Similarity Joins in Relational Database Systems (Synthesis Lectures on Data Management)
暫譯: 關聯資料庫系統中的相似性聯接(資料管理綜合講座)
Nikolaus Augsten, Michael H. Böhlen
- 出版商: Morgan & Claypool
- 出版日期: 2013-11-01
- 售價: $1,620
- 貴賓價: 9.5 折 $1,539
- 語言: 英文
- 頁數: 124
- 裝訂: Paperback
- ISBN: 1627050280
- ISBN-13: 9781627050289
-
相關分類:
資料庫、SQL
海外代購書籍(需單獨結帳)
相關主題
商品描述
State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity into database systems. We start out by discussing the properties of strings and trees, and identify the edit distance as the de facto standard for comparing complex objects. Since the edit distance is computationally expensive, token-based distances have been introduced to speed up edit distance computations. The basic idea is to decompose complex objects into sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation of the edit distance and prune expensive edit distance calculations. A key observation when computing similarity joins is that many of the object pairs, for which the similarity is computed, are very different from each other. Filters exploit this property to improve the performance of similarity joins. A filter preprocesses the input data sets and produces a set of candidate pairs. The distance function is evaluated on the candidate pairs only. We describe the essential query processing techniques for filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low.
Table of Contents: Preface / Acknowledgments / Introduction / Data Types / Edit-Based Distances / Token-Based Distances / Query Processing Techniques / Filters for Token Equality Joins / Conclusion / Bibliography / Authors' Biographies / Index
商品描述(中文翻譯)
最先進的資料庫系統管理和處理各種複雜物件,包括字串和樹。對於這些物件,等值比較通常沒有意義,必須用相似性比較來取代。本書描述了將相似性納入資料庫系統的概念和技術。我們首先討論字串和樹的特性,並確定編輯距離(edit distance)是比較複雜物件的事實標準。由於編輯距離的計算成本高昂,因此引入了基於標記(token-based)的距離來加速編輯距離的計算。基本思想是將複雜物件分解為可以高效比較的標記集合。基於標記的距離用於計算編輯距離的近似值,並修剪昂貴的編輯距離計算。在計算相似性連接時,一個關鍵觀察是許多物件對的相似性計算結果彼此之間差異很大。過濾器利用這一特性來提高相似性連接的性能。過濾器預處理輸入數據集並生成一組候選對。距離函數僅在候選對上進行評估。我們描述了基於下界和上界的過濾器的基本查詢處理技術。對於標記等值連接,我們描述了前綴、大小、位置和分區過濾器,這些過濾器可用於避免計算不必要的小交集,因為相似性會太低。
目錄:前言 / 致謝 / 介紹 / 數據類型 / 基於編輯的距離 / 基於標記的距離 / 查詢處理技術 / 標記等值連接的過濾器 / 結論 / 參考文獻 / 作者簡介 / 索引