Data Profiling (Synthesis Lectures on Data Management)
暫譯: 數據剖析(數據管理綜合講座)

Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock

  • 出版商: Morgan & Claypool
  • 出版日期: 2018-11-08
  • 售價: $2,390
  • 貴賓價: 9.5$2,271
  • 語言: 英文
  • 頁數: 154
  • 裝訂: Paperback
  • ISBN: 168173446X
  • ISBN-13: 9781681734460
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Data profiling refers to the activity of collecting data about data, i.e., metadata. Most IT professionals and researchers who work with data have engaged in data profiling, at least informally, to understand and explore an unfamiliar dataset or to determine whether a new dataset is appropriate for a particular task at hand. Data profiling results are also important in a variety of other situations, including query optimization, data integration, and data cleaning. Simple metadata are statistics, such as the number of rows and columns, schema and datatype information, the number of distinct values, statistical value distributions, and the number of null or empty values in each column. More complex types of metadata are statements about multiple columns and their correlation, such as candidate keys, functional dependencies, and other types of dependencies.

This book provides a classification of the various types of profilable metadata, discusses popular data profiling tasks, and surveys state-of-the-art profiling algorithms. While most of the book focuses on tasks and algorithms for relational data profiling, we also briefly discuss systems and techniques for profiling non-relational data such as graphs and text. We conclude with a discussion of data profiling challenges and directions for future work in this area.

商品描述(中文翻譯)

資料剖析是指收集有關數據的數據,即元數據的活動。大多數從事數據工作的 IT 專業人員和研究人員至少在非正式的情況下都參與過資料剖析,以了解和探索不熟悉的數據集,或確定新的數據集是否適合當前的特定任務。資料剖析的結果在許多其他情況下也很重要,包括查詢優化、數據整合和數據清理。簡單的元數據是統計數據,例如行數和列數、架構和數據類型信息、不同值的數量、統計值的分佈以及每列中的空值或空白值的數量。更複雜的元數據類型是關於多個列及其相關性的陳述,例如候選鍵、函數依賴和其他類型的依賴。

本書提供了各種可剖析元數據的分類,討論了流行的資料剖析任務,並調查了最先進的剖析算法。雖然本書大部分內容集中在關聯數據剖析的任務和算法上,但我們也簡要討論了剖析非關聯數據(如圖形和文本)的系統和技術。我們最後將討論資料剖析的挑戰以及未來在該領域的研究方向。