Data Cleaning: A Practical Perspective (Synthesis Lectures on Data Management)

Venkatesh Ganti, Anish Das Sarma

  • 出版商: Morgan & Claypool
  • 出版日期: 2013-09-01
  • 售價: $1,270
  • 貴賓價: 9.5$1,207
  • 語言: 英文
  • 頁數: 86
  • 裝訂: Paperback
  • ISBN: 1608456773
  • ISBN-13: 9781608456772
  • 海外代購書籍(需單獨結帳)

買這商品的人也買了...

商品描述

Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning.

In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks.

Table of Contents: Preface / Acknowledgments / Introduction / Technological Approaches / Similarity Functions / Operator: Similarity Join / Operator: Clustering / Operator: Parsing / Task: Record Matching / Task: Deduplication / Data Cleaning Scripts / Conclusion / Bibliography / Authors' Biographies

商品描述(中文翻譯)

數據倉庫整合企業的各種活動,通常成為生成支持重要業務決策的報告的基礎。數據中的錯誤往往由於各種原因而出現。其中一些原因包括輸入數據收集期間的錯誤以及在不同數據庫中獨立收集的數據合併期間的錯誤。這些數據倉庫中的錯誤通常導致上游報告出現錯誤,可能對業務決策產生負面影響。因此,在維護大型數據倉庫時,確保數據倉庫中的數據質量保持高水平是一個關鍵挑戰。保持高數據質量的過程通常被稱為數據清理。

在本書中,我們首先討論數據清理的目標。通常,數據清理的目標並不明確,在不同情境下可能意味著不同的解決方案。為了澄清這些目標,我們抽象出一組常見的數據清理任務,這些任務通常需要解決。這種抽象使我們能夠為這些常見的數據清理任務開發解決方案。然後,我們討論了一些流行的方法來開發這些解決方案。特別是,我們專注於一種以操作符為中心的方法來開發數據清理平台。操作符為中心的方法涉及開發可定制的操作符,這些操作符可以用作開發常見解決方案的基礎模塊。這與關聯代數用於查詢處理的方法類似。基本的操作符可以組合在一起構建複雜的查詢。最後,我們討論了開發自定義腳本的方法,這些腳本利用基本的數據清理操作符和關聯操作符來實現數據清理任務的有效解決方案。

目錄:前言/致謝/引言/技術方法/相似度函數/操作符:相似度連接/操作符:聚類/操作符:解析/任務:記錄匹配/任務:去重/數據清理腳本/結論/參考文獻/作者簡介