Data Cleaning: A Practical Perspective (Synthesis Lectures on Data Management)
暫譯: 數據清理:實用觀點(數據管理綜合講座)
Venkatesh Ganti, Anish Das Sarma
- 出版商: Morgan & Claypool
- 出版日期: 2013-09-01
- 售價: $1,290
- 貴賓價: 9.5 折 $1,226
- 語言: 英文
- 頁數: 86
- 裝訂: Paperback
- ISBN: 1608456773
- ISBN-13: 9781608456772
海外代購書籍(需單獨結帳)
買這商品的人也買了...
-
$1,000$900 -
$250超標量處理器設計
-
$1,860$1,767 -
$520$411 -
$594$564 -
$650$553 -
$420$331 -
$300$255 -
$520$411 -
$301特徵工程入門與實踐 (Feature Engineering Made Easy)
-
$680$578 -
$480$379 -
$880$695 -
$380$342 -
$580$458 -
$680$530 -
$690$538 -
$378產品經理方法論 構建完整的產品知識體系
-
$356硬件產品經理方法論
-
$403中台產品經理:數字化轉型復雜產品架構案例實戰
-
$760$532 -
$948$901 -
$880$695 -
$630$536 -
$580$458
相關主題
商品描述
Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning.
In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks.
Table of Contents: Preface / Acknowledgments / Introduction / Technological Approaches / Similarity Functions / Operator: Similarity Join / Operator: Clustering / Operator: Parsing / Task: Record Matching / Task: Deduplication / Data Cleaning Scripts / Conclusion / Bibliography / Authors' Biographies
商品描述(中文翻譯)
資料倉儲整合了企業的各種活動,並且通常形成生成報告的基礎,這些報告支持重要的商業決策。資料中的錯誤往往因多種原因而產生。其中一些原因包括在輸入資料收集過程中的錯誤,以及在獨立收集的不同資料庫之間合併資料時的錯誤。這些資料倉儲中的錯誤通常會導致上游報告的錯誤,並可能對商業決策產生負面影響。因此,維護大型資料倉儲時的一個關鍵挑戰是確保資料倉儲中的資料質量保持高水平。維護高資料質量的過程通常被稱為資料清理。
在本書中,我們首先討論資料清理的目標。通常,資料清理的目標並不明確,並且在不同情境中可能意味著不同的解決方案。為了澄清這些目標,我們抽象出一組常見的資料清理任務,這些任務通常需要被解決。這種抽象使我們能夠為這些常見的資料清理任務開發解決方案。接著,我們討論幾種開發這類解決方案的流行方法。特別是,我們專注於以操作符為中心的方法來開發資料清理平台。以操作符為中心的方法涉及開發可自定義的操作符,這些操作符可以作為開發常見解決方案的構建模塊。這類似於關聯代數在查詢處理中的方法。基本的操作符集可以組合在一起以構建複雜的查詢。最後,我們討論開發自定義腳本,這些腳本利用基本的資料清理操作符以及關聯操作符來實現資料清理任務的有效解決方案。
目錄:前言 / 致謝 / 介紹 / 技術方法 / 相似性函數 / 操作符:相似性連接 / 操作符:聚類 / 操作符:解析 / 任務:記錄匹配 / 任務:去重 / 資料清理腳本 / 結論 / 參考文獻 / 作者簡介