Hands-On Entity Resolution: A Practical Guide to Data Matching with Python

Shearer, Michael

  • 出版商: O'Reilly
  • 出版日期: 2024-03-12
  • 定價: $2,410
  • 售價: 9.5$2,290
  • 貴賓價: 9.0$2,169
  • 語言: 英文
  • 頁數: 196
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1098148487
  • ISBN-13: 9781098148485
  • 相關分類: Python程式語言
  • 立即出貨 (庫存=1)

商品描述

Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving datasets using open source Python libraries and cloud APIs.

Author Michael Shearer shows you how to scale up your data matching processes and improve the accuracy of your reconciliations. You'll be able to remove duplicate entries within a single source and join disparate data sources together when common keys aren't available. Using real-world data examples, this book helps you gain practical understanding to accelerate the delivery of real business value.

With entity resolution, you'll build rich and comprehensive data assets that reveal relationships for marketing and risk management purposes, key to harnessing the full potential of ML and AI. This book covers:

  • Challenges in deduplicating and joining datasets
  • Extracting, cleansing, and preparing datasets for matching
  • Text matching algorithms to identify equivalent entities
  • Techniques for deduplicating and joining datasets at scale
  • Matching datasets containing persons and organizations
  • Evaluating data matches
  • Optimizing and tuning data matching algorithms
  • Entity resolution using cloud APIs
  • Matching using privacy-enhancing technologies

商品描述(中文翻譯)

實體解析是一種重要的分析技術,能夠幫助您識別指向同一現實世界實體的多個數據記錄。這本實用指南將教導產品經理、數據分析師和數據科學家如何通過使用開源Python庫和雲端API來清理、分析和解析數據集,為數據增值。

作者Michael Shearer向您展示如何擴展數據匹配流程並提高對帳準確性。您將能夠在單一來源中刪除重複項目,並在沒有共同鍵的情況下將不同的數據源聯結在一起。通過使用真實世界的數據示例,本書將幫助您獲得實際的理解,以加快實現真正的商業價值。

通過實體解析,您將建立豐富而全面的數據資產,以揭示市場營銷和風險管理目的的關係,這對於充分發揮機器學習和人工智能的潛力至關重要。本書涵蓋以下內容:
- 數據去重和聯結的挑戰
- 提取、清理和準備數據集以進行匹配
- 文本匹配算法以識別等價實體
- 大規模數據去重和聯結技術
- 匹配包含個人和組織的數據集
- 評估數據匹配結果
- 優化和調整數據匹配算法
- 使用雲端API進行實體解析
- 使用增強隱私技術進行匹配