Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python

Zervou, Maria

  • 出版商: Packt Publishing
  • 出版日期: 2024-09-27
  • 售價: $1,840
  • 貴賓價: 9.5$1,748
  • 語言: 英文
  • 頁數: 456
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1837634742
  • ISBN-13: 9781837634743
  • 相關分類: GAN 生成對抗網絡Python程式語言
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Take your data preparation skills to the next level by converting any type of data asset into a structured, formatted, and readily usable dataset

Key Features:

- Maximize the value of your data through effective data cleaning methods

- Enhance your data skills using strategies for handling structured and unstructured data

- Elevate the quality of your data products by testing and validating your data pipelines

- Purchase of the print or Kindle book includes a free PDF eBook

Book Description:

Professionals face several challenges in effectively leveraging data in today's data-driven world. One of the main challenges is the low quality of data products, often caused by inaccurate, incomplete, or inconsistent data. Another significant challenge is the lack of skills among data professionals to analyze unstructured data, leading to valuable insights being missed that are difficult or impossible to obtain from structured data alone.

To help you tackle these challenges, this book will take you on a journey through the upstream data pipeline, which includes the ingestion of data from various sources, the validation and profiling of data for high-quality end tables, and writing data to different sinks. You'll focus on structured data by performing essential tasks, such as cleaning and encoding datasets and handling missing values and outliers, before learning how to manipulate unstructured data with simple techniques. You'll also be introduced to a variety of natural language processing techniques, from tokenization to vector models, as well as techniques to structure images, videos, and audio.

By the end of this book, you'll be proficient in data cleaning and preparation techniques for both structured and unstructured data.

What You Will Learn:

- Ingest data from different sources and write it to the required sinks

- Profile and validate data pipelines for better quality control

- Get up to speed with grouping, merging, and joining structured data

- Handle missing values and outliers in structured datasets

- Implement techniques to manipulate and transform time series data

- Apply structure to text, image, voice, and other unstructured data

Who this book is for:

Whether you're a data analyst, data engineer, data scientist, or a data professional responsible for data preparation and cleaning, this book is for you. Working knowledge of Python programming is needed to get the most out of this book.

Table of Contents

- Data Ingestion Techniques

- Importance of Data Quality

- Data Profiling - Understanding Data Structure, Quality, and Distribution

- Cleaning Messy Data and Data Manipulation

- Data Transformation - Merging and Concatenating

- Data Grouping, Aggregation, Filtering, and Applying Functions

- Data Sinks

- Detecting and Handling Missing Values and Outliers

- Normalization and Standardization

- Handling Categorical Features

- Consuming Time Series Data

- Text Preprocessing in the Era of LLMs

- Image and Audio Preprocessing with LLMs

商品描述(中文翻譯)

提升您的數據準備技能,將任何類型的數據資產轉換為結構化、格式化且隨時可用的數據集。

主要特點:
- 通過有效的數據清理方法最大化數據的價值
- 使用處理結構化和非結構化數據的策略來增強您的數據技能
- 通過測試和驗證數據管道來提升數據產品的質量
- 購買印刷版或 Kindle 書籍可獲得免費 PDF 電子書

書籍描述:
在當今數據驅動的世界中,專業人士在有效利用數據方面面臨多重挑戰。其中一個主要挑戰是數據產品的低質量,這通常是由於數據不準確、不完整或不一致所造成的。另一個重要挑戰是數據專業人士缺乏分析非結構化數據的技能,導致許多有價值的見解被忽視,而這些見解僅從結構化數據中難以或無法獲得。

為了幫助您應對這些挑戰,本書將帶您踏上上游數據管道的旅程,包括從各種來源獲取數據、驗證和分析數據以生成高質量的最終表格,以及將數據寫入不同的接收端。您將專注於結構化數據,執行基本任務,如清理和編碼數據集,以及處理缺失值和異常值,然後學習如何使用簡單技術操作非結構化數據。您還將接觸到各種自然語言處理技術,從分詞到向量模型,以及結構化圖像、視頻和音頻的技術。

在本書結束時,您將熟練掌握結構化和非結構化數據的清理和準備技術。

您將學到的內容:
- 從不同來源獲取數據並將其寫入所需的接收端
- 為更好的質量控制分析和驗證數據管道
- 熟悉結構化數據的分組、合併和連接
- 處理結構化數據集中的缺失值和異常值
- 實施操作和轉換時間序列數據的技術
- 對文本、圖像、語音及其他非結構化數據進行結構化

本書適合對象:
無論您是數據分析師、數據工程師、數據科學家,還是負責數據準備和清理的數據專業人士,本書都適合您。需要具備 Python 編程的工作知識,以便充分利用本書。

目錄:
- 數據獲取技術
- 數據質量的重要性
- 數據分析 - 理解數據結構、質量和分佈
- 清理雜亂數據和數據操作
- 數據轉換 - 合併和串接
- 數據分組、聚合、過濾和應用函數
- 數據接收端
- 檢測和處理缺失值和異常值
- 正規化和標準化
- 處理類別特徵
- 消費時間序列數據
- 在 LLM 時代的文本預處理
- 使用 LLM 的圖像和音頻預處理