Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Walker, Michael

  • 出版商: Packt Publishing
  • 出版日期: 2022-08-26
  • 售價: $1,650
  • 貴賓價: 9.5$1,568
  • 語言: 英文
  • 頁數: 542
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1803241675
  • ISBN-13: 9781803241678
  • 相關分類: SparkMachine Learning
  • 下單後立即進貨 (約3~4週)

商品描述

Explore supercharged machine learning techniques to take care of your data laundry loads

Key Features

- Learn how to prepare data for machine learning processes
- Understand which algorithms are based on prediction objectives and the properties of the data
- Explore how to interpret and evaluate the results from machine learning

Book Description

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.

As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You'll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you'll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You'll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book.

By the end of this book, you'll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

What you will learn

- Explore essential data cleaning and exploration techniques to be used before running the most popular machine learning algorithms
- Understand how to perform preprocessing and feature selection, and how to set up the data for testing and validation
- Model continuous targets with supervised learning algorithms
- Model binary and multiclass targets with supervised learning algorithms
- Execute clustering and dimension reduction with unsupervised learning algorithms
- Understand how to use regression trees to model a continuous target

Who this book is for

This book is for professional data scientists, particularly those in the first few years of their career, or more experienced analysts who are relatively new to machine learning. Readers should have prior knowledge of concepts in statistics typically taught in an undergraduate introductory course as well as beginner-level experience in manipulating data programmatically.

商品描述(中文翻譯)

探索強化機器學習技術,以應對您的數據清理負荷

主要特點

- 學習如何為機器學習過程準備數據
- 理解基於預測目標和數據特性的算法
- 探索如何解釋和評估機器學習的結果

書籍描述

許多懂得運行機器學習算法的人並不了解它們所做的統計假設以及如何將數據的特性與算法相匹配以獲得最佳結果。

在閱讀本書時,我們精心選擇了模型,以幫助您理解底層數據,包括特徵重要性和相關性,以及特徵和目標的分佈。本書的前兩部分介紹了為機器學習算法準備數據的技術,並且在數據清理方面毫不遲疑地使用了一些機器學習技術,包括異常檢測和特徵選擇。然後,本書幫助您將這些知識應用於各種機器學習任務。您將了解流行的監督和非監督算法,以及如何為它們準備數據並進行評估。接下來,您將建立模型並了解數據中的關係,以及使用該數據進行清理和探索任務。在本書中,您將快速掌握變量分佈、異常檢測和雙變量關係的研究,並更加關注預測準確性。

通過閱讀本書,您將能夠使用非監督機器學習算法(如主成分分析和k-means聚類)處理複雜的數據問題。

您將學到什麼

- 探索在運行最流行的機器學習算法之前使用的基本數據清理和探索技術
- 理解如何進行預處理和特徵選擇,以及如何設置數據進行測試和驗證
- 使用監督學習算法對連續目標進行建模
- 使用監督學習算法對二元和多類目標進行建模
- 使用非監督學習算法進行聚類和降維
- 理解如何使用回歸樹對連續目標進行建模

本書適合對機器學習相對新手的專業數據科學家,尤其是那些在職業生涯的最初幾年或對機器學習相對新手的經驗豐富的分析師。讀者應具備在本科入門課程中通常教授的統計概念的先備知識,以及在編程中操作數據的初級經驗。

作者簡介

Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island.

作者簡介(中文翻譯)

Michael Walker在多個教育機構擔任數據分析師已有30多年的工作經驗。自2006年以來,他還教授本科生數據科學、研究方法、統計和計算機編程。他目前是羅德島普羅維登斯的College Unbound的首席信息官。

目錄大綱

1. Examining the Distribution of Features and Targets
2. Examining Bivariate and Multivariate Relationships between Features and Targets
3. Identifying and Fixing Missing Values
4. Encoding, Transforming, and Scaling Features
5. Feature Selection
6. Preparing for Model Evaluation
7. Linear Regression Models
8. Support Vector Regression
9. K-Nearest Neighbor, Decision Tree, Random Forest and Gradient Boosted Regression
10. Logistic Regression
11. Decision Trees and Random Forest Classification
12. K-Nearest Neighbors for Classification
13. Support Vector Machine Classification
14. Naive Bayes Classification
15. Principal Component Analysis
16. K-Means and DBSCAN Clustering

目錄大綱(中文翻譯)

1. 檢視特徵和目標的分佈情況
2. 檢視特徵和目標之間的雙變量和多變量關係
3. 識別和修復缺失值
4. 編碼、轉換和縮放特徵
5. 特徵選擇
6. 準備模型評估
7. 線性回歸模型
8. 支持向量回歸
9. K最近鄰、決策樹、隨機森林和梯度提升回歸
10. 邏輯回歸
11. 決策樹和隨機森林分類
12. K最近鄰分類
13. 支持向量機分類
14. 朴素貝葉斯分類
15. 主成分分析
16. K均值和DBSCAN聚類