Big Data Analysis with Python (Paperback) Combine Spark and Python to unlock the powers of parallel computing and machine learning

Ivan Marin , Ankit Shukla , Sarang VK

買這商品的人也買了...

商品描述

Key Features

  • Get a hands-on, fast-paced introduction to the Python data science stack
  • Explore ways to create useful metrics and statistics from large datasets
  • Create detailed analysis reports with real-world data

Book Description

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you'll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The book begins with an introduction to data manipulation in Python using pandas. You'll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you'll be able to analyze data that is distributed on several computers by using Dask. As you progress, you'll study how to aggregate data for plots when the entire data cannot be accommodated in memory. You'll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book also covers Spark and explains how it interacts with other tools.

By the end of this book, you'll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

What you will learn

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals

Who this book is for

Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help you to understand various concepts explained in this book.

 

商品描述(中文翻譯)

主要特點


  • 快速入門Python數據科學工具組

  • 探索從大型數據集中創建有用的指標和統計數據的方法

  • 使用真實世界數據創建詳細的分析報告

書籍描述

由於可擴展性、信息不一致性和容錯性,實時處理大數據具有挑戰性。《Python大數據分析》教你如何使用工具來控制這些數據洪流。通過這本書,你將學習實用的技術,將數據聚合成有用的維度進行後續分析,提取統計測量,並將數據集轉換為其他系統的特徵。

本書首先介紹了使用pandas進行數據操作的基礎知識。然後,你將熟悉統計分析和繪圖技術。通過多個實踐活動,你將能夠使用Dask分析分佈在多台計算機上的數據。隨著學習的深入,你將研究如何在無法將整個數據放入內存時,對數據進行聚合以進行繪圖。你還將探索Hadoop(HDFS和YARN),這將幫助你處理更大的數據集。本書還介紹了Spark並解釋了它如何與其他工具交互。

通過閱讀本書,你將能夠搭建自己的Python環境,處理大型文件,並操縱數據生成統計數據、指標和圖形。

你將學到什麼


  • 使用Python讀取和轉換不同格式的數據

  • 使用磁盤上的數據生成基本統計數據和指標

  • 處理分佈在集群上的計算任務

  • 將來自不同來源的數據轉換為存儲或查詢格式

  • 為統計分析、可視化和機器學習準備數據

  • 以有效的視覺形式呈現數據

適合閱讀對象

《Python大數據分析》適用於Python開發人員、數據分析師和數據科學家,他們希望通過控制數據並將其轉化為有影響力的見解來進行實踐。對統計測量和關聯數據庫的基本知識將有助於理解本書中解釋的各種概念。

作者簡介

Ivan Marin is a Systems Architect and Data Scientist working at Daitan Group, a Campinas based software company. He designs Big Data systems for large volumes of data, and implements Machine Learning pipelines end to end using Python and Spark. He is also an active organizer of Data Science, Machine Learning and Python in São Paulo and has given Python for Data Science courses at university level.

Sarang VK in his current role as a data scientist, his responsibilities include identifying data sources, data preparation, development, and evaluation of predictive and optimization models for setting up production level machine learning / statistical solutions with back-end and front-end developments. Alongside, he supports pre-sales, stakeholder communication, requirement gathering, scoping, and solutions.

His strengths are Machine / Deep Learning, SQL, Predictive Analytics, Time-Series, Simulation Modelling, Optimization, Image/Text Analytics, NLP, Python, R, Spark, TensorFlow, Keras, h2o, SAP-PAL, AWS, SAP Predictive Factory, Azure, Financial Analytics, Supply Chain, Banking and Insurance, Retail/Customer Analytics, Trading Analytics, Healthcare Analytics, RPA, IPA.

Ankit Shukla is Data Scientist with a passion for using data science & advanced analytics to solve real-life problems and bring ideas to fruition. Skilled in using Machine Learning/AI & statistical modelling techniques to solve business problems & create actual dollar value for clients. Experienced in working with copious amounts of data, using the latest Big Data technologies to design data pipelines and generate impactful data-driven insights & reports.

His skill sets are: R, Python, SQL, HiveQL, Excel, Linux Shell Scripting, SAS (Working Knowledge), Docker Frameworks: Keras, OpenCV, XGBoost, NumPy, Scikit-learn, Caret, ggplot2, recommended lab Big Data: Hadoop, Hive, Impala, PySpark, SparkR, Pig, AWS (S3, EC-2, EMR, Sagemaker, Redshift) Machine Learning: Regression, Classification, Clustering, Feature Selection, Model Selection/Assessment, Recommender Systems, Neural Networks, Deep Learning, Transfer Learning Visualization: Tableau, R, Shiny.

作者簡介(中文翻譯)

Ivan Marin 是一位系統架構師和資料科學家,目前在位於坎皮納斯的軟體公司 Daitan Group 工作。他設計用於大量數據的大數據系統,並使用 Python 和 Spark 實現端到端的機器學習流程。他也是聖保羅地區數據科學、機器學習和 Python 的積極組織者,並在大學開設過 Python 數據科學課程。

Sarang VK 目前擔任資料科學家,他的職責包括識別數據來源、數據準備、開發和評估預測和優化模型,以建立具有後端和前端開發的生產級機器學習/統計解決方案。同時,他還支援售前、利益相關者溝通、需求收集、範圍界定和解決方案。

他的專長包括機器/深度學習、SQL、預測分析、時間序列、模擬建模、優化、圖像/文本分析、自然語言處理、Python、R、Spark、TensorFlow、Keras、h2o、SAP-PAL、AWS、SAP Predictive Factory、Azure、金融分析、供應鏈、銀行和保險、零售/客戶分析、交易分析、醫療保健分析、RPA、IPA。

Ankit Shukla 是一位資料科學家,熱衷於使用數據科學和高級分析解決實際問題並將想法付諸實踐。他擅長使用機器學習/人工智慧和統計建模技術解決業務問題,為客戶創造實際的價值。他有豐富的處理大量數據的經驗,使用最新的大數據技術設計數據流程並生成有影響力的數據驅動洞察和報告。

他的技能包括:R、Python、SQL、HiveQL、Excel、Linux Shell Scripting、SAS(工作知識)、Docker 框架:Keras、OpenCV、XGBoost、NumPy、Scikit-learn、Caret、ggplot2、推薦實驗室大數據:Hadoop、Hive、Impala、PySpark、SparkR、Pig、AWS(S3、EC-2、EMR、Sagemaker、Redshift)機器學習:回歸、分類、聚類、特徵選擇、模型選擇/評估、推薦系統、神經網絡、深度學習、遷移學習可視化:Tableau、R、Shiny。

目錄大綱

  1. The Python Data Science Stack
  2. Statistical Visualizations
  3. Working with Big Data Frameworks
  4. Diving Deeper with Spark
  5. Handling Missing Values and Correlation Analysis
  6. Exploratory Data Analysis
  7. Reproducibility in Big Data Analysis
  8. Creating a Full Analysis Report

目錄大綱(中文翻譯)

- The Python Data Science Stack
- 統計視覺化
- 使用大數據框架進行工作
- 深入探索 Spark
- 處理缺失值和相關性分析
- 探索性數據分析
- 大數據分析的可重複性
- 創建完整的分析報告