Essential PySpark for Scalable Data Analytics: A beginner's guide to harnessing the power and ease of PySpark 3

Nudurupati, Sreeram

  • 出版商: Packt Publishing
  • 出版日期: 2021-10-29
  • 定價: $1,750
  • 售價: 9.0$1,575
  • 語言: 英文
  • 頁數: 322
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1800568878
  • ISBN-13: 9781800568877
  • 相關分類: JVM 語言SparkData Science
  • 立即出貨 (庫存=1)

商品描述

Get started with distributed computing using PySpark, a single unified framework to solve end-to-end data analytics at scale


Key Features:

  • Discover how to convert huge amounts of raw data into meaningful and actionable insights
  • Use Spark's unified analytics engine for end-to-end analytics, from data preparation to predictive analytics
  • Perform data ingestion, cleansing, and integration for ML, data analytics, and data visualization


Book Description:

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework.

Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that enable you to gain insights much faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability and performance to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas.

By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.


What You Will Learn:

  • Understand the role of distributed computing in the world of big data
  • Gain an appreciation for Apache Spark as the de facto go-to for big data processing
  • Scale out your data analytics process using Apache Spark
  • Build data pipelines using data lakes, and perform data visualization with PySpark and Spark SQL
  • Leverage the cloud to build truly scalable and real-time data analytics applications
  • Explore the applications of data science and scalable machine learning with PySpark
  • Integrate your clean and curated data with BI and SQL analysis tools


Who this book is for:

This book is for practicing data engineers, data scientists, data analysts, and data enthusiasts who are already using data analytics to explore distributed and scalable data analytics. Basic to intermediate knowledge of the disciplines of data engineering, data science, and SQL analytics is expected. General proficiency in using any programming language, especially Python, and working knowledge of performing data analytics using frameworks such as pandas and SQL will help you to get the most out of this book.

商品描述(中文翻譯)

使用 PySpark 開始分散式計算,這是一個統一的框架,可解決大規模數據分析的問題。

主要特點:
- 發現如何將大量原始數據轉換為有意義且可操作的洞察力
- 使用 Spark 的統一分析引擎進行從數據準備到預測分析的端到端分析
- 執行數據摄取、清理和集成,用於機器學習、數據分析和數據可視化

書籍描述:
Apache Spark 是一個統一的數據分析引擎,旨在快速高效地處理大量數據。PySpark 是 Apache Spark 的 Python 語言 API,為 Python 開發人員提供了一個易於使用且可擴展的數據分析框架。

《可擴展數據分析的必備 PySpark》首先探索了分散式計算範式,並提供了 Apache Spark 的高級概述。您將從數據工程過程開始進行分析,學習如何在大規模情況下進行數據摄取、清理和集成。本書幫助您構建實時分析流程,使您能夠更快地獲得洞察力。然後,您將探索構建基於雲的數據湖的方法,並研究為數據湖帶來可靠性和性能的 Delta Lake。本書還介紹了一種新興範式 Data Lakehouse,它將數據倉庫的結構和性能與基於雲的數據湖的可擴展性相結合。隨後,您將使用 PySpark 執行可擴展的數據科學和機器學習任務,例如數據準備、特徵工程、模型訓練和生產化。最後,您將學習如何在 PySpark 上擴展標準的 Python ML 函式庫,以及一個名為 Koalas 的基於 PySpark 的新 pandas API。

通過閱讀本書,您將能夠利用 PySpark 解決業務問題。

學到什麼:
- 了解在大數據世界中分散式計算的作用
- 深入了解 Apache Spark 作為處理大數據的首選工具
- 使用 Apache Spark 擴展您的數據分析流程
- 使用數據湖構建數據流程,並使用 PySpark 和 Spark SQL 進行數據可視化
- 利用雲端構建可擴展且實時的數據分析應用
- 探索使用 PySpark 進行數據科學和可擴展機器學習的應用
- 將清潔和精選的數據與商業智能和 SQL 分析工具集成

本書適合實踐中的數據工程師、數據科學家、數據分析師和數據愛好者,他們已經在使用數據分析來探索分散式和可擴展的數據分析。預期讀者具備基礎到中級的數據工程、數據科學和 SQL 分析的知識。熟練使用任何編程語言,尤其是 Python,以及具備使用 pandas 和 SQL 進行數據分析的工作知識,將有助於您充分利用本書的內容。