Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up Using Pyspark (Paperback)

Parsian, Mahmoud

  • 出版商: O'Reilly
  • 出版日期: 2022-05-17
  • 定價: $2,780
  • 售價: 9.5$2,641
  • 貴賓價: 9.0$2,502
  • 語言: 英文
  • 頁數: 435
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1492082384
  • ISBN-13: 9781492082385
  • 相關分類: SparkAlgorithms-data-structuresDesign Pattern
  • 立即出貨 (庫存 < 3)

買這商品的人也買了...

商品描述

Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

Learn how to select Spark transformations for optimized solutions

Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()

- Understand data partitioning for optimized queries
- Build and apply a model using PySpark design patterns
- Apply motif-finding algorithms to graph data
- Analyze graph data by using the GraphFrames API
- Apply PySpark algorithms to clinical and genomics data
- Learn how to use and apply feature engineering in ML algorithms
- Understand and use practical and pragmatic data design patterns

From the Preface

Spark has become the de facto standard for large-scale data analytics. I have been using and teaching Spark since its inception nine years ago, and I have seen tremendous improvements in Extract, Transform, Load (ETL) processes, distributed algorithm development, and large-scale data analytics. I started using Spark with Java, but I found that while the code is pretty stable, you have to write long lines of code, which can become unreadable. For this book, I decided to use PySpark (a Python API for Spark) because it is easier to express the power of Spark in Python: the code is short, readable, and maintainable. PySpark is powerful but simple to use, and you can express any ETL or distributed algorithm in it with a simple set of transformations and actions.

Why I Wrote This Book

This is an introductory book about data analysis using PySpark. The book consists of a set of guidelines and examples intended to help software and data engineers solve data problems in the simplest possible way. As you know, there are many ways to solve any data problem: PySpark enables us to write simple code for complex problems. This is the motto I have tried to express in this book: keep it simple and use parameters so that your solution can be reused by other developers. My aim is to teach readers how to think about data and understand its origins and final intended form, as well as showing how to use fundamental data transformation patterns to solve a variety of data problems.

Who This Book Is For

To use this book effectively it will be helpful to know the basics of the Python programming language, such as how to use conditionals (if-then-else), iterate through lists, and define and call functions. However, if your background is in another programming language (such as Java or Scala) and you do not know Python, you will still be able to use the book as I have provided a reasonable introduction to Spark and PySpark.

This book is primarily intended for people who want to analyze large amounts of data and develop distributed algorithms using the Spark engine and PySpark. I have provided simple examples showing how to perform ETL operations and write distributed algorithms in PySpark. The code examples are written in such a way that you can cut and paste them to get the job done easily.

商品描述(中文翻譯)

Apache Spark的速度、易用性、複雜分析功能和多語言支援,使得這個集群計算框架的實際知識成為資料工程師和資料科學家所必需的技能。這本實踐指南將教授任何對Spark入門感興趣的人使用PySpark學習實際的演算法和範例。

在每一章中,作者Mahmoud Parsian將向您展示如何使用一組Spark轉換和演算法解決數據問題。您將學習如何處理涉及ETL、設計模式、機器學習演算法、數據分區和基因組分析的問題。每個詳細的示例都包含使用PySpark驅動程序和shell腳本的PySpark演算法。

通過這本書,您將能夠:
- 學習如何選擇Spark轉換以獲得優化的解決方案
- 探索強大的轉換和減少,包括reduceByKey()、combineByKey()和mapPartitions()
- 了解優化查詢的數據分區
- 使用PySpark設計模式構建和應用模型
- 對圖數據應用模式發現演算法
- 使用GraphFrames API分析圖數據
- 將PySpark演算法應用於臨床和基因組數據
- 學習如何在機器學習演算法中使用和應用特徵工程
- 理解並使用實用和實用的數據設計模式

從前言中:

Spark已成為大規模數據分析的事實標準。我從九年前的Spark創始時就開始使用和教授Spark,並見證了在ETL過程、分佈式演算法開發和大規模數據分析方面的巨大改進。我最初使用Java開始使用Spark,但我發現雖然代碼非常穩定,但必須編寫冗長的代碼,這可能變得難以閱讀。為了這本書,我決定使用PySpark(Spark的Python API),因為它更容易用Python表達Spark的強大功能:代碼簡短、易讀且易於維護。PySpark功能強大但使用簡單,您可以使用一組簡單的轉換和操作來表達任何ETL或分佈式演算法。

我為什麼寫這本書:

這是一本關於使用PySpark進行數據分析的入門書。本書包含一組指南和範例,旨在幫助軟體和數據工程師以最簡單的方式解決數據問題。如您所知,解決任何數據問題有很多種方法:PySpark使我們能夠為複雜問題編寫簡單的代碼。這是我在本書中試圖表達的座右銘:保持簡單,使用參數使您的解決方案可以被其他開發人員重複使用。我的目標是教讀者如何思考數據,了解其起源和最終目標形式,並展示如何使用基本數據轉換模式解決各種數據問題。

這本書適合對Python編程語言的基礎知識有所了解的人,例如如何使用條件語句(if-then-else)、遍歷列表以及定義和調用函數。然而,如果您的背景是其他編程語言(如Java或Scala)並且不熟悉Python,您仍然可以使用本書,因為我已經提供了對Spark和PySpark的合理介紹。

本書主要針對希望使用Spark引擎和PySpark進行大量數據分析和開發分佈式演算法的人。我提供了簡單的示例,展示如何在PySpark中執行ETL操作和編寫分佈式演算法。代碼示例的撰寫方式使您可以輕鬆地剪切和粘貼以完成工作。

作者簡介

Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Spark, PySpark, and distributed computing. Dr. Parsian currently leads Illumina's Big Data team, which is focused on large-scale genome analytics and distributed computing by using Spark and PySpark. He leads and develops scalable regression algorithms; DNA sequencing pipelines using Java, MapReduce, PySpark, Spark, and open source tools. He is the author of the following books: Data Algorithms (O'Reilly, 2015), PySpark Algorithms (Amazon.com, 2019), JDBC Recipes (Apress, 2005), JDBC Metadata Recipes (Apress, 2006). Also, Dr. Parsian is an Adjunct Professor at Santa Clara University, teaching Big Data Modeling and Analytics and Machine Learning to MSIS program utilizing Spark, PySpark, Python, and scikit-learn.

作者簡介(中文翻譯)

Mahmoud Parsian博士在計算機科學領域擁有博士學位,是一位有30年開發、設計、架構和撰寫經驗的實踐軟體專業人士。在過去的15年中,他一直從事Java伺服器端、資料庫、MapReduce、Spark、PySpark和分散式計算。Parsian博士目前領導Illumina的大數據團隊,該團隊專注於使用Spark和PySpark進行大規模基因組分析和分散式計算。他領導並開發可擴展的回歸算法;使用Java、MapReduce、PySpark、Spark和開源工具的DNA序列測序流程。他是以下書籍的作者:《Data Algorithms》(O'Reilly,2015)、《PySpark Algorithms》(Amazon.com,2019)、《JDBC Recipes》(Apress,2005)、《JDBC Metadata Recipes》(Apress,2006)。此外,Parsian博士是聖塔克拉拉大學的兼職教授,教授大數據建模和分析以及機器學習,並利用Spark、PySpark、Python和scikit-learn進行教學。