Advanced Analytics with Pyspark: Patterns for Learning from Data at Scale Using Python and Spark

Tandon, Akash, Ryza, Sandy, Laserson, Uri

  • 出版商: O'Reilly
  • 出版日期: 2022-07-19
  • 定價: $2,100
  • 售價: 8.0$1,680
  • 語言: 英文
  • 頁數: 233
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1098103653
  • ISBN-13: 9781098103651
  • 相關分類: Python程式語言Spark
  • 立即出貨

買這商品的人也買了...

商品描述

The amount of data being generated today is staggering--and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming.

Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques--including classification, clustering, collaborative filtering, and anomaly detection--to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.

If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.

  • Familiarize yourself with Spark's programming model and ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public datasets
  • Discover which machine learning tools make sense for particular problems
  • Explore code that can be adapted to many uses

商品描述(中文翻譯)

當今所產生的數據量驚人且不斷增長。Apache Spark已成為分析大數據的事實標準工具,並成為數據科學工具箱中的關鍵部分。本實用指南更新至Spark 3.0,結合了Spark、統計方法和現實世界的數據集,教授您如何使用PySpark、Spark的Python API和Spark編程中的最佳實踐來解決分析問題。

數據科學家Akash Tandon、Sandy Ryza、Uri Laserson、Sean Owen和Josh Wills首先介紹了Spark生態系統,然後深入探討了應用於基因組學、安全和金融等領域的常見技術模式,包括分類、聚類、協同過濾和異常檢測。本更新版還涵蓋了自然語言處理(NLP)和圖像處理。

如果您對機器學習和統計學有基本的理解,並且使用Python進行編程,本書將幫助您開始進行大規模數據分析。

- 熟悉Spark的編程模型和生態系統
- 學習數據科學的一般方法
- 檢視分析大型公共數據集的完整實現
- 發現哪些機器學習工具適用於特定問題
- 探索可適應多種用途的代碼

作者簡介

Akash Tandon is an independent consultant and experienced full-stack data engineer. Previously, he was a senior data engineer at Atlan, where he built software for enterprise data science teams. In another life, he had worked on data science projects for governments, and built risk assessment tools at a FinTech startup. As a student, he wrote open source software with the R project for statistical computing and Google. In his free time, he researches things for no good reason.

Sandy Ryza is software engineer at Elementl. Previously, he developed algorithms for public transit at Remix and was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project.

Uri Laserson is founder & CTO of Patch Biosciences. Previously, he worked on big data and genomics at Cloudera.

Sean Owen is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previously, he was director of Data Science at Cloudera and an engineer at Google.

Josh Wills is an independent data science and engineering consultant, the former head of data engineering at Slack and data science at Cloudera, and wrote a tweet about data scientists once.

作者簡介(中文翻譯)

Akash Tandon 是一位獨立顧問和經驗豐富的全端資料工程師。之前,他在 Atlan 擔任高級資料工程師,為企業數據科學團隊建立軟體。在另一個生活中,他曾為政府的數據科學項目工作,並在一家金融科技初創公司建立風險評估工具。作為一名學生,他曾與 R 計算統計和 Google 的 R 專案一起撰寫開源軟體。在空閒時間,他無緣無故地進行研究。

Sandy Ryza 是 Elementl 的軟體工程師。之前,他在 Remix 開發公共交通算法,並在 Cloudera 和 Clover Health 擔任高級資料科學家。他是 Apache Spark 的貢獻者、Apache Hadoop 的 PMC 成員,以及 Time Series for Spark 專案的創始人。

Uri Laserson 是 Patch Biosciences 的創始人兼 CTO。之前,他在 Cloudera 從事大數據和基因組學的工作。

Sean Owen 是 Databricks 的主要解決方案架構師,專注於機器學習和資料科學。他是 Apache Spark 的貢獻者和 PMC 成員,並與他人合著了《Advanced Analytics with Spark》一書。之前,他在 Cloudera 擔任數據科學總監,並在 Google 擔任工程師。

Josh Wills 是一位獨立的資料科學和工程顧問,曾任 Slack 的資料工程主管和 Cloudera 的資料科學家,並曾發表過一條有關資料科學家的推文。