Scaling Machine Learning with Spark: Distributed ML with Mllib, Tensorflow, and Pytorch

Polak, Adi

  • 出版商: O'Reilly
  • 出版日期: 2023-04-11
  • 定價: $2,700
  • 售價: 9.0$2,430
  • 語言: 英文
  • 頁數: 291
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1098106822
  • ISBN-13: 9781098106829
  • 相關分類: DeepLearningSparkTensorFlowMachine Learning
  • 立即出貨


Get up to speed on Apache Spark, the popular engine for large-scale data processing, including machine learning and analytics. If you're looking to expand your skill set or advance your career in scalable machine learning with MLlib, distributed PyTorch, and distributed TensorFlow, this practical guide is for you. Using Spark as your main data processing platform, you'll discover several open source technologies designed and built for enriching Spark's ML capabilities.

Scaling Machine Learning with Spark examines various technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLFlow, TensorFlow, PyTorch, and Petastorm. This book shows you when to use each technology and why. If you're a data scientist working with machine learning, you'll learn how to:

  • Build practical distributed machine learning workflows, including feature engineering and data formats
  • Extend deep learning functionalities beyond Spark by bridging into distributed TensorFlow and PyTorch
  • Manage your machine learning experiment lifecycle with MLFlow
  • Use Petastorm as a storage layer for bridging data from Spark into TensorFlow and PyTorch
  • Use machine learning terminology to understand distribution strategies


快速掌握 Apache Spark,這個用於大規模數據處理的流行引擎,包括機器學習和分析。如果你想擴展你的技能或在可擴展的機器學習領域中提升你的職業生涯,這本實用指南就是為你而設的。使用 Spark 作為你的主要數據處理平台,你將會發現幾個為增強 Spark 的機器學習能力而設計和建立的開源技術。

《使用 Spark 擴展機器學習》探討了基於 Apache Spark 生態系統的 Spark MLlib、MLFlow、TensorFlow、PyTorch 和 Petastorm 等技術,用於構建端到端的分佈式機器學習工作流程。本書向你展示了何時以及為什麼要使用每個技術。如果你是一位從事機器學習的數據科學家,你將學會如何:

- 構建實用的分佈式機器學習工作流程,包括特徵工程和數據格式
- 通過與分佈式 TensorFlow 和 PyTorch 的橋接,擴展 Spark 以外的深度學習功能
- 使用 MLFlow 管理你的機器學習實驗生命周期
- 使用 Petastorm 作為存儲層,將數據從 Spark 橋接到 TensorFlow 和 PyTorch
- 使用機器學習術語來理解分佈策略