Distributed Machine Learning with Python: Accelerating model training and serving with distributed systems (Paperback)

Wang, Guanhua

買這商品的人也買了...

商品描述

Build and deploy an efficient data processing pipeline for machine learning model training in an elastic, in-parallel model training or multi-tenant cluster and cloud

Key Features

- Accelerate model training and interference with order-of-magnitude time reduction
- Learn state-of-the-art parallel schemes for both model training and serving
- A detailed study of bottlenecks at distributed model training and serving stages

Book Description

Reducing time cost in machine learning leads to a shorter waiting time for model training and a faster model updating cycle. Distributed machine learning enables machine learning practitioners to shorten model training and inference time by orders of magnitude. With the help of this practical guide, you'll be able to put your Python development knowledge to work to get up and running with the implementation of distributed machine learning, including multi-node machine learning systems, in no time. You'll begin by exploring how distributed systems work in the machine learning area and how distributed machine learning is applied to state-of-the-art deep learning models. As you advance, you'll see how to use distributed systems to enhance machine learning model training and serving speed. You'll also get to grips with applying data parallel and model parallel approaches before optimizing the in-parallel model training and serving pipeline in local clusters or cloud environments. By the end of this book, you'll have gained the knowledge and skills needed to build and deploy an efficient data processing pipeline for machine learning model training and inference in a distributed manner.

What you will learn

- Deploy distributed model training and serving pipelines
- Get to grips with the advanced features in TensorFlow and PyTorch
- Mitigate system bottlenecks during in-parallel model training and serving
- Discover the latest techniques on top of classical parallelism paradigm
- Explore advanced features in Megatron-LM and Mesh-TensorFlow
- Use state-of-the-art hardware such as NVLink, NVSwitch, and GPUs

Who this book is for

This book is for data scientists, machine learning engineers, and ML practitioners in both academia and industry. A fundamental understanding of machine learning concepts and working knowledge of Python programming is assumed. Prior experience implementing ML/DL models with TensorFlow or PyTorch will be beneficial. You'll find this book useful if you are interested in using distributed systems to boost machine learning model training and serving speed.

商品描述(中文翻譯)

建立並部署一個高效的資料處理流程,用於機器學習模型訓練,可在彈性、並行模型訓練或多租戶集群和雲端中進行。

主要特點:

- 通過大幅減少時間成本,加速模型訓練和推斷過程。
- 學習最先進的並行方案,適用於模型訓練和服務。
- 詳細研究分佈式模型訓練和服務階段的瓶頸。

書籍描述:

在機器學習中減少時間成本,可以縮短模型訓練等待時間,加快模型更新週期。分佈式機器學習使機器學習從業者能夠將模型訓練和推斷時間大幅縮短。通過這本實用指南的幫助,您將能夠運用Python開發知識,迅速實現分佈式機器學習的實施,包括多節點機器學習系統。您將首先探索分佈式系統在機器學習領域的工作原理,以及分佈式機器學習如何應用於最先進的深度學習模型。隨著進一步的學習,您將了解如何使用分佈式系統來提高機器學習模型訓練和服務的速度。您還將掌握數據並行和模型並行方法,並優化本地集群或雲環境中的並行模型訓練和服務流程。通過閱讀本書,您將獲得構建和部署高效的資料處理流程,以分佈式方式進行機器學習模型訓練和推斷所需的知識和技能。

您將學到什麼:

- 部署分佈式模型訓練和服務流程。
- 熟悉TensorFlow和PyTorch中的高級功能。
- 在並行模型訓練和服務過程中減輕系統瓶頸。
- 探索傳統並行範例之上的最新技術。
- 探索Megatron-LM和Mesh-TensorFlow中的高級功能。
- 使用最先進的硬體,如NVLink、NVSwitch和GPU。

本書適合的讀者:

本書適合數據科學家、機器學習工程師和機器學習從業者,無論是在學術界還是工業界。假設讀者對機器學習概念有基本的理解,並具備Python編程的實際知識。如果您有使用TensorFlow或PyTorch實現機器學習/深度學習模型的經驗,將對本書有所裨益。如果您有興趣使用分佈式系統來提升機器學習模型訓練和服務的速度,本書將對您有所幫助。

作者簡介

Guanhua Wang is a final-year Computer Science PhD student in the RISELab at UC Berkeley, advised by Professor Ion Stoica. His research lies primarily in the Machine Learning Systems area including fast collective communication, efficient in-parallel model training and real-time model serving. His research gained lots of attention from both academia and industry. He was invited to give talks to top-tier universities (MIT, Stanford, CMU, Princeton) and big tech companies (Facebook/Meta, Microsoft). He received his master’s degree from HKUST and bachelor’s degree from Southeast University in China. He also did some cool research on wireless networks. He likes playing soccer and runs half-marathon multiple times in the Bay Area of California.

作者簡介(中文翻譯)

Guanhua Wang是加州大學伯克利分校(UC Berkeley)RISELab的計算機科學博士生,由Ion Stoica教授指導。他的研究主要集中在機器學習系統領域,包括快速集體通信、高效並行模型訓練和實時模型服務。他的研究引起了學術界和工業界的廣泛關注。他應邀在麻省理工學院、斯坦福大學、卡內基梅隆大學和普林斯頓大學等頂尖大學以及Facebook/Meta和微軟等大型科技公司發表演講。他在香港科技大學獲得碩士學位,並在中國東南大學獲得學士學位。他還在無線網絡方面進行了一些有趣的研究。他喜歡踢足球,在加州灣區多次參加半程馬拉松比賽。

目錄大綱

1. Splitting Input Data
2. Parameter Server and All-Reduce
3. Building a Data Parallel Training and Serving Pipeline
4. Bottlenecks and Solutions
5. Splitting the Model
6. Pipeline Input and Layer Split
7. Implementing Model Parallel Training and Serving Workflows
8. Achieving Higher Throughput and Lower Latency
9. A Hybrid of Data and Model Parallelism
10. Federated Learning and Edge Devices
11. Elastic Model Training and Serving
12. Advanced Techniques for Further Speed-Ups

目錄大綱(中文翻譯)

1. 分割輸入資料
2. 參數伺服器和全局減少
3. 建立資料平行訓練和服務流程
4. 瓶頸和解決方案
5. 分割模型
6. 流程輸入和層分割
7. 實現模型平行訓練和服務工作流程
8. 實現更高吞吐量和更低延遲
9. 資料和模型平行的混合
10. 聯邦學習和邊緣裝置
11. 彈性模型訓練和服務
12. 進一步加速的高級技術