Distributed Machine Learning with Pyspark: Migrating Effortlessly from Pandas and Scikit-Learn
暫譯: 使用 Pyspark 的分散式機器學習:輕鬆從 Pandas 和 Scikit-Learn 遷移
Testas, Abdelaziz
- 出版商: Apress
- 出版日期: 2023-11-24
- 售價: $1,880
- 貴賓價: 9.5 折 $1,786
- 語言: 英文
- 頁數: 490
- 裝訂: Quality Paper - also called trade paper
- ISBN: 1484297504
- ISBN-13: 9781484297506
-
相關分類:
Spark、Machine Learning
海外代購書籍(需單獨結帳)
商品描述
Migrate from pandas and scikit-learn to PySpark to handle vast amounts of data and achieve faster data processing time. This book will show you how to make this transition by adapting your skills and leveraging the similarities in syntax, functionality, and interoperability between these tools.
Distributed Machine Learning with PySpark offers a roadmap to data scientists considering transitioning from small data libraries (pandas/scikit-learn) to big data processing and machine learning with PySpark. You will learn to translate Python code from pandas/scikit-learn to PySpark to preprocess large volumes of data and build, train, test, and evaluate popular machine learning algorithms such as linear and logistic regression, decision trees, random forests, support vector machines, Naïve Bayes, and neural networks.
After completing this book, you will understand the foundational concepts of data preparation and machine learning and will have the skills necessary to apply these methods using PySpark, the industry standard for building scalable ML data pipelines.
What You Will Learn
- Master the fundamentals of supervised learning, unsupervised learning, NLP, and recommender systems
- Understand the differences between PySpark, scikit-learn, and pandas
- Perform linear regression, logistic regression, and decision tree regression with pandas, scikit-learn, and PySpark
- Distinguish between the pipelines of PySpark and scikit-learn
Who This Book Is For
Data scientists, data engineers, and machine learning practitioners who have some familiarity with Python, but who are new to distributed machine learning and the PySpark framework.商品描述(中文翻譯)
將從 pandas 和 scikit-learn 遷移到 PySpark,以處理大量數據並實現更快的數據處理時間。本書將向您展示如何通過調整您的技能並利用這些工具之間的語法、功能和互操作性的相似性來完成這一過渡。
《使用 PySpark 的分散式機器學習》為考慮從小型數據庫(pandas/scikit-learn)轉向使用 PySpark 進行大數據處理和機器學習的數據科學家提供了一個路線圖。您將學會如何將 pandas/scikit-learn 的 Python 代碼轉換為 PySpark,以預處理大量數據並構建、訓練、測試和評估流行的機器學習算法,如線性回歸、邏輯回歸、決策樹、隨機森林、支持向量機、朴素貝葉斯和神經網絡。
完成本書後,您將理解數據準備和機器學習的基本概念,並具備使用 PySpark 應用這些方法的必要技能,這是構建可擴展的機器學習數據管道的行業標準。
您將學到的內容:
- 掌握監督學習、非監督學習、自然語言處理(NLP)和推薦系統的基本原理
- 理解 PySpark、scikit-learn 和 pandas 之間的差異
- 使用 pandas、scikit-learn 和 PySpark 執行線性回歸、邏輯回歸和決策樹回歸
- 區分 PySpark 和 scikit-learn 的管道
本書適合對象:
數據科學家、數據工程師和機器學習從業者,對 Python 有一定的熟悉度,但對分散式機器學習和 PySpark 框架較為陌生。
作者簡介
Abdelaziz Testas, Ph.D., is a data scientist with over a decade of experience in data analysis and machine learning, specializing in the use of standard Python libraries and Spark distributed computing. He holds a Ph.D. in Economics from Leeds University and a Master's degree in Finance from Glasgow University. He has also earned several certificates in computer science and data science.
In the last ten years, he has worked for Nielsen in Fremont, California as a Lead Data Scientist focused on improving the company's audience measurement through planning, initiating, and executing end-to-end data science projects and methodology work. He has created advanced solutions for Nielsen's digital ad and content rating products by leveraging subject matter expertise in media measurement and data science. He is passionate about helping others improve their machine learning skills and workflows, and is excited to share his knowledge and experience with a wider audience through this book.
作者簡介(中文翻譯)
阿卜杜拉齊茲·泰塔斯(Abdelaziz Testas)博士,是一位擁有超過十年數據分析和機器學習經驗的數據科學家,專注於使用標準的 Python 函式庫和 Spark 分散式計算。他擁有利茲大學的經濟學博士學位以及格拉斯哥大學的金融碩士學位。他還獲得了多個計算機科學和數據科學的證書。
在過去的十年中,他在加利福尼亞州弗里蒙特的尼爾森(Nielsen)擔任首席數據科學家,專注於通過規劃、啟動和執行端到端的數據科學項目和方法論工作來改善公司的受眾測量。他利用在媒體測量和數據科學方面的專業知識,為尼爾森的數位廣告和內容評分產品創造了先進的解決方案。他熱衷於幫助他人提升機器學習技能和工作流程,並期待通過這本書與更廣泛的讀者分享他的知識和經驗。