Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling

Javier Luraschi , Kevin Kuo , Edgar Ruiz

商品描述

If you’re like most R users, you have deep knowledge and love for statistics. But as your organization continues to collect huge amounts of data, adding tools such as Apache Spark makes a lot of sense. With this practical book, data scientists and professionals working with large-scale data applications will learn how to use Spark from R to tackle big data and big compute problems.

Authors Javier Luraschi, Kevin Kuo, and Edgar Ruiz show you how to use R with Spark to solve different data analysis problems. This book covers relevant data science topics, cluster computing, and issues that should interest even the most advanced users.

  • Analyze, explore, transform, and visualize data in Apache Spark with R
  • Create statistical models to extract information and predict outcomes; automate the process in production-ready workflows
  • Perform analysis and modeling across many machines using distributed computing techniques
  • Use large-scale data from multiple sources and different formats with ease from within Spark
  • Learn about alternative modeling frameworks for graph processing, geospatial analysis, and genomics at scale
  • Dive into advanced topics including custom transformations, real-time data processing, and creating custom Spark extensions

商品描述(中文翻譯)

如果你是大多數 R 使用者,你對統計學有深入的了解和熱愛。但隨著你的組織繼續收集大量的數據,添加像 Apache Spark 這樣的工具是非常合理的。這本實用書籍將教導數據科學家和處理大規模數據應用的專業人士如何使用 R 與 Spark 來應對大數據和大計算問題。

作者 Javier Luraschi、Kevin Kuo 和 Edgar Ruiz 將向您展示如何使用 R 與 Spark 解決不同的數據分析問題。本書涵蓋了相關的數據科學主題、集群計算以及即使對於最高級的用戶也應該感興趣的問題。

本書內容包括:

- 使用 R 在 Apache Spark 中分析、探索、轉換和可視化數據
- 創建統計模型以提取信息並預測結果;在生產就緒的工作流程中自動化這個過程
- 使用分佈式計算技術在多台機器上進行分析和建模
- 輕鬆地從 Spark 內部使用來自多個來源和不同格式的大規模數據
- 了解用於圖形處理、地理空間分析和大規模基因組學的替代建模框架
- 深入研究高級主題,包括自定義轉換、實時數據處理和創建自定義 Spark 擴展功能

作者簡介

Javier is a software engineer with experience in technologies ranging from desktop, web, mobile and backend, to augmented reality and deep learning applications. He previously worked for Microsoft Research and SAP and holds a double degree in Mathematics and Software Engineering. He is the author of various R packages like sparklyr, cloudml, r2d3, mlflow, tfdeploy and kerasjs.

Kevin builds open source libraries for machine learning and model deployment. He has held data science positions in various industries including insurance where he was a credentialed actuary. Kevin is the creator of mlflow, mleap, sparkxgb among various R packages. He is also an amateur mixologist and sommelier.

Edgar Ruiz has a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Edgar is the author and administrator of the db.rstudio.com web site, and the current administrator of the sparklyr web site. He's also the co-author of the dbplyr package, and creator of the dbplot, tidypredict and the modeldb package.

作者簡介(中文翻譯)

Javier是一位軟體工程師,擁有從桌面、網頁、行動和後端技術到擴增實境和深度學習應用的經驗。他曾在微軟研究院和SAP工作,擁有數學和軟體工程的雙學位。他是各種R套件的作者,如sparklyr、cloudml、r2d3、mlflow、tfdeploy和kerasjs。

Kevin為機器學習和模型部署建立開源程式庫。他在多個行業擔任過數據科學職位,包括保險,他是一位認證精算師。Kevin是mlflow、mleap、sparkxgb等多個R套件的創作者。他還是一位業餘調酒師和侍酒師。

Edgar Ruiz在企業報告和商業智能解決方案的部署方面有豐富經驗。他是多篇文章和博客文章的作者,分享分析洞察和數據科學的伺服器基礎架構。Edgar是db.rstudio.com網站的作者和管理員,也是sparklyr網站的現任管理員。他還是dbplyr套件的共同作者,並創建了dbplot、tidypredict和modeldb套件。