Frank Kane's Taming Big Data with Apache Spark and Python

Frank Kane

買這商品的人也買了...

商品描述

Key Features

  • Understand how Spark can be distributed across computing clusters
  • Develop and run Spark jobs efficiently using Python
  • A hands-on tutorial by Frank Kane with over 15 real-world examples teaching you Big Data processing with Spark

Book Description

Frank Kane's Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you'll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python.

Apache Spark has emerged as the next big thing in the Big Data domain - quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis, making it an essential tool in many modern businesses.

Frank has packed this book with over 15 interactive, fun-filled examples relevant to the real world, and he will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease.

What you will learn

  • Find out how you can identify Big Data problems as Spark problems
  • Install and run Apache Spark on your computer or on a cluster
  • Analyze large data sets across many CPUs using Spark's Resilient Distributed Datasets
  • Implement machine learning on Spark using the MLlib library
  • Process continuous streams of data in real time using the Spark streaming module
  • Perform complex network analysis using Spark's GraphX library
  • Use Amazon's Elastic MapReduce service to run your Spark jobs on a cluster

About the Author

My name is Frank Kane. I spent nine years at Amazon and IMDb, wrangling millions of customer ratings and customer transactions to produce things such as personalized recommendations for movies and products and "people who bought this also bought." I tell you, I wish we had Apache Spark back then, when I spent years trying to solve these problems there. I hold 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, I left to start my own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

Table of Contents

  1. Getting Started with Spark
  2. Spark Basics and Simple Examples
  3. Advanced Examples of Spark Programs
  4. Running Spark on a Cluster
  5. SparkSQL, Dataframes and Datasets
  6. Other Spark Technologies and Libraries
  7. Where to Go From Here? - Learning More About Spark and Data Science

商品描述(中文翻譯)

《使用 Apache Spark 和 Python 征服大數據》是一本由 Frank Kane 撰寫的實戰指南,幫助讀者以實際操作的方式學習 Apache Spark。Frank 將從教授如何在單一系統或叢集上設置 Spark 開始,並引導讀者使用 Spark RDD 分析大數據集,以及使用 Python 高效地開發和執行 Spark 作業。

Apache Spark 在大數據領域崛起,僅僅幾年間從新興技術迅速成為頂尖明星。Spark 讓您能夠快速從大量數據中提取可行的洞察,實時地進行分析,成為現代許多企業不可或缺的工具。

Frank 在這本書中提供了超過 15 個與現實世界相關的互動實例,讓您能夠理解 Spark 生態系統,輕鬆實現生產級的實時 Spark 項目。

本書的重點內容包括:
- 發現如何將大數據問題轉化為 Spark 問題
- 在計算機或叢集上安裝和運行 Apache Spark
- 使用 Spark 的 Resilient Distributed Datasets 在多個 CPU 上分析大數據集
- 使用 MLlib 库在 Spark 上實現機器學習
- 使用 Spark 流式處理模塊實時處理連續數據
- 使用 Spark 的 GraphX 库進行複雜的網絡分析
- 使用 Amazon 的 Elastic MapReduce 服務在叢集上運行 Spark 作業

關於作者:
我是 Frank Kane。我在 Amazon 和 IMDb 工作了九年,處理數百萬個客戶評分和交易,為電影和產品提供個性化推薦以及“購買此商品的人也購買了”等功能。我告訴你,當時如果有 Apache Spark,我花了多年時間解決這些問題就好了。我在分散式計算、數據挖掘和機器學習領域擁有 17 項專利。2012 年,我離開了 Amazon,創辦了自己的成功公司 Sundog Software,專注於虛擬現實環境技術和教授大數據分析。

目錄:
1. 開始使用 Spark
2. Spark 基礎和簡單示例
3. Spark 程序的高級示例
4. 在叢集上運行 Spark
5. SparkSQL、Dataframes 和 Datasets
6. 其他 Spark 技術和庫
7. 從這裡出發 - 了解更多關於 Spark 和數據科學的知識