Learning Spark SQL

Aurobindo Sarkar

  • 出版商: Packt Publishing
  • 出版日期: 2017-09-04
  • 售價: $2,200
  • 貴賓價: 9.5$2,090
  • 語言: 英文
  • 頁數: 452
  • 裝訂: Paperback
  • ISBN: 1785888358
  • ISBN-13: 9781785888359
  • 相關分類: SparkSQL
  • 下單後立即進貨 (約3~4週)

商品描述

Design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API

About This Book

  • Learn about the design and implementation of streaming applications, machine learning pipelines, deep learning, and large-scale graph processing applications using Spark SQL APIs and Scala.
  • Learn data exploration, data munging, and how to process structured and semi-structured data using real-world datasets and gain hands-on exposure to the issues and challenges of working with noisy and "dirty" real-world data.
  • Understand design considerations for scalability and performance in web-scale Spark application architectures.

Who This Book Is For

If you are a developer, engineer, or an architect and want to learn how to use Apache Spark in a web-scale project, then this is the book for you. It is assumed that you have prior knowledge of SQL querying. A basic programming knowledge with Scala, Java, R, or Python is all you need to get started with this book.

What You Will Learn

  • Familiarize yourself with Spark SQL programming, including working with DataFrame/Dataset API and SQL
  • Perform a series of hands-on exercises with different types of data sources, including CSV, JSON, Avro, MySQL, and MongoDB
  • Perform data quality checks, data visualization, and basic statistical analysis tasks
  • Perform data munging tasks on publically available datasets
  • Learn how to use Spark SQL and Apache Kafka to build streaming applications
  • Learn key performance-tuning tips and tricks in Spark SQL applications
  • Learn key architectural components and patterns in large-scale Spark SQL applications

In Detail

In the past year, Apache Spark has been increasingly adopted for the development of distributed applications. Spark SQL APIs provide an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Hence, understanding the design and implementation best practices before you start your project will help you avoid these problems.

This book gives an insight into the engineering practices used to design and build real-world, Spark-based applications. The book's hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

It starts by familiarizing you with data exploration and data munging tasks using Spark SQL and Scala. Extensive code examples will help you understand the methods used to implement typical use-cases for various types of applications. You will get a walkthrough of the key concepts and terms that are common to streaming, machine learning, and graph applications. You will also learn key performance-tuning details including Cost Based Optimization (Spark 2.2) in Spark SQL applications. Finally, you will move on to learning how such systems are architected and deployed for a successful delivery of your project.

Style and approach

This book is a hands-on guide to designing, building, and deploying Spark SQL-centric production applications at scale.

商品描述(中文翻譯)

設計、實現和交付成功的流媒體應用程序、機器學習流程和圖形應用程序,使用Spark SQL API。

關於本書
- 通過使用Spark SQL API和Scala,了解設計和實現流媒體應用程序、機器學習流程、深度學習和大規模圖形處理應用程序。
- 使用真實世界數據集進行數據探索、數據整理,以及處理結構化和半結構化數據,並獲得實際操作經驗,了解處理嘈雜和“骯髒”真實世界數據的問題和挑戰。
- 了解在Web規模的Spark應用程序架構中的可擴展性和性能設計考慮因素。

本書適合對象
- 如果您是開發人員、工程師或架構師,並且想要學習如何在Web規模項目中使用Apache Spark,那麼本書適合您。假設您具有SQL查詢的先前知識。只需具備Scala、Java、R或Python的基本編程知識,即可開始閱讀本書。

您將學到什麼
- 熟悉Spark SQL編程,包括使用DataFrame/Dataset API和SQL進行操作。
- 通過使用不同類型的數據源(包括CSV、JSON、Avro、MySQL和MongoDB)進行一系列實踐練習。
- 執行數據質量檢查、數據可視化和基本統計分析任務。
- 在公開可用數據集上執行數據整理任務。
- 學習如何使用Spark SQL和Apache Kafka構建流媒體應用程序。
- 學習Spark SQL應用程序中的關鍵性能調優技巧。
- 學習大規模Spark SQL應用程序中的關鍵架構組件和模式。

詳細內容
- 在過去的一年中,Apache Spark在分布式應用程序開發中得到越來越廣泛的應用。Spark SQL API提供了一個優化的接口,幫助開發人員快速、輕鬆地構建這樣的應用程序。然而,使用Spark SQL API設計Web規模的生產應用程序可能是一項復雜的任務。因此,在開始項目之前了解設計和實施最佳實踐將幫助您避免這些問題。

本書提供了設計和構建基於Spark的真實應用程序的工程實踐見解。書中的實例將使您對使用Spark SQL進行任何未來項目具有所需的信心。它從使用Spark SQL和Scala進行數據探索和數據整理任務開始。豐富的代碼示例將幫助您了解實現各種應用程序的典型用例的方法。您將了解流媒體、機器學習和圖形應用程序常見的關鍵概念和術語。您還將學習Spark SQL應用程序中的關鍵性能調優細節,包括基於成本的優化(Spark 2.2)。最後,您將學習這些系統的架構和部署,以成功交付您的項目。

風格和方法
- 本書是一本實踐指南,教您如何在大規模情況下設計、構建和部署以Spark SQL為中心的生產應用程序。