Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala

Tome, Eric, Bhattacharjee, Rupam, Radford, David

  • 出版商: Packt Publishing
  • 出版日期: 2024-01-31
  • 售價: $1,740
  • 貴賓價: 9.5$1,653
  • 語言: 英文
  • 頁數: 300
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1804612588
  • ISBN-13: 9781804612583
  • 相關分類: JVM 語言Spark
  • 海外代購書籍(需單獨結帳)

商品描述

Take your data engineering skills to the next level by learning how to utilize Scala and functional programming to create continuous and scheduled pipelines that ingest, transform, and aggregate data


Key Features:


  • Transform data into a clean and trusted source of information for your organization using Scala
  • Build streaming and batch-processing pipelines with step-by-step explanations
  • Implement and orchestrate your pipelines by following CI/CD best practices and test-driven development (TDD)
  • Purchase of the print or Kindle book includes a free PDF eBook


Book Description:


Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount.


This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You'll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You'll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users.


By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.


What You Will Learn:


  • Set up your development environment to build pipelines in Scala
  • Get to grips with polymorphic functions, type parameterization, and Scala implicits
  • Use Spark DataFrames, Datasets, and Spark SQL with Scala
  • Read and write data to object stores
  • Profile and clean your data using Deequ
  • Performance tune your data pipelines using Scala


Who this book is for:


This book is for data engineers who have experience in working with data and want to understand how to transform raw data into a clean, trusted, and valuable source of information for their organization using Scala and the latest cloud technologies.

商品描述(中文翻譯)

將您提供的文字翻譯成繁體中文如下:

將您的資料工程技能提升到更高的水平,學習如何利用Scala和函數式編程來建立連續和定期的流程,以輸入、轉換和聚合資料。

主要特點:
- 使用Scala將資料轉換為組織的乾淨和可信資訊來源
- 逐步解釋建立流程和批次處理流程
- 遵循CI/CD最佳實踐和測試驅動開發(TDD)來實施和協調流程
- 購買印刷版或Kindle書籍可獲得免費PDF電子書

書籍描述:
大多數資料工程師都知道,在分散式計算環境中,性能問題很容易導致影響資料工程任務整體效率和效果的問題。儘管Python因其易於使用而成為資料工程的熱門選擇,但在分散式資料處理性能至關重要的情況下,Scala表現出色。

本書將教您如何在Spark框架上利用Scala編程語言,並使用最新的雲技術構建連續和觸發的資料流程。您將通過設置資料工程環境進行本地開發和可擴展的分散式雲部署,使用資料工程最佳實踐、測試驅動開發和CI/CD。您還將熟悉DataFrame API、Dataset API和Spark SQL API及其使用。本書還涵蓋了Scala中的資料分析和質量,以及編排和性能調整端到端流程以將資料交付給最終用戶的技術。

通過閱讀本書,您將能夠使用Scala構建流式和批次資料流程,並遵循軟體工程最佳實踐。

學到的內容:
- 設置開發環境以使用Scala建立流程
- 熟悉多態函數、類型參數化和Scala隱式
- 使用Scala的Spark DataFrames、Datasets和Spark SQL
- 讀取和寫入物件存儲的資料
- 使用Deequ進行資料分析和清理
- 使用Scala進行資料流程的性能調整

本書適合對資料工程有經驗並希望了解如何使用Scala和最新的雲技術將原始資料轉換為組織的乾淨、可信和有價值的資訊來源的資料工程師閱讀。