Data Engineering with Azure Databricks: Design, build, and optimize scalable data pipelines and analytics solutions with Azure Databricks
暫譯: 使用 Azure Databricks 的數據工程:設計、構建和優化可擴展的數據管道和分析解決方案

Foshin, Dmitry, Anoshin, Dmitry, Chernyshova, Tonya

  • 出版商: Packt Publishing
  • 出版日期: 2026-04-30
  • 售價: $1,840
  • 貴賓價: 9.5$1,748
  • 語言: 英文
  • 頁數: 412
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 180610637X
  • ISBN-13: 9781806106370
  • 相關分類: Microsoft Azure
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Master end-to-end data engineering on Azure Databricks. From data ingestion and Delta Lake to CI/CD and real-time streaming, build secure, scalable, and performant data solutions with Spark, Unity Catalog, and ML tools.

Key Features:

- Build scalable data pipelines using Apache Spark and Delta Lake

- Automate workflows and manage data governance with Unity Catalog

- Learn real-time processing and structured streaming with practical use cases

- Implement CI/CD, DevOps, and security for production-ready data solutions

- Explore Databricks-native ML, AutoML, and Generative AI integration

Book Description:

Data Engineering with Azure Databricks is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing.

Beginning with the foundational role of Azure Databricks in modern data engineering, you'll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow.

The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake's ACID features for data reliability and schema evolution. You'll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform.

With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need.

What You Will Learn:

- Set up a full-featured Azure Databricks environment

- Implement batch and streaming ingestion using Auto Loader

- Optimize Spark jobs with partitioning and caching

- Build real-time pipelines with structured streaming and DLT

- Manage data governance using Unity Catalog

- Orchestrate production workflows with jobs and ADF

- Apply CI/CD best practices with Azure DevOps and Git

- Secure data with RBAC, encryption, and compliance standards

- Use MLflow and Feature Store for ML pipelines

- Build generative AI applications in Databricks

Who this book is for:

This book is for data engineers, solution architects, cloud professionals, and software engineers seeking to build robust and scalable data pipelines using Azure Databricks. Whether you're migrating legacy systems, implementing a modern lakehouse architecture, or optimizing data workflows for performance, this guide will help you leverage the full power of Databricks on Azure. A basic understanding of Python, Spark, and cloud infrastructure is recommended.

Table of Contents

- The role of Azure Databricks in modern data engineering

- Setting up an end-to-end Azure Databricks environment

- Data ingestion strategies for Azure Databricks

- Deep dive into Apache Spark on Azure Databricks

- Streaming architectures with structured streaming

- Working with Delta Lake: ACID transactions & schema evolution

- Automating data pipelines with Delta Live Tables (DLT)

- Orchestrating data workflows: from notebooks to production

- CI/CD and DevOps for Azure Databricks

- Optimizing query performance and cost management

- Security, compliance, and data governance

- Machine learning, AutoML, and generative AI in Databricks

商品描述(中文翻譯)

**掌握 Azure Databricks 的端到端數據工程。從數據攝取和 Delta Lake 到 CI/CD 和實時串流,使用 Spark、Unity Catalog 和 ML 工具構建安全、可擴展且高效的數據解決方案。**

**主要特點:**
- 使用 Apache Spark 和 Delta Lake 構建可擴展的數據管道
- 使用 Unity Catalog 自動化工作流程並管理數據治理
- 通過實際案例學習實時處理和結構化串流
- 實施 CI/CD、DevOps 和安全性以實現生產就緒的數據解決方案
- 探索 Databricks 原生的 ML、AutoML 和生成式 AI 整合

**書籍描述:**
《使用 Azure Databricks 的數據工程》是您構建可擴展、安全和高效數據管道的必備指南,利用 Azure 上強大的 Databricks 平台。這本書專為數據工程師、架構師和開發人員設計,解釋了基於 Spark 的工作負載、Delta Lake、Unity Catalog 和實時數據處理的複雜性。

從 Azure Databricks 在現代數據工程中的基礎角色開始,您將探索如何設置穩健的環境,使用 Auto Loader 管理數據攝取,優化 Spark 性能,並使用 Azure Data Factory 和 Airflow 等工具協調複雜的工作流程。

本書深入探討結構化串流、Delta Live Tables 和 Delta Lake 的 ACID 特性,以確保數據可靠性和模式演變。您還將學習如何使用 Unity Catalog 管理安全性、合規性和訪問控制,並獲得有關使用 Azure DevOps 和 Terraform 管理 CI/CD 管道的見解。

最後幾章特別關注機器學習和生成式 AI,指導您自動化模型工作流程,利用 MLflow,並在 Databricks 上微調大型語言模型。無論您是在構建現代數據湖屋還是大規模運營分析,本書都提供了所需的工具和見解。

**您將學到的內容:**
- 設置功能齊全的 Azure Databricks 環境
- 使用 Auto Loader 實施批量和串流攝取
- 通過分區和緩存優化 Spark 作業
- 使用結構化串流和 DLT 構建實時管道
- 使用 Unity Catalog 管理數據治理
- 使用作業和 ADF 協調生產工作流程
- 使用 Azure DevOps 和 Git 應用 CI/CD 最佳實踐
- 使用 RBAC、加密和合規標準保護數據
- 使用 MLflow 和 Feature Store 進行 ML 管道
- 在 Databricks 中構建生成式 AI 應用

**本書適合誰:**
本書適合數據工程師、解決方案架構師、雲端專業人士和軟體工程師,尋求使用 Azure Databricks 構建穩健且可擴展的數據管道。無論您是在遷移舊系統、實施現代湖屋架構,還是優化數據工作流程以提高性能,本指南將幫助您充分利用 Azure 上的 Databricks。建議具備 Python、Spark 和雲基礎設施的基本理解。

**目錄:**
- Azure Databricks 在現代數據工程中的角色
- 設置端到端的 Azure Databricks 環境
- Azure Databricks 的數據攝取策略
- 深入探討 Azure Databricks 上的 Apache Spark
- 使用結構化串流的串流架構
- 使用 Delta Lake:ACID 交易與模式演變
- 使用 Delta Live Tables (DLT) 自動化數據管道
- 協調數據工作流程:從筆記本到生產
- Azure Databricks 的 CI/CD 和 DevOps
- 優化查詢性能和成本管理
- 安全性、合規性和數據治理
- Databricks 中的機器學習、AutoML 和生成式 AI