Engineering Lakehouses with Open Table Formats: Build scalable and efficient lakehouses with Apache Iceberg, Apache Hudi, and Delta Lake
暫譯: 使用開放表格格式建構工程湖倉:利用 Apache Iceberg、Apache Hudi 和 Delta Lake 建立可擴展且高效的湖倉

Mazumdar, Dipankar, Govindarajan, Vinoth

  • 出版商: Packt Publishing
  • 出版日期: 2025-12-26
  • 售價: $1,690
  • 貴賓價: 9.5$1,606
  • 語言: 英文
  • 頁數: 414
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1836207239
  • ISBN-13: 9781836207238
  • 相關分類: Data-visualization
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Jump-start your journey toward mastering open data architectural patterns by learning the fundamentals and applications of open table formats

Key Features:

- Build lakehouses with open table formats using compute engines such as Apache Spark, Flink, Trino, and Python

- Optimize lakehouses with techniques such as pruning, partitioning, compaction, indexing, and clustering

- Find out how to enable seamless integration, data management, and interoperability using Apache XTable

- Purchase of the print or Kindle book includes a free PDF eBook

Book Description:

Engineering Lakehouses with Open Table Formats provides detailed insights into lakehouse concepts, and dives deep into the practical implementation of open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake.

You'll explore the internals of a table format and learn in detail about the transactional capabilities of lakehouses. You'll also get hands on with each table format with exercises using popular computing engines, such as Apache Spark, Flink, Trino, and Python-based tools. The book addresses advanced topics, including performance optimization techniques and interoperability among different formats, equipping you to build production-ready lakehouses. With step-by-step explanations, you'll get to grips with the key components of lakehouse architecture and learn how to build, maintain, and optimize them.

By the end of this book, you'll be proficient in evaluating and implementing open table formats, optimizing lakehouse performance, and applying these concepts to real-world scenarios, ensuring you make informed decisions in selecting the right architecture for your organization's data needs.

What You Will Learn:

- Explore lakehouse fundamentals, such as table formats, file formats, compute engines, and catalogs

- Gain a complete understanding of data lifecycle management in lakehouses

- Learn how to systematically evaluate and choose the right lakehouse table format

- Optimize performance with sorting, clustering, and indexing techniques

- Use the open table format data with ML frameworks like TensorFlow and MLflow

- Interoperate across different table formats with Apache XTable and UniForm

- Secure your lakehouse with access controls and ensure regulatory compliance

Who this book is for:

This book is for data engineers, software engineers, and data architects who want to deepen their understanding of open table formats, such as Apache Iceberg, Apache Hudi, and Delta Lake, and see how they are used to build lakehouses. It is also valuable for professionals working with traditional data warehouses, relational databases, and data lakes who wish to transition to an open data architectural pattern. Basic knowledge of databases, Python, Apache Spark, Java, and SQL is recommended for a smooth learning experience.

Table of Contents

- Open Data Lakehouse: A New Architectural Paradigm

- Transactional Capabilities of the Lakehouse

- Apache Iceberg Deep Dive

- Apache Hudi Deep Dive

- Delta Lake Deep Dive

- Catalog and Metadata Management

- Interoperability in Lakehouses

- Performance Optimization and Tuning in a Lakehouse

- Data Governance and Security in Lakehouses

- Evaluating and Selecting Open Table Formats

- Real-World Applications and Learnings

商品描述(中文翻譯)

**開始掌握開放數據架構模式的旅程,學習開放表格格式的基本原理和應用**

**主要特點:**
- 使用 Apache Spark、Flink、Trino 和 Python 等計算引擎構建湖倉(lakehouses)與開放表格格式
- 通過修剪、分區、壓縮、索引和聚類等技術優化湖倉
- 瞭解如何使用 Apache XTable 實現無縫整合、數據管理和互操作性
- 購買印刷版或 Kindle 版書籍可獲得免費 PDF 電子書

**書籍描述:**
《使用開放表格格式工程湖倉》提供了湖倉概念的詳細見解,深入探討了 Apache Iceberg、Apache Hudi 和 Delta Lake 等開放表格格式的實際實施。

您將探索表格格式的內部運作,詳細了解湖倉的事務能力。您還將通過使用流行計算引擎(如 Apache Spark、Flink、Trino 和基於 Python 的工具)進行練習,親自操作每種表格格式。本書涵蓋了高級主題,包括性能優化技術和不同格式之間的互操作性,幫助您構建生產就緒的湖倉。通過逐步解釋,您將掌握湖倉架構的關鍵組件,並學習如何構建、維護和優化它們。

在本書結束時,您將能夠熟練評估和實施開放表格格式,優化湖倉性能,並將這些概念應用於現實場景,確保您在選擇適合組織數據需求的架構時做出明智的決策。

**您將學到什麼:**
- 探索湖倉的基本原理,如表格格式、文件格式、計算引擎和目錄
- 完整理解湖倉中的數據生命周期管理
- 學習如何系統性地評估和選擇合適的湖倉表格格式
- 通過排序、聚類和索引技術優化性能
- 使用開放表格格式數據與 ML 框架(如 TensorFlow 和 MLflow)
- 使用 Apache XTable 和 UniForm 在不同表格格式之間進行互操作
- 通過訪問控制保護您的湖倉,確保遵守法規

**本書適合誰:**
本書適合希望深入了解開放表格格式(如 Apache Iceberg、Apache Hudi 和 Delta Lake)並了解如何用它們構建湖倉的數據工程師、軟體工程師和數據架構師。對於希望過渡到開放數據架構模式的傳統數據倉庫、關聯數據庫和數據湖的專業人士也非常有價值。建議具備基本的數據庫、Python、Apache Spark、Java 和 SQL 知識,以便順利學習。

**目錄:**
- 開放數據湖倉:一種新的架構範式
- 湖倉的事務能力
- Apache Iceberg 深入探討
- Apache Hudi 深入探討
- Delta Lake 深入探討
- 目錄和元數據管理
- 湖倉中的互操作性
- 湖倉中的性能優化和調整
- 湖倉中的數據治理和安全性
- 評估和選擇開放表格格式
- 實際應用和學習