Serverless ETL and Analytics with AWS Glue: Your comprehensive reference guide to learning about AWS Glue and its features
暫譯: 無伺服器 ETL 與分析:AWS Glue 完整參考指南

Pathak, Vishal, Vajiraya, Subramanya, Sekiyama, Noritaka

  • 出版商: Packt Publishing
  • 出版日期: 2022-08-30
  • 售價: $2,100
  • 貴賓價: 9.5$1,995
  • 語言: 英文
  • 頁數: 434
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1800564988
  • ISBN-13: 9781800564985
  • 相關分類: Amazon Web ServicesServerless
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Build efficient data lakes that can scale to virtually unlimited size using AWS Glue

Key Features

- Learn to work with AWS Glue to overcome typical implementation challenges in data lakes
- Create and manage serverless ETL pipelines that can scale to manage big data
- Written by AWS Glue community members, this practical guide shows you how to implement AWS Glue in no time

Book Description

Organizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes.

Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You'll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you'll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options.

By the end of this AWS book, you'll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.

What you will learn

- Apply various AWS Glue features to manage and create data lakes
- Use Glue DataBrew and Glue Studio for data preparation
- Optimize data layout in cloud storage to accelerate analytics workloads
- Manage metadata including database, table, and schema definitions
- Secure your data during access control, encryption, auditing, and networking
- Monitor AWS Glue jobs to detect delays and loss of data
- Integrate Spark ML and SageMaker with AWS Glue to create machine learning models

Who this book is for

This book is for ETL developers, data engineers, and data analysts who want to understand how AWS Glue can help you solve your business problems. Basic knowledge of AWS data services is assumed.

商品描述(中文翻譯)

建構高效的數據湖,能夠擴展到幾乎無限的大小,使用 AWS Glue

主要特點

- 學習如何使用 AWS Glue 克服數據湖中的典型實施挑戰
- 創建和管理無伺服器的 ETL 管道,能夠擴展以管理大數據
- 本書由 AWS Glue 社群成員撰寫,這本實用指南將教你如何迅速實施 AWS Glue

書籍描述

當今的組織已經傾向於使用像 AWS Glue 這樣的服務,這些服務承擔了無差異的繁重工作並提供無伺服器的 Spark,使你能夠以無伺服器的方式創建和管理數據湖。本指南展示了如何使用 AWS Glue 解決現實世界中的問題,同時幫助你了解數據處理、數據整合和構建數據湖。

本書從 AWS Glue 的基本概念開始,教你如何使用此服務執行各種數據分析方面的工作,例如即時查詢、數據可視化和實時分析。它還提供了 AWS Glue 的 CI/CD 逐步指南,以及如何通過自動回歸測試來提高質量。你將了解數據安全方面的實施,包括訪問控制、加密、審計和網絡,並掌握一些有用的技術,例如選擇合適的文件格式、壓縮、分區和桶裝。隨著進展,你將發現 AWS Glue 的功能,例如爬蟲、Lake Formation、受管表、數據來源、DataBrew、Glue Studio 和自定義連接器。最後幾章將幫助你理解各種性能調優、故障排除和監控選項。

在這本 AWS 書籍結束時,你將能夠使用 AWS Glue 創建、管理、故障排除和部署 ETL 管道。

你將學到什麼

- 應用各種 AWS Glue 功能來管理和創建數據湖
- 使用 Glue DataBrew 和 Glue Studio 進行數據準備
- 優化雲存儲中的數據佈局,以加速分析工作負載
- 管理元數據,包括數據庫、表和架構定義
- 在訪問控制、加密、審計和網絡過程中保護你的數據
- 監控 AWS Glue 作業以檢測延遲和數據丟失
- 將 Spark ML 和 SageMaker 與 AWS Glue 整合,以創建機器學習模型

本書適合誰

本書適合 ETL 開發人員、數據工程師和數據分析師,想要了解 AWS Glue 如何幫助你解決業務問題。假設讀者具備基本的 AWS 數據服務知識。

作者簡介

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures.

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.

作者簡介(中文翻譯)

Vishal Pathak 是 AWS 的數據實驗室解決方案架構師。Vishal 與客戶合作,針對他們的使用案例設計解決方案,以解決他們的商業問題,並幫助他們構建可擴展的原型。在加入 AWS 之前,Vishal 曾幫助客戶在美國和澳大利亞實施商業智慧、數據倉庫和數據湖項目。

Subramanya Vajiraya 是 AWS 悉尼的雲端大數據工程師,專注於 AWS Glue。他於 2015 年在印度卡納塔克邦 Nitte 的 NMAM 技術學院(Visvesvaraya 技術大學,Belgaum)獲得資訊科學與工程的工程學士學位,並於 2017 年在澳大利亞悉尼的新南威爾士大學獲得專注於網際網路的資訊科技碩士學位。他熱衷於幫助客戶解決與其 ETL 工作負載相關的挑戰性技術問題,並在 AWS 上實施可擴展的數據整合和分析管道。

Noritaka Sekiyama 是 AWS Glue 和 AWS Lake Formation 團隊的高級大數據架構師。他在軟體行業擁有 11 年的工作經驗。身處日本東京,他負責實施軟體工件、構建庫、排除複雜問題並幫助指導客戶架構。

Tomohiro Tanaka 是 AWS 的高級雲端支援工程師。他的工作是幫助客戶解決問題,並在 AWS Glue、AWS IoT 和大數據技術(如 Apache Spark、Hadoop 和 Iceberg)上構建數據湖。

Albert Quiroga 在亞馬遜擔任高級解決方案架構師,幫助設計和架構全球最大的數據湖之一。在此之前,他在 AWS 工作了四年,專注於大數據技術,如 EMR 和 Athena,並成為 AWS Glue 的專家。Albert 曾與多家《財富》500 強公司合作,參與全球一些最大的數據湖項目,並幫助推出和開發多個 AWS 服務的功能。

Ishan Gaur 擁有超過 13 年的 IT 經驗,專注於軟體開發和數據工程,使用 Apache Spark、Scala 和各種 ETL 工具(如 Ab Initio 和 Datastage)構建分散式系統和高度可擴展的 ETL 管道。他目前在 AWS 擔任高級大數據雲端工程師,並是 AWS Glue 的主題專家(SME)。他負責幫助客戶構建大型、可擴展的分散式系統,並在 AWS 雲環境中實施這些系統,使用各種大數據服務,包括 EMR、Glue 和 Athena,以及其他技術,如 Apache Spark、Hadoop 和 Hive。

目錄大綱

1. Data Management – Introduction and Concepts
2. Introduction to Important AWS Glue Features
3. Data Ingestion
4. Data Preparation
5. Designing Data Layouts
6. Data Management
7. Metadata Management
8. Data Security
9. Data Sharing
10. Data Pipeline Management
11. Monitoring
12. Tuning, Debugging, and Troubleshooting
13. Data Analysis
14. Machine Learning Integration
15. Architecting Data Lakes for Real-World Scenarios and Edge Cases

目錄大綱(中文翻譯)

1. Data Management – Introduction and Concepts

2. Introduction to Important AWS Glue Features

3. Data Ingestion

4. Data Preparation

5. Designing Data Layouts

6. Data Management

7. Metadata Management

8. Data Security

9. Data Sharing

10. Data Pipeline Management

11. Monitoring

12. Tuning, Debugging, and Troubleshooting

13. Data Analysis

14. Machine Learning Integration

15. Architecting Data Lakes for Real-World Scenarios and Edge Cases