Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime

Nat Welch

買這商品的人也買了...

商品描述

This hands-on survival manual will give you the tools to confidently prepare for and respond to a system outage.

Key Features

  • Proven methods for keeping your website running
  • A survival guide for incident response
  • Written by an ex-Google SRE expert

Book Description

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it.

Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response.

Real-World SRE goes beyond just reacting to disaster-uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis.

The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

What you will learn

  • Monitor for approaching catastrophic failure
  • Alert your team to an outage emergency
  • Dissect your incident response strategies
  • Test automation tools and build your own software
  • Predict bottlenecks and fight for user experience
  • Eliminate the competition in an SRE interview

Who this book is for

Real-World SRE is aimed at software developers facing a website crisis, or who want to improve the reliability of their company's software. Newcomers to Site Reliability Engineering looking to succeed at interview will also find this invaluable.

Table of Contents

  1. Introduction
  2. Monitoring
  3. Incident Response
  4. Postmortems
  5. Testing & Releasing
  6. Capacity Planning
  7. Building Tools
  8. User Experience
  9. Networking Foundations
  10. Linux And Cloud Foundations

商品描述(中文翻譯)

這本實戰手冊將提供您自信地準備和應對系統故障所需的工具。

主要特點:
- 保持網站運行的有效方法
- 事故應對的生存指南
- 由前 Google SRE 專家撰寫

書籍描述:
《實戰 SRE》是軟體開發人員在網站崩潰時的必備生存指南。隨著企業努力最大化正常運行時間,網站可靠性工程(SRE)已成為前線的重要領域。本書提供了一個逐步的框架,供您在網站故障且時間緊迫時使用。

作者 Nat Welch 在互聯網上一些最容易發生故障的大型公司擁有豐富的可靠性工程經驗。他的方法經過實戰驗證,涵蓋了監控現代網絡服務、設置警報和評估事故應對的方方面面。

《實戰 SRE》不僅僅是對災難的反應,還揭示了測試和發布軟體、長期增長計劃和預見未來瓶頸所需的工具和策略。《實戰 SRE》讓您能夠制定自己堅固的行動計劃,應對公司整體網站危機。

《實戰 SRE》的最後一章專門介紹了如何在 SRE 面試中脫穎而出,無論是找到第一份工作還是獲得重要晉升。

您將學到:
- 監控接近災難性故障的跡象
- 向團隊發出故障緊急警報
- 分析事故應對策略
- 測試自動化工具並構建自己的軟體
- 預測瓶頸並為用戶體驗而戰
- 在 SRE 面試中擊敗競爭對手

適合閱讀對象:
《實戰 SRE》適合面臨網站危機的軟體開發人員,或者希望提高公司軟體可靠性的人。對於剛接觸網站可靠性工程並希望在面試中成功的新手來說,本書也是無價之寶。

目錄:
1. 簡介
2. 監控
3. 事故應對
4. 事後分析
5. 測試與發布
6. 容量規劃
7. 構建工具
8. 用戶體驗
9. 網絡基礎知識
10. Linux 和雲基礎知識