Spark 全棧數據分析

Russell Jurney

買這商品的人也買了...

商品描述

本書介紹了作者提出的敏捷數據科學的方法論,結合作者在行業中多年的實際工作經驗,為數據科學團隊提供了一套以類似敏捷開發的方法開展數據科學研究的實踐經驗。全書基於Spark做全棧數據分析,書中展示了工業界一些常見工具的使用,包括從前端顯示到後端處理的各個環節,手把手幫助數據科學家快速將理論轉化為真正面向用戶的應用程序,從而讓讀者在利用數據創造真正價值的同時,也能不斷完善自己的研究。本書適合初學者閱讀,數據科學家、工程師、分析師都能在本書中有所收獲。

作者簡介

作者:(美)Russell Jurney(羅素·朱尼)譯者:王道遠
Russell Jurney在賭場遊戲中練出了數據分析的技能,構建了網絡應用程序分析美國和墨西哥的老虎機的表現。在涉足創業、互動媒體、記者等行業後,他搬到矽谷,在Ning和LinkedIn構建分析型應用。Russell現在是Data Syndrome的首席顧問,他幫助公司使用本書所介紹的原則和方法構建分析性產品。
王道遠,畢業於浙江大學,目前就職於阿里巴巴計算平台事業部,在加入阿里巴巴之前,曾在英特爾亞太研發有限公司大數據部門工作了五年。

目錄大綱

目錄
前言................................................ .................................................. xiv 
第Ⅰ部分準備工作
第1章理論........................................ .................................................. 3 
導論................................................ .................................................. ...........................3 
定義..................... .................................................. .................................................. ....5 
方法學........................................... .................................................. ...................5 
敏捷數據科學宣言.......................... .................................................. ................6
瀑布模型的問題.............................................. .................................................. .......10 
研究與應用開發...................................... .................................................. ......11 
敏捷軟件開發的問題...................................... .................................................. .......14 
最終質量:償還技術債.................................... ................................................14 
瀑布模型的拉力............................................... ...............................................15 
數據科學過程................................................. .................................................. ........16 
設置預期....................................... .................................................. .................17
數據科學團隊的角色............................................. .........................................18 
認清機遇與挑戰... .................................................. .........................................19 
適應變化...... .................................................. .................................................. 21 
過程中的注意事項............................................ .................................................. .....23 
代碼審核與結對編程....................................... ...............................................25 
敏捷開發的環境:提高生產效率............................................ ........................25 
用大幅打印實現想法.................... .................................................. ................27
第2章敏捷工具............................................. ...................................29 
可伸縮性=易用性....... .................................................. ..........................................30 
敏捷數據科學之數據處理. .................................................. ....................................30 
搭建本地環境.......... .................................................. ...............................................32 
配置要求.................................................. .................................................. ......33 
配置Vagrant ......................................... .................................................. ..........33 
下載數據..................................... .................................................. ...................33
搭建EC2環境............................................... .................................................. ...........34 
下載數據.................................... .................................................. ....................38 
下載並運行代碼......................... .................................................. ............................38 
下載代碼................... .................................................. .....................................38 
運行代碼.......... .................................................. ..............................................38 
Jupyter筆記本. .................................................. ................................................39
工具集概覽............................................... .................................................. ..............39 
敏捷開發工具棧的要求............................. .................................................. ...39 
Python 3 ............................................ .................................................. .............39 
使用JSON行和Parquet序列化事件............................ .....................................42 
收集數據.......... .................................................. ..............................................45 
使用Spark進行數據處理................................................ .................................45 
使用MongoDB發布數據............ .................................................. ...................48
使用Elasticsearch搜索數據.............................................. ...............................50 
使用Apache Kafka分發流數據............ .................................................. .........54 
使用PySpark Streaming處理流數據.................................. .............................57 
使用scikit-learn與Spark MLlib進行機器學習.......... ......................................58 
使用Apache Airflow(孵化項目)進行調度.. .................................................. 59 
反思我們的工作流程............................................ ..........................................70 
輕量級網絡應用.. .................................................. ..........................................70
展示數據................................................ .................................................. ........73 
本章小結....................................... .................................................. ..........................75 
第3章數據................... .................................................. ...................77 
飛行航班數據........................... .................................................. ..............................77 
航班準點情況數據............... .................................................. .........................78 
OpenFlights數據庫...................... .................................................. ...................79
天氣數據................................................ .................................................. .................80 
敏捷數據科學中的數據處理......................... .................................................. ........81 
結構化數據vs.半結構化數據................................ ..........................................81 
SQL vs. NoSQL ... .................................................. .................................................. ..82 
SQL .............................................. .................................................. ...................83 
NoSQL與數據流編程......................... .................................................. ...........83 
Spark: SQL + NoSQL ................................. .................................................. ...84
NoSQL中的表結構............................................. .............................................84 
數據序列化. .................................................. .................................................. .85 
動態結構表的特徵提取與呈現........................................ ..............................85 
本章小結................. .................................................. ................................................86 
第Ⅱ部分攀登金字塔
第4章記錄收集與展示....................................... ...............................89 
整體使用................ .................................................. .................................................90 
航班數據收集與序列化............................................ ...............................................91
航班記錄處理與發布............................................. .................................................. 94 
把航班記錄發佈到MongoDB ........................................... ..............................95 
在瀏覽器中展示航班記錄............ .................................................. .........................96 
使用Flask和pymongo提供航班信息................. ..............................................97 
使用Jinja2渲染HTML5頁面................................................ ............................98 
敏捷開發檢查站................. .................................................. ..................................102 
列出航班記錄........... .................................................. ............................................103
使用MongoDB列出航班記錄............................................ ...........................103 
數據分頁.................... .................................................. ..................................106 
搜索航班數據............ .................................................. ...........................................112 
創建索引.... .................................................. .................................................. 112 
發布航班數據到Elasticsearch ............................................ ..........................113 
通過網頁搜索航班數據.................. .................................................. ............114 
本章小結................................... .................................................. ............................117
第5章使用圖表進行數據可視化.......................................... .......... 119 
圖表質量:迭代至關重要................................ .................................................. .....120 
用發布/裝飾模型伸縮數據庫..................................... ...........................................120 
一階形式... .................................................. .................................................. .121 
二階形式............................................. .................................................. .........122 
三階形式..................................... .................................................. .................123 
選擇一種形式............................ .................................................. ..................123
探究時令性............................................... .................................................. ............124 
查詢並展示航班總數................................ .................................................. ..124 
提取“金屬”(飛機(實體)) ..................................... ................................................132 
提取機尾編號............................................... .................................................132 
評估飛機記錄............................................... .................................................139 
數據完善................................................ .................................................. ...............140 
網頁表單逆向工程.............................. .................................................. ........140
收集機尾編號.............................................. .................................................. 142 
自動化表單提交.............................................. ..............................................143 
從HTML中提取數據................................................ .....................................144 
評價完善後的數據....... .................................................. ...............................147 
本章小結................ .................................................. ...............................................148 
第6章通過報表探索數據............................................. .................. 149 
提取航空公司為實體.......................... .................................................. .................150
使用PySpark把航空公司定義為飛機的分組........................................ .......150 
在MongoDB中查詢航空公司數據................................... ............................151 
在Flask中構建航空公司頁面.............. .................................................. ........151 
添加回到航空公司頁面的鏈接................................. ...................................152 
創建一個包括所有航空公司的主頁...... .................................................. ....153 
整理半結構化數據的本體關係..................................... ........................................154 
改進航空公司頁面..... .................................................. ..........................................155 
給航空公司代碼加上名稱.................................................. ..........................156
整合維基百科內容.............................................. ..........................................158 
把擴充過的航空公司表發佈到MongoDB ............................................... ....159 
在網頁上擴充航空公司信息...................................... ..................................160 
調查飛機(實體) .......... .................................................. .........................................162 
SQL嵌套查詢vs.數據流編程................................................. .......................164 
不使用嵌套查詢的數據流編程................. .................................................. .164 
Spark SQL中的子查詢.......................................... .........................................165
創建飛機主頁............................................... .................................................166 
在飛機頁面上添加搜索............................................ ....................................167 
創建飛機製造商的條形圖..... .................................................. .....................172 
對飛機製造商條形圖進行迭代................... .................................................174 
實體解析:新一輪圖表迭代.......................................... ................................177 
本章小結............... .................................................. ................................................183 
第7章進行預測.............................................. ............................... 185
預測的作用............................................... .................................................. ............186 
預測什麼................................... .................................................. ............................186 
預測分析導論.................. .................................................. .....................................187 
進行預測.......... .................................................. ............................................187 
探索航班延誤.. .................................................. .................................................. ...189 
使用PySpark提取特徵.......................................... .................................................. 193
使用scikit-learn構建回歸模型........................................... ....................................198 
讀取數據.......... .................................................. ............................................198 
數據採樣... .................................................. .................................................. .199 
向量化處理結果............................................ ................................................200 
準備訓練數據................................................ ................................................201 
向量化處理特徵............................................... .............................................201 
稀疏矩陣與稠密矩陣................................................. ...................................203
準備實驗................................................ .................................................. ......204 
訓練模型......................................... .................................................. .............204 
測試模型.................................. .................................................. ....................205 
小結............................ .................................................. ..................................207 
使用Spark MLlib構建分類器......... .................................................. ......................208 
使用專用結構加載訓練數據..................... .................................................. .208 
處理空值............................................. .................................................. .........210
用Route(路線)替代FlightNum(航班號) ....................................... ..............210 
對連續變量分桶以用於分類.......................... ..............................................211 
使用pyspark. ml.feature向量化處理特徵........................................... ...........219 
用Spark ML做分類................................. .................................................. .....221 
本章小結.......................................... .................................................. .....................223 
第8章部署預測系統...................... ................................................ 225 
把scikit-learn應用部署為網絡服務.......................................... .............................225
scikit-learn模型的保存與讀取......................................... .............................226 
提供預測模型的準備工作.............. .................................................. ............227 
為航班延誤回歸分析創建API .............................. ........................................228 
測試API ....... .................................................. ................................................232 
在產品中使用API .............................................. ............................................232 
使用Airflow部署批處理模式Spark ML應用.............................................. ..........234 
在生產環境中收集訓練數據................................ ........................................235 
Spark ML模型的訓練、存儲與加載.................................................. ............237
在MongoDB中創建預測請求............................................ ...........................239 
從MongoDB中獲取預測請求................ .................................................. .....245 
使用Spark ML以批處理模式進行預測................................... .....................248 
用MongoDB保存預測結果....................... .................................................. ..252 
在網絡應用中展示批處理預測結果...................................... ......................253 
用Apache Airflow(孵化項目)自動化工作流................. ..............................256 
小結.................. .................................................. ............................................264 
用Spark Streaming部署流式計算模式Spark ML應用..........................................264
在生產環境中收集訓練數據........................................... .............................265 
Spark ML模型的訓練、存儲、讀取.......... .................................................. ....265 
發送預測請求到Kafka ........................................ ..........................................266 
用Spark Streaming進行預測.. .................................................. ......................277 
測試整個系統........................ .................................................. ......................283 
本章小結......................... .................................................. ......................................285 
第9章改進預測結果..... .................................................. ............... 287
解決預測的問題.............................................. .................................................. .....287 
什麼時候需要改進預測....................................... .................................................. 288 
改進預測表現.............................................. .................................................. .........288 
黏附試驗法:找出黏性好的.............................. ............................................288 
為試驗建立嚴格的指標................................................. ...............................289 
把當日時間作為特徵............. .................................................. .....................298 
納入飛機數據......................... .................................................. .....................302
提取飛機特徵............................................... .................................................302 
在分類器模型中納入飛機特徵.......................................... ..........................305 
納入飛行時間.................... .................................................. ...................................310 
本章小結............ .................................................. .................................................. .313 
附錄A安裝手冊............................................ ................................. 315 
安裝Hadoop .............. .................................................. ...........................................315 
安裝Spark .... .................................................. .................................................. .......316
安裝MongoDB ................................................ .................................................. .....317 
安裝MongoDB的Java驅動....................................... ..............................................317 
安裝mongo- hadoop ................................................. ...............................................318 
編譯mongo -hadoop ................................................ .......................................318 
安裝pymongo_spark ........ .................................................. ............................318 
安裝Elasticsearch ................... .................................................. .............................318 
安裝Elasticsearch的Hadoop支持庫.............. .................................................. .......319
配置我們的Spark環境............................................. ..............................................320 
安裝Kafka . .................................................. .................................................. ........320 
安裝scikit-learn ..................................... .................................................. ...............320 
安裝Zeppelin ................................ .................................................. ........................321