大數據猩球:海量數據處理實踐指南 大数据猩球:海量数据处理实践指南

菲利普·克羅默 (Philip Kromer), 拉塞爾·賈米 (Russell Jurney)

  • 出版商: 電子工業
  • 出版日期: 2016-08-01
  • 定價: $414
  • 售價: 8.5$352
  • 語言: 簡體中文
  • 頁數: 192
  • 裝訂: 平裝
  • ISBN: 7121294184
  • ISBN-13: 9787121294181
  • 相關分類: 大數據 Big-data
  • 立即出貨 (庫存=1)

買這商品的人也買了...

商品描述

<內容簡介>

本書以實用的、可操作的視角解釋了大數據——採用黑猩猩和大象的隱喻,基於棒球統計數據集,使用Apache Hadoop和Pig等工具展示瞭如何處理大規模數據。此外,通過處理真實數據、解決現實問題,作者還以實例的形式總結了一些實踐分析模式,為有創造力的分析人員提供了最強大、最有價值的方法。本書特別適合那些需要大數據工具箱來解決實際問題的人們。

<章節目錄>

前言................................................. .................................................XI
第一部分入門 :理論和工具
第1 章Hadoop 基礎............................................. ...........................3
黑猩猩和大象創業............................................. .................................................. ..................4
Map-Only 作業:逐個處理記錄.......................................... .................................................5
Pig Latin Map-Only 作業............................................ .................................................. ..........6
創建Docker Hadoop 集群.............................................. .................................................. ......8
運行作業................................................ .................................................. .....................12
小結................................................. .................................................. ....................................15
第2 章MapReduce.............................................. ..........................17
黑猩猩和大象拯救聖誕節........................................... .................................................. ......17
玩具島上的麻煩............................................. .................................................. ............17
黑猩猩把信件變成帶標籤的玩具表........................................ ...................................19
小象將玩具表送到適當的工作台....................................... ................................................21
示例:馴鹿遊戲.............................................. .................................................. ...................23
UFO 數據................................................ .................................................. ....................24
根據報導延遲對UFO 目擊分組........................................... ......................................24
Mapper ................................................. .................................................. .......................24
Reducer ................................................. .................................................. ......................26
數據可視化................................................ .................................................. .................29
馴鹿小結................................................ .................................................. .....................30
Hadoop 與傳統數據庫.............................................. .................................................. .........30
MapReduce 俳句................................................ .................................................. .................31
Map 階段簡述.............................................. .................................................. ..............32
Group-Sort 階段簡述............................................ .................................................. .....32
Reduce 階段簡述.............................................. .................................................. ..........32
小結................................................. .................................................. ....................................33
第3 章棒球數據集速覽.......................................... ........................35
數據................................................. .................................................. ....................................35
縮略詞和術語............................................. .................................................. ........................36
規則和目標............................................... .................................................. ..........................37
評價指標................................................ .................................................. .............................37
小結................................................. .................................................. ....................................38
第4 章Pig 入門............................................. .................................39
Pig 幫助Hadoop 處理數據表,而不是記錄........................................ ..............................39
維基百科訪問數統計............................................. .................................................. ....41
基本數據操作............................................... .................................................. ......................43
控制操作................................................ .................................................. .....................44
管道操作................................................ .................................................. .....................44
結構化操作............................................... .................................................. ..................44
LOAD 定位並描述你的數據........................................... .................................................. ..46
簡單類型................................................ .................................................. .....................46
複雜類型1,元組:帶類型字段的固長序列.................................... ........................47
複雜類型2,袋:元組的無限集合....................................... .....................................47
定義變換後的記錄模式............................................ .................................................. .48
STORE 將數據寫入磁盤............................................ .................................................. .......49
輔助命令................................................ .................................................. .............................50
DESCRIBE ................................................. .................................................. ................50
DUMP ................................................. .................................................. ........................50
SAMPLE ................................................. .................................................. ....................50
ILLUSTRATE ................................................. .................................................. ............51
EXPLAIN................................................. .................................................. ...................51
Pig 函數................................................ .................................................. ...............................51
Piggybank ................................................. .................................................. ...........................53
Apache DataFu ................................................ .................................................. ....................56
小結................................................. .................................................. ....................................59
第二部分戰術 :分析模式
第5 章Map-Only 操作........................................... ........................63
模式用法................................................ .................................................. .....................63
清除數據................................................ .................................................. .............................64
選擇滿足條件的記錄:FILTER 等.......................................... ...........................................65
選擇滿足多個條件的記錄........................................... ................................................66
選擇或丟棄空值記錄............................................ .................................................. .....66
選擇匹配正則表達式的記錄(MATCHES) ........................................ ......................67
根據固定的值列表匹配記錄........................................... ............................................70
按字段名投影字段............................................. .................................................. ................71
使用FOREACH 選擇、重命名和重排序字段........................................ ..................71
抽取記錄的隨機樣本............................................. .................................................. ....73
按key 抽取一致性樣本............................................ .................................................. .74
僅加載部分part-Files 實現粗略抽樣......................................... ................................75
使用LIMIT 選擇固定數量的記錄........................................... ...................................75
其他數據消除模式.............................................. .................................................. .......76
變換記錄................................................ .................................................. .............................76
使用FOREACH 逐個變換記錄............................................. .....................................76
嵌套FOREACH 允許使用中間表達式.......................................... ............................77
根據模版格式化字符串............................................ .................................................. .79
使用複雜類型組裝字面值............................................ ...............................................80
操縱字段的類型.............................................. .................................................. ...........84
整型、浮點型和取整......................................... .................................................. ........86
從外部包調用用戶自定義函數.......................................... .........................................87
將一個表分裂成多個表的操作........................................ .................................................. .88
將數據條件定向到多個數據流(SPLIT) ...................................... ..............................88
將幾個表聯合成一個表的操作........................................ .................................................. .89
將多個Pig 關係表合併成一個表(堆砌行集) ................................... ......................89
小結................................................. .................................................. ....................................91
第6 章分組操作............................................. ................................93
按key 將記錄分組到袋........................................... .................................................. ..........93
模式用法................................................ .................................................. .....................97
統計key 的出現次數............................................. .................................................. ....97
使用帶分隔符的字符串表示值的集合....................................... ................................99
使用帶分隔符的字符串表示複雜數據結構....................................... ......................101
使用JSON 編碼的字符串表示複雜數據結構........................................ .................102
分組和聚合............................................... .................................................. ........................106
聚合組的統計數據............................................. .................................................. ......106
完全匯總字段............................................... .................................................. ............108
匯總整個表的聚合統計值........................................... ..............................................110
匯總字符串字段.............................................. .................................................. ......... 111
使用直方圖計算數值型值的分佈情況........................................ .....................................113
模式用法................................................ .................................................. ...................114
直方圖的數據分箱............................................ .................................................. .......114
確定箱子的大小.............................................. .................................................. .........116
解釋直方圖和分位數........................................... .................................................. ....118
將數據分箱到規模呈指數變化的塊....................................... ..................................119
為通用代碼段創建Pig 宏........................................... ..............................................121
比賽分佈情況............................................... .................................................. ............121
極端情況和乾擾因子............................................. .................................................. ..122
不要相信尾部分佈.............................................. .................................................. .....125
計算相對分佈直方圖............................................. .................................................. ..126
重新註入全局值.............................................. .................................................. .........127
在組內計算直方圖............................................ .................................................. .......128
導出可讀結果.............................................. .................................................. .............130
匯總技巧................................................ .................................................. ...........................132
統計組的條件子集——匯總技巧........................................ .....................................132
同時匯總組的多個子集........................................... .................................................. 134
測試組內某個值是否缺失.......................................... ...............................................136
小結................................................. .................................................. ..................................137
參考文獻................................................ .................................................. ...........................138
第7 章表連接............................................. .................................139
匹配表記錄(內連接) ........................................... .................................................. .........140
將一個表的記錄與另一個表的記錄直接匹配連接(直接內連接) .......................140
連接是怎麼工作的............................................. .................................................. ..............142
連接就是COGROUP+FLATTEN ............................................. ................................142
連接就是在表名上進行二次排序的MapReduce 作業..................................... ......143
處理連接和分組中的空值和不匹配....................................... ..................................145
枚舉多對多關係............................................ .................................................. ...................147
連接表和它自己(自連接) ......................................... .................................................. ....148
包含不匹配記錄的連接(外連接) ........................................ ...........................................150
模式用法................................................ .................................................. ...................152
連接不含外鍵關係的表.......................................... .................................................. .153
連接整型表填補列表中的空白......................................... ........................................155
僅選擇與另一個表不匹配的記錄(反連接) .................................... ...............................157
僅選擇與另一個表匹配的記錄(半連接) ..................................... ..................................158
反連接的另一種方式:使用COGROUP ........................................ .........................158
小結................................................. .................................................. ..................................160
第8 章排序操作............................................. ..............................161
準備職業生涯時期.............................................. .................................................. .............161
對所有記錄進行全排序............................................ .................................................. .......163
多字段排序............................................... .................................................. ................164
表達式排序(行不通) ........................................... .................................................. ..164
大小寫不敏感的字符串排序.......................................... ...........................................165
排序的空值處理............................................. .................................................. ..........165
將值放到排序順序的頂部或底端....................................... ......................................166
組內排序............................................... .................................................. ............................167
模式用法................................................ .................................................. ...................169
根據字段值的Top-K 選擇行......................................... ...........................................169
組內Top-K ............................................. .................................................. ..................170
按照排序順序給記錄編號............................................ .................................................. ...170
找出最大值對應的記錄........................................... .................................................. 171
對一組記錄進行混排........................................... .................................................. ....171
小結................................................. .................................................. ..................................172
第9 章重複記錄和唯一記錄.......................................... ...............173
處理重複................................................ .................................................. ...........................173
消除表中的重複記錄............................................ .................................................. ...174
消除組內的重複記錄............................................ .................................................. ...174
基於鍵消除重複.............................................. .................................................. .........175
基於鍵選擇唯一(或重複)記錄......................................... ....................................176
集合操作................................................ .................................................. ...........................177
全表上的集合操作............................................ .................................................. .......178
Distinct Union ................................................ .................................................. ...........179
Distinct Union(其他方法) ............................................ ...........................................179
Set Intersection ................................................ .................................................. ..........179
Set Difference ................................................ .................................................. ...........180
Symmetric Difference :(AB)+(BA) ........................................ ................................180
Set Equality ................................................ .................................................. ...............181
組內集合操作.............................................. .................................................. .............182
構造一個集合序列.............................................. .................................................. .....182
某個組內的集合操作........................................... .................................................. ....183
小結................................................. .................................................. ..................................185
索引................................................. ...............................................187