大數據猩球:海量數據處理實踐指南
內容描述
<內容簡介>
本書以實用的、可操作的視角解釋了大數據——採用黑猩猩和大象的隱喻,基於棒球統計數據集,使用Apache Hadoop和Pig等工具展示瞭如何處理大規模數據。此外,通過處理真實數據、解決現實問題,作者還以實例的形式總結了一些實踐分析模式,為有創造力的分析人員提供了最強大、最有價值的方法。本書特別適合那些需要大數據工具箱來解決實際問題的人們。
<章節目錄>
前言................................................. .................................................XI第一部分入門 :理論和工具第1 章Hadoop 基礎............................................. ...........................3黑猩猩和大象創業............................................. .................................................. ..................4Map-Only 作業:逐個處理記錄.......................................... .................................................5Pig Latin Map-Only 作業............................................ .................................................. ..........6創建Docker Hadoop 集群.............................................. .................................................. ......8運行作業................................................ .................................................. .....................12小結................................................. .................................................. ....................................15第2 章MapReduce.............................................. ..........................17黑猩猩和大象拯救聖誕節........................................... .................................................. ......17玩具島上的麻煩............................................. .................................................. ............17黑猩猩把信件變成帶標籤的玩具表........................................ ...................................19小象將玩具表送到適當的工作台....................................... ................................................21示例:馴鹿遊戲.............................................. .................................................. ...................23UFO 數據................................................ .................................................. ....................24根據報導延遲對UFO 目擊分組........................................... ......................................24Mapper ................................................. .................................................. .......................24Reducer ................................................. .................................................. ......................26數據可視化................................................ .................................................. .................29馴鹿小結................................................ .................................................. .....................30Hadoop 與傳統數據庫.............................................. .................................................. .........30MapReduce 俳句................................................ .................................................. .................31Map 階段簡述.............................................. .................................................. ..............32Group-Sort 階段簡述............................................ .................................................. .....32Reduce 階段簡述.............................................. .................................................. ..........32小結................................................. .................................................. ....................................33第3 章棒球數據集速覽.......................................... ........................35數據................................................. .................................................. ....................................35縮略詞和術語............................................. .................................................. ........................36規則和目標............................................... .................................................. ..........................37評價指標................................................ .................................................. .............................37小結................................................. .................................................. ....................................38第4 章Pig 入門............................................. .................................39Pig 幫助Hadoop 處理數據表,而不是記錄........................................ ..............................39維基百科訪問數統計............................................. .................................................. ....41基本數據操作............................................... .................................................. ......................43控制操作................................................ .................................................. .....................44管道操作................................................ .................................................. .....................44結構化操作............................................... .................................................. ..................44LOAD 定位並描述你的數據........................................... .................................................. ..46簡單類型................................................ .................................................. .....................46複雜類型1,元組:帶類型字段的固長序列.................................... ........................47複雜類型2,袋:元組的無限集合....................................... .....................................47定義變換後的記錄模式............................................ .................................................. .48STORE 將數據寫入磁盤............................................ .................................................. .......49輔助命令................................................ .................................................. .............................50DESCRIBE ................................................. .................................................. ................50DUMP ................................................. .................................................. ........................50SAMPLE ................................................. .................................................. ....................50ILLUSTRATE ................................................. .................................................. ............51EXPLAIN................................................. .................................................. ...................51Pig 函數................................................ .................................................. ...............................51Piggybank ................................................. .................................................. ...........................53Apache DataFu ................................................ .................................................. ....................56小結................................................. .................................................. ....................................59第二部分戰術 :分析模式第5 章Map-Only 操作........................................... ........................63模式用法................................................ .................................................. .....................63清除數據................................................ .................................................. .............................64選擇滿足條件的記錄:FILTER 等.......................................... ...........................................65選擇滿足多個條件的記錄........................................... ................................................66選擇或丟棄空值記錄............................................ .................................................. .....66選擇匹配正則表達式的記錄(MATCHES) ........................................ ......................67根據固定的值列表匹配記錄........................................... ............................................70按字段名投影字段............................................. .................................................. ................71使用FOREACH 選擇、重命名和重排序字段........................................ ..................71抽取記錄的隨機樣本............................................. .................................................. ....73按key 抽取一致性樣本............................................ .................................................. .74僅加載部分part-Files 實現粗略抽樣......................................... ................................75使用LIMIT 選擇固定數量的記錄........................................... ...................................75其他數據消除模式.............................................. .................................................. .......76變換記錄................................................ .................................................. .............................76使用FOREACH 逐個變換記錄............................................. .....................................76嵌套FOREACH 允許使用中間表達式.......................................... ............................77根據模版格式化字符串............................................ .................................................. .79使用複雜類型組裝字面值............................................ ...............................................80操縱字段的類型.............................................. .................................................. ...........84整型、浮點型和取整......................................... .................................................. ........86從外部包調用用戶自定義函數.......................................... .........................................87將一個表分裂成多個表的操作........................................ .................................................. .88將數據條件定向到多個數據流(SPLIT) ...................................... ..............................88將幾個表聯合成一個表的操作........................................ .................................................. .89將多個Pig 關係表合併成一個表(堆砌行集) ................................... ......................89小結................................................. .................................................. ....................................91第6 章分組操作............................................. ................................93按key 將記錄分組到袋........................................... .................................................. ..........93模式用法................................................ .................................................. .....................97統計key 的出現次數............................................. .................................................. ....97使用帶分隔符的字符串表示值的集合....................................... ................................99使用帶分隔符的字符串表示複雜數據結構....................................... ......................101使用JSON 編碼的字符串表示複雜數據結構........................................ .................102分組和聚合............................................... .................................................. ........................106聚合組的統計數據............................................. .................................................. ......106完全匯總字段............................................... .................................................. ............108匯總整個表的聚合統計值........................................... ..............................................110匯總字符串字段.............................................. .................................................. ......... 111使用直方圖計算數值型值的分佈情況........................................ .....................................113模式用法................................................ .................................................. ...................114直方圖的數據分箱............................................ .................................................. .......114確定箱子的大小.............................................. .................................................. .........116解釋直方圖和分位數........................................... .................................................. ....118將數據分箱到規模呈指數變化的塊....................................... ..................................119為通用代碼段創建Pig 宏........................................... ..............................................121比賽分佈情況............................................... .................................................. ............121極端情況和乾擾因子............................................. .................................................. ..122不要相信尾部分佈.............................................. .................................................. .....125計算相對分佈直方圖............................................. .................................................. ..126重新註入全局值.............................................. .................................................. .........127在組內計算直方圖............................................ .................................................. .......128導出可讀結果.............................................. .................................................. .............130匯總技巧................................................ .................................................. ...........................132統計組的條件子集——匯總技巧........................................ .....................................132同時匯總組的多個子集........................................... .................................................. 134測試組內某個值是否缺失.......................................... ...............................................136小結................................................. .................................................. ..................................137參考文獻................................................ .................................................. ...........................138第7 章表連接............................................. .................................139匹配表記錄(內連接) ........................................... .................................................. .........140將一個表的記錄與另一個表的記錄直接匹配連接(直接內連接) .......................140連接是怎麼工作的............................................. .................................................. ..............142連接就是COGROUP+FLATTEN ............................................. ................................142連接就是在表名上進行二次排序的MapReduce 作業..................................... ......143處理連接和分組中的空值和不匹配....................................... ..................................145枚舉多對多關係............................................ .................................................. ...................147連接表和它自己(自連接) ......................................... .................................................. ....148包含不匹配記錄的連接(外連接) ........................................ ...........................................150模式用法................................................ .................................................. ...................152連接不含外鍵關係的表.......................................... .................................................. .153連接整型表填補列表中的空白......................................... ........................................155僅選擇與另一個表不匹配的記錄(反連接) .................................... ...............................157僅選擇與另一個表匹配的記錄(半連接) ..................................... ..................................158反連接的另一種方式:使用COGROUP ........................................ .........................158小結................................................. .................................................. ..................................160第8 章排序操作............................................. ..............................161準備職業生涯時期.............................................. .................................................. .............161對所有記錄進行全排序............................................ .................................................. .......163多字段排序............................................... .................................................. ................164表達式排序(行不通) ........................................... .................................................. ..164大小寫不敏感的字符串排序.......................................... ...........................................165排序的空值處理............................................. .................................................. ..........165將值放到排序順序的頂部或底端....................................... ......................................166組內排序............................................... .................................................. ............................167模式用法................................................ .................................................. ...................169根據字段值的Top-K 選擇行......................................... ...........................................169組內Top-K ............................................. .................................................. ..................170按照排序順序給記錄編號............................................ .................................................. ...170找出最大值對應的記錄........................................... .................................................. 171對一組記錄進行混排........................................... .................................................. ....171小結................................................. .................................................. ..................................172第9 章重複記錄和唯一記錄.......................................... ...............173處理重複................................................ .................................................. ...........................173消除表中的重複記錄............................................ .................................................. ...174消除組內的重複記錄............................................ .................................................. ...174基於鍵消除重複.............................................. .................................................. .........175基於鍵選擇唯一(或重複)記錄......................................... ....................................176集合操作................................................ .................................................. ...........................177全表上的集合操作............................................ .................................................. .......178Distinct Union ................................................ .................................................. ...........179Distinct Union(其他方法) ............................................ ...........................................179Set Intersection ................................................ .................................................. ..........179Set Difference ................................................ .................................................. ...........180Symmetric Difference :(AB)+(BA) ........................................ ................................180Set Equality ................................................ .................................................. ...............181組內集合操作.............................................. .................................................. .............182構造一個集合序列.............................................. .................................................. .....182某個組內的集合操作........................................... .................................................. ....183小結................................................. .................................................. ..................................185索引................................................. ...............................................187