Spark 全棧數據分析
內容描述
本書介紹了作者提出的敏捷數據科學的方法論,結合作者在行業中多年的實際工作經驗,為數據科學團隊提供了一套以類似敏捷開發的方法開展數據科學研究的實踐經驗。全書基於Spark做全棧數據分析,書中展示了工業界一些常見工具的使用,包括從前端顯示到後端處理的各個環節,手把手幫助數據科學家快速將理論轉化為真正面向用戶的應用程序,從而讓讀者在利用數據創造真正價值的同時,也能不斷完善自己的研究。本書適合初學者閱讀,數據科學家、工程師、分析師都能在本書中有所收獲。
目錄大綱
目錄
前言................................................ .................................................. xiv
第Ⅰ部分準備工作
第1章理論........................................ .................................................. 3
導論................................................ .................................................. ...........................3
定義..................... .................................................. .................................................. ....5
方法學........................................... .................................................. ...................5
敏捷數據科學宣言.......................... .................................................. ................6
瀑布模型的問題.............................................. .................................................. .......10
研究與應用開發...................................... .................................................. ......11
敏捷軟件開發的問題...................................... .................................................. .......14
最終質量:償還技術債.................................... ................................................14
瀑布模型的拉力............................................... ...............................................15
數據科學過程................................................. .................................................. ........16
設置預期....................................... .................................................. .................17
數據科學團隊的角色............................................. .........................................18
認清機遇與挑戰... .................................................. .........................................19
適應變化...... .................................................. .................................................. 21
過程中的注意事項............................................ .................................................. .....23
代碼審核與結對編程....................................... ...............................................25
敏捷開發的環境:提高生產效率............................................ ........................25
用大幅打印實現想法.................... .................................................. ................27
第2章敏捷工具............................................. ...................................29
可伸縮性=易用性....... .................................................. ..........................................30
敏捷數據科學之數據處理. .................................................. ....................................30
搭建本地環境.......... .................................................. ...............................................32
配置要求.................................................. .................................................. ......33
配置Vagrant ......................................... .................................................. ..........33
下載數據..................................... .................................................. ...................33
搭建EC2環境............................................... .................................................. ...........34
下載數據.................................... .................................................. ....................38
下載並運行代碼......................... .................................................. ............................38
下載代碼................... .................................................. .....................................38
運行代碼.......... .................................................. ..............................................38
Jupyter筆記本. .................................................. ................................................39
工具集概覽............................................... .................................................. ..............39
敏捷開發工具棧的要求............................. .................................................. ...39
Python 3 ............................................ .................................................. .............39
使用JSON行和Parquet序列化事件............................ .....................................42
收集數據.......... .................................................. ..............................................45
使用Spark進行數據處理................................................ .................................45
使用MongoDB發布數據............ .................................................. ...................48
使用Elasticsearch搜索數據.............................................. ...............................50
使用Apache Kafka分發流數據............ .................................................. .........54
使用PySpark Streaming處理流數據.................................. .............................57
使用scikit-learn與Spark MLlib進行機器學習.......... ......................................58
使用Apache Airflow(孵化項目)進行調度.. .................................................. 59
反思我們的工作流程............................................ ..........................................70
輕量級網絡應用.. .................................................. ..........................................70
展示數據................................................ .................................................. ........73
本章小結....................................... .................................................. ..........................75
第3章數據................... .................................................. ...................77
飛行航班數據........................... .................................................. ..............................77
航班準點情況數據............... .................................................. .........................78
OpenFlights數據庫...................... .................................................. ...................79
天氣數據................................................ .................................................. .................80
敏捷數據科學中的數據處理......................... .................................................. ........81
結構化數據vs.半結構化數據................................ ..........................................81
SQL vs. NoSQL ... .................................................. .................................................. ..82
SQL .............................................. .................................................. ...................83
NoSQL與數據流編程......................... .................................................. ...........83
Spark: SQL + NoSQL ................................. .................................................. ...84
NoSQL中的表結構............................................. .............................................84
數據序列化. .................................................. .................................................. .85
動態結構表的特徵提取與呈現........................................ ..............................85
本章小結................. .................................................. ................................................86
第Ⅱ部分攀登金字塔
第4章記錄收集與展示....................................... ...............................89
整體使用................ .................................................. .................................................90
航班數據收集與序列化............................................ ...............................................91
航班記錄處理與發布............................................. .................................................. 94
把航班記錄發佈到MongoDB ........................................... ..............................95
在瀏覽器中展示航班記錄............ .................................................. .........................96
使用Flask和pymongo提供航班信息................. ..............................................97
使用Jinja2渲染HTML5頁面................................................ ............................98
敏捷開發檢查站................. .................................................. ..................................102
列出航班記錄........... .................................................. ............................................103
使用MongoDB列出航班記錄............................................ ...........................103
數據分頁.................... .................................................. ..................................106
搜索航班數據............ .................................................. ...........................................112
創建索引.... .................................................. .................................................. 112
發布航班數據到Elasticsearch ............................................ ..........................113
通過網頁搜索航班數據.................. .................................................. ............114
本章小結................................... .................................................. ............................117
第5章使用圖表進行數據可視化.......................................... .......... 119
圖表質量:迭代至關重要................................ .................................................. .....120
用發布/裝飾模型伸縮數據庫..................................... ...........................................120
一階形式... .................................................. .................................................. .121
二階形式............................................. .................................................. .........122
三階形式..................................... .................................................. .................123
選擇一種形式............................ .................................................. ..................123
探究時令性............................................... .................................................. ............124
查詢並展示航班總數................................ .................................................. ..124
提取“金屬”(飛機(實體)) ..................................... ................................................132
提取機尾編號............................................... .................................................132
評估飛機記錄............................................... .................................................139
數據完善................................................ .................................................. ...............140
網頁表單逆向工程.............................. .................................................. ........140
收集機尾編號.............................................. .................................................. 142
自動化表單提交.............................................. ..............................................143
從HTML中提取數據................................................ .....................................144
評價完善後的數據....... .................................................. ...............................147
本章小結................ .................................................. ...............................................148
第6章通過報表探索數據............................................. .................. 149
提取航空公司為實體.......................... .................................................. .................150
使用PySpark把航空公司定義為飛機的分組........................................ .......150
在MongoDB中查詢航空公司數據................................... ............................151
在Flask中構建航空公司頁面.............. .................................................. ........151
添加回到航空公司頁面的鏈接................................. ...................................152
創建一個包括所有航空公司的主頁...... .................................................. ....153
整理半結構化數據的本體關係..................................... ........................................154
改進航空公司頁面..... .................................................. ..........................................155
給航空公司代碼加上名稱.................................................. ..........................156
整合維基百科內容.............................................. ..........................................158
把擴充過的航空公司表發佈到MongoDB ............................................... ....159
在網頁上擴充航空公司信息...................................... ..................................160
調查飛機(實體) .......... .................................................. .........................................162
SQL嵌套查詢vs.數據流編程................................................. .......................164
不使用嵌套查詢的數據流編程................. .................................................. .164
Spark SQL中的子查詢.......................................... .........................................165
創建飛機主頁............................................... .................................................166
在飛機頁面上添加搜索............................................ ....................................167
創建飛機製造商的條形圖..... .................................................. .....................172
對飛機製造商條形圖進行迭代................... .................................................174
實體解析:新一輪圖表迭代.......................................... ................................177
本章小結............... .................................................. ................................................183
第7章進行預測.............................................. ............................... 185
預測的作用............................................... .................................................. ............186
預測什麼................................... .................................................. ............................186
預測分析導論.................. .................................................. .....................................187
進行預測.......... .................................................. ............................................187
探索航班延誤.. .................................................. .................................................. ...189
使用PySpark提取特徵.......................................... .................................................. 193
使用scikit-learn構建回歸模型........................................... ....................................198
讀取數據.......... .................................................. ............................................198
數據採樣... .................................................. .................................................. .199
向量化處理結果............................................ ................................................200
準備訓練數據................................................ ................................................201
向量化處理特徵............................................... .............................................201
稀疏矩陣與稠密矩陣................................................. ...................................203
準備實驗................................................ .................................................. ......204
訓練模型......................................... .................................................. .............204
測試模型.................................. .................................................. ....................205
小結............................ .................................................. ..................................207
使用Spark MLlib構建分類器......... .................................................. ......................208
使用專用結構加載訓練數據..................... .................................................. .208
處理空值............................................. .................................................. .........210
用Route(路線)替代FlightNum(航班號) ....................................... ..............210
對連續變量分桶以用於分類.......................... ..............................................211
使用pyspark. ml.feature向量化處理特徵........................................... ...........219
用Spark ML做分類................................. .................................................. .....221
本章小結.......................................... .................................................. .....................223
第8章部署預測系統...................... ................................................ 225
把scikit-learn應用部署為網絡服務.......................................... .............................225
scikit-learn模型的保存與讀取......................................... .............................226
提供預測模型的準備工作.............. .................................................. ............227
為航班延誤回歸分析創建API .............................. ........................................228
測試API ....... .................................................. ................................................232
在產品中使用API .............................................. ............................................232
使用Airflow部署批處理模式Spark ML應用.............................................. ..........234
在生產環境中收集訓練數據................................ ........................................235
Spark ML模型的訓練、存儲與加載.................................................. ............237
在MongoDB中創建預測請求............................................ ...........................239
從MongoDB中獲取預測請求................ .................................................. .....245
使用Spark ML以批處理模式進行預測................................... .....................248
用MongoDB保存預測結果....................... .................................................. ..252
在網絡應用中展示批處理預測結果...................................... ......................253
用Apache Airflow(孵化項目)自動化工作流................. ..............................256
小結.................. .................................................. ............................................264
用Spark Streaming部署流式計算模式Spark ML應用..........................................264
在生產環境中收集訓練數據........................................... .............................265
Spark ML模型的訓練、存儲、讀取.......... .................................................. ....265
發送預測請求到Kafka ........................................ ..........................................266
用Spark Streaming進行預測.. .................................................. ......................277
測試整個系統........................ .................................................. ......................283
本章小結......................... .................................................. ......................................285
第9章改進預測結果..... .................................................. ............... 287
解決預測的問題.............................................. .................................................. .....287
什麼時候需要改進預測....................................... .................................................. 288
改進預測表現.............................................. .................................................. .........288
黏附試驗法:找出黏性好的.............................. ............................................288
為試驗建立嚴格的指標................................................. ...............................289
把當日時間作為特徵............. .................................................. .....................298
納入飛機數據......................... .................................................. .....................302
提取飛機特徵............................................... .................................................302
在分類器模型中納入飛機特徵.......................................... ..........................305
納入飛行時間.................... .................................................. ...................................310
本章小結............ .................................................. .................................................. .313
附錄A安裝手冊............................................ ................................. 315
安裝Hadoop .............. .................................................. ...........................................315
安裝Spark .... .................................................. .................................................. .......316
安裝MongoDB ................................................ .................................................. .....317
安裝MongoDB的Java驅動....................................... ..............................................317
安裝mongo- hadoop ................................................. ...............................................318
編譯mongo -hadoop ................................................ .......................................318
安裝pymongo_spark ........ .................................................. ............................318
安裝Elasticsearch ................... .................................................. .............................318
安裝Elasticsearch的Hadoop支持庫.............. .................................................. .......319
配置我們的Spark環境............................................. ..............................................320
安裝Kafka . .................................................. .................................................. ........320
安裝scikit-learn ..................................... .................................................. ...............320
安裝Zeppelin ................................ .................................................. ........................321
作者介紹
作者:(美)Russell Jurney(羅素·朱尼)譯者:王道遠
Russell Jurney在賭場遊戲中練出了數據分析的技能,構建了網絡應用程序分析美國和墨西哥的老虎機的表現。在涉足創業、互動媒體、記者等行業後,他搬到矽谷,在Ning和LinkedIn構建分析型應用。Russell現在是Data Syndrome的首席顧問,他幫助公司使用本書所介紹的原則和方法構建分析性產品。
王道遠,畢業於浙江大學,目前就職於阿里巴巴計算平台事業部,在加入阿里巴巴之前,曾在英特爾亞太研發有限公司大數據部門工作了五年。