深入淺出R語言數據分析
內容描述
本書首先介紹數據分析的方法論,然後介紹數據分析的相關模型方法,並進一步通過數據分析案例,講解數據分析的思維、方法及模型實現過程。本書重點介紹R語言在數據分析方面的應用,讓讀者能夠快速地使用R語言進行數據分析、構建模型。
本書分為17章,內容包括:使用R語言獲取數據、數據分析中的數據處理與數據探索、生存分析、主成分分析、多維縮放、線性回歸模型、邏輯回歸模型、聚類模型、關聯規則、隨機森林、支持向量機、神經網絡、文本挖掘、社交網絡分析,以及關於R語言數據分析的兩個延伸內容:H2O機器學習和R語言爬蟲。
本書內容通俗易懂,案例豐富,實用性強,特別適合R語言的入門讀者和進階讀者閱讀,也適合數據分析人員、數據挖掘人員等其他數據科學從業者。另外,本書也適用於統計學、電腦、機器學習、數學等相關專業的本科生、研究生使用。
目錄大綱
目 錄
第1章 數據分析項目的流程
1.1 數據分析項目中的角色························1
1.2 數據分析項目的階段····························2
1.2.1 制定目標·····················································3
1.2.2 收集數據·····················································3
1.2.3 數據處理和分析·········································4
1.2.4 構建模型·····················································7
1.2.5 評估模型·····················································8
1.2.6 展示結果·····················································9
1.2.7 部署與維護模型·······································10
1.3 總結······················································10
第2章 數據的讀取
2.1 RData數據 ··········································11
2.2 readr高效讀取數據 ····························13
2.3 讀取Excel數據 ··································16
2.4 讀取SPSS、SAS、STATA數據 ·······17
2.5 R語言操作數據庫 ······························19
2.6 總結······················································23
第3章 數 據 探 索
3.1 缺失值的識別與處理··························24
3.1.1 缺失值的識別與描述性統計···················25
3.1.2 缺失值的可視化展示·······························26
3.1.3 缺失值的處理方法···································28
3.2 異常值··················································33
3.3 dlookr數據處理包 ······························38
3.3.1 所有變量的一般性診斷···························38
3.3.2 數值型變量的診斷···································39
3.3.3 分類變量的診斷·······································39
3.3.4 異常值的診斷···········································40
3.3.5 創建診斷報告···········································41
3.3.6 數據處理···················································42
3.3.7 缺失值處理···············································43
3.3.8 異常值處理···············································44
3.3.9 數據轉換···················································46
3.3.10 數據分箱·················································49
3.3.11 創建數據轉換報告·································52
3.4 數據相關性··········································53
3.5 自動化創建數據探索報告··················57
3.6 總結······················································60
第4 章生存分析
4.1 生存分析的基本內容··························61
4.2 使用R 語言進行生存分析·················64
4.3 非參數模型··········································66
4.3.1 使用Kaplan-Meier 方法擬合數據 ··········66
4.3.2 Kaplan-Meier 方法的可視化 ···················68
4.4 半參數模型生存分析方法··················70
4.4.1 構建Cox 模型··········································70
4.4.2 檢查假設···················································71
4.4.3 Coxph 模型可視化···································73
4.4.4 預測···························································74
4.4.5 分層···························································75
4.5 參數模型··············································77
4.6 隨機生存森林模型······························80
4.7 總結······················································82
第5 章主成分分析
5.1 概述······················································83
5.1.1 維度相關的問題·······································83
5.1.2 檢測多重共線性·······································84
5.1.3 方差膨脹因子···········································84
5.2 主成分分析詳解··································85
5.2.1 主成分分析的定義···································85
5.2.2 主成分分析的簡單原理···························86
5.2.3 主成分分析的算法···································87
5.3 使用R 語言進行主成分分析·············88
5.3.1 主成分分析的實現···································89
5.3.2 主成分分析案例·······································91
5.4 總結······················································96
第6 章多維縮放
6.1 MDS 的工作原理································97
6.3 MDS 的優點······································105
6.2 在R 語言中實現MDS·······················98
6.4 總結····················································106
第7 章線性回歸模型
7.1 線性回歸模型概述····························107
7.2 在R 語言中實現回歸模型···············108
7.2.1 圖形分析·················································109
7.2.2 建立線性模型·········································114
7.2.3 回歸模型的圖形診斷·····························119
7.2.4 預測模型·················································122
7.2.5 抽樣方法·················································124
7.3 總結····················································126
第8 章邏輯回歸模型
8.1 邏輯回歸的原理································127
8.2 在R 語言中實現邏輯回歸模型·······128
8.2.1 數據探索·················································129
8.2.2 構建邏輯回歸模型·································131
8.2.3 邏輯回歸預測·········································133
8.2.4 邏輯回歸模型評估·································133
8.3 總結····················································136
第9 章聚類模型
9.1 概述····················································137
9.1.1 聚類算法·················································137
9.1.2 K均值聚類的原理·································138
9.2 在R 語言中實現聚類模型···············139
9.2.1 K均值聚類·············································140
9.2.2 層次聚類·················································143
9.2.3 Medoids 聚類(PAM) ·························144
9.3 總結····················································146
第10 章關聯規則
10.1 關聯規則概述··································147
10.2 關聯規則的基本概念······················148
10.3 在R 語言中實現關聯規劃·············148
10.3.1 訓練模型···············································151
10.3.2 模型的評估···········································153
10.3.3 提升關聯規則的效果···························154
10.3.4 關聯規則的可視化·······························155
10.4 總結··················································158
第11 章隨機森林
11.1 隨機森林的基本概念······················159
11.3 總結··················································167
11.2 在R 語言中實現隨機森林 ·············161
第12 章支持向量機
12.1 概述··················································168
12.3 總結··················································179
12.2 在R 語言中實現支持向量機·········171
第13 章神經網絡
13.2.2 評估模型效果·······································187
13.1 概述··················································180
13.2 在R 語言中實現神經網絡·············182
13.3 總結··················································192
13.2.1 構建神經網絡模型·······························185
第14 章文本挖掘
14.1 概述··················································193
14.2 text2vec 背景及其基本原理 ···········194
14.3 DTM 與TFIDF 的原理和實現·······194
14.3.1 DTM 和TFIDF 的原理························194
14.3.2 DTM 的實現·········································196
14.3.3 TFIDF 的實現·······································199
14.4 情感分析··········································199
14.5 LDA 主題模型及其實現 ················206
14.6 構建自動問答系統··························208
14.7 總結··················································211
第15 章社交網絡分析
15.1 社交網絡概述··································212
15.2 igraph 簡介 ······································213
15.2.1 準備工作···············································214
15.2.2 圖的指標計算·······································215
15.3 社交網絡的常見結構······················217
15.4 社交網絡分析算法······················220
IX
目錄
15.4.1 Girvan-Newman ···································· 221
15.4.2 基於傳播標簽的社區檢測··················· 223
15.4.3 基於貪婪優化模塊的社區檢測··········· 224
15.4.4 自旋轉玻璃社群··································· 224
15.5 微博社交群體分析·························· 225
15.5.1 自旋轉玻璃社群··································· 226
15.5.2 社群檢測··············································· 228
15.6 總結·················································· 229
第16 章 H2O 機器學習
16.1 H2O 機器學習平臺························· 230
16.2 在R 語言中使用H2O ···················· 231
16.2.1 H2O 的安裝·········································· 231
16.2.2 案例應用··············································· 231
16.2.3 H2O 常用API ······································ 234
16.2.4 模型的通用參數··································· 235
16.2.5 參數調整··············································· 235
16.3 H2O Flow········································· 238
16.3.1 H2O Flow 的安裝································· 238
16.3.2 H2O Flow 的基本使用方法················· 239
16.4 總結·················································· 244
第17 章 R 語言爬蟲
17.1 快速爬取網頁數據·························· 245
17.2 rvest 簡介········································· 247
17.2.1 rvest API················································ 248
17.2.2 rvest API 詳解······································· 249
17.3 爬取BOSS 直聘數據······················ 250
17.4 模擬登錄·········································· 254