網絡數據採集技術 — Java 網絡爬蟲實戰
內容描述
本書以Java為開發語言,系統地介紹了網絡爬蟲的理論知識和基礎工具,包括網絡爬蟲涉及的Java基礎知識、HTTP協議基礎與網絡抓包、網頁內容獲取、網頁內容解析和網絡爬蟲數據存儲等。本書選取典型網站,採用案例講解的方式介紹網絡爬蟲中涉及的問題,以增強讀者的動手實踐能力。同時,本書還介紹了3種Java網絡爬蟲開源框架,即Crawler4j、WebCollector和WebMagic。本書適用於Java網絡爬蟲開發的初學者和進階者;也可作為網絡爬蟲課程教學的參考書,供高等院校文本挖掘、自然語言處理、大數據商務分析等相關學科的大學生和研究生參考使用;也可供企業網絡爬蟲開發人員參考使用。
目錄大綱
第1章網絡爬蟲概述與原理.......................................... .................................. 1
1.1網絡爬蟲簡介........... .................................................. ................................ 1
1.2網絡爬蟲分類............. .................................................. .............................. 2
1.3網絡爬蟲流程............... .................................................. ............................ 4
1.4網絡爬蟲的採集策略............... .................................................. ................ 5
1.5學習網絡爬蟲的建議........................... .................................................. .... 5
1.6本章小結.......................................... .................................................. ......... 6
第2章網絡爬蟲涉及的Java基礎知識........................................ ................... 7
2.1開發環境的搭建......................... .................................................. .............. 7
2.1.1 JDK的安裝及環境變量配置......................... ................................. 7
2.1.2 Eclipse的下載.......... .................................................. ...................... 9
2.2基本數據類型....................... .................................................. .................. 10
2.3數組............................. .................................................. ............................ 11
2.4條件判斷與循環................ .................................................. ..................... 12
2.5集合................................................ .................................................. ......... 15
2.5.1 List和Set集合................................. ............................................. 15
2.5.2 Map集合................................................. ....................................... 16
2.5.3 Queue集合..... .................................................. .............................. 17
2.6對象與類............... .................................................. .................................. 19
2.7 String類............ .................................................. ....................................... 21
2.8日期和時間處理............................................. .......................................... 23
2.9正則表達式... .................................................. .......................................... 26
2.10 Maven工程的創建.. .................................................. ............................. 29
2.11 log4j的使用................ .................................................. .......................... 33
2.12本章小結.................... .................................................. ........................... 40
第3章HTTP協議基礎與網絡抓包............ .................................................. . 41
3.1 HTTP協議簡介............................................ ............................................ 41
3.2 URL ................................................ .................................................. ......... 42
3.3報文..................................... .................................................. .................... 44
3.4 HTTP請求方法......................... .................................................. ............. 46
3.5 HTTP狀態碼................................ .................................................. .......... 46
3.5.1狀態碼2XX ................................. .................................................. . 47
3.5.2狀態碼3XX .......................................... .......................................... 47
3.5.3狀態碼4XX . .................................................. ................................. 48
3.5.4狀態碼5XX ............................................ ........................................ 48
3.6 HTTP信息頭..... .................................................. ..................................... 48
3.6.1通用頭....... .................................................. ................................... 49
3.6.2請求頭......... .................................................. ................................. 52
3.6.3響應頭........... .................................................. ............................... 55
3.6.4實體頭............. .................................................. ............................. 56
3.7 HTTP響應正文................ .................................................. ...................... 57
3.7.1 HTML .............................................. ............................................... 58
3.7. 2 XML ................................................ ............................................... 60
3.7. 3 JSON ................................................ ............................................... 61
3.8網絡抓包................................................ .................................................. . 64
3.8.1簡介............................................ .................................................. .. 64
3.8.2使用情境.......................................... .............................................. 65
3.8.3瀏覽器實現網絡抓包............................................ ......................... 65
3.8.4其他網絡抓包工具推薦......................................... ........................ 70
3.9本章小結...................... .................................................. ........................... 70
第4章網頁內容獲取................ .................................................. .................. 71
4.1 Jsoup的使用........................... .................................................. ................ 71
4.1.1 jar包的下載.......................... .................................................. ....... 71
4.1.2請求URL ..................................... .................................................. 72
4.1.3設置頭信息........................................... ......................................... 75
4.1.4提交請求參數............................................ .................................... 78
4.1.5超時設置........ .................................................. .............................. 80
4.1.6代理服務器的使用............ .................................................. ........... 81
4.1.7響應轉輸出流(圖片、PDF等的下載)....................... .............. 83
4.1.8 HTTPS請求認證............................. .............................................. 85
4.1.9大文件內容獲取問題............................................. ........................ 89
4.2 HttpClient的使用..................... .................................................. ............... 91
4.2.1 jar包的下載........................................... ........................................ 91
4.2.2請求URL .... .................................................. ................................. 92
4.2.3 EntityUtils類........... .................................................. ..................... 97
4.2.4設置頭信息...................... .................................................. ............ 98
4.2.5 POST提交表單............................... ............................................. 100
4.2.6超時設置................................................. ..................................... 103
4.2.7代理服務器的使用..... .................................................. ................ 105
4.2.8文件下載............................................. ......................................... 106
4.2.9 HTTPS請求認證.. .................................................. ..................... 108
4.2.10請求重試...................... .................................................. ............ 111
4.2.11多線程執行請求.............................. ........................................... 114
4.3 URLConnection與HttpURLConnection .. .............................................. 117
4.3.1實例化................................................ .......................................... 117
4.3.2獲取網頁內容. .................................................. ........................... 118
4.3.3 GET請求............................................. ......................................... 118
4.3.4模擬提交表單(POST請求) ................................................ .... 119
4.3.5設置頭信息....................................... ........................................... 120
4.3.6連接超時設置.................................................. ............................ 121
4.3.7代理服務器的使用.............. .................................................. ....... 122
4.3.8 HTTPS請求認證.................................... ..................................... 122
4.4本章小結......... .................................................. ...................................... 124
第5章網頁內容解析............................................ ...................................... 125
5.1 HTML解析........ .................................................. ................................... 125
5.1.1 CSS選擇器........ .................................................. ........................ 125
5.1.2 Xpath語法.................... .................................................. .............. 127
5.1.3 Jsoup解析HTML ............................. ........................................... 128
5.1.4 HtmlCleaner解析HTML .................................................. ........... 135
5.1.5 HTMLParser解析HTML ................................ ............................ 139
5.2 XML解析............................................... ................................................ 144
5.3 JSON解析................................................ ............................................... 145
5.3. 1 JSON校正............................................... ..................................... 145
5.3.2 org.json解析JSON .... .................................................. ................ 147
5.3.3 Gson解析JSON........................... ................................................ 152
5.3 .4 Fastjson解析JSON ............................................. ......................... 157
5.3.5網絡爬蟲實戰演練................. .................................................. .... 159
5.4本章小結............................................... .................................................. 165
第6章網絡爬蟲數據存儲.......................................... ................................. 166
6.1輸入流與輸出流.......... .................................................. ......................... 166
6.1.1簡介.................... .................................................. ........................ 166
6.1.2 File類.................... .................................................. ..................... 166
6.1.3文件字節流..................... .................................................. ........... 169
6.1.4文件字符流................................ .................................................. 172
6.1.5緩衝流............................................. ............................................. 176
6.1.6網絡爬蟲下載圖片實戰.............................................. ................. 180
6.1.7網絡爬蟲文本存儲實戰........................ ....................................... 184
6.2 Excel存儲....... .................................................. ...................................... 188
6.2.1 Jxl的使用..... .................................................. .............................. 188
6.2.2 POI的使用............. .................................................. .................... 191
6.2.3爬蟲案例........................ .................................................. ............ 198
6.3 MySQL數據存儲.............................................. ..................................... 202
6.3.1數據庫的基本概念..... .................................................. ................ 203
6.3.2 SQL語句基礎........................... .................................................. . 203
6.3.3 Java操作數據庫.......................................... ................................ 207
6.3.4爬蟲案例............ .................................................. ........................ 217
6.4本章小結...................... .................................................. ......................... 219
第7章網絡爬蟲實戰項目................. .................................................. ........ 220
7.1新聞數據採集.............................................. ........................................... 220
7.1.1採集的網頁.................................................. ................................ 220
7.1.2框架介紹............ .................................................. ........................ 222
7.1.3程序編寫.................... .................................................. ................ 223
7.2企業信息採集............................. .................................................. .......... 235
7.2.1採集的網頁................................. ................................................. 235
7.2.2框架介紹............................................. ......................................... 238
7.2.3第一層信息採集.......................................... ................................. 239
7.2.4第二層信息採集........ .................................................. ................. 248
7.3股票信息採集............................ .................................................. ........... 256
7.3.1採集的網頁................................ .................................................. 256
7.3.2框架介紹............................................ .......................................... 257
7.3.3程序設計.. .................................................. .................................. 258
7.3.4 Quartz實現定時調度任務....... .................................................. .. 267
7.4本章小結............................................... .................................................. 271
第8章Selenium的使用........................................... .................................. 272
8.1 Selenium簡介............ .................................................. ........................... 272
8.2 Java Selenium環境搭建................. .................................................. ...... 272
8.3瀏覽器的操控...................................... .................................................. . 274
8.4元素定位............................................. .................................................. .. 276
8.4.1 id定位.......................................... ................................................ 276
8.4.2 name定位............................................. ........................................ 277
8.4.3 class定位.... .................................................. ................................ 278
8.4.4 tag name定位........... .................................................. .................. 278
8.4.5 link text定位......................... .................................................. ..... 278
8.4.6 Xpath定位....................................... ............................................. 279
8.4.7 CSS選擇器定位............................................... ........................... 279
8.5模擬登錄................... .................................................. ............................ 280
8.6動態加載JavaScript數據(操作滾動條) ........................................ ... 283
8.7隱藏瀏覽器.......................................... .................................................. . 285
8.8截取驗證碼............................................ ................................................. 287
8.9本章小結............................................... .................................................. 291
第9章網絡爬蟲開源框架.......................................... ................................. 292
9.1 Crawler4j的使用............ .................................................. ...................... 292
9.1.1 Crawler4j簡介...................... .................................................. ...... 292
9.1.2 jar包的下載........................................... ...................................... 292
9.1.3入門案例...... .................................................. .............................. 293
9.1.4相關配置.............. .................................................. ...................... 297
9.1.5圖片的採集..................... .................................................. ........... 300
9.1.6數據採集入庫............................... ............................................... 304
9.2 WebCollector的使用................................................ .............................. 312
9.2.1 WebCollector簡介.............. .................................................. ....... 312
9.2.2 jar包的下載........................................... ...................................... 313
9.2.3入門案例...... .................................................. .............................. 313
9.2.4相關配置.............. .................................................. ...................... 318
9.2.5 HTTP請求擴展..................... .................................................. ..... 319
9.2.6翻頁數據採集..................................... ......................................... 327
9.2.7圖片的採集.. .................................................. .............................. 331
9.2.8數據採集入庫............ .................................................. ................ 334
9.3 WebMagic的使用.............................................. ..................................... 347
9.3.1 WebMagic簡介....... .................................................. ................... 347
9.3.2 jar包的下載....................... .................................................. ........ 347
9.3.3入門案例(翻頁數據採集) .............................. ......................... 347
9.3.4相關配置................... .................................................. ................. 351
9.3.5數據存儲方式.......................... .................................................. .. 352
9.3.6數據採集入庫........................................ ...................................... 355
9.3.7圖片的採集............................................ ...................................... 365
9.4本章小結........ .................................................. ....................................... 368
作者介紹
錢洋合肥工業大學管理科學與工程系博士、CSDN博客專家。曾作為技術人員參與多個橫向、縱向學術課題,負責數據採集系統的設計與開發工作。曾在CSDN上撰寫多篇關於數據採集、自然語言處理、編程語言等領域的原創博客。姜元春合肥工業大學教授、博士生導師。長期從事電子商務、商務智能、數據採集與挖掘等方面的理論研究與教學工作。先後主持過國家自然科學基金優秀青年科學基金項目、國家自然科學基金重大研究計劃培育項目、國家自然科學基金青年科學基金項目、教育部人文社科青年基金項目、阿里巴巴青年學者支持計劃、CCF-騰訊犀牛鳥基金項目等課題的研究工作。