網絡數據採集技術 — Java 網絡爬蟲實戰

網絡數據採集技術 — Java 網絡爬蟲實戰

作者: 錢洋 薑元春
出版社: 電子工業
出版在: 2020-01-01
ISBN-13: 9787121376078
ISBN-10: 7121376075





內容描述


本書以Java為開發語言,系統地介紹了網絡爬蟲的理論知識和基礎工具,包括網絡爬蟲涉及的Java基礎知識、HTTP協議基礎與網絡抓包、網頁內容獲取、網頁內容解析和網絡爬蟲數據存儲等。本書選取典型網站,採用案例講解的方式介紹網絡爬蟲中涉及的問題,以增強讀者的動手實踐能力。同時,本書還介紹了3種Java網絡爬蟲開源框架,即Crawler4j、WebCollector和WebMagic。本書適用於Java網絡爬蟲開發的初學者和進階者;也可作為網絡爬蟲課程教學的參考書,供高等院校文本挖掘、自然語言處理、大數據商務分析等相關學科的大學生和研究生參考使用;也可供企業網絡爬蟲開發人員參考使用。


目錄大綱


第1章網絡爬蟲概述與原理.......................................... .................................. 1
1.1網絡爬蟲簡介........... .................................................. ................................ 1
1.2網絡爬蟲分類............. .................................................. .............................. 2
1.3網絡爬蟲流程............... .................................................. ............................ 4
1.4網絡爬蟲的採集策略............... .................................................. ................ 5
1.5學習網絡爬蟲的建議........................... .................................................. .... 5
1.6本章小結.......................................... .................................................. ......... 6

第2章網絡爬蟲涉及的Java基礎知識........................................ ................... 7
2.1開發環境的搭建......................... .................................................. .............. 7
2.1.1 JDK的安裝及環境變量配置......................... ................................. 7
2.1.2 Eclipse的下載.......... .................................................. ...................... 9
2.2基本數據類型....................... .................................................. .................. 10
2.3數組............................. .................................................. ............................ 11
2.4條件判斷與循環................ .................................................. ..................... 12
2.5集合................................................ .................................................. ......... 15
2.5.1 List和Set集合................................. ............................................. 15
2.5.2 Map集合................................................. ....................................... 16
2.5.3 Queue集合..... .................................................. .............................. 17
2.6對象與類............... .................................................. .................................. 19
2.7 String類............ .................................................. ....................................... 21
2.8日期和時間處理............................................. .......................................... 23
2.9正則表達式... .................................................. .......................................... 26
2.10 Maven工程的創建.. .................................................. ............................. 29
2.11 log4j的使用................ .................................................. .......................... 33
2.12本章小結.................... .................................................. ........................... 40

第3章HTTP協議基礎與網絡抓包............ .................................................. . 41
3.1 HTTP協議簡介............................................ ............................................ 41
3.2 URL ................................................ .................................................. ......... 42
3.3報文..................................... .................................................. .................... 44
3.4 HTTP請求方法......................... .................................................. ............. 46
3.5 HTTP狀態碼................................ .................................................. .......... 46
3.5.1狀態碼2XX ................................. .................................................. . 47
3.5.2狀態碼3XX .......................................... .......................................... 47
3.5.3狀態碼4XX . .................................................. ................................. 48
3.5.4狀態碼5XX ............................................ ........................................ 48
3.6 HTTP信息頭..... .................................................. ..................................... 48
3.6.1通用頭....... .................................................. ................................... 49
3.6.2請求頭......... .................................................. ................................. 52
3.6.3響應頭........... .................................................. ............................... 55
3.6.4實體頭............. .................................................. ............................. 56
3.7 HTTP響應正文................ .................................................. ...................... 57
3.7.1 HTML .............................................. ............................................... 58
3.7. 2 XML ................................................ ............................................... 60
3.7. 3 JSON ................................................ ............................................... 61
3.8網絡抓包................................................ .................................................. . 64
3.8.1簡介............................................ .................................................. .. 64
3.8.2使用情境.......................................... .............................................. 65
3.8.3瀏覽器實現網絡抓包............................................ ......................... 65
3.8.4其他網絡抓包工具推薦......................................... ........................ 70
3.9本章小結...................... .................................................. ........................... 70

第4章網頁內容獲取................ .................................................. .................. 71
4.1 Jsoup的使用........................... .................................................. ................ 71
4.1.1 jar包的下載.......................... .................................................. ....... 71
4.1.2請求URL ..................................... .................................................. 72
4.1.3設置頭信息........................................... ......................................... 75
4.1.4提交請求參數............................................ .................................... 78
4.1.5超時設置........ .................................................. .............................. 80
4.1.6代理服務器的使用............ .................................................. ........... 81
4.1.7響應轉輸出流(圖片、PDF等的下載)....................... .............. 83
4.1.8 HTTPS請求認證............................. .............................................. 85
4.1.9大文件內容獲取問題............................................. ........................ 89
4.2 HttpClient的使用..................... .................................................. ............... 91
4.2.1 jar包的下載........................................... ........................................ 91
4.2.2請求URL .... .................................................. ................................. 92
4.2.3 EntityUtils類........... .................................................. ..................... 97
4.2.4設置頭信息...................... .................................................. ............ 98
4.2.5 POST提交表單............................... ............................................. 100
4.2.6超時設置................................................. ..................................... 103
4.2.7代理服務器的使用..... .................................................. ................ 105
4.2.8文件下載............................................. ......................................... 106
4.2.9 HTTPS請求認證.. .................................................. ..................... 108
4.2.10請求重試...................... .................................................. ............ 111
4.2.11多線程執行請求.............................. ........................................... 114
4.3 URLConnection與HttpURLConnection .. .............................................. 117
4.3.1實例化................................................ .......................................... 117
4.3.2獲取網頁內容. .................................................. ........................... 118
4.3.3 GET請求............................................. ......................................... 118
4.3.4模擬提交表單(POST請求) ................................................ .... 119
4.3.5設置頭信息....................................... ........................................... 120
4.3.6連接超時設置.................................................. ............................ 121
4.3.7代理服務器的使用.............. .................................................. ....... 122
4.3.8 HTTPS請求認證.................................... ..................................... 122
4.4本章小結......... .................................................. ...................................... 124

第5章網頁內容解析............................................ ...................................... 125
5.1 HTML解析........ .................................................. ................................... 125
5.1.1 CSS選擇器........ .................................................. ........................ 125
5.1.2 Xpath語法.................... .................................................. .............. 127
5.1.3 Jsoup解析HTML ............................. ........................................... 128
5.1.4 HtmlCleaner解析HTML .................................................. ........... 135
5.1.5 HTMLParser解析HTML ................................ ............................ 139
5.2 XML解析............................................... ................................................ 144
5.3 JSON解析................................................ ............................................... 145
5.3. 1 JSON校正............................................... ..................................... 145
5.3.2 org.json解析JSON .... .................................................. ................ 147
5.3.3 Gson解析JSON........................... ................................................ 152
5.3 .4 Fastjson解析JSON ............................................. ......................... 157
5.3.5網絡爬蟲實戰演練................. .................................................. .... 159
5.4本章小結............................................... .................................................. 165

第6章網絡爬蟲數據存儲.......................................... ................................. 166
6.1輸入流與輸出流.......... .................................................. ......................... 166
6.1.1簡介.................... .................................................. ........................ 166
6.1.2 File類.................... .................................................. ..................... 166
6.1.3文件字節流..................... .................................................. ........... 169
6.1.4文件字符流................................ .................................................. 172
6.1.5緩衝流............................................. ............................................. 176
6.1.6網絡爬蟲下載圖片實戰.............................................. ................. 180
6.1.7網絡爬蟲文本存儲實戰........................ ....................................... 184
6.2 Excel存儲....... .................................................. ...................................... 188
6.2.1 Jxl的使用..... .................................................. .............................. 188
6.2.2 POI的使用............. .................................................. .................... 191
6.2.3爬蟲案例........................ .................................................. ............ 198
6.3 MySQL數據存儲.............................................. ..................................... 202
6.3.1數據庫的基本概念..... .................................................. ................ 203
6.3.2 SQL語句基礎........................... .................................................. . 203
6.3.3 Java操作數據庫.......................................... ................................ 207
6.3.4爬蟲案例............ .................................................. ........................ 217
6.4本章小結...................... .................................................. ......................... 219

第7章網絡爬蟲實戰項目................. .................................................. ........ 220
7.1新聞數據採集.............................................. ........................................... 220
7.1.1採集的網頁.................................................. ................................ 220
7.1.2框架介紹............ .................................................. ........................ 222
7.1.3程序編寫.................... .................................................. ................ 223
7.2企業信息採集............................. .................................................. .......... 235
7.2.1採集的網頁................................. ................................................. 235
7.2.2框架介紹............................................. ......................................... 238
7.2.3第一層信息採集.......................................... ................................. 239
7.2.4第二層信息採集........ .................................................. ................. 248
7.3股票信息採集............................ .................................................. ........... 256
7.3.1採集的網頁................................ .................................................. 256
7.3.2框架介紹............................................ .......................................... 257
7.3.3程序設計.. .................................................. .................................. 258
7.3.4 Quartz實現定時調度任務....... .................................................. .. 267
7.4本章小結............................................... .................................................. 271

第8章Selenium的使用........................................... .................................. 272
8.1 Selenium簡介............ .................................................. ........................... 272
8.2 Java Selenium環境搭建................. .................................................. ...... 272
8.3瀏覽器的操控...................................... .................................................. . 274
8.4元素定位............................................. .................................................. .. 276
8.4.1 id定位.......................................... ................................................ 276
8.4.2 name定位............................................. ........................................ 277
8.4.3 class定位.... .................................................. ................................ 278
8.4.4 tag name定位........... .................................................. .................. 278
8.4.5 link text定位......................... .................................................. ..... 278
8.4.6 Xpath定位....................................... ............................................. 279
8.4.7 CSS選擇器定位............................................... ........................... 279
8.5模擬登錄................... .................................................. ............................ 280
8.6動態加載JavaScript數據(操作滾動條) ........................................ ... 283
8.7隱藏瀏覽器.......................................... .................................................. . 285
8.8截取驗證碼............................................ ................................................. 287
8.9本章小結............................................... .................................................. 291

第9章網絡爬蟲開源框架.......................................... ................................. 292
9.1 Crawler4j的使用............ .................................................. ...................... 292
9.1.1 Crawler4j簡介...................... .................................................. ...... 292
9.1.2 jar包的下載........................................... ...................................... 292
9.1.3入門案例...... .................................................. .............................. 293
9.1.4相關配置.............. .................................................. ...................... 297
9.1.5圖片的採集..................... .................................................. ........... 300
9.1.6數據採集入庫............................... ............................................... 304
9.2 WebCollector的使用................................................ .............................. 312
9.2.1 WebCollector簡介.............. .................................................. ....... 312
9.2.2 jar包的下載........................................... ...................................... 313
9.2.3入門案例...... .................................................. .............................. 313
9.2.4相關配置.............. .................................................. ...................... 318
9.2.5 HTTP請求擴展..................... .................................................. ..... 319
9.2.6翻頁數據採集..................................... ......................................... 327
9.2.7圖片的採集.. .................................................. .............................. 331
9.2.8數據採集入庫............ .................................................. ................ 334
9.3 WebMagic的使用.............................................. ..................................... 347
9.3.1 WebMagic簡介....... .................................................. ................... 347
9.3.2 jar包的下載....................... .................................................. ........ 347
9.3.3入門案例(翻頁數據採集) .............................. ......................... 347
9.3.4相關配置................... .................................................. ................. 351
9.3.5數據存儲方式.......................... .................................................. .. 352
9.3.6數據採集入庫........................................ ...................................... 355
9.3.7圖片的採集............................................ ...................................... 365
9.4本章小結........ .................................................. ....................................... 368


作者介紹


錢洋合肥工業大學管理科學與工程系博士、CSDN博客專家。曾作為技術人員參與多個橫向、縱向學術課題,負責數據採集系統的設計與開發工作。曾在CSDN上撰寫多篇關於數據採集、自然語言處理、編程語言等領域的原創博客。姜元春合肥工業大學教授、博士生導師。長期從事電子商務、商務智能、數據採集與挖掘等方面的理論研究與教學工作。先後主持過國家自然科學基金優秀青年科學基金項目、國家自然科學基金重大研究計劃培育項目、國家自然科學基金青年科學基金項目、教育部人文社科青年基金項目、阿里巴巴青年學者支持計劃、CCF-騰訊犀牛鳥基金項目等課題的研究工作。




相關書籍

Audio and Speech Processing with MATLAB

作者 Hill Paul

2020-01-01

Learn Data Science Using SAS Studio: A Quick-Start Guide

作者 Fouda Engy

2020-01-01

R for Programmers: Mastering the Tools (Paperback)

作者 Dan Zhang

2020-01-01