網絡數據採集技術 — Java 網絡爬蟲實戰
錢洋,薑元春
- 出版商: 電子工業
- 出版日期: 2020-01-01
- 定價: $474
- 售價: 8.5 折 $403
- 語言: 簡體中文
- ISBN: 7121376075
- ISBN-13: 9787121376078
-
相關分類:
Web-crawler 網路爬蟲、Java 程式語言
立即出貨 (庫存 < 3)
買這商品的人也買了...
-
Java 網路程式設計, 4/e (Java Network Programming, 4/e)$680$537 -
$403解密搜尋引擎技術實戰 - Lucene & Java『精華第三版』 -
七天學會設計模式:設計模式也可以這樣學$320$250 -
$654瘋狂 Java 講義, 4/e -
$454Java 高並發編程詳解:多線程與架構設計 -
打造股市小秘書|聊天機器人 x 網路爬蟲 x NoSQL x Python 整合應用實務$380$300 -
$474程序員的三門課:技術精進、架構修煉、管理探秘 -
邁向 Linux 工程師之路:Superuser 一定要懂的技術與運用, 2/e (How Linux Works: What Every Superuser Should Know, 2/e)$600$468 -
Python 設計模式$650$514 -
$454Linux Shell 核心編程指南 -
$534Java 系統性能優化實戰 -
e科技的資安分析與關鍵證據-數位鑑識$400$312 -
Windows 駭客程式設計:勒索病毒(第一冊) -- 加密篇$620$484 -
C Traps and Pitfalls (中文版)$380$296 -
圖解網路的運作機制$380$323 -
Bash 資安管理手冊 (Cybersecurity Ops with bash)$580$458 -
$454Prometheus技術秘笈 -
$551從企業級開發到雲原生微服務 : SpringBoot 實戰 -
Java 異步編程實戰$474$450 -
$454Python 3 反爬蟲原理與繞過實戰 -
用 Excel 學 Python 資料分析$450$383 -
你也能做出 Google:用 Elasticsearch 搭建叢集搜索引擎$780$616 -
Java 武功祕笈 (舊名: Java 程式設計應用實務)$650$553 -
CQRS 命令查詢職責分離模式 (Command Query Responsibility Segregation)$500$390 -
Spring Boot + Vue.js + 分佈式組件全棧開發訓練營 (視頻教學版)$414$393
商品描述
本書以Java為開發語言,系統地介紹了網絡爬蟲的理論知識和基礎工具,包括網絡爬蟲涉及的Java基礎知識、HTTP協議基礎與網絡抓包、網頁內容獲取、網頁內容解析和網絡爬蟲數據存儲等。本書選取典型網站,採用案例講解的方式介紹網絡爬蟲中涉及的問題,以增強讀者的動手實踐能力。同時,本書還介紹了3種Java網絡爬蟲開源框架,即Crawler4j、WebCollector和WebMagic。本書適用於Java網絡爬蟲開發的初學者和進階者;也可作為網絡爬蟲課程教學的參考書,供高等院校文本挖掘、自然語言處理、大數據商務分析等相關學科的大學生和研究生參考使用;也可供企業網絡爬蟲開發人員參考使用。
作者簡介
錢洋合肥工業大學管理科學與工程系博士、CSDN博客專家。曾作為技術人員參與多個橫向、縱向學術課題,負責數據採集系統的設計與開發工作。曾在CSDN上撰寫多篇關於數據採集、自然語言處理、編程語言等領域的原創博客。薑元春合肥工業大學教授、博士生導師。長期從事電子商務、商務智能、數據採集與挖掘等方面的理論研究與教學工作。先後主持過國家自然科學基金優秀青年科學基金項目、國家自然科學基金重大研究計劃培育項目、國家自然科學基金青年科學基金項目、教育部人文社科青年基金項目、阿裡巴巴青年學者支持計劃、CCF-騰訊犀牛鳥基金項目等課題的研究工作。
目錄大綱
第1章網絡爬蟲概述與原理.......................................... .................................. 1
1.1網絡爬蟲簡介........... .................................................. ................................ 1
1.2網絡爬蟲分類............. .................................................. .............................. 2
1.3網絡爬蟲流程............... .................................................. ............................ 4
1.4網絡爬蟲的採集策略............... .................................................. ................ 5
1.5學習網絡爬蟲的建議........................... .................................................. .... 5
1.6本章小結.......................................... .................................................. ......... 6
第2章網絡爬蟲涉及的Java基礎知識........................................ ................... 7
2.1開發環境的搭建......................... .................................................. .............. 7
2.1.1 JDK的安裝及環境變量配置......................... ................................. 7
2.1.2 Eclipse的下載.......... .................................................. ...................... 9
2.2基本數據類型....................... .................................................. .................. 10
2.3數組............................. .................................................. ............................ 11
2.4條件判斷與循環................ .................................................. ..................... 12
2.5集合................................................ .................................................. ......... 15
2.5.1 List和Set集合................................. ............................................. 15
2.5.2 Map集合................................................. ....................................... 16
2.5.3 Queue集合..... .................................................. .............................. 17
2.6對象與類............... .................................................. .................................. 19
2.7 String類............ .................................................. ....................................... 21
2.8日期和時間處理............................................. .......................................... 23
2.9正則表達式... .................................................. .......................................... 26
2.10 Maven工程的創建.. .................................................. ............................. 29
2.11 log4j的使用................ .................................................. .......................... 33
2.12本章小結.................... .................................................. ........................... 40
第3章HTTP協議基礎與網絡抓包............ .................................................. . 41
3.1 HTTP協議簡介............................................ ............................................ 41
3.2 URL ................................................ .................................................. ......... 42
3.3報文..................................... .................................................. .................... 44
3.4 HTTP請求方法......................... .................................................. ............. 46
3.5 HTTP狀態碼................................ .................................................. .......... 46
3.5.1狀態碼2XX ................................. .................................................. . 47
3.5.2狀態碼3XX .......................................... .......................................... 47
3.5.3狀態碼4XX . .................................................. ................................. 48
3.5.4狀態碼5XX ............................................ ........................................ 48
3.6 HTTP信息頭..... .................................................. ..................................... 48
3.6.1通用頭....... .................................................. ................................... 49
3.6.2請求頭......... .................................................. ................................. 52
3.6.3響應頭........... .................................................. ............................... 55
3.6.4實體頭............. .................................................. ............................. 56
3.7 HTTP響應正文................ .................................................. ...................... 57
3.7.1 HTML .............................................. ............................................... 58
3.7. 2 XML ................................................ ............................................... 60
3.7. 3 JSON ................................................ ............................................... 61
3.8網絡抓包................................................ .................................................. . 64
3.8.1簡介............................................ .................................................. .. 64
3.8.2使用情境.......................................... .............................................. 65
3.8.3瀏覽器實現網絡抓包............................................ ......................... 65
3.8.4其他網絡抓包工具推薦......................................... ........................ 70
3.9本章小結...................... .................................................. ........................... 70
第4章網頁內容獲取................ .................................................. .................. 71
4.1 Jsoup的使用........................... .................................................. ................ 71
4.1.1 jar包的下載.......................... .................................................. ....... 71
4.1.2請求URL ..................................... .................................................. 72
4.1.3設置頭信息........................................... ......................................... 75
4.1.4提交請求參數............................................ .................................... 78
4.1.5超時設置........ .................................................. .............................. 80
4.1.6代理服務器的使用............ .................................................. ........... 81
4.1.7響應轉輸出流(圖片、PDF等的下載)....................... .............. 83
4.1.8 HTTPS請求認證............................. .............................................. 85
4.1.9大文件內容獲取問題............................................. ........................ 89
4.2 HttpClient的使用..................... .................................................. ............... 91
4.2.1 jar包的下載........................................... ........................................ 91
4.2.2請求URL .... .................................................. ................................. 92
4.2.3 EntityUtils類........... .................................................. ..................... 97
4.2.4設置頭信息...................... .................................................. ............ 98
4.2.5 POST提交表單............................... ............................................. 100
4.2.6超時設置................................................. ..................................... 103
4.2.7代理服務器的使用..... .................................................. ................ 105
4.2.8文件下載............................................. ......................................... 106
4.2.9 HTTPS請求認證.. .................................................. ..................... 108
4.2.10請求重試...................... .................................................. ............ 111
4.2.11多線程執行請求.............................. ........................................... 114
4.3 URLConnection與HttpURLConnection .. .............................................. 117
4.3.1實例化................................................ .......................................... 117
4.3.2獲取網頁內容. .................................................. ........................... 118
4.3.3 GET請求............................................. ......................................... 118
4.3.4模擬提交表單(POST請求) ................................................ .... 119
4.3.5設置頭信息....................................... ........................................... 120
4.3.6連接超時設置.................................................. ............................ 121
4.3.7代理服務器的使用.............. .................................................. ....... 122
4.3.8 HTTPS請求認證.................................... ..................................... 122
4.4本章小結......... .................................................. ...................................... 124
第5章網頁內容解析............................................ ...................................... 125
5.1 HTML解析........ .................................................. ................................... 125
5.1.1 CSS選擇器........ .................................................. ........................ 125
5.1.2 Xpath語法.................... .................................................. .............. 127
5.1.3 Jsoup解析HTML ............................. ........................................... 128
5.1.4 HtmlCleaner解析HTML .................................................. ........... 135
5.1.5 HTMLParser解析HTML ................................ ............................ 139
5.2 XML解析............................................... ................................................ 144
5.3 JSON解析................................................ ............................................... 145
5.3. 1 JSON校正............................................... ..................................... 145
5.3.2 org.json解析JSON .... .................................................. ................ 147
5.3.3 Gson解析JSON........................... ................................................ 152
5.3 .4 Fastjson解析JSON ............................................. ......................... 157
5.3.5網絡爬蟲實戰演練................. .................................................. .... 159
5.4本章小結............................................... .................................................. 165
第6章網絡爬蟲數據存儲.......................................... ................................. 166
6.1輸入流與輸出流.......... .................................................. ......................... 166
6.1.1簡介.................... .................................................. ........................ 166
6.1.2 File類.................... .................................................. ..................... 166
6.1.3文件字節流..................... .................................................. ........... 169
6.1.4文件字符流................................ .................................................. 172
6.1.5緩衝流............................................. ............................................. 176
6.1.6網絡爬蟲下載圖片實戰.............................................. ................. 180
6.1.7網絡爬蟲文本存儲實戰........................ ....................................... 184
6.2 Excel存儲....... .................................................. ...................................... 188
6.2.1 Jxl的使用..... .................................................. .............................. 188
6.2.2 POI的使用............. .................................................. .................... 191
6.2.3爬蟲案例........................ .................................................. ............ 198
6.3 MySQL數據存儲.............................................. ..................................... 202
6.3.1數據庫的基本概念..... .................................................. ................ 203
6.3.2 SQL語句基礎........................... .................................................. . 203
6.3.3 Java操作數據庫.......................................... ................................ 207
6.3.4爬蟲案例............ .................................................. ........................ 217
6.4本章小結...................... .................................................. ......................... 219
第7章網絡爬蟲實戰項目................. .................................................. ........ 220
7.1新聞數據採集.............................................. ........................................... 220
7.1.1採集的網頁.................................................. ................................ 220
7.1.2框架介紹............ .................................................. ........................ 222
7.1.3程序編寫.................... .................................................. ................ 223
7.2企業信息採集............................. .................................................. .......... 235
7.2.1採集的網頁................................. ................................................. 235
7.2.2框架介紹............................................. ......................................... 238
7.2.3第一層信息採集.......................................... ................................. 239
7.2.4第二層信息採集........ .................................................. ................. 248
7.3股票信息採集............................ .................................................. ........... 256
7.3.1採集的網頁................................ .................................................. 256
7.3.2框架介紹............................................ .......................................... 257
7.3.3程序設計.. .................................................. .................................. 258
7.3.4 Quartz實現定時調度任務....... .................................................. .. 267
7.4本章小結............................................... .................................................. 271
第8章Selenium的使用........................................... .................................. 272
8.1 Selenium簡介............ .................................................. ........................... 272
8.2 Java Selenium環境搭建................. .................................................. ...... 272
8.3瀏覽器的操控...................................... .................................................. . 274
8.4元素定位............................................. .................................................. .. 276
8.4.1 id定位.......................................... ................................................ 276
8.4.2 name定位............................................. ........................................ 277
8.4.3 class定位.... .................................................. ................................ 278
8.4.4 tag name定位........... .................................................. .................. 278
8.4.5 link text定位......................... .................................................. ..... 278
8.4.6 Xpath定位....................................... ............................................. 279
8.4.7 CSS選擇器定位............................................... ........................... 279
8.5模擬登錄................... .................................................. ............................ 280
8.6動態加載JavaScript數據(操作滾動條) ........................................ ... 283
8.7隱藏瀏覽器.......................................... .................................................. . 285
8.8截取驗證碼............................................ ................................................. 287
8.9本章小結............................................... .................................................. 291
第9章網絡爬蟲開源框架.......................................... ................................. 292
9.1 Crawler4j的使用............ .................................................. ...................... 292
9.1.1 Crawler4j簡介...................... .................................................. ...... 292
9.1.2 jar包的下載........................................... ...................................... 292
9.1.3入門案例...... .................................................. .............................. 293
9.1.4相關配置.............. .................................................. ...................... 297
9.1.5圖片的採集..................... .................................................. ........... 300
9.1.6數據採集入庫............................... ............................................... 304
9.2 WebCollector的使用................................................ .............................. 312
9.2.1 WebCollector簡介.............. .................................................. ....... 312
9.2.2 jar包的下載........................................... ...................................... 313
9.2.3入門案例...... .................................................. .............................. 313
9.2.4相關配置.............. .................................................. ...................... 318
9.2.5 HTTP請求擴展..................... .................................................. ..... 319
9.2.6翻頁數據採集..................................... ......................................... 327
9.2.7圖片的採集.. .................................................. .............................. 331
9.2.8數據採集入庫............ .................................................. ................ 334
9.3 WebMagic的使用.............................................. ..................................... 347
9.3.1 WebMagic簡介....... .................................................. ................... 347
9.3.2 jar包的下載....................... .................................................. ........ 347
9.3.3入門案例(翻頁數據採集) .............................. ......................... 347
9.3.4相關配置................... .................................................. ................. 351
9.3.5數據存儲方式.......................... .................................................. .. 352
9.3.6數據採集入庫........................................ ...................................... 355
9.3.7圖片的採集............................................ ...................................... 365
9.4本章小結........ .................................................. ....................................... 368
