Web Scraping with Python

Richard Lawson

買這商品的人也買了...

商品描述

Successfully scrape data from any website with the power of Python

About This Book

  • A hands-on guide to web scraping with real-life problems and solutions
  • Techniques to download and extract data from complex websites
  • Create a number of different web scrapers to extract information

Who This Book Is For

This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved.

What You Will Learn

  • Extract data from web pages with simple Python programming
  • Build a threaded crawler to process web pages in parallel
  • Follow links to crawl a website
  • Download cache to reduce bandwidth
  • Use multiple threads and processes to scrape faster
  • Learn how to parse JavaScript-dependent websites
  • Interact with forms and sessions
  • Solve CAPTCHAs on protected web pages
  • Discover how to track the state of a crawl

In Detail

The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming.

This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.

Style and approach

This book is a hands-on guide with real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.

商品描述(中文翻譯)

使用Python的強大功能,成功從任何網站上爬取數據

關於本書



  • 實際問題和解決方案的網絡爬蟲實踐指南

  • 從複雜網站下載和提取數據的技巧

  • 創建多種不同的網絡爬蟲以提取信息

本書適合對象


本書面向希望合法使用網絡爬蟲的開發人員。具有Python編程經驗將會有所幫助,但不是必需的。任何對編程語言有一般知識的人都應該能夠閱讀本書並理解其中的原則。

你將學到什麼



  • 使用簡單的Python編程從網頁中提取數據

  • 構建多線程爬蟲以並行處理網頁

  • 跟隨鏈接爬取網站

  • 下載緩存以減少帶寬使用

  • 使用多線程和多進程進行更快的爬取

  • 學習如何解析依賴於JavaScript的網站

  • 與表單和會話進行交互

  • 解決受保護網頁上的CAPTCHA問題

  • 了解如何跟踪爬取的狀態

詳細內容


互聯網包含了有史以來最有用的數據集,大部分都是免費公開訪問的。然而,這些數據並不容易重複使用。它們嵌入在網站的結構和樣式中,需要仔細提取才能派上用場。網絡爬蟲作為一種簡單的手段,越來越受到重視,可以輕鬆地收集和理解在線上大量的信息。使用Python等簡單的語言,您可以使用簡單的編程從複雜的網站中爬取信息。

本書是使用Python從網站上爬取數據的最終指南。在前幾章中,它介紹了如何從靜態網頁中提取數據以及如何使用緩存來管理服務器負載。在基礎知識之後,我們將深入探討使用線程和更高級主題構建更複雜的爬蟲。逐步學習如何使用Ajax URL,使用Firebug擴展進行監控,以及間接爬取數據。還可以發現更多有關網絡爬蟲的細節,例如使用瀏覽器渲染器,管理cookie,如何提交表單以從受CAPTCHA保護的複雜網站中提取數據等等。本書以使用Scrapy庫創建高級爬蟲並將所學應用於真實網站的方式結束。

風格和方法


本書是一本實踐指南,提供真實案例和解決方案,從簡單到逐漸變得更複雜。本書的每一章都介紹一個問題,然後提供一個或多個可能的解決方案。