Spidering Hacks (Paperback)

Kevin Hemenway, Tara Calishain

  • 出版商: O'Reilly
  • 出版日期: 2003-12-02
  • 售價: $1,240
  • 貴賓價: 9.5$1,178
  • 語言: 英文
  • 頁數: 424
  • 裝訂: Paperback
  • ISBN: 0596005776
  • ISBN-13: 9780596005771
  • 相關分類: Python程式語言Web-crawler 網路爬蟲
  • 海外代購書籍(需單獨結帳)

買這商品的人也買了...

商品描述

Summary

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.

Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:


  • Aggregate and associate data from disparate locations, then store and manipulate the data as you like
  • Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
  • Integrate third-party data into your own applications or web sites
  • Make your own site easier to scrape and more usable to others
  • Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day


Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

Table of Contents

  • Credits  

    Preface  

    Chapter 1. Walking Softly 

          1. A Crash Course in Spidering and Scraping  

          2. Best Practices for You and Your Spider  

          3. Anatomy of an HTML Page  

          4. Registering Your Spider  

          5. Preempting Discovery  

          6. Keeping Your Spider Out of Sticky Situations  

          7. Finding the Patterns of Identifiers  

    Chapter 2. Assembling a Toolbox 

        Perl Modules

        Resources You May Find Helpful

          8. Installing Perl Modules  

          9. Simply Fetching with LWP::Simple  

          10. More Involved Requests with LWP::UserAgent  

          11. Adding HTTP Headers to Your Request  

          12. Posting Form Data with LWP  

          13. Authentication, Cookies, and Proxies  

          14. Handling Relative and Absolute URLs  

          15. Secured Access and Browser Attributes  

          16. Respecting Your Scrapee's Bandwidth  

          17. Respecting robots.txt  

          18. Adding Progress Bars to Your Scripts  

          19. Scraping with HTML::TreeBuilder  

          20. Parsing with HTML::TokeParser  

          21. WWW::Mechanize 101  

          22. Scraping with WWW::Mechanize  

          23. In Praise of Regular Expressions  

          24. Painless RSS with Template::Extract  

          25. A Quick Introduction to XPath  

          26. Downloading with curl and wget  

          27. More Advanced wget Techniques  

          28. Using Pipes to Chain Commands  

          29. Running Multiple Utilities at Once  

          30. Utilizing the Web Scraping Proxy  

          31. Being Warned When Things Go Wrong  

          32. Being Adaptive to Site Redesigns  

    Chapter 3. Collecting Media Files 

          33. Detective Case Study: Newgrounds  

          34. Detective Case Study: iFilm  

          35. Downloading Movies from the Library of Congress
          36. Downloading Images from Webshots  

          37. Downloading Comics with dailystrips  

          38. Archiving Your Favorite Webcams  

          39. News Wallpaper for Your Site  

          40. Saving Only POP3 Email Attachments  

          41. Downloading MP3s from a Playlist  

          42. Downloading from Usenet with nget  

    Chapter 4. Gleaning Data from Databases 

          43. Archiving Yahoo! Groups Messages with yahoo2mbox  

          44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups  

          45. Gleaning Buzz from Yahoo!  

          46. Spidering the Yahoo! Catalog  

          47. Tracking Additions to Yahoo!  

          48. Scattersearch with Yahoo! and Google  

          49. Yahoo! Directory Mindshare in Google  

          50. Weblog-Free Google Results  

          51. Spidering, Google, and Multiple Domains  

          52. Scraping Amazon.com Product Reviews  

          53. Receive an Email Alert for Newly Added Amazon.com Reviews
          54. Scraping Amazon.com Customer Advice  

          55. Publishing Amazon.com Associates Statistics  

          56. Sorting Amazon.com Recommendations by Rating  

          57. Related Amazon.com Products with Alexa  

          58. Scraping Alexa's Competitive Data with Java  

          59. Finding Album Information with FreeDB and Amazon.com  

          60. Expanding Your Musical Tastes  

          61. Saving Daily Horoscopes to Your iPod  

          62. Graphing Data with RRDTOOL  

          63. Stocking Up on Financial Quotes  

          64. Super Author Searching  

          65. Mapping O'Reilly Best Sellers to Library Popularity  

          66. Using All Consuming to Get Book Lists  

          67. Tracking Packages with FedEx  

          68. Checking Blogs for New Comments  

          69. Aggregating RSS and Posting Changes  

          70. Using the Link Cosmos of Technorati  

          71. Finding Related RSS Feeds  

          72. Automatically Finding Blogs of Interest  

          73. Scraping TV Listings  

          74. What's Your Visitor's Weather Like?  

          75. Trendspotting with Geotargeting  

          76. Getting the Best Travel Route by Train  

          77. Geographic Distance and Back Again  

          78. Super Word Lookup  

          79. Word Associations with Lexical Freenet  

          80. Reformatting Bugtraq Reports  

          81. Keeping Tabs on the Web via Email  

          82. Publish IE's Favorites to Your Web Site  

          83. Spidering GameStop.com Game Prices  

          84. Bargain Hunting with PHP  

          85. Aggregating Multiple Search Engine Results  

          86. Robot Karaoke  

          87. Searching the Better Business Bureau  

          88. Searching for Health Inspections  

          89. Filtering for the Naughties  

    Chapter 5. Maintaining Your Collections 

          90. Using cron to Automate Tasks  

          91. Scheduling Tasks Without cron  

          92. Mirroring Web Sites with wget and rsync  

          93. Accumulating Search Results Over Time  

    Chapter 6. Giving Back to the World 

          94. Using XML::RSS to Repurpose Data  

          95. Placing RSS Headlines on Your Site  

          96. Making Your Resources Scrapable with Regular Expressions  

          97. Making Your Resources Scrapable with a REST Interface  

          98. Making Your Resources Scrapable with XML-RPC  

          99. Creating an IM Interface  

          100. Going Beyond the Book  

    Index

  • 商品描述(中文翻譯)

    《Spidering Hacks》摘要:

    互聯網上充斥著大量的信息,使我們對更多、更好的數據渴求不已。出於必要,我們中的許多人已經變得相當擅長使用搜索引擎查詢,但有時即使是最強大的搜索引擎也不足夠。如果你曾經希望以不同的形式呈現數據,或者希望從多個網站和數據源收集數據並且在沒有瀏覽器限制的情況下並排查看,那麼《Spidering Hacks》就是為你而寫的。

    《Spidering Hacks》帶你進入互聯網數據檢索的下一個層次,超越了搜索引擎,向你展示如何創建蜘蛛和機器人從你喜歡的網站和數據源檢索信息。你將不再受到主機站點認為你想要看到的數據呈現方式的限制,你將學會如何抓取和重新利用原始數據,以便以對你有意義的方式查看。

    《Spidering Hacks》針對開發人員、研究人員、技術助理、圖書館員和高級用戶,提供了有關蜘蛛和抓取方法的專家技巧。你將從蜘蛛概念、工具(Perl、LWP、開箱即用的實用工具)和倫理(如何知道你是否走得太遠:什麼是可接受和不可接受的)的速成課程開始。接下來,你將收集媒體文件和數據庫中的數據。然後,你將學習如何解釋和理解數據,將其重新利用於其他應用程序中,甚至構建授權接口將數據集成到自己的內容中。完成《Spidering Hacks》後,你將能夠:

    - 從不同位置聚合和關聯數據,然後按照自己的意願存儲和操作數據
    - 通過了解競爭對手的產品何時打折、比較銷售排名和產品在電子商務網站上的位置,獲得商業上的競爭優勢
    - 將第三方數據集成到自己的應用程序或網站中
    - 使自己的網站更容易抓取,更易於他人使用
    - 在不每天訪問網站的情況下,隨時了解你最喜歡的漫畫、新聞故事、股票提示等內容

    就像O'Reilly流行的Hacks系列書籍中的其他書籍一樣,《Spidering Hacks》為你帶來了100個專家級的技巧和工具,幫助你掌握這項技術。如果你對任何類型的數據檢索感興趣,這本書提供了大量的數據,以尋找更多的數據。

    《Spidering Hacks》目錄:

    - 致謝
    - 前言
    - 第1章 輕輕地行走
    - 1. 蜘蛛和抓取的速成課程
    - 2. 你和你的蜘蛛的最佳實踐
    - 3. HTML頁面的結構
    - 4. 註冊你的蜘蛛
    - 5. 預防發現
    - 6. 讓你的蜘蛛遠離麻煩
    - 7. 找到標識符的模式
    - 第2章 組裝工具箱
    - Perl模塊
    - 有用的資源
    - 8. 安裝Perl模塊
    - 9. 使用LWP::Simple簡單抓取
    - 10. 更多涉及的抓取技巧