Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage

Zdravko Markov, Daniel T. Larose

  • 出版商: Wiley
  • 出版日期: 2007-04-01
  • 定價: $1,980
  • 售價: 5.0$990
  • 語言: 英文
  • 頁數: 218
  • 裝訂: Hardcover
  • ISBN: 0471666556
  • ISBN-13: 9780471666554
  • 相關分類: Data-mining
  • 立即出貨

買這商品的人也買了...

商品描述

Description

This book introduces the reader to methods of data mining on the web, including uncovering patterns in web content (classification, clustering, language processing), structure (graphs, hubs, metrics), and usage (modeling, sequence analysis, performance). 

 

Table of Contents 

PREFACE.

PART I: WEB STRUCTURE MINING.

1 INFORMATION RETRIEVAL AND WEB SEARCH.

Web Challenges.

Web Search Engines.

Topic Directories.

Semantic Web.

Crawling the Web.

Web Basics.

Web Crawlers.

Indexing and Keyword Search.

Document Representation.

Implementation Considerations.

Relevance Ranking.

Advanced Text Search.

Using the HTML Structure in Keyword Search.

Evaluating Search Quality.

Similarity Search.

Cosine Similarity.

Jaccard Similarity.

Document Resemblance.

References.

Exercises.

2 HYPERLINK-BASED RANKING.

Introduction.

Social Networks Analysis.

PageRank.

Authorities and Hubs.

Link-Based Similarity Search.

Enhanced Techniques for Page Ranking.

References.

Exercises.

PART II: WEB CONTENT MINING.

3 CLUSTERING.

Introduction.

Hierarchical Agglomerative Clustering.

k-Means Clustering.

Probabilty-Based Clustering.

Finite Mixture Problem.

Classification Problem.

Clustering Problem.

Collaborative Filtering (Recommender Systems).

References.

Exercises.

4 EVALUATING CLUSTERING.

Approaches to Evaluating Clustering.

Similarity-Based Criterion Functions.

Probabilistic Criterion Functions.

MDL-Based Model and Feature Evaluation.

Minimum Description Length Principle.

MDL-Based Model Evaluation.

Feature Selection.

Classes-to-Clusters Evaluation.

Precision, Recall, and F-Measure.

Entropy.

References.

Exercises.

5 CLASSIFICATION.

General Setting and Evaluation Techniques.

Nearest-Neighbor Algorithm.

Feature Selection.

Naive Bayes Algorithm.

Numerical Approaches.

Relational Learning.

References.

Exercises.

PART III: WEB USAGE MINING.

6 INTRODUCTION TO WEB USAGE MINING.

Definition of Web Usage Mining.

Cross-Industry Standard Process for Data Mining.

Clickstream Analysis.

Web Server Log Files.

Remote Host Field.

Date/Time Field.

HTTP Request Field.

Status Code Field.

Transfer Volume (Bytes) Field.

Common Log Format.

Identification Field.

Authuser Field.

Extended Common Log Format.

Referrer Field.

User Agent Field.

Example of a Web Log Record.

Microsoft IIS Log Format.

Auxiliary Information.

References.

Exercises.

7 PREPROCESSING FOR WEB USAGE MINING.

Need for Preprocessing the Data.

Data Cleaning and Filtering.

Page Extension Exploration and Filtering.

De-Spidering the Web Log File.

User Identification.

Session Identification.

Path Completion.

Directories and the Basket Transformation.

Further Data Preprocessing Steps.

References.

Exercises.

8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING.

Introduction.

Number of Visit Actions.

Session Duration.

Relationship between Visit Actions and Session Duration.

Average Time per Page.

Duration for Individual Pages.

References.

Exercises.

9 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION.

Introduction.

Modeling Methodology.

Definition of Clustering.

The BIRCH Clustering Algorithm.

Affinity Analysis and the A Priori Algorithm.

Discretizing the Numerical Variables: Binning.

Applying the A Priori Algorithm to the CCSU Web Log Data.

Classification and Regression Trees.

The C4.5 Algorithm.

References.

Exercises.

INDEX.

商品描述(中文翻譯)

描述

本書介紹了網絡數據挖掘的方法,包括揭示網絡內容(分類、聚類、語言處理)、結構(圖形、中心、指標)和使用情況(建模、序列分析、性能)中的模式。

目錄

前言
第一部分:網絡結構挖掘
1. 信息檢索和網絡搜索
- 網絡挑戰
- 網絡搜索引擎
- 主題目錄
- 語義網
- 網絡爬蟲
- 網絡基礎知識
- 網絡爬蟲
- 索引和關鍵字搜索
- 文檔表示
- 實施注意事項
- 相關性排名
- 高級文本搜索
- 在關鍵字搜索中使用HTML結構
- 評估搜索質量
- 相似性搜索
- 餘弦相似性
- Jaccard相似性
- 文檔相似性
- 參考文獻
- 練習題
2. 基於超鏈接的排名
- 簡介
- 社交網絡分析
- PageRank
- 權威和中心
- 基於鏈接的相似性搜索
- 提升頁面排名的技術
- 參考文獻
- 練習題

第二部分:網絡內容挖掘
3. 聚類
- 簡介
- 階層凝聚聚類
- k均值聚類
- 概率聚類
- 有限混合問題
- 分類問題
- 聚類問題
- 協同過濾(推薦系統)
- 參考文獻
- 練習題
4. 評估聚類
- 評估聚類的方法
- 基於相似性的準則函數
- 概率