Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage

作者: Zdravko Markov Daniel T. Larose
出版社: Wiley
出版在: 2007-04-01
ISBN-13: 9780471666554
ISBN-10: 0471666556
裝訂格式: Hardcover
總頁數: 218 頁





內容描述


Description

This book introduces the reader to
methods of data mining on the web, including uncovering patterns in web
content (classification, clustering, language processing), structure (graphs,
hubs, metrics), and usage (modeling, sequence analysis,
performance). 
 
Table of Contents 

PREFACE.
PART I: WEB STRUCTURE MINING.
1 INFORMATION RETRIEVAL AND WEB SEARCH.

Web Challenges.
Web Search Engines.
Topic Directories.
Semantic Web.
Crawling the Web.
Web Basics.
Web Crawlers.
Indexing and Keyword Search.
Document Representation.
Implementation Considerations.
Relevance Ranking.
Advanced Text Search.
Using the HTML Structure in Keyword Search.

Evaluating Search Quality.
Similarity Search.
Cosine Similarity.
Jaccard Similarity.
Document Resemblance.
References.
Exercises.
2 HYPERLINK-BASED RANKING.
Introduction.
Social Networks Analysis.
PageRank.
Authorities and Hubs.
Link-Based Similarity Search.
Enhanced Techniques for Page Ranking.
References.
Exercises.
PART II: WEB CONTENT MINING.
3 CLUSTERING.
Introduction.
Hierarchical Agglomerative Clustering.

k-Means Clustering.
Probabilty-Based Clustering.
Finite Mixture Problem.
Classification Problem.
Clustering Problem.
Collaborative Filtering (Recommender
Systems).
References.
Exercises.
4 EVALUATING CLUSTERING.
Approaches to Evaluating Clustering.
Similarity-Based Criterion Functions.
Probabilistic Criterion Functions.
MDL-Based Model and Feature Evaluation.

Minimum Description Length Principle.
MDL-Based Model Evaluation.
Feature Selection.
Classes-to-Clusters Evaluation.
Precision, Recall, and F-Measure.
Entropy.
References.
Exercises.
5 CLASSIFICATION.
General Setting and Evaluation Techniques.

Nearest-Neighbor Algorithm.
Feature Selection.
Naive Bayes Algorithm.
Numerical Approaches.
Relational Learning.
References.
Exercises.
PART III: WEB USAGE MINING.
6 INTRODUCTION TO WEB USAGE MINING.
Definition of Web Usage Mining.
Cross-Industry Standard Process for Data
Mining.
Clickstream Analysis.
Web Server Log Files.
Remote Host Field.
Date/Time Field.
HTTP Request Field.
Status Code Field.
Transfer Volume (Bytes) Field.
Common Log Format.
Identification Field.
Authuser Field.
Extended Common Log Format.
Referrer Field.
User Agent Field.
Example of a Web Log Record.
Microsoft IIS Log Format.
Auxiliary Information.
References.
Exercises.
7 PREPROCESSING FOR WEB USAGE MINING.
Need for Preprocessing the Data.
Data Cleaning and Filtering.
Page Extension Exploration and Filtering.

De-Spidering the Web Log File.
User Identification.
Session Identification.
Path Completion.
Directories and the Basket Transformation.

Further Data Preprocessing Steps.
References.
Exercises.
8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE
MINING.
Introduction.
Number of Visit Actions.
Session Duration.
Relationship between Visit Actions and
Session Duration.
Average Time per Page.
Duration for Individual Pages.
References.
Exercises.
9 MODELING FOR WEB USAGE MINING: CLUSTERING,
ASSOCIATION, AND CLASSIFICATION.
Introduction.
Modeling Methodology.
Definition of Clustering.
The BIRCH Clustering Algorithm.
Affinity Analysis and the A Priori Algorithm.

Discretizing the Numerical Variables:
Binning.
Applying the A Priori Algorithm to the CCSU
Web Log Data.
Classification and Regression Trees.
The C4.5 Algorithm.
References.
Exercises.
INDEX.




相關書籍

基於 LMI 的控制系統設計、分析及 MATLAB 模擬

作者 劉金琨 劉志傑

2007-04-01

Object-Oriented Python: Master Oop by Building Games and GUIs (Paperback)

作者 Irv Kalb

2007-04-01

構建企業級推薦系統:算法、工程實現與案例分析

作者 劉強

2007-04-01