Understanding Web Mining Techniques

Web mining involves using data mining techniques to extract knowledge from web data, which includes web documents, hyperlinks, and usage logs. It is categorized into three types: web content mining, web structure mining, and web usage mining, each focusing on different aspects of web data. The document also discusses the challenges of mining large datasets and various techniques employed in web mining, such as PageRank and HITS for structure mining, and the analysis of user behavior in usage mining.

Uploaded by

vsreevathsan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views25 pages

Understanding Web Mining Techniques

Uploaded by

vsreevathsan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Web Mining

Contents
● Introduction to Web Mining

● Web Content mining

● Web Structure mining

● Web Usage mining

What is Web Mining?
● Web mining is the use of data mining techniques to extract knowledge from web data.
● Web data includes :
○ web documents
○ hyperlinks between documents
○ usage logs of web sites
● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.
Data Mining vs Web Mining
● Data Mining : It is a concept of identifying a significant pattern from the data that gives a better
outcome.
● Web Mining : It is the process of performing data mining in the web. Extracting the web
documents and discovering the patterns from it.
Web Data Mining Process

[Link]
Issues
● Web data sets can be very large
○ Tens to hundreds of terabyte
● Cannot mine on a single server
○ Need large farms of servers
● Proper organization of hardware and software to mine multi-terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Web Mining Taxonomy

[Link]
357293
Web Content Mining - Introduction ??
● Mining, extraction and integration of useful data, information and knowledge from Web page
content.
● Web content mining is related but different from data mining and text mining.
● Web data are mainly semi-structured and/or unstructured, while data mining deals primarily
with structured data.
Web Content Mining Includes ? ?
?

Image

Image ref-1
image-ref-2
image-ref-3
Unstructured Web Data Mining

Image Source [Link]

[Link]
Unstructured Documents - Feature Extraction
● Bag of words to represent unstructured documents
○ Takes single word as feature
○ Ignores the sequence in which words occur
● Features could be
○ Boolean
■ Word either occurs or does not occur in a document
○ Frequency based
■ Frequency of the word in a document
● Variations of the feature selection include
○ Removing the case, punctuation, infrequent words and stop words etc..
● Features can be reduced using different feature selection techniques:
○ Information gain, mutual information, cross entropy.
○ Stemming: which reduces words to their morphological roots.
Structured Web Data
Structured Web Data

Image Source: [Link]

Mining Techniques Using Agent and Database

Image Source: [Link]

Agent-Based Approach
● Intelligent-Search-Agents developed that searches for characteristics to organize and interpret
the discovered information.
● Information-Filtering/Categorization - Using various information retrieval techniques and
characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize
them. HyPursuit, BO (Bookmark Organizer).
● Development of sophisticated AI systems acting on behalf of users autonomously or semi-
autonomously to discover and organize information.
Database Approaches
Used for transforming unstructured data into more structured and high-level collections of
resources, such as in relational databases, and using standard database querying mechanisms and
data mining techniques to access and analyze this information.

●Multilevel-Databases
○ lowest level - semi- structured information is kept
○ High level - generalizations from lower levels organized into relations and objects.
●Web-Query Systems
○ Web-based query systems and languages developed such as SQL, NLP for extracting data.
Typical Crawler

Img Source: [Link]

Text Mining - Brief

Img source: [Link]

What is Web Structure Mining?
● Web structure mining is the process of discovering structure information from the web.
● The structure of typical web graph consists of Web pages as nodes, and hyperlinks as edges
connecting between two related pages.

Hyperlink

Web document
Web Structure Mining (cont.)
● This type of mining can be performed either at the document level(intra-page) or at the
hyperlink level(inter-page).
● The research at the hyperlink level is called Hyperlink analysis.
● Hyperlink structure can be used to retrieve useful information on the web.

There are two main approaches:

● PageRank
● Hubs and Authorities - HITS
PageRank
● Used to discover the most important pages on the web.
● Prioritize pages returned from search by looking at web structure.
● Importance of pages is calculated based on the number of pages which point to it (backlinks).
● Weighting is used to provide more importance to backlinks coming from important pages.
● PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization.
○ (1-d): Bit of probability math magic so that sum of all webpages pageranks should be one.
PageRank (cont.)

[Link]
Hubs and Authorities
● Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
● Hub pages
○ Contains links to highly important pages.

Hubs Authorities
HITS (Hyperlink Induced Topic Search)
● Iterative algorithm for mining the Web graph to identify the topic hubs and authorities.
● Algorithm:
○ Let’s consider a matrix A with rows and columns corresponding to web pages. Aij =1
indicates that page i links to j and 0 otherwise.
○ Let a and h are vectors, whose ith component corresponds to the degree of authority and
hubbiness of ith page.
○ Hubbiness of the page is defined as the sum of the authorities of all the pages it links to. i.e
h = A x a.
○ Authority of the page is defined as the sum of hubbiness of all the pages that link to it. i.e.
a = At x h. where At is the transposed matrix.
Web Structure Mining applications
● Information retrieval in social networks.
● To find out the relevance of each web page.
● Measuring the completeness of Web sites.
● Used in search engines to find out the relevant information.
Web Usage Mining
● Web usage mining: automatic discovery of patterns in clickstreams and associated data
collected or generated as a result of user interactions with one or more Web sites.
● Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.
● The discovered patterns are usually represented as collections of pages, objects, or resources that
are frequently accessed by groups of users with common interests.
● Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external channels

Overview of Web Mining Techniques
No ratings yet
Overview of Web Mining Techniques
41 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
26 pages
Web Data Mining: Techniques & Applications
No ratings yet
Web Data Mining: Techniques & Applications
28 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
33 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
32 pages
Web and Text Mining Overview
No ratings yet
Web and Text Mining Overview
36 pages
Web Mining Techniques Overview
No ratings yet
Web Mining Techniques Overview
24 pages
Comprehensive Guide to Web Mining Techniques
No ratings yet
Comprehensive Guide to Web Mining Techniques
36 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
14 pages
Web Structure Mining Overview
No ratings yet
Web Structure Mining Overview
22 pages
Data Mining in Multimedia Web Content
No ratings yet
Data Mining in Multimedia Web Content
80 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
19 pages
Web Mining for Business Intelligence
No ratings yet
Web Mining for Business Intelligence
31 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining: Techniques and Challenges
No ratings yet
Web Mining: Techniques and Challenges
69 pages
Text and Web Mining Techniques Explained
No ratings yet
Text and Web Mining Techniques Explained
23 pages
Sequential Assignment in Web Mining
No ratings yet
Sequential Assignment in Web Mining
48 pages
Web Usage Mining Overview
No ratings yet
Web Usage Mining Overview
57 pages
Web Mining Techniques Overview
No ratings yet
Web Mining Techniques Overview
28 pages
Web Mining Overview and Applications
No ratings yet
Web Mining Overview and Applications
25 pages
Web Mining
100% (3)
Web Mining
28 pages
Overview of Web Mining Techniques
No ratings yet
Overview of Web Mining Techniques
28 pages
Web Mining Techniques and Applications
No ratings yet
Web Mining Techniques and Applications
4 pages
M 5 W S E: Odule Eb Mining and Earch Ngines
No ratings yet
M 5 W S E: Odule Eb Mining and Earch Ngines
57 pages
DM (MR-22) Module-5
No ratings yet
DM (MR-22) Module-5
31 pages
Datamining 5th Module
No ratings yet
Datamining 5th Module
18 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
11 pages
Web Mining Techniques Overview
No ratings yet
Web Mining Techniques Overview
31 pages
Web Usage Mining
No ratings yet
Web Usage Mining
13 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
58 pages
Web Mining: Types and Techniques
No ratings yet
Web Mining: Types and Techniques
18 pages
Web Mining Techniques and Examples
No ratings yet
Web Mining Techniques and Examples
45 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
73 pages
CLARANS in Spatial Web Mining
100% (1)
CLARANS in Spatial Web Mining
45 pages
Web Mining
No ratings yet
Web Mining
53 pages
Three Key Areas of Web Mining
No ratings yet
Three Key Areas of Web Mining
28 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
13 pages
Overview of Web Mining Techniques
No ratings yet
Overview of Web Mining Techniques
17 pages
Web Mining Techniques and Challenges
No ratings yet
Web Mining Techniques and Challenges
42 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
18 pages
Overview of Web Mining Techniques
No ratings yet
Overview of Web Mining Techniques
8 pages
Web Data Mining and Search Engine Basics
No ratings yet
Web Data Mining and Search Engine Basics
21 pages
Unit 3 qb1
No ratings yet
Unit 3 qb1
7 pages
Introduction to Web Data Mining
No ratings yet
Introduction to Web Data Mining
25 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
13 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
65 pages
Web Mining Techniques and Applications
No ratings yet
Web Mining Techniques and Applications
13 pages
Introduction to Web Mining Techniques
No ratings yet
Introduction to Web Mining Techniques
80 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
9 pages
Three Areas of Web Mining Explained
No ratings yet
Three Areas of Web Mining Explained
37 pages
Web Mining Techniques and Applications
No ratings yet
Web Mining Techniques and Applications
12 pages
Web Mining Techniques and Applications
0% (1)
Web Mining Techniques and Applications
48 pages
DM Mod 5
No ratings yet
DM Mod 5
123 pages
Web Mining Techniques and Applications
No ratings yet
Web Mining Techniques and Applications
21 pages
Web Mining: Analyzing Page Similarity
No ratings yet
Web Mining: Analyzing Page Similarity
71 pages
Web Mining Techniques and Processes
No ratings yet
Web Mining Techniques and Processes
81 pages
Overview of Web Mining Techniques
100% (1)
Overview of Web Mining Techniques
63 pages
Understanding Web Mining Techniques
No ratings yet
Understanding Web Mining Techniques
19 pages
GraphSAGE for R6 Chat Spammer Detection
No ratings yet
GraphSAGE for R6 Chat Spammer Detection
14 pages
PageRank Algorithm Implementation in Java
No ratings yet
PageRank Algorithm Implementation in Java
8 pages
Anatomy of Web Search Engine Architecture
No ratings yet
Anatomy of Web Search Engine Architecture
21 pages
Understanding HTTP Protocol Basics
No ratings yet
Understanding HTTP Protocol Basics
26 pages
URL Phishing Detection Using ML Model
No ratings yet
URL Phishing Detection Using ML Model
18 pages
Identifying Early Adopters in Networks
No ratings yet
Identifying Early Adopters in Networks
12 pages
Mind Control and Internet Integration
No ratings yet
Mind Control and Internet Integration
10 pages
Website Credibility Evaluation Framework
No ratings yet
Website Credibility Evaluation Framework
6 pages
SEO Basics: Crawling & Indexing Explained
No ratings yet
SEO Basics: Crawling & Indexing Explained
15 pages
Granovetter's Weak Ties and Network Concepts
No ratings yet
Granovetter's Weak Ties and Network Concepts
28 pages
PageRank and HITS Algorithm Lab Guide
No ratings yet
PageRank and HITS Algorithm Lab Guide
13 pages
Social Network Analysis Lesson Plan
No ratings yet
Social Network Analysis Lesson Plan
2 pages
Understanding Probabilistic Retrieval Models
No ratings yet
Understanding Probabilistic Retrieval Models
55 pages
Optimize Your Site for SEO Success
No ratings yet
Optimize Your Site for SEO Success
19 pages
Understanding Computers: Definition & Types
No ratings yet
Understanding Computers: Definition & Types
30 pages
Understanding Information Networks and the Web
No ratings yet
Understanding Information Networks and the Web
37 pages
Practical Digital Marketing Strategies
No ratings yet
Practical Digital Marketing Strategies
13 pages
Big Data Analytics Lab Regulations 2025
No ratings yet
Big Data Analytics Lab Regulations 2025
56 pages
PageRank Analysis in Python
No ratings yet
PageRank Analysis in Python
21 pages
20251114122756-Sna - Module 3
No ratings yet
20251114122756-Sna - Module 3
25 pages
Pagerank: Standing On The Shoulders of Giants
No ratings yet
Pagerank: Standing On The Shoulders of Giants
10 pages
Outlink Strategies to Maximize PageRank
No ratings yet
Outlink Strategies to Maximize PageRank
15 pages
Google Helpful Content Update Analysis
No ratings yet
Google Helpful Content Update Analysis
84 pages
Understanding Centrality in Networks
No ratings yet
Understanding Centrality in Networks
92 pages
Critical Nodes in Complex Networks Survey
No ratings yet
Critical Nodes in Complex Networks Survey
46 pages
Web Structure Mining: Techniques & Applications
No ratings yet
Web Structure Mining: Techniques & Applications
11 pages
Research Paper Neuro Symbolic
No ratings yet
Research Paper Neuro Symbolic
12 pages
AI Chatbot for Adaptive E-Learning
No ratings yet
AI Chatbot for Adaptive E-Learning
10 pages
CSCI4180 Exam Questions and Guidelines
No ratings yet
CSCI4180 Exam Questions and Guidelines
4 pages
ClickHouse Graph Processing Overview
0% (1)
ClickHouse Graph Processing Overview
35 pages