0% found this document useful (0 votes)
8 views25 pages

Understanding Web Mining Techniques

Web mining involves using data mining techniques to extract knowledge from web data, which includes web documents, hyperlinks, and usage logs. It is categorized into three types: web content mining, web structure mining, and web usage mining, each focusing on different aspects of web data. The document also discusses the challenges of mining large datasets and various techniques employed in web mining, such as PageRank and HITS for structure mining, and the analysis of user behavior in usage mining.

Uploaded by

vsreevathsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

Understanding Web Mining Techniques

Web mining involves using data mining techniques to extract knowledge from web data, which includes web documents, hyperlinks, and usage logs. It is categorized into three types: web content mining, web structure mining, and web usage mining, each focusing on different aspects of web data. The document also discusses the challenges of mining large datasets and various techniques employed in web mining, such as PageRank and HITS for structure mining, and the analysis of user behavior in usage mining.

Uploaded by

vsreevathsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Web Mining

Contents
● Introduction to Web Mining

● Web Content mining

● Web Structure mining

● Web Usage mining


What is Web Mining?
● Web mining is the use of data mining techniques to extract knowledge from web data.
● Web data includes :
○ web documents
○ hyperlinks between documents
○ usage logs of web sites
● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.
Data Mining vs Web Mining
● Data Mining : It is a concept of identifying a significant pattern from the data that gives a better
outcome.
● Web Mining : It is the process of performing data mining in the web. Extracting the web
documents and discovering the patterns from it.
Web Data Mining Process

[Link]
Issues
● Web data sets can be very large
○ Tens to hundreds of terabyte
● Cannot mine on a single server
○ Need large farms of servers
● Proper organization of hardware and software to mine multi-terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Web Mining Taxonomy

[Link]
357293
Web Content Mining - Introduction ??
● Mining, extraction and integration of useful data, information and knowledge from Web page
content.
● Web content mining is related but different from data mining and text mining.
● Web data are mainly semi-structured and/or unstructured, while data mining deals primarily
with structured data.
Web Content Mining Includes ? ?
?

Image

Image ref-1
image-ref-2
image-ref-3
Unstructured Web Data Mining

Image Source [Link]


[Link]
Unstructured Documents - Feature Extraction
● Bag of words to represent unstructured documents
○ Takes single word as feature
○ Ignores the sequence in which words occur
● Features could be
○ Boolean
■ Word either occurs or does not occur in a document
○ Frequency based
■ Frequency of the word in a document
● Variations of the feature selection include
○ Removing the case, punctuation, infrequent words and stop words etc..
● Features can be reduced using different feature selection techniques:
○ Information gain, mutual information, cross entropy.
○ Stemming: which reduces words to their morphological roots.
Structured Web Data
Structured Web Data

Image Source: [Link]


Mining Techniques Using Agent and Database

Image Source: [Link]


Agent-Based Approach
● Intelligent-Search-Agents developed that searches for characteristics to organize and interpret
the discovered information.
● Information-Filtering/Categorization - Using various information retrieval techniques and
characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize
them. HyPursuit, BO (Bookmark Organizer).
● Development of sophisticated AI systems acting on behalf of users autonomously or semi-
autonomously to discover and organize information.
Database Approaches
Used for transforming unstructured data into more structured and high-level collections of
resources, such as in relational databases, and using standard database querying mechanisms and
data mining techniques to access and analyze this information.

●Multilevel-Databases
○ lowest level - semi- structured information is kept
○ High level - generalizations from lower levels organized into relations and objects.
●Web-Query Systems
○ Web-based query systems and languages developed such as SQL, NLP for extracting data.
Typical Crawler

Img Source: [Link]


Text Mining - Brief

Img source: [Link]


What is Web Structure Mining?
● Web structure mining is the process of discovering structure information from the web.
● The structure of typical web graph consists of Web pages as nodes, and hyperlinks as edges
connecting between two related pages.

Hyperlink

Web document
Web Structure Mining (cont.)
● This type of mining can be performed either at the document level(intra-page) or at the
hyperlink level(inter-page).
● The research at the hyperlink level is called Hyperlink analysis.
● Hyperlink structure can be used to retrieve useful information on the web.

There are two main approaches:

● PageRank
● Hubs and Authorities - HITS
PageRank
● Used to discover the most important pages on the web.
● Prioritize pages returned from search by looking at web structure.
● Importance of pages is calculated based on the number of pages which point to it (backlinks).
● Weighting is used to provide more importance to backlinks coming from important pages.
● PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization.
○ (1-d): Bit of probability math magic so that sum of all webpages pageranks should be one.
PageRank (cont.)

[Link]
Hubs and Authorities
● Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
● Hub pages
○ Contains links to highly important pages.

Hubs Authorities
HITS (Hyperlink Induced Topic Search)
● Iterative algorithm for mining the Web graph to identify the topic hubs and authorities.
● Algorithm:
○ Let’s consider a matrix A with rows and columns corresponding to web pages. Aij =1
indicates that page i links to j and 0 otherwise.
○ Let a and h are vectors, whose ith component corresponds to the degree of authority and
hubbiness of ith page.
○ Hubbiness of the page is defined as the sum of the authorities of all the pages it links to. i.e
h = A x a.
○ Authority of the page is defined as the sum of hubbiness of all the pages that link to it. i.e.
a = At x h. where At is the transposed matrix.
Web Structure Mining applications
● Information retrieval in social networks.
● To find out the relevance of each web page.
● Measuring the completeness of Web sites.
● Used in search engines to find out the relevant information.
Web Usage Mining
● Web usage mining: automatic discovery of patterns in clickstreams and associated data
collected or generated as a result of user interactions with one or more Web sites.
● Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.
● The discovered patterns are usually represented as collections of pages, objects, or resources that
are frequently accessed by groups of users with common interests.
● Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external channels

You might also like