Web Mining
Contents
● Introduction to Web Mining
● Web Content mining
● Web Structure mining
● Web Usage mining
What is Web Mining?
● Web mining is the use of data mining techniques to extract knowledge from web data.
● Web data includes :
○ web documents
○ hyperlinks between documents
○ usage logs of web sites
● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.
Data Mining vs Web Mining
● Data Mining : It is a concept of identifying a significant pattern from the data that gives a better
outcome.
● Web Mining : It is the process of performing data mining in the web. Extracting the web
documents and discovering the patterns from it.
Web Data Mining Process
[Link]
Issues
● Web data sets can be very large
○ Tens to hundreds of terabyte
● Cannot mine on a single server
○ Need large farms of servers
● Proper organization of hardware and software to mine multi-terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Web Mining Taxonomy
[Link]
357293
Web Content Mining - Introduction ??
● Mining, extraction and integration of useful data, information and knowledge from Web page
content.
● Web content mining is related but different from data mining and text mining.
● Web data are mainly semi-structured and/or unstructured, while data mining deals primarily
with structured data.
Web Content Mining Includes ? ?
?
Image
Image ref-1
image-ref-2
image-ref-3
Unstructured Web Data Mining
Image Source [Link]
[Link]
Unstructured Documents - Feature Extraction
● Bag of words to represent unstructured documents
○ Takes single word as feature
○ Ignores the sequence in which words occur
● Features could be
○ Boolean
■ Word either occurs or does not occur in a document
○ Frequency based
■ Frequency of the word in a document
● Variations of the feature selection include
○ Removing the case, punctuation, infrequent words and stop words etc..
● Features can be reduced using different feature selection techniques:
○ Information gain, mutual information, cross entropy.
○ Stemming: which reduces words to their morphological roots.
Structured Web Data
Structured Web Data
Image Source: [Link]
Mining Techniques Using Agent and Database
Image Source: [Link]
Agent-Based Approach
● Intelligent-Search-Agents developed that searches for characteristics to organize and interpret
the discovered information.
● Information-Filtering/Categorization - Using various information retrieval techniques and
characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize
them. HyPursuit, BO (Bookmark Organizer).
● Development of sophisticated AI systems acting on behalf of users autonomously or semi-
autonomously to discover and organize information.
Database Approaches
Used for transforming unstructured data into more structured and high-level collections of
resources, such as in relational databases, and using standard database querying mechanisms and
data mining techniques to access and analyze this information.
●Multilevel-Databases
○ lowest level - semi- structured information is kept
○ High level - generalizations from lower levels organized into relations and objects.
●Web-Query Systems
○ Web-based query systems and languages developed such as SQL, NLP for extracting data.
Typical Crawler
Img Source: [Link]
Text Mining - Brief
Img source: [Link]
What is Web Structure Mining?
● Web structure mining is the process of discovering structure information from the web.
● The structure of typical web graph consists of Web pages as nodes, and hyperlinks as edges
connecting between two related pages.
Hyperlink
Web document
Web Structure Mining (cont.)
● This type of mining can be performed either at the document level(intra-page) or at the
hyperlink level(inter-page).
● The research at the hyperlink level is called Hyperlink analysis.
● Hyperlink structure can be used to retrieve useful information on the web.
There are two main approaches:
● PageRank
● Hubs and Authorities - HITS
PageRank
● Used to discover the most important pages on the web.
● Prioritize pages returned from search by looking at web structure.
● Importance of pages is calculated based on the number of pages which point to it (backlinks).
● Weighting is used to provide more importance to backlinks coming from important pages.
● PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization.
○ (1-d): Bit of probability math magic so that sum of all webpages pageranks should be one.
PageRank (cont.)
[Link]
Hubs and Authorities
● Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
● Hub pages
○ Contains links to highly important pages.
Hubs Authorities
HITS (Hyperlink Induced Topic Search)
● Iterative algorithm for mining the Web graph to identify the topic hubs and authorities.
● Algorithm:
○ Let’s consider a matrix A with rows and columns corresponding to web pages. Aij =1
indicates that page i links to j and 0 otherwise.
○ Let a and h are vectors, whose ith component corresponds to the degree of authority and
hubbiness of ith page.
○ Hubbiness of the page is defined as the sum of the authorities of all the pages it links to. i.e
h = A x a.
○ Authority of the page is defined as the sum of hubbiness of all the pages that link to it. i.e.
a = At x h. where At is the transposed matrix.
Web Structure Mining applications
● Information retrieval in social networks.
● To find out the relevance of each web page.
● Measuring the completeness of Web sites.
● Used in search engines to find out the relevant information.
Web Usage Mining
● Web usage mining: automatic discovery of patterns in clickstreams and associated data
collected or generated as a result of user interactions with one or more Web sites.
● Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.
● The discovered patterns are usually represented as collections of pages, objects, or resources that
are frequently accessed by groups of users with common interests.
● Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external channels