100% found this document useful (1 vote)
20 views45 pages

CLARANS in Spatial Web Mining

This document discusses spatial data mining and web mining. For spatial data mining, it covers spatial data types and attributes, spatial queries and data structures like quad trees and R-trees. It also discusses spatial rules, clustering algorithms and applications. For web mining, it provides an overview of web content mining, structure mining and usage mining. It describes techniques like crawlers, PageRank, HITS and pattern discovery from web server logs.

Uploaded by

rekha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
20 views45 pages

CLARANS in Spatial Web Mining

This document discusses spatial data mining and web mining. For spatial data mining, it covers spatial data types and attributes, spatial queries and data structures like quad trees and R-trees. It also discusses spatial rules, clustering algorithms and applications. For web mining, it provides an overview of web content mining, structure mining and usage mining. It describes techniques like crawlers, PageRank, HITS and pattern discovery from web server logs.

Uploaded by

rekha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Spatial & Web Mining

1
• Spatial Data: Introduction
• Spatial Data Overview
• Spatial Rule
• Spatial Clustering Algorithm
Spatial Object

• Contains both spatial and nonspatial


attributes.
• Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
• May retrieve object using either (or
both) spatial or nonspatial attributes.
Spatial Data Mining
Applications
• Geology
• GIS Systems
• Environmental Science
• Agriculture
• Medicine
• Robotics
• May involved both spatial and
temporal aspects
Spatial Queries
• Spatial selection may involve specialized
selection comparison operations:
– Near
– North, South, East, West
– Contained in
– Overlap/intersect
• Region (Range) Query – find objects that
intersect a given region.
• Nearest Neighbor Query – find object close to
identified object.
• Distance Scan – find object within a certain
distance of an identified object where distance
is made increasingly larger.
Spatial Data Structures
• Data structures designed specifically to store or
index spatial data.
• Often based on B-tree or Binary Search Tree
• Cluster data on disk basked on geographic
location.
• May represent complex spatial structure by
placing the spatial object in a containing
structure of a specific geographic shape.
• Techniques:
– Quad Tree
– R-Tree
– k-D Tree
MBR

• Minimum Bounding Rectangle


• Smallest rectangle that completely
contains the object
MBR Examples
Quad Tree

• Hierarchical decomposition of the space


into quadrants (MBRs)
• Each level in the tree represents the
object as the set of quadrants which
contain any portion of the object.
• Each level is a more exact
representation of the object.
• The number of levels is determined by
the degree of accuracy desired.
Quad Tree Example
R-Tree

• As with Quad Tree the region is


divided into successively smaller
rectangles (MBRs).
• Rectangles need not be of the same
size or number at each level.
• Rectangles may actually overlap.
• Lowest level cell has only one object.
• Tree maintenance algorithms similar
to those for B-trees.
R-Tree Example
K-D Tree

• Designed for multi-attribute data, not


necessarily spatial
• Variation of binary search tree
• Each level is used to index one of the
dimensions of the spatial object.
• Lowest level cell has only one object
• Divisions not based on MBRs but
successive divisions of the dimension
range.
k-D Tree Example
Spatial Rules
• Characteristic Rule
The average family income in Dallas is
$50,000.
• Discriminant Rule
The average family income in Dallas is
$50,000, while in Plano the average income is
$75,000.
• Association Rule
The average family income in Dallas for
families living near White Rock Lake is
$100,000.
Spatial Clustering

• Detect clusters of irregular shapes


• Use of centroids and simple distance
approaches may not work well.
• Clusters should be independent of
order of input.
CLARANS Extensions

• Remove main memory assumption of


CLARANS.
• Use spatial index techniques.
• Use sampling and R*-tree to identify
central objects.
• Change cost calculations by reducing
the number of objects examined.
• Voronoi Diagram
Voronoi
Web Mining Outline

• Introduction
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
Web Mining Taxonomy
Web Content Mining

• Extends work of basic search engines


• Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
Crawlers
• Robot (spider) traverses the hypertext
sructure in the Web.
• Collect information from visited pages
• Used to construct indexes for search
engines
• Traditional Crawler – visits entire Web (?)
and replaces index
• Periodic Crawler – visits portions of the
Web and updates subset of index
• Incremental Crawler – selectively searches
the Web and incrementally modifies index
• Focused Crawler – visits pages related to a
particular subject
Focused Crawler

• Only visit links from a page if that


page is determined to be relevant.
• Classifier is static after learning
phase.
• Components:
– Classifier which assigns relevance score
to each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages to based on crawler
and distiller scores.
Focused Crawler

• Classifier to related documents to


topics
• Classifier also determines how useful
outgoing links are
• Hub Pages contain links to many
relevant pages. Must be visited even if
not high relevance score.
Focused Crawler
Context Focused Crawler

• Context Graph:
– Context graph created for each seed
document .
– Root is the sedd document.
– Nodes at each level show documents with links
to documents at next higher level.
– Updated during crawl itself .
• Approach:
1. Construct context graph and classifiers using
seed documents as training data.
2. Perform crawling using classifiers and context
graph created.
Context Graph
Virtual Web View
• Multiple Layered DataBase (MLDB) built on
top of the Web.
• Each layer of the database is more generalized
(and smaller) and centralized than the one
beneath it.
• Upper layers of MLDB are structured and can be
accessed with SQL type queries.
• Translation tools convert Web documents to
XML.
• Extraction tools extract desired information to
place in first layer of MLDB.
• Higher levels contain more summarized data
obtained through generalizations of the lower
levels.
Personalization

• Web access or contents tuned to better fit the


desires of each user.
• Manual techniques identify user’s
preferences based on profiles or
demographics.
• Collaborative filtering identifies
preferences based on ratings from similar
users.
• Content based filtering retrieves pages
based on similarity between pages and user
profiles.
Web Structure Mining

• Mine structure (links, graph) of the Web


• Techniques
– PageRank
– CLEVER
• Create a model of the Web organization.
• May be combined with content mining
to more effectively retrieve important
pages.
PageRank
• Used by Google
• Prioritize pages returned from search
by looking at Web structure.
• Importance of page is calculated
based on number of pages which
point to it – Backlinks.
• Weighting is used to provide more
importance to backlinks coming form
important pages.
PageRank (cont’d)

• PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)


– PR(i): PageRank for a page i which points
to target page p.
– Ni: number of links coming out of page i
CLEVER

• Identify authoritative and hub pages.


• Authoritative Pages :
– Highly important pages.
– Best source for requested information.
• Hub Pages :
– Contain links to highly important pages.
HITS

• Hyperlink-Induces Topic Search


• Based on a set of keywords, find set of
relevant pages – R.
• Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or
from R.
– Calculate weights for authorities and hubs.
• Pages with highest ranks in R are returned.
Web Usage Mining

• Extends work of basic search engines


• Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
Web Usage Mining
Applications
• Personalization
• Improve structure of a site’s Web
pages
• Aid in caching and prediction of future
page references
• Improve design of individual pages
• Improve effectiveness of e-commerce
(sales and advertising)
Web Usage Mining Activities
• Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a
sitting.
• Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
• Transaction: session
• Itemset: pattern (or subset)
• Order is important
• Pattern Analysis
Web Usage Mining Issues

• Identification of exact user not


possible.
• Exact sequence of pages referenced
by a user not possible due to caching.
• Session not well defined
• Security, privacy, and legal issues
Data Structures

• Keep track of patterns identified


during Web usage mining process
• Common techniques:
– Trie
– Suffix Tree
– Generalized Suffix Tree
– WAP Tree
Trie vs. Suffix Tree

• Trie:
– Rooted tree
– Edges labeled which character (page)
from pattern
– Path from root to leaf represents pattern.
• Suffix Tree:
– Single child collapsed with parent. Edge
contains labels of both prior edges.
Trie and Suffix Tree
Generalized Suffix Tree

• Suffix tree for multiple sessions.


• Contains patterns from all sessions.
• Maintains count of frequency of
occurrence of a pattern in the node.
• WAP Tree:
Compressed version of generalized suffix
tree
Types of Patterns

• Algorithms have been developed to


discover different types of patterns.
• Properties:
– Ordered – Characters (pages) must occur in the
exact order in the original session.
– Duplicates – Duplicate characters are allowed in
the pattern.
– Consecutive – All characters in pattern must
occur consecutive in given session.
– Maximal – Not subsequence of another pattern.
Pattern Types

• Association Rules
None of the properties hold
• Episodes
Only ordering holds
• Sequential Patterns
Ordered and maximal
• Forward Sequences
Ordered, consecutive, and maximal
• Maximal Frequent Sequences
All properties hold
Episodes

• Partially ordered set of pages


• Serial episode – totally ordered with
time constraint
• Parallel episode – partial ordered with
time constraint
• General episode – partial ordered
with no time constraint

You might also like