Bahir DarInstitute of Technology
Faculty
of Computing
Department of Information System
Data Mining and Data Warehousing
By: Belete B.
1
Chapter one
Introduction
In this chapter we will cover the following issues in
brief
– Motivation: Why data mining?
– What is data mining?
– Data Mining: On what kind of data?
– Data mining functionalities
2
Motivation:
“Necessity is the Mother of Invention”
Our capacity of generating and collecting data have been
increased rapidly in the last several decades
Huge amount of data is available at the tip of our hand
It is predicted that more data will be produced in the next
year than has been generated during the entire existence of
humankind!
According to Witten and Frank, it is estimated that the
amount of data stored in the world's database grows every
twenty months at a rate of 100%
3
Motivation:
“Necessity is the Mother of Invention”
Contributing factors include
– Widespread use of bar code for most commercial products,
– Computerization of many business, scientific, and governmental
transactions,
– Advances in data collection tools (audio, video, satellite remote
sensing, scanning, image capturing tools)
– Usage of WWW as a global information system (the Internet in
general),
– Development of comprehensive application software,
– New computing and storage technologies
4
Motivation:
“Necessity is the Mother of Invention”
All this have made it easier to create, collect, and store all
types of data
As a result it creates a problem what is called data explosion
Data explosion is the problem of having huge amount of
data in an enterprise stored in databases, data warehouses
and other information repositories generated by automated
data collection tools
As the volume of data increases, the proportion of
information in which people could understand decreases
substantially or as the size of data get larger, analyzing the
data becomes very difficult
5
Motivation:
“Necessity is the Mother of Invention”
This shows that the level of understanding of people about
the data at hand could not keep pace with the rate of
generation of data in various forms, which results in
increasing information gap
Consequently, scholars begin to realize this bottleneck and
to look into possible remedies/solutions
Current technological progress permits the storage and
6
access of large amounts of data at virtually no cost
Motivation:
“Necessity is the Mother of Invention”
The true value is not in storing the data, but rather in our ability to
extract useful reports and to find interesting trends & correlations
to support decisions and policies made by businesses
We are drowning in data, but starving for knowledge!
To bridge the gap of analyzing large volume of data and
extracting useful information and knowledge for decision making
that the new generation of computerized methods known as Data
Mining (DM) has emerged in recent years
7
What is Data Mining?
Different scholars provided different definitions about DM
According to Berry and Linoff (2000); Han and Kamber (2006), DM
is the process of extracting or mining knowledge from large amounts
of data in order to discover meaningful patterns and rules
Data mining is extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) information or patterns
from data in large databases (e.g. data warehouse)
The term Data mining is a misnomer as it doesn’t directly related to
what is does
Data mining should best describe as knowledge mining from data
rather than data mining
Any way, we will use the term with this understanding
8
What is Data Mining?
Alternative names
– Knowledge discovery (mining) from databases (KDD),
– knowledge extraction,
– data/pattern analysis,
– data archeology,
– data dredging,
– information harvesting,
– business intelligence, etc.
9
What is Data Mining?
DM involves the use of sophisticated data analysis tools
to discover previously unknown, valid patterns and
relationships in large datasets
– These tools can include statistical models, mathematical
algorithms, and machine learning methods
According to Han and Kamber (2006), the major reason
that DM has attracted a great deal of attention in the
information industry in recent years is due to
– the wide availability of huge amounts of data and
– the imminent/expected need for turning such data into useful
information and knowledge
The information and knowledge gained can be used for
applications ranging from market analysis, fraud detection,
and customer retention, to production control and science
10
exploration
Data Mining: On What Kind of Data?
In principle, data mining is not specific to one type of media
or data
– Data mining should be applicable to any kind of information
repository
– Data mining is being put into use and studied for databases,
Relational databases
– a collection of tables, each of which is assigned a unique name.
– are one of the most commonly available and rich information
repositories, and thus they are a major data form in our study of data
mining
– DM algorithms using relational databases can be more versatile than
DM algorithms specifically written for flat files, since they can take
advantage of the structure inherent to relational databases
– While DM can benefit from SQL for data selection, transformation and
consolidation, it goes beyond what SQL could provide, such as
predicting, comparing, detecting deviations, etc 11
Data Mining: On What Kind of Data?
Data warehouses
– is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single
site
– Data warehouses are constructed via a process of data cleaning,
data integration, data transformation, data loading, and periodic
data refreshing
– To facilitate decision making, the data in a data warehouse are
organized around major subjects, such as customer, item, supplier,
and activity
– The data are stored to provide information from a historical
perspective and are typically summarized
Transactional databases
– consists of a file where each record represents a transaction
– One typical data mining analysis on such data is the so-called market
basket analysis 12
Data Mining: On What Kind of Data?
Advanced DB and information repositories
– Spatial databases: store geographical information like maps,
and global or regional positioning
• Such spatial databases present new challenges to data mining
algorithms
– Multimedia databases: include video, images, audio, and text
media
• It is characterized by its high dimensionality, which makes data
mining even more challenging
– WWW: is the most heterogeneous and dynamic repository
available
• Conceptually, the World Wide Web is comprised of three major
components the content of the Web, the structure of the Web , & the
usage of the web
• Data mining in the WWW, or web mining, is often divided into web
content mining, web structure mining and web usage mining 13
Data Mining Functionalities
Data mining functionalities are used to specify the kind of
patterns to be found in data mining task
Generally data mining task can be broadly classified as
– Descriptive (unsupervised)
– Predictive (supervised)
Descriptive data mining task characterize the general
properties of the data in a database
Predictive data mining task perform inference on the
current data in order to make prediction to the future
reference
– permits the value of one variable to be predicted from the known
values of other variables 14
Data Mining Functionalities
The supervised predictive data mining functionalities includes
Classification
Regression
Time series
Prediction
The unsupervised descriptive data mining functionalities includes
Association rule discovery
Clustering analysis
Summarization
Sequence discovery
15
Data Mining Functionalities:
Classification
Classification is the process of finding a set of models that
describe and distinguish data classes for the purpose of being able
to use the model to predict the class of an object whose class is
unknown
– The derived class is based on training data set and can be represented in
various forms such as classification IF—THEN rule, decision tree,
mathematical formulae or neural networks
– Classification approaches normally use a training set where all objects are
already associated with known class labels
– The classification algorithm learns from the training set and builds a model
– The model is used to classify new objects
There are different algorithms that are used for classification
purpose such as, decision tree, neural network, genetic algorithm,
naïve bayes, etc 16
Data Mining Functionalities
Cluster analysis
Clustering is a DM technique that finds similarities between
data according to the characteristics found in the data and
group’s similar data objects into one cluster
In cluster Analysis, class labels are unknown and a group of
data is given to be classified
The objective of clustering is to distribute cases (people,
objects, events etc.) into groups, so that the degree of
association can be strong between members of the same
cluster (intra-class similarity) and weak between members of
different clusters (inter-class similarity) 17
17
Data Mining Functionalities
Cluster analysis
Clustering tools assign groups of records to the same cluster if
they have something in common, making it easier to discover
meaningful patterns from the dataset
Clustering often serves as a starting point for some supervised
DM techniques or modeling
Generally, similar to classification, clustering is the organization
of data in classes
However, unlike classification, in clustering, class labels are
unknown and it is up to the clustering algorithm to discover
acceptable classes 18
18
Data Mining Functionalities:
Association Rule Mining
Association rule mining aims to extract interesting
correlations, frequent patterns, associations or casual
structures among sets of items in the transaction databases
or other data repositories
It studies the frequency of items occurring together in
transactional databases, and based on a threshold called
support, identifies the frequent item sets
Another threshold, confidence, which is the conditional
probability that an item appears in a transaction when
another item appears, is used to pinpoint association rules
Association analysis is commonly used for market basket
analysis 19
19
20
Quiz 1
1. What is Data mining and what is it used for?
2. What is association rule mining technique? Give
two example association rule mining algorisms
3. What are the main reasons for DM to attract a
great deal of attention in the information
industry in recent years according to Han and
Kamber?
4. What do we call DM which is applied in WWW and what are the
three aspects of DM that can be applied on WWW
21