Data Mining:
Concepts and Techniques
Jiawei Han, Micheline Kamber,
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data?
Classifications of Data mining systems
Major Challenges in Data Mining
Data Mining: Concepts and Techniques
April 12, 2012
Why Data Mining?
The Explosive Growth of Data: from terabytes to jetabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Necessity is the mother of inventionData miningAutomated analysis of massive data sets
3 Data Mining: Concepts and Techniques April 12, 2012
Evolution of Sciences
Before 1600, empirical science 1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
1950s-1990s, computational science
Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!
1990-now, data science
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. Mining: Concepts and Techniques April 12, 2012 4 Data 2002
Evolution of Database Technology
1960s:
Data collection, database creation, and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
1970s:
1980s:
Application-oriented DBMS (spatial, scientific, engineering, etc.)
Data mining, data warehousing, multimedia databases, and Web databases Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
Data Mining: Concepts and Techniques April 12, 2012
1990s:
2000s
5
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data? Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
6 Data Mining: Concepts and Techniques April 12, 2012
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Is everything data mining?
7
Simple search and query processing (Deductive) expert systems
Data Mining: Concepts and Techniques April 12, 2012
Knowledge Discovery (KDD) Process
This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Selection& transformation
Data Cleaning
Data Integration Databases
Data Mining: Concepts and Techniques
April 12, 2012
Data mining as a step in KDD process may generally involves:
Data cleaning Data integration
Data selection
Data transformation Data mining
Pattern evaluation
Knowledge presentation
Data Mining in Business Intelligence
Increasing potential to support business decisions
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery
End User
Business Analyst Data Analyst
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and Techniques April 12, 2012
DBA
10
KDD Process
Input Data
Data PreProcessing
Data Mining
PostProcessing
Data integration Normalization Feature selection Dimension reduction
Pattern discovery Association & correlation Classification Clustering Outlier analysis
Pattern Pattern Pattern Pattern
evaluation selection interpretation visualization
This is a view from typical machine learning and statistics communities
11 Data Mining: Concepts and Techniques April 12, 2012
Example: Medical Data Mining
Health care & medical data mining often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation
12
Data Mining: Concepts and Techniques
April 12, 2012
Data Mining: Confluence of Multiple Disciplines
Machine Learning Pattern Recognition Statistics
Applications
Data Mining
Visualization
Algorithm
Database Technology
High-Performance Computing
13
Data Mining: Concepts and Techniques
April 12, 2012
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data? Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
14 Data Mining: Concepts and Techniques April 12, 2012
Multi-Dimensional View of Data Mining
Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
15 Data Mining: Concepts and Techniques April 12, 2012
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
Data Mining: Concepts and Techniques April 12, 2012
High-dimensionality of data
High complexity of data
New and sophisticated applications
16
Data Mining Functionalities
The functionalities of data mining system are to enumerate the different types of pattern present in data mining tasks which can be cateagorized into 2 types:
1. 2.
Descriptive task Predictive task 1. Descriptive task: 1.1 Data charectarization and descrimination a. Identifying data b. Selecting data
c. Identifying and selecting the data
17
Data Mining: Concepts and Techniques
April 12, 2012
1.2 Mining frequently used patterns, associations and correlations a. Frequent item sets b. Frequent sub sequence c. Frequent Substrcture 2. Predictive task 2.1 data classification and data prediction a. Decision trees b. Neural networks 2.2 Cluster evaluation 2.3 Outlier evaluation
18 Data Mining: Concepts and Techniques April 12, 2012
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
Data Mining: Concepts and Techniques April 12, 2012
19
Classifications of data mining systems
Classification according to the kinds of databases mined Classification according to the kinds of knowledge mined Classification according to the kinds of techniques utilized Classification according to the applications adapted.
20
Data Mining: Concepts and Techniques
April 12, 2012
21
Data Mining: Concepts and Techniques
April 12, 2012