Data Mining and
Predictive
Modeling
Lecture 9: Association Rule Mining, Apriori Algorithm
Association Rule
• It is an importantMining
data mining model studied extensively by the
database and data mining community.
• Assume all data are categorical.
• Initially used for market basket analysis to find how items purchased
by customers are related.
Bread Milk [sup = 5%, conf = 100%]
Frequent Pattern
• Frequent Analysis
pattern: a pattern ( a set of items,
subsequences, substructures, etc.) that occurs frequently in a data
set.
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in
the context of frequent itemsets and association rule mining.
• Motivation: Finding inherent regularities in data
• What products were often purchased together?
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?
• Applications:
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis
Market Basket
Analysis
Why Frequent Pattern
Mining?
• Freq. pattern: An intrinsic and important property of datasets
• Foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
• Classification: frequent pattern analysis
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube and cube-gradient
• Semantic data compression
• Broad applications
Basing Concepts: Frequent
Patterns
•I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t I.
•Transaction Database T: a set of transactions T = {t1,
t2, …, tn}.
Transaction Data:
Supermarket
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID
(transaction ID)
• A transactional dataset: A set of transactions
The Model:
rules
• A transaction t contains X, a set of items (itemset) in I,
if X t.
• An association rule is an implication of the form:
X Y, where X, Y I, and X Y =
• An itemset is a set of items.
• E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
• E.g., {milk, bread, cereal} is a 3-itemset
Exam
Tid Items bought ple• itemset: A set of one or more items
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs • (absolute) support, or, support count
40 Nuts, Eggs, Milk of X: Frequency or occurrence of an
50 Nuts, Coffee, Diaper, Eggs, Milk
itemset X
• (relative) support, s, is the fraction
Customer
buys
Customer of transactions that contains X (i.e.,
buys diaper
both the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys beer
Association
Tid Items bought Rules
• Find all the rules X Y with
10 Beer, Nuts, Diaper minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
• support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk • confidence, c, conditional
Customer
probability that a transaction
Customer
buys both
buys
having X also contains Y
diaper Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys Association rules: (many
beer
Beer Diaper
more!)
(60%,
100%)
Diaper Beer (60%, 1
0
Frequent Itemset Mining
Methods
Frequent Itemset Mining
Methods
Apriori: A Candidate Generation-and-Test Approach
Improving the efficiency of Apriori
FPGrowth: A Frequent Pattern-Growth Approach
Frequent Pattern Mining with Vertical Data Format
Proposed by R. Agrawal and R. Srikanth in
1994.
Apriori The name of this algorithm is based on the
fact that the algorithm uses prior knowledge
Algorith of the frequent itemset properties.
m Apriori employs an iterative approach known
as a level-wise search where k-itemsets are
used to explore (k+1) itemsets.
Apriori Algorithm
Idea is to generate candidate itemsets of a given size and then scan
dataset to check if their counts are really large. The process is
iterative
i) All singeltons itemsets are candidate in first pass. Any items with
less than specified support value is eliminated.
ii)Two member itemsets
iii)Three member itemsets
iv) Frequent itemsets constitute set of frequent itemsets
Generate Association rule which have confidence value greater than
or equal to specified minimum confidence.
Apriori
• Method:
Method
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Terminate when no frequent or candidate set can be generated
• To improve the efficiency of the level-wise generation of frequent
itemsets, the apriori property is used to reduce the search space
• All non-empty subsets of a frequent itemset must also be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
Apriori -
Example
Sup = min Itemset sup
Itemset sup
2 {A} 2
Database TDB L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup
C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup
{B, C, E}
3rd scan {B, C, E} 2
Apriori -
Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Apriori -
Implementation
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
Generating Association Rules from
Frequent
• Method Itemsets
• For each frequent itemset l, generate all nonempty subsets of l
• For every nonempty subset of l, output the rule s -> (l - s) if (support_coutn(l) /
support_count(s)) >=min_conf
• Because the rules are generated from frequent itemsets, each one automatically
satisfies the minimum support
Generating Association Rules
•Example
Example: Given the following table and the frequent itemset X = {I1, I2, I5}, generate
the association rules.
• The nonempty subsets of X are:
{I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}
• The resulting association rules are:
• If the minimum confidence threshold is, say, 70%, then only the second, third, and last
rules are output, because these are the only ones generated that are strong
Improving the Efficiency of
Apriori
• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates
Scan Database only
Twice
• Scan 1: partition database and find local frequent patterns.
• Scan 2: consolidate global frequent patterns.
Hash-based Technique: Reduce the
Number of candidates
• A k-itemset whose corresponding hashing bucket count is below the threshold
cannot be frequent
Exam
ple with five transactions. Let
• Consider the following database
min_sup = 60% and min_conf = 80%.
• Find all the frequent itemsets using Apriori method
• List all the strong association rules matching the following
metarule
Thank
You