Business Analytics(PEC)
BAD417B
SEMESTER 7
Professor Rachitha E
CSE-Data Science,
RNSIT
Data mining, once a new technology, is now widely used by
organizations to extract valuable patterns from large datasets. Nobel
laureate Dr. Arno Penzias highlighted it as a “killer application” for
companies, and Thomas Davenport (2006) noted that companies like
Amazon, Capital One, and Marriott use analytics to understand
customers, optimize supply chains, and maximize ROI.
Importance of Data Mining:
• Helps organizations understand customers, vendors, and business
processes.
• Enabled by reduced storage and processing costs and data
consolidation into warehouses.
• Provides strategic advantage in competitive, global markets.
Origins and Techniques:
• Rooted in statistics and artificial intelligence since the 1980s.
• Initially meant discovering unknown patterns in data; now often used
broadly for data analysis.
Reasons for Popularity:
• Global competition and changing customer needs.
• Recognition of untapped value in large datasets.
• Consolidation of databases and single-view access.
• Advances in storage and processing technology at lower costs.
• Movement toward demassification (digital transformation of business
practices).
Applications:
• Science & Research: Astronomy, genomics, nuclear physics, medical
research.
Commercial:
Finance & Insurance
Retail
Healthcare
End Users (Miners):
• Can ask ad hoc questions and get quick answers with minimal
programming.
• Creativity is needed to interpret unexpected results.
• Tools integrate easily with spreadsheets and software, enabling fast
analysis.
• Parallel processing may be needed for large data volumes.
• Proper use of data mining provides strategic competitive advantage.
How Data Mining Works:
• Uses internal and external data to build models identifying patterns.
• Models can be simple linear relationships or complex nonlinear
relationships.
• Patterns can be:
• Associations: Items that co-occur
• Predictions: Forecasting future events (e.g., weather, sales trends).
• Clusters: Grouping similar entities (e.g., customer segmentation).
• Sequential Relationships: Time-ordered events (e.g., banking
product adoption).
Automation:
• Manual pattern discovery has existed for centuries, but modern large
datasets require automated or semi-automated tools.
• Tasks are classified into prediction, association, and clustering.
Learning Methods:
• Supervised Learning: Uses descriptive attributes + class/output
attribute.
• Unsupervised Learning: Uses descriptive attributes only.
Prediction:
• Involves forecasting future outcomes using data and experience.
• Can be:
• Classification: Predicting discrete labels (e.g., “rainy” or “sunny”).
• Regression: Predicting numeric values (e.g., temperature = 65°F).
Classification:
• Analyzes historical data to generate predictive models.
• Tools include: neural networks, decision trees, logistic regression,
discriminant analysis, support vector machines, genetic
algorithms, and rough sets.
• Goal: Predict future behavior accurately based on patterns learned
from training data.
Clustering, Associations, and Visualization in Data Mining
Clustering:
• Partitions objects/events into natural groups based on similarity.
• Class labels are unknown; clusters are created using heuristic
algorithms.
• Goal: maximize similarity within clusters and minimize similarity
across clusters.
• Common techniques: k-means (statistics) and self-organizing maps
(neural networks).
• Applications: market segmentation, customer targeting, and
identifying patterns in events/objects.
Associations:
• Discovers interesting relationships among variables in large
databases.
• Popular in retail as market-basket analysis.
• Derivatives:
• Link analysis: Finds connections among objects (e.g., web pages,
research authors).
• Sequence mining: Finds relationships based on the order of
occurrence.
• Algorithms: Apriori, FP-Growth, OneR, ZeroR, Eclat.
Visualization and Time-Series Forecasting:
• Visualization: Helps understand relationships; combined with
analytics for visual analytics.
• Time-series forecasting: Uses data collected over time to predict
future values of the same variable.
• Customer Relationship Management • Travel Industry
• Banking • Healthcare
• Retailing & Logistics
• Medicine
• Manufacturing & Production
• Entertainment
• Brokerage & Securities
• Homeland Security & Law
• Insurance
Enforcement
• Computer Hardware & Software
• Sports
• Government & Defense
The CRISP-DM (Cross-Industry Standard Process for Data
Mining) is the most widely used framework for carrying out data
mining projects.
• It consists of six main steps, followed in a logical but flexible order:
• Business Understanding – Define the goals and requirements of the
project from a business perspective.
• Data Understanding – Collect and explore the data to identify quality
issues and patterns.
• Data Preparation – Clean, transform, and organize the data for
analysis.
• Modeling – Apply data mining techniques (like classification,
clustering, or regression).
• Evaluation – Assess the model’s performance and check if it meets
business goals.
• Deployment – Implement the final model and share insights or
automate decisions.
SEMMA
• SEMMA stands for:
• Sample – Select a representative portion of data.
• Explore – Use visualization and statistics to understand data.
• Modify – Select, clean, and transform important variables.
• Model – Apply machine learning or statistical models.
• Assess – Evaluate model accuracy and usefulness.
SEMMA is iterative—you may return to earlier steps as new insights
emerge.
Difference from CRISP-DM:
• SEMMA focuses more on the technical modeling steps,
• CRISP-DM includes business understanding and deployment,
making it more comprehensive.
KDD (Knowledge Discovery in Databases)
• KDD is a broader concept where data mining is just one step.
Its stages include:
• Data Selection
• Data Preprocessing (Cleaning)
• Data Transformation
• Data Mining (Pattern Extraction)
• Interpretation/Evaluation (Turning patterns into knowledge)
• KDD = Complete discovery process;
Data Mining = Core analytical step within KDD.
• Popularity (as per KDnuggets Survey)
• CRISP-DM is the most widely used data mining methodology.
• SEMMA and KDD are also popular, especially in research and
industry tools.
Several methods are used in data mining — classification, regression,
clustering, and association.
Most tools include more than one algorithm for these methods.
Classification
Classification is a supervised learning method that learns patterns from
labeled data (past examples) to predict the class labels of new data.
• Examples:
• Weather prediction → Sunny / Rainy / Cloudy
• Credit approval → Good / Bad risk
• Fraud detection → Yes / No
• Marketing → Likely customer / Not likely
Difference from Regression:
• If the output is a class label → it’s classification.
• If the output is a numeric value → it’s regression (e.g.,
temperature = 68°F).
Aspect Classification Clustering
Supervised (uses input Unsupervised (uses
Learning Type
+ output labels) only input data)
Learn function
Find natural groups in
Goal between features and
data
known classes
Group customers
Predict spam vs. non-
Example based on buying
spam emails
patterns
Classification Process
• Model Development / Training:
Train the model using data with known class labels.
• Model Testing / Deployment:
Test with unseen data to check accuracy and then deploy for real use.
Model Evaluation Factors
• Predictive Accuracy:
Percentage of test samples correctly classified.
(Most important measure.)
• Speed:
How quickly the model is trained and used.
• Robustness:
Ability to handle noisy or missing data.
• Scalability:
Efficiency with large datasets.
• Interpretability:
How well humans can understand model decisions.
Confusion Matrix and Accuracy Estimation
• In classification problems, the main tool for estimating accuracy is
the confusion matrix (also called a classification matrix or
contingency table).
• It displays the actual vs. predicted classifications.
• Diagonal entries → Correct classifications
• Off-diagonal entries → Misclassifications
Key Performance Metrics
Accuracy → Overall correctness of the model
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Precision → Out of all predicted positives, how many were actually
positive
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Recall (Sensitivity or True Positive Rate) → Out of all actual
positives, how many did the model correctly identify
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
F1-Score → Harmonic mean of Precision and Recall
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Simple Split (Holdout Method)
The Simple Split (also called Holdout or Test Sample Estimation)
divides the dataset into two parts:
• Training Set – Used to build the model.
• Test Set – Used to evaluate model performance.
• Typical Split:
• ⅔ (Two-thirds) → Training data
• ⅓ (One-third) → Testing data
K-Fold Cross-Validation
To reduce bias caused by random sampling when comparing the
accuracy of models.
• Process:
• The entire dataset is divided into k equal parts (folds) — usually
using stratified sampling to keep class balance.
• The model is trained and tested k times:
• Each time, (k−1) folds are used for training.
• The remaining 1 fold is used for testing.
• The final accuracy (Cross-Validation Accuracy, CVA) is the
average of all k accuracy results.
Example:
For 10-fold cross-validation, data is split into 10 parts; each part gets a turn
as the test set while the other 9 are used for training.
Other Classification Assessment Methods
• Leave-One-Out (LOO):
• A special case of k-fold where k = number of data points.
• Each instance is used once for testing.
• Very accurate but time-consuming.
• Bootstrapping:
• Random samples are drawn with replacement from the original
data for training.
• The remaining data is used for testing.
• Repeated several times for stability.
Jackknifing:
• Similar to leave-one-out, but accuracy is recalculated by leaving one
sample out per iteration.
ROC Curve (Receiver Operating Characteristic):
• Plots:
• True Positive Rate (TPR) on the y-axis
• False Positive Rate (FPR = 1 − Specificity) on the x-axis
• The Area Under the Curve (AUC) measures classifier
performance:
• AUC = 1 → Perfect classifier
• AUC = 0.5 → Random performance (no better than chance)
• A model with a higher AUC is considered better.
• A variety of algorithms are used to build classification models,
including:
Decision Tree Analysis
One of the most popular machine learning techniques.
Uses a tree-like structure of decisions and outcomes to classify data.
Easy to interpret and visualize.
• Statistical Analysis
Traditional method used before machine learning became dominant.
Examples: Logistic Regression and Discriminant Analysis.
Assumptions:
• Linear relationship between inputs and output
• Data are normally distributed
• Variables are independent and not correlated
Since these assumptions often don’t hold in real-world data, machine-
learning methods are now preferred.
• Neural Networks
Powerful machine-learning models inspired by the human brain.
Can capture complex, nonlinear relationships.
Widely used for classification problems such as image or speech
recognition.
• Case-Based Reasoning (CBR)
Classifies new data by comparing it to past cases and finding the most
similar examples.
Learns from experience rather than fixed rules.
• Bayesian Classifiers
Based on probability theory (Bayes’ theorem).
Classifies new instances into the most probable class based on past
data.
Example: Naïve Bayes classifier.
• Genetic Algorithms
Inspired by natural evolution (selection, crossover, mutation).
Uses search-based optimization to evolve classification rules.
• Rough Sets
Deals with uncertainty and partial membership of class labels.
Builds rule-based models when class boundaries are not clear.
Ensemble models combine predictions from multiple models to
improve accuracy, robustness, and reduce bias.
Reason: There’s no single best model for all problems; combining
different models often gives better results.
Types of Ensembles:
Homogeneous Ensembles: Combine models of the same type
(e.g., decision trees).
•Bagging: Builds many trees (e.g., Random Forest).
•Boosting: Adjusts weights of misclassified samples to improve
accuracy (e.g., AdaBoost).
Heterogeneous Ensembles: Combine different model types (e.g.,
decision tree + neural network + SVM).
•Also called Information Fusion Models.
Example:
Model Type Strength
Decision Tree Good at capturing non-linear rules
Neural Network Learns complex hidden patterns
Works well with high-dimensional
SVM
data
Combination Methods:
•Simple Voting: All models contribute equally.
•Weighted Voting: More accurate models get higher weight.
Advantages: Higher accuracy and robustness.
Disadvantage: Increased complexity and reduced interpretability.
Decision Trees
• A Decision Tree is a flow-chart–like model used for classification or
prediction. It splits data into smaller groups based on the values of
input attributes to reach a final decision.
• Key Terms
• Attributes: Input features used for making decisions (e.g., Income,
Credit Rating).
• Node: A decision point based on an attribute.
• Branch: Outcome of a decision (Yes/No, High/Low, etc.).
• Leaf Node: Final result/class label (e.g., "High Risk").
How a Decision Tree Works
• Start with all training data at the root node.
• Pick the best attribute to split the data (the one that gives the most
useful separation).
• Split the data into subsets and create branches.
• Repeat splitting on each branch until:
• All samples in a node belong to one class (pure), or
• The node becomes too small to split further.
• Finally, the tree may be pruned to remove unnecessary branches and
improve accuracy on new data.
Types of Splits
• Continuous attributes: Split like Income < 50,000
• Categorical attributes: Split like Gender = Male or Female
Types of Splits
• Continuous attributes: Split like Income < 50,000
• Categorical attributes: Split like Gender = Male or Female
Choosing the Best Split
Gini Index (used in CART)
• Measures purity of a node. Lower Gini = purer data.
• Formula:
• 𝐺𝑖𝑛𝑖 𝑆 = 1 − 𝑝𝑗2
• Where 𝑝𝑗 =proportion of class j in dataset S.
• The best attribute is the one with the lowest Gini after splitting.
Information Gain (used in ID3/C4.5)
• Based on Entropy, which measures uncertainty.
• Entropy is zero when data is perfectly pure (all one class).
• 𝐺𝑎𝑖𝑛 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − Expected Entropy after split on A
• Higher information gain = better attribute for splitting.
Cluster Analysis is a data mining technique used to group similar items into
clusters, so that items in the same cluster are more alike, and items in
different clusters are less alike. It helps discover patterns or natural
groupings in data.
• It groups people, objects, or events based on similarity
(e.g., grouping customers by buying behavior).
• It is an exploratory technique—used to uncover hidden patterns that are
not already known.
• Example:
Like grouping students into A, B, C grade ranges, or sorting students into
houses in Harry Potter (Gryffindor, Slytherin, etc.).
• Clustering is widely used in many fields such as:
• Marketing – customer segmentation in CRM
• Fraud detection – identifying unusual transactions
• Biology, Medicine, Genetics, Astronomy
• Image/Character recognition, Anthropology, Social networks
Clustering helps to:
• Identify types or categories (e.g., customer groups)
• Build statistical models for different populations
• Develop rules for assigning new data to clusters
• Understand cluster size and characteristics over time
• Detect outliers or rare events
• Reduce data complexity for other data mining methods
• Cluster Analysis Cluster analysis can be performed using different
approaches:
• Common Methods
• Statistical Methods – e.g., k-means (for numeric data), k-modes (for
categorical data)
• Neural Networks – uses Self-Organizing Maps (SOM) to form clusters
• Fuzzy Logic – e.g., Fuzzy c-means, where a data point can belong to
more than one cluster
• Genetic Algorithms – use evolutionary principles for clustering
• Types of Clustering Techniques
• Clustering can follow two main strategies:
Type Meaning
Start with one big cluster and split
Divisive
it into smaller clusters
Start with each item as its own
Agglomerative
cluster and merge them
Distance Measures
• Most clustering techniques measure how similar or different two
items are by calculating distance:
• Euclidean Distance – straight-line distance (like using a ruler)
• Manhattan Distance – distance along grid paths (like a taxi driving
through city blocks)
• Sometimes, weighted distances are used if some features are more
important than others.
K-Means Clustering
k-means is one of the most popular clustering algorithms.
K = number of clusters you want to form.
Steps:
• Choose k (number of clusters).
• Pick k initial random points as cluster centers (centroids).
• Assign each data point to the nearest centroid.
• Recalculate centroids by finding the average position of points in each
cluster.
• Repeat steps 3 & 4 until the clusters become stable (no change in
assignment).
The Apriori Algorithm is a popular data mining method used to find
association rules—that is, to discover items that frequently occur
together in transactions (e.g., items bought together in a supermarket).
• Apriori works by:
• Finding frequent itemsets (groups of items purchased together)
• Using a minimum support value to decide if an itemset is frequent
• Building larger itemsets step-by-step, only from previously frequent
smaller itemsets
(this avoids unnecessary combinations)
• This step-by-step expansion is called candidate generation
How Apriori Works
Start with 1-item sets: Count how often each item appears (support).
• Keep only those with minimum support (frequent items).
• Generate 2-item sets using only frequent 1-item sets.
• Count their support and keep only frequent ones.
• Generate 3-item sets using frequent 2-item sets.
• Repeat until no more frequent itemsets can be formed.
• Apriori stops when no new frequent itemset meets minimum support.
Example (Simple Grocery Case)
• Transactions show items bought together.
Suppose minimum support = 3 (items must appear in at least 3 out of 6
transactions).
• Step 1: All 1-item sets had support ≥ 3 → all kept
• Step 2: Generate 2-item sets and count support
• Example: {1,3} had support < 3 → discarded
• Step 3: Only frequent 2-item sets are used to form 3-item sets
• This pruning saves time by eliminating combinations that can never
be frequent.
Prescriptive Analytics is the stage of analytics that recommends actions
or decisions based on insights gained from descriptive and predictive
analytics.
It answers the question: “What should we do?”
• We first analyze past data (descriptive analytics)
• Then predict future outcomes (predictive analytics)
• Finally, we decide the best action to achieve desired results → this is
prescriptive analytics
• It helps decision makers act in a way that maximizes benefits,
reduces costs, and improves results.
Examples of Prescriptive Analytics in Action
• Prescriptive analytics can guide decisions such as:
• Targeting customers with the best offers to increase sales or profit
• Retaining customers with the right promotions to stop them from
switching
• Vendor/contract decisions to minimize cost while meeting
requirements
• Marketing campaign planning to select the best audience within
budget
• Ad spending for paid search keywords to maximize ROI
• Staff scheduling based on predicted customer demand
• Warehouse location planning to reduce supply chain costs
• Delivery route optimization to lower travel time and fuel cost
• Provides data-backed decision making instead of relying only on
intuition
• Improves accuracy, profitability, and efficiency
• Helps justify decisions with solid reasoning and models
• Often results in huge financial gains
• Example: A shipping company used a prescriptive model called
TurboRouter and increased profits by $1–2 million in just three
weeks by improving fleet scheduling.
Identification of the Problem & Environmental Analysis
• Decisions are not taken in isolation. Before making a decision, it is
important to study:
• The scope of the domain – where and how the issue exists
• Environmental forces and dynamics – internal and external factors
influencing the situation
• Organizational culture & decision-making structure –
• Who takes decisions?
• Is the organization centralized or decentralized?
• Many times, environmental factors are responsible for the current
problem.
This study process is called Environmental Scanning and Analysis,
which involves:
• Monitoring
Scanning
Collecting information
Interpreting the findings
• BI/BA (Business Intelligence / Business Analytics) tools assist in
scanning data and identifying problems.
For effective modeling, everyone involved must have a common
understanding of the problem, because the model represents the
problem. If the problem is unclear, the model cannot support correct
decision-making.
Variable Identification
• Identifying the right variables for a model is a crucial step.
These variables may include:
• Decision Variables
• Result/Outcome Variables
• Uncontrollable Variables
• Along with the variables, the relationships among them must also be
clearly understood.
• Two tools help in identifying variables:
Tool Purpose
Graphical representation of variables and their
Influence Diagram
relationships in a model
A more detailed form of an influence diagram
Cognitive Map that helps understand variables and their
interactions
Forecasting (Predictive Analytics)
• To use Prescriptive Analytics, one must first know:
• What has already happened (Past Data)
• What is likely to happen (Future Prediction)
• Forecasting uses Predictive Analytics to estimate future outcomes so
that better decisions can be made.
Decisions affect the future, not the past
• No use conducting “what-if analysis” on past situations because
decisions now cannot change the past
• Online business and digital communication have increased the need
for faster and more accurate forecasting
Forecasting helps in:
• Predicting customer demand
• Analyzing product life cycles
• Understanding market conditions and consumer behavior
Model Construction Techniques
Models can be built for static or dynamic situations under certainty,
uncertainty, or risk.
To speed up model building, Decision Analysis Systems with built-in
modeling support are used, such as:
•Spreadsheets
•Data Mining Systems
•OLAP Systems
•Modeling Languages
Model Management
• Just like data, models must be stored, organized, maintained, and
updated to stay relevant and valid.
This is done using Model-Based Management Systems, similar to
how DBMS manages data.
• Purpose:
Maintain model integrity
Ensure models remain applicable and usable
Knowledge-Based Modeling
• Decision Support Systems (DSS) mainly use quantitative models
(number-based).
• Expert Systems use qualitative, knowledge-based models (rule-
based, experience-based).
• Some knowledge is necessary for building models that can actually be
solved and used.
Predictive analytics techniques like classification and clustering also
support the development of knowledge-based models.
Current Trends in Modeling
• Model & Solution Libraries available online for use or download
(e.g., NEOS Optimization Server, INFORMS resources)
• Increasing use of cloud-based tools for modeling, optimization, and
simulation
• Though tools simplify modeling, practical understanding is
essential for effective application
• Widely used in revenue management, CRM, retail, insurance,
entertainment, etc.
• Large models require data warehouses and parallel computing for
faster results
• Trend toward transparent models for decision makers (e.g., OLAP
and multidimensional analysis)
• Influence diagrams act as a “model of a model” and some tools can
build and solve models from them
1. Components of a Quantitative Model:
All mathematical models have four main components:
• Decision Variables: Factors controlled by the decision maker (e.g.,
investment amount, production level).
• Uncontrollable Variables (Parameters): External factors beyond
control (e.g., interest rate, tax rules, competition).
• Intermediate Result Variables: Indicate partial or intermediate
outcomes (e.g., spoilage, employee satisfaction).
• Result (Outcome) Variables: Final performance indicators or goals
(e.g., profit, cost, market share).
2. Decision Variables:
They represent choices or actions available to the decision maker (e.g.,
scheduling people, setting budgets).
3. Uncontrollable Variables / Parameters:
These are environmental factors that affect results but cannot be
changed by the decision maker (e.g., regulations, inflation, customer
income).
4. Intermediate Result Variables:
Show intermediate effects that lead to final outcomes (e.g., employee
satisfaction → productivity → profit).
5. Result (Outcome) Variables:
Show how well goals are achieved — they are dependent variables
affected by decisions and uncontrollable factors.
6. Structure of Mathematical Models:
• Models are expressed using mathematical equations linking
variables.
• Example:
• Profit model: P = R – C
• Present Value model: P = F / (1 + i)^n
Certainty, Uncertainty, and Risk in Decision-Making
Decision-making involves evaluating alternatives and predicting their
outcomes. The knowledge about these outcomes falls into three
categories:
1. Decision-Making under Certainty
• Complete knowledge of outcomes is assumed.
• Each alternative has a single known outcome.
• Models are deterministic and easy to solve.
• Common in structured problems or short-term decisions (e.g.,
investing in U.S. Treasury bills).
• Advantage: Can yield optimal solutions.
2. Decision-Making under Uncertainty
• Several outcomes are possible, but probabilities are unknown.
• More difficult than certainty due to insufficient information.
• Managers often try to reduce uncertainty by gathering more
information or approximating risk.
• Models must consider the decision maker’s attitude toward risk.
3. Decision-Making under Risk
• Several outcomes are possible, with known or estimated
probabilities.
• Allows calculation of expected values for each alternative.
• Decision-makers can assess calculated risk.
• Most major business decisions fall into this category
(probabilistic/stochastic).
• Certainty → outcomes known
• Risk → outcomes known with probabilities
• Uncertainty → outcomes unknown, probabilities unknown
Spreadsheets as Modeling Tools:
•Widely used for developing models in business, engineering,
mathematics, and science.
•Easy-to-use, flexible, and accessible for end users.
•Capabilities of Spreadsheets:
•Built-in statistical, forecasting, mathematical, and database
functions.
•Can perform tasks like linear programming, regression analysis, and
simulation.
Add-Ins for Advanced Modeling:
• Solver: Linear and nonlinear optimization
• What’sBest!: Optimization (version of Lindo)
• NeuralTools: Artificial neural networks
• Evolver: Genetic algorithms
• @RISK: Simulation studies
• Many add-ins are available free or at low cost, and new ones are
regularly introduced.
Popularity:
• Spreadsheets remain the most popular end-user modeling tool
due to their power and versatility.
• Used widely for analysis, planning, and decision support.
Thank You:)