ADVANCED DATA ANALYTICS
UE23AM343AB1
UNIT-2
Preparing Data For Machine Learning
Tasks – Data Cleaning
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
in AI & ML
Advanced Data Analytics
Unit 2
Preparing Data for Machine Learning Tasks -1
Slides excerpted from: Data Mining : Concepts and
Techniques by Han, Kamber and Pei, 3rd Edition
With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE,
PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
ADVANCED DATA ANALYTICS
Data Preparation for machine learning tasks
• A predictive modelling project will typically have steps as below:
• Define problem
• Prepare Data
• Analyse Data
• Evaluate model
• Finalize model
• The step of Data Preparation typically involves the following steps
• Data Cleaning: Identifying and correcting mistakes or errors in the data.
• Feature Selection: Identifying those input variables that are most relevant
to the task.
• Data Transforms: Changing the scale or distribution of variables.
• Feature Engineering: Deriving new variables from available data.
• Dimensionality Reduction: Creating compact projections of the data.
ADVANCED DATA ANALYTICS
Data Preparation Without Data Leakage
• Data preparation is the process of transforming raw data into a form
that is appropriate for modeling.
• A naive approach to preparing data applies the transform on the
entire dataset before evaluating the performance of the model.
• This results in a problem referred to as data leakage, where
knowledge of the hold-out test set leaks into the dataset used to train
the model.
• This can result in an incorrect estimate of model performance when
making predictions on new data.
• Data preparation must be prepared on the training set only in order to
avoid data leakage.
ADVANCED DATA ANALYTICS
Why Data Cleaning ?
• Analysis on data can only be as good as the data
itself. Low quality data will lead to low quality
analysis.
• Real world databases are highly susceptible to
noisy , missing and inconsistent data owing to their
huge size and multiple heterogeneous sources.
ADVANCED DATA ANALYTICS
Measures of Data Quality
1) Accuracy : Data must not contain errors or a lot of noise.
Example of inaccurate data : Date = 30/02/2002.
Reasons for inaccurate data : February
• Data collection instruments may be faulty.
• Human errors occur during data entry.
• Disguised missing data: Users may purposefully submit incorrect data
values for mandatory fields when they don’t want to share their
personal information. Example : Choosing the default value of January
1st for date of birth.
ADVANCED DATA ANALYTICS
Measures of Data Quality
2) Completeness : Data must not lack attribute values. It must contain attributes
of interest and relevance to the problem at hand.
Reasons for incompleteness :
• Attributes of interest were not considered important at the time of entry.
• Data might not be recorded due to equipment malfunction resulting in
missing data.
3) Consistency : Should not contain any discrepancies in the data or the naming
convention of the attributes.
Examples of inconsistency:
• Age is recorded as 50 but Date of Birth = 03/04/2005.
• In the result column of students’ marks , few entries are in GPA format and
rest in percentage.
• Discrepancies can exist between duplicate records.
ADVANCED DATA ANALYTICS
Measures of Data Quality
4) Timeliness : The data must be updated in a timely fashion. For example , for
an analysis run on the first day of every month, previous month’s data must
be up to date for accurate analysis.
4) Interpretability : The data must be easily understood. If the attributes of the
data aren’t easily understandable , the analysis is going to be hindered.
4) Believability : The data and its source must be trusted by the users. If this
data or the source caused problems in the past , current users will find it
hard to trust it.
NOTE : The quality of data is subjective and depends on the intended use of
data. The data needs of each problem is different.
ADVANCED DATA ANALYTICS
Data Cleaning
Data cleaning entails :
• Filling in missing values
• Smoothening noisy data
• Identifying and removing outliers
If users believe the data is dirty, they are
unlikely to trust the outcome of the analysis.
ADVANCED DATA ANALYTICS
Noisy data
Noise is a random error or variance in a measured variable.
Data smoothening techniques to combat noise :
• Binning
▪ Sort the data and partition into bins(equal-width, equal-frequency, etc.)
▪ Smooth by bin means, bin medians, by bin boundaries etc.
▪ More on binning in further lectures.
• Regression - Data can be smoothened by fitting it to a regression model.
• Clustering - Outliers can be detected with the help of clustering and can be
removed to smoothen the data.
• Combined computer and human inspection – Computer detects suspicious
values and is validated by a human. Is useful when dealing with possible
outliers.
ADVANCED DATA ANALYTICS
Outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set.
• Can use simple univariate statistics like standard
deviation and interquartile range to identify and
remove outliers from a data sample.
Case 1 : Outliers are noise that interferes with data
analysis.
Case 2 : Outliers are the main goal of our analysis.
Examples : Credit card fraud, Spam detection,
Intrusion detection
ADVANCED DATA ANALYTICS
Typical Data Cleaning in Code
• Identify columns or rows that have very low variance ( very few unique
values). They may not be effective features and can be removed
• Outlier Removals: can be detected by IQR. The IQR can be used to identify
outliers by defining limits on the sample values that are a factor k of the
IQR below the 25th percentile or above the 75th percentile. Typical value
of k=1.5
• Outlier Removals: It can also be detected using one-class classifier (A one-
class classifier aims at capturing characteristics of training instances, in
order )to be able to distinguish between them and potential outliers to
appear. The local outlier factor, or LOF for short, is a technique that
attempts to harness the idea of nearest neighbors for outlier detection.
ADVANCED DATA ANALYTICS
Missing or incomplete Data
Data is not always complete. Missing data maybe due to :
• Equipment malfunction
• Inconsistent with other recorded data leading to its deletion
• Data not recorded due to a misunderstanding
• Certain data may not be considered useful at the time of entry
Missing data may need to be inferred.
ADVANCED DATA ANALYTICS
Types of Missing Values
1. Missing Completely At Random (MCAR)
• The missing data is independent of the observed and unobserved data.
• example : A weighing scale running out of batteries. This is not dependent on the person and the
probability of this happening is equal to everyone.
• example : when we take a random sample of a population, where each member has the same chance
of being included in the sample. The (unobserved) data of members in the population that were not
included in the sample are MCAR
• If the probability of being missing is the same for all cases, then the data are said to be missing
completely at random (MCAR). This effectively implies that the causes of the missing data are
unrelated to the data. MCAR data doesn’t add bias to the analysis.
• Assuming the data as MCAR is a strong and often unrealistic assumption as “true randomness” is rare
in the real world.
• Ways to deal with it :
▪ Delete the records : If it is a small fraction of data
▪ Delete the attributes : If it is a small fraction of attributes
▪ Mean imputation
▪ Pairwise deletion : Compute the mean, variance and covariance with another variable available.
ADVANCED DATA ANALYTICS
Types of Missing Values
2. Missing At Random (MAR)
• Example: Employed people are less likely to answer all questions of a survey when
compared to unemployed people. Data is MAR if the likelihood of completing the survey is
dependent of the employment status but not on the topic of survey.
• Example: In a survey, you recorded emotion and many are null. On digging you find, mainly
for men!
• Example: men are more likely to tell you their weight or age than women: weight or age is
MAR.
• The probability of missing is the same only within groups defined by
the observed data. MAR assumes that the missing value can be predicted based on the
other observed data. The missingness is still random.
• Almost always produces a bias in the analysis.
• MCAR implies MAR but the converse isn’t true.
• Ways to deal with it : Regression imputation: unbiased if it considers the factor which
influences the missingness; Last observation carried forward (LOCF) and Baseline
observation carried forward(BOCF) : Yields biased estimates. Must be used only if the
underlying assumptions are scientifically justifiable;
ADVANCED DATA ANALYTICS
Types of Missing Values
3. Missing Not At random (MNAR)
• The missingness of the data depends on the value of the data. The mechanism for why the
data is missing is known. Yet , the values can’t be effectively inferred.
• MNAR means that the probability of being missing varies for reasons that are
unknown to us
• Hardest to address
• Examples :
▪ Censored data
▪ People belonging to certain income brackets might not wish to disclose their assets.
▪ A weighing machine can only measure weights in a particular range.
▪ Two or more variables show the same missingness pattern. Then it requires modeling
• Ways to deal with this :
▪ One must model the missingness explicitly, jointly modeling the response and
missingness.
▪ Generally , the data is assumed to be MAR whenever feasible to avoid this situation.
NOTE: There is no statistical way to determine under which category your missing data will fall
under.
ADVANCED DATA ANALYTICS
Types of Missing Values-A Quick Glance
Missing Completely at Random, MCAR : There is nothing systematic going on that
makes some data more likely to be missing than others. We may consequently
ignore many of the complexities that arise because data are missing,
Missing at Random, MAR, means there is a systematic relationship between the
propensity of missing values and the observed data, but not the missing data. Modern
missing data methods generally start from the MAR assumption.
Missing Not at Random, MNAR, means there is a relationship between the propensity
of a value to be missing and its values.
ADVANCED DATA ANALYTICS
An Interesting Thought
• Imagine you are collecting some information from your classmates. For
many reasons, not everyone will answer every question of yours. And that
is okay!
• Well the next step is replacing missing values right? We can use any one of
the methods we have discussed till now after some analysis of the data.
• But wait! Don’t you think the fact that they did not answer is some kind of
information per se that can be beneficial to our analysis?
• So the next time you build a model , before dealing with the missing values
, create an additional variable (preferably a binary variable ) in which you
store if the particular student answered or not.
• This may (or may not! ) help you gain more insights about the population
or improve the analytics model you are building!
[Link]
ADVANCED DATA ANALYTICS
Handling missing data
1. Ignore the tuple : Usually done when the class label is missing (for a
classification task). This is not effective when the percentage of missing values
per attribute varies considerably
2. Fill in the missing value manually : Time consuming and infeasible for a large
data set.
3. Fill it with a global constant : Replace it with a global constant like the word
“Unknown”. The downside is that the model might learn patterns with respect
to the occurrence of the word “Unknown”.
4. Fill it with a central tendency : For symmetric data distributions , replace it
with the mean and for skewed data distributions, replace it with the median.
5. A smarter way is to use attribute mean or median(based on the
distribution)for all samples belonging to the same class.
6. Use most probable model : Use models like regression , decision tree or
inference-based Bayesian formalism to infer the missing value.
ADVANCED DATA ANALYTICS
Missing Data Handling in code – statistical Imputation
• It is common to identify missing values in a dataset and replace them with
a numeric value. This is called data imputing, or missing data imputation.
• Statistical Imputation - A simple and popular approach to data
imputation involves using statistical methods to estimate a value for a
column from those values that are present, then replace all missing values
in the column with the calculated statistic.
• The column mean value.
• The column median value.
• The column mode value.
• A constant value.
ADVANCED DATA ANALYTICS
Missing Data Handling in code – Iterative and KNN Imputation
• KNN Imputation - If input variables are numeric, then regression models can be
used for prediction, and this case is quite common. A range of different models
can be used, although a simple k-nearest neighbor (KNN) model has proven to
be effective in experiments. The use of a KNN model to predict or fill missing
values is referred to as Nearest Neighbor Imputation or KNN imputation.
• Iterative Imputation - One approach to imputing missing values is to use an
iterative imputation model. Iterative imputation refers to a process where each
feature is modeled as a function of the other features, e.g. a regression
problem where missing values are predicted. Each feature is imputed
sequentially, one after the other, allowing prior imputed values to be used as
part of a model in predicting subsequent features. This methodology is
attractive if the multivariate distribution is a reasonable description of the data.
ADVANCED DATA ANALYTICS
Test your understanding!
1. Which of these is not a method to deal with noisy data?
a) Binning
b) Regression
c) Principal Component Analysis
d) Clustering
Solution
c) Principle Component Analysis
2. Outliers need to be removed in every dataset , regardless of the problem
statement.
Solution
False
3. Mean imputation can be done for which type of missing data?
Solution
MCAR
ADVANCED DATA ANALYTICS
Test your understanding!
• The statement “Most of the missing people from work are sickest people” denotes
what type of missingness?
MNAR
• Which type of missingness is called “non-ignorable”?
MNAR
Because the missing data mechanism itself has to be modelled as you deal with the
missing data. You have to include some model for why the data are missing and what
the likely values are.
ADVANCED DATA ANALYTICS
References
• Data Mining : Concepts and Techniques by Han, Kamber and Pei , The
Morgan Kaufmann Series in Data Management Systems ,3rd Edition
Chapter : 3.1-3.2
• [Link]
• [Link]
• [Link]
THANK YOU
Dr. Bhaskarjyoti Das
Professor, Department of Computer Science and
Engineering in AI & ML , PES University, Bengaluru
Email: bhaskarjyotidas@[Link]
ADVANCED DATA ANALYTICS
UE23AM343AB1
UNIT-2
Preparing Data for ML –
Data Integration and Reduction
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in
AI & ML
Advanced Data Analytics
Unit 1
Lecture 6 : Data Preprocessing – Data Integration and Reduction
With grateful thanks for contribution of slides to:
Slides excerpted from: Data Mining : Concepts and Dr. Gowri Srinivasa, Professor at the Department of CSE,
Techniques by Han, Kamber and Pei, 3rd Edition PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
DATA ANALYTICS
Data Integration
• Data analysis often requires data integration – the merging of data from
multiple data stores into a coherent store.
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting dataset. This can help improve the accuracy
and speed of the subsequent data analysis process.
• The semantic heterogeneity and structure of data pose great challenges in
data integration.
• How can we match schema and objects from different sources?
• Schema Integration!
• Example : How can a data analyst be sure that the attribute
customer_id in table A and customer_number in table B refer to the
same attribute?
• With the help of metadata! It provides all possible information
regarding the attributes , thus ensuring error free schema integration.
ADVANCED DATA ANALYTICS
Data Integration
• Entity identification problem : Identify real world entities from multiple
data sources. Example : Bill Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources
are different.
• Possible reasons : different representations, different scales, example –
metric vs British units
• During integration , special attention must be paid to the structure of the
data. This is to ensure that any attribute functional dependencies and
referential constraints in the source system match those in the target
system. For example , in one system, a discount may be applied to the
entire order whereas in another system , it is applied to each individual
line item. If this is not caught before integration, items in the target system
may be improperly discounted.
ADVANCED DATA ANALYTICS
Data Value Conflict Detection and Resolution
• Data integration also involves the detection and resolution of data
value conflicts.
• For example, for the same real-world entity, attribute values from
different sources may differ.
• This may be due to differences in representation, scaling, or encoding.
• For instance, a weight attribute may be stored in metric units in one
system and British imperial units in another.
ADVANCED DATA ANALYTICS
Redundancy in Data Integration
Redundant data often occur during the integration of multiple databases.
• Object identification : The same attribute or object may have different
names in different databases which causes redundancy.
• Derivable data : An attribute may be redundant if it can be derived from
another attribute or set of attributes. For example , annual revenue can be
derived from monthly revenue.
• Even otherwise, few redundancies can be detected by correlation analysis.
• For nominal data , χ2 (chi-square) test is employed.
• For numeric data , correlation coefficient and covariance is used.
ADVANCED DATA ANALYTICS
Tuple Duplication
• In addition to detecting redundancies between attributes, duplication should be
detected at the tuple level (Example, where there are two or more identical
tuples for a unique data entry case)
• The use of denormalized tables (often done to improve performance by avoiding
joins ) is another source of data redundancy.
• Inconsistencies often arise between various duplicates, due to inaccurate data
entry or updating some but not all data occurrences.
ADVANCED DATA ANALYTICS
Data Reduction
• Data reduction techniques are applied to obtain a reduced representation of the
dataset that is much smaller in volume , yet closely maintains the integrity of the
original data.
• Analysis on the reduced dataset should be more efficient yet produce the same
or almost the same analytical results.
• Why do we need data reduction? A database or a data warehouse may store
terabytes of data. Complex data analysis may take a very long time to run on the
complete data set.
ADVANCED DATA ANALYTICS
Data Reduction Strategies
• Dimensionality reduction – process of removing unimportant attributes
• Wavelet transforms, Principal Component Analysis (PCA), Attribute subset selection
• Numerosity reduction – replaces the original data volume by an alternative, smaller
forms of data representation
• Parametric - data is represented using some model. The model is used to estimate
the data, so that only parameters of data are required to be stored. Example -
Regression and log-linear models
• Nonparametric – Histograms (binning + KDE to get an approx. representation),
clustering (cluster representation) and sampling, Data cube aggregation (moving the
data from detailed level to a fewer number of dimensions enough for analysis)
• Data compression – transformations are applied to the data to obtain a reduced or a
compressed representation of the original data. Example - embedding
ADVANCED DATA ANALYTICS
Curse of dimensionality
• Example – 1 (density and distance issue)
• When dimensionality increases, data becomes increasingly sparse.
• A circle inside a square = (pi*r2 )/4r2 =78%
• A sphere inside a cube = 4/3 pi *r3 / (2r)3 = 52%
• This volume reduces exponentially to 0.24% for just 10 dimension! What will happen in
1000 dimensions which is common? Distance loses its meaning !! A higher dimension means
more sparsity
• Density and distance between points, critical to clustering and outlier analysis become less
meaningful.
• Example – 2 (Insufficient training data)
• I have 80 samples. Is it sufficient wrt sample space size ? (a) 1-dimemsional data with 100
possible values (b) 100 dimension data with each having 100 Possibility . Implication : The
possible combinations of subspaces will grow exponentially.
ADVANCED DATA ANALYTICS
Dimensionality Reduction
• Dimensionality reduction
• Avoids the curse of dimensionality.
• Helps to eliminate irrelevant attributes and reduce noise.
• Reduces time and space required for data analytics.
• Enables easier visualization.
• Wavelet transforms, Principal Component Analysis (PCA), Attribute subset
selection
ADVANCED DATA ANALYTICS
Principal Component Analysis (PCA)
What is PCA?
Assume there are 50 questions in all in the survey. The following three are among
them:
1.I feel comfortable around people
2.I easily make friends
3.I like going out
These queries could appear different now. There is a catch, though. They aren’t,
generally speaking. They all gauge how extroverted you are. Therefore, combining
them makes it logical, right? That’s where linear algebra and dimensionality reduction
methods come in! We want to lessen the complexity of the problem by minimizing
the number of variables since we have much too many variables that aren’t all that
different. That is the main idea behind dimensionality reduction. And it just so
happens that PCA is one of the most straightforward and popular techniques in this
field.
ADVANCED DATA ANALYTICS
Principal Component Analysis (PCA)
1. Each data vector in the multi-dimensional space – we are considering to
project each in an optimal direction such that we get the maximum
projection. Maximal projection captures the sample space in the best way.
2. What we are trying to do is: find an optimal direction such that the
variance of all the above projections is maximized. PCA solves this
maximization problem
3. The way to solve this is Lagrange Multiplier. When we solve the Lagrange
multiplier conditions by taking the first derivative, it comes out that the
Eigen Vector of the Covariance Matrix is the solution
4. If we do a little manipulation on the above, it comes out that the
Eigenvector corresponding to the largest eigenvalue is solution
ADVANCED DATA ANALYTICS
Principal Component Analysis (PCA)
• Finds a projection that captures the largest amount of variation in data.
• The original data is projected onto much smaller space , resulting in
dimensionality reduction.
• We find eigenvectors of the covariance matrix and these eigenvectors define
the new space.
x
2
x
1
ADVANCED DATA ANALYTICS
PCA - Steps
Step 1: Standardize the dataset.
Step 2: Calculate the covariance matrix for the features in the dataset.
Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.
Step 4: Sort eigenvalues and their corresponding eigenvectors.
Step 5: Pick k eigenvalues and form a matrix of eigenvectors.
Step 6: Transform the original matrix
Lets go step by step
ADVANCED DATA ANALYTICS
PCA - Steps
ADVANCED DATA ANALYTICS
PCA - Steps
ADVANCED DATA ANALYTICS
PCA - Steps
ADVANCED DATA ANALYTICS
PCA - Steps
And Finally!
ADVANCED DATA ANALYTICS
Data Reduction – LDA
• LDA (Linear Discriminant Analysis), is a linear machine learning
algorithm used for multiclass classification.
• Linear Discriminant Analysis seeks to best separate (or discriminate) the
samples in the training dataset by their class value.
• LDA seeks to find a linear combination of input variables that achieves
the maximum separation for samples between classes (class centroids
or means) and the minimum separation of samples within each class.
• There are many ways to frame and solve LDA such as LDA algorithm in
terms of Bayes Theorem and conditional probabilities.
• In practice, LDA for multiclass classification is typically implemented
using the tools from linear algebra, and like PCA, uses matrix
factorization at the core of the technique.
ADVANCED DATA ANALYTICS
Data Reduction – SVD
• SVD is a popular approach to dimensionality reduction is to use
techniques from the field of linear algebra.
• Singular Value Decomposition, or SVD, might be the most popular
technique for dimensionality reduction when data is sparse.
• Sparse data refers to rows of data where many of the values are zero.
This is often the case in some problem domains like recommender
systems and BOW representation of text.
• SVD uses feature projection and the algorithms used are referred to as
projection methods. The resulting dataset, the projection, can then be
used as input to train a machine learning model.
ADVANCED DATA ANALYTICS
Test your understanding!
• What is the first step in data integration?
Understanding the metadata
• Is PCA lossy or lossless?
Lossy
• Amongst PCA and Attribute subset selection , which data reduction
method has more interpretability?
Attribute Subset Selection
ADVANCED DATA ANALYTICS
Test your understanding!
• ----------------- is a nonzero vector that stays parallel after matrix
multiplication
Eigen Vectors
ADVANCED DATA ANALYTICS
References
• Data Mining: Concepts and Techniques by Jiawei Han, Micheline
Kamber and Jian Pei, The Morgan Kaufmann Series in Data
Management Systems, 3rd Edition Chapter : 3.3 – 3.4
• [Link]
• [Link]
e9cfa85d7b34
• [Link]
[Link]
• [Link]
component-analysis-pca-step-by-step-e7a4bb4031d9
ADVANCED DATA ANALYTICS
References
• Data Mining: Concepts and Techniques by Jiawei Han, Micheline
Kamber and Jian Pei, The Morgan Kaufmann Series in Data
Management Systems, 3rd Edition Chapter : 3.3 – 3.4
• [Link]
• [Link]
e9cfa85d7b34
• [Link]
[Link]
• [Link]
component-analysis-pca-step-by-step-e7a4bb4031d9
THANK YOU
Dr. Bhaskarjyoti Das
Professor, Department of Computer Science and
Engineering in AI & ML, PES University, Bengaluru
Email: Bhaskarjyotidas@[Link]
ADVANCED DATA ANALYTICS
UE21CS342AA2
UNIT-2
Data Transformation
Dr. Bhaskarjyoti Das
Department of Computer Science and
Engineering in AI & ML
Advanced Data Analytics
Unit 2
Data Transformations
Slides excerpted from: Data Mining : Concepts and
Techniques by Han, Kamber and Pei, 3rd Edition With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE,
PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI & ML
ADVANCED DATA ANALYTICS
Data Transformation
A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values.
ADVANCED DATA ANALYTICS
Why Smoothing ?
• Data smoothing is done in data analytics to reduce noise and
highlight underlying patterns.
• It helps in improving the accuracy of analysis by removing
fluctuations that obscure true trends and relationships, making it
easier to identify meaningful insights.
ADVANCED DATA ANALYTICS
Normalization
ADVANCED DATA ANALYTICS
Normalization
• Min-Max Normalization : Performs a linear transformation. It transforms the
values from [minA, maxA] to [new_minA, new_maxA]. A value v is
transformed by
• Example : Let the income range $12,000 to $98,000 be normalized to
[0.0,1.0]. Find out the mapping for $73,600
• This preserves the relationship among the original data values. It will
encounter an “out-of-bounds” error if an input which is outside the range of
original data is provided.
ADVANCED DATA ANALYTICS
Normalization
• Z-score normalization : The values for an attribute A are normalized based
on the mean and standard deviation of A. A value v can be normalized by
Where μA is the mean and σA the standard deviation.
• Example : Let μ= 54,000 and σ = 16,000. Then z-score of 73,600 is
• This method of normalization is useful when the actual minimum or
maximum value of A is unknown or there are outliers in A.
• A variation of this is obtained by using Mean Absolute Deviation (MAD)
instead of standard deviation. It is less susceptible to outliers.
ADVANCED DATA ANALYTICS
Why Discretization ?
• Data discretization is typically the process of converting continuous
numerical data into discrete categories
• This is done to simplify complex data, improve model performance,
and enhance interpretability.
• It's particularly useful for machine learning algorithms that require
categorical input or when analyzing patterns that are more apparent
in grouped data.
• Divides the range of a continuous attribute into intervals. Interval
labels are then used to replace the actual data values. Data size can
be reduced by discretization.
ADVANCED DATA ANALYTICS
Types of Discretization
• If the discretized process uses class information , it is called
supervised discretization else it is called unsupervised
discretization.
• Top-down discretization : Process starts by finding one or few points
(called split or cut points) to split the entire attribute range and
then repeat this process recursively on the resulting intervals.
• Bottom-up discretization : Also called as merging , starts by
considering all the continuous values as potential splits. It removes
few split points by merging neighborhood values to form intervals.
This process is recursively applied to the resulting intervals.
ADVANCED DATA ANALYTICS
Data discretization methods
• Binning :
▪ Top-down split , unsupervised
• Histogram analysis :
▪ Top-down split , unsupervised
• Clustering analysis :
▪ Unsupervised , top-down split or bottom-up merge
• Decision-tree analysis :
▪ Supervised , top-down split
• Correlation analysis :
▪ Unsupervised , bottom-up merge
Note : All these methods can be applied recursively.
ADVANCED DATA ANALYTICS
Binning
• Equal-width (distance) partitioning
▪ Divides the range into N intervals of equal size.
▪ The width of the interval is w = (Maximum – Minimum)/N.
▪ Is susceptible to outliers and skewed data.
• Equal-depth (frequency) partitioning
▪ Divides the range into N intervals, each containing approximately
same number of samples.
▪ Ensures good data scaling but managing categorical attributes can
get tricky.
ADVANCED DATA ANALYTICS
Binning - Example
[Link]
ADVANCED DATA ANALYTICS
Data Smoothing with Binning
Sorted data for price (in $ ) : 4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-depth (frequency) bins
Bin-1 : 4,8,9,15
Bin-2 : 21,21,24,25
Bin-3 : 26,28,29,34
ADVANCED DATA ANALYTICS
Data Smoothing with Binning
• Smoothing by bin means :
Bin-1 : 9,9,9,9
Bin-2 : 23,23,23,23
Bin-3 : 29,29,29,29
• Smoothing by bin boundaries :
Bin-1 : 4,4,4,15
Bin-2 : 21,21,25,25
Bin-3 : 26,26,26,34
ADVANCED DATA ANALYTICS
Binning vs Clustering (Unsupervised Discretization)
ADVANCED DATA ANALYTICS
Discretization by Classification
• Supervised : Class labels are
used in determining the split
point.
• It is a top-down discretization
where recursive split is applied.
• Example : Decision tree
analysis.
• Entropy is used to determine
the split point. Lower the
entropy , better the split. It is
given by [Link]
decisions-2946b9c18c8
ADVANCED DATA ANALYTICS
Discretization by Correlation Analysis
• Supervised : Class labels are used in determining the split point.
• Example , Chi-merge: χ2-based discretization. It is a bottom-up merge.
• Initially each distinct value of the attribute is considered to be one
interval.
• χ2 tests are performed for every pair of adjacent intervals.
• Adjacent intervals with the least χ2 values are merged as it indicates
similar class distributions.
• This merging process proceeds recursively until a pre-defined
threshold for χ2 is met.
ADVANCED DATA ANALYTICS
Concept Hierarchy Generation
• Concept hierarchy organizes concepts ( Attribute values ) hierarchically by
representing a series of mappings from a set of low-level concepts to a high-
level , generalized concepts.
• It facilitates drilling and rolling in data warehouses to view data in multiple
granularity.
• Method : Recursively reduce the data by collecting and replacing low level
concepts (such as numeric values for age) by higher level concepts ( such as
kids, teenagers, adults, senior citizens).
• They can be explicitly specified by domain experts and/or data warehouse
designers.
ADVANCED DATA ANALYTICS
Concept Hierarchy Generation for Numeric Data
Discretization methods discussed till now can be used for numeric data.
Example:
ADVANCED DATA ANALYTICS
Concept Hierarchy Generation for Nominal Data
1) Specification of a partial ordering of attributes explicitly at the schema
level
• A user or an expert defines a concept hierarchy by specifying a partial
or a total ordering of attributes at the schema level.
• For example , suppose a relational database contains the attributes
street, city, state and country . Location dimension of the data
warehouse may contain the same attributes.
• A hierarchy can be defined by specifying the total ordering among
these attributes at the schema level
street < city < state < country
ADVANCED DATA ANALYTICS
Concept Hierarchy Generation for Nominal Data
2) Specification of a portion of hierarchy by explicit data grouping
• A portion of the concept hierarchy is manually defined.
• In a large database , it is unrealistic to define the entire concept
hierarchy by explicit value enumeration.
• However , we can easily specify explicit groupings for a small portion of
intermediate-level data.
• For example , after specifying state and country form a hierarchy at the
schema level , a user can define few intermediate levels manually
{Karnataka , Tamil Nadu, Kerala, Andhra Pradesh, Telangana} < South India
ADVANCED DATA ANALYTICS
Concept Hierarchy Generation for Nominal Data
3) Specification of only a partial set of attributes.
• At times , a user can have a vague idea about what should be included in
the hierarchy.
• The user may have included only a small subset of the relevant attributes
in the hierarchy specification.
• For example, instead of including all the hierarchically relevant attributes
for location , the user might have specified only street and city.
• To handle this , embed data semantics into the database schema. Hence
one attribute will trigger a whole group of linked attributes to be added to
the hierarchy. For example , when city is added , it would automatically
include state and country as they are semantically related.
ADVANCED DATA ANALYTICS
Concept Hierarchy Generation for Nominal Data
4) Automatic generation of hierarchies by
analysis of distinct values per attribute
• Few hierarchies can be automatically
generated based on the analysis of the
number of distinct values per attribute in
the dataset.
• The attribute with the most distinct values
is placed at the lowest level of the
hierarchy.
Note : This method is not foolproof. For example, a time dimension in a database might
contain 20 distinct years , 12 distinct months and 7 distinct days of the week. However , this
doesn’t suggest that the time hierarchy should be year<month<days of the week .
ADVANCED DATA ANALYTICS
Data Transformation in Code
• There is some overlap with Data Cleaning discussed earlier
• Data Transformation may consist of below-mentioned steps
• Scaling numerical data ( standardization and normalization), scaling with
interquartile range
• Encoding categorical Data ( Ordinal and one-hot encoding)
• Using Power transform to make data distribution Gaussian like ( Box-Cox
, Yeo-Johnson) and Uniform Distribution (Uniform Quantile Transform)
• Discretize numerical data using Discretization such as K means
Discretization ( fits K clusters)
• Do the transformation to both numerical and Categorical data
ADVANCED DATA ANALYTICS
Data Compression
• Transformations are applied so as to obtain a reduced or
compressed representation of the original data.
• If the original data can be reconstructed from the
compressed data without any information loss, the data
reduction is called lossless. But if we can reconstruct only an
approximation of the original data , then the data reduction
is called lossy.
• There are several lossless algorithms for string compression ,
however they allow only limited data manipulation.
• Dimensionality reduction and numerosity reduction can also
be considered as data compression.
ADVANCED DATA ANALYTICS
Test your understanding!
• Which method of normalization must one choose if they are dealing with a lot of
outliers and don’t know the range of their data?
Solution
Z-Score Normalization
• Which split-point is preferred for discretization using decision trees?
Solution
A Split point which results in least entropy.
• Which normalization method strictly works in the range of input data?
Solution
Min-Max Normalization
ADVANCED DATA ANALYTICS
Test your understanding!
• Consider a set of Unsorted data for price in dollars
8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
1)Smooth the data by equal frequency bins
2)On the results of part (1) apply smoothing by bin means
Solution:
1) 2)
Advanced Data Analytics
Unit 2
Feature Selection
Slides excerpted from: Data Mining : Concepts and
Techniques by Han, Kamber and Pei, 3rd Edition With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE,
PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI & ML
ADVANCED DATA ANALYTICS
Attribute Subset Selection
• Attribute Subset Selection reduces the dataset size by removing irrelevant or
redundant attributes.
• Redundant attributes : Information is contained or can be extracted from
other attributes. Example : MRP of a product and the corresponding sales tax
paid.
• Irrelevant attributes : They contain no information which is useful for the
data analysis task at hand. Example : SRN is irrelevant to predict students’
GPA.
• The goal of attribute subset selection is to find the minimum set of attributes
such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes.
ADVANCED DATA ANALYTICS
Filter based Feature Selection
• Filter-based feature selections score each feature and select those
features with the largest (or smallest) score.
• The statistical measures used in filter-based feature selection are
generally calculated one input variable at a time with the target variable.
As such, they are referred to as univariate statistical measures.
• This may mean that any interaction between input variables is not
considered in the filtering process.
ADVANCED DATA ANALYTICS
Feature Selection Methods for Machine Learning
ADVANCED DATA ANALYTICS
Recursive Feature Elimination
• RFE is a wrapper-type feature selection algorithm. This means that a
different machine learning algorithm is given and used in the core of the
method, is wrapped by RFE, and used to help select features.
• RFE works by searching for a subset of features by starting with all
features in the training dataset and successfully removing features until
the desired number remains.
• This is achieved by fitting the given machine learning algorithm used in
the core of the model, ranking features by importance, discarding the least
important features, and re-fitting the model.
• Technically, RFE is a wrapper-style feature selection algorithm that also
uses filter-based feature selection internally.
• This process is repeated until a specified number of features remains.
ADVANCED DATA ANALYTICS
Heuristic Search in Attribute Subset Selection
• For n attributes , there are 2n possible subsets. An exhaustive search is infeasible.
• The best and the worst attributes are determined using tests of statistical
significance , which assume that the attributes are independent of each other.
• Stepwise forward selection : The procedure starts with an empty set as the reduced
set. In each iteration , the best of the original attributes is selected and added to the
reduced set.
• Stepwise backward elimination : The procedure starts with the full set of attributes.
In each iteration , the worst attribute remaining in the set is removed.
• Combination of forward and backward selection : In each iteration, the best
attribute is selected and the worst attribute is removed.
ADVANCED DATA ANALYTICS
Attribute Creation – Feature Generation
• Creating new features that can capture important information in the dataset
more effectively than the original ones.
• For example, an attribute area can be added based on the attribute’s height
and width. By combining attributes, accuracy can be improved and missing
information about the relationships between the attributes can be discovered.
• Three general methodologies
▪ Attribute extraction – Domain specific.
▪ Mapping data to a new space – Fourier transforms, wavelet transforms and
manifold approaches (isomap, Multidimensional Scaling, Spectral
Embedding etc)
▪ Attribute construction
o Combining features
o Data Discretization
ADVANCED DATA ANALYTICS
References
• Data Mining: Concepts and Techniques by Jiawei Han, Micheline
Kamber and Jian Pei, The Morgan Kaufmann Series in Data
Management Systems, 3rd Edition Chapter 3.5
• [Link]
data-mining/
THANK YOU
Dr. Bhaskarjyoti Das
Professor, Department of Computer Science and
Engineering in AI & ML, PES University, Bengaluru
Email: Bhaskarjyotidas@[Link]
ADVANCED DATA ANALYTICS
UE23AM343AB1
UNIT-2
Simple Linear Regression
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
in AI & ML
Advanced Data Analytics
Unit 2: Regression Analysis
Simple Linear Regression
Slides excerpted from: U. Dinesh Kumar,
“Business Analytics”, Wiley, 2nd Edition 2022 With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE, PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI &ML
ADVANCED DATA ANALYTICS
In this session
’ Introduction to Regression
’ What is Regression?
’ Types of Regression
’ Simple Linear Regression (SLR)
’ SLR Model Building
’ Ordinary Least Squares Assumptions
’ Interpretation of coefficients
ADVANCED DATA ANALYTICS
What is Regression?
’ Regression is a tool for finding existence of association relationship between
dependent variable Y and one or more independent variables X1, X2, …, Xn
’ Relationship can be linear or non-linear
’ Dependent variable: measures outcome of a study; responds to change in independent
variables (outcome variable, response variable)
’ Independent variable: explains the change in a response variable (explanatory variable)
’ Regression: set values of explanatory variable and see how it affects the response
variable
ADVANCED DATA ANALYTICS
Regression - Definition
’ Regression is a supervised learning algorithm under machine learning terminology
’ Requires knowledge of both Y and X in training set
’ Regression is a statistical technique to determine if there is a relationship between a
dependent variable (Y) and a set of independent variables (X1, X2, …, Xn)
’ Regression (association) does not imply causation, even though terms like
“independent” and “dependent” are used
’ Regression is a tool that lets us predict the value of the dependent variable given the
values of the independent variable
’ Regression used for generating new hypotheses and validating the hypotheses
ADVANCED DATA ANALYTICS
Regression Nomenclature
Dependent Variable Independent Variable
Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable Feature
Outcome Variable
ADVANCED DATA ANALYTICS
Let’s Noodle Around
’ A common business use case of regression is for predicting prices of goods and assets.
Say you are trying to predict the prices of houses in a certain neighborhood of a certain
suburb. Think of possible features/factors that may be used as explanatory variables.
’ Some features maybe
’ Square footage
’ Average income
’ …
Source: Getty Images
ADVANCED DATA ANALYTICS
Types of Regression
ADVANCED DATA ANALYTICS
Types of Regression
’ Regression coefficients are used to measure the average functional relationship
between variables
’ Simple Linear Regression: single independent variable, functional relationship
between the dependent variable and the regression coefficient is linear. The
linearity, in the linear regression models, refers to the linearity of the coefficients
’ Multiple Linear Regression: multiple independent variables, functional relationship
between the dependent variable and the regression coefficients is linear
’ Non-linear Regression: non-linear relationship between dependent variable and
regression coefficients
ADVANCED DATA ANALYTICS
Linear Regression
’ Linear relationship between dependent variable and regression coefficients ( and )
’ Not defined as a linear relationship between dependent variables and independent
variables
’ The below regression is also a linear regression
ADVANCED DATA ANALYTICS
Simple Linear Regression
’ Functional form of SLR is
SLR attempts to explain the changes in the
value of response variable Y using the
knowledge of the values of explanatory
variables X. Thus, the equation b0 + b1 Xi
gives the predicted value of Yi for a given
value of Xi , whereas the term ei is the error
in predicting the values of Yi . In fact b0 + b1
Xi is the conditional expected value of Yi for
a given value of Xi .
’ Regression is a statistical relationship (not mathematical) and is, hence, not exact
(calculated as a best-fit function)
’ Since it is inexact, there is an error term called the residual or random error
’ Such errors cannot be explained by the model
ADVANCED DATA ANALYTICS
Simple Linear Regression Model Building
’ 9-step process
ADVANCED DATA ANALYTICS
SLR Model Building
’ Step 1: Collect data
’ The first step in building a regression model is to collect and/or extract data from different sources for the
identified problem (or KPI – Key Performance Indicators)
’ Can be time consuming and expensive
’ Step 2: Preprocess data
’ Pre-processing the data is an important stage in the regression model building
’ Before the model is built, it is also essential to ensure the quality of the data for issues such as reliability,
completeness, usefulness, accuracy, missing data, and outliers.
’ Dummy variables for nominal data
’ Step 3: Divide into Train and Validation
’ While choosing best model, select based on performance in validation set
’ Step 4: Perform descriptive analysis
’ It is always a good practice to perform descriptive analytics before moving to predictive analytics model building
’ Visualization techniques to see distribution of data
ADVANCED DATA ANALYTICS
SLR Model Building (cont.)
’ Step 5: Define the functional form of relationship
’ Scatter plots
’ Step 6: Estimate regression parameters
’ The method of Ordinary Least Squares (OLS) is used to estimate the regression parameters
’ Step 7: Perform regression model diagnostics
’ Before it can be deployed it is necessary that the regression model created is validated for all model
assumptions including the definition of the functional form
’ Step 8: Validate model using validation set
’ Model may be cross-validated using multiple training and test data sets
’ Check for overfitting
’ Step 9: Decide on model deployment
ADVANCED DATA ANALYTICS
Ordinary Least Squares Solution
’ Estimation of regression coefficients involves solving a system of linear equations
’ Best equation to represent relationship between observations
̂ =0
’ Provides the Best Linear Unbiased Estimate (BLUE) – E[𝛽𝛽 – 𝛽𝛽]
’ OLS: minimize sum of squared errors (SSE) over all n observations
ADVANCED DATA ANALYTICS
Ordinary Least Squares Assumptions
1. The regression model is linear in regression parameters
2. The explanatory variable, X, is assumed to be non-stochastic (i.e., X is deterministic)
3. The conditional expected value of the residuals, E(ei|Xi), is zero
4. In the case of time series data, residuals are uncorrelated, that is, Cov(ei , ej) = 0 for all i ≠ j
5. The residuals, ei, follow a normal distribution
6. The variance of the residuals, Var(ei|Xi), is constant for all values of Xi. When the variance
of the residuals is constant for different values of Xi, it is called homoscedasticity. A non-
constant variance of residuals is called heteroscedasticity.
ADVANCED DATA ANALYTICS
Linear Regression Derivation
• The core idea is to find the "best-fit" line (or hyperplane in higher
dimensions) that minimizes the difference between predicted and
actual values, using the principle of least squares.
• Linear Regression can be derived using calculus, linear algebra,
Maximum Likelihood Estimation (MLE), and Gradient Descent.
• Calculus helps in finding the optimal parameters by minimizing the error
• Linear algebra provides a concise way to represent and manipulate the data
and model.
• MLE provides a probabilistic framework for finding the parameters. Most
importantly, it explains one of the fundamental assumption that residuals
follow a normal distribution
• Gradient Descent is an iterative optimization algorithm used to find the
parameters that minimize the cost function
ADVANCED DATA ANALYTICS
Ordinary Least Squares (OLS) – Linear Algebra way
1. Typically, linear system in the matrix form is Ax = b and we
have to find vector x
2. we hope the vector b lies in the column space of A, C(A). That
is, we’re hoping there’s some linear combination of the
columns of A that gives us our vector of observed b values.
3. we already know b doesn’t fit our model perfectly. That
means it’s outside the column space of A.
4. The linear regression answer is that we should forget about
finding a model that perfectly fits b, and instead swap out b
for another vector that’s pretty close to it but that fits our
model. Specifically, we want to pick a vector p that’s in the
column space of A, but is also as close as possible to b.
ADVANCED DATA ANALYTICS
Ordinary Least Squares (OLS) – Linear Algebra way
1. Specifically, we want to pick a vector p that’s in the column
space of A, but is also as close as possible to b. So, we choose
a project of b into column space A
2. e is just the observed vector b minus the projection p, or b
p. Since the vector e is perpendicular to the plane of A’s
column space, that means the dot product between them
must be zero.
3. The elements of the vector x-hat are the estimated regression
coefficients C and D we’re looking for.
In subsequent discussions on regression, the hat matrix will keep
appearing
ADVANCED DATA ANALYTICS
Ordinary Least Squares Solution (Calculus)
’ Equate partial derivatives of SSE wrt regression coefficients to 0 (can skip derivation; just
for understanding)
’ For
ADVANCED DATA ANALYTICS
Ordinary Least Squares Solution (Calculus)
’ For
ADVANCED DATA ANALYTICS
Ordinary Least Squares Solution (Calculus)
’ Solving (1) and (2) ’ Rewriting
ADVANCED DATA ANALYTICS
Ordinary Least Squares Solution (Calculus)
’ Adding 0 to the numerator and denominator of (3) ’ In terms of Pearson’s Coefficient
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
ADVANCED DATA ANALYTICS
OLS as MLE
ADVANCED DATA ANALYTICS
OLS as MLE
Mathematics for Machine Learning
OLS with Gradient Descent
Mathematics for Machine Learning
OLS with Gradient Descent
ADVANCED DATA ANALYTICS
Example
Table provides the salary
of 50 graduating MBA
students of a Business
School in 2016 and their
corresponding percentage
marks in grade 10. Develop
a linear regression model
by estimating the model
parameters.
ADVANCED DATA ANALYTICS
Ordinary Least Squares Assumptions
• Tip: if you have a Casio fx-991EX or 991ES, you could use the Statistics menu
(number 6), y=a+bx and type out all the x and y values
ADVANCED DATA ANALYTICS
Interpretation of Coefficients
’ Important for understanding the impact of change in the values of explanatory variables
on the response variable
’ The interpretation will depend on the functional form of the relationship between the
response and the explanatory variables
1. Interpretation in
• B1 is the change in Y for a unit change in X
• B0 is the expected value of Y when X = 0
2. Interpretation in
• B1 is the change in Y for percentage change in X
• B0 is the expected value of Y when ln(X) = 0 or X = 1
ADVANCED DATA ANALYTICS
Interpretation of Coefficients
ADVANCED DATA ANALYTICS
Test your understanding!
’ Which of the following are linear regression models?
’ Solution
ADVANCED DATA ANALYTICS
Test your understanding!
’ What is the meaning of “Heteroscedasticity”?
1. The variance of errors is not constant
2. The variance of the dependent variable is not constant
3. The errors are not linearly independent of one another
4. The errors have nonzero mean
’ Solution
’ The variance of the errors is not constant
ADVANCED DATA ANALYTICS
Test your understanding!
• State True or false:
In an OLS set-up, if the errors follow normal distribution, then Y will also follow a normal
distribution.
• Solution:
• True, since B0 and B1 are parameters and X is deterministic.
ADVANCED DATA ANALYTICS
References
’ Business Analytics by U. Dinesh Kumar – Wiley 2nd Edition, 2022 Chapter 9
’ U. Dinesh Kumar’s slides
’ Getty Images
THANK YOU
Dr. Bhaskarjyoti Das
Professor, Department of Computer Science and
Engineering in AI & ML , PES University, Bengaluru
Email: bhaskarjyotidas@[Link]
ADVANCED DATA ANALYTICS
UE23AM343AB1
UNIT-2
SLR – Validation, Outlier Analysis
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
in AI & ML
Advanced Data Analytics
Unit 2: Regression Analysis
SLR – Validation, Outlier Analysis, Numericals
Slides excerpted from: U. Dinesh Kumar,
“Business Analytics”, Wiley, 2nd Edition 2022 With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE, PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI &ML
ADVANCED DATA ANALYTICS
In this session
•
ADVANCED DATA ANALYTICS
Validation of the SLR Model
⚫It is important to validate the regression model to ensure its validity and goodness of
fit before it can be used for practical applications
⚫The following measures are used to validate the simple linear regression models
1. Co-efficient of determination (R2)
2. Hypothesis test for regression coefficient B1
3. Analysis of variance for overall model validity (relevant more for multiple linear regression)
4. Residual analysis to validate the regression model assumptions
5. Outlier analysis
ADVANCED DATA ANALYTICS
1. Coefficient of Determination (R2)
⚫The primary objective of regression is to explain the variation in Y using the knowledge of
X
⚫The coefficient of determination measures the percentage of variation in Y explained by
the regression model
⚫In the absence of the regression model, predictions would be made using the mean
value of Y
⚫Total variation of a single observation =
⚫ Broken down into variation explained by model and variation not explained by model
⚫ Model gives us the predicted value of Y (better prediction than just the mean value)
ADVANCED DATA ANALYTICS
1. Coefficient of Determination (R2) (cont.)
⚫Total variation can be broken down into variation explained by model and variation not
explained by model
ADVANCED DATA ANALYTICS
1. Coefficient of Determination (R2) (cont.)
•
ADVANCED DATA ANALYTICS
Example
Find the regression coefficients and the coefficient of determination for a linear
regression model fit to the data shown below. The explanatory variable is the cheese
age and the outcome variable is the price.
ADVANCED DATA ANALYTICS
Example
Tip: use your calculator’s Statistics menu if you have one to calculate the summary
statistics (Statistics menu -> y=a+bx -> enter values -> OPTN -> 2-
Variable Calc)
ADVANCED DATA ANALYTICS
Example
Equation of line
R2 is the square of the correlation coefficient
ADVANCED DATA ANALYTICS
1. Coefficient of Determination (R2) – Spurious Regression
⚫One of the major problems with coefficient of determination (R2) is that two sets of
data without any relationship can have a very high coefficient of determination value
⚫The data in Table 9.7 shows the number of Facebook users (in millions) and the
number of people who died of helium poisoning in UK between 2004 and 2012
⚫The R-square value for regression
model between the number of
deaths due to helium poisoning in
UK and the number of Facebook
users is 0.9928
⚫This is an example of a spurious
regression
Source: Textbook
1
ADVANCED DATA ANALYTICS
2. Hypothesis test for regression coefficient B1(t-test)
•
ADVANCED DATA ANALYTICS
2. Hypothesis test for regression coefficient B1 (cont.)
•
ADVANCED DATA ANALYTICS
2. Hypothesis test for regression coefficient B1 (cont.)
⚫The standard error of estimate Se (or standard error of residuals) is the standard deviation
of the sampling distribution of the residuals
⚫ 2 degrees of freedom lost due to calculation of 2 regression coefficients
⚫Standard error of regression coefficient
ADVANCED DATA ANALYTICS
2. Hypothesis test for regression coefficient B1 (cont.)
A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x.
Two-tailed test:
source
ADVANCED DATA ANALYTICS
3. Test for Overall Model: Analysis of Variance (F-test)
⚫Using the Analysis of Variance (ANOVA), we can test whether the overall model is statistically
significant
⚫For a simple linear regression, the null and alternative hypotheses in ANOVA and t-test are
exactly same and thus there will be no difference in the p-value
⚫Hypothesis test (two-tailed test)
⚫ H0: no linear relationship between Y and any of the X’s
⚫ H1: linear relationship between Y and at least one of the X’s
⚫Alternatively
⚫ H0: all regression coefficients are 0
⚫ H1: not all regression coefficients are 0
⚫F-statistic for hypothesis test
ADVANCED DATA ANALYTICS
4. Residual Analysis
•
ADVANCED DATA ANALYTICS
4. Residual Analysis (cont.)
•
ADVANCED DATA ANALYTICS
4. Residual Analysis (cont.)
Source: Textbook
1
ADVANCED DATA ANALYTICS
4. Residual Analysis (cont.)
3. Testing the functional form of the regression
⚫ Any pattern in the residual plot would indicate incorrect specification (misspecification) of the model
⚫ The functional form needs to be transformed
⚫ Good residual plots (left) and bad ones (right) – source(A Must read)
ADVANCED DATA ANALYTICS
4. Residual Analysis (cont.)
3. Testing the functional form of the regression (cont.)
⚫ Improve model by transforming variables (log transformation, squaring, power etc.)
⚫ Improve model by adding new variables (nominal, combination/interaction of variables etc.)
⚫ Resort to a non-linear model
log transformation Adding new variables
ADVANCED DATA ANALYTICS
5. Outlier Analysis
Source: Textbook Source: here
1
ADVANCED DATA ANALYTICS
5. Outlier Analysis (cont.)
⚫The following distance measures are useful in identifying the influential observations
1. Z-Score
2. Mahalanobis Distance
3. Other measures such as Cook’s Distance, Leverage Values, DFBeta and DFFit values
that are not covered
ADVANCED DATA ANALYTICS
5. Outlier Analysis (cont.)
1. Z-Score
⚫ Z-score is the standardized distance of an observation from its mean value
⚫ For the predicted value of the dependent variable Y, the Z-score is given by
⚫ Any observation with a Z-score of more than 3 may be flagged as outlier and influential observations that
may change the regression parameter values significantly
ADVANCED DATA ANALYTICS
5. Outlier Analysis (cont.)
2. Mahalanobis Distance
⚫ Mahalanobis distance is used to calculate the distance of a point from a distribution
⚫ Distance between specific values of the independent variable (Xi) to the centroid of all observations of the
explanatory variable
⚫ if the dimensions (columns in your dataset) are correlated to one another, which is
typically the case in real-world datasets, the Euclidean distance between a point and
the center of the points (distribution)
ADVANCED DATA ANALYTICS
5. Mahalanobis Distance
ADVANCED DATA ANALYTICS
5. Outlier Analysis (cont.)
•
ADVANCED DATA ANALYTICS
5. Outlier Analysis (cont.)
2. Mahalanobis Distance (cont.)
Answer:
ADVANCED DATA ANALYTICS
Test your understanding!
ADVANCED DATA ANALYTICS
Test your understanding!
ADVANCED DATA ANALYTICS
Test your understanding!
ADVANCED DATA ANALYTICS
Test your understanding!
Question 1:
Using the data above, answer the following questions:
a)Obtain the estimated regression line to predict sugar content based on the number of days the fruit is left on the tree
b)Calculate and plot the residuals against days(attribute).Comment on whether the residuals suggest a fault in the model.
ADVANCED DATA ANALYTICS
Test your understanding!
• Solution:
Part a:
Scatter plot drawn for primary understanding of how the attributes are
related
Since we see a slightly linear pattern,
linear regression may be appropriate
ADVANCED DATA ANALYTICS
Test your understanding!
ADVANCED DATA ANALYTICS
Test your understanding!
• Part b)
ADVANCED DATA ANALYTICS
Test your understanding!
Question 2:
The figures below show three residual plots. For each plot decide
whether the graph suggests a violation of one or more
assumptions for regression analysis.
ADVANCED DATA ANALYTICS
Test your understanding!
Solution:
a. The graph does not suggest a violation of one or more of the
assumptions for regression inferences; all points are randomly
scattered in a horizontal band.
b. Assumption on linear relationship between y and x appears to
be violated since the points seem to form a (slight) curve
indicating that the data do not follow a straight-line pattern.
c. Assumption of homoscedasticity appears to be violated since
the points form a funnel shape indicating non-constant
variability.
ADVANCED DATA ANALYTICS
Test your understanding!
Question 3:
In a linear regression problem, we are using “R-squared” to
measure goodness-of-fit. We add a feature in linear regression
model and retrain the same model. Which of the following option
is true?
A. If R Squared increases, this variable is significant.
B. If R Squared decreases, this variable is not significant.
C. Individually R squared cannot tell about variable importance.
We can’t say anything about it right now.
D. None of these.
ADVANCED DATA ANALYTICS
Test your understanding!
Solution: Option C
“R squared” individually can’t tell whether a variable is significant
or not because each time when we add a feature, “R squared” can
either increase or stay constant. But it is not true in case of
“Adjusted R squared” (increases when features found to be
significant)
.
ADVANCED DATA ANALYTICS
Test your understanding!
Question 4:
Suppose we have generated the data with help of polynomial regression of degree 3 (degree 3 will perfectly fit
this data). Now consider below points and choose the option based on these points.
1. Simple Linear regression will have high bias and low variance
2. Simple Linear regression will have low bias and high variance
3. polynomial of degree 3 will have low bias and high variance
4. Polynomial of degree 3 will have low bias and Low variance
• A. Only 1
• B. 1 and 3
• C. 1 and 4
• D. 2 and 4
ADVANCED DATA ANALYTICS
Test your understanding!
Solution: Option C
If we fit higher degree polynomial greater than 3, it will overfit the data because
model will become more complex. If we fit the lower degree polynomial less than 3
which means that we have less complex model so in this case high bias and low
variance. But in case of degree 3 polynomial it will have low bias and low variance.
ADVANCED DATA ANALYTICS
References
⚫Business Analytics by U. Dinesh Kumar – Wiley 2nd Edition, 2022 Chapter 9
⚫“Interpreting Residual Plots to Improve Your Regression”, Qualtrics:
[Link]
guides/interpreting-residual-plots-improve-regression/
⚫“Influential Points in Regression”, StatTrek:
[Link]
⚫“Two-tailed test”, Investopedia: [Link]
[Link]
⚫Mahalanobis distance, TileStats: [Link]
⚫
[Link]
df
⚫[Link]
scientist-on-regression-skill-test-regression-solution/
THANK YOU
Dr. Bhaskarjyoti Das
Professor, Department of Computer Science
and Engineering in AI & ML , PES University,
Bengaluru
Email: bhaskarjyotidas@[Link]
ADVANCED DATA ANALYTICS
UE23AM343AB1
UNIT-2
MLR – Multicolinearity, VIF, Numericals
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
in AI & ML
Advanced Data Analytics
Unit 2: Regression Analysis
MLR – Multicolinearity, VIF, Numericals
Slides excerpted from: U. Dinesh Kumar,
“Business Analytics”, Wiley, 2nd Edition 2022
With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE, PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI & ML
ADVANCED DATA ANALYTICS
In this session
• Validation of the MLR model
1. Co-efficient of multiple determination (R2) and Adjusted R2
2. Hypothesis test for individual variables (t-test)
3. Analysis of variance for overall model validity (F-test)
4. Residual analysis
5. Presence of Multi-collinearity
6. Check for auto-correlation
7. Outlier analysis (Distance measures to find influential points – there
are metrics and this is not covered)
ADVANCED DATA ANALYTICS
5. Multi-Collinearity and Variance Inflation Factor
’ When the data set has a large number of independent variables, it is possible that few of
these independent variables may be highly correlated (multi-collinearity)
’ Multi-collinearity can have the following impact on the model:
1. The standard error of estimate of a regression coefficient may be inflated and may result in
retaining (not rejecting) of null hypothesis in t-test, resulting in rejection of a statistically
significant explanatory variable
2. The sign of the regression coefficient may be different, that is, instead of negative value for
regression coefficient, we may have a positive regression coefficient and vice versa
3. Adding/removing a variable or even an observation may result in large variation in regression
coefficient estimates
’ Variance inflation factor (VIF) measures magnitude of multi-collinearity
ADVANCED DATA ANALYTICS
5. Variance Inflation Factor
’ Let us consider a regression model with two explanatory variables defined as follows
’ To find whether there is multi-collinearity, we develop a regression model between the two
explanatory variables as follows
’ Let R122 be the R2 value for the regression model between X1 and X2
’ Variance Inflation Factor (VIF) is given by
ADVANCED DATA ANALYTICS
5. Variance Inflation Factor (cont.)
’ The square root of VIF is the amount by which the standard error of estimate is inflated in the
presence of multi-collinearity
’ The square root of VIF is the amount by which the t-statistic is deflated
’ So, the actual t-value is given by
’ There will be some correlation between explanatory variables in almost all cases, thus the value of
VIF is likely to be more than one
’ The threshold value for VIF is 4 (a few authors suggest 10)
ADVANCED DATA ANALYTICS
5. Variance Inflation Factor (cont.)
Remedies for Handling Multi-Collinearity:
1. Remove one of the variables from the model building
2. Use mean-centered variables ( to remove structural multicollienearity)
3. Use Principal Component Analysis (PCA) to reduce number of dimensions
4. Use methods like Ridge Regression and LASSO Regression
ADVANCED DATA ANALYTICS
6. Check for Auto-Correlation
’ Auto-correlation is the correlation between successive error terms in a time-series data
’ One of the assumptions of regression model is that, there should be no correlation between error
terms, et and et−1 (known as auto-correlation of errors of lag 1)
’ In general, errors et and et−k may be correlated (known as auto-correlation of lag k)
’ If there is an auto-correlation, the standard error estimate of the beta coefficient may be
underestimated (opposite of multi-collinearity) and that will result in overestimation of the t-statistic
value, which, in turn, will result in a low p-value
’ Thus, a variable which has no statistically significant relationship with the response variable may be
accepted in the model due to the presence of auto-correlation
’ The presence of auto-correlation can be established using Durbin−Watson test
ADVANCED DATA ANALYTICS
6. Durbin-Watson Test
’ Let rho (𝝆𝝆) be the correlation between error terms (et, et−1)
’ The null and alternative hypotheses are stated below
’ The Durbin−Watson statistic, D, for correlation between errors of one lag is given by
ADVANCED DATA ANALYTICS
6. Durbin-Watson Test (cont.)
’ The value of D statistic will lie between 0 and 4
’ The Durbin−Watson test has two critical values, DL and DU
’ The inference of the test can be made based on the following conditions:
1. If D < DL, then the errors are positively correlated
2. If D > DU, then there is no evidence for positive auto-correlation
3. If DL < D < DU, the Durbin−Watson test is inconclusive
4. If (4 − D) < DL, then errors are negatively correlated
5. If (4 − D) > DU, there is no evidence for negative auto-correlation
6. If DL < (4 − D) < DU, the test is inconclusive
’ Rule of thumb: a Durbin−Watson statistic close to 2 would imply absence of auto-correlation
ADVANCED DATA ANALYTICS
6. Addressing auto-correlation in MLR
The procedure is as follows:
1. Initial OLS Regression: Start with an initial regression analysis using ordinary least squares
(OLS) to estimate the model parameters.
2. Residual Calculation: Calculate the residuals from the initial regression.
3. Test for Autocorrelation: Examine the residuals for the presence of autocorrelation using
ACF plots or tests such as the Durbin-Watson test. If the autocorrelation is not significant,
there is no need to follow the procedure.
4. Transformation: The estimated model is transformed by differencing the dependent and
independent variables to remove autocorrelation. The idea here is to make the residuals
closer to being uncorrelated.
5. Regress the Transformed Model: Perform a new regression analysis with the transformed
model and compute new residuals.
6. Check for Autocorrelation: Test the new residuals for autocorrelation again. If
autocorrelation remains, go back to step 4 and transform the model further until the
residuals show no significant autocorrelation.
ADVANCED DATA ANALYTICS
6. Methods for Addressing auto-correlation in MLR
There are different procedures
1. It will be discussed in Time Series where regression is one method to
handle time series data. Essentially, ACF (autocorrelation functions) plot is
used to find whether the possible autocorrelation exists for lag n
2. Then there are different methods to do the corrections in the regression
coefficients i.e. Cochrane-Orcutt , Hildreth-Lu Procedure, and First
Differences Procedure
ADVANCED DATA ANALYTICS
Test your understanding!
Question 1:
Consider the estimated regression equation: ŷ = 3536 + 1183x1 – 1208x2. Suppose the model
is changed to reflect the deletion of x2 and the resulting estimated simple linear equation
becomes ŷ= –10663 + 1386x1.
How should we interpret the meaning of the coefficient on x1 in the estimated simple linear
regression equation
ŷ= –10663 + 1386x1?
Solution:
A one-unit change in the independent variable x1 is associated with an expected change of
1386 units in the dependent variable ŷ.
ADVANCED DATA ANALYTICS
Test your understanding!
Question 2:
Interpret the results below and answer the following questions.
Suppose we regress the dependent variable y on four independent variables x1, x2, x3,
and x4. After running the regression on n = 16 observations, we have the following
information: SSreg = 946.181 and SSres = 49.773. Please answer the following questions.
HINT:SSreg measures explained variation and SSres measures unexplained variation
What is the value of r-square?
Solution:
0.95
ADVANCED DATA ANALYTICS
Test your understanding!
Question 3:
In a multiple regression analysis, when there is no linear relationship between each
of the independent variables and the dependent variable, then:
A)multiple t -tests of the individual coefficients will likely show some are significant
B)we will conclude erroneously that the model has some validity
C)the chance of erroneously concluding that the model is useful is substantially less
with the F -test than with multiple t -tests
D)All of the above statements are correct
Solution:
D)All of the above are correct
ADVANCED DATA ANALYTICS
Test your understanding!
Question 4:
For which regression assumption does the Durbin–Watson statistic
test?
[Link]
[Link]
[Link]
[Link] of error
Solution:
Independence of error
ADVANCED DATA ANALYTICS
Test your understanding!
Question 5:
Leverage value of an observation measures the influence of that observation on ______________
and is related to the __________ distance.
Solution:
the overall fit of the regression function, Mahalanobis
ADVANCED DATA ANALYTICS
Test your understanding!
Question 6:
If the data is time-series, then there could be __________ which can result in adding a variable
which is statistically not significant. The presence of it can result in inflating the ________ value.
Solution:
auto-correlation, t-statistic
ADVANCED DATA ANALYTICS
Test your understanding!
Question 7:
When a new variable is added to an MLR model, the increase in R2 is given by the square of the
____________ between the dependent variable and the newly added variable.
Solution:
part-correlation (semi-partial correlation
ADVANCED DATA ANALYTICS
References
• R-Squared, Adjusted R-Squared and the Degree of Freedom, Medium:
[Link]
degree-of-freedom-80e7203a7e27
• [Link]
Multiple-Linear-Regression
• [Link]
exercises/chapter-13-multiple-regression
• Business Analytics by U. Dinesh Kumar – Wiley 2nd Edition, 2022
THANK YOU
Dr. Bhaskarjyoti Das
Professor, Department of Computer Science and
Engineering in AI & ML , PES University, Bengaluru
Email: bhaskarjyotidas@[Link]
ADVANCED DATA ANALYTICS
UE23AM343AB1
UNIT-2
Multiple Linear Regression and Validation
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
in AI & ML
Advanced Data Analytics
Unit 2: Regression Analysis
Multiple Linear Regression
Slides excerpted from: U. Dinesh Kumar,
“Business Analytics”, Wiley, 2nd Edition 2022
With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE, PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI & ML
ADVANCED DATA ANALYTICS
In this session
• Multiple Linear Regression
• OLS Assumptions
• OLS Solution
• MLR Model Building
• Partial and Semi-Partial Correlation
• Interpretation of MLR Coefficients
• Standardized Regression Coefficient
• Regression Models with Qualitative Variables
ADVANCED DATA ANALYTICS
Multiple Linear Regression
⚫Multiple Linear Regression (MLR) model is a statistical model that establishes existence of
a linear relationship (association) between a dependent variable and several
independent variables The regression coefficients b1 , b2 , …,
⚫Functional form: bk are called partial regression
coefficients since the relationship
between an explanatory variable and
the response variable is calculated after
removing (partial out) the effect of all
the other explanatory variables in the
model.
⚫Response surface is a hyperplane in k+1 dimensions
⚫Matrix form of equation (N = sample size, K = dimensions)
ADVANCED DATA ANALYTICS
OLS Assumptions
1. The regression model is linear in regression parameters
2. The explanatory variable, X, is assumed to be non-stochastic (i.e., X is deterministic or fixed in
repeated samples)
3. The conditional expected value of the residuals, E(ei|Xi), is zero (errors are unbiased; we should not
see a pattern in residual plots; residuals when averaged over a given set of predictor values, will be
zero)
4. In case of time series data, residuals are uncorrelated, that is, Cov(ei , ej) = 0 for all i ≠ j
5. The residuals, ei, follow a normal distribution
6. The variance of the residuals, Var(ei|Xi), is constant for all values of Xi. When the variance of the
residuals is constant for different values of Xi, it is called homoscedasticity. A non-constant variance
of residuals is called heteroscedasticity. errors (residuals) is constant across all levels of the
independent variables. This means the spread or dispersion of the residuals is uniform, and they do
not cluster or spread out systematically as the predictor values change
7. There is no high correlation between independent variables in the model (called multi-
collinearity)
ADVANCED DATA ANALYTICS
OLS Solution
⚫Equate partial derivatives of SSE wrt partial regression coefficients to 0 (can skip
derivation; just for understanding)
ADVANCED DATA ANALYTICS
OLS Solution (cont.)
⚫Simplifying
ADVANCED DATA ANALYTICS
OLS Solution (cont.)
⚫Converting to matrix form
⚫Hat matrix describes the influence of each
observation on the predicted values of the response
variable
⚫Hat matrix plays a crucial role in identifying the
outliers and influential observations in the sample
Source: Medium
ADVANCED DATA ANALYTICS
MLR Model Building
⚫10-step process
ADVANCED DATA ANALYTICS
1. Partial Correlation
⚫ Partial correlation is the correlation between the response variable Y and the
explanatory variable X1 when influence of X2 is removed from both Y and X1
⚫ In other words, when X2 is kept constant
⚫ rYX1 is the Pearson correlation coefficient between Y and X1
⚫ In the Venn Diagram, it is equivalent to removing segments B, C, and E from the
variation in Y and X1
⚫ Once we remove the influence of X2 from both X1 and Y, the segment that is
common between Y and X1 is segment A (out of A + D), that is, the ratio [A/(A +
D)]
⚫ Interpretation: It reveals how X1 and Y are related to each other, independent
of X2's influence on both. It removes the confounding variable's influence from
both original variables.
ADVANCED DATA ANALYTICS
2. Semi-Partial Correlation (or Part Correlation)
⚫ Semi-Partial correlation is the correlation between the response variable Y
and the explanatory variable X1 when the influence of X2 is removed only
from X1 and not from Y
⚫ rYX1 is the Pearson correlation coefficient between Y and X1
⚫ In the Venn Diagram, it is equivalent to removing segments C, and E from
the variation in X1 (but C will be retained in Y)
⚫ Once we remove the influence of X2 from X1, the segment that is common
between Y and X1 is segment A (out of A + B + C + D in Y), that is, the ratio
[A/(A + B + C + D)]
⚫ removes the confounding variable's influence from only one of the original
variables
ADVANCED DATA ANALYTICS
Partial and Semi-Partial (Part) Correlation
ADVANCED DATA ANALYTICS
Interpretation of MLR Coefficients - Example
The cumulative television rating points (CTRP) of a
television program, money spent on promotion
(denoted as P), and the advertisement revenue (in
Indian rupees denoted as R) generated over one-month
period for 38 different television programs is provided in
Table 10.1.
Given a multiple linear regression model (estimated
from OLS) representing the relationship between the
advertisement revenue (R) generated as response
variable for predictor variables promotions (P) and
CTRP, interpret the meaning of the coefficients.
ADVANCED DATA ANALYTICS
Interpretation of MLR Coefficients – Example (cont.)
Solution:
For every one unit increase in CTRP, the revenue increases by
5931.850 when the variable promotion is kept constant, and for
one unit increase in promotion the revenue increases by 3.136
when CTRP is kept constant
ADVANCED DATA ANALYTICS
Interpretation of MLR Coefficients – Example (cont.)
⚫Assume we have 3 SLR models along with the MLR model
Model # Model Equation Interpretation of residuals
1 Variation in R not explained by CTRP or P
2 Variation in R not explained by CTRP
3 Variation in P not explained by CTRP
4
⚫The interpretations of the models are
ADVANCED DATA ANALYTICS
Standardized Regression Coefficients
⚫How can we assess the importance of explanatory variables in an MLR model?
⚫For example, for the model defined in the previous example, the coefficient value for CTRP
is 5931.85 and the coefficient for promotion spend is 3.136. Does this mean CTRP is more
influential than P on revenue?
⚫No, because the unit of measurement for CTRP is different from the unit of
measurement of P
⚫We have to derive standardized regression coefficients to compare the impact of different
explanatory variables that have different units of measurement
⚫Since the regression coefficients cannot be compared directly due to difference in scale
and units of measurements of variables, one has to normalize the data to compare the
regression coefficients and their impact on the response variable
ADVANCED DATA ANALYTICS
Standardized Regression Coefficients (cont.)
⚫A regression model can be built on standardized dependent variable and standardized
independent variables, the resulting regression coefficients are then known as
standardized regression coefficients
⚫The standardized regression can also be calculated using the following formula
⚫The standardized regression coefficients can be interpreted as follows: For one standard
deviation change in the explanatory variable, the standardized regression coefficient
captures the number of standard deviations by which the response variable will change
ADVANCED DATA ANALYTICS
Regression Models with Qualitative Variables
⚫In MLR, many predictor variables are likely to be qualitative or categorical variables
⚫We cannot include them directly in the model; we have to pre-process the categorical variables using
dummy variables for building a regression model
⚫For example, consider a regression model between dependent variable Y and an independent marital status
(MS)
⚫It will be incorrect to use the functional form below, as it forces Y to be 3 times higher for marital status
divorced than for marital status single
ADVANCED DATA ANALYTICS
Regression Models with Qualitative Variables (cont.)
⚫The correct model specification is shown below
⚫Whenever, we have n levels (or categories) for a qualitative variable (categorical variable), we
will use (n − 1) dummy variables, where each dummy variable is a binary variable used for
representing whether an observation belongs to a category or not
⚫The reason why we create only (n − 1) dummy variables is that inclusion of dummy variables for
all categories and the constant in the regression equation will create perfect multi-collinearity
(will be discussed later) and the matrix X in the MLR will become singular
ADVANCED DATA ANALYTICS
Regression Models with Qualitative Variables - Example
⚫The data in Table 10.12 provides salary and
educational qualifications of 30 randomly chosen
people in Bangalore. Build a regression model to
establish the relationship between salary earned
and their educational qualifications.
⚫Note that, if we build a model Y = b0 + b1 ×
Education, it will be incorrect. We have to use 3
dummy variables since there are 4 categories for
educational qualification.
ADVANCED DATA ANALYTICS
Regression Models with Qualitative Variables - Example
⚫Data in Table 10.12 has to be pre-processed using 3 dummy variables (HS, UG, and PG)
as shown in Table 10.13.
⚫The corresponding regression model is
⚫The fourth category (none) for which we did not create an explicit dummy variable is
called the base category
⚫The SPSS output for the regression model is shown in Table 10.14 (all the dummy
variables are statistically significant at alpha = 0.01 as p-values are all less than alpha)
Advanced Data Analytics
Unit 2: Regression Analysis
MLR – Validation
Slides excerpted from: U. Dinesh Kumar,
“Business Analytics”, Wiley, 2nd Edition 2022
With grateful thanks for contribution of slides to:
Dr. Gowri Srinivasa, Professor at the Department of CSE, PESU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI & ML
ADVANCED DATA ANALYTICS
In this session
• Validation of the MLR model
1. Co-efficient of multiple determination (R2) and Adjusted R2
2. Hypothesis test for individual variables (t-test)
3. Analysis of variance for overall model validity (F-test)
4. Residual analysis
5. Presence of Multi-collinearity
6. Check for auto-correlation
7. Outlier analysis (Distance measures to find influential points – there
are metrics and this is not covered)
ADVANCED DATA ANALYTICS
1. Coefficient of Determination (R2)
⚫Total variation can be broken down into variation explained by model and variation not
explained by model
ADVANCED DATA ANALYTICS
1. Coefficient of Multiple Determination (R2) and Adjusted R2
⚫In case of MLR, SSE will decrease as the number of explanatory
variables increases, and SST remains constant
MSE is calculated by dividing the Sum of
⚫Recall that R-square is defined as
Squared Errors (SSE) by its degrees of
freedom, which is n - (k + 1) for k
independent variable and the intercept).
SST measures the total variability of the
⚫So, it is possible, that R-square will increase even when there is no dependent variable around its mean,
statistically significant relationship between the explanatory variable and estimating this mean uses up one
and the response variable degree of freedom, leaving n-1 degrees
of freedom that are free to vary.
⚫To counter this, R2 value is adjusted by normalizing both SSE and SST
with the corresponding degrees of freedom
⚫The adjusted R-square with k predictors is given by
ADVANCED DATA ANALYTICS
1. Coefficient of Multiple Determination (R2) and Adjusted R2 (cont.)
⚫For an equation with k variables, we have the freedom to choose k-1 of them
⚫In terms of regression, for a dataset with n observations and k predictors (k+1 partial regression
coefficients), the number of values we can freely choose is n-k-1
⚫A good explanation can be found here
⚫While R-square is a non-decreasing function, adjusted R-square is not
⚫The adjusted R-square value is always less than or equal to the R-square value
⚫No increase in adjusted R-square after adding a new predictor variable to the model may
indicate that the newly added variable may not be statistically significant or it is not explaining
the variation in the response variables that is not explained by the variables that are already
present in the model
⚫Adjusted R² will increase or decrease on addition of new variables depending on how the
addition of new variables overall impacts the explanatory power of the model, which R² may
not be able to indicate in certain cases, hence we conclude that Adjusted R² is a more reliable
metrics that R² in cases where there are addition or removal of new variable during the
regression process.
ADVANCED DATA ANALYTICS
1. Coefficient of Multiple Determination (R2) and Adjusted R2 (cont.)
⚫ Adjusting the R² by adjusting RSS and TSS with their corresponding
degrees of freedom
⚫Since RSS ( Residual sum of sq) is related to the regression line and TSS
(total sum of square) is related to the mean value of ‘Y’ so we will make
adjustment to RSS or SSE with regression’s degree of freedom (n - k - 1)
while we will make adjustment to TSS with its equation’s degree of freedom
(n - 1) .
⚫Adjusted R² = 1- [SSE/(n - k - 1)] / [SST/ (n-1)]
=> Adjusted R² = 1- [(n -1)/(n - k - 1)]- SSE / SST
Since RSS/TSS = 1- R² so the formula can also be written in the form:
ADVANCED DATA ANALYTICS
2. Hypothesis test for Individual Variables in MLR – t-test
⚫Checking the statistical significance of individual variables is achieved through t-test
⚫The estimate of regression coefficient is given by
⚫The estimated value of regression coefficient is a linear function of the response variable Y
⚫Since we assume that the residuals follow normal distribution, Y follows a normal
distribution, and the estimate of regression coefficient also follows a normal distribution
⚫Since the standard deviation of the regression coefficient is estimated from the sample, we
use a t-test
⚫The null and alternative hypotheses in the case of an individual independent variable Xi and
the dependent variable Y is given, respectively, by
ADVANCED DATA ANALYTICS
2. Hypothesis test for Individual Variables in MLR – t-test (cont.)
Two-tailed test:
source
ADVANCED DATA ANALYTICS
3. Analysis of Variance for Overall Model Validity – F-test
⚫Analysis of Variance (ANOVA) is used to validate the overall regression model
⚫If there are k independent variables in the model, then the null and the alternative
hypotheses are
⚫Note that the statement in alternative hypothesis is that “not all betas are zero”, that is,
some of those beta values may be zero
⚫That is the reason why we have to do the t-test to check the existence of statistically
significant relationship between individual explanatory variables and the response variable
⚫F-statistic is given by (notice the degrees of freedom)
DATA ANALYTICS
3. Analysis of Portions of a Model – Partial F-test
⚫In many cases, data scientists may like to validate the portions of the model or a subset of
explanatory variables
⚫Assume that the data set has N observations (sample size), and we define two models
named full model which has k independent variables and reduced model which has r
independent variables (r < k) defined as follows
⚫Objective of partial F-test: check where the additional variables (Xr+1, Xr+2, ..., Xk) in the full
model are statistically significant
DATA ANALYTICS
3. Analysis of Portions of a Model – Partial F-test (cont.)
⚫The corresponding partial F-test has the following null and alternative hypotheses
⚫Partial F statistic
ADVANCED DATA ANALYTICS
4. Residual Analysis
⚫Residual analysis is important for checking assumptions about normal distribution of
residuals, homoscedasticity, and the functional form of a regression model
⚫If the residuals do not follow normal distribution, then we cannot trust the p-values of t-
test and F-test since for the statistic to follow t-distribution and F-distribution, the
residuals should follow normal or approximate normal distribution
⚫There are many reasons why residuals may not be normal; one such case is
misspecification of functional form of regression, that is, the data scientist may have used
linear model instead of log-linear or log-log model
⚫Similar to residual analysis of SLR
⚫ Construct P-P plots
ADVANCED DATA ANALYTICS
References
⚫Business Analytics by U. Dinesh Kumar – Wiley 2nd Edition, 2022 Chapter 10
⚫U. Dinesh Kumar’s slides
⚫Linear Regression pt 3, Medium: [Link]
⚫R-Squared, Adjusted R-Squared and the Degree of Freedom, Medium:
[Link]
80e7203a7e27
THANK YOU
Dr. Bhaskarjyoti Das
Professor, Department of Computer Science
and Engineering in AI & ML , PES University,
Bengaluru
Email: bhaskarjyotidas@[Link]
ADVANCED DATA ANALYTICS
UE23AM343AB1
UNIT-2
Regularization via Objective Functions
in Multivariate Linear Regression
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering
in AI & ML
Advanced Data Analytics
Regularization via Objective Functions
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI & ML
Advanced Data Analytics
Why do we need Regularization?
Overfitting happens when a model fits the training
data too well, including the noise and irregularities
present in the training data, rather than capturing
the true underlying relationship or pattern.
Overfit models show poor performance on unseen
data. (Overfitting is associated with high variance
and low bias)
Solution: Using validation (also called ‘holdout’ or
‘development set’) data, feature selection, data
augmentation, early stopping, cross-validation,
regularization, etc.
Advanced Data Analytics
What is Regularization?
Regularization is a technique used in machine learning to prevent
overfitting and improve a model’s ability to generalize to new data. It
does this by adding a penalty term, called a regularization term, to the
model’s objective (cost) function. This term is typically based on the
model’s parameters.
Typically, regularization trades a marginal decrease in training accuracy
for an increase in generalizability.
Intuitively, the idea of regularization is to penalize complex models that
may overfit the data.
Advanced Data Analytics
Norm-based Penalty
•
Advanced Data Analytics
Techniques of Regularization (via Objective Functions)
• L2 Regularization (Ridge Regression)
• L1 Regularization (Lasso Regression)
• Elastic Net Regularization (A combination of lasso and ridge regression)
Advanced Data Analytics
L2 Regularization - Ridge Regression
Ridge Regression (Least Absolute Shrinkage and Selection Operator) shrinks the less
important features’ coefficients to NEAR zero. The effect of some features on the
model can be greatly reduced, but never made nil.
Here, the L2 Regularization term is:
Considering MSE(Mean Squared Error) as the loss function, the updated loss
function with the regularization term is:
Advanced Data Analytics
L2 Regularization – The Algorithm (for understanding)
Algorithms and Optimizations in Machine Learning
L2 Regularization – The Algorithm (for understanding) continued
Advanced Data Analytics
L1 Regularization - Lasso Regression
Lasso Regression (Least Absolute Shrinkage and Selection Operator) shrinks the less
important features’ coefficients to zero, thus removing some features altogether.
This works well for feature selection in case we have a huge number of features.
Here, the L1 Regularization term is:
Considering MSE(Mean Squared Error) as the loss function, the updated loss
function with the regularization term is:
Advanced Data Analytics
L1 Regularization – The Algorithm (for understanding)
Advanced Data Analytics
L1 Regularization – The Algorithm (for understanding) continued
Advanced Data Analytics
L1 Regularization - Sparsity
•
Advanced Data Analytics
L1 Regularization – Sparsity continued
The parameters chosen by both models are point with
the smallest MSE loss that are within the light blue
feasible region. In order to find these points, we have
to find the level curve that is tangent to the feasible
region (this is shown in the figure).
The shape of the L2 feasible region is round and it’s
unlikely that the tangent point will be one that is
sparse.
In the L1 case, the level curve will most likely be
tangent to the L1 feasible region at a “vertex” of the
diamond. These “vertices” are aligned with the axes—
therefore at these points some of the coefficients are
exactly zero. This is the intuition for why L1 produces
sparse parameter vectors.
Advanced Data Analytics
Why is Sparsity Useful?
• It makes the model more interpretable. If we have a large number of features,
the Lasso will set most of their parameters zero, thus effectively excluding them.
This allows us to focus our attention on a small number of relevant features.
• Sparsity can also make models computationally more tractable. Once the model
has set certain weights to zero—we can ignore their corresponding features
entirely. This avoids us from spending and computation or memory on these
features.
Advanced Data Analytics
L1 and L2: Differences
Advanced Data Analytics
Elastic Net Regression
• What is Elastic Net Regression?
Elastic net regression is a linear regression technique that combines the strengths of
Ridge regression (L2 regularization) and Lasso regression (L1 regularization).
Advanced Data Analytics
Elastic Net – The Algorithm (for understanding)
Advanced Data Analytics
Elastic Net – The Algorithm (for understanding) continued
Advanced Data Analytics
Why do we need Elastic Net Regression?
• Limitation of Ridge Regression – Ridge regression does not eliminate
coefficients in our model even if the variables are irrelevant, which can lead to
overfitting when dealing with more features than observations.
• Solution offered by Elastic Net Regression – Elastic Net combines the feature elimination
aspect of Lasso and the feature coefficient reduction of Ridge, aiming to improve model
predictions by not completely eliminating any features
• Limitation of LASSO Regression – Lasso regression struggles with
multicollinearity, where it might randomly choose one of the multicollinear
variables without understanding the context, potentially eliminating relevant
independent variables
• Solution offered by Elastic Net Regression - Elastic Net addresses this issue by
incorporating both L1 and L2 regularization, offering a compromise that attempts to
shrink and do a sparse selection simultaneously, thus providing a more stable and
interpretable model
Advanced Data Analytics
References
• [Link]
• [Link]
complete-guide
• [Link]
for-deep-learning-a7b9e4a409bf [Link]
regularization-7f1b4fe948f2
• [Link]
• [Link]
f,overfitting%20in%20machine%20learning%20models.
THANK YOU
Dr. Bhaskarjyoti Das
Department of Computer Science and Engineering in AI & ML
bhaskarjyoti01@[Link]