Module 2.
3 – Application of Exploratory
Artificial Intelligence Data Analysis
and Machine Prof. Sridevi Tandley
Learning
Agenda – Module 2.1 & 2.2
1. Introduction to Basic Statistical Learning & Data Sampling Method
AIML
2. Feature Engineering Techniques
• What is a Feature? What is feature Engineering?
• What is its importance and why it is used?
• Main processes of Feature Engineering
• Widely Used Feature Engineering Techniques.
• Imputation.
• Binning
• Ordinary Encoding
• One-hot Encoding.
• Feature Splitting
• Handling Outliers.
• Transformations.
• Scaling (Normalization & Standardization).
Agenda – Module 2.3 - Application of Exploratory Data Analysis
• Why Exploratory Data Analysis (EDA) for ML models?
• EDA for ML Models – Step by Step Process followed in Real World
• Import collected Data
• Data Cleaning and Preprocessing
• Inspecting Data - Fill Rate
• Descriptive statistics
• Data Visualization
• Distribution (Normal & Binomial)
• Bivariate Analysis
• Multivariate Analysis: Correlation Analysis
• Feature Engineering
• Variance Inflation Factor (VIF)
• Hypothesis testing and Confidence Intervals
• EDA for Different ML Problems
• Common Pitfalls in EDA for ML
• Best Practices followed in EDA
Application of Exploratory Data Analysis
(EDA)
EDA Definition: EDA is an approach to analyzing data sets to summarize their main
characteristics, often with visual methods.
Exploratory Data Analysis refers to the crucial process of performing initial investigations on data to discover
patterns to check assumptions with the help of summary statistics and graphical representations.
• EDA can be leveraged to check for outliers, patterns, and trends in the given data.
• EDA helps to find meaningful patterns in data.
• EDA provides in-depth insights into the data sets to solve our business problems.
• EDA gives a clue to impute missing values in the dataset
Importance of EDA in ML pipeline: Goals of EDA for ML:
Helps understand data structure and relationships Identify patterns, anomalies, and relationships in data
Guides feature selection and engineering Check assumptions required for ML models
Informs model selection and hyperparameter tuning Inform data preprocessing and feature engineering
decisions
EDA - Process Steps
• Import collected Data
• Data cleaning and preprocessing
• Inspecting Data - Prints the columns and basic statistics (count, mean, std, min, etc.) of the dataset.
• Calculating Fill Rate and Count-Calculates the fill rate (percentage of non-null values) and count of non-null values for each
column.
• Descriptive statistics
• Data visualization
• Plotting Distribution Charts-Plots histograms for each column to visualize the distribution of data (Normal, Binomial, Poisson)
• Plotting Boxplots for Relationships-Plots boxplots to analyze the relationship between each numerical feature (column) and
the target variable
• Feature Engineering
• Variance Inflation Factor (VIF) - Calculates VIF for each numerical feature to detect multicollinearity
among predictors.
• Hypothesis testing and confidence intervals
Iterative process: Findings at each stage may require revisiting previous steps
EDA - Process Step 1 – Importing Data
Data can be in various file & format and thus importing the same is crucial task
• Importing Text files
• Reading text files is similar to CSV files. The only nuance is that you need to specify a separator with the sep argument, as
shown below. The separator argument refers to the symbol used to separate rows in a DataFrame.
df = pd.read_csv("[Link]", sep="\s")
• Importing Excel files (single sheet)
• Reading excel files (both XLS and XLSX) is as easy as the read_excel() function, using the file path as an input. need to specify
one additional argument, sheet_name, where you can either pass a string for the sheet name or an integer for the sheet
position (note that Python uses 0-indexing, where the first sheet can be accessed with sheet_name = 0)
df = pd.read_excel('diabetes_multi.xlsx', sheet_name=1)
• Importing JSON file
• Similar to the read_csv() function, you can use read_json() for JSON file types with the JSON file name as the argument. The
below code reads a JSON file from disk and creates a DataFrame object df.
C_Data = { 'Name' :['Bharath','Bharath','Sumathi','Lalit','Abdul'],
'Age': [27,28,33,34,23],
'Address': ['Chennai','Chennai','Delhi','Hyd','bangalore'],
'Company':['Aspire system','Aspire system','Aspire system','Aspire system','Aspire system']}
df =[Link](C_Data)
df = pd.read_json('[Link]')
EDA - Process Step 2 - Data Cleaning and Preprocessing
• Data cleaning means fixing bad data in your data set. Basically, a bad data could be containing:
[Link] cells
[Link] in wrong format
[Link] data
[Link]
• Main goal of data understanding is to gain general insights about the data, which covers the number
of rows and columns, values in the data, datatypes, and Missing values in the dataset.
• Data Inspection - Quality of the data needs to be accessed 1st
• shape – shape will display the number of observations(rows) and features(columns) in the dataset
• info() shows the variables missing values and datatype
• nunique() based on several unique values in each column and the data description, we can identify the continuous and
categorical columns
• isnull() is widely been in all pre-processing steps to identify null values in the data
• Data Cleaning & Preprocessing
Replace Empty Values on case-by-case manner - separately for Categorical & Numerical variable segments
• Replace Using Mean, Median, or Mode
• drop_duplicates() - Presence of duplicate values will hamper the model accuracy so it needs to be dropped
EDA - Process Step 2 - Data Cleaning and Preprocessing
(Cont…)
How to handle the missing values by using a few techniques
Drop the missing values – If the dataset is huge and missing values are very few then we
can directly drop the values because it will not have much impact.
Replace with mean values – We can replace the missing values with mean values, but this
is not advisable in case if the data has outliers.
Replace with median values – We can replace the missing values with median values, and
it is recommended in case if the data has outliers.
Replace with mode values – We can do this in the case of a Categorical feature.
Regression – It can be used to predict the null value using other details from the dataset.
EDA - Process Step 3 - Descriptive Statistics
• Descriptive statistics - The information gives a quick and simple description of the data.
• Can include Count, Mean, Standard Deviation, median, mode, minimum value, maximum value, range,
standard deviation, etc.
• Statistics summary gives a high-level idea to identify whether the data has any outliers, data entry error,
distribution of data such as the data is normally distributed or left/right skewed
• We will use describe() method, which shows basic statistical characteristics of each numerical feature
(int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25,
0.50, 0.75 quartiles.
EDA - Process Step 4 - Data Visualization
• Bivariate Analysis helps to understand how variables are related to each other and the relationship
between dependent and independent variables present in the dataset.
• We must decide what charts to plot to better understand the data. In this article, we visualize our data
using Matplotlib and Seaborn libraries.
• Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
• Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.
• Eg: Plotting Boxplots for Relationships-Plots boxplots to analyze the relationship between each numerical
feature (column) and the target variable
• Python Libraries – Widely used are:
• Matplotlib is a Python 2D plotting library used to draw basic charts we use Matplotlib.
• Seaborn is also a python library built on top of Matplotlib that uses short lines of code to create and style statistical plots from
Pandas and Numpy
EDA - Process Step 4 - Data Visualization - Distribution
Plots histograms for each column to visualize the distribution of data
• Normal Distribution
• Normal distribution informally called as a bell curve
• Distribution might vary a bit depending upon how spread the data is.
• If the data has a very high range and standard deviation, the normally distributed curve would be spread out and flatter, since a
large number of values would be sufficiently away from the mean
• If standard deviation is low and most of the values are near around the mean, there is high probability of the sample mean
being around the mean and the distribution is a lot skinnier.
• Higher the standard deviation, the thicker and flatter the curve.
• Binomial Distribution
• Binomial distribution is discrete probability distribution of the number of success in a sequence of n independent Bernoulli
trials (having only yes/no or true/false outcomes)
• Most of the times, the situations we encounter are pass-fail type. The democrats either win or lose the election. I either get a
heads or tails on the coin toss. You either win or lose your football game (assuming that there is always a forced outcome). So
there are only two outcomes – win and lose or success and failure. The likelihood of the two may or may not be the same.
EDA - Process Step 4 - Data Visualization – Bivariate Analysis
For Numerical variables, Pair plots and Scatter plots are widely
been used to do Bivariate Analysis.
A Stacked bar chart can be used for categorical variables if the
output variable is a classifier. Bar plots can be used if the
output variable is continuous
In this example, a pair plot has been used to show the
relationship between two Categorical variables.
[Link](figsize=(13,17))
[Link](data=[Link](['Kilometers_Driven','Price'],axis=1))
[Link]()
EDA - Process Step 4 - Data Visualization – Multivariate Analysis
using Correlation Matrix
Multivariate analysis looks at more than two variables. Multivariate analysis is one of the most useful methods to
determine relationships and analyze patterns for any dataset.
We can find the pairwise correlation between the different columns of the data using the corr() method. (Note – All
non-numeric data type column will be ignored.)
Pearson Correlation- Default method of “corr” function
Resulting coefficient is a value between -1 and 1 inclusive,
where:
Perfect Correlation: 1 or -1: Total positive linear correlation
We can see that "Duration" and "Duration" got the number 1.000000, which
makes sense, each column always has a perfect relationship with itself.
Good Correlation: 0.7 to 0.99
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more calories
you burn, and the other way around: if you burned a lot of calories, you probably
had a long work out.
Bad Correlation: 0: No linear correlation, 2 variables most likely do
not affect each other
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad
correlation, meaning that we can not predict the max pulse by just looking at the Heat map shows the correlation between the variables using Seab
duration of the work out, and vice versa.
EDA - Process Step 5 - Feature Engineering
Please go through earlier class notes & GitHub link to understand with examples
[Link]
EDA - Process Step 6 – Variance Inflation Factor
• Pairwise correlations may not always be useful as it is possible that just one variable might not be able to
completely explain some other variable but some of the variables combined could be ready to do this.
Thus, to check these sorts of relations between variables, one can use VIF. VIF explains the relationship of
one independent variable with all the other independent variables.
• VIF is given by,
• where i refers to the ith variable which is being represented as a linear combination of the rest of the
independent variables.
• The common heuristic followed for the VIF values is if VIF > 10 then the value is high and it should be
dropped. And if the VIF=5 then it may be valid but should be inspected first. If VIF < 5, then it is
considered a good VIF value.
• Thump Rule: 5 to 10 VIF is considered as good threshold values
EDA - Process Step 7 - Hypothesis testing and confidence intervals
• Hypothesis testing is defined in two terms – Null Hypothesis and Alternate Hypothesis.
• Null Hypothesis being the sample statistic to be equal to the population statistic. For eg: The Null Hypothesis would be
that the average marks after extra class are same as that before the classes.
• Alternate Hypothesis for this example would be that the marks after extra class are significantly different from that before
the class.
• Compute the probability (p-value) to obtain a larger value for the test statistic by chance (under the null
hypothesis).
• Hypothesis Testing is done on different levels of confidence and makes use of z-score to calculate the
probability. So, for a 95% Confidence Interval, anything above the z-threshold for 95% would reject
the null hypothesis.
Points to be noted:
• We cannot accept the Null hypothesis, only reject it or fail to reject it.
• As a practical tip, Null hypothesis is generally kept which we want to disprove. For Eg: You
want to prove that students performed better after taking extra classes on their exam. The
Null Hypothesis, in this case, would be that the marks obtained after the classes are same as
before the classes.
EDA - Process Step 7 - Confidence interval
A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the
range of values you expect your estimate to fall between if you redo your test, within a certain level of
confidence. Confidence, in statistics, is another way to describe probability.
Interesting points to note about Confidence Intervals:
[Link] Intervals can be built with difference degrees of
confidence suitable to a user’s needs like 70 %, 90% etc.
[Link] the sample size, smaller the Confidence Interval,
i.e more accurate determination of population mean from
the sample means.
[Link] are different confidence intervals for different sample
means. For example, a sample mean of 40 will have a
difference confidence interval from a sample mean of 45.
[Link] 95% Confidence Interval, we do not mean that – The
probability of a population mean to lie in an interval is 95%.
Instead, 95% C.I means that 95% of the Interval estimates
will contain the population statistic.
EDA for Different ML Problems
• EDA for classification problems:
• Class balance analysis
• Feature distributions by class
• ROC curves for binary classification
• EDA for regression problems:
• Residual analysis
• Heteroscedasticity checks
• Polynomial relationship exploration
• EDA for clustering problems:
• Dimensionality reduction for visualization
• Silhouette analysis
• Hierarchical clustering dendrograms
• EDA for time series analysis:
• Trend and seasonality decomposition
• Autocorrelation and partial autocorrelation plots
• Stationarity tests
Common Pitfalls in EDA for ML
• Data leakage:
• Including target-related information in features
• Using future data in time series problems
• Overfitting to the training data:
• Drawing conclusions from small samples
• Not validating findings on separate test sets
• Ignoring outliers or rare events:
• Removing important signal from the data
• Failing to model important edge cases
• Misinterpreting correlations:
• Assuming correlation implies causation
• Overlooking spurious correlations
Best Practices for EDA in ML
• Iterative process:
• Continuously refine analysis based on new insights
• Revisit EDA after initial modeling
• Domain knowledge integration:
• Consult subject matter experts
• Validate findings against business logic
• Documentation and reproducibility:
• Keep detailed notes of all steps and decisions
• Use version control for code and data
• Balancing automation and manual analysis:
• Use automated tools for initial insights
• Perform deeper manual analysis on key areas
• Linking EDA insights to business impact:
• Translate statistical findings to business metrics
• Provide actionable recommendations based on insights
References
• [Link]
• [Link]
104/html/feature_engineering.pdf
• [Link]
• [Link]
• [Link]
• [Link]
• “What Is Underfitting: DataRobot Artificial Intelligence
Wiki.” DataRobot, [Link]/wiki/underfitting/.
• Editor, Minitab Blog. “The Danger of Overfitting Regression Models.” Minitab Blog,
[Link]/blog/adventures-in-statistics-2/the-danger-of-overfitting-regression-models
• [Link]
930ad908148e
ThankYou