0% found this document useful (0 votes)

14 views23 pages

Bim 41

Module 2.3 focuses on the application of Exploratory Data Analysis (EDA) in machine learning, detailing its importance, processes, and best practices. It emphasizes the steps involved in EDA, including data cleaning, visualization, feature engineering, and hypothesis testing, while also addressing common pitfalls. The module aims to equip learners with the skills to effectively analyze data sets to uncover patterns and inform model development.

Uploaded by

khanshawez894

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views23 pages

Bim 41

Uploaded by

khanshawez894

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Module 2.

3 – Application of Exploratory
Artificial Intelligence Data Analysis

and Machine Prof. Sridevi Tandley

Learning
Agenda – Module 2.1 & 2.2
1. Introduction to Basic Statistical Learning & Data Sampling Method
AIML
2. Feature Engineering Techniques
• What is a Feature? What is feature Engineering?
• What is its importance and why it is used?
• Main processes of Feature Engineering
• Widely Used Feature Engineering Techniques.
• Imputation.
• Binning
• Ordinary Encoding
• One-hot Encoding.
• Feature Splitting
• Handling Outliers.
• Transformations.
• Scaling (Normalization & Standardization).
Agenda – Module 2.3 - Application of Exploratory Data Analysis

• Why Exploratory Data Analysis (EDA) for ML models?

• EDA for ML Models – Step by Step Process followed in Real World
• Import collected Data
• Data Cleaning and Preprocessing
• Inspecting Data - Fill Rate
• Descriptive statistics
• Data Visualization
• Distribution (Normal & Binomial)
• Bivariate Analysis
• Multivariate Analysis: Correlation Analysis
• Feature Engineering
• Variance Inflation Factor (VIF)
• Hypothesis testing and Confidence Intervals
• EDA for Different ML Problems
• Common Pitfalls in EDA for ML
• Best Practices followed in EDA
Application of Exploratory Data Analysis
(EDA)
EDA Definition: EDA is an approach to analyzing data sets to summarize their main
characteristics, often with visual methods.

Exploratory Data Analysis refers to the crucial process of performing initial investigations on data to discover
patterns to check assumptions with the help of summary statistics and graphical representations.
• EDA can be leveraged to check for outliers, patterns, and trends in the given data.
• EDA helps to find meaningful patterns in data.
• EDA provides in-depth insights into the data sets to solve our business problems.
• EDA gives a clue to impute missing values in the dataset

Importance of EDA in ML pipeline: Goals of EDA for ML:

Helps understand data structure and relationships Identify patterns, anomalies, and relationships in data

Guides feature selection and engineering Check assumptions required for ML models

Informs model selection and hyperparameter tuning Inform data preprocessing and feature engineering
decisions
EDA - Process Steps
• Import collected Data
• Data cleaning and preprocessing
• Inspecting Data - Prints the columns and basic statistics (count, mean, std, min, etc.) of the dataset.
• Calculating Fill Rate and Count-Calculates the fill rate (percentage of non-null values) and count of non-null values for each
column.
• Descriptive statistics
• Data visualization
• Plotting Distribution Charts-Plots histograms for each column to visualize the distribution of data (Normal, Binomial, Poisson)
• Plotting Boxplots for Relationships-Plots boxplots to analyze the relationship between each numerical feature (column) and
the target variable
• Feature Engineering
• Variance Inflation Factor (VIF) - Calculates VIF for each numerical feature to detect multicollinearity
among predictors.
• Hypothesis testing and confidence intervals
Iterative process: Findings at each stage may require revisiting previous steps
EDA - Process Step 1 – Importing Data
Data can be in various file & format and thus importing the same is crucial task
• Importing Text files
• Reading text files is similar to CSV files. The only nuance is that you need to specify a separator with the sep argument, as
shown below. The separator argument refers to the symbol used to separate rows in a DataFrame.
df = pd.read_csv("[Link]", sep="\s")
• Importing Excel files (single sheet)
• Reading excel files (both XLS and XLSX) is as easy as the read_excel() function, using the file path as an input. need to specify
one additional argument, sheet_name, where you can either pass a string for the sheet name or an integer for the sheet
position (note that Python uses 0-indexing, where the first sheet can be accessed with sheet_name = 0)
df = pd.read_excel('diabetes_multi.xlsx', sheet_name=1)
• Importing JSON file
• Similar to the read_csv() function, you can use read_json() for JSON file types with the JSON file name as the argument. The
below code reads a JSON file from disk and creates a DataFrame object df.
C_Data = { 'Name' :['Bharath','Bharath','Sumathi','Lalit','Abdul'],
'Age': [27,28,33,34,23],
'Address': ['Chennai','Chennai','Delhi','Hyd','bangalore'],
'Company':['Aspire system','Aspire system','Aspire system','Aspire system','Aspire system']}
df =[Link](C_Data)

df = pd.read_json('[Link]')
EDA - Process Step 2 - Data Cleaning and Preprocessing
• Data cleaning means fixing bad data in your data set. Basically, a bad data could be containing:
[Link] cells
[Link] in wrong format
[Link] data
[Link]
• Main goal of data understanding is to gain general insights about the data, which covers the number
of rows and columns, values in the data, datatypes, and Missing values in the dataset.

• Data Inspection - Quality of the data needs to be accessed 1st

• shape – shape will display the number of observations(rows) and features(columns) in the dataset
• info() shows the variables missing values and datatype
• nunique() based on several unique values in each column and the data description, we can identify the continuous and
categorical columns
• isnull() is widely been in all pre-processing steps to identify null values in the data
• Data Cleaning & Preprocessing
Replace Empty Values on case-by-case manner - separately for Categorical & Numerical variable segments
• Replace Using Mean, Median, or Mode
• drop_duplicates() - Presence of duplicate values will hamper the model accuracy so it needs to be dropped
EDA - Process Step 2 - Data Cleaning and Preprocessing
(Cont…)

How to handle the missing values by using a few techniques

Drop the missing values – If the dataset is huge and missing values are very few then we
can directly drop the values because it will not have much impact.
Replace with mean values – We can replace the missing values with mean values, but this
is not advisable in case if the data has outliers.
Replace with median values – We can replace the missing values with median values, and
it is recommended in case if the data has outliers.
Replace with mode values – We can do this in the case of a Categorical feature.

Regression – It can be used to predict the null value using other details from the dataset.
EDA - Process Step 3 - Descriptive Statistics
• Descriptive statistics - The information gives a quick and simple description of the data.

• Can include Count, Mean, Standard Deviation, median, mode, minimum value, maximum value, range,
standard deviation, etc.
• Statistics summary gives a high-level idea to identify whether the data has any outliers, data entry error,
distribution of data such as the data is normally distributed or left/right skewed
• We will use describe() method, which shows basic statistical characteristics of each numerical feature
(int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25,
0.50, 0.75 quartiles.
EDA - Process Step 4 - Data Visualization
• Bivariate Analysis helps to understand how variables are related to each other and the relationship
between dependent and independent variables present in the dataset.

• We must decide what charts to plot to better understand the data. In this article, we visualize our data
using Matplotlib and Seaborn libraries.
• Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
• Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.

• Eg: Plotting Boxplots for Relationships-Plots boxplots to analyze the relationship between each numerical
feature (column) and the target variable

• Python Libraries – Widely used are:

• Matplotlib is a Python 2D plotting library used to draw basic charts we use Matplotlib.
• Seaborn is also a python library built on top of Matplotlib that uses short lines of code to create and style statistical plots from
Pandas and Numpy
EDA - Process Step 4 - Data Visualization - Distribution
Plots histograms for each column to visualize the distribution of data
• Normal Distribution

• Normal distribution informally called as a bell curve

• Distribution might vary a bit depending upon how spread the data is.
• If the data has a very high range and standard deviation, the normally distributed curve would be spread out and flatter, since a
large number of values would be sufficiently away from the mean
• If standard deviation is low and most of the values are near around the mean, there is high probability of the sample mean
being around the mean and the distribution is a lot skinnier.
• Higher the standard deviation, the thicker and flatter the curve.

• Binomial Distribution

• Binomial distribution is discrete probability distribution of the number of success in a sequence of n independent Bernoulli
trials (having only yes/no or true/false outcomes)
• Most of the times, the situations we encounter are pass-fail type. The democrats either win or lose the election. I either get a
heads or tails on the coin toss. You either win or lose your football game (assuming that there is always a forced outcome). So
there are only two outcomes – win and lose or success and failure. The likelihood of the two may or may not be the same.
EDA - Process Step 4 - Data Visualization – Bivariate Analysis
For Numerical variables, Pair plots and Scatter plots are widely
been used to do Bivariate Analysis.

A Stacked bar chart can be used for categorical variables if the

output variable is a classifier. Bar plots can be used if the
output variable is continuous

In this example, a pair plot has been used to show the

relationship between two Categorical variables.

[Link](figsize=(13,17))
[Link](data=[Link](['Kilometers_Driven','Price'],axis=1))
[Link]()
EDA - Process Step 4 - Data Visualization – Multivariate Analysis
using Correlation Matrix
Multivariate analysis looks at more than two variables. Multivariate analysis is one of the most useful methods to
determine relationships and analyze patterns for any dataset.
We can find the pairwise correlation between the different columns of the data using the corr() method. (Note – All
non-numeric data type column will be ignored.)
Pearson Correlation- Default method of “corr” function
Resulting coefficient is a value between -1 and 1 inclusive,
where:

Perfect Correlation: 1 or -1: Total positive linear correlation

We can see that "Duration" and "Duration" got the number 1.000000, which
makes sense, each column always has a perfect relationship with itself.
Good Correlation: 0.7 to 0.99
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more calories
you burn, and the other way around: if you burned a lot of calories, you probably
had a long work out.
Bad Correlation: 0: No linear correlation, 2 variables most likely do
not affect each other
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad
correlation, meaning that we can not predict the max pulse by just looking at the Heat map shows the correlation between the variables using Seab
duration of the work out, and vice versa.
EDA - Process Step 5 - Feature Engineering

Please go through earlier class notes & GitHub link to understand with examples
[Link]
EDA - Process Step 6 – Variance Inflation Factor
• Pairwise correlations may not always be useful as it is possible that just one variable might not be able to
completely explain some other variable but some of the variables combined could be ready to do this.
Thus, to check these sorts of relations between variables, one can use VIF. VIF explains the relationship of
one independent variable with all the other independent variables.
• VIF is given by,

• where i refers to the ith variable which is being represented as a linear combination of the rest of the
independent variables.

• The common heuristic followed for the VIF values is if VIF > 10 then the value is high and it should be
dropped. And if the VIF=5 then it may be valid but should be inspected first. If VIF < 5, then it is
considered a good VIF value.
• Thump Rule: 5 to 10 VIF is considered as good threshold values
EDA - Process Step 7 - Hypothesis testing and confidence intervals
• Hypothesis testing is defined in two terms – Null Hypothesis and Alternate Hypothesis.
• Null Hypothesis being the sample statistic to be equal to the population statistic. For eg: The Null Hypothesis would be
that the average marks after extra class are same as that before the classes.
• Alternate Hypothesis for this example would be that the marks after extra class are significantly different from that before
the class.
• Compute the probability (p-value) to obtain a larger value for the test statistic by chance (under the null
hypothesis).
• Hypothesis Testing is done on different levels of confidence and makes use of z-score to calculate the
probability. So, for a 95% Confidence Interval, anything above the z-threshold for 95% would reject
the null hypothesis.

Points to be noted:
• We cannot accept the Null hypothesis, only reject it or fail to reject it.
• As a practical tip, Null hypothesis is generally kept which we want to disprove. For Eg: You
want to prove that students performed better after taking extra classes on their exam. The
Null Hypothesis, in this case, would be that the marks obtained after the classes are same as
before the classes.
EDA - Process Step 7 - Confidence interval
A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the
range of values you expect your estimate to fall between if you redo your test, within a certain level of
confidence. Confidence, in statistics, is another way to describe probability.
Interesting points to note about Confidence Intervals:

[Link] Intervals can be built with difference degrees of

confidence suitable to a user’s needs like 70 %, 90% etc.

[Link] the sample size, smaller the Confidence Interval,

i.e more accurate determination of population mean from
the sample means.

[Link] are different confidence intervals for different sample

means. For example, a sample mean of 40 will have a
difference confidence interval from a sample mean of 45.

[Link] 95% Confidence Interval, we do not mean that – The

probability of a population mean to lie in an interval is 95%.
Instead, 95% C.I means that 95% of the Interval estimates
will contain the population statistic.
EDA for Different ML Problems
• EDA for classification problems:
• Class balance analysis
• Feature distributions by class
• ROC curves for binary classification
• EDA for regression problems:
• Residual analysis
• Heteroscedasticity checks
• Polynomial relationship exploration
• EDA for clustering problems:
• Dimensionality reduction for visualization
• Silhouette analysis
• Hierarchical clustering dendrograms
• EDA for time series analysis:
• Trend and seasonality decomposition
• Autocorrelation and partial autocorrelation plots
• Stationarity tests
Common Pitfalls in EDA for ML
• Data leakage:
• Including target-related information in features
• Using future data in time series problems
• Overfitting to the training data:
• Drawing conclusions from small samples
• Not validating findings on separate test sets
• Ignoring outliers or rare events:
• Removing important signal from the data
• Failing to model important edge cases
• Misinterpreting correlations:
• Assuming correlation implies causation
• Overlooking spurious correlations
Best Practices for EDA in ML
• Iterative process:
• Continuously refine analysis based on new insights
• Revisit EDA after initial modeling
• Domain knowledge integration:
• Consult subject matter experts
• Validate findings against business logic
• Documentation and reproducibility:
• Keep detailed notes of all steps and decisions
• Use version control for code and data
• Balancing automation and manual analysis:
• Use automated tools for initial insights
• Perform deeper manual analysis on key areas
• Linking EDA insights to business impact:
• Translate statistical findings to business metrics
• Provide actionable recommendations based on insights
References

• [Link]
• [Link]
104/html/feature_engineering.pdf
• [Link]
• [Link]

• [Link]
• [Link]
• “What Is Underfitting: DataRobot Artificial Intelligence
Wiki.” DataRobot, [Link]/wiki/underfitting/.
• Editor, Minitab Blog. “The Danger of Overfitting Regression Models.” Minitab Blog,
[Link]/blog/adventures-in-statistics-2/the-danger-of-overfitting-regression-models
• [Link]
930ad908148e
ThankYou

CH 3
No ratings yet
CH 3
33 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
47 pages
Introduction to Exploratory Data Analysis
No ratings yet
Introduction to Exploratory Data Analysis
40 pages
Exp 12
No ratings yet
Exp 12
7 pages
Unit 3 Quick Revision
No ratings yet
Unit 3 Quick Revision
15 pages
Exploratorydataanalysis Acomprehensiveguidetoeda 230531120423 864eda98
No ratings yet
Exploratorydataanalysis Acomprehensiveguidetoeda 230531120423 864eda98
13 pages
Exploratory Data Analysis (EDA) Guide
No ratings yet
Exploratory Data Analysis (EDA) Guide
21 pages
Chapter 1
No ratings yet
Chapter 1
30 pages
Comprehensive Guide to Exploratory Data Analysis
No ratings yet
Comprehensive Guide to Exploratory Data Analysis
23 pages
Unit 3 Cse273
No ratings yet
Unit 3 Cse273
36 pages
EDA with Python: A Comprehensive Guide
No ratings yet
EDA with Python: A Comprehensive Guide
144 pages
Exploratory Data Analysis Techniques
100% (1)
Exploratory Data Analysis Techniques
8 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
26 pages
Exploratory Data Analysis with Python
No ratings yet
Exploratory Data Analysis with Python
19 pages
Data Analytics EDA with Python Guide
No ratings yet
Data Analytics EDA with Python Guide
41 pages
Types of Exploratory Data Analysis
No ratings yet
Types of Exploratory Data Analysis
9 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
42 pages
Importance of Exploratory Data Analysis
No ratings yet
Importance of Exploratory Data Analysis
133 pages
EDA Techniques and Visualization in Python
No ratings yet
EDA Techniques and Visualization in Python
14 pages
EDA Techniques and Their Purposes
No ratings yet
EDA Techniques and Their Purposes
18 pages
Notes Unit 1
No ratings yet
Notes Unit 1
12 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
Exploratory Data Analysis in Healthcare
No ratings yet
Exploratory Data Analysis in Healthcare
11 pages
Comprehensive Guide to EDA Techniques
No ratings yet
Comprehensive Guide to EDA Techniques
48 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
13 pages
Exploratory Data Analysis Techniques in R
No ratings yet
Exploratory Data Analysis Techniques in R
49 pages
@DataScience - Ir - Heart Disease Analysis
No ratings yet
@DataScience - Ir - Heart Disease Analysis
51 pages
EDA Steps and Techniques in Data Science
No ratings yet
EDA Steps and Techniques in Data Science
4 pages
Lec # 5
No ratings yet
Lec # 5
48 pages
Importance of EDA in ML Workflow
No ratings yet
Importance of EDA in ML Workflow
7 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
31 pages
Unit 4 Fds
No ratings yet
Unit 4 Fds
22 pages
Unit 2 DataScience
No ratings yet
Unit 2 DataScience
23 pages
EDA in Python: A Beginner's Guide
No ratings yet
EDA in Python: A Beginner's Guide
22 pages
EDA vs Descriptive Analysis Techniques
No ratings yet
EDA vs Descriptive Analysis Techniques
47 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
14 pages
Importing Excel into Tableau for EDA
No ratings yet
Importing Excel into Tableau for EDA
222 pages
EDA for Classification with Pandas
No ratings yet
EDA for Classification with Pandas
4 pages
Importance of Exploratory Data Analysis
No ratings yet
Importance of Exploratory Data Analysis
17 pages
Unit-2 Pattern & Anamoly
No ratings yet
Unit-2 Pattern & Anamoly
5 pages
Seaborn Boxplot Excluding Outliers
No ratings yet
Seaborn Boxplot Excluding Outliers
29 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
EDA Presentation 21 05 2025 GSI
No ratings yet
EDA Presentation 21 05 2025 GSI
21 pages
EDA in SAS: Communicating Insights
No ratings yet
EDA in SAS: Communicating Insights
25 pages
Unit V
No ratings yet
Unit V
10 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
41 pages
Understanding Data Types in EDA
No ratings yet
Understanding Data Types in EDA
28 pages
Part C (1,2)
No ratings yet
Part C (1,2)
46 pages
Data Preprocessing & EDA Guide
No ratings yet
Data Preprocessing & EDA Guide
9 pages
Unit 2 DataScience
No ratings yet
Unit 2 DataScience
22 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
23 pages
Annexure F - Cable Tray Datasheet
No ratings yet
Annexure F - Cable Tray Datasheet
4 pages
Solar For MSME - Guna Modes of Engagement
No ratings yet
Solar For MSME - Guna Modes of Engagement
8 pages
Bim 41
No ratings yet
Bim 41
23 pages
Session 11 13
No ratings yet
Session 11 13
17 pages
Bim 42
No ratings yet
Bim 42
27 pages
Bim 41
No ratings yet
Bim 41
37 pages
Print - A46288 - AllTeams - Berlin - Period 2
No ratings yet
Print - A46288 - AllTeams - Berlin - Period 2
252 pages
BIM 40 AIML - Module3.3 - Machine Learning - Supervised Techniques Using Scikit Learn - v1
No ratings yet
BIM 40 AIML - Module3.3 - Machine Learning - Supervised Techniques Using Scikit Learn - v1
50 pages
Print - A46288 - Team - R - Berlin - Period 4
No ratings yet
Print - A46288 - Team - R - Berlin - Period 4
40 pages
Print - A46288 - Team - R - Berlin - Period 1
No ratings yet
Print - A46288 - Team - R - Berlin - Period 1
25 pages
Print - A46288 - Team - R - Berlin - Period 9
No ratings yet
Print - A46288 - Team - R - Berlin - Period 9
61 pages
RIS Message Structure and Details
No ratings yet
RIS Message Structure and Details
294 pages
Jade Wiles' Shuttle Booking Details
No ratings yet
Jade Wiles' Shuttle Booking Details
1 page
ZF Gearboxes and Axle Systems Overview
No ratings yet
ZF Gearboxes and Axle Systems Overview
11 pages
Accreditation Application Checklist
100% (1)
Accreditation Application Checklist
2 pages
Node Extension in MDG Mass Processing
No ratings yet
Node Extension in MDG Mass Processing
13 pages
Alarm Codes for Thermo King Units
No ratings yet
Alarm Codes for Thermo King Units
4 pages
Smart Plug for IoT Malware Detection
No ratings yet
Smart Plug for IoT Malware Detection
6 pages
Gas Inventory Management System Report
No ratings yet
Gas Inventory Management System Report
14 pages
Grade 9 Computer Servicing Test Results
No ratings yet
Grade 9 Computer Servicing Test Results
27 pages
NHAI Contractor Blacklist Update Reminder
No ratings yet
NHAI Contractor Blacklist Update Reminder
7 pages
Valve Criticality Analysis Overview
No ratings yet
Valve Criticality Analysis Overview
6 pages
4059EE Release 1.5 Installation Guide
No ratings yet
4059EE Release 1.5 Installation Guide
18 pages
Unacademy CSAT Batch Offer 2026 - Apply Referral Code PLUSYK7ZA
No ratings yet
Unacademy CSAT Batch Offer 2026 - Apply Referral Code PLUSYK7ZA
14 pages
Estimasi Investasi dan Biaya Produksi
No ratings yet
Estimasi Investasi dan Biaya Produksi
26 pages
High School Calculus: Limits and Derivatives
No ratings yet
High School Calculus: Limits and Derivatives
3 pages
TV News Reporter Activity for Grades 4-5
No ratings yet
TV News Reporter Activity for Grades 4-5
3 pages
MEF Carrier Ethernet 2.0 Training Course
No ratings yet
MEF Carrier Ethernet 2.0 Training Course
2 pages
Buyback Equipment List for Bidders
No ratings yet
Buyback Equipment List for Bidders
1 page
ANR Report for Salesforce Chat App
No ratings yet
ANR Report for Salesforce Chat App
4,605 pages
Induction Heater User Manual
No ratings yet
Induction Heater User Manual
18 pages
ZN551
No ratings yet
ZN551
2 pages
Trust Finance: Crypto Mining Investment
100% (1)
Trust Finance: Crypto Mining Investment
15 pages
WKS 9 Occupational Diving Eddy Current Testing SCUBA
No ratings yet
WKS 9 Occupational Diving Eddy Current Testing SCUBA
1 page
Miniso Historical Share Price Analysis
No ratings yet
Miniso Historical Share Price Analysis
8 pages
SAVIOR Attendance System Overview
No ratings yet
SAVIOR Attendance System Overview
6 pages
01 - Renault Mb1 Mb3 Mj3
No ratings yet
01 - Renault Mb1 Mb3 Mj3
3 pages
Long Island N.V. Terms and Conditions
No ratings yet
Long Island N.V. Terms and Conditions
17 pages
TCS Smart Interview Questions Guide
100% (1)
TCS Smart Interview Questions Guide
22 pages
Driver Power Rating and Coupling Guide
No ratings yet
Driver Power Rating and Coupling Guide
5 pages
316 SS Valve Trim Material Standards
No ratings yet
316 SS Valve Trim Material Standards
2 pages

Bim 41

Uploaded by

Bim 41

Uploaded by

Module 2.

and Machine Prof. Sridevi Tandley

• Why Exploratory Data Analysis (EDA) for ML models?

Importance of EDA in ML pipeline: Goals of EDA for ML:

• Data Inspection - Quality of the data needs to be accessed 1st

How to handle the missing values by using a few techniques

• Python Libraries – Widely used are:

• Normal distribution informally called as a bell curve

A Stacked bar chart can be used for categorical variables if the

In this example, a pair plot has been used to show the

Perfect Correlation: 1 or -1: Total positive linear correlation

[Link] Intervals can be built with difference degrees of

[Link] the sample size, smaller the Confidence Interval,

[Link] are different confidence intervals for different sample

[Link] 95% Confidence Interval, we do not mean that – The

You might also like