0% found this document useful (0 votes)

87 views16 pages

Exploratory Data Analysis in Python

EDA for Data Science

Uploaded by

spraga1995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views16 pages

Exploratory Data Analysis in Python

EDA for Data Science

Uploaded by

spraga1995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis - EDA
Data Preparation and Missing Values
Predictive Modeling and Handling Outliers
Binning and Log Transform
Feature Engineering Techniques
Univariate and Bivariate Analysis
Statistical Analysis and Visualization

“Torture the data, and it will confess to

anything”

But, Raw, Unprocessed Data isn’t of much use………

EXPLORATORY DATA ANALYSIS - EDA

Exploratory Data Analysis (EDA) is the process of visualizing and

analysing data to extract insights from it.

In other words, EDA is the process of summarizing important

characteristics of data in order to gain better understanding of
the dataset.

After the data has been collected, it undergoes some processing

before being cleaned and EDA is then performed. Notice that
after EDA, we may go back to processing and cleaning of data,
i.e., this can be an iterative process.

It mainly has the following goals:

 To gain an understanding of data and find clues from the data

o Analyse the data
o Extract insights from it
 Preparing the proper input dataset, compatible with the
machine learning algorithm requirements.
 Improving the performance of machine learning models.
Data scientists spend 80% of their time on data preparation:

Best way to achieve expertise in feature engineering is practicing

different techniques on various datasets and observing their
effect on model performances.

Basic python scripts need Pandas and Numpy library to run

basic operations like data tables, arithmetic, logical…etc.

import pandas as pd
import numpy as np

Convert to data frame:

df=[Link](“data”)

Missing values:
Missing values are one of the most common problems you can
encounter when you try to prepare your data for machine
learning. The reason for the missing values might be human
errors, interruptions in the data flow, privacy concerns, and so
on. These affect the performance of the machine learning models.
Most of the algorithms do not accept datasets with missing values
and gives an error.

[Link]()
[Link]().sum()

We can handle missing values in many ways:

Delete: You can delete the rows with the missing values or delete
the whole column which has missing values.

df=[Link]()  Axis and Inplace

Impute: Deleting data might cause huge amount of information

loss. So, replacing data might be a better option than deleting.
However, there is an important selection of what you impute to
the missing values.
Numerical Imputation:
Considering a possible default value of missing values in the
column.

df = [Link](0)  Filling all missing values with ‘0’

One standard replacement technique is to replace missing values

with the average value of the entire column.
df=[Link]([Link]())
But, best imputation way is to use the medians of the columns
even when distribution is skewed. As the averages of the columns
are sensitive to the outlier values

df=[Link]([Link]())

Categorical Imputation:
Replacing the missing values with the maximum occurred
value (Mode) in a column is a good option for handling
categorical columns.

df['column_name'].fillna(df['column_name'].value_counts()
.idxmax(), inplace=True)

Predictive filling:
Alternatively, you can choose to fill missing values through
predictive filling.

df=[Link](method=’linear’)  Filling all missing values with

linearly formed data
Predictive model:
Creating a predictive model for filling the data
 Dataset split into 2 dataset
 One without missing values and one with missing values
 Create a model with dataset that is without missing values
 Run the model on dataset with missing values
Thus, the missing values are filled. But estimated values are well
behaved.

Handling Outliers:

Best way to detect the outliers is to demonstrate the data visually.

Box and whisker plot is most easy method to detect outliers

import seaborn as sns

[Link](x=’Variable',y='Target variable',data=df)

An Outlier Dilemma: Drop or Cap:

Dropping the data which is an outlier is one of the option to

handle the outliers
Dropping the outlier rows with Percentiles
upper_lim = df['column'].quantile(.95)
lower_lim = df['column'].quantile(.05)

df = df[(data['column'] < upper_lim) & (df['column'] > lower_lim)]

Another option for handling outliers is to cap them instead of

dropping.
Capping the outlier rows with Percentiles
upper_lim = df['column'].quantile(.95)
lower_lim = df['column'].quantile(.05)

[Link][(df[column] > upper_lim),column] = upper_lim

[Link][(df[column] < lower_lim),column] = lower_lim

Binning:

Binning can be applied on both categorical and numerical data

Numerical Binning
ExampleValue Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High

Categorical Binning
ExampleValue Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
Brazil -> South America
Numerical Binning Example
df['bin'] = [Link](df['value'], bins=[0,30,70,100],
labels=["Low", "Mid", "High"])

value bin
0 2 Low
1 45 Mid
2 7 Low
3 85 High
4 28 Low

Categorical Binning Example

conditions = [
df['Country'].[Link]('Spain'),
df['Country'].[Link]('Italy'),
df['Country'].[Link]('Chile'),
df['Country'].[Link]('Brazil')]

choices = ['Europe', 'Europe', 'South America', 'South America']

df['Continent'] = [Link](conditions, choices,

default='Other')

Country Continent
0 Spain Europe
1 Chile South America
2 Australia Other
3 Italy Europe
4 Brazil South America

Log Transform

Logarithm transformation (or log transform) is one of the most

commonly used mathematical transformations in feature
engineering. What are the benefits of log transform?

 It helps to handle skewed data and after transformation, the

distribution becomes more approximate to normal.

 In most of the cases the magnitude order of the data changes

within the range of the data.
 It also decreases the effect of the outliers, due to the
normalization of magnitude differences and the model
become more robust.
Log Transform Example

df['log+1'] =(df['value']+1).transform([Link])

Negative Values Handling

Note that the values are different

df['log'] = (df['value']-df['value'].min()+1) .transform([Link])

value log(x+1) log(x-min(x)+1)

0 2 1.09861 3.25810
1 45 3.82864 4.23411
2 -23 nan 0.00000
3 85 4.45435 4.69135
4 28 3.36730 3.95124
5 2 1.09861 3.25810
6 35 3.58352 4.07754
7 -12 nan 2.48491

One-hot encoding (or) Dummy coding:

One-hot encoding is one of the most common encoding methods

in machine learning. This method spreads the values in a column
to multiple flag columns and assigns 0 or 1 to them. These binary
values express the relationship between grouped and encoded
column. This method changes your categorical data, which is
challenging to understand for algorithms, to a numerical format
and enables you to group your categorical data without losing any
information. If you have N distinct values in the column, it is
enough to map them to N-1 binary columns, because the missing
value can be deducted from other columns.

encoded_columns = pd.get_dummies(df['column'])

df = [Link](encoded_columns).drop('column', axis=1)
New variable creation:
Sometimes, there will be a possibility of creating a new variable
by gaining information from 2 or more columns
This can be used to find the hidden relationship between the
variable and target.
E.g.: In ticket reservation:
No of passenger and their relationship column can be used to get
the information whether the passenger is travelling with family
or with friends or alone

Feature Split:

Splitting features is a good way to make them useful in terms of

machine learning. By extracting the utilizable parts of a column
into new features:

 We enable machine learning algorithms to comprehend them.

 Make possible to bin and group them.

 Improve model performance by uncovering potential

information.
String extraction example

[Link]()

0 Toy Story (1995)

1 Jumanji (1995)
2 Grumpier Old Men (1995)
3 Waiting to Exhale (1995)
4 Father of the Bride Part II (1995)
[Link]("(", n=1, expand=True)[1].[Link](")", n=1,
expand=True)[0]

0 1995
1 1995
2 1995
3 1995
4 1995

Scaling:

In most cases, the numerical features of the dataset do not have

a certain range and they differ from each other. In real life, it is
nonsense to expect age and income columns to have the same
range. But from the machine learning point of view, how these
two columns can be compared?

Scaling solves this problem. The continuous features become

identical in terms of the range, after a scaling process. This
process is not mandatory for many algorithms, but it might be
still nice to apply. However, the algorithms based
on distance calculations such as k-NN or k-Means need to have
scaled continuous features as model input.

Basically, there are two common ways of scaling:

Normalization

Normalization (or min-max normalization) scale all values in a

fixed range between 0 and 1. This transformation does not change
the distribution of the feature and due to the decreased standard
deviations, the effects of the outliers’ increases. Therefore, before
normalization, it is recommended to handle the outliers.
data = [Link]({'value':[2,45, -23, 85, 28, 2, 35, -12]})

data['normalized'] = (data['value'] - data['value']. min ()) /

(data['value']. max() - data['value'].min())

value normalized
0 2 0.23
1 45 0.63
2 -23 0.00
3 85 1.00
4 28 0.47
5 2 0.23
6 35 0.54
7 -12 0.10

Standardization

Standardization (or z-score normalization) scales the values

while taking into account standard deviation. If the standard
deviation of features is different, their range also would differ
from each other. This reduces the effect of the outliers in the
features.

In the following formula of standardization, the mean is shown

as μ and the standard deviation is shown as σ.

data = [Link]({'value':[2,45, -23, 85, 28, 2, 35, -12]})

data['standardized'] = (data['value'] - data['value'].mean()) /

data['value'].std()
value standardized
0 2 -0.52
1 45 0.70
2 -23 -1.23
3 85 1.84
4 28 0.22
5 2 -0.52
6 35 0.42
7 -12 -0.92

“garbage in, garbage out!”

Univariate Analysis:

Check it Out:
The first step in examining your data:

[Link]()
[Link]()
[Link]()

Explore target Variable:

Find the distribution of target variable, if the target variable is
categorical then find the success rate, using matplotlib and
seaborn plots.
For continuous variable:
import [Link] as plt
x = target['Column'].hist(density=True, stacked=True)
target['Column'].plot(kind='density')
[Link]()
For categorical variable:
import seaborn as sns
[Link](x='Survived', data=df)
Visualization:
Histogram:
In the univariate analysis, we use histograms for analysing and
visualizing frequency distribution.

[Link]([Link], kde=False)

Combination of histogram and distribution function

Box-plot
Second visualization tool used in the univariate analysis is box-
plot, this type of graph used for detecting outliers in data.
The distribution of continuous data that facilitates comparison
between variables or across the levels of categorical variables

[Link](x=df[‘categorical column’], y=df[‘continuous

column’], data=df)

Count Plot:
A histogram of categorical variable.

[Link](x=df[‘categorical column’], data=df)

Bar Plot:
Represents the central tendency of a numerical variable with a
high solid rectangle and an error bar on top of it to represent the
uncertainty
[Link](x=’categorical variable’, data=df)
Bivariate Analysis:
Pair plot:
Plot pairwise relationships in a dataset. This function will create
a grid of graphs with combinations of all variables. Mostly
creates scatter plot.
[Link](df, hue=’target variable’)

Reg plot:
Plot data and linear regression models best fit line in the same
graph

[Link](‘Var1’, ‘Var2’, data=df)

Joint plot:
Plot 2 variables with bi variate and univariate analysis in same
graph

[Link](‘Var1’, ‘Var2’, data=df, Kind=”Reg”)

Point plot:
Estimates intervals between categorical variables

[Link](x=’cat var’, y=’cont var’, hue=’cat var2’,

data=df)

Factor plot:
Used for multiple group comparison

[Link](data=df, x=var1, y=var2, hue =cat var)

Strip plot:
Scatter plot of one continuous and one categorical variable.

[Link](data=df, x=’cat variable’, y=’continuous

variable’)

Swarm plot:
One continuous and one categorical variable.
[Link](data=df, x=’cat variable’, y=’continuous
variable’)

Co- variance:
∑(𝐱 − 𝐱̅)(𝐲 − 𝐲̅)
𝐧−𝟏

Co- relation:
𝐶𝑜 − 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥 ) ∗ √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑦)

R2:
𝑅 2 = (𝑐𝑜 − 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛)2

[𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑚𝑒𝑎𝑛) − 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑙𝑖𝑛𝑒)]

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑚𝑒𝑎𝑛)

Common questions

The main goals of Exploratory Data Analysis (EDA) include gaining an understanding of the data, finding clues from the data, preparing a dataset compatible with machine learning algorithm requirements, and improving the performance of these models. EDA aids in identifying flaws and essential characteristics of data, thereby improving the data's suitability for machine learning applications through processes like feature engineering and preprocessing .

The iterative nature of EDA arises from the ongoing cycle of data examination, insights generation, followed by potential re-evaluation of data cleaning processes. As EDA progresses, new insights can reveal previously unnoticed issues in data quality or relevance that necessitate revisiting data cleaning and preprocessing steps. This iterative loop ensures that the dataset is accurately refined and well-prepared for subsequent analytical or modeling phases .

One-hot encoding transforms categorical variables into a numerical format by creating binary columns, each representing a distinct category, thus enabling machine learning algorithms to process these inputs effectively. However, it also increases the dimensionality of the dataset, which can complicate models and lead to potential overfitting, particularly with datasets having a large number of categories .

Imputing missing values with the median is advantageous over using the mean because the median is less sensitive to outlier values. While mean imputation can be skewed by extreme values, the median provides a central tendency measure that is more robust in datasets with skewed distributions, resulting in more accurate imputed data .

Handling outliers is crucial to prevent them from disproportionately influencing model training, leading to inaccurate predictions. Outliers can be managed by either dropping or capping them based on percentile calculations; dropping removes extreme values, whereas capping limits them by adjusting to predefined percentile thresholds. This preserves dataset integrity while mitigating outliers' impact .

Log transformation benefits feature engineering by normalizing skewed data and reducing the impact of outliers. It converts multiplicative relationships into additive ones, making data distributions more symmetrical and linear, which better aligns with many statistical assumptions required by machine learning models. By decreasing the relative distance between magnitudes, log transformation enhances model robustness .

Predictive modeling for filling missing values involves creating a model using a complete subset of the data, then applying this model to predict and fill the missing values in the incomplete subset. This method, leveraging patterns established in complete data, helps maintain dataset integrity by statistically inferring likely values instead of arbitrary imputation, thereby retaining relationships within the dataset .

Normalization scales data to a range between 0 and 1, preserving its original distribution, while standardization (z-score normalization) re-centers data around a mean of zero and scales it based on standard deviation. Algorithms based on distance measures, such as k-NN or k-Means, benefit more from normalization due to the fixed range. In contrast, linear models may benefit more from standardization as it accounts for data dispersion, making coefficients more interpretable and reducing outlier effects .

Numerical binning involves grouping continuous numerical values into discrete intervals, which simplifies models by reducing variability and capturing non-linear trends. Categorical binning groups categorical values based on certain criteria like geographical grouping, which enables easier classification and analysis. Both methods facilitate model training by reducing overfitting risks and enhancing interpretability .

Univariate analysis in EDA focuses on examining each variable independently to understand their individual distributions and characteristics. This analysis helps identify outliers, assess central tendencies, and visualize data variations using histograms and box plots, thereby guiding necessary cleaning, transformations, or feature engineering prior to more complex multivariate analyses .

“Torture the data, and it will confess to
anything”

But, Raw, Unprocessed Data isn’t of much use………

EXPLORATORY DATA ANALYSIS - EDA

Exploratory Data Analysis (EDA) is the process of visualizing and
analysing data to extra

Data scientists spend 80% of their time on data preparation:

Best way to achieve expertise in feature engineering is pract

errors, interruptions in the data flow, privacy concerns, and so
on. These affect the performance of the machine learning mo

But, best imputation way is to use the medians of the columns
even when distribution is skewed. As the averages of the colum

Predictive model:
Creating a predictive model for filling the data
 Dataset split into 2 dataset
 One without missing va

Dropping the outlier rows with Percentiles
upper_lim = df['column'].quantile(.95)
lower_lim = df['column'].quantile(.05)

Numerical Binning Example
df['bin'] = pd.cut(df['value'], bins=[0,30,70,100],
labels=["Low", "Mid", "High"])

value bin

 It also decreases the effect of the outliers, due to the
normalization of magnitude differences and the model
become more

New variable creation:
Sometimes, there will be a possibility of creating a new variable
by gaining information from 2 or

Overview of Emerging Technologies
No ratings yet
Overview of Emerging Technologies
82 pages
XTR102 Subwoofer Orion
No ratings yet
XTR102 Subwoofer Orion
75 pages
Understanding Abductive Research Approach
No ratings yet
Understanding Abductive Research Approach
5 pages
Research Methodology Overview and Objectives
No ratings yet
Research Methodology Overview and Objectives
66 pages
Research Poster Template Guide
No ratings yet
Research Poster Template Guide
3 pages
ORTF vs NOS Microphone Techniques
No ratings yet
ORTF vs NOS Microphone Techniques
3 pages
Impact of ICT on Tourism Growth
100% (1)
Impact of ICT on Tourism Growth
17 pages
Aural Limbo: Sonic Space Interaction
No ratings yet
Aural Limbo: Sonic Space Interaction
58 pages
Lesson Plan Template KSSR
No ratings yet
Lesson Plan Template KSSR
2 pages
Critical Thinking and Reasoning Methods
No ratings yet
Critical Thinking and Reasoning Methods
6 pages
Corey/Marti Surround Microphone Setup
No ratings yet
Corey/Marti Surround Microphone Setup
13 pages
Ecological Approach to Perception
No ratings yet
Ecological Approach to Perception
5 pages
REAPER Keyboard Shortcuts Overview
No ratings yet
REAPER Keyboard Shortcuts Overview
6 pages
Justin Bieber Purpose Tour Rider 2017
0% (1)
Justin Bieber Purpose Tour Rider 2017
3 pages
Kromhout Hearing Pastness and Presence The Myth of Perfect Fidelity and The Temporality of Recorded Sound
No ratings yet
Kromhout Hearing Pastness and Presence The Myth of Perfect Fidelity and The Temporality of Recorded Sound
17 pages
고1 빈칸 추론 기출문제 2014-2023
No ratings yet
고1 빈칸 추론 기출문제 2014-2023
18 pages
Behavioral and Structural Diagrams Guide
No ratings yet
Behavioral and Structural Diagrams Guide
18 pages
Short-Term VLCK Diet for Weight Loss
No ratings yet
Short-Term VLCK Diet for Weight Loss
1 page
Teaching "The Carrot Seed" Lesson Plan
No ratings yet
Teaching "The Carrot Seed" Lesson Plan
28 pages
Doctoral Dissertation Planning Checklist
No ratings yet
Doctoral Dissertation Planning Checklist
7 pages
Estate Planning Worksheets Guide
No ratings yet
Estate Planning Worksheets Guide
11 pages
Organizing Data with Lists and Arrays
No ratings yet
Organizing Data with Lists and Arrays
62 pages
The Witch's Swampwoods
No ratings yet
The Witch's Swampwoods
8 pages
Understanding B-Roll in Media
No ratings yet
Understanding B-Roll in Media
4 pages
Introduction to Project Management
No ratings yet
Introduction to Project Management
14 pages
Acoustic Terminology and Noise Levels
No ratings yet
Acoustic Terminology and Noise Levels
4 pages
Innovations in Headphone Technology
100% (1)
Innovations in Headphone Technology
6 pages
Dolby Laboratories: A History of Innovation
No ratings yet
Dolby Laboratories: A History of Innovation
12 pages
Urban Soundscape Management Strategies
No ratings yet
Urban Soundscape Management Strategies
28 pages
Neuhaus's Times Square Sound Installation
No ratings yet
Neuhaus's Times Square Sound Installation
10 pages
Introduction to System Analysis & Design
No ratings yet
Introduction to System Analysis & Design
120 pages
Stereo Recording Techniques Explained
No ratings yet
Stereo Recording Techniques Explained
1 page
Research Methodology Guidebook
No ratings yet
Research Methodology Guidebook
218 pages
Focal 2019 Catalog Car-Audio
No ratings yet
Focal 2019 Catalog Car-Audio
27 pages
CHIPS Program: Vision for U.S. Semiconductor Success
No ratings yet
CHIPS Program: Vision for U.S. Semiconductor Success
17 pages
Dolby Atmos Game Studio Technical Guidelines
No ratings yet
Dolby Atmos Game Studio Technical Guidelines
34 pages
Acoustical Defects in Building Design
No ratings yet
Acoustical Defects in Building Design
8 pages
Biosignals in Biomedical Instrumentation
No ratings yet
Biosignals in Biomedical Instrumentation
18 pages
Acoustic Examples and Calculations
No ratings yet
Acoustic Examples and Calculations
4 pages
Interaction Design and User Experience Guide
No ratings yet
Interaction Design and User Experience Guide
29 pages
Emerson Et Al 2001
No ratings yet
Emerson Et Al 2001
17 pages
Checklist For Creating Epicor Presentations
100% (1)
Checklist For Creating Epicor Presentations
1 page
Scenario-Based Requirements Modeling
No ratings yet
Scenario-Based Requirements Modeling
6 pages
System Modeling and UML Diagrams Guide
No ratings yet
System Modeling and UML Diagrams Guide
32 pages
Understanding Sound and Audiology Concepts
100% (1)
Understanding Sound and Audiology Concepts
18 pages
Aural Architecture by Derek Wendt
100% (1)
Aural Architecture by Derek Wendt
75 pages
AES 122 Paper
No ratings yet
AES 122 Paper
7 pages
Art Integration: Plants and Insects Lesson
No ratings yet
Art Integration: Plants and Insects Lesson
8 pages
The Forehand of Rafael Nadal
No ratings yet
The Forehand of Rafael Nadal
2 pages
Sound Processing in OpenMusic
100% (1)
Sound Processing in OpenMusic
6 pages
Essential Techniques in Sound Editing
No ratings yet
Essential Techniques in Sound Editing
16 pages
Integrating AI in Higher Education
No ratings yet
Integrating AI in Higher Education
19 pages
Sound Recording Resources List
No ratings yet
Sound Recording Resources List
19 pages
Research Methodology Overview Guide
No ratings yet
Research Methodology Overview Guide
162 pages
DB Drive Euphoria Car Audio Overview
No ratings yet
DB Drive Euphoria Car Audio Overview
40 pages
Architectural Acoustics Explained
No ratings yet
Architectural Acoustics Explained
25 pages
CH 5 Research Design - pp05-1
No ratings yet
CH 5 Research Design - pp05-1
23 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
10 pages
Comprehensive Guide to EDA Techniques
No ratings yet
Comprehensive Guide to EDA Techniques
48 pages
Feature Engineering Techniques for ML
No ratings yet
Feature Engineering Techniques for ML
12 pages
Kotlin Multiplatform for Mobile Apps
No ratings yet
Kotlin Multiplatform for Mobile Apps
39 pages
Computer Fundamentals MCQ Review
50% (2)
Computer Fundamentals MCQ Review
20 pages
Triangle Base and Height Worksheets
No ratings yet
Triangle Base and Height Worksheets
2 pages
Advanced SystemVerilog & UVM Course
No ratings yet
Advanced SystemVerilog & UVM Course
7 pages
AI's Transformative Benefits and Challenges
No ratings yet
AI's Transformative Benefits and Challenges
8 pages
Computer Architecture Exam Blueprint
No ratings yet
Computer Architecture Exam Blueprint
68 pages
Weld Defect Classification Using Decision Trees
No ratings yet
Weld Defect Classification Using Decision Trees
10 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
26 pages
History of AI Development Timeline
No ratings yet
History of AI Development Timeline
6 pages
Ph.D. Coursework Syllabus in CS
No ratings yet
Ph.D. Coursework Syllabus in CS
19 pages
.NET Program Execution Explained
100% (1)
.NET Program Execution Explained
6 pages
8086 Microprocessor Architecture Overview
No ratings yet
8086 Microprocessor Architecture Overview
20 pages
AI Assignment Solutions for NPTEL
No ratings yet
AI Assignment Solutions for NPTEL
5 pages
Python Basics: A Comprehensive Guide
No ratings yet
Python Basics: A Comprehensive Guide
10 pages
Grade 9 Maths Baseline Test 2024-25
No ratings yet
Grade 9 Maths Baseline Test 2024-25
2 pages
Year 5 Computing Exam Paper 2025
No ratings yet
Year 5 Computing Exam Paper 2025
4 pages
Lab 13: Heaps in C++ Programming
No ratings yet
Lab 13: Heaps in C++ Programming
9 pages
Learning Recursion v0 - 1
No ratings yet
Learning Recursion v0 - 1
115 pages
Java Code for Mixed Math Operations
No ratings yet
Java Code for Mixed Math Operations
3 pages
Data Structures and Algorithms Overview
No ratings yet
Data Structures and Algorithms Overview
10 pages
Deadlock Avoidance in CPU Scheduling
No ratings yet
Deadlock Avoidance in CPU Scheduling
14 pages
CT 411 - Lecture - 2 - 2025 - 2026
No ratings yet
CT 411 - Lecture - 2 - 2025 - 2026
11 pages
Understanding Instruction Level Parallelism
No ratings yet
Understanding Instruction Level Parallelism
10 pages
Truth Tables and Waveforms for MUX/DEMUX
No ratings yet
Truth Tables and Waveforms for MUX/DEMUX
36 pages
Design and Analysis of Algorithms Assignment
No ratings yet
Design and Analysis of Algorithms Assignment
5 pages
Array Operations in C Programming
No ratings yet
Array Operations in C Programming
7 pages
Ethical Hacking Mastery Roadmap
No ratings yet
Ethical Hacking Mastery Roadmap
25 pages
Roblox Security Script for Object Checks
No ratings yet
Roblox Security Script for Object Checks
2 pages
Turing Machine and Universal Turing Machine Explained
No ratings yet
Turing Machine and Universal Turing Machine Explained
13 pages
Java Concurrency Cheat Sheet
No ratings yet
Java Concurrency Cheat Sheet
5 pages

Exploratory Data Analysis in Python

Uploaded by

Exploratory Data Analysis in Python

Uploaded by

“Torture the data, and it will confess to

But, Raw, Unprocessed Data isn’t of much use………

Exploratory Data Analysis (EDA) is the process of visualizing and

In other words, EDA is the process of summarizing important

After the data has been collected, it undergoes some processing

It mainly has the following goals:

 To gain an understanding of data and find clues from the data

Best way to achieve expertise in feature engineering is practicing

Basic python scripts need Pandas and Numpy library to run

Convert to data frame:

We can handle missing values in many ways:

df=[Link]()  Axis and Inplace

Impute: Deleting data might cause huge amount of information

df = [Link](0)  Filling all missing values with ‘0’

One standard replacement technique is to replace missing values

df=[Link](method=’linear’)  Filling all missing values with

Best way to detect the outliers is to demonstrate the data visually.

Box and whisker plot is most easy method to detect outliers

import seaborn as sns

An Outlier Dilemma: Drop or Cap:

Dropping the data which is an outlier is one of the option to

df = df[(data['column'] < upper_lim) & (df['column'] > lower_lim)]

Another option for handling outliers is to cap them instead of

[Link][(df[column] > upper_lim),column] = upper_lim

Binning can be applied on both categorical and numerical data

Categorical Binning Example

choices = ['Europe', 'Europe', 'South America', 'South America']

df['Continent'] = [Link](conditions, choices,

Logarithm transformation (or log transform) is one of the most

 It helps to handle skewed data and after transformation, the

 In most of the cases the magnitude order of the data changes

Negative Values Handling

df['log'] = (df['value']-df['value'].min()+1) .transform([Link])

value log(x+1) log(x-min(x)+1)

One-hot encoding (or) Dummy coding:

One-hot encoding is one of the most common encoding methods

Splitting features is a good way to make them useful in terms of

 We enable machine learning algorithms to comprehend them.

 Make possible to bin and group them.

 Improve model performance by uncovering potential

0 Toy Story (1995)

In most cases, the numerical features of the dataset do not have

Scaling solves this problem. The continuous features become

Basically, there are two common ways of scaling:

Normalization (or min-max normalization) scale all values in a

data['normalized'] = (data['value'] - data['value']. min ()) /

Standardization (or z-score normalization) scales the values

In the following formula of standardization, the mean is shown

data = [Link]({'value':[2,45, -23, 85, 28, 2, 35, -12]})

data['standardized'] = (data['value'] - data['value'].mean()) /

“garbage in, garbage out!”

Explore target Variable:

Combination of histogram and distribution function

[Link](x=df[‘categorical column’], y=df[‘continuous

[Link](x=df[‘categorical column’], data=df)

[Link](‘Var1’, ‘Var2’, data=df)

[Link](‘Var1’, ‘Var2’, data=df, Kind=”Reg”)

[Link](x=’cat var’, y=’cont var’, hue=’cat var2’,

[Link](data=df, x=var1, y=var2, hue =cat var)

[Link](data=df, x=’cat variable’, y=’continuous

[𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑚𝑒𝑎𝑛) − 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑙𝑖𝑛𝑒)]

Common questions

What are the main goals of Exploratory Data Analysis (EDA) and how do they contribute to the performance of machine learning models?

What are the main goals of Exploratory Data Analysis (EDA) and how do they contribute to the performance of machine learning models?

Explain the iterative nature of Exploratory Data Analysis (EDA) and why it might lead back to processing and cleaning of data.

Explain the iterative nature of Exploratory Data Analysis (EDA) and why it might lead back to processing and cleaning of data.

Why is one-hot encoding used in machine learning, and what are its limitations despite converting categorical variables into a numerical format?

Why is one-hot encoding used in machine learning, and what are its limitations despite converting categorical variables into a numerical format?

In handling missing values during data preparation, what are the advantages of median imputation over mean imputation?

In handling missing values during data preparation, what are the advantages of median imputation over mean imputation?

Explain the importance of handling outliers in datasets and the methods available for dealing with them, such as dropping or capping, using percentiles.

Explain the importance of handling outliers in datasets and the methods available for dealing with them, such as dropping or capping, using percentiles.

What are the benefits of using log transformation in feature engineering, and how does it help in managing data skewness and outliers?

What are the benefits of using log transformation in feature engineering, and how does it help in managing data skewness and outliers?

How does predictive modeling for filling in missing values function, and what are its implications for maintaining dataset integrity?

How does predictive modeling for filling in missing values function, and what are its implications for maintaining dataset integrity?