Introduction To Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a method for analyzing datasets to summarize their main characteristics using statistical graphics and visualization. It includes various types of analysis such as univariate and multivariate, both graphical and non-graphical, and utilizes tools like Python and R for implementation. EDA is crucial in data science for detecting missing data, identifying outliers, and understanding data distribution before applying machine learning models.

Uploaded by

rupikashreecg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views5 pages

Introduction To Exploratory Data Analysis

Uploaded by

rupikashreecg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis (EDA)

16-Mark Answer | Unit 1 – Data Acquisition

1. Introduction to Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main
characteristics, often using statistical graphics and other data visualization methods. EDA was introduced
by statistician John Tukey to help analysts explore data before making assumptions or applying formal
models.
EDA assists data science professionals in the following ways:
• Getting a better understanding of data and its structure
• Identifying various data patterns, trends, and anomalies
• Getting a clearer understanding of the problem statement
• Detecting missing values, outliers, and inconsistencies in datasets

2. Types of Exploratory Data Analysis

There are four primary types of EDA:

2.1 Univariate Non-Graphical EDA

This is the simplest form of data analysis where the data being analyzed consists of only one variable.
Since it is a single variable, it does not deal with causes or relationships. The main purpose of univariate
analysis is to describe the data and find patterns that exist within it. Summary statistics such as mean,
median, mode, variance, and standard deviation are used.

2.2 Univariate Graphical EDA

Non-graphical methods do not provide a full picture of the data. Graphical methods are therefore required
to better understand the distribution of a single variable. Common types of univariate graphics include:
• Stem-and-leaf plots: Show all data values and the shape of the distribution.
• Histograms: A bar plot where each bar represents the frequency or proportion of cases for a
range of values.
• Box plots: Graphically depict the five-number summary — minimum, first quartile, median, third
quartile, and maximum.

2.3 Multivariate Non-Graphical EDA

Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables through cross-tabulation or summary
statistics such as correlation coefficients.
2.4 Multivariate Graphical EDA
Multivariate graphical EDA uses graphics to display relationships between two or more sets of data.
Common chart types include:
• Scatter plot: Shows how much one variable is affected by another.
• Multivariate chart: A graphical representation of relationships between factors and a response.
• Run chart: A line graph of data plotted over time.
• Bubble chart: A data visualization displaying multiple circles (bubbles) in a two-dimensional
plot.
• Heat map: A graphical representation of data where values are depicted by color.

3. Exploratory Data Analysis Tools

The two most common programming languages used to perform EDA are Python and R.
• Python: An interpreted, object-oriented programming language with dynamic semantics. Its
high-level built-in data structures and dynamic typing make it very attractive for rapid application
development. Python combined with EDA can be used to identify missing values in a dataset,
which is important for machine learning.
• R: An open-source programming language and free software environment for statistical
computing and graphics. Widely used among statisticians in data science for developing
statistical observations and data analysis.

4. Basic Tools of EDA – Plots and Graphs

4.1 Line Plot

A line plot displays information as a series of data points called "markers" connected by straight lines.
The measurement points must be ordered (typically by their x-axis values). This type of plot is often used
to visualize a trend in data over intervals of time — commonly referred to as a time series.
In Python (Matplotlib):
[Link](x_values, y_values) | [Link]()
The first argument is for horizontal-axis data, the second for vertical-axis data. [Link]() displays the
final plot. Used best for: Trend analysis, time-series data.

4.2 Scatter Plot

A scatter plot shows all individual data points without connecting them with lines. Each data point is
defined by its x-axis and y-axis values. This type of plot can be used to display trends or correlations
between two variables.
[Link](x_values, y_values)
Used best for: Showing how 2 variables compare, identifying correlation or outliers.
4.3 Histogram
A histogram is an accurate representation of the distribution of numeric data and is a frequency chart that
records the number of occurrences of an entry in a dataset. To create a histogram, the entire range of
values is divided into a series of intervals (also called bins), and the count of values falling into each bin
is recorded. Bins are consecutive and non-overlapping intervals of a variable.
[Link](data, bins=10)
The default value for the bins argument is 10. Used best for: Understanding frequency distribution of
numerical data.

4.4 Box Plot (Box-and-Whisker Plot)

A box plot, also called the box-and-whisker plot, is a way to show the distribution of values based on the
five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. Box plots
are especially effective in summarizing the spread of large datasets and in identifying outliers.
Key components of a Box Plot:
• Median: Divides data into two equal halves
• IQR (Interquartile Range): Range between Q1 (25th percentile) and Q3 (75th percentile)
• Whiskers: Lines extending from Q1 and Q3 to the minimum and maximum non-outlier values
• Outliers: Data points that fall outside Q1 - 1.5*IQR or Q3 + 1.5*IQR, shown as circles
[Link](data_points)
Used best for: Summarizing large data spread, comparing distributions, and identifying outliers.

4.5 Bar Chart

A bar chart represents categorical data with rectangular bars, where each bar's height corresponds to
the value it represents. It is useful when comparing a given numeric value across different categories.
Bar charts can also be used with two data series for side-by-side comparison.
[Link](categories, values)
Used best for: Comparing performance across categories (e.g., sales by region, marks by subject).

4.6 Pie Chart

A pie chart is a circular plot divided into slices to show numerical proportion. They are widely used in the
business world. However, many experts recommend avoiding pie charts because it is difficult to compare
the sections of a given pie chart, and even more difficult to compare data across multiple pie charts. In
many cases, they can be replaced by a bar chart for clearer comparison.
[Link](values, autopct='%1.1f%%')
Used best for: Showing composition/proportion of data (e.g., market share, budget breakdown).

4.7 Bubble Chart

Scatter and bubble plots help understand how variables are spread across the range considered. A
bubble chart adds a third dimension to a scatter plot — the size of each bubble represents a third variable.
It can be used to identify patterns, the presence of outliers, and the relationship between variables.
Used best for: Comparing three variables simultaneously, relationship and distribution analysis.

4.8 Line Graph (Time-series)

A line graph is preferred when time-dependent data has to be presented. It is best suited to analyze
trends over time, such as monthly sales, stock prices, or population growth. Multiple series can be plotted
on the same graph using different line styles or colors.
Used best for: Time-series trend analysis across multiple categories (e.g., sales across regions over
years).

4.9 Heatmap
A heatmap is the most preferred chart when checking for correlations between variables. It is a graphical
representation of data where values are depicted by color — the darker the color, the higher the positive
correlation; the lighter the color, the higher the negative correlation.
In a correlation heatmap:
• Positive values indicate positive correlation between variables
• Negative values indicate negative correlation
• A value of 1 on the diagonal means a variable is perfectly correlated with itself
Used best for: Understanding correlations in multivariate datasets.

4.10 Donut Chart and Stacked Column Chart

When we want to find the composition of data, donut charts, pie charts, and stacked column charts are
best suited. A donut chart is similar to a pie chart but with a hole in the center, allowing the center to
display additional information. Stacked column charts show how sub-components contribute to the total
across categories.
Used best for: Showing sales composition by category or time period.

4.11 Subplots
Sometimes it is better to plot different graphs in the same grid to understand and compare data better.
Subplots allow multiple charts (e.g., different regions' sales trends) to be displayed side by side in one
figure. This makes comparison much more intuitive and concise.

5. Summary Statistics in EDA

Summary statistics are numerical measures used to describe the main characteristics of a dataset. They
are a key non-graphical tool in EDA.
Statistic Description
Mean Average of all values in a dataset
Median Middle value that separates higher and lower halves
Mode Most frequently occurring value in a dataset
Variance Measure of the spread/dispersion of data values
Standard Deviation Square root of variance; shows spread around the mean
Quartiles (Q1, Q3) Divide data into four equal parts; used in box plots
IQR Interquartile range = Q3 - Q1; measures middle 50% spread
Min / Max Smallest and largest values in the dataset

6. Why EDA is Important in Data Science

Before applying any machine learning model, EDA is a critical step. It helps in:
• Detecting missing data, which can disrupt true data patterns and lead to inaccurate models
• Identifying outliers or anomalous data points that may affect model predictions
• Understanding data distribution to choose the right machine learning algorithm
• Discovering correlations between variables that inform feature selection
• Removing inconsistencies such as incorrect spellings, duplicate data, or mis-populated columns
• Understanding data formats and overall structure (mean, median, standard deviation, quantiles)

Unit 1 – Data Acquisition | Exploratory Data Analysis (EDA)

Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
94 pages
Unit 4 Fds
No ratings yet
Unit 4 Fds
22 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
21 pages
Introduction to Exploratory Data Analysis
No ratings yet
Introduction to Exploratory Data Analysis
40 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
22 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
42 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
47 pages
Types of Exploratory Data Analysis
No ratings yet
Types of Exploratory Data Analysis
9 pages
Whole Unit 3 of DCM205N - Data Science and Tools Notes
No ratings yet
Whole Unit 3 of DCM205N - Data Science and Tools Notes
18 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
14 pages
Chapter 5 Exploratory Data Analysis
No ratings yet
Chapter 5 Exploratory Data Analysis
67 pages
Types and Goals of Exploratory Data Analysis
No ratings yet
Types and Goals of Exploratory Data Analysis
5 pages
Dav Module 5 End Sem
No ratings yet
Dav Module 5 End Sem
13 pages
DSSM 3
No ratings yet
DSSM 3
38 pages
EDA and Data Science Process Overview
No ratings yet
EDA and Data Science Process Overview
9 pages
EDA Techniques: Histograms, Box & Scatter Plots
No ratings yet
EDA Techniques: Histograms, Box & Scatter Plots
25 pages
Call Duration Analysis in EDA
No ratings yet
Call Duration Analysis in EDA
77 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
26 pages
Importing Excel into Tableau for EDA
No ratings yet
Importing Excel into Tableau for EDA
222 pages
CAD 201-SM04 Removed
No ratings yet
CAD 201-SM04 Removed
14 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
47 pages
Importance of Exploratory Data Analysis
No ratings yet
Importance of Exploratory Data Analysis
133 pages
EDA Techniques and Their Purposes
No ratings yet
EDA Techniques and Their Purposes
18 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
36 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
41 pages
CH 3
No ratings yet
CH 3
33 pages
EDA vs CDA in Data Analytics
No ratings yet
EDA vs CDA in Data Analytics
79 pages
EDA Presentation 21 05 2025 GSI
No ratings yet
EDA Presentation 21 05 2025 GSI
21 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
31 pages
Hypothesis Testing in Python Statsmodels
No ratings yet
Hypothesis Testing in Python Statsmodels
20 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
13 pages
Exploratory Data Analysis Techniques
100% (1)
Exploratory Data Analysis Techniques
8 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
13 pages
Unit 3 - Data Mining For Business Analytics
No ratings yet
Unit 3 - Data Mining For Business Analytics
30 pages
EDA Techniques in Data Science
No ratings yet
EDA Techniques in Data Science
8 pages
Exploratory Data Analysis Insights Guide
No ratings yet
Exploratory Data Analysis Insights Guide
12 pages
EDA Data Visualization Techniques Guide
No ratings yet
EDA Data Visualization Techniques Guide
17 pages
Intro To EDA
No ratings yet
Intro To EDA
30 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
15 pages
Visual Aids For Exploratory Data Analysis (EDA) : A Comprehensive Guide With R Implementation
No ratings yet
Visual Aids For Exploratory Data Analysis (EDA) : A Comprehensive Guide With R Implementation
17 pages
Data Analytics Techniques Explained
No ratings yet
Data Analytics Techniques Explained
36 pages
Machine Learning Data Fundamentals
No ratings yet
Machine Learning Data Fundamentals
23 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
10 pages
EDA Techniques for Data Analysis
No ratings yet
EDA Techniques for Data Analysis
12 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
79 pages
Unit5 Data Visualization
No ratings yet
Unit5 Data Visualization
13 pages
Module 5 Notes of Dmbi
No ratings yet
Module 5 Notes of Dmbi
91 pages
Exploratory vs Confirmatory Data Analysis
100% (1)
Exploratory vs Confirmatory Data Analysis
48 pages
EDA Techniques and Tools in Python
No ratings yet
EDA Techniques and Tools in Python
6 pages
Data Exploration Techniques in Mining
No ratings yet
Data Exploration Techniques in Mining
11 pages
EDA Tools: Python vs R Comparison
No ratings yet
EDA Tools: Python vs R Comparison
12 pages
Exploratory Data Analysis Fundamentals
No ratings yet
Exploratory Data Analysis Fundamentals
85 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
31 pages
Chapter 1
No ratings yet
Chapter 1
30 pages
Instructor Station Simulator Guide
0% (1)
Instructor Station Simulator Guide
1 page
msp432 PDF
No ratings yet
msp432 PDF
11 pages
NAVTTC CCNA Course Assessment Guide
No ratings yet
NAVTTC CCNA Course Assessment Guide
2 pages
PMC-220 English User Manual V1.06 (20250319)
No ratings yet
PMC-220 English User Manual V1.06 (20250319)
21 pages
Apple Graphics & Arcade Game Design Text
No ratings yet
Apple Graphics & Arcade Game Design Text
290 pages
Microline 320/321 Turbo Service Manual
No ratings yet
Microline 320/321 Turbo Service Manual
145 pages
Cooperative Ground Robots for Agriculture
No ratings yet
Cooperative Ground Robots for Agriculture
6 pages
Overview of Operating System Types and Functions
No ratings yet
Overview of Operating System Types and Functions
20 pages
Overview of Number Systems
No ratings yet
Overview of Number Systems
143 pages
Video Editing Mastery Guide
No ratings yet
Video Editing Mastery Guide
8 pages
Mastering XLOOKUP in Excel
No ratings yet
Mastering XLOOKUP in Excel
17 pages
SuperServer 5029C-T User Manual
No ratings yet
SuperServer 5029C-T User Manual
122 pages
8086 Microprocessor Quiz Questions
No ratings yet
8086 Microprocessor Quiz Questions
11 pages
Tokenization Methods Overview
No ratings yet
Tokenization Methods Overview
29 pages
Module - 3: Wireless and Mobile Device Security (BCY613D)
No ratings yet
Module - 3: Wireless and Mobile Device Security (BCY613D)
43 pages
Ambuj Singh's Electronics Portfolio
No ratings yet
Ambuj Singh's Electronics Portfolio
1 page
Woodward EM Driver Commissioning Guide
No ratings yet
Woodward EM Driver Commissioning Guide
7 pages
IoT Sleep Monitoring System Proposal
No ratings yet
IoT Sleep Monitoring System Proposal
8 pages
FEM Simulation for Ultrasonic Concrete Inspection
No ratings yet
FEM Simulation for Ultrasonic Concrete Inspection
9 pages
Solar-Powered Sea Water Desalination
No ratings yet
Solar-Powered Sea Water Desalination
3 pages
8086 Bus Configuration and Timing Details
No ratings yet
8086 Bus Configuration and Timing Details
104 pages
Software Engineering Overview and Concepts
No ratings yet
Software Engineering Overview and Concepts
38 pages
JavaFX and C# GUI Lab Guide
No ratings yet
JavaFX and C# GUI Lab Guide
10 pages
Introduction to Artificial Intelligence Overview
No ratings yet
Introduction to Artificial Intelligence Overview
25 pages
Drilling Info System with WITSML Support
No ratings yet
Drilling Info System with WITSML Support
2 pages
DNS Server Configuration Guide
No ratings yet
DNS Server Configuration Guide
4 pages
Introduction to Data Structures
No ratings yet
Introduction to Data Structures
30 pages
Alpha Virtualization Overview
No ratings yet
Alpha Virtualization Overview
26 pages
Felcom18 Operators Manual
No ratings yet
Felcom18 Operators Manual
177 pages
Understanding Malware Types and Biases
No ratings yet
Understanding Malware Types and Biases
13 pages