0% found this document useful (0 votes)
4 views5 pages

Introduction To Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a method for analyzing datasets to summarize their main characteristics using statistical graphics and visualization. It includes various types of analysis such as univariate and multivariate, both graphical and non-graphical, and utilizes tools like Python and R for implementation. EDA is crucial in data science for detecting missing data, identifying outliers, and understanding data distribution before applying machine learning models.

Uploaded by

rupikashreecg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Introduction To Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a method for analyzing datasets to summarize their main characteristics using statistical graphics and visualization. It includes various types of analysis such as univariate and multivariate, both graphical and non-graphical, and utilizes tools like Python and R for implementation. EDA is crucial in data science for detecting missing data, identifying outliers, and understanding data distribution before applying machine learning models.

Uploaded by

rupikashreecg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis (EDA)

16-Mark Answer | Unit 1 – Data Acquisition

1. Introduction to Exploratory Data Analysis


Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main
characteristics, often using statistical graphics and other data visualization methods. EDA was introduced
by statistician John Tukey to help analysts explore data before making assumptions or applying formal
models.
EDA assists data science professionals in the following ways:
• Getting a better understanding of data and its structure
• Identifying various data patterns, trends, and anomalies
• Getting a clearer understanding of the problem statement
• Detecting missing values, outliers, and inconsistencies in datasets

2. Types of Exploratory Data Analysis


There are four primary types of EDA:

2.1 Univariate Non-Graphical EDA


This is the simplest form of data analysis where the data being analyzed consists of only one variable.
Since it is a single variable, it does not deal with causes or relationships. The main purpose of univariate
analysis is to describe the data and find patterns that exist within it. Summary statistics such as mean,
median, mode, variance, and standard deviation are used.

2.2 Univariate Graphical EDA


Non-graphical methods do not provide a full picture of the data. Graphical methods are therefore required
to better understand the distribution of a single variable. Common types of univariate graphics include:
• Stem-and-leaf plots: Show all data values and the shape of the distribution.
• Histograms: A bar plot where each bar represents the frequency or proportion of cases for a
range of values.
• Box plots: Graphically depict the five-number summary — minimum, first quartile, median, third
quartile, and maximum.

2.3 Multivariate Non-Graphical EDA


Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables through cross-tabulation or summary
statistics such as correlation coefficients.
2.4 Multivariate Graphical EDA
Multivariate graphical EDA uses graphics to display relationships between two or more sets of data.
Common chart types include:
• Scatter plot: Shows how much one variable is affected by another.
• Multivariate chart: A graphical representation of relationships between factors and a response.
• Run chart: A line graph of data plotted over time.
• Bubble chart: A data visualization displaying multiple circles (bubbles) in a two-dimensional
plot.
• Heat map: A graphical representation of data where values are depicted by color.

3. Exploratory Data Analysis Tools


The two most common programming languages used to perform EDA are Python and R.
• Python: An interpreted, object-oriented programming language with dynamic semantics. Its
high-level built-in data structures and dynamic typing make it very attractive for rapid application
development. Python combined with EDA can be used to identify missing values in a dataset,
which is important for machine learning.
• R: An open-source programming language and free software environment for statistical
computing and graphics. Widely used among statisticians in data science for developing
statistical observations and data analysis.

4. Basic Tools of EDA – Plots and Graphs

4.1 Line Plot


A line plot displays information as a series of data points called "markers" connected by straight lines.
The measurement points must be ordered (typically by their x-axis values). This type of plot is often used
to visualize a trend in data over intervals of time — commonly referred to as a time series.
In Python (Matplotlib):
[Link](x_values, y_values) | [Link]()
The first argument is for horizontal-axis data, the second for vertical-axis data. [Link]() displays the
final plot. Used best for: Trend analysis, time-series data.

4.2 Scatter Plot


A scatter plot shows all individual data points without connecting them with lines. Each data point is
defined by its x-axis and y-axis values. This type of plot can be used to display trends or correlations
between two variables.
[Link](x_values, y_values)
Used best for: Showing how 2 variables compare, identifying correlation or outliers.
4.3 Histogram
A histogram is an accurate representation of the distribution of numeric data and is a frequency chart that
records the number of occurrences of an entry in a dataset. To create a histogram, the entire range of
values is divided into a series of intervals (also called bins), and the count of values falling into each bin
is recorded. Bins are consecutive and non-overlapping intervals of a variable.
[Link](data, bins=10)
The default value for the bins argument is 10. Used best for: Understanding frequency distribution of
numerical data.

4.4 Box Plot (Box-and-Whisker Plot)


A box plot, also called the box-and-whisker plot, is a way to show the distribution of values based on the
five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. Box plots
are especially effective in summarizing the spread of large datasets and in identifying outliers.
Key components of a Box Plot:
• Median: Divides data into two equal halves
• IQR (Interquartile Range): Range between Q1 (25th percentile) and Q3 (75th percentile)
• Whiskers: Lines extending from Q1 and Q3 to the minimum and maximum non-outlier values
• Outliers: Data points that fall outside Q1 - 1.5*IQR or Q3 + 1.5*IQR, shown as circles
[Link](data_points)
Used best for: Summarizing large data spread, comparing distributions, and identifying outliers.

4.5 Bar Chart


A bar chart represents categorical data with rectangular bars, where each bar's height corresponds to
the value it represents. It is useful when comparing a given numeric value across different categories.
Bar charts can also be used with two data series for side-by-side comparison.
[Link](categories, values)
Used best for: Comparing performance across categories (e.g., sales by region, marks by subject).

4.6 Pie Chart


A pie chart is a circular plot divided into slices to show numerical proportion. They are widely used in the
business world. However, many experts recommend avoiding pie charts because it is difficult to compare
the sections of a given pie chart, and even more difficult to compare data across multiple pie charts. In
many cases, they can be replaced by a bar chart for clearer comparison.
[Link](values, autopct='%1.1f%%')
Used best for: Showing composition/proportion of data (e.g., market share, budget breakdown).

4.7 Bubble Chart


Scatter and bubble plots help understand how variables are spread across the range considered. A
bubble chart adds a third dimension to a scatter plot — the size of each bubble represents a third variable.
It can be used to identify patterns, the presence of outliers, and the relationship between variables.
Used best for: Comparing three variables simultaneously, relationship and distribution analysis.

4.8 Line Graph (Time-series)


A line graph is preferred when time-dependent data has to be presented. It is best suited to analyze
trends over time, such as monthly sales, stock prices, or population growth. Multiple series can be plotted
on the same graph using different line styles or colors.
Used best for: Time-series trend analysis across multiple categories (e.g., sales across regions over
years).

4.9 Heatmap
A heatmap is the most preferred chart when checking for correlations between variables. It is a graphical
representation of data where values are depicted by color — the darker the color, the higher the positive
correlation; the lighter the color, the higher the negative correlation.
In a correlation heatmap:
• Positive values indicate positive correlation between variables
• Negative values indicate negative correlation
• A value of 1 on the diagonal means a variable is perfectly correlated with itself
Used best for: Understanding correlations in multivariate datasets.

4.10 Donut Chart and Stacked Column Chart


When we want to find the composition of data, donut charts, pie charts, and stacked column charts are
best suited. A donut chart is similar to a pie chart but with a hole in the center, allowing the center to
display additional information. Stacked column charts show how sub-components contribute to the total
across categories.
Used best for: Showing sales composition by category or time period.

4.11 Subplots
Sometimes it is better to plot different graphs in the same grid to understand and compare data better.
Subplots allow multiple charts (e.g., different regions' sales trends) to be displayed side by side in one
figure. This makes comparison much more intuitive and concise.

5. Summary Statistics in EDA


Summary statistics are numerical measures used to describe the main characteristics of a dataset. They
are a key non-graphical tool in EDA.
Statistic Description
Mean Average of all values in a dataset
Median Middle value that separates higher and lower halves
Mode Most frequently occurring value in a dataset
Variance Measure of the spread/dispersion of data values
Standard Deviation Square root of variance; shows spread around the mean
Quartiles (Q1, Q3) Divide data into four equal parts; used in box plots
IQR Interquartile range = Q3 - Q1; measures middle 50% spread
Min / Max Smallest and largest values in the dataset

6. Why EDA is Important in Data Science


Before applying any machine learning model, EDA is a critical step. It helps in:
• Detecting missing data, which can disrupt true data patterns and lead to inaccurate models
• Identifying outliers or anomalous data points that may affect model predictions
• Understanding data distribution to choose the right machine learning algorithm
• Discovering correlations between variables that inform feature selection
• Removing inconsistencies such as incorrect spellings, duplicate data, or mis-populated columns
• Understanding data formats and overall structure (mean, median, standard deviation, quantiles)

Unit 1 – Data Acquisition | Exploratory Data Analysis (EDA)

You might also like