Smlds
Mod 1
Exploratory Data Analysis (EDA)
Introduction / Definition
Exploratory Data Analysis (EDA) is a method used to analyze and summarize
data sets in order to understand their main characteristics.
It uses simple statistical measures and graphical techniques to explore data
before applying advanced analysis or modeling.
EDA helps in understanding the structure of data, identifying patterns, and
detecting unusual values.
As explained in Module 1, EDA mainly works with rectangular data, where
rows represent records and columns represent variables.
BAD702-module-1-textbook
Objectives of Exploratory Data Analysis
Identify Central Tendency
To find a “typical value” for each feature to see where most of the data is
located.
• Central tendency shows the center or average behavior of data
• Measures such as mean, median, and trimmed mean are used
• Helps in summarizing large datasets into a single representative value
This objective helps in understanding what value is most common or typical in
the data.
Measure Variability
To determine the dispersion of data—whether values are tightly clustered or
spread out.
• Variability shows how much data values differ from each other
• Measures include variance, standard deviation, and interquartile range
(IQR)
• High variability means data is widely spread, while low variability means
data is close to the center
This helps in understanding the consistency of the data.
Explore Distribution
To visualize the overall shape and pattern of the data beyond single-number
summaries.
• Frequency tables group data into intervals
• Histograms, boxplots, and density plots are used for visualization
• Helps identify skewness, spread, and concentration of data
This objective gives a complete picture of how data values are distributed.
Detect Outliers
To identify extreme values that are significantly different from the rest of the
data.
• Outliers may occur due to data entry errors
• They may also represent rare or important cases
• Boxplots and percentile-based methods help in detecting outliers
Detecting outliers improves data quality and reliability.
Uncover Relationships
To examine how different variables relate to or influence one another.
• Correlation is used to measure the strength and direction of relationships
• Scatterplots are used to visually study relationships between variables
• Helps identify positive, negative, or no relationship
This objective is important for understanding interactions among variables.
Techniques for Exploring Categorical Data
Introduction
Categorical data refers to data that represents categories or groups rather than
numerical values.
Exploring categorical data is an important part of Exploratory Data Analysis
(EDA).
The main goal is to summarize category-wise information and understand how
frequently each category occurs.
EDA uses simple summary measures and visual techniques to analyze
categorical data effectively.
Techniques for Exploring Categorical Data
1. Proportions and Percentages
One of the simplest techniques for exploring categorical data is calculating
proportions.
• Shows how frequently each category occurs
• Helps compare categories easily
• Often expressed as percentages
This technique gives a quick overview of the importance of each category.
2. Mode
The mode is the category that appears most frequently in the dataset.
• It represents the most common category
• Useful only for categorical or discrete data
• Helps identify dominant or popular categories
Mode is a basic but effective summary measure for categorical data.
3. Frequency Tables
A frequency table shows the count of each category.
• Lists categories along with their frequencies
• Makes data easy to read and interpret
• Forms the basis for graphical representation
Frequency tables help in organizing categorical data in a structured form.
4. Bar Charts
Bar charts are the most commonly used visualization technique for categorical
data.
• Categories are shown on the x-axis
• Frequencies or proportions are shown on the y-axis
• Bars are separated, unlike histograms
Bar charts make category-wise comparisons visually clear and easy.
5. Pie Charts
Pie charts represent categorical data as parts of a whole.
• Each slice shows the proportion of a category
• Useful when the number of categories is small
• Gives a visual sense of percentage contribution
However, pie charts are generally less informative than bar charts.
Key Points
• Categorical data is summarized using counts and proportions
• Mode is the main measure of central tendency
• Bar charts are the most preferred visualization method
• These techniques help simplify and understand category-based data
Methods for Exploring Binary Data
Introduction
Binary data is a special type of categorical data that has only two possible values,
such as Yes/No, True/False, or 0/1.
Exploring binary data is an important part of Exploratory Data Analysis (EDA).
The main aim is to summarize the distribution of the two outcomes and
understand how frequently each outcome occurs.
EDA uses simple numerical summaries and visual tools to explore binary
variables.
Methods for Exploring Binary Data
1. Proportions
Proportion is the most common method used to explore binary data.
• It shows the fraction of observations belonging to each outcome
• For example, proportion of delayed vs non-delayed flights
• Values range between 0 and 1
Proportions give a clear idea of how data is divided between the two categories.
2. Percentages
Percentages are obtained by multiplying proportions by 100.
• Makes interpretation easier
• Useful for comparison and reporting
• Commonly used in summaries and tables
Percentages clearly show the dominance of one category over the other.
3. Frequency Counts
Frequency count shows the number of occurrences of each binary value.
• Counts how many 0s and 1s are present
• Forms the base for calculating proportions and percentages
• Helps understand data balance
This method is simple and easy to interpret.
4. Bar Charts
Bar charts are widely used to visualize binary data.
• Each bar represents one binary category
• Height of the bar shows frequency or proportion
• Bars are shown separately
Bar charts allow quick visual comparison between the two outcomes.
5. Expected Value (When Binary Data Is Numeric)
When binary outcomes are associated with numeric values, expected value is
used.
• Combines probability and value
• Calculated as the sum of value × probability
• Useful in business and decision-making problems
Expected value provides a single summary measure for binary outcomes
Correlation and Covariance
Introduction
Correlation and covariance are statistical measures used in Exploratory Data
Analysis (EDA) to study the relationship between two numerical variables.
They help in understanding how one variable changes with respect to another.
Both measures indicate whether variables move together or in opposite directions.
These techniques are important for identifying relationships before further
analysis or modeling.
Correlation
Meaning of Correlation
Correlation measures the strength and direction of the relationship between
two numerical variables.
• If both variables increase together, the correlation is positive
• If one variable increases while the other decreases, the correlation is
negative
• If there is no relationship, the correlation is zero
Correlation Coefficient
The correlation coefficient is a standardized measure of correlation.
• Its value ranges from –1 to +1
• +1 indicates perfect positive correlation
• –1 indicates perfect negative correlation
• 0 indicates no correlation
Because it is standardized, correlation values are easy to compare across
datasets.
Scatterplot
Scatterplots are used to visually represent correlation.
• One variable is plotted on the x-axis
• The other variable is plotted on the y-axis
• Each point represents one observation
Scatterplots help in identifying the pattern and direction of the relationship.
Covariance
Meaning of Covariance
Covariance measures the direction of the relationship between two variables.
• Positive covariance means variables increase together
• Negative covariance means one increases while the other decreases
• Zero covariance means no linear relationship
Covariance shows whether variables move together, but not how strong the
relationship is.
Methods for Exploring Two Variables
Introduction
Exploring two variables together is called bivariate analysis in Exploratory Data
Analysis (EDA).
It helps in understanding how one variable is related to another.
The methods used depend on whether the variables are numerical or categorical.
EDA uses summary tables and visual techniques to explore such relationships
clearly.
Methods for Exploring Two Variables
1. Scatterplot (Numeric vs Numeric)
Scatterplot is the most common method for exploring two numerical variables.
• One variable is plotted on the x-axis
• The other variable is plotted on the y-axis
• Each point represents one observation
It helps to identify:
• Positive relationship
• Negative relationship
• No relationship
2. Correlation Analysis
Correlation measures the strength and direction of relationship between two
numerical variables.
• Positive correlation: both variables increase together
• Negative correlation: one increases while the other decreases
• Zero correlation: no linear relationship
Correlation coefficient ranges from –1 to +1.
3. Contingency Table (Categorical vs Categorical)
A contingency table summarizes the relationship between two categorical
variables.
• Shows category-wise counts
• Can also include percentages
• Helps compare combinations of categories
It is commonly used in classification and survey data.
4. Boxplot (Categorical vs Numeric)
Boxplots are used when:
• One variable is categorical
• The other variable is numerical
They help in:
• Comparing distributions across categories
• Identifying variability and outliers
5. Hexagonal Binning and Density Plots
For very large datasets:
• Scatterplots become overcrowded
• Hexagonal binning groups data into hexagon-shaped bins
• Density and contour plots show concentration of data
These methods give a clear picture of dense data regions.
Measures of Central Tendency and Variability (With Diagram)
Introduction
Measures of central tendency and variability are used to summarize numerical
data in Exploratory Data Analysis.
Central tendency shows the typical value, while variability shows the spread of
data.
Together, they give a complete picture of the dataset.
These measures are widely used before advanced analysis.
Measures of Central Tendency
1. Mean
Mean is the average of all data values.
• Calculated by dividing the sum of values by number of observations
• Simple and widely used
• Sensitive to outliers
2. Median
Median is the middle value of sorted data.
• Divides data into two equal halves
• Not affected by extreme values
• More robust than mean
3. Trimmed Mean
Trimmed mean is calculated after removing extreme values from both ends.
• Reduces the effect of outliers
• Used when data contains extreme values
Measures of Variability
1. Range
Range is the difference between the maximum and minimum values.
• Simple measure
• Highly affected by outliers
2. Variance
Variance measures the average of squared deviations from the mean.
• Shows overall dispersion
• Sensitive to extreme values
3. Standard Deviation
Standard deviation is the square root of variance.
• Expressed in same units as data
• Easy to interpret
• Widely used measure
4. Interquartile Range (IQR)
IQR is the difference between 75th and 25th percentile.
• Robust to outliers
• Measures spread of middle 50% data
Minimum ──|─────[ Q1 ── Median ── Q3 ]─────|── Maximum
<---- IQR ---->
Q1 = 25th percentile
Q3 = 75th percentile
Box shows central spread
Whiskers show data range
Points outside whiskers are outliers
Explain Different Types of Data Distribution and Visualization Methods
Introduction
Data distribution describes how data values are spread or arranged in a dataset.
In Exploratory Data Analysis (EDA), understanding distribution helps in
identifying patterns, spread, and unusual values.
EDA uses statistical summaries and graphical methods to study data
distribution.
These methods give a clear picture of the overall behavior of data.
BAD702-module-1-textbook
Types of Data Distribution
1. Symmetric Distribution
• Data is evenly distributed around the center
• Mean and median are approximately equal
• Left and right sides of the distribution are similar
This type of distribution indicates balanced data.
2. Skewed Distribution
Skewed distribution occurs when data is not symmetrical.
a) Positively Skewed Distribution
• Tail extends towards higher values
• Mean is greater than median
b) Negatively Skewed Distribution
• Tail extends towards lower values
• Mean is less than median
3. Distribution with Outliers
• Contains extreme values far from most data
• Outliers affect mean and standard deviation
• Indicates unusual or rare observations
Visualization Methods for Data Distribution
1. Frequency Table
• Groups data into intervals
• Shows count of values in each interval
2. Histogram
• Bars represent frequency of data intervals
• Helps understand shape and spread
• Empty bins show absence of data
3. Boxplot
• Displays median, quartiles, and outliers
• Useful for comparing distributions
4. Density Plot
• Smoothed version of histogram
• Shows distribution as a continuous curve
explain Techniques for Exploring Multivariate Data
Introduction
Multivariate data involves more than two variables observed together.
Exploring multivariate data helps in understanding complex relationships among
variables.
EDA uses tables and advanced visualization techniques for multivariate analysis.
These techniques help in identifying patterns across multiple variables.
BAD702-module-1-textbook
Techniques for Exploring Multivariate Data
1. Correlation Matrix
• Displays correlation between multiple numerical variables
• Values range from –1 to +1
• Helps identify strongly related variables
2. Heatmaps
• Visual representation of correlation matrix
• Color intensity shows strength of relationship
• Easy to interpret large correlation tables
3. Contingency Tables
• Used for multiple categorical variables
• Shows joint frequency counts
• Helps compare category combinations
4. Boxplots for Grouped Data
• Numeric data grouped by categorical variables
• Helps compare distributions across groups
• Identifies variation and outliers
5. Hexagonal Binning and Contour Plots
• Used for large numerical datasets
• Shows density of data points
• Reduces overcrowding in scatterplots
Key Points
• Multivariate analysis studies more than two variables
• Correlation matrices summarize numeric relationships
• Visual tools simplify complex data
Explain EDA Workflow for a Real-World Dataset
Introduction
The Exploratory Data Analysis (EDA) workflow is a step-by-step process used to
analyze real-world datasets.
It helps in understanding data before applying statistical models.
EDA workflow ensures data quality, clarity, and reliability.
It forms the foundation of effective data analysis.
BAD702-module-1-textbook
Explain EDA Workflow for a Real-World Dataset
Introduction
Exploratory Data Analysis (EDA) workflow is a systematic process used to
understand and prepare real-world data before further analysis or modeling.
It helps in identifying data structure, errors, patterns, and relationships.
EDA workflow ensures that the dataset is clean, reliable, and meaningful.
This step-by-step process is essential before applying machine learning or
business analysis techniques.
EDA Workflow for a Real-World Dataset
Data Profiling
Checking data types, missing values, and the number of records.
• Identify number of rows and columns
• Check whether variables are numeric, categorical, or binary
• Detect missing or null values
• Helps understand the overall structure of the dataset
Data profiling gives a basic overview of the dataset.
Univariate Analysis
Analyzing each variable individually using mean, median, and histograms.
• Calculate measures of central tendency such as mean and median
• Measure spread using simple statistics
• Use histograms and boxplots to study distribution
• Helps understand individual variable behavior
Univariate analysis focuses on one variable at a time.
Bivariate Analysis
Checking correlations and scatter plots to see how variables affect each other.
• Use correlation to measure relationship strength
• Use scatter plots to visualize relationships
• Identify positive, negative, or no relationship
This step helps in understanding interaction between two variables.
Data Cleaning
Removing or correcting outliers and handling missing data based on the findings.
• Identify extreme or incorrect values
• Remove or correct outliers if required
• Handle missing values using suitable methods
Data cleaning improves accuracy and reliability of the dataset.
Summarization
Documenting the final insights to inform the next steps of machine learning or
business decisions.
• Combine results from all analysis steps
• Highlight key patterns and findings
• Decide suitable models or business actions
Summarization converts analysis into useful conclusions.
The EDA workflow plays a vital role in analyzing real-world datasets.
By following data profiling, analysis, cle