Exploratory Data Analysis Techniques

Uploaded by

dzedziphilly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views27 pages

Exploratory Data Analysis Techniques

Uploaded by

dzedziphilly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

DSI 237 Statistical analysis

Group 2
GROUP MEMBERS

NAME SURNAME REG NUMBERS

TANAKA TONGOWONA R2212031B
MERCY MUGAVIRI R2214108G
PADIEL GERALD R2211649M
TINOTENDA JECHE R229476X
ISHMAEL MOYO R2212593Y

RYAN HOMBERUME R2214520V

EXPLORATORY DATA ANALSYIS

 Exploratory Data Analysis (EDA) is a crucial step in the data analysis

process, involving a thorough examination of data through statistical and
visualization tools. The primary objective of EDA is to summarize the data,
uncover patterns, generate hypotheses, and test assumptions, setting
the foundation for in-depth analytics.
 Data scientists leverage EDA to gain insights into datasets, ultimately
influencing business strategies and outcomes. The insights obtained from
EDA, including features extracted, are pivotal not only for further data
analysis and modelling but also for enhancing machine learning
applications.
 Data visualization is a cornerstone of EDA, enabling the representation of
complex data in an easily understandable visual format.
Data Visualization
 Data Visualization represents the text or numerical data in a visual format,
which makes it easy to grasp the information the data express. We,
humans, remember the pictures more easily than readable text, so Python
provides us various libraries for data visualization like matplotlib, seaborn,
plotly, etc
Exploratory Data Analysis
 Creating Hypotheses, testing various business assumptions while dealing
with any Machine learning problem statement is very important and this is
what EDA helps to accomplish. There are various tootle and techniques to
understand your data, And the basic need is you should have the
knowledge of Numpy for mathematical operations and Pandas for data
manipulation.
Now lets us start exploring data and study
different data visualization plots with different
types of data
 let’s get started by importing libraries and loading Data

 import numpy as np
 import pandas pd
 import [Link] as plt
 import seaborn as sns
 from seaborn import load_dataset
 #titanic dataset
 data = pd.read_csv("titanic_train.csv")
 #tips dataset
 tips = load_dataset("tips")
Data visualization techniques
HISTOGRAMS

 Histograms provide a graphical representation of the distribution of a

single continuous variable by dividing it into bins and displaying the
count of observations within each bin. They help in understanding the
underlying data distribution by showing measures of central tendency
and spread.
 A histogram is a value distribution plot of numerical columns. It basically
creates bins in various ranges in values and plots it where we can
visualize how values are distributed.

 [Link](data['Age'], bins=5)
 [Link]()
BAR CHARTS

 Bar charts are commonly used to visualize and compare categorical

variables. They use bars to represent the frequency or count of each
category, making it easy to identify the most prevalent categories and
their relative proportions.
 [Link](data['Pclass'], data['Age'])
 [Link]()
BOX PLOTS

 Box plots, also known as box-and-whisker plots, effectively display the

distribution, central tendency, and spread of a dataset. They provide
insights into the data's minimum, maximum, median, quartiles, and
potential outliers, shedding light on skewness, symmetry, and the
presence of outliers.

 [Link](data['Sex'], data["Age"])
SCATTER PLOTS

 Scatter plots, a fundamental visualization technique, showcase the

relationship between two continuous variables. Each point on the plot
represents an observation, with its position determined by the values of
the compared variables. Scatter plots are *excellent for identifying
patterns like trends, clusters, or outliers.

 [Link](tips["total_bill"], tips["tip"])
Data preprocessing and cleaning
Data preprocessing is a crucial step in the data analysis process that involves
transforming raw data into a clean and organized format for analysis. It aims
to improve the quality of the data, making it more suitable for further analysis.
One of the key aspects of data preprocessing is data cleaning, which involves
handling missing data and outlier detection and treatment.

Handling missing values

Why do missing values occur in data
 Missing values can occur in data for a number of reasons, such as survey
non-responses or errors in data entry.
 Missing data is grouped into three broad categories:
 Missing completely at random.
 Missing at random.
 Missing not at random.
 Missing completely at random (MCAR)
Deletion
 When data is MCAR and MAR deletion may be a suitable method for
dealing with missing values. However, when data is MNAR, deletion of
missing observations can lead to bias.
 Listwise deletion
 Pairwise deletion
 Variable deletion
 Example
 Consider a small sample of data from the Nelson-Plosser macroeconomic
dataset:
Year gnp ip emp cpi
1906 - 9.8 33749 28
1907 - 10 34371 29
1908 - - 33246 28
1909 116.8 8.5 35072 28
1910 120.1 10 55762 28
Listwise deletion

 Uses only the observations from 1909 and 1910 for all parts of analysis.
 It eliminates the all data in 1906-1908 because of the missing values in
gnp and ip.
Pairwise deletion:
 Uses the observations in 1906-1910 when computing the means and
covariances of emp and cpi.
 Uses the observations in 1906-1907 and 1909-1910 when computing the
means and covariances of ip.
 Uses the observations in 1909-1910 when computing the means and
covariances of gnp.
 Uses the observations in 1909-1910 for model estimation other than
means and covariances.
Variable deletion

 Uses only emp and cpi for all parts of analysis.

Imputation
 Imputing data replaces missing values with statistically determined values.
Methods of imputation can vary from simply replacing missing values with
the mean to sophisticated multiple imputation [Link]
Outliers with Z-scores
Steps
 loop through all the data points and compute the Z-score using the formula
(Xi-mean)/std.
 define a threshold value of 3 and mark the datapoints whose absolute
value of Z-score is greater than the threshold as outliers.
 import numpy as np
 outliers = []
 def detect_outliers_zscore(data):
 thres = 3
 mean = [Link](data)
 std = [Link](data)
 # print(mean, std)
 for i in data:
 z_score = (i-mean)/std
 if ([Link](z_score) > thres):
 [Link](i)
 return outliers# Driver code
 sample_outliers = detect_outliers_zscore(sample)
 print("Outliers from Z-scores method: ", sample_outliers)
 The above code outputs: Outliers from Z-scores method: [101]
Detecting Outliers using the Inter Quantile Range(IQR)
Detecting outliers with iqr
IQR to detect Outliners
 Criteria: data points that lie 1.5 times of IQR above Q3 and below Q1 are
outliers. This shows in detail about outlier treatment in Python.

Steps
 Sort the dataset in ascending order
 calculate the 1st and 3rd quartiles(Q1, Q3)
 compute IQR=Q3-Q1
 compute lower bound = (Q1–1.5*IQR), upper bound = (Q3+1.5*IQR)
 loop through the values of the dataset and check for those who fall below
the lower bound and above the upper bound and mark them as outliers
Python Code
 outliers = []
 def detect_outliers_iqr(data):
 data = sorted(data)
 q1 = [Link](data, 25)
 q3 = [Link](data, 75)
 # print(q1, q3)
 IQR = q3-q1
 lwr_bound = q1-(1.5*IQR)
 upr_bound = q3+(1.5*IQR)
 # print(lwr_bound, upr_bound)
 for i in data:
 if (i<lwr_bound or i>upr_bound):
 [Link](i)
 return outliers# Driver code
 sample_outliers = detect_outliers_iqr(sample)
 print("Outliers from IQR method: ", sample_outliers)
 The above code outputs: Outliers from IQR method: [101]
How to Handle Outliers?
 Till now we learned about detecting the outliers. Step 1:
Trimming/Remove the outliers
 In this technique, we remove the outliers from the dataset. Although it is
not a good practice to follow.
 Python code to delete the outlier and copy the rest of the elements to
another array.
 # Trimming for i in sample_outliers: a = [Link](sample,
[Link](sample==i)) print(a) # print(len(sample), len(a))
 The outlier ‘101’ is deleted and the rest of the data points are copied to
another array ‘a’.

Step 2: Quantile Based Flooring and Capping

 In this technique, the outlier is capped at a certain value above the 90th
percentile value or floored at a factor below the 10th percentile value.
Python code to delete the outlier and copy the rest of the elements to
another array.
# Computing 10th, 90th percentiles and replacing the outliers
 tenth_percentile = [Link](sample, 10)
 ninetieth_percentile = [Link](sample, 90)
 # print(tenth_percentile, ninetieth_percentile)b =
 [Link](sample<tenth_percentile, tenth_percentile, sample)
 b = [Link](b>ninetieth_percentile, ninetieth_percentile, b)
 # print("Sample:", sample)
 print("New array:",b)
 The above code outputs: New array: [15, 20.7, 18, 7.2, 13, 16, 11, 20.7, 7.2, 15,
10, 9]

 The data points that are lesser than the 10th percentile are replaced with the
10th percentile value and the data points that are greater than the 90th
percentile are replaced with 90th percentile value.
 Step 3: Mean/Median Imputation
 As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
Python Code:
 median = [Link](sample)# Replace with median for i in
sample_outliers: c = [Link](sample==i, 14, sample) print("Sample:
", sample) print("New array: ",c) # print([Link])
 Step 5: Visualizing the Data after Treating the Outlier
 [Link](c, vert=False)
 [Link]("Boxplot of the sample after treating the outliers")
 [Link]("Sample")

Explanatory Data Analysis Techniques
100% (2)
Explanatory Data Analysis Techniques
28 pages
Introduction to Exploratory Data Analysis
No ratings yet
Introduction to Exploratory Data Analysis
12 pages
Data Pre-processing Techniques in Python
No ratings yet
Data Pre-processing Techniques in Python
7 pages
Exploratory Data Analysis & Outliers Guide
No ratings yet
Exploratory Data Analysis & Outliers Guide
99 pages
Data Preparation and Outlier Handling
No ratings yet
Data Preparation and Outlier Handling
52 pages
Data Wrangling Techniques in Python
No ratings yet
Data Wrangling Techniques in Python
141 pages
Data Wrangling with Python: Student Dataset
No ratings yet
Data Wrangling with Python: Student Dataset
7 pages
Outlier Detection in Pandas DataFrames
No ratings yet
Outlier Detection in Pandas DataFrames
19 pages
Outliers Treatment in Data
No ratings yet
Outliers Treatment in Data
9 pages
Outlier Detection Methods in Python
No ratings yet
Outlier Detection Methods in Python
11 pages
Python EDA: Data Visualization Guide
No ratings yet
Python EDA: Data Visualization Guide
108 pages
Chapter 7-Exploratory Data Analysis-4
No ratings yet
Chapter 7-Exploratory Data Analysis-4
13 pages
EDA Techniques and Tools in Python
No ratings yet
EDA Techniques and Tools in Python
6 pages
Outlier Treatment Techniques Explained
No ratings yet
Outlier Treatment Techniques Explained
16 pages
Missing Data & Outlier Detection in Time Series
No ratings yet
Missing Data & Outlier Detection in Time Series
16 pages
Data Cleaning and Preprocessing Guide
No ratings yet
Data Cleaning and Preprocessing Guide
32 pages
Outlier Detection Using Tukey's Method
No ratings yet
Outlier Detection Using Tukey's Method
4 pages
Understanding Outliers in Data Sets
No ratings yet
Understanding Outliers in Data Sets
41 pages
Detecting and Handling Outliers
No ratings yet
Detecting and Handling Outliers
7 pages
Understanding Outlier Analysis Techniques
No ratings yet
Understanding Outlier Analysis Techniques
22 pages
Understanding Data Outliers: A Guide
No ratings yet
Understanding Data Outliers: A Guide
10 pages
Understanding Outliers in Data Analysis
No ratings yet
Understanding Outliers in Data Analysis
3 pages
Data Wrangling Techniques in Python
No ratings yet
Data Wrangling Techniques in Python
3 pages
Machine Learning EDA and Data Preprocessing
No ratings yet
Machine Learning EDA and Data Preprocessing
11 pages
Uber Fare Prediction Analysis
No ratings yet
Uber Fare Prediction Analysis
9 pages
Essential Guide to Feature Engineering
No ratings yet
Essential Guide to Feature Engineering
63 pages
EDA Basics: Python Guide for Data Analysis
100% (1)
EDA Basics: Python Guide for Data Analysis
30 pages
Handling Outliers in Clustering
No ratings yet
Handling Outliers in Clustering
19 pages
Lecture02 - Data Preprocessing&DataVisulaization
No ratings yet
Lecture02 - Data Preprocessing&DataVisulaization
36 pages
Data Visualization with Histograms and Box Plots
No ratings yet
Data Visualization with Histograms and Box Plots
20 pages
Exploratory Data Analysis Basics in Python
No ratings yet
Exploratory Data Analysis Basics in Python
10 pages
Data Analysis with Python Tools
No ratings yet
Data Analysis with Python Tools
9 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
4 pages
Uber Ride Price Prediction Analysis
No ratings yet
Uber Ride Price Prediction Analysis
8 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
20 pages
Feature Engineering in IFT6758
No ratings yet
Feature Engineering in IFT6758
66 pages
Machine Learning Data Visualization Guide
No ratings yet
Machine Learning Data Visualization Guide
108 pages
Lab 2-ML
No ratings yet
Lab 2-ML
41 pages
Python Module for Confidence Intervals
No ratings yet
Python Module for Confidence Intervals
23 pages
Outlier Detection and Handling Methods
No ratings yet
Outlier Detection and Handling Methods
6 pages
EDA Techniques for Data Visualization
No ratings yet
EDA Techniques for Data Visualization
105 pages
Aiml ML Lab Manual.
No ratings yet
Aiml ML Lab Manual.
102 pages
Python Data Analysis Libraries Guide
No ratings yet
Python Data Analysis Libraries Guide
48 pages
Data Pre-processing for Machine Learning
No ratings yet
Data Pre-processing for Machine Learning
43 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
36 pages
Python EDA Guide: Steps & Techniques
No ratings yet
Python EDA Guide: Steps & Techniques
4 pages
Comprehensive Guide to EDA Techniques
No ratings yet
Comprehensive Guide to EDA Techniques
48 pages
Feature Engineering Techniques Overview
No ratings yet
Feature Engineering Techniques Overview
69 pages
ChE442-ML-Lecture 14
No ratings yet
ChE442-ML-Lecture 14
12 pages
Lec4 - Data Analysis Foundation (1) - Merged-Merged
No ratings yet
Lec4 - Data Analysis Foundation (1) - Merged-Merged
367 pages
Lec4 - Data Analysis Foundation
No ratings yet
Lec4 - Data Analysis Foundation
74 pages
DataAnalytics Units456 Notes
No ratings yet
DataAnalytics Units456 Notes
29 pages
Machine Learning: Data Preparation Guide
No ratings yet
Machine Learning: Data Preparation Guide
30 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
53 pages
Lab 4-ML
No ratings yet
Lab 4-ML
40 pages
EDA in SAS: Communicating Insights
No ratings yet
EDA in SAS: Communicating Insights
25 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
11 pages
Chap 3 Data Preparation and Cleaning
No ratings yet
Chap 3 Data Preparation and Cleaning
65 pages
Essential Data Visualization Techniques
No ratings yet
Essential Data Visualization Techniques
19 pages
Organisational Informatics Overview
No ratings yet
Organisational Informatics Overview
3 pages
Chatbot Performance Statistics & Visuals
No ratings yet
Chatbot Performance Statistics & Visuals
2 pages
R Programming: Vectors, Matrices, Data Frames
No ratings yet
R Programming: Vectors, Matrices, Data Frames
22 pages
Casual vs Console Games Explained
No ratings yet
Casual vs Console Games Explained
4 pages
Media Application Development Insights
No ratings yet
Media Application Development Insights
3 pages
Chatbot Performance Analysis Insights
No ratings yet
Chatbot Performance Analysis Insights
19 pages
Madokero Outreach Report for MyPlan Aid
No ratings yet
Madokero Outreach Report for MyPlan Aid
2 pages
TK Slides
No ratings yet
TK Slides
37 pages
Work-Related Learning Report: Data Science
No ratings yet
Work-Related Learning Report: Data Science
21 pages
Sampling Distribution of Light Bulb Lifespan
No ratings yet
Sampling Distribution of Light Bulb Lifespan
16 pages
Intelligent Twins Series: Chapter 3
No ratings yet
Intelligent Twins Series: Chapter 3
10 pages
JavaScript Table Creation Guide
No ratings yet
JavaScript Table Creation Guide
9 pages
Customer Bill Calculation Program
No ratings yet
Customer Bill Calculation Program
5 pages
African Values in Literature Analysis
No ratings yet
African Values in Literature Analysis
57 pages
Fleet Management in Zimbabwe's FMCG Sector
No ratings yet
Fleet Management in Zimbabwe's FMCG Sector
21 pages
Goals of Language Across Curriculum
No ratings yet
Goals of Language Across Curriculum
3 pages
Toyota Maintenance Plan Overview
No ratings yet
Toyota Maintenance Plan Overview
12 pages
Linux Commands and Shell Scripts Guide
No ratings yet
Linux Commands and Shell Scripts Guide
11 pages
Submersible Sewage Pump Specifications
No ratings yet
Submersible Sewage Pump Specifications
2 pages
Gradient Ascent in Logistic Regression
No ratings yet
Gradient Ascent in Logistic Regression
27 pages
Annual Project Activities Report
No ratings yet
Annual Project Activities Report
2 pages
Managing Cross Functional Teams
No ratings yet
Managing Cross Functional Teams
31 pages
Insulation Paint Specifications and Materials
No ratings yet
Insulation Paint Specifications and Materials
1 page
Mechatronic Systems Simulation Basics
No ratings yet
Mechatronic Systems Simulation Basics
77 pages
Bank Management System Project Report
No ratings yet
Bank Management System Project Report
18 pages
Event Hire Order Confirmation 2025
No ratings yet
Event Hire Order Confirmation 2025
1 page
Clustering vs Distributed Systems Explained
No ratings yet
Clustering vs Distributed Systems Explained
7 pages
Student Motivation and Academic Success
100% (1)
Student Motivation and Academic Success
14 pages
1768 Compactlogix Controllers, Revision 20: Release Notes
No ratings yet
1768 Compactlogix Controllers, Revision 20: Release Notes
26 pages
A Cut Above The Rest
No ratings yet
A Cut Above The Rest
19 pages
Panasonic Switch KX-T61610
No ratings yet
Panasonic Switch KX-T61610
231 pages
Grade 2 Term 2 Maths Schemes
No ratings yet
Grade 2 Term 2 Maths Schemes
14 pages
Project Performance - How To Assess The Early Stages
No ratings yet
Project Performance - How To Assess The Early Stages
6 pages
Work Immersion Task List
No ratings yet
Work Immersion Task List
2 pages
Water: Permeability Coe Single-Variable Function of Soil Parameter
No ratings yet
Water: Permeability Coe Single-Variable Function of Soil Parameter
20 pages
Software Testing Notes in Telugu
No ratings yet
Software Testing Notes in Telugu
4 pages
The Ravine: A Short Story Analysis
No ratings yet
The Ravine: A Short Story Analysis
1 page
Big Data's Impact on Finance Research
No ratings yet
Big Data's Impact on Finance Research
13 pages
FAP-IIS Series 500V N-Channel MOS-FET
No ratings yet
FAP-IIS Series 500V N-Channel MOS-FET
2 pages
Isabella Giaquinto: Supply Chain Expert
No ratings yet
Isabella Giaquinto: Supply Chain Expert
2 pages
Gov Uscourts Tned 122220 1 1
No ratings yet
Gov Uscourts Tned 122220 1 1
46 pages
Taylor Swift's Music and Lifelong Learning in Thailand
No ratings yet
Taylor Swift's Music and Lifelong Learning in Thailand
23 pages
Fundamentals of Produced Water Treatment in The Oil and Gas Industry
No ratings yet
Fundamentals of Produced Water Treatment in The Oil and Gas Industry
86 pages
Bharathidasan University Exam Schedule
No ratings yet
Bharathidasan University Exam Schedule
35 pages
Operation Manual MIPLUS REV - 00 en
No ratings yet
Operation Manual MIPLUS REV - 00 en
86 pages
GeoGebra Bookmark Design Project
No ratings yet
GeoGebra Bookmark Design Project
10 pages