0% found this document useful (0 votes)
20 views27 pages

Exploratory Data Analysis Techniques

Uploaded by

dzedziphilly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views27 pages

Exploratory Data Analysis Techniques

Uploaded by

dzedziphilly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

DSI 237 Statistical analysis

Group 2
GROUP MEMBERS

NAME SURNAME REG NUMBERS


TANAKA TONGOWONA R2212031B
MERCY MUGAVIRI R2214108G
PADIEL GERALD R2211649M
TINOTENDA JECHE R229476X
ISHMAEL MOYO R2212593Y

RYAN HOMBERUME R2214520V


EXPLORATORY DATA ANALSYIS

 Exploratory Data Analysis (EDA) is a crucial step in the data analysis


process, involving a thorough examination of data through statistical and
visualization tools. The primary objective of EDA is to summarize the data,
uncover patterns, generate hypotheses, and test assumptions, setting
the foundation for in-depth analytics.
 Data scientists leverage EDA to gain insights into datasets, ultimately
influencing business strategies and outcomes. The insights obtained from
EDA, including features extracted, are pivotal not only for further data
analysis and modelling but also for enhancing machine learning
applications.
 Data visualization is a cornerstone of EDA, enabling the representation of
complex data in an easily understandable visual format.
Data Visualization
 Data Visualization represents the text or numerical data in a visual format,
which makes it easy to grasp the information the data express. We,
humans, remember the pictures more easily than readable text, so Python
provides us various libraries for data visualization like matplotlib, seaborn,
plotly, etc
Exploratory Data Analysis
 Creating Hypotheses, testing various business assumptions while dealing
with any Machine learning problem statement is very important and this is
what EDA helps to accomplish. There are various tootle and techniques to
understand your data, And the basic need is you should have the
knowledge of Numpy for mathematical operations and Pandas for data
manipulation.
Now lets us start exploring data and study
different data visualization plots with different
types of data
 let’s get started by importing libraries and loading Data

 import numpy as np
 import pandas pd
 import [Link] as plt
 import seaborn as sns
 from seaborn import load_dataset
 #titanic dataset
 data = pd.read_csv("titanic_train.csv")
 #tips dataset
 tips = load_dataset("tips")
Data visualization techniques
HISTOGRAMS

 Histograms provide a graphical representation of the distribution of a


single continuous variable by dividing it into bins and displaying the
count of observations within each bin. They help in understanding the
underlying data distribution by showing measures of central tendency
and spread.
 A histogram is a value distribution plot of numerical columns. It basically
creates bins in various ranges in values and plots it where we can
visualize how values are distributed.

 [Link](data['Age'], bins=5)
 [Link]()
BAR CHARTS

 Bar charts are commonly used to visualize and compare categorical


variables. They use bars to represent the frequency or count of each
category, making it easy to identify the most prevalent categories and
their relative proportions.
 [Link](data['Pclass'], data['Age'])
 [Link]()
BOX PLOTS

 Box plots, also known as box-and-whisker plots, effectively display the


distribution, central tendency, and spread of a dataset. They provide
insights into the data's minimum, maximum, median, quartiles, and
potential outliers, shedding light on skewness, symmetry, and the
presence of outliers.

 [Link](data['Sex'], data["Age"])
SCATTER PLOTS

 Scatter plots, a fundamental visualization technique, showcase the


relationship between two continuous variables. Each point on the plot
represents an observation, with its position determined by the values of
the compared variables. Scatter plots are *excellent for identifying
patterns like trends, clusters, or outliers.

 [Link](tips["total_bill"], tips["tip"])
Data preprocessing and cleaning
Data preprocessing is a crucial step in the data analysis process that involves
transforming raw data into a clean and organized format for analysis. It aims
to improve the quality of the data, making it more suitable for further analysis.
One of the key aspects of data preprocessing is data cleaning, which involves
handling missing data and outlier detection and treatment.

Handling missing values


Why do missing values occur in data
 Missing values can occur in data for a number of reasons, such as survey
non-responses or errors in data entry.
 Missing data is grouped into three broad categories:
 Missing completely at random.
 Missing at random.
 Missing not at random.
 Missing completely at random (MCAR)
Deletion
 When data is MCAR and MAR deletion may be a suitable method for
dealing with missing values. However, when data is MNAR, deletion of
missing observations can lead to bias.
 Listwise deletion
 Pairwise deletion
 Variable deletion
 Example
 Consider a small sample of data from the Nelson-Plosser macroeconomic
dataset:
Year gnp ip emp cpi
1906 - 9.8 33749 28
1907 - 10 34371 29
1908 - - 33246 28
1909 116.8 8.5 35072 28
1910 120.1 10 55762 28
Listwise deletion

 Uses only the observations from 1909 and 1910 for all parts of analysis.
 It eliminates the all data in 1906-1908 because of the missing values in
gnp and ip.
Pairwise deletion:
 Uses the observations in 1906-1910 when computing the means and
covariances of emp and cpi.
 Uses the observations in 1906-1907 and 1909-1910 when computing the
means and covariances of ip.
 Uses the observations in 1909-1910 when computing the means and
covariances of gnp.
 Uses the observations in 1909-1910 for model estimation other than
means and covariances.
Variable deletion

 Uses only emp and cpi for all parts of analysis.


Imputation
 Imputing data replaces missing values with statistically determined values.
Methods of imputation can vary from simply replacing missing values with
the mean to sophisticated multiple imputation [Link]
Outliers with Z-scores
Steps
 loop through all the data points and compute the Z-score using the formula
(Xi-mean)/std.
 define a threshold value of 3 and mark the datapoints whose absolute
value of Z-score is greater than the threshold as outliers.
 import numpy as np
 outliers = []
 def detect_outliers_zscore(data):
 thres = 3
 mean = [Link](data)
 std = [Link](data)
 # print(mean, std)
 for i in data:
 z_score = (i-mean)/std
 if ([Link](z_score) > thres):
 [Link](i)
 return outliers# Driver code
 sample_outliers = detect_outliers_zscore(sample)
 print("Outliers from Z-scores method: ", sample_outliers)
 The above code outputs: Outliers from Z-scores method: [101]
Detecting Outliers using the Inter Quantile Range(IQR)
Detecting outliers with iqr
IQR to detect Outliners
 Criteria: data points that lie 1.5 times of IQR above Q3 and below Q1 are
outliers. This shows in detail about outlier treatment in Python.

Steps
 Sort the dataset in ascending order
 calculate the 1st and 3rd quartiles(Q1, Q3)
 compute IQR=Q3-Q1
 compute lower bound = (Q1–1.5*IQR), upper bound = (Q3+1.5*IQR)
 loop through the values of the dataset and check for those who fall below
the lower bound and above the upper bound and mark them as outliers
Python Code
 outliers = []
 def detect_outliers_iqr(data):
 data = sorted(data)
 q1 = [Link](data, 25)
 q3 = [Link](data, 75)
 # print(q1, q3)
 IQR = q3-q1
 lwr_bound = q1-(1.5*IQR)
 upr_bound = q3+(1.5*IQR)
 # print(lwr_bound, upr_bound)
 for i in data:
 if (i<lwr_bound or i>upr_bound):
 [Link](i)
 return outliers# Driver code
 sample_outliers = detect_outliers_iqr(sample)
 print("Outliers from IQR method: ", sample_outliers)
 The above code outputs: Outliers from IQR method: [101]
How to Handle Outliers?
 Till now we learned about detecting the outliers. Step 1:
Trimming/Remove the outliers
 In this technique, we remove the outliers from the dataset. Although it is
not a good practice to follow.
 Python code to delete the outlier and copy the rest of the elements to
another array.
 # Trimming for i in sample_outliers: a = [Link](sample,
[Link](sample==i)) print(a) # print(len(sample), len(a))
 The outlier ‘101’ is deleted and the rest of the data points are copied to
another array ‘a’.

Step 2: Quantile Based Flooring and Capping


 In this technique, the outlier is capped at a certain value above the 90th
percentile value or floored at a factor below the 10th percentile value.
Python code to delete the outlier and copy the rest of the elements to
another array.
# Computing 10th, 90th percentiles and replacing the outliers
 tenth_percentile = [Link](sample, 10)
 ninetieth_percentile = [Link](sample, 90)
 # print(tenth_percentile, ninetieth_percentile)b =
 [Link](sample<tenth_percentile, tenth_percentile, sample)
 b = [Link](b>ninetieth_percentile, ninetieth_percentile, b)
 # print("Sample:", sample)
 print("New array:",b)
 The above code outputs: New array: [15, 20.7, 18, 7.2, 13, 16, 11, 20.7, 7.2, 15,
10, 9]

 The data points that are lesser than the 10th percentile are replaced with the
10th percentile value and the data points that are greater than the 90th
percentile are replaced with 90th percentile value.
 Step 3: Mean/Median Imputation
 As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
Python Code:
 median = [Link](sample)# Replace with median for i in
sample_outliers: c = [Link](sample==i, 14, sample) print("Sample:
", sample) print("New array: ",c) # print([Link])
 Step 5: Visualizing the Data after Treating the Outlier
 [Link](c, vert=False)
 [Link]("Boxplot of the sample after treating the outliers")
 [Link]("Sample")

You might also like