0% found this document useful (0 votes)
8 views45 pages

Unit-2 Data Collection, Cleaning & Preprocessing

Unit 2 covers data collection, cleaning, and preprocessing techniques, emphasizing the importance of primary and secondary data sources. It discusses methods for handling missing data, outliers, and noise, along with practical programming examples using Python libraries like Pandas and NumPy. The document highlights the significance of these processes in ensuring data quality for analysis and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views45 pages

Unit-2 Data Collection, Cleaning & Preprocessing

Unit 2 covers data collection, cleaning, and preprocessing techniques, emphasizing the importance of primary and secondary data sources. It discusses methods for handling missing data, outliers, and noise, along with practical programming examples using Python libraries like Pandas and NumPy. The document highlights the significance of these processes in ensuring data quality for analysis and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 2 Data Collection, Cleaning & Preprocessing

1. Sources (Primary and Secondary),


2. Handling missing data,
3. Outliers,
4. Noise,
5. Transformation,
6. Normalization and
7. Data wrangling using libraries like Pandas and numpy.
Collect → Clean → Prepare

1. Sources of Data (Primary and Secondary)


Data Sources
A data source is the origin from where data is collected for analysis,
decision-making, or research purposes.

Broadly classified into:

1. Primary Data Sources


2. Secondary Data Sources

PRIMARY DATA SOURCES

Primary data is data collected directly by a researcher or


organization for a specific purpose.
It is first-hand, original, fresh, problem-specific, and highly reliable,
but its collection requires more time, effort, and cost.

Primary Data Sources


• Surveys & Questionnaires: Online, paper, or telephonic surveys
Example: Student feedback form

1
• Interviews: Face-to-face, telephonic, or video interviews
Example: Employee job-satisfaction interview
• Observations: Direct or participant observation
Example: Customer behavior analysis
• Experiments: Controlled or scientific testing
Example: Testing a new medicine
• Sensors & IoT: Automated data from devices
Example: Smartwatch heart-rate data

Applications: Market research, academic studies, healthcare


analytics, customer satisfaction, scientific research.

Advantages: Accurate, relevant, fresh, quality-controlled.


Disadvantages: Costly, time-consuming, manpower-intensive.

Secondary Data Sources


Secondary data is data already collected, processed, and published
by others for a different purpose and reused for analysis.
It is easy to access, cost-effective, and time-saving, but may not
exactly match the current problem.

Types of Secondary Data


• Government Data: Census, economic and health reports
Example: India Census data
• Public Datasets: Kaggle, UCI, Google Dataset Search
Example: Student dataset from Kaggle
• Company Reports: Annual, sales, financial reports
Example: Previous year sales report
• Online Databases: Journals, data portals, libraries
Example: Research datasets

2
• Web & Social Media Data: Reviews, logs, social media data
Example: Amazon product reviews

Applications

Trend analysis, comparative studies, forecasting, academic


research, policy making.

Advantages : Low cost, quick access, large datasets.


Disadvantages: Less accurate, possibly outdated, limited quality
control, not problem-specific.
In short:

Primary data = First-hand, accurate, costly


Secondary data = Already available, cheap, quick

Program for Primary Data Collection


Example 1: Collecting Primary Data using Python (User Input)

import pandas as pd

# Collecting primary data


names = []
marks = []

for i in range(3):
name = input("Enter student name: ")
mark = int(input("Enter marks: "))
[Link](name)
[Link](mark)

3
# Creating DataFrame
data = [Link]({
"Name": names,
"Marks": marks
})

print(data)

Explanation
• input() → collects data directly from user (primary data)
• Data is stored in lists
• Pandas converts it into a structured dataset
Output

Enter student name: Amit


Enter marks: 85
Enter student name: Neha
Enter marks: 90
Enter student name: Rahul
Enter marks: 78

Name Marks
0 Amit 85
1 Neha 90
2 Rahul 78

This is a perfect example of a Primary Data Collection program

Program for Secondary Data Collection


(Data already exists — we only fetch/load it)

4
Example 2: Collecting Secondary Data from a CSV file

import pandas as pd

data = pd.read_csv("[Link]")
print([Link]())

Already collected data → secondary data

Classification: Primary/ Secondary

Program Type Data Type


User input program Primary data
Survey / Form program Primary data
CSV / Excel loading Secondary data
API / Web scraping Secondary data
Sensor data Primary data

2. Handling Missing Data


Missing data refers to the absence of expected values in a dataset,
represented as blank, null, or NaN entries.
It can occur in numerical, categorical, and time-series data.
Causes:
• Human input errors, System or sensor failure, Data corruption,

Survey non-response, Data integration issues


Effects: Biased analysis, Incorrect statistics, Reduced dataset size,
Errors in ML models, Lower prediction accuracy

Therefore, handling missing data is a key step in data cleaning.

5
METHODS TO HANDLE MISSING DATA
(1) Removing Missing Data
a) Removing Rows: Use when missing data is very small.

[Link]()

b) Removing Columns: Removes unusable columns

[Link](axis=1)

(2) Filling Missing Data (Imputation)

a) Fill with Mean (Numerical)

df['Marks'].fillna(df['Marks'].mean())

b) Fill with Median


df['Marks'].fillna(df['Marks'].median())

c) Fill with Mode (Categorical)

df['Grade'].fillna(df['Grade'].mode()[0])

(3) Forward Fill / Backward Fill


Forward Fill replaces missing values with the previous (above) value
in the column.
Example:
Day Sales
1 100
2 NaN

6
3 150
Using Forward Fill:
Day Sales
1 100
2 100
3 150

Backward Fill (bfill): replaces missing values with the next (below)
value in the column.
Example:
Day Sales
1 100
2 NaN
3 150
Using Backward Fill:
Day Sales
1 100
2 150
3 150

[Link](method='ffill')
[Link](method='bfill')

Using Constant Value

[Link](0)

PROGRAM (Pandas)
Example Dataset: Student Marks

7
import pandas as pd
import numpy as np

data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Priya'],
'Marks': [85, [Link], 78, [Link]],
'Grade': ['A', 'A', None, 'B']
}

df = [Link](data)
print("Original Data:")
print(df)

Output

Name Marks Grade


0 Amit 85.0 A
1 Neha NaN A
2 Rahul 78.0 None
3 Priya NaN B

Handling Missing Values


# Fill Marks with mean
df['Marks'] = df['Marks'].fillna(df['Marks'].mean())

# Fill Grade with mode


df['Grade'] = df['Grade'].fillna(df['Grade'].mode()[0])

print("\nAfter Handling Missing Data:")


print(df)

8
Output

Name Marks Grade


0 Amit 85.00 A
1 Neha 81.50 A
2 Rahul 78.00 A
3 Priya 81.50 B

APPLICATIONS OF HANDLING MISSING DATA

• Healthcare analytics (missing patient records)


• Survey analysis, Financial data analysis, ML preprocessing
• Business decision making

Handling missing data means identifying, analyzing, and treating


absent values to improve data quality and accuracy.

3. Outliers
They are extreme values that differ significantly from the majority
of data and can distort analysis and model performance. In simple
terms: Outliers are values that lie far away from the normal range
of data.

If outliers are not handled:

• average (mean) becomes misleading,


• statistical results get distorted,
• machine learning models perform poorly,
• wrong business decisions may be taken.

9
Hence, outlier detection and treatment is a crucial step of Data
Cleaning.

CAUSES OF OUTLIERS

Cause Example
Extra zero in salary (500000 instead of
Data entry error
50000)
Measurement
Faulty sensor
error
Natural variation Extremely high income
Experimental error Wrong unit used

Methods of Outlier Detection


1. Mean & Standard Deviation Method

Mean–Standard Deviation method detects a value as outlier


when it lies beyond mean ± 3 standard deviations.

Program:
import numpy as np

data = [Link]([10, 12, 11, 13, 12, 100])


mean = [Link](data)

std = [Link](data)

print("Mean:",mean)

print("standard deviation:",std)
outliers = data[(data < mean - 3*std) | (data > mean + 3*std)]

10
print(outliers)

Output:

Why output is empty []

• Mean ≈ 26.33
• Standard Deviation ≈ 32.96
• Upper limit = 26.33 + 3×32.96 ≈ 125.21
• Lower limit = 26.33 − 3×32.96 ≈ −72.54

➔Value 100 lies within this range, so no outlier is detected using


the ±3σ rule.

Another example where outlier comes:

import numpy as np
data = [Link]([
50, 51, 49, 50, 52, 48, 51, 49, 50, 52,1000
])
mean = [Link](data)
std = [Link](data)
print("Mean:", mean)

print("Standard Deviation:", std)


outliers = data[(data < mean - 3*std) | (data > mean + 3*std)]

print("Found Outliers:", outliers)

11
Output:

Mean: 136.54545454545453

Standard Deviation: 273.0508988969451

Outliers: [1000]

2. Z-Score Method
Z-score shows how many standard deviations a value is away from
the mean. If Z-score is greater than +3 or less than −3, the value is
an outlier.
Program:
import numpy as np
data = [Link]([50, 51, 49, 50, 52, 48, 51, 49, 50, 52,1000
])
mean = [Link](data)
std = [Link](data)
print("Mean:", mean)
print("Standard Deviation:", std)
z = (data - mean) / std
print("Z :", z)
outliers = data[(z > 3) | (z < -3)]
print(outliers)
output:
Mean: 136.54545454545453

Standard Deviation: 273.0508988969451

12
Z : [-0.31695722 -0.3132949 -0.32061954 -0.31695722 -
0.30963258 -0.32428186

-0.3132949 -0.32061954 -0.31695722 -0.30963258


3.16224758]

[1000]
3. IQR (Interquartile Range) Method
Values lying below the lower bound or above the upper bound
are outliers.
Program:
import numpy as np

data = [Link]([10, 12, 11, 13, 12, 100])


# Step 2: Calculate the First Quartile (Q1)

# Q1 represents the value below which 25% of the data lies


Q1 = [Link](data, 25)

# Step 3: Calculate the Third Quartile (Q3)

# Q3 represents the value below which 75% of the data lies


Q3 = [Link](data, 75)

# Step 4: Calculate the Interquartile Range (IQR)

# IQR is the range of the middle 50% of the data

IQR = Q3 - Q1

13
# Step 5: Calculate the Lower Bound

# Values below this limit are considered outliers

lower_bound = Q1 - 1.5 * IQR


# Step 6: Calculate the Upper Bound

# Values above this limit are considered outliers

upper_bound = Q3 + 1.5 * IQR

# Step 7: Identify outliers


# Values outside the lower and upper bounds are outliers

outliers = data[(data < lower_bound) | (data > upper_bound)]

# Step 8: Print all results

print("Data:", data)
print("Q1 (25th percentile):", Q1)

print("Q3 (75th percentile):", Q3)

print("IQR:", IQR)

print("Lower Bound:", lower_bound)

print("Upper Bound:", upper_bound)


print("Outliers:", outliers)
Output:

14
Data: [ 10 12 11 13 12 100]

Q1 (25th percentile): 11.25

Q3 (75th percentile): 12.75


IQR: 1.5

Lower Bound: 9.0

Upper Bound: 15.0

Outliers: [100]
4. Box Plot Method
A box plot visually shows outliers as points outside the whiskers.
Program:

import pandas as pd

import [Link] as plt

df = [Link]({'Marks': [45, 50, 52, 48, 49, 120]})

[Link](column='Marks')
[Link]()

Output: Box plot showing 120 as an outlier


5. Scatter Plot Method
In a scatter plot, outliers appear as points far away from other
points.

15
Program:

import [Link] as plt

data = [10, 12, 11, 13, 12, 100]


[Link](range(len(data)), data)

[Link]()

Output: Scatter plot showing 100 far from other values

APPLICATIONS OF OUTLIER HANDLING


• Fraud detection
• Medical diagnosis
• Quality control
• Financial analytics
• Network security

4. Noise in Data
Noise refers to random, irrelevant, or meaningless data values that do not
represent the true underlying pattern of the dataset. In simple words: Noise
is unwanted disturbance in data that hides useful information.

Some variations occur due to:

• faulty instruments, environmental disturbances,


• transmission errors, human mistakes.
These unwanted and meaningless variations are called noise.

16
If noise is not handled:

• patterns become unclear, accuracy of analysis reduces,


• ML models overfit or underperform, wrong conclusions are drawn.

Therefore, noise handling is a crucial part of Data Cleaning.

SOURCES / CAUSES OF NOISE

Source
Sensor malfunction
Data entry errors
Transmission errors
Environmental factors
Measurement limitations

DIFFERENCE: NOISE vs OUTLIERS

Noise Outliers
Random disturbance Extreme value
Small variations Large deviation
Affects many points Affects few points
Hard to identify Easy to detect

METHODS TO HANDLE / REMOVE NOISE


1) Smoothing Techniques

Moving Average: Replaces value with average of neighbors

Exponential Smoothing: Gives more weight to recent values

17
2) Binning Method

• Sort data
• Divide into bins
• Replace values with bin mean/median

3) Regression

• Fit a line or curve


• Remove deviations from trend

4) Clustering :
• Noise points lie far from clusters

NOISE HANDLING – COMPLETE PROGRAM

Example Dataset: Daily Temperature (°C)

30, 31, 29, 45, 30, 32, 31


Here 45 is clearly abnormal (noise).
METHOD 1: Moving Average (Manual Calculation)

Formula
𝑋1 + 𝑋2 + 𝑋3
Moving Average =
3

We use window size = 3


Step-by-Step Calculation

For 3rd value (index 2)

18
(30 + 31 + 29)/3 = 90/3 = 30

For 4th value (index 3)


(31 + 29 + 45)/3 = 105/3 = 35

For 5th value

(29 + 45 + 30)/3 = 104/3 = 34.67

For 6th value


(45 + 30 + 32)/3 = 107/3 = 35.67

For 7th value

(30 + 32 + 31)/3 = 93/3 = 31

Result (Manual)

Original Smoothed
30 -
31 -
29 30
45 35
30 34.67
32 35.67
31 31

19
Noise 45 is reduced from 45 → 35

Program (Moving Average)


import pandas as pd

data = {'Temperature': [30, 31, 29, 45, 30, 32, 31]}

df = [Link](data)

df['Moving_Avg'] = df['Temperature'].rolling(window=3).mean()
print(df)

Output

Temperature Moving_Avg

0 30 NaN

1 31 NaN
2 29 30.00

3 45 35.00

4 30 34.67

5 32 35.67
6 31 31.00

Explanation

• rolling(window=3) takes 3 values


• mean() calculates average

20
• Sudden spike 45 becomes smoother (35)

METHOD 2: Binning (Manual Working)

Step 1: Sort Data


29, 30, 30, 31, 31, 32, 45

Step 2: Divide into 3 bins

Bin 1 → 29, 30
Bin 2 → 30, 31
Bin 3 → 31, 32, 45

Step 3: Replace by Mean

Bin 1 Mean:
(29 + 30)/2 = 59/2 = 29.5

Bin 2 Mean:

(30 + 31)/2 = 61/2 = 30.5

Bin 3 Mean:
(31 + 32 + 45)/3 = 108/3 = 36

Noise 45 becomes 36

Program (Binning)

import numpy as np

21
sorted_data = [Link](df['Temperature'])

bins = np.array_split(sorted_data, 3)
bin_means = [[Link](bin) for bin in bins]

print("Bins:", bins)

print("Bin Means:", bin_means)

Output
Bins: [array([29, 30]), array([30, 31]), array([31, 32, 45])]
Bin Means: [29.5, 30.5, 36.0]

Explanation

• Data sorted
• Divided into equal groups
• Each group replaced with its mean
• Noise value adjusted

METHOD 3: Regression (Manual Concept)

We fit a straight line:

𝑦 = 𝑚𝑥 + 𝑐
𝑚
𝑏
-10-8-6-4-2246810-10-5510y-interceptx-intercept

22
It finds best trend.

Noise point (45) will not strongly affect trend because line averages effect.

(Manual regression calculation is long, so we demonstrate


programmatically.)

Program (Linear Regression)


import numpy as np

from sklearn.linear_model import LinearRegression

X = [Link](len(df)).reshape(-1, 1)

y = df['Temperature']
model = LinearRegression()

[Link](X, y)

df['Trend_Line'] = [Link](X)

print(df)

Output (Approx)
Temperature Trend_Line

0 30 30.1

1 31 31.3

2 29 32.5
3 45 33.7

23
4 30 34.9

5 32 36.1

6 31 37.3
Explanation

• Model finds best straight line


• Instead of using noisy value 45 directly
• Uses smooth predicted values

MINI PROJECT
Title: Noise Handling and Data Smoothing Techniques Using Python
Objective: To identify and reduce noise in numerical data using various
smoothing techniques such as:
• Moving Average,Exponential Smoothing, Binning, Regression

Problem Statement
Real-world datasets such as sensor readings, stock prices, or temperature
data often contain noise — random irregular values that hide the true
trend.
Example:
A temperature sensor may suddenly show an abnormal value due to device
error.
The goal is to clean the dataset and preserve the real pattern.

24
5. Data Transformation
After data collection and data cleaning (handling missing values,
outliers, noise), the data may still not be suitable for analysis or
machine learning.

Reasons:

• Data may be in wrong format


• Values may be on different scales
• Categorical data may exist

➔ Data Transformation converts clean data into suitable format


so that it can be efficiently analyzed, visualized, or used for
modeling. It is a core step of Data Preprocessing.

OBJECTIVES:

• Improve data consistency


• Reduce skewness, Improve accuracy of analysis
• Convert raw data into meaningful features

Applications:
• DA, ML, DWg, BI, Healthcare & Finance analytics

Dataset Used
import pandas as pd
import numpy as np

data = {
'Name': ['Amit', 'Shruti', 'Rahul', 'Pooja'],
'Age': [20, 21, 22, 21],
25
'Marks': [45, 78, 90, 60],
'City': ['Delhi', 'Mumbai', 'Delhi', 'Pune']
}
df = [Link](data)
print("RAW DATA", df)
Output
Name Age Marks City
0 Amit 20 45 Delhi
1 Shruti 21 78 Mumbai
2 Rahul 22 90 Delhi
3 Pooja 21 60 Pune

TYPES OF DATA TRANSFORMATION


1) Scaling / Normalization
Scaling converts numerical values into a smaller, comparable range
(usually 0–1).

Program (Min-Max)
df['Marks_Normalized'] = (df['Marks'] - df['Marks'].min()) /
(df['Marks'].max() - df['Marks'].min'])

print(df[['Marks', 'Marks_Normalized']])
Output
Marks Marks_Normalized
0 45 0.00
1 78 0.61
2 90 1.00
3 60 0.33
Explanation
• Minimum (45) becomes 0

• Maximum (90) becomes 1

26
• Other values fall between 0 and 1
Used in Machine Learning.

2) Log / Mathematical Transformation


Log transformation reduces skewness and compresses large
numeric values.
Program
df['Marks_Log'] = [Link](df['Marks'])
print(df[['Marks', 'Marks_Log']])
Output (Approx)
Marks Marks_Log
0 45 3.81
1 78 4.36
2 90 4.50
3 60 4.09
Explanation
• Large differences become smaller

• Useful when data is highly skewed

3) Encoding (Categorical → Numerical)


Encoding converts text data into numerical form. ML algorithms
cannot process text. They need numeric representation

Program (Label Encoding)


df['City_Code'] = df['City'].astype('category').[Link]
print(df[['City', 'City_Code']])
Output
City City_Code
0 Delhi 0
1 Mumbai 1
2 Delhi 0

27
3 Pune 2
Explanation
df['City']: This selects the City column.
.astype('category'): converts column into categorical type, as
Unique categories.
Internally, Pandas assigns:
Delhi → Category 0
Mumbai → Category 1
Pune → Category 2
(Alphabetical order by default)
.[Link]: This extracts numeric code of each category.

4) Aggregation
Aggregation summarizes data using operations like mean, sum,
count.
Program
avg_marks = [Link]('City')['Marks'].mean()
print(avg_marks)
Output
City
Delhi 67.5
Mumbai 78.0
Pune 60.0
Explanation
• Delhi has two students → average = 67.5

• Data is grouped and summarized

5) Feature Creation (Derived Column)


Creating new meaningful columns from existing data.
Program
df['Result'] = [Link](df['Marks'] >= 40, 'Pass', 'Fail')

28
print(df[['Name', 'Marks', 'Result']])
Output
Name Marks Result
0 Amit 45 Pass
1 Shruti 78 Pass
2 Rahul 90 Pass
3 Pooja 60 Pass
Explanation
• Marks ≥ 40 → Pass

• New column created from Marks

Improves interpretability.

6) Discretization / Binning
Converts continuous numbers into categories.
Program
bins = [0, 50, 75, 100]
labels = ['Low', 'Medium', 'High']

df['Performance'] = [Link](df['Marks'], bins=bins, labels=labels)

print(df[['Marks', 'Performance']])
Output
Marks Performance
0 45 Low
1 78 High
2 90 High
3 60 Medium
Explanation
• 0–50 → Low

• 50–75 → Medium

• 75–100 → High

29
Makes analysis easier.

7) Data Type Conversion


Changing data type for correctness or consistency.
Program
df['Age'] = df['Age'].astype(float)
print([Link])
Output
Name object
Age float64
Marks int64
City object
...
Explanation
• Age converted from integer to float

• Ensures correct numeric operations

FINAL TRANSFORMED DATASET


print("\nFINAL TRANSFORMED DATA")
print(df)
Now dataset contains:
• Normalized values

• Log values

• Encoded values

• Derived features

• Binned categories

6. DATA NORMALIZATION
It is a data preprocessing technique used to rescale numerical values into
a common range so that no feature dominates others due to its
magnitude.

30
- It makes different numerical features comparable.

Why Normalization is Required?

In real datasets, features have different scales:

Feature Range Example


Salary 10,000 – 1,00,000
Age 18 – 60
Marks 0 – 100

If we use raw data:

• Large values dominate, Give biased results, Training becomes slow

Benefits of Normalization
• Equal importance to all features, Faster model convergence
• Improved accuracy, Better performance of ML models

Used:
• ML (KNN, SVM, Neural Networks)
• Data Mining , Clustering
• Recommendation Systems, Statistical Analysis

We will use this dataset:

import pandas as pd
data = {
'Student': ['Amit', 'Neha', 'Rahul', 'Priya', 'Karan'],
'Marks': [45, 88, 76, 95, 60]
}
df = [Link](data)
print("Original Data:")
31
print(df)
Output:
Student Marks
0 Amit 45
1 Neha 88
2 Rahul 76
3 Priya 95
4 Karan 60

Types of Data Normalization


1. Min–Max Normalization
2. Z-Score Normalization
3. Decimal Scaling
4. Max Normalization

1- Min–Max Normalization : It rescales data into a fixed range (usually


0 to 1). It subtracts the minimum value and divides by the data
range to rescale data between 0 and 1.
𝑿−𝑿𝒎𝒊𝒏
Formula: 𝑿′ = 𝑿 −𝑿
𝒎𝒂𝒙 𝒎𝒊𝒏
Meaning of Each Term
• X → Original value // (Ex: 76)
• Xₘᵢₙ → Minimum value in dataset // (Ex: 45)
• Xₘₐₓ → Maximum value in dataset // (Ex: 95)
• X − Xₘᵢₙ → Distance of value from min // (76−45=31)
• Xₘₐₓ − Xₘᵢₙ → Total range of data // (95−45=50)
• X′ → Normalized value (b/w 0 and 1) // X′=31/50=0.62
So, according to example, normalized value of 76 ≈ 0.62
This Formula Actually works as:
• Converts minimum value → 0

• Converts maximum value → 1

• Converts all other values → between 0 and 1

32
It rescales data proportionally within range.

Program
df['Marks_MinMax'] = (df['Marks'] - df['Marks'].min()) /
(df['Marks'].max() - df['Marks'].min())

print(df[['Student', 'Marks_MinMax']])
Output
Student Marks_MinMax
0 Amit 0.000000
1 Neha 0.782609
2 Rahul 0.565217
3 Priya 1.000000
4 Karan 0.260870
➔ Best when min and max values are known

2- Z-Score Normalization (Standardization)


It transforms data so that it has mean = 0 and standard deviation = 1.
In simple terms, Z-Score normalization subtracts the mean from each
value and divides by standard deviation to standardize data with mean 0
and standard deviation 1.

Formula:
Where:
• X = Original value

• μ (mu) = Mean of dataset

• σ (sigma) = Standard deviation

• Z = Standardized value

Example Dataset: Marks = 45, 88, 76, 95, 60


33
Step 1: Find Mean (μ)
Formula:
Sum of all values
𝝁=
Number of values

𝟒𝟓+𝟖𝟖+𝟕𝟔+𝟗𝟓+𝟔𝟎 𝟑𝟔𝟒
𝝁= = =𝟕𝟐. 𝟖
𝟓 𝟓

So, Mean (μ) = 72.8

Step 2: Find Standard Deviation (σ)


Formula:
∑(𝑿 − 𝝁)𝟐
𝝈=√
𝑵

Now calculate each part:

X X−μ (X − μ)²

45 -27.8 772.84

88 15.2 231.04

76 3.2 10.24

95 22.2 492.84

60 -12.8 163.84

34
Now sum of squared differences:
𝟕𝟕𝟐. 𝟖𝟒 + 𝟐𝟑𝟏. 𝟎𝟒 + 𝟏𝟎. 𝟐𝟒 + 𝟒𝟗𝟐. 𝟖𝟒 + 𝟏𝟔𝟑. 𝟖𝟒 = 𝟏𝟔𝟕𝟎. 𝟖

Divide by N = 5:
𝟏𝟔𝟕𝟎. 𝟖
= 𝟑𝟑𝟒. 𝟏𝟔
𝟓

Now square root:


𝝈 = √𝟑𝟑𝟒. 𝟏𝟔
𝝈 ≈ 𝟏𝟖. 𝟐𝟖

So, Standard Deviation (σ) ≈ 18.28

Step 3: Apply Z-Score Formula


Let’s find Z-score for 76
𝟕𝟔 − 𝟕𝟐. 𝟖
𝒁=
𝟏𝟖. 𝟐𝟖

𝟑. 𝟐
𝒁=
𝟏𝟖. 𝟐𝟖

𝒁 ≈ 𝟎. 𝟏𝟕

Interpretation,
Z = 0.17 means:
- 76 is 0.17 standard deviations above the mean
• Positive value → Above mean

• Negative value → Below mean

• 0 → Exactly at mean

Z-Score Normalization Program


import pandas as pd
35
import numpy as np

# Step 1: Create dataset


marks = [45, 88, 76, 95, 60]
df = [Link]({'Marks': marks})
print("Original Data:", df)

# Step 2: Calculate Mean


mean = df['Marks'].mean()
print("\nMean:", mean)

# Step 3: Calculate Standard Deviation


std = df['Marks'].std()
print("Standard Deviation:", std)

# Step 4: Apply Z-Score Formula


df['Z_Score'] = (df['Marks'] - mean) / std

print("\nData after Z-Score Normalization:")


print(df)

Output:
Original Data:
Marks
0 45
1 88
2 76
3 95
4 60

Mean: 72.8

36
Standard Deviation: 19.99

Data after Z-Score Normalization:


Marks Z_Score
0 45 -1.39
1 88 0.76
2 76 0.16
3 95 1.11
4 60 -0.64

Values are centered around mean 0


Can produce negative and positive values
Useful when data has outliers
Commonly used in ML algorithms

3- Decimal Scaling Normalization


Decimal scaling moves the decimal point of values so that the
maximum value becomes less than 1.
Formula

𝑿
𝑿 = 𝒋
𝟏𝟎

Where:
• X = Original value

• j = Smallest integer such that max value becomes < 1

• X′ = Normalized value

In decimal scaling, j is chosen equal to the number of digits in the


maximum value so that dividing by 𝟏𝟎𝒋 makes all values less than 1.
Rule in Decimal Scaling
We choose j such that:
Maximum value
<𝟏
𝟏𝟎𝒋
37
That means:
𝟏𝟎𝒋 > Maximum value
We use same dataset:
Marks = 45, 88, 76, 95, 60

Step 1: Find j
Maximum value = 95
To make 95 < 1, divide by 100.
𝟗𝟓/𝟏𝟎𝟎 = 𝟎. 𝟗𝟓
So,
𝟏𝟎𝒋 = 𝟏𝟎𝟎
𝒋=𝟐

Step 2: Apply Formula


For example, X = 76

𝟕𝟔
𝑿 = 𝟐
𝟏𝟎

𝟕𝟔
𝑿 =
𝟏𝟎𝟎

𝑿 = 𝟎. 𝟕𝟔

Program
import pandas as pd

data = {
'Student': ['Amit', 'Neha', 'Rahul', 'Priya', 'Karan'],
'Marks': [45, 88, 76, 95, 60]
}
df = [Link](data)

38
# Decimal Scaling

df['Marks_Decimal'] = df['Marks'] / 100

print(df[['Student', 'Marks_Decimal']])
Output
Student Marks_Decimal
0 Amit 0.45
1 Neha 0.88
2 Rahul 0.76
3 Priya 0.95
4 Karan 0.60
Explanation
• We divided all values by 100

• Because maximum value (95) has 2 digits

• All values become less than 1

• Very simple and fast method

5. Max Normalization
It divides each value by the maximum value in the dataset.
𝑿
Formula: 𝑿′ = 𝑿
𝒎𝒂𝒙

Where:
• X = Original value

• X_max = Maximum value in dataset

• X′ = Normalized value

We use same dataset:


Marks = 45, 88, 76, 95, 60

Step 1: Find Maximum


Maximum = 95
39
Step 2: Apply Formula
For example, X = 76

𝟕𝟔
𝑿 =
𝟗𝟓

𝑿 = 𝟎. 𝟖𝟎

Program
# Max Normalization
df['Marks_MaxNorm'] = df['Marks'] / df['Marks'].max()

print(df[['Student', 'Marks_MaxNorm']])

Output
Student Marks_MaxNorm
0 Amit 0.47
1 Neha 0.93
2 Rahul 0.80
3 Priya 1.00
4 Karan 0.63
Explanation
• Maximum value (95) becomes 1

• Other values are scaled proportionally

• Easy and quick method

• Useful when only maximum value is important

Thus, Data normalization converts numerical values into a common


scale to improve fairness, stability, and accuracy of machine learning
and analytical models.

40
7. DATA WRANGLING
Data Wrangling (also called Data Munging) is the complete process of
converting raw, messy, incomplete, inconsistent data into a clean,
structured, meaningful, and usable format suitable for data analysis and
machine learning.

It combines multiple preprocessing steps, such as:


• Handling missing values
• Removing duplicates
• Fixing data types
• Treating outliers and noise
• Transforming data
• Normalizing values
• Feature creation

Without data wrangling, analysis results are inaccurate and misleading.

TOOLS USED FOR DATA WRANGLING


Libraries: Pandas, NumPy, Matplotlib / Seaborn (optional)

DATA WRANGLING PROCESS

1. Load data
2. Inspect data
3. Handle missing values
4. Remove duplicates
5. Fix data types
6. Handle outliers
7. Data transformation
8. Data normalization
9. Final clean dataset

41
COMPLETE DATA WRANGLING PROGRAM
MINI PROJECT

Title: Data Wrangling and Preprocessing on Employee Salary Dataset


Using Pandas

Objective: To clean, preprocess, transform, and normalize a messy


employee dataset using Python and apply complete data wrangling
techniques.
Problem Statement: Real-world datasets often contain:

• Missing values
• Duplicate records
• Incorrect data types
• Outliers
• Inconsistent formats

These issues must be corrected before performing analysis or applying machine learning
algorithms.

This project demonstrates complete data wrangling step-by-step.

Dataset Description

The dataset contains the following columns:

Column Description
Employee Employee name
Salary Monthly salary
Experience Years of experience
Age Employee age

Tools & Technologies Used:

• Python
• Pandas
• NumPy

42
Implementation

Step 1: Create Raw Dataset

import pandas as pd

import numpy as np

data = {

'Employee': ['Aman', 'Riya', 'Kunal', 'Riya', 'Vikas', 'Meena'],

'Salary': [50000, [Link], 45000, [Link], 999999, 40000],

'Experience': [5, 3, None, 3, 8, 2],

'Age': ['25', '28', '30', '28', '35', 'thirty']

df = [Link](data)

print("RAW DATA:")

print(df)

Problems Identified

• Missing Salary
• Missing Experience
• Duplicate Row (Riya appears twice)
• Invalid Age value ("thirty")
• Outlier Salary (999999)

Step 2: Handle Missing Values

df['Salary'].fillna(df['Salary'].mean(), inplace=True)

df['Experience'].fillna(df['Experience'].mean(), inplace=True)

➔ Missing values replaced with mean.

Step 3: Remove Duplicate Records

43
df.drop_duplicates(inplace=True)

➔ Exact duplicate row removed.

Step 4: Fix Data Types

df['Age'] = pd.to_numeric(df['Age'], errors='coerce') // errors='coerce' forces invalid values to


be converted into NaN instead of raising an error.

df['Age'].fillna(df['Age'].mean(), inplace=True)// inplace=True → updates the original dataset


directly

➔ "thirty" converted to NaN


➔ Replaced with mean age

Step 5: Handle Outliers

df = df[df['Salary'] < 200000]

➔ Unrealistic salary (999999) removed.

Step 6: Data Transformation

df['Level'] = df['Experience'].apply(

lambda x: 'Senior' if x >= 5 else 'Junior'

➔ Created new categorical column "Level".

Step 7: Data Normalization (Min–Max)

df['Salary_Normalized'] = (

(df['Salary'] - df['Salary'].min()) /

(df['Salary'].max() - df['Salary'].min())

➔ Salary scaled between 0 and 1.

44
Step 8: Final Clean Dataset

print("\nCLEAN DATA:")

print(df)

Final Output (Approx)

Employee Salary Experience Age Level Salary_Normalized

0 Aman 50000 5.0 25.0 Senior 1.00

1 Riya 45000 3.0 28.0 Junior 0.50

2 Kunal 45000 4.2 30.0 Junior 0.50

5 Meena 40000 2.0 29.5 Junior 0.00

Conclusion

This mini project successfully demonstrated complete data wrangling including:

• Data cleaning, Data correction, Outlier removal, Feature transformation, Data


normalization

The dataset was transformed from messy raw data into clean, structured, and analysis-ready
data suitable for further analytics or machine learning.

APPLICATIONS OF DATA WRANGLING

• Data Analytics projects, ML pipelines, BI, Healthcare analytics, Banking & finance
• Education systems

45

You might also like