0% found this document useful (0 votes)

8 views45 pages

Unit-2 Data Collection, Cleaning & Preprocessing

Unit 2 covers data collection, cleaning, and preprocessing techniques, emphasizing the importance of primary and secondary data sources. It discusses methods for handling missing data, outliers, and noise, along with practical programming examples using Python libraries like Pandas and NumPy. The document highlights the significance of these processes in ensuring data quality for analysis and decision-making.

Uploaded by

ayushchaurasia09112006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views45 pages

Unit-2 Data Collection, Cleaning & Preprocessing

Uploaded by

ayushchaurasia09112006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 2 Data Collection, Cleaning & Preprocessing

1. Sources (Primary and Secondary),

2. Handling missing data,
3. Outliers,
4. Noise,
5. Transformation,
6. Normalization and
7. Data wrangling using libraries like Pandas and numpy.
Collect → Clean → Prepare

1. Sources of Data (Primary and Secondary)

Data Sources
A data source is the origin from where data is collected for analysis,
decision-making, or research purposes.

Broadly classified into:

1. Primary Data Sources

2. Secondary Data Sources

PRIMARY DATA SOURCES

Primary data is data collected directly by a researcher or

organization for a specific purpose.
It is first-hand, original, fresh, problem-specific, and highly reliable,
but its collection requires more time, effort, and cost.

Primary Data Sources

• Surveys & Questionnaires: Online, paper, or telephonic surveys
Example: Student feedback form

1
• Interviews: Face-to-face, telephonic, or video interviews
Example: Employee job-satisfaction interview
• Observations: Direct or participant observation
Example: Customer behavior analysis
• Experiments: Controlled or scientific testing
Example: Testing a new medicine
• Sensors & IoT: Automated data from devices
Example: Smartwatch heart-rate data

Applications: Market research, academic studies, healthcare

analytics, customer satisfaction, scientific research.

Advantages: Accurate, relevant, fresh, quality-controlled.

Disadvantages: Costly, time-consuming, manpower-intensive.

Secondary Data Sources

Secondary data is data already collected, processed, and published
by others for a different purpose and reused for analysis.
It is easy to access, cost-effective, and time-saving, but may not
exactly match the current problem.

Types of Secondary Data

• Government Data: Census, economic and health reports
Example: India Census data
• Public Datasets: Kaggle, UCI, Google Dataset Search
Example: Student dataset from Kaggle
• Company Reports: Annual, sales, financial reports
Example: Previous year sales report
• Online Databases: Journals, data portals, libraries
Example: Research datasets

2
• Web & Social Media Data: Reviews, logs, social media data
Example: Amazon product reviews

Applications

Trend analysis, comparative studies, forecasting, academic

research, policy making.

Advantages : Low cost, quick access, large datasets.

Disadvantages: Less accurate, possibly outdated, limited quality
control, not problem-specific.
In short:

Primary data = First-hand, accurate, costly

Secondary data = Already available, cheap, quick

Program for Primary Data Collection

Example 1: Collecting Primary Data using Python (User Input)

import pandas as pd

# Collecting primary data

names = []
marks = []

for i in range(3):
name = input("Enter student name: ")
mark = int(input("Enter marks: "))
[Link](name)
[Link](mark)

3
# Creating DataFrame
data = [Link]({
"Name": names,
"Marks": marks
})

print(data)

Explanation
• input() → collects data directly from user (primary data)
• Data is stored in lists
• Pandas converts it into a structured dataset
Output

Enter student name: Amit

Enter marks: 85
Enter student name: Neha
Enter marks: 90
Enter student name: Rahul
Enter marks: 78

Name Marks
0 Amit 85
1 Neha 90
2 Rahul 78

This is a perfect example of a Primary Data Collection program

Program for Secondary Data Collection

(Data already exists — we only fetch/load it)

4
Example 2: Collecting Secondary Data from a CSV file

import pandas as pd

data = pd.read_csv("[Link]")
print([Link]())

Already collected data → secondary data

Classification: Primary/ Secondary

Program Type Data Type

User input program Primary data
Survey / Form program Primary data
CSV / Excel loading Secondary data
API / Web scraping Secondary data
Sensor data Primary data

2. Handling Missing Data

Missing data refers to the absence of expected values in a dataset,
represented as blank, null, or NaN entries.
It can occur in numerical, categorical, and time-series data.
Causes:
• Human input errors, System or sensor failure, Data corruption,

Survey non-response, Data integration issues

Effects: Biased analysis, Incorrect statistics, Reduced dataset size,
Errors in ML models, Lower prediction accuracy

Therefore, handling missing data is a key step in data cleaning.

5
METHODS TO HANDLE MISSING DATA
(1) Removing Missing Data
a) Removing Rows: Use when missing data is very small.

[Link]()

b) Removing Columns: Removes unusable columns

[Link](axis=1)

(2) Filling Missing Data (Imputation)

a) Fill with Mean (Numerical)

df['Marks'].fillna(df['Marks'].mean())

b) Fill with Median

df['Marks'].fillna(df['Marks'].median())

c) Fill with Mode (Categorical)

df['Grade'].fillna(df['Grade'].mode()[0])

(3) Forward Fill / Backward Fill

Forward Fill replaces missing values with the previous (above) value
in the column.
Example:
Day Sales
1 100
2 NaN

6
3 150
Using Forward Fill:
Day Sales
1 100
2 100
3 150

Backward Fill (bfill): replaces missing values with the next (below)
value in the column.
Example:
Day Sales
1 100
2 NaN
3 150
Using Backward Fill:
Day Sales
1 100
2 150
3 150

[Link](method='ffill')
[Link](method='bfill')

Using Constant Value

[Link](0)

PROGRAM (Pandas)
Example Dataset: Student Marks

7
import pandas as pd
import numpy as np

data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Priya'],
'Marks': [85, [Link], 78, [Link]],
'Grade': ['A', 'A', None, 'B']
}

df = [Link](data)
print("Original Data:")
print(df)

Output

Name Marks Grade

0 Amit 85.0 A
1 Neha NaN A
2 Rahul 78.0 None
3 Priya NaN B

Handling Missing Values

# Fill Marks with mean
df['Marks'] = df['Marks'].fillna(df['Marks'].mean())

# Fill Grade with mode

df['Grade'] = df['Grade'].fillna(df['Grade'].mode()[0])

print("\nAfter Handling Missing Data:")

print(df)

8
Output

Name Marks Grade

0 Amit 85.00 A
1 Neha 81.50 A
2 Rahul 78.00 A
3 Priya 81.50 B

APPLICATIONS OF HANDLING MISSING DATA

• Healthcare analytics (missing patient records)

• Survey analysis, Financial data analysis, ML preprocessing
• Business decision making

Handling missing data means identifying, analyzing, and treating

absent values to improve data quality and accuracy.

3. Outliers
They are extreme values that differ significantly from the majority
of data and can distort analysis and model performance. In simple
terms: Outliers are values that lie far away from the normal range
of data.

If outliers are not handled:

• average (mean) becomes misleading,

• statistical results get distorted,
• machine learning models perform poorly,
• wrong business decisions may be taken.

9
Hence, outlier detection and treatment is a crucial step of Data
Cleaning.

CAUSES OF OUTLIERS

Cause Example
Extra zero in salary (500000 instead of
Data entry error
50000)
Measurement
Faulty sensor
error
Natural variation Extremely high income
Experimental error Wrong unit used

Methods of Outlier Detection

1. Mean & Standard Deviation Method

Mean–Standard Deviation method detects a value as outlier

when it lies beyond mean ± 3 standard deviations.

Program:
import numpy as np

data = [Link]([10, 12, 11, 13, 12, 100])

mean = [Link](data)

std = [Link](data)

print("Mean:",mean)

print("standard deviation:",std)
outliers = data[(data < mean - 3*std) | (data > mean + 3*std)]

10
print(outliers)

Output:

Why output is empty []

• Mean ≈ 26.33
• Standard Deviation ≈ 32.96
• Upper limit = 26.33 + 3×32.96 ≈ 125.21
• Lower limit = 26.33 − 3×32.96 ≈ −72.54

➔Value 100 lies within this range, so no outlier is detected using

the ±3σ rule.

Another example where outlier comes:

import numpy as np
data = [Link]([
50, 51, 49, 50, 52, 48, 51, 49, 50, 52,1000
])
mean = [Link](data)
std = [Link](data)
print("Mean:", mean)

print("Standard Deviation:", std)

outliers = data[(data < mean - 3*std) | (data > mean + 3*std)]

print("Found Outliers:", outliers)

11
Output:

Mean: 136.54545454545453

Standard Deviation: 273.0508988969451

Outliers: [1000]

2. Z-Score Method
Z-score shows how many standard deviations a value is away from
the mean. If Z-score is greater than +3 or less than −3, the value is
an outlier.
Program:
import numpy as np
data = [Link]([50, 51, 49, 50, 52, 48, 51, 49, 50, 52,1000
])
mean = [Link](data)
std = [Link](data)
print("Mean:", mean)
print("Standard Deviation:", std)
z = (data - mean) / std
print("Z :", z)
outliers = data[(z > 3) | (z < -3)]
print(outliers)
output:
Mean: 136.54545454545453

Standard Deviation: 273.0508988969451

12
Z : [-0.31695722 -0.3132949 -0.32061954 -0.31695722 -
0.30963258 -0.32428186

-0.3132949 -0.32061954 -0.31695722 -0.30963258

3.16224758]

[1000]
3. IQR (Interquartile Range) Method
Values lying below the lower bound or above the upper bound
are outliers.
Program:
import numpy as np

data = [Link]([10, 12, 11, 13, 12, 100])

# Step 2: Calculate the First Quartile (Q1)

# Q1 represents the value below which 25% of the data lies

Q1 = [Link](data, 25)

# Step 3: Calculate the Third Quartile (Q3)

# Q3 represents the value below which 75% of the data lies

Q3 = [Link](data, 75)

# Step 4: Calculate the Interquartile Range (IQR)

# IQR is the range of the middle 50% of the data

IQR = Q3 - Q1

13
# Step 5: Calculate the Lower Bound

# Values below this limit are considered outliers

lower_bound = Q1 - 1.5 * IQR

# Step 6: Calculate the Upper Bound

# Values above this limit are considered outliers

upper_bound = Q3 + 1.5 * IQR

# Step 7: Identify outliers

# Values outside the lower and upper bounds are outliers

outliers = data[(data < lower_bound) | (data > upper_bound)]

# Step 8: Print all results

print("Data:", data)
print("Q1 (25th percentile):", Q1)

print("Q3 (75th percentile):", Q3)

print("IQR:", IQR)

print("Lower Bound:", lower_bound)

print("Upper Bound:", upper_bound)

print("Outliers:", outliers)
Output:

14
Data: [ 10 12 11 13 12 100]

Q1 (25th percentile): 11.25

Q3 (75th percentile): 12.75

IQR: 1.5

Lower Bound: 9.0

Upper Bound: 15.0

Outliers: [100]
4. Box Plot Method
A box plot visually shows outliers as points outside the whiskers.
Program:

import pandas as pd

import [Link] as plt

df = [Link]({'Marks': [45, 50, 52, 48, 49, 120]})

[Link](column='Marks')
[Link]()

Output: Box plot showing 120 as an outlier

5. Scatter Plot Method
In a scatter plot, outliers appear as points far away from other
points.

15
Program:

import [Link] as plt

data = [10, 12, 11, 13, 12, 100]

[Link](range(len(data)), data)

[Link]()

Output: Scatter plot showing 100 far from other values

APPLICATIONS OF OUTLIER HANDLING

• Fraud detection
• Medical diagnosis
• Quality control
• Financial analytics
• Network security

4. Noise in Data
Noise refers to random, irrelevant, or meaningless data values that do not
represent the true underlying pattern of the dataset. In simple words: Noise
is unwanted disturbance in data that hides useful information.

Some variations occur due to:

• faulty instruments, environmental disturbances,

• transmission errors, human mistakes.
These unwanted and meaningless variations are called noise.

16
If noise is not handled:

• patterns become unclear, accuracy of analysis reduces,

• ML models overfit or underperform, wrong conclusions are drawn.

Therefore, noise handling is a crucial part of Data Cleaning.

SOURCES / CAUSES OF NOISE

Source
Sensor malfunction
Data entry errors
Transmission errors
Environmental factors
Measurement limitations

DIFFERENCE: NOISE vs OUTLIERS

Noise Outliers
Random disturbance Extreme value
Small variations Large deviation
Affects many points Affects few points
Hard to identify Easy to detect

METHODS TO HANDLE / REMOVE NOISE

1) Smoothing Techniques

Moving Average: Replaces value with average of neighbors

Exponential Smoothing: Gives more weight to recent values

17
2) Binning Method

• Sort data
• Divide into bins
• Replace values with bin mean/median

3) Regression

• Fit a line or curve

• Remove deviations from trend

4) Clustering :
• Noise points lie far from clusters

NOISE HANDLING – COMPLETE PROGRAM

Example Dataset: Daily Temperature (°C)

30, 31, 29, 45, 30, 32, 31

Here 45 is clearly abnormal (noise).
METHOD 1: Moving Average (Manual Calculation)

Formula
𝑋1 + 𝑋2 + 𝑋3
Moving Average =
3

We use window size = 3

Step-by-Step Calculation

For 3rd value (index 2)

18
(30 + 31 + 29)/3 = 90/3 = 30

For 4th value (index 3)

(31 + 29 + 45)/3 = 105/3 = 35

For 5th value

(29 + 45 + 30)/3 = 104/3 = 34.67

For 6th value

(45 + 30 + 32)/3 = 107/3 = 35.67

For 7th value

(30 + 32 + 31)/3 = 93/3 = 31

Result (Manual)

Original Smoothed
30 -
31 -
29 30
45 35
30 34.67
32 35.67
31 31

19
Noise 45 is reduced from 45 → 35

Program (Moving Average)

import pandas as pd

data = {'Temperature': [30, 31, 29, 45, 30, 32, 31]}

df = [Link](data)

df['Moving_Avg'] = df['Temperature'].rolling(window=3).mean()
print(df)

Output

Temperature Moving_Avg

0 30 NaN

1 31 NaN
2 29 30.00

3 45 35.00

4 30 34.67

5 32 35.67
6 31 31.00

Explanation

• rolling(window=3) takes 3 values

• mean() calculates average

20
• Sudden spike 45 becomes smoother (35)

METHOD 2: Binning (Manual Working)

Step 1: Sort Data

29, 30, 30, 31, 31, 32, 45

Step 2: Divide into 3 bins

Bin 1 → 29, 30
Bin 2 → 30, 31
Bin 3 → 31, 32, 45

Step 3: Replace by Mean

Bin 1 Mean:
(29 + 30)/2 = 59/2 = 29.5

Bin 2 Mean:

(30 + 31)/2 = 61/2 = 30.5

Bin 3 Mean:
(31 + 32 + 45)/3 = 108/3 = 36

Noise 45 becomes 36

Program (Binning)

import numpy as np

21
sorted_data = [Link](df['Temperature'])

bins = np.array_split(sorted_data, 3)
bin_means = [[Link](bin) for bin in bins]

print("Bins:", bins)

print("Bin Means:", bin_means)

Output
Bins: [array([29, 30]), array([30, 31]), array([31, 32, 45])]
Bin Means: [29.5, 30.5, 36.0]

Explanation

• Data sorted
• Divided into equal groups
• Each group replaced with its mean
• Noise value adjusted

METHOD 3: Regression (Manual Concept)

We fit a straight line:

𝑦 = 𝑚𝑥 + 𝑐
𝑚
𝑏
-10-8-6-4-2246810-10-5510y-interceptx-intercept

22
It finds best trend.

Noise point (45) will not strongly affect trend because line averages effect.

(Manual regression calculation is long, so we demonstrate

programmatically.)

Program (Linear Regression)

import numpy as np

from sklearn.linear_model import LinearRegression

X = [Link](len(df)).reshape(-1, 1)

y = df['Temperature']
model = LinearRegression()

[Link](X, y)

df['Trend_Line'] = [Link](X)

print(df)

Output (Approx)
Temperature Trend_Line

0 30 30.1

1 31 31.3

2 29 32.5
3 45 33.7

23
4 30 34.9

5 32 36.1

6 31 37.3
Explanation

• Model finds best straight line

• Instead of using noisy value 45 directly
• Uses smooth predicted values

MINI PROJECT
Title: Noise Handling and Data Smoothing Techniques Using Python
Objective: To identify and reduce noise in numerical data using various
smoothing techniques such as:
• Moving Average,Exponential Smoothing, Binning, Regression

Problem Statement
Real-world datasets such as sensor readings, stock prices, or temperature
data often contain noise — random irregular values that hide the true
trend.
Example:
A temperature sensor may suddenly show an abnormal value due to device
error.
The goal is to clean the dataset and preserve the real pattern.

24
5. Data Transformation
After data collection and data cleaning (handling missing values,
outliers, noise), the data may still not be suitable for analysis or
machine learning.

Reasons:

• Data may be in wrong format

• Values may be on different scales
• Categorical data may exist

➔ Data Transformation converts clean data into suitable format

so that it can be efficiently analyzed, visualized, or used for
modeling. It is a core step of Data Preprocessing.

OBJECTIVES:

• Improve data consistency

• Reduce skewness, Improve accuracy of analysis
• Convert raw data into meaningful features

Applications:
• DA, ML, DWg, BI, Healthcare & Finance analytics

Dataset Used
import pandas as pd
import numpy as np

data = {
'Name': ['Amit', 'Shruti', 'Rahul', 'Pooja'],
'Age': [20, 21, 22, 21],
25
'Marks': [45, 78, 90, 60],
'City': ['Delhi', 'Mumbai', 'Delhi', 'Pune']
}
df = [Link](data)
print("RAW DATA", df)
Output
Name Age Marks City
0 Amit 20 45 Delhi
1 Shruti 21 78 Mumbai
2 Rahul 22 90 Delhi
3 Pooja 21 60 Pune

TYPES OF DATA TRANSFORMATION

1) Scaling / Normalization
Scaling converts numerical values into a smaller, comparable range
(usually 0–1).

Program (Min-Max)
df['Marks_Normalized'] = (df['Marks'] - df['Marks'].min()) /
(df['Marks'].max() - df['Marks'].min'])

print(df[['Marks', 'Marks_Normalized']])
Output
Marks Marks_Normalized
0 45 0.00
1 78 0.61
2 90 1.00
3 60 0.33
Explanation
• Minimum (45) becomes 0

• Maximum (90) becomes 1

26
• Other values fall between 0 and 1
Used in Machine Learning.

2) Log / Mathematical Transformation

Log transformation reduces skewness and compresses large
numeric values.
Program
df['Marks_Log'] = [Link](df['Marks'])
print(df[['Marks', 'Marks_Log']])
Output (Approx)
Marks Marks_Log
0 45 3.81
1 78 4.36
2 90 4.50
3 60 4.09
Explanation
• Large differences become smaller

• Useful when data is highly skewed

3) Encoding (Categorical → Numerical)

Encoding converts text data into numerical form. ML algorithms
cannot process text. They need numeric representation

Program (Label Encoding)

df['City_Code'] = df['City'].astype('category').[Link]
print(df[['City', 'City_Code']])
Output
City City_Code
0 Delhi 0
1 Mumbai 1
2 Delhi 0

27
3 Pune 2
Explanation
df['City']: This selects the City column.
.astype('category'): converts column into categorical type, as
Unique categories.
Internally, Pandas assigns:
Delhi → Category 0
Mumbai → Category 1
Pune → Category 2
(Alphabetical order by default)
.[Link]: This extracts numeric code of each category.

4) Aggregation
Aggregation summarizes data using operations like mean, sum,
count.
Program
avg_marks = [Link]('City')['Marks'].mean()
print(avg_marks)
Output
City
Delhi 67.5
Mumbai 78.0
Pune 60.0
Explanation
• Delhi has two students → average = 67.5

• Data is grouped and summarized

5) Feature Creation (Derived Column)

Creating new meaningful columns from existing data.
Program
df['Result'] = [Link](df['Marks'] >= 40, 'Pass', 'Fail')

28
print(df[['Name', 'Marks', 'Result']])
Output
Name Marks Result
0 Amit 45 Pass
1 Shruti 78 Pass
2 Rahul 90 Pass
3 Pooja 60 Pass
Explanation
• Marks ≥ 40 → Pass

• New column created from Marks

Improves interpretability.

6) Discretization / Binning
Converts continuous numbers into categories.
Program
bins = [0, 50, 75, 100]
labels = ['Low', 'Medium', 'High']

df['Performance'] = [Link](df['Marks'], bins=bins, labels=labels)

print(df[['Marks', 'Performance']])
Output
Marks Performance
0 45 Low
1 78 High
2 90 High
3 60 Medium
Explanation
• 0–50 → Low

• 50–75 → Medium

• 75–100 → High

29
Makes analysis easier.

7) Data Type Conversion

Changing data type for correctness or consistency.
Program
df['Age'] = df['Age'].astype(float)
print([Link])
Output
Name object
Age float64
Marks int64
City object
...
Explanation
• Age converted from integer to float

• Ensures correct numeric operations

FINAL TRANSFORMED DATASET

print("\nFINAL TRANSFORMED DATA")
print(df)
Now dataset contains:
• Normalized values

• Log values

• Encoded values

• Derived features

• Binned categories

6. DATA NORMALIZATION
It is a data preprocessing technique used to rescale numerical values into
a common range so that no feature dominates others due to its
magnitude.

30
- It makes different numerical features comparable.

Why Normalization is Required?

In real datasets, features have different scales:

Feature Range Example

Salary 10,000 – 1,00,000
Age 18 – 60
Marks 0 – 100

If we use raw data:

• Large values dominate, Give biased results, Training becomes slow

Benefits of Normalization
• Equal importance to all features, Faster model convergence
• Improved accuracy, Better performance of ML models

Used:
• ML (KNN, SVM, Neural Networks)
• Data Mining , Clustering
• Recommendation Systems, Statistical Analysis

We will use this dataset:

import pandas as pd
data = {
'Student': ['Amit', 'Neha', 'Rahul', 'Priya', 'Karan'],
'Marks': [45, 88, 76, 95, 60]
}
df = [Link](data)
print("Original Data:")
31
print(df)
Output:
Student Marks
0 Amit 45
1 Neha 88
2 Rahul 76
3 Priya 95
4 Karan 60

Types of Data Normalization

1. Min–Max Normalization
2. Z-Score Normalization
3. Decimal Scaling
4. Max Normalization

1- Min–Max Normalization : It rescales data into a fixed range (usually

0 to 1). It subtracts the minimum value and divides by the data
range to rescale data between 0 and 1.
𝑿−𝑿𝒎𝒊𝒏
Formula: 𝑿′ = 𝑿 −𝑿
𝒎𝒂𝒙 𝒎𝒊𝒏
Meaning of Each Term
• X → Original value // (Ex: 76)
• Xₘᵢₙ → Minimum value in dataset // (Ex: 45)
• Xₘₐₓ → Maximum value in dataset // (Ex: 95)
• X − Xₘᵢₙ → Distance of value from min // (76−45=31)
• Xₘₐₓ − Xₘᵢₙ → Total range of data // (95−45=50)
• X′ → Normalized value (b/w 0 and 1) // X′=31/50=0.62
So, according to example, normalized value of 76 ≈ 0.62
This Formula Actually works as:
• Converts minimum value → 0

• Converts maximum value → 1

• Converts all other values → between 0 and 1

32
It rescales data proportionally within range.

Program
df['Marks_MinMax'] = (df['Marks'] - df['Marks'].min()) /
(df['Marks'].max() - df['Marks'].min())

print(df[['Student', 'Marks_MinMax']])
Output
Student Marks_MinMax
0 Amit 0.000000
1 Neha 0.782609
2 Rahul 0.565217
3 Priya 1.000000
4 Karan 0.260870
➔ Best when min and max values are known

2- Z-Score Normalization (Standardization)

It transforms data so that it has mean = 0 and standard deviation = 1.
In simple terms, Z-Score normalization subtracts the mean from each
value and divides by standard deviation to standardize data with mean 0
and standard deviation 1.

Formula:
Where:
• X = Original value

• μ (mu) = Mean of dataset

• σ (sigma) = Standard deviation

• Z = Standardized value

Example Dataset: Marks = 45, 88, 76, 95, 60

33
Step 1: Find Mean (μ)
Formula:
Sum of all values
𝝁=
Number of values

𝟒𝟓+𝟖𝟖+𝟕𝟔+𝟗𝟓+𝟔𝟎 𝟑𝟔𝟒
𝝁= = =𝟕𝟐. 𝟖
𝟓 𝟓

So, Mean (μ) = 72.8

Step 2: Find Standard Deviation (σ)

Formula:
∑(𝑿 − 𝝁)𝟐
𝝈=√
𝑵

Now calculate each part:

X X−μ (X − μ)²

45 -27.8 772.84

88 15.2 231.04

76 3.2 10.24

95 22.2 492.84

60 -12.8 163.84

34
Now sum of squared differences:
𝟕𝟕𝟐. 𝟖𝟒 + 𝟐𝟑𝟏. 𝟎𝟒 + 𝟏𝟎. 𝟐𝟒 + 𝟒𝟗𝟐. 𝟖𝟒 + 𝟏𝟔𝟑. 𝟖𝟒 = 𝟏𝟔𝟕𝟎. 𝟖

Divide by N = 5:
𝟏𝟔𝟕𝟎. 𝟖
= 𝟑𝟑𝟒. 𝟏𝟔
𝟓

Now square root:

𝝈 = √𝟑𝟑𝟒. 𝟏𝟔
𝝈 ≈ 𝟏𝟖. 𝟐𝟖

So, Standard Deviation (σ) ≈ 18.28

Step 3: Apply Z-Score Formula

Let’s find Z-score for 76
𝟕𝟔 − 𝟕𝟐. 𝟖
𝒁=
𝟏𝟖. 𝟐𝟖

𝟑. 𝟐
𝒁=
𝟏𝟖. 𝟐𝟖

𝒁 ≈ 𝟎. 𝟏𝟕

Interpretation,
Z = 0.17 means:
- 76 is 0.17 standard deviations above the mean
• Positive value → Above mean

• Negative value → Below mean

• 0 → Exactly at mean

Z-Score Normalization Program

import pandas as pd
35
import numpy as np

# Step 1: Create dataset

marks = [45, 88, 76, 95, 60]
df = [Link]({'Marks': marks})
print("Original Data:", df)

# Step 2: Calculate Mean

mean = df['Marks'].mean()
print("\nMean:", mean)

# Step 3: Calculate Standard Deviation

std = df['Marks'].std()
print("Standard Deviation:", std)

# Step 4: Apply Z-Score Formula

df['Z_Score'] = (df['Marks'] - mean) / std

print("\nData after Z-Score Normalization:")

print(df)

Output:
Original Data:
Marks
0 45
1 88
2 76
3 95
4 60

Mean: 72.8

36
Standard Deviation: 19.99

Data after Z-Score Normalization:

Marks Z_Score
0 45 -1.39
1 88 0.76
2 76 0.16
3 95 1.11
4 60 -0.64

Values are centered around mean 0

Can produce negative and positive values
Useful when data has outliers
Commonly used in ML algorithms

3- Decimal Scaling Normalization

Decimal scaling moves the decimal point of values so that the
maximum value becomes less than 1.
Formula
′
𝑿
𝑿 = 𝒋
𝟏𝟎

Where:
• X = Original value

• j = Smallest integer such that max value becomes < 1

• X′ = Normalized value

In decimal scaling, j is chosen equal to the number of digits in the

maximum value so that dividing by 𝟏𝟎𝒋 makes all values less than 1.
Rule in Decimal Scaling
We choose j such that:
Maximum value
<𝟏
𝟏𝟎𝒋
37
That means:
𝟏𝟎𝒋 > Maximum value
We use same dataset:
Marks = 45, 88, 76, 95, 60

Step 1: Find j
Maximum value = 95
To make 95 < 1, divide by 100.
𝟗𝟓/𝟏𝟎𝟎 = 𝟎. 𝟗𝟓
So,
𝟏𝟎𝒋 = 𝟏𝟎𝟎
𝒋=𝟐

Step 2: Apply Formula

For example, X = 76
′
𝟕𝟔
𝑿 = 𝟐
𝟏𝟎
′
𝟕𝟔
𝑿 =
𝟏𝟎𝟎
′
𝑿 = 𝟎. 𝟕𝟔

Program
import pandas as pd

data = {
'Student': ['Amit', 'Neha', 'Rahul', 'Priya', 'Karan'],
'Marks': [45, 88, 76, 95, 60]
}
df = [Link](data)

38
# Decimal Scaling

df['Marks_Decimal'] = df['Marks'] / 100

print(df[['Student', 'Marks_Decimal']])
Output
Student Marks_Decimal
0 Amit 0.45
1 Neha 0.88
2 Rahul 0.76
3 Priya 0.95
4 Karan 0.60
Explanation
• We divided all values by 100

• Because maximum value (95) has 2 digits

• All values become less than 1

• Very simple and fast method

5. Max Normalization
It divides each value by the maximum value in the dataset.
𝑿
Formula: 𝑿′ = 𝑿
𝒎𝒂𝒙

Where:
• X = Original value

• X_max = Maximum value in dataset

• X′ = Normalized value

We use same dataset:

Marks = 45, 88, 76, 95, 60

Step 1: Find Maximum

Maximum = 95
39
Step 2: Apply Formula
For example, X = 76
′
𝟕𝟔
𝑿 =
𝟗𝟓
′
𝑿 = 𝟎. 𝟖𝟎

Program
# Max Normalization
df['Marks_MaxNorm'] = df['Marks'] / df['Marks'].max()

print(df[['Student', 'Marks_MaxNorm']])

Output
Student Marks_MaxNorm
0 Amit 0.47
1 Neha 0.93
2 Rahul 0.80
3 Priya 1.00
4 Karan 0.63
Explanation
• Maximum value (95) becomes 1

• Other values are scaled proportionally

• Easy and quick method

• Useful when only maximum value is important

Thus, Data normalization converts numerical values into a common

scale to improve fairness, stability, and accuracy of machine learning
and analytical models.

40
7. DATA WRANGLING
Data Wrangling (also called Data Munging) is the complete process of
converting raw, messy, incomplete, inconsistent data into a clean,
structured, meaningful, and usable format suitable for data analysis and
machine learning.

It combines multiple preprocessing steps, such as:

• Handling missing values
• Removing duplicates
• Fixing data types
• Treating outliers and noise
• Transforming data
• Normalizing values
• Feature creation

Without data wrangling, analysis results are inaccurate and misleading.

TOOLS USED FOR DATA WRANGLING

Libraries: Pandas, NumPy, Matplotlib / Seaborn (optional)

DATA WRANGLING PROCESS

1. Load data
2. Inspect data
3. Handle missing values
4. Remove duplicates
5. Fix data types
6. Handle outliers
7. Data transformation
8. Data normalization
9. Final clean dataset

41
COMPLETE DATA WRANGLING PROGRAM
MINI PROJECT

Title: Data Wrangling and Preprocessing on Employee Salary Dataset

Using Pandas

Objective: To clean, preprocess, transform, and normalize a messy

employee dataset using Python and apply complete data wrangling
techniques.
Problem Statement: Real-world datasets often contain:

• Missing values
• Duplicate records
• Incorrect data types
• Outliers
• Inconsistent formats

These issues must be corrected before performing analysis or applying machine learning
algorithms.

This project demonstrates complete data wrangling step-by-step.

Dataset Description

The dataset contains the following columns:

Column Description
Employee Employee name
Salary Monthly salary
Experience Years of experience
Age Employee age

Tools & Technologies Used:

• Python
• Pandas
• NumPy

42
Implementation

Step 1: Create Raw Dataset

import pandas as pd

import numpy as np

data = {

'Employee': ['Aman', 'Riya', 'Kunal', 'Riya', 'Vikas', 'Meena'],

'Salary': [50000, [Link], 45000, [Link], 999999, 40000],

'Experience': [5, 3, None, 3, 8, 2],

'Age': ['25', '28', '30', '28', '35', 'thirty']

df = [Link](data)

print("RAW DATA:")

print(df)

Problems Identified

• Missing Salary
• Missing Experience
• Duplicate Row (Riya appears twice)
• Invalid Age value ("thirty")
• Outlier Salary (999999)

Step 2: Handle Missing Values

df['Salary'].fillna(df['Salary'].mean(), inplace=True)

df['Experience'].fillna(df['Experience'].mean(), inplace=True)

➔ Missing values replaced with mean.

Step 3: Remove Duplicate Records

43
df.drop_duplicates(inplace=True)

➔ Exact duplicate row removed.

Step 4: Fix Data Types

df['Age'] = pd.to_numeric(df['Age'], errors='coerce') // errors='coerce' forces invalid values to

be converted into NaN instead of raising an error.

df['Age'].fillna(df['Age'].mean(), inplace=True)// inplace=True → updates the original dataset

directly

➔ "thirty" converted to NaN

➔ Replaced with mean age

Step 5: Handle Outliers

df = df[df['Salary'] < 200000]

➔ Unrealistic salary (999999) removed.

Step 6: Data Transformation

df['Level'] = df['Experience'].apply(

lambda x: 'Senior' if x >= 5 else 'Junior'

➔ Created new categorical column "Level".

Step 7: Data Normalization (Min–Max)

df['Salary_Normalized'] = (

(df['Salary'] - df['Salary'].min()) /

(df['Salary'].max() - df['Salary'].min())

➔ Salary scaled between 0 and 1.

44
Step 8: Final Clean Dataset

print("\nCLEAN DATA:")

print(df)

Final Output (Approx)

Employee Salary Experience Age Level Salary_Normalized

0 Aman 50000 5.0 25.0 Senior 1.00

1 Riya 45000 3.0 28.0 Junior 0.50

2 Kunal 45000 4.2 30.0 Junior 0.50

5 Meena 40000 2.0 29.5 Junior 0.00

Conclusion

This mini project successfully demonstrated complete data wrangling including:

• Data cleaning, Data correction, Outlier removal, Feature transformation, Data

normalization

The dataset was transformed from messy raw data into clean, structured, and analysis-ready
data suitable for further analytics or machine learning.

APPLICATIONS OF DATA WRANGLING

• Data Analytics projects, ML pipelines, BI, Healthcare analytics, Banking & finance
• Education systems

Data Wrangling with Python: Student Dataset
No ratings yet
Data Wrangling with Python: Student Dataset
7 pages
Data Cleaning and Preparation Guide
No ratings yet
Data Cleaning and Preparation Guide
26 pages
Data Wrangling for Academic Dataset
No ratings yet
Data Wrangling for Academic Dataset
9 pages
Data Cleaning with Pandas Guide
No ratings yet
Data Cleaning with Pandas Guide
13 pages
Data Cleaning for Analytics & ML
No ratings yet
Data Cleaning for Analytics & ML
23 pages
Data Preprocessing with Pandas in Python
No ratings yet
Data Preprocessing with Pandas in Python
15 pages
Unit 2.2 - Data Acquiring & Cleaning - 2
No ratings yet
Unit 2.2 - Data Acquiring & Cleaning - 2
79 pages
Lecture 5 02102025 041330pm
No ratings yet
Lecture 5 02102025 041330pm
10 pages
EDA Basics: Python Guide for Data Analysis
100% (1)
EDA Basics: Python Guide for Data Analysis
30 pages
Python Data Cleaning and Visualization Guide
No ratings yet
Python Data Cleaning and Visualization Guide
42 pages
DataAnalytics Units456 Notes
No ratings yet
DataAnalytics Units456 Notes
29 pages
Unit 2 - Fod
No ratings yet
Unit 2 - Fod
82 pages
Data Preparation Techniques in Python
No ratings yet
Data Preparation Techniques in Python
54 pages
Data Wrangling Techniques in Python
No ratings yet
Data Wrangling Techniques in Python
141 pages
Data Cleaning and Preprocessing Guide
No ratings yet
Data Cleaning and Preprocessing Guide
32 pages
Datascience 8mark Answer
No ratings yet
Datascience 8mark Answer
21 pages
PR 2
No ratings yet
PR 2
10 pages
Research Methodology Overview
No ratings yet
Research Methodology Overview
29 pages
Research Methodology Overview
No ratings yet
Research Methodology Overview
29 pages
Comprehensive Guide to Data Cleaning
No ratings yet
Comprehensive Guide to Data Cleaning
36 pages
DWDM
No ratings yet
DWDM
37 pages
Essential Steps in Exploratory Data Analysis
No ratings yet
Essential Steps in Exploratory Data Analysis
47 pages
Data Cleaning Essentials for Machine Learning
No ratings yet
Data Cleaning Essentials for Machine Learning
20 pages
Data Wrangling for Student Performance
No ratings yet
Data Wrangling for Student Performance
4 pages
Data Collection and Cleaning Techniques
No ratings yet
Data Collection and Cleaning Techniques
64 pages
Data Cleaning and Outlier Removal Guide
No ratings yet
Data Cleaning and Outlier Removal Guide
21 pages
Unit-2-Data Cleaning Integration Transformation Detailed
No ratings yet
Unit-2-Data Cleaning Integration Transformation Detailed
37 pages
Data Wrangling Techniques in Python
No ratings yet
Data Wrangling Techniques in Python
44 pages
Data Collection and Analysis Guide
No ratings yet
Data Collection and Analysis Guide
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
19 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
27 pages
Data Processing and Statistics Basics
No ratings yet
Data Processing and Statistics Basics
21 pages
Data Science Lec 6
No ratings yet
Data Science Lec 6
27 pages
Exploratory Data Analysis & Outliers Guide
No ratings yet
Exploratory Data Analysis & Outliers Guide
99 pages
Dad54123
No ratings yet
Dad54123
10 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
29 pages
Data Cleaning Techniques in Python
No ratings yet
Data Cleaning Techniques in Python
28 pages
Implementation of Libraries
No ratings yet
Implementation of Libraries
36 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
23 pages
Data Cleaning and Outlier Management
No ratings yet
Data Cleaning and Outlier Management
17 pages
Data Cleaning for Quality Insights
No ratings yet
Data Cleaning for Quality Insights
42 pages
Data Preparation For Analysis
No ratings yet
Data Preparation For Analysis
17 pages
Data Analysis with Python Tools
No ratings yet
Data Analysis with Python Tools
9 pages
Unit3 DSP
No ratings yet
Unit3 DSP
13 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
11 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
44 pages
Data Analysis with Python and Pandas
No ratings yet
Data Analysis with Python and Pandas
15 pages
Data Cleaning Techniques and Tools
No ratings yet
Data Cleaning Techniques and Tools
27 pages
Essential Steps in Data Science Process
No ratings yet
Essential Steps in Data Science Process
13 pages
Importance of Data Cleaning in ML
No ratings yet
Importance of Data Cleaning in ML
27 pages
Data Analytics and Visualization Lab Manual Print Out
No ratings yet
Data Analytics and Visualization Lab Manual Print Out
23 pages
Chapter 7-Exploratory Data Analysis-4
No ratings yet
Chapter 7-Exploratory Data Analysis-4
13 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
37 pages
Factory Reset Procedures for EX Switches
No ratings yet
Factory Reset Procedures for EX Switches
4 pages
Alcatel BTS Training Overview
No ratings yet
Alcatel BTS Training Overview
91 pages
Correlation Analysis in Inferential Statistics
No ratings yet
Correlation Analysis in Inferential Statistics
9 pages
Java Programming Basics and Examples
No ratings yet
Java Programming Basics and Examples
7 pages
Vector Database Management Overview
No ratings yet
Vector Database Management Overview
13 pages
Nafl Prayers of Rasulullah (ﷺ) Explained
No ratings yet
Nafl Prayers of Rasulullah (ﷺ) Explained
3 pages
Integrating Technology in Education
No ratings yet
Integrating Technology in Education
6 pages
Romeo and Juliet: Beauty and Immortality
0% (1)
Romeo and Juliet: Beauty and Immortality
3 pages
The LinkedIn Design Playbook
No ratings yet
The LinkedIn Design Playbook
16 pages
Mutesa's Reign and Missionary Influence
No ratings yet
Mutesa's Reign and Missionary Influence
19 pages
Grade 2 CVC Words Lesson Plan
No ratings yet
Grade 2 CVC Words Lesson Plan
13 pages
User Manual for Language Learning App
No ratings yet
User Manual for Language Learning App
2 pages
Character Training for Homeschoolers
No ratings yet
Character Training for Homeschoolers
64 pages
Tafseer of Surah at-Teen Explained
100% (1)
Tafseer of Surah at-Teen Explained
29 pages
Reading & Writing Skills Overview
No ratings yet
Reading & Writing Skills Overview
30 pages
Caution on AI in Education Exposition
No ratings yet
Caution on AI in Education Exposition
11 pages
Fortigate BGP Configuration Guide
No ratings yet
Fortigate BGP Configuration Guide
23 pages
Teaching Strategies and Differentiation Techniques
No ratings yet
Teaching Strategies and Differentiation Techniques
4 pages
Introduction to Translation Studies
100% (3)
Introduction to Translation Studies
60 pages
English 5 Thesaurus Activity Sheet
No ratings yet
English 5 Thesaurus Activity Sheet
7 pages
The Evolution of Einfühlung Concept
No ratings yet
The Evolution of Einfühlung Concept
26 pages
HCIA-AI Exam Review Guide
100% (2)
HCIA-AI Exam Review Guide
36 pages
Ancient Mariner: Line-by-Line Analysis
No ratings yet
Ancient Mariner: Line-by-Line Analysis
7 pages
Form 2 English Evaluation Exam 2022
No ratings yet
Form 2 English Evaluation Exam 2022
13 pages
Deploying Windows 10 Roaming Profiles
No ratings yet
Deploying Windows 10 Roaming Profiles
29 pages
Flexbox vs. Grid Layout Explained
No ratings yet
Flexbox vs. Grid Layout Explained
7 pages
Online Fashion Store Project Report
No ratings yet
Online Fashion Store Project Report
298 pages
Micro:bit Nightlight Lesson Plan
No ratings yet
Micro:bit Nightlight Lesson Plan
11 pages
Strategies for Gender-Inclusive Writing
No ratings yet
Strategies for Gender-Inclusive Writing
1 page
Analysis of Crc-Polynomials For Safety-Critical
100% (1)
Analysis of Crc-Polynomials For Safety-Critical
6 pages