Unit-2 Data Collection, Cleaning & Preprocessing
Unit-2 Data Collection, Cleaning & Preprocessing
1
• Interviews: Face-to-face, telephonic, or video interviews
Example: Employee job-satisfaction interview
• Observations: Direct or participant observation
Example: Customer behavior analysis
• Experiments: Controlled or scientific testing
Example: Testing a new medicine
• Sensors & IoT: Automated data from devices
Example: Smartwatch heart-rate data
2
• Web & Social Media Data: Reviews, logs, social media data
Example: Amazon product reviews
Applications
import pandas as pd
for i in range(3):
name = input("Enter student name: ")
mark = int(input("Enter marks: "))
[Link](name)
[Link](mark)
3
# Creating DataFrame
data = [Link]({
"Name": names,
"Marks": marks
})
print(data)
Explanation
• input() → collects data directly from user (primary data)
• Data is stored in lists
• Pandas converts it into a structured dataset
Output
Name Marks
0 Amit 85
1 Neha 90
2 Rahul 78
4
Example 2: Collecting Secondary Data from a CSV file
import pandas as pd
data = pd.read_csv("[Link]")
print([Link]())
5
METHODS TO HANDLE MISSING DATA
(1) Removing Missing Data
a) Removing Rows: Use when missing data is very small.
[Link]()
[Link](axis=1)
df['Marks'].fillna(df['Marks'].mean())
df['Grade'].fillna(df['Grade'].mode()[0])
6
3 150
Using Forward Fill:
Day Sales
1 100
2 100
3 150
Backward Fill (bfill): replaces missing values with the next (below)
value in the column.
Example:
Day Sales
1 100
2 NaN
3 150
Using Backward Fill:
Day Sales
1 100
2 150
3 150
[Link](method='ffill')
[Link](method='bfill')
[Link](0)
PROGRAM (Pandas)
Example Dataset: Student Marks
7
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Priya'],
'Marks': [85, [Link], 78, [Link]],
'Grade': ['A', 'A', None, 'B']
}
df = [Link](data)
print("Original Data:")
print(df)
Output
8
Output
3. Outliers
They are extreme values that differ significantly from the majority
of data and can distort analysis and model performance. In simple
terms: Outliers are values that lie far away from the normal range
of data.
9
Hence, outlier detection and treatment is a crucial step of Data
Cleaning.
CAUSES OF OUTLIERS
Cause Example
Extra zero in salary (500000 instead of
Data entry error
50000)
Measurement
Faulty sensor
error
Natural variation Extremely high income
Experimental error Wrong unit used
Program:
import numpy as np
std = [Link](data)
print("Mean:",mean)
print("standard deviation:",std)
outliers = data[(data < mean - 3*std) | (data > mean + 3*std)]
10
print(outliers)
Output:
• Mean ≈ 26.33
• Standard Deviation ≈ 32.96
• Upper limit = 26.33 + 3×32.96 ≈ 125.21
• Lower limit = 26.33 − 3×32.96 ≈ −72.54
import numpy as np
data = [Link]([
50, 51, 49, 50, 52, 48, 51, 49, 50, 52,1000
])
mean = [Link](data)
std = [Link](data)
print("Mean:", mean)
11
Output:
Mean: 136.54545454545453
Outliers: [1000]
2. Z-Score Method
Z-score shows how many standard deviations a value is away from
the mean. If Z-score is greater than +3 or less than −3, the value is
an outlier.
Program:
import numpy as np
data = [Link]([50, 51, 49, 50, 52, 48, 51, 49, 50, 52,1000
])
mean = [Link](data)
std = [Link](data)
print("Mean:", mean)
print("Standard Deviation:", std)
z = (data - mean) / std
print("Z :", z)
outliers = data[(z > 3) | (z < -3)]
print(outliers)
output:
Mean: 136.54545454545453
12
Z : [-0.31695722 -0.3132949 -0.32061954 -0.31695722 -
0.30963258 -0.32428186
[1000]
3. IQR (Interquartile Range) Method
Values lying below the lower bound or above the upper bound
are outliers.
Program:
import numpy as np
IQR = Q3 - Q1
13
# Step 5: Calculate the Lower Bound
print("Data:", data)
print("Q1 (25th percentile):", Q1)
print("IQR:", IQR)
14
Data: [ 10 12 11 13 12 100]
Outliers: [100]
4. Box Plot Method
A box plot visually shows outliers as points outside the whiskers.
Program:
import pandas as pd
[Link](column='Marks')
[Link]()
15
Program:
[Link]()
4. Noise in Data
Noise refers to random, irrelevant, or meaningless data values that do not
represent the true underlying pattern of the dataset. In simple words: Noise
is unwanted disturbance in data that hides useful information.
16
If noise is not handled:
Source
Sensor malfunction
Data entry errors
Transmission errors
Environmental factors
Measurement limitations
Noise Outliers
Random disturbance Extreme value
Small variations Large deviation
Affects many points Affects few points
Hard to identify Easy to detect
17
2) Binning Method
• Sort data
• Divide into bins
• Replace values with bin mean/median
3) Regression
4) Clustering :
• Noise points lie far from clusters
Formula
𝑋1 + 𝑋2 + 𝑋3
Moving Average =
3
18
(30 + 31 + 29)/3 = 90/3 = 30
Result (Manual)
Original Smoothed
30 -
31 -
29 30
45 35
30 34.67
32 35.67
31 31
19
Noise 45 is reduced from 45 → 35
df = [Link](data)
df['Moving_Avg'] = df['Temperature'].rolling(window=3).mean()
print(df)
Output
Temperature Moving_Avg
0 30 NaN
1 31 NaN
2 29 30.00
3 45 35.00
4 30 34.67
5 32 35.67
6 31 31.00
Explanation
20
• Sudden spike 45 becomes smoother (35)
Bin 1 → 29, 30
Bin 2 → 30, 31
Bin 3 → 31, 32, 45
Bin 1 Mean:
(29 + 30)/2 = 59/2 = 29.5
Bin 2 Mean:
Bin 3 Mean:
(31 + 32 + 45)/3 = 108/3 = 36
Noise 45 becomes 36
Program (Binning)
import numpy as np
21
sorted_data = [Link](df['Temperature'])
bins = np.array_split(sorted_data, 3)
bin_means = [[Link](bin) for bin in bins]
print("Bins:", bins)
Output
Bins: [array([29, 30]), array([30, 31]), array([31, 32, 45])]
Bin Means: [29.5, 30.5, 36.0]
Explanation
• Data sorted
• Divided into equal groups
• Each group replaced with its mean
• Noise value adjusted
𝑦 = 𝑚𝑥 + 𝑐
𝑚
𝑏
-10-8-6-4-2246810-10-5510y-interceptx-intercept
22
It finds best trend.
Noise point (45) will not strongly affect trend because line averages effect.
X = [Link](len(df)).reshape(-1, 1)
y = df['Temperature']
model = LinearRegression()
[Link](X, y)
df['Trend_Line'] = [Link](X)
print(df)
Output (Approx)
Temperature Trend_Line
0 30 30.1
1 31 31.3
2 29 32.5
3 45 33.7
23
4 30 34.9
5 32 36.1
6 31 37.3
Explanation
MINI PROJECT
Title: Noise Handling and Data Smoothing Techniques Using Python
Objective: To identify and reduce noise in numerical data using various
smoothing techniques such as:
• Moving Average,Exponential Smoothing, Binning, Regression
Problem Statement
Real-world datasets such as sensor readings, stock prices, or temperature
data often contain noise — random irregular values that hide the true
trend.
Example:
A temperature sensor may suddenly show an abnormal value due to device
error.
The goal is to clean the dataset and preserve the real pattern.
24
5. Data Transformation
After data collection and data cleaning (handling missing values,
outliers, noise), the data may still not be suitable for analysis or
machine learning.
Reasons:
OBJECTIVES:
Applications:
• DA, ML, DWg, BI, Healthcare & Finance analytics
Dataset Used
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Shruti', 'Rahul', 'Pooja'],
'Age': [20, 21, 22, 21],
25
'Marks': [45, 78, 90, 60],
'City': ['Delhi', 'Mumbai', 'Delhi', 'Pune']
}
df = [Link](data)
print("RAW DATA", df)
Output
Name Age Marks City
0 Amit 20 45 Delhi
1 Shruti 21 78 Mumbai
2 Rahul 22 90 Delhi
3 Pooja 21 60 Pune
Program (Min-Max)
df['Marks_Normalized'] = (df['Marks'] - df['Marks'].min()) /
(df['Marks'].max() - df['Marks'].min'])
print(df[['Marks', 'Marks_Normalized']])
Output
Marks Marks_Normalized
0 45 0.00
1 78 0.61
2 90 1.00
3 60 0.33
Explanation
• Minimum (45) becomes 0
26
• Other values fall between 0 and 1
Used in Machine Learning.
27
3 Pune 2
Explanation
df['City']: This selects the City column.
.astype('category'): converts column into categorical type, as
Unique categories.
Internally, Pandas assigns:
Delhi → Category 0
Mumbai → Category 1
Pune → Category 2
(Alphabetical order by default)
.[Link]: This extracts numeric code of each category.
4) Aggregation
Aggregation summarizes data using operations like mean, sum,
count.
Program
avg_marks = [Link]('City')['Marks'].mean()
print(avg_marks)
Output
City
Delhi 67.5
Mumbai 78.0
Pune 60.0
Explanation
• Delhi has two students → average = 67.5
28
print(df[['Name', 'Marks', 'Result']])
Output
Name Marks Result
0 Amit 45 Pass
1 Shruti 78 Pass
2 Rahul 90 Pass
3 Pooja 60 Pass
Explanation
• Marks ≥ 40 → Pass
Improves interpretability.
6) Discretization / Binning
Converts continuous numbers into categories.
Program
bins = [0, 50, 75, 100]
labels = ['Low', 'Medium', 'High']
print(df[['Marks', 'Performance']])
Output
Marks Performance
0 45 Low
1 78 High
2 90 High
3 60 Medium
Explanation
• 0–50 → Low
• 50–75 → Medium
• 75–100 → High
29
Makes analysis easier.
• Log values
• Encoded values
• Derived features
• Binned categories
6. DATA NORMALIZATION
It is a data preprocessing technique used to rescale numerical values into
a common range so that no feature dominates others due to its
magnitude.
30
- It makes different numerical features comparable.
Benefits of Normalization
• Equal importance to all features, Faster model convergence
• Improved accuracy, Better performance of ML models
Used:
• ML (KNN, SVM, Neural Networks)
• Data Mining , Clustering
• Recommendation Systems, Statistical Analysis
import pandas as pd
data = {
'Student': ['Amit', 'Neha', 'Rahul', 'Priya', 'Karan'],
'Marks': [45, 88, 76, 95, 60]
}
df = [Link](data)
print("Original Data:")
31
print(df)
Output:
Student Marks
0 Amit 45
1 Neha 88
2 Rahul 76
3 Priya 95
4 Karan 60
32
It rescales data proportionally within range.
Program
df['Marks_MinMax'] = (df['Marks'] - df['Marks'].min()) /
(df['Marks'].max() - df['Marks'].min())
print(df[['Student', 'Marks_MinMax']])
Output
Student Marks_MinMax
0 Amit 0.000000
1 Neha 0.782609
2 Rahul 0.565217
3 Priya 1.000000
4 Karan 0.260870
➔ Best when min and max values are known
Formula:
Where:
• X = Original value
• Z = Standardized value
𝟒𝟓+𝟖𝟖+𝟕𝟔+𝟗𝟓+𝟔𝟎 𝟑𝟔𝟒
𝝁= = =𝟕𝟐. 𝟖
𝟓 𝟓
X X−μ (X − μ)²
45 -27.8 772.84
88 15.2 231.04
76 3.2 10.24
95 22.2 492.84
60 -12.8 163.84
34
Now sum of squared differences:
𝟕𝟕𝟐. 𝟖𝟒 + 𝟐𝟑𝟏. 𝟎𝟒 + 𝟏𝟎. 𝟐𝟒 + 𝟒𝟗𝟐. 𝟖𝟒 + 𝟏𝟔𝟑. 𝟖𝟒 = 𝟏𝟔𝟕𝟎. 𝟖
Divide by N = 5:
𝟏𝟔𝟕𝟎. 𝟖
= 𝟑𝟑𝟒. 𝟏𝟔
𝟓
𝟑. 𝟐
𝒁=
𝟏𝟖. 𝟐𝟖
𝒁 ≈ 𝟎. 𝟏𝟕
Interpretation,
Z = 0.17 means:
- 76 is 0.17 standard deviations above the mean
• Positive value → Above mean
• 0 → Exactly at mean
Output:
Original Data:
Marks
0 45
1 88
2 76
3 95
4 60
Mean: 72.8
36
Standard Deviation: 19.99
Where:
• X = Original value
• X′ = Normalized value
Step 1: Find j
Maximum value = 95
To make 95 < 1, divide by 100.
𝟗𝟓/𝟏𝟎𝟎 = 𝟎. 𝟗𝟓
So,
𝟏𝟎𝒋 = 𝟏𝟎𝟎
𝒋=𝟐
Program
import pandas as pd
data = {
'Student': ['Amit', 'Neha', 'Rahul', 'Priya', 'Karan'],
'Marks': [45, 88, 76, 95, 60]
}
df = [Link](data)
38
# Decimal Scaling
print(df[['Student', 'Marks_Decimal']])
Output
Student Marks_Decimal
0 Amit 0.45
1 Neha 0.88
2 Rahul 0.76
3 Priya 0.95
4 Karan 0.60
Explanation
• We divided all values by 100
5. Max Normalization
It divides each value by the maximum value in the dataset.
𝑿
Formula: 𝑿′ = 𝑿
𝒎𝒂𝒙
Where:
• X = Original value
• X′ = Normalized value
Program
# Max Normalization
df['Marks_MaxNorm'] = df['Marks'] / df['Marks'].max()
print(df[['Student', 'Marks_MaxNorm']])
Output
Student Marks_MaxNorm
0 Amit 0.47
1 Neha 0.93
2 Rahul 0.80
3 Priya 1.00
4 Karan 0.63
Explanation
• Maximum value (95) becomes 1
40
7. DATA WRANGLING
Data Wrangling (also called Data Munging) is the complete process of
converting raw, messy, incomplete, inconsistent data into a clean,
structured, meaningful, and usable format suitable for data analysis and
machine learning.
1. Load data
2. Inspect data
3. Handle missing values
4. Remove duplicates
5. Fix data types
6. Handle outliers
7. Data transformation
8. Data normalization
9. Final clean dataset
41
COMPLETE DATA WRANGLING PROGRAM
MINI PROJECT
• Missing values
• Duplicate records
• Incorrect data types
• Outliers
• Inconsistent formats
These issues must be corrected before performing analysis or applying machine learning
algorithms.
Dataset Description
Column Description
Employee Employee name
Salary Monthly salary
Experience Years of experience
Age Employee age
• Python
• Pandas
• NumPy
42
Implementation
import pandas as pd
import numpy as np
data = {
df = [Link](data)
print("RAW DATA:")
print(df)
Problems Identified
• Missing Salary
• Missing Experience
• Duplicate Row (Riya appears twice)
• Invalid Age value ("thirty")
• Outlier Salary (999999)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df['Experience'].fillna(df['Experience'].mean(), inplace=True)
43
df.drop_duplicates(inplace=True)
df['Level'] = df['Experience'].apply(
df['Salary_Normalized'] = (
(df['Salary'] - df['Salary'].min()) /
(df['Salary'].max() - df['Salary'].min())
44
Step 8: Final Clean Dataset
print("\nCLEAN DATA:")
print(df)
Conclusion
The dataset was transformed from messy raw data into clean, structured, and analysis-ready
data suitable for further analytics or machine learning.
• Data Analytics projects, ML pipelines, BI, Healthcare analytics, Banking & finance
• Education systems
45