📘 Chapter 6 — Introduction to NumPy
(Numerical Computing in Python — Foundation for Machine Learning)
1. Why Do We Need NumPy?
Python lists are flexible, but they are not efficient for numerical computation.
Problems with Python lists:
• Slow for large datasets
• No true vectorized operations
• Not memory-efficient
• Not suitable for mathematical modeling
📌 Solution:
NumPy (Numerical Python)
NumPy provides:
• Fast numerical arrays
• Vectorized operations
• Efficient memory usage
• Core foundation for ML, Pandas, SciPy, TensorFlow
2. What is NumPy?
NumPy is a Python library used for:
• Numerical computation
• Working with arrays
• Linear algebra
• Mathematical operations on large datasets
📌 Almost every ML and Data Science library depends on NumPy.
3. Installing and Importing NumPy
3.1 Installation
pip install numpy
3.2 Importing NumPy
import numpy as np
📌 np is the standard alias used everywhere.
4. NumPy Arrays vs Python Lists
4.1 Python List Example
numbers = [1, 2, 3, 4]
numbers * 2
Output:
[1, 2, 3, 4, 1, 2, 3, 4]
❌ Not mathematical multiplication.
4.2 NumPy Array Example
import numpy as np
arr = [Link]([1, 2, 3, 4])
arr * 2
Output:
[2 4 6 8]
✔ True mathematical operation
✔ Vectorized
✔ Fast
5. Creating NumPy Arrays
5.1 From a Python List
arr = [Link]([1, 2, 3])
5.2 Using Built-in Functions
[Link](5)
[Link](5)
[Link](0, 10)
[Link](0, 1, 5)
Explanation:
• zeros → array of zeros
• ones → array of ones
• arange → range with step
• linspace → evenly spaced values
6. Array Properties
[Link]
[Link]
[Link]
Meaning:
• shape → dimensions
• size → number of elements
• dtype → data type
📌 NumPy arrays contain elements of the same type.
7. Indexing and Slicing
7.1 Indexing
arr[0]
arr[-1]
7.2 Slicing
arr[1:4]
📌 Similar to Python lists, but faster.
8. Mathematical Operations
NumPy supports vectorized math operations:
arr + 5
arr * 2
arr ** 2
No loops required.
9. NumPy Arrays and Loops
9.1 Loop Approach (Slow)
result = []
for x in arr:
[Link](x * 2)
9.2 Vectorized Approach (Fast)
arr * 2
📌 This is why NumPy is powerful.
10. Boolean Operations and Filtering
arr > 3
arr[arr > 3]
Used heavily in:
• Data filtering
• Feature selection
• Preprocessing
11. Multidimensional Arrays (2D)
matrix = [Link]([[1, 2], [3, 4]])
Access elements:
matrix[0, 1]
📌 This is the base of matrices in ML.
12. NumPy and Machine Learning
In ML, NumPy is used for:
• Feature matrices
• Label vectors
• Mathematical modeling
• Linear algebra operations
Most ML algorithms internally operate on NumPy arrays.
13. Common Beginner Mistakes
• Using Python lists instead of NumPy arrays
• Using loops instead of vectorized operations
• Mixing data types in arrays
• Ignoring array shape
14. Best Practices
• Always use NumPy for numerical data
• Avoid loops when possible
• Check shape frequently
• Think in vectors and matrices
15. Summary
• NumPy is the backbone of numerical computing
• Arrays are faster and more efficient than lists
• Vectorization is key
• NumPy is essential before Pandas and ML
📘 Chapter 7 — Pandas: Data Analysis with
DataFrames
(Core Tool for Data Analysis & Machine Learning)
1. Why Do We Need Pandas?
After learning:
• File handling
• CSV & JSON
• NumPy arrays
We face a new problem:
📌 Real data is messy and structured in tables, not just numbers.
Python lists and NumPy arrays:
• Are not convenient for labeled data
• Do not handle missing values easily
• Are not ideal for real datasets
✅ Solution:
Pandas
2. What is Pandas?
Pandas is a Python library used for:
• Data manipulation
• Data analysis
• Working with structured (tabular) data
Pandas provides:
• High-level data structures
• Easy data filtering
• Powerful data cleaning tools
📌 Pandas is built on top of NumPy.
3. Core Data Structures in Pandas
Pandas has two main data structures:
• Series → 1D labeled data
• DataFrame → 2D labeled data (table)
4. Pandas Series
4.1 What is a Series?
A Series is:
• One-dimensional
• Labeled
• Similar to a column in a table
4.2 Creating a Series
import pandas as pd
s = [Link]([10, 20, 30])
4.3 Series with Custom Index
s = [Link]([10, 20, 30], index=["a", "b", "c"])
📌 Each value has a label (index).
5. Pandas DataFrame
5.1 What is a DataFrame?
A DataFrame is:
• A table of rows and columns
• Each column has a name
• Each row has an index
📌 It is the most important structure in Pandas.
5.2 Creating a DataFrame from Dictionary
data = {
"name": ["Ali", "Sara", "Omar"],
"age": [20, 22, 21],
"grade": [85, 90, 88]
}
df = [Link](data)
5.3 Viewing Data
[Link]()
[Link]()
• head() → first rows
• tail() → last rows
6. Reading Data from Files
6.1 Reading CSV Files
df = pd.read_csv("[Link]")
📌 Automatically creates a DataFrame.
6.2 Reading Excel Files
df = pd.read_excel("[Link]")
6.3 Reading JSON Files
df = pd.read_json("[Link]")
7. Exploring the Dataset
7.1 Dataset Information
[Link]()
Shows:
• Column names
• Data types
• Missing values
7.2 Statistical Summary
[Link]()
Provides:
• Mean
• Min / Max
• Standard deviation
8. Selecting Data
8.1 Selecting a Column
df["age"]
Returns a Series.
8.2 Selecting Multiple Columns
df[["name", "grade"]]
Returns a DataFrame.
8.3 Selecting Rows by Index
[Link][0]
[Link][0]
• loc → label-based
• iloc → position-based
9. Filtering Data
9.1 Conditional Filtering
df[df["grade"] > 85]
9.2 Multiple Conditions
df[(df["age"] > 20) & (df["grade"] > 85)]
📌 Used heavily in data preprocessing.
10. Handling Missing Data
10.1 Detect Missing Values
[Link]()
[Link]().sum()
10.2 Dropping Missing Values
[Link]()
10.3 Filling Missing Values
[Link](0)
11. Modifying Data
11.1 Adding a Column
df["passed"] = df["grade"] >= 60
11.2 Updating Values
df["grade"] = df["grade"] + 5
11.3 Renaming Columns
[Link](columns={"grade": "final_grade"})
12. Sorting Data
df.sort_values("grade")
df.sort_values("grade", ascending=False)
13. Grouping Data
13.1 Group By
[Link]("age").mean()
📌 Very important for analysis.
14. Pandas in Machine Learning
In ML, Pandas is used for:
• Loading datasets
• Cleaning data
• Feature selection
• Preparing NumPy arrays
Typical flow:
CSV → Pandas → NumPy → ML Model
15. Common Beginner Mistakes
• Confusing Series and DataFrame
• Forgetting axis meaning
• Modifying data without checking
• Ignoring missing values
16. Best Practices
• Always inspect data first
• Use meaningful column names
• Handle missing values early
• Convert to NumPy before ML models
17. Summary
• Pandas is essential for data analysis
• DataFrame is the core structure
• Pandas works with real datasets
• Pandas bridges raw data and ML
📘 Chapter 8 — Data Preprocessing for
Machine Learning
(The Final Gate Before ML)
1. What is Data Preprocessing?
Data preprocessing is the process of:
• Cleaning raw data
• Transforming data
• Preparing data to be used by machine learning models
📌 Machine Learning models do NOT understand raw data.
They only understand numbers in a clean, consistent format.
Without preprocessing:
❌ Models give wrong results
❌ Training becomes unstable
❌ Accuracy is misleading
2. Why Data Preprocessing is Mandatory Before ML
Real-world data is usually:
• Incomplete
• Inconsistent
• Noisy
• Mixed (numbers + text)
Examples of problems:
• Missing values
• Text data (categories)
• Different scales (age vs salary)
• Irrelevant columns
📌 Preprocessing is not optional.
3. Types of Data in a Dataset
A typical dataset contains:
3.1 Features
• Input variables
• Used to make predictions
• Example: age, salary, education
3.2 Target (Label)
• Output variable
• What we want to predict
• Example: price, disease status, class
📌 ML = Learn a mapping from features → target
4. Handling Missing Values
4.1 What are Missing Values?
Missing values appear as:
• NaN
• Empty cells
• Null values
4.2 Detecting Missing Values
Using Pandas:
[Link]()
[Link]().sum()
4.3 Removing Missing Values
[Link]()
📌 Use only if missing values are few.
4.4 Filling Missing Values (Imputation)
[Link](0)
Common strategies:
• Mean
• Median
• Mode
📌 Choice depends on the data type.
5. Handling Categorical Data (Encoding)
5.1 Why Encoding is Needed
ML models:
• Cannot understand text
• Only work with numbers
Example:
Gender = Male / Female
❌ Not usable directly.
5.2 Label Encoding
Converts categories to numbers:
Male → 0
Female → 1
Used when:
• Categories have order
• Binary categories
5.3 One-Hot Encoding
Creates separate columns for each category.
Example:
Color = Red, Blue
Becomes:
Color_Red | Color_Blue
📌 Commonly used in ML preprocessing.
6. Feature Scaling
6.1 Why Scaling is Important
Different features may have different ranges:
• Age: 0–100
• Salary: 1000–100000
Models may give more importance to larger values.
6.2 Normalization
Scales data to range:
0 → 1
Used when:
• Values have known bounds
6.3 Standardization
Centers data around:
Mean = 0
Standard deviation = 1
Used when:
• Data follows normal distribution
7. Feature Selection
7.1 What is Feature Selection?
Feature selection means:
• Keeping useful features
• Removing irrelevant or redundant ones
📌 More features ≠ better model.
7.2 Why Feature Selection Matters
• Reduces overfitting
• Improves performance
• Simplifies the model
• Reduces training time
8. Train-Test Split
8.1 Why Split Data?
To evaluate model performance fairly.
If we train and test on the same data:
❌ Model memorizes data
❌ Accuracy is fake
8.2 Train-Test Split Concept
• Training set → Learn
• Test set → Evaluate
Typical split:
• 80% training
• 20% testing
9. Data Leakage (Important Warning)
9.1 What is Data Leakage?
Data leakage occurs when:
• Information from the test set leaks into training
📌 Leads to:
• Unrealistically high accuracy
• Wrong conclusions
9.2 Common Causes
• Scaling before splitting
• Using future information
• Target-related features
10. Preprocessing Pipeline (Big Picture)
Typical ML pipeline:
Raw Data
↓
Cleaning
↓
Encoding
↓
Scaling
↓
Feature Selection
↓
Train / Test Split
↓
Machine Learning Model
📌 This pipeline is used in every ML project.
11. Preprocessing and Machine Learning
Every ML algorithm assumes:
• Clean input
• Numerical values
• Consistent scale
📌 Bad preprocessing → bad model
📌 Good preprocessing → meaningful results
12. Common Beginner Mistakes
• Ignoring missing values
• Encoding target incorrectly
• Scaling before splitting
• Using all features blindly
• Forgetting data leakage
13. Best Practices
• Always explore data first
• Handle missing values carefully
• Encode categorical data properly
• Split data early
• Build preprocessing pipelines
14. Summary
• Data preprocessing is mandatory
• Raw data is never ready for ML
• Encoding and scaling are essential
• Train-test split is critical
• Preprocessing quality determines model quality
15. Final Transition — Entering Machine Learning
📌 If you understand this chapter well, you are now able to:
• Take any dataset
• Prepare it correctly
• Feed it into a machine learning model
📘 Exercises — NumPy, Pandas & Data
Preprocessing
(With Clear Solutions)
🧮 Part 1 — NumPy Exercises
🔹 Exercise 1 — Creating NumPy Arrays
Question
Create a NumPy array that contains the numbers from 0 to 9.
Expected Output
[0 1 2 3 4 5 6 7 8 9]
Solution
import numpy as np
arr = [Link](10)
print(arr)
🔹 Exercise 2 — Zeros, Ones, Linspace
Question
Create:
1. An array of 5 zeros
2. An array of 5 ones
3. 5 evenly spaced numbers between 0 and 1
Solution
zeros = [Link](5)
ones = [Link](5)
lin = [Link](0, 1, 5)
print(zeros)
print(ones)
print(lin)
🔹 Exercise 3 — Array Properties
Question
Given the array:
arr = [Link]([10, 20, 30, 40])
Print:
• shape
• size
• data type
Solution
print([Link])
print([Link])
print([Link])
🔹 Exercise 4 — Vectorized Operations
Question
Multiply each element in the array by 3 without using loops.
arr = [Link]([1, 2, 3, 4])
Solution
result = arr * 3
print(result)
📌 ده ﺟﻮھﺮNumPy: ﻣﻔﯿﺶloops
🔹 Exercise 5 — Boolean Filtering
Question
From the array below, extract numbers greater than 25:
arr = [Link]([10, 20, 30, 40, 50])
Solution
filtered = arr[arr > 25]
print(filtered)
🔹 Exercise 6 — 2D Arrays
Question
Given the matrix:
matrix = [Link]([[1, 2], [3, 4]])
Access the value 4.
Solution
print(matrix[1, 1])
🐼 Part 2 — Pandas Exercises
🔹 Exercise 7 — Creating a DataFrame
Question
Create a DataFrame with:
• name
• age
• grade
Solution
import pandas as pd
data = {
"name": ["Ali", "Sara", "Omar"],
"age": [20, 22, 21],
"grade": [85, 90, 88]
}
df = [Link](data)
print(df)
🔹 Exercise 8 — Exploring the Data
Question
Display:
• First 2 rows
• Dataset info
• Statistical summary
Solution
print([Link](2))
print([Link]())
print([Link]())
🔹 Exercise 9 — Column Selection
Question
Select:
1. Only the grade column
2. name and age columns
Solution
print(df["grade"])
print(df[["name", "age"]])
🔹 Exercise 10 — Row Selection
Question
Select:
• First row using iloc
• First row using loc
Solution
print([Link][0])
print([Link][0])
🔹 Exercise 11 — Filtering Data
Question
Show students with grade > 85.
Solution
filtered = df[df["grade"] > 85]
print(filtered)
🔹 Exercise 12 — Adding a New Column
Question
Add a column passed where:
• True if grade ≥ 60
• False otherwise
Solution
df["passed"] = df["grade"] >= 60
print(df)
🔹 Exercise 13 — Sorting Data
Question
Sort students by grade (descending).
Solution
sorted_df = df.sort_values("grade", ascending=False)
print(sorted_df)
🧹 Part 3 — Data Preprocessing Exercises
🔹 Exercise 14 — Detect Missing Values
Question
Check missing values in the DataFrame.
Solution
print([Link]())
print([Link]().sum())
🔹 Exercise 15 — Filling Missing Values
Question
Fill missing values with 0.
Solution
df_filled = [Link](0)
print(df_filled)
🔹 Exercise 16 — Categorical Encoding (Conceptual)
Question
Why do we need to encode categorical data before ML?
Answer
Because Machine Learning models only understand numbers, not text.
🔹 Exercise 17 — Feature vs Target
Question
In the following dataset:
age, salary → ?
purchased → ?
Answer
• Features → age, salary
• Target → purchased
🔹 Exercise 18 — Train-Test Split (Conceptual)
Question
Why do we split data into training and testing sets?
Answer
To evaluate the model fairly and avoid memorization.
🔹 Exercise 19 — Data Leakage (Important)
Question
What happens if we scale the data before train-test split?
Answer
Data leakage occurs → unrealistically high accuracy → wrong conclusions.
✅ Final Mini Task (Revision)
Question
Arrange the correct ML preprocessing pipeline:
• Scaling
• Raw Data
• Encoding
• Train/Test Split
• Cleaning
Correct Order
Raw Data
↓
Cleaning
↓
Encoding
↓
Scaling
↓
Train/Test Split
↓
ML Model