0% found this document useful (0 votes)
7 views27 pages

Introduction to NumPy for ML

Chapter 6 introduces NumPy, a Python library essential for numerical computing, providing efficient arrays and vectorized operations, which are crucial for machine learning. It covers the creation and manipulation of NumPy arrays, mathematical operations, and the importance of using NumPy over Python lists for performance. The chapter emphasizes that NumPy is foundational for data science and machine learning libraries, including Pandas and TensorFlow.

Uploaded by

Sohaila Elbadry
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

Introduction to NumPy for ML

Chapter 6 introduces NumPy, a Python library essential for numerical computing, providing efficient arrays and vectorized operations, which are crucial for machine learning. It covers the creation and manipulation of NumPy arrays, mathematical operations, and the importance of using NumPy over Python lists for performance. The chapter emphasizes that NumPy is foundational for data science and machine learning libraries, including Pandas and TensorFlow.

Uploaded by

Sohaila Elbadry
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

📘 Chapter 6 — Introduction to NumPy

(Numerical Computing in Python — Foundation for Machine Learning)

1. Why Do We Need NumPy?


Python lists are flexible, but they are not efficient for numerical computation.

Problems with Python lists:

• Slow for large datasets


• No true vectorized operations
• Not memory-efficient
• Not suitable for mathematical modeling

📌 Solution:
NumPy (Numerical Python)

NumPy provides:

• Fast numerical arrays


• Vectorized operations
• Efficient memory usage
• Core foundation for ML, Pandas, SciPy, TensorFlow

2. What is NumPy?
NumPy is a Python library used for:

• Numerical computation
• Working with arrays
• Linear algebra
• Mathematical operations on large datasets

📌 Almost every ML and Data Science library depends on NumPy.

3. Installing and Importing NumPy


3.1 Installation
pip install numpy

3.2 Importing NumPy


import numpy as np

📌 np is the standard alias used everywhere.

4. NumPy Arrays vs Python Lists


4.1 Python List Example
numbers = [1, 2, 3, 4]
numbers * 2

Output:

[1, 2, 3, 4, 1, 2, 3, 4]

❌ Not mathematical multiplication.

4.2 NumPy Array Example


import numpy as np

arr = [Link]([1, 2, 3, 4])


arr * 2

Output:

[2 4 6 8]

✔ True mathematical operation


✔ Vectorized
✔ Fast

5. Creating NumPy Arrays


5.1 From a Python List
arr = [Link]([1, 2, 3])

5.2 Using Built-in Functions


[Link](5)
[Link](5)
[Link](0, 10)
[Link](0, 1, 5)

Explanation:

• zeros → array of zeros


• ones → array of ones
• arange → range with step
• linspace → evenly spaced values

6. Array Properties
[Link]
[Link]
[Link]

Meaning:

• shape → dimensions
• size → number of elements
• dtype → data type

📌 NumPy arrays contain elements of the same type.

7. Indexing and Slicing


7.1 Indexing
arr[0]
arr[-1]

7.2 Slicing
arr[1:4]

📌 Similar to Python lists, but faster.

8. Mathematical Operations
NumPy supports vectorized math operations:

arr + 5
arr * 2
arr ** 2

No loops required.

9. NumPy Arrays and Loops


9.1 Loop Approach (Slow)
result = []
for x in arr:
[Link](x * 2)

9.2 Vectorized Approach (Fast)


arr * 2

📌 This is why NumPy is powerful.

10. Boolean Operations and Filtering


arr > 3
arr[arr > 3]

Used heavily in:

• Data filtering
• Feature selection
• Preprocessing
11. Multidimensional Arrays (2D)
matrix = [Link]([[1, 2], [3, 4]])

Access elements:

matrix[0, 1]

📌 This is the base of matrices in ML.

12. NumPy and Machine Learning


In ML, NumPy is used for:

• Feature matrices
• Label vectors
• Mathematical modeling
• Linear algebra operations

Most ML algorithms internally operate on NumPy arrays.

13. Common Beginner Mistakes


• Using Python lists instead of NumPy arrays
• Using loops instead of vectorized operations
• Mixing data types in arrays
• Ignoring array shape

14. Best Practices


• Always use NumPy for numerical data
• Avoid loops when possible
• Check shape frequently
• Think in vectors and matrices
15. Summary
• NumPy is the backbone of numerical computing
• Arrays are faster and more efficient than lists
• Vectorization is key
• NumPy is essential before Pandas and ML

📘 Chapter 7 — Pandas: Data Analysis with


DataFrames
(Core Tool for Data Analysis & Machine Learning)

1. Why Do We Need Pandas?


After learning:

• File handling
• CSV & JSON
• NumPy arrays

We face a new problem:

📌 Real data is messy and structured in tables, not just numbers.

Python lists and NumPy arrays:

• Are not convenient for labeled data


• Do not handle missing values easily
• Are not ideal for real datasets

✅ Solution:

Pandas

2. What is Pandas?
Pandas is a Python library used for:

• Data manipulation
• Data analysis
• Working with structured (tabular) data

Pandas provides:

• High-level data structures


• Easy data filtering
• Powerful data cleaning tools

📌 Pandas is built on top of NumPy.

3. Core Data Structures in Pandas


Pandas has two main data structures:

• Series → 1D labeled data


• DataFrame → 2D labeled data (table)

4. Pandas Series
4.1 What is a Series?

A Series is:

• One-dimensional
• Labeled
• Similar to a column in a table

4.2 Creating a Series


import pandas as pd

s = [Link]([10, 20, 30])

4.3 Series with Custom Index


s = [Link]([10, 20, 30], index=["a", "b", "c"])

📌 Each value has a label (index).

5. Pandas DataFrame
5.1 What is a DataFrame?

A DataFrame is:

• A table of rows and columns


• Each column has a name
• Each row has an index

📌 It is the most important structure in Pandas.

5.2 Creating a DataFrame from Dictionary


data = {
"name": ["Ali", "Sara", "Omar"],
"age": [20, 22, 21],
"grade": [85, 90, 88]
}

df = [Link](data)

5.3 Viewing Data


[Link]()
[Link]()

• head() → first rows


• tail() → last rows

6. Reading Data from Files


6.1 Reading CSV Files
df = pd.read_csv("[Link]")
📌 Automatically creates a DataFrame.

6.2 Reading Excel Files


df = pd.read_excel("[Link]")

6.3 Reading JSON Files


df = pd.read_json("[Link]")

7. Exploring the Dataset


7.1 Dataset Information
[Link]()

Shows:

• Column names
• Data types
• Missing values

7.2 Statistical Summary


[Link]()

Provides:

• Mean
• Min / Max
• Standard deviation

8. Selecting Data
8.1 Selecting a Column
df["age"]
Returns a Series.

8.2 Selecting Multiple Columns


df[["name", "grade"]]

Returns a DataFrame.

8.3 Selecting Rows by Index


[Link][0]
[Link][0]

• loc → label-based
• iloc → position-based

9. Filtering Data
9.1 Conditional Filtering
df[df["grade"] > 85]

9.2 Multiple Conditions


df[(df["age"] > 20) & (df["grade"] > 85)]

📌 Used heavily in data preprocessing.

10. Handling Missing Data


10.1 Detect Missing Values
[Link]()
[Link]().sum()
10.2 Dropping Missing Values
[Link]()

10.3 Filling Missing Values


[Link](0)

11. Modifying Data


11.1 Adding a Column
df["passed"] = df["grade"] >= 60

11.2 Updating Values


df["grade"] = df["grade"] + 5

11.3 Renaming Columns


[Link](columns={"grade": "final_grade"})

12. Sorting Data


df.sort_values("grade")
df.sort_values("grade", ascending=False)

13. Grouping Data


13.1 Group By
[Link]("age").mean()

📌 Very important for analysis.

14. Pandas in Machine Learning


In ML, Pandas is used for:

• Loading datasets
• Cleaning data
• Feature selection
• Preparing NumPy arrays

Typical flow:

CSV → Pandas → NumPy → ML Model

15. Common Beginner Mistakes


• Confusing Series and DataFrame
• Forgetting axis meaning
• Modifying data without checking
• Ignoring missing values

16. Best Practices


• Always inspect data first
• Use meaningful column names
• Handle missing values early
• Convert to NumPy before ML models

17. Summary
• Pandas is essential for data analysis
• DataFrame is the core structure
• Pandas works with real datasets
• Pandas bridges raw data and ML

📘 Chapter 8 — Data Preprocessing for


Machine Learning
(The Final Gate Before ML)

1. What is Data Preprocessing?


Data preprocessing is the process of:

• Cleaning raw data


• Transforming data
• Preparing data to be used by machine learning models

📌 Machine Learning models do NOT understand raw data.


They only understand numbers in a clean, consistent format.

Without preprocessing:
❌ Models give wrong results
❌ Training becomes unstable
❌ Accuracy is misleading

2. Why Data Preprocessing is Mandatory Before ML


Real-world data is usually:

• Incomplete
• Inconsistent
• Noisy
• Mixed (numbers + text)

Examples of problems:

• Missing values
• Text data (categories)
• Different scales (age vs salary)
• Irrelevant columns

📌 Preprocessing is not optional.

3. Types of Data in a Dataset


A typical dataset contains:

3.1 Features

• Input variables
• Used to make predictions
• Example: age, salary, education

3.2 Target (Label)

• Output variable
• What we want to predict
• Example: price, disease status, class

📌 ML = Learn a mapping from features → target

4. Handling Missing Values


4.1 What are Missing Values?

Missing values appear as:

• NaN
• Empty cells
• Null values

4.2 Detecting Missing Values

Using Pandas:

[Link]()
[Link]().sum()

4.3 Removing Missing Values


[Link]()

📌 Use only if missing values are few.


4.4 Filling Missing Values (Imputation)
[Link](0)

Common strategies:

• Mean
• Median
• Mode

📌 Choice depends on the data type.

5. Handling Categorical Data (Encoding)


5.1 Why Encoding is Needed

ML models:

• Cannot understand text


• Only work with numbers

Example:

Gender = Male / Female

❌ Not usable directly.

5.2 Label Encoding

Converts categories to numbers:

Male → 0
Female → 1

Used when:

• Categories have order


• Binary categories

5.3 One-Hot Encoding


Creates separate columns for each category.

Example:

Color = Red, Blue

Becomes:

Color_Red | Color_Blue

📌 Commonly used in ML preprocessing.

6. Feature Scaling
6.1 Why Scaling is Important

Different features may have different ranges:

• Age: 0–100
• Salary: 1000–100000

Models may give more importance to larger values.

6.2 Normalization

Scales data to range:

0 → 1

Used when:

• Values have known bounds

6.3 Standardization

Centers data around:

Mean = 0
Standard deviation = 1
Used when:

• Data follows normal distribution

7. Feature Selection
7.1 What is Feature Selection?

Feature selection means:

• Keeping useful features


• Removing irrelevant or redundant ones

📌 More features ≠ better model.

7.2 Why Feature Selection Matters

• Reduces overfitting
• Improves performance
• Simplifies the model
• Reduces training time

8. Train-Test Split
8.1 Why Split Data?

To evaluate model performance fairly.

If we train and test on the same data:


❌ Model memorizes data
❌ Accuracy is fake

8.2 Train-Test Split Concept

• Training set → Learn


• Test set → Evaluate
Typical split:

• 80% training
• 20% testing

9. Data Leakage (Important Warning)


9.1 What is Data Leakage?

Data leakage occurs when:

• Information from the test set leaks into training

📌 Leads to:

• Unrealistically high accuracy


• Wrong conclusions

9.2 Common Causes

• Scaling before splitting


• Using future information
• Target-related features

10. Preprocessing Pipeline (Big Picture)


Typical ML pipeline:

Raw Data

Cleaning

Encoding

Scaling

Feature Selection

Train / Test Split

Machine Learning Model
📌 This pipeline is used in every ML project.

11. Preprocessing and Machine Learning


Every ML algorithm assumes:

• Clean input
• Numerical values
• Consistent scale

📌 Bad preprocessing → bad model


📌 Good preprocessing → meaningful results

12. Common Beginner Mistakes


• Ignoring missing values
• Encoding target incorrectly
• Scaling before splitting
• Using all features blindly
• Forgetting data leakage

13. Best Practices


• Always explore data first
• Handle missing values carefully
• Encode categorical data properly
• Split data early
• Build preprocessing pipelines

14. Summary
• Data preprocessing is mandatory
• Raw data is never ready for ML
• Encoding and scaling are essential
• Train-test split is critical
• Preprocessing quality determines model quality
15. Final Transition — Entering Machine Learning
📌 If you understand this chapter well, you are now able to:

• Take any dataset


• Prepare it correctly
• Feed it into a machine learning model

📘 Exercises — NumPy, Pandas & Data


Preprocessing
(With Clear Solutions)

🧮 Part 1 — NumPy Exercises

🔹 Exercise 1 — Creating NumPy Arrays

Question
Create a NumPy array that contains the numbers from 0 to 9.

Expected Output

[0 1 2 3 4 5 6 7 8 9]

Solution

import numpy as np

arr = [Link](10)
print(arr)

🔹 Exercise 2 — Zeros, Ones, Linspace


Question
Create:

1. An array of 5 zeros
2. An array of 5 ones
3. 5 evenly spaced numbers between 0 and 1

Solution

zeros = [Link](5)
ones = [Link](5)
lin = [Link](0, 1, 5)

print(zeros)
print(ones)
print(lin)

🔹 Exercise 3 — Array Properties

Question
Given the array:

arr = [Link]([10, 20, 30, 40])

Print:

• shape
• size
• data type

Solution

print([Link])
print([Link])
print([Link])

🔹 Exercise 4 — Vectorized Operations

Question
Multiply each element in the array by 3 without using loops.

arr = [Link]([1, 2, 3, 4])

Solution

result = arr * 3
print(result)

📌 ‫ ده ﺟﻮھﺮ‬NumPy: ‫ ﻣﻔﯿﺶ‬loops

🔹 Exercise 5 — Boolean Filtering

Question
From the array below, extract numbers greater than 25:

arr = [Link]([10, 20, 30, 40, 50])

Solution

filtered = arr[arr > 25]


print(filtered)

🔹 Exercise 6 — 2D Arrays

Question
Given the matrix:

matrix = [Link]([[1, 2], [3, 4]])

Access the value 4.

Solution

print(matrix[1, 1])

🐼 Part 2 — Pandas Exercises

🔹 Exercise 7 — Creating a DataFrame

Question
Create a DataFrame with:

• name
• age
• grade
Solution

import pandas as pd

data = {
"name": ["Ali", "Sara", "Omar"],
"age": [20, 22, 21],
"grade": [85, 90, 88]
}

df = [Link](data)
print(df)

🔹 Exercise 8 — Exploring the Data

Question
Display:

• First 2 rows
• Dataset info
• Statistical summary

Solution

print([Link](2))
print([Link]())
print([Link]())

🔹 Exercise 9 — Column Selection

Question
Select:

1. Only the grade column


2. name and age columns

Solution

print(df["grade"])
print(df[["name", "age"]])

🔹 Exercise 10 — Row Selection

Question
Select:
• First row using iloc
• First row using loc

Solution

print([Link][0])
print([Link][0])

🔹 Exercise 11 — Filtering Data

Question
Show students with grade > 85.

Solution

filtered = df[df["grade"] > 85]


print(filtered)

🔹 Exercise 12 — Adding a New Column

Question
Add a column passed where:

• True if grade ≥ 60
• False otherwise

Solution

df["passed"] = df["grade"] >= 60


print(df)

🔹 Exercise 13 — Sorting Data

Question
Sort students by grade (descending).

Solution

sorted_df = df.sort_values("grade", ascending=False)


print(sorted_df)

🧹 Part 3 — Data Preprocessing Exercises


🔹 Exercise 14 — Detect Missing Values

Question
Check missing values in the DataFrame.

Solution

print([Link]())
print([Link]().sum())

🔹 Exercise 15 — Filling Missing Values

Question
Fill missing values with 0.

Solution

df_filled = [Link](0)
print(df_filled)

🔹 Exercise 16 — Categorical Encoding (Conceptual)

Question
Why do we need to encode categorical data before ML?

Answer

Because Machine Learning models only understand numbers, not text.

🔹 Exercise 17 — Feature vs Target

Question
In the following dataset:

age, salary → ?
purchased → ?

Answer

• Features → age, salary


• Target → purchased

🔹 Exercise 18 — Train-Test Split (Conceptual)

Question
Why do we split data into training and testing sets?

Answer

To evaluate the model fairly and avoid memorization.

🔹 Exercise 19 — Data Leakage (Important)

Question
What happens if we scale the data before train-test split?

Answer

Data leakage occurs → unrealistically high accuracy → wrong conclusions.

✅ Final Mini Task (Revision)


Question
Arrange the correct ML preprocessing pipeline:

• Scaling
• Raw Data
• Encoding
• Train/Test Split
• Cleaning

Correct Order

Raw Data

Cleaning

Encoding

Scaling

Train/Test Split

ML Model

You might also like