0% found this document useful (0 votes)

12 views21 pages

Data Processing & Machine Learning Guide

Uploaded by

tawoni3834

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

Data Processing & Machine Learning Guide

Uploaded by

tawoni3834

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Notes on Data Processing, Supervised Learning, Unsupervised

Learning

Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable format.
It is a crucial step before feeding data into machine learning models.
Steps involved in data preprocessing:
- Data Cleaning: Handling missing values, removing duplicates, correcting errors.
- Data Integration: Combining data from different sources.
- Data Transformation: Normalization, standardization, encoding categorical variables.
- Data Reduction: Reducing the volume but producing similar analytical results (e.g.,
PCA, feature selection).
- Data Discretization: Converting continuous data into discrete buckets.

Data Preprocessing in python:

Data Processing

Data processing refers to the collection and manipulation of data to produce meaningful
information. It includes all stages from data collection to data visualization.
Key phases of data processing:
- Data Collection: Gathering data from various sources.
- Data Input: Converting collected data into a machine-readable form.
- Data Processing: Applying algorithms, transformations, and models.
- Data Output: Generating results from processed data.
- Data Storage: Saving processed data for future use.

Data Processing in Python:

Introduction to Machine Learning
 Machine learning is a branch of Artificial Intelligence and computer science.
 Focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy.

Types of Machine Learning

Supervised Learning

As its name suggests, supervised machine learning is based on supervision. It means in

the supervised learning technique, we train the machines using the "labeled" dataset, and
based on the training, the machine predicts the output.

Supervised learning is a type of machine learning where the model is trained on labeled
data.
Characteristics:
- Input and output pairs are provided.
- Goal is to learn a mapping from inputs to outputs.
Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.

Types of Supervised Learning:

Supervised machine learning can be classified into two types of problems, which are
given below:

o Classification
o Regression

a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o KNN Algorithm
o Support Vector Machine Algorithm

b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.

Regression algorithms given below:

o Simple Linear Regression Algorithm

Simple Linear Regression

Simple Linear Regression models the relationship between a single independent variable
(X) and a dependent variable (Y) by fitting a straight line: Y = β₀ + β₁X.

Steps
1. 1. Collect and visualize data to check linearity.
2. 2. Compute the slope (β₁) and intercept (β₀) using least squares:
3. β₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
4. β₀ = ȳ - β₁ x̄
5. 3. Use the fitted line to predict Y for new X values.
6. 4. Evaluate model performance using metrics like MSE and R².

Formulas
Slope (β₁):
β₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²

Intercept (β₀):
β₀ = ȳ - β₁ x̄

Model: Y = β₀ + β₁ X

Illustration
The following figure shows the data points and the fitted regression line:
k-Nearest Neighbors (kNN)
k-Nearest Neighbors (kNN) is a non-parametric, instance-based learning algorithm used
for classification and regression. For classification, an unlabeled sample is assigned the
label most common among its k nearest neighbors.

Algorithm Steps
1. Choose the number of neighbors k.
2. Compute the distance (e.g., Euclidean) between the query point and all training points.
3. Select the k training samples with the smallest distances.
4. For classification, assign the class by majority vote among the k neighbors.
5. For regression, take the average of the k neighbors' values.

Key Characteristics
- Instance-based: stores the entire training dataset.
- Lazy learning: no explicit training phase.
- Distance metric choice (Euclidean, Manhattan) impacts behavior.
- Sensitive to the scale of features and choice of k.

Decision Boundary Illustration

The figure below shows a kNN decision boundary for k=3 on a synthetic dataset:
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised learning algorithm that seeks an optimal
separating hyperplane between classes by maximizing the margin between the nearest
data points of each class.

Core Concepts
- Hyperplane: decision boundary that separates classes.
- Margin: distance between the hyperplane and the closest samples (support vectors).
- Support Vectors: training samples that lie on the margin and define the hyperplane.
- Soft Margin: allows some misclassifications via a penalty parameter C for non-
separable data.
- Kernel Trick: transforms data into higher-dimensional space to handle non-linear
separations.

Illustration of Hyperplane and Margins

The figure below depicts a linear SVM separating two classes, with its hyperplane and
margins:

Supervised Learning in Python:

Unsupervised Learning

Unsupervised learning is a type of machine learning where the model is trained on

unlabeled data.
Characteristics:
- Only input data is provided; no labels.
- Goal is to find hidden patterns or intrinsic structures.

Common algorithms:
- K-Means Clustering
- Apriori Algorithm (for association rule mining)
Applications:
- Customer segmentation, anomaly detection, market basket analysis, etc.

Types of Unsupervised Learning

Clustering:
A clustering problem is where you want to discover the inherent groupings in the data,
such as grouping customers by purchasing behavior.
For example, The data points in the graph below clustered together can be classified into
one single group. We can distinguish the clusters, and we can identify that there are 3
clusters in the below picture.

KMeans Clustering

The working of the K-Means algorithm is explained in the below steps:

Step 1: Select the number K to decide the number of clusters.

Step 2: Select random K points or centroids. (It can be other than the input dataset).

Step 3: Assign each data point to its closest centroid, which will form the predefined K
clusters.

Step 4: Calculate the variance and place a new centroid of each cluster.

Step 5: Repeat the third step, which means reassigning each data point to the new closest
centroid of each cluster.

Step 6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step 7: The model is ready.

Association
Association rule learning as the name suggests tries to figure out the association
between data points.

The goal is to figure out relationships in the data.

The most common example of this comes in market basket analysis.

Association rule learning works on the concept of If and Else Statement, such as if A
then B.

If element is called antecedent.

then statement is called as Consequent.

Apriori Algorithm

 This algorithm uses frequent datasets to generate association rules.

 It is designed to work on the databases that contain transactions.
 This algorithm uses a breadth-first search and Hash Tree to calculate the
itemset efficiently.
 It is mainly used for market basket analysis and helps to understand the
products that can be bought together.
 It can also be used in the healthcare field to find drug reactions for patients.
Unsupervised Learning in Python:

Common questions

Supervised learning uses labeled datasets to train models to predict outcomes based on input-output pairs, often used in risk assessment, fraud detection, and spam filtering. In contrast, unsupervised learning deals with unlabeled data and aims to uncover inherent patterns or structures, such as in customer segmentation and anomaly detection. Supervised methods require direct supervision and feedback during training, whereas unsupervised learning explores data's intrinsic structures autonomously .

Simple Linear Regression models the linear relationship between a single independent variable (X) and a dependent variable (Y) by fitting the best straight line through the data points. The steps include: 1) Collecting and visualizing data to assess linearity; 2) Computing the slope (β₁) and intercept (β₀) using the least squares method, where β₁ = Σ(xᵢ - x̄ )(yᵢ - ȳ) / Σ(xᵢ - x̄ )² and β₀ = ȳ - β₁ x̄; 3) Using the fitted line for prediction; 4) Evaluating model performance through metrics such as Mean Squared Error (MSE) and R², which measure prediction accuracy .

Choosing the number of clusters (K) in K-Means clustering is crucial but challenging due to the subjective nature of clusters, nospecific baseline, and possibility of overfitting or underfitting the data. Common strategies to address this include: 1) Elbow method, where within-cluster sum of squares is plotted against K, with the optimal K found at the 'elbow' point; 2) Silhouette analysis, providing insight into how well-separated clusters are; 3) Domain knowledge, leveraging practical insights about the data. These techniques help decide appropriate K by balancing model complexity and interpretability .

SVMs operate on several critical components: the hyperplane, which is the decision boundary separating different classes; the margin, representing the distance between the hyperplane and the closest samples, known as support vectors. Support vectors are crucial as they influence the position and orientation of the hyperplane. Additionally, the soft margin concept allows the model to tolerate misclassified samples via a penalty parameter C, making it suitable for non-separable data. The kernel trick enables SVMs to operate in higher-dimensional spaces, dealing with non-linear separations by transforming the input data .

The Apriori Algorithm facilitates association rule learning by identifying frequent itemsets within transactional databases, forming the basis for generating association rules. It operates using a breadth-first search through a hash tree structure to efficiently count itemsets. The algorithm’s primary application is market basket analysis, which helps understand products often bought together, thus assisting in planning marketing strategies and store layouts. It also finds use in healthcare for discovering patterns in drug reactions .

Classification and regression are two main approaches in supervised learning that differ fundamentally in their outputs. Classification involves predicting discrete categorical labels, such as 'spam' or 'not spam', using algorithms like KNN and SVM. Applications include email filtering and fraud detection. Regression focuses on predicting continuous numerical values, identifying linear relationships between variables, as exemplified by market trend forecasting and temperature predictions using algorithms like Simple Linear Regression. The key distinction lies in the nature of the predicted output—categorical for classification, continuous for regression .

The kNN algorithm is an instance-based, non-parametric model used in classification and regression by utilizing distances between points. For classification, kNN assigns the label most common among its k nearest neighbors; for regression, it averages the values of the k nearest neighbors. Core characteristics include storing the entire training dataset (instance-based), no explicit training phase (lazy learning), and sensitivity to the scale of features and choice of k due to its reliance on distance metrics such as Euclidean or Manhattan .

Data reduction aims to decrease the data volume while preserving essential analytical properties, facilitating efficient processing and analysis. It involves techniques like Principal Component Analysis (PCA), which reduces dimensionality by identifying principal components, and feature selection, which identifies and keeps only the most relevant variables. This process simplifies the dataset, reducing computational burdens and improving model performance by mitigating overfitting and enhancing interpretability without sacrificing critical information .

Normalization and standardization are distinct data transformation processes crucial for preparing datasets for machine learning models. Normalization rescales features so they lie within a specific range, typically 0 to 1, which is vital for algorithms such as kNN and SVM that rely on distance metrics. Standardization alters the feature distribution to have a mean of 0 and a standard deviation of 1, aiding algorithms that assume normal distribution of input data. Both processes ensure features contribute equally, preventing bias due to differing scales .

Data preprocessing transforms raw data into a usable format, essential for preparing data for machine learning models. Key steps include: 1) Data Cleaning - involves handling missing values, removing duplicates, and correcting errors to ensure data quality; 2) Data Integration - combines data from different sources, providing a consistent data set; 3) Data Transformation - normalizes or standardizes data and encodes categorical variables to ensure compatibility with models; 4) Data Reduction - reduces data volume, using techniques like PCA while maintaining analytical results; 5) Data Discretization - converts continuous data into discrete buckets for specific analysis purposes. These steps make the data more structured, enhancing model performance .

AAM Notes
No ratings yet
AAM Notes
87 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
47 pages
Types of Machine Learning Models
No ratings yet
Types of Machine Learning Models
6 pages
Understanding Machine Learning Algorithms
No ratings yet
Understanding Machine Learning Algorithms
10 pages
Machine Learning Basics and Workflow
No ratings yet
Machine Learning Basics and Workflow
38 pages
Python for Machine Learning Basics
No ratings yet
Python for Machine Learning Basics
78 pages
Overview of Machine Learning Algorithms
No ratings yet
Overview of Machine Learning Algorithms
61 pages
SoftComputing Module 2
No ratings yet
SoftComputing Module 2
35 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
35 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
42 pages
Data Science Lec 9
No ratings yet
Data Science Lec 9
9 pages
Classical Supervised Techniques For Machine Learning
No ratings yet
Classical Supervised Techniques For Machine Learning
25 pages
Ai Module 3
No ratings yet
Ai Module 3
35 pages
Supervised Learning: Algorithms Explained
No ratings yet
Supervised Learning: Algorithms Explained
15 pages
Machine Learning Lab Manual 2024-25
No ratings yet
Machine Learning Lab Manual 2024-25
57 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
11 pages
Mastering Machine Learning Techniques
No ratings yet
Mastering Machine Learning Techniques
56 pages
Overview of Machine Learning Algorithms
No ratings yet
Overview of Machine Learning Algorithms
21 pages
Data Science Overview and Machine Learning
No ratings yet
Data Science Overview and Machine Learning
51 pages
Machine Learning Tutorial - GeeksforGeeks
No ratings yet
Machine Learning Tutorial - GeeksforGeeks
17 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
30 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
5 pages
Review of Machine Learning Algorithms
No ratings yet
Review of Machine Learning Algorithms
6 pages
Unit-1 Machine Learning Techniques
No ratings yet
Unit-1 Machine Learning Techniques
10 pages
Types of Machine Learning Systems
No ratings yet
Types of Machine Learning Systems
43 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
38 pages
KNN vs K-means: A Comparative Study
No ratings yet
KNN vs K-means: A Comparative Study
9 pages
Supervised Learning in Data Analytics
No ratings yet
Supervised Learning in Data Analytics
26 pages
Data Science Tutorial - A Practical Guide of Supervised Learning Algorithms
No ratings yet
Data Science Tutorial - A Practical Guide of Supervised Learning Algorithms
16 pages
Supervised vs Unsupervised Learning
No ratings yet
Supervised vs Unsupervised Learning
8 pages
Machine Learning for Breast Cancer Prediction
No ratings yet
Machine Learning for Breast Cancer Prediction
8 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
32 pages
Understanding Estimators in ML
100% (2)
Understanding Estimators in ML
38 pages
Machine Learning Overview and Types
No ratings yet
Machine Learning Overview and Types
15 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
29 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
14 pages
Unsupervised Learning and K-Means Clustering
100% (1)
Unsupervised Learning and K-Means Clustering
47 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
23 pages
Machine Learning Basics with Python
No ratings yet
Machine Learning Basics with Python
54 pages
Bike Buyer Prediction with ML
No ratings yet
Bike Buyer Prediction with ML
19 pages
Supervised Learning Overview and Workflow
No ratings yet
Supervised Learning Overview and Workflow
16 pages
Machine Learning IV
No ratings yet
Machine Learning IV
79 pages
Overview of Machine Learning Types
No ratings yet
Overview of Machine Learning Types
9 pages
Machine Learning Basics and Techniques
No ratings yet
Machine Learning Basics and Techniques
50 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
31 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
6 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
16 pages
ML UNIT 1 Final - 260122 - 091452
No ratings yet
ML UNIT 1 Final - 260122 - 091452
27 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
38 pages
Supervised Learning Algorithms Overview
No ratings yet
Supervised Learning Algorithms Overview
13 pages
Machine Learning: Types and Applications
No ratings yet
Machine Learning: Types and Applications
67 pages
Supervised Machine Learning Algorithms
No ratings yet
Supervised Machine Learning Algorithms
6 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
94 pages
Machine Learning Basics Lab Guide
No ratings yet
Machine Learning Basics Lab Guide
7 pages
Lect 3-3 Supervised & Unsupervised Learning
No ratings yet
Lect 3-3 Supervised & Unsupervised Learning
14 pages
Application for SCOPE Nutrition Position
No ratings yet
Application for SCOPE Nutrition Position
5 pages
Self-Nominations for National Teacher Awards
No ratings yet
Self-Nominations for National Teacher Awards
22 pages
CHCPAL003 Palliative Care Assessment Guide
No ratings yet
CHCPAL003 Palliative Care Assessment Guide
63 pages
2024 Issue Sleagues 02-18-54829 Hulkaroy Davlatova Educational Administration Theory and Practice
No ratings yet
2024 Issue Sleagues 02-18-54829 Hulkaroy Davlatova Educational Administration Theory and Practice
6 pages
Fundamentals of Project Management Course
No ratings yet
Fundamentals of Project Management Course
5 pages
MTSS Framework and Special Education Plan
No ratings yet
MTSS Framework and Special Education Plan
7 pages
Machine Learning Spam Detection System
No ratings yet
Machine Learning Spam Detection System
8 pages
Grade 2 Math Lesson Plan: Numbers to 100
No ratings yet
Grade 2 Math Lesson Plan: Numbers to 100
6 pages
UPSC Aspirant Economics Faculty Resume
No ratings yet
UPSC Aspirant Economics Faculty Resume
1 page
Enhancing Critical Thinking in History
No ratings yet
Enhancing Critical Thinking in History
12 pages
Curriculum Development: Key Theories Explained
No ratings yet
Curriculum Development: Key Theories Explained
5 pages
Social Studies Through Dramatic Play
No ratings yet
Social Studies Through Dramatic Play
4 pages
Media and Communication Theories Overview
No ratings yet
Media and Communication Theories Overview
7 pages
Evaluating Authority in Information Literacy
No ratings yet
Evaluating Authority in Information Literacy
6 pages
Adelina Dela Cruz in Philippine History
No ratings yet
Adelina Dela Cruz in Philippine History
10 pages
Bloom's Taxonomy of Learning Domains
No ratings yet
Bloom's Taxonomy of Learning Domains
11 pages
R Classification Models Overview
No ratings yet
R Classification Models Overview
10 pages
The Impact of Technology Change in Malaysian
No ratings yet
The Impact of Technology Change in Malaysian
9 pages
Bandura's Social Cognitive Learning Theory
No ratings yet
Bandura's Social Cognitive Learning Theory
26 pages
Mapeh-Music: Quarter 2 - Module 2: Afro-Latin and Popular Music Performance
No ratings yet
Mapeh-Music: Quarter 2 - Module 2: Afro-Latin and Popular Music Performance
25 pages
Practicing Career Professionalism in Tourism
No ratings yet
Practicing Career Professionalism in Tourism
74 pages
PDQ Certificate in Teaching and Learning
100% (1)
PDQ Certificate in Teaching and Learning
4 pages
Fieldwork Experience in Social Work
No ratings yet
Fieldwork Experience in Social Work
2 pages
Personal Development: Quarter 1 - Module 2: Developing The Whole Person
No ratings yet
Personal Development: Quarter 1 - Module 2: Developing The Whole Person
20 pages
Integrating AI in European Education
No ratings yet
Integrating AI in European Education
24 pages
Thailand Lottery Insights 2025
No ratings yet
Thailand Lottery Insights 2025
4 pages
EdTech Innovations: AI & AR Insights
No ratings yet
EdTech Innovations: AI & AR Insights
6 pages
Understanding Portfolio-Based Assessment
No ratings yet
Understanding Portfolio-Based Assessment
32 pages
Understanding Continuous Comprehensive Evaluation
No ratings yet
Understanding Continuous Comprehensive Evaluation
29 pages
21st Century Literature Lesson Plan
No ratings yet
21st Century Literature Lesson Plan
5 pages

Data Processing & Machine Learning Guide

Uploaded by

Data Processing & Machine Learning Guide

Uploaded by

Notes on Data Processing, Supervised Learning, Unsupervised

Data Preprocessing in python:

Data Processing in Python:

Types of Machine Learning

As its name suggests, supervised machine learning is based on supervision. It means in

Types of Supervised Learning:

Some popular classification algorithms are given below:

Regression algorithms given below:

Simple Linear Regression

Decision Boundary Illustration

Illustration of Hyperplane and Margins

Supervised Learning in Python:

Unsupervised learning is a type of machine learning where the model is trained on

Types of Unsupervised Learning

The working of the K-Means algorithm is explained in the below steps:

Step 1: Select the number K to decide the number of clusters.

Step 6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step 7: The model is ready.

The goal is to figure out relationships in the data.

The most common example of this comes in market basket analysis.

If element is called antecedent.

then statement is called as Consequent.

 This algorithm uses frequent datasets to generate association rules.

Common questions

How does supervised learning differ from unsupervised learning, and what are typical applications for each?

Describe how Simple Linear Regression is used to model relationships between variables and the process involved in developing such a model.

What challenges might arise when choosing the number of clusters (K) in K-Means clustering, and what strategies can be employed to address these challenges?

What are the critical components of a Support Vector Machine (SVM) and how do these components contribute to its function as a classification algorithm?

What role does the Apriori Algorithm play in association rule learning, and what are its primary applications?

What are the fundamental differences between classification and regression in supervised learning, and what are some examples of each approach?

How does the k-Nearest Neighbors (kNN) algorithm function in both classification and regression tasks, and what are the algorithm's core characteristics?

Explain the process and purpose of data reduction in the context of data preprocessing.

How do normalization and standardization differ in data transformation, and why are these processes important for machine learning?

What are the key steps involved in the data preprocessing phase, and why is each step crucial for machine learning models?

You might also like