0% found this document useful (0 votes)
12 views21 pages

Data Processing & Machine Learning Guide

Uploaded by

tawoni3834
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

Data Processing & Machine Learning Guide

Uploaded by

tawoni3834
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Notes on Data Processing, Supervised Learning, Unsupervised

Learning

Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable format.
It is a crucial step before feeding data into machine learning models.
Steps involved in data preprocessing:
- Data Cleaning: Handling missing values, removing duplicates, correcting errors.
- Data Integration: Combining data from different sources.
- Data Transformation: Normalization, standardization, encoding categorical variables.
- Data Reduction: Reducing the volume but producing similar analytical results (e.g.,
PCA, feature selection).
- Data Discretization: Converting continuous data into discrete buckets.

Data Preprocessing in python:


Data Processing

Data processing refers to the collection and manipulation of data to produce meaningful
information. It includes all stages from data collection to data visualization.
Key phases of data processing:
- Data Collection: Gathering data from various sources.
- Data Input: Converting collected data into a machine-readable form.
- Data Processing: Applying algorithms, transformations, and models.
- Data Output: Generating results from processed data.
- Data Storage: Saving processed data for future use.

Data Processing in Python:


Introduction to Machine Learning
 Machine learning is a branch of Artificial Intelligence and computer science.
 Focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy.

Types of Machine Learning

Supervised Learning

As its name suggests, supervised machine learning is based on supervision. It means in


the supervised learning technique, we train the machines using the "labeled" dataset, and
based on the training, the machine predicts the output.

Supervised learning is a type of machine learning where the model is trained on labeled
data.
Characteristics:
- Input and output pairs are provided.
- Goal is to learn a mapping from inputs to outputs.
Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.

Types of Supervised Learning:

Supervised machine learning can be classified into two types of problems, which are
given below:

o Classification
o Regression

a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o KNN Algorithm
o Support Vector Machine Algorithm

b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.

Regression algorithms given below:


o Simple Linear Regression Algorithm

Simple Linear Regression


Simple Linear Regression models the relationship between a single independent variable
(X) and a dependent variable (Y) by fitting a straight line: Y = β₀ + β₁X.

Steps
1. 1. Collect and visualize data to check linearity.
2. 2. Compute the slope (β₁) and intercept (β₀) using least squares:
3. β₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
4. β₀ = ȳ - β₁ x̄
5. 3. Use the fitted line to predict Y for new X values.
6. 4. Evaluate model performance using metrics like MSE and R².

Formulas
Slope (β₁):
β₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²

Intercept (β₀):
β₀ = ȳ - β₁ x̄

Model: Y = β₀ + β₁ X

Illustration
The following figure shows the data points and the fitted regression line:
k-Nearest Neighbors (kNN)
k-Nearest Neighbors (kNN) is a non-parametric, instance-based learning algorithm used
for classification and regression. For classification, an unlabeled sample is assigned the
label most common among its k nearest neighbors.

Algorithm Steps
1. Choose the number of neighbors k.
2. Compute the distance (e.g., Euclidean) between the query point and all training points.
3. Select the k training samples with the smallest distances.
4. For classification, assign the class by majority vote among the k neighbors.
5. For regression, take the average of the k neighbors' values.

Key Characteristics
- Instance-based: stores the entire training dataset.
- Lazy learning: no explicit training phase.
- Distance metric choice (Euclidean, Manhattan) impacts behavior.
- Sensitive to the scale of features and choice of k.

Decision Boundary Illustration


The figure below shows a kNN decision boundary for k=3 on a synthetic dataset:
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised learning algorithm that seeks an optimal
separating hyperplane between classes by maximizing the margin between the nearest
data points of each class.

Core Concepts
- Hyperplane: decision boundary that separates classes.
- Margin: distance between the hyperplane and the closest samples (support vectors).
- Support Vectors: training samples that lie on the margin and define the hyperplane.
- Soft Margin: allows some misclassifications via a penalty parameter C for non-
separable data.
- Kernel Trick: transforms data into higher-dimensional space to handle non-linear
separations.

Illustration of Hyperplane and Margins


The figure below depicts a linear SVM separating two classes, with its hyperplane and
margins:

Supervised Learning in Python:


Unsupervised Learning

Unsupervised learning is a type of machine learning where the model is trained on


unlabeled data.
Characteristics:
- Only input data is provided; no labels.
- Goal is to find hidden patterns or intrinsic structures.

Common algorithms:
- K-Means Clustering
- Apriori Algorithm (for association rule mining)
Applications:
- Customer segmentation, anomaly detection, market basket analysis, etc.

Types of Unsupervised Learning

Clustering:
A clustering problem is where you want to discover the inherent groupings in the data,
such as grouping customers by purchasing behavior.
For example, The data points in the graph below clustered together can be classified into
one single group. We can distinguish the clusters, and we can identify that there are 3
clusters in the below picture.

KMeans Clustering

The working of the K-Means algorithm is explained in the below steps:

Step 1: Select the number K to decide the number of clusters.

Step 2: Select random K points or centroids. (It can be other than the input dataset).

Step 3: Assign each data point to its closest centroid, which will form the predefined K
clusters.

Step 4: Calculate the variance and place a new centroid of each cluster.

Step 5: Repeat the third step, which means reassigning each data point to the new closest
centroid of each cluster.

Step 6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step 7: The model is ready.


Association
Association rule learning as the name suggests tries to figure out the association
between data points.

The goal is to figure out relationships in the data.

The most common example of this comes in market basket analysis.

Association rule learning works on the concept of If and Else Statement, such as if A
then B.

If element is called antecedent.

then statement is called as Consequent.

Apriori Algorithm

 This algorithm uses frequent datasets to generate association rules.


 It is designed to work on the databases that contain transactions.
 This algorithm uses a breadth-first search and Hash Tree to calculate the
itemset efficiently.
 It is mainly used for market basket analysis and helps to understand the
products that can be bought together.
 It can also be used in the healthcare field to find drug reactions for patients.
Unsupervised Learning in Python:

Common questions

Powered by AI

Supervised learning uses labeled datasets to train models to predict outcomes based on input-output pairs, often used in risk assessment, fraud detection, and spam filtering. In contrast, unsupervised learning deals with unlabeled data and aims to uncover inherent patterns or structures, such as in customer segmentation and anomaly detection. Supervised methods require direct supervision and feedback during training, whereas unsupervised learning explores data's intrinsic structures autonomously .

Simple Linear Regression models the linear relationship between a single independent variable (X) and a dependent variable (Y) by fitting the best straight line through the data points. The steps include: 1) Collecting and visualizing data to assess linearity; 2) Computing the slope (β₁) and intercept (β₀) using the least squares method, where β₁ = Σ(xᵢ - x̄ )(yᵢ - ȳ) / Σ(xᵢ - x̄ )² and β₀ = ȳ - β₁ x̄; 3) Using the fitted line for prediction; 4) Evaluating model performance through metrics such as Mean Squared Error (MSE) and R², which measure prediction accuracy .

Choosing the number of clusters (K) in K-Means clustering is crucial but challenging due to the subjective nature of clusters, nospecific baseline, and possibility of overfitting or underfitting the data. Common strategies to address this include: 1) Elbow method, where within-cluster sum of squares is plotted against K, with the optimal K found at the 'elbow' point; 2) Silhouette analysis, providing insight into how well-separated clusters are; 3) Domain knowledge, leveraging practical insights about the data. These techniques help decide appropriate K by balancing model complexity and interpretability .

SVMs operate on several critical components: the hyperplane, which is the decision boundary separating different classes; the margin, representing the distance between the hyperplane and the closest samples, known as support vectors. Support vectors are crucial as they influence the position and orientation of the hyperplane. Additionally, the soft margin concept allows the model to tolerate misclassified samples via a penalty parameter C, making it suitable for non-separable data. The kernel trick enables SVMs to operate in higher-dimensional spaces, dealing with non-linear separations by transforming the input data .

The Apriori Algorithm facilitates association rule learning by identifying frequent itemsets within transactional databases, forming the basis for generating association rules. It operates using a breadth-first search through a hash tree structure to efficiently count itemsets. The algorithm’s primary application is market basket analysis, which helps understand products often bought together, thus assisting in planning marketing strategies and store layouts. It also finds use in healthcare for discovering patterns in drug reactions .

Classification and regression are two main approaches in supervised learning that differ fundamentally in their outputs. Classification involves predicting discrete categorical labels, such as 'spam' or 'not spam', using algorithms like KNN and SVM. Applications include email filtering and fraud detection. Regression focuses on predicting continuous numerical values, identifying linear relationships between variables, as exemplified by market trend forecasting and temperature predictions using algorithms like Simple Linear Regression. The key distinction lies in the nature of the predicted output—categorical for classification, continuous for regression .

The kNN algorithm is an instance-based, non-parametric model used in classification and regression by utilizing distances between points. For classification, kNN assigns the label most common among its k nearest neighbors; for regression, it averages the values of the k nearest neighbors. Core characteristics include storing the entire training dataset (instance-based), no explicit training phase (lazy learning), and sensitivity to the scale of features and choice of k due to its reliance on distance metrics such as Euclidean or Manhattan .

Data reduction aims to decrease the data volume while preserving essential analytical properties, facilitating efficient processing and analysis. It involves techniques like Principal Component Analysis (PCA), which reduces dimensionality by identifying principal components, and feature selection, which identifies and keeps only the most relevant variables. This process simplifies the dataset, reducing computational burdens and improving model performance by mitigating overfitting and enhancing interpretability without sacrificing critical information .

Normalization and standardization are distinct data transformation processes crucial for preparing datasets for machine learning models. Normalization rescales features so they lie within a specific range, typically 0 to 1, which is vital for algorithms such as kNN and SVM that rely on distance metrics. Standardization alters the feature distribution to have a mean of 0 and a standard deviation of 1, aiding algorithms that assume normal distribution of input data. Both processes ensure features contribute equally, preventing bias due to differing scales .

Data preprocessing transforms raw data into a usable format, essential for preparing data for machine learning models. Key steps include: 1) Data Cleaning - involves handling missing values, removing duplicates, and correcting errors to ensure data quality; 2) Data Integration - combines data from different sources, providing a consistent data set; 3) Data Transformation - normalizes or standardizes data and encodes categorical variables to ensure compatibility with models; 4) Data Reduction - reduces data volume, using techniques like PCA while maintaining analytical results; 5) Data Discretization - converts continuous data into discrete buckets for specific analysis purposes. These steps make the data more structured, enhancing model performance .

You might also like