Machine Learning
By [Link] Mhatre
Application
1. Collecting Data:
As you know, machines initially learn from the data that you give them.
It is of the utmost importance to collect reliable data so that your machine learning
model can find the correct patterns.
The quality of the data that you feed to the machine will determine how accurate your
model is.
If you have incorrect or outdated data, you will have wrong outcomes or predictions
which are not relevant.
Make sure you use data from a reliable source, as it will directly affect the outcome of
your model.
Good data is relevant, contains very few missing and repeated values, and has a good
representation of the various subcategories/classes present.
Figure 1: Collecting Data
Some popular sources for datasets include:
Kaggle (a large repository with diverse data types), FiveThirtyEight (curated
datasets from data journalism), AWS Public Datasets (datasets across various
domains like genomics and climate), World Bank Open Data (global development
data), EU Open Data Portal (datasets from European Union institutions), Pew
Research Center (datasets related to public opinion), and GitHub (user-
contributed datasets)
Key points about these sources:
Kaggle:
Well-known for hosting machine learning competitions and providing a vast collection of
datasets.
FiveThirtyEight:
Offers high-quality datasets often accompanied by code used for visualizations in their news
articles.
AWS Public Datasets:
Access to datasets across various fields, hosted on Amazon S3 storage.
World Bank:
Comprehensive data source on global development topics like poverty, education, and health.
EU Open Data Portal:
Public access to datasets from European Union institutions on topics like government, health,
and energy.
Pew Research Center:
Datasets based on surveys, providing insights into public opinion and societal trends.
GitHub:
2. Preparing the Data:
After you have your data, you have to prepare it. You can do this by :
Putting together all the data you have and randomizing it. This helps make sure that
data is evenly distributed, and the ordering does not affect the learning process.
Cleaning the data to remove unwanted data, missing values, rows, and columns,
duplicate values, data type conversion, etc. You might even have to restructure the
dataset and change the rows and columns or index of rows and columns.
Visualize the data to understand how it is structured and understand the relationship
between various variables and classes present.
Splitting the cleaned data into two sets - a training set and a testing set. The training set
is the set your model learns from. A testing set is used to check the accuracy of your
model after training.
Figure 2: Cleaning and Visualizing Data
3. Choosing a Model:
A machine learning model determines the output you get after running a machine
learning algorithm on the collected data.
It is important to choose a model which is relevant to the task at hand. Over the
years, scientists and engineers developed various models suited for different tasks
like speech recognition, image recognition, prediction, etc.
Apart from this, you also have to see if your model is suited for numerical or
categorical data and choose accordingly.
Figure 3: Choosing a model
4. Training the Model:
Training is the most important step in machine learning. In training, you pass the
prepared data to your machine learning model to find patterns and make
predictions.
It results in the model learning from the data so that it can accomplish the task set.
Over time, with training, the model gets better at predicting.
Figure 4: Training a model
5. Evaluating the Model:
After training your model, you have to check to see how it’s performing.
This is done by testing the performance of the model on previously unseen data.
The unseen data used is the testing set that you split our data into earlier.
If testing was done on the same data which is used for training, you will not get an
accurate measure, as the model is already used to the data, and finds the same
patterns in it, as it previously did.
This will give you disproportionately high accuracy.
When used on testing data, you get an accurate measure of how your model will
perform and its speed.
Figure 5: Evaluating a model
Unseen data:
•This is a broader term encompassing any data that the model has not been
trained on, including potential real-world scenarios that the model might
encounter.
Testing data:
•A carefully selected subset of unseen data used to test the model's
accuracy and ability to generalize to new situations.
How they are used:
•Training phase: The model learns patterns from the "training data".
•Evaluation phase: Once the model is trained, it is tested on the "testing
data" to assess its performance on unseen examples
6. Parameter Tuning:
Once you have created and evaluated your model, see if its accuracy can be
improved in any way.
This is done by tuning the parameters present in your model.
Parameters are the variables in the model that the programmer generally decides.
At a particular value of your parameter, the accuracy will be the maximum.
Parameter tuning refers to finding these values.
Figure 6: Parameter Tuning
7. Making Predictions:
In the end, you can use your model on unseen data to make predictions accurately .
Types of Machine Learning:
There are three important types of Machine Learning Algorithms :
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Supervised Learning:
Supervised Learning is the most popular paradigm for performing machine learning
operations. It is widely used for data where there is a precise mapping between input-
output data.
The dataset, in this case, is labeled, meaning that the algorithm identifies the features
explicitly and carries out predictions or classification accordingly.
As the training period progresses, the algorithm is able
to identify the relationships between the two variables such that we can predict a new
outcome.
.
Resulting Supervised learning algorithms are task-oriented.
As we provide it with more and more examples, it is able to learn more
properly so that it can undertake the task and yield us the output more accurately.
Some of the algorithms that come under supervised learning are as follows –
Linear Regression:
Linear Regression may be a supervised machine learning algorithm where the
anticipated output is continuous and features a constant slope.
It’s to predict values within endless range, (e.g. sales, price) instead of trying to
classify them into categories (e.g. cat, dog).
It’s used whenever we want to predict the worth of a variable supported the worth
of another variable.
The variable we would like to predict is named the variable (or sometimes,
the result variable).
In linear regression, we measure the linear relationship between two or more than
two variables. Based on this relationship, we perform predictions that follow
this linear pattern.
Random Forest:
Random forest may be a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, an excellent result most of the
time.
It’s also one among the foremost used algorithms, due to
its simplicity and variety (it are often used for both classification and regression
tasks).
Random Forests are an ensemble learning method that is for performing
classification, regression as well as other tasks through the construction of decision
trees and providing the output as a class which is the mode or mean of the
underlying individual trees.
Gradient Boosting:
Gradient boosting may be a machine learning technique
for regression and classification problems, which produces
a prediction model within the sort of an ensemble of weak
prediction models, typically decision trees.
It is an ensemble learning method that is a collection of several
weak decision trees which results in a powerful classifier.
Support Vector Machine:
SVMs are powerful classifiers that are used
for classifying the binary dataset into two classes with the help
of hyperplanes.
The benefits of support vector machines are Effective in high
dimensional spaces.
Still effective in cases where a number of
dimensions is bigger than the number of samples.
Logistic Regression:
Logistic regression may be a statistical model that in its basic form uses a logistic
function to model a binary variable, although more complex extensions exist.
In multivariate analysis , logistic regression (or logit regression) is estimating
the parameters of a logistic model (a sort of binary regression).
It makes use of a bell-shaped S curve that is generated with the help of logit
function to categorize the data into their respective classes.
Artificial Neural Networks:
Artificial Neural Networks are modeled after the human brain and they learn from the
data over time. They form a much larger portion of machine learning called Deep
Learning.
Unsupervised Learning:
In the case of an unsupervised learning algorithm, the data is not explicitly labeled into different
classes, that is, there are no labels. The model is able to learn from the data by finding implicit
patterns.
Unsupervised Learning algorithms identify the data based on their densities, structures, similar
segments, and other similar features. Unsupervised Learning Algorithms are based on Hebbian
Learning.
Cluster analysis is one of the most widely used techniques in supervised learning.
Let us look at some of the important algorithms that come under Unsupervised Learning.
Clustering:
Clustering, also known as cluster analysis, is a technique of grouping similar sets of
objects in the same group that is different from the objects in other group.
Some of the essential clustering techniques are as follows –
a. K-means:
The aim of the k-means clustering algorithm is to partition the n observations in the data
into k clusters such that each observation belongs to the cluster with the nearest mean.
This serves as the prototype of the cluster.
b. DBSCAN:
This is a clustering method that groups the data based on the density. It groups
together the points that are given in the space and marks the outliers in the low-
density region.
c. Hierarchical clustering:
In this form of clustering, a hierarchy of clusters is built.
Classification Regression
In this problem statement, the target variables are discrete. In this problem statement, the target variables are continuous.
Problems like Spam Email Classification, Disease prediction like Problems like House Price Prediction, Rainfall Prediction like
problems are solved using Classification Algorithms. problems are solved using regression Algorithms.
In this algorithm, we try to find the best possible decision boundary
In this algorithm, we try to find the best-fit line which can represent
which can separate the two classes with the maximum possible
the overall trend in the data.
separation.
Evaluation metrics like Mean Squared Error, R2-Score, and MAPE
Evaluation metrics like Precision, Recall, and F1-Score are used
are used here to evaluate the performance of the regression
here to evaluate the performance of the classification algorithms.
algorithms.
Here we face the problems like binary Classification or Here we face the problems like Linear Regression models as well as
Multi-Class Classification problems. non-linear models.
Input Data are Independent variables and categorical dependent Input Data are Independent variables and continuous dependent
variable. variable.
The classification algorithm’s task mapping the input value of x The regression algorithm’s task is mapping input value (x) with
with the discrete output variable of y. continuous output variable (y).
Output is Categorical labels. Output is Continuous numerical values.
Objective is to Predict categorical/class labels. Objective is to Predicting continuous numerical values.
Example use cases are Spam detection, image recognition, Example use cases are Stock price prediction, house price
sentiment analysis prediction, demand forecasting.
Examples of regression algorithms are:
Examples of classification algorithms are:
Linear Regression, Polynomial Regression, Ridge Regression, Lasso
Logistic Regression, Decision Trees, Random Forest, Support Vector
Regression, Support Vector Regression (SVR), Decision Trees for
Machines (SVM), K-Nearest Neighbors (K-NN), Naive Bayes, Neural
Regression, Random Forest Regression, K-Nearest Neighbors (K-NN)
Networks, K-Means Clustering, Multi-layer Perceptron (MLP), etc.
Regression, Neural Networks for Regression, etc.
Thank you……………