0% found this document useful (0 votes)
13 views39 pages

Understanding Machine Learning Basics

Machine learning is a branch of AI that uses data and algorithms to mimic human learning and improve prediction accuracy across various fields such as finance, entertainment, and insurance. Key processes in machine learning include data importing, exploratory data analysis, model selection, training, testing, and deployment, with algorithms categorized into supervised and unsupervised learning. Techniques like linear regression and K-nearest neighbors are commonly used for prediction and classification tasks.

Uploaded by

adityadas8512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views39 pages

Understanding Machine Learning Basics

Machine learning is a branch of AI that uses data and algorithms to mimic human learning and improve prediction accuracy across various fields such as finance, entertainment, and insurance. Key processes in machine learning include data importing, exploratory data analysis, model selection, training, testing, and deployment, with algorithms categorized into supervised and unsupervised learning. Techniques like linear regression and K-nearest neighbors are commonly used for prediction and classification tasks.

Uploaded by

adityadas8512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning

What is Machine Learning ?

● Machine learning is a branch of artificial


intelligence (AI) and computer science
which focuses on the use of data and
algorithms to imitate the way that
humans learn, gradually improving its
accuracy.
● It is the use of data to help the computer
understand various patterns which helps
it make better predictions (better
accuracy).
Why Machine learning ?
Machine Learning helps us make better predictions. It is like knowing
approximately what’s going to happen in the future. Almost every field now
used machine learning. Let’s Discuss a few.
● Stock Market Price Predictions (Finance)
● Recommendation System (Entertainment / Sales)
● Customer Segmentation (All Fields)
● Fraud Detection (Insurance)
Stock Market Price Predictions (Finance)
Stock Market has huge variations in the
prices every second and tons is generated on
a daily basis, understanding the data, deriving
useful information and patterns and then
making predictions is one of the use cases of
Machine Learning.
Recommendation System (Entertainment / Sales)
A system that looks into the various
factors it is provided with and then makes
similar suggestions based on that.
The recommendations made by the
program, can be judged on the basis of
how appealing are they to the customers.
If it works perfectly, then the GIF on the
right comes true.
Customer Segmentation (All Fields)
This system helps us segregate the
customers on the basis of various
factors and then cater to them
separately so that we can get the best
results.
We again use various factors to identify
and determine which customer falls into
which group.
Fraud Detection (Insurance)
The system predicts if a particular
transactions seems to be a fraud or not. This
is one of the famous use cases for ML.
With the advancement in technology, this
topic has gain a lot of attention. It is
becoming normal transactions are difficult
to differentiate from the fraudulent ones as
the days pass by.
Basic Requirements

● You must know how to code. (in any language) –


Python is best
● Basic understanding of statistics (applied statistics)
● Basic understanding of mathematics (applied math)
Step in Machine Learning
Importing Data
EDA Model selection
the data Transformation

Training the
Testing the Model Deployment
Model
Importing the Data
● The data can be in various formats and distributed amongst various files.
Combining them is one of the task that is crucial before processing it any
further.
● Many a times the data is unstructured, understanding the flow, pattern
and converting it into a consistent format is another task that comes
under this process.
● Data Cleaning : Sometimes there are missing values in the data that
need to be taken care of, there are various of doing so.
Exploratory Data Analysis (EDA)
● Viewing the data in various ways to understand the
structure.
● Understanding how the values distributed in the columns
● Use Visual and Non-Visual Methods to understand the
data.
● Make a note of all the inferences, insights and
assumptions that you gain or make for the visuals.
Data Transformation
● If there is any discrepancy in the data type of the column
values or you need to modify the data into a consistent
format.
● Sometimes you need to make different column values from
the existing ones or remove a few columns as they don’t add
any information that helps us build a better model.
Model selection
After understanding how the data is and transforming it in the
best way possible for further computation. Choosing the
model is an important task. Just because a complex model
gives us better accuracy (generally) we don’t start of with
them.
We start of with the weak models, and gradually move up the
ladder. Each model is selected trained, tested and tuned till
we feel that that’s the best that the model can do on this data
set.
Training the Model
Each and every model selected is trained over multiple times in
multiple ways to understand why one training set is better than
the other or why one model performs better over the other.
Testing the Model
We can only understand how good the model is by testing its learning on
data that it has not been trained on.
The training data has to be unseen by the model to understand how well
the model is able to work on data that it hasn’t seen before.
Deployment
After being satisfied with the results we deploy the model in the
real world to see how it works on the data from the real world.
The task doesn’t end here. If the model works fine we try to
make it work better. If the model fails to work, then we take it up
again and go through the entire cycle all over again. It’s a
continuous process.
Types of machine learning algorithms
Supervised
When the data set that we are working with has labels in it which tells us
which all column values represent which category or a continuous value, we
perform a supervised task on it.
Regression:
When the value that you want to predict is of continuous type.
Classification:
When the value that you trying to predict is a category.
Regression Classification
Task Task
Unsupervised
If the target variable or the value that we are trying to predict is not available,
then we perform an unsupervised task.
Clustering :
Where you group similar points together.
Association:
Where you try to find a pattern and try to recommend
Clustering Association
What is Linear Regression?
Linear regression performs the task to
predict a dependent variable value (y)
based on a given independent variable
(x). So, this regression technique finds
out a linear relationship between x
(input) and y(output). Hence the name,
Linear Regression.
Formula
Since we are trying to build a linear
relation between the the 2 variables, we
use the formula for a straight line.
The same formula can be written as
y =mx+c.
● c = θ1 = Intercept.
● m = θ2 = Slope
How do we update the θ1 and θ2 values?
To find the best fit line we need to have the
best θ1 and θ2 values. In order to find that
we need to minimise the cost function (J).
The cost function represents the difference
value between predicted and actual.
Since the predicted and actual difference
might have positive and negative values
iteratively we square the error to make it
positive. (MSE)
How do we understand if the line is best fit.
θ1 and θ2 are randomly selected at first and then optimised using Gradient
Descent on the Cost Function. Now when the cost function is minimum we
consider that to be the best fit line .
Other ways of evaluation
MAE : Mean Absolute Error, taking the
absolute value of the error value.
RMSE: Mean Squared Error can be
difficult to interpret at times when we
are dealing with large values. Taking the
root of the same gives us better
understanding.
Other ways of evaluation (Cont.)
In R2 we first see what is the variation in
error terms when we fit a line to the R2 = Var (mean) - Var(line)
mean of the distribution. Then we fit our Var(mean)
line and see how much variation was
explained by the new fit line. The higher
the variation explained the better is the
line. The value is always between 0 and
1.
Example for R2
Consider the Var(mean) = 32; and the Var(line) =6
R2 = Var (mean) - Var(line)
Var(mean)
R2 = 32 - 6
32
R2 = 26/32 = 0.8125

That means the line explains 81.25% of the variance, the remaining is considered as
error and which can’t be explained.
Classification
What is Logistic Regression?
Logistic regression is basically a
supervised classification algorithm. In
this analytics approach, the dependent
variable is finite or categorical: either
A or B (binary regression) or a range
of finite options A, B, C or D
(multinomial regression).
It is used in statistical software to
understand the relationship between
the dependent variable and one or
more independent variables by
estimating probabilities using a logistic
regression equation.
Evaluate Logistic Regressions

Confusion matrix is a good way to have a look


at the correctly identified classes and
misclassified classes.

Using the values from there we can find the


accuracy. The formula for that is total number
of correctly classified records divided by the
total number of records.
Additional Information

Stratified Sampling : when there is a class imbalance it’s best to use stratified
sampling, this makes sure that the test data and train data have an equal
distributions in terms of class proportions.

Example:

● Total number of classes : {0: 100 , 1:50}


● Considering the test size to be 20%:
○ Test records for the model {0 : 20, 1:10}
What is KNN?

KNN stands for K nearest Neighbours.


Now k in nothing but a placeholder,
which depicts the number of neighbours
you want to take into consideration.
Example k =3, I am going to take the 3
most nearest neighbours.

How do we measure which element is


close, we use some distance measure
to decide that.
Introduction to KNN

K-Nearest Neighbour is one of the simplest Machine Learning


algorithms based on Supervised Learning technique
K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by
using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems
It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at
Steps of KNN:
Step 1 : load the training as well as test data.

Step 2 : choose the value of K i.e. the nearest data points. K can be any integer.

Step 3 − For each point in the test data do the following −

 Calculate the distance between test data and each row of training data with the help of
Euclidean distance
 Take the K nearest neighbors as per the calculated Euclidean distance.
 Among these k neighbors, count the number of the data points in each category.
 Assign the new data points to that category for which the number of the neighbor is maximum.

Step 4 : Model is ready


How to select value of K:
 Selecting the right K value is a process called parameter tuning, which is important to achieve higher
accuracy.

 There is not a definitive way to determine the best value of K

 It depends on the type of problem you are solving

 Selecting a K value of one or two can be noisy and may lead to outliers in the model, and thus resulting in
overfitting of the model.

 To choose the value of K, take the square root of n (sqrt(n)), where n is the total number of data points.

 Usually, an odd value of K is selected to avoid confusion between two classes of data.
Distance Measures for KNN

Euclidean : The distance is calculated


through a straight line between two
points.

Manhattan : The distance is the


summation of the perpendicular
distance and horizontal distance.

Minkowski : it’s the distance between 2


points by using a curved line.
Additional Information

Model Summary:

● Precision : What proportion of positive


identifications was actually correct?
● Recall : What proportion of actual positives
was identified correctly?
● F1 Score : It is calculated from the precision
F1
and recall of the test, The F1 score is the
harmonic mean of the precision and recall.
Additional Information

Grid Search Cross Validation: This is a hyperparameter tuning method where


you put in all the parameter values that you want to train and test your model
with and on the basis of that you get a combination of all the values passed.
You can select the best out of that.

Random Search Cross Validation: This is similar to Grid Search but doesn’t
make a combination of all the values. It make a combination of that values that
are most likely to give you better results. (Best for larger datasets and more
number of parameters)

You might also like