0% found this document useful (0 votes)
16 views28 pages

ML Notes Mod 2

The document covers two main topics in machine learning: Linear Regression and Classification. It explains the concepts, types, assumptions, and evaluation metrics of Linear Regression, including Simple and Multiple Linear Regression, along with regularization techniques. Additionally, it introduces Classification, which involves categorizing data based on labeled examples.

Uploaded by

Chirag Chiru
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views28 pages

ML Notes Mod 2

The document covers two main topics in machine learning: Linear Regression and Classification. It explains the concepts, types, assumptions, and evaluation metrics of Linear Regression, including Simple and Multiple Linear Regression, along with regularization techniques. Additionally, it introduces Classification, which involves categorizing data based on labeled examples.

Uploaded by

Chirag Chiru
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 2

CONTENTS

CHAPTER 1
1. Regression: Linear Regression,
2. Multiple Linear Regression and Polynomial Regression,
3. Evaluating Regression Model’s Performance (RMSE, Mean Absolute Error, Correlation,
RSquare),
4. Regularization Methods

CHAPTER 2
1. Classification: Need and Applications of Classification,
2. Logistic Regression,
3. Decision tree

DEPARTMENT OF CSE, MVJCE Page 1


CHAPTER 1

Linear Regression in Machine learning


Linear regression is a statistical method used to model the relationship between a dependent variable and one or
more independent variables. It provides valuable insights for prediction and data analysis. This chapter will
explore its types, assumptions, implementation, advantages, and evaluation metrics.
Understanding Linear Regression
Linear regression is also a type of supervised machine-learning algorithm that learns from the labelled datasets
and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It
computes the linear relationship between the dependent variable and one or more independent features by fitting
a linear equation with observed data. It predicts the continuous output variables based on the independent input
variable.
For example if we want to predict house price we consider various factor such as house age, distance from the
main road, location, area and number of room, linear regression uses all these parameter to predict house price as
it consider a linear relation between all these features and price of house.
Why Linear Regression is Important?

The interpretability of linear regression is one of its greatest strengths. The model’s equation offers clear
coefficients that illustrate the influence of each independent variable on the dependent variable, enhancing our
understanding of the underlying relationships. Its simplicity is a significant advantage; linear regression is
transparent, easy to implement, and serves as a foundational concept for more advanced algorithms.

Now that we have discussed why linear regression is important now we will discuss its working based on best fit
line in regression.
What is the best Fit Line?
Our primary objective while using linear regression is to locate the best-fit line, which implies that the error
between the predicted and actual values should be kept to a minimum. There will be the least error in the best-fit
line.
The best Fit Line equation provides a straight line that represents the relationship between the dependent and
independent variables. The slope of the line indicates how much the dependent variable changes for a unit change
in the independent variable(s).

Linear Regression
Here Y is called a dependent or target variable and X is called an independent variable also known as the predictor
of Y. There are many types of functions or modules that can be used for regression. A linear function is the
simplest type of function. Here, X may be a single feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a given independent
variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is the work experience and Y
(output) is the salary of a person. The regression line is the best-fit line for our model.
In linear regression some hypothesis are made to ensure reliability of the model’s results.

Hypothesis function in Linear Regression

Assumptions are:
 Linearity: It assumes that there is a linear relationship between the independent and dependent variables.
This means that changes in the independent variable lead to proportional changes in the dependent variable.
 Independence: The observations should be independent from each other that is the errors from one
observation should not influence other.
As we have discussed that our independent feature is the experience i.e X and the respective salary Y is the
dependent variable. Let’s assume there is a linear relationship between X and Y then the salary can be predicted
using:

The model gets the best regression fit line by finding the best θ1 and θ2 values.
 θ1: intercept
 θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are finally using our model for
prediction, it will predict the value of y for the input value of x.
How to update θ1 and θ2 values to get the best-fit line?

To achieve the best-fit regression line, the model aims to predict the target value Y^ Y^ such that the error
difference between the predicted value Y^ Y^ and the true value Y is minimum. So, it is very important to
update the θ1 and θ2 values, to reach the best value that minimizes the error between the predicted y value (pred)
and the true y value (y).

Types of Linear Regression


When there is only one independent feature it is known as Simple Linear Regression or Univariate Linear
Regression and when there are more than one feature it is known as Multiple Linear Regression or Multivariate
Regression.
1. Simple Linear Regression
Simple linear regression is the simplest form of linear regression and it involves only one independent variable
and one dependent variable. The equation for simple linear regression is:
y=β0+β1X
where:
 Y is the dependent variable
 X is the independent variable
 β0 is the intercept
 β1 is the slope

Assumptions of Simple Linear Regression


Linear regression is a powerful tool for understanding and predicting the behavior of a variable, however, it needs
to meet a few conditions in order to be accurate and dependable solutions.
 Linearity: The independent and dependent variables have a linear relationship with one another. This
implies that changes in the dependent variable follow those in the independent variable(s) in a linear
fashion. This means that there should be a straight line that can be drawn through the data points. If the
relationship is not linear, then linear regression will not be an accurate model.
 Independence: The observations in the dataset are independent of each other. This means that the value
of the dependent variable for one observation does not depend on the value of the dependent variable for
another observation. If the observations are not independent, then linear regression will not be an accurate
model.
 Homoscedasticity: Across all levels of the independent variable(s), the variance of the errors is constant.
This indicates that the amount of the independent variable(s) has no impact on the variance of the errors.
If the variance of the residuals is not constant, then linear regression will not be an accurate model.
 Normality: The residuals should be normally distributed. This means that the residuals should follow a
bell-shaped curve. If the residuals are not normally distributed, then linear regression will not be an
accurate model.

Homoscedasticity in Linear Regression

Use Case of Simple Linear Regression (work on it)

 In a case study evaluating student performance analysts use simple linear regression to examine the
relationship between study hours and exam scores. By collecting data on the number of hours students studied
and their corresponding exam results the analysts developed a model that reveal correlation, for each
additional hour spent studying, students exam scores increased by an average of 5 points. This case highlights
the utility of simple linear regression in understanding and improving academic performance.
 Another case study focus on marketing and sales where businesses uses simple linear regression to forecast
sales based on historical data particularly examining how factors like advertising expenditure influence
revenue. By collecting data on past advertising spending and corresponding sales figures analysts develop a
regression model that tells the relationship between these variables. For instance if the analysis reveals that
for every additional dollar spent on advertising sales increase by $10. This predictive capability enables
companies to optimize their advertising strategies and allocate resources effectively.

2. Multiple Linear Regression


Multiple linear regression involves more than one independent variable and one dependent variable. The equation
for multiple linear regression is:

The goal of the algorithm is to find the best Fit Line equation that can predict the values based on the
independent variables.
In regression set of records are present with X and Y values and these values are used to learn a function so if
you want to predict Y from an unknown X this learned function can be used. In regression we have to find the
value of Y, So, a function is required that predicts continuous Y in the case of regression given X as independent
features.

Assumptions of Multiple Linear Regression

For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression apply. In addition to
this, below are few more:
1. No multicollinearity: There is no high correlation between the independent variables. This indicates that
there is little or no correlation between the independent variables. Multicollinearity occurs when two or more
independent variables are highly correlated with each other, which can make it difficult to determine the
individual effect of each variable on the dependent variable. If there is multicollinearity, then multiple linear
regression will not be an accurate model.
2. Additivity: The model assumes that the effect of changes in a predictor variable on the response variable is
consistent regardless of the values of the other variables. This assumption implies that there is no interaction
between variables in their effects on the dependent variable.
3. Feature Selection: In multiple linear regression, it is essential to carefully select the independent variables
that will be included in the model. Including irrelevant or redundant variables may lead to overfitting and
complicate the interpretation of the model.
4. Overfitting: Overfitting occurs when the model fits the training data too closely, capturing noise or random
fluctuations that do not represent the true underlying relationship between variables. This can lead to poor
generalization performance on new, unseen data.
Multiple linear regression sometimes faces issues like multicollinearity.

Multicollinearity
Multicollinearity is a statistical phenomenon where two or more independent variables in a multiple regression
model are highly correlated, making it difficult to assess the individual effects of each variable on the dependent
variable.
Detecting Multicollinearity includes two techniques:
 Correlation Matrix: Examining the correlation matrix among the independent variables is a common way
to detect multicollinearity. High correlations (close to 1 or -1) indicate potential multicollinearity.
 VIF (Variance Inflation Factor): VIF is a measure that quantifies how much the variance of an estimated
regression coefficient increases if your predictors are correlated. A high VIF (typically above 10) suggests
multicollinearity.

Use Case of Multiple Linear Regression

Multiple linear regression allows us to analyze relationship between multiple independent variables and a single
dependent variable. Here are some use cases:
 Real Estate Pricing: In real estate MLR is used to predict property prices based on multiple factors such as
location, size, number of bedrooms, etc. This helps buyers and sellers understand market trends and set
competitive prices.
 Financial Forecasting: Financial analysts use MLR to predict stock prices or economic indicators based on
multiple influencing factors such as interest rates, inflation rates and market trends. This enables better
investment strategies and risk management24.
 Agricultural Yield Prediction: Farmers can use MLR to estimate crop yields based on several variables
like rainfall, temperature, soil quality and fertilizer usage. This information helps in planning agricultural
practices for optimal productivity
 E-commerce Sales Analysis: An e-commerce company can utilize MLR to assess how various factors such
as product price, marketing promotions and seasonal trends impact sales.
Now that we have understood about linear regression, its assumption and its type now we will learn how to make
a linear regression model.
Cost function for Linear Regression
As we have discussed earlier about best fit line in linear regression, its not easy to get it easily in real life cases
so we need to calculate errors that affects it. These errors need to be calculated to mitigate them. The difference
between the predicted value Y^ Y^ and the true value Y and it is called cost function or the loss function.
Now we have calculated loss function we need to optimize model to mtigate this error and it is done through
gradient descent.
Gradient Descent for Linear Regression
A linear regression model can be trained using the optimization algorithm gradient descent by iteratively
modifying the model’s parameters to reduce the mean squared error (MSE) of the model on a training dataset.
To update θ1 and θ2 values in order to reduce the Cost function (minimizing RMSE value) and achieve the best-
fit line the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and then iteratively
update the values, reaching minimum cost.
A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation
in inputs.
Finding the coefficients of a linear equation that best fits the training data is the objective of linear regression. By
moving in the direction of the Mean Squared Error negative gradient with respect to the coefficients, the
coefficients can be changed. And the respective intercept and coefficient of X will be if α α is the learning rate.

Gradient Descent
After optimizing our model, we evaluate our models accuracy to see how well it will perform in real world
scenario.
Evaluation Metrics for Linear Regression
A variety of evaluation measures can be used to determine the strength of any linear regression model. These
assessment metrics often give an indication of how well the model is producing the observed outputs.
The most common measurements are:
Mean Square Error (MSE)
Mean Squared Error (MSE) is an evaluation metric that calculates the average of the squared differences between
the actual and predicted values for all the data points. The difference is squared to ensure that negative and positive
differences don’t cancel each other out.

MSE is a way to quantify the accuracy of a model’s predictions. MSE is sensitive to outliers as large errors
contribute significantly to the overall score.

Mean Absolute Error (MAE)

Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression model. MAE measures
the average absolute difference between the predicted values and actual values.
Mathematically, MAE is expressed as:

Lower MAE value indicates better model performance. It is not sensitive to the outliers as we consider absolute
differences.

Root Mean Squared Error (RMSE)

The square root of the residuals’ variance is the Root Mean Squared Error. It describes how well the observed
data points match the expected values, or the model’s absolute fit to the data.
Coefficient of Determination (R-squared)

R squared metric is a measure of the proportion of variance in the dependent variable that is explained the
independent variables in the model.

Adjusted R-Squared Error

Adjusted R2 measures the proportion of variance in the dependent variable that is explained by independent
variables in a regression model. Adjusted R-square accounts the number of predictors in the model and penalizes
the model for including irrelevant predictors that don’t contribute significantly to explain the variance in the
dependent variables.

Adjusted R-square helps to prevent overfitting. It penalizes the model with additional predictors that do not
contribute significantly to explain the variance in the dependent variable.
While evaluation metrics help us measure the performance of a model, regularization helps in improving that
performance by addressing overfitting and enhancing generalization.

Regularization Techniques for Linear Models


Lasso Regression (L1 Regularization)
Lasso Regression is a technique used for regularizing a linear regression model, it adds a penalty term to the
linear regression objective function to prevent overfitting.
The objective function after applying lasso regression is:

 the first term is the least squares loss, representing the squared difference between predicted and actual
values.
 the second term is the L1 regularization term, it penalizes the sum of absolute values of the regression
coefficient θj.

Ridge Regression (L2 Regularization)

Ridge regression is a linear regression technique that adds a regularization term to the standard linear objective.
Again, the goal is to prevent overfitting by penalizing large coefficient in linear regression equation. It useful
when the dataset has multicollinearity where predictor variables are highly correlated.
The objective function after applying ridge regression is:

 the first term is the least squares loss, representing the squared difference between predicted and actual
values.
 the second term is the L1 regularization term, it penalizes the sum of square of values of the regression
coefficient θj.
Elastic Net Regression

Elastic Net Regression is a hybrid regularization technique that combines the power of both L1 and L2
regularization in linear regression objective.

 the first term is least square loss.


 the second term is L1 regularization and third is ridge regression.
 λ is the overall regularization strength.
 α controls the mix between L1 and L2 regularization.

CHAPTER 2

Classification teaches a machine to sort things into categories. It learns by looking at examples with labels
(like emails marked “spam” or “not spam”). After learning, it can decide which category new items belong to,
like identifying if a new email is spam or not. For example a classification model might be trained on dataset
of images labeled as either dogs or cats and it can be used to predict the class of new and unseen images as
dogs or cats based on their features such as color, texture and shape.
Explaining classification in ml, horizontal axis represents the combined values of color and texture features.
Vertical axis represents the combined values of shape and size features.
 Each colored dot in the plot represents an individual image, with the color indicating whether the model
predicts the image to be a dog or a cat.
 The shaded areas in the plot show the decision boundary, which is the line or region that the model uses to
decide which category (dog or cat) an image belongs to. The model classifies images on one side of the
boundary as dogs and on the other side as cats, based on their features.

Types of Classification
When we talk about classification in machine learning, we’re talking about the process of sorting data into
categories based on specific features or characteristics.
There are different types of classification problems depending on how many categories (or classes) we are
working with and how they are organized.
There are two main classification types in machine learning:

1. Binary Classification

This is the simplest kind of classification. In binary classification, the goal is to sort the data into two distinct
categories. Think of it like a simple choice between two options. Imagine a system that sorts emails into
either spam or not spam. It works by looking at different features of the email like certain keywords or sender
details, and decides whether it’s spam or not. It only chooses between these two options.

2. Multiclass Classification

Here, instead of just two categories, the data needs to be sorted into more than two categories. The model picks
the one that best matches the input. Think of an image recognition system that sorts pictures of animals into
categories like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or texture) and chooses which animal
the picture is most likely to be based on the training it received.
Binary classification vs Multi class classification

3. Multi-Label Classification

In multi-label classification single piece of data can belong to multiple categories at once. Unlike multiclass
classification where each data point belongs to only one class, multi-label classification allows datapoints to
belong to multiple classes. A movie recommendation system could tag a movie as both action and comedy.
The system checks various features (like movie plot, actors, or genre tags) and assigns multiple labels to a
single piece of data, rather than just one.
Multilabel classification is relevant in specific use cases, but not as crucial for a starting overview of
classification.
How does Classification in Machine Learning Work?
Classification involves training a model using a labeled dataset, where each input is paired with its correct
output label. The model learns patterns and relationships in the data, so it can later predict labels for new,
unseen inputs.
In machine learning, classification works by training a model to learn patterns from labeled data, so it can
predict the category or class of new, unseen data. Here’s how it works:
1. Data Collection: You start with a dataset where each item is labeled with the correct class (for example,
“cat” or “dog”).
2. Feature Extraction: The system identifies features (like color, shape, or texture) that help distinguish one
class from another. These features are what the model uses to make predictions.
3. Model Training: Classification – machine learning algorithm uses the labeled data to learn how to map the
features to the correct class. It looks for patterns and relationships in the data.
4. Model Evaluation: Once the model is trained, it’s tested on new, unseen data to check how accurately it
can classify the items.
5. Prediction: After being trained and evaluated, the model can be used to predict the class of new data based
on the features it has learned.
6. Model Evaluation: Evaluating a classification model is a key step in machine learning. It helps us check
how well the model performs and how good it is at handling new, unseen data. Depending on the problem
and needs we can use different metrics to measure its performance.

Classification Machine Learning


If the quality metric is not satisfactory, the ML algorithm or hyperparameters can be adjusted, and the model
is retrained. This iterative process continues until a satisfactory performance is achieved. In short, classification
in machine learning is all about using existing labeled data to teach the model how to predict the class of new,
unlabeled data based on the patterns it has learned.
Examples of Machine Learning Classification in Real Life
Classification algorithms are widely used in many real-world applications across various domains, including:
 Email spam filtering
 Credit risk assessment: Algorithms predict whether a loan applicant is likely to default by analyzing factors
such as credit score, income, and loan history. This helps banks make informed lending decisions and
minimize financial risk.
 Medical diagnosis : Machine learning models classify whether a patient has a certain condition (e.g., cancer
or diabetes) based on medical data such as test results, symptoms, and patient history. This aids doctors in
making quicker, more accurate diagnoses, improving patient care.
 Image classification : Applied in fields such as facial recognition, autonomous driving, and medical
imaging.
 Sentiment analysis: Determining whether the sentiment of a piece of text is positive, negative, or neutral.
Businesses use this to understand customer opinions, helping to improve products and services.
 Fraud detection : Algorithms detect fraudulent activities by analyzing transaction patterns and identifying
anomalies crucial in protecting against credit card fraud and other financial crimes.
 Recommendation systems : Used to recommend products or content based on past user behavior, such as
suggesting movies on Netflix or products on Amazon. This personalizat ion boosts user satisfaction and
sales for businesses.

Classification Modeling in Machine Learning


Now that we understand the fundamentals of classification, it’s time to explore how we can use these concepts
to build classification models. Classification modeling refers to the process of using machine learning
algorithms to categorize data into predefined classes or labels. These models are designed to handle both binary
and multi-class classification tasks, depending on the nature of the problem.
Let’s see key characteristics of Classification Models:
1. Class Separation: Classification relies on distinguishing between distinct classes. The goal is to learn a
model that can separate or categorize data points into predefined classes based on their features.
2. Decision Boundaries: The model draws decision boundaries in the feature space to differentiate between
classes. These boundaries can be linear or non-linear.
3. Sensitivity to Data Quality: Classification models are sensitive to the quality and quantity of the training
data. Well-labeled, representative data ensures better performance, while noisy or biased data can lead to
poor predictions.
4. Handling Imbalanced Data: Classification problems may face challenges when one class is
underrepresented. Special techniques like resampling or weighting are used to handle class imbalances.
5. Interpretability: Some classification algorithms, such as Decision Trees, offer higher interpretability,
meaning it’s easier to understand why a model made a particular prediction.

Classification Algorithms
Now, for implementation of any classification model it is essential to understand Logistic Regression, which
is one of the most fundamental and widely used algorithms in machine learning for classification tasks. There
are various types of classifiers algorithms. Some of them are :
Linear Classifiers: Linear classifier models create a linear decision boundary between classes. They are simple
and computationally efficient. Some of the linear classification models are as follows:
 Logistic Regression
 Support Vector Machines having kernel = ‘linear’
 Single-layer Perceptron
 Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers: Non-linear models create a non-linear decision boundary between classes. They can
capture more complex relationships between input features and target variable. Some of the non-
linear classification models are as follows:
 K-Nearest Neighbours
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Ensemble learning classifiers:
 Random Forests,
 AdaBoost,
 Bagging Classifier,
 Voting Classifier,
 Extra Trees Classifier
 Multi-layer Artificial Neural Networks
Logistic Regression

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is
dichotomous (binary). Like all regression analyses, logistic regression is a predictive analysis. It is used to
describe data and to explain the relationship between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables.

Logistic Regression is another statistical analysis method borrowed by Machine Learning. It is used when
our dependent variable is dichotomous or binary. It just means a variable that has only 2 outputs, for example,
A person will survive this accident or not, The student will pass this exam or not. The outcome can either
be yes or no (2 outputs). This regression technique is similar to linear regression and can be used to predict
the Probabilities for classification problems.

Types of Logistic Regression

Binary Logistic Regression

Binary logistic regression is used to predict the probability of a binary outcome, such as yes or no, true or
false, or 0 or 1. For example, it could be used to predict whether a customer will churn or not, whether a
patient has a disease or not, or whether a loan will be repaid or not.

Multinomial Logistic Regression

Multinomial logistic regression is used to predict the probability of one of three or more possible outcomes,
such as the type of product a customer will buy, the rating a customer will give a product, or the political
party a person will vote for.

Ordinal Logistic Regression

It is used to predict the probability of an outcome that falls into a predetermined order, such as the level of
customer satisfaction, the severity of a disease, or the stage of cancer.
Why do we use Logistic Regression rather than Linear Regression?

After reading the definition of logistic regression we now know that it is only used when our dependent
variable is binary and in linear regression this dependent variable is continuous.

The second problem is that if we add an outlier in our dataset, the best fit line in linear regression shifts to
fit that point.

Now, if we use linear regression to find the best fit line which aims at minimizing t he distance between the
predicted value and actual value, the line will be like this:

Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5 then we predict malignant
tumor (1) and if it is less than 0.5 then we predict benign tumor (0). Everything seems okay here but now
let’s change it a bit, we add some outliers in our dataset, now this best fit line will shift to that point. Hence
the line will be somewhat like this:

Do you see any problem here? The blue line represents the old threshold, and the yellow line represents the
new threshold, which is maybe 0.2. To keep our predictions right, we had to lower our threshold value.
Hence, we can say that linear regression is prone to outliers. Now, if h(x)h(x)h(x) is greater than 0.2, only
this regression will give correct outputs. Another problem with linear regression is that the predicted values
may be out of range. We know that probability can be between 0 and 1, but if we use linear regression, this
probability may exceed 1 or go below 0. To overcome these problems, we use Logistic Regression, which
converts this straight best-fit line in linear regression to an S-curve using the sigmoid function, which will
always give values between 0 and 1. How this works and the math behind it will be covered in a later section.

Assumptions of Logistic regression

Logistic regression is a statistical method commonly used to analyze data with binary outcomes (yes/no,
1/0) and identify the relationship between those outcomes and independent variables. Here are some key
assumptions for logistic regression:

Data Specific

 Binary Dependent Variable: Logistic regression is designed for binary dependent variables. If your
outcome has more than two categories, you might need a multinomial logistic regression or other
classification techniques.
 Independent Observations: The data points should be independent of each other. This means no
repeated measurements or clustering within the data.

Relationship Between Variables

 Linearity in the Logit: The relationship between the independent variables and the logit of the
dependent variable (ln(p / (1-p))) is assumed to be linear. This doesn’t necessarily mean the outcome
itself has a linear relationship with the independent variables, but the log-odds do.
 No Multicollinearity: Independent variables shouldn’t be highly correlated with each other.
Multicollinearity can cause instability in the model and make it difficult to interpret the coefficients.

Other

 Absence of Outliers: While not a strict requirement, outliers can significantly influence the model.
It’s important to check for and address any outliers that might distort the results.
 Adequate Sample Size: Logistic regression typically requires a reasonably large sample size to
ensure reliable parameter estimates. There are different rules of thumb, but a common guideline is
to have at least 10 observations for each independent variable in the model.pen_spark

How does Logistic Regression work?

 Prepare the data: The data should be in a format where each row represents a single observation
and each column represents a different variable. The target variable (the variable you want to predict)
should be binary (yes/no, true/false, 0/1).
 Train the model: We teach the model by showing it the training data. This involves finding the
values of the model parameters that minimize the error in the training data.
 Evaluate the model: The model is evaluated on the held-out test data to assess its performance on
unseen data.
 Use the model to make predictions: After the model has been trained and assessed, it can be used
to forecast outcomes on new data.
Logistic Function

You must be wondering how logistic regression squeezes the output of linear regression between 0 and 1.

Well, there’s a little bit of math included behind this and it is pretty interesting trust me.

Let’s start by mentioning the formula of logistic function:

How similar it is too linear regression? If you haven’t read my article on Linear Regression, then please
have a look at it for a better understanding.

Best Fit Equation in Linear Regression

We all know the equation of the best fit line in linear regression is:

Let’s say instead of y we are taking probabilities (P). But there is an issue here, the value of (P) will exceed
1 or go below 0 and we know that range of Probability is (0-1). To overcome this issue we take “odds” of
P:

Do you think we are done here? No, we are not. We know that odds can always be positive which means
the range will always be (0,+∞ ). Odds are nothing but the ratio of the probability of success and probability
of failure. Now the question comes out of so many other options to transform this why did we only
take ‘odds’? Because odds are probably the easiest way to do this, that’s it.

The problem here is that the range is restricted and we don’t want a restricted range because if we do so
then our correlation will decrease. By restricting the range we are actually decreasing the number of data
points and of course, if we decrease our data points, our correlation will decrease. It is difficult to model a
variable that has a restricted range. To control this we take the log of odds which has a range from (-∞,+∞).
If you understood what I did here then you have done 80% of the maths. Now we just want a function of P
because we want to predict probability right? not log of odds. To do so we will multiply by exponent on
both sides and then solve for P.

Now we have our logistic function, also called a sigmoid function. The graph of a sigmoid function is as
shown below. It squeezes a straight line into an S-curve.

Differences Between Linear and Logistic Regression

Linear regression and logistic regression, while both workhorses in machine learning, serve distinct
purposes. The core difference lies in their target predictions. Linear regression excels at predicting
continuous values along a spectrum. Imagine predicting house prices based on size and location – the
resulting output would be a specific dollar amount, a continuous value on the price scale.

Logistic regression, on the other hand, deals with categories. It doesn’t predict a specific value but rather
the likelihood of something belonging to a particular class. For instance, classifying emails as spam
(category 1) or not spam (category 0). The output here would be a probability between 0 (not likely spam)
and 1 (very likely spam). This probability is then used to assign an email to a definitive category (spam or
not spam) based on a chosen threshold.
In simpler terms, linear regression answers “how much” questions, providing a specific value on a
continuous scale. Logistic regression tackles “yes or no” scenarios, giving the probability of something
belonging to a certain category.

Key properties of the logistic regression equation

 Sigmoid Function: The logistic regression model, when explained, uses a special “S” shaped curve
to predict probabilities. It ensures that the predicted probabilities stay between 0 and 1, which makes
sense for probabilities.
 Straightforward Relationship: Even though the logistic regression model might seem complex, the
relationship between our inputs (like age, height, etc.) and the outcome (like yes/no) is pretty simple
to understand. It’s like drawing a straight line, but with a curve instead.
 Coefficients: These are just numbers that tell us how much each input affects the outcome in the
logistic regression model. For example, if age is a predictor, the coefficient tells us how much the
outcome changes for every one year increase in age.
 Best Guess: We figure out the best coefficients for the logistic regression model by looking at the
data we have and tweaking them until our predictions match the real outcomes as closely as possible.
 Basic Assumptions: In logistic regression explained, we assume that our observations are
independent, meaning one doesn’t affect the other. We also assume that there’s not too much overlap
between our predictors (like age and height), and the relationship between our predictors and the
outcome is kind of like a straight line.
 Probabilities, Not Certainties: Instead of saying “yes” or “no” directly, logistic regression gives us
probabilities, like saying there’s a 70% chance it’s a “yes” in the logistic regression model. We can
then decide on a cutoff point to make our final decision.
 Checking Our Work: In logistic regression explained, we have some tools to make sure our
predictions are good, like accuracy, precision, recall, and a curve called the ROC curve. These help
us see how well our logistic regression model is doing its job.

Decision tree is a simple diagram that shows different choices and their possible results helping you make
decisions easily. This article is all about what decision trees are, how they work, their advantages and
disadvantages and their applications.
Understanding Decision Tree
A decision tree is a graphical representation of different options for solving a problem and show how different
factors are related. It has a hierarchical tree structure starts with one main question at the top called a node
which further branches out into different possible outcomes where:
 Root Node is the starting point that represents the entire dataset.
 Branches: These are the lines that connect nodes. It shows the flow from one decision to another.
 Internal Nodes are Points where decisions are made based on the input features.
 Leaf Nodes: These are the terminal nodes at the end of branches that represent final outcomes or predictions

Decision Tree Structure


They also support decision-making by visualizing outcomes. You can quickly evaluate and compare the
“branches” to determine which course of action is best for you.
Now, let’s take an example to understand the decision tree. Imagine you want to decide whether to drink coffee
based on the time of day and how tired you feel. First the tree checks the time of day—if it’s morning it asks
whether you are tired. If you’re tired the tree suggests drinking coffee if not it says there’s no need. Similarly
in the afternoon the tree again asks if you are tired. If you recommends drinking coffee if not it concludes no
coffee is needed.

DECISION TREES
Classification of Decision Tree
We have mainly two types of decision tree based on the nature of the target variable: classification
trees and regression trees.
 Classification trees: They are designed to predict categorical outcomes means they classify data into
different classes. They can determine whether an email is “spam” or “not spam” based on various features
of the email.
 Regression trees : These are used when the target variable is continuous It predict numerical values rather
than categories. For example a regression tree can estimate the price of a house based on its size, location,
and other features.
How Decision Trees Work?
A decision tree working starts with a main question known as the root node. This question is derived from the
features of the dataset and serves as the starting point for decision-making.
From the root node, the tree asks a series of yes/no questions. Each question is designed to split the data into
subsets based on specific attributes. For example if the first question is “Is it raining?”, the answer will
determine which branch of the tree to follow. Depending on the response to each question you follow different
branches. If your answer is “Yes,” you might proceed down one path if “No,” you will take another path.
This branching continues through a sequence of decisions. As you follow each branch, you get more questions
that break the data into smaller groups. This step-by-step process continues until you have no more helpful
questions .
You reach at the end of a branch where you find the final outcome or decision. It could be a classification (like
“spam” or “not spam”) or a prediction (such as estimated price).
Advantages of Decision Trees
 Simplicity and Interpretability: Decision trees are straightforward and easy to understand. You can
visualize them like a flowchart which makes it simple to see how decisions are made.
 Versatility: It means they can be used for different types of tasks can work well for
both classification and regression
 No Need for Feature Scaling: They don’t require you to normalize or scale your data.
 Handles Non-linear Relationships: It is capable of capturing non-linear relationships between features
and target variables.
Disadvantages of Decision Trees
 Overfitting: Overfitting occurs when a decision tree captures noise and details in the training data and it
perform poorly on new data.
 Instability: instability means that the model can be unreliable slight variations in input can lead to
significant differences in predictions.
 Bias towards Features with More Levels: Decision trees can become biased towards features with many
categories focusing too much on them during decision-making. This can cause the model to miss out other
important features led to less accurate predictions .
Applications of Decision Trees
 Loan Approval in Banking: A bank needs to decide whether to approve a loan application based on
customer profiles.
o Input features include income, credit score, employment status, and loan history.
o The decision tree predicts loan approval or rejection, helping the bank make quick and reliable
decisions.
 Medical Diagnosis: A healthcare provider wants to predict whether a patient has diabetes based on clinical
test results.
o Features like glucose levels, BMI, and blood pressure are used to make a decision tree.
o Tree classifies patients into diabetic or non-diabetic, assisting doctors in diagnosis.
 Predicting Exam Results in Education : School wants to predict whether a student will pass or fail based
on study habits.
o Data includes attendance, time spent studying, and previous grades.
o The decision tree identifies at-risk students, allowing teachers to provide additional support.

we discussed how decision trees model decisions through a tree-like structure, where internal nodes represent
feature tests, branches represent decision rules, and leaf nodes contain the final predictions. This basic
understanding is crucial for building and interpreting decision trees, which are widely used for classification
and regression tasks.
Now, let’s take this understanding a step further and dive into how decision trees are implemented in
machine learning. We will explore how to train a decision tree model, make predictions, and evaluate its
performance
Why Decision Tree Structure in ML?
A decision tree is a supervised learning algorithm used for both classification and regression tasks. It models
decisions as a tree-like structure where internal nodes represent attribute tests, branches represent attribute
values, and leaf nodes represent final decisions or predictions. Decision trees are versatile, interpretable, and
widely used in machine learning for predictive modeling.
Now we have covered about the very basic of decision tree but its very important to understand the intuition
behind the decision tree so lets move towards it.

Here’s an example to make it simple to understand the intuition of decision tree:


Imagine you’re deciding whether to buy an umbrella:
1. Step 1 – Ask a Question (Root Node):
Is it raining?
If yes, you might decide to buy an umbrella. If no, you move to the next question.
2. Step 2 – More Questions (Internal Nodes):
If it’s not raining, you might ask:
Is it likely to rain later?
If yes, you buy an umbrella; if no, you don’t.
3. Step 3 – Decision (Leaf Node):
Based on your answers, you either buy or skip the umbrella

Approach in Decision Tree


Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class
label and attributes are represented on the internal node of the tree. We can represent any boolean function on
discrete attributes using the decision tree.

Example: Predicting Whether a Person Likes Computer Games


Imagine you want to predict if a person enjoys computer games based on their age and gender. Here’s how the
decision tree works:
1. Start with the Root Question (Age):
 The first question is: “Is the person’s age less than 15?”
 If Yes, move to the left.
 If No, move to the right.
2. Branch Based on Age:
 If the person is younger than 15, they are likely to enjoy computer games (+2 prediction score).
 If the person is 15 or older, ask the next question: “Is the person male?”
3. Branch Based on Gender (For Age 15+):
 If the person is male, they are somewhat likely to enjoy computer games (+0.1 prediction score).
 If the person is not male, they are less likely to enjoy computer games (-1 prediction score)
Example: Predicting Whether a Person Likes Computer Games Using Two Decision Trees

Tree 1: Age and Gender

1. The first tree asks two questions:


 “Is the person’s age less than 15?”
o If Yes, they get a score of +2.
o If No, proceed to the next question.
 “Is the person male?”
o If Yes, they get a score of +0.1.
o If No, they get a score of -1.

Tree 2: Computer Usage

1. The second tree focuses on daily computer usage:


 “Does the person use a computer daily?”
o If Yes, they get a score of +0.9.
o If No, they get a score of -0.9.

Combining Trees: Final Prediction

The final prediction score is the sum of scores from both trees
Information Gain and Gini Index in Decision Tree
Till now we have discovered the basic intituition and approach of how decision tree works, so lets just move
to the attribute selection measure of decision tree.
We have two popular attribute selection measures used:
1. Information Gain 2. Gini Index

1. Information Gain:

Information Gain tells us how useful a question (or feature) is for splitting data into groups. It measures how
much the uncertainty decreases after the split. A good question will create clearer groups, and the feature with
the highest Information Gain is chosen to make the decision.
For example, if we split a dataset of people into “Young” and “Old” based on age, and all young people bought
the product while all old people did not, the Information Gain would be high because the split perfectly
separates the two groups with no uncertainty left
Suppose S is a set of instances, A is an attribute, Sv is the subset of S , v represents an individual value that the
attribute A can take and Values (A) is the set of all possible values of A, then

Entropy: is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary
collection of examples. The higher the entropy more the information content.
For example, if a dataset has an equal number of “Yes” and “No” outcomes (like 3 people who bought a product
and 3 who didn’t), the entropy is high because it’s uncertain which outcome to predict. But if all the outcomes
are the same (all “Yes” or all “No”), the entropy is 0, meaning there is no uncertainty left in predicting the
outcome
Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and Values (A) is the set of
all possible values of A, then

Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3

Building Decision Tree using Information GainThe essentials:

 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Note: No root-to-leaf path should contain the same discrete attribute twice
 Recursively construct each subtree on the subset of training instances that would be classified down that
path in the tree.
 If all positive or all negative training instances remain, the label that node “yes” or “no” accordingly
 If no attributes remain, label with a majority vote of training instances left at that node
 If no instances remain, label with a majority vote of the parent’s training instances.

You might also like