0% found this document useful (0 votes)
8 views29 pages

Understanding Decision Trees in ML

The document explains Decision Trees and Logistic Regression, two important concepts in machine learning. Decision Trees are used for classification and regression tasks, providing a visual representation of decisions and their outcomes, while Logistic Regression is a classification algorithm predicting probabilities for binary outcomes. Both methods have specific advantages, disadvantages, and applications across various fields such as banking, healthcare, and education.

Uploaded by

somashakerreddyt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

Understanding Decision Trees in ML

The document explains Decision Trees and Logistic Regression, two important concepts in machine learning. Decision Trees are used for classification and regression tasks, providing a visual representation of decisions and their outcomes, while Logistic Regression is a classification algorithm predicting probabilities for binary outcomes. Both methods have specific advantages, disadvantages, and applications across various fields such as banking, healthcare, and education.

Uploaded by

somashakerreddyt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Decision Tree


A Decision Tree helps us to make decisions by mapping out different
choices and their possible outcomes. It’s used in machine learning for tasks
like classification and prediction. In this article, we’ll see more about
Decision Trees, their types and other core concepts.

A Decision Tree helps us make decisions by showing different options and


how they are related. It has a tree-like structure that starts with one main
question called the root node which represents the entire dataset. From
there, the tree branches out into different possibilities based on features in
the data.
• Root Node: Starting point representing the whole dataset.
• Branches: Lines connecting nodes showing the flow from one decision
to another.
• Internal Nodes: Points where decisions are made based on data
features.
• Leaf Nodes: End points of the tree where the final decision or prediction
is made.

Decision Tree
A Decision Tree also helps with decision-making by showing possible
outcomes clearly. By looking at the "branches" we can quickly compare
options and figure out the best choice.
There are mainly two types of Decision Trees based on the target variable:
1. Classification Trees: Used for predicting categorical outcomes like
spam or not spam. These trees split the data based on features to
classify data into predefined categories.
2. Regression Trees: Used for predicting continuous outcomes like
predicting house prices. Instead of assigning categories, it provides
numerical predictions based on the input features.
How Decision Trees Work?
1. Start with the Root Node: It begins with a main question at the root
node which is derived from the dataset’s features.
2. Ask Yes/No Questions: From the root, the tree asks a series of yes/no
questions to split the data into subsets based on specific attributes.
3. Branching Based on Answers: Each question leads to different
branches:
• If the answer is yes, the tree follows one path.
• If the answer is no, the tree follows another path.
4. Continue Splitting: This branching continues through further decisions
helps in reducing the data down step-by-step.
5. Reach the Leaf Node: The process ends when there are no more useful
questions to ask leading to the leaf node where the final decision or
prediction is made.
Let’s look at a simple example to understand how it works. Imagine we
need to decide whether to drink coffee based on the time of day and how
tired we feel. The tree first checks the time:
1. In the morning: It asks “Tired?”
• If yes, the tree suggests drinking coffee.
• If no, it says no coffee is needed.
2. In the afternoon: It asks again “Tired?”
• If yes, it suggests drinking coffee.
• If no, no coffee is needed.
Example
Splitting Criteria in Decision Trees
In a Decision Tree, the process of splitting data at each node is important.
The splitting criteria finds the best feature to split the data on. Common
splitting criteria include Gini Impurity and Entropy.
• Gini Impurity: This criterion measures how "impure" a node is. The
lower the Gini Impurity the better the feature splits the data into distinct
categories.
• Entropy: This measures the amount of uncertainty or disorder in the
data. The tree tries to reduce the entropy by splitting the data on
features that provide the most information about the target variable.
These criteria help decide which features are useful for making the best
split at each decision point in the tree.
Pruning in Decision Trees
• Pruning is an important technique used to prevent overfitting in Decision
Trees. Overfitting occurs when a tree becomes too deep and starts to
memorize the training data rather than learning general patterns. This
leads to poor performance on new, unseen data.
• This technique reduces the complexity of the tree by removing branches
that have little predictive power. It improves model performance by
helping the tree generalize better to new data. It also makes the model
simpler and faster to deploy.
• It is useful when a Decision Tree is too deep and starts to capture noise
in the data.
Advantages of Decision Trees
• Easy to Understand: Decision Trees are visual which makes it easy to
follow the decision-making process.
• Versatility: Can be used for both classification and regression
problems.
• No Need for Feature Scaling: Unlike many machine learning models, it
don’t require us to scale or normalize our data.
• Handles Non-linear Relationships: It capture complex, non-linear
relationships between features and outcomes effectively.
• Interpretability: The tree structure is easy to interpret helps in allowing
users to understand the reasoning behind each decision.
• Handles Missing Data: It can handle missing values by using strategies
like assigning the most common value or ignoring missing data during
splits.
Disadvantages of Decision Trees
• Overfitting: They can overfit the training data if they are too deep which
means they memorize the data instead of learning general patterns.
This leads to poor performance on unseen data.
• Instability: It can be unstable which means that small changes in the
data may lead to significant differences in the tree structure and
predictions.
• Bias towards Features with Many Categories: It can become biased
toward features with many distinct values which focuses too much on
them and potentially missing other important features which can reduce
prediction accuracy.
• Difficulty in Capturing Complex Interactions: Decision Trees may
struggle to capture complex interactions between features which helps
in making them less effective for certain types of data.
• Computationally Expensive for Large Datasets: For large datasets,
building and pruning a Decision Tree can be computationally intensive,
especially as the tree depth increases.
Applications of Decision Trees
Decision Trees are used across various fields due to their simplicity,
interpretability and versatility lets see some key applications:
1. Loan Approval in Banking: Banks use Decision Trees to assess
whether a loan application should be approved. The decision is based
on factors like credit score, income, employment status and loan history.
This helps predict approval or rejection helps in enabling quick and
reliable decisions.
2. Medical Diagnosis: In healthcare they assist in diagnosing diseases.
For example, they can predict whether a patient has diabetes based on
clinical data like glucose levels, BMI and blood pressure. This helps
classify patients into diabetic or non-diabetic categories, supporting
early diagnosis and treatment.
3. Predicting Exam Results in Education: Educational institutions use to
predict whether a student will pass or fail based on factors like
attendance, study time and past grades. This helps teachers identify at-
risk students and offer targeted support.
4. Customer Churn Prediction: Companies use Decision Trees to predict
whether a customer will leave or stay based on behavior patterns,
purchase history, and interactions. This allows businesses to take
proactive steps to retain customers.
5. Fraud Detection: In finance, Decision Trees are used to detect
fraudulent activities, such as credit card fraud. By analyzing past
transaction data and patterns, Decision Trees can identify suspicious
activities and flag them for further investigation.
A decision tree can also be used to help build automated predictive models
which have applications in machine learning, data mining and statistics. By
mastering Decision Trees, we can gain a deeper understanding of data and
make more informed decisions across different fields.

Logistic Regression is a supervised


machine learning algorithm used for classification problems. Unlike linear
regression which predicts continuous values it predicts the probability that
an input belongs to a specific class. It is used for binary classification
where the output can be one of two possible categories such as Yes/No,
True/False or 0/1. It uses sigmoid function to convert inputs into a
probability value between 0 and 1. In this article, we will see the basics of
logistic regression and its core concepts.

Types of Logistic Regression


Logistic regression can be classified into three main types based on the
nature of the dependent variable:
1. Binomial Logistic Regression: This type is used when the dependent
variable has only two possible categories. Examples include Yes/No,
Pass/Fail or 0/1. It is the most common form of logistic regression and
is used for binary classification problems.
2. Multinomial Logistic Regression: This is used when the dependent
variable has three or more possible categories that are not ordered.
For example, classifying animals into categories like "cat," "dog" or
"sheep." It extends the binary logistic regression to handle multiple
classes.
3. Ordinal Logistic Regression: This type applies when the dependent
variable has three or more categories with a natural order or ranking.
Examples include ratings like "low," "medium" and "high." It takes the
order of the categories into account when modeling.
Assumptions of Logistic Regression
Understanding the assumptions behind logistic regression is important to
ensure the model is applied correctly, main assumptions are:
1. Independent observations: Each data point is assumed to be
independent of the others means there should be no correlation or
dependence between the input samples.
2. Binary dependent variables: It takes the assumption that the
dependent variable must be binary, means it can take only two values.
For more than two categories SoftMax functions are used.
3. Linearity relationship between independent variables and log
odds: The model assumes a linear relationship between the
independent variables and the log odds of the dependent variable
which means the predictors affect the log odds in a linear way.
4. No outliers: The dataset should not contain extreme outliers as they
can distort the estimation of the logistic regression coefficients.
5. Large sample size: It requires a sufficiently large sample size to
produce reliable and stable results.
Understanding Sigmoid Function
1. The sigmoid function is a important part of logistic regression which is
used to convert the raw output of the model into a probability value
between 0 and 1.
2. This function takes any real number and maps it into the range 0 to 1
forming an "S" shaped curve called the sigmoid curve or logistic curve.
Because probabilities must lie between 0 and 1, the sigmoid function is
perfect for this purpose.
3. In logistic regression, we use a threshold value usually 0.5 to decide
the class label.
• If the sigmoid output is same or above the threshold, the input is
classified as Class 1.
• If it is below the threshold, the input is classified as Class 0.
This approach helps to transform continuous input values into meaningful
class predictions.
How does Logistic Regression work?
Logistic regression model transforms the linear regression function
continuous value output into categorical value output using a sigmoid
function which maps any real-valued set of independent variables input
into a value between 0 and 1. This function is known as the logistic
function.
Suppose we have input features represented as a matrix:
X=[x11 ...x1mx21 ...x2m ⋮⋱ ⋮ xn1 ...xnm]X=⎣⎡x11 x21 ⋮xn1 ......⋱ ...
x1mx2m⋮ xnm⎦⎤
and the dependent variable is YYhaving only binary value i.e 0 or 1.
Y={0 if Class11 if Class2Y={01 if Class1 if Class2
then, apply the multi-linear function to the input variables X.
z=(∑i=1nwixi)+bz=(∑i=1nwixi)+b
Here xixi is the ithith observation of X, wi=[w1,w2,w3,⋯,wm]wi=[w1
,w2,w3,⋯,wm] is the weights or Coefficient and bbis the bias term also
known as intercept. Simply this can be represented as the dot product of
weight and bias.
z=w⋅X+bz=w⋅X+b
At this stage, zzis a continuous value from the linear regression. Logistic
regression then applies the sigmoid function to zzto convert it into a
probability between 0 and 1 which can be used to predict the class.
Now we use the sigmoid function where the input will be z and we find the
probability between 0 and 1. i.e. predicted y.
σ(z)=11+e−zσ(z)=1+e−z1

Sigmoid function
As shown above the sigmoid function converts the continuous variable
data into the probability i.e between 0 and 1.
• σ(z) σ(z) tends towards 1 as z→∞z→∞
• σ(z) σ(z) tends towards 0 as z→−∞z→−∞
• σ(z) σ(z) is always bounded between 0 and 1
where the probability of being a class can be measured as:
P(y=1)=σ(z)P(y=0)=1−σ(z)P(y=1)=σ(z)P(y=0)=1−σ(z)
Logistic Regression Equation and Odds:
It models the odds of the dependent event occurring which is the ratio of
the probability of the event to the probability of it not occurring:
p(x)1−p(x) =ez1−p(x)p(x) =ez
Taking the natural logarithm of the odds gives the log-odds or logit:
log⁡[p(x)1−p(x)]=zlog⁡[p(x)1−p(x)]=w⋅X+bp(x)1−p(x)=ew⋅X+
b⋯Exponentiate both sidesp(x)=ew⋅X+b⋅(1−p(x))p(x)=ew⋅X+b−e
w⋅X+b⋅p(x))p(x)+ew⋅X+b⋅p(x))=ew⋅X+bp(x)(1+ew⋅X+b)=ew⋅X+
bp(x)=ew⋅X+b1+ew⋅X+blog[1−p(x)p(x)]log[1−p(x)p(x)
]1−p(x)p(x)p(x)p(x)p(x)+ew⋅X+b⋅p(x))p(x)(1+ew⋅X+b)p(x)
=z=w⋅X+b=ew⋅X+b⋯Exponentiate both sides=ew⋅X+b⋅(1−p(x))=ew⋅X+b−
ew⋅X+b⋅p(x))=ew⋅X+b=ew⋅X+b=1+ew⋅X+bew⋅X+b
then the final logistic regression equation will be:
p(X;b,w)=ew⋅X+b1+ew⋅X+b=11+e−w⋅X+bp(X;b,w)=1+ew⋅X+bew⋅X+b
=1+e−w⋅X+b1
This formula represents the probability of the input belonging to Class 1.
Likelihood Function for Logistic Regression
The goal is to find weights ww and bias bb that maximize the likelihood of
observing the data.
For each data point ii
• for y=1y=1, predicted probabilities will be: p(X;b,w) =p(x)p(x)
• for y=0y=0 The predicted probabilities will be: 1-p(X;b,w)
= 1−p(x)1−p(x)
L(b,w)=∏i=1np(xi)yi(1−p(xi))1−yiL(b,w)=∏i=1np(xi)yi(1−p(xi))1−yi
Taking natural logs on both sides:
log⁡(L(b,w))=∑i=1nyilog⁡p(xi)+(1−yi)log⁡(1−p(xi))=∑i=1nyi
log⁡p(xi)+log⁡(1−p(xi))−yilog⁡(1−p(xi))=∑i=1nlog⁡(1−p(xi)
)+∑i=1nyilog⁡p(xi)1−p(xi=∑i=1n−log⁡1−e−(w⋅xi+b)+∑i=1nyi(
w⋅xi+b)=∑i=1n−log⁡1+ew⋅xi+b+∑i=1nyi(w⋅xi+b)log(L(b,w))
=i=1∑nyilogp(xi)+(1−yi)log(1−p(xi))=i=1∑nyilogp(xi)+log(1−p(xi
))−yilog(1−p(xi))=i=1∑nlog(1−p(xi))+i=1∑nyilog1−p(xip(xi)=i=1∑n
−log1−e−(w⋅xi+b)+i=1∑nyi(w⋅xi+b)=i=1∑n−log1+ew⋅xi+b+i=1∑nyi(w⋅xi
+b)
This is known as the log-likelihood function.
Gradient of the log-likelihood function
To find the best ww and bb we use gradient ascent on the log-likelihood
function. The gradient with respect to each weight wjwjis:
∂J(l(b,w)∂wj=−∑i=nn11+ew⋅xi+bew⋅xi+bxij+∑i=1nyixij=−∑i=nn
p(xi;b,w)xij+∑i=1nyixij=∑i=nn(yi−p(xi;b,w))xij∂wj∂J(l(b,w)
=−i=n∑n1+ew⋅xi+b1ew⋅xi+bxij+i=1∑nyixij=−i=n∑np(xi;b,w)xij+i=1∑nyixij
=i=n∑n(yi−p(xi;b,w))xij
Terminologies involved in Logistic Regression
Here are some common terms involved in logistic regression:
1. Independent Variables: These are the input features or predictor
variables used to make predictions about the dependent variable.
2. Dependent Variable: This is the target variable that we aim to predict.
In logistic regression, the dependent variable is categorical.
3. Logistic Function: This function transforms the independent variables
into a probability between 0 and 1 which represents the likelihood that
the dependent variable is either 0 or 1.
4. Odds: This is the ratio of the probability of an event happening to the
probability of it not happening. It differs from probability because
probability is the ratio of occurrences to total possibilities.
5. Log-Odds (Logit): The natural logarithm of the odds. In logistic
regression, the log-odds are modeled as a linear combination of the
independent variables and the intercept.
6. Coefficient: These are the parameters estimated by the logistic
regression model which shows how strongly the independent variables
affect the dependent variable.
7. Intercept: The constant term in the logistic regression model which
represents the log-odds when all independent variables are equal to
zero.
8. Maximum Likelihood Estimation (MLE): This method is used to
estimate the coefficients of the logistic regression model by maximizing
the likelihood of observing the given data.
Implementation for Logistic Regression
Now, let's see the implementation of logistic regression in Python. Here
we will be implementing two main types of Logistic Regression:
1. Binomial Logistic regression:
In binomial logistic regression, the target variable can only have two
possible values such as "0" or "1", "pass" or "fail". The sigmoid function is
used for prediction.
We will be using sckit-learn library for this and shows how to use the
breast cancer dataset to implement a Logistic Regression model for
classification.
from [Link] import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=23)

clf = LogisticRegression(max_iter=10000, random_state=0)


[Link](X_train, y_train)

acc = accuracy_score(y_test, [Link](X_test)) * 100


print(f"Logistic Regression model accuracy: {acc:.2f}%")
Output:
Logistic Regression model accuracy (in %): 96.49%
This code uses logistic regression to classify whether a sample from the
breast cancer dataset is malignant or benign.
2. Multinomial Logistic Regression:
Target variable can have 3 or more possible types which are not ordered
i.e types have no quantitative significance like “disease A” vs “disease B”
vs “disease C”.
In this case, the softmax function is used in place of the sigmoid
function. Softmax function for K classes will be:
softmax(zi)=ezi∑j=1Kezjsoftmax(zi)=∑j=1Kezjezi
Here KK represents the number of elements in the
vector zz and i,ji,j iterates over all the elements in the vector.
Then the probability for class cc will be:
P(Y=c∣X→=x)=ewc⋅x+bc∑k=1Kewk⋅x+bkP(Y=c∣X=x)=∑k=1Kewk⋅x+bk
ewc⋅x+bc
Below is an example of implementing multinomial logistic regression using
the Digits dataset from scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model, metrics

digits = datasets.load_digits()

X = [Link]
y = [Link]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,


random_state=1)

reg = linear_model.LogisticRegression(max_iter=10000, random_state=0)


[Link](X_train, y_train)

y_pred = [Link](X_test)

print(f"Logistic Regression model accuracy: {metrics.accuracy_score(y_test,


y_pred) * 100:.2f}%")
Output:
Logistic Regression model accuracy: 96.66%
This model is used to predict one of 10 digits (0-9) based on the image
features.
How to Evaluate Logistic Regression Model?
Evaluating the logistic regression model helps assess its performance and
ensure it generalizes well to new, unseen data. The following metrics are
commonly used:
1. Accuracy: Accuracy provides the proportion of correctly classified
instances.
Accuracy=TruePositives+TrueNegativesTotalAccuracy=TotalTruePo
sitives+TrueNegatives

2. Precision: Precision focuses on the accuracy of positive predictions.


Precision=TruePositivesTruePositives+FalsePositivesPrecision=T
ruePositives+FalsePositivesTruePositives

3. Recall (Sensitivity or True Positive Rate): Recall measures the


proportion of correctly predicted positive instances among all actual
positive instances.
Recall=TruePositivesTruePositives+FalseNegativesRecall=TruePosi
tives+FalseNegativesTruePositives

4. F1 Score: F1 score is the harmonic mean of precision and recall.


F1Score=2∗Precision∗RecallPrecision+RecallF1Score=2∗Precision
+RecallPrecision∗Recall

5. Area Under the Receiver Operating Characteristic Curve (AUC-


ROC): The ROC curve plots the true positive rate against the false
positive rate at various thresholds. AUC-ROC measures the area under
this curve which provides an aggregate measure of a model's
performance across different classification thresholds.
6. Area Under the Precision-Recall Curve (AUC-PR): Similar to AUC-
ROC, AUC-PR measures the area under the precision-recall curve helps
in providing a summary of a model's performance across different
precision-recall trade-offs.
What is a Neural Network?
Last Updated : 07 Oct, 2025


Neural networks are machine learning models that mimic the complex
functions of the human brain. These models consist of interconnected
nodes or neurons that process data, learn patterns and enable tasks such
as pattern recognition and decision-making.

Neural networks are capable of learning and identifying patterns directly


from data without pre-defined rules. These networks are built from several
key components:
• Neurons: The basic units that receive inputs, each neuron is governed
by a threshold and an activation function.
• Connections: Links between neurons that carry information, regulated
by weights and biases.
• Weights and Biases: These parameters determine the strength and
influence of connections.
• Propagation Functions: Mechanisms that help process and transfer
data across layers of neurons.
• Learning Rule: The method that adjusts weights and biases over time
to improve accuracy.
Learning in neural networks follows a structured, three-stage
process:
1. Input Computation: Data is fed into the network.
2. Output Generation: Based on the current parameters, the network
generates an output.
3. Iterative Refinement: The network refines its output by adjusting
weights and biases, gradually improving its performance on diverse
tasks.
In an adaptive learning environment:
• The neural network is exposed to a simulated scenario or dataset.
• Parameters such as weights and biases are updated in response to new
data or conditions.
• With each adjustment, the network’s response evolves allowing it to
adapt effectively to different tasks or environments.
The image illustrates the analogy between a biological neuron and an
artificial neuron, showing how inputs are received and processed to
produce outputs in both systems.
Importance of Neural Networks
• Identify Complex Patterns: Recognize intricate structures and
relationships in data; adapt to dynamic and changing environments.
• Learn from Data: Handle vast datasets efficiently; improve performance
with experience and retraining.
• Drive Key Technologies: Power natural language processing (NLP);
enable self-driving vehicles; support automated decision-making
systems.
• Boost Efficiency: Streamline workflows and processes; enhance
productivity across industries.
• Backbone of AI: Serve as the core driver of artificial intelligence
progress; continue shaping the future of technology and innovation.
Layers in Neural Network Architecture
Layers
1. Input Layer: This is where the network receives its input data. Each
input neuron in the layer corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy
lifting. A neural network can have one or multiple hidden layers. Each
layer consists of units (neurons) that transform the inputs into something
that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The
format of these outputs varies depending on the specific task like
classification, regression.
Working of Neural Networks
1. Forward Propagation
When data is input into the network, it passes through the network in the
forward direction, from the input layer through the hidden layers to the
output layer. This process is known as forward propagation. Here’s what
happens during this phase:
1. Linear Transformation: Each neuron in a layer receives inputs which
are multiplied by the weights associated with the connections. These
products are summed together and a bias is added to the sum. This can be
represented mathematically as:
z=w1x1+w2x2+…+wnxn+bz=w1x1+w2x2+…+wnxn+b
where
• ww represents the weights
• xx represents the inputs
• bb is the bias
2. Activation: The result of the linear transformation (denoted as zz) is
then passed through an activation function. The activation function is
crucial because it introduces non-linearity into the system, enabling the
network to learn more complex patterns. Popular activation functions
include ReLU, sigmoid and tanh.
2. Backpropagation
After forward propagation, the network evaluates its performance using a
loss function which measures the difference between the actual output and
the predicted output. The goal of training is to minimize this loss. This is
where backpropagation comes into play:
• Loss Calculation: The network calculates the loss which provides a
measure of error in the predictions. The loss function could vary;
common choices are mean squared error for regression tasks or cross-
entropy loss for classification.
• Gradient Calculation: The network computes the gradients of the loss
function with respect to each weight and bias in the network. This
involves applying the chain rule of calculus to find out how much each
part of the output error can be attributed to each weight and bias.
• Weight Update: Once the gradients are calculated, the weights and
biases are updated using an optimization algorithm like stochastic
gradient descent (SGD). The weights are adjusted in the opposite
direction of the gradient to minimize the loss. The size of the step taken
in each update is determined by the learning rate.
3. Iteration
This process of forward propagation, loss calculation, backpropagation and
weight update is repeated for many iterations over the dataset. Over time,
this iterative process reduces the loss and the network's predictions
become more accurate.
Through these steps, neural networks can adapt their parameters to better
approximate the relationships in the data, thereby improving their
performance on tasks such as classification, regression or any other
predictive modeling.
Example of Email Classification
Let's consider a record of an email dataset:
Email
ID Email Content Sender Subject Line Label

"Get free gift cards "Exclusive


1 spam@[Link] 1
now!" Offer"

To classify this email, we will create a feature vector based on the analysis
of keywords such as "free" "win" and "offer"
The feature vector of the record can be presented as:
• "free": Present (1)
• "win": Absent (0)
• "offer": Present (1)
How Neurons Process Data in a Neural Network
In a neural network, input data is passed through multiple layers, including
one or more hidden layers. Each neuron in these hidden layers performs
several operations, transforming the input into a usable output.
1. Input Layer: The input layer contains 3 nodes that indicates the
presence of each keyword.
2. Hidden Layer: The input vector is passed through the hidden layer.
Each neuron in the hidden layer performs two primary operations: a
weighted sum followed by an activation function.
Weights:
• Neuron H1: [0.5,−0.2,0.3]
• Neuron H2: [0.4,0.1,−0.5]
Input Vector: [1,0,1]
Weighted Sum Calculation
• For H1: (1×0.5)+(0×−0.2)+(1×0.3)=0.5+0+0.3=0.8
• For H2: (1×0.4)+(0×0.1)+(1×−0.5)=0.4+0−0.5=−0.1
Activation Function
Here we will use ReLu activation function:
• H1 Output: ReLU(0.8)= 0.8
• H2 Output: ReLu(-0.1) = 0
3. Output Layer: The activated values from the hidden neurons are sent to
the output neuron where they are again processed using a weighted sum
and an activation function.
• Output Weights: [0.7, 0.2]
• Input from Hidden Layer: [0.8, 0]
• Weighted Sum: (0.8×0.7)+(0×0.2)=0.56+0=0.56
• Activation (Sigmoid): σ(0.56)=11+e−0.56≈0.636σ(0.56)=1+e−0.561
≈0.636
4. Final Classification:
• The output value of approximately 0.636 indicates the probability of the
email being spam.
• Since this value is greater than 0.5, the neural network classifies the
email as spam (1).

Neural Network for Email Classification Example


Learning of a Neural Network
1. Learning with Supervised Learning
In supervised learning, a neural network learns from labeled input-output
pairs provided by a teacher. The network generates outputs based on
inputs and by comparing these outputs to the known desired outputs, an
error signal is created. The network iteratively adjusts its parameters to
minimize errors until it reaches an acceptable performance level.
2. Learning with Unsupervised Learning
Unsupervised learning involves data without labeled output variables. The
primary goal is to understand the underlying structure of the input data (X).
Unlike supervised learning, there is no instructor to guide the process.
Instead, the focus is on modeling data patterns and relationships, with
techniques like clustering and association commonly used.
3. Learning with Reinforcement Learning
Reinforcement learning enables a neural network to learn through
interaction with its environment. The network receives feedback in the form
of rewards or penalties, guiding it to find an optimal policy or strategy that
maximizes cumulative rewards over time. This approach is widely used in
applications like gaming and decision-making.
Types of Neural Networks
There are seven types of neural networks that can be used.
• Feedforward Networks: It is a simple artificial neural network
architecture in which data moves from input to output in a single
direction.
• Singlelayer Perceptron: It has one layer and it applies weights, sums
inputs and uses activation to produce output.
• Multilayer Perceptron (MLP): It is a type of feedforward neural network
with three or more layers, including an input layer, one or more hidden
layers and an output layer. It uses nonlinear activation functions.
• Convolutional Neural Network (CNN): It is designed for image
processing. It uses convolutional layers to automatically learn features
from input images, enabling effective image recognition and
classification.
• Recurrent Neural Network (RNN): Handles sequential data using
feedback loops to retain context over time.
• Long Short-Term Memory (LSTM): A type of RNN with memory cells
and gates to handle long-term dependencies and avoid vanishing
gradients.
K-Nearest Neighbor(KNN) Algorithm
Last Updated : 23 Aug, 2025


K-Nearest Neighbors (KNN) is a supervised machine learning algorithm
generally used for classification but can also be used for regression tasks.
It works by finding the "k" closest data points (neighbors) to a given input
and makes a predictions based on the majority class (for classification) or
the average value (for regression). Since KNN makes no assumptions
about the underlying data distribution it makes it a non-parametric and
instance-based learning method.

K-Nearest Neighbors is also called as a lazy learner algorithm because it


does not learn from the training set immediately instead it stores the entire
dataset and performs computations only at the time of classification.
For example, consider the following table of data points containing two
features:

KNN Algorithm working visualization


The new point is classified as Category 2 because most of its closest
neighbors are blue squares. KNN assigns the category based on the
majority of nearby points. The image shows how KNN predicts the category
of a new data point based on its closest neighbours.
• The red diamonds represent Category 1 and the blue squares represent
Category 2.
• The new data point checks its closest neighbors (circled points).
• Since the majority of its closest neighbors are blue squares (Category 2)
KNN predicts the new data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
What is 'K' in K Nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells the
algorithm how many nearby points or neighbors to look at when it makes a
decision.
Example: Imagine you're deciding which fruit it is based on its shape and
size. You compare it to fruits you already know.
• If k = 3, the algorithm looks at the 3 closest fruits to the new one.
• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the
new fruit is an apple because most of its neighbors are apples.
How to choose the value of k for KNN Algorithm?
• The value of k in KNN decides how many neighbors the algorithm looks
at when making a prediction.
• Choosing the right k is important for good results.
• If the data has lots of noise or outliers, using a larger k can make the
predictions more stable.
• But if k is too large the model may become too simple and miss
important patterns and this is called underfitting.
• So k should be picked carefully based on the data.
Statistical Methods for Selecting k
• Cross-Validation: Cross-Validation is a good way to find the best value
of k is by using k-fold cross-validation. This means dividing the dataset
into k parts. The model is trained on some of these parts and tested on
the remaining ones. This process is repeated for each part. The k value
that gives the highest average accuracy during these tests is usually the
best one to use.
• Elbow Method: In Elbow Method we draw a graph showing the error
rate or accuracy for different k values. As k increases the error usually
drops at first. But after a certain point error stops decreasing quickly.
The point where the curve changes direction and looks like an "elbow" is
usually the best choice for k.
• Odd Values for k: It’s a good idea to use an odd number for k
especially in classification problems. This helps avoid ties when deciding
which class is the most common among the neighbors.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors
are used for classification and regression task. To identify nearest neighbor
we use below distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two
points in a plane or space. You can think of it like the shortest path you
would walk if you were to go directly from one point to another.
distance(x,Xi)=∑j=1d(xj−Xij)2]distance(x,Xi)=∑j=1d(xj−Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along
horizontal and vertical lines like a grid or city streets. It’s also called
"taxicab distance" because a taxi can only drive along the grid-like streets
of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both
Euclidean and Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above, when p=2, it becomes the same as the Euclidean
distance formula and when p=1, it turns into the Manhattan distance
formula. Minkowski distance is essentially a flexible formula that can
represent either Euclidean or Manhattan distance depending on the value
of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of
similarity where it predicts the label or value of a new data point by
considering the labels or values of its K nearest neighbors in the training
dataset.
Step 1: Selecting the optimal value of K
• K represents the number of nearest neighbors that needs to be
considered while making prediction.
Step 2: Calculating distance
• To measure the similarity between target and training data points
Euclidean distance is widely used. Distance is calculated between data
points in the dataset and target point.
Step 3: Finding Nearest Neighbors
• The k data points with the smallest distances to the target point are
nearest neighbors.
Step 4: Voting for Classification or Taking Average for
Regression
• When you want to classify a data point into a category like spam or not
spam, the KNN algorithm looks at the K closest points in the dataset.
These closest points are called neighbors. The algorithm then looks at
which category the neighbors belong to and picks the one that appears
the most. This is called majority voting.
• In regression, the algorithm still looks for the K closest points. But
instead of voting for a class in classification, it takes the average of the
values of those K neighbors. This average is the predicted value for the
new point for the algorithm.
It shows how a test point is classified based on its nearest neighbors. As
the test point moves the algorithm identifies the closest 'k' data points i.e. 5
in this case and assigns test point the majority class label that is grey label
class here.

Bayesian methods for predictive modeling use probability to update


beliefs as new data arrives, combining prior knowledge with observed data using Bayes'
Theorem to produce a posterior distribution. This approach provides a full quantification of
uncertainty in predictions by modeling parameters as probability distributions rather than
point estimates. This results in a predictive distribution that is a weighted average of
predictions from all possible models, leading to more reliable and nuanced forecasts.
Core concepts

• Prior belief (

P(A)cap P open paren cap A close paren

𝑃(𝐴)

): The initial probability or belief about a model's parameters before seeing any new
data.

• Likelihood (

P(B|A)cap P open paren cap B vertical line cap A close paren

𝑃(𝐵|𝐴)

): The probability of the observed data given a specific set of model parameters.

• Posterior distribution (

P(A|B)cap P open paren cap A vertical line cap B close paren

𝑃(𝐴|𝐵)

): The updated belief about the model parameters after combining the prior and the
likelihood, calculated using Bayes' Theorem.
• Bayes' Theorem: The formula that connects these components:

P(A|B)=P(A)×P(B|A)P(B)cap P open paren cap A vertical line cap B close paren


equals the fraction with numerator cap P open paren cap A close paren cross cap P
open paren cap B vertical line cap A close paren and denominator cap P open paren
cap B close paren end-fraction

𝑃(𝐴|𝐵)=𝑃(𝐴)×𝑃(𝐵|𝐴)𝑃(𝐵)

• Uncertainty quantification: Unlike traditional methods that provide a single point


estimate, Bayesian models provide a full probability distribution, which indicates the
uncertainty surrounding the prediction.
• Bayesian model averaging: To get a final prediction, all possible models are
averaged together, weighted by their posterior probability, to create a more robust
predictive distribution that accounts for uncertainty in the model itself.

Key advantages

• Incorporates prior knowledge: Can leverage existing information in the form of a


prior distribution, which is useful when data is scarce or ambiguous.
• Handles uncertainty: Explicitly models uncertainty at all levels, from the parameters
to the model structure, leading to more reliable predictions.
• Provides a full picture: Produces a probability distribution for predictions, not just a
single value, which allows for a better understanding of risk and confidence.
• Flexible framework: Applicable to a wide range of problems, from simple linear
regression to complex machine learning models.

Example application
In Bayesian regression, instead of getting a single "best" slope, the model outputs a
probability distribution for the slope, showing how likely it is to be within a certain range.
For example, if you start with a belief that a coin is fair (prior), and then flip it and get 60
heads in 100 flips (likelihood), you would use Bayes' theorem to update your belief to a
posterior that shows a higher probability of the coin being biased towards heads.

Bayesian Linear Regression - GeeksforGeeks

14 Jul 2025 — Bayesian regression employs prior belief or knowledge about the data
to "learn" more about it and create more accurate ...

GeeksforGeeks

Improve Model Reliability with Bayesian Methods for Predictive ...

24 Jul 2024 — Conclusion. Bayesian methods offer a strong way to make predictive
models more reliable by giving a full system to figu...

4 Nov 2020 — in the previous uh videos I already talked a bit about some models
being more baian than others and I realized that I ha...

A batch approach to model assessment


involves evaluating a model on multiple inputs grouped together as a single batch,
rather than one at a time. This method optimizes resource usage, reduces overall
processing time, and improves efficiency for non-real-time tasks like large-scale
model evaluation or inference. It is particularly useful for improving throughput and
reducing the overhead associated with repeated individual requests.

How it works
• Grouping inputs: Instead of sending one data point to the model for evaluation,
similar inputs are collected and processed together as a single batch.

• Concurrent processing: Modern hardware like GPUs can process all inputs in the
batch concurrently, significantly increasing efficiency.

• Reduced overhead: By handling a single batch, the system can reduce the
overhead that comes from starting a new individual task for each input.

Benefits
• Efficiency: It significantly increases the throughput of the model, meaning it can
process more data in the same amount of time.

• Speed: For tasks that don't need an immediate response, such as evaluating a large
dataset, batch processing can drastically shorten the overall duration.

• Resource optimization: It keeps hardware, like CPUs and GPUs, more consistently
occupied, leading to better utilization.
• Cost-effectiveness: By improving efficiency and throughput, batch processing can
lead to substantial cost savings, especially for large-scale operations.

Use cases
• Model inference: Evaluating a pre-trained model on a set of images or text inputs at
once.

• Model evaluation: Running a new model version against a test dataset in a single,
large evaluation job.

• Data preprocessing: Performing feature extraction or data transformation on a


large collection of data points together.

• Batch experiments: Running experiments with different model configurations or


prompts to see which performs best.

"Percent correct classification" is a common term


for accuracy, a key metric used to evaluate the performance of a classification
model. It is the proportion of the total number of predictions that were correct,
whether the actual outcome was positive or negative.

Calculation
Accuracy is calculated using the following formula:

Where:

• TP = True Positives (correctly predicted positive outcomes)

• TN = True Negatives (correctly predicted negative outcomes)

• FP = False Positives (incorrectly predicted positive outcomes, or Type I error)

• FN = False Negatives (incorrectly predicted negative outcomes, or Type II error)

When to use it
Accuracy is an intuitive and easy-to-understand metric. However, it has limitations,
particularly with imbalanced datasets. For example, if a disease is very rare
(affecting 1% of the population), a simple model that predicts no one has the disease
will have 99% accuracy, but it is useless because it fails to identify any positive
cases (the people who actually have the disease).

In such cases, other metrics like precision, recall (sensitivity), and the F1
score (which can be derived from a confusion matrix) provide a more detailed and
useful breakdown of the model's performance.

The "rank order approach to model


assessment" can refer to several techniques, such as using human
judgment to rank a set of items (like student essays or search results) by quality, or
using a statistical method like Learning to Rank (LTR) in machine learning to train a
model to order items based on a relevance score. In a broader sense, it can also
describe a survey method where respondents rank options in order of preference or
importance to assess user priorities or preferences.

Rank ordering based on human judgment


• Process: Experts or judges are asked to place items, such as student work,
research papers, or recordings, into a rank order based on perceived quality or some
other attribute.

• Application: Used in educational assessment to create a scale of perceived quality


for a set of essays or oral exams, where items are ranked against each other by
multiple judges.

• Analysis: The judgments are used to create a measurement scale, and analysis can
include calculating fit statistics to see how consistent each judge's ranking was with
the overall estimated measures.

Machine learning: Learning to Rank (LTR)


• Process: A supervised machine learning technique used to train a model to order
items to reflect their relevance to a query.

• Application: Widely used in information retrieval systems like search engines and
recommendation platforms to improve the accuracy of results presented to users.

• Approaches: LTR models can be categorized into three types:


o Pointwise: Ranks individual items.

o Pairwise: Ranks pairs of items relative to each other.

o Listwise: Orders an entire list of items at once, and is often more effective.

Rank order scaling in surveys


• Process: Respondents are asked to rank a set of items, such as product features,
brands, or ideas, in order of preference or importance.

• Application: Can be used by companies to understand what features users value


most, or by organizations to prioritize changes based on customer feedback.

• Benefit: This method provides deeper insights than simple rating scales and is often
easy for respondents to answer, leading to thoughtful responses and better decision-
making.

Assessing regression models involves using metrics like R-squared, Mean Squared Error
(MSE), and Mean Absolute Error (MAE) to evaluate how well the model predicts a
continuous outcome. Key metrics include R-squared, which indicates the proportion of
variance in the dependent variable explained by the model, and adjusted R-squared, which
accounts for the number of predictors to prevent overfitting. Other important measures are
MAE, MSE, and Root Mean Squared Error (RMSE), which quantify the average magnitude
of prediction errors.
Common assessment metrics
You can watch this video to learn about different evaluation metrics for regression models:

59s

: A statistical measure of how close the data points are to the fitted regression line.

o Ranges from 0 to 1, where 1 indicates a perfect fit and 0 indicates the model is
no better than a random guess.
o Represents the proportion of the variance in the dependent variable that is
predictable from the independent variable(s).
• Adjusted R-squared: A modified version of R-squared that adjusts for the number of
predictors in the model.
o It increases only if the new term improves the model more than would be
expected by chance.
o More reliable for comparing models with different numbers of independent
variables.
• Mean Absolute Error (MAE): The average of the absolute differences between
predicted and actual values.
o Provides a linear score of the accuracy of a model.
• Mean Squared Error (MSE): The average of the squared errors between predicted
and actual values.
o Punishes larger errors more heavily than smaller ones.
• Root Mean Squared Error (RMSE): The square root of the MSE.
o Has the same units as the target variable, which makes it more interpretable
than MSE.

Other considerations

• F-test: Evaluates the overall statistical significance of the regression model. A


significant F-test indicates that at least one predictor variable has a reliable
relationship with the response variable.
• Residual analysis: Examining the residuals (the differences between actual and
predicted values) can help identify patterns and issues in the model, such as
heteroscedasticity or non-linearity.
• Theoretical relevance: In some cases, such as social sciences, it is important to
assess whether the model includes variables that are theoretically relevant to the
outcome, even if they do not have a statistically significant effect.

Common questions

Powered by AI

Backpropagation is a critical process in neural networks that involves three main steps: loss calculation, gradient calculation, and weight update. After forward propagation, the network first calculates the loss, measuring the error between predicted and actual outputs . It then computes the gradients of the loss function with respect to each network parameter, using these gradients to adjust the weights and biases in the opposite direction of the gradient via an optimization algorithm like stochastic gradient descent. This iterative process minimizes the loss and enhances model performance by fine-tuning network parameters .

Decision Trees face computational challenges with large datasets due to the complexity involved in building and pruning deep trees, which can be computationally intensive . This challenge arises from the need for extensive calculations at each node to determine the optimal splits and the subsequent complexity of managing a large decision structure. Such computational demands can slow down the model's deployment and evaluation, potentially limiting its applicability in real-time or resource-constrained environments, impacting use cases in practical applications .

In the medical field, Decision Trees assist in diagnosing diseases by analyzing clinical data such as glucose levels, BMI, and blood pressure to predict the likelihood of a condition like diabetes . The predictive mechanism involves constructing a tree model that makes decisions based on input characteristics, classifying patients into categories like diabetic or non-diabetic. This aids in early diagnosis and treatment . Benefits include the method's interpretability and simplicity, making it easy for healthcare professionals to understand the reasoning behind decisions and potentially communicate them to patients .

The sigmoid function in logistic regression is used to convert the linear combination of input features into a probability value between 0 and 1, which is crucial for classification tasks. It forms an 'S' shaped curve known as the logistic curve, and outputs probabilities are typically interpreted with a threshold (usually 0.5) to classify the input into one of two categories .

Neural networks handle complex pattern recognition through their layered architecture, consisting of an input layer, multiple hidden layers, and an output layer. Each neuron or unit in these layers performs computations that transform inputs using weights and activation functions, capturing intricate patterns and dynamics within the data . The architecture, specifically the number and configuration of layers and neurons, determines the model's capacity to learn complex relationships, with deeper networks (more hidden layers) typically identifying more abstract features. This design allows neural networks to adapt to various tasks like image recognition and natural language processing by efficiently processing and classifying complex data .

Pruning enhances the performance of Decision Trees by addressing overfitting, which occurs when a tree becomes too complex and starts to memorize training data instead of learning general patterns. This complexity can lead to poor performance on new, unseen data . Pruning reduces the tree's complexity by removing branches that have little predictive power, allowing the model to better generalize and improving its performance on new datasets .

The main criteria used for splitting data in Decision Trees are Gini Impurity and Entropy. Gini Impurity measures how impure a node is, referring to how mixed the data is at the node; a lower Gini Impurity suggests a better split as it indicates clear separation into categories . Entropy, on the other hand, measures the amount of uncertainty or disorder within the data, with the objective of reducing entropy by choosing splits that provide the most information about the target variable .

Decision Trees offer several advantages, including ease of understanding and interpretation due to their visual nature, versatility for both classification and regression tasks, handling of non-linear relationships, and ability to work without feature scaling or normalization . However, these advantages come with limitations such as the tendency to overfit with deep trees, instability with small changes in data, potential bias towards features with many categories, and difficulty capturing complex interactions, which make them less effective for certain types of data .

Reinforcement learning in neural networks is applied by allowing the network to learn optimal actions through interactions with its environment. Feedback is provided in the form of rewards or penalties, guiding the network to maximize cumulative rewards . It particularly excels in scenarios where decision-making involves sequential, dynamic environments, such as gaming and autonomous vehicles, where the response actions need to adaptively change based on the current state and past learnings to achieve long-term benefits .

The key assumptions of logistic regression include the following: observations are independent, the dependent variable is binary for binary logistic regression, a linear relationship exists between independent variables and log odds, there are no extreme outliers, and a sufficiently large sample size is used . These assumptions are crucial to ensure accurate and reliable model estimates, thereby affecting the model's predictive performance and generalizability .

You might also like