Unit-1
Q1)Explain the architecture and working of a feed-forward neural network.
Feedforward Neural Network (FNN) is a type of artificial neural network in which
information flows in a single direction i.e from the input layer through hidden layers to the
output layer without loops or feedback. It is mainly used for pattern recognition tasks like
image and speech classification.
For example in a credit scoring system, banks use an FNN which analyze users
financial profiles such as income, credit history and spending habits to determine their
creditworthiness.
Each piece of information flows through the network’s layers where various calculations are
made to produce a final score.
Structure of a Feedforward Neural Network
Feedforward Neural Networks have a structured layered design where data flows sequentially
through each layer.
1. Input Layer: The input layer consists of neurons that receive the input data. Each neuron
in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and output
layers. These layers are responsible for learning the complex patterns in the data. Each
neuron in a hidden layer applies a weighted sum of inputs followed by a non-linear
activation function.
3. Output Layer: The output layer provides the final output of the network. The number of
neurons in this layer corresponds to the number of classes in a classification problem or
the number of outputs in a regression problem.
Each connection between neurons in these layers has an associated weight that is adjusted
during the training process to minimize the error in predictions.
Feed Forward Neural Network
Activation Functions
Activation functions introduce non-linearity into the network enabling it to learn and model
complex data patterns.
Common activation functions include:
Sigmoid: σ(x)=1/1+e−x.
Tanh: tanh(x)=ex-e−x/ex+e−xx
ReLU: ReLU(x)=max(0,x)
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to
minimize the error between the predicted output and the actual output. This process is
typically performed using back propagation and gradient descent.
1. Forward Propagation: During forward propagation the input data passes through the
network and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean
Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
3. Backpropagation: In backpropagation the error is propagated back through the network
to update the weights. The gradient of the loss function with respect to each weight is
calculated and the weights are adjusted using gradient descent.
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function by
iteratively updating the weights in the direction of the negative gradient. Common variants of
gradient descent include:
Batch Gradient Descent: Updates weights after computing the gradient over the entire
dataset.
Stochastic Gradient Descent (SGD): Updates weights for each training example
individually.
Mini-batch Gradient Descent: It Updates weights after computing the gradient over a
small batch of training examples.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
Accuracy: The proportion of correctly classified instances out of the total instances.
Precision: The ratio of true positive predictions to the total predicted positives.
Recall: The ratio of true positive predictions to the actual positives.
F1 Score: The harmonic mean of precision and recall, providing a balance between the
two.
Confusion Matrix: A table used to describe the performance of a classification model,
showing the true positives, true negatives, false positives and false negatives.
Implementation of Feedforward Neural Network
This code demonstrates the process of building, training and evaluating a neural network
model using TensorFlow and Keras to classify handwritten digits from the MNIST dataset.
The model architecture is defined using the Sequential consisting of:
a Flatten layer to convert the 2D image input into a 1D array
a Dense layer with 128 neurons and ReLU activation
a final Dense layer with 10 neurons and softmax activation to output probabilities for
each digit class.
Model is compiled with
Adam optimizer
Sparse Categorical Crossentropy loss function
Sparse Categorical Accuracy metric
Then trained for 5 epochs on the training data
import tensorflow as tf
from [Link] import Sequential
from [Link] import Dense, Flatten
from [Link] import Adam
from [Link] import SparseCategoricalCrossentropy
from [Link] import SparseCategoricalAccuracy
# Load and prepare the MNIST dataset
mnist = [Link]
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# Build the model
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
# Compile the model
[Link](optimizer=Adam(),
loss=SparseCategoricalCrossentropy(),
metrics=[SparseCategoricalAccuracy()])
# Train the model
[Link](x_train, y_train, epochs=5)
# Evaluate the model
test_loss, test_acc = [Link](x_test, y_test)
print(f'\nTest accuracy: {test_acc}')
Output:
Test accuracy: 0.9767000079154968
____________________________*********_______________________
[Link] is an activation function? Explain different types with examples.
An activation function in a neural network is a mathematical function applied to the output
of a neuron. It introduces non-linearity, enabling the model to learn and represent complex
data patterns. Without it, even a deep neural network would behave like a simple linear
regression model.
Activation functions decide whether a neuron should be activated based on the weighted
sum of inputs and a bias term. They also make back propagation possible by providing
gradients for weight updates.
Activation Functions in neural Networks
Why Non-Linearity is Important
Real-world data is rarely linearly separable.
Non-linear functions allow neural networks to form curved decision boundaries , making
them capable of handling complex patterns (e.g., classifying apples vs. bananas under
varying colors and shapes).
They ensure networks can model advanced problems like image recognition, NLP and
speech processing.
Mathematical Example
Consider a neural network with:
Inputs: i1, i2
Hidden layer: neurons h1 and h2
Output layer: one neuron (output)
Weights: w1, w2, w3, w4, w5, w6
Biases: b1 for hidden layer, b2 for output layer
neural network
The hidden layer outputs are:
h1=i1.w1+i2.w3+b1
h2=i1.w2+i2.w4+b2
The output before activation is:
output=h1.w5+h2.w6+bias
Without activation, these are linear equations.
To introduce non-linearity, we apply a sigmoid activation:
σ(x)=1/1+e−x
final output=σ(h1.w5+h2.w6+bias)
This gives the final output of the network after applying the sigmoid activation function in
output layers, introducing the desired non-linearity.
Types of Activation Functions in Deep Learning
1. Linear Activation Function
Linear Activation Function resembles straight line define by y=x. No matter how many
layers the neural network contains if they all use linear activation functions the output is a
linear combination of the input.
The range of the output spans from(−∞ to +∞).
Linear activation function is used at just one place i.e. output layer.
Using linear activation across all layers makes the network's ability to learn complex
patterns limited.
Linear activation functions are useful for specific tasks but must be combined with non-
linear functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output
2. Non-Linear Activation Functions
1. Sigmoid Function
Sigmoid Activation Function is characterized by 'S' shape. It is mathematically defined as
A=1/1+e−x. This formula ensures a smooth and continuous output that is essential for
gradient-based optimization methods.
It allows neural networks to handle and model complex patterns that linear equations
cannot.
The output ranges between 0 and 1, hence useful for binary classification.
The function exhibits a steep gradient when x values are between -2 and 2. This
sensitivity means that small changes in input x can cause significant changes in output y
which is critical during the training process.
Sigmoid or Logistic Activation Function Graph
2. Tanh Activation Function
Tanh function(hyperbolic tangent function) is a shifted version of the sigmoid, allowing it
to stretch across the y-axis. It is defined as:
f(x)=tanh(x)=(2/1+e−2x)-1.
Alternatively, it can be expressed using the sigmoid function:
tanh(x)=2×sigmoid(2x)−1
Value Range: Outputs values from -1 to +1.
Non-linear: Enables modeling of complex data patterns.
Use in Hidden Layers : Commonly used in hidden layers due to its zero-centered output,
facilitating easier learning for subsequent layers.
T
anh Activation Function
3. ReLU (Rectified Linear Unit) Function
ReLU activation is defined by A(x)=max(0,x), this means that if the input x is positive,
ReLU returns x, if the input is negative, it returns 0.
Value Range: [0,∞), meaning the function only outputs non-negative values.
Nature: It is a non-linear activation function, allowing neural networks to learn complex
patterns and making backpropagation more efficient.
Advantage over other Activation: ReLU is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy for
computation.
ReL
U Activation Function
d) Leaky ReLU
f(x)={x, x>0
αx, x≤0
Leaky ReLU is similar to ReLU but allows a small negative slope (αα, e.g., 0.01)
instead of zero.
Solves the “dying ReLU” problem, where neurons get stuck with zero outputs.
Range: (−∞,∞).
Preferred in some cases for better gradient flow.
Leaky ReLU Activation Function
3. Exponential Linear Units
1. Softmax Function
Softmax function is designed to handle multi-class classification problems. It transforms
raw output scores from a neural network into probabilities. It works by squashing the
output values of each class into the range of 0 to 1 while ensuring that the sum of all
probabilities equals 1.
Softmax is a non-linear activation function.
The Softmax function ensures that each class is assigned a probability, helping to
identify which class the input belongs to.
Softmax Activation Function
2. SoftPlus Function
Softplus function is defined mathematically as: A(x)=log(1+ex).
This equation ensures that the output is always positive and differentiable at all points
which is an advantage over the traditional ReLU function.
Nature: The Softplus function is non-linear.
Range: The function outputs values in the range (0,∞), similar to ReLU, but without the
hard zero threshold that ReLU has.
Smoothness: Softplus is a smooth, continuous function, meaning it avoids the sharp
discontinuities of ReLU which can sometimes lead to problems during optimization.
Softpl
us Activation Function
Impact of Activation Functions on Model Performance
The choice of activation function has a direct impact on the performance of a neural
network in several ways:
1. Convergence Speed: Functions like ReLU allow faster training by avoiding the
vanishing gradient problem while Sigmoid and Tanh can slow down convergence in
deep networks.
2. Gradient Flow: Activation functions like ReLU ensure better gradient flow, helping
deeper layers learn effectively. In contrast Sigmoid can lead to small gradients,
hindering learning in deep layers.
3. Model Complexity: Activation functions like Softmax allow the model to handle
complex multi-class problems, whereas simpler functions like ReLU or Leaky ReLU
are used for basic layers.
__________________________ ******___________________________
[Link] short notes on supervised, unsupervised, and reinforcement learning.
Supervised, Unsupervised, and Reinforcement Learning (Short Notes with Examples)
1. Supervised Learning
Supervised learning is a type of machine learning where the model is trained using labeled
data. Each training example consists of an input and a corresponding correct output. The goal
is to learn a mapping from inputs to outputs so that the model can predict outcomes for new
data.
Common tasks: Classification and Regression
Examples:
Predicting student results (pass/fail) based on marks and attendance
Email spam detection (spam or not spam)
House price prediction using features like area, location, and number of rooms
Algorithms: Linear Regression, Logistic Regression, Decision Tree, KNN, SVM
2. Unsupervised Learning
Unsupervised learning uses unlabeled data. The model tries to find hidden patterns,
structures, or relationships in the data without any predefined output.
Common tasks: Clustering and Association
Examples:
Customer segmentation in marketing
Grouping students based on performance
Market basket analysis (items frequently bought together)
Algorithms: K-Means Clustering, Hierarchical Clustering, Apriori Algorithm, PCA
3. Reinforcement Learning
Reinforcement learning is a learning method where an agent interacts with an environment
and learns by trial and error. The agent receives rewards or penalties based on its actions
and aims to maximize the total reward over time.
Key elements: Agent, Environment, Action, Reward
Examples:
Game playing (Chess, Ludo, Video games)
Robot navigation
Traffic signal control systems
Algorithms: Q-Learning, SARSA, Deep Q-Network (DQN)
✅ Summary:
Supervised Learning: Learns from labeled data
Unsupervised Learning: Finds patterns in unlabeled data
Reinforcement Learning: Learns by interacting with the environment using rewards
___________________________*******_____________________________________
Regularization in Machine Learning
Last Updated : 11 Dec, 2025
Regularization is a technique used in machine learning to prevent overfitting, which
otherwise causes models to perform poorly on unseen data. By adding a penalty for
complexity, regularization encourages simpler and more generalizable models.
Prevents overfitting: Adds constraints to the model to reduce the risk of memorizing noise
in the training data.
Improves generalization: Encourages simpler models that perform better on new, unseen
data.
Regularization in Machine Learning
Types of Regularization
There are mainly 3 types of regularization techniques, each applying penalties in different
ways to control model complexity and improve generalization.
1. Lasso Regression
A regression model which uses the L1 Regularization technique is called LASSO (Least
Absolute Shrinkage and Selection Operator) regression. It adds the absolute value of
magnitude of the coefficient as a penalty term to the loss function(L). This penalty can shrink
some coefficients to zero which helps in selecting only the important features and ignoring
the less important ones.
Cost=1/n∑ (yi−yi^)2+λ∑ ∣wi∣
n m
i=1 i=1
Where
m: Number of Features
n: Number of Examples
yyi: Actual Target Value
y^i: Predicted Target Value
Note: These formulas apply to linear models. In neural networks, the number of weights is
much larger than the number of features, but the same regularization principles (L1, L2) still
apply on all weights.
Lets see how to implement this using python:
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42) :
Generates a regression dataset with 100 samples, 5 features and some noise.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) :
Splits the data into 80% training and 20% testing sets.
lasso = Lasso(alpha=0.1): Creates a Lasso regression model with regularization strength
alpha set to 0.1.
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from [Link] import make_regression
from [Link] import mean_squared_error
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lasso = Lasso(alpha=0.1)
[Link](X_train, y_train)
y_pred = [Link](X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print("Coefficients:", lasso.coef_)
Output:
Lasso Regression
The output shows the model's prediction error and the importance of features with some
coefficients reduced to zero due to L1 regularization.
2. Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge regression. It
adds the squared magnitude of the coefficient as a penalty term to the loss function(L). It
handles multicollinearity by shrinking the coefficients of correlated features instead of
eliminating them.
n m
Cost=1/n∑ (yi−yi^)2+λ∑ wi2
i=1 i=1
Where,
n: Number of examples or data points
m: Number of features i.e predictor variables
yi: Actual target value for the ith example
y^i: Predicted target value for the ith example
wi: Coefficients of the features
λ: Regularization parameter that controls the strength of regularization
Lets see how to implement this using python:
ridge = Ridge(alpha=1.0): Creates a Ridge regression model with regularization strength
alpha set to 1.0.
from sklearn.linear_model import Ridge
from [Link] import make_regression
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ridge = Ridge(alpha=1.0)
[Link](X_train, y_train)
y_pred = [Link](X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Coefficients:", ridge.coef_)
Output:
Ridge Regression
The output shows the MSE showing model performance. Lower MSE means better accuracy.
The coefficients reflect the regularized feature weights.
3. Elastic Net Regression
Elastic Net Regression is a combination of both L1 as well as L2 regularization. That shows
that we add the absolute norm of the weights as well as the squared measure of the weights.
With the help of an extra hyperparameter that controls the ratio of the L1 and L2
regularization.
Cost=1/n∑ (yi−yi^)2+λ((1−α)∑ /wi∣+α∑ wi2
n m m
i=1 i=1 i=1
Where
n: Number of examples (data points)
m: Number of features (predictor variables)
yi: Actual target value for the ith example
y^i: Predicted target value for the ith example
wi: Coefficients of the features
λ: Regularization parameter that controls the strength of regularization
α: Mixing parameter where 0≤α≤10≤α≤1 and α= 1 corresponds to Lasso (L1)
regularization, αα= 0 corresponds to Ridge (L2) regularization and Values between 0 and
1 provide a balance of both L1 and L2 regularization
Lets see how to implement this using python:
model = ElasticNet(alpha=1.0, l1_ratio=0.5) : Creates an Elastic Net model with
regularization strength alpha=1.0 and L1/L2 mixing ratio 0.5.
from sklearn.linear_model import ElasticNet
from [Link] import make_regression
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = ElasticNet(alpha=1.0, l1_ratio=0.5)
[Link](X_train, y_train)
y_pred = [Link](X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Coefficients:", model.coef_)
Output:
Elastic Net Regression
The output shows MSE which measures how far off predictions are from actual values (lower
is better) and coefficients show feature importance.
Benefits of Regularization
Now, let’s see various benefits of regularization which are as follows:
Prevents Overfitting: Regularization helps models focus on underlying patterns instead of
memorizing noise in the training data.
Enhances Performance: Prevents excessive weighting of outliers or irrelevant features
helps in improving overall model accuracy.
Stabilizes Models: Reduces sensitivity to minor data changes which ensures consistency
across different data subsets.
Prevents Complexity: Keeps model from becoming too complex which is important for
limited or noisy data.
Handles Multicollinearity: Reduces the magnitudes of correlated coefficients helps in
improving model stability.
Promotes Consistency: Ensures reliable performance across different datasets which
reduces the risk of large performance shifts.
Model Selection for Machine Learning
Last Updated : 06 Aug, 2025
Machine learning (ML) is a field that enables computers to learn patterns from data and make
predictions without being explicitly programmed. However, one of the most crucial aspects
of machine learning is selecting the right model for a given problem. This process is called
model selection. The choice of model significantly affects the accuracy, efficiency and
reliability of predictions. A bad model can cause overfitting or underfitting and sometimes
even lead to increased computational costs.
In this article, we are going to deeply explore into the process of model selection, its
importance and techniques used to determine the best-performing machine learning model for
different problems.
Importance of Model Selection
Model selection is a key step in machine learning because it affects how well a system can
learn from data and make accurate predictions. Different models have different ways of
processing data and choosing the right one ensures that the system works efficiently. A
simple model cannot capture details and has poor accuracy, while a model too complex might
overfit that is doing very well on training data but fails on new data. The goal is to find a
model that learns patterns effectively without being too simple or too complex.
Proper model selection involves experimenting with different models and comparing their
performance using evaluation metrics such as accuracy, precision, recall or mean squared
error. These metrics help in determining which model is best suited for a given task.
Apart from performance metrics, other factors such as training time, dataset size and
available computing power also play a crucial role in choosing the right model.
Selecting an appropriate model not only improves prediction accuracy but also enhances
efficiency, making the system faster and more reliable. This ensures that AI-driven
applications perform well in real-world scenarios.
Steps in Model Selection
Understanding the Problem and Data
Before selecting a model, it is important to first analyze the problem we are trying to solve.
The initial step is to determine whether it is a regression problem, where the goal is to predict
continuous values like house prices. If the task involves predicting categorical labels, such as
distinguishing between spam and non-spam emails, it falls under classification problem. On
the other hand, if the objective is to group similar data points, like segmenting customers
based on behavior, then it is a clustering problem. Understanding the type of problem helps in
choosing the most suitable machine learning model.
Another important point is a bit about the nature of the dataset itself. One has to check for
missing values, the number of numerical and categorical variables and the distribution of
data. Understanding the type of problem and the dataset helps in choosing the most suitable
machine learning model.
Selecting Suitable Models
After understanding the problem, we then choose a best model that should solve the problem.
Different types of models work better for different kinds of problems:
For Regression: Linear Regression, Decision Trees, Random Forest, Neural Networks.
For Classification: Logistic Regression, Support Vector Machines (SVM), k-Nearest
Neighbors (k-NN), Neural Networks.
For Clustering: k-Means, Hierarchical Clustering, DBSCAN.
Model Evaluation
Once we have identified the right models, we must rank each one according to how well it
does the job. The most common method is to split the dataset into two parts.
Training Set: The data used to train a machine learning model by learning patterns and
relationships.
Testing Set: This checks how well a model performs over new, unseen data.
We use k-fold cross-validation to further improve the evaluation. In k-fold cross-validation,
the data is split into k subsets. The model is trained on k-1 subsets and tested on the
remaining one, repeating the process k times. This way, our evaluation is not biased by a
particular train-test split.
Different machine learning problems require different evaluation metrics.
For Regression Problems: We make use of Mean Squared Error (MSE), Mean Absolute
Error (MAE) and R-squared.
For Classification Problems: We make use of Accuracy, Precision, Recall and F1-score.
After evaluating the models, we compare them to identify the one that satisfies performance
and computational efficiency.
Model Selection Techniques in Machine Learning
Grid Search
One of the simplest and most commonly used model selection techniques is grid search. In
this approach, systematically different combinations of hyperparameters are tried and that
gives the best performance chosen. It can be effective, but the main drawback will be
computationally intensive, especially for complex models and many parameters.
Random Search
Similar to grid search, random search doesn't check all possible combinations. Instead, it
randomly chooses a subset of the hyperparameter combinations. The random search method
often runs much faster than the grid search method and yet achieves equally good results.
Bayesian Optimization
Bayesian optimization is a smarter approach to model selection. Instead of just randomly
searching for the best hyperparameters, it uses probability models to predict which
parameters are likely to perform best and focuses on evaluating those. This method is
efficient and often finds better results than grid or random search.
Cross-Validation Based Selection
This method involves using cross-validation to evaluate multiple models and selecting the
one with the best average performance. Instead of relying on a single train-test split, cross-
validation divides the dataset into multiple parts and trains the model on different subsets.
This helps to ensure that the model’s performance is not just due to a specific split of data. By
averaging the results from different splits, we get how well the model will perform on new,
unseen data. This approach reduces the risk of overfitting and helps in selecting a good
model.
Dropout in Neural Networks
Last Updated : 12 Jul, 2025
The concept of Neural Networks is inspired by the neurons in the human brain and scientists
wanted a machine to replicate the same process. This craved a path to one of the most
important topics in Artificial Intelligence. A Neural Network (NN) is based on a collection of
connected units or nodes called artificial neurons, which loosely model the neurons in a
biological brain. Since such a network is created artificially in machines, we refer to that as
Artificial Neural Networks (ANN). This article assumes that you have a decent knowledge of
ANN. More about ANN can be found here. Now, let us go narrower into the details
of Dropout in ANN.
[Link](
rate
)
# rate: Float between 0 and 1.
# The fraction of the input units to drop.
Problem: When a fully-connected layer has a large number of neurons, co-adaptation is more
likely to happen. Co-adaptation refers to when multiple neurons in a layer extract the same,
or very similar, hidden features from the input data. This can happen when the connection
weights for two different neurons are nearly identical.
This poses two different problems to our model:
Wastage of machine's resources when computing the same output.
If many neurons are extracting the same features, it adds more significance to those
features for our model. This leads to overfitting if the duplicate extracted features are
specific to only the training set.
Solution to the problem: As the title suggests, we use dropout while training the NN to
minimize co-adaptation. In dropout, we randomly shut down some fraction of a layer’s
neurons at each training step by zeroing out the neuron values. The fraction of neurons to be
zeroed out is known as the dropout rate, rd rd . The remaining neurons have their values
multiplied by 11−rd 1−rd1 so that the overall sum of the neuron values remains the
same.
The two images represent dropout applied to a layer of 6 units, shown at multiple training
steps. The dropout rate is 1/3, and the remaining 4 neurons at each training step have their
value scaled by x1.5. Thereby, we are choosing a random sample of neurons rather than
training the whole network at once. This ensures that the co-adaptation is solved and they
learn the hidden features better.
Why dropout works?
By using dropout, in every iteration, you will work on a smaller neural network than the
previous one and therefore, it approaches regularization.
Dropout helps in shrinking the squared norm of the weights and this tends to a reduction
in overfitting.
Dropout can be applied to a network using TensorFlow APIs as follows:
[Link](
rate
)
# rate: Float between 0 and 1.
# The fraction of the input units to drop.