Introduction to Neural Networks
Syllabus
● Introduction To Artificial Neural Networks, Machine Learning vs. Neural Networks
● Perceptron, Ex-Or Problem, Multilayer perceptron
● Activation Function and its types
● Backpropagation
1. Introduction to Artificial Neural Networks
Artificial Neural Networks (ANNs) are computational models inspired by the
structure and functioning of the human brain. They consist of interconnected processing
units called neurons, which work collectively to learn patterns and relationships from data.
Unlike traditional rule-based algorithms, ANNs learn automatically through exposure to
examples, making them highly effective for complex tasks where explicit programming is
difficult. Their ability to approximate nonlinear functions allows them to model intricate
real-world phenomena across diverse domains.
At the core of an ANN is the artificial neuron, which receives inputs, applies weights,
performs a weighted summation, and passes the result through an activation function.
Activation functions such as Sigmoid, ReLU, or Tanh introduce nonlinearity, enabling the
network to learn complex mappings. Neurons are arranged in layers: an input layer that
receives data, one or more hidden layers that extract features or patterns, and an output
layer that produces the final prediction. The structure and depth of these layers determine
the network’s learning capacity.
ANNs learn through a process called training, where they iteratively adjust weights
to minimize prediction errors. This is typically achieved using optimization algorithms such
as gradient descent and the backpropagation technique. During training, the network
assesses how far its output deviates from the expected result and updates the weights
accordingly. This iterative process continues until the network achieves a desired level of
accuracy. Training may involve large datasets, and the network’s performance is influenced
by hyperparameters such as learning rate, batch size, and number of epochs.
Artificial Neural Networks have numerous advantages, including adaptability,
robustness to noisy data, and strong generalization capabilities. They are widely used in
tasks such as image classification, speech recognition, medical diagnosis, financial
forecasting, and control systems. Despite their strengths, ANNs also present challenges such
as high computational requirements, long training times, and difficulty in interpreting how
decisions are made. Nonetheless, they remain a foundational component of modern
machine learning and form the basis for more advanced architectures such as Convolutional
Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and deep learning systems.
Key Characteristics
1. Nonlinear Processing: Neural networks can model highly complex and nonlinear
relationships due to nonlinear activation functions in hidden layers.
2. Learning from Data: ANNs improve performance by learning patterns directly from
training data rather than relying on explicitly programmed rules.
3. Generalization Ability: Once trained, neural networks can generalize from previously
seen data to new, unseen inputs, enabling reliable predictions.
4. Adaptive Nature: Weights and biases are continuously adjusted during training, allowing
the network to adapt to changing input patterns.
Why Neural Networks are needed
Can model complex and nonlinear relationships that traditional algorithms cannot.
Do not require strict assumptions like linearity or normality.
Learn patterns directly from data without explicit human-defined rules.
Handle high-dimensional, unstructured, or noisy data effectively.
Useful in tasks such as image recognition, speech processing, NLP, and biomedical
analysis.
Can generalize to new, unseen data, making them suitable for dynamic
environments.
Adaptable and robust to noise and incomplete data.
1.1 Machine Learning vs. Neural Networks:
Aspect Machine Learning (ML) Neural Networks (NNs)
A branch of AI that enables systems A subset of ML inspired by the
to learn from data and make human brain, consisting of
Definition
predictions or decisions without interconnected neurons that learn
explicit programming. complex patterns from data.
Can handle structured, unstructured
Works best with structured or
Data (images, text, audio), and high-
tabular data; performance depends
Requirements dimensional data; learns features
on manual feature engineering.
automatically.
Can model linear, moderately Capable of modeling highly
Learning
complex, or specific patterns nonlinear, complex, and hierarchical
Capability
depending on the algorithm. relationships.
Typically less computationally Computationally intensive; training
Computation intensive; can run efficiently on large/deep networks often requires
standard hardware. GPUs or specialized hardware.
Usually faster; depends on algorithm Slower training, especially for deep
Training Time
and dataset size. architectures and large datasets.
Learns features automatically
Feature Requires manual feature selection
through hidden layers; reduces the
Engineering and pre-processing.
need for manual feature engineering.
High risk of overfitting, especially in
Less prone to overfitting with
deep networks; requires
Overfitting Risk simpler models; regularization
regularization, dropout, and large
methods can be applied.
datasets.
Predictive modeling, classification, Image and speech recognition,
regression, clustering, natural language processing,
Applications
recommendation systems, anomaly autonomous vehicles, deep learning,
detection. robotics, generative models.
High memory requirement due to
Memory Generally moderate; depends on
storage of weights, activations, and
Requirement dataset and algorithm.
multiple layers.
2. Perceptron
The Perceptron is the simplest type of artificial neural network and serves as the
fundamental building block for more complex neural networks. It was introduced by Frank
Rosenblatt in 1958 and is designed to perform binary classification tasks. The perceptron
mimics the behavior of a biological neuron: it receives multiple inputs, processes them using
weights, applies a bias, and passes the result through an activation function to produce an
output. This output typically represents one of two classes in a classification problem.
Mathematically, the perceptron computes a weighted sum of its inputs and adds a bias
term. The result is then passed through a step (threshold) activation function, which
determines whether the neuron “fires” (output = 1) or not (output = 0). The model can be
represented as:
( ) (∑ )
Where wᵢ are weights, xᵢ are inputs, b is bias, and f is the activation function. The step
function converts the continuous weighted sum into a binary output, making the perceptron
suitable for linearly separable data.
The learning process of a perceptron involves adjusting the weights and bias based
on the error between predicted and actual outputs. The Perceptron Learning Rule updates
weights iteratively using gradient-free methods: if the output is correct, no change is made;
if incorrect, weights are adjusted to reduce the error. Training continues until all training
samples are classified correctly or a maximum number of iterations is reached. This
supervised learning approach enables the perceptron to learn simple patterns and decision
boundaries.
Despite its historical importance, the perceptron has limitations. It can only solve
linearly separable problems; it fails for problems like XOR, which require non-linear decision
boundaries. However, multilayer perceptrons (MLPs) with hidden layers overcome this
limitation by using nonlinear activation functions. The perceptron remains a key concept in
understanding neural networks, serving as a foundation for more advanced architectures in
machine learning and deep learning.
2.1 Ex-Or Problem in Perceptron
The XOR (exclusive OR) problem is a classic example that demonstrates the
limitation of a single-layer perceptron. One of the most fundamental limitations of
perceptrons is their inability to solve problems that are not linearly separable. A perceptron
can only draw a straight line (or hyperplane in higher dimensions) to separate two classes. If
the data points from different classes cannot be separated by a straight line, the perceptron
fails to classify them correctly.
The perceptron is a type of feed-forward network, which means the process of
generating an output — known as forward propagation — flows in one direction from the
input layer to the output layer. There are no connections between units in the input layer.
Instead, all units in the input layer are connected directly to the output unit.
A simplified explanation of the forward propagation process is that the input values
X1 and X2, along with the bias value of 1, are multiplied by their respective weights W0, W2,
and parsed to the output unit. The output unit takes the sum of those values and employs
an activation function — typically the Heaviside step function — to convert the resulting
value to a 0 or 1, thus classifying the input values as 0 or 1.
Geometrically, this means the perceptron can separate its input space with a
hyperplane. That’s where the notion that a perceptron can only separate linearly separable
problems came from. Since the XOR function is not linearly separable, it really is impossible
for a single hyperplane to separate it. The XOR problem cannot be solved by a single-layer
perceptron because the XOR function is not linearly separable. To address this, a multilayer
perceptron (MLP) with at least one hidden layer is used.
The key idea is to transform the input space into an intermediate representation
where the classes become linearly separable. This is achieved by introducing nonlinear
activation functions in the hidden layer, enabling the network to model nonlinear decision
boundaries. The hidden neurons learn intermediate features that separate the XOR outputs
into distinct linear regions that the output layer can finally classify.
A typical solution uses a 2–2–1 network: two input neurons (for x₁ and x₂), two
hidden neurons, and one output neuron. During training, the network adjusts its weights
using backpropagation so that specific hidden neurons activate for patterns like (0,1) and
(1,0) while others activate for (0,0) and (1,1). These learned patterns allow the network to
correctly output 1 for unequal inputs and 0 for equal inputs. Thus, the XOR problem
demonstrated that multilayer networks with nonlinear activations are essential for solving
complex, nonlinearly separable classification tasks—marking an important advancement
that led to the development of modern neural network architectures.
2.2 Multi-layer Perceptron
A Multi-Layer Perceptron (MLP) is a class of feedforward artificial neural network
composed of multiple layers of interconnected neurons. Unlike a single-layer perceptron, an
MLP contains one input layer, one or more hidden layers, and an output layer, allowing it to
learn complex, nonlinear relationships in data. Each neuron performs a weighted sum of its
inputs and passes the result through a nonlinear activation function such as sigmoid, tanh,
or ReLU. This nonlinear transformation enables the MLP to approximate any continuous
function, making it a universal function approximator. The network's learning process
typically involves backpropagation, where errors at the output layer are propagated
backward to adjust the weights and biases throughout the network.
MLPs are widely used in supervised learning tasks such as classification, regression,
pattern recognition, and time-series prediction. The hidden layers extract intricate patterns
and higher-level features from the input data, while the output layer produces the final
prediction. The capacity of an MLP depends on the number of hidden layers, the number of
neurons in each layer, and the choice of activation functions. Because of their ability to
model complex mappings, MLPs serve as the foundation for many advanced neural
architectures and remain one of the most commonly used neural networks in practical
applications.
Types and Roles of Various Layers in an MLP
A Multi-Layer Perceptron (MLP) consists of three primary types of layers: the input
layer, one or more hidden layers, and the output layer. Each layer has a distinct functional
role in processing information and transforming it into meaningful predictions. The input
layer serves as the entry point of the network and holds the feature values from the dataset.
Although it does not perform any computation, it structures the data in a form that the
subsequent layers can interpret. The neurons in the input layer simply pass the input vector
to the first hidden layer without applying any activation or weight transformations.
The hidden layers are the computational core of the MLP and are responsible for
extracting patterns, learning relationships, and performing nonlinear transformations. Each
neuron in a hidden layer computes a weighted sum of its inputs and passes the result
through an activation function, enabling the network to model complex nonlinear functions.
Multiple hidden layers allow the MLP to build hierarchical representations of data, where
earlier layers capture simple features and deeper layers learn more abstract concepts. The
number of hidden layers and neurons determines the network’s learning capacity and
directly influences model performance.
Finally, the output layer generates the final predictions of the model. Its structure
and activation function depend on the type of task being performed. For classification tasks,
the output layer may use softmax or sigmoid activations to produce probabilities, whereas
regression tasks typically use a linear activation to generate continuous numeric values. The
output layer converts the learned internal representations into meaningful outputs that
align with the target variable. Together, the three types of layers form a complete MLP
architecture capable of learning and generalizing from data through supervised training.
3. Activation Function
An activation function is a mathematical transformation applied to the output of a
neuron in a neural network. After computing a weighted sum of inputs, the neuron passes
this value through the activation function, which determines whether the neuron should be
activated and how strongly it should contribute to the next layer. Activation functions
introduce nonlinearity into the neural network, allowing it to learn and represent complex
patterns in data. Without these functions, the entire neural network would behave like a
simple linear model and would not be able to model real-world problems effectively.
The importance of activation functions lies in their ability to enable deep neural
architectures to extract high-level features and perform complex decision-making. They
allow neural networks to approximate nonlinear relationships such as image features,
speech patterns, medical signals, and other intricate datasets. Activation functions also play
a critical role in controlling gradient flow during training. Proper activation selection helps
avoid issues such as vanishing or exploding gradients, ensuring stable and efficient learning.
Overall, activation functions form the core mechanism that gives neural networks their
expressive power and distinguishes them from traditional linear models.
3.1 Types of Activation Functions
A] Linear Activation Function
The linear activation function is defined as ( ) = , meaning the output is a direct,
unmodified mapping of the input. It does not introduce nonlinearity, making it suitable for
tasks where the relationship between inputs and outputs is fundamentally linear. It is
commonly used in the output layer of neural networks designed for regression tasks, where
continuous numeric values must be predicted.
Because the linear function does not constrain its output, it allows a model to
produce values across the entire real number space. This makes it ideal for predicting
variables such as cost, temperature, growth rate, or bone density. The absence of saturation
regions eliminates the risk of vanishing gradients, ensuring the optimization process remains
stable during training.
However, using linear activation in hidden layers severely limits the expressive
power of neural networks. A network composed of purely linear layers is mathematically
equivalent to a single linear transformation, regardless of depth. As a result, the network
cannot capture nonlinear patterns or complex relationships. For this reason, linear
activation is typically restricted to the output layer in regression-oriented architectures.
Advantages
No saturation; avoids vanishing gradients.
Useful for regression tasks requiring unbounded outputs.
Simple and computationally efficient.
Limitations
Cannot model nonlinear relationships.
If used in hidden layers, the network collapses to a simple linear model.
Limited expressive capabilities.
B] Sigmoid
The sigmoid activation function is a smooth, S-shaped nonlinear function defined as
( ) = 1/(1 + e⁻ˣ). It maps any real-valued input into the range between 0 and 1, making it
particularly useful in scenarios where the output represents a probability. Because of its
differentiable nature, the sigmoid function was widely adopted in early neural network
architectures and remains common in binary classification models.
In neural networks, the sigmoid function helps in introducing nonlinearity, enabling models
to learn complex patterns in data. Its smooth gradient allows the application of
backpropagation, as small adjustments to the input create proportionate changes in the
output. This property is valuable for tasks where subtle input variations must influence the
final decision. The function also ensures that outputs remain bounded, preventing extreme
numeric values.
However, sigmoid functions suffer from numerical challenges that may hinder training
efficiency. For very large positive or negative inputs, the sigmoid becomes saturated near 1
or 0, causing gradients to approach zero. This “vanishing gradient” issue slows convergence,
especially in deep networks. Despite this limitation, sigmoid remains useful in specific
architectures where bounded probability-like output is required.
Advantages
Smooth and differentiable.
Outputs lie between 0 and 1, suitable for probabilistic interpretation.
Historically well-understood and easy to implement.
Limitations
Prone to vanishing gradient problems.
Slow convergence in deep models.
Outputs are not zero-centered, affecting gradient dynamics.
C] ReLU (Rectified Linear Unit) Activation Function
The ReLU function is defined as ( ) = max(0, ). It is one of the most widely used activation
functions in modern deep neural networks because of its simplicity and effectiveness. ReLU
introduces nonlinearity while maintaining computational efficiency, as it requires only a
thresholding operation. It allows neural networks to learn complex functions and feature
hierarchies with faster convergence.
backpropagation. This makes deep networks train more efficiently compared to sigmoid or
tanh activation. The sparsity introduced by ReLU, where neurons output zero for negative
inputs, also helps reduce overfitting and improves computational efficiency. These
characteristics make ReLU suitable for deep learning architectures such as CNNs and MLPs.
ReLU avoids saturation in the positive region, which allows gradients to remain stable during
Despite its advantages, ReLU suffers from the “dying ReLU” problem, where neurons
permanently output zero due to negative weighted sums. Once a neuron dies, it stops
contributing to learning because its gradient becomes zero. Additionally, ReLU is unbounded
on the positive side, which may cause unstable gradients if not controlled. Variants such as
Leaky ReLU and ELU are designed to address these limitations.
Advantages
Fast computation and efficient training.
Avoids vanishing gradients for positive inputs.
Encourages sparse activation, reducing overfitting.
Limitations
Susceptible to the “dying ReLU” problem.
Outputs are unbounded, sometimes causing exploding gradients.
Not suitable for tasks requiring smooth gradients.
D] Tanh Activation Function
The tanh activation function is defined as ( ) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ). It produces
outputs in the range −1 to +1, making it a zero-centered activation function. Its S-shaped
curve resembles the sigmoid but offers symmetrical outputs, which improve training
dynamics and convergence speed. Tanh has been widely used in neural networks before the
rise of ReLU-based architectures.
In neural networks, the tanh function introduces strong nonlinearity and helps model
intricate input-output relationships. Because the output is centered around zero, gradients
flow more evenly through the network, resulting in faster and more stable learning than
sigmoid. This feature makes tanh particularly useful in hidden layers when modeling
normalized or centered data.
However, tanh suffers from the same vanishing gradient problem as the sigmoid
function. At extreme inputs, the gradient approaches zero, making it difficult for the
network to update weights effectively. This limitation is especially problematic in deep
architectures. Despite these issues, tanh remains relevant in specific tasks and is still used in
recurrent neural networks and shallow multilayer perceptrons.
Advantages
Zero-centered outputs improve training stability.
Provides stronger gradients than sigmoid.
Well-suited for normalized input data.
Limitations
Suffers from vanishing gradients at extreme values.
Slower training compared to ReLU-based models.
Less effective for very deep neural architectures.
Parameter Sigmoid Tanh ReLU Linear
Mathematical (eˣ − e⁻ˣ) / (eˣ +
1 / (1 + e⁻ˣ) max(0, x) f(x) = x
Form e⁻ˣ)
Output Range 0 to 1 −1 to +1 0 to ∞ −∞ to ∞
Nonlinear? Yes Yes Yes No
Zero-Centered Partially (only
No Yes Yes (identity)
Output positive output)
Very low
Computational Moderate (exp Moderate (exp
(simple max Very low
Cost operation) operations)
function)
Highly preferred
Use in Hidden Limited due to Good for Not
in deep
Layers slow learning centered data recommended
networks
Binary CNNs, deep
Common Use RNNs, centered Regression
classification MLPs, large
Cases data problems output layer
output layer networks
Training Speed Slow Moderate Fast Fast
Saturation No (except
Yes Yes No
Problem negative side)
Bounded Output Yes Yes No No
Suitability for
Poor Moderate Excellent Poor
Deep Networks
Good Good (range −1 Good (linear
Interpretability Moderate
(probabilities) to +1) mapping)
4. Backpropagation
Backpropagation is a fundamental learning algorithm used in training artificial neural
networks. It operates by adjusting the weights and biases of the network to minimize the difference
between predicted outputs and actual target values. The forward pass computes the output of the
network, while the backward pass calculates the gradients of the loss function with respect to each
parameter. This gradient information is then used by an optimization algorithm such as gradient
descent to update the parameters and reduce overall error.
The algorithm relies on the chain rule of calculus to propagate errors backward from the
output layer to all preceding layers. By computing partial derivatives layer by layer, backpropagation
efficiently determines how much each weight contributed to the final prediction error. This stepwise
gradient calculation enables deep networks with many layers to learn complex and nonlinear
relationships in the data. Without backpropagation, training neural networks with millions of
parameters would be computationally infeasible.
Backpropagation is widely used because it supports large-scale learning and works well with
different architectures such as multilayer perceptrons, convolutional neural networks, and recurrent
networks. It enables supervised learning by continuously updating parameters based on labeled
data. The iterative process gradually improves accuracy, making the network more capable of
generalizing to new inputs. Since the algorithm is deterministic and mathematically grounded, it
ensures a systematic convergence toward minimum loss.
Despite its effectiveness, backpropagation has certain challenges. Deep networks may
encounter vanishing or exploding gradients, which hinder learning in earlier layers. Training can be
computationally intensive and requires large datasets to avoid overfitting. Additionally, the
performance of backpropagation heavily depends on proper parameter initialization, choice of
activation function, and tuning of hyperparameters like learning rate. Nevertheless, it remains the
backbone of modern deep learning and continues to be optimized through improved architectures
and training techniques.
Advantages of Backpropagation
Efficient method for computing gradients using the chain rule.
Can train deep and complex neural architectures.
Works with a wide range of activation functions and optimization techniques.
Supports supervised learning for classification and regression tasks.
Enables continuous improvement through iterative parameter updates.
Computationally faster than naive gradient computation methods.
Limitations of Backpropagation
Suffers from vanishing and exploding gradient issues, especially in deep networks.
Requires differentiable activation functions to compute gradients.
Sensitive to learning rate and hyperparameter settings.
May converge to local minima rather than the global optimum.
Computationally expensive for large networks and datasets.
Requires labelled data, making it less suitable for unsupervised learning.
Applications of Backpropagation
Training multilayer perceptrons for prediction and classification tasks.
Deep learning models such as CNNs for image analysis.
Medical signal and imaging analysis, such as BMD estimation tasks.
Financial forecasting, pattern recognition and anomaly detection.
Robotics, control systems, and reinforcement learning components.