Module 2: Deep Learning and
Architectures in VLSI
Syllabus
• Deep Learning Overview:
• Artificial Neural Networks,
• Convolutional Neural Networks (CNN), and
• Recurrent Neural Networks (RNN).
• Training Deep Neural Networks: Backpropagation and Optimization
Techniques.
• Introduction to TensorFlow/PyTorch.
• VLSI Application: Utilizing CNNs for VLSI layout pattern recognition
and classification.
Artificial Neural Network
• Inspired by the structure of the human brain, neural networks have
established a reputation for successfully learning complex tasks such
as object recognition in images, automatic speech recognition (ASR),
machine translation, image captioning, video classification etc. usion
of Classification Task.
• As the name suggests, the design of Artificial Neural Networks
(ANNs) is inspired by the human brain. Although not as powerful as
the brain (yet), artificial neural networks are the most powerful
learning models in the field of machine learning.
• Artificial neural networks are said to be inspired by the structure of
the brain. Let’s first understand the basic structure of the brain, the
structure of a neuron and how information travels through neurons.
Artificial neural networks (ANNs) provide a general, practical method
for learning real-valued, discrete-valued, and vector-valued target
functions from examples.
Examples:
real-valued: sales price, percentage, weather forecast, stock market prediction
discrete-valued: Yes/No,
vector-valued: array of outputs 4
• The study of artificial neural networks (ANNs) has been inspired by the
observation that biological learning systems are built of very complex webs of
interconnected Neurons.
• Human information processing system consists of brain neuron: basic building
block cell that communicates information to and from various parts of body.
Simplest model of a neuron: considered as a threshold unit –a processing
element(PE). Collects inputs & produces output if the sum of the input exceeds
an internal threshold value.
• ANN are built out of a densely interconnected set of simple units, where each unit
takes a number of real valued inputs (Possibly the outputs of other units) and
produces a single real valued output (which may become the input to many other
units)
5
Biological Motivation
6
8
Properties of artificial neural networks (ANN's)
o Many neuron-like threshold switching units
o Many weighted interconnections among units
o Highly parallel, distributed process
o Emphasis on tuning weights automatically
o Input is a high dimensional discrete or real valued
( e.g. Sensor Input)
9
• Input is a high-dimensional discrete or real-valued (e.g., sensor input)
• Output is discrete or real-valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Examples:
[Link] phoneme recognition
[Link] classification
[Link] perdition
10
Appropriate Problems for Neural Network Learning
1. Instances are represented by many attribute-value pairs.
2. The target function output may be discrete-valued, real-valued, or a vector of
several real- or discrete-valued attributes.
3. The training examples may contain errors.
4. Long training times are acceptable.
5. Fast evaluation of the learned target function may be required
6. The ability of humans to understand the learned target function is not important
11
12
ANN versus BNN
13
14
15
Model of ANN
• Universal Function Approximators
16
Types of Activation function
1. Threshold Activation Function — (Binary step function)
17
2. Sigmoid Activation Function — (Logistic function)
18
3. Hyperbolic Tangent Function — (tanh)
19
4. Rectified Linear Units — (ReLu)
20
21
Network layers
22
An artificial neural network can be divided into three parts (layers), which are known as:
• Input layer: This layer is responsible for receiving information (data), signals, features,
or measurements from the external environment. These inputs are usually normalized
within the limit values produced by activation functions
• Hidden, intermediate, or invisible layers: These layers are composed of neurons which
are responsible for extracting patterns associated with the process or system being
analysed. These layers perform most of the internal processing from a network.
• Output layer : This layer is also composed of neurons, and thus is responsible for
producing and presenting the final network outputs, which result from the processing
performed by the neurons in the previous layers.
23
Architectures of Artificial Neural Network layers
The main architectures of artificial neural networks, considering the neuron
disposition, how they are interconnected and how its layers are composed, can
be divided as follows:
1. Single-layer feedforward Network
2. Multi-layer feedforward Network
3. Recurrent or Feedback Network
4. Mesh Networks
24
1. Single-layer feedforward Network
• This artificial neural network has
just one input layer and a single
neural layer, which is also the
output layer.
• Figure illustrates a simple-layer
feedforward network composed of n
inputs and m outputs.
• The information always flows in a
single direction (thus,
unidirectional), which is from the
input layer to the output layer
25
2. Multi-layer feedforward Network
• This artificial neural feedforward networks
with multiple layers are composed of one or
more hidden neural layers.
• Figure shows a feedforward network with
multiple layers composed of one input layer
with n sample signals, two hidden neural
layers consisting of n1 and n2 neurons
respectively,
• Finally, one output neural layer composed
of m neurons representing the respective
output values of the problem being
analyzed
26
3. Recurrent or Feedback Network
• In these networks, the outputs of
the neurons are used as feedback
inputs for other neurons.
• Figure illustrates an example of a
Perceptron network with feedback,
where one of its output signals is
fed back to the middle layer.
27
4. Mesh Networks
• The main features of networks with mesh
structures reside in considering the spatial
arrangement of neurons for pattern
extraction purposes, that is, the spatial
localization of the neurons is directly
related to the process of adjusting their
synaptic weights and thresholds.
• Figure illustrates an example of the
Kohonen network where its neurons are
arranged within a two- dimensional space
28
Types of NN
29
Perceptron
31
36
Perceptron
• The perceptron takes a weighted sum of multiple inputs (along
with a bias) as the cumulative input and applies a step
function on the cumulative input, i.e. it returns 1 if the input
is positive, else -1. In other words, the perceptron “fires”
(returns 1) if the cumulative input is positive and "stays
dormant" (returns 0) if the input is negative.
• Note that there are different ways to define the step function
• y=1 if x>0
• y=0 if x<=0
• Though one can rather use 1 and -1 as well instead of 1 and 0
Perceptron
• The input to a perceptron is the sum of weights multiplied with their respective inputs and the bias:
• In terms of vectors and matrices, Say w and x are vectors representing the weights and inputs as follows (note
that, by default, a vector is assumed to be a column vector):
Perceptron
• A neat and concise way to represent the weighted sum of w and x is using the dot product of 𝑊 𝑇 and x
• The transpose of w is 𝑊 𝑇 = [𝑤1, 𝑤2 , … … … . . 𝑤𝑘 ] - a row vector of size 1 x k. Taking the dot product
of 𝑊 𝑇 with x:
• Upon adding bias to 𝑊 𝑇 . 𝑋
• Upon applying the step function, if this cumulative sum of input is > 0, the output is 1/yes else 0/no.
Perceptron as a Classifier
• The weights represent the importance of the corresponding
feature for classification.
• sign function. The 'sign function' is similar to the step
function - it outputs +1 when the input is greater than 0 and
-1 otherwise. In a binary classification setting, +1 and -1
represent the two classes.
• simple exercise that will help you better understand how a
perceptron works.
•Consider a simple binary classification task.
Perceptron as a Classifier
• Consider the decision of whether to go to the sushi place being taken by a
perceptron model. You have the following factors affecting the decision to go/not
go: Distance, Cost and Company.
• These three variables are inputs to the perceptron.
• Suppose the inputs can be only 0/1 and the weights you assign to
each variable add up to 1.
0.5
• A sample set of weights can be 0.2
0.3
Perceptron as a Classifier
• For each of the inputs, the rules for deciding 1 and 0 are as follows - these are arbitrary mappings that you have
decided to make your model simpler.
• Assume that the bias value is -0.7. The sushi place is 5 km away and 3 of your friends are ready to accompany you.
Also, the cost for 2 is INR 2500. What is the input vector for the above-mentioned sushi place?
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝐼𝑛𝑝𝑢𝑡 𝑉𝑒𝑐𝑡𝑜𝑟 = 𝐶𝑜𝑠𝑡
𝐶𝑜𝑚𝑝𝑜𝑛𝑦
• The place is within 8km, hence, 1 for distance. It's cost > 2500. Hence, 0 for the cost. Since 3 friends are
accompanying me, the company is 1.
1
Ans: 𝐼𝑛𝑝𝑢𝑡 𝑉𝑒𝑐𝑡𝑜𝑟 = 0
1
• Consider an arbitrary set of weights to give to each of the factors. What do you get as the decision after applying the
step function on the cumulative input?
Binary Classification using Perceptron
Perceptron Problem Statement
• We need to find the correct w and b such that 𝑊 𝑇 . 𝑥 + 𝑏 > 0 for all
points where y= +1 and 𝑊 𝑇 . 𝑥 + 𝑏 < 0 for points where y = −1.
• Note that the step function used is defined as
𝑦 = 1 𝑖𝑓 𝑥 > 0
𝑦 = −1 𝑖𝑓 𝑥 ≤ 0
• a certain set (w, b) is a valid separator if y(𝑊 𝑇 . 𝑥 + 𝑏) > 0 for all the
data points
•not a valid separator if y 𝑊 𝑇 . 𝑥 + 𝑏 < 0 for any one of the data points.
Multiclass Classification using Perceptron's
Neuron • With an understanding of perceptron's, you are now
equipped to study the design of Artificial Neural Networks
(ANNs).
• Neural networks are a collection of artificial neurons
arranged in a particular structure.
• In this segment, you will understand how an artificial
neuron works i.e. how it converts inputs into outputs.
• You will also understand the topology or the structure of
large neural networks.
• So we see that a neuron is very similar to a perceptron, the
only difference being that there is an activation function
applied to the weighted sum of inputs.
• In perceptron's, the activation function is the step function,
though, in artificial neural networks, it can be any non-
linear function
Working of Neuron
• Let's now look at how large neural networks are designed using multiple individual neurons.
• Neurons in a neural network are arranged in layers.
• The first and the last layer are called the input and output layers.
• Input layers have as many neurons as the number of attributes in the data set and the output layer has
as many neurons as the number of classes of the target variable (for a classification problem).
• For a regression problem, the number of neurons in the output layer would be 1 (a numeric variable).
Working of Neuron
Assumptions made to Simplify Neural Networks
Since large neural networks can potentially have extremely complex
structures, certain assumptions are made to simplify the way information
flows in them. Assumptions are:
1. Neurons are arranged in layers and the layers are arranged
sequentially.
2. Neurons within the same layer do not interact with each other.
3. All the inputs enter the network through the input layer and all
the outputs go out of the network through the output layer.
4. Neurons in consecutive layers are densely connected, i.e.
all neurons in layer l are connected to all neurons in layer l+1.
5. Every interconnection in the neural network has a
weight associated with it, and every neuron has a bias associated
with it.
6. All neurons in all layers use the same activation function.
Convolutional Neural Network
• Convolutional Neural Networks, or CNNs, are specialized architectures which work
particularly well with visual data, i.e. images and videos.
• They have been largely responsible for revolutionizing 'deep learning' by setting new
benchmarks for many image processing tasks that were very recently considered
extremely hard.
• Although the vanilla neural networks (MLPs) can learn extremely complex functions,
their architecture does not exploit what we know about how the brain reads and processes
images.
• For this reason, although MLPs are successful in solving many complex problems, they
haven't been able to achieve any major breakthroughs in the image processing domain.
Convolutional Neural Network
• In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural
networks, most commonly applied to analyze visual imagery.
• The CNN architecture uses a special technique called Convolution instead of relying solely on
matrix multiplications like traditional neural networks.
• Convolutional networks use a process called convolution, which combines two functions to
show how one changes the shape of the other.
• The bottom line is that the role of the convolutional networks is to reduce the images into a
form that is easier to process, without losing features that are critical for getting a good
prediction.
• A Convolutional Neural Network is a specialized type of neural network designed to process
data with a grid-like topology, such as images. CNNs are particularly effective at capturing
spatial hierarchies in images through their layered architecture, which consists of multiple types
of layers that transform the input data into meaningful representations.
Convolutional Neural Network-Colour Image
Mathematically, the convolution operation is the summation of the
element-wise product of two matrices. Let’s take two matrices, X and Y. If
you 'convolve the image X using the filter Y', this operation will produce
Color image represented as matrix the matrix Z. Let’s say when we have X and Y of the same dimension.
Convolutional Neural Network-Colour Image
Let’s see another case when Image size is
5x5 and filter size is 3x3.
The basic idea of filters is to detect desired features (such
as vertical or horizontal edges) through convolution.
Convolutional Neural Network-Colour Image
Input, Filter and Output Dimension?
• In convolution, the number of filters should match the depth of the image.
• Say if we have a greyscale image with depth of 1 of size 10x10, the filter also needs
to have the depth of 1 of size say 5x5.
• In case of a colour image with RGB channel, say an image of size 10x10x3 where 3
is the depth, we need a filter of dimension 5x5x3 where the depth of 3 is present in
both the image and the filter.
• It is also to be noted that the result of convolution is a 2D array/matrix, irrespective
of the depth of the input. So, if we use multiple filters, there will be multiple 2D
array/matrix and we can stack them up.
• Filter size in CNN is chosen based on input image resolution, task complexity, and
computational constraints. Small filters such as 3×3 are preferred as they capture local
features efficiently, reduce parameters, and allow deeper architectures. Larger filters are
used only in early layers for global context.
Convolution- Stride and Padding
Stride: Stride is the number of pixels we move the filter (both
horizontally and vertically) to perform convolution operation.
Padding: Padding is the number of pixels we add all around
the image. As you can in Figure that the padding of 1 is used
Stride of 2 and Padding of 1
“Without padding, stride causes loss of
boundary information and reduces
output size, while with padding, stride
allows controlled down-sampling
without losing edge information.”
CNN-Feature Map
The term 'feature map' refers to the
(non-linear) output of the activation
function, not what goes into the
activation function (i.e. the output of the
convolution). Generally, ReLU is used
as an activation function, except in the
last layer where we use SoftMax
activation for classification.
Pooling
• After extracting features (as feature maps), CNNs typically aggregate these features using the pooling
layer to make the representation more compact.
• We already used padding to make the width and height of feature map same as that of input.
• But we also need compact representation of the feature map. We take aggregate over a patch of
feature map to get the output of pooling.
• Most popular are ‘Max Pooling’ and ‘Average Pooling’.
• Max pooling is more popular than average pooling as it has shown to work better average pooling.
• Max pooling: If any one of the patches says something strongly about the presence of a certain
feature, then the pooling layer counts that feature as 'detected’.
• Average pooling: If one patch says something very firmly but the other ones disagree, the pooling
layer takes the average to find out.
Max Pooling
• Max pooling selects the maximum element
from the region of the feature map covered by
the filter. Thus, the output after max-pooling
layer would be a feature map containing the
most prominent features of the previous
feature map.
• Max pooling layer preserves the most
important features (edges, textures, etc.) and
provides better performance in most cases.
𝟕+𝟑+𝟔+𝟏
Average Pooling 𝒏𝒐. 𝒐𝒇 𝒔𝒎𝒂𝒑𝒍𝒆𝒔(𝟒)
= 𝟒. 𝟐𝟓
• Average pooling computes the average of the
𝟐+𝟐+𝟗+𝟒
elements present in the region of feature map = 𝟒. 𝟐𝟓
𝒏𝒐. 𝒐𝒇 𝒔𝒎𝒂𝒑𝒍𝒆𝒔(𝟒)
covered by the filter. Thus, while max pooling
gives the most prominent feature in a
particular patch of the feature map, average
pooling gives the average of features present in
a patch.
• Average pooling provides a more generalized
representation of the input. It is useful in the
cases where preserving the overall context is 𝟖+𝟓+𝟑+𝟏
= 𝟒. 𝟐𝟓
important. 𝒏𝒐. 𝒐𝒇 𝒔𝒎𝒂𝒑𝒍𝒆𝒔(𝟒)
𝟐+𝟒+𝟐+𝟔
= 𝟑. 𝟓
𝒏𝒐. 𝒐𝒇 𝒔𝒎𝒂𝒑𝒍𝒆𝒔(𝟒)
Key Components of CNN Architecture
• 1. Input Layer: The input layer accepts raw pixel values from an image, typically represented as a three-dimensional tensor (height,
width, depth). For instance, a color image with dimensions 100×100 pixels would have a shape of
• 100×100×3
• 100×100×3 (where 3 represents the RGB color channels).
• 224×224×3(RGB image)
• 2. Convolutional Layers: These layers are the core building blocks of CNNs. They apply convolution operations using filters (also
known as kernels) that slide over the input image to detect features such as edges, textures, and shapes.
• Each filter produces a feature map that highlights specific patterns in the input data. The convolution operation involves computing the dot product between the
filter and a local region of the input image.
• What is learned: Edges, Corners, Textures
• 3. Activation Function: After convolution, an activation function is applied to introduce non-linearity into the model. The Rectified
Linear Unit (ReLU) is commonly used due to its effectiveness in accelerating convergence during training.
• Depth = number of filters used Example:224×224×3→32 filters .. 224×224×32
• 4. Pooling Layers: Pooling layers reduce the spatial dimensions of feature maps while retaining essential information. This down
sampling process helps decrease computational load and mitigate overfitting.
• Common pooling techniques include Max Pooling (selecting the maximum value from a region) and Average Pooling (calculating the average value).
• Purpose: Reduces spatial size, Makes features translation invariant, Reduces computation and overfitting
• Flatten Layer: Converts 3D feature maps into a 1D vector and Acts as a bridge between convolution layers and dense layers
• 5. Fully Connected Layers: Towards the end of the network, fully connected layers are used to combine features learned from
previous layers and make final predictions. Each neuron in these layers is connected to every neuron in the preceding layer.
• 6. Output Layer: The output layer generates predictions based on the learned features. For classification tasks, this layer typically
employs a softmax activation function to produce probabilities for each class.
• A CNN consists of convolution and pooling layers for hierarchical feature extraction followed by fully
connected layers for classification. The convolution layers learn spatial features, pooling reduces
dimensionality, and the softmax layer produces probabilistic class predictions.”
operation of CNNs
• The operation of CNNs can be summarized in several key steps:
• 1. Convolution Operation: Filters slide over the input image to produce feature maps that
capture various aspects of the image.
• 2. Activation: Non-linear activation functions like ReLU are applied to introduce complexity
into the model.
• 3. Pooling: Spatial dimensions are reduced through pooling layers, which help retain important
features while simplifying computations.
• 4. Hierarchical Feature Learning: As data passes through multiple layers, CNNs learn
increasingly complex features—from simple edges in early layers to intricate shapes and
objects in deeper layers.
• 5. Classification: The final fully connected layer processes these features to classify or predict
outcomes based on learned patterns.
Advantages of Convolutional Neural Networks
• 1. Automatic Feature Extraction: Unlike traditional machine learning methods that
require manual feature engineering, CNNs automatically learn relevant features from
raw data.
• 2. Translation Invariance: CNNs can recognize objects regardless of their position in
an image due to their hierarchical structure and pooling operations.
• 3. Reduced Computational Complexity: By using shared weights in convolutional
layers, CNNs significantly reduce the number of parameters compared to fully
connected networks.
• 4. Scalability: CNN architectures can be scaled up with additional layers and filters to
improve performance on complex tasks.
• Despite their advantages, CNNs face several challenges:
• 1. Data Requirements: Training CNNs requires large labeled datasets to achieve
high accuracy; insufficient data can lead to overfitting.
• 2. Computational Resources: Training deep CNNs can be computationally
intensive and may require specialized hardware such as GPUs.
• 3. Interpretability: Understanding how CNNs make decisions can be challenging
due to their complex architectures, leading to concerns about transparency in
critical applications like healthcare or autonomous vehicles.
Applications
CNNs have found applications across various domains:
• 1. Image Classification: Used extensively in identifying objects within images for applications
ranging from social media tagging to medical imaging diagnostics.
• 2. Object Detection: Techniques like YOLO (You Only Look Once) utilize CNNs for real-time
object detection in videos and images.
• 3. Facial Recognition: Employed in security systems for identifying individuals based on facial
features.
• 4. Autonomous Vehicles: Used for recognizing road signs, pedestrians, and other vehicles in real-
time for navigation purposes.
• 5. Medical Imaging Analysis: Applied in diagnosing diseases by analyzing X-rays, MRIs, and CT
scans with high accuracy.
Recurrent Neural Networks (RNN)
• Traditional ANNs and CNNs assume independence between inputs.
However, many real-world problems are sequential, where past information matters:
• Time-series forecasting
• Speech recognition
• Natural Language Processing (NLP)
• Video analysis
• DNA / genome sequences
• RNNs are designed to handle sequential data by maintaining memory of past inputs.
• An RNN processes inputs one time step at a time and shares parameters across time.
• It has a hidden state that acts as memory
• Current output depends on:
• Current input
• Previous hidden state
Recurrent Neural Networks (RNNs)
Subtitle: An Introduction to Sequence-Based Deep Learning
• Recurrent Neural Networks (RNNs) are a class
of artificial neural networks designed for
sequential data.
• Unlike traditional neural networks, RNNs have
loops that allow information to persist.
• Used in applications like speech recognition,
language modeling, and time-series forecasting.
• Same weights are reused at each time step
Applications of RNN
• Speech recognition
• Language modeling
• Machine translation
• Stock price prediction
• Weather forecasting
• Handwriting recognition
Advantages
✔ Handles variable-length sequences
✔ Temporal dependency modeling
✔ Parameter sharing reduces model size
Limitations
❌ Slow training
❌ Vanishing gradients
❌ Difficult to parallelize
Working of RNN
• RNNs process input sequentially and maintain a hidden state that captures past information.
• The same weights are applied at each time step.
• Hidden state is updated recursively based on previous state and current input.
• Information from previous steps is retained, allowing the network to learn temporal
dependencies.
•RNN introduces memory via hidden states
•Uses BPTT for training
•Suffers from vanishing gradients
•LSTM and GRU are improved RNNs
•Widely used in NLP and time-series
Mathematical Model of RNN
• shows how the hidden state and output of a Recurrent
Neural Network (RNN) are computed at each time step.
The states of RNN is calculated as follows:
Updated state: ℎ𝑡 = 𝑓 ℎ𝑡−1 , 𝑥𝑡
ℎ𝑡 − 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑠𝑡𝑎𝑡𝑒, ℎ𝑡−1 − 𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑠𝑡𝑎𝑡𝑒, 𝑥𝑡 −𝐼𝑛𝑝𝑢𝑡 𝑠𝑡𝑎𝑡𝑒
Activation function: ℎ𝑡 = 𝑤ℎℎ ℎ𝑡−1 + 𝑤𝑠ℎ 𝑥𝑡
𝑤ℎℎ − 𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑛𝑒𝑢𝑟𝑜𝑛 𝑎𝑛𝑑 𝑤𝑠ℎ − 𝑖𝑛𝑝𝑢𝑡 𝑛𝑒𝑢𝑟𝑜𝑛
Output state:
𝑦𝑡 = 𝑤ℎ𝑦 ℎ𝑡
𝑤ℎ𝑦 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟
• In an RNN, the hidden state at each time step is computed using
the current input and the previous hidden state, enabling the
network to model sequential and temporal dependencies.
Backpropagation Models in Neural Networks
1. Standard Backpropagation in Feedforward Networks
2. Backpropagation Through Time (BPTT) for RNNs
Recurrent Neural Networks (RNNs) involve time dependencies. The BPTT algorithm extends standard
backpropagation by unfolding the network over time.
3. Backpropagation in LSTMs (Long Short-Term Memory)
• LSTMs solve the vanishing gradient problem in RNNs using gates:
• During backpropagation, gradients are computed for: Cell state, Gates (forget, input, output), Weights and biases
4. Backpropagation in GRUs (Gated Recurrent Units)
Compare RNN, LSTM and GRU
Model Structure Pros Cons
Standard Works for No memory of past
Fully connected layers
Backpropagation feedforward networks inputs
Vanishing/exploding
BPTT (RNNs) Loops over time steps Works for sequences
gradients
More complex,
Solves vanishing
LSTMs Memory cells & gates computationally
gradient issue
heavy
GRUs Fewer gates than Similar performance Less interpretable
LSTMs as LSTMs than LSTMs
Optimization Techniques
• Optimization is the process of adjusting model parameters (weights and biases) to minimize
the loss function. There are two major categories of optimization techniques:
Gradient-Based Optimization – Uses gradients to update parameters.
Adaptive Optimization – Modifies the learning rate dynamically for better convergence.
Gradient-Based Optimization
(a) Gradient Descent (GD)
Gradient Descent is the fundamental optimization technique in machine learning and deep learning. It
updates parameters using:
(b) Variants of Gradient Descent
1. Batch Gradient Descent
• Computes gradients using the entire dataset.
• Pros: Stable convergence.
• Cons: Slow for large datasets.
2. Stochastic Gradient Descent (SGD)
• Updates parameters using a single random sample.
• Pros: Fast updates, good for large datasets.
• Cons: Noisy updates, may not converge smoothly.
[Link]-Batch Gradient Descent
• Uses a small batch of data for updates.
• Pros: Balances efficiency and stability.
• Cons: Still needs tuning of batch size.
Optimization Techniques
2. Adaptive Optimization Techniques
• Adaptive optimization techniques modify the learning rate dynamically for each parameter,
improving efficiency and convergence.
(a) Momentum
• Momentum helps accelerate gradient descent by considering past gradients:
(b) Nesterov Accelerated Gradient (NAG)
An improvement over momentum, NAG anticipates future gradients:
(c) AdaGrad (Adaptive Gradient Algorithm)
• AdaGrad adapts the learning rate for each parameter based on past gradients:
(d) RMSProp (Root Mean Square Propagation)
• RMSProp improves AdaGrad by using an exponentially decaying average of
past gradients:
Introduction to TensorFlow/PyTorch
• PyTorch and TensorFlow are both popular deep learning frameworks. They are
used to build artificial neural networks that can learn and make decisions.
PyTorch TensorFlow
Known for Flexibility, ease of use Scalability, optimization
Good for Research, prototyping, dynamic projects Large-scale applications, production
Dynamic computation graph, Pythonic
Features Strong visualization capabilities
nature
Rapidly growing, especially popular in Large community with extensive
Community
academia corporate support
• A deep learning framework is a software library that provides:
• Tensor operations
• Automatic differentiation
• GPU/TPU acceleration
• Tools to build, train, and deploy neural networks efficiently
• Examples: TensorFlow, PyTorch, Keras, MXNet
TensorFlow
• TensorFlow is an open-source deep learning framework developed by Google Brain, used for
building and deploying machine learning and deep learning models at scale.
• Key Features of TensorFlow
• Uses tensors (multi-dimensional arrays)
• Supports static and dynamic computation graphs
• High performance on CPU, GPU, and TPU
• Strong support for production and deployment
• Integrates with Keras API
• Advantages of TensorFlow
• Excellent for large-scale deployment
• Supports mobile and embedded systems
• Strong visualization using TensorBoard
• Widely used in industry
• Limitations
• More complex syntax (especially TF 1.x)
• Less flexible than PyTorch for research
TensorFlow Architecture
• User Code (Python)
↓
• Keras / TensorFlow API
↓
• Computation Graph
↓
• Runtime Engine
↓
• CPU / GPU / TPU
PyTorch
• PyTorch is an open-source deep learning framework developed by
Facebook (Meta AI), widely used for research and experimentation.
• Key Features of PyTorch
• Uses dynamic computation graphs
• Pythonic and intuitive syntax
• Excellent for research and prototyping
• Strong debugging support
• GPU acceleration using CUDA
Advantages of PyTorch
• Easy to learn and debug
• Dynamic graphs (define-by-run)
• Preferred in academic research
• Cleaner code for complex models
Limitations
• Earlier versions had limited deployment support
• Slightly slower than TensorFlow in large-scale production
Applications of TensorFlow and PyTorch
• Image Classification (CNN)
• Natural Language Processing (RNN, Transformers)
• Speech Recognition
• Medical Image Analysis
• Autonomous Systems
• Quantum ML research (hybrid models)
VLSI Application: Utilizing CNNs for VLSI
layout pattern recognition and classification.
Machine Learning for Compact Lithographic Process Models
• Deep within any modern IC production line flow is an obscure but
widespread technology that is an intersection of nano-fabrication and
machine learning called computational lithography
The Lithographic Patterning Process
• This figure represents a process flow in photolithography, typically
used in semiconductor manufacturing.
• It depicts the transformation of a layout through different stages:
mask creation, imaging, resist processing, and etching.
• Each step is associated with a transfer function that modifies the
information.
• The mask stage, fm, which communicates the polygon shapes L in a
design layout to a distribution of shapes on the photomask M.
• The image stage, fi , which communicates the shapes on the
photomask M into an image I distributed within the thin resist film on
the wafer surface.
• The resist stage, fr , which transforms the image distribution I into a
distribution of shapes representing regions protected by the
developed photoresist.
• The etch stage, fe, which transforms the shapes of regions protected
by photoresist R into the final etched shapes transmitted into the
wafer substrate.
Step 1: Layout 𝓛 → Mask 𝓜
The layout ℒ represents the original circuit design.
This design is converted into a photomask ℳusing a mask generation
process.
A fracture tool receives a layout ℒ ∈ {0, 1}2 and converts it into instructions
for a
mask-writing tool.
𝒇𝒎 = 𝑻𝒎 𝓛
A mask-writing tool receives instructions to direct an electron beam to
deliver energy in selected regions and doses onto a photoresist film
deposited on a mask blank.
After development, the photoresist selectively protects regions of the
mask blank. The mask is etched and materials are removed from the mask
blank in regions unprotected by photoresist
The mask generation function 𝑻𝒎 accounts for resolution enhancements
and corrections.
Step 2: Mask 𝓜 → Image 𝓘
The photomask is used in a lithographic system, where light (or
another radiation source) passes through it.
The image 𝓘is projected onto a photosensitive resist.
An exposure tool receives a photomask 𝓜∈ {0, … . N}2
The exposure tool illuminates the photomask, and the photomask
diffracts the light.
A projection lens collects a subset of the diffracted orders and forms an
image within a thin photoresist layer on the surface of a silicon wafer.
𝒇𝒊 = 𝑻𝒊 𝓜
The imaging process 𝑻𝒊 is influenced by diffraction, optical
aberrations, and the exposure system.
3. Image 𝓘 → Resist 𝓡
The projected image interacts with the photoresist 𝓡 (a light-sensitive material).
A series of photoresist processing tools receives the wafer, which has
been exposed to light energy in a three-dimensional distribution 𝓘 ∈𝓡 3 .
𝒇𝒓 = 𝑻𝒓 𝒇𝒊
The resist development process 𝑻𝒓 determines how the chemical properties of the resist change after
exposure.
This step defines the pattern in the resist, influenced by factors like exposure dose and resist contrast
4. Resist 𝓡 → Etch ℰ
The developed resist pattern is used as a mask for the etching process ℰ.
Etching removes the exposed regions of the underlying material, transferring the
resist pattern into the wafer.
An etch reactor receives a wafer that is partially masked by a film of
photoresist, in a pattern defined by:
𝓡∈ {0, 1}2 ∈ {0, 1}3
The etch reactor bombards the wafer with etching particles that react with the
unmasked regions on the surface of the wafer substrate.
𝒇𝒆 = 𝑻𝒆 𝒇𝒓
The etch transfer function 𝑻𝒆 depends on etching selectivity and anisotropy.
• Final Process Equation
𝒇𝒆 = 𝑻𝒆 𝑻𝒓 𝑻𝒊 𝑻𝒎 𝓛
This equation describes the cumulative effect of each step in
transferring the original layout into the final etched structure on a
semiconductor wafer. Each step introduces distortions and variations,
making process optimization crucial in semiconductor fabrication.
OPC (Optical Proximity Correction)
• OPC (Optical Proximity Correction) is a technique used in semiconductor manufacturing to
improve the accuracy of printed circuit features on a chip during photolithography.
• It's a critical process in photomask design, especially for advanced semiconductor nodes
where the resolution of the photolithography equipment cannot directly produce the intended
circuit features due to limitations in the optical process.
• The OPC model refers to the mathematical and computational methods used to simulate the
optical behavior of masks and predict how the printed features on the wafer will differ from
the original design.
• The goal is to modify the mask layout in such a way that the final printed features closely
match the intended design, correcting for optical distortions.
Why OPC is Necessary:
• Diffraction Effects: When light is projected through the mask,
diffraction occurs, causing the light to spread out and blur, which
distorts the pattern being projected.
• Short-Wavelength Light: As semiconductor technology advances,
shorter wavelengths of light (e.g., extreme ultraviolet (EUV) or deep
ultraviolet (DUV)) are used for smaller nodes, but the mask's
resolution limitations still cause inaccuracies in the final pattern.
• Small Feature Sizes: As feature sizes get smaller, the inaccuracies due
to diffraction become more significant, making it difficult to directly
print the desired pattern using conventional masks.
How OPC Works:
The OPC model typically involves several key steps:
• Pattern Simulation: The first step is to simulate how a given mask pattern will print on the wafer.
This is done by simulating the optical and resist process, considering diffraction and the
properties of the photolithography equipment.
• Error Detection: After simulating the printed pattern, the differences between the simulated
printed pattern and the intended design are analyzed to identify any distortions or "errors."
• Mask Modification: To correct the errors, the mask pattern is modified. This could involve
Biasing: Adjusting the size or shape of specific features (e.g., making a line slightly wider or a
corner more rounded) to compensate for the distortion during printing.
Assist Features: Adding additional features (such as sub-resolution assist features, or SRAFs)
that are not intended to be part of the final design but help in improving the printability of
critical features.
Resolution Enhancement Techniques (RET): Techniques like sub-resolution assist features
(SRAFs) and phase-shifting masks (PSM) are used to improve the fidelity of the printed
pattern.
• Iterative Optimization: The process may involve several iterations of simulation, error detection,
and mask modification to ensure the best possible pattern correction.
Types of OPC:
There are several types of OPC, each with different complexities and goals:
• Rule-based OPC: This approach uses predefined rules to make adjustments to the
mask. These rules are based on design patterns and their known behavior under
lithographic projection. Rule-based OPC is faster but less accurate than model-
based OPC.
• Model-based OPC: This approach uses more sophisticated mathematical models
to simulate the lithographic process and make mask corrections. It's more
accurate but computationally expensive.
• Inverse Lithography Technology (ILT): A more advanced technique that uses
optimization algorithms to generate the mask directly from the desired pattern
while minimizing the error during printing.
Machine Learning-OPC(ML-OPC)
Training Phase
• Patterns:
• A set of known patterns (possibly semiconductor mask
patterns) is collected as input.
• Extracting Parameters:
• Feature extraction is performed to generate numerical
representations of these patterns.
• These parameters are represented as pairs
(x1,y1),(x2,y2),…(x1, y1), (x2, y2), where x represents input
features and y represents the target output.
• ML-Model Training:
• A machine learning model is trained to map inputs (x) to
outputs (y) by minimizing the error function:
• This represents the optimization of the model to reduce
the difference between predicted and actual values.
• Trained ML-Model 𝑓(𝑥): Once trained, the model 𝑓(𝑥)
can generalize to new inputs.
Machine Learning-OPC(ML-OPC)
Testing Phase
• Unknown Pattern:
• A new pattern (not seen during training) is
provided as input.
• Extracting Parameters:
• Features of the unknown pattern are extracted,
but the output value is unknown (denoted as xn).
• Prediction Using the Trained Model:
• The trained ML model f(x) predicts the missing
output value.
• Testing and Predictions Include:
• Mask bias: Differences between the designed
and actual printed patterns on the mask.
• Etch bias: Errors occurring during the etching
process.
• SRAF (Sub-Resolution Assist Features) insertion:
Optimizations in mask design to improve
lithography accuracy.
Etch Proximity Correction (EPC)
Etch Proximity Correction (EPC) is a technique used in semiconductor manufacturing to
compensate for distortions that occur during the etching process. Similar to Optical
Proximity Correction (OPC), which corrects for distortions caused by optical lithography,
EPC focuses on compensating for variations introduced during plasma etching.
Why is EPC Needed?
• After photolithography prints the pattern onto a wafer, the etching process transfers this
pattern into underlying layers of semiconductor materials. However, during etching,
proximity effects can cause distortions due to:
Loading Effects: Different pattern densities cause variations in etch rates.
Microloading Effects: Smaller features etch differently compared to larger features.
Aspect Ratio-Dependent Etching (ARDE): Etching varies based on feature height and
width.
Etch Lag Effects: Dense areas etch at different rates than isolated areas.
• EPC is used to pre-compensate these distortions so that the final etched pattern closely
matches the intended design.
How EPC Works?
1. EPC Training Phase (ML-Based Approach)
Step 1: Data Collection
• Simulated and measured etch profiles are gathered.
• Feature parameters such as line width, pitch, pattern density, and aspect ratio are extracted.
• Corresponding etch biases (errors in etch dimensions) are recorded.
Step 2: Model Training
• A machine learning model is trained to predict the etch bias based on layout parameters.
• The training objective is to minimize the error between actual and predicted etch dimensions.
2. EPC Testing/Correction Phase
Step 1: New Layout Input
• A new semiconductor design layout is analyzed.
• Features are extracted and fed into the trained etch bias model.
Step 2: Etch Bias Prediction
• The trained model predicts the required etch correction values.
• Adjustments (e.g., widening or shrinking features) are made before fabrication.
Step 3: Mask Correction
• The corrected design is sent for mask generation and subsequent processing.
OPC vs EPC
Feature Optical Proximity Correction (OPC) Etch Proximity Correction (EPC)
Corrects distortions caused by optical Corrects distortions caused by
Purpose
lithography etching process
Etch loading, aspect ratio effects,
Main Effects Corrected Diffraction, interference, focus, exposure
etch lag
Modifies mask shapes to pre-compensate Modifies mask or etch process to
Correction Method
optical distortions counteract etching effects
Applied after lithography, before
Application Applied before lithography
etching
CNN for EPC
• Input:
• Raw layout images before etching.
• Measured post-etch feature profiles (scanning electron microscope [SEM] images).
• Feature parameter maps (line width, aspect ratio, density).
• CNN Architecture for EPC:
• Input Layer: Takes 2D images of the pre-etch layout.
• Convolutional Layers: Learns etch-related distortions like micro-loading and aspect-ratio
dependent etching (ARDE).
• Residual Blocks (Optional): Helps in learning finer distortions.
• Fully Connected Layers: Outputs correction values.
• Training Objective:
• Predict etch bias (difference between pre-etch and post-etch dimensions).
• Loss function:
2
𝐿= 𝐵𝑖𝑎𝑠𝑎𝑐𝑡𝑢𝑎𝑙 − 𝐵𝑖𝑎𝑠𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
• Output:
• Adjusted mask patterns or etch process parameters to ensure final feature dimensions match the
intended design.
CNN for OPC
• Input:
• Lithography mask images (2D images of mask layouts).
• Corresponding optical proximity error maps (ground truth).
• CNN Architecture:
• Convolutional Layers: Extracts local pattern features (edges, corners, etc.).
• Pooling Layers: Reduces feature map size while preserving essential information.
• Fully Connected Layers: Predicts the required mask correction per pixel.
• Training Objective:
• Minimize the difference between simulated vs. predicted OPC corrections using
Mean Squared Error (MSE).
2
𝑚𝑖𝑛 𝑦𝑖 − 𝐹 𝑥𝑖
• Output:
• A corrected mask layout that compensates for diffraction and optical distortions.