0% found this document useful (0 votes)
9 views11 pages

KL Divergence and Image Processing Formulas

The document discusses various image processing techniques, focusing on convolution, correlation, and edge detection methods such as Sobel and Canny. It outlines the mathematical formulas and steps for applying these techniques, including noise reduction, feature extraction, and periodicity detection. Key concepts like KL Divergence, autocorrelation, and the Laplacian of Gaussian are also explained in the context of image analysis.

Uploaded by

k247605
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

KL Divergence and Image Processing Formulas

The document discusses various image processing techniques, focusing on convolution, correlation, and edge detection methods such as Sobel and Canny. It outlines the mathematical formulas and steps for applying these techniques, including noise reduction, feature extraction, and periodicity detection. Key concepts like KL Divergence, autocorrelation, and the Laplacian of Gaussian are also explained in the context of image analysis.

Uploaded by

k247605
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Formulae layers extract features.

• Noise Reduction: Smoothing images to reduce


KL Divergence (Kullback-Leibler random noise.
Divergence) Steps for 2D Image Convolution
1. Flip the Kernel: Rotate the kernel 180◦ (both
Formula horizontally and vertically). This is the key differ-
KL Divergence ence from correlation.
For discrete probability distributions P and Q: 2. Place the Kernel: Position the center of the
X 
P (i)
 flipped kernel over each pixel (i, j) of the input
DKL (P ||Q) = P (i) log image.
i
Q(i)
3. Multiply and Sum: Multiply each kernel ele-
Explanation of Terms ment by the corresponding image pixel value it
• DKL (P ||Q): Represents the KL Divergence of Q overlaps.
from P . It is read as ”the divergence of P from 4. Sum the Products: Sum all the resulting prod-
Q”. ucts.
• P (i) or P (x): The probability of event i (discrete)
or the probability density function of x (continu- 5. Assign Result: The sum becomes the new pixel
ous) under distribution P . This is the ”true” or value at (i, j) in the output image.
reference distribution.
• Q(i) or Q(x): The probability of event i (discrete) 6. Repeat: Move the kernel to the next pixel and
or the probability density function of x (continu- repeat the process until all pixels are covered.
ous) under distribution Q. This is the approximat-
ing distribution. Correlation
• Cross-correlation is a measure of similarity of two series
P
i : Summation over all possible events i in the
discrete case. as a function of the displacement of one relative to the
• log: The logarithm, typically the natural logarithm other. In image processing, it’s used to find patterns or
(base e), but can also be base 2 for information the- templates within an image.
ory contexts. The base of the logarithm determines Formula
the units of divergence (nats for natural log, bits For a 2D image I and a template T :
for base 2 log). XX
P (i)
• Q(i) P (x)
or Q(x) : The ratio of the probabilities (or (I ⋆ T )(i, j) = I(i + m, j + n)T (m, n)
m n
probability densities) of P to Q for a given event
or value. Where:
Key Properties • (I ⋆ T )(i, j): The output pixel value at coordinates
• DKL (P ||Q) ≥ 0: KL Divergence is always non- (i, j) after correlation.
negative. • I(i + m, j + n): The pixel value from the input
• DKL (P ||Q) = 0 if and only if P = Q (i.e., the two image I at a shifted coordinate.
distributions are identical). • T (m, n): The value of the template T at coordi-
• Non-symmetric: DKL (P ||Q) ̸= DKL (Q||P ) in gen- nates
P P(m, n).
eral. This means it is not a true distance metric. • m n : Summation over all elements of the tem-
plate.
Image Filters When to Apply
Convolution Correlation is primarily used for:
Formula • Template Matching: Locating a small template
For a 2D image I and a kernel K: image within a larger image (e.g., finding a specific
object).
• Pattern Recognition: Identifying known pat-
XX
(I ∗ K)(i, j) = I(i − m, j − n)K(m, n)
m n
terns.
• Feature Detection: Although convolution is
Where: more common for general filtering, correlation can
• (I ∗ K)(i, j): The output pixel value at coordinates also be used.
(i, j) after convolution. Steps for 2D Image Correlation
• I(i − m, j − n): The pixel value from the input 1. Place the Template: Position the center of the
image I at a shifted coordinate. template over each pixel (i, j) of the input image.
• K(m, n): The value of the kernel K at coordinates
(m, 2. Multiply and Sum: Multiply each template el-
P n). ement by the corresponding image pixel value it

P
m n : Summation over all elements of the ker-
nel. overlaps.
When to Apply 3. Sum the Products: Sum all the resulting prod-
Convolution is widely used for: ucts.
• Filtering: Blurring (e.g., Gaussian filter), sharp-
ening, edge detection (e.g., Sobel, Prewitt, Lapla- 4. Assign Result: The sum becomes the new pixel
cian). value at (i, j) in the output image.
• Feature Extraction: In neural networks (Con-
volutional Neural Networks - CNNs), convolution 5. Repeat: Move the template to the next pixel and
repeat the process until all pixels are covered.

1
Autocorrelation Here, IGx (i, j)2 means squaring the pixel value at
Autocorrelation is the correlation of a signal with a de- (i, j) in the IGx image, and similarly for IGy (i, j)2 .
layed copy of itself as a function of the delay. In image
processing, it can be used to find repeating patterns or Θ(i, j) = atan2(IGy (i, j), IGx (i, j))
textures within a single image.
Formula The atan2(y, x) function is a two-argument arct-
For a 2D image I: angent. It correctly calculates the angle in all four
quadrants, taking into account the signs of both y
(here, IGy ) and x (here, IGx ). This provides the
XX
RII (u, v) = I(x, y)I(x − u, y − v)
orientation of the edge.
x y

Where: Prewitt Filter


• RII (u, v): The autocorrelation value for a displace- It is simpler than Sobel as it does not give extra weight
ment (u, v). to the central pixels.
• I(x, y): The pixel value from the image at (x, y). Kernels
• I(x − u, y − v): The pixel value from the image at The Prewitt filter uses two 3 × 3 kernels:
a
Pshifted
P coordinate. −1 0 1 −1 −1 −1
! !
• x y : Summation over all relevant pixels in the Gx = −1 0 1 Gy = 0 0 0
image. −1 0 1 1 1 1
When to Apply
Autocorrelation is used for: Steps
• Periodicity Detection: Identifying repeating The steps are identical to the Sobel filter, using the
patterns or textures in an image. Prewitt kernels instead:
• Texture Analysis: Characterizing the regularity
or randomness of textures. 1. Apply Gx : Convolve the image with the Gx ker-
• Signal Processing: Analyzing signals for repeti- nel. This results in a gradient component image,
tive structures. let’s call it IGx .
Steps for 2D Image Autocorrelation
1. Shift the Image: Create a shifted copy of the 2. Apply Gy : Convolve the image with the Gy ker-
original image by a displacement (u, v). nel. This results in another gradient component
image, IGy .
2. Multiply Corresponding Pixels: Multiply the
original image’s pixel values by the corresponding 3. Combine Gradients: At each pixel (i, j), com-
pixel values in the shifted copy. pute the gradient magnitude G(i, j) and direction
Θ(i, j):
3. Sum the Products: Sum all the resulting prod- q
ucts. G(i, j) = IGx (i, j)2 + IGy (i, j)2
4. Assign Result: The sum becomes the autocorre-
lation value for the displacement (u, v). Here, IGx (i, j)2 means squaring the pixel value at
(i, j) in the IGx image, and similarly for IGy (i, j)2 .
5. Repeat: Vary the displacement (u, v) and repeat
the process to build the autocorrelation function. Θ(i, j) = atan2(IGy (i, j), IGx (i, j))

Sobel Filter The atan2(y, x) function is a two-argument arct-


Kernels angent. It correctly calculates the angle in all four
The Sobel filter uses two 3×3 kernels, one for horizontal quadrants, taking into account the signs of both y
edges (Gx ) and one for vertical edges (Gy ): (here, IGy ) and x (here, IGx ). This provides the
orientation of the edge.
−1 0 1 −1 −2 −1
! !
Gx = −2 0 2 Gy = 0 0 0 Edge Detection Algorithms
−1 0 1 1 2 1 Edge detection is a fundamental task in image process-
ing that aims to identify points in a digital image at
Steps which the image brightness changes sharply or, more
1. Apply Gx : Convolve the image with the Gx kernel formally, has discontinuities.
to detect vertical edges. This results in a gradient Marr-Hildreth Edge Detection
component image, let’s call it IGx . The Marr-Hildreth edge detection algorithm is a classic
2. Apply Gy : Convolve the image with the Gy kernel approach based on the zero-crossings of the Laplacian
to detect horizontal edges. This results in another of Gaussian (LoG) operator. It emphasizes the impor-
gradient component image, IGy . tance of multi-scale analysis and finding edges at differ-
ent resolutions.
3. Combine Gradients: At each pixel (i, j), com- Laplacian of Gaussian (LoG)
pute the gradient magnitude G(i, j) and direction The Laplacian of Gaussian (LoG) is a second-order
Θ(i, j): derivative operator that combines Gaussian smoothing
q with Laplacian edge detection. It is often used as a sin-
2
G(i, j) = IGx (i, j) + IGy (i, j)2 gle filter, which is the convolution of the image with the
LoG kernel. The LoG kernel can be derived by applying

2
the Laplacian operator to the Gaussian function: (100 × 0.25) = 25 (simplified, full convolution
would involve more neighbors). A more realis-
∂2 ∂2 tic smoothed segment might look like: Ismooth ≈
∇2 G(x, y, σ) = G(x, y, σ) + 2 G(x, y, σ)
∂x2 ∂y [0, 25, 75, 95, 75, 25] (values are illustrative, actual
values depend on padding and full convolution).
The formula for the 2D LoG kernel is:
 2 2. Apply the Laplacian Operator: Simplified 1D
x + y 2 − 2σ 2 − x2 +y2 2

LoG(x, y, σ) = e 2σ Laplacian Kernel Lop = [1, −2, 1]. Now, convolve
2πσ 6 Ismooth with Lop . Let’s focus on where the edge is,
between 0 and 100. Consider the transition around
When applied as a single convolution, the LoG kernel index 3-4 in Ismooth : [75, 95, 75]. Applying Lop at a
acts as a band-pass filter, detecting intensity changes at point in the transition region (e.g., conceptually at
a certain scale. The zero-crossings of the filtered image index 3.5): L(x) = (75 × 1) + (95 × −2) + (75 × 1) =
indicate the locations of edges. 75 − 190 + 75 = −40. If we consider a point just
Steps before the transition, say using [25, 75, 95]: L(x) =
1. Smooth the Image with a Gaussian Filter (25×1)+(75×−2)+(95×1) = 25−150+95 = −30.
(or convolve with LoG directly): The input If we consider a point just after the transition, say
image I(x, y) is first convolved with a Gaussian using [95, 75, 25]: L(x) = (95 × 1) + (75 × −2) +
filter G(x, y, σ) to reduce noise. The Gaussian filter (25 × 1) = 95 − 150 + 25 = −30. The exact zero-
is given by: crossing would occur where the values transition
from negative to positive or vice-versa. In a real 1D
1 − x2 +y2 2 scenario, the LoG response would show a negative
G(x, y, σ) = e 2σ
2πσ 2 peak on one side of the edge, a positive peak on
where σ is the standard deviation, controlling the the other, and a zero-crossing exactly at the edge.
amount of smoothing. The smoothed image is For a 2D image, the zero-crossings form contours,
Ismooth = I ∗ G. Alternatively, one can directly which are the detected edges.
convolve the image with the LoG kernel: L(x, y) =
I ∗ LoG(x, y, σ).
Canny Edge Detection
The Canny edge detector is a multi-stage algorithm
2. Apply the Laplacian Operator (if not using widely recognized for its effectiveness and robustness.
direct LoG convolution): If the image was only Its primary goals are to detect all real edges, localize
smoothed with a Gaussian, the Laplacian operator them accurately, and minimize the number of false pos-
is then applied to the smoothed image. The 2D itives.
Laplacian operator is defined as: Steps
1. Noise Reduction (Gaussian Smoothing): The
∂2f ∂2f image is smoothed using a Gaussian filter to sup-
∇2 f = + press noise. This is the same step as in Marr-
∂x2 ∂y 2
Hildreth: Ismooth = I ∗ G.
Applying the Laplacian to the smoothed image
Ismooth gives L(x, y) = ∇2 (I ∗ G). 2. Calculate Intensity Gradients: Compute the
gradient magnitude G(x, y) and direction Θ(x, y)
3. Find Zero-Crossings: Edges are identified as for each pixel in the smoothed image. This is typ-
the zero-crossings in the LoG response L(x, y). ically done using Sobel or Prewitt kernels:
A zero-crossing occurs where the sign of L(x, y) q
changes (e.g., from positive to negative or vice G(x, y) = IGx (x, y)2 + IGy (x, y)2
versa). These zero-crossings correspond to points
of maximum intensity change in the smoothed im- Θ(x, y) = atan2(IGy (x, y), IGx (x, y))
age.
where IGx and IGy are the horizontal and verti-
4. Thresholding (Optional): To remove weak or cal gradient images obtained by convolving Ismooth
spurious edges, a threshold can be applied to the with Gx and Gy kernels respectively.
magnitude of the gradient at the zero-crossing
points. Only zero-crossings with a sufficiently high 3. Non-maximum Suppression: This step thins
gradient magnitude are considered true edges. the edges by preserving only the local maxima in
the gradient magnitude image. For each pixel, its
Example (Numerical) gradient magnitude is compared to the magnitudes
Let’s consider a simple 1D image (a row of pixels) and of its two neighbors along the gradient direction. If
a simplified 1D Gaussian and Laplacian for illustration. the pixel’s magnitude is not a local maximum, it is
Input 1D Image I: suppressed (set to zero). This results in thin, crisp
edges.
I = [0, 0, 0, 100, 100, 100]
4. Double Thresholding: Two thresholds, a high
Simplified 1D Gaussian Kernel G (σ chosen for small threshold (Thigh ) and a low threshold (Tlow ), are
kernel, e.g., G = [0.25, 0.5, 0.25]): used to classify edge pixels:
1. Gaussian Smoothing: Convolve I with G. Let’s • Pixels with gradient magnitude > Thigh are
look at the pixel at index 3 (value 100). Original classified as strong edge pixels.
neighborhood around index 3: [0, 0, 100] (consider- • Pixels with gradient magnitude between Tlow
ing padding for simplicity). Applying G around and Thigh are classified as weak edge pixels.
index 3: Ismooth [3] = (0 × 0.25) + (0 × 0.5) +

3
• Pixels with gradient magnitude < Tlow are (e.g., G(x, y) between Tlow and Thigh ) adjacent to
suppressed. this strong edge, they would also be promoted to
strong edges, forming a continuous edge segment.
5. Edge Tracking by Hysteresis: This final step Isolated weak edges would be discarded.
converts weak edges into strong edges if and only
if they are connected to strong edge pixels. It ef-
fectively connects broken edge segments. Interest Point Detection
• Strong edge pixels are immediately included Interest points (also known as keypoints or corner
in the final edge map. points) are distinctive features in an image that are
• Weak edge pixels are included only if they are stable under various transformations such as rotation,
connected to a strong edge pixel (directly or scaling, and illumination changes. They are crucial for
through other weak edge pixels that are them- tasks like image matching, object recognition, and 3D
selves connected to a strong edge). reconstruction.
• This process helps to bridge gaps in edges and Harris Corner Detection
reduce false positives. The Harris Corner Detector is a classic and widely used
algorithm for detecting corners in an image. A corner
Example (Numerical) is defined as a point where there is a significant change
Let’s trace a pixel through the Canny algorithm. in image intensity in multiple directions (e.g., both hor-
Input Image (small 3x3 region with an edge): izontal and vertical).
Principle
0 0 0
!
The core idea behind Harris Corner Detection is to look
I = 0 200 200 for points where the intensity changes rapidly when a
0 0 0 small window is shifted in any direction. It uses the
concept of an ”autocorrelation matrix” (also known as
Assume after Gaussian Smoothing, the pixel at the structure tensor) to analyze the local image struc-
(1, 1) (center) has values that are slightly spread. For ture.
simplicity, let’s assume the gradient calculation directly Steps
on the original image for this small example. 1. Convert to Grayscale: If the input image is in
color, convert it to grayscale. Corner detection is
1. Calculate Intensity Gradients (using Sobel typically performed on intensity values.
kernels): Consider the pixel at (1, 1) (value 200).
−1 0 1 −1 −2 −1
! !
2. Calculate Image Gradients: Compute the hor-
Gx = −2 0 2 , Gy = 0 0 0 izontal and vertical gradients of the image, Ix and
−1 0 1 1 2 1 Iy , respectively. This is usually done using Sobel
To calculate IGx (1, 1): IGx (1, 1) = (0 × −1) + (0 × or Prewitt operators.
0) + (0 × 1) + (0 × −2) + (200 × 0) + (200 × 2) + (0 ×
Ix = I ∗ Gx and Iy = I ∗ Gy
−1) + (0 × 0) + (0 × 1) = 400 (This is a simplified
calculation for illustration, assuming zero padding where Gx and Gy are the Sobel/Prewitt kernels.
and only considering the relevant part of the image
for the kernel application. In a real scenario, the 3. Compute Products of Gradients: Calculate
convolution would be over the entire image.) the products of gradients at each pixel:
To calculate IGy (1, 1): IGy (1, 1) = (0 × −1) + (0 ×
Ix2 = Ix · Ix
−2) + (0 × −1) + (0 × 0) + (200 × 0) + (200 × 0) +
(0 × 1) + (0 × 2) + (0 × 1) = 0
Iy2 = Iy · Iy
Gradient Magnitude at √(1, 1): G(1, 1) =
p Ixy = Ix · Iy
IGx (1, 1)2 + IGy (1, 1)2 = 4002 + 02 = 400
Gradient Direction at (1, 1): Θ(1, 1) = These are element-wise multiplications.

atan2(IGy (1, 1), IGx (1, 1)) = atan2(0, 400) = 0 4. Apply Gaussian Smoothing to Products of
(indicating a vertical edge). Gradients: Smooth the gradient products using
2. Non-maximum Suppression: Suppose for pixel a Gaussian filter (with standard deviation σG ):

(1, 1) with G(1, 1) = 400 and Θ(1, 1) = 0 .
The gradient direction is horizontal. We compare Sx2 = G(σG ) ∗ Ix2
G(1, 1) with its neighbors in the horizontal direc-
tion. Let’s say G(1, 0) = 100 and G(1, 2) = 300. Sy2 = G(σG ) ∗ Iy2
Since G(1, 1) = 400 is greater than both G(1, 0) = Sxy = G(σG ) ∗ Ixy
100 and G(1, 2) = 300, pixel (1, 1) is kept (it’s a
local maximum). If G(1, 1) was less than either This step creates a weighted sum of the gradient
neighbor, it would be suppressed to 0. products in a local neighborhood, making the de-
tector more robust to noise.
3. Double Thresholding: Let Thigh = 350 and
Tlow = 150. For pixel (1, 1), G(1, 1) = 400. Since 5. Construct the Structure Tensor (M Matrix):
G(1, 1) > Thigh , pixel (1, 1) is classified as a strong For each pixel (x, y), form the 2 × 2 matrix M :
edge.  2 
Sx (x, y) Sxy (x, y)
M (x, y) =
4. Edge Tracking by Hysteresis: Since pixel (1, 1) Sxy (x, y) Sy2 (x, y)
is a strong edge, it is automatically included in
the final edge map. If there were weak edge pixels

4
6. Calculate the Harris Response (Cornerness values within their local neighborhood, and above
Measure): For each pixel, compute the Harris re- a set threshold, will be marked as corners.
sponse R using the determinant and trace of the
matrix M :
Matrix Operations
Trace of a Square Matrix
R = det(M ) − k · (trace(M ))2 The trace of a square matrix is the sum of the elements
on its main diagonal. It is a linear operator and has
where: several useful properties in linear algebra.
• det(M ) = Sx2 Sy2 − Sxy
2 Trace of a 2 × 2 Matrix
• trace(M ) = Sx + Sy
2 2 For a 2 × 2 matrix A:
• k is an empirical constant, typically in the 
a11 a12

range 0.04 to 0.06. A=
a21 a22
The value of R indicates whether a pixel is a corner,
edge, or flat region: The trace of A is:
• |R| is small: Flat region (intensity is constant
trace(A) = a11 + a22
in all directions).
• R < 0: Edge region (large intensity change in Trace of an n × n Matrix
one direction, small in orthogonal direction). For an arbitrary n × n square matrix M :
• R is large and positive: Corner region (large
intensity change in all directions). 
m11 m12 · · · m1n

 m21 m22 · · · m2n 
7. Non-maximum Suppression: Apply non- M =
maximum suppression to the Harris response map  ... ..
.
..
.
.. 
. 
R to find local maxima. This ensures that only mn1 mn2 · · · mnn
the most prominent corners are selected, prevent-
ing multiple detections for a single corner. A com- The trace of M is the sum of its diagonal elements:
mon approach is to compare a pixel’s R value with n
its neighbors in a small window (e.g., 3×3 or 5×5) X
and keep only the highest R value. trace(M ) = mii = m11 + m22 + · · · + mnn
i=1
8. Thresholding: Apply a threshold to the non-
maximum suppressed response map. Pixels with Convolutional Neural Networks
R values above this threshold are considered de-
tected corners.
(CNNs)
Convolutional Neural Networks (CNNs) are a class of
Example (Conceptual) deep neural networks, most commonly applied to ana-
Consider a single pixel at the corner of a white square lyzing visual imagery. They are inspired by the organi-
on a black background. zation of the animal visual cortex.
• Gradients (Ix , Iy ): At the corner pixel, Ix would Convolutional Layer Output Size
be high (due to the vertical edge) and Iy would be The output size of a convolutional layer depends on the
high (due to the horizontal edge). In flat regions, input image dimensions, filter size, padding, and stride.
both would be near zero. Along an edge, one would Given:
be high, and the other near zero. • Win : Input width
• Products of Gradients (Ix2 , Iy2 , Ixy ): At the cor- • Hin : Input height
ner, Ix2 and Iy2 would both be large, and Ixy would • F : Filter size (width and height, assuming square
also be significant (if the corner is not perfectly filters)
axis-aligned after smoothing, or if the gradients are • P : Padding (number of pixels added to each side)
not perfectly orthogonal). • S: Stride (number of pixels to shift the filter)
• Smoothed Products (Sx2 , Sy2 , Sxy ): These values The output width Wout and output height Hout are
represent the average gradient information in the calculated as:
local neighborhood. 
Win − F + 2P

• Harris Response (R): Wout = +1
– For a flat region: Sx2 , Sy2 , Sxy are all small. S
det(M ) is small, trace(M ) is small. R will be 
Hin − F + 2P

small. Hout = +1
– For an edge region (e.g., vertical edge): Sx2 is S
large, Sy2 is small, Sxy is small. det(M ) ≈ 0. where ⌊·⌋ denotes the floor function (rounding down to
trace(M ) is large. R ≈ 0 − k · (large)2 , so R the nearest integer).
will be negative. Padding in Convolutional Layers
– For a corner region: Sx2 is large, Sy2 is large, Padding is a technique used in convolutional layers to
Sxy is also significant. det(M ) will be large control the spatial dimensions of the output feature
and positive. trace(M ) will be large. R = map. It involves adding zeros (or other values) around
large positive − k · (large)2 . If the positive de- the input image’s border.
terminant term dominates, R will be large and Valid Padding (No Padding)
positive. • Concept: No padding is added to the input. The
• Non-maximum Suppression & Threshold- convolution operation is performed only on the
ing: Only the pixels with the highest positive R valid regions of the input, where the kernel com-

5
pletely overlaps with the input. • Cout : Number of output channels (number of filters
• Formula for Output Size (Height/Width): in the layer)
• B: Boolean, 1 if bias is used, 0 otherwise
I −K The total number of parameters is:
O=⌊ ⌋+1
S
Parameters = (Fw × Fh × Cin + B) × Cout
Where:
– O: Output size (height or width). Number of Parameters in a Fully Con-
– I: Input size (height or width).
– K: Kernel size (height or width).
nected (Dense) Layer
For a fully connected layer, the number of parameters
– S: Stride.
depends on the number of input units and output units.
– ⌊·⌋: Floor function.
Given:
Same Padding
• Nin : Number of input units (neurons from the pre-
• Concept: Padding is added to the input such that
vious layer)
the output feature map has the same spatial dimen-
• Nout : Number of output units (neurons in the cur-
sions as the input feature map (assuming a stride of
rent layer)
1). The amount of padding is calculated to achieve
• B: Boolean, 1 if bias is used, 0 otherwise
this.
The total number of parameters is:
• Formula for Output Size (Height/Width):
Parameters = (Nin + B) × Nout
I
O=⌈ ⌉
S Pooling Layer Output Size
Where: Pooling layers (e.g., Max Pooling, Average Pooling) re-
– O: Output size (height or width). duce the spatial dimensions of the input.
– I: Input size (height or width). Given:
– S: Stride. • Win : Input width
– ⌈·⌉: Ceiling function. • Hin : Input height
• Formula for Padding Amount (for stride 1): • K: Pool size (width and height, assuming square
pool)
K −1 • S: Stride
P = The output width Wout and output height Hout are
2
calculated as:
Where:  
– P : Padding amount on one side (total Win − K
Wout = +1
padding is 2P ). S
– K: Kernel size.  
Full Padding Hin − K
• Concept: Padding is added to the input such that Hout = +1
S
every input pixel is ”hit” by the center of the ker-
nel. This results in an output feature map that is Attention Mechanism (in Trans-
larger than the input.
• Formula for Output Size (Height/Width): formers)
The attention mechanism is a core component of trans-
I + 2P − K former models, allowing the model to weigh the im-
O=⌊ ⌋+1 portance of different parts of the input sequence when
S
processing each element. It’s crucial for tasks like ma-
Where P = K − 1 is the padding amount. Substi- chine translation, where context from distant words is
tuting P : important.
I + 2(K − 1) − K I +K −2 Core Concept: Query, Key, Value
O=⌊ ⌋+1=⌊ ⌋+1 The attention mechanism operates on three main vec-
S S tors:
– O: Output size (height or width). • Query (Q): Represents the current word/token
– I: Input size (height or width). being processed.
– K: Kernel size (height or width). • Key (K): Represents all other words/tokens in
– S: Stride. the sequence.
– ⌊·⌋: Floor function. • Value (V ): Represents the actual information
Number of Parameters in a Convolu- content of all other words/tokens.
tional Layer Scaled Dot-Product Attention
The number of trainable parameters in a convolutional This is the most common form of attention used in
layer depends on the filter dimensions, number of in- transformers.
put channels, number of output channels (filters), and
whether a bias term is used. Formula: The attention score is calculated as follows:
Given:
QK T
 
• Fw : Filter width Attention(Q, K, V ) = softmax √ V
• Fh : Filter height dk
• Cin : Number of input channels (e.g., 3 for RGB
image) Where:
• Q ∈ Rn×dk is the Query matrix.

6
• K ∈ Rm×dk is the Key matrix. that same sequence, capturing internal dependencies
• V ∈ Rm×dv is the Value matrix. and relationships.
• n is the number of queries (e.g., sequence length • Application: Used in the encoder of a trans-
for current output). former, where each word in the input sentence at-
• m is the number of keys/values (e.g., sequence tends to other words in the *same* input sentence
length for input). to build a richer representation.
• dk is the dimension of the keys (and queries).
• d
√v is the dimension of the values. Cross-Attention In cross-attention, the Query ma-
• dk is the scaling factor to prevent large dot prod-
trix is derived from one sequence (e.g., the decoder’s
ucts from pushing the softmax into regions with output sequence), while the Key and Value matrices are
very small gradients. derived from a different sequence (e.g., the encoder’s
output sequence). This allows the decoder to focus on
Steps to calculate attention for a single head: relevant parts of the encoder’s output when generating
its own output.
1. Generate Q, K, V matrices: These are typ- • Application: Used in the decoder of a trans-
ically linear transformations of the input embed- former, where each word being generated in the
dings. Let X be the input embedding matrix. output sentence attends to words in the *input*
sentence to ensure proper translation or summa-
Q = XW Q K = XW K V = XW V rization.
Where W Q , W K , W V are learnable weight matri- Multi-Head Attention
ces. Transformers use Multi-Head Attention, which runs
several attention mechanisms (heads) in parallel.
2. Calculate attention scores (dot product):
Multiply the Query matrix by the transpose of the Formula (Conceptual):
Key matrix.
Scores = QK T MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )W O
3. Scale the scores: Divide the scores by the square Where headi = Attention(QWiQ , KWiK , V WiV ) and
root of the dimension of the keys, dk .
W O is a learnable output weight matrix.
QK T
Scaled Scores = √ Steps:
dk
1. Split into heads: The Q, K, V matrices are split
4. Apply Softmax: Apply the softmax function to into h heads. Each head gets a smaller dimension
the scaled scores to get attention weights. This (dk /h, dv /h).
ensures weights sum to 1.
2. Run Scaled Dot-Product Attention in par-
QK T allel: Each head independently calculates its at-
 
Attention Weights = softmax √ tention output.
dk
3. Concatenate results: The outputs from all at-
5. Multiply by Value matrix and apply to tention heads are concatenated.
embeddings: Multiply the attention weights by
the Value matrix to get the final attention out- 4. Linear transformation: The concatenated out-
put. This output, Attention Output, is a new, put is then passed through a final linear layer (W O )
context-aware representation for each query. Each to produce the final Multi-Head Attention output.
row of the Attention Output matrix represents a
query’s transformed embedding, which has been Neural Networks
enriched by selectively focusing on relevant infor- Neural Networks are computational models inspired by
mation from the entire input sequence (via the the structure and function of biological neural networks.
Value vectors). This transformed embedding then They consist of interconnected nodes (neurons) orga-
serves as the input for the subsequent layers in the nized in layers, designed to learn patterns from data.
transformer block, such as the feed-forward net- Forward Propagation (Computing
work or another attention sub-layer, effectively re-
placing the original input embedding for that stage
Weights and Activations)
Forward propagation is the process of passing input
of processing.
data through the neural network to produce an out-
Attention Output = Attention Weights · V put. At each neuron, the inputs are multiplied by their
respective weights, summed, and then passed through
an activation function.
Example (Conceptual): If you’re translating ”The
cat sat on the mat”, when processing ”cat”, the atten-
tion mechanism might give higher weight to ”The” and Formula for a single neuron: For a single neuron,
”sat” to understand ”cat’s” role in the sentence. the output a is calculated as:
Types of Attention N
X
!
Self-Attention In self-attention, the Query, Key, a=σ wi xi + b
and Value matrices are all derived from the same in- i=1
put sequence. This allows each element in the sequence
to attend to all other elements (including itself) within Where:

7
• xi are the inputs. classes).
• wi are the weights associated with each input.
• b is the bias term. N C
1 XX
• N is the number of inputs. LCCE = − yic log(ŷic )
N i=1 c=1
• σ is the activation function (e.g., sigmoid, ReLU,
tanh).
Where:
• N is the number of data points.
For a layer (vectorized form): For a layer with J • C is the total number of classes.
neurons and I inputs, the output A is: • yic is 1 if data point i belongs to class c, and 0
otherwise (one-hot encoded true labels).
Z = W X + BA = σ(Z)
• ŷic is the predicted probability for data point i be-
Where: longing to class c.
• X ∈ RI×1 is the input vector. Transformers
• W ∈ RJ×I is the weight matrix.
• B ∈ RJ×1 is the bias vector.
Architecture Overview
A standard Transformer consists of an encoder and a
• Z ∈ RJ×1 is the weighted sum before activation.
decoder.
• A ∈ RJ×1 is the output (activations) of the layer.
• Encoder: Maps an input sequence of symbol rep-
Backward Propagation (Weight Up- resentations (x1 , . . . , xn ) to a sequence of contin-
dates) uous representations (z1 , . . . , zn ). It processes the
Weight Update Formula (Gradient Descent): entire input sequence simultaneously.
Weights are updated iteratively to minimize the loss • Decoder: Receives the encoder’s output and gen-
function. For a weight wij connecting neuron i to neu- erates an output sequence (y1 , . . . , ym ) one element
ron j: at a time. It uses masked self-attention to prevent
new old ∂L attending to future tokens.
wij = wij − η
∂wij Key Components
Each encoder and decoder layer typically includes:
Where: • Multi-Head Attention: Allows the model to
• wijnew
is the updated weight. jointly attend to information from different rep-
• wijold
is the current weight. resentation subspaces at different positions. (See
• η (eta) is the learning rate, a hyperparameter con- Section 1.3 for details.)
trolling the step size of the update. • Feed-Forward Networks: A simple, fully con-
• ∂w∂L
ij
is the gradient of the loss function L with nected feed-forward network applied independently
respect to the weight wij . to each position.
A similar update rule applies to biases. The gradient • Residual Connections: Add the input of a sub-
∂L layer to its output, followed by layer normalization.
∂wij is computed using the chain rule, propagating the
• Positional Encoding: Added to the input em-
error backward through the network.
beddings to inject information about the relative
Common Loss Functions or absolute position of the tokens in the sequence,
Mean Squared Error (MSE): Commonly used for as the model contains no recurrence or convolution.
regression tasks.
Variational Autoencoders (VAEs)
N
1 X Architecture
LMSE = (yi − ŷi )2 A VAE consists of two main parts:
N i=1
• Encoder (Inference Network): Maps the in-
Where: put data x to a distribution over a latent space z,
• N is the number of data points. typically a Gaussian distribution parameterized by
• yi is the true value for data point i. mean µ and log-variance log σ 2 .
• ŷi is the predicted value for data point i. • Decoder (Generative Network): Samples a
point z from the latent distribution and recon-
structs the original data x.
Binary Cross-Entropy Loss (BCE): Used for bi- Loss Function (Evidence Lower Bound -
nary classification problems (two classes).
ELBO)
N The VAE is trained by optimizing the Evidence Lower
1 X Bound (ELBO), which balances reconstruction accu-
LBCE = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
N i=1 racy and the regularization of the latent space.

Where: LVAE (θ, ϕ) = Eqϕ (z|x) [log pθ (x|z)] − DKL (qϕ (z|x)||p(z))
• N is the number of data points.
• yi ∈ {0, 1} is the true binary label for data point i. Where:
• ŷi ∈ (0, 1) is the predicted probability for data • pθ (x|z) is the likelihood of observing x given z (re-
point i belonging to class 1. construction term).
• qϕ (z|x) is the approximate posterior distribution
(encoder output).
Categorical Cross-Entropy Loss (CCE): Used • p(z) is the prior distribution over the latent space
for multi-class classification problems (more than two (often a standard normal distribution).
• DKL is the Kullback-Leibler divergence, which

8
measures the difference between two probability Simplified Training Steps (Alternating
distributions. Training)
Reparameterization Trick 1. Train Discriminator (D):
To allow backpropagation through the sampling pro-
cess, the reparameterization trick is used: • Pass real data x to D, calculate D(x).
• Generate fake data x̂ = G(z). Pass x̂ to D,
z =µ+σ⊙ϵ calculate D(x̂).
• Update D’s weights to maximize log D(x) +
Where ϵ ∼ N (0, I) is a random sample from a stan- log(1−D(x̂)). (Effectively, D wants D(x) ≈ 1
dard normal distribution, and ⊙ denotes element-wise and D(x̂) ≈ 0).
multiplication. • Loss for D (Binary Cross-Entropy):
Steps for Training
1. Encode: Pass input x through the encoder to get LD = −Ex∼pdata (x) [log D(x)]−Ez∼pz (z) [log(1−D(G(z)))]
µ and log σ 2 .
2. Train Generator (G):
2. Reparameterize: Sample z using the reparame- • Generate fake data x̂ = G(z). Pass x̂ to D,
terization trick. calculate D(x̂).
• Update G’s weights to minimize log(1−D(x̂)).
3. Decode: Pass z through the decoder to recon- (Effectively, G wants D(x̂) ≈ 1 for its fake
struct x̂. data). A common alternative is to maximize
4. Calculate Loss: log D(x̂) to provide stronger gradients early
in training.
• Reconstruction Loss: Measures how well • Loss for G (Binary Cross-Entropy):
x̂ matches x (e.g., MSE for continuous data,
BCE for binary data). LG = −Ez∼pz (z) [log D(G(z))]
• KL Divergence Loss: Measures how close
qϕ (z|x) is to p(z). For Gaussian distributions: 3. Repeat steps 1 and 2, typically training D for a
few steps then G for one step.
D
1X
DKL (N (µ, σ 2 )||N (0, I)) = (exp(log σk2 ) Conditional Generative Adversar-
2
k=1
+ µ2k − 1 − log σk2 ) (1)
ial Networks (CGANs)
Conditional GANs (CGANs) are an extension of GANs
Where D is the dimension of the latent space. that allow for conditioning the generation process on
some auxiliary information, such as a class label or at-
5. Backpropagate and Update: Sum the two loss tributes. This gives more control over the generated
terms and use backpropagation to update encoder data.
and decoder weights. Architecture
Both the Generator and Discriminator receive an addi-
Generative Adversarial Networks tional input, c (the condition), alongside their original
inputs.
(GANs) • Generator (G): Takes a random noise vector z
Architecture and a condition c as input, generating data x̂ =
• Generator (G): Takes a random noise vector z as G(z, c).
input and generates synthetic data x̂ = G(z). Its • Discriminator (D): Takes either real data x
goal is to produce data that is indistinguishable (with its corresponding condition cx ) or synthetic
from real data. data x̂ (with its corresponding condition cz ) as in-
• Discriminator (D): Takes either real data x or put. So, D receives (x, cx ) or (x̂, cz ).
synthetic data x̂ as input and tries to distinguish Training Objective
between them. Its output D(x) is a probability The objective function is a conditional version of the
that the input is real. Its goal is to correctly classify original GAN objective:
real vs. fake.
Training Objective (Minimax Game) min max V (D, G) = Ex∼pdata (x),c∼pc (c) [log D(x|c)]
G D
GANs are trained as a minimax game, where G tries
to minimize the value function V (D, G) and D tries to + Ez∼pz (z),c∼pc (c) [log(1 − D(G(z|c)))] (3)
maximize it.
Where the conditioning variable c is incorporated into
both the generator and discriminator.
min max V (D, G) = Ex∼pdata (x) [log D(x)]
G D Simplified Training Steps
+ Ez∼pz (z) [log(1 − D(G(z)))] (2) The training steps are similar to GANs, but with the
condition c concatenated to relevant inputs.
Where:
• pdata (x) is the distribution of real data. 1. Train Discriminator (D):
• pz (z) is the distribution of the input noise. • Sample real data x and its condition cx . Pass
• D(x) is the discriminator’s output for real data. (x, cx ) to D.
• D(G(z)) is the discriminator’s output for generated • Sample random noise z and a condition cz .
data. Generate fake data x̂ = G(z, cz ). Pass (x̂, cz )
to D.

9
• Update D’s weights to maximize Core Concept: Contrastive Learning
log D(x|cx ) + log(1 − D(x̂|cz )). CLIP learns by maximizing the similarity of correct
image-text pairs and minimizing the similarity of in-
2. Train Generator (G): correct pairs. This is achieved using a contrastive loss
• Sample random noise z and a condition cz . function.
Generate fake data x̂ = G(z, cz ). Pass (x̂, cz ) Architecture
to D. CLIP consists of two main encoders:
• Update G’s weights to minimize log(1 − • Image Encoder: Encodes an image into an em-
D(x̂|cz )) (or maximize log D(x̂|cz )). bedding vector.
• Text Encoder: Encodes a text caption into an
BERT (Bidirectional Encoder embedding vector.
Representations from Transform- Contrastive Loss Formula
Given a batch of N image-text pairs, we form an N ×N
ers) similarity matrix. The rows represent image embed-
BERT is a pre-training language representation model dings, and columns represent text embeddings. The
designed to deeply understand context from both left diagonal elements are positive pairs, and off-diagonal
and right sides of a token. It achieves this by using elements are negative pairs.
a Transformer encoder and being pre-trained on two The loss for a batch is a sum of two softmax cross-
specific tasks. entropy losses: one for images (treating text as targets)
Key Idea: Bidirectional Context and one for text (treating images as targets).
Unlike previous models that read text input sequen- Let Ii be the i-th image embedding and Tj be the j-
tially (left-to-right or right-to-left), BERT processes th text embedding in a batch. Let sim(Ii , Tj ) = exp(Ii ·
words in relation to all other words in a sentence si- T j /τ ) where τ is a learnable temperature parameter.
multaneously. This is achieved by using a deep stack of
Transformer encoder layers. Image-to-Text Loss (LI ): For each image i, we
Pre-training Tasks want it to be highly similar to its true paired text Ti
BERT is pre-trained on large text corpora using two and dissimilar to all other texts Tj (j ̸= i).
unsupervised tasks:
N
• Masked Language Model (MLM): X sim(Ii , Ti )
– Concept: Randomly masks out some per- L I = − log PN
centage of tokens from the input, then pre- i=1 k=1 sim(Ii , Tk )
dicts the original vocabulary ID of the masked
word based on its context. Text-to-Image Loss (LT ): Similarly, for each text
– Steps: j, we want it to be highly similar to its true paired
1. Input sentence with 15% of tokens ran- image Ij and dissimilar to all other images Ik (k ̸= j).
domly masked. N
2. BERT predicts masked tokens using the X sim(Ij , Tj )
LT = − log PN
surrounding context.
j=1 k=1 sim(Ik , Tj )
3. Loss: Standard Cross-Entropy loss cal-
culated only on the masked tokens.
Total Contrastive Loss:
• Next Sentence Prediction (NSP):
– Concept: Predicts whether a second sen- 1
tence logically follows the first sentence. LCLIP = (LI + LT )
2
– Steps:
1. Given two sentences, e.g., “Sentence A” During training, the encoders’ parameters are updated
and “Sentence B”. to minimize this loss, bringing positive pairs closer and
pushing negative pairs apart in the joint embedding
2. BERT predicts if “Sentence B” is the ac- space.
tual next sentence of “Sentence A” (Is-
Next) or a random sentence (NotNext). GPT (Generative Pre-trained
3. Loss: Binary Cross-Entropy loss for clas- Transformer)
sification. GPT models are a series of large language models devel-
Application (Fine-tuning) oped by OpenAI, primarily based on the Transformer’s
After pre-training, BERT can be fine-tuned with a sin- decoder architecture. They are designed for generative
gle additional output layer for a wide range of down- tasks, learning to predict the next token in a sequence.
stream tasks (e.g., text classification, question answer- Core Idea: Autoregressive Language
ing) with minimal task-specific architecture modifica- Modeling
tions. GPT models are pre-trained on a massive amount of
Contrastive Loss for CLIP diverse text data using a simple objective: predicting
CLIP (Contrastive Language-Image Pre-training) is a the next word in a sentence. This unsupervised pre-
model that learns visual concepts from natural language training allows them to learn grammar, facts, reasoning
supervision. It’s trained on a vast dataset of image- abilities, and general language understanding.
text pairs to learn a joint embedding space where cor- Architecture
responding image and text embeddings are close, and GPT models utilize a stack of Transformer decoder
uncorresponding ones are far apart. blocks. A key difference from the full Transformer is
the masked self-attention mechanism.

10
• Masked Self-Attention: Each token can only Steps for Training
attend to previous tokens in the sequence. This 1. Sample Data and Noise: Sample a data point
prevents tokens from ”seeing” future information, x0 from the training set and a timestep t ∼ U (1, T ).
which is crucial for the autoregressive (left-to-
right) generation process. 2. Add Noise: Generate a noisy version xt by adding
Pre-training Objective (Language Mod- Gaussian noise ϵ to x0 according to the forward
process formula.
eling Loss)
The model is trained to minimize the negative log- 3. Predict Noise: Pass xt and t to the neural net-
likelihood of predicting the next token, given the pre- work ϵθ .
ceding tokens.
4. Calculate Loss: Compute the MSE between the
1 X
N predicted noise ϵθ (xt , t) and the true noise ϵ.
LLM =− log P (tokeni |tokeni−1 , . . . , token1 )
N i=1 5. Backpropagate and Update: Use backpropa-
gation to update the network’s weights.
Where:
• N is the number of tokens in the sequence. Steps for Generation (Sampling)
• P (tokeni | . . . ) is the probability of the i-th token 1. Start with Noise: Sample random Gaussian
given all previous tokens. noise xT ∼ N (0, I).
This is essentially a Categorical Cross-Entropy loss over
the vocabulary at each time step. 2. Iterative Denoising: For t = T, T − 1, . . . , 1:
Application (Few-shot/Zero-shot Learn- • Predict the noise ϵθ (xt , t) using the trained
ing & Fine-tuning) network.
After pre-training, GPT models can be used for various • Apply the denoising step to xt to get xt−1 .
downstream tasks: This involves subtracting the predicted noise
• Few-shot/Zero-shot Learning (In-context and adding a small amount of new noise for
learning): By providing a few examples within stochasticity (unless using deterministic sam-
the prompt, the model can perform tasks without pling).
explicit fine-tuning. • The formula for xt−1 is complex but effec-
• Fine-tuning: A small task-specific dataset can be tively reverses the forward step using the pre-
used to further train the model for specific appli- dicted noise. A simplified conceptual step:
cations, similar to BERT.  
1 1 − αt
Diffusion Models (DDPM - De- xt−1 = √ xt − √ ϵθ (xt , t) + σt z
αt 1 − ᾱt
noising Diffusion Probabilistic
where αt and ᾱt are terms from the forward
Models) process schedule, and z ∼ N (0, I) for stochas-
Core Concept: Forward and Reverse ticity.
Process 3. Final Output: After T steps, x0 is the generated
• Forward Diffusion Process (Noising): A fixed sample.
Markov chain that gradually adds Gaussian noise
to a data sample x0 over T steps, producing a se-
quence of noisy samples x1 , x2 , . . . , xT . As T → ∞,
xT approaches pure Gaussian noise.
• Reverse Diffusion Process (Denois-
ing/Generation): A learned Markov chain
that starts from pure noise xT and gradually
denoises it over T steps to reconstruct the original
data x0 . The model learns to predict the noise
added at each step of the forward process.
Loss Function
The training objective is to learn the reverse transitions.
For DDPMs, this simplifies to learning to predict the
noise ϵ added at each step. Given a noisy sample xt at
timestep t, the model ϵθ (xt , t) predicts the noise.

LDDPM = Et∼U (1,T ),x0 ∼q(x0 ),ϵ∼N (0,I) ||ϵ − ϵθ (xt , t)||2
 

Where:
• ϵ is the actual noise added at step t.
• ϵθ (xt , t) is the neural network’s prediction of the
noise, parameterized by θ.
• xt is the noisy sample at time t, obtained by ap-
plying noise to x0 .
• t is sampled uniformly from 1 to T .
This is a simple Mean Squared Error (MSE) between
the true noise and the predicted noise.

11

You might also like