Understanding CNNs with Visuals
Step-by-Step Example Using a 6x6 Image
Introduction
In this document, we walk through how a Convolutional Neural Network (CNN) processes an image
step by step, using a realistic 6x6 grayscale image and showing all calculations clearly.
We also include visualizations using the TikZ package to help you see what’s happening at each stage.
—
Step 1: Input Image (6x6 Grayscale)
Let’s begin with a small grayscale image of size 6 × 6:
0 0 0 0 0 0
0 5 5 5 5 0
0 5 9 9 5 0
Input Image = 0
5 9 9 5 0
0 5 5 5 5 0
0 0 0 0 0 0
Below is a visual representation using shades of gray:
This could represent part of a digit or shape.
—
Step 2: Convolution Layer – Apply Filter
Apply a vertical edge detection filter:
1
−1 0 1
Filter (3x3) = −1 0 1
−1 0 1
Now slide this filter over the image and compute dot products.
Here’s one position visualized:
Dot product calculation:
(−1)(0)+(0)(0)+(1)(0)+(−1)(0)+(0)(5)+(1)(5)+(−1)(0)+(0)(5)+(1)(9) = 0+0+0+0+0+5+0+0+9 = 14
Continue this process across the image...
Final Feature Map:
14 −4 −14 0
10
0 −20 0
10 0 −20 0
14 −4 −14 0
—
Step 3: ReLU Activation – Remove Negatives
Apply ReLU(x) = max(0, x):
Before ReLU:
14 −4 −14 0 14 0 0 0
10 0 −20 0 10 0 0 0
⇒ After ReLU =
10 0 −20 0 10 0 0 0
14 −4 −14 0 14 0 0 0
—
Step 4: Max Pooling – Reduce Size
Use a 2 × 2 window and stride = 2.
Result after max pooling:
14 0
14 0
Visualized:
2
Step 5: Flatten – Turn into a List
Convert 2D output to 1D:
Flattened Output = [14, 0, 14, 0]
—
Step 6: Fully Connected Layers – Make Prediction
The flattened vector ‘[14, 0, 14, 0]‘ is passed to dense layers which combine features to classify the image.
Output might look like:
[0.8, 0.1, 0.05, 0.05] ⇒ Predicted Class: 0