1.
Standard Convolution
The basic sliding filter operation across the input.
Each filter computes a weighted sum over the receptive field
2. Dilated (Atrous) Convolution
Introduces gaps ("dilations") between kernel elements.
Increases receptive field without increasing parameters.
Useful in segmentation tasks (e.g., DeepLab).
3. Transposed Convolution (Deconvolution / Fractionally Strided)
Upsampling variant of convolution.
Used for generating larger feature maps (e.g., in decoders, GANs).
4. Separable Convolutions
Spatially Separable: Breaks a 2D kernel into two 1D kernels (e.g., 3×3 → 3×1 + 1×3).
Depthwise Separable: Splits convolution into two steps:
1. Depthwise convolution (per-channel).
2. Pointwise convolution (1×1 across channels).
Used in MobileNet for efficiency.
5. Grouped Convolution
Input channels are split into groups, and each group is convolved separately.
Reduces computation.
Used in ResNeXt and AlexNet.
6. Pointwise Convolution (1×1 Conv)
A convolution with kernel size = 1.
Used for channel mixing and dimensionality reduction.
Core part of Inception modules.
7. Causal Convolution
Ensures output at time t depends only on input at time ≤ t.
Used in temporal models like WaveNet.
8. Deformable Convolution
Learns offsets for sampling positions instead of fixed grid.
Improves handling of geometric transformations (e.g., object detection).
In short:
Standard = basic
Dilated = bigger receptive field
Transposed = upsampling
Separable (depthwise/pointwise) = efficiency
Grouped = channel grouping
Causal = time-series
Deformable = adaptive receptive field
Variant Operation / Formula Key Idea Use Case
Standard Sliding kernel over input, Feature extraction in
y=∑w⋅xy = \sum w \cdot x
Convolution weighted sum CNNs
Semantic
y=∑w⋅xd⋅iy = \sum w \cdot Inserts gaps (dilation rate)
Dilated (Atrous) segmentation
x_{d \cdot i} in kernel
(DeepLab)
Transposed Spreads input over output Image generation,
Reverse of standard conv
(Deconv) grid, learns upsampling decoders, GANs
Spatially 2D kernel → two 1D Factorizes kernel (e.g., 3×3 Reduces
Separable kernels → 3×1 + 1×3) computation
Depthwise Depthwise conv + Convolution per channel + MobileNet, efficient
Separable Pointwise (1×1) mixing channels CNNs
Grouped Split input channels into Convolve each group
AlexNet, ResNeXt
Convolution groups separately
Mixes channels,
Pointwise (1×1) Kernel size = 1×1 Inception modules
dimensionality reduction
Causal yt=∑w⋅x≤ty_t = \sum w Only depends on current & Time-series,
Convolution \cdot x_{\leq t} past inputs WaveNet
Deformable y=∑w⋅x(p+Δp)y = \sum w Learns offsets for sampling Object detection,
Convolution \cdot x(p + \Delta p) positions dense prediction
CNN LEARNING NONLINEARITY FUNCTION IN CNN:
After convolution and pooling layers extract features, the activation function introduces non-
linearity so that the CNN can approximate nonlinear decision boundaries.
Without nonlinearity, multiple convolution layers would collapse into a single linear
transformation → CNN would behave like a single linear classifier.
low of Nonlinearity in CNN
1. Input image → Convolution layer (linear feature extraction)
2. Activation (ReLU, etc.) → Nonlinearity
3. Pooling → Downsampling
4. Stack multiple layers (conv + activation)
5. Fully connected + Softmax for final prediction
Activation
Advantages Disadvantages
Function
– Smooth output between 0 and 1 – Vanishing gradient (small updates
Sigmoid (probability-like) – Historically well in deep layers) – Not zero-centered →
understood slower convergence
– Output between -1 and 1 (zero-
– Still suffers vanishing gradient –
Tanh centered) – Stronger gradients than
Slower than ReLU
sigmoid
– Very fast to compute – Reduces
– Dead neuron problem (neurons
ReLU vanishing gradient problem – Sparse
stuck at 0 forever) – Not smooth at 0
activation (only positive neurons fire)
– Fixes dead neuron issue (small slope for
– Extra parameter α to tune – Slightly
Leaky ReLU negatives) – Works better than ReLU in
more compute
some tasks
ELU – Smooth curve for negative values – – More computationally expensive –
Activation
Advantages Disadvantages
Function
Faster convergence than ReLU – Mean Slower than ReLU
activations closer to 0 → helps training
– Not used in hidden layers – Can be
Softmax – Converts raw scores into probability
unstable with very large inputs (needs
(output layer) distribution – Good for classification
normalization)