Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
2a. Spatial structure
for image recognition
Using Spatial Structure
Input: 2D Idea: connect
image. patches of input to
Array of pixel neurons in hidden
values layer.
Neuron connected
to region of input.
Only “sees”these
values.
Using Spatial Structure
Connect patch in input layer to a single neuron in subsequent layer.
Use a sliding window to define connections.
How can we weight the patch to detect particular features?
Feature Extraction with Convolution
- Filter of size 4x4 : 16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch
This “patchy” operation is convolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
Fully Connected Neural Network
Input: Fully Connected:
• 2D image • Each neuron in
• Vector of pixel hidden layer
values connected to all
neurons in input
layer
• No spatial information
• Many, many
parameters
Key idea: Use spatial structure in input to inform architecture
of the network
High Level Feature Detection
Let’s identify key features in each image category
Nose, Eyes,Mouth Wheels, License Plate, Door,Windows,Steps
Headlights
Fully Connected Neural Network
2b. Convolutions and filters
Convolution operation is element wise
multiply and add
Filter / Kernel
Producing Feature Maps
Original Sharpen Edge Detect “Strong” Edge
Detect
A simple pattern: Edges
How can we detect edges with a kernel?
Input
-1 -1 Output
Filter
(Goodfellow 2016)
Simple Kernels / Filters
X or X?
Image is represented as matrix of pixel values… and computers are literal!
We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.
Rohrer How do CNNs work?
There are three approaches to edge cases in
convolution
Zero Padding Controls Output Size
(Goodfellow 2016)
• Same convolution: zero pad input so output • Valid-only convolution: output only when
is same size as input dimensions entire kernel contained in input (shrinks output)
• Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)
x = [Link].conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')
• TF convolution operator takes stride and zero fill option as parameters
• Stride is distance between kernel applications in each dimension
• Padding can be SAME or VALID
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3a. Learning Visual Features
de novo
Key idea:
learn hierarchy of features
directly from the data
(rather than hand-engineering them)
Low level features Mid level features High level features
Edges, dark spots Eyes, ears,nose Facial structure
Lee+ ICML 2009
Key idea: re-use parameters
Convolution shares parameters
Example 3x3 convolution on a 5x5 image
Feature Extraction with Convolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
[LeCun et al., 1998]
LeNet-5
conv avg pool conv avg pool
...
5×5 f=2 5×5 f=2
s=1 s=2 s=1 s=2
32×32×1 28×28×6 14×14×6 10×10×16
FC FC
... 𝑦𝑦�
⋮ ⋮
10
5×5×16
120 84 Reminder:
Output size = (N+2P-F)/stride + 1
This slide is taken from Andrew Ng [LeCun et al., 1998]
LeNet-5
• Only 60K parameters
• As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊 ↓, 𝑁𝑁𝐶𝐶 ↑
• General structure:
conv->pool->conv->pool->FC->FC->output
• Different filters look at different channels
• Sigmoid and Tanh nonlinearity
[LeCun et al., 1998]
Backpropagation of convolution
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)
An image classification CNN
Representation Learning in Deep CNNs
Low level features Mid level features High level features
Edges, dark spots Eyes, ears,nose Facial structure
Conv Layer 1 Conv Layer 2 Conv Layer 3
Lee+ ICML 2009
CNNs for Classification
1. Convolution:Apply filters to generate feature maps.
2. Non-linearity: Often ReLU.
3. Pooling: Downsampling operation on each feature map.
[Link].Conv2
Train model with image data. D
Learn weights of filters in convolutional layers. [Link].
*
[Link].MaxPool2
D
Example – Six convolutional layers
Convolutional Layers: Local Connectivity
[Link].
Conv2D
For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias
Convolutional Layers: Local Connectivity
[Link].Conv2D
For a neuron in hidden layer:
• Take inputs from patch
• Compute weighted sum
• Apply bias
4x4 filter:
1) applying a window of weights
matrix of 2) computing linear combinations
weights wij for neuron (p,q) in hidden layer 3) activating with non-linear function
CNNs: Spatial Arrangement of Output
Volume
depth
Layer Dimensions:
ℎ w d
where h and w are spatial
dimensions d (depth) = number of
height filters
Stride:
Filter step size
Receptive Field:
width Locations in input image
that a node is path
connected to
[Link].Conv2D( filters=d, kernel_size=(h,w), strides=s )
Introducing Non-Linearity
- Apply after every convolution operation
(i.e., after convolutional layers) Rectified Linear Unit
- ReLU: pixel-by-pixel operation that replaces (ReLU)
all negative values by zero.
- Non-linear operation
[Link]
Karn Intuitive CNNs
Pooling
[Link]
Pool2D(
pool_size=(2,2),
) strides=2 1) Reduced
dimensionality
2) Spatial invariance
Max Pooling, average pooling
The REctified Linear Unit (RELU) is a common
non-linear detector stage after convolution
x = [Link].conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
x = [Link].bias_add(x, b)
x= [Link](x)
f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it?
Pooling reduces dimensionality by giving up
spatial location
• max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID
x = [Link].max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')
Output Input Pooling Batch H W Input channel
Neighborhood
[batch, height, width, channels]
Dilated Convolution
CNNs for Classification: Feature Learning
91
1. Learn features in input image through convolution
2. Introduce non-linearity through activation function (real-world data is
non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
CNNs for Classification: Class Probabilities
- CONV and POOL layers output high-level features of input
- Fully connected layer uses these features for classifying input image
- Express output as probability of image belonging to a particular class
Putting it all together
import tensorflow as tf
def generate_model():
model = [Link]([
# first convolutional layer
[Link].Conv2D(32, filter_size=3, activation='relu’),
[Link].MaxPool2D(pool_size=2, strides=2),
# second convolutional layer
[Link].Conv2D(64, filter_size=3, activation='relu’),
[Link].MaxPool2D(pool_size=2, strides=2),
# fully connected classifier
[Link](),
[Link](1024, activation='relu’),
[Link](10, activation=‘softmax’)
# 10 outputs
])
return model
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
4a. Real-world feature invariance is
hard
How can computers recognize objects?
How can computers recognize objects?
Challenge:
• Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
• How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them
Feature invariance to perturbation is hard
Detect
features
to
classify
Li/Johnson/Yeung C231n