Object detection
One-stage detectors
Object detection approaches
• Pass (process) the image through a Neural
Network.
• On the final feature map(s) of the network:
• Slide a window
• For each window location, predict an
object class and a bounding box for
each “anchor” (also called “default
Stage 1 − box” or “prior”), i.e., adjust the anchor.
• Generate multiple candidates (bounding boxes) all
over the image. It’s typical to make use of
“anchors”
• Selective search, Region Proposal Network
(RPN)
• Throw away the candidates without an object.
Stage 2 −
• Process each candidate independently
• Assign a category (class) to each candidate and One stage anchor-free detectors: Same as one stage
adjust its bounding box. detectors but instead of adjusting anchor, directly
predict: Top-left and bottom-right corners And / Or the
centre of the object
Object detection approaches
Two-stage detectors One-stage detectors
• R-CNN (2013 – 2014) • YOLO (2015 – 2016) − Latest: YOLOv8
(2023)
• SPP Net (2014 – 2015)
• SSD (2016)
• Fast R-CNN (2015)
• RetinaNet (2017)
• Faster R-CNN (2015) • CenterNet (2019)
• R-FCN (2016) • EfficientDet (2019 – 2020)
• Feature Pyramid Network • Swin Transformer (2021)
(2017)
Object detection techniques: Comparisons
YOLO- You Only Look Once
• The YOLO model was first described by Joseph Redmon, et al. in the 2015
paper titled “You Only Look Once: Unified, Real-Time Object Detection.”
Ross Girshick, developer of R-CNN, was also an author and contributor to
this work.
• The approach involves a single neural network trained end to end that
takes a photograph as input and predicts bounding boxes and class labels
for each bounding box directly.
• The R-CNN models may be generally more accurate, yet the YOLO family
of models are fast, much faster than R-CNN, achieving object detection in
real-time.
You Only Look Once: Unified, Real-Time Object Detection, Joseph Redmon, Santosh Divvala,
Ross Girshick, Ali Farhadi
YOLO- You Only Look Once: Concepts
• Detection as Single Regression Problem
• No bounding box proposal.
• A single regression problem, straight from Unified Detection
image pixels to bounding box coordinates and
class probabilities
• Developed as Single Convolutional Network
• Reason Globally on the Entire Image
• Learns Generalizable Representations
Easy and Fast
[Link]
Redmon et al. CVPR 2016.
YOLO: Step 1
• Divide the image into a grid of cells.
• Ex. SxS grid, , typically 7x7 or 13x13.
• If the center of an object fall into a grid cell, it will be the responsible for the object.
• Each cell is responsible for predicting a set of bounding boxes and class probabilities.
• A bounding box involving the x, y coordinate and the width and height and the
confidence.
• A class prediction is also based on each cell. For example, an image may be divided
into a 7×7 grid and each cell in the grid may predict 2 bounding boxes, resulting in
94 proposed bounding box predictions.
• The class probabilities map and the bounding boxes with confidences are then
combined into a final set of bounding boxes and class labels.
• Hence, Each grid cell predict:
• B bounding boxes;
• B confidence scores as C=Pr(Obj)*IOU;
• C cond. Class prob. as P=Pr(𝑪𝒍𝒂𝒔𝒔𝒊|Object);
• Confidence Prediction is obtained as IOU of predicted box and any ground truth box.
YOLO: Step 2
• Predict bounding boxes and class probabilities for each cell.
• For each cell, the YOLO algorithm predicts a set of bounding boxes and
class probabilities.
• The bounding boxes are represented as four coordinates: the top left
corner, the bottom right corner, and the width and height of the box.
• The class probabilities represent the probability that the object in the
box belongs to a particular class.
YOLO: Step 3
• Apply non-max suppression.
• The bounding boxes predicted by the YOLO algorithm may overlap.
• To remove overlapping boxes, the YOLO algorithm applies a non-max
suppression algorithm.
• This algorithm keeps the box with the highest confidence score, and it
removes all other boxes that have a high overlap with the selected box.
YOLO: Step 4
• Draw the bounding boxes and class labels on the image.
• The final step is to draw the bounding boxes and class labels on the
image.
• The bounding boxes are drawn in a different color for each class, and
the class labels are displayed next to the bounding boxes.
Loss-Function
Pros
• Trained on a loss function that directly corresponds to
detection performance.
• The entire model is trained jointly.
• The fastest general-purpose object detector in the literature.
• At least detection at 45fps.
Limitations
• Struggle with Small Object.
• Struggle with Different aspects and ratios of objects
• Loss function is an approximation.
• Loss function threats errors in different boxes ratio at the
same.
SSD: Single Shot MultiBox Detector
• Don’t generate object proposals!
• Consider a tiny subset of the output space by design; directly
classify this small set of boxes
Image credit:
[Link]
SSD: Design of Small set of boxes
SSD Network Structure
SSD architecture taken from the original paper
● 3*3 conv kernel
●2*2 pooling with stride = 2 VGG-16 network: by Oxford's Visual
Geometry Group (VGG): Simonyan,
Karen, and Andrew Zisserman. "Very
deep convolutional networks for large-
scale image recognition." arXiv preprint
arXiv:1409.1556 (2014).
SSD: Multi-scale Feature Map
SSD: Multi-scale Feature Map
Source:[Link]
SSD: Multi-scale Feature Map
Source: [Link]
Default Bounding Boxes - scale and shape
Default Bounding Boxes - scale and shape
Default Bounding Boxes
Why small boxes in large feature maps?
Source: [Link]
real-time-object-detection-in-deep-learning-495ef744fab
Default Bounding Boxes
Why small boxes in large feature maps?
• large feature map - small receptive field - small object
• small feature map - large receptive field - large object
Convolutional predictors for detection
Bounding Box Matching Strategy
Training Objective
• After pairing groundtruth and default boxes, we can write the objective
function:
Xijp ={1,0}: matching the i-th default box to the j-th ground truth box of
category p.
N: matched default boxes.
c: class confidence.
l: predicted bounding box
g: ground truth bounding box
Training Objective
SSD Network Structure vs YOLO
Similar to YOLO, but denser grid map, multiscale grid maps. + Data augmentation + Hard negative mining +
Other design choices in the network.
Design Improvement over YOLO
Hard Negative Mining
• Instead of using all the negative examples, we sort them using the highest
confidence loss for each default box and pick the top ones.
• The ratio between negative examples and positive examples is 3:1.
• This method leads to faster optimization and a more stable training.
Data Augmentation
• Making the model more robust to various input object sizes and outputs:
[Link] Images.
2. Sample patch with minimal jaccard scores as 0.1, 0.3, 0.5, 0.7 or 0.9.
3. Randomly sample a patch.
Experiments: Effects of various design choices and
components on SSD performance.
Experiments: Effects of using multiple output layers.
Detection Results
Strength and Drawbacks
• Strength Drawbacks
• High Speed • The classification task for
• High Accuracy small objects is relatively
• Simple Training(single shot) hard for SSD.
One-stage detection
• What could be the problems?
• The extreme foreground-background class imbalance -> we have a lot
more negative examples.
• Even though they have small loss values, the gradients overwhelm the
model
• Solution: Focal Loss for Dense Object Detection (Lin et al. ICCV 2017)
• For easy examples, we down-weight it loss, so that the gradients from
these example have smaller impact to the model
[Link]
Resources
• YOLO
• Original (Darknet) ([Link]
• Tensorflow ([Link]
• Keras ([Link]
• SSD (Caffe) ([Link]