0% found this document useful (0 votes)
18 views35 pages

Overview of One-Stage Detectors

Uploaded by

botov73940
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views35 pages

Overview of One-Stage Detectors

Uploaded by

botov73940
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Object detection

One-stage detectors
Object detection approaches
• Pass (process) the image through a Neural
Network.
• On the final feature map(s) of the network:
• Slide a window
• For each window location, predict an
object class and a bounding box for
each “anchor” (also called “default
Stage 1 − box” or “prior”), i.e., adjust the anchor.
• Generate multiple candidates (bounding boxes) all
over the image. It’s typical to make use of
“anchors”
• Selective search, Region Proposal Network
(RPN)
• Throw away the candidates without an object.
Stage 2 −
• Process each candidate independently
• Assign a category (class) to each candidate and One stage anchor-free detectors: Same as one stage
adjust its bounding box. detectors but instead of adjusting anchor, directly
predict: Top-left and bottom-right corners And / Or the
centre of the object
Object detection approaches
Two-stage detectors One-stage detectors
• R-CNN (2013 – 2014) • YOLO (2015 – 2016) − Latest: YOLOv8
(2023)
• SPP Net (2014 – 2015)
• SSD (2016)
• Fast R-CNN (2015)
• RetinaNet (2017)
• Faster R-CNN (2015) • CenterNet (2019)
• R-FCN (2016) • EfficientDet (2019 – 2020)
• Feature Pyramid Network • Swin Transformer (2021)
(2017)
Object detection techniques: Comparisons
YOLO- You Only Look Once

• The YOLO model was first described by Joseph Redmon, et al. in the 2015
paper titled “You Only Look Once: Unified, Real-Time Object Detection.”
Ross Girshick, developer of R-CNN, was also an author and contributor to
this work.
• The approach involves a single neural network trained end to end that
takes a photograph as input and predicts bounding boxes and class labels
for each bounding box directly.
• The R-CNN models may be generally more accurate, yet the YOLO family
of models are fast, much faster than R-CNN, achieving object detection in
real-time.

You Only Look Once: Unified, Real-Time Object Detection, Joseph Redmon, Santosh Divvala,
Ross Girshick, Ali Farhadi
YOLO- You Only Look Once: Concepts
• Detection as Single Regression Problem
• No bounding box proposal.
• A single regression problem, straight from Unified Detection
image pixels to bounding box coordinates and
class probabilities
• Developed as Single Convolutional Network
• Reason Globally on the Entire Image
• Learns Generalizable Representations

Easy and Fast

[Link]
Redmon et al. CVPR 2016.
YOLO: Step 1
• Divide the image into a grid of cells.
• Ex. SxS grid, , typically 7x7 or 13x13.
• If the center of an object fall into a grid cell, it will be the responsible for the object.
• Each cell is responsible for predicting a set of bounding boxes and class probabilities.
• A bounding box involving the x, y coordinate and the width and height and the
confidence.
• A class prediction is also based on each cell. For example, an image may be divided
into a 7×7 grid and each cell in the grid may predict 2 bounding boxes, resulting in
94 proposed bounding box predictions.
• The class probabilities map and the bounding boxes with confidences are then
combined into a final set of bounding boxes and class labels.
• Hence, Each grid cell predict:
• B bounding boxes;
• B confidence scores as C=Pr(Obj)*IOU;
• C cond. Class prob. as P=Pr(𝑪𝒍𝒂𝒔𝒔𝒊|Object);
• Confidence Prediction is obtained as IOU of predicted box and any ground truth box.
YOLO: Step 2
• Predict bounding boxes and class probabilities for each cell.
• For each cell, the YOLO algorithm predicts a set of bounding boxes and
class probabilities.
• The bounding boxes are represented as four coordinates: the top left
corner, the bottom right corner, and the width and height of the box.
• The class probabilities represent the probability that the object in the
box belongs to a particular class.
YOLO: Step 3
• Apply non-max suppression.
• The bounding boxes predicted by the YOLO algorithm may overlap.
• To remove overlapping boxes, the YOLO algorithm applies a non-max
suppression algorithm.
• This algorithm keeps the box with the highest confidence score, and it
removes all other boxes that have a high overlap with the selected box.
YOLO: Step 4
• Draw the bounding boxes and class labels on the image.
• The final step is to draw the bounding boxes and class labels on the
image.
• The bounding boxes are drawn in a different color for each class, and
the class labels are displayed next to the bounding boxes.
Loss-Function
Pros
• Trained on a loss function that directly corresponds to
detection performance.
• The entire model is trained jointly.
• The fastest general-purpose object detector in the literature.
• At least detection at 45fps.
Limitations
• Struggle with Small Object.
• Struggle with Different aspects and ratios of objects
• Loss function is an approximation.
• Loss function threats errors in different boxes ratio at the
same.
SSD: Single Shot MultiBox Detector
• Don’t generate object proposals!
• Consider a tiny subset of the output space by design; directly
classify this small set of boxes

Image credit:
[Link]
SSD: Design of Small set of boxes
SSD Network Structure

SSD architecture taken from the original paper

● 3*3 conv kernel


●2*2 pooling with stride = 2 VGG-16 network: by Oxford's Visual
Geometry Group (VGG): Simonyan,
Karen, and Andrew Zisserman. "Very
deep convolutional networks for large-
scale image recognition." arXiv preprint
arXiv:1409.1556 (2014).
SSD: Multi-scale Feature Map
SSD: Multi-scale Feature Map

Source:[Link]
SSD: Multi-scale Feature Map

Source: [Link]
Default Bounding Boxes - scale and shape
Default Bounding Boxes - scale and shape
Default Bounding Boxes
Why small boxes in large feature maps?

Source: [Link]
real-time-object-detection-in-deep-learning-495ef744fab
Default Bounding Boxes
Why small boxes in large feature maps?
• large feature map - small receptive field - small object
• small feature map - large receptive field - large object
Convolutional predictors for detection
Bounding Box Matching Strategy
Training Objective
• After pairing groundtruth and default boxes, we can write the objective
function:

Xijp ={1,0}: matching the i-th default box to the j-th ground truth box of
category p.
N: matched default boxes.
c: class confidence.
l: predicted bounding box
g: ground truth bounding box
Training Objective
SSD Network Structure vs YOLO

Similar to YOLO, but denser grid map, multiscale grid maps. + Data augmentation + Hard negative mining +
Other design choices in the network.
Design Improvement over YOLO
Hard Negative Mining
• Instead of using all the negative examples, we sort them using the highest
confidence loss for each default box and pick the top ones.
• The ratio between negative examples and positive examples is 3:1.
• This method leads to faster optimization and a more stable training.
Data Augmentation
• Making the model more robust to various input object sizes and outputs:
[Link] Images.
2. Sample patch with minimal jaccard scores as 0.1, 0.3, 0.5, 0.7 or 0.9.
3. Randomly sample a patch.
Experiments: Effects of various design choices and
components on SSD performance.
Experiments: Effects of using multiple output layers.
Detection Results
Strength and Drawbacks
• Strength Drawbacks
• High Speed • The classification task for
• High Accuracy small objects is relatively
• Simple Training(single shot) hard for SSD.
One-stage detection
• What could be the problems?
• The extreme foreground-background class imbalance -> we have a lot
more negative examples.
• Even though they have small loss values, the gradients overwhelm the
model
• Solution: Focal Loss for Dense Object Detection (Lin et al. ICCV 2017)
• For easy examples, we down-weight it loss, so that the gradients from
these example have smaller impact to the model

[Link]
Resources
• YOLO
• Original (Darknet) ([Link]
• Tensorflow ([Link]
• Keras ([Link]

• SSD (Caffe) ([Link]

You might also like