You Only Look Once
path to design a detector
Feng Wang
AIRD, Coretronic Co.
Apr 17, 2019
The slides and a list of references can be found from
[Link]
Outlines
Concepts in object detection
A brief history of object detection
YOLO
design
loss function
training
weaknesses
Classification vs detection/recognition
Common tasks on images
[Link]
Bounding box proposal
Region of interest, region proposal, box proposal
Ground truth
Proposed bounding box
5 parameters
w, h
x, y
confidence score: how likely it
contains an object & accuracy
of the box
How good: Intersection over Union (IOU)
Overlap Area Examples
IOU =
Union Area
0:
1:
Outlines
Concepts in object detection
A brief history of object detection
YOLO
design
loss function
training
weaknesses
A brief history of object detection
[Link]
A brief history of object detection
Before CNN, people use handcrafted features to locate and
classify objects. (not too bad)
CNN boosts the accuracy of classification
ImageNet
A brief history of object detection
Region proposal -> Single shot:
classification Region proposal + classification
e.g. RCNN e.g. YOLO, SSD
accurate fast
slow less accurate
Outlines
Concepts in object detection
A brief history of object detection
YOLO
design
loss function
training
weaknesses
YOLO: you look only once
Results
x, y, w, h
confidence
Look once score:
contain an object &
box accuracy
class score:
belong to a class
Let's use CNN, Why not regress?
since it's good. They are just numbers.
Let's go to CNN
YOLO v1's CNN: GoogLeNet variant, 24 layers
YOLO v3's CNN: darknet-53
YOLO v2's CNN: darknet-19, 19 layers
Let's do regression
-- wait, wait, how many bounding boxes? Where are they
initially?
Better solution: using grids
Results for one box
x, y, w, h
confidence score:
contain an object &
box accuracy
class score:
belong to a class
Maybe set N as a large number?
Maybe initially put them randomly?
Note: N is large, but much smaller than R-CNN's
region proposal.
Let's do regression with non-maximal suppression
Proposed Proposed Class scores
box 1 box 2
class 1
Grid x, y, w, h x, y, w, h class 2,
1
confidence confidence ...
score score class 20
... ... ... ...
Proposed Proposed Class scores
box 1 box 2
class 1
We can use CNN to extract features, and Grid x, y, w, h x, y, w, h class 2,
SxS
finally perform a regression to detect confidence confidence ...
objects. score score class 20
YOLO v1: fully connected layers
v2 & v3: convolutional layers
arXiv: 1506.02640, 1612.08242, 1804.02767 vector size: SxSx(5x2+20)
Loss function
Problems
One object is partially/fully covered by several boxes.
Most boxes has no objects.
Multi-task training problem: location & class
Small objects need more accurate location & box
size.
Solution
Oh, no math please. Let's speak human language
Problem 1:
One object is
partially/fully
covered by
several boxes.
Each true object has one proposed box “responsible” to it.
Rule: the one with highest overlap with the ground truth boxes.
When inference, we use non-maximal suppression to select the best among the proposals.
Human language
Problem 2: 0.5
Most boxes has
no objects.
Human language
Problem 3:
Multi-task training
problem: location
& class. Weighted sum: here the problem is left untouched.
Human language
sqrt
Problem 4:
Small objects need
more accurate
location & box size.
Other problems
x, y can be out of the grid cell
smaller objects can locate
worse than the largers
probability can be out of [0, 1]
Fix them in YOLO v2
Pre-defined box size
Pre-defined box: anchor
Naturally, objects have special aspect ratios and sizes.
This can be a good starting point.
We don't need randomly initialized boxes' shapes.
Handcrafted box size vs clustering algorithms
Box can reshape during training.
The number of pre-defined boxes is
a hyperparameter
v2 uses 5
v3 uses 9
Anchor-free detection is a research topic, see [Link] for an instance. anchors used in YOLO v2
Improvements (in v2)
Resizing image sizes randomly during training: {320, 352, ..., 608}
CNN only reduce an image by a constant factor (here 32), hence is robust to input image size
resize every 10 epochs.
multi-scale training
Passthrough layer Odd number of grid cells
No loss to perform reshaping
vs
Feature map
Training
ImageNet: COCO/PASCAL VOC:
classification dataset detection dataset
YOLO
Step 1: Step 2 (transfer learning):
train classification backbone remove head layers
add regression as new head
fine-tune backbone & train head
Training tricks
decaying learning rate
batch normalization
data augmentation
Performance
Generalizability
Picasso & People-Art dataset
But ... no free lunch
YOLO is not as accurate as RCNN-series models
multi-task problem:
YOLO wins in less background error,
however, loses in localization error.
YOLO is poor for detecting small objects
CNN: training on ImageNet may not generalize well for small objects (classification)
loss function equalizes location weights for small & large objects (localization)
50+ years
YOLO is not good at crowd objects
non-maximal suppression. See an improvement: Adaptive NMS (arXiv:1904.03629)
YOLO is bad when encountering strange aspect ratio
pre-defined anchors, or anchors learned from data. Go anchor-free (arXiv:1904.01355).
Security
CNN (classification) can be fooled, as well as
YOLO, and the issues can be even worse.
Non-maximal suppression is fooled.
Daedalus: Breaking Non-Maximum
Suppression in Object Detection via
Adversarial Examples. arXiv:1902.02067
Is there anything helpful to improve?
Darwin's evolution
arXiv: 1807.05511