Advanced Deep Learning and
Computer Vision
Object Detection
Business Scenario
Ana is employed by a large retail company. She aims to
simplify the purchasing process by eliminating checkout
hassles. To achieve this, she intends to implement sensors
and cameras equipped with computer vision. This
technology allows shoppers to take items from shelves
while a computer monitors their selections. Shoppers can
then pay at a contactless machine or, in stores with fully
automated checkout systems, simply exit without any
further action.
Approach:
To make the process smooth, Ana will need to thoroughly
study the algorithm that detects objects.
Learning Objectives
By the end of this lesson, you will be able to:
Understand the difference between image classification, object
detection, and segmentation
Grasp the general framework of object detection
Comprehend the progression within the R-CNN family
Learn the architecture and working of YOLO
Image Classification and Object Detection
Discussion: Image Classification and Object Detection
• What is the goal of image classification?
• What is the main difference between image classification and object
detection?
Object Detection
Object detection is a computer vision that identifies and locates objects within images
or videos.
Butterfly 41%
Butterfly 65%
Butterfly 71% Butterfly 73%
Image Classification vs. Object Detection
Image classification predicts the class of objects in an image; in this case, the class is ‘Cat’.
Object detection identifies and locates instances of objects in images.
Image classification Object detection
Cat Duck, Cat, Dog, Duck, Duck
Object Detection
It requires localization as well as classification.
Object localization Object classification
Predict the coordinates
of the bounding boxes Predict the class of the
that fit the detected objects in the image.
objects.
Image Segmentation
Image segmentation is the process of partitioning an image into multiple image segments,
known as image regions.
Source: [Link]
Discussion: Image Classification and Object Detection
• What is the goal of image classification?
Answer: The goal is to assign a label or category to an input image.
• What is the main difference between image classification and object
detection?
Answer: Image classification assigns a single label to an image, while
object detection locates and identifies multiple objects within an image.
General Framework
Object Detection Framework
The object detection framework has the following components:
Creating ground truth data Creating a target bounding box
01 with labels of bounding boxes 04 offset variable for correcting the
and classes of objects location of the region proposal
Developing mechanisms to Building a model to predict
02 identify regions containing 05 object class and corresponding
objects bounding box offset
Creating a target class variable
Evaluating using mean Average
03 using the IoU (Intersection over 06 Precision (mAP)
Union) metric
Regional Proposal
The regions that the system detects with a high probability of containing an object are called the
Region of Interest (ROI). It is measured using the objectness score.
Low objectness score
High objectness score
Region of Interest proposed by the system
Regions with high objectness scores are pushed forward in the network, and those
with low scores are not processed further.
Feature Extraction and Network Predictions
The network analyzes all the regions that have been identified as having a high probability of
containing an object and makes two predictions for each region.
1 2
Predicts the Softmax function
bounding box predicts the class
Bounding-box Class
coordinates probability for each
prediction prediction
(x, y, w, h) object.
Note
In this step, generally, a pre-trained network is selected that is used for
feature extraction.
Non-Maximum Suppression
The network draws multiple boxes for the same object.
Prediction before NMS After applying NMS
NMS ensures that the object detection algorithm detects each object only once.
Object Detection Evaluation Metrics
For evaluating the performance of object detection, two primary metrics are used:
Frames per Mean average
second precision
(FPS) (mAP)
It is used to measure It is used to measure
detection speed. network precision.
Let's delve deeper into mAP.
Calculating Mean Average Precision
Find the intersection over union (IoU)
Step 1
score for each bounding box.
𝐵𝑔 ∩ 𝐵𝑝
𝐼𝑜𝑈 =
𝐵𝑔 ∪ 𝐵𝑝
Calculating Mean Average Precision
Step 2 Find true positives and false positives.
Intersection over union (IoU) is calculated to determine whether the detection is
valid (true positive) or not (false positive).
True False
1
Positive
1
Negative
1 1
If If
IoU. > 0.5 IoU. < 0.5
A threshold of 0.5 is set here.
Calculating Mean Average Precision
Calculate the average precision using the
Step 3
precision-recall curve.
Average Precision (AP) can be found by calculating the area under the curve (AUC).
The mAP for object detection is
the average of the APs calculated
for all the classes.
Region-Based Convolutional Neural Networks (R-CNN Family)
Discussion: R-CNN Family
• What is the role of convolutional neural networks (CNNs) in R-CNN?
• What are some popular variants of R-CNN?
R-CNN
In 2014, R-CNN made a big impact by using convolutional neural networks for object detection
and localization.
It became one of the first successful applications in this field and inspired the creation of
advanced detection algorithms.
Rich feature hierarchies for accurate object detection and semantic segmentation Techreport (v5)
R-CNN Architecture
An illustration of the R-CNN architecture is shown below:
Each proposed RoI is passed through the CNN to extract features, followed by a
bounding-box regressor and an SVM classifier to produce the network output prediction.
R-CNN: Selective Search
Selective search is a technique that distinguishes objects in an image by assigning
them distinct colors.
• In the provided image, small
regions appear during object
selection.
• The space beneath each region
grows as the regions become
more similar.
Paper – ‘Selective Search for Object Recognition’
Training R-CNN Model
To train R-CNN, follow these steps:
Train the feature
01
extractor CNN.
Train the Support
Vector Machine 02
classifier.
Train the bounding-
03 box regressors.
Drawbacks of R-CNN
The following conclusions can be drawn about R-CNN in the current scenarios:
R-CNN is not suitable for quick, real-time applications such as self-
driving cars because it is slow and requires significant computation.
The training process is very complex and not seamless.
The accuracy of detection is lower compared to its successors.
Fast R-CNN
In Fast R-CNN, instead of using maximum pooling, the system employs RoI pooling with a single
feature map for all regions.
The architecture is trained with a multi-task loss, which typically includes both
classification and bounding box regression losses.
FastR-CNN Paper RossGirshick
Fast R-CNN
There are a few important points to be noted with regards to Fast R-CNN:
• Fast R-CNN performs a convolution operation once per image to produce a
feature map, making it quicker than R-CNN.
• Training is also faster because all the components, like the feature extractor,
object classifier, and bounding-box regressor, are in one CNN network.
• However, there is a big bottleneck remaining: the selective search algorithm for
generating region proposals is very slow and is generated separately by
another model.
Faster R-CNN – Region Proposal Network
Faster R-CNN is the third iteration of the R-CNN family, developed in 2016.
In faster R-CNN, a Region Proposal Network (RPN)
is used to generate region proposals directly from
the feature maps produced by the deep
convolutional network.
Paper: Faster RCNN
Region Proposal Network
An RPN takes an image (of any size) as an input and output a set of rectangular object
proposals, each with an objectness score.
Example: Object detection using RPN proposals on the PASCAL VOC 2007 test
Paper: Faster RCNN
Predicting the Bounding Box by the Regressor
It is challenging to define the coordinates of the center when working with a bounding box.
• Anchor boxes, also known as reference
boxes, exist in the image.
• The regression layer predicts offsets,
known as deltas (Δx, Δy, Δw, Δh), from
these anchor boxes to fine-tune their
position and size.
• This improves the fit around the objects
to get final bounding box proposals.
Predicting the Bounding Box by the Regressor
The RPN uses a sliding window approach to generate a set of region proposals
based on anchor boxes.
Manning Publication: Deep Learning for Vision Systems
Architecture: RPN
The architecture of the RPN is composed of two layers.
3 x 3 CONV
(pad 1, 512 output channels)
Regression layer: Classification layer:
1 x 1 CONV 1 x 1 CONV
(4k output channels) (2k output channels)
k is the number of anchors.
Architecture: RPN
The architecture of the RPN is shown below:
Paper: Faster RCNN
Multi-Stage vs. Single-Stage Detector
Two primary categories of object detection architectures are:
Multi-stage Single-stage
detector detector
• One-stage detectors skip the
• In multi-stage detectors,
region proposal stage and run
detection through the R-CNN
detection directly over a dense
family happens in two stages.
sampling of possible locations.
• This approach is faster and
• These models are relatively
simpler, but it might trade off
slow.
some accuracy.
Faster R-CNN on Custom Data
Objective: Train faster R-CNN on custom data
Train a faster R-CNN algorithm to detect the bounding boxes around objects present in images.
Download data from: [Link] and required dataset from
dataset folder
Note: Please download the solution document from reference material section and follow
the Jupyter Notebook for step-by-step execution.
Discussion: R-CNN Family
• What is the role of convolutional neural networks (CNNs) in R-CNN?
Answer: Convolutional neural networks (CNNs) in R-CNN play a crucial role
in extracting features from the proposed regions. These extracted features
are vital for object classification and enable precise detection within the R-
CNN framework.
• What are some popular variants of R-CNN?
Answer: Fast R-CNN, Faster R-CNN, and Mask R-CNN are well-known
variants of R-CNN. They provide enhancements in terms of speed, efficiency,
and advanced capabilities such as instance segmentation.
YOLO - Real-Time Object Detection
Discussion: YOLO
• What is YOLO in real-time object detection?
• How does YOLO achieve real-time object detection?
• What are some advantages of YOLO for object detection?
YOLO: Unified, Real-Time Object Detection
You only look once (YOLO) is a single-stage detector which enables it to predict multiple object
classes and bounding boxes in a single network pass.
100
91
80
59
60
40
40
22
20 17
5 6
0
Faster R-CNN Faster R-CNN R-FCN (low) SSD (low) SSD (high) YOLO (low) YOLO (high)
(low) (hi gh)
Paper: A review: Comparison of performance metrics of pre-trained models for
object detection using the TensorFlow framework
The figure above compares the number of frames processed per second (FPS) using
different models on images of varying resolutions.
Generalizability
YOLO runs on sample artwork and natural images from the internet.
The Picasso Dataset Picasso Dataset precision-recall curves
Paper: You Only Look Once: Unified, Real-Time Object Detection
Its speed and accuracy have made it an industry favorite for applications ranging from
surveillance to autonomous driving.
The YOLO Detection System
Processing images with YOLO is simple and straightforward.
Runs a single Thresholds the resulting
Resizes the input
convolutional network detections by the
image to 448 × 448.
on the image. model’s confidence.
1. Resize 2. Run CNN 3. Non-maximum suppression
Paper: You Only Look Once: Unified, Real-Time Object Detection
The Model
The system interprets detection as a regression problem.
• It divides the image into a S × S grid and for
each grid, the cell predicts bounding boxes
(B), confidence for those boxes, and class
probabilities (C).
• These predictions are encoded as an S × S ×
(B ∗ 5 + C) tensor.
Paper: You Only Look Once: Unified, Real-Time Object Detection
Architecture
The detection network has 24 convolutional layers, followed by two fully connected layers.
C a C a C a C a C a C a C a C a
a a a a
a a a a
a a
Paper: You Only Look Once: Unified, Real-Time Object Detection
Alternating 1 × 1 convolutional layers reduces the feature space from preceding layers.
YOLOv3
YOLOv3 workflow when applying a 13 × 13 grid to the input image;
Each grid cell produces "B" bounding boxes, with each box having an associated
objectness score and class predictions.
The objectness score (Po) is calculated as:
Manning Publication: Deep Learning for Vision Systems
The objectness score undergoes a sigmoid transformation, converting it into a
probability value ranging between 0 and 1.
Prediction Across Different Scales
YOLOv3 has nine anchor boxes, with each of the three scales being assigned three
specific anchors.
The detection layer makes detections at feature maps of three different sizes with strides
32, 16, and 8, respectively.
Example: With an input image of size 416 × 416, the detections on scales
would be 13 × 13, 26 × 26, and 52 × 52.
Architecture
Let’s look at the YOLOv network architecture in detail.
36
61
91
79
Concatenation Concatenation
DarkNet Upsampling Upsampling
architecture layer layer
Scale: 1
Stride:32
82
Scale: 2
Detection layers Stride:16
at scale 1
94
Scale: 3
Stride: 8
106
Detection layers
at scale 2
Detection layers
at scale 3
Source: [Link]
YOLOv3
The YOLO architecture was influenced by other models, such as GoogLeNet (Inception) for
feature extraction.
Instead of the Inception modules, YOLO uses 1 × 1 reduction layers followed by
3 × 3 convolutional layers, this kind of architecture is known as Darknet.
YOLOv3 network structure, where the blue and red lines represent two-fold
up-sampling.
Source: [Link]
Darknet-53
YOLOv3 employs a 53-layer network variant of Darknet, known as Darknet-53, which
is trained on ImageNet.
For the detection task, YOLOv3 adds 53
layers, resulting in a 106-layer fully
convolutional underlying architecture for
YOLOv3.
Paper: YOLOv3: An Incremental Improvement
Hyperparameters
YOLOv3 has below hyperparameters:
class_threshold: Non-Max Suppression Threshold:
Defines the probability Helps to overcome the problem of
threshold for the predicted detecting an object multiple times
object. in an image.
input_height and input_shape:
Defines the dimensions of the
input image.
Discussion: YOLO
• What is YOLO in real-time object detection?
Answer: YOLO (You Only Look Once) is an efficient object detection
algorithm that performs bounding box prediction and class probability
estimation directly from the entire image in a single pass.
• How does YOLO achieve real-time object detection?
Answer: YOLO achieves real-time object detection by dividing the image into
a grid, predicting bounding boxes and class probabilities for objects in each
grid cell, and employing a single neural network for simultaneous
predictions across the entire image.
• What are some advantages of YOLO for object detection?
Answer: YOLO offers real-time processing capabilities, high detection
accuracy, and the ability to detect multiple objects within an image. These
advantages make it well-suited for applications such as video surveillance,
autonomous driving, and robotics.
Key Takeaways
Average precision is an important metric for evaluation.
Image classification assigns a single label to an image, while object
detection locates and identifies multiple objects within an image.
You only look once (YOLO) is a single-stage detector which enables
it to predict multiple object classes and bounding boxes in a single
network pass.
Fast R-CNN, faster R-CNN, and mask R-CNN are well-known variants
of R-CNN.
Knowledge Check
Knowledge
Check
Which is the relevant metric for the object detection task?
1
A. Coefficient of determination
B. Mean Absolute Percentage Error
C. Mean Average Precision
D. ROC
Knowledge
Check
Which is the relevant metric for the object detection task?
1
A. Coefficient of determination
B. Mean Absolute Percentage Error
C. Mean Average Precision
D. ROC
The correct answer is C
Mean Average Precision is the standard metric for evaluating object detection tasks, measuring
precision across various recall levels.
Knowledge
Check
Which one would you prefer for real-time object detection?
2
A. R-CNN
B. Multi-stage detectors
C. Faster R-CNN
D. Single-stage detectors
Knowledge
Check
Which one would you prefer for real-time object detection?
2
A. R-CNN
B. Multi-stage detectors
C. Faster R-CNN
D. Single-stage detectors
The correct answer is D
Single-stage detectors are designed for speed, making them suitable for real-time object detection
without multiple steps.
Knowledge
Check
Which architecture is used in YOLOv3 for feature extraction?
3
A. Darknet-63
B. Darknet-53
C. SVM
D. Darknet-15
Knowledge
Check
Which architecture is used in YOLOv3 for feature extraction?
3
A. Darknet-63
B. Darknet-53
C. SVM
D. Darknet-15
The correct answer is B
Darknet-53 is the backbone architecture used in YOLOv3 for feature extraction.
Lesson-End Project
Problem Statement:
You are provided with a trained model of YOLOv3 on the MS COCO dataset.
Using this model, you have to create an object detection program for the
different objects in the dataset.
Dataset:
The model provided is trained on the MS COCO dataset.
Steps to Perform:
1. Load and prepare the image in which the object will be detected
2. Load the pretrained model
3. Create bounding boxes for the detected objects
4. Use the model for prediction