0% found this document useful (0 votes)
15 views61 pages

Object Detection in Deep Learning

Ana, a retail employee, aims to enhance the shopping experience by implementing computer vision technology for automated checkout processes. The document outlines the learning objectives related to object detection, including understanding the differences between image classification, object detection, and segmentation, as well as the architecture of YOLO and the R-CNN family. It also discusses the evaluation metrics for object detection and the advancements in detection algorithms.

Uploaded by

Sushma M
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views61 pages

Object Detection in Deep Learning

Ana, a retail employee, aims to enhance the shopping experience by implementing computer vision technology for automated checkout processes. The document outlines the learning objectives related to object detection, including understanding the differences between image classification, object detection, and segmentation, as well as the architecture of YOLO and the R-CNN family. It also discusses the evaluation metrics for object detection and the advancements in detection algorithms.

Uploaded by

Sushma M
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Advanced Deep Learning and

Computer Vision
Object Detection
Business Scenario

Ana is employed by a large retail company. She aims to


simplify the purchasing process by eliminating checkout
hassles. To achieve this, she intends to implement sensors
and cameras equipped with computer vision. This
technology allows shoppers to take items from shelves
while a computer monitors their selections. Shoppers can
then pay at a contactless machine or, in stores with fully
automated checkout systems, simply exit without any
further action.

Approach:
To make the process smooth, Ana will need to thoroughly
study the algorithm that detects objects.
Learning Objectives

By the end of this lesson, you will be able to:

Understand the difference between image classification, object


detection, and segmentation

Grasp the general framework of object detection

Comprehend the progression within the R-CNN family

Learn the architecture and working of YOLO


Image Classification and Object Detection
Discussion: Image Classification and Object Detection

• What is the goal of image classification?

• What is the main difference between image classification and object


detection?
Object Detection

Object detection is a computer vision that identifies and locates objects within images
or videos.

Butterfly 41%
Butterfly 65%

Butterfly 71% Butterfly 73%


Image Classification vs. Object Detection

Image classification predicts the class of objects in an image; in this case, the class is ‘Cat’.
Object detection identifies and locates instances of objects in images.

Image classification Object detection

Cat Duck, Cat, Dog, Duck, Duck


Object Detection

It requires localization as well as classification.

Object localization Object classification

Predict the coordinates


of the bounding boxes Predict the class of the
that fit the detected objects in the image.
objects.
Image Segmentation

Image segmentation is the process of partitioning an image into multiple image segments,
known as image regions.

Source: [Link]
Discussion: Image Classification and Object Detection

• What is the goal of image classification?


Answer: The goal is to assign a label or category to an input image.

• What is the main difference between image classification and object


detection?
Answer: Image classification assigns a single label to an image, while
object detection locates and identifies multiple objects within an image.
General Framework
Object Detection Framework

The object detection framework has the following components:

Creating ground truth data Creating a target bounding box


01 with labels of bounding boxes 04 offset variable for correcting the
and classes of objects location of the region proposal

Developing mechanisms to Building a model to predict


02 identify regions containing 05 object class and corresponding
objects bounding box offset

Creating a target class variable


Evaluating using mean Average
03 using the IoU (Intersection over 06 Precision (mAP)
Union) metric
Regional Proposal

The regions that the system detects with a high probability of containing an object are called the
Region of Interest (ROI). It is measured using the objectness score.

Low objectness score

High objectness score

Region of Interest proposed by the system

Regions with high objectness scores are pushed forward in the network, and those
with low scores are not processed further.
Feature Extraction and Network Predictions

The network analyzes all the regions that have been identified as having a high probability of
containing an object and makes two predictions for each region.

1 2
Predicts the Softmax function
bounding box predicts the class
Bounding-box Class
coordinates probability for each
prediction prediction
(x, y, w, h) object.

Note
In this step, generally, a pre-trained network is selected that is used for
feature extraction.
Non-Maximum Suppression

The network draws multiple boxes for the same object.

Prediction before NMS After applying NMS

NMS ensures that the object detection algorithm detects each object only once.
Object Detection Evaluation Metrics

For evaluating the performance of object detection, two primary metrics are used:

Frames per Mean average


second precision
(FPS) (mAP)

It is used to measure It is used to measure


detection speed. network precision.

Let's delve deeper into mAP.


Calculating Mean Average Precision

Find the intersection over union (IoU)


Step 1
score for each bounding box.

𝐵𝑔 ∩ 𝐵𝑝
𝐼𝑜𝑈 =
𝐵𝑔 ∪ 𝐵𝑝
Calculating Mean Average Precision

Step 2 Find true positives and false positives.

Intersection over union (IoU) is calculated to determine whether the detection is


valid (true positive) or not (false positive).

True False
1
Positive
1
Negative
1 1
If If

IoU. > 0.5 IoU. < 0.5

A threshold of 0.5 is set here.


Calculating Mean Average Precision

Calculate the average precision using the


Step 3
precision-recall curve.

Average Precision (AP) can be found by calculating the area under the curve (AUC).

The mAP for object detection is


the average of the APs calculated
for all the classes.
Region-Based Convolutional Neural Networks (R-CNN Family)
Discussion: R-CNN Family

• What is the role of convolutional neural networks (CNNs) in R-CNN?

• What are some popular variants of R-CNN?


R-CNN

In 2014, R-CNN made a big impact by using convolutional neural networks for object detection
and localization.

It became one of the first successful applications in this field and inspired the creation of
advanced detection algorithms.

Rich feature hierarchies for accurate object detection and semantic segmentation Techreport (v5)
R-CNN Architecture

An illustration of the R-CNN architecture is shown below:

Each proposed RoI is passed through the CNN to extract features, followed by a
bounding-box regressor and an SVM classifier to produce the network output prediction.
R-CNN: Selective Search

Selective search is a technique that distinguishes objects in an image by assigning


them distinct colors.

• In the provided image, small


regions appear during object
selection.
• The space beneath each region
grows as the regions become
more similar.

Paper – ‘Selective Search for Object Recognition’


Training R-CNN Model

To train R-CNN, follow these steps:

Train the feature


01
extractor CNN.

Train the Support


Vector Machine 02
classifier.

Train the bounding-


03 box regressors.
Drawbacks of R-CNN

The following conclusions can be drawn about R-CNN in the current scenarios:

R-CNN is not suitable for quick, real-time applications such as self-


driving cars because it is slow and requires significant computation.

The training process is very complex and not seamless.

The accuracy of detection is lower compared to its successors.


Fast R-CNN

In Fast R-CNN, instead of using maximum pooling, the system employs RoI pooling with a single
feature map for all regions.

The architecture is trained with a multi-task loss, which typically includes both
classification and bounding box regression losses.
FastR-CNN Paper RossGirshick
Fast R-CNN

There are a few important points to be noted with regards to Fast R-CNN:

• Fast R-CNN performs a convolution operation once per image to produce a


feature map, making it quicker than R-CNN.
• Training is also faster because all the components, like the feature extractor,
object classifier, and bounding-box regressor, are in one CNN network.

• However, there is a big bottleneck remaining: the selective search algorithm for
generating region proposals is very slow and is generated separately by
another model.
Faster R-CNN – Region Proposal Network

Faster R-CNN is the third iteration of the R-CNN family, developed in 2016.

In faster R-CNN, a Region Proposal Network (RPN)


is used to generate region proposals directly from
the feature maps produced by the deep
convolutional network.

Paper: Faster RCNN


Region Proposal Network

An RPN takes an image (of any size) as an input and output a set of rectangular object
proposals, each with an objectness score.

Example: Object detection using RPN proposals on the PASCAL VOC 2007 test

Paper: Faster RCNN


Predicting the Bounding Box by the Regressor

It is challenging to define the coordinates of the center when working with a bounding box.

• Anchor boxes, also known as reference


boxes, exist in the image.
• The regression layer predicts offsets,
known as deltas (Δx, Δy, Δw, Δh), from
these anchor boxes to fine-tune their
position and size.
• This improves the fit around the objects
to get final bounding box proposals.
Predicting the Bounding Box by the Regressor

The RPN uses a sliding window approach to generate a set of region proposals
based on anchor boxes.

Manning Publication: Deep Learning for Vision Systems


Architecture: RPN

The architecture of the RPN is composed of two layers.

3 x 3 CONV
(pad 1, 512 output channels)

Regression layer: Classification layer:


1 x 1 CONV 1 x 1 CONV
(4k output channels) (2k output channels)

k is the number of anchors.


Architecture: RPN

The architecture of the RPN is shown below:

Paper: Faster RCNN


Multi-Stage vs. Single-Stage Detector

Two primary categories of object detection architectures are:

Multi-stage Single-stage
detector detector

• One-stage detectors skip the


• In multi-stage detectors,
region proposal stage and run
detection through the R-CNN
detection directly over a dense
family happens in two stages.
sampling of possible locations.

• This approach is faster and


• These models are relatively
simpler, but it might trade off
slow.
some accuracy.
Faster R-CNN on Custom Data

Objective: Train faster R-CNN on custom data

Train a faster R-CNN algorithm to detect the bounding boxes around objects present in images.

Download data from: [Link] and required dataset from


dataset folder

Note: Please download the solution document from reference material section and follow
the Jupyter Notebook for step-by-step execution.
Discussion: R-CNN Family

• What is the role of convolutional neural networks (CNNs) in R-CNN?


Answer: Convolutional neural networks (CNNs) in R-CNN play a crucial role
in extracting features from the proposed regions. These extracted features
are vital for object classification and enable precise detection within the R-
CNN framework.

• What are some popular variants of R-CNN?


Answer: Fast R-CNN, Faster R-CNN, and Mask R-CNN are well-known
variants of R-CNN. They provide enhancements in terms of speed, efficiency,
and advanced capabilities such as instance segmentation.
YOLO - Real-Time Object Detection
Discussion: YOLO

• What is YOLO in real-time object detection?

• How does YOLO achieve real-time object detection?

• What are some advantages of YOLO for object detection?


YOLO: Unified, Real-Time Object Detection

You only look once (YOLO) is a single-stage detector which enables it to predict multiple object
classes and bounding boxes in a single network pass.

100
91

80

59
60

40
40

22
20 17

5 6

0
Faster R-CNN Faster R-CNN R-FCN (low) SSD (low) SSD (high) YOLO (low) YOLO (high)
(low) (hi gh)

Paper: A review: Comparison of performance metrics of pre-trained models for


object detection using the TensorFlow framework

The figure above compares the number of frames processed per second (FPS) using
different models on images of varying resolutions.
Generalizability

YOLO runs on sample artwork and natural images from the internet.

The Picasso Dataset Picasso Dataset precision-recall curves


Paper: You Only Look Once: Unified, Real-Time Object Detection

Its speed and accuracy have made it an industry favorite for applications ranging from
surveillance to autonomous driving.
The YOLO Detection System

Processing images with YOLO is simple and straightforward.

Runs a single Thresholds the resulting


Resizes the input
convolutional network detections by the
image to 448 × 448.
on the image. model’s confidence.

1. Resize 2. Run CNN 3. Non-maximum suppression

Paper: You Only Look Once: Unified, Real-Time Object Detection


The Model

The system interprets detection as a regression problem.

• It divides the image into a S × S grid and for


each grid, the cell predicts bounding boxes
(B), confidence for those boxes, and class
probabilities (C).
• These predictions are encoded as an S × S ×
(B ∗ 5 + C) tensor.

Paper: You Only Look Once: Unified, Real-Time Object Detection


Architecture

The detection network has 24 convolutional layers, followed by two fully connected layers.

C a C a C a C a C a C a C a C a

a a a a

a a a a
a a

Paper: You Only Look Once: Unified, Real-Time Object Detection

Alternating 1 × 1 convolutional layers reduces the feature space from preceding layers.
YOLOv3

YOLOv3 workflow when applying a 13 × 13 grid to the input image;

Each grid cell produces "B" bounding boxes, with each box having an associated
objectness score and class predictions.

The objectness score (Po) is calculated as:

Manning Publication: Deep Learning for Vision Systems

The objectness score undergoes a sigmoid transformation, converting it into a


probability value ranging between 0 and 1.
Prediction Across Different Scales

YOLOv3 has nine anchor boxes, with each of the three scales being assigned three
specific anchors.

The detection layer makes detections at feature maps of three different sizes with strides
32, 16, and 8, respectively.

Example: With an input image of size 416 × 416, the detections on scales
would be 13 × 13, 26 × 26, and 52 × 52.
Architecture

Let’s look at the YOLOv network architecture in detail.

36

61
91
79

Concatenation Concatenation
DarkNet Upsampling Upsampling
architecture layer layer
Scale: 1
Stride:32

82

Scale: 2
Detection layers Stride:16
at scale 1
94
Scale: 3
Stride: 8

106

Detection layers
at scale 2

Detection layers
at scale 3

Source: [Link]
YOLOv3

The YOLO architecture was influenced by other models, such as GoogLeNet (Inception) for
feature extraction.

Instead of the Inception modules, YOLO uses 1 × 1 reduction layers followed by


3 × 3 convolutional layers, this kind of architecture is known as Darknet.

YOLOv3 network structure, where the blue and red lines represent two-fold
up-sampling.
Source: [Link]
Darknet-53

YOLOv3 employs a 53-layer network variant of Darknet, known as Darknet-53, which


is trained on ImageNet.

For the detection task, YOLOv3 adds 53


layers, resulting in a 106-layer fully
convolutional underlying architecture for
YOLOv3.

Paper: YOLOv3: An Incremental Improvement


Hyperparameters

YOLOv3 has below hyperparameters:

class_threshold: Non-Max Suppression Threshold:


Defines the probability Helps to overcome the problem of
threshold for the predicted detecting an object multiple times
object. in an image.

input_height and input_shape:


Defines the dimensions of the
input image.
Discussion: YOLO

• What is YOLO in real-time object detection?


Answer: YOLO (You Only Look Once) is an efficient object detection
algorithm that performs bounding box prediction and class probability
estimation directly from the entire image in a single pass.

• How does YOLO achieve real-time object detection?


Answer: YOLO achieves real-time object detection by dividing the image into
a grid, predicting bounding boxes and class probabilities for objects in each
grid cell, and employing a single neural network for simultaneous
predictions across the entire image.

• What are some advantages of YOLO for object detection?


Answer: YOLO offers real-time processing capabilities, high detection
accuracy, and the ability to detect multiple objects within an image. These
advantages make it well-suited for applications such as video surveillance,
autonomous driving, and robotics.
Key Takeaways

Average precision is an important metric for evaluation.

Image classification assigns a single label to an image, while object


detection locates and identifies multiple objects within an image.
You only look once (YOLO) is a single-stage detector which enables
it to predict multiple object classes and bounding boxes in a single
network pass.
Fast R-CNN, faster R-CNN, and mask R-CNN are well-known variants
of R-CNN.
Knowledge Check
Knowledge
Check
Which is the relevant metric for the object detection task?
1

A. Coefficient of determination

B. Mean Absolute Percentage Error

C. Mean Average Precision

D. ROC
Knowledge
Check
Which is the relevant metric for the object detection task?
1

A. Coefficient of determination

B. Mean Absolute Percentage Error

C. Mean Average Precision

D. ROC

The correct answer is C

Mean Average Precision is the standard metric for evaluating object detection tasks, measuring
precision across various recall levels.
Knowledge
Check
Which one would you prefer for real-time object detection?
2

A. R-CNN

B. Multi-stage detectors

C. Faster R-CNN

D. Single-stage detectors
Knowledge
Check
Which one would you prefer for real-time object detection?
2

A. R-CNN

B. Multi-stage detectors

C. Faster R-CNN

D. Single-stage detectors

The correct answer is D

Single-stage detectors are designed for speed, making them suitable for real-time object detection
without multiple steps.
Knowledge
Check
Which architecture is used in YOLOv3 for feature extraction?
3

A. Darknet-63

B. Darknet-53

C. SVM

D. Darknet-15
Knowledge
Check
Which architecture is used in YOLOv3 for feature extraction?
3

A. Darknet-63

B. Darknet-53

C. SVM

D. Darknet-15

The correct answer is B

Darknet-53 is the backbone architecture used in YOLOv3 for feature extraction.


Lesson-End Project

Problem Statement:
You are provided with a trained model of YOLOv3 on the MS COCO dataset.
Using this model, you have to create an object detection program for the
different objects in the dataset.

Dataset:
The model provided is trained on the MS COCO dataset.

Steps to Perform:
1. Load and prepare the image in which the object will be detected
2. Load the pretrained model
3. Create bounding boxes for the detected objects
4. Use the model for prediction

You might also like