0% found this document useful (0 votes)
19 views19 pages

Enhanced DA-YOLO for Traffic Signal Detection

Uploaded by

llc551551
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views19 pages

Enhanced DA-YOLO for Traffic Signal Detection

Uploaded by

llc551551
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Research on traffic signal detection algorithm based on improved

YOLOv8

Liangchang Li Zonghong Feng∗ Kai Xu Ning Zhang Yangyang Zhang

School of mathematics and physics


Lanzhou Jiaotong University, Lanzhou, Gansu, 730070, P. R. China.

Abstract

In view of the frequent traffic safety issues and the development of autonomous driving tech-
nology, which puts higher requirements on accurate recognition of traffic sign detection and the
limitations of small target detection, this paper proposes an improved DA-YOLO model based on
YOLOv8n. First, the Bottleneck in the C2f module of the YOLOv8n backbone network are replaced
with Bottleneck-DAttention. By introducing DAttention, features can be extracted more effectively,
thereby improving the performance of the model. Secondly, Dysample, an ultra-lightweight and
efficient upsampler, is introduced into the neck network to further improve the performance while
reducing the computational overhead. Finally, SlideLoss is replaced by nwdloss, which can better
handle prediction results with more complex distributions and has a better measure of the distance
between the predicted box and the true box in the feature space in target detection. Experimen-
tal results show that YOLOv8-DA shows better performance than SSD, Faster R-CNN, YOLOv4,
YOLOv5 and YOLOv7 algorithms. Compared with the baseline model YOLOv8n, the improved
DA-YOLO has an mAP improvement of 3.3%, an accuracy improvement of 1.5%, and a recall
improvement of 4.3%. The proposed method strikes a balance between model size and detection
accuracy, enabling it to meet the requirements of traffic signal detection, and provides new ideas and
methods for future research in the field of traffic light detection.

Keywords: YOLOv8, Traffic Signal Detection, Autonomous Driving, DySample, Object Detection

1 Introduction
As cars become a common choice for people’s daily travel rather than a luxury. However, this has also
led to an increase in traffic safety problems. According to official Chinese statistics, more than 500,000
traffic accidents occur each year, and on average one person is injured in a traffic accident every minute.
The main causes of these accidents include negligence caused by psychological or physiological factors,
and operational errors caused by unskilled or inexperienced drivers. In order to reduce the occurrence
∗ Corresponding author. Zonghong Feng(sccdfzh@[Link]). This work is supported by National Natural Science Foun-

dation of China (No.12162020), The Young Scholars ScienceFoundation of Lanzhou Jiaotong University (No.2020022).

1
of traffic accidents, autonomous driving technology and unmanned driving technology have been widely
used. The popularization of these technologies has made transportation more efficient and safe.
Traditional traffic light monitoring methods usually rely on manual inspections or fixed cameras
for monitoring. However, this method has problems such as low efficiency, high cost, and easy errors.
With the rapid development of artificial intelligence technology, especially breakthroughs in the field
of computer vision, deep learning-based object detection technology has gradually become an effective
means to solve the problem of traffic light monitoring . The use of deep learning algorithms can realize
the automatic recognition and monitoring of traffic lights and their surrounding traffic signs, thereby
improving the efficiency and accuracy of traffic management.
This paper aims to use deep learning technology, especially the target detection method based on
the YOLO algorithm, to accurately identify and monitor traffic lights and the traffic signs around them.
By building an efficient traffic light detection model, real-time monitoring and identification of traffic
light status can be achieved, which will help improve the level of intelligent traffic management, reduce
the occurrence of traffic accidents, and ensure the safety and smoothness of road traffic. Improve the
accuracy and real-time performance of traffic light detection, and provide effective support for traffic
management and traffic safety. Explore the application of deep learning algorithms in the field of traffic
light monitoring, and provide reference and reference for future related research.
Traditional traffic detection methods have begun to be combined with advanced technologies such
as machine learning to improve detection results. Common combined methods include support vector
machines, random forests, genetic algorithms, AdaBoost [1, 14, 27], and artificial neural networks. In
these machine learning methods, the commonly used feature information is local information based on
HOG (Histogram of Oriented Gradients) [2,16,23]. HOG can effectively extract the contour information
of the target object and is usually combined with the SVM classifier for feature classification.
The research team led by Zaklouta conducted continuous exploration between 2011 and 2012 [3,12,28].
In their 2011 study, they used histogram of oriented gradients (HOG) technology to extract key features
of traffic sign images and used it for feature training with support vector machines (SVM) to achieve
automated classification of signs. Entering 2012, the research team further combined HOG characteristics
with the random forest algorithm to identify various traffic signs and carefully selected the most discerning
features [4,17,22]. This improvement not only effectively reduces the size of the feature set and reduces the
demand for storage and computing resources, but also significantly improves the accuracy of recognition.
In the mid-20th century, deep learning began to rise, and its first application in the field of computer
vision can be traced back to the 1990s. With the popularization of advanced media devices and the
increase in the amount of image, audio and video data, deep learning gradually attracted attention
and began to develop rapidly in 2006. In the ImageNet Large-Scale Visual Recognition Challenge held
in 2012, a convolutional neural network called AlexNet was born [5, 11, 25]. It completely subverted
the traditional object detection technology with its excellent performance and achieved unprecedented
achievements, marking a new era of deep learning in the field of visual recognition. Since then, deep
learning technology has attracted widespread attention in the field of object detection. Many researchers
have begun to use deep learning technology to explore visual detection problems in many fields such as

2
Figure 1: Deep Learning Object Detection Algorithm Timeline.

transportation, industry, and medicine, promoting technological progress and application innovation in
this field.
Object detection methods based on deep learning can be roughly divided into two categories: one is
the two-stage detection algorithm based on candidate regions, such as R-CNN, Fast R-CNN and Faster
R-CNN [6,13,24]; the other is the single-stage detection algorithm based on regression, such as the single-
step multi-box object detector (SSD) [7, 15, 29] and the YOLO series of algorithms (such as YOLOv1,
YOLOv3, YOLOv5) [8, 18, 30]. In recent years, single-stage object detection algorithms have received
extensive attention and research, and their development process is shown in Figure 1.
In 2021, Ma et al. proposed an innovative strategy, which is to embed a feature selection unit (FSU) in
the connection layer of the network [9, 21]. The unit evaluates the feature contribution of each channel,
screens and prioritizes those features that contribute more to target detection, thereby reducing the
interference of unnecessary features in the fusion process. Through this improvement, the performance
of YOLOv3, YOLOv4, and YOLOv5-L has been improved by 0.60%, 1.10%, and 1.50%, respectively. In
addition, there are also studies that attempt to integrate the attention mechanism of the two dimensions
of space and channel into the connection layer. For example, in their 2020 study, Ju et al. developed
an adaptive feature fusion algorithm (AFFA) by combining the attention mechanism of global and local
spatial positions [10, 19, 20, 26]. This algorithm, combined with the YOLOv3 model, can intelligently
identify and emphasize the importance of channel and spatial features at different scales [8]. This method
achieved 5.08%, 2.30% and 7.41% mAP improvements on PASCAL VOC07, KITTI and Smart UVM
datasets, respectively.

2 Materials and Methods

2.1 Dataset source and processing

The dataset in this article comes from the traffic signal data on the Roboflow website. The URL for
obtaining the data is [Link] The
dataset contains traffic signs in the real world. Figure 2 shows 6 dataset samples of different categories,

3
Figure 2: Dataset Samples

including road scenes under different light and weather conditions.


The dataset contains a traffic sign image dataset of 4969 samples and is correctly divided into three
parts: Train, Valid, and Test. The training set has 3530 images, the validation set has 801 images, and
the test set has 638 images. The internal parameters of the model are optimized using the samples of the
training dataset. Subsequently, we evaluated the performance of the model using the validation dataset.
Based on the evaluation results, we further fine-tune the hyperparameters of the model through detailed
data analysis to improve its performance. The pictures contain 15 categories: green light, red light, stop,
speed limit 10, speed limit 20, speed limit 30, speed limit 40, speed limit 50, speed limit 60, speed limit
70, speed limit 80, speed limit 90, speed limit 100, speed limit 110, speed limit 120.

2.2 Standard YOLOv8 Model

YOLOv8 is the latest version of the YOLO series algorithm. While maintaining high efficiency and
accuracy, it further optimizes the algorithm structure and improves the running speed. YOLOv8 adopts
a new network structure and loss function, making the model more stable during training and improving
the generalization ability of the model.
YOLOv8 adopts an end-to-end training method, treating target detection as a regression problem,
and directly predicting the location and category of the object. In terms of model structure, YOLOv8
uses CSPDarknet53 as the backbone network, and improves the feature extraction ability of the model
by adding residual connections and feature fusion. YOLOv8 uses the CIoU loss function and considers
factors such as the overlapping area between the predicted box and the real box, the center point distance,
and the aspect ratio, so that the model focuses more on improving the accuracy of the predicted box
during training.
The YOLOv8 algorithm was proposed by Glenn-Jocher and is in the same vein as the YOLOv3 and
YOLOv5 algorithms. The YOLOv8 network structure is shown in Figure 3. The main components are

4
Input, Backbone, Neck, and Head.
At the input end, it mainly includes Mosaic image enhancement, adaptive anchor box calculation and
adaptive image scaling. The idea of Mosaic data enhancement method is to randomly select 4 different
images and randomly splice them into a large image, which can increase the diversity and difficulty of the
training set and help improve the generalization ability of the target detection model [3]. The Backbone
of YOLOv8 is mainly based on a neural network architecture called Darknet. Darknet is a lightweight
deep neural network framework developed by Joseph Redmon and is widely used in target detection tasks.
Darknet uses convolutional neural network (CNN) to extract image features and has high computational
efficiency and good performance. YOLOv8 uses CSPDarknet53 as its backbone network. CSPDarknet53
is an improved version of the Darknet architecture and introduces the Cross-Stage Partial Connection
(CSP) module to improve the performance and efficiency of the model. Cross-Stage Partial Connection
(CSP) module: The CSP module divides the input feature map into two parts, one part is directly
connected to the output, and the other part is connected to the output after a series of convolution and
pooling operations. This design can reduce the amount of computation and the number of parameters
of the model while improving the performance and convergence speed of the model.
When training the YOLOv8 model, a pre-trained Backbone model is usually used as an initialization
parameter to accelerate the convergence of the model and improve its performance. Pre-trained models
are usually trained on large-scale image datasets (such as ImageNet) to learn common image features.
The Backbone network extracts image features through a series of operations such as convolutional layers
and pooling layers, and gradually downsamples feature maps to obtain higher-level semantic information.
These feature maps are then input to the subsequent network layers of YOLOv8 for object detection tasks.
In order to improve the detection performance and computational efficiency of YOLOv8, researchers may
further optimize the Backbone network, such as adjusting the number of layers, channels, and module
design of the network to adapt to different application scenarios and hardware environments. Backbone
of YOLOv8 plays a vital role in the object detection task. It is responsible for extracting the features of
the image and downsampling it, providing important input for subsequent object detection tasks. The
Neck module of YOLOv8 is an important part of building the entire model. It is responsible for further
processing the feature maps extracted by Backbone and using them for object detection tasks. The
following is a detailed introduction to the Neck module of YOLOv8:
The Neck module of YOLOv8 usually adopts PANet as one of its main components. PANet is
an improved structure for Feature Pyramid Network (FPN), which aims to solve the problem of poor
information transmission between feature maps of different resolutions. Feature Pyramid Network (FPN):
FPN aims to achieve multi-scale feature extraction in target detection by building feature pyramids at
different levels. However, the traditional FPN structure has the problem of poor information transmission
when fusing features, and PANet solves this problem by introducing a path aggregation mechanism. In
addition to PANet, the Neck module of YOLOv8 may also adopt the FPN structure. FPN achieves
multi-scale target detection by fusing feature maps of different resolutions by building a pyramid-shaped
feature hierarchy. The main task of the Neck module is to fuse feature maps of different resolutions from
Backbone and perform necessary upsampling operations for use in the subsequent detection head. This

5
Figure 3: Standard YOLOv8 model structure diagram.

usually involves using operations such as convolution and upsampling to adjust and fuse feature maps.
The Neck module may also involve downsampling and upsampling operations to adjust the resolution
of feature maps to adapt them to target detection of different scales. Downsampling is usually achieved
through pooling or strided convolution, while upsampling is usually achieved through deconvolution or
upsampling convolution.
The Neck module is also responsible for fusing feature information from different levels to improve the
model’s detection performance of the target. This is usually achieved through feature fusion operations
at the channel level or the spatial level. Through the processing of the Neck module, YOLOv8 is able
to obtain richer information from the feature map extracted by Backbone and provide better input for
subsequent target detection tasks. The design and optimization of the Neck module are crucial to the
performance and efficiency of the model. The Head module of YOLOv8 is a key component in the target
detection model. It is responsible for converting the extracted feature map into the coordinates and
category probabilities of the target box. The following is a detailed introduction to the Head module
of YOLOv8: The Head module of YOLOv8 defines a set of anchor boxes, which represent predicted
target boxes of different shapes and sizes. Each anchor box is usually associated with the size and
aspect ratio of the predicted target. The Head module decodes the feature map and converts it into
the location of the target box (the center coordinates, width and height of the bounding box) and the
category probability. The Head module of YOLOv8 is usually used in conjunction with the feature
fusion operation in the Neck module to obtain richer feature information. The Head module performs
target detection predictions at multiple scales to handle targets of different sizes and proportions. This
is usually achieved by applying anchor boxes of different sizes to feature maps at different levels. The
Head module of YOLOv8 generates the final output of target detection, including the location, category,
and confidence score of the detected target box. In the target box predicted by the model, there may
be a large number of overlapping target boxes. In order to eliminate the redundant information brought
by overlapping target boxes, the Head module usually uses the NMS algorithm for filtering and retains
the target box with the highest confidence. Through the processing of the Head module, YOLOv8 can
achieve efficient and accurate target detection, map the extracted feature map to the target box in the
real world, and provide reliable target detection results for subsequent applications.

2.3 Improved YOLOv8 Model

In the context of autonomous driving, traffic sign detection still presents some challenges. For example,
the size of traffic signs changes dynamically as the vehicle moves. To ensure that the vehicle can respond

6
to complex traffic conditions in a timely manner, the detection algorithm needs to accurately identify
small-sized signs when they appear.
This experiment builds a traffic sign detection model that takes into account both detection accuracy
and speed from three aspects: First, replace Bottleneck in the C2f module of YOLOv8n with Bottle-
neck DAttention. By introducing DAttention, features can be extracted more effectively and model
performance can be improved. Secondly, DySample is introduced in the neck network, which is an effi-
cient and lightweight upsampler that can improve performance while reducing computational overhead.
Finally, nwdloss is used to replace SlideLoss to better measure the distance between the predicted box
and the true box in complex prediction results, thereby improving target detection [Link] 4 is
a diagram of the improved YOLOv8n algorithm architecture.

Figure 4: Improved YOLOv8 model structure diagram.

2.3.1 Vision Transformer with Deformable Attention

DAttention is an efficient deformable self-attention mechanism that aims to improve the model’s ability
to capture key information by dynamically adjusting the attention area. Different from the traditional
global attention mechanism, DAttention introduces a query-independent offset method to focus attention
on more important areas. By generating reference points and offsets, DAttention can flexibly adjust
the positions of candidate keys/values, thereby improving the flexibility and efficiency of the attention
module. This design not only reduces memory and computational complexity, but also enhances the
performance of the model in image classification and dense prediction tasks.
In traffic sign detection, the introduction of the DAttention module enables the model to focus on
the target area more effectively and improve detection accuracy, especially in complex backgrounds and
multi-scale target scenes. The architecture of DAttention is shown in Figure 5.
As illustrated in Figure 5(a), given the input feature map x ∈ RH×W ×C , a uniform grid of points
p ∈ RHG ×WG ×2 are generated as the references. Specifically, the grid size is downsampled from the input
feature map size by a factor r, HG = H/r, WG = W/r. The values of reference points are linearly
spaced 2D coordinates {(0, 0), . . . , (HG − 1, WG − 1)}, and then we normalize them to the range [−1, +1]

7
Figure 5: Model design of DAttention.

according to the grid shape HG × WG , in which (−1, −1) indicates the top-left corner and (+1, +1)
indicates the bottom-right corner.
To obtain the offset for each reference point, the feature maps are projected linearly to the query
tokens q = xWq , and then fed into a light weight sub-network θoffset (·) to generate the offsets ∆p =
θoffset (q). To stabilize the training process, we scale the amplitude of ∆p by some predefined factor s
to prevent too large offset, i.e., ∆p ← s tanh(∆p). Then the features are sampled at the locations of
deformed points as keys and values, followed by projection matrices:

q = xWq , k = xWk ,
e ∇ = xWv , (1)

with∆p = θoffset (q), x̄ = ϕ(x; p + ∆p) (2)

k̄ and v̄ represent the deformed key and value embeddings respectively. Specifically, we set the
sampling function ϕ(·; ·) to a bilinear interpolation to make it differentiable:

X
ϕ(z; (px , py )) = g(px , rx )g(py , ry )z[ry , rx , :] (3)
(rx ,ry )

where g(a, b) = max(0, 1 − |a − b|) and (rx , ry ) indexes all the locations on z ∈ RH×W ×C . As g would
be non-zero only on the 4 integral points closest to (px , py ), it simplifies Eq.(8) to a weighted average
on 4 locations. Similar to existing approaches, we perform multi-head attention on q, k, v and adopt
relative position offsets R. The output of an attention head is formulated as:

 √ 
z (m) = σ q (m) k̄ (m)⊤ / d + ϕ(B;
b R) v̄ (m) , (4)

2.3.2 Lightweight image upsampler DySamply

To improve the performance of the YOLOv8n model in the traffic sign detection task, we introduced
the DySample module. DySample is an ultra-lightweight and efficient dynamic upsampler designed
to achieve image resolution improvement with less computational overhead. Compared with traditional
kernel-based dynamic upsamplers such as CARAFE and SAPA, DySample replaces dynamic convolution
with a point sampling method, significantly reducing the number of parameters, computational resource

8
consumption, and latency. In addition, it does not require additional custom CUDA packages and can
be implemented using standard built-in functions in PyTorch, making it easier to integrate and deploy.
DySample not only performs well in resource-constrained environments, but also excels in multiple dense
prediction tasks such as object detection, semantic segmentation, etc. In our traffic sign detection task,
the introduction of DySample effectively improves the detection accuracy of the model while accelerating
the inference speed, proving its efficiency in practical [Link] architecture of DySample is shown
in Figure 6.

Figure 6: Model design of DySample.

Figure 6(a) demonstrates the feasibility of dynamic upsampling based on sampling. Given a feature
map C × H1 × W1 of size X and a sample set δ of size 2 × H2 × W2 , where the first dimension 2 represents
the x and y coordinates. The grid samplefunction resamples X intoX ′ of size C × H2 × W2 using the
assumed bilinear interpolation at position δ. The process is defined as

X ′ = grid sample(X, δ) (5)

Given an upsampling factor s, a feature map X of size C × H × W , a number of input channels C ,


and a number of output channels 2s2 , an offset O of of size 2s2 × H × W is generated, and then reshaped

9
into 2 × sH × sW by Pixel Shuffling. The offset O and the original sampling grid G are summed to
obtain the sample set S. The process formula is as follows:

O = linear(X ) (6)

S =G+O (7)

2.3.3 Normalized Gaussian Wasserstein Distance

In the realm of Optimal Transport theory, we harness the Wasserstein distance to quantify the diver-
gence between distributions. Consider two bivariate Gaussian distributions, µ1 and µ2 , with parameters
N (m1 , Σ1 ) and N (m2 , Σ2 ), respectively. The squared second-order Wasserstein distance between these
distributions is formulated as:

  1/2 
1/2 1/2
W22 (µ1 , µ2 ) = ∥m1 − m2 ∥22 + Tr Σ1 + Σ2 − 2 Σ2 Σ1 Σ2 (8)

which can be succinctly expressed as:

1/2 1/2
W22 (µ1 , µ2 ) = ∥m1 − m2 ∥22 + ∥Σ1 − Σ2 ∥2F (9)

where the Frobenius norm, denoted by ∥ · ∥F , is applied.


For Gaussian distributions derived from bounding box parameters A and B, the aforementioned
equation simplifies elegantly:

 T  T !
wa ha wb hb
W22 (Na , Nb ) = ∥ cxa , cya , , , cxb , cyb , , ∥22 (10)
2 2 2 2

While W22 (Na , Nb ) serves as a metric for distance, its utility as a similarity metric, which should fall within
the range of 0 to 1 like the Intersection over Union (IoU), is limited. To address this, we introduce the
Normalized Wasserstein Distance (NWD), defined by the exponential normalization of the Wasserstein
distance:

p !
W22 (Na , Nb )
N W D (Na , Nb ) = exp − . (11)
C

where C is a constant closely related to the [Link], NWDLoss enhances the traffic signal
detection model’s ability to recognize targets at different scales and in complex backgrounds. Applying
NWDLoss to the traffic signal detection model can more effectively capture the subtle differences between
traffic signals and backgrounds compared to the traditional IoU-based loss function.

10
2.4 Training Equipment and Parameter Setting

2.4.1 Experimental Environment and Parameter Adjustment

This paper is dedicated to optimizing the YOLOv8n algorithm for traffic signal detection. As a cutting-
edge target detection technology, YOLOv8n has significant advantages in ensuring the immediacy and
high accuracy of detection. However, the experiment found that the YOLOv8n benchmark model has
problems such as overfitting and insufficient performance. Therefore, during the experiment, we solved
these problems by adjusting the hyperparameters and structure of the model. This experiment rented a
GPU on the AutoDL cloud computing platform for the experiment. The experimental environment in
this article is shown in Table 1.

Table 1: Experimental environment configuration.

Configuration Version
Framework Pytorch-2.1.2
Programming Language Python3.10
GPU RTX 3080 Ti
Operating System Linux Ubuntu 22.04

2.5 Model Evaluation Indicators

This study uses precision (P ), recall (R), and mean average precision (mAP ) as accuracy evaluation
indicators. Specifically, P represents the ratio of the prediction algorithm area to the actual detection
area, while R represents the ratio of the accurately predicted categories to the total number of required
categories. mAP calculates the accuracy of the overall samples whose predicted boxes exceed 50% of
the actual boxes. The higher the mAP value, the higher the prediction accuracy. In addition, this study
also uses inference time and the number of network parameters as performance indicators. A shorter
inference time indicates better real-time performance, while a smaller model size indicates lower memory
usage. The number of true positive samples in this study is identified as T P , the number of false positive
samples is identified as F P , the total number of samples is identified as N, and the number of detected
traffic signal categories is identified as Q. The average accuracy of the i-th category can be represented
by APi . The calculation formula is as follows.

TP
P = (12)
TP + FP

TP
R= (13)
TP + FN

TP
APi = ÷N (14)
TP + FP

PQ
i=1 APi
mAP = × 100% (15)
Q

11
3 Results

3.1 Implementation Details

In the experiments in this chapter, we used the traffic signal dataset on the Roboflow website for model
training. The size of the model input is , the initial learning rate (LR0) is set to 0.0001, which helps
to increase the speed of gradient descent in the early stage of training and avoid falling into the local
optimal solution in the later stage of training. The maximum number of iterations is set to 100. In order
to accelerate the learning efficiency of the network model, the learning rate momentum factor is set to
0.937, the weight decay is set to 0.0005, and Dropout is set to 0.15.

Table 2: Parameter settings

Parameter Value
Input Image Size 640x640
Epochs 100
Batch Size 64
Patience 50
Optimizer SGD
Learning Rate 0.0001
Adam Momentum 0.937
Weight Decay 0.0005
Warmup Momentum 0.8

3.2 Experimental Results

In the field of deep learning, loss function plays a core role. It quantifies the model’s error by comparing
the deviation between the model’s predicted output and the actual observed value, and is a key factor
in measuring the model’s effectiveness. For the object detection algorithm YOLOv8n, its performance
evaluation relies on three main loss functions: Localization Loss for optimizing spatial position prediction,
Classification Loss for enhancing category recognition accuracy, and Dual focal loss for improving small
object detection [Link] three formulas are as follows:

αu (p2 + b2 )
Lbox = 1 − IoU + (16)
C2


−q (q log(p) + (1 − q) log(1 − p))

if q ≥ 0,
Lcls = (17)
−αp log(1 − p)

if q = 0.

Ldfl = − ((yi+1 − y) log(Si ) + (y − yi ) log(Si+1 )) (18)

Localization loss This is a loss function that measures the difference between the actual bounding box
of the target object and the predicted bounding box. Box Loss directly depends on the distance between
the bounding box predicted by the model and the actual bounding box. Classification loss is a loss
function that measures the accuracy of the model’s prediction of each target object category. Generally,

12
we assign a category label to each target category to be detected, and then the model needs to predict the
category of the target object in each bounding box. Classification loss calculates the difference between
the predicted category and the actual category, which can be calculated by cross entropy loss. Dual focal
loss (DFL) combines the mechanisms of focal loss (FL) and dual cross entropy (DCE) and improves their
respective scaling factors. In order to improve the conditions of the DCE loss function, the regularization
term is modified, and the modified term penalizes the network proportionally.
The six images on the left of Figure 7 show the curves of the loss function on the training set and
the validation set of the DySample-YOLOv8n algorithm over 100 cycles of training as the number of
iterations changes. These images include the positioning loss (box loss), classification loss (cls loss), and
dual focus loss (dfl loss). The horizontal axis represents the number of iterations, and the vertical axis
represents the average value of the loss function. By observing these curves, we can find that in the early
stage of training, the loss value shows a rapid downward trend. When the number of iterations exceeds
20, the downward trend of the loss value begins to slow down significantly and eventually tends to zero.

Figure 7: Loss function curve of DAYOLO algorithm

The changes in the loss function show that the algorithm exhibits extremely high accuracy and de-
tailed discrimination capabilities when identifying and classifying traffic sign objects. Looking further,
this phenomenon reflects that in the process of adjusting parameters at various stages of model training,
DySample-YOLOv8n effectively improves its recognition accuracy for small traffic signals. This contin-
uous optimization strategy significantly reduces possible errors during target detection. The four images
on the right side of the figure show the change curves of evaluation indicators such as precision, recall,
mAP50 and mAP50-95 of the DySample-YOLOv8n algorithm during 100 cycles of training. It can be
seen that the accuracy of the model increases rapidly in the early stages of training, and then The slow
growth with the increase of training times shows that the improved model has very strong learning ability
and is very good at handling tasks such as traffic detection.
q usually represents the label, and IoU is a measure of the overlap between the predicted bounding
box and the true bounding box. ρrepresents the Euclidean distance between the center points of the two
rectangular boxes. c represents the diagonal distance of the closed area of the two rectangular boxes, and
ν is used to measure the consistency of the relative proportions of the two rectangular boxes. α is a weight

13
coefficient used to balance different loss terms. y usually represents the value of a general distribution,
yi+1 −y y−yi
and i represents the index for items in iterations or [Link] = yy+1 −yi and Si+1 = yy+1 −yi . The
confusion matrix is shown in Figure 8.

Figure 8: Confusion metrics.

3.3 Ablation Experiment

To evaluate the effectiveness of the DAYOLO model, three sets of ablation experiments were conducted.
These experiments are designed to evaluate the effectiveness and feasibility of each improved module.
The results of these experiments are shown in Table 3.

Table 3: Ablation experiment results.

Baseline IoU DAttention Dysample nwdloss Params FLOPs (G) FPS Precision Recall mAP50
YOLOv8 CIoU - - - 3208573 8.9 161 95.0% 90.3% 93.9%
YOLOv8 CIoU ✓ - - 3296349 8.8 159 95.5% 90.0% 95.4%
YOLOv8 CIoU - ✓ - 2818701 8.3 171 95.2% 93.5% 95.3%
YOLOv8 CIoU - - ✓ 3303274 8.9 158 95.8% 90.0% 96.4%
YOLOv8 CIoU ✓ ✓ - 3244328 8.3 169 96.1% 92.6% 95.6%
YOLOv8 CIoU - ✓ ✓ 3153293 8.4 168 95.8% 92.8% 97.2%
YOLOv8 CIoU ✓ - ✓ 3061297 9.0 156 95.9% 93.5% 96.9%
YOLOv8 CIoU ✓ ✓ ✓ 3013773 8.2 167 96.1% 94.0% 96.9%
DAYOLO CIoU ✓ ✓ ✓ 3013773 8.2 167 96.7% 94.6% 97.2%

In this experiment, ablation studies were conducted on the YOLOv8n model to evaluate the effects
of incorporating various modules, including DAttention, Dysample, and nwdloss, on model performance.
The baseline YOLOv8n model showed strong detection capabilities with a precision of 95.0%, a recall of
90.3%, and an mAP@0.5 of 93.9%, while maintaining a high inference speed (FPS = 161).

14
By adding the Dysample module, the precision increased slightly to 95.2%, and recall rose to 93.5%,
resulting in a higher mAP@0.5 of 95.3%. This indicates that Dysample improves the model’s detection
ability without sacrificing efficiency, as the FPS increased to 171.
The introduction of the nwdloss module increased the precision to 95.8%, though the recall slightly
decreased to 90.0%. The mAP@0.5 improved to 95.4%, but the FPS dropped to 159, suggesting that
while the nwdloss module enhances accuracy, it slightly affects processing speed.
Incorporating both DAttention and Dysample modules further boosted the precision to 96.7%, recall
to 94.6%, and mAP@0.5 to 97.2%, while maintaining a high FPS of 167. This combination delivered the
best trade-off between detection accuracy and real-time performance.
Overall, the ablation study demonstrates that the combination of DAttention and Dysample modules
has the most significant impact on the model’s performance, improving both accuracy and inference
speed. The GFLOPs(B) slightly increased from 8.7 to 8.5, suggesting that these enhancements are
computationally efficient and provide a meaningful performance gain.
Figure 9 shows the impact of improving different modules on the model’s Precision, Recall, and
mAP50

Figure 9: mAP50 curve.

3.4 Comparison of Detection Performance between Different Models

In order to further confirm the advantages of the DAYOLO model in traffic signal detection, compar-
ative experiments with existing models were conducted on the detection capabilities of complex and
changeable environments (such as small targets, long distances, occlusions, etc.). Models participating
in the comparison include YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv9 and Faster R-CNN. The

15
experimental results are evaluated through multiple key indicators, including precision (Precision), recall
rate (Recall), mAP@0.5, mAP@0.5:0.95, parameter quantity (Parameters), FLOPs and model size, as
shown in Table 4.

Table 4: Model Performance Comparison

Model Precision Recall mAP0.5 mAP0.5 0.95 Parameters (M) FLOPs(G) Model Size (MB)
YOLOv3 90.0% 88.7% 87.7% 80.4% 98.8 282.2 207.8
YOLOv3-tiny 86.3% 83.9% 85.7% 77.9% 11.6 18.9 24.4
YOLOv5s 93.5% 92.1% 91.6% 78.5% 8.7 23.8 18.5
YOLOv5m 94.8% 94.4% 93.8% 76.8% 23.9 64 50.5
YOLOv5l 94.3% 90.5% 93.7% 78.8% 50.7 134.7 106.8
YOLOv5x 94.8% 90.9% 94.1% 77.3% 92.7 246 195
YOLOv6s 93.2% 90.7% 94% 75.5% 15.5 44 32.8
YOLOv6m 94.4% 89.9% 91.8% 75.3% 49.6 161.1 104.3
YOLOv6l 91.1% 88.4% 89.9% 74.4% 105.7 391.2 222.2
YOLOv6x 90.7% 87.3% 91.1% 75.4% 165 610.2 346.5
YOLOv7 93.8% 91.9% 91.2% 78.8% 35.5 105.1 74.8
YOLOv7-tiny 88.8% 87.2% 85.2% 72.9% 5.7 13.2 12.3
YOLOv8n 95.0% 90.3% 93.9% 82.6% 3.2 8.9 6.2
YOLOv8s 93.8% 91.9% 92.1% 82.1% 11.2 28.6 22.5
YOLOv8m 94.9% 88.5% 90.6% 81.1% 25.9 78.9 52
YOLOv8l 94.3% 89.2% 91.9% 80.3% 43.7 165.2 87.7
YOLOv8x 94.8% 87.2% 89.1% 79.4% 68.2 257.8 136.7
YOLOv9m 95.0% 92.1% 93.9% 79.5% 25.3 76.3 51.6
Faster R-CNN 95.5% 93.9% 91.2% 79.1% 41.4 239.3 321
DAYOLO(Ours) 96.7% 94.6% 97.2% 84.6% 3.0 8.2 6.1

In multiple comparisons, the mAP@0.5 of the DAYOLO model reached 97.2%, which is significantly
improved than most existing models. For example, the mAP@0.5 of YOLOv5s is 91.6% and YOLOv6s
is 94%, while DAYOLO has significantly improved accuracy compared to these models. Compared with
Faster R-CNN, which has better performance, DAYOLO also has a slight advantage. Faster R-CNN’s
mAP@0.5 is 91.2%, which is slightly lower than DAYOLO’s 97.2%. At the same time, the DAYOLO
model also reached a recall rate of 94.6%, surpassing most YOLO series models and Faster R-CNN,
indicating that DAYOLO has higher target capture capabilities in complex scenes.
The test set data was tested using the YOLOv8n and DAYOLO models, and the test results are
shown in Figure 10.
As can be seen from the above figure, the algorithm proposed in this paper performs well in normal
environments. The detection accuracy of the YOLOv8n algorithm in Figure (a) is 75%, while the de-
tection accuracy of the DAYOLO algorithm in Figure (b) reaches 97%. This shows that the DAYOLO
algorithm is significantly better than the YOLOv8n algorithm in terms of accuracy when detecting speed
limit signs in complex environments. The second group of figures (Figures (c) and (d)) further show the
comparison between the two algorithms. The accuracy of the YOLOv8n algorithm in Figure (c) is 93%,
while the detection accuracy of the DAYOLO algorithm in Figure (d) is increased to 97%. In this case,
DAYOLO performs slightly better. Overall, DAYOLO shows stronger robustness and consistency in
detection stability and adaptation to different environments. In summary, through comparative experi-
ments under different environments, it can be seen that the DAYOLO algorithm has excellent detection

16
(a) YOLOv8n test results (b) DAYOLO test results

(c) YOLOv8n test results (d) DAYOLO test results

Figure 10: Comparison diagram of algorithms in normal environment.

accuracy and reliability in normal environments, which is better than the standard YOLOv8n algorithm.
The optimization of the algorithm in this paper achieves higher accuracy in traffic sign detection tasks,
especially under complex and occluded conditions, providing a more effective solution for traffic signal
detection.

4 Conclusions
In view of the challenges in traffic signal detection, especially the demand for high-precision recognition
in the context of the development of autonomous driving technology, this study proposes an improved
DAYOLO model, which is optimized based on YOLOv8n. The introduction of the DAttention model can
more effectively extract features and improve the ability of feature extraction, especially in complex back-

17
grounds and multi-scale target scenes, which improves the detection accuracy. Secondly, the introduction
of DySample reduces the computational overhead, improves the detection accuracy of the model, and
accelerates the inference speed. Finally, the nwdloss is improved to better handle the prediction results
under complex distributions and better measure the distance between the predicted box and the true
box in the feature space, thereby improving the accuracy of target detection. Experimental results show
that YOLOv8-DA shows better performance than algorithms such as SSD, Faster R-CNN, YOLOv4,
YOLOv5, and YOLOv7. Compared with the baseline model YOLOv8n, the improved DA-YOLO im-
proves mAP by 3.3%, precision by 1.5%, and recall by 4.3%. These improvements enable DA-YOLO
to achieve a balance between model size and detection accuracy, meet the requirements of traffic signal
detection, and provide new ideas and methods for future research in the field of traffic signal detection.
Future work can further explore the application of these improvements in other fields and scenarios, and
how to combine these techniques with other advanced deep learning techniques to achieve more efficient
and accurate target detection.

References
[1] Hastie, Trevor, et al. ”Multi-class adaboost.” Statistics and its Interface 2.3 (2009): 349-360.
[2] Dalal, Navneet, and Bill Triggs. ”Histograms of oriented gradients for human detection.” 2005 IEEE computer society

conference on computer vision and pattern recognition (CVPR’05). Vol. 1. Ieee, 2005.

[3] Zaklouta, Fatin, and Bogdan Stanciulescu. ”Real-time traffic-sign recognition using tree classifiers.” IEEE Transactions on

Intelligent Transportation Systems 13.4 (2012): 1507-1514.


[4] Zaklouta, Fatin, and Bogdan Stanciulescu. ”Real-time traffic sign recognition in three stages.” Robotics and autonomous
systems 62.1 (2014): 16-24.
[5] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.
Advances in neural information processing systems (pp. 1097–1105).
[6] Girshick, R. (2015). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision (pp. 1440-1448).
[7] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: single-shot multi-box detector.
European conference on computer vision (pp. 21–37).
[8] Jiang, Peiyuan, et al. ”A Review of Yolo algorithm developments.” Procedia computer science 199 (2022): 1066-1073.

[9] HUANG S H, LU Z C, CHENG R, et al. FaPN: Feature-aligned Pyramid Network for Dense Image Prediction[J] IEEE/ CVF

International Conference on Computer Vision (ICCV) (2021): 844-853

[10] WANG G R, WANG K Z, LIN L. Adaptively Connected Neural Networks[J]. IEEE/ CVF Conference on Computer Vision

and Pattern Recognition (CVPR) (2019): 1781-1790.


[11] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014) . Rich feature hierarchies for accurate object detection and semantic
segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).
[12] Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of
the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).
[13] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: towards real-time object detection with region proposal
networks. Advances in neural information processing systems (pp. 91–99).
[14] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. Proceedings of the IEEE international conference on
computer vision (pp. 2961–2969).
[15] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.
Advances in neural information processing systems (pp. 1097–1105).
[16] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference
on computer vision and pattern recognition (pp. 770–778).
[17] Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167.

18
[18] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint

arXiv:1409.1556.

[19] Liu W, Lu H, Fu H, et al. Learning to Upsample by Learning to Sample[C]//Proceedings of the IEEE/CVF International

Conference on Computer Vision. 2023: 6027-6037.

[20] Reis D, Kupec J, Hong J, et al. Real-time flying object detection with YOLOv8[J]. arXiv preprint arXiv:2305.09972, 2023.

[21] Hossain M S, Betts J M, Paplinski A P. Dual focal loss to address class imbalance in semantic segmentation[J]. Neurocom-

puting, 2021, 462: 69-87.

[22] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and

pattern recognition. 2017: 7263-7271.

[23] Redmon J, Farhadi A. Yolov3: An incremental improvement[J]. arXiv preprint arXiv:1804.02767, 2018.

[24] Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A Lightweight YOLOv8 Tomato Detection Algorithm Combining Feature

Enhancement and Attention. Agronomy 2023, 13, 1824. [Link]

[25] Jiang, T.; Zhou, J.; Xie, B.; Liu, L.; Ji, C.; Liu, Y.; Liu, B.; Zhang, B. Improved YOLOv8 Model for Lightweight Pigeon

Egg Detection. Animals 2024, 14, 1226. [Link]

[26] Yang, H.; Qiu, S. A Novel Dynamic Contextual Feature Fusion Model for Small Object Detection in Satellite Remote-Sensing

Images. Information 2024, 15, 230. [Link]

[27] Hussain M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and

industrial defect detection[J]. Machines, 2023, 11(7): 677.

[28] Terven J, Córdova-Esparza D M, Romero-González J A. A comprehensive review of yolo architectures in computer vision:

From yolov1 to yolov8 and yolo-nas[J]. Machine Learning and Knowledge Extraction, 2023, 5(4): 1680-1716.

[29] Lou H, Duan X, Guo J, et al. DC-YOLOv8: small-size object detection algorithm based on camera sensor[J]. Electronics,

2023, 12(10): 2323.

[30] Soylu E, Soylu T. A performance comparison of YOLOv8 models for traffic sign detection in the Robotaxi-full scale au-

tonomous vehicle competition[J]. Multimedia Tools and Applications, 2024, 83(8): 25005-25035.

19

You might also like