See discussions, stats, and author profiles for this publication at: [Link]
net/publication/361980395
Violence Detection Using Computer Vision Approaches
Conference Paper · June 2022
DOI: 10.1109/AIIoT54504.2022.9817374
CITATIONS READS
6 259
3 authors:
Khalid Raihan Talha Koushik Bandapadya
North South University Westcliff University
2 PUBLICATIONS 48 CITATIONS 7 PUBLICATIONS 16 CITATIONS
SEE PROFILE SEE PROFILE
Mohammad Monirujjaman Khan
North South University
350 PUBLICATIONS 5,253 CITATIONS
SEE PROFILE
All content following this page was uploaded by Mohammad Monirujjaman Khan on 12 January 2025.
The user has requested enhancement of the downloaded file.
Violence Detection Using Computer Vision Approaches
Koushik Bandapadya Khalid Raihan Talha Mohammad Monirujjaman Khan
Department of Electrical & Computer Department of Electrical & Computer Department of Electrical & Computer
Engineering Engineering Engineering
North South University North South University North South University
Bashundhara, Dhaka-1229, Bangladesh Bashundhara, Dhaka-1229, Bangladesh Bashundhara, Dhaka-1229, Bangladesh
[Link]@[Link]
Abstract - Violent crime has always been a major social detection of actions, violence, or protests. In terms of social
problem. The rise of violent behavior in public areas can be security and stability, this field of study is quite useful. In
attributed to a variety of factors. Greed, frustration, and hostility terms of social security and stability, this field of study is quite
among individuals, as well as social and economic anxieties, are useful. It is impossible to prevent crime and violent acts unless
the primary causes of increased violence. It is critical to protect
our possessions, as well as our lives, from threats such as robbery
brain signals are studied and a specific pattern derived from
or homicide. It is impossible to prevent crime and violent acts criminal thinking is discovered in real-time. Due to
unless brain signals are studied and a certain pattern deduced technological feasibility, it has yet to be accomplished. Using
from criminal ideas is detected in real-time. Due to its deep learning-based computer vision, we can now easily
technological viability, it has yet to be realized. However, using detect aggressive activity in public areas. The majority of
deep learning-based computer vision technologies, we can detect public and private institutions already have CCTV [2].
violent activities in public areas. The goal of this project is to Effective violent detection techniques can assist the
build a real-time violent activity monitoring system that will be government or authorities in taking a quick and systematic
capable of detecting violence very quickly and efficiently. The approach to identifying violence and preventing the loss of
public of any city can benefit from it, as it will allow the people of
the law enforcement department to take necessary actions to
human life and property. As human beings and members of
prevent violent activities. When the system is implemented, it will society, we all desire to have secure streets, communities, and
be able to detect the speed of the movements of people and their workplaces. Because it does not involve any explicit feature
distances from other people walking in public places by using engineering, deep learning outperforms machine learning.
cameras. The system will mainly detect the speed of hand and leg There are some disadvantages, including high processing costs
movements of a person who will be very close to another person. and large training datasets. These technological considerations
If anyone is identified as a violent maker, the server-side of the drive us to create a model that requires less training time and a
system will notify the people who will be responsible for smaller number of training examples. Using deep learning
preventing violence in a very short period of time. The system methodologies, we offer approaches in our system that will be
was built using the concepts of computer vision and neural
networks. The system has been developed and tested initially on
able to spot violent threats and activities.
the personal computing devices of the system developers. This
system is very easy to design and develop, making it very easy to Previously, the presence of a body, the degree of action,
use for any kind of public area surveillance. At the same time, the and even aspects of the sound associated with violent activities
system gives its desired output due to its high accuracy. were used to distinguish between violent and non-violent
activities. Surveillance cameras are not very effective in
Index Terms - Violence detection, convolutional neural
networks, LSTM, Computer vision.
recording sounds related to certain activities (Audio-visual
content-based violent scene characterization) [3]. Frame-based
I. INTRODUCTION video analysis, on the other hand, is purely based on a
sequence of frames (that is, a picture) rather than sounds.
For a long time, one of the major issues has been the There are various sorts of violence, including one-on-one
occurrence of violence in daily life. It can easily destroy the violence, mob violence, family violence, sports violence, gun
peace and harmony of any society. However, criminal violence, and many others. Violence detection with C3D
activities from 2014 to 2017 declined a lot. However, starting Convolutional Neural Network (3D-CNN) was one of the
in 2017, it started to rise again. From 2017 to 2018, we can see works done before to find violent scenes in a video stream.
an increase of 6.79% [1]. Violent behavior in public areas is The 3D-CNN is a deep supervised learning system that uses
happening due to various factors. Individual greed, frustration, films to learn spatiotemporal discriminant features (a sequence
and hatred, as well as social and economic insecurity, are the of image frames). Unlike 2D convolutions, this method
leading causes of violence. To solve this issue, expected or applies 3D kernels to a series of image frames in their context,
unexpected violence should be detected at an early stage so resulting in 3D activation maps that capture both spatial and
that it can be stopped as soon as possible. temporal information. Three datasets were combined for this
task: hockey fight, movies, and crowd violence [4]. They were
Computer vision and deep learning have recently been able to get an accuracy of 84.428% at the 36th training epoch
used to investigate human actions and behavior. Even though [5]. Another contribution was a work that uses the concept of
it is the scariest social problem, very few works automate the convolutional neural networks (CNNs) and the Google Object
Detection API and uses these two new developments in II. METHODOLOGY
technology to retrain a pre-trained model to perform weapon
In this section, we will discuss the proposed framework
detection in real-time surveillance. From one of the latest
for creating our computer vision model and its architecture.
contributions, we were able to know that this problem can also
be solved by using convolutional neural networks. By
A. Model Architecture
scanning the sequential flow of video frames, a bidirectional
Our model must be able to predict sequences in successive
LSTM model (CNN-BiLSTM) architecture is used to detect
frames, such as a pattern in the movement of the individuals or
real-time violence. They had more than 98% accuracy for their
the degree of their motion, to classify violent or non-violent
three different models [6]. But unlike their work, we will only
activities. This is not possible by considering only the spatial
create a single model, which will not only decrease the server
features (features belonging to a particular frame) of the
load but also increase the response time of applications where
frames. While detecting sequences in frames, temporal or
it will be deployed.
time-related factors must be taken into account. The temporal
features can be handled in either a forward or backward order.
To predict violence in the sequential flow of frames, we
Our model processes the temporal features in both directions
will utilize the Convolutional Neural Network Bidirectional
in addition to the spatial features, which helps the model to
LSTM model (CNN-BiLSTM) architecture. To begin with, we
become more accurate and, at the same time, consumes less
divide a video into numerous frames. We pass each frame
computational time. The lightweight models are always
through a convolutional neural network to extract the
preferred in surveillance due to their low cost structure. The
information present in the current frame. Then, to recognize
model consists of three sub-parts [7].
any sequential flow of events, we utilize a bidirectional LSTM
layer to compare the information of the current frame once
with the prior frames and once with the upcoming frames. 1) CNN: The Convolutional Neural Network (CNN) is
the most common neural network in the field of computer
Finally, the classifier determines whether or not an action is
vision to detect and classify images, comprising an input
violent.
convolutional layer followed by three layers of convolution
After introducing our topic, we will go directly to the and max pooling. The kernel size for each convolutional layer
methodology in Section 2, where we will be discussing the is 3×3. 64 kernels are used in each convolutional layer. After
passing through "relu" activation, the output from each
steps and ways to implement our system, including
convolutional layer is
experimental setup, data processing, and training methods.
Then we will discuss the results of our work with qualitative
and quantitative data in section 3. Under section 3, we will
illustrate accuracy evolution and accuracy comparison. At the
end, we'll talk about the conclusion of our paper, including the
necessary figures and tables, as well as the chances of getting
an upgrade in the future.
The literature evaluation states that most systems are not
operating in Bangladesh and do not fully meet the needs of our
customer’s criteria. The main initiative has been based on a Fig. 1. Convolutional Neural Network (CNN)
website. In the next stages, we intend to make this
Android/iOS based mobile application more user-friendly. The function is max pooled to extract the features. Each
This system’s most appealing feature is that the vendor may maximum pooling uses a filter size of 2 x 2. Finally, the
communicate directly with customers to advertise the goods features are fattened and sent to the next model. The
and obtain feedback. Because it is quite versatile in terms of TensorFlow [8] and Keras [9] APIs have been used to deploy
expansion, the project can be upgraded in the near future as convolutional neural networks. In this diagram, the basic CNN
and when the need arises. This site can have several branches, capability is displayed in Fig. 1 [10].
and additional features such as a virtual shopping basket and a
virtual trial room can be added to make it more robust. 2) The Bidirectional LSTM Cells: The basic LSTM cell
appears in Fig. 2 [6]. Long-term memory cells are frequently
Firstly, we have given a brief idea of the project. Then we used to reexamine a portion of previously prepared highlights.
present an introduction to the e-commerce site. We look LSTM mimics the action of the human brain to keep in mind
through the problem statements as well as the existing system the already prepared event. The first layer in an LSTM cell is
and compare them with our site. We’ve talked about the known as the overlooking entryway layer, signified by ft. It is
remaining gaps in the project and discussed its passed through a sigmoid function to urge a yield of either 1
implementation. The rest of the sections of this paper are as or 1. The esteem shows a disregard state and 1 signifies a keep
follows: Section 2 covers the methodology; Section 3 presents in mind state. The condition of the disregard door layer is
the results and analysis of the system; and finally, the given in,
conclusions are set out in Section 4 along with references.
an activation function. In Fig. 4, the entire architecture of our
proposed model has been shown [6].
Fig. 2. Basic LSTM cell
Fig. 4. Node Architecture
II. EXPERIMENTAL SETUP
Fig. 3. BiLSTM Shell
A. Data Processing
Frames have been extracted from the videos. The
The next layer is called the input gate layer. In this layer,
extracted frames are reshaped to 100×100 pixels (denoted as x
the remembered state data is restrained with the new features.
y). The training data is a Numpy array, with each of its rows
representing a sequence or pattern in videos. A sequence
The yield from the disregard entryway layer is duplicated
might include a degree of movement and actions, whether a
into the cell state vector (ct) of the past LSTM cell (ct-1). The
movement of the arm is a punch or a handshake, etc. The
result is included in the yield from the input door layer,
minimum number of frames required to extract a sequence is
increased to the covered upstate vector of the final state upon
2. However, we have used 10 consecutive frames (denoted as
passing through a "tanh" operation to make a cell state vector
n) to extract the temporal features (that is, time-related
for the following LSTM cell. This vector, upon passing
features). The total number of samples (denoted by N)
through a "tan h" work, is increased to the covered upstate
represents the total number of such sequences in the dataset
vector of the past state (ht-1) upon passing through a "sigmoid"
((total number of frames) / (number of frames to consider in a
work to create a hidden state vector for the following LSTM
sequence)). For a simple implementation, NumPy allows an
cell (ht). Then, in the last layer of Ctr, some of the highlights
arbitrary value of 1 to be used. Hence, a structure containing a
from the previous state and the newly required highlights from
sequence of 10 consecutive frames with their respective class
the current cell are added together and sent to the other state.
labels is prepared. The shape of the training data is (-1, N, x,
y, c). Here, c represents the number of channels in each frame.
Where it is an input vector to the LSTM unit and bf, bi,
The pictorial representation of the training data is shown in
and bo are the weight vectors for the forget gate layer, input
Fig. 5 [6].
gate layer, and output gate layer, respectively. In the LSTM,
the features are remembered and passed from state 1 to state 2
to state n. The LSTM can also work in the reverse direction as
well; the features will be remembered and passed from state n
to state 2 to state 1. By combining both these mechanisms, we
achieve a bidirectional LSTM layer as shown in Fig. 3. The
bidirectional LSTM cells are more accurate in storing data.
For violence detection, a bidirectional LSTM will compare the
sequence of frames once in the forward direction and once in
the reverse direction. This mechanism gives our model more
strength by adding different cell states and training features.
3) The Dense Layers: The dense layers are omnipresent
when it comes to deep learning. Here, the fully connected
dense layers help to add random weights (Wi) to random
features (Xi) and test which set of features gives the best
accuracy over a certain number of epochs by passing through Fig. 5. Visualization of the training data
B. Data Frame Separation c. Crowd / Violent-Flows Fight Detection Dataset-246
The video datasets are divided into a 90/10 ratio for video clips
random selection. 10% of images and videos are used for
testing in the evaluation step. 90% of the images are used to F. Dataset
feed into the model for training purposes, and this could be The effectiveness of the CNN Bidirectional LSTM model
done by using a Python script. On the other hand, the weapon architecture has been validated by running on the standard
image dataset is divided into an 80/20 ratio in random datasets for violent and non-violent action detection, namely
selection. the Hockey Fight dataset [11], the Movies dataset [11], the
Violent Flows dataset [12] and the Weapons dataset for image
C. Model Training classification and object detection tasks.
To find the fights between people, a set of 10 frames with
dimensions of 100 x 100 were fed into a model with the shape The Hockey Fight Dataset: The Hockey Fight dataset
shown in Fig. 4 to pull out the spatial and temporal features. contains clips from ice hockey matches. The dataset has 500
Stochastic gradient descent has been used as an optimizer with violent clips and 500 non-violent clips with an average
a learning rate of 0.01 and a decay rate of 1e -6. The loss duration of 1 s. The clips had similar backgrounds and
function used in this paper is "sparse categorical cross- subjects. Hockey Fight Detection Dataset - Academic Torrents
entropy". In this multi-class classification problem, we have
used "0 or 1" as class labels, instead of one-hot encoding, in a The Movies Dataset contains clips from different movies
batch size of 5 samples at an instant. For training and testing for action sequences, whereas the non-fight sequences consist
purposes, the datasets are divided into 9:1 ratios. To keep its of clips from action recognition datasets. The dataset has 100
low computation cost, the whole model had to be built and violent clips and 100 non-violent clips with an average
trained from scratch for 25 epochs. duration of 1 s. Unlike the Hockey Fight dataset, the clips
from movies have different backgrounds and subjects. Movies
D. Model Testing Fight Detection Dataset - Academic Torrents
Once the model finishes training, at this stage, a test
dataset is used to evaluate the model and output the average Dataset: The Violent Flows dataset deals with crowd
precision and map. Then the script outputs the result from the violence. The dataset consists of videos of human actions from
model at the command prompt. The testing process can be run the real world, CCTV footage of crowd violence, and
on the existing trained model. YouTube videos, properly maintaining the standard
benchmark protocols. The dataset consists of 246 videos with
properly biased samples. Crowd Violence\ Non-violence
Database ([Link])
III. RESULT & ANALYSIS
In this section, we will discuss the results of our proposed
framework and analyze the strengths and weaknesses of our
model.
Fig. 6. System Layout
A. Accuracy Evaluation
E. Requirement As we have used the CNN-BiLSTM model architecture, it
i) Software can handle our specifically chosen datasets very efficiently.
a. TensorFlow with GPU support-An open-source Each of our datasets is divided into 9 parts, also known as
software library used for machine learning. epochs, for training the desired model and 1 epoch for
b. Python 3.9.x validation. From each epoch, we can get information on
c. Algorithm-CNN, RNN, LSTM, Deep Learning, training accuracy, training loss, validation accuracy, and
Computer Vision, Visual Studio Code. validation loss.
d. Libraries used: Keras, Numpy, Tensorflow Object
B. Hockey Fight Dataset
Detection.
For training the model for the hockey dataset, we used 10
ii) Hardware
epochs. Every epoch had 55 steps by maintaining the batch
Processor-AMD Ryzen 5 2400G
size of 16 frames. From Fig 7 and 8, we can see that the
RAM-16.0 GB
maximum accuracy achieved was 94.9% for training and
GPU-Radeon Vega 11 Graphics
96.94% for validation.
Operating System-Windows 10 Professional 64 bit
iii) Resources
a. Movies Fight Detection Dataset-200 video clips
b. Hockey Fight Detection Dataset -1000 video clips
Fig. 7. Training and validation accuracy achieved from the hockey fight
dataset
Fig. 10. Training and validation loss achieved from the movies dataset
D. Violence Flows Dataset
After getting the first trained model from the hockey fight
dataset, we over-fitted it with the violence flows dataset by
maintaining 10 epochs. Each of the epochs goes through 55
steps. From Fig 11 and 12, we can see that the maximum
accuracy achieved was 77.31% for training and 80% for
validation.
Fig. 8. Training and validation loss achieved from the hockey fight dataset
C. Movie dataset
After getting the first trained model from the hockey
dataset, we have over-fitted it with the movie dataset by
maintaining 10 epochs. Each of the epochs goes through 55
steps. From Fig 9 and 10 we can see that the maximum
accuracy achieved was 92.92% for training and 96.94% for
validation.
Fig. 11 Training and validation accuracy achieved from the violence flows
dataset
Fig. 9. Training and validation accuracy achieved from the movie’s dataset
proposed model, it needs to be further validated with more
standard datasets where the identification of one to many or
many to many violent activities is possible. In future work, we
will be able to increase the accuracy of our model by
maintaining our model architecture. Our model will have
combined violence and weapon detection capabilities. In the
near future, we are planning to detect metal by using thermal
vision cameras, which will allow us to differentiate between
real guns and fake guns [17]. We will also give our system the
capability to determine whether a gun holder is a member of
the law enforcement team (police) or not. In the near future,
our system will also be able to use night vision [18] and
thermal vision [19] technologies to find violent activities.
REFERENCES
[1] “Bangladesh Crime Rate & Statistics 2000-2022.” [Online] Available:
[Link]
Fig. 12 Training and validation loss achieved from violence flows dataset statistics (accessed Jan. 13, 2022).
[2] M. Ramzan et al., “A Review on State-of-the-Art Violence Detection
E. Accuracy Comparison Techniques,” IEEE Access, vol. 7. pp. 107560–107575, 2019. doi:
From the accuracy evaluation, we can see that our final 10.1109/access.2019.2932114.
[3] “Audio-visual content-based violent scene characterization.” [Online]
model has achieved an accuracy of more than 77% for training Available:
and 80% for validation accuracy. The comparison between our [Link]
model and its architecture with other existing models and their umber=15617 (accessed Nov. 11, 2021).
architecture is given in Table 1. [4] S. Accattoli, P. Sernani, N. Falcionelli, D. N. Mekuria, and A. F.
Dragoni, “Violence Detection in Videos by Combining 3D
Convolutional Neural Networks and Support Vector Machines,” Applied
TABLE I
Artificial Intelligence, vol. 34, no. 4. pp. 329–344, 2020. doi:
A comparison between the accuracy of our model with the existing models
10.1080/08839514.2020.1723876.
[5] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik,
Methods Hockey Movies Violence Flows “Violence Detection Using Spatiotemporal Features with 3D
MoIWLD [13] 96.8±1.04 - 93.19±0.12% Convolutional Neural Network,” Sensors , vol. 19, no. 11, May 2019,
% doi: 10.3390/s19112472.
ViF+OViF [14] 87.5±1.7% - 88±2.45% [6] R. Halder and R. Chatterjee, “CNN-BiLSTM Model for Violence
Spatiotemporal 98.1±0.58 100±0% 93.87±2.58% Detection in Smart Surveillance,” SN Computer Science, vol. 1, no. 4.
Encoder [15] % 2020. doi: 10.1007/s42979-020-00207-x.
Conv 3D [16] 98.3±0.81 100±0% 97.17±0.95% [7] R. Halder and R. Chatterjee, “CNN-BiLSTM Model for Violence
% Detection in Smart Surveillance,” SN Computer Science, vol. 1, no. 4.
CNN-LSTM [6] 97.1±0.55 100±0% 94.57±2.34% 2020. doi: 10.1007/s42979-020-00207-x.
% [8] “TensorFlow,” TensorFlow. [Online] Available:
[Link] (accessed Jan. 15, 2022).
CNN-BiLSTM (our 94.9% 92.92% 77.31%
[9] Keras Team, “Keras: the Python deep learning API.” [Online] Available:
model
[Link] (accessed Jan. 15, 2022).
[10] Phung, Phung, and Rhee, “A High-Accuracy Model Average Ensemble
Our main comparison was with models created by Conv of Convolutional Neural Networks for Classification of Cloud Image
3D [15] and CNN-LSTM [6]. In Conv 3D, it was created by Patches on Small Datasets,” Applied Sciences, vol. 9, no. 21. p. 4500,
2019. doi: 10.3390/app9214500.
extracting all the frames into a single folder and then training [11] E. B. Nievas, O. D. Suarez, G. B. García, and R. Sukthankar, “Violence
it. So, they used all the datasets to create their model at once. Detection in Video Using Computer Vision Techniques,” Computer
In the previous work of CNN-BiLSTM, they created three Analysis of Images and Patterns. pp. 332–339, 2011. doi: 10.1007/978-
separate models from three different datasets. But in our case, 3-642-23678-5_39.
[12] T. Hassner, Y. Itcher, and O. Kliper-Gross, “Violent flows: Real-time
we used CNN-BiLSTM to create our first model by using the detection of violent crowd behavior,” 2012 IEEE Computer Society
hockey fight dataset and over-fitting the model with the other Conference on Computer Vision and Pattern Recognition Workshops.
two datasets (movies and violence flows). 2012. doi: 10.1109/cvprw.2012.6239348.
[13] T. Zhang, W. Jia, X. He, and J. Yang, “Discriminative Dictionary
IV. CONCLUSION Learning With Motion Weber Local Descriptor for Violence Detection,”
IEEE Transactions on Circuits and Systems for Video Technology, vol.
Our proposed CNN-BiLSTM based violence detection 27, no. 3. pp. 696–709, 2017. doi: 10.1109/tcsvt.2016.2589858.
system can make society a secure place for peace-loving [14] Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, “Violence detection using
people. By training the model once and over-fitting twice, we Oriented VIolent Flows,” Image and Vision Computing, vol. 48–49. pp.
37–41, 2016. doi: 10.1016/[Link].2016.01.006.
were able to achieve decent accuracy for training, validation [15] A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, “Bidirectional
and testing. Using our proposed framework, we were able to Convolutional LSTM for the Detection of Violence in Videos,” Lecture
achieve the final results in the final over-fitting of our model. Notes in Computer Science. pp. 280–295, 2019. doi: 10.1007/978-3-
We could also see that at the last stage, our results nearly 030-11012-3_24.
stabilized. Despite the satisfactory performance of our
[16] J. Li, X. Jiang, T. Sun, and K. Xu, “Efficient Violence Detection Using
3D Convolutional Neural Networks,” 2019 16th IEEE International
Conference on Advanced Video and Signal Based Surveillance (AVSS).
2019. doi: 10.1109/avss.2019.8909883.
[17] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera, “Brightness
guided preprocessing for automatic cold steel weapon detection in
surveillance videos with deep learning,” Neurocomputing, vol. 330. pp.
151–161, 2019. doi: 10.1016/[Link].2018.10.076.
[18] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera, “Brightness
guided preprocessing for automatic cold steel weapon detection in
surveillance videos with deep learning,” Neurocomputing, vol. 330. pp.
151–161, 2019. doi: 10.1016/[Link].2018.10.076.
[19] R. Ippalapally, S. H. Mudumba, M. Adkay, and N. V. H. R., “Object
Detection Using Thermal Imaging,” 2020 IEEE 17th India Council
International Conference (INDICON). 2020. doi:
10.1109/indicon49873.2020.9342179.
View publication stats