0% found this document useful (0 votes)

23 views8 pages

Violence Detection via Computer Vision

The paper discusses a real-time violence detection system using deep learning-based computer vision techniques, specifically a CNN-BiLSTM architecture, to identify violent activities in public areas. The proposed system aims to enhance social security by quickly notifying authorities of potential violent incidents, utilizing video frame analysis to detect movement patterns. The authors highlight the advantages of deep learning over traditional methods, including higher accuracy and reduced need for extensive feature engineering.

Uploaded by

mahedihasan4831

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

Violence Detection via Computer Vision

Uploaded by

mahedihasan4831

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/361980395

Violence Detection Using Computer Vision Approaches

Conference Paper · June 2022

DOI: 10.1109/AIIoT54504.2022.9817374

CITATIONS READS
6 259

3 authors:

Khalid Raihan Talha Koushik Bandapadya

North South University Westcliff University
2 PUBLICATIONS 48 CITATIONS 7 PUBLICATIONS 16 CITATIONS

SEE PROFILE SEE PROFILE

Mohammad Monirujjaman Khan

North South University
350 PUBLICATIONS 5,253 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mohammad Monirujjaman Khan on 12 January 2025.

The user has requested enhancement of the downloaded file.

Violence Detection Using Computer Vision Approaches
Koushik Bandapadya Khalid Raihan Talha Mohammad Monirujjaman Khan
Department of Electrical & Computer Department of Electrical & Computer Department of Electrical & Computer
Engineering Engineering Engineering
North South University North South University North South University
Bashundhara, Dhaka-1229, Bangladesh Bashundhara, Dhaka-1229, Bangladesh Bashundhara, Dhaka-1229, Bangladesh
[Link]@[Link]

Abstract - Violent crime has always been a major social detection of actions, violence, or protests. In terms of social
problem. The rise of violent behavior in public areas can be security and stability, this field of study is quite useful. In
attributed to a variety of factors. Greed, frustration, and hostility terms of social security and stability, this field of study is quite
among individuals, as well as social and economic anxieties, are useful. It is impossible to prevent crime and violent acts unless
the primary causes of increased violence. It is critical to protect
our possessions, as well as our lives, from threats such as robbery
brain signals are studied and a specific pattern derived from
or homicide. It is impossible to prevent crime and violent acts criminal thinking is discovered in real-time. Due to
unless brain signals are studied and a certain pattern deduced technological feasibility, it has yet to be accomplished. Using
from criminal ideas is detected in real-time. Due to its deep learning-based computer vision, we can now easily
technological viability, it has yet to be realized. However, using detect aggressive activity in public areas. The majority of
deep learning-based computer vision technologies, we can detect public and private institutions already have CCTV [2].
violent activities in public areas. The goal of this project is to Effective violent detection techniques can assist the
build a real-time violent activity monitoring system that will be government or authorities in taking a quick and systematic
capable of detecting violence very quickly and efficiently. The approach to identifying violence and preventing the loss of
public of any city can benefit from it, as it will allow the people of
the law enforcement department to take necessary actions to
human life and property. As human beings and members of
prevent violent activities. When the system is implemented, it will society, we all desire to have secure streets, communities, and
be able to detect the speed of the movements of people and their workplaces. Because it does not involve any explicit feature
distances from other people walking in public places by using engineering, deep learning outperforms machine learning.
cameras. The system will mainly detect the speed of hand and leg There are some disadvantages, including high processing costs
movements of a person who will be very close to another person. and large training datasets. These technological considerations
If anyone is identified as a violent maker, the server-side of the drive us to create a model that requires less training time and a
system will notify the people who will be responsible for smaller number of training examples. Using deep learning
preventing violence in a very short period of time. The system methodologies, we offer approaches in our system that will be
was built using the concepts of computer vision and neural
networks. The system has been developed and tested initially on
able to spot violent threats and activities.
the personal computing devices of the system developers. This
system is very easy to design and develop, making it very easy to Previously, the presence of a body, the degree of action,
use for any kind of public area surveillance. At the same time, the and even aspects of the sound associated with violent activities
system gives its desired output due to its high accuracy. were used to distinguish between violent and non-violent
activities. Surveillance cameras are not very effective in
Index Terms - Violence detection, convolutional neural
networks, LSTM, Computer vision.
recording sounds related to certain activities (Audio-visual
content-based violent scene characterization) [3]. Frame-based
I. INTRODUCTION video analysis, on the other hand, is purely based on a
sequence of frames (that is, a picture) rather than sounds.
For a long time, one of the major issues has been the There are various sorts of violence, including one-on-one
occurrence of violence in daily life. It can easily destroy the violence, mob violence, family violence, sports violence, gun
peace and harmony of any society. However, criminal violence, and many others. Violence detection with C3D
activities from 2014 to 2017 declined a lot. However, starting Convolutional Neural Network (3D-CNN) was one of the
in 2017, it started to rise again. From 2017 to 2018, we can see works done before to find violent scenes in a video stream.
an increase of 6.79% [1]. Violent behavior in public areas is The 3D-CNN is a deep supervised learning system that uses
happening due to various factors. Individual greed, frustration, films to learn spatiotemporal discriminant features (a sequence
and hatred, as well as social and economic insecurity, are the of image frames). Unlike 2D convolutions, this method
leading causes of violence. To solve this issue, expected or applies 3D kernels to a series of image frames in their context,
unexpected violence should be detected at an early stage so resulting in 3D activation maps that capture both spatial and
that it can be stopped as soon as possible. temporal information. Three datasets were combined for this
task: hockey fight, movies, and crowd violence [4]. They were
Computer vision and deep learning have recently been able to get an accuracy of 84.428% at the 36th training epoch
used to investigate human actions and behavior. Even though [5]. Another contribution was a work that uses the concept of
it is the scariest social problem, very few works automate the convolutional neural networks (CNNs) and the Google Object
Detection API and uses these two new developments in II. METHODOLOGY
technology to retrain a pre-trained model to perform weapon
In this section, we will discuss the proposed framework
detection in real-time surveillance. From one of the latest
for creating our computer vision model and its architecture.
contributions, we were able to know that this problem can also
be solved by using convolutional neural networks. By
A. Model Architecture
scanning the sequential flow of video frames, a bidirectional
Our model must be able to predict sequences in successive
LSTM model (CNN-BiLSTM) architecture is used to detect
frames, such as a pattern in the movement of the individuals or
real-time violence. They had more than 98% accuracy for their
the degree of their motion, to classify violent or non-violent
three different models [6]. But unlike their work, we will only
activities. This is not possible by considering only the spatial
create a single model, which will not only decrease the server
features (features belonging to a particular frame) of the
load but also increase the response time of applications where
frames. While detecting sequences in frames, temporal or
it will be deployed.
time-related factors must be taken into account. The temporal
features can be handled in either a forward or backward order.
To predict violence in the sequential flow of frames, we
Our model processes the temporal features in both directions
will utilize the Convolutional Neural Network Bidirectional
in addition to the spatial features, which helps the model to
LSTM model (CNN-BiLSTM) architecture. To begin with, we
become more accurate and, at the same time, consumes less
divide a video into numerous frames. We pass each frame
computational time. The lightweight models are always
through a convolutional neural network to extract the
preferred in surveillance due to their low cost structure. The
information present in the current frame. Then, to recognize
model consists of three sub-parts [7].
any sequential flow of events, we utilize a bidirectional LSTM
layer to compare the information of the current frame once
with the prior frames and once with the upcoming frames. 1) CNN: The Convolutional Neural Network (CNN) is
the most common neural network in the field of computer
Finally, the classifier determines whether or not an action is
vision to detect and classify images, comprising an input
violent.
convolutional layer followed by three layers of convolution
After introducing our topic, we will go directly to the and max pooling. The kernel size for each convolutional layer
methodology in Section 2, where we will be discussing the is 3×3. 64 kernels are used in each convolutional layer. After
passing through "relu" activation, the output from each
steps and ways to implement our system, including
convolutional layer is
experimental setup, data processing, and training methods.
Then we will discuss the results of our work with qualitative
and quantitative data in section 3. Under section 3, we will
illustrate accuracy evolution and accuracy comparison. At the
end, we'll talk about the conclusion of our paper, including the
necessary figures and tables, as well as the chances of getting
an upgrade in the future.
The literature evaluation states that most systems are not
operating in Bangladesh and do not fully meet the needs of our
customer’s criteria. The main initiative has been based on a Fig. 1. Convolutional Neural Network (CNN)
website. In the next stages, we intend to make this
Android/iOS based mobile application more user-friendly. The function is max pooled to extract the features. Each
This system’s most appealing feature is that the vendor may maximum pooling uses a filter size of 2 x 2. Finally, the
communicate directly with customers to advertise the goods features are fattened and sent to the next model. The
and obtain feedback. Because it is quite versatile in terms of TensorFlow [8] and Keras [9] APIs have been used to deploy
expansion, the project can be upgraded in the near future as convolutional neural networks. In this diagram, the basic CNN
and when the need arises. This site can have several branches, capability is displayed in Fig. 1 [10].
and additional features such as a virtual shopping basket and a
virtual trial room can be added to make it more robust. 2) The Bidirectional LSTM Cells: The basic LSTM cell
appears in Fig. 2 [6]. Long-term memory cells are frequently
Firstly, we have given a brief idea of the project. Then we used to reexamine a portion of previously prepared highlights.
present an introduction to the e-commerce site. We look LSTM mimics the action of the human brain to keep in mind
through the problem statements as well as the existing system the already prepared event. The first layer in an LSTM cell is
and compare them with our site. We’ve talked about the known as the overlooking entryway layer, signified by ft. It is
remaining gaps in the project and discussed its passed through a sigmoid function to urge a yield of either 1
implementation. The rest of the sections of this paper are as or 1. The esteem shows a disregard state and 1 signifies a keep
follows: Section 2 covers the methodology; Section 3 presents in mind state. The condition of the disregard door layer is
the results and analysis of the system; and finally, the given in,
conclusions are set out in Section 4 along with references.
an activation function. In Fig. 4, the entire architecture of our
proposed model has been shown [6].

Fig. 2. Basic LSTM cell

Fig. 4. Node Architecture

II. EXPERIMENTAL SETUP

Fig. 3. BiLSTM Shell
A. Data Processing
Frames have been extracted from the videos. The
The next layer is called the input gate layer. In this layer,
extracted frames are reshaped to 100×100 pixels (denoted as x
the remembered state data is restrained with the new features.
y). The training data is a Numpy array, with each of its rows
representing a sequence or pattern in videos. A sequence
The yield from the disregard entryway layer is duplicated
might include a degree of movement and actions, whether a
into the cell state vector (ct) of the past LSTM cell (ct-1). The
movement of the arm is a punch or a handshake, etc. The
result is included in the yield from the input door layer,
minimum number of frames required to extract a sequence is
increased to the covered upstate vector of the final state upon
2. However, we have used 10 consecutive frames (denoted as
passing through a "tanh" operation to make a cell state vector
n) to extract the temporal features (that is, time-related
for the following LSTM cell. This vector, upon passing
features). The total number of samples (denoted by N)
through a "tan h" work, is increased to the covered upstate
represents the total number of such sequences in the dataset
vector of the past state (ht-1) upon passing through a "sigmoid"
((total number of frames) / (number of frames to consider in a
work to create a hidden state vector for the following LSTM
sequence)). For a simple implementation, NumPy allows an
cell (ht). Then, in the last layer of Ctr, some of the highlights
arbitrary value of 1 to be used. Hence, a structure containing a
from the previous state and the newly required highlights from
sequence of 10 consecutive frames with their respective class
the current cell are added together and sent to the other state.
labels is prepared. The shape of the training data is (-1, N, x,
y, c). Here, c represents the number of channels in each frame.
Where it is an input vector to the LSTM unit and bf, bi,
The pictorial representation of the training data is shown in
and bo are the weight vectors for the forget gate layer, input
Fig. 5 [6].
gate layer, and output gate layer, respectively. In the LSTM,
the features are remembered and passed from state 1 to state 2
to state n. The LSTM can also work in the reverse direction as
well; the features will be remembered and passed from state n
to state 2 to state 1. By combining both these mechanisms, we
achieve a bidirectional LSTM layer as shown in Fig. 3. The
bidirectional LSTM cells are more accurate in storing data.
For violence detection, a bidirectional LSTM will compare the
sequence of frames once in the forward direction and once in
the reverse direction. This mechanism gives our model more
strength by adding different cell states and training features.

3) The Dense Layers: The dense layers are omnipresent

when it comes to deep learning. Here, the fully connected
dense layers help to add random weights (Wi) to random
features (Xi) and test which set of features gives the best
accuracy over a certain number of epochs by passing through Fig. 5. Visualization of the training data
B. Data Frame Separation c. Crowd / Violent-Flows Fight Detection Dataset-246
The video datasets are divided into a 90/10 ratio for video clips
random selection. 10% of images and videos are used for
testing in the evaluation step. 90% of the images are used to F. Dataset
feed into the model for training purposes, and this could be The effectiveness of the CNN Bidirectional LSTM model
done by using a Python script. On the other hand, the weapon architecture has been validated by running on the standard
image dataset is divided into an 80/20 ratio in random datasets for violent and non-violent action detection, namely
selection. the Hockey Fight dataset [11], the Movies dataset [11], the
Violent Flows dataset [12] and the Weapons dataset for image
C. Model Training classification and object detection tasks.
To find the fights between people, a set of 10 frames with
dimensions of 100 x 100 were fed into a model with the shape The Hockey Fight Dataset: The Hockey Fight dataset
shown in Fig. 4 to pull out the spatial and temporal features. contains clips from ice hockey matches. The dataset has 500
Stochastic gradient descent has been used as an optimizer with violent clips and 500 non-violent clips with an average
a learning rate of 0.01 and a decay rate of 1e -6. The loss duration of 1 s. The clips had similar backgrounds and
function used in this paper is "sparse categorical cross- subjects. Hockey Fight Detection Dataset - Academic Torrents
entropy". In this multi-class classification problem, we have
used "0 or 1" as class labels, instead of one-hot encoding, in a The Movies Dataset contains clips from different movies
batch size of 5 samples at an instant. For training and testing for action sequences, whereas the non-fight sequences consist
purposes, the datasets are divided into 9:1 ratios. To keep its of clips from action recognition datasets. The dataset has 100
low computation cost, the whole model had to be built and violent clips and 100 non-violent clips with an average
trained from scratch for 25 epochs. duration of 1 s. Unlike the Hockey Fight dataset, the clips
from movies have different backgrounds and subjects. Movies
D. Model Testing Fight Detection Dataset - Academic Torrents
Once the model finishes training, at this stage, a test
dataset is used to evaluate the model and output the average Dataset: The Violent Flows dataset deals with crowd
precision and map. Then the script outputs the result from the violence. The dataset consists of videos of human actions from
model at the command prompt. The testing process can be run the real world, CCTV footage of crowd violence, and
on the existing trained model. YouTube videos, properly maintaining the standard
benchmark protocols. The dataset consists of 246 videos with
properly biased samples. Crowd Violence\ Non-violence
Database ([Link])

III. RESULT & ANALYSIS

In this section, we will discuss the results of our proposed
framework and analyze the strengths and weaknesses of our
model.
Fig. 6. System Layout
A. Accuracy Evaluation
E. Requirement As we have used the CNN-BiLSTM model architecture, it
i) Software can handle our specifically chosen datasets very efficiently.
a. TensorFlow with GPU support-An open-source Each of our datasets is divided into 9 parts, also known as
software library used for machine learning. epochs, for training the desired model and 1 epoch for
b. Python 3.9.x validation. From each epoch, we can get information on
c. Algorithm-CNN, RNN, LSTM, Deep Learning, training accuracy, training loss, validation accuracy, and
Computer Vision, Visual Studio Code. validation loss.
d. Libraries used: Keras, Numpy, Tensorflow Object
B. Hockey Fight Dataset
Detection.
For training the model for the hockey dataset, we used 10
ii) Hardware
epochs. Every epoch had 55 steps by maintaining the batch
Processor-AMD Ryzen 5 2400G
size of 16 frames. From Fig 7 and 8, we can see that the
RAM-16.0 GB
maximum accuracy achieved was 94.9% for training and
GPU-Radeon Vega 11 Graphics
96.94% for validation.
Operating System-Windows 10 Professional 64 bit
iii) Resources
a. Movies Fight Detection Dataset-200 video clips
b. Hockey Fight Detection Dataset -1000 video clips
Fig. 7. Training and validation accuracy achieved from the hockey fight
dataset

Fig. 10. Training and validation loss achieved from the movies dataset

D. Violence Flows Dataset

After getting the first trained model from the hockey fight
dataset, we over-fitted it with the violence flows dataset by
maintaining 10 epochs. Each of the epochs goes through 55
steps. From Fig 11 and 12, we can see that the maximum
accuracy achieved was 77.31% for training and 80% for
validation.

Fig. 8. Training and validation loss achieved from the hockey fight dataset

C. Movie dataset
After getting the first trained model from the hockey
dataset, we have over-fitted it with the movie dataset by
maintaining 10 epochs. Each of the epochs goes through 55
steps. From Fig 9 and 10 we can see that the maximum
accuracy achieved was 92.92% for training and 96.94% for
validation.

Fig. 11 Training and validation accuracy achieved from the violence flows
dataset

Fig. 9. Training and validation accuracy achieved from the movie’s dataset
proposed model, it needs to be further validated with more
standard datasets where the identification of one to many or
many to many violent activities is possible. In future work, we
will be able to increase the accuracy of our model by
maintaining our model architecture. Our model will have
combined violence and weapon detection capabilities. In the
near future, we are planning to detect metal by using thermal
vision cameras, which will allow us to differentiate between
real guns and fake guns [17]. We will also give our system the
capability to determine whether a gun holder is a member of
the law enforcement team (police) or not. In the near future,
our system will also be able to use night vision [18] and
thermal vision [19] technologies to find violent activities.
REFERENCES
[1] “Bangladesh Crime Rate & Statistics 2000-2022.” [Online] Available:
[Link]
Fig. 12 Training and validation loss achieved from violence flows dataset statistics (accessed Jan. 13, 2022).
[2] M. Ramzan et al., “A Review on State-of-the-Art Violence Detection
E. Accuracy Comparison Techniques,” IEEE Access, vol. 7. pp. 107560–107575, 2019. doi:
From the accuracy evaluation, we can see that our final 10.1109/access.2019.2932114.
[3] “Audio-visual content-based violent scene characterization.” [Online]
model has achieved an accuracy of more than 77% for training Available:
and 80% for validation accuracy. The comparison between our [Link]
model and its architecture with other existing models and their umber=15617 (accessed Nov. 11, 2021).
architecture is given in Table 1. [4] S. Accattoli, P. Sernani, N. Falcionelli, D. N. Mekuria, and A. F.
Dragoni, “Violence Detection in Videos by Combining 3D
Convolutional Neural Networks and Support Vector Machines,” Applied
TABLE I
Artificial Intelligence, vol. 34, no. 4. pp. 329–344, 2020. doi:
A comparison between the accuracy of our model with the existing models
10.1080/08839514.2020.1723876.
[5] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik,
Methods Hockey Movies Violence Flows “Violence Detection Using Spatiotemporal Features with 3D
MoIWLD [13] 96.8±1.04 - 93.19±0.12% Convolutional Neural Network,” Sensors , vol. 19, no. 11, May 2019,
% doi: 10.3390/s19112472.
ViF+OViF [14] 87.5±1.7% - 88±2.45% [6] R. Halder and R. Chatterjee, “CNN-BiLSTM Model for Violence
Spatiotemporal 98.1±0.58 100±0% 93.87±2.58% Detection in Smart Surveillance,” SN Computer Science, vol. 1, no. 4.
Encoder [15] % 2020. doi: 10.1007/s42979-020-00207-x.
Conv 3D [16] 98.3±0.81 100±0% 97.17±0.95% [7] R. Halder and R. Chatterjee, “CNN-BiLSTM Model for Violence
% Detection in Smart Surveillance,” SN Computer Science, vol. 1, no. 4.
CNN-LSTM [6] 97.1±0.55 100±0% 94.57±2.34% 2020. doi: 10.1007/s42979-020-00207-x.
% [8] “TensorFlow,” TensorFlow. [Online] Available:
[Link] (accessed Jan. 15, 2022).
CNN-BiLSTM (our 94.9% 92.92% 77.31%
[9] Keras Team, “Keras: the Python deep learning API.” [Online] Available:
model
[Link] (accessed Jan. 15, 2022).
[10] Phung, Phung, and Rhee, “A High-Accuracy Model Average Ensemble
Our main comparison was with models created by Conv of Convolutional Neural Networks for Classification of Cloud Image
3D [15] and CNN-LSTM [6]. In Conv 3D, it was created by Patches on Small Datasets,” Applied Sciences, vol. 9, no. 21. p. 4500,
2019. doi: 10.3390/app9214500.
extracting all the frames into a single folder and then training [11] E. B. Nievas, O. D. Suarez, G. B. García, and R. Sukthankar, “Violence
it. So, they used all the datasets to create their model at once. Detection in Video Using Computer Vision Techniques,” Computer
In the previous work of CNN-BiLSTM, they created three Analysis of Images and Patterns. pp. 332–339, 2011. doi: 10.1007/978-
separate models from three different datasets. But in our case, 3-642-23678-5_39.
[12] T. Hassner, Y. Itcher, and O. Kliper-Gross, “Violent flows: Real-time
we used CNN-BiLSTM to create our first model by using the detection of violent crowd behavior,” 2012 IEEE Computer Society
hockey fight dataset and over-fitting the model with the other Conference on Computer Vision and Pattern Recognition Workshops.
two datasets (movies and violence flows). 2012. doi: 10.1109/cvprw.2012.6239348.
[13] T. Zhang, W. Jia, X. He, and J. Yang, “Discriminative Dictionary
IV. CONCLUSION Learning With Motion Weber Local Descriptor for Violence Detection,”
IEEE Transactions on Circuits and Systems for Video Technology, vol.
Our proposed CNN-BiLSTM based violence detection 27, no. 3. pp. 696–709, 2017. doi: 10.1109/tcsvt.2016.2589858.
system can make society a secure place for peace-loving [14] Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, “Violence detection using
people. By training the model once and over-fitting twice, we Oriented VIolent Flows,” Image and Vision Computing, vol. 48–49. pp.
37–41, 2016. doi: 10.1016/[Link].2016.01.006.
were able to achieve decent accuracy for training, validation [15] A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, “Bidirectional
and testing. Using our proposed framework, we were able to Convolutional LSTM for the Detection of Violence in Videos,” Lecture
achieve the final results in the final over-fitting of our model. Notes in Computer Science. pp. 280–295, 2019. doi: 10.1007/978-3-
We could also see that at the last stage, our results nearly 030-11012-3_24.
stabilized. Despite the satisfactory performance of our
[16] J. Li, X. Jiang, T. Sun, and K. Xu, “Efficient Violence Detection Using
3D Convolutional Neural Networks,” 2019 16th IEEE International
Conference on Advanced Video and Signal Based Surveillance (AVSS).
2019. doi: 10.1109/avss.2019.8909883.
[17] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera, “Brightness
guided preprocessing for automatic cold steel weapon detection in
surveillance videos with deep learning,” Neurocomputing, vol. 330. pp.
151–161, 2019. doi: 10.1016/[Link].2018.10.076.
[18] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera, “Brightness
guided preprocessing for automatic cold steel weapon detection in
surveillance videos with deep learning,” Neurocomputing, vol. 330. pp.
151–161, 2019. doi: 10.1016/[Link].2018.10.076.
[19] R. Ippalapally, S. H. Mudumba, M. Adkay, and N. V. H. R., “Object
Detection Using Thermal Imaging,” 2020 IEEE 17th India Council
International Conference (INDICON). 2020. doi:
10.1109/indicon49873.2020.9342179.

View publication stats

Common questions

The CNN-BiLSTM architecture, which combines convolutional layers for spatial feature extraction with BiLSTM for temporal feature analysis, generally outperforms single-model approaches in violence detection. Single-model architectures might efficiently handle either spatial or temporal data but often lack the integrative capability to process both simultaneously. CNN-BiLSTM models achieve substantial accuracy improvements due to their ability to handle both aspects effectively, demonstrated by results showing over 98% accuracy in some cases. However, the complexity and computational demands are higher than simpler models, presenting a trade-off in terms of resource usage versus accuracy .

The CNNs are used to extract spatial features from each frame of a video, capturing the details present in single frames such as shapes and edges. However, to detect violence, it's necessary to also analyze the temporal flow of frames to understand the sequence of actions over time. This is where the Bidirectional LSTM (BiLSTM) architecture comes into play, as it processes the sequence of frame data bidirectionally to analyze temporal patterns. By combining CNNs and BiLSTM, the model can effectively utilize both spatial and temporal data, allowing for the detection of real-time violence with high accuracy, as evidenced by more than 98% accuracy achieved in previous models .

The CNN-BiLSTM framework handles varying backgrounds and subjects in different datasets by using the CNN component to focus on spatial features and identifying distinctive characteristics of violence across different contexts. The BiLSTM component processes temporal sequences, which aids in discerning patterns indicative of violence despite background differences. By training on multiple datasets with diverse examples, the model can better generalize these patterns, allowing it to adapt to variations in backgrounds and subjects, maintaining performance across diverse situations .

There are inherent trade-offs between model accuracy and computational efficiency in CNN-BiLSTM architectures. Higher accuracy generally requires a more complex model with increased computational demands due to more layers or larger sizes of kernels and LSTM units. Conversely, achieving computational efficiency typically involves reducing model complexity, such as fewer layers or using smaller datasets, which can lead to lower accuracy. The challenge is balancing these factors to maintain acceptable accuracy while ensuring the model remains feasible for real-time applications .

Using multiple datasets such as Hockey Fight, Movies, and Violent Flows allows the CNN-BiLSTM model to learn from a wider variety of data types, contexts, and backgrounds which enhances its robustness and generalization skills. These datasets present different challenges, such as varied lighting conditions, backgrounds, and types of violence, allowing the model to capture more comprehensive features. This helps in developing a more adaptable model that can effectively interpret and identify violence across various scenarios, thus improving the model's overall reliability and accuracy .

The main challenges include the lack of existing systems tailored to the specific needs of the region, such as linguistic and cultural differences that might affect the dataset. Systems that perform well elsewhere may not directly transfer due to these differences. Another consideration is the requirement of a lightweight model for cost-effectiveness and efficiency given potentially limited computational resources. The initiative is to develop mobile applications to make the system more user-friendly and accessible .

Reducing computational costs while maintaining accuracy can be achieved by optimizing the model's complexity and execution. This includes minimizing the model's weight by reducing the number of layers and parameters, using techniques such as pruning and quantization, and employing efficient frameworks like TensorFlow and Keras. Modifying the architecture to balance the processing of spatial and temporal features and adopting a single model approach instead of multiple might also decrease server load and improve response time without sacrificing performance .

Different datasets provide diverse scenarios and contexts that the models can learn from, thus improving their generalization capabilities. For instance, using datasets like Hockey Fight, Movies, and Violent Flows covers a range of environments and background differences. It allows the CNN-BiLSTM model to learn varied representations of violence, thus making it adaptable to diverse real-world situations. The accuracy results from training and validation on these datasets demonstrate the model's ability to effectively predict violent actions across different scenes .

Overfitting the model with subsequent datasets such as Movies and Violence Flows after initial training with the Hockey Fight dataset allows the model to adjust to more specific features inherent in various types of data. This step can help tune the model's weights to capture the unique aspects of different datasets, increasing specificity and precision. However, it also raises the risk of the model becoming too tailored to the details of specific data, potentially reducing its ability to generalize well to unseen data. A balance must be struck to enhance model performance while maintaining its flexibility .

The bidirectional processing of temporal features in a BiLSTM model allows it to consider both past and future context simultaneously, thus providing a more comprehensive analysis of the temporal information embedded in video frames. This capability enables the model to better capture the sequence of actions leading to or following potentially violent events, facilitating a more accurate identification of violence occurrences. This holistic approach contributes significantly to improving the accuracy of violence detection tasks .

Python Programming Essentials Guide
No ratings yet
Python Programming Essentials Guide
15 pages
Data Visualization with Matplotlib & Seaborn
No ratings yet
Data Visualization with Matplotlib & Seaborn
40 pages
CampusX Machine Learning Resources
No ratings yet
CampusX Machine Learning Resources
3 pages
Data Structures in C: A Comprehensive Guide
No ratings yet
Data Structures in C: A Comprehensive Guide
72 pages
Graphs and Hashing Concepts Explained
No ratings yet
Graphs and Hashing Concepts Explained
28 pages
Striver 79 DSA Sheet: Python Solutions
No ratings yet
Striver 79 DSA Sheet: Python Solutions
15 pages
Indexing and Searching Techniques Overview
No ratings yet
Indexing and Searching Techniques Overview
93 pages
Data Structures C Programs Guide
No ratings yet
Data Structures C Programs Guide
60 pages
Understanding Python Set Methods
No ratings yet
Understanding Python Set Methods
44 pages
Understanding Tree Data Structures
100% (1)
Understanding Tree Data Structures
37 pages
Selection Sort Algorithm in C
No ratings yet
Selection Sort Algorithm in C
2 pages
Prim's Algorithm for Minimum Spanning Tree
No ratings yet
Prim's Algorithm for Minimum Spanning Tree
12 pages
Tree Traversal Methods Explained
No ratings yet
Tree Traversal Methods Explained
4 pages
Deep Learning Overview and Applications
No ratings yet
Deep Learning Overview and Applications
49 pages
Importance of MongoDB for Modern Apps
No ratings yet
Importance of MongoDB for Modern Apps
2 pages
Searching and Sorting Algorithms in C
No ratings yet
Searching and Sorting Algorithms in C
16 pages
Introduction to Algorithms and Sorting
No ratings yet
Introduction to Algorithms and Sorting
195 pages
Advanced Programming Practice Record
No ratings yet
Advanced Programming Practice Record
112 pages
RBF Neural Networks Overview and Applications
No ratings yet
RBF Neural Networks Overview and Applications
34 pages
File Handling and Exception Management
No ratings yet
File Handling and Exception Management
17 pages
DSA Fast Revision Guide for Beginners
No ratings yet
DSA Fast Revision Guide for Beginners
11 pages
Python DSA Cheat Sheet: 100 Tricks
No ratings yet
Python DSA Cheat Sheet: 100 Tricks
4 pages
Data Structures and Algorithms Overview
No ratings yet
Data Structures and Algorithms Overview
189 pages
Python Data Structures Overview
No ratings yet
Python Data Structures Overview
53 pages
Classification, Regression, and Clustering Overview
No ratings yet
Classification, Regression, and Clustering Overview
142 pages
ART1 Numerical
No ratings yet
ART1 Numerical
5 pages
Elementary Data Structures Overview
No ratings yet
Elementary Data Structures Overview
26 pages
LSTM and GRU: Illustrated Guide
No ratings yet
LSTM and GRU: Illustrated Guide
15 pages
CLRS Gist - Handwritten
No ratings yet
CLRS Gist - Handwritten
32 pages
Python Tools for Parallel Computing
No ratings yet
Python Tools for Parallel Computing
16 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
21 pages
NumPy Cheat Sheet for Data Science
No ratings yet
NumPy Cheat Sheet for Data Science
1 page
Understanding Arrays in C Programming
No ratings yet
Understanding Arrays in C Programming
20 pages
Understanding Asymptotic Notation in Algorithms
No ratings yet
Understanding Asymptotic Notation in Algorithms
18 pages
Graph and Tree Data Structures Overview
No ratings yet
Graph and Tree Data Structures Overview
20 pages
Data Science With Python Updated Brochure
No ratings yet
Data Science With Python Updated Brochure
13 pages
AI Transforming Education: Benefits & Challenges
No ratings yet
AI Transforming Education: Benefits & Challenges
10 pages
Importance of Avoiding Deadly Embrace in DBMS
No ratings yet
Importance of Avoiding Deadly Embrace in DBMS
203 pages
Understanding Data Structures Explained
No ratings yet
Understanding Data Structures Explained
226 pages
Basis Path Testing in White Box Testing
No ratings yet
Basis Path Testing in White Box Testing
17 pages
Single-Source Shortest Paths Guide
No ratings yet
Single-Source Shortest Paths Guide
28 pages
Core Python Programming Course Overview
No ratings yet
Core Python Programming Course Overview
2 pages
KNN, K-Means, and Regression in Python
No ratings yet
KNN, K-Means, and Regression in Python
12 pages
Data Structures Exam Questions TEE 2023
100% (1)
Data Structures Exam Questions TEE 2023
2 pages
Understanding M-way Trees and B-trees
No ratings yet
Understanding M-way Trees and B-trees
20 pages
R Programming Essentials
No ratings yet
R Programming Essentials
9 pages
CampusX 100DaysML Notes Day1-14
No ratings yet
CampusX 100DaysML Notes Day1-14
31 pages
Data Mining Notes Unit-1 and Unit-2 Partial
No ratings yet
Data Mining Notes Unit-1 and Unit-2 Partial
19 pages
Breadth-First Search - Wikipedia, The Free Encyclopedia
No ratings yet
Breadth-First Search - Wikipedia, The Free Encyclopedia
3 pages
Key Topics in Algorithm Design Analysis
No ratings yet
Key Topics in Algorithm Design Analysis
2 pages
Data Structures and Algorithms in Pythoniedu - Us
No ratings yet
Data Structures and Algorithms in Pythoniedu - Us
77 pages
Python Web Scraping with BeautifulSoup
No ratings yet
Python Web Scraping with BeautifulSoup
6 pages
Python Practice Problems for Beginners
100% (1)
Python Practice Problems for Beginners
28 pages
Python for AI: Basics and Setup Guide
No ratings yet
Python for AI: Basics and Setup Guide
131 pages
Bubble and Insertion Sort Examples
No ratings yet
Bubble and Insertion Sort Examples
4 pages
Web Scraping and NumPy in Python
No ratings yet
Web Scraping and NumPy in Python
18 pages
NumPy Basics for Data Analysis
No ratings yet
NumPy Basics for Data Analysis
112 pages
Effect of Preparation Method On Antioxidant Activity of Ayurvedic Formulation Kumaryasava
No ratings yet
Effect of Preparation Method On Antioxidant Activity of Ayurvedic Formulation Kumaryasava
6 pages
DHCP Spoofing Attack Detection Methods
No ratings yet
DHCP Spoofing Attack Detection Methods
7 pages
ERP Impact on Colombo SMEs' Performance
No ratings yet
ERP Impact on Colombo SMEs' Performance
9 pages
CrimeNet: Vision Transformer for Violence Detection
No ratings yet
CrimeNet: Vision Transformer for Violence Detection
12 pages
Hybrid PV-Fuel Cell Optimization for Isolated Sites
No ratings yet
Hybrid PV-Fuel Cell Optimization for Isolated Sites
8 pages
Impact of Load on Synchronous Motors
No ratings yet
Impact of Load on Synchronous Motors
9 pages
Compal Embedded System Overview
No ratings yet
Compal Embedded System Overview
5 pages
AI and Machine Learning for Class 9
No ratings yet
AI and Machine Learning for Class 9
6 pages
Spotify Playlist Recommendation System
No ratings yet
Spotify Playlist Recommendation System
21 pages
Machine Learning Exam Questions December 2024
No ratings yet
Machine Learning Exam Questions December 2024
2 pages
AI & ML Question Bank for B.Tech Students
No ratings yet
AI & ML Question Bank for B.Tech Students
3 pages
Analyzing Search Query Patterns
No ratings yet
Analyzing Search Query Patterns
18 pages
Model Accuracy & Decision Tree Analysis
No ratings yet
Model Accuracy & Decision Tree Analysis
40 pages
Flight Delay Prediction Analysis Project
No ratings yet
Flight Delay Prediction Analysis Project
3 pages
IOAI 2025 Syllabus Overview
No ratings yet
IOAI 2025 Syllabus Overview
4 pages
Pattern Recognition and Classification Techniques
No ratings yet
Pattern Recognition and Classification Techniques
51 pages
Coursera Course Offerings Overview
No ratings yet
Coursera Course Offerings Overview
36 pages
Blockchain for Efficient Hiring Verification
No ratings yet
Blockchain for Efficient Hiring Verification
47 pages
AI Systems Design and Development Guide
No ratings yet
AI Systems Design and Development Guide
110 pages
Enhancing Speaker Verification with ECAPA-TDNN
No ratings yet
Enhancing Speaker Verification with ECAPA-TDNN
3 pages
Fuzzy Logic and Neural Network Classification
No ratings yet
Fuzzy Logic and Neural Network Classification
7 pages
Prediction of Graduation Delay Based On Student Character
No ratings yet
Prediction of Graduation Delay Based On Student Character
64 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
47 pages
AI in Agrifood Systems: Progress & Challenges
No ratings yet
AI in Agrifood Systems: Progress & Challenges
40 pages
Course Beginning & End Survey
No ratings yet
Course Beginning & End Survey
6 pages
AI & ML for Plant Disease Detection
No ratings yet
AI & ML for Plant Disease Detection
6 pages
Lesson 1 - Introduction To AI and Machine Learning
No ratings yet
Lesson 1 - Introduction To AI and Machine Learning
44 pages
Crop Yield Prediction with ML Techniques
No ratings yet
Crop Yield Prediction with ML Techniques
3 pages
Contrastive Predictive Coding Explained
No ratings yet
Contrastive Predictive Coding Explained
13 pages
Getting The Most Out of Ai 2021
No ratings yet
Getting The Most Out of Ai 2021
18 pages
Ikhwan Iqbal: AI & Software Researcher
No ratings yet
Ikhwan Iqbal: AI & Software Researcher
4 pages
Student List for Engineering Institutes
No ratings yet
Student List for Engineering Institutes
8 pages
IJAISC: AI and Soft Computing Journal
No ratings yet
IJAISC: AI and Soft Computing Journal
2 pages
Jayanth Resume
No ratings yet
Jayanth Resume
1 page
Leveraging AI for Extraordinary Success
No ratings yet
Leveraging AI for Extraordinary Success
99 pages
Instance-Based Learning Overview
No ratings yet
Instance-Based Learning Overview
12 pages
Chomsky on the False Promise of ChatGPT
100% (1)
Chomsky on the False Promise of ChatGPT
12 pages

Violence Detection via Computer Vision

Uploaded by

Violence Detection via Computer Vision

Uploaded by

See discussions, stats, and author profiles for this publication at: [Link]

Violence Detection Using Computer Vision Approaches

Conference Paper · June 2022

Khalid Raihan Talha Koushik Bandapadya

SEE PROFILE SEE PROFILE

Mohammad Monirujjaman Khan

The user has requested enhancement of the downloaded file.

Fig. 2. Basic LSTM cell

Fig. 4. Node Architecture

II. EXPERIMENTAL SETUP

3) The Dense Layers: The dense layers are omnipresent

III. RESULT & ANALYSIS

D. Violence Flows Dataset

View publication stats

Common questions

Evaluate the effectiveness of CNN-BiLSTM architecture in comparison to single-model approaches in violence detection.

How do Convolutional Neural Networks (CNNs) and Bidirectional LSTM (BiLSTM) architectures complement each other for real-time violence detection in video frames?

How does the CNN-BiLSTM framework handle varying backgrounds and subjects in different datasets for violence detection?

What are the trade-offs between model accuracy and computational efficiency in the context of CNN-BiLSTM architectures for violence detection?

Analyze how using multiple datasets like Hockey Fight, Movies, and Violent Flows can be advantageous for training a robust CNN-BiLSTM model for violence detection?

What are the main challenges and considerations in applying violence detection systems using CNN-BiLSTM architectures in regions like Bangladesh?

In what ways can the CNN-BiLSTM architecture be optimized to reduce computational costs while maintaining accuracy in detecting violent scenes?

How do different datasets used in training CNN-BiLSTM models influence the model's ability to generalize violence detection across diverse environments?

In the context of CNN-BiLSTM models for violence detection, how do overfitting practices with subsequent datasets like Movies and Violence Flows contribute to model performance?

What role does the bidirectional processing of temporal features in the BiLSTM model play in improving the accuracy of violence detection tasks?

You might also like