Toward Human Activity Recognition: A Survey: Original Article
Toward Human Activity Recognition: A Survey: Original Article
[Link] (0123456789().,-volV)(0123456789().
,- volV)
ORIGINAL ARTICLE
Received: 6 March 2021 / Accepted: 10 October 2022 / Published online: 20 October 2022
The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022
Abstract
Human activity recognition (HAR) is a complex and multifaceted problem. The research community has reported
numerous approaches to perform HAR. Along with HAR approaches, various surveys have revealed HAR trends in various
environments and applications. HAR is linked to a variety of technology-dependent daily life systems, such as human–
computer interaction systems, security surveillance, video surveillance, healthcare surveillance, robotics, content-based
information retrieval, and monitoring systems. Because of technological advancements, HAR trends change quickly and
necessitate an up-to-date and broader perspective. This study offers an HAR taxonomy, which includes online/offline HAR,
multimodal/unimodal HAR, handcrafted feature-based, and learning-based approaches. This study attempts to present the
multidisciplinary nature of HAR, such as application areas, activity types, task complexities, benchmark datasets, and/
methods. This research includes a comparative analysis of state-of-the-art HAR methods and a discussion of popular
datasets. The selected studies have been categorized using taxonomy, and different attributes such as activity complexity,
dataset size, and recognition rate have been used for their analysis. The comparative analysis of HAR approaches has also
helped to highlight domain challenges and open research directions for HAR researchers to follow.
Keywords Activity recognition Action recognition Video datasets Deep learning Handcrafted features
Video analysis Computer vision
123
4146 Neural Computing and Applications (2023) 35:4145–4182
Advanced methods are a combination of multiple steps, and parameters used for the analysis of these studies
which can collectively extract advanced features and per- ((b) Analysis Process). As shown in Fig. 2, studies are
form in-depth analysis to recognize human activities [5–8]. selected from multiple databases and initial selection is
Basic computer vision-based methods such as optical flow made on the basis of relevant topics. Then, all studies
[9–11], spatiotemporal interest points (STIP) [12], hidden published earlier than 2011 have not considered and search
Markov model (HMM) [13], and advanced deep learning is performed again through analyzing title of studies. This
tools, for example, convolutional neural networks (CNN), process helps to remove duplicates and survey studies from
recurrent neural networks (RNN) [14–16], are used to the selected set, and as a result, 2500 studies are left while
recognize human activity. others have been discarded. This survey is based on
HAR has a multidisciplinary nature, and various daily reviewing video-based HAR approaches and benchmark
life systems are influenced by performing HAR. HAR datasets. Therefore, selected set is refined to get studies that
plays its role in indoor/outdoor environments, robotics, have used video benchmark dataset for evaluation of their
content-based information retrieval, human–computer model. For this purpose, we have reviewed the abstract and
interactions (HCI), security surveillance, video surveil- sometimes experiment too in case abstract does not provide
lance, educational sector, monitoring, and social interac- necessary details.
tion-based applications [17]. Hence, because of rapid As a result, 46 studies are selected for state-of-the-art
technological advancement of daily life systems, there is a analysis which includes studies on feature-based models,
need for an up-to-date survey to discuss the progress of deep learning-based models, online activity recognition
HAR and also to highlight its challenges [1]. Considering model, and methods for multimodal HAR. We have ana-
previous surveys, HAR systems can be classified as online lyzed all selected studies using various parameters, such as
or offline based on the input data and processing strategy. publication year, method type, data input, activity level,
Then, there are unimodal/multimodal approaches that use dataset size, and its performance on benchmark datasets.
different modalities, such as video frames, audio cues, These parameters help in identifying which activities
skeleton data, and depth data. Most of the previous surveys among simple, intermediate, and complex are frequently
have discussed handcrafted approaches, and few recent used in research. The size of the dataset is an important
surveys have incorporated learning-based approaches as indicator to determine what types of datasets are more
well [18–20]. useful across selected studies. Several evaluation measures
are used for human activity recognition, such as Average
1.1 Methodology for survey of HAR approaches Precision, which is the most reported measure. However,
accuracy, recall, f-measure, likelihood ratio, and area under
This study attempts to provide a method-based classifica- curve (AUC) are also popular among studies.
tion of approaches through taxonomy. It also provides a The major contributions of this study are as follows:
comparative analysis of state-of-the-art methods presented
• This study highlights various approaches and proposes
since 2011 to present an overview of HAR domain. This
a HAR taxonomy, and elements of taxonomy are
study includes 46 state-of-the-art methods, which are based
discussed with the help of HAR methods. HAR
on topics such as ‘‘Human activity Recognition,’’ ‘‘Human
approaches are mainly divided into handcrafted/fea-
Action Recognition,’’ ‘‘Online Activity Recognition,’’
ture-based HAR and learning-based HAR approaches
‘‘Learning-based Human Activity Recognition,’’ and
which are further sub-divided up to four levels to cover
‘‘Handcrafted features-based Human Activity Recogni-
simple feature extraction-based methods such as tra-
tion.’’ The existing state-of-the-art surveys are also dis-
jectories and space–time feature.
cussed to analyze the up-to-date findings. Figure 2 shows
the process of selection of studies ((a) Selection Process)
123
Neural Computing and Applications (2023) 35:4145–4182 4147
123
4148 Neural Computing and Applications (2023) 35:4145–4182
Vishwakarma 2013 Abnormal Actions, Intermediate Security Surveillance Based on Activity recognition, object
et al. [18] Behavior, and tracking, object detection tasks, and
interactions behavior understanding using
handcrafted approaches. Few
surveillance-based video datasets are
also discussed
Ke et al. [21] 2013 Single person multi- High Pose estimation, Falling Detection, This survey provided details of Video-
person crowd Security Surveillance based activity and abnormal activity
activities (Actions, recognition methods
Interactions)
Vrigkas et al. 2015 Actions Behavior Intermediate Action Recognition, Behavior This survey categorized the HAR into
[24] Understanding unimodal and multimodal
approaches and supports the
effectiveness of later approach
Cheng et al. 2015 Multi-type of Simple Action Recognition Systems This survey focused on human action
[22] activities including recognition-based approaches and
Actions, Interaction, few benchmark datasets have also
been discussed
Zhu et al. [36] 2016 Actions Simple Action Recognition System This survey covered the handcrafted
and learned representations for
human action recognition
Dawn and 2016 Actions Simple Action Recognition System This survey discussed human action
Shaikh [23] recognition with Spatiotemporal
interest point (STIP) detector-based
methods. Performance of selected
methods has been discussed along
with their results on different
benchmarks
Sargano et al. 2017 Actions, Interactions Intermediate Human activity Recognition HAR approaches along with
[20] benchmarks have been discussed.
Application areas have also been
highlighted
Herath et al. 2017 Multi type of Intermediate Daily Monitoring Systems, Activity This survey is focused on deep
[25] activities including Recognition Systems representation of action recognition
actions and domain. It provides the architectural
Interaction details of different action recognition
models along with performance on
few benchmark datasets
Tripathi et al. 2018 Abnormal Activities High Abandoned object Detection, Theft This survey is focused on suspicious
[37] (Actions, Detection, Violence Detection, activity recognition. Feature-based
Interaction, Group Illegal Parking on Road Detection, approaches along with classical
Activities) Accidents Detection, Fire Detection machine learning methods have been
described to explain state-of-the-art
methods
Yao et al. [32] 2019 Daily activities Sports Intermediate Human Activity Recognition System, This survey provided Convolutional
activities (Actions, Daily activity monitoring system, neural network-based action
Interaction) Sports System recognition along with performance
of popular methods on large-scale
datasets and highlighted the
limitations and future directions
Moreno et al. 2019 Daily activities Intermediate Human activity recognition system. The survey has divided the approaches
[28] (Actions, Monitoring Systems into three main categories, i.e.,
Interactions) handcrafted features, depth sensors,
and deep learning-based approaches
which are further explained briefly
123
Neural Computing and Applications (2023) 35:4145–4182 4149
Table 1 (continued)
Author Year Activities Complexity Application Contribution
Wang et al. 2019 Abnormal Actions, High Human behavior recognition Focused on sensor-based behavior
[27] Behavior, and recognition and described the process
interactions of channel state-based behavior
recognition. They categorized
methods into model based, pattern
based, and deep learning-based
approaches
Liu et al. [29] 2019 Actions, gestures, and Intermediate Daily activity recognition, Gesture Focused on Wi-Fi signal processing-
interactions recognition, User identification, based activity recognition. Explained
Indoor localization & tracking different setups of wireless sensing
strategies such as RSSI-based, CSI-
based, FMCW-based, and Doppler
shift-based methods
Zhang et al. 2019 Actions, Interactions, High Human Activity Recognition System, The survey discussed both action
[33] Group Activity Action Detection System recognition and action detection,
whereas action recognition is further
extended toward action
representation methods and
interaction recognition methods
Jegham et al. 2020 Multi activities Intermediate Human Activity Recognition System Highlighted the constraint and
[26] (Actions, challenges faced during the process
Interactions) of activity recognition. Action
recognition approaches and few
benchmarks have also been described
Dang et al. 2020 Sensor-based data for Intermediate Ambient Living Environment. Daily Based on sensor and vision-based
[30] Action Recognition, Monitoring System. Human Activity HAR including benchmarks for both.
Multi Activities Recognition System Focused on feature Engineering and
(Actions, Preprocessing methods used for HAR
Interaction)
Beddiar et al. 2020 Multi type of High Human Activity Recognition Provided general overview of HAR,
[1] activities (Actions, including approaches, datasets,
Interaction, Group evaluation measures, and challenges
activity) of the domain
Das et al. [34] 2021 Actions, Interactions Intermediate Real-time human activity recognition. Focused on methods used for real-time
Daily activity monitoring human activity recognition.
Presented challenges of real-time
HAR
Chaurasia 2022 Multi type of Complex Daily activities, Military activities, have worked on activity recognition
et al. [31] activities (Actions, Abnormal activities, Ambution, and classification (ARC)
Interaction, Group Transportation activities smartphones and wearable sensors.
activity) Moreover, authors have concluded
that ARC depends on the
classification technique, number of
sensors, device type, orientation, and
placement. They have classified
studies using ten parameters and
highlighted domain challenges
Gupta et al. 2022 Multi type of Complex AI-based HAR applications, Hybrid Authors have stated HAR design,
[35] activities (Actions, AI models for HAR, Abnormal dependability, and stability are major
Interaction, Group Human activities based areas that need improvement to
activity) improve the HAR process
123
4150 Neural Computing and Applications (2023) 35:4145–4182
• This study also discusses HAR benchmark datasets, We aim to provide the recent trend among HAR
which have been used to perform experimentation and research community so that open challenges can be
evaluation of methods. HAR datasets discuss their highlighted for future research.
characteristics, e.g., single-view, multi-view, RGB, and • This study discusses HAR issues that were brought to
RGB-D information, as well as instance-based details. light through comparison analysis, and it includes the
Every dataset serves a purpose, and their brief descrip- environmental complexity of high intra-class variations
tion can help researchers to choose one accordingly. and the inter-class similarity problem. Similarly, back-
• State-of-the-art methods are analyzed based on prede- ground, multi-view, and illumination variations are the
fined parameters to highlight strength and limitations of primary issues that can affect the performance of the
domain. This survey includes 46 state-of-the-art recognition system.
approaches presented since 2011, and we have divided
Section 2 provides a review of previous surveys and
the methods into three categories: online/offline, uni-
emphasizes the importance of this study. The characteris-
modal/multimodal, and handcrafted feature-based
tics of widely used video benchmarks for HAR are covered
approach/learning-based approach. The selected studies
in Sect. 3. Section 4 then provides a taxonomy and detailed
are further classified based on the complexity of the
review of state-of-the-art HAR approaches to highlight
activity (i.e., simple, intermediate, or complex), as well
research trends in HAR. Section 5 discusses the limitations
as the size of the dataset (i.e., small, medium, large). It
of HAR and open research areas, and Sect. 6 concludes the
also includes the recognition rate (Average Precision)
study.
of selected studies to highlight how studies perform as
compared to each other’s. Hence, comparative analysis
of various studies provides recent trends among the
2 State-of-the-art HAR surveys
HAR research community and highlights open chal-
lenges for future research.
Human activity recognition is complex and involves vari-
• The selected methods are classified as online/offline,
ety of tasks. For example, action representation-based
unimodal/multimodal, and handcrafted feature-based
approaches need feature extraction and descriptors-based
approach/learning-based approach. The selected studies
methods. Human activity analysis is complex and per-
are further categorized based on activity complexity
formed by using both machine learning and deep learning
(i.e., simple, intermediate, complex) and size of dataset
approaches, whereas we have conducted a survey on dif-
(i.e., small, medium, large). It also includes recognition
ferent approaches of HAR and categorized HAR into input
rate (average precision) of selected studies to highlight
processing strategy-based, modality-based, and model-
their performance as compared to each other. Reported
based approaches. In previous years, authors have con-
recognition rate may contribute toward significance of a
tributed toward HAR and presented specific to general
selected study, but it is not a basis for comparison.
123
Neural Computing and Applications (2023) 35:4145–4182 4151
KTH [42] 600 (160 9 120)/ 6 (25) S RGB Actions Human action recognition in outdoor
25 conditions
Weizmann 90 (180 9 144)/50 10 (9) S RGB Actions Human action recognition
[43]
UCF Sports 150 (720 9 480)/ 10 S RGB Actions, Interactions Sports actions recognition
[45] 10 (human–object)
Olympic 783 16 S RGB Actions, Interactions Sports actions recognition
Sports [48] (human–object)
Hollywood 233 (400 9 300, 8 S RGB Actions, Behavior, Activity recognition, Behavior
[49] 300 9 200)/24 Interactions, Group Understanding, Interaction Recognition,
Activity Event Detection
UCF50 [50] 6681 (320 9 240)/ 50 S RGB Actions, Interactions Human Sports activity recognition
25 (human–object)
UCF101 [45] 13,320 101 S RGB Actions, Behavior, Human activity recognition
(320 9 240)/25 Interactions, Group
Activity
YouTube 1,133,158 487 S RGB Actions, Interactions Human Sports activity recognition
Sports 1 M (human–object)
[51]
IXMAS [47] 1650 (390 9 291)/ 13 (11) M RGB Actions Multi-view-invariant action recognitions
23
ActivityNet 27,801 203 S RGB Actions, Behavior, Human activity and behavior understanding
[52] (1280 9 720)/30 Interactions, Group
Activity
YouTube 8 M * 800,000 4716 S RGB Actions, Behavior, Human activity and behavior understanding
[53] Interactions, Group
Activity
HMDB51 6766 (320 9 240)/ 51 S RGB Actions, Behavior, Human activity and behavior understanding
[54] 30 Interactions, Group
Activity
CASIA 1446 (320 9 240)/ 8 (24) M RGB Actions, Behavior, Human behavior and interaction-based
Action [55] 25 Interaction systems
AVA [56] 430 80 M RGB Actions, Interactions Poses, person to person interaction and
person-object interaction Recognition
UCF Crime 1900 13 S RGB- Actions, Behavior, Security Surveillance
[57] Interactions, Group
Activity
UTKinect 200 (320 9 240)/ 10 (10) S RGB- Actions Human actions
[44] 30 D
MSR Action 567 (640 9 480)/ 20 (7) S RGB- Actions Sports Gesture recognition
3D [58] 15 D
MSR Action 180 (320 9 240)/ 10 (12) S RGB- Actions Action pairs recognitions
Pairs [59] 30 D
SYSU- 3D 480 (640 9 480)/ 40 (12) S RGB- Actions, Interactions Daily activity Recognition
HOI [60] 30 D (human–object)
CAD-60 [61] 60 (640 9 480)/25 12 (4) S RGB- Actions Daily activity recognition
D
CAD-120 120 (640 9 480)/ 10 (4) S RGB- Actions Action labeling, human and object tracking
[62] 25 D
UTD-MHAD 861 (512 9 424)/ 27 (8) S RGB- Actions, Interactions View- invariant human action recognition
[63] 30 D
RGB-D 1189 (640 9 480)/ 12 (30) M RGB- Actions, Interactions Daily activity recognition
HuDaAct 30 D
[64]
123
4152 Neural Computing and Applications (2023) 35:4145–4182
Table 2 (continued)
Activity No. of Videos No. of View Depth Activity types Application Areas
dataset (Resolution)/FPS Actions (D)
(Actors)
Berkeley 660 (640 9 480)/ 11 (12) M RGB- Behavior Human behavior Recognition
MHAD [65] 30 D
Northwestern- 1475 (640 9 480)/ 10 (10) M RGB- Actions, interactions Cross- view action recognition
UCLA [41] 30 D
UWA3D 900 (640 9 480)/ 30 (10) M RGB- Actions Similar and cross-view action recognition
Multi-view 30 D
[46]
LIRIS [66] 9800 (640 9 480, 828 (21) M RGB- Actions, Interactions Human activity recognition
720 9 576)/25 D
G3Di [67] 574 (640 9 480)/ 12 (15) S RGB- Actions, Interactions Gaming interaction activity
30 D
NTU 56,880 60 (40) M RGB- Actions, Behavior, Daily Activity Recognition, Health
RGB ? D (512 9 424, D Interaction surveillance systems
[68] 1920 9 1080)/30
ShakeFive 100 2 (37) S RGB- Actions Handshake Recognition
[69] D
survey-based studies, which are discussed in this section, effectiveness of STIP detectors as STIP detectors can
and Table 1 summarizes these surveys. improve HAR tasks because of their robustness.
Action representation-based survey Handcrafted vs. learned representation-based Survey
In 2013, Vishwakarma et al. [18] have published a Vrigkas et al. [24] presented their findings based on
survey on surveillance-based activity recognition that pri- unimodal and multimodal approaches that are further
marily covers classical HAR approaches. They have clas- subdivided to discuss HAR. The survey’s focus is skewed
sified HAR approaches as hierarchical or non-hierarchical. toward multimodal approaches because they provide a
It provides a review of motion detection and object track- better feature set for learning. They have highlighted
ing methods, and characteristics of a few HAR datasets challenges faced by multimodal approaches, such as
have been discussed. Ke et al. [21] published a survey to computational cost. It includes both traditional ML and
provide a general framework of HAR, which includes advanced deep learning models, i.e., CNN. In addition, the
object segmentation techniques, feature extraction tech- survey provides a review of a few publicly available
niques, activity detection techniques, and classification datasets that can be used for HAR. Zhen et al. [19] pub-
techniques. Authors have thoroughly discussed the hand- lished a survey that has discussed two major HAR
crafted approaches used in HAR in both [18] and [21] approaches: learned representation and handcrafted repre-
surveys. Cheng et al. [22] have discussed similar approach sentations. Each one is further subdivided to analyze both
as used in [18] and provided characteristics of action categories and highlighted the strength of deep learning-
recognition benchmarks. Dawn and Shaikh [23] used spa- based approaches. The survey in [19] was the first survey to
tiotemporal interest points (STIP) to emphasize the compare traditional approaches with modern deep learn-
ing-based approaches. Similarly, Sargano et al. [20]
123
Neural Computing and Applications (2023) 35:4145–4182 4153
presented a survey on handcrafted vs. learning-based and categorized it as refined behavior recognition, coarse
approaches in 2017. They have discussed few publicly behavior recognition, and inference activity. They have
available HAR datasets and popular HAR applications. In described channel state information-based behavior
contrast to [20], Herath et al. [25] also conducted a survey recognition with the help of three application areas which
focusing on deep representation of action recognition. It are model based, pattern based, and deep learning-based
has thoroughly discussed the popular handcrafted HAR approaches. Authors have considered five major aspects for
features as optic flow, motion history image, trajectories, describing behavior recognition application, which are
and other motion descriptors. They have also shown the experimental equipment, experimental environment,
architectural differences between popular networks like behavior type, classifier, and performance. The authors in
spatiotemporal networks, multiple stream networks, deep [28] have discussed sensor-based HAR systems and
generative networks, and temporal coherency networks. showed handcrafted feature-based approaches and deep
Then, in 2020, Jegham et al. [26] have attempted to provide learning-based approaches. Authors in [29] have presented
a quantitative analysis of a few popular methods while also a survey on HAR through wireless signal (e.g., Wi-Fi) as
discussing their applicability in various scenarios. The motion of the human body affects the wireless signal
primary goal of their work is to highlight HAR issues propagation. The authors have described the basic strategy
through comparative analysis. and structure of wireless sensing environment for HAR.
They have presented a variety of HAR applications which
Sensor-based survey
can be recognized by using wireless sensing technology
Authors in [27] have surveyed channel state-based such as fraud detection, daily activity monitoring. The
behavior recognition and thoroughly described the concept authors have categorized sensing strategies based on HAR
of channel state information. They have provided details of into received signal strength indicator-based (RSSI),
methods used for channel state-based behavior recognition channel state information-based (CSI), frequency shift for
123
4154 Neural Computing and Applications (2023) 35:4145–4182
123
Neural Computing and Applications (2023) 35:4145–4182 4155
study. As a result, this survey attempts to combine all have discussed HAR datasets and explains their charac-
necessary elements of HAR to show its multidisciplinary teristics. It includes publishing year, number of videos,
nature. These elements include feature-based methods, actors involved, type of actions, application area, view
classification-based methods, multi-modality-based meth- information, and ground truth data of HAR datasets.
ods, online learning-based methods, dataset used for these Authors have presented a variety of methods used for each
methods, and state-of-the-art approaches of HAR. More- dataset. In [38], datasets were classified based on actions,
over, it attempts to highlight limitations of HAR and pro- whereas in [39], authors have discussed RGB-D (Fig. 3)
vide open research directions. video datasets. They have included characteristics of 27
single view action datasets, 10 multi-view datasets, and 7
multi-person datasets. It contains information about pub-
3 Activity recognition datasets lishing year, number of videos, actions, and actors, and
dataset complexity issues. It provides details of dataset
So far, many benchmark datasets have been published, splits (i.e., test, train, validation) and discussed some HAR
covering a wide range of activities. The choice of dataset methods for each dataset. In another survey [40], authors
influences the selection of a suitable approach for human have classified datasets into RGB and RGB-D to discuss
activity recognition. Regarding dataset, the HAR presents challenges of HAR datasets. They have highlighted five
several challenges, such as inter/intra class variations and distinct challenges of datasets which are illumination, view
the environmental setup used while recording actions (in- variation, occlusion, annotations, and fusion of modalities.
door/outdoor, camera, view angles). Inter-/intra-class They have discussed HAR methods for each dataset and
variation occurs because of the unique nature of each also discussed HAR studies to highlight dataset challenges.
human. For instance, when walking, some people take Considering available reviews on HAR datasets, this study
small steps while others take gigantic steps, some people highlights the major findings of datasets and provides
avoid obstacles while others jump over them. More action discussion to support HAR benchmark analysis. Therefore,
classes may have overlapping, for example, complex we have shown major classification of datasets in Table 2
actions comprise small actions. For example, fighting class using attributes such as image resolution, camera view,
involves punching, kicking, and thrusting. HAR datasets modality and type of activity, and the respective applica-
involve a lot of variations, which are explained in few tion areas, etc.
previous surveys. In [38], authors divided datasets into This survey presents a collection of HAR benchmark
three categories: heterogeneous actions and specific datasets organized by data view or data acquisition mode,
actions, and others. The heterogeneous actions include i.e., single view or multi-view. Figure 4 illustrates the
different types of actions, for example, walking, jumping HAR task variation across datasets, including type of
running, etc. Specific actions include application-based activities and modality variations. Few HAR datasets are
datasets such as datasets of crowd behavior, abandoned discussed below under single view and multi view datasets.
objects, activities of daily living (ADL), fall detection, and
pose & gesture, whereas other categories have datasets of
motion capture (MOCAP), infrared and thermal. Authors
123
4156 Neural Computing and Applications (2023) 35:4145–4182
3.1 Single-view action dataset up, carry, throw, pull, push, wave and clapping. The
actions were performed by 10 actors and each action is
The single-view dataset is captured using a single camera performed twice, which is performed through a variety of
and single view so does not involve view complexity in views and it includes 200 sequences. Along with action
sequences, as shown in Fig. 4. videos, labels and actions happening in the background are
also included in the dataset.
3.1.1 KTH
3.1.4 UCF 101
The KTH dataset [42] is based on six different actions (i.e.,
walking, running jogging, boxing, waving, and clapping), The UCF 101 [45] dataset includes 13,320 RGB videos
and these actions are performed by 25 actors. While per- with 101 action categories which belong to 25 different
forming these actions, four different background variations groups and each group has 4–7 videos each. All actions
are used which are indoor, outdoor, scale variation within belong to five major groups human–human interaction,
outdoor, and trying different clothes. The dataset includes human–object interaction, body motion, sports, and playing
2391 videos captured through a static camera at a rate of 25 music as shown in Fig. 5. The UCF 101 provides realistic
frames per second (fps) with a resolution of 160 9 120. action videos rat6her than staged videos which improve the
The dataset is provided in training (performed by 8 per- overall recognition task.
sons), validation (performed by 8 persons), and test (per-
formed by 9 persons) splits, but it does not include 3.1.5 Multi-view datasets
extracted silhouette of different actions and background of
action. The multi-view datasets are usually captured in one of two
ways: multiple cameras at different angles or by using
3.1.2 Weizmann different viewpoints, as shown in Fig. 6.
The Weizmann dataset [43] includes 10 types of actions 3.1.6 UWA3D multiview
including running, walking, skipping, forward jump, up-
down jump, galloping, 2-hands waving, 1-hand waving, The UWA3D Multiview dataset [46] includes variety of
and leaning performed by nine actors. The dataset includes sequences that were captured in a row with no pause. All
total of 93 sequences captured through a static camera with these actions were performed by 10 different actors. They
180 9 144 resolution of 25fps rate with an additional 10 have performed different actions which include punching
sequences of walking (captured from different viewpoint). and waving with one hand, sitting down, and standing up,
The background of the captured data is subtracted and the holding chest, walking, turning around, drinking, bending,
actions happening in the background are also included in running, holding head, holding back, kicking, jumping,
the dataset. moping floor, sneezing, sitting down (chair), squatting, two
hands waving, two hand punching, vibrating, falling,
3.1.3 UTKinect irregular walking, lying down, phone answering, jumping
jack, picking up, putting down, dancing, and coughing.
The UTKinect dataset [44] is composed of 10 different This dataset is available in two versions: a single view
actions, which include activities like walk, sit-down, stand- version with 30 activities performed twice/thrice by actors,
123
Neural Computing and Applications (2023) 35:4145–4182 4157
Wang et al. 2011 Dense trajectories are used to find actions Offline, Unimodal, UCF Sports: 88.2%, Hollywood: 58.3%
[208] from the data along with information of Handcrafted features,
Histogram of oriented Gradient, optic flow, Simple, Small
and motion boundary
Kliper-Gross 2012 The study is based on action recognition from Offline, Unimodal, HMDB-51: 29.2%, UCF-50: 68.5%
et al. [209] unconstrained videos and representation- Handcrafted
based architecture is used by extracting Approach,
Motion interchange patterns from action Intermediate,
data. Then General set of feature descriptors Medium
shows importance of feature set
Oneata 2013 Performed action and event recognition Offline, Unimodal, HMDB-51: 55.9% UCF-50: 90.5%
et al.[210] through performing short action Handcrafted features, Hollywood: 63% Olympic Sports: 91.2%
classification then locating these actions in Complex, Medium
lengthy movie videos along with
recognition of complex events. It used
Fisher vector instead of BoW and a set of
handcrafted features is used to process the
input data which include motion boundary
histogram and SIFT. The data are
normalized using L2-normalization method
and classified through linear classifier. Also
performed another approach by excluding
human detectors
Wang and 2013 Based on extraction of optic flow information Offline, Unimodal, HMDB-51: 57.2% UCF-50: 91.2%,
Schmid which encodes the motion pixel value, and it Handcrafted features, Hollywood: 64.3% Olympic Sports: 91.1%
[154] is combined with extracted trajectories of Simple, Medium
data for action recognition (Trajectories,
HoF feature descriptors)
Jain et al. 2013 Worked on extracting motion-based Offline, Unimodal, HMDB-51: 52.1%, Hollywood: 62.5%
[211] information using representation-based Handcrafted features,
method to detect the actions from data. Simple, Medium
Finite set of feature descriptors are
incorporated which includes HoG, Traj,
MBH, HoF, and DCS. The extracted
features are further fed to VLAD encoding
technique
Peng et al. 2014 Performed action recognition by using Offline, Unimodal, HMDB-51: 66.8%
[212] representative-based method along with Handcrafted features,
stacked Fisher vector (SFV) and Fisher Complex, Medium
Vector to extract action representations.
SFV provides refined representation and
abstract semantic information in layered
manner to provide mid-level as well as high-
level activity recognition
Simonyan and 2014 Extracted appearance-based information from Offline, Unimodal, HMDB-51: 59.4%, UCF-101: 88.0%
Zisserman still frames and motion information of Deep learning,
[213] frames. Performed action recognition from Intermediate,
videos by using deep neural network along Medium
with transfer learning. Authors have used
two stream convolutional neural network to
perform action recognition moreover multi-
task learning is used to improve the results
by adding classes from both action datasets
i.e., HMDB-51, UCF-101
123
4158 Neural Computing and Applications (2023) 35:4145–4182
Table 3 (continued)
Reported Year Methodology Online/Offline, Mean precision
paper Modality, HAR
Approach, Activity
Level, Training Data
Karpathy 2014 Have used deep network (CNN) based on Offline, Unimodal, Clip Hit Sports-1 M: 41.9%, Sports-1 M:
et al.[51] spatiotemporal information to perform Deep learning, 60.9%, UCF-101: 63.3%
action recognition from large-scale videos Intermediate, Large
and opted for slow fusion-based learning
strategy
Sun et al. 2015 Have worked on factorized spatiotemporal Offline, Unimodal, HMDB-51: 59.1%, UCF-101: 88.1%
[214] convolutional networks (FstCN) which Deep learning,
perform factorization of original 3D Kernel Intermediate,
into 2D Kernel for action recognition, i.e., Medium
Two stream clarifaiNet
Wang et al. 2015 Have worked with deep convolutional Offline, Unimodal, UCF-101: 91.4%
[215] network to perform action recognition and Deep learning,
used Two streams GoogleNet and two Intermediate,
stream VGG-16. The aim is to overcome the Medium
overfitting problem of action recognition
due to small size data so proposed to
perform pretraining of both spatial and
temporal nets using low learning rate and
high drop-out ratio along with data
augmentation
Wang et al. 2015 Have proposed to use trajectory pooled deep Offline, Unimodal, HMDB-51: 65.9%, UCF-101: 91.5% Conv
[196] convolutional descriptor (TDD) for action Handcrafted features, pooling hit. Sports-1 M: 72.4%
recognition and another method is used Intermediate, Large
which implies the use of TDD along with
histogram of optic flow to perform action
recognition
Yue-Hei-Ng 2015 Worked on handling of full-length videos and Offline, Unimodal, Sports-1 M: 73.1% LSTM (image ? opt
et al. [216] proposed two different methods in which Deep learning, flow) UCF-101: 88.6%
one is based on finding the best design of Complex, Large
CNN through convolutional temporal
feature pooling architecture. And the second
approach is aimed at providing video in
form of ordered sequence of video frames
which is done by using RNN (LSTM) and is
combined with the output of CNN
Fernando 2015 Proposed to used video wide temporal Offline, Unimodal, HMDB-51: 63.7%, Hollywood: 73.7%
et al.[217] information to follow sequences using Handcrafted features,
ranking machine which assigns ranks to Intermediate,
produce action representation and this Medium
method is named as Rank pooling
Donahue et al. 2015 Proposed a recurrent convolutional Offline, Unimodal, UCF-101: 82.9%
[218] architecture aimed at providing large-scale Deep Learning,
visual learning (LRCN) which performs Complex, Medium
temporal dynamics learning along with
convolutional perceptual representation of
actions within videos
Wu et al.[80] 2015 Proposed multi-stream architecture which can Offline, Multimodal, UCF-101: 92.2% Columbia Consumer
perform multimodal feature extraction and Deep learning, Videos: 84.9%
so used CNN to extract multi features from complex, Medium
videos. Then LSTM is used for the learning
of long-term temporal variations in data.
Both methods are fused to perform activity
recognition
123
Neural Computing and Applications (2023) 35:4145–4182 4159
Table 3 (continued)
Reported Year Methodology Online/Offline, Mean precision
paper Modality, HAR
Approach, Activity
Level, Training Data
Jiang et al. 2015 Representation-based method has been Offline, Unimodal, HMDB-51: 57.3%, UCF-101: 78.5%,
[219] proposed to extract motion related Handcrafted features, Hollywood: 55.2% Olympic Sports: 80.6%
information from data using global and local Intermediate,
referencing to overcome the camera Medium
movement problem within unconstrained
videos
Lan et al. 2015 Proposed Multi-Skip feature stacking (MIFS) Offline, Unimodal, HMDB-51: 65.1%, UCF-101: 89.1%, UCF-
[220] stacks the extracted features in form of Handcrafted features, 50: 94.4% Hollywood: 68%3, Olympic
differential filters which helps in preventing Intermediate, Sports: 91.4%
data loss at coarse level. MIFS helps in Medium
action matching at different speed and
ranges and speedup the process of feature
extraction
Tran et al. 2015 Proposed deep three-dimensional Offline, Unimodal, UCF-101: 90.4%
[221] convolutional neural network (3D ConvNet) Deep learning,
for spatiotemporal feature extraction (C3D) Intermediate,
which uses a linear classifier and perform Medium
significantly improved action recognition
Soomro et al. 2016 Few frames are converted into super-pixel Online, Unimodal, UCF-Sports 83.7%
[72] which are combined with spatiotemporal Handcrafted features,
points to extract action segments and then simple, Small
dynamic programming on SVM score is
performed for action prediction. Pose and
appearance data are incorporated in online
manner
Fernando and 2016 Proposed temporal pooling layer which can be Offline, Unimodal, UCF Sports: 87%, Hollywood: 40.6%
Gould [222] incorporated with any convolutional neural Deep learning,
network such as VGG-16 and AlexNet. The Intermediate, Small
pooling layer is used to encode temporal
semantics from long videos which are
converted into fixed-length vectors
Fernando 2016 Proposed a hierarchical rank pooling that can Offline, Unimodal, HMDB-51: 66.9%, UCF-101: 91.4%,
et al.[223] extract the dynamics of CNN features using Deep learning, Hollywood: 76.7%
rank pooling function from video sequences. Complex, Medium
Then rank pooling is combined with non-
linear feature function to provide video
encoding mechanism
Li et al.[224] 2016 Proposed a video representation framework Offline, Unimodal, Thoumas15: 80.8%, Olympic Sports: 96.6%,
VLAD based on linear dynamic system and Handcrafted feature, UCF 101: 90.9%
helps in capturing video data using short Complex, Large
medium and long ranges which includes
motion and deep features of a video
Feichtenhofer 2016 Worked on fusion methods of convolutional Offline, Unimodal, HMDB-51: 69.2%, UCF-101: 93.5%
et al. [225] networks and proposed to use Deep learning,
spatiotemporal network can be used at Intermediate,
convolutional layer rather than SoftMax Medium
layer
Varol 2017 Proposed LTC-CNN model for video Offline, Unimodal, HMDB-51: 67.2%, UCF-101: 92.7%
et al.[226] representation based on long-term temporal Deep learning,
convolutions (LTC), moreover raw pixels, Simple, Medium
optic flow estimation features are
incorporated within model to improve
action recognition
123
4160 Neural Computing and Applications (2023) 35:4145–4182
Table 3 (continued)
Reported Year Methodology Online/Offline, Mean precision
paper Modality, HAR
Approach, Activity
Level, Training Data
Jalal et al. 2017 Presented spatiotemporal multi-fused features Online, Multimodal, MSR Actions 3D: 93.3%, 1 M-MSR Daily
[70] to perform online activity recognition which Handcrafted features, Depth Activity: 74.3%
includes joint features, torso and key joint- Complex, Large
based distant features, HoG, and few others
Singh et al. 2017 Proposed a novel graphical representation to Offline, Unimodal, UCSD ped1: 97.14%, UCSD ped2: 90.13%,
[227] perform abnormal activity recognition by Handcrafted features, UMN: 95.24%
introducing geometric structure along with Intermediate, Small
motion and appearance-based information,
whereas activity classification is performed
through SVM and global abnormal activity
through Bag of Words (BoG) using STIP,
SIFT, and DT feature set
Carmona et al. 2018 Have worked on improved dense trajectories Offline, Unimodal, KTH: 97.5%, Weizmann: 98.8%, HMDB-51:
[228] (IDT) through incorporating more Temporal Handcrafted features, 65.3%, UCF-101: 89.3%
Templates-based features and three Intermediate,
templates are constructed in form of third Medium
order tensor
Zolfaghari 2018 Have proposed online recognition architecture Online, Unimodal, UCF-101: 93.3%, HMDB-51: 68.7%
et al. [74] (ECO) which uses feature representation Deep learning,
from all video frames which are then fed to Intermediate,
CNN network. The model uses half frames Intermediate
from the current sequence and half from the
incoming queue data to reduce the overhead
Mukherjee 2018 Has proposed a motion capture strategy and Offline, Multimodal, MSR Action 3D: 96.17%
et al. [81] produced dynamic images from RGB and Deep learning,
depth videos separately using ResNet 101 Complex, Medium
network. The dynamic image reduces
complexity by extracting sparse matrix from
video, and resultant framework is fast and
memory efficient
Zhang et al. 2018 Proposed a semantic-based multistream deep Offline, Multimodal, MRA: 72.03%, UTA: 81.89%, MRP: 94.69%
[82] neural network for action attribute learning Deep learning, Accuracy: MSR Actions 3D: 93.40%, UTA
and action recognition along with zero shot complex, Medium Action 3D: 87.88%, MSR Action Pairs:
action recognition. It also combines 99.44%
semantics in graph regularization and joint
learning is achieved by using ADMM
optimization algorithm
Mao et al. 2018 Proposed a deep convolutional graph neural Offline, Unimodal, Youtube-8 M: 87.7%
[229] Network and used self-attention graph Deep learning,
pooling mechanism for action classification Complex, Large
Siddiq et al. 2019 Proposed a feature selection approach named Offline, Unimodal, Accuracy KTH: 99%, Weizmann: 98.2%
[230] normalized mutual information-based Handcrafted features,
feature selection (NMIFS) which is Simple, Small
extended form of both max-relevancy and
min-redundancy. Combination of Curvelet
transform, LDA, and HMM is used to prove
the state-of-the-art
Lin[71] 2019 Have proposed temporal shift module (TSM) Online, Multimodal, Accuracy: Something-Something V2: 50.7%,
to achieve efficiency with high performance. Deep learning, Kinetics: 76.3%
TSM provides 3D CNN performance, but it Complex, Large
costs 2D CNN which are further categorized
as unidirectional TSM (Online Recognition)
and Bi-directional TSM (offline recognition)
123
Neural Computing and Applications (2023) 35:4145–4182 4161
Table 3 (continued)
Reported Year Methodology Online/Offline, Mean precision
paper Modality, HAR
Approach, Activity
Level, Training Data
Franco [83] 2020 A multimodal approach is based on use of two Offline, Multimodal, CAD-60: 98.8%, CAD-120: 85.4%, Office
stream data i.e., skeleton data and RGB Handcrafted features, Activity: 90.6%
data. Skeleton data provide human posture- Complex, Small
based data, whereas RGB provides temporal
information for the evaluation of action
hence improve the action recognition
Zhang et al. 2020 Based on improvement in bad sample Offline, Unimodal, Pretrained on Kinect UCF-101: 96.8%,
[231] problem arises due to random cropping Deep learning, HMDB-51: 74.8%
technique and for that motion patch-based Simple, Medium
Siamese convolutional neural network
(MSCNN) has been proposed. Motion patch
uses the idea of extraction of critical motion
square region
Arzani 2020 Worked on human–robot interaction system Offline, Unimodal, UT Kinect: 100%, Florence 3D Dataset:
et al.[232] to handle both simple and complex activities Handcrafted features, 96.11%, CAD-60: 97.6%
and used probabilistic graphical models Complex, Small
(PGMs) to design a structured prediction
strategy. A deterministic switch is used to
identify simple and complex activity
subspaces considering all possible activities
Gowda et al. 2020 Have proposed a model SMART which Offline, Unimodal, Accuracy: ActivityNet: 84.4%, UCF-101:
[233] provides efficient frame selection strategy Deep learning, 98.6%, HMDB-51: 84.36
from videos and it is based on temporal Complex, Large
segment network (TSN and Kinetics)
Gowda et al. 2021 Worked on zero shot learning using Offline, Unimodal, Accuracy When Tested on unseen data while
[198] reinforcement method and proposed a Deep learning, Training data: [Olympics Sports: 68.8%,
clustering framework (CLASTER) which Complex, Medium HMDB-51: 53.3.4%, UCF-101: 69.3%]
can take all training data at once rather than
using individual optimization. They have
trained their model on activity recognition
benchmark datasets and then tested on
unseen examples from real world which
have made it complex but close to real-
world scenarios
Wharton et al. 2021 Have proposed a Coarse Temporal Attention Offline, Unimodal, SBU Kinect Interaction: 92.9%
[234] Network (CTA-Net) which is aimed at Deep learning,
capturing high level temporal data to learn Complex, Large
useful spatial and temporal variations in a
video
Ullat et al. 2021 Have proposed sequential extraction method Online, Unimodal, HMDB-51: 64.98%, UCF-101: 86.39%,
[235] which uses optical CNN model and Deep Deep learning, UCF-50: 91.29%, Hollywood2: 68.21%,
Skip Gated Recurrent Unit is proposed to Complex, Large YouTube Actions: 92.63%
perform sequential pattern learning
Khan et al. 2021 Have worked on feature extraction process to Offline, Unimodal, KTH: 98.66%, Weizmann: 99.1%, UCF
[236] improve action recognition and used shape Handcrafted features, Sports: 99.12%, UT Interaction: 100%
features along with deep learning features to Intermediate,
improve learning. Such as entropy Medium
controlled LSVM maximization is used for
robust feature extraction
Ullah et al. 2021 Have proposed a multi-view action Offline, Multimodal, MCAD: 86.9%, Northwestern-UCLA: 88.9%
[237] recognition method which performs frame Deep Learning,
level feature extraction to feed these Intermediate,
forward to conflux LSTM. Then correlation Medium
coeeficient is computed using view inter-
reliant pattern learning and then action
classification is performed
123
4162 Neural Computing and Applications (2023) 35:4145–4182
Table 3 (continued)
Reported Year Methodology Online/Offline, Mean precision
paper Modality, HAR
Approach, Activity
Level, Training Data
Reinolds et al. 2022 Authors have compared the performance of Online, Multimodal, Real-Life Violence situations: 89%
[238] both video-based and audio-based activity Deep learning,
recognition. They have performed Intermediate, Large
classification process for both types of input
by extracting features for each
Siddiqi et al. 2022 Authors have used mutual information Offline Unimodal, Kinect depth dataset: 98.2%
[239] algorithm and expanded max-relevance and Hand crafted
min-redundancy methods to select optimal features, Simple.
features. Features are extracted through Medium
symlet wavelet transform and later action
classification is performed through hidden
Markov model
Khare et al. 2022 Have proposed a multiresolution video Offline, Unimodal, KTH dataset: 96.38%, CASIA dataset:
[240] analysis scheme and used local binary Hand crafted 98.82%
pattern (LBP) along Zernike moment (ZM) features,
Intermediate,
Medium
Deotale et al. 2022 Have proposed a four step activity recognition Offline, Multimodal, ActivityNet: 39.37%
[241] method which involves frames conversion, Deep learning,
human body detection, action recognition Complex, Large
and then occurrence time of action using
two stream data (i.e., RGB image and optic
flow) through CNN-based network
Zhang et al. 2022 Have proposed ActionFormer which is an Offline, Unimodal, ActivityNet: 53.5%, THOMUS: 65.6%
[242] efficient method for timely action Deep learning,
recognition in a single shot setting. It Complex, Large
aggregates multiscale feature representation
and local self-attention information which is
forwarded to a decoder to perform action
recognition
3.1.8 IXMAS
123
Neural Computing and Applications (2023) 35:4145–4182 4163
and modality information. Dataset characteristics may help include human-to-human and human-to-object interactions
in choosing a dataset while considering specific models and are useful for evaluating human computer interaction
such as large-scale datasets are appropriate for deep (HCI) systems. CAD-60 [61] is RGBD dataset of daily
learning-based methods (e.g., CNN, RNN), whereas small- actions which are recorded in five different scene varia-
sized datasets are typically used to validate handcrafted tions, but it has class imbalance problem.
feature-based approaches. Small datasets are ineffective for NTU-RGBD [68] datasets is a large dataset with 56,880
deep learning-based models, which require massive videos recorded in a laboratory with strict guidelines,
amounts of training data. The Weizmann dataset is a small which made it partially useful for real-time activity
dataset with 90 videos, whereas the YouTube 8 M dataset recognition. It has daily life activities and health related
is the largest. Availability of large amounts of data is no actions such as falling and sneezing. NTU-RGBD [68]
longer a problem because of cheap CCTVs everywhere, but dataset can be used for evaluation of healthcare surveil-
labeling that data remains difficult. As a result, the variety lance and daily activity monitoring systems. Sports 1-M
of datasets simplified the task and provided flexibility when [51] and YouTube 8 M [53] are large-scale datasets that
validating any method. Along with the size of the dataset, offer background variation; occlusion and complexity of
the number of videos within a class is important when these datasets can be upscaled. Sports 1-M dataset has a
describing the quality of the dataset. It is preferable if each substantial variation in sports action, which are annotated.
class within a dataset has an equal number of videos to The annotation or labeling is performed by content-based
avoid class imbalance. retrieval strategy, and therefore, it may be inaccurate. UCF
Crime [57] is a large-scale dataset with 1900 videos of 13
3.2 Discussion different anomalies. The dataset offers inter class and intra-
class problem, which may result in increased false positive
HAR benchmark datasets are complex to analyze as they rate. As UCF Crime has unbalanced dataset, which means
try to mimic the real-life scenarios based on human few classes have significantly large amount of data as
activities. The purpose of HAR benchmark is to provide a compared to others. Considering the above list of datasets
close representation of human behavior in different sce- mentioned in Table 2, there is a lack of 3D datasets cap-
narios. One of the most important aspects of a dataset can tured in unconstraint environment. Majority of datasets
be its relation to reality, and a close relation of these two avoid background and distant activities that are useful in
will provide a better human activity recognition. In daily real time scenarios, for example, surveillance systems.
life, illumination, scene variations, occlusion, and back-
ground activities vary widely. However, datasets may have
not focused on such issues and were recorded in a con- 4 Human activity recognition approaches
trolled environment. Majority of HAR datasets are actor
based, which means it includes activities, which are per- HAR is used in various daily life systems and can be
formed by different actors. For example, few daily life performed by a variety of methods. It emphasizes the need
activity datasets do not focus on occlusion and background for HAR taxonomy to discuss existing approaches. Previ-
activities such as KTH [42] dataset, UT Kinect [44], and ous surveys are focused on some specific tasks; for
Northwestern-UCLA [41] datasets have a static back- instance, [24] have discussed only unimodal and multi-
ground. KTH [42] and Weizmann [43] are small size action modal HAR approaches, [20] has used handcrafted vs
datasets, and most methods achieve 100% accuracy in learned representation to discuss HAR, [18] has used only
these datasets. The reason is both datasets have a clear single-layered & hierarchical approach-based division, and
background with no occlusion and simple actions, which [1] has discussed both handcrafted vs learned representa-
can be 100% classified by most of recent HAR methods. tions and unimodal vs multimodal approaches. This study
That is why both datasets can be used as a good start but proposed a top-down taxonomy that can encompass all
cannot be up scaled for complex HAR scenarios. methods, from simple to complex. Figure 7 depicts input
Few datasets, which have considered occlusion and data variations within HAR, while Fig. 8 depicts
background variations, are useful for gaming/sports sys- taxonomy.
tems, e.g., MSR Action 3D [58] and G3Di [67]. MSR Human activity recognition can be done offline (via
Action 3D dataset has RGB and depth information, but stored videos) or online (via a live stream), which is critical
both channels are recorded separately, which causes syn- when dealing with real-time systems. Another variant is the
chronization problem. UCF-101 [45] and HMDB-51 [54] source of modalities, which refers to either unimodal or
are daily activity-based datasets of intermediate size, which multimodal methods. Unimodal methods rely on a single
offer dynamic background and can be used for evaluating modality for input, whereas multimodal methods may use
daily activity monitoring-based systems. The activities multiple modality inputs, for example, depth, audio cues,
123
4164 Neural Computing and Applications (2023) 35:4145–4182
and skeleton data [24]. HAR includes simple offline-uni- features. They have reduced the size of feature set through
modal methods [51] as well as complex Online-multimodal code vector. HMM is trained on these code vectors to
systems [70] [71]. So, all HAR systems, whether Online/ recognize human activity segments through forwarding
Offline or Unimodal/Multimodal, rely on handcrafted fea- spotting and depth map is used for online activity recog-
ture-based approaches or learning-based approaches. nition, whereas Zolfaghari et al. [74] have focused on long-
Baseline approaches used for above-mentioned systems are term content along with fast video processing to perform
shown in Fig. 8. The taxonomy is divided into two cate- efficient online recognition. They have proposed a 3D and
gories: handcrafted feature-based approaches learning- 2D Combination Architecture (ECO) in which 2D network
based approaches. A unimodal or multimodal framework ensures feature representations from still images, whereas
necessitates the careful selection of methods from hand- complex information is extracted from 3D network. To
crafted feature-based approaches or learning-based reduce the complexity and data overhead issue, half of the
approaches. Both handcrafted and learning-based approa- frames are taken from the current sequence and half from
ches are divided into sub-categories. Furthermore, recent the upcoming sequence (Queue) to make predictions. Xu
learning-based methods, such as zero-shot learning and et al. [75] performed online HAR that was based on using
transfer learning, are significant. When all the activity temporal context of each frame while performing action
classes are not available, such methods are useful. All detection in parallel. They have proposed a Temporal
above-mentioned methods are part of the HAR taxonomy Recurrent Network (TRN) which is based on RNN. It
to present a relationship between various variations. As it works by predicting actions from each frame while antic-
hasn’t been done before, it has the potential to contribute ipating future actions, so the future actions combined with
significantly to the domain by demonstrating the multi- historical data may produce better predictions. Lin et al.
disciplinary nature of HAR. HAR taxonomy is discussed [71] have proposed a temporal shift module for both online
under qualitative analysis section through existing HAR and offline recognition. The offline recognition is bidirec-
methods to provide a brief description of each. tional, whereas online recognition is unidirectional as it
considers only upcoming video frames, as shown in Fig. 9.
4.1 Online/offline processing strategy-based TSM-based online recognition model is shown in
human activity recognition Fig. 10, which provides low latency and low memory
consumption rate as compared to other methods. Their
Online human activity recognition uses the live stream model provides average precision of 95.5% while trained
which is fed to HAR model to perform activity recognition on UCF-101 dataset. It performs well on offline activity
such as in augmented reality/virtual reality (AR/VR) and recognition with zero latency rate and 95.8% average
self-driving cars. Most of the methods are targeted to off- precision.
line systems that process all video frames together and are To improve online action recognition from untrimmed
not suitable for real-time systems, i.e., security surveil- videos, Gao et al. [76] proposed a Weakly Supervised
lance. Soomro et al. [72] have used batch of frames from Online Action Detection (WOAD) framework. It uses
videos to estimate pose. They have used current frame to temporal proposal generator (TPG) that works offline to
convert it into super-pixels along with conditional random generate frame level labels and an online action recognizer
fields to produce nodes and spatiotemporal points are used (OAR) that detects online actions. Offline recognition is
to extract actions. Short duration clips are used to predict less complex than online recognition because it is based on
action confidence via dynamic programming based on stored videos, making dealing with such data easier com-
SVM scores. This approach has helped in capturing the pared to online. In offline scenarios, a decision is made
sequential information of video, whereas appearance-based after analyzing the entire video, whereas in real-time sce-
information and pose estimation are done online and only narios, recognition is required immediately based on new
few frames are used for this purpose. Singh et al. [73] have frames. Because action recognition from videos is pri-
addressed slow execution of offline approaches in real-time marily performed on stored data, most of the methods
scenarios through multiple spatiotemporal action localiza- discussed in this study are offline, whereas as shown in
tion. To overcome these issues, CNN is used along with a Table 3, online approaches for video-based activity
single-shot multibox detector, which helped in construction recognition are gaining popularity.
and labeling of action tubes, which achieved real-time
action recognition performance ranging up to 40 fps. Jalal 4.2 Modality-based human activity recognition
et al. [70] have used Depth Differential Silhouettes (DDS)
along with human temporal points to perform online Most of the methods are offline and unimodal because
activity recognition. It further considered the skeleton joint these methods involve less complex computational strate-
features, which include torso and key joint-based distant gies and resources. Unimodal approaches recognize
123
Neural Computing and Applications (2023) 35:4145–4182 4165
activity by utilizing data from a single modality, for resulting in an improvement in the action recognition
example, visual representation learned from image process.
sequences or still images. Unimodal approaches perform
well when motion-based features are used as methods 4.3 Model-based human activity recognition
based on space–time, stochastic, and shape-based data.
Besides the methods mentioned, rule-based approaches, Model-based human activity recognition involves methods
which include CFGs and statistical models (HMM) have of feature extraction from action data and classification of
performed well [24]. these data in specified class. In this study, handcrafted
The research community’s attention is shifting to mul- feature-based and learning-based approaches are discussed
timodal approaches based on data from two or more to cover wide range of HAR methods.
modalities. Ofli et al. [65, 68, 77, 84] uses a variety of
modalities to describe an activity, including RGB data, 4.3.1 Handcrafted feature-based approaches
depth data, audio cues, skeleton data, optic flow, motion
capture, and temporal data. Multimodal approaches pri- The feature-based/handcrafted approaches use statistical or
marily use two or three different sources of information to image processing techniques to calculate features. Fig-
recognize actions by performing feature fusion such as ure 11 depicts the general framework of feature-based
early and late fusion, which can be classified as affective human activity recognition. These methods rely on manual
methods, behavioral methods, or social networking-based feature extractions which include different statistical,
methods. Chen et al. have used facial expression along with temporal, and appearance-based features.
action recognition to design emotion recognition system
[63]. Rigkas et al. [78] have worked on behavior recogni- [Link] Non-hierarchical approaches Non-hierarchical or
tion using a fully connected conditional random fields single-layered approaches use raw video data and are
(CRFs) model which can recognize friendly, aggressive, classified into two types (i.e., space–time and sequential
and neutral behaviors. In [79], joint sparse regression-based approaches) based on how the temporal dimension is
method has been proposed which uses depth data as well considered, which are further classified into relevant
body parts information to extract variety of features for groups of methods. Such methods are used to recognize
action recognition. Wu et al. [80] proposed a deep learning- short and simple human actions (e.g., running, jumping,
based multi-stream architecture that can extract multiple walking) and are normally evaluated on small datasets, for
features from videos using CNN to perform multimodal example, KTH [42] and Weizmann dataset [43]. Non-
feature extraction. The extracted feature data are fed into hierarchical approaches are based on data representation
the Long-Short Term Memory (LSTM) model, which uses and matching, which is normally done using a suit-
this information to learn long-term temporal variations in able feature extraction strategy. The non-hierarchical
data and then combines it to perform human activity approaches can be used in different sequential combina-
recognition. Jalal et al. [70] have fused data of different tions to recognize more complex actions.
modalities which includes torso-based distant feature Space–time approaches The space–time approaches are
descriptors, key joint-based feature descriptors, motion based on the problem’s spatiotemporal nature. Because
features, shape-based features, and a few others. Mukherjee time is a regular domain, features can be extracted from a
et al. [81] have proposed the use of dynamic images by 3D volume containing a 2D spatiotemporal sequence of
extracting motion information from RGB images and depth images with another equal set of pixels in the third
images separately, which are then combined. The task is dimension (XYZ plane). This means that the video has a
performed by using two streams of Resnet-101 network spatiotemporal volume with important information for
and resulted in reduced sparse matrix from videos. Zhang action recognition, and as a result, many researchers have
et al. [82] have also worked with different modalities and contributed by proposing significant matching-based algo-
produced a semantics-based multi-stream deep neural net- rithms to identify underneath motion patterns.
work for action attribute learning and action recognition Space–time volumes The space–time volume approaches
along with zero shot action recognition. It also combines consider the entire volume as a template or simply a feature
semantics in graph regularization and joint learning to use that is then matched with previously existing videos to
adam (adaptive moment estimation) optimization algo- perform action classification. It is done by using a match-
rithm. Franco et al. [83] used temporal and posture-based ing algorithm such as Bobick and Davis’s [84] identified
data for activity recognition, and a two-stream architecture motion pattern. Hu et al. [85] contributed by combining the
based on skeleton and RGB data was proposed. Skeleton motion history images (MHI) and appearance-based
data provide human posture information, whereas RGB information, whereas appearance still relies on two fea-
data provide temporal information for action evaluation, tures, i.e., foreground image and Roh et al. [86] extended
123
4166 Neural Computing and Applications (2023) 35:4145–4182
motion pattern strategy using volumetric motion template covariance matrixes of two actions for classification.
to provide view-independent action recognition and shifted Another direction toward the action recognition was to
MHI from 2 to 3D. Histogram of Oriented Gradients improve the process of video analysis and for that purpose,
(HOG) to get the magnitude and direction of edges and Kim et al. [94] have proposed a method to check the
corners of a specific action. Another famous combination similarity between two videos by using the assumption that
was to use both global and local features and here global similar videos represent similar actions through extending
features involve the contour coding of motion energy canonical correlation analysis. The idea was good as it
image (MEI), whereas local features simply provide a helped in ignoring the irrelevant complexity within a task
bounding box for an action that further uses the multi-SVM by avoiding explicit motion estimation within a frame.
for classification of feature points [87]. Then, Kim et al. Space–time trajectories The trajectory-based approa-
[88] used the concept of representing spatiotemporal fea- ches use raw data from videos, and for that purpose,
tures gained from different actions by producing accumu- tracking points are obtained by considering joint positions
lated motion images (AMI).AMI pixel values are used to of the human body. The tracking of such joints or interest
produce a rank matrix. This task is based on computing the points results in the construction of a trajectory. Similarly,
distance value of the rank matrix of two videos, i.e., can- Messing et al. [95] used KLT tracker to track Harris3D
didate video and the target video. Another group of joint areas (feature trajectories) which produced log-polar
researchers [89] has designed a pose descriptor by using velocities as sequence. Then learning of these velocities
rectangular patches that were extracted over human sil- (i.e., velocity-history language) is performed by applying a
houettes and named that descriptor as Histogram of Ori- generative mixture model to classify videos and actions by
ented Rectangles (HOR). A similar approach presented by producing a weighted mixture of augmented trajectories.
Fang et al. [90] based on silhouettes have been proposed Another major contribution by Wang et al. [96] was to use
which aimed at the mapping of high dimensional silhou- dense trajectories which were sampled by taking dense
ettes to the spatial motion of low dimensional points to get points from each frame. The dense optical flow field is used
the information about inherent motion structures. After the to calculate displacement for tracking dense trajectories
pose descriptor, Ziaeefard et al. [91] has used skeleton- with the calculation of other local descriptors (i.e., HOG,
based data and designed a cumulative skeletonized image HOF).
(CSI) regarding time. This skeleton-based image is used to Space–time local features Object recognition inspired
create distance-based histogram to feed the information to the concept of using local features for action recognition
SVM model for matching. The authors have also used the from images, whereas local features are based on interest
idea of similar and dissimilar actions while matching pro- points and provide distinct features, which can be learned
cesses. Two types of CSI histograms were taken for similar as features. The local features can be sparse (Harris 3D [12]
and dissimilar actions. Wang et al. [92] were made using and Dollar detector [97]) or dense (i.e., optical flow)
the notion of ‘‘bag-of-words framework’’ so taking the depending on their extraction purpose. Jones et al. [98]
word as frame and document as videos to design semi- have extended Dollar detector [97] using k-means for
latent topic models (STM) which resulted in an efficient clustering of detected interest points and asymmetric bag-
action recognition system with better accuracy as well, but ging with random subspace support vector machine to
the drawback was the limited number of latent topics. incorporate feedback process. Gilbert et al. [99] have
Another research has been presented by Guo et al. [93] extended Harris 3D detector [12] to handle the sparsity
stating that the action is the deformation of local shape issue and used hierarchical grouping for action classifica-
features (i.e., centroid-centered object silhouettes) over a tion. Sadek et al. [100] have used the concept of taking
temporal sequence. The feature set of 13-dimensional temporal self-similarities using the fuzzy log-polar his-
normalized geometric vectors is used to produce a togram on Harris 3D detector [12] to describe the local
covariance matrix that holds the shape of the silhouette. interest points which are further classified through SVM.
The Riemannian matrix is calculated between the Ikizler-Cinbis and Sclaroff [101] have worked on feature
extraction of various objects and humans by incorporating
optical flow and foreground flow. The extracted features
are fed to multiple instances of learning frameworks (MIL)
to find the locality of interest points. Minhas et al. [102]
have used 3D dual-tree discrete wavelet transform (DT-
DWT) for spatiotemporal feature extraction and affine
SIFT for local feature extraction. They have used a hybrid
combination of both to feed the feature values to an
Fig. 12 General framework of learning-based approaches extreme learning machine (ELM).
123
Neural Computing and Applications (2023) 35:4145–4182 4167
Sequential approaches The sequential approaches are of state model by representing gestures as 2D-trajectory
two types i.e., exemplar-based, and state-based which are which helps in finding the locality of interest points, i.e.,
briefly described as follows: location changes of hand. Another study has been
conducted by Oliver et al. [115] to propose a Couple
Exemplar-based The exemplar-based approaches use a
of Hidden Markov Model (CHMMs) which overcome
representation of human actions as a template containing
the limitation of traditional HMM and makes it possible
a set of sequences of an action that can be compared with
to analyze the interaction between over two people.
new incoming video sequences, so the contributions
Along with HMMs, Dynamic Bayesian networks (DBN)
were made to compare such templates for the process of
are also used for human body gesture estimation and a
action recognition. Darrell et al. [103] proposed a
lot of improvement has been made to analyze person-
Dynamic Time Wrapping (DTW) algorithm to recognize
person interactions [116]. In [117], Coupled hidden
and handle the variations in an action. It is extended by
semi-Markov model is used to track the duration of
Gavrila et al. [104] to perform gesture analysis through
actions occurring within an event (sub-events). It models
DTW algorithm along with 3D joint angle model.
the representation of a person-person interaction but
Veeraraghavan et al. [105] proposed another modifica-
results in a lot of model complexities which compro-
tion to the DTW algorithm, in which they used a time
mised the performance of the model. Gupta et al. [118]
function to monitor the overall activity process. It
have used the probabilistic model which helps in the
distinguishes between activities that appear similar but
extraction of context-based information to perform the
differ, for example, pulling, pushing, throwing, and so
analysis of actions and demonstrated better performance
on. Another useful method is principal component
in the object recognition process. Moore et al. [119] have
analysis (PCA) and singular value decomposition
introduced the use of both HMM and Bayesian relations
(SVD). SVD is used for representation of video data to
for object classification and motion detection with
extract features as eigenvectors [106]. Then Efros et al.
limitation of hand moment detection only. Yu et al.
[107] attempted to use motion descriptors and incorpo-
[120] have presented another study that is based on the
rated optical flow as the baseline of the model. Optic
modification of HMMs. It has used star skeletons for
flow is used to track human activity mainly in public
representation to analyze the edges and corners of human
places and set a threshold of 30 pixels for normal
postures through the application of contour and his-
person’s height. Lublinerman et al. [108] have proposed
togram-based methods. The novel texture descriptors
a similar system that was limited in performance due to
were also being proposed by Kellokumpu et al. [121] for
noise and requires background enhancement. Jiang et al.
motion analysis with the use of HMM to assess the
[109] have worked on the use of geometric models to
temporal information of motion histograms. Another
recognize actions through postures. Lin et al. [110] have
work by Shi et al. [122] has been presented to resolve the
worked on video representation and used k-mean clus-
inference issue while performing segmentation and
tering to generate prototype sequences. They have
recognition of human actions and proposed a dynamic
generated a unique prototype for each video using the
programming algorithm (Viterbi like an algorithm) to
prototype sequence estimation approach. For prototype
perform action recognition.
matching, the fast DTW algorithm was used, which
Appearance-based approaches The appearance or out-
resulted in increased computational efficiency.
look of any target can be presented through 2D (XY) and
State model-based approaches The second category of
3D (XYZ) depth images and such methods rely on the
sequential approaches is the state-based model which
information related to shape, motion, and blend of both.
uses the hidden states for the representation of actions.
Such methods use appearance-based information along
Yamato et al. [111] have used Hidden Markov Model
with any suitable feature extraction method that can be
(HMM) for video representation and action recognition.
shape and contour-based features and optic flow in case
HMM is already being used for speech recognition and
of motion-based features.
text classification. A modification by Starner et al. [112]
Shape-based approaches The human silhouette [123] is
based on HMMs was proposed that targets the American
used to extract the local features, which are done by
Sign Language (ASL). In this approach, each sign is
using foreground silhouette subtraction using a segmen-
stored as HMMs to generate a corresponding sequence of
tation technique. The image can be assumed to have two
features. An issue with this approach is that ASL can
spaces, i.e., positive space (image silhouette) and
describe limited number of actions. Then Vogler et al.
negative space (surrounding region between boundary
[113] have worked on reducing the number of combi-
of image and human) [124]. To work with human
nations of ASL by using a Parallel Hidden Markov
silhouette, one must use contour points, geometric
Models (PHMMs). Bobick et al. [114] have used the
information, and region-based features of frame, and a
123
4168 Neural Computing and Applications (2023) 35:4145–4182
successful contribution was made to perform region- (MEI) and motion history image (MHI) for action key
based feature extraction through division of human poses and action recognition is performed through
silhouette into fixed number of cells and grid to represent nearest-neighbor classifier [136].
actions. The method further used a combination of two
popular classifiers, support vector machine and Nearest
[Link] Hierarchical approaches The second feature-
Neighbor (SVM ? NN) to recognize actions [125].
based category is hierarchical approaches, which have a lot
Another research was focused on considering the time-
of similarities with non-hierarchical approaches, especially
series data to use Symbolic Aggregate approximation
for atomic actions. The hierarchical approaches mainly use
(SAX) which first converts the silhouette into time-series
complex activities by considering the sub-events within it,
data to produce SAX vector through applying random
i.e., fighting, which involves other subtasks like pulling,
forest algorithm for action recognition [126]. Along with
pushing, punching, etc. Such approaches show their sig-
silhouette, pose invariant data are useful to estimate the
nificance where flexibility is required while dealing with
actions through shape of human body, and a contour-
complex interactions, e.g., human–human interaction,
based method is used, employing multi-view key poses
human–object interaction. The hierarchical approaches are
for action recognition [127]. It is further extended
of three types which are statistical, syntactic, and descrip-
through extraction of contour points from silhouette
tion-based approaches.
with radial scheme to perform action representation and
Statistical approaches Initially, most of the statistical
classification through SVM [128]. Another method based
approaches were based on the extension or modification of
on pose related information was proposed that uses scale
Hidden Markov Model (HMM) and Dynamic Bayesian
invariant features from silhouettes. Key poses are
Networks (DBN) to handle concurrent and sequential sub-
produced through clustering of these features, which
activities, respectively. After that, another hierarchical
are fed to the weighted voting scheme for action
approach was proposed to emphasize the use of propaga-
recognition [129].
tion networks (p-net), and these networks were proved
Motion-based approaches The basic trend is to extract
significantly better for both sequential and concurrent
the motion features through any useful mechanism and
activities [137]. Along with p-nets, a 4-layered proba-
then apply a classifier to recognize actions. Such a
bilistic latent model [138] was proposed, which uses the
contribution was made in [130] to produce a motion
Bayesian model for clustering after spatiotemporal feature
descriptor that uses motion directions and motion-
extraction, and then recognition is performed through
intensity histograms of a moving body. Classification
probabilistic latent model. The proposed model aimed to
of different action categories is performed using SVM.
handle the atomic actions through clustered space–time
Besides the motion descriptors, motion history images
features and complex actions with hierarchical descrip-
and histogram of oriented gradients (HoG) are also
tions. In another research, hierarchical clustering was
useful measures. Another useful approach was proposed
proposed for action recognition through the representation
in [131] to use the templates of motion which are based
of feature cues [139]. The cascade Condition Random
on motion history image and HoG. In [132], optic flow
Fields (CRFs) are helpful while analyzing the motion
feature descriptor is used for human activity recognition
pattern, and SVM can classify these motion patterns as
and only motion-based features are extracted.
human actions [140]. Another research was conducted
Hybrid approaches The hybrid approaches are based on
when data-related issues were raised and integration of
the combination of both shape-based and motion-based
training data with domain knowledge was proposed to
information, such as an optic flow with silhouette-based
resolve the insufficient data problem [141].
features to perform view-invariant action recognition. In
Syntactic approaches The activity is made up of mul-
[133], also incorporated dimensionality reduction using
tiple sub-activities and atomic actions which can be rec-
principal component analysis. Another method was
ognized by any activity recognition approaches such as
proposed to perform view invariant action recognition
Context-Free Grammar (CFGs)-based methods which are
by using coarse silhouette with radial grid-based features
categorized under syntactic approaches. If the atomic sub-
and employing motion features [134]. Among these
activities are symbols, then syntactic approaches integrate
methods, in a study [135], action representation was
these in the form of a string of symbols but involve con-
done in the form of a sequence of the prototype by
current action recognition problems. To overcome the
combining both motion and shape space. The action
concurrent action recognition problem, a lot of improve-
recognition of such representations is done by applying
ments using CFGs are made, for instance, Stochastic CFG
distance measures for sequence matching. The idea of
(SCFGs) used in [142, 143]. Activity recognition is per-
combining both shape- and motion-based information
formed by processing basic actions at lower layers of the
was more improved by using motion energy images
123
Neural Computing and Applications (2023) 35:4145–4182 4169
model, while complex activities are recognized by apply- shape and motion-related information to generate motion
ing parser techniques at top layer of the model. In [144], a descriptors or key poses for sequence matching. These
method is proposed to handle the production rules problem, methods use silhouette and interest point detectors, which
which means rules should be defined earlier. The proposed are then fed into any suitable classifier (e.g., SVM) to
algorithm has done the task through automatic learning of perform action classification. Sequential approaches can
rules Along with 2-layered frameworks, few researchers deal with view-invariant data and complex activities. When
have put their efforts into producing multi-layer frame- compared to state-based methods, exemplar-based methods
works such as a 4-layered framework which uses the spa- are more adaptable to complex activities and require less
tiotemporal features to generate a relevant set of rules for training data. Among layered approaches (i.e., hierarchi-
actions, i.e., strong, weak, and stochastic [145]. cal), description-based techniques outperformed other
Description-based approaches The method, which can methods in terms of high-level activity recognition because
explicitly retain spatiotemporal structures extracted from of their explicit nature of maintaining spatiotemporal
human activities, is known as a description-based changes. Syntactic and statistical approaches have proven
approach. Due to their explicit ability to describe the useful in dealing with noisy data.
structure of spatiotemporal changes, description-based
approaches can recognize both concurrent and sequential 4.3.2 Learning-based approaches
human activities. These methods use spatiotemporal and
logical relationships to define relationships between simple Learning-based approaches are the second major category
actions that result in higher activity, such as sub-events. of method used for HAR and Fig. 12 shows the general
The CFGs with the use of formal syntax have been pro- framework used by learning-based models to perform the
posed for activity recognition [146, 147] and a PNF net- task. Learning-based methods rely on automatic feature
work for distinct temporal identification is used [148]. The generation which does not require manual feature engi-
famous Bayesian Belief Networks (BBN), event logic, and neering process. Methods in this category have proven to
Petri nets have also been introduced for the task of complex be effective for a variety of tasks and can be used inde-
activity recognition [149–151]. The Markov Logic Net- pendently or in any hybrid combination (e.g., with hand-
works (MLN) that are symbolic were also proposed to crafted feature-based method). The learning-based
conjecture human activities based on different probabilities approaches are subdivided into two main types i.e., non-
[152]. Afterward, another postulation was proposed to neural networks-based approaches and neural networks-
handle higher level activity recognition by using different based approaches.
input sources based on temporal information employing no
kind of probabilistic computation [89]. Another study [153] [Link] Non-neural network-based approaches Genetic
has attempted to perform event annotation of one-to-one programming The non-neural network-based approaches
basketball videos through mixed probabilistic and logical use the pre-defined set of rules or sequences for learning a
inference. The semantic description of different scenarios model to evaluate the future data. The genetic program-
has been employed using first order logic to extract spa- ming and dictionary-based approaches are examples of
tiotemporal knowledge and for basic information extrac- such methods which are explained as follows: The genetic
tion, MLN is used. programming (GP) [156] is based on the Darwinian theory
The popularity of feature-based methods is increased of selection and is famous for the vision-based tasks
through continuous improvement in existing methods, involving natural and random selection of solution set. The
which includes techniques based on optical flow, Motion GP algorithm is an evolutionary algorithm that uses bio-
Boundary Histogram (MBH), Histogram of Oriented Gra- logically inspired operators, i.e., crossover or mutation to
dients (HOG), Histogram of Optical Flow (HOF), and perform the natural selection process over initialized
dense trajectories. Among these methods, IDT [154] computational program which should be randomly assem-
remained a successful method, which is further changed by bled. Shao et al. [157] have presented a study for evolution
using Fisher vector for effective action recognition [155]. of motion features based on colors and optic flow fields by
The space–time volume representation does not support using the population of operators, i.e., 3D-Gabor filter
view-invariant scenarios and is only useful when multiple [158] and wavelet [159]. Then classification error is cal-
people are involved in a single event. The space–time culated using GP fitness function along with SVM. The
trajectories perform better with known video points and error is incorporated in the evolutionary algorithm that
can accommodate different viewing angles. The space– provides the final solution set in the form of cascaded
time features can recognize multiple activities, but not operators for feature extraction process.
complex activities with view-invariant representations. The Dictionary learning Dictionary learning tries to learn the
appearance-based approaches primarily focus on using sparse data from the input by using linear combinations
123
4170 Neural Computing and Applications (2023) 35:4145–4182
based on the dictionary atoms and these representations are dataset to another. It used both discriminative and cross-
useful mainly for categorization of data such as classifi- domain discrepancy terms to ensure smoothness in action
cation of images, detection, and action recognition. The recognition.
bag-of-words model (BoW) [160] is very popular among
researchers because of its usability and it is also based on [Link] Neural network-based approaches Such types of
dictionary learning. Guha et al. [161] have presented a methods try to model the human visual system to analyze
method that overrules the performance of BoW model data. A similar structure is used to build a learning model
through sparse coding and improved the action recognition as biological neurons. Deep learning models have proven
performance. The primary function of sparse representation their worth in almost every domain where high-level data
is based on two components, i.e., regularization term and abstraction needs to be modeled and a few of them are
reconstruction-error value. Another concept of cross-view discussed below:
dictionary is proposed by Zheng et al. [162] to deal with
Multi-layer perceptron network (fully connected neural
the sparse coding and cross-domain inconsistencies so that
networks) This type of method requires building the fully
action recognition can be view-invariant. Along with
connected neural networks (FCNN) framework by using
sparse coding, supervised dictionary learning-based
low-level information. Kim et al. [164] have designed a
approaches are also popular. In [163], a loosely supervised
framework for feature extraction and classification by
dictionary learning technique has been proposed to help the
using FCNN as baseline and named it as Modified
learning adaptation process from one action recognition
Neural Network (MNN). At first, handcrafted features
Fig. 13 Recognition rate variation in selected HAR approaches on different benchmark datasets
123
Neural Computing and Applications (2023) 35:4145–4182 4171
are used to extract the basic information, and then 2D recognize human activities from videos. Ning et al.
contour of actor is obtained to generate spatiotemporal [179] have proposed a video framework that can
volume to obtain outer boundary information through 3D decompose videos into 2D images and then used the
Gabor filters [165]. In [166], neural network applies to 2D CNN to analyze different stages of embryonic
extract the action-based features which are done using development. Later on, Karpathy et al. [51] have
four layers (i.e., two layers of convolutional and two sub- performed a comparative analysis to provide the best
sampling layers) and a discriminative classifier to fit CNN-based architecture by using motion relative
classify actions. Jhuang et al. [167] used layers based information on static videos. The experimentation
on spatiotemporal feature detectors through the approx- involves the performance evaluation of different
imation of motion directive units. It performed the global approaches on a famous large-scale action recognition
max computation of every feature map extracted from dataset (i.e., YouTube-1 M) which reveals that the fixed-
action data. Shao et al. [168] have proposed to use the size architecture is not a suitable option for action
multi-layer network instead of Restricted Boltzmann recognition. Karpathy et al. [51] have presented 3D-
Machines (RBM) [169] to provide a hierarchical para- CNN-based learning strategies, i.e., early fusion, late
metric network (HPN) using skeleton features. It has fusion, and slow fusion. Early fusion approach works by
outperformed [170] to perform emission probability modifying 2D Convolutional window by adding tempo-
estimation of HMM. ral dimensions and passing these 3D cubes to first
Convolutional neural network (CNN) The 3D-CNN convolutional layer. In [180], CNN-Bi-LSTM is trained
[171] is developed to perform the convolution on both on RGB images to extract temporal information from
spatial and temporal dimensions to extract features from video data. Late fusion strategy is applied at decision
multiple channels to provide a variety of action repre- level of the network to provide end-to-end learning. The
sentations, whereas 2D-CNN [172] is only concerned late fusion is based on incorporating two CNN on two
with convolution on spatial domain. The proposed 3D- distant frames then combines both at fully connected
CNN model is a 7-layered architecture, including input layer to extract the motion-related information at global
layer, 3-convolutional, 2 sub-sampling, and a fully level, whereas slow fusion is based on connecting two
connected layer. Its feed-forward nature made it possible frames in both spatial and temporal dimensions.
to extract features for action recognition. Another group Recurrent neural networks (RNN) Recurrent neural
of researchers [173] has presented Hierarchical Invariant networks are popular for dealing with sequences because
spatiotemporal (HIST) framework. They have used they act as a memory unit by storing the previous state
Independent Subspace Analysis (ISA) [174] for feature and forwarding it to the next unit. They are computa-
extraction and Principal Component Analysis (PCA) tionally expensive, and RNN is further extended to
[175] during the training to cater to large video data. So, reduce implementation issues, such as the Long Short-
the proposed HIST model works on training of multiple Term Memory Network (LSTM) and Gated Recurrent
ISA, which is subsequently reduced through PCA and Unit (GRU). Yao et al. [181] have proposed the DT-
therefore uses the characteristics of ISA to perform 3DResNet-LSTM to exploit the localities within video
unsupervised analysis, which can help label large video where any activity is taking place. The task is performed
data. Then Baccouche et al. [176] have presented at different levels, such as first detected object becomes
sequential deep learning model by using 3D-CNN and the input of object tracking model and that clipped video
it works differently than the model presented in [164] frame is fed to CNN for feature extraction. Then LSTM
because of its sequence of layers, i.e., two alternative is used for HAR classification and final temporal
convolutional-layers, rectification layer, sub-sampling information is achieved. Meng et al. [182] have proposed
layer, and another convolutional-layer, sub-sampling the Quaternion Spatiotemporal convolutional neural
layer, third convolutional-layer, and then two fully network (QST-CNN) and Long Short Term Memory
connected layers. The action representations are network (LSTM) which is known as QST-CNN-LSTM
extracted through CNN by capturing the temporal to use on RGB data by considering its spatiotemporal
information over time with adaptation over sequential information. LSTM is used to capture the difference
information and hence a sequential approach is used for between two frames of a video. The model works
action labeling. through motion region extraction and this outperforms
Another variation is to use static frames as input for both on UCF11 and UCF sports datasets. Another work
the action recognition process, such as OverFeat [177] or [183] performs group activity recognition, and the
Caffe [178] have used image recognition model to learn proposed model is known as stagNet which uses
static frames. Along with image-based frameworks, there Spatiotemporal and semantics of the data to feed RNN.
are many video recognition frameworks that can This makes the model learn the intergroup relation and
123
4172 Neural Computing and Applications (2023) 35:4145–4182
Anthropometric variation Anthropometric issues are related to postures and angle issues that arise primarily because of human
[47, 55, 69, 127, 133, 136] body variation and acting in various poses. Such problems can occur when using shape or pose
feature extraction
Multiview variation Multi-view variation occurs because of unsettled camera view, which gives a different perspective to
[55, 101, 128, 130, 131, 133] any actions. To avoid confusion, use synchronized multi-camera while producing features from
each view
Cluttered and dynamic scene variation It applies to recorded action datasets in which actions are recorded in an indoor environment to
[15, 101, 211, 243] provide static and uniform background throughout the activity, but such an approach causes
problems when we evaluate these methods in outdoor conditions with highly dynamic
backgrounds. Many activity recognition algorithms, such as the optic flow method, combine
background noise with human motion information to overcome these issues
Intra-class variability and inter-class Intra-class variability and inter-class similarity arise from the unique behavior of each human being
similarity [79] and the tendency to repeat the same actions. For example, everyone walks differently due to age or
muscular condition, so we must deal with complex scenarios. We cannot rely on a single model to
perform the same task, so such issues must be addressed, i.e., by using discriminative features
Occlusion [5, 10, 55, 101] The occlusion problem occurs when the human body is obscured by another frontal object, which
can be caused by self-interference of different body parts or by another object
Environmental constraints [55, 243–245] The light effect changes the overall impression of a scene. The light sources cast shadows on the
objects and cause variations in illumination. Changes in weather and daytime conditions cause a
significant change in the scene and the created artifacts; for example, an action recorded during rain
differs completely from the same action recorded in broad daylight or the evening
Dynamic Cameras [55, 101, 246] The HAR is relatively simple in static camera scenarios, but the variations introduced by dynamic
cameras cause changes in pose and illumination, making the problem more difficult
Inadequate data [53, 199, 247] While working with deep learning models, the amount of data are the most important consideration.
The limitation in data amount is primarily because of the difficulty in creating and labeling human
activity videos, as well as processing and storage limitations in some systems
distinguishable spatiotemporal features of the data. The features fed to the enhanced input differential module
stagNet is further extended to provide group activity and and spatial memory state module. Spatial information is
individual action recognition by incorporating body extracted and transferred horizontally and the data are
regions and global part feature pooling [184]. The forwarded to traditional long-term convolutional net-
authors in [185] have described pre-trained weights may works to evaluate the performance of proposed model.
affect the learning of a model and it can be addressed Transformer networks Transformer networks [188] are
through a bi directional long short-term memory popular among both natural language processing and
(BiLSTM) model. They have proposed to use an computer vision tasks. They are used to model sequential
attention mechanism to prioritize the human actions data input, such as audios and videos, like recurrent
from a video sequence. The authors in [186] have used neural networks. Transformer network is mainly used for
dilated CNN (DCNN) layers for feature extraction and speech recognition, but it is also popular among action
BiLSTM used these features to analyze long-term recognition tasks such as multimodal-based methods.
dependencies. The process of action recognition has Deep learning-based methods are usually complex and
been further improved by applying attention mechanism, require many parameters to train model, which increases
which can extract high-level patterns. The authors have the computational overhead. Transformer networks can
proposed densely connected Bi-directional LSTM (DB- produce a less amount of trainable parameters. Video
LSTM) to improve the robustness of the model. Their transformer network [189], is a sequence-based neural
model works in both forward and backward direction to network architecture that attempts to recognize long
model visual and temporal information of data. More- range dependencies and analyze full length videos. The
over, authors have used appearance and motion-based authors have claimed that their model works better than
modalities to improve the human activity recognition. In any 2D spatial network and achieved faster training time
[187], authors have proposed spatial–temporal differen- with fewer GFLOPs (Giga floating point operations). In
tial long short term memory (ST-D LSTM) and used [190], authors have used Spatiotemporal-based trans-
Inception V3 for feature extraction from video data. The former networks model (ST-TR), which uses Skeleton-
123
Neural Computing and Applications (2023) 35:4145–4182 4173
based data. The authors have used sparse attention pooled deep convolutional descriptor (TDD) [154] has
mechanism on spatial information of human actions to been proposed, which has also been trained on HMDB-
extract intra-frames interactions of different body parts. 51 and UCF-101 datasets to provide a generalized
They have used temporal self-attention mechanism to feature extractor for future videos. In [197], authors
produce inter frame correlations which help to model proposed the use of dense trajectories and discriminative
skeleton data for action recognition. Action Transformer Fisher vector to encode TDDs via fisher vector
[191], is fully self-attentional architecture which has representation.
performed better than complex CNN, RNN and attentive
layer-based architectures. The authors have worked on
[Link] Learning strategy Transfer learning Transfer
pose representation through small temporal window
learning is a type of learning in which learning from one
which have produced a low latency overhead for
network is transferred to another network in terms of
accurate recognition. They have also published a large-
weights to improve recognition results. There are several
scale dataset entitled ‘‘MPOSE2021’’ which can be used
transfer learning strategies, such as freezing the convolu-
for real-time, short-time human action recognition.
tional layer of a new network and allowing only fully
Auto-encoder Auto-encoders learn data representation
connected layers to perform classification of tasks where
through unsupervised learning, primarily for low dimen-
the target problem is like a pre-trained model (To perform
sionality. The authors of [192] used auto-encoders with
Sports activity recognition, pre-trained model of Sports-
CNN to perform online action recognition, with CNN
1 M can be used). Du et al. [199] proposed a cascaded
learning frame-level representation. As a result, the auto-
architecture for activity recognition that is based on a
encoder performs sequence learning and feature dimen-
convolutional neural network.
sionality reduction. Then, unlike CNN, another group of
Zero-shot learning Zero shot learning is a type of
researchers used such methods for HAR and auto-
learning where we deal with unseen classes of data, nor-
encoders for abnormal activity detection, which learns
mally with synthetic data generation. Gowda et al. [198]
spatiotemporal features from data to avoid missing label
have proposed a reinforcement learning model to learn all
problems [193].
classes at once rather than individual data points opti-
Generative adversarial network (GAN) Generative meth-
mization. Moreover, [199, 200] have used zero action
ods use an unsupervised learning regime to learn data
recognition and, [82] proposed a semantic-based multi
representation from any type of unlabeled data. These
stream deep neural network that learns both action and
methods are popular for generating synthetic data, which
action attributes.
is achieved by learning features of each class from
In Short, the models presented in [195] and [196] have
original data. Today, we have a large amount of
been built on using handcrafted features-based methods
unlabeled data that are useless without labeling, but
using convolutional architecture as a baseline. The frame-
generative methods have made it possible to work with
work proposed in [201] also uses the convolutional archi-
such data. A group of researchers has used GAN for
tecture to learn the motion related actions. And learning-
early prediction of human activity in which GAN is used
based approaches can be discussed as to how and when
to avoid motion blur problems and predict future motion
learning framework is used because only few studies are
[194].
based on direct use of CNN and the rest follow a hybrid
Hybrid model Hybrid models are based on using
regime. The learning process works entirely differently if
handcrafted features, along with neural network models,
the two frames are fused by following different learning
to use the benefit of both strategies. Simonyan et al.
strategies, i.e., early, late, and slow fusion. The learning
[195] have proposed a 2-stream CNN-based architecture
frameworks may vary because of the number of layers
through the decomposition of video data into both spatial
within a network, such as slow fusion CNN has maximum
and temporal domain and then a CNN is trained on top of
number of layers. Some note that performance of slow
optic flow. The authors have proposed a lot of variations
fusion CNN (SFCNN) is not satisfactory while comparing
such as optic flow stacking, trajectory stacking, and bi-
the feature-based shallow representations [154, 155]. This
directional optic flow while 2-stream training is per-
means greater number of layers do not promise better
formed on HMDB-51 [54] and UCF-101 [45] datasets to
results. Two other deep learning frameworks, early fusion
compare the classification accuracy. The proposed
CNN [195] and late fusion CNN [196] have initially per-
architecture is a hybrid model because of the use of a
formed well, but both have resulted in reduced perfor-
CNN model and learning from both handcrafted features
mance while being tested on spatial stream networks.
and raw pixels. The 2-stream CNN architecture is
While working with CNN-based networks, the major
extended by Wang et al. [196] by introducing the use
problem is the size of dataset as majority of datasets have
of trajectories along with it. Then 2-stream trajectory
123
4174 Neural Computing and Applications (2023) 35:4145–4182
small number of representative videos with missing labels. of recent methods; popular datasets used for different tasks,
For training, two datasets can be combined to increase the and achieved performance. Performance does not directly
data volume, but because of intersection of different action affect importance of a study as not all studies are focused
classes, it is not a suitable option. Along with discussed on increasing recognition rate. Studies published earlier
methods, multi-task [202–204] and transfer learning were more focused on increasing recognition rate, but later
[205–207]-based approaches are also in use, which helps in on a lot of challenges are highlighted and researchers start
combining the data or to use the learned representation of working on multiple perspectives of a domain. For exam-
one dataset with another dataset. For example, Wang et al. ple, Convolutional neural networks-based methods. Fig-
[196] has used the transfer learning through training the ure 13 provides a performance graph that shows how the
model on UCF-101 and then trained on HMDB-51 to activity recognition rate changed over a decade based on
extract features for action classification. HAR benchmarks. We have added performance of studies
in ‘‘average precision’’ column, which provides perfor-
4.4 Analysis of State-of-the-art HAR approaches mance on different benchmark datasets. For example,
Table 3 shows that 2011 has performance on two datasets
We have discussed HAR approaches in the above section, only, whereas 2020 shows performance on eleven datasets.
followed by taxonomy to cover online/offline processing Therefore, the graph can be dense at places depending on
based, modality based, and method-based approaches. In the number of datasets used in the following year by
this section, state-of-the-art methods are analyzed to pro- selected set of studies. UCF-101 and HMDB-51 both are
vide recent trends and to highlight domain challenges and highly cited datasets and Table 3 also presents that most of
we have performed our analysis on 46 state-of-the-art the researchers have performed their experiment on these
techniques of HAR. The selected techniques include two.
machine learning approaches, deep learning approaches, Table 3 shows that in [233] authors have achieved a
multimodal approaches, and a framework for online HAR. good performance on UCF-101, HMDB-51, and Activ-
We have analyzed all selected studies using six major ityNet by using temporal segment network-based approach.
parameters, which are publication year, method type, data Later in 2021 [198], authors have tried to improve the
input, activity level, dataset size, and its performance on unseen class recognition problem by using zero-shot action
benchmark datasets. In this study, publication year is recognition. They achieved a low performance value as
important for signifying a study because recent methods compared to [233] as it was focused on unseen class
are close to general trend of HAR. The second parameter is recognition, which means they have used UCF-101 and
type of methods used to perform recognition, which HMD-51 as training data only. For performance evaluation,
demonstrates the popularity of specific type of method data are randomly taken from any action class, which
among both handcrafted and learning-based HAR. Then resulted in low performance when we compare it to other
data input represents if the videos are stacked in a database mentioned studies. Unseen class-based action recognition
or based on real-time camera feed. Activity level is another is still an open research area and needs a lot of improve-
parameter that shows a method is useful for recognition of ment. HAR is a diverse domain that includes simple to
simple, intermediate, or complex activities. All studies complex activities and among these intermediate activities
include experiments on relevant HAR benchmark datasets are frequently recognized in the selected studies. The task
which may help new researchers to identify which datasets complexity depends on type of activity recognized by a
are more useful depending on problem. Moreover, size of study, for example, daily actions are simple tasks, whereas
dataset is relevant to the type of method used, for example, group activities or crowd behavior is a complex task.
deep learning-based solutions perform well with large Human-to-human and human-to-object interactions are
datasets and handcrafted feature-based methods with small categorized as intermediate tasks. Another variation is the
or intermediated sized datasets. Performance of methods is size of the dataset, which has a significant impact on the
important parameter too and we have mentioned achieved recognition rate and is directly related to the type and level
performance of all selected studies. Table 3 provides of activities. Among HAR, abnormal activities still need
quantitative analysis of HAR approaches which is based on improvement as they may be affected by various factors
above-mentioned parameters. such as Reinolds et al. [238] have proposed that abnormal
activities are influenced by audio too. They have performed
4.4.1 Discussion recognition by extracting both audio-based features and
video-based features to perform recognition. They also
Table 3 is based on 46 state-of-the-art methods from 2011 compared features of both and claimed that video-based
to 2022, 20 of which are handcrafted and the remaining 26 features are more useful to perform recognition. In Table 3
are learning-based approaches. This table provides details since 2020, nine methods are based on deep learning-based
123
Neural Computing and Applications (2023) 35:4145–4182 4175
approaches, whereas only five are from handcrafted fea- of classes, but not all activities are covered. Collecting a
tures-based approaches, which shows an inclined behavior large-scale dataset of human activities, either by combining
of researchers toward deep learning. This is because, due to existing datasets or by adding new samples, could solve
technological advancements, large data are available in this problem. However, this may necessitate time-con-
form of videos. However, it still needs a good annotation suming labeling of the content and its temporal position.
method, and hybrid approaches (combination of both To avoid the time-consuming manual process, another
handcrafted features-based and deep learning-based option is synthetic data generation and synthetic label
approaches) are getting attention due to their performance. generation. As some activities occur with relatively few
Table 3 shows that most of the studies are based on using anomalies, class imbalance is also an open issue. Normally,
intermediate size of dataset for experiment as small size data augmentation is used to solve the problem of class
datasets have a limited number of training instances. Small imbalance, but synthetic data generation is also an option.
size datasets may compromise model performance, Because human activity data contain subtle variations,
whereas large size datasets are computationally expensive synthetic data generation necessitates further investigation.
to deal with. However, large datasets provide better Another approach is to use a weakly supervised learning
learning and hence provides better performance. Therefore, strategy with web-based videos that have weak labels. This
if the resources are not a bottleneck, it is better to use large may solve the problem of small dataset size and improve
size datasets for human activity recognition. In [208], for the overall performance of HAR system.
example, a handcrafted feature-based approach that is
trained on a small dataset is used to perform simple action 5.1.2 Improvement in models
recognition. Datasets must be chosen based on the task and
method, for example, large-scale datasets are more popular Deep learning has proven its worth everywhere, including
among deep learning solutions. It should not be dependent, HAR. However, deep learning models are improving all
but current resources and research need to be improved to the time, and another improvement can be made by mod-
deal with both data size and data type issues. GAN is ifying the global average pooling layer in existing 3D deep
widely used, and most researchers are working on synthetic convolution neural networks. Using temporal information
data generation to cover potential HAR scenarios. or Improved Dense Trajectories (IDT) may be useful for
this purpose. Multimodal approaches rely on the fusion of
data from various devices, such as audio-visual data. Such
5 Limitations and future research information can be more useful in distinguishing visually
similar activities. Researchers can extend HAR to improve
The preceding discussion can be expanded to highlight the performance of traditional ML approaches on large-scale
limitations of datasets across HAR variables. Table 4 datasets. Normally, HAR is performed on large videos,
includes the pertinent details to present the highlighted which may have irrelevant frames and are not part of the
issues while considering state-of-the-art approaches. recognition. Machine learning models require improve-
ment to identify trimmed actions from large videos. Deep
5.1 Future research directions learning models such as convolutional architecture-based
models (3D CNN) can be extended to exploit spatiotem-
HAR is constantly evolving and offers promising perfor- poral information of action data. We can improve HAR
mance ranging from simple day-to-day living systems to generalization problem by improving reinforcement and
real-time surveillance systems. Its multi-purpose applica- active learning strategies. Another problem that needs
tion has made it an ever-active research area, and with each improvement is to classify overlapped actions. Therefore,
technical advancement, new research directions are classification models can also be improved for classifying
opened. Hence, It is essential to design a representative overlapped actions from the dataset. Multimodal approa-
dataset for HAR that can overcome the occlusion, view ches require improvement to perform fusion of multiple
variation, and weather constraints of recorded data. modalities to perform HAR. Data from multimodal sources
can also be exploited to perform action recognition along
5.1.1 Improvement within benchmark datasets with the emotional state of human (e.g., walking in anger,
running in fear, smiling while talking). Multimodal
The preparation of data and approaches for multi-camera- approaches improved for human activity recognition with
based human activity recognition also needs to be emotional state identification may help to improve context-
improved. The size of the dataset and activity classes with based human activity recognition process. Similarly,
proper labels is another important consideration. For another approach to augmenting visual data for better
example, YouTube 8-M is the largest dataset with a variety learning is to use multi-camera views and data fusion from
123
4176 Neural Computing and Applications (2023) 35:4145–4182
123
Neural Computing and Applications (2023) 35:4145–4182 4177
20. Sargano AB, Angelov P, Habib Z (2017) A comprehensive 43. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007)
review on handcrafted and learning-based action representation Actions as space-time shapes. IEEE Trans Pattern Anal Mach
approaches for human activity recognition. Appl Sci 7(1):110 Intell 29(12):2247–2253
21. Ke S-R, Thuc HLU, Lee Y-J, Hwang J-N, Yoo J-H, Choi K-H 44. Xia L, Chen C-C, Aggarwal JK (2012) View invariant human
(2013) A review on video-based human activity recognition. action recognition using histograms of 3d joints. In: 2012 IEEE
Computers 2(2):88–131 computer society conference on computer vision and pattern
22. Cheng G, Wan Y, Saudagar A, Namuduri K, Buckles B (2015) recognition workshops, IEEE, pp 20–27
Advances in human action recognition: a survey. arXiv preprint 45. Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human
arXiv:1501.05964 action classes from videos in the wild. Center Res Comput Vis
23. Dawn DD, Shaikh SH (2016) A comprehensive survey of human 2:666
action recognition with spatio-temporal interest point (STIP) 46. Rahmani A, Mahmood A, Huynh D, Mian A (2014) Action
detector. Vis Comput 32(3):289–306 classification with locality-constrained linear coding. In: 2014
24. Vrigkas M, Nikou C, Kakadiaris IA (2015) A review of human 22nd international conference on pattern recognition, IEEE,
activity recognition methods. Front Robot AI 2:28 pp 3511–3516
25. Herath S, Harandi M, Porikli F (2017) Going deeper into action 47. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action
recognition: a survey. Image Vis Comput 60:4–21 recognition using motion history volumes. Comput Vis Image
26. Jegham I, Khalifa AB, Alouani I, Mahjoub MA (2020) Vision- Underst 104(2–3):249–257
based human action recognition: an overview and real world 48. Niebles JC, Chen C-W, Fei-Fei L (2010) Modeling temporal
challenges. Forensic Sci Int Digit Invest 32:200901 structure of decomposable motion segments for activity classi-
27. Wang Z et al (2019) A survey on human behavior recognition fication. European conference on computer vision. Springer,
using channel state information. IEEE Access 7:155986–156024 Berlin, pp 392–405
28. Rodrı́guez-Moreno I, Martı́nez-Otzeta JM, Sierra B, Rodriguez 49. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In:
I, Jauregi E (2019) Video activity recognition: state-of-the-art. 2009 IEEE conference on computer vision and pattern recog-
Sensors 19(14):3160 nition, IEEE, pp 2929–2936
29. Liu J, Liu H, Chen Y, Wang Y, Wang C (2019) Wireless sensing 50. Reddy KK, Shah M (2013) Recognizing 50 human action cat-
for human activity: a survey. IEEE Commun Surv Tutor egories of web videos. Mach Vis Appl 24(5):971–981
22(3):1629–1645 51. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-
30. Dang LM, Min K, Wang H, Piran MJ, Lee CH, Moon H (2020) Fei L (2014) Large-scale video classification with convolutional
Sensor-based and vision-based human activity recognition: a neural networks. In: Proceedings of the IEEE conference on
comprehensive survey. Pattern Recogn 108:107561 Computer Vision and Pattern Recognition, pp 1725–1732
31. Chaurasia SK, Reddy S (2022) State-of-the-art survey on 52. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015)
activity recognition and classification using smartphones and Activitynet: A large-scale video benchmark for human activity
wearable sensors. Multimedia Tools Appl 81(1):1077–1108 understanding. In: Proceedings of the IEEE conference on
32. Yao G, Lei T, Zhong J (2019) A review of convolutional-neural- computer vision and pattern recognition, pp 961–970
network-based action recognition. Pattern Recogn Lett 53. Abu-El-Haija S et al. (2016) Youtube-8m: A large-scale video
118:14–22 classification benchmark. arXiv preprint arXiv:1609.08675
33. Zhang H-B et al (2019) A comprehensive survey of vision-based 54. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011)
human action recognition methods. Sensors 19(5):1005 HMDB: a large video database for human motion recognition.
34. Das B, Saha A (2021) A survey on current trends in human In: 2011 international conference on computer vision, IEEE,
action recognition. In: Advances in medical physics and pp 2556–2563
healthcare engineering, Springer, pp 443–453 55. Yu S, Tan D, Tan T (2006) A framework for evaluating the
35. Gupta N, Gupta SK, Pathak RK, Jain V, Rashidi P, Suri JS effect of view angle, clothing and carrying condition on gait
(2022) Human activity recognition in artificial intelligence recognition. In: 18th international conference on pattern recog-
framework: a narrative review. Artif Intell Rev 3:1–54 nition (ICPR’06), vol 4: IEEE, pp 441–444
36. Zhu F, Shao L, Xie J, Fang Y (2016) From handcrafted to 56. Gu C et al. (2018) Ava: a video dataset of spatio-temporally
learned representations for human action recognition: a survey. localized atomic visual actions. In: Proceedings of the IEEE
Image Vis Comput 55:42–52 conference on computer vision and pattern recognition,
37. Tripathi RK, Jalal AS, Agrawal SC (2018) Suspicious human pp 6047–6056
activity recognition: a review. Artif Intell Rev 50(2):283–339 57. Sultani W, Chen C, Shah M (2018) Real-world anomaly
38. Chaquet JM, Carmona EJ, Fernández-Caballero A (2013) A detection in surveillance videos. In: Proceedings of the IEEE
survey of video datasets for human action and activity recog- conference on computer vision and pattern recognition,
nition. Comput Vis Image Underst 117(6):633–659 pp 6479–6488
39. Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D- 58. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag
based action recognition datasets: a survey. Pattern Recogn of 3d points. In: 2010 IEEE computer society conference on
60:86–105 computer vision and pattern recognition-workshops, IEEE,
40. Singh T, Vishwakarma DK (2019) Video benchmarks of human pp 9–14
action datasets: a review. Artif Intell Rev 52(2):1107–1154 59. Berclaz J, Fleuret F, Turetken E, Fua P (2011) Multiple object
41. Wang J, Nie X, Xia Y, Wu Y, Zhu S-C (2014) Cross-view action tracking using k-shortest paths optimization. IEEE Trans Pattern
modeling, learning and recognition. In: Proceedings of the IEEE Anal Mach Intell 33(9):1806–1819
conference on computer vision and pattern recognition, 60. Hu J-F, Zheng W-S, Ma L, Wang G, Lai J (2016) Real-time
pp 2649–2656 RGB-D activity prediction by soft regression. European Con-
42. Schuldt C, Laptev I, Caputo B (2004) Recognizing human ference on Computer Vision. Springer, Berlin, pp 280–296
actions: a local SVM approach. In: Proceedings of the 17th 61. Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured
international conference on pattern recognition, 2004. ICPR human activity detection from rgbd images. In: 2012 IEEE
2004, vol. 3: IEEE, pp 32–36 international conference on robotics and automation, IEEE,
pp 842–849
123
4178 Neural Computing and Applications (2023) 35:4145–4182
62. Koppula HS, Gupta R, Saxena A (2013) Learning human 80. Wu Z, Jiang Y-G, Wang X, Ye H, Xue X, Wang J (2015) Fusing
activities and object affordances from rgb-d videos. Int J Robot multi-stream deep networks for video classification. arXiv pre-
Res 32(8):951–970 print arXiv:1509.06086
63. Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a mul- 81. Mukherjee S, Anvitha L, Lahari TM (2018) Human activity
timodal dataset for human action recognition utilizing a depth recognition in RGB-D videos by dynamic images. arXiv pre-
camera and a wearable inertial sensor. In: 2015 IEEE interna- print arXiv:1807.02947
tional conference on image processing (ICIP), IEEE, 82. Zhang C, Tian Y, Guo X, Liu J (2018) DAAL: deep activation-
pp 168–172 based attribute learning for action recognition in depth videos.
64. Ni B, Wang G, Moulin P (2011) Rgbd-hudaact: A color-depth Comput Vis Image Underst 167:37–49
video database for human daily activity recognition. In: 2011 83. Franco A, Magnani A, Maio D (2020) A multimodal approach
IEEE international conference on computer vision workshops for human activity recognition based on skeleton and RGB data.
(ICCV workshops), IEEE, pp 1147–1153 Pattern Recogn Lett 131:293–299
65. Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2013) 84. Bobick AF, Davis JW (2001) The recognition of human
Berkeley mhad: a comprehensive multimodal human action movement using temporal templates. IEEE Trans Pattern Anal
database. In: 2013 IEEE workshop on applications of computer Mach Intell 23(3):257–267
vision (WACV), IEEE, pp 53–60 85. Hu Y, Cao L, Lv F, Yan S, Gong Y, Huang TS (2009) Action
66. Wolf C et al (2014) Evaluation of video activity localizations detection in complex scenes with spatial and temporal ambi-
integrating quality and quantity measurements. Comput Vis guities. In: 2009 IEEE 12th international conference on com-
Image Underst 127:14–30 puter vision, IEEE, pp 128–135
67. Bloom V, Argyriou V, Makris D (2014) G3di: A gaming 86. Roh M-C, Shin H-K, Lee S-W (2010) View-independent human
interaction dataset with a real time detection and evaluation action recognition with volume motion template on single stereo
framework. European conference on computer vision. Springer, camera. Pattern Recogn Lett 31(7):639–647
Berlin, pp 698–712 87. Qian H, Mao Y, Xiang W, Wang Z (2010) Recognition of
68. Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb? d: a human activities using SVM multi-class classifier. Pattern
large scale dataset for 3d human activity analysis. In: Proceed- Recogn Lett 31(2):100–111
ings of the IEEE conference on computer vision and pattern 88. Kim W, Lee J, Kim M, Oh D, Kim C (2010) Human action
recognition, pp 1010–1019 recognition using ordinal measure of accumulated motion.
69. Van Gemeren C, Tan RT, Poppe R, Veltkamp RC (2014) EURASIP J Adv Signal Process 2010(1):1–11
Dyadic interaction detection from pose and flow. International 89. Ijsselmuiden J, Stiefelhagen R (2010) Towards high-level
Workshop on Human Behavior Understanding. Springer, Berlin, human activity recognition through computer vision and tem-
pp 101–115 poral logic. Annual conference on artificial intelligence.
70. Jalal A, Kim Y-H, Kim Y-J, Kamal S, Kim D (2017) Robust Springer, Berlin, pp 426–435
human activity recognition from depth video using spatiotem- 90. Fang C-H, Chen J-C, Tseng C-C, Lien J-JJ (2009) Human action
poral multi-fused features. Pattern Recogn 61:295–308 recognition using spatio-temporal classification. Asian confer-
71. Lin J, Gan C, Han S (2019) Tsm: temporal shift module for ence on computer vision. Springer, Berlin, pp 98–109
efficient video understanding. In: Proceedings of the IEEE 91. Ziaeefard M, Ebrahimnezhad H (2010) Hierarchical human
international conference on computer vision, pp 7083–7093 action recognition by normalized-polar histogram. In: 2010 20th
72. Soomro K, Idrees H, Shah M (2016) Predicting the where and international conference on pattern recognition, IEEE,
what of actors and actions through online action localization. In: pp 3720–3723
Proceedings of the IEEE conference on computer vision and 92. Wang Y, Mori G (2009) Human action recognition by semila-
pattern recognition, pp 2648–2657 tent topic models. IEEE Trans Pattern Anal Mach Intell
73. Singh G, Saha S, Sapienza M, Torr PH, Cuzzolin F (2017) 31(10):1762–1774
Online real-time multiple spatiotemporal action localisation and 93. Guo K, Ishwar P, Konrad J (2009) Action recognition in video
prediction. In: Proceedings of the IEEE international conference by covariance matching of silhouette tunnels. In: 2009 XXII
on computer vision, pp 3637–3646 Brazilian symposium on computer graphics and image pro-
74. Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolu- cessing, IEEE, pp 299–306
tional network for online video understanding. In: Proceedings 94. Kim T-K, Cipolla R (2008) Canonical correlation analysis of
of the European conference on computer vision (ECCV), video volume tensors for action categorization and detection.
pp 695–712 IEEE Trans Pattern Anal Mach Intell 31(8):1415–1428
75. Xu M, Gao M, Chen Y-T, Davis LS, Crandall DJ (2019) 95. Messing R, Pal C, Kautz H (2009) Activity recognition using the
Temporal recurrent networks for online action detection. In: velocity histories of tracked keypoints. In: 2009 IEEE 12th
Proceedings of the IEEE/CVF international conference on international conference on computer vision, IEEE, pp 104–111
computer vision, pp 5532–5541 96. Wang H, Kläser A, Schmid C, Liu C-L (2011) Action recog-
76. Gao M, Zhou Y, Xu R, Socher R, Xiong C (2020) WOAD: nition by dense trajectories. In: CVPR 2011, IEEE,
weakly supervised online action detection in untrimmed videos. pp 3169–3176
arXiv preprint arXiv:2006.03732 97. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior
77. Ye Y, Li K, Qi G-J, Hua KA (2015) Temporal order-preserving recognition via sparse spatio-temporal features. In: 2005 IEEE
dynamic quantization for human action recognition from mul- international workshop on visual surveillance and performance
timodal sensor streams. In: Proceedings of the 5th ACM on evaluation of tracking and surveillance, IEEE, pp 65–72
international conference on multimedia retrieval, pp 99–106 98. Jones S, Shao L, Zhang J, Liu Y (2012) Relevance feedback for
78. Vrigkas M, Nikou C, Kakadiadis IA (2014) Classifying behav- real-world human action retrieval. Pattern Recogn Lett
ioral attributes using conditional random fields. Hellenic con- 33(4):446–452
ference on artificial intelligence. Springer, Berlin, pp 95–104 99. Gilbert A, Illingworth J, Bowden R (2009) Fast realistic multi-
79. Shahroudy A, Ng T-T, Yang Q, Wang G (2015) Multimodal action recognition using mined dense spatio-temporal features.
multipart learning for action recognition in depth videos. IEEE In: 2009 IEEE 12th international conference on computer vision,
Trans Pattern Anal Mach Intell 38(10):2123–2129 IEEE, pp 925–931
123
Neural Computing and Applications (2023) 35:4145–4182 4179
100. Sadek S, Al-Hamadi A, Michaelis B, Sayed U (2011) An action seventh IEEE international conference on computer vision, vol
recognition scheme using fuzzy log-polar histogram and tem- 1: IEEE, pp 80–86
poral self-similarity. EURASIP J Adv Signal Process 120. Yu E, Aggarwal JK (2009) Human action recognition with
2011(1):540375 extremities as semantic posture representation. In: 2009 IEEE
101. Ikizler-Cinbis N, Sclaroff S (2010) Object, scene and actions: computer society conference on computer vision and pattern
Combining multiple features for human action recognition. recognition workshops, IEEE, pp 1–8
European conference on computer vision. Springer, Berlin, 121. Kellokumpu V, Zhao G, Pietikäinen M (2011) Recognition of
pp 494–507 human actions using texture descriptors. Mach Vis Appl
102. Minhas R, Baradarani A, Seifzadeh S, Wu QJ (2010) Human 22(5):767–780
action recognition using extreme learning machine based on 122. Shi Q, Cheng L, Wang L, Smola A (2011) Human action seg-
visual vocabularies. Neurocomputing 73(10–12):1906–1917 mentation and recognition using discriminative semi-Markov
103. Darrell T, Pentland A (1993) Space-time gestures. In: Pro- models. Int J Comput Vision 93(1):22–32
ceedings of IEEE conference on computer vision and pattern 123. Wang L, Suter D (2007) Recognizing human activities from
recognition, IEEE, pp 335–340 silhouettes: motion subspace and factorial discriminative
104. Gavrila DM, Davis LS (1996) 3-D model-based tracking of graphical model. In: 2007 IEEE conference on computer vision
humans in action: a multi-view approach. In: Proceedings cvpr and pattern recognition, IEEE, pp 1–8
ieee computer society conference on computer vision and pat- 124. Rahman SA, Cho S-Y, Leung M (2012) Recognising human
tern recognition, IEEE, pp 73–80 actions by analysing negative spaces. IET Comput Vision
105. Veeraraghavan A, Chellappa R, Roy-Chowdhury AK (2006) 6(3):197–213
The function space of an activity. In: 2006 IEEE Computer 125. Vishwakarma DK, Kapoor R (2015) Hybrid classifier based
society conference on computer vision and pattern recognition human activity recognition using the silhouette and cells. Expert
(CVPR’06), vol 1: IEEE, pp 959–968 Syst Appl 42(20):6957–6965
106. Yacoob Y, Black MJ (1999) Parameterized modeling and 126. Junejo IN, Junejo KN, Al Aghbari Z (2014) Silhouette-based
recognition of activities. Comput Vis Image Underst human action recognition using SAX-Shapes. The Visual
73(2):232–247 Comput 30(3):259–269
107. Efros AA, Berg AC, Mori G, Malik J (2003) Recognizing action 127. Chaaraoui AA, Climent-Pérez P, Flórez-Revuelta F (2013) Sil-
at a distance. In: Null, IEEE, p 726 houette-based human action recognition using sequences of key
108. Lublinerman R, Ozay N, Zarpalas D, Camps O (2006) Activity poses. Pattern Recogn Lett 34(15):1799–1807
recognition from silhouettes using linear systems and model (in) 128. Chaaraoui AA, Flórez-Revuelta F (2014) A low-dimensional
validation techniques. In: 18th international conference on pat- radial silhouette-based feature for fast human action recognition
tern recognition (ICPR’06), vol 1: IEEE, pp 347–350 fusing multiple views. Int Schol Res Notices 2014:6666
109. Jiang H, Drew MS, Li Z-N (2006) Successive convex matching 129. Cheema S, Eweiwi A, Thurau C, Bauckhage C (2011) Action
for action detection. In: 2006 IEEE Computer society confer- recognition by learning discriminative key poses. In: 2011 IEEE
ence on computer vision and pattern recognition (CVPR’06), vol international conference on computer vision workshops (ICCV
2: IEEE, pp 1646–1653 Workshops), IEEE, pp 1302–1309
110. Lin Z, Jiang Z, Davis LS (2009) Recognizing actions by shape- 130. Chun S, Lee C-S (2016) Human action recognition using his-
motion prototype trees. In: 2009 IEEE 12th international con- togram of motion intensity and direction from multiple views.
ference on computer vision, IEEE, pp 444–451 IET Comput Vision 10(4):250–257
111. Yamato J, Ohya J, Ishii K (1992) Recognizing human action in 131. Murtaza F, Yousaf MH, Velastin SA (2016) Multi-view human
time-sequential images using hidden markov model. CVPR action recognition using 2D motion templates based on MHIs
92:379–385 and their HOG description. IET Comput Vision 10(7):758–767
112. Starner T, Pentland A (1997) Real-time american sign language 132. Ladjailia A, Bouchrika I, Merouani HF, Harrati N, Mahfouf Z
recognition from video using hidden Markov models. In: (2020) Human activity recognition via optical flow: decom-
Motion-based recognition, Springer, pp 227–243 posing activities into basic actions. Neural Comput Appl
113. Vogler C, Metaxas D (1999) Parallel hidden Markov models for 32(21):16387–16400
American sign language recognition. In: Proceedings of the 133. Ahmad M, Lee S-W (2006) HMM-based human action recog-
seventh IEEE international conference on computer vision, vol nition using multiview image sequences. In: 18th international
1: IEEE, pp 116–122 conference on pattern recognition (ICPR’06), vol 1: IEEE,
114. Bobick AF, Wilson AD (1997) A state-based approach to the pp 263–266
representation and recognition of gesture. IEEE Trans Pattern 134. Pehlivan S, Forsyth DA (2014) Recognizing activities in mul-
Anal Mach Intell 19(12):1325–1337 tiple views with fusion of frame judgments. Image Vis Comput
115. Oliver NM, Rosario B, Pentland AP (2000) A Bayesian com- 32(4):237–249
puter vision system for modeling human interactions. IEEE 135. Jiang Z, Lin Z, Davis L (2012) Recognizing human actions by
Trans Pattern Anal Mach Intell 22(8):831–843 learning and matching shape-motion prototype trees. IEEE
116. Park S, Aggarwal JK (2004) A hierarchical Bayesian network Trans Pattern Anal Mach Intell 34(3):533–547
for event recognition of human actions and interactions. Multi- 136. Eweiwi A, Cheema S, Thurau C, Bauckhage C (2011) Temporal
media Syst 10(2):164–179 key poses for human action recognition. In: 2011 IEEE inter-
117. Natarajan P, Nevatia R (2007) Coupled hidden semi markov national conference on computer vision workshops (ICCV
models for activity recognition. In: 2007 IEEE workshop on Workshops), IEEE, pp 1310–1317
motion and video computing (WMVC’07), IEEE, pp 10–10 137. Shi Y, Huang Y, Minnen D, Bobick A, Essa I (2004) Propa-
118. Gupta A, Davis LS (2007) Objects in action: An approach for gation networks for recognition of partially ordered sequential
combining action understanding and object perception. In: 2007 action. In: Proceedings of the 2004 IEEE computer society
IEEE conference on computer vision and pattern recognition, conference on computer vision and pattern recognition, CVPR
IEEE, pp 1–8 2004, vol. 2: IEEE, pp II–II
119. Moore DJ, Essa IA, Hayes MH (1999) Exploiting human actions 138. Yin J, Meng Y (2010) Human activity recognition in video using
and object context for recognition tasks. In: Proceedings of the a hierarchical probabilistic latent model. In: 2010 IEEE
123
4180 Neural Computing and Applications (2023) 35:4145–4182
computer society conference on computer vision and pattern 159. Primer A, Burrus CS, Gopinath RA (1998) Introduction to
recognition-workshops, IEEE, pp 15–20 wavelets and wavelet transforms. Prentice Hall, Upper Saddle
139. Mauthner T, Roth PM, Bischof H (2010) Temporal feature River
weighting for prototype-based action recognition. Asian con- 160. Harris ZS (1954) Distributional structure. Word
ference on computer vision. Springer, Berlin, pp 566–579 10(2–3):146–162
140. Han L, Wu X, Liang W, Hou G, Jia Y (2010) Discriminative 161. Guha T, Ward RK (2011) Learning sparse representations for
human action recognition in the learned hierarchical manifold human action recognition. IEEE Trans Pattern Anal Mach Intell
space. Image Vis Comput 28(5):836–849 34(8):1576–1588
141. Zeng Z, Ji Q (2010) Knowledge based activity recognition with 162. Zheng J, Jiang Z, Phillips PJ, Chellappa R (2012) Cross-view
dynamic bayesian network. European conference on computer action recognition via a transferable dictionary pair. BMVC 1:7
vision. Springer, Berlin, pp 532–546 163. Zhu F, Shao L (2014) Weakly-supervised cross-domain dic-
142. Minnen D, Essa I, Starner T (2003) Expectation grammars: tionary learning for visual recognition. Int J Comput Vision
leveraging high-level expectations for activity recognition. In: 109(1–2):42–59
2003 IEEE computer society conference on computer vision and 164. Kim H-J, Lee JS, Yang H-S (2007) Human action recognition
pattern recognition, 2003. Proceedings, vol 2: IEEE, pp II–II using a modified convolutional neural network. International
143. Moore D, Essa I (2002) Recognizing multitasked activities from symposium on neural networks. Springer, Berlin, pp 715–723
video using stochastic context-free grammar. In: AAAI/IAAI, 165. Jones JP, Palmer LA (1987) An evaluation of the two-dimen-
pp 770–776 sional Gabor filter model of simple receptive fields in cat striate
144. Kitani KM, Sato Y, Sugimoto A (2008) Recovering the basic cortex. J Neurophysiol 58(6):1233–1258
structure of human activities from noisy video-based symbol 166. Kim H-J, Lee J, Yang H-S (2006) A weighted FMM neural
strings. Int J Pattern Recognit Artif Intell 22(08):1621–1646 network and its application to face detection. International
145. Wang L, Wang Y, Gao W (2011) Mining layered grammar rules conference on neural information processing. Springer, Berlin,
for action recognition. Int J Comput Vision 93(2):162–182 pp 177–186
146. Nevatia R, Hobbs J, Bolles B (2004) An ontology for video 167. Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically
event representation. In: 2004 Conference on computer vision inspired system for action recognition. In: 2007 IEEE 11th
and pattern recognition workshop, IEEE, pp 119–119 international conference on computer vision, IEEE, pp 1–8
147. Ryoo MS, Aggarwal JK (2006) Recognition of composite 168. Shao L, Liu L, Li X (2013) Feature learning for image classi-
human activities through context-free grammar based repre- fication via multiobjective genetic programming. IEEE Trans
sentation. In: 2006 IEEE computer society conference on com- Neural Netw Learn Syst 25(7):1359–1371
puter vision and pattern recognition (CVPR’06), vol 2: IEEE, 169. Taylor GW, Hinton GE, Roweis ST (2007) Modeling human
pp 1709–1718 motion using binary latent variables. In: Advances in neural
148. Pinhanez CS, Bobick AF (1998) Human action detection using information processing systems, pp 1345–1352
pnf propagation of temporal constraints. In: Proceedings. 1998 170. Baum LE, Petrie T (1966) Statistical inference for probabilistic
IEEE computer society conference on computer vision and functions of finite state Markov chains. Ann Math Stat
pattern recognition (Cat. No. 98CB36231), IEEE, pp 898–904 37(6):1554–1563
149. Ghanem N, De Menthon D, Doermann D, Davis L (2004) 171. Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural
Representation and recognition of events in surveillance video networks for human action recognition. IEEE Trans Pattern
using petri nets. In: 2004 conference on computer vision and Anal Mach Intell 35(1):221–231
pattern recognition workshop, IEEE, pp 112–112 172. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based
150. Intille SS, Bobick AF (1999) A framework for recognizing learning applied to document recognition. Proc IEEE
multi-agent action from visual evidence. AAAI/IAAI 86(11):2278–2324
99(518–525):2 173. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierar-
151. Siskind JM (2001) Grounding the lexical semantics of verbs in chical invariant spatio-temporal features for action recognition
visual perception using force dynamics and event logic. J Artif with independent subspace analysis. In: CVPR 2011, IEEE,
Intell Res 15:31–90 pp 3361–3368
152. Tran SD, Davis LS (2008) Event modeling and recognition 174. Hyvarinen A, Hurri J, Hoyer PO (2009) ’’A probabilistic
using markov logic networks. European conference on computer approach to early computational vision. Nat Image Stat 2:666
vision. Springer, Berlin, pp 610–623 175. Wold S, Esbensen K, Geladi P (1987) Principal component
153. Morariu VI, Davis LS (2011) Multi-agent event recognition in analysis. Chemom Intell Lab Syst 2(1–3):37–52
structured scenarios. In: CVPR 2011, IEEE, pp 3289–3296 176. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011)
154. Wang H, Schmid C (2013) Action recognition with improved Sequential deep learning for human action recognition. Inter-
trajectories. In: Proceedings of the IEEE international confer- national workshop on human behavior understanding. Springer,
ence on computer vision, pp 3551–3558 Berlin, pp 29–39
155. Kang L, Ye P, Li Y, Doermann D (2014) Convolutional neural 177. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y
networks for no-reference image quality assessment. In: Pro- (2013) Overfeat: Integrated recognition, localization and
ceedings of the IEEE conference on computer vision and pattern detection using convolutional networks. arXiv preprint arXiv:
recognition, pp 1733–1740 1312.6229
156. Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic 178. Jia Y et al. (2014) Caffe: convolutional architecture for fast
programming. Springer, Berlin feature embedding. In: Proceedings of the 22nd ACM interna-
157. Shao L, Ji L, Liu Y, Zhang J (2012) Human action segmentation tional conference on Multimedia, ACM, pp 675–678
and recognition via motion and shape analysis. Pattern Recogn 179. Ning F, Delhomme D, LeCun Y, Piano F, Bottou L, Barbano PE
Lett 33(4):438–445 (2005) Toward automatic phenotyping of developing embryos
158. Marĉelja S (1980) Mathematical description of the responses of from videos. IEEE Trans Image Process 14(9):1360–1371
simple cortical cells. JOSA 70(11):1297–1300 180. Singh T, Vishwakarma DK (2021) A deeply coupled ConvNet
for human activity recognition using dynamic and RGB images.
Neural Comput Appl 33(1):469–485
123
Neural Computing and Applications (2023) 35:4145–4182 4181
181. Yao L, Qian Y (2018) Dt-3dresnet-lstm: An architecture for 202. Collobert R, Weston J (2008) A unified architecture for natural
temporal activity recognition in videos. Pacific Rim conference language processing: Deep neural networks with multitask
on multimedia. Springer, Berlin, pp 622–632 learning. In: Proceedings of the 25th international conference on
182. Meng B, Liu X, Wang X (2018) Human action recognition Machine learning, pp 160–167
based on quaternion spatial-temporal convolutional neural net- 203. Yan Y, Ricci E, Subramanian R, Liu G, Sebe N (2014) Multi-
work and LSTM in RGB videos. Multimedia Tools Appl task linear discriminant analysis for view invariant action
77(20):26901–26918 recognition. IEEE Trans Image Process 23(12):5599–5611
183. Qi M, Qin J, Li A, Wang Y, Luo J, Van Gool L (2018) stagnet: 204. Yang Q (2009) Activity recognition: linking low-level sensors to
an attentive semantic RNN for group activity recognition. In: high-level intelligence. In: Twenty-first international joint con-
Proceedings of the European conference on computer vision ference on artificial intelligence
(ECCV), pp 101–117 205. Zheng VW, Hu DH, Yang Q (2009) Cross-domain activity
184. Qi M, Wang Y, Qin J, Li A, Luo J, Van Gool L (2019) stagNet: recognition. In: Proceedings of the 11th international conference
an attentive semantic RNN for group activity and individual on Ubiquitous computing, pp 61–70
action recognition. IEEE Trans Circuits Syst Video Technol 206. Liu J, Shah M, Kuipers B, Savarese S (2011) Cross-view action
30(2):549–565 recognition via view knowledge transfer. In: CVPR 2011, IEEE,
185. Muhammad K et al (2021) Human action recognition using pp 3209–3216
attention based LSTM network with dilated CNN features. Futur 207. Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and
Gener Comput Syst 125:820–830 transferring mid-level image representations using convolutional
186. He J-Y, Wu X, Cheng Z-Q, Yuan Z, Jiang Y-G (2021) DB- neural networks. In: Proceedings of the IEEE conference on
LSTM: Densely-connected Bi-directional LSTM for human computer vision and pattern recognition, pp 1717–1724
action recognition. Neurocomputing 444:319–331 208. Wang H, Schmid AC, Liu C-L (2011) Action recognition by
187. Hu K, Zheng F, Weng L, Ding Y, Jin J (2021) Action recog- dense trajectories. Proc IEEE Conf Comput Vis Pattern
nition algorithm of Spatio-temporal differential LSTM based on Recognit 2:3169–3176
feature enhancement. Appl Sci 11(17):7876 209. Kliper-Gross O, Gurovich Y, Hassner T, Wolf L (2012) Motion
188. Vaswani A et al. (2017) Attention is all you need. In: Advances interchange patterns for action recognition in unconstrained
in neural information processing systems, pp 5998–6008 videos. European conference on computer vision. Springer,
189. Neimark D, Bar O, Zohar M, Asselmann D (2021) Video Berlin, pp 256–269
transformer network. arXiv preprint arXiv:2102.00719 210. Oneata D, Verbeek J, Schmid C (2013) Action and event
190. Plizzari C, Cannici M, Matteucci M (2021) Spatial temporal recognition with fisher vectors on a compact feature set. In:
transformer network for skeleton-based action recognition. Proceedings of the IEEE international conference on computer
International conference on pattern recognition. Springer, Ber- vision, pp 1817–1824
lin, pp 694–701 211. Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion
191. Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M for better action recognition. In: Proceedings of the IEEE con-
(2021) Action transformer: a self-attention model for short-time ference on computer vision and pattern recognition,
human action recognition. arXiv preprint arXiv:2107.00606 pp 2555–2562
192. Ullah A, Muhammad K, Haq IU, Baik SW (2019) Action 212. Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with
recognition using optimized deep autoencoder and CNN for stacked fisher vectors. European conference on computer vision.
surveillance data streams of non-stationary environments. Futur Springer, Berlin, pp 581–595
Gener Comput Syst 96:386–397 213. Simonyan K, Zisserman A (2014) Two-stream convolutional
193. Chong YS, Tay YH (2017) Abnormal event detection in videos networks for action recognition in videos. arXiv preprint arXiv:
using spatiotemporal autoencoder. International symposium on 1406.2199
neural networks. Springer, Berlin, pp 189–196 214. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action
194. Cui R, Hua G, Wu J (2020) AP-GAN: predicting skeletal recognition using factorized spatio-temporal convolutional net-
activity to improve early activity recognition. J Vis Commun works. In: Proceedings of the IEEE international conference on
Image Represent 73:102923 computer vision, pp 4597–4605
195. Simonyan K, Zisserman A (2014) Very deep convolutional 215. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good
networks for large-scale image recognition. arXiv preprint practices for very deep two-stream convnets. arXiv preprint
arXiv:1409.1556 arXiv:1507.02159
196. Wang L, Qiao Y, Tang X (2015) Action recognition with tra- 216. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O,
jectory-pooled deep-convolutional descriptors. In: Proceedings Monga R, Toderici G (2015) Beyond short snippets: deep net-
of the IEEE conference on computer vision and pattern recog- works for video classification. In: Proceedings of the IEEE
nition, pp 4305–4314 conference on computer vision and pattern recognition,
197. Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image pp 4694–4702
classification with the fisher vector: theory and practice. Int J 217. Fernando B, Gavves E, Oramas JM, Ghodrati A, Tuytelaars T
Comput Vision 105(3):222–245 (2015) Modeling video evolution for action recognition. In:
198. Gowda SN, Sevilla-Lara L, Keller F, Rohrbach M (2021) Proceedings of the IEEE conference on computer vision and
CLASTER: clustering with reinforcement learning for zero-shot pattern recognition, pp 5378–5387
action recognition. arXiv preprint arXiv:2101.07042 218. Donahue J et al. (2015) Long-term recurrent convolutional
199. Liu K, Liu W, Ma H, Huang W, Dong X (2019) Generalized networks for visual recognition and description. In: Proceedings
zero-shot learning for action recognition with web-scale video of the IEEE conference on computer vision and pattern recog-
data. World Wide Web 22(2):807–824 nition, pp 2625–2634
200. Ornek EP (2020) Zero-shot activity recognition with videos. 219. Jiang Y-G, Dai Q, Liu W, Xue X, Ngo C-W (2015) Human
arXiv preprint arXiv:2002.02265 action recognition in unconstrained videos by explicit motion
201. Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolu- modeling. IEEE Trans Image Process 24(11):3781–3795
tional learning of spatio-temporal features. European conference 220. Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond
on computer vision. Springer, Berlin, pp 140–153 gaussian pyramid: Multi-skip feature stacking for action
123
4182 Neural Computing and Applications (2023) 35:4145–4182
recognition. In: Proceedings of the IEEE conference on com- 235. Ullah A, Muhammad K, Ding W, Palade V, Haq IU, Baik SW
puter vision and pattern recognition, pp 204–212 (2021) Efficient activity recognition using lightweight CNN and
221. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) DS-GRU network for surveillance applications. Appl Soft
Learning spatiotemporal features with 3d convolutional net- Comput 103:107102
works. In: Proceedings of the IEEE international conference on 236. Khan MA et al (2021) A fused heterogeneous deep neural net-
computer vision, pp 4489–4497 work and robust feature selection framework for human actions
222. Fernando B, Gould S (2016) Learning end-to-end video classi- recognition. Arabian J Sci Eng 6:1–16
fication with rank-pooling. In: International conference on 237. Ullah A, Muhammad K, Hussain T, Baik SW (2021) Conflux
machine learning, PMLR, pp 1187–1196 LSTMs network: a novel approach for multi-view action
223. Fernando B, Anderson P, Hutter M, Gould S (2016) Discrimi- recognition. Neurocomputing 435:321–329
native hierarchical rank pooling for activity recognition. In: 238. Reinolds F, Neto C, Machado J (2022) Deep learning for activity
Proceedings of the IEEE conference on computer vision and recognition using audio and video. Electronics 11(5):782
pattern recognition, pp 1924–1932 239. Siddiqi MH, Alsirhani A (2022) An efficient feature selection
224. Li Y, Li W, Mahadevan V, Vasconcelos N (2016) Vlad3: method for video-based activity recognition systems. Math
encoding dynamics of deep features for action recognition. In: Problems Eng 2022:66689
Proceedings of the IEEE conference on computer vision and 240. Khare M, Jeon M (2022) Multi-resolution approach to human
pattern recognition, pp 1951–1960 activity recognition in video sequence based on combination of
225. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional complex wavelet transform, Local Binary Pattern and Zernike
two-stream network fusion for video action recognition. In: moment. Multimedia Tools Appl 2:1–30
Proceedings of the IEEE conference on computer vision and 241. Deotale D et al (2022) HARTIV: human activity recognition
pattern recognition, pp 1933–1941 using temporal information in videos. CMC-Comput Mater
226. Varol G, Laptev I, Schmid C (2017) Long-term temporal con- Continua 70(2):3919–3938
volutions for action recognition. IEEE Trans Pattern Anal Mach 242. Zhang C, Wu J, Li Y (2022) ActionFormer: localizing moments
Intell 40(6):1510–1517 of actions with transformers. arXiv preprint arXiv:2202.07925
227. Singh D, Mohan CK (2017) Graph formulation of video activ- 243. Ahmed N, Asif HMS, Khalid H (2021) PIQI: perceptual image
ities for abnormal activity recognition. Pattern Recogn quality index based on ensemble of Gaussian process regression.
65:265–272 Multimedia Tools Appl 80(10):15677–15700
228. Carmona JM, Climent J (2018) Human action recognition by 244. Ahmed SAN (2022) BIQ2021: a large-scale blind image quality
means of subtensor projections and dense trajectories. Pattern assessment database. arXiv preprint arXiv:submit/4155160
Recogn 81:443–455 245. Ahmed N, Asif HS, Bhatti AR, Khan A (2022) Deep ensembling
229. Mao F, Wu X, Xue H, Zhang R (2018) Hierarchical video frame for perceptual image quality assessment. Soft Comput 2:1–22
sequence representation with deep convolutional graph network. 246. Ahmed N, Asif HMS (2020) Perceptual quality assessment of
In: Proceedings of the European conference on computer vision digital images using deep features. Comput Inform
(ECCV) workshops, pp 0–0 39(3):385–409
230. Siddiqi MH, Alruwaili M, Ali A (2019) A novel feature selec- 247. Alzantot M, Chakraborty S, Srivastava M (2017) Sensegen: a
tion method for video-based human activity recognition sys- deep learning architecture for synthetic sensor data generation.
tems. IEEE Access 7:119593–119602 In: 2017 IEEE international conference on pervasive computing
231. Zhang Y, Po LM, Liu M, Rehman YAU, Ou W, Zhao Y (2020) and communications workshops (PerCom Workshops), IEEE,
Data-level information enhancement: motion-patch-based Sia- pp 188–193
mese convolutional neural networks for human activity recog-
nition in videos. Expert Syst Appl 147:113203 Publisher’s Note Springer Nature remains neutral with regard to
232. Arzani MM, Fathy M, Azirani AA, Adeli E (2020) Switching jurisdictional claims in published maps and institutional affiliations.
structured prediction for simple and complex human activity
recognition. IEEE Trans Cybern 6:7777
Springer Nature or its licensor (e.g. a society or other partner) holds
233. Gowda SN, Rohrbach M, Sevilla-Lara L (2020) SMART frame
exclusive rights to this article under a publishing agreement with the
selection for action recognition. arXiv e-prints, p.
author(s) or other rightsholder(s); author self-archiving of the
arXiv:2012.10671
accepted manuscript version of this article is solely governed by the
234. Wharton Z, Behera A, Liu Y, Bessis N (2021) Coarse temporal
terms of such publishing agreement and applicable law.
attention network (cta-net) for driver’s activity recognition. In:
Proceedings of the IEEE/CVF winter conference on applications
of computer vision, pp 1279–1289
123