0% found this document useful (0 votes)

4 views38 pages

Toward Human Activity Recognition: A Survey: Original Article

This survey examines human activity recognition (HAR), detailing various approaches and trends in the field, including a proposed taxonomy that categorizes methods into online/offline, unimodal/multimodal, and handcrafted/learning-based approaches. The study analyzes 46 state-of-the-art HAR methods and discusses benchmark datasets, highlighting the challenges and future research directions in HAR. It aims to provide a comprehensive overview of the multidisciplinary nature of HAR and its applications across various domains.

Uploaded by

work.afshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views38 pages

Toward Human Activity Recognition: A Survey: Original Article

Uploaded by

work.afshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Neural Computing and Applications (2023) 35:4145–4182

[Link] (0123456789().,-volV)(0123456789().
,- volV)

ORIGINAL ARTICLE

Toward human activity recognition: a survey

Gulshan Saleem1 • Usama Ijaz Bajwa1 • Rana Hammad Raza2

Received: 6 March 2021 / Accepted: 10 October 2022 / Published online: 20 October 2022
The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022

Abstract
Human activity recognition (HAR) is a complex and multifaceted problem. The research community has reported
numerous approaches to perform HAR. Along with HAR approaches, various surveys have revealed HAR trends in various
environments and applications. HAR is linked to a variety of technology-dependent daily life systems, such as human–
computer interaction systems, security surveillance, video surveillance, healthcare surveillance, robotics, content-based
information retrieval, and monitoring systems. Because of technological advancements, HAR trends change quickly and
necessitate an up-to-date and broader perspective. This study offers an HAR taxonomy, which includes online/offline HAR,
multimodal/unimodal HAR, handcrafted feature-based, and learning-based approaches. This study attempts to present the
multidisciplinary nature of HAR, such as application areas, activity types, task complexities, benchmark datasets, and/
methods. This research includes a comparative analysis of state-of-the-art HAR methods and a discussion of popular
datasets. The selected studies have been categorized using taxonomy, and different attributes such as activity complexity,
dataset size, and recognition rate have been used for their analysis. The comparative analysis of HAR approaches has also
helped to highlight domain challenges and open research directions for HAR researchers to follow.

Keywords Activity recognition Action recognition Video datasets Deep learning Handcrafted features
Video analysis Computer vision

1 Introduction to the movement of body parts to emphasize speech,

whereas action refers to the collective movement of body
Human activity recognition (HAR) is used to detect and parts to complete a task. For example, moving head in
classify human activities under appropriate labels. Human negation is a gesture, walking is an action, and speaking
activities are complex and evolve temporally, necessitating loudly with unpleasant facial expressions is an angry
suitable division into sub activities, as illustrated in Fig. 1. behavior. Interaction is a collection of actions usually
Human activity is an ongoing task composed of single or performed by two or more subjects, for example, a two-
multiple gestures, actions, and interactions. Gesture refers person conversation, fighting, cooking food, data entry, and
car washing, etc. Group activities are performed by mul-
tiple persons and may include a collection of gestures,
& Usama Ijaz Bajwa
usamabajwa@[Link] actions, and interactions, for example, a football game or a
strike. Gestures and actions are easy to recognize and
Gulshan Saleem
gulshnsaleem26@[Link] considered simple, whereas behavior and interactions are
intermediate. Multi-person activities such as human–hu-
Rana Hammad Raza
hammad@[Link] man interaction, group activities, or events are highly
complex [1]. Considering the above-mentioned subactivi-
1
Department of Computer Science, COMSATS University ties, the approaches used to recognize these vary widely.
Islamabad, Lahore Campus, 1.5 KM Defence Road Off Such as basic methods include feature-based image pro-
Raiwind Road, Lahore, Pakistan
cessing techniques, background/foreground subtraction,
2
Electronics and Power Engineering Department, Pakistan action detection, and classification (i.e., optical flow, spa-
Navy Engineering College (PNEC), National University of
Sciences and Technology (NUST), Habib Ibrahim tiotemporal interest points) [2–4].
Rehmatullah Road, Karachi, Pakistan

123
4146 Neural Computing and Applications (2023) 35:4145–4182

Gesture Action Behavior Interaction Group activity Event

Fig. 1 Human activity recognition (simple to complex activities)

Advanced methods are a combination of multiple steps, and parameters used for the analysis of these studies
which can collectively extract advanced features and per- ((b) Analysis Process). As shown in Fig. 2, studies are
form in-depth analysis to recognize human activities [5–8]. selected from multiple databases and initial selection is
Basic computer vision-based methods such as optical flow made on the basis of relevant topics. Then, all studies
[9–11], spatiotemporal interest points (STIP) [12], hidden published earlier than 2011 have not considered and search
Markov model (HMM) [13], and advanced deep learning is performed again through analyzing title of studies. This
tools, for example, convolutional neural networks (CNN), process helps to remove duplicates and survey studies from
recurrent neural networks (RNN) [14–16], are used to the selected set, and as a result, 2500 studies are left while
recognize human activity. others have been discarded. This survey is based on
HAR has a multidisciplinary nature, and various daily reviewing video-based HAR approaches and benchmark
life systems are influenced by performing HAR. HAR datasets. Therefore, selected set is refined to get studies that
plays its role in indoor/outdoor environments, robotics, have used video benchmark dataset for evaluation of their
content-based information retrieval, human–computer model. For this purpose, we have reviewed the abstract and
interactions (HCI), security surveillance, video surveil- sometimes experiment too in case abstract does not provide
lance, educational sector, monitoring, and social interac- necessary details.
tion-based applications [17]. Hence, because of rapid As a result, 46 studies are selected for state-of-the-art
technological advancement of daily life systems, there is a analysis which includes studies on feature-based models,
need for an up-to-date survey to discuss the progress of deep learning-based models, online activity recognition
HAR and also to highlight its challenges [1]. Considering model, and methods for multimodal HAR. We have ana-
previous surveys, HAR systems can be classified as online lyzed all selected studies using various parameters, such as
or offline based on the input data and processing strategy. publication year, method type, data input, activity level,
Then, there are unimodal/multimodal approaches that use dataset size, and its performance on benchmark datasets.
different modalities, such as video frames, audio cues, These parameters help in identifying which activities
skeleton data, and depth data. Most of the previous surveys among simple, intermediate, and complex are frequently
have discussed handcrafted approaches, and few recent used in research. The size of the dataset is an important
surveys have incorporated learning-based approaches as indicator to determine what types of datasets are more
well [18–20]. useful across selected studies. Several evaluation measures
are used for human activity recognition, such as Average
1.1 Methodology for survey of HAR approaches Precision, which is the most reported measure. However,
accuracy, recall, f-measure, likelihood ratio, and area under
This study attempts to provide a method-based classifica- curve (AUC) are also popular among studies.
tion of approaches through taxonomy. It also provides a The major contributions of this study are as follows:
comparative analysis of state-of-the-art methods presented
• This study highlights various approaches and proposes
since 2011 to present an overview of HAR domain. This
a HAR taxonomy, and elements of taxonomy are
study includes 46 state-of-the-art methods, which are based
discussed with the help of HAR methods. HAR
on topics such as ‘‘Human activity Recognition,’’ ‘‘Human
approaches are mainly divided into handcrafted/fea-
Action Recognition,’’ ‘‘Online Activity Recognition,’’
ture-based HAR and learning-based HAR approaches
‘‘Learning-based Human Activity Recognition,’’ and
which are further sub-divided up to four levels to cover
‘‘Handcrafted features-based Human Activity Recogni-
simple feature extraction-based methods such as tra-
tion.’’ The existing state-of-the-art surveys are also dis-
jectories and space–time feature.
cussed to analyze the up-to-date findings. Figure 2 shows
the process of selection of studies ((a) Selection Process)

123
Neural Computing and Applications (2023) 35:4145–4182 4147

Fig. 2 HAR survey process for analyzing state-of-the-art methods

123
4148 Neural Computing and Applications (2023) 35:4145–4182

Table 1 State-of-the-art survey related to HAR

Author Year Activities Complexity Application Contribution

Vishwakarma 2013 Abnormal Actions, Intermediate Security Surveillance Based on Activity recognition, object
et al. [18] Behavior, and tracking, object detection tasks, and
interactions behavior understanding using
handcrafted approaches. Few
surveillance-based video datasets are
also discussed
Ke et al. [21] 2013 Single person multi- High Pose estimation, Falling Detection, This survey provided details of Video-
person crowd Security Surveillance based activity and abnormal activity
activities (Actions, recognition methods
Interactions)
Vrigkas et al. 2015 Actions Behavior Intermediate Action Recognition, Behavior This survey categorized the HAR into
[24] Understanding unimodal and multimodal
approaches and supports the
effectiveness of later approach
Cheng et al. 2015 Multi-type of Simple Action Recognition Systems This survey focused on human action
[22] activities including recognition-based approaches and
Actions, Interaction, few benchmark datasets have also
been discussed
Zhu et al. [36] 2016 Actions Simple Action Recognition System This survey covered the handcrafted
and learned representations for
human action recognition
Dawn and 2016 Actions Simple Action Recognition System This survey discussed human action
Shaikh [23] recognition with Spatiotemporal
interest point (STIP) detector-based
methods. Performance of selected
methods has been discussed along
with their results on different
benchmarks
Sargano et al. 2017 Actions, Interactions Intermediate Human activity Recognition HAR approaches along with
[20] benchmarks have been discussed.
Application areas have also been
highlighted
Herath et al. 2017 Multi type of Intermediate Daily Monitoring Systems, Activity This survey is focused on deep
[25] activities including Recognition Systems representation of action recognition
actions and domain. It provides the architectural
Interaction details of different action recognition
models along with performance on
few benchmark datasets
Tripathi et al. 2018 Abnormal Activities High Abandoned object Detection, Theft This survey is focused on suspicious
[37] (Actions, Detection, Violence Detection, activity recognition. Feature-based
Interaction, Group Illegal Parking on Road Detection, approaches along with classical
Activities) Accidents Detection, Fire Detection machine learning methods have been
described to explain state-of-the-art
methods
Yao et al. [32] 2019 Daily activities Sports Intermediate Human Activity Recognition System, This survey provided Convolutional
activities (Actions, Daily activity monitoring system, neural network-based action
Interaction) Sports System recognition along with performance
of popular methods on large-scale
datasets and highlighted the
limitations and future directions
Moreno et al. 2019 Daily activities Intermediate Human activity recognition system. The survey has divided the approaches
[28] (Actions, Monitoring Systems into three main categories, i.e.,
Interactions) handcrafted features, depth sensors,
and deep learning-based approaches
which are further explained briefly

123
Neural Computing and Applications (2023) 35:4145–4182 4149

Table 1 (continued)
Author Year Activities Complexity Application Contribution

Wang et al. 2019 Abnormal Actions, High Human behavior recognition Focused on sensor-based behavior
[27] Behavior, and recognition and described the process
interactions of channel state-based behavior
recognition. They categorized
methods into model based, pattern
based, and deep learning-based
approaches
Liu et al. [29] 2019 Actions, gestures, and Intermediate Daily activity recognition, Gesture Focused on Wi-Fi signal processing-
interactions recognition, User identification, based activity recognition. Explained
Indoor localization & tracking different setups of wireless sensing
strategies such as RSSI-based, CSI-
based, FMCW-based, and Doppler
shift-based methods
Zhang et al. 2019 Actions, Interactions, High Human Activity Recognition System, The survey discussed both action
[33] Group Activity Action Detection System recognition and action detection,
whereas action recognition is further
extended toward action
representation methods and
interaction recognition methods
Jegham et al. 2020 Multi activities Intermediate Human Activity Recognition System Highlighted the constraint and
[26] (Actions, challenges faced during the process
Interactions) of activity recognition. Action
recognition approaches and few
benchmarks have also been described
Dang et al. 2020 Sensor-based data for Intermediate Ambient Living Environment. Daily Based on sensor and vision-based
[30] Action Recognition, Monitoring System. Human Activity HAR including benchmarks for both.
Multi Activities Recognition System Focused on feature Engineering and
(Actions, Preprocessing methods used for HAR
Interaction)
Beddiar et al. 2020 Multi type of High Human Activity Recognition Provided general overview of HAR,
[1] activities (Actions, including approaches, datasets,
Interaction, Group evaluation measures, and challenges
activity) of the domain
Das et al. [34] 2021 Actions, Interactions Intermediate Real-time human activity recognition. Focused on methods used for real-time
Daily activity monitoring human activity recognition.
Presented challenges of real-time
HAR
Chaurasia 2022 Multi type of Complex Daily activities, Military activities, have worked on activity recognition
et al. [31] activities (Actions, Abnormal activities, Ambution, and classification (ARC)
Interaction, Group Transportation activities smartphones and wearable sensors.
activity) Moreover, authors have concluded
that ARC depends on the
classification technique, number of
sensors, device type, orientation, and
placement. They have classified
studies using ten parameters and
highlighted domain challenges
Gupta et al. 2022 Multi type of Complex AI-based HAR applications, Hybrid Authors have stated HAR design,
[35] activities (Actions, AI models for HAR, Abnormal dependability, and stability are major
Interaction, Group Human activities based areas that need improvement to
activity) improve the HAR process

123
4150 Neural Computing and Applications (2023) 35:4145–4182

• This study also discusses HAR benchmark datasets, We aim to provide the recent trend among HAR
which have been used to perform experimentation and research community so that open challenges can be
evaluation of methods. HAR datasets discuss their highlighted for future research.
characteristics, e.g., single-view, multi-view, RGB, and • This study discusses HAR issues that were brought to
RGB-D information, as well as instance-based details. light through comparison analysis, and it includes the
Every dataset serves a purpose, and their brief descrip- environmental complexity of high intra-class variations
tion can help researchers to choose one accordingly. and the inter-class similarity problem. Similarly, back-
• State-of-the-art methods are analyzed based on prede- ground, multi-view, and illumination variations are the
fined parameters to highlight strength and limitations of primary issues that can affect the performance of the
domain. This survey includes 46 state-of-the-art recognition system.
approaches presented since 2011, and we have divided
Section 2 provides a review of previous surveys and
the methods into three categories: online/offline, uni-
emphasizes the importance of this study. The characteris-
modal/multimodal, and handcrafted feature-based
tics of widely used video benchmarks for HAR are covered
approach/learning-based approach. The selected studies
in Sect. 3. Section 4 then provides a taxonomy and detailed
are further classified based on the complexity of the
review of state-of-the-art HAR approaches to highlight
activity (i.e., simple, intermediate, or complex), as well
research trends in HAR. Section 5 discusses the limitations
as the size of the dataset (i.e., small, medium, large). It
of HAR and open research areas, and Sect. 6 concludes the
also includes the recognition rate (Average Precision)
study.
of selected studies to highlight how studies perform as
compared to each other’s. Hence, comparative analysis
of various studies provides recent trends among the
2 State-of-the-art HAR surveys
HAR research community and highlights open chal-
lenges for future research.
Human activity recognition is complex and involves vari-
• The selected methods are classified as online/offline,
ety of tasks. For example, action representation-based
unimodal/multimodal, and handcrafted feature-based
approaches need feature extraction and descriptors-based
approach/learning-based approach. The selected studies
methods. Human activity analysis is complex and per-
are further categorized based on activity complexity
formed by using both machine learning and deep learning
(i.e., simple, intermediate, complex) and size of dataset
approaches, whereas we have conducted a survey on dif-
(i.e., small, medium, large). It also includes recognition
ferent approaches of HAR and categorized HAR into input
rate (average precision) of selected studies to highlight
processing strategy-based, modality-based, and model-
their performance as compared to each other. Reported
based approaches. In previous years, authors have con-
recognition rate may contribute toward significance of a
tributed toward HAR and presented specific to general
selected study, but it is not a basis for comparison.

Fig. 3 RGB & RGB-D image from northwestern-UCLA [41]

123
Neural Computing and Applications (2023) 35:4145–4182 4151

Table 2 Characteristics of HAR benchmarks

Activity No. of Videos No. of View Depth Activity types Application Areas
dataset (Resolution)/FPS Actions (D)
(Actors)

KTH [42] 600 (160 9 120)/ 6 (25) S RGB Actions Human action recognition in outdoor
25 conditions
Weizmann 90 (180 9 144)/50 10 (9) S RGB Actions Human action recognition
[43]
UCF Sports 150 (720 9 480)/ 10 S RGB Actions, Interactions Sports actions recognition
[45] 10 (human–object)
Olympic 783 16 S RGB Actions, Interactions Sports actions recognition
Sports [48] (human–object)
Hollywood 233 (400 9 300, 8 S RGB Actions, Behavior, Activity recognition, Behavior
[49] 300 9 200)/24 Interactions, Group Understanding, Interaction Recognition,
Activity Event Detection
UCF50 [50] 6681 (320 9 240)/ 50 S RGB Actions, Interactions Human Sports activity recognition
25 (human–object)
UCF101 [45] 13,320 101 S RGB Actions, Behavior, Human activity recognition
(320 9 240)/25 Interactions, Group
Activity
YouTube 1,133,158 487 S RGB Actions, Interactions Human Sports activity recognition
Sports 1 M (human–object)
[51]
IXMAS [47] 1650 (390 9 291)/ 13 (11) M RGB Actions Multi-view-invariant action recognitions
23
ActivityNet 27,801 203 S RGB Actions, Behavior, Human activity and behavior understanding
[52] (1280 9 720)/30 Interactions, Group
Activity
YouTube 8 M * 800,000 4716 S RGB Actions, Behavior, Human activity and behavior understanding
[53] Interactions, Group
Activity
HMDB51 6766 (320 9 240)/ 51 S RGB Actions, Behavior, Human activity and behavior understanding
[54] 30 Interactions, Group
Activity
CASIA 1446 (320 9 240)/ 8 (24) M RGB Actions, Behavior, Human behavior and interaction-based
Action [55] 25 Interaction systems
AVA [56] 430 80 M RGB Actions, Interactions Poses, person to person interaction and
person-object interaction Recognition
UCF Crime 1900 13 S RGB- Actions, Behavior, Security Surveillance
[57] Interactions, Group
Activity
UTKinect 200 (320 9 240)/ 10 (10) S RGB- Actions Human actions
[44] 30 D
MSR Action 567 (640 9 480)/ 20 (7) S RGB- Actions Sports Gesture recognition
3D [58] 15 D
MSR Action 180 (320 9 240)/ 10 (12) S RGB- Actions Action pairs recognitions
Pairs [59] 30 D
SYSU- 3D 480 (640 9 480)/ 40 (12) S RGB- Actions, Interactions Daily activity Recognition
HOI [60] 30 D (human–object)
CAD-60 [61] 60 (640 9 480)/25 12 (4) S RGB- Actions Daily activity recognition
D
CAD-120 120 (640 9 480)/ 10 (4) S RGB- Actions Action labeling, human and object tracking
[62] 25 D
UTD-MHAD 861 (512 9 424)/ 27 (8) S RGB- Actions, Interactions View- invariant human action recognition
[63] 30 D
RGB-D 1189 (640 9 480)/ 12 (30) M RGB- Actions, Interactions Daily activity recognition
HuDaAct 30 D
[64]

123
4152 Neural Computing and Applications (2023) 35:4145–4182

Table 2 (continued)
Activity No. of Videos No. of View Depth Activity types Application Areas
dataset (Resolution)/FPS Actions (D)
(Actors)

Berkeley 660 (640 9 480)/ 11 (12) M RGB- Behavior Human behavior Recognition
MHAD [65] 30 D
Northwestern- 1475 (640 9 480)/ 10 (10) M RGB- Actions, interactions Cross- view action recognition
UCLA [41] 30 D
UWA3D 900 (640 9 480)/ 30 (10) M RGB- Actions Similar and cross-view action recognition
Multi-view 30 D
[46]
LIRIS [66] 9800 (640 9 480, 828 (21) M RGB- Actions, Interactions Human activity recognition
720 9 576)/25 D
G3Di [67] 574 (640 9 480)/ 12 (15) S RGB- Actions, Interactions Gaming interaction activity
30 D
NTU 56,880 60 (40) M RGB- Actions, Behavior, Daily Activity Recognition, Health
RGB ? D (512 9 424, D Interaction surveillance systems
[68] 1920 9 1080)/30
ShakeFive 100 2 (37) S RGB- Actions Handshake Recognition
[69] D

Fig. 4 HAR datasets

categorization

survey-based studies, which are discussed in this section, effectiveness of STIP detectors as STIP detectors can
and Table 1 summarizes these surveys. improve HAR tasks because of their robustness.
Action representation-based survey Handcrafted vs. learned representation-based Survey
In 2013, Vishwakarma et al. [18] have published a Vrigkas et al. [24] presented their findings based on
survey on surveillance-based activity recognition that pri- unimodal and multimodal approaches that are further
marily covers classical HAR approaches. They have clas- subdivided to discuss HAR. The survey’s focus is skewed
sified HAR approaches as hierarchical or non-hierarchical. toward multimodal approaches because they provide a
It provides a review of motion detection and object track- better feature set for learning. They have highlighted
ing methods, and characteristics of a few HAR datasets challenges faced by multimodal approaches, such as
have been discussed. Ke et al. [21] published a survey to computational cost. It includes both traditional ML and
provide a general framework of HAR, which includes advanced deep learning models, i.e., CNN. In addition, the
object segmentation techniques, feature extraction tech- survey provides a review of a few publicly available
niques, activity detection techniques, and classification datasets that can be used for HAR. Zhen et al. [19] pub-
techniques. Authors have thoroughly discussed the hand- lished a survey that has discussed two major HAR
crafted approaches used in HAR in both [18] and [21] approaches: learned representation and handcrafted repre-
surveys. Cheng et al. [22] have discussed similar approach sentations. Each one is further subdivided to analyze both
as used in [18] and provided characteristics of action categories and highlighted the strength of deep learning-
recognition benchmarks. Dawn and Shaikh [23] used spa- based approaches. The survey in [19] was the first survey to
tiotemporal interest points (STIP) to emphasize the compare traditional approaches with modern deep learn-
ing-based approaches. Similarly, Sargano et al. [20]

123
Neural Computing and Applications (2023) 35:4145–4182 4153

Fig. 5 UCF-101: single-view dataset [45]

Fig. 6 IXMAS: a multi-view dataset [47]

presented a survey on handcrafted vs. learning-based and categorized it as refined behavior recognition, coarse
approaches in 2017. They have discussed few publicly behavior recognition, and inference activity. They have
available HAR datasets and popular HAR applications. In described channel state information-based behavior
contrast to [20], Herath et al. [25] also conducted a survey recognition with the help of three application areas which
focusing on deep representation of action recognition. It are model based, pattern based, and deep learning-based
has thoroughly discussed the popular handcrafted HAR approaches. Authors have considered five major aspects for
features as optic flow, motion history image, trajectories, describing behavior recognition application, which are
and other motion descriptors. They have also shown the experimental equipment, experimental environment,
architectural differences between popular networks like behavior type, classifier, and performance. The authors in
spatiotemporal networks, multiple stream networks, deep [28] have discussed sensor-based HAR systems and
generative networks, and temporal coherency networks. showed handcrafted feature-based approaches and deep
Then, in 2020, Jegham et al. [26] have attempted to provide learning-based approaches. Authors in [29] have presented
a quantitative analysis of a few popular methods while also a survey on HAR through wireless signal (e.g., Wi-Fi) as
discussing their applicability in various scenarios. The motion of the human body affects the wireless signal
primary goal of their work is to highlight HAR issues propagation. The authors have described the basic strategy
through comparative analysis. and structure of wireless sensing environment for HAR.
They have presented a variety of HAR applications which
Sensor-based survey
can be recognized by using wireless sensing technology
Authors in [27] have surveyed channel state-based such as fraud detection, daily activity monitoring. The
behavior recognition and thoroughly described the concept authors have categorized sensing strategies based on HAR
of channel state information. They have provided details of into received signal strength indicator-based (RSSI),
methods used for channel state-based behavior recognition channel state information-based (CSI), frequency shift for

123
4154 Neural Computing and Applications (2023) 35:4145–4182

convolutional neural network-based HAR methods and

their performance on large-scale datasets along with their
performance on large-scale datasets. Zhang et al. [33] have
investigated action classification and detection methods
using RGB and depth-based datasets. Authors have dis-
cussed both handcrafted and deep learning methods for
action classification and detection. The authors have shown
the importance of action detection strategies for improving
HAR performance. Beddiar et al. [1] have summarized
HAR by discussing methods and benchmark datasets. They
Fig. 7 Input data variations within HAR approaches (baseline
methods are shown in Fig. 8) have identified HAR’s limitations and challenges, which
can be explored to extend research. The authors have
frequency-modulated carrier wave-based (FMCW), and focused on HAR process and presented methods for action
Doppler shift-based method. They have also added recent detection and classification. Authors in [34] have presented
HAR methods based on these sensing variations and HAR survey to highlight the recent trends for real-time
highlighted limitations of Wi-Fi-based approaches. Dang human activity recognition. They have described various
et al. [30] have discussed both vision- and sensor-based types of methods and evaluated their application to real-
HAR systems along with corresponding HAR approaches time scenarios. Their survey also highlighted the chal-
and datasets. Chaurasia et al. [31] have worked on activity lenges of real-time online HAR, such as processing time of
recognition and classification (ARC) smartphones and a method. Gupta et al. [35] have worked on analyzing
wearable sensors which include basics of ARC along with human activity recognition to highlight future directions
wearable and inertial sensors of smartphones. Moreover, and explained three major points which needs improve-
authors have concluded that ARC depends on classification ment. Authors have stated HAR design, dependability, and
technique, number of sensors, device type, orientation, and stability are major areas, which needs improvement to
placement. improve HAR process.
All above-discussed surveys are summarized in Table 1,
Other which presents highlights of each survey. Based on the
This category includes survey which covers multiple foregoing, it is possible to conclude that most of the sur-
areas within HAR or is hard to classify among above veys lack a general perspective of the domain and cannot
mentioned categories. Authors in [32] discuss combine all elements of the HAR system within a single

Fig. 8 Proposed taxonomy of human activity recognition approaches

123
Neural Computing and Applications (2023) 35:4145–4182 4155

study. As a result, this survey attempts to combine all have discussed HAR datasets and explains their charac-
necessary elements of HAR to show its multidisciplinary teristics. It includes publishing year, number of videos,
nature. These elements include feature-based methods, actors involved, type of actions, application area, view
classification-based methods, multi-modality-based meth- information, and ground truth data of HAR datasets.
ods, online learning-based methods, dataset used for these Authors have presented a variety of methods used for each
methods, and state-of-the-art approaches of HAR. More- dataset. In [38], datasets were classified based on actions,
over, it attempts to highlight limitations of HAR and pro- whereas in [39], authors have discussed RGB-D (Fig. 3)
vide open research directions. video datasets. They have included characteristics of 27
single view action datasets, 10 multi-view datasets, and 7
multi-person datasets. It contains information about pub-
3 Activity recognition datasets lishing year, number of videos, actions, and actors, and
dataset complexity issues. It provides details of dataset
So far, many benchmark datasets have been published, splits (i.e., test, train, validation) and discussed some HAR
covering a wide range of activities. The choice of dataset methods for each dataset. In another survey [40], authors
influences the selection of a suitable approach for human have classified datasets into RGB and RGB-D to discuss
activity recognition. Regarding dataset, the HAR presents challenges of HAR datasets. They have highlighted five
several challenges, such as inter/intra class variations and distinct challenges of datasets which are illumination, view
the environmental setup used while recording actions (in- variation, occlusion, annotations, and fusion of modalities.
door/outdoor, camera, view angles). Inter-/intra-class They have discussed HAR methods for each dataset and
variation occurs because of the unique nature of each also discussed HAR studies to highlight dataset challenges.
human. For instance, when walking, some people take Considering available reviews on HAR datasets, this study
small steps while others take gigantic steps, some people highlights the major findings of datasets and provides
avoid obstacles while others jump over them. More action discussion to support HAR benchmark analysis. Therefore,
classes may have overlapping, for example, complex we have shown major classification of datasets in Table 2
actions comprise small actions. For example, fighting class using attributes such as image resolution, camera view,
involves punching, kicking, and thrusting. HAR datasets modality and type of activity, and the respective applica-
involve a lot of variations, which are explained in few tion areas, etc.
previous surveys. In [38], authors divided datasets into This survey presents a collection of HAR benchmark
three categories: heterogeneous actions and specific datasets organized by data view or data acquisition mode,
actions, and others. The heterogeneous actions include i.e., single view or multi-view. Figure 4 illustrates the
different types of actions, for example, walking, jumping HAR task variation across datasets, including type of
running, etc. Specific actions include application-based activities and modality variations. Few HAR datasets are
datasets such as datasets of crowd behavior, abandoned discussed below under single view and multi view datasets.
objects, activities of daily living (ADL), fall detection, and
pose & gesture, whereas other categories have datasets of
motion capture (MOCAP), infrared and thermal. Authors

Fig. 9 Online vs offline HAR model [71]

123
4156 Neural Computing and Applications (2023) 35:4145–4182

Fig. 10 Online human activity recognition framework [71]

3.1 Single-view action dataset up, carry, throw, pull, push, wave and clapping. The
actions were performed by 10 actors and each action is
The single-view dataset is captured using a single camera performed twice, which is performed through a variety of
and single view so does not involve view complexity in views and it includes 200 sequences. Along with action
sequences, as shown in Fig. 4. videos, labels and actions happening in the background are
also included in the dataset.
3.1.1 KTH
3.1.4 UCF 101
The KTH dataset [42] is based on six different actions (i.e.,
walking, running jogging, boxing, waving, and clapping), The UCF 101 [45] dataset includes 13,320 RGB videos
and these actions are performed by 25 actors. While per- with 101 action categories which belong to 25 different
forming these actions, four different background variations groups and each group has 4–7 videos each. All actions
are used which are indoor, outdoor, scale variation within belong to five major groups human–human interaction,
outdoor, and trying different clothes. The dataset includes human–object interaction, body motion, sports, and playing
2391 videos captured through a static camera at a rate of 25 music as shown in Fig. 5. The UCF 101 provides realistic
frames per second (fps) with a resolution of 160 9 120. action videos rat6her than staged videos which improve the
The dataset is provided in training (performed by 8 per- overall recognition task.
sons), validation (performed by 8 persons), and test (per-
formed by 9 persons) splits, but it does not include 3.1.5 Multi-view datasets
extracted silhouette of different actions and background of
action. The multi-view datasets are usually captured in one of two
ways: multiple cameras at different angles or by using
3.1.2 Weizmann different viewpoints, as shown in Fig. 6.

The Weizmann dataset [43] includes 10 types of actions 3.1.6 UWA3D multiview
including running, walking, skipping, forward jump, up-
down jump, galloping, 2-hands waving, 1-hand waving, The UWA3D Multiview dataset [46] includes variety of
and leaning performed by nine actors. The dataset includes sequences that were captured in a row with no pause. All
total of 93 sequences captured through a static camera with these actions were performed by 10 different actors. They
180 9 144 resolution of 25fps rate with an additional 10 have performed different actions which include punching
sequences of walking (captured from different viewpoint). and waving with one hand, sitting down, and standing up,
The background of the captured data is subtracted and the holding chest, walking, turning around, drinking, bending,
actions happening in the background are also included in running, holding head, holding back, kicking, jumping,
the dataset. moping floor, sneezing, sitting down (chair), squatting, two
hands waving, two hand punching, vibrating, falling,
3.1.3 UTKinect irregular walking, lying down, phone answering, jumping
jack, picking up, putting down, dancing, and coughing.
The UTKinect dataset [44] is composed of 10 different This dataset is available in two versions: a single view
actions, which include activities like walk, sit-down, stand- version with 30 activities performed twice/thrice by actors,

123
Neural Computing and Applications (2023) 35:4145–4182 4157

Table 3 Quantitative analysis of state-of-the-art approaches of HAR

Reported Year Methodology Online/Offline, Mean precision
paper Modality, HAR
Approach, Activity
Level, Training Data

Wang et al. 2011 Dense trajectories are used to find actions Offline, Unimodal, UCF Sports: 88.2%, Hollywood: 58.3%
[208] from the data along with information of Handcrafted features,
Histogram of oriented Gradient, optic flow, Simple, Small
and motion boundary
Kliper-Gross 2012 The study is based on action recognition from Offline, Unimodal, HMDB-51: 29.2%, UCF-50: 68.5%
et al. [209] unconstrained videos and representation- Handcrafted
based architecture is used by extracting Approach,
Motion interchange patterns from action Intermediate,
data. Then General set of feature descriptors Medium
shows importance of feature set
Oneata 2013 Performed action and event recognition Offline, Unimodal, HMDB-51: 55.9% UCF-50: 90.5%
et al.[210] through performing short action Handcrafted features, Hollywood: 63% Olympic Sports: 91.2%
classification then locating these actions in Complex, Medium
lengthy movie videos along with
recognition of complex events. It used
Fisher vector instead of BoW and a set of
handcrafted features is used to process the
input data which include motion boundary
histogram and SIFT. The data are
normalized using L2-normalization method
and classified through linear classifier. Also
performed another approach by excluding
human detectors
Wang and 2013 Based on extraction of optic flow information Offline, Unimodal, HMDB-51: 57.2% UCF-50: 91.2%,
Schmid which encodes the motion pixel value, and it Handcrafted features, Hollywood: 64.3% Olympic Sports: 91.1%
[154] is combined with extracted trajectories of Simple, Medium
data for action recognition (Trajectories,
HoF feature descriptors)
Jain et al. 2013 Worked on extracting motion-based Offline, Unimodal, HMDB-51: 52.1%, Hollywood: 62.5%
[211] information using representation-based Handcrafted features,
method to detect the actions from data. Simple, Medium
Finite set of feature descriptors are
incorporated which includes HoG, Traj,
MBH, HoF, and DCS. The extracted
features are further fed to VLAD encoding
technique
Peng et al. 2014 Performed action recognition by using Offline, Unimodal, HMDB-51: 66.8%
[212] representative-based method along with Handcrafted features,
stacked Fisher vector (SFV) and Fisher Complex, Medium
Vector to extract action representations.
SFV provides refined representation and
abstract semantic information in layered
manner to provide mid-level as well as high-
level activity recognition
Simonyan and 2014 Extracted appearance-based information from Offline, Unimodal, HMDB-51: 59.4%, UCF-101: 88.0%
Zisserman still frames and motion information of Deep learning,
[213] frames. Performed action recognition from Intermediate,
videos by using deep neural network along Medium
with transfer learning. Authors have used
two stream convolutional neural network to
perform action recognition moreover multi-
task learning is used to improve the results
by adding classes from both action datasets
i.e., HMDB-51, UCF-101

123
4158 Neural Computing and Applications (2023) 35:4145–4182