A Hybrid Method For Human Interaction Recognition Using Spatio-Temporal Interest Points

Hybrid Method for Human Interaction Recognition

Uploaded by

cristi_pet4742

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views6 pages

A Hybrid Method For Human Interaction Recognition Using Spatio-Temporal Interest Points

Hybrid Method for Human Interaction Recognition

Uploaded by

cristi_pet4742

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2014 22nd International Conference on Pattern Recognition

A Hybrid Method for Human Interaction Recognition

using Spatio-Temporal Interest Points
Nijun Li, Xu Cheng, Haiyan Guo, Zhenyang Wu
School of Information Science and Engineering
Southeast University, Nanjing, China
{lnjleo, xcheng, [Link], zhenyang}@[Link]

Abstract—This paper proposes an innovative and effective BoW [7]) or structured (vocabulary tree [9]) codebook, and
hybrid way to recognize human interactions, which incorporates finally fit them into a supervised (e.g. SVM [8] or neural
the advantages of both global feature (Motion Context, MC) and network [10]) or unsupervised (e.g. probabilistic Latent
Spatio-Temporal (S-T) correlation of local Spatio-Temporal Semantic Analysis, pLSA [11]) leaning framework.
Interest Points (STIPs). The MC feature, which also derives from
Nevertheless, it is also possible to focus only on the Spatio-
STIPs, is used to train a random forest where Genetic Algorithm
(GA) is applied to the training phase to achieve a good Temporal (S-T) relationships of STIPs. Bregonzio et al. [12]
compromise between reliability and efficiency. Besides, we design extract multiple features from “clouds” of STIPs and
an effective and efficient S-T correlation based match to assist the successfully use Nearest Neighbor (NN) classifier and SVM to
MC feature, where MC’s structure and a biological sequence recognize human actions. Another inspiring example is
matching algorithm are employed to calculate the spatial and “Motion Context (MC)” [11] which is derived from “Shape
temporal correlation score, respectively. Experiments on the UT- Context” [13] for object recognition, capturing the distribution
Interaction dataset show that our GA search based random of STIPs.
forest and S-T correlation based match achieve better Although the approach combining STIPs with BoW and
performance than some other prevalent machine leaning
SVM is well known for its good performance, it has some
methods, and that a combination of those two methods
outperforms most of the state-of-the-art works. obvious shortcomings: (1) BoW uses unstructured local
features whose informative S-T relationships are totally
Keywords—spatio-temporal interest points (STIPs); motion ignored; (2) SVM is not necessarily the best choice for
context (MC); random forest; genetic algorithm (GA); spatio- discriminative learning machine due to its binary classification
temporal (S-T) correlation nature and difficulty in determining the kernel function
parameters. To overcome those short-comings, Matikainen et
I. INTRODUCTION al. [14] describe human actions by pairwise S-T relationships
Human action recognition, which has provoked an whereas Zhang et al. [15] put forward a “Bag of S-T Phrases
increasing research interest in the past decades, is now of (BoP)” model, both taking advantage of the S-T constraints of
central importance in many applications related to computer STIPs and achieving promising results. In spite of SVM,
vision such as video surveillance, video retrieval, human- decision tree receives more and more attention [1, 9] for its
computer interactions, robot vision, etc. Early studies in this merits of multiclass classification and ability to create a
area usually experiment on simple datasets which only contain structured codebook. To better deal with noise and enjoy the
single-person activities (e.g. Weizmann and KTH), and the benefits of boosting, some works train a series of decision
recognition rates on those benchmark datasets could be close to trees (also called “random forest”) [16, 17] instead of single
100% now [1]. However, the recognition rates on human tree.
interactions are relatively low due to their richer inner This paper aims at exploring and presenting effective and
semantics and contextual information [2]. efficient methods for human interaction recognition, and the
3D reconstruction and pose estimation based methods are contributions of our work are as follows.
often used in early years of this century, but now the prevalent
(1) An innovative hybrid framework which incorporates
approach is extracting 2D features directly from video
both global features and S-T correlation of local features is
sequences, among which the Spatio-Temporal Interest Points
proposed to recognize human interactions and achieves
(STIPs) [3, 4] have been prevalent in the past decade due to
promising results.
their simplicity, effectiveness and robustness to cluttered
backgrounds [5]. To exploit STIPs wisely, the next question (2) Genetic Algorithm (GA) search is integrated into the
should be considered is whether to use descriptors to describe training of random forest for the first time, which proves to be
them. Most researchers will give an affirmative answer to this a good compromise between reliability and efficiency.
question: they describe STIPs by various histograms (e.g. (3) An efficient scheme to calculate S-T correlation score
HOF, HOG, HOG3D [6], 3D-SIFT [7], etc.), then cluster the between two videos is presented, and such score based match
STIP descriptors to form an unstructured (Bag-of-Words, outperforms both BoW and pLSA (using the same codebook).

1051-4651/14 $31.00 © 2014 IEEE 2513

DOI 10.1109/ICPR.2014.434
A. Overview of Our Approach The STIPs are extracted by finding the local maxima of the
saliency maps. We also compare such STIPs with Dollar’s
counterparts, finding that they can have similar density (i.e.
STIPs per frame, see Fig. 2) with proper thresholds, whereas
the former are much faster since it bypasses the time-
consuming S-T filtering.

(a) (b) (c)

Fig. 1. Overview of our human interaction recognition system. Note that the Fig. 2. Examples of STIP extraction. (a) Snapshots of “shaking hands” and
flowchart within the red box does not require STIP descriptions whereas the “kicking”; (b) Motion saliency maps and STIPs based on voxel variance; (c)
part within the blue box does. Motion saliency maps and STIPs based on Dollar’s method.

Fig. 1 gives an overview of our human interaction B. Motion Context (MC)

recognition system. We first extract STIPs from each frame of MC feature, whose idea comes from “Shape Context” [13],
the input videos and then STIPs in each frame form an MC uses a log-polar diagram (centered at a reference point) to
vector. The MC feature can be seen as global for it reflects the measure the distribution of other STIPs and captures the global
distribution of STIPs and contains shape information as well. information of motion and shape. As depicted in Fig. 3(a, b),
Then we use the action class labeled MC vectors to train a we use a log-polar diagram containing 24 sub-regions to
random forest which is composed of a series of GA search generate a 24-D histogram called MC descriptor. Although MC
based binary decision trees, as shown in the red box. To assist descriptor can be constructed from every frame, in practice we
the MC based recognition, S-T correlation based match discard those frames with less than 30 STIPs to avoid too
(within the blue box) is also adopted. In specific, we describe sparse histograms.
each STIP by a vector which contains 3 parts: PCA of original
image patch, HOG and distribution of nearby STIPs.
Afterwards, we use k-means algorithm to cluster the STIPs,
hence we can describe each video by a series of STIP
occurrence sequences which serve as a template of the video. (a) (b) (c)
The spatial correlation score between two videos is calculated
Fig. 3. Illustration of MC. (a) Example of a log-polar diagram; (b) Sub-region
within the MC framework in a way similar with “histogram numbers in the log-polar diagram, which are consistent with the histogram’s
intersection kernel” [25], whereas the temporal correlation bin numbers; (c) Alignment of MC, where gray region denotes the main
score is calculated by a biological sequence matching orientation.
algorithm [18].
The diagram’s center (cx,cy) and diameter D are
The rest of the paper is organized as follows. STIP based
determined by
mid-level feature extraction is introduced in Section II.
Section III and IV elaborates on GA search based random ⎧(cx, cy ) = ( xmin + xmax , ymin + ymax ),
⎪ (1)
forest and S-T correlation calculation, respectively. ⎨ 2 2
Experimental results are shown and discussed in Section V, ⎪⎩ D = η × max( xmax - xmin , ymax - ymin ),
followed by our conclusion in Section VI.
where xmin , xmax , ymin and ymax denote the extreme of all the
II. STIP BASED MID-LEVEL FEATURE E XTRACTION STIPs’ coordinates in the current frame, and the coefficient η
( η = 1.2 ) is used to make D larger to cover most STIPs.
A. Voxel Variance based STIP Extraction Similar with [11], we will align the MCs so that their main
Numerous studies [2, 12] have confirmed the superiority orientations (fan sectors with most STIPs) always locate at the
of Dollar’s STIPs [3] over Laptev’s counterparts [4]. Here we right side (Fig. 3(c)).
use an even more straightforward method presented in [11] to C. STIP Occurrence Sequences
extract STIPs. A sliding window is used to calculate the
motion saliency maps from groups of frames within the STIP occurrence sequences serve as a template for
window. Each pixel value on motion saliency map is just the calculating temporal correlation score between videos. The first
variance of voxel values in the same location of a group of step is to describe the STIPs and cluster them. We create a
frames within the window. An empirical window size ranges compact yet discriminative STIP descriptor which captures
from 5 to 10, and we choose 7 in our experiments. both shape and motion information of the neighborhood

2514
without calculating the optical flow. Our 28-D descriptor (Fig. used to decide whether to split a node and if so, which test is
4) consists of three normalized parts: a 15-D PCA of original the best to split it.
image patch (sized 11 × 11 and centered at the STIP), a 9-D This paper adopts C4.5 (for limited space, readers can refer
HOG of the same image patch (the angular bin is 20° ), and a to [21] for details) to train decision trees. Specifically, each
4-D histogram describing the distribution of nearby STIPs (the candidate test is denoted by a tuple (a,c) which means the
attribute and the threshold. Hence, a continuous attribute can be
radius is 25 voxels and the angular bin is 90° ); α , β and γ
classified into {0, 1} through a comparison with the threshold c,
are three weight coefficients normalized by α + β + γ = 1 . which results in a binary tree. Hoping to reach a good balance
Note that although we do not employ optical flow, the between reliability and efficiency, we integrate GA search to
distribution of nearby STIPs reflects motion information. find the optimal (or suboptimal) binary test for every split.
It is pointed out by [21] that the bias of the decision tree is
the “preference for short trees”, which is consistent with
“Occam’s razor”. Coincidentally, post-pruning is often applied
to a fully grown tree to mitigate “over-fitting”. In order not to
use verifying datasets for post-pruning but still maintain high
performance, we employ a series of decision trees called
“binary random forest” which can benefit from boosting. Note
that here “binary” not only refers to the tree structure, but also
Fig. 4. Sketch map of our STIP descriptor. It is composed of three parts and means that each tree just focuses on one positive sample
each part is normalized and multipied by a weight coefficient (in our category.
experiments, a bit more emphasis is put on the steady HOG feature such that
α = γ = 0.3 , β = 0.4 ). B. GA Search
We use GA search to select good binary tests during the
After describing each STIP, we use k-means to cluster them decision tree learning, which proves to be an effective
into k visual words. The number of STIPs in each frame is
compromise between reliability and efficiency (see Subsection
about 30 to 120, which means we need a small k to make the
V-A). To apply GA, we need to code the candidate binary
spatial correlation calculation (Subsection IV-A) feasible. We
will also see in Subsection IV-B that a small k is good for tests (elements in the solution space) first. As mentioned
temporal correlation calculation. In experiments, we find that before, a candidate test is denoted by a tuple (a,c), which is
k = 10 can already achieve high performance and meanwhile exactly our coded individual which contains two genes: a is an
guarantee high computational efficiency. Given a video volume integer which denotes the dimension index (attribute) to be
and all its visual words (Fig. 5(a)), we can describe it by k tested, and c is a real number which denotes the threshold of
sequences with each corresponding to a certain visual word the binary test.
(Fig. 5(b)). Each sequence is a series of symbols that reflect the In each iteration of GA, the initial group of size N is
occurrence of certern visual word in each frame. Using generated by random. As illustrated by Fig. 6(a), parents
symbols rather than actual STIP numbers brings in two benefits: generate offspring in three different ways: the crossover
(1) the temporal matching between videos of the same category recombination simply exchanges the “allele” of the two
is more robust; (2) it is more convenient to apply the biological parents; the median recombination just calculates the mean of
sequence matching algorithm introduced in Subsection IV-B. the two parents; the mutation is accomplished by adding a
Gaussian variable of variance 3 to the second gene c.

(a) (b)
Fig. 5. Illustration of STIP occurrence sequences. (a) STIPs scattered in a
video volume of length T. Different colors denote different visual words. (b)
Occurrence sequences of visual words. The length of each sequence is also T,
and each symbol reflects the number of certain visual word in each frame.

III. GA SEARCH BASED RANDOM FOREST (a) (b)

A. Decision Tree Learning by C4.5 Algorithm Fig. 6. Illustration of GA search. (a) Three ways to obtain the next generation:
crossover recombination, median recombination and mutation; (b) One
Decision tree is a convenient and robust way to fit discrete iteration of GA search.
functions. In the 1980s, Quinlan [19] proposed a practical
iterative algorithm called ID3 to train the decision tree, and ID3 As displayed in Fig. 6(b), the offspring group generated by
was later developed to C4.5 [20] to deal with continuous crossover recombination, median combi-nation and mutation
attributes. Both ID3 and C4.5 use a greedy top-down method to are of size N/2, N/2 and N, respectively. “Information gain” is
construct the tree without backtracking. In the training (tree used as the fitness function. We pick out N best individuals
growing) phase, usually a criterion called “information gain” is from the offspring together with their parents (3N individuals

2515
in total) to generate the parents for the next iteration. In our mismatches should be predefined. In this case, the score
experiments, N=100 and GA iteration number is set to 4 for function of two symbols is defined by
each binary test search, where the parameters are tuned by ⎧ σ ( x, x) = 2, (4)
taking both effectiveness and efficiency into account. ⎨
⎩ σ ( x, y ) = σ ( x, −) = σ ( −, y ) = −1,
IV. S-T CORRELATION CALCULATION where x and y are two different symbols, and “-” means a
missing symbol because “insertion” and “deletion” are quite
A. Spatial Correlation Calculation
common for DNA sequences. The similarity matrix V(i,j) is of
The spatial correlation between two videos is calculated size (| S | +1) × (| T | +1) ( | ⋅ | means the length of the sequence)
within the MC framework in a way similar to “histogram
and its initial values are set as
intersection kernel” [25]. After k-means clustering, the STIPs
in each MC’s log-polar diagram now have visual word labels. ⎧V (0, 0) = 0,
⎪ (5)
Suppose Video i and Video j have T1 and T2 frames ⎨V (i , 0) = V (i − 1, 0) + σ ( S [i], −),
respectively, we need to compare all the pairs of frames from ⎪V (0, j ) = V (0, j − 1) + σ ( −, T [ j ]).
⎩
the two videos to calculate the spatial correlation, as defined
by Then the rest elements in the matrix can be obtained using the
recursion formula
⎡ T T 24 ⎤
SC _ Scorei , j = ⎢ ∑∑∑ SC _ Score( fi ,t , f j ,t , r ) ⎥ / ( N1 N 2 ), (2)
1 2

⎧V (i − 1, j − 1) + σ ( S[i ], T [ j ]),
⎪ (6)
⎣ t =1 t =1 r =1 ⎦
1 2
1 2 V (i, j ) = max ⎨ V (i − 1, j ) + σ ( S[i ], −),
⎪ V (i, j − 1) + σ ( −, T [ j ]).
where N1 and N2 are the total number of STIPs in Video i and j; ⎩
SC _ Score ( i) is the spatial correlation score of two frames in The similarity matrix is finally completed as in Fig. 8(b), where
certain sub-region, which means the number of shared visual the bottom-right element V (| S | +1,| T | +1) = 2 means the
words in that area (see Fig. 7). similarity score of the two sequences. Hence, the first question
is answered.

(a) (b) (c)

Fig. 8. (a) Two DNA sequences to be matched; (b) The similarity matrix
calculated from the two sequences; (c) The score function used to calculate
Fig. 7. Illustration of spatial correlation score calculation. Different colored temporal correlation.
dots stand for different visual words. Those two frames share 2 red dots in
Sub-region 8, so their spatial correlation score equals 2 in that area. To find out how the two sequences are related, a dynamic
programing based backtracking procedure should be
B. Temporal Correlation Calculation implemented. But in this paper, we can skip that procedure for
Having described each video by 10 STIP occurrence our focus is just on the temporal correlation score. We just use
sequences, now we can calculate the temporal correlation of the first step of N-W algorithm to calculate the temporal
the two videos using correlation score of two STIP occurrence sequences:
10
TC _ Scorei , j = ∑ TC _ Score ( seqi , k , seq j , k ), (3) TC _ Score( seqi , k , seq j , k ) = V (| seqi ,k | +1,| seq j , k | +1). (7)
k =1

where TC _ Score ( i ) calculates the temporal correlation score of Particularly, our score function σ ( x, y ) for calculating
two symbol sequences corresponding to certain visual word similarity matrix is defined by Fig. 8(c).
using a biological sequence matching algorithm called
Needleman-Wunsch (N-W) algorithm [18]. The N-W V. EXPERIMENTAL RESULTS
algorithm is composed of two main steps: (1) calculating a
similarity matrix of the two given symbol sequences; (2) We validate our algorithm on the UT-Interaction dataset
finding the best global alignment route using dynamic [23] using a 3.5GHz CPU under Matlab 2012a platform. The
programing based backtracking. Here we use a toy example of UT-Interaction dataset is provided in the contest of Semantic
matching two DNA fragments to clarify the former process. Description of Human Activities (SDHA2010) [2], during
Consider two DNA sequences “ACGCTG” and “CATGT” which four teams declared to take the challenge but only one
denoted by S and T (Fig. 8(a)), how similar are they and how team came up with a solution of acceptable performance.
are they related to each other? The dataset is composed of two parts: Set 1 is taken on a
To apply N-W method to answer those questions, a score parking lot with slightly variation of zoom rates, whereas Set 2
function which stimulates precise matches and penalizes is taken on a lawn with slightly moving backgrounds. Each set

2516
consists of 10 sequences ( 720 × 480 , 30 fps) including 6 kinds create a small codebook of MC words to avoid too sparse
of human interactions (see Fig. 9). Experiments in Subsection histograms, and finally we choose k=15 for best performance.
V-A and V-B are carried out only on Set 1, aiming to compare Decisions are made by NN classification. When using the
the performance of our approach and other prevalent machine Naïve Bayesian classifier, we assume independency among
learning methods. Finally, experimental results on both sets MC feature’s dimensions and use “Parzen Window” method
using combined methods are tested in Subsection V-C. 10-fold [22] to estimate the conditional probability of each dimension.
cross validation (54 clips for training and 6 for testing) works Fig. 10 gives the comparison results in terms of average
throughout all the experiments. recognition accuracy and training time. We can see that
although BoW and Naïve Bayesian classifier take much less
time to train, their performance is relatively low. GA search
based random forest achieves the highest recognition accuracy,
though it takes more training time than SVM. But note that the
Fig. 9. Snapshots from the UT-Interaction dataset, where the former
three are from Set 1 and the latter three are from Set 2. From left to
training time, which depends on initial group size, evolution
right, the interaction categories are “shaking hands”, “hugging”, generations, complexity of evolution strategies and number of
“kicking”, “pointing”, “punching” and “pushing”. trees in the forest, is quite flexible.
A. Recognition by MC and Random Forest
Using MC feature, we compare the performance of single
decision tree and random forest with various search strategies
in this subsection. When training a single tree, standard post-
pruning procedure is applied to avoid over-fitting. As to
random forest, we train 6 trees for each interaction category, so
(a) (b)
the forest contains 36 binary trees. “Full search”, “random
search” and “GA search” are adopted in the training of the Fig. 10. Comparison of GA search base random forest and other prevalent
machine learning methods. (a) Average recognition accuracy; (b) Average
forest, respectively.
training time.
TABLE I. RECOGNITION RESULTS OF SINGLE DECISION TREE AND RANDOM
FOREST WITH VARIOUS SEARCH STRATEGIES B. Recognition by S-T Correlation
We add both spatial and temporal correlation scores
together to obtain the final S-T correlation score. Note that the
two scores have different ranges, so we normalize them before
adding together. Instead of NN classification, we add the S-T
correlation score with video clips in the same category
together and choose the category with maximum total score.
Details of S-T correlation based recognition results are shown
by Fig. 11. We can see from Fig. 11 that every interaction
category has the largest S-T correlation score with respect to
The results are listed in Table I. As expected, single tree the same category (peaks of the curves), which is fairly natural.
method is the fastest but meanwhile has the lowest recognition Specifically, the interaction whose curve has an obvious peak
accuracy. Because each binary tree in random forest is more is easy to distinguish from other interactions, which is also
professional than single multiclass tree and random forest reflected in the first line of Table II.
enjoys the benefit of boosting, and note that even random
search based forest outperforms single tree with post-pruning.
What’s more, the training time of single tree is not much
shorter than random search based forest, for the post-pruning is
really time-consuming. On the other hand, we can see that GA
search has close performance with full search, but it is much
more time-saving; and that GA search does not take much
longer than random search, but achieves much higher accuracy.
So GA search really strikes a good balance between reliability
and efficiency.
Having confirmed GA search based random forest’s good
performance, we will compare it with some other prevalent
methods such as unstructured labeled MC descriptors directly Fig. 11. Normalized S-T correlation score for each interaction category with
applied to SVM, Bag-of-words (BoW) with NN classification, respect to the same category and other categories.
and Naïve Bayesian classifier with respect to labeled MC We compare S-T correlation based recognition with two
descriptors. other common machine learning methods that also use labeled
For SVM method, (6 × 5) / 2 = 15 one-vs-one linear SVMs STIPs. One is BoW method which is much the same as
are trained and results are obtained by voting. As to BoW, we described in the former subsection except that the “word” is

2517
STIP rather than MC. The other is pLSA, which is a recognition demands, so we plan to explore incremental
generative method to estimate the joint probability of observed learning in the future.
data and latent states. EM algorithm is adopted to estimate the
ACKNOWLEDGMENT
conditional probabilities, whose details can be found in [11].
Table II compares the three methods, from which we can see This work is supported by National Natural Science
the advantages of S-T correlation calculation in both Foundation of China (NSFC) under Grant No. 60971098 and
recognition accuracy and training time. Maybe BoW and 61302152.
pLSA could achieve higher performance using a larger REFERENCES
codebook, but they must require even more training time. [1] Z. Lin, Z. Jiang and L.S. Davis, “Recognizing actions by shape-motion
TABLE II. RECOGNITION RESULTS OF D IFFERENT MACHINE LEARNING prototype trees,” Proc. ICCV, pp. 444-451, 2009.
METHODS USING LABELED STIPS [2] M.S. Ryoo, C.C. Chen and J.K. Aggarwal, “An overview of contest on
semantic description of human activities (SDHA) 2010,” Recognizing
Patterns in Signals, Speech, Images and Videos, pp. 270-285, 2010.
[3] I. Laptev, “On space-time interest points,” IJCV, vol. 64(2/3), pp. 107-
123, 2005.
[4] P. Dollar, V. Rabaund, G. Cottrell and S. Belongie, “Behavior
recognition via sparse spatio-temporal features,” IEEE International
Workshop on Performance Evaluation of Tracking and Surveillance, pp.
C. Combined Methods on the Whole Dataset 65-72, 2005.
[5] A.H. Shabani, D.A. Clausi and J.S. Zelek, “Evaluation of local spatio-
Finally, we experiment on both datasets combining the temporal salient feature detectors for human action recognition,” Proc.
methods in the above two subsections. In particular, we test Computer and Robot Vision (CRV), pp. 468-475, 2012.
two combinations of best and second best methodologies (i.e. [6] A. Klaser, M. Marszałek and C. Schmid, “A spatio-temporal descriptor
GA search based random forest + S-T correlation, SVM + based on 3D-gradients,” Proc. BMVC, pp. 995-1004, 2008.
[7] P. Scovanner, S. Ali and M. Shah, “A 3-dimensional SIFT descriptor
pLSA). The combination rule is simple: since each method and its application to action recognition,” Proc. ACM International
can provide scores for each interaction category, we agree Conference on Multimedia, pp. 357–360, 2007.
with the decision made by the methodology with higher [8] C. Schuldt, I. Laptev and B. Caputo, “Recognizing human actions: a
local SVM approach,” Proc. ICPR, vol. 3, pp. 32-36, 2004.
confidence. [9] K. Mikolajczyk and H. Uemura, “Action recognition with motion-
TABLE III. AVERAGE RCOGNITION ACCURACY C OMPARISON ON BOTH SETS appearance vocabulary forest,” Proc. CVPR, pp. 1-8, 2008.
OF UT-I NTERACTION DATASET [10] N. Li, X. Chen, S. Zhang and Z. Wu, “Recognizing Human actions by
BP-AdaBoost algorithm under a hierarchical recognition framework,”
Proc. ICASSP, pp. 3407-3411, 2013.
[11] Z. Zhang, Y. Hu, S. Chan and L.T. Chia, “Motion context: a new
representation for human action recognition,” Proc ECCV, pp. 817-829,
2008.
[12] M. Bregonzio, S. Gong and T. Xiang, “Recognising action as clouds of
space-time interest points,” Proc. CVPR, pp. 1948-1955, 2009.
[13] S. Belongie, J. Malik and J. Puzicha, “Shape context: a new descriptor
for shape matching and object recognition,” Advances in Neural
Information Processing Systems, pp. 831-837, 2001.
Table III reports the recognition rate of our combined [14] P. Matikainen, M. Hebert and R. Sukthankar, “Representing pairwise
methods as well as some state-of-the-art works. It is exciting spatial and temporal relations for action recognition,” Proc. ECCV, pp.
508-521, 2010.
to see that our approach that combines GA based random [15] Y. Zhang, X. Liu, M.C. Chang et al., “Spatio-temporal phrases for
forest and S-T correlation achieves second best and best activity recognition,” Proc. ECCV, pp. 707-721, 2012.
performance on Set 1 and Set 2, respectively. The reason for [16] A. Yao, J. Gall and L. Van Gool, “A hough transform-based voting
such robust recognition results is that such a combination framework for action recognition,” Proc. CVPR, pp. 2061-2068, 2010.
[17] D. Waltisberg, A. Yao, J. Gall and L. Van Gool, “Variations of a hough-
takes advantage of both global motion and shape information voting action recognition system,” Recognizing Patterns in Signals,
as well as S-T relationships of local features. Note that Speech, Images and Videos, pp. 306-312, 2010.
although [15] has the highest accuracy on Set 1, it has not [18] S.B. Needleman and C.D. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins,”
tested on Set 2. Journal of Molecular Biology, vol. 48(3), pp. 443-453, 1970.
VI. CONCLUSION [19] J.R. Quinlan, “Introduction of decision trees,” Machine Learning, vol.
1(1), pp. 81-106, 1986.
An innovative and effective hybrid method which exploits [20] J.R. Quinlan, “C4.5: programs for machine learning,” vol. 1, Morgan
both global motion and shape information and S-T correlation Kaufmann, 1993.
[21] T.M. Mitchell, “Machine learning,” New York: McGraw-Hill, 1997.
of local STIPs to recognize human interactions is proposed in [22] S. Theodoridis and K. Koutroumbas, “Pattern Recognition,” Singapore:
this paper. Experiments on the UT-Interaction dataset show Elsevier Pte Ltd., 2010.
the merits of our approaches, comparing to some other [23] M.S. Ryoo and J.K. Aggarwal, “Spatio-temporal relationship match:
video structure comparison for recognition of complex human
prevalent machine leaning methods; and that the combination activities,” Proc. ICCV, pp. 1593-1600, 2009.
of GA search based random forest and S-T correlation [24] X. Zhang, J. Cui, L. Tian and H. Zha, “Local spatio-temporal feature
achieves the state-of-the-art performance. However, our based voting framework for complex human activity detection and
stationary training dataset and offline training procedure localization,” Proc. ACPR, pp. 12-16, 2011.
[25] S. Maji, A.C. Berg, J. Malik, “Classification using intersection kernel
cannot deal with online datasets which have dynamic support vector machines is efficient,” Proc. CVPR, pp. 1-8, 2008.

2518

Human Action Recognition in Videos Using Stable Features
No ratings yet
Human Action Recognition in Videos Using Stable Features
10 pages
3D SIFT and LDA for Human Action Recognition
No ratings yet
3D SIFT and LDA for Human Action Recognition
6 pages
IET Image Processing - 2020 - Khan - Improved Strategy For Human Action Recognition Experiencing A Cascaded Design
No ratings yet
IET Image Processing - 2020 - Khan - Improved Strategy For Human Action Recognition Experiencing A Cascaded Design
12 pages
A Comprehensive Survey of Vision-Based Human Actio
No ratings yet
A Comprehensive Survey of Vision-Based Human Actio
20 pages
Deep Learning for Action Recognition
No ratings yet
Deep Learning for Action Recognition
6 pages
Human Action Recognition Using SVM
No ratings yet
Human Action Recognition Using SVM
13 pages
Action Recognition Framework Fusion
No ratings yet
Action Recognition Framework Fusion
8 pages
Human Action Recognition via Optical Flow
No ratings yet
Human Action Recognition via Optical Flow
6 pages
Action Detection in Cluttered Videos
No ratings yet
Action Detection in Cluttered Videos
7 pages
Optimizing Object Detection Regions
No ratings yet
Optimizing Object Detection Regions
8 pages
CNN for Littering Detection in Videos
No ratings yet
CNN for Littering Detection in Videos
5 pages
Statistical Motion Descriptor for Behavior Analysis
No ratings yet
Statistical Motion Descriptor for Behavior Analysis
11 pages
A Survey On Video Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications
No ratings yet
A Survey On Video Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications
64 pages
Integrated Framework for HAR
No ratings yet
Integrated Framework for HAR
30 pages
Human Activity Recognition with LSTM
No ratings yet
Human Activity Recognition with LSTM
8 pages
Human Action Recognition Techniques
No ratings yet
Human Action Recognition Techniques
6 pages
Abnormal Activity Detection Using HOG Features and SVM Classifier
No ratings yet
Abnormal Activity Detection Using HOG Features and SVM Classifier
4 pages
Human-Object Interaction for Intention Understanding
No ratings yet
Human-Object Interaction for Intention Understanding
26 pages
Hierarchical Dynamic Bayesian Network for Action Recognition
No ratings yet
Hierarchical Dynamic Bayesian Network for Action Recognition
14 pages
CNN and DTW for Depth Map Action Recognition
No ratings yet
CNN and DTW for Depth Map Action Recognition
13 pages
Human Action Recognition with Depth Images
No ratings yet
Human Action Recognition with Depth Images
37 pages
Guardians AI: Advanced Crime Detection
No ratings yet
Guardians AI: Advanced Crime Detection
11 pages
Video-Based Human Action Recognition Survey
No ratings yet
Video-Based Human Action Recognition Survey
64 pages
Human Activity Recognition Techniques
No ratings yet
Human Activity Recognition Techniques
4 pages
Deep Fusion for Action Recognition
No ratings yet
Deep Fusion for Action Recognition
10 pages
Action Context Recognition via Sparse Coding
No ratings yet
Action Context Recognition via Sparse Coding
14 pages
Time-Invariant Gesture Recognition Framework
No ratings yet
Time-Invariant Gesture Recognition Framework
10 pages
Modern Techniques in Action Recognition
No ratings yet
Modern Techniques in Action Recognition
59 pages
Egocentric Video Activity Recognition Techniques
No ratings yet
Egocentric Video Activity Recognition Techniques
17 pages
LoCATe-GAT Modeling Multi-Scale Local Context and Action Relationships For Zero-Shot Action Recognition
No ratings yet
LoCATe-GAT Modeling Multi-Scale Local Context and Action Relationships For Zero-Shot Action Recognition
13 pages
Human Activity Recognition Techniques
No ratings yet
Human Activity Recognition Techniques
12 pages
Action Recognition From Video Using
No ratings yet
Action Recognition From Video Using
16 pages
Hazy Environment Object Detection Chip
No ratings yet
Hazy Environment Object Detection Chip
12 pages
Action Recognition with High-Level Units
No ratings yet
Action Recognition with High-Level Units
5 pages
Human Intention in Robot Interactions
No ratings yet
Human Intention in Robot Interactions
7 pages
I3D-Shufflenet for Human Action Recognition
No ratings yet
I3D-Shufflenet for Human Action Recognition
14 pages
Human Activity Recognition Techniques
No ratings yet
Human Activity Recognition Techniques
24 pages
STHOG-Based Human Tracking System
No ratings yet
STHOG-Based Human Tracking System
5 pages
Efficient Video Representation for Action Recognition
No ratings yet
Efficient Video Representation for Action Recognition
20 pages
3D Arm-Hand Action Recognition Framework
No ratings yet
3D Arm-Hand Action Recognition Framework
6 pages
Vision-Based Action Recognition Thesis
No ratings yet
Vision-Based Action Recognition Thesis
173 pages
2013 9 9 3659 3666
No ratings yet
2013 9 9 3659 3666
8 pages
PCBEAR: Pose Concept Bottleneck For Explainable Action Recognition
No ratings yet
PCBEAR: Pose Concept Bottleneck For Explainable Action Recognition
12 pages
HOG for Object Detection in Images
No ratings yet
HOG for Object Detection in Images
46 pages
Multimodal Action Recognition with CNN
No ratings yet
Multimodal Action Recognition with CNN
8 pages
Jurnal LSTM
No ratings yet
Jurnal LSTM
30 pages
Real-Time Multi-Person Action Recognition
No ratings yet
Real-Time Multi-Person Action Recognition
17 pages
Enhancing Video Anomaly Detection with HTSNet
No ratings yet
Enhancing Video Anomaly Detection with HTSNet
8 pages
Overview of Vision-Based Action Recognition
No ratings yet
Overview of Vision-Based Action Recognition
17 pages
Human Action Recognition: Aman Pandey (2017csb1127) Amit Srivastava (2017csb1189) IIT Ropar Ropar, Punjab
No ratings yet
Human Action Recognition: Aman Pandey (2017csb1127) Amit Srivastava (2017csb1189) IIT Ropar Ropar, Punjab
12 pages
Human Activity Recognition in Videos
No ratings yet
Human Activity Recognition in Videos
25 pages
Human Action Recognition in Surveillance
No ratings yet
Human Action Recognition in Surveillance
6 pages
Ou Anane 2013
No ratings yet
Ou Anane 2013
5 pages
Video Action Recognition with SVM
No ratings yet
Video Action Recognition with SVM
3 pages
Recognizing Drone-Control Actions
No ratings yet
Recognizing Drone-Control Actions
4 pages
Real-Time Person Tracking for Robots
No ratings yet
Real-Time Person Tracking for Robots
6 pages
Pedestrian Detection and Tracking Using HOG and Oriented-LBP Features
No ratings yet
Pedestrian Detection and Tracking Using HOG and Oriented-LBP Features
9 pages
Video Surveillance and Pattern Analysis
No ratings yet
Video Surveillance and Pattern Analysis
51 pages
Object Recognition via Human Actions
No ratings yet
Object Recognition via Human Actions
14 pages
Inria STIP/SED Overview 2017
No ratings yet
Inria STIP/SED Overview 2017
20 pages
STIPEverts Evaluation of Color 2013 CVPR Paper
No ratings yet
STIPEverts Evaluation of Color 2013 CVPR Paper
8 pages
Cross-Compiling Toolchains Overview
No ratings yet
Cross-Compiling Toolchains Overview
37 pages
Boosted Histograms for Object Detection
No ratings yet
Boosted Histograms for Object Detection
10 pages
Computer Networks (CSGE301) Generic Elective - (GE) Credit:6
No ratings yet
Computer Networks (CSGE301) Generic Elective - (GE) Credit:6
16 pages
Backyard Hens: Rules & Best Practices
No ratings yet
Backyard Hens: Rules & Best Practices
4 pages
MSDOS Commands Cheat Sheet PDF
No ratings yet
MSDOS Commands Cheat Sheet PDF
6 pages
ZVT ECR Interface Specification
No ratings yet
ZVT ECR Interface Specification
179 pages
Kpit Autosar
No ratings yet
Kpit Autosar
74 pages
IDRO 50 Thermo Fireplace Manual
No ratings yet
IDRO 50 Thermo Fireplace Manual
28 pages
Vehicle Applications of CAN Protocol
100% (1)
Vehicle Applications of CAN Protocol
26 pages
Contributions to Autonomous Navigation
No ratings yet
Contributions to Autonomous Navigation
68 pages
Virtual Sensors for Vehicle Yaw Rate
No ratings yet
Virtual Sensors for Vehicle Yaw Rate
11 pages
Counting Sort: Time Complexity & Stability
No ratings yet
Counting Sort: Time Complexity & Stability
7 pages
Master Thesis The Development of Autonomous Vehicles
No ratings yet
Master Thesis The Development of Autonomous Vehicles
116 pages
Driving Environment Change Detection System
No ratings yet
Driving Environment Change Detection System
6 pages
Thrun Etal Jfr06
No ratings yet
Thrun Etal Jfr06
104 pages
NCPC 2008 Solutions Overview
No ratings yet
NCPC 2008 Solutions Overview
15 pages
Engage Security 2015
No ratings yet
Engage Security 2015
89 pages
Ruang Bermain Anak di Permukiman Padat
No ratings yet
Ruang Bermain Anak di Permukiman Padat
5 pages
Riya Singh: Education Sales Expert Resume
No ratings yet
Riya Singh: Education Sales Expert Resume
1 page
Anaerobic Respiration Overview
No ratings yet
Anaerobic Respiration Overview
6 pages
CALGB 10403: Adult ALL Treatment Outcomes
No ratings yet
CALGB 10403: Adult ALL Treatment Outcomes
22 pages
Nuclear Medicine Test Bank: 8th Edition
No ratings yet
Nuclear Medicine Test Bank: 8th Edition
25 pages
Molluscan Functional Morphology Tribute
No ratings yet
Molluscan Functional Morphology Tribute
7 pages
Science 5 Lesson Plan: Classifying Plants
No ratings yet
Science 5 Lesson Plan: Classifying Plants
6 pages
Bioseparation Laboratory Manual
No ratings yet
Bioseparation Laboratory Manual
38 pages
Biochemical Principles: Metabolism Overview
No ratings yet
Biochemical Principles: Metabolism Overview
2 pages
Pediatric Diabetes - 2022 - Glaser - ISPAD Clinical Practice Consensus Guidelines 2022 Diabetic Ketoacidosis and
No ratings yet
Pediatric Diabetes - 2022 - Glaser - ISPAD Clinical Practice Consensus Guidelines 2022 Diabetic Ketoacidosis and
22 pages
S. boulardii CNCM I-745 and Gut Health
No ratings yet
S. boulardii CNCM I-745 and Gut Health
19 pages
Causes of Work Engagement Analysis
No ratings yet
Causes of Work Engagement Analysis
21 pages
Ricin and Ricinus communis: Pharmacology & Toxicology
No ratings yet
Ricin and Ricinus communis: Pharmacology & Toxicology
28 pages
Biomedical Engineering Curriculum Insights
No ratings yet
Biomedical Engineering Curriculum Insights
7 pages
Class IX Science Assessment Set A
No ratings yet
Class IX Science Assessment Set A
4 pages
Electric Motor Control Sang-Hoon Kim Ebook Accessible Full Chapters
100% (1)
Electric Motor Control Sang-Hoon Kim Ebook Accessible Full Chapters
26 pages
VIDAS® High Sensitivity Troponin I Assay
No ratings yet
VIDAS® High Sensitivity Troponin I Assay
13 pages
Value Analysis of Ivory Chess Set
No ratings yet
Value Analysis of Ivory Chess Set
11 pages
100 Medical Questions and Answers
No ratings yet
100 Medical Questions and Answers
58 pages
A General Theory of Enaction
No ratings yet
A General Theory of Enaction
23 pages
Stress-Induced Microglial Activation Review
No ratings yet
Stress-Induced Microglial Activation Review
14 pages
Nutrition Modes in Plants for Class 7
No ratings yet
Nutrition Modes in Plants for Class 7
8 pages
Genetic Variability and Medical Implications
No ratings yet
Genetic Variability and Medical Implications
2 pages
BP604TT Merged
No ratings yet
BP604TT Merged
10 pages
Malayan Filariasis Overview and Treatment
No ratings yet
Malayan Filariasis Overview and Treatment
11 pages
RunKeeper Training Guide Overview
100% (1)
RunKeeper Training Guide Overview
41 pages
Sufficient Cause Models in Epidemiology
No ratings yet
Sufficient Cause Models in Epidemiology
51 pages
CRISPR-Cas9 Gene Editing in Mouse Brain
No ratings yet
CRISPR-Cas9 Gene Editing in Mouse Brain
22 pages
Understanding Biopotential Electrodes
No ratings yet
Understanding Biopotential Electrodes
28 pages
Flora of High Creek Fen, Colorado
No ratings yet
Flora of High Creek Fen, Colorado
2 pages

A Hybrid Method For Human Interaction Recognition Using Spatio-Temporal Interest Points

Uploaded by

A Hybrid Method For Human Interaction Recognition Using Spatio-Temporal Interest Points

Uploaded by

2014 22nd International Conference on Pattern Recognition

A Hybrid Method for Human Interaction Recognition

1051-4651/14 $31.00 © 2014 IEEE 2513

(a) (b) (c)

Fig. 1 gives an overview of our human interaction B. Motion Context (MC)

III. GA SEARCH BASED RANDOM FOREST (a) (b)

(a) (b) (c)

You might also like