0% found this document useful (0 votes)
18 views6 pages

A Hybrid Method For Human Interaction Recognition Using Spatio-Temporal Interest Points

Hybrid Method for Human Interaction Recognition

Uploaded by

cristi_pet4742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

A Hybrid Method For Human Interaction Recognition Using Spatio-Temporal Interest Points

Hybrid Method for Human Interaction Recognition

Uploaded by

cristi_pet4742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2014 22nd International Conference on Pattern Recognition

A Hybrid Method for Human Interaction Recognition


using Spatio-Temporal Interest Points
Nijun Li, Xu Cheng, Haiyan Guo, Zhenyang Wu
School of Information Science and Engineering
Southeast University, Nanjing, China
{lnjleo, xcheng, [Link], zhenyang}@[Link]

Abstract—This paper proposes an innovative and effective BoW [7]) or structured (vocabulary tree [9]) codebook, and
hybrid way to recognize human interactions, which incorporates finally fit them into a supervised (e.g. SVM [8] or neural
the advantages of both global feature (Motion Context, MC) and network [10]) or unsupervised (e.g. probabilistic Latent
Spatio-Temporal (S-T) correlation of local Spatio-Temporal Semantic Analysis, pLSA [11]) leaning framework.
Interest Points (STIPs). The MC feature, which also derives from
Nevertheless, it is also possible to focus only on the Spatio-
STIPs, is used to train a random forest where Genetic Algorithm
(GA) is applied to the training phase to achieve a good Temporal (S-T) relationships of STIPs. Bregonzio et al. [12]
compromise between reliability and efficiency. Besides, we design extract multiple features from “clouds” of STIPs and
an effective and efficient S-T correlation based match to assist the successfully use Nearest Neighbor (NN) classifier and SVM to
MC feature, where MC’s structure and a biological sequence recognize human actions. Another inspiring example is
matching algorithm are employed to calculate the spatial and “Motion Context (MC)” [11] which is derived from “Shape
temporal correlation score, respectively. Experiments on the UT- Context” [13] for object recognition, capturing the distribution
Interaction dataset show that our GA search based random of STIPs.
forest and S-T correlation based match achieve better Although the approach combining STIPs with BoW and
performance than some other prevalent machine leaning
SVM is well known for its good performance, it has some
methods, and that a combination of those two methods
outperforms most of the state-of-the-art works. obvious shortcomings: (1) BoW uses unstructured local
features whose informative S-T relationships are totally
Keywords—spatio-temporal interest points (STIPs); motion ignored; (2) SVM is not necessarily the best choice for
context (MC); random forest; genetic algorithm (GA); spatio- discriminative learning machine due to its binary classification
temporal (S-T) correlation nature and difficulty in determining the kernel function
parameters. To overcome those short-comings, Matikainen et
I. INTRODUCTION al. [14] describe human actions by pairwise S-T relationships
Human action recognition, which has provoked an whereas Zhang et al. [15] put forward a “Bag of S-T Phrases
increasing research interest in the past decades, is now of (BoP)” model, both taking advantage of the S-T constraints of
central importance in many applications related to computer STIPs and achieving promising results. In spite of SVM,
vision such as video surveillance, video retrieval, human- decision tree receives more and more attention [1, 9] for its
computer interactions, robot vision, etc. Early studies in this merits of multiclass classification and ability to create a
area usually experiment on simple datasets which only contain structured codebook. To better deal with noise and enjoy the
single-person activities (e.g. Weizmann and KTH), and the benefits of boosting, some works train a series of decision
recognition rates on those benchmark datasets could be close to trees (also called “random forest”) [16, 17] instead of single
100% now [1]. However, the recognition rates on human tree.
interactions are relatively low due to their richer inner This paper aims at exploring and presenting effective and
semantics and contextual information [2]. efficient methods for human interaction recognition, and the
3D reconstruction and pose estimation based methods are contributions of our work are as follows.
often used in early years of this century, but now the prevalent
(1) An innovative hybrid framework which incorporates
approach is extracting 2D features directly from video
both global features and S-T correlation of local features is
sequences, among which the Spatio-Temporal Interest Points
proposed to recognize human interactions and achieves
(STIPs) [3, 4] have been prevalent in the past decade due to
promising results.
their simplicity, effectiveness and robustness to cluttered
backgrounds [5]. To exploit STIPs wisely, the next question (2) Genetic Algorithm (GA) search is integrated into the
should be considered is whether to use descriptors to describe training of random forest for the first time, which proves to be
them. Most researchers will give an affirmative answer to this a good compromise between reliability and efficiency.
question: they describe STIPs by various histograms (e.g. (3) An efficient scheme to calculate S-T correlation score
HOF, HOG, HOG3D [6], 3D-SIFT [7], etc.), then cluster the between two videos is presented, and such score based match
STIP descriptors to form an unstructured (Bag-of-Words, outperforms both BoW and pLSA (using the same codebook).

1051-4651/14 $31.00 © 2014 IEEE 2513


DOI 10.1109/ICPR.2014.434
A. Overview of Our Approach The STIPs are extracted by finding the local maxima of the
saliency maps. We also compare such STIPs with Dollar’s
counterparts, finding that they can have similar density (i.e.
STIPs per frame, see Fig. 2) with proper thresholds, whereas
the former are much faster since it bypasses the time-
consuming S-T filtering.

(a) (b) (c)


Fig. 1. Overview of our human interaction recognition system. Note that the Fig. 2. Examples of STIP extraction. (a) Snapshots of “shaking hands” and
flowchart within the red box does not require STIP descriptions whereas the “kicking”; (b) Motion saliency maps and STIPs based on voxel variance; (c)
part within the blue box does. Motion saliency maps and STIPs based on Dollar’s method.

Fig. 1 gives an overview of our human interaction B. Motion Context (MC)


recognition system. We first extract STIPs from each frame of MC feature, whose idea comes from “Shape Context” [13],
the input videos and then STIPs in each frame form an MC uses a log-polar diagram (centered at a reference point) to
vector. The MC feature can be seen as global for it reflects the measure the distribution of other STIPs and captures the global
distribution of STIPs and contains shape information as well. information of motion and shape. As depicted in Fig. 3(a, b),
Then we use the action class labeled MC vectors to train a we use a log-polar diagram containing 24 sub-regions to
random forest which is composed of a series of GA search generate a 24-D histogram called MC descriptor. Although MC
based binary decision trees, as shown in the red box. To assist descriptor can be constructed from every frame, in practice we
the MC based recognition, S-T correlation based match discard those frames with less than 30 STIPs to avoid too
(within the blue box) is also adopted. In specific, we describe sparse histograms.
each STIP by a vector which contains 3 parts: PCA of original
image patch, HOG and distribution of nearby STIPs.
Afterwards, we use k-means algorithm to cluster the STIPs,
hence we can describe each video by a series of STIP
occurrence sequences which serve as a template of the video. (a) (b) (c)
The spatial correlation score between two videos is calculated
Fig. 3. Illustration of MC. (a) Example of a log-polar diagram; (b) Sub-region
within the MC framework in a way similar with “histogram numbers in the log-polar diagram, which are consistent with the histogram’s
intersection kernel” [25], whereas the temporal correlation bin numbers; (c) Alignment of MC, where gray region denotes the main
score is calculated by a biological sequence matching orientation.
algorithm [18].
The diagram’s center (cx,cy) and diameter D are
The rest of the paper is organized as follows. STIP based
determined by
mid-level feature extraction is introduced in Section II.
Section III and IV elaborates on GA search based random ⎧(cx, cy ) = ( xmin + xmax , ymin + ymax ),
⎪ (1)
forest and S-T correlation calculation, respectively. ⎨ 2 2
Experimental results are shown and discussed in Section V, ⎪⎩ D = η × max( xmax - xmin , ymax - ymin ),
followed by our conclusion in Section VI.
where xmin , xmax , ymin and ymax denote the extreme of all the
II. STIP BASED MID-LEVEL FEATURE E XTRACTION STIPs’ coordinates in the current frame, and the coefficient η
( η = 1.2 ) is used to make D larger to cover most STIPs.
A. Voxel Variance based STIP Extraction Similar with [11], we will align the MCs so that their main
Numerous studies [2, 12] have confirmed the superiority orientations (fan sectors with most STIPs) always locate at the
of Dollar’s STIPs [3] over Laptev’s counterparts [4]. Here we right side (Fig. 3(c)).
use an even more straightforward method presented in [11] to C. STIP Occurrence Sequences
extract STIPs. A sliding window is used to calculate the
motion saliency maps from groups of frames within the STIP occurrence sequences serve as a template for
window. Each pixel value on motion saliency map is just the calculating temporal correlation score between videos. The first
variance of voxel values in the same location of a group of step is to describe the STIPs and cluster them. We create a
frames within the window. An empirical window size ranges compact yet discriminative STIP descriptor which captures
from 5 to 10, and we choose 7 in our experiments. both shape and motion information of the neighborhood

2514
without calculating the optical flow. Our 28-D descriptor (Fig. used to decide whether to split a node and if so, which test is
4) consists of three normalized parts: a 15-D PCA of original the best to split it.
image patch (sized 11 × 11 and centered at the STIP), a 9-D This paper adopts C4.5 (for limited space, readers can refer
HOG of the same image patch (the angular bin is 20° ), and a to [21] for details) to train decision trees. Specifically, each
4-D histogram describing the distribution of nearby STIPs (the candidate test is denoted by a tuple (a,c) which means the
attribute and the threshold. Hence, a continuous attribute can be
radius is 25 voxels and the angular bin is 90° ); α , β and γ
classified into {0, 1} through a comparison with the threshold c,
are three weight coefficients normalized by α + β + γ = 1 . which results in a binary tree. Hoping to reach a good balance
Note that although we do not employ optical flow, the between reliability and efficiency, we integrate GA search to
distribution of nearby STIPs reflects motion information. find the optimal (or suboptimal) binary test for every split.
It is pointed out by [21] that the bias of the decision tree is
the “preference for short trees”, which is consistent with
“Occam’s razor”. Coincidentally, post-pruning is often applied
to a fully grown tree to mitigate “over-fitting”. In order not to
use verifying datasets for post-pruning but still maintain high
performance, we employ a series of decision trees called
“binary random forest” which can benefit from boosting. Note
that here “binary” not only refers to the tree structure, but also
Fig. 4. Sketch map of our STIP descriptor. It is composed of three parts and means that each tree just focuses on one positive sample
each part is normalized and multipied by a weight coefficient (in our category.
experiments, a bit more emphasis is put on the steady HOG feature such that
α = γ = 0.3 , β = 0.4 ). B. GA Search
We use GA search to select good binary tests during the
After describing each STIP, we use k-means to cluster them decision tree learning, which proves to be an effective
into k visual words. The number of STIPs in each frame is
compromise between reliability and efficiency (see Subsection
about 30 to 120, which means we need a small k to make the
V-A). To apply GA, we need to code the candidate binary
spatial correlation calculation (Subsection IV-A) feasible. We
will also see in Subsection IV-B that a small k is good for tests (elements in the solution space) first. As mentioned
temporal correlation calculation. In experiments, we find that before, a candidate test is denoted by a tuple (a,c), which is
k = 10 can already achieve high performance and meanwhile exactly our coded individual which contains two genes: a is an
guarantee high computational efficiency. Given a video volume integer which denotes the dimension index (attribute) to be
and all its visual words (Fig. 5(a)), we can describe it by k tested, and c is a real number which denotes the threshold of
sequences with each corresponding to a certain visual word the binary test.
(Fig. 5(b)). Each sequence is a series of symbols that reflect the In each iteration of GA, the initial group of size N is
occurrence of certern visual word in each frame. Using generated by random. As illustrated by Fig. 6(a), parents
symbols rather than actual STIP numbers brings in two benefits: generate offspring in three different ways: the crossover
(1) the temporal matching between videos of the same category recombination simply exchanges the “allele” of the two
is more robust; (2) it is more convenient to apply the biological parents; the median recombination just calculates the mean of
sequence matching algorithm introduced in Subsection IV-B. the two parents; the mutation is accomplished by adding a
Gaussian variable of variance 3 to the second gene c.

(a) (b)
Fig. 5. Illustration of STIP occurrence sequences. (a) STIPs scattered in a
video volume of length T. Different colors denote different visual words. (b)
Occurrence sequences of visual words. The length of each sequence is also T,
and each symbol reflects the number of certain visual word in each frame.

III. GA SEARCH BASED RANDOM FOREST (a) (b)


A. Decision Tree Learning by C4.5 Algorithm Fig. 6. Illustration of GA search. (a) Three ways to obtain the next generation:
crossover recombination, median recombination and mutation; (b) One
Decision tree is a convenient and robust way to fit discrete iteration of GA search.
functions. In the 1980s, Quinlan [19] proposed a practical
iterative algorithm called ID3 to train the decision tree, and ID3 As displayed in Fig. 6(b), the offspring group generated by
was later developed to C4.5 [20] to deal with continuous crossover recombination, median combi-nation and mutation
attributes. Both ID3 and C4.5 use a greedy top-down method to are of size N/2, N/2 and N, respectively. “Information gain” is
construct the tree without backtracking. In the training (tree used as the fitness function. We pick out N best individuals
growing) phase, usually a criterion called “information gain” is from the offspring together with their parents (3N individuals

2515
in total) to generate the parents for the next iteration. In our mismatches should be predefined. In this case, the score
experiments, N=100 and GA iteration number is set to 4 for function of two symbols is defined by
each binary test search, where the parameters are tuned by ⎧ σ ( x, x) = 2, (4)
taking both effectiveness and efficiency into account. ⎨
⎩ σ ( x, y ) = σ ( x, −) = σ ( −, y ) = −1,
IV. S-T CORRELATION CALCULATION where x and y are two different symbols, and “-” means a
missing symbol because “insertion” and “deletion” are quite
A. Spatial Correlation Calculation
common for DNA sequences. The similarity matrix V(i,j) is of
The spatial correlation between two videos is calculated size (| S | +1) × (| T | +1) ( | ⋅ | means the length of the sequence)
within the MC framework in a way similar to “histogram
and its initial values are set as
intersection kernel” [25]. After k-means clustering, the STIPs
in each MC’s log-polar diagram now have visual word labels. ⎧V (0, 0) = 0,
⎪ (5)
Suppose Video i and Video j have T1 and T2 frames ⎨V (i , 0) = V (i − 1, 0) + σ ( S [i], −),
respectively, we need to compare all the pairs of frames from ⎪V (0, j ) = V (0, j − 1) + σ ( −, T [ j ]).

the two videos to calculate the spatial correlation, as defined
by Then the rest elements in the matrix can be obtained using the
recursion formula
⎡ T T 24 ⎤
SC _ Scorei , j = ⎢ ∑∑∑ SC _ Score( fi ,t , f j ,t , r ) ⎥ / ( N1 N 2 ), (2)
1 2

⎧V (i − 1, j − 1) + σ ( S[i ], T [ j ]),
⎪ (6)
⎣ t =1 t =1 r =1 ⎦
1 2
1 2 V (i, j ) = max ⎨ V (i − 1, j ) + σ ( S[i ], −),
⎪ V (i, j − 1) + σ ( −, T [ j ]).
where N1 and N2 are the total number of STIPs in Video i and j; ⎩
SC _ Score ( i) is the spatial correlation score of two frames in The similarity matrix is finally completed as in Fig. 8(b), where
certain sub-region, which means the number of shared visual the bottom-right element V (| S | +1,| T | +1) = 2 means the
words in that area (see Fig. 7). similarity score of the two sequences. Hence, the first question
is answered.

(a) (b) (c)


Fig. 8. (a) Two DNA sequences to be matched; (b) The similarity matrix
calculated from the two sequences; (c) The score function used to calculate
Fig. 7. Illustration of spatial correlation score calculation. Different colored temporal correlation.
dots stand for different visual words. Those two frames share 2 red dots in
Sub-region 8, so their spatial correlation score equals 2 in that area. To find out how the two sequences are related, a dynamic
programing based backtracking procedure should be
B. Temporal Correlation Calculation implemented. But in this paper, we can skip that procedure for
Having described each video by 10 STIP occurrence our focus is just on the temporal correlation score. We just use
sequences, now we can calculate the temporal correlation of the first step of N-W algorithm to calculate the temporal
the two videos using correlation score of two STIP occurrence sequences:
10
TC _ Scorei , j = ∑ TC _ Score ( seqi , k , seq j , k ), (3) TC _ Score( seqi , k , seq j , k ) = V (| seqi ,k | +1,| seq j , k | +1). (7)
k =1

where TC _ Score ( i ) calculates the temporal correlation score of Particularly, our score function σ ( x, y ) for calculating
two symbol sequences corresponding to certain visual word similarity matrix is defined by Fig. 8(c).
using a biological sequence matching algorithm called
Needleman-Wunsch (N-W) algorithm [18]. The N-W V. EXPERIMENTAL RESULTS
algorithm is composed of two main steps: (1) calculating a
similarity matrix of the two given symbol sequences; (2) We validate our algorithm on the UT-Interaction dataset
finding the best global alignment route using dynamic [23] using a 3.5GHz CPU under Matlab 2012a platform. The
programing based backtracking. Here we use a toy example of UT-Interaction dataset is provided in the contest of Semantic
matching two DNA fragments to clarify the former process. Description of Human Activities (SDHA2010) [2], during
Consider two DNA sequences “ACGCTG” and “CATGT” which four teams declared to take the challenge but only one
denoted by S and T (Fig. 8(a)), how similar are they and how team came up with a solution of acceptable performance.
are they related to each other? The dataset is composed of two parts: Set 1 is taken on a
To apply N-W method to answer those questions, a score parking lot with slightly variation of zoom rates, whereas Set 2
function which stimulates precise matches and penalizes is taken on a lawn with slightly moving backgrounds. Each set

2516
consists of 10 sequences ( 720 × 480 , 30 fps) including 6 kinds create a small codebook of MC words to avoid too sparse
of human interactions (see Fig. 9). Experiments in Subsection histograms, and finally we choose k=15 for best performance.
V-A and V-B are carried out only on Set 1, aiming to compare Decisions are made by NN classification. When using the
the performance of our approach and other prevalent machine Naïve Bayesian classifier, we assume independency among
learning methods. Finally, experimental results on both sets MC feature’s dimensions and use “Parzen Window” method
using combined methods are tested in Subsection V-C. 10-fold [22] to estimate the conditional probability of each dimension.
cross validation (54 clips for training and 6 for testing) works Fig. 10 gives the comparison results in terms of average
throughout all the experiments. recognition accuracy and training time. We can see that
although BoW and Naïve Bayesian classifier take much less
time to train, their performance is relatively low. GA search
based random forest achieves the highest recognition accuracy,
though it takes more training time than SVM. But note that the
Fig. 9. Snapshots from the UT-Interaction dataset, where the former
three are from Set 1 and the latter three are from Set 2. From left to
training time, which depends on initial group size, evolution
right, the interaction categories are “shaking hands”, “hugging”, generations, complexity of evolution strategies and number of
“kicking”, “pointing”, “punching” and “pushing”. trees in the forest, is quite flexible.
A. Recognition by MC and Random Forest
Using MC feature, we compare the performance of single
decision tree and random forest with various search strategies
in this subsection. When training a single tree, standard post-
pruning procedure is applied to avoid over-fitting. As to
random forest, we train 6 trees for each interaction category, so
(a) (b)
the forest contains 36 binary trees. “Full search”, “random
search” and “GA search” are adopted in the training of the Fig. 10. Comparison of GA search base random forest and other prevalent
machine learning methods. (a) Average recognition accuracy; (b) Average
forest, respectively.
training time.
TABLE I. RECOGNITION RESULTS OF SINGLE DECISION TREE AND RANDOM
FOREST WITH VARIOUS SEARCH STRATEGIES B. Recognition by S-T Correlation
We add both spatial and temporal correlation scores
together to obtain the final S-T correlation score. Note that the
two scores have different ranges, so we normalize them before
adding together. Instead of NN classification, we add the S-T
correlation score with video clips in the same category
together and choose the category with maximum total score.
Details of S-T correlation based recognition results are shown
by Fig. 11. We can see from Fig. 11 that every interaction
category has the largest S-T correlation score with respect to
The results are listed in Table I. As expected, single tree the same category (peaks of the curves), which is fairly natural.
method is the fastest but meanwhile has the lowest recognition Specifically, the interaction whose curve has an obvious peak
accuracy. Because each binary tree in random forest is more is easy to distinguish from other interactions, which is also
professional than single multiclass tree and random forest reflected in the first line of Table II.
enjoys the benefit of boosting, and note that even random
search based forest outperforms single tree with post-pruning.
What’s more, the training time of single tree is not much
shorter than random search based forest, for the post-pruning is
really time-consuming. On the other hand, we can see that GA
search has close performance with full search, but it is much
more time-saving; and that GA search does not take much
longer than random search, but achieves much higher accuracy.
So GA search really strikes a good balance between reliability
and efficiency.
Having confirmed GA search based random forest’s good
performance, we will compare it with some other prevalent
methods such as unstructured labeled MC descriptors directly Fig. 11. Normalized S-T correlation score for each interaction category with
applied to SVM, Bag-of-words (BoW) with NN classification, respect to the same category and other categories.
and Naïve Bayesian classifier with respect to labeled MC We compare S-T correlation based recognition with two
descriptors. other common machine learning methods that also use labeled
For SVM method, (6 × 5) / 2 = 15 one-vs-one linear SVMs STIPs. One is BoW method which is much the same as
are trained and results are obtained by voting. As to BoW, we described in the former subsection except that the “word” is

2517
STIP rather than MC. The other is pLSA, which is a recognition demands, so we plan to explore incremental
generative method to estimate the joint probability of observed learning in the future.
data and latent states. EM algorithm is adopted to estimate the
ACKNOWLEDGMENT
conditional probabilities, whose details can be found in [11].
Table II compares the three methods, from which we can see This work is supported by National Natural Science
the advantages of S-T correlation calculation in both Foundation of China (NSFC) under Grant No. 60971098 and
recognition accuracy and training time. Maybe BoW and 61302152.
pLSA could achieve higher performance using a larger REFERENCES
codebook, but they must require even more training time. [1] Z. Lin, Z. Jiang and L.S. Davis, “Recognizing actions by shape-motion
TABLE II. RECOGNITION RESULTS OF D IFFERENT MACHINE LEARNING prototype trees,” Proc. ICCV, pp. 444-451, 2009.
METHODS USING LABELED STIPS [2] M.S. Ryoo, C.C. Chen and J.K. Aggarwal, “An overview of contest on
semantic description of human activities (SDHA) 2010,” Recognizing
Patterns in Signals, Speech, Images and Videos, pp. 270-285, 2010.
[3] I. Laptev, “On space-time interest points,” IJCV, vol. 64(2/3), pp. 107-
123, 2005.
[4] P. Dollar, V. Rabaund, G. Cottrell and S. Belongie, “Behavior
recognition via sparse spatio-temporal features,” IEEE International
Workshop on Performance Evaluation of Tracking and Surveillance, pp.
C. Combined Methods on the Whole Dataset 65-72, 2005.
[5] A.H. Shabani, D.A. Clausi and J.S. Zelek, “Evaluation of local spatio-
Finally, we experiment on both datasets combining the temporal salient feature detectors for human action recognition,” Proc.
methods in the above two subsections. In particular, we test Computer and Robot Vision (CRV), pp. 468-475, 2012.
two combinations of best and second best methodologies (i.e. [6] A. Klaser, M. Marszałek and C. Schmid, “A spatio-temporal descriptor
GA search based random forest + S-T correlation, SVM + based on 3D-gradients,” Proc. BMVC, pp. 995-1004, 2008.
[7] P. Scovanner, S. Ali and M. Shah, “A 3-dimensional SIFT descriptor
pLSA). The combination rule is simple: since each method and its application to action recognition,” Proc. ACM International
can provide scores for each interaction category, we agree Conference on Multimedia, pp. 357–360, 2007.
with the decision made by the methodology with higher [8] C. Schuldt, I. Laptev and B. Caputo, “Recognizing human actions: a
local SVM approach,” Proc. ICPR, vol. 3, pp. 32-36, 2004.
confidence. [9] K. Mikolajczyk and H. Uemura, “Action recognition with motion-
TABLE III. AVERAGE RCOGNITION ACCURACY C OMPARISON ON BOTH SETS appearance vocabulary forest,” Proc. CVPR, pp. 1-8, 2008.
OF UT-I NTERACTION DATASET [10] N. Li, X. Chen, S. Zhang and Z. Wu, “Recognizing Human actions by
BP-AdaBoost algorithm under a hierarchical recognition framework,”
Proc. ICASSP, pp. 3407-3411, 2013.
[11] Z. Zhang, Y. Hu, S. Chan and L.T. Chia, “Motion context: a new
representation for human action recognition,” Proc ECCV, pp. 817-829,
2008.
[12] M. Bregonzio, S. Gong and T. Xiang, “Recognising action as clouds of
space-time interest points,” Proc. CVPR, pp. 1948-1955, 2009.
[13] S. Belongie, J. Malik and J. Puzicha, “Shape context: a new descriptor
for shape matching and object recognition,” Advances in Neural
Information Processing Systems, pp. 831-837, 2001.
Table III reports the recognition rate of our combined [14] P. Matikainen, M. Hebert and R. Sukthankar, “Representing pairwise
methods as well as some state-of-the-art works. It is exciting spatial and temporal relations for action recognition,” Proc. ECCV, pp.
508-521, 2010.
to see that our approach that combines GA based random [15] Y. Zhang, X. Liu, M.C. Chang et al., “Spatio-temporal phrases for
forest and S-T correlation achieves second best and best activity recognition,” Proc. ECCV, pp. 707-721, 2012.
performance on Set 1 and Set 2, respectively. The reason for [16] A. Yao, J. Gall and L. Van Gool, “A hough transform-based voting
such robust recognition results is that such a combination framework for action recognition,” Proc. CVPR, pp. 2061-2068, 2010.
[17] D. Waltisberg, A. Yao, J. Gall and L. Van Gool, “Variations of a hough-
takes advantage of both global motion and shape information voting action recognition system,” Recognizing Patterns in Signals,
as well as S-T relationships of local features. Note that Speech, Images and Videos, pp. 306-312, 2010.
although [15] has the highest accuracy on Set 1, it has not [18] S.B. Needleman and C.D. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins,”
tested on Set 2. Journal of Molecular Biology, vol. 48(3), pp. 443-453, 1970.
VI. CONCLUSION [19] J.R. Quinlan, “Introduction of decision trees,” Machine Learning, vol.
1(1), pp. 81-106, 1986.
An innovative and effective hybrid method which exploits [20] J.R. Quinlan, “C4.5: programs for machine learning,” vol. 1, Morgan
both global motion and shape information and S-T correlation Kaufmann, 1993.
[21] T.M. Mitchell, “Machine learning,” New York: McGraw-Hill, 1997.
of local STIPs to recognize human interactions is proposed in [22] S. Theodoridis and K. Koutroumbas, “Pattern Recognition,” Singapore:
this paper. Experiments on the UT-Interaction dataset show Elsevier Pte Ltd., 2010.
the merits of our approaches, comparing to some other [23] M.S. Ryoo and J.K. Aggarwal, “Spatio-temporal relationship match:
video structure comparison for recognition of complex human
prevalent machine leaning methods; and that the combination activities,” Proc. ICCV, pp. 1593-1600, 2009.
of GA search based random forest and S-T correlation [24] X. Zhang, J. Cui, L. Tian and H. Zha, “Local spatio-temporal feature
achieves the state-of-the-art performance. However, our based voting framework for complex human activity detection and
stationary training dataset and offline training procedure localization,” Proc. ACPR, pp. 12-16, 2011.
[25] S. Maji, A.C. Berg, J. Malik, “Classification using intersection kernel
cannot deal with online datasets which have dynamic support vector machines is efficient,” Proc. CVPR, pp. 1-8, 2008.

2518

You might also like