PornProbe: an LDA-SVM based Pornography Detection
System
Sheng Tang 1,2, Jintao Li1,Yongdong Zhang1,Cheng Xie1,Ming Li1, Yizhi Liu1,Xiufeng Hua1
1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 100190
{ts, jtli, zhyd, xiecheng, mli, liuyizhi, huaxiufeng}@[Link]
Yan-Tao Zheng2, Jinhui Tang2, Tat-Seng Chua2
2
School of Computing, National University of Singapore, Singapore, 117417
{yantaozheng, tangjh, chuats}@[Link]
ABSTRACT videos is relatively small. Second, the system must be highly
efficient, in order to scale up to handle huge amount of videos and
We present PornProbe, a pornography detection system that images on the Internet. In this research, we tackle the above goals
detects pornographic contents in videos. To build such a detection from the aspects of database and machine learning. Since the
system, we leverage a large scale training data set with 65,827 pornographic images vary considerably, it is necessary to
positive training image samples out of a total of 420,615 training establish a large-scale image database to include all possible
samples, and a novel detection scheme based on hierarchical variations (such as significant variations in races, lighting
LDA-SVM. The system combines the unsupervised clustering in conditions, textures, viewing angles, and complicated body
Latent Dirichlet Allocation (LDA) and supervised learning in structures, etc.) of such images as much as possible. Here, we set
Support Vector Machine, so as to achieve both high precision and up a large-scale database with more than 105 images. To the best
recall while ensuring efficiency in both training and testing. This of our knowledge, this is the largest pornographic image dataset,
demonstration shows how the system detects the pornographic as most existing systems usually use less than 103 pornographic
scenes in restricted artistic (RA) movies. images for training. However, training support vector machines
on such a large data sets is very time-consuming and it is often a
Categories and Subject Descriptors bottleneck. Therefore, it is important to develop fast algorithms
H.3.3 [Information Storage and Retrieval]: Information Search for training SVM to learn the pornography detection rules both
and Retrieval; efficiently and effectively from large-scale training databases.
H.2.4 [Database Management]: Systems—multimedia databases
2. Large-scale training data set
General Terms: Algorithms, Experimentation, Performance
Table 1. Statistics of our training data set
Keywords: Latent Dirichlet Allocation, SVM, Pornography Images Video Keyframes Total
Detection Positive 21,699 44,128 65,827
Negative 51,680 303,108 354,788
1. Introduction Table 1 presents the statistics of our large-scale training image
With the advancement of Internet, the proliferation of images database (including key frames extracted from videos) for
and videos has inevitably increased the chance of individuals pornography detection. Altogether, we collected 420,615 training
encountering adult-oriented contents such as pornographic images image samples from a wide variety of sources. For videos, we
and videos. The pornographic materials contain subjects that collected 1,108 pornographic videos from offline VCD sources;
induce sexual excitements. As the pornographic materials raise 20,000 short pornographic video clips from online media streams
many social, moral and religious issues, automatic detection of by the skin-based detection method [1] from Dec 2007 to Dec
pornographic contents in images and videos is highly demanded 2008; and about 65,000 non-pornographic videos from YouTube,
to facilitate possible regulation and censorship. Tudou, YouKu and other websites. For images, we utilized the
non-pornographic images from Corel database while downloaded
In this demonstration, we present PornProbe, a pornography pornographic ones from Pinkworld.
detection system that can detect pornographic video shots and
images. The design of PornProbe aims to accomplish two goals. 3. LDA-SVM Model
First, we want to achieve high precision and recall in detection.
This is because as compared to the total amount of images and Our proposed solution is motivated by the insight from
videos on the Internet, the number of pornographic images and psychophysical studies that humans can perform coarse
categorization of visual objects quite easily and quickly, followed
This work was supported by National Nature Science Foundation of by successively finer but slower discrimination [2]. Specifically,
China (60873165), National Basic Research Program of China (973 as shown in Figure 1, we propose a hierarchical LDA-SVM
Program, 2007CB311105) and Co-building Program of Beijing model which can scale up to large data sets through a combination
Municipal Education. of unsupervised clustering and supervised learning. First, we use
the generative Latent Dirichlet Allocation (LDA) [3] to model the
Copyright is held by the author/owner(s). relationship between images to mine the hidden structure of
MM’09, October 19–24, 2009, Beijing, China. images. We then perform coarse categorization by clustering
ACM 978-1-60558-608-3/09/10. large-scale training data set into small topic sets according to the
1003
Feature Extraction
Unsupervised
LDA clustering LDA-SVM
Training
Topic1 Topic2 Topick
Data
……
LDA
Model
SVM1 SVM2 …… SVMK
Training
Testing SVM Predict & Porn
Adaptive Bayesian Fusion degree
Test data
LDA Inference
Feature Extraction
Fig.1 The framework of PornProbe Fig.2 ROC Curves Fig.3 The interface of PornProbe
principle of maximum membership probability determined by the and 5415 non-pornographic samples. The Figure clearly indicates
topic-simplex representation vector inferred by the LDA model. that the LDA-SVM is much more effective than other two
Second, we perform successive finer discrimination by training methods. To test the effectiveness of our adaptive Bayesian fusion
each clustered topic set to generate multiple smaller SVM models. method, we compare it against the average fusion method, where
These topic-based SVM models are generally more effective its ROC curve is also shown in Fig.2 (the cyan one). The training
since their optimal separating hyperplanes may be much simpler time, testing time and the numbers of samples and SVs of the
to discriminate the data and have better generalization single SVM method and LDA-SVM are shown in Table 2, which
performance in the small homogenous topic sets rather than in the demonstrates the high training and testing efficiency of the
large overall data set. Finally, we propose an adaptive Bayesian proposed method.
approach to fuse membership probability with the probability Table 2. Training time and testing time of the SVM methods
predicted by the corresponding SVM models over all the topic Training Number Training Testing time
clusters. For a given test sample, we adaptively select only the samples of SVs time for 320×240
most probable clusters’ SVM models for prediction. Furthermore, SVM 120, 000 24,112 72 hours 667 ms
as the number of support vectors (SVs) is greatly reduced by
1842 per
training on the smaller topic set, testing efficiency can be LDA-SVM 420,615
topic
6 hours 49 ms
considerably improved. This makes it practical for online
detection in spite of large training data set. 4. Demonstration
The separate training of multiple SVM models on each topic set In this demonstration, we focus on the effectiveness and
gives rise to the parallelism of SVMs. The distributed training can efficiency of PornProbe, in pornography detection. Figure 3
drastically lower the computational complexity from O(n2) to shows the interface of PornProbe. The interface composes of 3
O(n2/k2). The overall computational complexity for training all the panes: a folder browsing pane (left-top) used for selecting the
k SVM models is O(n2/k), where n is the total number of training video to be detected; a play-back pane (left-bottom) used for
samples and k is the topic number. The optimal topic number k is playing back the detected shots or the whole video; and a result
not necessary since the first clustering is coarse categorization. browsing pane (right) used for displaying the detected
We only need to partition the training data set broadly. Therefore, pornographic key frames. After the input video is specified, the
k can be roughly determined by the ratio of the total number of pornography detection process will be started to examine the key
training samples n to the desired average size of topic set. frames of input video. Once the pornographic key frame is
detected, it will be blurred and displayed in the result pane. The
In our hierarchical framework for combining unsupervised pornographic content in a key frame is measured by its porn
clustering and supervised learning, we use the global color degree, which is shown at the bottom of each key frame. Our
histogram for coarse categorization at the top layer due to its demonstration shows that our system achieves both high precision
relatively lower dimension and faster extraction [4]; and the prior and recall while ensuring efficiency in testing.
fusion (concatenation) of the color moment and edge histogram
for finer discrimination at the bottom layer to further remove false 5. References
detection caused by many existing skin-based methods. Although [1] H. Zheng; “Maximum entropy modeling for skin detection: with
there is no special emphasis on detecting skin, skin detection is an application to Internet filtering”; Ph.D. Thesis, Univeristé des
actually included in the latent semantic analysis of the color Sciences et Technologies de Lille, France, 2004.
histogram and training of SVM models with color moment. Since [2] S. Thorpe, D. Fize, and C. Marlot; “Speed of processing in the
we train multiple models on the large-scale training data set that human visual system”; Nature, 381:520-522, June 1996.
includes nearly all possible variations, it is more stable and robust
[3] Blei,D.M., Ng, A.Y., and Jordan, M.J; “Latent Dirichlet
as compared to skin-based methods and single SVM methods.
allocation”. Journal of Machine Learning Research, 3, 2003,
For comparison, we evaluate the following 3 systems: (a) skin- 993-1022.
based method [1]; (b) single SVM (where we randomly selected [4] S. Tang, J.-T. Li, M. Li, C. Xie, Y.-Z. Liu, K. Tao, S.-X. Xu;
120,000 training samples from our training data set instead of the “TRECVID 2008 High-Level Feature Extraction By MCG-ICT-
whole set due to the impractical amount of training computation); CAS”; Proc. TRECVID Workshop, Gaithesburg, USA, Nov
and (c) our proposed LDA-SVM (k=40) method. Figure 2 shows 2008.
the ROC Curves of testing the 3 methods on 1695 pornographic
1004