Comparative Machine Learning for Lipreading
Comparative Machine Learning for Lipreading
Approach
Ziad Thabet Amr Nabih Karim Azmi
Faculty of Computer Science Faculty of Computer Science Faculty of Computer Science
MISR INTERNATIONAL UNIVERSITY MISR INTERNATIONAL UNIVERSITY MISR INTERNATIONAL UNIVERSITY
Cairo, Egypt Cairo, Egypt Cairo, Egypt
Ziad1407174@[Link] Amr1410718@[Link] Karim1405338@[Link]
Abstract—Lipreading is the process of interpreting spoken The recent advent of novel machine learning and signal
word by observing lip movement. It plays a vital role in human processing approaches have increased researchers’ interest
communication and speech understanding, especially for hearing- in automating the process of lipreading. This attention is
impaired individuals. Automated lipreading approaches have
recently been used in such applications as biometric identifi- motivated by the promising results of lipreading in application
cation, silent dictation, forensic analysis of surveillance camera areas such as human-computer interaction, forensic analysis
capture, and communication with autonomous vehicles. However, of surveillance camera capture, biometric identification, silent
lipreading is a difficult process that poses several challenges to dictation, and autonomous vehicles [1].
human- and machine-based approaches alike. This is due to the However, the recognition of lip motion presents several
large number of phonemes in human language that are visually
represented by a smaller number of lip movements (visemes). challenges to linear classifiers. Mainly because the features
Consequently, the same viseme may be used to represent several used in the classification are calculated from a sequence of
phonemes, which confuses any lipreader. In this paper, we shapes that the lip takes, also known as “visemes”. The number
present a detailed study of the machine learning approach for of visemes that the lip can take is between 10 and 14 [2],
the real-time visual recognition of spoken words. Our focus whereas the number of phonemes (i.e. acoustic sounds) that
on real-time performance is motivated by the recent trend of
using lipreading in autonomous vehicles. In this paper, machine can be produced by these visemes exceeds 50. This mismatch
learning approaches are applied to recognize lip-reading and nine between visual and audio signals creates new horizons in
different classifiers has been implemented and tested, reporting machine learning research. It motivates the quest for improved
their confusion matrices among different groups of words. The visual features and classifiers to bridge the gap between what
classification process went on more than one classifier but these has been spoken and what is visually perceived.
three classifiers got the best results which are GradientBoosting,
Support Vector Machine(SVM) and logistic regression with In this paper, we present LipDrive: a novel system for
results 64.7%, 63.5% and 59.4% respectively. visual speech recognition that targets autonomous vehicles
Index Terms—Lipreading, Classification, Autonomous Vehi- as an application. The focus here is on the application area
cles, Speech Recognition. of autonomous vehicles due to its thriving nature and the
possibilities that lipreading can offer. Human-computer inter-
I. I NTRODUCTION action approach is taken to characterize the challenges and
Lipreading, widely known as visual speech recognition opportunities of lipreading in facilitating the communication
(VSR), is a process that aims to interpret and understand between humans and autonomous vehicles, especially in noisy
spoken words by using only the visual signal produced by car environments. Furthermore, a comparative analysis of
lip movement. Lipreading plays a crucial role in both human- nine different linear classifiers that we tested in LipDrive is
human and human-computer interaction. For example, people presented . Their performance were studied in lipreading using
use lipreading in their daily conversations to understand one raw visual features as well as using a preprocessed feature
another in noisy environments and in situations where the set. Through presenting our experimental results, we aim to
audio speech signal is not readily comprehensible. Therefore, propose a set of guidelines for researchers working in the
the skill of lipreading has long been mastered by individuals area of lipreading that can steer their choice of classification
with hearing impairment. It enables them to understand speech method and preprocessing steps.
and maintain social activities without relying on the perception The main contribution of this paper can be summarized as
of sounds. follows:
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
20
Therefore, instead of passing videos and pictures, we extract Where,
only the needed features from the videos. This is realized θ
through passing the gray scale images to the features extraction R=
XM ax − px
stage using ”DLib” which is a modern C++ library that
H = YM ax − py
implements a multitude of machine learning algorithms [9].
“Shape Predictor 68 Face Landmarks” is used to detect the D. Concatenation
human face in images and to extract the 68 landmarks of the Individual frames are passed through the face detection,
face. These landmarks represent points on the mouth, nose, feature extraction, cropping and normalizing processes de-
eyes and so forth as shown in Figure 2. scribed above. However, since classifying a word based on
Furthermore, the number of landmarks is reduced to twenty individual frames is rarely ever the case, we concatenate the
points from each frame, that represents the features of the lips, frames back to form a sequence of feature vectors (Figure 4).
as shown in Figure 3. These points of each frame are then This process creates a training dataset that has the sequence
translated to the Z-order, by calculating a z-value that has the of feature vectors as input and the spoken word as class label.
ability to translate a 2D point (x, y) to a one-dimensional For example, if the word ”ABOUT” is captured in 10 frames,
value. This value is calculated by interleaving the binary each of which contains 20 features, this will lead 200 features
representations of its coordinate values. that produces the sequence for that word.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
21
understanding of the strengths and weaknesses of the classi-
fiers while attempting to classify different words with varying
visual and phonetic similarities.
A. Dataset
V. E XPERIMENTAL R ESULTS
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
22
Fig. 7. SGDClassifier’s Confusion Matrix Fig. 9. AdaBoost Classifier’s Confusion Matrix
4) Experiment 4 - Multi-Layer Perceptron Classifier (MLP): 6) Experiment 6 - Linear Discriminant Analysis Classifier
We used Multi-Layer Perceptron which is a neural net- (LDA): We used Linear Discriminant Analysis classifier that
work classifier and that optimizes the log-loss function using fits class densities to the data and based on Bayes theorem.
LBFGS. We got the accuracy of 48.3%. We got the accuracy of 56.1%.
Figure 8 depicts the confusion matrix for this experiment. Figure 10 depicts the confusion matrix for this experiment.
Fig. 8. Multi-Layer Perceptron Classifier’s Confusion Matrix Fig. 10. Linear Discriminant Analysis Classifier’s Confusion Matrix
5) Experiment 5 - AdaBoost Classifier: We used AdaBoost 7) Experiment 7 - Logistic Regression Classifier (LR): We
classifier that fits the model with the training dataset and then used Logistic Regression classifier that analyzes independent
fits the model with additional copies of the pre-trained model. variables to determine an outcome. We got the accuracy of
We got the accuracy of 54.5%. 59.4%.
Figure 9 depicts the confusion matrix for this experiment. Figure 11 depicts the confusion matrix for this experiment.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD.23
Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
Fig. 11. Logistic Regression Classifier’s Confusion Matrix Fig. 13. Gradient Boosting Classifier’s Confusion Matrix
A. Discussion of Results
8) Experiment 8 - Support Vector Machine Classifier The results have showed that most of the classifiers get con-
(SVM): We used Support Vector Machine classifier that an- fused between word ”About” and ”Between” and the accuracy
alyzes data for classification and regression analysis. We got between the classifiers are almost close to each others, this is
the accuracy of 63.5%. due to the small number of words being trained. However, on
Figure 12 depicts the confusion matrix for this experiment. increasing the number of words needed to be trained, the linear
classifiers’ accuracy starts to decrease directly proportional by
increasing the words. Thus, using neural networks classifiers
is essential for large scale dataset and this was clear when we
started to use this large data on MLP classifier. Meanwhile, the
usage of CNN is recommended to be used in order to have
promising results. In addition, to ensure high accuracy with
real-time processing, we recommend to use RNN and LSTM
classifiers.
24Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD.
R EFERENCES
[1] A. Hassanat, “Visual speech recognition,” arXiv preprint
arXiv:1409.1411, 2014.
[2] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bowden, “Improv-
ing visual features for lip-reading,” in Auditory-Visual Speech Processing
2010, 2010.
[3] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lipnet:
end-to-end sentence-level lipreading,” 2016.
[4] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian
Conference on Computer Vision, pp. 87–103, Springer, 2016.
[5] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu, “Towards better
analysis of deep convolutional neural networks,” IEEE transactions on
visualization and computer graphics, vol. 23, no. 1, pp. 91–100, 2017.
[6] N. Rathee, “A novel approach for lip reading based on neural network,”
in Computational Techniques in Information and Communication Tech-
nologies (ICCTICT), 2016 International Conference on, pp. 421–426,
IEEE, 2016.
[7] F. S. Lesani, F. F. Ghazvini, and R. Dianat, “Mobile phone security
using automatic lip reading,” in e-Commerce in Developing Countries:
With focus on e-Business (ECDC), 2015 9th International Conference
on, pp. 1–5, IEEE, 2015.
[8] P. Domingos, “A few useful things to know about machine learning,”
Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.
[9] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, “Face detection in
color images,” IEEE transactions on pattern analysis and machine
intelligence, vol. 24, no. 5, pp. 696–706, 2002.
[10] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian
Conference on Computer Vision, 2016.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
25