This document outlines a project focused on real-time gender and age estimation from facial images using deep learning techniques, specifically convolutional neural networks (CNNs). It details the architecture, training methodology, and results of a system that utilizes webcam input and OpenCV for face detection, achieving significant accuracy improvements over existing methods. The report also discusses challenges faced during implementation and the potential applications of the technology in various fields.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views13 pages
DL MINI FINAL
This document outlines a project focused on real-time gender and age estimation from facial images using deep learning techniques, specifically convolutional neural networks (CNNs). It details the architecture, training methodology, and results of a system that utilizes webcam input and OpenCV for face detection, achieving significant accuracy improvements over existing methods. The report also discusses challenges faced during implementation and the potential applications of the technology in various fields.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
INDEX
Sr No Title Page No
1 Abstract 6
2 Introduction 7
3 Network Architecture 9
4 Testing and Training 10
5 Results i
6 Conclusion 12
7 Code 13
8 Output 15Abstract
Gender and age estimation from facial images is a critical component in intelligent systems that
require human understanding. advancements in deep leaming, especially convolutional
neural networks (CNNs), it has become feasible to extract meaningful features from facial data
to predict attributes like age and gender with impressive accuracy. This project implements a
real-time gender and age prediction system using webcam input, OpenCV for face detection, and
pre-trained Caffe models for classification. The system is capable of detecting a person’s face
from a video stream and predicting their gender and age group in real-time. This report
documents the complete architecture, methodology, technologies used, challenges faced, and
potential real-world applications of the system,Introduction
‘Human faces carry rich information that can be used to estimate demographic traits such as age
and gender. The human ability to interpret these cues is intuitive, but replicating this process
using a computer system requires advanced machine learning and image processing techniques.
This project aims to implement a real-time system that can classify gender and age group from
video frames captured through a webcam using deep learning models.
The integration of facial recognition, deep learning, and computer vision creates opportunities
for a wide range of real-world applications in security, retail, marketing, and human-computer
interaction. The system leverages the OpenC¥ library and DNN module to run deep learning
models trained on the Adience dataset, providing fast and reliable predictions on a standard
computing setup.
‘Age and gender play fundamental roles in social interactions. Languages reserve different
salutations and grammar rules for men or women, and very often different vocabularies are used
when addressing elders compared to young people. Despite the basie roles these attributes play
in our day-to-day lives, the ability to automatically estimate them accurately and reliably from
face images is still far from meeting the needs of commercial applications. This is particularly
perplexing when considering recent claims to super-human capabilities in the related task of face
recogni tion (e.g., [48]). Past approaches to estimating or classifying these attributes from face
images have relied on differences in facial feature dimensions [29] or “tailored” face descriptors
(eg., [10, 15, 32]). Most have employed classifification schemes designed particularly for age or
gender estimation tasks, including [4] and others. Few of these past methods were designed to
handle the many challenges of unconstrained imaging conditions [10]. Moreover, the machine
learning methods employed by these systems did not fully Figure 1.
Faces from the Adience benchmark for age and gender classifification [10]. These images
represent some of the challenges of age and gender estimation from real-world, unconstrained
images. Most notably, extreme blur (low-resolution), occlusions, out-of-plane pose variations,
expressions and more exploit the massive numbers of image examples and data available through
the Internet in order to improve classifification capabilities. In this paper we attempt to close the
gap between automatic face recognition capabilities and those of age and gender estimation
methods.
To this end, we follow the successful example laid down by recent face recognition systems:
Face recognition techniques described in the last few years have shown that tremendous progre:
can be made by the use of deep convolutional neural networks (CNN) [31]. We demonstrate
similar gains with a simple network architecture, designed by considering the rather limited
availability of accurate age and gender labels in existing face data sets. We test our network on
the newly released Adience benchmark for age and gender classifification of unfifiltered face
images [10]. We show that despite the very challenging nature of the images in the Adience set
and the simplicity of our network design, our method outperforms existing state of the art by
substantial margins.
Although these resulls provide a remarkable baseline for deep-learning-based approaches, they
leave room for improvements by more elaborate system designs, suggesting that the problem of
accurately estimating age and gender in the unconstrained settings, as reflflected by the Adience
images, remains unsolved. In order to provide a foothold for the development of more effective
future methods, we make our trained models and classifification system publicly availableA CNN for age and gender estimation Gathering a large, labeled image training set for age and
gender estimation from social image repositories requires either access to personal information
on the subjects appearing in the images (their birth date and gender), which is often private, or is
tedious and time-consuming to manually label. Data-sets for age and gender estimation from
real-world social images are therefore relatively limited in size and presently no match in size
with the much larger image classifification data-sets (eg. the Imagenet dataset [45]).
Overfifitting is common problem when machine learning based methods are used on such small
image collections. This problem is exacerbated when considering deep convolutional neural
networks due to their huge numbers of model parameters. Care must therefore be taken in order
to avoid overfifitting under such circumstances.Network architecture
Our proposed network architecture is used throughout our experiments for both age and gender
classifification. It is illustrated in Figure 2. A more detailed, schematic diagram of the en
network design is additionally provided in Figure 3. The network comprises of only three
convolutional layers and two fully-connected layers with a small number of neurons. This, by
comparison to the much larger architectures applied, for example, in [28] and [5]. Our choice of
a smaller network design is motivated both from our desire to reduce the risk of overfitting as
well as the nature 36Figure 3. Full schematic diagram of our network architecture. Please see
text for more details. of the problems we are attempting to solve: age classification on the
Audience set requires distinguishing between eight classes; gender only two. This, compared to,
e.g., the ten thousand identity classes used to train the network used for face recognition in [48]
All three color channels are processed directly by the network. Images are fifirst rescaled to 256
x 256 and a crop of 227 x 227 is fed to the network. The three subsequent convolutional layers
are then defined as follows. 1. 96 filters of size 3«7*7 pixels are applied to the input in the first
convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking
the maximal value of 3 3 regions with two-pixel strides and a local response normalization
layer [28]. 2. The 96 x 28 x 28 output of the previous layer is then processed by the second
convolutional layer, containing 256 filters of size 96 5 x 5 pixels. Again, this is followed by
ReLU, a max pooling layer and a local response normalization layer with the same hyper
parameters as before. 3. Finally, the third and last convolutional layer operates on the 256 x 14
14 blob by applying a set of 384 filters of size 256 x 3 * 3 pixels, followed by ReLU and a max
pooling layer. The following fully connected layers ate then defined by: 4. A first fully
connected layer that receives the output of the third convolutional layer and contains 512
neurons, followed by a ReLU and a dropout layer. 5. A second fully connected layer that
receives the 512- dimensional output of the first fully connected layer and again contains $12
neurons, followed by a ReLU and a dropout layer. 6. A third, fully connected layer which maps
to the final classes for age or gender. Finally, the output of the last fully connected layer is fed to
a sofi-max layer that assigns a probability for each class. The prediction itself is made by taking
the class with the maximal probability for the given test image.Testing and training
Initialization. The weights in all layers are initialized with random values from a zero mean
Gaussian with standard deviation of 0.01. To stress this, we do not use pretrained models for
initializing the network; the network is trained, from scratch, without using any data outside of
the images and the labels available by the benchmark. This, again, should be compared with
CNN implementations used for face recognition, where hundreds of thousands of images are
used for training [48]. Target values for training are represented as sparse, binary vectors
corresponding to the ground truth class of the number of classes (two for gender, eight for the
eight age classes of the age classifification task), containing 1 in the index of the ground truth
and 0 elsewhere Network training. Aside from our use of a lean network architecture, we apply
‘two additional methods to further limit the risk of overfifitting. First we apply dropout learning
[24] (i.e. randomly setting the output value of net work neurons to zero). The network includes
two dropout layers with a dropout ratio of 0.5 (50% chance of setting a neuron’s output value to
zero). Second, we use data augmentation by taking a random crop of 227 * 227 pixels from the
256 x 256 input image and randomly mirror it in each forward-backward training pass. This,
similarly to the multiple crop and mirror variations used by [48]. Training itself is performed
using stochastic gradient decent with image batch size of fififty images. The inital learning rate
3, reduced to e~4 after 10K iterations. Prediction. We experimented with two methods of
using the network in order to produce age and gender predictions for novel faces: + Center Crop:
Feeding the network with the face image, cropped to 227 x 227 around the face center. * Over-
sampling: We extract fifive 227 « 227 pixel erop regions, four from the comers of the 256 * 256
face image, and an additional crop region from the center of the face. The network is presented
with all five images, along with their horizontal reflections. Its final prediction is taken to be the
average prediction value across all these variations. We have found that small misalignments in
the Audience images, caused by the many challenges of these images (occlusions, motion blur,
etc.) can have a noticeable impact on the quality of our results. This second, over-sampling
method, is designed to compensate for these small misalignments, bypassing the need for
improving alignment quality, but rather directly feeding the network with multiple translated
versions of the same face. 4. Experiments Our method is implemented using the Caffe open-
source framework [26]. Training was performed on an Amazon GPU machine with 1,536 CUDA
cores and 4GB of video memory. Training each network required about four hours, predicting
age or gender on a single image using our network requires about 200ms. Prediction running
times can conceivably be substantially improved by running the network on image batches.Results
s our results for gender and age spectively. Table 4
les a confusion matrix for our multi-class age classifification results. For age
ifification, we measure and compare both the accuracy when the algorithm gives the exact
age-group classifification and when the algorithm is off by one adjacent age-group (i.e., the
subject belongs to the group im mediately older or immediately younger than the predicted
group). This follows others who have done so in the past, and reflflects the uncertainty inherent
to the task — facial features often change very little between oldest faces in one age class and the
youngest faces of the subsequent class. Both tables compare performance with the methods
described in [10]. Table 2 also provides a comparison with [23] which used the same gender
classifification pipeline of [10] applied to more effective alignment of the faces; faces in their
tests were synthetically modifified to appear facing forward. Evidently, the proposed method
outperforms the reported state-of-the-art on both tasks with considerable gaps. Also evident is
the contribution of the over-sampling approach, which provides an additional performance boost
over the original network. This implies that better alignment (e.g., frontalization (22, 23}) may
provide an additional boost in performance. We provide a few examples of both gender and age
miselassififications in Figures 4 and 5, respectively. These show that many of the mistakes made
by our system are due to extremely challenging viewing conditions of some of the Adience
benchmark images. Most notable are mistakes caused by blur or low resolution and occlusions
(particularly from heavy makeup). Gender estimation mistakes also frequently occur for images
of babies or very young children where obvious gender attributes are not yet visible.Conclusions
s our results for gender and age spectively. Table 4
les a confusion matrix for our multi-class age classifification results. For age
ifification, we measure and compare both the accuracy when the algorithm gives the exact
age-group classifification and when the algorithm is off by one adjacent age-group (i.e., the
subject belongs to the group im mediately older or immediately younger than the predicted
group). This follows others who have done so in the past, and reflflects the uncertainty inherent
to the task — facial features often change very little between oldest faces in one age class and the
youngest faces of the subsequent class. Both tables compare performance with the methods
described in [10]. Table 2 also provides a comparison with [23] which used the same gender
classifification pipeline of [10] applied to more effective alignment of the faces; faces in their
tests were synthetically modifified to appear facing forward. Evidently, the proposed method
outperforms the reported state-of-the-art on both tasks with considerable gaps. Also evident is
the contribution of the over-sampling approach, which provides an additional performance boost
over the original network. This implies that better alignment (e.g., frontalization (22, 23}) may
provide an additional boost in performance. We provide a few examples of both gender and age
miselassififications in Figures 4 and 5, respectively. These show that many of the mistakes made
by our system are due to extremely challenging viewing conditions of some of the Adience
benchmark images. Most notable are mistakes caused by blur or low resolution and occlusions
(particularly from heavy makeup). Gender estimation mistakes also frequently occur for images
of babies or very young children where obvious gender attributes are not yet visible.Code
import ev2
import math
import argparse
def highlightFace(net, frame, conf_threshold=0.7):
frameOpenevDnn=[Link]()
framel [eight=[Link][0]
frame Width=frameOpenevDnn.shape1]
blob=[Link](frameOpencvDnn, 1.0, (300, 300), [104, 117, 123], True,
False)
[Link](blob)
[Link]()
faceBoxes=[]
for i in range([Link][2)):
confidence=detections[0,0,.2]
if confidence>conf_threshold:
x1=int(detections[0,0i,3]*frameWidth)
yl=int(detections{0,0,i,4]*frameHeight)
x2=int(detections[0,0,i,5]*frameWidth)
-y2=int(detections[0,0,i,6]*frameHeight)
[Link]({xl.y1.x2,y2])
[Link](frameOpenevDan, —(x1,y1), (x2,y2), (0,255.0),
int(round(frameHeight/150)), 8)
return frameOpenevDnn,fi
parser
[Link]()
parser.add_argument(‘--image')
args=parser.parse_args()faceProto="opencv_face_detector.pbtxt"
faceModel="openev_face_detector_uint8.pb"
ageProto="age_deploy.prototxt"
ageModel~"age_net.caffemodel”
genderProto="gender_deploy prototxt"
genderModel="gender_net.caffemodel"
a S=(78.4263377603, 87.7689143744, 114.895847746)
[(0-2)', (4-6), (8-12), (15-20), (25-32), (38-43), (48-53), (60-100))]
‘Male',Female']
genderL i
faceNet=[Link](faceModel,faceProto)
ageNet=[Link](ageModel,ageProto)
[Link]( genderModel,genderProte)
video=ev2, VideoCapture([Link] if [Link] else 0)
padding=20
while [Link](1)<0 :
hasFrame,frame=[Link]()
if not hasFrame:
[Link]()
break
resultImg,faceBoxes-highlightFace(faceNet,frame)
if not faceBoxes:
print("No face detected")
for faceBox in faceBoxes:
face=frame[max(0,faceBox{ l]-padding):
|+padding, frame shape[0}-1).max(0,fa
-min(faceBox[2]+padding, [Link][1]-1)]
min(faceBox[3
eBox{0}-padding)blob=[Link](face, 1.0, (227,227), += MODI
swapRB=False)
_MEAN_VALL
[Link](blob)
genderPreds=[Link]()
gender=gender! ist genderPreds[0].argmax()]
print(f' Gender: {gender}")
[Link](blob)
agePreds=ageNet forward()
age-ageL ist[agePreds|0].argmax()|
print(PAge: {age{1:-1]} years’)
[Link](resultimg, f'{gender}, {age}', (faceBox[0], _faceBox[1]+10),
cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0,255,255), 2, ev2.LINE_AA)
[Link]("Detecting age and gender", resultImg)