0% found this document useful (0 votes)
4 views13 pages

DL MINI FINAL

This document outlines a project focused on real-time gender and age estimation from facial images using deep learning techniques, specifically convolutional neural networks (CNNs). It details the architecture, training methodology, and results of a system that utilizes webcam input and OpenCV for face detection, achieving significant accuracy improvements over existing methods. The report also discusses challenges faced during implementation and the potential applications of the technology in various fields.

Uploaded by

ritikashetye189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
4 views13 pages

DL MINI FINAL

This document outlines a project focused on real-time gender and age estimation from facial images using deep learning techniques, specifically convolutional neural networks (CNNs). It details the architecture, training methodology, and results of a system that utilizes webcam input and OpenCV for face detection, achieving significant accuracy improvements over existing methods. The report also discusses challenges faced during implementation and the potential applications of the technology in various fields.

Uploaded by

ritikashetye189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
INDEX Sr No Title Page No 1 Abstract 6 2 Introduction 7 3 Network Architecture 9 4 Testing and Training 10 5 Results i 6 Conclusion 12 7 Code 13 8 Output 15 Abstract Gender and age estimation from facial images is a critical component in intelligent systems that require human understanding. advancements in deep leaming, especially convolutional neural networks (CNNs), it has become feasible to extract meaningful features from facial data to predict attributes like age and gender with impressive accuracy. This project implements a real-time gender and age prediction system using webcam input, OpenCV for face detection, and pre-trained Caffe models for classification. The system is capable of detecting a person’s face from a video stream and predicting their gender and age group in real-time. This report documents the complete architecture, methodology, technologies used, challenges faced, and potential real-world applications of the system, Introduction ‘Human faces carry rich information that can be used to estimate demographic traits such as age and gender. The human ability to interpret these cues is intuitive, but replicating this process using a computer system requires advanced machine learning and image processing techniques. This project aims to implement a real-time system that can classify gender and age group from video frames captured through a webcam using deep learning models. The integration of facial recognition, deep learning, and computer vision creates opportunities for a wide range of real-world applications in security, retail, marketing, and human-computer interaction. The system leverages the OpenC¥ library and DNN module to run deep learning models trained on the Adience dataset, providing fast and reliable predictions on a standard computing setup. ‘Age and gender play fundamental roles in social interactions. Languages reserve different salutations and grammar rules for men or women, and very often different vocabularies are used when addressing elders compared to young people. Despite the basie roles these attributes play in our day-to-day lives, the ability to automatically estimate them accurately and reliably from face images is still far from meeting the needs of commercial applications. This is particularly perplexing when considering recent claims to super-human capabilities in the related task of face recogni tion (e.g., [48]). Past approaches to estimating or classifying these attributes from face images have relied on differences in facial feature dimensions [29] or “tailored” face descriptors (eg., [10, 15, 32]). Most have employed classifification schemes designed particularly for age or gender estimation tasks, including [4] and others. Few of these past methods were designed to handle the many challenges of unconstrained imaging conditions [10]. Moreover, the machine learning methods employed by these systems did not fully Figure 1. Faces from the Adience benchmark for age and gender classifification [10]. These images represent some of the challenges of age and gender estimation from real-world, unconstrained images. Most notably, extreme blur (low-resolution), occlusions, out-of-plane pose variations, expressions and more exploit the massive numbers of image examples and data available through the Internet in order to improve classifification capabilities. In this paper we attempt to close the gap between automatic face recognition capabilities and those of age and gender estimation methods. To this end, we follow the successful example laid down by recent face recognition systems: Face recognition techniques described in the last few years have shown that tremendous progre: can be made by the use of deep convolutional neural networks (CNN) [31]. We demonstrate similar gains with a simple network architecture, designed by considering the rather limited availability of accurate age and gender labels in existing face data sets. We test our network on the newly released Adience benchmark for age and gender classifification of unfifiltered face images [10]. We show that despite the very challenging nature of the images in the Adience set and the simplicity of our network design, our method outperforms existing state of the art by substantial margins. Although these resulls provide a remarkable baseline for deep-learning-based approaches, they leave room for improvements by more elaborate system designs, suggesting that the problem of accurately estimating age and gender in the unconstrained settings, as reflflected by the Adience images, remains unsolved. In order to provide a foothold for the development of more effective future methods, we make our trained models and classifification system publicly available A CNN for age and gender estimation Gathering a large, labeled image training set for age and gender estimation from social image repositories requires either access to personal information on the subjects appearing in the images (their birth date and gender), which is often private, or is tedious and time-consuming to manually label. Data-sets for age and gender estimation from real-world social images are therefore relatively limited in size and presently no match in size with the much larger image classifification data-sets (eg. the Imagenet dataset [45]). Overfifitting is common problem when machine learning based methods are used on such small image collections. This problem is exacerbated when considering deep convolutional neural networks due to their huge numbers of model parameters. Care must therefore be taken in order to avoid overfifitting under such circumstances. Network architecture Our proposed network architecture is used throughout our experiments for both age and gender classifification. It is illustrated in Figure 2. A more detailed, schematic diagram of the en network design is additionally provided in Figure 3. The network comprises of only three convolutional layers and two fully-connected layers with a small number of neurons. This, by comparison to the much larger architectures applied, for example, in [28] and [5]. Our choice of a smaller network design is motivated both from our desire to reduce the risk of overfitting as well as the nature 36Figure 3. Full schematic diagram of our network architecture. Please see text for more details. of the problems we are attempting to solve: age classification on the Audience set requires distinguishing between eight classes; gender only two. This, compared to, e.g., the ten thousand identity classes used to train the network used for face recognition in [48] All three color channels are processed directly by the network. Images are fifirst rescaled to 256 x 256 and a crop of 227 x 227 is fed to the network. The three subsequent convolutional layers are then defined as follows. 1. 96 filters of size 3«7*7 pixels are applied to the input in the first convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking the maximal value of 3 3 regions with two-pixel strides and a local response normalization layer [28]. 2. The 96 x 28 x 28 output of the previous layer is then processed by the second convolutional layer, containing 256 filters of size 96 5 x 5 pixels. Again, this is followed by ReLU, a max pooling layer and a local response normalization layer with the same hyper parameters as before. 3. Finally, the third and last convolutional layer operates on the 256 x 14 14 blob by applying a set of 384 filters of size 256 x 3 * 3 pixels, followed by ReLU and a max pooling layer. The following fully connected layers ate then defined by: 4. A first fully connected layer that receives the output of the third convolutional layer and contains 512 neurons, followed by a ReLU and a dropout layer. 5. A second fully connected layer that receives the 512- dimensional output of the first fully connected layer and again contains $12 neurons, followed by a ReLU and a dropout layer. 6. A third, fully connected layer which maps to the final classes for age or gender. Finally, the output of the last fully connected layer is fed to a sofi-max layer that assigns a probability for each class. The prediction itself is made by taking the class with the maximal probability for the given test image. Testing and training Initialization. The weights in all layers are initialized with random values from a zero mean Gaussian with standard deviation of 0.01. To stress this, we do not use pretrained models for initializing the network; the network is trained, from scratch, without using any data outside of the images and the labels available by the benchmark. This, again, should be compared with CNN implementations used for face recognition, where hundreds of thousands of images are used for training [48]. Target values for training are represented as sparse, binary vectors corresponding to the ground truth class of the number of classes (two for gender, eight for the eight age classes of the age classifification task), containing 1 in the index of the ground truth and 0 elsewhere Network training. Aside from our use of a lean network architecture, we apply ‘two additional methods to further limit the risk of overfifitting. First we apply dropout learning [24] (i.e. randomly setting the output value of net work neurons to zero). The network includes two dropout layers with a dropout ratio of 0.5 (50% chance of setting a neuron’s output value to zero). Second, we use data augmentation by taking a random crop of 227 * 227 pixels from the 256 x 256 input image and randomly mirror it in each forward-backward training pass. This, similarly to the multiple crop and mirror variations used by [48]. Training itself is performed using stochastic gradient decent with image batch size of fififty images. The inital learning rate 3, reduced to e~4 after 10K iterations. Prediction. We experimented with two methods of using the network in order to produce age and gender predictions for novel faces: + Center Crop: Feeding the network with the face image, cropped to 227 x 227 around the face center. * Over- sampling: We extract fifive 227 « 227 pixel erop regions, four from the comers of the 256 * 256 face image, and an additional crop region from the center of the face. The network is presented with all five images, along with their horizontal reflections. Its final prediction is taken to be the average prediction value across all these variations. We have found that small misalignments in the Audience images, caused by the many challenges of these images (occlusions, motion blur, etc.) can have a noticeable impact on the quality of our results. This second, over-sampling method, is designed to compensate for these small misalignments, bypassing the need for improving alignment quality, but rather directly feeding the network with multiple translated versions of the same face. 4. Experiments Our method is implemented using the Caffe open- source framework [26]. Training was performed on an Amazon GPU machine with 1,536 CUDA cores and 4GB of video memory. Training each network required about four hours, predicting age or gender on a single image using our network requires about 200ms. Prediction running times can conceivably be substantially improved by running the network on image batches. Results s our results for gender and age spectively. Table 4 les a confusion matrix for our multi-class age classifification results. For age ifification, we measure and compare both the accuracy when the algorithm gives the exact age-group classifification and when the algorithm is off by one adjacent age-group (i.e., the subject belongs to the group im mediately older or immediately younger than the predicted group). This follows others who have done so in the past, and reflflects the uncertainty inherent to the task — facial features often change very little between oldest faces in one age class and the youngest faces of the subsequent class. Both tables compare performance with the methods described in [10]. Table 2 also provides a comparison with [23] which used the same gender classifification pipeline of [10] applied to more effective alignment of the faces; faces in their tests were synthetically modifified to appear facing forward. Evidently, the proposed method outperforms the reported state-of-the-art on both tasks with considerable gaps. Also evident is the contribution of the over-sampling approach, which provides an additional performance boost over the original network. This implies that better alignment (e.g., frontalization (22, 23}) may provide an additional boost in performance. We provide a few examples of both gender and age miselassififications in Figures 4 and 5, respectively. These show that many of the mistakes made by our system are due to extremely challenging viewing conditions of some of the Adience benchmark images. Most notable are mistakes caused by blur or low resolution and occlusions (particularly from heavy makeup). Gender estimation mistakes also frequently occur for images of babies or very young children where obvious gender attributes are not yet visible. Conclusions s our results for gender and age spectively. Table 4 les a confusion matrix for our multi-class age classifification results. For age ifification, we measure and compare both the accuracy when the algorithm gives the exact age-group classifification and when the algorithm is off by one adjacent age-group (i.e., the subject belongs to the group im mediately older or immediately younger than the predicted group). This follows others who have done so in the past, and reflflects the uncertainty inherent to the task — facial features often change very little between oldest faces in one age class and the youngest faces of the subsequent class. Both tables compare performance with the methods described in [10]. Table 2 also provides a comparison with [23] which used the same gender classifification pipeline of [10] applied to more effective alignment of the faces; faces in their tests were synthetically modifified to appear facing forward. Evidently, the proposed method outperforms the reported state-of-the-art on both tasks with considerable gaps. Also evident is the contribution of the over-sampling approach, which provides an additional performance boost over the original network. This implies that better alignment (e.g., frontalization (22, 23}) may provide an additional boost in performance. We provide a few examples of both gender and age miselassififications in Figures 4 and 5, respectively. These show that many of the mistakes made by our system are due to extremely challenging viewing conditions of some of the Adience benchmark images. Most notable are mistakes caused by blur or low resolution and occlusions (particularly from heavy makeup). Gender estimation mistakes also frequently occur for images of babies or very young children where obvious gender attributes are not yet visible. Code import ev2 import math import argparse def highlightFace(net, frame, conf_threshold=0.7): frameOpenevDnn=[Link]() framel [eight=[Link][0] frame Width=frameOpenevDnn.shape1] blob=[Link](frameOpencvDnn, 1.0, (300, 300), [104, 117, 123], True, False) [Link](blob) [Link]() faceBoxes=[] for i in range([Link][2)): confidence=detections[0,0,.2] if confidence>conf_threshold: x1=int(detections[0,0i,3]*frameWidth) yl=int(detections{0,0,i,4]*frameHeight) x2=int(detections[0,0,i,5]*frameWidth) -y2=int(detections[0,0,i,6]*frameHeight) [Link]({xl.y1.x2,y2]) [Link](frameOpenevDan, —(x1,y1), (x2,y2), (0,255.0), int(round(frameHeight/150)), 8) return frameOpenevDnn,fi parser [Link]() parser.add_argument(‘--image') args=parser.parse_args() faceProto="opencv_face_detector.pbtxt" faceModel="openev_face_detector_uint8.pb" ageProto="age_deploy.prototxt" ageModel~"age_net.caffemodel” genderProto="gender_deploy prototxt" genderModel="gender_net.caffemodel" a S=(78.4263377603, 87.7689143744, 114.895847746) [(0-2)', (4-6), (8-12), (15-20), (25-32), (38-43), (48-53), (60-100))] ‘Male',Female'] genderL i faceNet=[Link](faceModel,faceProto) ageNet=[Link](ageModel,ageProto) [Link]( genderModel,genderProte) video=ev2, VideoCapture([Link] if [Link] else 0) padding=20 while [Link](1)<0 : hasFrame,frame=[Link]() if not hasFrame: [Link]() break resultImg,faceBoxes-highlightFace(faceNet,frame) if not faceBoxes: print("No face detected") for faceBox in faceBoxes: face=frame[max(0,faceBox{ l]-padding): |+padding, frame shape[0}-1).max(0,fa -min(faceBox[2]+padding, [Link][1]-1)] min(faceBox[3 eBox{0}-padding) blob=[Link](face, 1.0, (227,227), += MODI swapRB=False) _MEAN_VALL [Link](blob) genderPreds=[Link]() gender=gender! ist genderPreds[0].argmax()] print(f' Gender: {gender}") [Link](blob) agePreds=ageNet forward() age-ageL ist[agePreds|0].argmax()| print(PAge: {age{1:-1]} years’) [Link](resultimg, f'{gender}, {age}', (faceBox[0], _faceBox[1]+10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0,255,255), 2, ev2.LINE_AA) [Link]("Detecting age and gender", resultImg)

You might also like