Working of Speaker Recognition and Voice Biometrics Systems
Presented By: Supervised By:
Deepanshu Kumar - IEC2022073 Dr. Ramesh Kumar Bhukya
Prashant Agrawal - IEC2022120 Assistant Professor, Dept. of ECE
Aditya Raj Singh - IEC2022054
IIIT Allahabad, India
Shivam Kumar - IEC2022055
Department of ECE, IIIT Allahabad, India
INTRODUCTION
• Definition:
Speaker Recognition is a technique where a computer system
identifies or verifies a person using only their voice.
• Difference from Speech Recognition:
Speech recognition focuses on "what is said",
while speaker recognition focuses on "who is speaking".
• Usage Areas:
Phone banking, smart assistants, online exams and secure access control.
• Key Components:
Speech Signal Processing combined with Machine Learning and
Pattern Recognition methods.
TYPES OF SPEAKER RECOGNITION
• Speaker Identification:
System chooses which registered speaker is talking
from a group of enrolled speakers.
Example: call center system deciding which customer is on the call.
• Speaker Verification:
System checks if the claimed identity is genuine or fake.
Example: system verifies "Is this really the account holder?" using voice.
• Text-Dependent Systems:
User speaks a fixed pass-phrase such as
"My voice is my password" during both enrollment and testing.
• Text-Independent Systems:
User can speak any sentence.
System focuses on speaker characteristics, not the exact words.
SYSTEM ARCHITECTURE AND FLOW
• Overall Pipeline:
Complete sequence from microphone input to final
speaker accept / reject decision.
• Front-End:
Speech capture followed by pre-processing and Voice Activity Detection
to prepare clean speech segments.
• Feature Stage:
Extract MFCC and related features from each frame of speech
and form a stream of feature vectors.
• Back-End:
Use feature vectors to build speaker models, store them in a database
and later compare test features with stored models.
SYSTEM ARCHITECTURE AND FLOW
Speech Pre-processing Feature Speaker Database
Capture & VAD Extraction Modelling (Enrollment)
(MFCC)
Simple left-to-right flow of a speaker recognition system from raw speech to final decision.
SPEECH CAPTURE AND PRE-PROCESSING
• Sampling:
Speech is recorded typically at 8 kHz for telephone quality
or 16 kHz for better quality applications.
• DC Offset Removal:
Signal is shifted so that average value becomes zero,
which avoids bias in later processing.
• Pre-emphasis Filter:
High frequencies are slightly boosted to balance the
natural tilt of speech spectrum and highlight important information.
• Normalization:
Overall amplitude is scaled so that recordings from different
sessions have comparable loudness.
• Voice Activity Detection (VAD):
Detects regions where speech is present and
removes long silence or background-only segments.
FEATURE EXTRACTION (MFCC)
• Why Features:
Raw waveform has too many samples and is not directly suitable
for pattern matching, so we convert it into compact feature vectors.
• MFCC Concept:
Mel-Frequency Cepstral Coefficients capture the overall
spectral shape of speech in a way similar to human hearing.
• MFCC Pipeline:
Pre-emphasis → Framing → Windowing → FFT → Mel filterbank →
log energies → Discrete Cosine Transform → MFCCs.
MFCC – STEP BY STEP
• Pre-emphasis:
y[n] = x[n] − a·x[n−1], where a is around 0.95.
This boosts higher frequencies which are important for intelligibility.
• Framing:
Speech is divided into short overlapping frames
(typically 20–30 ms) where the signal is almost stationary.
• Windowing:
Each frame is multiplied by a Hamming window to reduce discontinuities
at frame edges and lower spectral leakage.
• FFT and Spectrum:
Fast Fourier Transform converts each windowed frame from
time domain to frequency domain magnitude spectrum.
MEL FILTERBANK AND CEPSTRUM
• Mel Filterbank:
Magnitude spectrum is passed through a bank of triangular filters
spaced on mel scale which matches human perception of pitch.
• Log Energies:
Logarithm of filter outputs is taken to model loudness perception
and convert spectral multiplication into addition.
• Cepstrum via DCT:
Discrete Cosine Transform of log filter energies produces MFCCs,
which compactly represent the spectral envelope of speech.
• Dynamic Features:
First and second time derivatives of MFCCs (delta and delta-delta)
are often added to capture speech dynamics.
SPEAKER MODELLING
• Goal:
Represent each speaker by a mathematical model that captures their unique
vocal characteristics over many frames.
• Gaussian Mixture Models (GMM):
Probability density of MFCC vectors is modelled
as a weighted sum of multiple Gaussian components.
• GMM-UBM Approach:
A universal background GMM is trained using large multi-speaker data
and then adapted to each individual speaker using MAP adaptation.
• I-vectors and X-vectors:
Low-dimensional embeddings that summarize a full utterance
into a single fixed-length vector for classification or scoring.
• Neural Network Embeddings:
Deep neural networks can directly learn speaker embeddings
from spectrograms or MFCC sequences.
ENROLLMENT PHASE
• Data Collection:
User speaks several prompted or free sentences in a quiet room
using the target microphone or device.
• Feature Extraction:
System performs all pre-processing and extracts MFCC
and related features from the recorded speech.
• Model Training:
Using these feature vectors, a speaker model or embedding
(i-vector, x-vector or GMM) is estimated for that user.
• Template Storage:
The resulting model is stored securely in the database as a
voice print that represents that particular speaker.
• Quality Requirements:
Good enrollment needs enough duration of speech and
minimal background noise for reliable templates.
VERIFICATION / IDENTIFICATION PHASE
• Test Recording:
During use, the user again speaks a sentence which is captured
through the microphone in similar conditions.
• Feature Extraction:
The same MFCC-based pipeline is applied to the test recording
to generate feature vectors or an embedding.
• Verification Mode:
Test voice is matched only against the claimed speaker's model;
score is compared with a threshold to accept or reject.
• Identification Mode:
Test voice is matched against all enrolled models and the
speaker with highest score is selected as the predicted identity.
• Threshold Tuning:
Decision threshold is chosen to maintain a good balance between
false acceptance and false rejection errors.
PERFORMANCE METRICS
• False Acceptance Rate (FAR):
Percentage of impostor trials that are wrongly
accepted as genuine users by the system.
• False Rejection Rate (FRR):
Percentage of genuine users that are wrongly
rejected as impostors by the system.
• Equal Error Rate (EER):
Value of error where FAR and FRR become equal.
Lower EER means better overall performance.
• DET Curve:
Detection Error Tradeoff curve plots FAR versus FRR on special
axes and helps visually compare different systems or settings.
APPLICATIONS
• Banking and Finance:
Voice-based authentication for telephone banking,
customer support and high-value transaction approval.
• Smart Home and IoT Devices:
Voice biometrics used to personalize responses
and restrict access to sensitive commands on smart speakers.
• Forensics and Law Enforcement:
Speaker comparison for recorded calls or
threat messages to support investigations.
• Online Exams and Remote Work:
Continuous voice verification to reduce
impersonation and maintain academic or workplace integrity.
CHALLENGES AND LIMITATIONS
• Background Noise:
Traffic, crowd or music can corrupt features and
significantly reduce recognition accuracy.
• Channel and Device Mismatch:
Different microphones, codecs or networks
introduce variations not seen during training.
• Intra-Speaker Variability:
Same person may sound different when ill,
tired, emotional or speaking in another language.
• Spoofing Attacks:
Replay of recorded speech and modern deepfake voices
can fool naive systems if no countermeasures are used.
• Privacy and Security:
Voice prints are biometric data and must be encrypted,
access-controlled and used according to privacy regulations.
IMPROVEMENTS AND FUTURE SCOPE
• Robust Feature Design:
Explore features or learned representations that are
less affected by noise, channel and language variations.
• Domain Adaptation:
Use techniques such as cepstral mean and variance
normalization and score normalization to handle new devices and environments.
• Anti-Spoofing Front-End:
Add dedicated spoofing detection module before
verification to filter replay and synthetic attacks.
• Multimodal Systems:
Combine voice with face, fingerprint or typing pattern
so that an attacker must fool multiple modalities at once.
• Edge-Friendly Models:
Design lightweight neural architectures that run on
mobile phones and embedded boards with low delay and power consumption.
CONCLUSION
• Summary:
Speaker Recognition and Voice Biometrics provide an automatic way
to recognize or verify a person using only their voice signal.
• Pipeline:
The system performs speech capture, pre-processing, MFCC-based
feature extraction, speaker modelling and final decision making.
• Benefits:
Voice biometrics offer convenient, hands-free and password-free
access for many real-world applications such as banking and smart devices.
• Open Issues:
Accuracy in noisy and mismatched conditions and robustness
against spoofing attacks remain active research areas in this field.
THANK YOU
SUGGESTIONS / QUESTIONS