0% found this document useful (0 votes)
17 views18 pages

Speaker Recognition and Voice Biometrics

The document provides an overview of speaker recognition and voice biometrics systems, highlighting their definition, key components, and differences from speech recognition. It discusses various types of speaker recognition, the system architecture, feature extraction methods like MFCC, and the enrollment and verification phases. Additionally, it addresses applications, challenges, and future improvements in the field, emphasizing the importance of accuracy and security in voice-based systems.

Uploaded by

msatya0802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views18 pages

Speaker Recognition and Voice Biometrics

The document provides an overview of speaker recognition and voice biometrics systems, highlighting their definition, key components, and differences from speech recognition. It discusses various types of speaker recognition, the system architecture, feature extraction methods like MFCC, and the enrollment and verification phases. Additionally, it addresses applications, challenges, and future improvements in the field, emphasizing the importance of accuracy and security in voice-based systems.

Uploaded by

msatya0802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Working of Speaker Recognition and Voice Biometrics Systems

Presented By: Supervised By:


Deepanshu Kumar - IEC2022073 Dr. Ramesh Kumar Bhukya
Prashant Agrawal - IEC2022120 Assistant Professor, Dept. of ECE
Aditya Raj Singh - IEC2022054
IIIT Allahabad, India
Shivam Kumar - IEC2022055
Department of ECE, IIIT Allahabad, India
INTRODUCTION

• Definition:
Speaker Recognition is a technique where a computer system
identifies or verifies a person using only their voice.
• Difference from Speech Recognition:
Speech recognition focuses on "what is said",
while speaker recognition focuses on "who is speaking".
• Usage Areas:
Phone banking, smart assistants, online exams and secure access control.
• Key Components:
Speech Signal Processing combined with Machine Learning and
Pattern Recognition methods.
TYPES OF SPEAKER RECOGNITION

• Speaker Identification:
System chooses which registered speaker is talking
from a group of enrolled speakers.
Example: call center system deciding which customer is on the call.
• Speaker Verification:
System checks if the claimed identity is genuine or fake.
Example: system verifies "Is this really the account holder?" using voice.
• Text-Dependent Systems:
User speaks a fixed pass-phrase such as
"My voice is my password" during both enrollment and testing.
• Text-Independent Systems:
User can speak any sentence.
System focuses on speaker characteristics, not the exact words.
SYSTEM ARCHITECTURE AND FLOW

• Overall Pipeline:
Complete sequence from microphone input to final
speaker accept / reject decision.
• Front-End:
Speech capture followed by pre-processing and Voice Activity Detection
to prepare clean speech segments.
• Feature Stage:
Extract MFCC and related features from each frame of speech
and form a stream of feature vectors.
• Back-End:
Use feature vectors to build speaker models, store them in a database
and later compare test features with stored models.
SYSTEM ARCHITECTURE AND FLOW

Speech Pre-processing Feature Speaker Database


Capture & VAD Extraction Modelling (Enrollment)
(MFCC)

Simple left-to-right flow of a speaker recognition system from raw speech to final decision.
SPEECH CAPTURE AND PRE-PROCESSING

• Sampling:
Speech is recorded typically at 8 kHz for telephone quality
or 16 kHz for better quality applications.
• DC Offset Removal:
Signal is shifted so that average value becomes zero,
which avoids bias in later processing.
• Pre-emphasis Filter:
High frequencies are slightly boosted to balance the
natural tilt of speech spectrum and highlight important information.
• Normalization:
Overall amplitude is scaled so that recordings from different
sessions have comparable loudness.
• Voice Activity Detection (VAD):
Detects regions where speech is present and
removes long silence or background-only segments.
FEATURE EXTRACTION (MFCC)

• Why Features:
Raw waveform has too many samples and is not directly suitable
for pattern matching, so we convert it into compact feature vectors.
• MFCC Concept:
Mel-Frequency Cepstral Coefficients capture the overall
spectral shape of speech in a way similar to human hearing.
• MFCC Pipeline:
Pre-emphasis → Framing → Windowing → FFT → Mel filterbank →
log energies → Discrete Cosine Transform → MFCCs.
MFCC – STEP BY STEP

• Pre-emphasis:
y[n] = x[n] − a·x[n−1], where a is around 0.95.
This boosts higher frequencies which are important for intelligibility.
• Framing:
Speech is divided into short overlapping frames
(typically 20–30 ms) where the signal is almost stationary.
• Windowing:
Each frame is multiplied by a Hamming window to reduce discontinuities
at frame edges and lower spectral leakage.
• FFT and Spectrum:
Fast Fourier Transform converts each windowed frame from
time domain to frequency domain magnitude spectrum.
MEL FILTERBANK AND CEPSTRUM

• Mel Filterbank:
Magnitude spectrum is passed through a bank of triangular filters
spaced on mel scale which matches human perception of pitch.
• Log Energies:
Logarithm of filter outputs is taken to model loudness perception
and convert spectral multiplication into addition.
• Cepstrum via DCT:
Discrete Cosine Transform of log filter energies produces MFCCs,
which compactly represent the spectral envelope of speech.
• Dynamic Features:
First and second time derivatives of MFCCs (delta and delta-delta)
are often added to capture speech dynamics.
SPEAKER MODELLING

• Goal:
Represent each speaker by a mathematical model that captures their unique
vocal characteristics over many frames.
• Gaussian Mixture Models (GMM):
Probability density of MFCC vectors is modelled
as a weighted sum of multiple Gaussian components.
• GMM-UBM Approach:
A universal background GMM is trained using large multi-speaker data
and then adapted to each individual speaker using MAP adaptation.
• I-vectors and X-vectors:
Low-dimensional embeddings that summarize a full utterance
into a single fixed-length vector for classification or scoring.
• Neural Network Embeddings:
Deep neural networks can directly learn speaker embeddings
from spectrograms or MFCC sequences.
ENROLLMENT PHASE

• Data Collection:
User speaks several prompted or free sentences in a quiet room
using the target microphone or device.
• Feature Extraction:
System performs all pre-processing and extracts MFCC
and related features from the recorded speech.
• Model Training:
Using these feature vectors, a speaker model or embedding
(i-vector, x-vector or GMM) is estimated for that user.
• Template Storage:
The resulting model is stored securely in the database as a
voice print that represents that particular speaker.
• Quality Requirements:
Good enrollment needs enough duration of speech and
minimal background noise for reliable templates.
VERIFICATION / IDENTIFICATION PHASE

• Test Recording:
During use, the user again speaks a sentence which is captured
through the microphone in similar conditions.
• Feature Extraction:
The same MFCC-based pipeline is applied to the test recording
to generate feature vectors or an embedding.
• Verification Mode:
Test voice is matched only against the claimed speaker's model;
score is compared with a threshold to accept or reject.
• Identification Mode:
Test voice is matched against all enrolled models and the
speaker with highest score is selected as the predicted identity.
• Threshold Tuning:
Decision threshold is chosen to maintain a good balance between
false acceptance and false rejection errors.
PERFORMANCE METRICS

• False Acceptance Rate (FAR):


Percentage of impostor trials that are wrongly
accepted as genuine users by the system.
• False Rejection Rate (FRR):
Percentage of genuine users that are wrongly
rejected as impostors by the system.
• Equal Error Rate (EER):
Value of error where FAR and FRR become equal.
Lower EER means better overall performance.
• DET Curve:
Detection Error Tradeoff curve plots FAR versus FRR on special
axes and helps visually compare different systems or settings.
APPLICATIONS

• Banking and Finance:


Voice-based authentication for telephone banking,
customer support and high-value transaction approval.
• Smart Home and IoT Devices:
Voice biometrics used to personalize responses
and restrict access to sensitive commands on smart speakers.
• Forensics and Law Enforcement:
Speaker comparison for recorded calls or
threat messages to support investigations.
• Online Exams and Remote Work:
Continuous voice verification to reduce
impersonation and maintain academic or workplace integrity.
CHALLENGES AND LIMITATIONS

• Background Noise:
Traffic, crowd or music can corrupt features and
significantly reduce recognition accuracy.
• Channel and Device Mismatch:
Different microphones, codecs or networks
introduce variations not seen during training.
• Intra-Speaker Variability:
Same person may sound different when ill,
tired, emotional or speaking in another language.
• Spoofing Attacks:
Replay of recorded speech and modern deepfake voices
can fool naive systems if no countermeasures are used.
• Privacy and Security:
Voice prints are biometric data and must be encrypted,
access-controlled and used according to privacy regulations.
IMPROVEMENTS AND FUTURE SCOPE

• Robust Feature Design:


Explore features or learned representations that are
less affected by noise, channel and language variations.
• Domain Adaptation:
Use techniques such as cepstral mean and variance
normalization and score normalization to handle new devices and environments.
• Anti-Spoofing Front-End:
Add dedicated spoofing detection module before
verification to filter replay and synthetic attacks.
• Multimodal Systems:
Combine voice with face, fingerprint or typing pattern
so that an attacker must fool multiple modalities at once.
• Edge-Friendly Models:
Design lightweight neural architectures that run on
mobile phones and embedded boards with low delay and power consumption.
CONCLUSION

• Summary:
Speaker Recognition and Voice Biometrics provide an automatic way
to recognize or verify a person using only their voice signal.
• Pipeline:
The system performs speech capture, pre-processing, MFCC-based
feature extraction, speaker modelling and final decision making.
• Benefits:
Voice biometrics offer convenient, hands-free and password-free
access for many real-world applications such as banking and smart devices.
• Open Issues:
Accuracy in noisy and mismatched conditions and robustness
against spoofing attacks remain active research areas in this field.
THANK YOU
SUGGESTIONS / QUESTIONS

You might also like