0% found this document useful (0 votes)
21 views10 pages

Real-Time Deepfake Audio Detection

This document presents a case study on developing a deep learning framework for real-time detection of deepfake audio, addressing the increasing threats of identity fraud and misinformation. The proposed system aims to achieve low-latency detection with an explainable architecture, utilizing a hybrid CNN-Transformer model and focusing on robustness against various audio manipulation techniques. The project outlines objectives, methodologies, expected outcomes, and a timeline for implementation, emphasizing the need for efficient, real-time audio processing in communication platforms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Real-Time Deepfake Audio Detection

This document presents a case study on developing a deep learning framework for real-time detection of deepfake audio, addressing the increasing threats of identity fraud and misinformation. The proposed system aims to achieve low-latency detection with an explainable architecture, utilizing a hybrid CNN-Transformer model and focusing on robustness against various audio manipulation techniques. The project outlines objectives, methodologies, expected outcomes, and a timeline for implementation, emphasizing the need for efficient, real-time audio processing in communication platforms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Synopsis on

Case Study of Emerging Areas of Technology


(AIDS361)

Deep Learning for Detecting Deepfake Audio in


Real-Time Communication

BACHELOR OF TECHNOLOGY

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Submitted To : Mr. Ritesh Kumar Submitted By: Rishabh Chaturvedi


Roll No.: 03996211923
Sem: 5th Sec: T20
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA
SCIENCE
Dr. AKHILESH DAS GUPTA INSTITUTE OF TECHNOLOGY &
MANAGEMENT
(FORMERLY NORTHERN INDIA ENGINEERINGC OLLEGE)
(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHAU NIVERSITY, DELHI)
SHASTRI PARK, DELHI – 110053

ODD SESSION, 2024-27


TABLE OF CONTENTS
I. Introduction
II. Objectives
III. Literature Review
IV. Methodology
V. Case Study Description
VI. Analysis & Discussion
VII. Conclusion
VIII. References

2
Take-away – Text-to-speech (TTS) and voice-cloning models now fabricate speech that fools
both humans and speaker-verification systems, enabling live vishing, identity fraud and
political misinformation. This project proposes a low-latency, transformer-enhanced deep-
learning framework that flags synthetic speech inside an audio stream within 250 ms, hardens
it against unseen attacks, and explains its decisions for trust and regulatory compliance.

1 Background and Motivation

• Verified deepfake incidents surged 41% to 487 cases and US $347 M losses in Q2 2025
alone.

• Modern TTS/VC pipelines (e.g., VALL-E, WaveNet) generate near-perfect prosody


and timbre.

• Existing detectors achieve high offline accuracy but regress sharply when confronted
with compressed, re-recorded or novel attacks in live calls.

• Live platforms (VoIP, conferencing, call-centres) require decisions in <300 ms to


intercept fraudulent dialogue without audible delay.

2 Research Gap & Problem Statement

Most published models are (i) static – trained on fixed corpora such as ASVspoof and FoR and
brittle to unseen algorithms; (ii) heavy – ResNet/LSTM stacks exceed 30 M parameters,
precluding on-device use; (iii) opaque – offering no human-interpretable rationale, hampering
forensic acceptance. The project asks:

How can we design an explainable, lightweight deep-learning architecture that generalises to


emerging synthesis methods yet meets real-time streaming budgets?

3 Objectives

1. Build a stream-aware inference pipeline that processes 1-s audio windows with < 250
ms end-to-end latency.

3
2. Develop a hybrid CNN–Transformer detector that fuses spectral (CQCC/LFCC) and
self-supervised waveform tokens (Wav2Vec-style) for robustness to compression and
channel noise.

3. Implement an attention roll-out explanation module to visualise frequency bands


driving each decision for analyst review.

4. Evaluate cross-dataset generalisation on ASVspoof 2021, ADD 2023, FoR, In-the-Wild


and the FakeAVCeleb benchmark.

5. Package the model into a C++/ONNX edge library callable by WebRTC or SIP media
servers.

4 Proposed Methodology

Phase Key Tasks Planned Techniques Deliverables

- Stream 16 kHz chunks from Balanced


ASVspoof, FoR, In-the-Wild; SoX augmentation, train/val/test
Data curation simulate jitter, codec loss Opus/G.711 transcode shards

- Dual branch: (a) 128-bin log-Mel &


CQCC spectrograms; (b) raw
Feature waveform tokens via Wav2Vec 2.0 PyTorch torchaudio, Realtime feature
pipeline base mixed-precision extractor

- Conformer-Lite encoder (≈4 M Knowledge distillation,


params); - lightweight LCNN front- quantisation-aware
Model design end; - gated cross-attention fusion training <10 MB .onnx

Attention roll-out heatmaps, Analyst


Explainability gradient-guided spectral masking Captum, custom GUI dashboard

4
Phase Key Tasks Planned Techniques Deliverables

C++ inference, ring-buffer double- AVX2/ARM-NEON


Deployment buffering to overlap I/O and compute kernels WebRTC plug-in

Metrics: EER, min-tDCF, AUC, DSR Benchmark


Evaluation and latency budget DeepfakeBench harness report

Latency Budget

• Feature extraction ≈ 90 ms

• Model inference (FP16 on CPU/NPU) ≈ 110 ms

• Decision and callback ≈ 30 ms


Total ≈ 230 ms (meets sub-250 ms target).

5 Expected Outcomes and Contributions

1. Realtime detector outperforming LCNN and RawNetLite under unseen attacks by


≥15% relative EER while halving model size.

2. Open-sourced toolkit (Apache 2.0) for integrating audio forgery detection into
SIP/RTC stacks.

3. Annotated benchmark of 100 h live-stream style audio with ground-truth deepfakes for
future research.

4. Explainability guidelines correlating model saliency with human perceptual cues,


aiding legal admissibility.

6 Evaluation Plan

• Primary metric: Equal Error Rate (EER) on ADD 2023 real-time track.

• Secondary: Detection latency, Deception Success Rate (DSR), and computational


footprint (MACs, RAM).

5
• Ablations: feature branch removal, token length, spectrogram patch size.

• Statistical tests: McNemar for paired proportions, bootstrapped 95% CIs.

7 Resources & Timeline (12 months)

Quarter Milestones

Q1 Dataset curation, baseline LCNN re-implementation

Q2 Feature extractor, Conformer-Lite prototype

Q3 Cross-attention fusion + QAT, initial latency tuning

Q4 Explainability module, user study with 10 forensic analysts

Q5 Edge packaging, on-prem call-centre pilot test

Q6 Paper submission, code/data release

8 Risks and Mitigation

• Emerging synthesis unseen in training – adopt continual learning with replay buffer
and domain-mix training.

• Latency overshoot on low-end hardware – provide pruning + NPU offload path; fall
back to tiered cloud validation.

• Privacy concerns with audio upload – process on-device; only logits leave device.

[Link]

6
By uniting efficient spectro-temporal encoding with transformer attention and stringent real-
time engineering, the project aims to deliver a deployable defence against the rapidly escalating
threat of live deepfake speech, safeguarding communications, finance and democracy alike.

Based on my comprehensive research, here are 10 key references on deep learning for detecting
deepfake audio in real-time communication:

1. Yi, J., Wang, C., Tao, J., Zhang, X., Zhang, C.Y., & Zhao, Y. (2023)
"Audio Deepfake Detection: A Survey"
arXiv preprint arXiv:2308.14970
• Comprehensive survey covering pipeline and end-to-end detection methods
• Discusses feature extraction techniques including LFCC, CQCC, and deep features
• Analyzes CNN-based classifiers like LCNN and transformer-based approaches
2. Channing, G., Sock, J., Clark, R., Torr, P., & Schroeder de Witt, C. (2024)

"Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap"
arXiv preprint arXiv:2410.07436

• Introduces novel explainability methods for transformer-based audio deepfake


detectors

• Proposes attention roll-out mechanism for improved real-world generalizability

• Benchmarks ASVspoof 5 to FakeAVCeleb cross-dataset evaluation

3. Towards the Development of a Real-Time Deepfake Audio Detection System (2024)

"Towards the Development of a Real-Time Deepfake Audio Detection System in


Communication Platforms"
arXiv preprint arXiv:2403.11778

• Specifically addresses real-time deployment in communication platforms

• Implements ResNet and LCNN architectures for Microsoft Teams integration

• Evaluates static deepfake models in real-time conversational scenarios

4. Zhang, B., Cui, H., Nguyen, V., & Whitty, M. (2025)

"Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead"
Sensors, 25(7), 1989

• Most recent comprehensive survey (2025) covering latest advancements


• First to analyze privacy, fairness, and explainability in audio deepfake detection

7
• Provides quantitative comparison of detection models across datasets

5. Cuccovillo, L., Papastergiopoulos, C., Vafeiadis, A., et al. (2022)

"Open Challenges in Synthetic Speech Detection"


arXiv preprint arXiv:2209.07180

• Addresses current status and open challenges in synthetic speech detection

• Discusses requirements for real-time trustworthy detection methods

• Analyzes functional and non-functional requirements for deployment

6. Drakopoulos, F., Baby, D., & Verhulst, S. (2020)

"Real-Time Audio Processing on a Raspberry Pi using Deep Neural Networks"


Proceedings of the International Conference on Digital Audio Effects

• Demonstrates real-time DNN implementation achieving <16ms latency

• Tests 10-layer DNNs with up to 350,000 parameters on embedded systems

• Provides practical framework for low-latency audio processing applications

7. Wu, H., Zhang, S., Cao, Y., Xie, H., Liu, Y., & Xie, L. (2023)

"Towards Benchmarking and Evaluating Deepfake Detection"


arXiv preprint arXiv:2203.02115

• Establishes comprehensive benchmarking framework for deepfake detection

• Addresses generalization challenges across different attack types

• Proposes evaluation metrics for real-world deployment scenarios


8. Müller, N.M., Czempin, P., Dieckmann, F., Froghyar, A., & Böttinger, K. (2022)

"Does Audio Deepfake Detection Generalize?"


Proceedings of Interspeech 2022

• Investigates generalization capabilities of detection models to unseen attacks

• Introduces In-the-Wild dataset for real-world evaluation scenarios

• Highlights performance degradation in cross-dataset evaluation


9. Frank, J. & Schönherr, L. (2021)

"WaveFake: A Data Set to Facilitate Audio Deepfake Detection Research"


Proceedings of Neural Information Processing Systems

• Provides diverse dataset with state-of-the-art generative models

8
• Enables robustness evaluation under different synthesis techniques

• Supports development of generalizable detection algorithms

10. Tak, H., Patino, J., Sahidullah, M., Kamble, A., Todisco, M., & Evans, N. (2021)

"End-to-End Anti-Spoofing with RawNet2"


Proceedings of ICASSP 2021

• Introduces end-to-end architecture for raw waveform processing

• Achieves state-of-the-art performance on ASVspoof 2019 dataset

• Demonstrates potential for real-time implementation with efficient design

Key Research Themes Across References:

Real-Time Processing Requirements:

• Latency constraints under 250-300ms for live communication

• Computational efficiency for edge deployment

• Streaming audio processing architectures


Architectural Approaches:

• Transformer-based models with attention mechanisms

• CNN architectures (ResNet, LCNN) for efficiency

• End-to-end raw waveform processing

Generalization Challenges:

• Cross-dataset performance degradation

• Robustness to compression and channel effects


• Domain adaptation techniques

Evaluation Frameworks:

• Real-world benchmarking datasets

• Explainability and interpretability requirements

• Performance metrics for integrated systems

These references collectively demonstrate the evolving landscape of real-time deepfake audio
detection, highlighting both the technical achievements and remaining challenges in deploying
robust detection systems for live communication scenarios.

9
10

Common questions

Powered by AI

The document identifies several challenges in existing deepfake audio detectors, particularly in live communication scenarios. These detectors often demonstrate high offline accuracy, but their performance sharply declines when confronted with compressed, re-recorded, or novel attacks during live calls. Key issues include their static nature, as models are trained on fixed corpora and are not adaptive to unseen algorithms. They are also heavy, with architectures like ResNet or LSTM exceeding 30 million parameters, which limits their use on edge devices. Finally, they are typically opaque, offering no human-interpretable rationale, which hampers forensic acceptance .

The document proposes several risk mitigation strategies for addressing privacy concerns and latency overshoot during deployment. For privacy, the system processes audio data on-device, ensuring only the processed logits leave the device, thereby minimizing exposure of raw audio data. To address latency overshoot, particularly on low-end hardware, the document suggests adopting model pruning techniques and enabling NPU offload paths. Additionally, there is a provision to fall back to tiered cloud validation when necessary to ensure realtime engagement .

The proposed detector is claimed to outperform existing systems like LCNN and RawNetLite by achieving a more robust performance under unseen attacks, with a relative reduction in EER by 15%. It also boasts a significant reduction in model size, halving it compared to traditional models like ResNet or LSTM stacks. The use of a hybrid CNN–Transformer architecture supports efficient real-time processing, further aided by attention roll-out mechanisms for robust explainability, setting it apart from current models which often lack transparency and adaptability .

To enhance the generalization ability of the detection model against unseen deepfake synthesis methods, the document proposes several strategies. These include adopting continual learning strategies with a replay buffer and domain-mix training to handle emerging synthesis methods not seen in training. The framework also leverages a hybrid CNN–Transformer architecture combining diverse feature sets, including spectral and self-supervised waveform tokens, to improve robustness to compression and channel noise. Cross-dataset evaluations are conducted using various benchmark datasets, focusing on evaluating generalization performance .

Explainability is crucial for audio deepfake detection systems because it provides transparency in decision-making processes, which is essential for gaining trust and regulatory compliance, as well as for forensic analysis. The project addresses this requirement by incorporating an attention roll-out explanation module that visualizes the frequency bands driving each decision, allowing analysts to understand the rationale behind the model's output. This feature is intended to correlate model saliency with human perceptual cues and aid legal admissibility .

The proposed deep-learning framework for detecting deepfake audio in real-time communication includes a stream-aware inference pipeline designed to process 1-second audio windows with less than 250 ms end-to-end latency. It employs a hybrid CNN–Transformer detector combining spectral features (CQCC/LFCC) with self-supervised waveform tokens (Wav2Vec-style) to enhance robustness to compression and channel noise. The framework also implements an attention roll-out explanation module to visualize frequency bands influencing decisions for analyst review. Additionally, cross-dataset generalization is evaluated using various benchmarks, and the model is packaged into a C++/ONNX edge library for integration with WebRTC or SIP media servers .

Despite advancements, challenges remain in the field of deepfake audio detection. These include achieving real-time processing capabilities with stringent latency budgets in diverse and uncontrolled environments, such as live communication platforms. Another challenge is the need for models to generalize effectively across datasets and be robust against various audio compression techniques and channel effects. Additionally, ensuring explainability and interpretability of model outputs remains a priority, as opaque systems may not satisfy forensic or regulatory demands. Continuous adaptation to emerging deepfake synthesis technologies also poses an ongoing challenge .

The proposed framework ensures efficient real-time processing of audio streams by implementing a streamlined inference pipeline with a latency budget that meets the sub-250 ms target. This is achieved through a combination of feature extraction (approximately 90 ms), model inference using FP16 precision on CPU/NPU (approximately 110 ms), and decision-making and callback processes (approximately 30 ms). The use of a lightweight Conformer-Lite encoder, quantisation-aware training, and a model size of less than 10 MB further support efficient edge deployment and processing .

The expected outcomes of the proposed system include a real-time detector that outperforms existing models like LCNN and RawNetLite under unseen attacks, reducing the equal error rate (EER) by 15% while halving the model size. Additionally, an open-source toolkit will be made available for integrating audio forgery detection into SIP/RTC stacks. The project will also provide an annotated benchmark consisting of 100 hours of live-stream style audio with ground-truth deepfakes, contributing valuable resources for future research. Explainability guidelines will be developed to correlate model saliency with human perceptual cues, aiding in legal admissibility .

The document outlines several evaluation metrics for assessing the performance of the audio deepfake detection model. The primary metric is the Equal Error Rate (EER) assessed on the ADD 2023 real-time track. Secondary metrics include detection latency, Deception Success Rate (DSR), and the computational footprint measured in MACs and RAM usage. The testing plan includes various ablation studies such as feature branch removal and spectrogram patch size, alongside statistical tests like McNemar for paired proportions, with bootstrapped 95% confidence intervals used for evaluating model performance under different conditions .

You might also like