Real-Time Deepfake Audio Detection
Real-Time Deepfake Audio Detection
The document identifies several challenges in existing deepfake audio detectors, particularly in live communication scenarios. These detectors often demonstrate high offline accuracy, but their performance sharply declines when confronted with compressed, re-recorded, or novel attacks during live calls. Key issues include their static nature, as models are trained on fixed corpora and are not adaptive to unseen algorithms. They are also heavy, with architectures like ResNet or LSTM exceeding 30 million parameters, which limits their use on edge devices. Finally, they are typically opaque, offering no human-interpretable rationale, which hampers forensic acceptance .
The document proposes several risk mitigation strategies for addressing privacy concerns and latency overshoot during deployment. For privacy, the system processes audio data on-device, ensuring only the processed logits leave the device, thereby minimizing exposure of raw audio data. To address latency overshoot, particularly on low-end hardware, the document suggests adopting model pruning techniques and enabling NPU offload paths. Additionally, there is a provision to fall back to tiered cloud validation when necessary to ensure realtime engagement .
The proposed detector is claimed to outperform existing systems like LCNN and RawNetLite by achieving a more robust performance under unseen attacks, with a relative reduction in EER by 15%. It also boasts a significant reduction in model size, halving it compared to traditional models like ResNet or LSTM stacks. The use of a hybrid CNN–Transformer architecture supports efficient real-time processing, further aided by attention roll-out mechanisms for robust explainability, setting it apart from current models which often lack transparency and adaptability .
To enhance the generalization ability of the detection model against unseen deepfake synthesis methods, the document proposes several strategies. These include adopting continual learning strategies with a replay buffer and domain-mix training to handle emerging synthesis methods not seen in training. The framework also leverages a hybrid CNN–Transformer architecture combining diverse feature sets, including spectral and self-supervised waveform tokens, to improve robustness to compression and channel noise. Cross-dataset evaluations are conducted using various benchmark datasets, focusing on evaluating generalization performance .
Explainability is crucial for audio deepfake detection systems because it provides transparency in decision-making processes, which is essential for gaining trust and regulatory compliance, as well as for forensic analysis. The project addresses this requirement by incorporating an attention roll-out explanation module that visualizes the frequency bands driving each decision, allowing analysts to understand the rationale behind the model's output. This feature is intended to correlate model saliency with human perceptual cues and aid legal admissibility .
The proposed deep-learning framework for detecting deepfake audio in real-time communication includes a stream-aware inference pipeline designed to process 1-second audio windows with less than 250 ms end-to-end latency. It employs a hybrid CNN–Transformer detector combining spectral features (CQCC/LFCC) with self-supervised waveform tokens (Wav2Vec-style) to enhance robustness to compression and channel noise. The framework also implements an attention roll-out explanation module to visualize frequency bands influencing decisions for analyst review. Additionally, cross-dataset generalization is evaluated using various benchmarks, and the model is packaged into a C++/ONNX edge library for integration with WebRTC or SIP media servers .
Despite advancements, challenges remain in the field of deepfake audio detection. These include achieving real-time processing capabilities with stringent latency budgets in diverse and uncontrolled environments, such as live communication platforms. Another challenge is the need for models to generalize effectively across datasets and be robust against various audio compression techniques and channel effects. Additionally, ensuring explainability and interpretability of model outputs remains a priority, as opaque systems may not satisfy forensic or regulatory demands. Continuous adaptation to emerging deepfake synthesis technologies also poses an ongoing challenge .
The proposed framework ensures efficient real-time processing of audio streams by implementing a streamlined inference pipeline with a latency budget that meets the sub-250 ms target. This is achieved through a combination of feature extraction (approximately 90 ms), model inference using FP16 precision on CPU/NPU (approximately 110 ms), and decision-making and callback processes (approximately 30 ms). The use of a lightweight Conformer-Lite encoder, quantisation-aware training, and a model size of less than 10 MB further support efficient edge deployment and processing .
The expected outcomes of the proposed system include a real-time detector that outperforms existing models like LCNN and RawNetLite under unseen attacks, reducing the equal error rate (EER) by 15% while halving the model size. Additionally, an open-source toolkit will be made available for integrating audio forgery detection into SIP/RTC stacks. The project will also provide an annotated benchmark consisting of 100 hours of live-stream style audio with ground-truth deepfakes, contributing valuable resources for future research. Explainability guidelines will be developed to correlate model saliency with human perceptual cues, aiding in legal admissibility .
The document outlines several evaluation metrics for assessing the performance of the audio deepfake detection model. The primary metric is the Equal Error Rate (EER) assessed on the ADD 2023 real-time track. Secondary metrics include detection latency, Deception Success Rate (DSR), and the computational footprint measured in MACs and RAM usage. The testing plan includes various ablation studies such as feature branch removal and spectrogram patch size, alongside statistical tests like McNemar for paired proportions, with bootstrapped 95% confidence intervals used for evaluating model performance under different conditions .