Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Final Year Project Presentation
on Topic
Deep Fake Audio Detection using ML And
Deep Learning
:: Presented By ::
1. Saket Patil
2. Rau Wagh
3. Sanika Patil
4. Faiza Shekh
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Contents
• Introduction • Proposed MODEL
• Problem statement • UML/ER DIAGRAM
• Objective • System architecture
• Literature survey • Data flow diagram
• Existing system • Project plan
• Proposed system • Implementation
• Block diagram of the deep-fake audio detection • Testing
• Requirement analysis/models • Conclusion
• References
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Introduction
In today's digital era, deepfake technology poses a serious threat to audio authenticity.
Our project addresses this issue by developing a Machine Learning and Deep Learning-based
system to detect manipulated audio, specifically voice deepfakes. We utilize Logistic Regression,
Convolutional Neural Networks (CNN), and Generative Adversarial Networks (GAN) to analyze and
classify audio as real or fake. Among these, GAN-based models demonstrated the highest accuracy
and robustness, making them the most effective in detecting synthetic audio.
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Problem Statement
The project aims to address the growing challenge of detecting deepfake audio, which has become
increasingly realistic and difficult to identify. To tackle this, a robust detection system is developed using
advanced techniques such as Logistic Regression, Convolutional Neural Networks (CNNs), and Mel-
Frequency Cepstral Coefficients (MFCCs) for audio feature extraction. Additionally, Generative Adversarial
Networks (GANs) are utilized not only to generate synthetic audio samples for training but also to enhance
detection capabilities. By comparing the performance of these models, the project seeks to improve the
accuracy and reliability of audio authentication and forensic analysis in the face of rapidly evolving deepfake
technologies.
Bhujbal Knowledge City Institute of Engineering
Objectives
The primary objectives of this project is to:
• Develop a Deepfake Audio Detection System: Design and implement a deep learning-based system
capable of detecting deep fake audio with high accuracy and reliability, integrated into a web application
where users can upload audio files for detection and receive results on authenticity and accuracy.
• Create an accessible and user-friendly system that requires minimal technical knowledge, ensuring a
wide range of users can benefit from it.
• To compare and find best ML and deep learning models to detect Deepfake audio using accuracy,
precision, recall, and F1-score, supported by confusion matrices and visualizations.
Bhujbal Knowledge City Institute of Engineering
Literature Survey
Sr no Research Paper Name Methodology Limitations Year Author
1 Single Domain Shuffle Mix Aggregation and M-ASDG method is its reliance on 2023 Yuankun Xie,
Generalization for Audio Separation Domain single-domain data, which may august , Haonan Cheng,
deep fake Detection Generalization (SM-ASDG), is restrict its ability to generalize , Yutian Wang
structured into several key effectively to diverse, real-world and Long Ye,
components. multi-domain scenarios
2 Deep fake audio detection Emphasizes on Mel-frequency Handling higher feature sets and 2022 Ameer Hamza,
using MFCC features cepstral coefficients (MFCCs) for complexities poses a limitation, december Abdul Rehman,
audio feature extraction leading to the exploration of Javed,
transfer learning-based deep Farkhund
learning approaches Iqbal ,et .al
3 Beyond the illusion : Recurrent Neural Networks 1D Model Complexity: While ensemble 2020 Gulam ali et .al.
Ensemble deep learning for Convolutional Neural Networks, methods can improve
Effective Voice Deep fake (LSTM) Convolutional LSTM , performance, they also introduce
Detection complexity.
Bhujbal Knowledge City Institute of Engineering
Existing System
The existing system for detecting deepfake audio uses MFCC (Mel-Frequency Cepstral
Coefficients) for feature extraction, which mimics human hearing to capture key audio
characteristics. It employs SVM, Random Forest, and VGG-16 models to classify audio as real or
fake. While the system performs well, especially on the Fake-or-Real dataset, it struggles with noisy,
high-dimensional data and cannot perform real-time detection due to its computational load.
• MFCC: Extracts key audio features like frequency and amplitude.
• Machine Learning Models: SVM, Random Forest, and VGG-16 for classification.
• Performance: Mid accuracy in detecting fake audio.
• Limitations: Challenges with noisy data and not accurate for New tools
Bhujbal Knowledge City Institute of Engineering
Proposed System
• In this project, we propose a deepfake audio detection system using a combination of Machine
Learning and Deep Learning techniques.
• The system utilizes Logistic Regression, Convolutional Neural Networks (CNNs), and Generative
Adversarial Networks (GANs) for audio classification.
• The process begins with preprocessing the audio input to extract relevant features such as Mel-
Frequency Cepstral Coefficients (MFCCs) and spectrograms.
• Logistic Regression serves as a baseline model using MFCC features to classify audio as real or fake
• The CNN model focuses on identifying spatial and temporal patterns in spectrogram images to
detect anomalies in synthetic audio.
• GANs are employed both to generate deepfake audio samples and to improve the detection
model's robustness by learning subtle differences between real and fake audio.
• The combined approach improves overall detection accuracy and adaptability against increasingly
realistic deepfake threats..
Bhujbal Knowledge City Institute of Engineering
• Block diagram of the deep-fake audio detection
Bhujbal Knowledge City Institute of Engineering
User & System Requirements
[Link] Requirements
Upload Audio Files: Users should upload audio files (.wav/.mp3/.flac) directly for analysis.
Provide YouTube Links: Users can input YouTube links for automatic audio extraction.
Receive Detection Results: Users receive a clear result indicating if the audio is real or deepfake, along with
confidence scores.
[Link] Requirements
Audio Processing: Support for multiple audio formats and extraction from YouTube links.
Feature Extraction: Use MFCCs and Spectrograms for feature extraction from audio.
Deep Learning Models: Implement Logistic Regression (ML), CNN, and GAN (Deep Learning) models for detection.
Result Display: Provide results with real/fake classification and confidence scores.
Security: Encrypt and securely handle all user data and audio files.
Performance: Aim for high accuracy (~96%) and fast response time (<10 seconds) across platforms.
Bhujbal Knowledge City Institute of Engineering
REQUIREMENT ANALYSIS/MODELS
1. Functional Requirements
Audio Input: Accept multiple audio formats (.wav, .mp3, .flac).
Feature Extraction: Extract MFCCs and Spectrograms for model input.
Model Training: Train and evaluate Logistic Regression, CNN, and GAN models on labeled genuine and fake audio
datasets.
User Interface: Provide a user-friendly interface for uploading and analyzing audio samples.
Model Evaluation: Use metrics such as accuracy, precision, recall, and F1-score to evaluate model performance.
Result Display: Present results indicating whether audio is genuine or manipulated, along with
confidence/accuracy scores.
2. Non-functional Requirements
Performance: Process audio samples within a specified time limit (e.g., under 10 seconds for typical files).
Scalability: Support multiple simultaneous users without performance degradation.
Robustness: Maintain detection accuracy across different accents, languages, and audio quality levels.
Usability: Ensure an intuitive interface that allows easy file uploads or YouTube link inputs.
Bhujbal Knowledge City Institute of Engineering
Proposed MODEL
1. Data Collection
Dataset used: InTheWild and RealAndFake datasets containing genuine and deepfake audio
samples.
Diverse samples including multiple speakers, accents, and real-world conditions to ensure model
robustness.
2. Feature Extraction
Primary features: MFCC (Mel-frequency cepstral coefficients) extracted from audio files.
Additional acoustic features include spectrograms and chroma features for better audio
characterization.
Preprocessing steps: noise reduction, normalization, and resampling to standardize input audio.
Bhujbal Knowledge City Institute of Engineering
Proposed MODEL
3. Model Selection
Baseline ML models: Logistic Regression and Random Forest for initial classification.
Deep Learning model: Convolutional Neural Networks (CNN) trained on spectrogram images to
capture spatial and frequency patterns in audio.
GAN (Generative Adversarial Network) model used to enhance detection by distinguishing
between genuine and fake audio samples, achieving the highest accuracy (~96%).
4. Training and Validation
Data split into training, validation, and test sets ensuring balanced representation of real and fake
samples.
Hyperparameter tuning and early stopping applied during training to prevent overfitting.
Model evaluation on unseen audio samples from the test set and real-world noisy audio.
Bhujbal Knowledge City Institute of Engineering
5. Performance Metrics
Accuracy:
o Logistic Regression: ~60%
o CNN: 73% to 85%
o GAN: Up to 96%
Precision, Recall, and F1 Score: To evaluate model effectiveness in detecting deepfake audio.
Use of confusion matrix and ROC curve for comprehensive performance analysis.
Bhujbal Knowledge City Institute of Engineering
UML/ER DIAGRAM
[Link] Case Diagram
Actors
1. User (Audio Analyst/Investigator): Someone who uses the system to analyze audio samples.
2. System Administrator: Manages the system and its configurations.
3. Database: Stores audio samples and results.
4. Machine Learning Model: Processes audio samples for detection
Use Cases
1. Upload Audio Sample: User uploads an audio file for analysis.
2. Analyze Audio: System processes the uploaded audio using algorithms to detect deep fakes.
3. View Results: User views the results of the analysis, including confidence scores and flags.
4. Manage User Accounts: System Administrator manages user accounts and permissions.
5. Update Detection Algorithms: System Administrator updates the algorithms used for detection.
6. Retrieve Historical Data: User retrieves past analysis results from the database.
7. Report Findings: User generates and exports reports based on analysis.
Bhujbal Knowledge City Institute of Engineering
Fig. Use Case Diagram
Bhujbal Knowledge City Institute of Engineering
b. Class Diagram
[Link] Diagram
Bhujbal Knowledge City Institute of Engineering
[Link] Diagram
[Link] Diagram
Bhujbal Knowledge City Institute of Engineering
Activity Diagram Components
Initial Node
• Start of the process
Activities
1. Input Audio File
1. Action: User uploads or selects an audio file for analysis.
2. Preprocess Audio
1. Action: Normalize and format the audio file for feature extraction.
3. Extract Features
1. Action: Analyze the audio to extract relevant features.
4. Load Model
1. Action: Load the pre-trained model for prediction.
5. Make Prediction
1. Action: Use the model to predict if the audio is fake or real.
Bhujbal Knowledge City Institute of Engineering
[Link] Results
1. Action: Display the prediction result to the user.
[Link] Model (Optional)
2. Action: run an evaluation process with test data to measure performance.
[Link] Node
• Is Audio Fake?
• If Yes: Output results indicating the audio is fake.
• If No: Output results indicating the audio is real.
[Link] Node
• End of the process
Bhujbal Knowledge City Institute of Engineering
• DETAILED ARCHITECTURE / SYSTEM ARCHITECTURE
Bhujbal Knowledge City Institute of Engineering
1. User Interface Layer
This layer serves as the primary interaction point between the user and the system. It is responsible for:
• Allowing users to upload audio files
• Displaying analysis results such as classification labels (Real/Fake) and confidence scores
• Providing basic visualizations (e.g., spectrogram, waveform)
Technology Stack: HTML, CSS, JavaScript, Streamlit or Flask-based interface
2. Backend Processing Layer
This layer acts as the control center that routes requests from the user interface to the relevant system
components. Key responsibilities include:
• Managing API endpoints for file uploads and result retrieval
• Handling user inputs and coordinating the flow of data to the ML models
• Triggering the preprocessing and model inference pipeline
Technology Stack: Python, Flask REST API
Bhujbal Knowledge City Institute of Engineering
3. Preprocessing and Feature Extraction Layer
This layer prepares the raw audio input for model consumption. It performs:
• Noise reduction, silence trimming, and audio normalization
• Conversion of raw audio signals into spectrogram or MFCC representations
• Extraction of temporal and frequency-domain features required for classification Tools Used: Librosa, NumPy,
SciPy
4. Machine Learning and Deep Learning Layer
This is the core intelligence component of the system. It includes:
• Convolutional Neural Network (CNN) for spatial feature extraction from spectrogram images
• Long Short-Term Memory (LSTM) for sequence modeling and capturing temporal patterns in audio data
• Generative Adversarial Network (GAN) for generating fake audio data or as a classifier enhancement
• Baseline Classifiers such as Logistic Regression and Random Forest for performance benchmarking
Bhujbal Knowledge City Institute of Engineering
5. Data Management and Storage Layer
This layer ensures persistent storage and logging of:
• Uploaded audio files • Intermediate features and processed representations
• Classification results and metadata (timestamp, accuracy, confidence)
Database: MongoDB for production environments
6. Data Flow Overview
• User uploads audio via the web interface
• The backend server receives the file and forwards it to the preprocessing module
• The preprocessed features are passed to the selected ML or DL model
• The model outputs the classification (Real or Fake), which is returned to the user
• Results are stored in the database for future analysis or auditing
Bhujbal Knowledge City Institute of Engineering
• Data flow diagram
Level 0 (DFD) Level 1 (DFD)
Bhujbal Knowledge City Institute of Engineering
• Data flow diagram Level 2 (DFD)
Bhujbal Knowledge City Institute of Engineering
PROJECT PLAN
Phase Phase Name Duration Key Activities
No.
1 Project Planning & September Problem identification, topic selection, literature
Topic Finalization survey, team formation, initial planning
2 Synopsis September Preparing and submitting the project synopsis for
Submission & approval, addressing feedback from faculty
Approval
3 Requirement October Defining objectives, use-case scenarios, technical
Analysis & Feasibility feasibility study, review presentation 1
Study
4 System Design November Designing system architecture, model workflow,
dataset sourcing (InTheWild, RealAndFake),
creating ER diagrams, review presentation 2
5 Model Development Jan - Feb Implementing Logistic Regression, CNN, GAN;
(ML & DL) feature extraction (MFCC, spectrograms); initial
testing
Bhujbal Knowledge City Institute of Engineering
PROJECT PLAN
Phase Phase Name Duration Key Activities
No.
6 Testing & Debugging March Model evaluation, bug identification and fixing,
testing for accuracy, stability, and performance
7 Evaluation & April Performance testing, accuracy comparison
Comparative Study between models, final model selection (GAN),
review presentation 3
8 Final Integration April Final frontend-backend integration, web interface
implementation and testing
9 Documentation & April Final blackbook writing, abstract, conclusion,
Report Writing formatting, preparing for viva, final presentation
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Implementation
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Implementation
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Test cases and results of test cases.
1-For Logistic Regression
Audio Clip
Test Case ID Duration Noise Level Ground Truth Model Output Result
Description
Male voice,
TC-01 10s Low Real Real Pass
clear speech
Synthetic
TC-02 7s Low Fake Fake Pass
female voice
Real speech,
TC-03 background 5s High Real Fake Fail
noise
AI-generated
TC-04 12s Medium Fake Fake Pass
dialogue
TC-05 Real clip 15s Low Real Real Pass
• Cross-validation score: 0.92 → shows model consistency across different splits of data.
• Precision: 0.95 → out of all audios predicted as fake, 95% were truly fake.
• Recall: 0.95 → the model correctly identified 95% of all actual fake audios.
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Test cases and results of test cases.
2- MelGAN-based model
Audio Clip
Test Case ID Duration Noise Level Ground Truth Model Output Result
Description
Male voice,
TC-01 10s Low Real Real Pass
clear speech
Synthetic
TC-02 7s Low Fake Fake Pass
female voice
Real speech,
TC-03 background 5s High Real Fake Fail
noise
AI-generated
TC-04 12s Medium Fake Fake Pass
dialogue
TC-05 Real clip 15s Low Real Real Pass
•Accuracy: 96% — The model correctly predicts real vs fake in 96% of the total 26,659 samples.
•Precision (Fake): 96% — Out of all audios predicted as fake, 96% were truly fake.
•Recall (Fake): 98% — The model correctly detected 98% of all fake audios.
•F1-Score: 97% — Indicates a strong balance between precision and recall for fake audio detection.
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
Test cases and results of test cases.
3- CNN
Test Case ID Audio Type Duration Source Ground Truth Model Output Result
Real
TC-01 9s Original audio Real Real Pass
conversation
Deepfake
TC-02 11s GAN-generated Fake Fake Pass
podcast
Real noisy
TC-03 6s Mic-recorded Real Fake Fail
telephone
Synthetic
TC-04 7s CNN test Fake Fake Pass
female
Real news
TC-05 13s Public dataset Real Real Pass
broadcast
•Initial Accuracy: Started at ~46% (Epoch 1)
•Final Training Accuracy: ~73% (Epoch 10)
•Best Validation Accuracy: 85.94% (Epoch 9)
Bhujbal Knowledge City Institute of Engineering
Conclusion
This project aimed to develop an effective system for detecting deepfake audio using Machine Learning
and Deep Learning techniques, addressing the growing concern of AI-generated synthetic voices. Through
the comparative study of three model Logistic Regression, CNN, and GAN—we observed that the GAN
model achieved the highest accuracy of 96%, demonstrating superior performance in identifying modern
synthetic audio. While CNN performed reasonably well with 73% accuracy, its effectiveness was limited by
dataset diversity. Logistic Regression, though basic, provided a strong baseline with 60% accuracy and high
precision-recall values. Despite promising results, the system’s performance could benefit from larger and
more diverse datasets, particularly to improve detection in noisy and real-world conditions. Future
enhancements may include real-time detection capabilities. Overall, this work contributes meaningfully to the
field of audio forensics and digital media authentication, with potential applications in media security,
journalism, and cybersecurity.
Bhujbal Knowledge City Institute of Engineering
Department of Information Technology
References
[1] A. Hamza et al., ”Deepfake Audio Detection via MFCC Features Using Machine Learning,” IEEE Access, vol. 10, pp. 134018–
134028, 2022. doi: 10.1109/ACCESS.2022.3231480.
[2] Y. Xie, H. Cheng, Y. Wang, and L. Ye, ”Single Domain Generalization for Audio Deepfake Detection,” IEEE Transactions on
Information Forensics and Security, vol. 19, pp. 344–358, 2024. doi: 10.1109/TIFS.2023.3324724.
[3] K. Li, X. Lu, M. Akagi, and M. Unoki, ”Contributions of Jitter and Shimmer in the Voice for Fake Audio Detection,” IEEE
Access, vol. 11, pp. 84689–84698, 2023. doi: 10.1109/ACCESS.2023.3301616. [4] Y. Xie, H. Cheng, Y. Wang, and L. Ye, ”Domain
Generalization via Aggregation and Separation for Audio Deepfake Detection,” IEEE Transactions on Information Forensics and
Security, vol. 19, pp. 344–358, 2024. doi: 10.1109/TIFS.2023.3324724.
[5] A. Adi, Y. Nisal, M. Shrey, et al., ”Deepfake Audio Detection Using Convolutional Neural Networks,” International
Conference on Artificial Intelligence and Data Engineering (AIDE), 2021.
[6] L. Chen, Z. Jing, Z. Wei, et al., ”Audio Deepfake Detection: A Comprehensive Review,” Journal of Cybersecurity and Privacy,
2020.
[7] ”The Effect of Deep Learning Methods on Deepfake Audio Detection for Digital Investigation,” Procedia Computer Science,
2023. doi: 10.1016/[Link].2023.01.283.
[8] S. Dai, S. Zhang, Y. Liu, et al., ”Detecting Deepfake Audio with Machine Learning Techniques,” IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019