0% found this document useful (0 votes)
5 views21 pages

Project 4 Plag

The document presents a project report on a multi-modal deepfake detection framework that analyzes both audio and video streams to identify manipulations in synthetic media. It incorporates advanced techniques such as Vision Transformers for spatial analysis and Discrete Cosine Transform for frequency-domain feature extraction, along with audio processing using Bi-GRU and Self-Supervised Learning. The methodology aims to enhance detection accuracy and interpretability, addressing the challenges posed by modern deepfake technologies.

Uploaded by

umeshgoudediga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

Project 4 Plag

The document presents a project report on a multi-modal deepfake detection framework that analyzes both audio and video streams to identify manipulations in synthetic media. It incorporates advanced techniques such as Vision Transformers for spatial analysis and Discrete Cosine Transform for frequency-domain feature extraction, along with audio processing using Bi-GRU and Self-Supervised Learning. The methodology aims to enhance detection accuracy and interpretability, addressing the challenges posed by modern deepfake technologies.

Uploaded by

umeshgoudediga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Page 1 of 21 - Cover Page Submission ID trn:oi[Link]1

Sapare Aravind
Project 4 Report
Assignment 2

Document Details

Submission ID

trn:oi[Link]1 17 Pages

Submission Date 3,009 Words

Nov 25, 2025, 3:29 PM GMT+5:30


18,835 Characters

Download Date

Nov 25, 2025, 3:32 PM GMT+5:30

File Name

Turnit in [Link]

File Size

608.8 KB

Page 1 of 21 - Cover Page Submission ID trn:oi[Link]1


Page 2 of 21 - Integrity Overview Submission ID trn:oi[Link]1

9% Overall Similarity
The combined total of all matches, including overlapping sources, for each database.

Filtered from the Report


Bibliography

Quoted Text

Small Matches (less than 10 words)

Match Groups Top Sources

12 Not Cited or Quoted 9% 5% Internet sources


Matches with neither in-text citation nor quotation marks
2% Publications
0 Missing Quotations 0% 7% Submitted works (Student Papers)
Matches that are still very similar to source material

0 Missing Citation 0%
Matches that have quotation marks, but no in-text citation

0 Cited and Quoted 0%


Matches with in-text citation present, but no quotation marks

Page 2 of 21 - Integrity Overview Submission ID trn:oi[Link]1


Page 3 of 21 - Integrity Overview Submission ID trn:oi[Link]1

Match Groups Top Sources

12 Not Cited or Quoted 9% 5% Internet sources


Matches with neither in-text citation nor quotation marks
2% Publications
0 Missing Quotations 0% 7% Submitted works (Student Papers)
Matches that are still very similar to source material

0 Missing Citation 0%
Matches that have quotation marks, but no in-text citation

0 Cited and Quoted 0%


Matches with in-text citation present, but no quotation marks

Top Sources
The sources with the highest number of matches within the submission. Overlapping sources will not be displayed.

1 Student papers

BML Munjal University on 2024-05-15 4%

2 Student papers

Liverpool John Moores University on 2020-09-28 <1%

3 Student papers

Flinders University on 2025-09-08 <1%

4 Publication

Xin-yan Wang, Li-ming Zhang, Kai Zhang, Cheng Cheng. "Knowledge-data synerg… <1%

5 Student papers

University of Queensland on 2025-05-29 <1%

6 Internet

[Link] <1%

7 Student papers

Wageningen University on 2025-04-06 <1%

8 Internet

[Link] <1%

9 Publication

Shaheen Usmani, Sunil Kumar, Debanjan Sadhya. "Spatio-temporal knowledge di… <1%

10 Student papers

University of East London on 2025-09-09 <1%

Page 3 of 21 - Integrity Overview Submission ID trn:oi[Link]1


Page 4 of 21 - Integrity Overview Submission ID trn:oi[Link]1

11 Student papers

University of Surrey on 2023-08-15 <1%

12 Internet

[Link] <1%

Page 4 of 21 - Integrity Overview Submission ID trn:oi[Link]1


Page 5 of 21 - Integrity Submission Submission ID trn:oi[Link]1

ABSTRACT
The rapid advancement of generative artificial intelligence has resulted in highly realistic
deepfakes that pose severe risks to digital privacy, public trust, and the integrity of online
media. These manipulated videos often alter both facial appearance and speech
characteristics, making detection increasingly difficult. To address this challenge, this
work introduces a multi-modal deepfake detection framework that jointly analyzes audio
and video streams to capture spatial, spectral, temporal, and frequency-based
inconsistencies.

9 The video pipeline incorporates a Vision Transformer (ViT) to extract high-level spatial
representations of facial regions and applies Discrete Cosine Transform (DCT) on
grayscale frames to identify frequency-domain artifacts commonly introduced during
synthesis. The audio pipeline extracts Linear Frequency Cepstral Coefficients (LFCCs)
and models temporal dynamics using a Bi-directional Gated Recurrent Unit (Bi-GRU).
Additionally, a pre-trained Self-Supervised Learning (SSL) model provides powerful audio
embeddings, enabling the detection of subtle vocal anomalies characteristic of synthetic
speech. These three feature streams are fused using a multi-scale integration module
and passed to a classification head that predicts whether the input is real or fake.

Experiments on a curated set of audio–video deepfakes demonstrate strong and


balanced performance across all evaluation metrics. Explainable AI (XAI) techniques,
including ViT attention maps and Integrated Gradients for both DCT and audio features,
provide transparency into the model’s decision process. The results indicate that
combining spatial, temporal, and spectral cues yields a more robust and interpretable
deepfake detection system.

3
Page 5 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 6 of 21 - Integrity Submission Submission ID trn:oi[Link]1

ACKNOWLEDGEMENT

1 We are highly grateful to Dr. Sukhandeep Kaur, Assistant Professor, BML Munjal
University, Gurugram, for providing supervision to carry out the seminar/case study from
July -November 2025.

Dr. Sukhandeep Kaur has provided great help in carrying out my work and is
acknowledged with reverential thanks. Without wise counsel and able guidance, it
would have been impossible to complete the training in this manner.

We would like to express thanks profusely to thank Dr. Sukhandeep Kaur, for
stimulating me from time to time. We would also like to thank the entire team at BML
Munjal University. We would also thank my friends who devoted their valuable time and
helped me in all possible ways toward successful completion.

Akshay Korrapati

Revanth Kaki

Aravind Sapare

E. Umesh

4
Page 6 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 7 of 21 - Integrity Submission Submission ID trn:oi[Link]1

List of tables
Table1: Best Average Test Results Across 5 Folds ................................................... 1
Table 2: Confusion Matrix ....................................................................................... 4

5
Page 7 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 8 of 21 - Integrity Submission Submission ID trn:oi[Link]1

List of Abbreviations

2 AUC: Area Under Curve

CNN: Convolutional Neural Network

CPU: Central Processing Unit

CV: Computer Vision (often used for OpenCV)

DCT: Discrete Cosine Transform

DFDC: Deepfake Detection Challenge

FFN: Feed-Forward Network

GAN: Generative Adversarial Network

GRU: Gated Recurrent Unit

IG: Integrated Gradients

LFCC: Linear Frequency Cepstral Coefficients


7 LN: Layer Normalization

MHSA: Multi-Head Self-Attention

MLP: Multi-Layer Perceptron

MTCNN: Multi-Task Cascaded Convolutional Network

PIL: Python Imaging Library (Pillow)

6 ReLU: Rectified Linear Unit

RGB: Red, Green, Blue (color model)

ROC: Receiver Operating Characteristic

SSL: Self-Supervised Learning

ViT: Vision Transformer

XAI: Explainable Artificial Intelligence

6
Page 8 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 9 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link]

Deepfakes, which represent artificially generated or manipulated audio and video


content, have evolved rapidly in both complexity and realism due to the growth of
generative adversarial networks, diffusion models, and self-supervised learning
techniques. These synthetic media forms threaten digital authenticity by enabling
identity theft, influencing political narratives, creating fraudulent evidence, and
manipulating public opinion. Although early research in deepfake detection relied heavily
on spatial inconsistencies or visual artifacts present within individual frames, modern
manipulations have become sophisticated enough to minimize such flaws. Furthermore,
deepfake threats have expanded beyond visual domains to include synthetic speech,
which can convincingly imitate the vocal characteristics of real individuals. Relying on
single-modality detection therefore leaves systems vulnerable to adversarial attacks and
cross-manipulation methods that exploit modality weaknesses.

To address these challenges, this project proposes a hybrid multi-modal deepfake


detection architecture designed to process both the video and audio components of a
manipulated clip. The video stream is analyzed through a Vision Transformer that
models fine-grained spatial relationships across facial regions. To complement this, the
Discrete Cosine Transform is applied to frame-level grayscale data, enabling the
extraction of frequency-domain artifacts that are often present even in visually
convincing deepfakes. The audio stream is processed using a combination of LFCC
features modeled through a Bi-GRU network and high-dimensional SSL embeddings
that capture subtle variations in speech characteristics. By integrating these
complementary representations, the system is capable of identifying manipulation
patterns across multiple domains, making it more resilient to diverse deepfake
generation techniques. This report presents the complete methodology, experimental
setup, results, and explainability analysis of this multi-modal detection approach.

7
Page 9 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 10 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link] Survey

Deepfake research has advanced rapidly with the growth of generative AI, enabling highly
realistic manipulation of audio–visual media. Kaur et al. introduced the HAV-DF dataset,
the first Hindi audio-video deepfake corpus, developed using face-swap, reenactment,
lip-sync, and voice-cloning pipelines. Their study demonstrates that widely used
detectors such as Xception and Mesonet exhibit significantly reduced accuracy on Hindi-
based deepfakes due to linguistic and facial expression variations unique to native
speakers, highlighting the need for region-specific datasets

To address India’s linguistic and ethnic diversity, Das et al. proposed InDeepFake, a
multilingual multimodal dataset covering seven major Indian languages and multiple
demographic groups. The dataset includes over 4,600 deepfakes generated using seven
state-of-the-art manipulation methods. Benchmarking results reveal that existing
deepfake detectors, trained mostly on Western datasets, struggle with Indian faces and
multilingual audio-video samples, suggesting poor cross-dataset generalization . In
parallel, Bhatia et al. focused on Hindi speech deepfakes, showing that higher-order
spectral features such as bicoherence and cepstral coefficients can effectively
distinguish AI-generated speech, achieving accuracies above 99% with CNN and VGG-
based architectures

On the visual forensics front, Soudy et al. presented a hybrid deepfake detection
approach using Convolutional Neural Networks and Convolutional Vision Transformers
to analyze facial regions like eyes, nose, and whole face. Their majority-voting fusion
model achieved strong performance on datasets such as FaceForensics++ and DFDC,
demonstrating the importance of fine-grained facial cues in deepfake identification
.Across all studies, a consistent conclusion emerges: multilingual and culturally diverse
datasets are essential for building robust deepfake detectors, especially for low-resource
languages like Hindi.

8
Page 10 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 11 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link] Statement

Modern deepfake systems generate synthetic audio and video content that is nearly
indistinguishable from genuine recordings. These manipulations present a range of
societal and security risks, including misinformation, identity fraud, and political
influence operations. The difficulty of this problem lies not only in identifying subtle
spatial artifacts but also in detecting inconsistencies in voice patterns, synchronization
between lip movement and speech, and frequency irregularities that arise during
generative processing. Existing deepfake detection methods typically analyze only a
single modality, either video or audio, which makes them vulnerable to adversarial
attacks and limits their generalizability. Furthermore, many deep learning-based
detectors operate as black boxes, offering no interpretability or insight into the features
influencing their predictions. This lack of transparency undermines trust and
complicates real-world deployment where verification and justification are critical.

5 The primary objective of this project is to develop a robust and interpretable multi-modal
deepfake detection system that integrates visual, audio, and frequency-domain
information. The system must identify spatial inconsistencies in facial features, detect
frequency artifacts through DCT analysis, and capture temporal and spectral anomalies
in the audio stream using LFCC features and SSL-based embeddings. Additionally, the
detector should incorporate explainability tools to visualize attention regions and feature
importance, providing users with insight into the decision-making process. The system
aims to address the limitations of unimodal detectors by creating a unified architecture
capable of identifying complex and highly realistic deepfake manipulations.

9
Page 11 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 12 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link]
4.1 System Architecture
The proposed system follows a three-stream multi-modal architecture designed to
extract complementary cues from video frames and audio signals. The overall
architecture integrates a Vision Transformer for spatial analysis, a DCT-based network for
frequency-domain feature extraction, and an LFCC–BiGRU plus SSL embedding pipeline
for audio analysis. The outputs from these streams are fused and passed to a final
classification head.

4.1.1 Video Preprocessing and Frame Extraction


Video files are first processed to extract frames at fixed intervals, ensuring temporal
coverage across the clip. Each frame is converted from BGR to RGB and passed through
an MTCNN-based face detector that isolates the largest detected face. The detected
region is then cropped and resized to the required resolution. This preprocessing ensures
that the ViT and DCT streams operate on clean, standardized face regions that emphasize
the areas most often manipulated in deepfakes.

4.1.2 Vision Transformer for Spatial Feature Extraction


11 Each cropped face image is fed into a ViT model, specifically the ViT-Base Patch16-224
architecture. The input image is divided into non-overlapping patches that are linearly
4 projected into embedding vectors. A learnable CLS token is prepended to the patch
3 sequence, enabling the model to aggregate global information. The final few transformer
layers are fine-tuned while earlier layers remain frozen to balance computational
efficiency with task-specific learning. The CLS token output serves as a 768-dimensional
spatial feature representation that captures manipulation artifacts across the face. This
feature vector forms the video stream’s spatial representation.

10
Page 12 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 13 of 21 - Integrity Submission Submission ID trn:oi[Link]1

Figure 1: Multi-Modal Deepfake Detection Using SSL Audio Embeddings and ViT–DCT
Video Feature Fusion

4.1.3 DCT Frequency-Domain Feature Extraction


To complement spatial cues, each grayscale face frame undergoes a 2D Discrete Cosine
Transform, which decomposes the image into its frequency components. The log-
magnitude spectrum is computed, and three frequency bands—low, mid, and high—are
extracted and resized for consistency. These regions are flattened and fed into a custom
MLP that projects the 3072-dimensional input into a 128-dimensional frequency

11
Page 13 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 14 of 21 - Integrity Submission Submission ID trn:oi[Link]1

embedding. This embedding captures subtle spectral distortions introduced by deepfake


synthesis, providing an orthogonal perspective to the spatial ViT features.

4.1.4 Audio Feature Extraction Using LFCC and Bi-GRU


The audio signal is separated from the video and processed to extract LFCC features. The
signal is framed, windowed, and transformed into the frequency domain before passing
through linear filter banks to compute cepstral coefficients. The sequence of LFCC
frames is then fed into a Bi-directional GRU that models temporal dependencies and
vocal patterns. The GRU outputs a fixed-length representation that encodes the temporal
structure of genuine versus synthetic speech.

4.1.5 Self-Supervised Audio Embeddings


In addition to LFCC–BiGRU features, a pre-trained SSL audio representation model, such
as wav2vec2 or HuBERT, is used to extract high-level embeddings that capture phonetic,
prosodic, and acoustic signatures. These embeddings are projected through a linear
layer to align their dimensionality with the other feature streams. SSL embeddings
enhance the model’s ability to detect subtle speech manipulations that traditional
spectral features may overlook.

4.1.6 Multi-Stream Feature Fusion and Classification


The spatial embedding from the ViT, the frequency embedding from the DCT-MLP, and the
audio embeddings from both the Bi-GRU and SSL model are concatenated to form a
unified representation. A multi-layer classification head processes this fused vector
through dense layers with batch normalization and dropout, ultimately producing logits
that indicate whether the input clip is real or fake. The architecture is designed to balance
representational power with robustness and interpretability.

4.1.7 Explainable AI (XAI) Integration


To ensure transparency, XAI techniques are applied to all three streams. ViT attention
maps reveal the spatial regions most influential in classification. Integrated Gradients are
computed for both DCT frequency features and audio embeddings to highlight the
spectral and temporal components contributing to the decision. These visualizations
provide insight into the model’s internal reasoning.

12
Page 14 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 15 of 21 - Integrity Submission Submission ID trn:oi[Link]1

Figure 2: Example of ViT Last Layer CLS Token Attention. (Left: Original Face Image,
Right: Attention Heatmap

Figure 3: Audio Feature Attribution Visualization

13
Page 15 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 16 of 21 - Integrity Submission Submission ID trn:oi[Link]1

4.2 Experimental Setup


The system was implemented in a Kaggle environment using PyTorch and the Hugging
Face Transformers library. A curated dataset of audio–video deepfake clips was used,
12 containing real and synthetic samples in balanced proportions. The dataset was split into
training, validation, and test sets. Training employed the AdamW optimizer with
differential learning rates to adequately fine-tune the ViT layers while training the MLP,
GRU, and SSL projection layers. Automatic mixed precision was enabled to optimize GPU
10 usage. Performance was monitored using accuracy, precision, recall, F1-score, and AUC.

Figure 4: Training Curves (Loss, Accuracy, AUC)

14
Page 16 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 17 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link]
The multi-modal architecture demonstrated strong performance, showing stable
convergence across training epochs and balanced accuracy on the test set. The fused
feature representation outperformed unimodal baselines, confirming the importance of
combining spatial, frequency, and audio cues. The classification report indicated high
precision and recall for both real and fake categories, while the confusion matrix
reflected the system’s ability to correctly identify manipulations in diverse scenarios.

Figure 5: Examples of model predictions on test set images. Titles indicate True Label (R for Real, F for Fake) and
Predicted Label

Based on the final evaluation conducted using five-fold testing, the best possible
performance obtained from the provided code is reflected in the averaged test metrics.
Among all evaluated measures, the Area Under the ROC Curve (AUC) achieved the
highest value, demonstrating the model’s ability to effectively separate real and fake
samples across threshold variations. The overall Accuracy reached 0.7516, indicating
that the model correctly classified approximately 75% of all test inputs. The model
achieved a Recall (FAKE) of 0.7759, showing its ability to successfully detect the majority
of fake instances, while the Precision (REAL) value of 0.7627 indicates that a substantial
portion of samples predicted as real were indeed genuine. These performance levels
collectively reflect a balanced and stable detection capability across both real and fake
categories, with the AUC serving as the strongest indicator of model reliability.

15
Page 17 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 18 of 21 - Integrity Submission Submission ID trn:oi[Link]1

Metric Result Value

Area Under ROC Curve (AUC) 0.8173

Accuracy (ACC) 0.7516

Recall (FAKE) 0.7759

Precision (REAL) 0.7627

Table1: Best Average Test Results Across 5 Folds

Predicted Real Predicted Fake

Actual Real 9 25

Actual Fake 7 59

Table 2: Confusion Matrix

The XAI visualizations validated the model’s interpretability. ViT attention maps
consistently focused on facial areas commonly manipulated in deepfakes, such as the
eyes, mouth, and cheeks. Integrated Gradients revealed that high-frequency DCT
features played a significant role in identifying visual glitches, while SSL audio
embeddings demonstrated sensitivity to subtle inconsistencies in speech prosody and
articulation patterns. These findings confirm that the hybrid design leverages
complementary domains for reliable deepfake detection.

16
Page 18 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 19 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link]
This project presents a comprehensive audio–visual deepfake detection framework that
integrates spatial, spectral, temporal, and frequency-domain cues within a unified
architecture. By combining Vision Transformer spatial features, DCT frequency
representations, LFCC temporal modeling, and SSL-based audio embeddings, the
system achieves robust and interpretable deepfake detection. The incorporation of
8 explainability techniques enhances trust and provides valuable insights into the model’s
decision-making process. The results demonstrate that multi-modal fusion offers clear
advantages over single-modality approaches, especially as deepfake generation
technologies continue to advance. This architecture lays the groundwork for future
improvements involving temporal ViT models, synchronization analysis, and contrastive
multi-modal learning.

17
Page 19 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 20 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link]
1. Kaur, Sukhandeep & Buhari, Mubashir & Khandelwal, Naman & Tyagi, Priyansh &
Sharma, Kiran. (2024). Hindi audio-video-Deepfake (HAV-DF): A Hindi language-
based Audio-video Deepfake Dataset.
[Link]
2. Arnab Kumar Das, Aritra Bose, Priya Manohar, Anurag Dutta, Ruchira Naskar, Rajat
Subhra Chakraborty, “InDeepFake: A novel multimodal multilingual indian
deepfake video dataset”, Pattern Recognition Letters, Volume 197,2025,Pages 16-
23,ISSN 0167-8655, [Link]
3. K. Bhatia, A. Agrawal, P. Singh, and A. K. Singh, "Detection of AI Synthesized Hindi

Speech," arXiv preprint arXiv:2203.03706, 2022.


[Link]
4. Li, X., et al., “Deepfake detection using convolutional vision transformers and
convolutional neural networks,” Neural Computing and Applications, 2024.
[Link]
5. Heo, Y.-J., et al., “Deepfake Detection Scheme Based on Vision Transformer and
Distillation,” arXiv, 2021.
[Link]
6. Frank, J., et al., “Frequency-Aware Deepfake Detection: Improving General-
izability through Frequency Space Learning,” arXiv, 2024.
[Link]
7. Kumar, P., et al., “Hybrid Deepfake Image Detection: A Comprehensive
DatasetDriven Approach Integrating Convolutional and Attention Mechanisms
with Frequency Domain Analysis,” arXiv, 2025. [Link]
8. Wodajo, D., et al., “A Novel Hybrid Framework for Deepfake Detection,”Sciety,
2025. [Link]
9. Yang, C., et al., “Explainable AI for DeepFake Detection,” Applied Sciences, 2025.
[Link]
10. Venkateswarulu, S., et al., “DeepExplain: Enhancing DeepFake Detection
Through Transparent and Explainable AI model,” Informatica, 2025.
[Link]
11. “Deepfake Detection Model Combining Texture Differences and Frequency
Domain Information,” ACM Transactions on Privacy and Security, 2025.
[Link]
12. ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection.
(2023). IEEE Journals & Magazine | IEEE Xplore.
[Link]
13. . Junshuai Zheng, Yichao Zhou, Ning Zhang, Xiyuan Hu, Kaiwen Xu, Dongyang Gao,
Zhenmin Tang, A spatio-frequency cross fusion model for deepfake detection and
segmentation, Neurocomputing, Volume 628, 2025, 129683, ISSN 0925-2312,

18
Page 20 of 21 - Integrity Submission Submission ID trn:oi[Link]1
Page 21 of 21 - Integrity Submission Submission ID trn:oi[Link]1

[Link]
14. Usmani, S., Kumar, S., & Sadhya, D. (2024). Spatio-temporal knowledge distilled
video vision transformer (STKD-VViT) for multimodal deepfake detection.
Neurocomputing, 129256.
[Link]
15. Ganguly, S., Ganguly, A., Mohiuddin, S., Malakar, S., & Sarkar, R. (2022). ViXNet:
Vision Transformer with Xception Network for deepfakes based video and image
forgery detection. Expert Systems With Applications, 210, 118423.
[Link]
16. Essa, E. (2024). Feature fusion Vision Transformers using MLP-Mixer for enhanced
deepfake detection. Neurocomputing, 598, 128128.
[Link]
17. 14. N, A. D., & Simon, P. (2025). DeepGuardNet: A novel CNN architecture for
DeepFake image Detection. Procedia Computer Science, 258, 811–818.
[Link]
18. P. Korshunov and S. Marcel, "Vulnerability assessment and detection of Deepfake
videos," 2019 International Conference on Biometrics (ICB), Crete, Greece, 2019,
pp. 1-6, [Link]

19
Page 21 of 21 - Integrity Submission Submission ID trn:oi[Link]1

You might also like