0% found this document useful (0 votes)
9 views18 pages

Final Report

The document outlines the CSE 499.A Senior Design Project titled SurfShield, developed by a team from North South University to address the increasing threat of phishing attacks through an intelligent cybersecurity solution. The project employs a hybrid deep learning architecture for multimodal analysis of URLs, website content, and user behavior, achieving high detection accuracy. It includes a comprehensive methodology, budget, and future plans for enhancements, such as integrating deep learning models and expanding browser compatibility.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Final Report

The document outlines the CSE 499.A Senior Design Project titled SurfShield, developed by a team from North South University to address the increasing threat of phishing attacks through an intelligent cybersecurity solution. The project employs a hybrid deep learning architecture for multimodal analysis of URLs, website content, and user behavior, achieving high detection accuracy. It includes a comprehensive methodology, budget, and future plans for enhancements, such as integrating deep learning models and expanding browser compatibility.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Department of Electrical and Computer Engineering

North South University


CSE 499.A: SENIOR DESIGN PROJECT I
Group name: G1
Section- 16
Submitted to:
Faculty: Dr. Fariah Mahzabeen ( FMA )
Project Name: SurfShield

Page | 1
Team Members and Contribution:
Name ID Contribution

Md. Mazenul Islam Khan 2232309642  Data preprocess for website content
base model.
 Multiple decision tree model builds and
validate.
 Build the chrome extension.

Akib Sadman Nafi 2211224042  Data preprocess for website URL base
model.
 Multiple linear classification model
builds and validate.

Note: For building this project and other


project related documentation we use Ai tools.

For documentation we use ChatGPT and


Deepseek and coding purpose we use Claude
and Claude code subscription.

Page | 2
Index:
ABSTRACT 4
LIST OF FIGURES 4
LIST OF TABLES 4
Chapter 1 Introduction 5
1.1 Background and Motivation 5
1.2 Purpose and Goal of the Project 5
1.3 Organization of the Report 6
Chapter 2 Research Literature Review 6
2.1 Existing Research and Limitations 6
Chapter 3 Methodology 8
3.1 System Design/Architecture 8
3.2 Hardware and/or Software Components and Implementation 8
3.3 Budget and timeline 9
Chapter 4 Investigation/Experiment, Result, Analysis and Discussion 11
4.1 Experiments/ tests 11
4.2 Results 13
4.3 Analysis and Discussion 13
4.4 Conclusion and future plan for 499B 13
Project Demo 15
References 16

Page | 3
ABSTRACT
Due to the rapid growth of digital technology and global network, the number of internet user is
increasing rapidly. The number of cyber scams is also increasing rapidly and phishing is the most
common one. These cyber-attacks have disastrous consequence. It is estimated that $ 520 million
were lost world wide from phishing attacks in 2011 alone [15]. Traditionally, phishing attacks are
email focused, but it is spread on social media and even in SMS. To solve this problem, we
proposed our SurfShield, an intelligent cybersecurity solution that addresses critical gaps in
existing detection mechanisms. The project was chosen to tackle fundamental limitations of
current phishing protection systems, which predominantly employ single-channel analysis,
suffer from high false-positive rates exceeding 15%, and fail to adapt effectively to rapidly
evolving attack techniques including AI-generated deepfake content, polymorphic URLs, and
sophisticated social engineering tactics. SurfShield introduces a hybrid deep learning
architecture with attention mechanisms that performs multimodal analysis across URLs, website
content, website appearance, and behavioral patterns, achieving higher detection accuracy while
maintaining quickest response times through edge computing deployment.

LIST OF FIGURES
Number Name Page no.

1 Figure -System Architecture 8

2 Figure- safe website detect 15

3 Figure- Phishing detected 15


website

LIST OF TABLES
Number Name Page no.

1 Table - Tools 8

2 Table - Budgets 9

3 Table - Timeline 10

4 Table – Model Results 13

Page | 4
Chapter 1 Introduction:
1.1 Background and Motivation
We live in the age of digital information that sees billions of users avail online services every
day for purposes such as banking, shopping, socializing or communicating. Such reliance on the
digital has paved the way for sophisticated cybercriminals who use phishing attacks to
manipulate people in providing sensitive information. Hundreds of thousands of phishing sites
are detected and deactivated each month by the Anti-Phishing Working Group (APWG), posing
risks for both individuals and companies.

Phishing attacks have come a long way from cobbled-together emails to lean-on-the-
psychological-manipulation, real-looking websites, and deception-in-real-time they are now.
SSL Now that we live in the modern world, new phishing sites usually use SSL certs and
precision copied legitimate websites (or look very much like real ones) with a web address
looking a lot like the real technology brand or website – brands hidden within other words, typo
squatting or homograph attacks. Machine learning and real-time URL analysis have emerged as
promising approaches to detecting phishing attempts, but these technologies haven't been fully
democratized for everyday users in accessible, user-friendly formats.

1.2 Purpose and Goal of the Project


Purpose
Our Purpose is to develop an automated system for detecting phishing websites using machine
learning techniques, providing real-time protection to users while browsing
the internet.

Goal
The project encompasses:
- Research and implementation of multiple ML algorithms
- Feature extraction from URLs and webpage content
- Model training, evaluation, and comparison
- Browser extension development for end-user deployment
- Backend API server for model serving
- Comprehensive documentation and testing

Page | 5
1.3 Organization of the Report

This report is conduct in Four chapter. In chapter 1 we describe our project background and
motivation behind our project SurfShield. We also talk about our purpose and goal behind our
project. In the next section which is chapter 2 we describe the existing research and their
limitations that are alien with our project. The research ideas and limitations are our project
baseline from where we start our project. In chapter 3 we narrate the methodology of our project.
We describe our project architecture, software components and where they used and our budget
and timeline. In the chapter 4 we discus about our experiment result, analysis and our outputs.

Chapter 2 Research Literature Review


2.1 Existing Research and Limitations
Phishing is not now only restricted in emails. It is now evolved enough to target people using a
variety of channels, including chat, social media, emails, SMS, and the internet. With the rise of
digital connectivity and remote services, attacks increasing alarmingly. They now frequently
target live messengers, blogs, forums, SMS, and even VoIP [1][2][3][4][5]. For detecting
phishing, we traditionally using blacklists, heuristics, and rule-based strategies. These
approaches remain foundational but are brittle against evolving tactics and zero-day attacks [6]
[2][7][1]. Features from URLs, HTML, and third-party services have been used to implement
SVM, Decision Trees, Random Forest, k NN, and Naive Bayes [7][3][4][6]. For automatic
feature learning and better performance on massive data, DL models CNN, LSTM, Bi-LSTM,
GRU, DNN, and hybrids have become increasingly popular in recent years [8][2][3][9][1]. To
reduce manual feature engineering and to increase detection accuracy, the most recent research
aggregates deep character embeddings with NLP & manual features [3][8]. While visual, content
based, and URL based elements are commonplace, new tactics include favicon analysis, WHOIS
information, page structure, and even behavioral indicators. For robust models, integrating as
many pertinent feature categories as possible is highly valued [10][7][3]. Sources of datasets
include custom crawls, Phishtank, Alexa, and UCI. Because attack patterns are always evolving,
benchmarking quality and dataset freshness are essential [2][1][7][10][3]. To benchmarking,
reproducibility, and real-World Deployment, accuracy, precision, recall, F1 score, AUC, and
other parameters are used in the measurement. Reproducibility, open datasets, and browser based
or cloud integrated deployments are the main topics of recent studies.[8][10]
The overall pipeline is consistent: Data collection → Feature extraction → Model training
(ML&DL) → Testing/Validation → Real-world benchmarking [4] [11][1][2][7][8].
Hybrid deep learning models now combine different feature types and often ensemble diverse
classifiers (e.g., stacking, boosting, voting) for maximum performance [7][3][8]. Tool

Page | 6
development and browser plugins are used for live classification, reporting robust but variable
performance.[10][8]
But still there are some gaps remaining. Most current models, especially ML, require continual
retraining for new attack patterns. There's a deficit in zero day and “intelligent” adversarial
attack detection [1][2][3]. Many solutions are language, data, or modality specific; robust cross
platform, cross language, and cross medium detectors remain rare [2][3]. Dataset issues lack of
quantity, diversity, freshness, and recency still limit model utility in real-world deployment.
Features that once worked (e.g., IP address or HTTPS tokens in URL) become obsolete as
attacker strategies evolve [7] [10]. Traditional machine learning techniques and rule-based
approaches mainly rely on manually created features, which lose their utility as attackers evolve.
Although they need big, high-quality datasets and are sometimes viewed as "black box," DL
models show promise [3] [1] [8] [2]. While external service-based features pose difficulties for
real time detection and delay, certain content-based features are useless against contemporary
threats [10][7]. Although DL models are quite accurate, their interpretability is compromised,
which makes it more difficult to accept them for use in mission critical applications [1] [2] [3].
Emergent assaults can get beyond signature based, batch trained models, browser &cloud plugins
are still constrained by update and latency constraints [8] [10]. To mitigate the gaps, we Make
use of a hybrid deep learning architecture that processes character embeddings and NLP or
manual features at the same time. This architecture may use ensemble stacking or BiLSTM and
DNN to take advantage of both low-level and high-level structures [3][8]. We can combat the
quick evolution of phishing by optimizing real time, scalable deployment in browsers, cloud, and
cross-platform settings with frequent updates [4][8][10][3][7].

Page | 7
Chapter 3 Methodology
3.1 System Design/Architecture

Request Request

Chrome
Backend Multiple Models
extension
Server
Response Response

Figure -System Architecture

3.2 Software Components and Implementation


Tools:
Tool Purpose

VS Code IDE.

Python + Libraries Model build, build server and api.

JavaScript For data collection for the website .

Postman API testing.

ChatGPT/Claude AI assistance.

Table - Tools

Page | 8
3.3 Budget and timeline
Budgets:
Category Amount

Software & Tools $0.00

Claude Code Subscription $40.00 ($20 per month for two month)

External APIs & Services $0.00

Development & Deployment $0.00

Monitoring & Maintenance $0.00

TOTAL PROJECT COST $40.00

Cost per person $20

Time Investment 350–400 hours

ROI Skills + Portfolio + Potential Revenue = PRICELESS

Table - Budgets

Page | 9
Tentative Timeline:
Milestone Week Key Deliverables

M1: Proposal Submit Week 1-2 What we want to develop and what is our goal.

M2: Literature Review Week 3-4 15 research or conference paper study

M3: Data Ready Week 5-6 21,430 samples

M4: Features Built Week 7 46 features, baseline models

M5: Model Trained Week 8-9 Multiple model train and compare result

M6: Model Validated Week 10 Model selection

M7: Extension Ready Week 11 Chrome extension

M8: Optimized Week 12 Performance tuned

Table - Timeline

Page | 10
Chapter 4 Investigation/Experiment, Result, Analysis and
Discussion
4.1 Investigation/Experiment
Data preprocessing: For web content base model
Original Dataset: 10,000 samples × 48 features
Cleaned Dataset: 9,581 samples × 41 features
Preprocessing Steps:
1. Duplicate Removal: 419 duplicates (4.19%) removed
2. Feature Selection: 7 uninformative features removed
- Zero variance: HttpsInHostname
- Low variance: AtSymbol, NumHash, DoubleSlashInPath,
FakeLinkInStatusBar, RightClickDisabled, PopUpWindow
3. Stratified Splitting: 80/20 train/test split
4. Normalization: StandardScaler for distance-based algorithms
Class Distribution:
- Training: 3,996 legitimate (52.14%), 3,668 phishing (47.86%)
- Testing: 1,000 legitimate (52.16%), 917 phishing (47.84%)
- Balance: Excellent (52/48 split maintained)

Page | 11
FEATURE SET (41 FEATURES)
URL Structure Features (12):
- NumDots, SubdomainLevel, PathLevel, UrlLength
- NumDash, NumDashInHostname, TildeSymbol, NumUnderscore
- NumPercent, NumQueryComponents, NumAmpersand, NumNumericChars
Security Features (6):
- NoHttps, RandomString, IpAddress
- DomainInSubdomains, DomainInPaths, NumSensitiveWords
Page Structure Features (7):
- HostnameLength, PathLength, QueryLength
- EmbeddedBrandName, PctExtHyperlinks
- PctExtResourceUrls, ExtFavicon
Form & Behavior Features (8):
- InsecureForms, RelativeFormAction, ExtFormAction
- AbnormalFormAction, PctNullSelfRedirectHyperlinks
- FrequentDomainNameMismatch, SubmitInfoToEmail, IframeOrFrame
Content Features (2):
- MissingTitle, ImagesOnlyInForm
Runtime Features (6):
- SubdomainLevelRT, UrlLengthRT, PctExtResourceUrlsRT
- AbnormalExtFormActionR, ExtMetaScriptLinkRT
- PctExtNullSelfRedirectHyperlinksRT

Page | 12
4.2 Results
Rank Model Accuracy(Test) Precision Recall F1-Score Time
1 LightGBM 98.85% 98.38% 99.24% 98.81% 0.08s
2 Gradient 98.70% 98.27% 99.02% 98.64% 1.74s
Boosting
3 CatBoost 98.70% 98.16% 99.13% 98.64% 0.44s
4 XGBoost 98.64% 98.16% 99.02% 98.59% 0.45s
5 Random 98.28% 97.84% 98.58% 98.21% 0.18s
Forest
6 Decision 97.34% 97.17% 97.27% 97.22% 0.04s
Tree

Table – Model Results

4.3 Analysis and Discussion


For URL-Only analysis we use SVM model and Website content-based analysis we use
LightGBM model. We have both option for single model use or both model use in chrome
extension.

4.4 Conclusion and future plan for 499B


Conclusion:
This CSE499A project successfully developed a comprehensive phishing detection
system using machine learning.
The system consists of three integrated components:
1. SVM-based phishing detector with PCA (79.21% accuracy)
2. Multi-model training system comparing 6 algorithms (up to 98.85% accuracy)
3. Chrome browser extension with dual-model architecture
The project demonstrates:
- Machine learning algorithms and techniques
- Feature engineering and selection
- Model evaluation and comparison
- Software engineering and architecture
- Web development and browser extensions

Page | 13
Future Improvement for 499B:
1. Deep Learning Models
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN)
- Transformer-based models
- Better accuracy potential
2. Scalability Improvements
- Distributed deployment
- Load balancing
- Caching layer (Redis)
- Database logging
3. Additional Browsers
- Firefox extension
- Safari extension
- Edge extension
- Mobile browsers
4. Enhanced UI/UX
- Detailed explanations
- Feature importance visualization
- Historical analysis
- Whitelist/blacklist management
5. Security Enhancements
- HTTPS deployment
- Authentication
- Rate limiting
- Input sanitization

Page | 14
Project demo:

Figure- safe website detect

Figure- Phishing detected website

Page | 15
References:

1. Shakeel Ahmad et al., "Across the Spectrum: In-Depth Review AI-Based Models for
Phishing Detection," 2025.

2. Junaid Rashid et al., "Phishing Detection Using Machine Learning Technique," 2020.

3. Nguyet Quang Do et al., "Deep Learning for Phishing Detection: Taxonomy, Current
Challenges and Future Directions," 2022.

4. Moruf A. Adebowale et al., "Intelligent Phishing Detection Scheme Using Deep Learning
Algorithms," 2020.

5. Ozgur Koray Sahingoz et al., "Machine learning based phishing detection from URLs,"
2019.

6. Sanjiban Sekhar Roy et al., "Multimodel Phishing URL Detection Using LSTM,
Bidirectional LSTM, and GRU Models," 2022.

7. Alper Ozcan et al., "A hybrid DNN-LSTM model for detecting phishing URLs," 2021.

8. Khaled L. Chiew et al., "Utilisation of Website Logo for Phishing Detection," 2019.

9. Umer Ahmed Butt et al., "Cloud-based email phishing attack using machine and deep
learning algorithm," 2022.

10. Ashokbhai Bhadani-Dhara, "Phishing Detection Thesis," 2023.

11. Abdelhakim Hannousse & Salima Yahiouche, "Benchmark Dataset Construction for
Phishing Detection," 2021.

12. Ala Mughaid et al., "An intelligent cyber security phishing detection system using deep
learning techniques," 2022.

13. Abdulrahman Alhogail et al., "Game-based Phishing Awareness Training," 2021.

14. Esraa Abu Elsoud Salloum et al., "Phishing Attacks Detection Methods: Literature
Review," 2021.

15. Anupama Aggarwal et al., "PhishAri: Automatic Real-Time Phishing Detection on Twitter,"
2012.

Page | 16
16. K. S. Jishnu and B. Arthi, "Real-time phishing URL detection framework using knowledge
distilled ELECTRA," Automatika, vol. 65, no. 4, pp. 1621-1639, Oct. 2024, doi:
10.1080/00051144.2024.2415797.

17. M. W. Shaukat, R. Amin, M. M. A. Muslam, A. H. Alshehri, and J. Xie, "A hybrid approach
for alluring ads phishing attack detection using machine learning," Sensors, vol. 23, no. 19,
p. 8070, Sep. 2023, doi: 10.3390/s23198070.

18. K. M. Sudar, M. Rohan, and K. Vignesh, "Detection of adversarial phishing attack using
machine learning techniques," Sādhanā, vol. 49, no. 232, Aug. 2024. DOI:
10.1007/s12046-024-02582-0.

19. W. Li, S. Manickam, Y.-W. Chong, W. Leng, and P. Nanda, "A state-of-the-art review on
phishing website detection techniques," IEEE Access, vol. 12, pp. 187976–188012, 2024,
doi: 10.1109/ACCESS.2024.3514972

20. A. Karim, M. Shahroz, K. Mustofa, S. B. Belhaouari, and S. R. K. Joga, “Phishing


detection system through hybrid machine learning based on URL,” IEEE Access, vol. 11,
pp. 36815–36829, 2023, doi: 10.1109/ACCESS.2023.3252366.

21. M. Sánchez-Paniagua, E. Fidalgo Fernández, E. Alegre, W. Al-Nabki, and V. González-


Castro, “Phishing URL detection: A real-case scenario through login URLs,” IEEE Access,
vol. 10, pp. 42950–42963, 2022, doi: 10.1109/ACCESS.2022.3168681.

22. P. Yang, G. Zhao, and P. Zeng, “Phishing website detection based on multidimensional
features driven by deep learning,” IEEE Access, vol. 7, pp. 15196–15209, 2019, doi:
10.1109/ACCESS.2019.2892066.

23. J. Feng, L. Zou, X. Wang, and C. Wang, “Web2Vec: Phishing webpage detection method
based on multidimensional features driven by deep learning,” IEEE Access, vol. 8, pp.
221220–221234, 2020, doi: 10.1109/ACCESS.2020.3044053.

24. A. Odeh, I. Keshta, and E. Abdelfattah, “Machine learning techniques for detection of
website phishing: A review for promises and challenges,” in Proc. IEEE CCWC, 2021, pp.
583–589.

Page | 17
25. R. Mohammad, F. Thabtah, and L. McCluskey, “Detection of phishing websites using an
efficient feature-based machine learning framework,” Neural Computing and Applications,
vol. 31, no. 8, pp. 3851–3873, 2019.

26. S. Sahoo, B. Liu, and C. Hoi, “Detecting malicious URLs using machine learning
techniques: Review and research directions,” IEEE Access, vol. 7, pp. 140540–140559,
2019.

27. M. Aburrous et al., “A survey of machine learning-based solutions for phishing website
detection,” Journal of Information Security and Applications, vol. 51, 2020.

28. M. Zouina and B. Outtaj, “A stacking model using URL and HTML features for phishing
webpage detection,” Future Generation Computer Systems, vol. 106, pp. 27–39, 2020.

29. A. Adebowale et al., “An effective detection approach for phishing websites using URL
and HTML features,” Computers & Security, vol. 84, pp. 102–120, 2019.

Page | 18

You might also like