Phishing Detection Using Machine Learning
Phishing Detection Using Machine Learning
1.1 Introduction 1
vi | P a g e
CHAPTER 4: SOFTWARE REQUIREMENT SPECIFICATION 13-15
5.1 Python 17
5.6 Variables 21
5.9 Datasets 26
6.1 Introduction 27
6.2 Normalization 27
vii | P a g e
6.5 UML Diagrams 32-40
CHAPTER 8: CONCLUSION 71
CHAPTER - 1
1. INTRODUCTION
1.1 Introduction to project
Internet use has become an essential part of our daily activities as a result of rapidly
growing technology. Due to this rapid growth of technology and intensive use of digital
systems, data security of these systems has gained great importance. The primary
objective of maintaining security in information technologies is to ensure that necessary
precautions are taken against threats and dangers likely to be faced by users during the
use of these technologies. Phishing is defined as imitating reliable websites in order to
obtain the proprietary information entered into websites every day for various purposes,
such as usernames, passwords and citizenship numbers. Phishing websites contain
various hints among their contents and web browser-based information . Individual(s)
committing the fraud sends the fake website or e-mail information to the target address
as if it comes from an organization, bank or any other reliable source that performs
reliable transactions. Contents of the website or the e-mail include requests aiming to lure
the individuals to enter or update their personal information or to change their passwords
as well as links to websites that look like exact copies of the websites of the organizations
concerned.
1.2 Purpose of the project
Phishing is one of the most common and most dangerous attacks among cybercrimes.
The aim of these attacks is to steal the information used by individuals and organizations
to conduct transactions. Phishing websites contain various hints among their contents and
web browser-based information. The purpose of this study is to perform Extreme
Learning Machine (ELM) based classification for 30 features including Phishing
Websites Data in UC Irvine Machine Learning Repository database.
We proposed a system with the help of machine learning techniques and algorithms like
Logistic Regression, KNN, SVC, Random Forest, Decision Tree, XGB Classifier and
Naïve Bayes to predict Phishing. We trained our model with a large dataset and with more
than 30 features. We performed hyper parameter tuning for improving accuracy of
machine learning algorithms. We have considered all the classification metrics for testing
the model and tried to improve the model for precision and recall.
CHAPTER - 2
2. REQUIREMENTS
• RAM: 4GB
• Processor: Intel i3
• Software : Anaconda
• Jupyter IDE
3 SYSTEM ANALYSIS
1. Numpy
2. Pandas
3. Matplotlib
4. Scikit –learn
1 . Numpy:
2. Pandas
For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
4. Scikit – learn
➢ In existing system phishing website detection is done by some of the data mining
algorithms like j48,C4.5 in weka explorer which are not suitable for very large
datasets.
➢ In most of the existing systems the dataset is very small for improving precision and
as a result they produce false positives and false negatives.
➢ Number of features in the existing model is very less and as a result the existing
systems can face a threat of becoming outdated.
➢ With rapid growth in technology, there are many ways open for an attacker to pose
phishing threats for an internet user.
➢ Attackers are finding new ways to break into the existing systems and are able to
phish the data of users. Traditional features are being perfectly exploited by cyber
attackers.
Phishing is one of the most common and most dangerous attacks among cybercrimes.
The aim of these attacks is to steal the information used by individuals and organizations to
conduct transactions. Phishing websites contain various hints among their contents and web
browser based information. The purpose of this study is to perform Extreme Learning
Machine (ELM) based classification for 30 features including Phishing Websites Data in UC
Irvine Machine Learning Repository database.
Inputs:
➢ Importing the all required packages like numpy, pandas, matplotlib, scikit-learn and
required machine learning algorithms packages .
This project uses an iterative development lifecycle, where components of the application
are developed through a series of tight iterations. The first iteration focuses on very basic
functionality, with subsequent iterations adding new functionality to the previous work and or
correcting errors identified for the components in production.
The six stages of the SDLC are designed to build on one another, taking outputs from the
previous stage, adding additional effort, and producing results that leverage the previous effort
and are directly traceable to the previous stages. During each stage, additional information is
gathered or developed, combined with the inputs, and used to produce the stage deliverables.
It is important to note that the additional information is restricted in scope, new ideas that
would take the project in directions not anticipated by the initial set of high-level
requirements or features that are out-of-scope are preserved for later consideration.
Too many software development efforts go awry when the development team and
customer personnel get caught up in the possibilities of automation. Instead of focusing on
high priority features, the team can become mired in a sea of nice to have features that are not
essential to solve the problem, but in themselves are highly attractive. This is the root cause
of a large percentage of failed and or abandoned development efforts and is the primary
reason the development team utilizes the iterative model.
When Object orientation is used in analysis as well as design, the boundary between
OOA and OOD is blurred. This is particularly true in methods that combine analysis and
design. One reason for this blurring is the similarity of basic constructs (i.e.,objects and
classes) that are used in OOA and OOD. Though there is no agreement about what parts of
the object-oriented development process belong to analysis and what parts to design, there is
some general agreement about the domains of the two activities.
The fundamental difference between OOA and OOD is that the former models the
problem domain, leading to an understanding and specification of the problem, while the
latter models the solution to the problem. That is, analysis deals with the problem domain,
while design deals with the solution domain. However, OOAD subsumed the solution domain
representation. That is, the solution domain representation, created by OOD, generally
contains much of the representation created by OOA. The separating line is a matter of
perception, and different people have different views on it. The lack of clear separation
between analysis and design can also be considered one of the strong points of the object
oriented approach; the transition from analysis to design is “seamless”. This is also the main
reason OOAD methods-where analysis and designs are both performed.
The main difference between OOA and OOD, due to the different domains of modeling,
is in the type of objects that come out of the analysis and design process.
Features of OOAD:
• All objects can be represented graphically including the relation between them.
• All Key Participants in the system will be represented as actors and the actions done by
them will be represented as use cases.
• A typical use case is nothing bug a systematic flow of series of events which can be well
described using sequence diagrams and each event can be described diagrammatically by
Activity as well as state chart diagrams.
• So the entire system can be well described using the OOAD model, hence this model is
chosen as the SDLC model.
Preliminary investigation examines project feasibility, the likelihood the system will be
useful to the organization. The main objective of the feasibility study is to test the Technical,
Operational and Economical feasibility for adding new modules and debugging old running
systems. All systems are feasible if they are unlimited resources and infinite time. There are
aspects in the feasibility study portion of the preliminary investigation:
● Technical Feasibility
● Operational Feasibility
● Economical Feasibility
A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.
The system is economically feasible. It does not require any additional hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.
Proposed projects are beneficial only if they can be turned into an information system.
That will meet the organization’s operating requirements. Operational feasibility aspects of
the project are to be taken as an important part of the project implementation. Some of the
important issues raised are to test the operational feasibility of a project includes the
following:
The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.
The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.
3.9TECHNICAL FEASIBILITY
The technical issue usually raised during the feasibility stage of the investigation includes
the following:
PURPOSE
In software engineering, the same meanings of requirements apply, except that the focus
of interest is the software itself.
• Data analysis
• Data preprocessing
• Model building
• Prediction
4.2NON FUNCTIONAL REQUIREMENTS
Introduction to Django The Web development framework that saves you time and
makes Web development a joy. Using Django, you can build and maintain high quality Web
applications with minimal fuss. At its best, Web development is an exciting, creative act; at
its worst, it can be a repetitive, frustrating nuisance. Django lets you focus on the fun stuff —
the crux of your Web application — while easing the pain of the repetitive bits. In doing so, it
provides high-level abstractions of common Web development patterns, shortcuts for frequent
programming tasks, and clear conventions for how to solve problems. At the same time,
Django tries to stay out of your way, letting you work outside the scope of the framework as
needed. The goal of this book is to make you a Django expert. The focus is twofold. First, we
explain, in depth, what Django does and how to build Web applications with it. Second, we
discuss higher-level concepts where appropriate, answering the question “How can I apply
these tools effectively in my own projects?” By reading this book, you’ll learn the skills
needed to develop powerful Web sites quickly, with code that is clean and easy to maintain.
Should a developer really have to worry about printing the “Content-Type” line and
remembering to close the database connection? This sort of boilerplate reduces programmer
productivity and introduces opportunities for mistakes. These setup- and teardown-related
tasks would best be handled by some common infrastructure.
CHAPTER : 5
5. LANGUAGES OF IMPLEMENTATION
5.1 Python
What Is A Script?
Basically, a script is a text file containing the statements that comprise a Python
program. Once you have created the script, you can execute it over and over without
having to retype it each time.
Scripts are editable
Perhaps, more importantly, you can make different versions of the script by
modifying the statements from one file to the next using a text editor. Then you can
execute each of the individual versions. In this way, it is easy to create different programs
with a minimum amount of typing.
Just about any text editor will suffice for creating Python script files.
You can use Microsoft Notepad, Microsoft WordPad, Microsoft Word, or just about
any word processor if you want to.
Script:
Scripts are distinct from the core code of the application, which is usually written in a
different language, and are often created or at least modified by the end-user. Scripts are
often interpreted from source code or bytecode, whereas the applications they control are
traditionally compiled to native machine code.
Program:
The program has an executable form that the computer can use directly to execute the
instructions.
The same program in its human-readable source code form, from which executable
programs are derived (e.g., compiled)
P ython
What is Python?
Chances you are asking yourself this. You may have found this book because you
want to learn to program but don’t know anything about programming languages. Or you
may have heard of programming languages like C, C++, C#, or Java and want to know
what Python is and how it compares to “big name” languages. Hopefully I can explain it
for you.
Python concepts
If your not interested in the the hows and whys of Python, feel free to skip to the next chapter.
In this chapter I will try to explain to the reader why I think Python is one of the best
languages available and why it’s a great one to start programming with.
• Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP. • Python is
Interactive − You can actually sit at a Python prompt and interact with the interpreter directly to
write your programs.
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-68,
SmallTalk, and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU General
Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
• Easy-to-read − Python code is more clearly defined and visible to the eyes. •
Easy-to-maintain − Python's source code is fairly easy-to-maintain.
• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.
• Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
• Portable − Python can run on a wide variety of hardware platforms and has the same interface
on all platforms.
• Extendable − You can add low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be more efficient.
• Scalable − Python provides a better structure and support for large programs than shell
scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed
below −
• It supports functional and structured programming methods as well as OOP. • It can be used as a
scripting language or can be compiled to byte-code for building large applications.
• It provides very high-level dynamic data types and supports dynamic type checking. •
IT supports automatic garbage collection.
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
Types Python is a dynamic-typed language. Many other languages are static typed, such as
C/C++ and Java. A static typed language requires the programmer to explicitly tell the
computer what type of “thing” each data value is.
For example, in C if you had a variable that was to contain the price of something, you would
have to declare the variable as a “float” type.
This tells the compiler that the only data that can be used for that variable must be a floating
point number, i.e. a number with a decimal point.
Python, however, doesn’t require this. You simply give your variables names and assign
values to them. The interpreter takes care of keeping track of what kinds of objects your
program is using. This also means that you can change the size of the values as you develop
the program. Say you have another decimal number (a.k.a. a floating point number) you need
in your program.
With a static typed language, you have to decide the memory size the variable can take when
you first initialize that variable. A double is a floating point value that can handle a much
larger number than a normal float (the actual memory sizes depend on the operating
environment). If you declare a variable to be a float but later on assign a value that is too big
to it, your program will fail; you will have to go back and change that variable to be a double.
With Python, it doesn’t matter. You simply give it whatever number you want and Python will
take care of manipulating it as needed. It even works for derived values.
For example, say you are dividing two numbers. One is a floating point number and one is an
integer. Python realizes that it’s more accurate to keep track of decimals so it automatically
calculates the result as a floating point number
5.6 Variables
Variables are nothing but reserved memory locations to store values. This means that when
you create a variable you reserve some space in memory.
Based on the data type of a variable, the interpreter allocates memory and decides what can be
stored in the reserved memory. Therefore, by assigning different data types to variables, you
can store integers, decimals or characters in these variables.
The data stored in memory can be of many types. For example, a person's age is stored as a
numeric value and his or her address is stored as alphanumeric characters. Python has various
standard data types that are used to define the operations possible on them and the storage
method for each of them.
• Numbers
• String
• List
• Tuple
• Dictionary
P ython Numbers
Number data types store numeric values. Number objects are created when you assign a value
to them
P ython Strings
Strings in Python are identified as a contiguous set of characters represented in the quotation
marks. Python allows for either pairs of single or double quotes. Subsets of strings can be
taken using the slice operator ([ ] and [:] ) with indexes starting at 0 in the beginning of the
string and working their way from -1 at the end.
P ython Lists
Lists are the most versatile of Python's compound data types. A list contains items separated
by commas and enclosed within square brackets ([]). To some extent, lists are similar to arrays
in C. One difference between them is that all the items belonging to a list can be of different
data type.
The values stored in a list can be accessed using the slice operator ([ ] and [:]) with indexes
starting at 0 in the beginning of the list and working their way to end -1. The plus (+) sign is
the list concatenation operator, and the asterisk (*) is the repetition operator.
P ython Tuples
A tuple is another sequence data type that is similar to the list. A tuple consists of a number of
values separated by commas. Unlike lists, however, tuples are enclosed within parentheses.
The main differences between lists and tuples are: Lists are enclosed in brackets ( [ ] ) and
their elements and size can be changed, while tuples are enclosed in parentheses ( ( ) ) and
cannot be updated. Tuples can be thought of as read-only lists.
P ython Dictionary
Python's dictionaries are kind of hash table type. They work like associative arrays or hashes
found in Perl and consist of key-value pairs. A dictionary key can be almost any Python type,
but are usually numbers or strings. Values, on the other hand, can be any arbitrary Python
object.
Dictionaries are enclosed by curly braces ({ }) and values can be assigned and accessed using
square braces ([]).
The normal mode is the mode where the scripted and finished .py files are run in the Python
interpreter.
Interactive mode is a command line shell which gives immediate feedback for each statement,
while running previously fed statements in active memory. As new lines are fed into the
interpreter, the fed program is evaluated both in part and in whole
20 Python libraries
1. Requests. The most famous http library written by kenneth reitz. It’s a must
have for every python developer.
2. Scrapy. If you are involved in webscraping then this is a must have library for
you. After using this library you won’t use any other.
3. wxPython. A gui toolkit for python. I have primarily used it in place of tkinter.
You will really love it.
4. Pillow. A friendly fork of PIL (Python Imaging Library). It is more user friendly
than PIL and is a must have for anyone who works with images.
5. SQLAlchemy. A database library. Many love it and many hate it. The choice is
yours.
6. BeautifulSoup. I know it’s slow but this xml and html parsing library is very
useful for beginners.
7. Twisted. The most important tool for any network application developer. It has
a very beautiful api and is used by a lot of famous python developers.
8. NumPy. How can we leave this very important library ? It provides some
advance math functionalities to python.
9. SciPy. When we talk about NumPy then we have to talk about scipy. It is a
library of algorithms and mathematical tools for python and has caused many
scientists to switch from ruby to python.
10. matplotlib. A numerical plotting library. It is very useful for any data
scientist or any data analyzer.
11. Pygame. Which developer does not like to play games and develop them ? This
library will help you achieve your goal of 2d game development.
12. Pyglet. A 3d animation and game creation engine. This is the engine in which
the famous python port of minecraft was made.
13. pyQT. A GUI toolkit for python. It is my second choice after wxpython for
developing GUI’s for my python scripts.
14. pyGtk. Another python GUI library. It is the same library in which the famous
Bittorrent client is created.
15. Scapy. A packet sniffer and analyzer for python made in python.
16. pywin32. A python library which provides some useful methods and classes
for interacting with windows.
17. nltk. Natural Language Toolkit – I realize most people won’t be using this one,
but it’s generic enough. It is a very useful library if you want to manipulate
strings. But it’s capacity is beyond that. Do check it out.
20. IPython. I just can’t stress enough how useful this tool is. It is a python prompt
on steroids. It has completion, history, shell capabilities, and a lot more. Make
sure that you take a look at it.
Numpy:
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually
numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes.
The number of axes is rank.
Offers Matlab-ish capabilities within Python
m atplotlib
DataSets
The DataSet object is similar to the ADO Recordset object, but more powerful, and
with one other important distinction: the DataSet is always disconnected. The DataSet
object represents a cache of data, with database-like structures such as tables, columns,
relationships, and constraints. However, though a DataSet can and does behave much
like a database, it is important to remember that DataSet objects do not interact
directly with databases, or other source data. This allows the developer to work with a
programming model that is always consistent, regardless of where the source data
resides. Data coming from a database, an XML file, from code, or user input can all be
placed into DataSet objects. Then, as changes are made to the DataSet they can be
tracked and verified before updating the source data. The GetChanges method of the
DataSet object actually creates a second DatSet that contains only the changes to the
data. This DataSet is then used by a DataAdapter (or other objects) to update the
original data source.
The DataSet has many XML characteristics, including the ability to produce and consume
XML data and XML schemas. XML schemas can be used to describe schemas interchanged
via WebServices. In fact, a DataSet with a schema can actually be compiled for type safety
and statement completion.
CHAPTER : 6
6. SYSTEM DESIGN
6.1 INTRODUCTION
Software design sits at the technical kernel of the software engineering process and is applied
regardless of the development paradigm and area of application. Design is the first step in the
development phase for any engineered product or system. The designer’s goal is to produce a
model or representation of an entity that will later be built. Beginning, once system
requirement
have been specified and analyzed, system design is the first of the three technical activities -
design, code and test that is required to build and verify software.
The importance can be stated with a single word “Quality”. Design is the place where
quality is fostered in software development. Design provides us with representations of
software that can assess for quality. Design is the only way that we can accurately translate a
customer’s view into a finished software product or system. Software design serves as a
foundation for all the software engineering steps that follow. Without a strong design we risk
building an unstable system – one that will be difficult to test, one whose quality cannot be
assessed until the last stage.
6.2 NORMALIZATION
Insertion anomaly: Inability to add data to the database due to absence of other data.
Deletion anomaly: Unintended loss of data due to deletion of other data. Update
anomaly: Data inconsistency resulting from data redundancy and partial update
Normal Forms: These are the rules for structuring relations that eliminate anomalies.
A relation is said to be in first normal form if the values in the relation are atomic for every
attribute in the relation. By this we mean simply that no attribute value can be a set of values
or, as it is sometimes expressed, a repeating group.
A relation is said to be in second Normal form is it is in first normal form and it should satisfy
any one of the following rules.
3) Every non key attribute is fully functionally dependent on full set of primary key.
The above normalization principles were applied to decompose the data in multiple tables
thereby making the data to be maintained in a consistent state.
6.3 E – R DIAGRAMS
• The relation upon the system is structure through a conceptual ER-Diagram, which not only
specifics the existential entities but also the standard relations through which the system exists
and the cardinalities that are necessary for the system state to continue.
• The entity Relationship Diagram (ERD) depicts the relationship between the data objects. The
ERD is the notation that is used to conduct the date modeling activity the attributes of each data
object noted is the ERD can be described resign a data object descriptions. • The set of primary
components that are identified by the ERD are
The primary purpose of the ERD is to represent data objects and their relationships.
6.4 DATA FLOW DIAGRAMS
A data flow diagram is graphical tool used to describe and analyze movement of data
through a system. These are the central tool and the basis from which the other components
are developed. The transformation of data from input to output, through processed, may be
described logically and independently of physical components associated with the system.
These are known as the logical data flow diagrams. The physical data flow diagrams show the
actual implements and movement of data between people, departments and workstations. A
full description of a system actually consists of a set of data flow diagrams. Using two familiar
notations Yourdon, Gane and Sarson notation develops the data flow diagrams. Each
component in a DFD is labeled with a descriptive name. Process is further identified with a
number that will be used for identification purpose. The development of DFD’S is done in
several levels. Each process in lower level diagrams can be broken down into a more detailed
DFD in the next level. The lop-level diagram is often called context diagram. It consists a
single process bit, which plays vital role in studying the current system. The process in the
context level diagram is exploded into other process at the first level DFD.
The idea behind the explosion of a process into more process is that understanding at
one level of detail is exploded into greater detail at the next level. This is done until further
explosion is necessary and an adequate amount of detail is described for analyst to understand
the process.
Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.
Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.
A DFD is also known as a “bubble Chart” has the purpose of clarifying system requirements
and identifying major transformations that will become programs in system design. So it is the
starting point of the design to the lowest level of detail. A DFD consists of a series of bubbles
joined by data flows in the system.
1. The DFD shows flow of data, not of control loops and decision are controlled considerations
do not appear on a DFD.
2. The DFD does not indicate the time factor involved in any process whether the dataflow take
place daily, weekly, monthly or yearly.
2. Current Logical
3. New Logical
4. New Physical
CURRENT PHYSICAL:
In Current Physical DFD process label include the name of people or their positions or the
names of computer systems that might provide some of the overall system-processing label
includes an identification of the technology used to process the data. Similarly data flows and
data stores are often labels with the names of the actual physical media on which data are
stored such as file folders, computer files, business forms or computer tapes.
CURRENT LOGICAL:
The physical aspects at the system are removed as mush as possible so that the current system
is reduced to its essence to the data and the processors that transforms them regardless of
actual physical form.
NEW LOGICAL:
This is exactly like a current logical model if the user were completely happy with he user
were completely happy with the functionality of the current system but had problems with
how it
was implemented typically through the new logical model will differ from current logical
model while having additional functions, absolute function removal and inefficient flows
recognized.
NEW PHYSICAL:
The new physical represents only the physical implementation of the new system.
6.5 UML Diagrams
NewUseCase
Data
Understanding Predictive Learning
NewUseCase2 Model Building
NewUseCase3
Dataset
Data
Analytics(EDA) Trained Dataset
NewUseCase4
Model Evaluation
Particular Data
NewUseCase5
Test/Test Split
NewUseCase6
EXPLANATION:
The primary motivation behind a utilization case chart is to show what framework capacities are performed for which
entertainer. Parts of the entertainers in the framework can be portrayed. The above chart comprises of client as
entertainer. Each will assume a specific part to accomplish the idea.
6.5.2 Class Diagram
Model Evaluation
dataset
traineddataset()
particulardata()
EXPLANATION
In this class chart addresses how the classes with qualities and strategies are connected together to play out the
confirmation with security. From the above chart shown the different classes engaged with our venture.
Model Evaluation
EXPLAN
ATION:
In the above digram tells about the progression of articles between the classes. It's anything but a chart that shows a
total or fractional perspective on the construction of a displayed framework. In this item chart addresses how the
classes with traits and strategies are connected together to play out the confirmation with security.
6.5.4 Component Diagram
Trained Particular
Dataset Data
EXPLANATION:
A segment gives the arrangement of required interfaces that a part acknowledges or carries out. These are the static
outlines of the bound together demonstrating language. Segment outlines are utilized to address the working and
conduct of different segments of a framework.
6.5.5 Deployment Diagram
Data
Data
Understanding
Analysis(EDA)
Train Data
Model Building
Model
Evaluation
EXPLANATION:
An UML sending chart is an outline that shows the design of run time preparing hubs and the parts that live on
them. Arrangement graphs are a sort of design chart utilized in demonstrating the actual parts of an article situated
framework. They are regularly be utilized to demonstrate the static sending perspective on a framework.
6.5.6 STATE DIAGRAM
Dataset
Split Data
Model-Building Phase
Particular Data
Machine Learning
Dataset
EXPLANATION:
State outline are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, cycle and simultaneousness. State charts necessitate that the framework portrayed is made out of a
limited number of states; at times, this is to be sure the situation, while at different occasions this is a sensible
deliberation. Numerous types of state charts exist, which vary marginally and have distinctive semantics.
6.5.7 Sequence Diagram
Datasets Transfers
Datasets
Training Data
Predictive Learning
Machine Learning
Trained Dataset
Split Data
Particular Data
EXPLANATION:
UML Sequence Diagrams are connection charts that detail how tasks are done. They catch the association between
objects with regards to joint effort. Grouping Diagrams are time center and they show the request for the association
outwardly by utilizing the upward pivot of the outline to address time what messages is sent and when.
6.5.8 Collaboration Diagram
5: Machine Learning
Train
Data Data
Understanding
1: Datasets Transfers
8: Split Data 3: Training Data
4: Predictive Learning
Model
Building
Dataset
6: Trained Dataset
9: Particular Data
2: Datasets
Model
Evaluation
Data
Analysis(EDA)
EXPLANATION:
Coordinated effort outlines are utilized to show how items interface to play out the conduct of a specific use case, or
a piece of a utilization case. Alongside succession charts, joint effort are utilized by architects to characterize and
explain the jobs of the articles that play out a specific progression of occasions of a utilization case. They are the
essential wellspring of data used to deciding class duties and interfaces.
6.5.9 Activity Diagram
Dataset
Dataset
EXPLANATION:
Action chart are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, emphasis and simultaneousness. UML, action charts can be utilized to portray the business and
operational bit by bit work processes of parts in a framework. UML movement outlines might actually display the
inner rationale of a mind boggling activity. From numerous points of view UML action outlines are the article
arranged likeness stream graphs and information stream charts (DFDs) from underlying turn of events.
6.5.10 System Architecture
DATASET
DATASET
EXPLORATORY
EXPLORATORY DATA
DATA
ANALYTICS
ANALYTICS
TRAIN/TEST
TRAIN/TEST SPLIT
SPLIT
MODEL
MODEL BUILDING/HYPER
BUILDING/HYPER
PARAMETER
PARAMETER TUNING
TUNING
MODEL
MODEL
ELEVATION
ELEVATION
RESULT
RESULT
CHAPTER : 7
[Link]
7.1 Data Collection
We collected phishing websites dataset from Kaggle website. It consists
of mix of phishing and legitimate URL features. Dataset has 11055 rows
and 31 columns.
7.2 Exploratory Data Analysis
We loaded the dataset into python IDE with the help of pandas package
and checked if there are any missing values in data. We found that
thereare no missing values in data and we removed an unwanted column
for our process. After removing unwanted column, below are the
columns left out in our dataset.
Figure: 7.2.1
We tried to analyze the obtained data and go to find out the following observations:
Figure: 7.2.2
From the above count plot which is plotted with the help of seaborn
package, we can observe the count of values of target variable.
And below is the plot which is drawn with the help of matplotlib
package to find out the correlation among the features.
From the analysis we clearly found out that the type of our data is
classification problem where the target variable is Result.
X = [Link]('Result', axis=1)
Figure: 7.3.1
Figure: 7.3.2
7.4.1Logistic Regression
for the function used at the core of the method, the logistic function. The
logistic function, also called the sigmoid function was developed by
statisticians to describe properties of population growth in ecology, rising
quickly and maxing out at the carrying capacity of the environment. It’s an
S-shaped curve that can take any real-valued number and map it into a
value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP()
function in your spreadsheet) and value is the actual numerical value that
you want to transform. Below is a plot of the numbers between -5 and 5
transformed into the range 0 and 1 using the logistic function.
P(sex=male|height)
Written another way, we are modeling the probability that an input (X)
belongs to the default class (Y=1), we can write this formally as:
P(X) = P(Y=1|X)
We’re predicting probabilities? I thought logistic regression was a classification
algorithm?
Note that the probability prediction must be transformed into binary values
(0 or 1) in order to actually make a probability prediction.. Logistic
regression is a linear method, but the predictions are transformed using the
logistic function. The impact of this is that we can no longer understand the
predictions as a linear combination of the inputs as we can with linear
regression, for example, continuing on from above, the model can be stated
as:
/ 1 – p(X)) = b0 + b1 * X
We can move the exponent back to the right and write it as:
odds = e^(b0 + b1 * X)
We trained logistic regression model with the help of training split and tested with
test split.
Accuracy score - 0.92
Confusion Matrix:
Phishing Non
Phishing
Table:[Link]
Classification Report:
Precision Recall F1-Score Support
Phishing 0.94 0.92 0.93 9
0
1
Table:[Link]
Random forest is a supervised learning algorithm. It can be used both for classification and
regression. It is also the most flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a forest is. Random forests
creates decision trees on randomly selected data samples, gets prediction from each tree
and selects the best solution by means of voting. It also provides a pretty good indicator
of the feature importance.
Advantages:
• Random forest is considered as a highly accurate and robust method because of the
number of decision trees participating in the process.
• It does not suffer from the over fitting problem. The main reason is that it takes the
average of all the predictions, which cancels out the biases.
• Random forests can also handle missing values. There are two ways to handle these: using median
values to replace continuous variables and computing the proximity- weighted average of missing
values.
It works in four steps:
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each decision
tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.
We trained logistic regression model with the help of training split and tested with test split.
Accuracy score - 0.96
Confusion Matrix:
Phishing Non-Phishing
Table:[Link]
Classification Report:
Precision Recall F1-Score Support
Table:[Link]
A decision tree is a flowchart-like tree structure where an internal node represents feature,
the branch represents a decision rule, and each leaf node represents the outcome. The
topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning.
This flowchart-like structure helps you in decision making. It's visualization like a
flowchart diagram which easily mimics the human level thinking. That is why decision
trees are easy to understand and interpret.
1. Select the best attribute using Attribute Selection Measures (ASM) to split the
records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Starts tree building by repeating this process recursively for each child until one of
the condition will match:
o All the tuples belong to the same attribute value.
Pros
It requires fewer data preprocessing from the user, for example, there is no need to
normalize columns.
It can be used for feature engineering such as predicting missing values, suitable for
variable selection.
The decision tree has no assumptions about distribution because of the non-parametric
nature of the algorithm. (Source)
` Cons
The small variation(or variance) in data can result in the different decision tree. This
can be reduced by bagging and boosting algorithms.
Decision trees are biased with imbalance dataset, so it is recommended that balance
out the dataset before creating the decision tree.
Classification Report:
Table:[Link]
Table:[Link]
The naive Bayes classifier is a generative model for classification. Before the advent of
deep learning and its easy-to-use libraries, the Naive Bayes classifier was one of the
widely deployed classifiers for machine learning applications. Despite its simplicity, the
naive Bayes classifier performs quite well in many applications.
A Naive Bayes classifier is a probabilistic machine learning model that’s used for
classification task. The crux of the classifier is based on the Bayes theorem.
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that B has
occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that
the predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Confusion Matrix:
Phishing Non-Phishing
Table:[Link]
Classification Report:
Precision Recall F1-Score Support
Table:[Link]
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a [Link]
chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Types of SVM
Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Advantages
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes
algorithm. They also use less memory because they use a subset of training points in the
decision phase. SVM works well with a clear margin of separation and with high dimensional
space.
Confusion Matrix:
Phishing Non-Phishing
Table:[Link]
Classification Report:
Precision Recall F1-Score Support
Table:[Link]
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes
called distance, proximity, or closeness) with some mathematics we might have learned in
our childhood— calculating the distance between points on a graph.
Advantages
Table:[Link]
Classification Report:
Precision Recall F1-Score Support
Table:[Link]
XGBoost is a powerful machine learning algorithm especially where speed and accuracy
are concerned. XGBoost (eXtreme Gradient Boosting) is an advanced implementation
of gradient boosting algorithm.
ADVANTAGES
1. Regularization:
• Standard GBM has no regularization like XGBoost, therefore it also helps
to reduce overfitting.
High Flexibility
Tree Pruning:
• A GBM would stop splitting a node when it encounters a negative loss in the
split. Thus it is more of a greedy algorithm.
• XGBoost on the other hand make splits upto the max_depth specified and then
start pruning the tree backwards and remove splits beyond which there is no
positive gain.
• Another advantage is that sometimes a split of negative loss say -2 may be
followed by a split of positive loss +10. GBM would stop as it encounters -2.
But XGBoost will go deeper and it will see a combined effect of +8 of the split
and keep both.
Built-in Cross-Validation
Confusion Matrix:
Phishing Non-Phishing
Table:[Link]
Classification Report:
Precision Recall F1-Score Support
Table:[Link]
import numpy as np
import pandas as pd
import [Link] as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv('C:/Users/[Link]/Documents/phishing/[Link]'
) [Link]()
[Link]()
[Link]().sum()
[Link]
[Link](x='Result',data=df)
x=[Link](['Result','index'],axis=1)
y=df['Result']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
# Logistic Regression
# Accuracy Score
from [Link] import accuracy_score
a1=accuracy_score(y_test,y_pred)
a1
# Classification Report
Output:
# Random Forest
Output:
#SupportVector Machine
from [Link]
import SVC
sv=SVC()
[Link](x_train,y_train)
y_pred=[Link](x_test)
from [Link] import
accuracy_score
a4=accuracy_score(y_test,y_pred
)
a4
from [Link] import
classification_report
print(classification_report(y_test,y_pre
d))
MSE =
[Link]([Link](y_test,y_pred)).mea
n() RMSE4 = [Link](MSE)
print("Root Mean Square Error:\n",RMSE4)
[Link]([[-1,-1,-1,1,-1,1,-1,1,1,-1,-1,1,1,1,1,-1,-1,-1,-1,1,-1,1,-1,-1,-1,1,-1,1,-1,-1]])
Output:
#Navie Bayes
[Link](x='Algorithm',y='Accuracy',data=d
f1) [Link](rotation=90)
[Link]('Comparision of Accuracy Levels for various algorithms')
CODIND AND SCREENSHOT
#!/usr/bin/env python
# coding: utf-8
# Thanks to the users over at #python on [Link] (formerly Freenode), for answering
newbie questions.
# Special thanks to the helpful people on Twitter and Discord, for code, kits to test,
ideas, and general education in the space:
# @nullcookies @dyngnosis @olihough86 @dave_daves @JCyberSec_ @n0p1shing @ANeilan
@selenalarson @sysgoblin @PaulWebSec @BushidoToken @sjhilt @phage_nz
#
#
# Version 2.6.5
import os
import time
import gzip
import zipfile
import rarfile
import tarfile
import sys
import argparse
from collections import defaultdict
from datetime import datetime
# Script directions and basic settings. This also generates the help listing.
####################################################################################
parser = [Link](description='Kit Hunter v2.6.5')
group = parser.add_mutually_exclusive_group()
args = parser.parse_args()
# Show matching lines or not? The default is to always show matching lines.
####################################################################################
if [Link]:
line_match = False
else:
line_match = True
####################################################################################
# Show archives and files that have zero matches, or not? The default is to show zero
matches.
####################################################################################
if [Link]:
detect_zero_matches = False
else:
detect_zero_matches = True
####################################################################################
# Custom directory scanning arguments. Custom will require a tag file that is in the
same directory as Kit Hunter.
# Quick scanning will need to be configured up top, and pointed to the directly where
the quick scan tag folder resides.
# By default, once configured properly, the script will run in the current working
directory, with full scan options.
####################################################################################
if [Link]:
import glob
databses = filter([Link], [Link]('./*.tag'))
if not databses:
print ("No custom tag files found!\n\n")
[Link]()
for databse in databses:
kh_tag_path = [Link]()
break
else:
print ("No custom tag files found!\nYou need to make a custom .tag file in the
same directory Kit Hunter is running from.\nSee Help for more details.\n")
[Link]()
elif [Link]:
kh_tag_path = kh_quick_scan
elif [Link]:
kh_tag_path = kh_shell_scan
else:
kh_tag_path = kh_full_scan
####################################################################################
# These are directed help modules for the directory and custom scan switches.
####################################################################################
if [Link]:
print ("")
print ("")
print ("==================================================================")
print (" Kit Hunter Help: Using the [-d] switch ")
print ("==================================================================")
print ("")
print ("Kit Hunter is designed to be launched from within the directory you")
print ("wish to scan. As such, it will scan the current working directory by")
print ("default.")
print ("")
print ("The reccomended search should start within the directory you wish to")
print ("scan, or a directory above it. For website administrators, that means")
print ("starting from /www/ or /public_html/, or a directory above.")
print ("")
print ("However, you can trigger a custom directory scan by using the -d switch.")
print ("You need to make sure you /use/a/full/path/ and remember the trailing
slash.")
print ("")
print ("Example: kit_hunter_2.py -mlqd /this/is/the/full/path/")
print ("")
print ("On error, the script will terminate with a message.")
print ("")
print ("Note: The -d switch must be called last folled by the directory. (See
example)")
print ("You can use -d along with the [-m] and/or [-l] switches, and one of the")
print ("following: [-c], [-q], [-s].")
print ("")
print ("==================================================================")
print ("")
print ("")
[Link]()
if [Link]:
print ("")
print ("")
print ("==================================================================")
print (" Kit Hunter Help: Using the [-c] switch ")
print ("==================================================================")
print ("")
print ("Using the [-c] switch in Kit Hunter enables custom scanning. To use")
print ("custom scanning, you will need to place a single .tag file in the same
directory")
print ("where Kit Hunter is running from. The script will then scan from this")
print ("new tag file only, but otherwise operate as usual.")
print ("")
print ("This function will allow you to search for custom strings and other")
print ("elements, no matter what they are. If the custom tags are constant")
print ("indicators, then you might consider taking the custom .tag file and")
print ("giving it a name, before saving it in the tag_files directory.")
print ("")
print ("Remember to avoid having any whitespace in the tag file, and to place")
print ("each keyword on its own line. See existing .tag files as examples.")
print ("")
print ("You cannot use the [-c] switch with [-q] or [-s]")
print ("")
print ("==================================================================")
print ("")
print ("")
[Link]()
if [Link]:
print ("")
print ("")
print ("==================================================================")
print (" Kit Hunter Help: Using the [-q] switch ")
print ("==================================================================")
print ("")
print ("The [-q] switch in Kit Hunter activates the quick scan function, and")
print ("enables a quick scan of the target directory. The quick scan uses a")
print ("small tag file with basic, but very common phishing detections. It ")
print ("won't find everything, but it will find many of the typical phishing kits")
print ("that exist in the wild.")
print ("")
print ("Any detections made with quick scan should be immediately investigated,")
print ("as the tags are all medium to high-confidence markers.")
print ("")
print ("You cannot use the [-q] switch with -[c] or [-s]")
print ("")
print ("==================================================================")
print ("")
print ("")
[Link]()
if [Link]:
print ("")
print ("")
print ("==================================================================")
print (" Kit Hunter Help: Using the [-s] switch ")
print ("==================================================================")
print ("")
print ("The [-s] switch in Kit Hunter activates a special type of scanning.")
print ("")
print ("Calling this switch alone, or with the [-d] switch will enable you to")
print ("scan for common shell scripts. Shell scripts are often packaged with")
print ("phishing kits, or used to install phishing kits on webservers.")
print ("")
print ("The existance of a shell script on a webserver is a serious problem")
print ("and should be investigated immediately.")
print ("")
print ("Usage: kit_hunter_2.py -s")
print ("- or -")
print ("Usage: kit_hunter_2.py -sd /this/is/the/full/path/")
print ("")
print ("You cannot use the [-c] switch with [-q] or [-s]")
print ("")
print ("==================================================================")
print ("")
print ("")
[Link]()
# Several tag files were created for this release. However, you can have as many as you
want.
# Just remember to give the file the .tag extention so the script picks it up.
# Tag files should have no whitespace or empty lines.
####################################################################################
tag_files_ext = '.tag'
# You can list files to be ignored below. Just replace the example names and extention
with your own.
# If you only need to exclude a single file, then you'd place the name between the [ ]
and format it
# like this: ['[Link]']
#
# If the listed files do not exist, then this aspect of the generation process does
nothing.
# Keep in mind, the ignore focuses on the directory that Kit Hunter is launched from.
####################################################################################
files_to_ignore_in_current_directory = ['[Link]', '[Link]'] +
[generated_report_file_name]
files_to_ignore_in_current_directory = [[Link](directory_path, f) for f in
files_to_ignore_in_current_directory]
##########################################
# DO NOT EDIT BELOW THIS LINE #
##########################################
tag_files_content = dict()
for file in [Link](directory_path):
if [Link](tag_files_ext):
file_path = [Link](directory_path, file)
f = open(file_path, "rb")
file_contents = [Link]().splitlines()
tag_files_content[file] = file_contents
return tag_files_content
def tag_files_reverse_lookup(tag_file_contents):
reverse_lookup = {}
for file_name, tags in tag_file_contents.items():
for tag in tags:
if not tag in reverse_lookup:
reverse_lookup[tag] = []
reverse_lookup[tag].append(file_name)
return reverse_lookup
files = []
folder_paths = [dirpath for dirpath, _, _ in [Link](directory_path)]
for folder_path in folder_paths:
for file in [Link](folder_path):
for supported_format in supported_compressed_files_formats:
if [Link](supported_format):
[Link]([Link](folder_path, file))
return files
files_contents = dict()
for file_info in [Link]():
file_name = file_info.filename
if file_is_supported(file_name, supported_file_formats):
ifile = [Link](file_info)
file_contents = [Link]().splitlines()
files_contents[file_name] = file_contents
return files_contents
files_contents = dict()
for file_info in [Link]():
file_name = file_info.filename
if file_is_supported(file_name, supported_file_formats):
ifile = [Link](file_info)
file_contents = [Link]().splitlines()
files_contents[file_name] = file_contents
return files_contents
files_contents = dict()
for file_name in [Link]():
f = [Link](file_name)
if f:
if file_is_supported(file_name, supported_file_formats):
file_contents = [Link]().splitlines()
files_contents[file_name] = file_contents
return files_contents
files_contents = dict()
for file_name in [Link]():
f = [Link](file_name)
if f:
if file_is_supported(file_name, supported_file_formats):
file_contents = [Link]().splitlines()
files_contents[file_name] = file_contents
return files_contents
# During testing, there were instances where the script failed, and reported the
archive as corrupt.
# In these insteances, manual inspection is required. The traceback will alert to the
archive name and location.
####################################################################################
def get_content_of_compressed_file(directory_path, filename, compression_format):
file_contents = None
if compression_format == '.zip':
file_contents = get_contents_of_zip_file(directory_path, filename,
supported_file_formats)
else:
raise("Unsupported Format")
return file_contents
if not found_files_dict.get(file_name):
found_files_dict[file_name] = {}
# Report Generation
####################################################################################
def create_report(directory_path, filename, found_files, found_tags, found_lines,
found_files_dict, tag_file_reverse_lookup, error=None):
# Checking if we've got a folder or an archive
is_folder = True if filename == '' else False
file_type = 'Folder' if is_folder else 'Archive'
report = []
if error:
[Link]('|
=======================================================================================
========\n')
[Link]('| Encountered the following errors:\n')
[Link]('|\n')
[Link](f"| {error['identifier']} - {str(error['exception'])}\n")
[Link]('|\n')
else:
if len(found_files) == len(found_tags) == len(found_lines) == 0:
[Link]('|
=======================================================================================
========\n')
[Link]('| ' + file_type + ' Scanned:\n')
[Link]('| \n')
[Link]('| ' + str(dir_path) + '\n')
[Link]('| \n')
[Link]('|
=======================================================================================
========\n')
[Link]('| No known phishing keywords were discovered in any
file.\n\n\n\n\n\n')
return report
[Link]('|
=======================================================================================
========\n')
[Link]('| ' + file_type + ' Scanned:\n')
[Link]('| \n')
[Link]('| ' + str(dir_path) + '\n')
[Link]('| \n')
[Link]('|
=======================================================================================
========\n')
# TAG MATCHING
####################################################################################
found_tags_by_tag_file = {}
for ft in found_tags:
tag_file = tag_file_reverse_lookup.get(ft, ['not found'])[0] #a tag can exist
in multiple files, here we just take the 1st file it appeared in.
if tag_file not in found_tags_by_tag_file:
found_tags_by_tag_file[tag_file] = []
found_tags_by_tag_file[tag_file].append(ft)
for tag_file, found_tags in found_tags_by_tag_file.items():
[Link]('|\n')
[Link]('|
=======================================================================================
========\n')
[Link](f'| The following tag file reported matches: {tag_file}\n')
[Link]('|
=======================================================================================
========\n')
[Link]('| \n')
for ft in found_tags:
[Link]('| Tag: ' + str(ft)[1:200] + '\n')
# LINE MATCHING
####################################################################################
if line_match is not False:
[Link]('| \n')
[Link]('|
=======================================================================================
========\n')
[Link]('| The following lines contained the previously flagged phishing
tags:\n')
[Link]('|
=======================================================================================
========\n')
[Link]('| \n')
for fl in found_lines:
[Link]('| Line: '+ str(fl)[1:300] + '\n')
[Link]('|
=======================================================================================
========\n')
else:
[Link]('|
=======================================================================================
========\n')
[Link]('\n\n\n\n\n')
return report
def write_report(overall_report):
f = open([Link](directory_path, generated_report_file_name), "w+")
for report in overall_report:
[Link](report)
[Link]()
overall_report = []
tag_file_contents = get_contents_of_tag_files(directory_path)
errors = []
tag_file_reverse_lookup = tag_files_reverse_lookup(tag_file_contents)
try:
found_files, found_tags, found_lines, found_files_dict =
search_tag_strings(folder_contents, tag_file_contents)
except Exception as e:
error = {"dir" : directory_path, "identifier" : folder, "exception" : e}
[Link](error)
overall_report.append(report)
if folder_files_errors:
overall_report.append('|
=======================================================================================
========\n')
overall_report.append('| Scan Error Report:\n')
overall_report.append('|
=======================================================================================
========\n')
overall_report.append('| The following errors occurred during processing:\n')
overall_report.append('|
=======================================================================================
========\n')
report = []
for error in folder_files_errors:
[Link](f"| Error Location:\n")
[Link](f"| {error['identifier']}\n")
[Link]('|\n')
[Link](f"| Error Type:\n")
[Link](f"| {str(error['exception'])}\n")
[Link]('|
=======================================================================================
========\n')
overall_report.append(report)
write_report(overall_report)
print('=========================\n')
print('Done! All file processing is complete.\n')
print('=========================\n')
if errors or folder_files_errors:
total_errors = len(errors) + len(folder_files_errors)
print('WARNING:\n')
print(f'{total_errors} Errors were encountered during execution!\n')
print('See', generated_report_file_name, 'for details.\n')
print('=========================\n')
print('=========================\n')
print('The finished report is located at:\n')
file_path = [Link](directory_path, generated_report_file_name)
print("", file_path, '\n')
print('=========================\n')
compressed_files = get_compressed_files(directory_path,
supported_compressed_files_formats)
folder_files, errors = get_contents_of_folder_files(directory_path,
supported_file_formats)
process_files(directory_path, compressed_files, folder_files, errors)
CONCLUSION
The present project is aimed at classification of phishing websites based on the features.
For that we have taken the phishing dataset which collected from uci machine learning
repository and we built our model with seven different classifiers like SVC, Naïve Bayes,
XGB Classifier, Random Forest, K-Nearest Neighbours, Decision Tree and we got good
accuracy scores. There is a scope to enhance it further .if we can have more data our
project will be much more effective and we can get very good results. For this we need
API integrations go get the data of different websites.