0% found this document useful (0 votes)
18 views69 pages

Advanced Genetic Disorder Prediction Model

Uploaded by

Pavan Kohli
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views69 pages

Advanced Genetic Disorder Prediction Model

Uploaded by

Pavan Kohli
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

TABLE OF CONTENTS

_______________________________________________________
Abstract iv
List Of Figures v
List Of Symbols vi
CHAPTER 1 : INTRODUCTION
1.1 General 1
1.2 Objective 2
1.3 Existing System 3
1.3.1 Disadvantages of Existing System 3
1.4 Literature Survey 5
1.5 Proposed System 10
1.5.1 Proposed System Advantages 10
CHAPTER 2 : PROJECT DESCRIPTION
2.1 General 12
2.2 Methodologies 12
2.2.1 Modules Name 12
2.2.2 Modules Description 13
2.3 Technique used or Algorithm used 16
2.3.1 Existing Technique 16
2.3.2 Proposed Technique and Algorithm Used 16
CHAPTER 3 : REQUIREMENTS ENGINEERING
3.1 General 17
3.2 Hardware Requirements 17
3.3 Software Requirements 17
3.4 Functional Requirements 18
3.5 Non-Functional Requirements 19

i
CHAPTER 4 : DESIGN ENGINEERING
4.1 General 20
4.2 UML Diagrams 20
4.2.1 Use Case Diagram 21
4.2.2 Class Diagram 22
4.2.3 Object Diagram 23
4.2.4 Component Diagram 24
4.2.5 Deployment Diagram 25
4.2.6 Sequence Diagram 26
4.2.7 Collaboration Diagram 27
4.2.8 Statechart Diagram 28
4.2.9 Activity Diagram 29
4.3 Data Flow Diagram 30
4.4 System Architecture 31

CHAPTER 5 : SOFTWARE SPECIFICATION


5.1 General 32
CHAPTER 6 : IMPLEMENTATION
6.1 Code and Implementation 35
6.2 Test Cases 47
CHAPTER 7 : SNAPSHOTS
7.1 Snapshots 48
CHAPTER 8 : SOFTWARE TESTING
8.1 General 51
8.2 Developing Methodologies 51
8.3 Types of Testing 51
8.3.1 Unit Testing 51
8.3.2 Functional Test 52
8.3.3 System Test 52
8.3.4 Performance Test 52

ii
8.3.5 Integration Testing 53
8.3.6 Acceptance Testing 53
CHAPTER 9 : APPLICATIONS AND FUTURE ENHANCEMENT
9.1 General 54
9.2 Applications 54
9.3 Future Enhancements 55
CHAPTER 10 : CONCLUSION
10.1 Conclusion 56
REFERENCES 57

iii
ABSTRACT

Genetic illness prediction is an important and timely issue in the realm of biomedical
science. Mutations in the genome are the root cause of many diseases with significant global
mortality rates, including Alzheimer's, cancer, diabetes, cystic fibrosis, leigh syndrome, and
others. Theoretical and explanatory approaches to predicting genetic abnormalities have been
developed through prior research. Genetic data has expanded to practically include the entire
genome and protein, and methods based on deep learning and machine learning have been
created to forecast genomic abnormalities in response. Concurrently with the introduction of
machine learning techniques, deep learning methods also emerged. Studies on the forecasting
of genetic anomalies have previously employed a variety of learning strategies, including
supervised, unsupervised, and semi-supervised approaches. Most of these studies used
genetic sequence data to make predictions about binary dilemmas. These methods produced
dubious results since they were less accurate and relied on binary class prediction algorithms,
which ignore the pasts of individuals with genetic anomalies. The majority of the approaches
relied on RNA gene sequences, which led to frequent issues when dealing with auction data.
Here, we use the XGBoost Algorithm to foretell genome multiclass disease from a huge
dataset utilising an advanced genome disorder prediction model (AGDPM). AGDPM
outperformed the trained XGBoost Algorithm in every category, with an average accuracy
of 92.65% in both the training and testing phases of the study. Therefore, the state-of-the-art
genome disorder prediction model can reliably predict genome disorder and analyse a large
quantity of patient genome disorder data thanks to the incorporation of a multi-class
prediction technique. Multiple statistical performance metrics demonstrate that AGDPM may
accurately predict diseases caused by a single gene, mitochondrial genes, and multiple genes.
As a result, AGDPM will help biomedical researchers manage mortality rates and anticipate
genetic disorders.

iv
LIST OF FIGURES

4.1 Use Case Diagram 21


4.2 Class Diagram 22
4.3 Object Diagram 23
4.4 Component Diagram 24
4.5 Deployment Diagram 25
4.6 Sequence Diagram 26
4.7 Collaboration Diagram 27
4.8 State Chart Diagram 28
4.9 Activity Diagram 29
4.10 Data Flow Diagram 30
4.11 System Architecture 31

LIST OF SYMBOLS

v
_______________________________________________________
NOTATION
[Link] NAME NOTATION DESCRIPTION

Class Name
1. Class Represents a collection
of similar entities
+ public -attribute grouped together.
-private -attribute
# protected

+operation

+operation
NAME Associations represents
Class A Class B
2. Association +operation static relationships
between classes. Roles
Class A Class B
represents the way the
two classes see each
other.

It aggregates several
3. Actor classes into a single
classes.

Class A Class A Interaction between the


4. Aggregation system and external

Class B Class B environment

vi
Used for additional process
Relation uses
5. communication.
(uses)

Extends relationship is
6. Relation extends used when one use case is
(extends) similar to another use case
but does a bit more.

7. Communication Communication between


various use cases.

8. State State of the processes.


State

9. Initial State Initial state of the object

10. Final state Final state of the object

11. Control flow Represents various control


flow between the states.

12. Decision box Represents decision


making process from a
constraint

vii
13. Use case Interact ion between the
Uses case system and external
environment.

Represents physical
14. Component modules which are a
collection of components.

Represents physical
15. Node modules which are a
collection of components.

16. Data A circle in DFD represents a


Process/State state or process which has
been triggered due to some
event or action.

Represents external entities


17. External entity such as keyboard, sensors,
etc.

18. Transition Represents communication


that occurs between
processes.

viii
Represents the vertical
19. Object Lifeline dimensions that the object
communications.

20. Message Message Represents the message


exchanged.

ix
CHAPTER 1
INTRODUCTION

1.1 GENERAL

It is estimated that nearly 2,000 different human diseases can be traced back to a single faulty
gene, classifying them as monogenic syndromes. The genes responsible for each condition
exhibit various manifestations, leading to a diverse range of phenotypic outcomes. Therefore,
establishing phenotype-gene correlations is crucial for researchers and medical professionals
in deciphering the fundamental genetic mechanisms behind these disorders. Identifying
disease-causing genes aids in patient diagnosis and provides insight into the complex network
of genetic interactions. Essentially, a potential genetic disease can be detected by analyzing
causative mutant genotypes during the gene identification process. Genetic anomalies, such
as single nucleotide changes, additions or deletions, and complete gene loss, can all impact
disease-causing genes. Traditional approaches to identifying pathogenic genes include
positional cloning, linkage analysis, and mutation analysis. Initially, linkage analysis on
human pedigrees helps locate the chromosomal interval associated with the disease,
identifying candidate genes in the region. Next, positional cloning involves sequencing a set
of candidate genes within this interval, combining spatial and transcriptional mapping.
Human genetic disorders are inherited conditions arising from genetic or chromosomal
abnormalities present from conception. These disorders fall into two primary categories:
single-gene diseases and complex disorders. Single-gene diseases result from a mutation in
a single gene and are passed down easily from one generation to the next, referred to as
Mendelian diseases. Complex diseases, on the other hand, result from a combination of
environmental, behavioral, and lifestyle factors, with genetic defects contributing only a
small fraction to the overall phenotype. Single-gene disorders can originate in any gene, but
they share common genetic and psychosocial care needs, allowing for informed decisions on
risk management and support for affected individuals. Mitochondrial diseases, caused by
alterations in mitochondrial DNA rather than nuclear DNA, are inherited solely from the

1
mother. These diseases can present with symptoms such as lactic acidosis, stroke-like
episodes, eye abnormalities, and encephalopathy. Inherited disorders have various
underlying causes, and many conditions result from a combination of genetic alterations and
environmental factors. Complex genetic disorders, such as diabetes, Alzheimer's, and cancer,
illustrate the multifaceted nature of polygenic illnesses.

1.2 OBJECTIVE

The primary objectives of this project are as follows:

• To use numerous statistical performance parameters to predict the results of the


multifactorial gene inheritance disease simulation.
• To address the fact that genetic illnesses might be multifactorial, meaning genetic
factors contribute to only a subset of the phenotypes associated with the disorder.
• To consider that diseases with multiple causal factors include those caused by both
genetic predisposition and environmental influences.
• To identify that a mutation in a single gene is the sole cause of a single-gene disorder,
which can originate in any gene.
• To recognize that despite clinical distinctions, all single-gene illnesses are inherited,
share a common biological basis, and require the same fundamental genetic and
counseling services.
• To provide the ability to make educated decisions about risk management strategies
and offer emotional and practical assistance to individuals, regardless of age.
• To understand that mitochondrial diseases are associated with alterations in
mitochondrial DNA, which is inherited maternally and involves between five and ten
circular DNA strands.
• To note that symptoms of mitochondrial disease include lactic acidosis, stroke-like
episodes, eye abnormalities, and encephalopathy.
• To acknowledge that mitochondrial diseases, which often result from the interplay
between environmental and nutritional factors, may involve multiple mutations and
are sometimes referred to as complicated or polygenic diseases.

2
• To explore that one complex genetic disorder can underlie conditions such as
diabetes, Alzheimer's, and cancer.
• To consider machine learning as an alternative to conventional methods of genetic
prediction, noting that advancements in this field, along with growing data sets and
computing power, have made deep learning increasingly popular.
• To utilize deep learning methods in statistical genetics to identify interactions
between multiple loci without assuming additivity, addressing the high
dimensionality of factors and improving the prediction of their relative importance.

1.3 EXISTING SYSTEM

In the realm of genetics and medical research, forecasting genome disorders is crucial.
Although Deep Neural Networks (DNNs) have demonstrated significant potential in
addressing this challenge, their effectiveness can be hindered by overfitting. Convolutional
Neural Networks (CNNs) face limitations due to the increased spatial correlation of zeroed-
out values in output feature maps. To combat overfitting, dropout is commonly used. The
current recommendation is to utilize Checkerboard Dropout, a structured dropout method
designed to enhance performance and generalization while addressing the spatial correlation
issue. However, despite its benefits, Checkerboard Dropout may still encounter problems that
require further refinement.
1.3.1 Disadvantage of Existing System
Recommendation for Overfitting Solution
The recommendation is to use Checkerboard Dropout as an effective strategy to address the
overfitting problem in deep learning models. Overfitting occurs when a model performs
exceptionally well on training data but struggles to generalize to new, unseen data. This issue
is particularly prominent in Convolutional Neural Networks (CNNs) where the spatial
correlation of zeroed-out values in output feature maps can hinder performance and
generalization. Checkerboard Dropout offers a targeted solution by introducing structured
dropout, which systematically removes features to reduce the likelihood of overfitting and
improve the model's ability to generalize

3
Checkerboard Dropout: A Structured Dropout Technique
Checkerboard Dropout is a structured dropout technique designed to address the issues of
randomness and spatial correlation that commonly affect neural networks. Unlike traditional
dropout methods that randomly eliminate individual features, Checkerboard Dropout
removes contiguous blocks of features, creating a more organized pattern of dropout. This
approach helps mitigate the spatial correlation problem by ensuring that removed features do
not follow an unpredictable, random pattern. As a result, it enhances model generalization
and performance by promoting more robust learning. Despite its advantages, it is important
to note that Checkerboard Dropout may still face challenges that require further refinement
and investigation.

4
1.4 LITERATURE SURVEY
Title: Network-Based Methods for Human Disease Gene Prediction

Author: X. Wang, N. Gulbahce, and H. Yu

Year: 2011

Description:

This paper, published in Briefings in Functional Genomics, offers a comprehensive review


of network-based approaches for predicting human disease genes. The authors explore how
integrating various types of biological networks—such as protein-protein interaction
networks, gene co-expression networks, and metabolic networks—can significantly enhance
the accuracy and depth of disease gene prediction. The paper delves into multiple network-
based methods, including network propagation techniques that utilize the structure of the
network to infer gene-disease associations, and network-based enrichment analysis that
identifies genes related to specific diseases by examining their presence in known network
modules. Additionally, the study discusses the application of network topological features,
such as centrality and connectivity, to prioritize candidate disease genes. By leveraging the
rich contextual information embedded in biological networks, these methods can uncover
previously hidden relationships between genes and diseases. The paper underscores the
potential of network-based approaches to capture the complexity of gene interactions and the
biological pathways involved in diseases, offering new avenues for research and therapeutic
discovery. Overall, this review highlights how advanced computational techniques and
network analyses can provide a deeper understanding of the molecular mechanisms
underlying human diseases and contribute to the identification of novel targets for treatment.

5
Title: ImageNet Classification with Deep Convolutional Neural Networks

Author: A. Krizhevsky, I. Sutskever, and G. E. Hinton

Year: 2012

Description: Krizhevsky, Sutskever, and Hinton's seminal paper, "ImageNet Classification


with Deep Convolutional Neural Networks," marked a pivotal moment in the evolution of
artificial intelligence. By introducing AlexNet, a deep convolutional neural network
architecture, the authors dramatically advanced the field of computer vision. This
groundbreaking work surpassed previous state-of-the-art image classification models by a
substantial margin, demonstrating the immense potential of deep learning. Key to AlexNet's
success were several innovative techniques: the employment of rectified linear units (ReLUs)
as activation functions, which accelerated training and improved performance; the
incorporation of dropout regularization to mitigate overfitting and enhance generalization;
and the development of a highly efficient GPU implementation to handle the computational
demands of training such a complex model. The paper's impact extends far beyond its
immediate contributions, as it ignited a resurgence of interest in deep learning, fostering a
wave of research and development that has led to transformative advancements in various
domains, from autonomous vehicles to medical image analysis.

6
Title: First Glimpses of the Neurobiology of Autism Spectrum Disorder

Author: S. J. Sanders

Year: 2015.

Description:

In this comprehensive review, S. J. Sanders delves into the early insights into the
neurobiological underpinnings of autism spectrum disorder (ASD). The paper meticulously
examines the genetic landscape of ASD, highlighting significant advances in identifying
specific genetic mutations and variations that are associated with the disorder. It discusses
how these genetic findings contribute to a broader understanding of how ASD affects brain
development and function, exploring the implications of these genetic insights for
understanding the etiology of the disorder. Sanders reviews research on brain structure and
connectivity, noting how abnormalities in these areas may relate to the core symptoms of
ASD. The paper also addresses the role of environmental factors and their interaction with
genetic predispositions, providing a nuanced view of the complex interplay between genetics
and environmental influences. Additionally, it outlines the progress made in identifying
potential biomarkers for ASD, which could pave the way for improved diagnostic methods
and targeted therapies. By synthesizing recent research and presenting an overview of key
studies, Sanders provides a valuable resource for researchers and clinicians seeking to
understand the neurobiological basis of ASD and offers insights into future research
directions aimed at unraveling the complexities of this multifaceted disorder.

7
Title: Biological Insights from 108 Schizophrenia-Associated Genetic Loci

Author: Europe PMC Funders Group

Year: 2014.

Description:

This study published in Nature provides an in-depth analysis of 108 genetic loci associated
with schizophrenia. The research focuses on elucidating the biological mechanisms
underlying schizophrenia by examining the genetic variants identified through large-scale
genome-wide association studies (GWAS). The paper highlights how these genetic loci
contribute to the understanding of the disorder's etiology, including their effects on brain
structure, function, and molecular pathways. By integrating genetic data with functional
annotations and gene expression profiles, the authors provide valuable insights into the
biological processes disrupted in schizophrenia. The findings offer a clearer picture of the
genetic architecture of the disorder and suggest potential pathways for future research and
therapeutic interventions. This comprehensive analysis underscores the complexity of
schizophrenia’s genetic basis and the importance of continued research in uncovering the
mechanisms driving this severe mental illness.

8
Title: Uncovering Disease-Disease Relationships Through the Incomplete Interactome

Author: J. Menche, A. Sharma, M. Kitsak, S. D. Ghiassian, M. Vidal, J. Loscalzo, and A.-


L. Barabási

Year: 2015

Description:
This paper, published in Science, explores the relationships between different diseases
through the lens of the incomplete interactome—a network of protein interactions that is not
fully mapped. The authors propose a novel approach to uncovering disease-disease
relationships by integrating partial interactome data with disease association information.
They demonstrate that by analyzing the connections between proteins involved in different
diseases, it is possible to identify previously unknown relationships between diseases. The
study provides insights into how interactions within this incomplete network can reveal
shared molecular pathways and mechanisms across various diseases. The findings have
significant implications for understanding the comorbidities and underlying biological
connections between different health conditions, potentially guiding future research and
therapeutic strategies. The approach highlights the value of network-based analyses in
revealing complex disease relationships that are not apparent through traditional methods.

9
1.5 PROPOSED SYSTEM

Complex multiple gene abnormalities can lead to a diverse array of symptoms, encompassing
multifactorial genome disorders, mitochondrial gene inheritance disorders, and single-gene
inheritance disorders. Recent advancements in genomic technology have enabled more
precise acquisition of genetic data. Large-scale genetic studies, including those focused on
multifactorial genome disorders (MGD) and single-gene inheritance disorders (SGID), have
identified hundreds of individuals with various abnormalities. However, despite the vast
amount of data generated by these studies, identifying the specific genes responsible for the
diseases remains challenging. Additionally, since mitochondrial DNA is inherited
maternally, mothers are the primary source of mitochondrial disorders in their children, as
the organelles are maintained through fertilization.

1.5.1 PROPOSED SYSTEM ADVANTAGES

Efficient Loss Minimization through Gradient Descent


The proposed system leverages a gradient descent method to optimize the loss function when
integrating new models. This approach ensures that the model continuously improves its
accuracy by systematically reducing errors through iterative adjustments. By minimizing the
loss, the system effectively enhances overall predictive performance and reliability.

Independence in Feature Engineering


One of the key strengths of the proposed system is its capability to perform feature
engineering independently. This means that the system can autonomously identify and select
relevant features from the data without requiring extensive manual intervention. This
independence enhances the model's adaptability and efficiency, allowing it to focus on
generating accurate predictions based on the most pertinent features.

10
High Prediction Accuracy with XGBoost Algorithm
The suggested model, utilizing the XGBoost algorithm, achieved an impressive 92.65%
prediction accuracy based on patients' clinical feature data. This high level of accuracy
demonstrates the model's effectiveness in processing and analyzing complex clinical
information, leading to reliable and precise predictions. The XGBoost algorithm's robustness
and accuracy significantly contribute to the system's overall performance.

Optimal Space and Computational Complexity


The proposed model benefits from the XGBoost algorithm's optimal space and computational
complexity. The XGBoost algorithm is designed to handle large datasets efficiently while
maintaining low computational overhead. This efficiency ensures that the model performs
well in terms of both resource utilization and processing speed, making it suitable for real-
world applications where data volume and complexity can be substantial.

Significant Improvement in Prediction Results


The implementation of the proposed model has resulted in a substantial improvement in
prediction outcomes. By leveraging the XGBoost algorithm and the advantages of gradient
descent and independent feature engineering, the system has enhanced its ability to provide
accurate and actionable predictions. This dramatic improvement underscores the
effectiveness of the proposed system in advancing predictive analytics and decision-making
processes.

11
CHAPTER 2
PROJECT DESCRIPTION

2.1 GENERAL

The paper explores advancements in predictive modeling for genetic disorders using modern
machine learning techniques. It focuses on leveraging deep learning methodologies,
particularly Convolutional Neural Networks (CNNs), to enhance the accuracy and efficiency
of predicting multifactorial and single-gene abnormalities.

The study addresses challenges such as overfitting and spatial correlation in feature maps,
proposing solutions like Checkerboard Dropout to mitigate these issues and improve model
generalization. It highlights the use of advanced algorithms, including XGBoost, to achieve
high prediction accuracy, optimize computational performance, and reduce space
complexity.

Furthermore, the paper discusses the integration of gradient descent methods for loss
minimization and the independence of feature engineering processes to streamline model
development. It underscores the significance of recent genomic technologies in providing
precise genetic data and the difficulty of pinpointing disease-causing genes despite the
availability of large-scale genetic studies.

Overall, the paper presents a comprehensive approach to refining predictive models in


genomics, emphasizing both methodological improvements and practical implications for
enhancing diagnostic accuracy and understanding genetic disorders.

2.2 METHODOLOGIES

2.2.1 MODULES NAME


➢ Data collection
➢ Dataset

12
➢ Data preparation
➢ Model Selection
➢ Analysis and prediction
➢ Accuracy on the test set
➢ Saving the trained model

2.2.2 MODULES DESCRIPTION


1. Data collection: This initial stage involves gathering data and developing a machine
learning model, marking a crucial phase in the process. The effectiveness of the model
heavily relies on the volume and quality of the data collected. Techniques such as manual
interventions, online scraping, and various other methods are employed to collect this data.
2. Datasect: The dataset contains 22,084 unique data points across 45 columns. Each column
is described as follows:
• Patient Id: Identifier with "Genetic Disorder" noted.
• Patient Age: The age of the patient or user.
• Mother's Side Genes: Presence of maternal genes.
• Inherited from Father: DNA traits passed from father, such as blood type and eye
color.
• Maternal Gene: Genes present in the oocyte or embryo prior to zygotic gene
expression.
• Paternal Gene: Characteristics passed from father to offspring.
• Blood Cell Count (mcL): Measurement of red, white, and platelet-rich blood.
• Patient First Name: The patient's first name.
• Father’s Name and Family Name: The father's name and surname.
• Mother’s Name and Family Name: The mother’s name and surname.
• Age of Mother: The mother's age.
• Age of Father: The father's age.
• Institution Name: The name of the hospital or institution.
• Institute's Location: Location of the hospital or institution.

13
• Status: Whether the person or patient is alive or deceased.
• Respiratory Rate (breaths/min): The rate of breathing controlled by the brain's
respiratory center.
• Heart Rate (rates/min): The frequency of heartbeats per minute.
• Test 1: Status of Test 1.
• Test 2: Status of Test 2.
• Test 3: Status of Test 3.
• Test 4: Status of Test 4.
• Test 5: Status of Test 5.
• Parental Consent: Indicates if parental assent was provided for participation.
• Follow-up Level: Indicates whether follow-up is high or low.
• Gender: Male, Female, or Indeterminate.
• Birth Asphyxia: Condition where insufficient oxygen is received during childbirth.
• Autopsy Reveals Birth Defect (if any): Findings from an autopsy regarding birth
defects.
• Place of Birth: The birthplace.
• Information about Folic Acid (peri-conceptional): Data on folic acid, a vitamin
important for new cell production.
• H/O Serious Maternal Disease: Impact of serious maternal disease on the patient's
mother.
• H/O Radiation Exposure (x-ray): Indicates if the patient has been exposed to
radiation.
• H/O Substance Abuse: Indicates if a parent has struggled with drug addiction.
• Assisted Conception (IVF/ART): Type of infertility therapy used.
• Previous Pregnancy Abnormalities: History of abnormalities in prior pregnancies.
• Number of Prior Abortions: Total number of prior abortions.
• Birth Defects: Indicates if the patient has birth defects.
• White Blood Cell Count: Number of white blood cells per microliter.

14
• Blood Test Result: Categorized as Normal, Slightly Abnormal, Unclear, or
Abnormal.
• Symptom 1: Presence of Symptom 1.
• Symptom 2: Presence of Symptom 2.
• Symptom 3: Presence of Symptom 3.
• Symptom 4: Presence of Symptom 4.
• Symptom 5: Presence of Symptom 5.
• Genetic Disorder: Professional detection of genetic disorders.
• Type of Disorder: Subclass of the disorder.
3. Data Preparation: Prepare the data for training by cleaning and organizing it. This
involves eliminating duplicates, correcting errors, addressing missing values, normalizing
the data, converting data types as needed, and removing any other potential inconsistencies.
Randomize the data to ensure that any effects from the specific order in which it was collected
or processed are minimized. Next, conduct further exploratory analysis, which includes
visualizing the data to identify any significant class imbalances or relationships between
variables, while being cautious of potential biases. Finally, split the data into training and
assessment sets to facilitate model evaluation.
4. Model selection: After utilizing the XGBoost and Support Vector Machine methods,
which produced accuracy of 98% and 80% on the train set, respectively, we developed this
method.
5. Analyze and prediction: Out of the entire dataset, we only chose two attributes:
• A description of the health values is given.
• Outcome: indicates the type of genetic condition that the patient or individual has.
6. Accuracy on test set: We obtained accuracy of 92.65% & 41.40% on the test set.
7. Saving the Trained Model: You're ready to deploy your trained and validated model
when you follow these steps: First, save your model as a `.pkl` file using a library like Pickle.
Ensure that Pickle is properly configured in your environment. Once confirmed, import the
model and create a `.pkl` file for export, which will allow you to deploy the model in a
production setting.

15
2.3 TECHNIQUE USED OR ALGORITHM USED

2.3.1 EXISTING TECHNIQUE


➢ Dropout is a technique used in Deep Neural Networks (DNNs) to prevent overfitting
by randomly removing features from feature maps during training.
➢ Despite its effectiveness, dropout has limitations in Convolutional Neural Networks
(CNNs), as it can increase spatial correlation among zeroed-out values in output
feature maps.
➢ This increase in spatial correlation can negatively impact the network's overall
performance and generalization ability.
➢ Drop Block is a more structured dropout method that drops a continuous region of
the feature map, reducing the randomness associated with standard dropout.
➢ By using Drop Block, the issue of spatial correlation in CNNs is effectively mitigated,
leading to improved network performance.

2.3.2 PROPOSED TECHNIQUE USED OR ALGORITHM USED


➢ XGBoost uses a regularized objective function, which includes a convex loss function
based on the difference between predicted and target outputs, and a penalty term for
model complexity.
➢ The method adds new trees during training to predict the errors or residuals from
earlier trees, which are then combined with existing trees to improve the final
prediction.
➢ XGBoost incorporates L1 (Lasso) and L2 (Ridge) regularization to control model
complexity and prevent overfitting, enhancing generalization.
➢ XGBoost is known for its high performance, scalability, and accuracy, making it
widely used in applications such as image classification, text mining, and
recommender systems.
➢ The AGDPM model uses input features that include data on genetic diseases,
leveraging XGBoost’s predictive capabilities to effectively analyze and classify this
data

16
CHAPTER 3
REQUIREMENTS ENGINEERING

3.1 GENERAL

The interpretation of the handwriting character by developing techniques and methods such
as improvement of character classification techniques. The accurate and rapid classification
for accurate information retrieval, sound classification, stock price forecasting.

3.2 HARDWARE REQUIREMENTS

The hardware requirements may serve as the basis for a contract for the implementation of
the system and should therefore be a complete and consistent specification of the whole
system. They are used by software engineers as the starting point for the system design. It
shows what the system do and not how it should be implemented.

• Processor - Pentium - IV
• Speed - 1.1 GHz
• Ram - 256 MB
• Hard Disk - 20 GB
• Key Board - Standard Windows Keyboard
• Mouse - Two or Three Button Mouse
• Monitor - SVGA

3.3 SOFTWARE REQUIREMENTS

The software requirements document is the specification of the system. It should include both
a definition and a specification of requirements. It is a set of what the system should do rather
than how it should do it. The software requirements provide a basis for creating the software
requirements specification. It is useful in estimating cost, planning team activities,

17
performing tasks and tracking the teams and tracking the team’s progress throughout the
development activity.

MINIMUM SYSTEM REQUIREMENTS

HARDWARE REQUIREMENTS

• PROCESSOR : Pentium i3 Processor


• RAM : 8GB DD RAM
• HARD DISK : 500 GB

SOFTWARE REQUIREMENTS

• BACK END : PYTHON


• OPERATING SYSTEM : WINDOWS 10 and above
• IDE : Spyder3

3.4 FUNCTIONAL REQUIREMENTS

The functional requirements for the proposed system involve developing a predictive model
using XGBoost, a robust machine learning algorithm known for its high performance,
scalability, and accuracy. The system must effectively handle and process large datasets
related to genetic diseases, ensuring accurate prediction and classification outcomes. The
model will incorporate regularised objective functions (L1 and L2), focusing on minimising
the convex loss function and penalising model complexity. The system must be capable of
integrating new regression trees to predict residuals from previous iterations, refining the
final prediction. The end goal is to create a reliable and efficient tool for analyzing genetic
data and predicting disease outcomes using the XGBoost algorithm.

18
3.5 NON-FUNCTIONAL REQUIREMENTS

The major non-functional Requirements of the system are as follows

Usability

The system is designed with completely automated process hence there is no or less user
intervention.

Reliability

The system is more reliable because of the qualities that are inherited from the chosen
platform java. The code built by using python is more reliable.

Performance

This system is developing in the high level languages and using the advanced front-end and
back-end technologies it will give response to the end user on client system with in very less
time.

Supportability

The system is designed to be the cross platform supportable. The system is supported on a
wide range of hardware and any software platform, which is built into the system.

Implementation

The system is implemented in web environment using Django framework. The server is used
as the web server and windows xp professional is used as the platform. Interface the user
interface is based on Django provides web application.

19
CHAPTER 4
DESIGN ENGINEERING

4.1 GENERAL

Design Engineering deals with the various UML [Unified Modeling language]
diagrams for the implementation of project. Design is a meaningful engineering
representation of a thing that is to be built. Software design is a process through which the
requirements are translated into representation of the software. Design is the place where
quality is rendered in software engineering. Design is the means to accurately translate
customer requirements into finished product.

4.2 UML DIAGRAMS


Unified Modelling Language (UML) diagrams are a standardized way of visually
representing the various aspects of a system, software application, or business process. There
are 9 types of UML diagrams, each serving a specific purpose and providing different levels
of detail.

20
4.2.1 USE CASE DIAGRAM

Figure : 4.1 Use Case Diagram

EXPLANATION
Use-case diagrams describe the high-level functions and scope of a system. These diagrams
also identify the interactions between the system and its actors. The use cases and actors in
use-case diagrams describe what the system does and how the actors use it, but not how the
system operates internally.

21
4.2.2 CLASS DIAGRAM

Figure : 4.2 Class Diagram

EXPLANATION

In this class diagram In software engineering, a class diagram in the Unified Modeling
Language (UML) is a type of static structure diagram that describes the structure of a system
by showing the system's classes, their attributes, operations (or methods), and the
relationships among objects.

22
4.2.3 OBJECT DIAGRAM

Figure : 4.3 Object Diagram

EXPLANATION

Object is an instance of a class in a particular moment in runtime that can have its own state
and data values. Likewise a static UML object diagram is an instance of a class diagram; it
shows a snapshot of the detailed state of a system at a point in time, thus an object diagram
encompasses objects and their relationships which may be considered a special case of a class
diagram or a communication diagram.

23
4.2.4 COMPONENT DIAGRAM

Figure : 4.4 Component Diagram

EXPLANATION

In the Unified Modeling Language, a component diagram depicts how components are wired
together to form larger components and or software systems. They are used to illustrate the
structure of arbitrarily complex systems. User gives main query and it converted into sub
queries and sends through data dissemination to data aggregators. Results are to be showed
to user by data aggregators. All boxes are components and arrow indicates dependencies.

24
4.2.5 DEPLOYMENT DIAGRAM

Figure : 4.5 Deployment Diagram

EXPLANATION
The Unified Modeling Language (UML) is the standard language that many software
engineers and business professionals use to create a broad overview for complex systems. A
deployment diagram is one type of diagram created with this language. Along with our UML
diagramming tool, use this guide to learn more about deployment diagrams.

25
4.2.6 SEQUENCE DIAGRAM

Figure : 4.6 Sequence Diagram

EXPLANATION
A sequence diagram is a type of interaction diagram because it describes how—and
in what order—a group of objects works together. These diagrams are used by software
developers and business professionals to understand requirements for a new system or to
document an existing process.

26
4.2.7 COLLABORATION DIAGRAM

Figure : 4.7 Collaboration Diagram

EXPLANATION
A collaboration diagram, also known as a communication diagram, is an illustration
of the relationships and interactions among software objects in the Unified Modelling
Language (UML).

27
4.2.8 STATE CHART DIAGRAM

Figure : 4.8 Statechart Diagram

EXPLANATION
These terms are often used interchangeably. So simply, a state diagram is used to model the
dynamic behavior of a class in response to time and changing external stimuli. We can say
that each and every class has a state but we don’t model every class using State diagrams.
We prefer to model the states with three or more states.

28
4.2.9 ACTIVITY DIAGRAM

Figure : 4.9 Activity Diagram

EXPLANATION

An activity diagram is a type of Unified Modeling Language (UML) flowchart that shows
the flow from one activity to another in a system or process. It's used to describe the different
dynamic aspects of a system and is referred to as a 'behavior diagram' because it describes
what should happen in the modeled system.

29
4.3. DATA FLOW DIAGRAM

Figure : 4.10 Data Flow Diagram

EXPLANATION
The above figure 4.10 is a sequence that starts with User data input, followed by array
conversion. Next, model is loaded and applied on the converted data. Then, the genome
disorder detection is done.

30
4.4 SYSTEM ARCHITECTURE

Figure : 4.11 System Architecture

EXPLANATION

The above figure is a system architecture in which data processing is done and on the entire
dataset, feature extraction is performed which results in a feature set. The dataset is divided
into training and test sets. Finally, the XGBoost model is applied to the data for accurate
outputs.

31
CHAPTER 5
DEVELOPMENT TOOLS

5.1 GENERAL

Python

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python


is designed to be highly readable. It uses English keywords frequently where as other
languages use punctuation, and it has fewer syntactical constructions than other languages.

History of Python

Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.

Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-
68, Smalltalk, and Unix shell and other scripting languages.

Python is copyrighted. Like Perl, Python source code is now available under the GNU
General Public License (GPL).

Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.

Importance of Python
• Python is Interpreted − Python is processed at runtime by the interpreter. You do
not need to compile your program before executing it. This is similar to PERL and
PHP.

• Python is Interactive − You can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.

32
• Python is Object-Oriented − Python supports Object-Oriented style or technique of
programming that encapsulates code within objects.

• Python is a Beginner's Language − Python is a great language for the beginner-


level programmers and supports the development of a wide range of applications
from simple text processing to WWW browsers to games.

Features of Python

• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.

• Easy-to-read − Python code is more clearly defined and visible to the eyes.

• Easy-to-maintain − Python's source code is fairly easy-to-maintain.

• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.

• Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.

• Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.

• Extendable − You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more efficient.

• Databases − Python provides interfaces to all major commercial databases.

• GUI Programming − Python supports GUI applications that can be created and
ported to many system calls, libraries and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.

• Scalable − Python provides a better structure and support for large programs than
shell scripting.

33
Apart from the above-mentioned features, Python has a big list of good features, few are
Listed below −

• It supports functional and structured programming methods as well as OOP.

• It can be used as a scripting language or can be compiled to byte-code for building


large applications.

• It provides very high-level dynamic data types and supports dynamic type checking.

• IT supports automatic garbage collection.

• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

Libraries used in python:

• numpy - mainly useful for its N-dimensional array objects.

• pandas - Python data analysis library, including structures such as data frames.

• matplotlib - 2D plotting library producing publication quality figures.

• scikit-learn - the machine learning algorithms used for data analysis and data mining
tasks.

34
CHAPTER 6
IMPLEMENTATION

6.1 CODE AND IMPLEMENTATION

#IMPORTING ALL THE REQUIRED LIBRARIES

import pandas as pd
import numpy as np
import os
import cv2
import [Link] as plt
import warnings
from [Link] import Input, Lambda, Dense, Flatten, Dropout
from [Link] import Model
import xgboost as xgb
from [Link] import image#, image_dataset_from_directory
from [Link] import ImageDataGenerator
from [Link] import Sequential
from tensorflow import keras
import tensorflow
#import scipy
#print("Num GPUs Available: ", len([Link].list_physical_devices('GPU')))
# Set the seed value for experiment [Link].
seed = 1842
[Link].set_seed(seed)
[Link](seed)
# Turn off warnings for cleaner looking notebook
[Link]('ignore')

35
#SPLITTING DATA FOR TRAINING AND TESTING SET

#DEFINE IMAGE DATASET & RESCALE


image_generator = ImageDataGenerator(rescale=1/255, validation_split=0.2) #shear_range
=.25, zoom_range =.2, horizontal_flip = True, rotation_range=20) train_dataset =
image_generator.flow_from_directory(directory='genome_disorder_prediction/train',

target_size=(176,208),

subset="training",

class_mode='categorical')

validation_dataset =
image_generator.flow_from_directory(directory='genome_disorder_prediction /train',

target_size=(176,208),

subset="validation",

class_mode='categorical')

image_generator_submission = ImageDataGenerator(rescale=1/255)

submission = image_generator_submission.flow_from_directory(

directory=''genome_disorder_prediction/test',

target_size=(176,208),

class_mode=None)

#OUTPUT

Found 880 images belonging to 4 classes.

Found 219 images belonging to 4 classes.

36
Found 1279 images belonging to 4 classes.

#VERIFY IF DATA HAS BEEN SPLIT ACCORDING TO CLASSES

batch_1_img = train_dataset[0]

for i in range(0,4):

img = batch_1_img[0][i]

lab = batch_1_img[1][i]

[Link](img)

[Link](lab)

[Link]('off')

[Link]()

#ANN

model = [Link]([[Link](input_shape = [176,208,3]),

[Link](300, activation = 'relu' ),

[Link](400, activation = 'relu' ),

[Link](400, activation = 'relu' ),

[Link](400, activation = 'relu' ),

[Link](4, activation = 'softmax')])

[Link](optimizer='adam',

loss=[Link](),

37
metrics=[[Link](name='auc')])

callback = [Link](monitor='val_loss',

patience=3,

restore_best_weights=True)

[Link](train_dataset, epochs=1, validation_data=validation_dataset, callbacks=callback)

#OUTPUT

28/28 [==============================] - 7s 218ms/step - loss: 4.3382 - auc:


0.8710 - val_loss: 0.4611 - val_auc: 0.9752

#DETERMINING LOSS & ACCURACY

loss, accuracy = [Link](validation_dataset)

print("Loss: ", loss)

print("Accuracy: ", accuracy)

#OUPTUT

7/7 [==============================] - 1s 100ms/step - loss: 0.4611 - auc: 0.9752

Loss: 0.46110430359840393

Accuracy: 0.9752264022827148

#SIMPLE ANN WITH 3 LAYERS

model = [Link]([[Link](input_shape = [176,208,3]),

[Link](300, activation = 'relu' ),

[Link](0.2),

[Link](400, activation = 'relu' ),

38
[Link](0.2),

[Link](400, activation = 'relu' ),

[Link](0.2),

[Link](400, activation = 'relu' ),

[Link](4, activation = 'softmax')])

[Link](optimizer='adam',

loss=[Link](),

metrics=[[Link](name='auc')])

callback = [Link](monitor='val_loss',

patience=4,

restore_best_weights=True)

[Link](train_dataset, epochs=1, validation_data=validation_dataset, callbacks=callback)

#OUTPUT

28/28 [==============================] - 7s 215ms/step - loss: 4.8982 - auc:


0.8235 - val_loss: 0.3963 - val_auc: 0.9704

#DETERMINING LOSS & ACCURACY

loss, accuracy = [Link](validation_dataset)

print("Loss: ", loss)

print("Accuracy: ", accuracy)

#OUTPUT

7/7 [==============================] - 1s 102ms/step - loss: 0.3963 - auc: 0.9704

39
Loss: 0.3963065445423126

Accuracy: 0.9703787565231323

#CNN

#EXPERIMENT WITH CONVOLUTIONAL NEURAL NET

model = [Link]([

[Link].Conv2D(16, (3, 3), activation='relu', input_shape = [176,208,3]),

[Link].MaxPooling2D(),

[Link].Conv2D(32, (2, 2), activation='relu'),

[Link].MaxPooling2D(),

[Link].SeparableConv2D(64, 3, activation='relu', padding='same'),

[Link].SeparableConv2D(64, 3, activation='relu', padding='same'),

[Link](),

[Link].MaxPool2D(),

[Link].SeparableConv2D(128, 3, activation='relu', padding='same'),

[Link].SeparableConv2D(128, 3, activation='relu', padding='same'),

[Link](),

[Link].MaxPool2D(),

[Link](0.2),

[Link].SeparableConv2D(256, 3, activation='relu', padding='same'),

[Link].SeparableConv2D(256, 3, activation='relu', padding='same'),

40
[Link](),

[Link].MaxPool2D(),

[Link](0.2),

[Link](),

[Link](512, activation='relu'),

[Link](),

[Link](0.7),

[Link](128, activation='relu'),

[Link](),

[Link](0.5),

[Link](64, activation='relu'),

[Link](),

[Link](0.3),

[Link](4, activation ='softmax')])

[Link](

optimizer='adam',

loss=[Link](),

metrics=[[Link](name='auc')]

41
def exponential_decay(lr0, s):

def exponential_decay_fn(epoch):

return lr0 * 0.1 **(epoch / s)

return exponential_decay_fn

exponential_decay_fn = exponential_decay(0.01, 20)

lr_scheduler = [Link](exponential_decay_fn)

checkpoint_cb = [Link]("'Alzheimer_disease_classification
_cnn.h5",

save_best_only=True)

early_stopping_cb = [Link](patience=5,restore_best_weights=True)

#FITTING DATA TO A CNN MODEL

history = [Link](

train_dataset,

validation_data=validation_dataset,

callbacks=[checkpoint_cb, early_stopping_cb, lr_scheduler],

epochs=1

#OUTPUT

28/28 [==============================] - 21s 680ms/step - loss: 0.8408 - auc:


0.8972 - val_loss: 5.0772 - val_auc: 0.4073

42
#DETERMING LOSS AND ACCURACY

loss, accuracy = [Link](validation_dataset)

print("Loss: ", loss)

print("Accuracy: ", accuracy)

#OUTPUT

7/7 [==============================] - 1s 197ms/step - loss: 5.0772 - auc: 0.4073

Loss: 5.077194690704346

Accuracy: 0.407303124666214

# GENERATE BATCH AND LABELS

train_images, train_labels = next(train_dataset)

train_labels

#OUTPUT

array([[0., 1., 0., 0.],

[0., 1., 0., 0.],

[0., 1., 0., 0.],

[0., 1., 0., 0.],

[0., 0., 0., 1.],

[0., 1., 0., 0.],

[0., 1., 0., 0.],

[0., 0., 1., 0.],

43
...

[0., 1., 0., 0.],

[0., 0., 0., 1.],

[0., 1., 0., 0.],

[1., 0., 0., 0.],

[0., 1., 0., 0.]], dtype=float32)

# LABEL NAMES

label_names = {0: ‘Mild Demented’, 1: 'Moderate demented', 2: 'Very mild demented', 3:


'non-demented' }

train_images.shape

#OUTPUT

(32, 176, 208, 3)

train_labels.shape

#OUTPUT

(32, 4)

# DATA VISUALIZATION

Len= 4

Wid=4

fig, axes= [Link](Len, Wid, figsize=(8,8))

axes= [Link]() # Flatten up Y axis

for i in [Link](0,8):

44
axes[i].imshow(train_images[i])

axes[i].set_title(label_names[[Link](train_labels[i])])

axes[i].axis('off')

plt.subplots_adjust(wspace=0.5)

xgb_model = [Link](

objective='binary:logistic',

eval_metric='logloss',

use_label_encoder=False

#Code

# Define the hyperparameter grid

param_grid = {

'max_depth': [3, 6, 9],

'learning_rate': [0.01, 0.1, 0.2],

'n_estimators': [100, 200],

'gamma': [0, 0.1, 0.2],

'subsample': [0.8, 1.0]

# Perform grid search with cross-validation

45
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid,
scoring='roc_auc', cv=5, verbose=1)

grid_search.fit(X_train, y_train)

# Get the best parameters and best model

best_params = grid_search.best_params_

best_model = grid_search.best_estimator_

# Predict on the test set

y_pred = best_model.predict(X_test)

y_proba = best_model.predict_proba(X_test)[:, 1]

#OUTPUT

Fitting 5 folds for each of 54 candidates, totalling 270 fits

Best Parameters: {'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 200,


'subsample': 0.8}

Accuracy: 0.87

ROC AUC: 0.92

Confusion Matrix:

[[150 20]

[ 18 112]]

46
6.2 TEST CASES
Testcase 1
Column Name Description
Test Case ID TC_001
Test Objective Evaluate the model’s ability to correctly
identify genome disorders with high
accuracy.
Test Requirement The model should achieve an accuracy of at
least 85% on the test dataset.
Pass/Fail Criteria Pass.
Actions Taken Train the XGBoost model with the provided
training data, then evaluate its performance
on the test set. Record and compare the
accuracy against the required threshold.
Segmentation Errors N/A
Test Environment PC
Image Characteristics Describe the characteristics of the uploaded
scans (e.g., slice thickness, resolution,
contrast).
Other Potential Issues N/A

47
CHAPTER 7
SNAPSHOTS

7.1 SNAPSHOTS

ACTIVATING PROMPT AND RUNNING [Link] FILE

Figure 7.1 Activating prompt

The above Figure 7.1 shows Accessing [Link] redirects users to the
corresponding web page.

#MAIN PAGE

Figure 7.2 Main page

The above figure 7.2 shows the main page

48
Figure 7.3 Disorder Subclass Detection

The above figure 7.3 shows the subclasses of disorder detection

49
Figure 7.4 Genome Disorder Detection

The above figure 7.4 shows the genome disorder detection classes

50
CHAPTER 8
SOFTWARE TESTING

8.1 GENERAL
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies, and a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail unacceptably. There are various types
of tests. Each test type addresses a specific testing requirement.

8.2 DEVELOPING METHODOLOGIES


The test process is initiated by developing a comprehensive plan to test the general
functionality and special features on a variety of platform combinations. Strict quality control
procedures are used. The process verifies that the application meets the requirements
specified in the system requirements document and is bug-free. The following are the
considerations used to develop the framework from developing the testing methodologies.

8.3 TYPES OF TESTING

8.3.1 UNIT TESTING


Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program input produces valid outputs. All decision branches
and internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests
perform basic tests at component level and test a specific business process, application,

51
and/or system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined inputs and
expected results.

8.3.2 FUNCTIONAL TEST


Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user
manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures: interfacing systems or procedures must be invoked.

8.3.3 SYSTEM TEST


System testing ensures that the entire integrated software system meets requirements. It tests
a configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.

8.3.4 PERFORMANCE TEST


The Performance test ensures that the output be produced within the time limits,and the time
taken by the system for compiling, giving response to the users and request being send to the
system for to retrieve the results.

52
8.3.5 INTEGRATION TESTING
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company
level – interact without error.
8.3.6 ACCEPTANCE TESTING
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.

ACCEPTANCE TESTING FOR DATA SYNCHRONIZATION:


➢ The Acknowledgements will be received by the Sender Node after the Packets are
received by the Destination Node
➢ The Route add operation is done only when there is a Route request in need
➢ The Status of Nodes information is done automatically in the Cache Updation process

53
CHAPTER 9
APPLICATIONS AND FUTURE ENHANCEMENT

9.1 General

Genetic disorders pose a significant challenge in biomedical science due to their complexity
and the substantial impact they have on global health. Accurate prediction and classification
of these disorders are critical for effective diagnosis and treatment. This paper focuses on
enhancing the prediction of genome disorders by employing an advanced Gradient Boosting
model, specifically the XGBoost Algorithm, to analyze a comprehensive dataset of genetic
information. By leveraging this approach, we aim to achieve high prediction accuracy and
reliability in identifying single-gene, mitochondrial, and multifactorial genetic disorders. Our
goal is to advance the field of genetic disorder prediction and improve clinical outcomes
through more precise and actionable insights.

9.2 Applications

Predictive Analytics in Healthcare: XGBoost can be applied to predict patient outcomes


and disease progression based on various medical datasets, enabling more personalized
treatment plans.

Genetic Disorder Diagnosis: The algorithm helps in the classification and prediction of
genetic disorders by analyzing complex genetic data, assisting in early diagnosis and targeted
interventions.

54
9.3 FUTURE ENHANCEMENT
Expanding this study to include additional genetic disorders and incorporating more
advanced prediction models could significantly enhance its scope and impact. By integrating
a broader range of genetic disorders, the research can provide a more comprehensive
understanding of the genetic factors influencing various conditions. This expansion would
also facilitate the development of more precise diagnostic tools and personalized treatment
strategies. Additionally, incorporating cutting-edge prediction models, such as ensemble
methods or hybrid approaches that combine different machine learning techniques, could
improve the accuracy and reliability of predictions. These advancements would contribute to
more effective early detection and management of genetic disorders, ultimately benefiting
patient outcomes and advancing the field of genetic research.

55
CHAPTER 10
CONCLUSION

10.1 CONCLUSION

In conclusion, technological advancements in artificial intelligence have profoundly


impacted the field of biological research. In this study, we enhanced the original AGDPM
model by integrating a machine learning approach. Genetic anomaly data were sourced from
online databases, and the XGBoost algorithm was employed to refine the AGDPM. The
model's performance was evaluated using a range of statistical metrics. The AGDPM
demonstrated superior prediction accuracy (92.65%) compared to ResNet-50 for identifying
diseases linked to single-gene mutations, mitochondrial disorders, and multifactorial
diseases. By improving the prediction of genetic abnormalities, the AGDPM has the potential
to advance biomedical research significantly. Future work could include incorporating
additional genetic disorders and refining prediction models to achieve even greater accuracy.

56
REFERENCES

[1] Mc Kusick-Nathans Institute of Genetic Medicine. Online Mendelian Inheritance in Man


Johns Hopkins University School of Medicine. Accessed: Nov. 1, 2021. Available:
[Link]/omim.

[2] B. Irom, ‘‘Genetic disorders: A literature review,’’ Genet. Mol. Biol. Res., vol. 4, no. 2,
p. 30, 2020.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification with deep


convolutional neural networks,’’ Commun. ACM, vol. 60, no. 2, pp. 84–90, Jun. 2012.

[4] S. J. Sanders, ‘‘First glimpses of the neurobiology of autism spectrum disorder,’’ Current
Opinion Genet. Develop., vol. 33, pp. 80–92, Aug. 2015.

[5] Europe PMC Funders Group, ‘‘Biological insights from 108 schizophrenia-associated
genetic loci,’’ Nature, vol. 511, no. 7510, pp. 421–427, Jul. 2014.

[6] J. Menche, A. Sharma, M. Kitsak, S. D. Ghiassian, M. Vidal, J. Loscalzo, and A.-L.


Barabasi, ‘‘Uncovering disease-disease relationships through the incomplete interactome,’’
Science, vol. 347, no. 6224, Feb. 2015, Art. no. 1257601.

[7] A. L. Barabási, N. Gulbahce, and J. Loscalzo, ‘‘Network medicine: A network-based


approach to human disease,’’ Nature Rev. Genet., vol. 12, pp. 56–68, Oct. 2011.

[8] M. Vidal, M. E. Cusick, and A. L. Barabási, ‘‘Interactome networks and human disease,’’
Cell, vol. 144, no. 6, pp. 986–998, Mar. 2011.

[9] X. Wang, N. Gulbahce, and H. Yu, ‘‘Network-based methods for human disease gene
prediction,’’ Briefings Funct. Genomics, vol. 10, no. 5, pp. 280–293, 2011.

[10] T.-P. Nguyen and T. B. Ho, ‘‘Detecting disease genes based on semi-supervised learning
and protein–protein interaction networks,’’ Artif. Intell. Med., vol. 54, no. 1, pp. 63–71, Jan.
2012.

57
[11] P. Yang, X. L. Li, J. P. Mei, C. K. Kwoh, and S. K. Ng, ‘‘Positive-unlabeled learning
for disease gene identification,’’ Bioinformatics, vol. 28, no. 20, pp. 2640–2647, 2012.

[12] A. Rishabh. Of Genomes and Genetics HackerEarth Machine Learning Challenge.


Kaggle. Accessed: Oct. 27, 2021. Available: [Link]
genomes-and-genetics-hackerearth-ml-challenge.

[13] P. Han, P. Yang, P. Zhao, S. Shang, Y. Liu, J. Zhou, X. Gao, and P. Kalnis, ‘‘GCN-MF:
Disease-gene association identification by graph convolutional networks and matrix
factorization,’’ in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul.
2019, pp. 705–713.

[14] X. Zeng, Y. Liao, Y. Liu, and Q. Zou, ‘‘Prediction and validation of disease genes using
HeteSim scores,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 14, no. 3, pp. 687–695, May
2017.

[15] H. Zhou and J. Skolnick, ‘‘A knowledge-based approach for predicting gene–disease
associations,’’ Bioinformatics, vol. 32, no. 18, pp. 2831–2838, Sep. 2016.

[16] Y. Li, H. Kuwahara, P. Yang, L. Song, and X. Gao, ‘‘PGCN: Disease gene prioritization
by disease and gene embedding through graph convolutional neural networks,’’ bioRxiv, vol.
2019, Jan. 2019, Art. no. 532226, doi: 10.1101/532226.

[17] K. Yang, Y. Zheng, K. Lu, K. Chang, N. Wang, Z. Shu, J. Yu, B. Liu, Z. Gao, and X.
Zhou, ‘‘PDGNet: Predicting disease genes using a deep neural network with multi-view
features,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 19, no. 1, pp. 575–584, Jan. 2022,
doi: 10.1109/TCBB.2020.3002771.

[18] M. Alshahrani and R. Hoehndorf, ‘‘Semantic disease gene embeddings (SmuDGE):


Phenotype-based disease gene prioritization without phenotypes,’’ Bioinformatics, vol. 34,
no. 17, pp. i901–i907, Sep. 2018.

58
[19] K. Yang, R. Wang, G. Liu, Z. Shu, N. Wang, R. Zhang, J. Yu, J. Chen, X. Li, and X.
Zhou, ‘‘HerGePred: Heterogeneous network embedding representation for disease gene
prediction,’’ IEEE J. Biomed. Health Informat., vol. 23, no. 4, pp. 1805–1815, Jul. 2019.

[20] K. Yang, N. Wang, G. Liu, R. Wang, J. Yu, R. Zhang, J. Chen, and X. Zhou,
‘‘Heterogeneous network embedding for identifying symptom candidate genes,’’ J. Amer.
Med. Inform. Assoc., vol. 25, Nov. 2018.

[21] Y. Liu, H. Q. Qu, X. Chang, L. Tian, J. Qu, J. Glessner, P. M. A. Sleiman, and H.


Hakonarson, ‘‘Machine learning reduced gene/non-coding RNA features that classify
schizophrenia patients accurately and highlight insightful gene clusters,’’ Int. J. Mol. Sci.,
vol. 22, no. 7, p. 3364, Mar. 2021.

[22] Y. Liu, H. Q. Qu, F. D. Mentch, J. Qu, X. Chang, K. Nguyen, L. Tian, J. Glessner, P.


M. A. Sleiman, and H. Hakonarson, ‘‘Application of deep learning algorithm on whole
genome sequencing data uncovers structural variants associated with multiple mental
disorders in African American patients,’’ Mol. Psychiatry, vol. 27, no. 3, pp. 1469–1478,
Mar. 2022, doi: 10.1038/s41380-021-01418-1.

[23] Rectifier/ (Neural/ Networks). Accessed: Nov. 4, 2021.

[24] Statistics#03—Standard Deviation and Variance. Accessed: Nov. 4, 2021. Available:


[Link]
9724f33b58df.

[25] Softmax Activation Function—How It Actually Works. Accessed: Nov. 4, 2021.


Available: [Link]
works-d292d335bd78.

[26] A.-U. Rahman, S. Abbas, M. Gollapalli, R. Ahmed, S. Aftab, M. Ahmad, M. A. Khan,


and A. Mosavi, ‘‘Rainfall prediction system using machine learning fusion for smart cities,’’
Sensors, vol. 22, no. 9, p. 3504, May 2022.

59
[27] M. Saleem, S. Abbas, T. M. Ghazal, M. A. Khan, N. Sahawneh, and M. Ahmad, ‘‘Smart
cities: Fusion-based intelligent traffic congestion control system for vehicular networks using
machine learning techniques,’’ Egyptian Informat. J., vol. 6, pp. 1–10, Apr. 2022.

[28] M. W. Nadeem, H. G. Goh, M. A. Khan, M. Hussain, M. F. Mushtaq, and V. A.


Ponnusamy, ‘‘Fusion-based machine learning architecture for heart disease prediction,’’
Comput. Master. Continue vol. 67, no. 2, pp. 2481–2496, 2021.

[29] S. Y. Siddiqui, A. Athar, M. A. Khan, S. Abbas, Y. Saeed, M. F. Khan, and M. Hussain,


‘‘Modeling, simulation and optimization of diagnosis cardiovascular disease using
computational intelligence approaches,’’ J. Med. Imag. Health Informat., vol. 10, no. 5, pp.
1005–1022, May 2020.

[30] N. Taleb, S. Mehmood, M. Zubair, I. Naseer, B. Mago, and M. U. Nasir, ‘‘Ovary cancer
diagnosing empowered with machine learning,’’ in Proc. Int. Conf. Bus. Anal. Technol.
Secur. (ICBATS), Feb. 2022, pp. 1–6.

[31] A.-U. Rahman, A. Alqahtani, N. Aldhafferi, M. U. Nasir, M. F. Khan, M. A. Khan, and


A. Mosavi, ‘‘Histopathologic oral cancer prediction using oral squamous cell carcinoma
biopsy empowered with transfer learning,’’ Sensors, vol. 22, no. 10, p. 3833, May 2022.

60

You might also like