Advanced Genetic Disorder Prediction Model
Advanced Genetic Disorder Prediction Model
_______________________________________________________
Abstract iv
List Of Figures v
List Of Symbols vi
CHAPTER 1 : INTRODUCTION
1.1 General 1
1.2 Objective 2
1.3 Existing System 3
1.3.1 Disadvantages of Existing System 3
1.4 Literature Survey 5
1.5 Proposed System 10
1.5.1 Proposed System Advantages 10
CHAPTER 2 : PROJECT DESCRIPTION
2.1 General 12
2.2 Methodologies 12
2.2.1 Modules Name 12
2.2.2 Modules Description 13
2.3 Technique used or Algorithm used 16
2.3.1 Existing Technique 16
2.3.2 Proposed Technique and Algorithm Used 16
CHAPTER 3 : REQUIREMENTS ENGINEERING
3.1 General 17
3.2 Hardware Requirements 17
3.3 Software Requirements 17
3.4 Functional Requirements 18
3.5 Non-Functional Requirements 19
i
CHAPTER 4 : DESIGN ENGINEERING
4.1 General 20
4.2 UML Diagrams 20
4.2.1 Use Case Diagram 21
4.2.2 Class Diagram 22
4.2.3 Object Diagram 23
4.2.4 Component Diagram 24
4.2.5 Deployment Diagram 25
4.2.6 Sequence Diagram 26
4.2.7 Collaboration Diagram 27
4.2.8 Statechart Diagram 28
4.2.9 Activity Diagram 29
4.3 Data Flow Diagram 30
4.4 System Architecture 31
ii
8.3.5 Integration Testing 53
8.3.6 Acceptance Testing 53
CHAPTER 9 : APPLICATIONS AND FUTURE ENHANCEMENT
9.1 General 54
9.2 Applications 54
9.3 Future Enhancements 55
CHAPTER 10 : CONCLUSION
10.1 Conclusion 56
REFERENCES 57
iii
ABSTRACT
Genetic illness prediction is an important and timely issue in the realm of biomedical
science. Mutations in the genome are the root cause of many diseases with significant global
mortality rates, including Alzheimer's, cancer, diabetes, cystic fibrosis, leigh syndrome, and
others. Theoretical and explanatory approaches to predicting genetic abnormalities have been
developed through prior research. Genetic data has expanded to practically include the entire
genome and protein, and methods based on deep learning and machine learning have been
created to forecast genomic abnormalities in response. Concurrently with the introduction of
machine learning techniques, deep learning methods also emerged. Studies on the forecasting
of genetic anomalies have previously employed a variety of learning strategies, including
supervised, unsupervised, and semi-supervised approaches. Most of these studies used
genetic sequence data to make predictions about binary dilemmas. These methods produced
dubious results since they were less accurate and relied on binary class prediction algorithms,
which ignore the pasts of individuals with genetic anomalies. The majority of the approaches
relied on RNA gene sequences, which led to frequent issues when dealing with auction data.
Here, we use the XGBoost Algorithm to foretell genome multiclass disease from a huge
dataset utilising an advanced genome disorder prediction model (AGDPM). AGDPM
outperformed the trained XGBoost Algorithm in every category, with an average accuracy
of 92.65% in both the training and testing phases of the study. Therefore, the state-of-the-art
genome disorder prediction model can reliably predict genome disorder and analyse a large
quantity of patient genome disorder data thanks to the incorporation of a multi-class
prediction technique. Multiple statistical performance metrics demonstrate that AGDPM may
accurately predict diseases caused by a single gene, mitochondrial genes, and multiple genes.
As a result, AGDPM will help biomedical researchers manage mortality rates and anticipate
genetic disorders.
iv
LIST OF FIGURES
LIST OF SYMBOLS
v
_______________________________________________________
NOTATION
[Link] NAME NOTATION DESCRIPTION
Class Name
1. Class Represents a collection
of similar entities
+ public -attribute grouped together.
-private -attribute
# protected
+operation
+operation
NAME Associations represents
Class A Class B
2. Association +operation static relationships
between classes. Roles
Class A Class B
represents the way the
two classes see each
other.
It aggregates several
3. Actor classes into a single
classes.
vi
Used for additional process
Relation uses
5. communication.
(uses)
Extends relationship is
6. Relation extends used when one use case is
(extends) similar to another use case
but does a bit more.
vii
13. Use case Interact ion between the
Uses case system and external
environment.
Represents physical
14. Component modules which are a
collection of components.
Represents physical
15. Node modules which are a
collection of components.
viii
Represents the vertical
19. Object Lifeline dimensions that the object
communications.
ix
CHAPTER 1
INTRODUCTION
1.1 GENERAL
It is estimated that nearly 2,000 different human diseases can be traced back to a single faulty
gene, classifying them as monogenic syndromes. The genes responsible for each condition
exhibit various manifestations, leading to a diverse range of phenotypic outcomes. Therefore,
establishing phenotype-gene correlations is crucial for researchers and medical professionals
in deciphering the fundamental genetic mechanisms behind these disorders. Identifying
disease-causing genes aids in patient diagnosis and provides insight into the complex network
of genetic interactions. Essentially, a potential genetic disease can be detected by analyzing
causative mutant genotypes during the gene identification process. Genetic anomalies, such
as single nucleotide changes, additions or deletions, and complete gene loss, can all impact
disease-causing genes. Traditional approaches to identifying pathogenic genes include
positional cloning, linkage analysis, and mutation analysis. Initially, linkage analysis on
human pedigrees helps locate the chromosomal interval associated with the disease,
identifying candidate genes in the region. Next, positional cloning involves sequencing a set
of candidate genes within this interval, combining spatial and transcriptional mapping.
Human genetic disorders are inherited conditions arising from genetic or chromosomal
abnormalities present from conception. These disorders fall into two primary categories:
single-gene diseases and complex disorders. Single-gene diseases result from a mutation in
a single gene and are passed down easily from one generation to the next, referred to as
Mendelian diseases. Complex diseases, on the other hand, result from a combination of
environmental, behavioral, and lifestyle factors, with genetic defects contributing only a
small fraction to the overall phenotype. Single-gene disorders can originate in any gene, but
they share common genetic and psychosocial care needs, allowing for informed decisions on
risk management and support for affected individuals. Mitochondrial diseases, caused by
alterations in mitochondrial DNA rather than nuclear DNA, are inherited solely from the
1
mother. These diseases can present with symptoms such as lactic acidosis, stroke-like
episodes, eye abnormalities, and encephalopathy. Inherited disorders have various
underlying causes, and many conditions result from a combination of genetic alterations and
environmental factors. Complex genetic disorders, such as diabetes, Alzheimer's, and cancer,
illustrate the multifaceted nature of polygenic illnesses.
1.2 OBJECTIVE
2
• To explore that one complex genetic disorder can underlie conditions such as
diabetes, Alzheimer's, and cancer.
• To consider machine learning as an alternative to conventional methods of genetic
prediction, noting that advancements in this field, along with growing data sets and
computing power, have made deep learning increasingly popular.
• To utilize deep learning methods in statistical genetics to identify interactions
between multiple loci without assuming additivity, addressing the high
dimensionality of factors and improving the prediction of their relative importance.
In the realm of genetics and medical research, forecasting genome disorders is crucial.
Although Deep Neural Networks (DNNs) have demonstrated significant potential in
addressing this challenge, their effectiveness can be hindered by overfitting. Convolutional
Neural Networks (CNNs) face limitations due to the increased spatial correlation of zeroed-
out values in output feature maps. To combat overfitting, dropout is commonly used. The
current recommendation is to utilize Checkerboard Dropout, a structured dropout method
designed to enhance performance and generalization while addressing the spatial correlation
issue. However, despite its benefits, Checkerboard Dropout may still encounter problems that
require further refinement.
1.3.1 Disadvantage of Existing System
Recommendation for Overfitting Solution
The recommendation is to use Checkerboard Dropout as an effective strategy to address the
overfitting problem in deep learning models. Overfitting occurs when a model performs
exceptionally well on training data but struggles to generalize to new, unseen data. This issue
is particularly prominent in Convolutional Neural Networks (CNNs) where the spatial
correlation of zeroed-out values in output feature maps can hinder performance and
generalization. Checkerboard Dropout offers a targeted solution by introducing structured
dropout, which systematically removes features to reduce the likelihood of overfitting and
improve the model's ability to generalize
3
Checkerboard Dropout: A Structured Dropout Technique
Checkerboard Dropout is a structured dropout technique designed to address the issues of
randomness and spatial correlation that commonly affect neural networks. Unlike traditional
dropout methods that randomly eliminate individual features, Checkerboard Dropout
removes contiguous blocks of features, creating a more organized pattern of dropout. This
approach helps mitigate the spatial correlation problem by ensuring that removed features do
not follow an unpredictable, random pattern. As a result, it enhances model generalization
and performance by promoting more robust learning. Despite its advantages, it is important
to note that Checkerboard Dropout may still face challenges that require further refinement
and investigation.
4
1.4 LITERATURE SURVEY
Title: Network-Based Methods for Human Disease Gene Prediction
Year: 2011
Description:
5
Title: ImageNet Classification with Deep Convolutional Neural Networks
Year: 2012
6
Title: First Glimpses of the Neurobiology of Autism Spectrum Disorder
Author: S. J. Sanders
Year: 2015.
Description:
In this comprehensive review, S. J. Sanders delves into the early insights into the
neurobiological underpinnings of autism spectrum disorder (ASD). The paper meticulously
examines the genetic landscape of ASD, highlighting significant advances in identifying
specific genetic mutations and variations that are associated with the disorder. It discusses
how these genetic findings contribute to a broader understanding of how ASD affects brain
development and function, exploring the implications of these genetic insights for
understanding the etiology of the disorder. Sanders reviews research on brain structure and
connectivity, noting how abnormalities in these areas may relate to the core symptoms of
ASD. The paper also addresses the role of environmental factors and their interaction with
genetic predispositions, providing a nuanced view of the complex interplay between genetics
and environmental influences. Additionally, it outlines the progress made in identifying
potential biomarkers for ASD, which could pave the way for improved diagnostic methods
and targeted therapies. By synthesizing recent research and presenting an overview of key
studies, Sanders provides a valuable resource for researchers and clinicians seeking to
understand the neurobiological basis of ASD and offers insights into future research
directions aimed at unraveling the complexities of this multifaceted disorder.
7
Title: Biological Insights from 108 Schizophrenia-Associated Genetic Loci
Year: 2014.
Description:
This study published in Nature provides an in-depth analysis of 108 genetic loci associated
with schizophrenia. The research focuses on elucidating the biological mechanisms
underlying schizophrenia by examining the genetic variants identified through large-scale
genome-wide association studies (GWAS). The paper highlights how these genetic loci
contribute to the understanding of the disorder's etiology, including their effects on brain
structure, function, and molecular pathways. By integrating genetic data with functional
annotations and gene expression profiles, the authors provide valuable insights into the
biological processes disrupted in schizophrenia. The findings offer a clearer picture of the
genetic architecture of the disorder and suggest potential pathways for future research and
therapeutic interventions. This comprehensive analysis underscores the complexity of
schizophrenia’s genetic basis and the importance of continued research in uncovering the
mechanisms driving this severe mental illness.
8
Title: Uncovering Disease-Disease Relationships Through the Incomplete Interactome
Year: 2015
Description:
This paper, published in Science, explores the relationships between different diseases
through the lens of the incomplete interactome—a network of protein interactions that is not
fully mapped. The authors propose a novel approach to uncovering disease-disease
relationships by integrating partial interactome data with disease association information.
They demonstrate that by analyzing the connections between proteins involved in different
diseases, it is possible to identify previously unknown relationships between diseases. The
study provides insights into how interactions within this incomplete network can reveal
shared molecular pathways and mechanisms across various diseases. The findings have
significant implications for understanding the comorbidities and underlying biological
connections between different health conditions, potentially guiding future research and
therapeutic strategies. The approach highlights the value of network-based analyses in
revealing complex disease relationships that are not apparent through traditional methods.
9
1.5 PROPOSED SYSTEM
Complex multiple gene abnormalities can lead to a diverse array of symptoms, encompassing
multifactorial genome disorders, mitochondrial gene inheritance disorders, and single-gene
inheritance disorders. Recent advancements in genomic technology have enabled more
precise acquisition of genetic data. Large-scale genetic studies, including those focused on
multifactorial genome disorders (MGD) and single-gene inheritance disorders (SGID), have
identified hundreds of individuals with various abnormalities. However, despite the vast
amount of data generated by these studies, identifying the specific genes responsible for the
diseases remains challenging. Additionally, since mitochondrial DNA is inherited
maternally, mothers are the primary source of mitochondrial disorders in their children, as
the organelles are maintained through fertilization.
10
High Prediction Accuracy with XGBoost Algorithm
The suggested model, utilizing the XGBoost algorithm, achieved an impressive 92.65%
prediction accuracy based on patients' clinical feature data. This high level of accuracy
demonstrates the model's effectiveness in processing and analyzing complex clinical
information, leading to reliable and precise predictions. The XGBoost algorithm's robustness
and accuracy significantly contribute to the system's overall performance.
11
CHAPTER 2
PROJECT DESCRIPTION
2.1 GENERAL
The paper explores advancements in predictive modeling for genetic disorders using modern
machine learning techniques. It focuses on leveraging deep learning methodologies,
particularly Convolutional Neural Networks (CNNs), to enhance the accuracy and efficiency
of predicting multifactorial and single-gene abnormalities.
The study addresses challenges such as overfitting and spatial correlation in feature maps,
proposing solutions like Checkerboard Dropout to mitigate these issues and improve model
generalization. It highlights the use of advanced algorithms, including XGBoost, to achieve
high prediction accuracy, optimize computational performance, and reduce space
complexity.
Furthermore, the paper discusses the integration of gradient descent methods for loss
minimization and the independence of feature engineering processes to streamline model
development. It underscores the significance of recent genomic technologies in providing
precise genetic data and the difficulty of pinpointing disease-causing genes despite the
availability of large-scale genetic studies.
2.2 METHODOLOGIES
12
➢ Data preparation
➢ Model Selection
➢ Analysis and prediction
➢ Accuracy on the test set
➢ Saving the trained model
13
• Status: Whether the person or patient is alive or deceased.
• Respiratory Rate (breaths/min): The rate of breathing controlled by the brain's
respiratory center.
• Heart Rate (rates/min): The frequency of heartbeats per minute.
• Test 1: Status of Test 1.
• Test 2: Status of Test 2.
• Test 3: Status of Test 3.
• Test 4: Status of Test 4.
• Test 5: Status of Test 5.
• Parental Consent: Indicates if parental assent was provided for participation.
• Follow-up Level: Indicates whether follow-up is high or low.
• Gender: Male, Female, or Indeterminate.
• Birth Asphyxia: Condition where insufficient oxygen is received during childbirth.
• Autopsy Reveals Birth Defect (if any): Findings from an autopsy regarding birth
defects.
• Place of Birth: The birthplace.
• Information about Folic Acid (peri-conceptional): Data on folic acid, a vitamin
important for new cell production.
• H/O Serious Maternal Disease: Impact of serious maternal disease on the patient's
mother.
• H/O Radiation Exposure (x-ray): Indicates if the patient has been exposed to
radiation.
• H/O Substance Abuse: Indicates if a parent has struggled with drug addiction.
• Assisted Conception (IVF/ART): Type of infertility therapy used.
• Previous Pregnancy Abnormalities: History of abnormalities in prior pregnancies.
• Number of Prior Abortions: Total number of prior abortions.
• Birth Defects: Indicates if the patient has birth defects.
• White Blood Cell Count: Number of white blood cells per microliter.
14
• Blood Test Result: Categorized as Normal, Slightly Abnormal, Unclear, or
Abnormal.
• Symptom 1: Presence of Symptom 1.
• Symptom 2: Presence of Symptom 2.
• Symptom 3: Presence of Symptom 3.
• Symptom 4: Presence of Symptom 4.
• Symptom 5: Presence of Symptom 5.
• Genetic Disorder: Professional detection of genetic disorders.
• Type of Disorder: Subclass of the disorder.
3. Data Preparation: Prepare the data for training by cleaning and organizing it. This
involves eliminating duplicates, correcting errors, addressing missing values, normalizing
the data, converting data types as needed, and removing any other potential inconsistencies.
Randomize the data to ensure that any effects from the specific order in which it was collected
or processed are minimized. Next, conduct further exploratory analysis, which includes
visualizing the data to identify any significant class imbalances or relationships between
variables, while being cautious of potential biases. Finally, split the data into training and
assessment sets to facilitate model evaluation.
4. Model selection: After utilizing the XGBoost and Support Vector Machine methods,
which produced accuracy of 98% and 80% on the train set, respectively, we developed this
method.
5. Analyze and prediction: Out of the entire dataset, we only chose two attributes:
• A description of the health values is given.
• Outcome: indicates the type of genetic condition that the patient or individual has.
6. Accuracy on test set: We obtained accuracy of 92.65% & 41.40% on the test set.
7. Saving the Trained Model: You're ready to deploy your trained and validated model
when you follow these steps: First, save your model as a `.pkl` file using a library like Pickle.
Ensure that Pickle is properly configured in your environment. Once confirmed, import the
model and create a `.pkl` file for export, which will allow you to deploy the model in a
production setting.
15
2.3 TECHNIQUE USED OR ALGORITHM USED
16
CHAPTER 3
REQUIREMENTS ENGINEERING
3.1 GENERAL
The interpretation of the handwriting character by developing techniques and methods such
as improvement of character classification techniques. The accurate and rapid classification
for accurate information retrieval, sound classification, stock price forecasting.
The hardware requirements may serve as the basis for a contract for the implementation of
the system and should therefore be a complete and consistent specification of the whole
system. They are used by software engineers as the starting point for the system design. It
shows what the system do and not how it should be implemented.
• Processor - Pentium - IV
• Speed - 1.1 GHz
• Ram - 256 MB
• Hard Disk - 20 GB
• Key Board - Standard Windows Keyboard
• Mouse - Two or Three Button Mouse
• Monitor - SVGA
The software requirements document is the specification of the system. It should include both
a definition and a specification of requirements. It is a set of what the system should do rather
than how it should do it. The software requirements provide a basis for creating the software
requirements specification. It is useful in estimating cost, planning team activities,
17
performing tasks and tracking the teams and tracking the team’s progress throughout the
development activity.
HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
The functional requirements for the proposed system involve developing a predictive model
using XGBoost, a robust machine learning algorithm known for its high performance,
scalability, and accuracy. The system must effectively handle and process large datasets
related to genetic diseases, ensuring accurate prediction and classification outcomes. The
model will incorporate regularised objective functions (L1 and L2), focusing on minimising
the convex loss function and penalising model complexity. The system must be capable of
integrating new regression trees to predict residuals from previous iterations, refining the
final prediction. The end goal is to create a reliable and efficient tool for analyzing genetic
data and predicting disease outcomes using the XGBoost algorithm.
18
3.5 NON-FUNCTIONAL REQUIREMENTS
Usability
The system is designed with completely automated process hence there is no or less user
intervention.
Reliability
The system is more reliable because of the qualities that are inherited from the chosen
platform java. The code built by using python is more reliable.
Performance
This system is developing in the high level languages and using the advanced front-end and
back-end technologies it will give response to the end user on client system with in very less
time.
Supportability
The system is designed to be the cross platform supportable. The system is supported on a
wide range of hardware and any software platform, which is built into the system.
Implementation
The system is implemented in web environment using Django framework. The server is used
as the web server and windows xp professional is used as the platform. Interface the user
interface is based on Django provides web application.
19
CHAPTER 4
DESIGN ENGINEERING
4.1 GENERAL
Design Engineering deals with the various UML [Unified Modeling language]
diagrams for the implementation of project. Design is a meaningful engineering
representation of a thing that is to be built. Software design is a process through which the
requirements are translated into representation of the software. Design is the place where
quality is rendered in software engineering. Design is the means to accurately translate
customer requirements into finished product.
20
4.2.1 USE CASE DIAGRAM
EXPLANATION
Use-case diagrams describe the high-level functions and scope of a system. These diagrams
also identify the interactions between the system and its actors. The use cases and actors in
use-case diagrams describe what the system does and how the actors use it, but not how the
system operates internally.
21
4.2.2 CLASS DIAGRAM
EXPLANATION
In this class diagram In software engineering, a class diagram in the Unified Modeling
Language (UML) is a type of static structure diagram that describes the structure of a system
by showing the system's classes, their attributes, operations (or methods), and the
relationships among objects.
22
4.2.3 OBJECT DIAGRAM
EXPLANATION
Object is an instance of a class in a particular moment in runtime that can have its own state
and data values. Likewise a static UML object diagram is an instance of a class diagram; it
shows a snapshot of the detailed state of a system at a point in time, thus an object diagram
encompasses objects and their relationships which may be considered a special case of a class
diagram or a communication diagram.
23
4.2.4 COMPONENT DIAGRAM
EXPLANATION
In the Unified Modeling Language, a component diagram depicts how components are wired
together to form larger components and or software systems. They are used to illustrate the
structure of arbitrarily complex systems. User gives main query and it converted into sub
queries and sends through data dissemination to data aggregators. Results are to be showed
to user by data aggregators. All boxes are components and arrow indicates dependencies.
24
4.2.5 DEPLOYMENT DIAGRAM
EXPLANATION
The Unified Modeling Language (UML) is the standard language that many software
engineers and business professionals use to create a broad overview for complex systems. A
deployment diagram is one type of diagram created with this language. Along with our UML
diagramming tool, use this guide to learn more about deployment diagrams.
25
4.2.6 SEQUENCE DIAGRAM
EXPLANATION
A sequence diagram is a type of interaction diagram because it describes how—and
in what order—a group of objects works together. These diagrams are used by software
developers and business professionals to understand requirements for a new system or to
document an existing process.
26
4.2.7 COLLABORATION DIAGRAM
EXPLANATION
A collaboration diagram, also known as a communication diagram, is an illustration
of the relationships and interactions among software objects in the Unified Modelling
Language (UML).
27
4.2.8 STATE CHART DIAGRAM
EXPLANATION
These terms are often used interchangeably. So simply, a state diagram is used to model the
dynamic behavior of a class in response to time and changing external stimuli. We can say
that each and every class has a state but we don’t model every class using State diagrams.
We prefer to model the states with three or more states.
28
4.2.9 ACTIVITY DIAGRAM
EXPLANATION
An activity diagram is a type of Unified Modeling Language (UML) flowchart that shows
the flow from one activity to another in a system or process. It's used to describe the different
dynamic aspects of a system and is referred to as a 'behavior diagram' because it describes
what should happen in the modeled system.
29
4.3. DATA FLOW DIAGRAM
EXPLANATION
The above figure 4.10 is a sequence that starts with User data input, followed by array
conversion. Next, model is loaded and applied on the converted data. Then, the genome
disorder detection is done.
30
4.4 SYSTEM ARCHITECTURE
EXPLANATION
The above figure is a system architecture in which data processing is done and on the entire
dataset, feature extraction is performed which results in a feature set. The dataset is divided
into training and test sets. Finally, the XGBoost model is applied to the data for accurate
outputs.
31
CHAPTER 5
DEVELOPMENT TOOLS
5.1 GENERAL
Python
History of Python
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++, Algol-
68, Smalltalk, and Unix shell and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the GNU
General Public License (GPL).
Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
Importance of Python
• Python is Interpreted − Python is processed at runtime by the interpreter. You do
not need to compile your program before executing it. This is similar to PERL and
PHP.
• Python is Interactive − You can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.
32
• Python is Object-Oriented − Python supports Object-Oriented style or technique of
programming that encapsulates code within objects.
Features of Python
• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.
• Easy-to-read − Python code is more clearly defined and visible to the eyes.
• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.
• Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
• Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
• Extendable − You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more efficient.
• GUI Programming − Python supports GUI applications that can be created and
ported to many system calls, libraries and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.
• Scalable − Python provides a better structure and support for large programs than
shell scripting.
33
Apart from the above-mentioned features, Python has a big list of good features, few are
Listed below −
• It provides very high-level dynamic data types and supports dynamic type checking.
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
• pandas - Python data analysis library, including structures such as data frames.
• scikit-learn - the machine learning algorithms used for data analysis and data mining
tasks.
34
CHAPTER 6
IMPLEMENTATION
import pandas as pd
import numpy as np
import os
import cv2
import [Link] as plt
import warnings
from [Link] import Input, Lambda, Dense, Flatten, Dropout
from [Link] import Model
import xgboost as xgb
from [Link] import image#, image_dataset_from_directory
from [Link] import ImageDataGenerator
from [Link] import Sequential
from tensorflow import keras
import tensorflow
#import scipy
#print("Num GPUs Available: ", len([Link].list_physical_devices('GPU')))
# Set the seed value for experiment [Link].
seed = 1842
[Link].set_seed(seed)
[Link](seed)
# Turn off warnings for cleaner looking notebook
[Link]('ignore')
35
#SPLITTING DATA FOR TRAINING AND TESTING SET
target_size=(176,208),
subset="training",
class_mode='categorical')
validation_dataset =
image_generator.flow_from_directory(directory='genome_disorder_prediction /train',
target_size=(176,208),
subset="validation",
class_mode='categorical')
image_generator_submission = ImageDataGenerator(rescale=1/255)
submission = image_generator_submission.flow_from_directory(
directory=''genome_disorder_prediction/test',
target_size=(176,208),
class_mode=None)
#OUTPUT
36
Found 1279 images belonging to 4 classes.
batch_1_img = train_dataset[0]
for i in range(0,4):
img = batch_1_img[0][i]
lab = batch_1_img[1][i]
[Link](img)
[Link](lab)
[Link]('off')
[Link]()
#ANN
[Link](optimizer='adam',
loss=[Link](),
37
metrics=[[Link](name='auc')])
callback = [Link](monitor='val_loss',
patience=3,
restore_best_weights=True)
#OUTPUT
#OUPTUT
Loss: 0.46110430359840393
Accuracy: 0.9752264022827148
[Link](0.2),
38
[Link](0.2),
[Link](0.2),
[Link](optimizer='adam',
loss=[Link](),
metrics=[[Link](name='auc')])
callback = [Link](monitor='val_loss',
patience=4,
restore_best_weights=True)
#OUTPUT
#OUTPUT
39
Loss: 0.3963065445423126
Accuracy: 0.9703787565231323
#CNN
model = [Link]([
[Link].MaxPooling2D(),
[Link].MaxPooling2D(),
[Link](),
[Link].MaxPool2D(),
[Link](),
[Link].MaxPool2D(),
[Link](0.2),
40
[Link](),
[Link].MaxPool2D(),
[Link](0.2),
[Link](),
[Link](512, activation='relu'),
[Link](),
[Link](0.7),
[Link](128, activation='relu'),
[Link](),
[Link](0.5),
[Link](64, activation='relu'),
[Link](),
[Link](0.3),
[Link](
optimizer='adam',
loss=[Link](),
metrics=[[Link](name='auc')]
41
def exponential_decay(lr0, s):
def exponential_decay_fn(epoch):
return exponential_decay_fn
lr_scheduler = [Link](exponential_decay_fn)
checkpoint_cb = [Link]("'Alzheimer_disease_classification
_cnn.h5",
save_best_only=True)
early_stopping_cb = [Link](patience=5,restore_best_weights=True)
history = [Link](
train_dataset,
validation_data=validation_dataset,
epochs=1
#OUTPUT
42
#DETERMING LOSS AND ACCURACY
#OUTPUT
Loss: 5.077194690704346
Accuracy: 0.407303124666214
train_labels
#OUTPUT
43
...
# LABEL NAMES
train_images.shape
#OUTPUT
train_labels.shape
#OUTPUT
(32, 4)
# DATA VISUALIZATION
Len= 4
Wid=4
for i in [Link](0,8):
44
axes[i].imshow(train_images[i])
axes[i].set_title(label_names[[Link](train_labels[i])])
axes[i].axis('off')
plt.subplots_adjust(wspace=0.5)
xgb_model = [Link](
objective='binary:logistic',
eval_metric='logloss',
use_label_encoder=False
#Code
param_grid = {
45
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid,
scoring='roc_auc', cv=5, verbose=1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]
#OUTPUT
Accuracy: 0.87
Confusion Matrix:
[[150 20]
[ 18 112]]
46
6.2 TEST CASES
Testcase 1
Column Name Description
Test Case ID TC_001
Test Objective Evaluate the model’s ability to correctly
identify genome disorders with high
accuracy.
Test Requirement The model should achieve an accuracy of at
least 85% on the test dataset.
Pass/Fail Criteria Pass.
Actions Taken Train the XGBoost model with the provided
training data, then evaluate its performance
on the test set. Record and compare the
accuracy against the required threshold.
Segmentation Errors N/A
Test Environment PC
Image Characteristics Describe the characteristics of the uploaded
scans (e.g., slice thickness, resolution,
contrast).
Other Potential Issues N/A
47
CHAPTER 7
SNAPSHOTS
7.1 SNAPSHOTS
The above Figure 7.1 shows Accessing [Link] redirects users to the
corresponding web page.
#MAIN PAGE
48
Figure 7.3 Disorder Subclass Detection
49
Figure 7.4 Genome Disorder Detection
The above figure 7.4 shows the genome disorder detection classes
50
CHAPTER 8
SOFTWARE TESTING
8.1 GENERAL
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies, and a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail unacceptably. There are various types
of tests. Each test type addresses a specific testing requirement.
51
and/or system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined inputs and
expected results.
52
8.3.5 INTEGRATION TESTING
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company
level – interact without error.
8.3.6 ACCEPTANCE TESTING
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
53
CHAPTER 9
APPLICATIONS AND FUTURE ENHANCEMENT
9.1 General
Genetic disorders pose a significant challenge in biomedical science due to their complexity
and the substantial impact they have on global health. Accurate prediction and classification
of these disorders are critical for effective diagnosis and treatment. This paper focuses on
enhancing the prediction of genome disorders by employing an advanced Gradient Boosting
model, specifically the XGBoost Algorithm, to analyze a comprehensive dataset of genetic
information. By leveraging this approach, we aim to achieve high prediction accuracy and
reliability in identifying single-gene, mitochondrial, and multifactorial genetic disorders. Our
goal is to advance the field of genetic disorder prediction and improve clinical outcomes
through more precise and actionable insights.
9.2 Applications
Genetic Disorder Diagnosis: The algorithm helps in the classification and prediction of
genetic disorders by analyzing complex genetic data, assisting in early diagnosis and targeted
interventions.
54
9.3 FUTURE ENHANCEMENT
Expanding this study to include additional genetic disorders and incorporating more
advanced prediction models could significantly enhance its scope and impact. By integrating
a broader range of genetic disorders, the research can provide a more comprehensive
understanding of the genetic factors influencing various conditions. This expansion would
also facilitate the development of more precise diagnostic tools and personalized treatment
strategies. Additionally, incorporating cutting-edge prediction models, such as ensemble
methods or hybrid approaches that combine different machine learning techniques, could
improve the accuracy and reliability of predictions. These advancements would contribute to
more effective early detection and management of genetic disorders, ultimately benefiting
patient outcomes and advancing the field of genetic research.
55
CHAPTER 10
CONCLUSION
10.1 CONCLUSION
56
REFERENCES
[2] B. Irom, ‘‘Genetic disorders: A literature review,’’ Genet. Mol. Biol. Res., vol. 4, no. 2,
p. 30, 2020.
[4] S. J. Sanders, ‘‘First glimpses of the neurobiology of autism spectrum disorder,’’ Current
Opinion Genet. Develop., vol. 33, pp. 80–92, Aug. 2015.
[5] Europe PMC Funders Group, ‘‘Biological insights from 108 schizophrenia-associated
genetic loci,’’ Nature, vol. 511, no. 7510, pp. 421–427, Jul. 2014.
[8] M. Vidal, M. E. Cusick, and A. L. Barabási, ‘‘Interactome networks and human disease,’’
Cell, vol. 144, no. 6, pp. 986–998, Mar. 2011.
[9] X. Wang, N. Gulbahce, and H. Yu, ‘‘Network-based methods for human disease gene
prediction,’’ Briefings Funct. Genomics, vol. 10, no. 5, pp. 280–293, 2011.
[10] T.-P. Nguyen and T. B. Ho, ‘‘Detecting disease genes based on semi-supervised learning
and protein–protein interaction networks,’’ Artif. Intell. Med., vol. 54, no. 1, pp. 63–71, Jan.
2012.
57
[11] P. Yang, X. L. Li, J. P. Mei, C. K. Kwoh, and S. K. Ng, ‘‘Positive-unlabeled learning
for disease gene identification,’’ Bioinformatics, vol. 28, no. 20, pp. 2640–2647, 2012.
[13] P. Han, P. Yang, P. Zhao, S. Shang, Y. Liu, J. Zhou, X. Gao, and P. Kalnis, ‘‘GCN-MF:
Disease-gene association identification by graph convolutional networks and matrix
factorization,’’ in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul.
2019, pp. 705–713.
[14] X. Zeng, Y. Liao, Y. Liu, and Q. Zou, ‘‘Prediction and validation of disease genes using
HeteSim scores,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 14, no. 3, pp. 687–695, May
2017.
[15] H. Zhou and J. Skolnick, ‘‘A knowledge-based approach for predicting gene–disease
associations,’’ Bioinformatics, vol. 32, no. 18, pp. 2831–2838, Sep. 2016.
[16] Y. Li, H. Kuwahara, P. Yang, L. Song, and X. Gao, ‘‘PGCN: Disease gene prioritization
by disease and gene embedding through graph convolutional neural networks,’’ bioRxiv, vol.
2019, Jan. 2019, Art. no. 532226, doi: 10.1101/532226.
[17] K. Yang, Y. Zheng, K. Lu, K. Chang, N. Wang, Z. Shu, J. Yu, B. Liu, Z. Gao, and X.
Zhou, ‘‘PDGNet: Predicting disease genes using a deep neural network with multi-view
features,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 19, no. 1, pp. 575–584, Jan. 2022,
doi: 10.1109/TCBB.2020.3002771.
58
[19] K. Yang, R. Wang, G. Liu, Z. Shu, N. Wang, R. Zhang, J. Yu, J. Chen, X. Li, and X.
Zhou, ‘‘HerGePred: Heterogeneous network embedding representation for disease gene
prediction,’’ IEEE J. Biomed. Health Informat., vol. 23, no. 4, pp. 1805–1815, Jul. 2019.
[20] K. Yang, N. Wang, G. Liu, R. Wang, J. Yu, R. Zhang, J. Chen, and X. Zhou,
‘‘Heterogeneous network embedding for identifying symptom candidate genes,’’ J. Amer.
Med. Inform. Assoc., vol. 25, Nov. 2018.
59
[27] M. Saleem, S. Abbas, T. M. Ghazal, M. A. Khan, N. Sahawneh, and M. Ahmad, ‘‘Smart
cities: Fusion-based intelligent traffic congestion control system for vehicular networks using
machine learning techniques,’’ Egyptian Informat. J., vol. 6, pp. 1–10, Apr. 2022.
[30] N. Taleb, S. Mehmood, M. Zubair, I. Naseer, B. Mago, and M. U. Nasir, ‘‘Ovary cancer
diagnosing empowered with machine learning,’’ in Proc. Int. Conf. Bus. Anal. Technol.
Secur. (ICBATS), Feb. 2022, pp. 1–6.
60