Deep Learning in Multimodal Medical Data
Deep Learning in Multimodal Medical Data
Review
A R T I C L E I N F O A B S T R A C T
Keywords: Deep learning methods have achieved significant results in various fields. Due to the success of these methods,
Deep learning many researchers have used deep learning algorithms in medical analyses. Using multimodal data to achieve
Multimodal medical data more accurate results is a successful strategy because multimodal data provide complementary information. This
Review
paper first introduces the most popular modalities, fusion strategies, and deep learning architectures. We also
explain learning strategies, including transfer learning, end-to-end learning, and multitask learning. Then, we
give an overview of deep learning methods for multimodal medical data analysis. We have focused on articles
published over the last four years. We end with a summary of the current state-of-the-art, common problems, and
directions for future research.
1. Introduction implemented a CNN using a GPU, which was four times faster than any
equivalent implementation on CPU. However, the winner of the 2012
Medical data analysis using computers has always attracted many Imagenet Challenge image classification model, AlexNet (Krizhevsky
researchers, but deep learning methods have revolutionized this field et al., 2012), proved to be a landmark deep learning model with GPU
over the past decade. The history of deep learning can be traced back to acceleration. Many complex deep neural networks, such as VGG
1943 when McCulloch and Pitts (1943) created the first mathematical (Simonyan & Zisserman, 2014), ResNet (He et al., 2016), Inception
model of a neuron. Although this neuron had no learning mechanism, it (Szegedy et al., 2015), and DenseNet (Huang et al., 2017), have been
laid the foundation for deep learning. Rosenblatt (1957) introduced the introduced since 2012. All of these networks require high computational
perceptron, which is a single-layer neural network with learning capa resources. Another complex neural network is the generative adversarial
bilities to do binary classification on its own. This invention inspired the network (GAN), created by Goodfellow et al. (2014). The idea of this
revolution in the research of shallow neural networks. After a lot of network is to synthesize realistic data from a random vector. Because of
research in this field, Ivakhnenko (1968) introduced the Group Method the data scarcity problem, GAN has become popular in medical analyses.
of Data Handling (GMDH) for training neural networks in 1968. These Motivated by the success of deep learning, researchers in medical
networks are widely considered the first deep neural networks of the fields have also attempted to apply deep learning-based approaches to
feedforward multilayer perceptron type. The first convolutional neural different tasks, such as feature extraction, classification, segmentation in
network (CNN), called Neocognitron, and recurrent neural network images, prognosis prediction of disease, and overall survival (OS) pre
(RNN) were introduced in 1980 and 1982 (Fukushima, 1980; Hopfield, diction. Also, some deep learning architectures have been proposed for a
1982). Implementation of backpropagation in the neural networks by specific task in medicine. For example, several architectures have been
Rumelhart et al. (1986) opened gates for training complex deep neural introduced for medical image segmentation. The most popular one is U-
networks easily. Three years later, LeCun et al. (1989) used back Net, which was first introduced by Ronneberger et al. (2015). The main
propagation to train a CNN for handwritten digit recognition. This was a downside of U-Net is that it can only process 2D images while most
breakthrough moment as it laid the foundation of modern computer medical images used in clinical practice consist of 3D volumes. To solve
vision using deep learning. Hinton et al. (2006) proposed deep belief this problem, Milletari et al. (2016) designed V-Net, which segments
networks in which the training process is efficient for a large amount of volumetric medical images. Architectures such as U-Net++ (Zhou et al.,
data. 2018), U-Net 3 + H. Huang et al., 2020), R2U-Net (Alom et al., 2019),
The earliest attempt at using graphics processing units (GPUs) for and attention U-Net (Oktay et al., 2018) have also been designed to
deep learning was a study by Chellapilla et al. (2006). They improve the performance of U-Net. In 2021, inspired by the recent
* Corresponding author.
E-mail address: saniee@[Link] (M. Saniee Abadeh).
[Link]
Received 1 June 2021; Received in revised form 25 December 2021; Accepted 26 March 2022
Available online 4 April 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
success of transformers in Natural Language Processing (NLP), which get the final result. Decision-level fusion is also popular because it does
leverage self-attention mechanisms and encode long-range de not need all modalities for each sample. Also, individual networks better
pendencies, different transformer-based networks have been designed, exploit the unique information of their corresponding modality because
such as UNETR (Hatamizadeh et al., 2021), TransUNet (J. Chen et al., the search space in decision-level fusion is much smaller than other
2021), Swin-U-Net (H. Cao et al., 2021), and MedT (Valanarasu et al., fusion methods.
2021). Fig. 1 shows the key developments in deep learning architectures In recent years, the number of articles on multimodal medical data
along with state-of-the-art neural networks in medical fields. analysis using deep learning has increased steadily. We performed a
In general, medical data divide into three categories: imaging, clin thorough literature analysis with keywords ‘deep learning,’ ‘medical,’
ical, and omics data. As every single modality represents various and ‘multimodal’ on the PubMed database on August 14, 2021. We
important information, the combination of different modalities provides observed that the number of papers had increased since 2010, which
a more comprehensive view of disease. Therefore, multimodal medical means multimodal medical data analysis using deep learning is
data analysis reduces information uncertainty and improves models’ obtaining more and more attention (See Fig. 2). We also conducted a
performance. The most widely used imaging modalities are magnetic similar analysis on the Google Scholar search engine, which showed the
resonance imaging (MRI), computerized tomography (CT), positron same trend.
emission tomography (PET), and single-photon emission computed to There are some other reviews on multimodal medical data analysis
mography (SPECT). Multiparametric MRI also provides complementary using deep learning. Some of them gave a detailed review of deep
information by combining different MRIs, such as T1-weighted (T1), learning applications in medical image analysis. For instance, Zhou et al.
contrast-enhanced T1-weighted (T1c), T2-weighted (T2), and Fluid (2019) proposed a general pipeline for multimodal medical image seg
attenuation inversion recovery (Flair) images. Moreover, clinical data, mentation based on deep learning. This pipeline consists of data prep
such as antecedents, age, sex, and medical treatments, help physicians aration, network architecture, fusion strategy, and data post-processing.
better understand patients’ characteristics and disease evolution. They also compared the results of different deep learning architectures
Furthermore, genomic data are beneficial in medically-relevant pre in multimodal medical image segmentation. Xu (2019) introduced a
diction and diagnosis of disease progression (Bell, 2004; Schrodi et al., series of studies on deep learning applications in multimodal medical
2014). image analysis, emphasizing fusion techniques and feature extraction
When we use different modalities, we should decide how to integrate deep models. Litjens et al. (2017) reviewed major deep learning con
their information. Generally, there are three fusion strategies: input- cepts pertinent to medical image analysis. They summarized deep
level, layer-level, and decision-level fusion. In input-level fusion, we learning techniques used in medical image analysis and identified the
integrate the information of different modalities before giving them to a challenges for successful deep learning applications in medical imaging
single network. In layer-level fusion, one or more modalities are given to tasks. Zhang et al. (2020) presented an overview of multimodal data
the network independently, then their intermediate representations are fusion in neuroimaging. They also outlined the strengths and limitations
fused in a layer of the network. In input-level and layer-level fusion, all of imaging modalities, fundamental fusion rules, fusion quality assess
modalities must be available for each sample in the training set. This is a ment methods, and current challenges in multimodal fusion. Moreover,
serious downside because this situation is rarely satisfied in the medical they summarised current developments and applications of multimodal
field. However, these two methods find the relationship between neuroimaging in terms of neurological disorders and brain diseases.
different modalities better than decision-level fusion. In decision-level Ramachandram and Taylor (2017) introduced various applications in
fusion, each modality is used as a single input to train a single neural which multimodal deep learning had attracted great attention. These
network. The outputs of individual networks will then be integrated to applications included human activity recognition, medical applications,
Fig. 1. Key developments in deep learning architectures and state-of-the-art neural networks for medical data analysis.
2
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Fig. 2. The general trend in multimodal medical data analysis using deep learning obtained from PubMed (left) and Google Scholar (right).
autonomous systems, and multimedia applications. They also reviewed learning strategy should be chosen. Finally, researchers should choose a
several options, including stochastic regularization, casting architecture network architecture. Knowing different deep learning architectures
optimization, and incremental online reinforcement learning, to learn helps find the most suitable architecture for research. In the following
an optimal architecture. sections, these concepts are explained.
On the other hand, some review papers focused on omics data and
their combinations with other modalities. For example, Antonelli et al. 3. Data
(2019) focused on integrating omics and imaging data. Also, they
summarized features extracted by omics imaging data and methods 3.1 Imaging data
adopted for their analysis and integration. Lo Gullo et al. (2020)
explained data related to radiomics and radiogenomics and the differ Image data coming from different imaging technologies are funda
ences between them. They reviewed the radiomics and radiogenomics mental for medical analysis. Biomedical images reveal a lot of infor
literature in oncology, focusing on breast, brain, gynecological, liver, mation about human organs’ structure and functions (hemodynamic,
kidney, prostate, and lung malignancies. Unlike previous review papers, metabolic, and chemical processes) (Antonelli et al., 2019). Many re
we have not limited our study to one or two modalities. As a result, we searchers have used a combination of imaging modalities in recent
cover imaging, clinical, omics, and other types of data simultaneously. years. There are two reasons why the integration of imaging modalities
To the best of our knowledge, we are the first to review deep learning is beneficial. First, all individual modalities have their limitations. Sec
applications in multimodal medical data analysis without constraints on ond, a disease, disorder, or lesion may manifest itself in different forms,
the data type. symptoms, or etiology. On the other hand, different diseases may share
The rest of the paper is structured as follows. In Section 2, we some common symptoms or appearances. Therefore, an individual
introduce four important decisions on multimodal medical data analysis image modality may not reveal a complete picture of a disease (Zhang
using deep learning. Section 3 describes the most commonly used mo et al., 2020).
dalities in multimodal medical data analysis. When dealing with As medical images can be 2D or 3D, we should decide on images’
multimodal data, it is essential to know how to integrate information dimensions before training a network. The 2D approach takes image
from different modalities. Consequently, various fusion techniques are slices extracted from the 3D image and feeds them to the network. This
explained in Section 4. In Section 5, we present the most popular deep approach reduces the computational cost, but it ignores the spatial in
learning architectures, which include CNN, GAN, inception network, U- formation of images in the z-direction (T. Zhou et al., 2019). We can also
Net, VGG network, ResNet, RNN, and attention neural network. Section use 3D images and feed them directly to the network. This approach can
6 describes different learning strategies, and section 7 reviews papers on be expensive, but it does not lose any information. In this section, we
deep learning applications in multimodal medical data analysis. We introduce some of the most popular imaging modalities and mention
focus on articles published in the last four years. Section 8 introduces their advantages and disadvantages.
some of the most well-known multimodal datasets in the medical field.
Finally, we summarize common problems and open challenges in sec 3.1.1. Ct
tions 9 and 10.
CT is undoubtedly one of the most important technologies in medical
2. Multimodal deep learning process imaging and offers us views inside the human body that are so valuable
to physicians (Maier et al., 2018). CT is an imaging procedure in which
In the first step of multimodal medical data analysis, researchers an x-ray tube rotates around a patient, shooting narrow beams of X-rays
should decide on data sources, fusion strategy, learning strategy, and to the patient. This procedure produces signals measured by a computer
deep learning architecture (as shown in Fig. 3). Choosing the right to generate cross-sectional images of the organ under investigation
combination of data sources in multimodal analyses is critical because a (Antonelli et al., 2019). The primary strength of this modality is that it
wrong combination leads to lower performance. Data sources should provides clear anatomical structure information due to its excellent
provide complementary information to improve results. The next step is spatial resolution (Lin & Alessio, 2009). In other words, CT images
to decide how to integrate different modalities. Furthermore, a suitable convey details and discriminate between structures located within small
Fig. 3. Four decisions on multimodal medical data analysis using deep learning algorithms.
3
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
proximity to each other very well. Also, CT images are non-invasive, scans because the nuclides used in SPECT have a longer half-life and are
quick, and painless. On the other hand, CT images cannot identify soft more easily obtained than PET (Zhang et al., 2020). Fusion of SPECT and
tissues well (Zhang et al., 2020). Furthermore, because of exposure to other modalities provide more information than SPECT images alone.
ionizing radiation, the possibility of developing cancer increases later in For example, hybrid SPECT/CT adds clinical value over SPECT imaging
life (Kasban et al., 2015). An example of a CT scan is shown in Fig. 4. due to more precise anatomical lesion localization (“Mathematics and
Physics of Emerging Biomedical Imaging” 1996). Fig. 5 shows an
3.1.2. Pet example of SPECT/CT (Jacene et al., 2008).
PET provides information about the metabolism of a disease. In PET 3.1.4. MRi
images, the diseased areas, such as tumor and inflammation, appear as
‘hot’ areas, reflecting high contrast to the normal surrounding tissues. Modern MRI systems allow physicians to look inside the body
This high contrast in PET images distinguishes malignant areas from without ionizing radiation. They provide excellent soft-tissue contrast
normal tissues easily (Ju et al., 2015). There are some advantages to this and high spatial resolution for morphological imaging as well as a range
technique. First, this technique can check how far cancer has spread and of possibilities for functional imaging (Maier et al., 2018). MRI is also
how well the treatment is working. Second, it can provide highly ac non-invasive and painless. However, relatively low sensitivity, long
curate functional information, but this method has some downsides. The scan, long post-processing time, and being expensive are its downsides.
main one is that using ionizing radiation makes patients radioactive for a Moreover, this technique cannot detect intraluminal abnormalities
variable period. Also, it has relatively low spatial resolution and high (Kasban et al., 2015). Multiparametric MRI provides complementary
cost (Kasban et al., 2015). This low spatial resolution makes target information due to its dependence on variable acquisition parameters,
boundaries blur, so detecting tumors just by PET images is challenging. which improves models’ performance. MRI modalities include T1, T1c,
In the last few years, PET images have been fused with CT or MRI to T2, and Flair images, as shown in Fig. 6. For further details about
capture information from different modalities and put this information different imaging techniques, see Table 2 in Kasban et al. (2015).
in one image. Fig. 4 shows an example of PET/CT and PET/MRI (Boss
et al., 2010). 3.2 Omics data
3.1.3. SPECt Omics data characterize the behaviors of cells, tissues, and organs at
a molecular level and provide a comprehensive understanding of the
The goal of SPECT is to determine the three-dimensional radioac etiology of human diseases (Raja et al., 2017). Omics data are high-
tivity distribution which is resulted from the radiopharmaceutical (a dimensional, but only a small subset of them has important implica
radioactive-labeled pharmaceutical) uptake inside a patient. SPECT and tions (Kristensen et al., 2014). As a result, feature selection and feature
PET are very similar. For example, both of them use radioactive tracer extraction strategies are helpful when we use omics data. Genomics,
and detect γ-rays to reflect functional information about a disease. transcriptomics, proteomics, and epigenomics are the four main cate
However, unlike PET, the radioisotopes used for SPECT emit only a gories of omics data. Although there are some other categories, such as
single γ-ray during decay. Also, SPECT scans produce lower resolution metabolomics and lipidomics, we only describe four main groups.
images than PET. SPECT scans are significantly less expensive than PET Genomics is the study of the genomes of organisms. The genome is
Fig. 4. PET/MRI and PET/CT images of a 30-year-old patient with low-grade glioma. (Top) low-dose non–contrast-enhanced CT image (left), PET/CT (center), and
PET images (right). (Bottom) T2-weighted Flair image (left), PET/MR (center), and PET image (right).
4
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Fig. 6. Different MRI modalities in the BraTS dataset. From left to right: Flair, T1, T1ce, and T2 image.
the complete sequence of DNA in a cell or organism. Complete or partial 3.3 Clinical data
DNA sequences can be assayed using various experimental platforms,
such as single-nucleotide polymorphisms (SNP). Genomic analysis can Different types of clinical data can be used in medical analyses. The
detect insertions, deletions, and copy number variation (CNV), referring results of physiological measurements (lab results, vital signs), de
to the loss of or amplification of the expected two copies of each gene mographic information (gender, location, age, and marital status),
(one from the mother and one from the father at each gene locus). Gene payment and insurance information (which is indirectly related to the
sequences and regulatory motifs are other types of genomic data. The disease), and clinical notes are different groups of clinical data. Clinical
transcriptome is the complete set of RNA transcripts from DNA in a cell notes include descriptions of lab test results, physician diagnoses, drugs,
or tissue. The transcriptome includes ribosomal RNA (rRNA), messenger and treatments. Clinical notes may also contain other information such
RNA (mRNA), transfer RNA (tRNA), micro RNA (miRNA), and other as the chief complaint, family history, medical history, and allergies (Yu
non-coding RNA (ncRNA; Götz, 2019). Proteomics is concerned with the et al., 2019). Usually, NLP techniques are used to process clinical notes.
structure, function, and modification of proteins expressed in a biolog Some researchers have integrated clinical data with other data types to
ical system. Proteomics data include protein expression and protein improve performance.
structure. Epigenomics characterizes the epigenetic modifications of the
genome and aims to understand the regulations of the gene expression 3.4 Others
(Raja et al., 2017).
There is a correlation between different types of omics data. For Depending on the type of disease, other data types are also available.
example, Fig. 7 illustrates a two-step process by which a sequence of For example, molecular aberrations in Alzheimer’s disease (AD) are
nucleotides from DNA is converted into a sequence of amino acids to reflected in the cerebrospinal fluid (CSF), so many studies on AD have
build the desired protein (Betts et al., 2013). These two steps are called used the combination of CSF with other modalities (Kim & Lee, 2018;
transcription and translation. Lee, Kang, Nho, Sohn, & Kim, 2019; Lee, Nho, Kang, Sohn, & Kim, 2019;
5
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Lin et al., 2020). Another modality is electroencephalography (EEG) non-small cell lung cancer patients. In short, reading related papers
which is a monitoring method to record the brain’s electrical activity guides new researchers in multimodal medical data analysis a lot.
and helps physicians be more accurate in diagnosing some diseases such Table 1 summarizes a list of papers that employed the most popular
as epilepsy (Hosseini et al., 2020), sleep disorders, depth of anesthesia, combinations of modalities.
coma, encephalopathies, AD, and brain death (Shikalgar & Sonavane,
2020). Another modality is fluorescein angiography (FA) image which 4. Fusion structures
has been used in several multimodal eye analyses using deep learning
(Hervella et al., 2020, 2019; Li et al., 2020; Vaghefi et al., 2020). FA is a When we work with multimodal data, we need to decide how to
medical procedure in which a fluorescent dye is injected into the integrate them. There are three ways to fuse different modalities: input-
bloodstream. The dye highlights the blood vessels in the back of the eye, level fusion, layer-level fusion, and decision-level fusion. In this section,
which helps photograph vessels. we describe these methods.
Although multimodality was not very popular in machine learning This method is also known as early integration, feature-based inte
before, it has recently gained a lot of interest. The main reason why few gration, and data-based integration. In this method, different modalities
studies used multimodal data in the past is that it used to be very difficult are fused before conducting an analysis, as illustrated in Fig. 8(a). One
to access a multimodal dataset. Also, most of the papers and challenges benefit of input-level fusion is that it finds the relationship between
only focused on one modality. Consequently, models were optimized for different modalities. However, to find this relationship, all modalities
a single data type and a specific task. Several multimodal data re must be available for each sample in the training set, which is hardly
positories, which contain hundreds of matched patient samples for im satisfied in practice. Another disadvantage of this method is that it leads
aging, genomic, and clinical data, have been created in recent years. to a very large feature vector, which causes a high computational cost.
These repositories let researchers benefit from multimodality and adopt
a more comprehensive multimodal data approach to tackle more chal 4.2 Layer-level fusion
lenging and global tasks. We introduce some of these data repositories in
Section 8. This method is also known as intermediate integration and
Multimodal data help researchers have the most comprehensive transformation-based integration. In this method, one or more modal
understanding of patients, their background, and their disease evolution ities are given to a network independently, then their intermediate
because each modality provides different important information. For representations are fused in a layer of the network. Like input-level
example, clinical data help understand patients’ characteristics and fusion, this method finds the relationship between different modalities
biological differences in their disease evolution. Genomic data provide and requires all modalities for each sample in the training set. Fig. 8 (b)
prognostic signals, which are not accessible through other modalities, so illustrates this method.
they are beneficial in medically-relevant prediction and diagnosis of
disease progression (Micheel et al., 2012). Similarly, each imaging 4.3 Decision-level fusion
modality provides different types of biological information about a
disease. For instance, CT images diagnose muscle and bone disorders, Late integration and model-based integration are the other names of
such as bone tumors and fractures, while MRI offers a good soft-tissue this method. In this method, each modality is used as a single input to
contrast without radiation. Functional images, such as PET and train a neural network, then the outputs of models are fused to make the
SPECT, lack anatomical characterization but provide quantitative final decision (see Fig. 8(c)). This method is prevalent because it does
metabolic and functional information about diseases (Bhatnagar et al., not require all modalities for each sample. Unlike other fusion tech
2015). Histology slides help clinicians understand the structure of the niques, this technique cannot find the relationship between different
problem at a cellular scale. Additionally, analyzing the lymphocyte modalities. However, models which use decision-level fusion can
infiltration on a histology slide is an excellent indicator of organism
resistance. In the case of brain tumor segmentation, various imaging
modalities can be used to map brain tumor-induced tissue changes. For Table 1
instance, T2 and Flair MRI highlight differences in tissue water relaxa Different combinations of modalities used in the reviewed articles.
tional properties and detect the tumor with peritumoral edema. How Combination Articles
ever, T1 and T1c detect the tumor core without peritumoral edema.
Multiparametric Abrol et al. (2019); Ge et al. (2020); Isensee et al. (2019);
MRSI shows relative concentrations of selected metabolites, while post MRI Jiang et al. (2020); Jiang et al. (2020); Liang et al. (2018);
gadolinium T1 MRI shows pathological intratumoral take-up of contrast McKinley et al. (2019, 2020); Milecki et al. (2021);
agents. Perfusion and diffusion MRI show local water diffusion and Myronenko (2019); Nie et al. (2019); Saba et al. (2020);
blood flow (Menze et al., 2015; Zhou et al., 2019). Soltaninejad et al. (2019); Taleb et al. (2021); Tang et al.
(2020); Varghese et al. (2016); Wang et al. (2018); Wang et al.
Choosing the best combination of modalities is crucial because a (2020); Zhao et al. (2020)
wrong combination leads to a bad performance. Knowledge of the lim PET / CT Guo, et al. (2019); Kirienko et al. (2018); Li et al. (2019); Peng
itations and strengths of available modalities and an understanding of et al. (2019); Qin et al. (2020); Rubinstein et al. (2019); Shi
disease biology help determine which modalities are optimal for a task. et al.(2018); Zhao et al. (2018); Zhao et al. (2020); Zhou et al.
(2018)
Also, reading related articles helps choose the right combination of
MRI / PET Feng et al. (2019a); Liu et al. (2018); Lu et al. (2018); Shi et al.
modalities. For example, for disease assessment in Hodgkin lymphoma, (2018); Suk et al. (2014); Vu et al. (2018); Zhang and Shi
Kanoun et al. (2018) noted that FDG-PET plus CT increased the sensi (2020a)
tivity from 10% to 20% compared to conventional CT. Many researchers CT / X-ray El Asnaoui and Chawki (2020); Kassani et al. (2020); Maghdid
have integrated genomic data with other modalities to make more et al. (2020); Mukherjee et al. (2020); Rehman et al. (2020);
Zhang et al. (2021)
precise predictions about a disease (Bell, 2004; Schrodi et al., 2014). For Omics / Clinical Chen et al. (2019a); Hooshmand et al. (2020); Lai et al.
example, Chen et al. (2019a) demonstrated that combining clinical data (2019); Sun et al. (2019)
and gene expression data improves the accuracy of prognostic prediction CT / Clinical Bai et al. (2020); Lassau et al. (2020); Xu et al. (2020)
for breast cancer patients. Lai et al. (2019) integrated gene expression MRI / CT Cao et al. (2020); Ma et al. (2018); Xu et al.(2021)
MRI / Clinical Huang and Chung (2020); Liu et al. (2019)
and clinical data using a deep model to predict the 5-year survival of
6
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Table 2
The most common deep learning architectures in multimodal medical data
analysis.
Architecture Articles
CNN Cheerla and Gevaert (2019); El-Sappagh et al. (2020); Feng et al.
(2019b); Ge et al. (2020); Li, et al. (2019); Hosseini et al. (2020);
Huang and Chung (2020); Kirienko et al. (2018); Li et al. (2019);
Liu et al. (2018); Liu et al. (2019); Ma et al. (2018); Milecki et al.
(2021); Nie et al. (2019); Peng et al. (2019); Qiu et al. (2020); Shi
et al. (2018); Shikalgar and Sonavane (2020); Tang et al. (2020);
Wang et al. (2018); Zhang et al. (2019); Zhang et al. (2021); Zhou
et al. (2018)
Inception El Asnaoui and Chawki (2020); Kassani et al. (2020); Vaghefi et al.
(2020)
U-Net Hervella et al. (2020, 2019); Isensee et al. (2019); Jiang et al.
(2020); Wang et al. (2020); Zhao et al. (2020); Zhao et al. (2020)
VGGNet El Asnaoui and Chawki (2020); Jiang et al. (2020); Kassani et al.
(2020); Liu et al. (2018); Saba et al. (2020); Yan et al. (2019)
ResNet Abrol et al. (2019); El Asnaoui and Chawki (2020); Lassau et al.
(2020); Rehman et al. (2020); Vaghefi et al. (2020); van Sonsbeek
and Worring (2020); Yap et al. (2018); Zhang and Shi (2020b)
DenseNet El Asnaoui & Chawki (2020); Guo, et al. (2019); Kassani et al.
(2020); Liang et al. (2018); Qin et al. (2020); Wang et al. (2020)
RNN Lee, Kang, et al. (2019); Lee, Nho, et al. (2019); Shukla and Marlin
(2020)
LSTM Bagheri et al. (2020); Bai et al. (2020); El-Sappagh et al. (2020);
Feng et al. (2019b); Hosseini et al. (2020); van Sonsbeek and
Worring (2020); Zhang et al. (2020)
FCN Ali et al. (2020); Bagheri et al. (2020); Cheerla and Gevaert (2019);
Chen et al. (2019b); Hung et al. (2019); Lai et al. (2019);
Soltaninejad et al. (2019); Sun et al. (2019)
Auto-encoder Khamparia et al. (2019); Myronenko (2019); Rubinstein et al.
(2019); Varghese et al. (2016)
GAN B. Cao et al. (2020); Ge et al. (2020); Li et al. (2020)
RBM Hooshmand et al. (2020); Suk et al. (2014)
Attention- Chen et al. (2019a); Zhang and Shi (2020b); Zhang et al. (2021)
based
achieve high performance because the search space is smaller than other
Fig. 8. The illustration of various fusion strategies for multimodal learning. (a)
fusion techniques. In this technique, information is independently Input-level fusion, (b) Layer-level fusion, and (c) Decision-level fusion.
learned from different modalities, so the likelihood of overfitting is
lower than other methods. There are a variety of techniques for decision-
5.2.1. CNn
level fusion (Rokach, 2010). The most popular technique is based on
majority voting, in which after training each model separately, the final
CNN architecture comprises three basic layers: convolutional layer,
prediction is chosen based on the majority of the predictions of the in
pooling layer, and fully connected layer. The convolutional layer uses
dividual networks (Shikalgar & Sonavane, 2020).
multiple filters for extracting high-level features. The pooling layer de
creases the spatial size of feature maps obtained from the convolutional
5. Deep learning
layer, which leads to a decrease in the computational power required to
process image data. It also causes translation invariance, which is the
5.1 Introduction to deep neural network
ability to ignore translations of the target in the input. Usually, after
convolutional and pooling layers, there are some fully connected layers.
Neural networks are built of many neurons with specific activation
Various modifications, such as structural reformulation, regularization,
functions. Each set of neurons belongs to a hidden layer, many of which
and parameter optimizations, have been made in CNN architecture from
create a deep neural network. Deep learning methods are a kind of
1989 until today (Kanoun et al., 2018). Fig. 9 illustrates an example of a
representation learning algorithms that allow a machine to automati
basic CNN architecture. Although CNNs are mainly used for image
cally discover the needed representation from the raw input data (LeCun
analysis, some studies have used them for sequential data analysis
et al., 2015).
(Zhang & Wallace, 2015). We explain the most popular CNN architec
tures in this section.
5.2 Important architectures in deep learning
[Link]. Inception network
This section introduces some of the most popular neural network
Inception network, also known as GoogleNet, achieved perfect re
architectures, which greatly impact medical analyses. Fully connected
sults for classification and detection in the ImageNet Large-Scale Visual
networks (FCNs), CNNs, and RNNs are mainly used for supervised tasks.
Recognition Challenge 2014 (ILSVRC14) (Szegedy et al., 2015). Finding
On the other hand, unsupervised neural networks, including GANs,
the optimal value for filter size, which is one of the hyperparameters of
restricted Boltzmann machines (RBMs), and auto-encoders can be
neural networks, is challenging. The main idea of the inception network
employed in the absence of labeled data. Moreover, a combination of
is to apply various filter sizes and max-pooling simultaneously. An
these techniques can be used in semi-supervised tasks. The detailed list
inception network consists of inception modules. In an inception mod
of studies using each deep learning architecture in multimodal medical
ule, 1 × 1, 3 × 3, 5 × 5 filters are applied to the output of a layer. Then,
data analysis is shown in Table 2.
their outputs are concatenated to make a large vector, which is the input
of the next layer. However, this concatenation would inevitably increase
7
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
the number of outputs from stage to stage. To solve this problem, a 1 × 1 2015, was proposed to ease the training of very deep neural networks
filter is placed before 3 × 3 and 5 × 5 filters (Fig. 10). (He et al., 2016). Increasing the network depth may cause the vanishing
gradient problem. ResNet attempts to solve this problem by the idea of
[Link]. U-Net skip connection or shortcut, which adds the input of one layer to the
The U-net architecture, the winner of ISBI Cell Tracking Challenge output of the linear function of the next layer. This structure is called a
2015, has achieved good performance on different biomedical seg residual block, and ResNet is built by stacking these blocks. Fig. 12
mentation applications (Ronneberger et al., 2015). This network only shows the residual block.
needs a few annotated images. As shown in Fig. 11, the architecture
consists of a contracting path to capture context and an expansive path [Link]. VGG network
that enables precise localization. The contracting path comprises the The input of this architecture is a fixed-sized 224 × 224 × 3 image.
repeated application of two 3x3 convolutions, each followed by a The image is passed through a stack of convolutional layers, where all
rectified linear unit (ReLU) and a 2x2 max pooling operation with stride filters are 3 × 3 with stride one and same padding. Some of the con
2 for downsampling. Every step in the expansive path consists of an volutional layers are followed by max-pooling, which is performed over
upsampling of the feature map followed by a 2x2 convolution (“up- a 2 × 2 pixel window with stride 2. Convolutional layers are followed by
convolution”) that halves the number of feature channels, a concate three fully connected layers: the first two have 4096 channels, and the
nation with the correspondingly cropped feature map from the con third one, which is a soft-max layer, contains 1000 channels (Simonyan
tracting path, and two 3x3 convolutions, each followed by a ReLU. & Zisserman, 2014). What is explained here is VGG16 (Fig. 13). Another
Finally, a 1x1 convolution maps each feature vector to the desired version of the VGG network is VGG19 that is even deeper than VGG16
number of classes. but does almost as well as VGG16. The main downside of this network is
its large number of parameters (i.e., around 138 million). However, the
[Link]. ResNet uniformity of this architecture has attracted many researchers.
ResNet, which won first place in the classification task of ILSVRC
8
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Fig. 11. U-net which consists of a contracting path (left side) and an expansive path (right side).
RNNs are a class of neural networks that successfully model sequence Attention models have recently become very popular within the
data (Mandic & Chambers, 2001). Sequential data are ordered data in artificial intelligence community as an essential component of neural
which related things follow each other, such as a DNA sequence. The architecture (Alzubaidi et al., 2021). The intuition behind attention
basic premise of a traditional RNN is to parse every item in an input models can be explained using human biological systems. Our visual
series, one after the other, and keep updating its “hidden state” vector processing system tends to focus selectively on some parts of an image
every step of the way. At the end of every step, this hidden vector learns while ignoring other irrelevant information in a manner that can assist
to represent the context of all prior inputs. Therefore, when it makes a in perception (Chaudhari et al., 2019). Similarly, in several problems
decision, it considers the current input and what it has learned from such as language and speech, some parts of the input are more important
previous inputs. This is an important advantage of RNN because a than others. The attention mechanism allows a model to dynamically
sequence of data contains important information about what is coming pay attention to only certain parts of the input that help perform the task
next. Another advantage of RNN is its ability to process variable-length effectively. Important parts of the data are chosen based on the context
sequence input. Also, input size does not affect the RNN size. An example and learned through training procedure by gradient descent. One of the
of a traditional RNN is shown in Fig. 14. reasons why attention models have become so popular is that they
On the other hand, the basic RNN is not very good at capturing long improve the interpretability of neural networks, which are otherwise
dependencies. Another disadvantage of the basic RNN is the vanishing considered black-box models. This is a great benefit mainly because of
gradient problem. Gated recurrent unit (GRU) is an extension of RNN, the growing interest in the fairness, accountability, and transparency of
which makes the basic RNN better in facing these two problems (Chung machine learning models, especially in applications that influence
et al., 2014). Long-short-term memory (LSTM) is another extension of human lives. Furthermore, the attention mechanism helps overcome
RNN, which is more powerful than GRU (Hochreiter & Schmidhuber, some challenges with RNNs, such as performance degradation with an
1997). In LSTM, memory is extended, so it is well suited to learn from increase in the input size and the computational inefficiencies resulting
meaningful experiences that have very long time lags in between. from sequential processing of the input (Xu et al., 2015).
The first attention model, proposed by Bahdanau et al. (2014), is
shown in Fig. 15. This architecture consists of two RNNs, one of which is
9
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
called an encoder and reads a variable-length sequence input. The other that determines which image is real and which one is fake. In the
RNN is called a decoder and produces a sequence output. A context training process, the generator aims to create realistic images that
vector is utilized to preserve information from all hidden states of the confuse the discriminator. However, it is important to keep these net
encoder and align them with the current target output. Attention works at the same level and make them improve together. In recent
weights α are assigned to the input sequence to prioritize the set of years, GANs have become popular in medical analyses and have been
positions where relevant information is present. The weighted sum of all used for different tasks, such as data augmentation and image trans
hidden states of the encoder and their corresponding attention weights lation. One of the applications of GAN to image translation is CycleGAN
creates the context vector c. By doing so, the model can attend to a (Zhu et al., 2017), which allows converting an image from one domain
certain part of the source input and learn the complex relationship be to another domain.
tween the source and target better.
6. Different learning strategies
GANs are unsupervised learning algorithms that use a supervised The main idea of transfer learning is to use the knowledge gained
cost function as part of their training process. GANs comprise generator from one problem and apply it to a different problem. These problems
and discriminator networks, as shown in Fig. 16 (Goodfellow et al., are usually related to each other. When a large dataset is available for
2014). The generator network uses a noise vector as its input to generate the first problem, while there is not enough data available for the target
an image, then this generated image is given to the discriminator problem, transfer learning can be helpful. Transfer learning includes two
network. The discriminator network tries to distinguish between the real steps: pre-training and fine-tuning. A model is first trained on a large
and the generated image. In other words, the discriminator is a classifier
10
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
dataset to learn all network parameters. Next, this pre-trained network is Unlike transfer learning, which is a sequential process, multitask
fine-tuned on a new dataset. In fine-tuning, a few last layers of the learning aims to do multiple tasks simultaneously by a shared model. In
network are randomly initialized, then the entire network is retrained this technique, a big enough neural network is needed to do several tasks
using the new dataset. When the new dataset is very small, only the last with similar low-level features and an equal amount of data in the
layer should be randomly initialized to avoid the overfitting problem. training set. (Caruana, 1997; Ng, 2018).
Moreover, If the second task is not related to the first task, only initial
layers of the pre-trained network, which capture generic features, 6.3 End-to-end learning
should be used for the second task. The most widely used dataset for pre-
training a network is the ImageNet dataset (Ng, 2018; Pratt et al., 1991). Some systems need multiple stages of processing. In end-to-end
learning, a single neural network replaces all these stages. End-to-end
6.2 Multitask learning learning simplifies a system and reduces the need to design compo
nents manually. However, this learning strategy has some downsides.
11
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
For example, this method needs a lot of data to work well and excludes diagnose AD based on MRI and PET images. They used the Pearson
potentially useful hand-designed components. On the other hand, if correlation coefficient to judge the consistency of CNNs’ predictions.
there is enough data to train a big neural network, end-to-end learning Also, they proposed a formula to combine the neuroimaging diagnoses
can find the most appropriate mapping function between inputs and with clinical neuropsychological diagnoses (MMSE and CDR) based on
outputs (Ng, 2018). Table 3 shows different learning strategies. the Pearson correlation coefficient. When the result of neuroimaging
diagnoses is consistent, the algorithm only takes the results of the neu
7. Applications of deep learning in multimodal medical data roimaging diagnoses. Otherwise, the algorithm only focuses on the di
analysis agnoses of clinical neuropsychology. Zhang and Shi (2020b) proposed a
deep learning model based on the attention mechanism for AD diagnosis
This section provides an overview of different deep learning appli using MRI and PET images. In this model, the fusion ratio of each mo
cations in multimodal medical data analysis. Generally, deep learning dality is assigned automatically according to its importance. The final
techniques are classified into four major categories: unsupervised, semi- output of their model determines whether the input sample has Alz
supervised, self-supervised, and supervised. Fig. 17 shows the hierarchy heimer’s or not.
of papers considered in this article. “Alzheimer multimodal deep Also, the combination of PET and CT images is prevalent for multi
learning,” “cancer multimodal deep learning,” “multimodal deep modal cancer analysis. For example, H. Shi et al. (2018) used CT, PET,
learning in medical analysis,” and “COVID-19 multimodal deep and PET/CT images for lung tumor detection. They trained three CNNs
learning” are queries that we used to search and filter papers in Google separately on each modality and integrated the outputs of these CNNs to
scholar. The main focus of this survey is on COVID-19, cancer, and make the final decision. Similarly, Qin et al. (2020) used a CNN archi
Alzheimer’s. tecture to combine the fine-grained features from PET and CT images for
lung cancer detection. After the outbreak of COVID-19, many re
7.1 Supervised methods searchers have used deep learning models to analyze this disease. Since
chest CT and chest X-ray provide complementary information, using
Supervised learning indicates learning methods that use data with their combination is very popular among researchers. For instance,
human-annotated labels to train networks (Jing & Tian, 2020). The Zhang et al. (2021) proposed a deep convolutional attention network for
objective of a supervised learning model is to predict a correct label for COVID-19 diagnosis based on chest CT and chest X-ray. This network
new input data based on prior training data. has two branches, one of each receives 3D CT images, and the other one
receives 2D X-ray images. After five convolutional block attention
7.1.1. Classification modules (Woo et al., 2018) in each branch, the extracted deep CT fea
tures, and deep X-ray features are flattened. Next, the concatenation of
Many studies have used multimodal deep learning algorithms to these feature vectors is given to fully connected layers to diagnose
improve classification performance in the medical field. Using multi COVID-19. This method achieved a high accuracy of 98.02 ± 1.35% on a
modal data with deep learning helps achieve superior results, provided private dataset collected from local hospitals.
that the right combination of modalities is used. Some combinations of In some diseases, modalities should be chosen based on the purpose
modalities are prevalent in a disease analysis, such as MRI and PET in of the research. For instance, Vasquez-Correa et al. (2018) used speech,
Alzheimer’s. Zhang et al. (2019) employed two independent CNNs to handwriting, and gait signals to detect patients with Parkinson’s disease.
They trained three individual CNNs on each modality to create feature
maps. Then, they averaged these feature maps across different tasks and
Table 3
Different learning strategies.
transitions of a given subject. Finally, they concatenated embeddings
from three bio-signals and fed them to a radial basis support vector
Learning Articles
machine (SVM) to detect Parkinson’s disease patients. Vaghefi et al.
strategy
(2020) investigated the role of combining different image modalities in
Transfer El Asnaoui and Chawki (2020); Jiang et al. (2020); Kassani et al.
diagnosing intermediate dry age-related macular degeneration. They
learning (2020); Maghdid et al. (2020); Rehman et al.(2020); Saba et al.
(2020); van Sonsbeek and Worring (2020); Vu et al. (2018);
trained a network based on Inception-ResNet-v2 using optical coherence
Wang et al.(2020); Yap et al. (2018) tomography (OCT), OCT angiography (OCT-A), and color fundus pho
End-to-End Bai et al. (2020); Cao et al. (2020); Cheerla and Gevaert (2019); tographs. Ali et al. (2020) used deep learning to diagnose heart disease
learning Chen et al. (2019b); Feng et al. (2019b); Ge et al. (2020); Guo, from sensor data and electronic medical records. They combined
et al. (2019); Li, et al. (2019); Hervella et al. (2019, 2020);
extracted features from both data modalities and used the information
Hooshmand et al. (2020); Huang and Chung (2020); Hung et al.
(2019); Isensee et al. (2019); Jiang et al. (2020); Khamparia et al. gain technique for feature selection. With this technique, they decreased
(2019); Kirienko et al. (2018); Lai et al. (2019); Lee, Nho, et al. the computational burden and enhanced the system performance. They
(2019); Li et al. (2020); Liang et al.(2018); Lin et al. (2020); Liu also employed the conditional probability approach to identify the sig
et al.(2018); Ma et al. (2018); McKinley et al. (2019); Milecki
nificance of features. Next, they trained an ensemble deep learning
et al.(2021); Mukherjee et al. (2020); Myronenko (2019); Peng
et al. (2019); Qin et al. (2020); Shi et al. (2018); Shi et al. (2018);
model for heart disease prediction. Finally, they recommended
Shikalgar and Sonavane (2020); Suk et al. (2014); Sun et al. ontology-based dietary plans and activities based on each patient’s
(2019);Taleb et al. (2021); Vaghefi et al. (2020); Wang et al. health condition. Khamparia et al. (2019) proposed a multimodal deep
(2018); Wang et al. (2020); Xu et al. (2021); Zhang et al. (2020); learning model for chronic kidney disease classification. Their model is
Zhang et al. (2019); Zhang and Shi (2020b); Zhang et al. (2021);
constructed using stacked autoencoders with one softmax classifier. The
Zhao et al. (2018); Y. Zhao et al. (2020)
Multitask El-Sappagh et al.(2020); Liu et al. (2019); McKinley et al. (2020); model proposed by Shikalgar and Sonavane (2020) classifies AD using
learning Tang et al. (2020);Zhao et al. (2020) MRI images and EEG signals. The key objective of this method is to
Hybrid1 Abrol et al. (2019); Ali et al. (2020); Bagheri et al. (2020); enhance the learning procedure in which the weight factor of the deep
Hosseini et al. (2020); Kim and Lee (2018); Lassau et al. (2020); belief network (DBN) is incorporated with CNN for dealing with
Lee, Kang, et al. (2019); Li et al. (2019); Liu and Hu (2019); Lu
et al. (2018); Nie et al. (2019); Qiu et al. (2020); Rubinstein et al.
multimodal heterogeneous information. First, the median filter on MRI
(2019); Shukla and Marlin (2020); Soltaninejad et al.(2019); images and the Gaussian filter on EGG signals are applied to minimize
Vasquez-Correa et al. (2018); Xu et al. (2020); Yan et al. (2019); noise. After extracting texture properties of images by gray level co-
Zhou et al.(2018) occurrence matrix (GLCM), features from both modalities are concate
1
Articles that used techniques such as machine learning or image processing nated. Finally, a hybrid CNN-DBN model classifies final features.
for some stages of processing. On the other hand, some articles focus on side issues instead of
12
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Fig. 17. The tree of papers related to deep learning applications in multimodal medical data analysis.
classification. For example, most existing methods for AD analysis need feature extractor, which achieved 92.18% accuracy. Similarly, Maghdid
image preprocessing, such as registration and segmentation, but Man et al. (2020) and Rehman et al. (2020) used transfer learning along with
hua Liu et al. (2018) proposed a cascaded CNN, which requires no image the combination of X-ray and CT scans.
segmentation and rigid registration in the preprocessing of brain images. Some articles address several problems simultaneously by using a
This model learns multimodal features from MRI and PET images for AD multitask model. For instance, El-Sappagh et al. (2020) proposed a
classification, saves computation costs, and achieves more robustness to multitask multimodal deep learning model for AD classification and four
some variations of translation and rotation in images. Many CNN-based critical cognitive scores (ADAS, MMSE, FAQ, and CDRSB) regression.
methods use a flattening layer after convolutional layers because a fully They extracted a set of static baseline features and temporal features
connected layer only processes 1D information. The feature maps of 3D- from five heterogeneous data sources, including MRI, PET, cognitive
CNN are always in 3D, so using a flattening layer leads to the loss of 3D scores, neuropsychological data, and assessment data. In this method,
spatial information in feature maps. To solve this problem, Feng et al. each time-series source is separately learned using a pipeline of stacked
(2019a) used a fully stacked bidirectional LSTM to get rich spatial and CNN-Bidirectional LSTM blocks. The CNN automatically extracts local
semantic information from feature maps. They employed their method features from each time series, and LSTM extracts temporal features and
for the diagnosis of AD using MRI and PET. temporal relationships among them. Then, a decision fusion by a dense
Another issue on which many researchers are working is the data layer is used to get more abstract deep features from time-series sources
scarcity problem, especially in COVID-19 analysis. Since large COVID- and baseline data. The final step is the task of specific learning to pro
19 datasets have not been publicly available yet, data augmentation duce final results. Mingxia Liu et al. (2019) proposed a multitask deep
techniques and transfer learning have been used in many articles to learning model for simultaneous AD classification and clinical score
prevent the overfitting problem. For example, Kassani et al. (2020) regression using MRI and demographic information (age, education, and
compared different pre-trained CNN models for COVID-19 detection gender). After extracting multiple image patches from MRI images, they
based on MobileNet, DenseNet, Xception, ResNet, InceptionV3, fed them into a CNN. Next, they concatenated demographic information
Inception-ResNet-v2, VGGNet, and NASNet. They used pre-trained and the feature vector of each MRI patch and used these feature vectors
CNNs for feature extraction from X-ray and CT scans and fed the for AD classification and clinical score regression.
extracted features into several machine learning classifiers to identify
whether a subject has COVID-19 or not. The DenseNet121 feature
7.1.2. Prediction
extractor with bagging tree classifier achieved the best performance
with 99% classification accuracy. El Asnaoui and Chawki (2020) also
This section explains some articles that used deep learning algo
employed different pre-trained models (VGG16, VGG19, DenseNet201,
rithms for predicting a medical event in the future. Various predictive
Inception-ResNet-v2, Inception_V3, Resnet50, and MobileNet_V2) for
tasks can be performed based on disease type, such as metastasis pre
feature extraction from X-ray and CT scans. Then, they flattened these
diction in cancer. Peng et al. (2019) used PET and CT images to predict
features and passed them to a multilayer perceptron to classify each
distant metastases of soft-tissue sarcoma. The CNN proposed in this
image. Their results concluded that Inception Resnet V2 is the best
study has two branches for processing PET and CT images
13
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
simultaneously. After feature extraction in each branch, the extracted sequential data and missing values. Usually, a feature extraction phase is
features are concatenated. After another convolutional layer, these required to produce a fixed-size feature representation. However, Lee,
features are concatenated with texture features, which are obtained by Nho, et al. (2019) proposed an RNN that uses any irregular length of
applying grey-level co-occurrence matrix (GLCM), grey-level run-length data as input without preprocessing to predict conversion from MCI to
matrix (GLRLM), grey-level size zone matrix (GLSZM), and neighbor AD using demographic information (age, sex, years of education, and
hood grey-tone difference matrix (NGTDM) on PET images and feature APOE ε4 status), neuroimaging phenotypes measured by MRI, cognitive
selection. Finally, the new feature vector is given to fully connected performance, and CSF measurements. For extracting features, a single
layers to predict whether the input PET-CT image will develop metas GRU is trained separately on each modality. These GRU components
tasis. Zhou et al. (2018) proposed a hybrid model that predicts lymph make an encoding process in which longitudinal data are transformed
node metastasis in head and neck cancer. This method includes two into a vector. Next, four extracted feature vectors from each modality
models, one of which extracts intensity, texture, and geometric features are concatenated and given to an l1-regularized logistic regression
from PET and CT images. Then, these features are fed into an SVM to model for the final prediction.
predict the three classes of lymph nodes, including normal, suspicious, After the outbreak of COVID-19, another predictive task has attrac
and involved. The second model uses 3D-CNN to predict lymph node ted many researchers. In this task, researchers find patients who easily
classes based on CT and PET images. Finally, the outputs of these two deteriorate into critical cases because these patients have a higher pri
models are fused through the evidential reasoning (ER) approach. ority to receive medical treatment and special care. As there are not
Other common predictive tasks in cancer are prognosis and survival enough medical resources in epidemic areas, this task becomes more
prediction. In a study by Li et al. (2019), PET and CT images are fed into important. Some articles have addressed this problem by predicting
a CNN model to predict survival risk in patients with rectal cancer. The disease severity. Lassau et al. (2020) used clinical characteristics, lab
CNN learns imaging features by optimizing the partial likelihood of a tests, and CT images to predict the severity of hospitalized COVID-19
proportional hazards model. Several studies have combined genomic patients. They first trained a deep learning model to extract features
data with other modalities to achieve high performance. For instance, from CT images. Then, they concatenated these features with lab tests
Chen et al. (2019a) proposed an attention-based neural network that and clinical characteristics and gave them to a logistic regression model
fuses patients’ gene expression and clinical data for the breast cancer for predicting the severity score of patients. Similarly, Wang et al.
prognosis. They focused on the efficient fusion of multiple feature (2020) proposed a method to identify the potential high-risk COVID-19
extraction algorithms. In this model, five non-negative matrix factor patients who are more likely to become severe. In this method,
ization (NMF) algorithms extract features from gene expression and DenseNet121-FPN first finds lung masks in CT images. Some non-lung
obtain five feature matrices. Next, the attention mechanism calculates tissues such as the spine and heart inside the lung mask may still exist,
the weight of each NMF algorithm according to the clinical data of each so a non-lung area suppression operation suppresses the intensities of
patient. The weighted sum of five eigenvectors of feature matrices is non-lung areas inside the lung mask. Then, the standardized lung mask
then concatenated with clinical data to generate the final representa is sent to another DenseNet-based model, which is pre-trained using CT
tion, which is fed into a deep neural network for prediction. Lai et al. images and gene information. This model generates deep features and
(2019) combined gene expression and clinical data to predict the sur predicts the probability that a patient has COVID-19. After feature se
vival status of patients within five years. First, they removed patients lection from the combination of deep features and clinical features (age,
with incomplete clinical data. Then, they identified eight novel survival- sex, and comorbidity), a multivariate Cox proportional hazard model is
related genes based on seven previously well-known NSCLC biomarkers. used to identify high-risk COVID-19 patients. In a similar study, Fang
The combined 15 biomarkers and clinical data are fed into an FCN with et al. (2021) used a 3D ResNet to extract features from CT images and
two separate branches for gene markers and clinical data. After four used a multilayer perceptron to obtain features from clinical laboratory
hidden layers in each branch, these branches are stacked for the final data and personal information. Then, they concatenated these feature
prediction. Genomic data is beneficial for OS prediction, but tumor ge vectors and gave them to an LSTM to determine which patients deteri
notype is not available pre-operatively. Tang et al. (2020) proposed a orate into severe cases.
multitask CNN to simultaneously predict the genotype and OS of glio
blastoma patients from pre-operative multiparametric MRI. After two 7.1.3. Segmentation
convolutional blocks, their model is split into five branches, four
branches for genotype prediction (MGMT, IDH, 1p/19q, TERT), and one Image segmentation has greatly benefited from recent developments
for OS prediction. High-level features learned from four genotype pre in multimodal deep learning. In image segmentation, we determine the
diction branches are fed into the fully connected layer of the OS pre outline of an organ or anatomical structure as accurately as possible
diction branch to provide it with tumor genomic features. Furthermore, (Maier et al., 2019) Due to the variable size, shape, and location of the
patients’ age, gender, tumor size, and tumor location are fed to the fully target tissue, medical image segmentation is one of the most challenging
connected layers of all prediction tasks. This study achieves a remark tasks in medical image analysis (Zhou et al., 2019) U-Net and V-Net are
able result for OS prediction. Another study that focused on predicting the most successful architectures for medical image segmentation, so
IDH genotype from multiparametric MRI is Liang et al. (2018) which many researchers have employed them. Zhao et al. (2020) developed a
developed a multimodal 3D deep learning model based on the 3D model to automatically detect local prostate tumors, bone lesions, and
DenseNet framework. This model takes four MRI sequences (T1, T2, lymph node metastasis based on PET/CT images. Their model consists of
T1Gd, FLAIR) as input and gives IDH genotype as output. three 2.5D U-Nets, any of which is separately applied to one of the
A popular predictive task in AD is to predict conversion from mild different planes (axial, coronal, and sagittal) to make predictions about
cognitive impairment (MCI) to AD. For example, Lin et al. (2020) rec each voxel. Then, the majority voting strategy combines the outputs of
ommended an extreme learning machine (ELM) based (Huang et al., U-Nets to segment lesions. Zhao et al. (2018) developed two individual
2012) grading method to fuse multimodal data and predict MCI-to-AD V-Nets for extracting features from CT and PET images. Then, they fused
conversion efficiently. All modalities, including MRI, PET, CSF, and extracted features and gave them to a softmax layer to find the tumor
gene data, are first individually graded using the ELM method. Then, mask in lung cancer. Other architectures can also segment tissues and
these grading scores calculated from different modalities are fed into an achieve good results. For instance, Guo, et al. (2019) proposed a 3D
ELM classifier to discriminate subjects with progressive MCI from those DenseNet model for gross tumor volume segmentation in head and neck
with stable MCI. One of the major problems while dealing with longi cancer. In this study, after standardization and normalization, PET and
tudinal data, such as longitudinal CSF and longitudinal cognitive per CT images are fed into the model to adopt the tumor volume contour.
formance, in Alzheimer’s analysis is handling variable-length of Since 2012 Brain Tumor Image Segmentation Benchmark (BraTS)
14
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
challenge has been organized in conjunction with the international designed a gated dual-branch fusion mechanism to adaptively fuse the
conference on Medical Image Computing and Computer Assisted In estimated deformation fields. With the help of auxiliary gradient-space
terventions (MICCAI) to assess and compare state-of-the-art methods in guidance, their network concentrates more on the spatial relationship
automated brain tumor segmentation. This challenge has encouraged of the organ boundary.
many researchers to focus on brain tumor segmentation because it In another study, Rubinstein et al. (2019) extracted three feature
provides a multiparametric dataset of MRI, which contains T1, T1c, T2, classes, including statistical, kinetic biological, and deep features, from
and Flair images (see Section 8). Before this challenge, virtually all PET/CT images of patients with prostate tumors. They trained a stacked
studies were validated on relatively small private datasets with varying convolutional auto-encoder and considered its reconstruction errors in
metrics for performance quantification, making objective comparisons different training epochs as deep features. Then, they used all features to
between methods highly challenging (Menze et al., 2015) Brain tumor compute an anomaly score for each voxel. Finally, they used density
segmentation is the main task in the BraTS challenge, but other tasks estimation to detect anomalies, which are classified as tumors, in the
have also been added over the years. For example, the OS prediction task feature space. Milecki et al. (2021) employed an unsupervised deep
was added to this challenge in 2017. For OS prediction based on mul learning method to segment kidney grafts in T2 and Dynamic Contrast-
tiparametric MRI in the BraTS dataset, machine learning techniques Enhanced (DCE) MRI. They applied thresholding techniques and
with hand-crafted features are more popular than deep learning tech morphological operators to detect the area of interest. Then, an unsu
niques. As this topic is beyond the scope of this paper, we do not cover it. pervised CNN model, based on differentiable feature clustering, was
For further details about datasets provided by this challenge, see Table 1 used for the pixel-wise segmentation of the kidney graft. Suk et al.
in Ghaffari et al. (2020). (2014) devised an unsupervised deep learning method for learning high-
An example of studies on the BraTS dataset is Myronenko (2019) In level latent and shared features from MRI and PET images. In this
this study, after normalization and data augmentation, input images are method, paired patches of MRI and PET are given to a Gaussian RBM,
given to an encoder-decoder model in which the encoder extracts fea which is used as a preprocessor to transform the real-valued observa
tures from inputs, and the decoder predicts segmentation masks. tions into binary vectors. Then, these binary vectors are given to a DBM,
Furthermore, a variational auto-encoder (VAE) branch is added to the which finds a shared feature representation from the paired patches. As
encoder endpoint to reconstruct the original image. Also, the VAE loss is DBM is an undirected graphical model, bidirectional information flows
considered in the overall loss function to regularize the shared encoder. from one modality to another one and vice versa. Therefore, feature
This study won first place in the segmentation task of the BraTS 2018 representations are distributed over different layers in the path between
challenge. Jiang et al. (2020) the winner of the segmentation task in the modalities, and thus a shared representation is discovered. To validate
BraTS 2019 challenge, took a variant of (Myronenko, 2019) as the basic their model, authors trained multiple SVM classifiers with this shared
segmentation architecture and proposed a two-stage cascaded U-Net. In representation for AD/MCI diagnosis.
the first stage, U-Net predicts coarse segmentation maps, which are used
to calculate the first loss. Then, these maps and raw images are fed into 7.3 Semi-supervised methods
the second stage U-Net, to provide more accurate segmentation maps.
The second stage U-net has two decoders with the same structure, except Semi-supervised learning is a branch of machine learning that
that one decoder uses deconvolution and the other uses trilinear inter combines supervised and unsupervised learning (Chapelle et al., 2006).
polation. The outputs of these decoders are used to calculate the second Usually, in semi-supervised learning, a small amount of labeled data is
and third losses, which are added to the first loss to create the final loss. used in conjunction with a large amount of unlabeled data (Jing & Tian,
The interpolation decoder is only used during training for regularizing 2020). In recent years, semi-supervised learning has attracted many
the shared encoder. researchers, especially in the medical field, where small annotated
datasets are often available.
7.2 Unsupervised methods Some studies use a small labeled dataset to estimate labels for un
labeled data. For instance, Ge et al. (2020) trained a multi-stream 2D
Unsupervised learning refers to learning methods without using CNN using only labeled data in the training dataset. They also devised a
human-annotated labels (Jing & Tian, 2020) In unsupervised learning, graph-based semi-supervised method to estimate labels for unlabeled
we group an unlabeled dataset based on underlying hidden features. data. Then, they trained a GAN using training data from labeled and
Auto-encoders, GANs, deep Boltzmann machines (DBMs), RBMs, and unlabeled sets to generate synthetic MRIs for data augmentation.
DBNs are popular deep learning models in unsupervised tasks (Raza & Finally, they passed the labeled training dataset, the unlabeled training
Singh, 2021) Labeling data in the medical field is difficult and time- dataset with estimated labels, and augmented data to the pretrained
consuming; also, it needs a lot of knowledge and experience. As a multi-stream CNN to classify glioma. In the study by Huang and Chung
result, unsupervised, self-supervised, and semi-supervised methods have (2020), a graph convolutional neural network is designed to determine
recently received considerable attention in medical analyses due to their whether a subject is healthy or diseased. Their model accepts subjects’
potential for reducing the effort to label data. However, a few studies imaging and non-imaging data and represents them as a population
have used these techniques in multimodal medical data analysis. For graph (partially labeled). Features extracted from imaging data of sub
instance, Hooshmand et al. (2020) addressed the problem of drug jects are considered nodes in the population graph. A trainable module,
repurposing in COVID-19 in an unsupervised manner. Drug repurposing called the edge adapter, encodes the non-imaging data, such as pheno
is a way to discover new applications of existing drugs for treating other typic information (age, gender, and site), into the population connec
diseases (Pushpakom et al., 2018). They employed RBM to categorize tivity. Then, the population graph is given to a graph convolutional
two types of drug data, including differentially expressed genes and neural network, which allows the edge adapter to learn the pairwise
chemical structures, into 12 clusters. Then, they chose clusters that associations between subjects during the training of the network. While
consisted of drugs used for curing COVID-19 to discover medications their goal is to predict the disease state of unlabeled subjects under the
that may be useful in treating this disease. Xu et al. (2021) proposed an supervision of the labeled ones in the population graph, the learned
unsupervised deep learning method for multimodal image registration. graph can be easily applied to clustering analyses by thresholding.
This framework has two branches, both of which use encoder-decoder In another study, Zhao et al. (2020) combined different tricks to
architecture. The image registration branch estimates the primary achieve better accuracy for 3D brain tumor segmentation. To cope with
deformation field for a moving CT and a fixed MRI, while the gradient the problem of data imbalance in segmentation, they employed heuristic
map registration branch takes the corresponding gradient maps of CT sampling and hard sample mining. They also constructed a training
and MRI as inputs to produce another deformation field. They also batch pool with batches of different patch sizes. For each iteration
15
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
during training, they randomly selected a batch from the pool to update vector for each patient. Then, they used this vector for predicting pan-
the model, so they took advantage of both large patches and the large cancer OS. They tailored encoding methods to each data type using
batch size. Moreover, they used a multi-space semi-supervised method deep highway networks (Srivastava et al., 2015) to extract features from
to tackle the lack of annotated data. In this method, they trained clinical and genomic data, and SqueezeNet (Iandola et al., 2016) to
different models on the training set under different conditions, such as extract features from WSIs.
different subsets of the training set or different subspaces of features, at
each iteration. Then, they combined all of these models and used them to 7.4. Self-supervised methods
label the unlabeled dataset. After each iteration, they merged the Self-supervised learning algorithms solve a series of handcrafted
manually labeled dataset and model labeled dataset as the new training auxiliary tasks (so-called pretext tasks) in which supervision signals are
set. Furthermore, they developed a self-ensemble U-Net, which makes acquired from the data itself, without the need for manual annotation
predictions at different scales and joins them to obtain segmentation (Liu et al., 2021). Pretext tasks result in a model or representation that
masks. They also applied various optimization techniques, including can be used to solve the original modeling problem (Kolesnikov et al.,
gradual warming up learning rate (Goyal et al., 2018) and multitask 2019). With the help of well-designed pretext tasks, self-supervised
learning. learning enables models to learn more informative representations
On the other hand, some researchers use both labeled and unlabeled from unlabeled data to achieve better performance, generalization, and
data to improve the performance of their model without labeling unla robustness on various downstream tasks (Liu et al., 2021). Tasks such as
beled data. For example, Wang et al. (2020) proposed a semi-supervised image inpainting, colorizing grayscale images, jigsaw puzzles, and
method to synthesize high-quality pairs of Apparent Diffusion Coeffi super-resolution have proven effective for learning good representations
cient (ADC) and T2 images containing clinically significant (CS) prostate (Jaiswal et al., 2020). For example, Taleb et al. (2021) introduced a self-
cancer. Their model comprises an encoder to obtain latent vectors from supervised jigsaw puzzle-solving task for learning semantic representa
real ADC maps, a decoder to derive low-dimensional ADC maps from tions that facilitate downstream tasks in the multimodal medical imag
latent vectors, a StitchLayer to convert low-dimensional ADC maps to ing context. In this model, all modalities are cut into puzzle pieces or
full-size ADC images, and a U-Net to convert the full-size ADC images to patches and are shuffled randomly according to a specific permutation.
T2 images. By training the synthesizer in a supervised manner, the These shuffled image pieces are then assembled and create a set of
model enforces the correct paired relationship between synthesized ADC patches called P. In other words, P is a restored image in which each
and T2 images of a pair. To increase the diversity of the generated data element is drawn from a different modality. Then, a neural network
and avoid overfitting, they also trained the synthesizer in an unsuper processes each element independently to produce a single output feature
vised manner by providing various random latent vectors. Furthermore, vector for each element in P. The matrix created by the concatenation of
they minimized the Wasserstein distances between the marginal distri feature vectors is then passed to the Sinkhorn operator to obtain the soft
butions of synthesized and real images of two modalities to ensure high permutation matrix. This soft permutation matrix is applied to the
visual similarity between real ADC/T2 images and fake ones. Finally, to scrambled input P to reconstruct it. The network aims to minimize the
enforce the synthetic images to contain distinguishable CS prostate mean squared error between the sorted ground-truth P and the recon
cancer lesions, they maximized the distance of Jensen-Shannon diver structed version of the scrambled input P. In this way, the network learns
gence between CS and non-CS images. In the study by Hosseini et al. different tissue structures across given modalities. After the training
(2020), an unsupervised CNN extracts high-level features and wavelet & process, the network parameters can be used in downstream tasks by
spatial group ICA extract time & frequency features from preprocessed fine-tuning on target domains.
EEG. Then, these features are given to a nonlinear SVM to classify However, some researchers devise new pretext tasks to learn valu
interictal epileptiform discharge (IED) and non-IED time intervals. able representations. For instance, Li et al. (2020) proposed a repre
Based on the SVM outputs, a differential connectivity graph is built. sentation learning method that exploits FA and color fundus images for
Also, a reclustering method is proposed to identify brain networks from retinal disease diagnosis. First, they reconstructed FA images from cor
preprocessed rs-fMRI data. Finally, for seizure focus localization in ep responding color fundus images using a CycleGAN to learn the mapping
ilepsy, an LSTM is used to merge the estimation of brain network and function between these modalities. Each patient data consists of a color
connectivity found by rs-fMRI analysis and the differential connectivity fundus image, a transformed fundus image obtained from a random data
graph found by EEG analysis. The authors claimed their method ach augmentation technique, and the corresponding synthesized FA. Next,
ieves better results than other methods for IED detection and localiza they randomly sampled n patients and fed the batch of data into the
tion of epileptogenicity. ResNet18 to get high-level representations. Then, they classified high-
In another study, Varghese et al. (2016) designed a new post- level representations into n classes, where each class represents a pa
processing technique to improve glioma segmentation using multi tient, and used the classification error for optimizing the model. Finally,
parametric MRI. In this study, two stacked denoising auto-encoders they trained a KNN classifier using high-level representations to solve
(SDAE) were pretrained using many unlabeled High-Grade Glioma the downstream task, which was retinal disease diagnosis. Hervella et al.
(HGG) patches. One of the SDAEs is fine-tuned using labeled HGG (2020) proposed a self-supervised method for the optic disc and cup
patches to segment HGG images, and another one is fine-tuned using segmentation using unlabeled pairs of retinography and FA images. In
labeled Low-Grade Glioma (LGG) patches to segment LGG images. the first phase, a U-Net reconstructs FA images from their corresponding
Furthermore, a one-layer denoising auto-encoder (DAE), called Novelty retinographies. This multimodal reconstruction is a self-supervised task
detector (ND), is trained to create reconstruction error maps by that aims at learning domain-specific patterns. This trained network is
assigning every voxel the mean reconstruction error of the patch then fine-tuned using the annotated data for the downstream task (i.e.,
centered at that voxel. This leads to a heat map-like image with large the optic disc and cup segmentation). This technique causes a significant
error regions corresponding to the location of the Glioma. Error maps improvement in the segmentation performance. When multimodal data
are then binarized using Otsu’s thresholding. After applying connected is used, some imaging modalities may be unavailable due to clinical and
component analysis on images predicted by two SDAEs, connected practical restrictions. To impute missing data with adequate clinical
components that have an empty intersection with their corresponding accuracy, Cao et al. (2020) designed a self-supervised collaborative
binary error mask are discarded. Their results demonstrate that the learning framework to synthesize a missing modality using other
novelty detector causes a reduction in false-positive voxels and improves available imaging modalities. In the training phase, all modalities are
glioma segmentation. Cheerla and Gevaert (2019) developed an unsu available for each sample. In this phase, the encoder of a translation
pervised encoder to compress clinical data, gene expression, miRNA, network encodes the input images from different sources into a common
and histopathology whole slide images (WSIs) into a single feature latent feature space. The latent features are then concatenated and given
16
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
to the translation network’s decoder to construct the target image. The (brain tumor), while others such as TCIA and Grand Challenge provide
self-representation network is an auto-encoder that attempts to model data for different diseases.
the distribution of target images by reconstructing them. Once well-
trained, feature maps extracted from the decoder of the self- 9. Common problems
representation network are used to guide the optimization of the
translation network’s decoder. The pseudo images generated by trans This section describes common problems of applying deep learning
lation and self-representation networks are utilized for training a algorithms to medical analyses.
discriminator network with ground-truth images. The discriminator
tries to classify whether each patch in the input image is real or fake. In 9.1 Lack of data
the testing phase, self-representation and discriminator networks are
removed, and only the translation network is used to translate images Labeling medical data is time-consuming and needs a lot of knowl
from multiple source domains to the target domain. edge and expertise. Researchers have made a tremendous effort to create
In another form of self-supervised learning, pseudo labels are several medical datasets; however, most of them are limited in size.
generated for an unlabeled dataset according to the structure or char Small medical datasets usually cause the overfitting problem because
acteristics of the data itself. Then, these pseudo labels are used to train a training a neural network needs a lot of data. When data is insufficient,
model in a supervised manner (Yuan et al., 2021). For instance, Hervella the network only memorizes samples in the training set. Consequently, it
et al. (2019) proposed a self-supervised strategy for retinal vessel seg performs very well on the training set but not on the test set.
mentation. They used multiscale laplacian operation, an edge detection There are different strategies to overcome this problem. One of them
filter in image processing, to obtain vessel maps for each angiography. is decreasing the network complexity by regularizing the network or
They aligned these generated maps with corresponding retinographies controlling the number of layers and parameters. Another common
and used them as pseudo labels. Then, they used these pseudo labels to technique is increasing the training set size by data augmentation
train a U-Net with standard pixel-wise metrics. They chose these mo techniques. Traditional data augmentation methods include mirroring,
dalities because, unlike retinography, the vasculature is already high random crop, rotation, shearing, and local wrapping. These techniques
lighted in angiography. As a result, angiography provides improve the results slightly, but they cannot gain much additional in
complementary information that allows the network to segment retinal formation. A new sophisticated data augmentation technique is to syn
vessels in retinography without manually annotated labels. thesize high-quality examples of the data using generative models, such
as GAN (Frid Adar et al., 2018). For example, Frid Adar et al. (2018)
8. Data sources trained a GAN for generating synthetic liver lesions. They used these
synthetic images to increase the size of the training set and fed this larger
In Table 4, some of the most well-known multimodal datasets are training set into a CNN for liver lesion classification. Their results show
introduced. Some of them only focus on a specific disease such as ANDI that using synthetic images generated by GAN improves the accuracy of
(Alzheimer), BCDR (breast cancer), DDSM (breast cancer), and BraTS their model. Waheed et al. (2020) used a GAN-based network to
Table 4
Different multimodal medical datasets.
Dataset Description Data types Website
ADNI The Alzheimer’s disease dataset designed for the early MRI, PET images, genetics, cognitive tests, CSF, and blood [Link]
detection and tracking of AD biomarkers
BCDR A digital repository for breast cancer analysis Mammography, related ultrasound images, and clinical reports [Link]
OASIS Normal Aging and Alzheimer’s disease dataset Multiparametric MRI (T1, T2, FLAIR, ASL, SWI, DTI) and PET [Link]
images org/
TCIA An archive which includes imaging data for different Based on the disease type, different image modalities are [Link]
cancer types. available in this archive, such as MRI, CT, and digital [Link]/
Also, some COVID-19 datasets have been added to this histopathology.
archive recently. Supporting data, including patient outcomes, treatment details,
genomics, and image analyses, are also provided.
IDA An archive consisted of medical images for autism, brain Based on the project, different image modalities, including MRI, [Link]
mapping, brain aging, and Parkinson’s progression fMRI, DTI, PET, and SPECT, are available. rvices/Menu/[Link]?
project=
DDSM A digital database for mammographic image analysis in Mammography and patient information such as age, ACR breast [Link]
breast cancer density rating, subtlety rating for abnormalities, and ACR edu/cvprg/Mammogra
keyword description of abnormalities phy/[Link]
Grand Challenge A repository for different challenges in medical image Based on the challenge, different image modalities are available. [Link]
analysis.
For each challenge, a dataset is provided by organizers.
BraTS A challenge with three tasks: brain tumor segmentation, Multiparametric MRI, including T1, T1c, T2, and Flair, and [Link]
survival time prediction, and the evaluation of the age of patients edu/cbica/brats2020/
uncertainty in tumor segmentation
MIMIC A dataset comprising de-identified health-related data Patients’ demographics, lab tests, and textual patient notes [Link]
from ~ 40,000 critical care patients for computational org/about/mimic/
physiology
EMBL-EBI’s COVID-19 datasets provided by The European Biological images, genomics, protein expression data, and [Link]
COVID-19 Data Bioinformatics Institute chemical structure data [Link]/about
Portal
ISLES A dataset for the segmentation of stroke lesions in brain Multiparametric MRI including T1, T1c, Flair, and DWI [Link]
images sequences org/
ISEG 2017 A dataset for the segmentation of 6-month infant brain Multiparametric MRI including T1and T2 [Link]
tissues edu/
MRBrainS A dataset for the segmentation of grey matter, white Multiparametric MRI including T1, T1-weighted inversion [Link]
matter, and cerebrospinal fluid of the brain recovery, and Flair
17
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
generate synthetic chest X-ray images. They added these generated data are unrelated to any observed or unobserved elements. MAR means
images to the training set and gave this new training set to a CNN model missing data are related to some observed values; therefore, they can be
for COVID-19 detection. With this technique, the accuracy of their predicted by other features in the dataset. MNAR means missing data are
model increased from 85% to 95%. Another prevalent technique to related to both observed and unobserved values.
overcome the overfitting problem is transfer learning, which is To handle missing data, we should determine their type first. With
explained in Section 6. Moreover, unsupervised, semi-supervised, and MCAR, any missing data can be dropped without biasing models. This
self-supervised techniques are beneficial when the training set is small. method is not prevalent as medical datasets are usually small. In MAR
and MNAR, dropping data may bias models. To handle MNAR, we
9.2 Data preprocessing should find the causes of missingness. Also, we should perform what-if
analyses to see how sensitive the results are under various scenarios
Different modalities provide different aspects of the same problem; (van Buuren, 2012). Finally, data imputation techniques are practical to
however, they have been acquired differently. Therefore, standardiza address MAR. In data imputation, we replace missing data with some
tion and normalization techniques are required to make them compa estimated ones. Regression imputation, k-nearest neighbor (KNN)
rable (Antonelli et al., 2019). Furthermore, some modalities are high- imputation (Cover & Hart, 1967), and multiple imputation (Rubin,
dimensional, sparse, irregular, biased, or multi-scale, so it is critical to 1987) are well-known data imputation methods.
preprocess them before an analysis (Kwak & Hui, 2019). For example,
genomic data are high-dimensional, so it is difficult to match a large 9.6 Privacy concerns about data
amount of data from whole-genome sequencing with other kinds of data.
As a result, the dimensionality of genomic data should be reduced to Deep learning algorithms need large centralized training datasets.
match with other modalities (Xu, 2019). Gathering data from different repositories is a possible strategy to in
crease the size of training sets and gain a better understanding of data.
9.3 Class imbalance However, institutions cannot legally share their medical data with other
institutions because medical data are the most sensitive information
A significant difference between the number of negative and positive associated with an individual. As data protection is crucial in the med
samples in the training set leads to the class imbalance problem. This ical field, different strategies, such as federated learning (Konečný et al.,
problem causes models to over-classify the majority group due to the 2015), differential privacy (Dwork, 2006), and multiparty computation
higher prior probability of this group. As a result, the instances (Lindell & Pinkas, 2009), have been suggested to deal with data privacy
belonging to the minority group are misclassified more often than those problems.
belonging to the majority group (Johnson & Khoshgoftaar, 2019). There
are three main approaches to deal with this problem: data-level 9.7 Generalization
methods, algorithm-level methods, and hybrid methods. Data-level
methods modify the training set to make it suitable for a standard Deep learning models should generalize well to unseen data. Most
learning algorithm. These methods include generating new objects for existing deep learning models perform well until the test set distribution
the minority group (oversampling), removing examples from the ma is similar to the training set distribution; otherwise, their performance
jority group (undersampling), and random selection of target samples might degrade significantly. Domain adaptation and domain general
for preprocessing. Algorithm-level methods modify the model to alle ization techniques reduce differences between training and test distri
viate their bias towards the majority group. The most popular technique butions by learning domain invariant features (Khandelwal &
is the cost-sensitive approach, which assigns a higher penalty to the Yushkevich, 2020). Also, several studies have focused on the general
minority class and a lower penalty to the majority class. Hybrid methods ization of deep learning models. For instance, Lee et al. (2020) trained a
combine the advantages of previous groups (Krawczyk, 2016). CycleGAN to learn the training set intensity distribution. Then, they
used this network to adapt the arbitrary intensity distribution to the
9.4 Model interpretability and reliability specific intensity distribution of the training set. They confirmed that
their method creates images similar to the training set domain without
Accuracy is a critical factor for convincing users to use methods significant feature loss.
proposed in medical studies because these studies are related to human
lives (Kwak & Hui, 2019). Using multimodal data improves models’ 9.8 High-performance computational resources
accuracy, which leads more people to trust models. Another important
factor that persuades the medical community to use deep learning Deep learning algorithms usually demand high-performance
models is interpretability. Many studies have used deep learning in the computational resources. The high computational cost of medical data
medical field, and some of them have achieved perfect results. However, analysis makes this problem worse. Recently, many hardware acceler
the medical community has not accepted these methods yet. The main ators have been developed to provide the required computational power
reason is that they cannot trust the predictions of deep learning models, for deep learning projects (Talib et al., 2020). However, the training of
which are considered uninterpretable and black-box models. In recent deep learning algorithms is still time-consuming, especially with medi
years, several studies have focused on improving the interpretability of cal data. Parallel processing techniques also alleviate this problem.
deep neural networks. Different visualization techniques are also
developed to understand deep learning models. These methods include 10. Open challenges and future works
weight histograms, saliency maps (Simonyan et al., 2014), occlusion
maps, class maximization, and activation maximization. The integration of multiple modalities helps a model perform much
better because the model takes advantage of additional information
9.5 Missing data provided by different data types. As a result, multimodal data have
attracted many researchers recently, especially in the medical field. This
Handling missing data is challenging in medical data analyses, section describes open challenges and future scopes of this topic.
especially when multimodal data are used. There are three types of
missing data: missing completely at random (MCAR), missing at random • The most challenging problem of training a deep neural network
(MAR), and missing not at random (MNAR) (Sterne et al., 2009). MCAR using medical data is data scarcity. A state-of-the-art approach to
means the probability of missing is the same for all cases, and missing deal with the lack of data is few-shot learning (Fei Fei et al., 2006;
18
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Fink, 2004). As the name implies, few-shot learning is the practice of interests or personal relationships that could have appeared to influence
training a model with a limited amount of training data, which also the work reported in this paper.
leads to less computational costs. Few-shot learning involves iden
tifying the key features of each class to distinguish between the References
classes (Kotia et al., 2021). Few-shot learning with one training
sample in each class is called one-shot learning. Another type of few- Abrol, A., Fu, Z., Du, Y., & Calhoun, V. D. (2019). Multimodal Data Fusion of Deep
Learning and Dynamic Functional Connectivity Features to Predict Alzheimer’s
shot learning is zero-shot learning in which no training sample is
Disease Progression *. In IEEE Xplore. [Link]
available (Wang et al., 2020). Although few-shot learning is an active Ali, F., El-Sappagh, S., Islam, S. M. R., Kwak, D., Ali, A., Imran, M., & Kwak, K.-S. (2020).
research area, a few studies have used it in multimodal medical data A smart healthcare monitoring system for heart disease prediction based on
ensemble deep learning and feature fusion. Information Fusion, 63. [Link]
analyses. As a result, further investigation is required.
10.1016/[Link].2020.06.008
• Unsupervised, self-supervised, and semi-supervised methods have Alom, M. Z., Yakopcic, C., Hasan, M., Taha, T. M., & Asari, V. K. (2019). Recurrent
recently received considerable attention in medical analyses due to residual U-Net for medical image segmentation. Journal of Medical Imaging, 6(01).
their potential for reducing the effort of labeling data. However, a [Link]
Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O.,
few studies in multimodal medical data analysis have used these Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep
techniques. learning: Concepts, CNN architectures, challenges, applications, future directions.
• GAN has become wildly popular in the medical field recently. One of Journal of Big Data, 8(1). [Link]
Antonelli, L., Guarracino, M. R., Maddalena, L., & Sangiovanni, M. (2019). Integrating
its current applications is data augmentation; however, further imaging and omics data: A review. Biomedical Signal Processing and Control, 52.
research is required to make GAN’s output more reliable. A novel [Link]
target in medical analyses is differential privacy-based GAN, which Bagheri, A., Groenhof, T. K. J., Veldhuis, W. B., de Jong, P. A., Asselbergs, F. W., &
Oberski, D. L. (2020). Multimodal Learning for Cardiovascular Risk Prediction using
creates new fake data without memorizing individual characteristics EHR Data. In arXiv:2008.11979 [cs, eess, stat]. [Link]
of samples in the training set. Consequently, these models make large Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning
amounts of new fake medical data that can be released publicly. On to align and translate. ArXiv Preprint ArXiv:1409.0473.
Bai, X., Fang, C., Zhou, Y., Bai, S., Liu, Z., Chen, Q., Xu, Y., Xia, T., Gong, S., Xie, X.,
the other hand, GAN requires an enormous computational cost and
Song, D., Du, R., Zhou, C., Chen, C., Nie, D., Tu, D., Zhang, C., Liu, X., Qin, L., &
GPU memory. Some researchers have focused on this problem, but it Chen, W. (2020). Predicting COVID-19 malignant progression with AI techniques.
is still an open area for further research. Finally, GAN can be used for SSRN. [Link]
Bell, J. (2004). Predicting disease using genomics. Nature, 429(6990). [Link]
unsupervised learning and generalization.
10.1038/nature02624
• Large neural networks generally address complex problems better, Betts, J.G., Young, K.A., Wise, J.A., Johnson, E., Poe, B., Kruse, D.H., Korol, O., Johnson,
but they require more computational power and memory. One E.J., Womble, M., & DeSaix, P. (2013). Anatomy and Physiology. OpenStax.
popular approach to tackle this problem is neural network pruning, Bhatnagar, G., Wu, Q. M. J., & Liu, Z. (2015). A new contrast based multimodal medical
image fusion framework. Neurocomputing, 157. [Link]
which aims to eliminate a significant number of parameters from a neucom.2015.01.025
neural network without affecting its accuracy. Neural networks can Boss, A., Bisdas, S., Kolb, A., Hofmann, M., Ernemann, U., Claussen, C. D.,
be pruned before, during, or after training the model (Cun et al., Pfannenberg, C., Pichler, B. J., Reimold, M., & Stegger, L. (2010). Hybrid PET/MRI
of Intracranial Masses: Initial Experiences and Comparison to PET/CT. Journal of
1990; Han et al., 2015; Frankle et al., 2020). Furthermore, many Nuclear Medicine, 51(8). [Link]
studies have concentrated on reducing overall memory by van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC Press: In Google Books.
compression, but only a few have aimed at speeding up layers (Maji Cao, B., Zhang, H., Wang, N., Gao, X., & Shen, D. (2020). Auto-GAN: Self-supervised
collaborative learning for medical image synthesis. In AAAI 2020–34th AAAI
& Mullins, 2018). Conference on Artificial Intelligence. [Link]
• Some combinations of modalities are commonly used in multimodal Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Qi, T., & Wang, M. (2021). Swin-Unet:
medical analysis. The main reason is that they have proved effective Unet-like Pure Transformer for Medical Image Segmentation. ArXiv Preprint ArXiv:
2105.05537.
in improving models’ performance. However, it may be beneficial to
Caruana, R. (1997). Multitask Learning. Machine Learning, 28(1). [Link]
consider other combinations because they may reveal more infor 10.1023/a:1007379606734
mation. As a result, future studies should try new combinations. Chapelle, O., Chi, M., & Zien, A. (2006). A continuation method for semi-supervised
SVMs. ACM International Conference Proceeding Series, 148. [Link]
1143844.1143868
11. Conclusion Chaudhari, S., Mithal, V., Polatkan, G., & Ramanath, R. (2019). An attentive survey of
attention models. ArXiv Preprint ArXiv:1904.02874.
Deep learning is a powerful technique to analyze complex medical Cheerla, A., & Gevaert, O. (2019). Deep learning with multimodal representation for
pancancer prognosis prediction. Bioinformatics, 35(14). [Link]
data. Recently, deep learning methods have become very popular in bioinformatics/btz342
medical analyses because they have achieved outstanding results in this Chellapilla, K., Puri, S., & Simard, P. (2006). High Performance Convolutional Neural
field. Multimodal data improve neural networks’ performance as they Networks for Document Processing BT – Tenth International Workshop on Frontiers
in Handwriting Recognition. Tenth International Workshop on Frontiers in Handwriting
provide complementary information. This paper presents a compre Recognition.
hensive overview of the latest studies on multimodal medical data Chen, H., Gao, M., Zhang, Y., Liang, W., & Zou, X. (2019). Attention-Based Multi-NMF
analysis using deep learning algorithms. We divided related articles into Deep Neural Network with Multimodality Data for Breast Cancer Prognosis Model.
BioMed Research International, 2019. [Link]
four main categories, including supervised, semi-supervised, self-su Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Le Lu, Yuille, A. L., & Zhou, Y.
pervised, and unsupervised methods. We observed that many articles on (2021). Transunet: Transformers make strong encoders for medical image
COVID-19 had used transfer learning because they did not access large segmentation. ArXiv Preprint ArXiv:2102.04306.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated
datasets. We conclude transfer learning methods are invaluable in sit
Recurrent Neural Networks on Sequence Modeling. [Link]
uations such as pandemics when not enough data is available. Different 1412.3555.
modalities, deep learning architectures, and fusion strategies are also Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on
Information Theory, 13(1). [Link]
introduced in this paper. Furthermore, we provided links to access some
Cun, L., Le Cun, Y., Denker, J., & Solla, S. (1990). Optimal Brain Damage. Advances in
of the most well-known multimodal datasets and identified common Neural Information Processing Systems, 2.
problems and open challenges in this field. We believe that deep Dwork, C. (2006). Differential Privacy. Automata, Languages and Programming, 4052.
learning methods in multimodal medical data analysis will remain an [Link]
El-Sappagh, S., Abuhmed, T., Riazul Islam, S. M., & Kwak, K. S. (2020). Multimodal
active research area in the coming years. multitask deep learning model for Alzheimer’s disease progression detection based
on time series data. Neurocomputing, 412. [Link]
Declaration of Competing Interest neucom.2020.05.087
El Asnaoui, K., & Chawki, Y. (2020). Using X-ray images and deep learning for automated
detection of coronavirus disease. Journal of Biomolecular Structure and Dynamics.
The authors declare that they have no known competing financial [Link]
19
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Fang, C., Bai, S., Chen, Q., Zhou, Y., Xia, L., Qin, L., Gong, S., Xie, X., Zhou, C., Tu, D., Hung, C. Y., Lin, C. H., Chang, C. S., Li, J. L., & Lee, C. C. (2019). Predicting
Zhang, C., Liu, X., Chen, W., Bai, X., & Torr, P. H. S. (2021). Deep learning for Gastrointestinal Bleeding Events from Multimodal In-Hospital Electronic Health
predicting COVID-19 malignant progression. Medical Image Analysis, 72. [Link] Records Using Deep Fusion Networks. In IEEE Xplore. [Link]
org/10.1016/[Link].2021.102096 EMBC.2019.8857244.
Fei Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016).
Transactions on Pattern Analysis and Machine Intelligence, 28(4). [Link] SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.
10.1109/tpami.2006.79 [Link]
Feng, C., Elazab, A., Yang, P., Wang, T., Zhou, F., Hu, H., Xiao, X., & Lei, B. (2019). Deep Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., & Maier-Hein, K. H. (2019). No
Learning Framework for Alzheimer’s Disease Diagnosis via 3D-CNN and FSBi-LSTM. New-Net. Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries,
IEEE Access, 7. [Link] 11384. [Link]
Fink, M. (2004). Object classification from a single example utilizing class relevance Ivakhnenko, A. G. (1968). The group method of data of handling; a rival of the method of
metrics. Advances in Neural Information Processing Systems, 17. stochastic approximation. Soviet Automatic Control, 1(3), 43–55.
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carbin, M. (2020). Pruning Neural Networks Jacene, H. A., Goetze, S., Patel, H., Wahl, R. L., & Ziessman, H. A. (2008). Advantages of
at Initialization: Why are We Missing the Mark? [Link] Hybrid SPECT/CT vs SPECT Alone. The Open Medical Imaging Journal, 2(1). https://
Frid Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). [Link]/10.2174/1874347100802010067
GAN-based synthetic medical image augmentation for increased CNN performance Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D., & Makedon, F. (2020). A Survey on
in liver lesion classification. Neurocomputing, 321. [Link] Contrastive Self-Supervised Learning. Technologies, 9(1). [Link]
neucom.2018.09.013 technologies9010002
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a Jiang, X., Li, J., Kan, Y., Yu, T., Chang, S., Sha, X., Zheng, H., Luo, Y., & Wang, S. (2020).
mechanism of pattern recognition unaffected by shift in position. Biological MRI Based Radiomics Approach with Deep Learning for Prediction of Vessel Invasion
Cybernetics, 36(4). [Link] in Early-Stage Cervical Cancer. IEEE/ACM Transactions on Computational Biology and
Ge, C., Gu, I. Y. H., Jakola, A. S., & Yang, J. (2020). Deep semi-supervised learning for Bioinformatics. [Link]
brain tumor classification. BMC Medical Imaging, 20(1). [Link] Jing, L., & Tian, Y. (2020). Self-supervised Visual Feature Learning with Deep Neural
s12880-020-00485-0 Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Ghaffari, M., Sowmya, A., & Oliver, R. (2020). Automated Brain Tumor Segmentation [Link]
Using Multimodal Brain Scans: A Survey Based on Models Submitted to the BraTS Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class
2012–2018 Challenges. IEEE Reviews in Biomedical Engineering, 13. [Link] imbalance. Journal of Big Data, 6(1). [Link]
10.1109/RBME.2019.2946868 Ju, W., Xiang, D., Zhang, B., Wang, L., Kopriva, I., & Chen, X. (2015). Random Walk and
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Graph Cut for Co-Segmentation of Lung Tumor on PET-CT Images. IEEE Transactions
Courville, A., & Bengio, Y. (2014). Generative Adversarial Nets. Advances in Neural on Image Processing, 24(12). [Link]
Information Processing Systems. [Link] Kanoun, S., Rossi, C., & Casasnovas, O. (2018). [18F]FDG-PET/CT in hodgkin
Götz, T. I. (2019). Technical report: time-activity-curve integration in Lu-177 therapies lymphoma: Current usefulness and perspectives. In Cancers (Vol. 10, Issue 5).
in nuclear medicine. ArXiv Preprint ArXiv:1907.06617. [Link]
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Kasban, H., El-Bendary, M. A. M., & Salama, D. H. (2015). A Comparative Study of
Jia, Y., & He, K. (2018). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. Medical Imaging Techniques. International Journal of Information Science and
[Link] Intelligent System, 4(2).
Guo, Z., Li, X., Huang, H., Guo, N., & Li, Q. (2019). Deep Learning-Based Image Kassani, S. H., Kassasni, P. H., Wesolowski, M. J., Schneider, K. A., & Deters, R. (2020).
Segmentation on Multimodal Medical Imaging. IEEE Transactions on Radiation and Automatic Detection of Coronavirus Disease (COVID-19) in X-ray and CT Images: A
Plasma Medical Sciences, 3(2). [Link] Machine Learning-Based Approach. [Link]
Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both weights and connections Khamparia, A., Saini, G., Pandey, B., Tiwari, S., Gupta, D., & Khanna, A. (2019). KDSAE:
for efficient neural networks. Advances in Neural Information Processing Systems, Chronic kidney disease classification with multimedia data learning using deep
2015-January. stacked autoencoder network. Multimedia Tools and Applications, 79(47–48). https://
Hatamizadeh, A., Yang, D., Roth, H., & Xu, D. (2021). UNETR: Transformers for 3D [Link]/10.1007/s11042-019-07839-z
Medical Image Segmentation. ArXiv Preprint ArXiv:2103.10504. Khandelwal, P., & Yushkevich, P. (2020). Domain Generalizer: A Few-Shot Meta
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Learning Framework for Domain Generalization in Medical Imaging. Domain
Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Adaptation and Representation Transfer, and Distributed and Collaborative Learning,
Recognition (CVPR). 12444. [Link]
Hervella, A. S., Ramos, L., Rouco, J., Novo, J., & Ortega, M. (2020). Multi-Modal Self- Kim, J., & Lee, B. (2018). Identification of Alzheimer’s disease and mild cognitive
Supervised Pre-Training for Joint Optic Disc and Cup Segmentation in Eye Fundus impairment using multimodal sparse hierarchical extreme learning machine. Human
Images. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Brain Mapping, 39(9). [Link]
Processing – Proceedings, 2020-May. [Link] Kirienko, M., Sollini, M., Silvestri, G., Mognetti, S., Voulaz, E., Antunovic, L., Rossi, A.,
ICASSP40776.2020.9053551. Antiga, L., & Chiti, A. (2018). Convolutional Neural Networks Promising in Lung
Hervella, A. S., Rouco, J., Novo, J., & Ortega, M. (2019). Self-Supervised Deep Learning Cancer T-Parameter Assessment on Baseline FDG-PET/CT. Contrast Media &
for Retinal Vessel Segmentation Using Automatically Generated Labels from Molecular Imaging. [Link]
Multimodal Data. Proceedings of the International Joint Conference on Neural Networks, Kolesnikov, A., Zhai, X., & Beyer, L. (2019). Revisiting self-supervised visual
2019-July. [Link] representation learning. Proceedings of the IEEE Computer Society Conference on
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief Computer Vision and Pattern Recognition, 2019-June. [Link]
nets. Neural Computation, 18(7). [Link] CVPR.2019.00202.
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, Konečný, J., McMahan, B., & Ramage, D. (2015). Federated Optimization: Distributed
9(8). [Link] Optimization Beyond the Datacenter. [Link]
Hooshmand, S. A., Zarei Ghobadi, M., Hooshmand, S. E., Azimzadeh Jamalkandi, S., Kotia, J., Kotwal, A., Bharti, R., & Mangrulkar, R. (2021). Few Shot Learning for Medical
Alavi, S. M., & Masoudi-Nejad, A. (2020). A multimodal deep learning-based drug Imaging. Machine Learning Algorithms for Industrial Applications. Studies Computational
repurposing approach for treatment of COVID-19. Molecular Diversity. [Link] Intelligence, 907. [Link]
org/10.1007/s11030-020-10144-9 Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective directions. Progress Artificial Intelligence, 5(4). [Link]
computational abilities. Proceedings of the National Academy of Sciences of the United 0094-0
States of America, 79(8). [Link] Kristensen, V. N., Lingjærde, O. C., Russnes, H. G., Vollan, H. K. M., Frigessi, A., &
Hosseini, M. P., Tran, T. X., Pompili, D., Elisevich, K., & Soltanian-Zadeh, H. (2020). Børresen-Dale, A.-L. (2014). Principles and methods of integrative genomic analyses
Multimodal data analysis of epileptic EEG and rs-fMRI via deep learning and edge in cancer. Nature Reviews Cancer, 14(5). [Link]
computing. Artificial Intelligence in Medicine, 104. [Link] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep
artmed.2020.101813 convolutional neural networks. Advances in Neural Information Processing Systems, 2.
Huang, G.-B., Zhou, H., Ding, X., & Zhang, R. (2012). Extreme Learning Machine for Kwak, G. H., & Hui, P. (2019). DeepHealth: Review and challenges of artificial intelligence in
Regression and Multiclass Classification. IEEE Transactions on Systems, Man, and health informatics. [Link]
Cybernetics, Part B (Cybernetics), 42(2). [Link] Lai, Y. H., Chen, W. N., Hsu, T. C., Lin, C., Tsao, Y., & Wu, S. (2019). Predicting the
tsmcb.2011.2168604 Prognosis of Non-Small Cell Lung Cancer by Integrating Microarray and Clinical
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely Connected Data with Deep Learning. bioRxiv. [Link]
Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Lassau, N., Ammari, S., Chouzenoux, E., Gortais, H., Herent, P., Devilder, M., …
Recognition (CVPR). [Link] Blum, M. G. B. (2020). Integration of clinical characteristics, lab tests and a deep
Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y. W., & Wu, learning CT scan analysis to predict severity of hospitalized COVID-19 patients.
J. (2020). UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. MedRxiv. [Link]
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., &
Proceedings, 2020-May. [Link] Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition.
Huang, Y., & Chung, A. C. S. (2020). Semi-Supervised Multimodality Learning with Neural Computation, 1(4). [Link]
Graph Convolutional Neural Networks for Disease Diagnosis. Proceedings – LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553). https://
International Conference on Image Processing, ICIP, 2020-October. [Link] [Link]/10.1038/nature14539
10.1109/ICIP40778.2020.9191172.
20
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Lee, D. H., Li, Y., & Shin, B.-S. (2020). Generalization of intensity distribution of medical (BRATS). IEEE Transactions on Medical Imaging, 34(10). [Link]
images using GANs. Human-Centric Computing and Information Sciences, 10(1). TMI.2014.2377694
[Link] Micheel, C. M., Nass, S. J., Omenn, G. S., & Policy, H. S. (2012). Evolution of
Lee, G., Kang, B., Nho, K., Sohn, K.-A., & Kim, D. (2019). MildInt: Deep Learning-Based Translational Omics Lessons Learned and the Path Forward. Evolution.
Multimodal Longitudinal Data Integration Framework. Frontiers in Genetics, 10. Milecki, L., Bodard, S., Correas, J. M., Timsit, M. O., & Vakalopoulou, M. (2021). 3D
[Link] unsupervised kidney graft segmentation based on deep learning and multi-sequence
Lee, G., Nho, K., Kang, B., Sohn, K.-A., & Kim, D. (2019). Predicting Alzheimer’s disease MRI. Proceedings – International Symposium on Biomedical Imaging, 2021-April.
progression using multi-modal deep learning approach. Scientific Reports, 9(1). [Link]
[Link] Milletari, F., Navab, N., & Ahmadi, S.-A. (2016). V-Net: Fully Convolutional Neural
Li, H., Boimel, P., Janopaul-Naylor, J., Zhong, H., Xiao, Y., Ben-Josef, E., & Fan, Y. Networks for Volumetric Medical Image Segmentation. In 2016 Fourth International
(2019). Deep convolutional neural networks for imaging data based survival analysis Conference on 3D Vision (3DV). [Link]
of rectal cancer. Proceedings. IEEE International Symposium on Biomedical Imaging Mukherjee, H., Ghosh, S., Dhar, A., Obaidullah, S. M., Santosh, K. C., & Roy, K. (2020).
(ISBI 2019). [Link] Deep neural network to detect COVID-19: One architecture for both CT Scans and
Li, X., Jia, M., Islam, M. T., Yu, L., & Xing, L. (2020). Self-Supervised Feature Learning via Chest X-rays. Applied Intelligence. [Link]
Exploiting Multi-Modal Data for Retinal Disease Diagnosis. IEEE Transactions on Myronenko, A. (2019). 3D MRI brain tumor segmentation using autoencoder
Medical Imaging, 39(12). [Link] regularization. Lecture Notes in Computer Science (Including Subseries Lecture Notes in
Liang, S., Zhang, R., Liang, D., Song, T., Ai, T., Xia, C., Xia, L., & Wang, Y. (2018). Artificial Intelligence and Lecture Notes in Bioinformatics), 11384 LNCS. [Link]
Multimodal 3D DenseNet for IDH Genotype Prediction in Gliomas. Genes, 9(8). org/10.1007/978-3-030-11726-9_28.
[Link] Ng, A. (2018). Structuring Machine Learning Projects. Coursera. [Link]
Lin, E., & Alessio, A. (2009). What are the basic concepts of temporal, contrast, and org/learn/machine-learning-projects?specialization=deep-learning.
spatial resolution in cardiac CT? Journal of Cardiovascular Computed Tomography, 3 Nie, D., Lu, J., Zhang, H., Adeli, E., Wang, J., Yu, Z., Liu, L., Wang, Q., Wu, J., & Shen, D.
(6). [Link] (2019). Multi-Channel 3D Deep Feature Learning for Survival Time Prediction of
Lin, W., Gao, Q., Yuan, J., Chen, Z., Feng, C., Chen, W., Du, M., & Tong, T. (2020). Brain Tumor Patients Using Multi-Modal Neuroimages. Scientific Reports, 9(1).
Predicting Alzheimer’s Disease Conversion From Mild Cognitive Impairment Using [Link]
an Extreme Learning Machine-Based Grading Method With Multimodal Data. Oktay, O., Schlemper, J., Folgoc, L. Le, Lee, M. C. H., Heinrich, M. P., Misawa, K., Mori,
Frontiers in Aging Neuroscience, 12. [Link] K., McDonagh, S. G., Hammerla, N. Y., Kainz, B., Glocker, B., & Rueckert, D. (2018).
Lindell, Y., & Pinkas, B. (2009). Secure Multiparty Computation for Privacy-Preserving Attention U-Net: Learning Where to Look for the Pancreas. CoRR, abs/1804.03999.
Data Mining. Journal of Privacy and Confidentiality, 1(1). [Link] [Link]
jpc.v1i1.566. Peng, Y., Bi, L., Guo, Y., Feng, D., Fulham, M., & Kim, J. (2019). Deep multi-modality
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., van der collaborative learning for distant metastases predication in PET-CT soft-tissue
Laak, J. A. W. M., van Ginneken, B., & Sánchez, C. I. (2017). A survey on deep sarcoma studies. In 2019 41st Annual International Conference of the IEEE Engineering
learning in medical image analysis. Medical Image Analysis, 42. [Link] in Medicine and Biology Society (EMBC). [Link]
10.1016/[Link].2017.07.005 EMBC.2019.8857666
Liu, M., Cheng, D., Wang, K., & Wang, Y. (2018). Multi-Modality Cascaded Convolutional Pratt, L. Y., Mostow, J., & Kamm, C. A. (1991). Direct Transfer of Learned Information
Neural Networks for Alzheimer’s Disease Diagnosis. Neuroinformatics, 16(3–4). Among Neural Networks. In Proceedings of the ninth National conference on Artificial
[Link] intelligence (AAAI 91) (Vol. 2). [Link]
Liu, M., Zhang, J., Adeli, E., & Shen, D. (2019). Joint Classification and Regression via [Link].
Deep Multi-Task Multi-Channel Learning for Alzheimer’s Disease Diagnosis. IEEE Pushpakom, S., Iorio, F., Eyers, P. A., Escott, K. J., Hopper, S., Wells, A., Doig, A.,
Transactions on Biomedical Engineering, 66(5). [Link] Guilliams, T., Latimer, J., McNamee, C., Norris, A., Sanseau, P., Cavalla, D., &
TBME.2018.2869989 Pirmohamed, M. (2018). Drug repurposing: Progress, challenges and
Liu, Q., & Hu, P. (2019). Association Analysis of Deep Genomic Features Extracted by recommendations. Nature Reviews Drug Discovery, 18(1). [Link]
Denoising Autoencoders in Breast Cancer. Cancers, 11(4). [Link] nrd.2018.168
cancers11040494 Qin, R., Wang, Z., Jiang, L., Qiao, K., Hai, J., Chen, J., Xu, J., Shi, D., & Yan, B. (2020).
Liu, Y., Pan, S., Jin, M., Zhou, C., Xia, F., & Yu, P. S. (2021). Graph Self-Supervised Fine-Grained Lung Cancer Classification from PET and CT Images Based on
Learning: A Survey. CoRR, abs/2103.00111. [Link] Multidimensional Attention Mechanism. Complexity, 2020. [Link]
Lo Gullo, R., Daimiel, I., Morris, E. A., & Pinker, K. (2020). Combining molecular and 2020/6153657
imaging metrics in cancer: Radiogenomics. Insights into Imaging, 11(1). [Link] Qiu, S., Joshi, P. S., Miller, M. I., Xue, C., Zhou, X., Karjadi, C., … Kolachalama, V. B.
org/10.1186/s13244-019-0795-6 (2020). Development and validation of an interpretable deep learning framework for
Lu, D., Popuri, K., Ding, G. W., Balachandar, R., & Beg, M. F. (2018). Multimodal and Alzheimer’s disease classification. Brain, 143(6). [Link]
Multiscale Deep Neural Networks for the Early Diagnosis of Alzheimer’s Disease awaa137
using structural MR and FDG-PET images. Scientific Reports, 8(1). [Link] Raja, K., Patrick, M., Gao, Y., Madu, D., Yang, Y., & Tsoi, L. C. (2017). A Review of Recent
10.1038/s41598-018-22871-z Advancement in Integrating Omics Data with Literature Mining towards Biomedical
Ma, Z., Wu, X., Sun, S., Xia, C., Yang, Z., Li, S., & Zhou, J. (2018). A discriminative Discoveries. International Journal of Genomics, 2017. [Link]
learning based approach for automated nasopharyngeal carcinoma segmentation 6213474
leveraging multi-modality similarity metric learning. 2018 IEEE 15th International Ramachandram, D., & Taylor, G. W. (2017). Deep Multimodal Learning: A Survey on
Symposium on Biomedical Imaging (ISBI 2018). [Link] Recent Advances and Trends. IEEE Signal Processing Magazine, 34(6). [Link]
ISBI.2018.8363696. 10.1109/msp.2017.2738401
Maghdid, H. S., Asaad, A. T., Ghafoor, K. Z., Sadiq, A. S., & Khan, M. K. (2020). Raza, K., & Singh, N. K. (2021). A Tour of Unsupervised Deep Learning for Medical Image
Diagnosing COVID-19 Pneumonia from X-Ray and CT Images using Deep Learning and Analysis. Current Medical Imaging Formerly Current Medical Imaging Reviews, 17.
Transfer Learning Algorithms. [Link] [Link]
Maier, A., Steidl, S., Christlein, V., & Hornegger, J. (Eds.). (2018). Medical Imaging Rehman, A., Naz, S., Khan, A., Zaib, A., & Razzak, I. (2020). Improving Coronavirus
Systems (Vol. 11111). Springer International Publishing. [Link] (COVID-19) Diagnosis using Deep Transfer Learning. [Link]
978-3-319-96520-8. 2020.04.11.20054643.
Maier, A., Syben, C., Lasser, T., & Riess, C. (2019). A gentle introduction to deep learning Rokach, L. (2010). Ensemble-based classifiers. Artificial Intelligence Review, 33(1–2).
in medical image processing. Zeitschrift fur Medizinische Physik, 29(2). [Link] [Link]
org/10.1016/[Link].2018.12.003 Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for
Maji, P., & Mullins, R. (2018). On the Reduction of Computational Complexity of Deep Biomedical Image Segmentation. Lecture Notes in Computer Science, 9351. https://
Convolutional Neural Networks. Entropy, 20(4). [Link] [Link]/10.1007/978-3-319-24574-4_28
e20040305 Rosenblatt, F. (1957). The Perceptron, A Perceiving and Recognizing Automaton Project
Mandic, D. P., & Chambers, J. A. (2001). Recurrent Neural Networks for Prediction. Para. Cornell Aeronautical Laboratory.
Wiley Series in Adaptive and Learning Systems for Signal Processing, Communications, Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. In Wiley Series in
and Control. [Link] Probability and Statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc.. [Link]
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous org/10.1002/9780470316696
activity. The Bulletin of Mathematical Biophysics, 5(4). [Link] Rubinstein, E., Salhov, M., Nidam-Leshem, M., White, V., Golan, S., Baniel, J.,
BF02478259 Bernstine, H., Groshar, D., & Averbuch, A. (2019). Unsupervised tumor detection in
McKinley, R., Meier, R., & Wiest, R. (2019). Ensembles of Densely-Connected CNNs with Dynamic PET/CT imaging of the prostate. Medical Image Analysis, 55. [Link]
Label-Uncertainty for Brain Tumor Segmentation. Brainlesion: Glioma, Multiple org/10.1016/[Link].2019.04.001
Sclerosis, Stroke and Traumatic Brain Injuries, 11384. [Link] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by
030-11726-9_40 back-propagating errors. Nature, 323(6088). [Link]
McKinley, R., Rebsamen, M., Meier, R., & Wiest, R. (2020). Triplanar Ensemble of 3D-to- Saba, T., Sameh Mohamed, A., El-Affendi, M., Amin, J., & Sharif, M. (2020). Brain tumor
2D CNNs with Label-Uncertainty for Brain Tumor Segmentation. Brainlesion: Glioma, detection using fusion of hand crafted and deep learning features. Cognitive Systems
Multiple Sclerosis, Stroke and Traumatic Brain Injuries, 11992. [Link] Research, 59. [Link]
10.1007/978-3-030-46640-4_36 Schrodi, S. J., Mukherjee, S., Shan, Y., Tromp, G., Sninsky, J. J., Callear, A. P.,
Menze, B. H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., … Van Carter, T. C., Ye, Z., Haines, J. L., Brilliant, M. H., Crane, P. K., Smelser, D. T.,
Leemput, K. (2015). The Multimodal Brain Tumor Image Segmentation Benchmark Elston, R. C., & Weeks, D. E. (2014). Genetic-based prediction of disease traits:
21
F. Behrad and M. Saniee Abadeh Expert Systems With Applications 200 (2022) 117006
Prediction is very difficult, especially about the future. Frontiers in Genetics, 5, 162. Wang, S., Zha, Y., Li, W., Wu, Q., Li, X., Niu, M., Wang, M., Qiu, X., Li, H., Yu, H.,
[Link] Gong, W., Bai, Y., Li, L., Zhu, Y., Wang, L., & Tian, J. (2020). A Fully Automatic Deep
Shi, J., Zheng, X., Li, Y., Zhang, Q., & Ying, S. (2018). Multimodal Neuroimaging Feature Learning System for COVID-19 Diagnostic and Prognostic Analysis. European
Learning With Multimodal Stacked Deep Polynomial Networks for Diagnosis of Respiratory Journal. [Link]
Alzheimer’s Disease. IEEE Journal of Biomedical and Health Informatics, 22(1). Wang, Y., Yang, Y., Guo, X., Ye, C., Gao, N., Fang, Y., & Ma, H. T. (2018). A Novel
[Link] Multimodal MRI Analysis for Alzheimer’s Disease Based on Convolutional Neural
Shikalgar, A., & Sonavane, S. (2020). Hybrid Deep Learning Approach for Classifying Network. IEEE Xplore. [Link]
Alzheimer Disease Based on Multimodal Data. Advances in Intelligent Systems and Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention
Computing, 1025. [Link] module. Lecture Notes in Computer Science (Including Subseries Lecture Notes in
Shukla, S. N., & Marlin, B. M. (2020). Integrating Physiological Time Series and Clinical Artificial Intelligence and Lecture Notes in Bioinformatics), 11211 LNCS. [Link]
Notes with Deep Learning for Improved ICU Mortality Prediction. In arXiv: org/10.1007/978-3-030-01234-2_1.
2003.11059 [cs, stat]. [Link] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: (2015). Show, attend and tell: Neural image caption generation with visual
Visualising Image Classification Models and Saliency Maps. [Link] attention. International Conference on Machine Learning, 2048–2057.
1312.6034. Xu, M., Ouyang, L., Gao, Y., Chen, Y., Yu, T., Li, Q., Sun, K., Bao, F. S., Safarnejad, L.,
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large- Wen, J., Jiang, C., Chen, T., Han, L., Zhang, H., Gao, Y., Yu, Z., Liu, X., Yan, T., Li, H.,
Scale Image Recognition. [Link] … Chen, S. (2020). Accurately Differentiating COVID-19, Other Viral Infection, and
Soltaninejad, M., Zhang, L., Lambrou, T., Yang, G., Allinson, N., & Ye, X. (2019). MRI Healthy Individuals Using Multimodal Features via Late Fusion Learning. [Link]
Brain Tumor Segmentation using Random Forests and Fully Convolutional org/10.1101/2020.08.18.20176776.
Networks. In arXiv:1909.06337 [cs]. [Link] Xu, Y. (2019). Deep Learning in Multimodal Medical Image Analysis. Health Information
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway Networks. [Link] Science, 11837. [Link]
org/abs/1505.00387. Xu, Z., Yan, J., Luo, J., Li, X., & Jagadeesan, J. (2021). Unsupervised Multimodal Image
Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., Registration with Adaptative Gradient Guidance. [Link]
Wood, A. M., & Carpenter, J. R. (2009). Multiple imputation for missing data in icassp39728.2021.9414320.
epidemiological and clinical research: Potential and pitfalls. BMJ, 338, 1. https:// Yan, R., Ren, F., Rao, X., Shi, B., Xiang, T., Zhang, L., Liu, Y., Liang, J., Zheng, C., &
[Link]/10.1136/bmj.b2393 Zhang, F. (2019). Integration of Multimodal Data for Breast Cancer Classification
Suk, H. I., Lee, S. W., & Shen, D. (2014). Hierarchical feature representation and Using a Hybrid Deep Learning Method. Intelligent Computing Theories and Application,
multimodal fusion with deep learning for AD/MCI diagnosis. NeuroImage, 101. 11643. [Link]
[Link] Yap, J., Yolland, W., & Tschandl, P. (2018). Multimodal skin lesion classification using
Sun, D., Wang, M., & Li, A. (2019). A Multimodal Deep Neural Network for Human Breast deep learning. Experimental Dermatology, 27(11). [Link]
Cancer Prognosis Prediction by Integrating Multi-Dimensional Data. IEEE/ACM exd.13777
Transactions on Computational Biology and Bioinformatics, 16(3). [Link] Yu, Y., Li, M., Liu, L., Li, Y., & Wang, J. (2019). Clinical big data and deep learning:
10.1109/TCBB.2018.2806438 Applications, challenges, and future outlooks. Big Data Mining and Analytics, 2(4). htt
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., ps://[Link]/10.26599/bdma.2019.9020007.
Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper With Convolutions. Yuan, Y., Borrmann, D., Hou, J., Ma, Y., Nüchter, A., & Schwertfeger, S. (2021). Self-
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). supervised point set local descriptors for point cloud registration. Sensors
Taleb, A., Lippert, C., Klein, T., & Nabi, M. (2021). Multimodal Self-supervised Learning for (Switzerland), 21(2). [Link]
Medical Image Analysis. [Link] Zhang, F., Li, Z., Zhang, B., Du, H., Wang, B., & Zhang, X. (2019). Multi-modal deep
Talib, M. A., Majzoub, S., Nasir, Q., & Jamal, D. (2020). A systematic literature review on learning model for auxiliary diagnosis of Alzheimer’s disease. Neurocomputing, 361.
hardware implementation of artificial intelligence algorithms. The Journal of [Link]
Supercomputing. [Link] Zhang, T., & Shi, M. (2020). Multi-modal neuroimaging feature fusion for diagnosis of
Tang, Z., Xu, Y., Jin, L., Aibaidula, A., Lu, J., Jiao, Z., Wu, J., Zhang, H., & Shen, D. Alzheimer’s disease. Journal of Neuroscience Methods, 341. [Link]
(2020). Deep Learning of Imaging Phenotype and Genotype for Predicting Overall [Link].2020.108795
Survival Time of Glioblastoma Patients. IEEE Transactions on Medical Imaging, 39(6). Zhang, Y. D., Dong, Z., Wang, S. H., Yu, X., Yao, X., Zhou, Q., Hu, H., Li, M., Jiménez-
[Link] Mesa, C., Ramirez, J., Martinez, F. J., & Gorriz, J. M. (2020). Advances in
Vaghefi, E., Hill, S., Kersten, H. M., & Squirrell, D. (2020). Multimodal Retinal Image multimodal data fusion in neuroimaging: Overview, challenges, and novel
Analysis via Deep Learning for the Diagnosis of Intermediate Dry Age-Related orientation. Information Fusion, 64. [Link]
Macular Degeneration: A Feasibility Study. In. Journal of Ophthalmology. https Zhang, Y. D., Zhang, Z., Zhang, X., & Wang, S. H. (2021). MIDCAN: A multiple input deep
://[Link]/journals/joph/2020/7493419/. convolutional attention network for Covid-19 diagnosis based on chest CT and chest
Valanarasu, J., Jose, M., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021). Medical X-ray. Pattern Recognition Letters, 150. [Link]
transformer: Gated axial-attention for medical image segmentation. ArXiv Preprint Zhang, Y., & Wallace, B. C. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to)
ArXiv:2102.10662. ConvolutionalNeural Networks for Sentence Classification. CoRR, abs/1510.03820.
van Sonsbeek, T., & Worring, M. (2020). Towards Automated Diagnosis with Attentive [Link]
Multi-modal Learning Using Electronic Health Records and Chest X-Rays. Multimodal Zhao, X., Li, L., Lu, W., & Tan, S. (2018). Tumor co-segmentation in PET/CT using multi-
Learning for Clinical Decision Support and Clinical Image-Based Procedures, 12445. modality fully convolutional neural network. Physics in Medicine & Biology, 64(1).
[Link] [Link]
Varghese, A., Vaidhya, K., Thirunavukkarasu, S., Kesavdas, C., & Krishnamurthi, G. Zhao, Y., Gafita, A., Vollnberg, B., Tetteh, G., Haupt, F., Afshar-Oromieh, A., Menze, B.,
(2016). Semi-supervised Learning using Denoising Autoencoders for Brain Eiber, M., Rominger, A., & Shi, K. (2020). Deep neural network for automatic
LesionDetection and Segmentation. CoRR, abs/1611.08664. [Link] characterization of lesions on 68Ga-PSMA-11 PET/CT. European Journal of Nuclear
1611.08664. Medicine and Molecular Imaging, 47(3). [Link]
Vasquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., Eskofier, B., Klucken, J., y
& Noth, E. (2018). Multimodal Assessment of Parkinson’s Disease: A Deep Learning Zhou, T., Ruan, S., & Canu, S. (2019). A review: Deep learning for medical image
Approach. IEEE Journal of Biomedical and Health Informatics, 23(4). [Link] segmentation using multi-modality fusion. Array, 3–4. [Link]
10.1109/jbhi.2018.2866873 array.2019.100004
Vu, T.-D., Ho, N.-H., Yang, H.-J., Kim, J., & Song, H.-C. (2018). Non-white matter tissue Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N., & Liang, J. (2018). Unet++: A nested
extraction and deep convolutional neural network for Alzheimer’s disease detection. u-net architecture for medical image segmentation. In Lecture Notes in Computer
Soft Computing, 22(20). [Link] Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Waheed, A., Goyal, M., Gupta, D., Khanna, A., Al-Turjman, F., & Pinheiro, P. R. (2020). Bioinformatics), 11045 LNCS. [Link]
CovidGAN: Data Augmentation using Auxiliary Classifier GAN for Improved Covid- Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-To-Image Translation
19 Detection. IEEE Access, 8. [Link] Using.
22
Integrating bidirectional LSTM with CNNs in tasks like MRI and PET analysis leverages rich spatial and temporal information from feature maps. While CNNs are adept at spatial feature extraction, adding LSTMs allows for capturing longer-range dependencies and sequential context, which can enhance pattern recognition and classification accuracy in complex datasets such as brain imaging data for Alzheimer's diagnosis .
Challenges include the need for extensive labeled datasets which are not always available, as non-small cell lung cancer survival prediction relies heavily on the subtle integration of imaging and clinical data. The high dimensionality of multimodal data strains computational resources, and model interpretability is critical, especially in clinical settings. Further, data heterogeneity and the varying quality of available modalities add to model complexity and training instability .
Decision-level fusion differs by allowing each modality to be processed separately before their outputs are amalgamated for the final decision, unlike input- or layer-level fusion which combine data earlier in the process. This method's advantage is its flexibility, as it doesn't require all modalities for every sample, reducing the risk of overfitting and computational demand. However, it cannot uncover relationships between modalities directly but achieves high performance due to a smaller search space .
Input-level fusion, or early integration, requires the availability of all modalities for every sample in the training set, which is often not feasible in practice. This method results in a large feature vector that demands high computational resources. Despite these challenges, it helps in identifying relationships between different modalities .
GANs are beneficial in unsupervised tasks as they can generate high-quality and realistic images, which are useful for augmenting datasets in medical imaging where labeled data is scarce. However, they are challenging to train due to complex tuning needs and potential instability issues like mode collapse. GANs’ ability to learn complex data distributions can greatly assist in improving diagnostic models through synthetic data generation .
CNNs can be optimized for medical image analysis through various modifications including structural reformulation, regularization, and parameter optimization. These changes help improve the networks' ability to learn from complex, high-dimensional medical data, reducing computational costs and enhancing performance. Additionally, CNNs employ pooling layers for reducing spatial dimensions and ensuring translation invariance, which is crucial for processing image data effectively .
End-to-end learning approaches streamline processing by directly learning complex nonlinear mappings from raw multimodal inputs to outputs, effectively reducing preprocessing demands. Such approaches can simultaneously extract and learn from diverse data types, leading to more cohesive and insightful model outputs. They simplify model pipelines and improve the integration of omics and imaging data for comprehensive cancer prognosis, thus improving predictive performance .
Majority voting in decision-level fusion involves aggregating predictions from various individual models for each modality based on their most frequent predictions. This consensus mechanism helps improve decision robustness and accuracy in medical diagnoses by smoothing out errors that may arise from a single model. It is widely used due to its simplicity and effectiveness in scenarios where data from multiple modalities need to be integrated for final diagnostic predictions .
Multitask learning can enhance Alzheimer's disease progression analysis by allowing models to simultaneously learn from related tasks, such as different stages of progression or related symptoms, which can improve generalization. This approach shares representations between tasks, potentially leading to more robust and efficient learning as demonstrated in models designed for time series data related to Alzheimer's .
Decision-level fusion is often preferred because it allows for flexibility as each modality is processed separately with its model, making it robust when there is missing data for certain modalities in some samples. This method reduces the likelihood of overfitting and doesn’t require every modality to be available for each sample, unlike input- or layer-level fusion. The reduced complexity in computational resource demands and flexibility makes it practical for real-world applications despite its inability to directly analyze modality relationships .