| Literature DB >> 32532967 |
Chang Su1, Zhenxing Xu1, Jyotishman Pathak1, Fei Wang2.
Abstract
Mental illnesses, such as depression, are highly prevalent and have been shown to impact an individual's physical health. Recently, artificial intelligence (AI) methods have been introduced to assist mental health providers, including psychiatrists and psychologists, for decision-making based on patients' historical data (e.g., medical records, behavioral data, social media usage, etc.). Deep learning (DL), as one of the most recent generation of AI technologies, has demonstrated superior performance in many real-world applications ranging from computer vision to healthcare. The goal of this study is to review existing research on applications of DL algorithms in mental health outcome research. Specifically, we first briefly overview the state-of-the-art DL techniques. Then we review the literature relevant to DL applications in mental health outcomes. According to the application scenarios, we categorize these relevant articles into four groups: diagnosis and prognosis based on clinical data, analysis of genetics and genomics data for understanding mental health conditions, vocal and visual expression data analysis for disease detection, and estimation of risk of mental illness using social media data. Finally, we discuss challenges in using DL algorithms to improve our understanding of mental health conditions and suggest several promising directions for their applications in improving mental health diagnosis and treatment.Entities:
Year: 2020 PMID: 32532967 PMCID: PMC7293215 DOI: 10.1038/s41398-020-0780-3
Source DB: PubMed Journal: Transl Psychiatry ISSN: 2158-3188 Impact factor: 6.222
Fig. 1Examples of deep neural networks.
a Deep feedforward neural network (DFNN). It is the basic design of DL models. Commonly, a DFNN contains multiple hidden layers. b A recurrent neural network (RNN) is presented to process sequence data. To encode history information, each recurrent neuron receives the input element and the state vector of the predecessor neuron, and yields a hidden state fed to the successor neuron. For example, not only the individual information but also the dependence of the elements of the sequence x1 → x2 → x3 → x4 → x5 is encoded by the RNN architecture. c Convolutional neural network (CNN). Between input layer (e.g., input neuroimage) and output layer, a CNN commonly contains three types of layers: the convolutional layer that is to generate feature maps by sliding convolutional kernels in the previous layer; the pooling layer is used to reduce dimensionality of previous convolutional layer; and the fully connected layer is to make prediction. For the illustrative purpose, this example only has one layer of each type; yet, a real-world CNN would have multiple convolutional and pooling layers (usually in an interpolated manner) and one fully connected layer. d Autoencoder consists of two components: the encoder, which learns to compress the input data into a latent representation layer by layer, whereas the decoder, inverse to the encoder, learns to reconstruct the data at the output layer. The learned compressed representations can be fed to the downstream predictive model.
Fig. 2Technical details of neural networks.
a An illustration of basic unit of neural networks, i.e., artificial neuron. Each input x is associated with a weight w. The weighted sum of all inputs Σwx is fed to a nonlinear activation function f to generate the output y of the j-th neuron, i.e., y = f(Σwx). b Illustrations of the widely used nonlinear activation function.
Fig. 3PRISMA flow diagram: deep learning in mental health outcome research.
In total, 57 studies, in terms of clinical data analysis, genetic data analysis, vocal and visual expression data analysis, and social media data analysis, which met our eligibility criteria, were included in this review.
A summary of the selected studies in this review.
| Authors, year | Used deep model | Data | Study cohort | Outcome assessment | Aims | Performance | Findings |
|---|---|---|---|---|---|---|---|
| Clinical data | |||||||
| Neuroimage data | |||||||
| Kuang and He., 2014[ | DBN | fMRI | 449 Subjects (ADHD-200a) | Human annotation | Prediction of ADHD status and subtype | ACC = 0.407–0.809 | The model is the first time that the DL method has been used for the discrimination of ADHD with fMRI data. |
| Kuang et al., 2014[ | DBN | fMRI | 492 Subjects (ADHD-200a) | Human annotation | Prediction of ADHD status and subtype | ACC = 0.344–0.718 | The study verified that there is difference between ADHD and control in the prefrontal cortex and cingulated cortex. |
| Ulloa et al., 2015[ | DFNN | sMRI | 198 Schizophrenia subjects, 191 HCs | Human annotation | Prediction of schizophrenia | ACC = 0.75; Baseline ACC = 0.70 | The model classified neuroimaging data in an online fashion using purely synthetic data. |
| Pinaya et al., 2016[ | DBN | sMRI | 143 Schizophrenia subjects, 32 first-episode psychosis, 191 HCs | Human annotation based on SCID-I | Prediction of schizophrenia | ACC = 0.736; Baseline ACC = 0.681 | The DBN highlighted differences between classes, especially in the frontal, temporal, parietal, and insular cortices, and in some subcortical regions, including the corpus callosum, putamen, and cerebellum. |
| Farzi et al., 2017[ | DBN | fMRI | 336 Subjects (ADHD-200a) | Human annotation | Prediction of ADHD | ACC = 0.637–0.698; Baseline ACC = 0.352–0.642 | The deep model captured relationships from multiple features, including FMRI features, diagnosis status, ADHD measures, secondary symptoms, age, gender, etc. |
| Zou et al., 2017[ | 3D CNN | fMRI | 239 ADHDs, 429 TDCs | Human annotation | Prediction of ADHD | ACC = 0.657; Baseline ACC = 0.615 | The 3D CNN architecture can detect physiologically meaningful 3D local patterns from fMRI data. |
| Geng and Xu, 2017[ | Autoencoder and CNN | fMRI | 24 MDDs, 24 HCs | Not specified | Prediction of depression | ACC = 0.95; Baseline ACC = 0.71 | The model automatically learned meaningful features from the origin time series of the fMRI. |
| Zou et al., 2017[ | 3D CNN | fMRI and sMRI | 239 ADHDs, 429 TDCs | Human annotation | Prediction of ADHD | ACC = 0.692; Baseline ACC = 0.615 | The study found that brain functional and structural information are complementary. The low-level features and high-level features from fMRI and sMRI are useful for the detection of ADHD. |
| Sen et al., 2018[ | Autoencoder | fMRI and sMRI | 279 ADHDs, 491 HCs (ADHD-200a); 538 ASDs and 573 HCs (ABIDEb) | Human annotation | Prediction of ADHD and ASD | ACC = 0.643–0.673; Baseline ACC = 0.500–0.516 | Combining multimodal features can yield good classification accuracy for diagnosis of ADHD and autism. |
| Aghdam et al., 2018[ | DBN | fMRI and sMRI | 116 ASDs, 69 TDCs (ABIDE b) | Human annotation | Prediction of ASD | ACC = 0.656 | (1) There were significant relationships between rs-fMRI and sMRI; (2) Increasing the depth of DBN can help improve diagnostic classification. |
| Matsubara et al., 2019[ | DFNN | fMRI | 50 Schizophrenia subjects, 49 BDs, and 122 HCsc | Diagnosis of psychiatric disorder | ACC = 0.766; Baseline ACC = 0.720 | The study modeled joint distribution of rs-fMRI data, class labels, and remaining frame-wise variabilities. | |
| Pinaya et al., 2019[ | Autoencoder | sMRI | 35 Schizophrenia subjects, 40 HCs (NUSDASTd); 83 ASDs, 105 HCs (ABIDEb) | Human annotation | Identification of abnormal brain structural patterns in neuropsychiatric disorders (schizophrenia and ASD) | ACC = 0.639 to 0.707; Baseline ACC = 0.569 to 0.637 | There are distinct patterns of neuroanatomical deviations for the two diseases (schizophrenia and ASD). |
| Electroencephalogram data | |||||||
| Mohan et al., 2017[ | DFNN | 6.25-sec EEG | 116 University students | PHQ-9 score and DASS-21 | Prediction of depression | 19 Out of 20 testers were detected correctly | The profound outcome of this study showed the signals collected from central (C3 and C4) region are marginally higher compared other brain regions. |
| Acharya et al., 2018[ | CNN | 5-min EEG | 15 Depressed subjects, 15 HCs | Human annotation based on specific questions and physical examination | Prediction of depression | ACC = 0.935 (left hemisphere) and 0.960 (right hemisphere) | The study found that the EEG signals from the right hemisphere are more distinctive in depression than those from the left hemisphere. |
| Zhang et al., 2018[ | CNN | 1000 Hz EEG | 20 subjects | Cross-task mental workload assessment | Cross-task mental workload assessment | ACC = 0.889 | (1) Spectral changes of EEG hemispheric asymmetry provide effective information to distinguish different mental workload tasks. (2) Different time periods can provide different hemispheric EEG activities, and selection of an appropriate time window is essential for extracting hemispheric asymmetry information. |
| Li et al., 2019[ | CNN | 1-sec EEG | 24 Mild depression, 24 HCs | BDI-II | Prediction of mild depression | ACC = 0.856 | They found that the spectral information of EEG played a major role and the temporal information of EEG provided a statistically significant improvement to accuracy. |
| Electronic health records | |||||||
| Pham et al., 2017[ | DeepCare (LSTM based model) | Longitudinal EMRs | 11,000 Patients | ICD-10 diagnosis code | Prediction of the future mental outcomes | F-score = 0.754; Baseline F-score = 0.679 | The LSTM architecture appropriately captured disease progression by modeling the illness history but also the medical interventions. |
| Geraci et al., 2017[ | DFNN | Clinical notes | 366 Patients | Human annotation | Prediction of youth depression | Sensitivity = 0.935, Specificity = 0.68, Positive predictive value = 0.77 | The model identified individuals who meet the inclusion–exclusion criteria for depression research. |
| Rios and Kavuluru, 2017[ | CNN | Clinical notes | 1000 Neuropsychiatric notese | Human annotation | Prediction of psychiatric symptom severity | NMMAE = 0.856 | The CNN scheme showed superiority in extract text features and the predictive performance is better than many traditional text classification methods. |
| Tran and Kavuluru, 2017[ | CNN and attention-based RNN | Clinical notes | 1000 Neuropsychiatric notese | Human annotation | Prediction of 11 mental health conditions (e.g., ADHD, anxiety, bipolar, dementia, depression, etc.) | F-score = 0.631; Baseline F-score = 0.598 | Both the CNN and RNN architectures achieved desirable prediction performances. |
| Choi et al., 2018[ | DFNN | Structured EHRs | SD: 2546, HC: 817,405 | ICD-10 diagnosis code | Prediction of suicide death | AUC = 0.683; Baseline AUC = 0.688 | The model is able to address the imbalance classification problem. |
| Lin et al., 2018[ | DFNN | Clinical biomarkers and genetic biomarkers (SNPs) | 257 MDD treatment responders, 164 MDD treatment non-responders | HRSD | Prediction of antidepressant response and remission | AUC = 0.823; Baseline AUC = 0.816 | The model achieved better performance than the logistic regression classifier. |
| Dai and Jonnagaddala, 2018[ | CNN | Clinical notes | Clinical notes of psychiatric disorder subjects: Absent: 92, Mild: 252, Moderate: 156, Severe: 149e | Human annotation | Prediction of positive valence symptom severity | MAE = 0.539; Baseline MAE = 0.583 | The CNN models provided comparable solutions without sophisticated preprocessing on the text data. |
| Genetic data | |||||||
| Laksshman et al., 2017[ | CNN | Whole exome sequencing data | 1000 Subjectsf | Not specified | Differentiating bipolar disorder patients with healthy controls | AUC = 0.65; Baseline AUC = 0.62 | The 1D convolution captured correlation of neighboring loci. The model achieved a winning predictive performance of 0.65 AUC, compared with traditional methods ranging from 0.5 to 0.55. This revealed that the model might be picking up complex patterns across the samples. |
| Khan and Wang, 2017[ | ncDeepBrain (DFNN based) | Genome sequencing data | - | Not specified | Identification of non-coding variants associated with mental disorders | ACC = 0.82; Baseline ACC = 0.71 | The model was trained for scoring the non-coding variants for prioritization. |
| Khan et al., 2018[ | iMEGES (DFNN based) | Genome sequencing data | - | Not specified | Prioritization of susceptibility genes for mental disorders | AUC = 0.57 (schizophrenia) and 0.58 (ASD) | The model integrated the ncDeepBrain score, general gene scores, and disease-specific scores to prioritize susceptibility genes for mental disorders. |
| Wang et al., 2018[ | Deep structured phenotype network (DSPN) | Regulatory network | PsychENCODE Consortium dataset g | Not specified | Prediction of psychiatric phenotypes from genotype and expression | ACC = 0.729; Baseline ACC = 0.681 | The model provided insights about intermediate phenotypes and their connections to high-level phenotypes (disease traits). |
| Vocal and visual expression data | |||||||
| Chao et al., 2015[ | CNN and LSTM | Voice and visual data | 84 Subjects (AVEC dataset) | Human annotation | Prediction of depression severity | MAE = 8.7 | Face appearance features were extracted by CNN. The deep-learned appearance features, combined with audio and face shape features, were fed to a LSTM to capture long-term sequence features. |
| Yang et al., 2016[ | LSTM and autoencoder | Elicited speech voice data | 13 BDs, 13 UDs, and 13 HCs (Chi-Mei mood dataset) | Human annotation | Prediction of mood disorder | ACC = 0.769; Baseline ACC = 0.498 | The denoising autoencoder adopted emotion domain data to the speech data space to generate emotion profiles (EPs). The LSTM characterized the temporal evolution of the EP sequence with respect to eliciting emotional videos. |
| Ma et al., 2016[ | CNN and LSTM | Voice data | (AVEC dataset) | PHQ-8 score | Prediction of depression | F-score = 0.52 | The model incorporated short-term temporal and spectral correlations by a 1D CNN, captured middle-term correlations by 1D max-pooling, and extracted long-term correlations with LSTM. |
| Huang et al., 2017[ | LSTM and autoencoder | Elicited speech voice data | 15 BDs, 15 UDs, and 15 HCs (Chi-Mei mood dataset) | Human annotation | Prediction of mood disorder | ACC = 0.733 | The denoising autoencoder adopted emotion domain data to the speech data space to generate emotion profiles (EPs). The LSTM characterized the temporal evolution of the EP sequence with respect to eliciting emotional videos. |
| Su et al., 2017[ | LSTM and autoencoder | Elicited video data | 12 BDs, 12 MDDs, and 12 HCs (Chi-Mei mood dataset) | Human annotation | Classification of mood disorders | ACC = 0.677; Baseline ACC = 0.556 | The study modeled the long-term variation among different mood disorders types by LSTM. |
| Jaiswal et al., 2017[ | CNN | Facial expression RGBD data (A RGB-D image is simply a combination of a color image and its corresponding depth image.) | 4 ADHDs, 22 ASDs, 11 ADHD + ASDs, and 18 HCs | Not specified | Prediction of ADHD and ASD | ACC = 0.96 (condition vs. HC) and 0.93 (ADHD + ASD vs. ASD only) | The study established the relationship between facial expression/gestures and neurodevelopmental conditions such as ADHD and ASD. |
| Cho et al., 2017[ | CNN | Thermal images | 8 Healthy adultsh | Human annotation | Recognition of psychological stress level (mental overload) | ACC = 0.846 (no stress vs. stress) and 0.565 (no stress vs. low-level stress vs. high-level stress) | The model identified psychological stress level by using a low-cost thermal camera, which tracks the person’s breathing patterns. |
| Yang et al., 2017[ | CNN and DFNN | Voice and visual data | 189 Segments of clinical interview (AVEC dataset) | PHQ-8 score | Prediction of depression | MAE = 5.4 | The study proposed a multimodal approach: two CNNs were introduced to encode audio and video data, respectively. Then a fully connected DNN was used to combine the two channel feature maps to predict PHQ-8 scores. |
| Gupta et al., 2017[ | DFNN | Voice and visual data | 300 Video samples (AVEC dataset) | Valence, arousal, and dominance ratings by human annotation | Affective prediction | Correlation coefficient | The DFNN incorporated depression severity as the parameter, linking the effects of depression on subjects’ affective expressions. |
| He and Cao, 2018[ | CNN | Voice data | 300 Video samples (AVEC dataset) | BDI-II | Prediction of depression | MAE = 8.2; Baseline MAE = 10.4 | The model consists of four CNNs, one for extracting audio features from raw waveform, one for extracting texture features from spectrogram images, and two for modeling handcraft features. |
| Dawood et al., 2018[ | CNN and LSTM | Video collected by webcam | 862 Videos of AS, 545 videos of TDC | Not specified | Prediction of depression | ACC = 0.901 | The model takes the power of CNN to learn facial expression features from images (frame’s response map) and LSTM to learn from series of temporal data (sequence of response maps). |
| Song et al., 2018[ | CNN | Video data | 30 Depressed subjects, 77 non-depressed subjects, and 35 subjects for development (AVEC dataset) | PHQ-8 score | Prediction of depression and depression severity | MAE = 5.01; Baseline = 4.4 | The model transformed behavior signals to spectrum maps to capture long-term series information. Then CNN was used to extract spectral features. |
| Zhu et al., 2018[ | CNN | Video data | 340 Videos from 292 subjects (AVEC dataset) | BDI-II | Prediction of depression | MAE = 7.6; Baseline MAE = 8.2 | The model introduced two CNNs, one pre-trained for modeling the static facial appearance and the other modeling the optical flow images extracted from different frames. |
| Prasetio et al., 2018[ | CNN | Facial image | Female: 87 high stress, 129 low stress, and 175 neutral; Male: 134 high stress, 212 low stress, and 237 neutral | Human annotation | Stress recognition | ACC = 0.959; Baseline ACC = 0.890 | The features were from facial images and fed to a CNN to identify stress. |
| Jan et al. 2018[ | CNN (only for image | Voice and visual data | 300 Videos (AVEC dataset) | BDI-II | Prediction of depression severity | MAE = 6.7 (Unimodal) and 6.1 (Bimodal); Baseline MAE = 8.0 (Unimodal) and 6.4 (Bimodal) | The deep-learnt features showed significant improvement on prediction. |
| Harati et al. 2018[ | LSTM | Audio of interview during Deep Brain Stimulation treatment | 13 Subjects | HRMD score | Prediction of depression severity | AUC = 0.80 | The model extracted emotion features from patients’ clinical audio utterances. |
| Huang et al. 2019[ | CNN and LSTM | Elicited speech voice data | 15 BDs, 15 UDs, and 15 HCs (Chi-Mei mood dataset) | Human annotation | Short-term detection of mood disorders | ACC = 0.756; Baseline ACC = 0.622 | The CNN was used to generate an emotion profile (EP) of each elicited speech response. The LSTM was used to characterize temporal evolution of EPs of patients |
| Su et al., 2019[ | Autoencoder and LSTM | Voice and visual data | 13 BDs, 13 UDs, and 13 HCs (Chi-Mei mood dataset) | Human annotation | Prediction of mood disorder | ACC = 0.692; Baseline ACC = 0.498 | Autoencoder generated bottleneck features of the facial expression and speech response. LSTM modeled the temporal information of all elicited responses. The model is able to overcome misdiagnosis of bipolar disorder as unipolar disorder. |
| Social media data | |||||||
| Lin et al., 2014[ | CNN and DFNN | Sina Weibo posts | 11,074 Subjects of stress, 12,230 subjects of no stress | Pattern matching in tweets | Stress detection | ACC = 0.756–0.844 | There are relationships between users’ stress and their tweeting content, social engagement, and behavior patterns. |
| Lin et al., 2014[ | Denoising autoencoder | Hashtag-labeled tweets | 3634 Tweets of affection stress, 3966 tweets of work stress, 5747 tweets with social stress, 13,973 tweets of physiological stress, 14,543 tweets of other stress, and 14,931 tweets of no stress | User-labeled hashtag | Stress detection | ACC = 0.823; Baseline ACC = 59.7 | Detection results were improved by using deep neural network models. |
| Gkotsis et al., 2017[ | CNN and DFNN | Reddit posts | 538,245 Posts related to 11 mental themes, 476,388 non-mental health postsi | Human annotation | Identification of posts related to mental illness | ACC = 0.911 (binary classification) and 0.714 (multiclass classification); Baseline ACC = 0.908 (binary classification) and 0.708 (multiclass classification) | (1) The most common misclassification is depression; (2) Some of the themes are highly inter-related and not always distinguishable as separate and exclusive classes. |
| Li et al., 2017[ | RNN | Tencent Weibo posts | 29,232 Posts of 124 students, containing 122 study-related stressor events | Human annotation | Prediction of adolescent stress | MSE = 0.19; Baseline MSE = 0.25 | The model incorporated relationships of stressor events and improved the prediction of stress in adolescent. |
| Lin et al., 2017[ | CNN | Sina Weibo posts, Tencent Weibo posts, and Twitter posts; social interactions | 11,074 Subjects of stress, 12,230 subjects of no stress | Pattern matching in tweets | Stress detection | ACC = 0.916 | Users stress state is closely related to that of his/her friends in social media. |
| Sadeque et al., 2017[ | GRU | Reddit posts | 136 Depressed subjects, 752 HCs | Self-declaration of depression in posts | Prediction of early depression | F-score = 0.64; Baseline F-score = 0.40 | The RNN captured sequential information from texts with sequential property. |
| Cong et al., 2018[ | LSTM | The Reddit Self-reported Depression Diagnosis (RSDD) dataset | 9000 Depressed subjects, 107,000 HCs | Self-declaration of depression in posts | Prediction of depression | F-score = 0.60; Baseline F-score = 0.44 | The model reduced data imbalance and enhanced classification capacity. |
| Coppersmith et al., 2018[ | LSTM | Social media posts | 418 Users with suicide attempts; number of HC not specified | Self-declaration of depression in posts | Prediction of suicidal risk | AUC = 0.94 | The LSTM captured contextual information between words and better obtained nuances of language related to mental health. |
| Du et al, 2018[ | CNN and RNN | Twitter posts | 1,962,766 Tweets | Suicide-related keywords matching | Identification of suicide-related psychiatric stressors | ACC = 0.74 (CNN) and 0.72 (RNN); Baseline ACC = 0.703 | CNN- and RNN-based model obtained better performance at identifying suicide-related tweets and psychiatric stressors, respectively. |
| Ive et al., 2018[ | GRU | Social media posts | 538,245 Posts related to 11 mental themes, 476,388 non-mental health posts | Human annotation | Classification of media text related to mental health | ACC = 0.76 | RNN has the intrinsic ability of considering input in its sequence and the hierarchical structure is beneficial for the analysis of health-related online text. |
| Fraga et al., 2018[ | RNN | Reddit posts | 261,511 Posts and 1,256,669 comments from 105,878 users related to depression, 44,541 users related to SuicideWatch, 43,321 users related to anxiety, 13,939 users related to BDj | Keywords matching | Analysis of four subreddits (anxiety, bipolar, depression, and suicide) related to mental health disorders | – | (1) Interaction patterns are very similar across the subreddits and interactions are centered around content rather than users; (2) the four subreddits share a common language. |
| Alambo et al., 2019[ | RNN | Reddit posts | 4992 Posts of 500 users | Human annotation | Prediction of suicidal risk | – | This study generated a gold standard dataset of suicide posts with their risk levels and formed a basis for the next step of constructing conversational agents that elicited suicide-related natural conversation on basis of questions. |
ACC accuracy, ADHD attention-deficit hyperactivity disorder, ASD autism spectrum disorder, AUC area under the receiver operating characteristic curve, AVEC Audio-Visual Emotion recognition Challenge, BD bipolar disorder, BDI-II Beck Depression Inventory II, CNN convolutional neural network, DASS-21 Depression Anxiety stress scale, DBN deep belief network, DFNN deep feedforward neural network, GRU gated recurrent unit network, HC healthy control, HRSD Hamilton Rating Scale for Depression, LSTM long short-term memory network, MAE mean absolute error, MSE mean squared error, NMMAE normalized macro mean absolute error, PHQ-8 Patient Health Questionnaire eighth version, PHQ-9 Patient Health Questionnaire ninth version, RNN recurrent neural network, SCID-I Structured Clinical Interview for DSM-IV, SNP single-nucleotide polymorphism, TDC typical developing control, UD unipolar depression
aADHD-200 dataset, http://fcon_1000.projects.nitrc.org/indi/adhd200/
bABIDE dataset, http://fcon_1000.projects.nitrc.org/indi/abide/
chttps://openneuro.org/datasets/ds000030
dNUSDAST dataset, http://schizconnect.org
ehttps://www.i2b2.org/NLP/RDoCforPsychiatry/
fhttps://genomeinterpretation.org/content/4-bipolar-exomes
gPsychENCODE Consortium dataset, https://www.nimhgenetics.org/resources/psychencode
hhttp://youngjuncho.com/datasets/
ihttps://www.reddit.com/comments/3mg812
jhttp://files.pushshift.io/reddit/
khttps://github.com/mihaelacr/pydeeplearn
lhttps://github.com/trangptm/DeepCare
mhttps://github.com/WGLab/iMEGES
nhttp://youngjuncho.com/2017/acii2017-open-sources/
Fig. 4An illustration of the multimodal deep neural network.
One can model each modality with a specific network and combine them using the final fully-connected layers. In this way, parameters of the entire neural network can be jointly learned in a typical backpropagation manner.