| Literature DB >> 31856829 |
Léon-Charles Tranchevent1,2, Francisco Azuaje1,3, Jagath C Rajapakse4.
Abstract
BACKGROUND: The availability of high-throughput omics datasets from large patient cohorts has allowed the development of methods that aim at predicting patient clinical outcomes, such as survival and disease recurrence. Such methods are also important to better understand the biological mechanisms underlying disease etiology and development, as well as treatment responses. Recently, different predictive models, relying on distinct algorithms (including Support Vector Machines and Random Forests) have been investigated. In this context, deep learning strategies are of special interest due to their demonstrated superior performance over a wide range of problems and datasets. One of the main challenges of such strategies is the "small n large p" problem. Indeed, omics datasets typically consist of small numbers of samples and large numbers of features relative to typical deep learning datasets. Neural networks usually tackle this problem through feature selection or by including additional constraints during the learning process.Entities:
Keywords: Clinical outcome prediction; Deep learning; Deep neural network; Disease prediction; Graph topology; Machine learning; Network-based methods
Mesh:
Year: 2019 PMID: 31856829 PMCID: PMC6923884 DOI: 10.1186/s12920-019-0628-y
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Details about the four expression datasets used in the present study
| Name | Reference | Data type | Size | Usage |
|---|---|---|---|---|
| Fischer-M | Zhang et al., 2014 [ | Microarray | 498 * 43,291 | Training, testing |
| Fischer-R | Zhang et al., 2014 [ | RNA-seq | 498 * 43,827 | Training, testing |
| Maris | Wang et al., 2006 [ | Microarray | 92 * 12,625 | Testing |
| Versteeg | Molenar et al., 2012 [ | Microarray | 88 * 40,918 | Testing |
List of the possible data configurations (topological feature sets, datasets) used to train classification models
| Datasets | Topological features | Total size |
|---|---|---|
| Centralities | 12 | |
| Modularities | {30, 39}a | |
| Both | {42, 51} | |
| Centralities | 12 | |
| Modularities | {36, 47}a | |
| Both | {48, 59} | |
| Centralities | 24 | |
| Modularities | {75, 77}a | |
| Both | {99, 101}a |
aThe number of modules for each graph, corresponding to one clinical outcomes of interest, is different
bThis is the combined dataset in which the topological features of both ‘Fischer-M’ and ‘Fischer-R’ are concatenated
Fig. 1General workflow of the proposed method. Our strategy relies on a topological analysis to perform dimension reduction of both the training (light green) and test data (dark green). Data matrices are transformed into graphs, from which topological features are extracted. Even if the original features (light blues) are different, the topological features extracted from the graphs (dark blue) have the same meaning and are comparable. These features are then used to train and test several models that rely on different learning algorithms (DNN, SVM and RF). These models are compared based on the accuracy of their predictions on the test data
Fig. 2Model performance for different inputs. DNN models relying on different feature sets are compared by reporting their performance on the validation data for ‘Death from disease’ (a) and ‘Disease progression’ (b). Feature sets are defined by the original data that were used (microarray data, RNA-seq data or the integration of both) and by the topological features considered (centrality, modularity or both). Each single point represents a model. For each feature set, several models are trained by varying the neural network architecture and by performing replicates
Best performing DNN architectures.
| Configuration | Architecture | Balanced accuracy |
|---|---|---|
| Clinical outcome = ‘ | ||
| Fischer-M, centralities | [8,8,8,2] | |
| Fischer-M, modularities | [8,4] | 83.9% |
| Fischer-M, both | [8,8,8] | 86.8% |
| Fischer-R, centralities | [8,8,8,4] | 85.8% |
| Fischer-R, modularities | [8,8,8,2] | 82.1% |
| Fischer-R, both | [2,2,2,2] | 85.2% |
| Fischera, centralities | [8,2,2] | 86.1% |
| Fischera, modularities | [8,2,2] | 84.7% |
| Fischera, both | [8,8,4] | 84.7% |
| Clinical outcome = ‘ | ||
| Fischer-M, centralities | [8,8,8,2] | 84.3% |
| Fischer-M, modularities | [8,8,2] | 82.3% |
| Fischer-M, both | [4,4,2] | 83.7% |
| Fischer-R, centralities | [8,8,4] | 83.7% |
| Fischer-R, modularities | [8,2,2] | 79.1% |
| Fischer-R, both | [8,8,8,8] | 77.9% |
| Fischera, centralities | [4,2,2,2] | |
| Fischera, modularities | [8,8] | 79.6% |
| Fischera, both | [4,2] | 81.5% |
One row corresponds to the best model for a given clinical outcome and configuration (from Table 2). The best performance (i.e., balanced accuracy) is displayed in bold for each clinical outcome
aCombined dataset in which the topological features of both ‘Fischer-M’ and ‘Fischer-R’ are concatenated
Parameter optimization for all classifiers.
| Algorithm | Parameters | Balanced accuracy |
|---|---|---|
| Clinical outcome = ‘ | ||
| Data= | ||
| DNN [8,8,8,2] | o=Adam, lr=1e-3, d=0.3 | |
| GEDFNa | lr=1e-2, h=[64,16], b=8 | 79.5% (+8.6) |
| SVM | t=RBF, c=64, g=0.25 | 75.4% (+5.9) |
| RF | n=100 | 75.1% (+3.1) |
| Clinical outcome = ‘ | ||
| Data= | ||
| DNN [4,2,2,2] | o=Adam, lr=1e-3, d=0.3 | |
| GEDFNa | lr=1e-4, h=[16,4], b=32 | 81.2% (+0.4) |
| SVM | t=RBF, c=16, g=0.0625 | 81.8% (+2.0) |
| RF | n=100 | 78.1% (+3.2) |
One row corresponds to the best model for a given clinical outcome and algorithm. The optimal parameter values are provided (o: optimizer, lr: learning rate, d: dropout, h: sizes of the second and third GEDFN hidden layers, b: batch size, t: SVM kernel type, c: cost, g: gamma, n: number of trees). The gain in balanced accuracy with respect to the models run with default parameters is indicated between parentheses (from Table 3 for DNN)
afor GEDFN, the corresponding omics data is used as input instead of the topological features
External validation results.
| Datasets | Balanced accuracy | |||
|---|---|---|---|---|
| Training | Test | DNN | SVM | RF |
| Clinical outcome = ‘ | ||||
| Data = centralities | ||||
| Fischer-M | ||||
| Fischer-R | 53.5% | 66.8% | ||
| Maris | 53.1% | 50.0% | ||
| Versteeg | 53.3% | 67.5% | ||
| Fischer-R | ||||
| Fischer-M | 75.4% | 61.2% | ||
| Maris | 49.7% | 50.0% | ||
| Versteeg | 68.3% | 67.5% | ||
| Clinical outcome = ‘ | ||||
| Data = centralities | ||||
| Fischer-M | ||||
| Fischer-R | 75.2% | 71.8% | ||
| Maris | 66.0% | 53.8% | ||
| Versteeg | 78.1% | 78.1% | ||
| Fischer-R | ||||
| Fischer-M | 76.8% | 75.0% | ||
| Maris | 58.8% | 58.8% | ||
| Versteeg | 77.2% | 73.9% | ||
Models are trained using one of the ‘Fischer’ datasets and then tested using either the other ‘Fischer’ dataset or another independent dataset (‘Maris’ and ‘Versteeg’). The ‘Maris’ and ‘Versteeg’ datasets are too small to be used for both training and therefore are only used for validation. Rows in italics represent reference models (training and testing extracted from the same datasets)