| Literature DB >> 30405137 |
Abstract
In predictive model development, gene expression data is associated with the unique challenge that the number of samples (n) is much smaller than the amount of features (p). This "n ≪ p" property has prevented classification of gene expression data from deep learning techniques, which have been proved powerful under "n > p" scenarios in other application fields, such as image classification. Further, the sparsity of effective features with unknown correlation structures in gene expression profiles brings more challenges for classification tasks. To tackle these problems, we propose a newly developed classifier named Forest Deep Neural Network (fDNN), to integrate the deep neural network architecture with a supervised forest feature detector. Using this built-in feature detector, the method is able to learn sparse feature representations and feed the representations into a neural network to mitigate the overfitting problem. Simulation experiments and real data analyses using two RNA-seq expression datasets are conducted to evaluate fDNN's capability. The method is demonstrated a useful addition to current predictive models with better classification performance and more meaningful selected features compared to ordinary random forests and deep neural networks.Entities:
Mesh:
Year: 2018 PMID: 30405137 PMCID: PMC6220289 DOI: 10.1038/s41598-018-34833-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Visualization of the architecture of the fDNN model.
Classification comparison of the forest Deep Neural Network (fDNN) method, deep neural networks (DNN) and random forests (RF).
| Case | Clustered | Scattered | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| # true predictors | 10 | 20 | 30 | 40 | 50 | 10 | 20 | 30 | 40 | 50 |
| fDNN | 0.79 | 0.828 | 0.832 | 0.872 | 0.892 | 0.775 | 0.781 | 0.829 | 0.861 | 0.851 |
| DNN_3_256 | 0.762 | 0.791 | 0.809 | 0.829 | 0.865 | 0.75 | 0.727 | 0.822 | 0.823 | 0.836 |
| DNN_4_1024 | 0.76 | 0.754 | 0.76 | 0.836 | 0.833 | 0.742 | 0.724 | 0.774 | 0.846 | 0.805 |
| RF_300 | 0.783 | 0.82 | 0.823 | 0.862 | 0.887 | 0.772 | 0.76 | 0.825 | 0.858 | 0.831 |
| RF_500 | 0.765 | 0.826 | 0.824 | 0.86 | 0.904 | 0.765 | 0.738 | 0.818 | 0.843 | 0.852 |
Statistics are the classification accuracies measured by AUC.
Figure 2Plots of the classification comparison in Table 1. Cases: (a) clustered (b) scattered.
Testing results for the GSE99095 dataset.
| Method | Architecture | Testing AUC |
|---|---|---|
| fDNN | 400Trees + 256 + 64 + 16 |
|
| DNN | 1024 + 512 + 128 + 16 | 0.949 |
| RF | 1000Trees | 0.897 |
Numbers in the architecture column denote the number of trees in Random Forest and the number of hidden neurons in neural network methods.
Testing results for the GSE106291 dataset.
| Method | Architecture | Testing AUC |
|---|---|---|
| fDNN | 500Trees + 256 + 64 + 16 |
|
| DNN | 1024 + 256 + 16 | 0.751 |
| RF | 1000Trees | 0.716 |
Numbers in the architecture column denote the number of trees in Random Forest and the number of hidden neurons in neural network methods.
Figure 3ROC plots for (a) GSE99095 and (b) GSE106291.
The top 10 overrepresented GO biological processes by the top 1% genes selected in fDNN from GSE99095 data, after manual removal of redundant GO terms.
| GOBPID | Pvalue | Term | Significant in RF selected genes (p < 0.01) | Significant in genes uniquely selected by fDNN (p < 0.01) |
|---|---|---|---|---|
| GO:0070125 | 0.000319438 | mitochondrial translational elongation | Y | |
| GO:1990542 | 0.000319438 | mitochondrial transmembrane transport | Y | |
| GO:0006119 | 0.000431138 | oxidative phosphorylation | Y | Y |
| GO:0006412 | 0.000524598 | translation | Y | |
| GO:0048534 | 0.000553723 | hematopoietic or lymphoid organ development | Y | |
| GO:0007229 | 0.00166512 | integrin-mediated signaling pathway | Y | |
| GO:0098754 | 0.00166512 | detoxification | Y | |
| GO:0016073 | 0.002434088 | snRNA metabolic process | ||
| GO:0007599 | 0.004203111 | hemostasis | ||
| GO:1903018 | 0.00560232 | regulation of glycoprotein metabolic process |
The top 10 overrepresented GO biological processes by the top 2% genes selected in fDNN from GSE106291 data, after manual removal of redundant GO terms.
| GOBPID | Pvalue | Term | Significant in RF selected genes (p < 0.01) | Significant in genes uniquely selected by fDNN (p < 0.01) |
|---|---|---|---|---|
| GO:0006935 | 0.000640609 | chemotaxis | Y | |
| GO:0002274 | 0.001091917 | myeloid leukocyte activation | Y | |
| GO:0062014 | 0.001389434 | negative regulation of small molecule metabolic process | Y | |
| GO:0016477 | 0.001567641 | cell migration | Y | |
| GO:0045055 | 0.002003684 | regulated exocytosis | Y | |
| GO:0060078 | 0.002129227 | regulation of postsynaptic membrane potential | ||
| GO:0030334 | 0.00244581 | regulation of cell migration | Y | |
| GO:0030501 | 0.002766529 | positive regulation of bone mineralization | ||
| GO:0061045 | 0.002925766 | negative regulation of wound healing | ||
| GO:0071320 | 0.003321428 | cellular response to cAMP | Y |