| Literature DB >> 29219072 |
Rui Xie1, Jia Wen2, Andrew Quitadamo2, Jianlin Cheng1, Xinghua Shi3.
Abstract
BACKGROUND: Gene expression is a key intermediate level that genotypes lead to a particular trait. Gene expression is affected by various factors including genotypes of genetic variants. With an aim of delineating the genetic impact on gene expression, we build a deep auto-encoder model to assess how good genetic variants will contribute to gene expression changes. This new deep learning model is a regression-based predictive model based on the MultiLayer Perceptron and Stacked Denoising Auto-encoder (MLP-SAE). The model is trained using a stacked denoising auto-encoder for feature selection and a multilayer perceptron framework for backpropagation. We further improve the model by introducing dropout to prevent overfitting and improve performance.Entities:
Keywords: Deep learning; Gene expression; Multilayer perceptron; Predictive model; Stacked denoising auto-encoder
Mesh:
Substances:
Year: 2017 PMID: 29219072 PMCID: PMC5773895 DOI: 10.1186/s12864-017-4226-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 2An Overview of Our MLP-DAE Model. The input layers takes the pre-processed data. The auto-encoder1 and auto-encoder2 serve as hidden layers for the prediction model and are trained using back propagation. The output layer is built on a regression model to make the final predictions
Fig. 1An Illustration of an Auto-encoder Corruption Model. The raw input X is corrupted via process q. The black nodes denote the corrupted input. The corrupted input is converted to Y via process f . Afterwards, Y attempts to reconstruct the raw input via process g , and generates the reconstruction Z. A loss function L(X,Z) is used to calculate the reconstruction error for backpropagation
Fig. 3An Overall Workflow of the MLP-SAE Model. After pre-processing the input data, two layers of denoising autoencoders are used and the final regression layer produces the output of predicted gene expression quantification. The model is trained using pre-training and backpropagation for optimizing of the objective function
Comparison of Lasso, Random Forests, and MLP-SAE model
| Method | Hyperparameter | Hyperparameter value | MSE |
|---|---|---|---|
| Lasso |
| 0.05 | 0.3516 |
| 0.1 | 0.3182 | ||
| 0.2 | 0.3002 | ||
| 0.3 | 0.2951 | ||
| 0.4 | 0.2930 | ||
| 0.5 | 0.2918 | ||
| 0.6 | 0.2914 | ||
| 0.7 |
| ||
| 0.8 | 0.2912 | ||
| Random forests | Number of estimators | 10 | 0.3221 |
| 20 | 0.3127 | ||
| 30 | 0.3080 | ||
| 40 | 0.3001 | ||
| 50 | 0.2989 | ||
| 60 | 0.3003 | ||
| 70 | 0.2986 | ||
| 100 | 0.3003 | ||
| 150 | 0.2974 | ||
|
|
| ||
| MLP-SAE model | Learning rate |
|
|
| 0.01 | 0.2909 | ||
| 0.001 | 0.2895 | ||
| 0.0001 | 0.2908 | ||
| 0.00001 | 0.2918 |
Each row represents the hyperparameter used and corresponding MSE for each hyperparameter setup of each model. Bold rows denote the hyperparameters and corresponding MSE for the optimal models of the three methods respectively
Number of Genes Within R 2 Bins for MLP-SAE and MLP-SAE with Dropout Models
|
| MLP-SAE | MLP-SAE with Dropout |
|---|---|---|
| (0,0.05] | 3621 | 3507 |
| (0.05,0.1] | 1128 | 1121 |
| (0.1,0.2] | 1111 | 1086 |
| (0.2,0.3] | 436 | 493 |
| (0.3,0.4] | 181 | 229 |
| (0.4,0.5] | 96 | 110 |
| (0.5,0.6] | 23 | 43 |
| (0.6,0.7] | 8 | 13 |
| (0.7,0.8] | 0 | 2 |
For each gene, R 2 is calculated between the true and estimated expression values using the MLP-SAE model or the MLP-SAE model with dropout
Fig. 4Predictions Using the MLP-SAE Model with Dropout are More Correlated with the True Gene Expressions than Predictions from the MLP-SAE Model Without Dropout. X axis denotes the correlation bins between true and predicted gene expression values, and Y axis represents the log of number of genes in each correlation bin
Fig. 5True Expression and Predicted Expression of All Genes Using MLP-SAE with Dropout
Fig. 6True Expression and Predicted Expression of Selected Genes Using MLP-SAE with Dropout