| Literature DB >> 31969903 |
Xiaohui Niu1, Kun Yang1, Ge Zhang1, Zhiquan Yang1, Xuehai Hu1.
Abstract
Deciphering the code of cis-regulatory element (CRE) is one of the core issues of today's biology. Enhancers are distal CREs and play significant roles in gene transcriptional regulation. Although identifications of enhancer locations across the whole genome [discriminative enhancer predictions (DEP)] is necessary, it is more important to predict in which specific cell or tissue types, they will be activated and functional [tissue-specific enhancer predictions (TSEP)]. Although existing deep learning models achieved great successes in DEP, they cannot be directly employed in TSEP because a specific cell or tissue type only has a limited number of available enhancer samples for training. Here, we first adopted a reported deep learning architecture and then developed a novel training strategy named "pretraining-retraining strategy" (PRS) for TSEP by decomposing the whole training process into two successive stages: a pretraining stage is designed to train with the whole enhancer data for performing DEP, and a retraining strategy is then designed to train with tissue-specific enhancer samples based on the trained pretraining model for making TSEP. As a result, PRS is found to be valid for DEP with an AUC of 0.922 and a GM (geometric mean) of 0.696, when testing on a larger-scale FANTOM5 enhancer dataset via a five-fold cross-validation. Interestingly, based on the trained pretraining model, a new finding is that only additional twenty epochs are needed to complete the retraining process on testing 23 specific tissues or cell lines. For TSEP tasks, PRS achieved a mean GM of 0.806 which is significantly higher than 0.528 of gkm-SVM, an existing mainstream method for CRE predictions. Notably, PRS is further proven superior to other two state-of-the-art methods: DEEP and BiRen. In summary, PRS has employed useful ideas from the domain of transfer learning and is a reliable method for TSEPs.Entities:
Keywords: deep learning; prediction; pretraining; retraining; tissue-specific enhancers
Year: 2020 PMID: 31969903 PMCID: PMC6960260 DOI: 10.3389/fgene.2019.01305
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Flow chart of hybrid deep learning architecture.
Figure 2Determining optimal model hyperparameters of filter number, filter length, and pooling size. (A) GM values of grid search on the combinations of filter number and filter length. (B) GM values of grid search on the combinations of filter length and pooling size. (C) AUC values of grid search on the combinations of filter length and pooling size.
Prediction performances of pretraining stage with large-scale FANTOM5 enhancer data via a five-fold-cross-validation.
| Enhancer dataset | Sample size | ACC | AUC | SEN | SPE | MCC | GM |
|---|---|---|---|---|---|---|---|
| FANTOM5 enhancer data | 4653 + 46530 | 0.929 | 0.922 | 0.499 | 0.972 | 0.527 | 0.696 |
Figure 3Determining optimal pretraining-retraining model and comparison with classic model with no pretraining stage. (A) Comparison analysis determines FANTOM-ep20 model to be the optimal pretraining-retraining model. (B) Comparison of GM values between FANTOM-ep20 models and None-ep100 models on 23 different tissues or cell lines.
Figure 4Comparisons between our FANTOM-ep20 model and gkm-SVM tool on 23 different tissues or cell lines. (A) One-to-one direct comparison of GM value on each tissue or cell line. (B) Distribution comparisons of GM values and AUC values with box plots.
Comprehensive comparisons of FANTOM-ep20 model with DEEP and BiRen.
| Comparison targets | Data source | Sample size | Method | ACC | AUC | Sens | Spec | MCC | GM |
|---|---|---|---|---|---|---|---|---|---|
| Heart | 295 + 2950 | DEEP | 0.822 | NA | 0.802 | 0.824 | NA | 0.812 | |
| 239 + 2390 | FANTOM-ep20a | 0.946 | 0.963 | 0.664 | 0.976 | 0.669 | 0.805 | ||
| Liver | 84 + 840 | DEEP | 0.745 | NA | 0.740 | 0.755 | NA | 0.741 | |
| 75 + 750 | FANTOM-ep20 | 0.982 | 0.990 | 0.905 | 0.989 | 0.891 | 0.946 | ||
| Brain | 639 + 6390 | DEEP | 0.853 | NA | 0.832 | 0.855 | NA | 0.843 | |
| 619 + 6190 | FANTOM-ep20 | 0.906 | 0.915 | 0.630 | 0.933 | 0.501 | 0.766 | ||
| VISTA | 1747 + 17470 | BiRen | NA | 0.957 | NA | NA | NA | NA | |
| VISTA | 1848 + 18480 | FANTOM-ep20 | 0.946 | 0.958 | 0.650 | 0.975 | 0.655 | 0.796 |
aour best pretraining-retraining model by pretraining with large scale FANTOM enhancer data and retraining with 20 epochs; ‘NA’ represents ‘not provided by original publications’.