| Literature DB >> 31931701 |
Xueming Zheng1, Xingli Fu2, Kaicheng Wang3, Meng Wang4.
Abstract
BACKGROUND: MicroRNAs (miRNAs) play important roles in a variety of biological processes by regulating gene expression at the post-transcriptional level. So, the discovery of new miRNAs has become a popular task in biological research. Since the experimental identification of miRNAs is time-consuming, many computational tools have been developed to identify miRNA precursor (pre-miRNA). Most of these computation methods are based on traditional machine learning methods and their performance depends heavily on the selected features which are usually determined by domain experts. To develop easily implemented methods with better performance, we investigated different deep learning architectures for the pre-miRNAs identification.Entities:
Keywords: DNN; Detection; miRNAs
Mesh:
Substances:
Year: 2020 PMID: 31931701 PMCID: PMC6958766 DOI: 10.1186/s12859-020-3339-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of the proposed models
| Model (Data set partition) | Sen.(%) | Spe.(%) | F1(%) | MCC(%) | Acc.(%) |
|---|---|---|---|---|---|
| CNN (Training/Evalu./Test) | 88.83 | 88.28 | 88.83 | 77.11 | 88.56 |
| CNN (10-fold CV) | 89.58 ± 4.72 | 84.90 ± 4.84 | 87.53 ± 1.38 | 74.72 ± 3.52 | 87.24 ± 1.80 |
| RNN (Training/Evalu./Test) | 85.71 | 91.28 | 88.35 | 77.03 | 88.43 |
| RNN (10-fold CV) | 85.89 ± 3.29 | 91.14 ± 2.75 | 88.09 ± 2.03 | 77.04 ± 3.66 | 88.44 ± 1.80 |
Note: Classification performance of different models on the testing dataset was shown as sensitivity (column 2), specificity (column 3), F1-Score (column 4), MCC (column 5) and accuracy (column 6) respectively. For the 10-fold CV, the performance was shown as mean ± standard error
Fig. 1ROC and PRC of proposed DNN models. ROC (a) and PRC (b) are shown as indicated. The AUC is also shown in (a)
Comparison of model performance on the same benchmark datasets
| Model | hsa | Pseudo |
|---|---|---|
| This workCNN | 96 | 88 |
| This workRNN | 90 | 92 |
| AverageDT | 97 | 93 |
| ConsensusNB | 86 | 86 |
| ConsensusDT | 99 | 90 |
| DingNB | 88 | 84 |
| AverageNB | 83 | 89 |
| NgDT | 89 | 89 |
| Consensus | 97 | 96 |
| BatuwitaNB | 86 | 83 |
| BentwichNB | 92 | 71 |
| NgNB | 86 | 81 |
Note: Prediction results (percentage) of this work (rows 2–3) and the top ten models (rows 4–13) of the izMiR framework [19]. The values presented were true prediction rates (%) (TPR) achieved for each model and dataset. Pseudo: negative data, from the coding region of human RefSeq genes, 8492 hairpins; hsa: positive data, Homo sapiens miRNAs, 1881 sequences
Prediction accuracy on pre-RNAs datasets from other species using the CNN model trained with human data
| Database | Species | # of pre-miRNAs | # of correct prediction | Accuracy (%) |
|---|---|---|---|---|
| miRBase release 22 | 617 | 564 | 91.41 | |
| 1234 | 1081 | 87.60 | ||
| 495 | 436 | 88.08 | ||
| MirGeneDB | 499 | 498 | 99.80 | |
| 449 | 446 | 99.33 | ||
| 414 | 411 | 99.28 |
Fig. 2One-hot encoding and vectorization of pre-miRNA sequence. The seq_struc is the combination of nucleotide/base and the corresponding secondary structure indicated with different symbols. The left bracket “(“means paired base at 5′-end. The right bracket”)” means paired base at 3′-end. The dot “.” means unpaired base. The encoded sequence is padded with zero vectors to the length of 180
Fig. 3The proposed CNN and RNN architectures for pre-miRNAs prediction. a. CNN model. The pre-miRNA sequence is treated as a 180 × 12 × 1 vector. There are three cascades of convolution and max-pooling layers followed by two fully connected layers. The shapes of the tensors in the model are indicated by height × width × channels. FC: fully connected layer with 32 units. b. RNN model. Three LSTM layers with 128, 64 and 2 units respectively are shown in the RNN. The final output is passed through a softmax function with the output of probability distribution over labels. In each time step along the pre-miRNA sequence, the LSTM cells remembered or ignored old information passed along the arrows. The output was the probability distribution over the true or false labels.