| Literature DB >> 30987229 |
Xiu-Qin Liu1, Bing-Xiu Li2, Guan-Rong Zeng3, Qiao-Yue Liu4, Dong-Mei Ai5.
Abstract
With the rapid development of high-throughput sequencing technology, a large number of transcript sequences have been discovered, and how to identify long non-coding RNAs (lncRNAs) from transcripts is a challenging task. The identification and inclusion of lncRNAs not only can more clearly help us to understand life activities themselves, but can also help humans further explore and study the disease at the molecular level. At present, the detection of lncRNAs mainly includes two forms of calculation and experiment. Due to the limitations of bio sequencing technology and ineluctable errors in sequencing processes, the detection effect of these methods is not very satisfactory. In this paper, we constructed a deep-learning model to effectively distinguish lncRNAs from mRNAs. We used k-mer embedding vectors obtained through training the GloVe algorithm as input features and set up the deep learning framework to include a bidirectional long short-term memory model (BLSTM) layer and a convolutional neural network (CNN) layer with three additional hidden layers. By testing our model, we have found that it obtained the best values of 97.9%, 96.4% and 99.0% in F1score, accuracy and auROC, respectively, which showed better classification performance than the traditional PLEK, CNCI and CPC methods for identifying lncRNAs. We hope that our model will provide effective help in distinguishing mature mRNAs from lncRNAs, and become a potential tool to help humans understand and detect the diseases associated with lncRNAs.Entities:
Keywords: BLSTM; CNN; GloVe; deep learning; k-mer; long non-coding RNAs
Mesh:
Substances:
Year: 2019 PMID: 30987229 PMCID: PMC6523782 DOI: 10.3390/genes10040273
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The structure diagram of the model. We first split each input RNA sequence into k-mers using a moving window approach [21]. Then, based on all k-mer sequences, all the k-mer embedding vectors were learned by the unsupervised GloVe method. The embedding layer embedded all k-mers into the vector space and turned the k-mer sequence into a real matrix. The BLSTM layer consisted of two LSTMs layers that were parallel but opposite in direction, to capture long-term dependency information between sequences. The following CNN with three convolution layers scanned the above results using multiple convolutional filters to obtain different features. The final fully connected layer and logistic acted as classifiers to get the probability and final classification result of the input sequence belonging to a positive or negative class.
Figure 2The process of embedding stage. Input all the k-mer sequences obtained in the previous stage, calculate the co-occurrence matrix for the entire k-mer corpus and get the embedded vector after training.
Description of six datasets of mouse and human for lncRNA prediction.
| Dataset | Database | Transcript | Size | Max Length | Min Length | Mean Length |
|---|---|---|---|---|---|---|
| Human1 | RefSeq (version 60) | mRNA | 22,389 | 109,224 | 201 | 3346 |
| GENCODE.v17 | lncRNA | 22,389 | 91,667 | 200 | 965 | |
| Human2 | RefSeq (version 90) | mRNA | 45,550 | 109,224 | 201 | 3346 |
| GENCODE.v28 | lncRNA | 28,181 | 205,012 | 200 | 1054 | |
| Mouse | RefSeq.mouse.2 | mRNA | 15,896 | 24,271 | 224 | 3208 |
| GENCODE.vM18 | lncRNA | 17,624 | 93,147 | 200 | 1404 |
Database denotes the source of our article data, which came from the RefSeq and GENCODE databases. Size denotes the number of sequences the dataset contained, and max, min and mean length denote the maximum, minimum and mean values of sequence lengths for each dataset in nts, respectively. Note that we deleted any sequence shorter than 200 nts from the lncRNA dataset.
Detailed results of our model on each dataset, including cross-entropy loss and accuracy on training and test datasets.
| Dataset | Train Loss | Test Loss | Train Accuracy | Test Accuracy |
|---|---|---|---|---|
| Human1 | 0.143 | 0.167 | 0.966 | 0.959 |
| Human2 | 0.137 | 0.154 | 0.973 |
|
| Mouse | 0.166 | 0.175 | 0.953 | 0.949 |
Classification performance for four different methods in lncRNA prediction experiments.
| Dataset | Tool | Precision | Recall | F1score | Accuracy | auROC |
|---|---|---|---|---|---|---|
| Human1 | PLEK | 0.950 | 0.968 | 0.959 | 0.949 | 0.987 |
| CNCI | 0.962 | 0.919 | 0.940 | 0.938 | 0.936 | |
| CPC | 0.975 | 0.849 | 0.908 | 0.913 | 0.978 | |
| Word2vec | 0.897 |
| 0.936 | 0.917 | 0.969 | |
|
|
| 0.971 |
|
|
| |
| Human2 | PLEK | 0.951 | 0.965 | 0.958 | 0.949 | 0.987 |
| CNCI | 0.954 | 0.901 | 0.927 | 0.953 | 0.932 | |
| CPC |
| 0.836 | 0.908 | 0.897 | 0.982 | |
| Word2vec | 0.900 |
| 0.936 | 0.919 | 0.965 | |
|
| 0.982 |
|
|
|
| |
| Mouse | PLEK | 0.930 | 0.919 | 0.925 | 0.929 | 0.976 |
| CNCI | 0.957 | 0.931 | 0.944 |
| 0.947 | |
| CPC |
| 0.838 | 0.905 | 0.917 |
| |
| Word2vec | 0.885 | 0.884 | 0.884 | 0.891 | 0.956 | |
|
| 0.943 |
|
|
|
|
This table records the various indicators for evaluating the performance of the model, including precision, recall, F1score, accuracy and auROC value. Best results are shown in bold.
Figure 3The ROC curve of human and mouse, with “false positive rate” as the horizontal axis and “true positive rate” as the vertical axis. Different method results are marked in different colors. Particularly, our method is marked in green.
Figure 4Violin plot for length distribution of RNA sequences in human and mouse datasets. The width of each violin indicates the size of dataset, and the white dots represent the median values of sequence lengths.
Our model performance on the human2 dataset with different maximum length of input sequences on the BLSTM stage.
| Length (Units) | Precision | Recall | F1score | Accuracy | auROC |
|---|---|---|---|---|---|
| 2000 | 0.975 | 0.982 | 0.979 | 0.963 | 0.988 |
| 1500 | 0.962 | 0.989 | 0.975 | 0.982 | 0.957 |
| 1000 | 0.982 | 0.976 | 0.979 | 0.964 | 0.990 |
| 500 | 0.968 | 0.953 | 0.960 | 0.932 | 0.977 |
The indicator values of precision, recall, F1score, accuracy and auROC under input sequences with different lengths separately.
Classification performance of two variant deep learning architectures and our original model.
| Precision | Recall | F1score | Accuracy | auROC | |
|---|---|---|---|---|---|
| Full | 0.982 | 0.976 | 0.979 | 0.964 | 0.990 |
| No BLSTM | 0.861 | 1 | 0.925 | 0.861 | 0.746 |
| No conv | 0.861 | 1 | 0.925 | 0.861 | 0.746 |
“Full” means the original model, including an embedding stage, a BLSTM stage and a convolution stage; “No BLSTM” means the variant architecture removing the BLSTM stage; “No conv” means the variant architecture removing the convolution layer.
Figure 5Sensitivity analysis of hyperparameters for k-mer length k, embedding dimension D and the stride s on the human2 dataset. auROC score are displayed.