| Literature DB >> 31519146 |
Jianghui Wen1, Yeshu Liu1, Yu Shi1, Haoran Huang1, Bing Deng2, Xinping Xiao3.
Abstract
BACKGROUND: Long-chain non-coding RNA (lncRNA) is closely related to many biological activities. Since its sequence structure is similar to that of messenger RNA (mRNA), it is difficult to distinguish between the two based only on sequence biometrics. Therefore, it is particularly important to construct a model that can effectively identify lncRNA and mRNA.Entities:
Keywords: Convolutional neural network; K-mers; Relative entropy; lncRNA; mRNA
Mesh:
Substances:
Year: 2019 PMID: 31519146 PMCID: PMC6743109 DOI: 10.1186/s12859-019-3039-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The 2-mer frequency mean line graph. a The 2-mer frequency mean line graph of lncRNA. b The 2-mer frequency mean line graph of mRNA
Model classification accuracy for individual k value
| number of k-mers | matrix form | model accuracy | precision rate(P) | recall | calculating time (s/epoch) | ||
|---|---|---|---|---|---|---|---|
| 3 | 64 | 8 × 8 | 0.7508 | 0.81 | 0.79 | 0.79 | 5 |
| 4 | 256 | 16 × 16 | 0.7610 | 0.85 | 0.83 | 0.83 | 20 |
| 5 | 1024 | 32 × 32 | 0.7565 | 0.93 | 0.92 | 0.92 | 95 |
| 6 | 4096 | 64 × 64 | 0.7748 | 0.87 | 0.85 | 0.84 | 855 |
Model classification accuracy rate of two k value combinations
| number of k-mers | matrix form | model accuracy | precision rate(P) | recall rate(R) | calculating time (s/epoch) | ||
|---|---|---|---|---|---|---|---|
| 1 + 3 | 68 | 17 × 4 | 0.9280 | 0.94 | 0.94 | 0.94 | 4 |
| 1 + 4 | 260 | 10 × 26 | 0.9600 | 0.98 | 0.98 | 0.98 | 32 |
| 1 + 5 | 1028 | 4 × 257 | 0.4995 | 0.50 | 0.50 | 0.36 | 43 |
| 2 + 3 | 80 | 8 × 10 | 0.9810 | 0.99 | 0.99 | 0.99 | 9 |
| 2 + 4 | 272 | 16 × 17 | 0.7838 | 0.87 | 0.86 | 0.86 | 37 |
| 2 + 5 | 1040 | 26 × 40 | 0.7672 | 0.91 | 0.90 | 0.90 | 180 |
| 3 + 4 | 320 | 16 × 20 | 0.7666 | 0.90 | 0.90 | 0.90 | 47 |
| 3 + 5 | 1088 | 32 × 34 | 0.7566 | 0.94 | 0.94 | 0.94 | 189 |
| 4 + 5 | 1280 | 32 × 40 | 0.7532 | 0.95 | 0.94 | 0.94 | 290 |
Model classification accuracy rate of three k value combinations
| number of k-mers | matrix form | model accuracy | precision rate | recall rate | calculating time (s/epoch) | ||
|---|---|---|---|---|---|---|---|
| 1 + 2 + 3 | 84 | 17 × 20 | 0.9872 | 1.00 | 1.00 | 1.00 | 6 |
| 2 + 3 + 4 | 336 | 12 × 28 | 0.9738 | 1.00 | 1.00 | 1.00 | 57 |
| 2 + 3 + 5 | 1104 | 24 × 46 | 0.9798 | 1.00 | 1.00 | 1.00 | 217 |
K-mers calculation results after KL screening
| number of k-mers | number of k-mers after KL screening | original model accuracy | model accuracy after KL screening | calculation time of the original model (s/epoch) | calculation time of KL screening model (s/epoch) | |
|---|---|---|---|---|---|---|
| 5 | 1024 | 115 | 0.7565 | 0.782 | 95 s | 4 s |
| 6 | 4096 | 1045 | 0.7748 | 0.779 | 855 s | 47 s |
| 4 + 5 | 1280 | 112 | 0.7532 | 0.629 | 290 s | 4 s |
| 2 + 3 + 5 | 1104 | 195 | 0.9798 | 0.9761 | 217 s | 27 s |
Five model effect comparison table in human
| model | model accuracy | precision rate(P) | recall rate(R) | |
|---|---|---|---|---|
| CNN | 0.9872 | 0.9993 | 0.9955 | 0.9974 |
| RF | 0.8820 | 0.8949 | 0.8867 | 0.8925 |
| LR | 0.7020 | 0.7247 | 0.7183 | 0.7218 |
| DT | 0.8030 | 0.7873 | 0.7852 | 0.7869 |
| SVM | 0.7020 | 0.7245 | 0.7158 | 0.7179 |
Fig. 2ROC curve of CNN, RF, LR, DT and SVM
Five model effect comparison table in mouse
| model | model accuracy | precision rate(P) | recall rate(R) | |
|---|---|---|---|---|
| CNN | 0.8797 | 0.8960 | 0.8590 | 0.8771 |
| RF | 0.8120 | 0.8132 | 0.8130 | 0.8131 |
| LR | 0.7541 | 0.7454 | 0.7700 | 0.7575 |
| DT | 0.7001 | 0.6991 | 0.6977 | 0.6984 |
| SVM | 0.7528 | 0.7564 | 0.7476 | 0.7520 |
Five model effect comparison table in chicken
| model | model accuracy | precision rate(P) | recall rate(R) | |
|---|---|---|---|---|
| CNN | 0.9963 | 0.9943 | 0.9984 | 0.9963 |
| RF | 0.9302 | 0.9351 | 0.9245 | 0.9298 |
| LR | 0.8743 | 0.8902 | 0.8546 | 0.8720 |
| DT | 0.8227 | 0.8148 | 0.8315 | 0.8230 |
| SVM | 0.8724 | 0.8881 | 0.8538 | 0.8706 |
Fig. 3The 3-mer sliding window showing the process of taking a k-mer in sliding window mode in a sequence when k is three in which there are 21 3-mers
Fig. 4The 1-mer frequency distribution histogram. The contents of the A, C, G, and T bases in the lncRNA sequence are approximately 254 nt, 217 nt, 216 nt, and 240 nt, respectively, while the mRNA sequence has A, C, G, and T base contents of approximately 364 nt, 420 nt, 422 nt, and 343 nt, respectively, when we randomly select 5000 lncRNA sequence data and 5000 mRNA sequence data
Fig. 5The 3-mer distribution frequency diagram of mRNA and lncRNA. a The 32 3-mer distribution frequency diagram beginning with T and A, and (b) the other 32 3-mer distribution frequency diagram beginning with G and C
Fig. 6The lncRNA recognition model calculation flow chart. The lncRNA and mRNA classification model includes the input part and convolutional neural network part