| Literature DB >> 31191603 |
Hao Zhang1, Chunhe Zhang1, Zhi Li2, Cong Li1, Xu Wei1, Borui Zhang3, Yuanning Liu1.
Abstract
In recent years, obtaining RNA secondary structure information has played an important role in RNA and gene function research. Although some RNA secondary structures can be gained experimentally, in most cases, efficient, and accurate computational methods are still needed to predict RNA secondary structure. Current RNA secondary structure prediction methods are mainly based on the minimum free energy algorithm, which finds the optimal folding state of RNA in vivo using an iterative method to meet the minimum energy or other constraints. However, due to the complexity of biotic environment, a true RNA structure always keeps the balance of biological potential energy status, rather than the optimal folding status that meets the minimum energy. For short sequence RNA its equilibrium energy status for the RNA folding organism is close to the minimum free energy status; therefore, the minimum free energy algorithm for predicting RNA secondary structure has higher accuracy. Nevertheless, in a longer sequence RNA, constant folding causes its biopotential energy balance to deviate far from the minimum free energy status. This deviation is because of its complex structure and results in a serious decline in the prediction accuracy of its secondary structure. In this paper, we propose a novel RNA secondary structure prediction algorithm using a convolutional neural network model combined with a dynamic programming method to improve the accuracy with large-scale RNA sequence and structure data. We analyze current experimental RNA sequences and structure data to construct a deep convolutional network model, and then we extract implicit features of an effective classification from large-scale data to predict the pairing probability of each base in an RNA sequence. For the obtained probabilities of RNA sequence base pairing, an enhanced dynamic programming method is applied to obtain the optimal RNA secondary structure. Results indicate that our proposed method is superior to the common RNA secondary structure prediction algorithms in predicting three benchmark RNA families. Based on the characteristics of deep learning algorithm, it can be inferred that the method proposed in this paper has a 30% higher prediction success rate when compared with other algorithms, which will be needed as the amount of real RNA structure data increases in the future.Entities:
Keywords: RNA secondary structure; base pairing probability; convolutional neural network; dynamic programming; energy balance status
Year: 2019 PMID: 31191603 PMCID: PMC6540740 DOI: 10.3389/fgene.2019.00467
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The process of CDPfold.
Figure 2The process of RNA matrix representation based on RNA sequence pairing.
Distribution of RNA types and their number in each dataset.
| 5sRNA | 1283 |
| 16sRNA | 110 |
| 25sRNA | 35 |
| grp1RNA | 98 |
| grp2RNA | 11 |
| RNasePRNA | 454 |
| srpRNA | 928 |
| tmRNA | 462 |
| tRNA | 557 |
| telomeraseRNA | 37 |
Figure 35sRNA maximum stem length statistics.
Figure 45sRNA length distribution.
Figure 5Convolutional neural network structure in the 5sRNA experiment.
Figure 6The accuracy of the model in the training set and the validation set.
Figure 7The ErrorBar of accuracy changes with wobble base pairing weights.
Comparison of algorithms in 5sRna.
| mfold | 0.693 | 0.704 | 0.698 |
| RNAfold | 0.694 | 0.704 | 0.699 |
| cofold | 0.585 | 0.591 | 0.588 |
| Sfold | 0.703 | 0.733 | 0.718 |
| CDPfold | 0.932 | 0.916 | 0.924 |
The number of RNAs in each data set before and after the pseudo-knot was removed.
| 5sRNA | 1283 | 1283 |
| 16sRNA | 110 | 50 |
| 25sRNA | 35 | 20 |
| grp1RNA | 98 | 0 |
| grp2RNA | 11 | 11 |
| RNasePRNA | 454 | 37 |
| srpRNA | 928 | 928 |
| tmRNA | 462 | 3 |
| tRNA | 557 | 557 |
| telomeraseRNA | 37 | 0 |
Figure 8Number and length distribution of the RNA data of each family after redundancy.
Figure 9Stem length statistics in the data set.
Figure 10RNA sequence length distribution.
Figure 11Convolutional neural network model in general model.
Comparison of three types of RNA based on their prediction accuracy.
| Mfold | 0.698 | 0.631 | 0.566 |
| RNAfold | 0.699 | 0.632 | 0.577 |
| CDPfold | 0.911 | 0.905 | 0.823 |