| Literature DB >> 30886627 |
Linyu Wang1,2, Yuanning Liu1,2, Xiaodan Zhong1,2,3, Haiming Liu1,2, Chao Lu1,2, Cong Li1,2, Hao Zhang1,2.
Abstract
While predicting the secondary structure of RNA is vital for researching its function, determining RNA secondary structure is challenging, especially for that with pseudoknots. Typically, several excellent computational methods can be utilized to predict the secondary structure (with or without pseudoknots), but they have their own merits and demerits. These methods can be classified into two categories: the multi-sequence method and the single-sequence method. The main advantage of the multi-sequence method lies in its use of the auxiliary sequences to assist in predicting the secondary structure, but it can only successfully predict in the presence of multiple highly homologous sequences. The single-sequence method is associated with the major merit of easy operation (only need the target sequence to predict secondary structure), but its folding parameters are the common features of diversity RNA, which cannot describe the unique characteristics of RNA, thus potentially resulting in the low prediction accuracy in some RNA. In this paper, "DMfold," a method based on the Deep Learning and Improved Base Pair Maximization Principle, is proposed to predict the secondary structure with pseudoknots, which fully absorbs the advantages and avoids some disadvantages of those two methods. Notably, DMfold could predict the secondary structure of RNA by learning similar RNA in the known structures, which uses the similar RNA sequences instead of the highly homogeneous sequences in the multi-sequence method, thereby reducing the requirement for auxiliary sequences. In DMfold, it only needs to input the target sequence to predict the secondary structure. Its folding parameters are fully extracted automatically by deep learning, which could avoid the lack of folding parameters in the single-sequence method. Experiments show that our method is not only simple to operate, but also improves the prediction accuracy compared to multiple excellent prediction methods. A repository containing our code can be found at https://github.com/linyuwangPHD/RNA-Secondary-Structure-Database.Entities:
Keywords: RNA; deep learning; improved base pair maximization principle; multi-sequence method; pseudoknot; secondary structure prediction; single-sequence method
Year: 2019 PMID: 30886627 PMCID: PMC6409321 DOI: 10.3389/fgene.2019.00143
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1RNA structure can be decomposed into three pseudoknot-free substructures. Each color represents a substructure. There are three types of parentheses and a dot in the figure. The brackets represent the paired bases, the dots represent unpaired bases. Each pair of brackets corresponds to a separate substructure, and the edges, which represent the base pairs, are nested in a substructure.
Figure 2The schematic diagram of DMfold Architecture, which contains two parts: PU and CU. PU is a deep learning model, mainly responsible for predicting the input RNA sequences as dot-bracket sequences. CU is mainly to correct the prediction dot-bracket sequences and output the prediction secondary structure.
The rules of transformation between bases and One-Hot vectors (Details can be found in Supplementary Materials).
| A | 10000010 |
| U | 00101000 |
| G | 01000010 |
| C | 00100100 |
| N (padding base) | 00000000 |
The rules of transformation between dot-brackets and One-Hot vectors (Details can be found in Supplementary Materials).
| ( | 1000000 |
| ) | 0000001 |
| . | 0001000 |
| [ | 0100000 |
| ] | 0000010 |
| { | 0010000 |
| } | 0000100 |
| N (padding symbols) | 0000000 |
Figure 3The mean accuracy and loss of training and testing in the 10-fold cross-validation experiments, in which the brown and green curve represents the accuracy of training and testing, and the red and blue curve represents the loss of training and testing.
Figure 4The principle diagram of IBPMP. (A) is the procedure of IBPMP, which contains two parts: initialization and algorithm section. In the initialization, the procedure processes the prediction results of PU as the input of CU. In the algorithm section, it obtains the prediction secondary with pseudoknots. See below for details of FirstStep, SecondaryStep, and ThirdStep. (B) The procedure of CSCP, which contains two parts: initialization and algorithm section. In the initialization, it collects all stems and set priority for them. In the algorithm section, it obtains the optimal stem combinations. (C) is an example of the CSCP.
The comparison between DMfold and other methods on 5sRNA and tRNA.
| mfold | 0.741 | 0.708 | 0.722 | 0.708 | 0.675 | 0.690 |
| RNAfold | 0.708 | 0.634 | 0.667 | 0.613 | 0.550 | 0.579 |
| Cofold | 0.627 | 0.595 | 0.609 | 0.578 | 0.548 | 0.562 |
| IPknot | 0.787 | 0.775 | 0.774 | 0.485 | 0.555 | 0.512 |
| Probknot | 0.745 | 0.635 | 0.683 | 0.562 | 0.538 | 0.548 |
| DMfold | ||||||
The bold value is the maximum of each column.
The comparison between DMfold and other methods on tmRNA and RnaseP.
| mfold | 0.558 | 0.518 | 0.536 | 0.605 | ||
| RNAfold | 0.470 | 0.433 | 0.448 | 0.564 | 0.499 | 0.526 |
| Cofold | 0.358 | 0.329 | 0.342 | 0.518 | 0.481 | 0.495 |
| IPknot | 0.463 | 0.495 | 0.476 | 0.587 | 0.640 | 0.604 |
| Probknot | 0.457 | 0.410 | 0.431 | 0.583 | 0.531 | 0.551 |
| DMfold | 0.547 | 0.619 | ||||
The bold value is the maximum of each column.
Figure 5The visualization results of multiple methods and real structure. Green bases represent the stem. Red bases represent the bifurcation loop and unpaired single chain. Blue bases represent the hairpin loop. Yellow bases represent the interior and bulge loop.