| Literature DB >> 35748706 |
Marcell Szikszai1, Michael Wise1,2, Amitava Datta1, Max Ward1,3, David H Mathews4.
Abstract
MOTIVATION: The secondary structure of RNA is of importance to its function. Over the last few years, several papers attempted to use machine learning to improve de novo RNA secondary structure prediction. Many of these papers report impressive results for intra-family predictions, but seldom address the much more difficult (and practical) inter-family problem.Entities:
Year: 2022 PMID: 35748706 PMCID: PMC9364374 DOI: 10.1093/bioinformatics/btac415
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Example of a simple sequence–structure pair in dot-bracket format
Breakdown of RNA families in ArchiveII after filtering
| Family | Mean length |
|
|---|---|---|
| 5S rRNA | 119 | 1283 |
| SRP RNA | 180 | 918 |
| tRNA | 77 | 557 |
| tmRNA | 366 | 462 |
| RNase P RNA | 332 | 454 |
| Group I Intron | 375 | 74 |
| 16S rRNA | 317 | 67 |
| Telomerase RNA | 438 | 35 |
| 23S rRNA | 326 | 15 |
| Mean | 281 | |
| Total | 3865 |
16S rRNA and 23S rRNA are split into independent folding domains (Mathews ).
Recent papers that used machine learning for RNA secondary structure prediction
| Name | Authors | Year | Method | Intra-family | Inter-family | Re-trained |
|---|---|---|---|---|---|---|
| CROSS | Delli Ponti | 2017 | ANN | ✓ | ✗ | ✗ |
| DMfold | Wang | 2019 | LSTM | ✓ | ✗ | ✗ |
| SPOT-RNA | Singh | 2019 | CNN | ✓ | ✗ | ✗ |
| E2Efold | Chen | 2019 | CNN | ✓ | ✗ | ✗ |
| RNA-state-inf | Willmott | 2020 | BLSTM | ✓ | ✓ | ✗ |
| RPRes | Wang | 2021 | BLSTM | ✓ | ✗ | ✗ |
| MXfold2 | Sato | 2021 | BLSTM | ✓ | ✓ | ✓ |
| UFold | Fu | 2021 | CNN | ✓ | ✓ | ✓ |
Note: Inter-family and intra-family columns indicate the splitting methodology used in the paper, while the re-trained column indicates whether we have successfully re-trained the model on our dataset. Attempts were made to re-train nearly every model, however, many do not publish training methodology or could not be re-trained for another reason. See Sections 3.2 and 4.2 and the Supplementary Information for a detailed discussion on this.
Artificial neural network (Rumelhart ).
Long short-term memory neural network (Hochreiter and Schmidhuber, 1997).
Convolutional neural network (LeCun ).
Bidirectional long short-term memory neural network (Schuster and Paliwal, 1997).
Attention transformer (Vaswani ).
Residual neural network (He ).
Fig. 2.Comparison of the effect of nudges between families. The mean of all sequences in each family is calculated across the values. F1 scores have been normalized (min–max scaled) to account for the differences in underlying secondary structure prediction performance between families
Performance of the demonstrative model separated by RNA family
| Baseline |
| Family-fold | ||||
|---|---|---|---|---|---|---|
| Family |
| F1 | AUC | F1 | AUC | F1 |
| 5S rRNA | 1283 | 0.63 | 0.95 | 0.94 | 0.72 | 0.46 |
| SRP RNA | 918 | 0.64 | 0.88 | 0.81 | 0.73 | 0.50 |
| tRNA | 557 | 0.80 | 0.97 | 0.97 | 0.79 | 0.65 |
| tmRNA | 462 | 0.43 | 0.82 | 0.64 | 0.68 | 0.41 |
| RNase P RNA | 454 | 0.55 | 0.81 | 0.66 | 0.71 | 0.48 |
| Group I Intron | 74 | 0.53 | 0.73 | 0.53 | 0.72 | 0.49 |
| 16S rRNA | 67 | 0.58 | 0.77 | 0.60 | 0.72 | 0.48 |
| Telomerase RNA | 35 | 0.50 | 0.76 | 0.61 | 0.68 | 0.45 |
| 23S rRNA | 15 | 0.73 | 0.79 | 0.68 | 0.73 | 0.54 |
| Total | 3865 | |||||
| Mean | 0.60 | 0.83 | 0.72 | 0.72 | 0.50 | |
Note: F1 score refers to the performance of secondary structure prediction, and AUC refers to the performance of predicting the structures’ shadow via deep learning. The baseline is RNAstructure for free energy minimization without the deep learning input. Both k-fold and family-fold models are included.
Fig. 3.Performance of family-fold testing on our demonstrative model. The training set is comprised of all families except 5S rRNA, the validation is a 10% split of the training set, while the testing set is 5S rRNAs. Note the consistently poor performance of the testing set throughout. (a) tRNA tdbR00000247. (b) tRNA tdbR00000372. (c) tRNA tdbR00000435
Performance of family-fold cross-validation on MXfold2 and UFold
| F1 | |||
|---|---|---|---|
| Family | RNAstructure | MXfold2 | UFold |
| 5S rRNA | 0.63 | 0.54 | 0.53 |
| SRP RNA | 0.64 | 0.50 | 0.26 |
| tRNA | 0.80 | 0.64 | 0.26 |
| tmRNA | 0.43 | 0.46 | 0.40 |
| RNase P RNA | 0.55 | 0.51 | 0.41 |
| Group I intron | 0.53 | 0.45 | 0.45 |
| 16 S rRNA | 0.58 | 0.55 | 0.41 |
| Telomerase RNA | 0.50 | 0.34 | 0.80 |
| 23S rRNA | 0.73 | 0.64 | 0.45 |
| Mean | 0.60 | 0.51 | 0.44 |
Fig. 4.Secondary structure of three tRNAs. Despite relatively low sequence identity (<60%), their secondary structures appear nearly identical. Many machine learning model benchmarks fail to separate these RNAs between the training and testing sets, causing significant overlap