| Literature DB >> 33175836 |
Teresa Maria Rosaria Noviello1,2, Francesco Ceccarelli3,4, Michele Ceccarelli1,3, Luigi Cerulo2,5.
Abstract
Small non-coding RNAs (ncRNAs) are short non-coding sequences involved in gene regulation in many biological processes and diseases. The lack of a complete comprehension of their biological functionality, especially in a genome-wide scenario, has demanded new computational approaches to annotate their roles. It is widely known that secondary structure is determinant to know RNA function and machine learning based approaches have been successfully proven to predict RNA function from secondary structure information. Here we show that RNA function can be predicted with good accuracy from a lightweight representation of sequence information without the necessity of computing secondary structure features which is computationally expensive. This finding appears to go against the dogma of secondary structure being a key determinant of function in RNA. Compared to recent secondary structure based methods, the proposed solution is more robust to sequence boundary noise and reduces drastically the computational cost allowing for large data volume annotations. Scripts and datasets to reproduce the results of experiments proposed in this study are available at: https://github.com/bioinformatics-sannio/ncrna-deep.Entities:
Year: 2020 PMID: 33175836 PMCID: PMC7682815 DOI: 10.1371/journal.pcbi.1008415
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Distribution of sequences among 88 Rfam classes downloaded from Rfam database.
Distribution of downloaded Rfam classes among non-coding macro classes.
| non-coding class | Rfam classes |
|---|---|
| snRNA snoRNA | RF00003, RF00004, RF00007, RF00012, RF00015, RF00016, RF00020, RF00026, RF00066, RF00097, RF00149, RF00156, RF00191, RF00309, RF00321, RF00409, RF00432, RF00548, RF00560, RF00561, RF00619, RF01210 |
| Cis-regulatory | RF00037, RF00050, RF00059, RF00080, RF00162, RF00167, RF00168, RF00174, RF00234, RF00379, RF00380, RF00391, RF00442, RF00485, RF00504, RF00515, RF00521, RF00524, RF00557, RF01051, RF01055, RF01057, RF01068, RF01073, RF01497, RF01726, RF01731, RF01734, RF01750, RF02271, RF02913, RF02914 |
| miRNA | RF00104, RF00451, RF00639, RF00641, RF00643, RF00645, RF00865, RF00875, RF00876, RF00882, RF00886, RF00906, RF01059, RF01911, RF01942, RF02000, RF02096 |
| sRNA | RF00519, RF01687, RF01690, RF01699, RF01705, RF02924, RF03064 |
| Intron | RF00029, RF01998, RF01999, RF02001, RF02003, RF02012 |
| rRNA | RF00001, RF00002 |
| tRNA | RF00005, RF01852 |
Fig 2Distribution of sequence lengths among 88 Rfam classes downloaded from Rfam database.
Computational cost required to build the input representations of a sequence of length N.
| Input representation | Computational cost | Adopted in |
|---|---|---|
| Hilbert |
| Noviello |
| Morton |
| Noviello |
| Snake | Noviello | |
| Noviello | ||
| iPknot | nRC [ | |
| ViennaRNA | EDeN [ |
Fig 3k-mer representation: Examples of one, two, and tri-mer encodings.
Fig 4Examples of bi-dimensional space-filling curves.
The raw linear 47 base long sequence is encoded into the bi-dimensional space-filling curve depicted in blue. The padding necessary to fill the entire space is depicted in grey.
Fig 5A graphical representation of the deep learning architecture.
The raw RNA sequence is first encoded into an input layer representation (e.g. Hilbert filling-curve), then up to 3 convolution layers with rectifier activation followed by max-pooling layers perform the learning steps of sub-sequences with functional properties. Finally, two dense layers of rectified linear units are added to reduce data dimension to a softmax multi-class classification output layer.
Maximum dimension allowed for each input representation and sequence of at most 200 nucleotides.
Dimensions of Hilbert and Morton spaces are the lowest powers of two greater than 200, while the dimension of Snake can be simply obtained consider the ceiling of .
| Input representation | Maximum dimension allowed |
|---|---|
| Hilbert | 16 × 16 |
| Morton | 16 × 16 |
| Snake | 15 × 15 |
| k–mer | 200/k |
Fig 6Classification performance in term of Accuracy obtained in the test set with different padding schemas.
The deep learning architecture is composed by 3 CNN layers. Confidence intervals are drawn assuming a normal distribution of classification error.
Fig 7Classification performance in term of Accuracy obtained in the test set with different number of layers using CNNs where inputs are padded with a new symbol.
Zero indicates a dense network. Confidence intervals are drawn assuming a normal distribution of classification error.
Fig 8Classification performance in term of Accuracy obtained in the test set at different boundary noise levels.
The deep learning architecture is composed by 3 CNN layers and inputs are padded with a new symbol.
Fig 9Recognizing non-functional RNA with Monte Carlo Dropout.
Sequences are encoded with 1-mer and performance is estimated in terms of Area under ROC (on the left). Figures on the right shows the distributions of functional and non-functional RNA sequences among Information Entropy (H) and Top Distance (D).
Overall performance improvement, in terms of Accuracy, Kappa, and MCC, after Monte Carlo Dropout of uncertain samples encoded with 1-mer.
| Estimator | Approach | Accuracy | Kappa | MCC | % of rejected samples |
|---|---|---|---|---|---|
| Entropy | 3mer | 0.99 | 0.99 | 0.99 | 24.28 |
| 2mer | 0.99 | 0.99 | 0.99 | 24.32 | |
| 1mer | 0.99 | 0.99 | 0.99 | 18.79 | |
| Snake | 0.98 | 0.98 | 0.98 | 41.04 | |
| Morton | 0.98 | 0.97 | 0.97 | 38.50 | |
| Hilbert | 0.98 | 0.98 | 0.98 | 37.76 | |
| Top | 3mer | 0.98 | 0.98 | 0.98 | 20.07 |
| 2mer | 0.98 | 0.98 | 0.98 | 21.55 | |
| 1mer | 0.99 | 0.99 | 0.99 | 18.61 | |
| Snake | 0.97 | 0.97 | 0.97 | 31.17 | |
| Morton | 0.96 | 0.96 | 0.96 | 31.69 | |
| Hilbert | 0.97 | 0.97 | 0.97 | 31.09 |
Fig 10The effect on per class prediction performance of rejecting uncertain samples (1-mer encoded) with Monte Carlo Dropout.
Summary of results on the dataset called test13 containing 13 non-coding classes.
Results for nRC and RNAGCN are taken from [9].
| Architecture | Approach | Accuracy | Recall | Precision | F1-score | MCC |
|---|---|---|---|---|---|---|
| EDeN | 0.67 | 0.60 | 0.75 | 0.65 | 0.61 | |
| nRC | 0.82 | 0.82 | 0.81 | 0.82 | 0.80 | |
| RNAGCN | 0.86 | 0.86 | 0.86 | 0.86 | 0.85 | |
| RNN 50 nodes | 1mer | 0.86 | 0.86 | 0.86 | 0.86 | 0.85 |
| 2mer | 0.77 | 0.77 | 0.77 | 0.77 | 0.75 | |
| 3mer | 0.77 | 0.77 | 0.77 | 0.77 | 0.75 | |
| RNN 100 nodes | 1mer | 0.88 | 0.88 | 0.88 | 0.87 | 0.87 |
| 2mer | 0.79 | 0.79 | 0.79 | 0.79 | 0.77 | |
| 3mer | 0.78 | 0.78 | 0.79 | 0.78 | 0.75 | |
| RNN 150 nodes | 1mer | 0.89 | 0.90 | 0.90 | 0.90 | 0.89 |
| 2mer | 0.80 | 0.80 | 0.80 | 0.79 | 0.78 | |
| 3mer | 0.79 | 0.79 | 0.79 | 0.79 | 0.77 | |
| CNN standard | 1mer | 0.88 | 0.88 | 0.89 | 0.88 | 0.87 |
| 2mer | 0.83 | 0.83 | 0.84 | 0.83 | 0.82 | |
| 3mer | 0.81 | 0.81 | 0.82 | 0.81 | 0.79 | |
| Morton | 0.78 | 0.78 | 0.79 | 0.78 | 0.77 | |
| Snake | 0.82 | 0.82 | 0.83 | 0.81 | 0.80 | |
| Hilbert | 0.81 | 0.81 | 0.84 | 0.82 | 0.80 | |
| CNN improved | 1mer | |||||
| 2mer | 0.92 | 0.92 | 0.92 | 0.92 | 0.91 | |
| 3mer | 0.88 | 0.88 | 0.88 | 0.88 | 0.86 | |
| Morton | 0.86 | 0.86 | 0.88 | 0.86 | 0.85 | |
| Snake | 0.86 | 0.86 | 0.88 | 0.86 | 0.85 | |
| Hilbert | 0.86 | 0.87 | 0.89 | 0.87 | 0.86 |