| Literature DB >> 34164114 |
Thaís A R Ramos1,2,3, Nilbson R O Galindo2, Raúl Arias-Carrasco3, Cecília F da Silva2, Vinicius Maracaja-Coutinho1,3,4, Thaís G do Rêgo1,2.
Abstract
Non-coding RNAs (ncRNAs) are important players in the cellular regulation of organisms from different kingdoms. One of the key steps in ncRNAs research is the ability to distinguish coding/non-coding sequences. We applied seven machine learning algorithms (Naive Bayes, Support Vector Machine, K-Nearest Neighbors, Random Forest, Extreme Gradient Boosting, Neural Networks and Deep Learning) through model organisms from different evolutionary branches to create a stand-alone and web server tool (RNAmining) to distinguish coding and non-coding sequences. Firstly, we used coding/non-coding sequences downloaded from Ensembl (April 14th, 2020). Then, coding/non-coding sequences were balanced, had their trinucleotides count analysed (64 features) and we performed a normalization by the sequence length, resulting in total of 180 models. The machine learning algorithms validations were performed using 10-fold cross-validation and we selected the algorithm with the best results (eXtreme Gradient Boosting) to implement at RNAmining. Best F1-scores ranged from 97.56% to 99.57% depending on the organism. Moreover, we produced a benchmarking with other tools already in literature (CPAT, CPC2, RNAcon and TransDecoder) and our results outperformed them. Both stand-alone and web server versions of RNAmining are freely available at https://rnamining.integrativebioinformatics.me/. Copyright:Entities:
Keywords: Machine Learning; benchmarking; coding potential prediction; non-coding RNA
Mesh:
Substances:
Year: 2021 PMID: 34164114 PMCID: PMC8201426 DOI: 10.12688/f1000research.52350.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. A. Taxonomic tree according to the used organisms for the models building (black color) and validation (red color). B. Pipeline used to perform the benchmarking and create the tool. Firstly, we download the coding and non-coding sequences from Ensembl; Next, we performed the trinucleotides counts and sequence normalization. After this, we created a machine learning benchmarking within the 7 algorithms and selected the one with the best performance to be implemented in the RNAmining tool (XGBoost algorithm), which was again evaluated using sequences from 9 other different species and sets of artificially generated ones. Finally, we performed a novel benchmarking with RNAmining against the public available tools for coding potential prediction.
Set of sequences used in the training and testing datasets.
List of organisms and the total number of sequences used for testing and training both coding and non-coding RNAs. The numbers are separated into training/testing values. All sequences can be retrieved at RNAmining website ( https://rnamining.integrativebioinformatics.me/download).
| Species | Total | Coding | ncRNAs |
|---|---|---|---|
|
| |||
|
| 12,542 / 3,136 | 6,243 / 1,596 | 6,299 / 1,540 |
|
| 11,260 / 2,816 | 5,626 / 1,412 | 5,634 / 1,404 |
|
| 7,388 / 1,848 | 3,700 / 918 | 3,688 / 930 |
|
| 12,984 / 3,246 | 6,527 / 1,588 | 6,457 / 1,658 |
|
| 1,742 / 436 | 867 / 222 | 875 / 214 |
|
| 16,851 / 4,213 | 8,426 / 2,106 | 8,425 / 2107 |
|
| 92,844 / 23,212 | 46,575 / 11,453 | 46,269 / 11,759 |
|
| 4,668 / 1,168 | 2,344 / 574 | 2,324 / 594 |
|
| 34,336 / 8,584 | 17,113 / 4,347 | 17,223 / 4,237 |
|
| 35,272 / 8,818 | 17,668 / 4,377 | 17,604 / 4,441 |
|
| 2,705 / 677 | 1,351 / 340 | 1,354 / 337 |
|
| 12,604 / 3,152 | 6,280 / 1,598 | 6,324 / 1,554 |
|
| 4,243 / 1,061 | 2,107 / 545 | 2,136 / 516 |
|
| 1,456 / 364 | 723 / 187 | 733 / 177 |
|
| 2,224 / 556 | 1,120 /270 | 1,104 / 286 |
|
| |||
|
| 11,308 | 5,654 | 5,654 |
|
| 50,558 | 25,279 | 25,279 |
|
| 15,004 | 7,502 | 7,502 |
|
| 31,808 | 15,904 | 15,904 |
|
| 15,978 | 7,989 | 7,989 |
|
| 1,486 | 743 | 743 |
|
| 18,662 | 9,331 | 9,331 |
|
| 848 | 424 | 424 |
|
| 2,054 | 1,027 | 1,027 |
Benchmarking machine learning methods for coding potential prediction based on trinucleotides count.
F1-score for each one of the 15 species in which the algorithms were tested. Other metrics (sensitivity, specificity, precision, accuracy and the confusion matrix) used for the comparison of the algorithm’s performance were made available at the Extended data: Supplementary File S2 .
| Species | ANN | CNN | K-NN | NAIVE
| RANDOM
| SVM | XGBoost |
|---|---|---|---|---|---|---|---|
|
| 98.47 | 98.31 | 93.55 | 95.50 | 98.30 | 98.03 |
|
|
| 96.54 | 96.02 | 93.54 | 93.13 | 96.89 | 96.04 |
|
|
| 96.74 | 96.48 | 93.67 | 93.93 | 97.26 | 96.35 |
|
|
| 97.54 | 97.77 | 95.44 | 94.55 | 97.56 | 97.27 |
|
|
| 94.88 | 95.69 | 92.24 | 94.57 | 97.35 | 95.82 |
|
|
| 98.47 | 98.27 | 96.87 | 95.11 | 98.91 | 98.06 |
|
|
| 98.01 | 97.66 | 96.63 | 86.00 | 98.30 | 96.83 |
|
|
| 99.05 | 98.72 | 91.61 | 98.23 | 99.56 | 99.24 |
|
|
| 98.39 | 98.09 | 97.11 | 95.31 | 98.67 | 98.01 |
|
|
| 96.67 | 96.96 | 95.95 | 91.56 | 97.66 | 96.10 |
|
|
| 95.90 | 94.10 | 87.77 | 89,81 | 94.94 | 95.73 |
|
|
| 97.23 | 96.59 | 93.59 | 91.45 | 96.99 | 96.38 |
|
|
| 98.40 | 98.26 | 88.10 | 95.99 | 98.79 | 97.49 |
|
|
| 97.83 | 96.97 | 78.41 | 96.70 | 96.46 | 95.29 |
|
|
| 98.28 | 98.81 | 85.53 | 97.14 | 98.88 | 97.20 |
|
Evaluation (F1-score) of the models generated by XGBoost, the method implemented in RNAmining, according to evolutionary related and unrelated organisms.
Each line comprises the model for each one of the trained species, meanwhile the columns represent the set of 9 evolutionary related and unrelated organisms in which the method was evaluated. Other metrics (sensitivity, specificity, precision, accuracy and the confusion matrix) used for the comparisons were made available at the Extended data: Supplementary File S3 .
| Testing
| Arabidopsis
| Caenorhabditis
| Carassius
| Drosophila
| Gorilla
| Pseudonaja
| Rattus
| Saccharomyces
| Terrapene
|
|---|---|---|---|---|---|---|---|---|---|
|
| 95.35 | 89.97 | 94.77 | 97.16 | 95.17 | 96.56 | 96.74 | 93.07 | 95.83 |
|
| 97.24 | 97.79 | 95.97 | 98.13 | 97.01 | 97.73 | 97.15 | 96.09 | 98.10 |
|
| 96.19 | 96.76 | 95.73 | 97.87 | 97.01 | 96.90 | 97.25 | 95.07 | 97.56 |
|
| 96.64 | 90.50 | 95.29 | 97.96 | 97.24 | 96.89 | 96.42 | 93.96 | 96.62 |
|
| 94.90 | 95.57 | 94.80 | 96.73 | 95.34 | 95.43 | 95.76 | 91.49 | 95.51 |
|
| 97.60 | 97.89 | 95.76 | 98.02 | 97.93 | 97.79 | 97.59 | 96.48 | 97.69 |
|
| 95.71 | 81.25 | 92.19 | 96.44 | 97.73 | 96.24 | 94.60 | 93.57 | 95.65 |
|
| 93.71 | 96.78 | 91.63 | 86.25 | 96.30 | 93.39 | 94.37 | 95.47 | 95.63 |
|
| 97.40 | 97.91 | 95.69 | 98.04 | 97.90 | 97.53 | 97.46 | 93.54 | 97.31 |
|
| 96.44 | 87.68 | 94.66 | 97.17 | 97.57 | 97.31 | 96.67 | 94.32 | 96.30 |
|
| 97.16 | 97.54 | 95.22 | 97.46 | 97.35 | 97.37 | 96.79 | 94.96 | 97.22 |
|
| 97.39 | 97.48 | 95.39 | 87.74 | 97.32 | 97.86 | 97.29 | 94.67 | 97.53 |
|
| 93.31 | 94.48 | 92.07 | 87.74 | 95.81 | 93.47 | 94.72 | 92.48 | 95.56 |
|
| 94.00 | 97.07 | 91.94 | 86.89 | 96.60 | 93.95 | 94.12 | 95.02 | 95.81 |
|
| 93.46 | 96.65 | 91.53 | 84.86 | 95.51 | 93.68 | 93.16 | 94.42 | 95.02 |
Benchmarking results from RNAmining and the other tools already described in the literature according to organisms from different evolutionary branches.
F1-score metric for CPAT, CPC2, RNAcon, TransDecorder and RNAmining, based on the predictions using models provided by each tool or generated according to their instructions. The bold numbers are the best values regarding F1-score metric. The results for other metrics were made available at the Extended data: Supplementary File S2 .
| Species | CPAT | CPC2 | RNAcon | TransDecoder | RNAmining |
|---|---|---|---|---|---|
|
| 94.55 | 86.87 | 83.03 | 88.26 |
|
|
| 92.56 | 89.01 | 82.36 | 84.80 |
|
|
| 94.07 | 92.48 | 84.32 | 87.63 |
|
|
| 94.64 | 87.17 | 80.97 | 87.74 |
|
|
| 95.59 | 78.82 | 75.84 | 76.26 |
|
|
| 96.95 | 90.69 | 75.81 | 83.50 |
|
|
| 95.20 | 75.85 | 71.73 | 76.02 |
|
|
|
| 91.60 | 97.45 | 98.86 |
|
|
| 96.24 | 91.44 | 80.90 | 85.22 |
|
|
| 95.48 | 81.40 | 76.78 | 80.80 |
|
|
| 85.19 | 86.29 | 84.83 | 83.44 |
|
|
| 87.47 | 72.04 | 84.73 | 84.63 |
|
|
| 96.59 | 75.14 | 95.11 | 96.68 |
|
|
| 97.61 | 91.91 | 97.86 | 95.24 |
|
|
| 99.07 | 97.92 | 98.70 | 97.77 |
|
Evaluation of RNAmining performance according to different sets of artificial sequences from each trained model.
F1-score metrics for 10 datasets of artificial sequences randomly generated for each species. The mean, minimum and maximum values are displayed separated by organism. Other metrics (sensitivity, specificity, precision, accuracy and the confusion matrix) used for the comparisons were made available at the Extended data: Supplementary File S4 .
| Species | MEAN | MINIMUM | MAXIMUM |
|---|---|---|---|
|
| 1.66 | 0.86 | 2.44 |
|
| 1.08 | 0.70 | 1.40 |
|
| 0.95 | 0.43 | 1.72 |
|
| 1.25 | 0.12 | 2.21 |
|
| 2.31 | 0.90 | 3.51 |
|
| 2.48 | 1.88 | 2.89 |
|
| 11.15 | 10.53 | 11.52 |
|
| 24.86 | 21.95 | 27.03 |
|
| 1.34 | 1.00 | 1.18 |
|
| 6.64 | 5.74 | 7.58 |
|
| 1.80 | 0.58 | 3.99 |
|
| 3.62 | 2.67 | 5.04 |
|
| 64.13 | 62.99 | 65.76 |
|
| 37.43 | 31.72 | 41.84 |
|
| 23.26 | 17.65 | 28.21 |
Figure 2. RNAmining web server overview.
A. Job launcher screen (Run tab). The user only needs to upload the nucleotide sequences in FASTA format and select the model to be used based on the evolutionary close related species. B. Results web page screen. General report containing the list of coding and non-coding sequences in a dynamic table, in which the user can search for a particular sequence or filter only those coding or non-coding RNAs by using a free text form that will filter the results in the table dynamically. The user can download the complete table in tabular format and two FASTA files containing the set of coding and non-coding RNAs separately.