| Literature DB >> 32946529 |
Md Nafis Ul Alam1, Umar Faruq Chowdhury1.
Abstract
High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.Entities:
Mesh:
Year: 2020 PMID: 32946529 PMCID: PMC7500682 DOI: 10.1371/journal.pone.0239381
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Feature selection algorithms applied on succeeding levels to scale the model.
Tree-based and Lasso regularization-based feature selection applied on two levels successively yielded 6 total sets of features on the corresponding criteria of selection.
Top 3 k-meric features sets when ranked by feature importance instated by our feature selection flow-chart.
| Summed Feature Importance | Features |
|---|---|
| 6 | |
| 5 | |
| 4 |
The summed feature score is obtained by summing the total number of times the feature appears in the six filtered feature sets we obtained from our selection algorithm. The remainder of the table can be found in S1 Table.
Performance metrics of 18 models fitted and the selected model.
| Part 1: Performance of 18 fitted models on cross-validation and test sets | |||||||||||
| Model | Features | Correct/TP | False Positives | True Negatives | False Negatives | cv_bin | cv_bin std | cv_mac | cv_mac std | f1bin | f1mac |
| Logistic Regression | 5460 | 1049 | 4 | 3844 | 1 | 0.997559 | 0.001668 | 0.998457 | 0.001054 | 0.997622 | 0.998486 |
| Logistic Regression | 194 | 1020 | 0 | 3878 | 0 | 0.999194 | 0.001485 | 0.999489 | 0.000942 | 1 | 1 |
| Logistic Regression | 68 | 1025 | 10 | 3851 | 12 | 0.983759 | 0.005604 | 0.989727 | 0.003549 | 0.989382 | 0.993267 |
| Linear SVM | 5460 | 1010 | 2 | 3885 | 1 | 0.995825 | 0.000784 | 0.997351 | 0.000498 | 0.998517 | 0.999066 |
| Linear SVM | 194 | 1081 | 0 | 3817 | 0 | 0.99885 | 0.001281 | 0.999275 | 0.000808 | 1 | 1 |
| Linear SVM | 68 | 954 | 13 | 3922 | 9 | 0.985245 | 0.005137 | 0.990608 | 0.003267 | 0.988601 | 0.992902 |
| RBF Kernel SVM | 5460 | 995 | 1 | 3897 | 5 | 0.996293 | 0.00178 | 0.99765 | 0.001127 | 0.996994 | 0.998112 |
| RBF Kernel SVM | 194 | 1049 | 0 | 3848 | 1 | 0.99656 | 0.001552 | 0.997829 | 0.000978 | 0.999524 | 0.999697 |
| RBF Kernel SVM | 68 | 1012 | 0 | 3883 | 3 | 0.997572 | 0.001661 | 0.998463 | 0.001051 | 0.99852 | 0.999067 |
| Decision Tree | 5460 | 964 | 51 | 3817 | 66 | 0.939809 | 0.007824 | 0.961954 | 0.005977 | 0.942787 | 0.963846 |
| Decision Tree | 194 | 1006 | 67 | 3775 | 50 | 0.947326 | 0.006255 | 0.966403 | 0.004345 | 0.945045 | 0.964892 |
| Decision Tree | 68 | 1013 | 54 | 3786 | 45 | 0.943388 | 0.009383 | 0.964025 | 0.006727 | 0.953412 | 0.970253 |
| Random Forest | 5460 | 1032 | 0 | 3847 | 19 | 0.989437 | 0.003518 | 0.991985 | 0.002445 | 0.990879 | 0.994208 |
| Random Forest | 194 | 973 | 1 | 3911 | 13 | 0.991293 | 0.003575 | 0.994484 | 0.002069 | 0.992857 | 0.995535 |
| Random Forest | 68 | 987 | 3 | 3886 | 22 | 0.988948 | 0.003525 | 0.993217 | 0.001624 | 0.987494 | 0.992144 |
| Naïve Bayes | 5460 | 973 | 377 | 3493 | 55 | 0.81459 | 0.013953 | 0.877261 | 0.009596 | 0.818335 | 0.880049 |
| Naïve Bayes | 194 | 997 | 23 | 3867 | 11 | 0.982157 | 0.005434 | 0.988655 | 0.003468 | 0.983235 | 0.989429 |
| Naïve Bayes | 68 | 970 | 95 | 3812 | 21 | 0.942488 | 0.012554 | 0.963099 | 0.0081 | 0.94358 | 0.965126 |
| Part 2: Performance of the Selected Model and our Random Forest Model on Divergent Data | |||||||||||
| Model | Features | True Positives | False Positives | True Negatives | False Negatives | f1bin | f1mac | ||||
| rbf-SVM68 | 68 | 1855 | 92 | 11259 | 677 | 0.82830989 | 0.897644 | ||||
| RF68 | 68 | 1814 | 32 | 11319 | 718 | 0.8286889 | 0.898311 | ||||
Part 1 lists the performance metrics on the test splits of the training data and Part 2 lists the performance on the divergent data comprised of viruses of other genotypes and mouse RNA transcripts.
a binary f1 score in cross-validation sets.
b macro averaged f1 score in cross-validation sets.
c binary f1 scores in test sets.
d macro f1 scores in test sets.
e standard deviation.
Fig 2AUROC for 18 fitted models trained while varying the number of features selected.
(a) Performance of all classifiers with all features. (b) Performance with 194 features after first round of selection. (c) Performance with 68 selected features. The weakest performing model was the Gaussian Naïve Bayes model displaying 0.96 AUC. No significant decline in model performance was noted when feature numbers were shrunk from 194 to 68.
Fig 3Leave-one-family-out cross-validation results heatmap.
For every RNA virus family on the horizontal axis, there is a performance score for each model in the vertical axis. Other than the Decision Tree and Naïve Bayes models, performance was reasonably uniform all throughout the map.
Fig 4Performance of the rbf-SVM68 and RF68 models on divergent data set.
To assess the capacity of our models to generalize to new examples, we attempted to confound the model with sequence data from viruses of other genotypes (having other than plus and minus RNA genomes) and RNA transcripts from mice (Listed in S3 Table). The robustness of the model performance metrics was still found to be intact.
Performance metrics of rbf-SVM68 model on real data.
| data set | Transcript/Software Filters | Total Number of Transcripts | True Positives | False Positives | True Negatives | False Negatives | f1binary | f1macro | AUROC | Unary Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|
| Unary Homo Sapiens Transcriptome Assembly | All | 6303331 | N/A | 2327082 | 3976249 | N/A | N/A | 0.38681 | N/A | 63% |
| Unary Homo Sapiens Transcriptome Assembly | Length Filtered | 146382 | N/A | 7067 | 139315 | N/A | N/A | 0.487632 | N/A | 95% |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | All | 1179617 | 5318 | 331849 | 540622 | 1828 | 0.03089 | 0.432644 | 0.799 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | Both Class Length Filtered | 142803 | 49 | 3799 | 138906 | 49 | 0.024835 | 0.505587 | 0.776 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | Only Human Length Filtered | 149851 | 5318 | 3799 | 138906 | 1828 | 0.684 | 0.817074 | 0.951 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | bridger | 83744 | 16 | 20252 | 63452 | 24 | 0.001576 | 0.431906 | 0.6 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | idba-tran | 78116 | 30 | 19860 | 58182 | 44 | 0.003005 | 0.42847 | 0.604 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | soap-trans-default | 131711 | 19 | 44147 | 87538 | 7 | 0.0086 | 0.399727 | 0.793 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | rna-spades | 206133 | 372 | 71543 | 134089 | 129 | 0.010274 | 0.39969 | 0.775 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | spades | 93814 | 4488 | 25645 | 62559 | 1122 | 0.251126 | 0.537447 | 0.843 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | trinity | 97849 | 32 | 22936 | 74839 | 42 | 0.002778 | 0.434846 | 0.611 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | oases | 178498 | 180 | 34721 | 143373 | 224 | 0.010197 | 0.450784 | 0.691 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | trans-abyss | 213005 | 171 | 68268 | 144346 | 220 | 0.004969 | 0.406611 | 0.583 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | binpacker | 12291 | 7 | 800 | 11468 | 16 | 0.016867 | 0.491256 | 0.657 | N/A |
| Homo Sapiens Cell Incubate with Ebola Virus Transcriptome Assembly | shannon | 84456 | 3 | 23677 | 60776 | 0 | 0.00253 | 0.418611 | 0.862 | N/A |
a Unary accuracy is used to assess performance for test sets containing only one class of examples and thus rendering f1 scores uninformative.
Fig 5Performance on RNA-Seq assembly data from human cells cultured with Ebola virus.
The model performed the best with AUC of 0.95 when human transcripts below the training range length were left out. Filtering the short virus transcripts out as well as the human transcripts resulted in a drop in the curve owing to the skew in the number of examples belonging to the human class.
Fig 6Performance on RNA-Seq assembly of human cells cultured with Ebola virus based on the software for assembly.
There is significant variation in performance across the different assembly software. Spades had the most uniform performance where performance was modest whereas more popular assembly tools such as trinity, oases and trans-abyss had poorer but similarly uniform performance.
Fig 7Visualizing count of detected ORFs in RNA virus sequences changing in relation to the minimum length cutoff that was set.
We would expect all viral genomes to encode a minimal set of proteins necessary for successful replication and the number of these proteins to deviate minimally across viruses of different types. The graph depicts that at a cutoff range between 100 and 150, we achieve the optimal cutoff point where only important ORFs that have greater probability of being involved in the information pathway of the virus are considered.
Performance evaluation of rbf-SVM68 against BLAST and HMMER3.
| Tool | Cutoff | Query | Subject | False Hits | Our Model | False Positives | True Negatives | Accuracy |
|---|---|---|---|---|---|---|---|---|
| BLASTn | None | Human Transcriptome De novo Assembly | RNA Virus Sequence Database | 135 | rbf-SMV68 | 34 | 101 | 75% |
| BLASTn | None | Mouse Transcriptome De novo Assembly | RNA Virus Sequence Database | 130 | rbf-SMV68 | 37 | 93 | 72% |
| BLASTx | None | Human Transcriptome De novo Assembly | Informative Proteins Only | 505794 | rbf-SMV68 | 191958 | 315528 | 62% |
| BLASTx | 0.01 | Human Transcriptome De novo Assembly | Informative Proteins Only | 3392 | rbf-SMV68 | 1214 | 2178 | 64.20% |
| BLASTx | 0.0001 | Human Transcriptome De novo Assembly | Informative Proteins Only | 34 | rbf-SMV68 | 10 | 24 | 70.60% |
| hmmseach on informative vFams | None | Human Transcriptome De novo Assembly | Informative vFam Profiles Only | 26003 | rbf-SMV68 | 11914 | 14089 | 54% |
Fig 8Performance of the rbf-SVM model trained with the same computational pipeline for classifying positive and negative sense RNA viruses.
(a) Performance on different train-test splits when trained with all 5460 features. (b) Performance when trained with 194 features. (c) Performance when trained with 68 selected features. All models exhibit a great ability to specify the different sequence classes demonstrating the flexibility of our pipeline.
Performance of rbf-SVM in classifying positive and negative sense RNA viruses.
| Number of Features | Train-Test Split | True Positives | False Positives | True Negatives | False Negatives | cv_binary mean | cv_binary std | cv_macro mean | cv_macro std | f1binary | f1macro |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 5460 | 4:1 | 145 | 0 | 1422 | 6 | 0.972715 | 0.012229 | 0.985083 | 0.006681 | 0.97973 | 0.988812 |
| 5460 | 2:3 | 308 | 0 | 3445 | 21 | 0.966849 | 0.028873 | 0.981788 | 0.015834 | 0.967033 | 0.981997 |
| 5460 | 1:5 | 416 | 0 | 4587 | 29 | 0.950275 | 0.056018 | 0.97274 | 0.03045 | 0.966318 | 0.981584 |
| 194 | 4:1 | 125 | 0 | 1435 | 13 | 0.9759 | 0.013335 | 0.986788 | 0.007303 | 0.95057 | 0.973031 |
| 194 | 2:3 | 298 | 0 | 3453 | 23 | 0.968434 | 0.010478 | 0.982572 | 0.005786 | 0.962843 | 0.979762 |
| 194 | 1:5 | 389 | 0 | 4575 | 68 | 0.924052 | 0.043204 | 0.95789 | 0.023325 | 0.919622 | 0.956122 |
| 68 | 4:1 | 125 | 3 | 1430 | 15 | 0.945588 | 0.026894 | 0.970244 | 0.014663 | 0.932836 | 0.963291 |
| 68 | 2:3 | 281 | 6 | 3444 | 43 | 0.924708 | 0.031719 | 0.958641 | 0.01733 | 0.919804 | 0.95637 |
| 68 | 1:5 | 383 | 9 | 4575 | 65 | 0.896827 | 0.065865 | 0.943638 | 0.035787 | 0.911905 | 0.951941 |
Fig 9Diagrammatic representation of our complete experimental design.
Fig 10Suggested approach for acquiring the best automated annotations.
Our explorations indicate that there is no single annotation tool that can produce the best results without requiring further curation. Perhaps the best way to use these tools is by stacking the results on top of each other to remove maximum number of false positives while consolidating the true positive results.