| Literature DB >> 33471063 |
Joan Carles Pons1, David Paez-Espino2, Gabriel Riera1, Natalia Ivanova2, Nikos C Kyrpides2, Mercè Llabrés1.
Abstract
MOTIVATION: Two key steps in the analysis of uncultured viruses recovered from metagenomes are the taxonomic classification of the viral sequences and the identification of putative host(s). Both steps rely mainly on the assignment of viral proteins to orthologs in cultivated viruses. Viral Protein Families (VPFs) can be used for the robust identification of new viral sequences in large metagenomics datasets. Despite the importance of VPF information for viral discovery, VPFs have not yet been explored for determining viral taxonomy and host targets.Entities:
Year: 2021 PMID: 33471063 PMCID: PMC8830756 DOI: 10.1093/bioinformatics/btab026
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Graphical representation of the pipeline for viral genome classification (steps from 1 to 3). The first step is the initial classification of the VPFs regarding taxonomy and host infection at three different levels based on the information of the isolate reference viruses retrieved from IMG/M and ViralZone databases. The second step is to classify the uncultured viral genomes (UViGs) based on hits to homogeneous classified VPFs. The third step is the final VPFs classification adding the information from the UViGs classification
Minimal number of required hits per category
| Feature | Level | Cat 1.1 | Cat 1.2 | Cat 1.3 | Cat 1.4 |
|---|---|---|---|---|---|
| Taxonomy | Baltimore | >11 hits | >5 hits | >2 hits | >0 hits |
| Family | >10 hits | >4 hits | >2 hits | >0 hits | |
| Genus | >6 hits | >2 hits | >1 hits | >0 hits | |
| Host | Domain | >7 hits | >3 hits | >0 hits | – |
| Family | >5 hits | >2 hits | >0 hits | – | |
| Genus | >3 hits | >1 hits | >0 hits | – |
Note: Cat denotes the category.
Fig. 2.Score distributions of the UViGs regarding taxonomic classification (left panel) and host prediction (right panel). Scores values are displayed in the x-axis while the y-axis represents the number of UViGs with the corresponding score
Viral genomes classification–Example
| Hits | Normalized | VPF-classification |
|---|---|---|
|
| 300 | 1.1 |
|
| 250 | 1.2 |
|
| 200 | 30% |
|
| 350 | 60% |
|
| 275 | 25% |
UViGs classification
| Level | Classified UViGs | Homogeneous | 75% homogeneous |
|---|---|---|---|
| Balt. tax | 713 648 (97, 64%) | 712 129 | 713 528 |
| Fam. tax | 634 397 (86, 79%) | 409 625 | 525 767 |
| Genus tax | 362 962 (49, 65%) | 270 390 | 307 625 |
| Domain host | 633 475 (86, 67%) | 592 920 | 620 938 |
| Fam. host | 489 016 (66, 90%) | 306 747 | 368 612 |
| Genus host | 461 113 (63, 09%) | 289 089 | 344 538 |
VPFs final classification
| Feature | Level | Homog. | Heterog. | Total |
|---|---|---|---|---|
| 1.1—1.2—1.3—1.4 | ||||
| Taxonomy | Baltimore | 14 264—4042—2122—397 | 4142 | 24 957 |
| Family | 1709—1511—614—81 | 18 124 | 22 039 | |
| Genus | 1710—1425—144—94 | 15 732 | 19 105 | |
| Host | Domain | 4594—1975—748—0 | 16 226 | 23 543 |
| Family | 1109—580—216—0 | 17 877 | 19 782 | |
| Genus | 1427—329—84—0 | 17 216 | 19 056 |
Prediction coverage and accuracy with the NCBI test
| TAXONOMY | ||||
|---|---|---|---|---|
| Thresholds | dsDNA | ssDNA | RT | |
| Family | MR | 66.5% | 37% | 49.6% | 94% | 93% | 100% |
| MR | 66% | 98.2% | 49% | 99.8% | 93% | 100% | |
| MR | 65% | 99% | 48% | 99.9% | 93% | 100% | |
| Genus | MR | 56.8% | 39% | 52.5% | 94% | 100%| 100% |
| MR | 56.5% | 97% | 52% | 97.5% | 100% |100% | |
| MR | 55.5% | 98% | 52% | 98% | 100% | 100% | |
Note: In every entry, the coverage (left) appears separated from the accuracy (right) by a vertical bar. MR and CS denote the membership ratio and confidence score, respectively.
Fig. 3.Heatmaps depicting results obtained with the NCBI test. The x-axes show the confidence score thresholds while the y-axes show the membership ratio thresholds. The colors represent the ratio of classified viral sequences above a confidence score (x value) and a membership ratio (y value) with respect to the total number of sequences
Fig. 4.Coverage and accuracy values obtained by VPF-Class (left panel) and VirHostMatcher (right panel) in host prediction of the prophages test. On the right, we show the coverage and accuracy values (y-axis) obtained by VirHostMatcher with respect to the values of ONF dissimilarity measure (x-axis)