| Literature DB >> 30427878 |
Stephen Solis-Reyes1, Mariano Avino2, Art Poon2, Lila Kari3.
Abstract
For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.Entities:
Mesh:
Year: 2018 PMID: 30427878 PMCID: PMC6235296 DOI: 10.1371/journal.pone.0206409
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Statistics for the manually curated testing datasets.
The first author, year, and reference number for the publication associated with each data set is listed under the ‘Source’ column heading. The historically most prevalent HIV-1 subtype(s) is indicated under the ‘Subtype’ column heading.
| Source | Country | Subtype | Count | Sequence length (nt) | ||
|---|---|---|---|---|---|---|
| Average | Min. | Max. | ||||
| Nadai (2009) [ | Haiti | B | 66 | 1024.0 | 1024 | 1025 |
| Niculescu (2015) [ | Romania | F | 97 | 1301.2 | 1257 | 1302 |
| Paraschiv (2017) [ | Romania | F | 86 | 1295.9 | 1164 | 1299 |
| Rhee (2017) [ | Thailand | CRF01_AE | 282 | 703.8 | 633 | 756 |
| Sukasem (2007) [ | Thailand | CRF01_AE | 221 | 286.4 | 270 | 288 |
| Eshleman (2001) [ | Uganda | A/D | 102 | 1261.2 | 1260 | 1302 |
| Ssemwanga (2012) [ | Uganda | A/D | 72 | 1025.0 | 1025 | 1025 |
| Wolf (2017) [ | USA | B | 1653 | 1020.8 | 868 | 1080 |
| TenoRes Study Group (2016) [ | South Africa | C | 102 | 1001.4 | 921 | 1209 |
| van Zyl (2017) [ | South Africa | C | 59 | 1056.7 | 1002 | 1070 |
| Huang (2003) [ | Reference panel | n/a | 44 | 1189.9 | 1187 | 1190 |
| 2784 | 960.4 | 270 | 1302 | |||
Fig 1Highest accuracy score and running time averaged across all fifteen classifier algorithms, at different values of k, for the full set of 6625 whole HIV-1 genomes from the LANL database.
Classification accuracy scores after 10-fold cross-validation and running times averaged over five runs for each of the fifteen classifiers at k = 6, for the full set of 6625 whole HIV-1 genomes from the LANL database.
| Classifier | Accuracy | Running time | |
|---|---|---|---|
| Mean | Standard deviation | ||
| 96.66% | 59.7s | 0.81s | |
| 96.59% | 58.3s | 0.91s | |
| 96.49% | 57.7s | 1.60s | |
| 95.49% | 60.6s | 0.94s | |
| 95.32% | 102.0s | 0.92s | |
| 93.97% | 44.3s | 0.68s | |
| 93.95% | 34.0s | 0.83s | |
| 93.84% | 33.7s | 0.33s | |
| 93.53% | 62.3s | 1.03s | |
| 93.07% | 43.7s | 0.77s | |
| 91.10% | 37.4s | 1.32s | |
| 87.75% | 34.0s | 0.80s | |
| 77.76% | 36.0s | 1.37s | |
| 75.13% | 38.3s | 0.48s | |
| 64.85% | 159.3s | 1.21s | |
Classification accuracies for all tested HIV-1 subtyping tools, for each testing dataset; average accuracy both with and without weighting datasets by the number of sequences they contain.
| Source | K | COMET | CASTOR | SCUEAL | REGA |
|---|---|---|---|---|---|
| Nadai (2009) [ | 100.0% | 100.0% | 81.8% | 92.4% | 86.4% |
| Niculescu (2015) [ | 95.9% | 96.9% | 75.3% | 94.8% | 100.0% |
| Paraschiv (2017) [ | 91.9% | 73.3% | 46.5% | 68.6% | 87.2% |
| Rhee (2017) [ | 94.0% | 95.4% | 0.4% | 75.9% | 12.8% |
| Sukasem (2007) [ | 90.0% | 91.0% | 0.9% | 64.3% | 8.1% |
| Eshleman (2001) [ | 88.5% | 90.6% | 4.2% | 84.4% | 90.6% |
| Ssemwanga (2012) [ | 88.3% | 90.0% | 0.0% | 73.3% | 95.0% |
| Wolf (2017) [ | 99.8% | 99.8% | 61.1% | 99.3% | 98.2% |
| TenoRes Study Group (2016) [ | 99.0% | 99.0% | 28.4% | 99.0% | 100.0% |
| van Zyl (2017) [ | 94.9% | 93.2% | 57.6% | 93.2% | 94.9% |
| Huang (2003) [ | 95.2% | 97.6% | 19.0% | 81.0% | 95.2% |
| 94.3% | 93.3% | 34.1% | 84.2% | 78.9% | |
| 97.1% | 96.9% | 45.1% | 91.2% | 81.4% |
1 In this case, a substantial number of sequences that were classified as subtype A by REGA and our method were labeled unclassified subtypes (U) by COMET. In an HIV-1 phylogeny, subtype U sequences tend to be assigned a basal position (near the root) within the subtype A clade, suggesting that these sequences may be unrecognized variants or complex recombinants of subtype A.
2 These low accuracies are primarily caused by REGA misclassifying many CRF01 sequences as subtype A, and subtype A is mostly equivalent to CRF01 in the pol region. If CRF01 and A were treated as equivalent, these accuracies would be 97.9% and 86.4% for the Rhee and Sukasem datasets, respectively, and unweighted and weighted averages of 93.8% and 96.2%, respectively.
Approximate running times for all tested subtyping tools, for the dataset of van Zyl et al. [95] and all datasets listed in Table 3.
The van Zyl dataset was chosen at random for this purpose.
| Tool | Running time for the van Zyl dataset | Running time for datasets from |
|---|---|---|
| K | less than 2 seconds | 16 seconds |
| COMET | less than 2 seconds | 14 seconds |
| CASTOR | 3 seconds | 46 seconds |
| SCUEAL | 18 minutes | 8 hours |
| REGA | 31 minutes | 19 hours |
1 The REGA and SCUEAL web servers have limits of 1000 and 500 sequences per run, respectively. Thus, 3 batches of sequences were needed for REGA, and 6 batches for SCUEAL to classify all sequences. COMET, CASTOR, and our tool have no such limits.
Fig 2MoDMap of 4373 full-length HIV-1 genomes of 9 different pure subtypes or groups, at k = 6.
Fig 3MoDMap of 4124 full-length HIV-1 genomes of subtypes A, B, and C, at k = 6.
Fig 4MoDMap of 9270 natural HIV-1 pol genes vs. 1500 synthetically generated HIV-1 pol genes of various subtypes.
The same plot is colored on the left by type (natural and synthetic) and on the right by HIV-1 subtype.
Fig 5MoDMap of 5164 whole hepatitis B genomes of 6 different pure subtypes.