| Literature DB >> 25120265 |
Daniel Struck1, Glenn Lawyer2, Anne-Marie Ternes3, Jean-Claude Schmit3, Danielle Perez Bercoff3.
Abstract
Viral sequence classification has wide applications in clinical, epidemiological, structural and functional categorization studies. Most existing approaches rely on an initial alignment step followed by classification based on phylogenetic or statistical algorithms. Here we present an ultrafast alignment-free subtyping tool for human immunodeficiency virus type one (HIV-1) adapted from Prediction by Partial Matching compression. This tool, named COMET, was compared to the widely used phylogeny-based REGA and SCUEAL tools using synthetic and clinical HIV data sets (1,090,698 and 10,625 sequences, respectively). COMET's sensitivity and specificity were comparable to or higher than the two other subtyping tools on both data sets for known subtypes. COMET also excelled in detecting and identifying new recombinant forms, a frequent feature of the HIV epidemic. Runtime comparisons showed that COMET was almost as fast as USEARCH. This study demonstrates the advantages of alignment-free classification of viral sequences, which feature high rates of variation, recombination and insertions/deletions. COMET is free to use via an online interface.Entities:
Mesh:
Year: 2014 PMID: 25120265 PMCID: PMC4191385 DOI: 10.1093/nar/gku739
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 2.Subtype decision tree. The row sums of the log-likelihood matrix provide the overall likelihood of the query sequence to belong to each subtype. These sums are ordered to identify the most likely subtype (S) and the most likely pure subtype (PS). If the query sequence has the highest likelihood of belonging to a pure subtype (i.e. S = PS), this likelihood is challenged against the likelihoods of the sequence to be of any other subtype (other, PURE or CRF) by sliding over the matrix by 100-bp windows with a stepping size of 3 bp. If the difference between the row sums within the current window remains below the recombination threshold (i.e. 28) for each window, the pure subtype is assigned. Otherwise, COMET returns the result ‘UNASSIGNED’. If the query sequence has the highest likelihood of being a CRF, COMET performs a similar challenge, but only against the most likely pure subtype (PS) at first. If this difference remains below the recombination threshold (i.e. 28), COMET assigns the pure subtype (S) with an indication to check for the CRF, indicating a region where the CRF is pure. If the difference is higher than the recombination threshold, a second scan is performed as for the PURE situation, challenging each subtype against the initially assigned CRF.
Figure 1.N-ary tree representation of a Markov model, with the context ‘CGT’ highlighted. Each node (circle) has an associated frequency table (box) over the next base in the sequence following the context.
Sensitivity and specificity of COMET to type known sequences with varying levels of noise
| Noise | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fragment size | 0.0% | 2.0% | 4.0% | 6.0% | 8.0% | 10.0% | 12.0% | 14.0% | 16.0% | 18.0% | 20.0% | |
| 100 | 494 256 | 98.4% | 96.5% | 92.6% | 87.2% | 80.8% | 72.9% | 63.1% | 53.6% | 44.2% | 38.3% | 32.3% |
| 200 | 242 235 | 99.9% | 99.8% | 99.0% | 97.5% | 94.9% | 89.5% | 82.1% | 71.0% | 61.3% | 52.9% | 44.1% |
| 400 | 128 604 | 100.0% | 100.0% | 100.0% | 99.9% | 98.9% | 97.6% | 93.1% | 87.0% | 78.1% | 68.8% | 57.2% |
| 600 | 87 402 | 100.0% | 100.0% | 100.0% | 100.0% | 99.9% | 98.9% | 96.3% | 91.9% | 86.2% | 75.2% | 63.5% |
| 800 | 65 352 | 100.0% | 100.0% | 100.0% | 100.0% | 99.8% | 99.5% | 97.1% | 95.4% | 89.5% | 78.4% | 66.6% |
| 1200 | 43 008 | 100.0% | 100.0% | 100.0% | 100.0% | 99.5% | 100.0% | 99.1% | 97.4% | 92.9% | 83.4% | 70.3% |
| 1600 | 29 841 | 100.0% | 100.0% | 100.0% | 100.0% | 99.5% | 100.0% | 99.2% | 97.8% | 93.7% | 84.8% | 71.0% |
| Noise | ||||||||||||
| 100 | 494 256 | 93.8% | 93.8% | 93.7% | 93.6% | 93.5% | 93.5% | 93.5% | 93.5% | 93.5% | 93.7% | 93.8% |
| 200 | 242 235 | 96.9% | 96.1% | 95.2% | 94.6% | 94.2% | 93.9% | 93.7% | 93.6% | 93.6% | 93.7% | 93.9% |
| 400 | 128 604 | 98.6% | 97.9% | 97.0% | 96.1% | 95.4% | 94.7% | 94.3% | 94.0% | 93.8% | 93.8% | 94.0% |
| 600 | 87 402 | 99.3% | 98.8% | 98.1% | 97.2% | 96.3% | 95.5% | 94.8% | 94.3% | 94.1% | 94.0% | 94.2% |
| 800 | 65 352 | 99.5% | 99.2% | 98.6% | 97.9% | 97.0% | 96.0% | 95.3% | 94.7% | 94.3% | 94.1% | 94.3% |
| 1200 | 43 008 | 99.8% | 99.6% | 99.3% | 98.8% | 98.1% | 97.0% | 96.2% | 95.4% | 94.8% | 94.4% | 94.5% |
| 1600 | 29 841 | 99.9% | 99.7% | 99.5% | 99.2% | 98.7% | 97.8% | 96.8% | 96.0% | 95.2% | 94.6% | 94.7% |
| Noise | ||||||||||||
| 100 | 494 256 | 91.0% | 85.4% | 77.2% | 66.7% | 56.1% | 46.1% | 37.0% | 28.7% | 22.1% | 16.6% | 12.6% |
| 200 | 242 235 | 97.6% | 95.6% | 90.7% | 83.7% | 75.3% | 64.7% | 52.2% | 41.6% | 31.2% | 21.9% | 15.8% |
| 400 | 128 604 | 99.5% | 99.1% | 97.3% | 93.4% | 88.2% | 78.3% | 66.6% | 54.4% | 41.3% | 28.9% | 19.7% |
| 600 | 87 402 | 99.7% | 99.7% | 98.9% | 96.4% | 92.6% | 84.3% | 74.4% | 61.4% | 46.9% | 32.6% | 21.3% |
| 800 | 65 352 | 99.9% | 99.8% | 99.5% | 97.0% | 94.5% | 87.6% | 78.3% | 64.8% | 48.7% | 34.2% | 22.8% |
| 1200 | 43 008 | 99.8% | 99.9% | 100.0% | 97.3% | 96.1% | 91.2% | 82.9% | 69.0% | 52.8% | 36.9% | 23.0% |
| 1600 | 29 841 | 100.0% | 100.0% | 100.0% | 97.7% | 97.8% | 91.9% | 84.7% | 71.7% | 53.8% | 38.1% | 25.4% |
| Noise | ||||||||||||
| 100 | 494 256 | 99.7% | 99.7% | 99.6% | 99.6% | 99.5% | 99.4% | 99.4% | 99.4% | 99.3% | 99.3% | 99.3% |
| 200 | 242 235 | 99.9% | 99.9% | 99.8% | 99.8% | 99.7% | 99.7% | 99.7% | 99.6% | 99.6% | 99.6% | 99.6% |
| 400 | 128 604 | 99.9% | 99.9% | 99.9% | 99.9% | 99.9% | 99.8% | 99.8% | 99.8% | 99.7% | 99.7% | 99.7% |
| 600 | 87 402 | 100.0% | 100.0% | 99.9% | 99.9% | 99.9% | 99.9% | 99.8% | 99.8% | 99.8% | 99.8% | 99.8% |
| 800 | 65 352 | 100.0% | 100.0% | 99.9% | 99.9% | 99.9% | 99.9% | 99.9% | 99.9% | 99.8% | 99.8% | 99.8% |
| 1200 | 43 008 | 100.0% | 100.0% | 100.0% | 99.9% | 99.9% | 99.9% | 99.9% | 99.9% | 99.8% | 99.8% | 99.8% |
| 1600 | 29 841 | 100.0% | 100.0% | 100.0% | 100.0% | 99.9% | 99.9% | 99.9% | 99.9% | 99.9% | 99.8% | 99.8% |
A synthetic data set was generated from reference sequences from the LANL HIV database, by randomly introducing mutations throughout the genome (‘noise’). The sensitivity and specificity of COMET was calculated for varying degrees of noise (0–20%) introduced into PURE subtypes (A–J), (Tables a1 and a2) or CRFs (CRF01_AE-CRF49_cpx) (Tables b1 and b2). Sequences of different lengths were submitted to COMET.
Figure 3.Sensitivities and specificities of COMET, REGAv2 and SCUEAL assessed using the synthetic variation data set spanning the pol region.
Sensitivity and specificity of COMET to detect and identify synthetic recombinants
| Insert size | URF found | Composition found | |
|---|---|---|---|
| 100 | 118 508 | 84.5% | 84.1% |
| 200 | 59 254 | 96.9% | 96.8% |
| 300 | 57 876 | 98.7% | 98.7% |
| 400 | 57 876 | 99.6% | 99.6% |
| 600 | 56 498 | 99.9% | 99.9% |
| 800 | 55 120 | 100.0% | 100.0% |
| Mean | 96.6% | 96.5% |
A synthetic recombination data set was generated by replacing DNA sequences by the same sequence from another subtype. Different sizes of insert (from 100 bp to 800 bp) were introduced throughout the genome, generating 405 132 different recombinants. In the table, ‘URF found’ means that COMET recognized the sequence as ‘UNASSIGNED’. ‘Composition found’ means that COMET correctly identified the subtypes composing the background and the insert.
Detection and identification of unknown recombinants by COMET, REGAv2 and SCUEAL
| COMET | REGAv2 | SCUEAL | |||||
|---|---|---|---|---|---|---|---|
| Insert size | URF found | Composition found | URF found | Composition found | URF found | Composition found | |
| 100 | 3300 | 68.6% | 67.9% | 1.0% | 0.9% | 49.5% | 31.5% |
| 200 | 1650 | 90.4% | 90.1% | 18.5% | 17.5% | 75.3% | 58.7% |
| 300 | 1540 | 96.2% | 96.1% | 56.5% | 54.2% | 88.3% | 74.8% |
| 400 | 1540 | 97.9% | 97.9% | 78.1% | 74.2% | 94.6% | 84.4% |
| 600 | 1430 | 99.7% | 99.7% | 96.5% | 92.0% | 97.5% | 91.5% |
| 800 | 1320 | 99.9% | 99.9% | 96.9% | 94.9% | 99.1% | 95.5% |
| Mean | 92.1% | 91.9% | 57.9% | 55.6% | 84.1% | 72.7% | |
The synthetic recombination data set was restricted to the pol region, and one recombinant was selected for each pattern, leading to 10 780 sequences for this analysis. In the table, ‘URF found’ means that COMET assigned the sequence as ‘unassigned’, REGAv2 assigned the sequence as ‘check the bootscan’ or ‘check the report’ and SCUEAL assigned the sequence as ‘complex’ or ‘recombinant’. ‘Composition found’ means that the tool correctly identified the subtype of the background and of the insert composing the synthetic recombinant.
Sensitivity and specificity of COMET, REGAv2 and SCUEAL to type clinical patient-derived sequences retrieved from the LANL database
| Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | ||
|---|---|---|---|---|---|---|---|
| A1 | 1000 | 97.0% | 99.9% | 95.7% | 99.4% | 80.2% | 100.0% |
| A2 | 142 | 71.1% | 100.0% | 72.5% | 100.0% | 76.1% | 100.0% |
| B | 1000 | 99.6% | 99.8% | 96.2% | 99.9% | 97.6% | 99.9% |
| C | 1000 | 98.9% | 100.0% | 99.5% | 99.6% | 91.8% | 99.9% |
| D | 1000 | 92.4% | 100.0% | 87.2% | 100.0% | 86.8% | 100.0% |
| F1 | 1000 | 93.1% | 99.8% | 96.8% | 99.4% | 87.8% | 99.8% |
| F2 | 184 | 75.0% | 100.0% | 53.8% | 100.0% | 77.7% | 100.0% |
| G | 1000 | 89.0% | 99.9% | 89.9% | 98.6% | 79.5% | 98.9% |
| H | 97 | 74.2% | 100.0% | 87.6% | 100.0% | 85.6% | 100.0% |
| Mean | 87.8% | 99.9% | 86.6% | 99.7% | 84.8% | 99.8% | |
| Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | ||
| 01_AE | 1000 | 97.5% | 100.0% | 93.7% | 100.0% | 68.3% | 100.0% |
| 02_AG | 1000 | 95.6% | 99.5% | 14.8% | 100.0% | 26.6% | 99.9% |
| 06_cpx | 823 | 92.5% | 100.0% | 84.2% | 100.0% | 40.0% | 100.0% |
| 07_BC | 581 | 97.4% | 100.0% | 97.2% | 100.0% | 14.8% | 100.0% |
| 08_BC | 365 | 95.9% | 100.0% | 91.0% | 100.0% | 77.5% | 100.0% |
| 11_cpx | 116 | 82.8% | 100.0% | 75.0% | 100.0% | 72.4% | 100.0% |
| 12_BF | 317 | 87.1% | 100.0% | 50.2% | 100.0% | 9.1% | 100.0% |
| Mean | 92.7% | 99.9% | 72.3% | 100.0% | 44.1% | 100.0% | |
This data set includes 10 625 sequences spanning pol.
Figure 4.Agreement between the three subtyping tools on the subtype assigned to clinical patient-derived sequences retrieved from the LANL database. This data set includes 10 625 sequences spanning pol.