| Literature DB >> 19956739 |
Sergei L Kosakovsky Pond1, David Posada, Eric Stawiski, Colombe Chappey, Art F Y Poon, Gareth Hughes, Esther Fearnhill, Mike B Gravenor, Andrew J Leigh Brown, Simon D W Frost.
Abstract
Genetically diverse pathogens (such as Human Immunodeficiency virus type 1, HIV-1) are frequently stratified into phylogenetically or immunologically defined subtypes for classification purposes. Computational identification of such subtypes is helpful in surveillance, epidemiological analysis and detection of novel variants, e.g., circulating recombinant forms in HIV-1. A number of conceptually and technically different techniques have been proposed for determining the subtype of a query sequence, but there is not a universally optimal approach. We present a model-based phylogenetic method for automatically subtyping an HIV-1 (or other viral or bacterial) sequence, mapping the location of breakpoints and assigning parental sequences in recombinant strains as well as computing confidence levels for the inferred quantities. Our Subtype Classification Using Evolutionary ALgorithms (SCUEAL) procedure is shown to perform very well in a variety of simulation scenarios, runs in parallel when multiple sequences are being screened, and matches or exceeds the performance of existing approaches on typical empirical cases. We applied SCUEAL to all available polymerase (pol) sequences from two large databases, the Stanford Drug Resistance database and the UK HIV Drug Resistance Database. Comparing with subtypes which had previously been assigned revealed that a minor but substantial (approximately 5%) fraction of pure subtype sequences may in fact be within- or inter-subtype recombinants. A free implementation of SCUEAL is provided as a module for the HyPhy package and the Datamonkey web server. Our method is especially useful when an accurate automatic classification of an unknown strain is desired, and is positioned to complement and extend faster but less accurate methods. Given the increasingly frequent use of HIV subtype information in studies focusing on the effect of subtype on treatment, clinical outcome, pathogenicity and vaccine design, the importance of accurate, robust and extensible subtyping procedures is clear.Entities:
Mesh:
Year: 2009 PMID: 19956739 PMCID: PMC2776870 DOI: 10.1371/journal.pcbi.1000581
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1An example to illustrate the concepts of a mosaic and its binary encoding upon which the genetic algorithm operates.
Panel A: a phylogenetic breakpoint/lineage model which “threads” a query sequence (labeled ‘Q’) onto the reference tree with sequences. Panel B: the example individual model (mosaic) is encoded by a 36-bit binary vector on 5 fragments (genes)–2 for placing the breakpoints (Gray-binary encoded) and 3 for identifying sister lineages, binary encoded using the post-order traversal scheme shown in the reference tree of Panel A.
Figure 2Algorithmic flowchart of SCUEAL.
Algorithmic logic underlying SCUEAL; see Figure 3 for a description of the genetic algorithm itself. Refer to the text for more detailed descriptions of individual procedures and parameter definitions.
Figure 3Algorithmic flowchart of the genetic algorithm in SCUEAL.
A flowchart description of the genetic algorithm applied to a given starting population and controlled by input parameter values. Refer to the text and Figure 2 for further description of individual steps and parameter definitions.
SCUEAL performance on simulated data.
| Scenario | Seq., sites | Type/Distance | Inferred Mosaics | Breakpoints | ||
| Type | Count ( | Simulated Location, Parents | Inferred. #/Median Location Std.Dev. (95% Range) | |||
| 1. No recombination |
| N/A | Correct | 100 (100) | None | |
| 2. An evident breakpoint | 8,2000 | Close ( | Correct | 100 (88) | 1000 bp 1∶3 | 100/990,18.99 (931,1015) |
| Divergent ( | Correct | 100 (75) | 1000 bp 1∶7 | 100/1000, 5.04 (987,1007) | ||
| Ancient ( | Correct | 92 (86) | 1000 bp 1/2∶5/6 | 96/992,16.62 (947,1017) | ||
| Superset | 7 (6) | |||||
| M/M | 1 (1) | |||||
| 3. Two evident breakpoints | 8,2000 | Close ( | Correct | 98 (96) | 750 bp 1∶3 | 99/749,10.23 (720,769) |
|
| Superset | 1 (1) | 1250 bp 7∶1 | 99/1251,15.26 (1201,1273) | ||
| M/M | 1 (1) | |||||
|
| Correct | 95 (89) | 750 bp 1∶7 | 98/751, 5.61 (735,762) | ||
|
| Superset | 5 (5) | 1250 bp 7∶1 | 100/1251, 5.92 (1237,1265) | ||
| Ancient (69%) | Correct | 91 (90) | 750 bp 1/2∶5/6 | 96/749,22.37 (697,824) | ||
| 69% | Superset | 5 (4) | 1250 bp 5/6∶1/2 | 96/1250,20.09 (1192,1283) | ||
| M/M | 4 (4) | |||||
| 4. Two close breakpoints | 8, 2000 | Close (42%) | Correct | 22 (21) | 950 bp 1∶3 | 22/948,15.35 (888,960) |
| 42% | Subset | 77 (76) | 1050 bp 3∶1 | 22/1050, 7.77 (1031,1066) | ||
| M/M | 1 (1) | |||||
| Divergent (102%) | Correct | 73 (69) | 950 bp 1∶7 | 73/951, 6.40 (932,960) | ||
| 102% | Subset | 11 (0) | 1050 bp 7∶1 | 73/1051, 5.79 (1038,1068) | ||
| M/M | 16 (15) | |||||
| 5. Four breakpoints | 8, 2000 | Close (42%) | Correct | 96 (96) | 400 bp 1∶3 | 98/399,16.86 (342,428) |
| 42% | Superset | 3 (2) | 800 bp 3∶1 | 97/803, 9.63 (784,837) | ||
| 42% | M/M | 1 (1) | 1200 bp 1∶3 | 98/1200,11.34 (1161,1220) | ||
| 42% | 1600 bp 3∶1 | 99/1602,12.24 (1570,1634) | ||||
| Divergent (102%) | Correct | 96 (96) | 400 bp 1∶7 | 98/401, 5.08 (389,413) | ||
| 102% | Superset | 2 (2) | 800 bp 7∶1 | 99/802, 5.00 (785,809) | ||
| 102% | M/M | 2 (2) | 1200 bp 1∶7 | 99/1201, 5.73 (1188,1211) | ||
| 102% | 1600 bp 7∶1 | 99/1602, 4.19 (1594,1613) | ||||
| Ancient (69%) | Correct | 54 (54) | 400 bp 1/2∶5/6 | 65/402,14.88 (357,434) | ||
| 69% | Subset | 22 (3) | 800 bp 5/6∶1/2 | 67/802,15.74 (745,826) | ||
| 69% | M/M | 20 (19) | 1200 bp 1/2∶5/6 | 69/1201,19.67 (1169,1270) | ||
| 69% | Superset | 4 (4) | 1600 bp 5/6∶1/2 | 69/1602,18.04 (1550,1627) | ||
| 6. Nine breakpoints | 8,2000 | Close (42%) | Correct | 30 (30) | 200 bp 1∶3 | 68/201,11.72 (176,235) |
| 42% | Subset | 13 (2) | 400 bp 3∶1 | 62/403, 8.18 (391,420)) | ||
| 42% | Superset | 9 (7) | 600 bp 1∶3 | 68/601,14.69 (561,634) | ||
| 42% | 48 (25) | 800 bp 3∶1 | 71/803,10.72 (783,838) | |||
| 42% | 1000 bp 1∶3 | 71/1001,10.95 (974,1021) | ||||
| 42% | 1200 bp 3∶1 | 72/1203,13.14 (1177,1253) | ||||
| 42% | 1400 bp 1∶3 | 75/1401,11.45 (1364,1414) | ||||
| 42% | 1600 bp 3∶1 | 73/1602, 8.52 (1582,1626) | ||||
| 42% | 1800 bp 1∶3 | 77/1801,13.40 (1746,1816) | ||||
| Divergent | (102%) Correct | 64 (64) | 200 bp 1∶7 | 96/202, 4.87 (188,212) | ||
| 102% | Superset | 9(7) | 400 bp 7∶1 | 94/402, 7.98 (386,415) | ||
| 102% | M/M | 27 (25) | 600 bp 1∶7 | 93/601, 5.95 (591,625) | ||
| 102% | 800 bp 7∶1 | 93/802, 5.37 (790,815) | ||||
| 102% | 1000 bp 1∶7 | 92/1002, 5.35 (985,1015) | ||||
| 102% | 1200 bp 7∶1 | 93/1202, 6.17 (1191,1228) | ||||
| 102% | 1400 bp 1∶7 | 93/1402, 4.52 (1391,1411) | ||||
| 102% | 1600 bp 7∶1 | 93/1602, 4.05 (1594,1612) | ||||
| 102% | 1800 bp 1∶7 | 89/1802, 3.80 (1794,1814) | ||||
| 7. Complex mosaic | 8, 2000 | 42% | Correct | 88 (86) | 400 bp 1∶2 | 94/400,11.89 (375,440) |
| 12% | Subset | 3 (1) | 800 bp 3∶4 | 89/793,28.29 (737,853) | ||
| 108% | Superset | 5 (4) | 1200 bp 4∶7 | 98/1202, 4.02 (1192,1211) | ||
| 48% | M/M | 4(4) | 1600 bp 7∶5 | 98/1601.5,11.08 (1586,1640) | ||
| 8. HIV within-patient | 13, 2000 | Close (0.4%) | Subset | 96 (96) | 750 bp 1∶2 | |
| 0.4% | M/M | 4 (4) | 1250 bp 2∶1 | |||
| Divergent (2.3%) | Correct | 38 (36) | 750 bp 1∶9 | 38/741.5,34.62 (666,790) | ||
| 2.3% | Subset | 4 (2) | 1250 bp 9∶1 | 39/1256,36.11 (1156,1326) | ||
| Superset | 1 (1) | |||||
| M/M | 57 (55) | |||||
| 9. HIV within-patient | 13, 2000 | Close (0.4%) | Subset | 97 (97) | 400 bp 1∶2 | |
| 0.4% | M/M | 3 (3) | 800 bp 2∶1 | |||
| 0.4% | 1200 bp 1∶2 | |||||
| 0.4% | 1650 bp 2∶1 | |||||
| Divergent (2.9%) | Correct | 7 (7) | 400 bp 1∶9 | 16/391.5,32.00 (349,475) | ||
| 2.9% | Subset | 2 (1) | 800 bp 9∶1 | 21/808,39.70 (730,868) | ||
| 2.9% | Superset | 1 (0) | 1200 bp 1∶9 | 22/1202.5,42.90 (1118,1284) | ||
| 2.9% | M/M | 90 (70) | 1600 bp 9∶1 | 20/1610.5,32.87 (1551,1676) | ||
| 10. HIV within-subtype | 5, 2000 | 4% | Correct | 16 (16) | 400 bp 1∶2 | 30/402,31.53 (317,460) |
| 4% | Subset | 80 (77) | 800 bp 2∶1 | 21/802,35.56 (716,885) | ||
| 4% | M/M | 4 (4) | 1200 bp 1∶2 | 20/1209,30.18 (1151,1266) | ||
| 4% | 1600 bp 2∶1 | 35/1589,38.80 (1506,1689) | ||||
| 11. HIV mosaic | 12, 10000 | Close (12%) | Correct | 95 (95) | 2000 bp 1∶2 | 94/2002.5,30.11 (1925,2092) |
| 12% | Subset | 1 (0) | 4000 bp 2∶1 | 93/4000,29.62 (3928,4085) | ||
| 12% | Superset | 2(2) | 6000 bp 1∶2 | 94/6002.5,26.99 (5941,6067) | ||
| 12% | M/M | 2(2) | 8000 bp 2∶1 | 92/7996,33.39 (7929,8078) | ||
| Intermediate (12%) | Correct | 100 (100) | 2000 bp 1∶6 | 100/2000,17.40 (1959,2042) | ||
| 12% | 4000 bp 6∶1 | 99/4003,21.62 (3964,4053) | ||||
| 12% | 6000 bp 1∶6 | 100/6001,18.61 (5952,6040)) | ||||
| 12% | 8000 bp 6∶1 | 99/8004,16.88 (7968,8046) | ||||
| Divergent (11.5%) | Correct | 99 (97) | 2000 bp 1∶9 | 99/2002,20.92 (1956,2043) | ||
| 11.5% | Superset | 1 (1) | 4000 bp 9∶1 | 100/4002.5,19.85 (3945,4056) | ||
| 11.5% | 6000 bp 1∶9 | 98/6000,21.89 (5937,6042) | ||||
| 11.5% | 8000 bp 9∶1 | 99/7999,22.49 (7953,8070) | ||||
| Complex 12% | Correct | 94 (93) | 2000 bp 1∶2 | 96/2003,27.61 (1940,2070) | ||
| 14% | Superset | 5 (4) | 4000 bp 2∶6 | 99/4000,18.14 (3969,4053) | ||
| 12% | M/M | 1 (1) | 6000 bp 6∶1 | 100/6003,20.35 (5959,6068) | ||
| 11.5% | 8000 bp 1∶9 | 97/8000,21.34 (7947,8062) | ||||
Scenario provides a brief description a given simulation scenario. Seq., sites lists the number and length of simulated sequences. Type/distance classifies the simulation scenario by type and mean divergence between parental strains, measured as the total branch length (expected number of substitutions/site 100%) between the strains. Inferred Mosaics tabulates the number of cases (and the number of those that matched or bested the BIC score of the correct model) that fell into each of the classification categories (see main text for further detail). Correct: the simulated mosaic was recovered; superset: the simulated mosaic and superfluous breakpoints were inferred; subset: a partial correct mosaic was recapitulated (some breakpoints missing); and M/M - the inferred mosaic was a mismatch with the generating one. Breakpoints enumerates the location of each simulated breakpoint and its parental lineages, the number of times the breakpoint was recovered by SCUEAL, and the median (2.5%–97.5% range) of the distribution of distances between the simulated and inferred breakpoints.
Figure 4A simulation scenario example.
One of the simulation scenarios used to asses our detection method with the results over replicates (scenario 5/close in Table 2). The query sequence (2) was simulated to move from reference lineage 1 to reference lineage 3 every 400 bp as shown in the tree panel. The clustering chart depicts model and replicate averaged support for assigning the query sequence to a particular reference lineage, as estimated by the genetic algorithm over 100 simulated data replicates, whereas black impulse plots indicate the inferred placements of breakpoints. The y-axis does not reach because each replicate contributes the model averaged support for the best inferred mosaic type–a value that is ; the upper limit on the y-axis is, therefore, the mean (over replicates) model-averaged support for the best-fitting mosaic (0.92 in this case).
SCUEAL screening results on partial HIV-1 reverse transcriptase sequences from the Stanford Drug Resistance database.
| Subtype | Sequences | Agree | within-subtype | Diff. pure subtype | Diff. recombinant | Top 3 CRFs and URFs |
| A | 1740 |
|
|
|
| CRF33/34 (31); A1,D (14); AE, B (7) |
| B | 16116 |
|
|
|
| CRF28/29 (273); CRF42 (54); CRF20/23/24 (30) |
| C | 3133 |
|
|
|
| B,C,CRF31 (56); B,C (36); C/CRF07 (8) |
| D | 624 |
|
|
|
| A1,D (16); B, CRF19 (4); B, D (3) |
| F | 464 |
|
|
|
| B,F1 (27); CRF29, F1 (5); B, CRF40, F1 (4) |
| G | 757 |
|
|
|
| B, CRF14 (4); B, G(3); G,J (3) |
| H | 28 |
|
|
|
| G,H (2); A, H (1); A, B, K (1) |
| J | 22 |
|
|
|
| C,J (1) |
| K | 166 |
|
|
|
| CRF32, G (22); CRF30, CRF32 (7); C, CRF32 (5) |
| CRF01 (AE) | 1552 |
|
|
|
| CRF22 (5), AE,B (4); B, CRF33 (4) |
| CRF02 (AG) | 1352 |
|
|
|
| A,G (285), A,CRF36,G (41), A,CRF02,G (34) |
Subtype lists the sequence subtype as annotated in the database. Sequences provides the number of sequences downloaded from the database. Agree gives the percentage of sequences for which SCUEAL returned the same subtype as that stored in the database. within-subtype–SCUEAL inferred within-subtype recombination within the same subtype as the one stored in the database; figures in parentheses show the proportion of within-subtype recombinants identified when DRAM positions were masked. Diff. pure subtype–the proportion of cases where SCUEAL inferred a pure subtype different from the annotated one. Diff. recombinant–the proportion of cases where SCUEAL inferred a recombinant mosaic with at least one fragment different from the annotated subtype; figures in parentheses show the proportion of within-subtype recombinants identified when DRAM positions were masked. Top 3 CRFs and URFs–three most frequent mosaics inferred by SCUEAL.
Figure 5Power and accuracy in the sequence shuffling simulation.
Power of SCUEAL to detect breakpoints in the HIV-1 pol sequence shuffling scenario as a function of recombinant fragment length (x-axis) and divergence between parental strains (y-axis). Grid cells are colored according to the proportion of correctly detected breakpoints (different cells may summarize different numbers of simulations). White squares are plotted when there were no simulated breakpoints within a corresponding length-divergence range of values.
Figure 6An example of a good agreement between SCUEAL and REGA in classifying a partial pol subtype B sequence.
The SCUEAL clustering plots present in this figure and Figures 7, 8 and 9 are conceptually analogous to bootscan plots, i.e. which reference sequence is the most likely sister lineage of the query sequence for a given site, but is based on model averaged support values instead of phylogenetic bootstrap. A partial reference tree with placed query is shown; color coding is consistent between the similarity plot and the tree. A phylogenetic tree with bootstrap support values and bootscan plot using the REGA alignment generated for the query sequence are shown.
Figure 7An instance when a sequence unclassified by REGA is inferred to be a novel recombinant form by SCUEAL; the A–J mosaic structure is also confirmed by trees and bootscan plots based on the REGA reference alignment.
Figure 8An example of within-subtype (B) recombination detected by SCUEAL, but not by REGA. A partial reference tree with placed query is shown; color coding is consistent between the similarity plot and the tree.
Figure 9An instance when a sequence assigned to subtype A by REGA is deduced to be an A-B-A mosaic by SCUEAL.
Similarity plots based on the reduced REGA alignments (only A and B subtype reference sequences) confirm that the same mosaic structure is supported using if a small enough window is selected for a sliding window analysis.
SCUEAL screening results on partial HIV-1 polymerase sequences from the UK.
| Subtype | Sequences | Agree | within-subtype | Diff. pure subtype | Diff. recombinant | Top 3 CRFs and URFs |
| A | 2119 |
|
|
|
| CRF22 (24); A1, D (12); A1, C (4) |
| B | 19871 |
|
|
|
| B, D (120); B, CRF03 (40); B, F1 (38) |
| C | 7381 |
|
|
|
| B, C (11); C, D (11); C, J (10) |
| D | 614 |
|
|
|
| B, D (3); D, K (2); A, D (2) |
| F | 110 |
|
|
|
| B,F (2); F, G (1); F, H (1) |
| G | 673 |
|
|
|
| F1, G (25); CRF30, G (10); A, G (10) |
| H | 35 |
|
|
|
| |
| J | 35 |
|
|
|
| B, J (3); CRF09, J (3); G, J (2) |
| CRF01 (AE) | 419 |
|
|
|
| AE, B (2) |
| CRF02 (AG) | 1014 |
|
|
|
| A, G (278); A, CRF30, G (72); A, CRF30, CRF36 (56) |
| CRF06 | 147 |
|
|
|
| CRF32, K (34); CRF32, G (23); CRF30, CRF32 (14) |
Subtype lists the sequence subtype as annotated in the database. Sequences provides the number of sequences downloaded from the database. Agree gives the percentage of sequences for which SCUEAL returned the same subtype as the one inferred by REGA. within-subtype–SCUEAL inferred within-subtype recombination within the same subtype as the one inferred by REGA; figures in parentheses show the proportion of within-subtype recombinants identified when DRAM positions were masked. Diff. pure subtype–the proportion of cases where SCUEAL inferred a pure subtype different from the REGA assignment. Diff. recombinant–the proportion of cases where SCUEAL inferred a recombinant mosaic with at least one fragment different from the annotated subtype; figures in parentheses show the proportion of within-subtype recombinants identified when DRAM positions were masked. Top 3 CRFs and URFs–three most frequent mosaics inferred by SCUEAL.