| Literature DB >> 28540766 |
Oliver Ratmann1, Chris Wymant2, Caroline Colijn3, Siva Danaviah4, M Essex5,6, Simon D W Frost7, Astrid Gall8, Simani Gaiseitsiwe9, Mary Grabowski10,11, Ronald Gray10,12, Stephane Guindon13,14, Arndt von Haeseler15,16, Pontiano Kaleebu17, Michelle Kendall18, Alexey Kozlov19, Justen Manasa20,21, Bui Quang Minh22, Sikhulile Moyo23,24, Vladimir Novitsky25,26, Rebecca Nsubuga27, Sureshnee Pillay28, Thomas C Quinn10,29,30, David Serwadda31,32, Deogratius Ssemwanga33,34, Alexandros Stamatakis35,36, Jana Trifinopoulos37, Maria Wawer10,38, Andrew Leigh Brown39, Tulio de Oliveira40, Paul Kellam41, Deenan Pillay42, Christophe Fraser43.
Abstract
To characterize HIV-1 transmission dynamics in regions where the burden of HIV-1 is greatest, the 'Phylogenetics and Networks for Generalised HIV Epidemics in Africa' consortium (PANGEA-HIV) is sequencing full-genome viral isolates from across sub-Saharan Africa. We report the first 3,985 PANGEA-HIV consensus sequences from four cohort sites (Rakai Community Cohort Study, n=2,833; MRC/UVRI Uganda, n=701; Mochudi Prevention Project, n=359; Africa Health Research Institute Resistance Cohort, n=92). Next-generation sequencing success rates varied: more than 80% of the viral genome from the gag to the nef genes could be determined for all sequences from South Africa, 75% of sequences from Mochudi, 60% of sequences from MRC/UVRI Uganda, and 22% of sequences from Rakai. Partial sequencing failure was primarily associated with low viral load, increased for amplicons closer to the 3' end of the genome, was not associated with subtype diversity except HIV-1 subtype D, and remained significantly associated with sampling location after controlling for other factors. We assessed the impact of the missing data patterns in PANGEA-HIV sequences on phylogeny reconstruction in simulations. We found a threshold in terms of taxon sampling below which the patchy distribution of missing characters in next-generation sequences has an excess negative impact on the accuracy of HIV-1 phylogeny reconstruction, which is attributable to tree reconstruction artifacts that accumulate when branches in viral trees are long. The large number of PANGEA-HIV sequences provides unprecedented opportunities for evaluating HIV-1 transmission dynamics across sub-Saharan Africa and identifying prevention opportunities. Molecular epidemiological analyses of these data must proceed cautiously because sequence sampling remains below the identified threshold and a considerable negative impact of missing characters on phylogeny reconstruction is expected.Entities:
Year: 2017 PMID: 28540766 PMCID: PMC5597042 DOI: 10.1089/AID.2017.0061
Source DB: PubMed Journal: AIDS Res Hum Retroviruses ISSN: 0889-2229 Impact factor: 2.205

Alignment of the first PANGEA-HIV consensus sequences. Three thousand nine hundred eighty-five HIV-1 consensus sequences were generated from samples collected as part of the Mochudi Prevention Project (dark blue), the Rakai Community Cohort Study (purple), the Africa Health Research Institute Resistance Cohort (red), and the general population, fisherfolk, and female sex worker cohorts from MRC/UVRI Uganda (green). Locations of the HIV-1 gag, pol, and env genes are indicated on the x-axis, along with the primer sets of the Gall protocol that were used to amplify four overlapping genomic regions (arrows and blue dots). Vertical lines indicate the position of primers in the alignment. Missing data and gaps are shown in white. The total length of the alignment is 9,742 nt and covers the viral genome between HIV-1 gag and nef (length 8,628 nt in reference strain HXB2).
Characteristics of the First PANGEA-HIV Consensus Sequences
| Number of sequences | 92 | 359 | 2,833 | 701 |
| Number of individuals | 92 | 351 | 2,820 | 694 |
| Sex, % | ||||
| F | 73 | 73 | 56 | 51 |
| M | 27 | 24 | 44 | 32 |
| Missing | 0 | 3 | 0 | 16 |
| Age at time of sampling, % | ||||
| <25 | 10 | 17 | 25 | 11 |
| 25–29 | 20 | 21 | 28 | 13 |
| 30–34 | 16 | 18 | 22 | 20 |
| 35–39 | 24 | 14 | 15 | 19 |
| 40 or older | 30 | 24 | 10 | 15 |
| Missing | 0 | 6 | 0 | 23 |
| Serum/plasma HIV-1 RNA within 1 year of sampling date (copies/ml), % | ||||
| <10,000 | 7 | 30 | 19 | 1 |
| 10,000–49,999 | 36 | 22 | 8 | 0 |
| 50,000–99,999 | 13 | 16 | 3 | 1 |
| 100,000 or higher | 35 | 19 | 2 | 2 |
| Missing | 9 | 13 | 68 | 96 |
| Self-reported ART use before sampling, % | ||||
| Yes | 100 | 3 | 6 | 0 |
| No | 0 | 91 | 94 | 90 |
| Missing | 0 | 6 | 0 | 10 |
| Year of sampling, % | ||||
| 2009 | 0 | 0 | 0 | 35 |
| 2010 | 0 | 38 | 0 | 5 |
| 2011 | 55 | 36 | 25 | 0 |
| 2012 | 42 | 17 | 46 | 0 |
| 2013 | 2 | 7 | 19 | 40 |
| 2014 | 0 | 0 | 11 | 10 |
| Missing | 0 | 2 | 0 | 10 |
| HIV-1 subtype, % | ||||
| A1 | 0 | 0 | 19 | 23 |
| B | 0 | 0 | 0 | 2 |
| C | 94 | 93 | 3 | 1 |
| D | 0 | 0 | 30 | 21 |
| Other | 0 | 0 | 0 | 1 |
| potentially recombinant[ | 6 | 3 | 33 | 38 |
| <500 nt to determine subtype | 0 | 4 | 15 | 14 |
As identified with the COMET HIV-1 subtyping tool[52] on four partial amplicon sequences, see Materials and Methods. More refined approaches are underway to confirm recombinant sequences among potentially recombinant sequences.
Sequencing Success Rates Among the First PANGEA-HIV Consensus Sequences
| Partial genome | |||||
| | 1,503 | 99 | 91 | 82 | 85 |
| | 2,844 | 100 | 81 | 37 | 68 |
| | 2,571 | 100 | 80 | 37 | 66 |
| | 8,628 | 99 | 82 | 46 | 71 |
| Genomic region between primers[ | |||||
| | 241 | 98 | 88 | 78 | 81 |
| 2F-1R | 879 | 100 | 93 | 84 | 89 |
| 1R-3F | 2,375 | 100 | 79 | 33 | 66 |
| 3F-2R | 231 | 100 | 91 | 48 | 76 |
| 2R-4F | 908 | 99 | 81 | 40 | 68 |
| 4F-3R | 1,844 | 100 | 84 | 42 | 71 |
| 3R- | 1,048 | 99 | 71 | 30 | 59 |
See Figure 1 for location of the four forward and reverse primer sets.
Adjusted Odds Ratios of Partial Sequencing Failure Among the First PANGEA-HIV Sequences
| Analysis 1 | Analysis 2 | |||
| Excluding sequences from 21 batches that were significantly associated with sequencing failure | Excluding sequences from 21 batches as in analysis 1, and short sequences of less than 500 nt whose subtype could not be determined | |||
| ( | ( | |||
| Odds ratio | 95% confidence interval | Odds ratio | 95% confidence interval | |
| Serum/plasma HIV-1 RNA within 1 year of sampling date (copies/ml) | ||||
| <10,000 | 13.82 | 10.34–18.79 | 12.81 | 9.03–18.69 |
| 10,000–49,999 | 2.47 | 1.81–3.41 | 3.6 | 2.5–5.32 |
| 50,000–99,999 | 1.02 | 0.69–1.5 | 1.36 | 0.87–2.14 |
| 100,000 or higher | 0.02 | 0.01–0.02 | 0.01 | 0.01–0.02 |
| Missing | 5.76 | 4.34–7.79 | 6.1 | 4.33–8.84 |
| Self-reported ART use before sampling | ||||
| No | 1.0 | 1.0 | ||
| Yes | 1.05 | 0.95–1.16 | 0.96 | 0.85–1.08 |
| Cohort site | ||||
| Mochudi | 1.0 | 1.0 | ||
| Africa Health Research Institute Resistance Cohort | 0[ | singular[ | 0[ | singular[ |
| Rakai | 4.01 | 3.38–4.76 | 5.95 | 4.24–8.38 |
| MRC/UVRI Historic | 2.65 | 1.88–3.72 | 3.72 | 2.3–6.01 |
| MRC/UVRI FSW cohort | 1.69 | 1.38–2.07 | 1.75 | 1.21–2.53 |
| MRC/UVRI population cohorts | 1.26 | 1–1.59 | 1.55 | 1.04–2.31 |
| Amplicon[ | ||||
| Amplicon 1 | 1.0 | 1.0 | ||
| Amplicon 2 | 3.7 | 3.27–4.2 | 8.19 | 6.86–9.82 |
| Amplicon 3 | 5.06 | 4.48–5.73 | 12.42 | 10.42–14.87 |
| Amplicon 4 | 6.97 | 6.16–7.91 | 18.24 | 15.29–21.86 |
| HIV-1 subtype[ | ||||
| A1 | — | — | 1.0 | |
| B | — | — | 0[ | singular[ |
| C | — | — | 1.18 | 0.87–1.59 |
| D | — | — | 1.22 | 1.07–1.38 |
| Other | — | — | 1.18 | 0.35–4.11 |
| Potential recombinant | — | — | 0.64 | 0.54–0.76 |
No partial sequencing failure observed
The following genomic regions (partial amplicon sequences) in each amplicon were considered: gag start-2F in amplicon 1, 1R-3F in amplicon 2, 2R-4F in amplicon 3, 3R-nef end in amplicon 4. See Figure 1 for location of the partial amplicon sequences.
HIV-1 subtype was determined with the COMET HIV-1 subtyping tool,[52] version 2.1, for each of the genomic regions 1F-1R, 3F-4F, 4F-3R, if these were determined to at least 500 nt. If all region-specific assignments agreed, corresponding sequences were classified as “A1,” “B,” “C,” “D,” “other” (pure subtype). All other sequences were classified as “potential recombinant.”

Correctly reconstructed clades in simulated HIV-1 phylogenies from sequence alignments of 1,600 taxa with and without missing characters. Viral phylogenies of a generalized HIV-1 epidemic in a hypothetical sub-Sahara African setting were simulated, and HIV-1 gag, pol, and env sequences were generated along this phylogeny. The sampling coverage was 6% of individuals living with HIV-1 by 2020 in the simulation, corresponding to 1,600 taxa. PhyML was used to reconstruct the simulated viral tree. (A) Parts of the simulated viral phylogeny (blue) that were correctly reconstructed in 10 out of 10 replicate runs of PhyML from the sequence alignment of gag+pol+env sequences without missing characters (data set D1, see Supplementary Table S1). (B) Parts of the same simulated viral phylogeny that were correctly reconstructed in 10 out of 10 replicate runs of PhyML from a patchy sequence alignment, obtained by copying missing characters of randomly selected PANGEA-HIV sequences from Botswana into the sequence alignment D1 (data set D2). For visualization purposes, only the first five clades of the phylogeny are shown, each corresponding to a distinct transmission chain in the simulation. Results were similar with other tree reconstruction methods, and PhyML was chosen for illustration purposes.

Impact of missing characters in PANGEA-HIV sequences on phylogeny reconstruction when sequences are sparsely sampled. Three sequence data sets of 1,600 taxa of concatenated HIV-1 gag, pol, env genes were simulated. For each data set, missing characters in real PANGEA-HIV sequences from specific sampling locations (see x-axis) were copied into simulated sequences (data sets D1–D3, see Supplementary Table S1). Phylogenies were reconstructed in replicate with several tree reconstruction algorithms and compared to the true phylogeny. (A) Quartet distance between reconstructed and true subtrees that correspond to sampled transmission chains in the simulations. (B) Kendall-Colijn distance between reconstructed and true subtrees that correspond to sampled transmission chains in the simulations. (C) Proportion of false-positive transmission pairs among pairs of individuals that diverged less than 1% substitution/site in reconstructed phylogenies. (D) Mean absolute error (years) in estimated divergence times between sequences from sampled transmission pairs. Across all error measures, reconstructed phylogenies were considerably less accurate when sequences were sparsely sampled and contained missing characters as seen among PANGEA-HIV sequences from Botswana or Uganda, compared to gag+pol+env sequences without missing characters.

Excess negative impact of irregularly distributed missing characters on HIV-1 phylogeny reconstruction. Four times 60 sequence alignments of varying size (1,600 to 9,629 sequences, shape of points) and varying missing site patterns (either patchy or allocated in a single block after a certain genome position, color of points) were simulated (data sets D1-Mxx, D4-Mxx, D5-Mxx, D6-Mxx, D1-Pyy, see Supplementary Table S1). For each alignment, the average proportion of missing characters per sequence in alignments relative to the length of the gag+pol+env genome (6,807 nt) was calculated. One phylogeny per alignment was reconstructed with RAxML. (A) We first compared Quartet distances of trees reconstructed from patchy sequence alignments of 1,600 taxa to those of trees reconstructed from partial sequence alignments of 1,600 taxa. For the same average number/average proportion of missing characters, viral trees were less accurately reconstructed when missing characters were irregularly distributed. (B) We then compared Quartet distances of trees reconstructed from patchy sequence alignments of that increased in the number of viral sequences sampled. The excess error in Quartet distances associated with irregularly distributed missing characters vanished as sampling coverage approached 30% of individuals living with HIV-1 by 2020 in the simulations (∼10,000 taxa).

Alignment trimming to reduce tree reconstruction artifacts. (A) Sixty alignments of 1,600 gag+pol+env sequences (6,807 nt) with increasing proportions of missing characters were simulated. Missing site patterns were copied at random from PANGEA-HIV sequences (data sets D1-Mxx, see Supplementary Table S1). Thirty alignments were trimmed to the gag gene. One phylogeny per alignment was reconstructed with RAxML. We compared Quartet distances of trees reconstructed from patchy gag+pol+env sequences (gray) to those of patchy gag sequences (orange). It is possible to reconstruct more accurate phylogenies from shorter gag sequences, but only when the trimmed alignment harbors substantially fewer missing characters than the longer original alignment and sequence sampling coverage is low (6%). The proportion of missing characters in gag and gag+pol+env sequences among PANGEA-HIV sequences from Botswana and Uganda is indicated with triangles and diamonds. (B) The three sequence data sets of 1,600 gappy gag+pol+env sequences of Figure 2 were trimmed to the gag gene. Ten phylogenies were reconstructed with IQ-TREE, PhyML, and RAxML per alignment, and results are shown for IQ-TREE and PhyML. Tree reconstructions from gag genes that harbored missing characters as seen in PANGEA-HIV sequences from Botswana or Uganda were not more accurate than those from patchy gag+pol+env sequences, regardless of distance measure and tree reconstruction method. The differences in missing character patterns between the trimmed and original alignments were not large enough to result in more accurate tree reconstructions with the trimmed alignment.