Literature DB >> 28540766

HIV-1 full-genome phylogenetics of generalized epidemics in sub-Saharan Africa: impact of missing nucleotide characters in next-generation sequences.

Oliver Ratmann¹, Chris Wymant², Caroline Colijn³, Siva Danaviah⁴, M Essex^5,6, Simon D W Frost⁷, Astrid Gall⁸, Simani Gaiseitsiwe⁹, Mary Grabowski^10,11, Ronald Gray^10,12, Stephane Guindon^13,14, Arndt von Haeseler^15,16, Pontiano Kaleebu¹⁷, Michelle Kendall¹⁸, Alexey Kozlov¹⁹, Justen Manasa^20,21, Bui Quang Minh²², Sikhulile Moyo^23,24, Vladimir Novitsky^25,26, Rebecca Nsubuga²⁷, Sureshnee Pillay²⁸, Thomas C Quinn^10,29,30, David Serwadda^31,32, Deogratius Ssemwanga^33,34, Alexandros Stamatakis^35,36, Jana Trifinopoulos³⁷, Maria Wawer^10,38, Andrew Leigh Brown³⁹, Tulio de Oliveira⁴⁰, Paul Kellam⁴¹, Deenan Pillay⁴², Christophe Fraser⁴³.

Abstract

To characterize HIV-1 transmission dynamics in regions where the burden of HIV-1 is greatest, the 'Phylogenetics and Networks for Generalised HIV Epidemics in Africa' consortium (PANGEA-HIV) is sequencing full-genome viral isolates from across sub-Saharan Africa. We report the first 3,985 PANGEA-HIV consensus sequences from four cohort sites (Rakai Community Cohort Study, n=2,833; MRC/UVRI Uganda, n=701; Mochudi Prevention Project, n=359; Africa Health Research Institute Resistance Cohort, n=92). Next-generation sequencing success rates varied: more than 80% of the viral genome from the gag to the nef genes could be determined for all sequences from South Africa, 75% of sequences from Mochudi, 60% of sequences from MRC/UVRI Uganda, and 22% of sequences from Rakai. Partial sequencing failure was primarily associated with low viral load, increased for amplicons closer to the 3' end of the genome, was not associated with subtype diversity except HIV-1 subtype D, and remained significantly associated with sampling location after controlling for other factors. We assessed the impact of the missing data patterns in PANGEA-HIV sequences on phylogeny reconstruction in simulations. We found a threshold in terms of taxon sampling below which the patchy distribution of missing characters in next-generation sequences has an excess negative impact on the accuracy of HIV-1 phylogeny reconstruction, which is attributable to tree reconstruction artifacts that accumulate when branches in viral trees are long. The large number of PANGEA-HIV sequences provides unprecedented opportunities for evaluating HIV-1 transmission dynamics across sub-Saharan Africa and identifying prevention opportunities. Molecular epidemiological analyses of these data must proceed cautiously because sequence sampling remains below the identified threshold and a considerable negative impact of missing characters on phylogeny reconstruction is expected.

Entities: CellLine Disease Gene Species

Year: 2017 PMID： 28540766 PMCID： PMC5597042 DOI： 10.1089/AID.2017.0061

Source DB: PubMed Journal: AIDS Res Hum Retroviruses ISSN： 0889-2229 Impact factor: 2.205

Introduction

Viral phylogenetic methods are proving effective in addressing central questions in HIV-1 epidemiology: from characterizing continued transmissions in vulnerable populations[1,2] to quantifying their sources of transmission,[3,4] and detecting HIV-1 outbreaks in near real time.[5] In the past, these investigations were largely based on partial HIV-1 sequences of less than 1,500 nucleotides (nt) in length, obtained through Sanger sequencing. To expand the utility of viral phylogenetic methods, several consortia are now generating HIV-1 sequence data sets that span the entire viral genome.[6-9] The “Phylogenetics and Networks for Generalised HIV Epidemics in Africa” consortium (PANGEA-HIV) is in the process of providing more than 10,000 NGS from partnering cohort sites in sub-Saharan Africa for a comprehensive evaluation of current HIV-1 transmission dynamics.[6] We report the first 3,985 PANGEA-HIV consensus sequences that were generated in high throughput at the Wellcome Trust Sanger Institute on the Illumina MiSeq platform, after automated extraction of viral RNA and amplification with a universal HIV-1 primer set.[10] The sequences are from diverse settings in sub-Saharan Africa, including cohorts of the general population at various surveillance sites (Rakai Community Cohort Study,[11] Mochudi Prevention Project,[12,13] MRC/UVRI Uganda general population and fisherfolk cohorts[14-16]), a cohort of female sex-workers (MRC/UVRI Uganda Good Health for Women[17]), historical sequences from the 1980s, and a cohort of HIV-1 drug-resistant individuals from northern KwaZulu-Natal in South Africa (Africa Health Research Institute Resistance Cohort[18]). Most PANGEA-HIV consensus sequences from Botswana, South Africa, and MRC/UVRI Uganda cover nearly the entire viral genome from the gag to nef genes. Sequencing success rates were considerably lower for samples from the Rakai Community Cohort and varied substantially across the genome. Potential reasons for variation in NGS success rates could be as follows: low viral RNA count at time of sampling; sample degradation before RNA extraction; failure to extract viral RNA from plasma or serum samples; failure to amplify extracted RNA with the universal HIV-1 primer set; and failure during sequencing or sequence assembly. Our investigations below indicate that a number of factors, and not only low serum/plasma HIV-1 RNA loads, were associated with partial sequencing failure. Phylogenomic studies across the tree of life highlight that phylogenies can be accurately reconstructed from sequences with very high proportions of missing characters.[19-25] This could also be the case for HIV-1 phylogenies. Longer sequences of HIV-1 genomes increase phylogenetic accuracy,[26,27] because more nucleotide characters are available to resolve internal branches through characters that are uniquely shared among sets of sequences, and to infer multiple substitutions between convergent sequences.[28] On the contrary, and similar to many other pathogens, the HIV-1 genome is short (9,719 nt for the reference strain HXB2, including 5′- and 3′-LTR sequences). Thus, the number of informative characters between full-genome HIV-1 sequences remains limited, and missing data act by reducing the number of shared informative characters disproportionally when any two sequences have missing characters at different alignment positions. In addition, HIV-1 phylogenies from generalized epidemics in sub-Saharan Africa are broad and exhibit many long branches.[12,29,30] Indeed, due to the sheer magnitude of generalized HIV-1 epidemics in sub-Saharan Africa, closely related sequences are often not available to break long branches in viral phylogenies,[31] although other factors, including onward transmission months or years after infection, also contribute to the presence of long branches.[32] When branches are long, more informative sites are required to correctly infer phylogenetic relationships, and missing data could in this context indirectly exacerbate tree reconstruction artifacts.[31,33,34] These considerations suggest that missing data in HIV-1 sequences could have a substantial negative impact on tree reconstruction accuracy and subsequent molecular epidemiological studies, especially in sub-Saharan African settings where sequence sampling is limited. To characterize the implications of missing nucleotide characters in PANGEA-HIV sequences on tree reconstruction accuracy, we conducted simulation studies. An individual-level transmission and prevention model was used to generate regional HIV-1 epidemics in populations of ∼80,000 individuals, as well as corresponding HIV-1 phylogenies and full-genome sequences.[35] Each simulated sequence was paired at random with a PANGEA-HIV sequence, and missing nucleotide patterns of PANGEA-HIV sequences were superimposed onto the simulated sequences. We then tested several tree reconstruction tools in their ability to re-estimate known HIV-1 phylogenies from partially determined consensus sequences. These analyses provide insight into the accuracy with which viral phylogenetic relationships can be reconstructed from NGS that have missing characters at different positions in the sequence alignment. Since all molecular epidemiological investigations rely on accurately reconstructed viral phylogenies, our findings are fundamental to PANGEA-HIV and analogous full-genome viral sequencing efforts.

Materials and Methods

Next-generation sequencing

Serum and plasma samples from PANGEA-HIV participating cohort sites in Uganda and Botswana were shipped to University College London Hospital, London, United Kingdom, for automated RNA sample extraction on QIAsymphony SP workstations with the QIAsymphony DSP Virus/Pathogen Kit (Cat. No. 937036, 937055; Qiagen, Hilden, Germany), followed by one-step reverse transcription polymerase chain reaction (RT-PCR) as described in Ref.[10] Amplification was assessed through gel electrophoresis on a fraction of samples, and samples were shipped to the Wellcome Trust Sanger Institute, Hinxton, United Kingdom. From plasma samples from the resistance cohort, RNA was extracted at the Africa Health Research Institute in Durban, South Africa, using the QIAamp Viral Mini kit (Cat. No 52906; Qiagen), followed by one-step RT-PCR as described in Ref.[10] Amplicons were purified using the QIAquick Purification kit (Cat. No. 28106; Qiagen) and shipped to the Wellcome Trust Sanger Institute. Next-generation sequencing was performed as described previously on the Illumina MiSeq platform in the DNA pipelines core facility at the Wellcome Trust Sanger Institute.[36]

HIV-1 consensus sequences

Next-generation sequencing output was assembled with the SHIVER sequence assembly pipeline.[37] Briefly, short reads were mapped to a de novo reference constructed using contigs (that were assembled from the short reads with IVA[38]) and a set of standard whole-genome reference sequences.[39] Using, where available, the contigs for mapping increased accuracy in the constructed consensus sequences compared to using standard reference sequences alone. Gaps between contigs in the reference sequence were filled with a “best guess” standard reference sequence, giving those short reads that failed to result in contigs a chance to be mapped and produce additional consensus sequence. The SHIVER pipeline thus combined de novo assembly and read mapping to maximize accuracy and the length of the genome that can be assembled. The consensus sequence of mapped reads was determined by the most frequent read call at each site. To mitigate the effects of low-level contaminant reads, sites with less than 10 mapped reads were classified as undetermined. Consensus sequences were trimmed to the viral genome from HIV-1 gag (p17) to HIV-1 nef. This process yielded consensus sequences that were each aligned against a de novo reference sequence and a set of standard reference sequences, in this case the Los Alamos HIV-1 sequence compendium 2012.[39] To construct alignments of HIV-1 consensus sequences, insertions in consensus sequences were excised if they were not present in the standard reference sequences. The resulting alignment was uncertain in that gap characters which flank missing data characters could represent a deletion or a missing nucleotide as in “AC-GT-??–?-ACGT.” These sites were set to missing nucleotide characters: “AC-GT???????ACGT.”

Statistical analysis of factors associated with partial sequencing failure

To evaluate factors associated with partial sequencing failure, we focused on four genomic regions in the unaligned consensus sequences that were amplified by exactly one of the four primer sets of the Gall protocol[10]: region start-2F between the start of the gag gene and the 2F primer on amplicon 1, region 1R-3F between the 1R and 3F primers on amplicon 2, region 2R-4F between the 2R and 4F primers on amplicon 3, and region 3R-end between the 3R primer and the end of the nef gene on amplicon 4 (“partial amplicon sequences,” see Fig. 1).

Alignment of the first PANGEA-HIV consensus sequences. Three thousand nine hundred eighty-five HIV-1 consensus sequences were generated from samples collected as part of the Mochudi Prevention Project (dark blue), the Rakai Community Cohort Study (purple), the Africa Health Research Institute Resistance Cohort (red), and the general population, fisherfolk, and female sex worker cohorts from MRC/UVRI Uganda (green). Locations of the HIV-1 gag, pol, and env genes are indicated on the x-axis, along with the primer sets of the Gall protocol that were used to amplify four overlapping genomic regions (arrows and blue dots). Vertical lines indicate the position of primers in the alignment. Missing data and gaps are shown in white. The total length of the alignment is 9,742 nt and covers the viral genome between HIV-1 gag and nef (length 8,628 nt in reference strain HXB2). Based on PANGEA-HIV sequencing failure rates, partial amplicon sequences were classified into “undetermined” when more than 80% of nucleotide characters were missing, and “determined” when less than 60% of characters were missing. Ambiguous partial amplicon sequences with 20%–40% missing characters were not used in the analysis. Multivariate logistic regression analysis (gamlss,[40] R version 3.2.0) was used to identify covariates that were significantly associated with undetermined partial amplicon sequences.

Simulations to assess impact of missing nucleotides in PANGEA-HIV sequences

Viral trees were generated under the regional PANGEA-HIV simulation model[35] and captured disease dynamics in a regional population of ∼80,000 individuals from 1985 until 2020. Sequences of 6,807 nt were simulated along the viral trees with SeqGen version 1.3.2,[41] using codon- and gene-specific evolutionary rates and relative substitution rate parameters that were estimated from HIV-1 subtype C sequences (see supplementary fig. S13 in Ref.[35]). The simulated sequences correspond to concatenated gag, pol, and env genes, excluding the gag stem loop and variable loops in the env gene. To create alignments with missing data, missing nucleotide patterns of aligned PANGEA-HIV sequences were superimposed onto the simulated sequences. This step preserved the nonrandom distribution of missing data in PANGEA-HIV sequences. Other missing data patterns were also considered. Simulated sequence alignments systematically varied in the average proportion of missing characters per sequence in an alignment (0% to 60%), the distribution of missing characters [structured as in PANGEA-HIV sequences (“patchy” sequences), or in a single block after a given genomic position (“partial” sequences)], and sequence sampling coverage [1,600 (6%) to 9,629 (30%) of individuals living with HIV-1 in 2020 in the simulations]. Alignments and corresponding viral trees were indexed as described in Supplementary Table S1, and are available from https://doi.org/10.6084/m9.figshare.5056837.v1

Maximum-likelihood tree reconstruction

To ensure optimal deployment of existing phylogenetic inference tools, HIV-1 trees were reconstructed with IQ-TREE,[42,43] PhyML,[44] and RAxML[45] by the respective software developers. To determine best program settings, the ‘true’ phylogeny, from which the sequences were simulated, was provided for one data set without missing nucleotides to the teams. Trees were also reconstructed with FastTree[46] by the authors of this study. The command line options that were used for HIV-1 tree reconstructions are listed in Supplementary Data; Supplementary Data are available online at www.liebertpub.com/aid). Trees were subsequently dated and rooted with LSD version 0.3beta.[47]

Assessment of phylogeny reconstructions from sequences with missing data

Reconstructed trees were compared to true trees using several distance measures for tree topologies and HIV-1 transmission pairs. The central aim of PANGEA-HIV is to characterize recent transmission dynamics. For this reason, we focused on comparing the topology of phylogenetic clades that corresponded to transmission chains within the simulated regional population. This excluded deep splits in the true and inferred phylogenies from consideration. For each clade with at least four taxa, we calculated the proportion of unrooted, labeled subtrees of four taxa whose topologies differed between inferred and true clades (Quartet distance).[48] In addition, we evaluated the Kendall-Colijn distance on the same clades.[49] Tree distances typically scaled with clade size.[50] We estimated average functional relationships between tree distance and clade size with polynomial regression techniques, and adjusted tree distances for differences in clade size. To evaluate whether transmission pairs were accurately identified, we considered phylogenetically very close individuals as a proxy of transmission pairs and evaluated the proportion of false positives. The divergence cutoff was set deliberately at a low value of 1% substitutions per site,[51,52] so that a high proportion of true transmission pairs was expected under baseline analyses from near complete sequences. To evaluate whether transmission pairs were accurately dated, we considered for each sampled transmission pair (for whom both the transmitter and recipient had a sequence taken) the distance in units of time between their sequences in the true phylogeny, as well as the inferred phylogeny. We then calculated the mean absolute error of these distances across pairs. These distance measures provided an assessment of tree reconstruction accuracy in terms of local HIV-1 transmission chains and sampled transmission pairs.

Results

PANGEA-HIV next-generation sequences

Table 1 characterizes the first 3,985 PANGEA-HIV consensus sequences. Next-generation sequencing data are available through the European Nucleotide Archive (www.ebi.ac.uk/ena/data/view/PRJEB19239) and HIV-1 consensus sequences are available upon request to the PANGEA-HIV steering committee (Supplementary Data).

Characteristics of the First PANGEA-HIV Consensus Sequences

	Africa Health Research Institute Resistance Cohort	Mochudi Prevention Project	Rakai Community Cohort Study	MRC/UVRI Uganda
Number of sequences	92	359	2,833	701
Number of individuals	92	351	2,820	694
Sex, %
F	73	73	56	51
M	27	24	44	32
Missing	0	3	0	16
Age at time of sampling, %
<25	10	17	25	11
25–29	20	21	28	13
30–34	16	18	22	20
35–39	24	14	15	19
40 or older	30	24	10	15
Missing	0	6	0	23
Serum/plasma HIV-1 RNA within 1 year of sampling date (copies/ml), %
<10,000	7	30	19	1
10,000–49,999	36	22	8	0
50,000–99,999	13	16	3	1
100,000 or higher	35	19	2	2
Missing	9	13	68	96
Self-reported ART use before sampling, %
Yes	100	3	6	0
No	0	91	94	90
Missing	0	6	0	10
Year of sampling, %
2009	0	0	0	35
2010	0	38	0	5
2011	55	36	25	0
2012	42	17	46	0
2013	2	7	19	40
2014	0	0	11	10
Missing	0	2	0	10
HIV-1 subtype, %
A1	0	0	19	23
B	0	0	0	2
C	94	93	3	1
D	0	0	30	21
Other	0	0	0	1
potentially recombinant[a]	6	3	33	38
<500 nt to determine subtype	0	4	15	14

As identified with the COMET HIV-1 subtyping tool[52] on four partial amplicon sequences, see Materials and Methods. More refined approaches are underway to confirm recombinant sequences among potentially recombinant sequences.

Characteristics of the First PANGEA-HIV Consensus Sequences As identified with the COMET HIV-1 subtyping tool[52] on four partial amplicon sequences, see Materials and Methods. More refined approaches are underway to confirm recombinant sequences among potentially recombinant sequences. Two thousand eight hundred thirty-three sequences are from 26 communities of the Rakai Community Cohort study, Uganda.[53] Serum samples were obtained from household residents (aged 15–49 years) in three survey rounds between 2011 and 2014 in fisherfolk communities at the shores of Lake Victoria, and predominantly agrarian or trading communities inland. Participants were recruited at central community locations after a community mobilization event. Samples were sequenced regardless of viral load. Two hundred thirty-one sequences were obtained from 25 neighboring communities in Kalungu district, Uganda, and from fisherfolk communities on the shores of Lake Victoria, Uganda, through MRC/UVRI. Plasma samples were obtained from household residents (aged 13+ years) through house-to-house census rounds in Kalungu district between 2013 and 2014, and from a subset of residents (aged 13+) in fisherfolk communities between 2009 and 2010.[14-16] Fifty-two sequences were from a historic sample collection of the 1980s from MRC-UVRI. Four hundred eighteen sequences were obtained from female sex workers in Kampala, Uganda, as part of the Good Health for Women Project by MRC-UVRI.[17] Women (aged 15+ years) involved in commercial sex or employed in entertainment facilities were enrolled through peers between 2009 and 2014. Samples were sequenced regardless of viral load. Three hundred fifty-nine sequences are from the Mochudi Prevention Project, Botswana. Plasma samples were obtained from ART-naive individuals (aged 16–64 years) who tested positive during three rounds of an enhanced HIV testing and counseling campaign in households in northeastern Mochudi between 2010 and 2013.[12,13] Samples were sequenced regardless of viral load. Finally, 92 sequences are from the Africa Health Research Institute resistance cohort, South Africa. Plasma samples were obtained from primary health clinic attendees who failed ART in the Hlabisa sub-district of KwaZulu-Natal. Patients (>18 years) had been on ART for at least 1 year, had two successive plasma HIV-1 RNA measurements >1,000 copies/ml, at least 18 years old, and were seen between 2011 and 2013.

NGS success rates

Table 2 and Supplementary Figure S1 characterize sequencing success rates on the first PANGEA-HIV samples. More than 80% of the HIV-1 genome from the gag to the nef gene could be determined for all samples from the Africa Health Research Institute resistance cohort, 75% of samples from Mochudi, 60% of samples from MRC/UVRI Uganda, and 22% of samples from the Rakai Community Cohort. Sequencing success rates varied considerably across the genome. Figure 1 shows the alignment of PANGEA-HIV consensus sequences, directly obtained from paired consensus and assembly reference sequences (see the Materials and Methods section). As a result of alignment uncertainty, on average 0.11 additional missing nucleotide characters were introduced in the PANGEA-HIV alignment per missing character in unaligned consensus sequences (Supplementary Fig. S2).

Sequencing Success Rates Among the First PANGEA-HIV Consensus Sequences

	Average proportion of nonmissing nucleotide characters per sequence (relative to corresponding de novo reference sequence)
	Length in HXB2 (nt)	Africa Health Research Institute Resistance Cohort (%)	Mochudi Prevention Project (%)	Rakai Community Cohort Study (%)	MRC/UVRI Uganda (%)
Partial genome
gag	1,503	99	91	82	85
pol	2,844	100	81	37	68
env	2,571	100	80	37	66
gag (p17)-nef	8,628	99	82	46	71
Genomic region between primers[a]
gag start-2F	241	98	88	78	81
2F-1R	879	100	93	84	89
1R-3F	2,375	100	79	33	66
3F-2R	231	100	91	48	76
2R-4F	908	99	81	40	68
4F-3R	1,844	100	84	42	71
3R-nef end	1,048	99	71	30	59

See Figure 1 for location of the four forward and reverse primer sets.

Sequencing Success Rates Among the First PANGEA-HIV Consensus Sequences See Figure 1 for location of the four forward and reverse primer sets.

Factors associated with partial sequencing failure

Serum and plasma samples were processed in batches from shipment to sequencing. Twenty-one of 110 batches were significantly associated with partial sequencing failure across the four amplicons of unaligned consensus sequences, after controlling for recent viral load, sampling location, amplicon, and ART use before sampling (Supplementary Fig. S3). The 21 batches were processed consecutively up to and inclusive of RNA extraction, and contained 770 sequences from Rakai. Among the remaining sequences, serum or plasma viral load below 50,000 copies/ml within 1 year of sampling was significantly associated with partial sequencing failure across the four amplicons [adjusted odds ratio (OR) 2.47 (1.81–3.41) for viral loads within 10,000–49,999 copies/ml and 13.82 (10.34–18.79) for viral loads below 10,000 copies/ml; analysis 1 in Table 3]. After the 21 sequencing batches in Supplementary Fig S3 were excluded from analysis, we found sequencing success rates steadily decreased from amplicon 1 to amplicon 4. Sampling location remained significantly associated with partial sequencing failure after controlling for viral load, prior ART use, and differential amplicon success rates (Table 3).

Adjusted Odds Ratios of Partial Sequencing Failure Among the First PANGEA-HIV Sequences

Sample characteristic	Adjusted odds ratio for sequencing failure (>80% missing characters) in partial amplicon sequences
	Analysis 1		Analysis 2
	Excluding sequences from 21 batches that were significantly associated with sequencing failure		Excluding sequences from 21 batches as in analysis 1, and short sequences of less than 500 nt whose subtype could not be determined
	(n = 3,125 PANGEA-HIV sequences with 12,214 partial amplicon sequences)		(n = 2,725 PANGEA-HIV sequences with 10,635 partial amplicon sequences)
	Odds ratio	95% confidence interval	Odds ratio	95% confidence interval
Serum/plasma HIV-1 RNA within 1 year of sampling date (copies/ml)
<10,000	13.82	10.34–18.79	12.81	9.03–18.69
10,000–49,999	2.47	1.81–3.41	3.6	2.5–5.32
50,000–99,999	1.02	0.69–1.5	1.36	0.87–2.14
100,000 or higher	0.02	0.01–0.02	0.01	0.01–0.02
Missing	5.76	4.34–7.79	6.1	4.33–8.84
Self-reported ART use before sampling
No	1.0		1.0
Yes	1.05	0.95–1.16	0.96	0.85–1.08
Cohort site
Mochudi	1.0		1.0
Africa Health Research Institute Resistance Cohort	0[a]	singular[a]	0[a]	singular[a]
Rakai	4.01	3.38–4.76	5.95	4.24–8.38
MRC/UVRI Historic	2.65	1.88–3.72	3.72	2.3–6.01
MRC/UVRI FSW cohort	1.69	1.38–2.07	1.75	1.21–2.53
MRC/UVRI population cohorts	1.26	1–1.59	1.55	1.04–2.31
Amplicon[b]
Amplicon 1	1.0		1.0
Amplicon 2	3.7	3.27–4.2	8.19	6.86–9.82
Amplicon 3	5.06	4.48–5.73	12.42	10.42–14.87
Amplicon 4	6.97	6.16–7.91	18.24	15.29–21.86
HIV-1 subtype[c]
A1	—	—	1.0
B	—	—	0[a]	singular[a]
C	—	—	1.18	0.87–1.59
D	—	—	1.22	1.07–1.38
Other	—	—	1.18	0.35–4.11
Potential recombinant	—	—	0.64	0.54–0.76

No partial sequencing failure observed

The following genomic regions (partial amplicon sequences) in each amplicon were considered: gag start-2F in amplicon 1, 1R-3F in amplicon 2, 2R-4F in amplicon 3, 3R-nef end in amplicon 4. See Figure 1 for location of the partial amplicon sequences.

HIV-1 subtype was determined with the COMET HIV-1 subtyping tool,[52] version 2.1, for each of the genomic regions 1F-1R, 3F-4F, 4F-3R, if these were determined to at least 500 nt. If all region-specific assignments agreed, corresponding sequences were classified as “A1,” “B,” “C,” “D,” “other” (pure subtype). All other sequences were classified as “potential recombinant.”

Adjusted Odds Ratios of Partial Sequencing Failure Among the First PANGEA-HIV Sequences No partial sequencing failure observed The following genomic regions (partial amplicon sequences) in each amplicon were considered: gag start-2F in amplicon 1, 1R-3F in amplicon 2, 2R-4F in amplicon 3, 3R-nef end in amplicon 4. See Figure 1 for location of the partial amplicon sequences. HIV-1 subtype was determined with the COMET HIV-1 subtyping tool,[52] version 2.1, for each of the genomic regions 1F-1R, 3F-4F, 4F-3R, if these were determined to at least 500 nt. If all region-specific assignments agreed, corresponding sequences were classified as “A1,” “B,” “C,” “D,” “other” (pure subtype). All other sequences were classified as “potential recombinant.” The distribution of HIV-1 subtypes varied across sampling locations, with relatively homogeneous subtype C epidemics in Botswana and South Africa,[12,29] and more diverse epidemics in Uganda where subtypes A and D circulate predominantly.[30,54] This prompted us to investigate if HIV-1 subtypes or recombinant forms could be associated with partial sequencing failure. We conducted a subanalysis on sufficiently long sequences whose subtype could be determined with the COMET HIV-1 subtyping tool version 2.1.[55] The short sequences that were excluded all represent partial sequencing failures, which led to changes in the ORs relative to the central analysis. Relative to subtype A1, sequences of subtype D were significantly associated with more frequent partial sequencing failure, although not very strongly [(adjusted OR 1.22 (1.07–1.38), analysis 2 in Table 3]. By contrast, no subtype B sequence had more than 80% missing characters in the partial amplicon sequences. Sample sizes were small for sequences of subtypes other than A1, B, C, and D (Table 1). Depending on the exclusion criteria, potentially recombinant sequences were or were not significantly associated with partial sequencing failure. This indicates that more detailed analyses are required to identify recombinants among PANGEA-HIV sequences, and to evaluate their impact on sequencing success rates.

Large impact of missing characters in NGS on estimating HIV-1 phylogenies when sequences are sparsely sampled

We generated 921 phylogenies with IQ-TREE,[42,43] PhyML,[44] RAxML,[45] and FastTree[46] from simulated sequence alignments that varied in size and missing data patterns (Supplementary Table S1). On the sequence alignments with 1,600 taxa (6% sequence sampling coverage of individuals living with HIV-1 by 2020 in the simulations), increased phylogenetic error was readily visible even when the gag+pol+env sequences contained relatively few missing characters, as among the PANGEA-HIV sequences from Botswana. Figure 2 illustrates this when using PhyML; similar results were obtained with IQ-TREE, RAxML, and FastTree.

Correctly reconstructed clades in simulated HIV-1 phylogenies from sequence alignments of 1,600 taxa with and without missing characters. Viral phylogenies of a generalized HIV-1 epidemic in a hypothetical sub-Sahara African setting were simulated, and HIV-1 gag, pol, and env sequences were generated along this phylogeny. The sampling coverage was 6% of individuals living with HIV-1 by 2020 in the simulation, corresponding to 1,600 taxa. PhyML was used to reconstruct the simulated viral tree. (A) Parts of the simulated viral phylogeny (blue) that were correctly reconstructed in 10 out of 10 replicate runs of PhyML from the sequence alignment of gag+pol+env sequences without missing characters (data set D1, see Supplementary Table S1). (B) Parts of the same simulated viral phylogeny that were correctly reconstructed in 10 out of 10 replicate runs of PhyML from a patchy sequence alignment, obtained by copying missing characters of randomly selected PANGEA-HIV sequences from Botswana into the sequence alignment D1 (data set D2). For visualization purposes, only the first five clades of the phylogeny are shown, each corresponding to a distinct transmission chain in the simulation. Results were similar with other tree reconstruction methods, and PhyML was chosen for illustration purposes. Figure 3 summarizes results from four error measures. We first assessed the accuracy with which clades that correspond to sampled HIV-1 transmission chains were reconstructed. When phylogenies were inferred from gag+pol+env sequences without missing characters using IQ-TREE, the average Quartet distance between true and reconstructed clades was 5.8% (meaning that 5.8% of all unrooted and unlabeled subtrees of four taxa of these clades were not correct). The average Quartet distance rose to 11% when phylogenies were reconstructed from sequences with missing characters as seen in PANGEA-HIV sequences from Botswana, and to 15.3% when simulated sequences had missing characters as seen in PANGEA-HIV sequences from Uganda (Fig. 3A).

Impact of missing characters in PANGEA-HIV sequences on phylogeny reconstruction when sequences are sparsely sampled. Three sequence data sets of 1,600 taxa of concatenated HIV-1 gag, pol, env genes were simulated. For each data set, missing characters in real PANGEA-HIV sequences from specific sampling locations (see x-axis) were copied into simulated sequences (data sets D1–D3, see Supplementary Table S1). Phylogenies were reconstructed in replicate with several tree reconstruction algorithms and compared to the true phylogeny. (A) Quartet distance between reconstructed and true subtrees that correspond to sampled transmission chains in the simulations. (B) Kendall-Colijn distance between reconstructed and true subtrees that correspond to sampled transmission chains in the simulations. (C) Proportion of false-positive transmission pairs among pairs of individuals that diverged less than 1% substitution/site in reconstructed phylogenies. (D) Mean absolute error (years) in estimated divergence times between sequences from sampled transmission pairs. Across all error measures, reconstructed phylogenies were considerably less accurate when sequences were sparsely sampled and contained missing characters as seen among PANGEA-HIV sequences from Botswana or Uganda, compared to gag+pol+env sequences without missing characters. This trend was consistent regardless of tree distance measure (see Fig. 3B for results using the Kendall-Colijn distance), and results using other reconstruction methods were broadly comparable. Trees generated with FastTree did not have larger Quartet distances than other methods, despite significantly shorter run times of the method. We also observed similar problems in accurately reconstructing basic topological relationships as the extent of missing characters increased in simulations. Specifically, we evaluated the proportion of incorrect transmission pairs among phylogenetically very close individuals. Despite tight selection criteria (<1% substitutions per site for classifying any pair as phylogenetically close), the false-positive rate was 28% on gag+pol+env sequences without missing characters using PhyML, and rose to 41% when gag+pol+env sequences contained missing characters as seen for PANGEA-HIV sequences from Botswana (Fig. 3C). Results were broadly comparable using other reconstruction methods. The impact of missing data on falsely identified transmission pairs depended primarily on how well branch lengths for these topologically fundamental units were estimated, which was increasingly challenging and variable on patchy sequence alignments when sequence sampling was sparse. Specifically, the mean absolute error in divergence time estimates of sampled transmission pairs was 1.83 years with fully determined gag+pol+env sequences using IQ-TREE, and increased to 5.51 years with simulated gag+pol+env sequences that had missing characters as seen among PANGEA-HIV sequences from Uganda (Fig. 3D). Shorter branch lengths were typically inferred with increasing proportions of missing characters, suggesting that detection of multiple nucleotide substitutions was increasingly difficult (Supplementary Fig. S4). This led to more individuals estimated to be phylogenetically very closely related and increased false-positive rates (Supplementary Fig. S5). Overall, trees reconstructed with FastTree had longer branch lengths compared to trees reconstructed with other methods, implying that the criteria for selecting phylogenetically close pairs were implicitly tighter for trees reconstructed with FastTree. This explains why phylogenetially close pairs identified with FastTree were overall more accurate with our error measure, compared to using IQ-TREE or RAxML.

Irregular distribution of missing characters in NGS exacerbates tree reconstruction errors, but only when sequences are sparsely sampled

HIV-1 phylogenies have been more successfully reconstructed from partial gag or pol sequences even when taxon sampling is limited.[29,30,56] We therefore suspected that the large increases in tree reconstruction error of Figure 3 were related to the irregular, nonrandom distribution of missing data patches seen in Figure 1. To test this hypothesis, we compared tree reconstructions from increasingly patchy gag+pol+env sequences to those from partial sequences that were fully determined up to a certain genome position. Thus, in simulated alignments of partial sequences, missing characters formed a contiguous block from a certain genome position to the end of the gag+pol+env sequence. We then compared trees from patchy and partial sequences, while maintaining the overall proportion of missing nucleotides constant. For the same level of missing characters, viral trees were substantially less accurately reconstructed from gag+pol+env sequence alignments with irregularly distributed missing data than from alignments of partial sequences. (Fig. 4A). Thus, the poor performance of tree reconstruction methods is attributable to an excess negative impact of missing characters when these are irregularly distributed.

Excess negative impact of irregularly distributed missing characters on HIV-1 phylogeny reconstruction. Four times 60 sequence alignments of varying size (1,600 to 9,629 sequences, shape of points) and varying missing site patterns (either patchy or allocated in a single block after a certain genome position, color of points) were simulated (data sets D1-Mxx, D4-Mxx, D5-Mxx, D6-Mxx, D1-Pyy, see Supplementary Table S1). For each alignment, the average proportion of missing characters per sequence in alignments relative to the length of the gag+pol+env genome (6,807 nt) was calculated. One phylogeny per alignment was reconstructed with RAxML. (A) We first compared Quartet distances of trees reconstructed from patchy sequence alignments of 1,600 taxa to those of trees reconstructed from partial sequence alignments of 1,600 taxa. For the same average number/average proportion of missing characters, viral trees were less accurately reconstructed when missing characters were irregularly distributed. (B) We then compared Quartet distances of trees reconstructed from patchy sequence alignments of that increased in the number of viral sequences sampled. The excess error in Quartet distances associated with irregularly distributed missing characters vanished as sampling coverage approached 30% of individuals living with HIV-1 by 2020 in the simulations (∼10,000 taxa). In addition, when a larger number of patchy sequences were available for tree reconstruction, accuracy increased and approached that of alignments of partial sequences with the same average proportion of missing characters (Fig. 4B). With a sequence sampling coverage above 30%, trees from patchy gag+pol+env sequence alignments were not significantly less accurately reconstructed as trees from partial sequence alignments with a comparable level of missing characters. This indicates that the excess negative impact of irregularly distributed missing characters in HIV-1 sequence alignments is sidestepped when sufficiently many patchy sequences are available for analysis. Below this sequence coverage threshold, error in phylogeny reconstructions from patchy sequence alignments was larger than expected from alignments of partial sequences with the same overall level of missing characters.

Alignment trimming to reduce tree reconstruction artifacts

The excess negative impact of irregularly distributed missing characters arises through a combination of direct and indirect effects, including disproportionally fewer informative sites that are shared between any two sequences, and accumulating tree reconstruction artifacts. Intriguingly, the indirect effects could potentially be mitigated by excluding alignment columns with disproportionally many missing characters (“trimming”),[21,57] at the expense of fewer shared informative sites. Sequencing success rates were highest for amplicon 1 among PANGEA-HIV sequences, prompting us to compare tree reconstructions from more complete gag genes (1,440 nt without stem loop) to tree reconstructions from more patchy gag+pol+env sequences (6,807 nt). Figure 5A shows that trees reconstructed from gag sequence alignments of 1,600 taxa with <10% missing characters were on average more accurate than trees reconstructed from gag+pol+env sequence alignments of 1,600 taxa with >40% missing characters. Thus, more accurate HIV-1 phylogenies are only expected from trimmed alignments when sequencing success rates are highly uneven. This explains why we did not reconstruct more accurate phylogenies from gag genes compared to gag+pol+env sequences when both alignments had missing characters as seen among PANGEA-HIV sequences from Botswana, nor when both alignments had missing characters as seen among PANGEA-HIV sequences from Uganda (Fig. 5B).

Alignment trimming to reduce tree reconstruction artifacts. (A) Sixty alignments of 1,600 gag+pol+env sequences (6,807 nt) with increasing proportions of missing characters were simulated. Missing site patterns were copied at random from PANGEA-HIV sequences (data sets D1-Mxx, see Supplementary Table S1). Thirty alignments were trimmed to the gag gene. One phylogeny per alignment was reconstructed with RAxML. We compared Quartet distances of trees reconstructed from patchy gag+pol+env sequences (gray) to those of patchy gag sequences (orange). It is possible to reconstruct more accurate phylogenies from shorter gag sequences, but only when the trimmed alignment harbors substantially fewer missing characters than the longer original alignment and sequence sampling coverage is low (6%). The proportion of missing characters in gag and gag+pol+env sequences among PANGEA-HIV sequences from Botswana and Uganda is indicated with triangles and diamonds. (B) The three sequence data sets of 1,600 gappy gag+pol+env sequences of Figure 2 were trimmed to the gag gene. Ten phylogenies were reconstructed with IQ-TREE, PhyML, and RAxML per alignment, and results are shown for IQ-TREE and PhyML. Tree reconstructions from gag genes that harbored missing characters as seen in PANGEA-HIV sequences from Botswana or Uganda were not more accurate than those from patchy gag+pol+env sequences, regardless of distance measure and tree reconstruction method. The differences in missing character patterns between the trimmed and original alignments were not large enough to result in more accurate tree reconstructions with the trimmed alignment.

Discussion

NGS data of HIV-1 viruses offer unprecedented opportunities for studying disease progression,[58,59] evolution of resistance to antiretrovirals,[60,61] as well as aspects of transmission dynamics.[51,62,63] Obtaining NGS data from serum or plasma samples is fraught with difficulties, owing, in part, to the extreme genetic diversity of the virus, large variation in copy numbers in samples, as well as sample degradation.[64] PANGEA-HIV adopted a sequencing protocol that combined automated RNA extraction with amplification-dependent next-generation sequencing under the Gall protocol.[10] With this approach, consensus sequences of the HIV-1 genome could be generated from a diverse set of samples in high throughput. Sequencing success rates varied across the genome and were particularly low on samples from the Rakai Community Cohort Study, Uganda. Our phylogenetic simulation study indicates that missing nucleotide characters in PANGEA-HIV sequences have limited impact on phylogeny reconstruction when a sufficiently high proportion of viral sequences from epidemics are sampled. Specifically, the particular missing data patterns in PANGEA-HIV sequences did not have a significant excess negative impact on reconstructing phylogenies of simulated HIV-1 transmission chains when sequence sampling coverage was at least 30% (of individuals living with HIV-1 by the end of the simulation, Fig. 4). Above this threshold, phylogenetic inference error from alignments with missing characters at differing positions did not increase faster than on alignments with missing characters at the same positions, and overall relatively slowly. The Mochudi Prevention Project, the Rakai Community Cohort Study, and other sites have collected samples at higher coverage within the surveillance sites.[30,56] At larger geographical areas (e.g., the regions that encompass individual surveillance sites), current sequence sampling coverage remains below 30% of all individuals living with HIV-1. Large data sets of partial HIV-1 sequences are now available for Southern and Central Africa, including historical samples from partnering cohort sites. These sequence data sets should be used to increase taxon sampling in future analyses, and further mitigate the impact that missing characters could have on phylogeny reconstruction.[29,69] Phylogenetic analysis of real HIV-1 sequences is more challenging than that of the simulated sequences in this study. Simulations were generated under standard nucleotide evolution models and did not account for recombination between viral strains. Failure to appropriately account for recombination[65] as well as differences in relative nucleotide substitution rates,[66] evolutionary rates,[67] and nucleotide composition bias across the genome[68] can substantially increase systematic bias and lead to incorrect phylogenies that are highly supported in bootstrap analyses.[25,57] It is not clear to what extent these factors could exacerbate the impact of irregularly distributed missing data on phylogeny reconstruction and subsequent molecular epidemiologic investigations. Conversely, our accuracy measures were not restricted to phylogenetically credible clades due to computational limitations in generating large numbers of replicate trees. Phylogenetically credible clades (that occur in a high proportion of replicate trees) could be considerably more robust to the impact of missing sequence data than our error analyses across all clades suggest. Several previous studies support our finding that irregularly distributed missing data patterns have a substantial excess negative impact on HIV-1 phylogeny reconstruction below a certain sampling level. Missing nucleotide characters can exacerbate long branch attraction artifacts,[33,34] while increased taxon sampling reduces systematic errors in tree reconstruction.[31,70] In settings with sparse sequence sampling, our findings also support previous arguments against ad hoc alignment trimming[71]: we found that missing characters must be highly unevenly distributed across the genome for this trimming to have a net positive impact on phylogeny reconstruction. We suggest evaluating the impact of alignment trimming in simulations before application to real data. Simulation routines for this purpose are available as part of the PANGEA-HIV simulation tool (https://github.com/olli0601/PANGEA.HIV.sim). Even when using NGS without missing characters, a large proportion (25%–35%) of phylogenetically very close pairs of individuals (patristic distance <1% substitutions/site) were not transmission pairs in our simulations. The fact that complete NGS cannot confirm HIV-1 transmission events is primarily a consequence of sparse sampling in our simulations and also of viral evolution within hosts, which can lead to incongruencies between phylogenies and transmission trees regardless of sampling.[72] Further work is needed to characterize false-positive rates when sequencing is targeted at subpopulations, for example, at young women and their sexual partners. Considering the observed variation in sequencing success rates, there is clear scope for improving the PANGEA-HIV sequencing protocol. Our retrospective analyses indicate that in addition to low viral load, amplicon order and sampling locations were also strongly associated with partial sequencing failure. This could be due to several factors. One study found that manual extraction of viral RNA led to improved recovery of near full-length viral sequences compared to automated RNA extraction.[73] Considering differential amplicon sequencing success rates, modified protocols[74] or amplification-independent sequencing techniques[75-77] could also potentially improve NGS success rates. Data from the first PANGEA sequences were not sufficient to robustly evaluate the potential impact of nucleotide mutations at specific primer sites: at the 2R primer, we found larger proportions of sequences with a mutation toward the 3′-end, although the percent difference was not significant between sequences with >80% and <60% missing characters in 1R-3F, and mutations at the 2R primer were also not independently associated with partial sequencing failure in a subanalysis. A systematic comparison of NGS protocols on the same specimen is needed to identify more robust NGS approaches. This study provides evidence that the missing data patterns in PANGEA-HIV sequences do not substantially impact on phylogeny reconstruction when sufficiently many viral sequences are sampled. Current sequence sampling levels of regional HIV-1 epidemics in sub-Saharan Africa remain considerably below the sampling coverage threshold of ∼30% that was identified on simulated data. Further efforts to develop more robust NGS protocols would be highly beneficial for using NGS data to characterize patterns of HIV-1 transmission and HIV-1 prevention opportunities.

68 in total

1. Genome-scale approaches to resolving incongruence in molecular phylogenies.

Authors: Antonis Rokas; Barry L Williams; Nicole King; Sean B Carroll
Journal: Nature Date: 2003-10-23 Impact factor: 49.962

2. Does adding characters with missing data increase or decrease phylogenetic accuracy?

Authors: J J Wiens
Journal: Syst Biol Date: 1998-12 Impact factor: 15.683

Review 3. Promises and pitfalls of Illumina sequencing for HIV resistance genotyping.

Authors: Chanson J Brumme; Art F Y Poon
Journal: Virus Res Date: 2016-12-18 Impact factor: 3.303

4. Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.

Authors: Béatrice Roure; Denis Baurain; Hervé Philippe
Journal: Mol Biol Evol Date: 2012-08-28 Impact factor: 16.240

5. HIV risk among young African American men who have sex with men: a case-control study in Mississippi.

Authors: Alexandra M Oster; Christina G Dorell; Leandro A Mena; Peter E Thomas; Carlos A Toledo; James D Heffelfinger
Journal: Am J Public Health Date: 2010-11-18 Impact factor: 9.308

6. Viral linkage in HIV-1 seroconverters and their partners in an HIV-1 prevention clinical trial.

Authors: Mary S Campbell; James I Mullins; James P Hughes; Connie Celum; Kim G Wong; Dana N Raugi; Stefanie Sorensen; Julia N Stoddard; Hong Zhao; Wenjie Deng; Erin Kahle; Dana Panteleeff; Jared M Baeten; Francine E McCutchan; Jan Albert; Thomas Leitner; Anna Wald; Lawrence Corey; Jairam R Lingappa
Journal: PLoS One Date: 2011-03-02 Impact factor: 3.240

7. Resolving difficult phylogenetic questions: why more sequences are not enough.

Authors: Hervé Philippe; Henner Brinkmann; Dennis V Lavrov; D Timothy J Littlewood; Michael Manuel; Gert Wörheide; Denis Baurain
Journal: PLoS Biol Date: 2011-03-15 Impact factor: 8.029

8. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.

Authors: Ge Tan; Matthieu Muffato; Christian Ledergerber; Javier Herrero; Nick Goldman; Manuel Gil; Christophe Dessimoz
Journal: Syst Biol Date: 2015-06-01 Impact factor: 15.683

Review 9. Evolutionary analysis of the dynamics of viral infectious disease.

Authors: Oliver G Pybus; Andrew Rambaut
Journal: Nat Rev Genet Date: 2009-08 Impact factor: 53.242

10. A modified RNA-Seq approach for whole genome sequencing of RNA viruses from faecal and blood samples.

Authors: Elizabeth M Batty; T H Nicholas Wong; Amy Trebes; Karène Argoud; Moustafa Attar; David Buck; Camilla L C Ip; Tanya Golubchik; Madeleine Cule; Rory Bowden; Charis Manganis; Paul Klenerman; Eleanor Barnes; A Sarah Walker; David H Wyllie; Daniel J Wilson; Kate E Dingle; Tim E A Peto; Derrick W Crook; Paolo Piazza
Journal: PLoS One Date: 2013-06-10 Impact factor: 3.240

14 in total

1. Mapping of HIV-1C Transmission Networks Reveals Extensive Spread of Viral Lineages Across Villages in Botswana Treatment-as-Prevention Trial.

Authors: Vlad Novitsky; Melissa Zahralban-Steele; Sikhulile Moyo; Tapiwa Nkhisang; Dorcas Maruapula; Mary Fran McLane; Jean Leidner; Kara Bennett; Kathleen E Wirth; Tendani Gaolathe; Etienne Kadima; Unoda Chakalisa; Molly Pretorius Holme; Shahin Lockman; Mompati Mmalane; Joseph Makhema; Simani Gaseitsiwe; Victor DeGruttola; M Essex
Journal: J Infect Dis Date: 2020-06-03 Impact factor: 5.226

2. Effect of HIV Subtype and Antiretroviral Therapy on HIV-Associated Neurocognitive Disorder Stage in Rakai, Uganda.

Authors: Ned Sacktor; Deanna Saylor; Gertrude Nakigozi; Noeline Nakasujja; Kevin Robertson; M Kate Grabowski; Alice Kisakye; James Batte; Richard Mayanja; Aggrey Anok; Ronald H Gray; Maria J Wawer
Journal: J Acquir Immune Defic Syndr Date: 2019-06-01 Impact factor: 3.731

3. Designing & Conducting Trials To Reliably Evaluate HIV Prevention Interventions.

Authors: Thomas R Fleming; Victor DeGruttola; Deborah Donnell
Journal: Stat Commun Infect Dis Date: 2019-07-18

Review 4. Genetic Cluster Analysis for HIV Prevention.

Authors: Mary Kate Grabowski; Joshua T Herbeck; Art F Y Poon
Journal: Curr HIV/AIDS Rep Date: 2018-04 Impact factor: 5.071

5. Next-generation sequencing of HIV-1 single genome amplicons.

Authors: Gustavo H Kijak; Eric Sanders-Buell; Phuc Pham; Elizabeth A Harbolick; Celina Oropeza; Anne Marie O'Sullivan; Meera Bose; Charmagne G Beckett; Mark Milazzo; Merlin L Robb; Sheila A Peel; Paul T Scott; Nelson L Michael; Adam W Armstrong; Jerome H Kim; David M Brett-Major; Sodsai Tovanabutra
Journal: Biomol Detect Quantif Date: 2019-03-11

6. Inferring HIV-1 transmission networks and sources of epidemic spread in Africa with deep-sequence phylogenetic analysis.

Authors: Oliver Ratmann; M Kate Grabowski; Matthew Hall; Tanya Golubchik; Chris Wymant; Lucie Abeler-Dörner; David Bonsall; Anne Hoppe; Andrew Leigh Brown; Tulio de Oliveira; Astrid Gall; Paul Kellam; Deenan Pillay; Joseph Kagaayi; Godfrey Kigozi; Thomas C Quinn; Maria J Wawer; Oliver Laeyendecker; David Serwadda; Ronald H Gray; Christophe Fraser
Journal: Nat Commun Date: 2019-03-29 Impact factor: 14.919

7. A Comprehensive Genomics Solution for HIV Surveillance and Clinical Monitoring in Low-Income Settings.

Authors: David Bonsall; Tanya Golubchik; Mariateresa de Cesare; Mohammed Limbada; Barry Kosloff; George MacIntyre-Cockett; Matthew Hall; Chris Wymant; M Azim Ansari; Lucie Abeler-Dörner; Ab Schaap; Anthony Brown; Eleanor Barnes; Estelle Piwowar-Manning; Susan Eshleman; Ethan Wilson; Lynda Emel; Richard Hayes; Sarah Fidler; Helen Ayles; Rory Bowden; Christophe Fraser
Journal: J Clin Microbiol Date: 2020-09-22 Impact factor: 5.948

8. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

Authors: Chris Wymant; François Blanquart; Tanya Golubchik; Astrid Gall; Margreet Bakker; Daniela Bezemer; Nicholas J Croucher; Matthew Hall; Mariska Hillebregt; Swee Hoe Ong; Oliver Ratmann; Jan Albert; Norbert Bannert; Jacques Fellay; Katrien Fransen; Annabelle Gourlay; M Kate Grabowski; Barbara Gunsenheimer-Bartmeyer; Huldrych F Günthard; Pia Kivelä; Roger Kouyos; Oliver Laeyendecker; Kirsi Liitsola; Laurence Meyer; Kholoud Porter; Matti Ristola; Ard van Sighem; Ben Berkhout; Marion Cornelissen; Paul Kellam; Peter Reiss; Christophe Fraser
Journal: Virus Evol Date: 2018-05-18

9. A high HIV-1 strain variability in London, UK, revealed by full-genome analysis: Results from the ICONIC project.

Authors: Gonzalo Yebra; Dan Frampton; Tiziano Gallo Cassarino; Jade Raffle; Jonathan Hubb; R Bridget Ferns; Laura Waters; C Y William Tong; Zisis Kozlakidis; Andrew Hayward; Paul Kellam; Deenan Pillay; Duncan Clark; Eleni Nastouli; Andrew J Leigh Brown
Journal: PLoS One Date: 2018-02-01 Impact factor: 3.240

10. Quantifying HIV transmission flow between high-prevalence hotspots and surrounding communities: a population-based study in Rakai, Uganda.

Authors: Oliver Ratmann; Joseph Kagaayi; Matthew Hall; Tanya Golubchick; Godfrey Kigozi; Xiaoyue Xi; Chris Wymant; Gertrude Nakigozi; Lucie Abeler-Dörner; David Bonsall; Astrid Gall; Anne Hoppe; Paul Kellam; Jeremiah Bazaale; Sarah Kalibbala; Oliver Laeyendecker; Justin Lessler; Fred Nalugoda; Larry W Chang; Tulio de Oliveira; Deenan Pillay; Thomas C Quinn; Steven J Reynolds; Simon E F Spencer; Robert Ssekubugu; David Serwadda; Maria J Wawer; Ronald H Gray; Christophe Fraser; M Kate Grabowski
Journal: Lancet HIV Date: 2020-01-14 Impact factor: 12.767