Literature DB >> 30040081

Selection analyses of paired HIV-1 gag and gp41 sequences obtained before and after antiretroviral therapy.

Philip L Tzou1, Soo-Yon Rhee1, Sergei L Kosakovsky Pond2, Justen Manasa1, Robert W Shafer1.   

Abstract

Most HIV-1-infected individuals with virological failure on a pharmacologically-boosted protease inhibitor (PI) regimen do not develop PI-resistance protease mutations. One proposed explanation is that HIV-1 gag or gp41 cytoplasmic domain mutations might also reduce PI susceptibility. In a recent study of paired gag and gp41 sequences from individuals with virological failure on a PI regimen, we did not identify PI-selected mutations and concluded that if such mutations existed, larger numbers of paired sequences from multiple studies would be needed for their identification. In this study, we generated site-specific amino acid profiles using gag and gp41 published sequences from 5,338 and 4,242 ART-naïve individuals, respectively, to assist researchers identify unusual mutations arising during therapy and to provide scripts for performing established and novel maximal likelihood estimates of dN/dS substitution rates in paired sequences. The pipelines used to generate the curated sequences, amino acid profiles, and dN/dS analyses will facilitate the application of consistent methods to paired gag and gp41 sequence datasets and expedite the identification of potential sites under PI-selection pressure.

Entities:  

Mesh:

Substances:

Year:  2018        PMID: 30040081      PMCID: PMC6057438          DOI: 10.1038/sdata.2018.147

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

HIV-1 protease mutations responsible for protease inhibitor (PI) resistance are now uncommon in patients with virological failure on an initial PI-containing regimen, particularly regimens including pharmacologically-boosted lopinavir, atazanavir, or darunavir[1-3] One explanation for the infrequent occurrence of PI-resistance mutations in protease is that mutations outside of protease might reduce PI susceptibility even in the absence of primary PI resistance protease mutations. Indeed, many studies have reported that gag cleavage and non-cleavage site mutations may compensate for the reduced fitness associated with primary PI-resistance protease mutations[4-11] and several studies of pseudotyped viruses reported that genetic loci in matrix (MA) gag[12,13] and in the gp41 cytoplasmic domain (CD)[14] can reduce PI susceptibility in the absence of PI-resistance protease mutations. To identify gag and gp41 mutations under selective PI pressure, we recently sequenced gag and/or gp41 in 61 individuals with virological failure on a PI or a control nonnucleoside RT inhibitor (NNRTI) containing regimen[15]. We quantified nonsynonymous and synonymous mutations in both genes and identified sites exhibiting signal for directional or diversifying selection. We also used published gag and gp41 polymorphism data to highlight mutations displaying a high selection index, defined as changing from a conserved or common amino acid variant to an uncommon amino acid variant. The rationale for this latter analysis is that most drug-resistance mutations in established targets of antiviral therapy including protease, RT, integrase, and the extracellular domain of gp41 are amino acid variants at sites that are non-polymorphic in the absence of selective drug pressure. In our previous study, many amino acid mutations were found to emerge in gag and in gp41-CD in both the PI- and NNRTI-treated groups. However, in neither gene, were there discernible differences between the two groups in overall numbers of mutations, mutations displaying evidence of diversifying or directional selection, or mutations with a high selection index. Based on this previous study, we concluded that if gag and/or gp41 encoded PI-resistance mutations, they might not be confined to repeated mutations at a few sites, and that multiple studies with large numbers of paired sequences from individuals with virological failure on a PI-containing regimen would need to be pooled to identify such mutations. To facilitate such studies, we provide here a detailed description of the methods used to generate the datasets and analytic results used in our previous study. The selection index analyses, in particular, require additional exposition because they used data derived from the curation and annotation of gag and gp41 sequences from more than 500 GenBank submission sets and/or peer-reviewed publications to determine the polymorphism rates at each gag and gp41 position. The annotation of these references according to the treatment status of the individuals in the references is included as part of this manuscript’s Data Citation. The selection index analyses, also required performing quality control analyses of each gag and gp41 sequence and determining the prevalence of each mutation at each position. Finally, the HyPhy scripts described in this manuscript make it possible to exactly replicate each of the maximum likelihood estimates of the ratio of non-synonymous and synonymous substitution rates presented in our original manuscript.

Methods

Gag and gp41 sequences of paired samples obtained before and after PI or NNRTI therapy

The sequences described in our previous manuscript were obtained from HIV-1-infected individuals in Northern California who had genotypic resistance tests performed between April 2001 and June 2013 and from participants in the ACTG A5202 clinical trial[3,15]. The sequences were derived from plasma virus samples obtained before and after therapy from 41 previously PI-naïve subjects who had received a ritonavir-boosted PI-containing regimen or from 20 control subjects who received an NNRTI-containing regimen. Among the 41 PI-treated subjects, paired sequences before and after PI treatment were available for both gag and gp41 in 11 individuals, for gag alone in 13 individuals, and for gp41 alone in 17 individuals. Among 20 NNRTI control subjects, paired sequences before and after NNRTI treatment were available for both gag and gp41 in 13 individuals, for gag alone in three individuals, and gp41 alone in four individuals. Table 1 (available online only) contains the GenBank accessions for each of the paired protease, gag and gp41 sequences from the 41 PI- and 20 NNRTI-treated individuals. This study was approved by the Institutional Review Boards (IRBs) of Stanford University, KPNC, and the NIH ACTG and all study methods were performed in accordance with the guidelines of these IRBs. Informed consent was required for participation in the ACTG 5202 trial. The Stanford University and KPNC IRBs provided a waiver of informed consent for the study of remnant KPNC samples that were unlinked to individual protected health information.
Table 1

GenBank accessions for each of the paired protease, gag and gp41 sequences

GroupSubjectProtease (PR)
 Gag
Gp41
  BaselineFollow-upBaselineFollow-upBaselineFollow-up
Abbreviations: PI: Individuals who received a ritonavir-boosted protease inhibitor containing regimen. NNRTI: Individuals who received a nonnucleoside RT inhibitor (NNRTI) containing regimen. NA: not applicable as a sequence was not performed       
PI8006KT340026KT340027NANAKT339970KT339971
PI14728KT340028KT340029NANAKT339972KT339973
PI14736GQ210720GQ213759NANAKT339974KT339975
PI18380AY798294GQ213798NANAKT339976KT339977
PI24950KT340030KT340031NANAKT339978KT339979
PI25082GQ210971GQ212432NANAKT339980KT339981
PI26307KT340032KT340033KT339954KT339955KT339982KT339983
PI38099KT340034KT340035NANAKT339985KT339984
PI39143KT340036KT340037KT339958KT339959KT339986KT339987
PI39270MG171048MG171049NANAKY579942KY579943
PI42080KT340039KT340040KT339960KT339961KT339988KT339989
PI42654KT340041KT340042NANAKT339990KT339991
PI56120KT340047KT340048KT339966KT339967KT339994KT339995
PI56141KT340049KT340050NANAKT339996KT339997
PI57479KT340051KT340052NANAKT339998KT339999
PI118724MG171059MG171060NANANANA
PI118745MG171063MG171064KY579860KY579861KY579932KY579933
PI118747MG171065MG171066NANANANA
PI118754MG171067MG171068NANANANA
PI118761MG171069MG171070KY579878KY579879NANA
PI118770MG171071MG171072NANANANA
PI118792MG171073MG171074KY579870KY579871NANA
PI118811MG171075MG171076KY579872KY579873NANA
PI118820MG171077MG171078NANANANA
PI118823MG171079MG171080KY579864KY579865NANA
PI118827MG171081MG171082KY579846KY579847NANA
PI118828MG171083MG171084KY579854KY579855NANA
PI118840MG171085MG171086NANAKY579940KY579941
PI118846MG171087MG171088KY579880KY579881KY579938KY579939
PI118849MG171089MG171090KY579884KY579885NANA
PI118853MG171091MG171092KY579866KY579867NANA
PI118855MG171093MG171094NANANANA
PI118856MG171095MG171096NANAKY579944KY579945
PI118860MG171097MG171098KY579848KY579849KY579920KY579921
PI118865MG171099MG171100NANANANA
PI118870MG171101MG171102NANANANA
PI118886MG171103MG171104KY579868KY579869NANA
PI118899MG171105MG171106NANAKY579924KY579925
PI118903MG171107MG171108KY579874KY579875KY579936KY579937
PI118905MG171109MG171110NANANANA
PI118910MG171111MG171112NANAKY579922KY579923
PI118918MG171113MG171114NANANANA
PI118925MG171115MG171116KY579876KY579877NANA
PI118930MG171117MG171118NANANANA
PI118934MG171119MG171120NANANANA
PI118935MG171121MG171122KY579850KY579851KY579928KY579929
PI118942MG171123MG171124NANANANA
PI118951MG171125MG171126NANAKY579926KY579927
PI118954MG171127MG171128KY579858KY579859NANA
PI118956MG171129MG171130KY579882KY579883NANA
PI118962MG171131MG171132NANANANA
PI118965MG171133MG171134NANAKY579946KY579947
PI118972MG171135MG171136KY579862KY579863KY579934KY579935
PI118973MG171137MG171138KY579852KY579853KY579930KY579931
PI118982MG171139MG171140NANANANA
PI118985MG171141MG171142NANANANA
PI118986MG171143MG171144KY579856KY579857NANA
NNRTI8349GQ206503MG171044NANAKY579904KY579905
NNRTI9918GQ206632MG171045KY579838KY579839KY579906KY579907
NNRTI25036GQ210904MG171046KY579840KY579841KY579908KY579909
NNRTI35596GQ212974KY190132KY579842KY579843KY579910KY579911
NNRTI37879MG171047KY190134KY579844KY579845KY579912KY579913
NNRTI42036MG171050KY190141KY579814KY579815KY579886KY579887
NNRTI42183MG171051MG171052KY579816KY579817KY579888KY579889
NNRTI55928MG171053MG171054KY579818KY579819KY579890KY579891
NNRTI57448MG171055KY190153NANAKY579914KY579915
NNRTI61483MG171056MG171057NANAKY579916KY579917
NNRTI61631MG171058KY190163NANAKY579918KY579919
NNRTI252392MG171061MG171062KY579832KY579833KY579898KY579899
NNRTI264159KY787124KY787125KY579836KY579837KY579902KY579903
NNRTI108149KY787112KY787113KY579824KY579825NANA
NNRTI232768KY787118KY787119KY579830KY579831NANA
NNRTI122034KY787114KY787115KY579826KY579827KY579894KY579895
NNRTI214046KY787116KY787117KY579828KY579829KY579896KY579897
NNRTI21890KY787108KY787109KY579820KY579821KY579892KY579893
NNRTI253540KY787122KY787123KY579834KY579835KY579900KY579901
NNRTI44969KY787110KY787111KY579822KY579823NANA
The plasma samples were processed and underwent direct PCR Sanger sequencing as described in our previous manuscript[15]. Each gag, and gp41 sequence was aligned using the Translation Align option with the ClustalW algorithm for multiple sequence alignment using the Geneious R11 software[16]. The parameters used were the default values (cost matrix: BLOSUM; gap open cost: 10; gap extend cost: 0.1). The multiple sequence alignment was then manually edited using the subtype B consensus sequence[17]. Manual edits were required for the gag but not the gp41 sequence because gag contained many more indels than gp41. The manual edits were primarily the shifting of indels to be consistent with the remaining sequences and the subtype B consensus. Nucleotide insertions were then stripped from the sequence prior to subsequent analyses. The original and aligned sets of FASTA sequences for gag and gp41 are in the files gagOriginal.fas, gagAligned.fas, gp41Original.fas, and gp41Aligned.fas (Data Citation 1). The insertions in gag and gp41 are in file insertions.csv. Neighbor-joining trees for each gene, created by HyPhy version 2.3.2 using the TN93 distance, confirmed that each pair of sequences clustered by individual. All trees (in Newick format) are included in Data Citation 1. Data Citation 1 also contains the initial and edited gag gene alignments in data/gag.geneious (Data Citation 1).

Pairwise sequence comparisons and dN/dS analyses of gag and gp41

Tables 2 and 3 summarize the median proportions of pairwise nucleotide and amino acid changes, pairwise dN/dS ratios, and median proportions of IUPAC ambiguities in the pre- and post-treatment gag and gp41 sequences, respectively. Pairwise dN/dS ratio estimation was implemented using a custom HyPhy v2.3.2 script scripts/pairwise-estimator-dnds.bf (Data Citation 1); this script reads in a collection of aligned coding sequences, splits them into host pairs (based on patient ID encoded in the FASTA sequence name), and estimates dN/dS by maximum likelihood using the MG94xREV codon substitution model[18]. The script also optionally restricts the analysis to a contiguous region of the alignment, for example to focus on a specific protein domain.
Table 2

Pairwise nucleotide and amino acid changes, dN/dS ratios, and percent ambiguities in HIV-1 gag before and after protease inhibitor (PI) and nonnucleoside RT inhibitor (NNRTI) therapy.

 PI (n=24)NNRTI (n=16)Pa
Median % pairwise NA changes (interquartile range)
   
 Complete gene1.1 (0.8–1.6)0.6 (0.5–1.3)0.1
 Matrix0.4 (0.2–0.6)0.2 (0.1–0.6)0.1
 C-terminal region0.3 (0.2–0.6)0.3 (0.2–0.3)0.6
Median % pairwise AA changes (interquartile range)
   
 Complete gene1.4 (0.8–1.8)0.8 (0.6–1.2)0.1
 Matrix0.6 (0.4–0.6)0.4 (0.4–0.9)1.0
 C-terminal region0.7 (0.2–0.9)0.4 (0.4–0.8)0.6
Median pairwise dN/dS ratio (interquartile range)b
   
 Complete gene0.21 (0.08–0.48)0.35 (0.15–0.57)0.2
 Matrix0.24 (0.00–0.68)0.79 (0.16–∞)0.1
 C-terminal region0.46 (0.19–1.14)0.50 (0.24–1.05)0.9
Median % IUPAC Ambiguities (interquartile range)c
   
 Baseline (complete gene)0.4c (0.1–0.8)0.1 (0.1–0.1)0.1
 Follow-up (complete gene)0.0c (0.0–0.6)0.0 (0.0–0.1)0.2

aMann-Whitney U Test.

bRatio of nonsynonymous to synonymous mutations.

cThe proportion of ambiguities (i.e., mixtures of more than one base at the same position) was significantly higher at baseline than at follow-up in the PI group (p=0.02; Mann-Whitney U Test).

Table 3

Pairwise nucleotide and amino acid changes, dN/dS ratios, and percent ambiguities in HIV-1 gp41 before and after protease inhibitor (PI) and nonnucleoside inhibitor (NNRTI) therapy.

 PI (n=28)NNRTI (n=17)Pa
Median % pairwise NA changes (interquartile range)
   
 Complete gene1.1 (0.5–1.5)0.8 (0.6–1.4)0.8
 Cytoplasmic domain0.4 (0.3–0.9)0.6 (0.2–0.7)1.0
Median % pairwise AA changes (interquartile range)
   
 Complete gene1.4 (0.6–2.2)1.3 (0.9–2.2)0.7
 Cytoplasmic domain0.9 (0.3–1.4)0.9 (0.6–1.1)0.9
Median pairwise dN/dS ratio (interquartile range)b
   
 Complete gene0.42 (0.16–0.74)0.44 (0.29–0.96)0.4
 Cytoplasmic domain0.59 (0.17–1.99)0.47 (0.32–0.93)0.9
Median % IUPAC Ambiguities (interquartile range)
   
 Baseline (complete gene)0.1 (0.0–0.2)0.1 (0.0–0.1)0.8
 Follow-up (complete gene)0.0 (0.0–0.1)0.0 (0.0–0.1)0.4

aMann-Whitney U Test.

bRatio of nonsynonymous to synonymous mutations.

As noted in the previous manuscript, there was no difference in median proportions of pairwise nucleotide and amino acid changes and pairwise dN/dS ratios in gag and gp41 between the PI- and NNRTI-treated patients. Also, as previously noted, there was a significant reduction in the median proportion of ambiguous nucleotides in gag between baseline and follow-up among the PI-treated patients: 0.4% (IQR:0.1% to 0.8%) vs 0.0% (IQR:0% to 0.6%; p=0.02 Mann-Whitney U test). Table 4 lists each of the amino acid changes that occurred at gag cleavage sites.
Table 4

Amino acid changes occurring within protease cleavage sites in gag during protease inhibitor (PI) and nonnucleoside RT inhibitor (NNRTI) therapy.

Cleavage sitePositionBaseline AAsFollow-up AAs# patients
PI group
    
SP1 / Nucleocapsid (NC)373IPAAM|MQRGNMPAAM|MQRGN1
 373PTAIM|MQKGNSTAIM|MQKGN1
 373, 375STAIM|MQKGNPTTIM|MQKGN1
 374, 380PXAIM|MQKGNaPPAIM|MQRGN1
 374, 381SAAMM|MQRSNSTAMM|MQRGN1
 374SXAIM|MQKGNaSTAIM|MQKGN1
 375, 378SANIM|MQRGNSAAIM|IQRGN1
 376, 380SATIM|MQKGNSATTM|MQRGN1
 378SASVM|MQRGNSASVM|IQRGN1
Nucleocapsid (NC) / SP2429, 436EKQAN|FLGRIERQAN|FLGKI1
 436ERQAN|FLGKLERQAN|FLGRL1
 436ERQAN|FLGXIaERQAN|FLGKI1
 437ERQAN|FLGKXaERQAN|FLGKL1
SP2 / p6gag453RPGNF|LQSRLRPGNF|LQSRP2
p6pol / Proteaseb485, 486, 487, 490VXLXF|PXITLaVSFSF|PQITL1
 486VSVNF|PQITLVSLNF|PQITL1
NNRTI group
    
 Matrix (MA) / Capsid (CA)132VSHNY|PIVQNVSHNF|PIVQN1
 SP1 / Nucleocapsid (NC)374-STAM|MQRGN-TTAM|MQRGN1
 374PTTIM|MQRGNPATIM|MQRGN1
 375PAAIM|MQRGNPATIM|MQRGN1
 375SAAIM|MQKGNSANIM|MQKGN1
 375STAIM|MQRGNSTTIM|MQRGN1
 Nucleocapsid (NC) / SP2429EKQAN|FLGRLERQAN|FLGRL1
 SP2 / p6gag453RPGNF|PQSRLRPGNF|PQSRP1
 p6pol / Proteaseb487VSFSF|PQITLVSFNF|PQITL1
 488VSLDL|PQITLVSLDF|PQITL1

aX stands for mixtures consisted of at least two amino acids which were not subtype B consensus.

bA -1 frameshift was applied to the pol reading frame relative to the gag reading frame.

Positional dN/dS selection analyses of gag and gp41

We ran the fixed effects likelihood (FEL) method, as implemented in HyPhy v2.3.2, to detect codon sites exhibiting diversifying selection in gag and gp41 on the post-treatment branches using a p-value of 0.05 (refs 18,19). This analysis requires annotated phylogenetic trees (e.g., internalFiles/phylo/gagNNRTIs.tre (Data Citation 1)), i.e. trees where post-treatment branches are marked for testing. A convenient tool for annotating trees can be found at http://phylotree.hyphy.org. We also used Hyphy v2.3.2 to fit a model of episodic directional selection (MEDS) to the post-treatment branches pressure, also using a p-value of 0.05 (ref. 20). Table 5 shows the gag positions with evidence of diversifying selection and the gag mutations with evidence of directional selection within the PI and NNRTI-treatment groups. Table 6 shows the gp41 positions with evidence of diversifying selection and evidence of directional selection within the PI and NNRTI-treatment groups. Both of these analyses identify candidate positions and mutations that are most likely to be under selective drug pressure. However, as is the case with any statistical testing procedures, it is possible that some of the positions and/or mutations are misclassified as either false positives or false negatives. Files containing shell scripts are provided to enable users to repeat these selection analyses with the same set of parameters that we used.
Table 5

Amino acid positions with evidence of diversifying selection and mutations with evidence of directional selection in HIV-1 gag within the protease inhibitor (PI) and nonnucleoside RT inhibitor (NNRTI)-treatment groups.

Positions with Evidence of Diversifying Selection (FEL)a
 
 PI67 (0.04), 115 (0.05), 223 (0.008), 468 (0.03), 469 (0.008), 474 (0.03)
 NNRTI54 (0.04), 69 (0.03), 173 (0.04)
Mutations with Evidence of Directional Selection (MEDS)b
 
 PIK59M (1, p=0.000), Q219H (3, p=0.000)c, F370Y (1, p=0.000), T371N (1, p=0.001)
 NNRTIY79F (2, p=0.000), K110M (1, p=0.001), A371T (1, p=0.001), N371T (1, p=0.001)

aParentheses contain p-values.

bParentheses contain number of individuals and p-values.

cIn one individual, there was a change from H219→Q.

Table 6

Amino acid positions with evidence of diversifying selection and the mutations with evidence of directional selection in HIV-1 gp41 within the protease inhibitor (PI) and nonnucleoside RT inhibitor (NNRTI)-treatment groups.

Positions with Evidence of Diversifying Selection (FEL)a
 
 PI55 (0.02), 101 (0.01), 273 (0.02)
 NNRTI24 (0.009), 187 (0.04), 310 (0.04)
Mutations with Evidence of Directional Selection (MEDS)b
 
 PIT307I (3, p=0.001), I325F (1, p=0.001), L325F (1, p=0.001)
 NNRTINone

aParentheses contain p-values.

bParentheses contain number of individuals and p-values.

Collection of previously published gag sequences from ARV-naïve individuals

We downloaded the complete set of 7,550 one-per-person aligned complete gag sequences from the Los Alamos National Laboratories (LANL) HIV Sequence Database[21]. We filtered 565 sequences that contained either large deletions or missing nucleotides (n=238), more than 3 frameshift mutations (n=281), or 3 or more signature APOBEC mutations (n=46) defined as mutations at highly conserved positions that were likely to occur in sequences containing stop codons and that occurred in an appropriate dinucleotide context: GG→AG for APOBEC3G GA→AA for APOBEC3F (ref. 22). Applying the Local FDR Poisson distribution using the R LocFDRPois package to our data, we found that presence of ≥3 signature APOBEC mutations was associated with a 0.99 likelihood of a sequence having undergone APOBEC-mediated G to A hypermutation[23]. Table 7 lists the 45 signature APOBEC mutations that we identified and Fig. 1 shows the distribution of the number of signature APOBEC mutations per gag sequence.
Table 7

HIV-1 gag signature APOBEC mutationsa.

PositionConsensus AAbConsensus %bSignature mutationcProportion occurring in sequence with a stop codon# Sequences with mutation
1M98.9I8936
16W98.7*10031
24G99.2E5317
25G99.2R717
36W99.6*10015
56G99.9R605
71G99.5R754
99E99.3K676
140G99.8E1001
140G99.8R8211
155W99.5*10031
192G99.8E605
192G99.8R9010
212W99.4*10033
214R99.8K5812
221G99.9R789
229R99.6K6718
232R99.1K6025
233G99.8R1005
249W99.5*10022
265W99.6*10025
269G99.7R8015
284D99.5N1003
288G99.6R8824
294R99.7K6715
298D99.8N673
299R99.8Q1005
305R99.8K676
316W99.5*10031
338G99.7R5319
352G99.8R717
354G99.8R7010
355G99.9R605
365E99.8K1002
396G100S1002
399G99.7R7117
414W99.3*10031
417G99.4R7124
420G99.7E676
420G99.7R628
435G99.3R7229
438W99.5*1004
443G98.8R6010
446G99.4E6010
446G99.4R7621

aMutations at highly conserved positions that are likely to occur in sequences with stop codons and that occur in an appropriate dinucleotide context: GG→AG or GA→AA.

bAmino acid present in >97.5% of Group M HIV-1 sequences.

cMutations strongly consistent with APOBEC-mediated G-to-A hypermutation. “*”: stop codon.

Figure 1

Distribution of gag APOBEC signature mutations.

Distribution of the number of APOBEC signature gag mutations in the complete set of 7,031 one-per-person, quality-control filtered (i.e. sequences with large numbers of missing nucleotides and frame shifts were excluded) aligned complete gag sequences downloaded from the LANL HIV Sequence Database. The 46 sequences containing ≥3 APOBEC signature mutations were considered to be at high risk of having been subject to APOBEC-mediated G-to-A hypermutation and were excluded from our amino acid prevalence calculations.

We then used Batch Entrez to submit each of the Accession IDs to GenBank[24] and parsed the XML results to aggregate sequences into GenBank submission sets, henceforth referred to as studies, sharing either the same PubMed ID or the same Title and Author List fields. We reviewed the 264 studies reporting three or more individuals. Of these studies, 164 (62.1%) comprised solely ART-naïve individuals, 75 (28.4%) comprised individuals whose treatment status was unknown, and 25 (9.5%) comprised individuals who were ART-experienced. A summary of these 264 studies is provided in gagStudies.csv (Data Citation 1). We then determined the proportion of each amino acid at each position in gag for the complete set of 5,365 one-per-person group M ART-naïve sequences as well as for the four LANL-designated subtypes (A, B, C, and CRF01_AE) for which at least 100 sequences were present. Site-specific amino acids present in 0.1% or fewer sequences were considered unusual. Fig. 2 shows that the numbers of gag sequences according to the number of unusual mutations per sequence monotonically decreases until n=10 unusual mutations. Therefore, we excluded sequences containing ≥11 unusual mutations since these 27 sequences were considered to be at high risk of poor sequence quality. We then recalculated the proportion of each amino acid at each position for the remaining 5,338 sequences. The original and aligned sets of FASTA sequences for the 5,338 one-per-person filtered sequences from these studies are in gagNaiveOriginal.fas and gagNaiveAligned.fas (Data Citation 1). The header for each sequence contains the GenBank accession number and the LANL-designated subtype. The file gagAAPrevalence.csv (Data Citation 1) lists the proportion of each amino acid at each gag position. In this file insertions, deletions, and mixtures are indicated by “ins”, “del”, and “X”, respectively. Figure 3 displays the distribution of amino acids at each of the 500 gag positions in the one-per-person group M HIV-1 gag sequences from ARV-naïve individuals.
Figure 2

Distribution of unusual gag mutations.

Distribution of the number of unusual gag amino acids in the 5,365 one-per-person, quality-control and APOBEC-filtered aligned complete gag sequences from PI-naïve individuals. Unusual amino acids were defined as those occurring in ≤0.1% of sequences. The 27 sequences containing ≥11 unusual amino acids were removed from the final PI-naïve gag amino acid profile.

Figure 3

Distribution of HIV-1 group M gag amino acid variants in sequences from PI-naïve individuals.

Amino acid variants occurring in 5,338 one-per-person sequences from PI-naïve individuals. Amino acids occurring in ≥50% of sequences are shown in bold black; those occurring in 10 to 49% of sequences are shown in red; and those occurring in 1 to 9% of sequences are shown in blue. Positions at which insertions or deletions have been reported in ≥1% of sequences are indicated by dots. The complete summary of all amino acid variants in group M sequences and for the most common subtypes can be found in the file naiveStudies/gagAAPrevalence.csv (Data Citation 1).

Collection of previously published gp41 sequences from ARV-naïve individuals

We downloaded the complete set of 7,489 one-per-person aligned complete gp41 sequences from the LANL HIV Sequence Database[21]. We filtered 453 sequences that contained either large deletions or missing nucleotides (n=234), more than 3 frameshift mutations (n=89), or 3 or more signature APOBEC mutations (n=130) defined as mutations at highly conserved positions that were likely to occur in sequences containing stop codons and that occurred in an appropriate dinucleotide context[22]. Applying the Local FDR Poisson distribution using the R LocFDRPois package to our data, we found that presence of ≥3 signature APOBEC mutations was associated with 0.97 likelihood of a sequence having undergone APOBEC-mediated G to A hypermutation[23]. Table 8 lists the 47 signature APOBEC mutations and Fig. 4 shows the distribution of the number of signature APOBEC mutations per sequence.
Table 8

HIV-1 gp41 signature APOBEC mutationsa.

PositionConsensus AAbConsensus %bSignature mutationcProportion occurring in sequence with a stop codon# Sequences with mutation
5G98.4R8739
10G98.3R9069
16G99.2E6213
16G99.2R8335
19M98.6I9395
36G99R1005
36G99S9344
60W98.5*100103
61G99.4S836
68R99.5K9225
73E99.4K6932
78D99.3N8318
83G98.5E717
83G98.5R8658
85W98.4*100100
86G99.7R1001
86G99.7S9315
89G98.1E6520
89G98.1R9170
99W98.4*100102
103W98.6*10093
112W98.3*100107
117W98.5*10099
120W98.2*100117
146E99.2K5540
155W98.7*10085
159W98.3*10099
161W98.6*10087
167W99.3*10041
169W98.2*10066
179G97.6E577
179G97.6R9654
180G98.2R1002
183G98.4R6916
183G98.4S8360
200G98.7E6015
200G98.7R9069
227G98.8E7919
227G98.8R9048
240G98.1E7512
240G98.1R8272
246W98.2*100110
248D99.7N8211
269R98.3K5753
292W97.8*100108
339G99.1S6534
341E99K5553

aMutations at highly conserved positions that are likely to occur in sequences with stop codons and that occur in an appropriate dinucleotide context: GG→AG or GA→AA.

bAmino acids present in >97.5% of Group M HIV-1 sequences.

cMutations strongly consistent with APOBEC-mediated G-to-A hypermutation. “*”: stop codon.

Figure 4

Distribution of gp41 APOBEC signature mutations.

Distribution of the number of gp41 APOBEC signature mutations in the complete set of 7,166 one-per-person quality-control filtered (i.e. sequences with large numbers of missing nucleotides and frame shifts were excluded) aligned complete gp41 sequences from the LANL HIV sequence database. The 130 sequences containing ≥3 APOBEC signature mutations were considered to be at high risk of having been subject to APOBEC-mediated G-to-A hypermutation and were excluded from our amino acid prevalence calculations.

We then used Batch Entrez to submit each of the Accession IDs to GenBank[24] and parsed the XML results to aggregate sequences into GenBank submission sets (or studies) sharing either the same PubMed ID or the same Title and Author List fields. We reviewed the 329 studies reporting five or more individuals. Of these studies, 191 (58.1%) comprised solely ART-naïve individuals, 95 (28.9%) comprised individuals whose treatment status was unknown, and 43 (13.0%) comprised individuals who were ART-experienced. A summary of these 329 studies is provided in gp41Studies.csv (Data Citation 1). We then determined the proportion of each amino acid at each position in gp41 for the complete set of 4,263 one-per-person group M ART-naïve sequences as well as for each of the four LANL-designated subtypes (A, B, C, and CRF01_AE) for which at least 100 sequences were present. Amino acids occurring at a proportion ≤0.1% were considered unusual. Figure 5 shows that the numbers of gp41 sequences according to the number of unusual mutations per sequence monotonically decreases until n=7 unusual mutations. Therefore, we excluded sequences containing ≥8 unusual mutations, since these 21 sequences were considered to be at high risk of poor sequence quality. We then recalculated the proportion of each amino acid at each position for the remaining 4,242 sequences. The original and aligned sets of sequences for the 4,242 one-per-person filtered sequences from these studies in FASTA format are in gp41NaiveOriginal.fas and gp41NaiveAligned.fas (Data Citation 1). The header for each sequence contains the GenBank accession number and the LANL-designated subtype. The file gp41AAPrevalence.csv (Data Citation 1) lists the proportion of each amino acid at each gp41 position. In this file insertions, deletions, and mixtures are indicated by “ins”, “del”, and “X”, respectively. Figure 6 displays the distribution of amino acids at each of the 345 gp41 positions in the one-per-person group M HIV-1 gp41 sequences from ART-naïve individuals.
Figure 5

Distribution of unusual gp41 mutations.

Distribution of the number of unusual gp41 amino acids in the 4,263 one-per-person quality-control and APOBEC-filtered aligned complete gp41 sequences from PI-naïve individuals. Unusual amino acids were defined as those occurring in ≤0.1% of sequences. The 21 sequences containing ≥8 unusual amino acids were removed from the final PI-naïve gp41 amino acid profile.

Figure 6

Distribution of HIV-1 group M gp41 amino acid variants in sequences from PI-naïve individuals.

Amino acid variants occurring in 4,242 one-per-person sequences from PI-naïve individuals. Amino acids occurring in ≥50% of sequences are shown in bold black; those occurring in 10 to 49% of sequences are shown in red; and those occurring in 1 to 9% of sequences are shown in blue. Positions at which insertions or deletions have been reported in ≥1% of sequences are indicated by dots. The complete summary of all amino acid variants in group M sequences and for the most common subtypes can be found in the file naiveStudies/gp41AAPrevalence.csv (Data Citation 1).

Gag and gp41 selection indexes

We used empirical gag and gp41 amino acid site frequencies to calculate a selection index for each amino acid change that developed during therapy defined as follows: log10 of the ratio of the proportion of the pre-therapy amino acid in PI-naïve individuals divided by the proportion of the post-therapy amino acid in PI-naïve individuals (fold change). Amino acid changes with a high selection index were defined as changing from a highly conserved or relatively common amino acid variant at a position to an amino acid with a prevalence at least 10 times less common (i.e., a selection index ≥1.0). The distribution of all selection indexes for gag and gp41 according to treatment was plotted using an R script that accepts as input the list of amino acid changes between pairs of sequences and data on the proportion of each amino acid in an external database. The script and the resulting figures are located at scripts/make-graphical-summary.r, reports/gag-mutations.pdf, and reports/gp41-mutations.pdf (Data Citation 1). As no statistical model for the expected distribution of selection indexes for proteins under selective drug pressure has been developed, the plots are useful primarily for identifying loci at which changes from a conserved to an unusual amino acid were clustered. As noted in our previous manuscript, we found no discernible difference in either gag or gp41 in the distribution of selection indexes between PI- and NNRTI-treated individuals.

Code availability

The code used in this manuscript includes the set of Python (version 3.5.2), R (version 3.2.3), and Linux shell scripts that are available on Github (https://github.com/hivdb/gag-gp41) and in the gag-gp41.zip file submitted to Dryad Digital Repository (Data Citation 1). The Github site and the Dryad zip file also includes the 24 files cited in this Data Descriptor. There are no restrictions on accessing or using the code or files as they are released under the open source MIT License.

Data Records

The original and aligned pre- and post-treatment gag and gp41 sequences from our previous published study are available in four FASTA files in which each sequence header contains four fields: GenBank accession ID, PID, treatment time point, and treatment category (for example KY579846|118827_PIs_Pre). The sequence files, which are named gagOriginal.fas, gagAligned.fas, gp41Original.fas, and gp41Aligned.fas, are located in the directory data/fasta/ (Data Citation 1). Table 1 (available online only) lists the GenBank accession IDs according to PID, treatment time point, and treatment category (Data Citation 2, Data Citations 3, Data Citations 4, Data Citations 5, Data Citations 6, Data Citations 7, Data Citations 8, Data Citations 9, Data Citations 10, Data Citations 11, Data Citations 12, Data Citations 13, Data Citations 14, Data Citations 15, Data Citations 16, Data Citations 17, Data Citations 18, Data Citations 19, Data Citations 20). The file data/insertions.csv (Data Citation 1) lists each of the insertions in gag and gp41 as these were removed during sequence alignment. The Newick representation of the neighbour-joining trees for the aligned gag and gp41 sequences are in the directory data/phylo/ (Data Citation 1). Tables 2, 3, and 4 summarize the differences between pre- and post-treatment sequences for gag, the gag cleavage sites, and gp41, respectively. The Linux shell scripts that pass parameters to Hyphy batch language scripts used to perform the pairwise dN/dS analysis, FEL diversifying selection analysis, and MEDS directional selection analysis are named run-pairwise.sh, run-fel.sh, and run-meds.sh, respectively (Data Citation 1). They are available in the directory scripts/. The summarized results of the pairwise dN/dS analysis are in Tables 2 and 4. The results of FEL and MEDS are in Tables 5 and 6. The file data/naiveStudies/gagStudies.csv (Data Citation 1) contains the list of all studies with three or more individuals from whom gag sequences were obtained for analyzing mutation prevalence. The files data/naiveStudies/gagNaiveOriginal.fas and data/naiveStudies/gagNaiveAligned.fas (Data Citation 1) contain the 5,338 one-per-person quality-control filtered original and aligned sequences from these studies. Table 7 lists the gag signature APOBEC mutations. Figure 1 shows the distribution of the number of gag signature APOBEC mutations prior to quality control filtering. Figure 2 shows the distribution of the number of unusual amino acids per gag sequence. The file gagAAPrevalence.csv (Data Citation 1) lists the proportion of each amino acid at each gag position according to HIV-1 subtype in one-per-person sequences filtered for an excess of signature APOBEC mutations and unusual amino acids. report/gag-naive-indels.pdf (Data Citation 1) displays the distribution of insertions and deletions at each position in gag. Figure 3 displays the proportions of all gag amino acid variants present at 1.0% or greater frequency of one-per-person sequences. The file data/naiveStudies/gp41Studies.csv (Data Citation 1) contains the list of all studies with five or more individuals from whom gp41 sequences were obtained for analyzing mutation prevalence. The files data/naiveStudies/gp41NaiveOriginal.fas and data/naiveStudies/gp41NaiveAligned.fas (Data Citation 1) contain the 4,242 one-per-person quality-control filtered original and aligned sequences from these studies. Table 8 lists the gp41 signature APOBEC mutations. Figure 4 shows how the distribution of the number of gp41 signature APOBEC mutations prior to quality control filtering. Figure 5 shows the distribution of the number of unusual amino acids per gp41 sequence. The file gp41AAPrevalence.csv (Data Citation 1) lists the proportion of each amino acid at each gp41 position according to HIV-1 subtype in one-per-person sequences filtered for an excess of signature APOBEC mutations and unusual amino acids. report/gp41-naive-indels.pdf (Data Citation 1) displays the distribution of insertions and deletions at each position in gp41. Figure 6 displays the proportions of all gp41 amino acid variants present in ≥1.0% of one-per-person sequences. The file scripts/run-basic.py and scripts/make-graphical-summary.r (Data Citation 1) contain the script that accept a list of amino acid changes between pairs of sequences and data on the proportion of each mutation to generate a plot showing the selection indexes for each mutation. The files reports/gag-mutations.pdf, and reports/gp41-mutations.pdf contain the output of these scripts.

Technical Validation

Several concerns arise when calculating positional amino acid prevalence from large numbers of sequences in public databases: (i) Do any of the sequences contain nucleotide sequence errors introduced by those who submitted the sequence?; (ii) Do any of the sequences contain annotation errors introduced by those who submitted the sequence or by database curators?; (iii) Have errors been introduced during sequence alignment resulting in the spurious alignment of nonhomologous positions and secondarily inaccurate mutation proportion data?; and (iv) In the case of HIV-1, do the sequences have evidence of APOBEC-mediated G-to-A hypermutation or other evidence for biological artefact consistent with a nonviable virus protein? In our analyses, we used the LANL HIV Sequence Database[21] to retrieve complete group M HIV-1 gag and gp41 sequences from previously published studies submitted to GenBank[24]. Despite the fact that GenBank is the standard database for sequences determined by dideoxynucleoside sequencing and that the LANL HIV Sequence Database is a curated HIV sequence database, we performed additional analyses to address the concerns cited in the previous paragraph. This process involved first removing sequences containing large gaps, multiple frame shift mutations, and an excess of signature APOBEC mutations. This was followed by removing a small number of sequences containing high numbers of unusual mutations. The approach for identifying likely APOBEC-mediated G-to-A hypermutation was similar to an approach that we previously described for HIV-1 protease, RT, and integrase[25]. We first identified 45 gag and 47 gp41 signature APOBEC mutations. Overall, 23 of the signature mutations were stop codons at positions for which the highly conserved consensus amino acid was tryptophan (W) and 69 were highly unusual amino acids at conserved positions that usually occurred in a sequence containing one or more stop codons. The distribution of the number of signature APOBEC mutations per sequence was then used to exclude a small proportion of sequences with ≥3 signature APOBEC mutations: 0.6% for gag and 1.7% for gp41. Following these steps, we plotted the distribution of pairwise uncorrected nucleotide distances for group M HIV-1 gag and gp41 and for those subtypes for which more than 100 sequences were available (Figs. 7 and 8). The intra- and inter-subtype pairwise distances for both genes clustered around previously reported genetic distances for these genes[26]. The absence of highly divergent gag or gp41 sequences is consistent with our attention to sequence alignment and sequence quality in creating our curated sequence datasets and amino acid profiles.
Figure 7

Distribution of pairwise uncorrected nucleotide distances for 5,338 group M HIV-1 gag sequences and within each of the four most common subtypes.

Approximately 0.001% of group M sequence pairs had distance between 17.5 to 19.6% and cannot be visualized on the figure. The left- and right-sided distributions for the group M sequences reflect intra- and inter-subtype distances, respectively.

Figure 8

Distribution of pairwise uncorrected nucleotide distances for 4,242 group M HIV-1 gag sequences and within each of the four most common subtypes.

Approximately 0.001% of group M sequence pairs had distance between 20.5 to 21.3% and cannot be visualized on the figure. The left- and right-sided distributions for the group M sequences reflect intra- and inter-subtype distances, respectively.

We also identified a previous study of gag amino acid variation and found that 92.9% (914) of the 984 amino acids which we detected at a prevalence of at least 1.0% of PI-naïve individuals were detected among the 993 amino acids detected at a prevalence of at least 1.0% in this earlier study [27]. This previous study was similar in design to ours with the following exceptions: (i) Sequences were obtained from all published studies through 2012 regardless of whether the individuals from whom the sequences were obtained were PI-experienced, PI-naïve, or of uncertain PI treatment history; (ii) Hypermutated sequences were excluded using the Los Alamos Hypermut tool[28]; and (iii) No isolates were excluded solely on the basis of having a large number of unusual mutations. We did not identify a similarly large study of gp41 amino acid variation. The exclusion of outlier sequences is a logical approach to creating useful sequence sets and alignments from which mutation proportion data can be calculated. Nonetheless, highly unusual sequences may not always reflect erroneous or artefactual data. As part of our technical validation pipeline, we have identified those sequences that were excluded should other researchers be interested in their analysis.

Usage Notes

The complete set of files including tables, figures, sequence files, tab-delimited files, and code files are available as a Dryad data citation and on the GitHub repository. The Dryad data citation provides a stable permanent snapshot of the analyses described in this manuscript. The GitHub repository will evolve as new studies are published and as more published studies are reviewed to expand the sets of filtered annotated gag and gp41 sequences from PI-naïve individuals. Other researchers sequencing gag and/or gp41 sequences before and after PI therapy will be able to pool our data with theirs and to perform the same dN/dS and selection index analyses using the software described in this manuscript and provided on Dryad and GitHub. Several other aspects of our data and code will be useful to other researchers even if they are not planning to perform the same analyses described in this manuscript: (i) the signature APOBEC mutations for gag and gp41 will be useful for the study of sequences of these two genes; (ii) the gag and gp41 sequence sets, publication summaries, and mutation prevalence files will also be useful to other researchers studying these genes; and (iii) the Hyphy and shell scripts will be useful to other researchers performing dN/dS analyses on sequence pairs. In particular, the HyPhy script for pairwise analysis has not been previously published.

Additional information

How to cite this article: Tzou, P. L. et al. Selection analyses of paired HIV-1 gag and gp41 sequences obtained before and after antiretroviral therapy. Sci. Data 5:180147 doi: 10.1084/sdata.2018.147 (2018). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
  25 in total

1.  Human immunodeficiency virus type 1 protease cleavage site mutations associated with protease inhibitor cross-resistance selected by indinavir, ritonavir, and/or saquinavir.

Authors:  H C Côté; Z L Brumme; P R Harrigan
Journal:  J Virol       Date:  2001-01       Impact factor: 5.103

2.  Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation.

Authors:  P P Rose; B T Korber
Journal:  Bioinformatics       Date:  2000-04       Impact factor: 6.937

3.  Changes in human immunodeficiency virus type 1 Gag at positions L449 and P453 are linked to I50V protease mutants in vivo and cause reduction of sensitivity to amprenavir and improved viral fitness in vitro.

Authors:  Michael F Maguire; Rosario Guinea; Philip Griffin; Sarah Macmanus; Robert C Elston; Josie Wolfram; Naomi Richards; Mary H Hanlon; David J T Porter; Terri Wrin; Neil Parkin; Margaret Tisdale; Eric Furfine; Chris Petropoulos; B Wendy Snowden; Jörg-Peter Kleim
Journal:  J Virol       Date:  2002-08       Impact factor: 5.103

4.  Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data.

Authors:  Matthew Kearse; Richard Moir; Amy Wilson; Steven Stones-Havas; Matthew Cheung; Shane Sturrock; Simon Buxton; Alex Cooper; Sidney Markowitz; Chris Duran; Tobias Thierer; Bruce Ashton; Peter Meintjes; Alexei Drummond
Journal:  Bioinformatics       Date:  2012-04-27       Impact factor: 6.937

5.  GenBank.

Authors:  Dennis A Benson; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal:  Nucleic Acids Res       Date:  2014-11-20       Impact factor: 19.160

6.  A novel substrate-based HIV-1 protease inhibitor drug resistance mechanism.

Authors:  Monique Nijhuis; Noortje M van Maarseveen; Stephane Lastere; Pauline Schipper; Eoin Coakley; Bärbel Glass; Mirka Rovenska; Dorien de Jong; Colombe Chappey; Irma W Goedegebuure; Gabrielle Heilek-Snyder; Dominic Dulude; Nick Cammack; Lea Brakier-Gingras; Jan Konvalinka; Neil Parkin; Hans-Georg Kräusslich; Francoise Brun-Vezinet; Charles A B Boucher
Journal:  PLoS Med       Date:  2007-01       Impact factor: 11.069

7.  HIV-1 drug resistance mutations emerging on darunavir therapy in PI-naive and -experienced patients in the UK.

Authors:  Kate El Bouzidi; Ellen White; Jean L Mbisa; Caroline A Sabin; Andrew N Phillips; Nicola Mackie; Anton L Pozniak; Anna Tostevin; Deenan Pillay; David T Dunn
Journal:  J Antimicrob Chemother       Date:  2016-09-28       Impact factor: 5.790

8.  Gag-protease coevolution analyses define novel structural surfaces in the HIV-1 matrix and capsid involved in resistance to Protease Inhibitors.

Authors:  Francisco M Codoñer; Ruth Peña; Oscar Blanch-Lombarte; Esther Jimenez-Moyano; Maria Pino; Thomas Vollbrecht; Bonaventura Clotet; Javier Martinez-Picado; Rika Draenert; Julia G Prado
Journal:  Sci Rep       Date:  2017-06-16       Impact factor: 4.379

Review 9.  Human Immunodeficiency Virus Gag and protease: partners in resistance.

Authors:  Axel Fun; Annemarie M J Wensing; Jens Verheyen; Monique Nijhuis
Journal:  Retrovirology       Date:  2012-08-06       Impact factor: 4.602

10.  HIV-1 Protease, Reverse Transcriptase, and Integrase Variation.

Authors:  Soo-Yon Rhee; Kris Sankaran; Vici Varghese; Mark A Winters; Christopher B Hurt; Joseph J Eron; Neil Parkin; Susan P Holmes; Mark Holodniy; Robert W Shafer
Journal:  J Virol       Date:  2016-06-10       Impact factor: 5.103

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.