Literature DB >> 17108357

The HIV positive selection mutation database.

Calvin Pan¹, Joseph Kim, Lamei Chen, Qi Wang, Christopher Lee.

Abstract

The HIV positive selection mutation database is a large-scale database available at http://www.bioinformatics.ucla.edu/HIV/ that provides detailed selection pressure maps of HIV protease and reverse transcriptase, both of which are molecular targets of antiretroviral therapy. This database makes available for the first time a very large HIV sequence dataset (sequences from approximately 50 000 clinical AIDS samples, generously contributed by Specialty Laboratories, Inc.), which makes possible high-resolution selection pressure mapping. It provides information about not only the selection pressure on individual sites but also how selection pressure at one site is affected by mutations on other sites. It also includes datasets from other public databases, namely the Stanford HIV database [S. Y. Rhee, M. J. Gonzales, R. Kantor, B. J. Betts, J. Ravela and R. W. Shafer (2003) Nucleic Acids Res., 31, 298-303]. Comparison between these datasets in the database enables cross-validation with independent datasets and also specific evaluation of the effect of drug treatment.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2006 PMID： 17108357 PMCID： PMC1669717 DOI： 10.1093/nar/gkl855

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The HIV-1 virus is the causative agent of AIDS, a growing worldwide epidemic and also a fascinating system for studying fundamental scientific questions. For example, one major clinical problem in the treatment of AIDS is HIV's ability to develop resistance to antiviral drugs rapidly, often within weeks of introduction of a new drug (1–3). Foremost among the factors responsible for this are the virus' extremely high mutation rate (4,5) and replication rate (3,6–8). For this reason, there is great medical interest in understanding both the specific causes of drug resistance, and predicting fast versus slow evolutionary pathways to multiple drug resistance. At the same time, HIV provides an extraordinary wealth of data about fundamental scientific questions such as the fitness landscape for protein evolution (9,10). Evolutionary biology has developed a powerful and general approach for investigating such problems: metrics of selection pressure that measure whether a particular genetic change is selected for or against during evolution. Such metrics can reveal important selection forces either constraining or driving evolution of a protein, directly from raw sequence variation data (11,12). One very widely used metric of selection pressure on amino acid mutations is known as Ka/Ks or dn/ds (13,14) and measures the ratio of observed amino acid mutations over observed synonymous mutations, normalized by the ratio expected under a neutral model. Thus a Ka/Ks = 1 value indicates neutral selection. Ordinarily Ka/Ks is ≪ 1, indicating negative selection against amino acid mutations (far fewer observed than expected under a neutral model). Ka/Ks > 1 is referred to as positive selection (i.e. amino acid mutations increase reproductive fitness) and is observed in rare cases where new evolutionary challenges create strong pressure for rapid evolution of a protein (e.g. immune system genes like MHC that are involved in recognizing pathogenic antigens). Ordinarily, a single Ka/Ks value is calculated for a whole gene, but with very large datasets it becomes possible to estimate distinct Ka/Ks values for individual codon positions or amino acid mutations. This yields a ‘selection pressure map’ of a gene, revealing its detailed functional constraints and in rare cases positive selection peaks that signal important new evolutionary pressures such as drug treatment. We used Ka/Ks because it provides a powerful tool for detecting positive selection. Phylogenetic analysis of our HIV sequence dataset using Phylip (15) shows a star-like topology (data are available at , but will be presented in detail elsewhere), in agreement with previous studies (16,17). We have assembled a large-scale database that provides researchers detailed selection pressure maps of HIV proteins involved in drug resistance. These data have many possible applications, including prediction of mutations contributing to drug resistance, distinguishing primary drug resistance mutations from accessory mutations, rate measurements of fast versus slow evolutionary pathways to multiple drug resistance, and the evolutionary dynamics of different types of mutations as the virus moves from untreated to drug-treated conditions and back. This database makes available for the first time a very large HIV sequence dataset (sequences from ∼50 000 clinical AIDS samples), which makes possible high-resolution selection pressure mapping, as well as smaller datasets from other public databases. The methods and most of the data described herein have been published previously (12,18).

DATABASE CONTENT, INTERFACE AND APPLICATIONS

Datasets

The primary dataset consists of sequences for HIV protease and reverse transcriptase (RT) for ∼50 000 clinical AIDS patient samples from the United States, collected during 1999–2003 (12), and mostly under drug treatment. These data cover 1.4 kb each [300 000 chromatograms; six overlapping reads per sample, including both strands; see (12) for details] and were generously contributed by Specialty Laboratories Inc. Owing to HIV's high mutation rate, on average each sequence contains 32 mutations/kb [with respect to the Los Alamos reference sequence (12)], for a total of more than 2 million mutation observations in the dataset (12). Over 5000 distinct codon mutations were observed, each with an average count of 364 samples (12). For comparison, this density of polymorphism information is equivalent to sequencing ∼1 million people. This very large dataset, made available publicly for the first time, has made detailed selection pressure mapping possible. Of the samples, 99.3% are subtype B; non-subtype-B samples were excluded from the analysis (12). The dataset is fully HIPAA-compliant; all information concerning the source patients was removed by Specialty. The database currently includes two additional datasets, also covering HIV protease and RT. These datasets were obtained from the Stanford HIV database (19). The Stanford-Treated dataset consists of 1797 subtype B samples with known drug treatments. This dataset provides a useful comparison with the Specialty results, for validating whether a specific mutation is reproducibly selected by drug treatment. The Stanford-Untreated dataset consists of 2628 subtype B samples not under drug treatment. By comparing results from this dataset with Specialty and Stanford-Treated, users can assess whether a specific mutation is more likely to be associated with drug resistance or other types of phenotypic fitness effects (e.g. interactions with the immune system). The Specialty raw sequence data are available as a gzip'ed FASTA file at .

Amino acid selection pressure mapping

The first aspect of the database is mapping of Ka/Ks selection pressure at each codon position in HIV protease and the first 381 codon positions of RT (Figure 1). Further positions in RT were not sequenced in this dataset. Codon-specific selection pressure (12) was calculated using the following formula: where Na and Ns are the number of amino acid mutations and synonymous mutations observed at the codon, na,t is the number of possible transition mutations in the codon that would change the resulting amino acid, ns,t is the number of possible transition mutations that are synonymous, na,v and ns,v are the equivalent numbers for transversions, and ft and fv are the transition and transversion frequencies, respectively. We calculated an LOD confidence score for a codon to be under positive selection pressure according to the following formula: where N is the total number of mutations observed in the codon and q is calculated as follows: This analysis includes Ka/Ks values for 2946 individual amino acid mutations (12) at 399 codon positions with LOD scores >2. These data have many applications. For example, strong positive selection (Ka/Ks > 1) indicates drug-resistance mutations or important fitness effects. Experimental validation data in HIV protease (where causes of drug resistance are well characterized) showed that 19 of 23 known drug resistance codons were correctly predicted by our database, which also accurately predicts the mutant enzyme's activity phenotype (12,20). Of the 47 positively selected sites found in the Specialty dataset, 28 were also found in the Stanford-Untreated dataset, possibly indicating that those sites can harbor fitness mutations (18). The database has a simple graphical interface (Figure 1): users can peruse the codon-position selection pressure map directly, click on a position, and inspect detailed tabular results grouped either by codon position, individual amino acid mutations or individual nucleotide mutations (e.g. to see whether two different nucleotide mutations producing the same amino acid replacement show the same Ka/Ks value).

Figure 1

The interface to the positive selection mutation database is a clickable imagemap. Clicking on any codon position performs a query and returns the results in an easy-to-read format. (Specialty dataset is shown.)

Selection pressure interaction mapping

The massive size of the Specialty dataset makes it possible to measure how selection pressure for one amino acid mutation Y is affected by amino acid mutations at other sites X. Specifically, the database computes Ka/Ks for mutation Y conditioned on the presence of amino acid mutations at site X versus the absence of any mutation at site X. This ‘conditional Ka/Ks’ (18) calculation is performed as follows: where and are the numbers of amino acid mutations and synonymous mutations at site Y observed in the presence of amino acid mutations at site X and all other variables retain their previous definitions. Dividing this result by the one obtained in the absence of any mutation at site X to arrive at the ‘conditional selection ratio’ (18) results in the following expression: where and are the numbers of samples containing either an amino acid mutation or synonymous mutation at Y and no mutation at X. The LOD score by which we evaluated the significance of apparent positive conditional selection was calculated using the following: where and q as defined above. For experimental validation, this database correctly predicted 80 of 92 known mutation positive interaction pairs identified in HIV protease by independent experimental studies (P-value = 10−70) (11,18). The database again provides a graphical interface (Figure 2) as a 2D heatmap showing all pairwise interactions, which users can click at any position to inspect detailed tabular results.

Figure 2

Selection pressure interaction map. The degree to which a mutation at one site X (horizontal axis) affects the selection pressure at another site Y (vertical axis) is shown as the condtional selection ratio for all amino acid mutations at site Y conditioned on any amino acid mutation at site X. The color coding scale indicates increasing values of positive conditional selection ratio. Interactions showing conditional selection ratios >1 (positive conditional selection) with LOD scores >3 are shown, with blue indicating stronger interactions and yellow indicating weaker ones. Clicking any particular square provides details on the numbers used in the calculation. These data can yield useful insights into HIV drug resistance. For example, the data show a significant interaction between protease site 90 (a known drug resistance mutation site) and site 10 (Figure 3). Amino acid mutations at 90 displayed strong, unconditional positive selection, indicating that they directly cause drug resistance. In contrast, mutations at 10 are negatively selected in the absence of the 90 mutation, but become positively selected in the presence of the 90 mutation (Figure 3). These results closely match previous experimental studies showing that mutations at 90 cause drug resistance, while mutations at 10 have an accessory effect of compensating for the destabilizing effect of mutations at 90 (21). Thus, our database can help users by providing information that can distinguish primary drug-resistance mutations from accessory mutations (18). Users can navigate through links on every result page, to see mutations that strongly select for a given mutation, mutations that are strongly selected for by this mutation, or links to the Stanford (22) and Los Alamos HIV databases (23) giving further information about mutations at this site.

Figure 3

For the two possible pathways from wild-type protease to the 10/90 double mutant, we computed the conditional Ka/Ks values for each mutation conditioned on the presence or absence of the other mutation (shown as numbers next to each edge in the figure). For example, in the absence of the 10 mutation, the 90 mutation shows strong positive selection in both the Specialty and Stanford-Treated datasets, but was negatively selected in the Stanford-Untreated dataset. Since the steady-state speed of a multistep path is determined by its slowest step, we highlighted the rate-limiting step in each path (boldface). For example, in the Specialty dataset, the steady-state rate of the upper pathway appears to be ∼10-fold faster than that of the lower pathway. (a) Specialty dataset, (b) Stanford-Treated dataset and (c) Stanford-Untreated dataset. Comparison between the independent datasets in the database can shed additional light on such questions. For example, users can assess whether positively selected mutations in the Specialty dataset are really due to drug resistance, by comparing with the Stanford-Treated and Stanford-Untreated datasets. As shown in Figure 3b and c, the Stanford-Treated data strongly corroborate the Specialty result, while the Stanford-Untreated data show that 90 is indeed involved in drug resistance; it becomes strongly negatively selected in the absence of drug treatment. These data can help users distinguish genuine drug-treatment mutations from those that affect phenotype in other ways, e.g. interactions with the host immune system. Detailed analysis of these datasets demonstrates that the Ka/K results are highly reproducible: independent datasets from different sets of patients show strong quantitative agreement (18).

FUTURE ADDITIONS

We are currently working to add new data and features to the database. We will add a number of new datasets to the database. First, we will add data for additional HIV genes, such as the env gene, which is important for HIV immune evasion (24); although these datasets have smaller numbers of sequences, our analysis has shown that useful Ka/K mapping information can be obtained from such counts. Second, we will analyze mutation data from patients under specific drug-treatment to compare selection pressures caused by different drugs. Third, we will add datasets for other HIV subtypes (e.g. subtype C) to reveal, where selection pressure patterns appear to be consistent with those seen in subtype B (allowing diagnostic criteria from subtype B to be applied to other subtypes) versus where there are important differences. Fourth, we will add a new very large dataset for the Hepatitis C core gene, consisting of approximately 60 000 samples, generously donated by Specialty Laboratories. Lastly, we will add new analyses and graphical interfaces to the database, including phylogenetic analysis and clickable pathway diagrams.

22 in total

1. Population dynamics of HIV-1 inferred from gene sequences.

Authors: N C Grassly; P H Harvey; E C Holmes
Journal: Genetics Date: 1999-02 Impact factor: 4.562

2. Unbiased estimation of the rates of synonymous and nonsynonymous substitution.

Authors: W H Li
Journal: J Mol Evol Date: 1993-01 Impact factor: 2.395

3. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time.

Authors: A S Perelson; A U Neumann; M Markowitz; J M Leonard; D D Ho
Journal: Science Date: 1996-03-15 Impact factor: 47.728

4. Baseline human immunodeficiency virus type 1 phenotype, genotype, and RNA response after switching from long-term hard-capsule saquinavir to indinavir or soft-gel-capsule saquinavir in AIDS clinical trials group protocol 333.

Authors: M F Para; D V Glidden; R W Coombs; A C Collier; J H Condra; C Craig; R Bassett; R Leavitt; S Snyder; V McAuliffe; C Boucher
Journal: J Infect Dis Date: 2000-08-14 Impact factor: 5.226

5. Mutation patterns and structural correlates in human immunodeficiency virus type 1 protease following different protease inhibitor treatments.

Authors: Thomas D Wu; Celia A Schiffer; Matthew J Gonzales; Jonathan Taylor; Rami Kantor; Sunwen Chou; Dennis Israelski; Andrew R Zolopa; W Jeffrey Fessel; Robert W Shafer
Journal: J Virol Date: 2003-04 Impact factor: 5.103

6. Evidence for positive epistasis in HIV-1.

Authors: Sebastian Bonhoeffer; Colombe Chappey; Neil T Parkin; Jeanette M Whitcomb; Christos J Petropoulos
Journal: Science Date: 2004-11-26 Impact factor: 47.728

7. Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase.

Authors: L M Mansky; H M Temin
Journal: J Virol Date: 1995-08 Impact factor: 5.103

8. Genetic and phenotypic analyses of human immunodeficiency virus type 1 escape from a small-molecule CCR5 inhibitor.

Authors: Shawn E Kuhmann; Pavel Pugach; Kevin J Kunstman; Joann Taylor; Robyn L Stanfield; Amy Snyder; Julie M Strizki; Janice Riley; Bahige M Baroudy; Ian A Wilson; Bette T Korber; Steven M Wolinsky; John P Moore
Journal: J Virol Date: 2004-03 Impact factor: 5.103

9. Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection.

Authors: D D Ho; A U Neumann; A S Perelson; W Chen; J M Leonard; M Markowitz
Journal: Nature Date: 1995-01-12 Impact factor: 49.962

10. Human immunodeficiency virus reverse transcriptase and protease sequence database.

Authors: Soo-Yon Rhee; Matthew J Gonzales; Rami Kantor; Bradley J Betts; Jaideep Ravela; Robert W Shafer
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

16 in total

1. Computational analysis of HIV-1 protease protein binding pockets.

Authors: Gene M Ko; A Srinivas Reddy; Sunil Kumar; Barbara A Bailey; Rajni Garg
Journal: J Chem Inf Model Date: 2010-10-25 Impact factor: 4.956

2. A phylogenetic and Markov model approach for the reconstruction of mutational pathways of drug resistance.

Authors: Patricia Buendia; Brice Cadwallader; Victor DeGruttola
Journal: Bioinformatics Date: 2009-08-03 Impact factor: 6.937

3. Positive selection on HIV accessory proteins and the analysis of molecular adaptation after interspecies transmission.

Authors: André E R Soares; Marcelo A Soares; Carlos G Schrago
Journal: J Mol Evol Date: 2008-05-09 Impact factor: 2.395

Review 4. Unraveling the web of viroinformatics: computational tools and databases in virus research.

Authors: Deepak Sharma; Pragya Priyadarshini; Sudhanshu Vrati
Journal: J Virol Date: 2014-11-26 Impact factor: 5.103

5. ORION-VIRCAT: a tool for mapping ICTV and NCBI taxonomies.

Authors: Willy Valdivia-Granda; Francis Larson
Journal: Database (Oxford) Date: 2009-10-12 Impact factor: 3.451

6. JCoDA: a tool for detecting evolutionary selection.

Authors: Steven N Steinway; Ruth Dannenfelser; Christopher D Laucius; James E Hayes; Sudhir Nayak
Journal: BMC Bioinformatics Date: 2010-05-27 Impact factor: 3.169

7. Evidence of a novel RNA secondary structure in the coding region of HIV-1 pol gene.

Authors: Qi Wang; Ian Barr; Feng Guo; Christopher Lee
Journal: RNA Date: 2008-10-30 Impact factor: 4.942

8. A coordinated proteomic approach for identifying proteins that interact with the E. coli ribosomal protein S12.

Authors: Michael Brad Strader; William Judson Hervey; Nina Costantino; Suwako Fujigaki; Cai Yun Chen; Ayca Akal-Strader; Chibueze A Ihunnah; Anthony J Makusky; Donald L Court; Sanford P Markey; Jeffrey A Kowalak
Journal: J Proteome Res Date: 2013-02-01 Impact factor: 4.466

9. Distinguishing functional amino acid covariation from background linkage disequilibrium in HIV protease and reverse transcriptase.

Authors: Qi Wang; Christopher Lee
Journal: PLoS One Date: 2007-08-29 Impact factor: 3.240

10. Identification of HCV protease inhibitor resistance mutations by selection pressure-based method.

Authors: Ping Qiu; Vincent Sanfiorenzo; Stephanie Curry; Zhuyan Guo; Shaotang Liu; Angela Skelton; Ellen Xia; Constance Cullen; Robert Ralston; Jonathan Greene; Xiao Tong
Journal: Nucleic Acids Res Date: 2009-04-24 Impact factor: 16.971