Literature DB >> 18653522

hapConstructor: automatic construction and testing of haplotypes in a Monte Carlo framework.

Ryan Abo¹, Stacey Knight, Jathine Wong, Angela Cox, Nicola J Camp.

Abstract

SUMMARY: Haplotypes carry important information that can direct investigators towards underlying susceptibility variants, and hence multiple tagging single nucleotide polymorphisms (tSNPs) are usually studied in candidate gene association studies. However, it is often unknown which SNPs should be included in haplotype analyses, or which tests should be performed for maximum power. We have developed a program, hapConstructor, which automatically builds multi-locus SNP sets to test for association in a case-control framework. The multi-SNP sets considered need not be contiguous; they are built based on significance. An important feature is that the missing data imputation is carried out based on the full data, for maximal information and consistency. HapConstructor is implemented in a Monte Carlo framework and naturally extends to allow for significance testing and false discovery rates that account for the construction process and to related individuals. HapConstructor is a useful tool for exploring multi-locus associations in candidate genes and regions. AVAILABILITY: http://www-genepi.med.utah.edu/Genie.

Entities: Disease Gene

Mesh：

Year: 2008 PMID： 18653522 PMCID： PMC2530882 DOI： 10.1093/bioinformatics/btn359

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Multiple tagging-SNPs (tSNPs) are widely used in candidate gene association studies. It has been shown that there is increased power to detect disease variants with low frequency by performing both haplotype and single-locus analyses even with the multiple testing correction (Becker and Knapp, 2004). In new studies, tSNPs are usually analyzed independently and in multi-SNP combinations. Even when associations are considered established (Cox et al., 2007), comprehensive SNP-set haplotype analyses can be performed to more accurately define the haplotype/s on which the susceptibility variants lie. One avenue that may effectively guide such searches is a more systematic haplotype-mining analysis. Multi-locus analyses are of high dimension leading to reduced power when testing association. Haplotype similarity, cladistic and phylogenetic techniques can be used to reduce dimensionality (Bardel et al., 2006; Camp et al., 2005; Jannot et al., 2004; Liu et al., 2007; Molitor et al., 2003; Tzeng and Zhang, 2007; Waldron et al., 2006; Yu et al., 2005). However, these methods require a priori determination of which SNPs to include; and there remains the question of whether to analyze monotype or diplotype data and the mode of expression. Studying SNP subsets may be optimal and reduce dimension. Sliding windows (Lin et al., 2004) and haplotype clustering using variable-length Markov chain models (Browning and Browning, 2007; Browning, 2006) have been proposed for traditional case-control data and contiguous subsets of SNPs. An approach for non-contiguous SNP subsets exists; constructing haplotypes by starting from SNP pairs and iteratively adding SNPs based on significance and base pair distance (haploBuild; Laramie et al., 2007). While this latter approach is flexible in haplotype construction, it is limited to transmission statistics in the FBAT software (Horvath et al., 2001) and lacks a valid significance assessment that accounts for all the multiple testing inherent in a data-mining technique. We present hapConstructor, software to construct and test multi-locus data, allowing for non-contiguous SNP subsets. Tests for non-independence and effect size are incorporated. Monotype (alleles or haplotypes); diplotype (genotypes or haplotype pairs); and composite genotype (unphased genotypes across multiple loci) tests are included. Standard reductions of dimensionality are incorporated, such as specific haplotype tests for monotype data, and dominant, recessive and additive tests for specific haplotypes for diplotype data. Multi-locus SNP sets are constructed through a forward–backward stepwise process. HapConstructor operates in a Monte Carlo (MC) framework which offers two advantages. First, it naturally extends to testing related individuals. Second, the null distribution for the full SNP set is simulated once, and can be used to assess both empirical significance of individual tests and construction-wide P-values and false discovery rates (FDRs) that account for the construction process. HapConstructor is a Java-based extension of Genie (Allen-Brady et al., 2006).

2 METHODS

The MC framework is provided by Genie, with imputation of missing data, estimation of population haplotype frequencies and maximum likelihood estimates (MLE) of individuals’ haplotype pairs provided by the hapMC component. First, all single SNPs {s1, s2,…,sn} are tested. In each forward step, a SNP is added to SNP sets whose P-value surpassed the user-defined threshold at the previous step. The thresholds can be constant or may vary by step. For example, if s1 surpassed the first threshold, the next step would consider two-locus SNP sets {s1-s2, s1-s3,…,s1-sn}. An optional backward process starts at the third step and consists of testing all (n−1)-locus subsets not previously considered. To maintain efficiency and speed and reduce redundancy, each subsequent step in the build process extends the haplotypes with the specific alleles that previously met the threshold at the prior step rather than considering all haplotypes spanning the new loci set. Test statistics available are χ2, χ2-trend and odds ratio. The data can be considered as diplotype or monotype or both. For diplotype data, haplotype and composite genotype tests are performed. Haplotype models are dominant, recessive and additive models for each haplotype. Composite genotypes include each of the dominant and recessive combinations across loci. For monotype data, each specific haplotype is compared to all others. Summaries for all tests performed are stored. A user interface allows these to be sorted by step, SNP, test-type and significance. If required, a construction-wide assessment that accounts for the building process can be made. A valid global P-value and FDR is generated; the latter is more appropriate for data mining (Benjamini and Hochberg, 1995). These are achieved by reusing the null configurations generated for the MC procedure. Each null configuration is considered as the ‘observed data’ and the construction algorithm is used with significances determined from the remaining N−1 null configurations. This is repeated to generate a set of null ‘constructions’ from which valid empirical construction-wide P-values and FDRs are determined.

3 RESULTS

We illustrate hapConstructor using a sample of 1128 independent breast cancer cases and 1149 independent controls from Sheffield, UK and 14 tSNPs in the CASP8 gene. Single SNP tests results yielded three SNPs with P-values below 0.05 (0.010–0.047). The construction process continued to the fifth step (five-locus haplotypes). A four-locus haplotype was identified as the most significantly associated haplotype with an empirical P-value of 8.0×10−5 and a construction-wide FDR of 0.044, a result which is consistent with the established association between breast cancer and CASP8 (Cox et al., 2007). This four-locus haplotype contained only one of the three SNPs that had obtained significant single test results. HapConstructor completed the building process for the real data in 96 h with 100 000 MC simulations, on a machine with an Intel Pentium core 2 duo with 3.0 Ghz per processor and 2 GB of memory. It required 7 days using 10 server nodes to complete 1000 simulated builds for the construction-wide significance assessment. To assess the potential value-added of the construction process in our illustrative example, we analyzed all 14-SNP haplotypes with frequencies over 1% and also performed exhaustive sliding window analyses for window sizes of 2- to 6-SNP haplotypes. Of the 15 14-SNP haplotypes analyzed, only one obtained nominal significance (P=0.0357). For the sliding windows, 2351 tests were conducted and 314 were found to be nominally significant (0.0021–0.05, not accounting for multiple testing). The most significantly associated haplotypes were found in the four-, five- and six-locus window sizes. The results from both of these more standard approaches were inferior to the haplotype building in terms of significance and indicate that hapConstructor was a valuable approach and that exhaustive searches using contiguous multi-SNP sets are not the optimal solution in this situation.

4 CONCLUSIONS

HapConstructor offers a data-mining approach to association analyses, allowing automatic and comprehensive construction of multi-locus SNP-set tests. It improves upon other methods in the variety of analyses and statistics performed, and the ability to appropriately assess global significance. Additional features are the immediate extension to mixtures of independent and related individuals, a virtue of the method being nested in Genie (Allen-Brady et al., 2006), and the ability to impute missing data. It should be noted, however, that the extension to related individuals is limited to an assumption of no recombination, as only under these conditions are MLE haplotype estimates using relatives unbiased. A limitation of hapConstructor, and MC testing in general, is computational burden. This is dependent upon the number of simulations (especially construction-wide assessment), sample size, number of SNPs considered and threshold values. Depending on the dataset being analyzed, hapConstructor may require significant time and computational resources to complete both the build process and construction-wide assessment. Construction-wide assessment may be intractable for large datasets due to time or resources. Despite the computational intensity, hapConstructor is a useful tool for exploring multi-locus associations in candidate genes and regions, and fulfills a current need of many investigators. Our future work will include more sophisticated heuristics for the construction process and extensions to interaction models.

16 in total

1. Clustering of haplotypes based on phylogeny: how good a strategy for association testing?

Authors: Claire Bardel; Pierre Darlu; Emmanuelle Génin
Journal: Eur J Hum Genet Date: 2006-02 Impact factor: 4.246

2. Association in multifactorial traits: how to deal with rare observations?

Authors: A-S Jannot; L Essioux; F Clerget-Darpoux
Journal: Hum Hered Date: 2004 Impact factor: 0.444

3. Characterization of linkage disequilibrium structure, mutation history, and tagging SNPs, and their use in association analyses: ELAC2 and familial early-onset prostate cancer.

Authors: Nicola J Camp; Jeff Swensen; Benjamin D Horne; James M Farnham; Alun Thomas; Lisa A Cannon-Albright; Sean V Tavtigian
Journal: Genet Epidemiol Date: 2005-04 Impact factor: 2.135

4. Fine mapping of disease genes via haplotype clustering.

Authors: E R B Waldron; J C Whittaker; D J Balding
Journal: Genet Epidemiol Date: 2006-02 Impact factor: 2.135

5. Multilocus association mapping using variable-length Markov chains.

Authors: Sharon R Browning
Journal: Am J Hum Genet Date: 2006-04-07 Impact factor: 11.025

6. Efficient multilocus association testing for whole genome association studies using localized haplotype clustering.

Authors: Brian L Browning; Sharon R Browning
Journal: Genet Epidemiol Date: 2007-07 Impact factor: 2.135

7. HaploBuild: an algorithm to construct non-contiguous associated haplotypes in family based genetic studies.

Authors: Jason M Laramie; Jemma B Wilk; Anita L DeStefano; Richard H Myers
Journal: Bioinformatics Date: 2007-06-22 Impact factor: 6.937

8. A common coding variant in CASP8 is associated with breast cancer risk.

Authors: Angela Cox; Alison M Dunning; Montserrat Garcia-Closas; Sabapathy Balasubramanian; Malcolm W R Reed; Karen A Pooley; Serena Scollen; Caroline Baynes; Bruce A J Ponder; Stephen Chanock; Jolanta Lissowska; Louise Brinton; Beata Peplonska; Melissa C Southey; John L Hopper; Margaret R E McCredie; Graham G Giles; Olivia Fletcher; Nichola Johnson; Isabel dos Santos Silva; Lorna Gibson; Stig E Bojesen; Børge G Nordestgaard; Christen K Axelsson; Diana Torres; Ute Hamann; Christina Justenhoven; Hiltrud Brauch; Jenny Chang-Claude; Silke Kropp; Angela Risch; Shan Wang-Gohrke; Peter Schürmann; Natalia Bogdanova; Thilo Dörk; Rainer Fagerholm; Kirsimari Aaltonen; Carl Blomqvist; Heli Nevanlinna; Sheila Seal; Anthony Renwick; Michael R Stratton; Nazneen Rahman; Suleeporn Sangrajrang; David Hughes; Fabrice Odefrey; Paul Brennan; Amanda B Spurdle; Georgia Chenevix-Trench; Jonathan Beesley; Arto Mannermaa; Jaana Hartikainen; Vesa Kataja; Veli-Matti Kosma; Fergus J Couch; Janet E Olson; Ellen L Goode; Annegien Broeks; Marjanka K Schmidt; Frans B L Hogervorst; Laura J Van't Veer; Daehee Kang; Keun-Young Yoo; Dong-Young Noh; Sei-Hyun Ahn; Sara Wedrén; Per Hall; Yen-Ling Low; Jianjun Liu; Roger L Milne; Gloria Ribas; Anna Gonzalez-Neira; Javier Benitez; Alice J Sigurdson; Denise L Stredrick; Bruce H Alexander; Jeffery P Struewing; Paul D P Pharoah; Douglas F Easton
Journal: Nat Genet Date: 2007-02-11 Impact factor: 38.330

9. PedGenie: an analysis approach for genetic association testing in extended pedigrees and genealogies of arbitrary size.

Authors: Kristina Allen-Brady; Jathine Wong; Nicola J Camp
Journal: BMC Bioinformatics Date: 2006-04-18 Impact factor: 3.169

10. Incorporating single-locus tests into haplotype cladistic analysis in case-control studies.

Authors: Jianfeng Liu; Chris Papasian; Hong-Wen Deng
Journal: PLoS Genet Date: 2007-03-23 Impact factor: 5.917

14 in total

1. Exploring multilocus associations of inflammation genes and colorectal cancer risk using hapConstructor.

Authors: Karen Curtin; Roger K Wolff; Jennifer S Herrick; Ryan Abo; Martha L Slattery
Journal: BMC Med Genet Date: 2010-12-03 Impact factor: 2.103

2. Genetic variants in XRCC2: new insights into colorectal cancer tumorigenesis.

Authors: Karen Curtin; Wei-Yu Lin; Rina George; Mark Katory; Jennifer Shorto; Lisa A Cannon-Albright; Gillian Smith; D Timothy Bishop; Angela Cox; Nicola J Camp
Journal: Cancer Epidemiol Biomarkers Prev Date: 2009-08-18 Impact factor: 4.254

3. Haplotype-based analysis: a summary of GAW16 Group 4 analysis.

Authors: Elizabeth Hauser; Nadine Cremer; Rebecca Hein; Harshal Deshmukh
Journal: Genet Epidemiol Date: 2009 Impact factor: 2.135

4. Meta association of colorectal cancer confirms risk alleles at 8q24 and 18q21.

Authors: Karen Curtin; Wei-Yu Lin; Rina George; Mark Katory; Jennifer Shorto; Lisa A Cannon-Albright; D Timothy Bishop; Angela Cox; Nicola J Camp
Journal: Cancer Epidemiol Biomarkers Prev Date: 2009-01-20 Impact factor: 4.254

5. Fine-mapping CASP8 risk variants in breast cancer.

Authors: Nicola J Camp; Marina Parry; Stacey Knight; Ryan Abo; Graeme Elliott; Sushilaben H Rigas; Sabapathy P Balasubramanian; Malcolm W R Reed; Helen McBurney; Ayse Latif; William G Newman; Lisa A Cannon-Albright; D Gareth Evans; Angela Cox
Journal: Cancer Epidemiol Biomarkers Prev Date: 2011-11-04 Impact factor: 4.254

6. Rule-based induction method for haplotype comparison and identification of candidate disease loci.

Authors: Sirkku Karinen; Silva Saarinen; Rainer Lehtonen; Pasi Rastas; Pia Vahteristo; Lauri A Aaltonen; Sampsa Hautaniemi
Journal: Genome Med Date: 2012-03-19 Impact factor: 11.117

7. Haplotype association analyses in resources of mixed structure using Monte Carlo testing.

Authors: Ryan Abo; Jathine Wong; Alun Thomas; Nicola J Camp
Journal: BMC Bioinformatics Date: 2010-12-09 Impact factor: 3.169

8. Evaluation of genetic risk scores for lipid levels using genome-wide markers in the Framingham Heart Study.

Authors: Stephen R Piccolo; Ryan P Abo; Kristina Allen-Brady; Nicola J Camp; Stacey Knight; Jeffrey L Anderson; Benjamin D Horne
Journal: BMC Proc Date: 2009-12-15

9. A breast cancer risk haplotype in the caspase-8 gene.

Authors: Neil Duncan Shephard; Ryan Abo; Sushila Harkisandas Rigas; Bernd Frank; Wei-Yu Lin; Ian Wallace Brock; Adam Shippen; Sabapathy Prakash Balasubramanian; Malcolm Walter Ronald Reed; Claus Rainer Bartram; Alfons Meindl; Rita Katharina Schmutzler; Christoph Engel; Barbara Burwinkel; Lisa Anne Cannon-Albright; Kristina Allen-Brady; Nicola Jane Camp; Angela Cox
Journal: Cancer Res Date: 2009-03-24 Impact factor: 12.701

10. Discordant Haplotype Sequencing Identifies Functional Variants at the 2q33 Breast Cancer Risk Locus.

Authors: Nicola J Camp; Wei-Yu Lin; Alex Bigelow; George J Burghel; Timothy L Mosbruger; Marina A Parry; Rosalie G Waller; Sushilaben H Rigas; Pei-Yi Tai; Kristofer Berrett; Venkatesh Rajamanickam; Rachel Cosby; Ian W Brock; Brandt Jones; Dan Connley; Robert Sargent; Guoying Wang; Rachel E Factor; Philip S Bernard; Lisa Cannon-Albright; Stacey Knight; Ryan Abo; Theresa L Werner; Malcolm W R Reed; Jason Gertz; Angela Cox
Journal: Cancer Res Date: 2016-01-21 Impact factor: 12.701