Literature DB >> 21076150

Automated construction and testing of multi-locus gene-gene associations.

Ryan Abo¹, Stacey Knight, Alun Thomas, Nicola J Camp.

Abstract

UNLABELLED: It has been argued that the missing heritability in common diseases may be in part due to rare variants and gene-gene effects. Haplotype analyses provide more power for rare variants and joint analyses across genes can address multi-gene effects. Currently, methods are lacking to perform joint multi-locus association analyses across more than one gene/region. Here, we present a haplotype-mining gene-gene analysis method, which considers multi-locus data for two genes/regions simultaneously. This approach extends our single region haplotype-mining algorithm, hapConstructor, to two genes/regions. It allows construction of multi-locus SNP sets at both genes and tests joint gene-gene effects and interactions between single variants or haplotype combinations. A Monte Carlo framework is used to provide statistical significance assessment of the joint and interaction statistics, thus the method can also be used with related individuals. This tool provides a flexible data-mining approach to identifying gene-gene effects that otherwise is currently unavailable. AVAILABILITY: http://bioinformatics.med.utah.edu/Genie/hapConstructor.html.

Entities: CellLine Disease Gene Species

Mesh：

Year: 2010 PMID： 21076150 PMCID： PMC3008644 DOI： 10.1093/bioinformatics/btq616

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Haplotype and gene–gene analyses have been suggested as strategies to identify disease loci that single nucleotide polymorphism (SNP) approaches may have missed (Manolio ). Haplotypes have the potential for improved characterization of variation across the locus set (Clark, 2004; Schaid, 2004). Yet, it is usually unclear which haplotypes to test and how to model them. Numerous methods consider all haplotypes spanning the entire locus set, with attempts to reduce the degrees of freedom that this approach otherwise confers (Liu ; Tzeng and Zhang, 2007). Other techniques have been designed to analyze contiguous and non-contiguous locus subsets (Abo ; Browning, 2006; Browning and Browning, 2007; Laramie ; Lin, 2004). It has been hypothesized (Moore, 2003), and in some cases shown (Combarros ), that genetic factors at one gene can modify the effects of another gene on disease susceptibility. If such biological interaction exists, the association may only be evident by considering both genes simultaneously. Gene–gene studies are complicated by issues surrounding what constitutes a gene–gene interaction. For example, some approaches for testing interactions focus on association between two unlinked loci (Wu ; Zhao ), which do not provide any measure of departure from additivity as a statistical interaction is classically defined. Most often haplotype analyses are performed for a single region and gene–gene studies concentrate on single SNPs in each region. Methods that consider multi-locus data at more than one gene would be desirable to maximize the ability to detect association evidence. One such method exists to test specific haplotype interactions at unlinked regions (Becker ). However, both haplotype and gene–gene analyses can result in high-dimensionality, and how to combine them is therefore a challenging problem. To address these challenges, we have extended our single region haplotype-mining approach (Abo ) to consider multi-locus data at two genes and test for association and interaction. We concentrate on a broad set of tests that considers both joint effects and interaction effects. In our gene–gene-mining process, data considered at each gene can be single or multi-locus. We anticipate that this gene–gene-mining approach will be most useful for hypothesis generation. However, if required, haplotype testing can also be performed using an empirical correction for multiple testing. Case–control and case-only designs are available, in addition to statistics to test joint and interaction effects. The method is implemented in a Monte Carlo (MC) testing framework and empirical construction-wide significance assessment is available for hypothesis testing.

2 METHODS

For both genes/regions considered, maximum likelihood estimates (MLE) for all individuals' haplotype pairs and population haplotype frequencies are determined. All SNPs in each region and all individuals with sufficient data at both regions are considered (based on a user-defined genotype call rate threshold). Full-length MLE haplotypes, or sub-haplotypes extracted from them, are the genetic variables considered in the construction and testing process. Consider h and k loci in unlinked genes, G1 = {M1,…, M} and G2 = {M,…, M}. The full locus set S = G1 ∪ G2. First, all single locus association tests are conducted. These single locus associations are assessed against the first significance threshold, T1, which is user-defined. For any locus i with P-value ≤ T1, all locus pairs {M, M|∀M ∈ S; j ≠ i} are considered at the second step. The locus pair {M, M} is the locus set, L, being considered. When the two loci in L span both genes, gene–gene tests between the loci are performed. When loci in L are all within the same gene, the two loci are tested as a haplotype or composite genotype. Tests at step n are assessed at significance threshold T (∈ {T1,…, T}), which are usually chosen to be increasing in stringency with n. A locus set can be written as L = {g1 g2|g1 ⊂ G1 and g2 ⊂ G2} where g1 denotes loci that reside in G1 and g2 those that reside in G2. In steps n > 2, if there are multiple SNPs in both genes, gene–gene tests between haplotypes across g1 and haplotypes across g2 will be performed. The steps continue until no further locus sets pass the defined threshold values or the full locus sets have been tested. To avoid a strict uphill climb algorithm, which is susceptible to identifying local minimums, we have incorporated a backward step. At each backward step, the algorithm considers subsets of size n − 1 from the current locus set that were not previously tested. Any subsets which pass the significance threshold, T, will be retained and the process will continue forward again. For locus sets where g1 and/or g2 are multi-locus, haplotypes or composite genotypes are considered. The algorithm considers each haplotype across g as a potential ‘risk haplotype’, and compares with all other haplotypes grouped together. For any specific haplotype, this reduces the multi-locus data to a biallelic system which can be used for standard allelic, dominant, recessive and additive models for testing both within and across genes. For composite genotype combinations, phase is unimportant, each locus in L is modeled separately as dominant or recessive and the combinations of these considered across loci. Hence, composite genotypes tests can be performed within or across genes. To reduce the tests performed, at step n + 1 the algorithm only expands the specific risk haplotypes that passed the significance threshold (i.e. the alleles at loci from step n are fixed). A similar rule is applied to the composite genotypes. Single locus, haplotype and composite genotype models are tested using odds ratios, chi-square and chi-square trend association statistics. For locus sets containing loci in two genes, L = {g1 g2|g1 ⊂ G1 and g2 ⊂ G2}, an interaction odds ratio test and a correlation-based statistic are offered to identify gene–gene effects between the two loci sets, g1 and g2. As described above, multi-locus sets within genes are considered using biallelic recoding. We refer to specific haplotypes across g1 and g2 as h1 and h2. The interaction odds ratio between h1 and h2 is calculated using the method described by Thomas (2004), IOR, where m and n denote dominant or recessive models imposed on h1 and h2, respectively, and 0 indicates the wildtype. Under the null hypothesis, H0: IOR = 1, the odds of disease given h1 and h2 is the product of the odds of disease for each h. We have also implemented interaction tests based on correlation (Wu ; Zhao ). Correlation of specific haplotypes, h1 and h2, from locus sets g1 and g2 are performed. Following Wang ), the correlation is determined as follows, where each individuali is assigned a value x for locus set g based on its MLE haplotype pairs: The correlation between h1 and h2 is estimated by the correlation coefficient: where and , j = (1, 2), and N is the number of individuals. This correlation coefficient is an estimate of the composite correlation statistic (Zaykin ) which is robust to Hardy–Weinberg disequilibrium. For a case–control study design, the method tests H0: rcase − rcontrol = 0. For a case-only H0: rcase=0 and the first step in the automated process considers the correlation between pairs of single SNPs. We also note the availability of meta-statistics for analyzing multiple datasets. Statistical significances are determined with a MC procedure. The validity of the MC procedure is based on properly matching the null simulations with the observed data with regard to pedigree structure, missing data structure and phasing procedure (Curtis and Sham, 2006). Our MC procedure is based on a two-region multi-locus gene-drop. In both regions, haplotype pairs are assigned to founders and independent individuals based on the estimated full-length haplotype frequencies. Full-length haplotypes for both regions are then assigned to pedigree descendants using gene-dropping techniques based on Mendelian inheritance (MacCluer ). The missing data structure is then imposed on the simulated multi-locus genotype data and the known phase is ignored. These simulated data are then statistically phased, to match the procedure performed with the observed data. The procedure generates null genotype configurations from which null statistics are calculated and a null empirical distribution created. It must be noted that this MC procedure assumes a null of no linkage and no association. If strong linkage exists (but no association), there is the potential for inflated type 1 errors; although in simulations we find that for reasonable linkage models that the MC procedure remains a good approximation for the null and type 1 errors remain valid. Correction for the data-mining process is also available and, if selected, will provide construction-wide significance and false discovery rates. Correction for construction is implemented in the same way as for hapConstructor (Abo ), where the null distribution for a complete construction run is generated by conducting the same search process starting from 1000 null configurations.

3 IMPLEMENTATION

Our method is implemented as a Java-based program. It is an extension of the hapConstructor module (Abo ) in the Genie software (Allen-Brady ). The program can be run on Windows, Unix or Linux machines with Java 1.6 and at least 2 GB of RAM. An example dataset consisting of 14 SNPs in one gene and 11 SNPs in the second gene required 7 h and 11 min with 4 GB of memory to complete building to step 3. Parameter options for this example included default critical thresholds, 10 000 null simulations and no construction-wide assessment. It is important to note that this example may not provide useful insight to other implementations of the method because there are many factors that will affect the running time of the program. These include: number of SNPs, number of samples, number of null simulations selected for significance assessment, critical thresholds selected for the steps in the building process, use of the multiple-testing correction procedure and whether or not there is an association signal. Program details, including the example described above, are available at http://bioinformatics.med.utah.edu/Genie/hapConstructor.html. Funding: R.A. is an NLM fellow (grant T15 LM0724); National Institutes of Health (CA 098364); the Susan G. Komen Foundation and the Avon Foundation Breast Cancer Fund (to N.J.C.). Conflict of Interest: none declared.

19 in total

Automated construction and testing of multi-locus gene-gene associations.

1 INTRODUCTION

2 METHODS

3 IMPLEMENTATION

1. Multilocus association mapping using variable-length Markov chains.

2. Test for interaction between two unlinked loci.

3. Efficient multilocus association testing for whole genome association studies using localized haplotype clustering.

4. Improving power in contrasting linkage-disequilibrium patterns between cases and controls.

5. HaploBuild: an algorithm to construct non-contiguous associated haplotypes in family based genetic studies.

6. Estimated haplotype counts from case-control samples cannot be treated as observed counts.

7. Composite measure of linkage disequilibrium for testing interaction between unlinked loci.

8. Haplotype-based association analysis via variance-components score test.

9. Incorporating single-locus tests into haplotype cladistic analysis in case-control studies.

10. hapConstructor: automatic construction and testing of haplotypes in a Monte Carlo framework.

1. Connection between genetic and clinical data in bipolar disorder.