Literature DB >> 25114493

Use of diplotypes - matched haplotype pairs from homologous chromosomes - in gene-disease association studies.

Lingjun Zuo¹, Kesheng Wang², Xingguang Luo¹.

Abstract

Alleles, genotypes and haplotypes (combinations of alleles) have been widely used in gene-disease association studies. More recently, association studies using diplotypes (haplotype pairs on homologous chromosomes) have become increasingly common. This article reviews the rationale of the four types of association analyses and discusses the situations in which diplotype-based analyses are more powerful than the other types of association analyses. Haplotype-based association analyses are more powerful than allele-based association analyses, and diplotype-based association analyses are more powerful than genotype-based analyses. In circumstances where there are no interaction effects between markers and where the criteria for Hardy-Weinberg Equilibrium (HWE) are met, the larger sample size and smaller degrees of freedom of allele-based and haplotype-based association analyses make them more powerful than genotype-based and diplotype-based association analyses, respectively. However, under certain circumstances diplotype-based analyses are more powerful than haplotype-based analysis.

Entities: Chemical Disease Gene Species

Keywords: Hardy-Weinberg equilibrium; association analysis; diplotype; genotypes; haplotype; interaction effects

Year: 2014 PMID： 25114493 PMCID： PMC4118015 DOI： 10.3969/j.issn.1002-0829.2014.03.009

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

Introduction: definition and composition of diplotypes

Humans are diploid organisms; they have paired homologous chromosomes in their somatic cells, which contain two copies of each gene. An allele is one member of a pair of genes occupying a specific spot on a chromosome (called locus). Two alleles at the same locus on homologous chromosomes make up the individual’s genotype. A haplotype (a contraction of the term ‘haploid genotype’) is a combination of alleles at multiple loci that are transmitted together on the same chromosome. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. Genewise haplotypes are established with markers within a gene; familywise haplotypes are established with markers within members of a gene family; and regionwise haplotypes are established within different genes in a region at the same chromosome. Finally, a diplotype is a matched pair of haplotypes on homologous chromosomes.[1] (see Figure 1).

Figure 1.

Model of alleles, genotypes, haplotypes and diplotypes on a pair of chromosomes

Traditionally, the expectation-maximum (EM) algorithm has been used to estimate haplotype frequencies.[2],[3] This algorithm assumes Hardy-Weinberg Equilibrium (HWE).[4] However, if the genotype frequency distributions of individual markers are not in HWE, the assumption of the EM algorithm will be violated. The magnitude of the error of the EM estimates is greater when the HWE violation (the so-called Hardy-Weinberg Disequilibrium [HWD]) is attributable to a greater expected heterozygote frequency than the observed heterozygote frequency.[4] Several programs can be used to construct both haplotypes and diplotypes. The HelixTree program[5] is based on the EM algorithm. New-generation programs such as the PHASE program are based on the Bayesian approach and the Partition Ligation algorithm; their proponents claim that they are more accurate in constructing haplotypes than the traditional programs based on the EM algorithm.[6],[7],[8] Both HelixTree and PHASE can estimate the diplotype frequency distributions among a population and estimate the diplotype probabilities for each individual. The probabilities of unambiguously observed diplotypes for each individual estimated by these programs should be 1.0; the probabilities of inferred diplotypes for each subject will be between 0.0 and 1.0.

Diplotype-based association analysis: application and interpretation

Haplotype-based and diplotype-based association analyses are more powerful than allele-based and genotype-based analyses.[9],[10],[11] Under certain circumstances (reviewed below), diplotype-based analysis is more powerful than haplotype-based analysis. Under these specific circumstances, diplotype-based association analysis is the most powerful of the four types of association analyses, a finding that has been confirmed in about 200 studies since 2002.[12],[13] For example, Lee and colleagues[14] found that the 111 haplotype of the Calpain-10 gene was associated with an increased risk of polycystic ovary syndrome (PCOS) (OR=2.4; 95% CI 1.8–3.3), the 112 haplotype was associated with a decreased risk of PCOS (OR=0.6; 95% CI 0.4–0.8), and the 121 haplotype was not associated with PCOS; however, the 111/121 diplotype was more strongly associated with increased susceptibility to PCOS than any of the haplotypes (OR=3.4; 95% CI 2.2–5.2). Luo and colleagues[15],[16],[17],[18],[19],[20],[21],[22] reported that the diplotypes at ADH1A, 1B, 1C, 4 and 7, CHRM2, OPRM1, OPRD1 and OPRK1 were much more strongly associated with alcohol dependence, drug dependence and personality factors than the alleles, genotypes and haplotypes at these sites. And Li and colleagues[23] found that specific growth traits were significantly associated with the diplotypes of four individual SNPs at IGF-II but not with the haplotypes of these SNPs. Similar findings have been reported in other studies.[24],[25] There are several possible interpretations of these findings:

Haplotypes and diplotypes contain more information than alleles and genotypes

As shown in Figure 1, a haplotype is a combination of alleles from multiple loci on a single chromosome, a genotype is composed of two alleles on homologous chromosomes, and a diplotype is composed of two haplotypes (i.e., multiple genotypes) on homologous chromosomes. Theoretically, the information contained in a multi-locus haplotype is greater than that in a single-locus allele and the information contained in a multi-locus diplotype is greater than that contained in a single-locus genotype. Similarly, haplotypes with more alleles contain more information than those with less alleles and diplotypes with more genotypes contain more information than those with less genotypes. A multi-locus haplotype is a specific variant of all possible combinations of single-locus alleles on the chromosome; both alleles and haplotypes reflect the features of chromosomes in the population. A diplotype is a specific variant of all possible combinations of single-locus genotypes on the paired chromosomes; both genotypes and diplotypes represent the types of chromosome pairs in each individual (see Table 1). A diplotype can also be conceptualized as a specific variant of all possible combinations of haplotypes from the two participating chromosomes. So haplotype-based analyses are equivalent to a stratified analysis of all alleles (at all loci), and diplotype-based analyses are equivalent to both stratified analysis of all genotypes at all loci, and to stratified analysis of all haplotypes. Thus, when the sample size is sufficiently large, haplotype- and diplotype-based analyses should be more powerful than allele-based and genotype-based analyses. Similarly, the analysis of an individual diplotype should be more informative than analysis of the corresponding individual haplotype. Comparison of haplotype-based and diplotype-based association analyses Two alleles at one biallelic marker can divide the chromosomes in a population into two categories; these two alleles would result in three genotypes at the specified marker on homologous chromosomes and, thus, could be used to divide the individuals in a population into three categories. Assuming n independent biallelic markers, up to 2n haplotypes constructed by these n markers can divide the chromosomes in a population into 2n categories. At the same time, n independent biallelic markers would result in up to 2n(2n+1)/2 diplotypes on the paired chromosomes, dividing the individuals in a population into 2n(2n+1)/2 categories. (Note: each of these 2n(2n+1)/2 diplotype categories is a subset of one of the 2n haplotype categories.) When the sample size is large enough, dividing a sample into more categories increases the ability to identify meaningful variance between different subgroups in the sample, so haplotype-based and diplotype-based analyses are more powerful than allele-based and genotype-based analyses and an individual’s diplotype is more informative than an individual’s haplotype. However, the overall diplotype-based analysis may not be more powerful than the corresponding haplotype-based analysis because in some situations the much greater degrees of freedom in a diplotype-based analysis than in the corresponding haplotype-based analysis weakens the strength of the identified associations. The multi-locus haplotype and diplotype are composed of multiple markers that are in linkage disequilibrium (LD). They contain information from all of these individual markers and from several unknown flanking markers on the same chromosome. They are, therefore, usually more informative and closer to representing a ‘whole gene’ than single-marker alleles and genotypes. This is particularly the case when several of the known and unknown markers are etiologically related to the disease(s) of interest.[9],[10],[11]

Genotype-based and diplotype-based analyses remain valid in the presence of Hardy-Weinberg Disequilibrium

When the genotype frequency distributions of some markers are not in Hardy-Weinberg Equilibrium the allele-based and haplotype-based analyses become less powerful and may be invalid, but the genotype-based and diplotype-based analyses are still valid. When there is Hardy-Weinberg Disequilibrium the marker alleles and haplotypes are not independent of each other so the effects of disease predisposing alleles and haplotypes may be ‘masked’ by other non-disease predisposing alleles and haplotypes[25] or, in the case of a recessive condition, by the presence of a dominant allele on the homologous chromosome. This weakens or invalidates the strength of the association between the allele or haplotype and the disease(s) of interest. However, genotype-based and diplotype-based association analyses remain valid even in the presence of strong Hardy-Weinberg disequilibrium. This has been demonstrated in several studies.[15],[16],[17],[18],[27],[28],[29],[30]

Haplotype and diplotype analyses incorporate interaction effects and, thus, are more informative when interaction between assessed markers is present

The haplotypes or diplotypes incorporate information on linkage disequilibrium among markers; so information on the multivariate interaction effects between markers are incorporated into haplotype-based and diplotype-based analyses.[31] In most cases[18],[20],[21],[22] reported interaction effects between alleles and between genotypes are similar to those seen with corresponding multi-locus haplotype-based and diplotype-based analyses; this supports the contention that diplotype-based analyses incorporate information on the interactions between different markers and between different haplotypes. The interaction effect is often a more powerful predictor of disease status than the main effect,[32] especially when the main effects are marginal,[33] so when interaction effects occur diplotype-based association analyses would likely be more informative than association analyses based on haplotypes, genotypes or alleles.

Using quantitative measures instead of categorical measures makes diplotype-based analysis more powerful

Programs implementing the Bayesian approach can estimate the probabilities of all possible pairs of haplotypes (i.e., a ‘full model’ in which the probabilities of all diplotype categories are assessed) or the probabilities of the most relevant subset of diplotype categories (i.e., a “reduced” model) for each individual. The estimated diplotype probabilities are quantitative measures so they usually preserve more information than the original categorical list of the different diplotype categories. Thus the analyses are more powerful if they employ diplotype probabilities instead of diplotype categories.[17]

Avoiding multiple testing preserves the power of haplotype-based and diplotype-based analyses

When testing the association between single markers and a phenotype, multiple independent tests are required so the analysis needs to be adjusted for multiple testing, which reduces the power of the analysis to identify significant differences between groups. But there is no need to adjust for multiple testing when incorporating multiple markers into haplotype-based or diplotype-based analyses, preserving the power of the analysis.[34] This is another reason that haplotype-based and diplotype-based association analyses are more powerful than single-locus analyses.

Discussion: conclusion and future aspects

This review shows that haplotype-based association analyses are more powerful than allele-based association analyses and that diplotype-based association analyses are more powerful than genotype-based analyses. Moreover, under certain circumstances, diplotype-based analyses are more powerful than haplotype-based analysis. Thus, in circumstances where very large sample sizes are available, diplotype-based association analysis is the most powerful of the four potential analytic strategies. The sample sizes of association analyses based on alleles and haplotypes are twice those of the corresponding association analyses based on genotypes and diplotypes. And the degrees of freedom in allele-based and haplotype-based analyses are much less than the degrees of freedom of the corresponding genotype-based and diplotype-based analyses. Thus in circumstances where there are no interaction effects between markers and where the criteria for Hardy-Weinberg Equilibrium are met, allele-based association analyses are more powerful than genotype-based analyses and haplotype-based association analyses are more powerful than diplotype-based analyses.[9],[33] However, in several other circumstances the diplotype-based analysis is more powerful than haplotype-based analyses: (a) when there are interaction effects between haplotypes, (b) when there is Hardy-Weinberg Disequilibrium, and (c) when considering a recessive model of inheritance.[33] One disadvantage of diplotype-based analysis compared to haplotype-based analysis is that there are typically a greater number of rare diplotype categories (i.e., categories with few individuals) than the number of rare haplotype categories. For each category, no matter how small, an additional degree of freedom needs to be included in the analysis, so this results in a greater decrease in the power of diplotype-based association tests compared to haplotype-based association tests. Strategies to deal with rare observations include excluding such categories or merging them with other categories.[29],[33]

Table 1

Comparison of haplotype-based and diplotype-based association analyses

	Haplotype-based association analysis	Diplotype-based association analysis

Composition	A haplotype is a subset of all alleles on specific chromosomes in the population.	A diplotype is a subset of all genotypes on homologous chromosome pairs in the population. A specific diplotype is one variant of all possible combinations of the haplotypes that exist in the population.
Feature	Both alleles and haplotypes reflect the components of chromosomes in individuals and in the population.	Both genotypes and diplotypes reflect the components of chromosome pairs in individuals and in the population.
n independent single-nucleotide polymorphisms (SNPs)	At most 2ⁿ haplotypes	At most 2ⁿ(2ⁿ+1)/2 diplotypes.
Degrees of freedom in analysis	2ⁿ-1	[2ⁿ(2ⁿ+1)/2]-1
Markers not in Hardy-Weinberg Equilibrium (HWE)	Less powerful predictor of disease status	More powerful predictor of disease status
Recessive genetic model	Less powerful predictor of disease status	More powerful predictor of disease status
With interaction	Less powerful predictor of disease status	More powerful predictor of disease status
Without interaction	Less powerful predictor of disease status	More powerful predictor of disease status
Sample size (n individuals)	2n	n
Frequency of rare categories	Less common	More common (decrease power)

33 in total

1. Haplotypes vs single marker linkage disequilibrium tests: what do we gain?

Authors: J Akey; L Jin; M Xiong
Journal: Eur J Hum Genet Date: 2001-04 Impact factor: 4.246

2. A comparison of bayesian methods for haplotype reconstruction from population genotype data.

Authors: Matthew Stephens; Peter Donnelly
Journal: Am J Hum Genet Date: 2003-10-20 Impact factor: 11.025

3. Tests of association between quantitative traits and haplotypes in a reduced-dimensional space.

Authors: Qiuying Sha; Jianping Dong; Renfang Jiang; Shuanglin Zhang
Journal: Ann Hum Genet Date: 2005-11 Impact factor: 1.670

4. Association in multifactorial traits: how to deal with rare observations?

Authors: A-S Jannot; L Essioux; F Clerget-Darpoux
Journal: Hum Hered Date: 2004 Impact factor: 0.444

5. Genome-wide strategies for detecting multiple loci that influence complex diseases.

Authors: Jonathan Marchini; Peter Donnelly; Lon R Cardon
Journal: Nat Genet Date: 2005-03-27 Impact factor: 38.330

6. Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus.

Authors: D M Nielsen; M G Ehm; B S Weir
Journal: Am J Hum Genet Date: 1998-11 Impact factor: 11.025

7. ADH4 gene variation is associated with alcohol dependence and drug dependence in European Americans: results from HWD tests and case-control association studies.

Authors: Xingguang Luo; Henry R Kranzler; Lingjun Zuo; Jaakko Lappalainen; Bao-zhu Yang; Joel Gelernter
Journal: Neuropsychopharmacology Date: 2006-05 Impact factor: 7.853

8. Relationship of CYP3A5 genotype and ABCB1 diplotype to tacrolimus disposition in Brazilian kidney transplant patients.

Authors: Diego Alberto C Cusinato; Riccardo Lacchini; Elen A Romao; Miguel Moysés-Neto; Eduardo B Coelho
Journal: Br J Clin Pharmacol Date: 2014-08 Impact factor: 4.335

9. A high IL-4 production diplotype is associated with an increased risk but better prognosis of oral and pharyngeal carcinomas.

Authors: Cheng-Mei Yang; Hung-Chih Chen; Yu-Yi Hou; Ming-Chien Lee; Huei-Han Liou; Sin-Jhih Huang; Liang-Ming Yen; Dong-Mei Eng; Yao-Dung Hsieh; Luo-Ping Ger
Journal: Arch Oral Biol Date: 2013-10-06 Impact factor: 2.633

10. A multilocus likelihood approach to joint modeling of linkage, parental diplotype and gene order in a full-sib family.

Authors: Qing Lu; Yuehua Cui; Rongling Wu
Journal: BMC Genet Date: 2004-07-26 Impact factor: 2.797

3 in total

1. Methylation quantitative trait loci within the TOMM20 gene are associated with metabolic syndrome-related lipid alterations in severely obese subjects.

Authors: Juan de Toro-Martín; Frédéric Guénard; André Tchernof; Yves Deshaies; Louis Pérusse; Frédéric-Simon Hould; Stéfane Lebel; Picard Marceau; Marie-Claude Vohl
Journal: Diabetol Metab Syndr Date: 2016-07-29 Impact factor: 3.320

2. Prion protein gene sequence and chronic wasting disease susceptibility in white-tailed deer (Odocoileus virginianus).

Authors: Adam L Brandt; Amy C Kelly; Michelle L Green; Paul Shelton; Jan Novakofski; Nohra E Mateus-Pinilla
Journal: Prion Date: 2015 Impact factor: 3.931

3. New diagnostic SNP molecular markers for the Mytilus species complex.

Authors: Joanna Wilson; Iveta Matejusova; Rebecca E McIntosh; Stefano Carboni; Michaël Bekaert
Journal: PLoS One Date: 2018-07-12 Impact factor: 3.240

3 in total