Literature DB >> 24817879

Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression.

Abstract

Exploring heritability of complex traits is a central focus of statistical genetics. Among various previously proposed methods to estimate heritability, variance component methods are advantageous when estimating heritability using markers. Due to the high-dimensional nature of data obtained from genome-wide association studies (GWAS) in which genetic architecture is often unknown, the most appropriate heritability estimator model is often unclear. The Haseman-Elston (HE) regression is a variance component method that was initially only proposed for linkage studies. However, this study presents a theoretical basis for a modified HE that models linkage disequilibrium for a quantitative trait, and consequently can be used for GWAS. After replacing identical by descent (IBD) scores with identity by state (IBS) scores, we applied the IBS-based HE regression to single-marker association studies (scenario I) and estimated the variance component using multiple markers (scenario II). In scenario II, we discuss the circumstances in which the HE regression and the mixed linear model are equivalent; the disparity between these two methods is observed when a covariance component exists for the additive variance. When we extended the IBS-based HE regression to case-control studies in a subsequent simulation study, we found that it provided a nearly unbiased estimate of heritability, more precise than that estimated via the mixed linear model. Thus, for the case-control scenario, the HE regression is preferable. GEnetic Analysis Repository (GEAR; http://sourceforge.net/p/gbchen/wiki/GEAR/) software implemented the HE regression method and is freely available.

Entities: Chemical Disease Gene Species

Keywords: GWAS; Haseman–Elston regression; REML; case-control; identity by state; missing heritability; mixed linear model; variance component

Year: 2014 PMID： 24817879 PMCID： PMC4012219 DOI： 10.3389/fgene.2014.00107

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

Introduction

So-called “missing heritability” can occur due to various reasons, such as small sample size, underrepresented variant spectrum, experimental design, and improper methodological assumptions (Manolio et al., 2009). Because of the high-dimensional nature of genome-wide association study (GWAS) data, in which the number of markers (M) is far greater than the number of individuals (N), estimating heritability is difficult. For instance, if the statistical power is insufficient, variants associated with a small effect may not be captured under a stringent p-value threshold (~ 10−8). This obstacle can be partially bypassed by implementing the mixed linear model, which uses the genetic relationship between individuals estimated from single nucleotide polymorphism (SNP) markers in lieu of fitting hundreds of thousands of markers together (Yang et al., 2010). Nevertheless, it has been recently disputed how an estimator should be adjusted under genetic architecture. Speed et al. (2012) suggested using a weighted genetic relationship matrix under different genetic architecture, which is often unknown. As demonstrated in large-scale empirical data studies (Lee et al., 2013), Speed's ad-hoc weighing method depends on the genetic architecture and does not often outperform plain weight methods upon comparison. As the genetic architecture, such as the relationship between variant frequency and variant effect, is often unknown, criteria should be established to justify the model used to estimate heritability. For GWAS, as many samples are collected to study diseases, many studies eventually adopt a case-control design. Due to ascertainment in case-control studies, scale transformation is necessary. Without scale transformation, the heritability on the observed scale can be greater than 1, rendering the estimated heritability meaningless, as it is not representative of its heritability on the liability scale, which is more interpretable (Falconer, 1966) for disease data. An equation (Lee et al., 2011) that transforms heritability from the observed scale to the liability scale has been proposed (as the equation was indexed as the 23rd equation in Sang Hong Lee's paper, it is henceforth denoted as Hong23) and was investigated under the infinitesimal model, for which the number of casual loci is infinite. However, in practice, disease loci are reasonably limited for many diseases (Yang et al., 2011), which raises the question of whether or not Hong23 works well for mixed linear model estimates if the infinitesimal model does not hold. All of the above concerns are related to the heritability estimated via variance component methods implemented thus far in mixed linear models. The Haseman–Elston (HE) regression is a prestigious method for estimating variance components (Haseman and Elston, 1972). The HE regression, a well-known tool for linkage studies that uses identity by descent (IBD) (Lynch and Walsh, 1998; Hill and Weir, 2011) scores, however, seems a rusty weapon in the genomics analysis armory of the GWAS era. This is because the HE regression relies on relatedness measured on IBD but not identity by state (IBS). Although IBS has been employed for linkage analysis, such as under affected-pedigree-member design (Lange, 1986b; Weeks and Lange, 1988; Bishop and Williamson, 1990), its performance is largely dependent on marker polymorphisms and may cause high false positives when ad-hoc weighting functions or incorrect frequencies are adopted. As an underrepresented concept of the linkage era, IBS is neither well-adapted to linkage studies nor employed in the original HE regression framework. Taken together, the following questions remain. Can the HE regression be applied to the IBS content such as for GWAS? If the answer is affirmative, what is the theoretical basis and the genetic interpretation in this new context? An equilibrium has been established between the HE regression and the variance component method (Sham et al., 2002) for linkage studies. Does this equilibrium stand for high-dimensional data such as GWAS data and what are its assumptions? If the IBS-based HE regression is applied to case-control studies, can it estimate heritability better than the mixed linear models? Given the recent dispute regarding heritability estimation of complex traits, can HE regression provide further justification? Recently, a new theory using like-standardized IBS has paved another route to assess genetic relatedness (Ritland, 1996; Powell et al., 2010; Yang et al., 2010) between unrelated individuals (conventional sense). The IBS score resembles the conventional IBD score (Powell et al., 2010), which raises the question of whether this IBS score can be used in the HE regression for unrelated individuals. In this study, by replacing the IBD scores with standardized IBS scores, we used the HE regression to conduct association studies for GWAS data. Assuming random mating, biallelic loci, and additive genetic effects only on the genetic architecture of quantitative trait loci (QTLs) underlying a complex trait, this report establishes the theoretical basis for using the HE regression for GWAS. Two generic scenarios were investigated, and their regression coefficients were derived and have genetically meaningful interpretations. In scenario I, the IBS score was assessed via a marker that was in linkage disequilibrium (LD) with a QTL. This enabled the HE regression to be a tool for single-marker GWAS. In scenario II, IBS score was assessed on multiple markers, each of which could be in LD with multiple QTLs. This allowed the HE regression to be used to estimate the variance component tagged by markers. The second scenario has implications for estimation of heritability for complex trait using whole genome-wide markers together, similar to the mixed linear model (Yang et al., 2010; Lee et al., 2011). Using an analytical method that establishes the equivalence between the IBS-based HE regression and the mixed linear model, a simple criterion is proposed to justify the estimates in this study. A similar equivalence between the HE regression and the variance component analysis with the mixed linear model was determined in the context of linkage analysis (Sham and Purcell, 2001). In this study, their equivalence is established under the context of GWAS, and the conditions for equivalence are explored analytically as well as in silico. After extending the established HE regression into case-control scenarios, we demonstrated that Hong23 fits the estimate from the HE regression better than that from the mixed linear model. Furthermore, as the IBS-based HE regression uses least squares, it is advantageous in its computational efficiency and is N (N is the sample size) times faster than the mixed linear model. In order to facilitate the application, the HE regression algorithm for GWAS data was implemented in Java software, GEnetic Analysis Repository (GEAR), which is freely available online. As the first half of this report is focused on establishing the mathematical basis of the IBS-based HE regression, many mathematical symbols are introduced (Table 1). In the text below, the HE regression is the IBS-based HE regression unless explicitly noted otherwise.

Table 1

Notation definitions.

Notation	Definition
p_k and q_k	Allele frequencies of A and a at the k^th locus. A is the reference allele.
D_kl	Linkage disequilibrium of a pair of loci, D_kl = f_{a_ka_l} − q_kq_l, in which f_{a_ka_l} is the frequency of haplotype a_ka_l.
r_kl and R_kl	r_kl = p(a_l\|a_k) and R_kl = p(A_l\|A_k), the conditional probabilities of the two coupling haplotypes, a_ka_l and A_kA_l.
ρ_kl	ρkl=Dklpkqkplql, the Pearson's correlation between a pair of biallelic loci, k and l.
ρ²_M	The mean of the squared correlation between any marker pair, including the marker with itself. This can be estimated from the genotype data.
ρ²_Q	The mean of the squared correlation between any a marker and a QTL.
Λ	The ratio between ρ²_Q and ρ²_M. This indicates how markers tag causal variants.
M	The number of markers.
M_e	The effective number of markers. See the text and Supplementary Note II for definition.
x_i	x_i = [x_i1, x_i2, x_i3, …, x_iM], genotype scores, a vector. It counts the reference allele number for each locus.
g_k	The genotype set for the k^th locus, such as g_k = {a_ka_k, A_ka_k, A_kA_k}. Analogously, for a QTL, g_k = {_k_k, _{kq_k, q_kq_k}}.
s_i	Standardized genotype scores for the i^th individual, a vector. si=[xi1−2p12p1q1,xi2−2p22p2q2,…,xiM−2pM2pMqM].
L	The number of QTLs.
N	Sample size.
	.
	, in which d is the number of parameters in the HE regression.
y_i	The phenotype of the i^th individual.
Y_ij	The square of the phenotype difference between the i^th and the j^th individuals.
Ω_ij	The genetic relatedness between the i^th and the j^th individuals. See the text for definition.
β_l	The additive effect of the l^th QTL.
σ²_A	Total additive variance.
h²	Narrow-sense heritability.
σ_l	The square-root of the additive variance of the l^th QTL, σl=2plqlβl.
Hong23	Expressed as hl2=ho2K(1−K)z2K(1−K)P(1−P) is the heritability on the liability scale, h²_o is the heritability on the observed scale directly estimated based on the case-control data, K is the disease prevalence, P is the proportion of cases in the data, and z is the height of the standard normal distribution in which the prevalence is located (Lee et al., 2011).
Subscript	Subscripts i and j are used to indicate individuals, and k and l are used to indicate loci, which can be either markers or QTLs.

Notation definitions.

Theory of the IBS-based He regression

For an individual, the phenotype is denoted as y, which follows the normal distribution of N(μ, σ2), and the genotype is x = [x, x, …, x], in which M is the number of markers. For the i individual, the genotype at the k locus is x, which counts the reference alleles at the k locus. The reference allele is denoted as A and the alternative is a. The frequency of A is p and the frequency of a is q. g is the set of possible genotypes, say {a}, at the k locus. Consequently, aa, Aa, and AA are coded as 0, 1, and 2, respectively. After standardization, x is expressed as . For x, given a genotype of aa, Aa, and AA, their standardized scores are , and , respectively. The additive effect of the l QTL is denoted as β. Throughout the study, we assume a polygenic model with L QTLs.

The IBS-based he regression

Haseman and Elston (1972) proposed a linear model, Y = μ + bπ + e, for detecting linkage between a marker and a QTL in a full-sib design. Y represents the squared difference between a pair of full sibs, and π is the proportion of IBD at an observed marker locus; μ is the intercept of the regression, b is the regression coefficient, and e is the residual. The mathematical expectation of the regression coefficient is b = −2(1 − 2c)2 σ2, in which c is the recombination fraction between the marker locus and the QTL, and σ2 is the additive genetic variance of the QTL. Now consider a sample consisting of N unrelated individuals. If the phenotype for the i individual is y, we can modify the HE original regression as below in which Y = (y − y)2 represents the squared difference, Ω is the measure of the genetic relatedness of a pair of individuals, and e is the residual. Given N unrelated individuals, there are such individual pairs. Ω is the similarity score between a pair of individuals based on the IBS, as recently proposed (Powell et al., 2010; Yang et al., 2010). For the linear model in Equation (1), the expectation of the regression coefficient is . var(Ω) is the variance of the genetic relatedness. cov(Y, Ω) = E(ΩY) − E(Ω)E(Y) = E(ΩY) because E(Ω) = 0 [see the definition for Ω in section The Derivation of var(Ω) and Effective Number of Markers (M)], and is the mathematical expectation of the joint distribution for Ω and Y. In order to derive var(Ω) and cov(Y, Ω), we need to introduce the haplotype distribution of a biallelic loci pair (section Haplotypes of a Biallelic Loci Pair). When the haplotype is constructed on a pair of markers, it leads to the derivation of var(Ω) [section The Derivation of var(Ω) and Effective Number of Markers (M)]; when the haplotype is constructed for a marker and a QTL, it leads to E(y|x), the conditional expectation of the phenotype based on a marker [section The Derivation of E(y|x)].

Derivations of var(Ω) and E(y)

Haplotypes of a biallelic loci pair

For a pair of biallelic loci, there are four haplotype phases, and their conditional probabilities are as summarized in Table S1. r = p(a|a) and R = p(A|A) are defined as the conditional probabilities of the haplotypes in the coupling phases, such as a and A, respectively; 1 − r and 1 − R represent the conditional probabilities of the alleles in their repulsion phases, such as a and A, respectively. D = f − p, in which f is the frequency of the haplotype A; D is the covariance between the loci, quantifying the LD between them. The correlation of a pair of biallelic loci can be expressed as a 2 × 2 correlation ρ2 is often used to parameterize the LD of a loci pair (Hill and Robertson, 1968). For more LD parameterization, please refer to Devlin and Risch (1995) and Wray (2005). Once the conditional probabilities of the haplotypes are defined, it is straightforward to obtain the joint probabilities of the genotypes for a pair of loci. For example, under random mating, the probability of the genotype A is p(A = p(A = p(A = p22. Analogously, this leads to the joint probabilities of the other eight two-locus genotypes (See Table 2).

Table 2

The joint distribution of two loci.

		The k^th locus
		a_ka_k	A_ka_k	A_kA_k
The l^th locus	a_la_l	q²_kr²_kl	2p_kq_kr_kl(1 − R_kl)	p²_k(1 − R_kl)²
	A_la_l	2q²_kr_kl(1 − r_kl)	2p_kq_k[r_klR_kl + (1 − r_kl) (1 − R_kl)]	2p²_kR_kl(1 − R_kl)
	A_lA_l	q²_k(1 − r_kl)²	2p_kq_kR_kl(1 − r_kl)	p²_kR²_kl
Marginal probability		q²_k	2p_kq_k	p²_k

Each cell lists the joint probability of a genotype pair at the .

The joint distribution of two loci. Each cell lists the joint probability of a genotype pair at the . r.

The derivation of var(Ω) and effective number of markers (M)

For a sample consisting of unrelated individuals, their pairwise genetic relationships, say additive genetic relationships, can be estimated with genetic markers, such as SNP markers (Powell et al., 2010; Yang et al., 2010). The genetic relatedness Ω between the i individual and the j individual is measured by the dot product of their standardized genotypes and then divided by the number of markers. The possible relatedness scores of a pair of individuals are summarized in Table 3A, totaling nine products. After combining the same score values, there are seven unique terms as in Table 3B. It is easy to derive that E(Ω) = 0 and , in which cov(Ω, Ω) = E(ΩΩ) − E(Ω)E(Ω) = E(ΩΩ) because E(Ω) = 0.

Table 3A

The joint distribution of the genetic relatedness between individuals .

Individual i			Individual j			Relatedness for individuals i and j
Genotype	s_ik	Frequency	Genotype	s_k	Frequency	Ω_ijk	Frequency
a_ka_k	−2pk2pkqk	q²_k	a_ka_k	−2pk2pkqk	q²_k	4pk22pkqk	q⁴_k
			A_ka_k	qk−pk2pkqk	2p_kq_k	−2pk(qk−pk)2pkqk	2p_kq³_k
			A_kA_k	2qk2pkqk	p²_k	−4pkqk2pkqk	p²_kq²_k
A_ka_k	qk−pk2pkqk	2p_kq_k	a_ka_k	−2pk2pkqk	q²_k	−2pk(qk−pk)2pq	2p_kq³_k
			A_ka_k	qk−pk2pkqk	2p_kq_kD	(qk−pk)22pkqk	4p²_kq²_k
			A_kA_k	2qk2pkqk	p²_k	2q(qk−pk)2pkqk	2p³_kq_k
A_kA_k	2qk2pkqk	p²_k	a_ka_k	−2pk2pkqk	q²_k	−4pkqk2pkqk	p²_kq²_k
			A_ka_k	qk−pk2pkqk	2p_kq_k	2q(qk−pk)2pkqk	2p³_kq_k
			A_kA_k	2qk2pkqk	p²_k	4qk22pkqk	p⁴_k

As A was set as the reference allele, with a frequency of p, aa, Aa, and AA were coded as 0, 1, and 2, respectively.

Table 3B

A reorganization of Table .

Ω_ijk	4pk22pkqk	−2pk(qk−pk)2pkqk	−4pkqk2pkqk	(qk−pk)22pkqk	2qk(qk−pk)2pkqk	4qk22pkqk
Frequency	q⁴_k	4p_kq³_k	2p²_kq²_k	4p²_kq²_k	4p³_kq_k	p⁴_k

Ω.

The joint distribution of the genetic relatedness between individuals . As A was set as the reference allele, with a frequency of p, aa, Aa, and AA were coded as 0, 1, and 2, respectively. A reorganization of Table . Ω. Ω is informative in revealing hidden relatedness. For example, for the duplicated individual in the sample, E(Ω) = 1; for first-degree relatives, E(Ω) = 0.5; for second-degree relatives, E(Ω) = 0.25. Consequently, it can control the entry of samples that are under the expected cutoff for relatedness. After some additional algebra (see Supplementary Note I), we arrived at the following equation. When the k locus and the l locus are in linkage equilibrium, cov(Ω, Ω) = 0; when the k locus and the l locus are at the same locus, cov(Ω, Ω) = 1. The distribution of ρ2 varies with p and p (Wray, 2005). We can also interpret var(Ω) as the mean of the squared Pearson's correlation between the markers along the genome, denoted as ρ2. For simplicity of the following derivation, the concept of an effective number of markers, M, is introduced here. Intuitively, as markers are often in linkage disequilibrium, the real number of “independent” markers is smaller than the total number of the markers genotyped. This concept was previously introduced under the context of risk prediction (Purcell et al., 2009), and M was evaluated using Monte Carlo simulation. As indicated in Supplementary Note II, 1/var(Ω) is the mathematical expectation of the effective number of markers evaluated under the simulation method (Purcell et al., 2009). For example, for 100 equifrequent biallelic loci, if the correlation for each pair of consecutive markers is 0, 0.25, 0.5, and 0.75, the effective number of markers is approximately 100, 90, 61, and 29, respectively. Real GWAS data are often at a magnitude of 104 (Vinkhuyzen et al., 2013).

The derivation of E(y|x)

The expected phenotype of y given genotype x depends on the QTL genotype, say the l locus, in LD with x. Assuming a biallelic QTL in LD with the marker, the conditional expectation of the marker is E(y = A) = Σ = A), in which g = { , q, q}, and E(y|x = A) = β × R2 + 0 × 2R (1 − R) − β × (1 − R)2 = (2R − 1)β. Analogously, we can derive the expected values of E(y = A) = (R − r)β and E(y = a) = (1 − 2r)β (See Table 4). Once E(y) is defined, the distribution of E(Y) can be tabulated as in Table 5.

Table 4

The expected phenotype conditional to one's genotype on the observed marker.

Marker genotype	QTL genotype	QTL conditional probability	E(y_i\|x_ik)
a_ka_k	q_lq_l	r²_kl	(1 − 2r_kl)β_l
	_lq_l	r_kl(1 − r_kl)
	q_l_l	r_kl(1 − r_kl)
	_l_l	(1 − r_kl)²
A_ka_k	q_lq_l	r_kl(1 − R_kl)	(R_kl − r_kl)β_l
	_lq_l	(1 − r_kl)(1 − R_kl)
	q_l_l	r_klR_kl
	_l_l	(1 − r_kl)
A_kA_k	q_lq_l	(1 − R_kl)²	(2R_kl − 1)β_l
	_lq_l	R_kl(1 − R_kl)
	q_l_l	R_kl(1 − R_kl)
	_l_l	R²_kl

It is assumed that the k.

Table 5

The joint distribution of .

For the nine cells, the symmetrical cells are highlighted in same color. In each highlighted cell, three terms from the top to the bottom are Ω.

τ.

The expected phenotype conditional to one's genotype on the observed marker. It is assumed that the k. r. The joint distribution of . s. For the nine cells, the symmetrical cells are highlighted in same color. In each highlighted cell, three terms from the top to the bottom are Ω. τ.

Deriving the mathematical expectation of the regression coefficient

In this section, we investigated two scenarios to derive the expected value of the regression coefficient for Equation (1). In scenario I, genetic similarity is estimated at a single marker, which is in LD with one or more QTLs. In scenario II, genetic similarity is estimated based on M markers, each of which can be in LD with L QTLs.

Scenario I: one marker and one QTL

Under the scenario of one marker, say the k marker, and one QTL, say the l QTL, since var(Ω) = 1, E(b) = E(Ω), which is E(Ω) = ΣΣ[E(y − E(y]2p(x)p(x. Consequently, we can derive the regression coefficient as and in which τ = 1 − r − R. When the QTL overlaps with the marker, or the correlation between the QTL and the marker is 1, E(b) = −4p β2 because r = R = 1. When the QTL is in linkage equilibrium with the marker, r = q and R = p, 1 − r − R = 0, and consequently E(b) = 0. According to Table S1, . Consequently, the expression of Equation (6) can be rearranged as . The correlation of a pair of biallelic loci [Equation (2)], and consequently E(b) = − 2ρ2σ2, in which . Alternatively, we can write In the GWAS context, squared LD (Pearson's correlation) is in lieu of the recombination fraction for linkage. The mathematical expectation of the regression coefficient resembles the one in the original HE regression. However, it should be noted that here the interpretation of the regression coefficient is based on linkage disequilibrium and association, whereas the original interpretation is based on linkage between the marker and the QTL. When multiple QTLs are in LD with the marker, the conditional expectation for y given x is , and , respectively. The joint distribution of Ω and Y is as summarized in Table S2, which resembles Table 5. Still using E(Ω) = Σ − E(y]2p(x, the regression coefficient can be derived as below. Equation (8) can be rearranged as It is easy to see that when L = 1, Equation (9) can be simplified to Equation (7).

Scenario II: multiple markers and multiple QTLs

When the genetic relatedness matrix is constructed with M markers, each of which may be in LD with L QTLs, the HE regression becomes Y = a + bΩ, in which . For convenience, Ω denotes the relatedness fraction constructed with the k marker between the i and the j individuals. According to the definition of the regression coefficient, . , in which cov(Ω) = ρ2, as expressed in Equation (4). After rearrangement in which , and , summarizing the between-locus variance. is the average squared LD between a marker and a QTL across the genome, and is the averaged LD between every pair of markers, including the LD between each marker to itself. The interpretation of Equation (11) will be clear in Simulation III and Simulation IV. If the phenotype is standardized, heritability equals the additive variance component. It is straightforward to obtain an estimate of the heritability for a single QTL, as in scenario I, or all QTLs, as in scenario II (See Supplementary Note IV)

The sampling variance of the regression coefficient

The sum of square error (SSE) is var(Y) = 8σ4 (Supplementary Note III), and , in which and d is the number of the regression coefficient (here d = 1). For scenario I, as only one marker is used, var(Ω) = 1 and ρ2 = 1. Given the current GWAS data, which incorporates thousands of individuals and often up to one million markers, it is reasonable to assume and 8M » 4σ4 Λ2. For real GWAS data with about one million markers, ranges from 30,000 to 50,000 markers due to the strong LD pattern (Vinkhuyzen et al., 2013). When the phenotype is standardized, the sampling variance of the regression coefficient is half of the additive variance component.

The mathematical expectation of the He regression intercept

The expectation of the intercept is E(Y) = E[(y − y)2] = E(y2) + E(y2) − 2E(y). E(y2) = var(y) − E(y)2, E(y2) = var(y) − E(y)2, E(y) = cov(y) + E(y)E(y). As the individuals are not related to each other, assuming no common environment, cov(y) = 0. So, E(Y) = var(y) + var(y) = 2σ2 + 2σ2, twice the phenotypic variance. The negative ratio between the regression coefficient and the intercept provides an estimate of the heritability if the phenotype is not standardized. The derived regression coefficients and their sampling variance at the completion of the derivation are summarized in Table 6.

Table 6

Summary of the derivations.

Scenario	E(b)		σ_b
	In genetic parameters	In statistical parameters
One marker and one QTL	−4τ²_klp_kq_kβ²_l	−2ρ²_klσ²_l
One marker and multiple QTLs	−4pkqk[Σl=1Lτklβl]2	−2Σl1=1LΣl2=1Lρkl1ρkl2σl1σl2	As above
Multiple markers and multiple QTLs	{Σk=1M−4pkqk[Σl=1Lτklβl]2M}/{Σk=1MΣl=1Mρkl2M2}	−2σ²_AΛ if QTLs are randomly allocated along the genome.	≈4NMe

For .

When the phenotype is standardized, h.

Summary of the derivations. For . When the phenotype is standardized, h.

The additive variance component structure of a quantitative trait without ascertainment

The additive variance of a trait is defined as . However, for a complex trait with polygenic genetic architecture, if the QTLs are randomly allocated along the genome, (Supplementary Note V), a phenomenon that the between-locus covariances tradeoff. This is often true for a trait without ascertainment or selection. When each QTL is tagged perfectly and randomly allocated along the genome, Λ = 1. Equation (11) zeros out the Δ term and directly gives the unbiased estimate of twice negative of the additive variance. Removing the scale makes the heritability estimate unbiased. In practice, due to imperfect LD, the heritability is reduced to h2Λ. In fact, the HE regression and the mixed model are equivalent and can agree on the heritability estimate (see Simulation III). However, this equivalence can be disturbed when QTL effects are not randomly distributed (Simulation IV).

Extension to case-control GWAS data

Like the debut application of the original HE regression for schizophrenia (Elston et al., 1973), the IBS HE regression is also extended to case-control GWAS data in this study. However, due to scale issues and ascertainment (Dempster and Lerner, 1950; Falconer, 1966), the estimated heritability needs to be transformed to the liability scale, which is genetically meaningful for ascertained samples. One transformation was proposed by Lee et al. (2011), denoted here as Hong23. It is expressed as , in which h2 is the heritability on the liability scale, h2 is the heritability on the observed scale directly estimated based on the case-control data, K is the prevalence of the disease, P is the proportion of the cases in the data, and z is the height of the standard normal distribution in which the prevalence K is located. Once the heritability is estimated by the HE regression on the observed scale with Hong23, it can be easily transformed from the observed scale to the liability scale. Simulation studies will be conducted to investigate whether the HE regression better estimates heritability than does the mixed linear model (Simulation IV). In addition, Y in the HE regression can also be expressed as a cross-product, and then E(b) = −2pτ2β2, which is half that of Equation (7) (See Supplementary Note VI).

Monte Carlo simulation results

In the Monte Carlo simulation, we will investigate the precision of the derived equations.

Simulation I: one marker and one QTL [evaluation of Equation (7)]

This simulation investigated the accuracy of Equation (7) for a single-marker application. One thousand unrelated individuals were simulated. One marker and one QTL were simulated, both of which were equifrequent and biallelic. The heritability of the QTL was 0.5. The LD between the marker and the QTL was set at three levels: ρ = 0.25, ρ = 0.5, and ρ = 0.75. The single marker was used to construct the genetic relatedness, Ω. Then a single-marker-based HE regression was conducted. After standardizing the phenotype, the negative half of the regression coefficient returned the unbiased heritability estimate. As indicated by Equation (7), given ρ = 0.25, ρ = 0.5, and ρ = 0.75, the regression coefficient expectation was −0.062, 0.125, and 0.57, respectively. After 100 rounds of simulation, the derived expectation of the regression coefficient, as well as the sampling variance (Table 6), were in good agreement with the simulation results listed in Table 7. This simulation indicates that the single-marker HE regression is a competitive tool for QTL mapping.

Table 7

Simulation evaluations of Equation (7).

LD	Analytical results^a	Simulation results^b
ρ = 0.25	−0.062 (0.004)	−0.062 (0.0039)
ρ = 0.5	−0.25 (0.004)	−0.25 (0.0039)
ρ = 0.75	−0.56 (0.004)	−0.56 (0.0039)

The standard error was calculated: . Here N = 1000 and M.

The standard errors in parentheses indicate the mean of the standard error from 100 simulation replications.

Simulation evaluations of Equation (7). The standard error was calculated: . Here N = 1000 and M. The standard errors in parentheses indicate the mean of the standard error from 100 simulation replications.

Simulation II: statistical power of the single-marker He regression

For the single-marker HE regression, as the expectation and the sampling variance of the regression coefficient were already derived, a t-test could be constructed as , in which the linkage disequilibrium between the k marker and the l QTL is ρ. When the sample size is sufficiently large, the t-test approaches the z-score distribution, and the non-centrality parameter of χ21 is consequently , a χ2-test with one degree of freedom. In Table 8, the required sample size to detect association with a SNP for a GWAS (type-I error rate of 10−8) and the required sample size to detect a QTL are indicated.

Table 8

The sample size required for the single-marker HE regression to detect a QTL associated with the target marker.

h²	ρ_kl
	0.25	0.5	0.75
0.005	33,276	8,319	3,697
0.01	16,638	4,159	1,849
0.025	6,655	1,664	739
0.05	3,327	832	370

Here the p-value cutoff was 10.

The sample size required for the single-marker HE regression to detect a QTL associated with the target marker. Here the p-value cutoff was 10. In contrast, for a conventional one-marker association linear regression, yi = μ + b + e, if the phenotype and the genotypes are both standardized, E(b) = ρβ, and its standard error is , a t-test can be constructed as . Taking the square of the t statistic, the non-centrality parameter of . These two χ2 tests differ by the factor . Once , the single-marker HE regression is more powerful than the conventional liner regression; otherwise, the conventional linear regression method is more powerful. As listed in Table 9, given that the heritability of a QTL is 0.01, if the LD between the target marker is low (ρ = 0.25), medium (ρ = 0.5), or high (ρ = 0.75), the sample size required to allow HE to outperform the linear regression is 6400, 1600, and 712, respectively. If the heritability is even smaller, say h2 = 0.001, the required sample size is 12,800, 3200, and 1423 to make the HE regression more powerful under the low, medium, and high LD, respectively.

Table 9

The required sample size that makes the HE regression more powerful than the conventional single-marker linear regression.

h²	ρ_kl
	0.25	0.5	0.75
0.005	12,800	3,200	1,423
0.01	6,400	1,600	712
0.025	2,560	640	285
0.05	1,280	320	143

The required sample size that makes the HE regression more powerful than the conventional single-marker linear regression. Depending on the sample size, heritability, and LD patterns between the QTL and the target marker, the power of the HE regression may or may not be greater than the conventional linear regression. However, when the sample size is large, or the heritability of the QTL is large, HE regression is a more powerful tool for association studies. These results are based on the assumption that the real sampling variance agrees with the derived theoretical result.

Simulation III: the all-marker He regression and the mixed linear model are equivalent [Δ = 0 in Equation (11)]

In this simulation, 100 equifrequent and biallelic QTLs were simulated, and the additive effect of each QTL was sampled from N(0, 1). Four LD levels (ρl1, l2 = 0, 0.25, 0.5, 0.75) were adopted for each of two consecutive QTLs, and the effective number of markers decreased correspondingly (M ≈ 100, 90, 61, 29). One thousand unrelated individuals were simulated, and the genetic relatedness of each pair of individuals was estimated on these 100 QTLs. The heritability of the simulated polygenic model was 0.5, which is calculated as . And . Both the HE regression and the mixed linear model were employed to estimate the additive variance component. The mixed linear model (Yang et al., 2010) can be expressed as y = μ + x + e, where y is the phenotype of the i individual, μ is the mean, x is the indicator variable with values of 0, 1, or 2 depending on the reference allele counts, and e is the residual. Restricted maximum likelihood (REML) was employed to estimate the variance components of the mixed linear model (Yang et al., 2010). As shown in Table 10, the estimated heritability from either the HE regression or the mixed linear model was equal and not biased, demonstrating the equivalence between the HE regression and the mixed linear model when the QTLs are randomly distributed regardless of their pairwise LD.

Table 10

Simulation evaluations of Equation (11) and comparison between the HE regression and the mixed linear model method (Δ = 0).

LD (ρ)	Equation (11)^a	HE results^b	Mixed model results^c
ρ = 0	0.5 (0.020)	0.499 (0.020)	0.499 (0.041)
ρ = 0.25	0.5 (0.019)	0.500 (0.019)	0.501 (0.042)
ρ = 0.5	0.5 (0.016)	0.502 (0.015)	0.491 (0.043)
ρ = 0.75	0.5 (0.011)	0.488 (0.011)	0.508 (0.048)

Calculated given △ = 0.

The standard errors in parentheses indicate the mean of the standard error from 100 simulation replications.

Simulation evaluations of Equation (11) and comparison between the HE regression and the mixed linear model method (Δ = 0). Calculated given △ = 0. The standard errors in parentheses indicate the mean of the standard error from 100 simulation replications. E(b) = −2σ2 Λ (ignoring Δ) sheds light on the inference of the general LD pattern between the tagged markers and the causal variance. , and ρ2 can be estimated from markers. If the heritability of the trait is known (not likely though), it is possible to estimate ρ2. For example, the heritability of height is estimated at around 0.8 (Visscher et al., 2006; Perola et al., 2007) in linkage, but is 0.4 as estimated in an association study (Yang et al., 2010). If the estimate from linkage was considered to be the true heritability, = 0.5. Assuming the effective number of markers is M = 10,000, ρ2 = 0.0001, ρ2 = ρ2 = 0.00005. The average absolute value of the LD between a QTL and a marker is 0.007.

Simulation IV: when the He regression and the mixed linear model are not equivalent [when Δ ≠ 0 in Equation (11)]

The general setting for this simulation was similar to the last one, but the QTL effects were sorted such that the additive effects were increased along the simulated chromosomal segment. The covariance between any two QTLs can be predicted by cov . The heritability is defined as , in which . Different from the last simulation, . With this set-up, which is not likely to be true in practice but illustrates an extreme case, the HE regression and the mixed model gave very different estimations. With increased correlation between markers, the HE gave inflated estimates and the mixed model gave deflated estimates. Although both methods gave biased estimates, Equation (11) still could predict the results of the HE regression correctly (See Table 11).

Table 11

Simulation evaluations of Equation (11) when the covariance summation is not zero (Δ ≠ 0).

LD (ρ)	Equation (11)^a	HE results^b	Mixed model results^c
ρ = 0	0.500 (0.020)	0.497 (0.020)	0.499 (0.041)
ρ = 0.25	0.715 (0.019)	0.712 (0.019)	0.414 (0.041)
ρ = 0.5	0.853 (0.015)	0.850 (0.015)	0.347 (0.043)
ρ = 0.75	0.878 (0.011)	0.881 (0.011)	0.291 (0.048)

Calculated given △ ≠ 0.

The standard errors in parentheses indicate the mean of the standard error from 100 simulation replications.

Simulation evaluations of Equation (11) when the covariance summation is not zero (Δ ≠ 0). Calculated given △ ≠ 0. The standard errors in parentheses indicate the mean of the standard error from 100 simulation replications.

Simulation V: application to case-control data

The HE regression was applied to case-control data. A polygenic model of L equifrequent diallelic QTLs was simulated, and each locus was in Hardy–Weinberg equilibrium and any pair of QTLs was in linkage equilibrium. The heritability on the liability scale was h2, the heritability on the liability scale. The effect of each QTL was sampled from N(0, σ2), and σ2 = h2/[2 × p × (1 − p) × L], in which p = 0.5. The phenotype of each individual under the liability scale was scaled to unit. The ascertainment of cases on the liability scale was K. Individuals were sampled from the described reference population until 1000 cases and 1000 controls were recruited. The heritability on the liability scale was 0.5. In order to cover a broad range of scenarios, three levels of QTL number, L = 100, 1000, and 10,000, and three levels of disease prevalence at the population level, K = 0.1, 0.01, and 0.001, were adopted. Nine scenarios were simulated in total, and 30 independent simulation replications were implemented for each scenario. The genetic relationship matrix was constructed using all individuals and the allele frequencies were estimated from the sample. The genetic additive variance components were estimated with the HE regression and the mixed model method. As the directly estimated variance component was on the observed scale and could be greater than 1, we employed both the REML and non-constrained REML for mixed model methods, which allowed the heritability to be greater than 1. As illustrated in Figure 1, the estimated h2 was compared across all three methods. In general the HE regression resulted in a more precise estimate than that of the REML and non-constrained REML. For the mixed model methods, either with or without constraints, REML often underestimated the variance components. The bias was caused by two factors: the number of QTLs (in each row panel) and the prevalence of the disease (in each column panel). With fewer QTLs, a lower prevalence could exacerbate underestimation by the mixed model.

Figure 1

Estimation of heritability on the liability scale using the HE regression and mixed linear model methods. In each row, from left to right, each panel represents the case-control sample simulated under the same heritability on the liability scale (h2) but with different prevalence. In each panel, the vertical axis indicates the estimated heritability on the liability scale (h2), whereas the horizontal axis indicates which of the three methods (REML, non-constrained REML, and HE regression [least square estimate]) was used. The standard error of the mean (SEM) is indicated at the top of each bar.

Conclusion

The analytical results summarized in Table 6 were evaluated using Monte Carlo simulation, and were highly precise in general. The single-marker HE regression is a competitive tool for QTL mapping, particularly with a large sample size (Simulations I, II). The HE regression and the mixed model method were equivalent, with both providing a precise heritability estimate for a typical polygenic trait (Simulation III). However, if QTL effects are correlated, neither the HE regression nor the mixed model method gave an unbiased estimate (Simulation IV). For case-control studies, the HE regression should be preferred in general (Simulation V).

Genetic analysis repository (GEAR)

In order to facilitate application of the HE regression method to estimate complex trait heritability, GEAR software was developed. GEAR was developed on Java and can run across many operating systems, such as Windows, Mac, and Linus/Unix, as long as a Java virtual machine is available. GEAR has been demonstrated to function in the following situations. It can generate genetic relatedness of unrelated individuals, as formulated in Equation (3), based on whole-genome markers. It can estimate the effective number of markers based on a genetic-relatedness matrix. It can estimate heritability with the HE regression. GEAR can read genotype data saved in PLINK binary format (Purcell et al., 2007). GEAR can be downloaded from the website: https://sourceforge.net/projects/gbchen/files/GEAR/ The online GEAR manual can be found at https://sourceforge.net/p/gbchen/wiki/GEAR/

Discussion

Historically, linkage was the major tool for QTL mapping of complex traits since the 1970s, which was gradually replaced by association analysis when GWAS became popular (The Wellcome Trust Consortium, 2007). The transmission/disequilibrium test (TDT; Ott, 1989; Spielman et al., 1993) triggered the transition from linkage to association for family-based studies. In the year 2000, generalized TDT was proposed (Laird et al., 2000), which is robust for population stratification. Shortly after that, population-based design emerged as the major flow in genetic data, and GWAS became the leading method for estimating heritability up until now. Extension of the original HE regression to association studies can be seen as an effort to increase the diversity of GWAS analysis tools. In this study, we established a theory for a modified HE regression, in which IBS scores replace IBD scores. Although IBS is used to detect IBD in linkage studies (Lange, 1986a,b; Bishop and Williamson, 1990), it is considered to be a way of inferring IBD for relatives, such as sib pairs, when founder genotypes are unavailable. In this study, IBS served as the key concept to detect association of unrelated samples rather than relatives. Linkage and association have both been proposed to estimate heritability of complex traits. For example, for height, the heritability estimated from linkage studies is around 0.8 (Visscher et al., 2006; Perola et al., 2007), but around 0.4 from association studies (Yang et al., 2010). Thus, far there is no clear conclusion regarding the fundamental difference between the heritability estimated from these two kinds of methods. Despite their mathematical similarity, application, and interpretation differences should be appreciated. Under various scenarios, the mathematical expectations of the regression coefficients, as well as the sampling variances, were derived. There is substantial mathematical similarity between the IBD HE regression and the IBS HE regression. For example, for both models under the single-marker scenario, their regression coefficients can be expressed in a unified form, E(b) = − 2ρ2 σ2. As these two models are based on different genetic mechanisms, the interpretations of their respective regression coefficients are reasonably different. In the IBD-based HE regression, E(b) = −2(1 − 2c)2 σ2, 1 − 2c ranges from 0 to 1; whereas in the IBS-based HE regression, E(b) = −2τ2 σ2, in which τ, ranges from −1 to 1. As the values of r and R rely upon the allele frequencies of the biallelic marker and the biallelic QTL, they reach either −1 or 1 only given that the marker has the same allele frequency as that of the QTL. However, after taking the square, both (1 − 2c)2 and τ2 lie between 0 and 1, inclusive. Equation (11), E(b) = −2σ2 Λ (ignoring Δ), provides a possible way to estimate the LD pattern between causal loci and markers. If the true heritability is not readily known, it is possible to estimate ρ2, the average LD between QTLs and markers. As demonstrated in simulation III, it may be possible to estimate ρ2. However, the causal loci can be in any possible form, such as SNPs, chromatin markers, or methylation markers, and different methods capture genetic variation in different forms. In practice, the obstacle in estimating ρ2 lies in how heritability estimated from different methods, such as linkage and association, or genotyping platforms, such as SNP markers and methylation markers, can be connected to each other. Equation (11) sheds light on the investigation for how QTLs are distributed along the genome. Application of the HE regression to heritability estimation of complex traits revealed that the HE regression seems to be equivalent to the mixed model approaches in general (Δ = 0). A similar equivalence was previously established for linkage analysis (Sham and Purcell, 2001). However, for GWAS data, it should be noted that the equivalence is conditional on the genetic architecture of a trait. As indicated in the simulation, the equivalence stands only for typical polygenic genetic architecture, which may be true for many traits without ascertainment or selection, such as height (Yang et al., 2010). However, when substantial covariance exists between causal loci, the equivalence does not stand and neither the HE regression nor the mixed model method gave unbiased estimates. The equivalence may break down under other circumstances that have not been investigated. In real studies, this kind of covariance may be a result of selection in active regions, such as HLA loci, which harbors many signals; then, the HE regression and the mixed model estimates may differ. The equivalence may break down under other circumstances that have not been investigated in this study. In GWAS, many samples are collected for complex diseases, which are often in a case-control design. Complex disease prevalence is often low; consequently, the cases are under strong ascertainment, which disrupts the assumptions underlying the mixed linear model. As observed in the simulation studies, the HE regression is more precise in estimating heritability than the mixed linear model for case-control studies across a broad range of scenarios. Use of HE regression is advantageous when the disease prevalence is low and the number of causal loci is few. In their original work, Lee et al. (2011) assumed an infinitesimal model of complex diseases. However, when this assumption was disrupted during simulation (likely in practice as well), the mixed linear model method gave biased estimates of heritability. Thus, whenever possible, the HE regression method is preferable to estimate heritability of complex traits. As derived in this work, the HE regression and the mixed model method are equivalent under polygenic genetic architecture. In other words, when the estimates generated by these two methods significantly differ for the same data, caveats should be presented. As investigated in the simulation, the real heritability may lie between the estimates of these two methods. Speed et al. (2012) previously investigated the assumptions underlying the mixed model method and proposed alternative weighting methods to adjust the heritability estimation. However, as their weighting method depends on genetic architecture, which is often unknown, it is difficult to justify which weighting method is appropriate to adopt for certain data (Gusev et al., 2013). Thus, simply comparing the estimates from the HE regression and the mixed model method may offer an alternative way of justification. It should be noted that the HE regression method is on the basis of the least square framework rather than the maximal likelihood framework as many mixed model based on (Yang et al., 2010; Speed et al., 2012; Lee et al., 2013). As a numerical method, maximal likelihood methods give estimates optimizing the likelihood under the assumptions, which may break down in practice. Given recent interests in comparing estimates with or without imputation for the genome (Gusev et al., 2013), controversial results have been observed. It is not sure what the increased or decreased estimation of heritability indicates after imputation. A reasonable guess will be that the local covariance structure, as indicated in Equation (11), changes and eventually bring out different estimates. The proposed IBS HE regression, which depends on fewer assumptions compared with maximal likelihood methods, may help melt the controversy. In practice, undocumented relatedness may creep into samples, and eventually bring about suspiciously high relatedness. As discussed previously (Powell et al., 2010), Equation (3) gives a score of 0 for a pair of unrelated individuals, 0.5 for first-degree relatives, and 1 for duplicated individuals or monozygotic twins. It seems easy to eliminate related individuals if a cutoff, say a relatedness of less than 0.05, is applied to the sample. For association studies, population stratification may increase false positive rates. To reduce the threat of population stratification, phenotypes can be adjusted by principal components (Price et al., 2006) and then fit into the HE regression. If a sample is admixed, the power of the HE regression may be reduced if, in the ancestral populations, the allele frequency spectrums are different from each other or genetic heterogeneity exists in the genetic architecture of the underlying trait in question. More investigation will be required to overcome this challenge. The variance components have often been estimated via REML (Yang et al., 2010; Lee et al., 2011). Given its various merits, REML is computationally expensive, particularly for large sample sizes. The computational complex is on the scale of O(tN3), which indicates that it is cubic to the sample size and t rounds of iterations. The time complex of the HE regression is far lower, asymptotically O(2N2), given two parameters, the intercept and the regression coefficient, included in the model. Given the large sample sizes often employed in GWAS, the computational burden can be dramatically reduced. Although the HE regression method is derived on a simple-regression scenario, its extension to a multiple-regression scenario is straightforward. For instance, the genetic relatedness between each pair of individuals can be constructed on each chromosome and then all chromosome-based relatedness scores can be fit into the regression framework. In addition, the difference between a pair of phenotypes can also be expressed as a cross product and squared sum (Sham and Purcell, 2001).

Author contributions

Guo-Bo Chen conceived the study, derived the equations, conducted the simulation studies, and calculated the analytical results. Guo-Bo Chen developed and maintained the GEnetic Analysis Repository (GEAR) software. Guo-Bo Chen wrote the manuscript.

Conflict of interest statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

31 in total

1. Equivalence between Haseman-Elston and variance-components linkage analyses for sib pairs.

Authors: P C Sham; S Purcell
Journal: Am J Hum Genet Date: 2001-05-14 Impact factor: 11.025

2. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

3. Estimation of SNP heritability from dense genotype data.

Authors: S Hong Lee; Jian Yang; Guo-Bo Chen; Stephan Ripke; Eli A Stahl; Christina M Hultman; Pamela Sklar; Peter M Visscher; Patrick F Sullivan; Michael E Goddard; Naomi R Wray
Journal: Am J Hum Genet Date: 2013-12-05 Impact factor: 11.025

4. A test statistic for the affected-sib-set method.

Authors: K Lange
Journal: Ann Hum Genet Date: 1986-07 Impact factor: 1.670

5. The affected sib-pair method using identity by state relations.

Authors: K Lange
Journal: Am J Hum Genet Date: 1986-07 Impact factor: 11.025

6. The investigation of linkage between a quantitative trait and a marker locus.

Authors: J K Haseman; R C Elston
Journal: Behav Genet Date: 1972-03 Impact factor: 2.805

7. Common SNPs explain a large proportion of the heritability for human height.

Authors: Jian Yang; Beben Benyamin; Brian P McEvoy; Scott Gordon; Anjali K Henders; Dale R Nyholt; Pamela A Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2010-06-20 Impact factor: 38.330

Review 8. Estimation and partition of heritability in human populations using whole-genome analysis methods.

Authors: Anna A E Vinkhuyzen; Naomi R Wray; Jian Yang; Michael E Goddard; Peter M Visscher
Journal: Annu Rev Genet Date: 2013-08-22 Impact factor: 16.830

9. Quantifying missing heritability at known GWAS loci.

Authors: Alexander Gusev; Gaurav Bhatia; Noah Zaitlen; Bjarni J Vilhjalmsson; Dorothée Diogo; Eli A Stahl; Peter K Gregersen; Jane Worthington; Lars Klareskog; Soumya Raychaudhuri; Robert M Plenge; Bogdan Pasaniuc; Alkes L Price
Journal: PLoS Genet Date: 2013-12-26 Impact factor: 5.917

10. Combined genome scans for body stature in 6,602 European twins: evidence for common Caucasian loci.

Authors: Markus Perola; Sampo Sammalisto; Tero Hiekkalinna; Nick G Martin; Peter M Visscher; Grant W Montgomery; Beben Benyamin; Jennifer R Harris; Dorret Boomsma; Gonneke Willemsen; Jouke-Jan Hottenga; Kaare Christensen; Kirsten Ohm Kyvik; Thorkild I A Sørensen; Nancy L Pedersen; Patrik K E Magnusson; Tim D Spector; Elisabeth Widen; Karri Silventoinen; Jaakko Kaprio; Aarno Palotie; Leena Peltonen
Journal: PLoS Genet Date: 2007-05-02 Impact factor: 5.917

35 in total

1. Implications of simplified linkage equilibrium SNP simulation.

Authors: Sang Hong Lee
Journal: Proc Natl Acad Sci U S A Date: 2015-09-28 Impact factor: 11.205

2. Reply to Lee: Downward bias in heritability estimation is not due to simplified linkage equilibrium SNP simulation.

Authors: David Golan; Saharon Rosset; Eric S Lander
Journal: Proc Natl Acad Sci U S A Date: 2015-09-28 Impact factor: 11.205

3. Mixed model with correction for case-control ascertainment increases association power.

Authors: Tristan J Hayeck; Noah A Zaitlen; Po-Ru Loh; Bjarni Vilhjalmsson; Samuela Pollack; Alexander Gusev; Jian Yang; Guo-Bo Chen; Michael E Goddard; Peter M Visscher; Nick Patterson; Alkes L Price
Journal: Am J Hum Genet Date: 2015-04-16 Impact factor: 11.025

4. Measuring missing heritability: inferring the contribution of common variants.

Authors: David Golan; Eric S Lander; Saharon Rosset
Journal: Proc Natl Acad Sci U S A Date: 2014-11-24 Impact factor: 11.205

5. An efficient method to handle the 'large p, small n' problem for genomewide association studies using Haseman-Elston regression.

Authors: Bujun Mei; Zhihua Wang
Journal: J Genet Date: 2016-12 Impact factor: 1.166

6. A new genomic prediction method with additive-dominance effects in the least-squares framework.

Authors: Hailan Liu; Guo-Bo Chen
Journal: Heredity (Edinb) Date: 2018-06-20 Impact factor: 3.821

7. Estimating SNP-Based Heritability and Genetic Correlation in Case-Control Studies Directly and with Summary Statistics.

Authors: Omer Weissbrod; Jonathan Flint; Saharon Rosset
Journal: Am J Hum Genet Date: 2018-07-05 Impact factor: 11.025

8. A Large Multiethnic Genome-Wide Association Study of Adult Body Mass Index Identifies Novel Loci.

Authors: Thomas J Hoffmann; Hélène Choquet; Jie Yin; Yambazi Banda; Mark N Kvale; Maria Glymour; Catherine Schaefer; Neil Risch; Eric Jorgenson
Journal: Genetics Date: 2018-08-14 Impact factor: 4.562

9. On the reconciliation of missing heritability for genome-wide association studies.

Authors: Guo-Bo Chen
Journal: Eur J Hum Genet Date: 2016-07-20 Impact factor: 4.246

10. EigenGWAS: finding loci under selection through genome-wide association studies of eigenvectors in structured populations.

Authors: G-B Chen; S H Lee; Z-X Zhu; B Benyamin; M R Robinson
Journal: Heredity (Edinb) Date: 2016-05-04 Impact factor: 3.821