Literature DB >> 19236714

Genotyping and inflated type I error rate in genome-wide association case/control studies.

Abstract

BACKGROUND: One common goal of a case/control genome wide association study (GWAS) is to find SNPs associated with a disease. Traditionally, the first step in such studies is to assign a genotype to each SNP in each subject, based on a statistic summarizing fluorescence measurements. When the distributions of the summary statistics are not well separated by genotype, the act of genotype assignment can lead to more potential problems than acknowledged by the literature.
RESULTS: Specifically, we show that the proportions of each called genotype need not equal the true proportions in the population, even as the number of subjects grows infinitely large. The called genotypes for two subjects need not be independent, even when their true genotypes are independent. Consequently, p-values from tests of association can be anti-conservative, even when the distributions of the summary statistic for the cases and controls are identical. To address these problems, we propose two new tests designed to reduce the inflation in the type I error rate caused by these problems. The first algorithm, logiCALL, measures call quality by fully exploring the likelihood profile of intensity measurements, and the second algorithm avoids genotyping by using a likelihood ratio statistic.
CONCLUSION: Genotyping can introduce avoidable false positives in GWAS.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2009 PMID： 19236714 PMCID： PMC2679732 DOI： 10.1186/1471-2105-10-68

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

One common goal of a genome wide association (GWAS) study is to search the entire genome for single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) associated with a disease or some other phenotype. In this article, we focus our analysis on SNPs. The two possible alleles at a SNP are arbitrarily labeled A and B, and association is often tested by measuring and comparing the frequencies of the genotypes AA, AB, and BB, in case and control groups. As technology currently allows close to one million SNPs to be examined simultaneously, there is a need for fast, automated methods to test for association. As only a small minority of SNPs are expected to be associated with the disease, even a modest false positive rate could bury true associations beneath those occurring by chance. Using Affymetrix 500 k GeneChips as an example, each of the 500,000+ SNPs is represented by a series of probes on a pair of arrays. Each probe is an oligonucleotide designed to bind to either the A or B allele. A subject's fluorescently labeled DNA is allowed to hybridize with these probes, and then a spectrometer measures the relative fluorescent levels between the A and B probes. Each genotyping algorithm (see Methods) summarizes the fluorescence information, or the likelihood a subject has allele A at that SNP, by its own statistic. In any population, these statistics usually cluster into three groups, corresponding to the three genotypes. Studies have noted that 1) The mean and variance of these clusters, or the shape of their distribution in general, varies by SNP and 2) For a single SNP, because of differences in processing or duration of storage, the shape of the statistic's distribution can differ between the case and control groups [1-3]. The current test of association requires calling, or assigning a genotype, to each SNP for each study subject and then comparing the called genotypes between the case and control groups. The majority of SNPs are easy to call and any of the available methods will call them correctly. Unfortunately, there is a difficult minority that cannot be easily clustered into three distinct groups. Because there can be as many as 500,000 SNPs, this minority can greatly inflate the type I error rate and cause the large, characteristic, deviations from the x = y line in the qq-plots of test statistics. Most studies assume these consequences are the unavoidable results of population substructure and poor data. In this paper, we dispel the myth that these are the sole issues. In fact, the inflated type I error rate and general misbehavior of the test statistic may also result from the act of genotype assignment and a poor choice of statistical methodology. The goal of this manuscript is two-fold. First, our primary goal is to show that genotyping overlapping clusters can lead to potential problems that we have yet to see fully acknowledged in the literature. The proportions of each called genotype need not equal their true proportions in the population, even as the number of subjects grows infinitely large. As we compare genotype calls, p-values from tests of association will be anti-conservative when the distribution of the summary statistic differs between cases and controls. Moreover, the called genotypes for two subjects need not be independent, even when their true genotypes are independent. Therefore, p-values from tests of association can be anti-conservative, even when the distributions of the summary statistic for the cases and controls are identical, a fact we believe has yet to be fully demonstrated. Although previous studies have examined the effects of genotyping error on tests of association [4-6], studies has neither fully explored the effects caused by case/control differences in distributions nor dependence of error. Second, we discuss two new tests that can circumvent these potential problems. One test compares calls made from a genotyping algorithm designed to minimize the type I error. The second test compares the fluorescence distributions, instead of the called genotypes. We start the Methods section by discussing currently available methods for genotyping SNPs and testing for association. Concurrently, we introduce logiCALL, our new genotyping algorithm, and the likelihood ratio-based test of association. In the Results and Discussion section, we start by showing that the proportion of genotypes called AA, AB, and BB need not converge to the true population proportions. Then we discuss how the called genotypes can be dependent. We conclude the section by comparing the proposed tests of association with the current standards through both simulation studies and real data analysis. Then a short Conclusions section summarizes the key points.

Methods

Calling Genotypes

There is currently a wide variety of programs available for genotyping SNPs. The most popular supporting Affymetrix are RLMM [7], BRLMM [8], CRLMM [9], CHIAMO [10], SNiPer-HD [11], and MAMS [12]. The most popular program for Illumina is their own proprietary software, BeadStudio, but other methods have been recently suggested by Moorhead et. al. [3], Teo et. al. [13], and Dunning et. al. [14] (Table 1). To introduce these methods, we start by defining notation. Let there be n subjects. Let G∈ {AA, AB, BB} be the true genotype of SNP j in subject i, 1 ≤ i ≤ n. For Affymetrix chips, assume there are nprobe quartets representing each SNP on an array, and let be the normalized probe intensities for subject i, SNP j, and probe k. Here, the subscripts PMand MMsignify the perfect match and mismatch probes for allele A. PMand MMare similarly defined for allele B. The log transformed intensities will be . To maintain notational consistency, for Illumina chips, we denote the BeadStudio intensity values of the two SNP alleles by . As we will only discuss a single SNP for the majority of the paper, we will omit the superscript "j" from all future notation.

Table 1

Programs available for genotyping SNPs.

Name	Summary Statistic	MM¹	Data²	Data³	Notes
RLMM	{Θ_A, Θ_B}	No	T	M
BRLMM	1 a s i n h ( 4 ) a s i n h ( 4 ( I A − I B ) I A − I B )	No	E-U	M	Assumes genotypes in "training" data are known. "Training" data only uses high quality SNPs. Incorporates info from other SNPs as a Bayesian Prior.
CRLMM	{Θ_A+- Θ_B+, Θ_A-- Θ_B-}	No	T	L	Corrects for the effect of total intensity level and probe length on {S_ij} through more complex method, and allows corrections to vary by array.
CHIAMO	{ 1 n p ∑ k = 1 n p Y A k , 1 n p ∑ k = 1 n p Y A k }	Yes	E-L, T*	W	CHIAMO is a Bayesian hierarchical mixture model and is greatly simplified by this brief summary
SNiPer-HD	{ R 1 , ... , R n p } , where R k = ( I P M A i j k ) / ( I P M A i j k + I P M B i j k )	No	E-U	W	Assumes genotypes in "training" data are unknown and requires the EM algorithm. "Training" data should only use high quality SNPs.
Moorhead	1 s i n h ( 2 ) s i n h ( I P M A + I P M B I P M B )	N/A	E-U	W	Originally for MIP, but applicable to Affymetrix. Plagnol demonstrated how to link genotype probabilities between cases and controls.
logiCALL	{ 1 s i n h ( 2 ) s i n h ( I P M A + I P M B I P M B ) , I P M A + I P M B }	No	E-L	W-F	Designed to lower false positive rate and assigns calls based on cumulative distribution, not density functions.

1Indicates use of mismatched probes

2Parameters were estimated by Experimental or Training data. For experimental data, under the null, cases and control genotype proportions could be Linked or Unlinked. * indicates optional.

3Distance can be Mahalanobis, W eighted Likelihood, or unweighted Likelihood

Programs available for genotyping SNPs. 1Indicates use of mismatched probes 2Parameters were estimated by Experimental or Training data. For experimental data, under the null, cases and control genotype proportions could be Linked or Unlinked. * indicates optional. 3Distance can be Mahalanobis, W eighted Likelihood, or unweighted Likelihood With the exception of dynamic modeling (DM) [15], all calling algorithms share the same general form, and we exploit this form to summarize their key features. The process of transforming raw signal into genotypes can be divided into four steps: 1) Normalize the intensity values; 2) Describe the normalized intensity values by a single, possibly multivariate, summary statistic; 3) Estimate the mean and variance of the summary statistic for the three possible genotypes, AA, AB, and BB; and 4) Compare the value of the statistic from a subject to the group parameters found in the third step to make the call. The universal first step, normalizing the intensity values, is tangential to our discussion here. The second step is to choose a statistic, Sthat Summarizes the intensities. For example, RLMM models the probe intensities as and . Then, . In the updated version, CRLMM models sense(+) and antisense(-) probes separately, resulting in a 4D statistic, Each method assumes that the distribution, φ(S|θ), of their statistic in a given population q is a Mixture of multivariate normal distributions where θ, to be defined below, are the parameters characterizing the distribution in population q. When it is clear we are discussing only a single population, we will omit the superscript q. Although problems can arise when the distribution is not a true mixture of normals, those complications are beyond the scope of this paper [16]. For completeness, we point out that a minority of programs, including CRLMM, allow these parameters to vary within a group (e.g. to be subject/array specific). Ignoring that some methods allow for an additional null distribution, the general form is where ψ(·) is the multivariate normal density. The three mean vectors, , variance matrices, Σ ≡ {Σ, Σ, Σ}, and probabilities, p ≡ {p, p, p}, correspond to the three possible genotypes, AA, AB, and BB. Define and later, we let Φ(·) and Φ(·) be the cumulative distribution for a normal variable and a mixture of normals. The third step is to estimate θ. Some algorithms, such as RLMM, use a training data set, where the genotypes are known. Other algorithms, such as SNiPer-HD, use no training data, and find the best estimates that describe their experimental values. The fourth step is to assign a genotype, , to SNP j in subject i. Often, for a given value of S, the assigned genotype maximizes a similarity function: = argmaxD(g, S|θ). The similarity function is usually a modified version of one of the following three quantities: Mahalanobis distance: , Unweighted Likelihood: , or the Weighted Likelihood: p. The similarity function is also modified to ensure monotonicity of assignment. When we let S= {M, A}, we force D(AB, S|θ) = D(BB, S|θ) = 0 if M≤ μ, D(BB, S|θ) = 0 if μ≤ M≤ μ, D(AA, S|θ) = 0 if μ≤ M≤ · μ, and D(AB, S|θ) = D(AA, S|θ) = 0 if M≥ μ, where μ, in this case, is the mean of Mwhen G= g. This modification is standard in calling algorithms. As we do not know the true value of the parameters in experiments, we replace D(g, S|θ) by D(g, S|). A subject's SNP may not be called, or assigned a missing value, if the difference or ratio between D(, S|) and D(g2, S|) is not large enough, where g2is the genotype with the second largest value of D(g, S|). A SNP may be omitted from further study if too many values were set to missing. Table 1 describes the details of the four steps for popular methods. For many purposes or to understand the details of the method, especially in handling rare alleles, this table will seem an oversimplification. For our purposes here, it highlights the features of interest.

Tests of Association

The current tests of association start by calling genotypes for a given SNP j in a group of subjects with the disease and in a group of controls. They then compare the resulting proportions, and , from these Affected and Unaffected groups using either a Cochran-Armitage test or logistic regression. Here , where the indicator function is defined by 1(x) = 1 if x is true, 0 otherwise, Q∈ {A, U} is the disease status for individual i, and nis the number of subjects with disease status q. In this manuscript, any 'p-value' from a genotype-based association test will be calculated using ANOVA on the logistic regression model with Qand genotype (unordered grouping) as the dependent and independent variables. Standard tests tend to err anti-conservatively as we will discuss below. We will propose four alterations that can reduce type I error rate, with only a minimal decrease in power. These are the four differences that separate logiCALL from standard methods. The first is based on the observation, which is discussed later, that the likelihood profile of φ(S|θ) will have multiple local maxima near the overall maximum. When estimating θ, the EM algorithm converges to multiple solutions. For many of those solutions, the resulting parameter set, , satisfies . For each parameter set satisfying this inequality, we will make a new group of genotype assignments, . If more than 10% of such assignments disagree with (τ = 0.06), we label that subject's call as questionable. We also continue the practice of marking calls with small values of D(, S|)/D(g2, S|) as questionable. The second alteration is that we do not discard questionable calls, an act which can create false positives. Instead, we assign questionable Sso the proportions of genotypes in the cases and controls are as similar as possible, which is defined as minimizing , with the restriction that the final call for subject i must be either or g2. Here, we let g2be the genotype which is either the runner-up in terms of distance or the most common genotype among the dissenting calls, depending on why the genotype was labeled as questionable. The third alteration, which is already incorporated into other programs is to perform the EM algorithm under the null hypothesis that the genotype proportions in the two populations are identical [3]. The fourth is the use of a weighted Mahalonobis distance, which is defined later. Given these changes, logiCALL then compares the estimated genotype proportions in cases and controls using logistic regression. Note that none of the changes affect calls for the vast majority of SNPs. We also introduce a completely new method for testing association based on a likelihood ratio statistic. For our method, steps 1 and 2, normalization and choice of summary statistic, can mimic any of the previously described methods. As our real data to be analyzed was collected on Illumina chips, we choose the statistic from Moorhead, et al. [3], for exposition. We then assume that Sfollows the mixture model described by equation (1), but allow the parameters to differ by disease status: and θ = {θ, θ}, where Although θ contains 18 parameters, it has only 16 degrees of freedom (df) because for q ∈ {A, U}. Our new test will reject the null hypothesis of no association, when LR(), the likelihood ratio, is large, where Clearly, the restricted parameter space, is a subset of the unrestricted parameter space, Ω. In an ideal scenario, the distribution of 2log(LR) would converge to a chi-squared distribution, , with 2 degrees of freedom. Therefore, the 'p-value' from a likelihood ratio-based test will be calculated as 1 - (2log(LR)).

Data Source

To demonstrate the problems of genotyping and compare the genotype- and likelihood ratio-based tests of association, we use three types of data. First, for discussion, we may assume a hypothetical study measuring a one dimensional summary statistic, S, for a SNP j with only two possible genotypes, G∈ {0, 1}. Furthermore, to show problems can exist even under the best conditions, where model and truth coincide, we assume that Sfollows a normal distribution given Qand G, and that the full distribution can be described by . We compare our two new tests of association to a standard method using simulated data. The standard method mimics the general Bead-Studio approach by a) fitting parameters with the EM algorithm; b) calling genotypes based on the Mahalanobis Distance; c) removing all calls where D(, S|)/D(g2, S|) > 0.5; and d) comparing the two sets of resulting estimates, and . We generated 10 simulated datasets, containing 1000 subjects (500 cases, 500 controls) and 303,100 SNPs for each of 18 scenarios. For each gene j and each subject i, we generated a 2D summary statistic (M, A). The distribution of Mdepended on genotype. If G= AA, then M~ 2X - 1, and if G= BB, then M~1 - 2X, where X ~ beta(α = 3, β = 30). If G= AB, then Malso followed a beta distribution, but the parameters varied by SNP, disease status, and scenario. For all SNPs and all subjects, A~ N (10, 1.5). We generated three types of SNPs, background, shifted, and influential. First, 300,000 background SNPs were generated and included in all 10 × 18 = 180 data sets. For each SNP, a single minor allele frequency was generated from a uniform(0.2, 0.4) distribution and genotype probabilities were generated assuming Hardy-Weinberg Equilibrium. Here, E [M|G= AB] = 0 and was independent of disease status. These SNPs, which formed three distinct clusters, can be easily identified and represent a well-behaved group. For 3,000 shifted SNPs, MAF ~uniform(0.2, 0.4) and genotype probabilities were generated assuming Hardy-Weinberg Equilibrium. Here, E [M|G= AB] ∈ {-0.739 + 0.2, - 0.739 + 0.3 - 0.739 + 0.5}, where we note E [M|G= AA] = -0.739 and E [M|G= AB, Q= A] - E [M|G= AB, Q= U] ∈ {0, 0.2}. This group represents difficult to call SNPs. For 100 influential SNPs, MAF ~uniform(0.2, 0.4) and genotype probabilities were chosen so that, under a disease prevalence of 0.01 and a model of additive effects, the genotype relative risk for subjects homogeneous for the minor allele, P(Q= A|BB)/P (Q= A|AA) ∈ {1.5, 2.0, 2.5}. Combing the degree of shift for poor quality SNPs and the effect size of truly associated SNPs, we have a total of 18 scenarios used in our simulation. The next set of data is from a recent GWAS of Inflammatory Bowel Disease (IBD) that compared 983 subjects with IBD to 1004 subjects without the disease. Using Illumina microarrays, 308,330 SNPs on the autosomal chromosomes were tested. Jewish and non-Jewish cohorts, approximately equal in size, were analyzed separately, a practice continued here. Details have been previously published [17,18]. Because the overwhelming majority of the SNPs are easy to genotype, as any of the summary statistics neatly divide into three clusters, we chose a 3137 difficult SNP subset where at least two clusters overlap (definition below). Because association was tested separately in Jewish and non-Jewish cohorts, a total of 3137 × 2 = 6274 tests were possible. To demonstrate called genotype dependency, we bootstrapped 40 samples, ignoring case/control status, of 500 subjects for each SNP. A sample would be discarded, and replaced by another random selection, if one or both of the groups were lacking an AA (S<-0.7) or a BB (S> 0.7) genotype. We defined difficult SNPs as follows. For SNP j, we first estimated the density, (M) nonparametrically using the R function 'density(adjust = 0.3)'. In theory, (M) is a mixture of three normal distributions, corresponding to the three genotypes. If the SNP is well-behaved, then the three underlying densities will not overlap, and the empirical density (M) will attain minima near 0 in the valleys between μand μand between μand μ. If either of these minima exceeded 0.2, then at least two of underlying densities overlapped, and that SNP was defined as difficult. To speed the process, we found that approximating the center of the peaks (i.e. μ, μ, and μ) by the median values of Min the three windows, {M≤ -0.6; -0.3 ≤ M≤ 0.3, M≥ 0.5}, worked well.

Results and discussion

Two Genotype Example: Parameters

We choose to use the hypothetical, two-genotype, study, to highlight that the estimated parameters can be inconsistent even in the simplest scenario, where the summary statistic is distributed normally and our fitted model is correct. When dealing with only a single population, we define the parameter p0 (p1) to be the probability that a subject's true genotype is 0 (1). We define the parameter to be the probability that a subject's called genotype is 0 (1). Probabilities of called genotypes implicitly depend on the number of subjects in the sample. We then define the parameter c* to be the point which is equidistant to the two genotype groups when θ is known. Therefore, μ0 ≤ c* ≤ μ1 is defined to be a solution to equation 6 For the remainder of the paper, we shall assume such a c* exists. This assumption is safe in practice as genes with extremely rare minor alleles are discarded. When D is the Mahalanobis distance, c* is a solution to We define the final parameter, , to be the probability that a subject's Svalue is less than c* given their true genotype is 0 (1). Our first goal in this Results section is to show that may be true when two clusters, {S:G= 0} and {S:G= 1} overlap. We will refer to - p0 as asymptotic bias, or bias, and we note that it depends on D, p0, and the magnitude of the overlap. Here, we also define C, our estimate for c*, to be the solution to the equation such that . If no such C exists, to be consistent with monotonicity of assignment, we let where D(G= g, S= C) is the smaller of the two measures when . The variable C is the cut-point or threshold value of S which separates 0 and 1 calls. Therefore, by monotonicity of assignment, S C ⇒ = 1. If S= C, we assign the genotype randomly. In the specific example of the Mahalanobis distance, C is usually the solution to Therefore, by convergence of the MLE, we know that which will useful for the next section.

Two Genotype Example:

We start by assigning genotypes according to their Mahalanobis distance, as done in BRLMM. Recall, we assign subject i to genotype 0 if S Therefore, given the genotype, the limiting conditional probabilities, P( = 1 ⋃ ≠ G|G= 0) and P( = 0 ⋃ ≠ G|G= 1) are equal. If p0 > p1, then the unconditional probabilities cannot be equal, specifically Obviously the opposite inequality holds if p0 Clearly, when p0 = p1 = 0.5, for all values of . However, when p0 ≠ p1, the bias is a non-zero function , and therefore depends on the parameter group, (Figure 1). As shown by Figure 1, the bias can be quite large when either and/or |p0 - 0.5 | is large.

Figure 1

(b) The bias, - p0 (y-axis) depends on P(x-axis), and is shown for different values of p0.

(b) The bias, - p0 (y-axis) depends on P(x-axis), and is shown for different values of p0. Next, assume that genotype assignments are based on a likelihood, or weighted likelihood measure. For a value ω ∈ c(0, 1), the probability that φ(S) exceeds ω, or , changes with , where returns a value greater than μ. Therefore, φ0(s) = φ1(s) does not imply anything about the relationship between Φ0(s) = Φ1(s). We illustrate the potential for bias by a simple example where μ0 = 0, = 1, and μ1 = 1. Let p0 = p1 = 0.5. Then, for any value of . Equivalently, when two normal densities intersect at the threshold value, the probability of misclassifying a genotype 0 subject will not equal the probability of misclassifying a genotype 1 subject, and therefore . For a given threshold, c*, we can define the bias by . Unlike the previous example, with the Mahalanobis distance, Figure 2 shows that describing the bias as a function of and p0 can be difficult. In current tests of association, this inequality shows that it possible for even when , which as we will see, will lead to an inflated type I error and too many low p-values. Start by noting that GWAS test a surrogate hypothesis, , not the true hypothesis of interest, . Because need not equal when the distributions for cases and controls differ, , the tested hypothesis, can be false even when H0 is true. Let T* be a standard test statistic for , which is believed to have the following property, . Let us make the reasonable assumption that the difference between and is small, or, in words, when is known to be true, the validity of H0 has little effect on the distribution of T*. Then,

Figure 2

(. After fixing, μ0 = 0, = 1, and μ1 = 1, we plot the bias, - p0 (y-axis), against (x-axis), for different values of p0.

(. After fixing, μ0 = 0, = 1, and μ1 = 1, we plot the bias, - p0 (y-axis), against (x-axis), for different values of p0. Here β is a measure of the power to reject and we assume β > α. Therefore, the current method of rejecting H0 whenever T* > is actually anti-conservative if the stated p-value is α

Two Genotype Example: Inconsistency

As with any GWAS experiment, we can estimate p0 and p1 by For presentation, we will omit the superscript n, writing and as and . As , from equation 5, is equivalent to E [P(S Therefore, by the convergence of and to constants and the convergence of the MLE, we have Having just discussed cases where equation (9) holds, our standard estimates of p0 and p1 are not consistent. Specifically, for these scenarios.

Return of Consistency: Modifying the Mahalanobis Distance

The Mahalanobis Distance, D(g, s|θ), measures the conditional probability of getting a value as extreme as sgiven genotype g. Therefore, we could achieve the same results using where g ∈ {0, 1} and Sis again presumed to be normally distributed. As we saw, the current estimators suffer because they don't account for the genotype probabilities. Borrowing Bayesian terminology, we simply need to weight our distance measure by the prior probability of a subject having each genotype. Therefore, returning to step 4 of our genotyping process, we now define = gwhere gmaximizes the function , a weighted version of the Mahalanobis distance. Let c† be the point such that . Then we are guaranteed consistency as With this change, our estimate would now be consistent.

Dependence/Correlation of Called Genotypes

If we knew {G1,..., G}, the true genotypes for a group of n subjects at SNP j, as opposed to only knowing their called genotypes, , it would easy to construct a test of H0: with a specified α level. Reject H0 if . Here, is the 1-α percentile of a distribution, and The central limit theorem allows us to be confident that we have an α-level test because when {G1,..., Gis a vector of independent Bernoulli random variables. Again, we have returned to the two genotype scenario to simplify our discussion. By statements about the perceived α-levels of , are often implicitly treated as the true genotypes and are assumed to be a vector of independent Bernoulli random variables. The truth, however, is that is not independent of . Specifically if C, the threshold for calls, is relatively small, then both P( = 1|C) and P( = 1|C) are relatively large. Using this common dependence on C, it is simple to show that and are positively correlated. This proof, which uses Jensen's inequality, clearly shows that the two variables, and , are not independent as that would have implied . The consequence of this dependence is that does not follow a distribution, and in turn, that neither follows a distribution nor has P( > t) = α. We next examine the behavior of , and . First, as is common, the dependence increases the variance of these estimates. For any population, The first term, , represents the uncertainty in the true number of subjects with genotype 0, and can be roughly approximated by . The second term, reflects that there will be a subpopulation, of a random size ~n|Φ(C) - Φ(E[C])|, that is assigned the 'non-ideal' genotype, where a call is labeled 'non-ideal' if it would have been different had the threshold been E [C]. We can approximate this second term, the overall increase in the variance, by nvar(Φ(C)) if P( = 0|C) ≈ Φ(C) and the cor(, |C) ≈ 0. In experiments, we can estimate nvar(Φ(C)) by bootstrapping samples of C or (C). To focus on the distributions instead of just the variances, we decompose as Note that and . The appropriate multiple of the first term, n((C) - Φ(C)), should be well approximated by X-E [C], where X ~binomial(n, E [C]). From our own experience, we have seen that the third term, (Φ(E[C]) - E(C)]), is a constant close to 0. The second term, (Φ(C) - Φ(E[C])) is the variable which causes deviation from normality. Again, we could approximate the distribution of this term by bootstrapping C. Next, we offer an example to demonstrate the effect of dependence among called genotypes. Specifically, using the IBD data, we show that a is a poor approximation for the distribution of . Because we will use real data, we have purposely chosen to discuss instead of T . There are many reasons that T may not follow a distribution, including being a poor estimate of p, the distributions of Sbeing far from normal and population substructure. However, for large n, the only reason that will not be approximately normal is if are not independent. Also, we chose to focus on the AB genotype as this is certain to be one of the genotypes with an overlapping cluster. For each of our 3137 SNPs in the IBD data, we bootstrap 40 samples of 500 subjects, and calculate 40 values of , where is estimated by . The qq-plot in Figure 3 compares these 40 × 3137 values with a N(0,1). The distribution is far from normal, which implies that are dependent. Some SNPs were more likely to contribute skewed values than others, but the top and bottom 100 values (200 total) are from 64 different SNPs, indicating that no one SNP, or small number of SNPs, is responsible for the deviation in the qq plot. In contrast, the qq-plot from well-behaved SNPs, where the contributions to the density of S from the three genotypes were separated, was the expected straight line, showing that it was not the normal approximation skewing the results (Figure 3). Because the magnitude of the observed values were larger than predicted by theory, the practical implication is that tests based on the statistic, T, estimated by (T), will be anti-conservative(i.e. too many significant p-values after adjusting for multiple testing) under the null hypothesis. Here we also note that if Swere truly normal, the impact of the dependency would be much less. For more details on the origin of dependency, please see Table 1.

Figure 3

(Dependency of Calls) The density of . The deviation from the Y = X line indicates that is not distributed as a binomial(n, p) variable.

Comparing Tests of Association

Table 2 shows the results from the simulations, and lists the percentage of the influential SNPs that were ranked among the top 100 most significant SNPs, where the rankings were determined by either LogiCALL, a likelihood ratio test, or a standard test, similar to Bead-Studio. As there were 100 influential SNPs in each simulation, an ideal scenario would have 100% of the influential SNPs in the top 100 SNPs. As expected, the percentages increase as the relative risk, comparing the two homogeneous genotypes, increases from 1.5 to 2.5. So long as the densities of φ (M|G= AA) and φ (M|G= AB) were distinct, which was the case when the distance between μand exceeded 0.5 (and was less than 1.239), all three tests performed equally well. As the shifts were decreased in simulations, many of these shifted SNPs started to appear in the top 100 most significant SNPs, when ranked by a standard test. Furthermore, the loss in power was exaggerated when the amount of overlap differed between cases and controls, φ(M|G= AB, Q= A) ≠ φ (M|G= AB, Q= U). In the most extreme case, when E [M|G= AB] = -0.539 and E [M|G= AB, Q= A] - E [M|G= AB, Q= U] = 0.2, the standard test only detected about half as many influential genes as it did when there were no shifted genes. In contrast, LogiCALL almost never ranked any of the shifted SNPs in the top 100. However, had these shifted SNPs been influential, LogiCALL would have had less power to detect them. The performance of the likelihood ratio was in between the two other tests, but performed nearly as well as LogiCALL when φ (M|G= AB, Q= A) = φ (M|G= AB, Q= U).

Table 2

The percentage of influential genes among the top 100 most significant SNPs, as ranked by LogiCALL, a likelihood ratio test, and a standard test.

		RR = 1.5			RR = 2.0			RR = 2.5
Shift	Difference	LogiCALL	LR	Standard	LogiCALL	LR	Standard	LogiCALL	LR	Standard
0.5	0	4.4	4.6	4.5	41.8	41.8	41.3	78.8	79.2	79.3
	0.2	4.4	4.6	4.5	42.0	41.8	41.3	78.8	79.3	79.4

0.3	0	4.4	4.6	4.6	42.0	41.9	41.2	78.8	79.2	79.4
	0.2	4.4	1.9	0.2	41.6	29.5	14.4	78.8	67.3	42.7

0.2	0	4.4	4.4	2.1	42.0	41.4	25.9	78.8	78.9	47.1
	0.2	3.7	0.1	0.0	38.9	3.4	0.0	75.3	21.3	0.0

Simulated data sets are full described in the Methods Section. The shift is the distance between the and μand . The Difference is the distance between and . RR is the genotype relative risk for subjects homogeneous for the minor allele.

The percentage of influential genes among the top 100 most significant SNPs, as ranked by LogiCALL, a likelihood ratio test, and a standard test. Simulated data sets are full described in the Methods Section. The shift is the distance between the and μand . The Difference is the distance between and . RR is the genotype relative risk for subjects homogeneous for the minor allele. In GWAS, each marker is tested for association with the disease. Here, we compare four methods for testing the 3137 chosen SNPs in the IBD study. In the first method, we let Bead-Studio call the genotypes and perform its standard Cochran-Armitage test. Default settings were used to assign calls as missing and remove poor quality SNPs. In the second method, we call the SNPs using logiCALL and test for association using logistic regression, although using Cochran-Armitage would not change our conclusions. In the third method, we calculate the Likelihood Ratio Statistic comparing and from equation (3) using all SNPs, and in the fourth method, we calculate the LR statistic using only those SNPs where and yield identical calls, and Hardy-Weinberg equilibrium, for controls alone, is not violated at a statistical significance level of < 10-16. For each method, we calculated the proportion of 'p-values' that were less than 0.005, 0.001, and 0.0005 (Table 3).

Table 3

The percentage of 'p-values' less than traditional α-levels (0.005,0.001,0.005) are listed for four tests of association.

Method	n	p < 0.005	p < 0.001	p < 0.0005
BeadStudio	3487	0.015	0.005	0.004
logiCALL	5533	0.004	0.002	0.002
LR	5533	0.087	0.061	0.052
LR-(same)	3014	0.006	0.002	0.001

1) BeadStudio (default setting for missing assignements and omitting SNPs).

2) logiCALL.

3) Likelihood Ratio (LR) using all SNP.

4) Likelihood ratio (LR-same) using only SNP where calls were the same for restricted and unrestricted parameter sets.

The percentage of 'p-values' less than traditional α-levels (0.005,0.001,0.005) are listed for four tests of association. 1) BeadStudio (default setting for missing assignements and omitting SNPs). 2) logiCALL. 3) Likelihood Ratio (LR) using all SNP. 4) Likelihood ratio (LR-same) using only SNP where calls were the same for restricted and unrestricted parameter sets. When genotyping all SNPs and all subjects, the proportion of Bead-Studio 'p-values' below the three thresholds far exceeded 0.005, 0.001, and 0.0005. Even after removing low quality SNPs and allowing missing calls, the proportion of 'p-values' below the three thresholds were 0.015, 0.005, and 0.004. In contrast, LogiCALL eliminated nearly all false positives. The proportion below the three thresholds were 0.004, 0.002, and 0.002. If the majority of these SNPs are presumed to be null associations, logiCALL appears to be the superior method, so long as the power loss is minimal. Assigning conservative 'p-values' to problematic SNPs is nearly equivalent to removing them. However, because of the tendency for there to be multiple, nearly equivalent, maximum likelihood estimates, the relative distances to the AA, AB, and BB genotypes using the single set of maximum likelihood estimates may not be adequate in identifying questionable calls. Therefore, logiCALL gains an advantage by combining two methods for identifying suspect calls. Additionally, it avoids false positives caused by differential bias, where the proportion of missing calls differs between cases and controls. This new method simplifies testing by requiring no preprocessing and testing all SNPs. The power loss from logiCALL depends on the quality of the data. When the statistic for a SNP cleanly separates into an AA, AB, and BB group, there is no power loss. In our IBD example, the 'p-values' reported for rs2066843 and rs2076756, the two SNPs that are believed to be truly associated with IBD were similar for the two methods, Bead-Studio (2.9 × 10-9 and 5.1 × 10-10) and logiCALL (1.5 × 10-8 and 1.6 × 10-9). Among those subjects meeting the 96% Bead-Studio call rate, logiCALL found no questionable calls. The likelihood ratio method had mixed results. Clearly, when the distribution of the summary statistic is a mixture of normals, the estimated genotype proportions are asymptotically unbiased. Unfortunately, this method still resulted in an increased number of false positives. However, if we removed those SNPs where at least one call changed when switching from to , the false positive rate decreased to the expected level. In theory, all calls should be identical and only the resulting likelihoods should differ. When assigning genotypes, the cost, in likelihood, incurred from forcing the vectors of genotype proportions to be equal should be far less than the cost of switching calls. When the reverse is true, and calls switch, the statistic for the three groups cannot be well separated, and the p-value is suspect.

Conclusion

In genome-wide association tests, under the null hypothesis, the test statistic rarely follows the expected chi-squared distribution. This deviation tends to result in an excess of false positives. Unfortunately, the investigation into the origin of this deviation has yet to be completed. The problems associated with poor signal quality and population substructure have been thoroughly explored. However, the overlap of fluorescent signals has only been identified as a serious problem, and has yet to be fully explained. In this paper, we have provided two reasons, parameter inconsistency and called genotype dependency, that help explain how overlap causes this deviant behavior. Furthermore, we propose two methods, logiCALL and a method based on the likelihood ratio statistic that better handle the problems of inconsistency and dependency. These methods will perform similarly to the common, genotype-based, test statistics for the well-behaved SNPs and appear to create fewer false positives for the difficult-to-call SNPs. We have also identified a new characteristic of some false positives, that the call differs when using vs. , that will help distinguish which low p-values represent significant disease/marker association. We have demonstrated that increasing sample size alone will not eliminate type I error, as genotyping, in its current form, leads to inconsistent estimates of population parameters. To alleviate this inconsistency, the distance measure used for assignment would need to be switched to , defined in the Results and Discussion Section. Moreover, we have proven that the called genotypes can be dependent under certain conditions, and that tests based on called genotypes need to account for the increased variance caused by dependence. Finally, we illustrated that the likelihood profile of the data can be relatively flat near the MLEs. Therefore, judging the quality of calls from distances based only on the MLE may not provide an adequate means to identify questionable calls. Hence logiCALL, which looks at all locally-maximal likelihood estimates of the parameters can reduce the type I error rate. Testing association by the likelihood ratio statistic is another promising method for addressing the problems associated with overlapping signals.

Availability and requirements

Computer programs are available on author's website, .

Authors' contributions

JS identified the original problem. Both JS and HZ developed the problem. JS drafted the manuscript. Both authors read and approved the final manuscript.

18 in total

1. Detection and integration of genotyping errors in statistical genetics.

Authors: Eric Sobel; Jeanette C Papp; Kenneth Lange
Journal: Am J Hum Genet Date: 2002-01-08 Impact factor: 11.025

2. Allowing for genotyping error in analysis of unmatched case-control studies.

Authors: K M Rice; P Holmans
Journal: Ann Hum Genet Date: 2003-03 Impact factor: 1.670

3. Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays.

Authors: Xiaojun Di; Hajime Matsuzaki; Teresa A Webster; Earl Hubbell; Guoying Liu; Shoulian Dong; Dan Bartell; Jing Huang; Richard Chiles; Geoffrey Yang; Mei-mei Shen; David Kulp; Giulia C Kennedy; Rui Mei; Keith W Jones; Simon Cawley
Journal: Bioinformatics Date: 2005-01-18 Impact factor: 6.937

4. beadarray: R classes and methods for Illumina bead-based data.

Authors: Mark J Dunning; Mike L Smith; Matthew E Ritchie; Simon Tavaré
Journal: Bioinformatics Date: 2007-06-22 Impact factor: 6.937

Review 5. Genotyping errors and their impact on genetic analysis.

Authors: Michael B Miller; Karen Schwander; D C Rao
Journal: Adv Genet Date: 2008 Impact factor: 1.944

6. Testing of single locus hypotheses where there is incomplete separation of the phenotypes.

Authors: E A Murphy; D R Bolling
Journal: Am J Hum Genet Date: 1967-05 Impact factor: 11.025

7. Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples.

Authors: Huixiao Hong; Zhenqiang Su; Weigong Ge; Leming Shi; Roger Perkins; Hong Fang; Joshua Xu; James J Chen; Tao Han; Jim Kaput; James C Fuscoe; Weida Tong
Journal: BMC Bioinformatics Date: 2008-08-12 Impact factor: 3.169

8. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

Authors:
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

9. A genotype calling algorithm for the Illumina BeadArray platform.

Authors: Yik Y Teo; Michael Inouye; Kerrin S Small; Rhian Gwilliam; Panagiotis Deloukas; Dominic P Kwiatkowski; Taane G Clark
Journal: Bioinformatics Date: 2007-09-10 Impact factor: 6.937

10. A method to address differential bias in genotyping in large-scale association studies.

Authors: Vincent Plagnol; Jason D Cooper; John A Todd; David G Clayton
Journal: PLoS Genet Date: 2007-04-05 Impact factor: 5.917

2 in total

1. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.

Authors: Brian L Browning; Zhaoxia Yu
Journal: Am J Hum Genet Date: 2009-12 Impact factor: 11.025

2. Rare genomic structural variants in complex disease: lessons from the replication of associations with obesity.

Authors: Robin G Walters; Lachlan J M Coin; Aimo Ruokonen; Adam J de Smith; Julia S El-Sayed Moustafa; Sebastien Jacquemont; Paul Elliott; Tõnu Esko; Anna-Liisa Hartikainen; Jaana Laitinen; Katrin Männik; Danielle Martinet; David Meyre; Matthias Nauck; Claudia Schurmann; Rob Sladek; Gudmar Thorleifsson; Unnur Thorsteinsdóttir; Armand Valsesia; Gerard Waeber; Flore Zufferey; Beverley Balkau; François Pattou; Andres Metspalu; Henry Völzke; Peter Vollenweider; Kári Stefansson; Marjo-Riitta Järvelin; Jacques S Beckmann; Philippe Froguel; Alexandra I F Blakemore
Journal: PLoS One Date: 2013-03-12 Impact factor: 3.240

2 in total