Literature DB >> 20924194

Systematic removal of outliers to reduce heterogeneity in case-control association studies.

Yuanyuan Shen1, Zhe Liu, Jurg Ott.   

Abstract

BACKGROUND/AIMS: In human case-control association studies, population heterogeneity is often present and can lead to increased false-positive results. Various methods have been proposed and are in current use to remedy this situation.
METHODS: We assume that heterogeneity is due to a relatively small number of individuals whose allele frequencies differ from those of the remainder of the sample. For this situation, we propose a new method of handling heterogeneity by removing outliers in a controlled manner. In a coordinate system of the c largest principal components in multidimensional scaling (MDS), we systematically remove one after another of the most extreme outlying individuals and each time recompute the largest association test statistic. The smallest p value obtained within M removals serves as our test statistic whose significance level is assessed in randomization samples.
RESULTS: In power simulations of our method and three methods in current use, averaged over several different scenarios, the best method turned out to be logistic regression analysis (based on all individuals) with MDS components as covariates.
CONCLUSION: Our proposed method ranked closely behind logistic regression analysis with MDS components but ahead of other commonly used approaches. In analyses of real datasets our method performed best.
Copyright © 2010 S. Karger AG, Basel.

Entities:  

Mesh:

Year:  2010        PMID: 20924194      PMCID: PMC2975732          DOI: 10.1159/000320422

Source DB:  PubMed          Journal:  Hum Hered        ISSN: 0001-5652            Impact factor:   0.444


Background

Population admixture (cryptic heterogeneity) represents a potentially serious problem in case-control association studies [1]. Allele frequencies tend to differ between countries and even between different regions in a single country [2,3]. Disregarding such differences tends to inflate the χ2 association statistic [4], which ‘sees’ heterogeneity as a deviation from the null hypothesis of homogeneity. One of the first methods to deal with the deleterious effects of heterogeneity is genomic control (GC) [4], which assesses the extent of inflation of χ2 in terms of the GC factor, λ, and then divides each χ2 by λ. Additional methods have since been introduced, notably principal components analysis [5] and logistic regression with components of multidimensional scaling (MDS) as covariates [6]. Here, we propose a novel approach based on deleting individuals that appear as outliers. This approach specifically addresses the situation of a relatively small number of individuals that do not belong to the main portion of the study sample.

Outlier Removal Method

It is intuitive that one way to deal with heterogeneity is to remove individuals not belonging to a sample. Such an approach might be seen as more appropriate than ‘punishing’ all individuals by rolling back all test statistics as it is done in the GC method. However, removing outliers has to be carried out in a statistically satisfactory manner. To decide how many and which individuals to remove, we proceed as follows: based on the commonly used identity by state (IBS) metric, similarity between two individuals is defined as the IBS between two individuals, averaged over all SNPs. In the coordinate system of the c largest MDS components (here we use c = 4 throughout), each individual is at some distance from the center. That individual with the largest distance from the center is considered a potential outlier. Initially, the Pearson χ2 is computed in a 2 × 2 contingency table for each SNP, where the two rows correspond to cases and controls, and the two columns represent the SNP alleles. After retaining the p value, p0, for the largest χ2 over all the SNPs, the first potential outlier is removed and another largest χ2 is computed (at whatever SNP in the genome it occurs) leading to p1, and so on. We proceed until a predefined maximum number, M, of individuals has been removed. The sequence of p values (p0, p1, …, p) initially either decreases (p0 > p1) or increases (p0 < p1). In the first case, assume that the smallest p value, pmin, among the M + 1 values occurs at step k, that is, after k outliers have been removed. We then take T = pmin as our overall test statistic. In the latter case, we search for the first (local) minimum p value, T, or, if none occurs, we retain T = p, with T again being our test statistic. In each of a sufficiently large set of randomization samples (labels case and control are randomly permuted), the whole approach is repeated, and we obtain the significance level associated with T as the proportion of randomization samples with T values at least as small as the observed T. Note that there may be a different SNP with largest χ2 in different steps of outlier removal. The technique of finding the smallest p value among several model assumptions and obtaining the (genome-wide) significance level associated with this smallest p value is not new. We previously applied this principle in comparing disease association of sets of SNPs, where each set contains different numbers of SNPs. This has led to our Set Association method [7], which is more powerful than SNP by SNP analysis [8,9] and has successfully been applied in various studies [10,11,12]. By design, our approach always removes at least one individual. In this sense, it furnishes trimmed results. Trimming is well known in classical statistics as a procedure for eliminating outliers [13,14]. In particular, such methods have been developed for small numbers of outlying observations [15]. Here we apply this principle to case-control association studies.

Power Simulations

For a simple power comparison, we assume a total of 1,000 independent SNPs, with the last SNP conferring disease susceptibility. We further assume a total sample size of 200 individuals, of which 10 are outliers. The 190 non-outliers are equally divided into cases and controls while we consider 3 scenarios for the 10 outliers: (1) 5 cases and 5 controls, (2) 2 cases and 8 controls, and (3) no cases and 10 controls, where the latter scenario represents the (perhaps common) situation that controls tend to be chosen from a different population segment than that furnishing cases. For the 999 non-disease SNPs with alleles A and B, allele frequencies P(A) are randomly picked between 0.10 and 0.50 for non-outliers, and between 0.10 and 0.90 for outliers (for details, see online suppl. material; for all online suppl. material, see www.karger.com/doi/10.1159/000320422). The disease (functional) SNP has alleles D and d, with the former conferring disease susceptibility. Its allele frequency, P(D), is set to 0.30 in non-outliers and is chosen randomly from 0.10 through 0.90 in outliers. Genotype frequencies are given according to the Hardy Weinberg equilibrium. We consider dominant and recessive inheritance, with h denoting the penetrance for non-susceptibility genotypes, while the penetrance for disease conferring genotypes is given by rh. Disease prevalence is taken to be 1%. Power to detect the disease SNP is computed as a function of the penetrance ratio, r = rh/h, where r = 1 represents the null hypothesis of no genetic effect. The maximum number of outliers to be removed is set at M = 20 (10% of the sample size of 200). We compare the following 4 test procedures, where each is applied to the disease SNP. The remaining SNPs are independent of the disease SNP. Pearson-GC: This 1 d.f. Pearson χ2 test with GC correction, that is, all χ2 are divided by the GC parameter, λ, where λ = observed median χ2 for all SNPs divided by the median of the χ2 distribution with 1 d.f. (0.456). Logistic: Logistic regression analysis for each SNP in turn, additive allele test (1 d.f.). This test has no provision for addressing the heterogeneity problem. Logistic-MDS: Logistic regression analysis (1 d.f.) with the largest 4 MDS components as covariates. The latter are determined on the basis of all SNPs. Outliers: Our approach for removing individuals that extremely deviate from the center in an MDS coordinate space. At r = 1, for each of the 4 methods, 5,000 datasets are generated under dominant and recessive inheritance, and critical thresholds for the test statistics are chosen such that the resulting significance level (proportion of significant results) is exactly equal to 0.05. Resulting thresholds are then used to estimate power at penetrance ratios r > 1.

Results

Power of the different analysis methods was somewhat dependent on model assumptions, but the Logistic-MDS method overall did best, followed by our Outliers method. Table 1 shows results for dominant inheritance and outliers consisting of 2 cases and 8 controls (all results of power simulations are given in online suppl. table S1); these results are fairly typical of the overall picture. Figure 1 shows power figures in graphical form.
Table 1

Power of 4 association analysis methods as a function of the penetrance ratio, r, for a dominant disease model

Pearson-CGLogisticLogistic-MDSOutliers
r = 1.00.0500.0500.0500.050
r = 1.50.1320.1490.2320.176
r = 2.00.3920.4270.5570.463
r = 2.50.6150.6630.7770.685
r = 3.00.7610.8040.8940.817
r = 3.50.8530.8870.9410.891
r = 4.00.9060.9340.9700.938
Average power0.5130.5120.5940.572
Rank3412

The number of outliers is 10 (2 cases, 8 controls), and the number of non-outliers is 190. The last two rows show average power over 36 model conditions (shown in online suppl. table S1) and resulting ranking.

Fig. 1

Power of 4 analysis methods as a function of the penetrance ratio, r (based on results in table 1).

We combined results for each value of r and 6 model assumptions (dominant/recessive, 3 splits of cases versus controls in outliers) and computed average power over these 36 conditions (online suppl. table S1). As the last row of table 1 shows, this ranking makes the Logistic-MDS method the winner, closely followed by our Outliers method. This power simulation is rather simple and is mainly designed to demonstrate that our Outliers method is competitive. In particular, only one disease SNP was assumed and any significant result is a true positive. Additional power simulations are provided in the online supplementary material, for example, for a trait influenced by two susceptibility loci and for different population structures. The Pearson-GC method presumably suffers from the potentially severe protection from false-positive results. In fact, computing p values from χ2 tables for the Pearson-GC method leads to type I errors much smaller than 0.05 (details not shown) but, as mentioned, in our simulations the type I error was constant for all methods.

Analyzing a Published Dataset

To demonstrate our Outliers method, we applied it and the 3 other approaches discussed here to a published dataset on Parkinson disease with approximately 540 case and control individuals and approximately 408,000 SNPs genome wide [16]. To make results comparable and allow for genome-wide correction for multiple testing, p values were estimated in permutation samples. In this analysis, we applied the standard Pearson χ2 test without GC correction. As table 2 shows, the Outliers method furnished the smallest p value of 0.076, which is not formally significant, although nearly so. The smallest nominal p value in the Outliers method was obtained after 3 individuals had been removed as outliers (fig. 2). The significance level associated with this smallest p value is estimated to be 0.076. Without removing outliers, the p value of the largest test statistic (χ2) is equal to 0.120. Thus, the Outliers method resulted in a considerable improvement, although it did not furnish a significant result. If, for argument's sake, we transform p values into χ2 with 2 d.f. [17], we find χ2 of c1 = 5.15 for p = 0.076, and c2 = 4.24 for p = 0.120. As χ2 is proportional to sample size, the ratio, c1/c2 = 1.22, reflects a virtual gain of 22% in sample size obtained by our method. Of course, this argument is artificial since we do not know whether these p values reflect true or false positives.
Table 2

Analysis results for a published dataset of Parkinson Disease

Logistic-MDSLogistic

chSNPposPnomPpermchSNPposPnomPperm
4rs6826751682626211.73E-060.2324rs6826751682626212.46E-060.258
4rs2242330682760155.64E-060.5904rs2242330682760156.05E-060.569
4rs3775866682729461.03E-050.8274rs3775866682729461.11E-050.831
4rs355477682252911.83E-050.97016rs4888984780668351.30E-050.877
4rs355461682094901.87E-050.9724rs355477682252911.63E-050.935
4rs355506682148481.87E-050.97210rsl480597444811151.67E-050.940
10rsl480597444811151.92E-050.9734rs355461682094901.73E-050.948
5rsl0053056960691761.92E-050.9734rs355506682148481.73E-050.948
1rsl8872791806418171.94E-050.9741rsl8872791806418171.83E-050.954
16rs4888984780668351.95E-050.9744rs355464682078901.91E-050.959

ch = Chromosome; pos = position; pnom = nominal p value; Pperm = P value from permutation samples, corrected for multiple testing, 1,000 permutations.

Fig. 2

For Parkinson disease dataset, minimum p values obtained with given numbers of outliers removed.

Discussion

So-called ‘obvious’ outliers are often removed in an ad-hoc manner, and there may not be good statistical justifications for doing so. In particular, if outliers are removed by trial and error, that is, if they are removed only when this leads to a reduction in p value, then such a procedure clearly tends to increase the false-positive rate of results. Here, we developed a statistically rigorous procedure for removing outliers while maintaining correct type I error. We carried out additional power simulations under various conditions and also analyzed one more real dataset. All these results may be found in the online supplementary material. These simulations confirm our conclusions based on results shown in table 1; they also show that the Outliers method often does best with recessive modes of inheritance. In addition, at least in the two real datasets analyzed here, for the best SNPs, the Outliers method yields the smallest p values. As is well known, an alternative to removing outliers is to allow for them in the analysis, which may be done by including principal components as covariates in logistic regression analysis [5]. The two approaches may do equally well in practice, although our power calculations have given the logistic regression approach (with MDS components) a slight advantage. Supplemental Data Click here for additional data file.
  15 in total

1.  Association mapping in structured populations.

Authors:  J K Pritchard; M Stephens; N A Rosenberg; P Donnelly
Journal:  Am J Hum Genet       Date:  2000-05-26       Impact factor: 11.025

2.  Genomic control for association studies.

Authors:  B Devlin; K Roeder
Journal:  Biometrics       Date:  1999-12       Impact factor: 2.571

3.  A cluster of cholesterol-related genes confers susceptibility for Alzheimer's disease.

Authors:  Andreas Papassotiropoulos; M Axel Wollmer; Magdalini Tsolaki; Fabienne Brunner; Dimitra Molyva; Dieter Lütjohann; Roger M Nitsch; Christoph Hock
Journal:  J Clin Psychiatry       Date:  2005-07       Impact factor: 4.384

4.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

5.  Analysis of multiple SNPs in genetic association studies: comparison of three multi-locus methods to prioritize and select SNPs.

Authors:  A Geert Heidema; Edith J M Feskens; Pieter A F M Doevendans; Henk J T Ruven; Hans C van Houwelingen; Edwin C M Mariman; Jolanda M A Boer
Journal:  Genet Epidemiol       Date:  2007-12       Impact factor: 2.135

6.  Robust regression techniques A useful alternative for the detection of outlier data in chemical analysis.

Authors:  M Cruz Ortiz; Luis A Sarabia; Ana Herrero
Journal:  Talanta       Date:  2006-02-10       Impact factor: 6.057

7.  Genomic dissection of population substructure of Han Chinese and its implication in association studies.

Authors:  Shuhua Xu; Xianyong Yin; Shilin Li; Wenfei Jin; Haiyi Lou; Ling Yang; Xiaohong Gong; Hongyan Wang; Yiping Shen; Xuedong Pan; Yungang He; Yajun Yang; Yi Wang; Wenqing Fu; Yu An; Jiucun Wang; Jingze Tan; Ji Qian; Xiaoli Chen; Xin Zhang; Yangfei Sun; Xuejun Zhang; Bailin Wu; Li Jin
Journal:  Am J Hum Genet       Date:  2009-12       Impact factor: 11.025

8.  Genetic structure of the Han Chinese population revealed by genome-wide SNP variation.

Authors:  Jieming Chen; Houfeng Zheng; Jin-Xin Bei; Liangdan Sun; Wei-hua Jia; Tao Li; Furen Zhang; Mark Seielstad; Yi-Xin Zeng; Xuejun Zhang; Jianjun Liu
Journal:  Am J Hum Genet       Date:  2009-12       Impact factor: 11.025

9.  Multicentre search for genetic susceptibility loci in sporadic epilepsy syndrome and seizure types: a case-control study.

Authors:  Gianpiero L Cavalleri; Michael E Weale; Kevin V Shianna; Rinki Singh; John M Lynch; Bronwyn Grinton; Cassandra Szoeke; Kevin Murphy; Peter Kinirons; Deirdre O'Rourke; Dongliang Ge; Chantal Depondt; Kristl G Claeys; Massimo Pandolfo; Curtis Gumbs; Nicole Walley; James McNamara; John C Mulley; Kristen N Linney; Leslie J Sheffield; Rodney A Radtke; Sarah K Tate; Stephanie L Chissoe; Rachel A Gibson; David Hosford; Alice Stanton; Tracey D Graves; Michael G Hanna; Kai Eriksson; Anne-Mari Kantanen; Reetta Kalviainen; Terence J O'Brien; Josemir W Sander; John S Duncan; Ingrid E Scheffer; Samuel F Berkovic; Nicholas W Wood; Colin P Doherty; Norman Delanty; Sanjay M Sisodiya; David B Goldstein
Journal:  Lancet Neurol       Date:  2007-11       Impact factor: 44.182

10.  Genetic risk factors for diabetic nephropathy on chromosomes 6p and 7q identified by the set-association approach.

Authors:  K Kanková; A Stejskalová; L Pácal; S Tschoplová; M Hertlová; D Krusová; L Izakovicová-Hollá; M Beránek; A Vasků; S Barral; J Ott
Journal:  Diabetologia       Date:  2007-03-08       Impact factor: 10.122

View more
  2 in total

1.  William Allan Award Address: On the role and soul of a statistical geneticist.

Authors:  Jürg Ott
Journal:  Am J Hum Genet       Date:  2011-03-11       Impact factor: 11.025

2.  Genome-wide association of implantable cardioverter-defibrillator activation with life-threatening arrhythmias.

Authors:  Sarah S Murray; Erin N Smith; Nikki Villarasa; Tara Nahey; Jeff Lande; Harold Goldberg; Marian Shaw; Lawrence Rosenthal; Brian Ramza; Jamshid Alaeddini; Xinqiang Han; Samir Damani; Orhan Soykan; Robert C Kowal; Eric J Topol
Journal:  PLoS One       Date:  2012-01-11       Impact factor: 3.240

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.