| Literature DB >> 31130982 |
Michela Panarella1,2, Kelly M Burkett2.
Abstract
Extreme phenotype sampling (EPS) is a popular study design used to reduce genotyping or sequencing costs. Assuming continuous phenotype data are available on a large cohort, EPS involves genotyping or sequencing only those individuals with extreme phenotypic values. Although this design has been shown to have high power to detect genetic effects even at smaller sample sizes, little attention has been paid to the effects of confounding variables, and in particular population stratification. Using extensive simulations, we demonstrate that the false positive rate under the EPS design is greatly inflated relative to a random sample of equal size or a "case-control"-like design where the cases are from one phenotypic extreme and the controls randomly sampled. The inflated false positive rate is observed even with allele frequency and phenotype mean differences taken from European population data. We show that the effects of confounding are not reduced by increasing the sample size. We also show that including the top principal components in a logistic regression model is sufficient for controlling the type 1 error rate using data simulated with a population genetics model and using 1,000 Genomes genotype data. Our results suggest that when an EPS study is conducted, it is crucial to adjust for all confounding variables. For genetic association studies this requires genotyping a sufficient number of markers to allow for ancestry estimation. Unfortunately, this could increase the costs of a study if sequencing or genotyping was only planned for candidate genes or pathways; the available genetic data would not be suitable for ancestry correction as many of the variants could have a true association with the trait.Entities:
Keywords: Type 1 error; association study; extreme phenotype sampling; population stratification; principal component analysis
Year: 2019 PMID: 31130982 PMCID: PMC6509877 DOI: 10.3389/fgene.2019.00398
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Genetic and phenotypic parameter settings for simulations to assess confounding effect of population stratification.
| N | Full cohort sample size | 5000 |
| n | Sample size for each extreme | 0.1N = 500 |
| Proportion of full sample from population 1 | 0.3, 0.4, 0.5, 0.6, 0.7 | |
| Proportion of full sample from population 2 | 1 − | |
| Frequency of A allele in population 1 | 0.5 to 0.9, by 0.1 | |
| Frequency of A allele in population 2 | 0.5 to 0.9, by 0.1 | |
| μ1 | Phenotype mean in population 1 | 0.1, 0.2 |
Figure 1False positive proportion for scenarios with ω and μ = 0.1. The solid, black line gives the false positive rate under EPS. The dashed, red line gives the false positive rate under random sampling. The dotted, blue line gives the rate under case-control sampling. The first column corresponds to the recessive test, the second column corresponds to the additive test and the third column corresponds do the codominant test. The first row corresponds to simulations with p1 = 0.5, the second row corresponds to simulations with p1 = 0.7 and the third row corresponds to simulations with p1 = 0.9.
Figure 2False positive proportion for scenarios with ω and μ = 0.1. The solid, black line gives the false positive rate under EPS. The dashed, red line gives the false positive rate under random sampling. The dotted, blue line gives the rate under case-control sampling. The first column corresponds to the recessive test, the second column corresponds to the additive test and the third column corresponds do the codominant test. The first row corresponds to simulations with p1 = 0.5, the second row corresponds to simulations with p1 = 0.7 and the third row corresponds to simulations with p1 = 0.9.
Estimated false positive proportions using parameter settings from Italy and France data.
| 0.20 | 0.80 | 0.16 | 0.29 | 0.23 | 0.08 | 0.13 | 0.10 |
| 0.30 | 0.70 | 0.24 | 0.44 | 0.34 | 0.11 | 0.18 | 0.14 |
| 0.40 | 0.60 | 0.30 | 0.53 | 0.43 | 0.13 | 0.21 | 0.16 |
| 0.50 | 0.50 | 0.34 | 0.58 | 0.47 | 0.15 | 0.25 | 0.19 |
| 0.60 | 0.40 | 0.33 | 0.55 | 0.45 | 0.14 | 0.23 | 0.17 |
| 0.70 | 0.30 | 0.29 | 0.47 | 0.37 | 0.14 | 0.19 | 0.15 |
| 0.80 | 0.20 | 0.20 | 0.31 | 0.24 | 0.10 | 0.13 | 0.11 |
Effect of false positive rate due to confounding when sample size is increased.
| 10,000 | 2,000 | 0.06 | 0.07 | 0.06 | 0.05 | 0.05 | 0.05 |
| 20,000 | 4,000 | 0.08 | 0.09 | 0.07 | 0.06 | 0.06 | 0.06 |
| 50,000 | 10,000 | 0.13 | 0.14 | 0.11 | 0.07 | 0.08 | 0.07 |
Simulations were run with μ.
Estimated false positive rates before and after adjustment using the top five principal components.
| 0.1 | –0.1 | 0.819 | 0.880 | 0.056 | 0.056 |
| 0.15 | –0.15 | 0.993 | 0.996 | 0.045 | 0.0495 |
| 0.175 | –0.175 | 0.9995 | 0.9995 | 0.049 | 0.055 |
| 0.2 | –0.2 | 1 | 1 | 0.059 | 0.057 |
Logistic regression was used to test the association with the putative disease locus using either a codominant or additive genetic model.
Estimated false positive rates before and after adjustment using the top five principal components for the rare variant case.
| 0.01 | 0.05 | 0.1 | –0.1 | 0.124 | 0.143 | 0.044 | 0.048 |
| 0.01 | 0.1 | 0.1 | –0.1 | 0.2788 | 0.339 | 0.068 | 0.055 |
| 0.01 | 0.05 | 0.2 | –0.2 | 0.385 | 0.437 | 0.043 | 0.047 |
| 0.01 | 0.1 | 0.2 | –0.2 | 0.7718 | 0.845 | 0.064 | 0.054 |
Logistic regression was used to test the association with the putative disease locus using either a codominant or additive genetic model.