| Literature DB >> 19436704 |
Abstract
There is strong evidence that rare variants are involved in complex disease etiology. The first step in implicating rare variants in disease etiology is their identification through sequencing in both randomly ascertained samples (e.g., the 1,000 Genomes Project) and samples ascertained according to disease status. We investigated to what extent rare variants will be observed across the genome and in candidate genes in randomly ascertained samples, the magnitude of variant enrichment in diseased individuals, and biases that can occur due to how variants are discovered. Although sequencing cases can enrich for casual variants, when a gene or genes are not involved in disease etiology, limiting variant discovery to cases can lead to association studies with dramatically inflated false positive rates.Entities:
Mesh:
Year: 2009 PMID: 19436704 PMCID: PMC2674213 DOI: 10.1371/journal.pgen.1000481
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
The proportion of variants identified in samples randomly ascertained from the population.
| Frequency |
|
|
|
|
| 0.001 | 0.181(0.012) | 0.330(0.015) | 0.632(0.015) | 0.865(0.011) |
| 0.002 | 0.330(0.015) | 0.551(0.016) | 0.865(0.011) | 0.982(0.004) |
| 0.005 | 0.633(0.015) | 0.865(0.011) | 0.993(0.003) | 1.000(2.1E-4) |
| 0.01 | 0.866(0.011) | 0.982(0.004) | 1.000(2.1E-4) | 1.000(1.4E-6) |
The proportion of variants discovered assuming linkage equilibrium and their standard deviations (shown in parentheses) in samples of N = 100, 200, 500, and 1,000 individuals for variants with equal population frequencies of 0.001, 0.002, 0.005, and 0.01. Although the mean proportions of variants discovered are not dependent on the number of variants in the genome, the standard deviations will vary depending on the number of variants. The standard deviations shown are for M = 1,000 variants. All proportions of variants discovered displayed as 1.0 were rounded up and their actual values are between >0.999 and <1.0.
The probability of identifying rare variants with equal frequencies within a gene in samples of randomly ascertained individuals.
|
| Frequency |
|
|
| ||||||
| 50% | 80% | 1.00% | 50% | 80% | 100% | 50% | 80% | 100% | ||
| 10 | 0.001 | 0.0220 | 3.7E-5 | 3.8E-8 | 0.2060 | 0.0032 | 1.5E-5 | 0.9992 | 0.8571 | 0.2340 |
| 0.005 | 0.8836 | 0.2265 | 0.0103 | 0.9992 | 0.8584 | 0.2354 | 1.0000 | 1.0000 | 0.9996 | |
| 0.01 | 0.9993 | 0.8600 | 0.2373 | 1.0000 | 0.9994 | 0.8343 | 1.0000 | 1.0000 | 1.0000 | |
| 20 | 0.001 | 0.0012 | 3.1E-9 | 1.5E-16 | 0.0863 | 2.2E-5 | 2.3E-10 | 1.0000 | 0.8770 | 0.0547 |
| 0.005 | 0.9266 | 0.0903 | 0.0001 | 1.0000 | 0.8786 | 0.0554 | 1.0000 | 1.0000 | 0.9991 | |
| 0.01 | 1.0000 | 0.8805 | 0.0563 | 1.0000 | 1.0000 | 0.6961 | 1.0000 | 1.0000 | 1.0000 | |
The probability of discovering at least 50%, 80%, and 100% of variants within a gene with M = 10 and 20 variants in linkage equilibrium with population frequencies of 0.001, 0.005, and 0.01 are displayed for samples of N = 100, 200, and 1,000 individuals. All probabilities of identifying rare variants that are shown as 1.0 were rounded up and their actual values are between >0.9999 and <1.0.
Figure 1The probabilities of variant discovery for a gene with ten rare variants that have equal population frequency and reside on separate haplotypes.
Panels (A and B) display the probability of variant discovery in randomly ascertained samples when each of the variants has a population frequency of 0.001 (Panel A) or 0.005 (Panel B). Panels (C and D) display the probability of discovering causal variants in samples of cases when each of the ten rare variants has a genotypic RR = 2.0 under a dominant model and population frequency of 0.001 (Panel C) or 0.005 (Panel D).
The relative increase in frequency of causal variants in samples of cases compared to samples of randomly ascertained individuals.
| RR | Frequency | Multiplicative | Dominant | Additive | Recessive | ||||||||
| 5 | 10 | 20 | 5 | 10 | 20 | 5 | 10 | 20 | 5 | 10 | 20 | ||
| 2 | 0.001 | 1.99 | 1.98 | 1.96 | 1.98 | 1.96 | 1.92 | 1.99 | 1.97 | 1.94 | 1.00 | 1.01 | 1.02 |
| 0.002 | 1.98 | 1.96 | 1.92 | 1.96 | 1.92 | 1.85 | 1.97 | 1.94 | 1.89 | 1.01 | 1.02 | 1.04 | |
| 0.005 | 1.95 | 1.90 | 1.82 | 1.91 | 1.82 | 1.68 | 1.93 | 1.86 | 1.75 | 1.02 | 1.05 | 1.09 | |
| 5 | 0.001 | 4.90 | 4.81 | 4.63 | 4.81 | 4.63 | 4.32 | 4.83 | 4.67 | 4.38 | 1.02 | 1.04 | 1.08 |
| 0.002 | 4.81 | 4.63 | 4.31 | 4.63 | 4.32 | 3.81 | 4.67 | 4.38 | 3.91 | 1.04 | 1.08 | 1.15 | |
| 0.005 | 4.55 | 4.17 | 3.57 | 4.18 | 3.60 | 2.84 | 4.25 | 3.71 | 3.00 | 1.10 | 1.19 | 1.35 | |
Frequency of variants within the population.
The relative increase of causal variant frequency is shown for 5, 10, and 20 causal variants, each with equal population frequencies of 0.001, 0.002, and 0.005 and genotypic RR of either 2 or 5 under a multiplicative, dominant, additive, and recessive model. The calculations were carried out under the assumption that the causal variants reside on separate haplotypes.
Figure 2Distribution of the frequencies of rare variants and the average proportion of rare variants discovered in randomly ascertained and case samples.
Data were generated using coalescent simulation under the neutral Wright–Fisher model with a scaled mutation rate θ = 4. Panel (A) displays the distribution of rare variants with frequency ≤0.01 for 100 haplotype pools each with 10,000 haplotypes. Panel (B) displays the mean proportion of variants discovered for randomly ascertained samples and for case samples of N = 100, 200, 500, and 1,000 individuals. Results are based upon 10,000 replicates. Case samples were generated with 50% of rare variants randomly chosen to be causal each with a genotypic RR of 2 or 5 under the additive model.
False positive rates for association studies when variant identification is carried out in only cases or in both cases and controls for gene(s) with a fixed number of neutral variants.
|
|
|
| ||||||||
| Discovery Sample |
| 0.001 | 0.002 | 0.005 | 0.001 | 0.002 | 0.005 | 0.001 | 0.002 | 0.005 |
| Cases Only | 100 | 0.067 | 0.140 | 0.115 | 0.260 | 0.346 | 0.235 | 0.447 | 0.562 | 0.374 |
| 200 | 0.144 | 0.132 | 0.055 | 0.350 | 0.273 | 0.083 | 0.553 | 0.430 | 0.113 | |
| 500 | 0.107 | 0.055 | 0.048 | 0.219 | 0.079 | 0.049 | 0.334 | 0.104 | 0.049 | |
| 1,000 | 0.057 | 0.047 | 0.050 | 0.078 | 0.048 | 0.050 | 0.102 | 0.048 | 0.050 | |
| Cases and Controls | 100 | 0.039 | 0.043 | 0.051 | 0.042 | 0.050 | 0.050 | 0.046 | 0.050 | 0.050 |
| 200 | 0.045 | 0.049 | 0.049 | 0.049 | 0.048 | 0.050 | 0.048 | 0.050 | 0.050 | |
| 500 | 0.050 | 0.049 | 0.050 | 0.049 | 0.050 | 0.050 | 0.049 | 0.050 | 0.050 | |
| 1,000 | 0.051 | 0.050 | 0.050 | 0.050 | 0.050 | 0.050 | 0.050 | 0.050 | 0.050 | |
Results are shown for gene(s) with M = 10, 20, and 30 neutral variants with equal population frequencies of 0.001, 0.002, and 0.005, for N = 100, 200, 500, and 1,000 cases, and an equal number of controls. The assumption is made that the variants reside on separate haplotypes. The upper panel shows the false positive rates when only cases are used for variant discovery and the discovered variants are genotyped in controls. The lower panel shows the false positive rates when both case and controls are sequenced to discover rare variants. Analyses were carried out using the Cochran–Armitage test for trend (see Methods). The false positive rates were evaluated for an α = 0.05 and based upon 100,000 replicates.
False positive rates for association studies when a definite number of neutral variants are identified only in cases or when variant discovery is carried out in cases and controls.
|
|
| ||||||
| Discovery Sample |
| 0.001 | 0.002 | 0.005 | 0.001 | 0.002 | 0.005 |
| Cases Only | 100 | 0.389 | 0.211 | 0.087 | 0.864 | 0.584 | 0.186 |
| 200 | 0.217 | 0.111 | 0.045 | 0.579 | 0.245 | 0.060 | |
| 500 | 0.080 | 0.046 | 0.047 | 0.175 | 0.058 | 0.048 | |
| 1,000 | 0.047 | 0.047 | 0.051 | 0.060 | 0.047 | 0.050 | |
| Cases and Controls | 100 | 0.040 | 0.038 | 0.047 | 0.050 | 0.052 | 0.051 |
| 200 | 0.040 | 0.045 | 0.047 | 0.053 | 0.049 | 0.050 | |
| 500 | 0.042 | 0.051 | 0.048 | 0.05 | 0.048 | 0.049 | |
| 1,000 | 0.051 | 0.052 | 0.050 | 0.052 | 0.050 | 0.050 | |
The false positive rates are displayed for when M = 5 or 10 neutral variants with equal population frequencies of 0.001, 0.002, and 0.005 were discovered in cases (upper panel) or in both cases and controls (lower panel) for N = 100, 200, 500, and 1,000 cases with equal number of controls. It is assumed that each rare variant resides on a separate haplotype. Analyses were carried out using the Cochran–Armitage test for trend (see Methods). The false positive rates were evaluated for an α = 0.05 and based upon 100,000 replicates.
False positive rates for rare variant association studies when variants are identified only in cases or in both cases and controls for variants with a mixture of frequencies.
| Number of Cases | |||||||
| Discovery Sample |
| Number of variants | Frequency | 100 | 200 | 500 | 1,000 |
| Cases Only | 4 | 21.0 | 0.0403 | 0.124 | 0.094 | 0.066 | 0.056 |
| 6 | 31.5 | 0.0628 | 0.178 | 0.122 | 0.076 | 0.062 | |
| 8 | 41.9 | 0.0835 | 0.228 | 0.146 | 0.085 | 0.066 | |
| 12 | 61.7 | 0.1160 | 0.325 | 0.205 | 0.106 | 0.075 | |
| Cases and Controls | 4 | 21.0 | 0.0403 | 0.048 | 0.050 | 0.050 | 0.050 |
| 6 | 31.5 | 0.0628 | 0.050 | 0.047 | 0.049 | 0.049 | |
| 8 | 41.9 | 0.0835 | 0.048 | 0.048 | 0.051 | 0.051 | |
| 12 | 61.7 | 0.1160 | 0.049 | 0.049 | 0.049 | 0.051 | |
The number of controls is equal to the number of cases.
Number of rare variants with frequency ≤1% observed per haplotype pool averaged over 100 haplotype pools.
Total frequency of rare variants with frequency ≤1% observed per haplotype pool averaged over 100 haplotype pools.
Coalescent simulations with scaled mutation rates θ ranging between 4 and 12 were used to generate genotype data for rare variants with frequencies between 0.0001 and 0.01 for samples of N = 100, 200, 500, and 1,000 cases. The false positive rates are displayed when variant discovery is carried out in only cases via sequencing and the discovered variants are genotyped in controls (upper panel) and when both cases and controls are sequenced to discover rare variants (lower panel). Analyses were carried out using the Cochran–Armitage test for trend (see Methods). The false positive rates were evaluated for an α = 0.05 and based upon 100,000 replicates.