| Literature DB >> 16627870 |
Stuart Macgregor1, Peter M Visscher, Grant Montgomery.
Abstract
Array based DNA pooling techniques facilitate genome-wide scale genotyping of large samples. We describe a structured analysis method for pooled data using internal replication information in large scale genotyping sets. The method takes advantage of information from single nucleotide polymorphisms (SNPs) typed in parallel on a high density array to construct a test statistic with desirable statistical properties. We utilize a general linear model to appropriately account for the structured multiple measurements available with array data. The method does not require the use of additional arrays for the estimation of unequal hybridization rates and hence scales readily to accommodate arrays with several hundred thousand SNPs. Tests for differences between cases and controls can be conducted with very few arrays. We demonstrate the method on 384 endometriosis cases and controls, typed using Affymetrix Genechip(c) HindIII 50 K arrays. For a subset of this data there were accurate measures of hybridization rates available. Assuming equal hybridization rates is shown to have a negligible effect upon the results. With a total of only six arrays, the method extracted one-third of the information (in terms of equivalent sample size) available with individual genotyping (requiring 768 arrays). With 20 arrays (10 for cases, 10 for controls), over half of the information could be extracted from this sample.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16627870 PMCID: PMC1440945 DOI: 10.1093/nar/gkl136
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Estimation of by estimation method
| Estimation method | ||
|---|---|---|
| 0.0241 | 0.0060 | |
| Nested GLMM | 0.0241 | 0.0112 |
| Non-nested GLMM | 0.0239 | 0.0152 |
See appendix 1 for the definition of var(epool−1) and appendix 2 for the definition of var(epool−2). By construction var(epool−2) must be smaller than var(epool−1). var(epool−1) gives an approximate estimate of the overall variance associated with pooling; this estimate averages over all SNPs without taking into account the differing precision (i.e. number of functioning probes) available for each SNP.
Variance component estimates from nested model (case sample only)
| Replicate | Strand | Probe | |
|---|---|---|---|
| Variance component (as variance) | 0.00026 | 0.00268 | 0.01054 |
| Variance component (as standard deviation) | 0.01612 | 0.05176 | 0.10266 |
| Contribution to variance in allele frequency estimate (as variance) | 0.00009 | 0.00044 | 0.00042 |
| Contribution to variance in allele frequency estimate (as standard deviation) | 0.00931 | 0.02097 | 0.02049 |
The first two rows give the variance component estimates for each source (as variance and as standard deviation). These values are then inserted into Equation 3 to give the contribution to the variance in allele frequency estimate (variance/standard deviation given in last two rows).
Test statistic comparison at 1% level
| Test statistic | # SNPs exceeding 1% level | Proportion exceeding 1% level |
|---|---|---|
| 9370 | 0.16580 | |
| 734 | 0.01299 | |
| 666 | 0.01179 | |
| 620 | 0.01097 | |
| 554 | 0.00981 |
The total number of SNPs is 56 494. The number expected to exceed the 1% level under the null hypothesis of no true associations is 565.
Test statistic comparison at 0.1% level
| Test statistic | # SNPs exceeding 0.1% level | Proportion exceeding 0.1% level |
|---|---|---|
| 162 | 0.00287 | |
| 91 | 0.00161 | |
| 81 | 0.00143 | |
| 69 | 0.00122 |
The total number of SNPs is 56 494. The number expected to exceed the 0.1% level under the null hypothesis of no true associations is 56.
Figure 1Comparison of test statistic performance. Results for 56 494 SNPs. Each statistic is plotted against the expected distribution under the hypothesis that there are no (or very few) true positives. A test statistic with distribution exactly equal to the expected asymptotic distribution will lie along the plotted y = x line. The plotted lines exhibit considerable stochastic variation at high values (>18) because there are few data points in this range.
kmax values at varying frequencies in cases and controls
| Frequency case/control | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
|---|---|---|---|---|---|---|---|---|---|
| 0.1 | 0.22 | 0.27 | 0.33 | 0.41 | |||||
| 0.2 | 0.33 | 0.41 | |||||||
| 0.3 | 0.22 | 0.33 | |||||||
| 0.4 | 0.27 | 0.41 | |||||||
| 0.5 | 0.33 | 3 | |||||||
| 0.6 | 0.41 | 2.45 | 3.67 | ||||||
| 0.7 | 3.06 | 4.58 | |||||||
| 0.8 | 2.45 | 3.06 | |||||||
| 0.9 | 2.45 | 3 | 3.67 | 4.58 |
kmax values between 0.5 and 2 are in boldface, kmax values in the range (0.2–0.5) and (2,5) are in normal font and kmax values less than 0.2 or greater than 5 are in italic font.
Figure 2Comparison of log transformed P-values calculated assuming k = 1 with log transformed P-values where k was estimated from ∼3000 individuals. Data from 74 SNPs are shown. The regression line, y = 0.000 + 0.989× is drawn on the plot. The line y = x is not shown but is virtually indistinguishable from the regression line.