| Literature DB >> 18304995 |
Rafal Kustra1, Xiaofei Shi, Duncan J Murdoch, Celia M T Greenwood, Jagadish Rangrej.
Abstract
We present a new method to efficiently estimate very large numbers of p-values using empirically constructed null distributions of a test statistic. The need to evaluate a very large number of p-values is increasingly common with modern genomic data, and when interaction effects are of interest, the number of tests can easily run into billions. When the asymptotic distribution is not easily available, permutations are typically used to obtain p-values but these can be computationally infeasible in large problems. Our method constructs a prediction model to obtain a first approximation to the p-values and uses Bayesian methods to choose a fraction of these to be refined by permutations. We apply and evaluate our method on the study of association between 2-way interactions of genetic markers and colorectal cancer using the data from the first phase of a large, genome-wide case-control study. The results show enormous computational savings as compared to evaluating a full set of permutations, with little decrease in accuracy.Entities:
Keywords: Bayesian testing; Genome-wide association studies; Interaction effects; Permutation distribution; Random Forest; p-value distribution
Mesh:
Substances:
Year: 2008 PMID: 18304995 PMCID: PMC2536722 DOI: 10.1093/biostatistics/kxm053
Source DB: PubMed Journal: Biostatistics ISSN: 1465-4644 Impact factor: 5.899
Genotypes and potential haplotypes for 3 SNP markers. The 2 true haplotypes are ACG and TCC. However, only the genotype data are observed. Two possible haplotype pairs are consistent with the observed genotypes
| Marker 1 | Marker 2 | Marker 3 | |
| True haplotype pair | |||
| Chromosome 1 | A | C | G |
| Chromosome 2 | T | C | C |
| Observed genotypes | AT | CC | CG |
| Second potential haplotype pair | |||
| Chromosome 1 | A | C | C |
| Chromosome 2 | T | C | G |
Sample table of haplotype-pair counts, rounded to 1 decimal place
| Win1 | Win2 | Case | Control | Total | Win1 | Win2 | Case | Control | Total |
| CCC | CAC | 1.2 | 0.2 | 1.4 | TCC | CAC | 0.0 | 0.1 | 0.1 |
| CCC | CAG | 0.0 | 0.0 | 0.1 | TCC | CAG | 0.0 | 0.0 | 0.1 |
| CCC | CGC | 9.5 | 6.9 | 16.4 | TCC | CGC | 4.6 | 2.9 | 7.4 |
| CCC | CGG | 4.9 | 2.0 | 6.9 | TCC | CGG | 1.7 | 0.9 | 2.6 |
| CCC | TAC | 5.8 | 3.4 | 9.2 | TCC | TAC | 1.5 | 1.7 | 3.3 |
| CCC | TAG | 1.1 | 1.9 | 3.0 | TCC | TAG | 1.0 | 0.5 | 1.5 |
| CCC | TGC | 6.5 | 4.4 | 10.9 | TCC | TGC | 1.6 | 2.2 | 3.8 |
| CCC | TGG | 2.5 | 3.3 | 5.8 | TCC | TGG | 0.4 | 0.4 | 0.8 |
| CCT | CAC | 8.4 | 11.1 | 19.5 | TCT | CAC | 7.6 | 3.7 | 11.3 |
| CCT | CAG | 3.5 | 3.1 | 6.7 | TCT | CAG | 3.3 | 0.4 | 3.7 |
| CCT | CGC | 79.1 | 93.4 | 172.4 | TCT | CGC | 57.8 | 57.2 | 115.0 |
| CCT | CGG | 26.6 | 28.7 | 55.3 | TCT | CGG | 24.9 | 24.1 | 49.0 |
| CCT | TAC | 69.0 | 74.1 | 143.1 | TCT | TAC | 47.0 | 46.5 | 93.5 |
| CCT | TAG | 46.3 | 37.0 | 83.3 | TCT | TAG | 27.1 | 29.0 | 56.1 |
| CCT | TGC | 77.9 | 68.5 | 146.4 | TCT | TGC | 41.4 | 44.6 | 86.1 |
| CCT | TGG | 41.7 | 34.8 | 76.4 | TCT | TGG | 28.9 | 27.9 | 56.8 |
| CGC | CAC | 80.4 | 77.8 | 158.2 | TGC | CAC | 13.1 | 9.6 | 22.7 |
| CGC | CAG | 22.0 | 17.7 | 39.7 | TGC | CAG | 2.9 | 2.2 | 5.1 |
| CGC | CGC | 731.9 | 789.5 | 1521.4 | TGC | CGC | 107.3 | 110.9 | 218.2 |
| CGC | CGG | 293.8 | 305.0 | 598.9 | TGC | CGG | 39.5 | 43.9 | 83.5 |
| CGC | TAC | 705.6 | 660.0 | 1365.6 | TGC | TAC | 90.3 | 95.7 | 186.1 |
| CGC | TAG | 397.9 | 366.6 | 764.5 | TGC | TAG | 52.0 | 48.3 | 100.3 |
| CGC | TGC | 609.2 | 616.0 | 1225.3 | TGC | TGC | 81.3 | 75.5 | 156.8 |
| CGC | TGG | 373.7 | 355.4 | 729.1 | TGC | TGG | 39.3 | 43.7 | 83.0 |
| CGT | CAC | 7.6 | 11.4 | 19.0 | TGT | CAC | 8.6 | 8.8 | 17.4 |
| CGT | CAG | 2.9 | 2.4 | 5.3 | TGT | CAG | 2.6 | 2.8 | 5.4 |
| CGT | CGC | 83.0 | 89.5 | 172.5 | TGT | CGC | 73.5 | 78.3 | 151.8 |
| CGT | CGG | 36.5 | 25.0 | 61.5 | TGT | CGG | 39.2 | 25.8 | 65.0 |
| CGT | TAC | 64.8 | 86.9 | 151.8 | TGT | TAC | 59.7 | 77.8 | 137.5 |
| CGT | TAG | 46.7 | 42.2 | 89.0 | TGT | TAG | 49.8 | 44.5 | 94.3 |
| CGT | TGC | 65.5 | 60.2 | 125.7 | TGT | TGC | 52.9 | 50.8 | 103.7 |
| CGT | TGG | 45.0 | 28.0 | 73.0 | TGT | TGG | 38.0 | 30.3 | 68.3 |
| Total | 4949.6 | 4897.5 | 9847.1 |
Fig. 1.p-value prediction algorithm.
Fig. 2.p-value Bayesian update algorithm. K is the number of p-values to consider, N is a cap for the number of permutations for each p-value, p0 is a target p-value, b is a batch size for the updates, and n is the number of permutations done for window pair k so far.
Fig. 3.p-value estimates based on the reference permutations and as predicted by the RF model on all 410 181 window pairs.
Fig. 4.Sensitivities of different strategies.
Numbers of permutations used by each strategy
| Method | Permutations (millions) |
| Besag(3) | 15.2 |
| Besag(6) | 28.7 |
| BUaP ( | 8.4 |
| BUaP ( | 13.4 |
| BUaP ( | 31.4 |
| Classic | 4101.8 |
| PO | 0.0 |