| Literature DB >> 24204319 |
David M Evans1, Marie Jo A Brion, Lavinia Paternoster, John P Kemp, George McMahon, Marcus Munafò, John B Whitfield, Sarah E Medland, Grant W Montgomery, Nicholas J Timpson, Beate St Pourcain, Debbie A Lawlor, Nicholas G Martin, Abbas Dehghan, Joel Hirschhorn, George Davey Smith.
Abstract
It is common practice in genome-wide association studies (GWAS) to focus on the relationship between disease risk and genetic variants one marker at a time. When relevant genes are identified it is often possible to implicate biological intermediates and pathways likely to be involved in disease aetiology. However, single genetic variants typically explain small amounts of disease risk. Our idea is to construct allelic scores that explain greater proportions of the variance in biological intermediates, and subsequently use these scores to data mine GWAS. To investigate the approach's properties, we indexed three biological intermediates where the results of large GWAS meta-analyses were available: body mass index, C-reactive protein and low density lipoprotein levels. We generated allelic scores in the Avon Longitudinal Study of Parents and Children, and in publicly available data from the first Wellcome Trust Case Control Consortium. We compared the explanatory ability of allelic scores in terms of their capacity to proxy for the intermediate of interest, and the extent to which they associated with disease. We found that allelic scores derived from known variants and allelic scores derived from hundreds of thousands of genetic markers explained significant portions of the variance in biological intermediates of interest, and many of these scores showed expected correlations with disease. Genome-wide allelic scores however tended to lack specificity suggesting that they should be used with caution and perhaps only to proxy biological intermediates for which there are no known individual variants. Power calculations confirm the feasibility of extending our strategy to the analysis of tens of thousands of molecular phenotypes in large genome-wide meta-analyses. We conclude that our method represents a simple way in which potentially tens of thousands of molecular phenotypes could be screened for causal relationships with disease without having to expensively measure these variables in individual disease collections.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24204319 PMCID: PMC3814299 DOI: 10.1371/journal.pgen.1003919
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Figure 1Association between polygene score and BMI measured at age nine in the ALSPAC cohort.
Association between polygene score and BMI measured at age nine using different p-value thresholds for the construction of the score in ALSPAC children (N = 5819). The lines joining the circles display the results for allelic scores calculated by using genotyped variants from across the genome in either a weighted (unbroken line) or an unweighted (dashed line) fashion. The lines joining the triangles display scores calculated similarly but excluding all variants +/−1 MB around 32 known BMI variants, and using either a weighted (unbroken line) or unweighted (dashed line) strategy. The histogram in the background displays the number of SNPs involved in construction of the allelic score at each corresponding SNP inclusion threshold for the “All variants” condition.
Figure 3Association between polygene score and LDLc measured at age nine in the ALSPAC cohort.
Association between polygene score and LDLc measured at age nine using different p-value thresholds for the construction of the score in ALSPAC children (N = 4251). The lines joining the circles display the results for allelic scores calculated by using genotyped variants from across the genome in either a weighted (unbroken line) or an unweighted (dashed line) fashion. The lines joining the triangles display scores calculated similarly but excluding all variants +/−1 MB around 37 known LDLc variants, and using either a weighted (unbroken line) or unweighted (dashed line) strategy. The histogram in the background displays the number of SNPs involved in construction of the allelic score at each corresponding SNP inclusion threshold for the “All variants” condition.
Figure 2Association between polygene score and CRP measured at age nine in the ALSPAC cohort.
Association between polygene score and CRP measured at age nine using different p-value thresholds for the construction of the score in ALSPAC children (N = 4251). The lines joining the circles display the results for allelic scores calculated by using genotyped variants from across the genome in either a weighted (unbroken line) or an unweighted (dashed line) fashion. The lines joining the triangles display scores calculated similarly but excluding all variants +/−1 MB around 18 known CRP variants, and using either a weighted (unbroken line) or unweighted (dashed line) strategy. The histogram in the background displays the number of SNPs involved in construction of the allelic score at each corresponding SNP inclusion threshold for the “All variants” condition.
Association between case-control status in the WTCCC and either a weighted genome-wide score consisting of all SNPs across the genome (“GW Score”), a weighted allelic score consisting of highly significant SNPs (p<5×10−8) from known regions only (“Known”), or a weighted genome-wide score consisting of all SNPs across the genome with SNPs from known regions removed from its construction (“Complement”).
| BMI | CRP | LDLc | ||||||||||||||||
| GW Score | Known | Complement | GW Score | Known | Complement | GW Score | Known | Complement | ||||||||||
| Dir | P | Dir | P | Dir | P | Dir | P value | Dir | P value | Dir | P | Dir | P value | Dir | P value | Dir | P | |
| BD | − | 0.051 | − | 0.62 | − | 0.026 | + | 0.37 | + | 0.11 | + | 0.96 | − | 0.049 | − | 0.88 | − | 0.059 |
| CHD | + | 0.37 | + | 0.17 | + | 0.57 | + | 0.028 | + | 0.80 | + | 0.079 | + | 1.7×10−3 | + | 9.2×10−3 | + | 0.049 |
| HT | − | 0.76 | − | 0.58 | + | 0.76 | + | 0.20 | + | 0.23 | + | 0.53 | − | 0.011 | − | 0.75 | − | 0.012 |
| CD | − | 0.97 | + | 0.90 | + | 0.99 | + | 2.9×10−4 | + | 0.051 | + | 0.011 | − | 0.73 | − | 0.76 | − | 0.71 |
| RA | − | 0.18 | + | 0.15 | − | 0.085 | + | 0.17 | + | 0.028 | + | 0.69 | − | 0.26 | − | 0.25 | − | 0.50 |
| T1D | − | 0.97 | + | 0.77 | + | 0.85 | + | 0.020 | + | 0.15 | + | 0.033 | − | 0.018 | + | 0.58 | − | 0.20 |
| T2D | + | <2×10−16 | + | 4.3×10−7 | + | 1.8×10−12 | + | 7.6×10−8 | + | 0.50 | + | 2.1×10−7 | + | 0.66 | − | 0.12 | + | 0.48 |
See Tables S1 through S3 for a complete list of results.
BD = Bipolar Disorder; CHD = Coronary Heart Disease; HT = Hypertension; CD = Crohn's Disease; RA = Rheumatoid Arthritis; T1D = Type 1 Diabetes; T2D = Type 2 Diabetes.
Dir = Direction of effect; P = P value.
Approximate power to detect association between an allelic score indexing a biological exposure and disease.
| 2000 Cases 3000 Controls | 50000 Cases 50000 Controls | ||||
| σG 2 | β | σL 2 | Disease Prevalence | Power | Power |
| 10% | .1 | 0.1% | 1% | 83.8% | 100% |
| 10% | .2 | 0.4% | 1% | 100% | 100% |
| 10% | .5 | 2.5% | 1% | 100% | 100% |
| 10% | .1 | 0.1% | 5% | 66.2% | 100% |
| 10% | .2 | 0.4% | 5% | 99.7% | 100% |
| 10% | .5 | 2.5% | 5% | 100% | 100% |
| 10% | .1 | 0.1% | 10% | 57.0% | 100% |
| 10% | .2 | 0.4% | 10% | 99.0% | 100% |
| 10% | .5 | 2.5% | 10% | 100% | 100% |
| 10% | .1 | 0.1% | 20% | 48.3% | 100% |
| 10% | .2 | 0.4% | 20% | 97.0% | 100% |
| 10% | .5 | 2.5% | 20% | 100% | 100% |
| 5% | .1 | 0.05% | 1% | 55.0% | 100% |
| 5% | .2 | 0.2% | 1% | 98.6% | 100% |
| 5% | .5 | 1.25% | 1% | 100% | 100% |
| 5% | .1 | 0.05% | 5% | 39.1% | 99.1% |
| 5% | .2 | 0.2% | 5% | 92.0% | 100% |
| 5% | .5 | 1.25% | 5% | 100% | 100% |
| 5% | .1 | 0.05% | 10% | 32.7% | 94.3% |
| 5% | .2 | 0.2% | 10% | 85.6% | 100% |
| 5% | .5 | 1.25% | 10% | 100% | 100% |
| 5% | .1 | 0.05% | 20% | 27.3% | 81.0% |
| 5% | .2 | 0.2% | 20% | 77.4% | 100% |
| 5% | .5 | 1.25% | 20% | 100% | 100% |
| 1% | .1 | 0.01% | 1% | 15.4% | 14.6% |
| 1% | .2 | 0.04% | 1% | 46.2% | 99.9% |
| 1% | .5 | 0.25% | 1% | 99.7% | 100% |
| 1% | .1 | 0.01% | 5% | 11.7% | 3.0% |
| 1% | .2 | 0.04% | 5% | 32.5% | 94.0% |
| 1% | .5 | 0.25% | 5% | 96.4% | 100% |
| 1% | .1 | 0.01% | 10% | 10.4% | 1.3% |
| 1% | .2 | 0.04% | 10% | 27.2% | 80.4% |
| 1% | .5 | 0.25% | 10% | 92.2% | 100% |
| 1% | .1 | 0.01% | 20% | 9.3% | 0.5% |
| 1% | .2 | 0.04% | 20% | 22.8% | 58.8% |
| 1% | .5 | 0.25% | 20% | 85.8% | 100% |
| 0.1% | .1 | 0.001% | 1% | 6.0% | 0% |
| 0.1% | .2 | 0.004% | 1% | 9.1% | 0.4% |
| 0.1% | .5 | 0.025% | 1% | 31.4% | 92.2% |
| 0.1% | .1 | 0.001% | 5% | 5.7% | 0% |
| 0.1% | .2 | 0.004% | 5% | 7.6% | 0.1% |
| 0.1% | .5 | 0.025% | 5% | 22.1% | 54.7% |
| 0.1% | .1 | 0.001% | 10% | 5.5% | 0% |
| 0.1% | .2 | 0.004% | 10% | 7.1% | 0% |
| 0.1% | .5 | 0.025% | 10% | 18.7% | 33.1% |
| 0.1% | .1 | 0.001% | 20% | 5.4% | 0% |
| 0.1% | .2 | 0.004% | 20% | 6.7% | 0% |
| 0.1% | .5 | 0.025% | 20% | 16% | 17.5% |
The model is parameterized in terms of the percentage of variance in the biological intermediate explained by the SNP (σG 2), the strength of the causal relationship between the biological intermediate (β) and liability of disease, which together determine the amount of variance in disease liability explained by the SNP (σL 2), and the prevalence of disease. Estimates of power are presented for 2000 cases and 3000 controls (α = 0.05), and for 50000 cases and controls (α = 1.1×10−7).