| Literature DB >> 24886216 |
Edwin S Iversen1, Gary Lipton, Merlise A Clyde, Alvaro N A Monteiro.
Abstract
BACKGROUND: Genetic association studies are conducted to discover genetic loci that contribute to an inherited trait, identify the variants behind these associations and ascertain their functional role in determining the phenotype. To date, functional annotations of the genetic variants have rarely played more than an indirect role in assessing evidence for association. Here, we demonstrate how these data can be systematically integrated into an association study's analysis plan.Entities:
Mesh:
Year: 2014 PMID: 24886216 PMCID: PMC4041996 DOI: 10.1186/1471-2164-15-398
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Two–staged procedure for integrating variant–level functional annotation data with subject–level genetic association data. At the first stage, functional annotation data are combined to estimate the prior (to observing the genetic association data) probability of association for each variant. At stage two, these estimates are combined with the Bayes Factor (a metric of association) in favor of genetic association via Bayes’ formula to estimate the posterior (to observing the functional and genetic association data) probability of association for each variant.
Figure 2Construction and evaluation of models for (prior) probability of association given the functional annotation data. The purple arrows represent model construction (‘training’), while the green arrows represent evaluation of the models. Construction of the training set, validation set and functional annotation database are depicted in Additional file 1: Figure S10 and described in Methods. The training data were used to construct a series of models, each distinguished by the coefficients (or ‘weights’) it assigns to the various annotation variables. We chose the best among these by comparing their predictions in the validation set using the concordance index.
Annotations used to construct the functional signatures
| Name | Annotation class | Description |
|---|---|---|
| MAF | Minor Allele Frequency | Natural spline basis for MAF |
| funcIntron | dbSNP Function Class | Indicator that variant is intronic. |
| funcNg3 | dbSNP Function Class | Indicator that variant is near-gene-3. |
| funcNg5 | dbSNP Function Class | Indicator that variant is near-gene-5. |
| funcNonsynon | dbSNP Function Class | Indicator that variant is missense or nonsense. |
| funcSynon | dbSNP Function Class | Indicator that variant is synonymous. |
| funcUTR | dbSNP Function Class | Indicator that variant is in the 3 ′ or 5 ′ UTR. |
| PhyPC | phyloP Evol. Cons. Score | First 4 PCs for PhyloP data. |
| IndelInd | DGV Regions | Indicator that SNP is in the region of a known in–del. |
| CNVInd | DGV Regions | Indicator that SNP is in the region of a known CNV. |
| InvInd | DGV Regions | Indicator that SNP is in the region of a known inversion. |
| BrPC | ENCODE Super–Track | PCs of Broad promoter/enhancer ChIP–seq data. |
| CalPC | ENCODE Super–Track | PCs of CalTech transcription level RNA–seq data. |
| logDNase | ENCODE Regulatory Super–Track | DNaseI hypersensitivity cluster log(score). |
| TFBSfreq | ENCODE Regulatory Super–Track | SNP in ChIP–seq TFBS region(s) – count. |
| logTFBS | ENCODE Regulatory Super–Track | SNP in ChIP–seq TFBS region(s) – log(TFBS score). |
| ORegInd | Open REGulatory ANNOtation DB | Indicator that SNP is in ORegAnno DB. |
| PPh2Prob | PolyPhen–2 | Probability that SNP is damaging. |
| RegDBcat | RegulomeDB | RegulomeDB category. |
Definitions of the 54 variables appearing in the prior model for association status arranged by type/class of annotation.
Summary of estimates for the model for association status given the functional annotation data
| Coefficient | Mean | SD | Mean/SD | Coefficient | Mean | SD | Mean/SD |
|---|---|---|---|---|---|---|---|
| MAF1 | 0.029 | 0.0272 | 1.051 | CalPC8 | 0.096 | 0.0608 | 1.584 |
| MAF2 | 0.003 | 0.0185 | 0.151 | CalPC9 | 0.009 | 0.0218 | 0.406 |
| MAF3 | 0.018 | 0.0236 | 0.759 | CalPC10 | -0.019 | 0.0234 | -0.809 |
| MAF4 | -0.008 | 0.0199 | -0.425 | CalPC11 | -0.044 | 0.0468 | -0.943 |
| BrPC1 | -0.348 | 0.0388 | -8.983 | PhyPC1 | 0.225 | 0.0452 | 4.982 |
| BrPC2 | 0.174 | 0.0360 | 4.845 | PhyPC2 | -0.023 | 0.0308 | -0.742 |
| BrPC3 | 0.002 | 0.0172 | 0.099 | PhyPC3 | 0.053 | 0.0395 | 1.336 |
| BrPC4 | -0.077 | 0.0301 | -2.561 | PhyPC4 | 0.024 | 0.0289 | 0.839 |
| BrPC5 | 0.149 | 0.0301 | 4.932 | funcIntron | -0.003 | 0.0199 | -0.160 |
| BrPC6 | 0.097 | 0.0329 | 2.961 | funcNg3 | 0.002 | 0.0238 | 0.098 |
| BrPC7 | -0.019 | 0.0229 | -0.825 | funcNg5 | 0.003 | 0.0200 | 0.158 |
| BrPC8 | -0.078 | 0.0312 | -2.498 | funcNonsynon | 0.089 | 0.0387 | 2.308 |
| BrPC9 | -0.012 | 0.0202 | -0.573 | funcSynon | -0.007 | 0.0236 | -0.283 |
| BrPC10 | 0.007 | 0.0182 | 0.407 | funcUTR | 0.002 | 0.0210 | 0.078 |
| BrPC11 | 0.021 | 0.0239 | 0.887 | logDNase | 0.011 | 0.0253 | 0.419 |
| BrPC12 | -0.039 | 0.0290 | -1.343 | TFBSfreq | 0.024 | 0.0267 | 0.885 |
| BrPC13 | -0.094 | 0.0318 | -2.972 | logTFBS | 0.019 | 0.0299 | 0.641 |
| BrPC14 | -0.000 | 0.0175 | -0.015 | ORegInd | 0.027 | 0.0236 | 1.163 |
| BrPC15 | -0.039 | 0.0286 | -1.354 | IndelInd | -0.023 | 0.0337 | -0.693 |
| BrPC16 | 0.009 | 0.0185 | 0.467 | CNVInd | 0.059 | 0.0321 | 1.842 |
| BrPC17 | -0.015 | 0.0213 | -0.696 | InvInd | 0.090 | 0.0305 | 2.938 |
| BrPC18 | -0.014 | 0.0210 | -0.688 | rDBcat1 | 0.014 | 0.0257 | 0.550 |
| CalPC1 | -0.103 | 0.0492 | -2.084 | rDBcat2 | 0.066 | 0.0378 | 1.741 |
| CalPC2 | 0.090 | 0.0441 | 2.030 | rDBcat3 | 0.012 | 0.0264 | 0.467 |
| CalPC3 | -0.019 | 0.0270 | -0.709 | rDBcat4 | 0.116 | 0.0461 | 2.508 |
| CalPC4 | 0.086 | 0.0457 | 1.889 | rDBcat5 | 0.106 | 0.0527 | 2.003 |
| CalPC5 | 0.053 | 0.0395 | 1.350 | rDBcat6 | -0.056 | 0.0552 | -1.018 |
| CalPC6 | -0.012 | 0.0233 | -0.496 | pph2prob | 0.078 | 0.0269 | 2.905 |
| CalPC7 | 0.002 | 0.0207 | 0.078 |
Estimates of the posterior mean and standard deviation are provided for each coefficient in the model along with the ratio of these quantities, a ‘signal–to–noise’ measure analogous to the Z statistic.
Means and 95% interval estimates of the concordance indices for each of the four models
| Concordance index | |||
|---|---|---|---|
| Label | Prior | Mean | 95% CI |
| Normal | N(0, 1) | 0.6348 | (0.6112, 0.6555) |
| NEG1 | NEG(0.834, 0.1610) | 0.6397 | (0.6148, 0.6615) |
| NEG2 | NEG(0.950, 0.0588) | 0.6433 | (0.6208, 0.6675) |
| NEG3 | NEG(0.978, 0.0245) | 0.6487 | (0.6244, 0.6675) |
Functional signatures improve inference for association status in a GWAS of ovarian cancer
| Variant | Locus | MAF | LOF | log(BFA) | RankA | RankA+F |
|---|---|---|---|---|---|---|
| rs2072590 | 2q31 | 0.34 | 1.46 | 8.63 | 65 | 59 |
| rs2665390 | 3q25 | 0.09 | 0.77 | 8.08 | 77 | 73 |
| rs10069690 | 5p15 | 0.23 | 0.91 | -1.38 | 1,549,122 | 651,710 |
| rs11782652 | 8q21 | 0.08 | 0.22 | 2.98 | 5,272 | 6,843 |
| rs7814937 | 8q24 | 0.12 | 1.54 | 14.61 | 21 | 16 |
| rs3814113 | 9p22 | 0.30 | -0.09 | 14.01 | 38 | 38 |
| rs7084454 | 10p12 | 0.31 | 1.44 | 1.19 | 45,616 | 12,221 |
| rs757210 | 17q12 | 0.37 | 1.74 | 2.31 | 11,630 | 2,411 |
| rs2077606 | 17q21 | 0.18 | 0.70 | -0.25 | 339,456 | 200,494 |
| rs9303542 | 17q21 | 0.27 | 0.05 | 3.70 | 2,276 | 3,532 |
| rs8170 | 19p13 | 0.19 | 0.82 | 2.72 | 7,133 | 4,179 |
| Mean | True + | 0.23 | 0.87 | 5.15 | 178,246 | 80,143 |
| Median | True + | 0.23 | 0.82 | 2.98 | 5,272 | 3,532 |
| Mean | True − | 0.35 | 0.11 | 0.37 | 438,664 | 517,810 |
| Median | True − | 0.36 | 0.06 | 0.14 | 181,116 | 244,393 |
Ranks of known associated variants (labeled ‘true +’) tend to improve (i.e. are closer to one) when association and functional data are incorporated in the analysis (RankA+F) relative to when only the association data are used (RankA) and, hence, are more likely to be studied further. Conversely, ranks of (very likely) unassociated variants (labeled ‘true −’) tend to fall with inclusion of the functional data. The functional data for a given variant is summarized by its ‘functional signature’, defined as the prior log–odds of its association given the functional data (LOF). Aggregate (mean and median) values are provided for the true + set and the true − set. Ranks are out of approximately 2.5M variants.
Figure 3Correlation of functional signatures between adjacent HapMap II/III SNPs as a function of base pair distance (black line). Cumulative distribution function(CDF) of base pair distances across the genome (red line).