| Literature DB >> 26111110 |
Samuel K Handelman, Michal Seweryn, Ryan M Smith, Katherine Hartmann, Danxin Wang, Maciej Pietrzak, Andrew D Johnson, Andrzej Kloczkowski, Wolfgang Sadee.
Abstract
BACKGROUND: Over the past 50,000 years, shifts in human-environmental or human-human interactions shaped genetic differences within and among human populations, including variants under positive selection. Shaped by environmental factors, such variants influence the genetics of modern health, disease, and treatment outcome. Because evolutionary processes tend to act on gene regulation, we test whether regulatory variants are under positive selection. We introduce a new approach to enhance detection of genetic markers undergoing positive selection, using conditional entropy to capture recent local selection signals.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26111110 PMCID: PMC4480832 DOI: 10.1186/1471-2164-16-S8-S8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1. This figure shows the input to the conditional logistic regression associated with a single gene (Cholesterol Ester Transfer Protein, or CETP) from the Zeller 2010 data set. The purposes of this figure are to illustrate the prediction problem in a single gene/strata, and to showcase the relative degree of serial auto-correlation (smoothness) associated with the different predictors. At the top of the figure, each SNP is indicated with a symbol reflecting the location within the gene, only one, rs1532625, is an eQTL in the Zeller data set (indicated with a large orange hourglass symbol), and it happens to be in an Intron; the two major splice isoforms of CETP are illustrated at the bottom of the figure for reference. rs1532625 does NOT show any particular sign of being under positive selection. The conditional logistic regression used in this manuscript is fit to eQTLs such as rs1532625, with genes such as CETP treated as individual strata (equivalent to a 1-to-many case-control matching in a clinical trial.) In this data set CETP contains only a single eQTL but this is not always the case. Four predictors used in this manuscript are scaled to empirical Z-scores in order to fit on the same chart; a fifth potential predictor (composite of multiple signals, cms), very powerful in other contexts, is also shown to illustrate the issue with auto-correlation. Conditional logistic models depend on a degree of independence among the predictors - because the cms score (blue line) has such a strong serial auto-correlation (as would any positive selection measure that is smoothed in a window of any size), it is not independent of the within-gene location (symbols at the top) which are used as an independent predictor. Even Fst (the green line) shows too much serial auto-correlation to converge in the Mangravite data set, which was part of the motivation in developing H|H. The other three positive selection measures, including H|H, are highly non-smooth, so they can be fit to logistic models where individual strata contain short regions of DNA.
Chicago QTL browser data sets.
| Dataset | Notes | Number of eQTLs/total | Num. Genes (Regions) | Total GTEx Markers |
|---|---|---|---|---|
| Mangravite 2012 [ | LCLs; candidate genes | 42,550/49,937 | 2,064 (2,030) | 385,836 |
| Montgomery 2010A [ | LCLs; RNA-seq; exons | 12,691/13,965 | 2,755 (2,155) | 1,389,595 |
| Montgomery 2010B [ | LCLs; RNA-seq; transcripts | 3,881/4,748 | 934 (706) | 493,074 |
| Schadt 2007 [ | Liver | 2,558/2,694 | 1,523 (857) | 862,416 |
| Stranger 2007 [ | LCLs | 12,751/15,067 | 1,309 (961) | 289,863 |
| Veyrieras 2008 [ | LCLs | 13,730/15,939 | 1,437 (1016) | 296,696 |
| Zeller 2010 [ | Monocytes | 36,371/39,923 | 5,532 (3,931) | 1,398,426 |
For each Dataset with primary reference: Notes on the sample in which eQTLs were measured; the Number of eQTLs below their significant threshold and which are also present in the GTEx candidate panel (out of the total number of QTLs passing the threshold in the source data set); the Number of Genes (and intergenic Regions, each of which was analysed as if a gene) which contained at least one eQTL present in both data sets; and, the Total GTEx Markers present in those genes, whether they are eQTLs in the corresponding data set or not.
Conditional logistic models.
| Model | Formula |
|---|---|
| 1 | eQTL ~ H|H + |ΔDAF| + ΔDAF + ΔiHH + Fst + iHS + MAF + location |
| 2 | eQTL ~ H|H + |ΔDAF| + ΔDAF + ΔiHH + iHS + MAF + location |
| 3 | eQTL ~ H|H + |ΔDAF| + ΔDAF + ΔiHH + Fst + iHS + MAF × GO + location |
| 4 | eQTL ~ H|H + |ΔDAF| + ΔDAF + Fst + MAF × GO + location |
| 5 | eQTL ~ H|H + |ΔDAF| + ΔDAF + MAF × GO + location |
| 6 | |
Each of the formulas above is fit to an intercept (α) and to one or more effect sizes (β, one associated with each term).
Summary of t-statistic Z-scores from conditional logistic model fits
| Dataset | Z-score H|H | Z-s. |ΔDAF| | Z-s. Fst | Z-s. ΔiHH | Z-s. iHS |
|---|---|---|---|---|---|
| Mangravite 2012 | 22 | 6/NA | NA | 0 | 8 |
| Montgomery 2010A | 13 | 3/5 | -4 | 1 | -2 |
| Montgomery 2010B | 8 | 2/4 | -3 | 3 | -2 |
| Schadt 2007 | 2 | 1/1 | -1 | 1 | 0 |
| Stranger 2007 | 8 | -2/4 | -8 | -3 | -2 |
| Veyrieras 2008 | 8 | 0/0 | -13 | -3 | -1 |
| Zeller 2010 | 19 | 2/7 | -7 | 2 | 0 |
For each Dataset, the average value of the Z-score across all Models (1-5 from Table 2, rounded to the nearest whole number.) Two averages are given |ΔDAF| - Models 2 and 5 before the slash; Models 1, 3 and 4 after the slash. For H|H, a positive number indicates a positive value is a strong predictor for eQTLs, for the other measures, an extreme negative value indicates that a low log p-value is a strong predictor for eQTLs. ΔiHH and iHS are intended to be measures of positive selection. With the exception of |ΔDAF| (which changes substantially depending on whether Fst is included in the model), these Z-scores do not change greatly among Models 1-6 in Table 2, indicating that these predictors are largely independent.
GO IDs where eQTL effect size for Adjusted Haplotype Conditional Entropy outside of the 90% confidence interval from permutation.
| Gene Ontology | Mangravite 2012: βH|H ± σH|H | Montgomer y'10 Exon: βH|H ± σH|H | Montgomer y '10 Trspt: βH|H ± σH|H | Schadt 2007: βH|H ± σH|H | Stranger 2007: βH|H ± σH|H | Veyrieras 2008: βH|H ± σH|H | Zeller 2010: βH|H ± σH|H |
|---|---|---|---|---|---|---|---|
| Negative Effect Sizes | |||||||
| GO: | -0.402 ± 0.592 | 0.137 ± 0.255 | -0.001 ± 0.172 | -0.048 ± 0.060 | |||
| GO: | -0.0001 ± 0.134 | -0.049 ± 0.074 | |||||
| GO: | -0.013 ± 0.125 | -0.043 ± 0.126 | 0.032 ± 0.043 | ||||
| GO: | -0.006 ± 0.249 | -0.055 ± 0.180 | -0.107 ± 0.123 | -0.068 ± 0.132 | 0.019 ± 0.056 | ||
| Large Positive Effect Sizes | |||||||
| GO: | 0.006 ± 0.127 | 0.196 ± 0.330 | |||||
| GO: | -0.074 ± 0.124 | 0.455 ± 0.676 | 0.066 ± 0.100 | 0.018 ± 0.037 | |||
| GO: | -0.052 ± 0.116 | 0.084 ± 0.109 | 0.010 ± 0.030 | ||||
| GO: | 0.090 ± 0.259 | -0.143 ± 0.231 | -0.067 ± *0.108 | ||||
| GO: | 0.010 ± 0.033 | 0.040 ± 0.061 | 0.084 ± 0.100 | ||||
| GO: | 0.040 ± 0.071 | -0.031 ± 0.0112 | 0.038 ± 0.040 | ||||
| GO: | 0.027 ± 0.086 | 0.048 ± 0.094 | 0.031 ± 0.128 | 0.026 ± 0.026 | |||
| GO: | 0.030 ± 0.055 | -0.006 ± q0.173 | -0.136 ± 0.169 | -0.033 ± 0.039 | |||
Each row corresponds to GO ID, and each cell gives the logistic-scale effect size (β) and standard error (σ), for the Adjusted Haplotype Conditional Entropy (H|H) for eQTLs in genes within that GO category from a conditional logistic regression under Model 6. A bold cell indicates a term that is contributing to membership in the corresponding group; a red italicized cell indicates a term that is pointing in the opposite direction.
Whole-blood cis-eQTL conditional logistic regression results
| Dataset | Notes | Number of | Num. Genes (Regions) | Total GTEx Markers | βH|H ± σH|H |
|---|---|---|---|---|---|
| Westra 2012 - Group 1 | GO:0006367, GO:0006396, GO:0008544, GO:0042742 | 11,527 | 324 (441) | 108,372 | -0.003 ± 0.003 |
| Westra 2012 - Group 2 | GO:0001501, GO:0006869, GO:0006936, GO:0007186, GO:0009653, GO:0016567, GO:0018108, GO:0051056, | 63,269 | 1,265 (1,810) | 598,118 | 0.005 ± 0.001 |
| Westra 2012 - Both | In both Group 1 and in Group 2 | 1,673 | 17 (57) | 18,447 | 0.006 ± 0.007 |
| Westra 2012 - TOTAL | RNA extracted from whole blood. | 495,268 /923,021 | 15,742 (14,417) | 4,606,410 | |
As Table 1 with two additional columns reporting the effect size (β) and standard error (σ) of Adjusted Conditional Haplotype Entropy (H|H) in the hypothesis-testing phase fit to the statistical model in Eq. 1 using the Westra[11] data set.
Empirical adjustments in conditional entropy.
| ASW | 4.9 | -0.74 | -0.22 |
| CEU | 5.8 | -0.54 | -0.52 |
| CHB | 6.8 | -0.50 | -0.75 |
| CHS | 7.0 | -0.51 | -0.76 |
| CLM | 5.2 | -0.54 | -0.48 |
| FIN | 5.6 | -0.53 | -0.47 |
| GBR | 6.0 | -0.56 | -0.55 |
| IBS | 5.2 | -0.81 | -0.79 |
| JPT | 7.2 | -0.56 | -0.79 |
| LWK | 4.0 | -0.64 | -0.03 |
| MXL | 5.7 | -0.49 | -0.60 |
| PUR | 5.5 | -0.58 | -0.52 |
| TSI | 6.0 | -0.53 | -0.55 |
| YRI | 4.3 | -0.65 | -0.11 |
Each cell gives the paramaters for a regression fit across all GTEx candidate eQTLs for the Adjusted Haplotype Conditional Entropy (H|H) in the corresponding population, including the intercept (α), the contribution from the number of chromosomes carrying the minor variant (βm), and the contribution from the number of chromosomes carrying the major variant (βM).