| Literature DB >> 28490319 |
Ping Zeng1,2, Xiang Zhou3, Shuiping Huang4.
Abstract
BACKGROUND: It has been shown that gene expression in human tissues is heritable, thus predicting gene expression using only SNPs becomes possible. The prediction of gene expression can offer important implications on the genetic architecture of individual functional associated SNPs and further interpretations of the molecular basis underlying human diseases.Entities:
Keywords: Bayesian sparse linear mixed model; Cis-SNPs; Elastic net; Gene expression; Lasso; Linear mixed model; Prediction model
Mesh:
Year: 2017 PMID: 28490319 PMCID: PMC5425981 DOI: 10.1186/s12864-017-3759-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Comparison of the four methods (i.e. Lasso, ENET, LMM and BSLMM) for predicting gene expression in scenarios I-III. a The results of scenario I where the BSLMM modeling assumption is satisfied and 15 causal SNPs are included in the sparse part. b The results of scenario II where the LMM modeling assumption is satisfied. c The results of scenario III where the sparse modeling assumption is satisfied and there are only 15 causal SNPs and the rest are all neutral. The performance is measured by R 2. In each panel from left to right it corresponds to PVE = 0.1, 0.3 or 0.5 respectively
Computational time (in second) for the four models for predicting gene expression measurements
| #SNP | PVE | Lasso | ENET | LMM | BSLMM |
|---|---|---|---|---|---|
| 510 | 0.118 | 4.117 (0.203) | 2.937 (0.073) | 0.159 (0.148) | 1.780 (1.789) |
| 1375 | 0.002 | 6.594 (0.273) | 5.345 (0.110) | 0.560 (0.021) | 3.895 (0.917) |
| 2011 | 0.000 | 5.805 (0.172) | 5.134 (0.100) | 0.727 (0.076) | 1.502 (0.841) |
| 3045 | 0.357 | 8.623 (0.177) | 7.992 (0.234) | 1.097 (0.011) | 8.286 (8.159) |
| 4120 | 0.046 | 8.649 (0.282) | 8.385 (0.227) | 1.412 (0.073) | 16.129 (8.792) |
| 4953 | 0.523 | 10.019 (0.248) | 9.772 (0.285) | 1.621 (0.182) | 7.626 (3.406) |
| 5818 | 0.124 | 13.492 (0.199) | 13.077 (0.237) | 1.957 (0.057) | 2.269 (0.854) |
#SNP denotes the number of cis-SNPs included in this gene; PVE is the proportion of variance of gene expression explained by cis-SNPs; the tuning parameters of LASSO ENET are selected using 100-fold cross validation; BSLMM uses 10,000 Monte Carlo samplings after 2,000 burn-in samplings. The times are averaged across 20 replicates, and values in parentheses are the standard deviations
Fig. 2Comparison of the prediction performance of the four methods (i.e. Lasso, ENET, LMM and BSLMM) for the Geuvadis data. In each panel it lists the number of genes where BSLMM performs better and the number of genes where BSLMM performs worse; in the top (a)-(c), these numbers are computed across all the genes, and in the bottom (d)-(f) these numbers are computed across only the genes with R 2 ≥ 0.05
Number of predictive genes passing the given R 2 threshold in the Geuvadis data and GenoExp data
| threshold | Geuvadis data | GenoExp data | ||||||
|---|---|---|---|---|---|---|---|---|
| Lasso | ENET | LMM | BSLMM | Lasso | ENET | LMM | BSLMM | |
| 0.05 | 2252 | 2262 | 2447 | 2567 | 1785 | 1414 | 1560 | 1758 |
| 0.10 | 1144 | 1145 | 1145 | 1266 | 831 | 788 | 734 | 826 |
| 0.20 | 420 | 422 | 383 | 466 | 315 | 309 | 276 | 323 |
| 0.30 | 161 | 162 | 152 | 178 | 156 | 148 | 124 | 160 |
| 0.40 | 75 | 75 | 65 | 76 | 70 | 70 | 56 | 70 |
| 0.50 | 33 | 33 | 25 | 32 | 36 | 32 | 27 | 37 |
| 0.60 | 14 | 14 | 12 | 14 | 25 | 21 | 20 | 24 |
There are 15,810 and 15,427 genes in the Geuvadis data and GenoExp data, respectively. It can be seen that in both data sets when the given R 2 threshold is large (e.g. ≥0.30) the number of predictive genes passing that value in LMM is less than that of LASSO, ENET or BSLMM, implying that these highly predictive genes may have a sparse genetic architecture
Fig. 3Distribution of R 2 of BSLMM for the Geuvadis data. a A Manhattan-type plot shows R 2 and gene positions across chromosomes, in which the y-axis is R 2 for each gene, the x-axis is the gene position and the various colors represent different chromosomes. b The barplot shows the proportion of predictive genes (R 2 ≥ 0.05) for each chromosome. c The scatter of the proportion of the predictive genes against the proportion of gene in each chromosome. d The R 2 pattern for the MHC region (chr6: 26-34 Mb); there are a total of 179 genes with R 2 ≥ 0.05 in chromosome 6, among which 45 are located on the MHC region (in red). The total length of chromosome 6 is about 171 Mb, and the length of the MHC region is 8 Mb. Then the enrichment-fold is 5.37, which is computed as the ratio of the proportion of predictive genes (i.e. 0.25 = 45/179) and the proportion of the length of MHC (i.e. 0.05 = 8/171), and is significantly higher (P = 1.79 × 10−3) than the average enrichment-fold (the median is 1.70) of other regions in chromosome 6
Fig. 4Enrichment-fold in 1,324 approximately independent LD blocks. a The enrichment-fold distributed across the chromosomes; the reference lines are 4, 10 and 20, respectively; (b) The histogram of enrichment-fold in 1,324 independent LD blocks; the median is 1.49 (indicating with red reference line) and the maximum is 299.82. The enrichment-fold is computed as the ratio of the proportion of predictive genes (i.e. R 2 ≥ 0.05) and the proportion of the length of that LD block
Enrichment-fold (≥20) of independent LD blocks in the Geuvadis data
| Enrichment fold | #Identified SNPs | Chromosome | LD block | |
|---|---|---|---|---|
| lower | upper | |||
| 26.58 | 16 | 2 | 84,687,169 | 84,743,579 |
| 53.20 | 11 | 2 | 152,118,393 | 152,146,571 |
| 22.76 | 17 | 3 | 19,988,517 | 20,053,822 |
| 184.35 | 8 | 3 | 75,713,481 | 75,721,542 |
| 27.75 | 17 | 3 | 161,090,668 | 161,144,215 |
| 36.70 | 15 | 4 | 44,680,444 | 44,728,612 |
| 81.95 | 8 | 4 | 47,465,736 | 47,487,305 |
| 34.18 | 13 | 5 | 107,006,596 | 107,052,542 |
| 32.41 | 25 | 7 | 120,965,421 | 121,036,418 |
| 299.82 | 29 | 10 | 18,940,551 | 18,948,334 |
| 45.61 | 12 | 10 | 131,909,081 | 131,934,663 |
| 28.64 | 8 | 12 | 127,210,816 | 127,256,957 |
| 44.88 | 23 | 12 | 129,308,528 | 129,337,972 |
| 22.28 | 17 | 13 | 101,241,782 | 101,327,347 |
| 29.80 | 13 | 16 | 5,084,142 | 5,147,789 |
| 23.06 | 22 | 18 | 23,671,164 | 23,806,409 |
| 151.19 | 16 | 18 | 61,616,535 | 61,637,159 |
We obtained a total of 18,896 complete records (mainly including the information of disease/trait, chromosome id and position) of identified SNPs by GWASs from https://www.genome.gov/gwastudies/. We counted the number (given in the second column) of related SNPs within 1 Mb upstream and downstream regions near each LD block. These identified SNPs are extensively related to about 130 different types of complex diseases and traits. For example, in the first LD block (Chr2: 84,687,169-84,743,579), previous GWASs have discovered 16 associated SNPs, which, in terms of the catalog of published GWASs, are related to aging traits, protein quantitative trait loci, pulmonary function decline, IgG glycosylation, RR interval heart rate, the response to antipsychotic therapy, coronary artery calcification, prostate cancer, response to cytadine analogues cytosine arabinoside, bilirubin levels, orthostatic hypotension, breast cancer and conduct disorder