| Literature DB >> 31001318 |
Anna V Mikhaylova1, Timothy A Thornton1.
Abstract
Using genetic data to predict gene expression has garnered significant attention in recent years. PrediXcan has become one of the most widely used gene-based methods for testing associations between predicted gene expression values and a phenotype, which has facilitated novel insights into the relationship between complex traits and the component of gene expression that can be attributed to genetic variation. The gene expression prediction models for PrediXcan were developed using supervised machine learning methods and training data from the Depression Genes and Networks (DGN) study and the Genotype-Tissue Expression (GTEx) project, where the majority of subjects are of European descent. Many genetic studies, however, include samples from multi-ethnic populations, and in this paper we evaluate the accuracy of PrediXcan for predicting gene expression in diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in an African population (Yoruban) and four European ancestry populations for thousands of genes. We evaluate a range of models from the PrediXcan weight databases and use Pearson's correlation coefficient to assess gene expression prediction accuracy with PrediXcan. From our evaluation, we find that the predictive performance of PrediXcan varies substantially among populations from different continents (F-test p-value < 2.2 × 10-16), where prediction accuracy is lower in the Yoruban population from West Africa compared to the European-ancestry populations. Moreover, not only do we find differences in predictive performance between populations from different continents, we also find highly significant differences in prediction accuracy among the four European ancestry populations considered (F-test p-value < 2.2 × 10-16). Finally, while there is variability in prediction accuracy across different PrediXcan weight databases, we also find consistency in the qualitative performance of PrediXcan for the five populations considered, with the African ancestry population having the lowest accuracy across databases.Entities:
Keywords: complex traits; expression quantitative trait loci (eQTL); genetic diversity; genetic mapping; transcriptome
Year: 2019 PMID: 31001318 PMCID: PMC6456650 DOI: 10.3389/fgene.2019.00261
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of PrediXcan databases used in analyses.
| DGN whole blood | 922 | 13,171 | 249,696 |
| GTEx v6 1KG whole blood | 338 | 6,759 | 185,786 |
| GTEx v6 1KG LCL | 114 | 3,759 | 125,045 |
| GTEx v6 HapMap whole blood | 338 | 6,588 | 136,941 |
| GTEx v6 HapMap LCL | 114 | 3,441 | 91,237 |
| GTEx v7 HapMap whole blood | 315 | 6,297 | 140,931 |
| GTEx v7 HapMap LCL | 96 | 3,045 | 88,143 |
Number of genes for which Pearson correlation coefficients are available by population and by PrediXcan weight database.
| Genes with observed and | 10,387 | 5,432 | 2,777 |
| predicted expression values | |||
| By population: | |||
| CEU | 10,385 | 5,432 | 2,777 |
| FIN | 10,385 | 5,432 | 2,777 |
| GBR | 10,385 | 5,432 | 2,777 |
| TSI | 10,385 | 5,432 | 2,776 |
| YRI | 10,354 | 5,419 | 2,767 |
| Genes before filtering | 10,354 | 5,419 | 2,767 |
| Genes after filtering | 3,493 | 2,288 | 1,699 |
Figure 1Violin plots of gene expression correlation coefficients by five populations using DGN, GTEx v7 WB, and GTEx v7 LCL weight databases; (A) before and (B) after filtering out poorly predicted genes.
Gene counts per population, per database, per correlation category for the five populations using DGN, GTEx WB, and GTEx LCL weight databases.
| 3,583 | 3,491 | 3,480 | 3,587 | 4,156 | 561 | 547 | 554 | 585 | 911 | |
| 0 < | 5,107 | 4,976 | 4,812 | 4,954 | 5,001 | 1,533 | 1,379 | 1,258 | 1,409 | 1,674 |
| 0.2 < | 1,359 | 1,480 | 1,589 | 1,434 | 1,016 | 1,097 | 1,162 | 1,209 | 1,121 | 728 |
| 0.4 < | 239 | 302 | 354 | 290 | 147 | 236 | 300 | 353 | 289 | 146 |
| 0.6 < | 56 | 93 | 105 | 75 | 31 | 56 | 93 | 105 | 75 | 31 |
| 0.8 < | 10 | 12 | 14 | 14 | 3 | 10 | 12 | 14 | 14 | 3 |
| 1,756 | 1,621 | 1,622 | 1,684 | 2,101 | 336 | 309 | 314 | 335 | 590 | |
| 0 < | 2,471 | 2,450 | 2,366 | 2,456 | 2,491 | 877 | 786 | 732 | 820 | 993 |
| 0.2 < | 902 | 958 | 981 | 901 | 668 | 788 | 804 | 793 | 758 | 546 |
| 0.4 < | 210 | 282 | 329 | 278 | 117 | 207 | 281 | 328 | 275 | 117 |
| 0.6 < | 69 | 93 | 100 | 85 | 38 | 69 | 93 | 100 | 85 | 38 |
| 0.8 < | 11 | 15 | 21 | 15 | 4 | 11 | 15 | 21 | 15 | 4 |
| 546 | 488 | 484 | 509 | 774 | 80 | 69 | 55 | 69 | 274 | |
| 0 < | 1,119 | 1,031 | 996 | 1,050 | 1,296 | 560 | 443 | 426 | 477 | 777 |
| 0.2 < | 718 | 742 | 761 | 736 | 510 | 675 | 681 | 692 | 681 | 461 |
| 0.4 < | 293 | 361 | 369 | 360 | 145 | 293 | 361 | 369 | 360 | 145 |
| 0.6 < | 80 | 126 | 137 | 96 | 38 | 80 | 126 | 137 | 96 | 38 |
| 0.8 < | 11 | 19 | 20 | 16 | 4 | 11 | 19 | 20 | 16 | 4 |
Results from linear mixed models for population category (with CEU as a reference) and change in gene correlation coefficient among filtered genes.
| FIN | 0.019 | (0.014, 0.025) | 1.3 × 10−11 | 0.021 | (0.015, 0.028) | 1.3 × 10−9 | 0.038 | (0.030, 0.046) | < 10−16 |
| GBR | 0.029 | (0.023, 0.034) | < 10−16 | 0.032 | (0.025, 0.039) | < 10−16 | 0.051 | (0.043, 0.059) | < 10−16 |
| TSI | 0.010 | (0.004, 0.016) | 3.9 × 10−4 | 0.013 | (0.007, 0.020) | 4.6 × 10−5 | 0.027 | (0.019, 0.035) | 2.9 × 10−11 |
| YRI | −0.054 | (−0.059, −0.048) | < 10−16 | −0.070 | (−0.077, −0.063) | < 10−16 | −0.097 | (−0.105 −0.089) | < 10−16 |
Results from linear mixed models for population category (excluding CEU, with FIN as a reference) and change in gene correlation coefficient among filtered genes.
| GBR | 0.010 | (0.004, 0.015) | 9.2 × 10−4 | 0.011 | (0.004, 0.018) | 3.1 × 10−3 | 0.013 | (0.005, 0.021) | 2.0 × × 10−3 |
| TSI | −0.009 | (−0.015, −0.003) | 1.8 × 10−3 | −0.008 | (−0.015, −0.001) | 2.8 × 10−2 | −0.011 | (−0.019, −0.003) | 8.9 × 10−3 |
| YRI | −0.073 | (−0.079, −0.067) | < 10−16 | −0.091 | (−0.098, −0.084) | < 10−16 | −0.134 | (−0.143, −0.126) | < 10−16 |
Figure 2Scatter plots comparing gene correlation coefficients by population using GTEx v7 LCL vs. GTEx v7 WB databases.