| Literature DB >> 34711957 |
Edward Mountjoy1,2, Ellen M Schmidt1,2, Miguel Carmona2,3, Jeremy Schwartzentruber1,2,3, Gareth Peat2,3, Alfredo Miranda2,3, Luca Fumis2,3, James Hayhurst2,3, Annalisa Buniello2,3, Mohd Anisul Karim1,2, Daniel Wright1,2, Andrew Hercules2,3, Eliseo Papa4, Eric B Fauman5, Jeffrey C Barrett1,2, John A Todd6, David Ochoa2,3, Ian Dunham1,2,3, Maya Ghoussaini7,8.
Abstract
Genome-wide association studies (GWASs) have identified many variants associated with complex traits, but identifying the causal gene(s) is a major challenge. In the present study, we present an open resource that provides systematic fine mapping and gene prioritization across 133,441 published human GWAS loci. We integrate genetics (GWAS Catalog and UK Biobank) with transcriptomic, proteomic and epigenomic data, including systematic disease-disease and disease-molecular trait colocalization results across 92 cell types and tissues. We identify 729 loci fine mapped to a single-coding causal variant and colocalized with a single gene. We trained a machine-learning model using the fine-mapped genetics and functional genomics data and 445 gold-standard curated GWAS loci to distinguish causal genes from neighboring genes, outperforming a naive distance-based model. Our prioritized genes were enriched for known approved drug targets (odds ratio = 8.1, 95% confidence interval = 5.7, 11.5). These results are publicly available through a web portal ( http://genetics.opentargets.org ), enabling users to easily prioritize genes at disease-associated loci and assess their potential as drug targets.Entities:
Mesh:
Year: 2021 PMID: 34711957 PMCID: PMC7611956 DOI: 10.1038/s41588-021-00945-5
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Figure 1Open Targets Genetics pipeline schematic.
a, Data sources include all available GWAS, as well as variant effect predictions and functional genomic data. b, A number of pipelines are run to perform statistical fine-mapping of GWAS, colocalization with gene expression quantitative trait studies (QTLs) and also between distinct GWAS traits, and integrative “locus-to-gene” prioritization from both genetic and functional genomic input features. c, Outputs of the pipelines are available in a web portal, via programmatic API, and as bulk downloads.
Extended Data Figure 1
Extended Data Figure 2
Extended Data Figure 3
Figure 2Performance of the locus-to-gene (L2G) model.
Colors show metrics calculated on each individual fold of the 5-fold cross-validation. The overall metric, combining all folds, is shown in dark blue. a, Calibration curve showing (top) the fraction of all GSP genes found as positives at different L2G score thresholds (mean predicted value) and (bottom) the count of genes in each L2G score bin. b,c, The precision-recall curve (b) and the receiver-operator characteristic curve (c) for identifying GSP genes from among those within 500 kb at each locus. d, The Relative Importance of each predictor in the L2G model. Blue vertical bars show the mean importance for each feature in cross-validation, while paler bars show the importance obtained in each fold. The vertical dashed lines show the minimum and maximum mean feature importances. max denotes that the maximum score for any variant in the 95% credible set was used for each gene; average denotes that a score averaged over the 95% credible set, weighted by posterior probability, was used for each gene; nbh (neighbourhood) denotes that scores were calculated for each gene relative to the best scoring gene at the locus. Insets in a-c indicate the chromosomes for which each fold of the data was evaluated in cross-validation, and the average precision (AP) (b) or AUC (c) for that fold.
Classification performance for feature groups.
Performance characteristics of the full model are shown at the top, and analyses for individual groups of features are shown in sections below. Counts are shown for true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). *Mean distance aggregates across all the variants in the credible set and weighs by their posterior probability.
| Features | Average precision | AUC | Precision | Recall | TP | FP | TN | FN | Sensitivity | Specificity | FDR | GSP count | GSN count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Full model | 0.65 | 0.93 | 0.73 | 0.53 | 236 | 86 | 6,429 | 209 | 0.53 | 0.99 | 0.27 | 445 | 6,515 |
|
| |||||||||||||
| Closest footprint | 0.37 | 0.79 | 0.56 | 0.60 | 268 | 207 | 6,308 | 177 | 0.60 | 0.97 | 0.44 | 445 | 6,515 |
| Closest TSS | 0.34 | 0.76 | 0.56 | 0.55 | 246 | 195 | 6,320 | 199 | 0.55 | 0.97 | 0.44 | 445 | 6,515 |
|
| |||||||||||||
| Mean distance* | 0.62 | 0.91 | 0.69 | 0.49 | 219 | 98 | 6,417 | 226 | 0.49 | 0.98 | 0.31 | 445 | 6,515 |
| Interaction | 0.26 | 0.79 | 0.55 | 0.05 | 23 | 19 | 6,496 | 422 | 0.05 | 1.00 | 0.45 | 445 | 6,515 |
| Molecular QTL | 0.36 | 0.85 | 0.62 | 0.18 | 79 | 49 | 6,466 | 366 | 0.18 | 0.99 | 0.38 | 445 | 6,515 |
| Pathogenicity prediction | 0.48 | 0.76 | 0.70 | 0.43 | 191 | 80 | 6,435 | 254 | 0.43 | 0.99 | 0.30 | 445 | 6,515 |
|
| |||||||||||||
| Mean distance* | 0.47 | 0.77 | 0.69 | 0.43 | 191 | 84 | 6,431 | 254 | 0.43 | 0.99 | 0.31 | 445 | 6,515 |
| Interaction | 0.65 | 0.93 | 0.73 | 0.53 | 234 | 85 | 6,430 | 211 | 0.53 | 0.99 | 0.27 | 445 | 6,515 |
| Molecular QTL | 0.65 | 0.93 | 0.74 | 0.54 | 239 | 86 | 6,429 | 206 | 0.54 | 0.99 | 0.26 | 445 | 6,515 |
| Pathogenicity prediction | 0.63 | 0.92 | 0.71 | 0.50 | 222 | 91 | 6,424 | 223 | 0.50 | 0.99 | 0.29 | 445 | 6,515 |
Extended Data Figure 4
Extended Data Figure 5
Extended Data Figure 6
Extended Data Figure 7
Extended Data Figure 8