Literature DB >> 35047852

From GWAS variant to function: A study of ∼148,000 variants for blood cell traits.

Quan Sun¹, Cheynna A Crowley¹, Le Huang², Jia Wen³, Jiawen Chen¹, Erik L Bao^4,5, Paul L Auer⁶, Guillaume Lettre^7,8, Alexander P Reiner^9,10, Vijay G Sankaran^4,5, Laura M Raffield³, Yun Li^1,3,11.

Abstract

Genome-wide association studies (GWASs) have identified hundreds of thousands of genetic variants associated with complex diseases and traits. However, most variants are noncoding and not clearly linked to genes, making it challenging to interpret these GWAS signals. We present a systematic variant-to-function study, prioritizing the most likely functional elements of the genome for experimental follow-up, for >148,000 variants identified for hematological traits. Specifically, we developed VAMPIRE: Variant Annotation Method Pointing to Interesting Regulatory Effects, an interactive web application implemented in R Shiny. This tool efficiently integrates and displays information from multiple complementary sources, including epigenomic signatures from blood-cell-relevant tissues or cells, functional and conservation summary scores, variant impact on protein and gene expression, chromatin conformation information, as well as publicly available GWAS and phenome-wide association study (PheWAS) results. Leveraging data generated from independently performed functional validation experiments, we demonstrate that our prioritized variants, genes, or variant-gene links are significantly more likely to be experimentally validated. This study not only has important implications for systematic and efficient revelation of functional mechanisms underlying GWAS variants for hematological traits but also provides a prototype that can be adapted to many other complex traits, paving the path for efficient variant-to-function (V2F) analyses.

Entities: Chemical

Keywords: blood cell traits; experimental validations; functional annotations; genome-wide association studies; variant to function

Year: 2021 PMID： 35047852 PMCID： PMC8756514 DOI： 10.1016/j.xhgg.2021.100063

Source DB: PubMed Journal: HGG Adv ISSN： 2666-2477

Introduction

Genome-wide association studies (GWASs) have identified thousands of genetic loci and hundreds of thousands of genetic variants associated with various complex human diseases and traits, but the underlying genetic mechanism for the vast majority of these GWAS signals remains elusive. With extensive sequencing and GWAS efforts, there is a pressing need to convert the large and ever-growing number of significant GWAS variant-trait pairs into human-interpretable functional or mechanistic knowledge. Most variants identified through GWASs reside in the noncoding regions (e.g., >95% for blood cell traits), and most signals include multiple highly correlated variants or variants in strong linkage disequilibrium (LD). Pinpointing the most likely causal variants within GWAS signals, and linking these variants to their target genes, is challenging, particularly as the number of GWAS loci and variants increases. For hematological traits, for instance, our recent GWAS meta-analyses, have revealed over 7,000 loci, with >148,000 variants associated with at least one blood cell index at stringent genome-wide significance threshold. Comprehensive and computationally efficient annotation and prioritization of such GWAS findings are of ever-increasing interest. Understanding how genetic variants contribute to a phenotype is often referred to as the variant-to-function (V2F) problem. Responding to this problem requires us to determine causal genetic variants, relative cell types/states, their target genes, and cellular/physiological functions. Functional experiments are needed to fully reveal molecular mechanisms, but we cannot yet afford to perform time-, money-, and labor-consuming experimental validations of thousands of loci involving hundreds of thousands of potentially functional variants or regulatory elements controlling their nearby genes, since each gene is likely regulated by multiple variants, and each variant may regulate multiple genes. Thus, computational methods are needed to screen potential variants and their effector genes for further experiments. In this study, we focus on hematological traits. Hematological phenotypes (red blood cell, white blood cell, and platelet counts and indices) are critical physiological intermediaries in oxygen transport, immunity, infection, thrombosis, and hemostasis and are associated with autoimmune, allergic, infectious, and cardiovascular diseases. Hematological traits are highly heritable, and recent large GWASs for hematological traits (including nearly 750,000 participants) identified thousands of variant-trait associations., In addition, there are multiple large-scale functional experiments already available,, for hematological traits, as well as fairly comprehensive functional annotation resources relevant to blood tissues. This makes hematological traits an ideal model for this type of V2F computational solution. We have developed VAMPIRE: Variant Annotation Method Pointing to Interesting Regulatory Effects, a tool for the user to explore annotations encompassing epigenomic signatures, variant impact on protein and gene expression, chromatin conformation information from Hi-C and similar technologies, as well as publicly available GWAS and phenome-wide association study (PheWAS) results, creating a comprehensive annotation profile for variants from recent trans-ethnic blood cell trait publications, with a flexible interface for adding additional future GWAS results. This interactive web application implemented in R Shiny provides a model display mechanism for annotating GWAS variants from diverse complex traits, allowing selection of most likely causal variants and their effector genes for experimental follow-up. Importantly, we show the value of how variants and genes nominated by VAMPIRE can highlight key regulators of blood cell traits using independent functional assessment, confirming the value of this annotation tool. While blood cell traits are the focus for VAMPIRE, this framework (including our R Shiny application) is adaptable for annotation of other complex trait GWAS results and will facilitate the connection between variant and function.

Material and methods

Variant annotations

The current version of VAMPIRE (Figure 1) includes GWAS results from two studies (as detailed in the supplemental methods), including all variants in 95% credible sets for fine-mapped hematological-trait-associated loci from Chen et al. (N1 = 148,019 variants) and lead variants (N2 = 2) from a TOPMed imputed GWAS meta-analysis in African American and Hispanic/Latino populations. We plan to extend VAMPIRE as new trans-ethnic blood cell trait genetic analyses are released.

Figure 1

Overall framework of this study

VAMPIRE starts with GWAS variants in the 95% credible sets, integrates different annotations, and assigns them into different prioritization categories. We further demonstrated that our top prioritized category is enriched with variants that were experimentally validated. VAMPIRE provides a prototype that can be adapted to many other complex traits, paving the path for efficient variant-to-function (V2F) analyses.

Overall framework of this study VAMPIRE starts with GWAS variants in the 95% credible sets, integrates different annotations, and assigns them into different prioritization categories. We further demonstrated that our top prioritized category is enriched with variants that were experimentally validated. VAMPIRE provides a prototype that can be adapted to many other complex traits, paving the path for efficient variant-to-function (V2F) analyses. The sources of the annotation used are stated clearly in the VAMPIRE online application, with links or references to the original data sources. As a brief summary, the annotation categories are trivially split into six types (“variant level,” “1D,” “2D,” “3D,” “PheWAS,” and “GWAS”). First, variant level contains data on phenotypic association from the original publication or preprint (such as the p value for association with a given hematological trait, effect size, and posterior probability of inclusion for fine-mapping credible sets). Second, 1D refers to epigenomic or sequence constraint features. This displays selected output from WGSA, including functional prediction scores, conservation scores, and epigenetic information gathered from GeneHancer, FANTOM5,, Roadmap, and ENCODE. ATAC-seq peaks from recent studies for blood cell traits, and key histone chromatin immunoprecipitation sequencing (ChIP-seq) peaks such as H3K9me3, H3K36me3, H3K4me1, H3K4me3, and H3K27Ac generated across blood-cell-related tissues from Roadmap Epigenomics are also included., We further include information regarding whether each variant resides in any selective sweep region detected from multiple populations in the 1000 Genomes Project using the S/HIC method., Information is displayed based on the tissue relevance to the blood cell phenotype (see Supplemental methods). All variants have 1D annotation, but for prioritization purposes as described below in the five categories for noncoding variant annotation, we define 1D annotation as FANTOM5_enhancer_robust = Y (yes), or Genehancer_feature = “Promoter” or “Enhancer” or “Promoter/Enhancer,” or coreMarks (for any relevant roadmap epigenomic category) = “Enhancers” or “Active TSS.” Users can then additionally filter by criteria such as functional prediction and conservation scores. For the “2D” annotations, we included impact on gene expression and splicing ratios (expression quantitative trail locus [eQTL] and splicing QTL [sQTL] information) and impact on protein abundance (protein QTL [pQTL] information) from public sources relevant to blood cell traits. This includes both bulk and cell-type-specific sources from the public domain (eQTLGen, CAGE, BIOS for whole blood, and Raj et al. for purified CD4+ T cells and monocytes). Information available in these sources varies, but generally we at a minimum display the effect size estimate, p value, the allele assessed, and the gene or protein involved. Variants were matched across sources based on chromosome, position, and alleles of each variant. Only significant results (based on false discovery rate [FDR] or other publication-specific thresholds) from the respective sources are displayed in VAMPIRE; we do note that formal co-localization analyses would still need to be performed to determine if blood-cell-related and gene/protein expression QTL signals truly coincide. For the 3D annotations, we include information on 3D genome conformation, linking blood-lineage-specific regulatory elements to target genes from various sources. More specifically, using Hi-C data we incorporated statistically significant long-range chromatin interactions (LRCI),, calculated from Fit-Hi-C, loops using the HiCCUPs methodology, and super-FIREs for related tissues. Two Promoter-Capture Hi-C (PCHi-C) data sources, were also incorporated and matched with the 2D results to highlight consistent evidence regarding the affected gene(s) across 2D and 3D annotations. VAMPIRE displays information on the number of loops, LRCI, PCHi-C interactions, FIREs, or super-FIREs, as well as significance measures such as p values, FDR, or CHICAGO scores where applicable. This 3D annotation information can also be visualized via our HUGIn browser. The last two data groups present results from two PheWAS sources, and GWAS results of blood cell traits from GWAS catalog, allowing the user to evaluate if hematological trait-associated variants may also influence other complex traits. To visualize and leverage these multiple annotation categories for further analysis or prioritization of experimental validations, VAMPIRE efficiently displays and integrates relevant variant information, allowing the user to investigate either all the variants annotated or subsets based on annotation category groupings, searching either by variant or by gene name. The comprehensive annotation for the variants is summarized using a five-category grouping created for highlighting the most promising variants, as they have various types of annotation. Specifically, the five categories for noncoding variants are (1) the most restrictive category, containing variants that have 1D, 2D, and 3D annotation and the genes implicated by 2D and 3D evidence are consistent; (2) containing variants with 1D, 2D, and 3D evidence, but the genes implicated from different resources are not consistent; (3) 2D and 3D with consistent gene evidence between the 2D and 3D annotations; (4) variants with 2D and 3D information and no consistent gene implied; and (5) variants with 1D and 3D evidence. We also have a predicted high-impact coding variant category displayed, including high-confidence loss-of-function (LoF) variants and likely influential missense, in-frame insertions and deletions (indels), and synonymous variants. Variants without strongly compelling variant annotation are still displayed but are not listed in these high-priority categories. The user can further subset results by hematological trait, hematological trait category, or (for the Chen et al. paper) the ancestry-specific grouping in which a given credible set was derived (trans-ethnic, European, East Asian, South Asian, Hispanic/Latino, or African ancestry). In addition, the user can restrict the amount of information presented by selecting which tables to be displayed. All tables can be exported in a csv or tab delimited format.

Enrichment analysis

To assess whether the variants prioritized by VAMPIRE are more likely to be functionally impactful, we performed enrichment analysis at three different levels: variant level, gene level, and variant-gene pair level, leveraging data generated from previously published functional experiments.,, For each set of analyses, we conducted Fisher’s exact test and calculated odds ratios (ORs) and one-sided p values. At the variant level, we assessed the enrichment of variants that modify transcription factor (TF) binding motif among our annotation category 1 variants. We compared variants in category 1 with both uncategorized variants and variants in other categories. Recently, Vuckovic et al. characterized variants that affect erythropoiesis or hematopoiesis by modifying related TF motifs, such as for KLF1, KLF6, MAFB, and GATA1. We chose these four erythroid TFs as positive control TFs and two non-erythroid TFs (IRF1 and IRF8) as negative controls. At the gene level, we evaluated the genes interrogated by Nandakumar et al. with a pooled short hairpin RNA (shRNA)-based loss-of-function approach. Specifically, Nandakumar et al. assessed 389 genes in the neighborhood of 75 loci associated with red blood cell traits, to identify potential causal genes underlying these GWAS signals. We assessed the enrichment of genes validated by shRNA experiments among those prioritized in VAMPIRE’s category 1. Note that the categories were previously defined at the variant level. Here we extend the variant category to gene category as the strongest category where a genome-wide significant variant linked to this gene falls in. Due to the limited sample size of uncategorized genes, especially when overlapping with genes in the shRNA paper (leaving us with only two genes), we compared genes in category 1 to genes in all other categories. We also performed enrichment analyses at both variant and gene levels for categories 2–5, comparing one category to the others to see if any specifically exhibit a higher level of enrichment than the others. Specifically, we compared category 2 to categories 3–5; compared category 3 to categories 2, 4, and 5; and compared category 5 to categories 2–4. At the variant-gene pair level, we employed the enhancer-gene connections validated via CRISPRi-FlowFISH experiments by Fulco et al. in their activity-by-contact (ABC) paper. Specifically, Fulco et al. tested pairs of candidate cis regulatory elements (CREs, ∼500 bp regions) and their potential effector genes via CRISPRi perturbations of the CREs, in multiple cell lines including the K562 cells. Fulco et al. tested 4,124 CRE-gene pairs in total, of which 175 were significant from their experiments. We overlapped their tested CREs with variants in our VAMPIRE annotation database. We define a VAMPIRE variant-gene pair confirmed if the variant overlaps an ABC-validated CRE and the linked genes in VAMPIRE (from QTL and chromatin capture conformation evidence) overlap the corresponding effector gene for that CRE via ABC’s CRISPRi-FlowFISH experiment. We focused on ABC experiments performed on the K562 cells (instead of GM12878 cells, where a very small number of CREs were tested), as the number of tested CRE-gene pairs was not too small for robust statistical inference. Matching the K562 cell line, we focused only on variants associated with red blood cell traits. Similar to the above two sets of enrichment analyses, we focused on annotations in VAMPIRE’s prioritization category 1. Specifically, we tested whether variant-gene pairs prioritized in VAMPIRE’s category 1 are enriched within ABC’s validated enhancer-gene connections. Given the CREs tested in the ABC paper are rather short (∼500 bp), we also performed sensitivity analysis by first extending the CRE regions by ±1 kb and ±5 kb and then overlapping variants with these extended CREs, to ensure robust conclusions.

Comparison to FUMA

To further assess the capability of VAMPIRE in terms of gene prioritization, we compared the genes prioritized by VAMPIRE to genes prioritized by FUMA for seven red blood cell traits, including hematocrit (HCT), hemoglobin (HGB), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular volume (MCV), red blood cell (RBC) count, and red blood cell distribution width (RDW). We uploaded the GWAS summary statistics for each trait separately to the FUMA website with all default parameters using FUMA’s SNP2GENE function. We then combined the prioritized genes for all seven red blood cell traits to compare the two methods. Similar to the gene-level enrichment analysis described above, we evaluated the number of shRNA-assessed genes and shRNA-validated genes from the shRNA experiments overlapping with the two methods. Venn diagrams were used for better illustration of the results.

Results

Overview of VAMPIRE annotations

The overall framework of VAMPIRE is illustrated in Figure 1. We started with all variants in 95% credible sets from our recent trans-ethnic study for hematological traits (total 148,019 variants) and lead variants (2 variants) from Kowalski et al. We incorporated six types of annotations (detailed in Material and methods): GWAS summary statistics and posterior probability of inclusion from our previous fine-mapping analyses; epigenomic or sequence constraints features (1D); eQTL, sQTL, and pQTL information (2D); information on 3D genome conformation (3D); results from two PheWAS sources, (PheWAS); and GWAS results from blood cell traits from GWAS catalog (GWAS).

VAMPIRE variant categories

To visualize and prioritize variants along with their corresponding candidate regulatory regions and their potential effector genes, we leverage the aforementioned six types of annotation to group these ∼148,000 blood cell traits’ associated variants into various prioritization categories. Specifically, for non-coding variants, we classified them into five categories (detailed in Material and methods). Among them, category 1 is the most restrictive category, containing variants that meet all the fulling criteria: have 1D, 2D, and 3D annotation and 2D and 3D evidence supports the same effector genes (i.e., gene-consistent). Variants in category 2 also are required to have 1D, 2D, and 3D annotation simultaneously, but the genes implicated by 2D and 3D evidence are inconsistent. For example, if a variant rsXXX is an eQTL of gene A according to 2D annotations, and it also resides in a region that forms a loop with the promoter region of gene A, we say the 2D and 3D evidence is gene-consistent, and rsXXX will be classified in category 1. However, if rsYYY is an eQTL of gene B according to 2D annotations, but there is only information suggesting the rsYYY-residing region forms a loop with the promoter region of gene C, we say it is not gene-consistent and will classify rsYYY in category 2. Of course, it’s possible that a SNP is an eQTL for multiple genes (e.g., gene D and gene E) and its residing region forms loops with promoters of multiple genes (e.g., gene E and gene F). As long as we can find one gene that is shared, we classify the variant as gene-consistent for the shared gene(s). In practice, we are more confident to prioritize functional experiments for rsXXX than for rsYYY, since we have consistent support from three independent sources of information for rsXXX: 1D suggesting it is regulatory, and 2D (i.e., eQTL or pQTL) and 3D (i.e., chromatin conformation) both suggesting it is regulating gene A). Variants in category 3 and 4 have only 2D and 3D annotation. Category 3 includes those with consistent target genes suggested by 2D and 3D annotation, while variants in category 4 have 2D and 3D annotations suggesting different/inconsistent target genes. Category 5 includes those with 1D and 3D annotation but no 2D evidence. Variants not falling into any of the five categories are classified as uncategorized. Note that due to tissue or cell type specificity for some 2D (e.g., eQTL) and 3D (e.g., pcHiC) annotations, such variant-level categorization was separately performed for different traits. For instance, for white blood cell-related indices (e.g., monocyte), we considered 2D annotations from whole blood, peripheral blood mononuclear cells (PBMCs), and monocytes, while for platelet-related traits (e.g., platelet count), we only considered 2D annotations from whole blood and PBMCs. Suppose a variant has 1D regulatory evidence and forms a loop with gene A from 3D annotations. Furthermore, it is an eQTL for gene A based on monocyte, but is an eQTL only for another gene B from whole blood and PBMCs; the variant falls into category 1 for monocyte count, but category 2 for platelet count. In summary, a variant may fall into different categories for different traits. In addition, each gene is categorized according to the prioritization categories according to its linked variant(s). When its linked variants fall in multiple categories, the gene is assigned to the most highly prioritized category. The numbers of variants and genes in each category are shown in Table 1.

Table 1

Numbers of variants and genes in each category

	Explanation	Unique variants (#)	Variant-trait pairs (#)	Genes (#)
Category 1	1D & 2D & 3D & gene-consistent	13,862	19,988	9,857
Category 2	1D & 2D & 3D & not gene-consistent	21,269	30,276	2,735
Category 3	2D & 3D & gene-consistent	14,155	20,192	1,300
Category 4	2D & 3D & not gene-consistent	33,732	48,497	1,621
Category 5	1D & 3D	11,820	14,507	1,578
Uncategorized	others	62,489	78,477	174
Total		148,215	211,937	17,265

Note that the category was defined initially at variant level, separately for each blood cell trait. One variant may fall in category 1 for one trait but in other categories for other traits. In total, we have 148,215 unique variants and 211,937 variant-trait association pairs. For gene-level category, each gene is categorized according to the prioritization categories of its linked variant(s). When its linked variants fall in multiple categories, the gene is assigned to the most highly prioritized category.

Numbers of variants and genes in each category Note that the category was defined initially at variant level, separately for each blood cell trait. One variant may fall in category 1 for one trait but in other categories for other traits. In total, we have 148,215 unique variants and 211,937 variant-trait association pairs. For gene-level category, each gene is categorized according to the prioritization categories of its linked variant(s). When its linked variants fall in multiple categories, the gene is assigned to the most highly prioritized category. Our enrichment analyses employing multiple previously published functional validation experiments encompassing variant-level, gene-level, and variant-gene pair levels all showed promising results. Specifically, at the variant level, compared to uncategorized variants, we found significant enrichment of variants affecting TF binding motifs among variants prioritized in category 1 of VAMPIRE (Figure 2) for all the erythroid TFs (p < 8.1E−4) but GATA1 (p = 0.18) (Table 2), likely due a smaller sample size of variants. In contrast, neither of the two negative control TFs (IRF1 and IRF8) showed any significant enrichment (p = 0.22 and 0.62). A similar pattern holds when comparing category 1 variants to all other categories, but the significance level decreased (Table S1), which suggests that variants in other prioritized categories, although not as enriched at TF binding sites as category 1 variants, tend to exhibit higher levels of enrichment than the uncategorized variants. At the gene level, we focused on two statistics: (1) number of genes selected for shRNA experiments, since genes were more likely to be selected for experiments when they demonstrated some prior evidence of potential causality; and (2) number of genes validated (p < 0.05) by shRNA experiments. We compared the number of genes in our annotation category 1 and all other categories and found that both shRNA-assessed genes (p = 3.5E−13) and validated genes (p = 3.1E−8) show strong enrichment among those in our annotation category 1 (Table 3), and the estimated enrichment score for validated genes (OR = 4.65) is almost double of that for shRNA-assessed genes (OR = 2.37).

Figure 2

Variant-level TF motif enrichment analysis

Each dot represents an enrichment score, with the line depicting 95% confidence interval (CI). All the upper bounds of these CIs are infinity. The p values of the enrichment are reflected by the dot size at the OR point estimate, with a larger dot indicating more significant enrichment.

Table 2

Variant-level transcription factor (TF) motif enrichment analysis

	Category 1	Uncategorized	p value	Odds ratio
All RBCT variants	5,687	21,947
KLF1	34	34	7.10E−08	3.86
KLF6	21	14	4.30E−07	5.79
MAFB	13	13	8.10E−04	3.86
GATA1	8	19	0.18	1.63
IRF1	12	49	0.62	0.95
IRF8	19	58	0.22	1.26

Four erythroid TFs and two non-erythroid TFs were selected. Fisher’s exact test was applied to test for enrichment. Three erythroid TFs show enrichment for our VAMPIRE annotation category 1 (MAFB, KLF6, KLF1, p < 0.05). GATA1 motif variants also have some evidence of enrichment (odds ratio = 1.625), but this enrichment is not significant (p = 0.18), likely due to smaller sample size of variants. Two non-hematopoiesis transcription factors selected as controls do not show significant enrichment with VAMPIRE functional annotation category 1. RBCT, red blood cell trait associated.

Table 3

Gene level enrichment analysis

	Category 1	Other categories	p value	Odds ratio
All category genes	9,857	7,408
shRNA-assessed genes	262	83	3.50E−13	2.37
shRNA-validated genes	68	11	3.10E−08	4.65

Fisher’s exact test was applied to test for enrichment. Both shRNA experiment assessed genes and validated genes show significant enrichment in our most restrictive VAMPIRE annotation category (category 1).

Variant-level TF motif enrichment analysis Each dot represents an enrichment score, with the line depicting 95% confidence interval (CI). All the upper bounds of these CIs are infinity. The p values of the enrichment are reflected by the dot size at the OR point estimate, with a larger dot indicating more significant enrichment. Variant-level transcription factor (TF) motif enrichment analysis Four erythroid TFs and two non-erythroid TFs were selected. Fisher’s exact test was applied to test for enrichment. Three erythroid TFs show enrichment for our VAMPIRE annotation category 1 (MAFB, KLF6, KLF1, p < 0.05). GATA1 motif variants also have some evidence of enrichment (odds ratio = 1.625), but this enrichment is not significant (p = 0.18), likely due to smaller sample size of variants. Two non-hematopoiesis transcription factors selected as controls do not show significant enrichment with VAMPIRE functional annotation category 1. RBCT, red blood cell trait associated. Gene level enrichment analysis Fisher’s exact test was applied to test for enrichment. Both shRNA experiment assessed genes and validated genes show significant enrichment in our most restrictive VAMPIRE annotation category (category 1). We also conducted similar enrichment analyses at variant and gene levels to compare categories 2–5. Neither variant-level (Table S1) nor gene-level (Table S2) enrichment results is significant at all, except for category 3 (i.e., both 2D and 3D evidence exist and suggesting the same gene[s]). Category 3 is significantly (p = 0.037) enriched with KLF1 motif variants (OR = 1.44) and is significantly (p = 0.027) enriched with shRNA-assessed genes (OR = 1.70). These results suggest category 3 may be the next category most worthy of further investigation after category 1, but the evidence is not strong: the significance levels are not very high, other TF motifs are not enriched, and sample size (i.e., number of genes tested) is small. However, category 2 (1D/2D/3D but not gene-consistent) is significantly favorable over category 4 (2D/3D but not gene-consistent) (Table S3), suggesting that the additional 1D information provides more evidence. Finally, at the variant-gene pair level, we also observed enrichment among variants selected into VAMPIRE’s category 1 (Table 4). When restricting only to variants in category 1 and associated with red blood cell traits and without extending the CRE regions, only 7 of VAMPIRE’s variant-gene pairs can be found in ABC’s CRISPRi-FlowFISH experiments, of which 6 are not significant and 1 is significant. While not significant (p = 0.26), the direction of enrichment is nevertheless encouraging (one of seven, or 14.3%, confirmed by CRISPRi-FlowFISH experiments) and 3-fold greater than that among all/background pairs from Fulco et al., where 175 out of 4,124 variant-gene pairs (4.2%) were confirmed. Note that all the confirmed pairs were linked with variants associated with red blood cell traits. Further generalizing to all VAMPIRE annotation categories and to variants associated with any blood cell trait, the enrichment OR increases to 8.30 with p value 9.0E−5, indicating that variant-gene pairs prioritized by VAMPIRE’s five categories have much higher odds of being functional. To further accommodate causal variants tagged by GWAS variants not falling into the short 500 bp CREs, we extended the CREs by ±1 kb or ±5 kb and performed similar enrichment analysis. Our conclusions remained qualitatively similar (Table 4), but the enrichments increased in significance, thanks to larger sample size (in this context, the larger number of variant-gene pairs contributing to the analysis) and suggesting that more liberal windows of cis-regulatory regions can capture a higher rate of functional variant-gene pairs. For example, the enrichment for category 1 variants associated with red blood cell traits reached an OR of 15.77 (p = 3.8E−6) and 16.68 (p = 3.1E−15) for 1 kb and 5 kb extension, respectively. We thus conclude that such enrichment is significant and robust to the extension of CREs.

Table 4

Variant-gene pair level enrichment analysis

	Not significant	Significant	Significant (%)	p value	Odds ratio
All pairs from Fulco et al.⁷	3,949	175	4.24
Confirmed pairs in category 1 for RBC traits	6	1	14.29	0.26	3.76
Confirmed pairs in category 1 for all traits	6	1	14.29	0.26	3.76
Confirmed pairs in all categories for all traits	19	7	26.92	9.00E−05	8.3
Confirmed pairs in category 1 for RBC traits (±1 kb)	10	7	41.18	3.80E−06	15.77
Confirmed pairs in category 1 for all traits (±1 kb)	21	9	30	3.50E−06	9.66
Confirmed pairs in all categories for all traits (±1 kb)	70	21	23.08	4.60E−10	6.76
Confirmed pairs in category 1 for RBC traits (± 5 kb)	27	20	42.55	3.10E−15	16.68
Confirmed pairs in category 1 for all traits (± 5 kb)	64	23	26.44	3.80E−12	8.1
Confirmed pairs in all categories for all traits (± 5 kb)	160	37	18.78	3.10E−13	5.21

We performed analysis for three variant annotation pools (category 1, red blood cell [RBC] trait associated; category 1, any blood cell trait associated; any annotation priority category (1–5), any blood cell trait associated) and three CRE lengths. Fisher’s exact test was applied to test for enrichment. We found enrichment for all three variant annotation pools. These enrichments are also robust to the extension of CREs.

Variant-gene pair level enrichment analysis We performed analysis for three variant annotation pools (category 1, red blood cell [RBC] trait associated; category 1, any blood cell trait associated; any annotation priority category (1–5), any blood cell trait associated) and three CRE lengths. Fisher’s exact test was applied to test for enrichment. We found enrichment for all three variant annotation pools. These enrichments are also robust to the extension of CREs.

Application example

Figure 3 shows one example at the CALR locus associated with red blood cell traits. Fulco et al. confirmed by CRISPRi-FlowFISH experiment that CRE chr19: 12,996,905–12,998,745 (hg19) regulates gene CALR (adjusted p value, 1.9E−7). Annotations compiled by VAMPIRE suggest, consistently, that CALR is linked to rs8110787 (chr19: 12,999,458, hg19) in category 1. rs8110787 is associated with several red blood cell traits, including HCT, MCH, MCV, and red blood cell counts. Based on genomic distance alone, CALR is not the nearest gene to rs8110787, with several other closer genes. However, based on H3K27ac HiChIP data in K562 cells, rs8110787 significantly interacts with CALR promoter region (p < 1E−120), suggesting that CALR is a potential target gene regulated by the CRE around rs8110787. This variant is also an eQTL of CALR from CAGE (p = 9.4E−16) and BIOS (p = 1.0E−25) and is an enhancer in K562 leukemia cells (E123) from Roadmap, adding additional evidence. Our VAMPIRE successfully highlights this rs8110787-CALR pair in its category 1.

Figure 3

Variant-gene pair example (rs8110787-CALR) visualization from HUGIn2

Fulco et al. confirmed via CRISPRi experiments that chr19: 12,996,905–12,998,745 (hg19) regulates gene CALR (adjusted p value, 1.9E−7), which is highly expressed in erythroblasts. Based on annotations in VAMPIRE, CALR is linked to rs8110787 (chr19: 12,999,458, hg19) in prioritization category 1, including higher-than-expected physical interactions with the CALR locus from erythroblast pcHiC data, eQTL of CALR in CAGE and BIOS, erythroid ATAC-seq peak, and H3K27ac peak in K562 leukemia cells. rs8110787 is associated with several red blood cell traits (namely hematocrit [HCT], mean corpuscular hemoglobin [MCH], mean corpuscular volume [MCV], and red blood cell count), as reported in Chen et al.

Variant-gene pair example (rs8110787-CALR) visualization from HUGIn2 Fulco et al. confirmed via CRISPRi experiments that chr19: 12,996,905–12,998,745 (hg19) regulates gene CALR (adjusted p value, 1.9E−7), which is highly expressed in erythroblasts. Based on annotations in VAMPIRE, CALR is linked to rs8110787 (chr19: 12,999,458, hg19) in prioritization category 1, including higher-than-expected physical interactions with the CALR locus from erythroblast pcHiC data, eQTL of CALR in CAGE and BIOS, erythroid ATAC-seq peak, and H3K27ac peak in K562 leukemia cells. rs8110787 is associated with several red blood cell traits (namely hematocrit [HCT], mean corpuscular hemoglobin [MCH], mean corpuscular volume [MCV], and red blood cell count), as reported in Chen et al. As a further example of the utility of the VAMPIRE application, we present the annotation results for one of the lead genome-wide significant variants from recent trans-ethnic GWAS analyses from Chen et al. For our analysis, we were particularly interested in exploring low-frequency variants and those more common in those of non-European ancestry. We were able to quickly rank and prioritize variants for further examination using the annotation categories described above, including the low-frequency variant rs112097551 associated with MCV, MCH, and red blood cell count. This low-frequency intergenic variant rs112097551 (GATA2-RPN1 locus, 0.15% minor allele frequency in Chen et al. trans-ethnic analysis) has no close linkage disequilibrium proxies in African or European populations and thus was not compared to other highly correlated variants. Based on variant frequency, particularly in European ancestry populations, we had no expectation this variant would have eQTL or pQTL evidence (2D annotation), given currently available sample sizes for eQTL and pQTL analysis. For low-frequency variants, 1D and 3D annotation would be the highest annotation category likely for a variant of interest like rs112097551. The variant is ∼5× more common among African versus non-African samples in gnomAD version 2.1.1. It is the only variant in the credible set in fine-mapping analyses from Chen et al. 1D annotation suggests this variant is highly conserved (CADD Phred score of 20.4, meaning the variant is among the top 1% of deleterious variants in the human genome), and it is rated as deleterious by FATHMM-XF (rank score 0.99169, close to the maximum score of 1). It is also in open chromatin in megakaryocyte-erythroid progenitor cells, based on hematopoietic ATAC-seq data. 3D annotation from PCHi-C data in erythroblasts from Javierre et al. links this variant to the gene RUVBL1 ∼500 kb away, as well as noncoding transcripts RNU2-37P and RUVBL1-AS1. Based on these data, which can be quickly displayed using the VAMPIRE application, we most recently validated experimentally this candidate functional enhancer variant via base and nuclease editing. FUMA is an integrative web-based platform using multiple different sources of biological evidence to facilitate functional annotation of GWAS results, gene prioritization, and interactive visualization. We compared our VAMPIRE and FUMA, in terms of the number of genes prioritized, shRNA-assessed genes, and validated genes of Figure 4, for red blood cell traits. FUMA prioritized 4,070 genes (A1 + A2 + A3), where 1,886 genes are also prioritized by VAMPIRE category 1 (A1) with an additional 769 genes in categories 2–5 of VAMPIRE (A2). The total number of genes prioritized by VAMPIRE category 1 (n = 4,832, A1 + A4) is similar to that by FUMA (n = 4,070), but that number is almost twice that of FUMA when considering all the categories of VAMPIRE (n = 7,922, A1 + A2 + A4 + A5). We evaluated the prioritized genes using data from the shRNA experiments. We first checked genes assessed in the shRNA experiments (Figure 4B) and observed similar proportions of method-specific prioritized genes assessed. Comparing FUMA and VAMPIRE category 1, for example, out of the 2,184 (A2 + A3) FUMA-specific genes, 79 (B2 + B3) are assessed (3.6%); out of the 2,177 (A4) VAMPIRE category 1-specific genes, 84 (B4) are assessed (3.9%). We also found that shRNA-assessed genes exhibit a higher level of sharing than all genes prioritized. Again, comparing FUMA and VAMPIRE category 1, 178 (B1) out of 341 (B1 + B2 + B3 + B4) shRNA-assessed genes (52.2%) are shared between the two methods. In contrast, 1,886 (A1) out of 6,247 (A1 + A2 + A3 + A4) of all genes prioritized (30.2%) are shared. Finally, compared to FUMA, VAMPIRE category 1 led to a larger number (23 [C4] specific to VAMPIRE category 1 versus 16 [C2 + C3] specific to FUMA, Figure 4C) and larger proportion (27.4% [C4/B4] versus 20.3% [(C2 + C3)/(B2 + B3)], although not statistically significant due to small number of genes involved) of shRNA-validated genes (Figures 4B and 4C). These results suggest that VAMPIRE is complementary to FUMA, with VAMPIRE category 1 genes more likely being functional.

Figure 4

Venn diagrams comparing FUMA and VAMPIRE

(A) All prioritized genes by FUMA and VAMPIRE. (B) shRNA-assessed genes overlap with genes prioritized by FUMA and VAMPIRE. (C) shRNA-validated genes overlap with genes prioritized by FUMA and VAMPIRE. In each panel, numbers are the number of genes belonging to the corresponding category. A1: shared between FUMA and VAMPIRE category 1; A2: shared between FUMA and VAMPIRE other categories (categories 2–5); A3: FUMA-specific genes; A4: VAMPIRE category 1-specific genes; A5: VAMPIRE other category-specific genes. Similar interpretation for B1–B5, C1–C5.

Venn diagrams comparing FUMA and VAMPIRE (A) All prioritized genes by FUMA and VAMPIRE. (B) shRNA-assessed genes overlap with genes prioritized by FUMA and VAMPIRE. (C) shRNA-validated genes overlap with genes prioritized by FUMA and VAMPIRE. In each panel, numbers are the number of genes belonging to the corresponding category. A1: shared between FUMA and VAMPIRE category 1; A2: shared between FUMA and VAMPIRE other categories (categories 2–5); A3: FUMA-specific genes; A4: VAMPIRE category 1-specific genes; A5: VAMPIRE other category-specific genes. Similar interpretation for B1–B5, C1–C5.

Discussion

As genotyped sample sizes increase and meta-analysis efforts grow ever larger, more variant-trait pairs are identified for complex traits than can be easily annotated on a variant-by-variant basis. New, user-friendly applications are needed for rapid display of functional annotation information and prioritization of variants for further functional follow-up to pave the V2F path. Our VAMPIRE tool provides an example of how the publicly available code can be adapted to accommodate other sources of annotation specific to other complex trait GWAS results or to accommodate future blood cell trait GWASs and annotation resources. Along with the addition of more blood cell trait genetics papers published in the future, VAMPIRE could also be used as written to annotate GWAS results for other blood-related phenotypes, such as recent GWASs of risk of myeloproliferative neoplasm or clonal hematopoiesis., For non-coding variants, we group them in five categories, and we have the following conclusions and observations in terms of variant prioritization. First, category 1 is the most restrictive category, and variants in category 1 are more likely to be functional than those in the other categories. Second, beyond category 1, we only found category 2 shows enhanced functional potential over category 4, while there are no strong preferences among the other categories. We have performed both variant-level (Table S1) and gene-level (Table S2) enrichment analyses comparing categories 2–5 and found no significant results, except for category 3. This may suggest that category 3 is slightly more likely to contain functional variants than categories 2, 4, and 5. However, the evidence is not strong: the significance levels are not very high, other TF motifs are not enriched, and sample size (i.e., number of genes tested) is small. Third, variant frequency information can also be helpful in interpreting eQTL/pQTL data. For a low frequency or rare variant, power is low in current eQTL/pQTL studies with small to moderate sample sizes. Thus, the absence of 2D evidence most likely reflects the power issue and should be treated as eQTL/pQTL not having been adequately assessed rather than truly not associated with the expression of gene(s) or the abundances of protein(s). Finally, different annotations have different weights depending on the trait of interest. For instance, annotations from megakaryocytes are critically important for platelet-related traits but can be rather safely ignored for red blood cell-related traits. Investigators focusing on different traits should use their discretion to up-/down-weigh various annotations. There are several reasons that a variant does not show up in the current VAMPIRE. First, we only included variants in credible sets from the recent GWAS efforts for blood cell traits., Variants not in those fine-mapping credible sets were not annotated. It is possible that such variants play functional roles but were not detected by GWASs and further missed by subsequent fine mapping. However, the probability tends to be low, particularly for common causal variants, given the >750,000 sample size involved in the generation of the credible sets. Second, for the included variants (i.e., credible set variants for blood cell traits), not falling in the prioritization categories (e.g., uncategorized) means that they are less likely to play functionally important roles compared to variants in categories 1–5, because no regulatory evidence or target genes are suggested based on the functional annotation information we have. Of course, it is possible that some of these uncategorized variants are indeed functional, but their functions are not reflected by the functional data we currently have. As we accumulate additional functional validation data, including high-throughput massively parallel reporter assays (MPRAs), medium-throughput CRISPRi/CRISPRa, and low-throughput mouse xenotransplant experiments, we will provide statistics summarizing experimental validation results (e.g., number of variants in the category followed up, proportion that show evidence of functional impact in their experiments) for each of the VAMPIRE categories and for user-defined categories. Importantly, we illustrate the value of VAMPIRE using existing independent functional validation and therefore illuminate the value of this type of annotation tool in enabling one to go from variant to function for blood cell traits and other complex phenotypes. We also note that there are some limitations of VAMPIRE. First, comprehensive annotations specific to various cell types and cell states would further enhance classification and prioritization accuracy of functional variants or regulatory elements and their target genes. Although data are increasingly being generated by us, and others,, and have been incorporated into VAMPIRE where available, interrogations in a cell-type- or state-specific manner are still much needed. For instance, our recent work has demonstrated cell-type or tissue-specific FIREs, and super-interactive promoters (SIPs) play key regulatory roles and aid the identification and prioritization of functional regulatory elements and their corresponding genes. As more experimental data are generated, we will update VAMPIRE accordingly. Second, our list of 148,019 variants derives primarily from fine-mapping studies, which may be inaccurate in loci where more than one independent or partially independent signal exists. However, this limitation cannot be resolved before more powerful methods are developed for fine-mapping analysis for trans-ethnic GWASs. Finally, most of the annotations are based on analyses in European ancestry individuals (e.g., eQTL, pQTL, chromatin conformation, etc.). Many ongoing efforts, including ours, are generating resources for non-European ancestry samples. For example, we are involved in several recently funded efforts to generate RNA-sequencing data in non-European ancestry individuals in hematopoietic cell types and anticipate relevant eQTL and sQTL annotations being added to VAMPIRE in upcoming years. In conclusion, we have built a comprehensive annotation tool, VAMPIRE, which provides characterization and prioritization of blood cell trait-related GWAS signals. Our results using existing functional experiments demonstrate that variants and genes prioritized by VAMPIRE are significantly more likely to be functionally validated at either the variant, gene, or variant-gene pair level. Annotation tools like VAMPIRE, which could be easily modified to apply to additional complex traits and diseases, are necessary to translate knowledge of GWAS-significant variants to target genes and biological insights and to guide our decisions to prioritize experimental validations of most likely functional regulatory variants/elements and their effector genes.

41 in total

1. Polarization of the effects of autoimmune and neurodegenerative risk alleles in leukocytes.

Authors: Towfique Raj; Katie Rothamel; Sara Mostafavi; Chun Ye; Mark N Lee; Joseph M Replogle; Ting Feng; Michelle Lee; Natasha Asinovski; Irene Frohlich; Selina Imboywa; Alina Von Korff; Yukinori Okada; Nikolaos A Patsopoulos; Scott Davis; Cristin McCabe; Hyun-il Paik; Gyan P Srivastava; Soumya Raychaudhuri; David A Hafler; Daphne Koller; Aviv Regev; Nir Hacohen; Diane Mathis; Christophe Benoist; Barbara E Stranger; Philip L De Jager
Journal: Science Date: 2014-05-02 Impact factor: 47.728

2. A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome.

Authors: Anthony D Schmitt; Ming Hu; Inkyung Jung; Zheng Xu; Yunjiang Qiu; Catherine L Tan; Yun Li; Shin Lin; Yiing Lin; Cathy L Barr; Bing Ren
Journal: Cell Rep Date: 2016-11-15 Impact factor: 9.423

3. Genetic influences on F cells and other hematologic variables: a twin heritability study.

Authors: C Garner; T Tatu; J E Reittie; T Littlewood; J Darley; S Cervino; M Farrall; P Kelly; T D Spector; S L Thein
Journal: Blood Date: 2000-01-01 Impact factor: 22.113

4. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping.

Authors: Suhas S P Rao; Miriam H Huntley; Neva C Durand; Elena K Stamenova; Ivan D Bochkov; James T Robinson; Adrian L Sanborn; Ido Machol; Arina D Omer; Eric S Lander; Erez Lieberman Aiden
Journal: Cell Date: 2014-12-11 Impact factor: 41.582

5. Gateways to the FANTOM5 promoter level mammalian expression atlas.

Authors: Marina Lizio; Jayson Harshbarger; Hisashi Shimoji; Jessica Severin; Takeya Kasukawa; Serkan Sahin; Imad Abugessaisa; Shiro Fukuda; Fumi Hori; Sachi Ishikawa-Kato; Christopher J Mungall; Erik Arner; J Kenneth Baillie; Nicolas Bertin; Hidemasa Bono; Michiel de Hoon; Alexander D Diehl; Emmanuel Dimont; Tom C Freeman; Kaori Fujieda; Winston Hide; Rajaram Kaliyaperumal; Toshiaki Katayama; Timo Lassmann; Terrence F Meehan; Koro Nishikata; Hiromasa Ono; Michael Rehli; Albin Sandelin; Erik A Schultes; Peter A C 't Hoen; Zuotian Tatum; Mark Thompson; Tetsuro Toyoda; Derek W Wright; Carsten O Daub; Masayoshi Itoh; Piero Carninci; Yoshihide Hayashizaki; Alistair R R Forrest; Hideya Kawaji
Journal: Genome Biol Date: 2015-01-05 Impact factor: 13.583

6. Lineage-Specific Genome Architecture Links Enhancers and Non-coding Disease Variants to Target Gene Promoters.

Authors: Biola M Javierre; Oliver S Burren; Steven P Wilder; Roman Kreuzhuber; Steven M Hill; Sven Sewitz; Jonathan Cairns; Steven W Wingett; Csilla Várnai; Michiel J Thiecke; Frances Burden; Samantha Farrow; Antony J Cutler; Karola Rehnström; Kate Downes; Luigi Grassi; Myrto Kostadima; Paula Freire-Pritchett; Fan Wang; Hendrik G Stunnenberg; John A Todd; Daniel R Zerbino; Oliver Stegle; Willem H Ouwehand; Mattia Frontini; Chris Wallace; Mikhail Spivakov; Peter Fraser
Journal: Cell Date: 2016-11-17 Impact factor: 41.582

7. Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome.

Authors: Daniel R Schrider; Andrew D Kern
Journal: Mol Biol Evol Date: 2017-08-01 Impact factor: 16.240

8. Interrogation of human hematopoiesis at single-cell and single-variant resolution.

Authors: Jacob C Ulirsch; Caleb A Lareau; Erik L Bao; Leif S Ludwig; Michael H Guo; Christian Benner; Ansuman T Satpathy; Vinay K Kartha; Rany M Salem; Joel N Hirschhorn; Hilary K Finucane; Martin J Aryee; Jason D Buenrostro; Vijay G Sankaran
Journal: Nat Genet Date: 2019-03-11 Impact factor: 38.330

9. Common DNA sequence variation influences 3-dimensional conformation of the human genome.

Authors: David U Gorkin; Yunjiang Qiu; Ming Hu; Kipper Fletez-Brant; Tristin Liu; Anthony D Schmitt; Amina Noor; Joshua Chiou; Kyle J Gaulton; Jonathan Sebat; Yun Li; Kasper D Hansen; Bing Ren
Journal: Genome Biol Date: 2019-11-28 Impact factor: 13.583

10. Cell-type-specific 3D epigenomes in the developing human cortex.

Authors: Michael Song; Mark-Phillip Pebworth; Xiaoyu Yang; Ming Hu; Armen Abnousi; Changxu Fan; Jia Wen; Jonathan D Rosen; Mayank N K Choudhary; Xiekui Cui; Ian R Jones; Seth Bergenholtz; Ugomma C Eze; Ivan Juric; Bingkun Li; Lenka Maliskova; Jerry Lee; Weifang Liu; Alex A Pollen; Yun Li; Ting Wang; Arnold R Kriegstein; Yin Shen
Journal: Nature Date: 2020-10-14 Impact factor: 49.962

5 in total

Review 1. Immuno-Modulatory Effects of Intervertebral Disc Cells.

Authors: Paola Bermudez-Lekerika; Katherine B Crump; Sofia Tseranidou; Andrea Nüesch; Exarchos Kanelis; Ahmad Alminnawi; Laura Baumgartner; Estefano Muñoz-Moya; Roger Compte; Francesco Gualdi; Leonidas G Alexopoulos; Liesbet Geris; Karin Wuertz-Kozak; Christine L Le Maitre; Jérôme Noailly; Benjamin Gantenbein
Journal: Front Cell Dev Biol Date: 2022-06-29