| Literature DB >> 27807048 |
Hao Mei1,2, Lianna Li3, Fan Jiang2, Jeannette Simino4, Michael Griswold4, Thomas Mosley5, Shijian Liu6.
Abstract
Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/.Entities:
Keywords: SNP; gene; gene set; genetic annotation
Mesh:
Substances:
Year: 2016 PMID: 27807048 PMCID: PMC5144977 DOI: 10.1534/g3.116.034694
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Systematic components of the snpGeneSets package. (A) Local genomic knowledge base: it parses the public NCBI dbSNP, Entrez Gene, and MSigDB databases to generate the SNP map (“snpmap”), gene map (“genemap”), gene sets (“geneSets”), and gene information (“geneInfo”) tables. (B) Three main annotations: (1) genomic mapping annotations for SNPs and genes, and functional annotation for gene sets; (2) relation mapping annotations between SNPs and genes, and between genes and gene sets; and (3) analysis-based annotation for measuring genes from SNP associations and testing gene set enrichment. (C) Auxiliary functions: they aim to support the first two major components (A and B), including identification of SNPs and genes from a defined genomic region, retrieval of genes and gene sets from a particular gene set category, permutation test and p-value calculations for gene set enrichment for genes, computation of the U-score for genes, and creation of a gene set database. MSigDB, Molecular Signatures Database; NCBI, National Center for Biotechnology Information; SNP, single nucleotide polymorphism.
Gene effect measures computed from the SNP association p-values
| Method | Gene Measure | Description |
|---|---|---|
| minP | The minimum p-value among SNPs in the gene | |
| 2ndP | The second smallest p-value of SNPs in the gene | |
| simP | Simes’ p-value adjusted for the number of SNPs | |
| fishP | Fisher’s combined p-value |
p1, p2, …, pk: the association p-values of K SNPs located in the same gene; p(1) ≤ p(2) ≤ … ≤ p(): the ordered association p-values of the K SNPs. a random variable that follows a chi-square distribution with 2 k degrees of freedom. SNP, single nucleotide polymorphism.
Two types of gene set enrichment tests
| Type 1 Test (CGEA) | Type 2 Test (USGSA) | |
|---|---|---|
| H1 hypothesis | Candidate genes (Φ) are enriched in the tested gene set (Ω) of a particular category ( | GWS genes with high ranked |
| Parameters | ||
| Effect | ||
| SE | ||
| Exact p-value | ||
CGEA, candidate gene enrichment analysis; USGSA, uniform-score gene set analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes; GWS, genome-wide study.
Genomic mapping annotation for T2D genes from the GWAS catalog
| Gene_ID | Gene_name | Full_name | Chr. | Start1 (bp) | End1 (bp) | Start2 (bp) | End2 (bp) | Strand |
|---|---|---|---|---|---|---|---|---|
| 2645 | GCK | Glucokinase | 7 | 44,183,870 | 44,229,022 | 44,144,271 | 44,189,423 | — |
| 3172 | HNF4A | Hepatocyte nuclear factor 4 α | 20 | 42,984,441 | 43,061,485 | 44,355,801 | 44,432,845 | + |
| 6927 | HNF1A | HNF1 homeobox A | 12 | 121,415,861 | 121,440,315 | 120,978,058 | 121,002,512 | + |
| 6928 | HNF1B | HNF1 homeobox B | 17 | 36,046,434 | 36,105,096 | 37,686,431 | 37,745,105 | — |
“Start1” and “End1”, gene transcript start and end position based on GRCh37. “Start2” and “End2”, gene transcript start and end position based on GRCh38. Chr., chromosome.
The strongest gene effects identified in the Finnish T2D-GWAS
| Method | Gene Measure | Gene_ID | Gene_name | Chr. | Strand | Start1 (bp) | End1 (bp) | Start2 (bp) | End2 (bp) |
|---|---|---|---|---|---|---|---|---|---|
| 2.38E−06 | 57537 | SORCS2 | 4 | + | 71,94,374 | 7,744,564 | 7,192,647 | 7,742,837 | |
| 1.51E−05 | 6934 | TCF7L2 | 10 | + | 114,709,978 | 114,927,437 | 112,950,219 | 113,167,678 | |
| 3.69E−05 | 406914 | MIR127 | 14 | + | 101,349,316 | 101,349,412 | 100,882,979 | 100,883,075 | |
| 406927 | MIR136 | 14 | + | 101,351,039 | 101,351,120 | 100,884,702 | 100,884,783 | ||
| 574034 | MIR433 | 14 | + | 101,348,223 | 101,348,315 | 100,881,886 | 100,881,978 | ||
| 574451 | MIR432 | 14 | + | 101,350,820 | 101,350,913 | 100,884,483 | 100,884,576 | ||
| 7.09E−17 | 4008 | LMO7 | 13 | + | 76,194,570 | 76,434,006 | 75,620,434 | 75,859,870 |
“Start1” and “End1”, gene transcript start and end position based on GRCh37. “Start2” and “End2”, gene transcript start and end position based on GRCh38. Chr., chromosome.
Figure 2Q-Q plot of the gene effect measures. The observed sample quantiles of the minP, 2ndP, simP, and fishP gene measures against the theoretical quantiles of the uniform distribution.
Figure 3Gene measures and U-scores for GCK, HNF4A, HNF1A, and HNF1B using the type 2 diabetes genome-wide association study. The gene measure values and transformed U-scores for all four genes using (A) minP, (B) 2ndP, (C) simP, and (D) fishP.
Test of T2D-mapped genes from the GWAS catalog in the T2D-GWAS and T2D-GWES data
| Data | Gene Measure | Mean1 | Stat1 | P1 | Mean2 | Stat2 | P2 | |
|---|---|---|---|---|---|---|---|---|
| T2D-GWAS | 102 | 19.61% | 3.70 | 1.77E−04 | 32.35% | 4.80 | 2.73E−06 | |
| T2D-GWAS | 102 | 17.65% | 3.33 | 5.99E−04 | 36.27% | 5.49 | 1.49E−07 | |
| T2D-GWAS | 102 | 9.80% | 1.62 | 0.054 | 15.69% | 1.57 | 0.060 | |
| T2D-GWAS | 102 | 27.45% | 5.06 | 9.58E−07 | 41.18% | 6.37 | 2.88E−09 | |
| T2D-GWES | 86 | p-value | 12.79% | 2.15 | 0.017 | 22.09% | 2.69 | 0.004 |
N, the number of T2D-mapped genes from the GWAS catalog that were measured in the T2D-GWAS or T2D-GWES data; Mean1, percent of genes with U-scores ≤ 0.05; Stat1 and P1, the t statistic and p-value for testing Mean1 > 5%; Mean2, percent of genes with U-scores ≤ 0.10; Stat2 and P2, the t statistic and p-value for testing Mean2 > 10%. T2D, type 2 diabetes; GWAS, genome-wide association study; GWES, genome-wide expression study.
The strongest KEGG pathway identified by the USGSA (type 2) enrichment analysis of the Finnish T2D-GWAS
| Measure | Genes | PID | Size | SetGenes | Effect (%) | SE (%) | |||
|---|---|---|---|---|---|---|---|---|---|
| 289 | 2901 | 69 | 17 | 17.8 | 3.0 | 4.63E−07 | 0 | 0.0003 | |
| 274 | 2901 | 69 | 19 | 21.0 | 3.0 | 5.80E−09 | 0 | 0 | |
| 214 | 2841 | 50 | 8 | 10.9 | 3.1 | 7.65E−04 | 0.31 | 0.26 | |
| 309 | 2901 | 69 | 21 | 23.1 | 3.1 | 1.28E−09 | 0 | 0 |
Genes: the number of GWAS genes with U-score ≤ 0.05 used for the enrichment analysis; PID, the pathway ID; “Size, the number of GWAS genes belonging to a Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway; SetGenes, the number of GWAS genes belonging to a KEGG pathway with U-score ≤ 0.05; p, unadjusted p-value; p_perm, the adjusted p-value based on 1000 permutations; p_table, the adjusted p-value based on the pregenerated distribution table. GWAS, genome-wide association study; GWES, genome-wide expression study.