Literature DB >> 29878048

snpEnrichR: analyzing co-localization of SNPs and their proxies in genomic regions.

Kari Nousiainen¹, Kartiek Kanduri², Isis Ricaño-Ponce³, Cisca Wijmenga^3,4, Riitta Lahesmaa², Vinod Kumar^3,5, Harri Lähdesmäki^1,2.

Abstract

Motivation: Co-localization of trait associated SNPs for specific transcription-factor binding sites or regulatory regions in the genome can yield profound insight into underlying causal mechanisms. Analysis is complicated because the truly causal SNPs are generally unknown and can be either SNPs reported in GWAS studies or other proxy SNPs in their linkage disequilibrium. Hence, a comprehensive pipeline for SNP co-localization analysis that utilizes all relevant information about both the genotyped SNPs and their proxies is needed.
Results: We developed an R package snpEnrichR for SNP co-localization analysis. The software integrates different tools for random SNP generation and genome co-localization analysis to automatize and help users to create custom SNP co-localization analysis. We show via an example that including proxy SNPs in SNP co-localization analysis enhances the sensitivity of co-localization detection. Availability and implementation: The software is available at https://github.com/kartiek/snpEnrichR.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29878048 PMCID： PMC6247941 DOI： 10.1093/bioinformatics/bty460

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Assessing co-localization of SNPs on given genomic regions requires an empirical hypothesis test. For a given population, SNPs have several quantifiable properties, such as allele frequency, the number of SNPs in linkage disequilibrium (LD), distance to nearest gene and gene density, which can be used to draw random sets of SNPs that have similar characteristics as the original SNP set. Such an empirical randomization approach provides a calibrated null distribution for co-localization analysis. Genome-wide association studies have successfully linked SNPs to various traits. So-called tag-SNPs are generally consireded as proxies for causal SNPs. Because it is difficult to pinpoint the actual causal SNPs to a phenotype, taking other SNPs in their LD into account may enhance the sensitivity of the co-localization analysis.

2 Materials and methods

R package snpEnrichR facilitates SNP co-localization analysis by computing required statistics and integrates to several existing tools to enable efficient and automated data management for the analysis. The package consists of five main functions: (i) getSNPs retrieves trait associated SNPs directly from the NHGRI-EBI GWAS Catalog (MacArthur ). Alternatively, user can manually provide custom SNP lists. (ii) clumpSNPs detects linked SNPs in a list, removes the correlated SNPs, and returns a list of (decorrelated) tag-SNPs. Removing correlated SNPs from a SNP list is needed to avoid biases in random SNP set generation. (iii) submitSNPsnap connects to SNPsnap server (Pers ) and sends a retrieval request to generate a specified number of randomly sampled SNP sets. Each set consists of randomly sampled SNPs that have similar properties as the list of (decorrelated) tag-SNPs. (iv) findProxies expands a list of SNPs with all linked SNPs within a genomic distance d and above a correlation level r2 that are set by the user. (v) analyzeEnrichment computes the overlap between the genomic regions and each of the randomly sampled SNP sets that are extended to contain all SNPs that are in LD. These overlap scores form an empirical null distribution for the hypothesis test, and the empirical P-value is computed the standard way by counting the number of times randomly sampled SNP sets have at least as many overlaps with the genomic regions as the original input SNP set (which is also extended with LD SNPs). Empirical P-values are computed for all input SNP lists (e.g. different diseases) separately and the obtained P-values are corrected for multiple testing by the Benjamini–Hochberg method providing false discovery rate (FDR) values. The functions can be easily used as the basis of SNP co-localization analysis pipeline. External tools are required only to liftover different genomic builds to correspond to each other, such as, e.g. GWAS catalog uses build GRCh38 whereas SNPsnap relies on GRCh37. Due to the dependency of an external server and the resulting time lag in random SNP set generation, we suggest that pipeline should be run in two phases. snpEnrichR requires R packages RSelenium, readr, dplyr, httr, utils, parallel, rtracklayer and GenomicFeatures, and external software PLINK version 1.9 (Chang ).

2.1 Input files

snpEnrichR requires three user-specified data sources: (i) a list of genomic regions, (ii) a list of trait associated SNPs, (iii) a processed version of 1000 Genomes Project phase 3 SNP data for the studied population in a format supported by PLINK, i.e. a sample information file (.bed), a binary biallelic genotype table (.bim) and an extended set variant information file (.fam) (The 1000 Genomes Project Consortium, 2015). In our analyses, 1000 genomes data is annotated based on the genome coordinates, long indels and duplicate variants have been removed, and the data is filtered with the same quality control criteria used by SNPsnap, i.e. minimum minor allele frequency is 0.01, Hardy–Weinberg equilibrium test’s P-value is 10–6 and maximum missing genotype rate is 0.1. Note that in snpEnrichR the SNP files can be directly accessed from NHGRI-EBI GWAS Catalog database and 1000 genomes data is preprocessed into PLINK compatible format for convenience. All data is mapped into human genome assembly hg19 and represented in one-based coordinate system.

3 Example use case

To illustrate the utility and features of the tool, we applied it for studying SNP co-localization in transcription factor STAT6 binding sites in human CD4+ T cells during early Th2 cell differentiation (Elo ). We downloaded STAT6 binding sites from Gene Transcription Regulation Database (GTRD) which hosts transcription factor binding sites identified by ChIP-seq experiments (Yevshin ). The data consisted of STAT6 binding sites from five samples (EXP000514,…, EXP000518) of one biological replicate. After merging overlapping bindings sites, there are 15340 binding sites. The median length of the binding sites is 421. We fetched tag-SNPs of 11 immune-related and three non-immune related diseases/traits in European ancestry from NHGRI-EBI Catalog, and we removed tag-SNPs from HLA region and converted the coordinates into hg19 assembly. We used the snpEnrichR analysis pipeline with LD block parameters (d = 100 kb and r2 = 0.8) and used 1000 randomly generated SNPs sets when computing empirical P-values. We used the tool to implement two analyses. The first pipeline computes the standard co-localizations using only the tag-SNPs whereas the second considers the proxy SNPs as well. Figure 1 shows that including proxy SNPs enhances the sensitivity of co-localization analysis. When considering the tag-SNPs only, two of the immune-related trait specific SNP co-localizations were detected. Whereas, five additional traits were identified as significantly enriched at STAT6 binding sites when proxy SNPs were taken into account. In addition, the inclusion of the proxy SNPs did not cause artificial co-localization signal for non-immune related traits where the tag-SNPs did not co-localize with STAT6 binding sites.

Fig. 1.

Co-localization results for SNPs from 11 immune-related and 3 non-immune-related diseases in STAT6 binding sites in human CD4+ T cells during early differentiation. Dashed line corresponds to the corrected P-value (FDR) of 0.05

4 Discussion and conclusion

We have implemented R package snpEnrichR to facilitate automated SNP co-localization analysis. The tool provides all major functionalities needed in co-localization analysis: an interface to fetch trait specific SNPs, detection and filtering tool for clumped SNPs, access to a web server that uses the best practises in generating random SNP sets that maintain characteristics of a given input SNP set, and the computation of proxy SNPs as well as co-localization tests. snpEnrichR; R package also enables flexible and easy integration to related analyses a user may have. Additional examples of this approach were recently reported (Tripathi ; Ullah ).

Funding

This work has been supported by the Academy of Finland [Centre of Excellence in Molecular Systems Immunology and Physiology Research (2012-2017) grant 250114; as well as the project 292832]. R.L. was supported by the Academy of Finland (AoF) grants 292335, 294337, 292482, 31444 and by grants from the JDRF, the Sigrid Jusélius Foundation and the Finnish Cancer Foundation. Conflict of Interest: none declared.

8 in total

1. SNPsnap: a Web-based tool for identification and annotation of matched SNPs.

Authors: Tune H Pers; Pascal Timshel; Joel N Hirschhorn
Journal: Bioinformatics Date: 2014-10-13 Impact factor: 6.937

2. Genome-wide profiling of interleukin-4 and STAT6 transcription factor regulation of human Th2 cell programming.

Authors: Laura L Elo; Henna Järvenpää; Soile Tuomela; Sunil Raghav; Helena Ahlfors; Kirsti Laurila; Bhawna Gupta; Riikka J Lund; Johanna Tahvanainen; R David Hawkins; Matej Oresic; Harri Lähdesmäki; Omid Rasool; Kanury V Rao; Tero Aittokallio; Riitta Lahesmaa
Journal: Immunity Date: 2010-06-25 Impact factor: 31.745

3. Genome-wide Analysis of STAT3-Mediated Transcription during Early Human Th17 Cell Differentiation.

Authors: Subhash K Tripathi; Zhi Chen; Antti Larjo; Kartiek Kanduri; Kari Nousiainen; Tarmo Äijo; Isis Ricaño-Ponce; Barbara Hrdlickova; Soile Tuomela; Essi Laajala; Verna Salo; Vinod Kumar; Cisca Wijmenga; Harri Lähdesmäki; Riitta Lahesmaa
Journal: Cell Rep Date: 2017-05-30 Impact factor: 9.423

4. Second-generation PLINK: rising to the challenge of larger and richer datasets.

Authors: Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee
Journal: Gigascience Date: 2015-02-25 Impact factor: 6.524

5. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

6. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog).

Authors: Jacqueline MacArthur; Emily Bowler; Maria Cerezo; Laurent Gil; Peggy Hall; Emma Hastings; Heather Junkins; Aoife McMahon; Annalisa Milano; Joannella Morales; Zoe May Pendlington; Danielle Welter; Tony Burdett; Lucia Hindorff; Paul Flicek; Fiona Cunningham; Helen Parkinson
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

7. GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments.

Authors: Ivan Yevshin; Ruslan Sharipov; Tagir Valeev; Alexander Kel; Fedor Kolpakov
Journal: Nucleic Acids Res Date: 2016-10-24 Impact factor: 16.971

8. Transcriptional Repressor HIC1 Contributes to Suppressive Function of Human Induced Regulatory T Cells.

Authors: Syed Bilal Ahmad Andrabi; Subhash Kumar Tripathi; Obaiah Dirasantha; Kartiek Kanduri; Sini Rautio; Catharina C Gross; Sari Lehtimäki; Kanchan Bala; Johanna Tuomisto; Urvashi Bhatia; Deepankar Chakroborty; Laura L Elo; Harri Lähdesmäki; Heinz Wiendl; Omid Rasool; Riitta Lahesmaa
Journal: Cell Rep Date: 2018-02-20 Impact factor: 9.423

8 in total

1 in total

1. A systematic comparison of FOSL1, FOSL2 and BATF-mediated transcriptional regulation during early human Th17 differentiation.

Authors: Ankitha Shetty; Subhash Kumar Tripathi; Sini Junttila; Tanja Buchacher; Rahul Biradar; Santosh D Bhosale; Tapio Envall; Asta Laiho; Robert Moulder; Omid Rasool; Sanjeev Galande; Laura L Elo; Riitta Lahesmaa
Journal: Nucleic Acids Res Date: 2022-05-20 Impact factor: 19.160

1 in total