Literature DB >> 28239419

Variant Set Enrichment: an R package to identify disease-associated functional genomic regions.

Musaddeque Ahmed^1,2, Richard C Sallari³, Haiyang Guo^1,2, Jason H Moore⁴, Housheng Hansen He^1,2, Mathieu Lupien^1,5.

Abstract

BACKGROUND: Genetic predispositions to diseases populate the noncoding regions of the human genome. Delineating their functional basis can inform on the mechanisms contributing to disease development. However, this remains a challenge due to the poor characterization of the noncoding genome. Here, we propose an R package that can pinpoint which genomic features are etiologically important based on the genetic predispositions.
RESULTS: Variant Set Enrichment (VSE) is an R package to calculate the enrichment of a set of disease-associated variants across functionally annotated genomic regions, consequently highlighting the mechanisms important in the etiology of the disease studied.
CONCLUSIONS: VSE is implemented as an R package and can easily be implemented in any system with R.

Entities: CellLine Disease Gene Species

Keywords: AVS; Disease; Enrichment; GWAS; Noncoding region; Regulatory region

Year: 2017 PMID： 28239419 PMCID： PMC5320724 DOI： 10.1186/s13040-017-0129-5

Source DB: PubMed Journal: BioData Min ISSN： 1756-0381 Impact factor: 2.522

Background

Over 80% of genetic predisposition, namely risk-loci populated by Single Nucleotide Polymorphisms (SNPs), to human diseases identified by Genome Wide Association Studies (GWAS) map to noncoding DNA [1-3]. In other words, most disease-associated SNPs do not directly alter coding sequences. Over the last decade, the functional annotation of the coding and noncoding genome across a wide collection of cell and tissue types benefited from the integration of maps of transcriptional activity from both coding and noncoding transcripts, such as miRNA and long-noncoding RNAs (lncRNAs) as well as chromatin-protein binding profiles, inclusive of transcription factors and epigenetics modifications, and open chromatin. This functional annotation provides a unique opportunity to delineate the functional basis of genetic predispositions to disease. Here, we present a computational method, named Variant Set Enrichment (VSE) that computes the enrichment/depletion of the set of genetic predisposition for a disease of interest over functional genomic annotations. We previously used a VSE-based approach to identify the enrichment of Breast Cancer (BCa) genetic predispositions at enhancers bound by FOXA1 and ESR1 in breast cancer cells [1]. VSE relies on the set of genetic predispositions and functional annotations; this renders VSE applicable to the study of any genetically inherited disease for which these data are available.

Implementation

A genetic predisposition (risk-locus) identified by GWAS corresponds to a SNP found on the GWAS array (termed as “tagSNP”) and all SNPs missing from the array but known to be in Linkage Disequilibrium (LD) with the tagSNP (termed as “ldSNP”) [4]. The sum of all genetic predispositions to a particular disease, ie: all the tagSNPs and their ldSNPs constitute the Associated Variant Set (AVS) for that disease. The identity of risk-loci is user defined, as the cut-off for LD determination is subjected to study preferences. Occasionally, two or more risk-loci for a particular disease can overlap with one another by a common ldSNP. If the common ldSNP overlaps with a functional genomic annotation of interest, the enrichment score calculated by VSE can be inflated because each risk-locus inclusive of this ldSNP would be counted independently. To correct for this possibility, VSE computes a network of all SNPs in which each SNP is represented as a node and the pairwise LD as an edge. Each cluster in the network represents a disjointed locus, as such, a ldSNP is present only in one locus (Additional file 1: Figure S1). VSE then computes the enrichment score of the AVS for each functional genomic annotation of interest in three sequential steps. In the first step, VSE tallies the number of independent risk-loci that overlaps with the functional genomic annotations. Overlapping of a risk-locus is defined as at least one member SNP found within the functional genomic annotation of interest. This preliminary tallying of AVS may indicate which genomic annotations are functionally related to risk-associated variants, but the overlapping can be affected by size and structure of the AVS. To correct for these biases, VSE, in the second step, computes a null distribution of the overlap tallies that is based on random permutation of AVS. The null AVS is computed by randomly sampling SNPs from a comprehensive pool of tagSNPs present on the GWAS arrays (Illumina Human OmniExpress) and clustering them with their ldSNPs imputed from the 1000 Genome Project Phase III data. When calculating the set of null AVS, VSE makes sure that each set is built in the way that it has identical total number of null loci as the total number of risk-loci in the AVS; and each null locus is matched in size to the corresponding query locus. We defined each null AVS as Matched Random Variant Sets (MRVS). In the third step, VSE tallies the overlapping of MRVS with the functional genomic annotations of interest. This provides the null distribution to calculate for the enrichment/depletion of the AVS across different functional genomic annotations. To make the enrichment analysis comparable across all functional genomic annotations of interest, MRVS tally is centered at the median and scaled to the standard deviations of the null distribution. The enrichment score is then defined as the number of standard deviations that the overlapping tally deviates from the null overlapping tally median. VSE calculates an exact P-value for significance of the enrichment/depletion by fitting a density function to the null distribution derived from the MRVS. The level of significance is corrected for multiple testing using Bonferroni method. The deviation of the null distribution from the normality is tested using Kolmogorov-Smirnov test; and if the distribution deviates, the Box-Cox power transformation is applied on to the null to approach normality.

Results

The usefulness and impact of VSE is demonstrated by calculating the enrichment of SNPs associated with four cancer types for DNase I Hypersensitivity Sites (DHS) and a set of histone marks profiled genome wide in cancer type relevant cell lines. We compiled 72, 92, 16 and 36 significantly associated SNPs, or tagSNPs, for Prostate Cancer (PRAD), Breast Cancer (BCa), Lung Cancer (LUAD) and Colorectal Cancer (COAD) respectively from NHGRI catalog [5]. We computed the LD structure by finding all SNPs in the European population from the 1000 Genome Project that are in LD with the tagSNPs with r2 ≥ 0.8 [6]. The DNase-seq and ChIP-seq data for H3K4me1 (enhancer), H3K4me3 (promoter), H3K27ac (enhancer and promoter), H3K36me3 (gene body) and H3K27me3 (repressive region) for each of MCF7 (BCa), LNCaP (PRAD), A549 (LUAD) and HCT-116/Caco-2 (COAD) cell lines are compiled from ENCODE data [7] and complemented by data from independent studies [8, 9] (Additional file 1: Table S1). VSE ensures that the distribution of the background is normal by Kolmogorov-Smirnov test and applies transformation if necessary (Additional file 1: Figure S2). Upon performing VSE, the results show that BCa and PRAD AVS are significantly enriched in DHS and regions with H3K27ac mark found in breast and prostate cancer cells, respectively (Fig. 1; Additional file 1: Figures S3 and S4). On the other hand, SNPs associated with LUAD are enriched in regions with H3K36me3 mark only (Fig. 1). In our cross-validation analysis, we observed that the enrichment of an AVS is cancer-type specific, e.g., BCa AVS is enriched in DHS only in MCF7 cells, not in other cells (Additional file 1: Figure S5). The enrichment of distinct cancer AVS across different functional genomic regions argues for a unique biology affected by genetic predispositions across cancer types.

Fig. 1

Enrichment of Breast, Prostate, Lung and Colorectal cancer AVS across different genomic maps in cancer-type specific cells. The box and whisker plots show the enrichment score distribution of match null set. The bar inside the box corresponds to the median enrichment score of the null set. The significantly enriched genome regions (Bonferroni corrected P-value < 0.01) are marked in red. The histone modifications are profiled in MCF7, LNCaP, A549 and HCT-116/Caco-2 for breast, prostate, lung and colorectal cancer, respectively

Conclusions

A set of genetic variants that are strongly associated with a particular disease holds clues about the underlying the mechanism of the development of the disease. VSE provides an easy approach to delineate such information by pinpointing the genomic features that are most affected by the genetic predispositions of that particular disease. In our preliminary analysis, we demonstrate that the genetic variants associated with prostate cancer and breast cancer are significantly over-represented in regulatory regions, while the variants associated with the lung cancer are enriched in coding regions. VSE can be easily implemented in R in any platform. A usage vignette is available in the VSE webspage in CRAN repository and also found in the Additional file 2.

9 in total

1. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

Authors: Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio
Journal: Proc Natl Acad Sci U S A Date: 2009-05-27 Impact factor: 11.205

2. Systematic localization of common disease-associated variation in regulatory DNA.

Authors: Matthew T Maurano; Richard Humbert; Eric Rynes; Robert E Thurman; Eric Haugen; Hao Wang; Alex P Reynolds; Richard Sandstrom; Hongzhu Qu; Jennifer Brody; Anthony Shafer; Fidencio Neri; Kristen Lee; Tanya Kutyavin; Sandra Stehling-Sun; Audra K Johnson; Theresa K Canfield; Erika Giste; Morgan Diegel; Daniel Bates; R Scott Hansen; Shane Neph; Peter J Sabo; Shelly Heimfeld; Antony Raubitschek; Steven Ziegler; Chris Cotsapas; Nona Sotoodehnia; Ian Glass; Shamil R Sunyaev; Rajinder Kaul; John A Stamatoyannopoulos
Journal: Science Date: 2012-09-05 Impact factor: 47.728

3. Linking disease associations with regulatory information in the human genome.

Authors: Marc A Schaub; Alan P Boyle; Anshul Kundaje; Serafim Batzoglou; Michael Snyder
Journal: Genome Res Date: 2012-09 Impact factor: 9.043

4. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.

Authors: Danielle Welter; Jacqueline MacArthur; Joannella Morales; Tony Burdett; Peggy Hall; Heather Junkins; Alan Klemm; Paul Flicek; Teri Manolio; Lucia Hindorff; Helen Parkinson
Journal: Nucleic Acids Res Date: 2013-12-06 Impact factor: 16.971

5. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

6. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

7. Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression.

Authors: Richard Cowper-Sal lari; Xiaoyang Zhang; Jason B Wright; Swneke D Bailey; Michael D Cole; Jerome Eeckhoute; Jason H Moore; Mathieu Lupien
Journal: Nat Genet Date: 2012-09-23 Impact factor: 38.330

8. Comprehensive functional annotation of 77 prostate cancer risk loci.

Authors: Dennis J Hazelett; Suhn Kyong Rhie; Malaina Gaddis; Chunli Yan; Daniel L Lakeland; Simon G Coetzee; Brian E Henderson; Houtan Noushmehr; Wendy Cozen; Zsofia Kote-Jarai; Rosalind A Eeles; Douglas F Easton; Christopher A Haiman; Wange Lu; Peggy J Farnham; Gerhard A Coetzee
Journal: PLoS Genet Date: 2014-01-30 Impact factor: 5.917

9. Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer.

Authors: Phillippa C Taberlay; Aaron L Statham; Theresa K Kelly; Susan J Clark; Peter A Jones
Journal: Genome Res Date: 2014-06-10 Impact factor: 9.043

9 in total

1. Regulatory annotation of genomic intervals based on tissue-specific expression QTLs.

Authors: Tianlei Xu; Peng Jin; Zhaohui S Qin
Journal: Bioinformatics Date: 2020-02-01 Impact factor: 6.937

2. Common and Rare Coding Genetic Variation Underlying the Electrocardiographic PR Interval.

Authors: Honghuang Lin; Jessica van Setten; Albert V Smith; Nathan A Bihlmeyer; Helen R Warren; Jennifer A Brody; Farid Radmanesh; Leanne Hall; Niels Grarup; Martina Müller-Nurasyid; Thibaud Boutin; Niek Verweij; Henry J Lin; Ruifang Li-Gao; Marten E van den Berg; Jonathan Marten; Stefan Weiss; Bram P Prins; Jeffrey Haessler; Leo-Pekka Lyytikäinen; Hao Mei; Tamara B Harris; Lenore J Launer; Man Li; Alvaro Alonso; Elsayed Z Soliman; John M Connell; Paul L Huang; Lu-Chen Weng; Heather S Jameson; William Hucker; Alan Hanley; Nathan R Tucker; Yii-Der Ida Chen; Joshua C Bis; Kenneth M Rice; Colleen M Sitlani; Jan A Kors; Zhijun Xie; Chengping Wen; Jared W Magnani; Christopher P Nelson; Jørgen K Kanters; Moritz F Sinner; Konstantin Strauch; Annette Peters; Melanie Waldenberger; Thomas Meitinger; Jette Bork-Jensen; Oluf Pedersen; Allan Linneberg; Igor Rudan; Rudolf A de Boer; Peter van der Meer; Jie Yao; Xiuqing Guo; Kent D Taylor; Nona Sotoodehnia; Jerome I Rotter; Dennis O Mook-Kanamori; Stella Trompet; Fernando Rivadeneira; André Uitterlinden; Mark Eijgelsheim; Sandosh Padmanabhan; Blair H Smith; Henry Völzke; Stephan B Felix; Georg Homuth; Uwe Völker; Massimo Mangino; Timothy D Spector; Michiel L Bots; Marco Perez; Mika Kähönen; Olli T Raitakari; Vilmundur Gudnason; Dan E Arking; Patricia B Munroe; Bruce M Psaty; Cornelia M van Duijn; Emelia J Benjamin; Jonathan Rosand; Nilesh J Samani; Torben Hansen; Stefan Kääb; Ozren Polasek; Pim van der Harst; Susan R Heckbert; J Wouter Jukema; Bruno H Stricker; Caroline Hayward; Marcus Dörr; Yalda Jamshidi; Folkert W Asselbergs; Charles Kooperberg; Terho Lehtimäki; James G Wilson; Patrick T Ellinor; Steven A Lubitz; Aaron Isaacs
Journal: Circ Genom Precis Med Date: 2018-05

3. Exploring Shared Susceptibility between Two Neural Crest Cells Originating Conditions: Neuroblastoma and Congenital Heart Disease.

Authors: Alessandro Testori; Vito A Lasorsa; Flora Cimmino; Sueva Cantalupo; Antonella Cardinale; Marianna Avitabile; Giuseppe Limongelli; Maria Giovanna Russo; Sharon Diskin; John Maris; Marcella Devoto; Bernard Keavney; Heather J Cordell; Achille Iolascon; Mario Capasso
Journal: Genes (Basel) Date: 2019-08-30 Impact factor: 4.096

Review 4. Pioneer of prostate cancer: past, present and the future of FOXA1.

Authors: Mona Teng; Stanley Zhou; Changmeng Cai; Mathieu Lupien; Housheng Hansen He
Journal: Protein Cell Date: 2020-09-18 Impact factor: 14.870

5. CRISPRi screens reveal a DNA methylation-mediated 3D genome dependent causal mechanism in prostate cancer.

Authors: Musaddeque Ahmed; Fraser Soares; Ji-Han Xia; Yue Yang; Jing Li; Haiyang Guo; Peiran Su; Yijun Tian; Hyung Joo Lee; Miranda Wang; Nayeema Akhtar; Kathleen E Houlahan; Almudena Bosch; Stanley Zhou; Parisa Mazrooei; Junjie T Hua; Sujun Chen; Jessica Petricca; Yong Zeng; Alastair Davies; Michael Fraser; David A Quigley; Felix Y Feng; Paul C Boutros; Mathieu Lupien; Amina Zoubeidi; Liang Wang; Martin J Walsh; Ting Wang; Shancheng Ren; Gong-Hong Wei; Housheng Hansen He
Journal: Nat Commun Date: 2021-03-19 Impact factor: 14.919

6. Genetic architecture of gene regulation in Indonesian populations identifies QTLs associated with global and local ancestries.

Authors: Heini M Natri; Georgi Hudjashov; Guy Jacobs; Pradiptajati Kusuma; Lauri Saag; Chelzie Crenna Darusallam; Mait Metspalu; Herawati Sudoyo; Murray P Cox; Irene Gallego Romero; Nicholas E Banovich
Journal: Am J Hum Genet Date: 2021-12-16 Impact factor: 11.025

7. A catalog of potential putative functional variants in psoriasis genome-wide association regions.

Authors: Yan Lin; Lu Liu; Yujun Sheng; Changbing Shen; Xiaodong Zheng; Fusheng Zhou; Sen Yang; Xianyong Yin; Xuejun Zhang
Journal: PLoS One Date: 2018-05-01 Impact factor: 3.240

8. Combinatorial and statistical prediction of gene expression from haplotype sequence.

Authors: Berk A Alpay; Pinar Demetci; Sorin Istrail; Derek Aguiar
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937

9. The impact of proinflammatory cytokines on the β-cell regulatory landscape provides insights into the genetics of type 1 diabetes.

Authors: Mireia Ramos-Rodríguez; Helena Raurell-Vila; Maikel L Colli; Maria Inês Alvelos; Marc Subirana-Granés; Jonàs Juan-Mateu; Richard Norris; Jean-Valery Turatsinze; Ernesto S Nakayasu; Bobbie-Jo M Webb-Robertson; Jamie R J Inshaw; Piero Marchetti; Lorenzo Piemonti; Manel Esteller; John A Todd; Thomas O Metz; Décio L Eizirik; Lorenzo Pasquali
Journal: Nat Genet Date: 2019-11-01 Impact factor: 38.330

9 in total