Literature DB >> 32330239

SparkINFERNO: a scalable high-throughput pipeline for inferring molecular mechanisms of non-coding genetic variants.

Pavel P Kuksa¹, Chien-Yueh Lee¹, Alexandre Amlie-Wolf^1,2, Prabhakaran Gangadharan¹, Elizabeth E Mlynarski¹, Yi-Fan Chou¹, Han-Jen Lin¹, Heather Issen¹, Emily Greenfest-Allen^1,3, Otto Valladares¹, Yuk Yee Leung¹, Li-San Wang^1,3.

Abstract

SUMMARY: We report Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO), a scalable bioinformatics pipeline characterizing non-coding genome-wide association study (GWAS) association findings. SparkINFERNO prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts and plausible target genes they affect. To achieve this, the SparkINFERNO algorithm integrates GWAS summary statistics with large-scale collection of functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci and other functional datasets across more than 400 tissues and cell types. Scalability is achieved by an underlying API implemented using Apache Spark and Giggle-based genomic indexing. We evaluated SparkINFERNO on large GWASs and show that SparkINFERNO is more than 60 times efficient and scales with data size and amount of computational resources.
AVAILABILITY AND IMPLEMENTATION: SparkINFERNO runs on clusters or a single server with Apache Spark environment, and is available at https://bitbucket.org/wanglab-upenn/SparkINFERNO or https://hub.docker.com/r/wanglab/spark-inferno. CONTACT: lswang@pennmedicine.upenn.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Year: 2020 PMID： 32330239 PMCID： PMC7320617 DOI： 10.1093/bioinformatics/btaa246

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Genome-wide association studies (GWASs) have successfully identified over 70 000 genetic variants associated with more than 3000 human diseases and phenotypes (Buniello ). Interpretation of these associations remain difficult (Amlie-Wolf ; Watanabe ) as most GWAS hits are in the non-coding genome. Resolution of GWAS is limited as neighboring variants have similar associations due to linkage disequilibrium (LD) (Amlie-Wolf ). Our recently developed INFERNO method (Amlie-Wolf ) focuses on identifying potentially causal variants underlying observed GWAS associations by integrating with hundreds of functional genomics datasets. The current INFERNO implementation is not optimized for big data, and a scalable framework for annotating genetic variants and genomic regions generated by various human genetic studies in a high-throughput manner is in need for systematic large-scale genomic and genetic analyses. The scale and heterogeneity of functional genomics datasets and annotations necessitate systematic, integrative analysis and interpretation of GWAS association findings. For example, while INFERNO uses relatively small set of functional genomics datasets, projects, such as GTEx (Aguet ), FANTOM5 (Andersson ), ENCODE (Bernstein ) and Roadmap Epigenomics (Kundaje ), produce >60 000 experimental datasets across >1100 tissues, cell types, biological conditions, each with millions to billions of records across the genome. In order to pair these functional annotations with modern population-level studies, such as UK Biobank (500 000 individuals with >2500 phenotypes), we need a scalable, high-throughput, robust and easy to use software that can systematically interpret hundreds of millions of genotypes across millions of participants. We implemented Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO) as a scalable, high-throughput automated workflow that integrates a large-scale functional genomics data repository and processes GWAS results by performing LD analysis, functional evidence evaluation and aggregation, Bayesian colocalization analysis of GWAS and expression quantitative trait loci (eQTL) signals, characterize the downstream regulatory effects including the tissue contexts, regulatory elements and target genes that they affect. We applied SparkINFERNO on inflammatory bowel disease (IBD) (Liu ) and the International Genomics of Alzheimer’s disease (AD) GWAS datasets (Lambert ) and show that this scalable framework is at least 60 times more efficient and able to identify the molecular mechanisms underlying non-coding GWAS signals.

2 Materials and methods

We chose Apache Spark (Zaharia ) and Python for a scalable implementation of INFERNO (Amlie-Wolf ) (see Supplementary Table S1 and ‘Comparison with original INFERNO implementation’ section in Supplementary Methods). The new SparkINFERNO is highly scalable, modular and coupled with an integrated functional genomics data repository (Fig. 1 and Supplementary Fig. S1 and Tables S1 and S2). Analysis modules perform various types of genomic data integration to produce functional evidence including tissue-specific regulatory elements (enhancers), transcription factor (TF) activity, chromatin states and genetic regulation (eQTL) information. SparkINFERNO implements scalable genomic querying (Supplementary Figs S2 and S3) using Spark parallel transformations and Giggle-based genomic indexing (Layer ). SparkINFERNO can be extended with additional annotation data and/or customized evaluation modules. Results are reported by individual evaluation modules and as combined summaries (Supplementary Methods).

Fig. 1.

Overview of SparkINFERNO

Overview of SparkINFERNO SparkINFERNO accepts complete GWAS summary statistics or top GWAS association variants as the input and generates a list of potentially causal variants, affected tissue-specific enhancers and target gene(s) as the output. The entire workflow consists of four phases (Fig. 1): (i) Pre-processing and QC of GWAS input; (ii) Generating candidate set of potentially causal variants; (iii) Evaluating functional genomic evidence across genomic datasets in a tissue-specific manner including regulatory elements (enhancers), eQTL colocalization, transcriptional factor binding sites (TFBSs) and others for each GWAS locus/signal; and (iv) Aggregating evidence to infer prioritization of causal variants, including information on affected tissues/cell types, regulatory elements, TFs and target genes. See Supplementary Methods for technical details. The pre-processing phase takes raw GWAS summary statistics in a tab-separated values format as input, resolves reference and alternative alleles, checks allele frequencies in the reference population (e.g. 1000 Genomes Project) and produces quality control flags. Quality control steps mark GWAS variants with inconsistent alleles that could not be matched with reference genotype data (Supplementary Methods and Fig. S4). The candidate set construction phase expands genome-wide significant associations into a putative causal variant set by pruning significant variants into a smaller set of independent variants using publicly available LD data (e.g. 1000 Genome), and then expanding these signals into putative causal sets consisting of nearby variants in LD. The user can specify the reference population in LD pruning/expansion to match the population underlying the input GWAS study. Supplementary Methods and Figure S4 provide details of the workflow for generating putative variant sets. The evaluation phase executes Spark-based annotation jobs in parallel (Fig. 1). SparkINFERNO uses an integrated repository of annotations for genomic elements (promoters, exons, introns, etc.), non-coding RNAs, regulatory elements, such as enhancers, TFBSs and others (integrated data and data repository implementation in Supplementary Table S2 and Fig. S1). The current SparkINFERNO implementation contains 3.5 billion genomic intervals from 2342 tracks for 32 tissue categories. In the final aggregation phase, SparkINFERNO combines functional evidence from individual genomic analyses and produces a list of candidate variants, enhancer elements and their target genes as supported by FANTOM5, Roadmap, GTEx, TF binding and other functional evidence. SparkINFERNO performs colocalization analysis (Supplementary Fig. S5) of the GWAS and eQTL signals across genome-wide significant loci using COLOC (Giambartolomei ). To install SparkINFERNO, users can either install the package (https://bitbucket.org/wanglab-upenn/SparkINFERNO) on their own Spark cluster, or use a pre-created Docker image (wanglab/spark-inferno). To run SparkINFERNO, the user first edits the configuration file and provides input GWAS specifications. A complete run of SparkINFERNO produces candidate potentially causal variants, target genes, tissue contexts, regulatory elements and detailed BED files documenting overlaps with functional genomics and annotation datasets.

3 Results

We evaluated SparkINFERNO on our AWS Spark cluster using IGAP AD and IBD GWAS datasets containing 8 080 502 and 11 555 676 variants, respectively. For the IGAP GWAS dataset, SparkINFERNO took 993 s on a 16-core Linux server to complete the analysis, whereas the original INFERNO took 60 973 s. SparkINFERNO is 61 times faster (Supplementary Fig. S2). SparkINFERNO scales well with the amount of computational resources both in local and cluster modes (Supplementary Figs S3A and S3B), including parallel Giggle-based genomic querying (Supplementary Fig. S8). SparkINFERNO identified 1418 and 15 343 candidate causal variants and 97 and 317 colocalized target gene–tissue combinations for IGAP and IBD, respectively (Supplementary Table S3). As can be seen from distribution of identified overlaps across functional genomics datasets and tissue types (Supplementary Figs S6 and S7) SparkINFERNO identifies genes and tissues that are likely important for the disease etiology.

Funding

This work was supported by the National Institute on Aging [U24-AG041689, U54-AG052427, U01-AG032984 and T32-AG00255]; Biomarkers Across Neurodegenerative Diseases (BAND 3) (award number 18062), co-funded by Michael J Fox Foundation, Alzheimer's Association, Alzheimer's Research UK and the Weston Brain institute. Conflict of Interest: none declared. Click here for additional data file.

11 in total

1. GIGGLE: a search engine for large-scale integrated genome analysis.

Authors: Ryan M Layer; Brent S Pedersen; Tonya DiSera; Gabor T Marth; Jason Gertz; Aaron R Quinlan
Journal: Nat Methods Date: 2018-01-08 Impact factor: 28.547

2. An atlas of active enhancers across human cell types and tissues.

Authors: Robin Andersson; Claudia Gebhard; Michael Rehli; Albin Sandelin; Irene Miguel-Escalada; Ilka Hoof; Jette Bornholdt; Mette Boyd; Yun Chen; Xiaobei Zhao; Christian Schmidl; Takahiro Suzuki; Evgenia Ntini; Erik Arner; Eivind Valen; Kang Li; Lucia Schwarzfischer; Dagmar Glatz; Johanna Raithel; Berit Lilje; Nicolas Rapin; Frederik Otzen Bagger; Mette Jørgensen; Peter Refsing Andersen; Nicolas Bertin; Owen Rackham; A Maxwell Burroughs; J Kenneth Baillie; Yuri Ishizu; Yuri Shimizu; Erina Furuhata; Shiori Maeda; Yutaka Negishi; Christopher J Mungall; Terrence F Meehan; Timo Lassmann; Masayoshi Itoh; Hideya Kawaji; Naoto Kondo; Jun Kawai; Andreas Lennartsson; Carsten O Daub; Peter Heutink; David A Hume; Torben Heick Jensen; Harukazu Suzuki; Yoshihide Hayashizaki; Ferenc Müller; Alistair R R Forrest; Piero Carninci
Journal: Nature Date: 2014-03-27 Impact factor: 49.962

3. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations.

Authors: Jimmy Z Liu; Suzanne van Sommeren; Hailiang Huang; Siew C Ng; Rudi Alberts; Atsushi Takahashi; Stephan Ripke; James C Lee; Luke Jostins; Tejas Shah; Shifteh Abedian; Jae Hee Cheon; Judy Cho; Naser E Dayani; Lude Franke; Yuta Fuyuno; Ailsa Hart; Ramesh C Juyal; Garima Juyal; Won Ho Kim; Andrew P Morris; Hossein Poustchi; William G Newman; Vandana Midha; Timothy R Orchard; Homayon Vahedi; Ajit Sood; Joseph Y Sung; Reza Malekzadeh; Harm-Jan Westra; Keiko Yamazaki; Suk-Kyun Yang; Jeffrey C Barrett; Behrooz Z Alizadeh; Miles Parkes; Thelma Bk; Mark J Daly; Michiaki Kubo; Carl A Anderson; Rinse K Weersma
Journal: Nat Genet Date: 2015-07-20 Impact factor: 41.307

4. INFERNO: inferring the molecular mechanisms of noncoding genetic variants.

Authors: Alexandre Amlie-Wolf; Mitchell Tang; Elisabeth E Mlynarski; Pavel P Kuksa; Otto Valladares; Zivadin Katanic; Debby Tsuang; Christopher D Brown; Gerard D Schellenberg; Li-San Wang
Journal: Nucleic Acids Res Date: 2018-09-28 Impact factor: 16.971

5. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.

Authors: Annalisa Buniello; Jacqueline A L MacArthur; Maria Cerezo; Laura W Harris; James Hayhurst; Cinzia Malangone; Aoife McMahon; Joannella Morales; Edward Mountjoy; Elliot Sollis; Daniel Suveges; Olga Vrousgou; Patricia L Whetzel; Ridwan Amode; Jose A Guillen; Harpreet S Riat; Stephen J Trevanion; Peggy Hall; Heather Junkins; Paul Flicek; Tony Burdett; Lucia A Hindorff; Fiona Cunningham; Helen Parkinson
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

6. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease.

Authors: J C Lambert; C A Ibrahim-Verbaas; D Harold; A C Naj; R Sims; C Bellenguez; A L DeStafano; J C Bis; G W Beecham; B Grenier-Boley; G Russo; T A Thorton-Wells; N Jones; A V Smith; V Chouraki; C Thomas; M A Ikram; D Zelenika; B N Vardarajan; Y Kamatani; C F Lin; A Gerrish; H Schmidt; B Kunkle; M L Dunstan; A Ruiz; M T Bihoreau; S H Choi; C Reitz; F Pasquier; C Cruchaga; D Craig; N Amin; C Berr; O L Lopez; P L De Jager; V Deramecourt; J A Johnston; D Evans; S Lovestone; L Letenneur; F J Morón; D C Rubinsztein; G Eiriksdottir; K Sleegers; A M Goate; N Fiévet; M W Huentelman; M Gill; K Brown; M I Kamboh; L Keller; P Barberger-Gateau; B McGuiness; E B Larson; R Green; A J Myers; C Dufouil; S Todd; D Wallon; S Love; E Rogaeva; J Gallacher; P St George-Hyslop; J Clarimon; A Lleo; A Bayer; D W Tsuang; L Yu; M Tsolaki; P Bossù; G Spalletta; P Proitsi; J Collinge; S Sorbi; F Sanchez-Garcia; N C Fox; J Hardy; M C Deniz Naranjo; P Bosco; R Clarke; C Brayne; D Galimberti; M Mancuso; F Matthews; S Moebus; P Mecocci; M Del Zompo; W Maier; H Hampel; A Pilotto; M Bullido; F Panza; P Caffarra; B Nacmias; J R Gilbert; M Mayhaus; L Lannefelt; H Hakonarson; S Pichler; M M Carrasquillo; M Ingelsson; D Beekly; V Alvarez; F Zou; O Valladares; S G Younkin; E Coto; K L Hamilton-Nelson; W Gu; C Razquin; P Pastor; I Mateo; M J Owen; K M Faber; P V Jonsson; O Combarros; M C O'Donovan; L B Cantwell; H Soininen; D Blacker; S Mead; T H Mosley; D A Bennett; T B Harris; L Fratiglioni; C Holmes; R F de Bruijn; P Passmore; T J Montine; K Bettens; J I Rotter; A Brice; K Morgan; T M Foroud; W A Kukull; D Hannequin; J F Powell; M A Nalls; K Ritchie; K L Lunetta; J S Kauwe; E Boerwinkle; M Riemenschneider; M Boada; M Hiltuenen; E R Martin; R Schmidt; D Rujescu; L S Wang; J F Dartigues; R Mayeux; C Tzourio; A Hofman; M M Nöthen; C Graff; B M Psaty; L Jones; J L Haines; P A Holmans; M Lathrop; M A Pericak-Vance; L J Launer; L A Farrer; C M van Duijn; C Van Broeckhoven; V Moskvina; S Seshadri; J Williams; G D Schellenberg; P Amouyel
Journal: Nat Genet Date: 2013-10-27 Impact factor: 38.330

7. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

8. Integrative analysis of 111 reference human epigenomes.

Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis
Journal: Nature Date: 2015-02-19 Impact factor: 69.504

9. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.

Authors: Claudia Giambartolomei; Damjan Vukcevic; Eric E Schadt; Lude Franke; Aroon D Hingorani; Chris Wallace; Vincent Plagnol
Journal: PLoS Genet Date: 2014-05-15 Impact factor: 5.917

10. Genetic effects on gene expression across human tissues.

Authors: Alexis Battle; Christopher D Brown; Barbara E Engelhardt; Stephen B Montgomery
Journal: Nature Date: 2017-10-11 Impact factor: 49.962

2 in total

1. FILER: a framework for harmonizing and querying large-scale functional genomics knowledge.

Authors: Pavel P Kuksa; Yuk Yee Leung; Prabhakaran Gangadharan; Zivadin Katanic; Lauren Kleidermacher; Alexandre Amlie-Wolf; Chien-Yueh Lee; Liming Qu; Emily Greenfest-Allen; Otto Valladares; Li-San Wang
Journal: NAR Genom Bioinform Date: 2022-01-14

2. Alzheimer's Disease Variant Portal: A Catalog of Genetic Findings for Alzheimer's Disease.

Authors: Pavel P Kuksa; Chia-Lun Liu; Wei Fu; Liming Qu; Yi Zhao; Zivadin Katanic; Kaylyn Clark; Amanda B Kuzma; Pei-Chuan Ho; Kai-Teh Tzeng; Otto Valladares; Shin-Yi Chou; Adam C Naj; Gerard D Schellenberg; Li-San Wang; Yuk Yee Leung
Journal: J Alzheimers Dis Date: 2022 Impact factor: 4.472

2 in total