Literature DB >> 28521008

GeMSTONE: orchestrated prioritization of human germline mutations in the cloud.

Siwei Chen^1,2,3, Juan F Beltrán^1,2, Clara Esteban-Jurado⁴, Sebastià Franch-Expósito⁴, Sergi Castellví-Bel⁴, Steven Lipkin⁵, Xiaomu Wei^2,5, Haiyuan Yu^1,2.

Abstract

Integrative analysis of whole-genome/exome-sequencing data has been challenging, especially for the non-programming research community, as it requires simultaneously managing a large number of computational tools. Even computational biologists find it unexpectedly difficult to reproduce results from others or optimize their strategies in an end-to-end workflow. We introduce Germline Mutation Scoring Tool fOr Next-generation sEquencing data (GeMSTONE), a cloud-based variant prioritization tool with high-level customization and a comprehensive collection of bioinformatics tools and data libraries (http://gemstone.yulab.org/). GeMSTONE generates and readily accepts a shareable 'recipe' file for each run to either replicate previous results or analyze new data with identical parameters and provides a centralized workflow for prioritizing germline mutations in human disease within a streamlined workflow rather than a pool of program executions.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2017 PMID： 28521008 PMCID： PMC5556704 DOI： 10.1093/nar/gkx398

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Next-generation sequencing (NGS) has significantly reduced the cost of obtaining genomic data for increasingly large sample sizes (1), facilitating discovery of causal genes and mutation candidates for various disorders (2), and providing sizable genetic variant datasets (3). As a result, the process of filtering, annotating and prioritizing variants from large-scale studies has grown in complexity and computational burden. It has become increasingly difficult to organize, maintain and standardize the variant analysis workflows, increasing the time and monetary investment for less computationally oriented biologists and labs. Some integrative frameworks (4–7) have been developed to enhance the reproducibility and accessibility of NGS studies. This same initiative inspired the framework for GeMSTONE: recording all analysis metadata for reproducible computational experiments, specifically focusing on germline mutation prioritization in human disease. Although other platforms bring together different bioinformatics tools and allow users to schedule their analyzes online, none of them are built with an emphasis on streamlined single-run scheduling and automatic fetching of the necessary supplementary public data. Platforms like Galaxy (4), for instance, allow the user to combine many different tools from an impressive catalog, but require the user to reformat their data depending on the particular input format of the database or tool that they want to add to their analysis. A major design goal in the development of GeMSTONE is the ability to maximize customization for studies in a streamlined workflow rather than a pool of program executions. Within the GeMSTONE interface, databases required by the user-selected tools are pre-loaded and the user-input data will be automatically reformatted to fit query requirements. Therefore, adding an extra layer of analysis to any workflow requires minimal effort. There is a large research community focusing on genetic variation study relating to human disease (8–18). This community often performs their analysis in-house rather than using any of the currently available tools for variant analysis. GeMSTONE facilitates the process of integrating and assessing evidence for causal inferences while automating the whole workflow in a reproducible way. Through its design GeMSTONE fills a significant gap in the online analysis landscape. GeMSTONE provides centralized workflows: embedding key features of variant prioritization for DNA sequencing data, focused on but not limited to germline mutations, with a collection of current bioinformatics tools and data libraries in a highly-customizable and reproducible manner. In short, we created GeMSTONE to organize, schedule, document and reproduce our variant analysis workflows from a single interface. We show that the GeMSTONE workflow is consistent with consensus guidelines for interpreting sequence variants in human disease (19,20) (Supplementary Table S1). A demo study is fully described and explained as it is designed, scheduled and analyzed through the chained GeMSTONE functionalities (http://gemstone.yulab.org/manual.html); we also demonstrate its feasibility and efficiency in a proof-of-concept case by recapitulating results of a published variant analysis (9).

MATERIALS AND METHODS

GeMSTONE serves as an online variant prioritization framework that leverages seven popular bioinformatics suites [VT (21), VCFtools (22), BCFtools (23), SnpEff (24), GEMINI (25), dbNSFP (26) and PLINK/SEQ (27)] in connection to 46 meta-information and prediction resources (Figure 1; Supplementary Table S2) to provide a smooth, customizable workflow for variant analysis.

Figure 1.

GeMSTONE pipeline overview. The schematic represents the GeMSTONE's central analysis pipeline. The fundamental backbone filter cascade can be seen in blue, prioritizing rare and putatively damaging variants and genes. Different libraries are grouped in orange, used in annotation or filtering steps throughout the workflow as indicated. Users of the GeMSTONE web portal can customize their analyzes of genomic data from Variant Call Format (VCF) files by using tools from a range of different classes (Figure 1). These include (i) variant normalization for unified representation of genetic variants using VT, (ii) variant/genotype quality filters on matrices encoded in the VCF file such as QUAL (Phred-scaled quality score), GQ (genotype quality), DP (read depth) and filter status using VCFtools, (iii) variant type filters on variant consequence and transcript biotype based on SnpEff annotations, (iv) common variant filter on allele frequency in the general population [ExAC (28), 1000 Genomes (29), ESP6500 (30) and TAGC (31)], (v) variant function filters on predicted damaging effects [18 methods (e.g. Polyphen-2 (32), SIFT (33), CADD (34)) complied in dbNSFP (Supplementary Table S2), Rosetta ddG (35)] and protein domains [Pfam (36)], and (vi) comprehensive annotations (and filters) on gene and gene product attributes [Gene Ontology (37)], biological pathways [KEGG (38), BioCarta (39) and Reactome (40) complied in MSigDB (41)], human disease association [HGMD (42), ClinVar (43), OMIM (44)] and mouse model knockout phenotypes [MGI (45)], gene-based scores on accumulated mutational damage [GDI (46)] and genic intolerance [RVIS (47)], gene expression [GTEx (48), HPA (49)], protein–protein interaction network [IntAct (50), BioGRID (51) and ConcesusPathDB (52) complied in dbNSFP, and HINT (53)], and (vii) pathway enrichment analysis using a fisher exact test. Users may also choose to include supplementary files, such as a pedigree (PED) file for co-segregation analysis, a list of genes for personalized annotation, or a second VCF file with a control cohort for genetic association tests [BURDEN (27), Calpha (54), vt (55) and SKAT (56) implemented in PLINK/SEQ]. All these options come together to provide a holistic filtering, annotation and prioritization pipeline (Figure 1). The customized pipeline is then scheduled for processing on a protected server, alleviating the user's burden to update software, parse data libraries, store large derivative files and dedicate processing time. The web server and database server are running as virtual machines (VMs) on shared physical infrastructure. Both the database and web host VMs can be expanded or moved to an upgraded physical machine or granted more resources in their current depending on demand, making the hardware setup easily scalable to more traffic. The average turnaround time is about 11 min for a 1MB VCF input file containing ∼13,800 variants under default settings, of which querying up to 18 in silico predictions takes a static 8 min searching through 76GB dbNSFP database on all chromosomes. Although the processing time will vary depending on the choice of options and the number of concurrent users, GeMSTONE in general can handle a single ∼500M VCF input per run within 1 day. Once the job is finished, the user can log into the GeMSTONE portal to interact with the completed workflow by selectively downloading step-by-step snapshots of their workflow, interactively visualizing their variant statistics and downloading their recipe (JSON) file, which can be uploaded or shared to replicate or modify the same workflow. An essential design to reinforce GeMSTONE's reproducibility function and to ensure the sustainability of our web tool is our rigorous versioning system. We keep in our system static versions of all the external resources, where all the tools and datasets that we use for GeMSTONE are loaded onto our server so that it does not go to any external program or server when running. Thus we are able to ensure backward compatibility as we add updated versions of software or new tools. GeMSTONE records the versions of each tool and database used in a job in the recipe file and if users submit a recipe whose workflow uses older software or datasets, they will be prompted on the fly asking whether they want to use the legacy version or the latest version of the resources. GeMSTONE also records the versions in a human-readable summary file for easy access and reference. One important function for germline mutation prioritization in human disease is GeMSTONE's co-segregation analysis, which provides six common inheritance models (autosomal dominant, autosomal recessive, recessive compound heterozygous (via GEMINI (25)), X-linked dominant, X-linked recessive and Y-linked dominant) based on the user-defined pedigree structure in PED file. GeMSTONE screens sample genotypes (using BCFtools (23)) in each family and seeks for variants that are co-segregating with disease status under selected mode of inheritance. Additionally, a recurrence filter constrains the degree to which co-segregation events are allowed across multiple families and the prevalence of the variants in sporadic samples. We found this option to be seldom implemented by previous web tools yet often recommended by American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) (20). The benefits of this analysis are many-fold: (i) increasing segregation data in families or (ii) high mutation frequency affecting multiple sporadic cases suggests stronger evidence for pathogenicity; (iii) whereas a upper limit of such recurrence can help eliminate potential false positives in large samples. This process of user-driven development by which GeMSTONE morphs to the community's needs is the key behind GeMSTONE's ability to grow as a knowledge bank with a robust and updated set of functionalities. Small but necessary prioritizing steps like these, now explicitly documented in the GeMSTONE summary and recipe files, can become an active component of study replication. Another supporting evidence for disease association comes from in silico predictions of variant functional effects. Predictions from different algorithms are considered as a single piece of evidence in sequence interpretation in part due to the underlying similarities in the basis in these software suites (19,20). GeMSTONE's variant functional prediction step allows the user to choose up to 19 different in silico predictors (Supplementary Table S2) with customizable thresholds. More dedicatedly, a ‘global deleteriousness filter’ allows users to set a threshold on the number of selected predictors needed for a variant to pass the filter. This set of filters is useful in that it allows users to adjust the stringency of each algorithm while balancing and investigating any inconsistency among different predictions. The availability of these filters and annotations also provide an environment in which users can choose predictive metrics solely based on their relative merit rather than the programming investment that it would take to install, query and customize them for a study. Most options within the GeMSTONE workflow can serve dual purposes, acting as either filters or annotations. For the ‘global deleteriousness filter’ mentioned above, the count of deleterious predictions and their individual scores will be annotated next to each variant, providing information that can be used for variant prioritization without being part of any filter. We also provide the option to combine information across libraries, for example, we allow for known disease gene annotation on candidates to be supplemented with their interaction partners as reported in other databases, asking whether those interactors were previously implicated in the disease of interest. This distribution and coverage of tools (Figure 1) have never been collected and connected in a centralized workflow before. By maintaining an updated set of bioinformatics tools for variant analysis, GeMSTONE decreases the barrier to entry, for less computationally oriented research groups and establishes a central bioinformatics hub for researchers who study sequence variants implicated in severe familial diseases as well as rare, large-effect risk variants in complex disease. The options offered by the web interface also serve as a way for users to explore and learn about new tools and data sources while providing developers with an overview of the current variant analysis landscape to fill any gaps in the current tool-space. New tools can be easily added to GeMSTONE and presented to the community through the web interface, removing platform-specific barriers.

RESULTS

As an example of a GeMSTONE use case, we replicated a published analysis of rare pathogenic variants in new predisposition genes for familial colorectal cancer (CRC) (9). A side-by-side demonstration of the study's workflow and GeMSTONE's reimplementation using the same dataset and prioritization criteria is shown in Figure 2. The original analyzes were conducted in two sequences of prioritization, progressively looking for predisposing mutations with stronger evidence for causality to CRC as they underwent increasingly stringent criteria (lower allele frequency in general populations; rarer presence among the affected samples; more deleterious molecular impact by in silico predictions; more interesting biological functions of the genes and their protein product, e.g. domains and interactions) (9). While formerly requiring in-house scripting for co-segregation analysis, in silico analysis and a series of gene function annotations querying and parsing several databases, the entirety of each sequence of prioritization pipeline can be performed with a single run through our interactive, lightweight web form using GeMSTONE.

Figure 2.

Recapitulation of a published colorectal cancer (CRC) study. As a proof-of-concept case study, GeMSTONE recapitulated every step in the original Colorectal-cancer prioritization workflow1, rescuing 27 out of 28 candidate variants from the ∼30,000 variants in the raw whole exome sequencing dataset and hitting all hereditary CRC and CRC GWAS variants. Perhaps the most convenient feature within GeMSTONE is its recipe file generator. The recipe file from any given run can be shared and readily uploaded to our site to modify any part of the filtering and annotation pipeline for more stringent prioritization in a follow-up run. Once uploaded, the recipe file (JSON) will populate the web form dynamically, giving the user the ability to modify the run using the same interface that created it. In our CRC case, we lowered the upper-bound of allele frequency filter from 0.5% to 0.1% [in 1000 Genomes (29) and ESP6500 (30)] and recurrence filter from 9 to 4, requiring variants to be present in ≤4 individuals in our dataset. Next, we increased the lower-bound of deleteriousness filter from 4 to 5 without changing the user-defined deleterious thresholds of any single predictor [PhyloP (33) score >0.85, SIFT (57) score <0.05, PolyPhen-2 (32) score >0.85, GERP++ (58) score >2, Mutation Taster (59) score >0.5 and LRT (60) score >0.9]. Finally, we added variant and gene annotations with interesting gene function, interactions and locations in protein domains. This workflow leverages a variety of public databases, including Gene Ontology (37), KEGG (38), Reactome (40), HINT (53), Pfam (36) and HGMD (42), as well as a complementary list of cancer terms collected by the authors. This modified workflow was automatically recorded in a JSON recipe file and packaged with corresponding results and intermediate output files. Through the above two automated runs, GeMSTONE recapitulated every step of the original prioritization workflow. A total of 27 out of 28 candidate variants were identified (the missing variant was filtered out due to slightly higher allele frequency in a sub-population database from 1000 Genomes), as well as all hereditary CRC and CRC Genome-wide Association Study (GWAS) variants (9) (Figure 2).

DISCUSSION

GeMSTONE provides a code-free portal for variant filtering, annotation and prioritization, which not only helps standardize genetic variation analyzes (Supplementary Table S1) but also offers the means to replicate and share computational protocols easily. From a user's perspective, GeMSTONE is a reliable one-stop shop for variant analysis where they can find a collection of tools spanning a broad range of applications through an intuitive, unified user interface subsuming all general-purpose workflows from comparable toolkits (Figure 3).

Figure 3.

Heatmap comparison of GeMSTONE and other variant prioritization tools. This heatmap compares with other tools that have similar objectives on the aspects of (A) raw data inputs and prioritization, (B) knowledge-based annotation from external data resources and libraries, (C) inheritance models for co-segregation analysis and (D) strategy of reproducibility. Each row represents a different tool, while each column represents a specific feature. Dark blue indicates that a tool has similar capacity for a specific function while light blue indicates that a tool has a similar feature but with less powerful functionality than GeMSTONE (see Discussion). Although currently most of other variant prioritization tools accept VCF and pedigree files as inputs and can perform routine filtering on quality control and variant consequence (Figure 3A), GeMSTONE stands out as a more powerful tool by including annotations at the variant, gene, pathway and network level (Figure 3B) and co-segregation analysis using different inheritance models for potential germline mutation prioritization (Figure 3C). We consider certain features in GeMSTONE to be ‘more powerful’ in the aspect of comprehensiveness or/and flexibility: GeMSTONE often provides more comprehensive options for filtering and annotation linking to external resources than others and most of the GeMSTONE options flexibly allow for annotation or filtering, or both. See detailed reasons in Supplementary Table S3. A keystone of GeMSTONE is the recipe file (Figure 3D), which records all workflow parameters in a single file that can be shared and uploaded onto the site to reproduce a previous run. The recipe file can be used to (i) replicate results by rerunning the same workflow on the same dataset, (ii) process new data with a known workflow or (iii) modify parameters in a known workflow to evaluate study design. This approach has the potential to bring more transparency and openness to the bioinformatics community by enhancing the reproducibility of large-scale genomic studies.

CONCLUSIONS

GeMSTONE allows for accessible, collaborative, replicable and holistic analysis of genetic variants. First, it seamlessly knits together filters and annotations through different tools with either stringent, study-specific parameters or general best-practice settings. Second, it eliminates the time and space burdens associated with modern variant analysis tools, saving users dozens of gigabytes of potential disk space per run for the same workflow on a medium-sized dataset. Third, it significantly lowers the barrier to entry for traditional biologists by eliminating the installation and scripting sinkholes that may dissuade researchers from pursuing large-scale analysis or trying new tools. Fourth, it provides a readable, shareable log—both programmatic and human—to allow other researchers to understand and replicate study results given the same starting data. Finally, GeMSTONE encourages the growth of the genomics research community by maintaining and updating a bank of best-practice bioinformatics methods and tools. We expect our GeMSTONE will greatly aid in automating the (re)analysis of genome-wide genetic variation data and enhance the reproducibility of large-scale genomic studies.

DECLARATIONS

Availability of data and material

Exome sequence data for 43 CRC patients were provided by Esteban-Jurado et al. (9) through private communication.

Ethics approval

Ethics approval was not needed for this study. Click here for additional data file.

58 in total

1. GeneProf: analysis of high-throughput sequencing experiments.

Authors: Florian Halbritter; Harsh J Vaidya; Simon R Tomlinson
Journal: Nat Methods Date: 2011-12-28 Impact factor: 28.547

2. Pooled association tests for rare variants in exon-resequencing studies.

Authors: Alkes L Price; Gregory V Kryukov; Paul I W de Bakker; Shaun M Purcell; Jeff Staples; Lee-Jen Wei; Shamil R Sunyaev
Journal: Am J Hum Genet Date: 2010-05-13 Impact factor: 11.025

3. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

4. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

5. The BioGRID interaction database: 2015 update.

Authors: Andrew Chatr-Aryamontri; Bobby-Joe Breitkreutz; Rose Oughtred; Lorrie Boucher; Sven Heinicke; Daici Chen; Chris Stark; Ashton Breitkreutz; Nadine Kolas; Lara O'Donnell; Teresa Reguly; Julie Nixon; Lindsay Ramage; Andrew Winter; Adnane Sellam; Christie Chang; Jodi Hirschman; Chandra Theesfeld; Jennifer Rust; Michael S Livstone; Kara Dolinski; Mike Tyers
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 19.160

6. GEMINI: integrative exploration of genetic variation and genome annotations.

Authors: Umadevi Paila; Brad A Chapman; Rory Kirchner; Aaron R Quinlan
Journal: PLoS Comput Biol Date: 2013-07-18 Impact factor: 4.475

7. Exome sequencing identifies novel and recurrent mutations in GJA8 and CRYGD associated with inherited cataract.

Authors: Donna S Mackay; Thomas M Bennett; Susan M Culican; Alan Shiels
Journal: Hum Genomics Date: 2014-11-18 Impact factor: 4.639

Review 8. POLE and POLD1 mutations in 529 kindred with familial colorectal cancer and/or polyposis: review of reported cases and recommendations for genetic testing and surveillance.

Authors: Fernando Bellido; Marta Pineda; Gemma Aiza; Rafael Valdés-Mas; Matilde Navarro; Diana A Puente; Tirso Pons; Sara González; Silvia Iglesias; Esther Darder; Virginia Piñol; José Luís Soto; Alfonso Valencia; Ignacio Blanco; Miguel Urioste; Joan Brunet; Conxi Lázaro; Gabriel Capellá; Xose S Puente; Laura Valle
Journal: Genet Med Date: 2015-07-02 Impact factor: 8.822

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.

Authors: Wenqing Fu; Timothy D O'Connor; Goo Jun; Hyun Min Kang; Goncalo Abecasis; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; David Altshuler; Jay Shendure; Deborah A Nickerson; Michael J Bamshad; Joshua M Akey
Journal: Nature Date: 2012-11-28 Impact factor: 49.962