Literature DB >> 28522612

GenomeVIP: a cloud platform for genomic variant discovery and interpretation.

R Jay Mashl^1,2, Adam D Scott^1,2, Kuan-Lin Huang^1,2, Matthew A Wyczalkowski¹, Christopher J Yoon^1,2, Beifang Niu¹, Erin DeNardo¹, Venkata D Yellapantula^1,2, Robert E Handsaker^3,4, Ken Chen⁵, Daniel C Koboldt¹, Kai Ye^1,2, David Fenyö⁶, Benjamin J Raphael⁷, Michael C Wendl^1,8,9, Li Ding^1,2,8,10.

Abstract

Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional "download and analyze" paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.

Mesh：

Year: 2017 PMID： 28522612 PMCID： PMC5538560 DOI： 10.1101/gr.211656.116

Source DB: PubMed Journal: Genome Res ISSN： 1088-9051 Impact factor: 9.043

Understanding the relationship between genetics and disease is a central theme of biomedical research. Enabled by increasingly economical next-generation sequencing technologies, many projects have sought to characterize variation within and across populations and disease cohorts. Among these are the 1000 Genomes Project, The Cancer Genome Atlas (TCGA), UK10K, Pediatric Cancer Genome Project (PCGP), and the International Cancer Genome Consortium (ICGC), but numerous smaller-scale projects are also under way. Advances in sequencing technologies and further economies of scale are expected to increase the collective corpus of sequence data dramatically, particularly as clinical diagnostic sequencing becomes more prevalent while expanding across data types (methylation, mRNA, and miRNA) and especially if widespread screening of asymptomatic individuals is implemented. For example, the Precision Medicine Initiative (PMI) envisions a longitudinal collection of genomic data from more than 1 million individuals (Collins and Varmus 2015). The vast amounts of data produced by today's sequencing projects impose logistical challenges in downloading and storing data prior to passing it through a pipeline of bioinformatics tools running on local high-performance computing resources, i.e., the traditional “bringing data to the tools” paradigm of variant analysis (Stein 2010). Cloud computing addresses both the computational and storage challenges associated with large data sets by enabling users to launch on-demand virtualized instances of computer systems with preinstalled tools and scripts that have the ability to import source data from and export processed data to cloud storage. Furthermore, the research community is transitioning to approaches that democratize access to genomic data (Heath et al. 2014; Stein et al. 2015), resulting in the creation of resources such as the National Cancer Institute's Genomic Data Commons (GDC; http://gdc.cancer.gov), a comprehensive cancer genomics repository. Cloud computing, where analysis is conducted by “bringing tools to the data,” is viewed as having an important role in future bioinformatics analysis (Dudley et al. 2010), which is especially critical for independent scientists and clinics that are unable to support a local high-performance computational infrastructure. Here, we present the Genome Variant Investigation Platform (GenomeVIP), a system for performing variant discovery, annotation, and interpretation using cloud resources. Its intuitive, lightweight web interface enables users to detect genomic variants (single- and multinucleotide variants, short insertions and deletions [indels], complex indels, and structural alterations that include translocations, inversions, and tandem duplications) in whole-genome or exome sequence alignment files (BAMs). We highlight important functionalities of GenomeVIP and examine its capabilities within the larger context of genomic data processing systems, demonstrating GenomeVIP as an engine for future cancer and human genetic studies.

Results

GenomeVIP is an open-source, cloud-aware, multiuser platform with a web interface for performing discovery, annotation, and interpretation of genomic variation. Its sophisticated design brings powerful cloud resources to bear for task-specific bioinformatics analyses without any special cloud expertise required by the investigator. Users can direct GenomeVIP to perform germline, somatic, and de novo variant calling by selecting software from its palette of widely used bioinformatics tools and can specify provided best-practices discovery tool parameter sets or design a custom “execution profile” by adjusting online or uploading parameters through the web interface. By providing both high- and low-level access to tools and parameters, GenomeVIP provides flexibility for use by computational biologist power users as well as by users having expert knowledge or particular research needs. Its design and operational aspects aim to promote the reproducibility, transparency, and uniformity of the processing of genomic data. Although we focus on GenomeVIP's cloud capability, it runs equivalently on local high-performance computing clusters with local data with preinstalled versions of the named tools. The design and implementation of GenomeVIP and its usage for modern, high sample count cancer genomics and human genetics research also serves as a starting point for integration of additional tools and capabilities and expansion to other cloud-computing platforms.

GenomeVIP architecture

The functionality of GenomeVIP is provided through coordination among three central components: the user's web browser, the GenomeVIP server, and a cloud-computing resource (Fig. 1).

Figure 1.

GenomeVIP platform. GenomeVIP consists of three components (web browser, server host, cloud), coordinated by various scripting languages (blue) and cloud toolkits (green). Interactive web pages, written in HTML (with CSS elements) and JavaScript, provide front-end functionality. JQuery is a JavaScript library providing methods to modify web page content with cross-browser compatibility. Server-side PHP modules utilize StarCluster and S3 Tools cloud toolkits to access EC2 Compute (gray) and storage resources (yellow) in the cloud. GenomeVIP creates within EC2 a virtual cluster, based on a machine image with preinstalled variant detection tools and supporting software (collectively, “Genomics Tools”) (red), that can access sequence data on S3 and EBS (Elastic Block Storage) resources (yellow). Secure channels using HTTPS and secure shell (SSH) protocols allow communication between various components. Resulting variant call files stored in S3 are accessible via the GenomeVIP interface or the Amazon S3 Console.

Web browser

The GenomeVIP web browser interface (Fig. 1, top) furnishes user controls for many tasks, including the following: loading sample file information; selecting samples; choosing parameters and tools for discovery, filtering, and annotation; selecting genomic regions; and managing cloud computational and storage resources. The interface is implemented using a combination of HTML, JavaScript, and cascading style sheets (CSS) and uses a jQuery JavaScript library to modify web page content and provide cross-browser compatibility. In addition, user-generated events and communications between the browser and the GenomeVIP server are handled by JSON-formatted AJAX requests, a standard jQuery feature.

Server host

The GenomeVIP server (Fig. 1, middle) runs a secure Apache HTTPS web server that may be installed and run locally as a real or virtual machine (VM) or that may be instantiated as a VM on Amazon Web Services (AWS). Server-side scripts written in PHP handle user selections and generate content for client-side interpretation and storage within the user's web browser (Fig. 1, top). To configure a computation to run on the AWS cloud, users provide their previously established AWS login credentials, specify genomic samples, and select one of the predefined analysis pipelines and parameter sets, which may be further customized. Users then specify storage and computing resources required, i.e., a VM instance type (giving a certain number of processors and memory) and the number of nodes or virtual cluster size. The server automatically builds the necessary configuration files for the StarCluster (http://star.mit.edu/cluster) and S3 Tools (http://s3tools.org) cloud toolkits to manage AWS computing and storage resources, respectively. After users submit a computation, the server performs key tasks: instantiating cloud computing resources; generating a master script that creates template-subordinate work scripts to perform the variant discovery, post-discovery processing, and storing of results in parallel fashion across genomic regions and tools; and transmitting the work scripts to the targeted computing host for execution.

The cloud

Cloud resources (Fig. 1, bottom) consist of cloud computational and storage components as specified by users on the GenomeVIP server. Here, Amazon's Elastic Compute Cloud (EC2) computational infrastructure hosts a virtual cluster instantiated by GenomeVIP from a prebuilt run-time virtual machine (VM) image containing well-established variant discovery tools—VarScan (Koboldt et al. 2009, 2012, 2013), Pindel (Ye et al. 2009), GATK (McKenna et al. 2010; DePristo et al. 2011; Van der Auwera et al. 2013), BreakDancer (Chen et al. 2009; Fan et al. 2014), Genome STRiP (Handsaker et al. 2011), and Strelka (Saunders et al. 2012)—with supporting genomics software and the S3 Tools cloud toolkit for interacting with the AWS Simple Storage Service (S3) cloud (Sarna 2011). GenomeVIP automatically mounts volumes from Amazon's Elastic Block Store (EBS) resource containing any user-specified BAM alignment and reference genome files to the cluster. Examples of mountable EBS volumes include encrypted volumes created by users, data providers, or collaborators under the given AWS account and any public, unencrypted volumes. (GenomeVIP does not address access control lists [ACLs] for data, but we note their use at the AWS account level may help to enforce regulatory data access and usage policies.) Files located in S3 are another source of input. When executed, the master script automatically submits work scripts as batch jobs to the local job resource manager (e.g., SGE on AWS and LSF on local clusters) for handling job concurrency and dependencies. As work units are completed, GenomeVIP transmits raw and final results to the specified storage location (i.e., S3 storage on AWS or a results directory on local clusters).

GenomeVIP components

The functional units within GenomeVIP collectively implement variant analysis of genomic data (Fig. 2). They encompass sets of public tools most trusted and most often relied upon by investigators for their respective tasks.

Figure 2.

GenomeVIP workflows. Three variant-discovery pipelines (germline, somatic, and de novo) with predicted variant types, including single-nucleotide variants (SNVs), insertions and deletions (indels), structural variants (SVs); selected filtering features; and post-discovery annotation options provided by third-party software packages having knowledge of catalogs of genetic variation.

Variant detection

GenomeVIP deploys pipelines for germline, somatic, and de novo variant discovery with helper modules for performing filtering and annotation. The platform incorporates well-established, widely used tools, i.e., VarScan, GATK, Strelka, BreakDancer, Pindel, MuTect2, and Genome STRiP, for the detection of single-nucleotide variants (SNVs), small insertions and deletions (indels), and structural variants (SVs) that include inversions, copy-number variation, tandem duplications, and inter- and intra-chromosomal translocations. Results are reported in VCF format.

Variant filtering

GenomeVIP provides several user-adjustable filtering modules, which encapsulate various tool-specific methods or rule-based heuristics, to refine raw variant calls. Such filters include VarScan's own germline and somatic high-confidence filter as well as a new de novo/family trio filter we developed that considers the maximum total number of supporting reads in the parents. The BreakDancer filter we developed removes somatic calls having read support in the normal or de novo calls having read support in either parent. Pindel analysis incorporates a unified germline, somatic, and de novo filter we developed that considers read coverage, variant allele frequency, strandedness, read support, and homopolymer repeat length. Further details on these three new filters are provided in Methods. GenomeVIP also provides dbSNP (NCBI v.142, GRChr37 and GRCh38) and a false-positive filter as high-level filters. These methods are based in part on published (Xu et al. 2014) best practices for these tools, our results in the ICGC-TCGA DREAM Somatic Mutation Calling challenge (http://www.synapse.org/#!Synapse:syn312572/wiki/58893), and our germline and somatic calling in the ICGC-TCGA Pan-Cancer Analysis of Whole Genomes Somatic Mutation Calling Challenge (Pilot-63) Validation project (http://www.synapse.org/#!Synapse:syn2875157). Finally, GenomeVIP can accept user-designated VCF files for filtering against a pan-normal data set.

Variant annotation

GenomeVIP furnishes several annotation methods: dbSNP, which provides various information on all known short sequence variation; the Ensembl Variant Effect Predictor system (VEP) (McLaren et al. 2010), including its interfaces to individual tools like SIFT (Ng and Henikoff 2003) and PolyPhen (Adzhubei et al. 2013), which provides information on the impact of variants on, e.g., genes, transcripts, protein sequences, and regulatory regions; and ClinVar (Landrum et al. 2016), which provides interpretations of the clinical significance of variants.

GenomeVIP dynamic interface

GenomeVIP session configuration spans the spectrum of running “out of the box” almost wholly on defaults to accepting customized user instructions for practically every step. Following modern user application programming practices, the graphical interface is highly menu-driven. A navigational menu summarizes the core steps of the configuration and execution process (Fig. 3, top). Activation of any of these items in the interface (via user mouse click) updates an adjacent display panel to show the corresponding options (Fig. 3A–F), each of which is described in more detail below.

Figure 3.

GenomeVIP screenshots. (A) Accounts. Presentation of the user's valid Amazon Web Services causes GenomeVIP to generate a semipersistent sessionID used to store or recall previous cloud resource configurations. (B) Select Genomes. A user-uploaded file listing sequence alignment, reference, and index files is parsed and displayed for item selection. (C) Quick Setup tab configuration for loading a built-in execution profile with predefined tools and parameters (Step 1, option 1); a profile may alternatively be uploaded via the interface (Step 1, option 2). Predefined genomic regions may be selected or uploaded via the interface (Step 2). Clicking the Apply Profile button (Step 3) configures tools listed under the other tabs (gray) with the current predefined profile and regions, which may be subsequently modified manually under the other tabs. (D) Post-discovery Analysis. Selection of filters and annotation as part of the execution profile, showing the expanded false-positives filter panel (gray) for customization. (E) Submit. Resource management options are provided to create new or reuse existing computing instances and cloud storage location. Buttons to preview, download, or error-check the current execution profile, or to submit it as a computation, are available. (F) Results. An Amazon cloud storage file listing showing folders for tools’ outputs, job status, and results. Files .sh and .ep represent the master script describing the computation's workflow and the execution profile, respectively.

Accounts

First, users provide their login credentials for Amazon Web Services (AWS) or a local high-performance compute cluster, depending on where the input data are located and thus where the computations are to be run (Fig. 3A). GenomeVIP employs semipersistent “sessions” to facilitate reuse of instantiated AWS computing resources. For local clusters, users may submit jobs to specific hosts and queues.

Select genomes

Next, users specify genomic data sources (Fig. 3B). GenomeVIP recognizes sequence alignment files in BAM format and reference genomes in FASTA or compressed FASTA format. On AWS, users may provide cloud storage volumes with prepared file lists or opt to use one of the prepared 1000 Genomes Project donor sample sets (i.e., pilot phase; phase 1 low-coverage or exome; or phase 3 low-coverage, high-coverage, or exome). On local clusters, GenomeVIP can obtain remote directory file listings directly (via SSH secure shell). The server parses the data sources and presents the alignment and reference files for selection. GenomeVIP notes any missing index or dictionary files and generates directives to create them at run time as necessary. This framework is sufficiently general to handle nonhuman genomes wherever supported by the variant detection tools.

Execution profile

In this series of tabbed panels (Fig. 3C,D), users design an “execution profile” consisting of their choice of variant detection, filtering, and annotation tools. The “Quick Setup” tab (Fig. 3C) provides access to several built-in execution profiles, based on best practices and our own experience that can afford high-quality calls. A Run Mode field sets the study type (i.e., germline, somatic, or de novo) and filtering and annotation options, and a Parameters field accounts for the sequence data type (i.e., whole-genome or exome), depth of coverage (i.e., low or targeted), and palette of applied filters. Users may also upload execution profiles through their web browser, enabling them to readily reuse settings from prior computations to ensure consistency across multiple runs. Users can select from predefined chromosome sets or specify a custom list of chromosomes and/or regions by entering details directly or by uploading a list file of regions. Options within the individual tools’ tabs or the Post-discovery Analysis tab allow customization of the current profile (Fig. 3D). For example, the user can include or exclude particular discovery tools from the analysis or modify the more commonly altered discovery parameters. GenomeVIP also utilizes collapsible panels (e.g., Fig. 3D, “Options”) to provide users with access to other command line parameters offered by particular tools. Modules for performing false-positives filtering, including filtering by a user-supplied panel-of-normals VCF file, and annotation by dbSNP, VEP (McLaren et al. 2010), and ClinVar (Landrum et al. 2016) are available.

Submit

In the final steps of preparing a computation, users select the computing resource and finalize the execution profile (Fig. 3E). On AWS, users select whether to launch a new virtual computing cluster or reuse an existing cluster and specify the top-level destination (S3 “bucket”) for storing results and supporting output (see below). On local clusters, users designate the working directory in which the computations are executed. Buttons to preview, validate (or error-check), or download the current execution profile are available. These functions also are available to users throughout the configuration process to assist them in preparing their job so that it executes as intended. Submitting the computation causes a final validation check to be performed and, if successful, a summary of the submission is displayed with a jobID for identifying the computation.

Results

GenomeVIP places various outputs in the location specified by the user during the Submit step. As shown in Figure 3F, the main results are placed in a “Results” folder, and the raw outputs and intermediate results are stored in folders corresponding to individual variant tools. As the time needed to complete an analysis is expected to vary significantly depending on the type of analysis and the complexity of the sample, GenomeVIP uses a “status” subdirectory with sentinel files indicating which discrete jobs in the workflow are unfinished. The main location contains a copy of workflow script (file: *.sh) and the execution profile (file: *.ep) serving as a record of the computation. Execution profiles may be uploaded to GenomeVIP in future computations to ensure uniformity in processing across multiple runs. Users may obtain results and job status information for AWS computations either by providing the S3 storage location and jobID of the computation to a running GenomeVIP server, or by logging into the AWS Console (“dashboard”) to navigating to the S3 storage service. Either of these methods allows users to download files through the browser using secure HTTPS protocols. Access to results and job status for computations run on local clusters can be obtained in an analogous manner to AWS by providing the working directory of the computation to a running GenomeVIP server or via command-line login to the user's cluster account.

Options

Features in this panel (not shown) allow users to manage several advanced features of Amazon cloud operations, such as deactivating the use of encrypted streams for data in transit to and from AWS S3 cloud storage and terminating instantiated AWS clusters. Users may launch an updated GenomeVIP runtime machine image instead of the server's built-in default by providing the corresponding Amazon machine image (AMI) identification tag.

Discussion

Use with big data

In addition to numerous smaller-scale projects, GenomeVIP has been used in several major projects, two of which are listed in Table 1: the International Cancer Genome Consortium (ICGC) Somatic Pilot and the TCGA PanCan Germline projects. In the ICGC project, GenomeVIP invoked the native somatic calling capability of VarScan and Strelka and directed Pindel to generate calls for each matched pair in cohort mode. Generating these calls consumed more than 9000 CPU-hours, with the pairs requiring on average 11.7, 40.8, and 135 CPU-hours for VarScan, Strelka, and Pindel, respectively. In the TCGA PanCan Atlas germline analysis project, more than 8000 samples across 22 tumor types were processed individually with VarScan, GATK, and Pindel using our in-house pipeline. Comparison of these local-based calls to those generated more recently on cloud resources have produced excellent recall rates for insertions and deletions (∼96%) and for single-nucleotide variants (>99%), the difference being largely attributable to the use of more recent tool versions. The tool-specific CPU usage here was 8.2, 5.4, and 10.2 CPU-hours for VarScan, GATK, and Pindel, respectively.

Table 1.

Examples of large-scale projects utilizing GenomeVIP

Comparison to existing cloud pipelines

Genomic sequencing has long been supported by automated processing of raw data (Wendl et al. 1998); as a result, a variety of systems, many of which are deployable on cloud computing resources, have become available with varying levels of capability, user convenience, and sophistication. Early entries only targeted specific parts within the larger process, e.g., applying cloud technologies to the genome alignment process (Langmead et al. 2009; Schatz 2009; Wall et al. 2010), but recent work has progressed toward solutions that are more complete. The positioning of GenomeVIP within this milieu has been guided by its underlying design goal of furnishing an intuitive, graphics-based system to the nonspecialist biomedical investigator for harnessing well-established, task-specific, external tools to analyze WGS/WES data for somatic, germline, and de novo variants using on-demand cloud resources. A comparison of features across a selection of comparable systems is presented in Table 2.

Table 2.

Brief comparison of variant discovery frameworks

Intuitive, web-based control

Special bioinformatics skills related to transferring large files, database creation, programming, or algorithms are required to use systems such as Atlas2 Genboree (Evani et al. 2012), the COSMOS library (Gafni et al. 2014), or the CloudBioLinux tool set (Krampis et al. 2012), and familiarity with Unix/Linux command-line functionality is needed to easily run tools like TREVA (Li et al. 2014), TOGGLE (Monat et al. 2015), HugeSeq (Lam et al. 2012), GotCloud (Jun et al. 2015), or Churchill (Kelly et al. 2015). Many systems also have more subtle aspects of designing a calculation, for example, manually discretizing genomic regions for parallelization (Afgan et al. 2010). GenomeVIP enables complete specification of job execution entirely by web-interface menu prompting, making it easier for users without considerable bioinformatics experience to undertake genomic analyses.

Vetted tools

Some approaches rely on native methods for processing, e.g., Atlas-SNP2 and Atlas-Indel2 in the Atlas2 Cloud and Mercury (Reid et al. 2014) systems, and almost all lack capability for managing the full array of biomedically relevant variant types. For example, most systems (Evani et al. 2012; Li et al. 2014; Jun et al. 2015; Kelly et al. 2015) cannot handle SVs, and none treat any type of complex indels. GenomeVIP follows a UNIX-like philosophy of recruiting only highly vetted, task-specific external tools and porting them for cloud compatibility. GenomeVIP runs each tool with fully specified parameter lists to guard against inadvertent changes of defaults. It also supports the reuse of custom execution profiles for repeating an analysis, which may be helpful for performing longitudinal studies on the same patient or for simply dividing a large data set into smaller computational sets.

Data privacy protection

The GenomeVIP server uses secure HTTPS by default, with encryption specifications consistent with HIPAA standards, to communicate with the user's web browser. The StarCluster toolkit uses secure shell (SSH) encrypted protocol for communicating with and transferring files to/from Amazon. The S3 Tools toolkit is configured by GenomeVIP by default to use HTTPS when transmitting data between computing resources and S3 cloud storage and to request server-side encryption (AES-256 protocol) be applied to new data stored at rest in S3. When used with a local cluster, GenomeVIP uses SSH when accessing the remote file system and when transferring files to/from the cluster.

Example comparison

We compared GenomeVIP directly to another cloud-based analysis system, namely GotCloud (Jun et al. 2015). As the latter does not readily handle somatic or trio analysis, we have limited the comparison to germline variants. In particular, we used GotCloud to repeat the germline analysis of the nine cases from the 1000 Genomes Project discussed above. Downstream analysis revealed that dbSNP concordances of GotCloud-generated calls were 99.0%–99.7% for SNVs and 88.5%–93.0% for indels, both of whose ranges are comparable to the results from GenomeVIP (Fig. 4F).

Figure 4.

Applications of GenomeVIP. (A) Principal component analysis of germline SNV and indel predictions for nonrelated 1000 Genomes Project Phase 1 samples from three populations: (red) CHB; (green) FIN; (blue) YRI. (B) True-positive (TP) and false-positive (FP) rates for somatic SNV calls novel to dbSNP. Performance of VarScan and Strelka callers individually (red, blue) and in combination (green, purple) are evaluated before and after exploratory false-positives filtering using multiple parameter combinations, in which VSR is the minimum number of variant-supporting reads. (C) GenomeVIP performance on ICGC Pan-Cancer Pilot-50 somatic mutation calling for one matched sample pair, in which the colors correspond to the number of pipelines predicting the same variant. (D) Performance statistics. (E) De novo recall performance (blue), as compared to published experimental validation results, and filtered call set size (red) for SNV calling in NA12878 as a function of PVSR, the number of variant-supporting reads in parental genomes NA12891 and NA12892. (F) dbSNP concordances of germline SNVs and indels, as called by GenomeVIP (darker shading) and GotCloud (lighter shading), for the samples described in A.

Methods

GenomeVIP genomic applications

GenomeVIP includes tools to perform germline, somatic, and de novo variant discovery and annotation. We illustrate these capabilities with three examples: (1) germline variant discovery from exome samples from a large cohort; (2) somatic variant detection on a synthetic matched tumor/normal sample pair; and (3) de novo analysis of a well-studied family trio from the 1000 Genomes Project. Parameters used are available in Supplemental Information. In each case, local and cloud deployments of GenomeVIP produced identical raw calls. We also confirmed the parallelizability of genomic regions by performing cloud computations over entire chromosomes and local computations by processing 10-Mb windows spanning the entire genome. Performance and compute statistics are reported in Figure 4D and Supplemental Information.

Germline

We selected nine unrelated donors from the 1000 Genomes Project (Abecasis et al. 2010), three each from three populations (CHB, FIN, YRI) and directed GenomeVIP to launch VarScan and Pindel variant callers to perform SNV and indel discovery on Chromosome 20 using discovery parameters used in a previous germline analysis (Kanchi et al. 2014). Raw calling performance, as measured by dbSNP concordance, revealed SNV concordances ranging from 97.0% to 98.1% and indel concordances ranging from 92.2% to 96.2%. To validate the germline variants, we then conducted a downstream principal component analysis of the SNVs using PLINK (Chang et al. 2015) and found that analysis faithfully recapitulated the population structures of these three ethnic groups (Fig. 4A).

Somatic

We called somatic mutations in the matched tumor/normal synthetic DREAM-3 samples, the most complex of the open-access data sets from the ICGC-TCGA DREAM Somatic Mutation Calling (SMC) Challenge (Ewing et al. 2015). We directed GenomeVIP to generate raw calls using VarScan and Strelka with limited filtering to enable downstream exploration of the effect of selected false-positive filtering parameters, namely, the number of supporting reads and read mapping qualities (Methods). Calculations of true-positive (TP) and false-positive (FP) rates, based on unmasked regions of a known synthetically generated tumor's “truth” set (Ewing et al. 2015), for multiple combinations of parameter sets for the individual, intersected, and combined filtered call sets, is plotted on receiver operating characteristic (ROC) curves (Fig. 4B). Comparison of unfiltered, novel calls reveals comparable TP but significantly different FP rates for VarScan and Strelka callers. Although Strelka's TP rates (ranging from 0.784 to 0.789) and FP rates (ranging from 0.011 to 0.016) did not appreciably change across the false-positives parameter landscape investigated, increasing the number of required variant-supporting reads (VSR) to four for filtering VarScan calls dramatically decreased FP rates while also modestly decreasing TP rates. This contrast in behavior is a result of VarScan having produced manifold more putative calls than Strelka in the “unmasked” genomic regions targeted by the evaluator script. Finally, the intersections of calls from the two callers were found to be largely invariant (TP, ∼62.7%; FP ∼0.94%), supporting the view that combining calls from multiple callers may be an effective strategy to identify a core set of high-quality calls while mitigating against a significant portion of false positives (Cantarel et al. 2014). This exercise also serves as a case study, showing the creation of expert knowledge and experience that investigators may wish to capture for subsequent analyses, such as maximizing sensitivity for discovery or specificity for diagnostic purposes. Users can then provide these optimal parameter values as previously described to perform a complete run. A similar approach was used to process WGS tumor/normal matched pairs from more than 10 different cancer types for the ICGC Pan-Cancer project. In the initial “Pilot 50” project, GenomeVIP produced 4.4 million SNV, 94,000 indel, and 100,000 SV filtered calls (Table 1) that we submitted for evaluation for selection for validation experiments. The comparison between GenomeVIP's predictions and those from eight other standard pipelines showed that GenomeVIP produces results highly concordant with noncloud pipelines (Fig. 4C), suggesting high-quality, reproducible analysis.

De novo SNV mutations in family trios

We analyzed the NA12891-NA12892-NA12878 family trio samples for de novo single-nucleotide variants using VarScan and associated filtering modules provided by GenomeVIP and compared the results to experimentally validated germline and somatic de novo variants (Conrad et al. 2011). GenomeVIP attained 97.8% recall (979 of 1001 experimentally validated sites) after predicted false positives were removed. The additional effect of applying a variant read-support filter for the parental genomes showed that exclusion of variant support in the parents (PVSR = 0) yielded the smallest GenomeVIP call set (N = 2431) with a recall rate of 95.0% (Fig. 4E, blue curve); in this set, 41 calls were validated experimentally as false positives. Increasing PVSR resulted in larger GenomeVIP variant call sets by at least 13% (Fig. 4E, red curve), while improving recall rates only marginally. For example, at PVSR = 2, at which the recall rate has plateaued, GenomeVIP made 3302 final calls, of which 42 validated as false positives. In this example, the highest balanced accuracy is likely to be obtained for values of PVSR at or near zero.

Software availability

The GenomeVIP source code and associated scripts are freely available for academic use and are available through GitHub (https://github.com/ding-lab/GenomeVIP/) and as Supplemental Code. Users having Amazon Web Services (AWS) login credentials can launch a GenomeVIP server on the AWS cloud by instantiating server images located in the AWS public repository. GenomeVIP executes computations using a public run-time AWS image providing a nearly complete set of the required genomics software. Software packages on these images carry their own licensing and usage terms. For example, GenomeVIP requires users to provide the location of their own licensed copy of GATK (version 3.5 and higher supported). Installation of GenomeVIP on a local web server allows users to design and execute computations on a local high-performance compute cluster using local data or on Amazon's cloud using data stored at AWS; furthermore, GenomeVIP's internal configuration files may be edited manually to point to local versions of the tools on which GenomeVIP depends. GenomeVIP orchestrates many tools that are all upgraded independently, and these updates will be passed through to GenomeVIP users in the following ways: Tool updates not requiring user interface modification will be distributed in updated runtime images, but those requiring such modification will require an updated server image. Tool versions will accumulate rather than be replaced to preserve backward compatibility. Database updates will be managed similarly, although the user can alternatively specify the location of public or custom annotation VCFs that GenomeVIP can retrieve via FTP/HTTP or from Amazon cloud storage. Documentation, support, and further information is available through GitHub and the GenomeVIP home page at our Turnkey Variant Analysis Project website (http://tvap.genome.wustl.edu/tools/genomevip/).

41 in total

1. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.

Authors: Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson
Journal: Genome Res Date: 2012-02-02 Impact factor: 9.043

2. VarScan: variant detection in massively parallel sequencing of individual and pooled samples.

Authors: Daniel C Koboldt; Ken Chen; Todd Wylie; David E Larson; Michael D McLellan; Elaine R Mardis; George M Weinstock; Richard K Wilson; Li Ding
Journal: Bioinformatics Date: 2009-06-19 Impact factor: 6.937

3. The case for cloud computing in genome informatics.

Authors: Lincoln D Stein
Journal: Genome Biol Date: 2010-05-05 Impact factor: 13.583

4. Detecting and annotating genetic variations using the HugeSeq pipeline.

Authors: Hugo Y K Lam; Cuiping Pan; Michael J Clark; Phil Lacroute; Rui Chen; Rajini Haraksingh; Maeve O'Huallachain; Mark B Gerstein; Jeffrey M Kidd; Carlos D Bustamante; Michael Snyder
Journal: Nat Biotechnol Date: 2012-03-07 Impact factor: 54.908

5. Variation in genome-wide mutation rates within and between human families.

Authors: Donald F Conrad; Jonathan E M Keebler; Mark A DePristo; Sarah J Lindsay; Yujun Zhang; Ferran Casals; Youssef Idaghdour; Chris L Hartl; Carlos Torroja; Kiran V Garimella; Martine Zilversmit; Reed Cartwright; Guy A Rouleau; Mark Daly; Eric A Stone; Matthew E Hurles; Philip Awadalla
Journal: Nat Genet Date: 2011-06-12 Impact factor: 38.330

6. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Authors: Geraldine A Van der Auwera; Mauricio O Carneiro; Christopher Hartl; Ryan Poplin; Guillermo Del Angel; Ami Levy-Moonshine; Tadeusz Jordan; Khalid Shakir; David Roazen; Joel Thibault; Eric Banks; Kiran V Garimella; David Altshuler; Stacey Gabriel; Mark A DePristo
Journal: Curr Protoc Bioinformatics Date: 2013

7. Predicting functional effect of human missense mutations using PolyPhen-2.

Authors: Ivan Adzhubei; Daniel M Jordan; Shamil R Sunyaev
Journal: Curr Protoc Hum Genet Date: 2013-01

8. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics.

Authors: Benjamin J Kelly; James R Fitch; Yangqiu Hu; Donald J Corsmeier; Huachun Zhong; Amy N Wetzel; Russell D Nordquist; David L Newsom; Peter White
Journal: Genome Biol Date: 2015-01-20 Impact factor: 13.583

9. Integrated analysis of germline and somatic variants in ovarian cancer.

Authors: Krishna L Kanchi; Kimberly J Johnson; Charles Lu; Michael D McLellan; Mark D M Leiserson; Michael C Wendl; Qunyuan Zhang; Daniel C Koboldt; Mingchao Xie; Cyriac Kandoth; Joshua F McMichael; Matthew A Wyczalkowski; David E Larson; Heather K Schmidt; Christopher A Miller; Robert S Fulton; Paul T Spellman; Elaine R Mardis; Todd E Druley; Timothy A Graubert; Paul J Goodfellow; Benjamin J Raphael; Richard K Wilson; Li Ding
Journal: Nat Commun Date: 2014 Impact factor: 14.919

10. Searching for SNPs with cloud computing.

Authors: Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-11-20 Impact factor: 13.583

5 in total

Review 1. Technological advances in cancer immunity: from immunogenomics to single-cell analysis and artificial intelligence.

Authors: Ying Xu; Guan-Hua Su; Ding Ma; Yi Xiao; Zhi-Ming Shao; Yi-Zhou Jiang
Journal: Signal Transduct Target Ther Date: 2021-08-20

2. Identification and validation of a novel eight mutant-derived long non-coding RNAs signature as a prognostic biomarker for genome instability in low-grade glioma.

Authors: Aierpati Maimaiti; Xixian Wang; Yinan Pei; Nuerbiye Nuermaimaiti; Abudireheman Tuersunniyazi; Yaeraili Abula; Zhaohai Feng; Lei Jiang; Xin Shi; Maimaitijiang Kasimu
Journal: Aging (Albany NY) Date: 2021-06-03 Impact factor: 5.682

3. DriverDBv3: a multi-omics database for cancer driver gene research.

Authors: Shu-Hsuan Liu; Pei-Chun Shen; Chen-Yang Chen; An-Ni Hsu; Yi-Chun Cho; Yo-Liang Lai; Fang-Hsin Chen; Chia-Yang Li; Shu-Chi Wang; Ming Chen; I-Fang Chung; Wei-Chung Cheng
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

4. A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.

Authors: Muhammad Tahir; Muhammad Sardaraz
Journal: Genes (Basel) Date: 2020-02-05 Impact factor: 4.096

5. Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services.

Authors: Inès Krissaane; Carlos De Niz; Alba Gutiérrez-Sacristán; Gabor Korodi; Nneka Ede; Ranjay Kumar; Jessica Lyons; Arjun Manrai; Chirag Patel; Isaac Kohane; Paul Avillach
Journal: J Am Med Inform Assoc Date: 2020-07-27 Impact factor: 4.497

5 in total