| Literature DB >> 28522612 |
R Jay Mashl1,2, Adam D Scott1,2, Kuan-Lin Huang1,2, Matthew A Wyczalkowski1, Christopher J Yoon1,2, Beifang Niu1, Erin DeNardo1, Venkata D Yellapantula1,2, Robert E Handsaker3,4, Ken Chen5, Daniel C Koboldt1, Kai Ye1,2, David Fenyö6, Benjamin J Raphael7, Michael C Wendl1,8,9, Li Ding1,2,8,10.
Abstract
Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional "download and analyze" paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.Mesh:
Year: 2017 PMID: 28522612 PMCID: PMC5538560 DOI: 10.1101/gr.211656.116
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.GenomeVIP platform. GenomeVIP consists of three components (web browser, server host, cloud), coordinated by various scripting languages (blue) and cloud toolkits (green). Interactive web pages, written in HTML (with CSS elements) and JavaScript, provide front-end functionality. JQuery is a JavaScript library providing methods to modify web page content with cross-browser compatibility. Server-side PHP modules utilize StarCluster and S3 Tools cloud toolkits to access EC2 Compute (gray) and storage resources (yellow) in the cloud. GenomeVIP creates within EC2 a virtual cluster, based on a machine image with preinstalled variant detection tools and supporting software (collectively, “Genomics Tools”) (red), that can access sequence data on S3 and EBS (Elastic Block Storage) resources (yellow). Secure channels using HTTPS and secure shell (SSH) protocols allow communication between various components. Resulting variant call files stored in S3 are accessible via the GenomeVIP interface or the Amazon S3 Console.
Figure 2.GenomeVIP workflows. Three variant-discovery pipelines (germline, somatic, and de novo) with predicted variant types, including single-nucleotide variants (SNVs), insertions and deletions (indels), structural variants (SVs); selected filtering features; and post-discovery annotation options provided by third-party software packages having knowledge of catalogs of genetic variation.
Figure 3.GenomeVIP screenshots. (A) Accounts. Presentation of the user's valid Amazon Web Services causes GenomeVIP to generate a semipersistent sessionID used to store or recall previous cloud resource configurations. (B) Select Genomes. A user-uploaded file listing sequence alignment, reference, and index files is parsed and displayed for item selection. (C) Quick Setup tab configuration for loading a built-in execution profile with predefined tools and parameters (Step 1, option 1); a profile may alternatively be uploaded via the interface (Step 1, option 2). Predefined genomic regions may be selected or uploaded via the interface (Step 2). Clicking the Apply Profile button (Step 3) configures tools listed under the other tabs (gray) with the current predefined profile and regions, which may be subsequently modified manually under the other tabs. (D) Post-discovery Analysis. Selection of filters and annotation as part of the execution profile, showing the expanded false-positives filter panel (gray) for customization. (E) Submit. Resource management options are provided to create new or reuse existing computing instances and cloud storage location. Buttons to preview, download, or error-check the current execution profile, or to submit it as a computation, are available. (F) Results. An Amazon cloud storage file listing showing folders for tools’ outputs, job status, and results. Files .sh and .ep represent the master script describing the computation's workflow and the execution profile, respectively.
Examples of large-scale projects utilizing GenomeVIP
Brief comparison of variant discovery frameworks
Figure 4.Applications of GenomeVIP. (A) Principal component analysis of germline SNV and indel predictions for nonrelated 1000 Genomes Project Phase 1 samples from three populations: (red) CHB; (green) FIN; (blue) YRI. (B) True-positive (TP) and false-positive (FP) rates for somatic SNV calls novel to dbSNP. Performance of VarScan and Strelka callers individually (red, blue) and in combination (green, purple) are evaluated before and after exploratory false-positives filtering using multiple parameter combinations, in which VSR is the minimum number of variant-supporting reads. (C) GenomeVIP performance on ICGC Pan-Cancer Pilot-50 somatic mutation calling for one matched sample pair, in which the colors correspond to the number of pipelines predicting the same variant. (D) Performance statistics. (E) De novo recall performance (blue), as compared to published experimental validation results, and filtered call set size (red) for SNV calling in NA12878 as a function of PVSR, the number of variant-supporting reads in parental genomes NA12891 and NA12892. (F) dbSNP concordances of germline SNVs and indels, as called by GenomeVIP (darker shading) and GotCloud (lighter shading), for the samples described in A.