Literature DB >> 24454756

STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud.

Konrad J Karczewski¹, Guy Haskin Fernald¹, Alicia R Martin¹, Michael Snyder², Nicholas P Tatonetti³, Joel T Dudley⁴.

Abstract

The increasing public availability of personal complete genome sequencing data has ushered in an era of democratized genomics. However, read mapping and variant calling software is constantly improving and individuals with personal genomic data may prefer to customize and update their variant calls. Here, we describe STORMSeq (Scalable Tools for Open-Source Read Mapping), a graphical interface cloud computing solution that does not require a parallel computing environment or extensive technical experience. This customizable and modular system performs read mapping, read cleaning, and variant calling and annotation. At present, STORMSeq costs approximately $2 and 5-10 hours to process a full exome sequence and $30 and 3-8 days to process a whole genome sequence. We provide this open-access and open-source resource as a user-friendly interface in Amazon EC2.

Entities: Gene Species

Mesh：

Year: 2014 PMID： 24454756 PMCID： PMC3893165 DOI： 10.1371/journal.pone.0084860

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Individuals are now empowered to obtain and explore their full personal genome and exome sequences owing to declining costs in genome sequencing, and direct-to-consumer genetic testing companies have begun to provide sequencing services: in 2011, 23andMe conducted a pilot exome sequencing program for$999, while at the time of this writing, DNADTC provides the service for $895. Software and algorithms for short read mapping and variant calling are an active area of development and individuals may prefer to customize which software or parameters to use to process their raw genetic data. However, as these programs require significant computational resources, such a task is generally intractable without access to large-scale computing resources. Furthermore, execution of the required software pipeline requires proficiency in command-line programming, or alternatively, expensive commercial software options geared towards experts. These concerns can be ameliorated by use of intuitive open-source software operating in a cloud-computing environment. A number of solutions enabling researchers to process sequencing data using cloud computing are available. The majority of open-source, cloud-based tools for genomic data are command-line based and require substantial technical skills to use. Notable exceptions are Galaxy, Crossbow, and SIMPLEX. Galaxy aims to provide a reproducible environment for genome informatics accessible to non-technical investigators[1], but offers a vast array of tools beyond those typically used for processing personal genomic data and requires knowledgeable use of its workflow system. Crossbow provides a scalable framework for mapping and variant calling[2], but is limited to the Bowtie suite, while SIMPLEX requires command-line proficiency[3]. Ideally, by our definition, a user-friendly solution would employ a simple, unified graphical user interface for uploading reads, setting parameters, executing analyses, and downloading and visualizing results.

Implementation

Thus, we created STORMSeq (Scalable Tools for Open-source Read Mapping) to fill the need for a user-friendly processing pipeline for personal human whole genome and exome sequence data. STORMSeq utilizes the Amazon Web Services (http://aws.amazon.com) cloud-computing environment for its implementation, and offers an intuitive interface enabling individuals to perform customized read mapping and variant calling with personal genome data. STORMSeq dissociates the backend computational pipeline from the end-user and provides a simplified point-and-click interface for setting high-level parameters, and the system initiates with an optimized default configuration using recent versions of BWA (0.7.5a) and GATK Lite (2.1) as of 11/1/13. Users can then access final processed data and visualize summary statistics without having to load the data into a statistical software package. STORMSeq is a highly secure system entirely encapsulated within the user's Amazon account space, thereby ensuring that only the user has the ability to gain or grant access to their genetic data and results. STORMSeq's cloud-based architecture is illustrated in Figure 1. The user uploads their reads in FASTQ or BAM formats to Amazon S3 (Simple Scalable Storage) through a graphical interface provided by Amazon Web Services. The STORMSeq website (www.stormseq.org) provides instructions for starting the STORMSeq webserver machine image (AMI v1.0: ami-b35b7cda) within the Amazon cloud computing environment. This STORMSeq webserver is then the entry point for the user to choose software packages and set parameters for the analysis (Figure S1). The system currently offers a complete short read processing pipeline, including:

Figure 1

Overview of the STORMSeq system.

Overview of the STORMSeq system.

The user uploads short reads to Amazon S3 and starts a webserver on Amazon EC2, which controls the mapping and variant calling pipeline. Progress can be monitored on the webserver and results are uploaded to persistent storage on Amazon S3. Read mapping software packages, including BWA[4], BWA-MEM[5], and SNAP[6] Read cleaning pipeline with GATK[7] Variant (SNP and indel) calling packages, GATK and Samtools[8] Annotation using VEP[9] The system backend is modular, and designed to be easily expandable by researchers wishing to add additional functionality or incorporate other software packages. Once the user has set the relevant parameters (or uses the default set provided) and clicked “GO,” the system starts a compute cluster on the Amazon Elastic Compute Cloud (with the number of machines started related to the number of files uploaded and whether exome or genome analyses are selected) and runs the relevant software. The use of the software is free, and the user simply pays for compute time and storage on the Amazon servers, which as of 11/1/13 (for spot instances) costs $0.026 per hour for the (large) systems required for BWA, and $0.14 per hour for the (quadruple extra-large) high-memory systems required for SNAP, and $0.095 per GB-month for persistent storage of reads and variant call results. As the pipeline progresses, a progress bar is updated on the webserver and once the pipeline is finished, summary statistics, such as depth of coverage and other variant information, and visualizations using ggbio[10] and d3[11], are displayed on the webserver (Figure 2). Processing is parallelized where possible using Starcluster (http://mit.edu/star/cluster) and Sun Grid Engine. The results are then uploaded back to Amazon S3 for persistent storage.

Figure 2

Sample output.

STORMSeq provides basic visualization for summary statistics, such as (A) genome-wide SNP density and (B) size distribution of short indels.

Sample output.

STORMSeq provides basic visualization for summary statistics, such as (A) genome-wide SNP density and (B) size distribution of short indels.

Results and Discussion

We tested the STORMSeq system using two paired-end 100 bp read datasets: a personal genome sequence dataset with 1.1B reads (approximately 38X coverage), and a personal exome sequence data set with 90M reads (approximately 45X coverage; available in STORMSeq's demo functionality). For the personal exome data, the pipeline cost approximately $2 USD using spot pricing and took 10 hours using BWA and 5 hours using SNAP (Table 1; Figure S2). For personal genome sequence data, BWA and SNAP took 176 and 82 hours for processing, respectively, and each at a cost of approximately $30 USD (Table 1; Figure S3). Note that these values do not include storage costs, and are highly dependent on a number of factors, including the number and size of files provided by the user, as the software dynamically determines a cluster size based on this information. Additionally, STORMSeq was developed to support current cost savings of spot instances, and so, on-demand costs for the pipeline are much higher (Table 1).

Table 1

Approximate costs for STORMSeq.

Analysis Type	Exome		Genome
Pipeline	SNAP	BWA	SNAP	BWA
Cost (Spot)	$2.26	$1.90	$26.42	$32.76
Cost (On-demand)	$19.68	$8.16	$254.20	$129.12
Time	5 h	10 h	176 h	98 h

Note that these costs are approximate and may depend on a number of factors related to the input files.

Note that these costs are approximate and may depend on a number of factors related to the input files. We offer STORMSeq free for public use, where users pay only for compute time on the Amazon cloud. The source code for the STORMSeq software is available for download from www.github.com/konradjk/stormseq under an open-source license. We expect that the majority of STORMSeq users will be individuals from academia and the broader lay public interested in analyzing personal genomic information. In addition, those without access to large computing clusters, such as clinicians wishing to process patient data for clinical studies, as well as small research groups with genome sequence projects may seek to use the system to process genomic data for their patients and subjects. The system is modular and can be easily expanded and integrated with other tools. In the future, it will be crucial to integrate such tools with genome interpretation services, such as Interpretome[12]. The STORMSeq webserver allows users to set parameters and start the pipeline using a graphical interface. (PDF) Click here for additional data file. Time and cost estimates (spot pricing) for a personal exome sequence (90M reads, or 45X coverage) for BWA (red) and SNAP (blue). These figures are estimates only and results may vary. The merged step includes initial aligned BAMs, while final includes cleaned, sorted, and re-calibrated BAMs, as well as annotated variant calls (VCF). The stats step includes GATK's VariantEval and other VCF statistics, and depth is the completed GATK's DepthOfCoverage process. (PDF) Click here for additional data file. Time and cost estimates for a personal genome sequence (1.1B reads, or 38X coverage) for BWA (red) and SNAP (blue). These figures are estimates only and results may vary. The merged step includes initial aligned BAMs, while final includes cleaned, sorted, and re-calibrated BAMs, as well as annotated variant calls (VCF). The stats step includes GATK's VariantEval and other VCF statistics, and depth is the completed GATK's DepthOfCoverage process. (PDF) Click here for additional data file.

10 in total

1. D³: Data-Driven Documents.

Authors: Michael Bostock; Vadim Ogievetsky; Jeffrey Heer
Journal: IEEE Trans Vis Comput Graph Date: 2011-12 Impact factor: 4.579

2. Interpretome: a freely available, modular, and secure personal genome interpretation engine.

Authors: Konrad J Karczewski; Robert P Tirrell; Pablo Cordero; Nicholas P Tatonetti; Joel T Dudley; Keyan Salari; Michael Snyder; Russ B Altman; Stuart K Kim
Journal: Pac Symp Biocomput Date: 2012

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data.

Authors: Maria Fischer; Rene Snajder; Stephan Pabinger; Andreas Dander; Anna Schossig; Johannes Zschocke; Zlatko Trajanoski; Gernot Stocker
Journal: PLoS One Date: 2012-08-01 Impact factor: 3.240

5. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330

6. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.

Authors: William McLaren; Bethan Pritchard; Daniel Rios; Yuan Chen; Paul Flicek; Fiona Cunningham
Journal: Bioinformatics Date: 2010-06-18 Impact factor: 6.937

7. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

8. ggbio: an R package for extending the grammar of graphics for genomic data.

Authors: Tengfei Yin; Dianne Cook; Michael Lawrence
Journal: Genome Biol Date: 2012-08-31 Impact factor: 13.583

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

10. Searching for SNPs with cloud computing.

Authors: Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-11-20 Impact factor: 13.583

10 in total

11 in total

Review 1. Interpreting functional effects of coding variants: challenges in proteome-scale prediction, annotation and assessment.

Authors: Khader Shameer; Lokesh P Tripathi; Krishna R Kalari; Joel T Dudley; Ramanathan Sowdhamini
Journal: Brief Bioinform Date: 2015-10-22 Impact factor: 11.622

2. IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples.

Authors: Jennifer Hintzsche; Jihye Kim; Vinod Yadav; Carol Amato; Steven E Robinson; Eric Seelenfreund; Yiqun Shellman; Joshua Wisell; Allison Applegate; Martin McCarter; Neil Box; John Tentler; Subhajyoti De; William A Robinson; Aik Choon Tan
Journal: J Am Med Inform Assoc Date: 2016-03-28 Impact factor: 4.497

STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud.

Introduction

Implementation

Overview of the STORMSeq system.

Sample output.

Results and Discussion

1. D³: Data-Driven Documents.

2. Interpretome: a freely available, modular, and secure personal genome interpretation engine.

3. The Sequence Alignment/Map format and SAMtools.

4. SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data.

5. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

6. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.

7. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

8. ggbio: an R package for extending the grammar of graphics for genomic data.

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

10. Searching for SNPs with cloud computing.

Review 1. Interpreting functional effects of coding variants: challenges in proteome-scale prediction, annotation and assessment.

2. IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples.

Review 3. Clinical Genomics: Challenges and Opportunities.

4. Cloud-based interactive analytics for terabytes of genomic variants data.

5. Fastq2vcf: a concise and transparent pipeline for whole-exome sequencing data analyses.

6. Tentacle: distributed quantification of genes in metagenomes.

7. Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes.

8. Scalable and cost-effective NGS genotyping in the cloud.

9. MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants.

10. PVT: an efficient computational procedure to speed up next-generation sequence analysis.