| Literature DB >> 26925205 |
Krithika Bhuvaneshwar1, Dinanath Sulakhe2, Robinder Gauba1, Alex Rodriguez3, Ravi Madduri2, Utpal Dave2, Lukasz Lacinski2, Ian Foster2, Yuriy Gusev1, Subha Madhavan1.
Abstract
Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the "Globus Genomics" system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon 's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.Entities:
Keywords: Cloud computing; Galaxy; Next generation sequencing; Translational research
Year: 2014 PMID: 26925205 PMCID: PMC4720014 DOI: 10.1016/j.csbj.2014.11.001
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Supplementary File 1bPros and Cons of using Galaxy from online discussion.
Fig. 1Architecture of the Globus Genomics system. The orange colored components indicate the three distinct components of the system (at a higher level), and the pink colored components are additional features added by the Globus Genomics team.
Fig. 2How to run a ready-made NGS workflow in the Globus Genomics system.
Fig. 3a. Schematic diagram of the whole Genome and whole exome analysis workflow.
b. Whole genome and exome analysis workflow inside the Globus Genomics system.
Fig. 4a. Schematic diagram of the whole transcriptome (RNA-seq) analysis workflow.
b. Whole transcriptome (RNA-seq) analysis workflow inside the Globus Genomics system.
Fig. 5Summary for analysis of 78 lung cancer samples through the exome-seq workflow. Execution time was not optimal due to the high nature of I/O in the workflow.
"Spot Price" as mentioned in the figure key refers to the price of the AWS spot instance [38].
Fig. 6Summary of the 78 lung cancer samples in an I/O optimized server.
“Spot price” refers to the price of the AWS spot instance [38].
Fig. 7Summary for RNA-Seq Analysis of 21 TCGA samples of varying input sizes.
“Spot price” refers to the price of the AWS spot instance [38].
Sample workflow run costs including compute, temporal storage and outbound I/Oa.
| Workflow | Input data size | Storage size reqs (GBs) | Amazon storage costs | Compute requirement (node hours) | Amazon compute costs | Data download (GBs) | Amazon outbound I/O costs | Total amazon costs |
|---|---|---|---|---|---|---|---|---|
| DNA copy number | .070 GB | 0.03 | <$0.01 | 0.15 | $0.05 | 0.003 | <$0.01 | $0.05 |
| microRNA Seq | 0.3 GB | 1 | <$0.01 | 0.5 | $0.17 | 0.1 | $0.01 | $0.18 |
| RNA Seq | 10 GB (~ 5 Gbp) | 70 | $0.12 | 20 | $6.80 | 7 | $0.70 | $7.62 |
| WES | 6 GB (~ 5 Gbp) | 50 | $0.08 | 6 | $2.04 | 5 | $0.50 | $2.62 |
| WGS | 72 GB (~ 35 Gbp) | 320 | $0.53 | 30 | $10.20 | 32 | $3.20 | $13.93 |
The analysis presented in Table 1 was carried out under the following assumptions: (a)Input data are compressed in GZ format, paired-end Illumina reads (b)RNA-seq analysis includes variant analysis as well: Sickle QC, RSEM (singleton and paired), sort, rmdup, fixmate, picard reorder, picard add or replace groups, GATK Unified Genotyper, GATK recalibration, and GATK variant filtering (c)WES analysis includes: BWA, sort, rmdup, fixmate, picard reorder, picard add or replace groups, GATK Unified Genotyper, GATK recalibration, and GATK variant filtering (d)WGS analysis includes: Bowtie2, sort, rmdup, fixmate, picard reorder, picard add or replace groups, and GATK Unified Genotyper (e)Reference genome used for all analyses is hg19.