| Literature DB >> 26470712 |
Yassine Souilmi1,2, Alex K Lancaster3,4, Jae-Yoon Jung5, Ettore Rizzo6, Jared B Hawkins7, Ryan Powles8, Saaïd Amzazi9, Hassan Ghazal10, Peter J Tonellato11,12, Dennis P Wall13.
Abstract
BACKGROUND: While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars.Entities:
Mesh:
Year: 2015 PMID: 26470712 PMCID: PMC4608296 DOI: 10.1186/s12920-015-0134-9
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1GenomeKey workflow and overall benchmarking study design. a GenomeKey workflow implements the GATK 3 best practices for genomic variant calling. Each arrow represents a stage of the workflow, and the level of parallelization for each stage is described in the Methods section under “Workflow”. b Deployment of the workflow on the Amazon Web Services Elastic Compute Cloud (EC2) infrastructure using the COSMOS workflow management engine
GlusterFS configurations used to increase shared disk space
| GlusterFS bricks | Shared Disk Size (TB) | |
|---|---|---|
| Config 1 (1;0;20) | 1 | 3.3 |
| Config 2 (1;1;19) | 2 | 6.6 |
| Config 3 (1;3;16) | 4 | 13.2 |
(L; M; N): L on-demand master nodes, M on-demand worker nodes, N spot-instance worker nodes
Comparison of variant calls results
| Variant calls | Ti/Tv All SNPs | High quality SNPs | Genotype concordance |
|---|---|---|---|
| GenomeKey | 2.25 | 202290 | 0.97 |
| DePristo | 2.26 | 141618 | - |
Fig. 2GenomeKey scalability. GenomeKey workflow efficiently scales with increasing number of genomes. a Wall time and (b) cost as a function of number of genomes compared to a linear extrapolation single genome. GenomeKey workflow scales efficiently with increasing number of exomes compared on different GlusterFS configurations. The blue curve represents the 1, 3, 5 and 10 exomes runs performed on a cluster with one GlusterFS brick; the yellow curve represents the scalability on a cluster with four GlusterFS bricks. c Wall time and (d) cost as a function of exome and size for as compared to a linear extrapolation of a single exome
Fig 3Cluster Resources Usage. Cluster resources are utilized more efficiently as batch size increases. When the number of exomes increases from (a) 5 exomes to (b) 10 exomes, overall cluster CPU usage (shown as the brown “Total” line) is higher across the entire runtime. Percent CPU usage for each job across the entire 20-node was summed within 5-min “wall time” windows and then scaled by the total number of cores (20 nodes × 32 cores/node = 1920 cores) to quantify the overall system utilization. CPU usage for jobs not fully contained within each 5 min’ window was pro-rated according to how much they overlapped. The contribution of each stage to the entire total (brown line) as a function of time further illustrates the parallelization