| Literature DB >> 28107819 |
Hatem Elshazly1, Yassine Souilmi2,3, Peter J Tonellato3,4, Dennis P Wall5, Mohamed Abouelhoda6,7.
Abstract
BACKGROUND: Next Generation Genome sequencing techniques became affordable for massive sequencing efforts devoted to clinical characterization of human diseases. However, the cost of providing cloud-based data analysis of the mounting datasets remains a concerning bottleneck for providing cost-effective clinical services. To address this computational problem, it is important to optimize the variant analysis workflow and the used analysis tools to reduce the overall computational processing time, and concomitantly reduce the processing cost. Furthermore, it is important to capitalize on the use of the recent development in the cloud computing market, which have witnessed more providers competing in terms of products and prices.Entities:
Keywords: Cloud computing; Multicloud; Personalized medicine; Sequence analysis; Variant analysis
Mesh:
Year: 2017 PMID: 28107819 PMCID: PMC5248509 DOI: 10.1186/s12859-016-1454-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of different systems for variant analysis
| STORMseq | Atlas2 | Simplex | WEP | GenomeKey | MC-GenomeKey | |
|---|---|---|---|---|---|---|
| Quality | - | - | fastx-toolkit | ngs-qc toolkit + fastqc | fastx-toolkit | fastx-toolkit |
| Mapping | BWA | - | BWA | BWA | BWA | BWA |
| Variant Calling | GATK | Logistic Regression Model | GATK | GATK | GATK | GATK |
| Annotation | VEP (variant effect predictor) | - | Annovar | Annovar | Annovar | Annovar |
| Deployment | AWS EC2 | AWS EC2 | AWS EC2 | Web Service | AWS EC2 | AWS, Google Cloud, Amazon, OpenStack based |
| Web Interface | Yes | Yes | No | Yes | Yes | Yes |
| Multiple samples in one run | No | No | No | No | Yes | Yes |
| Parallelization technique | split by chromosome | - | - | NA | split by chromosome + split by read group id | split by chromosome + split by read group id + more split by sub-group ID and sub-chromosomes |
| Workflow Engine | Python Scripts | - | JClusterService | Scripts | Cosmos | Cosmos |
| Modularity | No | - | No | No | Yes | Yes |
| Use of Heterogeneousa cluster | No | No | No | NA | No | Yes |
| Failure-handling+ Mechanisms | No | No | No | No | No | Yes |
| Use of Spot Instances | No | No | No | NA | No | Yes |
aHeterogeneous cluster means nodes of different virtual machine types and also from different clouds
+Failure handling means response to failure of compute nodes in cloud, as in the case of spot instances
Fig. 1Variant Analysis Workflow. Variant Analysis Workflow. The figure shows the major phases and other major steps in each stage
Fig. 2Screenshots of the MC-GenomeKey website. Screen shots of the web interface: a the user enters own credentials for Amazon and Google cloud. b The user sets workflow parameters, e.g., alignment and variant calling parameters. c The user defines the size of cluster, type of nodes, and use of spot instances or not. d The user sets the job configuration parameters, where the user can select the “recovery method” to respond to termination of spot instances. In this screen the input and output folders are defined
Fig. 3Data flow associated with the parallel execution of the MC-GenomeKey jobs. Data flow associated with the jobs of MC-GenomeKey in the form of DAG (Directed Acyclic Graph). A job is executed only if all the input became available
On demand prices for Amazon and Google instances
| Amazon | ||||||||
|---|---|---|---|---|---|---|---|---|
| Instance type | CPUs | Mem | Price | Spot Price | Instance Type | CPUs | Mem | Price |
| m3.2xlarge | 8 | 30 | $0.53 | $0.07 | n1-standard-8 | 8 | 30 | $0.28 |
| c3.8xlarge | 32 | 60 | $1.68 | $0.25 | n1-standard-16 | 16 | 60 | $0.56 |
| c4.xlarge | 36 | 60 | $1.76 | $0.27 | n1-highcpu-32 | 32 | 28.8 | $1.216 |
| r3.8xlarge | 32 | 104 | $2.66 | $0.3 | n1-standard-32 | 32 | 120 | $1.12 |
Spot instance prices for Amazon are the minimum prices observed such that the instance was available for at least one hour (prices are computed for three months period from October until December 2015)
Life Time of different machines (in minutes) against different bid prices
| c3-8xlarge | ||||
| Bid | Average | Minimum | Maximum | Median |
| $0.2 | 0 | 0 | 0 | 0 |
| $0.3 | 89.22 | 1 | 2401 | 8 |
| $0.4 | 201.17 | 1 | 5168 | 14 |
| $0.5 | 442.18 | 1 | 15907 | 15.5 |
| $0.6 | 3311.57 | 1 | 51244 | 29.5 |
| $1.00 | 17272.6 | 1 | 65193 | 12 |
| c4-8xlarge | ||||
| Bid | Average | Minimum | Maximum | Median |
| $0.2 | 0 | 0 | 0 | 0 |
| $0.3 | 185.23 | 1 | 3921 | 11 |
| $0.4 | 386.27 | 1 | 15880 | 14 |
| $0.5 | 825.84 | 1 | 30899 | 10 |
| $0.6 | 1296.8 | 1 | 30899 | 15 |
| $1.0 | 5359.12 | 2 | 32511 | 26.5 |
| r3-8xlarge | ||||
| Bid | Average | Minimum | Maximum | Median |
| $0.2 | 0 | 0 | 0 | 0 |
| $0.3 | 64.58 | 1 | 713 | 16 |
| $0.4 | 139.26 | 1 | 4811 | 14.5 |
| $0.5 | 175.65 | 1 | 8318 | 18.5 |
| $0.6 | 229.79 | 1 | 8344 | 25 |
| $1.0 | 422.08 | 1 | 14441 | 23.5 |
Prices are computed for three months period from October until December 2015. Instances of type m3-2xlarge were available all the period with a bid price of 0.2$
Fig. 4Scenarios for handling sudden termination of spot instances. Sequence diagrams showing scenario 2 (a) where the computation continues in the same Amazon (AWS) cloud and Scenario 3 (b) where computation filed jobs in terminated spot instances are migrated to Google (GCE) cloud
Running times of the variant analysis workflow in different clouds
| Exome (9.2 GB) | |||
| Amazon | Azure | ||
| Alignment (BWA) | 0:12:20 | 0:18:40 | 00:26:00 |
| IndelRealigner | 0:14:39 | 0:20:10 | 00:28:00 |
| MarkDuplicates | 0:15:29 | 0:23:06 | 00:35:00 |
| BQSR | 0:28:06 | 0:34:15 | 00:55:00 |
| HaplotypeCaller | 1:08:46 | 0:58:40 | 01:28:00 |
| GenotypeGVCFs | 0:14:23 | 0:12:40 | 00:17:00 |
| VQSR | 0:10:07 | 0:10:14 | 00:12:00 |
| Merge VCF | 0:05:07 | 0:04:55 | 00:10:00 |
| Convert VCF to Annovar | 0:00:13 | 0:00:10 | 00:00:15 |
| Annotate | 0:05:16 | 0:05:36 | 00:09:00 |
| Merge Annotation | 0:03:06 | 0:04:01 | 00:06:00 |
| Total |
| 2:06:28 ($ | 3:12:00 ($31.24) |
| Whole Genome (113 GB) | |||
| Amazon | Azure | ||
| Alignment (BWA) | 8:03:53 | 8:10:13 | 11:18:20 |
| IndelRealigner | 3:04:03 | 3:09:34 | 04:15:58 |
| MarkDuplicates | 3:15:04 | 3:22:41 | 04:39:26 |
| BQSR | 4:11:14 | 4:17:23 | 06:35:44 |
| HaplotypeCaller | 9:05:43 | 9:15:49 | 14:29:22 |
| GenotypeGVCFs | 1:45:01 | 1:46:44 | 02:2:14 |
| VQSR | 1:33:29 | 1:33:36 | 02:05:13 |
| Merge VCF | 0:05:55 | 0:06:05 | 0:15:08 |
| Convert VCF to Annovar | 0:06:14 | 0:06:17 | 0:15:29 |
| Annotate | 0:15:01 | 0:15:21 | 0:24:39 |
| Merge Annotation | 0:10:06 | 0:11:01 | 0:23:07 |
| Total |
| 32:13:06 ( | 46:44:40 ($462) |
Running times (hours: minutes: seconds) of the variant analysis workflow in different clouds using a cluster of 4 nodes. We give the time of different steps. The total time and cost (in USD) are in the rows titled “Total”. The best running times and options are underlined
Total running times for running MC-GenomeKey on the whole genome dataset using different clusters of increasing node number
| Nodes | ||||
|---|---|---|---|---|
| 4 | 8 | 16 | 32 | |
| Alignment (BWA) | 8:03:53 | 4:45:51 | 2:55:34 | 1:30:22 |
| IndelRealigner | 3:04:03 | 1:25:44 | 0:50:51 | 0:28:44 |
| MarkDuplicates | 3:15:04 | 1:35:01 | 0:48:15 | 0:33:10 |
| BQSR | 4:11:14 | 2:55:16 | 1:35:02 | 0:18:41 |
| HaplotypeCaller | 9:05:43 | 6:04:45 | 4:18:30 | 2:15:01 |
| GenotypeGVCFs | 1:45:01 | 1:01:10 | 0:37:14 | 0:15:09 |
| VQSR | 1:33:29 | 0:55:45 | 0:30:56 | 0:11:30 |
| Merge VCF | 0:05:55 | 0:08:39 | 0:11:20 | 0:13:15 |
| Convert VCF to Annovar | 0:06:14 | 0:07:11 | 0:06:59 | 0:07:01 |
| Annotate | 0:15:01 | 0:14:55 | 0:15:10 | 0:14:32 |
| Merge Annotation | 0:04:06 | 0:03:49 | 0:04:01 | 0:04:05 |
| Total time | 31:29:43 | 19:18:06 | 12:13:52 | 6:11:30 |
The cost of using spot instances for Case 1, where all spot instanced get terminated
| Bid = $0.2 (1 min) | Bid = $0.3 (65 min) | Bid = $0.35 (100 min) | Bid = $0.4 (140 min) | Bid = $0.5 (176 min) | Bid = $0.6 (230 min) | Bid = $1 (422 min) | Average data transfer time (cost) | Total time | |
|---|---|---|---|---|---|---|---|---|---|
| Exome | |||||||||
| No Failure | $1.734 | $2.534 |
|
|
|
|
| 0 |
|
| Failure Step 1 (Mapping) |
| 16.227 | 16.45 | 16.627 | 17.027 | 17.427 | 19.027 | 12 min (0.96$) | 02:08:23 |
| Failure Step 2 (Variant Calling) | 11.471 |
| $12.1 | 12.271 | 12.671 | 10.937 | 12.537 | 11 min (0.87$) | 02:06:10 |
| Failure Step 3 (Annotation) | 7.402 | 7.802 | $8.1 | 8.202 | 8.602 | 9.002 | 10.602 | 6 s (0.00036$) | 01:55:48 |
| Whole Genome (4 Nodes) | |||||||||
| No Failure | $27.744 | $40.544 | $46.9 | $53.344 | $66.144 | $78.944 | $130.144 | 0 |
|
| Failure Step 1 (Mapping) |
|
|
|
|
|
|
| 230 min (18.4$) | 35:08:46 |
| Failure Step 2 (VC) | $127.84 | $135.84 | $139.85 | $143.85 | $151.84 | $159.84 | $191.84 | 100 min (7.9$) | 32:47:17 |
| Failure Step 3 (Annotation) | $55.25 | $84.59 | $89.79 | $94.99 | $105.39 | $115.79 | $157.39 | 12 s (0.00432$) | 32:14:01 |
| Whole Genome (8 Nodes) | |||||||||
| No Failure | $32.23 | $47.70 | $55.42 | $63.16 | $78.63 | $94.10 | $155.96 | 0 |
|
| Failure Step 1 (Mapping) | $243.75 |
|
|
|
| $250.15 | $256.55 | 230 min (18.4$) | 23:10:1 |
| Failure Step 2 (Variant Calling) | $123.05 | $133.45 | $138.65 | $143.85 | $154.25 |
|
| 100 min (7.9$) | 20:52:08 |
| Failure Step 3 (Annotation) | $54.99 | $68.59 | $75.39 | $82.19 | $95.79 | $109.39 | $163.79 | 12 s (0.00432$) | 19:20:01 |
| Whole Genome (32 nodes) | |||||||||
| No Failure | $39.88 | $59.61 | $69.50 | $79.35 | $99.08 | $118.81 | $197.75 | 0 |
|
| Failure Step 1 (Mapping) | $291.75 |
|
| $298.15 | $301.35 | $304.55 | $317.35 | 230 min (18.4$) | 10:45:12 |
| Failure Step 2 (VC) | $172.65 | $185.45 | $191.85 |
|
| $223.85 | $275.05 | 100 min (7.9$) | 7:56:03 |
| Failure Step 3 (Annotation) | $85.39 | $101.39 | $109.39 | $117.39 | $133.39 |
|
| 12 s (0.00432$) | 6:20:10 |
The cost of using spot instances with different bid prices and failure time points given for Case 1 where all spot instanced get terminated for the Exome and the Whole Genome datasets. GCE cluster setup time is nearly 7 min. For every bid we provided the average lifetime of cluster in brackets. Costs in bold are the most likely ones with the respective bid price and its most likely time of failure. The best expected costs for a given experiment are underlined. The best expected cost for exome is 3.334 using bid of 0.4 (underlined) and for whole genomes comes is 149.39 (underlined) with 32 nodes and bid price of 0.6
The cost of using spot instances for Case 2, where some spot instances get terminated
| Bid = $0.2 (1 min) | Bid = $0.3 (62 min) | Bid = $0.35 (100 min) | Bid = $0.4 (135 min) | Bid = $0.5 (172 min) | Bid = $0.6 (224 min) | Bid = $1 (419 min) | Average data transfer time (cost) | Total time | |
|---|---|---|---|---|---|---|---|---|---|
| Exome | |||||||||
| No Failure | $1.734 | $2.534 | $2.934 |
|
|
|
| 0 |
|
| Failure Step 1 (Mapping) | $8.78 |
| $9.51 | $9.98 | $10.58 | $11.18 | $13.58 | 7 min (0.6$) | 02:08:23 |
| Failure Step 2 (Variant Calling) |
| $9.73 | $9.91 | $10.33 | $10.93 | $11.53 | $13.93 | 6 min (0.47$) | 02:06:10 |
| Failure Step 3 (Annotation) | $4.06 | $4.66 |
| $5.26 | $5.86 | $6.46 | $8.86 | 2 s (0.05$) | 01:55:48 |
| Whole Genome (4 Nodes) | |||||||||
| No Failure | $27.744 | $40.544 | $46.9 | $53.344 | $66.144 | $78.944 | $130.144 | 0 |
|
| Failure Step 1 (Mapping) | $114.43 |
|
| $128.83 | $136.03 | $143.23 | $172.03 | 123 min (4.87$) | 33:08:46 |
| Failure Step 2 (VC) | $64.52 | $74.92 | $80.12 |
|
|
| $147.72 | min (0.96$) | 33:47:17 |
| Failure Step 3 (Annotation) | $45.32 | $56.92 | $62.72 | $68.52 | $80.12 | $91.72 |
| 30 s (0.6) | 32:14:01 |
| Whole Genome (8 Nodes) | |||||||||
| No Failure | $32.23 | $47.70 | $55.42 | $63.16 | $78.63 | $94.10 | $155.96 | 0 |
|
| Failure Step 1 (Mapping) | $137.64 |
|
| $154.44 | $162.84 | $171.24 | $204.84 | 123 min (4.87$) | 21:30:1 |
| Failure Step 2 (VC) | $77.33 | $90.13 | $96.53 |
|
|
| $179.73 | 101 min (0.96$) | 21:12:08 |
| Failure Step 3 (Annotation) | $48.26 | $62.66 | $69.86 | $77.06 | $91.46 | $105.86 |
| 30 s (0.585$) | 19:55:01 |
| Whole Genome (32 nodes) | |||||||||
| No Failure | $39.88 | $59.61 | $69.50 | $79.35 | $99.08 | $118.81 | $197.75 | 0 |
|
| Failure Step 1 (Mapping) | $165.64 |
|
| $188.04 | $199.24 | $210.44 | $255.24 | 123 min (4.87$) | 8:15:12 |
| Failure Step 2 (VC) | $93.33 | $109.33 | $117.33 |
|
|
| $221.33 | 101 min (0.96$) | 7:56:03 |
| Failure Step 3 (Annotation) | $83.46 | $101.06 | $109.86 | $118.66 | $136.26 | $153.86 |
| 30 s (0.585$) | 6:55:12 |
The cost of using spot instances with different bid prices and failure time points given for Case 2, where some spot instances get terminated (we assume half of initial number) for Exome and Whole Genome datasets. GCE cluster setup time is nearly 7 min. For every bid we provided the average lifetime of cluster in brackets. Costs in bold are the most likely ones with the respective bid price and its most likely time of failure. The best expected costs for a given experiment are underlined. The best expected cost for exome is 3.334 using bid of 0.4 and for whole genomes is 106.12 with a bid of 0.6 finishing in 33 h. If one has to finish in less than 10 h, the best price is 125.33 with 32 nodes and bid price of $0.4