| Literature DB >> 29363427 |
Yiqi Wang1, Gen Li2, Mark Ma2, Fazhong He3, Zhuo Song4, Wei Zhang5, Chengkun Wu1.
Abstract
BACKGROUND: Whole-genome sequencing (WGS) plays an increasingly important role in clinical practice and public health. Due to the big data size, WGS data analysis is usually compute-intensive and IO-intensive. Currently it usually takes 30 to 40 h to finish a 50× WGS analysis task, which is far from the ideal speed required by the industry. Furthermore, the high-end infrastructure required by WGS computing is costly in terms of time and money. In this paper, we aim to improve the time efficiency of WGS analysis and minimize the cost by elastic cloud computing.Entities:
Keywords: AWS; Parallel and distributed computing; Whole-genome sequencing
Mesh:
Year: 2018 PMID: 29363427 PMCID: PMC5780748 DOI: 10.1186/s12864-017-4334-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Comparison with previous benchmarks of time and cost for WGS data analysis based on different pipeline and hardware
| Tool | Aligner + Variant Caller | Depth | Time | Cost | Deptha | Timea | Costa | Hardware |
|---|---|---|---|---|---|---|---|---|
| Genomekey + COSMOS [ | BWA + GATK HaplotypeCaller | 37× | 4.9 h | $48.5 | 55× | 7.3 h | $72.1 | 20× AWS c2.8xlarge |
| Churchill [ | BWA + GATK UnifiedGenotyper | 30× | 1.7 h | – | 55× | 3.1 h | – | 16× AWS r3.8×large |
| STORMseq [ | BWA + GATK lite | 38× | 176 h | $32.8 | 55× | 255 h | $47.5 | – |
| Crossbow [ | Bowtie + SOAPsnp | 38× | 4.5 h | $71.4 | 55× | 6.5 h | $103.3 | 20× AWS c1.xlarge |
| Crossbow [ | Bowtie + SOAPsnp | 38× | 2.5 h | $83.6 | 55× | 3.6 h | $121 | 40× AWS c1.xlarge |
| PEMapper / PECaller [ | PEMapper + PECaller | 30× | 29.3 h | – | 55× | 53.7 h | – | – |
| Globus [ | Bowtie2 + GATK | 30× | 12 h | – | 55× | 22 h | – | 1× AWS cr1.8xlarge |
| SevenBridges [ | BWA + GATK | 15× | 8 h | $14.1 | 55× | 29.3 h | $51.7 | – |
| BGI-online (BALSA) [ | BALSA | 50× | 5.5 h | – | 55× | 6 h | – | 6-core CPU, 64GB RAM, GPU GTX680 |
| Average | 42.9 h | $79.1 |
‘a’ means time and cost of different depth data are normalized to 55× with linear relationship. ‘-’ means not reported
The configuration information of r3.8xlarge and m4.4×large
| Instance Type | vCPU | Memory (GB) | Storage (GB) | Networking performance | Physical processor | Clock speed (GHz) |
|---|---|---|---|---|---|---|
| r3.8xlarge | 32 | 244 | 2 × 320 SSD | 10 Gigabit | Intel Xeon E5–2670 v2 | 2.5 |
| m4.4xlarge | 16 | 64 | EBS Only | High | Intel Xeon E5–2676 v3 | 2.4 |
Time cost and AWS expenditure for 55× WGS
| Overall time for the 55× WGS | Cost per m4.4×.large instance | Cost per r3.8×.large instance | Overall expenditure for the 55× WGS |
|---|---|---|---|
| 18.4 min | $0.1287 | $0.4386 | $16.50 |
Time cost for each step in 55× WGS
| Step | Time cost | |
|---|---|---|
| 1 | Mapping | 4.7 min |
| 2 | BAM Merging and Sorting | 3.6 min |
| 3 | Variants calling | 8.9 min |
| 4 | VCF Merging | 23.2 s |
Comparison of overall time cost between GT-WGS and Churchill
| Method | Overall time (min) | Number of CPU Cores |
|---|---|---|
|
| 18.4 | 250*16 = 4000 |
|
| 191 | 16*32 = 512 |
Results comparison between GT-WGS and BWA + GATK
| Mutation type | Unique mutation sites of GT-WGS | Unique mutation sites of BWA + GATK best practice | Common mutation sites | Mutation sites with consistent position but different genotype | ||||
|---|---|---|---|---|---|---|---|---|
| Number | Proportion | Number | Proportion | Number | Proportion | Number | Proportion | |
| SNP | 3928 | 0.10% | 4443 | 0.11% | 4,067,370 | (99.89%, 99.88%) | 643 | (0.016%, 0.016%) |
| INDEL | 646 | 0.08% | 675 | 0.08% | 823,871 | (99.90%, 99.89%) | 197 | (0.024%, 0.024%) |
Results comparison among cased of different number of computation instances
| Number of computation instances (m4.4xlarge) | Time cost | |
|---|---|---|
| 1 | 4 | 888.7 min |
| 2 | 16 | 238.0 min |
| 3 | 64 | 67.9 min |
| 4 | 250 | 18.4 min |
Fig. 1Speedup of GT-WGS
AWS expenditure for 5× WGS
| Cost per r4.4×.large instance | Cost per r3.8×.large instance | Overall expenditure for 500 5× WGS | Average expenditure for 5× WGS |
|---|---|---|---|
| $0.24 | $0.61 | $1810.0 | $3.62 |
The configuration information of r3.8xlarge instance and r4.4xlarge instance
| Instance type | vCPU | Memory (GiB) | Storage (GB) | Networking performance | Physical processor | Clock speed (GHz) |
|---|---|---|---|---|---|---|
| r3.8xlarge | 32 | 244 | 2 × 320 SSD | 10 Gigabit | Intel Xeon E5–2670 v2 | 2.5 |
| r4.4xlarge | 16 | 122 | EBS Only | Up to 10 Gigabit | Intel Xeon E5–2686 v4 | 2.3 |
Time cost portfolio for 5× WGS time cost on average and in total
| Step | Time cost | Total time | Time per 5× WGS | |
|---|---|---|---|---|
| 1 | Mapping | 1.80 min | 1199.74mins | 2.39mins |
| 2 | BAM Merging and Sorting | 0.40 min | ||
| 3 | Variants calling | 5.06 min | ||
| 4 | VCF Merging | 33.6 s |
Fig. 2Two IO walls in the process of distributed WGS
Fig. 3GT-WGS architecture
Fig. 4Structure of MicroService
Fig. 5WGS analyzing process of GT-WGS
Fig. 6Dynamic task scheduling