| Literature DB >> 27612449 |
Zhuoyi Huang1, Navin Rustagi1, Narayanan Veeraraghavan1, Andrew Carroll2, Richard Gibbs1, Eric Boerwinkle1,3, Manjunath Gorentla Venkata4, Fuli Yu5.
Abstract
BACKGROUND: The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies.Entities:
Keywords: Big data; Cloud AWS; Ensemble calling; Joint calling; SNV; Scalable; Supercomputer; Variant calling; WGS
Mesh:
Year: 2016 PMID: 27612449 PMCID: PMC5018196 DOI: 10.1186/s12859-016-1211-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary performance metrics of goSNAP pipeline
| Stage | # of core hours | Time (days) | Data (TB) generated | Data (TB) (upload/ download) | Median execution time /unit | # of parallel execution threads | Optimal instance (core/mem) |
|---|---|---|---|---|---|---|---|
| Slicing /Repack | ~48 k/~26 k | 5 | 360 | 180/0 | ~1.15 h/sample (slicing)/~22 h/bin (repacking) | 5297 (slicing) 300 (repacking) | 8 cores, 16GB/ 4 cores, 4GB |
| Calling | ~1.4 million | 14 | 120 | 0/2 | ~60 hrs/bin | 2797 | 8 cores, 16GB |
| Genotype Likelihood | ~15.6 k | 1 | 6 | 0/2 | ~1.5 h/BAM | 5297 | 2 cores, 8GB |
| Imputation and Phasing | ~3.7 million | 30 | 2 | 2/2 | ~30 h/bin @ Rhea ~13 h/bin @ Blue BioU | 265 k | 32 cores |
| Total | ~5.2 million | 50 | 488 | 182/6 | -- | -- | -- |
The pipeline finished in 50 days and only transferred a total of 6 TB of data starting from a raw data footprint of 180 TB. Cache data of 360 TB was live only for 14 days. Intermediate results amount to 120 and are archived for future use
The pipeline used 5.2 million core hours
Fig. 1a A resource constraint analysis diagram of computing resources with respect to the three available architectures. In this diagram feasibility is measured in terms of cost, time and resource bounds. b A resource constraint analysis diagram of the variant calling stages for CHARGES-F3 dataset. Feasibility is measured in terms of cost, time and limited computing infrastructure in the AWS cloud environment, Supercomputer and LHPC respectively. The measure of feasibility is for illustration purposes only and does not conform to any data
Fig. 2goSNAP pipeline workflow minimizes egress charges. Variant calling (Stage A) and genotype likelihood (Stage C) calling is done on the AWS cloud, consensus filtering and imputation preprocessing is accomplished in the LHPC (Stage B) and Imputation and Phasing (Stage D) is done at the supercomputers at Rice and Oakridge National labs
Variant calling sensitivity and specificity for the consensus 3of4 approach ensures high specificity and FDR without a loss of sensitivity
| Consensus 3of4 | Consensus 2of4 | GATK-HC | GATK-UG | GotCloud | SNPTools | |
|---|---|---|---|---|---|---|
| # SNVs | 72,945,834 | 86,233,412 | 103,439,411 | 104,649,069 | 78,483,824 | 66,290,585 |
| Ti/Tv | 2.12 | 2.08 | 2.00 | 2.00 | 2.09 | 1.99 |
| % in 1000G | 50.22 % | 43.75 % | 36.35 % | 36.54 % | 46.91 % | 51.17 % |
| % in dbSNP | 40.25 % | 35.31 % | 28.88 % | 29.53 % | 37.93 % | 41.91 % |
| sensitivity | 63.80 % | 68.98 % | 68.51 % | 69.99 % | 64.17 % | 51.26 % |
| specificity | 99.92 % | 99.70 % | 99.30 % | 99.54 % | 99.86 % | 99.13 % |
| FDR | 3.34 % | 11.29 % | 22.91 % | 16.16 % | 6.12 % | 33.11 % |
The gold standard dataset consists of 4612 samples with 80–100 × coverage. All the four callers are necessary for increasing the yield of SNVs