| Literature DB >> 24949238 |
Ruibang Luo1, Yiu-Lun Wong1, Wai-Chun Law1, Lap-Kei Lee1,2, Jeanno Cheung1, Chi-Man Liu1, Tak-Wah Lam1.
Abstract
This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA's speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.Entities:
Keywords: GPU; Genomics; HPC; NGS; Secondary analysis; Variant calling; Whole-exome sequencing; Whole-genome seqeuncing
Year: 2014 PMID: 24949238 PMCID: PMC4060040 DOI: 10.7717/peerj.421
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1BALSA, based on SOAP3-dp, performs the whole secondary analysis (raw reads to variants) in memory with most of the modules accelerated with GPU.
Figure 2A flowchart of the pipeline of BALSA.
BQSR denotes “base quality score recalibration”.
Aligners, post-processing tools, variant callers and integrated pipelines used for comparison with BALSA.
| Step | Tool | Citation |
|---|---|---|
| Aligner | BWA-bwaaln |
|
| BWA-bwamem |
| |
| SOAP3-dp |
| |
| Post-processing | GATK |
|
| Picard |
| |
| Variant caller | Atlas2 |
|
| Freebayes |
| |
| GATK HaplotypeCaller |
| |
| GATK UnifiedGenotyper |
| |
| Samtools |
| |
| Mutect |
| |
| Varscan |
| |
| Pipeline | ISAAC |
|
| Somatic Caller | Mutect |
|
| SomaticSniper |
|
Figure 3Time consumption comparison between pipelines analyzing YH 50-fold 100 bp paired-end WGS data.
Time consumption of different pipelines.
All number in hours.
| Step | BWAaln | BWAmem | SOAP3-dp | ISAAC | BALSA |
|---|---|---|---|---|---|
| Alignment | 46.16 | 14.56 | 4.12 | 9.89 | 5.24 |
| Sort and merge | 1.40 | 1.70 | 1.74 | ||
| Mark duplicate | 6.84 | 6.25 | 5.50 | ||
| Realigner target creator | 0.93 | 0.77 | 1.06 | ||
| Indel realigner | 10.89 | 7.37 | 15.70 | ||
| Base score recalibration | 5.20 | 4.75 | 4.91 | ||
| PrintReads | 12.17 | 9.92 | 9.47 | ||
| Variant calling | 4.41 | 3.37 | 3.77 | 2.03 | 0.24 |
| Total | 88.00 | 48.68 | 46.27 | 11.92 | 5.49 |
Statistics of variants called by different pipelines.
“VQSR LowQual” means variants passed GATK VQSR but with (i) QUAL<50 for pipelines using UnifiedGenotyper and (ii) QUAL<30 for BALSA. “RandomForest LowQual” means variants with probability ≥0.95 using random forest classification but with QUAL<30 for BALSA. Please refer to Supplemental Information 8.4.1 for the details of the variant QUAL profile of BALSA.
| Variant type | Metric | BWAaln | BWAmem | SOAP3-dp | ISAAC | BALSA |
|---|---|---|---|---|---|---|
| SNP | Raw | 4,175,654 | 4,267,377 | 4,978,914 | 3,429,162 | 5,239,864 |
| VQSR PASS | 3,324,891 | 3,307,619 | 3,383,853 | – | 3,444,915 | |
| VQSR LowQual | 151,933 | 136,392 | 308,321 | – | 877,964 | |
| RandomForest PASS | – | – | – | – | 3,433,397 | |
| RandomForest LowQual | – | – | – | – | 871,422 | |
| Ti/Tv | 2.08 | 2.07 | 2.05 | 2.08 | 2.04 | |
| dbSNP v137 | 99.62% | 99.47% | 98.60% | 99.29% | 98.51% | |
| Ref Hets | 54.40% | 54.40% | 55.40% | 57.20% | 58.20% | |
| Indel | Raw (Indel) | 605,966 | 615,351 | 685,541 | 455,103 | 974,033 |
| VQSR PASS | 576,889 | 615,351 | 624,629 | – | 671,914 | |
| RandomForest PASS | – | – | – | – | 630,827 | |
| dbSNP v137 | 90.70% | 90.49% | 87.80% | 93.38% | 89.01% |
Figure 4Correlation plot between the RandomForest Probability generated by BALSA and the VQSLOD value generated by GATK’s VQSR on YH 50-fold 100 bp paired-end WGS data.
Figure 5Venn graphs illustrating the overlaps between (1) BALSA, (2) the Ensemble call set, and (3) the known variants on both SNP and Indel.
AAF denotes “alternative allele frequency”, i.e., percentage of reads supporting the alternative allele among all simulated reads covering a variant. DP represents the number reads simulated covering a variant. Qual means the variant score assigned by BALSA.
Figure 6Venn graphs illustrating the overlaps between (1) BALSA, (2) ISAAC, and (3) the known variants on both SNP and Indel.
Run time, number of SNPs passing filter (with PASS tag), union of SNP sites and total number of SNP conflicts of BALSA and BWA + GATK for the NA12877, NA12878 and NA12882 family.
Union is the SNP sites called in all samples or called in any sample.
| Pipeline | Sample | Time (Hour) | SNPs (PASS) | Union | Conflicts | Conflict rate |
|---|---|---|---|---|---|---|
| BALSA | NA12877 | 6.98 | 3,522,647 | 4,556,818 | 209,552 | 4.60% |
| NA12878 | 6.24 | 3,439,917 | ||||
| NA12882 | 6.65 | 3,428,070 | ||||
| BWA + GATK | NA12877 | 87.81 | 3,125,185 | 4,327,046 | 236,608 | 5.47% |
| NA12878 | 91.42 | 3,158,382 | ||||
| NA12882 | 87.01 | 3,183,451 |