| Literature DB >> 26271043 |
Riyue Bao1, Kyle Hernandez1, Lei Huang1, Wenjun Kang1, Elizabeth Bartom1, Kenan Onel2, Samuel Volchenboum3, Jorge Andrade1.
Abstract
Whole exome sequencing has facilitated the discovery of causal genetic variants associated with human diseases at deep coverage and low cost. In particular, the detection of somatic mutations from tumor/normal pairs has provided insights into the cancer genome. Although there is an abundance of publicly-available software for the detection of germline and somatic variants, concordance is generally limited among variant callers and alignment algorithms. Successful integration of variants detected by multiple methods requires in-depth knowledge of the software, access to high-performance computing resources, and advanced programming techniques. We present ExScalibur, a set of fully automated, highly scalable and modulated pipelines for whole exome data analysis. The suite integrates multiple alignment and variant calling algorithms for the accurate detection of germline and somatic mutations with close to 99% sensitivity and specificity. ExScalibur implements streamlined execution of analytical modules, real-time monitoring of pipeline progress, robust handling of errors and intuitive documentation that allows for increased reproducibility and sharing of results and workflows. It runs on local computers, high-performance computing clusters and cloud environments. In addition, we provide a data analysis report utility to facilitate visualization of the results that offers interactive exploration of quality control files, read alignment and variant calls, assisting downstream customization of potential disease-causing mutations. ExScalibur is open-source and is also available as a public image on Amazon cloud.Entities:
Mesh:
Year: 2015 PMID: 26271043 PMCID: PMC4535852 DOI: 10.1371/journal.pone.0135800
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Highly modulated architecture of ExScalibur.
The pipelines contain seven major analysis steps. First, the pipeline checks the quality of the sequencing reads, performs adapter trimming (for both SE and PE reads), and merges 3’ overlapping PE reads (for PE reads only). Then the reads are aligned to the reference genome, filtered, duplicates removed, and the alignment refined. The pipelines calculate exon coverage and collect callable loci from the alignment. Afterwards, the pipelines detect, filter, and annotate variants for each aligner+caller combination. Finally, the pipelines archive the results, integrate metrics and all variants sets, and generate a project data analysis report for visualization in ExScaliburViz. At the pipeline completion, a runtime report is generated to illustrate the timeline of analysis, with detailed description of the commands, inputs, outputs, and dependencies. Intermediate reports will be generated if the pipelines prematurely terminate due to software/hardware failure.
Evaluation of GMD germline SNV detection in the AML dataset.
| Variant Set | TP | FP | TN | FN | Sensitivity (SD) % | Specificity (SD) % | Precision (SD) % |
|---|---|---|---|---|---|---|---|
| BWA-mem+GATKHaplotypeCaller | 15,856 | 33 | 22,564 | 516 | 97.17 (1.24) | 99.86 (0.34) | 99.80 (0.48) |
| Novoalign+GATKHaplotypeCaller | 15,833 | 33 | 22,564 | 538 | 97.03 (1.24) | 99.86 (0.34) | 99.80 (0.48) |
| BWA-mem+FreeBayes | 14,982 | 94 | 22,503 | 1,390 | 91.66 (0.90) | 99.54 (0.42) | 99.31 (0.62) |
| Novoalign+FreeBayes | 15,009 | 80 | 22,517 | 1,363 | 91.80 (0.87) | 99.62 (0.35) | 99.43 (0.52) |
| BWA-mem+IsaacVariantCaller | 11,560 | 11 | 22,586 | 4,812 | 73.53 (8.30) | 99.95 (0.05) | 99.90 (0.11) |
| Novoalign+IsaacVariantCaller | 11,288 | 11 | 22,586 | 5,083 | 71.77 (8.09) | 99.95 (0.05) | 99.90 (0.10) |
| BWA-mem+SAMtools | 15,153 | 60 | 22,537 | 1,219 | 91.82 (2.19) | 99.75 (0.50) | 99.62 (0.73) |
| Novoalign+SAMtools | 14,210 | 57 | 22,540 | 2,161 | 85.26 (4.21) | 99.76 (0.45) | 99.62 (0.71) |
|
| 16,057 | 47 | 22,550 | 315 | 98.23 (0.99) | 99.80 (0.40) | 99.72 (0.56) |
Counts and percentages are shown as the average across 30 AML normal samples. SD: Standard Deviation.
Evaluation of SMD somatic SNV detection in the simulation datasets.
| Variant Set | Dataset 1 | Dataset 2 | ||||||
|---|---|---|---|---|---|---|---|---|
| TP | FN | Sensitivity % | FNR | FP | TN | Specificity % | FPR | |
| BWA-mem+MuTect | 690 | 52 | 92.99 | 7.01E-02 | 16 | 47,301,677 | 99.99997 | 3.38E-07 |
| Novoalign+MuTect | 684 | 58 | 92.18 | 7.82E-02 | 23 | 47,301,670 | 99.99995 | 4.86E-07 |
| BWA-mem+Shimmer | 550 | 192 | 74.12 | 2.59E-01 | 0 | 47,301,693 | 100.00000 | 0.00 |
| Novoalign+Shimmer | 536 | 206 | 72.24 | 2.78E-01 | 0 | 47,301,693 | 100.00000 | 0.00 |
| BWA-mem+SomaticSniper | 707 | 35 | 95.28 | 4.72E-02 | 110 | 47,301,583 | 99.99977 | 2.33E-06 |
| Novoalign+SomaticSniper | 697 | 45 | 93.94 | 6.06E-02 | 109 | 47,301,584 | 99.99977 | 2.30E-06 |
| BWA-mem+Strelka | 597 | 145 | 80.46 | 1.95E-01 | 16 | 47,301,677 | 99.99997 | 3.38E-07 |
| Novoalign+Strelka | 596 | 146 | 80.32 | 1.97E-01 | 19 | 47,301,674 | 99.99996 | 4.02E-07 |
| BWA-mem+VarScan2 | 708 | 34 | 95.42 | 4.58E-02 | 27 | 47,301,666 | 99.99994 | 5.71E-07 |
| Novoalign+VarScan2 | 705 | 37 | 95.01 | 4.99E-02 | 25 | 47,301,668 | 99.99995 | 5.29E-07 |
| BWA-mem+Virmid | 678 | 64 | 91.37 | 8.63E-02 | 0 | 47,301,693 | 100.00000 | 0.00 |
| Novoalign+Virmid | 690 | 52 | 92.99 | 7.01E-02 | 4 | 47,301,689 | 99.99999 | 8.46E-08 |
|
| 713 | 29 | 96.09 | 3.91E-02 | 1 | 47,301,692 | 100.00000 | 0.00 |
Results are shown for high-quality variants that passed all quality filters. Additional precision digits were kept for Specificity to infer small differences.