| Literature DB >> 26600436 |
Matthew A Field1,2, Vicky Cho1,3, T Daniel Andrews1,2, Chris C Goodnow1,4.
Abstract
A diversity of tools is available for identification of variants from genome sequence data. Given the current complexity of incorporating external software into a genome analysis infrastructure, a tendency exists to rely on the results from a single tool alone. The quality of the output variant calls is highly variable however, depending on factors such as sequence library quality as well as the choice of short-read aligner, variant caller, and variant caller filtering strategy. Here we present a two-part study first using the high quality 'genome in a bottle' reference set to demonstrate the significant impact the choice of aligner, variant caller, and variant caller filtering strategy has on overall variant call quality and further how certain variant callers outperform others with increased sample contamination, an important consideration when analyzing sequenced cancer samples. This analysis confirms previous work showing that combining variant calls of multiple tools results in the best quality resultant variant set, for either specificity or sensitivity, depending on whether the intersection or union, of all variant calls is used respectively. Second, we analyze a melanoma cell line derived from a control lymphocyte sample to determine whether software choices affect the detection of clinically important melanoma risk-factor variants finding that only one of the three such variants is unanimously detected under all conditions. Finally, we describe a cogent strategy for implementing a clinical variant detection pipeline; a strategy that requires careful software selection, variant caller filtering optimizing, and combined variant calls in order to effectively minimize false negative variants. While implementing such features represents an increase in complexity and computation the results offer indisputable improvements in data quality.Entities:
Mesh:
Year: 2015 PMID: 26600436 PMCID: PMC4658170 DOI: 10.1371/journal.pone.0143199
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Analysis Workflow.
BAM files from BWA, Isaac aligner, and Bowtie2 were paired with each of GATK, Isaac variant caller, and SAMTools run both with and without additional filtering (VQSR, BAQ, and LowGQX respectively). Output vcf files were regularized using custom code and variants from GIAB high quality regions taken forward to generate false positive and false negative rates.
Fig 2Software concordance Venn diagrams.
Merged variant calls for each tool were overlapped with other tools of the same variety for each variant type. Aligners are compared in row 1, variant callers without filtering in row 2 and variant callers with filtering in row 3.
Fig 3Software concordance ROC curves.
Merged variant calls for each tool were calculated and ROC curves generated using the genome quality score. Aligners are compared in row 1, variant callers without filtering in row 2, and variant callers with filtering in row 3.
Total SNV calls and tool-specific SNV calls for each aligner and variant caller run with and without filtering.
| Software | Filter | Total SNV Calls | False Positive Rate | False Negative Rate | Tool-specific SNVs | Tool-specific FP Rate |
|---|---|---|---|---|---|---|
| Bowtie2 | N/A | 238829 | 8.44% | 3.15% | 2050 | 53.41% |
| bwa | N/A | 241513 | 8.55% | 2.96% | 3860 | 39.27% |
| Isaac_align | N/A | 183241 | 4.02% | 3.31% | 1030 | 36.12% |
| GATK | None | 229215 | 7.43% | 2.91% | 59275 | 15.19% |
| GATK | VQSR | 228239 | 7.06% | 2.99% | N/A | N/A |
| Isaac_VC | None | 168348 | 6.35% | 4.27% | 7815 | 32.50% |
| Isaac_VC | LowGQX | 79366 | 4.56% | 7.47% | N/A | N/A |
| Samtools | None | 169582 | 6.12% | 3.68% | 554 | 86.64% |
| Samtools | BAQ | 107243 | 3.84% | 6.52% | N/A | N/A |
The union of variant calls for each tool was calculated and false positive and false negative rates determined relative to high quality GIAB variants. Tool-specific calls were also calculated, defined as SNVs specific to a single tool.
From bases overlap ENSEMBL v75 canonical transcripts.
Total deletion calls and tool-specific deletion calls for each aligner and variant caller run with and without filtering.
| Software | Filter | Total Deletion Calls | False Positive Rate | False Negative Rate | Tool-specific Dels | Tool-specific FP Rate |
|---|---|---|---|---|---|---|
| Bowtie2 | N/A | 3617 | 33.26% | 23.73% | 379 | 29.29% |
| bwa | N/A | 3676 | 30.88% | 23.73% | 252 | 46.03% |
| Isaac_align | N/A | 2976 | 25.5% | 27.12% | 157 | 43.95% |
| GATK | None | 2084 | 28.69% | 27.12% | 260 | 17.31% |
| GATK | VQSR | 2049 | 27.57% | 28.18% | N/A | N/A |
| Isaac_VC | None | 2964 | 26.05% | 23.73% | 414 | 45.17% |
| Isaac_VC | LowGQX | 1602 | 31.34% | 35.59% | N/A | N/A |
| Samtools | None | 3160 | 29.4% | 27.12% | 958 | 50.21% |
| Samtools | BAQ | 3247 | 30.49% | 27.12% | N/A | N/A |
The union of variant calls for each tool was calculated and false positive and false negative rates determined relative to high quality GIAB variants. Tool-specific calls were also calculated, defined as deletions specific to a single tool.
From bases overlap ENSEMBL v75 canonical transcripts.
Total insertion calls and tool-specific insertion calls for each aligner and variant caller run with and without filtering.
| Software | Filter | Total Insertion Calls | False Positive Rate | False Negative Rate | Tool-specific Ins | Tool-specific FP Rate |
|---|---|---|---|---|---|---|
| Bowtie2 | N/A | 4367 | 27.11% | 22.39% | 565 | 35.04% |
| bwa | N/A | 4073 | 20.94% | 31.34% | 215 | 61.40% |
| Isaac_align | N/A | 4180 | 31.27% | 31.34% | 843 | 14.95% |
| GATK | None | 2140 | 20.56% | 35.82% | 74 | 14.86% |
| GATK | VQSR | 2105 | 19.62% | 37.31% | N/A | N/A |
| Isaac_VC | None | 3334 | 19.26% | 25.37% | 323 | 56.97% |
| Isaac_VC | LowGQX | 1634 | 23.32% | 34.33% | N/A | N/A |
| Samtools | None | 4856 | 34.1% | 22.39% | 2102 | 41.29% |
| Samtools | BAQ | 4998 | 34.87% | 20.90% | N/A | N/A |
The union of variant calls for each tool was calculated and false positive and false negative rates determined relative to high quality GIAB variants. Tool-specific calls were also calculated, defined as insertions specific to a single tool.
From bases overlap ENSEMBL v75 canonical transcripts.
Variant calls grouped by frequency of detection.
| Variant Type | Number of times variant detected (out of 9 total software pairs) | Total Variant Calls | False Positive Rate | False Negative Rate |
|---|---|---|---|---|
| SNV | 9 (Intersection) | 106594 | 3.29% | 6.78% |
| > = 1 (Union) | 245360 | 9.09% | 2.90% | |
| 2–8 | 132989 | 12.18% | N/A | |
| 1 | 5777 | 46.43% | N/A | |
| Deletion | 9 (Intersection) | 1158 | 14.68% | 35.59% |
| > = 1 (Union) | 4194 | 35.13% | 23.72% | |
| 2–8 | 2326 | 36.54% | N/A | |
| 1 | 710 | 63.80% | N/A | |
| Insertion | 9 (Intersection) | 1415 | 12.16% | 46.26% |
| > = 1 (Union) | 5524 | 35.36% | 19.40% | |
| 2–8 | 2612 | 26.03% | N/A | |
| 1 | 1497 | 73.55% | N/A |
Unfiltered variants were grouped based on the frequency of detection within nine possible aligner/variant caller pairs and segregated into four bins; variants in all 9 pairs, variants in at least 1 pair, variants in 2–8 pairs, and variants unique to 1 pair.
From bases overlap ENSEMBL v75 canonical transcripts.
SNV calls for GIAB data at simulated contamination levels.
| Variant Caller | Simulated Contamination Level | Variant Calls | False Positive Rate | False Negative Rate |
|---|---|---|---|---|
| GATK | 0% | 228239 | 6.62% | 3.23% |
| 25% | 163041 | 6.22% | 3.76% | |
| 50% | 113263 | 5.50% | 5.54% | |
| 75% | 69416 | 4.77% | 13.45% | |
| 90% | 34018 | 4.13% | 40.74% | |
| 95% | 17321 | 4.08% | 67.55% | |
| 98% | 5662 | 4.06% | 88.87% | |
| 99% | 1696 | 4.25% | 96.46% | |
| Isaac VC | 0% | 79366 | 4.51% | 7.90% |
| 25% | 73259 | 4.26% | 12.84% | |
| 50% | 61458 | 4.01% | 18.56% | |
| 75% | 43091 | 3.90% | 31.59% | |
| 90% | 23146 | 3.67% | 54.76% | |
| 95% | 10888 | 3.77% | 76.29% | |
| 98% | 3246 | 4.13% | 92.60% | |
| 99% | 887 | 5.41% | 97.96% | |
| SAMtools | 0% | 106332 | 3.78% | 6.39% |
| 25% | 83337 | 3.66% | 9.17% | |
| 50% | 63382 | 3.46% | 14.76% | |
| 75% | 39853 | 3.29% | 30.92% | |
| 90% | 16660 | 3.16% | 64.83% | |
| 95% | 6365 | 3.03% | 85.64% | |
| 98% | 1534 | 3.46% | 96.45% | |
| 99% | 371 | 6.47% | 99.18% |
BWA alignments were used to generate filtered SNV lists for GATK, Isaac variant caller, and SAMtools at simulated contamination levels of 0%, 25%, 50%, 75%, 90%, 95%, 98%, and 99%. Variant lists were overlapped to GIAB high quality variants to determine false positive and false negative rates.
From bases overlap ENSEMBL v75 canonical transcripts.
Melanoma cell line control variant calls overlapping annotated ClinVar melanoma risk factors.
| GRCh37 Coordinate (dbSNP) | dbSNP id | Missing Aligner / Variant Caller Pair (F = filtered, U = unfiltered) | ClinVar Annotation |
|---|---|---|---|
| 5:33951693 | rs16891982 | None | Malignant melanoma of skin |
| 11:89017961 | rs1126809 | Bowtie2/Isaac_vc (F) | Increased risk of cutaneous melanoma |
| 14:104165753 | rs861539 | Isaac_aligner/GATK (U, F) Isaac_aligner/Isaac_vc (U, F) Isaac_aligner/SAMtools (U, F) | Increased risk of cutaneous melanoma |
Variant calls from melanoma cell line C001 were overlapped to ClinVar and all annotated melanoma risk factors examined. Software pairs failing to detect these variants are reported in column 3 with variants listed as unfiltered (U) or filtered (F) to reflect whether variant caller filtering was applied.