| Literature DB >> 30453880 |
Xiaopeng Bian1, Bin Zhu2, Mingyi Wang2, Ying Hu3, Qingrong Chen3, Cu Nguyen3, Belynda Hicks2, Daoud Meerzaman4.
Abstract
BACKGROUND: High-throughput sequencing has rapidly become an essential part of precision cancer medicine. But validating results obtained from analyzing and interpreting genomic data remains a rate-limiting factor. The gold standard, of course, remains manual validation by expert panels, which is not without its weaknesses, namely high costs in both funding and time as well as the necessarily selective nature of manual validation. But it may be possible to develop more economical, complementary means of validation. In this study we employed four synthetic data sets (variants with known mutations spiked into specific genomic locations) of increasing complexity to assess the sensitivity, specificity, and balanced accuracy of five open-source variant callers: FreeBayes v1.0, VarDict v11.5.1, MuTect v1.1.7, MuTect2, and MuSE v1.0rc. FreeBayes, VarDict, and MuTect were run in bcbio-next gen, and the results were integrated into a single Ensemble call set. The known mutations provided a level of "ground truth" against which we evaluated variant-caller performance. We further facilitated the comparison and evaluation by segmenting the whole genome into 10,000,000 base-pair fragments which yielded 316 segments.Entities:
Mesh:
Year: 2018 PMID: 30453880 PMCID: PMC6245711 DOI: 10.1186/s12859-018-2440-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
A summary of the complexity levels of the four synthetic data sets
| Set 1 | Set 2 | Set 3 | Set 4 | |
|---|---|---|---|---|
| Mutation types | SNV, structural variation (SV) (deletions, duplications, inversions) | SNV, SV (deletions, duplications, insertions, inversions) | SNV, SV (deletions, duplications, insertions, inversions) & INDEL | SNV, SV (deletions, duplications, inversions) & INDEL |
| Number of Somatic SNVs | 3535 | 4322 | 7903 | 16,268 |
| Cellularity | 100% | 80% | 100% | 80% |
| Subclone variant allele frequencies (VAFs) | N/A | N/A | 50%, 33%, 20% | 50%, 35% (effectively 30% and 15% due to cellularity) |
Fig. 2In the resampling process, N substitutions were randomly selected and used to resample each data set 10,000 times. Here, N is the observed overlap number between methods. The random overlap number was calculated from the N substitutions for each selection
The number of inserted “truth” SNPs in each data set and the numbers detected by each caller. The numbers listed in the first row next to “truth” indicate the number of inserted SNPs in the set
| Callers | Set_1 Truth:3535 | Set_2 Truth:4322 | Set_3 Truth:7903 | Set_4 Truth:16268 | ||||
|---|---|---|---|---|---|---|---|---|
| True positive | False positive | True positive | False positive | True positive | False positive | True positive | False positive | |
| FreeBayes | 3379 | 6212 | 4150 | 6533 | 6989 | 5558 | 8762 | 9521 |
| VarDict | 3443 | 6988 | 4224 | 6094 | 7366 | 5388 | 12,598 | 4456 |
| MuTect | 3303 | 755 | 4081 | 1176 | 7100 | 1098 | 11,708 | 1030 |
| Ensemble | 3202 | 199 | 4085 | 278 | 7031 | 135 | 10,666 | 450 |
| MuTect2 | 3334 | 384 | 4087 | 966 | 7131 | 817 | 11,876 | 535 |
| MuSE | 3447 | 583 | 4197 | 432 | 6952 | 299 | 8941 | 1353 |
Fig. 1Caller performance comparison in detecting the number of SNPs in each data set is compared to the corresponding numbers in the ground-truth file
Fig. 3Similarities between callers are represented by the width of the connecting lines
Comparison of outcomes using data sets 1 (3a) and 4 (3b). The background color reflects the degree of agreement between pairs of callers, with greater intensities indicating higher degrees of agreement
Fig. 4Example plots showing caller performance comparisons of the sequence of segments for chromosomes 1 and 8. SNPs detected in each segment were compared with SNPs in the corresponding segment of the truth file
Fig. 5Correlations between sensitivity, specificity and balanced-accuracy scores of three callers to G-C content (a, b, c), and gene density (d, e, f) in set one
Fig. 6Correlations between true positives, false positives and the total number of SNPs to G-C content (a, b, c) and gene density (d, e, f) in set one