| Literature DB >> 30253747 |
Christopher I Cooper1, Delia Yao1, Dorota H Sendorek1, Takafumi N Yamaguchi1, Christine P'ng1, Kathleen E Houlahan1,2, Cristian Caloian1, Michael Fraser3, Kyle Ellrott4,5,6, Adam A Margolin4,5,7, Robert G Bristow2,3, Joshua M Stuart6, Paul C Boutros8,9,10,11,12,13.
Abstract
BACKGROUND: Platform-specific error profiles necessitate confirmatory studies where predictions made on data generated using one technology are additionally verified by processing the same samples on an orthogonal technology. However, verifying all predictions can be costly and redundant, and testing a subset of findings is often used to estimate the true error profile.Entities:
Keywords: Candidate-selection; DNA sequencing; Validation; Verification
Mesh:
Year: 2018 PMID: 30253747 PMCID: PMC6157051 DOI: 10.1186/s12859-018-2391-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Valection Candidate-Selection Strategies. a A hypothetical scenario where we have results from three callers available. Each call is represented using a dot. SNV calls that are shared by multiple callers are represented with matching dot colours. b The ‘random rows’ method where all unique calls across all callers are sampled from with equal probability. c The ‘directed-sampling’ method where a ‘call overlap-by-caller’ matrix is constructed and the selection budget is distributed equally across all cells. d The ‘equal per caller’ method where the selection budget is distributed evenly across all callers. e The ‘equal per overlap’ method where the selection budget is distributed evenly across all levels of overlap (i.e. call recurrence across callers). f The ‘increasing with overlap’ method where the selection budget is distributed across overlap levels in proportion to the level of overlap. g The ‘decreasing with overlap’ method where the selection budget is distributed across overlap levels in inverse proportion to the level of overlap
Fig. 2Verification Selection Experimental Design. Verification candidates were selected from somatic mutation calling results of multiple algorithms run on three in silico tumours (IS1, IS2, and IS3). Candidate selection was performed separately on each tumour’s set of results using all combinations of five different verification budgets (i.e. number of calls selected) and six different selection strategies. F1 scores were calculated for each set of selected calls and compared to F1 scores calculated from the full prediction set. To compare the effect of the numbers of algorithms used, datasets were further subset using four different metrics
Fig. 3All Synthetic Data Simulation Results for Selection Strategy Parameter Combinations. Overall, the best results are obtained using the ‘equal per caller’ method. The ‘random rows’ approach scores comparably except in cases where there is high variability in prediction set sizes across callers. Calls from low-call callers are less likely to be sampled at random and, in cases where none are sampled, it is not possible to get performance estimates for those callers. Failed estimate runs are displayed in grey
Fig. 4F1 Scores for All Synthetic Dataset Replicate Runs. Top selection strategies perform consistently across replicate runs. Strategies are ordered by median scores. The adjustment step in precision calculations improves the ‘equal per caller’ method, but shows little effect on ‘random rows’