| Literature DB >> 28143617 |
O Pipek1, D Ribli1, J Molnár2, Á Póti2, M Krzystanek3, A Bodor1, G E Tusnády2, Z Szallasi3,4,5,6, I Csabai1, D Szüts7.
Abstract
BACKGROUND: Detection of somatic mutations is one of the main goals of next generation DNA sequencing. A wide range of experimental systems are available for the study of spontaneous or environmentally induced mutagenic processes. However, most of the routinely used mutation calling algorithms are not optimised for the simultaneous analysis of multiple samples, or for non-human experimental model systems with no reliable databases of common genetic variations. Most standard tools either require numerous in-house post filtering steps with scarce documentation or take an unpractically long time to run. To overcome these problems, we designed the streamlined IsoMut tool which can be readily adapted to experimental scenarios where the goal is the identification of experimentally induced mutations in multiple isogenic samples.Entities:
Keywords: Demonstrative algorithm; Low false positive rate; Multiple isogenic samples; Mutagenesis; Next generation sequencing; Somatic mutation detection
Mesh:
Year: 2017 PMID: 28143617 PMCID: PMC5282906 DOI: 10.1186/s12859-017-1492-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An overview of the testing and optimisation of the mutation detection method
Fig. 2Test set detection for WT and Mutant 1 samples. a, b Plots of mean reference nucleotide frequency values in the samples of the two geno-types; a whole genome, b diploid chromosomes only. Insets are zoomed-in regions of the underlying plot. Dashed rectangles mark the clusters identified as test cohorts. c Generating the same figures for different sample numbers. Percentages in purple show the ratio of lost test set positions, while percentages in orange represent the ratio of gained positions in the area in the dashed rectangle
Fig. 3Quality components resulting from different parameter settings and different datasets. a Effects of varying other_rnf_min (different curves) and sample_mut_freq_min (along the curves) with constant sample_cov_min = 7. The inset contains maximal achievable TPRs for given FPR thresholds with the optimal parameter settings. b Effects of changing sample_cov_min (different curves) and sample_mut_freq_min (along the curves) with fixed other_rnf_min = 0.93. c Effects of varying other_rnf_min (different curves) and the S score parameter (along the curves) with sample_cov_min = 5 and sample_mut_freq_min = 0.21. d Effects of varying the size of the dataset. Measurement points correspond to the parameter settings of the inset of (a). Mean values and standard deviation of three randomly chosen datasets are shown (see Additional file 2). e Effects of decreased sample coverage. Measurement points correspond to the parameter settings of the inset of (a). Mean values and standard deviation of three randomly down-sampled measurements are shown (see Additional file 2)
Fig. 4Results of running IsoMut without tuning the S score value. a SNV counts for each sample, grouped by genotype. Colours indicate the treatment of the given sample. b Indel counts for each sample, grouped by genotype. Colours indicate the treatment of a given sample, darker bars representing insertions, lighter ones deletions
Comparison of runtimes of different tools with all available resources
| Tool | 12 cores | Single core | ||||
|---|---|---|---|---|---|---|
| Number of parallel processes | Runtime | Equivalent runtime on 1 Gb genome | Runtime relative to IsoMut | Runtime | Runtime relative to IsoMut | |
| IsoMut | 12 | 1 min 24 s | 4 h 56 min | 1 | 7 min | 1 |
| VarScan 2 | 5–6 | 16 min | 2 days 8 h | 11 | 1 h 20 min | 11 |
| MuTect | 6–7 | 1 h 7 min | 9 days 20 h | 48 | 4 h 55 min | 42 |
| MuTect2 | 4–5 | 4 h | 35 days 5 h | 171 | 21 h 6 min | 178 |
Table of the runtime comparison of different mutation detection software using a computer with 23 GB memory and 12 cores or a single core only. The tools were run on the 4.735 Mb chicken chromosome 28 using the 30-sample dataset used throughout this study