| Literature DB >> 34911553 |
Nicolas Dierckxsens1,2, Tong Li3, Joris R Vermeesch4, Zhi Xie5.
Abstract
Accurate simulations of structural variation distributions and sequencing data are crucial for the development and benchmarking of new tools. We develop Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it reveal the strengths and weaknesses for current available structural variation callers and long-read sequencing platforms. With these findings, we develop a new method (combiSV) that can combine the results from structural variation callers into a superior call set with increased recall and precision, which is also observed for the latest structural variation benchmark set developed by the GIAB Consortium.Entities:
Keywords: Benchmark; Long-read sequencing; Simulated model; Structural variation
Mesh:
Year: 2021 PMID: 34911553 PMCID: PMC8672642 DOI: 10.1186/s13059-021-02551-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Available features and system requirements of structural variation simulators
*SCNVsim and SURVIVOR was excluded from the benchmark
Fig. 1Context-specific error patterns for mismatches and indels. A Context-specific error patterns for real data of Nanopore (9.4.1) and simulated data from Sim-it. B Context-specific error patterns for real data of PacBio Sequel II and simulated data from Sim-it. C Deletion lengths for real Nanopore data and the simulations of the benchmarked tools. D Insertion lengths for real Nanopore data and the simulations of the benchmarked tools
Benchmark statistics on three simulated datasets of 24,600 SVs for 6 existing SV callers and combiSV (combiSV (6): all 6 tools combined)
Fig. 2Structural variance detection stats for a series of Nanopore and PacBio HiFi datasets with increasing sequencing depths
Precision and recall statistics for each type of SV from the Nanopore 20x dataset. (combiSV (6): all 6 tools combined)
Fig. 3Structural variance detection stats for a series of Nanopore datasets with increasing sequencing lengths
Comparison between combiSV and SURVIVOR for 9 combinations of three SV callers on a simulated Nanopore dataset of 20x and the GIAB reference dataset (Nanopore). The highest scores between combiSV and SURVIVOR are indicated in gray