| Literature DB >> 31881822 |
Shifu Chen1,2, Yanqing Zhou3, Yaru Chen3, Tanxiao Huang3, Wenting Liao3, Yun Xu3, Zhicheng Li4, Jia Gu4.
Abstract
BACKGROUND: Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. Most existing duplicate removing tools cannot handle the UMI-integrated data. Some modern tools can work with UMIs, but are usually slow and use too much memory. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results.Entities:
Keywords: Consensus reads; Deduplication; Next-generation sequencing; Unique molecular identifier
Mesh:
Year: 2019 PMID: 31881822 PMCID: PMC6933617 DOI: 10.1186/s12859-019-3280-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features comparison of different deduplication or consensus read generating tools
| SAMtools | Picard | Picard | UMI-tools | |||
|---|---|---|---|---|---|---|
| Non-UMI mode | UMI mode | |||||
| No need to sort by read name | + | + | + | + | ||
| No need to fix mate information | + | + | + | + | ||
| No need to add UMI tag | + | + | + | + | + | |
| No need to sort by position again | + | + | + | + | ||
| JSON/Text Metrics | + | + | + | + | + | |
| HTML Report | + | + | ||||
Fig. 1The brief workflow of gencore
Fig. 2The coverage statistics figures in the HTML report
Fig. 3Comparison of speed, memory peak and processing results of different tools in both UMI and non-UMI modes. a memory peak and execution time of different tools. Samtools and Picard (in UMI mode) need to prepare the data before performing deduplication, whereas gencore and UMI_tools needn’t. b average depth of output BAM. For the cfDNA samples (1801, 1802 and 1803), the depths of UMI mode results are much higher than non-UMI mode, indicating that over-deduplication may happen when performing deduplication without UMI for ultra-deep sequencing data. c specificity of downstream variant calling results comparing to the golden standard results provided by NCCL
Fig. 4Comparison of the alignment files before and after gencore processing In this figure, the position marked by double lines is NM_005228.3(EGFR): c.2369C > T, p.T790 M variant. a visualizes the mapped reads of original alignment file, b visualizes the mapped reads after gencore processing. We can find that the false positive mismatches, which appear randomly in the original alignment file, are corrected by gencore