Yuansheng Liu1, Xiaocai Zhang2, Quan Zou3, Xiangxiang Zeng1. 1. College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China. 2. Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW 2007, Australia. 3. Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China.
Abstract
SUMMARY: Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. AVAILABILITY AND IMPLEMENTATION: https://github.com/yuansliu/minirmd. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. AVAILABILITY AND IMPLEMENTATION: https://github.com/yuansliu/minirmd. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.