Zexuan Zhu1, Linsen Li1, Yongpeng Zhang1, Yanli Yang1, Xiao Yang1. 1. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China and Infectious Disease Initiative, The Broad Institute, Cambridge, MA 02142, USA.
Abstract
SUMMARY: Exhaustive mapping of next-generation sequencing data to a set of relevant reference sequences becomes an important task in pathogen discovery and metagenomic classification. However, the runtime and memory usage increase as the number of reference sequences and the repeat content among these sequences increase. In many applications, read mapping time dominates the entire application. We developed CompMap, a reference-based compression program, to speed up this process. CompMap enables the generation of a non-redundant representative sequence for the input sequences. We have demonstrated that reads can be mapped to this representative sequence with a much reduced time and memory usage, and the mapping to the original reference sequences can be recovered with high accuracy. AVAILABILITY AND IMPLEMENTATION: CompMap is implemented in C and freely available at http://csse.szu.edu.cn/staff/zhuzx/CompMap/. CONTACT: xiaoyang@broadinstitute.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: Exhaustive mapping of next-generation sequencing data to a set of relevant reference sequences becomes an important task in pathogen discovery and metagenomic classification. However, the runtime and memory usage increase as the number of reference sequences and the repeat content among these sequences increase. In many applications, read mapping time dominates the entire application. We developed CompMap, a reference-based compression program, to speed up this process. CompMap enables the generation of a non-redundant representative sequence for the input sequences. We have demonstrated that reads can be mapped to this representative sequence with a much reduced time and memory usage, and the mapping to the original reference sequences can be recovered with high accuracy. AVAILABILITY AND IMPLEMENTATION: CompMap is implemented in C and freely available at http://csse.szu.edu.cn/staff/zhuzx/CompMap/. CONTACT: xiaoyang@broadinstitute.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.