Min Li1, Binbin Wu2, Xiaodong Yan2, Junwei Luo2, Yi Pan3, Fang-Xiang Wu4, Jianxin Wang2. 1. School of Information Science and Engineering, Central South University, Changsha 410083, China. Electronic address: limin@mail.csu.edu.cn. 2. School of Information Science and Engineering, Central South University, Changsha 410083, China. 3. School of Information Science and Engineering, Central South University, Changsha 410083, China; Department of Computer Science, Georgia State University, Atlanta GA30302, USA. 4. School of Information Science and Engineering, Central South University, Changsha 410083, China; Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon SK S7N 5A9, Canada.
Abstract
MOTIVATION: Cheap and fast next generation sequencing (NGS) technologies facilitate research of de novo assembly greatly. The reliability of contigs is critical to construct reliable scaffolding. However, contigs generated from most assemblers contain errors because of the limitation of assembly strategy and computation complexity. Among all these errors, the misassembly error is one of the most harmful types. RESULTS: In this paper, we propose a new method named "PECC" to identify and correct misassembly errors in contigs based on the paired-end read distribution. PECC extracts sequence regions with lower paired-end reads supports and verifies them based on the distribution of paired-end supports. To validate the effectiveness of PECC, we applied PECC to the contigs produced by five popular assemblers on four real datasets, and we also carried out experiments to analyze the influences of PECC on scaffolding. The results show that PECC can reduce misassembly errors and improve the performance of scaffolding results, which demonstrate the promising applications of PECC in de novo assembly.
MOTIVATION: Cheap and fast next generation sequencing (NGS) technologies facilitate research of de novo assembly greatly. The reliability of contigs is critical to construct reliable scaffolding. However, contigs generated from most assemblers contain errors because of the limitation of assembly strategy and computation complexity. Among all these errors, the misassembly error is one of the most harmful types. RESULTS: In this paper, we propose a new method named "PECC" to identify and correct misassembly errors in contigs based on the paired-end read distribution. PECC extracts sequence regions with lower paired-end reads supports and verifies them based on the distribution of paired-end supports. To validate the effectiveness of PECC, we applied PECC to the contigs produced by five popular assemblers on four real datasets, and we also carried out experiments to analyze the influences of PECC on scaffolding. The results show that PECC can reduce misassembly errors and improve the performance of scaffolding results, which demonstrate the promising applications of PECC in de novo assembly.