| Literature DB >> 23284954 |
Haibin Xu1, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui Qian, Jinhui Chen, Shilin Chen.
Abstract
The presence of duplicates introduced by PCR amplification is a major issue in paired short reads from next-generation sequencing platforms. These duplicates might have a serious impact on research applications, such as scaffolding in whole-genome sequencing and discovering large-scale genome variations, and are usually removed. We present FastUniq as a fast de novo tool for removal of duplicates in paired short reads. FastUniq identifies duplicates by comparing sequences between read pairs and does not require complete genome sequences as prerequisites. FastUniq is capable of simultaneously handling reads with different lengths and results in highly efficient running time, which increases linearly at an average speed of 87 million reads per 10 minutes. FastUniq is freely available at http://sourceforge.net/projects/fastuniq/.Entities:
Mesh:
Year: 2012 PMID: 23284954 PMCID: PMC3527383 DOI: 10.1371/journal.pone.0052249
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The processing flow chart for FastUniq.
Step 1: import all read pairs into memory; Step 2: sort read pairs based on nucleotide sequences; Step 3: identify duplicates in sorted read pairs and output the unique sequences.
Figure 2FastUniq three-tier architecture for storage of read pairs.
The high-tier objective was to store hundreds of millions or more of paired reads. Data for each read pair composed of two reads are stored in a middle-tier ‘fastq_pair’ object, and data for each read are stored in a basic-tier ‘fastq’ object.
Figure 3Results of duplicates removal for Illumina sequencing libraries from Acropora digitifera corresponding to multiple insert sizes.
(A) The number of read pairs before and after duplicates removal using FastUniq or the mapping-based pipeline for each library. (B) The percentage of duplicates in the results of the mapping-based pipeline identified using FastUniq or fastx_collapser for each library.
Figure 4Running time performance of FastUniq.
The running time is measured by the ‘time’ command in the Linux operating system.