| Literature DB >> 22289480 |
Tongwu Zhang1, Yingfeng Luo, Kan Liu, Linlin Pan, Bing Zhang, Jun Yu, Songnian Hu.
Abstract
The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the data processing much more difficult and complicated than the first-generation sequencing technology. Although there are some software packages developed to assess the data quality, those packages either are not easily available to users or require bioinformatics skills and computer resources. Moreover, almost all the quality assessment software currently available didn't taken into account the sequencing errors when dealing with the duplicate assessment in NGS data. Here, we present a new user-friendly quality assessment software package called BIGpre, which works for both Illumina and 454 platforms. BIGpre contains all the functions of other quality assessment software, such as the correlation between forward and reverse reads, read GC-content distribution, and base Ns quality. More importantly, BIGpre incorporates associated programs to detect and remove duplicate reads after taking sequencing errors into account and trimming low quality reads from raw data as well. BIGpre is primarily written in Perl and integrates graphical capability from the statistics package R. This package produces both tabular and graphical summaries of data quality for sequencing datasets from Illumina and 454 platforms. Processing hundreds of millions reads within minutes, this package provides immediate diagnostic information for user to manipulate sequencing data for downstream analyses. BIGpre is freely available at http://bigpre.sourceforge.net.Entities:
Mesh:
Year: 2011 PMID: 22289480 PMCID: PMC5054156 DOI: 10.1016/S1672-0229(11)60027-2
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1The main read quality control in Illumina sequencing platform. A. Per base sequence quality distribution. B. Per sequence quality score distribution. C. Per base “N” quality score distribution.D. Correlation between forward and reverse sequences. E. Reads quality score distribution in one Illumina tile. Q>20 in green and Q<=20 in red. F. Per sequence GC content distribution. G. The distribution of data production in all Illumina tiles. Q>30 in green, Q>20 && Q<=30 in blue and Q<=20 in red.
Figure 2The main read quality control in 454 sequencing platform. A. Per base “N” sequence quality and number distributions. B. per poly(N) sequence quality distribution at different positions. Forward Qmean indicates the poly(N) mean quality excluding last base; Last Qmean indicates the last base quality in poly(N). C. Per poly(N) sequence quality distribution for different length. D. Distribution of the length of poly(N). E. Last base quality distribution in poly(N) at different positions.
Figure 3Duplicate analysis in Illumina paired-end library. A. Cumulate scaffolding usage ratio (also cumulate useful reads ratio) in Illumina mate-pair library (insert size: 5 kb) with different duplicate length sets. B. New lane usage ratio with different length sets.
Figure 4Frequency of duplicate read depth in 454 transcriptome sequencing. The read depth is indicated by the number of reads in each duplicate group.