Literature DB >> 22289480

BIGpre: a quality assessment package for next-generation sequencing data.

Tongwu Zhang¹, Yingfeng Luo, Kan Liu, Linlin Pan, Bing Zhang, Jun Yu, Songnian Hu.

Abstract

The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the data processing much more difficult and complicated than the first-generation sequencing technology. Although there are some software packages developed to assess the data quality, those packages either are not easily available to users or require bioinformatics skills and computer resources. Moreover, almost all the quality assessment software currently available didn't taken into account the sequencing errors when dealing with the duplicate assessment in NGS data. Here, we present a new user-friendly quality assessment software package called BIGpre, which works for both Illumina and 454 platforms. BIGpre contains all the functions of other quality assessment software, such as the correlation between forward and reverse reads, read GC-content distribution, and base Ns quality. More importantly, BIGpre incorporates associated programs to detect and remove duplicate reads after taking sequencing errors into account and trimming low quality reads from raw data as well. BIGpre is primarily written in Perl and integrates graphical capability from the statistics package R. This package produces both tabular and graphical summaries of data quality for sequencing datasets from Illumina and 454 platforms. Processing hundreds of millions reads within minutes, this package provides immediate diagnostic information for user to manipulate sequencing data for downstream analyses. BIGpre is freely available at http://bigpre.sourceforge.net.

Entities: Chemical Disease Mutation Species

Mesh：

Year: 2011 PMID： 22289480 PMCID： PMC5054156 DOI： 10.1016/S1672-0229(11)60027-2

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Next-generation sequencing (NGS) technology has demonstrated its capacity to produce an enormous volume of data cheaply at an unprecedented speed. The variety of NGS features makes these platforms coexist in the marketplace, with some having clear advantages for particular applications over others (. At present, five NGS platforms are available, including the Roche GS-FLX 454 Genome Sequencing (also referred as 454 sequencing), the Illumina Genome Analyzer (also referred as Solexa sequencing), the ABI SOLiD analyzer, Polonator G.007 and the Helicos HeliScope platforms 1, 2, 3, 4. The new Illumina HiSeq 2000 Genome Analyzer is capable of producing single reads of 2X100 base pairs (bp) and generates about 200 gigabase (Gb) of short sequences per run. The current model, SOLiD 4.0 analyzer, has a read length of up to 50 bp and can produce 80-100 Gb of mappable sequences per run. Additional platforms with faster sequencing speed and lower reagents cost have become available recently (such as Ion Torrent and PacBio) (. The giant datasets generated by the NGS platforms present the big challenges and opportunities for software and algorithm development (. During the past several years, a large number of new software applications and algorithms have been developed for sequence alignment or assembly but only a few for quality assessment and visualization, such as TileQC (, SolexaQA (, and PIQA (. Meanwhile, with the deep concern of NGS raw data, the duplicate reads are the major problems for the subsequent analysis in both Illumina and 454 platforms 10, 11. The duplicate reads are mainly caused by emulsion PCR, especially with insufficient sample. Until now, there are few types of software designed to detect or preprocess the duplicate reads. Most of them are implemented in the pipeline of other softwares (e.g. SAMtools ( and GATLK (), which cannot be used directly to raw reads. The two widely used quality control packages for duplicate read removal is available in rmdup in SAMtools and FASTQ/A Collapser in the FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The rmdup in SAMtools is used to remove potential PCR duplicates in short read mapping project and takes the read mapping data (such as SAM or BAM format) but not raw reads as the input. Moreover, it recognizes the duplicates with the identical external coordinates, therefore, the sequencing errors, such as homopolymer in 454 sequencing reads, are not taken into account in SAMtools. Another program FASTQ/A Collapser just simply collapses identical sequences (in a FASTA file) into a single sequence and does not allow sequencing errors and sequencing length differences, either. Given the fact that sequencing errors are common in NGS platforms and the length of reads varies after trimming low quality or in 454 sequencing platform, BIGpre uses the mismatch setting to get more precise detection of duplicates and takes into account the different sequencing errors along the read length in different sequencing platforms. Here, we present BIGpre, a user-friendly and free software package which can provide rapid, simple and comprehensive assessment of read quality and duplication rate for both Illumina and 454 platforms.

Implementation

BIGpre is command-line executable package written in perl and available as source code on multiple Linux platforms (e.g. Fedora and Ubuntu). The current version (BIGpre 2.02) has been tested on Linux (2.6.18-164.el5, x86_64) (BIGpre 3.0, a higher version, is under test now and will be available next year). BIGpre, along with an implementation file of the described method and example can be downloaded at the project website http://bigpre.sourceforge.net/. The package requires the programming environment R (Version 2.5 or higher). The R software is available at the website “The R Project for Statistical Computing” (http://www.r-project.org/). The memory requirement is equal to the size of the input data.

Results and Discussion

As we know, low quality reads often result in the high mis-assembly rate of de novo sequencing and the high false SNP calling of re-sequencing, whereas presence of duplicate reads often leads to overestimation of the sequencing coverage (. So the quality control of the raw data is very important to ensure the data reliability for further analysis. BIGpre is a useful package including several programs designed to assess read quality in generality and flexibility for both Illumina and 454 sequencing data. All the program codes are written with perl scripts and the package can display the results graphically with the statistics package R. To decrease memory exhaustion, BIGpre reads data from disk at request instead of keeping data cached. Here, we introduce the key features of BIGpre. A detailed documentation including screen shots is also available at the software web site listed above. The following datasets are used as example. Dataset from one lane (paired-end) and three lanes (mate-paired) for whole genome sequencing of Litopenaeus vannamei, respectively, was used for Illumina quality control and duplication analysis. In addition, dataset used for 454 quality control and duplication analysis is 1/4 run of transcriptome sequencing of Bemisia tabaci. All the datasets are sequenced in Beijing Institute of Genomics (Zhang, T., et al. unpublished).

Sequencing quality assessment

The program solqs in BIGpre is used to assess the quality of Illumina sequencing data. In Illumina sequencing, the read quality assessment is based on each cycle along the read including undefined base (named as “N”) (Figure 1A-C). And the phred value Q20 is an important quality score to evaluate the sequencing error rate. Generally, the mean GC content in all the forward/reverse reads will be coincident with the genome GC content (gGC) in genome sequencing (Figure 1D), but the read mean quality distribution often shows the bias in some regions with extreme high/low GC content. Also, the correlation between forward and reverse reads was affected by the machine running time (Figure 1E). Solqs can evaluate the correlation between forward and reverse reads and analyse the four nucleotide ration in each cycle along the read and count the undefined base. In addition, solqs can also show the boundary effects 8, 15 and the data production in each tile (Figure 1F, G).

Figure 1

The main read quality control in Illumina sequencing platform. A. Per base sequence quality distribution. B. Per sequence quality score distribution. C. Per base “N” quality score distribution.D. Correlation between forward and reverse sequences. E. Reads quality score distribution in one Illumina tile. Q>20 in green and Q<=20 in red. F. Per sequence GC content distribution. G. The distribution of data production in all Illumina tiles. Q>30 in green, Q>20 && Q<=30 in blue and Q<=20 in red.

The program 454QC in BIGpre is designed to evaluate the quality of reads sequenced by 454 platform. Similar to solqs, 454QC can provide the information of read length and base quality distribution, GC content statistics, and undefined base “N” analysis (Figure 2A). Besides, this program also introduces a new assessment to poly(N), which is the main quality concern of 454 data (, including ploy(N) length and base quality of different position in poly(N) (Figure 2B-E).

Figure 2

The main read quality control in 454 sequencing platform. A. Per base “N” sequence quality and number distributions. B. per poly(N) sequence quality distribution at different positions. Forward Qmean indicates the poly(N) mean quality excluding last base; Last Qmean indicates the last base quality in poly(N). C. Per poly(N) sequence quality distribution for different length. D. Distribution of the length of poly(N). E. Last base quality distribution in poly(N) at different positions.

Duplicate read identification

Two programs soldup and 454dup in BIGpre can identify and count duplicate reads by comparing the read consistency. The duplicate reads in NGS data are usually caused by either the failure of the emulsion PCR step to match on sequence to one bead, or the insufficient starting material. Large insert libraries, such as 20 kb mate-paired library in 454 platform are prone to generate duplicate reads. With insufficient starting material, PCR copies of the same target are sequenced multiple times. Removing duplicate from raw reads is very important, because the duplicates can cause genomic and transcriptomic mis-assembly, mislead variant basecalling and alignment algorithms (, and lead to incorrect interpretation of the abundance of species and genes in metagenomic study (. There are several principles to detect duplicate reads. First, the detection should be mapping-free, which means it should be independent of reference genome (. Second, in order to reduce the false duplicates, the detection should be done after trimming of low quality reads. Third, the high raw cluster density can generate the artificial reads and result in the higher duplication ratio (. Therefore, these reads should be removed as duplicate reads from raw data. Fourth, for the data generated from paired-end or mate-paired libraries, both ends should be taken into consideration. For Illumina platform data, the program soldup analyses the duplication ratio with different duplicate length sets in each sequencing library and provides the useful unique reads (Figure 3A) and data useful ratio when adding a new lane (Figure 3B). Moreover, the user can select a maximal length as the criterion for duplicate length and remove all the duplicates with the one selected copy or coverage value copy left.

Figure 3

Duplicate analysis in Illumina paired-end library. A. Cumulate scaffolding usage ratio (also cumulate useful reads ratio) in Illumina mate-pair library (insert size: 5 kb) with different duplicate length sets. B. New lane usage ratio with different length sets.

For 454 platform data, the situation is much more complicated. The 454 duplicate reads include both artificial and natural duplicates, and can make up 4-44% of total reads in metagenomic samples (. Separating artificial duplicates from natural duplicates is very difficult. Removing all the duplicates will result in the underestimation of sequencing coverage, since majority of duplicates in 454 sequencing platforms are low depth with two or three copy (Figure 4). The program 454dup is designed to group reads by constructing consensus sequences as the duplicates. In those duplicate groups, the read is sorted by the read length and quality. Also, the ploy (N) mis-match in each duplicate group is treated as the sequencing error. In the tested dataset, we found that about 5% raw reads were grounded as duplicates and most of the duplicates were artificial due to the low sequencing coverage. Because the reads in the same group have the same sequence, the user can choose the criterion to filter the duplicates in each group.

Figure 4

Frequency of duplicate read depth in 454 transcriptome sequencing. The read depth is indicated by the number of reads in each duplicate group.

Additional preprocess tools for Illumina data

BIGpre also includes some tools for preprocessing Illumina data and all those tools are organized into one program called solapt. The program solsize is designed to analyse the true library insert size after mapping the reads to reference, and can be used to evaluate the library quality. The program soljoin takes the sequencing direction of insert library into account and is designed specifically to join paired-end reads into a single longer read, when the library insert size is smaller than total length of paired end reads. These manually prolonged reads would be very helpful to the genome and transcriptome de novo assembly. The program solfiter can be used to filter and trim raw data into high quality fastq reads to remove low quality bases while the program solin can remove the internal adapter from the mate-paired read and produce two new paired-end reads.

Conclusion

The BIGpre package provides both tabular and graphical summaries of data quality for both Illumina and 454 platforms. Compared to other available tools, the BIGpre package is designed for the data preprocessing and quality control with easy access and providing more information. It integrates several functions into one package to insure that only the high quality reads are used for subsequent analysis: (i) assesses the read quality with rapid, simple and effective measures for two platforms. (ii) detects and analyses the duplicate reads and duplication depth. (iii) preprocesses the sequencing data, such as joining the small paired-end reads into longer single reads, removing the internal adapter sequences, and trimming raw data into high quality sequences. This package produces standardized outputs within minutes, thus facilitates the reads quality comparison in each machine runs, and provides library quality and complexity assessment in a very intuitive manner.

Authors’ contributions

TZ drafted the manuscript and developed the software. YL and KL participated in the software design and manuscript writing. LP and BZ participated in the initial design. JY and SH proposed the idea of the software and revised the manuscript. All authors have read and approved the final manuscript.

Competing interests

The authors have no competing interests to declare.

17 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. Pyrobayes: an improved base caller for SNP discovery in pyrosequences.

Authors: Aaron R Quinlan; Donald A Stewart; Michael P Strömberg; Gábor T Marth
Journal: Nat Methods Date: 2008-01-13 Impact factor: 28.547

Review 3. Next-generation sequencing transforms today's biology.

Authors: Stephan C Schuster
Journal: Nat Methods Date: 2007-12-19 Impact factor: 28.547

4. Bioinformatics for next generation sequencing.

Authors: Alex Bateman; John Quackenbush
Journal: Bioinformatics Date: 2009-02-08 Impact factor: 6.937

Review 5. Sequencing technologies - the next generation.

Authors: Michael L Metzker
Journal: Nat Rev Genet Date: 2009-12-08 Impact factor: 53.242

6. The sequence and de novo assembly of the giant panda genome.

Authors: Ruiqiang Li; Wei Fan; Geng Tian; Hongmei Zhu; Lin He; Jing Cai; Quanfei Huang; Qingle Cai; Bo Li; Yinqi Bai; Zhihe Zhang; Yaping Zhang; Wen Wang; Jun Li; Fuwen Wei; Heng Li; Min Jian; Jianwen Li; Zhaolei Zhang; Rasmus Nielsen; Dawei Li; Wanjun Gu; Zhentao Yang; Zhaoling Xuan; Oliver A Ryder; Frederick Chi-Ching Leung; Yan Zhou; Jianjun Cao; Xiao Sun; Yonggui Fu; Xiaodong Fang; Xiaosen Guo; Bo Wang; Rong Hou; Fujun Shen; Bo Mu; Peixiang Ni; Runmao Lin; Wubin Qian; Guodong Wang; Chang Yu; Wenhui Nie; Jinhuan Wang; Zhigang Wu; Huiqing Liang; Jiumeng Min; Qi Wu; Shifeng Cheng; Jue Ruan; Mingwei Wang; Zhongbin Shi; Ming Wen; Binghang Liu; Xiaoli Ren; Huisong Zheng; Dong Dong; Kathleen Cook; Gao Shan; Hao Zhang; Carolin Kosiol; Xueying Xie; Zuhong Lu; Hancheng Zheng; Yingrui Li; Cynthia C Steiner; Tommy Tsan-Yuk Lam; Siyuan Lin; Qinghui Zhang; Guoqing Li; Jing Tian; Timing Gong; Hongde Liu; Dejin Zhang; Lin Fang; Chen Ye; Juanbin Zhang; Wenbo Hu; Anlong Xu; Yuanyuan Ren; Guojie Zhang; Michael W Bruford; Qibin Li; Lijia Ma; Yiran Guo; Na An; Yujie Hu; Yang Zheng; Yongyong Shi; Zhiqiang Li; Qing Liu; Yanling Chen; Jing Zhao; Ning Qu; Shancen Zhao; Feng Tian; Xiaoling Wang; Haiyin Wang; Lizhi Xu; Xiao Liu; Tomas Vinar; Yajun Wang; Tak-Wah Lam; Siu-Ming Yiu; Shiping Liu; Hemin Zhang; Desheng Li; Yan Huang; Xia Wang; Guohua Yang; Zhi Jiang; Junyi Wang; Nan Qin; Li Li; Jingxiang Li; Lars Bolund; Karsten Kristiansen; Gane Ka-Shu Wong; Maynard Olson; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal: Nature Date: 2009-12-13 Impact factor: 49.962

7. PIQA: pipeline for Illumina G1 genome analyzer data quality assessment.

Authors: A Martínez-Alcántara; E Ballesteros; C Feng; M Rojas; H Koshinsky; V Y Fofanov; P Havlak; Y Fofanov
Journal: Bioinformatics Date: 2009-07-14 Impact factor: 6.937

8. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

9. TileQC: a system for tile-based quality control of Solexa data.

Authors: Peter C Dolan; Dee R Denver
Journal: BMC Bioinformatics Date: 2008-05-28 Impact factor: 3.169

10. Probabilistic base calling of Solexa sequencing data.

Authors: Jacques Rougemont; Arnaud Amzallag; Christian Iseli; Laurent Farinelli; Ioannis Xenarios; Felix Naef
Journal: BMC Bioinformatics Date: 2008-10-13 Impact factor: 3.169

7 in total

1. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data.

Authors: Yuxin Chen; Yongsheng Chen; Chunmei Shi; Zhibo Huang; Yong Zhang; Shengkang Li; Yan Li; Jia Ye; Chang Yu; Zhuo Li; Xiuqing Zhang; Jian Wang; Huanming Yang; Lin Fang; Qiang Chen
Journal: Gigascience Date: 2018-01-01 Impact factor: 6.524

2. HTQC: a fast quality control toolkit for Illumina sequencing data.

Authors: Xi Yang; Di Liu; Fei Liu; Jun Wu; Jing Zou; Xue Xiao; Fangqing Zhao; Baoli Zhu
Journal: BMC Bioinformatics Date: 2013-01-31 Impact factor: 3.169

Review 3. Single-Cell Transcriptomics Bioinformatics and Computational Challenges.

Authors: Olivier B Poirion; Xun Zhu; Travers Ching; Lana Garmire
Journal: Front Genet Date: 2016-09-21 Impact factor: 4.599

4. De novo transcriptome sequencing in Monsonia burkeana revealed putative genes for key metabolic pathways involved in tea quality and medicinal value.

Authors: Adugna A Woldesemayat; Khayalethu Ntushelo; David M Modise
Journal: 3 Biotech Date: 2016-11-19 Impact factor: 2.406