| Literature DB >> 26460497 |
Mohan A V S K Katta1, Aamir W Khan1, Dadakhalandar Doddamani1, Mahendar Thudi1, Rajeev K Varshney2.
Abstract
Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.Entities:
Mesh:
Year: 2015 PMID: 26460497 PMCID: PMC4604202 DOI: 10.1371/journal.pone.0139868
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Flowchart of NGS-QCbox pipeline illustrating the two modes of usage namely quick and complete.
NGS-QCbox comprises of two workflow modes namely quick and complete. In quick mode, read/base level metrics are computed in parallel using Raspberry, an in-house tool, both before and after quality trimming. On the other hand, complete mode is full-fledged quality control and variant calling pipeline that integrates quick mode and additionally generates genome coverage information in parallel. Quality of the data generated could be assessed using this information.
Fig 2The menu driven interface for NGS-QCbox for quick and complete mode respectively.
(a) Shows the prompt and the parameters such as cut-off phred score, minimum read length after trimming, data source and number of processors to be used for the quick QC mode. (b) Complete QC mode adds more parameters to quick mode like information related to the genome (bowtie index, genome size, number of processors used by bowtie).
A comparative account of the features of NGS-QCbox pipeline with five similar pipeline/tools.
| NGS-QCbox | Prinseq-lite | NGS QC Toolkit | HTSeq | FastQC | FastX | |
|---|---|---|---|---|---|---|
| Compressed FASTQ (input) | Y | N | N | N | N | N |
| Batch job processing | Y | N | N | N | Y/N | N |
| Genome coverage | Y | N | N | N | N | Y |
| Variations (SNP/INDEL) | Y | N | N | N | N | N |
| Menu interface | Y | N | N | N | Y | N |
| Feature richness | Y | Y | Y | N | Y | N |
| Task parallelization | Y | N | N | N | N | N |
The symbols Y and N denote Yes and No respectively describing the presence or absence of the feature.
Parallel performance comparison of NGS-QCbox with other tools/pipelines.
| Quick QC mode | |||||||
|---|---|---|---|---|---|---|---|
| Samples | Processors | NGS-QCbox | Prinseq-lite | NGS QC Toolkit | HTSeq | FastQC | FastX |
| (seconds) | (seconds) | (seconds) | (seconds) | (seconds) | (seconds) | ||
| 1 | 1 | 217 | 600 | 28,618 | 630 | 360 | 1,073 |
| 100 | 1 | 21,849 | 60,116 | 2,861,800* | 63,513 | 36,121 | 107,300* |
| 200 | 1 | 43,741 | 120,232* | 5,523,600* | 127,026* | 72,318 | 214,600* |
| 300 | 1 | 65,319 | 180,348* | 8,285,400* | 190,539* | 108,477* | 321,900* |
| 1 | 20 | 217 | NA | 10,020 | NA | NA | NA |
| 100 | 20 | 3,472 | NA | 954,221 | NA | NA | NA |
| 200 | 20 | 4,636 | NA | 1,908,442* | NA | NA | NA |
| 300 | 20 | 7,189 | NA | 2,862,663* | NA | NA | NA |
|
| |||||||
|
|
| ||||||
| 1 | 1 | 2,800 | 2,838 | ||||
| 100 | 20 | 31,723 | NA | ||||
The tools were evaluated based on the performance observed with 1 processor against using 20 processors (parallel). To process one sample of size 4.38 Gb with one processor, NGS-QCbox consumes 217 seconds. This is a notable speedup of 2.76X over Prinseq-lite, 132X over NGS QC Toolkit, 2.9X over HTSeq, 1.65X over FastQC and 4.9X over FastX. In this case, with increase in the number of samples, all the programs scale linearly with increase in data size (samples). Similarly with 100, 200 and 300 samples, the speedups are the same order because of the serial processing of the samples. But when processing 100 samples in parallel with 20 processors the speedup obtained is 6.25X over the one processor run. Similar speedups of 9.4X and 9X were observed when comparing the runtime of 200 and 300 samples. This translates to the fact that the runtime to process each sample gets reduced to 23–34 seconds with parallelization which is a huge gain over running them serially. The “*” symbol indicates that the values are extrapolated based on the linear run time. Extrapolation is necessary because in such cases the run time exceeds a time period over days and months. NA indicates the program does not support parallelization. We have executed the flow of commands used in complete QC mode of pipeline into sequential order instead of parallel mode with one processor as input. It was observed that there was a loss of 38 seconds per sample when NGS-QCbox steps were ran sequentially. When complete QC mode was tested for 100 samples parallel processing gave a massive speedup of 8.82X.