Literature DB >> 31416440

FQStat: a parallel architecture for very high-speed assessment of sequencing quality metrics.

Sree K Chanumolu1, Mustafa Albahrani1, Hasan H Otu2.   

Abstract

BACKGROUND: High throughput DNA/RNA sequencing has revolutionized biological and clinical research. Sequencing is widely used, and generates very large amounts of data, mainly due to reduced cost and advanced technologies. Quickly assessing the quality of giga-to-tera base levels of sequencing data has become a routine but important task. Identification and elimination of low-quality sequence data is crucial for reliability of downstream analysis results. There is a need for a high-speed tool that uses optimized parallel programming for batch processing and simply gauges the quality of sequencing data from multiple datasets independent of any other processing steps.
RESULTS: FQStat is a stand-alone, platform-independent software tool that assesses the quality of FASTQ files using parallel programming. Based on the machine architecture and input data, FQStat automatically determines the number of cores and the amount of memory to be allocated per file for optimum performance. Our results indicate that in a core-limited case, core assignment overhead exceeds the benefit of additional cores. In a core-unlimited case, there is a saturation point reached in performance by increasingly assigning additional cores per file. We also show that memory allocation per file has a lower priority in performance when compared to the allocation of cores. FQStat's output is summarized in HTML web page, tab-delimited text file, and high-resolution image formats. FQStat calculates and plots read count, read length, quality score, and high-quality base statistics. FQStat identifies and marks low-quality sequencing data to suggest removal from downstream analysis. We applied FQStat on real sequencing data to optimize performance and to demonstrate its capabilities. We also compared FQStat's performance to similar quality control (QC) tools that utilize parallel programming and attained improvements in run time.
CONCLUSIONS: FQStat is a user-friendly tool with a graphical interface that employs a parallel programming architecture and automatically optimizes its performance to generate quality control statistics for sequencing data. Unlike existing tools, these statistics are calculated for multiple datasets and separately at the "lane," "sample," and "experiment" level to identify subsets of the samples with low quality, thereby preventing the loss of complete samples when reliable data can still be obtained.

Entities:  

Keywords:  FASTQ; Parallel programming; Sequence quality

Mesh:

Year:  2019        PMID: 31416440      PMCID: PMC6694608          DOI: 10.1186/s12859-019-3015-y

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  27 in total

1.  Statistical design and analysis of RNA sequencing data.

Authors:  Paul L Auer; R W Doerge
Journal:  Genetics       Date:  2010-05-03       Impact factor: 4.562

2.  NGS QC Toolkit: a toolkit for quality control of next generation sequencing data.

Authors:  Ravi K Patel; Mukesh Jain
Journal:  PLoS One       Date:  2012-02-01       Impact factor: 3.240

3.  The Sequence Read Archive: explosive growth of sequencing data.

Authors:  Yuichi Kodama; Martin Shumway; Rasko Leinonen
Journal:  Nucleic Acids Res       Date:  2011-10-18       Impact factor: 16.971

4.  Genomic data integration for ecological and evolutionary traits in non-model organisms.

Authors:  Denis Tagu; John K Colbourne; Nicolas Nègre
Journal:  BMC Genomics       Date:  2014-07-17       Impact factor: 3.969

5.  Next generation sequencing in clinical medicine: Challenges and lessons for pathology and biomedical informatics.

Authors:  Rama R Gullapalli; Ketaki V Desai; Lucas Santana-Santos; Jeffrey A Kant; Michael J Becich
Journal:  J Pathol Inform       Date:  2012-10-31

6.  Grape RNA-Seq analysis pipeline environment.

Authors:  David G Knowles; Maik Röder; Angelika Merkel; Roderic Guigó
Journal:  Bioinformatics       Date:  2013-01-17       Impact factor: 6.937

7.  ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data.

Authors:  Christopher R Cabanski; Keary Cavin; Chris Bizon; Matthew D Wilkerson; Joel S Parker; Kirk C Wilhelmsen; Charles M Perou; J S Marron; D Neil Hayes
Journal:  BMC Bioinformatics       Date:  2012-09-04       Impact factor: 3.169

8.  Kraken: a set of tools for quality control and analysis of high-throughput sequence data.

Authors:  Matthew P A Davis; Stijn van Dongen; Cei Abreu-Goodger; Nenad Bartonicek; Anton J Enright
Journal:  Methods       Date:  2013-06-29       Impact factor: 3.608

9.  MAP-RSeq: Mayo Analysis Pipeline for RNA sequencing.

Authors:  Krishna R Kalari; Asha A Nair; Jaysheel D Bhavsar; Daniel R O'Brien; Jaime I Davila; Matthew A Bockol; Jinfu Nie; Xiaojia Tang; Saurabh Baheti; Jay B Doughty; Sumit Middha; Hugues Sicotte; Aubrey E Thompson; Yan W Asmann; Jean-Pierre A Kocher
Journal:  BMC Bioinformatics       Date:  2014-06-27       Impact factor: 3.169

10.  Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors:  Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal:  Bioinformatics       Date:  2014-04-01       Impact factor: 6.937

View more
  4 in total

1.  BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data.

Authors:  Jacob L Steenwyk; Thomas J Buida; Carla Gonçalves; Dayna C Goltz; Grace Morales; Matthew E Mead; Abigail L LaBella; Christina M Chavez; Jonathan E Schmitz; Maria Hadjifrangiskou; Yuanning Li; Antonis Rokas
Journal:  Genetics       Date:  2022-07-04       Impact factor: 4.402

2.  Comparative analysis of single-cell transcriptomics in human and Zebrafish oocytes.

Authors:  Handan Can; Sree K Chanumolu; Elena Gonzalez-Muñoz; Sukumal Prukudom; Hasan H Otu; Jose B Cibelli
Journal:  BMC Genomics       Date:  2020-07-08       Impact factor: 3.969

3.  Genome wide expression analysis of circular RNAs in mammary epithelial cells of cattle revealed difference in milk synthesis.

Authors:  Syed Mudasir Ahmad; Basharat Bhat; Zainab Manzoor; Mashooq Ahmad Dar; Qamar Taban; Eveline M Ibeagha-Awemu; Nadeem Shabir; Mohd Isfaqul Hussain; Riaz A Shah; Nazir A Ganai
Journal:  PeerJ       Date:  2022-03-01       Impact factor: 2.984

4.  A quality control portal for sequencing data deposited at the European genome-phenome archive.

Authors:  Dietmar Fernández-Orth; Manuel Rueda; Babita Singh; Mauricio Moldes; Aina Jene; Marta Ferri; Claudia Vasallo; Lauren A Fromont; Arcadi Navarro; Jordi Rambla
Journal:  Brief Bioinform       Date:  2022-05-13       Impact factor: 13.994

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.