Literature DB >> 27153699

fqtools: an efficient software suite for modern FASTQ file manipulation.

Alastair P Droop1.   

Abstract

UNLABELLED: Many Next Generation Sequencing analyses involve the basic manipulation of input sequence data before downstream processing (e.g. searching for specific sequences, format conversion or basic file statistics). The rapidly increasing data volumes involved in NGS make any dataset manipulation a time-consuming and error-prone process. I have developed fqtools; a fast and reliable FASTQ file manipulation suite that can process the full set of valid FASTQ files, including those with multi-line sequences, whilst identifying invalid files. Fqtools is faster than similar tools, and is designed for use in automatic processing pipelines.
AVAILABILITY AND IMPLEMENTATION: fqtools is open source and is available at: https://github.com/alastair-droop/fqtools SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: a.p.droop@leeds.ac.uk.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2016        PMID: 27153699      PMCID: PMC4908325          DOI: 10.1093/bioinformatics/btw088

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The FASTQ format has become the de facto standard for storage of next-generation sequencing read data (Cock ). Based originally upon the FASTA sequence format (Pearson and Lipman, 1988), FASTQ stores nucleotide sequences and associated base qualities (Ewing and Green, 1998) for multiple named reads in a four-field human-readable ASCII format. Although there is no defined standard for FASTQ files, Cock ) provide a good overview of the format, and provide as close to a ‘standard’ as is available. Many analysis pipelines involve initial data manipulation (e.g. reformatting, viewing or overview statistics) before downstream processing (e.g. quality control, adapter removal and alignment). Seemingly simple tasks like viewing the first few reads in a file or checking the distribution of read lengths often require scripting or loading the data in tools that are quite slow for large datasets. These file manipulations are much more frequent when data are being re-used in novel analyses. Frequently, individual researchers will write scripts (e.g. in Python, Perl or AWK) to perform these tasks. Many tools are available for FASTQ processing such as the fastx-toolkit, bio-awk, fastq-tools, fast, seqmagick and seq-tk (see the Supplementary Materials for the URLs of these tools). None of these provide a comprehensive set of common manipulations that would be required for most analyses. Most FASTQ processing tools fail to process reads with sequence data split across multiple lines. As read lengths from modern sequencing technologies are constantly increasing (Schneider and Dekker, 2012), this is likely to become problematic as human readability is vastly reduced by extremely long lines. Detecting invalid input is extremely important, as bioinformatics pipelines are often automated; thus significant computation and analysis time can be wasted if input errors are not detected early. For this reason, a trustworthy FASTQ manipulation tool should report invalid files. Similarly, tools should be able to correctly process the full range of valid inputs. The fqtools suite was written to address this need for efficient and reliable viewing, manipulation and summarization of FASTQ data before it is pre-processed (with e.g. FastQC [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) or cutadapt (Martin, 2011)]. Both compressed and plain FASTQ can be processed, as can SAM and BAM-formatted data. Paired-end sequence data is handled either as file pairs or in interleaved format (Table 1).
Table 1.

Commands present in the fqtools suite

Description
viewView FASTQ files
headView the first reads in FASTQ files
countCount FASTQ file reads
headerView FASTQ file header data
sequenceView FASTQ file sequence data
qualityView FASTQ file quality data
header2View FASTQ file secondary header data
fastaConvert FASTQ files to FASTA format
basetabTabulate FASTQ base frequencies
qualtabTabulate FASTQ quality character frequencies
lengthtabTabulate FASTQ read lengths
typeAttempt to guess the FASTQ quality encoding type
validateValidate FASTQ files
findFind FASTQ reads containing specific sequences
trimTrim reads in a FASTQ file
qualmapTranslate quality values using a mapping file

Commands present in the fqtools suite. The supplementary information contains a full description of each command.

Commands present in the fqtools suite Commands present in the fqtools suite. The supplementary information contains a full description of each command.

2 Implementation

I have developed a fast and memory-efficient state machine for parsing FASTQ files. The use of a state machine (as opposed to a line-based approach) obviates the difficulties with line breaks in sequence and quality data. As read order must be identical for both paired-end files, manipulations that re-order reads process both pairs simultaneously. The fqtools suite has been written to allow input and output from either files or standard streams. Both files and streams can contain either plain or gzip-compressed data. By using streams, the fqtools suite can be easily incorporated into computational pipelines. The commands contained in the fqtools suite are listed in Table 2, along with a brief description of their purpose.
Table 2.

FASTQ processing tools overview

ValidInvalidProcess .gzPlain (reads/s)Compressed (reads/)
fqtoolsYYR+W701 375444 648
bashR+W2 605 421934 331
bioawkYNR434 632312 708
seqtkYNR1 122 355545 865
fastYY2984
fastx-toolkitNN69 762
seqmagickYYR+W25 3254000

Benchmark data for various FASTQ processing tools. All tools were installed locally, and run against the complete test set (Cock ). Valid shows if all the valid test set were processed correctly. Invalid shows if the tool identified all the invalid files. Process .gz shows if the tool can natively read (R) and write (W) gzip-compressed files. The speed columns show the speed in reads per second.

FASTQ processing tools overview Benchmark data for various FASTQ processing tools. All tools were installed locally, and run against the complete test set (Cock ). Valid shows if all the valid test set were processed correctly. Invalid shows if the tool identified all the invalid files. Process .gz shows if the tool can natively read (R) and write (W) gzip-compressed files. The speed columns show the speed in reads per second.

3 Performance

I tested several common sequence manipulation tools against four criteria: The ability to process the full range of valid FASTQ files; The ability to detect the full range of FASTQ errors; The ability to read and write compressed data; and The processing speed. To evaluate the ability of the fqtools suite to correctly process valid files and to reject invalid ones, I used the test set provided by Cock ). The performance of the fqtools suite was tested against several similar tools using a sample file containing 100 000 reads generated using ART (Huang ). Table 2 shows these results. For all tools, the closest option to parsing the file without further processing was used. The lowest time score over 50 repeats was taken. Speed results for printing the files on the bash terminal are supplied for a ‘maximum speed’ reference, although these commands make no attempt to parse the FASTQ data within the file. Full data are available in the supplementary information. Of the tools tested, only three (fqtools, fast and seqmagick) correctly processed the full test set. Of these three, fqtools is by far the fastest when processing both uncompressed and compressed files. Although several of these tools are designed to be extendable, data processing speed for basic FASTQ manipulation is becoming increasingly important. For these tasks, the flexibility of interpreted tools (such as seqmagick and fast) is unnecessary.

4 Summary

Here, I describe fastqc, a suite of FASTQ manipulations tools that efficiently handles the full range of valid FASTQ, whilst detecting invalid files. The suite can process both compressed and uncompressed files. fqtools is freely available on Github at https://github.com/alastair-droop/fqtools.
  5 in total

1.  DNA sequencing with nanopores.

Authors:  Grégory F Schneider; Cees Dekker
Journal:  Nat Biotechnol       Date:  2012-04-10       Impact factor: 54.908

2.  Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors:  B Ewing; P Green
Journal:  Genome Res       Date:  1998-03       Impact factor: 9.043

3.  Improved tools for biological sequence comparison.

Authors:  W R Pearson; D J Lipman
Journal:  Proc Natl Acad Sci U S A       Date:  1988-04       Impact factor: 11.205

4.  ART: a next-generation sequencing read simulator.

Authors:  Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal:  Bioinformatics       Date:  2011-12-23       Impact factor: 6.937

Review 5.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Authors:  Peter J A Cock; Christopher J Fields; Naohisa Goto; Michael L Heuer; Peter M Rice
Journal:  Nucleic Acids Res       Date:  2009-12-16       Impact factor: 16.971

  5 in total
  8 in total

1.  Quantitative Analysis of Adenosine-to-Inosine RNA Editing.

Authors:  Turnee N Malik; Jean-Philippe Cartailler; Ronald B Emeson
Journal:  Methods Mol Biol       Date:  2021

2.  The islet-resident macrophage is in an inflammatory state and senses microbial products in blood.

Authors:  Stephen T Ferris; Pavel N Zakharov; Xiaoxiao Wan; Boris Calderon; Maxim N Artyomov; Emil R Unanue; Javier A Carrero
Journal:  J Exp Med       Date:  2017-06-19       Impact factor: 14.307

3.  Population genomics reveals that an anthropophilic population of Aedes aegypti mosquitoes in West Africa recently gave rise to American and Asian populations of this major disease vector.

Authors:  Jacob E Crawford; Joel M Alves; William J Palmer; Jonathan P Day; Massamba Sylla; Ranjan Ramasamy; Sinnathamby N Surendran; William C Black; Arnab Pain; Francis M Jiggins
Journal:  BMC Biol       Date:  2017-02-28       Impact factor: 7.431

4.  Microbiomes of Velloziaceae from phosphorus-impoverished soils of the campos rupestres, a biodiversity hotspot.

Authors:  Antonio Pedro Camargo; Rafael Soares Correa de Souza; Patrícia de Britto Costa; Isabel Rodrigues Gerhardt; Ricardo Augusto Dante; Grazielle Sales Teodoro; Anna Abrahão; Hans Lambers; Marcelo Falsarella Carazzolle; Marcel Huntemann; Alicia Clum; Brian Foster; Bryce Foster; Simon Roux; Krishnaveni Palaniappan; Neha Varghese; Supratim Mukherjee; T B K Reddy; Chris Daum; Alex Copeland; I-Min A Chen; Natalia N Ivanova; Nikos C Kyrpides; Christa Pennacchio; Emiley A Eloe-Fadrosh; Paulo Arruda; Rafael Silva Oliveira
Journal:  Sci Data       Date:  2019-07-31       Impact factor: 6.444

5.  Eye Degeneration and Loss of otx5b Expression in the Cavefish Sinocyclocheilus tileihornes.

Authors:  Zushi Huang; Tom Titus; John H Postlethwait; Fanwei Meng
Journal:  J Mol Evol       Date:  2019-07-22       Impact factor: 2.395

6.  Bacteria related to tick-borne pathogen assemblages in Ornithodoros cf. hasei (Acari: Argasidae) and blood of the wild mammal hosts in the Orinoquia region, Colombia.

Authors:  Juan D Carvajal-Agudelo; Héctor E Ramírez-Chaves; Paula A Ossa-López; Fredy A Rivera-Páez
Journal:  Exp Appl Acarol       Date:  2022-07-13       Impact factor: 2.380

7.  TREML4 receptor regulates inflammation and innate immune cell death during polymicrobial sepsis.

Authors:  Christina Nedeva; Joseph Menassa; Mubing Duan; Chuanxin Liu; Marcel Doerflinger; Andrew J Kueh; Marco J Herold; Pamali Fonseka; Thanh Kha Phan; Pierre Faou; Harinda Rajapaksha; Weisan Chen; Mark D Hulett; Hamsa Puthalakath
Journal:  Nat Immunol       Date:  2020-10-05       Impact factor: 25.606

8.  Syntrophy via Interspecies H2 Transfer between Christensenella and Methanobrevibacter Underlies Their Global Cooccurrence in the Human Gut.

Authors:  Albane Ruaud; Sofia Esquivel-Elizondo; Jacobo de la Cuesta-Zuluaga; Jillian L Waters; Largus T Angenent; Nicholas D Youngblut; Ruth E Ley
Journal:  mBio       Date:  2020-02-04       Impact factor: 7.867

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.