Literature DB >> 20015970

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Peter J A Cock¹, Christopher J Fields, Naohisa Goto, Michael L Heuer, Peter M Rice.

Abstract

FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal definition to date, and existing in at least three incompatible variants. This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them, based on publicly available information such as the MAQ documentation and conventions recently agreed by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava and EMBOSS. Being an open access publication, it is hoped that this description, with the example files provided as Supplementary Data, will serve in future as a reference for this important file format.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2009 PMID： 20015970 PMCID： PMC2847217 DOI： 10.1093/nar/gkp1137

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

One of the core issues of Bioinformatics is dealing with a profusion of (often poorly defined or ambiguous) file formats. Some ad hoc simple human readable formats have over time attained the status of de facto standards. A ubiquitous example of this is the ‘FASTA sequence file format’, originally invented by Bill Pearson as an input format for his FASTA suite of tools (1). Over time, this format has evolved by consensus; however, in the absence of an explicit standard some parsers will fail to cope with very long ‘>’ title lines or very long sequences without line wrapping. There is also no standardization for record identifiers. In the area of DNA sequencing, the FASTQ file format has emerged as another de facto common format for data exchange between tools. It provides a simple extension to the FASTA format: the ability to store a numeric quality score associated with each nucleotide in a sequence. This is a very minimal representation of a sequencing read—nothing about the relative levels of the four nucleotides is captured [e.g. from Sanger capillary sequencing or Solexa/Illumina sequencing (2)] nor did this in any way attempt to deal with flow or colour space data [e.g. Roche 454 (3) or ABI SOLiD (4)]. No doubt because of its simplicity, the FASTQ format has become widely used as a simple interchange file format. Unfortunately, history has repeated itself, and the FASTQ format suffers from the absence of a clear definition (which we hope this manuscript will address), and several incompatible variants. We discuss the history of the FASTQ format, describing key variants, and conventions adopted by the Open Bioinformatics Foundation (OBF, http://www.open-bio.org) projects Biopython (5), BioPerl (6), BioRuby (http://www.bioruby.org), BioJava (7), and EMBOSS (8) (each represented here by an author) for reading, writing and converting between them. This is intended to provide a public, open access and citable definition of this community consensus of the FASTQ format specification.

PHRED SCORES AND THE QUAL FORMAT

The PHRED software reads DNA sequencing trace files, calls bases and assigns a quality value to each base called (9,10). This introduced the PHRED quality score of a base call, defined in terms of the estimated probability of error: PHRED also introduced a new file format, known as the QUAL format, after the default file extension, to hold these quality scores. These are FASTA like, holding PHRED scores as space separated plain text integers and supplement a corresponding FASTA file with the associated sequences. For example, here is a single read from the NCBI sequence read archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgih) presented as a FASTA entry: and as a QUAL entry holding the PHRED scores: PHRED scores are now a de facto standard for representing sequencing read base qualities. For example, the Roche 454 ‘off instrument’ applications allow conversion from a binary Standard Flowgram Format (SFF) file to FASTA and QUAL files. PHRED scores are also used in SAM (Sequence Alignment/Map, http://samtools.sourceforge.net/), Staden Experiment (11), ACE (12), and FASTQ files.

SANGER FASTQ FORMAT

The FASTQ format was invented at the turn of the century at the Wellcome Trust Sanger Institute by Jim Mullikin, gradually disseminated, but never formally documented (Antony V. Cox, Sanger Institute, personal communication 2009). The closest thing to an official description from Sanger can be found on the MAQ/BWA website (13,14), but even this is incomplete. Full details of the file format, describing the read title, sequence and quality scores are given later. Here, we concentrate on how the quality scores were encoded into a simple string. Early FASTQ files were used for Sanger capillary sequencing, and it was natural to use PHRED quality scores (described above). Storing PHRED scores as single characters (or bytes) gave a simple but reasonably space efficient encoding. In order that the file be human readable and easily edited, this restricted the choices to the ASCII printable characters 32–126 (decimal), and since ASCII 32 is the space character, Sanger FASTQ files use ASCII 33–126 to encode PHRED qualities from 0 to 93 (i.e. PHRED scores with an ASCII offset of 33). This gives a very broad range of error probabilities, from 1.0 (a wrong base) through to 10−9.3 (an extremely accurate read) and so the Sanger FASTQ format is useful both for raw sequencing reads and post-processed assemblies where higher qualities occur. The OBF projects refer to this, the original or standard FASTQ format, as the Sanger variant, using the format name ‘fastq-sanger’ (Table 1).

Table 1.

The three described FASTQ variants, with columns giving the description, format name used in OBF projects, range of ASCII characters permitted in the quality string (in decimal notation), ASCII encoding offset, type of quality score encoded and the possible range of scores

Description, OBF name	ASCII characters		Quality score
	Range	Offset	Type	Range
Sanger standard
fastq-sanger	33–126	33	PHRED	0 to 93
Solexa/early Illumina
fastq-solexa	59–126	64	Solexa	−5 to 62
Illumina 1.3+
fastq-illumina	64–126	64	PHRED	0 to 62

SOLEXA FASTQ FORMAT

In 2004, Solexa, Inc. introduced their own incompatible (and indistinguishable) version of the FASTQ format (2). Although the FASTQ format only records a single quality score per letter, Solexa also produced other files with quality scores for all four bases, and in order to represent low-quality information more fully an alternative logarithmic mapping was used (15). Solexa quality scores are defined as: Although different sequencing systems estimate their error rates using different methodologies, simply rearranging these two equations and equating the error estimates allows a straightforward mapping between the two. This conversion has gained widespread usage through MAQ (13). An important consequence of these equations is for high values the two scores are asymptotically equal, and after rounding to the nearest integer scores of ⋛10 are interchangeable (Figure 1). However, Solexa scores go down to −5 (approximating a random read error probability of 0.75). The Sanger offset of 33 can, therefore, no longer be used. Rather, an offset of 64 was chosen, meaning ASCII 59 to 126 can be used, allowing Solexa scores from −5 to 62 inclusive.

Figure 1.

Visual representation of the mapping between PHRED and Solexa quality scores. Vertical layout represents the probability of error on a log scale, therefore the PHRED points are equally spaced (black circles on left), while the Solexa points are not (white circles on right). Solid black lines are reciprocal mappings between scores, and grey arrows are lossy mappings. Near the top of the figure, the black lines are almost horizontal because the two scores are almost equal. The straightforward mapping of higher scores is omitted due to space.

ILLUMINA 1.3+ FASTQ FORMAT

Although Illumina initially continued to use the Solexa FASTQ variant, from Genome Analyzer Pipeline version 1.3 onwards (16), PHRED quality scores rather than Solexa scores were used. However, rather than adopt the original Sanger format, Illumina introduced a third incompatible FASTQ variant designed to be interchangeable with their earlier ‘Solexa FASTQ’ files for good quality reads. The Illumina 1.3+ FASTQ variant encodes PHRED scores with an ASCII offset of 64, and so can hold PHRED scores from 0 to 62 (ASCII 64–126), although currently raw Illumina data quality scores are only expected in the range 0–40. The OBF projects refer to this variant as the Illumina 1.3+ FASTQ format, under the format name ‘fastq-illumina’ (Table 1).

ABI SOLID COLOUR SPACE FASTQ

ABI SOLiD sequencing works in colour space not sequence space (4), leading ABI to introduce Color Space FASTA (CSFASTA) files with matching QUAL files, and also Color Space FASTQ (CSFASTQ) files. These use the digits 0–3 to encode the colour calls (base transitions), but are not considered herein where we focus solely on sequence space FASTQ files.

FASTQ DEFINITION

Here is a Sanger FASTQ read from the NCBI SRA (shown earlier in the FASTA and QUAL formats): There are four line types in the FASTQ format. First a ‘@’ title line which often holds just a record identifier. This is a free format field with no length limit—allowing arbitrary annotation or comments to be included, as in the example above where the NCBI have included an alternative ID and the sequence length. Some sequencing centers encode paired end read information here (alternatively two matched FASTQ files are often used). Second comes the sequence line(s), which as in the FASTA format can be line wrapped. Also like FASTA format, there is no explicit limitation on the characters expected, but restriction to the IUPAC single letter codes for (ambiguous) DNA or RNA is wise, and upper case is conventional. In some contexts, the use of lower or mixed case or the inclusion of a gap character may make sense. White space such as tabs or spaces is not permitted. Third, to signal the end of the sequence lines and the start of the quality string, comes the ‘+’ line. Originally this also included a full repeat of the title line text (as shown in the NCBI example above); however, by common usage and the MAQ tool convention, this is optional and the ‘+’ line can contain just this one character, reducing the file size significantly. The OBF tools follow this MAQ convention on output, and omit the optional repeated title text. Finally, comes quality line(s) which again can be wrapped. As discussed above, these use a subset of the ASCII printable characters (at most ASCII 33–126 inclusive) with a simple offset mapping. Crucially, after concatenation (removing line breaks), the quality string must be equal in length to the sequence string. It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence. Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with. The OBF tools follow this convention on output, as does the MAQ conversion script. We recommend this for maximum compatibility with (simplistic) parsers. Because FASTQ files (like FASTA files) are plain text, the new line characters will normally follow the operating system convention. However, as data are shared between machines, any parser should cope with both Unix style new lines (line feed only, ASCII 10) and DOS/Windows style (carriage return and line feed, ASCII 13 then 10).

CONVERTING FASTQ FILES

Conversion from ‘fastq-illumina’ to ‘fastq-sanger’ will be a common operation, and is very straightforward since both variants use PHRED scores but with different offsets. All that is required is to decrease the quality character codes by 31. The opposite conversion is unlikely to be required, but in this situation the ‘fastq-illumina’ format can only hold PHRED scores from 0–62, compared with 0–93 in ‘fastq-sanger’. The OBF projects will all apply 62 as a maximum PHRED score (giving ASCII 126) with a warning message for values outside of this range. Conversion from ‘fastq-solexa’ to ‘fastq-sanger’ (or ‘fastq-illumina’) requires conversion of Solexa scores to PHRED scores using Equation (3) and rounding to the nearest integer. This mapping is lossy for poor quality reads, for example Solexa scores 9 and 10 both give PHRED score 10 (Figure 1). The reverse conversion uses Equation (4) instead. Taken literally, this maps PHRED 0 to Solexa ∞, but the minimum Solexa score is taken, −5 (corresponding to a random base call). Thus, both PHRED 0 and 1 map to Solexa −5 (Figure 1). A maximum limit of Solexa score 62 applies (giving ASCII 126). Biopython (version 1.51 or later), BioPerl (version 1.6.1 or later), BioRuby (version 1.4.0 or later), BioJava (verion 1.7.1 or later) and the seqret tool from EMBOSS (version 6.1.0 patch 1 or later) are all able to inter-convert between any of the three FASTQ variants (Table 1).

TEST CASES

Two classes of example files are provided as Supplementary Data. First, a number of invalid files which any parser should reject, including truncated reads, examples where the sequence and quality lengths differ, and invalid ASCII characters in the quality lines. Secondly, a set of valid but challenging FASTQ files together with a standardized version of the same data, plus how that file should be converted to other FASTQ variants. These examples are used in the OBF unit tests.

DISCUSSION

The original Sanger FASTQ format was by no means perfect. The ‘@’ and ‘+’ characters have dual usage as line markers or anywhere within the quality string. Simple indexing of the file looking for lines starting with ‘@’ is therefore not possible. The lack of ownership of this emerging standard by the Sanger Institute contributed greatly to later confusion, which can mostly be attributed to Solexa/Illumina, who not once but twice have invented their own ‘FASTQ’ format. With hindsight, we may ask why Solexa used their own scoring system for FASTQ output, given Illumina have since switched to the PHRED convention. Furthermore, as part of this switch for GAPipeline 1.3, Illumina could have adopted the original Sanger format. This would have still caused disruption in the short term, but would have unified the FASTQ format. While the Illumina 1.3+ FASTQ variant is interchangeable with the earlier Solexa FASTQ format for good quality reads, as a result of these choices, we now have three incompatible FASTQ variants that cannot be reliably distinguished. A simple measure such as the inclusion of header lines like ‘#Solexa FASTQ 1.0’ or ‘#Illumina FASTQ 1.3’ would have imposed a trivial overhead on the file size and allowed automatic determination of the file format and thus the quality encoding. Currently, the onus is on the bioinformatician to determine provenance, which now requires finding out which version of the Solexa/Illumina pipeline was used! Even reading the literature can be confusing, for example Huang et al. (17) wrote ‘ … using Illumina GA processing pipeline V0.2.2.6 … MAQ was used to convert Illumina FASTQ to Sanger standard FASTQ format’ At the time of writing, MAQ does not convert ‘fastq-illumina’ to ‘fastq-sanger’ format, so this group could have potentially mis-converted their data. However, as the Illumina pipeline version is given, we can infer they actually started with what we have christened the ‘fastq-solexa’ FASTQ format, and there is no problem. Despite this confusion, the Sanger version of the FASTQ format has found the broadest acceptance, supported by many assembly and read mapping tools—for example SSAHA2 (18), MAQ (13), Velvet (19), BWA (14) and BowTie (20). Although some of these tools can convert from the Solexa (and in some cases also the Illumina 1.3+) FASTQ variant, support for the standard Sanger FASTQ files is most common. Therefore, most users will do this conversion very early in their workflows (perhaps using OBF software). We also note that the NCBI SRA makes all its data available as standard Sanger FASTQ files (even if originally from a Solexa/Illumina machine). We hope that this trend will lead to Illumina themselves switching to the original FASTQ convention at a later date, which would eventually relegate this confusion of incompatible variants to a historical concern. A further suggestion is for Roche to extend their SFF tools to produce Sanger FASTQ files in addition to the existing options of FASTA and QUAL files. In addition to simple conversion between FASTQ variants, other common steps in a sequencing pipeline include quality and adaptor trimming, and contaminant or quality-based filtering. A set of interchangeable tools like the OBF projects, based on a common FASTQ standard, will be of great benefit here.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

BBSRC (grants BBR/G02264X/1 and BB/D018358/1 to EMBOSS), Scottish Government Rural and Environment Research and Analysis Directorate, UK (to P.J.A.C), Funding for open access charge: P.M.R.’s EBI group budget. Conflict of interest statement. None declared.

18 in total

1. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

2. SSAHA: a fast search method for large DNA databases.

Authors: Z Ning; A J Cox; J C Mullikin
Journal: Genome Res Date: 2001-10 Impact factor: 9.043

3. Solexa Ltd.

Authors: Simon Bennett
Journal: Pharmacogenomics Date: 2004-06 Impact factor: 2.533

4. Genome sequencing in microfabricated high-density picolitre reactors.

Authors: Marcel Margulies; Michael Egholm; William E Altman; Said Attiya; Joel S Bader; Lisa A Bemben; Jan Berka; Michael S Braverman; Yi-Ju Chen; Zhoutao Chen; Scott B Dewell; Lei Du; Joseph M Fierro; Xavier V Gomes; Brian C Godwin; Wen He; Scott Helgesen; Chun Heen Ho; Chun He Ho; Gerard P Irzyk; Szilveszter C Jando; Maria L I Alenquer; Thomas P Jarvie; Kshama B Jirage; Jong-Bum Kim; James R Knight; Janna R Lanza; John H Leamon; Steven M Lefkowitz; Ming Lei; Jing Li; Kenton L Lohman; Hong Lu; Vinod B Makhijani; Keith E McDade; Michael P McKenna; Eugene W Myers; Elizabeth Nickerson; John R Nobile; Ramona Plant; Bernard P Puc; Michael T Ronan; George T Roth; Gary J Sarkis; Jan Fredrik Simons; John W Simpson; Maithreyan Srinivasan; Karrie R Tartaro; Alexander Tomasz; Kari A Vogt; Greg A Volkmer; Shally H Wang; Yong Wang; Michael P Weiner; Pengguang Yu; Richard F Begley; Jonathan M Rothberg
Journal: Nature Date: 2005-07-31 Impact factor: 49.962

5. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors: Daniel R Zerbino; Ewan Birney
Journal: Genome Res Date: 2008-03-18 Impact factor: 9.043

6. Base-calling of automated sequencer traces using phred. I. Accuracy assessment.

Authors: B Ewing; L Hillier; M C Wendl; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

7. Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors: B Ewing; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

8. Consed: a graphical tool for sequence finishing.

Authors: D Gordon; C Abajian; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

9. Experiment files and their application during large-scale sequencing projects.

Authors: J K Bonfield; R Staden
Journal: DNA Seq Date: 1996

10. Improved tools for biological sequence comparison.

Authors: W R Pearson; D J Lipman
Journal: Proc Natl Acad Sci U S A Date: 1988-04 Impact factor: 11.205

443 in total

1. Using the Velvet de novo assembler for short-read sequencing technologies.

Authors: Daniel R Zerbino
Journal: Curr Protoc Bioinformatics Date: 2010-09

Review 2. Next-generation sequencing techniques for eukaryotic microorganisms: sequencing-based solutions to biological problems.

Authors: Minou Nowrousian
Journal: Eukaryot Cell Date: 2010-07-02

3. Fitness analyses of all possible point mutations for regions of genes in yeast.

Authors: Ryan Hietpas; Benjamin Roscoe; Li Jiang; Daniel N A Bolon
Journal: Nat Protoc Date: 2012-06-21 Impact factor: 13.491

Review 4. Human genotype-phenotype databases: aims, challenges and opportunities.

Authors: Anthony J Brookes; Peter N Robinson
Journal: Nat Rev Genet Date: 2015-11-10 Impact factor: 53.242

5. PI3K/Akt/mTOR Signaling and Plasma Membrane Proteins Are Implicated in Responsiveness to Adjuvant Dendritic Cell Vaccination for Metastatic Colorectal Cancer.

Authors: David C Qian; Xiangjun Xiao; Jinyoung Byun; Arief A Suriawinata; Stephanie C Her; Christopher I Amos; Richard J Barth
Journal: Clin Cancer Res Date: 2016-07-19 Impact factor: 12.531

6. Pseudouridylation defect due to DKC1 and NOP10 mutations causes nephrotic syndrome with cataracts, hearing impairment, and enterocolitis.

Authors: Eszter Balogh; Jennifer C Chandler; Máté Varga; Mona Tahoun; Dóra K Menyhárd; Gusztáv Schay; Tomas Goncalves; Renáta Hamar; Regina Légrádi; Ákos Szekeres; Olivier Gribouval; Robert Kleta; Horia Stanescu; Detlef Bockenhauer; Andrea Kerti; Hywel Williams; Veronica Kinsler; Wei-Li Di; David Curtis; Maria Kolatsi-Joannou; Hafsa Hammid; Anna Szőcs; Kristóf Perczel; Erika Maka; Gergely Toldi; Florentina Sava; Christelle Arrondel; Magdolna Kardos; Attila Fintha; Ahmed Hossain; Felipe D'Arco; Mario Kaliakatsos; Jutta Koeglmeier; William Mifsud; Mariya Moosajee; Ana Faro; Eszter Jávorszky; Gábor Rudas; Marwa H Saied; Salah Marzouk; Kata Kelen; Judit Götze; George Reusz; Tivadar Tulassay; François Dragon; Géraldine Mollet; Susanne Motameny; Holger Thiele; Guillaume Dorval; Peter Nürnberg; András Perczel; Attila J Szabó; David A Long; Kazunori Tomita; Corinne Antignac; Aoife M Waters; Kálmán Tory
Journal: Proc Natl Acad Sci U S A Date: 2020-06-17 Impact factor: 11.205

7. The genomic landscape of mantle cell lymphoma is related to the epigenetically determined chromatin state of normal B cells.

Authors: Jenny Zhang; Dereje Jima; Andrea B Moffitt; Qingquan Liu; Magdalena Czader; Eric D Hsi; Yuri Fedoriw; Cherie H Dunphy; Kristy L Richards; Javed I Gill; Zhen Sun; Cassandra Love; Paula Scotland; Eric Lock; Shawn Levy; David S Hsu; David Dunson; Sandeep S Dave
Journal: Blood Date: 2014-03-28 Impact factor: 22.113

8. Dual Recognition of H3K4me3 and DNA by the ISWI Component ARID5 Regulates the Floral Transition in Arabidopsis.

Authors: Lian-Mei Tan; Rui Liu; Bo-Wen Gu; Cui-Jun Zhang; Jinyan Luo; Jing Guo; Yuhua Wang; Lixian Chen; Xuan Du; Sisi Li; Chang-Rong Shao; Yin-Na Su; Xue-Wei Cai; Rong-Nan Lin; Lin Li; She Chen; Jiamu Du; Xin-Jian He
Journal: Plant Cell Date: 2020-04-30 Impact factor: 11.277

9. Deep Illumina sequencing reveals differential expression of long non-coding RNAs in hyperoxia induced bronchopulmonary dysplasia in a rat model.

Authors: Han-Rong Cheng; Shao-Ru He; Ben-Qing Wu; Dong-Cai Li; Tian-Yong Hu; Li Chen; Zhu-Hui Deng
Journal: Am J Transl Res Date: 2017-12-15 Impact factor: 4.060

10. Minimum information for reporting next generation sequence genotyping (MIRING): Guidelines for reporting HLA and KIR genotyping via next generation sequencing.

Authors: Steven J Mack; Robert P Milius; Benjamin D Gifford; Jürgen Sauter; Jan Hofmann; Kazutoyo Osoegawa; James Robinson; Mathijs Groeneweg; Gregory S Turenchalk; Alex Adai; Cherie Holcomb; Erik H Rozemuller; Maarten T Penning; Michael L Heuer; Chunlin Wang; Marc L Salit; Alexander H Schmidt; Peter R Parham; Carlheinz Müller; Tim Hague; Gottfried Fischer; Marcelo Fernandez-Viňa; Jill A Hollenbach; Paul J Norman; Martin Maiers
Journal: Hum Immunol Date: 2015-09-25 Impact factor: 2.850