Literature DB >> 21615897

Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool.

Mariette Jérôme¹, Céline Noirot, Christophe Klopp.

Abstract

BACKGROUND: Roche 454 pyrosequencing platform is often considered the most versatile of the Next Generation Sequencing technology platforms, permitting the sequencing of large genomes, the analysis of variations or the study of transcriptomes. A recent reported bias leads to the production of multiple reads for a unique DNA fragment in a random manner within a run. This bias has a direct impact on the quality of the measurement of the representation of the fragments using the reads. Other cleaning steps are usually performed on the reads before assembly or alignment.
FINDINGS: PyroCleaner is a software module intended to clean 454 pyrosequencing reads in order to ease the assembly process. This program is a free software and is distributed under the terms of the GNU General Public License as published by the Free Software Foundation. It implements several filters using criteria such as read duplication, length, complexity, base-pair quality and number of undetermined bases. It also permits to clean flowgram files (.sff) of paired-end sequences generating on one hand validated paired-ends file and the other hand single read file.
CONCLUSIONS: Read cleaning has always been an important step in sequence analysis. The pyrocleaner python module is a Swiss knife dedicated to 454 reads cleaning. It includes commonly used filters as well as specialised ones such as duplicated read removal and paired-end read verification.

Entities: Chemical Disease Species

Year: 2011 PMID： 21615897 PMCID： PMC3117718 DOI： 10.1186/1756-0500-4-149

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Findings

NGS platforms are now well implanted in sequencing centers and some laboratories. The hundred fold decrease of sequencing costs widens the scientific population accessing this type of data. The reads produced by NGS platforms are prone to errors, some of which are random and others specific to a given platform often called biases. Cleaning bad quality or biased reads is often the first step before any other analysis. Several pieces of software exist but they are specialised in some filters. Tools such as SeqClean [2], lucy [3] or Figaro [4] gather common but useful filters based on different criteria such as length, quality, polya tails, complexity, vector presence, contamination. Another like cd-hit-454 [5] specifically discards artificial duplicated reads [6] observed in pyrosequencing runs. The purpose of pyrocleaner is to offer a tool as complete as possible to clean and assess 454 reads.

Implementation

PyroCleaner aims to clean duplicated reads generated by the Roche 454 platform in a controlled way as close as possible to a random read selection process. Replicate seeking (--clean-duplicated-reads option) relies on an alignment of all sequences against themselves, using megablast [7]. The result is used to build a graph connecting similar reads. In this graph, each read is represented by a vertex and edges represent the similarities between reads. Edges are created only if the similarity between read starts at the first nucleic position of both reads, has a score higher than 100 and if both reads have the same strand. Due to the homopolymer bias and sequencing errors, duplicated reads can offer slight differences:insertions, deletions or substitutions and they can stop at different positions. Two options are available regarding the aggressiveness of the duplicated read removal step. Using the aggressive option (--aggressive), two reads are connected disregarding the read length. Without the aggressive option, the algorithm will connect reads only if their length differences are lower than a given threshold value (default is 70 bases). Connected components are then extracted using the igraph library [8]. Each component represents a duplication cluster of which only the longest read is kept in the result file. The underlying idea is that artificial duplicates will be much more alike than reads coming from two different fragments starting at the same position. The module also provides an option to filter paired-end reads:--clean-pairends. A 454 paired-end read should be composed of the sequence of one end of the DNA fragment, a linker sequence and the sequence of the other end of the DNA fragment. Unfortunately in some cases the linker is missing. In other cases the linker is too close to the end of the read and therefore the mate-pair cannot be used to bridge contigs in an assembly process. Cleaning paired-end reads relies on seeking this linker. The Roche platform uses three different linkers depending on the chemistry, one for GSFLX and two others for Titanium. Using the option generates a local similarity search which is performed between input sequences and 454 linkers using cross_match [9]. It leads to the generation of two output files using the strategy presented in Figure 1. The first file will contain all good quality paired-end reads. The second one gathers all reads in which the linker was missing, the linker location too close to one end or the linker sequence quality too low. In the last two cases the reads are clipped in order to keep the longest subsequence without linker. Thus, all reads from the second file can be used as single reads in the assembly.

Figure 1

Paired-end cleaning strategy. Reads having no linker (a) are retained as single reads. If multiple linkers are present (b) in the same read, the read is discarded. In cases where the linker is partially found, meaning that the number of mismatches is lower than a threshold, only reads where the linker is located at the beginning or at the end (c) are saved as single reads, others (d) are deleted. Reads where the entire linker is present and not to closely located to one end (e) are saved as paired-end reads. In other cases, sequences are saved as single reads only if the linker is located far enough from one end (g), while others (f) are deleted. More basic but useful cleaning options are also provided. Discarding reads using their length can be done by setting min/max values (--clean-length-win option) or by using the standard deviation (--clean-length-std option) computed on all input reads. Reads can be filtered based on their complexity, which is computed using the compressed string length (library zip) on the complete sequence (--clean-complexity-full option) or on several sub-sequences generated using a sliding window approach (--clean-complexity-win option). In the latter last case, the read is flagged as complex if at least one sub-sequence complexity is higher than the given threshold. Reads can also be discarded if none of its base pairs has a quality value above the given threshold (--clean-quality-full option) or if the rate of undetermined bases is higher than a specified value (--clean-ns option). Pyrocleaner produces several output files. The result files containing the reads can be written into several formats such as sff, fastq or fasta. Exporting reads into sff format is convenient as it is now widely used by assemblers. However, exporting reads in this format depends on the sfffile script provided by Roche:if it is missing, the output will be written in the fastq format. The log file gives precise information about the cleaned reads and the reason of cleaning. The last lines of the generated log file contain cleaning summary and duplication profile figures.

Results and Discussion

In order to analyse the PyroCleaner efficiency on duplication cleaning, we used a technical validation run produced in collaboration with Roche (data are available on the ecoli demonstration runs of the ng6 web site http://ng6.toulouse.inra.fr/). It was a whole genome sequencing of E coli K12 Titanium run where one half plate was prepared by a PlaGe sequencing platform [9] technician (Run1, see Figure 2) and the second half plate by a Roche technician, both using the same protocol (Run2, see Figure 2). The aim was to compare the duplication profile between real and simulated data. The simulated reads were picked randomly from both strands in the E coli K12 genome. Two simulations were performed. Both simulations have demonstrated a replication rate close to 8%, which is far from the 31% and 18% obtained respectively for the first and seconds runs.

Figure 2

Duplication profile before pyrocleaning and after. Simulated dataset were produced using the E coli K12 genome. Sequences of 500 bp were picked randomly along the genome using both strands. The number of simulated sequences (Sim run1/Sim run2) equals the number of sequences produced in the experimental runs (454 run1/454 run2). For both 454 and simulated datasets the structure of duplication was produced using PyroCleaner log files. The probability of appearance of large clusters of duplicated reads is much higher with the 454 platform than with random selection. Figure 2a shows the distribution of duplicated read clusters with cluster size going from 2 to 30. With simulated data the largest cluster contains only five reads whereas clusters with as many as 27 reads are present in the experimental data. Figure 2b presents the structure of duplication calculated on the experimental runs once cleaned by PyroCleaner. It clearly reduces the number of large clusters and produces a read set having a structure which is close to random selection. It took 123 minutes to process 671 856 E coli sequences on a quad-core machine with 32Gb of memory. 28 624 sequences were discarded because of their length (4.3%), 65 because of the number of undetermined bases, 21 807 because of their low complexity (3.2%), 663 because of their poor quality (0.1%) and 156 222 because they have been flagged as duplicate (23.3%). The multiple read bias impact is closely linked to the type of analysis performed. As long as the fragmentation protocol produces a random selection of fragment start positions, the reads can easily be cleaned using the PyroCleaner. In case of non-random selection of sequence end positions like AFLP, 3' mRNA tags, and reduced representation libraries there is no solution to distinguish between artificially duplicated reads and sequences coming from multiple fragments starting at the same position.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JM implemented, designed the tools and wrote the manuscript. CN participated in the implementation and the design. CK participated in its coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

6 in total

1. A greedy algorithm for aligning DNA sequences.

Authors: Z Zhang; S Schwartz; L Wagner; W Miller
Journal: J Comput Biol Date: 2000 Feb-Apr Impact factor: 1.479

2. Figaro: a novel statistical method for vector sequence removal.

Authors: James Robert White; Michael Roberts; James A Yorke; Mihai Pop
Journal: Bioinformatics Date: 2008-01-17 Impact factor: 6.937

3. DNA sequence quality trimming and vector removal.

Authors: H H Chou; M H Holmes
Journal: Bioinformatics Date: 2001-12 Impact factor: 6.937

4. Systematic artifacts in metagenomes from complex microbial communities.

Authors: Vicente Gomez-Alvarez; Tracy K Teal; Thomas M Schmidt
Journal: ISME J Date: 2009-07-09 Impact factor: 10.302

5. Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors: B Ewing; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

6. Artificial and natural duplicates in pyrosequencing reads of metagenomic data.

Authors: Beifang Niu; Limin Fu; Shulei Sun; Weizhong Li
Journal: BMC Bioinformatics Date: 2010-04-13 Impact factor: 3.169

6 in total

24 in total

1. Pyrosequencing-based transcriptomic resources in the pond snail Lymnaea stagnalis, with a focus on genes involved in molecular response to diquat-induced stress.

Authors: Anthony Bouétard; Céline Noirot; Anne-Laure Besnard; Olivier Bouchez; Damien Choisne; Eugénie Robe; Christophe Klopp; Laurent Lagadic; Marie-Agnès Coutellec
Journal: Ecotoxicology Date: 2012-07-20 Impact factor: 2.823

2. Field monitoring of avian influenza viruses: whole-genome sequencing and tracking of neuraminidase evolution using 454 pyrosequencing.

Authors: Guillaume Croville; Sébastien Mathieu Soubies; Johanna Barbieri; Christophe Klopp; Jérôme Mariette; Olivier Bouchez; Christelle Camus-Bouclainville; Jean-Luc Guérin
Journal: J Clin Microbiol Date: 2012-06-20 Impact factor: 5.948

3. NG6: Integrated next generation sequencing storage and processing environment.

Authors: Jérôme Mariette; Frédéric Escudié; Nicolas Allias; Gérald Salin; Céline Noirot; Sylvain Thomas; Christophe Klopp
Journal: BMC Genomics Date: 2012-09-09 Impact factor: 3.969

4. Transcriptional profiling of bud dormancy induction and release in oak by next-generation sequencing.

Authors: Saneyoshi Ueno; Christophe Klopp; Jean Charles Leplé; Jérémy Derory; Céline Noirot; Valérie Léger; Elodie Prince; Antoine Kremer; Christophe Plomion; Grégoire Le Provost
Journal: BMC Genomics Date: 2013-04-10 Impact factor: 3.969

5. Sequencing platform and library preparation choices impact viral metagenomes.

Authors: Sergei A Solonenko; J César Ignacio-Espinoza; Adriana Alberti; Corinne Cruaud; Steven Hallam; Kostas Konstantinidis; Gene Tyson; Patrick Wincker; Matthew B Sullivan
Journal: BMC Genomics Date: 2013-05-10 Impact factor: 3.969

6. Teleost fish mount complex clonal IgM and IgT responses in spleen upon systemic viral infection.

Authors: Rosario Castro; Luc Jouneau; Hang-Phuong Pham; Olivier Bouchez; Véronique Giudicelli; Marie-Paule Lefranc; Edwige Quillet; Abdenour Benmansour; Frédéric Cazals; Adrien Six; Simon Fillatreau; Oriol Sunyer; Pierre Boudinot
Journal: PLoS Pathog Date: 2013-01-10 Impact factor: 6.823

7. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data.

Authors: Lauren M Bragg; Glenn Stone; Margaret K Butler; Philip Hugenholtz; Gene W Tyson
Journal: PLoS Comput Biol Date: 2013-04-11 Impact factor: 4.475

8. High-density linkage mapping in a pine tree reveals a genomic region associated with inbreeding depression and provides clues to the extent and distribution of meiotic recombination.

Authors: Emilie Chancerel; Jean-Baptiste Lamy; Isabelle Lesur; Céline Noirot; Christophe Klopp; François Ehrenmann; Christophe Boury; Grégoire Le Provost; Philippe Label; Céline Lalanne; Valérie Léger; Franck Salin; Jean-Marc Gion; Christophe Plomion
Journal: BMC Biol Date: 2013-04-18 Impact factor: 7.431

9. Filtering duplicate reads from 454 pyrosequencing data.

Authors: Susanne Balzer; Ketil Malde; Markus A Grohme; Inge Jonassen
Journal: Bioinformatics Date: 2013-02-01 Impact factor: 6.937

10. Fine mapping of complex traits in non-model species: using next generation sequencing and advanced intercross lines in Japanese quail.

Authors: Laure Frésard; Sophie Leroux; Patrice Dehais; Bertrand Servin; Hélène Gilbert; Olivier Bouchez; Christophe Klopp; Cédric Cabau; Florence Vignoles; Katia Feve; Amélie Ricros; David Gourichon; Christian Diot; Sabine Richard; Christine Leterrier; Catherine Beaumont; Alain Vignal; Francis Minvielle; Frédérique Pitel
Journal: BMC Genomics Date: 2012-10-15 Impact factor: 3.969