Literature DB >> 19737799

TagDust--a program to eliminate artifacts from next generation sequencing data.

Timo Lassmann¹, Yoshihide Hayashizaki, Carsten O Daub.

Abstract

MOTIVATION: Next-generation parallel sequencing technologies produce large quantities of short sequence reads. Due to experimental procedures various types of artifacts are commonly sequenced alongside the targeted RNA or DNA sequences. Identification of such artifacts is important during the development of novel sequencing assays and for the downstream analysis of the sequenced libraries.
RESULTS: Here we present TagDust, a program identifying artifactual sequences in large sequencing runs. Given a user-defined cutoff for the false discovery rate, TagDust identifies all reads explainable by combinations and partial matches to known sequences used during library preparation. We demonstrate the quality of our method on sequencing runs performed on Illumina's Genome Analyzer platform. AVAILABILITY: Executables and documentation are available from http://genome.gsc.riken.jp/osc/english/software/. CONTACT: timolassmann@gmail.com.

Entities: Disease Species

Mesh：

Year: 2009 PMID： 19737799 PMCID： PMC2781754 DOI： 10.1093/bioinformatics/btp527

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Next-generation sequencing is applied to address a whole range of biological questions (Mardis, 2008; von Bubnoff, 2008). A widely recognizable challenge lies in the computational treatment of the huge volumes of data being generated. An initial step is to verify whether a sequencing run was successful. A low mapping rate to a reference genome is commonly a good indicator of the run quality, however, it fails to explain the source of the unmapped sequences. From experience we know that large fractions of the unmapped sequences often correspond to artifacts arising from linker and adaptor sequences used in the library construction. Such artifacts are comparable with vector sequences found in traditional Sanger sequencing (White et al., 2008). The identification of these artifacts is important during the development of novel sequencing assays. More importantly, a fraction of artifacts commonly maps to reference genomes and can thus influence the biological interpretation of the libraries. The situation is particularly problematic when comparing two RNA samples sequenced at different biological states. If the total number of sequences from states A and B is the same but the fraction of artifacts is increased in state B, it may appear that non-artifactual sequences are downregulated compared with state A. Identification of known library sequences in sequenced reads should be trivial. However, sequencing errors, PCR errors, short read lengths, combinations of several fragmented sequences and their reverse complements complicate this task dramatically. To resolve this basic issue we developed TagDust, a program employing a fast, fuzzy string-matching algorithm to identify partial matches to library sequences in the reads. A read is annotated as an artifact if a large fraction of its residues can be explained by matches to library sequences.

2 METHODS

We previously employed the Muth–Manber algorithm (Muth and Manber, 1996) in the context of multiple alignments to quickly assess sequence similarity (Lassmann et al., 2008). It allows for multiple string matching with up to one error (mismatch, insertion or deletion). The latter is achieved by creating libraries of k-mers from both query and target strings. The library is then extended to patterns of length k−1 by deleting each character in all the original k-mers in turn. For example, the 4mer ACGT will be converted into CGT, AGT, ACT and CGT. We will refer to these extended patterns as lk-mers. A comparison of these libraries via fast exact string matching reveals all matches with up to one error. For example, a mismatch in the original sequences is detected with an exact match of the lk-mers lacking the mismatched residues. As a default, TagDust used a k-mer length of 12. For detecting artifacts, we are not really interested in the individual matches to a read but instead whether a large proportion of a read can be labeled as matching library sequences. Hence, we altered the default Muth–Manber algorithm to return the percentage of nucleotides involved in matches to library sequences and to run efficiently on very large datasets. Briefly, we record all lk-mers derived from the library sequences in a bitfield, scan all reads and identify matches with quick bit-lookups. Since the sequenced reads are currently short, between 30–50 nt in length, spurious hits often occur. Discarding reads based on these matches is obviously undesirable. Therefore, it is crucial to select a suitable cutoff on the percentage of residues covered by library sequences. We approach this problem in a manner analogous to recent work by Zhang et al. (2008) relating to the interpretation of ChIP-sequencing data. Initially, we simulate a sequencing dataset with the same length distribution and nucleotide composition as the input dataset. Secondly, we apply the modified Muth–Manber algorithm to the simulated reads to derive a distribution of the number of reads labeled as 5%, 10%,…, 100% library sequences. The distribution reflects how often we expect reads to be labeled as X% library sequence by chance. Finally, we obtain P-values from this null distribution and adjust them using the Benjamini–Hochberg method to reflect the controlled false discovery rate (FDR; Benjamini and Hochberg, 1995). The lowest sequence coverage that gives the requested FDR is then used as the cutoff value. For efficiency, TagDust is implemented in the C programming language. TagDust uses <5 MB of memory since only single reads are read into memory at a time for processing. Hence, it is applicable to current datasets and the large volume of data expected with future next-generation sequencing instruments. A computational bottleneck is the calculation of the adjusted P-values since this step, in principle, requires sorting of millions of P-values. However, since sequence lengths are natural numbers, only a selection of coverage cutoffs and associated P-values is possible. For example, a 20-nt sequence can be 95% or 100% labeled as library sequences but not by 97%. We take advantage of this and use a bit-sort-like algorithm to perform this step in linear memory and time. TagDust is freely available from the OMICS software repository or by request from the author.

3 RESULTS AND DISCUSSION

Obtaining suitable datasets for benchmarking our method is not trivial since partially failed sequencing runs are commonly not deposited in public databases. Nevertheless, we obtained five datasets sequenced by the Illumina Genome Analyzer from the NCBI short read archive. We used the standard Illumina adaptors and primers used in the different sequencing assays as target sequences to be filtered out from the reads. As expected, only a relatively small percentage of the deposited reads can be explained by library sequences (Table 1). To determine whether the same sequences could be filtered out by simply mapping to the reference genome, we mapped all artifactual sequences with up to two mismatches to the human genome (hg18 assembly) using nexalign (T.Lassmann, manuscript in preparation). Evidently, a varying percentage of the artifactual sequences map to the genome. In the absence of replicates it is difficult to determine whether such tags are actual artifacts and hence we recommend users to merely flag such reads and their mapping positions.

Table 1.

Percentages of reads identified as artifacts in five sequencing runs at varying FDR thresholds

Description	Accession	Sequences	FDR 0.05 (%)	FDR 0.01 (%)	FDR 0.001 (%)	CPU sec.
Genomic PE (18 nt)	ERR000017	6 381 596	1.4 (98.79)	0.4 (98.91)	0.1 (98.61)	28
Genomic PE (36 nt)	ERR000130	10 209 914	3.2 (84.05)	0.8 (52.72)	0.4 (11.44)	84
Genomic (25 nt)	SRR000723	7 230 975	1.7 (57.64)	0.5 (54.26)	0.1 (36.44)	45
Chip-Seq (25 nt)	SRR000731	6 011 079	3.7 (29.15)	2.5 (12.81)	2.0 (1.73)	37
RNA-Seq (33 nt)	SRR002052	12 099 833	1.8 (23.32)	0.6 (22.30)	0.1 (20.38)	103

The mapping rates of the artifactual sequences to the human genome are indicated in brackets. The last column lists the runtime of TagDust in CPU seconds for the 0.05 FDR cutoff.

Percentages of reads identified as artifacts in five sequencing runs at varying FDR thresholds The mapping rates of the artifactual sequences to the human genome are indicated in brackets. The last column lists the runtime of TagDust in CPU seconds for the 0.05 FDR cutoff. TagDust processes even the largest dataset here in <2 min on a standard desktop PC while using <5 MB of memory. Conceivably, the time it takes to map libraries can be reduced by using TagDust to filter out artifacts before the mapping. The two main applications for TagDust are to troubleshoot failed large-scaled sequencing runs and to filter out artifactual sequences from successful ones. The latter may affect the biological interpretation of the produced data since some artifactual sequences map to the respective reference genomes. Funding: Research Grant for the RIKEN Omics Science Center from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government (MEXT to Y.H.); a grant of the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology, Japan; the Strategic Programs for R&D of RIKEN Grant for the RIKEN Frontier Research System, Functional RNA research program. Conflict of Interest: none declared.

5 in total

1. Next-generation sequencing: the race is on.

Authors: Andreas von Bubnoff
Journal: Cell Date: 2008-03-07 Impact factor: 41.582

Review 2. The impact of next-generation sequencing technology on genetics.

Authors: Elaine R Mardis
Journal: Trends Genet Date: 2008-02-11 Impact factor: 11.639

3. Figaro: a novel statistical method for vector sequence removal.

Authors: James Robert White; Michael Roberts; James A Yorke; Mihai Pop
Journal: Bioinformatics Date: 2008-01-17 Impact factor: 6.937

4. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features.

Authors: Timo Lassmann; Oliver Frings; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2008-12-22 Impact factor: 16.971

5. Modeling ChIP sequencing in silico with applications.

Authors: Zhengdong D Zhang; Joel Rozowsky; Michael Snyder; Joseph Chang; Mark Gerstein
Journal: PLoS Comput Biol Date: 2008-08-22 Impact factor: 4.475

5 in total

120 in total

1. Using formaldehyde-assisted isolation of regulatory elements (FAIRE) to isolate active regulatory DNA.

Authors: Jeremy M Simon; Paul G Giresi; Ian J Davis; Jason D Lieb
Journal: Nat Protoc Date: 2012-01-19 Impact factor: 13.491

2. A Role for Widely Interspaced Zinc Finger (WIZ) in Retention of the G9a Methyltransferase on Chromatin.

Authors: Jeremy M Simon; Joel S Parker; Feng Liu; Scott B Rothbart; Slimane Ait-Si-Ali; Brian D Strahl; Jian Jin; Ian J Davis; Amber L Mosley; Samantha G Pattenden
Journal: J Biol Chem Date: 2015-09-03 Impact factor: 5.157

3. Deep-sequencing of human Argonaute-associated small RNAs provides insight into miRNA sorting and reveals Argonaute association with RNA fragments of diverse origin.

Authors: Alexander Maxwell Burroughs; Yoshinari Ando; Michiel Jan Laurens de Hoon; Yasuhiro Tomaru; Harukazu Suzuki; Yoshihide Hayashizaki; Carsten Olivier Daub
Journal: RNA Biol Date: 2011-01-01 Impact factor: 4.652

4. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data.

Authors: Ravi K Patel; Mukesh Jain
Journal: PLoS One Date: 2012-02-01 Impact factor: 3.240

5. Tumor-homing cytotoxic human induced neural stem cells for cancer therapy.

Authors: Juli R Bagó; Onyi Okolie; Raluca Dumitru; Matthew G Ewend; Joel S Parker; Ryan Vander Werff; T Michael Underhill; Ralf S Schmid; C Ryan Miller; Shawn D Hingtgen
Journal: Sci Transl Med Date: 2017-02-01 Impact factor: 17.956

6. H3K36 Methylation Regulates Nutrient Stress Response in Saccharomyces cerevisiae by Enforcing Transcriptional Fidelity.

Authors: Stephen L McDaniel; Austin J Hepperla; Jie Huang; Raghuvar Dronamraju; Alexander T Adams; Vidyadhar G Kulkarni; Ian J Davis; Brian D Strahl
Journal: Cell Rep Date: 2017-06-13 Impact factor: 9.423

Review 7. Next-generation transcriptome assembly.

Authors: Jeffrey A Martin; Zhong Wang
Journal: Nat Rev Genet Date: 2011-09-07 Impact factor: 53.242

8. Predictable transcriptome evolution in the convergent and complex bioluminescent organs of squid.

Authors: M Sabrina Pankey; Vladimir N Minin; Greg C Imholte; Marc A Suchard; Todd H Oakley
Journal: Proc Natl Acad Sci U S A Date: 2014-10-21 Impact factor: 11.205

9. Deep transcriptome profiling of mammalian stem cells supports a regulatory role for retrotransposons in pluripotency maintenance.

Authors: Alexandre Fort; Kosuke Hashimoto; Daisuke Yamada; Md Salimullah; Chaman A Keya; Alka Saxena; Alessandro Bonetti; Irina Voineagu; Nicolas Bertin; Anton Kratz; Yukihiko Noro; Chee-Hong Wong; Michiel de Hoon; Robin Andersson; Albin Sandelin; Harukazu Suzuki; Chia-Lin Wei; Haruhiko Koseki; Yuki Hasegawa; Alistair R R Forrest; Piero Carninci
Journal: Nat Genet Date: 2014-04-28 Impact factor: 38.330

10. EvoPipes.net: Bioinformatic Tools for Ecological and Evolutionary Genomics.

Authors: Michael S Barker; Katrina M Dlugosch; Louie Dinh; R Sashikiran Challa; Nolan C Kane; Matthew G King; Loren H Rieseberg
Journal: Evol Bioinform Online Date: 2010-10-20 Impact factor: 1.625