Literature DB >> 25143290

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters.

Justin Chu¹, Sara Sadeghi¹, Anthony Raymond¹, Shaun D Jackman¹, Ka Ming Nip¹, Richard Mar¹, Hamid Mohamadi¹, Yaron S Butterfield¹, A Gordon Robertson¹, Inanç Birol¹.

Abstract

Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements. Availability and implementaion: www.bcgsc.ca/platform/bioinfo/software/biobloomtools.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 25143290 PMCID： PMC4816029 DOI： 10.1093/bioinformatics/btu558

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Pipelines that detect pathogens and contamination screen for host sequences so they do not interfere with downstream analysis (Castellarin ; Kostic ; Tang ; Xu ). The alignment-based algorithms that these pipelines use provide mapping locations that are irrelevant for classification, and thus perform more computation than is needed. To address this, we have developed BioBloom Tools (BBT). BBT uses Bloom filters—probabilistic, constant time access data structures that identify whether elements belong to a set (Bloom, 1970). Bloom filters are similar to hash tables but do not store the elements themselves; instead, they store a fixed number of bits for every element into a common bit array. Thus, they use less memory, but queries to the filter may return false membership (hits) because of hash collisions in the common bit array. The false-positive rate (FPR) resulting from these false hits can be managed by increasing the size of the filter (Supplementary Material). Using Bloom filters for sequence categorization was pioneered by the program FACS (Stranneheim ). Here, we describe a Bloom filter implementation that includes heuristics to control false positives and increase speed.

2 METHODS

We first build filters from a set of reference sequences by dividing the sequences into all possible k-mers (substrings of length k). We compare the forward and reverse complement of every k-mer, and include the alphanumerically smaller sequence in the filter. We calculate the bit signature of a k-mer by mapping the sequence to a set of integer values using a fixed number of hash functions (Supplementary Materials) (Broder and Mitzenmacher, 2004). The bitwise union of the signatures of all the k-mers constitutes a Bloom filter for the corresponding reference sequences. To test whether a query sequence of length l is present in the target reference(s), we use a sliding window of k-mers. Starting at one end of the query sequence, and shifting one base pair at a time along this sequence, we check each k-mer against each reference’s Bloom filter. When a k-mer matches a filter, we incrementally calculate a score: where c is the number of contiguous stretches of adjacent filter-matching k-mers until the current position in the query, and a is the length of the i-th stretch. This heuristic penalizes likely false-positive hits. We evaluate k-mers this way until we reach either a specified score threshold (s*) or the end of the query sequence. If at any point we reach s*, we categorize the query as belonging to the reference, and terminate the process for that query. Further, we use a jumping k-mer heuristic that skips k k-mers when a miss is detected after a long series of adjacent hits. This efficiently handles cases in which the query has a single (or a few) base mismatch(es) with the target.

3 BENCHMARKING

We compared BBT against two widely used Burrows–Wheeler transform-based alignment tools that have low memory usage and high accuracy—BWA (Li and Durbin, 2003) and Bowtie 2 (BT2; Langmead and Salzberg, 2012)—and against the C++ implementation of FACS (https://github.com/SciLifeLab/facs). Tool versions and other details are provided in the Supplementary Materials.

3.1 Benchmarking on simulated data

We used dwgsim (https://github.com/nh13/DWGSIM) to generate simulated Illumina reads from human, mouse and Escherichia coli reference genomes. For each genome, we generated 1 million 2 × 150 bp paired-end (PE) reads and 1 million 100 bp single-end (SE) reads. We used E.coli because it is a common contaminant and is genetically distant from human. With mouse, which is commonly used in xenograft studies, we tested categorization accuracy for species that are closely related genetically. Because FACS does not support PE reads, we used the 100 bp SE reads to compare the false- and true-positive rates (FPR and TPR, respectively) of BBT and FACS. We tested a range of scoring thresholds for both tools. Using a k-mer size of 25 bp, BBT generally matched or outperformed FACS (Fig. 1A and B). We note that, for shorter k-mers, performance of BBT and FACS algorithms would deteriorate, especially in distinguishing sequences from closely related references. For both tools, longer k-mers gave lower FPR but also lower maximum TPR (Supplementary Figs S1 and S2), with BBT performing increasingly better than FACS for longer k-mers.

Fig. 1.

Performance comparisons of BBT against FACS, BWA and BT2. Receiver operator characteristic curves of BBT and FACS using simulated 100 bp SE reads from Homo sapiens mixed with (A) E.coli and (B) Mus musculus filtered against an H.sapiens Bloom filter using a k-mer size of 25 bp; (C) CPU time benchmark comparing BT2 (for a range of built-in settings), BWA (using aln and mem settings), FACS and BBT, on one lane of human 2 × 150 bp PE Illumina HiSeq 2500 reads To compare BBT and FACS to BWA and BT2, we used 2 × 150 bp PE reads. In our tests, overall, BBT performed comparably with the aligners and outperformed ‘fast’ and ‘very fast’ settings of BT2 in both false-negative rate (FNR) and false-discovery rate (FDR; Table 1).

Table 1.

Benchmarking results using simulated paired end 2 × 150 bp reads

Tool and Settings	FNR	FDR	FDR
Tool and Settings	(H.sapiens)	(M.musculus)	(E.coli)
BT2 very sensitive	1.40 × 10⁻⁵	2.03 × 10⁻²	0
BT2 sensitive	7.52 × 10⁻⁴	9.08 × 10⁻³	0
BT2 fast	1.26 × 10⁻²	5.90 × 10⁻³	0
BT2 very fast	1.34 × 10⁻²	5.65 × 10⁻³	0
BWA aln	3.26 × 10⁻³	8.14 × 10⁻⁴	0
BWA mem	0	1.92 × 10⁻¹	1.00 × 10⁻⁴
FACS	1.22 × 10⁻¹	9.88 × 10⁻³	0
BBT (s* = 0.1)	8.42 × 10⁻³	3.78 × 10⁻³	0

Note: All reads were treated as SE reads for FACS.

Benchmarking results using simulated paired end 2 × 150 bp reads Note: All reads were treated as SE reads for FACS.

3.2 Benchmarking on experimental data

We used a single lane of 2 × 150 bp PE human DNA reads (https://basespace.illumina.com/run/716717/2x150-HiSeq-2500-demo-NA12878) generated with an Illumina HiSeq 2500 sequencer to benchmark computational performance. For a controlled comparison, we ran at least eight replicates for each tool, and we measured CPU time, with all applications using a single thread. We ran BBT with s* = 0.1 and compared it with FACS, BWA and BT2, using a range of run modes for the latter two tools. BBT was faster than the fastest aligner/settings combination (BT2 very fast) by at least an order of magnitude (Fig. 1C). The mapping rates (categorization rates for BBT and FACS) of each tool were comparable, at 96.69 (BT2 very sensitive), 96.57 (BT2 sensitive), 96.18 (BT2 fast), 95.97 (BT2 very fast), 99.76 (BWA mem), 95.12 (BWA aln), 95.81 (FACS) and 97.27% (BBT).

3.3 Memory usage

For categorization, using the human reference and simulated reads, the peak memory usage (GB) for each tool was 3.8 (BBT), 4.8 (FACS), 3.1 (BWA aln), 5.2 (BWA mem) and 3.4 (BT2). These figures are for categorization only and do not include the memory usage for creating the FM-indexes or Bloom filters. Unless slower disk-based methods are used, creating an FM-index takes at least O(nlog(n)) bits of memory, where n is the size of the reference sequence (Ferragina ). In contrast, Bloom filter memory usage is the same for the creation and categorization stages, and takes O(-nlog(f)) bits of memory, where f is the FPR and n is the number of input sequences. We created filters using 3.2 GB of memory for both FACS and BBT. Assuming optimal numbers of hash functions are used, filters with the same size should have similar FPRs. However, in practice, we had to use different FPR settings in creating these filters (FPR of 0.5% for FACS and 0.75% for BBT). We note that the tools would differ from theoretical estimates because of implementation-specific calculation differences. Finally, to demonstrate the scalability of BBT, we built a filter for 5182 bacterial sequences (representing 6 × 1010 unique 25-mers), using 6.8 GB of memory, corresponding to an FPR of 0.75%.

6 in total

1. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma.

Authors: Mauro Castellarin; René L Warren; J Douglas Freeman; Lisa Dreolini; Martin Krzywinski; Jaclyn Strauss; Rebecca Barnes; Peter Watson; Emma Allen-Vercoe; Richard A Moore; Robert A Holt
Journal: Genome Res Date: 2011-10-18 Impact factor: 9.043

2. PathSeq: software to identify or discover microbes by deep sequencing of human tissue.

Authors: Aleksandar D Kostic; Akinyemi I Ojesina; Chandra Sekhar Pedamallu; Joonil Jung; Roel G W Verhaak; Gad Getz; Matthew Meyerson
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

3. Classification of DNA sequences using Bloom filters.

Authors: Henrik Stranneheim; Max Käller; Tobias Allander; Björn Andersson; Lars Arvestad; Joakim Lundeberg
Journal: Bioinformatics Date: 2010-05-13 Impact factor: 6.937

4. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

5. The landscape of viral expression and host gene fusion and adaptation in human cancer.

Authors: Ka-Wei Tang; Babak Alaei-Mahabadi; Tore Samuelsson; Magnus Lindh; Erik Larsson
Journal: Nat Commun Date: 2013 Impact factor: 14.919

6. RNA CoMPASS: a dual approach for pathogen and host transcriptome analysis of RNA-seq datasets.

Authors: Guorong Xu; Michael J Strong; Michelle R Lacey; Carl Baribault; Erik K Flemington; Christopher M Taylor
Journal: PLoS One Date: 2014-02-25 Impact factor: 3.240

6 in total

44 in total

1. Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas.

Authors: Yang Liu; Nilay S Sethi; Toshinori Hinoue; Barbara G Schneider; Andrew D Cherniack; Francisco Sanchez-Vega; Jose A Seoane; Farshad Farshidfar; Reanne Bowlby; Mirazul Islam; Jaegil Kim; Walid Chatila; Rehan Akbani; Rupa S Kanchi; Charles S Rabkin; Joseph E Willis; Kenneth K Wang; Shannon J McCall; Lopa Mishra; Akinyemi I Ojesina; Susan Bullman; Chandra Sekhar Pedamallu; Alexander J Lazar; Ryo Sakai; Vésteinn Thorsson; Adam J Bass; Peter W Laird
Journal: Cancer Cell Date: 2018-04-02 Impact factor: 31.743

2. Impact of Contaminating DNA in Whole-Genome Amplification Kits Used for Metagenomic Shotgun Sequencing for Infection Diagnosis.

Authors: Matthew Thoendel; Patricio Jeraldo; Kerryl E Greenwood-Quaintance; Janet Yao; Nicholas Chia; Arlen D Hanssen; Matthew P Abdel; Robin Patel
Journal: J Clin Microbiol Date: 2017-03-29 Impact factor: 5.948

3. Direct Detection and Identification of Prosthetic Joint Infection Pathogens in Synovial Fluid by Metagenomic Shotgun Sequencing.

Authors: Morgan I Ivy; Matthew J Thoendel; Patricio R Jeraldo; Kerryl E Greenwood-Quaintance; Arlen D Hanssen; Matthew P Abdel; Nicholas Chia; Janet Z Yao; Aaron J Tande; Jayawant N Mandrekar; Robin Patel
Journal: J Clin Microbiol Date: 2018-08-27 Impact factor: 5.948

4. Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma.

Authors: Siyuan Zheng; Andrew D Cherniack; Ninad Dewal; Richard A Moffitt; Ludmila Danilova; Bradley A Murray; Antonio M Lerario; Tobias Else; Theo A Knijnenburg; Giovanni Ciriello; Seungchan Kim; Guillaume Assie; Olena Morozova; Rehan Akbani; Juliann Shih; Katherine A Hoadley; Toni K Choueiri; Jens Waldmann; Ozgur Mete; A Gordon Robertson; Hsin-Ta Wu; Benjamin J Raphael; Lina Shao; Matthew Meyerson; Michael J Demeure; Felix Beuschlein; Anthony J Gill; Stan B Sidhu; Madson Q Almeida; Maria C B V Fragoso; Leslie M Cope; Electron Kebebew; Mouhammed A Habra; Timothy G Whitsett; Kimberly J Bussey; William E Rainey; Sylvia L Asa; Jérôme Bertherat; Martin Fassnacht; David A Wheeler; Gary D Hammer; Thomas J Giordano; Roel G W Verhaak
Journal: Cancer Cell Date: 2016-05-09 Impact factor: 31.743

5. The Integrated Genomic Landscape of Thymic Epithelial Tumors.

Authors: Milan Radovich; Curtis R Pickering; Ina Felau; Gavin Ha; Hailei Zhang; Heejoon Jo; Katherine A Hoadley; Pavana Anur; Jiexin Zhang; Mike McLellan; Reanne Bowlby; Thomas Matthew; Ludmila Danilova; Apurva M Hegde; Jaegil Kim; Mark D M Leiserson; Geetika Sethi; Charles Lu; Michael Ryan; Xiaoping Su; Andrew D Cherniack; Gordon Robertson; Rehan Akbani; Paul Spellman; John N Weinstein; D Neil Hayes; Ben Raphael; Tara Lichtenberg; Kristen Leraas; Jean Claude Zenklusen; Junya Fujimoto; Cristovam Scapulatempo-Neto; Andre L Moreira; David Hwang; James Huang; Mirella Marino; Robert Korst; Giuseppe Giaccone; Yesim Gokmen-Polar; Sunil Badve; Arun Rajan; Philipp Ströbel; Nicolas Girard; Ming S Tsao; Alexander Marx; Anne S Tsao; Patrick J Loehrer
Journal: Cancer Cell Date: 2018-02-12 Impact factor: 31.743

6. Comparison of microbial DNA enrichment tools for metagenomic whole genome sequencing.

Authors: Matthew Thoendel; Patricio R Jeraldo; Kerryl E Greenwood-Quaintance; Janet Z Yao; Nicholas Chia; Arlen D Hanssen; Matthew P Abdel; Robin Patel
Journal: J Microbiol Methods Date: 2016-05-26 Impact factor: 2.363

7. Integrated genomic characterization of papillary thyroid carcinoma.

Authors:
Journal: Cell Date: 2014-10-23 Impact factor: 41.582

8. Comprehensive Molecular Characterization of Muscle-Invasive Bladder Cancer.

Authors: A Gordon Robertson; Jaegil Kim; Hikmat Al-Ahmadie; Joaquim Bellmunt; Guangwu Guo; Andrew D Cherniack; Toshinori Hinoue; Peter W Laird; Katherine A Hoadley; Rehan Akbani; Mauro A A Castro; Ewan A Gibb; Rupa S Kanchi; Dmitry A Gordenin; Sachet A Shukla; Francisco Sanchez-Vega; Donna E Hansel; Bogdan A Czerniak; Victor E Reuter; Xiaoping Su; Benilton de Sa Carvalho; Vinicius S Chagas; Karen L Mungall; Sara Sadeghi; Chandra Sekhar Pedamallu; Yiling Lu; Leszek J Klimczak; Jiexin Zhang; Caleb Choo; Akinyemi I Ojesina; Susan Bullman; Kristen M Leraas; Tara M Lichtenberg; Catherine J Wu; Nicholaus Schultz; Gad Getz; Matthew Meyerson; Gordon B Mills; David J McConkey; John N Weinstein; David J Kwiatkowski; Seth P Lerner
Journal: Cell Date: 2017-10-05 Impact factor: 41.582

9. Comparative evaluation of cDNA library construction approaches for RNA-Seq analysis from low RNA-content human specimens.

Authors: T L Masters; C A Hilker; P R Jeraldo; A V Bhagwate; K E Greenwood-Quaintance; B W Eckloff; N Chia; A D Hanssen; M P Abdel; J Z Yao; J Jen; R Patel
Journal: J Microbiol Methods Date: 2018-10-14 Impact factor: 2.363

10. Integrative Analysis Identifies Four Molecular and Clinical Subsets in Uveal Melanoma.

Authors: A Gordon Robertson; Juliann Shih; Christina Yau; Ewan A Gibb; Junna Oba; Karen L Mungall; Julian M Hess; Vladislav Uzunangelov; Vonn Walter; Ludmila Danilova; Tara M Lichtenberg; Melanie Kucherlapati; Patrick K Kimes; Ming Tang; Alexander Penson; Ozgun Babur; Rehan Akbani; Christopher A Bristow; Katherine A Hoadley; Lisa Iype; Matthew T Chang; Andrew D Cherniack; Christopher Benz; Gordon B Mills; Roel G W Verhaak; Klaus G Griewank; Ina Felau; Jean C Zenklusen; Jeffrey E Gershenwald; Lynn Schoenfield; Alexander J Lazar; Mohamed H Abdel-Rahman; Sergio Roman-Roman; Marc-Henri Stern; Colleen M Cebulla; Michelle D Williams; Martine J Jager; Sarah E Coupland; Bita Esmaeli; Cyriac Kandoth; Scott E Woodman
Journal: Cancer Cell Date: 2017-08-14 Impact factor: 31.743