Literature DB >> 23193222

READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation.

Raeece Naeem¹, Mamoon Rashid, Arnab Pain.

Abstract

UNLABELLED: READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf compute cluster with 16 nodes (Supplementary Material). AVAILABILITY: http://cbrc.kaust.edu.sa/readscan.

Entities: Disease Gene Species

Mesh：

Year: 2012 PMID： 23193222 PMCID： PMC3562070 DOI： 10.1093/bioinformatics/bts684

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The idea of computational subtraction of human host sequences to identify microbial sequences was first implemented on Amazon EC2 (Elastic Compute Cloud) environment in the form of software PathSeq (Kostic ). An alternative open source workflow available on desktop computers was recently provided by Rapid Identification of Non-human Sequences (RINS) (Bhaduri ). More recently, a platform called CaPSID (Borozan ) to store and visualize the identified non-human sequences was described. We present READSCAN a highly scalable and efficient tool to analyze ultra-high volume of data produced by the latest sequencers like Illumina HiSeq (http://www.illumina.com/systems/hiseq_systems.ilmn) that can produce 3–6 billion short reads in a single run. READSCAN uses the data parallelism in the sequenced reads and effectively distributes the processing on multiple Central Processing Unit (CPU)s. READSCAN’s core alignment procedure on multiple known references is based on SMALT (H. Postingl 2012, personal communication; http://www.sanger.ac.uk/resources/software/smalt/) a fast and accurate short read mapper that works for a range of sequencing platforms (e.g. Illumina, Roche-454, Applied Biosystems-SOLiD). READSCAN is highly portable to work on a dual core laptop computer with as small as 2 GB memory to a large Beowulf cluster with 100 s of compute nodes. READSCAN reports the genome relative abundance (GRA) of those identified non-host microbial sequences implemented based on a proven finite mixture model and expectation maximization algorithm (Xia ). The results are ranked in the order of most to least abundant species grouped by National Center for Biotechnology Information (NCBI) taxonomical tree. The software performs an alignment-based assembly to report the length of the region covered by the reads and weighted mean length of such contigs produced as a result. This serves as a useful metric in assessing the true–positive results and also eliminates the need for an assembly program for microbial sequences with known reference genomes.

2 METHODS

The software first indexes the host and pathogen database sequences on a chosen k-mer value r based on the principle discussed in Baeza-Yates and Perleberg (BYP) (Baeza-Yates and Perleberg, 1996) This k-mer value r allows us to detect the mutated sequences with maximum error or mutation rate of k in a string of length m. The search phase as described in Figure 1 divides the input sequences into manageable chunks, and each chunk is processed in parallel. Each chunk is mapped against the host and pathogen references simultaneously using SMALT aligner. The result of the mapping procedure is filtered for per cent identity cut-off. The reads are then classified into several bins, namely, host, pathogen, ambiguous and unmapped. The classification is based on the alignment score reported by SMALT.

Fig. 1.

Schematic of READSCAN algorithm

Schematic of READSCAN algorithm The directed acyclic graph representing the set of tasks and its dependencies is abstracted out, and the result is passed on to GNU make on a desktop computer and Makeflow (Yu ) on a multicore cluster to efficiently execute the tasks in parallel to speed up the overall throughput. The Makeflow abstractions are the key that make the program highly portable and execute directly without any modification on Load Sharing Facility (LSF), Sun Grid Engine and various other load levelers. The memory and resource requirements for the alignment tasks are computed using the formula provided by the SMALT aligner, and these values are passed to the appropriate job scheduler. SMALT-like other short read aligners have an inherent maximum limitation (Martin ) on the size of the database that can be indexed. This limitation is overcome by splitting up the database into manageable parts, such that each part does not exceed the random access memory limitation on a particular compute node. This helps the workflow to accommodate multiple human references to improve the accuracy of human reads removal and also multiple pathogen references grouped by taxon like bacteria, virus, protozoa and fungi. Choosing an appropriate chunk size can control the speed of the entire search phase. Because of sequence similarity between reference sequences in the pathogen database the same read may map to multiple references in a non-unique mapping. Hence, the resulting statistics file is clustered by NCBI taxonomy tree and the GRA for particular species is reported as a sum of the GRA of all reference sequences of that species.

3 RESULTS

3.1 Performance of READSCAN on real dataset

We tested the performance of the READSCAN on a real dataset of RNA sequencing of 11 pair control and matched colorectal carcinoma samples (Castellarin ). READSCAN was able to detect the microbial flora present in the colorectal carcinoma and matched healthy tissues. The GRA values of different microbes in tumor and non-tumor samples have been shown as a heatmap (Supplementary Fig. S1), which clearly depicts the enrichment of Fusobacterium nucleatum sequences in nine tumor samples compared with their normal counterparts (one of the key findings presented in (Castellarin ). Prostate cancer cell line SRR073726 (Prensner ) was analyzed, and READSCAN accurately reported the human papilloma virus (HPV) serotype 18 as the most abundant organism present in the sample with GRA of 68% and contig length of 953 bp in 39 min. RINS also matched the HPV serotype 18 with a contig of length 923 bp in 105 min. RINS matched 12 HPV reference sequences where READSCAN reported 18 HPV reference sequences grouped by HPV at the taxon level. The comparison was made on the same computer with exactly the same viral and human references.

3.2 Performance comparison—READSCAN, RINS and PathSeq on simulated dataset

The simulated dataset was generated from human transcriptome and 12 viral genomes (Supplementary Methods). READSCAN outperformed RINS and PathSeq in recovering viral reads with different mutation rates (Fig. 2A). Also, READSCAN is much faster than RINS and PathSeq (Fig. 2B) in its default mode. By tuning alignment indexing and min-identity parameters, READSCAN’s high-sensitive mode achieved higher sensitivity with a trade off in specificity and time (Fig. 2). PathSeq achieved 100% specificity (ability to remove human reads) closely followed by READSCAN and RINS with 99.99% specificity in removing the human reads (Supplementary Fig. S2). To benchmark the scalability of READSCAN with added compute power, the same dataset was used on 1, 2, 4, 8 and 16 compute nodes where READSCAN scaled up linearly and completed the run in <27 min using 16 nodes (Supplementary Fig. S3).

Fig. 2.

Performance comparison of READSCAN, RINS and PathSeq on simulated dataset

4 CONCLUSIONS

READSCAN is a fast and accurate sequence search tool available on a variety of clusters and workstations designed to handle large next-generation sequencing datasets and detect non-target or pathogenic sequences. Funding: KAUST faculty funding to A.P. supports this work. Conflict of Interest: none declared.

7 in total

1. Rapid identification of non-human sequences in high-throughput sequencing datasets.

Authors: Aparna Bhaduri; Kun Qu; Carolyn S Lee; Alexander Ungewickell; Paul A Khavari
Journal: Bioinformatics Date: 2012-02-28 Impact factor: 6.937

2. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma.

Authors: Mauro Castellarin; René L Warren; J Douglas Freeman; Lisa Dreolini; Martin Krzywinski; Jaclyn Strauss; Rebecca Barnes; Peter Watson; Emma Allen-Vercoe; Richard A Moore; Robert A Holt
Journal: Genome Res Date: 2011-10-18 Impact factor: 9.043

3. PathSeq: software to identify or discover microbes by deep sequencing of human tissue.

Authors: Aleksandar D Kostic; Akinyemi I Ojesina; Chandra Sekhar Pedamallu; Joonil Jung; Roel G W Verhaak; Gad Getz; Matthew Meyerson
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

4. Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression.

Authors: John R Prensner; Matthew K Iyer; O Alejandro Balbin; Saravana M Dhanasekaran; Qi Cao; J Chad Brenner; Bharathi Laxman; Irfan A Asangani; Catherine S Grasso; Hal D Kominsky; Xuhong Cao; Xiaojun Jing; Xiaoju Wang; Javed Siddiqui; John T Wei; Daniel Robinson; Hari K Iyer; Nallasivam Palanisamy; Christopher A Maher; Arul M Chinnaiyan
Journal: Nat Biotechnol Date: 2011-07-31 Impact factor: 54.908

5. Accurate genome relative abundance estimation based on shotgun metagenomic reads.

Authors: Li C Xia; Jacob A Cram; Ting Chen; Jed A Fuhrman; Fengzhu Sun
Journal: PLoS One Date: 2011-12-06 Impact factor: 3.240

6. Optimizing read mapping to reference genomes to determine composition and species prevalence in microbial communities.

Authors: John Martin; Sean Sykes; Sarah Young; Karthik Kota; Ravi Sanka; Nihar Sheth; Joshua Orvis; Erica Sodergren; Zhengyuan Wang; George M Weinstock; Makedonka Mitreva
Journal: PLoS One Date: 2012-06-13 Impact factor: 3.240

7. CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes.

Authors: Ivan Borozan; Shane Wilson; Paola Blanchette; Philippe Laflamme; Stuart N Watt; Paul M Krzyzanowski; Fabrice Sircoulomb; Robert Rottapel; Philip E Branton; Vincent Ferretti
Journal: BMC Bioinformatics Date: 2012-08-17 Impact factor: 3.169

7 in total

23 in total

1. VirusSeeker, a computational pipeline for virus discovery and virome composition analysis.

Authors: Guoyan Zhao; Guang Wu; Efrem S Lim; Lindsay Droit; Siddharth Krishnamurthy; Dan H Barouch; Herbert W Virgin; David Wang
Journal: Virology Date: 2017-01-18 Impact factor: 3.616

2. VERSE: a novel approach to detect virus integration in host genomes through reference genome customization.

Authors: Qingguo Wang; Peilin Jia; Zhongming Zhao
Journal: Genome Med Date: 2015-01-20 Impact factor: 11.117

3. NGS-based approach to determine the presence of HPV and their sites of integration in human cancer genome.

Authors: P Chandrani; V Kulkarni; P Iyer; P Upadhyay; R Chaubal; P Das; R Mulherkar; R Singh; A Dutt
Journal: Br J Cancer Date: 2015-05-14 Impact factor: 7.640

4. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets.

Authors: Matthias Scheuch; Dirk Höper; Martin Beer
Journal: BMC Bioinformatics Date: 2015-03-03 Impact factor: 3.169

Review 5. High-Throughput Sequencing, a VersatileWeapon to Support Genome-Based Diagnosis in Infectious Diseases: Applications to Clinical Bacteriology.

Authors: Ségolène Caboche; Christophe Audebert; David Hot
Journal: Pathogens Date: 2014-04-02

6. VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data.

Authors: Qingguo Wang; Peilin Jia; Zhongming Zhao
Journal: PLoS One Date: 2013-05-24 Impact factor: 3.240

7. No association between HPV positive breast cancer and expression of human papilloma viral transcripts.

Authors: Orla M Gannon; Annika Antonsson; Michael Milevskiy; Melissa A Brown; Nicholas A Saunders; Ian C Bennett
Journal: Sci Rep Date: 2015-12-14 Impact factor: 4.379

8. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples.

Authors: Samia N Naccache; Scot Federman; Narayanan Veeraraghavan; Matei Zaharia; Deanna Lee; Erik Samayoa; Jerome Bouquet; Alexander L Greninger; Ka-Cheung Luk; Barryett Enge; Debra A Wadford; Sharon L Messenger; Gillian L Genrich; Kristen Pellegrino; Gilda Grard; Eric Leroy; Bradley S Schneider; Joseph N Fair; Miguel A Martínez; Pavel Isa; John A Crump; Joseph L DeRisi; Taylor Sittler; John Hackett; Steve Miller; Charles Y Chiu
Journal: Genome Res Date: 2014-06-04 Impact factor: 9.043

9. Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data.

Authors: Allyson L Byrd; Joseph F Perez-Rogers; Solaiappan Manimaran; Eduardo Castro-Nallar; Ian Toma; Tim McCaffrey; Marc Siegel; Gary Benson; Keith A Crandall; William Evan Johnson
Journal: BMC Bioinformatics Date: 2014-08-04 Impact factor: 3.169

10. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples.

Authors: Changjin Hong; Solaiappan Manimaran; Ying Shen; Joseph F Perez-Rogers; Allyson L Byrd; Eduardo Castro-Nallar; Keith A Crandall; William Evan Johnson
Journal: Microbiome Date: 2014-09-05 Impact factor: 14.650