Literature DB >> 27153669

OrfM: a fast open reading frame predictor for metagenomic data.

Ben J Woodcroft¹, Joel A Boyd¹, Gene W Tyson¹.

Abstract

UNLABELLED: Finding and translating stretches of DNA lacking stop codons is a task common in the analysis of sequence data. However, the computational tools for finding open reading frames are sufficiently slow that they are becoming a bottleneck as the volume of sequence data grows. This computational bottleneck is especially problematic in metagenomics when searching unassembled reads, or screening assembled contigs for genes of interest. Here, we present OrfM, a tool to rapidly identify open reading frames (ORFs) in sequence data by applying the Aho-Corasick algorithm to find regions uninterrupted by stop codons. Benchmarking revealed that OrfM finds identical ORFs to similar tools ('GetOrf' and 'Translate') but is four-five times faster. While OrfM is sequencing platform-agnostic, it is best suited to large, high quality datasets such as those produced by Illumina sequencers.
AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available for download at http://github.com/wwood/OrfM or through GNU Guix under the LGPL 3+ license. OrfM is implemented in C and supported on GNU/Linux and OSX. CONTACTS: b.woodcroft@uq.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Species

Mesh：

Year: 2016 PMID： 27153669 PMCID： PMC5013905 DOI： 10.1093/bioinformatics/btw241

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

In genomics, stretches of DNA uninterrupted by stop codons are known as open reading frames (ORFs). The TAG (‘amber’), TAA (‘ochre’) and TGA (‘opal’) stop codons signal the ribosomal machinery to cease translation, with few exceptions. An extended stretch of DNA free of in-frame stop codons is evidence that a gene may be encoded on that region. ORF prediction in metagenomics can be performed on finished population genomes, draft population genomes, assembled contigs or individual reads. Searching for genes in individual metagenomic reads (‘gene-centric analysis’) is useful when reference genomes are unavailable and assembly of reads is either computationally prohibitive or a microbial community is too complex for successful assembly (Howe and Chain, 2015). In long assembled sequences, conventional gene predictors use information such as codon usage to more accurately predict genes, but these signals become unreliable in the limited genomic context of short read data. In bacterial and archaeal genomes, genes are not interrupted by exons and intergenic space is minimal, so short read sequences derived from these genomes are more likely to encode a fragment of a gene uninterrupted by a stop codon. ORF prediction directly on early next generation sequencing platforms (e.g. Roche 454) was difficult as they produced reads prone to insertion deletion (indel) errors. In contrast, newer Illumina-based sequencers generate reads where indel errors are rare; reads are higher quality and the errors that do occur are chiefly substitution errors (Jünemann ). The current widespread use of Illumina sequencing in metagenomics (Bragg and Tyson, 2014) presents an opportunity to find ORFs in microbial reads directly. Identification of ORFs in short read data simplifies downstream comparative analysis and allows use of tools that require protein sequence as input e.g. searching for protein families with HMMER (Camacho ). Using ORFs instead of six-frame translating sequences for downstream sequence comparison tools e.g. BLAST (Camacho ) minimizes the impact of multiple hypothesis testing so results may be more significant. While finding ORFs in short read data provides advantages over gene prediction and six-frame translation, current ORF finders do not scale to the large size of modern metagenomes e.g. He ), >500 Gb. Here, we present OrfM, a tool to rapidly identify ORFs in metagenomic datasets.

2 Inputs and outputs of OrfM

OrfM uses FASTA or FASTQ (gzip-compressed or uncompressed) sequences as input, and can accept other input formats if converted to FASTA and streamed via the UNIX STDIN pipe. OrfM handles these input format files through its use of kseq.h (http://lh3lh3.users.sourceforge.net/kseq.shtml). By default, the minimum ORF length reported by OrfM is set to 96 bp (32 amino acids). This threshold was driven by the current prevalence of 100 bp Illumina HiSeq reads: the 96 bp cutoff is the maximal size of ORF such that a reading frame can be found in each of the 6 reading frames of a 100 bp read. All ORFs greater than the threshold length are reported even if they overlap. As well as the standard translation table, OrfM can use the 18 alternative translation tables. OrfM outputs amino acid FASTA sequences whose header is the same as the input sequence, with the addition of a string ‘_X_Y_Z’ to the first word, where X is the start position, Y is the frame number and Z is the ORF number. This naming scheme allows ORFs to be located in the original sequence and ensures that the names of the ORFs are unique. OrfM can also output the corresponding nucleotide sequences of the ORFs, if desired.

3 Algorithm

In contrast with previous methods which first translate the input sequence into 6 frames and then scan through these translated strings looking for stop codons, OrfM identifies stop codons in nucleotide sequences directly, using an Aho–Corasick search dictionary (Aho and Corasick, 1975). Further details can be found in Supplementary Text S1.

4 Benchmarking

OrfM was compared (Supplementary Text S2) with ‘GetOrf’ from the emboss suite (Tringe ) (version 6.6.0) and the ‘Translate’ tool from the biosquid package version 1.9g+cvs20050121 (Eddy, unpublished http://eddylab.org/software.html). The tools were compared using three public datasets on a single core of a 20 core 2.3 GHz Intel Xeon E5-2650 running Linux 3.2.0. The benchmark datasets were (i) the forward 100 bp reads of a HiSeq 2000 metagenome (5.5 Gb) (Shakya ) in gzip-compressed FASTQ format, (ii) the same reads transformed into uncompressed FASTA format and (iii) a collection of 1000 microbial genomes selected randomly from the Integrated Microbial Genomes (IMG) 4.1 database (Markowitz ) in FASTA format (Supplementary Table S1). FASTA sequences converted from compressed FASTQ were streamed into GetOrf using the UNIX STDIN pipe (here using gzip for decompression and awk for conversion to FASTA), while Translate does not accept streamed sequences, so the compressed FASTQ benchmark was not carried out. Translate was run with a minimum ORF size of 32 (-l 31), and GetOrf with a minimum nucleotide size of 96 (-minsize 96) in order to constrain the minimum output ORF length to the default cutoff of OrfM. In all cases OrfM was the fastest, taking 20 and 21% of the time required for translate and GetOrf respectively (Fig. 1). The set of ORFs produced by each of the three methods were identical when reads containing ambiguous nucleotides were omitted from the comparison.

Fig. 1.

Time taken (wall time) by each program for the benchmark datasets. GetOrf and Translate take significantly more time than OrfM to call ORFs. Translate is unable to run on compressed reads therefore wall time was not measured for the first dataset. Error bars indicate standard error of mean among triplicate runs

8 in total

1. Comparative metagenomics of microbial communities.

Authors: Susannah Green Tringe; Christian von Mering; Arthur Kobayashi; Asaf A Salamov; Kevin Chen; Hwai W Chang; Mircea Podar; Jay M Short; Eric J Mathur; John C Detter; Peer Bork; Philip Hugenholtz; Edward M Rubin
Journal: Science Date: 2005-04-22 Impact factor: 47.728

2. Updating benchtop sequencing performance comparison.

Authors: Sebastian Jünemann; Fritz Joachim Sedlazeck; Karola Prior; Andreas Albersmeier; Uwe John; Jörn Kalinowski; Alexander Mellmann; Alexander Goesmann; Arndt von Haeseler; Jens Stoye; Dag Harmsen
Journal: Nat Biotechnol Date: 2013-04 Impact factor: 54.908

3. Metagenomics using next-generation sequencing.

Authors: Lauren Bragg; Gene W Tyson
Journal: Methods Mol Biol Date: 2014

4. Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities.

Authors: Migun Shakya; Christopher Quince; James H Campbell; Zamin K Yang; Christopher W Schadt; Mircea Podar
Journal: Environ Microbiol Date: 2013-02-06 Impact factor: 5.491

5. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

6. IMG: the Integrated Microbial Genomes database and comparative analysis system.

Authors: Victor M Markowitz; I-Min A Chen; Krishna Palaniappan; Ken Chu; Ernest Szeto; Yuri Grechkin; Anna Ratner; Biju Jacob; Jinghua Huang; Peter Williams; Marcel Huntemann; Iain Anderson; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2012-01 Impact factor: 16.971

7. Patterns in wetland microbial community composition and functional gene repertoire associated with methane emissions.

Authors: Shaomei He; Stephanie A Malfatti; Jack W McFarland; Frank E Anderson; Amrita Pati; Marcel Huntemann; Julien Tremblay; Tijana Glavina del Rio; Mark P Waldrop; Lisamarie Windham-Myers; Susannah G Tringe
Journal: MBio Date: 2015-05-19 Impact factor: 7.867

Review 8. Challenges and opportunities in understanding microbial communities with metagenome assembly (accompanied by IPython Notebook tutorial).

Authors: Adina Howe; Patrick S G Chain
Journal: Front Microbiol Date: 2015-07-09 Impact factor: 5.640

8 in total

10 in total

1. Methanogenesis on Early Stages of Life: Ancient but Not Primordial.

Authors: Israel Muñoz-Velasco; Carlos García-Ferris; Ricardo Hernandez-Morales; Antonio Lazcano; Juli Peretó; Arturo Becerra
Journal: Orig Life Evol Biosph Date: 2019-01-05 Impact factor: 1.950

2. Microbial sulfate reduction by Desulfovibrio is an important source of hydrogen sulfide from a large swine finishing facility.

Authors: Olga V Karnachuk; Igor I Rusanov; Inna A Panova; Mikhail A Grigoriev; Viacheslav S Zyusman; Elena A Latygolets; Maksat K Kadyrbaev; Eugeny V Gruzdev; Alexey V Beletsky; Andrey V Mardanov; Nikolai V Pimenov; Nikolai V Ravin
Journal: Sci Rep Date: 2021-05-21 Impact factor: 4.379

3. GraftM: a tool for scalable, phylogenetically informed classification of genes within metagenomes.

Authors: Joel A Boyd; Ben J Woodcroft; Gene W Tyson
Journal: Nucleic Acids Res Date: 2018-06-01 Impact factor: 16.971

4. TreeSAPP: the Tree-based Sensitive and Accurate Phylogenetic Profiler.

Authors: Connor Morgan-Lang; Ryan McLaughlin; Zachary Armstrong; Grace Zhang; Kevin Chan; Steven J Hallam
Journal: Bioinformatics Date: 2020-09-15 Impact factor: 6.937

5. Extensive structural variation in the Bowman-Birk inhibitor family in common wheat (Triticum aestivum L.).

Authors: Yucong Xie; Karl Ravet; Stephen Pearce
Journal: BMC Genomics Date: 2021-03-25 Impact factor: 3.969

6. Analyses of Leishmania-LRV Co-Phylogenetic Patterns and Evolutionary Variability of Viral Proteins.

Authors: Alexei Y Kostygov; Danyil Grybchuk; Yulia Kleschenko; Daniil S Chistyakov; Alexander N Lukashev; Evgeny S Gerasimov; Vyacheslav Yurchenko
Journal: Viruses Date: 2021-11-19 Impact factor: 5.048

7. NGS read classification using AI.

Authors: Benjamin Voigt; Oliver Fischer; Christian Krumnow; Christian Herta; Piotr Wojciech Dabrowski
Journal: PLoS One Date: 2021-12-22 Impact factor: 3.240

8. Loss of Novel Diversity in Human Gut Microbiota Associated with Ongoing Urbanization in China.

Authors: Shan Sun; Huijun Wang; Annie Green Howard; Jiguo Zhang; Chang Su; Zhihong Wang; Shufa Du; Anthony A Fodor; Penny Gordon-Larsen; Bing Zhang
Journal: mSystems Date: 2022-06-21 Impact factor: 7.324

9. Adaptation to Industrial Stressors Through Genomic and Transcriptional Plasticity in a Bioethanol Producing Fission Yeast Isolate.

Authors: Dane Vassiliadis; Koon Ho Wong; Jo Blinco; Geoff Dumsday; Alex Andrianopoulos; Brendon Monahan
Journal: G3 (Bethesda) Date: 2020-04-09 Impact factor: 3.154

10. Inhibiting Type VI Secretion System Activity with a Biomimetic Peptide Designed To Target the Baseplate Wedge Complex.

Authors: Y Cherrak; I Filella-Merce; V Schmidt; D Byrne; V Sgoluppi; R Chaiaheloudjou; S Betzi; X Morelli; M Nilges; R Pellarin; E Durand
Journal: mBio Date: 2021-08-10 Impact factor: 7.867

10 in total