Literature DB >> 24008419

Infernal 1.1: 100-fold faster RNA homology searches.

Eric P Nawrocki1, Sean R Eddy.   

Abstract

SUMMARY: Infernal builds probabilistic profiles of the sequence and secondary structure of an RNA family called covariance models (CMs) from structurally annotated multiple sequence alignments given as input. Infernal uses CMs to search for new family members in sequence databases and to create potentially large multiple sequence alignments. Version 1.1 of Infernal introduces a new filter pipeline for RNA homology search based on accelerated profile hidden Markov model (HMM) methods and HMM-banded CM alignment methods. This enables ∼100-fold acceleration over the previous version and ∼10 000-fold acceleration over exhaustive non-filtered CM searches. AVAILABILITY: Source code, documentation and the benchmark are downloadable from http://infernal.janelia.org. Infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Documentation includes a user's guide with a tutorial, a discussion of file formats and user options and additional details on methods implemented in the software. CONTACT: nawrockie@janelia.hhmi.org

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 24008419      PMCID: PMC3810854          DOI: 10.1093/bioinformatics/btt509

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Many structural RNAs conserve their sequence and secondary structure, and the most effective RNA homology search and alignment tools incorporate both types of conservation into their scoring systems. Covariance models (CMs) are profile stochastic context-free grammars (Durbin ), probabilistic models of the conserved sequence and secondary structure of an RNA family, analogous to sequence-based profile hidden Markov models (HMMs) commonly used for protein sequence analysis, with added complexity necessary for modeling RNA secondary structure. Infernal implements methods for constructing CMs from input structurally annotated RNA alignments or single sequences and for using those models to search for and align homologous RNAs. Compared with the previous version 1.0.2, Infernal 1.1 accelerates typical RNA homology searches ∼100-fold using a filter pipeline based on accelerated profile HMM methods [the HMMER3 project (Eddy, 2008, 2011)] and constrained CM alignment algorithms (Brown, 2000; Nawrocki, 2009). The increased speed comes at a negligible cost to sensitivity (Fig. 1). Additionally, version 1.1 implements specialized algorithms for structural alignment of truncated RNA sequences (Kolbe and Eddy, 2009) commonly found in sequencing reads, which were prone to misalignment in previous versions.
Fig. 1.

ROC-like curves for the benchmark. Plots are shown for the new Infernal 1.1 with and without filters, for the old Infernal 1.0.2, for profile HMM searches with nhmmer (from the HMMER package included in Infernal 1.1, default parameters) and for family-pairwise-searches with BLASTN (ncbi-blast-2.2.28+, default parameters). The maximum sensitivity (not shown) for default Infernal 1.1 is 0.81 (629 of 820 true positives found), which is achieved at a false-positive rate of 0.19/Mb/query. For non-filtered Infernal, maximum sensitivity is 0.87 at 2.9 false positives per Mb per query. This indicates that at high false-positive rates the filters prevent some true positives from being found, but prevent many more false positives from being found. CPU times are total times for all 106 family searches measured for single execution threads on 3.0 GHz Intel Xeon processors. The Infernal times do not include time required for model calibration.

ROC-like curves for the benchmark. Plots are shown for the new Infernal 1.1 with and without filters, for the old Infernal 1.0.2, for profile HMM searches with nhmmer (from the HMMER package included in Infernal 1.1, default parameters) and for family-pairwise-searches with BLASTN (ncbi-blast-2.2.28+, default parameters). The maximum sensitivity (not shown) for default Infernal 1.1 is 0.81 (629 of 820 true positives found), which is achieved at a false-positive rate of 0.19/Mb/query. For non-filtered Infernal, maximum sensitivity is 0.87 at 2.9 false positives per Mb per query. This indicates that at high false-positive rates the filters prevent some true positives from being found, but prevent many more false positives from being found. CPU times are total times for all 106 family searches measured for single execution threads on 3.0 GHz Intel Xeon processors. The Infernal times do not include time required for model calibration.

2 APPROACH

Exhaustive dynamic programming (DP) CM algorithms are impractically slow (Fig. 1). Several types of sequence-based filters have been developed for acceleration, including a BLAST-based filtering scheme used by Rfam since its inception (Griffiths-Jones ) and several profile HMM-based methods (Weinberg and Ruzzo, 2004, 2006). Infernal version 1.0.2 and version 1.1 both use profile HMM filters: version 1.0.2’s filters are derived from the HMMER2 package (Eddy, 2003), whereas version 1.1 co-opts HMMER3’s dramatically accelerated search algorithms, which take advantage of single-instruction multiple-data vector instructions to parallelize the core steps of the HMM DP algorithms (Eddy, 2011). Version 1.1 uses four separate profile HMM-based filter stages, each one successively slower and stricter than the previous stage. The new filter stages are sufficiently fast that the post-HMM-filtering CM DP algorithms as implemented in the previous version (1.0.2) became the clear computational bottleneck. To accelerate these, constraints, or bands, derived from an HMM alignment of the sequence are imposed on the DP matrices to significantly reduce the number of required calculations (Brown, 2000; Nawrocki, 2009). Both the new filters and the banded CM methods are vital for the improved search speed. In the benchmark described later in the text, for default Infernal searches, the profile HMM stages take about one-third of the total running time and the remaining time is spent on the subsequent CM DP calculations.

3 USAGE

There are two major applications of Infernal: to search for structural RNAs in a sequence dataset (e.g. to perform genome annotation of RNAs) and to create multiple sequence- and structure-based alignments of RNA homologs [e.g. 16S small subunit ribosomal RNA alignment for environmental survey studies (Cole )]. Both applications begin with a CM file, which can either be downloaded from the Rfam database of >2000 RNA families (Burge ) or created by the user with Infernal’s cmbuild program from a structurally annotated single sequence or multiple sequence alignment. Before a CM can be used to search a sequence database, it must first be calibrated by the cmcalibrate program, which performs a simulated search against random sequence to determine model-specific parameters for assigning E-values to database hits. (Rfam CM files come pre-calibrated.) The cmsearch program takes a calibrated CM file, searches it against a sequence database and outputs a ranked list of top scoring hits and hit alignments. The cmalign program takes a CM file (calibrated or not), aligns all sequences to the model and outputs a structurally annotated MSA in Stockholm format. Version 1.1 introduces the cmscan program for determining whether a given sequence contains homologies to any known RNA families in a CM library like Rfam. Before running cmscan, the CM database must be converted to a special format using cmpress, which enables faster scanning.

4 PERFORMANCE

An independent benchmark of RNA homology search (Freyhult ) found covariance model-based programs, including a previous version of Infernal, to be the most specific and sensitive of the tools tested. We present here results from an updated version of our previously published internal RMARK benchmark (Nawrocki ), mainly to indicate the relative performance of Infernal 1.1 and the previous version 1.0.2. The RMARK3 benchmark was constructed from the seed alignments of the Rfam 10.0 database as previously described (Nawrocki ). It is composed of a set of 106 families, each represented by a training alignment of ≥5 aligned sequences and a test set of ≥1 sequences. No two test sequences are >70% identical, and no train/test sequence pair is >60% identical. The 780 test sequences were embedded into ten 1 Mb genome-like sequences, to create a benchmark ‘pseudo-genome’ of 10.16 Mb. For each included family, a model was built from the training set using the Rfam alignment, calibrated and used to search the pseudo-genome. The resulting hits from all searches were then sorted by E-value and a sensitivity versus false-positive rate ROC-like curve was generated from the results (Fig. 1). Figure 1 shows that default Infernal 1.1 performs the benchmark searches in 0.44 h and is ∼100 times faster than the previous version 1.0.2 (49.31 h) and ∼10 000 times faster than exhaustive non-filtered 1.1 search (4359 h); yet all three search methods have similar sensitivity at the low false-positive rates necessary for large database searches. We also tested two sequence-only methods: profile HMMs implemented in HMMER3 (Eddy, 2008, 2011) and family-pairwise (Grundy, 1998) single-sequence BLASTN queries (Altschul ), which were faster (0.02 and 0.01 h, respectively), but significantly less sensitive than CMs, indicating the benefit of secondary structure modeling. The relatively fast speed of default version 1.1 on the benchmark is maintained on real genomic sequences. The average speed is 1.5 s/Mb/query on the benchmark and 0.6 s/Mb/query on a several gigabase database that includes a sampling of 15 genomes (five each of archaea, bacteria and eukarya) using the same query models from the benchmark. As database size increases, Infernal increases filter stringency resulting in faster search rates without sacrificing appreciable sensitivity at low false-positive rates based on further RMARK benchmarking (results not shown). Infernal is now a more practical tool for RNA homology search. The increased speed should enable its incorporation into automated sequence annotation pipelines and obviate the need for additional filtering schemes for large-scale CM searches, such as the BLAST-based filter paradigm used by Rfam (Griffiths-Jones ). Rfam-based annotation of one typical bacterial or archaeal genome (i.e. searching all 2208 Rfam 11.0 models against a 2–5 Mb target) now takes ∼1 h on a single quad-core desktop computer. Analysis of larger datasets, however, such as vertebrate genomes or all reads from a high-throughput sequencing run, still requires a compute cluster. As an example, a search of all Rfam models against the 1 Gb chicken genome would require ∼3 h on a 100-CPU compute cluster. The most expensive programs (cmalign, cmcalibrate, cmscan and cmsearch) are implemented for use with multiple threads on multi-core machines and in coarse-grained MPI versions for clusters.
  13 in total

1.  Small subunit ribosomal RNA modeling using stochastic context-free grammars.

Authors:  M P Brown
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  2000

2.  Rfam: an RNA family database.

Authors:  Sam Griffiths-Jones; Alex Bateman; Mhairi Marshall; Ajay Khanna; Sean R Eddy
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

3.  Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy.

Authors:  Zasha Weinberg; Walter L Ruzzo
Journal:  Bioinformatics       Date:  2004-08-04       Impact factor: 6.937

4.  Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA.

Authors:  Eva K Freyhult; Jonathan P Bollback; Paul P Gardner
Journal:  Genome Res       Date:  2006-12-06       Impact factor: 9.043

5.  Homology detection via family pairwise search.

Authors:  W N Grundy
Journal:  J Comput Biol       Date:  1998       Impact factor: 1.479

Review 6.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

7.  Accelerated Profile HMM Searches.

Authors:  Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2011-10-20       Impact factor: 4.475

8.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis.

Authors:  J R Cole; Q Wang; E Cardenas; J Fish; B Chai; R J Farris; A S Kulam-Syed-Mohideen; D M McGarrell; T Marsh; G M Garrity; J M Tiedje
Journal:  Nucleic Acids Res       Date:  2008-11-12       Impact factor: 16.971

9.  Rfam 11.0: 10 years of RNA families.

Authors:  Sarah W Burge; Jennifer Daub; Ruth Eberhardt; John Tate; Lars Barquist; Eric P Nawrocki; Sean R Eddy; Paul P Gardner; Alex Bateman
Journal:  Nucleic Acids Res       Date:  2012-11-03       Impact factor: 16.971

10.  A probabilistic model of local sequence alignment that simplifies statistical significance estimation.

Authors:  Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2008-05-30       Impact factor: 4.475

View more
  878 in total

1.  The First Draft Genome Assembly of Snow Sheep (Ovis nivicola).

Authors:  Maulik Upadhyay; Andreas Hauser; Elisabeth Kunz; Stefan Krebs; Helmut Blum; Arsen Dotsev; Innokentiy Okhlopkov; Vugar Bagirov; Gottfried Brem; Natalia Zinovieva; Ivica Medugorac
Journal:  Genome Biol Evol       Date:  2020-08-01       Impact factor: 3.416

2.  Probing-directed identification of novel structured RNAs.

Authors:  Svetlana V Vinogradova; Roman A Sutormin; Andrey A Mironov; Ruslan A Soldatov
Journal:  RNA Biol       Date:  2016       Impact factor: 4.652

3.  Integron-Derived Aminoglycoside-Sensing Riboswitches Control Aminoglycoside Acetyltransferase Resistance Gene Expression.

Authors:  Shasha Wang; Weizhi He; Wenxia Sun; Jun Zhang; Yaowen Chang; Dongrong Chen; Alastair I H Murchie
Journal:  Antimicrob Agents Chemother       Date:  2019-05-24       Impact factor: 5.191

4.  Using Rosetta for RNA homology modeling.

Authors:  Andrew M Watkins; Ramya Rangan; Rhiju Das
Journal:  Methods Enzymol       Date:  2019-06-11       Impact factor: 1.600

5.  Bioinformatic analysis of riboswitch structures uncovers variant classes with altered ligand specificity.

Authors:  Zasha Weinberg; James W Nelson; Christina E Lünse; Madeline E Sherlock; Ronald R Breaker
Journal:  Proc Natl Acad Sci U S A       Date:  2017-03-06       Impact factor: 11.205

Review 6.  Computational analysis of riboswitch-based regulation.

Authors:  Eric I Sun; Dmitry A Rodionov
Journal:  Biochim Biophys Acta       Date:  2014-02-28

7.  Cultivable, Host-Specific Bacteroidetes Symbionts Exhibit Diverse Polysaccharolytic Strategies.

Authors:  Arturo Vera-Ponce de León; Benjamin C Jahnes; Jun Duan; Lennel A Camuy-Vélez; Zakee L Sabree
Journal:  Appl Environ Microbiol       Date:  2020-04-01       Impact factor: 4.792

8.  The Draft Genome of a Flat Peach (Prunus persica L. cv. '124 Pan') Provides Insights into Its Good Fruit Flavor Traits.

Authors:  Aidi Zhang; Hui Zhou; Xiaohan Jiang; Yuepeng Han; Xiujun Zhang
Journal:  Plants (Basel)       Date:  2021-03-12

9.  Chromosome-scale genome assembly of sweet cherry (Prunus avium L.) cv. Tieton obtained using long-read and Hi-C sequencing.

Authors:  Jiawei Wang; Weizhen Liu; Dongzi Zhu; Po Hong; Shizhong Zhang; Shijun Xiao; Yue Tan; Xin Chen; Li Xu; Xiaojuan Zong; Lisi Zhang; Hairong Wei; Xiaohui Yuan; Qingzhong Liu
Journal:  Hortic Res       Date:  2020-08-01       Impact factor: 6.793

10.  Localized effect of treated wastewater effluent on the resistome of an urban watershed.

Authors:  Christopher N Thornton; Windy D Tanner; James A VanDerslice; William J Brazelton
Journal:  Gigascience       Date:  2020-11-19       Impact factor: 6.524

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.