Literature DB >> 23842809

nhmmer: DNA homology search with profile HMMs.

Abstract

SUMMARY: Sequence database searches are an essential part of molecular biology, providing information about the function and evolutionary history of proteins, RNA molecules and DNA sequence elements. We present a tool for DNA/DNA sequence comparison that is built on the HMMER framework, which applies probabilistic inference methods based on hidden Markov models to the problem of homology search. This tool, called nhmmer, enables improved detection of remote DNA homologs, and has been used in combination with Dfam and RepeatMasker to improve annotation of transposable elements in the human genome. AVAILABILITY: nhmmer is a part of the new HMMER3.1 release. Source code and documentation can be downloaded from http://hmmer.org. HMMER3.1 is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X.

Entities: Gene Species

Mesh：

Substances：

Year: 2013 PMID： 23842809 PMCID： PMC3777106 DOI： 10.1093/bioinformatics/btt403

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

A widely used general purpose tool for DNA/DNA sequence comparison is blastn (Altschul ; Camacho ), which heuristically approximates the Smith–Waterman algorithm (Smith and Waterman, 1981) for recognizing local regions of similarity between two sequences. In recent years, most advances in DNA/DNA comparison have related to accelerating search for near-exact matches (Kent, 2002; Langmead ; Li and Durbin, 2009), and to improving whole-genome alignment (Kurtz ; Schwartz ). Another area that deserves attention is the development of methods that maximize the power of computational sequence comparison tools to detect remote homologies. Profile hidden Markov models (profile HMMs) (Durbin ; Krogh ) represent an important advance in terms of sensitivity of sequence searches for remote homology. They provide a formal probabilistic framework for sequence comparison and improve detection of remote homologs by (i) enabling position-specific residue and gap scoring based on a query profile, and (ii) calculating the signal of homology based on the more powerful ‘Forward/Backward’ HMM algorithm that computes not just one best-scoring alignment, but a sum of support over all possible alignments. In the past, this improved sensitivity came at a significant computational cost, but recent advances in HMMER3 have increased speed for protein search by ∼100-fold, reaching blastp-like speed through a combination of filtering heuristics (Eddy, 2008) and computer engineering (Eddy, 2011; Farrar, 2007). Tools based on profile HMMs (Eddy, 2009; Karplus ) have historically focused on protein search, with little concentration on the challenges presented by (i) chromosome-length target sequences, and (ii) the extreme composition bias often seen in genomic DNA. With attention to the details of DNA search, nhmmer builds upon the speed advances of HMMER3, bringing the power of profile HMMs to DNA homology search, at speeds nearly as fast as blastn with sensitive settings. An example of a biological problem requiring sensitive detection of remote DNA homologs is the annotation of genomic sequence derived from ancient transposable element (TE) expansions. A prerelease version of the nhmmer tools has recently been shown to provide increased sensitivity over blastn and other single-sequence search methods, with reduced false discovery rate and reasonable runtime, in searching for TEs (Wheeler ). For example, when nhmmer was used within the recently released RepeatMasker 4.0 (Smit and Hubley, 2013), an additional 150 Mb (5%) of the human genome was reliably annotated as derived from TEs.

2 USAGE AND PERFORMANCE

Usage. The program nhmmer is used to search one or more nucleotide queries against a nucleotide sequence database. For each query, nhmmer searches the target database and outputs a ranked list of the hits with the most significant matches to the query. A query may consist of a single sequence, a multiple sequence alignment, or a profile HMM built using the HMMER program hmmbuild. Each hit represents a region of local similarity between a portion of the query and a subsequence of the full target database sequence, and is assigned a similarity score S in bits, along with an E-value (Eddy, 2008) indicating the expected number of false positives at a threshold of score S. Each hit is also accompanied by an alignment of the matched sequence to the model, with values indicating the confidence with which each position is aligned. The final score, boundaries and alignment of a hit are computed based on filling in a Forward/Backward dynamic programming matrix, but the computational burden of doing this for the full target database is prohibitive. Therefore, nhmmer uses a series of acceleration filters that depend on simpler approximations of the final Forward score of a hit. These filters are based on those used in the HMMER3 protein search tools (Eddy, 2011), but have been modified to work in the context of long (potentially chromosome length) target sequences. The initial filter, called ‘single segment ungapped Viterbi’, scans along the target sequence with a fast ungapped Viterbi alignment using a reduced-precision, 16-way vector-parallel approach (Farrar, 2007). Windows around high-scoring ungapped alignments are subjected to a full-gapped Viterbi alignment to the model. Candidate alignments passing this filter then endure the full rigor of a Forward/Backward alignment to the query, including application of a context-dependent null model to account for composition bias shared by the query and target. For more details on the full acceleration pipeline, see Eddy and Wheeler (2013). Performance. In Figure 1 we consider the performance of nhmmer on a benchmark called Rmark3 that has been used previously to test the RNA homology search tool Infernal (Nawrocki ). The benchmark consists of 106 families from Rfam that could be divided into two groups such that no sequence in one group is >60% identical to any sequence in the other group [Rfam 10.0, Gardner ]. One group was used as the query alignment for the family, and sequences from the other group (780 sequences in total) were embedded in 10 Mb of sequence simulated using a 15-state HMM trained on genomic sequence from a variety of organisms. A positive was defined as an embedded sequence with >50% length covered by a query from the same family; a negative was defined as any hit that mostly covers simulated sequence. For more details on construction of the benchmark, see Nawrocki and Eddy (2007).

Fig. 1.

Benchmark of search sensitivity and specificity. Searches were performed against the Rmark3 benchmark either by constructing a single profile HMM from the query alignment (nhmmer profile), constructing a consensus sequence from the query alignment (consensus), or by using family pairwise search (fpw). The aborted lines for two nhmmer variants indicate that the list of all hits found by each search variant was exhausted before reaching 1 false positive per Mb per search. The nhmmer parameters were default, except setting the E-value threshold, ‘-E 100’ for profile and consensus variants, to extend the hit list. Higher E-values have no effect, as further hits (true and false) are filtered by the default acceleration heuristics. Many parameters were tested for NCBI blastn 2.2.28+, with the best-performing variant shown here (‘-word_size 7 -penalty -3 -reward 2 -gapopen 4 -gapextend 2’). For each combination of program and method, hits for all families were collected and ranked by E-value, and true and false hits were defined as described in the text. The Y-axis is the fraction of 780 true positives detected with an E-value sufficient to achieve the false-positive rate specified on the X-axis. Runtime was collected on a single thread on a 2.66 GHz Intel Gainestown (X5550) processor. The benchmark can be downloaded from http://selab.janelia.org/publications.html

In this benchmark, we begin with an alignment of multiple members of a DNA sequence family and aim to find more instances of the family in the target sequence database. The standard methods for this homology search problem (e.g. using blastn) involve searching the target database with a single query sequence, either (i) producing a consensus sequence to represent the sequence family, then using the consensus as query to search the database, or (ii) using the family pairwise (fpw) search method, in which each individual sequence from the family alignment is used as a query, the hit lists are merged, and overlapping hits are adjudicated by recording the hit with the best E-value (Grundy, 1998). Using both of these single-sequence query approaches on Rmark3, nhmmer achieves better sensitivity than blastn. These single-sequence query methods do not, however, take full advantage of the information contained within the query alignment. In nhmmer, a profile HMM is built from the alignment, and represents the residue and indel distributions for each position, modeling the conservation patterns of the family in a way that is not possible with single-sequence queries. The benefits of profile search are two-fold: (i) search power is much greater than even with fpw, and (ii) search speed is roughly equivalent to that of searching with a single consensus sequence, as only one search is performed for the entire family, rather than one for each sequence in the query alignment as in fpw. In addition to being more sensitive than blastn, nhmmer represents a nearly 100-fold increase in speed over previous implementations of DNA homology search with profile HMMs. For example, using the seed alignment for Dfam entry DF0000789 (a 338 position-long DNA transposon) to search against the human genome with a single thread took nhmmer 12 min to complete, whereas HMMER 1.8.5 completed in 782 min and SAM 3.5 (Hughey and Krogh, 1995; Karplus ) required 844 min. Benchmark of search sensitivity and specificity. Searches were performed against the Rmark3 benchmark either by constructing a single profile HMM from the query alignment (nhmmer profile), constructing a consensus sequence from the query alignment (consensus), or by using family pairwise search (fpw). The aborted lines for two nhmmer variants indicate that the list of all hits found by each search variant was exhausted before reaching 1 false positive per Mb per search. The nhmmer parameters were default, except setting the E-value threshold, ‘-E 100’ for profile and consensus variants, to extend the hit list. Higher E-values have no effect, as further hits (true and false) are filtered by the default acceleration heuristics. Many parameters were tested for NCBI blastn 2.2.28+, with the best-performing variant shown here (‘-word_size 7 -penalty -3 -reward 2 -gapopen 4 -gapextend 2’). For each combination of program and method, hits for all families were collected and ranked by E-value, and true and false hits were defined as described in the text. The Y-axis is the fraction of 780 true positives detected with an E-value sufficient to achieve the false-positive rate specified on the X-axis. Runtime was collected on a single thread on a 2.66 GHz Intel Gainestown (X5550) processor. The benchmark can be downloaded from http://selab.janelia.org/publications.html Other applications. HMMER3.1’s nhmmer has recently been adopted as a search engine within the TE annotation tool, RepeatMasker 4.0 (Smit and Hubley, 2013), where in conjunction with Dfam, it supports a substantial boost in sensitivity in human DNA repeat annotation with better speed than the previous most sensitive method (Wheeler ). The core pipeline of nhmmer has also been incorporated as an acceleration filter for the RNA homology search tool Infernal, where it supports fast filtering with negligible loss in Infernal sensitivity (E.Nawrocki and S.R.Eddy, unpublished data). We anticipate that nhmmer will similarly benefit other domains of DNA sequence comparison that depend on discriminative detection of remote homologs.

18 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

3. Striped Smith-Waterman speeds database searches six times over other SIMD implementations.

Authors: Michael Farrar
Journal: Bioinformatics Date: 2006-11-16 Impact factor: 6.937

4. Hidden Markov models for detecting remote protein homologies.

Authors: K Karplus; C Barrett; R Hughey
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

5. Hidden Markov models in computational biology. Applications to protein modeling.

Authors: A Krogh; M Brown; I S Mian; K Sjölander; D Haussler
Journal: J Mol Biol Date: 1994-02-04 Impact factor: 5.469

6. Identification of common molecular subsequences.

Authors: T F Smith; M S Waterman
Journal: J Mol Biol Date: 1981-03-25 Impact factor: 5.469

7. Versatile and open software for comparing large genomes.

Authors: Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal: Genome Biol Date: 2004-01-30 Impact factor: 13.583

8. Dfam: a database of repetitive DNA based on profile hidden Markov models.

Authors: Travis J Wheeler; Jody Clements; Sean R Eddy; Robert Hubley; Thomas A Jones; Jerzy Jurka; Arian F A Smit; Robert D Finn
Journal: Nucleic Acids Res Date: 2012-11-30 Impact factor: 16.971

9. Query-dependent banding (QDB) for faster RNA similarity searches.

Authors: Eric P Nawrocki; Sean R Eddy
Journal: PLoS Comput Biol Date: 2007-02-07 Impact factor: 4.475

10. A probabilistic model of local sequence alignment that simplifies statistical significance estimation.

Authors: Sean R Eddy
Journal: PLoS Comput Biol Date: 2008-05-30 Impact factor: 4.475

232 in total

1. Evolution of the Aux/IAA Gene Family in Hexaploid Wheat.

Authors: Linyi Qiao; Li Zhang; Xiaojun Zhang; Lei Zhang; Xin Li; Jianzhong Chang; Haixian Zhan; Huijuan Guo; Jun Zheng; Zhijian Chang
Journal: J Mol Evol Date: 2017-10-30 Impact factor: 2.395

2. Disentangling the aging gene expression network of termite queens.

Authors: José Manuel Monroy Kuhn; Karen Meusemann; Judith Korb
Journal: BMC Genomics Date: 2021-05-11 Impact factor: 3.969

3. Rapid genome shrinkage in a self-fertile nematode reveals sperm competition proteins.

Authors: Da Yin; Erich M Schwarz; Cristel G Thomas; Rebecca L Felde; Ian F Korf; Asher D Cutter; Caitlin M Schartner; Edward J Ralston; Barbara J Meyer; Eric S Haag
Journal: Science Date: 2018-01-05 Impact factor: 47.728

4. The Ground State and Evolution of Promoter Region Directionality.

Authors: Yi Jin; Umut Eser; Kevin Struhl; L Stirling Churchman
Journal: Cell Date: 2017-08-10 Impact factor: 41.582

5. Studying RNA Homology and Conservation with Infernal: From Single Sequences to RNA Families.

Authors: Lars Barquist; Sarah W Burge; Paul P Gardner
Journal: Curr Protoc Bioinformatics Date: 2016-06-20

6. Splicing conservation signals in plant long noncoding RNAs.

Authors: Jose Antonio Corona-Gomez; Irving Jair Garcia-Lopez; Peter F Stadler; Selene L Fernandez-Valverde
Journal: RNA Date: 2020-04-02 Impact factor: 4.942

7. Global Regulator of Rubber Degradation in Gordonia polyisoprenivorans VH2: Identification and Involvement in the Regulation Network.

Authors: Jan de Witt; Sylvia Oetermann; Mariana Parise; Doglas Parise; Jan Baumbach; Alexander Steinbüchel
Journal: Appl Environ Microbiol Date: 2020-07-20 Impact factor: 4.792

8. Comprehensive phylogeny of ray-finned fishes (Actinopterygii) based on transcriptomic and genomic data.

Authors: Lily C Hughes; Guillermo Ortí; Yu Huang; Ying Sun; Carole C Baldwin; Andrew W Thompson; Dahiana Arcila; Ricardo Betancur-R; Chenhong Li; Leandro Becker; Nicolás Bellora; Xiaomeng Zhao; Xiaofeng Li; Min Wang; Chao Fang; Bing Xie; Zhuocheng Zhou; Hai Huang; Songlin Chen; Byrappa Venkatesh; Qiong Shi
Journal: Proc Natl Acad Sci U S A Date: 2018-05-14 Impact factor: 11.205

9. Repeated replacement of an intrabacterial symbiont in the tripartite nested mealybug symbiosis.

Authors: Filip Husnik; John P McCutcheon
Journal: Proc Natl Acad Sci U S A Date: 2016-08-29 Impact factor: 11.205

10. Microplitis demolitor Bracovirus Proviral Loci and Clustered Replication Genes Exhibit Distinct DNA Amplification Patterns during Replication.

Authors: Gaelen R Burke; Tyler J Simmonds; Sarah A Thomas; Michael R Strand
Journal: J Virol Date: 2015-07-08 Impact factor: 5.103