| Literature DB >> 23842809 |
Travis J Wheeler1, Sean R Eddy.
Abstract
SUMMARY: Sequence database searches are an essential part of molecular biology, providing information about the function and evolutionary history of proteins, RNA molecules and DNA sequence elements. We present a tool for DNA/DNA sequence comparison that is built on the HMMER framework, which applies probabilistic inference methods based on hidden Markov models to the problem of homology search. This tool, called nhmmer, enables improved detection of remote DNA homologs, and has been used in combination with Dfam and RepeatMasker to improve annotation of transposable elements in the human genome. AVAILABILITY: nhmmer is a part of the new HMMER3.1 release. Source code and documentation can be downloaded from http://hmmer.org. HMMER3.1 is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23842809 PMCID: PMC3777106 DOI: 10.1093/bioinformatics/btt403
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Benchmark of search sensitivity and specificity. Searches were performed against the Rmark3 benchmark either by constructing a single profile HMM from the query alignment (nhmmer profile), constructing a consensus sequence from the query alignment (consensus), or by using family pairwise search (fpw). The aborted lines for two nhmmer variants indicate that the list of all hits found by each search variant was exhausted before reaching 1 false positive per Mb per search. The nhmmer parameters were default, except setting the E-value threshold, ‘-E 100’ for profile and consensus variants, to extend the hit list. Higher E-values have no effect, as further hits (true and false) are filtered by the default acceleration heuristics. Many parameters were tested for NCBI blastn 2.2.28+, with the best-performing variant shown here (‘-word_size 7 -penalty -3 -reward 2 -gapopen 4 -gapextend 2’). For each combination of program and method, hits for all families were collected and ranked by E-value, and true and false hits were defined as described in the text. The Y-axis is the fraction of 780 true positives detected with an E-value sufficient to achieve the false-positive rate specified on the X-axis. Runtime was collected on a single thread on a 2.66 GHz Intel Gainestown (X5550) processor. The benchmark can be downloaded from http://selab.janelia.org/publications.html