| Literature DB >> 19740770 |
Tin Yin Lam1, Irmtraud M Meyer.
Abstract
Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model's architecture. The implementation of prediction algorithms and algorithms to train the model's parameters, however, can be a time-consuming and error-prone task. We here present HMMConverter, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMs and pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMConverter can thus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets.Entities:
Mesh:
Year: 2009 PMID: 19740770 PMCID: PMC2790874 DOI: 10.1093/nar/gkp662
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Example of an XML model definition. This example shows how the model of the dishonest casino (19), its states and transitions are defined in HMMConverter using XML. The emission probabilities are listed in the included flat-text file emission_dishonest_hmm2.txt. This is one of the examples included in the HMMConverter software package. The HMMConverter manual explains in detail how to define a variety of models, how to invoke parameter training algorithms and how to define different types of sequence analysis.
Figure 2.Banding. Projection of the three-dimensional search space for a pair-HMM onto the plane spanned by the two input sequences. The tube inside the large rectangle significantly reduces the three-dimensional search space. The two narrow vertical strips in the left figure correspond to the amount of memory allocated by the first iteration of the Hirschberg algorithm. The tube can be either user defined (left) or derived from Blast matches (right). The two thick lines in the right figure correspond to the set of matches selected by the dynamic programming routine as the highest scoring sub-set of mutually compatible Blast matches (the discarded blast matches are not shown here). In this example, the radius is specified as 30 by the user.
Figure 3.Prior information. Example of prior information for input sequence X of length L for an annotation label set S = {Exon,Intron,Intergenic}. For sequence interval [1,89], no prior information on the annotation of the sequence is available. This stretch of the sequence would thus be analyzed with an model whose nominal emission probabilities are not biased by any prior probabilities. For sequence interval [90,184], there is prior information concerning the likelihood of different sequence positions for being Exon, Intron and Intergenic. For the rest of the input sequence, i.e. for sequence interval [185,370], we know with certainty that the sequence positions in [185,214] are exonic and that the remainder of the sequence is intergenic. Note that the prior probabilities add up to 1 for every sequence position for which any prior information is supplied, reflecting the fact that in this case the three labels Exon, Intron and Intergenic in this annotation label set are mutually exclusive and that each sequence position has to fall into exactly one of these three categories.
Computational requirements
| Feature | Implemented algorithm | Time requirement | Memory requirement | Reference |
|---|---|---|---|---|
| Viterbi algorithm | Viterbi algorithm | 𝒪( | 𝒪( | ( |
| Hirschberg algorithm | Hirschberg algorithm | 𝒪( | 𝒪( | ( |
| Viterbi training | Lam–Meyer algorithm | 𝒪( | 𝒪( | ( |
| Baum–Welch training | Miklós–Meyer algorithm | 𝒪( | 𝒪( | ( |
| Posterior sampling training | Lam–Meyer algorithm | 𝒪( | 𝒪( | ( |
Overview of the time and memory requirements for the different prediction and parameter training algorithms in HMMConverter. The requirements are given for an HMM with M states and a connectivity of Tmax, where L is the length of the input sequence and K the number of state paths sampled in each iteration for every training sequence using the posterior sampling algorithm. Note that the requirements for the training algorithms are the requirements for each iteration.