| Literature DB >> 24098098 |
André Yoshiaki Kashiwabara1, Igor Bonadio, Vitor Onuchic, Felipe Amado, Rafael Mathias, Alan Mitchell Durham.
Abstract
Discrete Markovian models can be used to characterize patterns in sequences of values and have many applications in biological sequence analysis, including gene prediction, CpG island detection, alignment, and protein profiling. We present ToPS, a computational framework that can be used to implement different applications in bioinformatics analysis by combining eight kinds of models: (i) independent and identically distributed process; (ii) variable-length Markov chain; (iii) inhomogeneous Markov chain; (iv) hidden Markov model; (v) profile hidden Markov model; (vi) pair hidden Markov model; (vii) generalized hidden Markov model; and (viii) similarity based sequence weighting. The framework includes functionality for training, simulation and decoding of the models. Additionally, it provides two methods to help parameter setting: Akaike and Bayesian information criteria (AIC and BIC). The models can be used stand-alone, combined in Bayesian classifiers, or included in more complex, multi-model, probabilistic architectures using GHMMs. In particular the framework provides a novel, flexible, implementation of decoding in GHMMs that detects when the architecture can be traversed efficiently.Entities:
Mesh:
Year: 2013 PMID: 24098098 PMCID: PMC3789777 DOI: 10.1371/journal.pcbi.1003234
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Comparison of ToPS with other Markov model toolkits.
| Program | Input Format | Probabilistic Models | Simulation | Distinguishing Characteristics |
| HMMConverter | XML | HMM | NO | memory efficient Viterbi, forward, backward |
| pair-HMM | ||||
| generalized HMM | ||||
| HMMoC | XML | HMM | YES | memory efficient Viterbi, forward, backward |
| C language | pair-HMM, triple-HMM, quad-HMM | |||
| generalized HMM | ||||
| gHMM | XML | HMM | YES | continuous emission |
| inhomogeneous Markov chain | graphical user interface | |||
| pair-HMM | ||||
| mixture of probability density functions | ||||
| HTK | XML | HMM | NO | continuous emission |
| Tigrscan | own language | GHMM | NO | Does not provide Baum-Welsh training |
| N-SCAN | XML | GHMM | NO | Does not provide Baum-Welsh training |
|
| own language | HMM | YES | model selection criteria (AIC and BIC) |
| pair-HMM | build profile-HMM from alignment | |||
| GHMM | efficient and general GHMMs | |||
| variable-length Markov chain | ||||
| inhomogeneous Markov chains | ||||
| discrete i.i.d models | ||||
| SBSW |
The generalized version of HMMs in HMMoC and HMMConverter is different from the GHMMs as defined by Kulp [14].
Specifically, they only allow the emission of whole words within a state, and neither allows sub models or the characterization of duration with a non-geometric distribution;
Tigrscan and N-SCAN implement GHMMs containing as sub-models weight arrays, maximum dependence decomposition, smoothed histograms, three-periodic Markov chains, and interpolated Markov models.
However, these models can not be used individually, as the state architecture of the GHMM is hard coded in these systems.
Figure 1A diagram of examples of ToPS usage.
Square boxes represent data files, rounded boxes represent programs or manual processes. Each model may be described manually by editing a text file (1), or the train program can be used to estimate the parameters and automatically generate such file from a training set (2). The files that contain the model parameters (in our example model1.txt, model2.txt and model3.txt) are used by the programs evaluate (3), simulate (4), bayes_classifier(5) and viterbi_decoding (6). The evaluate program calculates the likelihood of a set of input sequences given a model, the simulate program samples new sequences, the viterbi_decoding program decodes input sequences using the Viterbi algorithm, and the bayes_classifier classifies input sequences given a set of probabilistic models.
Figure 2The implemented GHMM for the CpG island detector.
In this GHMM we used IMMs as emission sub-models and we tested different values for the exit probability of the NONCPG state, , to generate the sensitivity analysis. The mean length of the CPG state emission was estimated using the training data.
Figure 3Sensitivity associated with the combined length of the predicted CGIs.
In this experiment the points in the curve correspond to different values for the exit probability of the NONCPG state of the GHMM. For comparison, the results with the CGI list from UCSC Genome Browser and with the CGI list obtained using HMM [2] are shown as a blue square and green triangle, respectively.
Comparison between CGI lists.
| CGI List | Total number of CGI regions | Percentage of confirmed TSSs contained in the CGI predictions (“sensitivity”) | Total of nucleotides in CGI list (“specificity”) |
| UCSC Genome Browser |
|
|
|
|
|
|
|
|
| HMM |
|
|
|
|
|
|
|
|
This table shows a comparison between four distinct CGI lists: the UCSC Genome Browser list, the list produced by the HMM designed by Wu and collaborators [2], and the lists produced by our GHMM approach using two distinct exit probabilities for the NONCPG state. The probabilities of the GHMM selected were those that produced lists with the same sensitivity as the ones from the UCSC Genome Browser (), and from the HMM by Wu and collaborators ().
Figure 4GHMM architecture for eukaryotic protein-coding gene prediction.
is a state for representing an initial exon that ends at phase . is a state for representing an internal exon that begins at phase and ends at phase . is a state for representing a terminal exon that begins at phase . is a state for representing an intron at phase . is a state for representing intergenic regions. is a state for representing the start codon signal. is a state for representing the stop codon signal. is a state for representing acceptor splice site signal at phase . is a state for representing the donor splice site signal at phase . To model the reverse strand, we used the states that begin with the prefix ‘r-’. Squares with a self-transition represent states with geometric duration distribution. Squares without a self-transition represent states with a non-geometric duration distribution. Ellipses represent states with fixed-length durations.
States of the GHMM for the gene prediction problem.
| State Name | Description | Emission Model | Duration Model |
|
| start codon | start codon initial motif (20 nt) | fixed-length (27 nt) |
| start codon model (3 nt) | |||
| initial pattern model (4 nt) | |||
|
| stop codon | stop codon model (3 nt) | fixed-length (3 nt) |
|
| single exon | protein-coding model | Smoothed Histogram |
|
| initial exons | protein-coding model | Smoothed Histogram |
|
| terminal exons | protein-coding model | Smoothed Histogram |
|
| internal exon | protein-coding model | Smoothed Histogram |
|
| intron | non-coding model | geometric distributed |
|
| donor splice site | donor initial pattern (4 nt) | fixed-length (13 nt) |
| donor splice site model (9 nt) | |||
|
| acceptor splice site | branch point model (32 nt) | fixed-length (42 nt) |
| acceptor splice site model (6 nt) | |||
| acceptor initial pattern model (4 nt) | |||
|
| intergenic state | non-coding model | geometric distributed |
|
| final state | non-coding model | self-transition probability is one |
This table shows a summary of the configuration we used in each state of the GHMM for the gene-prediction problem. The states , , and are composed of two or more individual sub-models. The reverse strand states are symmetric and were omitted from this table.
Accuracy of the gene predictions.
| Gene | Exon | Nucleotide | |||||||
| Predictor | PPV | S | F-score | PPV | S | F-score | PPV | S | F-score |
| GENSCAN | 9.7±1.1 | 19.6±0.7 | 12.9±1.1 | 54.3±2.2 |
|
| 55.0±4.7 |
| 69.9±3.7 |
| ToPS |
|
|
|
| 55.9±1.7 | 57.4±1.6 |
| 87.1±2.4 |
|
This table shows the accuracy of ToPS to the 5-fold cross-validation experiment. GENSCAN was tested using the “HumanIso.smat” parameters and the same test set used in each individual validation run. PPV: positive predictive value; S: sensitivity.