Taikai Takeda1, Michiaki Hamada1,2,3,4,5, John Hancock. 1. Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, Tokyo 169-8555, Japan. 2. Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan. 3. Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan. 4. Institute for Medical-Oriented Structural Biology, Waseda University, Tokyo 162-8480, Japan. 5. Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan.
Abstract
Motivation: Pair Hidden Markov Models (PHMMs) are probabilistic models used for pairwise sequence alignment, a quintessential problem in bioinformatics. PHMMs include three types of hidden states: match, insertion and deletion. Most previous studies have used one or two hidden states for each PHMM state type. However, few studies have examined the number of states suitable for representing sequence data or improving alignment accuracy. Results: We developed a novel method to select superior models (including the number of hidden states) for PHMM. Our method selects models with the highest posterior probability using Factorized Information Criterion, which is widely utilized in model selection for probabilistic models with hidden variables. Our simulations indicated that this method has excellent model selection capabilities with slightly improved alignment accuracy. We applied our method to DNA datasets from 5 and 28 species, ultimately selecting more complex models than those used in previous studies. Availability and implementation: The software is available at https://github.com/bigsea-t/fab-phmm. Contact: mhamada@waseda.jp. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Pair Hidden Markov Models (PHMMs) are probabilistic models used for pairwise sequence alignment, a quintessential problem in bioinformatics. PHMMs include three types of hidden states: match, insertion and deletion. Most previous studies have used one or two hidden states for each PHMM state type. However, few studies have examined the number of states suitable for representing sequence data or improving alignment accuracy. Results: We developed a novel method to select superior models (including the number of hidden states) for PHMM. Our method selects models with the highest posterior probability using Factorized Information Criterion, which is widely utilized in model selection for probabilistic models with hidden variables. Our simulations indicated that this method has excellent model selection capabilities with slightly improved alignment accuracy. We applied our method to DNA datasets from 5 and 28 species, ultimately selecting more complex models than those used in previous studies. Availability and implementation: The software is available at https://github.com/bigsea-t/fab-phmm. Contact: mhamada@waseda.jp. Supplementary information: Supplementary data are available at Bioinformatics online.
The alignment of biological sequences (e.g. DNA, RNA and proteins) is one of the most classical and important problems in the field of bioinformatics. Sequence alignment permits the assessment of the functional relationships among biological sequences by quantifying sequence similarity. Since similar nucleotides or amino acids sequences are often functionally related, the development of quantitative evaluations of sequence similarity has been of great interest. This high demand for similarity evaluations has driven the development of a variety of alignment programs (Altschul ; Frith ; Thompson ). Moreover, sequence alignments are essential for analyzing the huge amounts of sequence data produced by high-throughput sequencers in computational tasks such as mapping read sequences onto reference genomes (Hamada ; Li and Homer, 2010).For this alignment task, probabilistic approaches are widely recognized. These probabilistic approaches include Pair Hidden Markov Models (PHMMs) (Durbin ), which handle indels and substitutions that occur throughout molecular evolution by using sequentially dependent unobserved hidden states, specifically the match, insertion and deletion states as well as their corresponding probabilistic symbol emissions (cf. Fig. 1).
Fig. 1.
Transition diagram of hidden states in a PHMM. The match states, which emit a pair of characters (), are connected to all the other states, whereas the X- and Y-insertion states, which emit a pair of a character and a gap symbol ‘-’ ( and ), are only connected to the match states
Transition diagram of hidden states in a PHMM. The match states, which emit a pair of characters (), are connected to all the other states, whereas the X- and Y-insertion states, which emit a pair of a character and a gap symbol ‘-’ ( and ), are only connected to the match statesThere have been several attempts to construct slightly more complex PHMMs. Bradley ), Lunter ) and Paten used PHMMs with two insertion and two deletion states, and Cartwright (2009) proposed a general version of PHMMs that employs a zeta power-law model of indel lengths. Additionally, a few generalizations of PHMMs have been proposed (such as Pachter ), which introduced generalized PHMMs for DNA–DNA, DNA–cDNA and DNA–protein alignments. However, to the best of our knowledge, no previous study has focused on determining the suitable number of states for representing the biological models that describe sequence evolution or for achieving better alignment accuracy.Bayesian model selection provides a sophisticated approach for selecting the best model by maximizing model evidence. In this, some parameters are marginalized out, and so a preference for simpler models is inherent to the method. When the model prior is uniform, maximizing model evidence is equivalent to maximizing the posterior probability of model given data, so we can choose the model with the largest posterior by maximizing the model evidence.The well-known difficulty of Bayesian model selection is that the model evidence is analytically intractable in general, including for PHMMs. Markov Chain Monte Carlo (MCMC) (Hastings, 1970) and Variational Inference (VI) (Beal, 2003; Blei ; Jordan ) enable approximation of the difficult-to-compute model evidence, but both approaches have a drawback: high computational cost. In contrast, the Factorized Asymptotic Bayesian (FAB) algorithm (Fujimaki and Hayashi, 2012; Fujimaki and Morinaga, 2012; Hayashi ) is a promising alternative model selection technique based on the Factorized Information Criterion (FIC). One advantage of the FAB algorithm is its simultaneous optimization of the model structure and the parameters, which makes the FAB algorithm more scalable than VI and MCMC. The advantages are further discussed in Section 2.2.The contributions of this study are summarized as follows:We developed a novel FIC-based model selection algorithm for PHMMs and demonstrate the reasonably good accuracy in model selection using a synthetic dataset. To the best of our knowledge, this is the first attempt in the literature to apply a model selection method to PHMMs.The model selection method slightly improved evaluation metrics on the same synthetic dataset.We conducted experiments on real DNA sequences and found that our method selects a more complex probabilistic structure than the ones that have been traditionally used for pairwise alignment of these species.
2 Materials and methods
PHMMs are a type of probabilistic generative model for sequence alignment (Durbin ) with three types of hidden states: a match-type state M, an X-insertion-type state X and a Y-insertion-type state Y (Fig. 1). The insertion states model the molecular evolution of indels, and the emission probability of the match states characterizes the substitution rates.In this study, we employed the FAB algorithm (Fujimaki and Morinaga, 2012; Hayashi ) to select the best model structure for a PHMM. The FAB algorithm is an information criterion-based technique that enables the simultaneous optimization of both the parameters and the model structure. The properties of the FAB algorithm are explained in Section 2.2 in more detail. Note that we modified the standard formalization of PHMM (e.g. Durbin ) because it does not use hidden variables explicitly, which is inappropriate for the FAB algorithm. In this section, we introduce our formalization of the PHMM with explicit hidden variables (Section 2.1) and then develop a proposed model selection method using the FAB algorithm (Section 2.2).In the following, we denote the number of match-type states, X-insertion-type states and Y-insertion-type states as K, K and K, respectively. Additionally, K represents the total number of hidden states, that is . Formally, we regard model selection as selecting the number of hidden states .
2.1 Pair Hidden Markov Model
Let observed sequences be and , where N is the number of sequence pairs. The nth sequences are and , where and are the lengths of and , respectively. We abbreviate all the observed sequences as . Unlike normal HMMs, PHMMs have hidden variables , which are two-dimensional and the nth of which is (Fig. 2). Note that these two-dimensional hidden variables are not a common formalization and include a zero-state, introduced below. The value corresponds to the hidden state where, for match states, and are matched. For insertion states, represents that x corresponds to the gap ‘–’ and that the last used symbol in is and vice versa for Y-insertion states. The hidden state is a 1-of-K representation, but slightly modified to allow a zero-state, where for all k is zero and does not emit any symbols from that variable (an example is shown in Fig. 3). This is because of the unique characteristics of PHMMs (in comparison with conventional HMMs); only a subset of the hidden variables emit symbols, that is only the hidden variables corresponding to aligned positions emit symbols (Fig. 3). The set is a parameter set, where and represent the initial probability, transition probability and emission probability parameters, respectively. Also, each hidden state k corresponds to one of the state types {M, X, Y}, which is given by a function where .
Fig. 2.
Graphical model representation of (a) HMMs and (b) PHMMs. The hidden states, denoted by z, are one-dimensional in the normal HMM, whereas they, denoted by , are two-dimensional in the PHMM. In the PHMM, a pair of symbol emissions (x, y) is an emission from the hidden state , describing a pair of (aligned) nucleotides in the case of DNA alignments, for example
Fig. 3.
An alignment example for a pair of DNA sequences x and y as well as the corresponding two-dimensional hidden states, where , illustrating how hidden states are encoded. is the zero-state that does not emit any symbols, and and are the 1-of-K coding corresponding to the M-type, X-type and Y-type states, respectively. This hidden state encoding allows us to generate the alignment from the sequence pair GG and AAGG
Graphical model representation of (a) HMMs and (b) PHMMs. The hidden states, denoted by z, are one-dimensional in the normal HMM, whereas they, denoted by , are two-dimensional in the PHMM. In the PHMM, a pair of symbol emissions (x, y) is an emission from the hidden state , describing a pair of (aligned) nucleotides in the case of DNA alignments, for exampleAn alignment example for a pair of DNA sequences x and y as well as the corresponding two-dimensional hidden states, where , illustrating how hidden states are encoded. is the zero-state that does not emit any symbols, and and are the 1-of-K coding corresponding to the M-type, X-type and Y-type states, respectively. This hidden state encoding allows us to generate the alignment from the sequence pair GG and AAGGNow we can write the complete log-likelihood of PHMM as
where is a set of hidden variables corresponding to the initial states. The initial hidden variable varies with the type of hidden state because each hidden state’s type uses a different number of original sequences; an M-type state uses both x1 and y1, whereas X-type and Y-type states each use one of them. For this reason, the initial hidden variable of an M-type state is , whereas it is and for X-type and Y-type states, respectively. For the transition probability, we use as a set of (previous) hidden variables from which this can transit to . We further denote the emission probability as a categorical distribution using new variables as
where represents the categorical emission probability of the kth hidden state. Note that the X-type and Y-type states emit the gap ‘–’ instead of the normal symbols (e.g. A, T, G or C in the case of DNA alignments). Thus, the dimensionality of the parameter of the emission probability differs with the type of hidden states, namely, for the match states and L – 1 for the insertion states, where L is the number of symbols (L = 4 in the case of DNA sequences). Using this notation, we can rewrite the complete likelihood in an explicit form.
where and is a transition direction defined as
Again, it should be noted that the representation in Equation (3) is essential for a derivation of our model selection algorithm in the following section.
2.2 PHMM model selection algorithm: FAB-PHMM
We formalize the model selection problem for PHMM as a maximization of the model evidence.
where the evidence is given by . Note that the model size is parameterized by the number of hidden states of each state type. However, the model evidence is difficult to compute; thus, we generally need approximations. In this study, we use FIC as an asymptotically accurate approximation.FIC has following three appealing properties:
In the following, we will start with the derivation of (Section 2.2.1), then take a lower bound to derive FICLB (Section 2.2.2) for optimization via expectation maximization (EM). We iteratively optimize the target function FICLB with respect to a variational distribution q and parameters in the E step (Section 2.2.3) and M step (Section 2.2.4), respectively. The model is tuned via model pruning (Section 2.2.5).Asymptotic equivalence to marginal likelihood. Although Bayesian Information Criterion (BIC) is a widely used and simple information criterion, it lacks theoretical justification because of the non-regularity of the latent variable models (Watanabe, 2009). PHMM is not an exception to this, so the BIC’s approximation is invalid for PHMMs. Unlike BIC, FIC is consistent with the marginal likelihood for latent variable models. Practically speaking, Fujimaki and Hayashi (2012) empirically showed that BIC-HMM tends to choose overly complicated models, while FIC-HMM chooses optimal models more often.Simultaneous optimization of model and parameters. VI is closely related to FIC. Both of them perform similar approximation using variational distribution. One advantage of FIC is that it can optimize parameters and models simultaneously. This makes FIC-based optimization computationally more efficient.Prior free. Unlike VI, FIC does not require prior distributions because it treats priors as . Thus, FIC is hyper-parameter tuning free and easier to optimize.
2.2.1 Factorized information criterion
Let be a set of local (component-dependent) parameters and be a set of global (component-independent) parameters. Additionally, we denote all the parameters as (in the case of PHMM, local parameters and global parameters ). Hayashi have shown that the model evidence can be approximated as an asymptotically accurate information criterion, FIC, which can be expressed as
where is the number of free parameters in is a maximum joint likelihood estimators (MJLEs), is the Hessian matrix of with respect to is the marginal posterior and is the entropy of . In FIC, the penalty term is given by the volume of the Fisher information matrix , which penalizes complexity in the model.Here we derive FIC for PHMM, . Since the local parameters do not interact with each other, the Fisher information matrix is a block diagonal matrix whose blocks are , thus . Here, using the Equation (3), we can write these Fisher information matrices as
and
where both and are with respect to the number of samples N. Thus, the penalty term is
The newly introduced symbols and are effective samples of, respectively, transition and emission probability for the kth latent variable. The values and are the dimensionalities of parameters and , respectively.Finally, ignoring the term, we derive FIC for PHMM as
The penalty terms are now sums of parameter dimensionality weighted by the corresponding effective samples. For example, the dimensionality of the kth emission probability is weighted by . When the effective sample of the kth component is small, the penalty term for the kth latent variable also becomes small. In this case, is degenerate and we can safely prune the kth latent component. This model pruning is further discussed in Section 2.2.5.
2.2.2 FIC lower bound
We employ an EM algorithm to optimize the parameters. To make the EM algorithm tractable, we further take the lower bound of and derive FICLB. We use three approximations to construct the lower bound. (i) Since the MJLEs is unavailable in practice, we replace it by the arbitrary parameter , which is optimized in the M step. (ii) Instead of the marginal posterior , we use a variational distribution q, which is optimized in the E step. (iii) We take a lower bound of the negative logarithm as , where L is linear approximation of the logarithm function and is any distribution over . During the optimization procedure, is set to be the variational distribution q of the previous time step. Using these approximations, we now get the lower bound
Here, we introduced the auxiliary variable for simplicity. The full algorithm including model pruning (the model selection mechanism) is explained in Section 2.2.5.
2.2.3 E-step updates
We need to obtain the distribution that maximizes FICLB (see Algorithm 2.1 for details). This can be done using a modified forward–backward algorithm as follows:
Using these forward–backward variables, the optimal variational distribution is obtained as follows:
2.2.4 M-step updates
Now, we want to find the that maximizes FICLB for fixed q (see Algorithm 2.1 for details). For those parameters, we have the update function:
For calculation of β, out-of-range indexing is treated as zero, that is if or .
2.2.5 Pruning degenerated components
In contrast to VI, the FAB algorithm enables simultaneous optimization of a model and its parameters via model pruning (Hayashi ). Let us call degenerated when there exists an equivalent likelihood for a smaller model , that is where . In such cases, is overcomplete so we can transform the model to obtain a new smaller and equivalent model. This transformation is called model pruning. In the case of FAB-PHMM, we can prune the components k with effective samples, which are beneath some threshold . Starting from a sufficiently large model, we can prune redundant components while optimizing parameters. This model pruning algorithm is shown in Algorithm 2.1.We observed that this model pruning algorithm sometimes fails by being captured within poor local optima. In such cases, degenerated components are not pruned. To avoid this problem, we incorporate greedy pruning (Algorithm 2.2). When the algorithm converged, we append the current model to model candidates. Then, delete the component with the fewest effective samples (greedy_pruning) and restart the algorithm. After finding the smallest possible model , we choose the model with the largest FIC from among the model candidates.
2.2.6 Computational complexity
For each iteration in Algorithm 2.1, the computational complexity is for the E and M steps, and for model pruning. Therefore, the overall complexity for each step is . Note that this complexity is exactly the same as ordinary parameter learning for PHMM using the Baum–Welch algorithm (Durbin ).The FAB-PHMM algorithmInput: data , initial model , initial variational distribution q, initial parameter , stopping threshold η and pruning thresholdloop▹ E-stepfor allk that satisfy dodelete the kth hidden state of model ▹ Pruningend for▹ M-stepifthenend loopend ifend loopThe greedy FAB-PHMM algorithmInput: data , initial model , initial variational distribution q, initial parameter , stopping threshold η and pruning thresholdInitialize candidates with an empty listloop← FABPHMM-algorithm() ▹ Alg. 2.1append to candidatesifthenend loopend if() ← greedy_pruning()▹ Delete the component with fewest effective samplesend loopchoose model with highest FIC from candidates
3 Experiments
We performed three types of experiments to answer the following three questions. First, how accurate is the proposed method in selecting the optimal model (Section 3.1)? Next, how much does the proposed method contribute to alignment accuracy relative to fixed-size models (Section 3.2)? Finally, what kinds of models are selected when we train our proposed method against real DNA data (Section 3.3)?In this study, because of the high computational cost of parameter learning, we concentrated on short alignments (up to 200 bp). Additionally, we considered only global alignments. Extensions for the alignment of longer sequences and local alignment will be discussed in Section 4. Moreover, following previous research (Bradley ; Lunter ), we here concentrate on models with a single match state and multiple insertion states, although our method is potentially applicable to multiple match states (cf. Fig. 1 and Section 4). For all the experiments, we set the stopping threshold and pruning threshold .
3.1 Model selection capability
We first investigated the model selection capability of the proposed method. We used synthetic data because the true model is not available for real dataset. We defined parameters of PHMMs of different sizes manually and generated alignments from them. The true model size and their names are provided in Table 1. From the manual models, we generated N alignments of fixed length 100. After removing the gaps from each alignment, we fed the sequences to Algorithm 2.2 and estimated the optimal model only from the data (pairs of sequences). We then determined if it can recover the true model. We set the initial model size to be . We ran experiments for sample sizes . Parameters for each model can be found in Supplementary Section S3 and Figures S1–S7.
Table 1.
Model sizes of hand-crafted models
small
med
large
imb
imb_large
huge
imb_huge
(1, 1, 1)
(1, 2, 2)
(1, 4, 4)
(1, 2, 1)
(1, 4, 2)
(1, 6, 6)
(1, 6, 3)
Note: The first, second and third values in each triplet are the numbers of match states (K), X-insertion states (K) and Y-insertion states (K), respectively. Model names including the term ‘imb’ indicate an imbalanced model with unequal values of K and K. See Supplementary Figures S1–S7 for details of the parameters.
Model sizes of hand-crafted modelsNote: The first, second and third values in each triplet are the numbers of match states (K), X-insertion states (K) and Y-insertion states (K), respectively. Model names including the term ‘imb’ indicate an imbalanced model with unequal values of K and K. See Supplementary Figures S1–S7 for details of the parameters.Table 2 shows the fraction of models that were correctly predicted—prediction is correct when the true model size and predicted model size are exactly the same. As the number of samples N grew, the approximation became more precise. Also, some models tend to require more samples for precise model selection. With sufficient samples, FAB-PHMM successfully recovered models almost perfectly except in some cases [i.e. in experiments (med, N = 1000) and (small, N = 1000)]. Even in these cases, the inaccurately selected models have smaller FIC values than accurately selected models, and we assumed that the algorithm was trapped in a poor local optimum. However, we can easily avoid this problem by using multiple runs and choosing the model with the largest FIC. Indeed, when we picked such a model out of 10 replicates generated with a different random seed, the predictions were always correct when the sample size was greater than or equal to 700 (see bold-face text in Table 2; the bold-face font indicates that the model with the highest FIC value successfully predicted the correct model).
Table 2.
Fraction of precise model selections from multiple estimations for data produced with different model sizes
No. of samples
Precise model selection for different model sizes
small
med
large
imb
imb_large
huge
imb_huge
100
10/10
0/10
0/10
10/10
0/10
0/10
0/10
200
10/10
0/10
0/10
9/10
0/10
0/10
0/10
300
10/10
0/10
0/10
4/10
7/10
0/10
0/10
400
10/10
0/10
0/10
9/10
10/10
0/10
10/10
500
10/10
0/10
0/10
7/10
9/10
0/10
5/10
600
10/10
5/10
0/10
9/10
10/10
5/10
9/10
700
10/10
10/10
3/10
5/10
10/10
10/10
10/10
800
10/10
10/10
10/10
7/10
10/10
10/10
10/10
900
9/10
10/10
10/10
8/10
10/10
10/10
10/10
1000
9/10
8/10
10/10
7/10
10/10
10/10
10/10
Note: We ran the model selection algorithm 10 times for each combination of sample number and model size (shown in Table 1) with random initial parameters. Each table entry indicates the fraction of runs in which the correct optimal model was selected. The bold-face font indicates precise prediction of the model size, that is when the method selected the model with the largest FIC correctly.
Fraction of precise model selections from multiple estimations for data produced with different model sizesNote: We ran the model selection algorithm 10 times for each combination of sample number and model size (shown in Table 1) with random initial parameters. Each table entry indicates the fraction of runs in which the correct optimal model was selected. The bold-face font indicates precise prediction of the model size, that is when the method selected the model with the largest FIC correctly.Note that this experimental setting is much simpler than the real setting, where the true model is not in the PHMM class (i.e. the real DNA sequences are not PHMM generated). However, we assume that FAB-PHMM can select the optimal model, in the sense of choosing the closest possible models.
3.2 Alignment accuracy
We also explored the alignment accuracy of the proposed method. Since it is difficult to obtain true genome alignments, we reused the generated data from the manual models in the model selection experiment. For alignment accuracy assessments, we had five alignment datasets for each different model size. We first trained multiple models on each of those datasets and then performed alignments. For training, we used PHMM (with a fixed model size) of those eight different model sizes with random parameter initializations in addition to FAB-PHMM, which automatically chose the optimal model size. In total, we used nine PHMMs, including one with the true model sizes as well as one FAB-PHMM. In order to avoid poor local optima, we ran five trainings for each setting and selected the model with the best score (FIC for FAB-PHMM and likelihood for PHMM).As a measure of performance in terms of accuracy, we used the f1 score, which is the harmonic mean of precision and recall:
For example, when the true alignment is and the inferred alignment is , the true and inferred positions are and , respectively. The correctly inferred position is simply the intersection of the true positions and inferred positions: . In this case, precision is 2/3 and recall is 2/3; thus, the f1 score is 2/3.We report the result of the number of sequences where N = 1000 in Table 3. For every dataset, the proposed method performed on par with the true model while all the other fixed-size models performed relatively poorly in some cases: for example for the large model dataset experiment, FAB-PHMM and the large model (i.e. the same model size as the one that produced dataset) performed better than others, whereas the large model performed relatively poorly for smaller datasets.
Table 3.
Alignment f1 score
Simulated models
Training models
small
med
large
imb
imb_large
huge
imb_huge
fab
small
0.9286
0.9286
0.9255
0.9283
0.9281
0.9247
0.9262
0.9287
med
0.8337
0.8327
0.8318
0.8342
0.8324
0.8258
0.8279
0.8328
large
0.8196
0.8255
0.8309
0.8238
0.8278
0.8277
0.8313
0.8315
imb
0.8882
0.8963
0.8927
0.8965
0.8925
0.8919
0.8928
0.8968
imb_large
0.8602
0.8694
0.8681
0.8659
0.8686
0.8654
0.8678
0.8688
huge
0.9419
0.9473
0.9506
0.9449
0.9490
0.9513
0.9510
0.9515
imb_huge
0.9681
0.9688
0.9717
0.9690
0.9715
0.9717
0.9719
0.9717
Average
0.9035
0.9066
0.9071
0.9061
0.9069
0.9052
0.9068
0.9083
Note: For each model of data generation (i.e. the simulated models shown in Table 1), we trained models of fixed sizes including the true model and proposed model (fab) before the f1 score evaluation. The italic and bold values indicate the result of training with the true model size and the best score obtained without the true model size, respectively. See Table 1 for the details of the models.
Alignment f1 scoreNote: For each model of data generation (i.e. the simulated models shown in Table 1), we trained models of fixed sizes including the true model and proposed model (fab) before the f1 score evaluation. The italic and bold values indicate the result of training with the true model size and the best score obtained without the true model size, respectively. See Table 1 for the details of the models.Although this alignment accuracy measure is widely used (e.g. Rivas and Eddy, 2015), this approach only considers aligned bases that correspond to match states and ignores all those corresponding to insertion states. For this reason, we also evaluated the insertion counterparts of the f1 score (see Supplementary Section S1 for further details).In addition to assessing alignment accuracy, we also calculated the perplexity of each trained model in order to show how well the models explain the data. Refer to Supplementary Section S2 for further details.
3.3 Model selection from real DNA sequences
We used real DNA data to explore the resulting models selected by FAB-PHMM. Since this study concentrates on global alignments of short sequences, proper analyses require homologous DNA sequence pairs. For this reason, we used locally aligned DNA data produced by Frith and Kawaguchi (2015) (https://zenodo.org/record/17436##.WA3REpOLQYM) and multiple-aligned DNA data produced by MULTIZ (http://hgdownload-test.cse.ucsc.edu/goldenPath/hg38/multiz20way/).
3.3.1 LAST dataset
The LAST dataset (Frith and Kawaguchi, 2015) contains pairwise alignments of human sequences to those of four other species (dog, orangutan, mouse and chimpanzee). For our purposes, we selected the alignments with lengths between 100 and 200. Additionally, we removed the alignments with a ‘missmap’ probability (an alignment ambiguity measure provided with the dataset) of less than because we wanted to use only highly reliable homologous pairs. This resulted in 1640 to 94 904 remaining alignments for each of the alignments between the human sequences and those of the four species. We extracted overlapping regions from across all four species and cropped alignments to include only those overlapping regions. Then, we randomly sampled 1000 alignments and removed gaps from them in order to input them into FAB-PHMM. We ran 10 trainings with different random seeds (which determine the initial parameters in PHMM) and selected the best model with the highest FIC value. We set the initial model size to be .Figure 4 shows the selected model size for each species. For species more closely related to humans (i.e. orangutan and chimpanzee), the algorithm selected a similar model and vice versa for species more distantly related to humans (i.e. dog and mouse). Specifically, in the case of chimpanzee and orangutan alignments, the simplest model was chosen, whereas many more X- and Y-insertion states were estimated in the case of dog and mouse alignments. Additionally, it was observed that more X-insertion states were predicted than Y-insertion states for dog and mouse alignments.
Fig. 4.
Trained model sizes for the LAST dataset shown on a phylogenetic tree generated using phyloT and the ETE Toolkit. The model sizes are to the right of the species names, for example ‘Mouse—1 6 3’ means the selected model for the human–mouse alignment is
Trained model sizes for the LAST dataset shown on a phylogenetic tree generated using phyloT and the ETE Toolkit. The model sizes are to the right of the species names, for example ‘Mouse—1 6 3’ means the selected model for the human–mouse alignment isFigures 5 and 6, Supplementary Figures S8 and S9 provide the detailed parameters of the selected models from human–chimpanzee, human–mouse, human–orangutan and human–dog sequence alignments, respectively. Overall, all the trained substitution matrixes are almost symmetric. We have the following observations about the human–mouse alignment: (i) two X-insertion states, X3 and X4, with similar and relatively high transition probabilities from the match state (M) were predicted and (ii) X3 had higher emission probabilities of A and C while X4 has higher emission probabilities of T and G. In contrast, the human–dog sequence alignment provides the following observations: (i) the self-transition probabilities of X1, X3 and X5 were similar (about 0.8), though the emission probability of X1 had an almost uniform distribution while those of X3 and X5 were skewed with different profiles; (ii) the self-transition probability of X2 and X4 were smaller (about 0.4) than those of the others, meaning the states corresponded to shorter gaps; (iii) we obtained two long insertion states, Y1 and Y3, and one short insertion state, Y2, while Y2 and Y3 had similar emission profiles.
Fig. 5.
Trained PHMM for human–chimpanzee alignments for sequences from the LAST dataset using the proposed method. The resulting model is the simplest one, . (a) Trained initial and transition probabilities and (b) emission probabilities
Fig. 6.
Trained PHMM and its parameters for alignments between human and mouse sequences. Panel (a) shows X1–X5 and Y1–Y3 have five X-insertion states and three Y-insertion states, respectively, where each value is an estimated initial/transition probability. For example, the value of 0.4793 in cell is equal to the transition probability from Y3 to M. Panel (b) shows emission probabilities of each hidden states. For example, the value 0.0173 in cell (T, A) in the left most panel is equal to the probability of a match state emitting a nucleotide pair
Trained PHMM for human–chimpanzee alignments for sequences from the LAST dataset using the proposed method. The resulting model is the simplest one, . (a) Trained initial and transition probabilities and (b) emission probabilitiesTrained PHMM and its parameters for alignments between human and mouse sequences. Panel (a) shows X1–X5 and Y1–Y3 have five X-insertion states and three Y-insertion states, respectively, where each value is an estimated initial/transition probability. For example, the value of 0.4793 in cell is equal to the transition probability from Y3 to M. Panel (b) shows emission probabilities of each hidden states. For example, the value 0.0173 in cell (T, A) in the left most panel is equal to the probability of a match state emitting a nucleotide pair
3.3.2 MULTIZ dataset
We also used the multiz20way dataset, which consists of multiple alignments of 19 different species’ genome assemblies to the human genome. As with the previous experiment, we randomly selected 1000 alignments of human sequences to sequences from each of the 19 species where each was restricted to have lengths of 90–110 bp. These alignments were then used as a training set in FAB-PHMM. We set the initial model size to be .The selected model sizes are shown on a phylogenetic tree in Figure 7. In general, we can see species that are more distantly related to humans tend to have alignments with humans that are described by large models, for example the human–bushbaby alignment model has a model size of (1, 9, 7) and the human–dog alignment model is (1, 9, 9). However, for species that are more closely related to humans, somewhat random model sizes were selected to describe alignments, for example the human–gorilla alignment model size is (1, 3, 3), whereas the human–chimpanzee model (a closer species’ alignment model) has a model size of (1, 5, 4).
Fig. 7.
Trained model sizes for the MULTIZ dataset using the proposed method with an inferred phylogenetic tree. See the caption of Figure 4
Trained model sizes for the MULTIZ dataset using the proposed method with an inferred phylogenetic tree. See the caption of Figure 4
4 Discussion
Although our proposed method may be potentially adapted to have multiple match states (cf. Fig. 1), we have concentrated on a method with a single match state and multiple X- and Y-insertion states. This is because some previous studies focused only on multiple insertion states, and learning multiple match states in addition to multiple insertion states would lead to more complex models that are not interpretable. (Also, the use of multiple match states incurs more computational cost due to the increase in trained parameters.) Still, it might be interesting to investigate multiple match states for further improvement of alignment accuracy or making novel inferences about sequence evolution, because multiple match states could correspond to the substitution rate of regions.Our experiments in Section 3.2 show that models trained by our proposed method achieved better alignment accuracy than other models, but the improvement was only marginal. We assume that this is because maximizing model evidence not always result in maximizing alignment accuracy: higher model evidence might contribute to better sequence modelling, but it might not affect alignment accuracy metric. This result is consistent with previous research by Lunter in which the authors indicated that modifications of insertion states can result in only small improvements in alignment accuracy.In our analyses using real data (Section 3.3), we utilized limited datasets owing to the high computational cost of our proposed method. Even with these limited datasets, we observed that much more complex models than traditionally used models are selected as optimal. This result implies a possibility that much more complex probabilistic structures exist behind the probabilistic alignments than previously believed.As well as improving alignment accuracy, the selected model structure may provide interesting insight about biological sequences because the selected probabilistic model structure contains latent information of input data. In other words, the selected model may provide insights into the biological functions of sequences. Similarly, decoded hidden states of PHMMs may reveal (e.g. in DNA alignments) that regions with different hidden states correspond to different functions, such as exon versus intron sequences, non-coding RNAs and regulatory elements. This information can be useful for inferring novel biological insights from sequences.For the best model selection for real data, however, a comprehensive dataset (including, e.g. coding, non-coding and repetitive regions) is required because the trained model structures depend on input data. Indeed, our analyses led to different model selections based on the LAST and MULTIZ datasets. This could be because the homologous pairs taken for the LAST dataset were more similar than those taken for the MULTIZ dataset due to the protocols of generating input sequences (cf. Section 3.3). In our future work, we will utilize larger and unbiased datasets in order to select more reliable models. The main bottleneck is the high computational cost of our algorithm. To address this, we will attempt to accelerate the training process, such as by parallelization of the algorithms, stochastic optimization (Hoffman ; Liu ; Robbins and Monro, 1951) and seed-extension heuristics in forward and backward algorithms (Hamada ).In this study, we focused on genomic DNA sequences, but our method is also applicable to RNA or protein sequences. The number of characters in protein sequences greatly exceeds those in DNA sequences, which would lead to more complex models that may provide interesting biological insights. These applications will be included in our future research as well.
5 Conclusions
In this study, we proposed a novel method to develop PHMMs based on FIC and demonstrated the model selection capability of the proposed model using a synthetic dataset. We believe this is the first study that focuses on model selection of PHMM. On the same synthetic dataset, we observed slight improvement of evaluation metrics of sequence alignments. Additionally, we conducted experiments on real DNA sequences and found that they are best handled with a more complex probabilistic structure than the ones that have been traditionally used for pairwise alignment of these species. This result implies a possibility that more complex probabilistic structures exist behind probabilistic alignments than previously believed.Click here for additional data file.
Authors: Robert K Bradley; Adam Roberts; Michael Smoot; Sudeep Juvekar; Jaeyoung Do; Colin Dewey; Ian Holmes; Lior Pachter Journal: PLoS Comput Biol Date: 2009-05-29 Impact factor: 4.475