| Literature DB >> 27638547 |
Kiyoshi Ezawa1,2.
Abstract
BACKGROUND: Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. Moreover, currently none of these models can fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions.Entities:
Keywords: Biological realism; Factorability; Insertion/deletion (indel); Non-equilibrium evolution; Power-law distribution; Rate variation; Sequence alignment probability; Stochastic evolutionary model
Mesh:
Year: 2016 PMID: 27638547 PMCID: PMC5026781 DOI: 10.1186/s12859-016-1105-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Genuine stochastic evolutionary model vs. HMM (or transducer). a Probability density calculation via a genuine stochastic evolutionary model. Each sequence state is represented as an array of sites (boxes). Sites to be deleted are shaded in red or magenta. Inserted sites are shaded in blue or cyan. The s and s , respectively, denote the initial and final states. The s (ν = 1, 2, 3) is an intermediate state. (The “P[…]” denotes a probability, the “p[…]” denotes a probability density.) b A pairwise alignment (PWA) between the initial (I) and final (F) states resulted from the indel process in panel a. The C (i = 1, …, 10) labels the alignment column below. c Probability calculation via a HMM (or a transducer). It is a priori unclear whether or how the methods in panels a and c are related with each other. For clarity, residue states and substitutions were omitted. (Note that the equation in panel a is merely a rough expression to give a broad idea on the issue. Rigorous expressions will be given in Results and discussion.) Panels a and b of this figure were adapted from panels B and F of Fig. 1 of [32]
Key concepts and results in this paper
| Concept/result | Description | Main location |
|---|---|---|
|
| An ancestry index is assigned to each site. Sharing of an ancestry index among sites indicates the sites’ mutual homology. As a fringe benefit, the indices enable the mutation rates to vary across regions (or sites) beyond the mere dependence on the residue state of the sequence. | Section |
|
| This enables the intuitively clear and yet mathematically precise description of mutations, especially insertions/deletions, on sequence states. This is a core tool in our | Section |
| Rate operator | An operator version of the rate matrix, which specifies the rates of the instantaneous transitions between the states in our evolutionary model. | Section |
| Finite-time transition operator | An operator version of the finite-time transition matrix, each element of which gives the probability of transition from a state to another after a finite time-lapse. This results from the cumulative effects of the rate operator during a finite time-interval. | Section |
| Defining equations (differential) | 1st-order time differential equations (forward and backward) that define our indel evolutionary model. They are operator versions of the standard defining equations of a continuous-time Markov model. | Section |
|
| Two integral equations (forward and backward) that are equivalent to the aforementioned differential equations defining our indel evolutionary model. They play an essential role when deriving the perturbation expansion of the finite-time transition operator. | Section |
|
| The perturbation expansion of the finite-time transition operator. It was derived in an intuitively clear yet mathematically precise manner, by using the aforementioned defining integral equations. | Section |
| Perturbation expansion ( | The perturbation expansion of the | Section |
|
| An equivalence relation between the products of two indel operators each. The relations play key roles when defining LHS equivalence classes. | Section |
|
| An equivalence class consisting of global indel histories that share all local history components. The classes play an essential role when proving the factorability of a given PWA probability. | Section |
|
| We proved that, under | Section |
| Perturbation expansion ( | The “perturbation expansion” of the | Section |
|
| We proved that, under | Section |
|
| Such a model gives factorable PWA probabilities, because the exit rate is an affine function of the sequence length (regardless of whether indel rates are time-dependent or not). The indel model of Dawg [ | Subsection |
|
| We showed that the “chop-zone” method in [ | Subsection |
| Model with simple insertion rate variation | If the deletion rates are space-homogeneous and the insertion rates depend only on the insertions’ flanking sites, the PWA probabilities are still factorable. | Subsection |
|
| This kind of model is a simplest example of the indel model whose | Subsection |
| Degree of non-factorability | The “difference of exit-rate differences” (Eq. ( | Subsection |
| Space-heterogeneous model with factorable PWA probability | We found that a class of indel models with rate-heterogeneity across regions (Eqs. ( | Subsection |
NOTE: Especially important things are in boldface
Fig. 2Sequence states. a A sequence state in the SID models [21]. Each site (cell) is assigned a residue state (A, T, G or C). b The corresponding extended sequence state in our evolutionary model. Each site is assigned an ancestry index (number) in addition to a residue state. c The corresponding basic sequence state. Each site is assigned an ancestry index alone. The represents the set of residue states assigned to all sites. The represents the set of ancestry indices assigned to all sites. (Note that the identical symbols (s’s) used in panels a and c represent different types of states (of the same sequence))
Fig. 3Operator representation of mutations. a A substitution operator, . The residues before and after the substitution are in boldface in blue and red, respectively. b An insertion operator, , and a fill-in operator, . The inserted sites are shaded in cyan. (Note that “A” at the top of the rightmost inserted column means the ancestry index of 10, not the residue state of A). c A deletion operator, . The sites to be deleted are shaded in magenta. In this figure, the extended sequence states were used for illustration. The bra-vector below each array denotes the state. The extended state, , is identical to that in Fig. 2b. Each vertical arrow indicates the action of the mutation operator beside it. Note that the first arguments of all operators and the second argument of the deletion operator specify positions along the sequence, and not ancestries (specified at the top of the sites)
Fig. 4Example indel history and resulting alignments. a An example indel history in terms of the bra-vectors of sequence states and indel operators. b The graphical illustration of the history using basic sequence states. Each sequence state in panel a is horizontally aligned with its graphical representation in panel b. c The resulting MSA among the sequence states that the indel history went through. d The resulting PWA between the initial and final sequences. In both c and d, the bold italicized characters in the leftmost column are the suffixes indicating the sequence states in panel a. In panels b, c, and d, the number in each site (cell) represents its ancestry, but not necessarily its position along the sequence. The ‘A’ in the final sequence abbreviates 10. The same shading scheme as in Fig. 1 is used. The figure was adapted from Fig. 1 of [32]
Fig. 5Binary equivalence relation and LHS equivalence class. a An indel history, . b Another indel history, . These histories result in the same final state (〈s | (= 〈s |)). Thus, their total effects are equivalent. c Their equivalent local history set (LHS), , is represented by the isolated actions of local indel histories on the initial state (each enclosed in a dashed box)