| Literature DB >> 30241460 |
Massimo Maiolo1,2,3, Xiaolei Zhang4, Manuel Gil1,3, Maria Anisimova5,6.
Abstract
BACKGROUND: Sequence alignment is crucial in genomics studies. However, optimal multiple sequence alignment (MSA) is NP-hard. Thus, modern MSA methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogeny. Changes between homologous characters are typically modelled by a Markov substitution model. In contrast, the dynamics of indels are not modelled explicitly, because the computation of the marginal likelihood under such models has exponential time complexity in the number of taxa. But the failure to model indel evolution may lead to artificially short alignments due to biased indel placement, inconsistent with phylogenetic relationship.Entities:
Keywords: Dynamic programming; Indel; Phylogeny; Poisson process; Sequence alignment
Mesh:
Year: 2018 PMID: 30241460 PMCID: PMC6151001 DOI: 10.1186/s12859-018-2357-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An example of φ(|m|) (Eq. 2), i.e. the marginal likelihood of all non-observable histories, as a function of MSA length |m|. The parameters are: τ=1, λ=10, μ=1, p(c)=0.5
Fig. 2Overview of the progressive algorithm. The algorithm traverses a guide tree (indicated by the shadow in Panel a) in postorder. At each internal node, the evolutionary paths from the two children down to the leaves (doted lines in Panel a) are aligned by full maximum likelihood under the PIP model, using a dynamic programming approach (DP). Since the likelihood function does not increase monotonically in the MSA length (see Fig. 1), the DP accommodates the MSA length along a third dimension (indicated by k in Panels a, b); thus, it works with cubic matrices (in contrast to the traditional quadratic DP alignment). The forward phase of the DP stores likelihood values in three sparse matrices (Panel b: SM for matching columns; SX and SY to introduce new indel events). Further, matrix TR (Panel a) at position (i,j,k) records the name of the DP matrix (either “ SM”, “ SX”, or “ SY”) with highest likelihood at (i,j,k). An optimal alignment is determined by backtracking along TR (indicated in Panel a by the arrows in the projection of TR onto the plane). Note that the likelihood function marginalises over all indel scenarios compatible with putative homology (Panel c)
Fig. 3MSAs inferred with PRANK+F (top), our algorithm (middle, denoted by P-PIP) and MAFFT (bottom) from 23 strains of gp120 human and simian immunodeficiency virus (always using the same guide-tree). a. The total MSA lengths are 669, 661 and 579 columns respectively. The three methods show good agreement in the conserved regions. Substantial differences are observed in regions 1–4, highlighted by colors. b. Magnification of Region 4. MAFFT over-aligns the sequences. Depicted on the left: The tree in black is the original guide-tree. The trees depicted in colour are the same guide tree but with re-estimated branch lengths. A detailed view of regions 1–3 is given in Additional file 1: Figures S1-S3