| Literature DB >> 34258521 |
Kristóf Takács1, Vince Grolmusz1,2.
Abstract
The multiple sequence alignment (MSA) is an increasingly important task in bioinformatics as we have to deal with the constantly increasing gene- and protein sequence databases. MSA is applied in phylogenetic analysis, in discovering conservative protein domains, in the assignment of secondary and tertiary structural features in proteins, or in the metagenomic sample analysis and gene discovery. Usually, the focus is on the MSA of long sequences, since in the practice these tasks appear most frequently. However, the strict analysis of the optimal MSA of short sequences is an area of negligence, and findings there may contribute to better and faster algorithms for the multiple alignment of long sequences. In the present contribution, we are examining length-1 sequences using arbitrary metric and length-2 sequences using unit metric, and we show that the optimum of the MSA problem can be achieved by the trivial alignment in both cases.Entities:
Year: 2021 PMID: 34258521 PMCID: PMC8255854 DOI: 10.1096/fba.2020-00118
Source DB: PubMed Journal: FASEB Bioadv ISSN: 2573-9832
A multiple alignment for length‐1 sequences on columns
|
| – | … | – |
| … | … | … | … |
|
| – | – | – |
| – |
| … | – |
| … | … | … | … |
| – |
| … | – |
| … | … | … | … |
| – | – | … |
|
| … | … | … | … |
| – | – | … |
|
A multiple alignment for k length‐1 sequences in two columns
|
| – |
|
| – |
| — | … |
|
| – |
| – |
|
| – |
|
| … | … |
| – |
|
The first two columns of
|
| – |
|
| – |
| … | … |
|
| – |
| – |
|
| – |
|
| … | … |
| – |
|
| – | – |
| … | … |
| – | – |
The structure of after permuting its rows and making its block setting with . Number 1 denotes the first characters, and number 2 the second letters. During the proof, an upper bound is given for the cost of aligning letters with the same order that are not aligned in by using character‐gap alignment costs that are included in cost
| 1 | 2 | – | – |
| … | … | … | … |
| 1 | 2 | – | – |
| 1 | – | 2 | – |
| … | … | … | … |
| 1 | – | 2 | – |
| 1 | – | – | 2 |
| … | … | … | … |
| 1 | – | – | 2 |
| – | 1 | 2 | – |
| … | … | … | … |
| – | 1 | 2 | – |
| – | 1 | – | 2 |
| … | … | … | … |
| – | 1 | – | 2 |
| – | – | 1 | 2 |
| … | … | … | … |
| – | – | 1 | 2 |
The block setting of if , denoting only that an element is the first/second character of its aligned sequence or a gap. For example, the first element of the first row in the block setting and the second element of the fourth row (which are denoting the first characters of some sequences) are not aligned in , so the cost of their alignment with each other, which is a part of cost but not a part of cost , must be estimated from above with a part of cost . Namely, with the cost of aligning the block setting's first element of the first row with the gaps in the first element of the fourth row
| 1 | 2 | – | – |
| 1 | – | 2 | – |
| 1 | – | – | 2 |
| – | 1 | 2 | – |
| – | 1 | – | 2 |
| – | – | 1 | 2 |
The trivial and an optimal alignment of S
| C | G | – | C | G |
| G | C | G | C | – |
| G | G | G | – | G |
The trivial and an optimal alignment of S
| C | C | G | C | C | G | – |
| G | C | G | G | C | G | – |
| C | G | C | – | C | G | C |
| C | C | G | – |
| G | C | G | – |
| – | C | G | C |
| C | G | – | |
| C | 0 | 2 | 1 |
| G | 2 | 0 | 1 |
| – | 1 | 1 | 0 |
| – | C | G |
| G | C | – |
| G | – | G |