| Literature DB >> 34828415 |
Xuhua Xia1,2.
Abstract
Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.Entities:
Keywords: PWM; automation; codon-based alignment; inconsistency; phylogenetics; position weight matrix; sequence alignment; sum-of-pairs score
Mesh:
Year: 2021 PMID: 34828415 PMCID: PMC8623120 DOI: 10.3390/genes12111809
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Multiple sequence alignment of 11 mammalian ACE2 proteins. Only 25 amino acid sites from the N-terminus are shown, taken from Wei et al. [14].
Figure 2Suboptimal alignment of codon sequences. (A) Two unaligned codon sequences. (B) Alignment from codon-based alignment methods. (C) A better alignment based on alignment scores.
Sum-of-pairs scores for Alignment 1 (Figure 1) and an alternative Alignment 2 with “T-” occupying sites 20 and 21 in N. procyonoides. Only sites 20 and 21 in Figure 1 are considered.
| T/-(1) | T/T(1) | T/I(1) | I/-(1) | SPS | |
|---|---|---|---|---|---|
| Score(2) | −6 | 5 | −1 | −6 | |
| Alignment 1 | 10 | 6 | 4 | −34 + C(3) | |
| Alignment 2 | 6 | 10 | 4 | −10 + C(3) |
(1) Amino acid pairs relevant for the calculation of SPS, (2) Gap penalty is −6, T/T match and T/I mismatch scores are 5 and −1, respectively, (3) C is a constant represents sum of pairwise scores from all sequences other than N. procyonoides.
Partial position weight matrix for 11 aligned ACE2 sequences, generated from DAMBE [40] using default options for pseudocounts and background frequencies. Only sites 20 and 21 are included.
| Alignment 1 | Alignment 2 | ||||
|---|---|---|---|---|---|
| AA | Site 20 | Site 21 | Site 20 | Site21 | |
| A | −3.4621 | −3.4621 | −3.4621 | −3.4621 | |
| R | −3.4632 | −3.4632 | −3.4632 | −3.4632 | |
| N | −3.4620 | −3.4620 | −3.4620 | −3.4620 | |
| D | −3.4625 | −3.4625 | −3.4625 | −3.4625 | |
| C | −3.4757 | −3.4757 | −3.4757 | −3.4757 | |
| Q | −3.4632 | −3.4632 | −3.4632 | −3.4632 | |
| E | −3.4616 | −3.4616 | −3.4616 | −3.4616 | |
| G | −3.4625 | −3.4625 | −3.4625 | −3.4625 | |
| H | −3.4673 | −3.4673 | −3.4673 | −3.4673 | |
| I | −3.4628 | 2.9353 | −3.4628 | 2.9353 | |
| L | −3.4612 | −3.4612 | −3.4612 | −3.4612 | |
| K | −3.4624 | −3.4624 | −3.4624 | −3.4624 | |
| M | −3.4645 | −3.4645 | −3.4645 | −3.4645 | |
| F | −3.4629 | −3.4629 | −3.4629 | −3.4629 | |
| P | −3.4628 | −3.4628 | −3.4628 | −3.4628 | |
| S | −3.4619 | −3.4619 | −3.4619 | −3.4619 | |
| T | 4.1089 | 3.5976 | 4.2457 | 3.3770 | |
| W | −3.4649 | −3.4649 | −3.4649 | −3.4649 | |
| Y | −3.4632 | −3.4632 | −3.4632 | −3.4632 | |
| V | −3.4621 | −3.4621 | −3.4621 | −3.4621 | |
Figure 3N-terminus of 20 aligned HTT sequences, with the site numbering in the middle. (A) Alignment from MAFFT [7] with optimized options. (B) One of the alternative alignments refined with the PWMD criterion.
Tree log-likelihood values for the two multiple sequence alignments in Figure 3, obtained with PhyML and three different substitution matrices.
| Substitution Matrix | |||
|---|---|---|---|
| Alignment | LG | JTT | BLOSUM62 |
| in | −126.6903 | −122.6004 | −126.7423 |
| in | −106.7703 | −105.2280 | −106.9387 |
Part of the 20 × 3156 position weight matrix obtained with default options for pseudocounts and background frequencies in DAMBE [40]. Only sites 18 to 44 from 3156 aligned sites are shown, with only two amino acids (Q and P) out of 20. Site numbers are as in the alignment in Figure 3A.
| Site | Q | P |
|---|---|---|
| 18 | −0.0374 | −4.3223 |
| 19 | −0.0374 | −4.3223 |
| 20 | −0.0374 | −4.3223 |
| 21 | −0.0374 | −4.3223 |
| 22 | −0.0374 | −4.3223 |
| 23 | −0.0374 | −4.3223 |
| 24 | −0.0374 | −4.3223 |
| 25 | 1.4974 | −4.3223 |
| 26 | 1.4974 | −4.3223 |
| 27 | 2.2241 | −4.3223 |
| 28 | 4.2125 | −4.3223 |
| 29 | 4.2125 | −4.3223 |
| 30 | 4.2125 | −4.3223 |
| 31 | 4.2125 | −4.3223 |
| 32 | 4.2125 | −4.3223 |
| 33 | 4.2125 | −4.3223 |
| 34 | 4.1387 | −4.3223 |
| 35 | 4.0609 | −4.3223 |
| 36 | 4.0609 | −4.3223 |
| 37 | 4.0609 | −4.3223 |
| 38 | 2.8964 | 2.5727 |
| 39 | −0.0374 | −4.3223 |
| 40 | −4.3224 | −0.1637 |
| 41 | −4.3224 | −0.1637 |
| 42 | −4.3224 | 0.7953 |
| 43 | −4.3224 | 0.7953 |
| 44 | 0.9251 | 3.4601 |
Figure 4Illustration of PWM-based refinement of sequence alignment based on the N-terminus of 20 aligned HTT sequences, with the site numbering in the middle. (A) Alignment after Step 1 refinement. (B) Alignment after Step 2 refinement, except that the shared gap at site 39 has not yet been deleted.