| Literature DB >> 35409125 |
Dimitrii O Kostenko1, Eugene V Korotkov1.
Abstract
The aim of this work was to compare the multiple alignment methods MAHDS, T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK in their ability to align highly divergent amino acid sequences. To accomplish this, we created test amino acid sequences with an average number of substitutions per amino acid (x) from 0.6 to 5.6, a total of 81 sets. Comparison of the performance of sequence alignments constructed by MAHDS and previously developed algorithms using the CS and Z score criteria and the benchmark alignment database (BAliBASE) indicated that, although the quality of the alignments built with MAHDS was somewhat lower than that of the other algorithms, it was compensated by greater statistical significance. MAHDS could construct statistically significant alignments of artificial sequences with x ≤ 4.8, whereas the other algorithms (T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK) could not perform that at x > 2.4. The application of MAHDS to align 21 families of highly diverged proteins (identity < 20%) from Pfam and HOMSTRAD databases showed that it could calculate statistically significant alignments in cases when the other methods failed. Thus, MAHDS could be used to construct statistically significant multiple alignments of highly divergent protein sequences, which accumulated multiple mutations during evolution.Entities:
Keywords: amino acid sequence; dynamic programming; multiple alignment
Mesh:
Substances:
Year: 2022 PMID: 35409125 PMCID: PMC8998981 DOI: 10.3390/ijms23073764
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Mean Z and CS values for BAliBASE protein families’ alignments produced by MAHDS with different K, d, and e parameters.
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|
| 5.0 | −1.0 | 40.0 | 5.0 | 178.02 | 0.32 | 0.43 |
| −1.0 | 40.0 | 4.0 | 180.08 | 0.33 | 0.45 | |
| −1.0 | 40.0 | 2.0 | 182.26 | 0.39 | 0.49 | |
| −1.0 | 40.0 | 1.0 | 180.95 | 0.43 | 0.52 | |
| −1.0 | 40.0 | 0.2 | 146.40 | 0.44 | 0.53 | |
| −2.0 | 40.0 | 2.0 | 172.64 | 0.32 | 0.41 | |
| −2.0 | 28.0 | 2.8 | 178.77 | 0.35 | 0.45 | |
| −2.0 | 28.0 | 0.7 | 126.46 | 0.44 | 0.52 |
Mean Z and CS values for BAliBASE protein families’ alignments produced by different methods.
| Methods |
| ||
|---|---|---|---|
| MAHDS | 180.95 | 0.43 | 0.52 |
| T-Coffee | 115.31 | 0.81 | 0.87 |
| MUSCLE | 158.60 | 0.73 | 0.80 |
| PRANK | 65.50 | 0.64 | 0.70 |
| Clustal Omega | 116.95 | 0.81 | 0.85 |
| Kalign | 131.51 | 0.75 | 0.82 |
| MAFFT | 125.32 | 0.80 | 0.85 |
Z values for multiple alignments of 81 Des sets built using MAHDS (Z values > 10.0 are in bold).
| Indel Count | Indel Length | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 |
|
|
|
|
|
|
|
| 7.2 |
| 2 | 5 |
|
|
|
|
|
|
|
| 3.4 |
| 2 | 20 |
|
|
|
|
|
|
|
| 8.1 |
| 5 | 1 |
|
|
|
|
|
|
|
| 8.4 |
| 5 | 5 |
|
|
|
|
|
|
| 8.7 | 6.8 |
| 5 | 20 |
|
|
|
|
|
|
| 8.8 | 6.4 |
| 10 | 1 |
|
|
|
|
|
|
| 9.7 | 6.4 |
| 10 | 5 |
|
|
|
|
|
|
| 7.6 | 6.0 |
| 10 | 20 |
|
|
|
|
|
| 8.6 | 7.0 | 5.2 |
Z values calculated for multiple alignments of 81 Des sets constructed using Clustal Omega and Kalign (Z values > 10.0 are in bold).
| Indel Number | Indel Length | Clustal | Kalign | |||
|---|---|---|---|---|---|---|
| 2 | 1 |
|
|
|
|
|
| 2 | 5 |
|
|
|
|
|
| 2 | 20 |
|
|
|
|
|
| 5 | 1 |
|
| −46.8 |
|
|
| 5 | 5 |
|
| −69.7 |
|
|
| 5 | 20 |
|
| −460.1 |
|
|
| 10 | 1 |
|
| −223.1 |
|
|
| 10 | 5 |
|
| −480.6 |
| −73.3 |
| 10 | 20 |
| −209.9 | −493.2 |
| −489.0 |
Z values calculated for multiple alignments of 81 Des sets constructed using MAFFT and MUSCLE (Z values > 10.0 are in bold).
| Indel Count | Indel Length | MAFFT | MUSCLE | ||||
|---|---|---|---|---|---|---|---|
| 2 | 1 |
|
|
|
|
|
|
| 2 | 5 |
|
|
|
|
|
|
| 2 | 20 |
|
|
|
|
|
|
| 5 | 1 |
|
|
|
|
|
|
| 5 | 5 |
|
|
|
|
|
|
| 5 | 20 |
| −67.5 |
|
|
| −73.1 |
| 10 | 1 |
|
|
|
|
|
|
| 10 | 5 |
| −52.5 |
|
|
| −65.5 |
| 10 | 20 |
| −406.8 |
|
| −77.9 | −90.5 |
Z values calculated for multiple alignments of 81 Des sets built using PRANK and T-Coffee (Z > 10.0 are in bold).
| Indel Count | Indel Length | PRANK | T-COFFEE | |||
|---|---|---|---|---|---|---|
| 2 | 1 |
|
|
|
|
|
| 2 | 5 |
|
|
|
|
|
| 2 | 20 |
|
|
|
|
|
| 5 | 1 |
|
|
|
|
|
| 5 | 5 |
|
|
|
| −167.5 |
| 5 | 20 |
|
|
| −78.8 | −355.1 |
| 10 | 1 |
|
|
|
|
|
| 10 | 5 |
|
| −18.4 | −260.7 | −387.6 |
| 10 | 20 |
| −225.7 | −133.27 | −415.8 | −905.0 |
Performance of MAHDS, T-Coffee, and MUSCLE in building MSAs of low identity protein families (Z values > 10.0 are in bold). The first two families are taken from https://mizuguchilab.org/cgi-bin/homstrad/browse.cgi (accessed on 5 June 2021) The remaining 19 were obtained from https://pfam.xfam.org/ (accessed on 7 June 2021). The protein family names obtained from the Pfam database are shown in Table 8.
| Name/ | Number of Sequences | Average Length | Average % Identity | MAHDS | T-Coffee | MUSCLE | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Gap Openings | Gaps |
| Gap Openings | Gaps |
| Gap Openings | Gaps | ||||
| Fibronectin type 3 domain | 13 | 122 | 17% | −7.3 | 63 | 1465 | −12.5 | 325 | 1004 | −10.9 | 154 | 598 |
| PH domain | 14 | 98.0 | 16% | −12.9 | 71 | 2035 | −17.6 | 221 | 2011 | −7.7 | 172 | 1387 |
| PF00915 | 44 | 234.9 | 17% |
| 2962 | 182,664 | −88.4 | 7841 | 802,423 | −15.9 | 2775 | 124,031 |
| PF02950 | 9 | 76.0 | 14% | −6.93 | 33 | 1061 | −13.4 | 103 | 296 | −2.9 | 45 | 188 |
| PF06653 | 210 | 162.2 | 18% |
| 5666 | 393,226 | −124.7 | 13,698 | 648,158 |
| 6557 | 380,093 |
| PF07611 | 97 | 300.1 | 18% |
| 5651 | 163,051 | −11.7 | 10,943 | 197,366 |
| 3743 | 97,553 |
| PF07622 | 97 | 273.6 | 19% |
| 6896 | 343,751 | −33.5 | 15,437 | 616,470 |
| 5865 | 273,614 |
| PF08928 | 182 | 120.9 | 18% |
| 9934 | 235,705 |
| 11,885 | 374,657 |
| 7699 | 210,857 |
| PF09624 | 101 | 144.6 | 17% |
| 2198 | 43,877 | −0.2 | 3278 | 58,118 |
| 1459 | 27,515 |
| PF09987 | 22 | 223.7 | 14% |
| 192 | 14,967 | −14.8 | 747 | 8978 |
| 290 | 4748 |
| PF10734 | 219 | 80.5 | 19% |
| 2284 | 172,735 | −77.6 | 8520 | 390,912 |
| 4601 | 301,122 |
| PF10805 | 181 | 96.9 | 16% |
| 1602 | 67,600 | −34.7 | 6415 | 147,729 |
| 3807 | 75,691 |
| PF10846 | 285 | 226.6 | 12% |
| 13,166 | 607,607 | −91.6 | 55,603 | >5 × 106 | −6.5 | 14,604 | 778,013 |
| PF10895 | 33 | 184.2 | 17% |
| 585 | 17,420 | −52.7 | 1487 | 29,766 | 3.8 | 712 | 12,837 |
| PF11368 | 178 | 228.4 | 17% |
| 2862 | 33,214 | −91.6 | 15,145 | 80,954 |
| 4108 | 29,868 |
| PF13944 | 185 | 124.2 | 18% |
| 9220 | 381,058 | −132.5 | 18,133 | >1 × 106 | −22.5 | 5512 | 268,531 |
| PF16506 | 28 | 282.4 | 14% | −2.6 | 265 | 20,329 | −79.6 | 3052 | 51,549 | −3.5 | 1224 | 13,805 |
| PF18406 | 166 | 87.3 | 19% |
| 4073 | 142,973 | −45.1 | 12,397 | 460,552 |
| 5286 | 160,922 |
| PF18709 | 91 | 257.8 | 16% |
| 3724 | 101,986 |
| 8700 | 232,070 |
| 3280 | 97,117 |
| PF19443 | 216 | 216.7 | 17% |
| 22,203 | 533,793 | −56.5 | 39,783 | >1 × 106 |
| 15,753 | 349,836 |
| PF19975 | 121 | 229.4 | 19% |
| 6716 | 179,105 | −96.6 | 12,219 | 503,784 |
| 5233 | 205,882 |
Protein family names from the Pfam database whose accession number is used in Table 7 are shown in this table.
| Accession Number | Name |
|---|---|
| PF00915 | Calicivirus coat proteins |
| PF02950 | Conotoxins |
| PF06653 | Tight junction proteins |
| PF07611, PF07622 | Proteins of unknown function |
| PF08928, PF09624 | |
| PF10734, PF10805 | |
| PF10846, PF10895 | |
| PF11368 | |
| PF09987 | Uncharacterized protein conserved in archaea |
| PF13944 | Calycin-like beta-barrel domain |
| PF16506 | Putative virion glycoprotein of insect viruses |
| PF18406 | Ferredoxin-like domain in Api92-like protein |
| PF18709 | Dynamin-like helical domain |
| PF19443 | DAHL domain |
| PF19975 | Double-GTPase 1 |
Figure 1Diagram of the algorithm for multiple alignment of amino acid sequences shown in Section 4.1. Set of N sequences for multiple alignment is denoted as SI. Q is set of random PWMs (see Section 4.2). PWMm is matrix that has the maximum value of the similarity function when aligning sequences from the set SI (see Section 4.3 and Formula (4)). MA is a multiple alignment built for sequences from SI set using PWMm (see Section 4.5 and Formula (5)).