| Literature DB >> 17137519 |
Shinsuke Yamada1, Osamu Gotoh, Hayato Yamana.
Abstract
BACKGROUND: Multiple sequence alignment (MSA) is a useful tool in bioinformatics. Although many MSA algorithms have been developed, there is still room for improvement in accuracy and speed. In the alignment of a family of protein sequences, global MSA algorithms perform better than local ones in many cases, while local ones perform better than global ones when some sequences have long insertions or deletions (indels) relative to others. Many recent leading MSA algorithms have incorporated pairwise alignment information obtained from a mixture of sources into their scoring system to improve accuracy of alignment containing long indels.Entities:
Mesh:
Year: 2006 PMID: 17137519 PMCID: PMC1769516 DOI: 10.1186/1471-2105-7-524
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example of gap extension penalty calculation. This figure shows an example of columns a8 and b12 being aligned. '*', '·', and '-' denote a residue, a static null, and a dynamic null, respectively. We assume that piecewise linear gap cost g(x) is max{-(ux + v)} and critical gap length x(= ⌊(v2 - v1)/(u1 - u2)⌋) is 4. Gap extension penalty is u1 if x ≤ 4, otherwise u2. Running gap profile vectors and are {(1, )} and {(1, + + ), (3, + ), (11, )}, respectively. Dynamic gap information (A) is {(0, 1), (2, 2), (7, 1)}. Segment profile is {(1, ), (3, + ), (5, + + )}. Similarly, the profile vectors of B are defined: = {(7, )}, = {(0, )}, (B) is empty, and = {(1, 0), (9, )}. In what follows, we consider the non-trivial calculation of the gap extension penalty with respect to the gap of B1, the target gap. By using and (A), we find that the two dynamic gaps specified by (2, 2) and (7, 1) in (A) are partially and completely aligned with the target gap, respectively. Consequently, the total number of nulls aligned with null columns of dynamic gaps to be removed is 2. Therefore, the number of columns of B1 is 5(= 7 - 2). By subtracting 5 from 8 (the end position of the segment), the starting position of A, 3, is obtained. Then, the gap extension penalty with respect to the gap of B1 is (F1·u1 + F2·u2) where F1 = + and F2 = . Note that A1 is not involved in the gap extension penalty because a1,8 is a null.
List of evaluated programs
| program | version | option |
| PRIME | blosum62, | |
| PRIME | blosum62, | |
| Prrn [9] | 3.4 | -b2 -mblosum62 -u1 -v9 |
| MAFFT* [15] | 5.662 | --maxiterate 1000 --localpair (L-INS-i) |
| ProbCons* [16] | 1.09 | default |
| T-Coffee* [27] | 2.02 | default |
| MUSCLE [8] | 3.52 | default |
| DIALIGN-T [28] | 0.2.1 | default |
| POA [5] | 2 | -do_global -do_progressive blosum80_trunc.mat |
| ClustalW [7] | 1.83 | default |
Programs with * employ pairwise alignment information when calculating multiple alignment. Parameters or options of each program other than the gap penalty ones are chosen to obtain as accurate an alignment as possible.
BAliBASE version 3.0 contents
| no. of alignments | characteristic of alignment | |
| Reference 1.1 | 37 | phylogenetically equidistant (less than 20% identity) |
| Reference 1.2 | 42 | phylogenetically equidistant (20 to 40% identity) |
| Reference 2 | 39 | families including orphan sequences |
| Reference 3 | 29 | equidistant families (less than 25% identity) |
| Reference 4 | 48 | long N/C terminal extensions (excluded from homologous region set) |
| Reference 5 | 14 | long internal insertions |
Average sum-of-pairs scores of full length set
| Ref. 1.1 | Ref. 1.2 | Ref. 2 | Ref. 3 | Ref. 4 | Ref. 5 | Overall | Ranksum | |
| PRIME | 0.643 | 0.933 | 0.922 | 0.859 | 0.910 | 0.882 | 0.861 | 809 |
| PRIME | 0.635 | 0.931 | 0.898 | 0.851 | 0.882 | 0.871 | 0.846 | 912 |
| Prrn | 0.574 | 0.923 | 0.901 | 0.820 | 0.859 | 0.821 | 0.821 | 1055 |
| MAFFT | 0.671 | 0.938 | 0.923 | 0.852 | 0.918 | 0.892 | 0.868 | 656 |
| ProbCons | 0.648 | 0.942 | 0.905 | 0.835 | 0.887 | 0.879 | 0.851 | 764 |
| T-Coffee | 0.613 | 0.933 | 0.916 | 0.826 | 0.900 | 0.858 | 0.846 | 884 |
| MUSCLE | 0.570 | 0.909 | 0.888 | 0.808 | 0.857 | 0.839 | 0.815 | 1260 |
| DIALIGN-T | 0.489 | 0.888 | 0.859 | 0.744 | 0.817 | 0.780 | 0.768 | 1668 |
| POA | 0.474 | 0.857 | 0.857 | 0.733 | 0.805 | 0.754 | 0.753 | 1804 |
| ClustalW | 0.497 | 0.864 | 0.848 | 0.722 | 0.786 | 0.713 | 0.748 | 1682 |
Each column shows average sum-of-pairs scores using all alignments of each reference of the full length set. Overall and Ranksum columns show the average sum-of-pairs scores and the rank sum of the Friedman test using all alignment of the whole full length set, respectively. A smaller rank sum means better accuracy.
Average column scores of full length set
| Ref. 1.1 | Ref. 1.2 | Ref. 2 | Ref. 3 | Ref. 4 | Ref. 5 | Overall | Ranksum | |
| PEIME | 0.416 | 0.839 | 0.445 | 0.566 | 0.573 | 0.552 | 0.572 | 846 |
| PRIME | 0.391 | 0.826 | 0.413 | 0.539 | 0.483 | 0.496 | 0.531 | 958 |
| Prrn | 0.334 | 0.791 | 0.406 | 0.469 | 0.491 | 0.411 | 0.499 | 1080 |
| MAFFT | 0.449 | 0.839 | 0.436 | 0.560 | 0.607 | 0.544 | 0.583 | 759 |
| ProbCons | 0.401 | 0.851 | 0.374 | 0.462 | 0.530 | 0.509 | 0.532 | 847 |
| T-Coffee | 0.324 | 0.832 | 0.384 | 0.459 | 0.563 | 0.534 | 0.525 | 1017 |
| MUSCLE | 0.313 | 0.795 | 0.343 | 0.380 | 0.460 | 0.408 | 0.465 | 1246 |
| DIALIGN-T | 0.246 | 0.723 | 0.290 | 0.347 | 0.462 | 0.389 | 0.423 | 1554 |
| POA | 0.224 | 0.678 | 0.265 | 0.343 | 0.413 | 0.323 | 0.389 | 1690 |
| ClustalW | 0.221 | 0.707 | 0.219 | 0.271 | 0.404 | 0.237 | 0.368 | 1497 |
Each column shows average column scores using all alignments of each reference of the full length set. Overall and Ranksum columns show the average column scores and the rank sum of the Friedman test using all alignment of the whole full length set, respectively. A smaller rank sum means better accuracy.
Figure 2Score differences between PRIMEand PRIMEon full length set. The horizontal axis denotes reference alignment ID, and the vertical axis, the difference in sum-of-pairs or column scores on respective alignments of the full length set using PRIMEand PRIME. A positive difference score of an alignment is an indication that PRIMEshows better performance than PRIMEfor the alignment, and vice versa.
Average sum-of-pairs scores of homologous region set
| Ref. 1.1 | Ref. 1.2 | Ref. 2 | Ref. 3 | Ref. 5 | Overall | Ranksum | |
| PRIME | 0.772 | 0.940 | 0.955 | 0.903 | 0.891 | 0.894 | 613 |
| PRIME | 0.781 | 0.938 | 0.954 | 0.907 | 0.896 | 0.897 | 634 |
| Prrn | 0.763 | 0.936 | 0.954 | 0.894 | 0.887 | 0.889 | 698 |
| MAFFT | 0.753 | 0.940 | 0.946 | 0.890 | 0.897 | 0.886 | 654 |
| ProbCons | 0.788 | 0.953 | 0.953 | 0.910 | 0.907 | 0.904 | 489 |
| T-Coffee | 0.704 | 0.939 | 0.940 | 0.878 | 0.888 | 0.870 | 821 |
| MUSCLE | 0.735 | 0.931 | 0.943 | 0.882 | 0.870 | 0.875 | 907 |
| DIALIGN-T | 0.573 | 0.901 | 0.897 | 0.793 | 0.821 | 0.798 | 1406 |
| POA | 0.634 | 0.877 | 0.923 | 0.822 | 0.800 | 0.816 | 1370 |
| ClustalW | 0.664 | 0.905 | 0.922 | 0.816 | 0.788 | 0.827 | 1263 |
Each column shows average sum-of-pairs scores using all alignments of each reference of the homologous region set. Overall and Ranksum columns show the average sum-of-pairs scores and the rank sum of the Friedman test using all alignment of the whole homologous region set, respectively. A smaller rank sum means better accuracy.
Average column scores of homologous region set
| Ref. 1.1 | Ref. 1.2 | Ref. 2 | Ref. 3 | Ref. 5 | Overall | Ranksum | |
| PEIME | 0.588 | 0.849 | 0.595 | 0.636 | 0.575 | 0.665 | 640 |
| PRIME | 0.589 | 0.847 | 0.582 | 0.648 | 0.593 | 0.666 | 640 |
| Prrn | 0.561 | 0.834 | 0.601 | 0.630 | 0.558 | 0.654 | 708 |
| MAFFT | 0.552 | 0.846 | 0.532 | 0.631 | 0.578 | 0.640 | 670 |
| ProbCons | 0.591 | 0.875 | 0.540 | 0.625 | 0.583 | 0.658 | 548 |
| T-Coffee | 0.476 | 0.840 | 0.491 | 0.625 | 0.540 | 0.607 | 822 |
| MUSCLE | 0.496 | 0.823 | 0.496 | 0.574 | 0.501 | 0.596 | 946 |
| DIALIGN-T | 0.338 | 0.761 | 0.370 | 0.452 | 0.429 | 0.485 | 1339 |
| POA | 0.390 | 0.712 | 0.424 | 0.459 | 0.371 | 0.493 | 1348 |
| ClustalW | 0.416 | 0.791 | 0.443 | 0.475 | 0.394 | 0.529 | 1193 |
Each column shows average column scores using all alignments of each reference of the homologous region set. Overall and Ranksum columns show the average column scores and the rank sum of the Friedman test using all alignment of the whole homologous region set, respectively. A smaller rank sum means better accuracy.
Figure 3Score differences between PRIMEand PRIMEon homologous region set. The horizontal axis denotes reference alignment ID, and the vertical axis, the difference in sum-of-pairs or column scores on respective alignments of the homologous region set using PRIMEand PRIME. A positive difference score of an alignment is an indication that PRIMEshows better performance than PRIMEfor the alignment, and vice versa.
Figure 4Average sum-of-pairs score differences between full length and homologous region sets. Each point means average sum-of-pairs score difference in respective alignments on each reference of the full length and homologous region sets. PRIMEdenotes PRIME, and PRIME, PRIME. The smaller absolute value of a score indicates that the introduction of long terminal indels less affects the alignment accuracy of a program.
Figure 5Average column score differences between full length and homologous region sets. Each point means average column score difference in respective alignments on each reference of the full length and homologous region sets. PRIMEdenotes PRIME, and PRIME, PRIME. The smaller absolute value of a score indicates that the introduction of long terminal indels less affects the alignment accuracy of a program.
Average quality scores of PREFAB
| Main | Weighting | Long gap | ||||
| QS | Ranksum | QS | Ranksum | QS | Ranksum | |
| PRIME | 0.721 | 8151 | 0.649 | 588 | 0.658 | 1408 |
| PRIME | 0.718 | 8355 | 0.637 | 617 | 0.651 | 1504 |
| Prrn | 0.722 | 8120 | 0.624 | 621 | 0.653 | 1455 |
| MAFFT | 0.722 | 7744 | 0.639 | 585 | 0.660 | 1352 |
| ProbCons | 0.705 | 8659 | 0.620 | 594 | 0.637 | 1443 |
| T-Coffee | 0.700 | 9126 | 0.627 | 584 | 0.631 | 1640 |
| MUSCLE | 0.680 | 10446 | 0.607 | 642 | 0.596 | 1918 |
| DIALIGN-T | 0.621 | 13277 | 0.587 | 754 | 0.541 | 2506 |
| POA | 0.603 | 14662 | 0.554 | 868 | 0.513 | 2789 |
| ClustalW | 0.617 | 12952 | 0.603 | 650 | 0.519 | 2583 |
| PSA | 0.591 | 14525 | 0.638 | 627 | 0.498 | 2804 |
| PSA | 0.581 | 14789 | 0.621 | 670 | 0.489 | 2856 |
Each QS and Ranksum columns show the average quality scores and the rank sum of the Friedman test using quality scores on all alignments of each reference set, respectively. A smaller rank sum means better accuracy.
Computation time
| BAliBASE | PREFAB | |
| PRIME | 9.4 × 105 | 5.5 × 105 |
| PRIME | 4.9 × 105 | 4.3 × 105 |
| Prrn | 6.8 × 105 | 1.9 × 105 |
| MAFFT | 1.9 × 104 | 2.7 × 104 |
| ProbCons | 6.4 × 104 | 1.9 × 105 |
| T-Coffee | 7.2 × 105 | 2.0 × 106 |
| MUSCLE | 7.9 × 103 | 1.6 × 104 |
| DIALIGN-T | 3.0 × 104 | 1.2 × 105 |
| POA | 1.0 × 104 | 2.6 × 104 |
| ClustalW | 8.3 × 103 | 2.7 × 104 |
BAliBASE column shows total times (sec.) of constructing all alignments of the full length and homologous region sets by each program, while PREFAB column, those of calculating whole alignments of the main and weighting sets only.