| Literature DB >> 15661851 |
Kazutaka Katoh1, Kei-ichi Kuma, Hiroyuki Toh, Takashi Miyata.
Abstract
The accuracy of multiple sequence alignment program MAFFT has been improved. The new version (5.3) of MAFFT offers new iterative refinement options, H-INS-i, F-INS-i and G-INS-i, in which pairwise alignment information are incorporated into objective function. These new options of MAFFT showed higher accuracy than currently available methods including TCoffee version 2 and CLUSTAL W in benchmark tests consisting of alignments of >50 sequences. Like the previously available options, the new options of MAFFT can handle hundreds of sequences on a standard desktop computer. We also examined the effect of the number of homologues included in an alignment. For a multiple alignment consisting of approximately 8 sequences with low similarity, the accuracy was improved (2-10 percentage points) when the sequences were aligned together with dozens of their close homologues (E-value < 10(-5)-10(-20)) collected from a database. Such improvement was generally observed for most methods, but remarkably large for the new options of MAFFT proposed here. Thus, we made a Ruby script, mafftE.rb, which aligns the input sequences together with their close homologues collected from SwissProt using NCBI-BLAST.Entities:
Mesh:
Year: 2005 PMID: 15661851 PMCID: PMC548345 DOI: 10.1093/nar/gki198
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Options of version 5.3 (upper) and the previous version (lower) of MAFFT
aAll pairwise alignments are computed by global alignment with an FFT approximation. The FFT approximation is disabled in the progressive alignment stage.
bAll pairwise alignments are computed with FASTA (25) with the Smith–Waterman optimization.
cAll pairwise alignments are computed with FASTA (25) without the Smith–Waterman optimization.
dDistance matrix is calculated based on the number of 6-tuples shared by two sequences (1,10).
eUPGMA tree with a modified linkage (for detail see Supplementary Material).
fGuide tree is recalculated based on the first alignment and progressive alignment is re-performed (1,13).
g‘Importance’ (I) value is considered as described in text.
hWSP score is optimized through the iterative refinement (14).
Figure 1The CPU times required for various sizes of alignments. Sequences were generated using the ROSE program (29). (A and B) Average length (L) of input sequences versus CPU time. The number of sequence is 40. Average distance among input sequences is 100 PAM (A) (percentage identity ∼ 35–85) or 250 PAM (B) (percentage identity ∼ 15–65). (C and D) The number of input sequences (N) versus CPU time. Average sequence length is 300. Average distance among input sequences is 100 PAM (C) or 250 PAM (D). See Table 1 for command-line options for each strategy in MAFFT. Options of other programs are as follows:
TCoffee, default;
PROBCONS, default;
CLUSTAL W, default;
MUSCLE-i, muscle -maxiters 16;
MUSCLE-2, muscle -maxiters 1;
MUSCLE-fast, muscle -sv -maxiters 1 -diags1 -distance1 kbit20_3.
Comparison of performances of several methods based on 55 alignments in HOM tests
| Method | Dataset | Accuracy (%) | Improvement | CPU time (s) |
|---|---|---|---|---|
| G-INS-i | ||||
| HOM+0 | 42.58 | — | 44.31 | |
| HOM+20 | 52.06 | +9.48 | 182.3 | |
| HOM+50 | 53.85 | +11.2 | 514.2 | |
| HOM+100 | 54.61 | +12.0 | 1405 | |
| H-INS-i | ||||
| HOM+0 | 43.20 | — | 38.68 | |
| HOM+20 | 49.56 | +6.36 | 151.2 | |
| HOM+50 | 53.37 | +10.2 | 426.8 | |
| HOM+100 | 53.29 | +10.1 | 1110 | |
| F-INS-i | ||||
| HOM+0 | 43.14 | — | 32.06 | |
| HOM+20 | 51.26 | +8.12 | 122.0 | |
| HOM+50 | 53.72 | +10.6 | 342.0 | |
| HOM+100 | 53.57 | +10.4 | 758.4 | |
| H-INS-1 | ||||
| HOM+0 | 38.55 | — | 14.30 | |
| HOM+20 | 46.00 | +7.45 | 73.81 | |
| HOM+50 | 48.80 | +10.3 | 237.9 | |
| HOM+100 | 48.35 | +9.80 | 636.6 | |
| FFT-NS-i | ||||
| HOM+0 | 43.57 | — | 32.57 | |
| HOM+20 | 49.57 | +6.00 | 73.84 | |
| HOM+50 | 50.68 | +7.11 | 155.87 | |
| HOM+100 | 50.73 | +7.16 | 365.8 | |
| FFT-NS-2 | ||||
| HOM+0 | 35.94 | — | 6.22 | |
| HOM+20 | 45.06 | +9.12 | 15.23 | |
| HOM+50 | 44.42 | +8.48 | 26.46 | |
| HOM+100 | 43.61 | +7.67 | 43.46 | |
| PROBCONS 1.06 | ||||
| HOM+0 | 47.95 | — | 91.13 | |
| HOM+20 | 51.78 | +3.83 | 590.1 | |
| HOM+50 | 51.59 | +3.64 | 2237 | |
| HOM+100 | 51.81 | +3.86 | 7634 | |
| MUSCLE-i 3.41 | ||||
| HOM+0 | 43.44 | — | 37.20 | |
| HOM+20 | 45.94 | +2.50 | 113.6 | |
| HOM+50 | 46.90 | +3.46 | 403.7 | |
| HOM+100 | 48.07 | +4.63 | 719.4 | |
| TCoffee 2.02 | ||||
| HOM+0 | 43.49 | — | 486.4 | |
| HOM+20 | 48.26 | +4.77 | 5007 | |
| HOM+50 | 49.71 | +6.22 | 28 250 | |
| HOM+100 | 49.94 | +6.45 | 71 390 | |
| CLUSTAL W 1.83 | ||||
| HOM+0 | 36.77 | — | 16.29 | |
| HOM+20 | 36.57 | –0.20 | 87.98 | |
| HOM+50 | 37.33 | +0.56 | 242.5 | |
| HOM+100 | 36.77 | +0.00 | 620.6 | |
The highest accuracy value within each dataset is in boldface.
aThe difference from the highest accuracy was shown to be significant (P < 0.01) by both the Wilcoxon test and the Friedman test.
bThe Wilcoxon test showed a significant difference but the Friedman test did not.
cThe improvement of score from HOM+0 was shown to be significant by both the Wilcoxon test and the Friedman test.
dThe Wilcoxon test showed a significant improvement but the Friedman test did not. See Table 1 for command-line options for each method in MAFFT. Command-line option for MUSCLE-i is muscle -maxiters 1000.
Comparison of performances of several methods based on 209 alignments in TWI tests
| Method | Dataset | TWIs | TWIf | CPU | ||
|---|---|---|---|---|---|---|
| Improvement | Improvement | time (s) | ||||
| G-INS-i | ||||||
| TWI+0 | 20.73 | — | 41.68 | — | 232.1 | |
| TWI+20 | 27.38 | +6.65 | 47.00 | +5.32 | 747.4 | |
| TWI+50 | 29.58 | +8.85 | 51.11 | +9.43 | 1724 | |
| H-INS-i | ||||||
| TWI+0 | 23.36 | — | 42.78 | — | 154.1 | |
| TWI+20 | 26.30 | +2.94 | 47.40 | +4.62 | 467.0 | |
| TWI+50 | 27.87 | +4.51 | 50.29 | +7.51 | 1102 | |
| F-INS-i | ||||||
| TWI+0 | 22.03 | — | 43.21 | — | 155.8 | |
| TWI+20 | 25.80 | +3.77 | 47.12 | +3.91 | 405.1 | |
| TWI+50 | 27.25 | +5.22 | 47.59 | +4.38 | 882.8 | |
| H-INS-1 | ||||||
| TWI+0 | 18.28 | — | 38.20 | — | 30.29 | |
| TWI+20 | 22.48 | +4.20 | 43.81 | +5.61 | 144.6 | |
| TWI+50 | 24.77 | +6.49 | 45.76 | +7.56 | 460.6 | |
| FFT-NS-i | ||||||
| TWI+0 | 18.16 | — | 37.46 | — | 124.1 | |
| TWI+20 | 21.64 | +3.48 | 40.88 | +2.29 | 303.8 | |
| TWI+50 | 22.76 | +4.60 | 44.85 | +7.49 | 565.6 | |
| FFT-NS-2 | ||||||
| TWI+0 | 12.89 | — | 30.27 | — | 19.41 | |
| TWI+20 | 16.14 | +3.25 | 33.59 | +3.32 | 44.54 | |
| TWI+50 | 17.49 | +4.60 | 37.08 | +6.87 | 77.36 | |
| PROBCONS 1.06 | ||||||
| TWI+0 | 22.06 | — | 44.48 | — | 234.0 | |
| TWI+20 | 22.79 | +0.73 | 43.81 | −0.67 | 1747 | |
| TWI+50 | 22.53 | +0.47 | 44.86 | +0.38 | 6889 | |
| MUSCLE-i 3.41 | ||||||
| TWI+0 | 15.67 | — | 36.38 | — | 382.3 | |
| TWI+20 | 17.98 | +2.31 | 36.68 | +0.30 | 999.9 | |
| TWI+50 | 19.61 | +3.94 | 38.17 | +1.79 | 2152 | |
| TCoffee 2.02 | ||||||
| TWI+0 | 21.80 | — | 44.20 | — | 1378 | |
| TWI+20 | 22.81 | +1.01 | 44.56 | +0.36 | 13 900 | |
| TWI+50 | 21.85 | +0.05 | 45.18 | +0.98 | 82 200 | |
| CLUSTAL W 1.83 | ||||||
| TWI+0 | 12.76 | — | 34.28 | — | 31.52 | |
| TWI+20 | 11.72 | −1.04 | 33.59 | −0.69 | 152.7 | |
| TWI+50 | 12.91 | +0.15 | 34.95 | +0.67 | 458.8 | |
See the footnote of Table 2.
Accuracy values of several methods for the PREFAB tests
| Method | Identity (%) | CPU time (s) | ||||
|---|---|---|---|---|---|---|
| 0–20 | 20–40 | 40–70 | 70–100 | All | ||
| G-INS-i | 46.75 | 82.77 | 96.30 | 98.60 | 68.85 | 16 030 |
| H-INS-i | 48.22 | 83.32 | 95.83 | 98.64 | 69.70 | 15 060 |
| F-INS-i | 48.00 | 83.35 | 95.83 | 98.62 | 69.61 | 9007 |
| H-INS-1 | 46.13 | 82.16 | 95.42 | 98.60 | 68.27 | 9910 |
| FFT-NS-i | 45.72 | 80.95 | 93.83 | 98.55 | 67.48 | 4176 |
| FFT-NS-2 | 43.19 | 79.52 | 93.23 | 98.47 | 65.74 | 930.4 |
| FFT-NS-1 | 40.90 | 77.50 | 93.46 | 98.59 | 63.92 | 666.2 |
| TCoffee 2.02 | 45.30 | 82.36 | 95.20 | 98.62 | 67.96 | 973 600 |
| PROBCONS 1.06 | 45.63 | 82.10 | 95.01 | 98.18 | 67.95 | 142 200 |
| MUSCLE-i | 42.77 | 80.43 | 95.43 | 98.28 | 66.05 | 13 260 |
| MUSCLE-fast | 38.44 | 76.65 | 93.42 | 97.91 | 62.44 | 544.1 |
| CLUSTAL W (default) | 33.96 | 74.14 | 93.54 | 97.85 | 59.45 | 12 970 |
The highest accuracy value within each percent identity range is in bold letters.
aThe difference from the highest accuracy was found to be significant (P < 0.01) by both the Wilcoxon test and the Friedman test.
bThe Wilcoxon test showed a significant difference but the Friedman test did not. See Table 1 for command-line options for each strategy in MAFFT. Options of other programs are as follows:
TCoffee, default;
PROBCONS, default;
MUSCLE-i, muscle -maxiters 1000;
MUSCLE-fast, muscle -sv -maxiters 1 -diags1 -distance1 kbit20_3;
CLUSTAL W (default), default.