| Literature DB >> 27378296 |
Kazunori D Yamada1, Kentaro Tomii2, Kazutaka Katoh3.
Abstract
MOTIVATION: Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27378296 PMCID: PMC5079479 DOI: 10.1093/bioinformatics/btw412
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Performance of various options of MAFFT and other methods on HomFam and OXFam
| HomFam | OXFam | |||||||
|---|---|---|---|---|---|---|---|---|
| Small | Medium | Large | All | Small | Medium | Large | All | |
| Number of sequences in an entry | <3000 | 3000–10 000 | >10 000 | <3000 | 3000–10 000 | >10 000 | ||
| Number of entries | 38 | 32 | 19 | 74 | 59 | 32 | ||
| Mean SP score | ||||||||
| MAFFT—Normaltreea | 0.9074 | 0.8957 | 0.7795 | 0.9300 | 0.8853 | 0.8207 | ||
| MAFFT—Randomchainb | 0.8699 | 0.8671 | 0.7106 | 0.9010 | 0.8571 | 0.7674 | ||
| MAFFT—Iterative ( | 0.9105 | 0.9004 | 0.7945 | 0.9438 | 0.9058 | 0.8364 | ||
| MAFFT—Iterative ( | 0.9267 | 0.9167 | 0.8045 | 0.9441 | 0.9228 | 0.8391 | ||
| MAFFT—Iterative ( | 0.9405 | 0.9228 | 0.8159 | 0.9533 | 0.9257 | 0.8427 | ||
| Clustal Omegaf | 0.9148 | 0.8693 | 0.6871 | 0.9257 | 0.8735 | 0.7409 | ||
| Clustal Omega–Fullg | 0.9088 | 0.8806 | 0.6692 | 0.9244 | 0.8595 | 0.7440 | ||
| Clustal Omega—Randomchainh | 0.8798 | 0.8309 | – | – | 0.8905 | 0.8477 | – | – |
| UPP—Fasti | 0.8616 | 0.8407 | 0.7700 | 0.9327 | 0.8940 | 0.7878 | ||
| UPP—Defaultj | 0.8678 | 0.8708 | 0.7956 | 0.9415 | 0.9068 | 0.8211 | ||
| MAFFT— G-INS-1k | 0.9358 | 0.9520 | 0.8844 | 0.9572 | 0.9485 | 0.8749 | ||
| Mean TC score | ||||||||
| MAFFT—Normaltreea | 0.7806 | 0.7298 | 0.5646 | 0.9016 | 0.8253 | 0.7404 | ||
| MAFFT—Randomchainb | 0.7315 | 0.6967 | 0.4932 | 0.8622 | 0.7989 | 0.6827 | ||
| MAFFT—Iterative ( | 0.7939 | 0.7275 | 0.5943 | 0.9143 | 0.8546 | 0.7613 | ||
| MAFFT—Iterative ( | 0.8315 | 0.7573 | 0.6148 | 0.9155 | 0.8807 | 0.7645 | ||
| MAFFT—Iterative ( | 0.8628 | 0.7746 | 0.6283 | 0.9319 | 0.8897 | 0.7707 | ||
| Clustal Omegaf | 0.8057 | 0.7152 | 0.4449 | 0.8842 | 0.8118 | 0.6408 | ||
| Clustal Omega–Fullg | 0.8086 | 0.7365 | 0.4386 | 0.8886 | 0.7839 | 0.6688 | ||
| Clustal Omega–Randomchainh | 0.7580 | 0.6918 | – | – | 0.8452 | 0.7888 | – | – |
| UPP—Fasti | 0.7466 | 0.7087 | 0.5853 | 0.9028 | 0.8535 | 0.7196 | ||
| UPP—Defaultj | 0.7492 | 0.7570 | 0.6330 | 0.9138 | 0.8676 | 0.7601 | ||
| MAFFT—G-INS-1k | 0.8549 | 0.8480 | 0.7441 | 0.9358 | 0.9147 | 0.8212 | ||
| Total CPU time (min) | ||||||||
| MAFFT—Normaltreea | 2.9 | 36 | 420 |
| 5.5 | 81 | 470 |
|
| MAFFT—Randomchainb | 2.0 | 15 | 71 |
| 3.4 | 30 | 96 |
|
| MAFFT—Iterative ( | 7.4 | 61 | 580 |
| 11 | 130 | 660 |
|
| MAFFT—Iterative ( | 160 | 390 | 790 |
| 190 | 640 | 1000 |
|
| MAFFT—Iterative ( | 810 | 2100 | 1500 |
| 1100 | 3500 | 2800 |
|
| Clustal Omegaf | 21 | 160 | 300 |
| 27 | 220 | 590 |
|
| Clustal Omega—Fullg | 44 | 570 | 5400 |
| 82 | 1300 | 1700 |
|
| Clustal Omega—Randomchainh | 130 | 18 000 | – | – | 170 | 4600 | – | – |
| UPP—Fasti | 53 | 190 | 260 |
| 92 | 380 | 540 |
|
| UPP—Defaultj | 360 | 1600 | 2400 |
| 660 | 3300 | 4900 |
|
| MAFFT—G-INS-1k | (370) | (5200) | (44 000) | ( | (760) | (13 000) | (71 000) | |
Commands are: a, mafft input; b, mafft ––randomchain ––randomseed seed input; c–e, mafft-sparsecore.rb -s seed -p p -i input; f, clustalo -i input; g, clustalo ––full -i input; h, clustalo ––pileup -i input (sequence order randomized); i, run_upp.py -m amino -B 100 -s input; j, run_upp.py -m amino -s input; k, mafft ––globalpair ––thread 10 input; The order of input sequences was randomized in every sequence set. The results of “MAFFT — Randomchain” (b) and “MAFFT — Iterative” (c-e) were averaged for 100 and 10 replications, respectively, with different seeds of random numbers. The results of “UPP” (i, j) were averaged for 10 different runs. For “Clustal Omega — Randomchain” (h), only the results for the Small and Medium subsets are shown, because the calculation of the Large subset did not finish while compiling this manuscript. The CPU time of “MAFFT — G-INS-1” (k) is shown in parentheses as it ran on different computer systems from the others.
Performance of various options of MAFFT and other methods on ContTest
| Small | Medium | Large | All | |
|---|---|---|---|---|
| Number of sequences in an entry | <3000 | 3000–10 000 | >10 000 | |
| Number of entries | 15 | 70 | 51 | |
| Mean ContTest score | ||||
| MAFFT—Normaltreea | 0.4081 | 0.4874 | 0.5439 | |
| MAFFT—Randomchainb | 0.4406 | 0.5227 | 0.5997 | |
| MAFFT—Iterative ( | 0.3747 | 0.5005 | 0.5771 | |
| MAFFT—Iterative ( | 0.3883 | 0.5180 | 0.6046 | |
| MAFFT—Iterative ( | 0.3808 | 0.5237 | 0.6198 | |
| Clustal Omegaf | 0.3039 | 0.4291 | 0.4262 | |
| Clustal Omega—Fullg | 0.3080 | 0.4585 | 0.4640 | |
| Clustal Omega—Randomchainh | 0.4328 | 0.5324 | 0.5703 | |
| UPP—Fasti | 0.3515 | 0.5139 | 0.5744 | |
| UPP—Defaultj | 0.3555 | 0.5254 | 0.5936 | |
| MAFFT—G-INS-1k | 0.3853 | 0.5445 | 0.6582 | |
| Total CPU time (min) | ||||
| MAFFT—Normaltreea | 1.2 | 54 | 440 |
|
| MAFFT—Randomchainb | 0.56 | 16 | 88 |
|
| MAFFT—Iterative ( | 2.2 | 77 | 650 |
|
| MAFFT—Iterative ( | 30 | 250 | 930 |
|
| MAFFT—Iterative ( | 160 | 1100 | 2100 |
|
| Clustal Omegaf | 5.0 | 130 | 460 |
|
| Clustal Omega—Fullg | 16 | 830 | 5600 |
|
| Clustal Omega—Randomchainh | 28 | 6400 | 110 000 | |
| UPP—Fasti | 17 | 230 | 500 |
|
| UPP—Defaultj | 130 | 2000 | 4600 | |
| MAFFT—G-INS-1k | (110) | (7100) | (48 000) | |
a–k, The same commands were used as Table 1. The input sequence order was not changed from that in the original data, because it seems to be already randomized.