| Literature DB >> 19103665 |
Timo Lassmann1, Oliver Frings, Erik L L Sonnhammer.
Abstract
In the growing field of genomics, multiple alignment programs are confronted with ever increasing amounts of data. To address this growing issue we have dramatically improved the running time and memory requirement of Kalign, while maintaining its high alignment accuracy. Kalign version 2 also supports nucleotide alignment, and a newly introduced extension allows for external sequence annotation to be included into the alignment procedure. We demonstrate that Kalign2 is exceptionally fast and memory-efficient, permitting accurate alignment of very large numbers of sequences. The accuracy of Kalign2 compares well to the best methods in the case of protein alignments while its accuracy on nucleotide alignments is generally superior. In addition, we demonstrate the potential of using known or predicted sequence annotation to improve the alignment accuracy. Kalign2 is freely available for download from the Kalign web site (http://msa.sbc.su.se/).Entities:
Mesh:
Year: 2008 PMID: 19103665 PMCID: PMC2647288 DOI: 10.1093/nar/gkn1006
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Running time of several multiple alignment methods on four scenarios with simulated alignments of varying evolutionary distance (PAM = 100 and PAM = 250), increasing sequence length (L = 10–2000), and number (N = 10–1500). For each case one parameter was varied (x-axis) while two parameters were kept constant (plot heading). Kalign2 scales much better than most of the methods, especially with increasing number of sequences. All tests were carried out on an AMD64 3200+ processor with 2GB of RAM running Linux.
Memory requirement in megabytes for several alignment programs as a function of the number of sequences (N)
| 10 | 50 | 100 | 150 | 200 | 250 | 300 | 350 | 400 | 450 | 500 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Kalign2 | 6.6 | 7.1 | 7.4 | 7.7 | 7.8 | 7.9 | 8.3 | 8.5 | 8.5 | 8.8 | 9.3 |
| Kalign1 | 8.6 | 20.6 | 38.0 | 60.6 | 80.5 | 100.3 | 127.1 | 151.6 | 176.0 | 193.5 | 225.6 |
| ClustalW | 7.9 | 8.1 | 8.3 | 8.8 | 9.2 | 9.5 | 10.4 | 10.9 | 11.4 | 12.2 | 13.1 |
| ClustalWqt2 | 7.9 | 8.1 | 8.4 | 8.8 | 9.2 | 9.5 | 10.4 | 10.9 | 11.4 | 12.2 | 13.1 |
| Muscle | 17.6 | 25.1 | 34.3 | 43.0 | 52.7 | 60.2 | 70.4 | 79.8 | 89.2 | 98.3 | 108.2 |
| Muscle_fast | 17.3 | 24.8 | 33.6 | 42.8 | 52.3 | 59.8 | 69.7 | 80.0 | 89.0 | 97.7 | 107.2 |
| T_Coffee_fast | 17.5 | 25.1 | 34.3 | 43.0 | 52.7 | 60.2 | 70.4 | 79.8 | 89.2 | 98.2 | 108.2 |
| G_INS_i | 146.9 | 149.6 | 153.4 | 161.7 | 169.9 | 179.0 | 199.2 | 206.4 | 213.8 | 246.4 | 266.2 |
| L_INS_i | 146.9 | 149.5 | 153.2 | 161.0 | 168.6 | 177.8 | 195.7 | 203.8 | 211.8 | 241.4 | 260.8 |
| FFT_NS_2 | 140.1 | 140.4 | 141.2 | 141.7 | 142.8 | 142.7 | 144.5 | 144.9 | 145.3 | 145.7 | 146.9 |
| FFT_NS_i | 146.0 | 147.5 | 149.8 | 153.2 | 157.0 | 159.1 | 168.1 | 171.2 | 174.0 | 181.0 | 188.7 |
| Parttree-1-1000 | 140.2 | 140.5 | 141.5 | 142.0 | 143.0 | 143.4 | 144.8 | 144.8 | 145.3 | 145.8 | 147.4 |
| Parttree-2-1000 | 140.2 | 140.5 | 141.5 | 142.0 | 143.0 | 143.4 | 144.8 | 145.3 | 145.8 | 146.5 | 147.5 |
| Dialign | 6.7 | 13.1 | 30.5 | 59.6 | 99.7 | 146.0 | – | – | – | – | – |
| Dialign_fast | 8.7 | 13.0 | 30.8 | 60.5 | 101.4 | 147.0 | – | – | – | – | – |
| Probcons_fast | 12.2 | 21.4 | 58.1 | 120.6 | 207.6 | 315.6 | – | – | – | – | – |
| Probcons | 12.2 | 21.3 | 58.1 | 120.6 | 207.6 | – | – | – | – | – | – |
| T_Coffee | 20.9 | 129.1 | 464.5 | – | – | – | – | – | – | – | – |
No measurement could be obtained for some method test set combinations due to excessive time or memory requirements (dashes). Kalign2 requires the least amount of memory followed by ClustalW.
Command-line arguments for each alignment method tested
| Method | Command |
|---|---|
| ClustalW | Clustalw |
| ClustalWqt2 | clustalw –quicktree |
| Dialign | Dialing |
| Dialign_fast | dialign –o |
| FFT_NS_2 | Mafft –retree 2 |
| FFT_NS_i | Mafft –maxiterate 1000 |
| G_INS_i | Mafft –globalair –maxiterate 1000 |
| Kalign2 | kalign2 |
| Kalign1 | kalign1 |
| L_INS_i | Mafft –localpair –maxiterate 1000 |
| Muscle | muscle |
| Muscle_fast | muscle -maxiters 1 -diags -sv -distance1 kbit20_3 |
| Parttree-1-1000 | mafft –retree 1 –parttree –partsize 1000 |
| Parttree-2-1000 | mafft –retree 2 –parttree –partsize 1000 |
| Probcons_fast | probcons -ir 0 |
| Probcons | Probcons |
| ProbconsRNA | probconsRNA |
| T_Coffee | t_coffee |
| T_Coffee_fast | t_coffee -special_mode quickaln |
Figure 2.Accuracy on RNA alignments using the SPS score. Boxplots for the accuracy measured using the Bralibase2.1 benchmark set. (A) Alignments with an average pairwise sequence identity (APSI) <40%. (B) Alignments with an APSI >40%. Kalign2 was the most accurate method, especially in regions with low APSI.
Figure 3.External feature alignment using protein secondary structure generally improves accuracy on the Balibase benchmark. An increase in the SPS score is seen mostly for cases with high structural coverage.