Literature DB >> 31665271

Kalign 3: multiple sequence alignment of large data sets.

Abstract

MOTIVATION: Kalign is an efficient multiple sequence alignment (MSA) program capable of aligning thousands of protein or nucleotide sequences. However, current alignment problems involving large numbers of sequences are exceeding Kalign's original design specifications. Here we present a completely re-written and updated version to meet current and future alignment challenges.
RESULTS: Kalign now uses a SIMD accelerated version of the bit-parallel Gene Myers algorithm to estimate pariwise distances, adopts a sequence embedding strategy and the bi-secting K-means algorithm to rapidly construct guide trees for thousands of sequences. The new version maintains high alignment accuracy on both protein and nucleotide alignments and scales better than other MSA tools. AVAILABILITY: The source code of Kalign and code to reproduce the results are found here: https://github.com/timolassmann/kalign.

Entities: Gene

Year: 2019 PMID： 31665271 PMCID： PMC7703769 DOI： 10.1093/bioinformatics/btz795

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Multiple sequence alignment (MSA) remains an important task in biological sequence analysis. MSA programs can be divided into consistency and progressive methods. The latter estimate pairwise sequence distances, construct a guide tree and align sequences following the order of the guide tree. Consistency-based methods tend to be more accurate than compared with progressive methods but are orders of magnitude slower and therefore not practical when aligning thousands of sequences. Kalign (Lassmann et al., 2008) is a progressive alignment method striking a good balance between accuracy and speed compared with other alignment programs on a range of popular benchmark datasets (see e.g. Sievers ). Despite having aged well Kalign was not designed to handle the tens of thousands of sequences frequently encountered today. In particular, the original Kalign program uses the unweighted pair group method with arithmetic mean (UPGMA) algorithm to construct a guide tree resulting in quadratic time complexity. More recent alignment programs have overcome this hurdle by implementing heuristics to construct guide trees (Blackshields ; Katoh and Toh, 2006). Here we present a new version of Kalign, introducing a SIMD (single instruction, multiple data) accelerated version of Gene Myers’ bit-parallel algorithm (Myers, 1999) to estimate pairwise sequence distances and adopting the sequence embedding strategy introduced by Blackshields to speed up the construction of guide trees.

2 Materials and methods

We replaced the fast string matching algorithm used in Kalign2 (Muth and Manber, 1996) with a new implementation of Gene Myers’ approximate string matching algorithm. The algorithm calculates the exact edit distance between two strings using bit-parallel instructions. In the standard implementation the maximum length of a query is equivalent to the size of a computer word (64 characters on 64 bit architectures). However the algorithm lends itself to further parallelization using SIMD instructions including the AVX and AVX2 instructions available on all modern computers. Using these instructions it becomes possible to compare sequences of length 256. Although the implementation of the Gene Myers algorithm is fairly straight forward using AVX instructions some operations are absent from the AVX instruction set and had to be implemented separately. A stand-alone implementation of the algorithm is distributed together with Kalign to facilitate downstream adoption and development. To estimate pairwise sequence distances Kalign scans the first 256 characters of the shorter sequence across the longer sequence. The distance is defined as the number of edits required to turn one sequence into an exact match in the longer sequence. For distantly related protein sequences the sequence similarity is too low for the algorithm to detect meaningful distances. Therefore, following the method by Steinegger and Söding (2018), Kalign converts all protein sequences into a reduced alphabet by merging (L, M), (I, V), (K, R), (E, Q), (A, S, T), (N, D) and (F, Y) for the purpose of the distance calculation. Kalign adopts the guide tree construction methods used in clustal omega (Sievers ). A number of seed sequences are selected and all sequences are compared against those forming for each sequence a vector of distances to all seeds. The bi-secting k-means algorithm is used to cluster sequences based on the Euclidean distance between these vectors until clusters containing fewer than 100 sequences are found. Here Kalign again uses AVX instructions to accelerate the distance calculation. Finally, the UPGMA method is used to cluster the remaining sequences. Since the bi-secting k-means algorithm is not guaranteed to discover the optimal split of sequences into two clusters Kalign runs the algorithm 50 times using randomly selected sequences to seed the calculation.

3 Results

We compared the performance of Kalign against two other popular progressive alignment methods muscle (Edgar, 2004) and clustal omega (Sievers ). We used the Balibase (Thompson ), Quantest2 (Sievers and Higgins, 2019), Bralibase (Gardner ) and HomFam benchmark datasets (Fig. 1). Clustal omega and Muscle were run with parameters recommended for large alignments on the BaliFam dataset (Clustal:-threads = 8 -MAC-RAM = 48 000 -iterations = 2; Muscle: -maxiters 2), but otherwise default parameters were used.

Fig. 1.

Benchmark results. (a) Sum of pairs scores (SP) of all tested alignment programs on Balibase protein alignment datasets. (b) SP scores of RNA bralibase alignments. (c) Computational performance assessed on the HomFam dataset Kalign’s performance on all six Balibase categories is statistically indistinguishable from the other two programs (two sample t-test, corrected P < 0.05). Likewise there is no statistical difference in alignment accuracy on the Quantest2 benchmark dataset (results not shown). Kalign’s mean performance is significantly better compared with the other two programs in two out of the six Bralibase alignment categories. However, we note that the performance of all algorithms can vary dramatically depending on the specific alignment case (see Fig. 1, box plot error bars and outliers). Therefore, we do not assume that good performance on an MSA benchmark sets generalizes and recommend users to manually inspect their alignments and compare the results of different alignment programs. Kalign compares favorably to the other two programs in terms of running times and scalability on the Balifam dataset (Fig. 1c). In all alignment cases Kalign is one to two orders of magnitude quicker and compared with clustal omega only uses a single CPU core.

4 Conclusion

We present a new version of Kalign that outperforms other programs in terms of running times while sacrificing little in terms of accuracy. This combination makes Kalign especially attractive in large alignment problems.

9 in total

1. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs.

Authors: J D Thompson; F Plewniak; O Poch
Journal: Bioinformatics Date: 1999-01 Impact factor: 6.937

2. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

3. PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences.

Authors: Kazutaka Katoh; Hiroyuki Toh
Journal: Bioinformatics Date: 2006-11-21 Impact factor: 6.937

4. QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction.

Authors: Fabian Sievers; Desmond G Higgins
Journal: Bioinformatics Date: 2020-01-01 Impact factor: 6.937

5. A benchmark of multiple sequence alignment programs upon structural RNAs.

Authors: Paul P Gardner; Andreas Wilm; Stefan Washietl
Journal: Nucleic Acids Res Date: 2005-04-28 Impact factor: 16.971

6. Sequence embedding for fast construction of guide trees for multiple sequence alignment.

Authors: Gordon Blackshields; Fabian Sievers; Weifeng Shi; Andreas Wilm; Desmond G Higgins
Journal: Algorithms Mol Biol Date: 2010-05-14 Impact factor: 1.405

7. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors: Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal: Mol Syst Biol Date: 2011-10-11 Impact factor: 11.429

8. Clustering huge protein sequence sets in linear time.

Authors: Martin Steinegger; Johannes Söding
Journal: Nat Commun Date: 2018-06-29 Impact factor: 14.919

9. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features.

Authors: Timo Lassmann; Oliver Frings; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2008-12-22 Impact factor: 16.971

9 in total

17 in total

1. Recursive MAGUS: Scalable and accurate multiple sequence alignment.

Authors: Vladimir Smirnov
Journal: PLoS Comput Biol Date: 2021-10-06 Impact factor: 4.475

2. Phage defence by deaminase-mediated depletion of deoxynucleotides in bacteria.

Authors: Brian Y Hsueh; Geoffrey B Severin; Clinton A Elg; Evan J Waldron; Abhiruchi Kant; Alex J Wessel; John A Dover; Christopher R Rhoades; Benjamin J Ridenhour; Kristin N Parent; Matthew B Neiditch; Janani Ravi; Eva M Top; Christopher M Waters
Journal: Nat Microbiol Date: 2022-07-11 Impact factor: 30.964

3. DciA Helicase Operators Exhibit Diversity across Bacterial Phyla.

Authors: Helen C Blaine; Joseph T Burke; Janani Ravi; Christina L Stallings
Journal: J Bacteriol Date: 2022-07-26 Impact factor: 3.476

4. AncestralClust: Clustering of Divergent Nucleotide Sequences by Ancestral Sequence Reconstruction using Phylogenetic Trees.

Authors: Lenore Pipes; Rasmus Nielsen
Journal: Bioinformatics Date: 2021-10-20 Impact factor: 6.931

5. Identification of Uncharacterized Components of Prokaryotic Immune Systems and Their Diverse Eukaryotic Reformulations.

Authors: A Maxwell Burroughs; L Aravind
Journal: J Bacteriol Date: 2020-11-19 Impact factor: 3.490

6. Phylogeny Estimation Given Sequence Length Heterogeneity.

Authors: Vladimir Smirnov; Tandy Warnow
Journal: Syst Biol Date: 2021-02-10 Impact factor: 15.683

7. MAGUS: Multiple sequence Alignment using Graph clUStering.

Authors: Vladimir Smirnov; Tandy Warnow
Journal: Bioinformatics Date: 2021-07-19 Impact factor: 6.937

8. Search and sequence analysis tools services from EMBL-EBI in 2022.

Authors: Fábio Madeira; Matt Pearce; Adrian R N Tivey; Prasad Basutkar; Joon Lee; Ossama Edbali; Nandana Madhusoodanan; Anton Kolesnikov; Rodrigo Lopez
Journal: Nucleic Acids Res Date: 2022-04-12 Impact factor: 19.160

9. Genomic diversity of bacteriophages infecting Microbacterium spp.

Authors: Deborah Jacobs-Sera; Lawrence A Abad; Richard M Alvey; Kirk R Anders; Haley G Aull; Suparna S Bhalla; Lawrence S Blumer; David W Bollivar; J Alfred Bonilla; Kristen A Butela; Roy J Coomans; Steven G Cresawn; Tom D'Elia; Arturo Diaz; Ashley M Divens; Nicholas P Edgington; Gregory D Frederick; Maria D Gainey; Rebecca A Garlena; Kenneth W Grant; Susan M R Gurney; Heather L Hendrickson; Lee E Hughes; Margaret A Kenna; Karen K Klyczek; Hari Kotturi; Travis N Mavrich; Angela L McKinney; Evan C Merkhofer; Jordan Moberg Parker; Sally D Molloy; Denise L Monti; Dana A Pape-Zambito; Richard S Pollenz; Welkin H Pope; Nathan S Reyna; Claire A Rinehart; Daniel A Russell; Christopher D Shaffer; Viknesh Sivanathan; Ty H Stoner; Joseph Stukey; C Nicole Sunnen; Sara S Tolsma; Philippos K Tsourkas; Jamie R Wallen; Vassie C Ware; Marcie H Warner; Jacqueline M Washington; Kristi M Westover; JoAnn L Whitefleet-Smith; Helen I Wiersma-Koch; Daniel C Williams; Kira M Zack; Graham F Hatfull
Journal: PLoS One Date: 2020-06-18 Impact factor: 3.240

Review 10. The COVID-19 Pandemic: A Comprehensive Review of Taxonomy, Genetics, Epidemiology, Diagnosis, Treatment, and Control.

Authors: Yosra A Helmy; Mohamed Fawzy; Ahmed Elaswad; Ahmed Sobieh; Scott P Kenney; Awad A Shehata
Journal: J Clin Med Date: 2020-04-24 Impact factor: 4.241