Literature DB >> 35915051

HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.

Furong Tang1,2, Jiannan Chao1,3, Yanming Wei4, Fenglong Yang5, Yixiao Zhai3, Lei Xu2, Quan Zou1,3.   

Abstract

HAlign is a cross-platform program that performs multiple sequence alignments based on the center star strategy. Here we present two major updates of HAlign 3, which helped improve the time efficiency and the alignment quality, and made HAlign 3 a specialized program to process ultra-large numbers of similar DNA/RNA sequences, such as closely related viral or prokaryotic genomes. HAlign 3 can be easily installed via the Anaconda and Java release package on macOS, Linux, Windows subsystem for Linux, and Windows systems, and the source code is available on GitHub (https://github.com/malabz/HAlign-3).
© The Author(s) 2022. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.

Entities:  

Keywords:  center star strategy; common substring; multiple sequence alignment; substring selection; suffix tree

Mesh:

Substances:

Year:  2022        PMID: 35915051      PMCID: PMC9372455          DOI: 10.1093/molbev/msac166

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   8.800


HAlign, which is implemented with Java (Zou et al. 2015), is a multiple sequence alignment (MSA) tool that uses the center star strategy to speed up the alignment of sequences with high similarity, such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes from different strains within the same species (Wu et al. 2020). It first chooses a center sequence from the input, aligns it with the other sequences, and then merges all pairwise alignments into a final MSA (supplementary fig. S1, Supplementary Material online). Furthermore, a suffix tree—a data structure containing all the suffixes of a given string which allows the rapid implementation of many essential string operations (Ukkonen 1995; Baeza-Yates and Gonnet 1996)—is applied to divide and conquer the pairwise alignment. The center star strategy and suffix tree greatly accelerate the alignment procedure, allowing HAlign to process large data sets within a few minutes. In contrast, other state-of-the-art tools, which mainly adopt a progressive method (Feng and Doolittle 1987), take several days to complete the alignment or even fail because of the memory limitations of the hardware or the unacceptable amount of computation time required.

K-ary Tree Modification

In the updated HAlign version, the left-child right-sibling (LCRS) tree was replaced by a k-ary tree to build the suffix tree. The LCRS representation of a suffix tree is space efficient, but it takes several redundant steps when searching for the correct nodes. The suffix tree node structure modified with the k-ary tree considers only four nucleobase types in DNA or RNA sequences. Five pointers in each node were assigned to store the A, C, T/U, G, and N branches, respectively (the fifth pointer represented the unknown base N; fig. 1). Thus, each visit to a node’s particular child would be much more efficient. However, as each node of the k-ary tree stored five pointers regardless of the existence of a branch or not, this tree would incur extra space compared with the LCRS tree (see Supplementary material online). In the test experiments, the k-ary tree modification almost doubled the searching efficiency while taking up around 12.5% more space across the data sets with different similarities to center sequences (fig. 1). High-quality SARS-CoV-2 genome sequences all have a 99% similarity to the reference genome. It was established that using 1 million SARS-CoV-2 sequences to search for common substrings through the k-ary tree-modified suffix tree would save nearly 500,000 s (0.5 s for each sequence, as shown in fig. 1, top panel). Overall, the doubled search efficiency would significantly improve the processing speed when the data sets contain millions of sequences.

Illustration of the fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences in HAlign 3. (A) Diagram of the k-ary tree, LCRS tree, and global substring selection algorithm. (i) Suffixes of the center sequence (AGAGC). (ii) The suffixes are listed alphabetically to obtain the suffix array. The k-ary tree (iii) was used instead of the LCRS tree (iii′) to construct the suffix tree. Coral pink blocks represent the k-ary tree’s nodes. Except for the leaf nodes, all the other nodes store five pointers, that is, 1, 2, 3, 4, and 5, which are specifically designed to store the branch of A, C, T, G, and N, respectively. Even if one or more branches do not exist in some cases, they still take up space. The yellow pies represent the nodes of the LCRS tree. Each node stores two pointers: its first child and the next sibling. The solid arrows represent the branches or leaves of trees (iii) and (iii’), while the dashed arrows indicate the parent–child relationship simplified by the representation of the LCRS tree. The green arrows show that the GC substring in green can be found by two steps in the k-ary tree (iii), whereas five steps are necessary for the LCRS tree. (iv) The green bars represent the common substrings between the center and query sequences. Together, arrows (directed edges: only the ones denoting the connectivity from the end of one common substring to the end of another that ends at the right of the former substring are accepted) and bars (nodes) form a DAG. During the DP, the longest path ending at node j was calculated as the maximum of the longest path from the first node to node i plus their directed edge from the node i to node j. The dark green bars lining the longest path (black arrows) are the final selected common substrings, whose total length (without double counting the overlapping part) is the longest to cover the query sequence. (B) Comparison of running time and memory between the k-ary and LCRS methods. Fourteen star-tree simulated data sets (containing 1,000 sequences each) with different similarities were tested. Each center sequence was used to build suffix trees. The running time and memory during the search for the common substrings (relative to the center sequence) of the other 999 query sequences were recorded. Upper panel: the running time consumptions (quartile distribution) for 14 data sets in the order of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 70%, and 60%. Middle panel: the running time ratio (mean ± SD) of k-ary to LCRS methods. Both time values were divided by the LCRS consumption time for each sequence. Bottom panel: the running memory ratio (mean ± SD) of k-ary to LCRS methods. The values were normalized by the consumption memory of the LCRS method for each sequence. (C) Coverage ratio of the identical bases by selected common substrings of global to local selection algorithms (quartile distribution in boxplot). The total lengths of selected common substrings were first normalized by the known number of identical bases in each group to obtain the percentage of coverage. They were then divided by the mean coverage of the LCRS method. The results of data sets with similarities of 70% and 60% were not shown because there was hardly any common substring between the center and query sequences. (D) Performance comparison of HAlign 2 and 3, MAFFT v7.490, MUSCLE v3.8.31, and ClustalΩ v1.2.4 based on star-tree simulated data splits from the data sets used above (nine splits for each data set with similarity of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 70%, and 60%). Left panel: comparison of alignment time and memory (mean). The results of data sets with a 60% similarity were not shown because the low similarity sharply increased the running time and memory. Middle panel: comparison of Q and TC scores (mean ± SD). Right panel: comparison of alignment length (mean ± SD). The length of alignments obtained from the five programs was divided by the reference alignment length. The inset shows the zoomed-in details with only one group of HAlign 2 results because the others were out of range. (E) Performance comparison of the five programs based on hierarchical tree simulated SARS-CoV-2-like genome data sets with various mean similarities. Fourteen data sets (nine replicates per data set with 100 sequences per replicate) with mean similarities of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75%, and 70% were used. Left panel: comparison of alignment time and memory (mean). Middle panel: comparison of Q and TC scores (mean ± SD). Right panel: comparison of the alignment length (mean ± SD). The length of alignments obtained from the five programs was divided by the reference alignment length. Default parameters running on single-thread were set for all programs in the experiments (D and E): MAFFT built the guide tree and aligned twice (FFT-NS-2); MUSCLE did the same and then refined the alignment 14 times maximum; ClustalΩ built the guide tree and aligned once; HAlign 2 and 3 built the star-tree and aligned once. The -localMSA mode and suffix tree algorithm were used for HAlign 2, which randomly selected the center sequences (no function to specify the center sequence). HAlign 3 picked the longest simulated sequence as the center sequence.

Illustration of the fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences in HAlign 3. (A) Diagram of the k-ary tree, LCRS tree, and global substring selection algorithm. (i) Suffixes of the center sequence (AGAGC). (ii) The suffixes are listed alphabetically to obtain the suffix array. The k-ary tree (iii) was used instead of the LCRS tree (iii′) to construct the suffix tree. Coral pink blocks represent the k-ary tree’s nodes. Except for the leaf nodes, all the other nodes store five pointers, that is, 1, 2, 3, 4, and 5, which are specifically designed to store the branch of A, C, T, G, and N, respectively. Even if one or more branches do not exist in some cases, they still take up space. The yellow pies represent the nodes of the LCRS tree. Each node stores two pointers: its first child and the next sibling. The solid arrows represent the branches or leaves of trees (iii) and (iii’), while the dashed arrows indicate the parent–child relationship simplified by the representation of the LCRS tree. The green arrows show that the GC substring in green can be found by two steps in the k-ary tree (iii), whereas five steps are necessary for the LCRS tree. (iv) The green bars represent the common substrings between the center and query sequences. Together, arrows (directed edges: only the ones denoting the connectivity from the end of one common substring to the end of another that ends at the right of the former substring are accepted) and bars (nodes) form a DAG. During the DP, the longest path ending at node j was calculated as the maximum of the longest path from the first node to node i plus their directed edge from the node i to node j. The dark green bars lining the longest path (black arrows) are the final selected common substrings, whose total length (without double counting the overlapping part) is the longest to cover the query sequence. (B) Comparison of running time and memory between the k-ary and LCRS methods. Fourteen star-tree simulated data sets (containing 1,000 sequences each) with different similarities were tested. Each center sequence was used to build suffix trees. The running time and memory during the search for the common substrings (relative to the center sequence) of the other 999 query sequences were recorded. Upper panel: the running time consumptions (quartile distribution) for 14 data sets in the order of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 70%, and 60%. Middle panel: the running time ratio (mean ± SD) of k-ary to LCRS methods. Both time values were divided by the LCRS consumption time for each sequence. Bottom panel: the running memory ratio (mean ± SD) of k-ary to LCRS methods. The values were normalized by the consumption memory of the LCRS method for each sequence. (C) Coverage ratio of the identical bases by selected common substrings of global to local selection algorithms (quartile distribution in boxplot). The total lengths of selected common substrings were first normalized by the known number of identical bases in each group to obtain the percentage of coverage. They were then divided by the mean coverage of the LCRS method. The results of data sets with similarities of 70% and 60% were not shown because there was hardly any common substring between the center and query sequences. (D) Performance comparison of HAlign 2 and 3, MAFFT v7.490, MUSCLE v3.8.31, and ClustalΩ v1.2.4 based on star-tree simulated data splits from the data sets used above (nine splits for each data set with similarity of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 70%, and 60%). Left panel: comparison of alignment time and memory (mean). The results of data sets with a 60% similarity were not shown because the low similarity sharply increased the running time and memory. Middle panel: comparison of Q and TC scores (mean ± SD). Right panel: comparison of alignment length (mean ± SD). The length of alignments obtained from the five programs was divided by the reference alignment length. The inset shows the zoomed-in details with only one group of HAlign 2 results because the others were out of range. (E) Performance comparison of the five programs based on hierarchical tree simulated SARS-CoV-2-like genome data sets with various mean similarities. Fourteen data sets (nine replicates per data set with 100 sequences per replicate) with mean similarities of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75%, and 70% were used. Left panel: comparison of alignment time and memory (mean). Middle panel: comparison of Q and TC scores (mean ± SD). Right panel: comparison of the alignment length (mean ± SD). The length of alignments obtained from the five programs was divided by the reference alignment length. Default parameters running on single-thread were set for all programs in the experiments (D and E): MAFFT built the guide tree and aligned twice (FFT-NS-2); MUSCLE did the same and then refined the alignment 14 times maximum; ClustalΩ built the guide tree and aligned once; HAlign 2 and 3 built the star-tree and aligned once. The -localMSA mode and suffix tree algorithm were used for HAlign 2, which randomly selected the center sequences (no function to specify the center sequence). HAlign 3 picked the longest simulated sequence as the center sequence.

Global Substring Selection Algorithm

Searching for common substrings between center and query sequences by suffix tree always results in many redundant substrings. The previous HAlign version was limited to solving the optimal common substrings by locally selecting the longest substrings and the adjacent substrings with the shortest distances to them. In the updated version, a substring selection algorithm combining the directed acyclic graph (DAG) with dynamic programming (DP) was applied to the global searching for a series of optimal common substrings, which compose the longest path to cover the center/query sequence maximally (fig. 1, Supplementary material online). This improvement increased the final total length of common substrings between the center and query sequences, especially for data sets with a lower similarity (15% and 150% increase in length for data sets with similarities of 85% and 80%, respectively) compared with the former method (fig. 1). Also, this improvement further decreased the overall consumption as less query sequence lengths needed to be processed by pairwise alignment, which is the most time- and space-consuming procedure. For 80% similarity data sets, around 2,500 bp of pairwise alignment was saved for each sequence (supplementary fig. S2, Supplementary Material online).

Affine Gap Penalty, k-Banded Dynamic Program and Center Sequence Selection

Several other improvements were achieved in this new version of HAlign. The affine gap penalty (Gotoh 1982) was added to the scoring system to make the pairwise alignment more reliable on biological sequences, and the k-band method (Zou et al. 2012) was employed to decrease the time cost of DP. The previous version of HAlign randomly selected a sequence as the center (Su et al. 2017), while the updated version uses the longest input sequence, assuming that this is most likely to retain the information that other sequences might lose by mutation. The new method to select the center sequence was inspired by the greedy strategy used in the CD-HIT (Li et al. 2001) program.

Performance Comparison

Compared with the previous version, in HAlign 3, memory consumption was vastly decreased (fig. 1, left panel) when processing the star-tree simulated data sets with different similarities to the center sequence. When comparing the Q score (the ratio of correctly aligned pairs of letters to aligned pairs in reference) and TC score (the ratio of correctly aligned columns to aligned columns in reference) among five programs, the quality of HAlign 3’s alignments was similar to that of MUSCLE’s (Edgar 2004) alignments, and these rank as the top two programs (fig. 1, middle panel). The lengths of HAlign 3’s alignments of data sets with high similarities (≥92%) were much closer to reference alignment lengths than those of other programs’ alignments (fig. 1, right panel). For the alignment of data sets with long sequence length and high similarity to the center sequence (≥92%), HAlign 3 was the best among the five programs in terms of alignment time, memory, quality, and length. Three real data sets were also used for further comparisons and it was shown that: (1) 672 human mitochondrial genomes were highly similar to each other (>99%, supplementary fig. S3, Supplementary Material online), ranging from 16,555 to 16,578 bp (supplementary fig. S3, Supplementary Material online); (2) 500 SARS-CoV-2 genomes were less similar to each other (>97%, supplementary fig. S3, Supplementary Material online) ranging from 29,283 to 29,891 bp (supplementary fig. S3, Supplementary Material online); (3) 647 Mycobacterium 23S rRNA sequences were more different from each other (>30%, supplementary fig. S3, Supplementary Material online) ranging from 1,909 to 3,485 bp (supplementary fig. S3, Supplementary Material online). The average sum of pairs score (known as aSPscore and corresponding to the sum of pairs score divided by the number of sequence pairs, Supplementary material online) was used to measure the alignment quality in the absence of reference alignment. The memory consumption and alignment length of HAlign 3 were significantly reduced compared with those of HAlign 2 in all three data sets. The alignment quality of HAlign 3 was comparable with that of MAFFT (Katoh and Standley 2013), MUSCLE, and ClustalΩ (Sievers et al. 2011) in the data sets with an even relatively lower similarity, such as those containing SARS-CoV-2 and 23S rRNA (supplementary fig. S3, S3, and S3, Supplementary Material online). Therefore, in the real data experiment, HAlign 3 can also be comparable with state-of-the-art tools. To ensure the performance of HAlign 3, SARS-CoV-2-like and mitochondrial-like genomes were further simulated on a hierarchical tree by INDELible v1.03 (Fletcher and Yang 2009; Supplementary material online). As shown in fig. 1 and supplementary fig. S4, Supplementary Material online, for the hierarchical tree simulated data sets with various mean similarities (i.e., the mean of the similarities between any two sequences in a given data set), the memory consumption, alignment quality, and length were significantly improved in HAlign 3 compared with the previous version. HAlign 3 could also achieve quite closed alignment quality and length to the top two ranked programs (MAFFT and MUSCLE) for data sets with a higher mean similarity (≥96%) while taking much less time. Similar results were observed in the alignments of simulated genome data sets with preset branch lengths based on real cases but varying scales of tree length (sum of branch lengths) (supplementary fig. S4,, and table S1, Supplementary Material online).

Conclusion

Overall, HAlign 3 was much more efficient in processing speed while with minor accuracy loss, which further declined with the increase of sequence similarity. The advantages were more apparent when the number of similar sequences was larger: running on a single-node server, HAlign 3 was the only program that could align 1 million high-quality SARS-CoV-2 genomes successfully in 30 min (supplementary fig. S5, Supplementary Material online). The recently added updates made HAlign a specialized program to deal with ultra-large numbers of similar DNA/RNA sequences, which will be an increasingly common sequence analysis problem as the sequencing technology accelerates the accumulation of intraspecific sequences. Click here for additional data file.
  10 in total

1.  Clustering of highly homologous sequences to reduce the size of large protein databases.

Authors:  W Li; L Jaroszewski; A Godzik
Journal:  Bioinformatics       Date:  2001-03       Impact factor: 6.937

2.  MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors:  Robert C Edgar
Journal:  Nucleic Acids Res       Date:  2004-03-19       Impact factor: 16.971

3.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy.

Authors:  Quan Zou; Qinghua Hu; Maozu Guo; Guohua Wang
Journal:  Bioinformatics       Date:  2015-03-25       Impact factor: 6.937

4.  Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework.

Authors:  Wenhe Su; Xiangke Liao; Yutong Lu; Quan Zou; Shaoliang Peng
Journal:  J Comput Biol       Date:  2017-11-08       Impact factor: 1.479

5.  Progressive sequence alignment as a prerequisite to correct phylogenetic trees.

Authors:  D F Feng; R F Doolittle
Journal:  J Mol Evol       Date:  1987       Impact factor: 2.395

6.  An improved algorithm for matching biological sequences.

Authors:  O Gotoh
Journal:  J Mol Biol       Date:  1982-12-15       Impact factor: 5.469

7.  MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors:  Kazutaka Katoh; Daron M Standley
Journal:  Mol Biol Evol       Date:  2013-01-16       Impact factor: 16.240

8.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors:  Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal:  Mol Syst Biol       Date:  2011-10-11       Impact factor: 11.429

9.  A new coronavirus associated with human respiratory disease in China.

Authors:  Fan Wu; Su Zhao; Bin Yu; Yan-Mei Chen; Wen Wang; Zhi-Gang Song; Yi Hu; Zhao-Wu Tao; Jun-Hua Tian; Yuan-Yuan Pei; Ming-Li Yuan; Yu-Ling Zhang; Fa-Hui Dai; Yi Liu; Qi-Min Wang; Jiao-Jiao Zheng; Lin Xu; Edward C Holmes; Yong-Zhen Zhang
Journal:  Nature       Date:  2020-02-03       Impact factor: 49.962

10.  INDELible: a flexible simulator of biological sequence evolution.

Authors:  William Fletcher; Ziheng Yang
Journal:  Mol Biol Evol       Date:  2009-05-07       Impact factor: 16.240

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.