| Literature DB >> 35915051 |
Furong Tang1,2, Jiannan Chao1,3, Yanming Wei4, Fenglong Yang5, Yixiao Zhai3, Lei Xu2, Quan Zou1,3.
Abstract
HAlign is a cross-platform program that performs multiple sequence alignments based on the center star strategy. Here we present two major updates of HAlign 3, which helped improve the time efficiency and the alignment quality, and made HAlign 3 a specialized program to process ultra-large numbers of similar DNA/RNA sequences, such as closely related viral or prokaryotic genomes. HAlign 3 can be easily installed via the Anaconda and Java release package on macOS, Linux, Windows subsystem for Linux, and Windows systems, and the source code is available on GitHub (https://github.com/malabz/HAlign-3).Entities:
Keywords: center star strategy; common substring; multiple sequence alignment; substring selection; suffix tree
Mesh:
Substances:
Year: 2022 PMID: 35915051 PMCID: PMC9372455 DOI: 10.1093/molbev/msac166
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 8.800
Illustration of the fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences in HAlign 3. (A) Diagram of the k-ary tree, LCRS tree, and global substring selection algorithm. (i) Suffixes of the center sequence (AGAGC). (ii) The suffixes are listed alphabetically to obtain the suffix array. The k-ary tree (iii) was used instead of the LCRS tree (iii′) to construct the suffix tree. Coral pink blocks represent the k-ary tree’s nodes. Except for the leaf nodes, all the other nodes store five pointers, that is, 1, 2, 3, 4, and 5, which are specifically designed to store the branch of A, C, T, G, and N, respectively. Even if one or more branches do not exist in some cases, they still take up space. The yellow pies represent the nodes of the LCRS tree. Each node stores two pointers: its first child and the next sibling. The solid arrows represent the branches or leaves of trees (iii) and (iii’), while the dashed arrows indicate the parent–child relationship simplified by the representation of the LCRS tree. The green arrows show that the GC substring in green can be found by two steps in the k-ary tree (iii), whereas five steps are necessary for the LCRS tree. (iv) The green bars represent the common substrings between the center and query sequences. Together, arrows (directed edges: only the ones denoting the connectivity from the end of one common substring to the end of another that ends at the right of the former substring are accepted) and bars (nodes) form a DAG. During the DP, the longest path ending at node j was calculated as the maximum of the longest path from the first node to node i plus their directed edge from the node i to node j. The dark green bars lining the longest path (black arrows) are the final selected common substrings, whose total length (without double counting the overlapping part) is the longest to cover the query sequence. (B) Comparison of running time and memory between the k-ary and LCRS methods. Fourteen star-tree simulated data sets (containing 1,000 sequences each) with different similarities were tested. Each center sequence was used to build suffix trees. The running time and memory during the search for the common substrings (relative to the center sequence) of the other 999 query sequences were recorded. Upper panel: the running time consumptions (quartile distribution) for 14 data sets in the order of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 70%, and 60%. Middle panel: the running time ratio (mean ± SD) of k-ary to LCRS methods. Both time values were divided by the LCRS consumption time for each sequence. Bottom panel: the running memory ratio (mean ± SD) of k-ary to LCRS methods. The values were normalized by the consumption memory of the LCRS method for each sequence. (C) Coverage ratio of the identical bases by selected common substrings of global to local selection algorithms (quartile distribution in boxplot). The total lengths of selected common substrings were first normalized by the known number of identical bases in each group to obtain the percentage of coverage. They were then divided by the mean coverage of the LCRS method. The results of data sets with similarities of 70% and 60% were not shown because there was hardly any common substring between the center and query sequences. (D) Performance comparison of HAlign 2 and 3, MAFFT v7.490, MUSCLE v3.8.31, and ClustalΩ v1.2.4 based on star-tree simulated data splits from the data sets used above (nine splits for each data set with similarity of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 70%, and 60%). Left panel: comparison of alignment time and memory (mean). The results of data sets with a 60% similarity were not shown because the low similarity sharply increased the running time and memory. Middle panel: comparison of Q and TC scores (mean ± SD). Right panel: comparison of alignment length (mean ± SD). The length of alignments obtained from the five programs was divided by the reference alignment length. The inset shows the zoomed-in details with only one group of HAlign 2 results because the others were out of range. (E) Performance comparison of the five programs based on hierarchical tree simulated SARS-CoV-2-like genome data sets with various mean similarities. Fourteen data sets (nine replicates per data set with 100 sequences per replicate) with mean similarities of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75%, and 70% were used. Left panel: comparison of alignment time and memory (mean). Middle panel: comparison of Q and TC scores (mean ± SD). Right panel: comparison of the alignment length (mean ± SD). The length of alignments obtained from the five programs was divided by the reference alignment length. Default parameters running on single-thread were set for all programs in the experiments (D and E): MAFFT built the guide tree and aligned twice (FFT-NS-2); MUSCLE did the same and then refined the alignment 14 times maximum; ClustalΩ built the guide tree and aligned once; HAlign 2 and 3 built the star-tree and aligned once. The -localMSA mode and suffix tree algorithm were used for HAlign 2, which randomly selected the center sequences (no function to specify the center sequence). HAlign 3 picked the longest simulated sequence as the center sequence.