| Literature DB >> 35422971 |
Ziyuan Wang1, Junjie Tan2, Yanling Long3, Yijia Liu1, Wenyan Lei1, Jing Cai4, Yi Yang1, Zhibin Liu1.
Abstract
Multiple DNA/RNA sequence alignment is an important fundamental tool in bioinformatics, especially for phylogenetic tree construction. With DNA-sequencing improvements, the amount of bioinformatics data is constantly increasing, and various tools need to be iterated constantly. Mitochondrial genome analyses of multiple individuals and species require bioinformatics software; therefore, their performances need to be optimized. To improve the alignment of ultra-large datasets and ultra-long sequences, we optimized a dynamic programming algorithm using longest common substring methods. Ultra-large test DNA datasets, containing sequences of different lengths, some over 300 kb (kilobase), revealed that the Multiple DNA/RNA Sequence Alignment Tool Based on Suffix Tree (SaAlign) saved time and computational space. It outperformed the existing technical tools, including MAFFT and HAlign-II. For mitochondrial genome datasets having limited numbers of sequences, MAFFT performed the required tasks, but it could not handle ultra-large mitochondrial genome datasets for core dump error. We implement a multiple DNA/RNA sequence alignment tool based on Center Star strategy and use suffix array algorithm to optimize the spatial and time efficiency. Nowadays, whole-genome research and NGS technology are becoming more popular, and it is necessary to save computational resources for laboratories. That software is of great significance in these aspects, especially in the study of the whole-mitochondrial genome of plants.Entities:
Keywords: Alignment; DP, Dynamic programming; LCS, Longest common subsequence; MSA, Multiple sequence alignment; Phylogenetic tree; SA, Suffix array; Sequence analysis; Suffix array
Year: 2022 PMID: 35422971 PMCID: PMC8976100 DOI: 10.1016/j.csbj.2022.03.018
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Original dataset in the experiment.
| Dataset | Max Length | Min Length | Average Length | Sequence Number | File Size |
|---|---|---|---|---|---|
| ITS | 738 | 459 | 572.83 | 100 | 65KB |
| ITS | 902 | 346 | 587.4 | 1000 | 659KB |
| ITS | 993 | 363 | 584.23 | 10,000 | 6.42MB |
| virus genome(1X) | 29,873 | 29,510 | 29811.2 | 100 | 3MB |
| virus genome(10X) | 29,901 | 29,407 | 29800.8 | 1000 | 26.75MB |
| Mitochondrion genome (small) | 363,329 | 208,098 | 268591.1 | 10 | 2.60MB |
| Mitochondrion genome (large) | 362,070 | 232,242 | 298624.64 | 100 | 29.18MB |
Running time with genome MSA.
| ITS sequences | ITS sequences | ITS sequences | Virus genome | Virus genome | Mitochondrion | Mitochondrion | |
|---|---|---|---|---|---|---|---|
| MAFFT | 3.2 ± 0.1 s | 14.32 ± 1.2 min | ∼ | 43.60 ± 2.8 min | ∼ | 7.3h | ∼ |
| HAlign2.1 | 4.8 ± 0.3 s | 8.41 ± 0.8 min | 49.03 ± 2.2 min | ∼ | ∼ | ∼ | ∼ |
| SaAlign | 23 ± 3.1 s | 7.85 ± 1.3 min | 57.25 ± 5.4 min | 75.21 ± 4.9 min | 11.6± 0.2h | 1.81h | 20.2h |
Fig. 1Phylogenetic tree of fungiITS 1X dataset.
Average SPS with genome MSA.
| Avg SPS | ITS sequences (1X) | ITS sequences (10X) | ITS sequences (100X) | Virus genome (1X) | Virus genome (10X) | Mitochondrion genome (small) | Mitochondrion genome (large) |
|---|---|---|---|---|---|---|---|
| MAFFT | 0.826 | 0.851 | ∼ | 0.815 | ∼ | 0.926 | ∼ |
| HAlign2.1 | 0.722 | 0.723 | 0.735 | ∼ | ∼ | ∼ | ∼ |
| SaAlign | 0.722 | 0.723 | 0.735 | 0.631 | 0.637 | 0.695 | 0.716 |
Fig. 2Running time with increasing worker nodes using datasets consisting different number of sequences (5X, 10X and 20X denote dataset consisting 500, 1000 and 2000 sequences).
Fig. 3Running time with increasing worker nodes using datasets consisting different length of sequences (500X, 30,000X and 300,000X denote dataset consisting sequences whose mean length are approximately 500, 30,000 and 300,000 bp).
Fig. 4A simple Spark workflow.
Fig. 5Alignment procedure of Needleman-Wunsch algorithm optimization by longest common substrings.