| Literature DB >> 31036850 |
V Vineetha1, C L Biji2, Achuthsankar S Nair2.
Abstract
Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m2). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/ .Entities:
Mesh:
Substances:
Year: 2019 PMID: 31036850 PMCID: PMC6488671 DOI: 10.1038/s41598-019-42966-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Sample flow of SPARK-MSNA algorithm.
Figure 2Flow chart of SPARK-MSNA algorithm.
Sample knowledge base constructed for testing.
| Difference in Length of sequences (%) | Similarity (%) | percentage of diagonals filled in the 2 × 2 matrix |
|---|---|---|
| 0.30 | 99.20 | 0.15 |
| 0.20 | 99.30 | 0.14 |
| 0.40 | 98.00 | 0.16 |
| 0.30 | 98.10 | 0.15 |
| 0.10 | 95.90 | 0.18 |
| 0.23 | 97.30 | 0.17 |
| 0.34 | 98.20 | 0.16 |
| 0.35 | 96.40 | 0.18 |
| 0.70 | 99.00 | 0.4 |
| 1.20 | 99.00 | 0.3 |
| 5.80 | 75.00 | 6.2 |
| 25 | 50.00 | 20 |
Execution time taken by SPARK-MSNA for datasets with different similarity. Datasets were of equal size (3.75MB).
| Similarity (%) | Execution time (without knowledge base) | Execution time (with knowledge base) | |
|---|---|---|---|
| Dataset 1 | 95 | 1 min 11 sec | 50 sec |
| Dataset 2 | 70 | 1 min 31 sec | 1 min 4 sec |
| Dataset 3 | 45 | 1 min 47 sec | 1 min 14 sec |
| Dataset 4 | 35 | 2 min 5 sec | 1 min 29 sec |
| Dataset 5 | 20 | 2 min 43 sec | 1 min 55 sec |
Figure 3Execution time of SPARK-MSNA decreases as similarity of input sequences increase.
Figure 4Improvement in execution time of SPARK-MSNA with more number of nodes.
Figure 5Speedup in execution time due to additional compute nodes.
Figure 6Weak scalability of SPARK-MSNA.
Figure 7Performance comparison of SPARK-MSNA with other algorithms.