| Literature DB >> 28848600 |
Zhan Peng1, Yuping Wang1.
Abstract
Searching for the Multiple Longest Common Subsequences (MLCS) of multiple sequences is a classical NP-hard problem, which has been used in many applications. One of the most effective exact approaches for the MLCS problem is based on dominant point graph, which is a kind of directed acyclic graph (DAG). However, the time and space efficiency of the leading dominant point graph based approaches is still unsatisfactory: constructing the dominated point graph used by these approaches requires a huge amount of time and space, which hinders the applications of these approaches to large-scale and long sequences. To address this issue, in this paper, we propose a new time and space efficient graph model called the Leveled-DAG for the MLCS problem. The Leveled-DAG can timely eliminate all the nodes in the graph that cannot contribute to the construction of MLCS during constructing. At any moment, only the current level and some previously generated nodes in the graph need to be kept in memory, which can greatly reduce the memory consumption. Also, the final graph contains only one node in which all of the wanted MLCS are saved, thus, no additional operations for searching the MLCS are needed. The experiments are conducted on real biological sequences with different numbers and lengths respectively, and the proposed algorithm is compared with three state-of-the-art algorithms. The experimental results show that the time and space needed for the Leveled-DAG approach are smaller than those for the compared algorithms especially on large-scale and long sequences.Entities:
Keywords: biological sequence alignment; directed acyclic graph; dominant point method; longest common subsequence; multiple longest common subsequences
Year: 2017 PMID: 28848600 PMCID: PMC5552671 DOI: 10.3389/fgene.2017.00104
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1(A) The score table of two DNA sequences ACTAGCTA and TCAGGTAT. (B) Constructing the MLCS from the score table, where the shaded cells conspond to dominant points.
Figure 2The DAG of two sequences ACTAGCTA and TCAGGTAT constructed by the general dominant point based algorithms, in which the black and gray nodes will be eliminated by the Minima operation.
Figure 3(A) The successor table of sequence ACTAGCTA. (B) The successor table of sequence TCAGGTAT.
Figure 4The Leveled-DAG constructed for sequences ACTAGCTA and TCAGGTAT. The mach point and the corresponding symbol are shown in each node. The partial LCSs are shown by red strings near the nodes. The white nodes are newly created and will be expanded later. The green ones are outdated and will be removed right away. The red ones with incoming edges are left from the previous levels and cannot be removed at present. (A) Generate the first level of nodes. (B) Generate the second level of nodes. (C) Generate the third level of nodes. (D) No new node is created any more. (E) Delete the remaining outdated nodes. (F) Only the end node is left.
The average running times (in seconds) of the test algorithms on different numbers of DNA and protein sequences with the length of sequences fixed to 100. (Using 32 threads).
| 3 | 0.052 (0.003) | 0.041 (0.001) | 0.031 (0.002) | 0.018 (0.001) | 0.021 (0.001) | 0.034 (0.002) | 0.027 (0.001) | 0.016 (0.001) |
| 4 | 0.255 (0.01) | 0.203 (0.02) | 0.071 (0.004) | 0.053 (0.003) | 0.183 (0.02) | 0.152 (0.03) | 0.051 (0.003) | 0.037 (0.002) |
| 5 | 2.9 (0.1) | 1.5 (0.09) | 0.12 (0.008) | 0.082 (0.004) | 2.1 (0.1) | 1.0 (0.08) | 0.098 (0.008) | 0.077 (0.006) |
| 6 | 26.5 (1.7) | 10.3 (0.8) | 1.3 (0.09) | 1.1 (0.1) | 20.8 (1.2) | 6.9 (0.6) | 0.94 (0.03) | 0.75 (0.02) |
| 7 | 151.8 (10.0) | 32.8 (1.9) | 3.6 (0.1) | 2.7 (0.2) | 116.5 (10.3) | 21.8 (1.3) | 2.8 (0.2) | 1.9 (0.1) |
| 8 | 834.9 (43.8) | 147.6 (8.8) | 8.5 (0.6) | 6.9 (0.4) | 746.5 (58.8) | 107.0 (8.7) | 6.3 (0.3) | 4.7 (0.2) |
| 9 | 4,174.6 (408.7) | 738.5 (51.5) | 16.4 (0.9) | 13.7 (0.6) | 3,059.2 (301.4) | 585.6 (40.2) | 12.5 (0.8) | 9.9 (0.4) |
| 10 | 25,671.4 (2,433.8) | 3,385.4 (326.4) | 30.0 (2.3) | 25.3 (0.9) | 22,751.9 (1,596.7) | 2,645.3 (215.7) | 24.7 (1.5) | 20.6 (1.1) |
| 20 | – | – | 64.8 (4.6) | 51.2 (2.1) | – | – | 57.7 (3.3) | 45.3 (2.3) |
| 30 | – | – | 136.7 (6.7) | 96.7 (3.7) | – | – | 124.3 (8.9) | 89.2 (4.7) |
| 40 | – | – | 250.3 (9.4) | 191.4 (7.5) | – | – | 223.4 (14.7) | 180.8 (10.5) |
| 50 | – | – | 463.2 (17.5) | 380.1 (12.3) | – | – | 432.7 (22.4) | 366.4 (16.0) |
| 60 | – | – | 665.4 (42.2) | 530.3 (27.4) | – | – | 590.6 (31.2) | 509.5 (20.4) |
| 70 | – | – | 1,088.1 (76.7) | 875.5 (39.4) | – | – | 967.8 (76.4) | 848.6 (39.6) |
| 80 | – | – | 1,684.6 (127.5) | 1,233.2 (62.3) | – | – | 1,432.6 (105.2) | 1,167.0 (66.8) |
| 90 | – | – | 2,217.9 (188.3) | 1,764.6 (85.5) | – | – | 2,053.5 (127.1) | 1,715.1 (84.1) |
| 100 | – | – | 3,041.5 (220.9) | 2,417.8 (174.9) | – | – | 2,320.2 (144.8) | 2,056.2 (101.5) |
| 200 | – | – | 3,398.3 (241.4) | 2,778.2 (190.2) | – | – | 2,492.6 (152.4) | 2,118.5 (114.9) |
| 300 | – | – | 3,665.0 (263.8) | 2,962.5 (206.7) | – | – | 2,614.2 (165.7) | 2,214.3 (138.8) |
| 400 | – | – | 3,981.6 (285.0) | 3,191.2 (218.0) | – | – | 2,745.3 (172.8) | 2,375.4 (152.9) |
| 500 | – | – | 4,237.2 (310.3) | 3,384.0 (231.4) | – | – | 2,862.4 (181.1) | 2,435.1 (164.3) |
| 600 | – | – | 4,555.9 (336.9) | 3,547.2 (243.7) | – | – | 2,947.9 (193.4) | 2,479.2 (170.7) |
| 700 | – | – | 4,880.3 (362.7) | 3,854.7 (266.2) | – | – | 3,174.8 (204.5) | 2,511.9 (183.2) |
The standard deviations of the running times are shown in the parentheses.
The memory consumption (in MB) of the test algorithms on different numbers of DNA and protein sequences with the length of sequences fixed to 100.
| 3 | 28 | 31 | 8 | 5 | 25 | 28 | 7 | 4 |
| 4 | 373 | 447 | 23 | 17 | 330 | 403 | 19 | 14 |
| 5 | 1,358 | 1,485 | 93 | 85 | 1,167 | 1,304 | 77 | 62 |
| 6 | 3,315 | 3,490 | 297 | 223 | 2,718 | 2,960 | 203 | 190 |
| 7 | 5,190 | 5,862 | 534 | 489 | 4,152 | 47,06 | 469 | 418 |
| 8 | 11,057 | 12,051 | 1,211 | 1,124 | 8,513 | 9,871 | 1,017 | 943 |
| 9 | 20,634 | 21,183 | 3,058 | 2,765 | 15,138 | 16,062 | 2,538 | 2,238 |
| 10 | 35,769 | 36,934 | 5,813 | 5,232 | 25,637 | 26,048 | 4,766 | 4,251 |
| 20 | – | – | 32,329 | 28,126 | – | – | 24,246 | 18,045 |
| 30 | – | – | 48,765 | 39,291 | – | – | 36,824 | 26,713 |
| 40 | – | – | 67,813 | 52,607 | – | – | 49,503 | 35,182 |
| 50 | – | – | 91,128 | 68,103 | – | – | 64,292 | 46,137 |
| 60 | – | – | 121,268 | 87,359 | – | – | 81,379 | 58,174 |
| 70 | – | – | 156,470 | 118,600 | – | – | 98,541 | 61,036 |
| 80 | – | – | 197,387 | 141,859 | – | – | 117,390 | 76,283 |
| 90 | – | – | 209,145 | 146,402 | – | – | 120,833 | 81,429 |
| 100 | – | – | 229,372 | 151,386 | – | – | 124,124 | 84,069 |
| 200 | – | – | 252,247 | 163,948 | – | – | 131,920 | 88,132 |
| 300 | – | – | 261,963 | 167,085 | – | – | 138,255 | 92,025 |
| 400 | – | – | 268,993 | 170,811 | – | – | 144,213 | 96,044 |
| 500 | – | – | 276,103 | 173,945 | – | – | 151,318 | 101,250 |
| 600 | – | – | 290,398 | 177,140 | – | – | 157,986 | 104,986 |
| 700 | – | – | 299,498 | 179,846 | – | – | 162,298 | 108,107 |
The average running times of the test algorithms under different lengths of DNA and protein sequences with the number of sequences fixed to 5. (Using 32 threads).
| 50 | 0.57 (0.03) | 0.13 (0.01) | 0.038 (0.002) | 0.026 (0.001) | 0.06 (0.001) | 0.018 (0.001) | 0.004 (0) | 0.001 (0) |
| 100 | 2.7 (0.2) | 1.4 (0.08) | 0.23 (0.03) | 0.96 (0.04) | 0.3 (0.02) | 0.16 (0.01) | 0.077 (0.003) | 0.058 (0.006) |
| 200 | 244.1 (10.4) | 10.6 (0.2) | 8.5 (0.3) | 6.8 (0.2) | 28.5 (1.5) | 1.2 (0.1) | 0.96 (0.102) | 0.77 (0.05) |
| 300 | 4,064.8 (312.6) | 95.3 (4.7) | 38.7 (2.2) | 32.6 (2.7) | 467.1 (14.4) | 11.4 (1.1) | 4.4 (0.2) | 3.1 (0.2) |
| 400 | – | 312.4 (11.5) | 77.8 (4.9) | 59.5 (3.8) | 3,659.2 (363.8) | 36.7 (1.8) | 8.9 (0.8) | 7.2 (0.6) |
| 500 | – | 1,566.9 (128.9) | 132.6 (7.8) | 112.2 (5.6) | – | 180 (6.2) | 15.2 (1.1) | 12.7 (0.9) |
| 600 | – | 4,384.1 (297.4) | 201.1 (11.7) | 165.3 (8.3) | – | 533.8 (19.4) | 23.1 (2.0) | 18.9 (1.0) |
| 700 | – | 10,347.5 (913.2) | 287.3 (12.1) | 223.4 (10.1) | – | 1,075.5 (32.7) | 32.6 (2.5) | 24.6 (1.03) |
| 800 | – | 27,489.2 (2,351.3) | 373.2 (14.5) | 313.8 (13.3) | – | 2,958.1 (61.5) | 43.9 (3.8) | 35.3 (1.1) |
| 900 | – | – | 487.3 (21.6) | 399.1 (15.5) | – | 6,709.0 (221.2) | 54.1 (4.2) | 44.8 (1.2) |
| 1,000 | – | – | 644.7 (29.3) | 513.5 (19.8) | – | 11,508.6 (1,258.9) | 70.8 (5.7) | 57.0 (1.8) |
| 2,000 | – | – | 4,240.5 (251.2) | 3,017.6 (87.2) | – | – | 469.3 (13.7) | 355.4 (10.2) |
| 3,000 | – | – | 9,915.1 (673.8) | 7,922.0 (297.3) | – | – | 1,168.5 (30.5) | 873.1 (22.4) |
| 4,000 | – | – | 16,963.4 (1,553.3) | 13,762.3 (1,433.4) | – | – | 1,843.0 (41.3) | 1,532.5 (32.4) |
| 5,000 | – | – | 24,672.9 (2,104.3) | 19,074.7 (1,658.1) | – | – | 2,788.3 (55.6) | 2,065.2 (46.3) |
The standard deviations of the running times are shown in the parentheses.
The memory consumption (in MB) of the test algorithms under different lengths of DNA and protein sequences with the number of sequences fixed to 5.
| 50 | 47 | 56 | 21 | 17 | 40 | 42 | 18 | 11 |
| 100 | 1,352 | 1,481 | 99 | 82 | 1,163 | 1,296 | 81 | 56 |
| 200 | 8,331 | 8,652 | 2,353 | 1,469 | 6,249 | 7,963 | 1,894 | 988 |
| 300 | 16,874 | 16,993 | 4,050 | 3,051 | 11,047 | 12,735 | 3,251 | 1,864 |
| 400 | – | 27,355 | 5,866 | 4,787 | 113,665 | 20,586 | 4,819 | 3,012 |
| 500 | – | 41,257 | 8,297 | 6,654 | – | 32,771 | 6,770 | 4,351 |
| 600 | – | 60,912 | 12,063 | 8,598 | – | 46,009 | 9,023 | 5,806 |
| 700 | – | 85,733 | 18,550 | 10,163 | – | 65,574 | 11,652 | 7,513 |
| 800 | – | 126,483 | 26,070 | 14,250 | – | 86,684 | 14,725 | 9,426 |
| 900 | – | – | 36,341 | 20,539 | – | 111,748 | 18,380 | 11,573 |
| 1,000 | – | – | 49,442 | 27,985 | – | 140,457 | 22,507 | 13,690 |
| 2,000 | – | – | 95,784 | 55,549 | – | – | 45,633 | 27,811 |
| 3,000 | – | – | 152,178 | 86,732 | – | – | 71,058 | 42,669 |
| 4,000 | – | – | 224,135 | 125,454 | – | – | 99,564 | 58,937 |
| 5,000 | – | – | 301,375 | 165,756 | – | – | 134,568 | 77,502 |
Pseudocode
| The successor tables of the input sequences. |
| The MLCS of the input sequences. |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
| | |
| δ ← the corresponding symbol of |
| |
| |
| Append δ to |
| |
| |
| |
| |
| Append δ to |
| |
| |
| |
| Delete |
| |
| |