| Literature DB >> 26260162 |
Oscar Torreno1, Oswaldo Trelles2.
Abstract
BACKGROUND: Conventional pairwise sequence comparison software algorithms are being used to process much larger datasets than they were originally designed for. This can result in processing bottlenecks that limit software capabilities or prevent full use of the available hardware resources. Overcoming the barriers that limit the efficient computational analysis of large biological sequence datasets by retrofitting existing algorithms or by creating new applications represents a major challenge for the bioinformatics community.Entities:
Mesh:
Year: 2015 PMID: 26260162 PMCID: PMC4531504 DOI: 10.1186/s12859-015-0679-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Summary of GECKO’s modular design. The branches on the top represent dictionary computation using the binary tree for each sequence. Once the dictionaries are calculated, perfect matches between words produce a set of seed points (hits). Afterwards, hits are sorted (by diagonal and offset inside the diagonal) and filtered. Finally, the hits are extended to generate a set of HSPs (FragHits). An additional figure with a real example is provided in Section 2.1 of the Additional file 1
Dataset information. From left to right: Type of comparison for which the sequence is going to be used, species name, strand and/or chromosome of origin, GenBank accession number and size in Mbp
| Test type | Species | Strain / Chromosome | Accession number | Mbp |
|---|---|---|---|---|
| Pairwise comparison | Tomato Yellow Leaf Curl Virus | TYLCV | GenBank:AM409201.1 | 0.004 |
| Tomato Yellow Leaf Curl Virus | TYLCV-lr2 | GenBank:EU085423.2 | 0.004 | |
| Buchnera aphidicola | APS (Acyrthosiphon pisum) | GenBank:NC_002528.1 | 0.636 | |
| Buchnera aphidicola | 5A (Acyrthosiphon pisum) | GenBank:NC_011833.1 | 0.640 | |
| Escherichia coli | K-12 | GenBank:NC_000913.2 | 4.596 | |
| Escherichia coli | O157:H7 Sakai | GenBank:NC_002695.1 | 5.448 | |
| Drosophila melanogaster | chromosome 2R | GenBank:NT_033778.3 | 20.948 | |
| Drosophila pseudoobscura | strain MV2–25 chromsome 3 | GenBank:NC_009006.2 | 19.604 | |
| Multiple comparison | Homo sapiens | chromosome 1 | GenBank:NC_000001.11 | 246.600 |
| Pan troglodytes | chromosome 1 | GenBank:NC_006468.3 | 226.172 | |
| Macaca mulata | chromosome 1 | GenBank:NC_007858.1 | 226.092 | |
| Pongo abelii | chromosome 1 | GenBank:NC_012591.1 | 227.768 | |
| Gorilla gorilla | chromosome 1 | GenBank:NC_018424.1 | 227.336 | |
| Mus musculus | chromosome 1 | GenBank:NC_000067.6 | 193.624 | |
| Rattus norvegicus | chromosome 1 | GenBank:NC_005100.3 | 287.344 | |
| Bos taurus breed Hereford | chromosome 1 | GenBank:AC_000158.1 | 156.840 | |
| Canis lupus familiaris breed Boxer | chromosome 1 | GenBank:NC_006583.3 | 121.516 | |
| Sus scrofa breed mixed | chromosome 1 | GenBank:NC_010443.4 | 312.336 |
Execution time in seconds for the comparison of the sequences listed in Table 1 under “pairwise comparison” (lowest execution time and memory consumption of each row are highlighted in bold). The comparison of mammalian chromosomes was also included to test the ability of GECKO and reference software packages to function when analysing very large datasets. The dictionary calculation time is included in the reported times, since the dictionary were not pre-calculated. “n.a.” indicates that resource problems prevented analysis execution and the presence of (*1) after some execution times indicates that the time was measured in a bigger machine because in such cases they were using more than 8GB of memory (more details of these cases in the Additional file 1 Section 3.3)
| Gepard | MUMmer | Mauve | ||||
|---|---|---|---|---|---|---|
| Comparison | Time | Memory | Time | Memory | Time | Memory |
| TYLCV-TYLCV-lr2 | 0.84 | 52824 |
| 2944 | 0.06 |
|
| BuchneraAPS-BuchneraBp | 2.56 | 74808 |
|
| 6.73 | 14304 |
| E.colik12-E.coliO157 | 33.12 | 378412 | 10.63 |
| 45.92 | 99880 |
| D.Melanogaster-D.Pseudoobscura | 238.34 | 716244 | 45.99 | 355272 | 294.92 | 379912 |
| H.Sapiens-Chr1-P.Troglodytes-chr1 |
| 49788208 | 23226.00∗1 | 15747168 | >604800.00 | n.a. |
| LASTZ | LAST | GECKO | ||||
| Comparison | Time | Memory | Time | Memory | Time | Memory |
| TYLCV-TYLCV-lr2 | 0.04 | 67388 |
| 3024 | 0.36 | 1564816 |
| BuchneraAPS-BuchneraBp | 0.46 | 71244 | 46.20 | 475912 | 1.60 | 1564816 |
| E.colik12-E.coliO157 |
| 95884 | 109.00 | 1972028 | 17.20 | 1564816 |
| D.Melanogaster-D.Pseudoobscura |
|
| 1593.00 | 5436716 | 48.72 | 1564816 |
| H.Sapiens-Chr1-P.Troglodytes-chr1 | 78360.00 | 5782352 | n.a. | 312065840 |
|
|
Fig. 2Separate dotplot-like representations of Human chromosome 1 (X-axis) compared to equivalent chromosomes from several other mammalian species: (1) Pan troglodytes, (2) Macaca mulata, (3) Pongo abelii, (4) Gorilla gorilla, (5) Mus musculus, (6) Rattus norvegicus, (7) Bos taurus, (8) Canis familiaris and (9) Sus scrofa. Red colour indicates forward strand fragments and black the reverse strand ones. Plots indicate that there are closely-related (from 1 to 5) and remotely-related (from 6 to 9) sequences. This is caused by the fact of that chromosome numbering was based on their length and not their content. For example, human chromosome 1 is present in several chromosomes of Bos Taurus (but not in the first chromosome, as can be deduced from sub-figure 7). An image with the first five sub-plots projected over one sequence is provided in the Additional file 1
Dictionary calculation execution time in seconds for the sequences listed in Table 1 under the multiple comparison test type
| Sequence (chr1) | Time |
|---|---|
| Homo sapiens (HS) | 747.53 |
| Pan troglodytes (PT) | 630.07 |
| Macaca mulata (MM) | 649.26 |
| Pongo abelii (PA) | 628.81 |
| Gorilla gorilla (GG) | 712.68 |
| Mus musculus (MMu) | 537.45 |
| Rattus norvegicus (RN) | 857.28 |
| Bos taurus breed Hereford (BT) | 451.83 |
| Canis lupus familiaris breed Boxer (CF) | 293.36 |
| Sus scrofa breed mixed (SS) | 980.99 |
The numbers in the upper diagonal refer to the combined execution time for total HSP calculation, hit sorting and all-vs-all comparison of both strands (forward and reverse) in seconds (acronyms as described in Table 3). The charts in the bottom part are symmetric visual representations of the corresponding cell in the upper diagonal (bar colour legend: blue =GECKO; orange =Gepard; grey =Lastz; and yellow =MUMmer). The total execution time (in seconds) for all the comparisons were: GECKO - 318591, Gepard - 576889, Lastz - 4752315 and MUMmer - 558360. The total time for GECKO represents a dummy execution, the actual execution time (executing the dictionary calculation once) was of 142954
| Method | HS | PT | MM | PA | GG | MMU | RN | BT | CF | SS | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HS | GECKO | 19190 | 2438 | 11282 | 11433 | 9358 | 11367 | 3768 | 2944 | 5875 | |
| Gepard | 6152 | 2581 | 12973 | 2861 | 8644 | 11478 | 5850 | 5540 | 14880 | ||
| Lastz | 158874 | 140255 | 117398 | 105912 | 108312 | 96593 | 63294 | 70904 | 157619 | ||
| MUMmer | 13891 | 2536 | 7519 | 11083 | 31164 | 10566 | 2127 | 3277 | 10170 | ||
| PT | GECKO |
| 2932 | 10567 | 9287 | 9686 | 10800 | 3766 | 2939 | 5909 | |
| Gepard | 15662 | 22242 | 26400 | 27394 | 9113 | 11362 | 10445 | 12856 | |||
| Lastz | 135093 | 191012 | 181386 | 210949 | 164312 | 171510 | 146315 | 115160 | |||
| MUMmer | 6214 | 23434 | 43301 | 27051 | 9640 | 1594 | 2836 | 10384 | |||
| MM | GECKO |
|
| 3322 | 5432 | 5306 | 7558 | 4461 | 3294 | 6387 | |
| Gepard | 16356 | 16675 | 15573 | 9349 | 8663 | 7111 | 13517 | ||||
| Lastz | 141032 | 128324 | 136632 | 128874 | 72552 | 57393 | 133667 | ||||
| MUMmer | 4512 | 6669 | 17013 | 25387 | 1736 | 6005 | 10423 | ||||
| PA | GECKO |
|
|
| 31137 | 10012 | 5907 | 3770 | 3081 | 5879 | |
| Gepard | 28282 | 25929 | 11305 | 11727 | 11680 | 14778 | |||||
| Lastz | 148768 | 167206 | 135357 | 82444 | 63000 | 157305 | |||||
| MUMmer | 15115 | 36458 | 19321 | 3330 | 3170 | 11564 | |||||
| GG | GECKO |
|
|
|
| 9703 | 5957 | 5206 | 5294 | 7895 | |
| Gepard | 25819 | 9960 | 13355 | 13250 | 13244 | ||||||
| Lastz | 137351 | 63414 | 44411 | 28089 | 66732 | ||||||
| MUMmer | 36614 | 11729 | 6869 | 30429 | 45431 | ||||||
| MMU | GECKO |
|
|
|
|
| 5908 | 5159 | 5170 | 5873 | |
| Gepard | 10229 | 12219 | 11360 | 13493 | |||||||
| Lastz | 58823 | 44641 | 32046 | 92761 | |||||||
| MUMmer | 8546 | 1307 | 2756 | 10128 | |||||||
| RN | GECKO |
|
|
|
|
|
| 5930 | 5894 | 5935 | |
| Gepard | 8143 | 6458 | 17278 | ||||||||
| Lastz | 47869 | 39163 | 79895 | ||||||||
| MUMmer | 1979 | 4702 | 15277 | ||||||||
| BT | GECKO |
|
|
|
|
|
|
| 3777 | 5914 | |
| Gepard | 6574 | 9827 | |||||||||
| Lastz | 21861 | 56029 | |||||||||
| MUMmer | 717 | 1608 | |||||||||
| CF | GECKO |
|
|
|
|
|
|
|
| 5889 | |
| Gepard | 8302 | ||||||||||
| Lastz | 51778 | ||||||||||
| MUMmer | 2777 | ||||||||||
| SS | GECKO |
|
|
|
|
|
|
|
|
| |
| Gepard | |||||||||||
| Lastz | |||||||||||
| MUMmer |