| Literature DB >> 17022809 |
Todd J Treangen1, Xavier Messeguer.
Abstract
BACKGROUND: Due to recent advances in whole genome shotgun sequencing and assembly technologies, the financial cost of decoding an organism's DNA has been drastically reduced, resulting in a recent explosion of genomic sequencing projects. This increase in related genomic data will allow for in depth studies of evolution in closely related species through multiple whole genome comparisons.Entities:
Mesh:
Year: 2006 PMID: 17022809 PMCID: PMC1629028 DOI: 10.1186/1471-2105-7-433
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
A summary of experimental results for 23 distinct sets of sequences
| # | Sequence set | Size | |||||||||
| 1 | 1.5 | 22 | 264 | 6325 | 649 | 2s | 5s | 8s | 52 | 72.8 | |
| 2 | 3.5 | 23 | 1159 | 3229 | 484 | 6s | 3s | 9s | 153 | 62.5 | |
| 3 | 9.6 | 27 | 470 | 516 | 39 | 16s | 1s | 17s | 419 | 98.9 | |
| 4 | 8.7 | 24 | 13101 | 45940 | 722 | 16s | 114s | 143s | 283 | 94.3 | |
| 5 | 15.3 | 27 | 15843 | 37702 | 2441 | 25s | 151s | 181s | 487 | 74.8 | |
| 6 | P. | 18.4 | 27 | 11232 | 39753 | 1527 | 35s | 252s | 294s | 573 | 72.8 |
| 7 | 4.9 | 21 | 770 | 0 | 7 | 6s | 0s | 6s | 156 | 98.5 | |
| 8 | 21.4 | 25 | 14049 | 6 | 400 | 24s | 1s | 25s | 488 | 94.0 | |
| 9 | 23.1 | 23 | 37596 | 1285 | 564 | 32s | 2s | 38s | 548 | 76.7 | |
| 10 | 23.8 | 23 | 46336 | 983 | 328 | 35s | 1s | 39s | 567 | 93.1 | |
| 11 | 25.3 | 23 | 47221 | 5543 | 704 | 38s | 9s | 57s | 553 | 84.2 | |
| 12 | 13.1 | 21 | 15446 | 84 | 121 | 17s | 1s | 18s | 258 | 88.3 | |
| 13 | 19.6 | 24 | 23216 | 132 | 260 | 24s | 2s | 26s | 390 | 92.6 | |
| 14 | 36.8 | 25 | 27731 | 4149 | 468 | 54s | 7s | 62s | 713 | 93.2 | |
| 15 | 48.4 | 23 | 39979 | 3753 | 418 | 63s | 12s | 78s | 740 | 73.3 | |
| 16 | 72.2 | 23 | 5802 | 8136 | 1218 | 95s | 84s | 181s | 991 | 54.9 | |
| 17 | 93.6 | 18 | 1132 | 637 | 907 | 161s | 99s | 261s | 1174 | 15.6 | |
| 18 | Bacilli 14 (12&13) | 22.7 | 19 | 251 | 3801 | 2721 | 43s | 41s | 93s | 414 | 14.3 |
| 19 | Bacilli 14 (13&14) | 56.4 | 19 | 431 | 5250 | 2718 | 79s | 100s | 185s | 654 | 26.3 |
| 20 | 62.3 | 15 | 597 | 1691 | 2045 | 100s | 54s | 155s | 638 | 4.1 |
A selection of results for 20 independent sets of closely related sequence comparisons conducted with M-GCAT. Size and Memory usage are listed in megabytes (MB). All experiments were performed and running times (cpu time) measured on a 2 GHZ Pentium processor, with 2 GB of main memory, running Windows XP Professional. Size is the total size (MB) of the set of sequences. is the number of multi-MUM Anchors found, is the configured minimum size of multi-MUM Anchors, is number of multi-MUMs found, is the number of multi-MUM clusters. tis the time needed to find the set of multi-MUM anchors, tis the time needed to find the initial set of multi-MUMS, and tthe time required to perform entire comparison. Mem is peak usage of system memory (MB), and Cov. is the percentage of each sequence that was aligned. The percentage that was not aligned corresponds to regions where no multi-MUMs were found. A p value of 10,000,000 and q value of 100 was used for all experiments. The d value was set to the length of the longest sequence in each example to emphasize the global alignment framework. For a complete listing of the sequences used in these comparisons refer to Additional file 2.
Comparison of M-GCAT & Mauve alignment frameworks.
| M-GCAT | Mauve | |||
| MUM Clusters | Coverage | LCBs | Coverage | |
| 126 | 80.1% | 126 | 86.4% | |
| 72 | 81.0% | 91 | 85.2% | |
| 85 | 75.5% | 113 | 82.0% | |
Figure 5Analysis of multiple genome comparison framework efficiency and memory usage. This experiment was ran exclusively on a 2 Ghz Pentium M processor, with 2 GB of main memory, running Windows XP Professional. The memory usage as the peak memory usage during the comparison. The time is represented in total cpu time.
Verifying reliability of selected Alignment frameworks
| # | Sequence set | |||||||
| 1 | 649 | 1188 | 244 | 1432 | 576 | 2008 | 82.0% | |
| 2 | 484 | 1971 | 585 | 2556 | 675 | 3231 | 77.0% | |
| 13 | 328 | 12108 | 3823 | 15931 | 4953 | 20884 | 76.0% | |
| 18 | 418 | 28428 | 1757 | 30185 | 5375 | 35560 | 94.2% | |
| 19 | 1218 | 42617 | 2883 | 45550 | 7291 | 52791 | 93.7% |
Testing the reliability of five of the alignment frameworks generated in Table 1. is the number of multi-MUM clusters analyzed for orthologs, Identified is the total number of genes with one or more identified orthologs in its corresponding multi-MUM cluster, Missed the total number of proteins in multi-MUM clusters with no orthologs, Unknown the total number of genes that have yet to be fully classified, Total the total number of genes in all of the multi-MUM clusters in all genomes, and Accuracy the number of Identified orthologs divided by the total Known (Identified + Missed).
Multi-MUM search comparison
| M-GCAT | M-GCAT w/Inversions | EMAGEN-DM | MGA | |||||
| Running time(s) | 32+190 | 11+26 | 32+210 | 11+32 | 178+223 | 70+15 | 535+441 | 338+382 |
| Number of LIS-MUMs | 34844 | 5568 | 36154 | 10238 | 34612 | 3781 | 34922 | 5503 |
| Total Length of LIS-MUMs | 3592285 | 631663 | 4012435 | 1540505 | 3484053 | 425309 | 3547621 | 626112 |
Comparison of multi-MUM search efficiency. MGA and EMAGEN-DM results on a Sun Blade 1000 workstation with 1 GB RAM, as reported in [14]. M-GCAT results on a Sun Ultra-250 with a 400 Mhz processor and 512 MB RAM.