| Literature DB >> 29992260 |
Chris-André Leimeister1, Thomas Dencker1, Burkhard Morgenstern1,2.
Abstract
Motivation: Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods.Entities:
Mesh:
Year: 2019 PMID: 29992260 PMCID: PMC6330006 DOI: 10.1093/bioinformatics/bty592
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Spaced-words histogram for a comparison of two bacterial genomes, Phaeobacter gallaeciensis 2.10 and Rhodobacterales bacterium Y4I. All possible spaced-word matches with respect to a given binary pattern P are identified, and their scores are calculated as explained in the main text. The number of spaced-word matches with a score s is plotted against s. Two peaks are visible, an approximately normally distributed peak for background spaced-word matches, and a more complex peak for spaced-word matches representing homologies. With a cut-off value of zero, background and homologous spaced-word matches can be reliably separated
Fig. 2.Recall values for Mugsy using anchor points generated with FSWM and with MUMmer, respectively, as well as for Cactus. Test data were simulated genomic sequences generated with ALF, see main text for details. FSWM was run with the default weight w = 10, i.e. with 10 match positions in the underlying pattern, and with w = 8
Fig. 3.Precision values for Mugsy with FSWM and MUMmer anchor points respectively, and for Cactus. Test data and parameter values as in Figure 2
Fig. 4.F-Score values for Mugsy with FSWM and MUMmer anchor points, respectively, and for Cactus. Test data and parameter values as in Figure 2
Evaluation of multiple alignments of 29 E. coli/Shigella genomes, 32 Roseobacter genomes and 9 fungal genomes, obtained with Mugsy, using anchor points calculated with FSWM and with MUMmer, respectively
| Core LCBs | Aligned pairs | Core col. | LCBs | |
|---|---|---|---|---|
| 29 | ||||
| 539 | 1,61E+09 | 2,827,115 | 4138 | |
| 664 | 1,63E+09 | 2,867,432 | 5906 | |
| 20,163 | 1,48E+09 | 2,663,750 | 56,592 | |
| 32 | ||||
| 39 | 3,63E+08 | 13,654 | 13,501 | |
| 859 | 7,15E+08 | 824,054 | 30,836 | |
| 5984 | 4,95E+08 | 280,085 | 337,320 | |
| 9 fungal genomes | ||||
| 9 | 5,88E+06 | 2097 | 4252 | |
| 2590 | 1,18E+08 | 718,176 | 89,555 | |
| 31,589 | 1,33E+08 | 828,680 | 848,242 | |
Note: As a comparison, the table contains the results obtained with Cactus. The first column contains the number of core columns, i.e. the number of columns in the multiple alignments that do not contain gaps; the second column contains the total number of aligned pairs of positions in the alignment. The third column contains the number of core Locally Collinear Blocks (LCBs) i.e. the number of LCBs that involve all of the aligned genomes (‘core LCBs’), while the last column contains the total number of LCBs.
Run time in minutes for three different multiple genome-alignment methods applied to the three test datasets that we used in our program evaluation
| 59 | 83 | 110 | |
| 638 | 6428 | 1488 | |
| 73 | 63 | 43 | |
| 286 | 1099 | 63 | |
| 714 | 1775 | 775 |