Literature DB >> 29992260

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.

Chris-André Leimeister¹, Thomas Dencker¹, Burkhard Morgenstern^1,2.

Abstract

Motivation: Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods.
Results: In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points. Availability and implementation: http://spacedanchor.gobics.de. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 29992260 PMCID： PMC6330006 DOI： 10.1093/bioinformatics/bty592

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The most fundamental task in biological sequence analysis is to align two or several nucleic-acid or protein sequences—either globally, over their entire length, or locally, by restricting the alignment to a single region of homology. Standard approaches to global alignment assume that the input sequences derived from a common ancestor, and that evolutionary events are limited to substitutions and small insertions and deletions. In this case, sequence homologies can be represented by global sequence alignments, that is, by inserting gap characters into the sequences such that evolutionarily related sequence positions are arranged on top of each other. Under most scoring schemes, calculating an optimal alignment of two sequences takes time proportional to the product of their lengths and is therefore limited to rather short sequences (Durbin ; Gotoh, 1982; Morgenstern, 2002; Needleman and Wunsch, 1970; Smith and Waterman, 1981). With the rapidly increasing number of partially or fully sequenced genomes, alignment of genomic sequences has become an important field of research in bioinformatics, see Earl for a recent review and evaluation of some of the most popular approaches. Here, the first challenge is the sheer size of the input sequences which makes it impossible to use traditional algorithms with quadratic run time. A second challenge is the fact that related genomes often share multiple local homologies, interrupted by non-conserved parts of the sequences where no significant similarities can be detected. This means that neither global alignment methods (Needleman and Wunsch, 1970) nor strictly local methods (Altschul ; Smith and Waterman, 1981) are appropriate to represent the homologies between entire genomes. Finally, homologies do not generally occur in the same relative order in different genomes, because of duplications and large-scale genome rearrangements. Since it is not possible, in general, to represent homologies among genomes in one single alignment, advanced genome aligners return alignments of so-called Locally Collinear Blocks, i.e. blocks of segments of the input sequences where orthologous genes appear in the same linear order. Since the late 1990s, efforts have been made to a address the above issues, and many approaches to genome-sequence alignment have been published. One of the first multiple-alignment programs that could be applied to genomic sequences was DIALIGN (Morgenstern , 2002). This program composes multiple alignments from chains of local pairwise alignments, and it does not penalize gaps; it is therefore able to align sequences where local homologies are separated by non-homologous regions. The program was initially not designed for large genomic sequences, though, and it is limited to sequences up to around 10 kb. Moreover, DIALIGN is not able to deal with duplications, rearrangements or homologies on different strands of the DNA double helix. To align longer sequences, most programs for genomic alignment rely on some sort of anchoring (Huang ; Morgenstern ). In a first step, they use a fast local alignment method to identify high-scoring local homologies, so-called anchor points. Next, chains of such local alignments are calculated and, finally, sequence segments between the selected anchor points are aligned with a slower but more sensitive alignment method. For multiple sequence sets, either pairwise or multiple local alignments can be used as anchor points. A pioneering tool to find anchor points for genomic alignment is MUMmer (Delcher ); the current version of the program is considered the state-of-the-art in alignment anchoring (Kurtz ). MUMmer uses maximal unique matches as pairwise anchor points. The genome aligner MGA, by contrast, uses maximal exact matches involving all input sequences (Höhl ). Both MUMmer and MGA use suffix trees (Kurtz, 1999) and related data structures to rapidly identify the pairwise or multiple word matches. MUMmer and MGA can rapidly align entire bacterial genomes; MUMmer was also used in the A. thaliana genome project (The Arabidopsis Genome Initiative, 2000). However, since the number of exact word matches decreases with increasing evolutionary distances, these approaches are most useful if closely related genomes are to be compared, such as different strains of E. coli. Other approaches to genome alignment are OWEN (Ogurtsov ), AVID (Bray ), MAVID (Bray and Pachter, 2003), LAGAN and Multi-LAGAN (Brudno ), CHAOS/DIALIGN (Brudno ), the VISTA genome pipeline (Dubchak ), TBA (Blanchette ) and Mauve (Darling ), see Dewey and Pachter (2006) and Batzoglou (2005) for review. All of these methods use anchor points, and most of them are able to deal with duplications and genome rearrangements. Some genome aligners use statistical properties of the sequences (Bradley ; Darling ); other methods are based on graphs, for example on A-Bruijn graphs (Raphael ) or on cactus graphs (Paten ). A further development of Mauve, called progressiveMauve (Darling ), uses palindromic spaced seeds (Darling ) instead of exact word matches as anchor points. Spaced seeds are used for sequence-analysis tasks such as database searching (Choi ; Ma ; Noé, 2017; Xu ), read mapping (Břinda ; David ; Langmead ; Noé ; Ounit and Lonardi, 2015), alignment-free sequence comparison (Leimeister ) or pathogen detection Deneke . Such pattern-based approaches are often superior to methods based on contiguous words or word matches, see for example Li . In Mauve, palindromic patterns are used to cover both DNA strands of the input sequences. Mugsy (Angiuoli and Salzberg, 2011) is a popular software pipeline for multiple genome alignment. In a first step, this program uses nucmer (Kurtz ) to construct all pairwise alignments of the input sequences. Nucmer, in turn, uses MUMmer to find exact unique word matches which are used as alignment anchor points. An alignment graph is constructed from these pairwise alignments using the SeqAn software (Döring ), and Locally Collinear Blocks are constructed. Finally, a multiple alignment is calculated using SeqAn:: TCoffee (Rausch ). Mugsy has been designed to rapidly align closely related genomes, such as different strains of a bacterium. Here, it produces alignments of high quality. On more distantly related genomes, however, the program is often outperformed by other multiple aligners (Earl ). Finding anchor points is the most important step in whole-genome sequence alignment. Here, a trade-off between speed, sensitivity and precision has to be made. A sufficient number of anchor points is necessary to reduce the run time of the subsequent, more sensitive alignment routine. Wrongly chosen anchor points, on the other hand, can substantially deteriorate the quality of the final output alignment. They may not only lead to misalignments of non-homologous parts of the sequences but may also prevent biologically relevant, true homologies from being aligned. Also, if the number of anchor points is too large, finding optimal chains of anchor points can become computationally expensive. In this article, we apply the filtered spaced word matches (FSWM) approach (Leimeister ) to find pairwise anchor points for genomic alignment. We use a hit-and-extend approach where high-scoring spaced-word matches are used as seeds. More precisely, for a given binary pattern of length ℓ representing match and don’t care positions, we identify spaced-word matches—i.e. pairs of length-ℓ segments from the input sequences with matching nucleotides at the match positions and possible mismatches at the don’t care positions. For each such spaced-word match, we then calculate a similarity score, and we keep only those spaced-word matches that have a score above a certain threshold. These matches are then extended to gap-free alignments, similar as in BLAST (Altschul ). To evaluate the anchor points generated by our approach, we modified the Mugsy pipeline by using our anchoring procedure instead of the original anchor points in Mugsy that are based on exact word matches. For closely related input sequences, these two different anchoring procedures lead to alignments of similar quality. Our anchor points are clearly superior, however, if distal sequences are to be aligned, where most other alignment approaches either fail to produce meaningful alignments or require an unacceptable amount of time. Through our website at http://spacedanchor.gobics.de, we provide the modified Mugsy pipeline with our anchoring approach, as a pipeline for genome-sequence alignment that can be readily installed. In addition, we provide a stand-alone version of our software, such that software developers can integrate our anchor points into their own sequence-analysis pipelines.

2 Results

2.1 Filtered spaced word matches

For a sequence S of length L over an alphabet Σ and denotes the ith symbol of S, and denotes the length of S. Throughout this article, a pattern is a word over {0, 1}. For a pattern P, a position i is called a match position if and a don’t-care positions otherwise. The number of match positions in a pattern P is called the weight of P. For an alphabet Σ, a pattern P, and a wildcard character ‘*’ not contained in Σ, a spaced word with respect to P is a word w over , such that if and only if k is a don’t-care position, see also Leimeister and Horwege . We say that a spaced word w with respect to a pattern P occurs in a sequence S at some position i, if , and if for all match positions k of P. For sequences S1 and S2, a pattern P, and positions i and j, we say that there is spaced-word match between S1 and S2 at (i, j) with respect to P if the same spaced word occurs at i in S1 and at j in S2—in other words, if for all match positions k in P, one has For the two sequences S1 and S2 below, for example, there is a spaced-word match with respect to the pattern P = 1100101 at (5, 2): as the same spaced word ‘’ occurs at positions 5 in S1 and at position 2 in S2. In a previous article, we used spaced-word matches to estimate phylogenetic distances between genomic sequences, by considering at the nucleotides aligned to each other at the don’t care positions of selected spaced-word matches (Leimeister ). To remove spurious random spaced-word matches, we applied a simple filtering procedure. Based on the following substitution matrix (Chiaromonte ) we calculated for each spaced-word match the sum of substitution scores of the nucleotide pairs aligned at the don’t-care positions, and we removed all spaced-word matches with a score below zero; compare also Brejova . A graphical representation of the spaced-word matches between two sequences shows that this procedure can clearly separate random spaced-word matches from true homologies. If we plot for each possible score value s the number of spaced-word matches with score equal to s, we obtain a bimodal distribution with one peak for random matches and a second peak for true homologies. We call such a plot a spaced-words histogram, see Figure 1 for an example. For simulated sequence pairs under a simple model of evolution, and with a sufficient number of don’t-care positions in the underlying pattern, both peaks are approximately normally distributed. For real-world sequences, the random peak is still normally distributed, but the ‘homologous’ peak is more complex. Even so, using a suitable cut-off value, one can easily distinguish between random matches and true homologies; for the above matrix, a cut-off of zero works well. More examples for spaced-words histograms are given in Leimeister .

Fig. 1.

Spaced-words histogram for a comparison of two bacterial genomes, Phaeobacter gallaeciensis 2.10 and Rhodobacterales bacterium Y4I. All possible spaced-word matches with respect to a given binary pattern P are identified, and their scores are calculated as explained in the main text. The number of spaced-word matches with a score s is plotted against s. Two peaks are visible, an approximately normally distributed peak for background spaced-word matches, and a more complex peak for spaced-word matches representing homologies. With a cut-off value of zero, background and homologous spaced-word matches can be reliably separated Herein, we propose to use spaced-word matches to calculate anchor points for pairwise alignment of genomic sequences. To distinguish between spaced-word matches representing true homologies and random background matches, we use the above filtering criterion. More precisely, our approach to find anchor points for genomic alignment is as follows. For given parameters ℓ and w, we first calculate a pattern P with length ℓ and weight w—i.e. with w match positions—using our recently developed software rasbhari (Hahn ). We then identify all spaced-word matches with respect to P. Based on the above substitution matrix, we calculate the score of each spaced-word match, and we discard all spaced-word matches with a score below zero, as we did in our previous article (Leimeister ). By default, our program uses only unique spaced-word matches. That is, if a spaced word w occurs n times in one sequence and m times in a second sequence, we only use the best-scoring of the n × m resulting spaced-word matches. But as an alternative, it is also possible to use all spaced-word matches with a score above zero. To find homologies even for distantly related sequences, we use patterns with a low weight; by default, we use a weight of w = 10. On the other hand, we use a large number of don’t-care positions, since this makes it easier to distinguish true homologies from random spaced-word matches. By default, we use a pattern length of , so our patterns contain 10 match positions and 100 don’t-care positions. Next, we do gap-free extensions of the identified local similarities in both directions using a standard X-drop approach. As starting points for these extensions, we do not use the full spaced-word matches, but their midpoints. The reason for this is that, with our long patterns, even high-scoring spaced-word matches may not represent true homologies over their entire length. It often happens that parts of a spaced-word aligns homologous nucleotides, but one or both ends of the aligned segments extend into non-homologous regions. There is a high probability, however, that the midpoint of a long, high-scoring spaced-word match is located within a region of true homology. As a result, it is possible that an ‘extended’ match in our approach is shorter than the initial spaced-word match that was used to define the starting point for the X-drop extension. Also, it can happen that a spaced-word match is located within the ‘extension’ of a previously processed match. Such matches are redundant and are therefore discarded by our algorithm. Finally, we use the extended gap-free alignments as anchor points for alignment.

2.2 Evaluation

To evaluate FSWM and to compare it to a state-of-the-art approach to alignment anchoring, we used the Mugsy software system. Here, we used the default version of FSWM with unique matches, i.e. for each distinct spaced word, only the highest-scoring spaced-word match is used. As mentioned above, the original Mugsy uses MUMmer to find pairwise anchor points. We replaced MUMmer in the Mugsy pipeline by our FSWM-based anchor points and evaluated the resulting multiple alignments. In addition, we compared these alignments to alignments produced by the multiple genome aligner Cactus (Paten ). Cactus is known to be one of the best existing tools for multiple genome alignment; it performed excellently in the Alignathon study (Earl ). To measure the performance of the compared methods, we used simulated genomic sequences as well as three sets of real genomes. To make MUMmer directly comparable to FSWM, we used a minimum length of 10 nt for maximum unique matches, corresponding to the default weight (sum of match positions) used in Spaced Words. Note that, by default, MUMmer uses a minimum length of 15 nt. With this default value, however, we obtained alignments of much lower quality. Cactus was run with default values.

2.2.1 Simulated genomic sequences

To simulate genomic sequences, we used the artificial life framework (ALF) developed by Dalquen . ALF generates artificial gene families along a randomly generated tree, according to a probabilistic model of evolution. During this process, evolutionary events are logged so the true MSA is known for each simulated gene family and can be used as reference to assess the quality of automatically generated alignments. We generated a series of 14 datasets, each one based on a randomly generated tree with 30 leaves, representing different species. Each dataset consists of 750 simulated gene families, evolved along the respective tree, such that exactly one gene from each family is present in each of the 30 ‘species’. Within each dataset, we used a fixed mutation rate for all gene families, but we used different mutation rates for different datasets. For all other parameters in ALF, we used the default settings. We varied the mutation rates between an average of 0.1013 substitutions per position for the first dataset to an average of 0.8349 substitutions per position for the 14th dataset. Here, the average is taken over all pairs of ‘species’ within the respective dataset. The maximal pairwise distance between all pairs of sequences within a dataset ranges from 0.1640 for the first to 1.0923 for the 14th dataset. The simulated genes have an average length of about 1500 bp, summing up to a total size of about 32 MB per dataset. For simplicity, we did not concatenate the 750 genes in one ‘species’. Instead, we applied the alignment programs that we evaluated to compare all genes from one ‘species’ to all genes from all other ‘species’ within the same dataset. Concatenating the sequences would have led to the same results. To assess the quality of the produced alignments, we calculated recall and precision values in the usual way. If, for one given dataset, S is the set of all positions of the 30 × 750 simulated gene sequences, we denote by the set of all pairs of positions aligned to each other by the alignment that is to be evaluated, while denotes the set of all pairs of positions aligned to each other in the reference alignment. Recall and precision are then defined as The harmonic mean of recall and precision is called the balanced F-score and is often used as an overall measure of accuracy; it is thus defined as To estimate these three values, we used the tool mafComparator which was also used in the Alignathon study (Earl ). Since it is prohibitive to consider all pairs of positions of the test sequences, we sampled 10 million pairs of positions for each dataset. This corresponds to the evaluation procedure used in Alignathon. For the simulated sequence sets, their recall and precision values are shown in Figures 2 and 3. For datasets with smaller mutation rates, the quality of alignments obtained with FSWM and MUMmer is comparable (Fig. 4). However, if the mutation rate increases, our spaced-words approach clearly outperforms the original version of Mugsy where exact word matches are used to find anchor points. With FSWM, not only more homologies are detected, compared to Mummer, but also the precision of Mugsy is slightly improved.

Fig. 2.

Fig. 3.

Precision values for Mugsy with FSWM and MUMmer anchor points respectively, and for Cactus. Test data and parameter values as in Figure 2

Fig. 4.

F-Score values for Mugsy with FSWM and MUMmer anchor points, respectively, and for Cactus. Test data and parameter values as in Figure 2

Recall values for Mugsy using anchor points generated with FSWM and with MUMmer, respectively, as well as for Cactus. Test data were simulated genomic sequences generated with ALF, see main text for details. FSWM was run with the default weight w = 10, i.e. with 10 match positions in the underlying pattern, and with w = 8 Precision values for Mugsy with FSWM and MUMmer anchor points respectively, and for Cactus. Test data and parameter values as in Figure 2 F-Score values for Mugsy with FSWM and MUMmer anchor points, respectively, and for Cactus. Test data and parameter values as in Figure 2

2.2.2 Real-world genome sequences

For real-world genome families, it is usually not possible to calculate the precision of MSA programs because it is, in general, not known which sequence positions exactly are homologous to each other and which ones are not. If there are core blocks of the sequences for which biologically correct alignments are known, at least recall values can be calculated for these core blocks. For most genome sequences, however, not even such core blocks are available. To evaluate Mugsy, the authors of the program therefore used the number of core columns of the produced alignments as a criterion for alignment quality (Angiuoli and Salzberg, 2011). Here, a core column is defined as a column that does not contain gaps, i.e. a column in which nucleotides from all of the input sequences are aligned. In addition, the authors of Mugsy used the number of pairs of aligned positions of the aligned sequences as an indicator of alignment quality. In this article, we use the same criteria to evaluate multiple alignments of real-world genomes. As a first real-word example, we used a set of 29 E. coli/Shigella genomes that has been used in the original Mugsy paper, see Supplementary Material for details; these sequences have also been used to evaluate alignment-free methods (Haubold ; Morgenstern ; Yi and Jin, 2013). The total size of this dataset is about 141 MB. As a second test set, we used another prokaryotic dataset, namely a set of 32 complete Roseobacter genomes (details in the Supplementary Material); these genomes are more distantly related than the E. coli/Shigella strains. The total size of this dataset is about 135 MB. To test our approach on eukaryotic genomes, we used as a third test case a set of nine fungal genomes, namely Coprinopsis cinerea, Neurospora crassa, Aspergillus terreus, Aspergillus nidulans, Histoplasma capsulatum, Paracoccidioides brasiliensis, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Ustilago maydis (genbank accession numbers are given in the Supplementary Material). The total size of this third dataset is about 253 MB. The results of Mugsy with MUMmer and FSWM, respectively, for the three real-world datasets are shown in Table 1, together with the results obtained with Cactus. In addition to the number of core columns and the number of aligned pairs of positions, the table contains the number of core Locally Collinear Blocks, i.e. the number of Locally Collinear Blocks involving all of the input sequences, and the total number of Locally Collinear Blocks returned by the alignment programs. For the E. coli/Shigella sequences, the two anchoring methods, MUMmer and FSWM, led to alignments of comparable quality when used with Mugsy; the genome sequences in this dataset are very similar to each other. For the Roseobacter and fungal genomes, however, the FSWM anchor points led to much better alignments than the default anchor points generated with MUMmer. The sequences in these sets are far more apart from each other than the sequences in the E. coli/Shigella set, so the results on these three datasets confirm our above results on simulated sequences.

Table 1.

	Core LCBs	Aligned pairs	Core col.	LCBs
29 E. coli/Shigella genomes
Mugsy + MUMmer	539	1,61E+09	2,827,115	4138
Mugsy + FSWM	664	1,63E+09	2,867,432	5906
Cactus	20,163	1,48E+09	2,663,750	56,592
32 Roseobacter genomes
Mugsy + MUMmer	39	3,63E+08	13,654	13,501
Mugsy + FSWM	859	7,15E+08	824,054	30,836
Cactus	5984	4,95E+08	280,085	337,320
9 fungal genomes
Mugsy + MUMmer	9	5,88E+06	2097	4252
Mugsy + FSWM	2590	1,18E+08	718,176	89,555
Cactus	31,589	1,33E+08	828,680	848,242

Note: As a comparison, the table contains the results obtained with Cactus. The first column contains the number of core columns, i.e. the number of columns in the multiple alignments that do not contain gaps; the second column contains the total number of aligned pairs of positions in the alignment. The third column contains the number of core Locally Collinear Blocks (LCBs) i.e. the number of LCBs that involve all of the aligned genomes (‘core LCBs’), while the last column contains the total number of LCBs.

Evaluation of multiple alignments of 29 E. coli/Shigella genomes, 32 Roseobacter genomes and 9 fungal genomes, obtained with Mugsy, using anchor points calculated with FSWM and with MUMmer, respectively Note: As a comparison, the table contains the results obtained with Cactus. The first column contains the number of core columns, i.e. the number of columns in the multiple alignments that do not contain gaps; the second column contains the total number of aligned pairs of positions in the alignment. The third column contains the number of core Locally Collinear Blocks (LCBs) i.e. the number of LCBs that involve all of the aligned genomes (‘core LCBs’), while the last column contains the total number of LCBs.

2.2.3 Program run time

Table 2 reports the program run times of Mugsy with FSWM, Mugsy with MUMmer and Cactus on the above three real-world sequence sets. In addition, the table contains the run times for FSWM and MUMmer alone. A program run of Mugsy with FSWM on a set of five mammalian sequences of length 200 mb each from Earl took around 7 days, and 5 h with k = 10 and two days with k = 12.

Table 2.

Run time in minutes for three different multiple genome-alignment methods applied to the three test datasets that we used in our program evaluation

	E. coli/Shigella	Roseobacter	fungal genomes
FSWM	59	83	110
FSWM + Mugsy	638	6428	1488
MUMmer	73	63	43
MUMmer + Mugsy	286	1099	63
Cactus	714	1775	775

Run time in minutes for three different multiple genome-alignment methods applied to the three test datasets that we used in our program evaluation

3 Discussion

In this article, we proposed a novel approach to calculate anchor points for genome alignment. Finding suitable anchor points is a critical step in all methods for genome alignment, since the selected anchor points determine which regions of the sequences can be aligned to each other in the final alignment. A sufficient number of anchor points is necessary to keep the search space and run time of the main alignment procedure manageable, so sensitive methods are needed to find anchor points. Wrongly selected anchor points, on the other hand, can seriously deteriorate the quality of the final alignments, so anchoring procedures must also be highly specific. Earlier approaches to genomic alignment used exact word matches as anchor points (Delcher ; Höhl ), since such matches can be easily found using suffix trees and related indexing structures. These approaches are limited, however, to situations where closely related genomes are to be aligned, for example different strains of a bacterium. In modern approaches to database searching, spaced seeds are used to find potential sequence homologies (Buchfink ; Hauswedell ; Li ). Here, binary patterns of match and don’t care positions are used, and two sequence segments of the corresponding length are considered to match if identical residues are aligned at the match positions, while mismatches are allowed at the don’t care positions. Such pattern-based approaches are more sensitive than previous methods that relied on exact word matches. We previously proposed to apply the ‘spaced-seeds’ idea to alignment-free sequence comparison, by replacing contiguous words by so-called spaced words, i.e. by words that contain wildcard characters at certain pre-defined positions (Leimeister ). More recently, we introduced FSWM (Leimeister ) to estimate the average number of substitutions per sequence position between two genomes. In the latter approach, we first identify spaced-word matches using relatively long patterns with only few match positions. For the identified matching segments, we look at the nucleotides that are aligned to each other at the don’t-care positions, and we discard spaced-word matches for which the similarity at the don’t-care positions is below a threshold. Substitution frequencies are then estimated based on the aligned nucleotides at the don’t-care positions of the remaining spaced-word matches. We showed that this procedure is fast and highly sensitive, and it can reliably distinguish between true homologies and spurious sequence similarities. In the present study, we used FSWM to calculate anchor points for genomic sequence alignment. Instead of using the selected spaced-word matches directly as anchor points, we extend the identified hits into both directions, similar to the hit-and-extend approach to database searching. In view of speed and accuracy, this approach is somewhere between exact word matching and gapped local alignment. As in our previous paper on filtered spaced words (Leimeister ), we use binary patterns with a large number of don’t-care positions. This way, the ‘homologous’ and ’background’ peaks in the spaced-word histograms (Fig. 1) are far enough apart, since the distance between them is proportional to the number of don’t-care positions in the underlying patterns. With a large number of don’t-care positions, it is therefore easier to distinguish between homologous and background spaced-word matches. One might think that, with our long patterns, we might miss too many shorter local homologies. We do not see this as a problem, though. Our goal is not to find all local homologies between two sequences, but to output a sufficient number of anchor points to make the final alignment procedure feasible. Moreover, our algorithm is well able to find gap-free homologies that are shorter than the specified pattern length, as long as the sequence similarity between these homologies is strong enough. As explained above, we do not start the X-drop extension at the end positions of the identified hits, but in the middle; this way we can find spaced-word matches that cover short homologies, but reach into gapped or non-homologous sequence regions to the left and to the right. In such cases, it can happen that the ‘extended’ hits are shorter than the respective initial spaced-word matches. To evaluate these anchor points, we integrated them into the popular genome-alignment pipeline Mugsy. Test runs on simulated genome sequences show that, for closely related sequences, Mugsy produces alignments of high quality with both types of anchor points. For more distantly related sequences, however, the recall values of the program drop dramatically if anchor points are calculated with MUMmer while, with our spaced-word matches, one observes recall values close to 100% for distances up to around 0.7 substitutions per position. For real-world genomes, it is more difficult to evaluate the performance of genome aligners since there is only limited information available on which positions are homologous to each other and which ones are not. Angiuoli and Salzberg (2011) therefore used the number of aligned pairs of positions as an indicator of alignment quality, together with the size of the ‘core alignment’, i.e. the number of alignments columns that do not contain gaps. At first glance, these criteria might seem questionable; it would be trivial to maximize these values, simply by aligning sequences without internal gaps, by adding gaps only at the ends of the shorter sequences. However, as shown in Figure 3, all MSA programs in our study have high precision values, i.e. positions aligned by these programs are likely to be true homologs. In this situation, the number of aligned position pairs and size of the ‘core alignment’ can be considered as a proxy for the recall of the applied methods i.e. the proportion of homologies that are correctly aligned. As shown in Table 2, the program run time to generate anchor points is comparable for FSWM and MUMmer. For distantly related sequence sets, however, the total run time of Mugsy is much higher with our FSWM anchoring approach than with anchor points from MUMmer. A possible explanation for the difference in run time is that FSWM is more sensitive, so a larger number of anchor points are produced. Table 1 shows that, with our FSWM, more Locally Collinear Blocks are found than with the exact word matches that are found with MUMmer—especially for distantly related sequences where exact word matching is not very sensitive. One way of reducing the program run time would be to apply a cut-off value to reduce the number Locally Collinear Blocks that are to be aligned in the main alignment procedure. Further research efforts are necessary to balance speed and accuracy of multiple genome alignment algorithms. Click here for additional data file.

51 in total

1. PatternHunter: faster and more sensitive homology search.

Authors: Bin Ma; John Tromp; Ming Li
Journal: Bioinformatics Date: 2002-03 Impact factor: 6.937

2. Efficient multiple genome alignment.

Authors: Michael Höhl; Stefan Kurtz; Enno Ohlebusch
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

3. A novel method for multiple alignment of sequences with repeated and shuffled elements.

Authors: Benjamin Raphael; Degui Zhi; Haixu Tang; Pavel Pevzner
Journal: Genome Res Date: 2004-11 Impact factor: 9.043

4. Segment-based multiple sequence alignment.

Authors: Tobias Rausch; Anne-Katrin Emde; David Weese; Andreas Döring; Cedric Notredame; Knut Reinert
Journal: Bioinformatics Date: 2008-08-15 Impact factor: 6.937

5. andi: fast and accurate estimation of evolutionary distances between closely related genomes.

Authors: Bernhard Haubold; Fabian Klötzl; Peter Pfaffelhuber
Journal: Bioinformatics Date: 2014-12-10 Impact factor: 6.937

6. A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors: S B Needleman; C D Wunsch
Journal: J Mol Biol Date: 1970-03 Impact factor: 5.469

7. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Authors: Sebastian Horwege; Sebastian Lindner; Marcus Boden; Klas Hatje; Martin Kollmar; Chris-André Leimeister; Burkhard Morgenstern
Journal: Nucleic Acids Res Date: 2014-05-14 Impact factor: 16.971

8. Multiple sequence alignment with user-defined anchor points.

Authors: Burkhard Morgenstern; Sonja J Prohaska; Dirk Pöhler; Peter F Stadler
Journal: Algorithms Mol Biol Date: 2006-04-19 Impact factor: 1.405

9. Alignathon: a competitive assessment of whole-genome alignment methods.

Authors: Dent Earl; Ngan Nguyen; Glenn Hickey; Robert S Harris; Stephen Fitzgerald; Kathryn Beal; Igor Seledtsov; Vladimir Molodtsov; Brian J Raney; Hiram Clawson; Jaebum Kim; Carsten Kemena; Jia-Ming Chang; Ionas Erb; Alexander Poliakov; Minmei Hou; Javier Herrero; William James Kent; Victor Solovyev; Aaron E Darling; Jian Ma; Cedric Notredame; Michael Brudno; Inna Dubchak; David Haussler; Benedict Paten
Journal: Genome Res Date: 2014-10-01 Impact factor: 9.043