| Literature DB >> 14693042 |
Michael Brudno1, Michael Chapman, Berthold Göttgens, Serafim Batzoglou, Burkhard Morgenstern.
Abstract
BACKGROUND: Genomic sequence alignment is a powerful method for genome analysis and annotation, as alignments are routinely used to identify functional sites such as genes or regulatory elements. With a growing number of partially or completely sequenced genomes, multiple alignment is playing an increasingly important role in these studies. In recent years, various tools for pair-wise and multiple genomic alignment have been proposed. Some of them are extremely fast, but often efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast search program identifies a chain of strong local sequence similarities. In a second step, regions between these anchor points are aligned using a slower but more accurate method.Entities:
Mesh:
Substances:
Year: 2003 PMID: 14693042 PMCID: PMC521198 DOI: 10.1186/1471-2105-4-66
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The figure shows a matrix representation of sequence alignment. The seed shown can be chained to any seed which lies inside the search box. All seeds located less then distance bp from the current location are stored in a skip list, in which we do a range query for seeds located within a gap cutoff from the diagonal on which the current seed is located. The seeds located in the grey areas are not available for chaining to make the algorithm independent of sequence order.
Total CPU time and alignment quality for DIALIGN (D) and DIALIGN anchored with CHAOS (C+D) applied to a set of 42 pairs of genomic sequences from human and mouse [12]. CHAOS was run with varying cutoff parameters. Lower cutoff values for CHAOS produced higher numbers of anchor points resulting in a decreased search space for the final DIALIGN alignment procedure thus leading to improved running time but slightly decreased alignment quality. The average number of anchor points per kilobase is shown (anc./kb). Score is the total numerical score of all produced DIALIGN alignments, i.e. the sum of the scores of the segment pairs in the alignments. As a rough measure of the biological quality of the produced alignments, we compared local sequence similarities identified by DIALIGN and CHAOS to known protein-coding regions. Here, Sn, Sp and AC are sensitivity, specificity and approximate correlation, respectively. For the D and C+D results, DIALIGN was evaluated by comparing all segment pairs contained in the alignment to annotated exons.
| program | cutoff | anc./ | CPU | %CPU | score | %score | Sn | Sp | AC |
| D | 179,001 | 100.0 | 54,214 | 100.0 | 83 | 40 | 57 | ||
| C+D | 35 | 1.4 | 14,334 | 8.0 | 53,839 | 99.3 | 83 | 40 | 57 |
| C+D | 30 | 1.7 | 11,717 | 6.5 | 53,820 | 99.2 | 83 | 40 | 57 |
| C+D | 25 | 2.1 | 11,485 | 6.4 | 53,654 | 98.9 | 83 | 40 | 57 |
| C+D | 20 | 2.8 | 8,964 | 5.0 | 53,642 | 98.9 | 83 | 40 | 57 |
| C+D | 15 | 4.2 | 7,404 | 4.1 | 53,208 | 98.1 | 82 | 41 | 57 |
| C+D | 10 | 6.5 | 6,696 | 3.7 | 52,684 | 97.1 | 82 | 41 | 57 |
Figure 2CHAOS-DIALIGN correctly aligns the SCL promoter and a conserved non-coding sequence in exon 1. The alignment was extracted from the CHAOS-DIALIGN global alignment of SCL sequences from human, mouse, chicken, zebrafish, and pufferfish. Consensus binding motifs are labelled. All except YY1 have been previously demonstrated to be essential for the appropriate pattern or level of SCL expression. The factors binding conserved sequence (CS) 1 and 2 are unknown. Shading of bases is at (grey) and (black) conservation.
Figure 3Relative improvement in program running time for 42 pairs of genomic sequences form human and mouse of different length. Each point represents one sequence pair. The x-axis is the medium sequence length of sequence pairs while the y-axis is the relative running time of the anchored-alignment procedure compared to the non-anchored procedure.