| Literature DB >> 15640451 |
Abstract
We describe a multiple alignment program named MAP2 based on a generalized pairwise global alignment algorithm for handling long, different intergenic and intragenic regions in genomic sequences. The MAP2 program produces an ordered list of local multiple alignments of similar regions among sequences, where different regions between local alignments are indicated by reporting only similar regions. We propose two similarity measures for the evaluation of the performance of MAP2 and existing multiple alignment programs. Experimental results produced by MAP2 on four real sets of orthologous genomic sequences show that MAP2 rarely missed a block of transitively similar regions and that MAP2 never produced a block of regions that are not transitively similar. Experimental results by MAP2 on six simulated data sets show that MAP2 found the boundaries between similar and different regions precisely. This feature is useful for finding conserved functional elements in genomic sequences. The MAP2 program is freely available in source code form at http://bioinformatics.iastate.edu/aat/sas.html for academic use.Entities:
Mesh:
Year: 2005 PMID: 15640451 PMCID: PMC546147 DOI: 10.1093/nar/gki159
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1An example alignment of four sequences. Each different section is indicated by a rectangle, which consists of the start and end positions of each region in the difference section.
Figure 2The partitions and the weakest-link percent identity of a similarity block. Each region in the block is represented by a node. For each pair of regions, the induced alignment of the two regions is represented by an edge between the nodes with the percent identity of the alignment shown next to the edge. Each partition link is indicated by a thick edge. The complete graph is on the upper left, a maximum-weight spanning tree of the graph is on the lower right and the rest are every partition and the edges linking the partition.
Average sum-of-pairs costs for sets of intersection sub-blocks of multiple alignments from five programs on the CFTR, SCL, MET and ST7 data sets
| Program | CFTR | SCL | MET | ST7 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S50 | S65 | S67 | S70 | S33 | S36 | S37 | S39 | S223 | S258 | S283 | S315 | S260 | S295 | S320 | S353 | |
| T-COFFEE | 1.574 | 1.579 | 1.601 | 1.610 | 1.560 | 1.582 | 1.579 | 1.595 | 1.545 | 1.553 | 1.555 | 1.560 | 1.509 | 1.524 | 1.528 | 1.537 |
| MAP2 | 1.595 | 1.597 | 1.619 | 1.627 | 1.583 | 1.605 | 1.602 | 1.618 | 1.565 | 1.574 | 1.574 | 1.579 | 1.531 | 1.546 | 1.550 | 1.559 |
| MLAGAN | 1.649 | 1.648 | N/A | N/A | 1.604 | 1.624 | 1.622 | N/A | 1.601 | 1.610 | N/A | N/A | 1.573 | 1.586 | 1.590 | N/A |
| CHAOS/DIALIGN | 1.698 | 1.702 | 1.722 | N/A | 1.665 | 1.687 | N/A | N/A | 1.686 | 1.693 | 1.696 | N/A | 1.631 | 1.645 | N/A | N/A |
| MAVID | 1.762 | N/A | N/A | N/A | 1.810 | N/A | N/A | N/A | 1.719 | N/A | N/A | N/A | 1.701 | N/A | N/A | N/A |
aThe name S50 means that a set of 50 intersection sub-blocks was selected from every multiple alignment. The number 1.649 on row MLAGAN and column S50 is the average sum-of-pairs cost of 50 intersection sub-blocks of blocks of the MLAGAN alignment that have an intersection with blocks of the MAP2 alignment. The mark N/A on row MLAGAN and column S70 means that the MLAGAN alignment does not contain 70 blocks that have an intersection with blocks of the MAP2 alignment.
bThe set of intersection sub-blocks for T-COFFEE was generated from the corresponding set of blocks for MAP2 by running T-COFFEE, once for each MAP2 block, on the regions of the MAP2 block.
Total length (bp) of MAP2 blocks and additional blocks from each of three existing programs on the CFTR, SCL, MET and ST7 data sets
| Program | CFTR | SCL | MET | ST7 |
|---|---|---|---|---|
| MAP2 blocks | 24 825 | 16 488 | 159 893 | 152 023 |
| MLAGAN (a-blocks) | 38 583 | 43 399 | 147 194 | 221 645 |
| CHAOS/DIALIGN (a-blocks) | 30 337 | 38 595 | 134 125 | 170 338 |
| MAVID (a-blocks) | 19 950 | 27 977 | 74 756 | 120 301 |
aThe notation ‘a-blocks’ represents additional blocks, which have no intersection with any MAP2 block.
Figure 3The distribution of the weakest-link percent identities of MAP2 blocks and additional blocks from MLAGAN, CHAOS/DIALIGN and MAVID. A block from the three existing programs is an additional block if the block is not covered by any of the MAP2 blocks. (A) The distribution for the CFTR data set. (B) The distribution for the SCL data set. (C) The distribution for the MET data set. (D) The distribution for the ST7 data set. (E) The distribution for the simulated data sets.
Figure 4The distribution of the distances between the exon boundaries and block boundaries on the simulated data sets for MAP2, MLAGAN, MAVID and CHAOS/DIALIGN. Exponential scales are used for both directions.
The actual or estimated running times (in minutes) of the programs
| Data set | MAP2 | MLAGAN | CHAOS/DIALIGN | MAVID |
|---|---|---|---|---|
| CFTR | 298.45 | 2.68 | 81.47 | ∼6 |
| SCL | 60.2 | 0.65 | 16.5 | ∼2 |
| MET | 2337.07 | 21.92 | 526.93 | ∼12 |
| ST7 | 2789.78 | 22.75 | 648.83 | ∼12 |