| Literature DB >> 15969769 |
Abstract
BACKGROUND: Alignments of homologous DNA sequences are crucial for comparative genomics and phylogenetic analysis. However, multiple alignment represents a computationally difficult problem. For protein-coding DNA sequences, it is more advantageous in terms of both speed and accuracy to align the amino-acid sequences specified by the DNA sequences rather than the DNA sequences themselves. Many implementations making use of this concept of "translated alignments" are incomplete in the sense that they require the user to manually translate the DNA sequences and to perform the amino-acid alignment. As such, they are not well suited to large-scale automated alignments of large and/or numerous DNA data sets.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15969769 PMCID: PMC1175081 DOI: 10.1186/1471-2105-6-156
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Theoretical gain in speed from performing a translated alignment. The figure reveals there is always a performance advantage in aligning any given proportion of the protein-coding DNA sequences in a data set via their amino-acid translations with the remaining DNA sequences subsequently profile-aligned to them. The curve as shown is based on the assumption that the translated alignment is 9x faster, on average, than the respective DNA alignment; other values produce nearly identical curves of different scales.
Benchmark data for the comparative performance of a translated alignment. Six mammalian protein-coding genes were aligned either as DNA (using ClustalW; default parameters) or via their translations as amino acids (using transAlign; genetic code specified, otherwise default parameters). All analyses used ClustalW v1.83 on an 800-MHz dual-processor Macintosh G4 running OS 10.3.5. The alignment score is taken relative to the corresponding sequence from a manually aligned data set and is the opposite of the Hamming distance (i.e., matching bases score +1, mismatches score +0). The alignment score was calculated for each individual sequence and then averaged over all sequences in each data set. Gene symbols follow the HUGO Gene Nomenclature Committee (HGNC; [21]).
| Amino-acid alignment | |||||||||
| DNA alignment | Time (sec) | ||||||||
| Data set | No. of sequences | Unaligned sequence length | Alignment time (sec) | Average alignment score | Amino-acid alignment | DNA profile alignment | transAlign processing | Total | Average alignment score |
| 100 | 256-768 | 475 | 579.28 | 52 | 14 | 0 | 66 | 774.61 | |
| 2484 | 388-1200 | 1216963 | 437.54 | 127309 | 13823 | 34 | 141166 | 860.75 | |
| 128 | 543-3141 | 2804 | 2346.46 | 307 | n/a | 3 | 310 | 2345.13 | |
| 196 | 326-1584 | 6492 | 1583.85 | 733 | n/a | 3 | 736 | 1583.95 | |
| 484 | 627-1292 | 45122 | 598.26 | 4004 | 10636 | 9 | 14649 | 579.71 | |
| 182 | 711-1310 | 8384 | 862.06 | 921 | n/a | 4 | 925 | 1002.16 | |