| Literature DB >> 20334658 |
Pierre M Durand1, Scott Hazelhurst, Theresa L Coetzer.
Abstract
BACKGROUND: Sequence alignments form part of many investigations in molecular biology, including the determination of phylogenetic relationships, the prediction of protein structure and function, and the measurement of evolutionary rates. However, to obtain meaningful results, a significant degree of sequence similarity is required to ensure that the alignments are accurate and the inferences correct. Limitations arise when sequence similarity is low, which is particularly problematic when working with fast-evolving genes, evolutionary distant taxa, genomes with nucleotide biases, and cases of convergent evolution.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20334658 PMCID: PMC2851608 DOI: 10.1186/1471-2105-11-151
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
FIRE scores
| Set | Data sets aligned | # | FIRE score |
|---|---|---|---|
| *metazoan MYB1 and protozoan MYB1 | ω ≤ 0.2 | 0.93 | |
| *metazoan MYB1 and metazoan MYB2 | ω ≤ 0.3 | 0.94 | |
| protozoan MYB1 and metazoan MYB2 | ω ≤ 0.3 | 0.99 | |
| *metazoan GK and protozoan GK | ω ≤ 1.3 | +0.62 | |
| €κ light chain VR and κ light chain VR | ω ≤ 7.0 | 0.66 | |
| *κ light chain VR and λ light chain VR | ω ≤ 8.2 | 0.65 | |
| *metazoan MYB1 and metazoan p53 | ω ≤ 1.3 | 0.45 | |
| metazoan MYB1 and metazoan GK | ω ≤ 1.3 | 0.09 | |
| κ light chain VR and metazoan p53 | ω ≤ 7.0 | 0.29 | |
| *metazoan p53 and λ light chain VR | ω ≤ 8.2 | 0.32 | |
The results of 10 FIRE alignments of the ω MLEs derived from two sequence sets are shown. The range of ω MLEs at codon sites (#) includes values for both data sets, and was taken from model M3 results. FIRE plots and alignments for sets marked with asterisk (*) are provided in Figure 1 and additional file 3, respectively. The two κ data sets (labeled €) represent different κ sequences. Metazoan and protozoan GK data sets differed by >100 codons and therefore produced a relatively low FIRE alignment score (+). DNA-binding domains were used for MYB and p53 alignments. Sets 7-10 are negative controls. GK = glycerol kinase; VR = variable region.
Figure 1FIRE plots. Plots represent the pairwise alignment of ω MLEs at codon sites with FIRE, recorded as a percent similarity between the two values. Corresponding FIRE scores and alignments are in TABLE 1 and additional file 3, respectively. A sliding window of 16 codons was used and the percent similarity is the average over the window. (A) conserved orthologous metazoan and protozoan MYB1 DBDs; (B) conserved paralogous metazoan MYB1 and MYB2 DBDs; (C) conserved metazoan and protozoan GK; (D) κ and λ light chain antibodies; (E) metazoan MYB1 and p53 DBDs; and (F) p53 DBD and κ light chain antibody. The sequence sets used in plots E and F have no functional similarity and represent negative controls. The 60% similarity cut-off value is indicated by a solid line. DBD = DNA-binding domain; GK = glycerol kinase.
Figure 2FIRE, T-Coffee, ClustalW and MAFFT MSAs. The alignments generated by (A) FATCAT, (B) FIRE, (C) T-Coffee, (D) ClustalW and (E) MAFFT algorithms for kappa and lambda antibody variable regions (data set 6 in Table 1) are displayed. Only sequences corresponding to the two structure files in the FATCAT alignment and the representative sequences from the two clades aligned by FIRE are shown for each of the other three MSAs. Using the FATCAT alignment as an independent standard-of-truth reference, correctly aligned residue pairs in the other four MSAs were identified (shaded regions). Overall, T-Coffee and MAFFT produced the most accurate alignments, however, FIRE performed better than ClustalW demonstrating the viability of using an evolutionary rates-based approach to sequence analysis when sequence similarity is low. In addition, the short stretches of conserved amino acids (indicated by +) inflate the performances of the three homology-based methods relative to FIRE (see text for discussion).
FIRE, T-Coffee, ClustalW and MAFFT performances
| Set | FATCAT | FIRE performance | T-Coffee performance | ClustalW performance | MAFFT performance |
|---|---|---|---|---|---|
| 54 | 0.87 | 1.00 | 0.99 | 1.00 | |
| 94 | 0.57 | 0.96 | 0.84 | 0.97 | |
| 137 | 0.83 | 0.97 | 0.83 | 0.96 | |
| - | 0.69 | 1.00 | 0.68 | 0.94 | |
| 103 | 0.83 | 0.98 | 0.71 | 0.88 | |
| 108 | 0.87 | 0.97 | 0.73 | 0.87 | |
Performances of the FIRE, T-Coffee, ClustalW and MAFFT algorithms were measured by determining the proportion of correctly aligned residue pairs using FATCAT and DALI structure-based alignments as a reference. FIRE is independent of homology and performed better than ClustalW for data sets 5 and 6 (antibody variable regions), illustrating the value of using this approach when sequence similarities are low. This independence from residues in the sequence may also lead to relatively poor FIRE performance when sequence similarity is high, for example set 2. T-Coffee and MAFFT performed best overall (although see text for further discussion). The PDB structure files included in FATCAT, DALI and T-Coffee algorithms are set 1: 2DIM, 2K9N; set 2: 2DIM, 2DIN; set 3: 2YUM, 2K9N; set 4: 1BO5; set 5: 5LVE, 1QP1; set 6: 1LVE, 1NC4. *Due to a lack of structural data for set 4, the FUGUE threading algorithm [23] was used to generate a reference structure alignment from the E. histolytica sequence (XM_650121.1) using E. coli glycerol kinase (PDB IB: 1BO5) as a template. For all alignments, FATCAT and DALI produced similar results and only the FATCAT data are shown.