| Literature DB >> 15857510 |
Erik L L Sonnhammer1, Volker Hollich.
Abstract
BACKGROUND: Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.Entities:
Mesh:
Year: 2005 PMID: 15857510 PMCID: PMC1131889 DOI: 10.1186/1471-2105-6-108
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Accuracy as average RMSD values for combinations of data modelsand estimators
| testset | |||||
| Dayhoff | MV | JTT | WAG | average | |
| Scoredist – Dayhoff | 12.68 | 20.85 | 13.67 | 12.81 | 15.00 |
| ML – Dayhoff | 12.70 | 28.40 | 14.75 | 15.15 | 17.75 |
| ED – Dayhoff | 13.57 | 31.36 | 16.10 | 16.63 | 19.41 |
| Scoredist – MV | 19.28 | 13.15 | 16.29 | 18.73 | 16.86 |
| ML – MV | 19.96 | 13.44 | 19.36 | 19.21 | 17.99 |
| ED – MV | 15.68 | 13.35 | 13.95 | 14.75 | 14.43 |
| Scoredist – JTT | 13.67 | 17.16 | 12.89 | 13.47 | 14.30 |
| ML – JTT | 12.15 | 25.07 | 12.10 | 13.44 | 15.69 |
| ED – JTT | 12.56 | 27.71 | 12.70 | 14.37 | 16.84 |
| Jukes-Cantor | 23.92 | 16.28 | 19.88 | 22.48 | 20.64 |
| Kimura | 16.24 | 29.81 | 22.36 | 19.16 | 21.89 |
For each testset and method, the average root mean square deviation from the true distance was calculated for 2,000 alignment samples in the interval 1–200 PAM units. Lower RMSD values indicate higher accuracy on a single testset. The column 'average' gives the mean of the four evaluated testsets. A low value in this column shows the estimator's robustness as it measures the accuracy over all four models (including "wrong" data models). Scoredist was more robust than ML, as it for each training set always had higher accuracy on average. The ED estimator gave good results when trained with MV, but was poor in all other cases (see Discussion for details). Scoredist, Jukes-Cantor, and Kimura distances were calculated with the Belvu alignment viewer. The Maximum Likelihood (ML) and Expected Distance (ED) estimates were produced by lapd (L. Arvestad, unpublished).
Figure 1Stratified accuracy analysis of . To illustrate how estimated distance depends on the model, the average deviation is plotted as a function of true distance for two evolutionary models, Dayhoff and Mueller-Vingron. For each evolutionary distance between 1 and 200 PAM, 10 alignments were generated. For each alignment, the deviation was calculated as the difference between the estimated distance and the true distance used for data generation by ROSE [16]. The average of the 10 deviations was plotted using a running average with a window of 10 residues. Note that positive and negative deviations at the same true distance can cancel each other out – the curve only shows the average deviation and not the variability. The values in Table 1 measure the accuracy more correctly by using RMSD of every datapoint. The testset data was created with the matrices given by Dayhoff (A) or Müller-Vingron (B). In both cases, the estimators using the same evolutionary model as the testset data perform well. However, when switching the model in the estimator, Scoredist diverges less than ML, indicating that Scoredist is more robust. The curves show that ML-MV is more different from ML-Dayhoff than Scoredist-MV is from Scoredist-Dayhoff, particularly for the MV dataset in (B). The less difference between estimates using different models, the more robust is the method.
Figure 2The Belvu multiple sequence alignment viewer. Belvu is a multiple sequence alignment viewer that implements the Scoredist distance estimator. The alignment window (A) shows a subset of the Pfam family DNA_pol_A (PF00476). Uniprot IDs are shown throughout. A sequence with known structure is included (DPO1_ECOLI) – the SA line showing surface accessibility and the SS line showing secondary structure. The neighbour-joining tree in (B) used uncorrected distances (observed differences), while the tree in (C) used Scoredist correction. Belvu assigns a colour to each species if provided with species markup information. The distance correction mainly affects the longer branches, and affects the tree topology in some cases, e.g. the placement of DPOQ_HUMAN. Structural markup and taxonomic information were embedded in the Stockholm format alignment provided by the Pfam database.
Figure 3Estimation of the calibration factor . This factor rescales the raw distance dto optimally fit true evolutionary distances. The plot shows how c is estimated by least-squares fitting of raw distances dto true distances for 2000 artificially produced sequence alignments, using the Dayhoff matrix series. The linear relationship between the raw distance dand the true distance of the sequence samples justifies the introduction of the calibration factor c, which was here determined to c= 1.3370 (See Table 2).
Calibration factors for three evolutionary models
| Dayhoff | 1.3370 |
| JTT | 1.2873 |
| MV | 1.1775 |
The raw distance dis scaled by the calibration factor c, which was obtained by least squares fitting of 2000 artificial protein sequence alignments generated for the matrices as given by Dayhoff, JTT (Jones-Taylor-Thornton), and MV (Müller-Vingron).