| Literature DB >> 16354297 |
Emmanuel D Levy1, Christos A Ouzounis, Walter R Gilks, Benjamin Audit.
Abstract
BACKGROUND: One of the most evident achievements of bioinformatics is the development of methods that transfer biological knowledge from characterised proteins to uncharacterised sequences. This mode of protein function assignment is mostly based on the detection of sequence similarity and the premise that functional properties are conserved during evolution. Most automatic approaches developed to date rely on the identification of clusters of homologous proteins and the mapping of new proteins onto these clusters, which are expected to share functional characteristics.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16354297 PMCID: PMC1361783 DOI: 10.1186/1471-2105-6-302
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of the Univariate Bayesian annotation approach. Re-annotation of the filtered ENZYME database with the univariate Bayesian approach. Since we systematically sample 10 enzymes to calculate the probabilities for a protein to belong to each functional class (See Different strategies of annotation), probabilities can only take one of the following eleven values: 0, 0.1, ..., 0.9, 1. We report for each assignment probability level and globally the number of correct annotations, the number of annotation errors and the corresponding error rate and coverage of the database.
| Univariate Bayesian approach | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 84 | 109 | 103 | 99 | 119 | 177 | 252 | 302 | 437 | 726 | 27795 | ||
| 27 | 15 | 5 | 11 | 13 | 23 | 41 | 29 | 31 | 45 | 293 | ||
| 24.3 | 12.1 | 4.6 | 10.0 | 9.8 | 11.5 | 14.0 | 8.8 | 6.6 | 5.8 | 1.04 | ||
| 0.4 | 0.4 | 0.4 | 0.4 | 0.5 | 0.7 | 1.0 | 1.2 | 1.7 | 2.7 | 100.0 | ||
Figure 1Number of re-annotation errors. Number of annotation errors E(α,S0) made during the re-annotation of the 28088 enzymes of the filtered ENZYME database (See Methods) using the best CI strategy (See Different strategies of annotation) as a function of the parameter α and cut-off S0 (Eq. (1)).
Figure 2Re-annotation error rate. Rate of annotation error as a function of the coverage for the re-annotation of the 28088 enzymes of the filtered ENZYME database (See Methods). The full line corresponds to the best-hit strategy (See Determining the optimal correspondence indicator); the curve was obtained by performing the re-annotation for different values of the threshold S0 between 45 (100% coverage by definition of the filtered ENZYME database) and 841. (∇,∆) correspond to the univariate and multivariate Bayesian methods at the highest confidence level (P = 1, Tables 1 and 2). (◊) corresponds to the "clear cases" identified by the univariate Bayesian method (P = 1 and second highest probability equals to 0; see Re-annotation with the univariate Bayesian approach).
Performance of the Multivariate Bayesian annotation method. Re-annotation of the filtered ENZYME database with the multivariate Bayesian method. Since we systematically sample 10 enzymes to calculate the probabilities for a protein to belong to each functional class (See Different strategies of annotation), probabilities can only take one of the following eleven values: 0, 0.1, ..., 0.9, 1. We report for each assignment probability level and globally the number of correct annotations, the number of annotation errors and the corresponding error rate and coverage of the database.
| Multivariate Bayesian method | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 9 | 35 | 109 | 116 | 188 | 511 | 27866 | ||
| 0 | 0 | 0 | 5 | 10 | 37 | 34 | 37 | 29 | 17 | 222 | ||
| - | - | - | 100.0 | 52.6 | 51.4 | 23.4 | 24.2 | 13.4 | 3.2 | 0.79 | ||
| 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.3 | 0.5 | 0.5 | 0.8 | 1.9 | 100 | ||
Figure 3Examples of topology in the CI space. In 4 cases where there is a strong cross similarity between sequences belonging to two different EC classes, we plot for each protein of these classes a point whose 2 coordinates are the CIs of its sequence with the two functional classes (BLAST best-hit with the corresponding EC class). (a): EC 2.3.1.61 (black circles) and EC 2.3.1.12 (grey triangles); crosses on top and at right correspond to the projection on the CI axes; the dotted circle (boxes on top and to the right) marks the limit of the sampling regions used to annotate O31550 with the multivariate (univariate) Bayesian method [See Additional file 1, Fig. S1]. (b): EC 1.6.5.3 (black triangles) and EC 1.6.99.5 (grey circles). (c): EC 1.2.1.59 (black triangles) and EC 1.2.1.12 (grey circles). (d): EC 1.4.1.3 (black triangles) and EC 1.4.1.4 (grey circles); the arrow shows the change of position of protein P94598 in the CI space when the annotation of P95544 is corrected from EC 1.4.1.4 to EC 1.4.1.3 (See Analysing the origins of annotation errors).