| Literature DB >> 24586563 |
Sabrina de Azevedo Silveira1, Raquel Cardoso de Melo-Minardi2, Carlos Henrique da Silveira3, Marcelo Matos Santoro4, Wagner Meira2.
Abstract
The volume and diversity of biological data are increasing at very high rates. Vast amounts of protein sequences and structures, protein and genetic interactions and phenotype studies have been produced. The majority of data generated by high-throughput devices is automatically annotated because manually annotating them is not possible. Thus, efficient and precise automatic annotation methods are required to ensure the quality and reliability of both the biological data and associated annotations. We proposed ENZYMatic Annotation Predictor (ENZYMAP), a technique to characterize and predict EC number changes based on annotations from UniProt/Swiss-Prot using a supervised learning approach. We evaluated ENZYMAP experimentally, using test data sets from both UniProt/Swiss-Prot and UniProt/TrEMBL, and showed that predicting EC changes using selected types of annotation is possible. Finally, we compared ENZYMAP and DETECT with respect to their predictions and checked both against the UniProt/Swiss-Prot annotations. ENZYMAP was shown to be more accurate than DETECT, coming closer to the actual changes in UniProt/Swiss-Prot. Our proposal is intended to be an automatic complementary method (that can be used together with other techniques like the ones based on protein sequence and structure) that helps to improve the quality and reliability of enzyme annotations over time, suggesting possible corrections, anticipating annotation changes and propagating the implicit knowledge for the whole dataset.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24586563 PMCID: PMC3929618 DOI: 10.1371/journal.pone.0089162
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fragment of an occurrence matrix.
| Id | F1 | F2 | F3 | F4 | F5 | Class |
| Q8TUG3 | 1 | 1 | 0 | 1 | 0 |
|
| O67004 | 1 | 1 | 0 | 1 | 0 |
|
| P34724 | 0 | 0 | 1 | 0 | 1 |
|
| P44009 | 0 | 1 | 0 | 1 | 1 |
|
This fragment of an occurrence matrix shows the EC change , which occurred from release 5 to 6, and its control. F1 = nucleotide-binding, F2 = magnesium, F3 = eukaryota, F4 = metal-binding, F5 = signal.
Best results for the Descriptive and Predictive Multiclass Experiments.
| Multiclass experiment | Algorithm | # offeatures | FPR | Prec. | Rec. |
| AUC |
| Descriptive | KNN_K1 | 38 | 0.01 | 0.74 | 0.74 | 0.74 | 0.95 |
| Predictive | KNN_K1 | 13 | 0.08 | 0.41 | 0.32 | 0.25 | 0.65 |
In this table, # of features refers to the number of features or attributes (in the matrix that resulted in the best classification model). TPR corresponds to recall and was omitted.
Results of the Common Source Experiment with Swiss-Prot test data.
| Source | FPR | Prec. | Rec. |
| AUC | Algorithm | # of features | # of classes |
| -.-.-.- | 0.10 | 0.66 | 0.34 | 0.31 | 0.66 | KNN_K1 | 1 | 36 |
| 1.1.1.- | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | KNN_K1 | 11 | 2 |
| 1.10.2.2 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | KNN_K5 | 2 | 2 |
| 1.9.3.1 | 0.33 | 0.70 | 0.70 | 0.70 | 0.68 | KNN_K10 | 2 | 2 |
| 2.-.-.- | 0.31 | 0.77 | 0.42 | 0.32 | 0.62 | N. Bayes | 1 | 3 |
| 2.1.1.- | 0.24 | 0.91 | 0.90 | 0.91 | 0.93 | KNN_K7 | 74 | 3 |
| 2.3.1.- | 0.96 | 0.93 | 0.96 | 0.95 | 0.91 | KNN_K10 | 100 | 2 |
| 2.4.-.- | 0.00 | 0.98 | 0.97 | 0.97 | 0.98 | J48 | 13 | 2 |
| 2.7.1.- | 0.03 | 0.93 | 0.88 | 0.89 | 0.89 | KNN_K3 | 89 | 2 |
| 2.7.3.- | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | J48 | 30 | 2 |
| 2.7.7.48 | 0.30 | 0.70 | 0.66 | 0.66 | 0.55 | KNN_K3 | 40 | 2 |
| 2.7.7.6 | 0.01 | 0.96 | 0.93 | 0.94 | 0.96 | N. Bayes | 32 | 2 |
| 3.-.-.- | 0.01 | 0.95 | 0.90 | 0.91 | 0.94 | KNN_K1 | 5 | 2 |
| 3.1.-.- | 0.96 | 0.93 | 0.96 | 0.95 | 0.61 | KNN_K1 | 100 | 2 |
| 3.1.13.- | 0.06 | 0.95 | 0.95 | 0.95 | 0.91 | KNN_K10 | 65 | 2 |
| 3.1.2.15 | 0.00 | 1.00 | 0.96 | 0.98 | 0.00 | KNN_K10 | 100 | 2 |
| 3.2.1.18 | 0.93 | 0.87 | 0.93 | 0.90 | 0.50 | J48 | 10 | 2 |
| 3.4.22.- | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | KNN_K10 | 100 | 2 |
| 3.4.25.- | 0.33 | 1.00 | 1.00 | 1.00 | 0.97 | KNN_K10 | 41 | 2 |
| 3.6.3.14 | 0.05 | 0.94 | 0.94 | 0.94 | 0.95 | N. Bayes | 12 | 2 |
| 4.2.2.- | 0.64 | 0.80 | 0.72 | 0.62 | 0.80 | KNN_K1 | 2 | 2 |
| 5.-.-.- | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | KNN_K1 | 4 | 2 |
| 6.-.-.- | 0.90 | 0.81 | 0.90 | 0.85 | 0.50 | N. Bayes | 100 | 2 |
| 6.4.1.2 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | KNN_K1 | 10 | 2 |
Each line corresponds to the best result (classifier) obtained for each source as we used the training and test data from 1 up to 100 features after SVD processing and the classification techniques Naïve Bayes, J48 and KNN with . The last two columns refer to the number of features or attributes (in the occurrence matrix that resulted in the best classification model) and to the number of classes in each classifier. The TPR corresponds to the recall and was omitted.
Result of the search for higher level classes of EC number hierarchy.
| EC Class | Scholar | PDB | PubMed | |||
| absolute value | (%) | absolute value | (%) | absolute value | (%) | |
| oxidoreductase | 122,000 | 6.5 | 7,731 | 1.8 | 499,969 | 20.2 |
| transferase | 942,000 | 50.0 | 10,897 | 26.5 | 712,758 | 28.8 |
| hydrolase | 215,000 | 11.4 | 16,054 | 39.1 | 1,040,771 | 42.1 |
| lyase | 154,000 | 8.2 | 3,202 | 7.8 | 118,865 | 4.8 |
| isomerase | 177,000 | 9.4 | 1,655 | 4.0 | 47,984 | 1.9 |
| ligase | 273,000 | 14.5 | 1,517 | 3.7 | 52,562 | 2.1 |
We performed a simple search for the names of higher level classes of EC number hierarchy on February, 2012 in repositories Google Scholar, PDB and PubMed (absolute value and percentage).
Results of the Common Source Experiment with TrEMBL test data.
| Source | FPR | Prec. | Rec. |
| AUC | Algorithm | # of features | # of classes |
| -.-.-.- | 0.13 | 0.82 | 0.68 | 0.74 | 0.80 | N. Bayes | 81 | 36 |
| 2.3.1.- | 0.07 | 0.91 | 0.88 | 0.89 | 0.87 | J48 | 5 | 2 |
| 2.7.7.6 | 0.96 | 0.93 | 0.96 | 0.95 | 0.50 | KNN_K10 | 1 | 2 |
| 3.1.-.- | 0.93 | 0.87 | 0.93 | 0.90 | 0.54 | KNN_K1 | 100 | 2 |
| 3.6.3.14 | 0.58 | 0.92 | 0.91 | 0.89 | 0.79 | N. Bayes | 43 | 2 |
| 6.4.1.2 | 0.96 | 0.93 | 0.96 | 0.95 | 0.50 | J48 | 10 | 2 |
Each line corresponds to the best result (classifier) obtained for each source as we used the training and test data from 1 up to 100 features after SVD processing and the classification techniques Naïve Bayes, J48 and KNN with . The last two columns refer to the number of features or attributes (in the occurrence matrix that resulted in the best classification model) and to the number of classes in each classifier. The TPR corresponds to the recall and was omitted.
Figure 1Comparison of ENZYMAP, DETECT and Swiss-Prot.
We compared the EC number predictions made by ENZYMAP and DETECT and checked both against the UniProt/Swiss-Prot annotations. The number of predictions in which the techniques agree or disagree is presented in the diagrams. In (a), the first level of the EC number annotation is compared; In (b), (c) and (d), up to the second, third and fourth levels of the EC number annotation are compared.
ENZYMAP and DETECT predictions that agree with UniProt/Swiss-Prot.
| Level 1 | Level 2 | Level 3 | Level 4 | |
| ENZYMAP (%) | 56 | 53 | 49 | 49 |
| DETECT (%) | 49 | 48 | 45 | 32 |
| Coverage (%) | 72 | 70 | 65 | 64 |
The rows ENZYMAP and DETECT respectively correspond to the percentage of predictions made by our approach and by DETECT that are in accordance with the UniProt/Swiss-Prot annotations. The Coverage represents the percentage of database annotations covered by the techniques used in a complementary manner. In this comparison we considered from 1 to 4 levels of EC number.