| Literature DB >> 22530800 |
Neetika Nath1, John B O Mitchell.
Abstract
BACKGROUND: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22530800 PMCID: PMC3368749 DOI: 10.1186/1471-2105-13-60
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Linear separating hyperplane for the binary classification. The solid line shows the maximum margin hyperplane separating the red and green classes. The dotted lines show the margins and the highlighted points are the support vectors.
Figure 2Workflow of the cross-validation exercise. Flow chart illustrating the workflow used in the cross-validation part of this study.
Cross-validation accuracy
| Package | CARET, Train | randomForest | |||
|---|---|---|---|---|---|
| 0.865 | 0.883 | 0.787 | 0.907 | 0.910 | |
| 0.681 | 0.640 | 0.618 | 0.682 | 0.682 | |
| 0.714 | 0.703 | 0.666 | 0.708 | 0.707 | |
| 0.623 | 0.611 | 0.557 | 0.614 | 0.616 | |
| 0.598 | 0.574 | 0.515 | 0.567 | 0.566 | |
Average cross-validated accuracy over 10 repetitions of 10-fold cross-validation, as shown in Figure 2, for four methods and five descriptor sets
Cross-validated values of Gorodkin's RK
| Package | CARET, Train | randomForest | ||
|---|---|---|---|---|
| 0.831 | 0.853 | 0.737 | 0.884 | |
| 0.596 | 0.547 | 0.525 | 0.600 | |
| 0.639 | 0.625 | 0.579 | 0.631 | |
| 0.522 | 0.509 | 0.443 | 0.510 | |
| 0.489 | 0.457 | 0.379 | 0.447 | |
Average cross-validated value of Gorodkin's K-category correlation coefficient over 10 repetitions of 10-fold cross-validation, as shown in Figure 2, for four methods and five descriptor sets
Figure 3Performance of different classifiers in cross-validation. The Figure shows the accuracy achieved by each of the four classifiers for each of the five descriptor sets in the cross-validation.
Prediction accuracies by EC class
| Package & Method | randomForest, Random Forest | |||||
|---|---|---|---|---|---|---|
| 0.961 | 0.816 | 0.948 | 0.835 | 0.952 | 0.877 | |
| 0.823 | 0.394 | 0.849 | 0.605 | 0.500 | 0.731 | |
| 0.865 | 0.500 | 0.828 | 0.568 | 0.628 | 0.600 | |
| 0.870 | 0.406 | 0.680 | 0.495 | 0.276 | 0.654 | |
| 0.817 | 0.363 | 0.722 | 0.334 | 0.333 | 0.315 | |
Prediction accuracies by EC class and descriptor type. These data are for the RF method as implemented in the R package randomForest [48].
External test set accuracy
| Package | CARET, Train | randomForest | ||
|---|---|---|---|---|
| 0.744 | 0.744 | 0.674 | 0.837 | |
| 0.721 | 0.744 | 0.581 | 0.791 | |
| 0.744 | 0.721 | 0.581 | 0.698 | |
| 0.698 | 0.767 | 0.512 | 0.791 | |
| 0.581 | 0.488 | 0.605 | 0.581 | |
Prediction accuracy on the 43-entry external test set for four methods and five descriptor sets.
Figure 4Performance of different classifiers for the external test set. The Figure shows the accuracy achieved by each of the four classifiers for each of the five descriptor sets for the external test set.