| Literature DB >> 28367366 |
Shervine Amidi1, Afshine Amidi1, Dimitrios Vlachakis2, Nikos Paragios1,3, Evangelia I Zacharaki1,3.
Abstract
The number of protein structures in the PDB database has been increasing more than 15-fold since 1999. The creation of computational models predicting enzymatic function is of major importance since such models provide the means to better understand the behavior of newly discovered enzymes when catalyzing chemical reactions. Until now, single-label classification has been widely performed for predicting enzymatic function limiting the application to enzymes performing unique reactions and introducing errors when multi-functional enzymes are examined. Indeed, some enzymes may be performing different reactions and can hence be directly associated with multiple enzymatic functions. In the present work, we propose a multi-label enzymatic function classification scheme that combines structural and amino acid sequence information. We investigate two fusion approaches (in the feature level and decision level) and assess the methodology for general enzymatic function prediction indicated by the first digit of the enzyme commission (EC) code (six main classes) on 40,034 enzymes from the PDB database. The proposed single-label and multi-label models predict correctly the actual functional activities in 97.8% and 95.5% (based on Hamming-loss) of the cases, respectively. Also the multi-label model predicts all possible enzymatic reactions in 85.4% of the multi-labeled enzymes when the number of reactions is unknown. Code and datasets are available at https://figshare.com/s/a63e0bafa9b71fc7cbd7.Entities:
Keywords: Amino acid sequence; Enzyme classification; Multi-label; Single-label; Smith-Waterman algorithm; Structural information
Year: 2017 PMID: 28367366 PMCID: PMC5374972 DOI: 10.7717/peerj.3095
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Comparative table of several alignment-free approaches.
| No. proteins | Information | Parameters | Classification method | Level | Work | ||
|---|---|---|---|---|---|---|---|
| 1,371 | 3D structure | 3D-HINT potential | LDA | QSAR | ANN | 0–1 | |
| 4,755 | Moments, entropy, electrostatic, HINT potential | MLP | |||||
| 2,276 | 3D-QSAR | ||||||
| 26,632 | Global binding descriptors | SVM | 1–3 | ||||
| 211,658 | Structural | GRAVY | 1 | ||||
| 3,095 | Sequence | PseAAC, SAAC, GM | ML-kNN | ||||
| 9,832 | FunD, PSSM | OET-kNN | 1–2 | ||||
| 300,747 | Interpro signatures | BR-kNN | 1–4 | ||||
Figure 1Overview of feature-level fusion.
Figure 2Decision-level fusion for single- and multi-label classification.
Dataset I: 39,251 single-labeled enzymes.
| Class | EC 1 | EC 2 | EC 3 | EC 4 | EC 5 | EC 6 |
|---|---|---|---|---|---|---|
| Name | Oxidoreductase | Transferase | Hydrolase | Lyase | Isomerase | Ligase |
| Number | 7,256 | 10,665 | 15,451 | 2,694 | 1,642 | 1,543 |
Dataset II: 783 multi-labeled enzymes.
| Number of classes | 2 | 3 | 4 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EC numbers | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 4 | 1 | 1 | 1 | 1 |
| 2 | 3 | 4 | 5 | 3 | 4 | 5 | 6 | 4 | 5 | 6 | 5 | 2 | 2 | 4 | 2 | |
| 3 | 4 | 5 | 4 | |||||||||||||
| 5 | ||||||||||||||||
| Number of enzymes | 62 | 44 | 14 | 2 | 217 | 160 | 45 | 15 | 82 | 23 | 73 | 28 | 1 | 7 | 6 | 4 |
Note:
The total number of enzymes with 2, 3 and 4 labels each are 765, 14, 4, respectively.
Testing performance of dataset I.
| Type | SI | AA | Decision fusion | Feature fusion | |||
|---|---|---|---|---|---|---|---|
| Classifier | SVM | NN | NN | SVM | NN | SVM | NN |
| Overall accuracy | 0.830 | 0.828 | 0.976 | 0.977 | 0.978 | 0.942 | 0.878 |
| Balanced accuracy | 0.755 | 0.788 | 0.968 | 0.966 | 0.968 | 0.910 | 0.856 |
Figure 3Testing subset accuracy for dataset II.
Figure 4Repartition of correctly predicted enzymes with respect to subset accuracy.
Figure 5Testing 1-Hamming-loss for dataset II.
Comparison of 1-Hamming-loss per class with SVM–SVM.
| Classifier | 1-Hamming-loss per class | |||||
|---|---|---|---|---|---|---|
| EC 1 | EC 2 | EC 3 | EC 4 | EC 5 | EC 6 | |
| SI SVM only | 0.962 | 0.834 | 0.860 | 0.822 | 0.943 | 0.962 |
| AA NN only | 0.962 | 0.898 | 0.885 | 0.962 | ||
| Decision fusion SVM–SVM | 0.917 | |||||
Note:
The best classification performance is indicated in bold for each class.
Testing performance of dataset II.
| Type | SI | AA | Decision fusion | Feature fusion | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Classifier | SVM | NN | SVM | NN | SVM | NN | SVM | NN | |||
| SVM | NN | SVM | NN | ||||||||
| Alpha | 0.73 | 0.69 | 0.80 | 0.76 | |||||||
| Hamming-loss | 0.103 | 0.119 | 0.064 | 0.063 | 0.054 | 0.054 | 0.063 | 0.083 | 0.098 | ||
| Accuracy | 0.790 | 0.800 | 0.883 | 0.885 | 0.898 | 0.879 | 0.889 | 0.823 | 0.831 | ||
| Precision | 0.857 | 0.829 | 0.901 | 0.906 | 0.942 | 0.918 | 0.907 | 0.889 | 0.856 | ||
| Recall | 0.825 | 0.831 | 0.908 | 0.908 | 0.920 | 0.885 | 0.911 | 0.847 | 0.856 | ||
| F1 score | 0.835 | 0.829 | 0.904 | 0.906 | 0.919 | 0.893 | 0.908 | 0.859 | 0.855 | ||
| Subset accuracy | 0.688 | 0.739 | 0.834 | 0.841 | 0.847 | 0.841 | 0.847 | 0.726 | 0.783 | ||
| Macro | Precision | 0.921 | 0.744 | 0.940 | 0.941 | 0.962 | 0.945 | 0.903 | 0.927 | 0.806 | |
| Recall | 0.741 | 0.777 | 0.881 | 0.871 | 0.879 | 0.854 | 0.881 | 0.791 | 0.787 | ||
| F1 | 0.801 | 0.758 | 0.902 | 0.897 | 0.905 | 0.905 | 0.889 | 0.844 | 0.794 | ||
| Micro | Precision | 0.864 | 0.822 | 0.904 | 0.907 | 0.943 | 0.919 | 0.904 | 0.901 | 0.857 | |
| Recall | 0.829 | 0.832 | 0.910 | 0.910 | 0.922 | 0.885 | 0.913 | 0.850 | 0.857 | ||
| F1 | 0.846 | 0.827 | 0.907 | 0.908 | 0.921 | 0.918 | 0.909 | 0.875 | 0.857 | ||
Note:
The best classification performance (based on different criteria) is indicated in bold for each technique.