| Literature DB >> 33286565 |
Alessio Martino1, Enrico De Santis1, Alessandro Giuliani2, Antonello Rizzi1.
Abstract
Multiple kernel learning is a paradigm which employs a properly constructed chain of kernel functions able to simultaneously analyse different data or different representations of the same data. In this paper, we propose an hybrid classification system based on a linear combination of multiple kernels defined over multiple dissimilarity spaces. The core of the training procedure is the joint optimisation of kernel weights and representatives selection in the dissimilarity spaces. This equips the system with a two-fold knowledge discovery phase: by analysing the weights, it is possible to check which representations are more suitable for solving the classification problem, whereas the pivotal patterns selected as representatives can give further insights on the modelled system, possibly with the help of field-experts. The proposed classification system is tested on real proteomic data in order to predict proteins' functional role starting from their folded structure: specifically, a set of eight representations are drawn from the graph-based protein folded description. The proposed multiple kernel-based system has also been benchmarked against a clustering-based classification system also able to exploit multiple dissimilarities simultaneously. Computational results show remarkable classification capabilities and the knowledge discovery analysis is in line with current biological knowledge, suggesting the reliability of the proposed system.Entities:
Keywords: computational biology; dissimilarity spaces; kernel methods; protein contact networks; support vector machines; systems biology
Year: 2020 PMID: 33286565 PMCID: PMC7517365 DOI: 10.3390/e22070794
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Distributions within the original 6685 proteins set.
Classes distribution within the filtered 4957 proteins set.
| Total | ||||||||
|---|---|---|---|---|---|---|---|---|
|
| EC1 | EC2 | EC3 | EC4 | EC5 | EC6 | not-enzymes | |
|
| 540 | 1017 | 919 | 329 | 182 | 244 | 1726 | 4957 |
|
| 10.89 | 20.52 | 18.54 | 6.64 | 3.67 | 4.92 | 34.82 | 100% |
Genetic algorithm parameters description.
| Parameter | Bounds | Contraints |
|---|---|---|
|
| ||
|
|
|
|
|
|
| |
|
|
|
Test Set Performances with Fitness Function .
| Class | Performances | Complexity | ||||
|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | Informedness
| AUC | Sparsity | |
| 1 (EC1) | 0.95 | 0.87 | 0.68 | 0.83 | 0.92 | 49.43 |
| 2 (EC2) | 0.91 | 0.88 | 0.66 | 0.82 | 0.90 | 49.62 |
| 3 (EC3) | 0.90 | 0.84 | 0.58 | 0.78 | 0.88 | 49.48 |
| 4 (EC4) | 0.97 | 0.90 | 0.56 | 0.78 | 0.88 | 49.42 |
| 5 (EC5) | 0.98 | 0.83 | 0.44 | 0.72 | 0.78 | 50.78 |
| 6 (EC6) | 0.99 | 0.94 | 0.76 | 0.88 | 0.95 | 49.28 |
| 7 (not-enzymes) | 0.82 | 0.77 | 0.70 | 0.79 | 0.89 | 50.52 |
Normalised.
Figure 2ROC curves with fitness function . In brackets, the respective AUC values.
Test set performances with fitness function and .
| Class | Performances | Complexity | ||||
|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | Informedness
| AUC | Sparsity | |
| 1 (EC1) | 0.95 | 0.86 | 0.69 | 0.84 | 0.92 | 33.08 |
| 2 (EC2) | 0.91 | 0.88 | 0.67 | 0.82 | 0.90 | 32.48 |
| 3 (EC3) | 0.90 | 0.83 | 0.57 | 0.77 | 0.87 | 29.94 |
| 4 (EC4) | 0.97 | 0.88 | 0.54 | 0.77 | 0.88 | 33.89 |
| 5 (EC5) | 0.98 | 0.85 | 0.45 | 0.73 | 0.79 | 35.54 |
| 6 (EC6) | 0.98 | 0.91 | 0.76 | 0.88 | 0.95 | 35.38 |
| 7 (not-enzymes) | 0.82 | 0.77 | 0.69 | 0.79 | 0.88 | 33.37 |
Normalised.
Figure 3ROC curves with fitness function and . In brackets, the respective AUC values.
Figure 4Schematic of the classification system able to learn a classification model for each positive class. The model provides the crisp decision as well as a score (a real number) encoding the decision reliability.
Test set performances with the one-class classifier.
| Class | Classifier | Performances | ||||
|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | Informedness
| AUC | ||
| 1 (EC1) | OCC | 0.92 |
| 0.35 | 0.67 | 0.85 |
| MKMD |
| 0.88 |
|
|
| |
| 2 (EC2) | OCC | 0.83 | 0.87 | 0.45 | 0.69 | 0.76 |
| MKMD |
|
|
|
|
| |
| 3 (EC3) | OCC | 0.83 |
| 0.49 | 0.70 | 0.77 |
| MKMD |
| 0.84 |
|
|
| |
| 4 (EC4) | OCC | 0.68 | 0.60 |
| 0.61 | 0.72 |
| MKMD |
|
| 0.53 |
|
| |
| 5 (EC5) | OCC | 0.85 | 0.75 | 0.37 | 0.62 | 0.69 |
| MKMD |
|
|
|
|
| |
| 6 (EC6) | OCC | 0.97 | 0.96 | 0.57 | 0.78 | 0.88 |
| MKMD |
|
|
|
|
| |
| 7 (not-enzymes) | OCC | 0.68 | 0.60 |
| 0.61 | 0.72 |
| MKMD |
|
| 0.68 |
|
| |
Normalised.
Figure 5ROC curves comparison (best run for all classes). In brackets, the respective AUC values.
Comparison (in terms of AUC) between the proposed MKMD approach and previous studies.
| Approach | EC1 | EC2 | EC3 | EC4 | EC5 | EC6 | Not-Enzymes |
|---|---|---|---|---|---|---|---|
| DME + Logistic Regression [ | – | – | – | – | – | – | 0.62 |
| DME + SVM [ | – | – | – | – | – | – | 0.64 |
| DME + Naïve Bayes [ | – | – | – | – | – | – | 0.62 |
| DME + Decision Tree [ | – | – | – | – | – | – | 0.60 |
| DME + Neural Network [ | – | – | – | – | – | – | 0.63 |
| OCC [ | – | – | – | – | – | – | 0.63 |
| Feature Generation via Betti Numbers + SVM [ | 0.79 | 0.75 | 0.73 | 0.73 | 0.46 | 0.77 | 0.77 |
| Feature Generation via Spectral Density + SVM [ | 0.85 | 0.82 | 0.85 | 0.81 | 0.59 | 0.81 | 0.82 |
| MKMD with | 0.92 | 0.90 | 0.88 | 0.88 | 0.78 | 0.95 | 0.89 |
| MKMD with | 0.92 | 0.90 | 0.87 | 0.88 | 0.79 | 0.95 | 0.88 |
| MKMD with no representative selection ( | 0.91 | 0.91 | 0.88 | 0.87 | 0.78 | 0.95 | 0.88 |
| OCC ( | 0.85 | 0.76 | 0.77 | 0.72 | 0.69 | 0.88 | 0.72 |
Selected proteins in order to discriminate EC 1 (oxidoreductases) vs. all the rest.
| PDB ID | Notes/Description |
|---|---|
| 1KOF | Transferase |
| 1XFG | Transferase |
| 3E2R | Oxydoreductase |
| 4TS9 | Transferase |
| 1ZDM | Signalling Protein |
| 1MPG | Hydrolase |
| 1QQQ | Transferase |
Selected proteins in order to discriminate EC 2 (transferases) vs. all the rest.
| PDB ID | Notes/Description |
|---|---|
| 3EDC | LAC repressor (signalling protein) |
| 1DKL | Hydrolase |
| 1JKJ | Ligase |
| 2DBI | Unknown function |
| 3UCS | Chaperone |
| 1LX7 | Transferase |
| 2GAR | Transferase |
| 3ILI | Transferase |
| 1S08 | Transferase |
| 4IXM | Hydrolase |
| 4XTJ | Isomerase |
| 1KW1 | Lyase |
| 1BDH | Transcription factor (DNA-binding) |
| 4PC3 | Elongation factor (RNA-binding) |
| 5G1L | Isomerase |
Selected proteins in order to discriminate EC 3 (hydrolases) vs. all the rest.
| PDB ID | Notes/Description |
|---|---|
| 4RZS | Transcription factor (signalling protein) |
| 1ZDM | Signalling protein |
| 3I7R | Lyase |
| 1HW5 | Signalling protein |
| 1SO5 | Lyase |
Selected proteins in order to discriminate EC 4 (lyases) vs. all the rest.
| PDB ID | Notes/Description |
|---|---|
| 2BWX | Hydrolase |
| 3UWM | Oxydoreductase |
| 2H71 | Electron transport |
| 1D7A | Lyase |
| 4DAP | DNA-binding |
| 1SPV | Structural genomics, unknown function |
| 1EXD | Ligase + RNA-binding |
| 1X83 | Isomerase |
| 3ILJ | Transferase |
| 2D4U | Signalling protein |
| 1JNW | Oxydoreductase |
| 1TRE | Oxydoreductase |
| 1ZPT | Oxydoreductase |
| 3LGU | Hydrolase |
| 1IB6 | Oxydoreductase |
| 3C0U | Structural genomics, unknown function |
| 5GT2 | Oxydoreductase |
| 2RN2 | Hydrolase |
| 4L4Z | Transcription regulator |
| 3CMR | Hydrolase |
| 1NQF | Transport protein |
| 1GPQ | Hydrolase |
| 4ODM | Isomerase + chaperone |
| 2NPG | Transport protein |
| 2UAG | Ligase |
| 1OVG | Transferase |
| 3AVU | Transferase |
| 1RBV | Hydrolase |
| 5AB1 | Cell adhesion |
| 1TMM | Transferase |
| 4NIY | Hydrolase |
| 4WR3 | Isomerase |
Selected proteins in order to discriminate EC 5 (isomerases) vs. all the rest.
| PDB ID | Notes/Description |
|---|---|
| 4ITX | Lyase |
| 2BWW | Hydrolase |
| 5IU6 | Transferase |
| 1ODD | Gene regulatory |
| 5G5G | Oxydoreductase |
| 1G7X | Transferase |
| 2E0Y | Transferase |
| 2SCU | Ligase |
| 1HO4 | Hydrolase |
| 3RGM | Transport Protein |
| 1OAC | Oxydoreductase |
| 5MUC | Oxydoreductase |
| 3OGD | Hydrolase + DNA binding |
| 4K34 | Membrane protein |
| 1Q0L | Oxydoreductase |
| 1G58 | Isomerase |
| 5M3B | Transport protein |
| 2WOH | Oxydoreductase |
| 2PJP | Translation regulation (RNA-binding) |
Selected proteins in order to discriminate EC 6 (ligases) vs. all the rest.
| PDB ID | Notes/Description |
|---|---|
| 2OLQ | Lyase |
| 1JDI | Isomerase |
| 4NIG | Oxydoreductase + DNA-binding |
| 5T03 | Transferase |
| 5FNN | Oxydoreductase |
| 2Z9D | Oxydoreductase |
| 2V3Z | Hydrolase |
| 4ARI | Ligase + RNA-binding |
| 3LBS | Transport protein |
| 4QGS | Oxydoreductase |
| 5B7F | Oxydoreductase |
| 2ABH | Transferase |
Selected proteins in order to discriminate not-enzymes vs. all the rest.
| PDB ID | Notes/Description |
|---|---|
| 1SPA | Transferase |
| 2YH9 | Membrane protein |
| 1NQF | Transport protein |
| 1LDI | Transport protein |
| 1TIK | Hydrolase |
| 1MWI | Hydrolase + DNA-binding |
| 1GEW | Transferase |
| 5CKH | Hydrolase |
| 3ABQ | Lyase |
| 3B6M | Oxydoreductase |
Figure 6Three basic patterns in protein 3D structures. (a) Transferase—PDB ID 1KOF, (b) Proline dehydrogenase (oxidoreductase)—PDB ID 3E2R, (c) Transport Protein (Non-Enzyme)—PDB ID 3RGM.
Figure 7Average kernel weights vectors .