| Literature DB >> 30497368 |
Katarina Elez1,2, Alexandre M J J Bonvin3, Anna Vangone4,5.
Abstract
BACKGROUND: Study of macromolecular assemblies is fundamental to understand functions in cells. X-ray crystallography is the most common technique to solve their 3D structure at atomic resolution. In a crystal, however, both biologically-relevant interfaces and non-specific interfaces resulting from crystallographic packing are observed. Due to the complexity of the biological assemblies currently tackled, classifying those interfaces, i.e. distinguishing biological from crystal lattice interfaces, is not trivial and often prone to errors. In this context, analyzing the physico-chemical characteristics of biological/crystal interfaces can help researchers identify possible features that distinguish them and gain a better understanding of the systems.Entities:
Keywords: Biological interface; Classification; Crystal interfaces; EPPIC; Intermolecular contacts; PISA; Predictor; Protein-protein interface; Residue contacts
Mesh:
Substances:
Year: 2018 PMID: 30497368 PMCID: PMC6266931 DOI: 10.1186/s12859-018-2414-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Boxplot of the structural and energetics properties. Boxplots of the interfacial residue-contacts (RCs) (panel A), the buried surface area (BSA) (panel B) and energetics values (panels C, D, E) are reported for the BioMany and XtalMany entries in pink and blue, respectively. The electrostatics (Eelec), Van der Waals (Evdw) and Desolvation (Edes) energies have been calculated with the HADDOCK refinement server. The black line in the middle of the boxes corresponds to the median, while the lower and upper hinges correspond to the 25th and 75th percentile, respectively, with the whiskers extending no longer than 1.5 times the interquartile range from the hinge. Point beyond the range are considered outliers and drawn as black points
Fig. 2Boxplot of the structural properties divided by physico-chemical properties. Boxplots of the interfacial Residue-Contacts (RCs) and of the Non-Interacting Surface (NIS) classified by the charged/polar/apolar character of the residues (top left panel and top right panel, respectively). The analysis of the number of RCs for each of the 20 standard amino acids is reported in the bottom panel. BioMany and XtalMany entries are reported in pink and blue, respectively. The black line in the middle of the boxes corresponds to the median, while the lower and upper hinges correspond to the 25th and 75th percentile, respectively, with the whiskers extending no longer than 1.5 times the interquartile range from the hinge. Point beyond the range are considered outliers and drawn as black points
Performance of classification models based on different features and training algorithms
| Training Features | Bagging | Random Forest | Adaptive Boosting | Gradient Boosting | Neural Network | Average | |
|---|---|---|---|---|---|---|---|
| S1 | BSA | 0.74 | 0.74 | 0.81 | 0.81 | 0.55 | 0.73 |
|
|
|
|
|
|
| ||
| S2 | RCs | 0.86 | 0.86 | 0.85 | 0.86 | 0.85 | 0.86 |
|
|
|
|
|
|
| ||
| S3 | CC, CP, CA, PP, AP, AA | 0.89 | 0.90 | 0.89 | 0.89 | 0.89 | 0.89 |
|
|
|
|
|
|
| ||
| S4 | CC, CP, CA, PP, AP, AA, ANIS, CNIS, PNIS | 0.90 | 0.90 | 0.89 | 0.89 | 0.89 | 0.89 |
| S5 | CC, CP, CA, PP, AP, AA, LD, G, A, L, M, F, W, K, Q, E, S, P, V, I, C, Y, H, R, N, D, T | 0.92 | 0.92 | 0.91 | 0.92 | 0.91 | 0.92 |
| S6 | CC, CP, CA, PP, AP, AA, ANIS, CNIS, PNIS, LD, G, A, L, M, F, W, K, Q, E, S, P, V, I, C, Y, H, R, N, D, T | 0.92 | 0.92 | 0.91 | 0.93 | 0.92 | 0.92 |
| E1 | HS | 0.76 | 0.76 | 0.83 | 0.82 | 0.82 | 0.80 |
|
|
|
|
|
|
| ||
| E2 | Eelec, Evdw, Edes | 0.87 | 0.87 | 0.87 | 0.87 | 0.85 | 0.87 |
|
|
|
|
|
|
| ||
| CC, CP, CA, PP, AP, AA, ANIS, CNIS, PNIS, LD, G, A, L, M, F, W, K, Q, E, S, P, V, I, C, Y, H, R, N, D, T, Eelec, Evdw, Edes | 0.92 | 0.93 | 0.92 | 0.93 | 0.90 | 0.92 | |
| C |
Accuracy values calculated according Eq. 2 in “Methods”
The predictive accuracies have been reported for several classification models tested. Nine sets of features have been used to train new predictive models, based on structural properties (S1, S2, S3, S4, S4, S6), energetics (E1, E2) and a combination of structure and energetics (C). For each set of training features, five machine learning algorithms have been used for the training (Bagging, Random Forest, Adaptive Boosting, Gradient Boosting and Neural Network). For the trained models, the accuracies on the Many [34] and the DC [15] (numbers in brackets) datasets are reported. The accuracy on the Many is reported as average of the 10-fold cross validation. In brackets the accuracy over the DC dataset is reported
Fig. 3Accuracy of machine learning classifiers. Prediction accuracies (y-axis) of the various predictors as a function of the feature set used for training (x-axis). The training sets consist of structural properties (S1, S2, S3, S4, S4, S6), energetics (E1, E2) and a combination of both (C). Refer to Table 1 for the detailed list of features included in each set. Five different machine learning algorithms have been used for the training: Bagging, Random Forest, Adaptive Boosting, Gradient Boosting and Neural Network, reported in blue, purple, green, red and brown, respectively
Optimization of the machine learning classifiers
| Classification accuracy on the MANY dataset | Classification accuracy on the independent DC dataset | |
|---|---|---|
| Bagging | 0.92 | 0.73 |
| Random Forest | 0.92 | 0.74 |
| Adaptive Boosting | 0.92 | 0.74 |
| Gradient Boosting | 0.93 | 0.74 |
| Neural Network | 0.91 | 0.75 |
The maximum accuracy reached by optimizing the settings is reported by each classifier for the Many (as average over the 10-fold cross-validation) and the DC datasets