| Literature DB >> 36226242 |
Wei Xiao1, Juhui Ren1, Jutao Hao1, Haoyu Wang1, Yuhao Li1, Liangzhao Lin1.
Abstract
Water molecules play an important role in many biological processes in terms of stabilizing protein structures, assisting protein folding, and improving binding affinity. It is well known that, due to the impacts of various environmental factors, it is difficult to identify the conserved water molecules (CWMs) from free water molecules (FWMs) directly as CWMs are normally deeply embedded in proteins and form strong hydrogen bonds with surrounding polar groups. To circumvent this difficulty, in this work, the abundance of spatial structure information and physicochemical properties of water molecules in proteins inspires us to adopt machine learning methods for identifying the CWMs. Therefore, in this study, a machine learning framework to identify the CWMs in the binding sites of the proteins was presented. First, by analyzing water molecules' physicochemical properties and spatial structure information, six features (i.e., atom density, hydrophilicity, hydrophobicity, solvent-accessible surface area, temperature B-factors, and mobility) were extracted. Those features were further analyzed and combined to reach a higher CWM identification rate. As a result, an optimal feature combination was determined. Based on this optimal combination, seven different machine learning models (including support vector machine (SVM), K-nearest neighbor (KNN), decision tree (DT), logistic regression (LR), discriminant analysis (DA), naïve Bayes (NB), and ensemble learning (EL)) were evaluated for their abilities in identifying two categories of water molecules, i.e., CWMs and FWMs. It showed that the EL model was the desired prediction model due to its comprehensive advantages. Furthermore, the presented methodology was validated through a case study of crystal 3skh and extensively compared with Dowser++. The prediction performance showed that the optimal feature combination and the desired EL model in our method could achieve satisfactory prediction accuracy in identifying CWMs from FWMs in the proteins' binding sites.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36226242 PMCID: PMC9550495 DOI: 10.1155/2022/5104464
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.809
Figure 1The training process of the dataset. The crystal structure 1D7R is marked in magenta. The crystal structure 1M0Q (i.e., the crystal structure of dialkylglycine decarboxylase complexed with S-1-aminoethanephosphonate [43]) marked in green is the homologous protein of the crystal structure 1D7R. The ligand in the conformation of the crystal structure 1D7R is shown as an orange ball-and-sticks. The magenta and green spheres represent the water molecules in the crystal structures 1D7R and 1M0Q, respectively, while the yellow and cyan ones represent the CWMs and FWMs in the crystal structure 1D7R, respectively. The distances between each of the two water molecules are indicated in red.
Figure 2Distributions of different features: (a) atom density; (b) atomic hydrophilicity; (c) atomic hydrophobicity; (d) solvent-accessible surface area; (e) temperature B-factors; (f) mobility. The blue and red curves represent the distributions of the features for the CWMs and FWMs, respectively.
The minimum, maximum, and average values of the features.
| Categories of water molecules | Values | Features | |||||
|---|---|---|---|---|---|---|---|
| Atom density | Atomic hydrophilicity | Atomic hydrophobicity | SASA (Å2) | BFs | Mobility | ||
| FWMs | Min | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Max | 7.000 | 0.147 | 0.249 | 84.949 | 99.930 | 8.992 | |
| Mean | 1.146 | 0.029 | 0.049 | 21.877 | 36.371 | 1.673 | |
|
| |||||||
| CWMs | Min | 0.000 | 0.005 | 0.000 | 0.000 | 0.000 | 0.027 |
| Max | 12.000 | 0.243 | 0.505 | 40.877 | 94.67 | 12.289 | |
| Mean | 3.033 | 0.071 | 0.116 | 6.683 | 23.740 | 1.099 | |
Averaged performance indices under different feature combinations.
| No. | Combination |
|
|
|
|
|
|---|---|---|---|---|---|---|
| 1 | ABCDEF∗ |
| 0.809 |
|
|
|
| 2 | ABCDE | 0.717 | 0.799 | 0.749 | 0.773 | 0.774 |
| 3 | ABCDF | 0.723 | 0.811 | 0.751 |
|
|
| 4 | ABCEF | 0.723 | 0.810 | 0.751 | 0.778 | 0.787 |
| 5 | ABDEF | 0.710 | 0.814 | 0.733 | 0.771 | 0.770 |
| 6 | ACDEF | 0.720 | 0.815 | 0.744 | 0.778 | 0.780 |
| 7 | BCDEF |
| 0.807 |
| 0.778 | 0.786 |
| 8 | ABCD | 0.719 | 0.804 | 0.749 | 0.774 | 0.771 |
| 9 | ABCE | 0.718 | 0.798 | 0.751 | 0.773 | 0.771 |
| 10 | ABCF |
| 0.808 |
|
| 0.786 |
| 11 | ABDE | 0.698 | 0.810 | 0.722 | 0.763 | 0.753 |
| 12 | ABDF | 0.711 | 0.817 | 0.734 | 0.773 | 0.771 |
| 13 | ABEF | 0.708 | 0.807 | 0.735 | 0.769 | 0.771 |
| 14 | ACDE | 0.711 | 0.813 | 0.735 | 0.771 | 0.766 |
| 15 | ACDF | 0.722 | 0.821 | 0.744 |
| 0.782 |
| 16 | ACEF | 0.719 | 0.814 | 0.743 | 0.777 | 0.780 |
| 17 | ADEF | 0.703 | 0.847 | 0.714 | 0.775 | 0.745 |
| 18 | BCDE | 0.718 | 0.797 | 0.751 | 0.773 | 0.771 |
| 19 | BCDF | 0.723 | 0.806 |
| 0.778 | 0.787 |
| 20 | BCEF | 0.719 | 0.799 | 0.752 | 0.774 | 0.783 |
| 21 | BDEF | 0.710 | 0.811 | 0.735 | 0.771 | 0.773 |
| 22 | CDEF | 0.720 | 0.812 | 0.746 | 0.777 | 0.780 |
| 23 | ABC | 0.718 | 0.802 | 0.749 | 0.774 | 0.771 |
| 24 | ABD | 0.699 | 0.818 | 0.720 | 0.765 | 0.753 |
| 25 | ABE | 0.697 | 0.810 | 0.721 | 0.763 | 0.754 |
| 26 | ABF | 0.709 | 0.810 | 0.735 | 0.770 | 0.773 |
| 27 | ACD | 0.712 | 0.816 | 0.735 | 0.773 | 0.762 |
| 28 | ACE | 0.707 | 0.807 | 0.734 | 0.768 | 0.760 |
| 29 | ACF | 0.720 | 0.819 | 0.743 |
| 0.783 |
| 30 | ADE | 0.679 | 0.827 | 0.696 | 0.755 | 0.705 |
| 31 | ADF | 0.703 | 0.844 | 0.715 | 0.774 | 0.745 |
| 32 | AEF | 0.701 | 0.836 | 0.716 | 0.771 | 0.739 |
| 33 | BCD | 0.718 | 0.799 | 0.750 | 0.773 | 0.773 |
| 34 | BCE | 0.713 | 0.785 | 0.751 | 0.767 | 0.767 |
| 35 | BCF | 0.716 | 0.796 | 0.749 | 0.771 | 0.777 |
| 36 | BDE | 0.699 | 0.811 | 0.723 | 0.764 | 0.754 |
| 37 | BDF | 0.712 | 0.816 | 0.735 | 0.773 | 0.773 |
| 38 | BEF | 0.705 | 0.798 | 0.735 | 0.765 | 0.771 |
| 39 | CDE | 0.712 | 0.808 | 0.738 | 0.771 | 0.767 |
| 40 | CDF | 0.722 | 0.819 | 0.745 |
| 0.783 |
| 41 | CEF | 0.721 | 0.817 | 0.745 |
| 0.780 |
| 42 | DEF | 0.703 | 0.835 | 0.718 | 0.772 | 0.748 |
| 43 | AB | 0.695 | 0.815 | 0.717 | 0.763 | 0.740 |
| 44 | AC | 0.708 | 0.820 | 0.729 | 0.772 | 0.760 |
| 45 | AD | 0.673 | 0.848 | 0.684 | 0.757 | 0.691 |
| 46 | AE | 0.676 | 0.846 | 0.688 | 0.759 | 0.693 |
| 47 | AF | 0.700 | 0.847 | 0.711 | 0.773 | 0.739 |
| 48 | BC | 0.707 | 0.789 | 0.742 | 0.764 | 0.755 |
| 49 | BD | 0.699 | 0.815 | 0.721 | 0.765 | 0.756 |
| 50 | BE | 0.694 | 0.802 | 0.722 | 0.759 | 0.750 |
| 51 | BF | 0.702 | 0.793 | 0.734 | 0.761 | 0.766 |
| 52 | CD | 0.713 | 0.817 | 0.736 | 0.774 | 0.764 |
| 53 | CE | 0.704 | 0.795 | 0.736 | 0.764 | 0.759 |
| 54 | CF | 0.713 | 0.813 | 0.737 | 0.772 | 0.773 |
| 55 | DE | 0.677 | 0.830 | 0.693 | 0.755 | 0.703 |
| 56 | DF | 0.702 | 0.840 | 0.716 | 0.772 | 0.746 |
| 57 | EF | 0.697 | 0.828 | 0.715 | 0.767 | 0.740 |
| 58 | A | 0.664 |
| 0.667 | 0.759 | 0.662 |
| 59 | B | 0.674 | 0.796 | 0.703 | 0.745 | 0.729 |
| 60 | C | 0.691 | 0.811 | 0.714 | 0.759 | 0.738 |
| 61 | D | 0.670 | 0.842 | 0.684 | 0.754 | 0.693 |
| 62 | E | 0.670 | 0.838 | 0.685 | 0.754 | 0.690 |
| 63 | F | 0.685 | 0.842 | 0.697 | 0.763 | 0.726 |
∗A, B, C, D, E, and F represent the features of the atom density, mobility, temperature B-factors, atomic hydrophilicity, atomic hydrophobicity, and SASA, respectively. ∗∗In each category, we highlight the values of the best performance in bold. Note that we allow ±0.001 deviations for the values. For example, the values of the best performance are 0.725 and 0.724, respectively.
Performance comparison of seven machine learning models in identifying water molecules in the binding sites of proteins using the optimal feature combination.
| Prediction models | ACC | SN | PPV |
| AUC | Average performance∗∗ |
|---|---|---|---|---|---|---|
| SVM | 0.809 | 0.889 | 0.793 | 0.838 | 0.880 | 0.842 |
| KNN | 0.805 | 0.873 | 0.797 | 0.833 | 0.890 | 0.840 |
| DT | 0.805 | 0.838 |
| 0.827 | 0.900 | 0.837 |
| LR | 0.795 | 0.831 | 0.807 | 0.819 | 0.870 | 0.824 |
| DA | 0.793 | 0.836 | 0.801 | 0.818 | 0.870 | 0.824 |
| NB | 0.798 | 0.828 | 0.812 | 0.820 | 0.890 | 0.830 |
| EL |
|
| 0.803 |
|
|
|
∗Bold values indicate the highest performance values. ∗∗For each model, the average performance is defined by averaging out all the values from five criteria.
Figure 3(a) All water molecules in the binding site of Chain B of the crystal structure 3skh, where the yellow and cyan spheres represent the CWMs and FWM, respectively. (b) The predicted results using the EL model, where the yellow spheres represent the correctly identified CWMs, and the magenta sphere represents the mispredicted FWM. Note that the ligands of Chain B in these conformations are shown as the orange ball-and-stick models.