| Literature DB >> 31814842 |
Wei Wang1,2, Keliang Li1, Hehe Lv1, Hongjun Zhang3, Shixun Wang1, Junwei Huang1.
Abstract
The analysis and prediction of small molecule binding sites is very important for drug discovery and drug design. The traditional experimental methods for detecting small molecule binding sites are usually expensive and time consuming, and the tools for single species small molecule research are equally inefficient. In recent years, some algorithms for predicting binding sites of protein-small molecules have been developed based on the geometric and sequence characteristics of proteins. In this paper, we have proposed SmoPSI, a classification model based on the XGBoost algorithm for predicting the binding sites of small molecules, using protein sequence information. The model achieved better results with an AUC of 0.918 and an ACC of 0.913. The experimental results demonstrate that our method achieves high performances and outperforms many existing predictors. In addition, we also analyzed the binding residues and nonbinding residues and finally found the PSSM; hydrophilicity, hydrophobicity, charge, and hydrogen bonding have obviously different effects on the binding-site predictions.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31814842 PMCID: PMC6877956 DOI: 10.1155/2019/1926156
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Detailed compositions of the 14 types of small molecules for binding sites.
| Small molecular type | Hetnam | Formula | Number of sequences | Number of binding domain residues | Number of nonbinding domain residues |
|---|---|---|---|---|---|
| ACO | Acetyl coenzyme | C23 H38 N7 O17 P3 S | 167 | 86 | 4256 |
| ADP | Adenosine-5′-diphosphate | C10 H15 N5 O10 P2 | 807 | 576 | 20406 |
| ANP | Phosphoaminophosphonic acid-adenylate ester | C10 H17 N6 O12 P3 | 386 | 223 | 10073 |
| ATP | Adenosine-5′-triphosphate | C10 H16 N5 O13 P3 | 549 | 574 | 13700 |
| COA | Coenzyme A | C21 H36 N7 O16 P3 S | 350 | 245 | 8855 |
| FAD | Flavin-adenine-dinucleotide | C27 H33 N9 O15 P2 | 876 | 1417 | 21359 |
| FMN | Flavin mononucleotide | C17 H21 N4 O9 P | 425 | 552 | 10498 |
| GDP | Guanosine-5′-diphosphate | C10 H15 N5 O11 P2 | 214 | 294 | 5270 |
| GNP | Phosphoaminophosphonic acid-guanylate ester | C10 H17 N6 O13 P3 | 187 | 344 | 4518 |
| NAD | Nicotinamide-adenine-dinucleotide | C21 H27 N7 O14 P2 | 1053 | 1305 | 26073 |
| NAP | NADP nicotinamide-adenine-dinucleotide phosphate | C21 H28 N7 O17 P3 | 529 | 806 | 12948 |
| NDP | NADPH dihydro-nicotinamide-adenine-dinucleotide | C21 H30 N7 O17 P3 | 334 | 462 | 8222 |
| SAH | S-adenosyl-L-homocysteine | C14 H20 N6 O5 S | 465 | 371 | 11719 |
| SAM | S-adenosylmethionine | C15 H22 N6 O5 S | 240 | 186 | 6054 |
Figure 1The sliding window sampling of protein sequences. A residue on the protein sequence is used as the center point, with (w − 1)/2 as the boundary, and then the extracted matrix is the feature of the central residue. The features are extracted matrix of the entire protein sequence in the sliding.
Figure 2Distribution of 20 amino acids between binding residues and nonbinding residues in the protein sequences.
Figure 3Distribution of four physicochemical properties of binding domain residues and nonbinding domain residues in the protein sequences.
Figure 4The analysis between the sampling window w and predication performances. (a) With the length of the sampling window increasing, the AUC values of 14 protein-small molecules continue to increase. When w=15, the AUC values of all protein-small molecules reach a peak and then decrease. (b) With the length of the sampling window increasing, the ACC values of 14 protein-small molecules increase continuously. When w=15, ACC values of all protein-small molecules reach a peak and then decrease.
Figure 5The feature significance scores are calculated using the mean decrease accuracy and random forest-based feature importance scoring system. (a) Mean decrease accuracy. (b) Random forest-based feature importance scoring system.
The results of classification prediction of 14 protein-small molecules are given by using the SmoPSI model.
| Small molecules | ACC | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|
| ACO | 0.900 | 0.886 | 0.921 | 0.904 | 0.891 |
| ADP | 0.910 | 0.891 | 0.934 | 0.912 | 0.927 |
| ANP | 0.900 | 0.863 | 0.940 | 0.900 | 0.897 |
| ATP | 0.927 | 0.914 | 0.940 | 0.927 | 0.935 |
| COA | 0.928 | 0.918 | 0.940 | 0.929 | 0.931 |
| FAD | 0.903 | 0.875 | 0.932 | 0.903 | 0.917 |
| FMN | 0.926 | 0.911 | 0.940 | 0.925 | 0.932 |
| GDP | 0.929 | 0.917 | 0.940 | 0.928 | 0.937 |
| GNP | 0.926 | 0.914 | 0.940 | 0.927 | 0.925 |
| NAD | 0.916 | 0.898 | 0.937 | 0.917 | 0.922 |
| NAP | 0.915 | 0.893 | 0.940 | 0.916 | 0.912 |
| NDP | 0.906 | 0.875 | 0.940 | 0.907 | 0.908 |
| SAH | 0.875 | 0.830 | 0.930 | 0.877 | 0.894 |
| SAM | 0.926 | 0.914 | 0.940 | 0.927 | 0.920 |
| Average | 0.913 | 0.893 | 0.937 | 0.914 | 0.918 |
Comparison with existing predictors on training sets of 3 same types of ligands.
| Predictor methods | Small molecules | ACC | Recall | AUC |
|---|---|---|---|---|
| SmoPSI | ADP | 0.910 | 0.934 | 0.927 |
| ATP | 0.927 | 0.940 | 0.935 | |
| GDP | 0.929 | 0.940 | 0.937 | |
| Average | 0.922 | 0.938 | 0.933 | |
|
| ||||
| TargetS | ADP | 0.972 | 0.561 | 0.907 |
| ATP | 0.962 | 0.484 | 0.887 | |
| GDP | 0.972 | 0.639 | 0.908 | |
| Average | 0.969 | 0.561 | 0.901 | |
|
| ||||
| EC-RUS | ADP | 0.973 | 0.622 | 0.939 |
| ATP | 0.964 | 0.586 | 0.912 | |
| GDP | 0.976 | 0.672 | 0.937 | |
| Average | 0.971 | 0.627 | 0.929 | |