| Literature DB >> 26545824 |
Yusuke Komiyama1, Masaki Banno2, Kokoro Ueki2, Gul Saad2, Kentaro Shimizu2.
Abstract
MOTIVATION: Predictive tools that model protein-ligand binding on demand are needed to promote ligand research in an innovative drug-design environment. However, it takes considerable time and effort to develop predictive tools that can be applied to individual ligands. An automated production pipeline that can rapidly and efficiently develop user-friendly protein-ligand binding predictive tools would be useful.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26545824 PMCID: PMC4803387 DOI: 10.1093/bioinformatics/btv593
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The pipeline workflow for automatic generation of ligand-binding site predictive tools. A user specifies the name of the ligand (chemical compound) for which a binding-site prediction tool is desired. The pipeline constructs the dataset for training, extracts the sequence features, and automatically performs machine learning and parameter tuning. As the backend of the pipeline system, we developed the PLBSP Residue database based on Semantic Web (linked open data) technologies. The pipeline output is a protein–ligand-binding prediction tool
Fig. 2.Construction process of PLBSP Residue database. We use ligand-binding PDB structures determined by X-ray crystallography. Ligand-binding residues are defined as those that contain at least one atom within n Å of any ligand atom. The atomic distance is calculated for all pairs of ligand atoms and ligand-binding residue atoms. Ligands connected by covalent bonds are excluded. We modeled the graph database schema of PLBSP in RDF using a public RDF dataset and ontology from PDB, PDB CCD and UniProt. Ligand information is added from RDF ChEBI, RDF SIFTS and UniChem API
Fig. 3.Feature extraction and machine learning. We generate a PSSM profile by performing two iterations of PSI-BLAST using the NCBI non-redundant (nr) protein database. The feature vector is a PSSM profile of w consecutive residues and the center residue is the target residue for which the prediction is being made. The dimension of the feature vector is w × 21, which corresponds to the number of amino acids (20 plus an N- or C-terminal spacer). The feature vector can be taken as input to the three machine learning algorithms
Numbers of ligand types and proteins and ligand-binding residues in the extracted dataset from the PLBSP Residue database
| Ligand name | No. of ligand types | No. of ligand-binding proteins | No. of ligand-binding residues |
|---|---|---|---|
| Purine nucleotide | 58 | 521 | 10,564 |
| Lipid | 117 | 224 | 4,737 |
| Fe | 2 | 130 | 1,005 |
| Zn | 2 | 576 | 5,128 |
| Mn | 1 | 230 | 1,772 |
| FAD | 1 | 123 | 4,168 |
| AMP | 1 | 54 | 1,013 |
| SF4 | 1 | 71 | 1,392 |
Performances of SVM-based prediction tools generated for eight ligand-binding proteins in Table 1
| Ligand name | Sens. (%) | Spec. (%) | MCC | AUC |
|---|---|---|---|---|
| Purine nucleotide | 37.4 | 98.0 | 0.484 | 0.850 |
| Lipid | 24.0 | 97.4 | 0.331 | 0.798 |
| Fe | 49.3 | 99.3 | 0.615 | 0.904 |
| Zn | 40.6 | 99.2 | 0.555 | 0.835 |
| Mn | 34.7 | 99.1 | 0.484 | 0.869 |
| FAD | 43.2 | 96.8 | 0.630 | 0.906 |
| AMP | 20.8 | 98.3 | 0.320 | 0.808 |
| SF4 | 75.5 | 97.7 | 0.781 | 0.952 |
Fig. 4.Predictive accuracy of predictors automatically generated using the pipeline workflow. Accuracies of the prediction tools are automatically calculated and are indicated as AUC values. The x-axis shows the false positive rate (FPR) and the y-axis shows the true positive rate (TPR). The ROC plot shows curves for the eight ligand-binding proteins listed in Tables 1 and 2
Average AUC and execution times of the generation of the GA as a parameter-tuning method for prediction of iron cation binding proteins
| GA generation | Run time (s) | SVM param. cost | SVM param. sigma | Window size | SVM AUC | Std error (AUC) (%) |
|---|---|---|---|---|---|---|
| 0 | 17.78 | 1.99 | 17 | 0.691 | 1.76 | |
| 1 | 1,753 | 0.54 | 1.26 | 5 | 0.827 | 1.22 |
| 2 | 2,134 | 29.42 | 1.29 | 9 | 0.799 | 1.36 |
| 3 | 2,217 | 29.42 | 0.37 | 9 | 0.881 | 1.04 |
| 4 | 2,255 | 13.60 | 2.79 | 9 | 0.741 | 1.57 |
| 5 | 2,333 | 13.60 | 1.83 | 9 | 0.766 | 1.42 |
| 10 | 2,584 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 20 | 3,122 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 30 | 3,643 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 40 | 3,964 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 50 | 4,271 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 60 | 4,674 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 70 | 5,013 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 80 | 5,389 | 25.23 | 0.32 | 9 | 0.894 | 0.94 |
| 90 | 5,507 | 17.84 | 0.32 | 9 | 0.894 | 0.94 |
| 100 | 5,960 | 17.84 | 0.15 | 9 | 0.934 | 0.47 |
Average AUCs of three machine learning methods for grid search parameter optimization
| Ligand name | Algorithm | Sensitivity (%) | MCC | AUC |
|---|---|---|---|---|
| Purine nucleotide | SVM | 44.5 | 0.554 | 0.869 |
| NN | 40.3 | 0.374 | 0.765 | |
| RF | 27.3 | 0.445 | 0.858 | |
| Lipid | SVM | 45.7 | 0.516 | 0.863 |
| NN | 42.6 | 0.338 | 0.753 | |
| RF | 24.8 | 0.418 | 0.851 | |
| Iron cation | SVM | 61.2 | 0.718 | 0.940 |
| NN | 59.9 | 0.641 | 0.894 | |
| RF | 46.9 | 0.635 | 0.940 |
A window size of nine was used for all cases. For SVM, sigma was 0.1 and cost was 1.0. For NN, the number of nodes was 25 and learning rate was 0.1. For RF, number of trees were 1501 and sampling size per tree was 20. The test dataset was 15% of the full dataset for each ligand. It was randomly sampled and removed from the dataset. The remaining dataset was used for training.
Average AUCs of three machine learning methods for GA parameter optimization
| Ligand name | Algorithm | Parameters | Sensitivity (%) | MCC | AUC |
|---|---|---|---|---|---|
| Purine nucleotide | SVM | Sigma = 0.16, Cost = 22.5, | 41.3 | 0.213 | 0.834 |
| NN | #Nodes = 44, Learning rate = 3.26, | 20.8 | 0.295 | 0.636 | |
| RF | #Trees = 2023, #Iterations = 20, | 25.9 | 0.450 | 0.849 | |
| Lipid | SVM | Sigma = 0.78, Cost = 25.27, | 13.5 | 0.289 | 0.803 |
| NN | #Nodes = 49, Learning rate = 0.66, | 44.6 | 0.381 | 0.775 | |
| RF | #Trees = 1611, #Iterations = 3, | 19.5 | 0.387 | 0.854 | |
| Iron cation | SVM | Sigma = 0.32, Cost = 17.66, | 45.6 | 0.654 | 0.911 |
| NN | #Nodes = 46, Learning rate = 3.26, | 32.0 | 0.458 | 0.803 | |
| RF | #Trees = 991, #Iterations = 29, | 35.0 | 0.579 | 0.943 |
The best parameters found are as this table. The test dataset was 15% of the full dataset for each ligand and was same as used in Table 4.