| Literature DB >> 19703312 |
Nobuyoshi Sugaya1, Kazuyoshi Ikeda.
Abstract
BACKGROUND: Protein-protein interactions (PPIs) are challenging but attractive targets of small molecule drugs for therapeutic interventions of human diseases. In this era of rapid accumulation of PPI data, there is great need for a methodology that can efficiently select drug target PPIs by holistically assessing the druggability of PPIs. To address this need, we propose here a novel approach based on a supervised machine-learning method, support vector machine (SVM).Entities:
Mesh:
Substances:
Year: 2009 PMID: 19703312 PMCID: PMC2739204 DOI: 10.1186/1471-2105-10-263
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic representation of the SVM-based method for assessment of the druggability of PPIs. For the details, see text.
Positive set PPIs for the SVM-based method. a
| PPI | PPI | ||
| Protein 1 | Protein 2 | Protein 1 | Protein 2 |
| ARF1 | CYTH1 b | GRB2 | EGFR |
| ARF1 | CYTH1 b | GRB2 | MET |
| ARF1 | CYTH2 c | HOXB1 | PBX1 |
| ARF1 | CYTH2 c | IL1B | IL1R1 |
| BCL2 | BAK1 | IL2 | IL2RA |
| BCL2L1 | BAK1 | MAGI3 | PTEN |
| BIRC4 | CASP3 | MDM2 | TP53 |
| BIRC4 | CASP9 | PIK3R1 | PDGFRB |
| BIRC4 | DIABLO | RAC1 | TIAM1 |
| BIRC5 | BIRC5 | RAC1 | TRIO |
| CALM1 | CAMK1 | STAT3 | STAT3 |
| CALM1 | MYLK | TCF7L1 | CTNNB1 |
| CALM1 | PDE1A | TCF7L2 | CTNNB1 |
| CD4 | HLA-DQB1 | THRB | NCOA2 |
| ESR1 | NCOA2 | TNF | TNF |
| FKBP1A | TGFBR1 | ZAP70 | CD247 |
aFor details, see Additional file 1: Table S1.
b, cThe ligand-binding pockets considered are different from each other.
Attributes of the PPIs used in the SVM-based method. a
| No. | Attribute |
| Structural information | |
| 1 | Pocket volume |
| 2 | Accessible surface area of pocket |
| 3 | Percentage of accessible surface area of pocket to that of total surface of protein |
| 4 | Pocket compactness |
| 5 | Pocket planarity |
| 6 | |
| 7 | Pocket narrowness |
| 8 | |
| 9 | Ratio of Ala frequency on pocket surface to that on total surface b |
| 10 | Ratio of Cys frequency on pocket surface to that on total surface b |
| 11 | Ratio of Asp frequency on pocket surface to that on total surface b |
| 12 | Ratio of Glu frequency on pocket surface to that on total surface b |
| 13 | Ratio of Phe frequency on pocket surface to that on total surface b |
| 14 | Ratio of Gly frequency on pocket surface to that on total surface b |
| 15 | Ratio of His frequency on pocket surface to that on total surface b |
| 16 | Ratio of Ile frequency on pocket surface to that on total surface b |
| 17 | Ratio of Lys frequency on pocket surface to that on total surface b |
| 18 | Ratio of Leu frequency on pocket surface to that on total surface b |
| 19 | Ratio of Met frequency on pocket surface to that on total surface b |
| 20 | Ratio of Asn frequency on pocket surface to that on total surface b |
| 21 | Ratio of Pro frequency on pocket surface to that on total surface b |
| 22 | Ratio of Gln frequency on pocket surface to that on total surface b |
| 23 | Ratio of Arg frequency on pocket surface to that on total surface b |
| 24 | Ratio of Ser frequency on pocket surface to that on total surface b |
| 25 | Ratio of Thr frequency on pocket surface to that on total surface b |
| 26 | Ratio of Val frequency on pocket surface to that on total surface b |
| 27 | Ratio of Trp frequency on pocket surface to that on total surface b |
| 28 | Ratio of Tyr frequency on pocket surface to that on total surface b |
| Drug and chemical information | |
| 29 | Number of small chemical drugs ( |
| 30 | Number of small chemical drugs ( |
| 31 | Number of biotech drugs ( |
| 32 | Number of biotech drugs ( |
| 33 | Number of approved drugs ( |
| 34 | Number of approved drugs ( |
| 35 | Number of experimental drugs ( |
| 36 | Number of experimental drugs ( |
| 37 | Number of investigational drugs ( |
| 38 | Number of investigational drugs ( |
| 39 | Number of nutraceutical drugs ( |
| 40 | Number of nutraceutical drugs ( |
| 41 | Number of withdrawn drugs ( |
| 42 | Number of withdrawn drugs ( |
| 43 | Number of illicit drugs ( |
| 44 | Number of illicit drugs ( |
| Functional information | |
| 45 | Both proteins are related to OMIM-registered diseases (1) or not (0) |
| 46 | Number of interacting proteins ( |
| 47 | Number of interacting proteins ( |
| 48 | Number of biological pathways in which either protein is involved ( |
| 49 | Number of biological pathways in which either protein is involved ( |
| 50 | Number of biological pathways in which both interacting proteins are involved |
| 51 | Identity scores of the GO terms in the Cellular Component category |
| 52 | Identity scores of the GO terms in the Molecular Function category |
| 53 | Identity scores of the GO terms in the Biological Process category |
| 54 | Number of paralogs in the KEGG ( |
| 55 | Number of paralogs in the KEGG ( |
| 56 | Number of paralogs in the PIRSF ( |
| 57 | Number of paralogs in the PIRSF ( |
| 58 | Number of gene-expressing health states ( |
| 59 | Number of gene-expressing health states ( |
| 60 | Number of health states in which both genes are expressed |
| 61 | Number of gene-expressing body sites ( |
| 62 | Number of gene-expressing body sites ( |
| 63 | Number of body sites in which both genes are expressed |
| 64 | Number of gene-expressing developmental stages ( |
| 65 | Number of gene-expressing developmental stages ( |
| 66 | Number of developmental stages in which both genes are expressed |
| 67 | Similarity scores of gene expression profiles in the Health State category |
| 68 | Similarity scores of gene expression profiles in the Body Sites category |
| 69 | Similarity scores of gene expression profiles in the Developmental Stage category |
aFor details of the definitions and calculation methods, see Additional file 4: Supplementary Methods.
bAbbreviations: Ala, alanine; Cys, cysteine; Asp, aspartic acid; Glu, glutamic acid; Phe, phenylalanine; Gly, glycine; His, histidine; Ile, isoleucine; Lys, lysine; Leu, leucine; Met, methionine; Asn, asparagine; Pro, proline; Gln, glutamine; Arg, arginine; Ser, serine; Thr, threonine; Val, valine; Trp, tryptophan; Tyr, tyrosine.
dDefined as the larger one of the two numbers for the two interacting proteins in a PPI.
eDefined as the smaller one of the two numbers for the two interacting proteins in a PPI.
Summary of the results of the cross-validation tests.
| Kernel function | Positives:negatives | ||||
| All attributes | Top 10 attributes by F-score | ||||
| 1:1 | 1:2 | 1:3 | 1:1 | ||
| Linear | Accuracy | 72.05 ± 6.40 | 75.37 ± 4.75 | 79.22 ± 3.78 | 74.91 ± 5.96 |
| Sensitivity | 71.54 ± 8.97 | 65.73 ± 7.80 | 60.21 ± 8.48 | 75.34 ± 8.19 | |
| Specificity | 72.56 ± 8.31 | 80.19 ± 4.96 | 85.56 ± 3.96 | 74.47 ± 8.14 | |
| Polynomial | Accuracy | 70.86 ± 8.83 | 76.18 ± 6.06 | 81.18 ± 3.98 | 71.74 ± 7.73 |
| Sensitivity | 79.85 ± 9.15 | 53.35 ± 28.74 | 52.38 ± 25.58 | 83.29 ± 10.58 | |
| Specificity | 61.87 ± 18.47 | 87.60 ± 8.01 | 90.78 ± 5.49 | 60.19 ± 20.23 | |
| Radial basis function | Accuracy | 80.50 ± 4.33 | 83.43 ± 3.22 | 86.37 ± 2.36 | 81.53 ± 4.36 |
| Sensitivity | 81.61 ± 5.84 | 65.18 ± 9.37 | 58.67 ± 10.09 | 82.76 ± 6.09 | |
| Specificity | 79.40 ± 6.64 | 92.55 ± 3.61 | 95.61 ± 2.46 | 80.29 ± 6.51 | |
| Sigmoid | Accuracy | 63.79 ± 10.87 | 69.68 ± 7.73 | 73.30 ± 6.97 | 63.32 ± 14.62 |
| Sensitivity | 62.62 ± 16.32 | 31.69 ± 23.08 | 23.51 ± 19.63 | 61.37 ± 18.06 | |
| Specificity | 64.96 ± 16.95 | 88.67 ± 10.62 | 89.90 ± 8.93 | 65.27 ± 17.23 | |
Numbers shown are average percentage ± standard deviation.
Figure 2ROC curves of the training data with the SVM model using all 69 PPI attributes (1:1 positives:negatives ratio). ROC curves with the linear (orange), polynomial (magenta), RBF (green), and sigmoid (blue) kernels were calculated for the 10,000 random training data sets, and average values of true positive rate at each false positive rate are plotted. AUCs ± standard deviations of the ROC curves with the linear, polynomial, RBF, and sigmoid kernels are 0.76 ± 0.09, 0.67 ± 0.20, 0.78 ± 0.13, and 0.64 ± 0.17, respectively.
Figure 3F-scores of the (A) structural, (B) drug and chemical, and (C) functional attributes. Average values (black squares) and standard deviations (vertical lines passing through the squares) are shown. For descriptions of the attributes, see Table 2.
Figure 4Frequency distributions of (A) the number of interacting proteins (. For both attributes, the difference between the frequency distributions of the positive and test instances is statistically significant (P < 10-15) by the two-sample Kolmogorov-Smirnov test.
Figure 5Frequency distributions of the druggability scores (the number of times an instance was judged to be positive) by the SVM models using (A) all attributes and (B) the top 10 attributes by F-score.
Potentially-druggable PPIs predicted by the SVM-based method. a
| PPI | PPI | ||
| Protein 1 | Protein 2 | Protein 1 | Protein 2 |
| APC | CTNNB1 | CTNNB1 | CTNNBIP1 |
| ARHGAP1 | CDC42 | E2F2 | RB1 |
| ARHGDIA | CDC42 | EGFR | ERRFI1 |
| ARHGDIA | RAC1 | EP300 | CITED2 |
| ARHGDIA | RAC2 | EP300 | HIF1A |
| BCL2L1 | BECN1 | EP300 | MYB |
| BCL9 | CTNNB1 | GSK3B | AXIN1 |
| CALM1 | KCNN2 | HRAS | RALGDS |
| CALM1 | RYR1 | HRAS | RASA1 |
| CALM2 | MARCKS | MAX | MYC |
| CD247 | SHC1 | NCF2 | RAC1 |
| CDC42 | ITSN1 | NFKB1 | TXN |
| CDC42 | MCF2L | NFKBIB | RELA |
| CDC42 | WAS | RAC1 | ARFIP2 |
| CDH1 | CTNNB1 | RAF1 | RAP1A |
| CREBBP | CITED2 | RPA1 | TP53 |
| CREBBP | HIF1A | S100B | TP53 |
| CREBBP | IRF3 | SMAD2 | ZFYVE9 |
| CREBBP | MYB | SMAD4 | SKI |
| CTNNA1 | JUP | TP53 | TP53BP1 |
| CTNNB1 | BTRC | TP53 | TP53BP2 |
aFor details, see Additional file 1: Table S3. PPIs were listed if an instance of the PPIs had the druggability scores of >9,000 by the SVM model using all attributes and >6,500 by the model using the top 10 attributes by F-score.