| Literature DB >> 26987649 |
Hakime Öztürk1, Elif Ozkirimli2, Arzucan Özgür3.
Abstract
BACKGROUND: Molecular structures can be represented as strings of special characters using SMILES. Since each molecule is represented as a string, the similarity between compounds can be computed using SMILES-based string similarity functions. Most previous studies on drug-target interaction prediction use 2D-based compound similarity kernels such as SIMCOMP. To the best of our knowledge, using SMILES-based similarity functions, which are computationally more efficient than the 2D-based kernels, has not been investigated for this task before. <br> RESULTS: In this study, we adapt and evaluate various SMILES-based similarity methods for drug-target interaction prediction. In addition, inspired by the vector space model of Information Retrieval we propose cosine similarity based SMILES kernels that make use of the Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) weighting approaches. We also investigate generating composite kernels by combining our best SMILES-based similarity functions with the SIMCOMP kernel. With this study, we provided a comparison of 13 different ligand similarity functions, each of which utilizes the SMILES string of molecule representation. Additionally, TF and TF-IDF based cosine similarity kernels are proposed. <br> CONCLUSION: The more efficient SMILES-based similarity functions performed similarly to the more complex 2D-based SIMCOMP kernel in terms of AUC-ROC scores. The TF-IDF based cosine similarity obtained a better AUC-PR score than the SIMCOMP kernel on the GPCR benchmark data set. The composite kernel of TF-IDF based cosine similarity and SIMCOMP achieved the best AUC-PR scores for all data sets.Entities:
Keywords: Chemoinformatics; Drug-target interaction prediction; SMILES; SMILES based drug similarity
Mesh:
Substances:
Year: 2016 PMID: 26987649 PMCID: PMC4797122 DOI: 10.1186/s12859-016-0977-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of components included in the drug-target interaction data sets of Yamanishi et al. [5]
| Dataset | Drugs | Targets | Interactions |
|---|---|---|---|
| Enzyme | 445 | 664 | 2926 |
| Ion Channels | 210 | 204 | 1476 |
| GPCR | 223 | 95 | 635 |
| Nuclear Receptor | 54 | 26 | 90 |
LINGOs with their corresponding frequencies in the sample SMILES strings S M I 1 and S M I 2
|
|
| ||
|---|---|---|---|
| LINGO | Freq | LINGO | Freq |
| OC(O | 1 | CCCC | 2 |
| C(O) | 1 | CCC( | 1 |
| (O) = | 1 | CC(O | 1 |
| O) =O | 1 | C(O) | 1 |
| (O) = | 1 | ||
| O) =C | 1 | ||
| ) =C0 | 1 |
The IDF scores for the LINGOs in the sample imaginary SMILES strings S M I 1 and S M I 2. The IDF scores are computed by assuming that S M I 1 and S M I 2 are compounds in the enzyme data set consisting of 445 compounds in total
| LINGO Dictionary | IDF (log10(N/df)) |
|---|---|
| OC(O | log10(445/2) |
| C(O) | log10(445/113) |
| (O) = | log10(445/105) |
| O) =O | log10(445/143) |
| CCCC | log10(445/61) |
| CCC( | log10(445/49) |
| CC(O | log10(445/36) |
| O) =C | log10(445/4) |
| ) =C0 | log10(445/5) |
Average AUC-ROC and AUC-PR scores for 5 repetitions of 5 fold cv. on each of the four data sets. The standard deviations are given in parenthesis
| Method | AUC-ROC (std) | AUC-PR (std) | Time (sec) |
|---|---|---|---|
|
| |||
| SIMCOMP |
| 0.303 (0.027) | 413,7 min |
| Edit | 0.833 (0.016) | 0.178 (0.004) | 6 |
| NLCS | 0.837 (0.014) | 0.228 (0.013) | 4 |
| CLCS | 0.834 (0.013) | 0.234 (0.019) | 331 |
| SMILES-based substring | 0.752 (0.006) | 0.169 (0.010) | 133 |
| SMIfp CBD (34D) | 0.846 (0.009) | 0.199 (0.008) | 1 |
| SMIfp Tanimoto (34D) | 0.832 (0.012) | 0.191 (0.012) | 1 |
| SMIfp CBD (38D) | 0.852 (0.009) | 0.205 (0.009) | 1 |
| SMIfp Tanimoto (38D) | 0.844 (0.012) | 0.201 (0.006) | 1 |
| LINGOsim (q =3) | 0.846 (0.013) | 0.290 (0.013) | 3 |
| LINGOsim (q =4) | 0.823 (0.010) | 0.294 (0.006) | 3 |
| LINGOsim (q =5) | 0.819 (0.015) | 0.264 (0.013) | 3 |
| LINGO-based TF | 0.811 (0.017) | 0.259 (0.008) | 19 |
| LINGO-based TF-IDF | 0.822 (0.012) | 0.292 (0.031) | 47 |
| TF-IDF+SIMCOMP | 0.852 (0.010) |
| |
| LINGOsim+SIMCOMP | 0.852 (0.016) | 0.318 (0.019) | |
|
| |||
| SIMCOMP |
| 0.224(0.032) | 48,7 min |
| Edit | 0.754 (0.013) | 0.199 (0.025) | 1 |
| NLCS | 0.753 (0.007) | 0.189 (0.037) | 0,9 |
| CLCS | 0.755 (0.018) | 0.185 (0.028) | 47 |
| SMILES-based substring | 0.743 (0.004) | 0.197 (0.031) | 21 |
| SMIfp CBD (34D) | 0.717 (0.019) | 0.136 (0.036) | 0,3 |
| SMIfp Tanimoto (34D) | 0.698 (0.015) | 0.125 (0.022) | 0,3 |
| SMIfp CBD (38D) | 0.722 (0.012) | 0.137 (0.024) | 0,3 |
| SMIfp Tanimoto (38D) | 0.699 (0.028) | 0.156 (0.028) | 0,4 |
| LINGOsim (q =3) | 0.737 (0.015) | 0.192 (0.046) | 0,8 |
| LINGOsim (q =4) | 0.737 (0.011) | 0.197 (0.037) | 1 |
| LINGOsim (q =5) | 0.727 (0.009) | 0.188 (0.026) | 1 |
| LINGO-based TF | 0.738 (0.018) | 0.204 (0.024) | 3 |
| LINGO-based TF-IDF | 0.712 (0.014) | 0.178 (0.029) | 7 |
| TF-IDF+SIMCOMP | 0.763 (0.010) |
| |
| LINGOsim+SIMCOMP | 0.773 (0.012) | 0.229 (0.018) | |
|
| |||
| SIMCOMP | 0.867 (0.009) | 0.307 (0.018) | 71,2 min |
| Edit | 0.844 (0.015) | 0.248 (0.030) | 1 |
| NLCS | 0.853 (0.006) | 0.247 (0.013) | 1 |
| CLCS | 0.855 (0.014) | 0.279 (0.030) | 52 |
| SMILES-based substring | 0.782 (0.019) | 0.205 (0.032) | 21 |
| SMIfp CBD (34D) | 0.852 (0.014) | 0.209 (0.018) | 0,3 |
| SMIfp Tanimoto (34D) | 0.847 (0.006) | 0.213 (0.016) | 0,3 |
| SMIfp Tanimoto (38D) | 0.856 (0.009) | 0.228 (0.015) | 0,3 |
| LINGOsim (q =3) | 0.875 (0.003) | 0.317 (0.015) | 1 |
| LINGOsim (q =4) | 0.876 (0.004) | 0.333 (0.020) | 1 |
| LINGOsim (q =5) | 0.874 (0.006) | 0.337 (0.019) | 1 |
| LINGO-based TF | 0.872 (0.004) | 0.335 (0.012) | 3 |
| LINGO-based TF-IDF | 0.871 (0.007) | 0.348 (0.018) | 9 |
| TF-IDF+SIMCOMP |
|
| |
| LINGOsim+SIMCOMP | 0.879 (0.009) | 0.335 (0.016) | |
|
| |||
| SIMCOMP | 0.856 (0.015) | 0.435 (0.008) | 2,9 min |
| Edit | 0.828 (0.009) | 0.305 (0.029) | 0,2 |
| NLCS | 0.815 (0.018) | 0.302 (0.032) | 0,2 |
| CLCS | 0.813 (0.037) | 0.319 (0.039) | 10 |
| SMILES-based substring | 0.766 (0.028) | 0.335 (0.035) | 2 |
| SMIfp CBD (34D) | 0.809 (0.026) | 0.296 (0.015) | 0,1 |
| SMIfp Tanimoto (34D) | 0.784 (0.031) | 0.281 (0.020) | 0,1 |
| SMIfp CBD (38D) | 0.815 (0.017) | 0.307 (0.024) | 0,1 |
| SMIfp Tanimoto (38D) | 0.787 (0.030) | 0.322 (0.034) | 0,1 |
| LINGOsim (q =3) | 0.800 (0.013) | 0.351 (0.036) | 0,2 |
| LINGOsim (q =4) | 0.829 (0.013) | 0.414 (0.031) | 0,2 |
| LINGOsim (q =5) | 0.834 (0.013) | 0.389 (0.023) | 0,2 |
| LINGO-based TF | 0.820 (0.013) | 0.373 (0.035) | 0,4 |
| LINGO-based TF-IDF | 0.855 (0.022) | 0.418 (0.016) | 0,8 |
| TF-IDF+SIMCOMP |
|
| |
| LINGOsim+SIMCOMP | 0.840 (0.015) | 0.399 (0.031) | |
The best AUC-ROC and AUC-PR results for each data set are indicated in bold. The results that are significantly better than SIMCOMP according to the paired t-test (α = 0.05) are indicated with . The p-values range between 0.0004 and 0.0329, and they are provided in the Additional file 1: Table S1.
Top 10 most common LINGOs of each compound data set
|
|
| ||
| LINGO | Num. of drugs | LINGO | Num. of drugs |
| c0cc | 321 | c0cc | 180 |
| (=O) | 300 | 0ccc | 170 |
| 0ccc | 279 | (=O) | 117 |
| C(=O | 228 | cccc | 108 |
| ccc0 | 197 | ccc0 | 107 |
| cccc | 171 | ccc( | 94 |
| )c0c | 155 | C(=O | 87 |
| @H]( | 149 | )c0c | 84 |
| ccc( | 144 | Cc0c | 78 |
| [C@H | 144 | C(O) | 72 |
|
|
| ||
| LINGO | Num. of drugs | LINGO | Num. of drugs |
| c0cc | 165 | (=O) | 37 |
| 0ccc | 148 | [C@H | 35 |
| (=O) | 130 | C@H] | 35 |
| ccc0 | 116 | C(=O | 35 |
| cccc | 105 | H]0C | 35 |
| C(=O | 101 | [C@@ | 35 |
| )c0c | 94 | C@@H | 35 |
| ccc( | 72 | @@H] | 35 |
| O)c0 | 56 | @H]0 | 35 |
| =O)c | 54 | )[C@ | 34 |