| Literature DB >> 34884688 |
Paul T Kim1, Robin Winter1, Djork-Arné Clevert1.
Abstract
In silico protein-ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to create an accurate model of the protein-ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous studies in PCM modeling rely on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings that outperform complex, human-engineered representations. Several different embedding methods for proteins and molecules have been developed based on various language-modeling methods. Here, we demonstrate the utility of these unsupervised representations and compare three protein embeddings and two compound embeddings in a fair manner. We evaluate performance on various splits of a benchmark dataset, as well as on an internal dataset of protein-ligand binding activities and find that unsupervised-learned representations significantly outperform handcrafted representations.Entities:
Keywords: computational biology; protein–ligand binding prediction; unsupervised representation learning
Mesh:
Substances:
Year: 2021 PMID: 34884688 PMCID: PMC8657702 DOI: 10.3390/ijms222312882
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Comparison of single-task, multi-task and proteochemometric modeling strategies for protein–ligand binding prediction.
Marginal performance (average of all models containing this descriptor on this task) and additional information for different descriptors.
| Marginal Performance | Model Info | |||||
|---|---|---|---|---|---|---|
| Random MCC | LCCO MCC | LPO MCC | Repr Size | # Params | Type | |
| CDDD | 0.610 | 0.483 | 0.309 | 512 | 26 M | GRU-Autoencoder |
| MolBert |
|
| 0.306 | 768 | 85 M | Transformer |
| UniRep |
|
| 0.306 | 256 | 1.8 M | m-LSTM |
| SeqVec | 0.591 | 0.481 | 0.317 | 1024 | 93 M | Stacked LSTM |
| ESM | 0.629 | 0.492 | 0.296 | 1280 | 650 M | Transformer |
Abbreviations: See Table 2. Random MCC: MCC on Random Split; LCCO MCC: MCC on LCCO Split; LPO MCC: MCC on LPO Split; Repr Size: Representation Size; Params: Number of model parameters; GRU: Gated-Recurrent-Unit Model [39]; LSTM: Long-and-Short-Term Memory Network [40]. Bold indicates statistically significant performance improvement by wilcoxon signed-rank test over data splits.
Results on test set for model with different descriptor combinations on benchmark ChEMBL dataset. Best results are denoted in bold. Standard deviations of metrics are shown in parentheses. Raw results for each split can be found in Supplementary Tables S1–S3.
| Random | LCCO | LPO | ||||
|---|---|---|---|---|---|---|
| MCC | BedROC | MCC | BedROC | MCC | BedROC | |
| CDDD + UniRep | 0.645 (0.004) | 0.979 (0.002) | 0.490 (0.061) | 0.941 (0.017) | 0.307 (0.031) | 0.847 (0.038) |
| CDDD + SeqVec | 0.575 (0.079) | 0.967 (0.012) | 0.475 (0.060) | 0.930 (0.021) | ||
| CDDD + ESM | 0.609 (0.014) | 0.974 (0.003) | 0.484 (0.054) | 0.930 (0.023) | 0.297 (0.093) | 0.834 (0.093) |
| MolBert + UniRep | 0.312 (0.024) | 0.847 (0.040) | ||||
| MolBert + SeqVec | 0.607 (0.030) | 0.973 (0.007) | 0.487 (0.062) | 0.938 (0.017) | 0.311 (0.035) | 0.842 (0.049) |
| MolBert + ESM | 0.630 (0.009) | 0.977 (0.002) | 0.499 (0.053) | 0.937 (0.022) | 0.294 (0.118) | 0.832 (0.090) |
| Handcrafted | 0.337 (0.003) | 0.819 (0.007) | 0.276 (0.024) | 0.753 (0.058) | 0.132 (0.051) | 0.655 (0.061) |
Abbreviations: CDDD: Continuous and Data Driven Descriptors; UniRep: [34]; SeqVec: [35]; ESM: [36]; MolBert: [33]; MCC: Matthews Correlation Coefficient; BedROC: Boltmann-Enhanced ROC; Random: Random Split; LCCO: Leave-Compound-Cluster-Out; LPO: Leave-Protein-Out. Bold indicates statistically significant performance improvement by wilcoxon signed-rank test over data splits.
Results on a test set for model with different descriptor combinations on LCCO split of large Internal bioactivity dataset. Best results are denoted in bold. Standard deviation of metrics are shown in parentheses.
| MCC | BedROC | |
|---|---|---|
| CDDD + UniRep | 0.633 (0.021) | 0.931 (0.023) |
| CDDD + SeqVec | 0.626 (0.016) | 0.922 (0.022) |
| CDDD + ESM | 0.634 (0.023) | 0.927 (0.022) |
| MolBert + UniRep | ||
| MolBert + SeqVec | 0.639 (0.012) | 0.928 (0.021) |
| MolBert + ESM | 0.646 (0.021) | 0.932 (0.021) |
Abbreviations: see Table 2.
Table comparing test set performance of full model to No-Interaction-Terms model various splits of the ChEMBL dataset. Raw results for each split can be found in Supplementary Tables S1–S3.
| Random | LCCO | LPO | |||||||
|---|---|---|---|---|---|---|---|---|---|
| No-Int | Full | % Imp | No-Int | Full | % Imp | No-Int | Full | % Imp | |
| CDDD + UniRep | 0.565 | 0.645 | 14.3 | 0.424 | 0.490 | 15.6 | 0.281 | 0.307 | 9.3 |
| CDDD + SeqVec | 0.548 | 0.575 | 4.9 | 0.411 | 0.475 | 15.6 | 0.287 | 0.322 | 12.2 |
| CDDD + ESM | 0.557 | 0.609 | 9.4 | 0.416 | 0.484 | 16.3 | 0.287 | 0.297 | 3.5 |
| MolBert + UniRep | 0.574 | 0.654 | 13.8 | 0.439 | 0.505 | 15.0 | 0.283 | 0.312 | 10.2 |
| MolBert + SeqVec | 0.558 | 0.607 | 8.7 | 0.430 | 0.487 | 13.3 | 0.290 | 0.311 | 7.2 |
| MolBert + ESM | 0.567 | 0.630 | 11.2 | 0.434 | 0.499 | 15.0 | 0.292 | 0.294 | 0.7 |
Abbreviations: see Table 2; No-Int: No-Interaction-Terms Model; Full: Full Model; % Imp: Percentage improvement of Full Model over No-Interaction-Terms Model.