| Literature DB >> 35682792 |
Enrico Gandini1, Gilles Marcou2, Fanny Bonachera2, Alexandre Varnek2, Stefano Pieraccini1, Maurizio Sironi1.
Abstract
Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that 'similar molecules have similar properties'. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts' judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful. In this paper, we built models using both 2D fingerprints and 3D descriptors, i.e., molecular shape and pharmacophore descriptors. The proposed models were also evaluated by constructing a dataset of pairs of molecules which was submitted to a group of experts for the similarity judgment. The proposed machine-learning models can be useful to reduce or assist human efforts in future evaluations. For this reason, the new molecules dataset and an online tool for molecular similarity estimation have been made freely available.Entities:
Keywords: chemical data set; machine learning; molecular similarity; similarity perception
Mesh:
Substances:
Year: 2022 PMID: 35682792 PMCID: PMC9181189 DOI: 10.3390/ijms23116114
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Figure 12D (t)/3D (t) similarity plots of MP (top) included in the survey and (bottom) studied by Franco et al. [11]; 2D structures of some representative molecular pairs are shown (top).
Figure 2Distribution of molecular pairs (MP) according to human assessed similarity (horizontal axis) in each selected subset.
Models built on the training set collected here and validated on the Franco data set using single-feature (Equation (2)) and double-feature (Equation (4)) logistic regression (LogReg) models. The sizes of the collected and Franco sets are equal to 100 molecular pairs.
| Model Type | Variables | Fit | Validation | ||
|---|---|---|---|---|---|
| Ncorrect | ROCAUC | Ncorrect | ROCAUC | ||
| single-feature |
| 81 | 0.920 | 92 | 0.988 |
|
| 70 | 0.845 | 92 | 0.970 | |
| double-feature | 84 | 0.924 | 95 | 0.988 | |
Figure 3Percentage of predictions by the Tanimoto CDK Extended (t) and TanimotoCombo (t) models in the four calculated similarity subsets of the collected set. Both models correctly predicted 100% of the molecular pairs in the sim2D,sim3D and dis2D,dis3D subsets. All prediction errors occurred in the sim2D,dis3D and dis2D,sim3D subsets.
Models built on the Franco training set and validated on the dataset collected here a.
| Model Type | Variables | Fit | Validation | ||
|---|---|---|---|---|---|
| Ncorrect | ROCAUC | Ncorrect | ROCAUC | ||
| single-feature |
| 93 | 0.988 | 81 | 0.920 |
|
| 91 | 0.970 | 69 | 0.845 | |
| double-feature | 95 | 0.988 | 81 | 0.916 | |
a see caption for Table 1.
Coefficients of Equations (2) and (3) for the models built on the collected set.
|
|
|
| |
|---|---|---|---|
| Equation (2), | −4.860 | 8.449 | - |
| Equation (2), | −4.464 | 3.554 | - |
| Equation (3) | −5.605 | 5.214 | 2.009 |
Figure 4The web service ReadySim. The user can type in a molecular pair either as two SMILES strings or using the chemical sketchers below. The server standardizes the structures, computes the similarity and transforms it back as a probability that the pair will be considered as similar by a panel of experts.