| Literature DB >> 24485002 |
Pedro Franco, Nuria Porta, John D Holliday, Peter Willett1.
Abstract
BACKGROUND: In the European Union, medicines are authorised for some rare disease only if they are judged to be dissimilar to authorised orphan drugs for that disease. This paper describes the use of 2D fingerprints to show the extent of the relationship between computed levels of structural similarity for pairs of molecules and expert judgments of the similarities of those pairs. The resulting relationship can be used to provide input to the assessment of new active compounds for which orphan drug authorisation is being sought.Entities:
Year: 2014 PMID: 24485002 PMCID: PMC3923256 DOI: 10.1186/1758-2946-6-5
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Three training-set molecule-pairs with the corresponding fractions ( = 143) of Yes/No responses to the question “Are these molecules similar?” The similarity values in the right-hand column are those obtained using the Tanimoto coefficient and ECFP4 fingerprints.
Figure 2Distribution of expert assessments on the training-set.
Figure 3Plot of the proportion of experts who assessed a training-set molecule-pair as being similar against the ECFP4 similarity for that molecule-pair. The figure also shows the computed logistic regression curve (and 95% confidence limits) for this fingerprint.
Logistic regression to predict the similarity, or otherwise, of training-set molecule-pairs using different types of fingerprint
| BCI | −12.758 | 2.128 | 0.906 | 0.599 | 0.990 |
| Daylight | −10.677 | 1.850 | 0.884 | 0.577 | 0.986 |
| ECFC4 | −9.207 | 2.438 | 0.878 | 0.378 | 0.983 |
| ECFP4 | −12.754 | 2.524 | 0.894 | 0.505 | 0.988 |
| MDL | −9.022 | 1.380 | 0.812 | 0.654 | 0.973 |
| Unity | −12.347 | 1.956 | 0.884 | 0.631 | 0.987 |
The columns contain the β0 and β1 values for the logistic regression model, the Nagelkerke R2 value, the computed value for t and the AUC for the ROC curve.
Figure 4ROC curve for ECFP4 fingerprints.
Optimal levels of performance using ROC curves
| BCI | 0.606 | 0.534 | 0.980 | 0.941 | 0.941 | ||||
| Daylight | 0.510 | 0.225 | 1.000 | 0.882 | 0.891 | 0.940 | 0.942 | 0.882 | 0.8866 |
| ECFP4 | 0.490 | 0.406 | 0.980 | 0.922 | 0.923 | 0.950 | 0.951 | 0.901 | 0.9017 |
| ECFC4 | 0.364 | 0.415 | 0.980 | 0.882 | 0.889 | 0.930 | 0.932 | 0.862 | 0.8645 |
| MDL | 0.650 | 0.487 | 0.939 | 0.882 | 0.885 | 0.910 | 0.911 | 0.821 | 0.8216 |
| Unity | 0.639 | 0.537 | 0.938 | 0.961 | 0.950 | 0.947 | 0.898 | 0.8990 |
t is the similarity threshold that gives the best level of performance, where this is that similarity value which maximises the values of the precision, the accuracy, the F index, the Youden index and the Matthews coefficient whilst maintaining acceptable values of the sensitivity and specificity. The largest values of these last five variables are bold-faced in the table.
Characteristics of the 163 training-set and 51 test-set molecules
| Molecular weight | 301 (100–500) | 392 (150–1950) |
| Number of carbons | 16 (5–26) | 22 (0–86) |
| Number of heteroatoms | 5 (1–11) | 9 (3–52) |
| Number of rings | 2 (0–5) | 3 (0–11) |
| Number of aromatic rings | 2 (0–4) | 1 (0–3) |
| Number of stereocentres | 1 (0–9) | 1 (0–15) |
Each element of the table lists the median value, together with the corresponding range in brackets.
Numbers of test-set molecule-pairs predicted correctly using and
| BCI | 97 | 97 |
| Daylight | 97 | 98 |
| ECFP4 | 96 | 97 |
| ECFC4 | 71 | 71 |
| MDL | 92 | 92 |
| Unity | 97 | 97 |
| Consensus | 98 | 98 |