| Literature DB >> 33431038 |
C Škuta1, I Cortés-Ciriano2, W Dehaen1,3, P Kříž4, G J P van Westen5, I V Tetko6, A Bender2, D Svozil7,8.
Abstract
An affinity fingerprint is the vector consisting of compound's affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based affinity fingerprint, components of which are predicted by Random Forest regression models trained on bioactivity data from the ChEMBL database. Both real-valued (rv-QAFFP) and binary (b-QAFFP) versions of the QAFFP fingerprint were implemented and their performance in similarity searching, biological activity classification and scaffold hopping was assessed and compared to that of the 1024 bits long Morgan2 fingerprint (the RDKit implementation of the ECFP4 fingerprint). In both similarity searching and biological activity classification, the QAFFP fingerprint yields retrieval rates, measured by AUC (~ 0.65 and ~ 0.70 for similarity searching depending on data sets, and ~ 0.85 for classification) and EF5 (~ 4.67 and ~ 5.82 for similarity searching depending on data sets, and ~ 2.10 for classification), comparable to that of the Morgan2 fingerprint (similarity searching AUC of ~ 0.57 and ~ 0.66, and EF5 of ~ 4.09 and ~ 6.41, depending on data sets, classification AUC of ~ 0.87, and EF5 of ~ 2.16). However, the QAFFP fingerprint outperforms the Morgan2 fingerprint in scaffold hopping as it is able to retrieve 1146 out of existing 1749 scaffolds, while the Morgan2 fingerprint reveals only 864 scaffolds.Entities:
Keywords: Affinity fingerprint; Bioactivity modeling; Biological fingerprint; QSAR; Scaffold hopping; Similarity searching
Year: 2020 PMID: 33431038 PMCID: PMC7260783 DOI: 10.1186/s13321-020-00443-6
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The workflow for the calculation of the rv-QAFFP fingerprint. 1360 ligand sets (Additional file 1) assayed against various molecular targets were extracted from the ChEMBL19 database [50, 51]. For each ligand set, Random Forest model was built using 80% of data for training and 20% for testing. Each QSAR model was validated using both internal (i.e., cross-validated) and external (i.e., test set) error measures and only models that satisfied stringent quality criteria were used for the construction of the rv-QAFFP fingerprint. The applicability domain of individual QSAR models was estimated using inductive conformal prediction [53–56]. The rv-QAFFP fingerprint is composed of 440 affinities predicted for the panel of assays covering 376 distinct molecular targets
Fig. 2The representation of 12 target classes for all 1360 models and 440 models selected for the construction of QAFFP
The comparison of the performance of the Morgan2 (ECFP4) and b-QAFFP fingerprints for similarity searching for 69 HET data sets
| FP | Morgan2 | b-QAFFP | |||||||
|---|---|---|---|---|---|---|---|---|---|
| AD | – | No | Yes | ||||||
| Cutoff | – | 5 | 6 | 7 | 8 | 6 | 7 | 8 | |
| AUC | 0.66 ± 0.01 | 0.63 ± 0.01 | 0.63 ± 0.01 | 0.65 ± 0.01 | 0.58 ± 0.01 | 0.62 ± 0.01 | 0.63 ± 0.01 | 0.56 ± 0.01 | |
| EF5 | 6.41 ± 0.40 | 3.67 ± 0.25 | 4.52 ± 0.33 | 4.50 ± 0.30 | 2.27 ± 0.16 | 4.65 ± 0.32 | 3.97 ± 0.26 | 1.76 ± 0.12 | |
Model AD was estimated by an ICP. Affinities predicted to lie outside model AD were encoded by zeros. Various affinity cutoffs were used to construct the b-QAFFP fingerprint. Best results are shown in a column in italic. Data shown are averages over all HET data sets with their standard errors of the mean. The b-QAFFP fingerprint is 384 bits long
The comparison of the performance of the Morgan2 (ECFP4) and b-QAFFP fingerprints for similarity searching for 37 HOM data sets
| FP | Morgan2 | b-QAFFP | |||||||
|---|---|---|---|---|---|---|---|---|---|
| AD | – | No | Yes | ||||||
| Cutoff | – | 5 | 6 | 7 | 8 | 6 | 7 | 8 | |
| AUC | 0.57 ± 0.02 | 0.61 ± 0.02 | 0.58 ± 0.03 | 0.61 ± 0.02 | 0.57 ± 0.02 | 0.59 ± 0.02 | 0.61 ± 0.02 | 0.56 ± 0.02 | |
| EF5 | 4.09 ± 0.42 | 3.44 ± 0.30 | 3.52 ± 0.47 | 3.88 ± 0.54 | 2.33 ± 0.24 | 3.56 ± 0.51 | 3.39 ± 0.53 | 1.81 ± 0.21 | |
Model AD was estimated by an ICP. Affinities predicted to lie outside model AD were encoded by zeros. Various affinity cutoffs were used to construct the b-QAFFP fingerprint. Best results are shown in a column in italic. Data shown are averages over all HOM data sets with their standard errors of the mean. The b-QAFFP fingerprint is 402 bits long
The comparison of the performance of the Morgan2 (ECFP4), rv-QAFFP and b-QAFFP fingerprints for biological activity classification of 23 CLASS data sets
| FP | Morgan2 | rv-QAFFP | b-QAFFP | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| AD | – | No | Yes | No | Yes | ||||||
| Cutoff | – | – | – | 5 | 6 | 7 | 8 | 6 | 7 | 8 | |
| AUC | 0.87 ± 0.01 | 0.86 ± 0.02 | 0.83 ± 0.01 | 0.85 ± 0.01 | 0.84 ± 0.02 | 0.77 ± 0.01 | 0.85 ± 0.02 | 0.83 ± 0.02 | 0.73 ± 0.01 | ||
| EF5 | 2.16 ± 0.16 | 2.08 ± 0.13 | 2.03 ± 0.13 | 2.09 ± 0.14 | 2.08 ± 0.14 | 1.89 ± 0.11 | 2.08 ± 0.14 | 2.04 ± 0.13 | 1.78 ± 0.10 | ||
Model AD was estimated by an ICP with the confidence level of 90%. rv-QAFFP models were trained using raw data. Considering AD for rv-QAFFP means that if the prediction interval width was larger than ± 2.0, the prediction was regarded unreliable and was replaced by the average of all reliably predicted affinities. Various affinity cutoffs were used to construct the b-QAFFP fingerprint. Affinities predicted to lie outside model AD were encoded by zeros. Best results are shown in columns in italic. Data shown are averages over all CLASS data sets with their standard errors of the mean. Both rv-QAFFP and b-QAFFP fingerprints are 440 bits long
Fig. 3The number of ACSKs identified by the Morgan2, b-QAFFP and rv-QAFFP fingerprints. The total number of ACSKs in the CLASS data sets is 1749
Fig. 4ACSKs recall using the b-QAFFP (a) and rv-QAFFP fingerprints (b) and their combination rv+b-QAFFP (c). Recall is the percentage of ACSKs revealed from all ACSKs existing in the given data set
The average number of ACSKs per an assay (and its standard error of the mean SEM) in 22 CLASS sets revealed by the Morgan2, rv-QAFFP and b-QAFFP fingerprints
| FP | Morgan2 | rv-QAFFP | b-QAFFP | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| AD | No | Yes | No | Yes | |||||||
| Cutoff | – | – | – | 5 | 6 | 7 | 8 | 6 | 7 | 8 | |
| Average | 39.27 | 41.45 | 48.40 | 47.80 | 48.89 | 66.14 | 47.89 | 48.54 | 67.58 | ||
| SEM | 8.25 | 8.68 | 10.26 | 10.47 | 11.18 | 16.02 | 10.54 | 10.88 | 16.84 | ||
Model AD was estimated by an ICP with the confidence level of 90%. rv-QAFFP models were trained using raw data. Considering AD for rv-QAFFP means that if the prediction interval width was larger than ± 2.0, the prediction was regarded unreliable and was replaced by the average of all reliably predicted affinities. Various affinity cutoffs were used to construct the b-QAFFP fingerprint. Affinities predicted to lie outside model AD were encoded by zeros. Data shown are averages over 22 CLASS data sets with their standard errors of the mean (SEM). Both rv-QAFFP and b-QAFFP fingerprints are 440 bits long. The recommended settings are shown in columns in italic
The average number of ON bits in b-QAFFPs calculated for HET set compounds
| no AD | AD | |||||||
|---|---|---|---|---|---|---|---|---|
| Cutoff | 5 | 6 | 7 | 8 | 5 | 6 | 7 | 8 |
| Average [%] | 92.5 | 53.5 | 14.5 | 1.6 | 71.1 | 39.4 | 10.4 | 1.2 |
| Average [count] | 407 | 235 | 64 | 7 | 313 | 174 | 46 | 5 |
Model AD was estimated by an ICP with the confidence level of 90% and the maximum interval width, that distinguishes whether the prediction is reliable enough, was set to ± 2.0. Affinities predicted to lie outside model AD were encoded by zeros. b-QAFFP is 440 bits long
The average number of ACSKs per an assay revealed by the Morgan2, rv-QAFFP and b-QAFFP fingerprints in 22 CLASS sets
| Morgan2 | rv-QAFFP | b-QAFFP | rv+b-QAFFP | |
|---|---|---|---|---|
| Average # of ACSKs | 39.27 ± 8.25 | 41.41 ± 8.51 | 48.41 ± 10.37 | 52.10 ± 11.12 |
In addition, the union of ACSKs revealed by both rv-QAFFP and b-QAFFP is reported. Averages are shown together with their standard errors of the mean. Additional file 4 contains detailed information about the number of revealed ACSKs for individual assays