| Literature DB >> 30467740 |
Norberto Sánchez-Cruz1, José L Medina-Franco2.
Abstract
BACKGROUND: Simplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statistical-based database fingerprint (SB-DFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of "1" bits on a large representative set of the chemical space.Entities:
Keywords: Chemical space; Epi-informatics; Molecular fingerprints; Representation; Similarity searching
Year: 2018 PMID: 30467740 PMCID: PMC6755589 DOI: 10.1186/s13321-018-0311-x
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Schematic representation of single fingerprints for a compound database and an hypothetical 20-bit fingerprint. The upper part of charts shows the binary representation of the generated single fingerprint: a database fingerprint (DFP) and b statistical-based database fingerprint (SB-DFP)
Selected datasets from the epigenomic database
| Dataset | Number of compounds | Intra-set similarity median (Tc) | Average “1” bits | Number of “1” bits in DFP | Number of “1” bits in SB-DFP | ||||
|---|---|---|---|---|---|---|---|---|---|
| MACCSa | ECFP4b | MACCSa | ECFP4b | MACCSa | ECFP4b | MACCSa | ECFP4b | ||
| BRD2 | 234 | 0.569 | 0.152 | 56.0 | 54.3 | 53 | 27 | 67 | 229 |
| BRD3 | 246 | 0.573 | 0.153 | 56.6 | 54.6 | 53 | 26 | 73 | 231 |
| BRD4 | 477 | 0.486 | 0.133 | 55.9 | 52.8 | 47 | 14 | 71 | 333 |
| CREBBP | 105 | 0.694 | 0.276 | 56.1 | 53.9 | 52 | 36 | 50 | 185 |
| DNMT1 | 127 | 0.403 | 0.115 | 55.4 | 51.7 | 50 | 13 | 62 | 281 |
| EHMT2 | 61 | 0.636 | 0.228 | 62.4 | 55.7 | 62 | 41 | 56 | 167 |
| EP300 | 57 | 0.425 | 0.106 | 58.2 | 55.7 | 53 | 11 | 56 | 285 |
| HDAC10 | 190 | 0.514 | 0.165 | 53.2 | 50.6 | 50 | 17 | 46 | 272 |
| HDAC11 | 137 | 0.494 | 0.156 | 51.2 | 50.8 | 48 | 16 | 42 | 229 |
| HDAC1 | 2740 | 0.453 | 0.149 | 53.2 | 51.4 | 51 | 15 | 63 | 499 |
| HDAC2 | 767 | 0.447 | 0.149 | 50.3 | 48.4 | 46 | 13 | 53 | 336 |
| HDAC3 | 669 | 0.474 | 0.147 | 52.6 | 50.3 | 49 | 13 | 54 | 356 |
| HDAC4 | 452 | 0.427 | 0.135 | 50.4 | 46.4 | 42 | 10 | 49 | 248 |
| HDAC5 | 112 | 0.455 | 0.153 | 47.3 | 44.1 | 39 | 13 | 26 | 176 |
| HDAC6 | 1374 | 0.474 | 0.149 | 54.3 | 49.8 | 48 | 13 | 62 | 415 |
| HDAC7 | 112 | 0.489 | 0.165 | 50.4 | 45.8 | 43 | 12 | 28 | 197 |
| HDAC8 | 864 | 0.500 | 0.153 | 54.9 | 51.2 | 50 | 12 | 52 | 398 |
| HDAC9 | 102 | 0.494 | 0.169 | 52.6 | 47.4 | 46 | 13 | 29 | 190 |
| KAT2B | 55 | 0.583 | 0.179 | 50.8 | 37.3 | 46 | 13 | 44 | 99 |
| KDM1A | 241 | 0.380 | 0.143 | 44.8 | 46.2 | 31 | 21 | 31 | 216 |
| KDM4C | 88 | 0.359 | 0.101 | 48.8 | 40.3 | 41 | 10 | 38 | 158 |
| L3MBTL1 | 50 | 0.804 | 0.551 | 42.2 | 36.8 | 37 | 27 | 37 | 56 |
| L3MBTL3 | 89 | 0.731 | 0.404 | 40.4 | 36.6 | 37 | 26 | 35 | 83 |
| MAP3K7 | 96 | 0.539 | 0.137 | 57.1 | 60.5 | 59 | 35 | 45 | 190 |
| MGEA5 | 67 | 0.683 | 0.316 | 54.2 | 39.6 | 48 | 19 | 42 | 126 |
| NCOA1 | 51 | 0.350 | 0.105 | 45.5 | 43.3 | 34 | 11 | 18 | 132 |
| NCOA3 | 157 | 0.368 | 0.109 | 47.7 | 44.6 | 39 | 10 | 26 | 166 |
| PRMT1 | 61 | 0.395 | 0.076 | 53.0 | 53.5 | 41 | 9 | 40 | 239 |
| Average | 350 | 0.507 | 0.178 | 52 | 48 | 46 | 18 | 46 | 232 |
aMACCS keys 166-bit
bECFP4 2048-bit
Range of Tanimoto similarity values in similarity matrices
| Representation | MACCS keys (166-bit) | ECFP4 (2048-bit) | ||||||
|---|---|---|---|---|---|---|---|---|
| Minimum | Average | Maximum | Range | Minimum | Average | Maximum | Range | |
| All compoundsa | 0.293 | 0.407 | 0.804 | 0.511 | 0.059 | 0.114 | 0.553 | 0.494 |
| DFP | 0.254 | 0.540 | 1.000 | 0.746 | 0.070 | 0.408 | 1.000 | 0.930 |
| SB-DFP | 0.050 | 0.342 | 1.000 | 0.950 | 0.011 | 0.185 | 1.000 | 0.989 |
aIt should be noted that the comparisons involving the self-similarity of data sets does not reach a value of 1 and in some cases such self-similarity does not correspond to the highest value in the matrix row, that could be misinterpreted as the existence of pairs of databases more similar to each other than to themselves, which makes no sense. The matrices constructed by using DFP or SB-DFP do not present such problem, since when dealing with unique comparisons, a maximum of 1 is guaranteed for the diagonal of the matrix
Fig. 2Dendograms for hierarchical clustering of targets computed with different approaches based in two molecular fingerprints, MACCS keys and ECFP4. a The ground truth; b, e all-compound comparisons (ACC); c, f database fingerprint (DFP); d, g statistical-based database fingerprint (SB-DFP). The Adjusted Rand Index (ARI) of each clustering is indicated in each panel. See main text for details
Fig. 3Early enrichment performance of similarity searches. Average recovery rates (selection set size equal to the number of ADCs) for three search strategies over 28 epigenetic data sets are reported in a histogram representation for a MACCS keys and b ECFP4. Standard deviations are displayed as error bars
Fig. 4General performance of similarity searches. Average AUCs for three search strategies over 28 epigenetic data sets are reported in a histogram representation for a MACCS keys and b ECFP4. Standard deviations are displayed as error bars
Average recovery rates
| Dataset | MACCS keys (166-bit) | ECFP4 (2048-bit) | ||||
|---|---|---|---|---|---|---|
| 1-NN | DFP | SB-DFP | 1-NN | DFP | SB-DFP | |
| BRD2 | 43.7 (5.0) | 29.9 (13.7) | 13.8 (12.8) |
| 28.4 (24.2) | 68.0 (7.1) |
| BRD3 | 43.5 (4.8) | 32.0 (12.3) | 10.6 (11.3) |
| 31.9 (23.8) | 68.7 (7.1) |
| BRD4 | 30.0 (5.4) | 7.6 (7.7) | 4.5 (4.3) |
| 2.7 (4.7) | 52.6 (8.1) |
| CREBBP | 52.7 (4.7) | 45.5 (7.8) | 16.5 (16.2) |
| 55.6 (25.0) | 73.7 (4.2) |
| DNMT1 | 9.9 (5.2) | 0.5 (1.5) | 3.8 (3.9) | 12.9 (5.7) | 0.0 (0.0) |
|
| EHMT2 | 66.3 (7.1) | 40.9 (12.6) | 28.1 (17.8) |
| 40.2 (23.5) | 78.4 (8.3) |
| EP300 | 34.6 (7.5) | 5.5 (5.8) | 1.4 (2.7) |
| 0.7 (2.8) | 37.0 (10.8) |
| HDAC10 | 37.1 (8.6) | 34.2 (15.1) |
| 36.5 (8.0) | 15.4 (12.3) |
|
| HDAC11 | 34.7 (8.3) | 22.5 (12.4) | 43.7 (12.1) | 39.6 (8.8) | 6.6 (6.4) |
|
| HDAC1 | 18.2 (6.1) | 15.8 (13.5) |
| 30.9 (6.7) | 6.3 (5.1) | 51.1 (9.0) |
| HDAC2 | 20.9 (7.0) | 20.1 (16.1) |
| 31.3 (6.5) | 9.1 (6.0) | 44.7 (10.6) |
| HDAC3 | 27.5 (8.7) | 27.3 (13.1) |
| 32.0 (6.2) | 10.4 (6.6) | 45.4 (9.6) |
| HDAC4 | 19.2 (4.7) | 9.1 (7.3) | 29.6 (11.0) |
| 7.9 (11.2) |
|
| HDAC5 | 20.7 (9.6) | 30.2 (12.1) |
| 23.1 (6.4) | 10.0 (4.3) | 32.0 (12.1) |
| HDAC6 | 22.8 (6.7) | 32.0 (15.1) |
| 25.7 (5.8) | 9.3 (9.1) | 44.6 (9.0) |
| HDAC7 | 25.6 (8.4) | 36.6 (11.7) |
| 28.4 (6.7) | 11.0 (4.9) | 38.6 (10.4) |
| HDAC8 | 27.4 (7.0) | 33.9 (11.9) |
| 29.6 (6.9) | 9.5 (3.9) | 46.2 (9.8) |
| HDAC9 | 25.4 (9.0) | 34.9 (11.9) |
| 27.7 (7.5) | 9.6 (8.7) | 38.4 (13.0) |
| KAT2B | 55.3 (12.7) | 41.0 (8.7) | 37.6 (13.5) |
| 35.3 (14.1) |
|
| KDM1A | 24.6 (5.1) | 13.3 (8.0) | 6.8 (5.6) | 53.3 (8.4) | 18.3 (15.1) |
|
| KDM4C | 12.2 (5.1) | 0.4 (1.0) | 11.5 (8.7) |
| 0.1 (0.3) | 17.1 (5.8) |
| L3MBTL1 | 62.2 (8.5) | 68.8 (4.6) | 66.0 (11.1) | 91.1 (4.6) | 94.5 (1.8) |
|
| L3MBTL3 | 59.5 (8.5) | 49.7 (4.2) | 37.4 (11.2) |
| 71.1 (4.5) | 81.1 (6.8) |
| MAP3K7 | 41.2 (6.0) | 19.8 (14.3) | 2.2 (3.1) | 56.6 (5.2) | 31.1 (23.8) |
|
| MGEA5 | 58.5 (25.6) | 84.8 (4.9) | 84.6 (1.7) | 86.3 (3.5) | 86.4 (2.0) |
|
| NCOA1 | 2.7 (2.1) | 0.0 (0.2) |
|
| 0.1 (0.3) |
|
| NCOA3 | 1.1 (0.9) | 0.1 (0.2) |
| 2.6 (1.4) | 0.1 (0.3) |
|
| PRMT1 | 48.8 (8.7) | 2.8 (5.7) | 2.7 (4.3) | 52.8 (10.5) | 1.0 (3.8) |
|
| Average |
|
|
|
|
|
|
The best performing methods for each dataset are shown in bold. If there were no significative difference between two or more methods, all of them are marked. Standard deviations are shown in parentheses
Average areas under ROC curves
| Dataset | MACCS keys (166-bit) | ECFP4 | ||||
|---|---|---|---|---|---|---|
| 1-NN | DFP | SB-DFP | 1-NN | DFP | SB-DFP | |
| BRD2 | 0.938 (0.035) | 0.875 (0.019) | 0.911 (0.031) |
| 0.865 (0.037) | 0.970 (0.030) |
| BRD3 | 0.940 (0.041) | 0.873 (0.015) | 0.905 (0.029) |
| 0.861 (0.056) |
|
| BRD4 | 0.880 (0.038) | 0.821 (0.036) | 0.871 (0.040) | 0.927 (0.037) | 0.740 (0.082) |
|
| CREBBP | 0.953 (0.025) | 0.924 (0.008) | 0.963 (0.009) | 0.956 (0.027) | 0.913 (0.016) |
|
| DNMT1 | 0.652 (0.045) | 0.652 (0.049) |
| 0.711 (0.058) | 0.484 (0.060) | 0.834 (0.042) |
| EHMT2 |
| 0.897 (0.027) | 0.965 (0.023) | 0.951 (0.050) | 0.860 (0.042) | 0.947 (0.036) |
| EP300 | 0.874 (0.041) | 0.810 (0.052) |
| 0.843 (0.066) | 0.592 (0.076) | 0.873 (0.052) |
| HDAC10 | 0.932 (0.022) | 0.916 (0.043) | 0.946 (0.025) | 0.934 (0.032) | 0.821 (0.063) |
|
| HDAC11 | 0.939 (0.024) | 0.899 (0.073) | 0.940 (0.034) | 0.948 (0.035) | 0.786 (0.065) |
|
| HDAC1 | 0.797 (0.036) | 0.755 (0.085) | 0.886 (0.041) | 0.884 (0.035) | 0.688 (0.073) |
|
| HDAC2 | 0.847 (0.035) | 0.808 (0.081) | 0.895 (0.042) | 0.905 (0.032) | 0.750 (0.048) |
|
| HDAC3 | 0.875 (0.032) | 0.862 (0.059) | 0.888 (0.032) | 0.892 (0.035) | 0.725 (0.062) |
|
| HDAC4 | 0.841 (0.039) | 0.781 (0.067) | 0.888 (0.021) | 0.890 (0.039) | 0.672 (0.060) |
|
| HDAC5 | 0.866 (0.066) | 0.838 (0.036) |
| 0.917 (0.030) | 0.840 (0.049) |
|
| HDAC6 | 0.828 (0.028) | 0.825 (0.042) | 0.895 (0.011) | 0.868 (0.026) | 0.743 (0.072) |
|
| HDAC7 | 0.907 (0.037) | 0.925 (0.037) |
| 0.913 (0.027) | 0.864 (0.020) | 0.934 (0.024) |
| HDAC8 | 0.878 (0.024) | 0.883 (0.054) | 0.937 (0.011) | 0.896 (0.028) | 0.762 (0.043) |
|
| HDAC9 | 0.901 (0.028) | 0.933 (0.031) | 0.943 (0.012) | 0.942 (0.019) | 0.885 (0.026) |
|
| KAT2B | 0.926 (0.039) | 0.893 (0.022) | 0.947 (0.033) | 0.935 (0.027) | 0.928 (0.022) |
|
| KDM1A | 0.745 (0.048) | 0.701 (0.051) | 0.860 (0.055) | 0.885 (0.038) | 0.721 (0.058) |
|
| KDM4C | 0.677 (0.067) | 0.608 (0.069) |
| 0.653 (0.052) | 0.527 (0.045) | 0.823 (0.048) |
| L3MBTL1 | 0.997 (0.001) | 0.999 (0.000) |
| 1.000 (0.000) | 1.000 (0.000) |
|
| L3MBTL3 | 0.990 (0.003) |
|
| 0.989 (0.005) | 0.985 (0.004) |
|
| MAP3K7 | 0.860 (0.042) | 0.791 (0.028) | 0.861 (0.027) | 0.858 (0.042) | 0.738 (0.079) |
|
| MGEA5 | 0.985 (0.005) | 0.985 (0.006) | 0.979 (0.009) | 0.979 (0.009) |
| 0.992 (0.007) |
| NCOA1 | 0.491 (0.074) | 0.572 (0.073) | 0.682 (0.056) | 0.618 (0.047) | 0.519 (0.060) |
|
| NCOA3 | 0.530 (0.057) | 0.577 (0.071) | 0.680 (0.064) | 0.590 (0.045) | 0.503 (0.059) |
|
| PRMT1 | 0.867 (0.058) | 0.673 (0.072) | 0.843 (0.078) | 0.881 (0.081) | 0.365 (0.081) |
|
| Average |
|
|
|
|
|
|
The best performing methods for each dataset are shown in bold. If there were no significative difference between two or more methods, all of them are marked. Standard deviations are shown in parentheses