| Literature DB >> 26784447 |
Marek Śmieja1, Dawid Warszycki2.
Abstract
Fingerprints, bit representations of compound chemical structure, have been widely used in cheminformatics for many years. Although fingerprints with the highest resolution display satisfactory performance in virtual screening campaigns, the presence of a relatively high number of irrelevant bits introduces noise into data and makes their application more time-consuming. In this study, we present a new method of hybrid reduced fingerprint construction, the Average Information Content Maximization algorithm (AIC-Max algorithm), which selects the most informative bits from a collection of fingerprints. This methodology, applied to the ligands of five cognate serotonin receptors (5-HT2A, 5-HT2B, 5-HT2C, 5-HT5A, 5-HT6), proved that 100 bits selected from four non-hashed fingerprints reflect almost all structural information required for a successful in silico discrimination test. A classification experiment indicated that a reduced representation is able to achieve even slightly better performance than the state-of-the-art 10-times-longer fingerprints and in a significantly shorter time.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26784447 PMCID: PMC4718645 DOI: 10.1371/journal.pone.0146666
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Exemplary hashed (A) and non-hashed (B) fingerprints.
Presence of “1” and “0” corresponds to presence or absence of a particular pattern, repectively. In case of hashed fingerprint (A) bit collision phenomena is presented—one bit encodes more than one motif.
Fig 2The relationship between the number of bits selected by the AIC-Max algorithm and information related activity.
The information, measured by AIC Eq (1), was averaged over all datasets used in the underlying study.
Minimal and maximal values of AIC.
The 3-bit fingerprint representation X1 X2 X3 of eight compounds and their activity labels Y1, Y2, Y3 given three biological targets, as listed in the table. Since the activity of the i-th receptor is fully determined by a single feature X, then AIC(X) = 1, for i = 1,2,3. In contrast, AIC(X) = 0, for i ≠ j because Y is independent of X. Finally, , since the activity of two out of three receptors was fully reflected by two bits.
| compound no. | ||||||
|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 0 | 1 | 1 | 0 | 1 | 1 |
| 5 | 1 | 0 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 1 | 1 | 0 | 1 |
| 7 | 1 | 1 | 0 | 1 | 1 | 0 |
| 8 | 1 | 1 | 1 | 1 | 1 | 1 |
Influence of dependent and independent bits on AIC.
The activity of a given receptor depends only on two out of four features: X1 and X2. The addition of feature X3 to X1 does not change AIC because it is independent of Y, which results in AIC(X1) = AIC(X1, X3) = 0.38. The same holds for X4, which is completely correlated with X1, and AIC(X1) = AIC(X1, X4) = 0.38.
| compound no. | |||||
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 | 0 |
| 2 | 0 | 0 | 1 | 1 | 0 |
| 3 | 0 | 1 | 0 | 1 | 0 |
| 4 | 0 | 1 | 1 | 1 | 0 |
| 5 | 1 | 0 | 0 | 0 | 0 |
| 6 | 1 | 0 | 1 | 0 | 0 |
| 7 | 1 | 1 | 0 | 0 | 1 |
| 8 | 1 | 1 | 1 | 0 | 1 |
Fingerprints generated in PaDEL software [18].
| Fingerprint | Abbreviation | Hashed | Length |
|---|---|---|---|
| EState fingerprint [ | estate | NO | 79 |
| MACCS fingerprint [ | maccs | NO | 166 |
| PubChem fingerprint [ | pubchem | NO | 881 |
| Substructure fingerprint [ | substructure | NO | 308 |
| Klekota Roth fingerprint [ | KRFP | NO | 4860 |
| Fingerprint [ | fingerprint | YES | 1024 |
| Extended fingerprint [ | extended | YES | 1024 |
| Graph-only fingerprint [ | graph only | YES | 1024 |
The summary of datasets used in the selection process.
| Receptor | Actives | Inactives | ZINC |
|---|---|---|---|
| 5-HT2 | 2060 | 1081 | 18540 |
| 5-HT2 | 428 | 341 | 3852 |
| 5-HT2 | 1303 | 1050 | 11727 |
| 5-HT5 | 69 | 146 | 621 |
| 5-HT6 | 1626 | 426 | 14634 |
| 5-HT1 | 4427 | 1230 | 39843 |
Fig 3The relationship between the number of bits selected by the AIC-Max algorithm and associated information of activity.
The information score was measured by the normalized mutual information calculated for constructed representations for every receptor averaged over all folds reported on a test set.
Fig 4Classification performance.
The relationship between the number of bits selected by AIC-Max algorithm and associated MCC score for every receptor averaged over all folds reported on a test set.
Classification performance on a dataset containing actives and inactives.
| fingerprint | 5-HT2 | 5-HT2 | 5-HT2 | 5-HT5 | 5-HT6 | mean |
|---|---|---|---|---|---|---|
| reduced(25) | 0.679 | 0.521 | 0.708 | 0.698 | 0.737 | 0.669 |
| reduced(50) | 0.731 | 0.558 | 0.743 | 0.724 | 0.746 | 0.701 |
| reduced(100) | 0.736 | 0.761 | 0.759 | 0.778 | ||
| estate | 0.425 | 0.448 | 0.501 | 0.614 | 0.584 | 0.514 |
| maccs | 0.713 | 0.607 | 0.741 | 0.760 | 0.755 | 0.715 |
| pubchem | 0.730 | 0.545 | 0.739 | 0.739 | 0.709 | |
| substructure | 0.500 | 0.483 | 0.551 | 0.647 | 0.595 | 0.555 |
| KRFP | 0.697 | 0.565 | 0.707 | 0.766 | 0.742 | 0.695 |
| extended | 0.596 | 0.736 | 0.803 | 0.730 | ||
| fingerprinter | 0.733 | 0.591 | 0.773 | 0.745 | 0.730 | |
| graphonly | 0.703 | 0.559 | 0.716 | 0.788 | 0.774 | 0.708 |
Classification performance on a dataset containing actives and putative inactives.
| fingerprint | 5-HT2 | 5-HT2 | 5-HT2 | 5-HT5 | 5-HT6 | mean |
|---|---|---|---|---|---|---|
| reduced(25) | 0.889 | 0.828 | 0.887 | 0.876 | 0.933 | 0.883 |
| reduced(50) | 0.939 | 0.878 | 0.939 | 0.966 | 0.929 | |
| reduced(100) | 0.919 | |||||
| estate | 0.604 | 0.503 | 0.563 | 0.725 | 0.844 | 0.648 |
| maccs | 0.936 | 0.877 | 0.932 | 0.894 | 0.970 | 0.922 |
| pubchem | 0.931 | 0.839 | 0.916 | 0.886 | 0.967 | 0.908 |
| substructure | 0.820 | 0.660 | 0.743 | 0.783 | 0.906 | 0.782 |
| KRFP | 0.932 | 0.841 | 0.925 | 0.862 | 0.965 | 0.905 |
| extended | 0.936 | 0.858 | 0.920 | 0.884 | 0.967 | 0.913 |
| fingerprinter | 0.932 | 0.852 | 0.918 | 0.868 | 0.966 | 0.907 |
| graphonly | 0.916 | 0.823 | 0.896 | 0.888 | 0.954 | 0.895 |
Fig 5Classification times.
Mean training times of a random forest classifier for various fingerprint representations averaged over all data sets of active and inactive compounds.
Classification performance on a dataset containing active and inactive compounds of 5-HT1 receptor (middle column) as well as actives and putative inactives (last column).
The reduced representation was constructed from four non-hashed fingerprints based on five biological targets (first 3 rows). The reduced representation from all fingerprints (except KRFP) was also evaluated (last row).
| fingerprint | inactives | ZINC |
|---|---|---|
| reduced(25) | 0.553 | 0.893 |
| reduced(50) | 0.632 | 0.950 |
| reduced(100) | 0.663 | |
| estate | 0.250 | 0.566 |
| maccs | 0.630 | 0.961 |
| pubchem | 0.659 | 0.948 |
| substructure | 0.332 | 0.886 |
| KRFP | 0.650 | 0.958 |
| extended | 0.960 | |
| fingerprinter | 0.713 | 0.957 |
| graphonly | 0.627 | 0.933 |
| reduced (100) formed from all fingerprints | 0.998 | 0.961 |