| Literature DB >> 26273325 |
Wojciech M Czarnecki1, Sabina Podlewska2, Andrzej J Bojarski3.
Abstract
BACKGROUND: Support Vector Machine has become one of the most popular machine learning tools used in virtual screening campaigns aimed at finding new drug candidates. Although it can be extremely effective in finding new potentially active compounds, its application requires the optimization of the hyperparameters with which the assessment is being run, particularly the C and [Formula: see text] values. The optimization requirement in turn, establishes the need to develop fast and effective approaches to the optimization procedure, providing the best predictive power of the constructed model.Entities:
Keywords: Bayesian optimization; Compounds classification; Parameters optimization; Support Vector Machine; Virtual screening
Year: 2015 PMID: 26273325 PMCID: PMC4534515 DOI: 10.1186/s13321-015-0088-0
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Details of the classification experiments performed
| Targets | Fingerprints | Optimization method |
|
|---|---|---|---|
| No of iterations | |||
| 5-HT | EstateFP | Bayes |
|
| 5-HT | ExtFP | Random |
|
| 5-HT | KlekFP | Grid search | 20, 30, 50, 75, 100, 150 |
| 5-HT | MACCSFP | Small grid | |
| CDK2 | PubchemFP | SVMlight | |
| M | SubFP | libSVM | |
| ERK2 | |||
| AChE | |||
| A | |||
| alpha2AR | |||
| beta1AR | |||
| beta3AR | |||
| CB1 | |||
| DOR | |||
| D | |||
| H | |||
| H | |||
| HIVi | |||
| IR | |||
| ABL | |||
| HLE |
Fig. 1Global analysis of classification accuracy obtained for different methods for SVM parameters optimization expressed as the number of experiments in which a particular strategy provided the highest accuracy values.
Fig. 2Analysis of the effectiveness of different SVM optimization strategies with respect to various fingerprints expressed as the number of experiments in which a particular strategy provided the highest accuracy values for a given compounds representation.
Fig. 3Analysis of effectiveness of different SVM optimization strategies with respect to various targets expressed as the number of experiments in which a particular strategy provided the highest accuracy values for a given protein target.
A comparison of the number of highest accuracies obtained with the Bayesian optimization and grid search
| Comparison | Bayes | Grid search |
|---|---|---|
| Global | 96 | 34 |
| EstateFP | 15 | 7 |
| ExtFP | 16 | 6 |
| KlekFP | 16 | 5 |
| MACCSFP | 18 | 4 |
| PubchemFP | 16 | 6 |
| SubFP | 15 | 6 |
| 5-HT | 5 | 1 |
| 5-HT | 5 | 1 |
| 5-HT | 4 | 3 |
| 5-HT | 3 | 3 |
| CDK2 | 6 | 0 |
| M | 6 | 1 |
| ERK2 | 5 | 1 |
| AChE | 5 | 1 |
| A | 5 | 1 |
| alpha2AR | 5 | 1 |
| beta1AR | 3 | 3 |
| beta3AR | 3 | 4 |
| CB1 | 5 | 1 |
| DOR | 4 | 2 |
| D | 5 | 1 |
| H | 6 | 0 |
| H | 5 | 1 |
| HIVi | 1 | 5 |
| IR | 5 | 1 |
| ABL | 6 | 0 |
| HLE | 4 | 3 |
Fig. 4Analysis of the changes in accuracy during the SVM optimization procedure for the subsequent optimization steps.
The AUC values obtained in 5-HT, ExtFP for curves illustrating changes in the accuracy in time and final optimal accuracy values obtained
| optimization method | AUC | Final accuracy |
|---|---|---|
| Bayes | 0.892* | 0.896* |
| Random | 0.885 | 0.887 |
| Grid search | 0.802 | 0.881 |
| SVMlight | 0.683 | 0.683 |
| libSVM | 0.847 | 0.847 |
The highest values obtained among all strategies tested are marked with an asterisk sign
The average AUC values–global, obtained for a particular fingerprint and particular target
| Fingerprint/target | Bayes | Random | Grid search | SVMlight | libSVM |
|---|---|---|---|---|---|
| global | 0.883* | 0.870 | 0.799 | 0.676 | 0.792 |
| EstateFP | 0.847* | 0.829 | 0.774 | 0.690 | 0.763 |
| ExtFP | 0.902* | 0.891 | 0.806 | 0.669 | 0.874 |
| KlekFP | 0.899* | 0.889 | 0.812 | 0.669 | 0.730 |
| MACCSFP | 0.890* | 0.876 | 0.798 | 0.683 | 0.828 |
| PubchemFP | 0.898* | 0.885 | 0.816 | 0.669 | 0.808 |
| SubFP | 0.864* | 0.854 | 0.787 | 0.677 | 0.749 |
| 5-HT | 0.860* | 0.850 | 0.780 | 0.683 | 0.743 |
| 5-HT | 0.848* | 0.821 | 0.702 | 0.568 | 0.717 |
| 5-HT | 0.913* | 0.910 | 0.886 | 0.814 | 0.862 |
| 5-HT | 0.830* | 0.816 | 0.748 | 0.675 | 0.714 |
| CDK2 | 0.876* | 0.875 | 0.796 | 0.664 | 0.768 |
| M | 0.850* | 0.843 | 0.778 | 0.557 | 0.748 |
| ERK2 | 0.958 | 0.961* | 0.949 | 0.931 | 0.942 |
| AChE | 0.884* | 0.854 | 0.788 | 0.611 | 0.764 |
| A | 0.843* | 0.835 | 0.764 | 0.564 | 0.720 |
| alpha2AR | 0.875* | 0.874 | 0.773 | 0.563 | 0.725 |
| beta1AR | 0.910* | 0.864 | 0.798 | 0.710 | 0.828 |
| beta3AR | 0.874* | 0.823 | 0.826 | 0.545 | 0.722 |
| CB1 | 0.874* | 0.854 | 0.782 | 0.622 | 0.793 |
| DOR | 0.888* | 0.880 | 0.734 | 0.599 | 0.814 |
| D | 0.841* | 0.837 | 0.759 | 0.698 | 0.745 |
| H | 0.898* | 0.880 | 0.638 | 0.548 | 0.801 |
| H | 0.937* | 0.926 | 0.906 | 0.897 | 0.905 |
| HIVi | 0.939 | 0.945* | 0.934 | 0.901 | 0.911 |
| IR | 0.936* | 0.936* | 0.925 | 0.886 | 0.897 |
| ABL | 0.850* | 0.831 | 0.748 | 0.587 | 0.733 |
| HLE | 0.867* | 0.865 | 0.763 | 0.578 | 0.779 |
The highest values obtained among all strategies tested are marked with an asterisk sign
The average final accuracy values—global, obtained for a particular fingerprint and particular target
| fingerprint/target | Bayes | Random | Grid search | SVMlight | libSVM |
|---|---|---|---|---|---|
| Global | 0.889* | 0.873 | 0.876 | 0.676 | 0.792 |
| EstateFP | 0.852* | 0.832 | 0.833 | 0.690 | 0.763 |
| ExtFP | 0.907* | 0.896 | 0.892 | 0.669 | 0.874 |
| KlekFP | 0.907* | 0.890 | 0.891 | 0.669 | 0.730 |
| MACCSFP | 0.898* | 0.878 | 0.880 | 0.683 | 0.828 |
| PubchemFP | 0.901* | 0.886 | 0.894 | 0.669 | 0.808 |
| SubFP | 0.869* | 0.856 | 0.864 | 0.677 | 0.749 |
| 5-HT | 0.871* | 0.848 | 0.860 | 0.683 | 0.743 |
| 5-HT | 0.855* | 0.825 | 0.772 | 0.568 | 0.717 |
| 5-HT | 0.916* | 0.915 | 0.933 | 0.814 | 0.862 |
| 5-HT | 0.833* | 0.819 | 0.819 | 0.675 | 0.714 |
| CDK2 | 0.885* | 0.881 | 0.870 | 0.664 | 0.768 |
| M | 0.858 | 0.846 | 0.897* | 0.557 | 0.748 |
| ERK2 | 0.959 | 0.961* | 0.961* | 0.931 | 0.942 |
| AChE | 0.889* | 0.857 | 0.872 | 0.611 | 0.764 |
| A | 0.856 | 0.838 | 0.882* | 0.564 | 0.720 |
| alpha2AR | 0.880* | 0.873 | 0.872 | 0.563 | 0.725 |
| beta1AR | 0.914* | 0.870 | 0.864 | 0.710 | 0.828 |
| beta3AR | 0.879 | 0.825 | 0.972* | 0.545 | 0.722 |
| CB1 | 0.881* | 0.857 | 0.868 | 0.622 | 0.793 |
| DOR | 0.897* | 0.884 | 0.872 | 0.599 | 0.814 |
| D | 0.849* | 0.838 | 0.837 | 0.698 | 0.745 |
| H | 0.904* | 0.879 | 0.691 | 0.548 | 0.801 |
| H | 0.938* | 0.926 | 0.919 | 0.897 | 0.905 |
| HIVi | 0.938 | 0.946 | 0.967* | 0.901 | 0.911 |
| IR | 0.939 | 0.937 | 0.956* | 0.886 | 0.897 |
| ABL | 0.857* | 0.836 | 0.840 | 0.587 | 0.733 |
| HLE | 0.867* | 0.871 | 0.864 | 0.578 | 0.779 |
The highest values obtained among all strategies tested are marked with an asterisk sign
Fig. 5Analysis of the number of iterations of the optimization procedure required to achieve the highest accuracy. The figure presents the number of iterations required for a particular optimization strategy to achieve optimal performance for the predictive model.
Fig. 6Analysis of the changes in accuracy for different steps during the SVM optimization procedure.
The number of active and inactive compounds in the dataset
| Protein | Actives | Inactives |
|---|---|---|
| 5-HT | 1836 | 852 |
| 5-HT | 1211 | 927 |
| 5-HT | 1491 | 342 |
| 5-HT | 705 | 340 |
| CDK2 | 741 | 1462 |
| M | 760 | 939 |
| ERK2 | 72 | 958 |
| AChE | 1147 | 1804 |
| A | 1789 | 2286 |
| alpha2AR | 364 | 283 |
| beta1AR | 195 | 477 |
| beta3AR | 111 | 133 |
| CB1 | 1964 | 1714 |
| DOR | 2535 | 1992 |
| D | 1034 | 449 |
| H | 636 | 546 |
| H | 2706 | 313 |
| HIVi | 102 | 915 |
| IR | 147 | 1139 |
| ABL | 409 | 582 |
| HLE | 820 | 610 |
Fingerprints used for compounds representation
| Fingerprint | Abbreviation | Length | Short description |
|---|---|---|---|
| E-State fingerprint | EStateFP | 79 | Computes electrotopological state (E-state) index for each atom, describing its electronic state with consideration of the influence of other atoms in particular structure |
| Extended fingerprint | ExtFP | 1024 | A hashed fingerprint with each atom in the given structure being a starting point of a string of a length not exceeding six atoms. A hash code is produced for every path of such type and in turn it constitutes the basis of a bit string representing the whole structure |
| Klekota and Roth fingerprint | KlekFP | 4860 | Fingerprint analyzing the occurrence of particular chemical substructures in the given compound. Developed by Klekota and Roth |
| MACCS fingerprint | MACCSFP | 166 | Fingerprint using the MACCS keys in its bits definition |
| Pubchem fingerprint | PubchemFP | 881 | Substructure fingerprint with bits divided into several sections: hierarchic element counts, rings, simple atom pairs, simple atom nearest neighbours, detailed atom neighbourhoods, simple SMART patterns, complex SMART patterns |
| Substructure fingerprint | SubFP | 308 | Substructure fingerprint based on the SMART patterns developed by Christian Laggner |