| Literature DB >> 30945814 |
Anita Rácz1, Dávid Bajusz2, Károly Héberger1.
Abstract
QSAR/QSPR (quantitative structure-activity/property relationship) modeling has been a prevalent approach in various, overlapping sub-fields of computational, medicinal and environmental chemistry for decades. The generation and selection of molecular descriptors is an essential part of this process. In typical QSAR workflows, the starting pool of molecular descriptors is rationalized based on filtering out descriptors which are (i) constant throughout the whole dataset, or (ii) very strongly correlated to another descriptor. While the former is fairly straightforward, the latter involves a level of subjectivity when deciding what exactly is considered to be a strong correlation. Despite that, most QSAR modeling studies do not report on this step. In this study, we examine in detail the effect of various possible descriptor intercorrelation limits on the resulting QSAR models. Statistical comparisons are carried out based on four case studies from contemporary QSAR literature, using a combined methodology based on sum of ranking differences (SRD) and analysis of variance (ANOVA).Entities:
Keywords: QSAR; analysis of variance; correlation; descriptor; regression; sum of ranking differences
Mesh:
Substances:
Year: 2019 PMID: 30945814 PMCID: PMC6767540 DOI: 10.1002/minf.201800154
Source DB: PubMed Journal: Mol Inform ISSN: 1868-1743 Impact factor: 3.353
Number of compounds in the training and test sets with endpoints and references, for the four case studies.
| Endpoint | Applicability domain | No. training | No. test | Ref. | |
|---|---|---|---|---|---|
| 1 | pIC50 | N‐benzoyl‐L‐biphenylalanine derivatives | 99 | 43 | [24] |
| 2 | logBB | Diverse compounds | 287 | 81 | [25] |
| 3 | pLC50 | Benzene derivatives | 51 | 18 | [26] |
| 4 | pIC50 | N‐substituted maleimides | 48 | 14 | [19,27] |
Figure 1Numbers of selected descriptors for the four datasets, for each intercorrelation limit. (“none” means that no correlation limit was used.)
Figure 2Workflow of the applied procedure from descriptor generation to ANOVA.
Figure 3Distribution of the examined performance parameter values. The blue dashed line means the R 2 distribution, and the orange dashed‐dotted line means the Q 2 distribution.
Figure 4An example SRD result with the full plot (above) and a magnified part (below). Vertical bars denote the models with different intercorrelation limits. The black curve corresponds to the cumulative distribution of SRD values based on random rankings. On the left Y and X axes, normalized SRD [%] values are plotted, while the right Y axis shows the percentages for the distribution of random rankings.
Figure 5Average normalized SRD values are plotted against the intercorrelation limits. Vertical error bars are calculated based on the standard deviations, with the law of error propagation.