| Literature DB >> 28691113 |
Linlin Zhao1, Wenyi Wang1, Alexander Sedykh2, Hao Zhu1,3.
Abstract
Numerous chemical data sets have become available for quantitative structure-activity relationship (QSAR) modeling studies. However, the quality of different data sources may be different based on the nature of experimental protocols. Therefore, potential experimental errors in the modeling sets may lead to the development of poor QSAR models and further affect the predictions of new compounds. In this study, we explored the relationship between the ratio of questionable data in the modeling sets, which was obtained by simulating experimental errors, and the QSAR modeling performance. To this end, we used eight data sets (four continuous endpoints and four categorical endpoints) that have been extensively curated both in-house and by our collaborators to create over 1800 various QSAR models. Each data set was duplicated to create several new modeling sets with different ratios of simulated experimental errors (i.e., randomizing the activities of part of the compounds) in the modeling process. A fivefold cross-validation process was used to evaluate the modeling performance, which deteriorates when the ratio of experimental errors increases. All of the resulting models were also used to predict external sets of new compounds, which were excluded at the beginning of the modeling process. The modeling results showed that the compounds with relatively large prediction errors in cross-validation processes are likely to be those with simulated experimental errors. However, after removing a certain number of compounds with large prediction errors in the cross-validation process, the external predictions of new compounds did not show improvement. Our conclusion is that the QSAR predictions, especially consensus predictions, can identify compounds with potential experimental errors. But removing those compounds by the cross-validation procedure is not a reasonable means to improve model predictivity due to overfitting.Entities:
Year: 2017 PMID: 28691113 PMCID: PMC5494643 DOI: 10.1021/acsomega.7b00274
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1ROC AUC and ROC enrichment plots for each data sets.
Figure 2Comparison of external prediction results for each model from different modeling sets. In each heatmap, the x axis represents modeling sets with the top ranked 5, 10, 15, and 20% compounds removed by cross-validations, y axis represents modeling sets with different ratios of simulated experimental errors.
Information on Chemical Data Sets Used in This Study
| size | actives | inactives | description | sources | |
|---|---|---|---|---|---|
| categorical sets | |||||
| BCRP | 395 | 178 | 217 | inhibition of membrane transporters at 10 μM | Sedykh et al.[ |
| BSEP | 725 | 303 | 422 | bile salt efflux pump inhibition at 100 μM | Metrabase database[ |
| MDR1 | 1585 | 750 | 835 | inhibition of membrane transporters at 10 μM | Sedykh et al.[ |
| AMES | 3979 | 1718 | 2261 | bacterial mutagenicity Ames test | CCRIS database[ |
Continuous activity values were negative log 10 transformed.
Denotes proprietary data sets provided by Multicase Inc. (http://www.multicase.com/case-ultra-models) as accessed in 2015. For these, the “Source” column provides direct precursor publications.
Figure 3Modeling workflow.