| Literature DB >> 15345371 |
Weida Tong1, Qian Xie, Huixiao Hong, Leming Shi, Hong Fang, Roger Perkins.
Abstract
Quantitative structure-activity relationship (QSAR) methods have been widely applied in drug discovery, lead optimization, toxicity prediction, and regulatory decisions. Despite major advances in algorithms and software, QSAR models have inherent limitations associated with a size and chemical-structure diversity of the training set, experimental error, and many characteristics of structure representation and correlation algorithms. Whereas excellent fit to the training data may be readily attainable, often models fail to predict accurately chemicals that are outside their domain of applicability. A QSAR's utility and, in the case of regulatory decisions, justification for usage increasingly depend on the ability to quantify a model's potential for predicting unknown chemicals with some known degree of certainty. It is never possible to predict an unknown chemical with absolute certainty. Here we report on two QSAR models based on different data sets for classification of chemicals according to their ability to bind to the estrogen receptor. The models were developed by using a novel QSAR method, Decision Forest, which combines the results of multiple heterogeneous but comparable Decision Tree models to produce a consensus prediction. We used an extensive cross-validation process to define an applicability domain for model predictions based on two quantitative measures: prediction confidence and domain extrapolation. Together, these measures quantify the accuracy of each prediction within and outside of the training domain. Despite being based on large and diverse training sets, both QSAR models had poor accuracy for chemicals within the domain of low confidence, whereas good accuracy was obtained for those within the domain of high confidence. For prediction in the high confidence domain, accuracy was inversely proportional to the degree of domain extrapolation. The model with a larger training set of 1,092, compared with 232 for the other, was more accurate in predicting chemicals at larger domain extrapolation, and could be particularly useful for rapidly prioritizing potential endocrine disruptors from large chemical universe.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15345371 PMCID: PMC1277118 DOI: 10.1289/txg.7125
Source DB: PubMed Journal: Environ Health Perspect ISSN: 0091-6765 Impact factor: 9.031
Figure 1Comparison of structural diversity of ER232 and ER1092 in a chemistry space defined by three principal components of over 270 2D structural descriptors.
Figure 2Schematic illustration for defining the training domain of a tree. For an unknown chemical predicted by the tree, its classification is determined by the terminal node (dark circle) to which it belongs. There are three descriptors used in the path (bold line) from the root to the terminal node and the range of these three descriptors across all chemicals in the training set determines the training domain.
Statistics of the forest models based on ER232 and ER1092.
| ER232 | ER1092 | |
|---|---|---|
| Number of chemicals | 232 | 1092 |
| Number (%) of misclassifications | 5 (2.16%) | 50 (4.58%) |
| Number of trees combined | 6 | 4 |
| Number of descriptors used | 79 | 138 |
| Accuracy | 96.6% | 95.4% |
| Specificity | 96.0% | 91.0% |
| Sensitivity | 96.9% | 97.6% |
Figure 3Prediction accuracy versus confidence level. Data were calculated from 2,000 runs of 10-fold cross-validation for (A) ER232 and (B ) ER1092.
The HC and LC predictions from 2,000 runs of 10-fold cross-validation for ER232 and ER1092.
| ER232
| ER1092
| |||
|---|---|---|---|---|
| Confidence regions | Accuracy (%) | Percentage of chemicals | Accuracy (%) | Percentage of chemicals |
| HC | 86.6 | 79.2 | 86.3 | 69.9 |
| LC | 63.8 | 20.8 | 64.7 | 30.1 |
| All | 81.9 | 100 | 79.7 | 100 |
Abbreviations: HC , high confidence; LC, low confidence.
Figure 4Prediction accuracy in different domains of extrapolation for ER232 and ER1092 from 2,000 runs of 10-fold cross-validation.
Prediction accuracy in different regions of confidence and extrapolation derived from 2,000 runs of 10-fold cross-validation for ER232 and ER1092.
| HC region
| LC region
| ||||
|---|---|---|---|---|---|
| Data set | Extrapolation [( | Accuracy (%) | No. of predictions | Accuracy (%) | No. of predictions |
| ER232 | 0 | 87.8 | 349,595 | 64.7 | 83,393 |
| 0–10 | 79.7 | 10,442 | 55.8 | 6,216 | |
| 10–20 | 61.1 | 1,325 | 50.5 | 2,651 | |
| 20–30 | 65.3 | 853 | 63.9 | 1,086 | |
| > 30 | 39.7 | 5,614 | 65.9 | 2,825 | |
| ER1092 | 0 | 86.4 | 1,511,180 | 64.4 | 645,177 |
| 0–10 | 89.4 | 6,896 | 61.5 | 5,135 | |
| 10–20 | 88.5 | 3,914 | 68.1 | 3,453 | |
| 20–30 | 96.8 | 1,209 | 75.3 | 959 | |
| > 30 | 48.9 | 3,560 | 54.1 | 2,517 | |