| Literature DB >> 29520515 |
Kamel Mansouri1,2,3, Chris M Grulke4, Richard S Judson4, Antony J Williams4.
Abstract
The collection of chemical structure information and associated experimental data for quantitative structure-activity/property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2-15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q2 of the models varied from 0.72 to 0.95, with an average of 0.86 and an R2 test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission's Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure-activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency's CompTox Chemistry Dashboard.Entities:
Keywords: Environmental fate; Model validation; OECD principles; OPERA; Open data; Open source; Physicochemical properties; QMRF; QSAR/QSPR
Year: 2018 PMID: 29520515 PMCID: PMC5843579 DOI: 10.1186/s13321-018-0263-1
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Endpoint datasets in the PHYSPROP database
| Property abbreviation | Property | Source SD file |
|---|---|---|
| AOH | Atmospheric hydroxylation rate | EPI_AOP_Data_SDF.sdf |
| BCF | Bioconcentration factor | EPI_BCF_Data_SDF.sdf |
| BioHL | Biodegradability half-life | EPI_BioHC_Data_SDF.sdf |
| BP | Boiling point | EPI_Boil_Pt_Data_SDF.sdf |
| HL | Henry’s Law constant | EPI_Henry_Data_SDF.sdf |
| KM | Fish biotransformation half-life | EPI_KM_Data_SDF.sdf |
| KOA | Octanol–air partition coefficient | EPI_KOA_Data_SDF.sdf |
| KOC | Soil adsorption coefficient | EPI_PCKOC_Data_SDF.sdf |
| logP | Octanol–water partition coefficient | EPI_Kowwin_Data_SDF.sdf |
| MP | Melting point | EPI_Melt_Pt_Data_SDF.sdf |
| RB | Readily biodegradable | EPI_Biowin_Data_SDF.sdf |
| VP | Vapor pressure | EPI_VP_Data_SDF.sdf |
| WS | Water solubility | EPI_Wskowwin_Data_SDF.sdf |
Numbers of chemicals associated with PHYSPROP datasets before and after curation and QSAR-ready standardization workflows
| Property | No. of chemicals in dataset | No. of top-quality chemicalsa | No. of QSAR-ready chemicalsa |
|---|---|---|---|
| AOH | 818 | 818 (100%) | 745 (91.1%) |
| BCF | 685 | 618 (90.2%) | 608 (88.7%) |
| BioHL | 175 | 151 (86.3%) | 150 (85.7%) |
| BP | 5890 | 5591 (94.9%) | 5436 (92.3%) |
| HL | 1829 | 1758 (96.1%) | 1711 (93.5%) |
| KM | 631 | 548 (86.8%) | 541 (85.7%) |
| KOA | 308 | 277 (90%) | 270 (87.7%) |
| KOC | 788 | 750 (95.2%) | 735 (93.3%) |
| LogP | 15,806 | 14,544 (92%) | 14,041 (88.8%) |
| MP | 10,051 | 9120 (90.7%) | 8656 (86.1%) |
| RB | 1265 | 1196 (94.5%) | 1171 (92.5%) |
| VP | 3037 | 2840 (93.5%) | 2716 (89.4%) |
| WS | 5764 | 4372 (75.8%) | 4224 (73.3%) |
aPercentages relative to the original dataset shown in parentheses; 2D descriptors only used
Fig. 1Distribution of experimental logP values between training and test sets
Performance of the selected models in fitting, CV, and on the test sets
| Property | No. of descriptors | Fivefold CV (75%) | Training (75%) | Test (25%) | |||||
|---|---|---|---|---|---|---|---|---|---|
| Q2 | RMSE | Dataset | R2 | RMSE | Dataset | R2 | RMSEP | ||
| AOH | 13 | 0.85 | 1.14 | 516 | 0.85 | 1.12 | 176 | 0.83 | 1.23 |
| BCF | 10 | 0.84 | 0.55 | 469 | 0.85 | 0.53 | 157 | 0.83 | 0.64 |
| BioHL | 6 | 0.89 | 0.25 | 112 | 0.88 | 0.26 | 38 | 0.75 | 0.38 |
| BP | 13 | 0.93 | 22.46 | 4077 | 0.93 | 22.06 | 1358 | 0.93 | 22.08 |
| HL | 9 | 0.84 | 1.96 | 441 | 0.84 | 1.91 | 150 | 0.85 | 1.82 |
| KM | 12 | 0.83 | 0.49 | 405 | 0.82 | 0.5 | 136 | 0.73 | 0.62 |
| KOA | 2 | 0.95 | 0.69 | 202 | 0.95 | 0.65 | 68 | 0.96 | 0.68 |
| KOC | 12 | 0.81 | 0.55 | 545 | 0.81 | 0.54 | 184 | 0.71 | 0.61 |
| LogP | 9 | 0.86 | 0.69 | 10,537 | 0.86 | 0.67 | 3513 | 0.86 | 0.78 |
| MP | 16 | 0.74 | 50.20 | 6486 | 0.75 | 49.12 | 2167 | 0.74 | 52.27 |
| VP | 12 | 0.91 | 1.08 | 2034 | 0.91 | 1.08 | 679 | 0.92 | 1 |
| WS | 11 | 0.87 | 0.81 | 3158 | 0.87 | 0.82 | 1066 | 0.86 | 0.86 |
The QMRF reports published online
| Property | JRC report ID | DOI |
|---|---|---|
| AOH | Q17-22b-0024 | 10.13140/RG.2.2.24685.59368/2 |
| BCF | Q17-24a-0023 | 10.13140/RG.2.2.17974.70722/1 |
| BioHL | Q17-23b-0022 | 10.13140/RG.2.2.34751.92320/1 |
| BP | Q17-12-0021 | 10.13140/rg.2.2.33074.20160/1 |
| HL | Q17-19-0020 | 10.13140/rg.2.2.17764.99201/1 |
| KM | Q17-66-0019 | 10.13140/rg.2.2.31186.76482/1 |
| KOA | Q17-18-0018 | 10.13140/rg.2.2.14409.54883/1 |
| KOC | Q17-26-0017 | 10.13140/rg.2.2.27831.32163/1 |
| LogP | Q17-16-0016 | 10.13140/rg.2.2.12731.82723/1 |
| MP | Q17-11-0015 | 10.13140/rg.2.2.26153.60003/1 |
| RB | Q17-23a-0014 | 10.13140/rg.2.2.19442.71369/1 |
| VP | Q17-14-0013 | 10.13140/rg.2.2.32864.48641/1 |
| WS | Q17-13-0012 | 10.13140/rg.2.2.16087.27041/1 |
Fig. 2Results search header for atrazine on the CompTox Chemistry Dashboard
Fig. 3Summary view of experimental and predicted physicochemical properties
Fig. 4Melting Point (MP) experimental and predicted values from different sources
Fig. 5OPERA prediction calculation report for the melting point of bisphenol A
Fig. 6Experimental and predicted values for training and test set of OPERA logP model
Fig. 7LogP predictions for KOWWIN model. The overestimated cluster selected for comparison is highlighted in a red ellipse
Fig. 8LogP predictions for KOWWIN model in purple stars compared to OPERA model in green circles
Local comparison of OPERA logP and KOWWIN
| Model | R2 | RMSE |
|---|---|---|
| OPERA logP | 0.75 | 1.19 |
| KOWWIN | − 0.35 | 2.79 |
Newly added data for PBDEs and resulting OPERA model predicted logP values
| DTXSID | Name | CASRN | OPERA logP (old) | Newly added data | OPERA logP (new) |
|---|---|---|---|---|---|
| DTXSID40872703 | BDE-17 | 147217-75-2 | 5.13 | 5.74 ± 0.22 | 5.80 |
| DTXSID4052710 | BDE-28 | 41318-75-6 | 4.17 | 5.94 ± 0.15 | 5.97 |
| DTXSID3030056 | BDE-47 | 5436-43-1 | 5.65 | 6.81 ± 0.08 | 6.56 |
| DTXSID4052685 | BDE-85 | 182346-21-0 | 6.00 | 7.37 ± 0.12 | 7.38 |
| DTXSID9030048 | BDE 99 | 60348-60-9 | 6.03 | 7.32 ± 0.14 | 7.38 |
| DTXSID4052689 | BDE-100 | 189084-64-8 | 6.04 | 7.24 ± 0.16 | 7.26 |
| DTXSID4030047 | BDE-153 | 68631-49-2 | 6.00 | 7.90 ± 0.14 | 7.72 |
| DTXSID3052692 | BDE-154 | 207122-15-4 | 5.94 | 7.82 ± 0.16 | 7.72 |
| DTXSID8052693 | BDE-183 | 207122-16-5 | 6.09 | 8.27 ± 0.26 | 8.19 |
OPERA model prediction performance for MP with and without salt information
| Mode | Variables | Fivefold CV (75%) | Training (75%) | Test (25%) | |||
|---|---|---|---|---|---|---|---|
| Q2 | RMSE (°C) | R2 | RMSE (°C) | R2 | RMSEP (°C) | ||
| No salts | 15 | 0.72 | 51.8 | 0.74 | 50.27 | 0.73 | 52.72 |
| With salts | 16 | 0.74 | 50.2 | 0.75 | 49.12 | 0.74 | 52.27 |
OPERA and EPI Suite MP prediction statistics for chemicals with salts
| Dataset | Chemicals with salts | RMSE OPERA (°C) | RMSE EPI Suite (°C) | |
|---|---|---|---|---|
| No salts | With salts | |||
| Training set | 117 | 102.18 | 81.56 | 154.78 |
| Test set | 38 | 98.73 | 88.68 | 154.42 |