| Literature DB >> 35262215 |
Craig Warren Davis1, Louise Camenzuli2, Aaron D Redman1.
Abstract
Quantitative structure-property relationship (QSPR) models for predicting primary biodegradation of petroleum hydrocarbons have been previously developed. These models use experimental data generated under widely varied conditions, the effects of which are not captured adequately within model formalisms. As a result, they exhibit variable predictive performance and are unable to incorporate the role of study design and test conditions on the assessment of environmental persistence. To address these limitations, a novel machine-learning System-Integrated Model (HC-BioSIM) is presented, which integrates chemical structure and test system variability, leading to improved prediction of primary disappearance time (DT50) values for petroleum hydrocarbons in fresh and marine water. An expanded, highly curated database of 728 experimental DT50 values (181 unique hydrocarbon structures compiled from 13 primary sources) was used to develop and validate a supervised model tree machine-learning model. Using relatively few parameters (6 system and 25 structural parameters), the model demonstrated significant improvement in predictive performance (root mean square error = 0.26, R2 = 0.67) over existing QSPR models. The model also demonstrated improved accuracy of persistence (P) categorization (i.e., "Not P/P/vP"), with an accuracy of 96.8%, and false-positive and -negative categorization rates of 0.4% and 2.7%, respectively. This significant improvement in DT50 prediction, and subsequent persistence categorization, validates the need for models that integrate experimental design and environmental system parameters into biodegradation and persistence assessment. Environ Toxicol Chem 2022;41:1359-1369.Entities:
Keywords: Biodegradation; Environmental modeling; Hydrocarbon; Machine learning; Persistent compounds; Quantitative structure-property relationship
Mesh:
Substances:
Year: 2022 PMID: 35262215 PMCID: PMC9320815 DOI: 10.1002/etc.5328
Source DB: PubMed Journal: Environ Toxicol Chem ISSN: 0730-7268 Impact factor: 4.218
Summary of studies that met screening criteria, including relevant study information used in model development and number of data points (No.)
| Test media/innoculum source | Hydrocarbon source | Test temperature (°C) | Dosing method | Use of dispersant | No. | References |
|---|---|---|---|---|---|---|
| Freshwater | Gasoline | 21 | Direct | N | 110 | Prince et al. ( |
| B20 diesel | 21 | Direct | N | 72 | Prince et al. ( | |
| Defined mixture | 20 | P.D. | N | 33, 22 | Prosser et al. ( | |
| Seawater | Crude oil | 2 | Direct | N | 44 | McFarlin et al. ( |
| 5 | Direct | Y | 80, 14, 32 | Brakstad, Ribicic, et al. ( | ||
| 5–13 | Direct | Y | 107 | Ribicic et al. ( | ||
| 8 | Direct | Y | 24 | Prince et al. ( | ||
| 21 | Direct | Y | 69 | Prince et al. ( | ||
| Defined mixture | 20 | P.D. | N | 29, 25 | Prosser et al. ( | |
| 20 | P.D. | N | 18 | Comber et al. ( | ||
| Produced water | 13 | P.D. | N | 10 | Lofthus et al. ( | |
| Activated sludge | Defined mixture | 20 | P.D. | N | 39 | Birch et al. ( |
| Total: 728 |
Several crude oil datasets were compiled; a complete characterization of the test substances and the experimental designs is available in the Excel File in the Supporting Information.
Test material: Produced water containing oil droplets and oil‐coated particulates collected from offshore drilling operations in the North Sea.
Complete documentation of screening criteria, study designs, test system parameters, and additional notes are provided in Section A2 of the Supporting Information.
P.D. = passively dosed.
Figure 1Schematic diagram of the System‐Integrated Model (HC‐BioSIM) cubist decision tree machine‐learning workflow. User‐defined input includes experimental disappearance time (DT50) values (labels used to train the model), chemical structure (C ), and system (S ) parameters. Model‐defined rules (R ) for parsing the dataset are indicated by white circles, and the blue box indicates terminal subset “nodes,” where multiple linear regressions (MLRs) are applied, resulting in a prediction of DT50 values for that subset. Example rules are included for illustrative purpose.
Figure 2Predicted versus observed experimental disappearance time (DT50; in days) for the (A) BioHCwin and (B) System‐Integrated Model (HC‐BioSIM) models. Solid line represents 1:1 agreement, and semidashed and dashed lines represent 3× and 10× errors in predictions, respectively. Colors correspond to hydrocarbon classes: n‐paraffins (nP), iso‐paraffins (iP), mono‐naphthenics (MN), di‐naphthenics (DN), polynaphthenics (PN), mono‐aromatics (MAr), naphthenic mono‐aromatics (NMAH), di‐aromatics (DAH), polyaromatics (PAH), naphthenic di‐aromatic (NDAH), and naphthenic polyaromatics (NPAH).
Comparison of HC‐BioSIM and BioHCWin model performance for training and validation sets (including k‐fold cross‐validation)
| HC‐BioSIM | BioHCWin | ||||
|---|---|---|---|---|---|
| Dataset | No. | RMSE |
| RMSE |
|
| Training | 582 | 0.23 | 0.71 | 0.76 | 0.16 |
| Validation | 146 | 0.34 | 0.52 | 0.72 | 0.18 |
| All | 728 | 0.26 | 0.67 | 0.75 | 0.17 |
| CV test fold | 146 | 0.30 ± 0.01 | 0.51 ± 0.08 | 0.75 ± 0.05 | 0.16 ± 0.03 |
| (3.2%) | (16%) | (6.0%) | (18%) | ||
Mean ± standard deviation (SD) RMSE and R 2 values for the individual test folds (k = 5). Coefficients of variation (%) are included in parentheses.
RMSE and R 2 values are reported for both models. A complete description of the cross‐validation technique, individual fold statistics, and parameters is presented in Section A7 of the Supporting Information.
CV = cross validation; RMSE = root mean square error.
Figure 3Boxplots of logarithmic model residuals (predicted—experimental log()) as a function of test system parameters (A–D), carbon number (E), and hydrocarbon class (F). Semidashed lines represent a 2‐fold predicted error (0.3 log units), and dashed lines represent a 10‐fold predicted error (1.0 log unit). Box widths are proportional to the square root of the number of observations. For abbreviations, see Figure 2 legend.
Summary of HC‐BioSIM model subsets (S), rules (R), number of data points (No.), logarithmic average prediction error (E), and brief descriptions of data subsets
| Subset (S) | Rules (R) | No. | Average predictive error (E) | Description of data subset |
|---|---|---|---|---|
| 1 |
| 51 | 0.12 | Dispersed, low loading, mid‐high temperature, no PAHNL |
|
| ||||
|
| ||||
| PAHNL = 0 | ||||
| 2 |
| 162 | 0.14 | High loading |
| 3 |
| 16 | 0.17 | Dispersed, mid‐high temperature, PAHNL |
|
| ||||
| PAHNL = 1 | ||||
| 4 |
| 129 | 0.37 | Dispersed, low loading, high temperature, no PAHNL |
|
| ||||
|
| ||||
| PAHNL = 0 | ||||
| 5 |
| 78 | 0.10 | Low‐viscosity HC substrate, low‐temperature, no PAHNL |
|
| ||||
| PAHNL = 0 | ||||
| 6 |
| 54 | 0.16 | Nondispersed, mid‐low temperature, no PAHNL |
|
| ||||
| PAHNL = 0 | ||||
| 7 |
| 51 | 0.09 | High‐viscosity HC substrate, dispersed, low temperature, no PAHNL |
|
| ||||
|
| ||||
| PAHNL = 0 | ||||
| 8 |
| 36 | 0.18 | Low temperature, PAHNL |
| PAHNL = 1 | ||||
| 9 |
| 24 | 0.38 | Nondispersed, low loading, mid‐high temperature, PAHNL |
|
| ||||
|
| ||||
| PAHNL = 1 |
PAHNL = presence or absence of non‐linear 3‐ring PAH structural fragment.
Prediction matrix of persistence categorization based on European Chemicals Agency freshwater and marine compartmental half‐life criteria
| Model | |||
|---|---|---|---|
| System | Prediction | BioHCWin (%) | HC‐BioSIM (%) |
| Freshwater | FN (type II) | 0.4 | 1.3 |
| Correct | 93.7 | 97.9 | |
| FP (type I) | 5.9 | 0.8 | |
| Seawater | FN (type II) | 1.1 | 3.3 |
| Correct | 87.6 | 96.2 | |
| FP (type I) | 11.3 | 0.4 | |
| Total | FN (type II) | 0.9 | 2.6 |
| Correct | 89.7 | 96.8 | |
| FP (type I) | 9.43 | 0.6 | |
For freshwater, the European Union Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) P and vP criteria of 40 and 60 days are used, respectively.
For marine water, the European Union REACH singular P and vP criteria of 60 days are used.
Activated sludge primary disappearance time (DT50) values (n = 39) were excluded from this evaluation, because their applicability in comparing against either freshwater or marine DT50 criteria is not clear.
Prediction matrices for the SI‐BioHCWin and bio‐pp‐LFER models are presented in Section A8 of the Supporting Information.
FN = false negative; FP = false positive.