| Literature DB >> 30114591 |
Thomas H Miller1, Matteo D Gallidabino2, James I MacRae3, Stewart F Owen4, Nicolas R Bury5, Leon P Barron6.
Abstract
The application of machine learning has recently gained interest from ecotoxicological fields for its ability to model and predict chemical and/or biological processes, such as the prediction of bioconcentration. However, comparison of different models and the prediction of bioconcentration in invertebrates has not been previously evaluated. A comparison of 24 linear and machine learning models is presented herein for the prediction of bioconcentration in fish and important factors that influenced accumulation identified. R2 and root mean square error (RMSE) for the test data (n = 110 cases) ranged from 0.23-0.73 and 0.34-1.20, respectively. Model performance was critically assessed with neural networks and tree-based learners showing the best performance. An optimised 4-layer multi-layer perceptron (14 descriptors) was selected for further testing. The model was applied for cross-species prediction of bioconcentration in a freshwater invertebrate, Gammarus pulex. The model for G. pulex showed good performance with R2 of 0.99 and 0.93 for the verification and test data, respectively. Important molecular descriptors determined to influence bioconcentration were molecular mass (MW), octanol-water distribution coefficient (logD), topological polar surface area (TPSA) and number of nitrogen atoms (nN) among others. Modelling of hazard criteria such as PBT, showed potential to replace the need for animal testing. However, the use of machine learning models in the regulatory context has been minimal to date and is critically discussed herein. The movement away from experimental estimations of accumulation to in silico modelling would enable rapid prioritisation of contaminants that may pose a risk to environmental health and the food chain.Entities:
Keywords: BCF; Bioconcentration; Machine learning; Modelling; PBT; Pharmaceutical
Mesh:
Substances:
Year: 2018 PMID: 30114591 PMCID: PMC6234108 DOI: 10.1016/j.scitotenv.2018.08.122
Source DB: PubMed Journal: Sci Total Environ ISSN: 0048-9697 Impact factor: 7.963
Comparison of model performance for the prediction of BCF in Cyprinus carpio. MAE is the mean absolute error and NA indicates the metric was not applicable.
| Model | RMSE | R2 | MAE | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Training | Verification | Test | Training | Verification | Test | Training | Verification | Test | ||
| Trajan | 0.785 | 1.052 | 0.832 | 0.532 | 0.390 | 0.521 | 0.619 | 0.835 | 0.608 | |
| 0.830 | 0.893 | 0.873 | 0.673 | 0.400 | 0.569 | 0.664 | 0.893 | 0.718 | ||
| 0.723 | 0.689 | 0.584 | 0.651 | 0.635 | 0.725 | 0.565 | 1.600 | 0.450 | ||
| 0.689 | 0.538 | 0.337 | 0.675 | 0.770 | 0.659 | 0.548 | 1.608 | 0.553 | ||
| 0.403 | 0.524 | 0.644 | 0.887 | 0.819 | 0.702 | 0.313 | 0.380 | 0.530 | ||
| Model | Training | Cross-Validation | Test | Training | Cross-Validation | Test | Training | Cross-Validation | Test | |
| R | 0.719 | 0.771 | 1.203 | 0.621 | 0.570 | 0.234 | 0.560 | NA | 0.778 | |
| 0.722 | 0.769 | 1.164 | 0.618 | 0.571 | 0.254 | 0.564 | NA | 0.765 | ||
| 0.725 | 0.766 | 1.083 | 0.614 | 0.576 | 0.304 | 0.568 | NA | 0.753 | ||
| 0.729 | 0.760 | 1.054 | 0.612 | 0.582 | 0.314 | 0.577 | NA | 0.754 | ||
| 0.733 | 0.757 | 1.112 | 0.607 | 0.585 | 0.284 | 0.562 | NA | 0.770 | ||
| 0.517 | 0.683 | 0.902 | 0.807 | 0.665 | 0.468 | 0.404 | NA | 0.648 | ||
| 0.673 | 0.756 | 1.014 | 0.668 | 0.593 | 0.346 | 0.529 | NA | 0.768 | ||
| 0.596 | 0.751 | 0.877 | 0.739 | 0.597 | 0.505 | 0.462 | NA | 0.620 | ||
| 0.395 | 0.672 | 0.859 | 0.888 | 0.678 | 0.518 | 0.319 | NA | 0.612 | ||
| 0.232 | 0.834 | 1.022 | 0.962 | 0.560 | 0.370 | 0.174 | NA | 0.680 | ||
| 0.454 | 0.795 | 0.880 | 0.860 | 0.582 | 0.520 | 0.345 | NA | 0.624 | ||
| 0.539 | 0.730 | 1.014 | 0.787 | 0.632 | 0.390 | 0.425 | NA | 0.696 | ||
| 0.500 | 0.681 | 0.899 | 0.819 | 0.673 | 0.479 | 0.395 | NA | 0.633 | ||
| 0.383 | 0.644 | 0.841 | 0.893 | 0.704 | 0.537 | 0.261 | NA | 0.590 | ||
| 0.699 | 0.747 | 1.029 | 0.643 | 0.594 | 0.340 | 0.539 | NA | 0.729 | ||
| 0.292 | 0.675 | 0.771 | 0.956 | 0.688 | 0.633 | 0.231 | NA | 0.589 | ||
| 0.605 | 0.739 | 0.821 | 0.762 | 0.630 | 0.586 | 0.485 | NA | 0.652 | ||
| 0.249 | 0.660 | 0.789 | 0.957 | 0.687 | 0.593 | 0.187 | NA | 0.587 | ||
| 0.353 | 0.678 | 0.973 | 0.910 | 0.673 | 0.431 | 0.282 | NA | 0.628 | ||
Fig. 1(a) linear regression of the predicted logBCF values versus the observed logBCF values in fish using the 4-MLP developed in approach 1, training data (crosses, n = 242), verification data (circles, n = 55) and test data (triangles, n = 55). (b) Raw residuals of the predicted logBCF data in fish for the verification and test data only.
Fig. 2(a) Principal component analysis used for visualisation of the case similarity based on the 14 modelled descriptors (i.e. applicability domain). (b) Distances between cases in the PCA space with a threshold applied (0.975 quantile of χ2 distribution) designated by the red line (c) the distribution of cases based on distance in the PCA space.
Fig. 3(a) Comparison of the predicted logBCF data versus the observed logBCF in invertebrates using the fish-based 4-layer MLP. (b) Regression of a separately developed and optimised model trained with the invertebrate BCF data (Gammarus pulex), training set (crosses, n = 24), verification set (circles, n = 5) and test set (triangles, n = 5).
Fig. 4Descriptors sensitivity analysis performed by removing a descriptor from the model and assessing the affected performance. Increased error ratios indicate more important descriptors. (a) descriptor sensitivity for the fish-based model and (b) for the invertebrate-based model.