| Literature DB >> 29666463 |
Jang-Sik Choi1, My Kieu Ha2, Tung Xuan Trinh2, Tae Hyun Yoon2, Hyung-Gi Byun3.
Abstract
A generalized toxicity classification model for 7 different oxide nanomaterials is presented in this study. A data set extracted from multiple literature sources and screened by physicochemical property based quality scores were used for model development. Moreover, a few more preprocessing techniques, such as synthetic minority over-sampling technique, were applied to address the imbalanced class problem in the data set. Then, classification models using four different algorithms, such as generalized linear model, support vector machine, random forest, and neural network, were developed and their performances were compared to find the best performing preprocessing methods as well as algorithms. The neural network model built using the balanced data set was identified as the model with best predictive performance, while applicability domain was defined using k-nearest neighbours algorithm. The analysis of relative attribute importance for the built neural network model identified dose, formation enthalpy, exposure time, and hydrodynamic size as the four most important attributes. As the presented model can predict the toxicity of the nanomaterials in consideration of various experimental conditions, it has the advantage of having a broader and more general applicability domain than the existing quantitative structure-activity relationship model.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29666463 PMCID: PMC5904177 DOI: 10.1038/s41598-018-24483-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Model development workflow.
Attributes used in the model development.
| PChem attributes | QM attributes | Tox attributes | ||
|---|---|---|---|---|
| Core size (nm) | Surface charge (mV) | Formation enthalpy | Assay method (AM) | Cell type (CT) |
| Conduction band energy | Cell name (CN) | Exposure time | ||
| Hydrodynamic size (nm) | Specific surface area (m2/g) | Valence band energy | Cell species (CS) | Cell viability (%) |
| Electronegativity | Cell origin (CO) | Dose (mg/mL) | ||
Skewness for each numeric attribute.
| Attribute | Skewness (before) | Method | Skewness (after) |
|---|---|---|---|
| Core size | 3.35 | Log10 | 0.15 |
| Surface charge | 1.68 | Log10 | −0.04 |
| Hydrodynamic size | 0.46 | Z-score | 0.46 |
| Surface area | 3.24 | Log10 | 0.34 |
| ΔHsf | −0.35 | Z-score | −0.35 |
| Ec | 1.75 | Log10 | 0.09 |
| Ev | −2.85 | Min–max | −2.85 |
| χMeO | 2.28 | Min–max | 2.28 |
| ET | 1.39 | Log10 | −0.60 |
| Dose | 11.30 | Log10 | −0.14 |
Model performance for each normalization method.
| Model | Normalization method | True positive | False positive | False negative | True negative | Sensitivity | Specificity | Balanced accuracy |
|---|---|---|---|---|---|---|---|---|
| GLM | Min-Max | 39 | 12 | 16 | 278 | 71% | 96% | 83% |
| z-score | 39 | 12 | 16 | 278 | 71% | 96% | 83% | |
| Log | 46 | 11 | 9 | 279 | 84% | 96% | 90% | |
| combination | 45 | 8 | 10 | 282 | 82% | 97% | 90% | |
| SVM | Min-Max | 28 | 6 | 27 | 284 | 51% | 98% | 74% |
| z-score | 29 | 7 | 26 | 283 | 53% | 98% | 75% | |
| Log | 40 | 5 | 15 | 285 | 73% | 98% | 86% | |
| combination | 41 | 5 | 14 | 285 | 75% | 98% | 86% | |
| RF | Min-Max | 45 | 5 | 10 | 285 | 82% | 98% | 90% |
| z-score | 44 | 5 | 11 | 285 | 80% | 98% | 89% | |
| Log | 45 | 5 | 10 | 285 | 82% | 98% | 90% | |
| combination | 45 | 5 | 10 | 285 | 82% | 98% | 90% | |
| NNET | Min-Max | 38 | 15 | 17 | 275 | 69% | 95% | 82% |
| z-score | 40 | 6 | 15 | 284 | 73% | 98% | 85% | |
| Log | 43 | 8 | 12 | 282 | 78% | 97% | 88% | |
| combination | 48 | 8 | 7 | 282 | 87% | 97% | 92% |
WMW analysis of attributes in data set.
| Attribute | nonToxic | Toxic | z | p-value | ||
|---|---|---|---|---|---|---|
| Mean rank | Sum of rank | Mean rank | Sum of rank | |||
| ΔHsf | 263.65 | 129188.50 | 426.63 | 35836.50 | −8.93 | 4.28E-19 |
| χMeO | 306.09 | 149985.50 | 179.04 | 15039.50 | 7.03 | 2.07E-12 |
| Dose | 267.43 | 131041.50 | 404.57 | 33983.50 | −7.02 | 2.23E-12 |
| Surface area | 307.31 | 150582.00 | 171.94 | 14443.00 | 6.92 | 4.37E-12 |
| Ev | 269.86 | 132233.50 | 390.38 | 32791.50 | −6.60 | 4.04E-11 |
| Exposure time | 273.44 | 133986.50 | 369.51 | 31038.50 | −5.23 | 1.70E-07 |
| Ec | 275.76 | 135124.50 | 355.96 | 29900.50 | −4.39 | 1.11E-05 |
| Core size | 275.48 | 134987.00 | 357.60 | 30038.00 | −4.21 | 2.58E-05 |
| Surface charge | 277.10 | 135778.00 | 348.18 | 29247.00 | −3.64 | 2.70E-04 |
| Hydrodynamic size | 281.45 | 137910.50 | 322.79 | 27114.50 | −2.11 | 3.45E-02 |
Internal validation result.
| Model | Training data | True positive | False positive | False negative | True negative | Sensitivity | Specificity | Balanced accuracy |
|---|---|---|---|---|---|---|---|---|
| GLM | ID | 46 | 11 | 9 | 279 | 84% | 96% | 90% |
| BD | 255 | 30 | 20 | 245 | 93% | 89% | 91% | |
| SVM | ID | 41 | 5 | 14 | 285 | 75% | 98% | 86% |
| BD | 269 | 5 | 6 | 270 | 98% | 98% | 98% | |
| RF | ID | 45 | 5 | 10 | 285 | 82% | 98% | 90% |
| BD | 268 | 2 | 7 | 273 | 97% | 99% | 98% | |
| NNET | ID | 48 | 8 | 7 | 282 | 87% | 97% | 92% |
| BD | 272 | 6 | 3 | 269 | 99% | 98% | 98% |
External validation result.
| Model | Training data | True positive | False positive | False negative | True negative | Sensitivity | Specificity | Balanced accuracy |
|---|---|---|---|---|---|---|---|---|
| GLM | ID | 25 | 6 | 4 | 194 | 86% | 97% | 92% |
| BD | 26 | 21 | 3 | 179 | 90% | 90% | 90% | |
| SVM | ID | 22 | 5 | 7 | 195 | 76% | 98% | 87% |
| BD | 25 | 10 | 4 | 190 | 86% | 95% | 91% | |
| RF | ID | 24 | 3 | 5 | 197 | 83% | 99% | 91% |
| BD | 25 | 9 | 4 | 191 | 86% | 96% | 91% | |
| NNET | ID | 23 | 4 | 6 | 196 | 79% | 98% | 89% |
| BD | 27 | 13 | 2 | 187 | 93% | 94% | 93% |
Reliability validation result.
| True positive | False positive | False negative | True negative | Sensitivity | Specificity | Balanced accuracy |
|---|---|---|---|---|---|---|
| 9 | 8 | 5 | 122 | 64% | 93% | 79% |
Relative importance.
| Attribute | Relative importance |
|---|---|
| Dose | 11.10 |
| ΔHsf | 7.34 |
| Exposure time | 5.33 |
| Hydrodynamic size | 4.43 |
| Ec | 3.66 |
| Surface area | 3.58 |
| Core size | 3.57 |
| Cell species | 2.88 |
| xMeO | 2.44 |
| Cell type | 2.44 |
| Surface charge | 1.90 |
| Assay method | 1.82 |
| Ev | 1.45 |
| Cell name | 1.33 |
| Cell origin | 1.06 |
Applicability domain of the best model.
| split | sensitivity | specificity | Balanced Accuracy |
|
|
| |
|---|---|---|---|---|---|---|---|
| 1 | 91% | 94% | 93% | 22 | 1.39 | 0.79 | 2.10 |
| 2 | 88% | 92% | 90% | 22 | 1.40 | 0.82 | 0.80 |
| 3 | 95% | 98% | 96% | 23 | 1.40 | 0.80 | 0.40 |
| 4 | 94% | 95% | 95% | 24 | 1.42 | 0.79 | 0.50 |
| 5 | 82% | 97% | 89% | 22 | 1.40 | 0.81 | 0.30 |
| 6 | 96% | 97% | 97% | 21 | 1.35 | 0.79 | 0.40 |
| 7 | 100% | 94% | 97% | 22 | 1.40 | 0.79 | 0.50 |
| 8 | 86% | 95% | 90% | 22 | 1.39 | 0.78 | 1.60 |
| 9 | 95% | 97% | 96% | 25 | 1.47 | 0.80 | 0.60 |
| 10 | 100% | 96% | 98% | 25 | 1.49 | 0.80 | 0.50 |
| avg. | 93% | 96% | 94% | 22.8 | 1.41 | 0.80 | 0.77 |