| Literature DB >> 36133654 |
Marianna Kotzabasaki1, Iason Sotiropoulos1, Costas Charitidis1, Haralambos Sarimveis1.
Abstract
Multi-walled carbon nanotubes (MWCNTs) are made of multiple single-walled carbon nanotubes (SWCNTs) which are nested inside one another forming concentric cylinders. These nanomaterials are widely used in industrial and biomedical applications, due to their unique physicochemical characteristics. However, previous studies have shown that exposure to MWCNTs may lead to toxicity and some of the physicochemical properties of MWCNTs can influence their toxicological profiles. In silico modelling can be applied as a faster and less costly alternative to experimental (in vivo and in vitro) testing for the hazard characterization of MWCNTs. This study aims at developing a fully validated predictive nanoinformatics model based on statistical and machine learning approaches for the accurate prediction of genotoxicity of different types of MWCNTs. Towards this goal, a number of different computational workflows were designed, combining unsupervised (Principal Component Analysis, PCA) and supervised classification techniques (Support Vectors Machine, "SVM", Random Forest, "RF", Logistic Regression, "LR" and Naïve Bayes, "NB") and Bayesian optimization. The Recursive Feature Elimination (RFE) method was applied for selecting the most important variables. An RF model using only three features was selected as the most efficient for predicting the genotoxicity of MWCNTs, exhibiting 80% accuracy on external validation and high classification probabilities. The most informative features selected by the model were "Length", "Zeta average" and "Purity". This journal is © The Royal Society of Chemistry.Entities:
Year: 2021 PMID: 36133654 PMCID: PMC9417168 DOI: 10.1039/d0na00600a
Source DB: PubMed Journal: Nanoscale Adv ISSN: 2516-0230
Fig. 1Workflow of data collection, pre-processing, model development, validation and analysis.
The characteristics (features) of MWCNTs used in QSAR modelling
| Features |
|---|
| Carbon purity (%)[ |
| Minimum length (nm)[ |
| Maximum length (nm)[ |
| Average length (nm)[ |
| Minimum diameter (nm)[ |
| Maximum diameter (nm)[ |
| Average diameter (nm)[ |
| Specific surface area (SSA) measured by BET (m2 g−1)[ |
| % Impurities (Fe2O3, CoO, NiO, MgO, MnO)[ |
| Concentration of endotoxins (Eu ml−1)[ |
| Combustion elemental analysis (CEA), C, H, N, O (wt%)[ |
| Surface coatings OH, COOH, NH2, (mmol g−1)[ |
| Zeta average ( |
| Polydispersity index (PdI) batch and PdI at 12.5 and 200 μg ml−1 (ref. |
| Reactive oxygen species (ROS) and respective peak concentrations (μg ml−1)[ |
Overview of MWCNTs dataset. “Diameter” and “Length” were measured in nanometers (nm) representing the average values, “Endotoxins” in EU mg−1, “BET” in m2 g−1 and “Impurity”, “Purity” and “CEA” in percentages (%). “Impurity” was calculated as the percentage of total impurities. “Genotoxicity” was the binary endpoint indicating whether a MWCNT was considered as genotoxic (value “1”) or non-genotoxic (value “0”)
| Group | Code | Type | Length | Diameter | BET | Impurity | Purity | CEA | Endotoxinest | Genotoxicity |
|---|---|---|---|---|---|---|---|---|---|---|
| Group I | NRCWE-040 | Pristine | 518.9 | 22.1 | 150 | 0.773 | 98.60 | 96.00 | 0.18 | 0 |
| NRCWE-041 | –OH | 1005.0 | 26.9 | 152 | 0.462 | 99.20 | 97.00 | 0.22 | 0 | |
| NRCWE-042 | –COOH | 723.2 | 30.2 | 141 | 0.321 | 99.20 | 96.00 | 026 | 0 | |
| Group II | NRCWE-043 | Pristine | 771.3 | 55.6 | 82 | 1.219 | 98.50 | 96.00 | 0.25 | 0 |
| NRCWE-044 | –OH | 1330.0 | 32.7 | 74 | 1.006 | 98.60 | 97.00 | 0.27 | 1 | |
| NRCWE-045 | −COOH | 1553.0 | 30.2 | 119 | 2.782 | 96.30 | 93.00 | 0.34 | 1 | |
| Group III | NRCWE-046 | Pristine | 717.2 | 29.1 | 223 | 0.783 | 98.70 | 96.00 | 0.19 | 0 |
| NRCWE-047 | –OH | 532.5 | 22.6 | 216 | 0.781 | 98.70 | 97.00 | 0.01 | 0 | |
| NRCWE-048 | –COOH | 1604.0 | 17.9 | 185 | 0.721 | 98.80 | 96.00 | 0.03 | 0 | |
| NRCWE-049 | –NH | 731.4 | 14.9 | 199 | 0.738 | 98.80 | 97.0 | 0.05 | 0 | |
| Standard | NM-400 | Pristine | 847.0 | 11.0 | 254 | 0.368 | 90.00 | 88.00 | 0.24 | 1 |
| Materials | NM-401 | Pristine | 4048.0 | 67.0 | 18 | 0.065 | 99.19 | 98.00 | 0.42 | 0 |
| NM-402 | Pristine | 1372.0 | 11.0 | 226 | 1.313 | 92.97 | 92.00 | 0.01 | 1 | |
| NM-403 | Pristine | 1373.0 | 13.0 | 227 | 5.313 | 90.00 | 97.00 | 0.01 | 1 | |
| NRCWE-006 | Pristine | 5700.0 | 65.0 | 26 | 0.680 | 99.00 | 98.00 | 0.51 | 1 |
Hyperparameters that were tuned by the Bayesian optimization method
| Model | Hyperparameter | Description |
|---|---|---|
| RF | n_estimators | Number of trees in the forest |
| Min_samples_split | Minimum number of observations required to split a node | |
| Max_features | Maximum number of features to consider when looking for the best split of a node | |
| LR |
| Inverse of regularization strength |
| Penalty | Norm used in the penalization | |
| SVM |
| Regularization parameter |
| Gamma | Kernel coefficient | |
| Kernel | Kernel type |
Fig. 2Diagram representing the percentage of the explained variance as a function of the principal components.
The optimal hyperparameter values that were extracted from the Bayesian optimization technique,[40] for the models that were trained on the reduced PCA-dataset
| Model | Hyperparameter | Optimal values |
|---|---|---|
| RF | n_estimators | 9 |
| Min_samples_split | 0.1025 | |
| Max_features | 0.6983 | |
| LR |
| 43.71 |
| Penalty | L2 | |
| SVM |
| 4.37 |
| Gamma | — | |
| Kernel | Linear |
Validation metrics for the models that were trained on the reduced PCA-dataset
| RF | LR | SVM | NB | |||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.800 | 0.800 | 0.800 | 0.600 | ||||
| Precision | 0.000 | 0.50 | 0.500 | 0.000 | ||||
| Sensitivity | 0.000 | 1.000 | 1.000 | 0.000 | ||||
| Specificity | 1.000 | 0.750 | 0.750 | 0.750 | ||||
| F1-score | 0.000 | 0.670 | 0.667 | 0.000 | ||||
| MCC | 0.000 | 0.612 | 0.612 | −0.250 | ||||
| Cross-validation | 0.833 | 0.542 | 0.708 | 0.625 | ||||
| Confusion matrix | 4 | 0 | 3 | 1 | 3 | 1 | 3 | 1 |
| 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | |
The optimal hyperparameter values that were extracted from the Bayesian optimization technique,[40] for the models that were trained on the full dataset
| Model | Hyperparameter | Optimal values |
|---|---|---|
| RF | n_estimators | 10 |
| Min_samples_split | 0.5 | |
| Max_features | 0.1 | |
| LR |
| 43.71 |
| Penalty | L2 | |
| SVM |
| 9.98 |
| Gamma | 0.254 | |
| Kernel | Sigmoid |
Validation metrics for the models that were trained on the full dataset
| RF | LR | SVM | NB | |||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.600 | 0.800 | 0.8000 | 0.600 | ||||
| Precision | 0.0.333 | 0.500 | 0.333 | 0.333 | ||||
| Sensitivity | 1.000 | 1.000 | 1.000 | 0.500 | ||||
| Specificity | 0.500 | 0.500 | 0.750 | 0.500 | ||||
| F1-score | 0.500 | 0.667 | 0.500 | 0.500 | ||||
| MCC | 0.666 | 0.612 | 0.612 | 0.408 | ||||
| Cross-validation | 0.708 | 0.458 | 0.917 | 0.458 | ||||
| Confusion matrix | 2 | 2 | 3 | 1 | 3 | 1 | 2 | 2 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
Fig. 3Scaled significance of features in the LR and RF models.
Most significant features of the LR and RF models after the application of the RFE method
| Features for the LR model | Features for the RF model |
|---|---|
| Zeta average at 12.5 μg ml−1 | Zeta average at 12.5 μg ml−1 |
| Length (average) | Length (average) |
| Polydispersity index (batch) | Purity (%) |
| Purity (%) | — |
The optimal hyperparameter values that were extracted from the Bayesian optimization method,[40] for the LR and RF models that were trained on the four most significant features
| Model | Hyperparameter | Optimal values |
|---|---|---|
| LR | C | 4.37 |
| Penalty | L2 | |
| RF | n_estimators | 19 |
| Min_samples_split | 0.116 | |
| Max_features | 0.666 |
Validation metrics for the models that were trained on the most significant features
| RF | LR | |||
|---|---|---|---|---|
| Accuracy | 0.800 | 0.800 | ||
| Precision | 0.500 | 0.500 | ||
| Sensitivity | 1.000 | 1.000 | ||
| Specificity | 0.750 | 0.750 | ||
| F1-score | 0.666 | 0.666 | ||
| MCC | 0.612 | 0.612 | ||
| Cross-validation | 0.917 | 1.000 | ||
| Confusion matrix | 3 | 1 | 3 | 1 |
| 0 | 1 | 0 | 1 | |
Classification probabilities of the RF and LR models after applying the RFE method,[41] for the testing samples
| RF | LR | |||
|---|---|---|---|---|
| MWCNT | Prob. of “0” class | Prob. of “1” class | Prob. of “0” class | Prob. of “1” class |
| NRWCE-040 | 0.74 | 0.26 | 0.74 | 0.26 |
| NRWCE-041 | 0.74 | 0.26 | 0.82 | 0.18 |
| NRWCE-048 | 0.68 | 0.32 | 0.73 | 0.27 |
| NM-401 | 0.31 | 0.69 | 0.36 | 0.64 |
| NM-402 | 0.00 | 1.00 | 0.31 | 0.69 |
Leverage values of the testing samples
| Name | RF | LR | ||
|---|---|---|---|---|
| Leverage value | Reliability | Leverage value | Reliability | |
| NRCWE-040 | 0.20 | Reliable | 0.20 | Reliable |
| NRCWE-041 | 0.18 | Reliable | 0.36 | Reliable |
| NRCWE-048 | 0.33 | Reliable | 0.43 | Reliable |
| NM-401 | 0.48 | Reliable | 0.51 | Reliable |
| NM-402 | 0.13 | Reliable | 0.19 | Reliable |