| Literature DB >> 35323641 |
Nikiforos Alygizakis1,2, Vasileios Konstantakos3, Grigoris Bouziotopoulos4, Evangelos Kormentzas5, Jaroslav Slobodnik2, Nikolaos S Thomaidis1,2.
Abstract
Liquid chromatography-high resolution mass spectrometry (LC-HRMS) and gas chromatography-high resolution mass spectrometry (GC-HRMS) have revolutionized analytical chemistry among many other disciplines. These advanced instrumentations allow to theoretically capture the whole chemical universe that is contained in samples, giving unimaginable opportunities to the scientific community. Laboratories equipped with these instruments produce a lot of data daily that can be digitally archived. Digital storage of data opens up the opportunity for retrospective suspect screening investigations for the occurrence of chemicals in the stored chromatograms. The first step of this approach involves the prediction of which data is more appropriate to be searched. In this study, we built an optimized multi-label classifier for predicting the most appropriate instrumental method (LC-HRMS or GC-HRMS or both) for the analysis of chemicals in digital specimens. The approach involved the generation of a baseline model based on the knowledge that an expert would use and the generation of an optimized machine learning model. A multi-step feature selection approach, a model selection strategy, and optimization of the classifier's hyperparameters led to a model with accuracy that outperformed the baseline implementation. The models were used to predict the most appropriate instrumental technique for new substances. The scripts are available at GitHub and the dataset at Zenodo.Entities:
Keywords: contaminants of emerging contaminants; gas chromatography; liquid chromatography; retrospective suspect screening
Year: 2022 PMID: 35323641 PMCID: PMC8949148 DOI: 10.3390/metabo12030199
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Performance comparison of feature selection strategies under 10-time 10-fold cross-validation. The number of features for each set is given in parentheses. The best performance across each 10-fold CV is highlighted in bold.
| Initial (1446) | Variance (1074) | Correlation (439) | RF Importance (64) | RFECV (57) | Final (8) | |
|---|---|---|---|---|---|---|
| 1st 10-Fold | 80.06 ± 1.49 | 80.25 ± 2.17 | 80.18 ± 1.67 | 80.34 ± 1.05 | 80.2 ± 0.89 |
|
| 2nd 10-Fold | 80.67 ± 2.08 | 80.26 ± 1.87 | 80.56 ± 2.08 | 80.2 ± 3.01 | 80.58 ± 2.7 |
|
| 3rd 10-Fold | 80.81 ± 1.17 | 80.64 ± 1.83 | 79.48 ± 0.96 | 80.54 ± 2.04 | 80.33 ± 2.34 |
|
| 4th 10-Fold |
| 80.14 ± 1.67 | 79.64 ± 1.51 | 80.35 ± 1.22 | 80.12 ± 1.52 | 80.52 ± 1.7 |
| 5th 10-Fold | 80.66 ± 1.18 | 81.17 ± 1.23 | 79.55 ± 1.33 |
| 81.13 ± 1.28 | 80.98 ± 1.59 |
| 6th 10-Fold | 80.41 ± 1.12 | 80.13 ± 1.22 | 80.82 ± 1.59 | 81.15 ± 1.3 | 80.64 ± 1.05 |
|
| 7th 10-Fold |
| 80.92 ± 1.61 | 80.67 ± 1.73 | 80.12 ± 1.55 | 80.85 ± 1.59 | 81.18 ± 1.19 |
| 8th 10-Fold | 80.63 ± 1.53 |
| 80.95 ± 0.87 | 80.58 ± 1.61 | 80.61 ± 2.34 | 81.01 ± 1.6 |
| 9th 10-Fold |
| 80.25 ± 1.43 | 80.43 ± 1.44 | 79.86 ± 1.11 | 79.76 ± 1.1 | 81.04 ± 0.96 |
| 10th 10-Fold | 80.54 ± 1.08 | 80.24 ± 1.74 | 80.52 ± 0.92 | 80.68 ± 1.29 | 80.68 ± 1.32 |
|
The p-values for each pairwise comparison using Nemeyi post hoc test. The number of features for each set is given in parentheses. The statistically significant differences (p < 0.05) are highlighted in bold.
| Initial (1446) | Variance (1074) | Correlation (439) | RF Importance (64) | RFECV (57) | Final (8) | |
|---|---|---|---|---|---|---|
| Initial | 1.000 | 0.658 |
| 0.386 | 0.636 | 0.458 |
| Variance | 0.658 | 1.000 | 0.679 | 0.900 | 0.900 |
|
| Correlation |
| 0.679 | 1.000 | 0.900 | 0.701 |
|
| RF Importance | 0.386 | 0.900 | 0.900 | 1.000 | 0.900 |
|
| RFECV | 0.636 | 0.900 | 0.701 | 0.900 | 1.000 |
|
| Final | 0.458 |
|
|
|
| 1.000 |
Classification report of rule-based classifier.
| Precision | Recall | F1-Score | Accuracy | |
|---|---|---|---|---|
| GC class | 78.79 | 85.57 | 82.04 | 71.56 |
| LC class | 61.87 | 95.84 | 75.2 | 60.99 |
| Micro average | 69.71 | 90.18 | 78.63 | |
| Macro average | 70.33 | 90.71 | 78.62 | |
| Weighted average | 71.21 | 90.18 | 78.97 | |
| Samples average | 69.77 | 90.13 | 75.34 |
Classification report of decision tree classifier.
| Precision | Recall | F1-Score | Accuracy | |
|---|---|---|---|---|
| GC class | 82.25 | 88.23 | 85.14 | 76.61 |
| LC class | 76.44 | 95.59 | 84.95 | 79.10 |
| Micro average | 79.42 | 91.53 | 85.05 | |
| Macro average | 79.34 | 91.91 | 85.04 | |
| Weighted average | 79.64 | 91.53 | 85.05 | |
| Samples average | 81.86 | 92.35 | 84.02 |
Figure 1Confusion matrix—rule-based classifier. “Y” stands for “Yes”, indicating that a compound is amenable, while “N” stands for “No”, indicating that a compound is not amenable.
Figure 2Confusion matrix—decision tree classifier. “Y” stands for “Yes”, indicating that a compound is amenable, while “N” stands for “No”, indicating that a compound is not amenable.
Figure 3Receiver operating characteristic (ROC) curve analysis.
Figure 4Aggregative results of LC-MS and GC-MS amenability.
Figure 5RFECV score vs. number of features.
Figure 6Validation curve for decision tree classifier.