| Literature DB >> 34911437 |
David Chardin1,2, Olivier Humbert1,2, Caroline Bailleux1,3, Fanny Burel-Vandenbos4, Valerie Rigau5,6, Thierry Pourcher1, Michel Barlaud7.
Abstract
BACKGROUND: Supervised classification methods have been used for many years for feature selection in metabolomics and other omics studies. We developed a novel primal-dual based classification method (PD-CR) that can perform classification with rejection and feature selection on high dimensional datasets. PD-CR projects data onto a low dimension space and performs classification by minimizing an appropriate quadratic cost. It simultaneously optimizes the selected features and the prediction accuracy with a new tailored, constrained primal-dual method. The primal-dual framework is general enough to encompass various robust losses and to allow for convergence analysis. Here, we compare PD-CR to three commonly used methods: partial least squares discriminant analysis (PLS-DA), random forests and support vector machines (SVM). We analyzed two metabolomics datasets: one urinary metabolomics dataset concerning lung cancer patients and healthy controls; and a metabolomics dataset obtained from frozen glial tumor samples with mutated isocitrate dehydrogenase (IDH) or wild-type IDH.Entities:
Mesh:
Year: 2021 PMID: 34911437 PMCID: PMC8672607 DOI: 10.1186/s12859-021-04478-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Overview of the datasets
| Dataset | No. of samples | No. of features | Sample type |
|---|---|---|---|
| LUNG | 1005 | 2944 | Urine |
| BRAIN | 88 | 25,286 | Glial tumor tissue |
LUNG dataset: mean accuracy using 3 seeds and 4-fold cross validation: comparison with PLS-DA, Random forest and Best SVM
| LUNG | PD-CR | PD-CR | PLS-DA | RF (100 trees) | RF (400 trees) | SVM |
|---|---|---|---|---|---|---|
| Accuracy | 79.44 | 78.3 | 76.56 | 71.31 | 72.44 | 76.25 |
| AUC | 79.97 | 74.05 | 73.38 | 74.50 | 76.64 | |
| Time (s) | 0.11 | 0.11 | 0.09 | 0.89 | 3.47 | 85.6 |
Fig. 1Distribution of the confidence score for the prediction (CSP) on the Lung dataset and impact of using CSP for classification with rejection on the false discovery rate (FDR). From Left to right and top to bottom: Histogram of the CSP, Kernel density estimation; FDR as a function of CSP after classification with rejection, rate of rejected samples as a function of CSP after classification with rejection. As expected for a pertinent confidence score, the FDR diminishes when using a higher CSP threshold for classification with rejection
Top 10 features selected by random forests, PLS-DA, PD-CR and SVM in the LUNG dataset
| RF | PLS-DA | PD-CR | SVM |
|---|---|---|---|
| MZ 264.1215224 | MZ 264.1215224 | MZ 264.1215224 | MZ 264.1215224 |
| MZ 656.2017529 | MZ 126.9069343 | MZ 308.0984878 | MZ 308.0984878 |
| MZ 441.1613664 | MZ 170.0605916 | MZ 126.9069343 | MZ 247.0970455 |
| MZ 584.2670695 | MZ 613.3595637 | MZ 613.3595637 | MZ 613.3595637 |
| MZ 247.0970455 | MZ 243.1004849 | MZ 243.1004849 | MZ 615.0353192 |
| MZ 486.2571336 | MZ 486.2571336 | MZ 247.0970455 | MZ 372.9232556 |
| MZ 308.0984878 | MZ 308.0984878 | MZ 332.0963401 | MZ 441.1613664 |
| MZ 204.1345526 | MZ 561.3432022 | MZ 441.1613664 | MZ 370.0525988 |
| MZ 247.1384435 | MZ 94.06574518 | MZ 94.06574518 | MZ 423.0084949 |
| MZ 447.10803 | MZ 269.1280232 | MZ 561.3432022 | MZ 332.0963401 |
Fig. 2Boxplots concerning relative abundances of features MZ 264.1215224 and MZ 308.0984878 of the LUNG dataset, most likely corresponding to creatine riboside and N-acetylneuraminic acid respectively. Fold changes: 2.57 and 1.43 respectively. Label 1 indicates urine samples of patients without lung cancer. Label 2 indicates urine samples of patients with lung cancer
BRAIN dataset Accuracy using 3 seeds and 4-fold cross validation: comparison with PLS-DA, Random Forest and best SVM
| BRAIN | PD-CR | PD-CR | PLS-DA | RF (100 trees) | RF (400 trees) | SVM |
|---|---|---|---|---|---|---|
| Accuracy | 92.04 | 90.9 | 84.09 | 88.63 | 89.39 | 87.78 |
| AUC | 92.08 | – | 84.33 | 88.70 | 89.02 | 88.53 |
Fig. 3Distribution of the confidence score for the prediction (CSP) on the BRAIN dataset and impact of using CSP for classification with rejection on the false discovery rate (FDR). From left to right and top to bottom: Histogram of the CSP, Kernel density estimation; FDR as a function of CSP after classification with rejection, rate of rejected samples as a function of CSP after classification with rejection. As expected for a pertinent confidence score, the FDR diminishes when using a higher CSP threshold for classification with rejection
Top 10 features selected by random forests, PLS-DA, PD-CR and SVM on the BRAIN dataset with 25,286 features
| Random forests | PLS-DA | PD-CR | SVM |
|---|---|---|---|
| NEG_MZ147.0867 | POS_MZ131.0342 | POS_MZ131.0342 | POS_MZ131.0342 |
| POS_MZ133.0384 | POS_MZ132.0375 | POS_MZ132.0375 | POS_MZ132.0375 |
| POS_MZ166.0713 | POS_MZ166.0713 | POS_MZ243.9903 | POS_MZ166.0713 |
| POS_MZ228.0182 | NEG_MZ147.0288 | POS_MZ166.0712 | NEG_MZ147.0288 |
| POS_MZ132.5234 | NEG_MZ148.0321 | NEG_MZ147.0288 | NEG_MZ148.0321 |
| POS_MZ173.0306 | NEG_MZ149.0329 | NEG_MZ148.0321 | POS_MZ171.0265 |
| POS_MZ219.0082 | POS_MZ171.0265 | POS_MZ123.5181 | POS_MZ132.0375 |
| NEG_MZ215.0168 | POS_MZ132.0375 | POS_MZ171.0265 | POS_MZ247.9616 |
| POS_MZ171.0265 | POS_MZ243.9903 | NEG_MZ149.0329 | POS_MZ243.9903 |
| POS_MZ319.0510 | POS_MZ123.5181 | POS_MZ133.0384 | NEG_MZ149.0329 |
Fig. 4Boxplots concerning relative abundances of features POS_131.0342, POS_132.0375 POS_243.9903 and POS_166.0712 of the BRAIN dataset, most likely corresponding to different adducts of 2-Hydroxyglutarate. Fold changes: 32.9, 35.6, 14.6 and 33.7 respectively. Label 1: samples of tumors with wild type IDH, Label 2: samples of tumors with mutated IDH
Mean accuracy using 4-fold cross validation with 3 different seeds: comparison of methods on the BRAIN highly filtered data set
| PD-CR | PD-CR | PLS-DA | Random Forests | SVM | |
|---|---|---|---|---|---|
| Accuracy | 94.31 | 92.8 | 93.18 | 92.04 | 89.20 |
Top 10 features selected by PD-CR in the highly and minimally filtered versions of the BRAIN dataset
| Identified (495 features) | Large (25,287 features) |
|---|---|
| POS_M131.0342 | POS_MZ131.0342 |
| NEG_M147.02882 | POS_MZ132.0375 |
| POS_M85.0291 | POS_MZ243.9903 |
| POS_M149.0450 | POS_MZ166.0713 |
| NEG_M112.0220 | NEG_MZ147.0288 |
| POS_M154.0864 | NEG_MZ148.0320 |
| NEG_M171.0847 | POS_MZ123.518 |
| NEG_M320.0627 | POS_MZ171.0265 |
| POS_M113.0350 | NEG_MZ149.0329 |
| POS_M147.1170 | POS_MZ133.0384 |