| Literature DB >> 36199022 |
Yingxia Li1, Ulrich Mansmann2, Shangming Du2, Roman Hornung2.
Abstract
BACKGROUND: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics.Entities:
Keywords: Benchmark; Classification; Feature selection; Multi-omics data; TCGA
Mesh:
Year: 2022 PMID: 36199022 PMCID: PMC9533501 DOI: 10.1186/s12859-022-04962-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Prediction performance using RF after feature selection. Panels a and b show the distributions of the mean cross-validated AUC values across the datasets for all rank and subset evaluation methods, respectively. The p-values show the results of the Friedman tests
Fig. 2Prediction performance using SVM after feature selection. Panels a and b show the distributions of the mean cross-validated AUC values across the datasets for all rank and subset evaluation methods, respectively. The p-values show the results of the Friedman tests
The best performing methods (according to the AUC) per setting
| nvar | selsep | clivar | Selector | AUC | Brier | accuracy |
|---|---|---|---|---|---|---|
| 10 | Yes | Yes | mRMR | 0.8299 | 0.1347 | 0.8217 |
| 10 | Yes | No | mRMR | 0.8266 | 0.1357 | 0.8189 |
| 10 | No | Yes | mRMR | 0.8263 | 0.1323 | 0.8281 |
| 10 | No | No | mRMR | 0.8247 | 0.1331 | 0.8261 |
| 100 | Yes | Yes | mRMR | 0.8405 | 0.1287 | 0.8359 |
| 100 | Yes | No | mRMR | 0.8406 | 0.1286 | 0.8363 |
| 100 | No | Yes | mRMR | 0.8345 | 0.1307 | 0.8311 |
| 100 | No | No | mRMR | 0.8354 | 0.1307 | 0.8290 |
| 1000 | Yes | Yes | mRMR | 0.8374 | 0.1342 | 0.8196 |
| 1000 | Yes | No | mRMR | 0.8376 | 0.1339 | 0.8200 |
| 1000 | no | yes | mRMR | 0.8290 | 0.1364 | 0.8171 |
| 1000 | No | No | mRMR | 0.8274 | 0.1366 | 0.8172 |
| 5000 | Yes | Yes | mRMR | 0.8264 | 0.1383 | 0.8148 |
| 5000 | Yes | No | mRMR | 0.8260 | 0.1384 | 0.8128 |
| 5000 | No | Yes | mRMR | 0.8227 | 0.1401 | 0.8111 |
| 5000 | No | No | mRMR | 0.8215 | 0.1402 | 0.8107 |
| - | Yes | Yes | Lasso | 0.8387 | 0.1335 | 0.8219 |
| - | Yes | No | Lasso | 0.8413 | 0.1330 | 0.8219 |
| - | No | Yes | Lasso | 0.8190 | 0.1374 | 0.8205 |
| - | No | No | Lasso | 0.8185 | 0.1386 | 0.8213 |
The values of the performance metrics were obtained by averaging over the cross-validation repetitions and datasets; ‘nvar’ denotes the number of selected features, ‘selsep’ whether the features were selected separately by data type, and ‘clivar’ whether clinical variables were included or not
The best performing methods and settings (according to the AUC) per dataset
| Dat | Selector | nvar | selsep | clivar |
|---|---|---|---|---|
| BLCA | mRMR | 100 | Yes | Yes |
| BRCA | Lasso | – | No | Yes |
| COAD | mRMR | 10 | No | Yes |
| ESCA | infor | 1000 | No | No |
| HNSC | mRMR | 10 | Yes | Yes |
| LGG | Lasso | – | No | No |
| LIHC | mRMR | 100 | Yes | Yes |
| LUAD | mRMR | 100 | No | No |
| LUSC | Rfe | – | No | No |
| PAAD | mRMR | 10 | Yes | Yes |
| PRAD | mRMR | 100 | yes | Yes |
| SARC | GA | – | Yes | Yes |
| SKCM | mRMR | 100 | No | No |
| STAD | RF-VI | 100 | Yes | No |
| UCEC | Lasso | – | Yes | No |
Here, ‘nvar’ denotes the number of selected features, ‘selsep’ whether the features were selected separately by data type, and ‘clivar’ whether clinical variables were included or not
Fig. 3Mean computation times of feature selection methods averaged across the different datasets. The red and the blue lines indicate the results obtained when selecting from all data types concurrently and separately, respectively
Fig. 4Mean computation times of feature selection methods for the different datasets. Panel a shows the results obtained for separate selection and panel b, those obtained for concurrent selection from all data types
Summary of the datasets used for the benchmark experiment
| Dataset | Cancer | Clin | cnv | mirna | mutation | rna | r_m | |||
|---|---|---|---|---|---|---|---|---|---|---|
| BLCA | Bladder urothelial | 5 | 57,964 | 825 | 18,577 | 23,081 | 100,455 | 382 | 186 | 0.49 |
| BRCA | Breast invasive C | 8 | 57,964 | 835 | 17,975 | 22,694 | 99,479 | 735 | 255 | 0.35 |
| COAD | Colon AC | 7 | 57,964 | 802 | 18,538 | 22,210 | 99,524 | 191 | 106 | 0.55 |
| ESCA | Esophageal C | 6 | 57,964 | 763 | 12,628 | 25,494 | 96,858 | 106 | 83 | 0.78 |
| HNSC | Head–neck squamous CC | 11 | 57,964 | 793 | 17,248 | 21,520 | 97,539 | 443 | 307 | 0.69 |
| LGG | Low grade glioma | 10 | 57,964 | 645 | 9235 | 22,297 | 90,154 | 419 | 195 | 0.47 |
| LIHC | Liver hepatocellular C | 11 | 57,964 | 776 | 11,821 | 20,994 | 91,569 | 159 | 44 | 0.28 |
| LUAD | Lung AC | 9 | 57,964 | 799 | 18,388 | 23,681 | 100,844 | 426 | 212 | 0.50 |
| LUSC | Lung squamous CC | 9 | 57,964 | 895 | 18,500 | 23,524 | 100,895 | 418 | 346 | 0.83 |
| PAAD | Pancreatic AC | 10 | 57,964 | 612 | 12,392 | 22,348 | 93,329 | 124 | 78 | 0.63 |
| PRAD | prostate AC | 4 | 57,925 | 585 | 11,702 | 21,769 | 91,981 | 407 | 48 | 0.12 |
| SARC | Sarcoma | 11 | 57,964 | 778 | 10,001 | 22,842 | 91,599 | 126 | 48 | 0.38 |
| SKCM | Skin cutaneous M | 9 | 57,964 | 1002 | 18,593 | 22,248 | 99,819 | 249 | 39 | 0.16 |
| STAD | Stomach AC | 7 | 57,964 | 787 | 18,581 | 26,027 | 103,369 | 295 | 139 | 0.47 |
| UCEC | Uterine corpus EC | 11 | 57,447 | 866 | 21,053 | 23,978 | 103,358 | 405 | 144 | 0.36 |
C. indicates carcinoma, AC Adenocarcinoma, CC Cell carcinoma, M Melanoma, and EC Endometrial carcinoma.
The third to the seventh column show the numbers of features in the respective feature groups and the eighth column the total amount of features (f). The last three columns show the numbers of observations (n), the numbers of TP53 mutation cases (m), and the ratio between the numbers of mutation events and the numbers of observations (r_m), in that order
Summary of methods compared in the benchmark experiment
| Method | Selector | R package::function |
|---|---|---|
| Filter | :: | |
| Information gain (infor) | FSelector::information.gain | |
| ReliefF | FSelector::relief | |
| The Minimum Redundancy Maximum Relevance (mRMR) | mRMRe::mRMR.ensemble | |
| Wrapper | Recursive feature elimination (Rfe) | Caret::rfeControl and rfe |
| Genetic algorithm (GA) | Caret::gafsControl and gafs | |
| Embedded | The least absolute shrinkage and selection operator (Lasso) | Glmnet::cv.glmnet |
| The permutation importance of random forests (RF-VI) | Ranger:: ranger |