| Literature DB >> 35378930 |
Stepan Nersisyan1, Victor Novosad1,2, Alexei Galatenko3,4, Andrey Sokolov3,4, Grigoriy Bokov3,4, Alexander Konovalov3,4, Dmitry Alekseev3,4, Alexander Tonevitsky1,2,5.
Abstract
Feature selection is one of the main techniques used to prevent overfitting in machine learning applications. The most straightforward approach for feature selection is an exhaustive search: one can go over all possible feature combinations and pick up the model with the highest accuracy. This method together with its optimizations were actively used in biomedical research, however, publicly available implementation is missing. We present ExhauFS-the user-friendly command-line implementation of the exhaustive search approach for classification and survival regression. Aside from tool description, we included three application examples in the manuscript to comprehensively review the implemented functionality. First, we executed ExhauFS on a toy cervical cancer dataset to illustrate basic concepts. Then, multi-cohort microarray breast cancer datasets were used to construct gene signatures for 5-year recurrence classification. The vast majority of signatures constructed by ExhauFS passed 0.65 threshold of sensitivity and specificity on all datasets, including the validation one. Moreover, a number of gene signatures demonstrated reliable performance on independent RNA-seq dataset without any coefficient re-tuning, i.e., turned out to be cross-platform. Finally, Cox survival regression models were used to fit isomiR signatures for overall survival prediction for patients with colorectal cancer. Similarly to the previous example, the major part of models passed the pre-defined concordance index threshold 0.65 on all datasets. In both real-world scenarios (breast and colorectal cancer datasets), ExhauFS was benchmarked against state-of-the-art feature selection models, including L1-regularized sparse models. In case of breast cancer, we were unable to construct reliable cross-platform classifiers using alternative feature selection approaches. In case of colorectal cancer not a single model passed the same 0.65 threshold. Source codes and documentation of ExhauFS are available on GitHub: https://github.com/s-a-nersisyan/ExhauFS. ©2022 Nersisyan et al.Entities:
Keywords: Classification; ExhauFS; Exhaustive search; Feature selection; Survival regression
Year: 2022 PMID: 35378930 PMCID: PMC8976470 DOI: 10.7717/peerj.13200
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1ExhauFS workflow.
Dashed lines represent optional relations (e.g., training/filtration datasets can be used in some feature pre-selection methods).
Figure 2ROC and Kaplan–Meier plots for the ten-gene signature (TRIP13, ZWINT, EPN3, ECHDC2, CX3CR1, STARD13, MB, SLC7A5, ABAT, CCNL2) evaluated on GSE1456 (microarray) and TCGA-BRCA (RNA-seq) datasets.
Red points on ROC curves stand for the actual SVM threshold values.
Figure 3Kaplan–Meier plots for 5′-isomiR signatures in colorectal cancer.
(A) The seven-isomiR signature (hsa-miR-200a-5p—0, hsa-miR-26b-5p—0, hsa-miR-21-3p—0, hsa-miR-126-3p—0, hsa-let-7e-5p—0, hsa-miR-374a-3p—0, hsa-miR-141-5p—0). (B) The four-gene isomiR signature (hsa-miR-126-3p—0, hsa-miR-374a-3p—0, hsa-miR-182-5p—0, hsa-let-7e-5p—0).