Literature DB >> 22254468

Evaluating feature selection strategies for high dimensional, small sample size datasets.

Abhishek Golugula1, George Lee, Anant Madabhushi.   

Abstract

In this work, we analyze and evaluate different strategies for comparing Feature Selection (FS) schemes on High Dimensional (HD) biomedical datasets (e.g. gene and protein expression studies) with a small sample size (SSS). Additionally, we define a new feature, Robustness, specifically for comparing the ability of an FS scheme to be invariant to changes in its training data. While classifier accuracy has been the de facto method for evaluating FS schemes, on account of the curse of dimensionality problem, it might not always be the appropriate measure for HD/SSS datasets. SSS lends the dataset a higher probability of containing data that is not representative of the true distribution of the whole population. However, an ideal FS scheme must be robust enough to produce the same results each time there are changes to the training data. In this study, we employed the robustness performance measure in conjunction with classifier accuracy (measured via the K-Nearest Neighbor and Random Forest classifiers) to quantitatively compare five different FS schemes (T-test, F-test, Kolmogorov-Smirnov Test, Wilks Lambda Test and Wilcoxon Rand Sum Test) on 5 HD/SSS gene and protein expression datasets corresponding to ovarian cancer, lung cancer, bone lesions, celiac disease, and coronary heart disease. Of the five FS schemes compared, the Wilcoxon Rand Sum Test was found to outperform other FS schemes in terms of classification accuracy and robustness. Our results suggest that both classifier accuracy and robustness should be considered when deciding on the appropriate FS scheme for HD/SSS datasets.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22254468     DOI: 10.1109/IEMBS.2011.6090214

Source DB:  PubMed          Journal:  Conf Proc IEEE Eng Med Biol Soc        ISSN: 1557-170X


  5 in total

1.  The Residual Center of Mass: An Image Descriptor for the Diagnosis of Alzheimer Disease.

Authors:  Alexandre Yukio Yamashita; Alexandre Xavier Falcão; Neucimar Jerônimo Leite
Journal:  Neuroinformatics       Date:  2019-04

2.  Digital Mammography in Breast Cancer: Additive Value of Radiomics of Breast Parenchyma.

Authors:  Hui Li; Kayla R Mendel; Li Lan; Deepa Sheth; Maryellen L Giger
Journal:  Radiology       Date:  2019-02-12       Impact factor: 29.146

3.  Quantitative ultrasound image analysis of axillary lymph node status in breast cancer patients.

Authors:  Karen Drukker; Maryellen Giger; Lina Arbash Meinel; Adam Starkey; Jyothi Janardanan; Hiroyuki Abe
Journal:  Int J Comput Assist Radiol Surg       Date:  2013-03-24       Impact factor: 2.924

4.  A Combined Metabolomic and Proteomic Analysis of Gestational Diabetes Mellitus.

Authors:  Joanna Hajduk; Agnieszka Klupczynska; Paweł Dereziński; Jan Matysiak; Piotr Kokot; Dorota M Nowak; Marzena Gajęcka; Ewa Nowak-Markwitz; Zenon J Kokot
Journal:  Int J Mol Sci       Date:  2015-12-16       Impact factor: 5.923

5.  Descriptor selection for predicting interfacial thermal resistance by machine learning methods.

Authors:  Xiaojuan Tian; Mingguang Chen
Journal:  Sci Rep       Date:  2021-01-12       Impact factor: 4.379

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.