| Literature DB >> 30125332 |
Zifa Li1, Weibo Xie1, Tao Liu1.
Abstract
Feature selection and classification are the main topics in microarray data analysis. Although many feature selection methods have been proposed and developed in this field, SVM-RFE (Support Vector Machine based on Recursive Feature Elimination) is proved as one of the best feature selection methods, which ranks the features (genes) by training support vector machine classification model and selects key genes combining with recursive feature elimination strategy. The principal drawback of SVM-RFE is the huge time consumption. To overcome this limitation, we introduce a more efficient implementation of linear support vector machines and improve the recursive feature elimination strategy and then combine them together to select informative genes. Besides, we propose a simple resampling method to preprocess the datasets, which makes the information distribution of different kinds of samples balanced and the classification results more credible. Moreover, the applicability of four common classifiers is also studied in this paper. Extensive experiments are conducted on six most frequently used microarray datasets in this field, and the results show that the proposed methods have not only reduced the time consumption greatly but also obtained comparable classification performance.Entities:
Mesh:
Year: 2018 PMID: 30125332 PMCID: PMC6101392 DOI: 10.1371/journal.pone.0202167
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The overview of RVOS.
| Algorithm 1. Random value-based oversampling method for class imbalance issue | ||
| 1. Given data matrix | ||
| 2. While k> = 1: | ||
| (1) For j = 1, 2, …, n (n denotes the column size of | ||
| • Randomly choose a value V from | ||
| (2) Save new sample to | ||
| (3) k = k– 1; | ||
| 3. Return | ||
The overview of VSSRFE.
| Algorithm 2. Recursive Feature Elimination with variable step size (VSSRFE) | ||
| 1. Given set of genes, | ||
| 2. Get total quantity of genes from | ||
| 3. Temp = n_total; N = n_total; S = s_initial | ||
| 4. While N > n_selected: | ||
| (1) N = N–S; | ||
| (2) If temp / N = 2 and S > 1: | ||
| • Temp = N; | ||
| (3) Train LLSVM with X and Y and get sorted weights vector | ||
| (4) Delete features according to | ||
| 5. Return | ||
The overview of cyclic coordinate descent method for LLSVM.
| Algorithm 3. Cyclic coordinate descent method for LLSVM | ||
| 1. Given | ||
| 2. For k = 1, 2, 3, … m: | ||
| (1) | ||
| (2) For j = 1, 2, …, n: | ||
| • Obtain Z* by solving the sub-problem (6); | ||
| 3. Return | ||
The characteristics of raw datasets.
| Dataset | # Class 1 | # Class 2 | #Features | SDR | IR | Original Ref. |
|---|---|---|---|---|---|---|
| 22 | 40 | 2000 | 3.1% | 1.82 | [ | |
| 21 | 39 | 7129 | 0.84% | 1.86 | [ | |
| 25 | 47 | 7129 | 1.01% | 1.88 | [ | |
| 91 | 162 | 15154 | 1.67% | 1.78 | [ | |
| 59 | 77 | 12600 | 1.08% | 1.31 | [ | |
| 46 | 51 | 24481 | 0.40% | 1.11 | [ |
Parameters for feature selectors and classifiers on balanced datasets.
| Parameter | Leukemia | Prostate | Ovarian | Breast | Colon | CNS | ||
|---|---|---|---|---|---|---|---|---|
| LLSVM | C | 0.1 | 0.7 | 0.3 | 0.3 | 0.9 | 0.5 | |
| SVM | C | 0.1 | 0.9 | 0.5 | 0.1 | 0.1 | 0.05 | |
| RF | N | 100 | 800 | 100 | 400 | 200 | 400 | |
| LLSVM | S | 600 | 1000 | 1000 | 800 | 100 | 200 | |
| SVM | S | 400 | 400 | 1000 | 800 | 200 | 100 | |
| SVM | C | 9 | 1 | 3 | 0.09 | 7 | 0.1 | |
| kNN | K | 1 | 5 | 7 | 7 | 6 | 4 | |
| LR | C | 19 | 7 | 7 | 3 | 9 | 3 | |
Parameters for SVM-VSSRFE on raw datasets.
| Feature selectors | Parameters | Leukemia | Prostate | Ovarian | Breast | Colon | CNS |
|---|---|---|---|---|---|---|---|
| C | 0.3 | 0.9 | 0.5 | 0.1 | 0.5 | 0.9 | |
| S | 100 | 800 | 1000 | 1000 | 60 | 100 |
Fig 1The comparison of ACC obtained on six balanced and raw datasets.
Fig 3The comparison of MCC obtained on six balanced and raw datasets.
The comparison of performance and time consumption between SVM-RFE and SVM-VSSRFE.
| SVM-RFE | SVM-VSSRFE | |||||||
|---|---|---|---|---|---|---|---|---|
| ACC | AUC | MCC | Time (s) | ACC | AUC | MCC | Time (s) | |
| 10468.94 | 0.9021 | 0.9538 | 0.8096 | |||||
| 20518.44 | 0.8318 | 0.9003 | 0.6685 | |||||
| 0.7696 | 0.8765 | 0.5504 | 1009.01 | |||||
| 76.69 | 0.8875 | 0.9516 | 0.7820 | |||||
| 0.99 | 0.9809 | 1435.81 | ||||||
| 13897.13 | ||||||||
The comparison of time consumption (s) between SVM-VSSRFE and LLSVM-VSSRFE.
| SVM-VSSRFE | LLSVM-VSSRFE | |
|---|---|---|
| 4775.05 | ||
| 4541.82 | ||
| 1890.59 | ||
| 100.04 | ||
| 2256.16 | ||
| 741.62 |
Fig 4The comparison of ACC obtained by five feature selectors.
Fig 6The comparison of MCC obtained by five feature selectors.
Fig 7The comparison of ACC obtained by four common classifiers.
Fig 9The comparison of MCC obtained by four common classifiers.
Fig 10The classification model evaluation.