| Literature DB >> 30964868 |
Bingtao Zhang1,2, Peng Cao3.
Abstract
High dimensional biomedical data contain tens of thousands of features, accurate and effective identification of the core features in these data can be used to assist diagnose related diseases. However, there are often a large number of irrelevant or redundant features in biomedical data, which seriously affect subsequent classification accuracy and machine learning efficiency. To solve this problem, a novel filter feature selection algorithm based on redundant removal (FSBRR) is proposed to classify high dimensional biomedical data in this paper. First of all, two redundant criteria are determined by vertical relevance (the relationship between feature and class attribute) and horizontal relevance (the relationship between feature and feature). Secondly, to quantify redundant criteria, an approximate redundancy feature framework based on mutual information (MI) is defined to remove redundant and irrelevant features. To evaluate the effectiveness of our proposed algorithm, controlled trials based on typical feature selection algorithm are conducted using three different classifiers, and the experimental results indicate that the FSBRR algorithm can effectively reduce the feature dimension and improve the classification accuracy. In addition, an experiment of small sample dataset is designed and conducted in the section of discussion and analysis to clarify the specific implementation process of FSBRR algorithm more clearly.Entities:
Mesh:
Year: 2019 PMID: 30964868 PMCID: PMC6456288 DOI: 10.1371/journal.pone.0214406
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Different cases of extreme value.
| large | large | |
| large | small | |
| small | large | |
| small | small |
High dimension biomedical datasets.
| 2000 | 62 | 2 | |
| 7129 | 60 | 2 | |
| 4026 | 47 | 2 | |
| 5409 | 16772 | 2 | |
| 10000 | 200 | 2 | |
| 21548 | 1097 | 2 | |
| 18348 | 528 | 2 | |
| 319 | 163 | 2 |
Hardware and software configuration of experimental.
| CPU | Intel i7-8550U 4.0GHz | |
| RAM | 16G | |
| Operating system (OS) | Windows 7 | |
| Software platform | Matlab 2017a |
Experimental results based on eight data sets.
| Classifier | Dataset | Algorithm | Mean (%) | Max(%) | Min(%) | Std | MeanFN | RT(s) |
|---|---|---|---|---|---|---|---|---|
| ColonTumor | Full Set | 78.21 | 82.32 | 71.46 | 3.51 | 2000 | ‒ | |
| FSBRR | 95.45 | 91.21 | ||||||
| Relief | 85.49 | 89.94 | 81.23 | 3.16 | 38 | 2.19 | ||
| mRmR | 86.61 | 90.72 | 82.77 | 3.85 | 42 | 2.75 | ||
| GA | 89.13 | 95.71 | 84.12 | 3.91 | 36 | 19.18 | ||
| Nervous-System | Full Set | 61.33 | 74.61 | 54.86 | 5.68 | 7129 | ‒ | |
| FSBRR | 84.02 | 74.64 | ||||||
| Relief | 71.56 | 81.24 | 68.88 | 3.97 | 44 | 14.72 | ||
| mRmR | 72.42 | 76.43 | 69.61 | 3.24 | 42 | 16.37 | ||
| GA | 75.76 | 79.21 | 69.56 | 3.87 | 48 | 90.01 | ||
| DLBCL-Stanford | Full Set | 76.60 | 80.21 | 73.79 | 2.87 | 4026 | ‒ | |
| FSBRR | 84.16 | 79.30 | ||||||
| Relief | 75.48 | 81.64 | 71.63 | 3.34 | 45 | 3.84 | ||
| mRmR | 76.58 | 84.61 | 69.15 | 6.42 | 54 | 4.10 | ||
| GA | 79.46 | 82.46 | 75.16 | 3.12 | 91 | 31.80 | ||
| p53 Mutants | Full Set | 89.66 | 92.18 | 81.61 | 3.42 | 5409 | ‒ | |
| FSBRR | 98.00 | 90.56 | 3.19 | 39 | ||||
| Relief | 90.42 | 94.25 | 84.25 | 4.06 | 54 | 4.14 | ||
| mRmR | 87.91 | 91.14 | 79.79 | 5.48 | 4.52 | |||
| GA | 94.16 | 98.31 | 91.21 | 64 | 37.44 | |||
| Arcene | Full Set | 73.14 | 80.60 | 66.60 | 4.87 | 10000 | ‒ | |
| FSBRR | 87.30 | 82.20 | 2.11 | |||||
| Relief | 81.01 | 84.01 | 77.78 | 73 | 24.94 | |||
| mRmR | 78.80 | 84.02 | 72.02 | 4.82 | 71 | 20.34 | ||
| GA | 77.14 | 80.02 | 74.31 | 3.16 | 91 | 128.67 | ||
| BRCA | Full Set | 80.57 | 85.51 | 76.99 | 3.29 | 21548 | ‒ | |
| FSBRR | 89.24 | 81.14 | ||||||
| Relief | 85.22 | 90.10 | 79.44 | 4.01 | 241 | 26.25 | ||
| mRmR | 83.54 | 85.92 | 80.36 | 2.21 | 189 | 24.01 | ||
| GA | 85.16 | 88.90 | 81.12 | 2.62 | 246 | 29.60 | ||
| GBM | Full Set | 69.92 | 80.16 | 62.77 | 8.45 | 18348 | ‒ | |
| FSBRR | 80.95 | 86.78 | 75.49 | 61 | ||||
| Relief | 76.14 | 82.15 | 70.65 | 4.29 | 4.87 | |||
| mRmR | 74.90 | 79.12 | 69.44 | 3.96 | 68 | 5.42 | ||
| GA | 90.25 | 75.02 | 5.48 | 93 | 12.84 | |||
| TSP | Full Set | 68.77 | 75.87 | 63.21 | 4.26 | 319 | ‒ | |
| FSBRR | 81.23 | 76.99 | ||||||
| Relief | 67.01 | 69.89 | 61.32 | 3.01 | 135 | 0.75 | ||
| mRmR | 56.96 | 62.42 | 50.42 | 4.12 | 124 | 0.50 | ||
| GA | 67.26 | 70.51 | 60.23 | 3.96 | 137 | 0.80 | ||
| ColonTumor | Full Set | 75.80 | 81.32 | 71.02 | 3.69 | 2000 | ‒ | |
| FSBRR | 95.01 | 88.09 | ||||||
| Relief | 83.41 | 88.91 | 78.13 | 3.76 | 38 | 1.87 | ||
| mRmR | 84.69 | 89.12 | 79.58 | 3.88 | 42 | 2.04 | ||
| GA | 86.67 | 88.33 | 82.33 | 2.53 | 38 | 9.16 | ||
| Nervous-System | Full Set | 56.86 | 69.66 | 53.94 | 4.17 | 7129 | ‒ | |
| FSBRR | 81.25 | 70.11 | 3.22 | |||||
| Relief | 65.69 | 70.89 | 63.28 | 3.36 | 44 | 9.14 | ||
| mRmR | 65.77 | 69.12 | 62.14 | 3.28 | 42 | 8.31 | ||
| GA | 73. 62 | 76.58 | 69.14 | 56 | 73.82 | |||
| DLBCL-Stanford | Full Set | 78.24 | 82.46 | 62.36 | 5.83 | 4026 | ‒ | |
| FSBRR | 88.78 | 79.47 | ||||||
| Relief | 76.32 | 84.47 | 71.95 | 4.73 | 45 | 2.43 | ||
| mRmR | 76.98 | 87.71 | 71.54 | 5.86 | 54 | 3.01 | ||
| GA | 80.06 | 82.78 | 72.83 | 3.79 | 97 | 24.31 | ||
| p53 Mutants | Full Set | 84.93 | 90.15 | 80.61 | 4.01 | 5409 | ‒ | |
| FSBRR | 88.30 | 92.01 | 85.50 | |||||
| Relief | 85.20 | 89.23 | 81.99 | 2.45 | 54 | 3.78 | ||
| mRmR | 84.27 | 87.50 | 80.09 | 2.29 | 37 | 4.14 | ||
| GA | 95.14 | 87.04 | 2.99 | 56 | 39.77 | |||
| Arcene | Full Set | 67.60 | 77.38 | 62.57 | 4.38 | 10000 | ‒ | |
| FSBRR | 89.30 | 79.20 | 3.02 | |||||
| Relief | 82.01 | 87.01 | 79.04 | 73 | 9.04 | |||
| mRmR | 78.93 | 86.33 | 75.18 | 4.34 | 71 | 9.79 | ||
| GA | 79.13 | 84.78 | 75.90 | 3.41 | 87 | 84.46 | ||
| BRCA | Full Set | 78.51 | 83.42 | 72.81 | 4.87 | 21548 | ‒ | |
| FSBRR | 87.24 | 80.02 | ||||||
| Relief | 83.75 | 87.06 | 74.59 | 5.23 | 241 | 19.10 | ||
| mRmR | 82.43 | 87.88 | 75.31 | 4.85 | 189 | 20.21 | ||
| GA | 83.01 | 87.11 | 78.01 | 3.02 | 251 | 62.62 | ||
| GBM | Full Set | 68.82 | 80.64 | 65.55 | 4.58 | 18348 | ‒ | |
| FSBRR | 80.12 | 87.07 | 78.30 | 61 | ||||
| Relief | 74.85 | 80.02 | 70.88 | 3.81 | 2.42 | |||
| mRmR | 74.82 | 80.25 | 66.40 | 5.56 | 68 | 2.87 | ||
| GA | 88.51 | 75.24 | 3.95 | 88 | 5.50 | |||
| TSP | Full Set | 62.72 | 74.87 | 61.21 | 4.86 | 319 | ‒ | |
| FSBRR | 82.31 | 76.46 | ||||||
| Relief | 61.75 | 69.23 | 60.86 | 3.23 | 135 | 0.69 | ||
| mRmR | 57.87 | 63.79 | 53.18 | 4.27 | 124 | 0.51 | ||
| GA | 69.56 | 71.88 | 66.81 | 2.26 | 139 | 0.81 | ||
| ColonTumor | Full Set | 73.86 | 80.63 | 70.66 | 3.98 | 2000 | ‒ | |
| FSBRR | 91.46 | 83.21 | ||||||
| Relief | 80.46 | 84.18 | 75.44 | 4.23 | 38 | 3.22 | ||
| mRmR | 81.78 | 85.71 | 75.81 | 4.58 | 42 | 3.96 | ||
| GA | 84.16 | 89.66 | 81.62 | 4.40 | 48 | 28.86 | ||
| Nervous-System | Full Set | 53.34 | 65.15 | 50.86 | 6.58 | 7129 | ‒ | |
| FSBRR | 78.12 | 70.69 | ||||||
| Relief | 68.56 | 72.61 | 62.81 | 3.92 | 44 | 17.89 | ||
| mRmR | 65.78 | 72.48 | 61.15 | 5.02 | 42 | 19.96 | ||
| GA | 68.16 | 72.11 | 63.56 | 3.52 | 45 | 42.11 | ||
| DLBCL-Stanford | Full Set | 70.16 | 79.21 | 60.79 | 5.86 | 4026 | ‒ | |
| FSBRR | 83.45 | 74.41 | 4.31 | |||||
| Relief | 72.96 | 74.61 | 62.69 | 5.95 | 45 | 4.41 | ||
| mRmR | 72.15 | 77.65 | 69.15 | 54 | 3.25 | |||
| GA | 72.85 | 77.45 | 68.06 | 3.85 | 91 | 12.54 | ||
| p53 Mutants | Full Set | 85.56 | 87.76 | 80.13 | 4.11 | 5409 | ‒ | |
| FSBRR | 90.02 | 82.15 | 3.11 | 39 | ||||
| Relief | 84.15 | 87.26 | 80.36 | 54 | 2.85 | |||
| mRmR | 82.15 | 87.58 | 76.05 | 4.85 | 3.91 | |||
| GA | 84.55 | 88.12 | 78.51 | 3.98 | 61 | 20.51 | ||
| Arcene | Full Set | 70.56 | 74.15 | 66.58 | 3.55 | 10000 | ‒ | |
| FSBRR | 84.52 | 78.15 | ||||||
| Relief | 76.59 | 80.12 | 70.54 | 3.85 | 73 | 8.52 | ||
| mRmR | 74.75 | 79.65 | 70.25 | 4.65 | 71 | 8.15 | ||
| GA | 71.35 | 75.85 | 65.35 | 4.05 | 115 | 55.85 | ||
| BRCA | Full Set | 77.25 | 80.24 | 74.21 | 3.17 | 21548 | ‒ | |
| FSBRR | 86.55 | 80.85 | ||||||
| Relief | 79.05 | 85.89 | 76.19 | 3.29 | 241 | 8.58 | ||
| mRmR | 78.49 | 86.01 | 71.71 | 5.85 | 189 | 9.77 | ||
| GA | 79.96 | 86.36 | 75.45 | 3.25 | 262 | 8.40 | ||
| GBM | Full Set | 65.12 | 71.55 | 61.10 | 3.40 | 18348 | ‒ | |
| FSBRR | 85.55 | 77.35 | 61 | |||||
| Relief | 74.01 | 78.23 | 71.08 | 2.75 | 7.88 | |||
| mRmR | 73.99 | 79.00 | 70.59 | 3.44 | 68 | 8.58 | ||
| GA | 80.03 | 87.10 | 76.40 | 3.36 | 101 | 9.51 | ||
| TSP | Full Set | 64.45 | 72.14 | 60.45 | 5.84 | 319 | ‒ | |
| FSBRR | 76.18 | 69.98 | ||||||
| Relief | 62.65 | 67.18 | 58.10 | 3.94 | 135 | 0.70 | ||
| mRmR | 56.90 | 61.54 | 51.95 | 4.75 | 124 | 0.52 | ||
| GA | 64.67 | 72.09 | 60.46 | 5.18 | 140 | 0.75 |
Note: (1) full set: a set of all features that have not been processed by feature selection algorithm. (2) Mean (%): the mean of performance. (3) Max(%): the highest of performance. (4) Min(%): the lowest of performance. (5) Std: the standard deviation. (6) MeanFN: the mean number of selected feature. (7) RT(s): running time, unit is second.
Boldface indicates the best experimental result.
Fig 1The average attributes value of four feature selection algorithms based on eight datasets.
Fig 2The relationship between parameter δ and classification accuracy.
Fig 3The relationship between parameter α and classification accuracy.
Fig 4The range of parameter values when the classification accuracy is located in the top 20% based on FSBRR algorithm.
The relationship between relevance and redundant feature based on FSBRR algorithm.
| {5} | 83.30 | ‒ | 0.9157 | ‒ | ‒ | |
| {5,7} | 87.12 | ↑ | 0.9011 | 0.9454 | ||
| {5,21} | 83.16 | ↓ | 0.8045 | 0.9101 | ||
| {5,7,18} | 87.64 | ↑ | 0.4562 | 0.2131/0.1983 | ||
| {8,3} | 84.50 | ‒ | 0.8934/0.9013 | ‒ | ‒ | |
| {8,3,30} | 87.86 | ↑ | 0.7120 | 0.3010/0.2139 |