| Literature DB >> 32405295 |
Xinping Xiao1, Dian Fu1, Yu Shi1, Jianghui Wen1.
Abstract
The Mahalanobis-Taguchi system (MTS) is a multivariate data diagnosis and prediction technology, which is widely used to optimize large sample data or unbalanced data, but it is rarely used for high-dimensional small sample data. In this paper, the optimized MTS for the classification of high-dimensional small sample data is discussed from two aspects, namely, the inverse matrix instability of the covariance matrix and the instability of feature selection. Firstly, based on regularization and smoothing techniques, this paper proposes a modified Mahalanobis metric to calculate the Mahalanobis distance, which is aimed at reducing the influence of the inverse matrix instability under small sample conditions. Secondly, the minimum redundancy-maximum relevance (mRMR) algorithm is introduced into the MTS for the instability problem of feature selection. By using the mRMR algorithm and signal-to-noise ratio (SNR), a two-stage feature selection method is proposed: the mRMR algorithm is first used to remove noise and redundant variables; the orthogonal table and SNR are then used to screen the combination of variables that make great contribution to classification. Then, the feasibility and simplicity of the optimized MTS are shown in five datasets from the UCI database. The Mahalanobis distance based on regularization and smoothing techniques (RS-MD) is more robust than the traditional Mahalanobis distance. The two-stage feature selection method improves the effectiveness of feature selection for MTS. Finally, the optimized MTS is applied to email classification of the Spambase dataset. The results show that the optimized MTS outperforms the classical MTS and the other 3 machine learning algorithms.Entities:
Mesh:
Year: 2020 PMID: 32405295 PMCID: PMC7199641 DOI: 10.1155/2020/4609423
Source DB: PubMed Journal: Comput Intell Neurosci
Algorithm 1The algorithm flow of optimized Mahalanobis–Taguchi system.
Description of the dataset.
| Dataset name | Number of variables | Number of samples | Positive class | Negative class |
|---|---|---|---|---|
| Ionosphere | 33 | 351 | Good/225 | Bad/126 |
| Z-Alizadeh Sani | 48 | 303 | Normal/87 | CAD/216 |
| Parkinson dataset with replicated acoustic features | 46 | 240 | Healthy/120 | PD/120 |
| Breast Cancer Wisconsin (prognostic) | 34 | 194 | Recurrent/148 | Nonrecurrent/46 |
| Connectionist Bench (sonar, mines vs. rocks) | 60 | 161 | R/50 | M/111 |
Figure 1The distribution of RS-MD for normal observations. (a) Ionosphere. (b) Z-Alizadeh Sani. (c) Parkinson dataset with replicated acoustic features. (d) Breast Cancer Wisconsin (prognostic).
The variance of the RS-MD.
| Dataset name |
|
|
|
|
|---|---|---|---|---|
| Ionosphere | 0.1555 | 0.1182 | 0.0978 | 0.2463 |
| Z-Alizadeh Sani | 0.1127 | 0.1228 | 0.1483 | 0.2633 |
| Parkinson dataset with replicated acoustic features | 0.0650 | 0.0589 | 0.0552 | 0.1207 |
| Breast Cancer Wisconsin (prognostic) | 0.1419 | 0.1359 | 0.1455 | 0.3499 |
Figure 2Comparison of the distributions between MD and RS-MD. (a) Ionosphere. (b) Z-Alizadeh Sani. (c) Parkinson dataset with replicated acoustic features. (d) Breast Cancer Wisconsin (prognostic).
Figure 3Comparison between RS-MD and GS-MD. (a) GS-MD. (b) RS-MD.
Figure 4Stability of feature selection results for each dataset.
Figure 5Classification accuracy after feature selection of each dataset. (a) Ionosphere. (b) Z-Alizadeh Sani. (c) Parkinson dataset with replicated acoustic features. (d) Breast Cancer Wisconsin (prognostic).
Selected feature and its score by mRMR algorithm.
| Feature |
|
|
|
|
|
|
|
|
|
|
|
| Score | 0.000 | −0.094 | −0.019 | −0.007 | −0.014 | −0.027 | −0.026 | −0.019 | −0.016 | −0.017 | −0.030 |
| Feature |
|
|
|
|
|
|
|
|
|
|
|
| Score | −0.022 | −0.028 | −0.027 | −0.033 | −0.041 | −0.050 | −0.052 | −0.060 | −0.057 | −0.061 | −0.066 |
| Feature |
|
|
|
|
|
|
|
|
| ||
| Score | −0.068 | −0.063 | −0.076 | −0.072 | −0.085 | −0.092 | −0.094 | −0.096 | −0.098 |
Signal-to-noise ratio values under each test.
| Test | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| SNR | 1.513 | −1.271 | −4.010 | 0.692 | −7.809 | −2.428 | −3.104 | −9.281 | −5.210 | −6.468 | −5.422 |
| Test | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| SNR | −5.690 | −5.506 | −4.236 | −3.931 | −2.749 | −4.634 | −4.227 | −4.107 | −6.069 | −0.658 | −2.976 |
| Test | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | |
| SNR | −2.747 | 0.260 | −3.678 | −4.094 | −4.759 | −3.800 | −4.712 | −3.734 | −5.076 | −4.008 |
Figure 6ROC curve corresponding to the Mahalanobis distance of the training set after feature selection.
Comparison of results between optimized MTS and the classification methods for high-dimensional small sample data.
| Optimized MTS | Classical MTS | Decision tree | ||||
| Relief | SVM-RFE | |||||
| Number of features | 15 | 20 | 15 | 15 | ||
| Training set | 0.9194 | 0.8722 | 0.8333 | 0.8444 | ||
| Test set | 0.9067 | 0.8633 | 0.8433 | 0.8533 | ||
|
| ||||||
| SVM | kNN | |||||
| Relief | SVM-RFE | Relief | SVM-RFE | |||
|
| ||||||
| Number of features | 15 | 15 | 15 | 15 | ||
| Training set | 0.8722 | 0.8944 | 0.8583 | 0.8750 | ||
| Test set | 0.8733 | 0.8967 | 0.8600 | 0.8867 | ||