| Literature DB >> 24489753 |
Leon Bobrowski1, Tomasz Łukaszuk2, Bengt Lindholm3, Peter Stenvinkel3, Olof Heimburger3, Jonas Axelsson3, Peter Bárány3, Juan Jesus Carrero4, Abdul Rashid Qureshi3, Karin Luttropp4, Malgorzata Debowska5, Louise Nordfors4, Martin Schalling4, Jacek Waniewski6.
Abstract
Identification of risk factors in patients with a particular disease can be analyzed in clinical data sets by using feature selection procedures of pattern recognition and data mining methods. The applicability of the relaxed linear separability (RLS) method of feature subset selection was checked for high-dimensional and mixed type (genetic and phenotypic) clinical data of patients with end-stage renal disease. The RLS method allowed for substantial reduction of the dimensionality through omitting redundant features while maintaining the linear separability of data sets of patients with high and low levels of an inflammatory biomarker. The synergy between genetic and phenotypic features in differentiation between these two subgroups was demonstrated.Entities:
Mesh:
Year: 2014 PMID: 24489753 PMCID: PMC3904924 DOI: 10.1371/journal.pone.0086630
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1AE and CVE - phenotypic space.
The apparent error rate (AE) and the cross-validation error (CVE) in different feature subspaces of the phenotypic space .
Figure 3AE and CVE - phenotypic and genetic space.
The apparent error rate (AE) and the cross-validation error (CVE) in different feature subspaces of the phenotypic and genetic space .
Features that define the optimal phenotypic subspace characterized by the lowest cross-validation error (CVE), their factor coefficients in the minimal value of the criterion function (see Appendix S1, equation 5) and their correlation coefficients with CRP plasma concentrations.
| Feature | Factor | Pearson’scorrelation |
|
|
| 1,478 | 0,483 | 0,000 |
|
| 1,066 | −0,389 | 0,000 |
|
| 1,023 | 0,238 | 0,000 |
|
| 0,841 | 0,098 | 0,141 |
|
| 0,806 | 0,396 | 0,000 |
|
| −0,778 | −0,070 | 0,298 |
|
| 0,758 | 0,351 | 0,000 |
|
| 0,754 | 0,106 | 0,114 |
|
| −0,740 | 0,017 | 0,796 |
|
| −0,657 | −0,085 | 0,201 |
|
| −0,493 | −0,084 | 0,212 |
|
| 0,493 | 0,225 | 0,001 |
|
| −0,433 | −0,039 | 0,559 |
|
| 0,404 | −0,064 | 0,336 |
|
| −0,393 | −0,219 | 0,001 |
|
| 0,301 | 0,093 | 0,165 |
|
| 0,289 | 0,323 | 0,000 |
|
| −0,278 | −0,120 | 0,071 |
|
| 0,237 | 0,225 | 0,001 |
|
| 0,237 | 0,075 | 0,264 |
|
| −0,153 | −0,088 | 0,189 |
Figure 2AE and CVE - genetic space.
The apparent error rate (AE) and the cross-validation error (CVE) in different feature subspaces of the genetic space .
The confusion matrices (see Appendix S1, equation 11), for the combined phenotypic and genetic subspaces with dimensionalities , , , and .
|
|
|
|
|
| 89 | 23 |
|
| 24 | 89 |
|
|
|
|
|
| 104 | 8 |
|
| 8 | 105 |
|
|
|
|
|
| 110 | 2 |
|
| 2 | 111 |
|
|
|
|
|
| 95 | 17 |
|
| 20 | 93 |
Figure 4The diagnostic map.
Linear separation of the high CRP from the low CRP patients for the cohort of incident dialysis patients in the optimal feature subspace of the phenotypic and genetic space .
The cross validation error CVE (mean SD) for different classifiers in the phenotypic space and their subspaces obtained by using five features selection methods (RLS, ReliefF, CFS-FS, mSVM-RFE, MRMR) and five classifiers (RF, KNN, SVM, NBC, CPL), see Section “Alternative methods for feature selection and classification”.
| Feature selection method | Number offeatures | Classifier | ||||
| RF | KNN | SVM | NBC | CPL | ||
| No selection | 57 | 0,231 | 0,329 | 0,258 | 0,302 | 0,258 |
|
|
|
|
|
| ||
| ReliefF |
| 0,173 | 0,240 | 0,160 | 0,240 | 0,156 |
|
|
|
|
|
| ||
| (25) | (28) | (26) | (3) | (26) | ||
| CFS-FS | 15 | 0,218 | 0,196 | 0,178 | 0,267 | 0,191 |
|
|
|
|
|
| ||
| mSVM-RFE | 26 | 0,200 | 0,338 | 0,151 | 0,231 | 0,178 |
|
|
|
|
|
| ||
| MRMR |
| 0,182 | 0,182 | 0,169 | 0,240 | 0,173 |
|
|
|
|
|
| ||
| (30) | (12) | (11) | (21) | (8) | ||
| RLS | 21 | 0,191 | 0,311 | 0,156 | 0,280 | 0,138 |
|
|
|
|
|
| ||
ReliefF and MRMR are ranking procedures. The optimal sets of features for these two methods were determined for each classifier separately; the number of features (shown in parentheses) corresponds to the size of the subset of features characterized by the smallest cross validation error for the specific classifier.
The cross validation error CVE (mean SD) for different classifiers in the phenotypic and geneticspace and their subspaces obtained by using five features selection methods (RLS, ReliefF, CFS-FS, mSVM-RFE, MRMR) and five classifiers (RF, KNN, SVM, NBC, CPL), see Section “Alternative methods for feature selection and classification”.
| Feature selection method | Number offeatures | Classifier | ||||
| RF | KNN | SVM | NBC | CPL | ||
| No selection | 285 | 0,293 | 0,382 | 0,218 | 0,293 | 0,209 |
|
|
|
|
|
| ||
| ReliefF |
| 0,191 | 0,240 | 0,187 | 0,200 | 0,213 |
|
|
|
|
|
| ||
| (80) | (2) | (54) | (16) | (61) | ||
| CFS-FS | 15 | 0,218 | 0,196 | 0,178 | 0,267 | 0,191 |
|
|
|
|
|
| ||
| mSVM-RFE | 153 | 0,262 | 0,382 | 0,156 | 0,302 | 0,182 |
|
|
|
|
|
| ||
| MRMR |
| 0,160 | 0,267 | 0,129 | 0,213 | 0,156 |
|
|
|
|
|
| ||
| (25) | (1) | (44) | (27) | (39) | ||
| RLS | 60 | 0,231 | 0,378 | 0,018 | 0,258 | 0,018 |
|
|
|
|
|
| ||
ReliefF and MRMR are ranking procedures. The optimal sets of features for these two methods were determined for each classifier separately; the number of features (shown in parentheses) corresponds to the size of the subset of features characterized by the smallest cross validation error for the specific classifier.
The cross validation error CVE (mean SD) for different classifiers in the genetic space and their subspaces obtained by using five features selection methods (RLS, ReliefF, CFS-FS, mSVM-RFE, MRMR) and five classifiers (RF, KNN, SVM, NBC, CPL), see Section “Alternative methods for feature selection and classification”.
| Feature selection method | Number offeatures | Classifier | ||||
| RF | KNN | SVM | NBC | CPL | ||
| No selection | 228 | 0,502 | 0,436 | 0,444 | 0,493 | 0,462 |
|
|
|
|
|
| ||
| ReliefF |
| 0,338 | 0,293 | 0,347 | 0,369 | 0,369 |
|
|
|
|
|
| ||
| (22) | (76) | (82) | (26) | (39) | ||
| CFS-FS | 3 | 0,458 | 0,427 | 0,427 | 0,422 | 0,427 |
|
|
|
|
|
| ||
| mSVM-RFE | 140 | 0,48 | 0,342 | 0,356 | 0,458 | 0,378 |
|
|
|
|
|
| ||
| MRMR |
| 0,347 | 0,333 | 0,280 | 0,280 | 0,276 |
|
|
|
|
|
| ||
| (21) | (70) | (38) | (21) | (25) | ||
| RLS | 81 | 0,489 | 0,418 | 0,338 | 0,418 | 0,169 |
|
|
|
|
|
| ||
ReliefF and MRMR are ranking procedures. The optimal sets of features for these two methods were determined for each classifier separately; the number of features (shown in parentheses) corresponds to the size of the subset of features characterized by the smallest cross validation error for the specific classifier.