| Literature DB >> 27195952 |
Talayeh Razzaghi1, Oleg Roderick2, Ilya Safro1, Nicholas Marko2.
Abstract
This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Entities:
Mesh:
Year: 2016 PMID: 27195952 PMCID: PMC4873242 DOI: 10.1371/journal.pone.0155119
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The multilevel SVM framework consists of three phases: gradual training set coarsening, coarsest support vectors’ learning, and gradual support vectors’ refinement (uncoarsening).
Pairs of AkNN graphs correspond to two classes of learning.
Confusion matrix.
| Positive class | Negative Class | |
| Positive class | True Positive (TP) | False Positive (FP) |
| Negative Class | False Negative (FN) | True Negative (TN) |
Public data sets.
| Dataset |
| |C+| | |C−| | ||
|---|---|---|---|---|---|
| Twonorm | 0.50 | 20 | 7400 | 3703 | 3697 |
| Letter26 | 0.96 | 16 | 20000 | 734 | 19266 |
| Ringnorm | 0.50 | 20 | 7400 | 3664 | 3736 |
| Cod-rna | 0.67 | 8 | 59535 | 19845 | 39690 |
| Clean (Musk) | 0.85 | 166 | 6598 | 1017 | 5581 |
| Advertisement | 0.86 | 1558 | 3279 | 459 | 2820 |
| Nursery | 0.67 | 8 | 12960 | 4320 | 8640 |
| Hypothyroid | 0.94 | 21 | 3919 | 240 | 3679 |
| Buzz | 0.80 | 77 | 140707 | 27775 | 112932 |
| Forest | 0.98 | 54 | 581012 | 9493 | 571519 |
Comparative G-mean results for ML(W)SVM against the regular SVM, WSVM, NB, C4.5, 5NN, and LR on academic datasets for different fractions of missing values (r) using the REM imputation method.
| Dataset | MLSVM | MLWSVM | SVM | WSVM | C4.5 | 5NN | NB | LR | |
|---|---|---|---|---|---|---|---|---|---|
| Twonorm | 5% | 0.86 | 0.97 | ||||||
| 10% | 0.97 | 0.97 | 0.87 | 0.97 | 0.97 | 0.97 | |||
| 20% | 0.88 | 0.97 | 0.97 | ||||||
| 40% | 0.97 | 0.97 | 0.97 | 0.97 | 0.89 | 0.97 | |||
| Letter | 5% | 0.97 | 0.99 | 0.99 | 0.97 | 0.98 | 0.86 | 0.81 | |
| 10% | 0.98 | 0.98 | 0.99 | 0.98 | 0.98 | 0.86 | 0.80 | ||
| 20% | 0.99 | 0.99 | 0.97 | 0.98 | 0.87 | 0.80 | |||
| 40% | 0.95 | 0.97 | 0.96 | 0.97 | 0.98 | 0.88 | 0.83 | ||
| Ringorm | 5% | 0.97 | 0.98 | 0.97 | 0.98 | 0.91 | 0.61 | 0.76 | |
| 10% | 0.98 | 0.98 | 0.91 | 0.62 | 0.98 | 0.76 | |||
| 20% | 0.97 | 0.91 | 0.62 | 0.76 | |||||
| 40% | 0.97 | 0.91 | 0.62 | 0.76 | |||||
| Cod-rna | 5% | 0.95 | 0.95 | 0.92 | 0.66 | 0.93 | |||
| 10% | 0.95 | 0.95 | 0.95 | 0.91 | 0.66 | 0.92 | |||
| 20% | 0.95 | 0.95 | 0.95 | 0.94 | 0.91 | 0.67 | 0.92 | ||
| 40% | 0.93 | 0.90 | 0.68 | 0.91 | |||||
| Clean | 5% | 0.99 | 0.98 | 0.83 | 0.92 | 0.79 | 0.89 | ||
| 10% | 0.99 | 0.99 | 0.83 | 0.91 | 0.79 | 0.89 | |||
| 20% | 0.83 | 0.91 | 0.79 | 0.89 | |||||
| 40% | 0.82 | 0.92 | 0.79 | 0.89 | |||||
| Advertisement | 5% | 0.87 | 0.87 | 0.87 | 0.87 | 0.81 | 0.60 | 0.82 | |
| 10% | 0.86 | 0.86 | 0.86 | 0.85 | 0.62 | 0.82 | |||
| 20% | 0.83 | 0.85 | 0.83 | 0.85 | 0.83 | 0.61 | 0.83 | ||
| 40% | 0.84 | 0.86 | 0.87 | 0.81 | 0.85 | 0.62 | 0.82 | ||
| Nursery | 5% | 0.99 | 0.99 | 0.00 | |||||
| 10% | 0.99 | 0.99 | 0.00 | ||||||
| 20% | 0.96 | 0.96 | 0.00 | ||||||
| 40% | 0.92 | 0.92 | 0.99 | 0.46 | |||||
| Hypothyroid | 5% | 0.83 | 0.87 | 0.81 | 0.87 | 0.96 | 0.76 | 0.88 | |
| 10% | 0.85 | 0.86 | 0.78 | 0.86 | 0.76 | 0.89 | |||
| 20% | 0.84 | 0.86 | 0.72 | 0.86 | 0.96 | 0.75 | 0.90 | ||
| 40% | 0.86 | 0.88 | 0.84 | 0.88 | 0.96 | 0.76 | 0.89 | ||
| Buzz | 5% | 0.93 | 0.89 | ||||||
| 10% | 0.93 | 0.89 | |||||||
| 20% | 0.92 | 0.93 | 0.93 | 0.88 | 0.93 | ||||
| 40% | 0.93 | 0.93 | 0.93 | 0.93 | 0.86 | ||||
| Forest | 5% | 0.90 | 0.90 | 0.87 | 0.80 | 0.00 | |||
| 10% | 0.92 | 0.91 | 0.92 | 0.88 | 0.85 | 0.78 | 0.00 | ||
| 20% | 0.91 | 0.90 | 0.91 | 0.89 | 0.84 | 0.77 | 0.00 | ||
| 40% | 0.88 | 0.89 | 0.88 | 0.85 | 0.82 | 0.73 | 0.00 | ||
| # of bold values | 13 | 22 | 13 | 22 | 13 | 4 | 9 | 10 |
Computational time in seconds (not including the REM method).
| MLSVM | SVM | MLWSVM | WSVM | |
|---|---|---|---|---|
| Twonorm | 5 | 28 | 5 | 28 |
| Letter | 30 | 138 | 32 | 139 |
| Ringnorm | 4 | 25 | 4 | 26 |
| Cod-rna | 266 | 1831 | 281 | 1857 |
| Clean | 17 | 95 | 15 | 82 |
| Advertisement | 98 | 227 | 100 | 231 |
| Nursery | 25 | 187 | 31 | 192 |
| Hypothyroid | 2 | 3 | 2 | 3 |
| Buzz | 2209 | 25257 | 2999 | 26026 |
| Forest | 13328 | 352500 | 13360 | 353210 |
Healthcare datasets.
The set “Example 1” has 10000 observations in each class. In set “Example 2”, the majority and minority classes contain 50400, and 33600 observations, respectively. For details about the data see [8].
| Data |
| No. of classes | |
|---|---|---|---|
| Example 1 | 16 | 50000 | 5 |
| Example 2 | 13 | 84000 | 2 |
Accuracy of financial risk problem with five risk classes (Example 1) using the REM imputation method.
| Class | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| LR | 0.58 | 0.54 | 0.53 | 0.51 | 0.59 |
| MLSVM | 0.83 | 0.78 | 0.77 | 0.78 | 0.90 |
| MLWSVM | 0.86 | 0.76 | 0.76 | 0.77 | 0.91 |
Comparison of Multilevel WSVM against Multilevel SVM and Adaptive Logistic Regression (LR) using the REM imputation method.
Improved results are in bold.
| G-mean | SN | SP | ACC | |
|---|---|---|---|---|
| Adaptive LR | 0.7516 | 0.8903 | 0.6345 | 0.7619 |
| MLSVM | ||||
| MLWSVM |
Sensitivity, specificity and G-mean of financial risk problem with five risk classes (Example 1) using ML(W)SVM and REM imputation methods.
| MultilevelSVM | MultilevelWSVM | |||||
|---|---|---|---|---|---|---|
| SN | SP | G-mean | SN | SP | G-mean | |
| Class 1 | 0.86 | 0.73 | 0.79 | 0.89 | 0.74 | 0.81 |
| Class 2 | 0.89 | 0.34 | 0.55 | 0.86 | 0.36 | 0.56 |
| Class 3 | 0.89 | 0.28 | 0.50 | 0.88 | 0.29 | 0.50 |
| Class 4 | 0.88 | 0.40 | 0.60 | 0.87 | 0.40 | 0.58 |
| Class 5 | 0.96 | 0.69 | 0.81 | 0.96 | 0.70 | 0.82 |