| Literature DB >> 29065647 |
Runzhi Li1, Wei Liu1, Yusong Lin1, Hongling Zhao1, Chaoyang Zhang2.
Abstract
It is important to identify and prevent disease risk as early as possible through regular physical examinations. We formulate the disease risk prediction into a multilabel classification problem. A novel Ensemble Label Power-set Pruned datasets Joint Decomposition (ELPPJD) method is proposed in this work. First, we transform the multilabel classification into a multiclass classification. Then, we propose the pruned datasets and joint decomposition methods to deal with the imbalance learning problem. Two strategies size balanced (SB) and label similarity (LS) are designed to decompose the training dataset. In the experiments, the dataset is from the real physical examination records. We contrast the performance of the ELPPJD method with two different decomposition strategies. Moreover, the comparison between ELPPJD and the classic multilabel classification methods RAkEL and HOMER is carried out. The experimental results show that the ELPPJD method with label similarity strategy has outstanding performance.Entities:
Mesh:
Year: 2017 PMID: 29065647 PMCID: PMC5494772 DOI: 10.1155/2017/8051673
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Multilabel physical examination records description.
| Records | Disease a | Disease b | Disease c | Disease d |
|---|---|---|---|---|
|
| ∗ | ∗ | ∗ | |
|
| ∗ | ∗ | ||
|
| ∗ | ∗ | ||
| … | ||||
|
| ∗ | ∗ | ∗ |
∗ points out the diseases each physical examination record is associated with.
Figure 1Enumeration for reassembling labels.
Algorithm 1Algorithm 1: Combination label transformation.
Variable parameters denotation.
| Notation | Denotation |
|---|---|
|
| The training dataset |
|
| Set of class labels |
|
| The dataset associated with the combination labels |
|
| The association for the |
|
| The |
|
| Datasets associated with class label |
|
| Threshold for infrequency records |
|
| Training sub datasets by decomposition |
|
| A Hash function |
|
| The similarity matrix |
|
| The similarity threshold |
|
| The number of label sets |
|
| The partition of label sets |
Example of label combination.
| Physical records | Combination labels |
|---|---|
|
| acd |
|
| ab |
|
| bc |
| … | … |
|
| bcd |
Example of label decomposition.
| Physical records | Combination labels | Decomposition labels |
|---|---|---|
|
| acd | ac |
Algorithm 2Algorithm 2: Decomposition of datasets.
Figure 2Example of label sets partition.
Algorithm 3Algorithm 3: Label sets dividing algorithm.
Description of the multilabel training dataset in the experiments.
| Data sets | Records | Attributes | Single labels | Combination labels | Label density | Label cardinality | |
|---|---|---|---|---|---|---|---|
| Training | Test | ||||||
| Physical records | 99,270 | 11,030 | 62 | 6 | 64 | 0.336 | 2.015 |
Figure 3Optimal hyperparameters selected in LIBSVM.
Random forest classifier parameter tuning on partition training subsets.
| NumFeatures | NumTrees | Accuracy | Out of bag error | Time out(s) |
|---|---|---|---|---|
| 10 | 30 | 0.9116 | 0.1483 | 2.22 |
| 10 | 40 | 0.9118 | 0.1473 | 2.94 |
| 10 | 50 | 0.9126 | 0.1466 | 3.78 |
| 10 | 60 | 0.9136 | 0.146 | 4.51 |
| 15 | 30 | 0.9175 | 0.1394 | 2.88 |
| 15 | 40 | 0.9167 | 0.1387 | 3.88 |
| 15 | 50 | 0.9185 | 0.138 | 4.99 |
| 15 | 60 | 0.9185 | 0.1377 | 5.84 |
| 15 | 70 | 0.9195 | 0.1373 | 6.84 |
| 15 | 80 | 0.9197 | 0.1373 | 7.83 |
| 15 | 100 | 0.9185 | 0.1369 | 9.99 |
| 20 | 30 | 0.9150 | 0.137 | 3.69 |
| 20 | 50 | 0.9189 | 0.1354 | 6.25 |
| 20 | 70 | 0.9186 | 0.1346 | 8.76 |
| 30 | 40 | 0.9173 | 0.1347 | 7.27 |
| 30 | 60 | 0.9170 | 0.1342 | 10.93 |
| 40 | 40 | 0.9190 | 0.1343 | 9.09 |
| 40 | 50 | 0.9195 | 0.1338 | 11.36 |
| 40 | 60 | 0.9202 | 0.1334 | 13.84 |
Confusion matrix of ELPPJD_LS based on LIBSVM.
| Prediction | ||||||||
|---|---|---|---|---|---|---|---|---|
| 001011 | 001100 | 010010 | 100,001 | 100,110 | 111,000 | 111,111 | ||
| Real class | 001011 | 4048 | 105 | 61 | 127 | 4 | 2 | 89 |
| 001100 | 24 | 2622 | 167 | 8 | 79 | 66 | 33 | |
| 010010 | 22 | 97 | 407 | 2 | 5 | 30 | 7 | |
| 100,001 | 63 | 3 | 7 | 1038 | 31 | 4 | 36 | |
| 100,110 | 0 | 30 | 16 | 7 | 707 | 19 | 0 | |
| 111,000 | 0 | 22 | 26 | 3 | 10 | 374 | 5 | |
| 111,111 | 15 | 0 | 6 | 19 | 1 | 6 | 405 | |
RAkEL parameters tuning.
| Metrics | RAkEL_k3_m15 | RAkEL_k4_m10 | RAkEL_k5_m4 |
|---|---|---|---|
| Avg accuracy | 0.583 | 0.484 | 0.547 |
| Precisionmicro | 0.543 | 0.544 | 0.577 |
| Recallmicro | NaN | NaN | NaN |
| F1micro | 0.744 | 0.689 | 0.712 |
| Precisionmacro | NaN | NaN | NaN |
| Recallmacro | NaN | NaN | NaN |
| F1macro | 0.575 | 0.578 | 0.558 |
HOMER parameters tuning.
| Metrics | HOMER_RF_k2 | HOMER_RF_k3 | HOMER_RF_k4 | HOMER_RF_k5 | HOMER_RF_k6 |
|---|---|---|---|---|---|
| Avg accuracy | 0.4755 | 0.5043 | 0.5079 | 0.5152 | 0.5128 |
| Precisionmicro | 0.5483 | 0.5987 | 0.6336 | 0.6639 | 0.6694 |
| Recallmicro | 0.758 | 0.7761 | 0.7030 | 0.6767 | 0.6639 |
| F1micro | 0.6363 | 0.6759 | 0.6665 | 0.6702 | 0.6666 |
| Precisionmacro | 0.5337 | 0.6042 | 0.596 | 0.6045 | 0.6125 |
| Recallmacro | 0.7442 | 0.7688 | 0.6706 | 0.6409 | 0.6318 |
| F1macro | 0.5957 | 0.6489 | 0.6070 | 0.5936 | 0.5927 |
Performance evaluation for different multilabel methods.
| Metrics | ELPPJD_SE | ELPPJD_LS | RAkEL_C4.5 | HOMER_RF |
|---|---|---|---|---|
| Avg accuracy | 0.516 | 0.8859 | 0.583 | 0.5152 |
| Precisionmicro | 0.516 | 0.8859 | 0.543 | 0.6639 |
| Recallmicro | 0.516 | 0.8859 | NaN | 0.6767 |
| F1micro | 0.516 | 0.8859 | 0.744 | 0.6702 |
| Precisionmacro | 0.52 | 0.8082 | NaN | 0.6045 |
| Recallmacro | 0.5046 | 0.8603 | NaN | 0.6409 |
| F1macro | 0.5122 | 0.8334 | 0.575 | 0.5936 |