| Literature DB >> 32781680 |
Belisario Panay1, Nelson Baloian1, José A Pino1, Sergio Peñafiel1, Horacio Sanson2, Nicolas Bersano2.
Abstract
Although many authors have highlighted the importance of predicting people's health costs to improve healthcare budget management, most of them do not address the frequent need to know the reasons behind this prediction, i.e., knowing the factors that influence this prediction. This knowledge allows avoiding arbitrariness or people's discrimination. However, many times the black box methods (that is, those that do not allow this analysis, e.g., methods based on deep learning techniques) are more accurate than those that allow an interpretation of the results. For this reason, in this work, we intend to develop a method that can achieve similar returns as those obtained with black box methods for the problem of predicting health costs, but at the same time it allows the interpretation of the results. This interpretable regression method is based on the Dempster-Shafer theory using Evidential Regression (EVREG) and a discount function based on the contribution of each dimension. The method "learns" the optimal weights for each feature using a gradient descent technique. The method also uses the nearest k-neighbor algorithm to accelerate calculations. It is possible to select the most relevant features for predicting a patient's health care costs using this approach and the transparency of the Evidential Regression model. We can obtain a reason for a prediction with a k-NN approach. We used the Japanese health records at Tsuyama Chuo Hospital to test our method, which included medical examinations, test results, and billing information from 2013 to 2018. We compared our model to methods based on an Artificial Neural Network, Gradient Boosting, Regression Tree and Weighted k-Nearest Neighbors. Our results showed that our transparent model performed like the Artificial Neural Network and Gradient Boosting with an R2 of 0.44.Entities:
Keywords: dempster–shafer theory; evidential regression; feature selection; health care costs; interpretable prediction; regression; supervised learning
Mesh:
Year: 2020 PMID: 32781680 PMCID: PMC7472302 DOI: 10.3390/s20164392
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Insurance claims.
| Header | Name | Description |
|---|---|---|
| IR | Medical institution | Details of the medical institution. |
| RE | Insured details | Patient details with dates and demographics. |
| HO | Insurer details | Patient insurer information. |
| KO | Public expenses | Patient public expense information. |
| KH | Special information | Patient especial information (free text). |
| SY | Diagnosis | Patient diagnosis in MHLW coding. |
| SI | Procedure | Details for a patient treatment. |
| IY | Medications | Details for the medicines given. |
| TO | Specific equipment | Specific equipment details used in a patient. |
| CO | Comment | Comments for diagnoses or symptoms (free text). |
| SJ | Symptoms | Patients symptoms. |
Statistics of patients’ records in each scenario.
| Statistics | Scenario 1 | Scenario 2 | Scenario 3 |
|---|---|---|---|
| Total number of patients | 71,001 | 33,646 | 8810 |
| Mean costs | 11,030 | 11,536 | 12,420 |
| Mean age | 54.00 | 58.00 | 63.00 |
| % Male | 48.81 | 48.54 | 48.56 |
| % Female | 51.19 | 51.46 | 51.44 |
Figure 1Radial Basis Function for different values.
Figure 2Time complexity single prediction.
Figure 3Feature transformation.
Synthetic regression datasets.
| Name | Samples | Relevant Features | Total Features |
|---|---|---|---|
| Linear Regression | 200 | 5 | 500 |
| Friedman | 200 | 5 | 500 |
| Linear Regression 5k | 5000 | 5 | 500 |
| Friedman 5k | 5000 | 5 | 500 |
MAE on synthetic datasets (the lower the better).
| Name | RT | WkNN | WEVREG | GB |
|---|---|---|---|---|
| Linear Regression |
|
|
|
|
| Friedman |
|
|
|
|
| Linear Regression 5k |
|
|
|
|
| Friedman 5k |
|
|
|
|
Figure 4Model performance with different number of features.
Figure 5Patients costs distribution.
Number of variables by type in patient encoding.
| Description | Scenario 1 | Scenario 2 | Scenario 3 |
|---|---|---|---|
| Demographics | 2 | 2 | 2 |
| Health checkups | 27 | 54 | 135 |
| Chronic diseases | 46 | 92 | 230 |
| Medication info | 2 | 4 | 10 |
| Previous costs | 1 | 2 | 5 |
| Actual cost | 1 | 1 | 1 |
Figure 6Patients logarithmic costs distribution.
Models performance with all features (the lower is the better for Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE); the higher the better for ).
| Model | Scenario 1 | Scenario 2 | Scenario 3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| RT |
|
|
|
|
|
|
|
|
|
| WkNN |
|
|
|
|
|
|
|
|
|
| ANN |
|
|
|
|
|
|
|
|
|
| WEVREG |
|
|
|
|
|
|
|
|
|
| GB |
|
|
|
|
|
|
|
|
|
Top 5 features for each scenario.
| Scenario 1 | Scenario 2 | Scenario 3 | |||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| 65.06 |
| 19.67 |
| 19.66 |
|
| 6.19 |
| 16.47 |
| 19.66 |
|
| 1.09 |
| 3.09 |
| 19.66 |
|
| 1.03 |
| 1.70 |
| 19.66 |
|
| 1.03 |
| 1.59 |
| 2.43 |
Number of features selected by scenario.
| Scenario | Total Features | Selected Features |
|---|---|---|
| 1 | 76 | 5 |
| 2 | 154 | 10 |
| 3 | 382 | 33 |
Model performance with selected features (for MAE and MAPE lower is better, for higher is better).
| Model | Scenario 1 | Scenario 2 | Scenario 3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| RT |
|
|
|
|
|
|
|
|
|
| WkNN |
|
|
|
|
|
|
|
|
|
| ANN |
|
|
|
|
|
|
|
|
|
| WEVREG |
|
|
|
|
|
|
|
|
|
| GB |
|
|
|
|
|
|
|
|
|