| Literature DB >> 35272654 |
Zhixu Hu1, Hang Qiu2,3, Liya Wang4, Minghui Shen5.
Abstract
BACKGROUND: An aging population with a burden of chronic diseases puts increasing pressure on health care systems. Early prediction of the hospital length of stay (LOS) can be useful in optimizing the allocation of medical resources, and improving healthcare quality. However, the data available at the point of admission (PoA) are limited, making it difficult to forecast the LOS accurately.Entities:
Keywords: Length of stay; Machine learning; Multimorbidity network; Network analysis; Patient similarity network; Point of admission
Mesh:
Year: 2022 PMID: 35272654 PMCID: PMC8915508 DOI: 10.1186/s12911-022-01802-z
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Flowchart of proposed predictive model
Descriptive statistics of main variables in our dataset
| Category | Counts (proportion) | Mean (std) of the LOS | |
|---|---|---|---|
| Total | – | 2,543,758 (100.00%) | 12.3 (7.5) |
| Gender | Male | 1,237,624 (48.65%) | 12.4 (7.6) |
| Female | 1,306,134 (51.35%) | 12.2 (7.4) | |
| Years | 2015 | 260,745 (10.25%) | 13.2 (8.0) |
| 2016 | 423,586 (16.65%) | 12.6 (7.6) | |
| 2017 | 550,686 (21.65%) | 12.4 (7.5) | |
| 2018 | 610,795 (24.01%) | 12.2 (7.5) | |
| 2019 | 697,946 (27.44%) | 11.9 (7.4) | |
| Age group | 65–69 | 594,045 (23.35%) | 11.8 (7.3) |
| 70–74 | 645,257 (25.37%) | 12.1 (7.3) | |
| 75–79 | 575,939 (22.64%) | 12.3 (7.3) | |
| 80–84 | 416,941 (16.39%) | 12.6 (7.5) | |
| 85–89 | 228,725 (8.99%) | 13.4 (8.4) | |
| 90 + | 82,851 (3.26%) | 13.9 (9.1) | |
| Ethnic group | Han | 2,533,984 (99.62%) | 12.3 (7.5) |
| Minority | 9,774 (0.38%) | 12.6 (7.8) |
Fig. 2Distribution and descriptive statistics of the LOS
Fig.3Patient-disease matrix (entries indicate whether a patient has a disease)
Fig. 4Visualization of the MN (only the edges with RR greater than 30 are reserved for visualization purpose). The nodes represent chronic disease; The colors of nodes represent 18 disease chapters according to the ICD-10, and the size of nodes is positively correlated with the degree of nodes; The edges represent the co-occurrence relationship between pair-wise diseases
Fig. 5The average Jaccard index similarity in each group. The similarity of the female is generally less similar than that of the male. The similarity increased with age, because the older the patient, the more diseases the patient has
Feature descriptions
| Feature name | Descriptions | Typesa | Number |
|---|---|---|---|
| Baseline features | 69 | ||
| Date features | The year, month, and day of the week of admission | N | 3 |
| Gender | Male or Female | D | 2 |
| Age | Age of the patient | N | 1 |
| Hospital affiliation | The affiliation of the hospital | N | 1 |
| Admission status | 1. Danger 2. Urgent 3. General | N | 1 |
| Patient's and Hospital's address code | The smaller the value, the closer to the city center | N | 2 |
| Address flag | Whether the patient's address code is equal to the hospital's address code | N | 1 |
| Hospital levels | Measuring hospital quality | N | 2 |
| Number of diseases | Number of diseases at the PoA | N | 1 |
| Hospital admission source | 1. Emergency treatment 2. Outpatient service 3. Transferred from Other medical institutions 4. Others | D | 4 |
| Ethnic group | Han or minority | D | 2 |
| Job | The occupation of the patient | D | 13 |
| Marital status | 1. Spinsterhood 2. married 3. Divorce 4. Missing | D | 4 |
| Elixhauser comorbidity index [ | Including AIDS HIV, alcohol abuse, blood loss anemia, and so on | D | 31 |
| Elixhauser comorbidity score [ | A mapping score to represent one's health condition | N | 1 |
| Historical features | 8 | ||
| Descriptive statistics of historical LOS | Extract the counts, mean, standard deviation, median, min, and a max of these LOS | N | 6 |
| Last discharge interval | The days between the last discharge date and the date of current admission | N | 1 |
| Last LOS | The LOS of the last hospital admission | N | 1 |
| MN features | 657 | ||
| Eigenvector centrality features | For each chronic disease in the MN, extracting its eigenvector centrality value as features | N | 653 |
| Disease risk features | Extract the counts, maximum, mean, and sum of disease risk scores | N | 4 |
| PSN features | 5 | ||
| Descriptive statistics of neighbor's LOS | Extract the mean, standard deviation, median, min, and a max of these LOS | N | 5 |
aThe N and D represent the numerical feature and discrete feature, respectively. One-hot encoding will be used for the discrete features
The comparison of predictive performance of XGBoost, GBDT, RF, LinearSVM, and DNN on different feature subsets
| Models | Metrics | Baseline | Baseline + History | Baseline + MN | Baseline + PSN | Baseline + History + MN + PSN |
|---|---|---|---|---|---|---|
| XGBoost | MAE | 4.528 ± 0.006 | 4.276 ± 0.007 | 4.300 ± 0.007 | 4.241 ± 0.007 | 4.024 ± 0.006 |
| RMSE | 6.419 ± 0.013 | 6.130 ± 0.015 | 6.182 ± 0.013 | 6.128 ± 0.013 | 5.859 ± 0.013 | |
| R | 0.250 ± 0.002 | 0.316 ± 0.002 | 0.304 ± 0.001 | 0.316 ± 0.001 | 0.375 ± 0.002 | |
| GBDT | MAE | 4.531 ± 0.007 | 4.280 ± 0.006 | 4.306 ± 0.006 | 4.251 ± 0.009 | 4.026 ± 0.006 |
| RMSE | 6.422 ± 0.014 | 6.136 ± 0.013 | 6.189 ± 0.012 | 6.139 ± 0.013 | 5.861 ± 0.011 | |
| R2 | 0.249 ± 0.002 | 0.314 ± 0.002 | 0.302 ± 0.001 | 0.314 ± 0.001 | 0.374 ± 0.001 | |
| RF | MAE | 4.553 ± 0.008 | 4.343 ± 0.007 | 4.343 ± 0.006 | 4.297 ± 0.008 | 4.106 ± 0.007 |
| RMSE | 6.468 ± 0.014 | 6.229 ± 0.015 | 6.256 ± 0.013 | 6.226 ± 0.014 | 5.987 ± 0.015 | |
| R2 | 0.238 ± 0.002 | 0.293 ± 0.002 | 0.287 ± 0.002 | 0.294 ± 0.001 | 0.347 ± 0.002 | |
| Linear SVM | MAE | 4.982 ± 0.007 | 4.697 ± 0.006 | 4.714 ± 0.006 | 4.571 ± 0.007 | 4.366 ± 0.006 |
| RMSE | 7.004 ± 0.011 | 6.622 ± 0.013 | 6.710 ± 0.011 | 6.549 ± 0.012 | 6.265 ± 0.013 | |
| R2 | 0.107 ± 0.001 | 0.201 ± 0.002 | 0.180 ± 0.001 | 0.219 ± 0.001 | 0.285 ± 0.001 | |
| DNN | MAE | 4.595 ± 0.053 | 4.371 ± 0.043 | 4.390 ± 0.036 | 4.302 ± 0.043 | 4.152 ± 0.046 |
| RMSE | 6.518 ± 0.022 | 6.250 ± 0.020 | 6.343 ± 0.015 | 6.223 ± 0.025 | 6.066 ± 0.034 | |
| R2 | 0.226 ± 0.004 | 0.289 ± 0.004 | 0.267 ± 0.003 | 0.295 ± 0.004 | 0.330 ± 0.006 |
The experiment was repeated ten times, and the mean and standard deviation were calculated
Fig. 6The distribution of feature importance in XGBoost, GBDT, and RF on four feature subset
Top ten features in tree-based models
| XGBoost | RIa | GBDT | RI | RF | RI |
|---|---|---|---|---|---|
| mean of neighbors’ LOS | 1 | mean of neighbors’ LOS | 1 | mean of neighbors’ LOS | 1 |
| median of historical LOS | 0.56 | mean of historical LOS | 0.74 | mean of historical LOS | 0.57 |
| max of historical LOS | 0.38 | last LOS | 0.45 | median of neighbors’ LOS | 0.41 |
| mean of historical LOS | 0.28 | median of neighbors’ LOS | 0.3 | median of historical LOS | 0.39 |
| LDA-1 | 0.24 | std of neighbors’ LOS | 0.25 | max of historical LOS | 0.16 |
| std of neighbors’ LOS | 0.23 | last discharge interval | 0.21 | last LOS | 0.14 |
| last LOS | 0.21 | median of historical LOS | 0.19 | last discharge interval | 0.13 |
| median of neighbors’ LOS | 0.2 | max of historical LOS | 0.19 | LDA-1 | 0.13 |
| last discharge interval | 0.17 | hospital address | 0.16 | std of neighbors’ LOS | 0.12 |
| LDA-2 | 0.16 | LDA-1 | 0.13 | hospital address | 0.1 |
aRI is the relative importance of using min–max normalization. The LDA-1 represents the first component after LDA reduction for network features
Fig. 7The error distribution of MAE of XGBoost using Baseline + History + MN + PSN on the testing set. The green bars are the LOS distributions on the testing set. The results of different subgroups are shown: a full data and gender subgroups, b age subgroups
Comparison of the results with prior related researches
| Study | Condition | Size of dataset | Algorithm | Mean of LOS | Features | Metrics | |||
|---|---|---|---|---|---|---|---|---|---|
| History | MN | PSN | MAE | R2 | |||||
| This study | All chronic diseases | 1,308,041 | XGBoost | 12.31 | Y | Y | Y | 4.024 | 0.375 |
| Xie et al | All diseases | 242,075 | RF | None | N | N | N | None | 0.15 |
| Liu et al | All diseases | 155,474 | LR | 4.50 | N | N | N | None | 0.146 |
| Turgeman et al | HF | 20,321 | Cubist model | None | Y | N | N | 1 | 0.79 |
| Zolbanin et al | COPD | 86,338 | ANN | 5.15 | Y | N | N | 1.239 | 0.613 |
| Chang et al. [ | Ischemic stroke | 330 | LR | 11 | N | N | N | None | 0.369 |
| Tsai et al | Heart diseases | 2377 | ANN | 5.73 | N | N | N | 3.76 | None |