| Literature DB >> 36082346 |
Abstract
The influx of hospital patients has become common in recent years. Hospital management departments need to redeploy healthcare resources to meet the massive medical needs of patients. In this process, the hospital length of stay (LOS) of different patients is a crucial reference to the management department. Therefore, building a model to predict LOS is of great significance. Five machine learning (ML) algorithms named Lasso regression (LR), ridge regression (RR), random forest regression (RFR), light gradient boosting machine (LightGBM), and extreme gradient boosting regression (XGBR) and six feature encoding methods named label encoding, count encoding, one-hot encoding, target encoding, leave-one-out encoding, and the proposed encoding method are used to construct the regression prediction model. The Scikit-Learn toolbox on the Python platform builds the prediction model. The input is the dataset named Hospital Inpatient Discharges (SPARCS De-Identified) 2017 with 2343569 instances provided by the New York State Department of Health verify the model after removing 2.2% of the missing data, and the model ultimately uses mean squared error (MSE) and coefficient of determination (R 2) as the performance measurement. The results show that the model with the LightGBM algorithm and the proposed encoding method has the best R 2 (96.0%) and MSE score (2.231).Entities:
Mesh:
Year: 2022 PMID: 36082346 PMCID: PMC9448550 DOI: 10.1155/2022/9517029
Source DB: PubMed Journal: Comput Intell Neurosci
Feature description of the dataset.
| Feature name | Type | Description |
|---|---|---|
| Hospital Service Area | Radom categorical | Describe the location of the hospital |
| Hospital County | ||
| Permanent Facility ID | Hospital service information | |
| Facility Name | ||
| Operating Certificate Number | Patient diagnostic information | |
| Type of Admission | ||
| CCS Diagnosis Code | ||
| CCS Diagnosis Description | ||
| CCS Procedure Code | ||
| CCS Procedure Description | ||
| APR DRG Code | ||
| APR DRG Description | ||
| APR MDC Code | ||
| APR MDC Description | ||
| APR Severity of Illness Code | ||
| APR Severity of Illness Description | ||
| Payment Typology 1 | Patient cost information | |
| Payment Typology 2 | ||
| Payment Typology 3 | ||
| Zip Code - 3 digits | Patient personal information | |
| Race | ||
| Ethnicity | ||
| Patient Disposition | ||
| Birth Weight | ||
| Age Group | Ordered categorical | |
| APR Risk of Mortality | Patient diagnostic information | |
| APR Medical Surgical Description | Three classes | |
| Gender | Patient personal information | |
| Discharge Year | One class | Patient treatment information |
| Abortion Edit Indicator | Binary classes | |
| Emergency Department Indicator | Patient service information | |
| Length of Stay | Continuous | Target feature |
| Total Charges | Patient cost information | |
| Total Costs |
Figure 1Visualization of the proposed framework.
Figure 2Density plot of length of stay.
Figure 3Length of stay distribution of two (three) class features.
Figure 4Density plot of Total Costs and Total Charges.
Figure 5Length of stay in Age Group and APR Risk of Mortality.
Numerical conversion details of ordered categorical features.
| Feature name | Original string data | Converted numerical data |
|---|---|---|
| Age Group | “0 to 17” | 0 |
| “18 to 29” | 1 | |
| “30 to 49” | 2 | |
| “50 to 69” | 3 | |
| “70 or older” | 4 | |
|
| ||
| APR Risk of Mortality | “Minor” | 0 |
| “Moderate” | 1 | |
| “Major” | 2 | |
| “Extreme” | 3 | |
Feature importance and selection results.
| Feature name | Correlation or | Retain feature |
|---|---|---|
| Gender | 0.053 | Yes |
| APR Medical Surgical Description | 0.043 | |
| Emergency Department Indicator | 0.051 | |
| Hospital County |
| |
| Operating Certificate Number |
| |
| Permanent Facility Id |
| |
| Facility Name |
| |
| Zip Code - 3 digits |
| |
| CCS Diagnosis Code |
| |
| CCS Procedure Code |
| |
| APR DRG Code |
| |
| APR MDC Code |
| |
| Patient Disposition |
| |
| Hospital Service Area |
| |
| Ethnicity |
| |
| Type of Admission |
| |
| Payment Typology 1 |
| |
| Race |
| |
| APR Severity of Illness Code |
| |
| APR Risk of Mortality | 0.376 | |
| Age Group | 0.228 | |
| Total Charges | 0.602 | |
| Total Costs | 0.651 |
Label encoding example for the “Patient Disposition” feature.
| Raw eigenvalues | Sorted eigenvalues | Numerical eigenvalues |
|---|---|---|
| Home or self-care | Short-term hospital | 0 |
| Skilled nursing home | Expired | 1 |
| Court/law enforcement | Hospice - medical facility | 2 |
| Skilled nursing home | Home or self-care | 3 |
| Court/law enforcement | Home or self-care | 3 |
| Short-term hospital | Skilled nursing home | 4 |
| Court/law enforcement | Skilled nursing home | 4 |
| Home or self-care | Court/law enforcement | 5 |
| Expired | Court/law enforcement | 5 |
| Hospice - medical facility | Court/law enforcement | 5 |
One-hot encoding example for the “Race” feature.
| Raw feature | New features after numerical encoding | ||
|---|---|---|---|
| Race | Race-White | Race-Black/African American | Race-other race |
| White | 1 | 0 | 0 |
| White | 1 | 0 | 0 |
| White | 1 | 0 | 0 |
| Black/African American | 0 | 1 | 0 |
| Black/African American | 0 | 1 | 0 |
| Black/African American | 0 | 1 | 0 |
| Black/African American | 0 | 1 | 0 |
| Other race | 0 | 0 | 1 |
| White | 1 | 0 | 0 |
The proposed encoding method.
| Feature name | Encoding method |
|---|---|
| Gender | Label encoding |
| APR Medical Surgical Description | |
| Emergency Department Indicator | |
| Hospital County | |
| Operating Certificate Number | |
| Permanent Facility Id | |
| Facility Name | |
| Zip Code - 3 digits | |
| CCS Diagnosis Code | |
| CCS Procedure Code | |
| APR DRG Code | |
| APR MDC Code | |
| Patient Disposition | |
|
| |
| Hospital Service Area | One-hot encoding |
| Ethnicity | |
| Type of Admission | |
| Payment Typology 1 | |
| Race | |
| APR Severity of Illness Code | |
|
| |
| APR Risk of Mortality | Sort the feature values from low to high and then encode them from 0 to N-1. |
| Age Group | |
Model performance in this and related study.
| Model performance in this study | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | LR | RR | RFR | XGBR | LightGBM | |||||||
|
| ||||||||||||
| MSE | Training | 5.626 | 5.400 | 0.848 | 1.938 | 1.116 | ||||||
| Test | 5.882 | 5.680 | 2.295 | 2.287 | 2.231 | |||||||
|
| ||||||||||||
|
| Training | 0.697 | 0.726 | 0.994 | 0.969 | 0.990 | ||||||
| Test | 0.675 | 0.702 | 0.958 | 0.958 | 0.960 | |||||||
|
| ||||||||||||
| Hyper-parameters | Alpha ( | Alpha ( | n_estimators = 100 | n_estimators = 500 | n_estimators = 25000 feature_fraction = 0.6 | |||||||
| One-fold fitting time | 3.654s | 1.653s | 946.465 | 900.799s | 874.331s | |||||||
|
| ||||||||||||
| Model performance in related study [ | ||||||||||||
| Model | LR | RR | RFR | XGBR | MLP | DTR | ||||||
|
| ||||||||||||
| MSE | Training | 42.58 | 39 | 0.76 | 5.30 | 39 | 0.002 | |||||
| Test | 42.19 | 38.49 | 5 | 5.62 | 38.49 | 5.93 | ||||||
|
| ||||||||||||
|
| Training | 0.31 | 0.37 | 0.987 | 0.914 | 0.37 | 0.999 | |||||
| Test | 0.31 | 0.3711 | 0.92 | 0.908 | 0.371 | 0.903 | ||||||
The performance changes in different encoding methods.
| Encoding Method | MSE |
| ||
|---|---|---|---|---|
| Training | Test | Training | Test | |
| Label encoding | 1.120 | 2.248 | 0.990 | 0.959 |
| Count encoding | 1.129 | 2.252 | 0.990 | 0.959 |
| Target encoding | 1.129 | 2.252 | 0.990 | 0.959 |
| Leave-one-out encoding | 0.023 | 7.777 | 0.999 | 0.221 |
| Proposed encoding method | 1.116 | 2.231 | 0.990 | 0.960 |