| Literature DB >> 35079202 |
Yan Zhang1, Jie Zhang1, Min Tao1, Jian Shu2, Degang Zhu1.
Abstract
Overcrowding in emergency departments (EDs) is a serious problem in many countries. Accurate ED patient arrival forecasts can serve as a management baseline to better allocate ED personnel and medical resources. We combined calendar and meteorological information and used ten modern machine learning methods to forecast patient arrivals. For daily patient arrival forecasting, two feature selection methods are proposed. One uses kernel principal component analysis(KPCA) to reduce the dimensionality of all of the features, and the other is to use the maximal information coefficient(MIC) method to select the features related to the daily data first and then perform KPCA dimensionality reduction. The current study focuses on a public hospital ED in Hefei, China. We used the data November 1, 2019 to August 31, 2020 for model training; and patient arrival data September 1, 2020 to November 31, 2020 for model validation. The results show that for hourly patient arrival forecasting, each machine learning model has better forecasting results than the traditional autoRegressive integrated moving average (ARIMA) model, especially long short-term memory (LSTM) model. For daily patient arrival forecasting, the feature selection method based on MIC-KPCA has a better forecasting effect, and the simpler models are better than the ensemble models. The method we proposed could be used for better planning of ED personnel resources.Entities:
Keywords: Calendar and meteorological information; Emergency department; Kernel principal component analysis; Maximal information coefficient; Patient arrivals
Year: 2022 PMID: 35079202 PMCID: PMC8776398 DOI: 10.1007/s10489-021-03085-9
Source DB: PubMed Journal: Appl Intell (Dordr) ISSN: 0924-669X Impact factor: 5.086
Fig. 1Scatter plot of ED patient arrival hourly data (November 1, 2019 - November 30, 2020)
Fig. 2Daily patient arrival flows (November 1, 2019 - November 30, 2020)
Original feature data details
| Feature type | Feature No. | Description | Value meaning |
|---|---|---|---|
| Calendar data | X1 | Hour of day | Continuous variable |
| X2 | Day of week | Continuous variable | |
| X3 | Month of year | Continuous variable | |
| X4 | Season of year | 0 = Spring, 1 = Summer, 2 = Autumn, 3 = Winter | |
| X5 | Public holiday | 0 = No Public Holiday, 1 = Public holiday | |
| Meteorological data | X6 | Air temperature(max) | Continuous variable (deg. C) |
| X7 | Air temperature(mean) | Continuous variable (deg. C) | |
| X8 | Air temperature(min) | Continuous variable (deg. C) | |
| X9 | Mean wind speed level | Continuous variable | |
| X10 | Weather | 0 = Sunny, 1 = Cloudy, 2 = Overcast, 3 = Light rain, 4 = Heavy rain, 5 = Snow | |
| X11-X18 | Air quality index related data | Continuous variable | |
| Newly constructed data | X19-X29 | Change of X6 to X9, X11, X13 to X18 compared with one day before | Continuous variable |
Advantages and disadvantages of using algorithms
| Algorithm | Advantage | Disadvantage |
|---|---|---|
| Linear Regression | Simple thinking, easy to implement, especially effective for small data volumes | It is difficult to model polynomial regressions for nonlinear data |
| KNN | The algorithm is simple, and the training time complexity is O(n) | Large amounts of calculation when there are more features |
| SVR | Suitable for small sample data, can solve high-dimensional problems | Sensitive to missing values and high memory consumption |
| Ridge | The penalty will reduce overfitting | No feature selection function |
| Xgboost | 1. Distributed processing of high-dimensional features 2. The importance of features can be output 3. Add regular term to reduce fitting | Iterative data consumes more space |
| Random Forest | Can handle high-dimensional data without feature selection | Different attribute division methods have a greater impact on the forecast effect |
| AdaBoost | High accuracy without overfitting | Training time is too long, and easy to be disturbed by noise |
| Gradient Boosting | The forecast effect is stable and robust | High computational complexity and not easy to parallelize |
| Bagging | Integrated multiple regressors, with better prediction results | Larger digestion space |
| LSTM | Suitable for dealing with problems that are highly related to time series | The amount of calculation will be huge and time-consuming |
Fig. 3Box-plots of ED patient arrivals by (a) hour of the day, (b) day of the week, and (c) season of the year
Parameter settings of the data forecast
| Model | Parameter settings |
|---|---|
| Linear regression | Default |
| KNN | N_neighbors = 3 |
| SVR | Kernel = RBF, C = 1e3, epsilon = 0.1 |
| Ridge | Alpha = 1.0, normalize = False |
| Xgboost | Eta = 0.1, max_depth = 9, gamma = 0.1 |
| Random forest | N_jobs = 1, random_state = 12, n_estimators = 100 |
| AdaBoost | Default |
| Gradient boosting | Max_depth = 9, min_sample_split = 200 |
| Bagging | Base_estimator = ‘decision tree’ |
| LSTM | Step_size = 4, epochs = 300 |
Fig. 4The results of ten hourly forecast models and ARIMA evaluated by (a) RMSE, (b) MAE, and (c) MAPE
Fig. 5The results of nine daily forecasts with the KPCA models and the ARIMA model evaluated by (a) RMSE, (b) MAE, and (c) MAPE
MIC coefficients between each feature and the daily patient arrivals
| Feature No. | MIC coefficients | Feature No. | MIC coefficients |
|---|---|---|---|
X2 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 | 0.02 0.7 0.03 0.05 0.08 0.11 0.17 0.09 0.35 0.3 0.28 0.2 0.06 0.36 | X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 | 0.35 0.16 0.13 0.11 0.01 0.03 0.18 0.18 0.05 0.01 0.23 0.16 0.04 |
Fig. 6The results of nine daily forecasts with the MIC-KPCA models and the ARIMA model evaluated by (a) RMSE, (b) MAE, and (c) MAPE
Daily forecast results of the two feature selection methods
| Model | Daily forecast result with KPCA | Daily forecast result with MIC-KPCA | ||||
|---|---|---|---|---|---|---|
| RMSE | MAE | MAPE(%) | RMSE | MAE | MAPE(%) | |
| Linear Regression | 30.16 | 24.4 | 9.81 | 31.52 | 25.39 | 10.13 |
| ARIMA | 35.72 | 28.83 | 12.93 | 35.72 | 28.83 | 12.93 |
| KNN | 31.61 | 24.62 | 10.09 | 30.23 | 23.79 | 9.63 |
| SVR | 26.84 | 21.52 | 8.81 | 26.84 | 21.53 | 8.81 |
| Ridge | 29.6 | 23.85 | 9.6 | 30.21 | 24.31 | 9.74 |
| Xgboost | 32.51 | 26.42 | 10.71 | 30.6 | 24.75 | 10.03 |
| Random Forest | 36.73 | 29.22 | 11.98 | 27.98 | 22.06 | 9.02 |
| AdaBoost | 33.7 | 27.7 | 11.02 | 34.5 | 27.26 | 10.86 |
| Gradient Boosting | 32.07 | 25.88 | 10.53 | 29.73 | 23.69 | 9.6 |
| Bagging | 33.7 | 26.73 | 10.53 | 31.94 | 25.17 | 9.6 |