| Literature DB >> 34936688 |
Delin Meng1, Jun Xu2, Jijun Zhao1.
Abstract
Hand, foot and mouth disease (HFMD) is an increasingly serious public health problem, and it has caused an outbreak in China every year since 2008. Predicting the incidence of HFMD and analyzing its influential factors are of great significance to its prevention. Now, machine learning has shown advantages in infectious disease models, but there are few studies on HFMD incidence based on machine learning that cover all the provinces in mainland China. In this study, we proposed two different machine learning algorithms, Random Forest and eXtreme Gradient Boosting (XGBoost), to perform our analysis and prediction. We first used Random Forest to examine the association between HFMD incidence and potential influential factors for 31 provinces in mainland China. Next, we established Random Forest and XGBoost prediction models using meteorological and social factors as the predictors. Finally, we applied our prediction models in four different regions of mainland China and evaluated the performance of them. Our results show that: 1) Meteorological factors and social factors jointly affect the incidence of HFMD in mainland China. Average temperature and population density are the two most significant influential factors; 2) Population flux has different delayed effect in affecting HFMD incidence in different regions. From a national perspective, the model using population flux data delayed for one month has better prediction performance; 3) The prediction capability of XGBoost model was better than that of Random Forest model from the overall perspective. XGBoost model is more suitable for predicting the incidence of HFMD in mainland China.Entities:
Mesh:
Year: 2021 PMID: 34936688 PMCID: PMC8694472 DOI: 10.1371/journal.pone.0261629
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Datasets in this study.
| Datasets | Names | Duration | Temporal resolution | Spatial resolution | Sources |
|---|---|---|---|---|---|
| I | Incidence dataset | 2009–2017 | Month | 31 provinces | The Data-Center of China Public Health Science [ |
| II | Meteorological dataset | 2009–2017 | Month | 31 provinces | China Meteorological Data Service Centre [ |
| III | Population flux dataset | 2017 | Day | 31 provinces | Tencent Location Big Data [ |
| IV | Passenger traffic dataset | 2009–2017 | Year | 31 provinces | The 2018 China Statistical Yearbook [ |
| V | Demography dataset | 2009–2017 | Year | 31 provinces | The 2018 China Statistical Yearbook [ |
Fig 1Time series of HFMD incidence and meteorological factors in Beijing from January 2009 to December 2017.
Time series is available for every province in our datasets.
Fig 2Framework of Random Forest algorithm.
Fig 3Determination of the optimal number of clusters.
Fig 4Clusters for 31 provinces in mainland China, through color coding.
Ranking of %IncMSE values for influential factors based on the national dataset.
| Factors |
|
|---|---|
| Temperature | 45.195 |
| Population density | 34.252 |
| Population flux | 26.533 |
| Wind speed | 24.791 |
| Sunshine hours | 24.398 |
| Relative humidity | 19.798 |
| Precipitation | 17.726 |
Ranking of influential factors affecting HFMD incidence in 31 provinces of mainland China.
| Temperature | Population flux | Precipitation | Relative humidity | Wind speed | Sunshine hours | |
|---|---|---|---|---|---|---|
| Beijing | 1 | 2 | 3 | 4 | 5 | 6 |
| Jiangsu | 1 | 2 | 3 | 4 | 5 | 6 |
| Anhui | 1 | 2 | 3 | 4 | 5 | 6 |
| Shaanxi | 1 | 2 | 3 | 4 | 5 | 6 |
| Qinghai | 1 | 2 | 3 | 4 | 5 | 6 |
| Zhejiang | 1 | 2 | 3 | 4 | 6 | 5 |
| Yunnan | 1 | 2 | 3 | 5 | 4 | 6 |
| Gansu | 1 | 2 | 3 | 5 | 4 | 6 |
| Ningxia | 1 | 2 | 3 | 5 | 4 | 6 |
| Shandong | 1 | 2 | 3 | 5 | 6 | 4 |
| Inner Mongolia | 1 | 2 | 3 | 6 | 4 | 5 |
| Tianjin | 1 | 2 | 3 | 6 | 5 | 4 |
| Jiangxi | 1 | 3 | 2 | 4 | 6 | 5 |
| Jilin | 1 | 4 | 2 | 3 | 5 | 6 |
| Shanxi | 1 | 4 | 3 | 2 | 5 | 6 |
| Fujian | 1 | 4 | 5 | 2 | 6 | 3 |
| Hunan | 1 | 5 | 2 | 3 | 6 | 4 |
| Guangdong | 1 | 5 | 2 | 3 | 6 | 4 |
| Liaoning | 1 | 5 | 3 | 2 | 4 | 6 |
| Hainan | 1 | 5 | 4 | 3 | 6 | 2 |
| Heilongjiang | 1 | 6 | 2 | 3 | 5 | 4 |
| Guangxi | 1 | 6 | 2 | 3 | 5 | 4 |
| Hebei | 1 | 6 | 2 | 4 | 3 | 5 |
| Henan | 1 | 6 | 2 | 5 | 3 | 4 |
| Hubei | 2 | 1 | 3 | 5 | 6 | 4 |
| Guizhou | 2 | 1 | 3 | 6 | 5 | 4 |
| Xinjiang | 2 | 1 | 4 | 6 | 5 | 3 |
| Shanghai | 2 | 1 | 5 | 3 | 6 | 4 |
| Tibet | 2 | 3 | 5 | 1 | 4 | 6 |
| Chongqing | 4 | 1 | 5 | 2 | 3 | 6 |
| Sichuan | 4 | 2 | 5 | 3 | 1 | 6 |
| Percentage of 1 | 0.78 (24/31) | 0.16 (5/31) | 0 | 0.03 (1/31) | 0.03 (1/31) | 0 |
| Percentage of 2 | 0.16 (5/31) | 0.42 (13/31) | 0.26 (8/31) | 0.13 (4/31) | 0 | 0.03 (1/31) |
The smaller the number, the greater the influence of this factor on HFMD incidence. 1 represents the greatest influence on HFMD incidence, and 6 represents the least.
Fig 5The error of Random Forest model against the number of decision trees.
Fig 6Predicted incidence against observed HFMD incidence using Random Forest model.
Fig 7Predicted incidence against observed HFMD incidence using XGBoost model.
Regional division for 31 provinces in mainland China.
| Clusters | Provinces |
|---|---|
| 1 | Tibet, Qinghai, Ningxia, Xinjiang, Gansu, Inner Mongolia |
| 2 | Beijing, Shaanxi, Shanxi, Jilin, Hebei, Heilongjiang, Liaoning, Tianjin, Shandong |
| 3 | Anhui, Hunan, Fujian, Guangdong, Jiangsu, Zhejiang, Hainan |
| 4 | Sichuan, Chongqing, Guizhou, Hubei, Yunnan, Henan, Shanghai, Guangxi, Jiangxi |
MSE and EVS in different regions with different time delay using Random Forest and XGBoost.
| MSE | EVS | ||||||
|---|---|---|---|---|---|---|---|
| No delay | One-month delay | Two-month delay | No delay | One-month delay | Two-month delay | ||
| Country level | Random Forest | 103.51 | 104.41 | 104.14 | 0.5429 | 0.5390 | 0.5403 |
| XGBoost | 74.88 |
| 87.72 | 0.5593 |
| 0.5572 | |
| Cluster 1 | Random Forest | 13.75 | 14.80 | 15.08 | 0.5991 | 0.5688 | 0.5612 |
| XGBoost | 6.74 |
| 10.62 | 0.6422 |
| 0.6269 | |
| Cluster 2 | Random Forest | 36.99 | 38.82 | 38.71 | 0.6534 | 0.6364 | 0.6375 |
| XGBoost |
| 39.17 | 50.36 |
| 0.6627 | 0.5982 | |
| Cluster 3 | Random Forest | 208.94 | 203.51 | 206.42 | 0.4385 |
| 0.4453 |
| XGBoost | 242.31 | 191.78 |
| 0.3927 | 0.3905 | 0.4350 | |
| Cluster 4 | Random Forest | 130.10 | 134.81 | 133.95 | 0.4924 | 0.4744 | 0.4780 |
| XGBoost | 76.29 |
| 91.23 | 0.5708 |
| 0.5981 | |
The best evaluation criteria value for each region have been marked in bold.