Literature DB >> 35065932

Impact analysis of environmental and social factors on early-stage COVID-19 transmission in China by machine learning.

Yifei Han1, Jinliang Huang2, Rendong Li2, Qihui Shao1, Dongfeng Han1, Xiyue Luo3, Juan Qiu4.   

Abstract

As a highly contagious disease, COVID-19 caused a worldwide pandemic and it is still ongoing. However, the infection in China has been successfully controlled although its initial transmission was also nationwide and has caused a serious public health crisis. The analysis on the early-stage COVID-19 transmission in China is worth investigating for its guiding significance on prevention to other countries and regions. In this study, we conducted the experiments from the perspectives of COVID-19 occurrence and intensity. We eliminated unimportant factors from 113 variables and applied four machine learning-based classification and regression models to predict COVID-19 occurrence and intensity, respectively. The influence of each important factor was analysed when applicable. Our optimal model on COVID-19 occurrence prediction presented an accuracy of 91.91% and the best R2 of intensity prediction reached 0.778. Linear regression-based model was identified as unable to fit and predict the intensity, and thus only the variable influence on COVID-19 occurrence can be explained. We found that (1) CO VID-19 was more likely to occur in prosperous cities closer to the epicentre and located on higher altitudes, (2) and the occurrence was higher under extreme weather and high minimum relative humidity. (3) Most air pollutants increased the risk of COVID-19 occurrence except NO2 and O3, and there existed a lag effect of 6-7 days. (4) NPIs (non-pharmaceutical interventions) did not show apparent effect until two weeks after.
Copyright © 2022 The Authors. Published by Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Air pollutants; COVID-19; Machine learning; Meteorology; Non-pharmaceutical interventions; Social data

Mesh:

Substances:

Year:  2022        PMID: 35065932      PMCID: PMC8776626          DOI: 10.1016/j.envres.2022.112761

Source DB:  PubMed          Journal:  Environ Res        ISSN: 0013-9351            Impact factor:   8.431


Introduction

The novel coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS CoV-2) has become a global pandemic. As of June 15, 2021, there have been over 175 million confirmed patients and over 3.8 million deaths due to COVID-19 (WHO, 2021). On the initial stage of COVID-19 transmission, WHO confirmed on January 12, 2020, that the infection in Wuhan belongs to coronaviruses (Bashir et al., 2020). The virus spread to surrounding provinces in China rapidly during January, and thus Wuhan, Hubei province restricted personnel movement on Jan 23 to prevent more severe transmission (Yu et al., 2020; Andersen et al., 2021). All provinces of mainland China were on high alert and responded swiftly (Ren et al., 2021). Comprehensive and stringent epidemic prevention and control measures were taken nationwide (The State Council Information Office of the People's Republic of China, 2020). It was proved that the feedback of these measures was encouraging, only sporadic cases have been reported on mainland China since April 29, 2020 (The State Council Information Office of the People's Republic of China, 2020). The recovery of COVID-19 in mainland China cost less than half a year and has provided a valuable background for scholars to investigate the correlation between the COVID-19 transmission and meteorological, social, demographic and geographical factors (Liu et al., 2020; Sun et al., 2020b; Wang et al., 2021; Xiao et al., 2021). Currently, it is a popular methodology to statistically analyse how various parameters influence the spreading of COVID-19 from both global and regional perspectives. For example, related studies showed that some environmental and economic factors such as PM2.5 and GDP have a positive effect on preventing COVID-19 infection (Ahmed et al., 2021). A study in the United States verified that meteorological parameters (higher temperature) and social background (physical distancing intervention) together contribute to a lower risk of COVID-19 transmission (Guo et al., 2021b). However, most studies on COVID-19 transmission in early 2020 were focused on the impact of environmental quality rather than socio-economic, governance, transportation and other elements (Sharifi and Khavarian-Garmsir, 2020). Previous studies on COVID-19 transmission factors were usually conducted via statistical methods. Varies of models have been proved to be significantly effective in revealing the impact of different factors and thus provided recommendations in prevention (Babuna et al., 2021; Coşkun et al., 2021; Lorenzo et al., 2021; Xiao et al., 2021). Apart from conventional statistical analysis, spatial statistics has also shown its reliability and priority in digging out the complex correlation from geographical data (Sun et al., 2020a; Andersen et al., 2021; Dupre et al., 2021). Furthermore, some of these statistical models showed the potential to forecast the trend of the COVID-19 pandemic (Guo et al., 2021a). As a popular technology, machine learning has also been widely applied and thus brought significant findings when it comes to the analysis of the COVID-19 pandemic. Kuo and Fu (2021) integrated 8 types of basic machine learning techniques to build up a robust model to predict the number of patients based on environmental, demographic and mobility data. Li et al. (2021) built a ridge logistic model and found out how numerous factors may affect COVID-19 transmission and fatality. Nevertheless, most of these machine learning-based studies focused on the influence factors of worldwide transmission instead of the early-stage pandemic in China. Therefore, we applied several machine learning methods on a wide range of environmental, social, economic, meteorological and geographical data during the early-stage COVID-19 pandemic in China, and thus revealed the impact of all these factors. Meanwhile, we evaluated and compared the machine learning-based models to work out an optimal solution for estimating the transmission of COVID-19.

Materials and methods

Data source

In this study, we collected prefecture-level data in mainland China and all the data obtained were daily data between Jan 10 and Jun 11 when the number of daily new patients decreased to less than 300 (Johns Hopkins University, 2021). Each record represented the parameters of each prefecture per day. The data of confirmed and newly confirmed patients were obtained from the National Health Commission of the People's Republic of China (2021) and provincial or prefectural health commissions. 13,332 newly confirmed cases of novel coronavirus pneumonia (NCP) were officially released on Feb 12, such a numerous increase was due to the change of diagnosis standard, which was previously relying on nucleic acid testing but then changed to clinical diagnosis (Chinadaily, 2020). We evenly distributed the increment of patients on Feb 12 to each of the previous days to assure our data reflected the true transmission. Since the relation between an independent factor and disease incidence counts may fail to indicate the association with disease transmission (Zhao, 2020). It should be noticed that COVID-19 was not balanced distributed since the COVID-19 transmission varied a lot among different prefectures. Therefore, our study was conducted from two separate perspectives of COVID-19 occurrence (COV_O) and COVID-19 intensity (COV_I). COV_O were binarized to 1 and 0 depending on whether COVID-19 is over 0. COV_I inherited the natural logarithm of COVID-19 to smooth the unbalanced distribution. The meteorological data we collected included daily minimum temperature (MinT), maximum temperature (MaxT), mean temperature (MeanT), relative humidity (Rh) and minimum relative humidity (MinRh). These data were downloaded from the China Meteorological Data Service Center (2021). The atmospheric environmental quality data, including daily CO, NO2, O3, PM2.5, PM10, SO2 and AQI (air quality index), were downloaded from The Data Center of The Ministry of Ecology And Environment of The People's Republic of China (2021). We applied linear interpolation to replace missing values. Meanwhile, considering the delay effect of atmospheric environmental indicators, we defined CO_1 as the CO content of one day before, and CO_9 was the CO content of 9 days before. The same naming rule was used on all the other atmospheric environmental indicators. The geographical data used in this study was a 30m digital elevation model (DEM) product called ‘ASTER GDEM’, which was downloaded from Geospatial Data Cloud (2021). The mean DEM (MeanDEM) of each prefecture was calculated. The socio-economic data we used consisted of three parts. The first part is human-related economic data. We collected the household population (Pop), gross domestic product (GDP), and population density (PD) of the year 2019, which were the latest accessible version. The first two were collected from all provincial statistical yearbooks of China. The population density was downloaded from Ministry Of Housing and Urban-Rural Development of the People’s Republic of China (2021). The second part is migration-related data. We collected destination proportion in population flow from Wuhan (WH) and migration scale (MS) from Jan 1 to Jan 23 (Wuhan lockdown), 2020 (Baidu, 2020). Then we calculated the destination migration scale flow from Wuhan (Popmob) according to the following equation: Meanwhile, considering the delay effect, we defined Popmob1 as Popmob of one day before, and Popmob9 was defined as Popmob of 9 days before, and so on naming Popmob0 to Popmob9. Besides, the cumulative Popmob (Popmobsum) from Jan 1 to each corresponding date was also calculated. The distances from Wuhan to the geometric centre of each prefecture (DisWH) were calculated. The third part is the level of the adopted control measures (Reslevel) in 366 prefectures. These data were collected from the National Health Commission of the People's Republic of China (2021) and provincial or municipal health commissions. China's public health alert system is categorized into four levels in terms of the nature of the incidents, the extent of harm and scope: Level-I (extremely significant), Level-II (significant), Level-III (major) and Level-IV (normal) (The Central People's Government of the People's Republic of China, 2020). We quantified these response levels that no response, Level-IV, Level-III, Level-II and Level-I were ranked 0, 1, 2, 3 and 4, respectively. Meanwhile, considering the delay effect, Reslevel1 was defined as the response level one day before, Reslevel20 was defined as the response level 20 days before, and so was the others. Days from Jan 10 (Time) of each record was also extracted and regarded as a potential influence parameter considering that the temporal aspect may influence a lot in the early-stage transmission. More specific information of all the aforementioned data was shown in Appendix A.

Factor filtering

In this study, we collected 113 factors. Such high dimensional data would massively increase the difficulty of machine learning. Besides, some factors may not have a strong correlation to COVID-19 occurrence or COVID-19 intensity where such interruption might bring unexpected noise to our analysis. Therefore, a preprocess of factor filtering was necessary.

COVID-19 occurrence

We built Gradient Boosting Decision Tree (GBDT) and Random Forest (RF) classification models on COV_O to find out the importance of each factor. These two machine learning-based classification models were built and modified based on the Python module ‘Scikit-learn’ (Pedregosa et al., 2012). RF fits a certain number of decision trees trained on randomly selected subsamples of the training data by the bagging approach and the output of the RF classification model is the class selected by most trees (Kulkarni and Lowe, 2016; Kuo and Fu, 2021). GBDT is also in the form of ensemble decision trees but uses boosting approach and it usually overperforms RF (Piryonesi and El-Diraby, 2020; Kuo and Fu, 2021). In this preprocessing, we set the number of trees to 1000 for both two models. Each type of data was normalized to the interval of [0, 1] based on the maximum and the minimum. The influence of each factor was calculated by both these two models. The influence was computed via Gini importance, which was derived from the Gini index (Breiman, 2001). Since our classification was binary according to two categories of COV_O, the Gini index was then computed by the following equation (Qi, 2012):where is the Gini index of node , and represents the fraction of the positive example assigned to this node. The Gini importance of each feature in a tree was the sum of the Gini index reduction and the feature importance was the sum of Gini importance in all trees (Qi, 2012). The higher feature importance means a stronger correlation. For each model, we recorded 50 factors the influence of which ranked the least and deleted the factors which were listed in both these two results. In this way, the least related variables were all removed to avoid noise interruption in machine learning.

COVID-19 intensity

Similar to the preprocess on COVID-19 occurrence, GBDT and RF were also adopted to build regression models to explore the importance of each factor for COVID-19 intensity and we also applied the same normalization to each type of data. The number of trees was set to 1100 and 1500 for GBDT and RF regression models respectively. The last 50 influence factors were also recorded and the repetitive factors were dropped as well.

Estimation models

Since we have 51,776 records the COV_O of which were 0 and only 4588 records were 1, we randomly select 10% from the former COV_O records and thus the distribution of these two categories was more balanced to avoid overfitting problem. Then we randomly split the training and test data for the following classification models by the proportion of 70% and 30%. Four machine learning models were adopted to estimate the COVID-19 occurrence. We retained identical GBDT and RF classification applied in factor selection considering they were among the most popular realizations of ensemble learning (Piryonesi and El-Diraby, 2020). Moreover, we have built a ridge classification model and a 3-layer artificial neural network (ANN), then compared them with the aforementioned two models to find the optimal solution. The ridge classifier is referred to a least-squares support vector machine (LSSVM) with a linear kernel that transforms the binary targets into (−1, 1) and then does a regression task (Scikit-learn, 2021). To increase the reliability of COV_O estimation, we applied 10-fold cross-validation to our ridge classifier. The ANN included three hidden layers where the neural nodes were 64, 48, and 16. All records were segmented into training, validation, and test sets by 60%, 20%, and 20%, respectively. The results of all four classification models were evaluated based on their accuracy, precision and recall calculated by the following equations: where refers to the number of correct estimations of class , refers to the true record number of the class , is the number of class records estimated by the model, and is the total amount of all records. Considering the algorithm of the ridge classifier is based on linear regression, the coefficients of all selected factors were finally computed. The factors with an absolute coefficient over 0.1 were then listed and analysed. The overall workflow on COVID-19 occurrence is shown in Fig. 1 .
Fig. 1

Overall workflow on COVID-19 occurrence classification.

Overall workflow on COVID-19 occurrence classification. Based on 4588 records that presented non-zero COVID-19 intensity, we kept them all and also split them into training and test sets by 70% and 30%. Consistently, we employed four machine learning techniques to estimate the COVID-19 intensity. GBDT and RF were still preserved due to their good performances. However, instead of ridge regression, in this experiment, we used elastic net (EN) with 10-fold cross-validation. Compared to ridge regression, EN can discard unimportant variables but keeps the stability, and it usually gives a significantly accurate prediction (Zou and Hastie, 2005). Besides, we also constructed a 2-layer ANN which had two hidden layers individually consisting of 96 and 10 nodes. The results were finally assessed by the coefficient of determination (R2) and mean squared error (MSE) of all estimations for both training and test sets. R2 and MSE were calculated by the following equations: where refers to the records of all non-zero COV_I, refers to the true COV_I of the record, refers to the estimated COV_I of the record, and is the mean COV_I of all records. The EN model is also a linear regression-based model and thus its coefficient analysis of all input variables was then conducted to explain how each factor would impact the COVID-19 intensity. The overall workflow on COVID-19 intensity is shown in Fig. 2 .
Fig. 2

Overall workflow on COVID-19 intensity estimation.

Overall workflow on COVID-19 intensity estimation.

Results

The last 50 COV_O influence factors calculated by GBDT and RF classification are shown in Table 1 . Based on these results, 26 factors (Table 2 ) were finally cross-selected and dropped before fitting the estimation model.
Table 1

The last 50 COV_O influence factors calculated by GBDT and RF.

GBDT
RF
FactorInfluenceFactorInfluenceFactorInfluenceFactorInfluence
Reslevel17<0.0001AQI_2<0.0001Popmob10.0011SO20.0026
Reslevel6<0.0001SO2_6<0.0001Popmob00.0013PM10_80.0026
Reslevel14<0.0001CO_10.0001Popmob20.0015AQI_20.0026
Reslevel7<0.0001Popmob30.0001Popmob30.0015PM250.0026
Reslevel8<0.0001PM10_20.0001CO_10.0020Popmob40.0026
Reslevel11<0.0001AQI_30.0001Popmob50.0020AQI_60.0027
Reslevel9<0.0001PM25_40.0001Popmob70.0020AQI_30.0027
AQI_4<0.0001Reslevel200.0001CO_30.0021SO2_10.0027
AQI_9<0.0001CO_40.0001CO_90.0021AQI_90.0027
Popmob6<0.0001PM25_60.0001CO0.0021PM10_70.0027
Popmob5<0.0001AQI_10.0001SO2_30.0022Popmob90.0027
Popmob2<0.0001PM100.0001CO_60.0022PM10_20.0027
Reslevel3<0.0001AQI0.0001CO_20.0023PM10_30.0027
Reslevel13<0.0001PM10_10.0001CO_50.0023AQI_70.0027
Reslevel5<0.0001PM10_50.0001SO2_50.0023PM10_60.0027
Reslevel12<0.0001SO20.0001SO2_60.0024SO2_90.0028
Reslevel16<0.0001CO_90.0001SO2_20.0024AQI_40.0028
Reslevel10<0.0001NO2_90.0001SO2_70.0024AQI_10.0028
CO_8<0.0001O3_80.0001CO_80.0024PM10_90.0028
Reslevel4<0.0001NO2_10.0001SO2_40.0024PM10_10.0028
Rh<0.0001PM25_80.0001Popmob60.0025Popmob80.0028
PM10_8<0.0001Popmob10.0001CO_40.0025PM10_50.0028
AQI_6<0.0001AQI_50.0001AQI_80.0025PM25_60.0029
AQI_8<0.0001PM25_90.0001CO_70.0026AQI_50.0029
Popmob8<0.0001CO_30.0001SO2_80.0026PM10_40.0029
Table 2

26 cross-selected COV_O influence factors to be deleted.

AQI_1AQI_8CO_9Popmob1SO2
AQI_2AQI_9PM10_1Popmob2SO2_6
AQI_3CO_1PM10_2Popmob3
AQI_4CO_3PM10_5Popmob5
AQI_5CO_4PM10_8Popmob6
AQI_6CO_8PM25_6Popmob8
The last 50 COV_O influence factors calculated by GBDT and RF. 26 cross-selected COV_O influence factors to be deleted. Apart from the AQI with a lag of 7 days, all the other AQI factors were proved to be unrelated to the number of patients and deleted. Some other air pollutants such as SO2, CO, PM10, PM2.5 and also the migration scale flow from Wuhan with the lag of various days were identified lack of influence and then removed. As for the last 50 COV_I influence factors (Table 3 ), some Air pollutants such as AQI, CO, NO2, PM2.5 and SO2 were removed based on the cross-selection results (Table 4 ). Besides, two social factors – the migration scale from Wuhan and the response level – with the lag of certain days were also removed.
Table 3

The last 50 COV_I influence factors calculated by GBDT and RF.

GBDT
RF
FactorInfluenceFactorInfluenceFactorInfluenceFactorInfluence
Reslevel6<0.0001SO20.0002Reslevel120.0000CO_30.0008
PM25_9<0.0001Reslevel70.0002Reslevel140.0000SO2_50.0008
Reslevel20<0.0001Popmob50.0002Reslevel100.0001SO2_40.0008
Reslevel5<0.0001NO2_20.0002Reslevel130.0001SO2_30.0008
Reslevel13<0.0001SO2_60.0002Reslevel110.0001CO_60.0009
SO2_2<0.0001Popmob70.0002Reslevel90.0001PM25_60.0009
CO_2<0.0001PM10_10.0002Reslevel60.0001PM25_10.0009
CO_5<0.0001Reslevel110.0002Reslevel150.0001AQI_10.0009
Reslevel12<0.0001PM25_60.0003Reslevel80.0002PM250.0009
CO<0.0001AQI_80.0003Reslevel170.0002SO2_90.0009
CO_8<0.0001PM10_90.0003Reslevel70.0002Reslevel200.0010
Reslevel1<0.0001PM100.0003Reslevel50.0002CO_90.0010
SO2_8<0.0001SO2_40.0003Reslevel180.0002PM25_70.0010
CO_10.0001AQI_20.0003Popmob80.0003AQI_30.0010
AQI_30.0001NO2_80.0003Popmob90.0003PM25_20.0010
NO2_40.0001PM25_80.0003Reslevel160.0003SO2_80.0010
Reslevel150.0001PM25_20.0003Popmob70.0004SO2_20.0010
Reslevel100.0001Reslevel180.0003Popmob60.0004NO2_90.0011
AQI_10.0001NO2_30.0003Popmob50.0005PM25_50.0011
SO2_30.0001O3_60.0003CO_40.0006PM25_30.0011
Popmob60.0002SO2_50.0003CO_50.0006AQI_40.0012
Reslevel00.0002NO2_10.0003SO2_60.0007Reslevel10.0012
CO_70.0002CO_90.0004AQI_20.0007PM25_40.0012
CO_60.0002PM25_10.0004SO2_70.0007AQI_60.0012
NO2_90.0002MaxT0.0004AQI_50.0008PM10_20.0013
Table 4

30 cross-selected COV_I influence factors to be deleted.

AQI_1NO2_9Popmob7Reslevel15SO2_2
AQI_2PM25_1Reslevel1Reslevel18SO2_3
AQI_3PM25_2Reslevel10Reslevel20SO2_4
CO_5PM25_6Reslevel11Reslevel5SO2_5
CO_6Popmob5Reslevel12Reslevel6SO2_6
CO_9Popmob6Reslevel13Reslevel7SO2_8
The last 50 COV_I influence factors calculated by GBDT and RF. 30 cross-selected COV_I influence factors to be deleted. The final performances of four machine learning models on the test sets for COVID-19 occurrence classification are shown in Fig. 3 as the format of confusion matrixes. The accuracy, precision, and recall were also calculated and shown in Table 5 .
Fig. 3

Confusion matrixes of COV_O classification results by Ridge classifier (a), Gradient Boosting Decision Tree (b), Random Forest (c), and 3-layer Artificial Neural network.

Table 5

Assessment indexes of COV_O classification results by four models.

RidgeGBDTRF3-layer ANN
Accuracy88.02%91.91%91.54%88.40%
Precision89.28%95.01%94.60%88.95%
Recall87.96%89.44%89.12%89.18%
Confusion matrixes of COV_O classification results by Ridge classifier (a), Gradient Boosting Decision Tree (b), Random Forest (c), and 3-layer Artificial Neural network. Assessment indexes of COV_O classification results by four models. All these four techniques showed spectacular performances on the classification of test sets that their accuracies were identically over 88%, and GBDT presented the highest accuracy of 91.91%. Meanwhile, all the high precisions and recalls also indicated that these four classifiers were similarly convincing. Considering that our Ridge classifier presented excellent performance, 38 factors that showed high influence with the absolute coefficient over 0.1 were retained and listed in Table 6 .
Table 6

38 COVID-19 occurrence factors that have high coefficients.

FactorCoefficientFactorCoefficient
AQI_70.5675PM25_10.8735
CO_20.1655PM25_20.1621
CO_5−0.3688PM25_30.7964
DEM−0.1584PM25_50.9568
DisWH−0.4398PM25_7−0.7619
GDP1.1081PD0.3344
NO2_1−0.8857Popmob40.6645
NO2_4−0.1488Popmobsum0.5297
NO2_5−0.1183Reslevel10.7163
NO2_6−0.2154Reslevel20.1711
NO2_80.1998Reslevel15−0.1343
O3_1−0.1249Reslevel19−0.2020
O3_6−0.1668Reslevel20−0.4461
O3_7−0.1183MinRh0.1623
O3_8−0.1001SO2_50.5294
PM10_40.3215SO2_7−0.1303
PM10_60.2028MaxT0.3439
PM10_7−1.0103MinT−0.1128
PM10_9−0.2273TIME−1.1932
38 COVID-19 occurrence factors that have high coefficients. As for the prediction on COVID-19 intensity, the results of all four regression models are presented in Fig. 4 and their assessment metrics are shown in Table 7 .
Fig. 4

Estimated values and residuals of COV_I by Elastic Net (a), Gradient Boosting Decision Tree (b), Random Forest (c), and 2-layer Artificial Neural Network (d).

Table 7

Assessment metrics of COV_I regression model on training and test set.

EN
GBDT
RF
2-layer ANN
Training setTest setTraining setTest setTraining setTest setTraining setTest set
MSE1.1821.1260.3320.4150.2950.4310.3070.454
R20.3400.3970.8150.7780.8350.7700.8290.757
Estimated values and residuals of COV_I by Elastic Net (a), Gradient Boosting Decision Tree (b), Random Forest (c), and 2-layer Artificial Neural Network (d). Assessment metrics of COV_I regression model on training and test set. The R2 of the EN model was below 0.4 and its MSE reached over 1.0, which was not feasible enough when the dependent variable (natural logarithm of COV_I) ranged from −4 to 6. In this case, the coefficient analysis was not applicable. The other three regression models (GBDT, RF, and 2-layer ANN) all performed well that their R2 on both training and test sets were over 0.75.

Discussion

Most of our proposed machine learning classification or regression models were tested to be reliable. All four classification models provided convincing accuracies and GBDT was tested to be the optimal solution. GBDT performed the best on all three assessment indices compared to the other models. Considering the good performance of our ridge classification model, the coefficient analysis of all the selected factors on COVID-19 occurrence, therefore, brought reasonable and inspiring findings: The analysis of our meteorological data showed that the maximum temperature had a positive impact on the occurrence of COVID-19 cases while the minimum temperature showed a negative impact. Although it was believed that the higher temperature would reduce the possibility of COVID-19 transmission (Liu et al., 2020), our findings indicated that the COVID-19 was more likely to occur where the temperature was extreme and such connection was stronger where the maximum temperature is higher. However, the meteorological impacts could be very complex considering that human interruptions such as the frequencies of outdoor activities were highly related to the weather. Our study also indicated that the minimum relative humidity had a positive impact on COVID-19 occurrence and this could be supported by current studies (Qiu et al., 2021). The analysis of the atmospheric environmental quality data explained the correlation from different perspectives. It was believed that CO, NO2, O3, PM10, and PM2.5 synergistically impact the COVID-19 occurrence with a lag of 8–9 days (Qiu et al., 2021). Our study verified that such lag did exist but such duration should be 6–7 days, except CO and O3. CO at 4–5 days in advance was found to be the inflection point that its lag effect was shorter than the other pollutants. Besides, O3 did not show a significant inflection point and thus its lag effect was still yet to be explored. Another surprising finding was that NO2 and O3 both showed a negative correlation with COVID-19 occurrence, which was not aligned with the conclusion that most scholars thought they were supposed to increase the severity of transmission (Barnett-Itzhaki and Levi, 2021). However, we applied the lag effect to avoid the possible impacts of COVID-19 on the air pollutants which can be confusing when fitting the prediction model. Therefore, our study should be more reliable compared to the former researches. In this study, it was found that COVID-19 patients were more likely to occur in the prefectures with higher GDP, denser population, more intensive population mobility and closer to the epicentre, which was aligned with some studies (Coccia, 2020; Qiu et al., 2021). Nevertheless, for population density, some studies stated that it was not significantly associated with the spreading of COVID-19 (Sun et al., 2020b). Considering this inconsistency, an explanation for our result was that the transportation capacity in prosperous cities was usually larger and thus increased the frequency of human interaction, which made it vulnerable to the COVID-19 epidemic. Whereas the medical facilities were usually more accessible and sufficient in big cities to prevent a further internal epidemic, and thus these factors might negatively influence the COVID-19 intensity instead of occurrence. However, this needed to be proved by further exploring. Our result also presented that the advance control measure response level was not positively influencing the nationwide spreading until the lag of 15 days, where it was found to be the turning point from the negative coefficient to the positive. This finding was consistent with the widely acknowledged viewpoint that the NPIs (non-pharmaceutical interventions) continuing for two weeks can reduce the spreading effectively (Vardavas et al., 2021). In our study, the finding on the geographical data revealed that COVID-19 occurrence is higher where the altitude is lower, which was consistent with the finding of Sun et al. (2020b). The landscape and the prefectural development status in China mainland, the prefectures located on higher elevation were closely associated with both the meteorological data as well as the socio-economic status. Hence, such a negative correlation was assumed partly due to the synergy of these two causes. As for the prediction on the number of COVID-19 patients, RF presented the lowest MSE and the best R2 on the training set but GBDT outperformed the others with its best estimation results on the test set. The performance of ANN on COVID-19 intensity estimation was brilliant even though it was not as feasible as GBDT. This could be explained by two reasons: (1) Given that all records were at the prefectural level, data of the same prefecture more or less had a homogeneity problem that most variables were similar values, and thus the neural network failed to train on enough effective data. (2) The distribution of COVID-19 intensity records was still a little skewed even after the transformation of the natural logarithm. Although these three models were slightly overfitted, such results were still good enough considering the extremely unbalanced distribution of COVID-19 intensity. However, the poor fitting performance of the EN model indicated that this prediction on the number of patients is very complex rather than a single linear regression model can express. While most COVID-19 spreading prediction models using machine learning focused on the number of patients-only statistics instead of taking multifactor as the parameters (Ardabili et al., 2020; Pinter et al., 2020; Rustam et al., 2020), the models proposed in our study counted a series of variables to ensure convincingness. Besides, the brilliant performances of these models verified that a reliable forecast on the regional early-stage COVID-19 transmission could be accessible and it is worth further researches. Moreover, the early-stage infection of more contagious diseases rather than COVID-19 is also believed to be achievable based on our study. Potential improvement on the analysis of early-stage COVID-19 transmission is worth more endeavour from different perspectives: (1) A wider range of variables such as vaccination may have indirect impacts which have not been fully understood, and they might contribute to a more comprehensive model, especially for complex COVID-19 intensity prediction; (2) The quantification of the control measures in other countries is yet to be studied for better use of our optimal model; (3) More specific correlation and the synergistic effect of different factors are worth further exploration.

Conclusion

Various factors synergistically influence the early-stage COVID-19 transmission in China: (1) COVID-19 was more likely to occur in prosperous cities closer to the epicentre and located on higher altitudes; (2) The extreme weather and higher minimum relative humidity increased the vulnerability to COVID-19 occurrence; (3) More air pollutants increased the risk of COVID-19 occurrence except for NO2 and O3, and most of the pollutants did not show the impact until 6–7 days after; (4) The control effects of NPIs on COVID-19 occurrence became significant after two weeks. The accuracy of our optimal model used to predict the COVID-19 occurrence reached 91.91%. Besides, our optimal regression model presented the best R2 of 0.778 on the prediction. However, the specific connections between the number of COVID-19 patients and each factor were complex and work further exploration. Overall, this is the first study that analysed multifactor early-stage COVID-19 transmission in China and precisely estimated the pandemic trend by machine learning. Our study brings guiding significance to the forecast of the early-stage regional COVID-19 epidemic and enables quicker response to prevent regional outbreaks from becoming a nationwide pandemic.

Funding

This work was supported by Multidisciplinary Cross-cultivation project of Innovation Academy of Precision Measurement Science and Technology, [grant number S21S3202], the of China [grant number 2020CFA048], and the [grant numbers 62071457, 81803297].

Authors’ contributions

Juan Qiu and Rendong Li conceptualized the study, Yifei Han directed the study's implementation, drafted and finished the manuscript, Jinliang Huang review the drafts and supervised the study, Qihui Shao, Dongfeng Han and Xiyue Luo collected resources and helped to interpret the findings.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data typeVariableAbbreviationUnitsScaleData source
COVID-19 dataCase incidence ratios (cases per 10,000,000 persons)CIR0.00001%DailyNational Health Commission of the People's Republic of China (http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml) and provincial or municipal health commissions
Meteorological dataMinimum temperatureMinT0.1 °CDailyhttp://data.cma.cn
Maximum temperatureMaxT0.1 °CDaily
Mean temperatureMeanT0.1 °CDaily
Relative humidityRh%Daily
Minimum relative humidityMinRh%Daily
Atmospheric environmental quality dataCarbon monoxide, COCO, CO_1, …, CO_9mg/m3Dailyhttps://datacenter.mee.gov.cn/
Nitrogen dioxide, NO2NO2, NO2_1, …, NO2_9μg/m3Daily
Ozone, O3O3, O3_1, …, O3_9μg/m3Daily
Fine particles, PM2.5PM25, PM25_1, …, PM25_9μg/m3Daily
Inhalable coarse particles, PM10PM10, PM10_1, …, PM10_9μg/m3Daily
Sulfur dioxide, SO2SO2, SO2_1, …, SO2_9μg/m3Daily
Air Quality IndexAQI, AQI_1, …, AQI_9Daily
Geographical dataMean DEMMeanDEMmDailyhttp://www.gscloud.cn
Socio-economic dataHousehold populationPop10,000 Person2019, Fixed valuehttp://tjj.shandong.gov.cn/tjnj/nj2020/zk/indexch.htm, etc.
Population densityPDPerson/km22019, Fixed valuehttp://www.mohurd.gov.cn/xytj/index.html
GDP per capitaGDP100 million RMBYearlyhttp://tjj.shandong.gov.cn/tjnj/nj2020/zk/indexch.htm, etc.
Destination migration scale flow from Wuhan = destination proportion in population flow from Wuhan * migration scalePopmob, Popmob1, …, Popmob9, PopmobsumDailyhttp://qianxi.baidu.com/
The distance of each city from WuhanDisWHkmFixed valueDistance measurement based on GIS
National emergency responseReslevel, Reslevel1, …, Reslevel20DailyNational Health Commission of the People's Republic of China (http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml) and provincial or municipal health commissions
Temporal dataDays from Jan 10, 2020TimeDailyCorresponding to each record
  23 in total

1.  Correlation between environmental pollution indicators and COVID-19 pandemic: A brief study in Californian context.

Authors:  Muhammad Farhan Bashir; Ben Jiang Ma; Bushra Komal; Muhammad Adnan Bashir; Taimoor Hassan Farooq; Najaf Iqbal; Madiha Bashir
Journal:  Environ Res       Date:  2020-05-13       Impact factor: 6.498

2.  Analyzing the spatial determinants of local Covid-19 transmission in the United States.

Authors:  Lauren M Andersen; Stella R Harden; Margaret M Sugg; Jennifer D Runkle; Taylor E Lundquist
Journal:  Sci Total Environ       Date:  2020-09-18       Impact factor: 7.963

3.  Meteorological factors and COVID-19 incidence in 190 countries: An observational study.

Authors:  Cui Guo; Yacong Bo; Changqing Lin; Hao Bi Li; Yiqian Zeng; Yumiao Zhang; Md Shakhaoat Hossain; Jimmy W M Chan; David W Yeung; Kin-On Kwok; Samuel Y S Wong; Alexis K H Lau; Xiang Qian Lao
Journal:  Sci Total Environ       Date:  2020-11-23       Impact factor: 7.963

4.  Effect of Environmental and Socio-economic factors on the spreading of COVID-19 at 70 cities/provinces.

Authors:  Jishan Ahmed; Md Hasnat Jaman; Goutam Saha; Pratyya Ghosh
Journal:  Heliyon       Date:  2021-05-05

5.  Temporal and spatial analysis of COVID-19 transmission in China and its influencing factors.

Authors:  Qian Wang; Wen Dong; Kun Yang; Zhongda Ren; Dongqing Huang; Peng Zhang; Jie Wang
Journal:  Int J Infect Dis       Date:  2021-03-09       Impact factor: 3.623

6.  County-level demographic, social, economic, and lifestyle correlates of COVID-19 infection and death trajectories during the first wave of the pandemic in the United States.

Authors:  Natalie C DuPre; Seyed Karimi; Charlie H Zhang; Lyndsey Blair; Arushi Gupta; Lamyaa Mousa A Alharbi; Mariyam Alluhibi; Riten Mitra; W Paul McKinney; Bert Little
Journal:  Sci Total Environ       Date:  2021-05-03       Impact factor: 7.963

7.  Association between air quality, meteorological factors and COVID-19 infection case numbers.

Authors:  Jason Sam Leo Lorenzo; Wilson Wai San Tam; Wei Jie Seow
Journal:  Environ Res       Date:  2021-03-17       Impact factor: 8.431

8.  Effects of chronic exposure to ambient air pollutants on COVID-19 morbidity and mortality - A lesson from OECD countries.

Authors:  Zohar Barnett-Itzhaki; Adi Levi
Journal:  Environ Res       Date:  2021-01-21       Impact factor: 6.498

9.  Physical distancing implementation, ambient temperature and Covid-19 containment: An observational study in the United States.

Authors:  Cui Guo; Shin Heng Teresa Chan; Changqing Lin; Yiqian Zeng; Yacong Bo; Yumiao Zhang; Shakhaoat Hossain; Jimmy W M Chan; David W Yeung; Alexis K H Lau; Xiang Qian Lao
Journal:  Sci Total Environ       Date:  2021-05-21       Impact factor: 7.963

10.  Identifying novel factors associated with COVID-19 transmission and fatality using the machine learning approach.

Authors:  Mengyuan Li; Zhilan Zhang; Wenxiu Cao; Yijing Liu; Beibei Du; Canping Chen; Qian Liu; Md Nazim Uddin; Shanmei Jiang; Cai Chen; Yue Zhang; Xiaosheng Wang
Journal:  Sci Total Environ       Date:  2020-10-13       Impact factor: 7.963

View more
  2 in total

1.  Study of the Economic, Environmental, and Social Factors Affecting Chinese Residents' Health Based on Machine Learning.

Authors:  Hui Xu; Wei Pan; Meng Xin; Wulin Pan; Cheng Hu; Dai Wanqiang; Ge Huang
Journal:  Front Public Health       Date:  2022-06-14

2.  The environment, epidemics, and human health.

Authors:  Avelino Núñez-Delgado; Warish Ahmed; Elza Bontempi; José L Domingo
Journal:  Environ Res       Date:  2022-07-31       Impact factor: 8.431

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.