Literature DB >> 32871492

Predicting the hotspots of age-adjusted mortality rates of lower respiratory infection across the continental United States: Integration of GIS, spatial statistics and machine learning algorithms.

Abolfazl Mollalo¹, Behrooz Vahedi², Shreejana Bhattarai³, Laura C Hopkins⁴, Swagata Banik⁵, Behzad Vahedi⁶.

Abstract

OBJECTIVE: Although lower respiratory infections (LRI) are among the leading causes of mortality in the US, their association with underlying factors and geographic variation have not been adequately examined.
METHODS: In this study, explanatory variables (n = 46) including climatic, topographic, socio-economic, and demographic factors were compiled at the county level across the continentalUS.Machine learning algorithms - logistic regression (LR), random forest (RF), gradient boosting decision trees (GBDT), k-nearest neighbors (KNN), and support vector machine (SVM) - were employed to predict the presence/absence of hotspots (P < 0.05) for elevated age-adjusted LRI mortality rates in a geographic information system framework.
RESULTS: Overall, there was a historical shift in hotspots away from the western US into the southeastern parts of the country and they were highly localized in a few counties. The two decision tree methods (RF and GBDT) outperformed the other algorithms (accuracies: 0.92; F1-scores: 0.85 and 0.84; area under the precision-recall curve: 0.84 and 0.83, respectively). Moreover, the results of the RF and GBDT indicated that higher spring minimum temperature, increased winter precipitation, and higher annual median household income were among the most substantial factors in predicting the hotspots.
CONCLUSIONS: This study helps raise awareness of public health decision-makers to develop and target LRI prevention programs.

Entities: Chemical Disease Gene Species

Keywords: Accuracy assessment; Decision trees; GIS; Hotspots; Lower respiratory infections; US

Mesh：

Year: 2020 PMID： 32871492 PMCID： PMC7442929 DOI： 10.1016/j.ijmedinf.2020.104248

Source DB: PubMed Journal: Int J Med Inform ISSN： 1386-5056 Impact factor: 4.046

Introduction

Lower respiratory infections (LRI) are diseases of the lower respiratory tracts and include bronchitis, bronchiolitis, pneumonia, and recently emerged coronavirus (COVID-19). LRI are major public health concerns across the world ([1], [2], [3]), and are among the leading causes of mortality and morbidity in children and adults [4,5]. In 2016, LRI caused nearly 2.38 million deaths worldwide, including 652,572 children under five years old and 1,080,958 adults over 70 years old, making it the sixth leading cause of death for all ages [6]. LRI are the cause of a significant number of hospitalizations in developed countries [7]. In the US, LRI have been classified as the 7th leading cause of death and years of life lost [8]. In this country, bronchiolitis is the leading diagnosis of LRI in children younger than two years old, causing almost 150,000 annual hospitalizations [9]. Similarly, pneumonia is another most common reason for hospital admissions in the US that causes the most common severe bacterial infection in children [10]. However, with the success of the childhood vaccination programs such as the 7-valent and 13-valent pneumococcal conjugate vaccines, the proportion of elderly affected by LRI in the US has significantly declined [11]. Previous studies have shown that many socio-economic factors such as education level, income, and poverty [12] and environmental factors such as climate and air pollution ([13]; [14]) were significantly associated with LRI prevalence. Further, demographic factors such as age, gender, and race [15] and behavioral factors such as cigarette smoking [16] were correlated with LRI prevalence. Few studies have examined the spatial variation of LRI in small geographic regions. For example, Beamer et al. [17] identified distinct patterns of significant spatial clusters for each LRI phenotype within Tucson, Arizona. Those clusters were associated with various community-level risk factors such as increased air pollution, poor housing conditions, and low socio-economic status. Beck et al. [18] conducted a study in Cincinnati, Ohio, to examine geographic variation of LRI hospitalization rates across Hamilton county using Getis-Ord Gi* statistic. They also examined whether such variation was correlated with socio-economic status using the non-parametric Kruskal-Wallis test. The results indicated a significant alteration in the median hospitalization rates by census tract quintile for both bronchiolitis and pneumonia. Further, socio-economic conditions had substantial influences on those hospitalization rates, and hotspots were located in the impoverished neighborhoods in the urban core. In recent decades, the use of novel modeling techniques such as machine learning algorithms in public health studies, in particular, respiratory disease research has increased [19]. For instance, Heckerling et al. [20] trained a back-propagation artificial neural network (ANN) optimized by genetic algorithm to predict pneumonia among patients (n = 1044) with respiratory complaints from the University of Illinois and the University of Nebraska. A multitude of variables, such as demographics, symptoms, signs, and comorbidity with other respiratory diseases, including asthma and lung disease, were compiled to predict the presence or absence of pneumonia among the patients. The ANN model successfully predicted pneumonia on the test dataset with 93 % accuracy. In a case-control study in Taiwan, Kuo et al. [21] compared the performance of seven machine learning classifiers, including random forest and logistic regression, to predict hospital-acquired pneumonia among schizophrenic patients. Among the employed algorithms, random forest had the highest accuracy (93 %) in predicting pneumonia. Further, the significant predictors were clozapine use, clozapine prescription, and prescription duration. While several studies have been conducted in smaller geographic regions, to our knowledge, no previous nationwide study has examined geographic variations of LRI mortality rates and their association with underlying factors across the US. Identifying hotspot(s) of LRI mortality rates (i.e., counties with higher than expected mortalities) and their presence or absence based on population-level underlying factors can help public health decision makers for targeted interventions at the national level. Thus, in this ecological study, we investigate the geographic variation of age-adjusted LRI mortality rates across the continental US from 1980 to 2014 using spatial statistics. Further, we employed several machine learning algorithms to predict hotspot(s) occurrence with potential risk factors in a geographic information system (GIS) framework.

Material and methods

Data collection and preparation

Continental US age-adjusted mortality rates of LRI were obtained at the county level from Global Health Data Exchange (http://ghdx.healthdata.org/record/ihme-data/united-states-mortality-rates-county-1980-2014). The data were available for eight years: 1980, 1985, 1990, 1995, 2000, 2005, 2010, and 2014. The disease data were then spatialized at the county level in ArcGIS 10.7 (ESRI, Redlands, CA). The ESRI shapefile of the administrative boundary of US counties was obtained from Topologically Integrated Geographic Encoding and Referencing (TIGER)/Line US Census Bureau for the year 2018 (http://www.census.gov/). Explanatory variables (n = 46) including climatic, topographic, socio-economic, and demographic factors were compiled at the county level across the continental US and stored in a file geodatabase in ArcGIS 10.7. The variables were selected according to either the previously published literature or domain knowledge. Low and high air temperature can aggravate respiratory symptoms, particularly among individuals with preexisting conditions. Low air temperature can adversely impact epithelium by narrowing the respiratory airways and declining lung functions. In contrast, high air temperature can increase allergic illnesses possibly by increasing pollen production or extending the length of pollen season, which in turn can make the respiratory symptoms worse. Increased precipitation may facilitate the spread of respiratory diseases. Vitamin D, which is produced by sunlight exposure, may protect the human body against respiratory diseases. We obtained climate data including daily air temperature (°C), daily precipitation (mm), and daily sunlight (KJ/m2) from the Centers for Disease Control and Prevention Wide-Ranging Online Data for Epidemiologic Research (CDC WONDER) database (http://wonder.cdc.gov/). Then, we aggregated the daily climate data for the spring (March 19-June 20), summer (June 20-September 22), autumn (September 22-December 21) and winter (December 21 to March 20) seasons (i.e., seasonal minimum and maximum temperature, seasonal average precipitation, and seasonal average sunlight). The fine particulate matter (PM 2.5), which may contain soot, smoke, and dust, can get deep into human lungs and enter the bloodstream. According to Bowe et al. [22], exposure to high levels of PM 2.5 is associated with almost 200,000 deaths in the US. Moreover, cigarette smoking can damage human airways and the small air sacs in the lungs. Daily PM 2.5 air quality data was obtained from the CDC WONDER database. The mean values of PM 2.5 for the four seasons were computed for each county. Also, the data pertaining to cigarette smoking prevalence in the US for men and women were obtained from Dwyer-Lindgren et al. [23]. Respiratory infections are more complicated in infants and children living in high altitudes. During acute LRI, hypoxemia occurs more frequently in children at high altitudes, which may result in increased mortality [24]. Therefore, the topographic data (i.e., median altitude and slope) of US counties were also incorporated as explanatory variables. The altitude shuttle radar topography mission (STRM) digital elevation model with 30 m spatial resolution were obtained from the national map website (http://nationalmap.gov/). The altitude and slope values for counties were then quantified using zonal statistics function in ArcGIS Spatial Analyst extension. Lower socio-economic status can be associated with unbalanced access to health care which in turn can lead to elevated mortality of diseases. A broad range of socio-economic and demographic variables including the proportion of the white and black population, median household income, poverty, unemployment rate, (lack of) health insurance, and the number of physicians per county was obtained from the US Census Bureau's American FactFinder (https://factfinder.census.gov/) and included in the file geodatabase. All data used in this study are publicly available from the above sources.

Spatial statistics

The spatial pattern of age-adjusted LRI mortality rates (i.e., clustered, dispersed, or random) across the continental US, were examined with global and local indices of spatial autocorrelation for every eight years of study. Moran’s I and Getis-Ord General G were employed to investigate the extent to which the nearby counties had similar LRI rates. Moran’s I is calculated using the following formula:where and are the deviations of LRI mortality rates from the average mortality rate for county and county , respectively; is a binary weight matrix between county and county based on the first-order Queen contiguity (i.e., each element in weight matrix is non-zero when the counties share borders of non-zero length); and is the aggregate number of counties. The value of ranges between -1 (negative spatial autocorrelations) and +1 (positive spatial autocorrelation), while values close to 0 indicate no spatial autocorrelation ([25], [26]). Using the same notation as for Eq (1) Getis-Ord General G is computed as: A significant value of G indicates spatial clustering of LRI mortality rates. Both Moran’s I and Getis-Ord General G statistics were calculated in ArcGIS 10.7. Local measures of spatial autocorrelation such as Getis-ord Gi* also were applied to locate the identified spatial autocorrelations of LRI mortality rates (P < 0.05) as follows [27,28]. A high positive and a high negative value of imply hotspot and coldspot, respectively. However, the focus of this study is on mapping and analyzing the identified hotspots of LRI mortality rates for further modeling. More detailed information about the clustering and hotspot detection techniques have been published elsewhere ([29], [30]).

Machine learning modeling

Five different machine learning classifiers were employed to identify hotspot locations (P < 0.05) of the LRI age-adjusted mortality rates. The LRI mortality rate for the year 2014 was considered as dependent variable. The classifiers were vanilla logistic regression (LR), random forest (RF), gradient boosting decision trees (GBDT), k-nearest neighbors (KNN), and support vector machine (SVM). These classifiers were selected due to their successful performance in identifying intricate patterns in many binary classification applications ([31]; [32]). The scikit-learn Python package was used to develop the classifiers.

Logistic regression

LR, a linear function for binary classification, applies maximum likelihood estimation to minimize the errors after transforming the presence or absence of LRI hotspots into a logit variable [33]. The output of LR is the likelihood of LRI’ hotspot occurrence, as a function of several exploratory variables and can be expressed as:Where is the predicted likelihood of LRI hotspot occurrence bounded between 0 and 1; and is a linear combination of the variables and its value varies between and . More precisely:Where is the intercept and are the coefficients associated with the variables . The detailed information about LR is provided by Hosmer and Lemeshow [34].

Random forest

RF developed by Breiman [35] is an ensemble learning method where a plethora of decision trees are produced based on bootstrap sampling. The input data are repeatedly split, based on many different generated classification trees. The final decision is made based on the maximum number of ‘votes’ obtained from individual trees ([36]; [37,38]). In this study, the number of trees was set to 1000. Also, the optimal number of layers from the root to the node of the trees was chosen using cross-validation from the set of {2, 3, 4}.

Gradient boosting decision trees

Similar to RF, GBDT is an ensemble method based on bootstrap sampling, which generates many decision trees. While RF uses the bagging method (e.g., equal probability of sample selection in each iteration), GBDT uses a boosting method (i.e., weighted (unequal) sample selection in each run). After each iteration, the weights are adjusted so that the higher weights will be assigned to the models with good performances (Friedman [39]). Suppose is a training sample, is the associated label of , and N is the number of training samples. For any training sample is the classification (the ith decision tree) of , and is the loss between F() and . GBDT determines an optimal model such that is minimized. In the first step, the GBDT initialize the decision tree , then iteratively constructs new trees. For each iteration, a negative gradient is computed and a new tree is added to reduce the residuals. The optimal model can be calculated as follows:where is the number of iterations; v controls the learning rate; is the weight of and is the trained decision tree in the tth iteration [39].

K-nearest neighbors

The k-nearest neighbors classifier (k is a positive integer), is a non-parametric and distance-based algorithm that assigns a test sample to the class that is common among its k-nearest training samples. In other words, a county is classified as a hotspot of LRI if a majority of its neighboring counties are hotspots Peterson [40]. Using a random search algorithm, k = 10 was selected as the optimal number of nearest neighbors. Also, the explanatory variables are not involved in this algorithm. The distance can be calculated in a variety of ways including Euclidean distance, Hamming distance, Manhattan distance and Minkowski Distance. We used Manhattan distance which yielded better results which is calculated as:where and are -dimensional vectors such that and .

Support vector machine

The SVM classifier, first proposed by Vapnik [41], uses robust statistical learning theory. Consider a dataset of high dimensional points, viewed as vector , where each point belongs to one of two classes defined by . Here, corresponds to the presence/absence of LRI hotspots. If we assume these points to be linearly separable (i.e., can be separated via a linear boundary), the goal of SVM is to find the d-dimentional hyperplane maximizing the margin (i.e., distance between the closest points or support vectors) as illustrated in Fig. 1 [42].

Fig. 1

Principle of linearly separable SVM using maximum margin.

Principle of linearly separable SVM using maximum margin. The hyperplane can be expressed as , where is the orientation of hyperplane and is the offset of hyperplane from origin and is sign function (i.e., sgn= +1 for presence and sgn= -1 for absence of LRI hotspot). SVM can work in the case where the points are not linearly separable by using a soft-margin. Soft margin allows a trade-off between the margin of separation and the miss-classification penalty. One form of which can be the aggregated distance of the miss-classified points to the separation hyperplane. The optimal separating hyperplane can be found using Lagrangian multipliers from: Where are the Lagrange multipliers and the value of or regularization shows a trade-off between maximizing the margin and minimizing the errors. Finally, and can be obtained as follows: Where is the number of support vectors placed on the margin lines. Many real-world problems are nonlinear. In this case, SVM utilizes kernel functions to transform data into a higher dimensional space than the original dimension in which the input data can be separated by a linear boundary [43]. For non-linear separable cases, the above formula is extended using kernel function. This function maps the input dataset onto a higher dimensional feature space as shown in Fig. 2 . The decision function is modified as:Where is a Gaussian radial basis function kernel as:

Fig. 2

A non-linear boundary in the input space (left) and a maximum margin hyperplane in feature space (right).

A non-linear boundary in the input space (left) and a maximum margin hyperplane in feature space (right). Appropriate results highly depend on the selection of and . Here, we used a grid search to find the optimum values for the two parameters. This method checks various combinations of and in a range of pre-defined values ( between 0.5 and 20 with increments of 0.5 and between 0.005 and 1.0 with increments of 0.1). It should be noted that these ranges are boundaries of search space and have been chosen to cover a large enough space. For example, in our case, 20 is numerically large enough for C.

Accuracy assessment

To employ the algorithms, 70 % and 30 % of the dataset were randomly selected for training and test dataset, respectively. A randomized search algorithm for tuning hyper-parameters in each classification algorithm was used. L1 regularization (LASSO) was used to reduce the complexity of the model and to avoid overfitting. This is done by penalizing small weights to zero, leading to a sparser model. The performances of the classifiers were assessed with several metrics: overall accuracy (), precision (), recall (, F1-score (), false positive rate or FPR () and area under ROC (receiver operating characteristic) curve (ROC AUC). In the above formulas, , and represent the number of true positives, true negatives, false positives, and false negatives, respectively. The area under the precision-recall curve (PR AUC), which shows the tradeoff between precision and recall of different thresholds, was also measured because the classes were imbalanced (Goutte & Gaussier [44]). All evaluation metrics were computed on the test dataset.

Results

The null hypothesis of complete spatial randomness was rejected for all study years based on Moran’s I (range: 0.36 – 0.61; p-values<0.001) and General G (range: 0.0018 – 0.0019; p-values<0.001) statistics. The z-scores of both statistics almost consistently increased to large values from 1980 to 2014, indicating highly significant clustering (Table 1 ). Clustering was minimal from 1980 to 1990, but sharply and consistently increased thereafter.

Table 1

Results of the global Moran’s I and General G statistic of age-adjusted LRI mortality rates, continental US, 1980-2014.

Year	Index		Z-score		Type of distribution	P-value
Year	Moran’s I	General G	Moran’s I	General G	Type of distribution	P-value
1980	0.38	0.0019	36.31	8.27	Clustered	∼ 0
1985	0.36	0.0019	34.59	8.40	Clustered	∼ 0
1990	0.37	0.0019	35.04	9.57	Clustered	∼ 0
1995	0.41	0.0018	39.50	12.10	Clustered	∼ 0
2000	0.49	0.0018	47.00	15.50	Clustered	∼ 0
2005	0.53	0.0018	51.06	18.81	Clustered	∼ 0
2010	0.58	0.0018	55.79	22.24	Clustered	∼ 0
2014	0.61	0.0018	58.35	24.68	Clustered	∼ 0

Results of the global Moran’s I and General G statistic of age-adjusted LRI mortality rates, continental US, 1980-2014. In the earlier years of the study period (1980–1985), the identified hotspots of the LRI mortality rates by Getis-Ord Gi* hotspot detection technique were mostly concentrated in the western US. In contrast, from 1990 to 2000, these hotspots became less prominent, while LRI hotspots shifted toward the southeastern parts of the US (Fig. 3 ). These counties continue to represent hotspots through the remaining periods.

Fig. 3

Location of hotspots of LRI mortality rates in the continental US using Getis-Ord Gi* hotspot detection technique, 1980-2014.

Location of hotspots of LRI mortality rates in the continental US using Getis-Ord Gi* hotspot detection technique, 1980-2014. In total, 118 counties (3.8 % of US counties) were persistently identified as (part) of LRI hotspots (Fig. 4 ). Among these were counties in Georgia (n = 49), Kentucky (n = 25), and Virginia (n = 22) that were persistently affected, and accounted for 81.3 % of total persistent hotspot counties.

Fig. 4

Location of counties that were persistently identified as hotspots of LRI mortality rates by Getis-Ord Gi* hotspot detection technique, 1980-2014.

Location of counties that were persistently identified as hotspots of LRI mortality rates by Getis-Ord Gi* hotspot detection technique, 1980-2014. All the classification algorithms predicted the hotspots of LRI mortality rates with relatively high accuracy (≥ 0.84); however, GBDT and RF were the most accurate models (0.92) (Table 2 ). Precision-recall plots of the employed models (Fig. 5 ) showed that GBDT had the highest PR AUC - indicating the largest values of both precision and recall for different cut-off values.

Table 2

Evaluation metrics associated with each of the employed machine learning classifiers.

	Accuracy	Precision	Recall	F1-Score	ROC AUC	PR AUC	FPR
Classifier
LR	0.84	0.75	0.87	0.78	0.86	0.72	0.17
RF	0.92	0.87	0.82	0.84	0.82	0.83	0.03
GBDT	0.92	0.87	0.83	0.85	0.83	0.84	0.04
KNN	0.90	0.84	0.8	0.82	0.8	0.82	0.05
SVM	0.91	0.83	0.86	0.84	0.86	0.82	0.07

Fig. 5

Results of the precision-recall curve for employed machine learning classifiers. The orange dash line annotates the average precision.

Evaluation metrics associated with each of the employed machine learning classifiers. Results of the precision-recall curve for employed machine learning classifiers. The orange dash line annotates the average precision. GBDT achieved the highest F1- score (85 %) and PR AUC (84 %), compared to the other models, while the LR model had the worst performance (Table 2). Also, the results of RF were slightly better than KNN and SVM. Overall, of the employed machine learning algorithms, the decision trees (i.e., GBDT and RF) yielded a more accurate predictions. The contributions of variables were analyzed for the GBDT and RF models (Fig. 6 ). The results of the GBDT model indicated that spring minimum temperature, winter precipitation, and median household income had the greatest positive influence in predicting the hotspots.

Fig. 6

Relative variable importance analysis using the gradient boosting and random forest decision trees. A detailed description of x-axis codes is provided in Supplementary Material.

Discussion

In this study, we integrated spatial statistical tools with machine learning classifiers in a GIS platform to identify hotspots of the LRI mortality rates across the continental US and to identify the most substantial LRI-associated environmental and socio-economic factors. Given the lack of nationwide spatial analysis and modeling of LRI, our modeling framework can be applied as a general protocol specifically to more prevalent respiratory diseases in the US such as asthma, chronic obstructive pulmonary disease, pneumonia and COVID-19 to support public health decision makings at the national level. Overall, there was a historical shift in hotspots away from the western US into the southeastern parts of the country, and the hotspots were highly localized in a few counties. Environmental factors contributed most strongly to these hotspots, while economic and social factors seem to be of secondary significance. According to Fischer et al. [45], advanced computational models can translate the occurrence of infectious diseases into decision-support tools. Unlike traditional models, machine learning algorithms can quantify the association between infectious disease and explanatory variables, even with incomplete or noisy data [26] in a shorter time period and less costs. Moran’s I and General G statistics confirmed that LRI mortality rates are spatially clustered (P < 0.001) across the continental US. Counties with high mortality rates tend to locate closer together than expected by chance. Using Getis-Ord Gi*, we identified several hotspots across the continental US. Additionally, spatial-temporal analysis of the clusters found a notable geographic shift in the location of hotspots from the west coast to the southeast of the US during the study period. The spatial pattern and shift in the locations of hotspots over time may partially reflect the vast differences in LRI mortality rates by drivers of geographic patterns, including environment, socio-economic and behavior factors. It may also be attributed to the health disparities or improved health care quality such as PCV7 and PCV13 vaccination programs during the study period. The latter is consistent with the substantial global decline of Streptococcus pneumonia - the leading cause of LRI mortality - as estimated by GBD 2016 Lower Respiratory Infections Collaborators [46]. Moreover, some states (including Georgia, Kentucky, and Virginia) and counties included persistent hotspots, suggesting targeting resources and policy interventions in these areas. All the classifiers showed a considerable accuracies; however, due to the imbalanced dataset, in general, ensemble decision trees outperformed the (complex) SVM or traditional and frequently applied LR. Additionally, although SVM was slightly less accurate compared to the decision trees, it is less interpretable, slower to run, and more susceptible to overfitting. Allyn et al. [47] developed LR, RF, GBDT, SVM, and Naïve Bayes Model to predict the mortality of 4676 patients after elective cardiac surgery from December 2005 to December 2012. Their results showed RF outperformed the other classifiers (AUC = 0.788). Our results are also in agreement with the findings of Churpek et al. [48], who compared LR, tree-based models, KNN, SVM, and neural networks. Their findings showed that RF was the most accurate classifier (AUC = 0.801), followed by the gradient boosting machine (AUC = 0.794). The findings of decision trees indicated that higher spring temperature and increased precipitation during winter are among the most substantial predictors of the presence or absence of the hotspots. The contribution of these environmental factors is most likely due to the changes in the epidemiology of weather-sensitive pathogens and host immune response, which can, in turn, lead to respiratory infections [49]. Other studies show that respiratory infections are seasonal, especially during winter and rainy months. Seasonality may play a role due to the proximity of people in enclosed environments during cold temperature weather, which can facilitate the spread of infections during those seasons. For example, Thomas et al. [50] found that RSV infection was more prevalent in children during the winter months in Canada. In Malaysia, LRI was positively correlated with the monthly number of rainy days but negatively associated with the monthly mean temperature [51]. A study conducted in Pakistan showed that LRI cases were more frequent in months when the minimum temperature was lower [52], however, in Brazil, statistically significant associations were found between viral LRI and increasing temperature and decreasing humidity [53]. Inconsistent findings may be due to different studied organisms or different spatial units of analysis. For example, from county-level studies, one can not draw a conclusion at the individual level due to ecological fallacy. Moreover, age is a potential confounder that needs to be adjusted, particularly in studying mortality rates of diseases, to avoid distorting the relationship. The findings of decision trees also implied that the economic status such as median household income and the higher proportion of the population living below the poverty line (according to the definition of US census Bureau (https://www.census.gov/) were among substantial socio-economic factors in describing LRI hotspots. Although we cannot provide an explicit explanation for economic factors, poor access to basic treatments is a plausible explanation. The findings were consistent with a large body of literature worldwide. LRI was found predominantly in the disadvantaged populations in South Auckland, New Zealand [54]. These populations were living in areas in the bottom quintile for socio‐economic deprivation and with high rates of smoke exposure and poor living conditions. Similarly, impoverished children living in informal households without electricity and running water had approximately four times higher LRI mortality rates in South Africa [55]. There are several limitations of the current research study. First, the variables incorporated in the machine learning models undergoes several transformations and are susceptible to measurement or analysis errors. Also, neglecting the role of spatial autocorrelation, especially in sparse data, may produce biased estimates of the importance of variables. Another limitation is attributed to the selection of spatial scale. The values within each county are uniform, but there might be sharp contrasts between neighboring sub-counties, however, the choice of the spatial unit was dictated by the available data. Future studies should analyze and predict hotspots of LRI at the sub-county level, such as zip code or census tract levels, for targeted human interventions, particularly for Virginia, Kentucky, and Georgia, which were persistently identified as LRI hotspots. Additionally, future LRI studies should incorporate the concentration of other criteria air pollutants such as ground ozone, Sulphur oxides, lead, carbon monoxide, and nitrogen oxides as they may cause serious damages to internal organs especially to lungs which can lead to a higher mortality of LRI. To our knowledge, this is the first study that incorporated national datasets on the LRI mortality rate using machine learning algorithms. Despite the above limitations, these findings have important public health implications. Predicting why the counties with high LRI mortality rates cluster geographically can be helpful further to reduce mortality in these regions. Moreover, the results of decision tree modeling can provide insight for future research geared toward identifying contributing factors such as median household income and climate factors to elevated LRI mortality rates. Despite significant efforts for mitigating mortality of LRI, there are many clustered counties, particularly in Georgia, Kentucky, and Virginia, where LRI mortality rates have remained elevated for the past 35 years.

CRediT authorship contribution statement

Abolfazl Mollalo: Conceptualization, Writing - original draft, Data curation, Formal analysis, Writing - review & editing. Behrooz Vahedi: Formal analysis. Shreejana Bhattarai: Writing - review & editing. Laura C. Hopkins: Writing - review & editing. Swagata Banik: Writing - review & editing. Behzad Vahedi: Conceptualization, Writing - review & editing.

Declaration of Competing Interest

The authors report no declarations of interest.

39 in total

1. Spatial clusters of child lower respiratory illnesses associated with community-level risk factors.

Authors: Paloma I Beamer; Nathan Lothrop; Zhenqiang Lu; Rebecca Ascher; Kacey Ernst; Debra A Stern; Dean Billheimer; Anne L Wright; Fernando D Martinez
Journal: Pediatr Pulmonol Date: 2015-10-05

2. The State of US Health, 1990-2016: Burden of Diseases, Injuries, and Risk Factors Among US States.

Authors: Ali H Mokdad; Katherine Ballestros; Michelle Echko; Scott Glenn; Helen E Olsen; Erin Mullany; Alex Lee; Abdur Rahman Khan; Alireza Ahmadi; Alize J Ferrari; Amir Kasaeian; Andrea Werdecker; Austin Carter; Ben Zipkin; Benn Sartorius; Berrin Serdar; Bryan L Sykes; Chris Troeger; Christina Fitzmaurice; Colin D Rehm; Damian Santomauro; Daniel Kim; Danny Colombara; David C Schwebel; Derrick Tsoi; Dhaval Kolte; Elaine Nsoesie; Emma Nichols; Eyal Oren; Fiona J Charlson; George C Patton; Gregory A Roth; H Dean Hosgood; Harvey A Whiteford; Hmwe Kyu; Holly E Erskine; Hsiang Huang; Ira Martopullo; Jasvinder A Singh; Jean B Nachega; Juan R Sanabria; Kaja Abbas; Kanyin Ong; Karen Tabb; Kristopher J Krohn; Leslie Cornaby; Louisa Degenhardt; Mark Moses; Maryam Farvid; Max Griswold; Michael Criqui; Michelle Bell; Minh Nguyen; Mitch Wallin; Mojde Mirarefin; Mostafa Qorbani; Mustafa Younis; Nancy Fullman; Patrick Liu; Paul Briant; Philimon Gona; Rasmus Havmoller; Ricky Leung; Ruth Kimokoti; Shahrzad Bazargan-Hejazi; Simon I Hay; Simon Yadgir; Stan Biryukov; Stein Emil Vollset; Tahiya Alam; Tahvi Frank; Talha Farid; Ted Miller; Theo Vos; Till Bärnighausen; Tsegaye Telwelde Gebrehiwot; Yuichiro Yano; Ziyad Al-Aly; Alem Mehari; Alexis Handal; Amit Kandel; Ben Anderson; Brian Biroscak; Dariush Mozaffarian; E Ray Dorsey; Eric L Ding; Eun-Kee Park; Gregory Wagner; Guoqing Hu; Honglei Chen; Jacob E Sunshine; Jagdish Khubchandani; Janet Leasher; Janni Leung; Joshua Salomon; Jurgen Unutzer; Leah Cahill; Leslie Cooper; Masako Horino; Michael Brauer; Nicholas Breitborde; Peter Hotez; Roman Topor-Madry; Samir Soneji; Saverio Stranges; Spencer James; Stephen Amrock; Sudha Jayaraman; Tejas Patel; Tomi Akinyemiju; Vegard Skirbekk; Yohannes Kinfu; Zulfiqar Bhutta; Jost B Jonas; Christopher J L Murray
Journal: JAMA Date: 2018-04-10 Impact factor: 56.272

3. Clinical Features and Outcome of Children with Severe Lower Respiratory Tract Infection Admitted to a Pediatric Intensive Care Unit in South Africa.

Authors: Hayley K Hutton; Heather J Zar; Andrew C Argent
Journal: J Trop Pediatr Date: 2019-02-01 Impact factor: 1.165

4. Respiratory syncytial virus subgroup B dominance during one winter season between 1987 and 1992 in Vancouver, Canada.

Authors: E Thomas; M J Margach; C Orvell; B Morrison; E Wilson
Journal: J Clin Microbiol Date: 1994-01 Impact factor: 5.948

5. Trends in bronchiolitis hospitalizations in the United States, 2000-2009.

Authors: Kohei Hasegawa; Yusuke Tsugawa; David F M Brown; Jonathan M Mansbach; Carlos A Camargo
Journal: Pediatrics Date: 2013-06-03 Impact factor: 7.124

Review 6. Child health and living at high altitude.

Authors: S Niermeyer; P Andrade Mollinedo; L Huicho
Journal: Arch Dis Child Date: 2008-12-09 Impact factor: 3.791

7. Differential respiratory health effects from the 2008 northern California wildfires: A spatiotemporal approach.

Authors: Colleen E Reid; Michael Jerrett; Ira B Tager; Maya L Petersen; Jennifer K Mann; John R Balmes
Journal: Environ Res Date: 2016-06-15 Impact factor: 6.498

8. Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery.

Authors: Phan Thanh Noi; Martin Kappas
Journal: Sensors (Basel) Date: 2017-12-22 Impact factor: 3.576

Review 9. The risk of lower respiratory tract infection following influenza virus infection: A systematic and narrative review.

Authors: Ryan E Malosh; Emily T Martin; Justin R Ortiz; Arnold S Monto
Journal: Vaccine Date: 2017-11-20 Impact factor: 3.641

10. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower respiratory infections in 195 countries, 1990-2016: a systematic analysis for the Global Burden of Disease Study 2016.

Authors:
Journal: Lancet Infect Dis Date: 2018-09-19 Impact factor: 71.421

4 in total

Review 1. A review of GIS methodologies to analyze the dynamics of COVID-19 in the second half of 2020.

Authors: Ivan Franch-Pardo; Michael R Desjardins; Isabel Barea-Navarro; Artemi Cerdà
Journal: Trans GIS Date: 2021-07-11

2. Leveraging data analytics to understand the relationship between restaurants' safety violations and COVID-19 transmission.

Authors: Arthur Huang; Efrén de la Mora Velasco; Ashkan Farhangi; Anil Bilgihan; Melissa Farboudi Jahromi
Journal: Int J Hosp Manag Date: 2022-05-11

3. Spatial statistical analysis of pre-existing mortalities of 20 diseases with COVID-19 mortalities in the continental United States.

Authors: Abolfazl Mollalo; Kiara M Rivera; Nasim Vahabi
Journal: Sustain Cities Soc Date: 2021-01-28 Impact factor: 7.587

4. Burden of Respiratory Infection and Tuberculosis Among US States from 1990 to 2019.

Authors: Wen Zhong; Nicola Luigi Bragazzi; Jude Dzevela Kong; Saeid Safiri; Masoud Behzadifar; Jun Liu; Xinyao Liu; Weijun Wang
Journal: Clin Epidemiol Date: 2021-06-29 Impact factor: 4.790

4 in total