| Literature DB >> 32905083 |
Louis Anthony Cox1, Douglas A Popken1.
Abstract
In the first half of 2020, much excitement in news media and some peer reviewed scientific articles was generated by the discovery that fine particulate matter (PM2.5) concentrations and COVID-19 mortality rates are statistically significantly positively associated in some regression models. This article points out that they are non-significantly negatively associated in other regression models, once omitted confounders (such as latitude and longitude) are included. More importantly, positive regression coefficients can and do arise when (generalized) linear regression models are applied to data with strong nonlinearities, including data on PM2.5, population density, and COVID-19 mortality rates, due to model specification errors. In general, statistical modeling accompanied by judgments about causal interpretations of statistical associations and regression coefficients - the current weight-of-evidence (WoE) approach favored in much current regulatory risk analysis for air pollutants - is not a valid basis for determining whether or to what extent risk of harm to human health would be reduced by reducing exposure. The traditional scientific method based on testing predictive generalizations against data remains a more reliable paradigm for risk analysis and risk management.Entities:
Keywords: Air pollution; Bayesian networks; CART trees; COVID-19 mortality risk; Causation; Health effects; Machine learning; Model specification error; PM2.5; Random forest; Regression; Scientific method
Year: 2020 PMID: 32905083 PMCID: PMC7462829 DOI: 10.1016/j.gloepi.2020.100033
Source DB: PubMed Journal: Glob Epidemiol ISSN: 2590-1133
Fig. 1Importance plots for several variables as predictors of COVID-19 mortality (left) and case rates (right) per 100,000 people. The plots are generated by random forest nonparametric model ensembles that explain about 48% of the variance in mortality rates and 40% of the variability in case rates among counties in the United States as of April 2020. Appendix A provides details and data. “Importance” is measured as the percentage increase in mean squared prediction error (“%IncMSE”) if a variable is dropped as a predictor. Variable labels are defined in the text and in Appendix A for the most important variables; see data sources in Appendix A for all variables.
“%IncMSE” is the percentage increase in mean squared prediction error from dropping a variable.
Fig. 2Bayesian network for COVID-19 mortality (deaths per 100,000 people) showing statistical dependencies among variables. An arrow between two variables indicates that they are informative about each other (i.e., not statistically independent).
Fig. 3Bayesian network for COVID-19 case rate (cases reported per 100,000 people).
Fig. 4A classification and regression tree (CART) tree for COVID-19 mortality (DeathRate100K). The tree was generated by the rpart package in R.
Response: DeathRate100K.
Inputs: X2000.2016AveragePM25, PCT_BLACK, PCT_ED_HS, Latitude, Longitude, FebMinTmp2000.2019, FirstCaseDays, PCT_HISP, WINTERAVGTMP, PopDensity.
Mutiple linear regression model for COVID-19 mortality rate. The columns give standardized regression coefficients (b*) and their standard errors; unstandardized regression coefficients (b) and their standard errors; and t-test values and significance levels (p-values) for each coefficient.
| Regression Summary for Dependent Variable: DeathRate100K | ||||||
|---|---|---|---|---|---|---|
| b* | Std.Err. | b | Std.Err. | t(3000) | p-value | |
| Intercept | −15.81 | 9.28 | −1.7 | 0.0885 | ||
| 2000-2016AveragePM25 | 0.00 | 0.028 | −0.04 | 0.23 | −0.2 | 0.8718 |
| PCT_HISP | 0.09 | 0.021 | 13.17 | 3.19 | 4.1 | 0.0000 |
| PCT_BLACK | 0.28 | 0.021 | 40.75 | 2.99 | 13.6 | 0.0000 |
| PopDensity | 0.04 | 0.018 | 0.00 | 0.00 | 2.2 | 0.0261 |
| WINTERAVGTMP | 0.09 | 0.034 | 0.17 | 0.06 | 2.7 | 0.0063 |
| FirstCaseDays | 0.17 | 0.018 | 0.21 | 0.02 | 9.6 | 0.0000 |
| Latitude | 0.18 | 0.037 | 0.77 | 0.16 | 4.8 | 0.0000 |
| Longitude | 0.15 | 0.025 | 0.27 | 0.05 | 5.9 | 0.0000 |
Mutiple linear regression model for COVID-19 mortality rate with Longitude omitted.
| Regression Summary for Dependent Variable: DeathRate100K | ||||||
|---|---|---|---|---|---|---|
| b* | Std.Err. | b | Std.Err. | t(3001) | p-value | |
| Intercept | −38.08 | 8.52 | −4.5 | 0.0000 | ||
| 2000-2016AveragePM25 | 0.09 | 0.02 | 0.77 | 0.19 | 4.0 | 0.0001 |
| PCT_HISP | 0.06 | 0.02 | 9.05 | 3.13 | 2.9 | 0.0038 |
| PCT_BLACK | 0.29 | 0.02 | 41.08 | 3.00 | 13.7 | 0.0000 |
| PopDensity | 0.04 | 0.02 | 0.00 | 0.00 | 2.5 | 0.0109 |
| WINTERAVGTMP | 0.04 | 0.03 | 0.08 | 0.06 | 1.3 | 0.1876 |
| FirstCaseDays | 0.17 | 0.02 | 0.21 | 0.02 | 9.6 | 0.0000 |
| Latitude | 0.15 | 0.04 | 0.63 | 0.16 | 4.0 | 0.0001 |
Fig. 5Scatter plots of average estimated historical PM2.5 concentrations (in micrograms per cubic meter) (red squares) and COVID-19 deaths per 100,000 (blue circles) vs. Longitude.
Fig. 6Scatter plot of COVID-19 deaths per 100,000 (DeathRate100k) against PopDensityLog. A non-parametric (lowess) smoothing regression curve is superimposed to aid visual interpretation.
A multiple linear regression model for the dependent variable PopDensityLog2.
| Regression Summary for Dependent Variable: Risk = PopDensityLog^2 | ||||||
|---|---|---|---|---|---|---|
| b* | Std.Err. | b | Std.Err. | t(3000) | p-value | |
| Intercept | −47.30 | 4.19 | −11.3 | 0.000000 | ||
| 2000-2016AveragePM25 | 0.35 | 0.02 | 1.99 | 0.11 | 18.9 | 0.000000 |
| PCT_BLACK | −0.01 | 0.01 | −1.32 | 1.35 | −1.0 | 0.326789 |
| PCT_HISP | 0.16 | 0.01 | 17.09 | 1.44 | 11.9 | 0.000000 |
| Latitude | 0.37 | 0.02 | 1.09 | 0.07 | 15.0 | 0.000000 |
| Longitude | 0.17 | 0.02 | 0.21 | 0.02 | 10.1 | 0.000000 |
| WINTERAVGTMP | 0.22 | 0.02 | 0.28 | 0.03 | 9.9 | 0.000000 |
| FirstCaseDays | 0.39 | 0.01 | 0.33 | 0.01 | 33.5 | 0.000000 |
| PopDensity | 0.38 | 0.01 | 0.00 | 0.00 | 33.1 | 0.000000 |
Fig. 7A CART tree model for the dependent variable PopDensityLog2.
Data sources and variable overview.
| Data Category | Source | Comments |
|---|---|---|
| PM2.5 | Pm2.5 annual average data from the Atmospheric Composition Analysis Group ( | We averaged across grid cells in each county, and produced a 2000–2016 average, as well as separate values for each year 2000–2018. |
| County boundaries | U. S. Census | Used to provide county boundaries for PM2.5 attribution, land area for popdensity, and centroids lat/lons. |
| Demographic | U.S. Census Bureau online API. 2018 ACS5 (American Community Survey 5-year data ending in 2018). County level data. | List of variables in table below. |
| Temperatures | NOAA. County level annual data | We averaged across years 2000–2019 for long term averages by month. We also extracted monthly averages for Jan-Apr 2020. |
| Humidity | Humidity averages by U.S. weather station (city) through 2018. | For each county centroid, the humidity data from the closest (based on lat/lon coordinates) weather station is obtained. |
| Hospital beds | Hospital level data with county identifier from Homeland Infrastructure Foundation-Level Data (HIFLD) | Aggregated hospitals over counties. Converted to beds per 100 K population. Log version also. |
| Mitigation policies | State level governmental COVID-19 policies compiled byRaifman et al., Boston University School of Public Health, COVID-19 United States state policy database ( | Used to compute days since stay-at-home order and days since closure of non-essential businesses (from 5/11/2020) |
| Behavioral | County level data from Robert Wood Johnson | Smoking, Obesity, and overall health |
| County Population | Used to scale various variables | |
| County attributes | Selected variables from the COVID Severity Forecasting project. UC Berkeley Departments of Statistics, EECS led by Professor Bin Yu | List of variables in table below. |
| Economic characteristics | County level data from USDA - | 3 county coding schemes described in table below. |
| Outcomes (deaths, cases, days since first case) | Cumulative values by date downloaded from | 5/11 values for cases and deaths extracted. Converted to per 100 K. Days since first case computed by using first case date. |
Additional variable details.
| Variable | Category | Description |
|---|---|---|
| PCT_POVERTY | Demographic | % below poverty |
| PCT_OWNEDHOM | Demographic | % owning home |
| PCT_ED_HS | Demographic | % with high school education |
| PCT_BLACK | Demographic | % black |
| PCT_HISP | Demographic | % hispanic |
| MED_INCOMERATIO | Demographic | Median income, converted to ratio relative to mean over counties |
| MED_HOMERATIO | Demographic | Median home value, converted to ratio relative to mean over counties |
| PCT_65PLUS | Demographic | % 65+ years |
| PCT_45TO64 | Demographic | % 45–64 years |
| PCT_15TO44 | Demographic | % 15–44 years |
| Rural-urban_ContinuumCode_2013 | Economic characteristics | 1–9 code indicating county degree of urbanization. |
| Urban_Influence_Code_2013 | Economic characteristics | 1–12 code indicating county degree of urban influence. |
| Economic_typology_2015 | Economic characteristics | 1–6 code indicating county economic condition. |
| PopDensity[Log] | County Population | 2018 Population estimate divided by land area (square miles) from shape files. Log version also. |
| CensusRegionName | County Attributes | Created binary column for each level. |
| CensusDivisionName | County Attributes | Created binary column for each level. |
| StateName | County Attributes | Created binary column for each level. |
| dem_to_rep_ratio | County Attributes | Ratio of registered democrats to republicans in county |
| #ICU_beds100K[Log] | County Attributes | Number of ICU beds per 100 K population. Log version also. |