| Literature DB >> 34204140 |
Arnold Kamis1, Rui Cao1, Yifan He1, Yuan Tian1, Chuyue Wu1.
Abstract
In this research, we take a multivariate, multi-method approach to predicting the incidence of lung cancer in the United States. We obtain public health and ambient emission data from multiple sources in 2000-2013 to model lung cancer in the period 2013-2017. We compare several models using four sources of predictor variables: adult smoking, state, environmental quality index, and ambient emissions. The environmental quality index variables pertain to macro-level domains: air, land, water, socio-demographic, and built environment. The ambient emissions consist of Cyanide compounds, Carbon Monoxide, Carbon Disulfide, Diesel Exhaust, Nitrogen Dioxide, Tropospheric Ozone, Coarse Particulate Matter, Fine Particulate Matter, and Sulfur Dioxide. We compare various models and find that the best regression model has variance explained of 62 percent whereas the best machine learning model has 64 percent variance explained with 10% less error. The most hazardous ambient emissions are Coarse Particulate Matter, Fine Particulate Matter, Sulfur Dioxide, Carbon Monoxide, and Tropospheric Ozone. These ambient emissions could be curtailed to improve air quality, thus reducing the incidence of lung cancer. We interpret and discuss the implications of the model results, including the tradeoff between transparency and accuracy. We also review limitations of and directions for the current models in order to extend and refine them.Entities:
Keywords: adult smoking; ambient emissions; environmental quality index; iterative modeling; lung cancer; machine learning; regression; transparency; united states
Mesh:
Substances:
Year: 2021 PMID: 34204140 PMCID: PMC8201047 DOI: 10.3390/ijerph18116127
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1Cancer of the Lung and Bronchus, United States, 2017, Rate per 100,000 people, All ages, all races/ethnicities, Male and Female. Source: Centers for Disease Control and Prevention.
Figure 2Cigarette Use by Adults, United States, 2018. Source: Centers for Disease Control and Prevention.
Literature Review.
| ID | Variables | Methods/Data Source(s) | Findings |
|---|---|---|---|
| [ | Fine particulates, including sulfates | Regression, 14–16 year mortality follow-up of 8111 adults in 6 US cities/Prospective cohort study | After adjusting for smoking, mortality strongly associated with air pollution with fine particulates |
| [ | Race and socioeconomic and gender predictors of early-state non-small cell lung cancer | Regressions/SEER | Higher socioeconomic status helps survival, as does being Caucasian or female. |
| [ | PM10, SO4, SO2, O3, and NO2 checked for lung cancer | 6,338 nonsmoking, non-Hispanic white SDA residents of California were enrolled in 1977/Adventist Health Study (AHS) | Levels of PM10, SO4, SO2, O3, NO2 far higher for those with lung cancer, especially in males. |
| [ | PM2.5 and SO2, lung cancer, lung cancer mortality | Cox Proportional hazards model/American cancer society, part of cancer prevention study (CPS-II), ongoing prospective mortality study of 1.2 M adults | PM2.5 and SO2 associated with lung cancer; each 10 microgram/m3 increase associated with 8% increase in lung cancer mortality |
| [ | Race, gender, SE class, chemicals, not just smoking | Datasets from SEER and NPCR/National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Program and the Centers for Disease Control and Prevention’s National Program of Cancer Registries (NPCR) | Epidemiologically and biological studies show strong causation between smoking and cellular mutations; racial disparities: Black worst, then white, then other races; lower Socio-Economic class is strongly associated with lung cancer; not gender; race seems to be proxy for Socio-Economic class |
| [ | carbon dioxide, ozone, cancer, Ozone Mortality, Ozone Hospitalization, Ozone Emergency Room Visits, and Particulate Matter Mortality pollution mortality | Mathematical model/Nasa and EPA and California air resources | A climate-air pollution model showed by cause-and-effect analysis that fossil-fuel CO2 increases U.S. surface ozone, carcinogens, and Coarse Particulate matter, increasing cancer rates |
| [ | asbestos fibers and ambient Coarse Particulate Matter PM10, PM2.5 and diesel exhaust particles | Chemicals purchased and combined with smoke, passed through filters/experiments | Synergistic effects in the generation of hydroxyl radicals in smoke with environmental asbestos fibers and ambient PM10, PM2.5 and diesel exhaust particles (DEP). The highest synergistic effects were observed with the asbestos fibers, PM2.5 and DEP, producing redox recycling and oxidative action. |
| [ | Ozone and PM2.5 to predict premature (excess) mortality | Simulations of preindustrial and present-day (2000) concentrations included rural areas/epidemiology literature | Tropospheric O3 and PM2.5 contribute substantially to global premature mortality from lung cancer, which is 14% higher than baseline. |
| [ | Socioeconomic, Rural-Urban, and Racial Inequalities in US Cancer Mortality: | Stats (regression)/three national data sources: the national mortality database, the decennial census, and the 2009–2010 Area Resource File | Blacks experiencing higher mortality from each cancer than whites within each deprivation group. Socioeconomic gradients in mortality were steeper in non-metropolitan than in metropolitan areas. Mortality disparities may reflect inequalities in smoking and other cancer-risk factors, screening, and treatment. |
| [ | All of them | Statistics | Intersectionality of all the variables |
| [ | PM2.5 and O3 | 80,285 AHSMOG-2 participants were followed for an average of 7.5 years; Logistic regression/Adventist Health and Smog Study-2 (AHSMOG-2), a cohort of health-conscious nonsmokers, where 81% have never smoked. | Lung cancer is associated with PM2.5 in never smokers and slightly higher if 1+ hrs./Day outdoors or 5+ years at residence. |
| [ | Cancer risk index (CRI) Incidence of cancer risk from air toxics | Statistical modelling of San Antonia Texas; racial disparities found/Data for CRI from National Air Toxics Assessment [ | Cancer risk index is all positively correlated with the ambient diesel coarse particulate matter. Institutional transformations are essential to mitigate the social-ecological divide. |
| [ | Radon, Lung cancer | Meta-analysis of 8 case-control studies of indoor radon, where | Relative risk is 14% greater for those exposed to indoor radon versus the controls |
| [ | Occupational lung cancer, asbestos, arsenic, chromium, radon, silica, beryllium, nickel, cadmium, diesel exhaust | Review of many studies of workers in the U.S. | Conservative estimates are that relative risk of occupational lung cancer is 1.31 for diesel fumes, 2.0 for asbestos, and 3.69 for arsenic; several million exposed workers in early 1980 s |
| [ | 24 experts in a working group | Review of many studies: human, occupational, outdoor, indoor, animal. | From many sources, respirable PM10, PM2.5, NO2, SO2, and O3 are frequently and substantially above safe levels. Consistency in studies shows cellular damage, as well as genetic and epigenetic effects. |
| [ | Demographics, cancer types, cigarette features all lead to mutations and other changes in the genes | Review of smoking: all epidemiologically and biological studies show strong causation, and it parallels the rise and fall of cigarette smoking/Many sources | Prevention important and cessation important because it causes cancer in all demographics. Stopping smoking is the most important cause of lung cancer. |
| [ | Incidence and survival of Small-Cell Lung Cancer among all lung cancers by Gender and Smoking and Stage of cancer | Analysis of the Surveillance, Epidemiologic, and End Results (SEER) database | Proportion of SCLC has diminished, and survival has increased slightly, attributed to decreasing smoking and increased proportion of low-tar cigarettes |
Variables and their Descriptions, Timeframe, Data source.
| Var. | Description | Time | Data Source |
|---|---|---|---|
| New Case of Lung Cancer | Cancer of the Lung or Bronchus, All Ages, All Races/Ethnicities, Male and Female. Rate per 100,000 people | 2013–2017 (mean) | CDC United States Cancer Statistics |
| Adult Smoking | Percentage of adults who are current smokers (county level) | 2011–2013 (mean) | County Health Ranking Organization |
| Land EQI | Environmental Quality Index–Land Domain | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| SocioD EQI | Environmental Quality Index–Socio-Demographic Domain | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| Built EQI | Environmental Quality Index–Built Environment Domain | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| Air EQI | Environmental Quality Index–Air Domain | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| Water EQI | Environmental Quality Index–Water Domain | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| PM2.5_T1 | Fine Particulate Matter (2.5 micrometers or smaller) Mean of 24 h period | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| PM10_T1 | Coarse Particulate Matter (10 micrometers or smaller) based on Mean of 24 h period | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| SO2_T1 | Sulfur Dioxide | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| NO2_T1 | Nitrogen Dioxide | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| CO_T1 | Carbon Monoxide | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| O3_T1 | Tropospheric (ground level) Ozone | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| CN_T1 | Cyanide compounds | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| Diesel | Gaseous exhaust produced by a diesel type of internal combustion engine | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| CS2 | Carbon Disulfide | 2000–2005 (mean) | Air Quality-Lung Cancer Data |
| PM2.5_T2 | Fine Particulate Matter (2.5 micrometers or smaller), weighted annual mean (mean weighted by calendar quarter), based on weighted mean 24 h | 2006–2010 (mean) | EPA Outdoor Air Quality Data |
| PM10_T2 | Coarse Particulate Matter (10 micrometers or smaller), weighted annual mean (mean weighted by calendar quarter), based on weighted mean 24 h | 2006–2010 (mean) | EPA Outdoor Air Quality Data |
| SO2_T2 | Sulfur Dioxide Mean 1 h (the annual mean of all the 1-h measurements in the year) | 2006–2010 (mean) | EPA Outdoor Air Quality Data |
| NO2_T2 | Nitrogen Dioxide Mean 1 h (the annual mean of all the 1-h measurements in the year) | 2006–2010 (mean) | EPA Outdoor Air Quality Data |
| CO_T2 | Carbon Monoxide 2nd Max 8 h (the 2nd highest non-overlapping 8-h avg in the year) | 2006–2010 (mean) | EPA Outdoor Air Quality Data |
| O3_T2 | Tropospheric Ozone 4th Max 8 h, the 4th highest daily max 8-h average in the year | 2006–2010 (mean) | EPA Outdoor Air Quality Data |
Variables and Data Cleaning.
| Variable | Description | Imputation | Transformation |
|---|---|---|---|
| New Cases of Lung Cancer | Cancer of the Lung/Bronchus, Rate per 100,000 people | none | none |
| Adult Smoking | Percentage of adults who are current smokers | none | none |
| PM2.5_T1 | Particulate Matter 2.5 in Time 1 | none | none |
| PM10_T1_log | Particulate Matter 10 in Time 1 | none | Logarithm |
| SO2_T1_log | Sulfur Dioxide in Time 1 | none | Logarithm |
| NO2_T1_log | Nitrogen Dioxide in Time 1 | none | Logarithm |
| CO_T1_log | Carbon Monoxide in Time 1 | median | Logarithm |
| EQI_Land | Environmental Quality Index, Land Domain | median | none |
| EQI_SocioD | Environmental Quality Index, Socio-Demographic Domain | none | none |
| EQI_Built | Environmental Quality Index, Built Domain | none | none |
| O3_T1_log | Tropospheric Ozone in Time 1 | none | Logarithm |
| CN_log | Cyanide compounds | none | Logarithm |
| Diesel_log | Diesel Exhaust | none | Logarithm |
| CS2_log | Carbon Disulfide | none | Logarithm |
| EQI_Air | Environmental Quality Index, Air Domain | none | none |
| EQI_Water | Environmental Quality Index, Water Domain | none | none |
| PM2.5_T2 | Particulate Matter 2.5 in Time 2 | none | none |
| PM10_T2 | Particulate Matter 10 in Time 2 | median | none |
| SO2_T2_log | Sulfur Dioxide in Time 2 | none | Logarithm |
| CO_T2 | Carbon Monoxide in Time 2 | none | none |
| O3_T2 | Tropospheric Ozone in Time 2 | median | none |
| NO2_T2 | Nitrogen Dioxide in Time 2 | --------- | --------- |
Descriptive Statistics.
| Var. Type | Variable | Description | Min. | 1 Q | Median | Mean | 3 Q | Max. |
|---|---|---|---|---|---|---|---|---|
| Target | Lung Cancer | Lung/Bronchus Cancer Rate | 14.600 | 56.800 | 65.360 | 66.220 | 75.700 | 132.400 |
| Baseline | Adult Smoking | Current Adult Smokers (%) | 0.000 | 0.173 | 0.207 | 0.210 | 0.243 | 0.425 |
| Macro | EQI_Air | Environmental Quality Index, Air Domain | −2.532 | −0.349 | 0.177 | 0.147 | 0.692 | 2.508 |
| EQI_Built | Environmental Quality Index, Built Domain | −3.993 | −0.408 | 0.177 | 0.119 | 0.672 | 7.283 | |
| EQI_Land | Environmental Quality Index, Land Domain | −3.149 | −0.395 | 0.207 | 0.078 | 0.672 | 2.095 | |
| EQI_SocioD | Environmental Quality Index, Socio-Demographic Domain | −3.331 | −0.584 | 0.022 | 0.027 | 0.570 | 3.979 | |
| EQI_Water | Environmental Quality Index, Water Domain | −1.701 | −0.614 | 0.359 | 0.063 | 0.889 | 1.478 | |
| Micro | CN_log | Cyanide compounds | −3.743 | −2.118 | −1.812 | −1.842 | −1.523 | −0.022 |
| CO_T1_log | Carbon Monoxide | 0.650 | 2.248 | 2.555 | 2.503 | 2.944 | 3.800 | |
| CO_T2 | Carbon Monoxide | 0.267 | 1.191 | 1.558 | 1.691 | 1.900 | 7.020 | |
| CS2_log | Carbon Disulfide | −6.900 | −3.875 | −3.436 | −3.429 | −2.975 | 0.361 | |
| Diesel_log | Diesel Exhaust | −1.773 | −0.711 | −0.526 | −0.539 | −0.356 | 0.495 | |
| NO2_T1_log | Nitrogen Dioxide | 1.306 | 2.383 | 2.657 | 2.632 | 2.905 | 3.818 | |
| NO2_T2 | Nitrogen Dioxide | 1.000 | 7.811 | 8.700 | 9.231 | 11.125 | 24.400 | |
| O3_T1_log | Tropospheric Ozone | 2.341 | 3.456 | 3.641 | 3.598 | 3.810 | 4.876 | |
| O3_T2 | Tropospheric Ozone | 0.053 | 0.069 | 0.072 | 0.071 | 0.075 | 0.090 | |
| PM10_T1_log | Particulate Matter 10 | 1.030 | 2.129 | 2.452 | 2.406 | 2.692 | 3.678 | |
| PM10_T2 | Particulate Matter 10 | 10.000 | 19.420 | 21.990 | 22.210 | 23.700 | 40.200 | |
| PM2.5_T1 | Particulate Matter 2.5 | 2.167 | 7.940 | 10.417 | 9.941 | 11.782 | 16.912 | |
| PM2.5_T2 | Particulate Matter 2.5 | 4.500 | 9.743 | 11.171 | 10.855 | 12.419 | 17.150 | |
| SO2_T1_log | Sulfur Dioxide | 0.251 | 1.679 | 2.154 | 2.035 | 2.478 | 3.569 | |
| SO2_T2_log | Sulfur Dioxide | 1.000 | 22.000 | 33.000 | 36.980 | 49.000 | 98.000 |
Figure 3Matrix Plot of Lung Cancer, Adult Smoking, and Environmental Quality Index, all domains. Significance codes: 0 ‘***’ 0.001 ‘.’ 0.1 ‘ ’ 1.
Figure 4Matrix Plot of Lung Cancer with Variables in Time 1: Particulate Matter 2.5 and 10, Carbon Disulfide, Cyanide compounds, Carbon Monoxide, Diesel Exhaust, Nitrogen Dioxide, Tropospheric Ozone, Sulfur Dioxide. Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
Figure 5Matrix Plot of Lung Cancer with Micro Variables in Time 2: Particulate Matter 2.5, Particulate Matter 10, Carbon Monoxide, Tropospheric Ozone, Sulfur Dioxide. Significance codes: 0 ‘***’ 0.001 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
Figure 6All Regression Models. Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
Figure 7Testing for Interaction.
Figure 8Foundation + Environmental Quality Index; Residual standard error: 10.5 on 2236 degrees of freedom; Multiple R-squared: 0.6197, Adjusted R-squared: 0.6114; F-statistic: 74.36 on 49 and 2236 DF, p-value: < 2.2 × 10−16; Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
Figure 9Foundation + Ambient Emissions; Residual standard error: 10.62 on 2228 degrees of freedom; Multiple R-squared: 0.613, Adjusted R-squared: 0.6026; F-statistic: 58.82 on 60 and 2228 DF, p-value: < 2.2 × 10−16; Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
Figure 10Foundation + EQI + Ambient Emissions; Residual standard error: 10.41 on 2221 degrees of freedom; Multiple R-squared: 0.6285, Adjusted R-squared: 0.6178; F-statistic: 58.71 on 64 and 2221 DF, p-value: < 2.2 × 10−16; Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.
LR: Linear Regression; RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error.
| ID | Meth. | Variable Groups | TRAIN (80%) | TEST (20%) | |||||
|---|---|---|---|---|---|---|---|---|---|
| adj. R2 | RMSE | MAE | MAPE | RMSE | MAE | MAPE | |||
| 1 | LR | smoking + state + EQI + emissions |
|
|
|
|
| 8.281 | 13.901 |
| 2 | LR | smoking + state + EQI | 0.612 | 10.168 | 7.556 | 12.241 | 11.167 |
|
|
| 3 | LR | smoking + state + emissions | 0.602 | 10.273 | 7.697 | 12.478 | 11.416 | 8.579 | 14.332 |
| 4 | LR | state | 0.527 | 11.239 | 8.198 | 13.324 | 11.792 | 8.664 | 14.414 |
| 5 | LR | emissions | 0.322 | 13.543 | 10.494 | 17.089 | 13.996 | 10.818 | 18.316 |
| 6 | LR | smoking | 0.308 | 13.724 | 10.367 | 17.401 | 14.777 | 11.083 | 19.289 |
| 7 | LR | EQI | 0.211 | 14.639 | 11.098 | 18.633 | 15.297 | 11.429 | 20.338 |
RF: Random Forest, GBT: Gradient Boosted Tree; RR = Ridge Regression; SVM = Support Vector Machine; RMSE: Root Mean Squared Error; MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error.
| ID | Meth. | Variable Groups | TRAIN (80%) | TEST (20%) | |||||
|---|---|---|---|---|---|---|---|---|---|
| adj. R2 | RMSE | MAE | MAPE | RMSE | MAE | MAPE | |||
| 8 | GBT | smoking + state + EQI | 0.611 | 10.340 | 7.611 | 12.334 |
| 7.377 | 12.054 |
| 9 | SVM | smoking + state + EQI + emissions | 0.634 | 10.026 | 7.335 |
| 10.027 | 7.401 | 12.063 |
| 10 | RF | smoking + state + EQI + emissions |
|
|
| 11.926 | 10.068 |
|
|
| 11 | GBT | smoking + state + EQI + emissions | 0.625 | 10.151 | 7.445 | 12.132 | 10.239 | 7.535 | 12.252 |
| 12 | RR | smoking + state + EQI + emissions | 0.600 | 10.486 | 7.741 | 12.667 | 10.314 | 7.822 | 12.881 |
| 13 | RR | smoking + state + EQI | 0.598 | 10.507 | 7.758 | 12.688 | 10.322 | 7.793 | 12.784 |
| 14 | RF | smoking + state + EQI | 0.584 | 10.684 | 7.814 | 12.932 | 10.383 | 7.627 | 12.570 |
Figure 11Feature Impact for Model 8: GBT (least squares loss, early stopping), excluding states having no impact.
Figure 12Feature Impact for Model 10: Random Forest (500 trees, terminal node size = 5), excluding states having no impact.
Best performing Models: Variance Explained, Root Mean Squared Error.
| Predictor Variables | Linear Model (adj. R2, RMSE) | Non-Linear Model (adj. R2, RMSE) |
|---|---|---|
| smoking + state + EQI | Linear Regression (0.612, 11.167) | Gradient Boosted Tree: (0.611, 9.976) |
| smoking + state + EQI + Emissions | Linear Regression (0.617, 11.155) | Random Forest (0.639, 10.068) |
Anthropogenic Sources of the Highest Impact Ambient Emissions.
| Ambient Emission | Anthropogenic Sources |
|---|---|
| Particulate Matter | Combustion of carbon-based fuels. Smokestacks; power plants, automobiles. Diesel- and gasoline-powered motor vehicles and equipment; burning wood in residential fireplaces, wood stoves, wildfires, agricultural and other fires. Cement dust, fly ash, oil smoke, and smog from construction sites, unpaved roads and fields [ |
| Sulfur Dioxide | Fuel combustion in mobile sources, e.g., automobiles, locomotives, ships, and other equipment; burning of fossil fuels (coal, oil, and diesel) or other materials that contain sulfur at power plants and other industrial facilities. Smelting of mineral ores (aluminum, copper, zinc, lead, and iron) that contain sulfur. Eastern states have more sulfate particles than the West, mostly because of sulfur dioxide emitted by large, coal-fired power plants [ |
| Carbon Monoxide | Burning of fossil fuels (gasoline, natural gas, oil, coal, and wood) in vehicles or machinery. Poorly vented gas appliances (furnaces, ranges, ovens, water heaters, clothes dryers, etc.), many in the home: Fireplaces, wood, and gas stoves Coal or oil furnaces Space heaters or oil or kerosene heaters Charcoal grills, camp stoves Gas-powered lawn mowers and power tools Automobile exhaust fumes Portable generator Leaking chimneys Cigarettes, pipes, and cigars smoked in the home. |