| Literature DB >> 35886298 |
Kung-Min Wang1, Kun-Huang Chen2, Chrestella Ayu Hernanda1, Shih-Hsien Tseng1, Kung-Jeng Wang1.
Abstract
The lung cancer threat has become a critical issue for public health. Research has been devoted to its clinical study but only a few studies have addressed the issue from a holistic perspective that included social, economic, and environmental dimensions. Therefore, in this study, risk factors or features, such as air pollution, tobacco use, socioeconomic status, employment status, marital status, and environment, were comprehensively considered when constructing a predictive model. These risk factors were analyzed and selected using stepwise regression and the variance inflation factor to eliminate the possibility of multicollinearity. To build efficient and informative prediction models of lung cancer incidence rates, several machine learning algorithms with cross-validation were adopted, namely, linear regression, support vector regression, random forest, K-nearest neighbor, and cubist model tree. A case study in Taiwan showed that the cubist model tree with feature selection was the best model with an RMSE of 3.310 and an R-squared of 0.960. Through these predictive models, we also found that apart from smoking, the average NO2 concentration, employment percentage, and number of factories were also important factors that had significant impacts on the incidence of lung cancer. In addition, the random forest model without feature selection and with feature selection could support the interpretation of the most contributing variables. The predictive model proposed in the present study can help to precisely analyze and estimate lung cancer incidence rates so that effective preventative measures can be developed. Furthermore, the risk factors involved in the predictive model can help with the future analysis of lung cancer incidence rates from a holistic perspective.Entities:
Keywords: cubist model tree; feature selection; lung cancer incidence rate; machine learning algorithm; predictive model; random forest; variable importance
Mesh:
Year: 2022 PMID: 35886298 PMCID: PMC9316771 DOI: 10.3390/ijerph19148445
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Figure 1The trachea, bronchus, and lung cancer incidence rate in Taiwan.
Description of the predictive model variables.
| Factor | Variable (Notation) | Description | Data Type |
|---|---|---|---|
| Air pollution |
Carbon monoxide (CO) | Average CO concentration (ppm) | Continuous |
|
Nitrogen dioxide (NO2) | Average NO2 concentration (ppb) | Continuous | |
|
Sulfur dioxide (SO2) | Average SO2 concentration (ppb) | Continuous | |
|
Ozone (O3) | Average O3 concentration (ppb) | Continuous | |
|
Particulate matter 10 (PM10) | Average PM10 concentration (μg/m3) | Continuous | |
|
Registered vehicles (VEHICLES) | Total number of registered vehicles, including buses, heavy trucks, sedans, light trucks, specially constructed vehicles, and motorcycles. | Discrete | |
|
Factories (FACTORIES) | Total number of factories | Discrete | |
| Tobacco use |
Tobacco consumption per capita (TOBACCO) | Consumption of tobacco per capita aged 18 and over (pieces/year) | Discrete |
|
Smokers rate (SMOKERS) | Percentage of smokers from population aged 18 and over | Continuous | |
| Socioeconomic status |
Rate of low-income persons (LI) | Percentage of low-income persons from total population | Continuous |
| Employment status |
Percent employed (EMPLOYED) | Percentage of employed from civilian population aged 15 and over | Continuous |
|
Unemployment rate (UNEMPLOYMENT) | Total unemployment rate | Continuous | |
| Marital status |
Divorce status (DIVORCE) | Divorce status of population aged 15 and over | Continuous |
| Living environment |
Rate of one-story buildings (ONE) | Number of households living in one-story buildings | Continuous |
|
Rate of apartments six stories or over (APARTMENTS) | Number of households living in apartments six stories or over | Continuous | |
|
Rate of days with PSI > 100 (PSI) | Percentage of days measured with PSI > 100 | Continuous | |
|
Availability rate of public sanitary sewers (SANITARY) | Percentage of public sanitary sewer availability | Continuous | |
|
Rate of heavily polluted sections (POLLUTED) | Percentage of heavily polluted sections in the total length of major rivers | Continuous | |
|
Rate of unqualified drinking water (UNQDRINK) | Percentage of unqualified drinking water as tested | Continuous | |
|
Rate of proper refuse disposal (DISPOSAL) | Percentage of proper refuse disposal | Continuous | |
| Dependent variable |
Lung cancer incidence rate (LC) | Trachea, bronchus, and lung cancer (C33–C34) incidence rates per 100,000 in Taiwan | Continuous |
Figure 2Research process.
Figure 3Correlation plot.
Selected variables from the stepwise regression and feature selection.
| Factor | Predictor Variable | Stepwise Regression | Feature Selection |
|---|---|---|---|
| Air pollution | CO |
| |
| NO2 |
|
| |
| SO2 |
| ||
| O3 | |||
| PM10 | |||
| VEHICLES |
|
| |
| FACTORIES |
|
| |
| Tobacco use | TOBACCO |
|
|
| SMOKERS |
|
| |
| Socioeconomic status | LI |
| |
| Employment status | EMPLOYED |
|
|
| UNEMPLOYMENT | |||
| Marital status | DIVORCE | ||
| Living environment | ONE | ||
| APARTMENTS |
| ||
| PSI |
|
| |
| SANITARY |
| ||
| POLLUTED |
| ||
| UNQDRINK |
| ||
| DISPOSAL |
|
| |
| Total number of variables | 15 | 8 | |
Performance results of the machine learning models.
| Algorithm | Fold | Without Feature Selection | With Feature Selection | ||
|---|---|---|---|---|---|
| RMSE | R-Squared | RMSE | R-Squared | ||
| Linear regression | 1 | 17.612 | 0.632 | 22.122 | 0.682 |
| 2 | 2.341 | 0.980 | 5.279 | 0.875 | |
| 3 | 134.232 | 0.532 | 24.519 | 0.827 | |
| 4 | 13.419 | 0.080 | 6.846 | 0.960 | |
| 5 | 4.911 | 0.849 | 10.789 | 0.374 | |
| Average | 34.503 | 0.615 | 13.911 | 0.743 | |
| Support vector | 1 | 2.144 | 0.971 | 1.617 | 0.994 |
| 2 | 3.712 | 0.978 | 5.296 | 0.919 | |
| 3 | 2.447 | 0.996 | 5.223 | 0.941 | |
| 4 | 4.055 | 0.922 | 4.244 | 0.984 | |
| 5 | 9.489 | 0.173 | 9.758 | 0.182 | |
| Average | 4.369 | 0.808 | 5.228 | 0.804 | |
| Random forest | 1 | 5.402 | 0.853 | 4.532 | 0.885 |
| 2 | 4.599 | 0.905 | 5.067 | 0.895 | |
| 3 | 1.732 | 0.969 | 2.448 | 0.935 | |
| 4 | 5.086 | 0.897 | 4.996 | 0.885 | |
| 5 | 7.365 | 0.853 | 7.570 | 0.868 | |
| Average | 4.837 | 0.895 | 4.922 | 0.894 | |
| K-nearest neighbor | 1 | 2.562 | 0.946 | 7.215 | 0.974 |
| 2 | 6.008 | 0.749 | 6.008 | 0.842 | |
| 3 | 3.925 | 0.875 | 3.516 | 0.923 | |
| 4 | 4.282 | 0.913 | 6.862 | 0.669 | |
| 5 | 10.792 | 0.590 | 6.393 | 0.660 | |
| Average | 5.514 | 0.814 | 5.999 | 0.814 | |
| Cubist model tree | 1 | 5.817 | 0.831 | 6.524 | 0.853 |
| 2 | 3.508 | 0.910 | 2.712 | 0.971 | |
| 3 | 5.615 | 0.869 | 2.607 | 0.988 | |
| 4 | 7.451 | 0.550 | 2.897 | 0.998 | |
| 5 | 2.007 | 0.987 | 1.808 | 0.990 | |
| Average | 4.880 | 0.829 | 3.310 | 0.960 | |
Figure 4Cubist model tree with feature selection.
Figure 5Variable importance of the random forest models.