| Literature DB >> 29179711 |
Colin Bellinger1, Mohomed Shazan Mohomed Jabbar2, Osmar Zaïane2, Alvaro Osornio-Vargas3.
Abstract
BACKGROUND: Data measuring airborne pollutants, public health and environmental factors are increasingly being stored and merged. These big datasets offer great potential, but also challenge traditional epidemiological methods. This has motivated the exploration of alternative methods to make predictions, find patterns and extract information. To this end, data mining and machine learning algorithms are increasingly being applied to air pollution epidemiology.Entities:
Keywords: Air pollution; Association mining; Big data; Data mining; Epidemiology; Exposure; Machine learning
Mesh:
Year: 2017 PMID: 29179711 PMCID: PMC5704396 DOI: 10.1186/s12889-017-4914-3
Source DB: PubMed Journal: BMC Public Health ISSN: 1471-2458 Impact factor: 3.295
The following queries were applied to the databases
| Query | Results |
|---|---|
| (“data mining”) AND ((Environment AND health) OR (exposure)) | 252 |
| (“data mining”) AND (“air pollution”) | 10 |
| (“geo-spatial”) AND ((“air pollution”)) | 3 |
| (“clustering”) AND (“air pollution”) | 119 |
| (“machine learning”) AND (“air pollution”) | 16 |
| (“association mining”) AND (“air pollution”) | 0 |
Categorization of articles organized by the application setting
| Setting | References |
|
|---|---|---|
| Outdoor | [ | 87 |
| Indoor | [ | 8 |
| General | [ | 5 |
The final column (n(%)) is the percentage of articles in each category
Categorization of articles organized by the study objective
| Setting | References |
|
|---|---|---|
| Forecasting | [ | 60 |
| Source apportionment | [ | 10 |
| Hypothesis generation | [ | 30 |
The final column (n(%)) is the percentage of articles in each category
Fig. 1PRISMA flow diagram. Overview of the PRISMA results from our search process
Fig. 2Publications Per Country. The number of publications per country identified a predominance in the filed by European countries, the USA and China
Fig. 3Publications Per Year. Number of articles per year between January 2000 and October 2017. We identified an apparent tendency of an increased number of publications on data mining and epidemiology in recent years
Categorization of articles organized by the data mining approach
| Setting | References |
|
|---|---|---|
| Prediction | [ | 59 |
| Clustering | [ | 26 |
| Association Mining | [ | 15 |
The final column (n(%)) is the percentage of articles in each category
Summary of air pollution source apportionment studies using data mining techniques
| Author | Year | Sub-field | Environmental agent of interest | Data mining techniques | Objective |
|---|---|---|---|---|---|
| Chen et al. [ | 2010 | Outdoor air pollution | Inorganic acids & basic air pollutants | Hierarchical Clustering | Explore relationship between climate and air pollutants |
| Singh et al. [ | 2013 | Outdoor air pollution | AQI | PCA, SVM, DT | Predicting air quality and identifying air pollution sources. |
| Fernández- Camacho et al. [ | 2015 | Urban air and noise pollution by traffic | NOx, O3, SO2, Black Carbon | Fuzzy Clustering | Find the relationship of noise to the traffic emission |
| Chen et al. [ | 2015 | Outdoor air pollution | Multiple air pollutants | Clustering | Source apportionment for air pollutants |
| Li et al. [ | 2017 | Outdoor air pollution | PM | Trajectory clustering | Use clustering to understand how seasonality and meteorology effects pollution sources for Beijing |
Chemical abbreviations: AQI air quality index, NOx nitrogen oxides, O3 ozone, SO2 sulfur dioxide, PM particulate matter. Data mining abbreviations: PCA principle component analysis, SVM support vector machine and DT decision tree
Summary of studies forecasting air pollution distributions and related variables using data mining methods
| Author | Year | Sub-field | Environmental agent | Data mining techniques | Objective |
|---|---|---|---|---|---|
| Kolehmainen et al. [ | 2001 | Outdoor air pollution | NO2 | ANN | Comparing two Neural Nets for their suitability in forecasting Air Quality |
| Kukkonen et al. [ | 2003 | Outdoor air pollution | PM NO2, | ANN | Machine Learning Model comparison for forecasting NO2 and PM10 concentrations |
| Niska et al. [ | 2004 | Outdoor air pollution | NO2 | Genetic Algorithms, ANN | Investigate the use of GA to find a better ANN model to forecast air quality |
| Ghanem et al. [ | 2004 | Outdoor air pollution | SO2,C6H6,NO,NO2,O3 | Hierarchical clustering | Monitor chemicals and outline challenges related to collection and processing. |
| Corani [ | 2005 | Outdoor air pollution | Ozone, PM10 | ANN, Lazy Learning | Predict levels of air pollutants from meteorological and other local variables. |
| Dominici et al. [ | 2006 | Outdoor air pollution | PM2.5 | Bayesian Hierarchical Models | Assess the association of air pollution levels with the number of deaths per day |
| Ma et al. [ | 2008 | Outdoor air pollution | SO2, O3, NOx, C6H6 | k-means | Developing a distributed air pollution monitoring system & use data mining to find patterns of pollutant distribution |
| Pegoretti et al. [ | 2009 | Indoor air pollution | Rn | Geostatistical Models, KNN | Forecasting the indoor Radon concentrations |
| Aquilina et al. [ | 2010 | Outdoor air pollution | particle-associated PAH | DT, ANN | Predict personal exposure to particle-associated polycyclic aromatic hydrocarbons (PAH) |
| Padula et al. [ | 2012 | Outdoor air pollution | Traffic-related pollution | Targeted maximum likelihood estimation | Estimate the probability of low birth weight among full-term infants based on the mother’s exposure to traffic-related air pollution |
| Zhu et al. [ | 2012 | Urban outdoor air pollution | SO2, NO2, PM10, Respiratory diseases | ARM, GMDH | Forecasting the number of respiratory patients based on the seasonal effects of air pollution |
| Singh et al. [ | 2013 | Outdoor air pollution | AQI | PCA, Ensemble Decision DT, SVM | Predicting the Air Quality and identifying major sources of air pollution |
| Beckerman et al. [ | 2013 | Outdoor air pollution | NO2, PM2.5 | GLM | Develop a better land use regression model for using machine learning methods |
| Pandy et al. [ | 2013 | Outdoor air pollution | UFP, PM | DT, RF, | Test machine learning classifiers for predicting air quality and assess the impact of weather and traffic related variables on UFP and PM. |
| Philibert et al. [ | 2013 | Setting |
| RF | Predict NO2 emissions using variables related to chemical fertilizer treatments applied to agricultural plots. |
| Chen et al. [ | 2014 | Outdoor air pollution | Smog | ANN, Social Network Analysis | Predicting Smog based Health Hazardous regions |
| Dias et al. [ | 2014 | Outdoor air pollution | PM2.5 | Density-based Clustering | Quantification of human exposure to traffic related air pollution |
| Lary et al. [ | 2014 | Outdoor air pollution | PM2.5 | Ensemble Algorithms RF, SVM, ANN | Estimating the daily distributions of PM2.5 |
| Jiang et al. [ | 2015 | Outdoor air quality | AQI | Correlation Analysis | Monitoring the dynamics of air quality in large cities based on social media |
| Wang et al. [ | 2015 | Outdoor air pollution | Generic | Topic Models LDA, NLP | Evaluating the use of social media data to estimate air pollution and public response |
| Reid et al. [ | 2015 | Outdoor air pollution | PM2.5 | Generalized boosting model, GAM, RF, SVM, KNN Regression, etc. | Predicting PM2.5 during wildfire |
| Lary et al. [ | 2015 | Outdoor air pollution | PM2.5 | Ensemble regression models | Estimating PM2.5 distribution and relationship of such air pollutants with mental health |
| Lewis et al. [ | 2016 | Outdoor air quality | NOx,O3, SO2, CO, VOCs, PM | Boosted regression DT, gaussian process emulation | Improve the accuracy of common low cost air pollution sensors |
| Hu et al. [ | 2016 | In/Outdoor air pollution | Generic | RF | Understanding, exposure to air pollution by predicting time-activity tracking of individuals |
| Challoner et al. [ | 2015 | Indoor air pollution | PM NO2, | ANN | Predicting the indoor air quality from outdoor monitors |
| Mirto et al. [ | 2016 | Outdoor air pollution and climate | Generic | Spatial data mining, hot spot analysis | Finding correlations between diseases and air pollution due to climatic factors |
| Xu et al. [ | 2017 | Outdoor air pollution | PM, CO O3, SO2 NO2, | SVM, Fuzzy Evaluation, Empirical Mode Decomposition | Air quality forecasting and evaluation |
| Min et al. [ | 2017 | Outdoor air pollution | PM2.5 | K-Means | Apply K-Means to the identify potential new monitoring sites by considering a larger set of 313 variables in their models. Traffic and urbanicity are found to be useful to guide site selection |
| Keller et al. [ | 2017 | Outdoor air pollution | PM2.5 | Modified K-Means | A clustering method to assess exposure to air pollution in health-related studies. They consider the multivariate nature of the exposure and spatial misalignment likely to occur when using data from central monitoring stations and the actual location of the cases |
| Liu et al. [ | 2017 | Outdoor air pollution | PM, SO2, CO, NO2, O3 | SVM Regression | Apply support vector regression for air pollution forecasting using six criteria pollutants, five meteorological conditions and the Air Quality Index |
Chemical abbreviations: NO nitrogen oxide, NO2 nitrogen dioxide, NOx nitrogen oxides, UPM ultra fine particulate matter, PM particulate matter, SO2 sulfur dioxide, C6H6 benzene, O3 ozone, Rn radon, AQI air quality index, VOCs volatile organic compounds. Data mining abbreviations: ANN artificial neural network, DT decision trees, ARM association rule mining, GMDH group method of data handling, PCA principle component analysis, SVM support vector machine, GLM generalized linear model, RF random forest, LDA latent dirichlet allocation, NLP natural language processing, GAM general additive models, k-nearest neighbors. Note, k is a constant value specifying the number of nearest neighbors in kNN and the number of clusters in k-means
Summary of hypothesis generating studies using data mining methods to generate new hypotheses to understand the relationship between air pollution and health conditions better
| Author | Year | Sub-field | Environmental agents | Data mining techniques | Objective |
|---|---|---|---|---|---|
| Chen et al. [ | 2010 | Outdoor air pollution | Inorganic acids & basic air pollutants | Hierarchical Clustering | Explore relationship between climate and air pollutants |
| Zhu et al. [ | 2012 | Urban outdoor air pollution | SO2, NO2, PM10, Respiratory diseases | ARM, GMDH | Forecasting the number of respiratory patients based on the seasonal effects of air pollution |
| Pandy et al. [ | 2013 | Outdoor air pollution | UFP, PM | DT, RF | Test machine learning classifiers for predicting air quality and assess the impact of weather and traffic related variables on UFP and PM. |
| Payus et al. [ | 2013 | Outdoor air pollution | SO2, NO2, PM10, CO,O3 | ARM | Find associations between combinations of air pollutants with respiratory illness. |
| Bobb et al. [ | 2014 | Mixture of chemicals | Multiple chemicals, neurodevelopment, hemodynamics | Bayesian kernel machine regression (BKMR) | Identifying mixtures (e.g., metals) and components responsible for various health effects (e.g., neurodevelopment) |
| Gass et al. [ | 2014 | Outdoor air pollution | CO, NO2, O3, PM | Classification and regression trees | Apply classification and regression trees to generate hypothesis about exposure to mixtures of pollutants and health effects. They work with children’s asthma emergency visit |
| Fernández-Camacho et al. [ | 2015 | Urban air and noise pollution by traffic | NOx, O3, SO2, Black Carbon | Fuzzy clustering | Find the relationship of noise to the traffic emission |
| Bell et al. [ | 2015 | General chemical exposure | 219 chemicals | ARM | Find relationships between chemicals and health biomarkers or diseases |
| Qin et al. [ | 2015 | Outdoor air pollution | PM | ARM | Exploring relationships of PM spatial-temporal variations and how cities influence each other |
| Reid et al. [ | 2016 | Outdoor air quality with wildfire | PM2.5 Respiratory diseases | Generalized estimating equation and generalized boosting model | Finding the relationship between wildfire and associated increment in PM2.5 affects people with respiratory diseases |
| Toti et al. [ | 2016 | Outdoor air pollution, pediatric asthma | SO2, NO, PM, NO2 | ARM | Exploring relationships of Air Pollution Exposure on Asthma |
| Mirto et al. [ | 2016 | Outdoor air pollution & climate changes | Generic | Spatial data mining, hot spot analysis | Finding correlations between diseases (e.g. respiratory and cardiovascular diseases, cancer, male human infertility) and air pollution due to climatic factors |
| Li et al. [ | 2017 | Outdoor air pollution | PM | Trajectory clustering | Apply clustering to identify transport pathways, sources and seasonal variations of particulate matter (PM2.5 and PM10) in Beijing for regulation purposes |
| Stingone et al. [ | 2017 | Outdoor air pollution | National air toxics assessment | DT | Apply machine learning to identify air pollutants exposure profiles when exploring multiple pollutants (104 ambient air toxics) and then estimate the magnitude of the profile’s effect on math scores in kindergarten children |
| Ghanem et al. [ | 2004 | Outdoor air pollution | SO2,C6H6,NO, NO2,O3 | Hierarchical clustering | Monitor chemicals and outline challenges related to collection and processing. |
Chemical abbreviations: SO2 sulfur dioxide, NO nitrogen oxide, NOx nitrogen oxides, NO2 nitrogen dioxide, UFP ultra fine particulate matter, PM particulate matter, O3 ozone and C6H6 benzene. Data mining abbreviations: ASM association rule mining, GMDH group method of data handling, DT decision tree and RF random forest