| Literature DB >> 35603096 |
Abstract
The survival of mankind cannot be imagined without air. Consistent developments in almost all realms of modern human society affected the health of the air adversely. Daily industrial, transport, and domestic activities are stirring hazardous pollutants in our environment. Monitoring and predicting air quality have become essentially important in this era, especially in developing countries like India. In contrast to the traditional methods, the prediction technologies based on machine learning techniques are proved to be the most efficient tools to study such modern hazards. The present work investigates six years of air pollution data from 23 Indian cities for air quality analysis and prediction. The dataset is well preprocessed and key features are selected through the correlation analysis. An exploratory data analysis is exercised to develop insights into various hidden patterns in the dataset and pollutants directly affecting the air quality index are identified. A significant fall in almost all pollutants is observed in the pandemic year, 2020. The data imbalance problem is solved with a resampling technique and five machine learning models are employed to predict air quality. The results of these models are compared with the standard metrics. The Gaussian Naive Bayes model achieves the highest accuracy while the Support Vector Machine model exhibits the lowest accuracy. The performances of these models are evaluated and compared through established performance parameters. The XGBoost model performed the best among the other models and gets the highest linearity between the predicted and actual data.Entities:
Keywords: Air quality index; Box plot; Correlation-based feature selection; Exploratory data analysis; Indian air quality data; Machine learning; Synthetic minority oversampling technique
Year: 2022 PMID: 35603096 PMCID: PMC9107909 DOI: 10.1007/s13762-022-04241-5
Source DB: PubMed Journal: Int J Environ Sci Technol (Tehran) ISSN: 1735-1472 Impact factor: 3.519
Research works on AQI prediction through ML technology
| S. No | Author(s) and year | Dataset | ML/DL algorithms applied | Pollutant(s) studied | Pre-processing/Feature selection/other technique(s) applied | Performance parameter(s) studied | Tool(s)/hardware employed | Result(s) |
|---|---|---|---|---|---|---|---|---|
| 1 | Gopalakrishnan ( | Google street view and environmental defence fund (EDF) | LR, Ridge Regression (RR), Elastic Net (EN), RF, and Gradient Boosting (XGBoost) | Black carbon (BC), and NO2 | Correlation, feature engineering | – | Jupyter Notebook | The proposed model predicts concentrations of BC and NO2 in the entire Oakland area |
| 2 | Rybarczyk and Zalakeviciute ( | Quito air quality dataset | XGBoost | NO2, SO2, CO, PM2.5 | Cross-validation | Root Mean Squared Error (RMSE), and Pearson Coefficient of Correlation (PCC) | R, MS Excel, and Igor Pro | The proposed model exhibited the highest accuracy at the traffic-busy areas |
| 3 | Sanjeev ( | Some datasets of pollutants’ concentration and meteorological factors | RF, ANN, and SVM | NO2, O3, CO, SO2, NH3, PM10, and PM2.5 | Cleaning, attribute selection, and normalization | Accuracy | – | RF performed the best with 99.4% accuracy |
| 4 | Castelli et al ( | US Environmental Protection Agency (US EPA) | SVR | CO, NO2, SO2, O3, and PM2.5 | Missing data imputation, removal of outliers, nonlinear data transformation; time series analysis, radial basis function (RBF), and principal component analysis (PCA) | Pearson correlation, mean absolute error (MAE), RMSE, and normalized RMSE (nRMSE) | Python 3.6 with Pandas and Scikit-learn | Accuracy PCA SVR-RBF: 88% (training set) and 92.7% (validation set), Accuracy SVR-RBF: 90.02% (training set) and 94.1% (validation set) |
| 5 | Doreswamy et al. ( | Taiwan Air Quality Monitoring Network (TAQMN) | LR, RF, XGBoost, K-Nearest Neighbors (KNN), DT, ANN | PM2.5 | Cross-validation | MAE, RMSE, Mean squared error (MSE), and R-squared (R2) | The XG boost regressor model performed the best | |
| 6 | Liang et al ( | Taiwan’s Environmental Protection Administration (EPA), and Taiwan’s Central Weather Bureau (CWB) | SVM, RF, AdaBoost, ANN, LR, and Stacking Ensemble (SE) | CO, NO2, SO2, O3, PM10, and PM2.5 | Missing value imputation, data normalization, and numeric conversions | MAE, RMSE, and R2 | Orange | AdaBoost exhibited the best MAE and SVM yielded the worst results |
| 7 | Madan et al. ( | Kaggle | LR, DT, RF, ANN, SVM, etc | SO2, NO2, O3, CO, PM10, and PM2.5 | – | R2, RMSE, and MAE | – | Authors reported that NN and boosting model found to be superior |
| 8 | Madhuri et al. ( | Data collected from sensors | LR, SVM, DT, and RF | CO, NO, C6H6, and SnO2 | Normalization, attribute selection, and discretization | MSE and RMSE | – | RF achieved the highest accuracy |
| 9 | Monisri et al. ( | Data collected from sensors and IoT devices | RF, DT, and SVM | C6H6, CO2, CO, NO2, NO3 | Removal of missing values, Imputation, Normalization | Accuracy | Python, Jupyter Notebook IDE | The mixed model exhibits high precision |
| 10 | Nahar et al ( | Dataset maintained by the ministry of environment, Jordan | DT, SVM, k-Nearest Neighbor (k-NN), RF, and LR | NO2, SO2, O3, CO, H2S, and PM10 | Filling missing values with averages | Accuracy | KNIME | The proposed model predicted the pollutant factor with 92% accuracy |
| 11 | Bhalgat et al ( | Kaggle | Integration of ANN and Kriging | SO2 and PM2.5 | Null values removal, elimination of redundant data, Auto-Regressive (AR), and Auto-Regressive Integrated Moving Average (ARIMA) | MSE | MATLAB and R | Concertation of SO2 is deadly in Nagpur and it is being increased in Pune and Mumbai |
| 12 | Mahalingam et al. ( | Central pollution control board (CPCB), India dataset | NNs and SVM | PM10, PM2.5, NO2, O3, CO, SO2, NH3, and Pb | – | Mean, Standard deviation (SD) | Python: Pandas and NumPy | Accuracy NN: 91.62% Accuracy SVM:97.3% |
| 13 | Soundari et al. ( | Central Pollution Control Board (CPCB), India dataset | NNs | NO2, SO2, Respirable Suspended Particulate Matter (RSPM), and Suspended Particulate Matter (SPM) | Boundary value analysis (BVA), cost estimation, linear regression, and gradient boosting | Moving average, Box plot | Python | The proposed model achieved 95% accuracy in the prediction of AQI |
| 14 | Zhu et al. ( | US Environmental Protection Agency (US EPA) | Multi-Task Learning (MTL) framework | O3, PM2.5, and SO2 | Missing value imputation | RMSE | – | The proposed light formulation model performed the best |
| 15 | Rybarczyk and Zalakeviciute ( | Traffic data extracted from Google Maps | Multiple regression, Regression Modal Tree (RMT), and multiple models | PM2.5, SO2, CO, NO2, and O3 | Background subtraction | Correlation coefficient (r) and RMSE | Python | RMT exhibited better predictions than the linear regression model |
Statistics of various pollutants and AQI in the CPCB dataset
| Pollutants → | PM2.5 | PM10 | NO | NO2 | NOX | NH3 | CO | SO2 | O3 | Benzene | Toluene |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Statistics ↓ | |||||||||||
| Count | 24,933 | 18,391 | 25,949 | 25,946 | 25,346 | 19,203 | 27,472 | 25,677 | 25,509 | 23,908 | 21,490 |
| Mean | 67.450 | 118.127 | 17.574 | 28.560 | 32.309 | 23.483 | 2.248 | 14.531 | 34.491 | 3.280 | 8.700 |
| Std | 64.661 | 90.605 | 22.785 | 24.474 | 31.646 | 25.684 | 6.962 | 18.133 | 21.694 | 15.811 | 19.969 |
| Min | 0.040 | 0.010 | 0.020 | 0.010 | 0.078 | 0.010 | 0.253 | 0.010 | 0.010 | 0.063 | 0.238 |
| 25% | 28.820 | 56.255 | 5.630 | 11.750 | 12.820 | 8.580 | 0.510 | 5.670 | 18.860 | 0.120 | 0.600 |
| 50% | 48.570 | 95.680 | 9.890 | 21.690 | 23.520 | 15.850 | 0.890 | 9.160 | 30.840 | 1.070 | 2.970 |
| 75% | 80.590 | 149.745 | 19.950 | 37.620 | 40.127 | 30.020 | 1.450 | 15.220 | 45.570 | 3.080 | 9.150 |
| Max | 949.990 | 1000.102 | 390.680 | 362.210 | 467.630 | 352.890 | 175.818 | 193.860 | 257.730 | 455.03 | 454.85 |
Fig. 1Flowchart of the proposed model
Fig. 2Missing values of the features and their percentages
Fig. 3Correlation heatmap of AQI with other pollutants (Threshold: 0.4)
Correlation between AQI and pollutants
| S. No | Features | Correlation value | S. No | Features | Correlation value |
|---|---|---|---|---|---|
| 1 | PM10 | 0.80331 | 7 | NO | 0.452191 |
| 2 | CO | 0.68334 | 8 | Toluene | 0.279992 |
| 3 | PM2.5 | 0.65918 | 9 | NH3 | 0.252019 |
| 4 | NO2 | 0.53707 | 10 | O3 | 0.198991 |
| 5 | SO2 | 0.52586 | 11 | Xylene | 0.165532 |
| 6 | NOx | 0.486450 | 12 | Benzene | 0.044407 |
Fig. 4Skewness present in dataset features
Fig. 5Intensities of various pollutants from 2015 to 2020
Fig. 6The six most polluted Indian cities with their average AQI values from 2015 to 2020
Fig. 7Pollutants governing AQI directly
Fig. 8Timeline graph of AQI with respect to specific pollutants
Fig. 9Variation analysis of pollutants through Box plots
Comparison of model results in the training set
| Model | Accuracy | Precision | Recall | F1-score | Training time (in seconds) |
|---|---|---|---|---|---|
| KNN | 89 | 94 | 90 | 96 | 0.104 |
| GNB | 85 | 91 | 94 | 88 | 0.110 |
| SVM | 90 | 93 | 88 | 0.258 | |
| RF | 88 | 93 | 88 | 92 | 0.102 |
| XGBoost | 95 | 95 | 91 | 0.532 |
Results of ML algorithms for AQI Prediction with and without SMOTE (training set)
| Models | Without SMOTE | With SMOTE | ||||||
|---|---|---|---|---|---|---|---|---|
| MAE | RMSE | RMSLE | MAE | RMSE | RMSLE | |||
| KNN | 0.627 | 3.834 | 0.153 | 0.913 | 0.023 | 1.003 | 0.063 | 0.864 |
| GNB | 0.622 | 2.454 | 0.164 | 0.856 | 0.027 | 1.212 | 0.045 | 0.801 |
| SVM | 0.537 | 2.238 | 0.078 | 0.820 | 0.026 | 1.003 | 0.043 | 0.772 |
| RF | 0.331 | 1.973 | 0.082 | 0.643 | 0.022 | 0.583 | ||
| XGBoost | 0.963 | 0.062 | ||||||
Comparison of model results in the testing set
| Model | Accuracy | Precision | Recall | F1-Score | Prediction time (in seconds) |
|---|---|---|---|---|---|
| KNN | 85 | 92 | 85 | 94 | 0.018 |
| GNB | 83 | 88 | 89 | 92 | 0.016 |
| SVM | 91 | 90 | 83 | 0.027 | |
| RF | 86 | 92 | 91 | 90 | 0.023 |
| XGBoost | 90 | 96 | 95 | 91 | 0.041 |
Results of ML algorithms for AQI prediction with and without SMOTE (testing set)
| Models | Without SMOTE | With SMOTE | ||||||
|---|---|---|---|---|---|---|---|---|
| MAE | RMSE | RMSLE | MAE | RMSE | RMSLE | |||
| KNN | 0.834 | 4.023 | 0.620 | 0.453 | 0.067 | 2.880 | 0.018 | 0.272 |
| GNB | 0.564 | 3.487 | 0.236 | 0.174 | 2.316 | 0.016 | ||
| SVM | 0.634 | 3.803 | 0.623 | 0.153 | 2.098 | 0.032 | 0.512 | |
| RF | 0.627 | 2.220 | 0.198 | 0.643 | 0.076 | 1.458 | 0.410 | |
| XGBoost | 0.156 | 0.834 | 0.174 | 0.026 | ||||