| Literature DB >> 35237751 |
Kai Fan1,2,3, Ranil Dhammapala4, Kyle Harrington5, Ryan Lamastro6, Brian Lamb3, Yunha Lee1,2,3.
Abstract
Chemical transport models (CTMs) are widely used for air quality forecasts, but these models require large computational resources and often suffer from a systematic bias that leads to missed poor air pollution events. For example, a CTM-based operational forecasting system for air quality over the Pacific Northwest, called AIRPACT, uses over 100 processors for several hours to provide 48-h forecasts daily, but struggles to capture unhealthy O3 episodes during the summer and early fall, especially over Kennewick, WA. This research developed machine learning (ML) based O3 forecasts for Kennewick, WA to demonstrate an improved forecast capability. We used the 2017-2020 simulated meteorology and O3 observation data from Kennewick as training datasets. The meteorology datasets are from the Weather Research and Forecasting (WRF) meteorological model forecasts produced daily by the University of Washington. Our ozone forecasting system consists of two ML models, ML1 and ML2, to improve predictability: ML1 uses the random forest (RF) classifier and multiple linear regression (MLR) models, and ML2 uses a two-phase RF regression model with best-fit weighting factors. To avoid overfitting, we evaluate the ML forecasting system with the 10-time, 10-fold, and walk-forward cross-validation analysis. Compared to AIRPACT, ML1 improved forecast skill for high-O3 events and captured 5 out of 10 unhealthy O3 events, while AIRPACT and ML2 missed all the unhealthy events. ML2 showed better forecast skill for less elevated-O3 events. Based on this result, we set up our ML modeling framework to use ML1 for high-O3 events and ML2 for less elevated O3 events. Since May 2019, the ML modeling framework has been used to produce daily 72-h O3 forecasts and has provided forecasts via the web for clean air agency and public use: http://ozonematters.com/. Compared to the testing period, the operational forecasting period has not had unhealthy O3 events. Nevertheless, the ML modeling framework demonstrated a reliable forecasting capability at a selected location with much less computational resources. The ML system uses a single processor for minutes compared to the CTM-based forecasting system using more than 100 processors for hours.Entities:
Keywords: air quality forecasts; machine learning; multiple linear regression; ozone; random forest
Year: 2022 PMID: 35237751 PMCID: PMC8883518 DOI: 10.3389/fdata.2022.781309
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
Figure 1(A) ML1 model based on random forest (RF) classifier and multiple linear regression (MLR) models (B) ML2 model based on a two-phase RF regression and weighting factors. (MDA8 O3: the maximum daily 8-hour moving average O3).
A 2 × 2 contingency table for forecast skill.
|
|
| ||
|---|---|---|---|
|
|
|
| |
| Unhealthy | a = hits | b = false alarms | a + b |
| Good | c = misses | d = correct negatives | c + d |
| Total | a + c | b + d | a + b + c + d |
A 3 × 3 contingency table for forecast skill.
|
|
| ||
|---|---|---|---|
|
|
|
| |
| 1 | n11 | n12 | n13 |
| 2 | n21 | n22 | n23 |
| 3 | n31 | n32 | n33 |
Figure 2Boxplot of observed (obs) MDA8 O3 from May to September in 2017 – 2020.
Summary of historical air quality information from May to September in 2017–2020.
|
|
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| |||||||
| 2017 | 100 | 51 | 50 | 42 | 58 | 65 | 29 | 6 | 6.0% |
| 2018 | 148 | 46 | 44 | 39 | 52 | 119 | 25 | 4 | 2.7% |
| 2019 | 136 | 44 | 43 | 38 | 50 | 121 | 15 | 0 | 0 |
| 2020 | 142 | 42 | 42 | 35 | 48 | 132 | 10 | 0 | 0 |
| Total | 526 | 45 | 44 | 38 | 51 | 437 | 79 | 10 | 1.9% |
The AQI categories are based on O.
Figure 3Boxplot of feature weights from (A) RF classifier model in ML1, (B) the first and (C) the second RF regression model in ML2. The blue lines show the mean of the feature weights (0.1).
Figure 4Diagram of 10-time, 10-fold cross-validation.
Statistics and forecast verifications of the 10-time, 10-fold cross-validations of the simulated O3 at Kennewick, WA during 2017–2020.
|
|
|
|
| ||
|---|---|---|---|---|---|
|
| |||||
|
| 0.070 | 0.38 | 0.43 | 0.54 | |
| NMB (%) | 1.1 | −2.2 | 5.2 | −0.22 | |
| NME (%) | 17 | 14 | 16 | 12 | |
| HSS | 0.34 | 0.34 | 0.42 | 0.4 | |
| KSS | 0.30 | 0.30 | 0.61 | 0.33 | |
| CSI | 1 | 0.85 | 0.85 | 0.74 | 0.87 |
| 2 | 0.24 | 0.24 | 0.34 | 0.27 | |
| 3 | 0 | 0 | 0.28 | 0 |
Figure 5Ratio plots of model prediction to observations vs. observations for three models (A) AIRPACT, (B) ML1, and (C) ML2.
Annual statistics and forecast verifications of the 10-time, 10-fold cross-validations at Kennewick, WA.
|
|
|
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
|
| 0.0053 | 0.46 | 0.43 | 0.029 | 0.44 | 0.46 | 0.34 | 0.33 | 0.58 | 0.64 | 0.43 | 0.44 | 0.57 | 0.58 | |
| NMB (%) | −1.7 | −7.5 | 2.5 | 12 | 2.3 | 4.3 | 6.6 | 7.1 | −6.3 | −1.8 | 1.4 | 5.3 | −4.5 | −0.36 | |
| NME (%) | 25 | 14 | 12 | 19 | 15 | 15 | 16 | 18 | 12 | 10 | 11 | 14 | 13 | 11 | |
| HSS | 0.30 | 0.31 | 0.47 | 0.28 | 0.41 | 0.55 | 0.31 | 0.31 | 0.30 | 0.51 | 0.32 | 0.35 | 0.30 | 0.52 | |
| KSS | 0.26 | 0.22 | 0.43 | 0.45 | 0.45 | 0.73 | 0.59 | 0.77 | 0.25 | 0.43 | 0.24 | 0.35 | 0.27 | 0.44 | |
| CSI | 1 | 0.72 | 0.86 | 0.90 | 0.86 | 0.65 | 0.81 | 0.72 | 0.77 | 0.73 | 0.89 | 0.89 | 0.91 | 0.73 | 0.89 |
| 2 | 0.24 | 0.17 | 0.35 | 0.23 | 0.37 | 0.45 | 0.27 | 0.24 | 0.24 | 0.35 | 0.22 | 0.25 | 0.14 | 0.32 | |
| 3 | 0 | 0 | – | – | 0.30 | 0.25 | – | – | 0 | 0 | – | – | 0.30 | 0.25 | |
Combined refers to using the ML1 predicted MDA8 O.
Figure 6Time series of MDA8 O3 from observation, AIRPACT and combined ML model predictions from May to September in 2017–2020.
Statistics and forecast verifications in 2019–2020.
|
|
|
|
| ||
|---|---|---|---|---|---|
|
| 0.33 | 0.35 | 0.49 | 0.48 | |
| NMB (%) | 6.9 | 8.0 | 5.2 | 5.7 | |
| NME (%) | 17 | 18 | 12 | 13 | |
| HSS | 0.31 | 0.28 | 0.41 | 0.47 | |
| KSS | 0.64 | 0.66 | 0.39 | 0.44 | |
| CSI | 1 | 0.75 | 0.70 | 0.90 | 0.91 |
| 2 | 0.26 | 0.24 | 0.30 | 0.34 |
Note that mean is the ensemble means of the MDA8 O.
Figure 7Distributions of observed and (A) ML1, (B) ML2 model predicted MDA8 O3 in 2019 and 2020.