Literature DB >> 36156995

Multi-stage ensemble-learning-based model fusion for surface ozone simulations: A focus on CMIP6 models.

Abstract

Accurately simulating the geographical distribution and temporal variability of global surface ozone has long been one of the principal components of chemistry-climate modelling. However, the simulation outcomes have been reported to vary significantly as a result of the complex mixture of uncertain factors that control the tropospheric ozone budget. Settling the cross-model discrepancies to achieve higher accuracy predictions of surface ozone is thus a task of priority, and methods that overcome structural biases in models going beyond naïve averaging of model simulations are urgently required. Building on the Coupled Model Intercomparison Project Phase 6 (CMIP6), we have transplanted a conventional ensemble learning approach, and also constructed an innovative 2-stage enhanced space-time Bayesian neural network to fuse an ensemble of 57 simulations together with a prescribed ozone dataset, both of which have realised outstanding performances (R2 > 0.95, RMSE < 2.12 ppbv). The conventional ensemble learning approach is computationally cheaper and results in higher overall performance, but at the expense of oceanic ozone being overestimated and the learning process being uninterpretable. The Bayesian approach performs better in spatial generalisation and enables perceivable interpretability, but induces heavier computational burdens. Both of these multi-stage machine learning-based approaches provide frameworks for improving the fidelity of composition-climate model outputs for uses in future impact studies.

Entities: Chemical

Keywords: CCM; CMIP6; Data fusion; Model ensemble; Space-time Bayesian neural network; Surface ozone

Year: 2021 PMID： 36156995 PMCID： PMC9488062 DOI： 10.1016/j.ese.2021.100124

Source DB: PubMed Journal: Environ Sci Ecotechnol ISSN： 2666-4984

Introduction

Tropospheric ozone (O3) is a trace-gas, near-term climate forcer with global mean lifetime ∼23 days, and a major air pollutant with detrimental effects on human and ecosystem health [[1], [2], [3]]. Besides warming the atmosphere as a greenhouse gas, ground-level O3 also reduces crop yields [[4], [5], [6]]. Laboratory experiments have confirmed O3 exposure to cause oxidative stress, inflammatory responses and immunologic diseases [7]. Epidemiological studies report that short-term exposures to high-level O3 are significantly associated with the exacerbation of asthma [8] and have increased hospitalisations among children [9], while long-term ozone exposure is linked to respiratory diseases including chronic obstructive pulmonary disease, cardiovascular diseases, preterm delivery and even premature deaths [[10], [11], [12], [13], [14], [15]]. Global Burden of Disease (GBD) reported over 0.36 million premature deaths globally in 2019 from exposure to ambient O3 [16]; high O3 exposure may also exacerbate PM2.5-mortality risk associations [17]. These results underscore the pressing need for research which links population exposure assessment to surface O3 and its impacts on human health. Satellite-based observations are limited in their ability to provide accurate measurements for O3 at the surface, since ambient air O3 is obscured by increased O3 abundance in the upper atmosphere which prevents direct measurement by remote-sensing. Ground-level station-based observation sites are excellent tools for recording and monitoring surface O3 but still suffered from limited spatial coverage [18,19]. The demand for full-coverage surface O3 concentrations have promoted the application of model simulations, which have improved alongside our mechanistic understanding of tropospheric O3 [[20], [21], [22]]. However, model simulations are also limited, due to imperfect O3 chemistry mechanisms built into models, biases and errors in the underlying emissions, and uncertainties caused by the discretisation and numerical treatment of a non-linear complex system. Archibald et al. have shown that for future evolution projections of tropospheric column O3, model differences are a leading order term of uncertainty over decadal scales [22]. There are various types of models used to simulate surface O3. Chemical transport models (CTM) perform satisfactorily, especially in regional-level simulations [[23], [24], [25], [26]], and are considered to be free of biases in meteorological dynamics due to the use of prescribed weather features. However, these models lack important atmospheric composition feedbacks to the modelled meteorology and climate, hence development of atmospheric composition-climate models (CCM); when coupled with land, sea, and sea-ice modules into earth system models (ESM), it is feasible to simulate multi-decadal or even centennial scale changes of the atmosphere [[27], [28], [29], [30]]. To evaluate and compare the coupled models, a number of research institutes have contributed to the Coupled Model Inter-comparison Project Phase 6 (CMIP6) with a range of experiments conducted by a series of state-of-the-art coupled CCMs and ESMs. The same inputs are used, including emission inventories and land properties [[31], [32], [33], [34]]. CMIP6 has endorsed a total of 23 MIPs to answer a wide range of scientific questions in atmospheric chemistry and climate, among which the Aerosols and Chemistry Model Intercomparison Project (AerChemMIP) involves a collection of simulations targeted at reactive gases and aerosols, including tropospheric O3. [35] Large discrepancies have been detected across models; in addition to identifying the mechanistic causes for these differences [32,36], an urgent challenge is the calibration and utilisation of the simulation ensemble. Another prominent strength of free-running CCMs is their capacity for decadal simulations without nudging by observations, so the harmonisation of cross-model disagreements, in future years beyond observations, will benefit further long-term scenario projection studies. Applying frontier machine learning algorithms to “assimilate” the outputs from multi-source modelling activities like MIPs and observation databases is an important part of environmental research in the big data era. However, unlike retrospective analysis (abbreviated as re-analysis) products like Modern-Era Retrospective analysis for Research and Applications (MERRA), which involves observational nudging in the simulation processes as a representative type of data assimilation [37], there are opportunities for post-simulation data mining by integrating various relevant databases, which is customarily named “data fusion” [[38], [39], [40]]. Data fusion can be thought of an aspect of bias correction, but we treat it here as a specific machine learning task. Data fusion studies, which use ensemble learning to enhance the prediction accuracy of ambient air pollution concentrations beyond observation sites, have been emerging in recent years [[41], [42], [43], [44]]. However, these studies have used no more than one model simulation, integrated with predictor variables contributing to the budget of O3, without fusing multiple simulation ensembles such as CMIP6. In addition, the conventional machine- or deep-learning approaches aim purely at brute-force fitting to high accuracy while sacrificing the interpretability of training processes, so have long been criticised as “black-box” and contradict the nature of mechanism-driven sciences like atmospheric modelling [[45], [46], [47]]. Under these circumstances, a performance-interpretability balance for multi-source data fusion following credible observations is of high value in atmospheric research. Our current study is an innovative exploration on this issue, emphasising the development of innovative ensemble-learning frameworks to assimilate the multiple CMIP6 model simulation ensembles and TOAR observations to obtain a single surface O3 dataset which captures spatiotemporal variabilities as accurately as possible. Our ultimate goal is to establish data assimilation approaches which can be projected onto future scenario forecasts, hence historical observational products only serve as labels for supervised training rather than inputs, since such inputs for future years are counterfactual. Fusing a collection of simulation ensembles, rather than using outputs from one simulation, gives more prominence to the mechanism-driven models in order to avoid brute-force overfitting resulting from external predictor variables, especially when any given model simulation may be largely biased. The primary innovation of this study lies in transplanting the conventional ensemble-learning methodology onto multi-source data fusion, and optimising an enhanced 2-stage space-time Bayesian neural network to assimilate the CMIP6 simulation ensemble. Advantages of the conventional approach include a much lower computation burden and higher accuracy in observation-covered regions, while the innovative Bayesian approach offers better spatial generalisability and intuitive perception of spatiotemporal model weighting. In either case, the multi-model fused surface O3 concentration can fill observational gaps and enable further relevant research. As an example, we show that using a Fourier-series function to fit temporal surface O3 variability provides a feasible way to effectively summarise periodical changes in air pollutant concentrations. Detailed evaluations and comparisons on the CMIP6 model ensemble, and deeper discussions on model revision insights from deep learning-based calibration processes are beyond the scope of this study.

Methodology and data sources

CMIP6 simulation ensemble

We collect 14 coupled earth system models having finished the “historical” simulations (1850–2014) of tropospheric O3 as listed in Table 1, of which 8 models use interactive chemistry schemes. A prescribed O3 concentration dataset is used for all 4 non-interactive chemistry models (AWI-ESM [48], BCC-CSM2 [[49], [50], [51]], IPSL-CM6A [52,53], and MPI-M-ESM1.2 [[54], [55], [56], [57]]) and 2 CNRM models are not considered for fusion due to the simplified treatment of O3 chemistry [[58], [59], [60], [61], [62]]. The prescribed O3 is an average of 2 earlier generation atmospheric models, CESM1-WACCM and Canadian Middle Atmosphere Model (CMAM), under the auspices of the IGAC/SPARC Chemistry-Climate Model Initiative (CCMI) [63]. A total of 8 models, including BCC-ESM1 [64,65], MPI-ESM1.2-HAM [66], MRI-ESM2.0 [[67], [68], [69]], NASA-GISS-E2.1 [[70], [71], [72]], NCAR-CESM2-WACCM6 [73,74], NCC-NorESM [75], NOAA-GFDL-ESM4 [76,77], and UKESM1-0-LL [20,[78], [79], [80], [81], [82]], consisting of 57 individual simulation experiments (i.e. realisations in terms of CCM simulation labelled as rninpnfn) and 1 prescribed input dataset (from Inputs4MIPs) [63] are recruited for data fusion. The multiple ensemble members under one model allow for capturing the uncertainties in the chaotic coupled chemistry-climate system; and because of the free-running nature of the simulations, each of the 57 individual simulations is treated separately with no cross-ensemble averaging clustering into each model involved. All simulation outputs are averaged to monthly time frequency for assimilation with observations. Detailed information of the participant research institutes, design of atmosphere module settings, and experiment labelling rules are illustrated in the Appendix.

Table 1

Institute	Model	Trop	Strat	Grids	Realisations	Physics	Forcing	Fusion	Refs
AWI	ESM^∥	P^#	P	192 × 96 × 47	r₁^⊥	p₁	f₁		[48]
BCC	ESM1	I	P	128 × 64 × 26	r₁, r₂, r₃	p₁	f₁	✓	[65,95]
	CSM2	P	P	320 × 160 × 19	r₁	p₁	f₁		[[96], [97], [98]]
CNRM^∗	CM6.1	N	I	256 × 128 × 91	r_1-5	p₁	f₂		[[58], [59], [60]]
	ESM2.1	N	I	256 × 128 × 91	r₁, r₂, r₃	p₁	f₂		[59,61,62]
HAMMOZ^§	MPI-ESM1.2-HAM	I	P	192 × 96 × 47	r₁, r₂	p₁	f₁	✓	[66]
IPSL	CM6A	P	P	144 × 143 × 79	r_1-10	p₁	f₁		[52,53]
MOHC	UKESM1-0-LL^†	I	I	192 × 144 × 85	r_10-12, r_14-19	p₁	f₂	✓	[80,81,[99], [100], [101], [102]]
	UKESM1-0-LL	I	I	192 × 144 × 85	r_5-7	p₁	f₃	✓
MO-NERC	UKESM1-0-LL	I	I	192 × 144 × 85	r_1-4,r_8-9	p₁	f₂	✓
MPI-M	ESM1.2-HR	P	P	384 × 192 × 95	r_1-10	p₁	f₁		[[54], [55], [56], [57], [103]]
MRI	ESM2.0	I	I	128 × 64 × 80	r_1-5	p₁	f₁	✓	[69,104,105]
NASA-GISS	E2.1-G	I	I	144 × 90 × 40	r_1-10	p₃	f₁	✓	[70,106,107]
	E2.1-G	I	I	144 × 90 × 40	r₁, r₂, r₃	p₅	f₁	✓
	E2.1-H	I	I	144 × 90 × 40	r_1-5	p₃	f₁	✓
	E2.1-H	I	I	144 × 90 × 40	r₁, r₂, r₃	p₅	f₁	✓
NCAR	CESM2-WACCM6	I	I	288 × 192 × 70	r₁, r₂, r₃	p₁	f₁	✓	[74,108]
NCC	NorESM-MM^‡	I	P	288 × 192 × 32	r₁, r₂, r₃	p₁	f₁	✓	[75]
NIMS-KMA	UKESM1-0-LL	I	I	192 × 144 × 85	r₁₃	p₁	f₂	✓	[109]
NOAA-GFDL	ESM4	I	I	288 × 180 × 49	r₁	p₁	f₁	✓	[76,77]

∥ The earth system models are unique for each institute, but coincidently are named the same as ESM with version numbers, thus are named by institute + model name hereafter in this paper for distinguishment (i.e. CNRM-ESM2.1 is not an updated version of BCC-ESM1, but a new version of CNRM-ESM1) [110].

# AWI-ESM, BCC-CSM2, IPSL-CM6A, and MPI-M-ESM1.2-HR use the same prescribed ozone for the whole earth system modelling instead of simulating the ozone, so that the surface ozone concentrations reported by these 4 models are essentially the same. In this sense, the single prescribed ozone (input4MIPs) [63] is used in place of the 4 models to avoid duplication.

⊥ All the realisations of the climate equilibrium started since 1850, so that are marked with the same initialisation index, i1. The ensemble experiment variant serial numbers are defined by a combination of realisation, initialisation, physics, and forcing, e.g. r1i1p1f1.

∗ The 2 CNRM models are not considered for surface ozone multi-model fusion as they do not include tropospheric ozone module.

§ Full name as HAMMOZ-Consortium, marked as HAM in model name.

† MOHC, MO-NERC and NIMS-KMA ran the same UKESM1 model with same configuration, but contributed different ensemble experiments, so that are referred collectively as UKESM1-0-LL hereafter in this paper.

‡ NCC ran the NorESM in two different coupling resolutions, as low atmospheric-medium ocean resolution (LM) and median atmospheric-medium ocean resolution (MM). In order to achieve higher performance in multi-model fusion, only the higher spatial-resolution simulation, MM, is considered so as to avoid duplication.

Summarisation of CMIP6 historical project participant institutes and models with chemistry schemes, spatial gridding, and experiment realisation, physics, and forcing settings. The names of institutes and coupled earth system models are listed in abbreviation. The three-dimensional spatial resolutions are represented in longitudinal-latitudinal-vertical grids. The tropospheric and stratospheric chemistry schemes are denoted as interactive (I), prescribed (P) and none (N) in “Trop” and “Strat” columns. The realisation, physics and forcing indices identify ensemble experiment members. The “Fusion” column indicates whether the simulation experiments are included into multi-model fusion. Full names of the CMIP6 participant research institutes are listed in the Appendix. ∥ The earth system models are unique for each institute, but coincidently are named the same as ESM with version numbers, thus are named by institute + model name hereafter in this paper for distinguishment (i.e. CNRM-ESM2.1 is not an updated version of BCC-ESM1, but a new version of CNRM-ESM1) [110]. # AWI-ESM, BCC-CSM2, IPSL-CM6A, and MPI-M-ESM1.2-HR use the same prescribed ozone for the whole earth system modelling instead of simulating the ozone, so that the surface ozone concentrations reported by these 4 models are essentially the same. In this sense, the single prescribed ozone (input4MIPs) [63] is used in place of the 4 models to avoid duplication. ⊥ All the realisations of the climate equilibrium started since 1850, so that are marked with the same initialisation index, i1. The ensemble experiment variant serial numbers are defined by a combination of realisation, initialisation, physics, and forcing, e.g. r1i1p1f1. ∗ The 2 CNRM models are not considered for surface ozone multi-model fusion as they do not include tropospheric ozone module. § Full name as HAMMOZ-Consortium, marked as HAM in model name. † MOHC, MO-NERC and NIMS-KMA ran the same UKESM1 model with same configuration, but contributed different ensemble experiments, so that are referred collectively as UKESM1-0-LL hereafter in this paper. ‡ NCC ran the NorESM in two different coupling resolutions, as low atmospheric-medium ocean resolution (LM) and median atmospheric-medium ocean resolution (MM). In order to achieve higher performance in multi-model fusion, only the higher spatial-resolution simulation, MM, is considered so as to avoid duplication.

Observations

The tropospheric ozone assessment report (TOAR) programme has archived high-quality ground-level O3 measurements over the period 1990–2014 [17], which are used as “standard” for physical and statistical model evaluation; our study period is thus selected as 1990–2014. To support analyses at the planar spatial resolution of the CCMs involved in this study, TOAR sites are aggregated into 2° × 2° latitude-longitude grid as plotted in Fig. S1, including 585 spatial grids with a total of 5,322 different observational sites; and averaged to monthly temporal interval for the robustness of model-observation evaluation. Such spatiotemporal aggregations can also strengthen the stability of grid-level observation-simulation evaluation, and to some extent abate the statistical compromises by excluding the observation missing records for some certain sites in the early years of the dataset (ca. 1990s). Only spatial grids in which there is at least one observation site are used. Throughout the study, the gridded TOAR observations are used solely as supervised learning labels to train models that can be applied onto wider temporal-range CMIP6 CCMs (e.g. 2015–2100), rather than as inputs.

Additional auxiliary predictors

Higher prediction accuracy can be achieved when integrating additional features into statistical models, which are nominated as auxiliary variables, covariates, predictor variables, or assistant features exchangeably in terminology reported by literature [42,43,83]. Comprehensively considering the O3 budget mechanisms, experiences from previous relevant studies, and statistical correlations with surface O3 using generalised linear model (GLM) stepwise backward selection, we screen out 13 variables as assistant predictors as: CMIP6 simulated concentrations of surface PM2.5, NO2, higher layers of O3 (vertical O3 column), and ambient air temperature obtained from the World Climate Research Programme (WCRP) Earth System Grid Federation (ESGF) CMIP6 database (https://esgf-node.llnl.gov/search/cmip6); emissions of biogenic VOCs, NOx, CO, black carbon (BC) and organic carbon (OC) together with urbanised land proportions, collected from input datasets for Model Intercomparison Projects (https://esgf-node.llnl.gov/search/input4mips); surface elevation downloaded from the Global Multi-resolution Terrain Elevation Data (GMTED) [84]; and gridded urban and rural populations linearly interpolated with corrections towards the actual annual world total populations into year-precision from United Nation's World Population Prospects (UN WPP) Adjusted Population Density and Gridded Population of the World (GPW) operated by NASA Socioeconomic Data and Applications Centre (SEDAC) [85]. The excluded auxiliary variables might be of mechanistic relationships with O3 budget (e.g. concentrations of VOCs, humidity), but post-simulation data fusion studies respect the GLM stepwise screening results more, in line with the parsimony principle when enhancing prediction accuracy.

Multi-model fusion frameworks

We use “physical model” to refer to the CMIP6 mechanism-driven atmospheric models, and “statistical model” for the data-oriented machine- or deep-learning frameworks to avoid confusion in terminology. No transformations are made for either the observations or model simulations as they follow the Gaussian distribution well with slight temporal imbalance. Following literatures [42,43,83], an adjusted ensemble learning-based multi-model fusion framework is constructed as presented in the upper panel of Fig. 1. In this approach, raw simulations (i.e. 57 CMIP6 historical simulations and 1 prescribed O3 dataset, noted as “57 + 1 ensemble” hereafter) together with the normalised additional auxiliary predictor variables are first re-gridded onto the 2° × 2° TOAR observation grids as the first stage, following procedures graphically presented in Fig. S2. In the second stage, all the model simulation ensembles, external predictors, and 6 space-time indices (i.e. 3 Euclidean spherical coordinates in analytic geometry, and 3 helix-shape trigonometricised month sequence t as [cos(2πtT−1), sin(2πtT−1), t] where T is prescribed as 1 year) [40] are mixed together as inputs for random forest, gradient boosting decision tree, and convolutional neural network regression models separately; and in the third stage, outputs from the 3 algorithms are finally blended by L2-regularisation-based weighting (ridge regression). This approach is entitled as “aggressive” approach because this methodology respects the observations (i.e. labels for supervision) more than the physical models, hence during the process of training, the concentrations in each grid are treated individually accompanied by compromising the spatiotemporal continuous structure of the original physical model simulations, leading to inexplicability. The aggressive approach involves at least two stages of ensemble: the first CMIP6 multi-model ensemble and second multi-algorithm ensemble, where the random forest regressor essentially is another layer of ensemble learning. The random forest regressor is a large collection of separate decision trees with individual of which generating a single prediction and the final prediction given by averaging all trees, thus the random forest is perceived as an ensemble learning method [86]. Combination of random forest, gradient boosting decision tree, and convolutional neural network follows the design of previous studies [43,44].

Fig. 1

Abbreviations and denotations: RFR, random forest regression; GBR, gradient boosting decision tree regression; CNN, convolutional neural network regression; SFP, semi-final product; BNN, Bayesian neural network regression; , re-scaling factor; , systematic bias corrector; , individual model weight; , bias corrector; , physical model identifier; , location index; , temporal index; , random noise.

Schematic diagram of machine learning-based multi-model fusion by aggressive and conservative approaches. The stacking of source data layers refers to the collections of datasets with the same level in training models; the ellipses indicate elemental machine learning methodologies; and the rectangles represent the raw outputs from machine learning treatments. A total of 57 physical model simulations and 1 prescribed O3 concentration dataset (Inputs4MIPs) are considered. Abbreviations and denotations: RFR, random forest regression; GBR, gradient boosting decision tree regression; CNN, convolutional neural network regression; SFP, semi-final product; BNN, Bayesian neural network regression; , re-scaling factor; , systematic bias corrector; , individual model weight; , bias corrector; , physical model identifier; , location index; , temporal index; , random noise. Contrarily, in order to maintain the interpretability of the deep learning processes, we also adopt an enhanced 2-stage space-time Bayesian neural network (BNN) framework as illustrated in the lower panel of Fig. 1. Space-time indices and additional predictors are put into a 10-layer 1024-node at maximum BNN to generate spatiotemporal variant re-scaling factors (k), bias correctors (b) and the randomised noises (σ), under the supervision of TOAR observations to pre-calibrate the raw re-gridded CMIP6 simulations. Then, spatiotemporal variant model weights (α) are estimated by 5-layer 256-node at maximum BNN merely from the 6 space-time indices, to finally reach the weighted average ensemble surface O3 concentration predictions. The numbers of hidden layers and nodes are determined by pilot tuning experiments. This approach is named as the “conservative” approach as throughout the process of prediction enhancement, all parameters are clamped by space-time indices with presumed distributions, thus this framework respects the raw simulations more and might be highly biased on extreme observations. All involved parameters can be thoroughly separated from the framework and presented intuitively by mapping, so that the whole process of assimilation is traceable and interpretable. We construct the two-stage BNN instead of single-stage because the divergences still exist among the calibrated CMIP6 models in the first-stage and hence further mixing is required. Directly using the second-stage BNN will lose the chance to observe the calibration features for individual physical models; and different degrees of initial biases will cast higher weights onto the smaller biased models, possibly leading to undesirable feature monopolisation. Statistical principles of naïve space-time BNN (i.e. single-stage space-time BNN) are illustrated in details by a recent report [40]. Mathematically speaking, solutions of the spatiotemporal parameters (i.e. k, b, and α) are not unique, but it is reasonable to assume the observation covered and uncovered regions are of homogeneity in distribution of these parameters, which requires a Bayesian method to replace the single value of parameters with a distribution. The 6 space-time indices can assist in capturing the spatiotemporal autocorrelation of the surface O3. 10,000 times of Markov-Chain Monte-Carlo (MCMC) simulation ensembles are applied to approximate the distribution, so as to guarantee the robustness of BNN estimation, thence the conservative approach involves a total of 3-stage ensembles: first in multi-model ensemble and the latter two in the 2-stage Bayesian parameter generation. For the final predictions based on the optimised distribution parameters trained through the BNN, 69.2% fall into 1 standard deviation (σ) range, 96.2% into 2σ and 99.9% into 3σ, conforming to the regularity of Gaussian distribution and thus justifying our Bayesian model presumption. To evaluate the performance of 2 approaches, 10-fold cross-validation (CV) assessment is applied, and 7:3 training-test split is used through the full dataset during 1990–2014. An additional temporal extrapolation test is conducted by manually setting the 1990–2009 TOAR observations with grid-corresponding physical model simulations as training set and 2010–2014 as test set. Three manual cross-validation tests are conducted by splitting the whole dataset into training-testing sets with regional integrity as i) Europe-training for North-America-testing; ii) North-America-training for Europe-testing; and iii) Europe-North-America-training for East-Asia-testing, so as to evaluate the spatial extrapolation capability of the 2 statistical models. Decomposition of model-observation errors follow a previous research [87]. The neural network trainings are accomplished by Adam stochastic optimisation algorithm, setting the initial anchor values from observations and the learning rate as 10−4 after centric normalisation. Including space-time indices as inputs enables the two frameworks to achieve space- and time-specific simulation calibrations, rather than a simple homogeneous correction. The complex machine learning frameworks are constructed instead of using simple statistical models owing to their limitations in handling the i) similarities across multiple physical models (i.e. collinearity in statistical term); ii) interaction effects between the input variables; iii) spatiotemporal auto-correlations and discrepancies in calibration parameters; and iv) propensity of overfitting when introducing high-order polynomial terms. Additionally, this cross-disciplinary study closely follows the trends of applying the cutting-edge data sciences onto environmental studies, hence only machine- and deep-learning approaches are transplanted, enhanced and discussed here.

Other relevant statistics

Fourier-series sinusoid functions theoretically can fit any periodical variables [88], so are used to capture the location-specific long-term periodic variations of surface O3 in this study to parametrically interpret the final assimilated surface O3 concentrations by revealing the intra- and inter-year variability quantitatively with perceivable mapping. The aggressive and conservative approaches are post-simulation data fusion frameworks without any influences onto physical modelling mechanisms, and Fourier fittings are post-fusion descriptive statistics not independent from CCM simulations and fusion processes. Akaike Information Criteria (AIC) is used for statistical model selection, taking the realistic explicability altogether into consideration as listed in Table S1. Given TOAR observations and model outputs are monthly averaged, the final Fourier function is chosen aswhere t represents the month-sequence; a0 as starting-point surface O3 concentration (January 1990); 12a1 as log-transformed annual average change rates (given the temporal interval is defined as month, 12 should be multiplied for converting into yearly metric); 2b0 as the baseline and 24b1 as annual change of seasonal variation amplitude (i.e. peak-valley difference; since b0 refers to the peak-centricity or centricity-valley gaps, the peak-valley disparities need to be doubled); and c0 as the fine-tuning parameter which can modify the sinusoidal shape, but usually the absolute values are rather small, thus not considered for interpretation. An exponential term for the annual average surface O3 is applied instead of linear term as the long-term simulations have reported exponential increasing trend of the tropospheric O3 over centennial scales [32], regardless of the fact that the AIC values vote for the linear model.

Results

Raw simulation evaluations

Raw CMIP6 surface O3 simulations generally perform fairly well across all TOAR covered areas in terms of synchronicity (Fig. 2), as the correlations between observations and the 57 + 1 ensemble averages are 0.74 ± 0.18 (Inter-Quartile Range, IQR: [0.67, 0.87], Range: [-0.58, 0.96]). Overestimations are observed at 4.1 ± 2.0 (IQR: [5.1, 13.1], Range: [-22.2, 31.1]) ppbv across all TOAR covered spatial grids, hence the normalised mean biases (NMB) are high at 9.7 ± 6.3 (IQR: [4.2, 13.5], Range: [-28.1, 48.9]) %. Some regions like west Australia coastline even report negative correlations (Pearson's ρ = −0.58).

Fig. 2

Model-observation evaluation for the raw CMIP6 surface ozone simulation-ensemble and multi-model fusion by both aggressive and conservative approaches. a-c: Simulation-observation synchronicity, absolute and relative biases for 57 + 1 CMIP6 simulation ensemble. Model evaluations are conducted on TOAR observation covered sites across 1990–2014. d-g: Evaluations of aggressively and conservatively integrated surface ozone concentrations in terms of the overall model-observation synchronicity and bias. h-i: Multi-model and TOAR-observation assimilated historical global surface ozone concentrations by aggressive and conservative approaches. The 25-year average surface ozone concentrations during 1990–2014 are mapped as summary. All spatial resolutions are set as 2° × 2°, and the temporal interval is set to month. The synchronicity and bias for realisation-ensembled model outputs are also evaluated in Fig. S3 and Fig. S4. NASA-GISS-E2.1 reports negative synchronicity in the USA-Canada border, while NCC-NorESM fails to reproduce the temporal variabilities in most of the studied sites. UKESM1-0-LL predicts closely to the measurements, but underestimates the surface O3 around the USA-Canada border; while all the rest models present overestimations. Divergences are found between the individual models (Fig. S5), and the high simulation discrepancies are mainly aggregated in the intertropical convergence zone (ITCZ) and eastern China, where the standard deviations exceed 20% of the ensemble means. The barely satisfactory synchronicities and high overestimation biases indicate that the raw surface O3 simulation might not be suitable for direct application in health impact studies, verifying the necessity of calibrations, at least statistically. In addition, since the model simulation losses are of spatiotemporal variant patterns, simple calibration approaches like subtracting a constant overestimation bias from multi-model average might be of limited use. On this condition, more complicated post-simulation statistical models are required, using a series of full spatiotemporal coverage variables to capture the patterns from observation-covered sites and project the patterns onto regions beyond observation.

Performance of multi-model ensemble fusion

Both aggressive and conservative multi-model fusion perform well in prediction enhancement (Fig. 2). The model-observation correlations are high at 0.98 ± 0.01 (IQR: [0.97, 0.99]) and 0.95 ± 0.08 (IQR: [0.95, 0.98]) for the aggressive and conservative approach, respectively; and NMBs of the aggressive model are 0.29 ± 3.06 (IQR: [-1.22, 1.54]) %, marginally smaller than the conservative model at 0.40 ± 3.57 (IQR: [-1.72, 1.93]) %. The general overestimation issues of the raw CMIP6 simulations have been handled well, but there are still some sporadic high NMBs detected in Asia, Africa, and South America, where the ground-based monitoring sites are rare and spatially scarce. The full-range fitting R2 (Table 2) of the aggressive and conservative approaches are 0.96 and 0.95, respectively, both indicating plausibility of the multi-model fusion with calibration; while the conservative predictions follow more loosely to the observations, especially in the low-concentration ranges (Fig. S6), resulting in relatively higher root mean squared error (RMSE) as 2.12 ppbv compared with 1.81 ppbv for the aggressive approach. However, the conservative approach performs better in 1:1 model-observation calibration criteria according to the closer-to-one slope factor (k−1 < k, 0.97−1 < 1.05) and closer-to-zero systematic bias (|b| < |b|, |0.71| < |-1.35|). This is because directly involving additional auxiliary features (i.e. the aggressive approach) can possibly introduce noise into the calibration, as their association with surface O3 are not simply linear, especially in higher concentration ranges, so that the 1:1 model-calibration line is deviated. The crude planar and longitudinal resolutions can smooth the observational noises and extreme cases, resulting in higher prediction accuracies than other similar data fusion studies with finer spatiotemporal precisions [44,89,90].

Table 2

	Aggressive Approach							Conservative Approach
	CV-R²	test-R²	full-R²	RMSE	NMB	k	b	CV-R²	test-R²	full-R²	RMSE	NMB	k	b
Period
1990–1994	0.91	0.90	0.94	2.00	3.41	1.11	−1.62	0.92	0.91	0.93	2.00	0.02	0.98	0.59
1995–1999	0.90	0.90	0.94	1.74	1.71	1.09	−1.26	0.92	0.91	0.92	2.10	0.84	0.97	0.66
2000–2004	0.91	0.91	0.95	1.71	0.88	1.09	−1.16	0.91	0.91	0.93	2.28	0.71	0.97	0.95
2005–2009	0.91	0.91	0.96	1.68	1.11	1.09	−1.17	0.91	0.91	0.91	2.22	0.83	0.97	0.82
2010–2014	0.94	0.93	0.96	1.71	0.88	1.09	−1.16	0.92	0.91	0.94	2.28	0.71	0.97	0.95
Region
Europe	0.91	0.91	0.94	1.94	2.40	1.12	−1.61	0.92	0.91	0.92	2.02	1.27	0.98	0.37
North America	0.93	0.93	0.96	1.61	1.27	1.08	−1.19	0.91	0.91	0.93	1.96	−0.04	0.97	0.94
South America	0.90	0.87	0.95	1.22	3.12	1.10	−0.89	0.83	0.81	0.83	2.55	3.06	0.92	1.51
Asia	0.92	0.92	0.95	2.14	4.03	1.12	−1.65	0.90	0.90	0.92	2.96	1.85	0.96	0.90
Africa	0.90	0.86	0.90	2.13	2.82	1.19	−2.33	0.82	0.80	0.84	3.69	−3.81	0.93	2.88
Oceania	0.94	0.91	0.96	0.91	0.68	1.08	−0.78	0.83	0.81	0.84	2.13	−1.05	0.88	2.65
Season
March–May	0.93	0.90	0.97	1.91	0.84	1.13	−0.65	0.94	0.91	0.96	2.06	0.89	0.99	0.97
June–August	0.94	0.92	0.98	1.78	1.12	1.09	−0.86	0.94	0.92	0.95	2.14	0.74	0.97	0.75
September–November	0.93	0.89	0.98	1.75	3.09	1.12	−0.57	0.93	0.90	0.95	2.07	0.10	0.98	0.69
December–February	0.93	0.90	0.98	1.80	3.05	1.14	−0.60	0.93	0.90	0.95	2.19	0.54	0.98	0.51
TOAR	0.94	0.89	0.96	1.81	2.01	1.05	−1.35	0.90	0.88	0.95	2.12	0.57	0.97	0.71

Evaluation summary of aggressive and conservative multi-model fusion for surface ozone. The model evaluation metrics include the cross-validation (CV), test and full dataset overall coefficient of determination (R), the root mean squared error (RMSE), the normalised mean bias (NMB), and the linear regression slope (k) and intercept (b). Both two statistical models are evaluated separately for each 5-year period, season and continent to assess the spatiotemporal performances. Both approaches calibrate the physical models effectively, with the conventional aggressive approach performing slightly better than the innovatively established conservative model, which however, is already good. The spatiotemporal stability of the two approaches are also assessed in Table 2, concluding that the aggressive approach performs better in the later years of the dataset, while the conservative approach performs consistently well across the 25-year period. This is because the aggressive approach depends so largely on the observations that defects of observation coverage in early years will compromise the learning effects. However, the aggressive approach performs well across different continents (R2 > 0.90), but the conservative approach performs slightly worse in the southern hemisphere (R2 > 0.83), as a result of insufficient observations. This data sparsity results in the inter model-spread in the raw simulations being, to some extent, retained, as this could not be addressed by the BNN-based weighted linear combination; instead, additional features in the prediction-oriented aggressive approach brute-forcedly correct the large observation-simulation gaps. Both approaches perform well across seasons. The extreme cases of observed surface O3 are defined as outliers exceeding 1st-99th percentiles, equally 8.3–50.6 ppbv. Both 2 approaches perform closely well on the low-concentration extreme O3 as RMSE <1.92 ppbv, but the conservative approach fails in the high-concentrations owing to substantial low biases (RMSE = 6.16 ppbv, NMB = −7.67%), inferior to the aggressive approach (RMSE = 4.64 ppbv, NMB = −5.08%). This is because the Bayesian approach will “restrict” predictions into the prior probabilistic distribution, hence labelled as “conservative approach”.

Extrapolation generalisability

Due to the limitations of lacking systematic observations in China, India, Africa and oceanic regions during 1990–2014, there are no means to verify the simulations in these areas directly; but this problem can be explored indirectly by checking the extrapolation potential on the observation-uncovered locations. Three regional cross-validation tests are graphically summarised in Fig. S7, all of which reveal better generalisation capability of the conservative approach than aggressive. Neither underfitting nor overfitting issues are detected on the conservative approaches (i.e. CV and test scores are quite close); while underfitting is apparent for the aggressive approach in these regions, mainly reflected by failures in capturing extreme O3 concentrations. The temporal extrapolation tests of two statistical models reveal high generalisability on the most recent 5-year test sets during 2010–2014 as R2 = 0.91 (CV-R2 = 0.88, test-R2 = 0.82) for the aggressive approach and R2 = 0.92 (CV-R2 = 0.89, test-R2 = 0.85) for the conservative approach. The temporal extrapolation performances are better than spatial generalisation, because the temporal periodic variations of surface O3 are of a more stable pattern than regional divergences. In a nutshell, the conservative BNN approach wins over towards spatial and temporal generalisability, and we thus regard the conservative BNN results as “standard” for further interpretation.

Differences between ensemble approaches

Comparisons between the “standard” and aggressive approach outcomes are graphically summarised in Fig. S8, revealing most of the global regions are of high congruity (ρ = 0.85 ± 0.17, IQR: [0.81, 0.96]), while the divergences mostly occur on the ITCZ and Arabian-African areas (ρ < 0.02). Small relative biases have also justified the similarity between the aggressive and conservative approaches, as the NMBs (defined as aggressive minus conservative) are 1.38 ± 4.61 (IQR: [-1.59, 3.77]) %. The positive differences mainly aggregate in Africa, Antarctica, Oceania and most of the oceanic basins, while the negative differences cluster in Asia, Europe and America. The simplest fusion, the arithmetic average, of CMIP6 simulation ensemble would be used as a compromise were there no ground-based observations as used by precedent studies [32], which factually could lead to high biases if the real surface O3 exposure assessment is the main research interest. This study aims to develop innovative approaches to fuse both model simulations and observations, and by comparing with the simplest fusion, advantages of new methods can be highlighted. The conservatively ensembled surface O3 concentrations are of higher synchronicity (ρ = 0.97 ± 0.06, IQR: [0.97, 0.99]) with the simple ensemble average than the aggressive approach (ρ = 0.87 ± 0.14, IQR: [0.83, 0.96]), as the BNN is essentially an enhanced linear combination of multiple model simulations without substantial changes to the spatiotemporal auto-correlation. The ensemble average exceeds the aggressive fusion by 5.9 ± 9.7 (IQR: [-7.9, 14.3]) %, and the overestimations cluster regularly on land surface, especially the high-population-density regions; but surpass the conservative fusion by 9.6 ± 10.5 (IQR: [0.81, 20.2]) %, with the overestimations mainly detected in the wide-coverage northern-hemisphere without apparent land-ocean distinguishment. In conclusion, the simple ensemble average can lead to overestimations, especially in the northern hemispheric land surface; and the differences also reveal that the aggressive fusion model has modified the spatial auto-correlation of the raw CMIP6 simulation to a larger extent than the conservative approach.

Bayesian spatiotemporal weights

The differences between the two approaches can also be partially attributed to the different weighting schemes of the raw individual simulations. The 57 + 1 ensembles occupy 93.9% weights in the aggressive approach while the additional assistant variables only contribute 6.1%. Generally, for the aggressive approach, 4 among the 58 simulations contribute dominantly by over 10%, as UKESM1-0-LL-r3i1p1f2 (18.6%), the prescribed O3 (17.4%), NASA-GISS-E2.1-G-r1i1p3f1 (14.7%) and NCAR-CESM2-WACCM6-r1i1p1f1 (14.1%), while 36 ensemble members contribute less than 0.1%, as graphical presented in Fig. S9. On the contrary, the conservative approach results in relatively more even weights, where the prescribed O3 (2.1%), UKESM1-0-LL (1.9%) and NASA-GISS-E2.1 (1.8%). Besides the physical model weights, the space-time BNN also generates spatiotemporal variant weights, which can reflect the regions of skill for each individual physical model as presented in Fig. S10: UKESM1-0-LL and NCAR-CESM2-WACCM6 are weighted higher in northern hemisphere over land, while the prescribed O3 dataset, NASA-GISS-E2.1, and NOAA-GFDL-ESM4 contribute more in southern hemisphere over land. The temporal variations of the spatial weights are generally small and of regular regional clustering trends, indicating that the physical models have captured the seasonal variability well. BNN-based multi-model fusion treats the assistant variables independently with the CMIP6 model simulations, so that the weights of these additional features are not at the same level as the physical models like in the aggressive approach. Direct comparisons of the weights of the assistant variables between the two approaches reveal quite similar patterns of using these additional features for model calibration as shown in Fig. S11 which indicates that urban-rural populations, ambient air temperature and elevation are important factors. We suggest further work pay more attention to the role of model surface temperature, which is not fixed in these free-running simulations. High contributions of the space-time indices also indicate that more additional features need to be included for further consideration.

Long-term surface ozone variations

Spatiotemporal variabilities of the BNN-fused surface O3 are summarised parametrically using Fourier-series functions (Fig. 3). The fitting quality R2 has reached 0.81 ± 0.12 (IQR: [0.77, 0.87]), where the poor performances (R2 < 0.50) concentrate in ITCZ and the coastlines. The global annual average increasing rate of the surface O3 is estimated to be 0.23 (95% CI: [0.21, 0.25]) % yr−1, and the highest increasing rates are detected in south Asia, South America, and continental Europe. Decreasing trends are also discovered in eastern China and eastern US. The average intra-year seasonal variation is 13.9 (IQR: [2.1, 49.5]) ppbv, and the highest amplitude differences cluster in eastern US, Africa, Europe, and eastern China. The annual changes of seasonal variations also demonstrate regional variabilities: widening in eastern China by maximum as 1.8 ppbv per year while narrowing in western countries by extreme to −0.8 ppbv per year. The intra-year peak and valley concentrations are generally ascending, as the peaks increase by 8.8 ± 1.1 (IQR: [-6.8, 16.1]) ppbv per year, and the valleys ascend by 0.6 ± 0.8 (IQR: [-7.0, 8.3]) ppbv per year.

Fig. 3

Spatiotemporal variability parametrisation for CMIP6 multi-model ensemble assimilated surface ozone concentrations during 1990–2014 by the conservative approach. The ensemble-learning predicted concentrations are clustered by month. a: Fourier-series function-based curve-fitting quality for grid-specific surface ozone variabilities against temporal sequence, quantified by R. b: Annual increasing ratio for yearly average surface ozone concentrations, estimated by exp(12a1)-1. c: Annual average intra-year variation amplitude as the peak-valley gaps, estimated by 2b0. d: Annual average linear change rates of the intra-year variation amplitudes, estimated by 24b1. e-f: Averaged annual change rates of peak and valley concentrations, deduced from the fitted second-order Fourier-series function.

Discussion

Multi-model fusion improvement potential

Decomposition of model-observation errors (Fig. S12) can assist in evaluating the potential optimisations of statistical models, as it is worthwhile to analyse the sources of prediction losses and how these may be theoretically reduced [91]. The overall RMSE for the aggressive approach is 1.81 ppbv, among which the irreducible square root noise is 1.42 ± 0.47 ppbv, occupying 66.1 ± 16.7% of the total errors; while the averaged error of the conservative approach is 2.58 ppbv, where the square root noise is 1.87 ± 0.70 ppbv, accounting for 62.2 ± 25.4%. The noises together with the biases by conservative approach are generally higher than the aggressive approach, while their proportions are close except for the African regions, as listed in Table S2. Most of the unsolvable noises take over more than half of the errors, indicating that both fusion approaches have well approached the realistic observations. The variances, also known as cross-model divergences, are comparable or even greater than the biases for the aggressive approach, while conservative approach variances are several folds lower than biases, accounting for less than 10% except for South America (17%). This indicates the conservative fusion model is more robust. The model variances can be statistically perceived as discrepancies of model construction by random draws from the training subset, so that higher model variances represent severe dependence on training inputs, revealing higher sensitivity and lower generalisability. The current crux of the conservative fusion model falls on the biases, suggesting higher optimisation potentials than the aggressive approach. The biases originate from the inherent systematic biases in physical models, and the insufficient inclusion of assistant features to enhance the prediction statistically. Comparatively, due to the relatively higher statistical model variances, the aggressive approach should not be the prevalent stream for multi-model fusion, as changes in observation coverage (i.e. labels for supervision in machine learning) will substantially affect the stability of the statistical model.

Differences in spatial extrapolation

Divergences exist in the multi-model fused and calibrated surface O3 by two approaches, especially in observation-uncovered areas. The better spatial generalisation capacity of the conservative space-time BNN multi-model fusion is an advantage over the aggressive approach. Paradoxically, the aggressive approach actually performs well in capturing extreme values. This is attributed to overfitting on the assistant features added directly into the fusion processes, so that the predictions are excessively reliant on these external variables. However, due to the complexity of the mechanisms controlling O3, the statistical associations between physical models, auxiliary predictors, and realistic concentrations recognised by the aggressive approach are superfluous and of localised boundedness so that they may drastically differ across regions. Excluding these features from aggressive multi-model fusion alleviates the poor performance in spatial extrapolation, as for each regional cross-validation test, R2 rises to 0.81, 0.83, 0.74, and RMSE declines to 3.64, 3.97, 5.95 ppbv, for North America, Europe and East Asia respectively. To put it briefly, the external assistant features can increase the fitting quality in statistical training, but also serve as the limiting factors for model generalisation. This presents an issue towards understanding the processes of aggressive multi-model fusion, since conservative predictions manifested as underfitting by the aggressive approach should be ascribed to overfitting in the additional feature-assisted aggressive pathway. It suggests that conventional ensemble deep-learning approaches respect the observations as supervision and link the input variables only statistically, rather than respecting that the physical and chemical mechanisms are of limited use. Hence, this is the second reason that the novel conservative multi-model fusion approach by space-time BNN is preferred.

Cross-approach divergences

Most discrepancies between the two fusion approaches and the simple ensemble average are located in the tropics (Fig. S8). This is primarily attributable to the lack of observations as training data, and the variations in raw simulations (Fig. S5) which result from the difficulty in capturing O3 in this region due to complexity in the precursor emissions, including biogenic VOCs, soil NOx, lightning NOx, etc. [32] We highlight in particular the need for long-term continuous ground-based measurements of O3 in the tropics as a research priority. The differences between the simple ensemble average and the aggressive fusion approach (Fig. S8) indicate that the aggressive approach only addressed the systematic overestimations on the land surface; the additional variables lead to a land-ocean contrast (e.g. the population, ambient air temperature, O3 precursor emissions), which are used as key nodes in the tree-structure regressions, so that the calibrations are only effective over land rather than the whole global surface. The conservative approach respects the raw simulations more by calibrating uniformly for both land and oceans, so that the average-conservative differences are more spatially uniform (Fig. S8).

Systematic overestimation

Direct averaging the raw CMIP6 surface O3 simulation ensembles, as commonly used in the literature [2,32,36], causes positive biases around 5–10%, equal to 3.6 ± 4.4 ppbv, with some regions like India high-biased by +40% (+22.7 ppbv), consistent with recent multi-model ensemble studies in this region [92]. Subtracting a constant systematic bias-offset still cannot handle the regional variant biases. Such large biases have detrimental influences on the use of raw ensemble mean data for work related to public health and pollution control policy studies in these regions, reiterating the necessity of observation-supervised calibration. The systematic overestimations across CMIP6 simulations speculate the major cause to be the inadequate vertical stratification in atmospheric modules. Essentially speaking, the lowest layers of CMIP6 model simulations are used to approximate surface O3, but the layer actually refers to a vertical average. Tropospheric O3 concentration rises with altitude [32], thus resulting in overestimation. UKESM1-0-LL stratifies 85 vertical layers [20], which is the most among 8 interactive chemistry CMIP6 models (Table 1), and lowest overestimations are found, with underestimations observed in a few regions (Fig. S4). Further experiments to adjust vertical stratifications and observe the changes in surface O3 simulation performance are suggested to rigorously check this speculation.

Rationality of enhanced space-time BNN

We design our enhanced 2-stage space-time BNN optimised from a classical naïve space-time BNN which does not consider additional feature involvement [40]. The enhancement in part stems from overcoming the inconsistence between the overall and location-specific observation-simulation linear relationships: each simulation cell at different time points requires a unique set of k-b parameters for calibration as , where the subscripts l and t represent location and time indices, so that using a fixed slope k and intercept b to calibrate all simulation cells is of limited use. However, the calculated sets of parameters are spatially limited to observations, thus a naïve space-time BNN framework is required for spatial extrapolation onto the full global space. The BNN generates space-time variant calibration slopes and intercepts for each CMIP6 model in the pilot attempts, with which the assistant features are significantly correlated Fig. S13, indicating these additional factors can contribute to the calibration parameters. To increase prediction accuracy, the enhanced 2-stage Bayesian neural network regression-based multi-model fusion framework is constructed firstly by incorporating assistant features into the multi-layer perceptron structure to generate the calibrated individual simulations, and secondly by fusing these using another naïve space-time BNN without involving any external features.

Comparisons with relevant studies

There are 2 other recent studies exploring possible means to fuse multiple CMIP6 simulations for surface O3 [38,39]. Chang et al. (2019) developed an M3Fusion method which combined a convolutional neural network (CNN) to capture the spatial auto-correlations and a recurrent neural network (RNN) to recognise the temporal dependences, so that both spatiotemporal variabilities can be reproduced in multi-scale, multi-temporal and multi-modality weighting-based data fusion with bias correction, which is a praiseworthy leap in data-driven environmental studies [39]. The main weaknesses are its opaque model training processes and lack of direct evaluation of spatiotemporal auto-correlated residuals. DeLang et al. (2021) made an improvement by adhering Bayesian Maximum Entropy (BME), a pure posterior statistical algorithm without involving machine learning, as the second-stage operation to calibrate the spatiotemporal auto-correlated residuals after M3Fusion [38]. BME is of high statistical interpretability and efficient computation, but the prediction accuracies are not comparable with machine learning models because heavy reliance on the priori of spatiotemporal autocorrelation patterns may over-smooth the final predictions [93]. Our 2-stage space-time BNN in the conservative approach leverages the high prediction capacities of deep learning frameworks and core principles of maximum entropy information theory to achieve comprehensive interpretability, with the cost of extremely high computational burdens as the BNN uses probabilistic distribution estimation to replace deterministic calculations. Additionally, the space-time BNN does not perform well in extreme cases like the aggressive approach, which computes much more quickly at the expense of ignoring residual reproduction and interpretability. Comprehensively considering the pros and cons, the aggressive approach is preferable for regional data fusion with sufficient observation coverage, while the conservative approach is of higher value for longer-term larger-scale studies which require more extrapolations, if computation expense permits.

Sensitivity analysis

Considering that cross-realisation variations (0.5 ± 0.1 ppbv) are much lower than cross-model deviations (4.6 ± 1.7 ppbv, Fig. S5), we conduct an additional sensitivity analysis by firstly averaging the multi-realisations within each model, then putting the 8 realisation-averaged model simulations together with the prescribed O3 (hereafter noted as 8 + 1 models) into the aggressive and conservative model as the input stacking layer, so that potential influences from imbalances in realisation, physics and forcing numbers can be thoroughly eliminated. The results of these new fused data are very similar to the previous calculations, with R2 = 0.94, RMSE = 2.24 ppbv for the aggressive approach, and R2 = 0.93, RMSE = 2.67 ppbv for the conservative approach. This sensitivity analysis experiment shows that unequal numbers of intra-model realisations do not significantly affect fusion performance, indicating that disparity in the number of realisations for a given model (e.g. 21 realisations for NASA-GISS-E2.1 while only a single realisation for NOAA-GFDL-ESM4) is not a significant issue when the research target is post-simulation data fusion. It also suggests that averaging the multi-realisation ensemble before the multi-model fusion takes place will still result in accurate results. This is particularly important if the model-data fusion approach is computationally expensive, as is the case for the conservative approach we have used. One-dropout sensitivity analysis shows that removing one model (with all its realisations) can achieve impressive accuracy, with R2 = 0.91–0.93, RMSE = 2.49–2.82 ppbv using the aggressive approach, and R2 = 0.89–0.93, RMSE = 2.97–3.46 ppbv by the conservative approach, both insignificantly lower than using all 8 + 1 CMIP6 models. This evidence also corroborates the limited interference from inequality of realisation numbers towards data fusion. However, the multi-model fusion performances are substantially reduced when only 2 models are kept (keeping a single model would be inappropriate for the basic idea of multi-model fusion), with R2 = 0.83–0.87, RMSE = 3.68–5.14 ppbv using the aggressive approach, and R2 = 0.71–0.78, RMSE = 4.79–8.02 ppbv with the conservative approach. The aggressive-conservative performance gap converges when fusing >9 realisations, or >4 realisation-averaged models. It exposes the critical limitation of the conservative approach and that the innovative enhanced space-time BNN will not perform satisfactorily when only a few models are used for fusion, because different models have used different chemistry mechanisms, or simplifications, or have other physical differences [94], so that limited numbers of models cannot capture the full variations of the realistic surface O3 by BNN-based linear-combination. It also further justifies the necessity of the CMIP6 multi-model study from the perspective of raising the signal-noise ratio and enabling more credible surface O3 datasets (the more models used in the fusion process the better the performance). We keep the aggressively and conservatively-fused outcomes separately as 2 ultimate achievements of this study, rather than combine into a single dataset, because of our aim to maintain the interpretability of the BNN-fusion processes instead of purely focusing on brute-force fitting.

Merits and limitations

Five major merits of our study are highlighted. First, we establish an enhanced 2-stage space-time Bayesian neural network regression-based deep-learning framework to fuse multi-ensemble surface O3 simulation, which is verified to be of high accuracy and accessible interpretability in spatiotemporal weighting. Second, we verify the improved spatial extrapolation generalisability of our newly developed approach compared to the conventional method; and owing to the commendable spatial and temporal extrapolation potentials, our ensemble learning frameworks can be applied to a wide temporal range of surface O3 studies. Third, to the best of our knowledge, this shall be the first study to fuse CMIP6 model simulations for surface O3 over the 25-year historical period of 1990–2014 using machine learning techniques, and such long-term global studies continue to be rare. Fourth, the fused and calibrated surface O3 concentration dataset can be used further for cross-disciplinary studies. Finally, we innovatively apply Fourier-series functions for the purpose of parametrising and visualising the complex temporal periodical variations of surface O3. However, our studies have several limitations. First, the model evaluation-calibration resolution is coarse at 2° × 2°, and some heavily polluted regions including China, India and Africa still lack observations. Second, the additional assistant features used to enhance the statistical model prediction are still limited, and more variables shall be considered in further studies. Third, more detailed and deeper discussions concerning the parametric model calibration by 2-stage space-time BNN regression, and mechanistic influences from different physics and forcing settings, could have been replenished and excavated, but this is beyond the scope of the current study. We aim to address some of these issues in future research.

Conclusion

To explore the possibility of harmonising the cross-model simulation discrepancies and more accurately predicting the surface O3 concentrations in decadal scales, two parallel multi-stage ensemble-learning frameworks have been developed: i) an aggressive approach using ensemble learning of random forest, gradient boosting decision tree, and convolutional neural network, and ii) a conservative approach constructing 2-stage space-time Bayesian neural networks. Both the aggressive and conservative approaches perform satisfactorily in fusing multiple CMIP6 free-running CCM surface O3 simulations under supervision of observations, assisted with auxiliary datasets. The innovative Bayesian neural network framework is of better interpretability and higher spatiotemporal extrapolation capacity than the conventional ensemble learning model at the expense of high computation burdens. The Bayesian method is also able to present the parametric calibration and weighting layers intuitively, which can inspire further mechanistic model revisions and help improve surface O3 modelling with CCMs in the future. Besides the development of the two machine learning frameworks as methodological frameworks for post-simulation data assimilation research, the multi-model fused surface O3 concentrations with bias calibration also contribute to the literature for further impact and policy studies.

Declaration of competing interest

We have no conflicts of interest to disclose.

26 in total

1. Asthma in exercising children exposed to ozone: a cohort study.

Authors: Rob McConnell; Kiros Berhane; Frank Gilliland; Stephanie J London; Talat Islam; W James Gauderman; Edward Avol; Helene G Margolis; John M Peters
Journal: Lancet Date: 2002-02-02 Impact factor: 79.321

2. An ensemble-based model of PM_2.5 concentration across the contiguous United States with high spatiotemporal resolution.

Authors: Qian Di; Heresh Amini; Liuhua Shi; Itai Kloog; Rachel Silvern; James Kelly; M Benjamin Sabath; Christine Choirat; Petros Koutrakis; Alexei Lyapustin; Yujie Wang; Loretta J Mickley; Joel Schwartz
Journal: Environ Int Date: 2019-07-01 Impact factor: 9.621

3. Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States.

Authors: Xiang Ren; Zhongyuan Mi; Panos G Georgopoulos
Journal: Environ Int Date: 2020-06-25 Impact factor: 9.621

4. Spatiotemporal continuous estimates of PM_2.5 concentrations in China, 2000-2016: A machine learning method with inputs from satellites, chemical transport model, and ground observations.

Authors: Tao Xue; Yixuan Zheng; Dan Tong; Bo Zheng; Xin Li; Tong Zhu; Qiang Zhang
Journal: Environ Int Date: 2018-12-18 Impact factor: 9.621

5. Inverse probability weighted distributed lag effects of short-term exposure to PM_2.5 and ozone on CVD hospitalizations in New England Medicare participants - Exploring the causal effects.

Authors: Xinye Qiu; Yaguang Wei; Yan Wang; Qian Di; Tamar Sofer; Yara Abu Awad; Joel Schwartz
Journal: Environ Res Date: 2019-12-30 Impact factor: 6.498

6. Maternal ambient air pollution exposure with spatial-temporal variations and preterm birth risk assessment during 2013-2017 in Zhejiang Province, China.

Authors: Zhe Sun; Liyang Yang; Xiaoxia Bai; Wei Du; Guofeng Shen; Jie Fei; Yonghui Wang; An Chen; Yuanchen Chen; Meirong Zhao
Journal: Environ Int Date: 2019-10-26 Impact factor: 9.621

7. Mapping Yearly Fine Resolution Global Surface Ozone through the Bayesian Maximum Entropy Data Fusion of Observations and Model Output for 1990-2017.

Authors: Marissa N DeLang; Jacob S Becker; Kai-Lan Chang; Marc L Serre; Owen R Cooper; Martin G Schultz; Sabine Schröder; Xiao Lu; Lin Zhang; Makoto Deushi; Beatrice Josse; Christoph A Keller; Jean-François Lamarque; Meiyun Lin; Junhua Liu; Virginie Marécal; Sarah A Strode; Kengo Sudo; Simone Tilmes; Li Zhang; Stephanie E Cleland; Elyssa L Collins; Michael Brauer; J Jason West
Journal: Environ Sci Technol Date: 2021-03-08 Impact factor: 9.028

8. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.

Authors: Cynthia Rudin
Journal: Nat Mach Intell Date: 2019-05-13

9. Estimating Spatiotemporal Variation in Ambient Ozone Exposure during 2013-2017 Using a Data-Fusion Model.

Authors: Tao Xue; Yixuan Zheng; Guannan Geng; Qingyang Xiao; Xia Meng; Meng Wang; Xin Li; Nana Wu; Qiang Zhang; Tong Zhu
Journal: Environ Sci Technol Date: 2020-11-11 Impact factor: 9.028

10. Impact of Oxidant Gases on the Relationship between Outdoor Fine Particulate Air Pollution and Nonaccidental, Cardiovascular, and Respiratory Mortality.

Authors: Scott Weichenthal; Lauren L Pinault; Richard T Burnett
Journal: Sci Rep Date: 2017-11-27 Impact factor: 4.379