B B Cael1,2, Kelsey Bisson3, Christopher L Follett1. 1. Massachusetts Institute of Technology Cambridge MA USA. 2. Woods Hole Oceanographic Institution Woods Hole MA USA. 3. University of California Santa Barbara CA USA.
Abstract
We describe the basis of a theory for interpreting measurements of two key biogeochemical fluxes-primary production by phytoplankton (p, μg C · L-1 · day-1) and biological carbon export from the surface ocean by sinking particles (f, mg C · m-2 · day-1)-in terms of their probability distributions. Given that p and f are mechanistically linked but variable and effectively measured on different scales, we hypothesize that a quantitative relationship emerges between collections of the two measurements. Motivated by the many subprocesses driving production and export, we take as a null model that large-scale distributions of p and f are lognormal. We then show that compilations of p and f measurements are consistent with this hypothesis. The compilation of p measurements is extensive enough to subregion by biome, basin, depth, or season; these subsets are also well described by lognormals, whose log-moments sort predictably. Informed by the lognormality of both p and f we infer a statistical scaling relationship between the two quantities and derive a linear relationship between the log-moments of their distributions. We find agreement between two independent estimates of the slope and intercept of this line and show that the distribution of f measurements is consistent with predictions made from the moments of the p distribution. These results illustrate the utility of a distributional approach to biogeochemical fluxes. We close by describing potential uses and challenges for the further development of such an approach.
We describe the basis of a theory for interpreting measurements of two key biogeochemical fluxes-primary production by phytoplankton (p, μg C · L-1 · day-1) and biological carbon export from the surface ocean by sinking particles (f, mg C · m-2 · day-1)-in terms of their probability distributions. Given that p and f are mechanistically linked but variable and effectively measured on different scales, we hypothesize that a quantitative relationship emerges between collections of the two measurements. Motivated by the many subprocesses driving production and export, we take as a null model that large-scale distributions of p and f are lognormal. We then show that compilations of p and f measurements are consistent with this hypothesis. The compilation of p measurements is extensive enough to subregion by biome, basin, depth, or season; these subsets are also well described by lognormals, whose log-moments sort predictably. Informed by the lognormality of both p and f we infer a statistical scaling relationship between the two quantities and derive a linear relationship between the log-moments of their distributions. We find agreement between two independent estimates of the slope and intercept of this line and show that the distribution of f measurements is consistent with predictions made from the moments of the p distribution. These results illustrate the utility of a distributional approach to biogeochemical fluxes. We close by describing potential uses and challenges for the further development of such an approach.
Two of the most important processes in the carbon cycle are net primary production by phytoplankton (NPP; here we use p [μg C · L−1 · day−1]) and the export flux of sinking biogenic particles out of the upper ocean (f [mg C · m−2 · day−1]; Williams & Follows, 2011). p is typically defined either as the amount of photosynthetically fixed carbon available to first‐level heterotrophs in an ecosystem (Osmond, 1989) or as the difference between autotrophic photosynthesis and respiration (Lindeman, 1942). f is preferably defined as the downward flux of particulate organic carbon (POC) at the base of the euphotic layer z
eu (Buesseler & Boyd, 2009) but operationally is often defined as the flux though a particular depth, for example, 100 m, because most measurements of POC flux historically have not been taken at z
eu.Much remains unknown about the variability of p and f. Both p and f are influenced by a myriad of biological, physical, and chemical processes. Measurements of each range over several orders of magnitude (Buesseler & Boyd, 2009; Buitenhuis et al., 2013) and exhibit substantial variability. The interpretation of the most common method of measuring p, based on radiocarbon uptake, is debated (e.g., Marra, 2009), and f measurements are notorious for having large uncertainties (e.g., Buesseler, 1991). However, in order to gain a thorough understanding of production and flux we must characterize their variability.The relationship between f and p is also of interest; mass conservation suggests an inextricable dependence of f on p as all of the sinking material that comprises f must previously be fixed via p. However, because many other fluxes, for example, heterotrophic respiration, affect the organic carbon balance of the surface ocean, their quantitative relationship is not immediate (Williams & Follows, 2011). Multiple relationships have been proposed specifying f as a function of local depth‐integrated production
(mg C · m−2 · day−1; e.g., Eppley & Peterson, 1979; Dunne et al., 2005; Laws et al., 2011; Maiti et al., 2013), often a scaling relationship with an error term which accounts for processes the model does not resolve. However, direct comparison of these two measurements is challenging; these relationships have indicated
has little predictive power for f. Figure 1 shows Southern Ocean f and
data compiled by Maiti et al. (2013), whose correlation is nonsignificant at the 0.05 level (Figure 1), illustrating the difficulty in understanding the relationship between f and p—though other data sets have suggested stronger relationships (e.g., Dunne et al., 2005; supporting information S1).
Figure 1
Scatterplot of particulate organic carbon flux (f) and depth‐integrated net primary production by phytoplankton (
) data from Maiti et al. (2013). Notice the lack of a clear correlation between the two. Metrics reported are from Pearson's correlation: The fraction of variance of explained (R
2) and the p value of the correlation. Compare to Figure S2.
Scatterplot of particulate organic carbon flux (f) and depth‐integrated net primary production by phytoplankton (
) data from Maiti et al. (2013). Notice the lack of a clear correlation between the two. Metrics reported are from Pearson's correlation: The fraction of variance of explained (R
2) and the p value of the correlation. Compare to Figure S2.This lack of predictive power is in no small part because p and f can become locally decoupled, that is, not consistently comparable pointwise (Buesseler, 1998). p is measured on the scales of liters and days; f is effectively measured on the spatial scale of kilometers and/or the time scale of weeks (which due to mixing also implies far larger spatial scales; Buesseler, 1991; Marra, 2009; Siegel & Deuser, 1997; sections 3.1 and 3.3). The variable nature of both p and f therefore implies that a given
measurement can be a poor measure of the productivity occurring in the waters sampled by a colocated f measurement.Given the mechanistic relationship between p and f, it is nonetheless reasonable to expect that a quantitative relationship between measurements of the two should emerge at larger scales. In other words, even if p and f can become locally decoupled, it is still plausible that the p characteristics of a region could be predictive of the f characteristics of that region. Here we adapt previous approaches quantitatively relating p and f to larger scales by analyzing collections of measurements—describing these in terms of their probability distributions and then inferring a relationship between the two quantities via their distributions' moments.One way to understand variable processes such as p and f is through their probability distributions. Rather than trying to predict the value of every measurement, can the underlying distribution that the values are “drawn” from be understood? How does the distribution (or its parameters) differ for different environmental conditions? Can two quantities be related not by a pointwise function but via their probability distributions? Here we develop the basics of such an approach for p and f (Figure 2). We hope our efforts can provide direction toward a more complete theory.
Figure 2
Schematic of approach: p and f data from the global ocean are grouped into probability distributions and the probability distributions are then analyzed. Production and flux are then related to each other (and to other quantities) via the moments of their probability distributions.
Schematic of approach: p and f data from the global ocean are grouped into probability distributions and the probability distributions are then analyzed. Production and flux are then related to each other (and to other quantities) via the moments of their probability distributions.We argue that the lognormal distribution is a natural null model for both p and f on large scales. We then show that an extensive compilation of 14C measurements for p (Buitenhuis et al., 2013) and a large compilation of sediment trap and 234Th measurements for f (supporting information S2) are well described by lognormal distributions. Subregions of the p compilation are also well described by lognormals, whose moments sort according to oceanographic intuition, indicating that the lognormal is a general and robust feature of large‐scale p variability. In agreement with recent models for the relationship between f and
(Britten et al., 2017; Laws et al., 2011; Maiti et al., 2013), we posit a statistical scaling relationship between f and p, noting that it is the only relationship mapping one lognormal into another (Campbell, 1995); we then demonstrate that the moments of the two distributions should be linearly related via this scaling relationship. We then find agreement between two independent approaches to estimate the scaling relationship's parameters—one an out‐of‐sample test utilizing the log‐means of p and f from three open ocean time series stations, and the other subregioning the p and f data into three biomes, following Banse (1992). We show that the prediction made by this scaling relationship agrees with the log‐moments of the global distributions p and f and with the log‐standard deviations of the time series' and biomes' distributions. We close by discussing advantages, potential uses, and limitations of a distributional approach.
Theory
Lognormal Distribution as a Null Model
What distribution might be expected to underlie p or f? The processes that affect the distributions are both structured and stochastic. Structured processes are governed by large‐scale geophysical fluid dynamics and climate, while stochastic processes are inherently variable. For instance, the average nutrient supply to the euphotic layer over a region should directly shift the mean of the distribution for p or f, while turbulence can drive local fluctuations in nutrient supply (Falkowski & Ziemann, 1991). Here we explore what is expected when the distributions for p and f are driven by a combination of many processes, both structured and stochastic.We begin with the observation that both p and f are complex processes, occurring as the result of a number of subprocesses. In the case of p, a single carbon fixation event (i.e., carbon being fixed by phytoplankton at a particular place and time) requires the occurrence of a number of subevents: the presence of a phytoplankton, the absence of a predator that would consume the phytoplankton, the presence of light and various nutrients at suitable concentrations, and the presence of appropriate temperature and physical conditions (Durham et al., 2013; Eppley, 1972; Marra, 1978; Sverdrup, 1953).In the case of f, a single export event (in the microscopic sense, i.e., a particle sinking out of the upper ocean) requires that a particle be generated by p (and thus all of the above subprocesses), that it not be consumed or remineralized by heterotrophic processes (Steinberg et al., 2008), that it not be deflected by flow (Siegel & Deuser, 1997), that it not be disintegrated by shear processes (Alldredge et al., 1990), and that it be of sufficient density and size to overcome the viscosity of water and sink (Alldredge & Silver, 1988; Jackson & Burd, 1998; Smayda, 1970). Thus, we can consider the completion of a p or f event as dependent on the completion of many subevents.Given the general description of p and f as being composed of many subprocesses, the natural candidate for the probability distribution of both is the lognormal distribution. By the Central Limit Theorem (Montroll & Shlesinger, 1982; Shockley, 1957), the probability of completing any task that relies on the successful completion of many subtasks is lognormal. To illustrate, we adapt an example from Shockley, (1957; Montroll & Shlesinger, 1982), who used a similar model to explain the observed lognormal distribution of papers published by researchers. Let the fixation of carbon require the occurrence of n subprocesses (in a general sense, including the absence of predation), each of which occur at a given place and time with probabilities q
1,q
2,…,q
. The probability Q that a phytoplankton fixes carbon at that place and time is then their product:
. Therefore, taking the log of both sides,
so by the Central Limit Theorem, the distribution function of
(and therefore the sum, that is,
), is the normal distribution, irrespective of the individual subprocesses. Hence, given a few weak conditions (Montroll & Shlesinger, 1982; supporting information S3), the distribution function for p should be lognormal.Equivalently, lognormal distributions emerge from a large number of variable, multiplicative factors (Limpert et al., 2001), so the lognormal may be thought of as resulting as the product of variable environmental and compositional effects. The lognormal distribution, which has the form
is agnostic of the underlying subprocesses and reduces all of the complexity into two parameters—the log‐mean μ and the log‐standard deviation σ—providing a simple way to describe the quantity in question.Thus, we use the lognormal as a null hypothesis that p and f occur as the result of many distinct requirements or equivalently that they are influenced by many variable compositional and environmental factors. We note that f involving additional factors does not necessarily imply the σ of f's distribution will be larger because of possible correlations between these factors; for instance, one might expect a higher abundance of detritivores (who consume material that could otherwise contribute to f) in areas where sinking particles are more abundant.The lognormal distribution has enjoyed some utility in ocean ecology and carbon cycling. Some optical variables like chlorophyll appear to be lognormally distributed, and it has been recognized that these distributions could be useful for estimating primary productivity from satellite data (Campbell, 1995). In a terrestrial system Forney and Rothman (2012) showed decomposition rates of organic carbon followed a lognormal distribution. In the surface ocean, ecological abundance data are also lognormal (Luo et al., 2012). When biology is involved, one often encounters the lognormal distribution (Koch, 1966).
Relating f and p by Their Distributions' Moments
Relating f to p via their probability distributions can circumvent the decoupling problem discussed in section 1, because measurements are no longer related pointwise; the question becomes how the distribution of f values is related to the distribution of p values. If we hypothesize that p and f are lognormally distributed, can we describe the parameters of f's distribution as a function of the parameters of p's distribution?As previous studies have posited a scaling relationship between export efficiency e
f and depth‐integrated production
(Laws et al., 2011; Maiti et al., 2013), and a scaling relationship is consistent with p and f being lognormally distributed (supporting information S4), we hypothesize a scaling relationship between f and p, that is,
where C and α are constants, e is the base of the natural logarithm, and the equality holds in a statistical sense. If p is lognormally distributed according to
and f scales with p according to equation (3), this implies that f is lognormally distributed according to
which therefore implies the log‐moments of f and p are related according toA full derivation of the above is given in supporting information S4. Estimates of C and α can thus be used to predict the log‐moments of f (and hence the distribution of f) from those of p. Because μ is a lower‐order moment than σ, it can be estimated more accurately (Flannery et al., 1992); therefore, in section 4.3 we estimate C and α via equation (6).
Materials and Methods
Compilation of p Measurements
The most common in situ measurements of p are those taken by the “14C method,” which form much of the basis of our understanding of p. The method is discussed comprehensively elsewhere (Marra, 2009; Pei & Laws, 2013; Peterson, 1980); in brief, p is inferred from differences in light and dark bottles incubated with isotopically labeled (14C) carbon dioxide. The principal issues discussed in the literature are the interpretation of metabolic terms and the accuracy with which in vitro measurements reflect in situ conditions. Individual sample errors are less concerning when considering logarithmic probability distributions. Additive errors would yield significant measurement artifacts for small p samples but increasingly negligible ones for larger p values; multiplicative errors would affect estimates of σ but not the shape of the probability distribution itself. In any case, as long as natural variability is larger than measurement error, it is a plausible assumption that the statistics of 14C measurements reflect those of p itself.Recently, Buitenhuis et al. (2013) compiled an extensive database of 50,050 measurements of p via the 14C method, spanning the ocean in time, depth, and latitude‐longitude. This provides an excellent database with which to test the lognormal null model. See Buitenhuis et al. (2013) for a complete description of the compilation. We augment the Buitenhuis et al. (2013) compilation with data from the CARIACO time series (available at http://www.imars.usf.edu/cariaco; accessed 7 June 2017) in order to compare with the f data from that time series in section 4.3. To minimize the effect of additive errors, we compare the lognormal distribution to the empirical distribution of p 10% of the distribution's peak height in log‐space, which we determine by a relative residual error analysis (supporting information S5). This operation excludes 16% of the observations where p > 0, but the excluded observations account for <0.01% of the total primary production in the database. For consistency we use the same threshold for all analyses of p data throughout the manuscript. Figure 3 shows the locations of these p measurements. Our results are not sensitive to factor of two changes in this threshold.
Figure 3
Top: Locations of p samples > 0.5 μg C · L−1 · day−1 from the compilation by Buitenhuis et al. (2013); compare to their Figure 2a. The solid black boundary line denotes the separation of data into ocean basins (section 3.2). The time series are indicated by “crosses.” Middle: Locations of f samples >3.6 mg C · m−2 · day−1 from the compilation. The time series are indicated by “crosses.” Bottom: Map of biomes into which p and f data are subregioned; compare to Figure 1 of Banse (1992). Biomes are defined as B1: low‐seasonality, nutrient‐depleted (oligotrophic), B2: low‐seasonality, nutrient‐replete (eutrophic), B3: high‐seasonality (seasonal), using a depth‐integrated net primary production climatology from the Carbon‐based Productivity Model (Westberry et al., 2008); see main text for details.
Top: Locations of p samples > 0.5 μg C · L−1 · day−1 from the compilation by Buitenhuis et al. (2013); compare to their Figure 2a. The solid black boundary line denotes the separation of data into ocean basins (section 3.2). The time series are indicated by “crosses.” Middle: Locations of f samples >3.6 mg C · m−2 · day−1 from the compilation. The time series are indicated by “crosses.” Bottom: Map of biomes into which p and f data are subregioned; compare to Figure 1 of Banse (1992). Biomes are defined as B1: low‐seasonality, nutrient‐depleted (oligotrophic), B2: low‐seasonality, nutrient‐replete (eutrophic), B3: high‐seasonality (seasonal), using a depth‐integrated net primary production climatology from the Carbon‐based Productivity Model (Westberry et al., 2008); see main text for details.
Subregions of p
This total set of p measurements is large enough to subregion in time, depth, and latitude/longitude. We test four ways of coarsely subregioning the p data, noting that many other, more sophisticated subregioning schemes are possible (e.g., Longhurst, 1998). We do so to test for robustness of the lognormal and to address the issue of nonrandom sampling. If the lognormality of the total p distribution is a coincident result of the spatiotemporal structure of p measurements, subregions' p distributions should not be lognormal. We also do this to test whether the log‐moments of the subregions' p distributions sort predictably (section 4).Biome. Following Banse (1992), who defined three ocean “Domains”—respectively low‐seasonality, nutrient depleted (oligotrophic), low‐seasonality, nutrient‐replete (eutrophic), and high‐seasonality (seasonal)—we generate latitude‐longitude subregions by objectively identifying three biomes using a
climatology from the Carbon‐based Productivity Model (Westberry et al., 2008; available at http://www.science.oregonstate.edu/ocean.productivity/standard.product.php). The “Seasonal (B3)” biome includes all locations where the log‐variance of climatological
is >0.3; the “Eutrophic (B2)” biome includes all remaining locations where the log‐mean of climatological
is >6.5; the “Oligotrophic (B1)” biome includes all other locations. Threshold values were chosen to correspond with the “Domains” drawn in Banse (1992); biomes are shown in Figure 3 (cf. Figure 1 of Banse, 1992). Our results are not sensitive to the values of these thresholds.Basin. We generate three latitude‐longitude subregions by splitting the data by basin into Atlantic, Southern Ocean, and Indo‐Pacific (Figure 3).Depth. We generate three‐depth subregions by sorting p values by the depth at which they were sampled and splitting the data evenly into <6, 6–20.5, and >20.5 m (i.e., 6 m is the 33.3rd percentile of the sampling depths, and 20.5 m is the 66.7th percentile).Time. We generate four seasonal subregions by taking all data from the Northern Hemisphere and splitting them into boreal “winter” (DJF), “spring” (MAM), “summer” (JJA), and “autumn” (SON), where, for example, DJF refers to December, January, and February.
Compilation of f Measurements
The most common measurements of f are those using sediment traps (Honjo et al., 2008), which collect sinking material, and those using the “234Th method” (Buesseler et al., 2006), which measures disequilibrium of the particle reactive 234Th isotope as a chemical signature of particulate flux.We compiled a database of 1,770 shallow (≤200 m) POC export flux measurements by trap and 234Th spanning the global ocean; see Figure 3 (supporting information S2). The database comprises 49% 234Th measurements. This database is the largest available compilation of shallow POC export flux measurements of which we are aware. For accessibility, it is organized according to the principles of “tidy data” (Wickham, 2014) and is available via the supporting information. Note that the f compilation is smaller than even any of the subregions of the p compilation because f measurements are far less abundant. Because resolving a distribution with large dynamic range requires a large sample size, we include as many measurements as possible, across two methods and multiple depths (supporting information S2); as with p, we assume that the distribution of measurement values is reflective of the distribution of the process being measured. About one third of our f data comes from three time series—HOT (the Hawaii Ocean Time series; Karl & Lukas, 1996), BATS (the Bermuda Atlantic Time series Study; Michaels & Knap, 1996), and CARIACO (CArbon Retention In A Colored Ocean; http://imars.usf.edu/cariaco; Thurnell, 2013)—and a large number of p samples (>1,000) are also available from each of these time series; we therefore exclude these from the global distribution in order to make an out‐of‐sample estimate for C and α in section 4.3.To minimize the effect of additive errors and to analyze both p and f data the same way, we compare the lognormal distribution to the empirical distribution of f measurements above a noise threshold of 10% of peak height in log‐space, or 3.6 mg C · m−2 · day−1. As with the p data, for consistency we use the same noise threshold for all analyses of f data throughout the manuscript, and our results were not sensitive to factor of two changes in this threshold.Figure 3 indicates that many of the f samples are taken at distinct locations from the p samples. As we are comparing collections of measurements, our approach requires the assumption that the variability sampled by a collection of measurements is large compared to the bias generated by those measurements' spatiotemporal structure. This assumption is supported for large‐scale sets of measurements by the robustness of p's lognormality (section 4.1) but means that a lognormal is not expected to emerge when this assumption breaks down, for example, when looking at measurements from a single location or when sampling across boundary currents and other areas with large physical gradients.
Statistical Methods
In total we have 14 sets of p measurements and one set of f measurements. We fit a lognormal distribution to each of these, estimating the log‐moments by minimizing Kuiper's statistic, which measures deviations between the hypothesized lognormal distribution and the empirical cumulative distribution function (CDF) of the measurements. Kuiper's statistic is preferred in many applications because it balances ease of interpretation with sensitivity to tails (Flannery et al., 1992), though our results are not sensitive to this choice of statistic as compared to the Kolmogorov‐Smirnov and Anderson‐Darling statistics (supporting information S6).We estimate uncertainties in the estimated moments by bootstrap sampling the p and f subsets 10,000 times and repeating the procedure in each case (Efron, 1979). Note that this procedure underestimates uncertainty as it only accounts for statistical errors. Systematic uncertainty including measurement errors, interpretation errors, and nonrandomness of the sampling are not accounted for. To be conservative we use a uniform uncertainty of 0.04 for μ
and 0.03 for σ
corresponding to the largest 95% bootstrap confidence interval across all subsets that of the winter data.
Results
p and f Are Lognormally Distributed
We find that the lognormal null model is a good description of the distribution for p globally; Figure 4 shows good correspondence between the data and hypothesized distribution. We find moments of μ
=2.23 ± 0.04 and σ
=1.63 ± 0.03. We also find that the lognormal null model is a good description of the distribution for f globally; see Figure 4. We find moments of μ
=3.96 ± 0.07 and σ
=1.20 ± 0.05. While the distribution's fit to the f data is less visually compelling, this is more than accounted for by sample size (supporting information S6).
Figure 4
Top: PDF of global p data versus lognormal fit. n = 38,334 refers to the number of samples included in the PDF, which excludes data from HOT, BATS, CARIACO, as well as data below the 10% of peak height threshold <0.5 μg C · L−1 · day−1. Green curve is the probability density function of the lognormal which minimizes the Kuiper statistic as compared to the data. Bottom: PDF of global f data versus lognormal fit. n = 1,033 refers to the number of samples included in the PDF, which excludes data from HOT, BATS, and CARIACO time series, as well as data below the 10% of peak height threshold of 3.6 mg C · m−2 · day−1. Green curve is the probability density function of the lognormal which minimizes the Kuiper statistic as compared to the data. HOT = Hawaii Ocean Time; BATS = Bermuda Atlantic Time series Study; CARIACO = CArbon Retention In A Colored Ocean time series; PDF = probability density function.
Top: PDF of global p data versus lognormal fit. n = 38,334 refers to the number of samples included in the PDF, which excludes data from HOT, BATS, CARIACO, as well as data below the 10% of peak height threshold <0.5 μg C · L−1 · day−1. Green curve is the probability density function of the lognormal which minimizes the Kuiper statistic as compared to the data. Bottom: PDF of global f data versus lognormal fit. n = 1,033 refers to the number of samples included in the PDF, which excludes data from HOT, BATS, and CARIACO time series, as well as data below the 10% of peak height threshold of 3.6 mg C · m−2 · day−1. Green curve is the probability density function of the lognormal which minimizes the Kuiper statistic as compared to the data. HOT = Hawaii Ocean Time; BATS = Bermuda Atlantic Time series Study; CARIACO = CArbon Retention In A Colored Ocean time series; PDF = probability density function.Additionally, we find that for the subregions' p distributions, the lognormal model is a good description of the distributions; see Figure 5. We find μ
ranges from 1.13 to 2.86 (±0.04), and σ
ranges from 1.20 to 1.69 (±0.03). The robustness of the lognormal across the various subregions strongly supports its applicability to modeling p from a large‐scale perspective and indicates that the lognormality of p is robust to the spatiotemporal structure of measurements.
Figure 5
Probability density functions (PDFs) of p data for different subregions described in the text. In each case the green curves are the PDF of the lognormal which minimizes the Kuiper statistic as compared to the data. (μ,σ) is plotted in Figure 6.
Probability density functions (PDFs) of p data for different subregions described in the text. In each case the green curves are the PDF of the lognormal which minimizes the Kuiper statistic as compared to the data. (μ,σ) is plotted in Figure 6.
Figure 6
Log‐moments for each of the p subregions, estimated as described in the main text. Error bars are 95% confidence intervals estimated from bootstrapping; contours are lines of constant mean for a lognormal distribution, equal to
.
Log‐Moments of Subregions' p Distributions Sort Predictably
Figure 6 shows the μ
and σ
of the lognormal distributions shown in Figure 5. Because of the additional uncertainties mentioned at the end of section 3.4, we restrict ourselves to the sorting of moments across subregions. Variation in σ
is 7% of the variation in μ
across subregions, and σ
is a higher‐order moment and therefore more difficult to estimate accurately, so we focus on variation in μ
.Log‐moments for each of the p subregions, estimated as described in the main text. Error bars are 95% confidence intervals estimated from bootstrapping; contours are lines of constant mean for a lognormal distribution, equal to
.The ordering of μ
across subregions is sensible in terms of light and nutrient availability. μ
is lower in the oligotrophic biome (B1) than in the eutrophic and seasonal biomes (B2 and B3), whose μ
is not significantly different. μ
is lower in the Indo‐Pacific, the basin having the largest proportion of low‐productivity waters (Williams & Follows, 2011), than in the Atlantic or Southern Ocean. The large μ
for the Southern Ocean reflects that the Southern Ocean p data are dominantly sampled from, and therefore only representative of, high‐productivity times of year in that basin. μ
is highest in the shallowest depth subregion, where light is highest, and is lowest in the deepest subregion, where light is lowest. μ
is higher in spring and summer, when the Northern Hemisphere experiences more sunlight.The lognormality of p has implications for the interpretation of variability in p measurements, because mean and variance are controlled both by σ and μ. Note the contours in Figure 6; the mean (n.b. not the log‐mean) of a lognormal distribution is given by
, so the two moments can compensate to achieve the same mean, for example, the mean for the Atlantic (μ
,σ
) is closer to the mean for the Southern Ocean (μ
,σ
) than the individual moments are; the same goes for summer and autumn. Variance (n.b. not the log‐variance) of a lognormal is given by
, so two distributions with the same σ can have different variances; for example, σ
=1.50 ± 0.03 for both the Indo‐Pacific and the Southern Ocean, but because μ
is larger for the Southern Ocean, its variance is also higher, the difference driven by μ.The sorting of σ
is also consistent with intuition. σ
is largest in the seasonal biome (Biome 3) and lowest in the oligotrophic biome (Biome 2). σ
is largest in the Atlantic, the basin with the most pronounced seasonality (Silsbe et al., 2016). σ
is larger in boreal autumn and spring, when broadly speaking the Northern Hemisphere's oceans experience larger changes, than in boreal winter and summer. However, because understanding variability in log‐space is no trivial matter, and these variations are small compared to those in μ
, we caution against making too much of the sorting of σ
barring a more quantitative theory for these moments (section 5).
f Scales Sublinearly With p
That f and p are well described by lognormal distributions implies the statistical scaling relationship f = e
p
(section 2.2). This suggests that estimates of C and α should allow one to predict μ
from μ
(via equation (6) and σ
from σ
(via equation (7). We use two procedures to estimate C and α based on equation (6) and the μ's of independent data collections —the biomes and the time series—then compare these estimates and evaluate (i) their ability to predict μ
from μ
globally and (ii) whether the estimates for α are consistent with those for σ
and σ
.First, we subregion the global f data into the three biomes described in section 3.2. These data have the advantage of being parsed into and sampled across objective biogeographic regions but are not colocalized with the p data from the same biomes and are not independent of the global p and f data. We take the (μ
,σ
) values from Figure 6, and then estimate (μ
,σ
) by the log‐mean and log‐standard deviation of the f data (above the 10% peak height threshold) rather than by fitting a lognormal distribution as for Figures 4 and 5, though our results are not sensitive to this choice. We then estimate uncertainty in each log‐moment for each biome by bootstrap sampling as in section 3.4, repeating the procedure 10,000 times and using the 95% bootstrap confidence interval as an uncertainty estimate. We then estimate C and α by fitting equation (6) to the three μ
, μ
pairs from the three biomes. We use type II Least Squares Cubic regression (York, 1966) of μ
on μ
because each moment estimate has a different, nonnegligible uncertainty. We use the subscripts C
and α
to indicate the estimates derived from the biomes' data.Second, we make an out‐of‐sample estimate with the three time series' data excluded from the global distributions, which provide an alternative set of μ
, μ
pairs to fit C and α. These data have the advantage of being colocalized, and independent of the global p and f data so as not to confound the prediction, but are sampled only across three total locations. As above, we estimate (μ
,σ
,μ
,σ
) by the log‐moments of the time series' data, estimate uncertainty via bootstrap sampling, and then make estimates C
and α
from the time series' μ's and type II Least Squares Cubic Regression in equation (6).Figure 7 shows the histograms of the time series' p and f data and the biomes' f data. We note that the time series' p distributions are visibly (and quantitatively [supporting information S6]) not lognormal, consistent with the idea that the lognormal holds only for large‐scale variability, that is, as one considers larger and/or more variable spatiotemporal regions and therefore samples over more variability in the processes influencing p and f (section 5). The f data in Figure 7 are consistent with a lognormal distribution, though the statistical power to compare each collection of data with a hypothesized distribution is limited due to sample size (supporting information S6).
Figure 7
Histograms of time series' p and f data and biomes' f data. (μ, σ) is plotted in Figure 8; p data for the biomes are plotted in Figure 5.
Histograms of time series' p and f data and biomes' f data. (μ, σ) is plotted in Figure 8; p data for the biomes are plotted in Figure 5.
Figure 8
μ
versus μ
for the three time series, the three biomes, and the global ocean. Dotted blue line is the estimated line from the regression on the biomes (μ
=α
μ
+C
); dotted green line is the estimated line from the regression on the time series (μ
=α
μ
+C
); parameter estimates from the regressions are reported in the top‐left of the figure, along with their estimated standard errors. Solid black lines represent 95% bootstrap confidence intervals. The global distributions' log‐means (yellow) are not included in the regressions that produce the dashed lines. Inset: σ
plotted versus σ
for the three time series, the three biomes, and the global ocean. Dashed line is the estimate σ
=α
σ
; gray shading corresponds to α
's standard error. None of the σ's are included in the regressions that produce the dashed line. HOT = Hawaii Ocean Time; BATS = Bermuda Atlantic Time series Study; CARIACO = CArbon Retention In A Colored Ocean time series.
Nonetheless, as each data set in Figure 7 is unimodally distributed in log‐space and we focus here on the first moment, it is sensible to approximate each distribution with a lognormal, estimating μ and σ for each by the data's log‐mean and log‐standard deviation. Probability distributions are defined by the series of their moments, and discrepancies from this approximation will arise only for higher‐order moments. That this estimation procedure ultimately results in a good prediction, despite the time series' distributions p data not being lognormally distributed, evinces the utility and robustness of a distributional approach.Figure 8 shows the results of the regressions, which find α
=0.64 ± 0.13,α
=0.65 ± 0.14,C
=2.47 ± 0.17, and C
=2.59± 0.30 (where here the ± is the standard error as estimated by the regression, not a 95% confidence interval). Thus, the parameter estimates are in good agreement. The biomes' parameter estimates result in a prediction for the global μ
=C
+α
μ
=4.04 ± 0.43, and the time series' parameter estimates result in a prediction of μ
=C
+α
μ
=3.90 ± 0.34 (where the ± is the standard error from propagating uncertainty in α and C). Both of these compare well with the estimated value for the global distributions' μ
=3.96 ± 0.07.μ
versus μ
for the three time series, the three biomes, and the global ocean. Dotted blue line is the estimated line from the regression on the biomes (μ
=α
μ
+C
); dotted green line is the estimated line from the regression on the time series (μ
=α
μ
+C
); parameter estimates from the regressions are reported in the top‐left of the figure, along with their estimated standard errors. Solid black lines represent 95% bootstrap confidence intervals. The global distributions' log‐means (yellow) are not included in the regressions that produce the dashed lines. Inset: σ
plotted versus σ
for the three time series, the three biomes, and the global ocean. Dashed line is the estimate σ
=α
σ
; gray shading corresponds to α
's standard error. None of the σ's are included in the regressions that produce the dashed line. HOT = Hawaii Ocean Time; BATS = Bermuda Atlantic Time series Study; CARIACO = CArbon Retention In A Colored Ocean time series.The inset of Figure 8 shows the results of the second prediction—whether σ
=α
σ
for the time series, the biomes, and globally. For the global distributions, the eutrophic and seasonal biomes (B2 and B3), BATS, and CARIACO, σ
/σ
is within 1 standard error of both α
and α
; σ
/σ
for HOT is less than α
and α
by more than 1 standard error, and σ
/σ
for the oligotrophic biome (B1) is more than α
by more than 1 standard error. As the σ values are not used to estimate the α's, and the σ
values estimated from the biomes' and the time series' data have larger uncertainty, this correspondence is quite good, corroborating that f ∼ p
and therefore that μ
and σ
are predictable from μ
and σ
. Variation in σ
/σ
could be due to biogeochemical differences between time series, to measurement uncertainties, or to other factors.
Discussion
What Are the Uses and Requirements of a More Complete Theory?
As stated previously, the objective herein is only to develop the foundations of a distributional theory, as a more complete theory involves a good deal more sophistication and would be benefitted by more measurements than are available at present, particularly for f. How to subdivide the ocean is a key question. We have only done so coarsely here as proof of concept and to test the robustness of the lognormal to subregioning. The refinement of this theoretical framework requires first developing a more considered subregioning of the ocean in space and time, that is, determining an answer to the ubiquitous question in oceanography of what spatiotemporal region a given set of observations represents.A predictive theory for the subregions μ
and σ
must also be developed. We have only attempted to demonstrate that sorting of the μ and σ of subregions' p distributions is consistent with oceanographic intuition; we have not developed a quantitative theory for either. p being lognormally distributed implies exp(μ
) = median(p), so it may be possible to describe μ
in terms of that which controls median productivity; σ
may possibly be described in terms of the variances of the subprocesses distributions. As the lognormal distribution is controlled by these two parameters, a complete distributional theory for p requires the ability to predict these two parameters, that is, how the distribution shifts in response to the environment. Such a refinement could constitute a highly simplified yet accurate description of p globally.Due to its relative sparsity, subregioning f data substantially reduces the statistical power one has to test f's lognormality. While an f database of comparable size and methodological consistency to that for p compiled by Buitenhuis et al. (2013) is far from reach, a large number of additional measurements are expected to be available in the near future from EXPORTS (Siegel et al., 2016) and other programs, which may be leveraged to evaluate the robustness of f's lognormality to subregioning and the consistency of the scaling relationship inferred here. A distinct limitation of the distributional approach presented here is that it requires a large number of measurements.Finally, given that the two procedures for estimating C and α result in very similar values despite using distinct data, this raises the interesting question of what physical, chemical, and biological processes might control these parameters.
Can the Probabilistic Approach Developed Here Be Used to Test Models?
Satellite ocean color models and numerical biogeochemical models of p and f do not contain spatiotemporal information explicitly—if they use such information, it is to determine some other variable, for example, solar irradiance. In principle, the probability distribution of a variable contains all of the nonspatiotemporal information about that variable. Thus, the strictest possible nonspatiotemporal criterion by which one can evaluate numerical biogeochemical models of p or f is by evaluating the probability distributions they generate. Rather than comparing model values and data measurements pointwise, one can sample a model at the same places and times as a data set and compare the resulting probability distributions. For models of p, the null model described herein can be utilized as a strict test. From model output, (i) objectively generate regions based on biogeochemically relevant parameters, (ii) determine the probability distributions of production (and export) in these regions, and (iii) analyze the forms of the distributions and the behavior of their log‐moments. Do spatiotemporal subregions of the model produce lognormal distributions, and do the moments of these subregions sort predictably? Such approaches are underutilized in Earth sciences and may be especially useful in oceanic contexts because slight misfits in the locations of fronts and other features would not be overpenalized. In this manuscript we have not analyzed measurements of depth‐integrated production, the quantity of most interest for satellite ocean color algorithms (Behrenfeld & Falkowski, 1997; Silsbe et al., 2016; Westberry et al., 2008); depending on how a given algorithm incorporates depth integration, comparison may require augmenting the lognormal null model to account for integration in depth. Nonetheless, it appears that a distributional approach could be useful not only for the analysis of p and f measurements but also as a more rigorous methodology of model‐data comparison.
What About the Spatiotemporal Structure of the Surface Ocean?
Although the processes underlying production and flux suggest a lognormal distribution, the spatiotemporal structure of the surface ocean suggests otherwise. If p and f were not at all stochastic quantities and were entirely determined by the mean state when and where they were measured with negligible variation, or if p and f measurements were strongly correlated in space or time, variability would not necessarily be lognormal; the observed distribution of p and f would then instead be wholly determined by the places and times at which those measurements were made. Even if p and f values in the ocean were truly lognormal, available measurements of p and f are far from a truly random sample (Figure 3), which could affect their distribution similarly. However, given sufficient underlying complexity and variability in the processes that result in p or f, a lognormal distribution will emerge; μ and σ of that distribution will then be reflective of the sample locations and times, for example, p data from higher productivity areas will have a larger μ. We therefore consider the lognormal distribution to characterize large‐scale variability in p and f, to hold when the spatiotemporal region being sampled contains enough variability in the factors p and f that a lognormal distribution emerges. The robustness of the lognormal distribution across subsets of p measurements, despite large spatial patterns in p across the ocean and the highly nonrandom nature of global p sampling, suggests that substantial multiplicative variability is a fundamental characteristic of p.
Conclusion
Understanding the relationships between biogeochemical fluxes like primary production and export is difficult because there are many distinct subprocesses which play a role. Under some circumstances systems become complex enough that they can become simple again if viewed through their probability distributions rather than sets of colocalized measurements. We suggest that this is the case for primary production and export. We demonstrate that measurements of these two quantities are lognormally distributed and that the moments of the distributions are a linearly related to one another. With additional data and a clearer understanding of how to define biogeochemical provinces we believe this approach will help us build a mechanistic understanding of these, and other biogeochemical fluxes in the sea.Supporting Information S1Click here for additional data file.Supporting Information S2Click here for additional data file.Data Set S1Click here for additional data file.
Authors: Chris M Marsay; Richard J Sanders; Stephanie A Henson; Katsiaryna Pabortsava; Eric P Achterberg; Richard S Lampitt Journal: Proc Natl Acad Sci U S A Date: 2015-01-05 Impact factor: 11.205
Authors: William M Durham; Eric Climent; Michael Barry; Filippo De Lillo; Guido Boffetta; Massimo Cencini; Roman Stocker Journal: Nat Commun Date: 2013 Impact factor: 14.919
Authors: Rachel E Szabo; Sammy Pontrelli; Jacopo Grilli; Julia A Schwartzman; Shaul Pollak; Uwe Sauer; Otto X Cordero Journal: Proc Natl Acad Sci U S A Date: 2022-07-21 Impact factor: 12.779
Authors: Sangwon Hyun; Aditya Mishra; Christopher L Follett; Bror Jonsson; Gemma Kulk; Gael Forget; Marie-Fanny Racault; Thomas Jackson; Stephanie Dutkiewicz; Christian L Müller; Jacob Bien Journal: Proc Math Phys Eng Sci Date: 2022-06-22 Impact factor: 3.213