Literature DB >> 36161961

Divisive normalization is an efficient code for multivariate Pareto-distributed environments.

Stefan F Bucher^1,2,3, Adam M Brandenburger^4,5,6.

Abstract

Divisive normalization is a canonical computation in the brain, observed across neural systems, that is often considered to be an implementation of the efficient coding principle. We provide a theoretical result that makes the conditions under which divisive normalization is an efficient code analytically precise: We show that, in a low-noise regime, encoding an n-dimensional stimulus via divisive normalization is efficient if and only if its prevalence in the environment is described by a multivariate Pareto distribution. We generalize this multivariate analog of histogram equalization to allow for arbitrary metabolic costs of the representation, and show how different assumptions on costs are associated with different shapes of the distributions that divisive normalization efficiently encodes. Our result suggests that divisive normalization may have evolved to efficiently represent stimuli with Pareto distributions. We demonstrate that this efficiently encoded distribution is consistent with stylized features of naturalistic stimulus distributions such as their characteristic conditional variance dependence, and we provide empirical evidence suggesting that it may capture the statistics of filter responses to naturalistic images. Our theoretical finding also yields empirically testable predictions across sensory domains on how the divisive normalization parameters should be tuned to features of the input distribution.

Entities: Chemical

Keywords: Pareto distribution; divisive normalization; efficient coding; histogram equalization; natural stimulus statistics

Mesh：

Year: 2022 PMID： 36161961 PMCID： PMC9546555 DOI： 10.1073/pnas.2120581119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 12.779

The brain has to make efficient use of its limited resources to represent and respond to the wide range of stimuli in its environment. An important mechanism by which this can be achieved is divisive normalization (1, 2), which is thought to be a canonical computation in the brain (3). This gain control mechanism (according to which the response of a neuron to its preferred stimulus is suppressed by the intensity of nonpreferred stimuli) permits the representation of potentially unbounded stimuli by biophysically feasible bounded firing rates. Originally proposed for individual neurons in the primary visual cortex (1, 4, 5), this computation has since also been observed at the population level in the primary visual cortex (6–8) and throughout the visual hierarchy (9, 10), as well as in several other neural systems including olfactory pathways (11), the middle temporal area (12, 13), the inferotemporal cortex (14), the hippocampus (15), and in multisensory integration (16). In addition, divisive normalization has been shown to play an important role in value representations (17, 18) and for choice behavior, where it has been proposed to account for violations of the independence of irrelevant alternatives (IIA) axiom of rational choice (19–23; but see refs. 24 and 25). The nonlinear computation has also been suggested to play a role in attentional modulation (12, 26, 27), the modulation of response variability (28), the representation of visual uncertainty (29), and probabilistic inference (30, 31). It is further used in neural network models of the visual system (32, 33) as well as in computer vision and image compression (34). This ubiquitous array of functions begs the question of what overarching objective the divisive normalization computation achieves. In this paper, we consider this computation’s information-theoretic properties and provide testable conditions for its efficiency that are both simple and general, making them applicable across many of the aforementioned settings. Since Schwartz and Simoncelli (35) showed empirically that divisive normalization reduces the statistical redundancy present in natural images, a common answer (36) has been that divisive normalization is an implementation of the efficient coding principle (37–41). This principle has been central to our understanding of the visual and other sensory systems (42–44), and it has also provided an account of biases in perception (45) and choice (46–49). Of course, divisive normalization has benefits beyond coding efficiency and redundancy reduction, such as permitting tuning curves that are invariant with respect to “nuisance” dimensions (e.g., maintaining discriminability of orientations regardless of contrast) or ensuring that population responses are easily decodable (e.g., by a linear classifier or winner-take-all competition), among other features (3). Its widespread implementation in the nervous system may thus simultaneously achieve a number of purposes. Here, we focus on the question of whether divisive normalization is indeed an efficient computation, which arises naturally in both the sensory and choice domains. Despite significant progress (50), an answer to this question in terms of testable conditions for efficiency has remained elusive, since formally relating neural computations to stimulus distributions has proved difficult: “The establishment of a precise quantitative relationship between environmental statistics and neural processing is important [but] it has been surprisingly difficult to make the link quantitatively precise [and] specification of a probability distribution over the space of input signals is a difficult problem in its own right” (ref. 51, p. 1194). We close this gap with a theoretical result that makes precise the conditions on the input distribution under which divisive normalization encodes a stimulus efficiently. Existing analytical work in the domain of vision has demonstrated that divisive normalization approximately (but not entirely) removes the statistical dependence in models of filter responses to natural images (52–54) such as the conditional normal (35) or lognormal (55) distributions. Moreover, divisive normalization can be viewed as an approximation of the nonlinear radial Gaussianization transformation that removes the statistical dependence of non-Gaussian elliptically symmetric distributions (56), yet divisive normalization itself can do so only imperfectly owing to its bounded range (57, 58). Lyu (50, 59) has quantified the extent to which divisive normalization reduces the statistical dependence of one such elliptical distribution: the multivariate Student’s t distribution, which is in the class of Gaussian scale mixture models of natural images (60). He showed that even though divisive normalization approximates the transformation that eliminates this model’s statistical dependencies, it can also increase them in low-dimensional settings. This literature typically assumes a model of empirical stimulus statistics and derives the predictions of an (approximately) optimal code. It thus represents the first of two common approaches for testing the efficient coding hypothesis (51). The second approach is to examine the statistics of actual neural responses to naturalistic stimuli, in the spirit of Laughlin (40). Here, we pursue instead a third approach that consists of deriving analytically what stimulus distribution a given computation efficiently represents. This is in contrast to Malo and Laparra (54) or Lyu (50), for example, who use similar techniques but who start by assuming a given model of stimulus statistics. Instead, our approach is similar in spirit to that of Ballé et al. (61), who obtain a density model on images by inverting a generalized divisive normalization transform, except that we obtain the input density in analytical closed form. Without making any a priori assumptions about the stimulus statistics, the input distribution we find to be efficiently encoded captures many important features of naturalistic stimulus statistics, as we demonstrate in an analysis of image statistics. Our approach thus provides an additional perspective on the efficiency properties of divisive normalization. We consider a setting in which an n-dimensional input is to be encoded by the divisively normalized firing rates of n neurons. The input can be either a stimulus or a representation of a stimulus coming from another neural system upstream. In the context of visual stimuli, the multivariate input could arise, for instance, as the responses of a population of linear filters convolved with the stimulus (35). At least two conditions have to be satisfied for the resulting multivariate representation to be efficient in a low-noise regime. First, it ought to adhere to histogram equalization (40) along each input dimension, which ensures that each output is used equally often. Second, maximizing the Shannon entropy of the output distribution requires—in the absence of constraints—that any statistical dependence across dimensions be removed. We use a formulation of the efficient coding principle that, in a low-noise regime, implies both of these desiderata and thus gives rise to a multivariate analog of the classic criterion of histogram equalization. Specifically, we consider a neural code to be efficient if and only if it maximizes the Shannon mutual information (62, 63) between the n-dimensional input and its representation. Since, for sufficiently small noise, this criterion can be approximated arbitrarily well by the requirement that the output distribution is entropy-maximizing, divisive normalization is then efficient whenever it transforms the input distribution into an output distribution that is uniform over the range of values divisive normalization can attain (39, 64).* This allows us to characterize the class of input distributions that are efficiently encoded in a low-noise regime. We prove that divisive normalization maximizes the entropy of the output distribution if and only if the distribution of inputs in the environment is multivariate Pareto. This suggests that divisive normalization may have evolved as an efficient encoding strategy for heavy-tailed, scale-invariant power-law distributions of the kind that occur in many ecological contexts (66); see also Discussion. The statistical dependence in the multivariate Pareto distribution is also consistent with the conditional variance dependence observed in natural image statistics (35), and it has a representation as a Gamma mixture of independent exponential random variables (providing a link to ref. 60). In an empirical analysis of naturalistic images, we demonstrate that the efficiently encoded Pareto distribution indeed captures the statistics of filter responses to natural images just as well as a common model of natural image statistics does. Divisive normalization may thus be an adaptation, in evolution or development, to various natural contexts with physical quantities whose distributions are characterized by heavy-tailed marginals and an empirically important form of statistical dependence. We generalize our result by allowing for a representation to come at an arbitrary metabolic cost, which affects the shape of an efficient code (67, 68), and we show how this impacts the optimally encoded input distribution. For example, if costs are linear in the total number of spikes (which constrains the average firing rate), then the entropy-maximizing output distribution is exponential, and the associated input distribution changes accordingly. We provide necessary and sufficient conditions on the stimulus distribution for divisive normalization to be efficient under any member of a large family of cost functions. Beyond providing a testable prediction on the shape of stimulus distributions that divisive normalization efficiently encodes, our theoretical result also yields empirically testable predictions across sensory domains on how the parameters of the divisive normalization transformation should be tuned to the parameters of the stimulus distribution (35). Specifically, the power index in the normalization function matches the shape parameter of the stimulus distribution, while the normalization weights are the inverses of the scale parameters of the stimulus distribution. Our theoretical predictions thus open the door to systematic experimental tests of the efficiency properties of empirically observed divisive normalization.

Results

We consider a multivariate stimulus (or input from another neural system) modeled as an n-dimensional random vector taking values , where, for all i, for some . The distribution of this stimulus is described by a continuous probability density function (pdf) . The support of this density is semi-infinite, bounded below by the constant , so that shifting the stimulus by gives rise to the positive input . This input is encoded by a population of n neurons whose (mean) firing rates are given by the divisive normalization function (3)for and finite parameters , b > 0, and . The encoding gives rise to an n-dimensional representation. The parameter γ is often interpreted as the maximal firing rate, the parameter b as a semisaturation constant, and λ are normalization weights. One possible interpretation of α in the context of vision neuroscience is that models the activation resulting from a linear filter response that is then normalized. While numerous functional forms have been proposed to describe divisive normalization (59, 69), our formulation generalizes the normalization equation in Carandini and Heeger (ref. 3, equation ) to include different weights λ in the normalization pool. For analytical tractability, we assume that these normalization weights can differ across input dimensions but not across output dimensions.

Constraints and Metabolic Costs.

To assess the efficiency of the divisive normalization transform, we have to specify the class of permissible codes with which to compare it, as well as the efficiency criterion. We start by defining the former. Specifically, we compare divisive normalization with any encoding whose codomain is restricted toso that the output resulting from any feasible input respects an upper bound γ on the -weighted sum of firing rates . In addition to this constraint, we allow for a metabolic cost of a representation (70–72). We assume that a representation comes at a cost , where c is an arbitrary cost function that is continuous and bounded on Δ. The constraint is motivated by the fact that the range of values attained by divisive normalization is bounded by a linear constraint. The following proposition shows that Δ is exactly the range of values the divisive normalization function (Eq. ) attains: The divisive normalization function is invertible and its image is given by the simplex Δ. The proof of is based on an application of the matrix determinant lemma (e.g., ref. 73, theorem 18.1.1); it is given in along with all other proofs. Note that the bound γ also implies an upper bound on the firing rate of each individual neuron, but not all neurons can simultaneously spike at their maximal firing rate. We take this constraint, which may arise from physiological limitations, as given and thus evaluate the efficiency of divisive normalization relative to the class of encodings that respect this same upper bound on the (weighted) sum of (nonnegative) firing rates. Note that a bounded codomain is not only empirically plausible, but also mathematically necessary for considering “lossy compression,” since an unbounded range of firing rates amounts to perfect coding capacity. Moreover, efficiency among all encodings respecting the constraint Δ is a necessary condition for efficiency among an even larger class of bounded representations with which divisive normalization could potentially be compared, because divisive normalization does adhere to the constraint Δ. Whether neural codes with image Δ are efficient even among those with more general representations is beyond the scope of this paper, but an interesting question for future research. Imposing suitable additional structure on the metabolic cost function, for example, would likely result in such a constraint.

Efficiency Criterion.

To state the conditions under which a neural code is efficient, we employ a formalization of the efficient coding hypothesis whose criterion is based on the mutual information between the input and its noisy representationwhere the additive noise is a random vector in and is independent of . (The divisive normalization function is one example of an encoding , so that the noise applies to the output of the transformation.) We assume a low-noise regime in which the entropy of the noise is sufficiently small that the mutual information between the input and its noisy representation can be approximated by the entropy of the distribution of outputs . Given independent random vectors and in and a map , the mutual information can be approximated arbitrarily closely by the entropy , given sufficiently small entropy . According to this result, the efficient coding hypothesis amounts to maximization of the entropy of the output distribution (74, 75) net of metabolic costs. We thus consider a stimulus encoding to be efficient if the resulting output distribution maximizes, among all distributions with support equal to Δ, the entropy of net of the expected cost .

Characterizing the Optimal Output Distribution.

Ignoring metabolic costs, our efficiency criterion amounts to a multivariate version of histogram equalization and requires that all feasible outputs occur equally often. Among all distributions attaining the same range of values as divisive normalization, the entropy-maximizing distribution of mean firing rates is uniform over its simplex support Δ. Note that, constrained to such a support, the uniform distribution is not statistically independent across dimensions (as would be the case in absence of constraints), yet this distribution does result from the optimal encoding. The following proposition (cf. 76) characterizes the optimal output distribution taking into account any metabolic costs, allowing for potentially more empirically realistic firing-rate distributions. Fix a random vector (the representation) with bounded support in and a continuous and bounded cost function . The pdf of the distribution that maximizes the entropy of net of the expected cost, i.e., that maximizes , among all distributions with support , is given byfor in . For example, assuming that the cost of a representation is linear in the sum of all firing rates gives rise to a truncated exponential output distribution , which is in line with empirically observed firing rates in response to natural scenes (77, 78). Using high firing rates less frequently, it optimally balances the informational benefit of a wide range of firing rates with the cost of relying on metabolically costly high firing rates. Other special cases of interest include costs that take the quadratic form , in which case is truncated normal on , and costs that are constant in , in which case is uniform on .

Characterizing the Efficiently Encoded Input Distribution.

By mapping inputs to their representations , divisive normalization transforms any stimulus distribution into a corresponding distribution of outputs or firing rates. Our main mathematical result describes this transformation by relating any stimulus distribution to the output distribution that results under divisive normalization and vice versa. Let in be a random vector with continuous pdf and be the divisive normalization mapping. Let and denote the output of the divisive normalization transformation, with pdf . Then the two pdfs satisfy, for any , The proof uses the change-of-variables formula for random vectors (e.g., ref. 79, theorem 8.1.7) and again relies on the matrix determinant lemma to compute the determinant of the Jacobian of in closed analytical form. Given and the shape of the optimal output distribution provided by , lets us characterize, for any metabolic cost function, the stimulus distribution for which divisive normalization is an efficient encoding in a low-noise regime. Fig. 1 illustrates this for the univariate special case (n = 1) absent metabolic costs. Histogram equalization on requires that the output distribution is uniform over this range. A univariate Pareto distribution with pdf is thus mapped into an entropy-maximizing uniform distribution by a transformation whose derivative is as in (with and ), which indeed integrates to the univariate divisive normalization formula .

Fig. 1.

Univariate example of how divisive normalization transforms a stimulus distribution into a distribution over outputs. If a stimulus s follows a Pareto distribution (Left column), with shape parameter β (and μ = 0, σ = 1), the divisive normalization transform of x = s (Center column), with appropriately chosen parameter α, ensures that the resulting representation follows the entropy-maximizing uniform distribution (Right column). For the general case, the following result states the condition on the stimulus distribution for divisive normalization to maximize, among all distributions with support equal to Δ, the entropy of the resulting output distribution net of expected costs: Divisive normalization of is an efficient encoding of a stimulus for a cost function c if and only if 1) the stimulus distribution has joint pdf, for ,and 2) the parameters satisfy and . Note that the efficiently encoded distribution depends on the metabolic cost function c and also on the parameters of the divisive normalization transform. The efficiently encoded input distribution shares the parameters γ and b with the divisive normalization transform. Additionally, efficiency requires that the exponent α of the divisive normalization transform matches the shape parameter of the input distribution and that the normalization weights λ are inversely proportional to the scale parameters . This reflects the fact that scaling inputs requires adjusting normalization weights to maintain the efficiency condition of . Of course, other parameterizations of this distribution are possible, but the scale parameters σ reflect the extent to which λ and b are exchangeable, and they will be helpful below. Fig. 2 illustrates for the bivariate case and metabolic costs that are constant (Fig. 2) or linear (Fig. 2) in the sum of the firing rates of all neurons. Note that the input distribution that divisive normalization efficiently encodes has a higher density for small values for linear metabolic costs than it does absent metabolic costs. This reflects the fact that, in the presence of metabolic costs, higher firing rates should be used less frequently.

Fig. 2.

Divisive normalization transforms the distribution of stimulus into a distribution over outputs . In this bivariate example with , b = 1, and two values of α and β, joint probability densities are plotted with darker color representing higher density. Marginal densities are shown in the adjacent plots. The Pareto distributions in A are transformed into the uniform distribution over the simplex Δ (black triangle) that is efficient absent metabolic costs (). B shows the qualitatively similar stimulus distributions that are transformed into the truncated exponential distribution that is efficient under linear metabolic costs ().

Constant Metabolic Costs.

An important special case of interest is constant metabolic costs, since the problem then reduces to maximizing information transmission, which is a common formulation of the efficient coding hypothesis. states the result for this special case, which makes a particularly sharp prediction that can be viewed as a benchmark generalized by . Moreover, Fig. 2 demonstrates that the qualitative shape of the efficiently encoded stimulus distribution appears to be robust to the exact assumption on metabolic costs. Recall that the joint survival (or complementary cumulative distribution) function of a random vector is defined as Divisive normalization of is an efficient encoding of a stimulus under constant metabolic costs if and only if 1) the stimulus distribution is a multivariate Pareto type III distribution with joint survival functionand joint pdffor , and 2) the parameters satisfy and . is a corollary of for the special case of constant (or zero) metabolic costs. According to , which is illustrated for the bivariate case in Fig. 2, the efficiently encoded stimulus distribution is a particular multivariate Pareto type III distribution (80, 81). Note that this particular Pareto type III distribution is parameterized by a homogeneous shape parameter along with location parameters μ and scale parameters . As above, efficiency requires that the parameters λ and α of the normalization formula should be tuned to the stimulus distribution as follows: The exponent α should be set to the shape parameter β of the distribution, and the normalization weights should be inversely proportional to the (transformed) scales of the distribution. Note that Eq. is obtained from Eq. by imposing constant metabolic costs and evaluating the resulting integral in the denominator, which results in the cancellation of γ and b (see proof in ). Under constant metabolic costs this leaves γ and b as free parameters. As mentioned above, the parameter γ can be interpreted as the constraint on the sum of firing rates, and b is a semisaturation constant. The joint and conditional densities of this Pareto type III distribution are plotted in for a range of parameter values. Its marginal distribution is a univariate Pareto type III with cumulative distribution function (cdf)for , whose mean is, for , given by contains expressions for the variance and further moments. The marginal Pareto distribution is heavy tailed and for it is an exact power lawfor . This observation is interesting given that many naturally occurring quantities exhibit approximate power-law characteristics (82–84). For , the marginal distribution is a univariate log-logistic distribution, sometimes also referred to as a Fisk (85) distribution.

Empirical Analysis of Natural Stimulus Statistics.

If divisive normalization has evolved as an efficient computation, our result suggests that it may have adapted to environments whose stimulus statistics are well described by multivariate Pareto (type III) distributions. We examine this hypothesis in the visual domain where there is considerable empirical evidence on stimulus statistics to which we can relate our result. In particular, we ask whether the Pareto distribution we have found to be efficiently encoded exhibits the kind of statistical dependence that is commonly found in filter responses to natural images (35, 51, 56, 86). We first note that the pairwise covariance of the efficiently encoded Pareto distribution (Eq. ) is, for and , given by ref. 80, equation 6.1.29:where Γ is the gamma function. To assess whether the Pareto distribution and its correlation structure are an empirically relevant model of natural image statistics, we performed a simple exploratory analysis (Fig. 3 ) using images from the van Hateren image dataset (87). We obtained the joint histogram of the responses of a pair of filters differing in orientation (shown in Fig. 3 for the example image of Fig. 3), as well as the corresponding conditional histogram (Fig. 3). The characteristic “bow-tie” shape of the conditional histogram is an empirical regularity observed in numerous naturalistic stimuli (35, 86).

Fig. 3.

Fitting natural stimulus statistics (see for details). (A) Scatterplot showing the log-likelihood (llh) of the best-fitting Pareto (x axis) and multivariate-t (y axis) models to the statistics of naturalistic images from the van Hateren dataset (87). Each dot corresponds to one image and the histograms show the corresponding marginal distributions. (B) An (additional) example image from the van Hateren dataset, with log-transformed brightness for better visibility. (C) Joint histogram for the example image of the responses of two filters with orientations of 45∘ and 135∘ counterclockwise, respectively (brightness on log scale). (D) Corresponding conditional histogram of the same filter responses, showing the distribution of responses of one filter (y axis) conditional on the response of the other filter (x axis). The brightness is rescaled in each column to use the full range of intensities, in line with the literature (35). Using the filter response data, we obtained maximum-likelihood estimates, for each image separately, of the parameters of a bivariate Pareto distribution (extended to ; see ). For comparison, we also fitted a bivariate t-distribution (with the same number of parameters), as has been used to model natural image statistics (59, 88–90). Fig. 3 shows the distribution of the resulting log-likelihoods. The fact that the log-likelihoods of the Pareto model are greater than those of the multivariate t model (means: for Pareto and for t-distribution) suggests that the Pareto distribution is a strong contender as a model of natural image statistics. However, we do not consider our analysis to be definitive, and further research is required to determine how well the Pareto distribution describes image statistics quantitatively and qualitatively. But this analysis does demonstrate that the Pareto distribution, which we derived from first principles, is also likely to be an empirically relevant description of natural stimulus statistics. Fig. 4 illustrates why the Pareto distribution may describe natural stimulus statistics well. The depicted conditional histogram of a bivariate Pareto type III distribution with β = 1 (extended to ) closely matches the empirically observed statistical dependence in filter responses to natural images. We conclude that the efficiently encoded Pareto distribution exhibits key features of naturalistic stimulus distributions, including this kind of statistical dependence as well as heavy-tailed marginal distributions. We further note that the average estimate for β was 1.2 (), which is comparable to estimates of the divisive normalization exponent (α) from neural data (6). This may suggest that the divisive normalization parameter α is indeed tuned to the shape parameter β of the Pareto distribution.

Fig. 4.

Conditional histogram of a bivariate Pareto distribution extended to , with , and β = 1. Brightness is proportional to the probability of s2 conditional on s1, rescaled in each column to use the full range of intensities, as is customary in the literature. The statistical dependence closely resembles the bow-tie shape observed empirically for many naturalistic stimuli (35).

Relation to Existing Models of Natural Stimulus Statistics.

This section explores the connections to existing models of natural stimulus statistics, with the goal of providing a foundation for future research. We first show how the statistical dependence of the Pareto distribution can capture the conditional variance dependence that is commonly observed in bow-tie plots (86). To see this, assume that are filter responses that are distributed according to a multivariate Pareto type III distribution. The variance of a filter response S conditional on all other filter responses is then, for β = 1 (assumed for tractability) and n > 2, given by The conditional variance dependence is thus quadratic, in accordance with Schwartz and Simoncelli (35) who use a similar quadratic model of the variance dependence observed in conditional histograms. Unlike in their model, the conditional distribution giving rise to this variance dependence is not Gaussian. The conditional distribution of the Pareto distribution in Eq. has cdffor , as we show in . For suitable parameter values this reduces to a log-logistic distribution, closely related to the conditionally lognormal distribution used by Wainwright et al. (55). Next, we show that the Pareto distribution also has a connection with models of image statistics based on Gamma-weighted scale mixtures of Gaussian random variables (60). It turns out that the Pareto type III distribution can be expressed as a Gamma mixture of transformed exponential (or Weibull) random variables (ref. 80, chap. 6.2). The following proposition shows that the efficiently coded distribution of Eq. is a particularly simple mixture of transformed standard exponential random variables: Let for , and let independently of all U withis a Pareto type III distribution with the joint survival function of Eq. This result not only facilitates comparisons with existing models of naturalistic stimuli, it is also of practical importance, since it provides a simple way to draw samples from the multivariate Pareto type III distribution.

Discussion

We have characterized the family of stimulus distributions for which divisive normalization maximizes mutual information in the low-noise limit. Absent metabolic costs, this family consists of particular multivariate Pareto type III distributions, which divisive normalization transforms into an entropy-maximizing output distribution. Taking into account metabolic costs of an arbitrary shape, the efficiently encoded input distributions take a generalized form in which inputs resulting in costlier outputs occur less frequently. We note that our result does not imply that divisive normalization is the only neural encoding to satisfy the necessary and sufficient condition for efficiency, even with a Pareto stimulus distribution. Rather, any encoding whose Jacobian satisfies, given some input distribution, the equivalent of Eq. in attains the same output distribution and will thus be equally efficient. Yet for divisive normalization to be efficient, the stimulus distribution must necessarily be of the prescribed type. We further demonstrated that the Pareto distribution is consistent with empirical findings on naturalistic stimulus statistics. In the context of vision, the Pareto distribution captures key features observed in the statistics of natural images that are commonly modeled with Gaussian scale mixtures (60, 91). Our empirical analysis demonstrated that the Pareto distribution may be a similarly good model of natural-image statistics. identifies the Pareto distribution as a mixture of transformed exponential random variables and may thus prove helpful in further exploring relations to existing Gaussian mixture models of natural stimuli. While sharing some features with such models, the Pareto distribution differs in other respects. For example, unlike Gaussian scale mixtures, it is not elliptically symmetric. This means that it can be skewed but is not invariant to rotations of the coordinate system and may thus capture joint histograms resembling a diamond or rhombus rather than an ellipse. Furthermore, the Pareto distribution also captures features of typical luminance and contrast distributions (92). We leave exploring these issues for future research. also opens the door to future work exploring more general coding mechanisms that are efficient for a wider class of input distributions. The distribution we find to be efficiently encoded is a special case of a Pareto type III distribution that is restrictive in the sense that a single shape parameter β governs the distribution’s dependence structure as well as the shape of its marginals. suggests that another fruitful direction, beyond considering more general Pareto distributions, may be to consider mixtures of an exponential random vector whose components are not independent, allowing for a more general covariance structure. It would also be desirable to generalize our result to allow normalization weights to differ across pairs of neurons. Ruling out the case where different neurons in a population interact with different weights eliminates the possibility that neurons representing more proximal inputs normalize each other more strongly than those representing more distant inputs. Such distance-dependent normalization would be an efficient code of stimuli whose statistics reflect a notion of proximity. For example, this is the case for contrast in images, where correlation is decreasing as a function of distance. Our current framework does not incorporate such a distance notion for reasons of analytical tractability, but an extension would be of interest. Our results provide a sharp testable prediction that can potentially be used to test empirically the efficient coding hypothesis. In a neural system whose response function is well described by divisive normalization, testing whether the inputs are Pareto distributed amounts to testing a necessary condition for the efficient coding hypothesis to hold. A more rigorous and demanding test would also estimate the normalization parameters as well as the parameters of the distribution and test whether they satisfy the conditions imposed by Theorems 2 and 3. Formally, if the efficient coding hypothesis holds, then the hypothesis that such a system’s inputs are Pareto distributed with and must be true (under the maintained hypothesis of constant metabolic costs). Rejecting this null hypothesis would thus result in the rejection of the efficient coding hypothesis. Experiments that systematically manipulate the distribution of sensory stimuli (or choice sets) in a subject’s environment could even examine whether the divisive normalization parameters (such as α) optimally adapt to different stimulus distribution contexts (e.g., Pareto distributions with different values for β). Of course, less demanding tests are possible and informative. It is important to stress that according to , testing the input distribution is a test of the joint hypothesis that coding is efficient and metabolic costs are of a particular form. For instance, Pareto-distributed inputs are only necessary for the efficient coding hypothesis under the maintained assumption of constant metabolic costs. This is a feature, not a bug. can be viewed as a representation theorem, analogous to the economics tradition of testing whether choice behavior is consistent with maximizing some utility function. This provides additional degrees of freedom: If the input distribution adheres to Eq. for some metabolic cost function, then the efficient coding hypothesis would not be rejected. Our work raises the possibility that divisive normalization is ubiquitous because it is an adaptation to Pareto distributions that are themselves widespread—in stimulus statistics but also, perhaps, in the statistics of other quantities. One example of such a quantity is firing rates of neurons, which in many neural systems appear to follow lognormal distributions (93–95), whose heavy-tailed shape closely resembles that of Pareto distributions. If firing rates of an upstream system are approximately Pareto distributed, it is conceivable that divisive normalization may be an adaptation to the firing-rate distribution of neurons from which it receives input. But, of course, the resulting outputs would then not be Pareto distributed, raising the question of whether the observed cascade of normalization-like modules in the brain is to some extent redundant. If each stage in a series of divisive normalization levels in the brain receives the preceding stage’s output as its input, then each stage faces a different distributional structure. Understanding how several stages of divisive normalization may work together to produce an efficient code is an open issue that is beyond the scope of this paper. Perhaps, applying our result to distributions of firing rates will help shed light on such issues. The fact that divisive normalization has been implicated in settings beyond sensory processing and perception, particularly in value representations (17, 18) and choice (19, 21), raises the question of whether the Pareto distribution is also an ecologically relevant description of value-based choice environments. The univariate Pareto distribution of Eq. (with μ = 0) is often used to model wealth and income distributions in economics, which is where it originated (96, 97). Moreover, the Pareto distribution is intimately related to Zipf’s law and is observed in many complex systems (82). Interestingly, it is also remarkably consistent (98) with Benford’s law (99, 100), according to which the leading digits in many naturally arising sets of numbers are likely to be small. The heavy-tailed power-law characteristic of the Pareto distribution is relevant not only in sensory contexts, but also in many economic contexts (101, 102) ranging from city-size distributions to stock returns and trading volumes (103). While empirical evidence from choice environments is lacking, the widespread occurrence of Pareto distributions in economic contexts hints that divisive normalization might be observed in value representations and choice behavior because it is an efficient code of Pareto-distributed values within choice sets. An important caveat to this hypothesis is that mutual information is less relevant as an objective function for a decision maker than it is in the sensory domain. The efficient code for a utility-maximizing decision maker differs from the information-maximizing code (104–106). Conditions under which divisive normalization is efficient for choice remain to be determined. In conclusion, our theoretical results make a simple and sharp prediction about why divisive normalization may be observed across a wide range of settings. This prediction can be tested in different domains, ranging from electrophysiological studies to empirical studies of value distributions and experiments on choice behavior. Our findings provide a framework for future research to test empirically and experimentally whether divisive normalization occurs in environments in which it is an efficient computation.

Materials and Methods

All proofs are given in . The empirical analysis examined the statistics of filter responses to naturalistic images. For a set of 100 images from the (linear) van Hateren image dataset (87), we computed filter responses using a steerable pyramid (107) with four levels and four bands differing in orientation, using the matlabPyrTools package. (See https://github.com/LabForComputationalVision/matlabPyrTools and www.cns.nyu.edu/~eero/STEERPYR/ for more information.) We examined the statistical dependency between a pair of filters of the pyramid’s second level with orientations of 45∘ and 135∘ counterclockwise, respectively (108, figure 1.9). For each image separately, we fitted the joint distribution of these filter responses by obtaining maximum-likelihood estimates of the parameters of two bivariate distributions. The first distribution is a Pareto type III distribution extended to , in the sense that the density at any is given by , where is as in Eq. . We imposed the restriction to ensure that the number of free parameters is not larger than for the second distribution, which was a bivariate t-distribution with correlation parameter c and degrees of freedom df. The estimation was performed using the mle function of Matlab R2022a.

71 in total

Divisive normalization is an efficient code for multivariate Pareto-distributed environments.

Results

Constraints and Metabolic Costs.

Efficiency Criterion.

Characterizing the Optimal Output Distribution.

Characterizing the Efficiently Encoded Input Distribution.

Constant Metabolic Costs.

Empirical Analysis of Natural Stimulus Statistics.

Relation to Existing Models of Natural Stimulus Statistics.

Discussion

Materials and Methods

1. Reading population codes: a neural implementation of ideal observers.

2. Natural signal statistics and sensory gain control.

3. Some informational aspects of visual perception.

Review 4. Efficient coding and the neural representation of value.

5. Visual perception. An efficient code in V1?

6. Marginalization in neural circuits with divisive normalization.

Review 7. The normalization model of attention.

Review 8. Visual perception and the statistical properties of natural scenes.

9. An excitatory basis for divisive normalization in visual cortex.

10. Normalization governs attentional modulation within human visual cortex.

1. Divisive normalization is an efficient code for multivariate Pareto-distributed environments.