Neurons in the primary visual cortex are selective to orientation with various degrees of selectivity to the spatial phase, from high selectivity in simple cells to low selectivity in complex cells. Various computational models have suggested a possible link between the presence of phase invariant cells and the existence of orientation maps in higher mammals' V1. These models, however, do not explain the emergence of complex cells in animals that do not show orientation maps. In this study, we build a theoretical model based on a convolutional network called Sparse Deep Predictive Coding (SDPC) and show that a single computational mechanism, pooling, allows the SDPC model to account for the emergence in V1 of complex cells with or without that of orientation maps, as observed in distinct species of mammals. In particular, we observed that pooling in the feature space is directly related to the orientation map formation while pooling in the retinotopic space is responsible for the emergence of a complex cells population. Introducing different forms of pooling in a predictive model of early visual processing as implemented in SDPC can therefore be viewed as a theoretical framework that explains the diversity of structural and functional phenomena observed in V1.
Neurons in the primary visual cortex are selective to orientation with various degrees of selectivity to the spatial phase, from high selectivity in simple cells to low selectivity in complex cells. Various computational models have suggested a possible link between the presence of phase invariant cells and the existence of orientation maps in higher mammals' V1. These models, however, do not explain the emergence of complex cells in animals that do not show orientation maps. In this study, we build a theoretical model based on a convolutional network called Sparse Deep Predictive Coding (SDPC) and show that a single computational mechanism, pooling, allows the SDPC model to account for the emergence in V1 of complex cells with or without that of orientation maps, as observed in distinct species of mammals. In particular, we observed that pooling in the feature space is directly related to the orientation map formation while pooling in the retinotopic space is responsible for the emergence of a complex cells population. Introducing different forms of pooling in a predictive model of early visual processing as implemented in SDPC can therefore be viewed as a theoretical framework that explains the diversity of structural and functional phenomena observed in V1.
Cells in the primary visual cortex of higher mammals (V1) are sensitive to oriented localized visual patterns and these cells have classically been divided into two classes: simple and complex [1]. Simple cells show in particular a dynamic response to a drifting sinusoidal grating which is linearly modulated by the rectified temporal component of the sinusoidal grating: Simple cells are maximally activated when the stimulus matches a specific spatial phase inside their Receptive Field (RF) [1-3]. On the other hand, complex cells remain relatively invariant to the phase of the stimulus [1, 2, 4, 5]. As such, it is assumed that the primary visual cortex extracts elementary orientation features and produces scene representations that are both selective and invariant to the phase component [6]. Another remarkable property of the visual cortex is the hierarchical organization of the areas downstream to V1 along the ventral visual stream. Each of these areas is sensitive to increasingly complex features: simple edges for the primary visual cortex (V1), shape and textures for V4 [7], and specific objects in infero-temporal region (IT) [8]. These properties are in line with the challenging tasks of object recognition [9]: a stimulus is decomposed into simpler features throughout the hierarchical organization to build a representation that is specific enough to allow accurate recognition (selectivity) while being invariant to properties that do not affect the object identity (invariance) [10]. Historically, those properties of the visual cortex have been modeled with hierarchical networks which alternate layers of linear filters to describe simple cells with non-linear layers to account for complex cells invariance. In particular, Sakai and Tanaka [11] demonstrated that spatial pooling of simple cells units was necessary to reproduce complex cells behavior. Spatial pooling functions, exemplified by the sum of squared simple cells responses (i.e. energy pooling) or with a winner-take-all mechanism (i.e. max-pooling) [12], account for both spatial and phase invariance as it is observed in biological complex-cells.Another phenomenon observed in the cortex of higher mammals is the possible presence of orientation maps. An orientation map is a functional structure of the selectivity of cells on the topography of the cortical surface for which neighboring cortical cells are preferentially tuned to similar orientations. Orientation preference varies smoothly but also shows local singularities, called pinwheels [13-15]. Both the formation mechanism [16-18] and the function [19-21] of such a topology have been thoroughly discussed in the literature. For example, [19] has proposed that orientation maps were optimal to ensure the uniform coverage of features over the visual space and [20] suggested that the presence of pinwheels could facilitate contour extraction. While some frameworks have successfully modeled orientation maps [22, 23], only a few of them have been proposed to make the link between complex cells and topographical organization. Hyvarinen et al. [24] observed that maximizing the sparseness of locally pooled simple-cells results in complex cells that were topographically organized in orientation maps. Similar results were obtained with a type of independent component analysis (ICA) [25]. Antolik et al. [26] managed to demonstrate that complex cells organize themselves in orientation maps if one includes a lateral inhibition mechanism. Further in that direction, clustering orientation preference within orientation maps could also facilitate the emergence of orientation-tuned complex receptive fields. Indeed, to keep orientation tuning while gaining invariance, complex receptive fields must be able to pool locally from all different phase selectivities within a set of similarly orientation tuned neurons. The emergence of orientation maps would facilitate such an operation. However, there is no neurophysiological evidence of a systematic link between the emergence of complex cells and orientation maps. A topographic organization lacking an orientation map is commonly referred to as salt-and-pepper. For example, it is observed that there is no spatial structure for orientation preference in rodents (most notably mice, rats, and squirrels), although they do possess complex cells in V1 [14, 15]. The models listed above, cannot explain the presence of these complex cells in animals that do not present orientation maps in V1. This questions the actual links that can exist between orientation maps and complex cells. In particular, can the same framework be used to describe both salt-and-pepper and orientation maps structures while accounting for the emergence of complex cells?In this article, we study the link between the emergence of topographical organization of orientation preference along with the emergence of complex cells by varying the pooling strategies between layers, using the Sparse Deep Predictive Coding (SDPC) as a model of V1 [27, 28]. The SDPC model integrates three key computational properties observed in the visual cortex: selectivity, invariance, and hierarchical architecture. First, selectivity in the SDPC is achieved with Sparse Coding (SC). Besides being justified by the efficient coding hypothesis [29], SC has successfully accounted for V1 simple cell’s responses and Receptive Fields (RFs) [30, 31]. In addition, the SDPC leverages the Predictive Coding (PC) framework to describe the hierarchical relationship between the network’s layers. PC suggests that every cortical area predicts at best the upstream sensory information. The mismatch between the prediction and the lower-level activity elicits a prediction error that is used to adjust the neural response until an equilibrium is reached [32-34]. The SDPC also includes convolutional layers that allow us to define distinct pooling functions acting in different subspaces: the spatial retinotopic space, and the feature space. In this work, we study the combined impact of these pooling strategies on both the emergence of complex cells and topographical maps. We demonstrate that pooling across the 2D feature space leads to a topographical organization similar to orientation maps, with the presence of phase maps but that does not account for the emergence of complex cells. In contrast, pooling across the spatial dimension leads to cells with complex-like behavior but without orientation maps. Only the combined pooling on both feature and spatial dimensions allows the common emergence of orientation maps, complex cells, and no phase maps. More generally, we argue that the combination of these different pooling strategies allows us to describe the structural and functional diversity across species. Then, we present the SDPC as a unifying theoretical model of the early visual cortex, which building principles can explain the link between topographical structures (retinotopic maps and orientation maps) and the receptive field structure (complex cells). The paper is organized as follows: First, we describe our model by detailing the SDPC as well as the pooling strategies. Next, we analyze the complex-like behavior and the topographical organization of the neurons for different pooling strategies. Then, we evaluate the impact of the network size on the pinwheels’ density. Finally, we discuss our results and detail the implication of this work with respect to the state of the art.
Results
Brief description of Sparse Deep Predictive Coding (SDPC)
The Sparse Deep Predictive Coding (SDPC) framework solves a series of hierarchical inverse problems with sparsity constraints. A group of neurons γ predicts by an optimization procedure the pooled activity from the previous cortical layer p(γ) and that was obtained through a set of synaptic weights W. Given a network with N layers, we can define the generative model [27, 28] as:
Where x represents the input stimulus and γ are the rate-based neural responses for layer i (see Fig 1b). Sparsity constraints are introduced using the ℓ0 pseudo-norm which computes the number of active elements in each activity map γ. The W matrices represent the network’s weights (convolutional kernels), and ϵ is the prediction error associated to each layer. In Eq 1, p denotes the pooling operator we use to model the complex cells. We have tested different type and combination of pooling functions based on max-pooling (see the section ‘Pooling functions’ for a detailed description). Henceforth, we use a 2 layer version of the SDPC to model V1 (i.e. N = 2 in Eq 1). To tighten the parallel with biology, we index all the first and second layer variables with the letter S and C to denote the simple and complex cell’s layer, respectively.
Fig 1
The SDPC network.
a. Update scheme of the SDPC network used in this study: x is the input image, γ and γ represent simple and complex cells response maps, respectively. W and W are convolutional kernels that encode for the RFs of the simple and complex cell layers, where each synaptic weight matrix is composed of M and M neurons (kernels) respectively. p(γ) is the pooling function used to generate position and feature invariance in the response of the second layer. Feed-forward connections carry information on the prediction errors (ϵ and ϵ) that are used to refine the neural activities. Feedback information is carried through the unpooling function , that approximates the derivative of the pooling function. The circles with arrows inside are error nodes such that the output signal is equal to the difference of the input signal. b. Here, we show a representation of γ and p(γ). Each pixel represents a model neuron and the color code indicates the amplitude of the neural response (lighter for no response and darker blue for the maximal response, here normalized to 1). In this figure, we illustrate three possible outputs p(γ), used to generate different network structures: MaxPool 2D selects the maximum activity over spatial (retinotopic) positions, in each plane independently; MaxPool 2D acts in the feature space by selecting the maximum activity across planes in γ. When the network integrates the two pooling functions in sequence, we refer to them as a unique max-pooling operator called MaxPool 2D + 2D.
The SDPC network.
a. Update scheme of the SDPC network used in this study: x is the input image, γ and γ represent simple and complex cells response maps, respectively. W and W are convolutional kernels that encode for the RFs of the simple and complex cell layers, where each synaptic weight matrix is composed of M and M neurons (kernels) respectively. p(γ) is the pooling function used to generate position and feature invariance in the response of the second layer. Feed-forward connections carry information on the prediction errors (ϵ and ϵ) that are used to refine the neural activities. Feedback information is carried through the unpooling function , that approximates the derivative of the pooling function. The circles with arrows inside are error nodes such that the output signal is equal to the difference of the input signal. b. Here, we show a representation of γ and p(γ). Each pixel represents a model neuron and the color code indicates the amplitude of the neural response (lighter for no response and darker blue for the maximal response, here normalized to 1). In this figure, we illustrate three possible outputs p(γ), used to generate different network structures: MaxPool 2D selects the maximum activity over spatial (retinotopic) positions, in each plane independently; MaxPool 2D acts in the feature space by selecting the maximum activity across planes in γ. When the network integrates the two pooling functions in sequence, we refer to them as a unique max-pooling operator called MaxPool 2D + 2D.One possibility to solve the generative problem defined in Eq 1 is by minimizing the following loss function [27]:
In Eq 2, ‖⋅‖2 and ‖⋅‖1 denote the ℓ2 and ℓ1-norm, respectively. Note that the ℓ0 constraint in Eq 1 has been relaxed with a ℓ1-penalty term in Eq 2 as it leads to a more convenient optimization problem [35]. The minimization of F in Eq 2 is performed using an alternation of an inference and a learning step. The inference step involves finding the optimal neural responses (i.e. γ) with a gradient descent on the loss function F [36]:
Once the inference process has converged, the optimization of W and W is performed through the learning process:
In Eqs 3 and 4, , , and denote the neural activity and the synaptic weights at the gradient step number k, respectively. η is the step size of the inference process and ω is the step size of the learning. Additionally, λ and λ are the sparsity-inducing regularization parameters of the simple and complex cell’s layer, respectively. is the non-negative soft-thresholding operator (see Eq 7 in section ‘Detailed description of SDPC’ for a mathematical description of the soft-thresholding operator). In Eq 3, denotes the approximation of the derivative of the max pooling function. Note that max-pooling is not differentiable but its derivative could be approximated by an unpooling function, i.e. an operator composed of a matrix filled with ones in the position of the local maximum elements and with zeros everywhere else. Fig 1a shows the update scheme of the SDPC network used in this study. In this study, we aim to model simple and complex cells in V1.
Pooling functions
All pooling functions are based on max-pooling: a function that selects the maximum response within a group of cells (here, γ and γ for an example):
The max-pooling performs a winner-takes-all computation rather than calculating the energy of the firing of a pool of neurons [11]. We decided to use max-pooling as a way of generating invariance for two reasons: First, the winner-take-all mechanism can account for numerous nonlinear responses in the sensory cortex, and it is regarded as one of the fundamental components of sensory processing in the brain [10, 11, 37, 38] (see section ‘About the bio-plausibility of SDPC’ for more details on the bio-plausibility of max-pooling); Second, the local competition enforced by max-pooling is echoing the global competition mechanism introduced with Sparse Coding (SC). Indeed, while max-pooling performs a winner-takes-all computation in cells that belong to the same neighborhood, SC forces all neurons in a given layer to compete with each other to best predict the input. Using both SC and max-pooling to model respectively simple and complex cells allows us to reduce the core computation of V1 to a single mathematical competition mechanism.Given a neural response map, γ, max-pooling can be seen as a nonlinear convolution whose output, p(γ), is the maximum value in each pooling region (see Fig 1b). Just as convolution, max-pooling is defined by its kernel size and stride. In this study, by varying these parameters, we introduce three types of pooling functions:MaxPool 2D acts in the spatial dimension, independently for each feature. In this study, we use a pooling kernel of 2 × 2 neurons with a stride of 2.MaxPool 1D selects the maximum across a neighborhood of adjacent planes of γ along with a 1D circular space. This function acts in the feature space, leaving the spatial encoding unchanged. For this function, we used a linear kernel of size 4 and a stride equal to 1.MaxPool 2D arranges a feature space composed of M neurons on a grid of grid, for each spatial location. Then it selects the maximum activity across 2D pooling regions, with a size of 2 × 2 neurons with a stride of 1. This pooling function only acts in the feature space, just as the one we defined above.Since the pooling functions operate in two different sub-spaces, we can apply them in sequence to combine the effect of the two strategies (see Fig 1b). We call these functions MaxPool 2D + 1D and MaxPool 2D + 2D.
Pooling in a predictive coding network
Besides reaching a stable point in terms of the loss function (see Eq 2), the network efficiently developed edge-like Receptive Fields (RFs) in the first layer (see Fig 1b and S1 Fig top left panel) and second layer (S1 Fig bottom left panel). Note that even if the convergence of the SDPC is guaranteed due to the convexity of the loss function, nothing prevents the SDPC to converge towards a trivial solution. This first observation offers then a sanity check and confirms that the SDPC is converging towards a meaningful solution. This result holds for the different network sizes that we tested (36, 49, 64, 81, 100 and 121 neurons for each layer) and the different combinations of pooling functions: MaxPool 2D, MaxPool 2D, MaxPool 2D + 1D, and MaxPool 2D + 2D. Note that inferring the shape of the second layer RFs is not trivial, because the pooling function p(γ) makes it challenging to project the learned filters back into the space of the input image (see [28] for details). In S1 Fig, we show V, a linear approximation for the second layer kernels (see section ‘Log-Gabor fitting’ for more detail on this linear approximation). From these linear projections, we can see that the W’ RFs also have the shape of localized edge-like filters (see S1 Fig bottom left panel). This is interesting to observe that the second layer RFs are similar to the first layer RF even if the second layer response is strongly non-linear (due to the max-pooling operation). In the following section, we evaluate the difference between the first and second layer response.
Phase invariance and complex behavior
In Fig 2 we show some typical responses of complex and simple cells when varying both orientation and phase. While all of the first layer neurons are sensitive to both orientation and phase (i.e. simple cells), we observe that some of the second layer neurons are invariant to phase variation (i.e. complex cells, see Fig 2). To quantify phase invariance at the population level we use the modulation ratio, denoted and introduced by [4] (see section ‘Modulation ratio and complex behavior’ for mathematical details on this index). A cell is classified as simple if > 1 and as complex if < 1. We test the phase invariant behavior of the network when trained using 4 different settings corresponding to the following pooling strategies: p = MaxPool 2D, p = MaxPool 2D, p = MaxPool 2D + 1D and p = MaxPool 2D + 2D. In all these settings, the distribution of shows that the first layer develops exclusively simple-like neurons, while the second layer shows a distribution that depends on the pooling strategy (Fig 3, fourth column). When spatial pooling and circular feature pooling are combined together (for p = MaxPool 2D + 1D), we observe a sharp high peak of in 0 in the second layer (see Fig 3c). In contrast, the distribution is much broader when the spatial pooling is combined with 2D feature pooling (for p = MaxPool 2D + 2D, see Fig 3d). For these 2 previous settings, the second layer neurons are strongly invariant to phase (high peak at 0 in Fig 3c and 3d). When p = MaxPool 2D, our model does not develop complex cells, i.e. the is high for both layers (see Fig 3b). When we impose only a spatial pooling (i.e. p = MaxPool 2D), the second layer of our network exhibits as many simple cells as complex cells (see Fig 3a).
Fig 2
Response of simple-like and complex-like model neurons.
Example of two model neurons exhibiting a simple (left) and complex (right) behavior. The black lines indicate the model neuron’s response when its receptive field contains a drifting (top) or rotating (bottom) stimulus; the red lines indicate the response modeled according to a half-rectified sinusoidal model as in [39]. The modulation ratio, , is greater than 1 for simple neurons that are tuned to a specific phase. A complex response, that is partially or completely independent to phase, is quantified by a low modulation ratio (less than 1). Importantly, both neurons remain tuned to orientation.
Fig 3
Emergence of orientation maps in the first layer and complex response in the second layer.
We show the different properties learned by our model depending on which pooling functions we used (a—d). All networks showed here are trained with M = 100 and M = 100. First column. Representation of the simple cells Receptive Fields (i.e. the synaptic weights of the first layer W). The red rectangles for (b—d) indicate the size of the feature pooling, the red arrows illustrate the circular (c) or toroidal (b and d) structure of the feature space. Second column. Orientation preference of each neuron represented by the filters in W. The luminance of the map is modulated by the orientation selectivity (HWHH) of each neuron. Third column. Phase preference of each neuron represented by the filters in W. Fourth column. Distribution of the modulation ratio () for the first and second layers of the network. Fifth column. Local homogeneity index (LHI) associated with each element is W. White stars indicate the position of pinwheels. Sixth column. Linear relationship between the LHI and the orientation selectivity, measured by the half-width at half-height (HWHH) of the response of cells from the first layer.
Response of simple-like and complex-like model neurons.
Example of two model neurons exhibiting a simple (left) and complex (right) behavior. The black lines indicate the model neuron’s response when its receptive field contains a drifting (top) or rotating (bottom) stimulus; the red lines indicate the response modeled according to a half-rectified sinusoidal model as in [39]. The modulation ratio, , is greater than 1 for simple neurons that are tuned to a specific phase. A complex response, that is partially or completely independent to phase, is quantified by a low modulation ratio (less than 1). Importantly, both neurons remain tuned to orientation.
Emergence of orientation maps in the first layer and complex response in the second layer.
We show the different properties learned by our model depending on which pooling functions we used (a—d). All networks showed here are trained with M = 100 and M = 100. First column. Representation of the simple cells Receptive Fields (i.e. the synaptic weights of the first layer W). The red rectangles for (b—d) indicate the size of the feature pooling, the red arrows illustrate the circular (c) or toroidal (b and d) structure of the feature space. Second column. Orientation preference of each neuron represented by the filters in W. The luminance of the map is modulated by the orientation selectivity (HWHH) of each neuron. Third column. Phase preference of each neuron represented by the filters in W. Fourth column. Distribution of the modulation ratio () for the first and second layers of the network. Fifth column. Local homogeneity index (LHI) associated with each element is W. White stars indicate the position of pinwheels. Sixth column. Linear relationship between the LHI and the orientation selectivity, measured by the half-width at half-height (HWHH) of the response of cells from the first layer.
Learning topographic orientation maps in the SDPC network
Networks trained using pooling functions that act in the feature space (MaxPool 2D, MaxPool 2D + 2D, and MaxPool 2D + 1D) showed the emergence of differential topographic structures for W and, by consequence, γ. Interestingly, as we introduce pooling in the feature space (see section ‘Pooling functions’ for more details on pooling functions), the arrangement of filters on W converges to form a topographic orientation map where neighboring neurons become tuned to similar orientations (see Fig 3b–3d, second column). To compare quantitatively the orientation maps learned by our model with the ones observed in neurophysiological experiments, we used the local homogeneity index (LHI), introduced by [40] (for mathematical details on the LHI index see section ‘Local homogeneity index (LHI)’). We observed in the fifth column of Fig 3b–3d that the orientation maps developed by the SDPC contain regions where orientation preference varies smoothly (high LHI) combined with local discontinuities (low LHI), in analogy with pinwheels observed in higher mammals [13-15]. One may wonder if the existence of these topographic maps has functional implications for the properties of neurons from the first layer. Indeed, electrophysiological experiments have shown a clear link between a neuron’s position in the cortical map and its tuning properties [20, 40] (but see [13]). Specifically, neurons located near pinwheels tend to have a broader orientation tuning, while neurons located in iso-orientation regions display a narrower, more selective tuning. We evaluated the relationship between the LHI of single neurons in the first layer and their orientation tuning, evaluated as the half-width at half-height (HWHH) of their response to a rotating grating. We found a linear relationship similar to one of [40] and [20]. This relationship is valid for all networks with a 2D topographic map (p = MaxPool 2D + 2D) and M ≥ 64 (p < 0.01). The linear relationship is significant (p < 0.01) for few configurations of p = MaxPool 2D and p = MaxPool 2D + 1D, with no clear dependence on M.Interestingly, the networks trained with a combination of spatial and feature pooling (MaxPool 2D + 1D and MaxPool 2D + 2D) appear to develop a topographic map when we look at the preferred orientations θ of neurons in the first layer (Fig 3c and 3d, second column). However, this structure is not present if we look at the preferred phase map ϕ, where there appears to be no particular structure (Fig 3c and 3d, third column), in line with the known neurophysiology [41]. This is not true in the case of MaxPool 2D, where both maps appear to be organized in groups of cells sensitive to similar orientations and phases, in contrast with neurophysiological evidence (see Fig 3b, second and third column). To further analyze the orientation map structures, we used the LHI. The LHI quantifies the number of pinwheels generated by our model in different tested conditions (for mathematical details on the pinwheels density calculation see section ‘Local homogeneity index (LHI)’). We found that the tested pooling functions, MaxPool 2D, MaxPool 2D + 2D, and MaxPool 2D + 1D, generated a comparable number of pinwheels centers even for networks about 4 times bigger than the smallest network tested: M = 36 and M = 121 (see Fig 4). Thus, the number of pinwheels singularities remarkably does not depend on the size of the first layer M. A similar effect has been reported across many species [42].
Fig 4
Pinwheel density does not depend on network size.
Here we show that there is no relationship between the tested first layer’s network size (M) and the density of pinwheels (ρ). Pinwheel density is defined as the average number of pinwheels per area unit (see section ‘Local homogeneity index (LHI)’). For all networks showing orientation maps, we found no significant trend between the size of the map, M, and the pinwheel density, in line with neurophysiological observations.
Pinwheel density does not depend on network size.
Here we show that there is no relationship between the tested first layer’s network size (M) and the density of pinwheels (ρ). Pinwheel density is defined as the average number of pinwheels per area unit (see section ‘Local homogeneity index (LHI)’). For all networks showing orientation maps, we found no significant trend between the size of the map, M, and the pinwheel density, in line with neurophysiological observations.
Invariance to phase vs. invariance to orientation
We want to further explore why the complex cells developed by the second layer of our model show invariance to the stimulus phase while remaining tuned to orientation (see Fig 5). A highly nonlinear network could respond unspecifically to a broad set of different stimuli. Invariance to multiple properties of the stimuli (e.g. both to phase and orientation) would be in contradiction with neurophysiological experiments. In fact, it is well established that complex cells exhibit phase invariance but show orientation tuning similar to simple cells [4, 5]. To assess whether the network’s response is orientation-selective or not, we tested the network’s invariance property to both phase and orientation of gratings. We used drifting gratings as stimuli that change phase over time. Conversely, we used rotating gratings that change orientation, to modulate the response of orientation-selective neurons. This was performed for each cell of the second layer of our network at its optimal spatial frequency (see section ‘Drifting grating vs. Rotating grating’). For quantification purposes, we define R as the percentage of cells that show a complex-like response for phase, or that are unselective to orientation ( < 1). We report the results relative to the second layer of the networks, as in all the tested conditions, the first layer of the network exclusively exhibited simple-like responses (R = 0, see Fig 3, fourth column). In Fig 5, the black and green lines R and R correspond to the dependence on phase (drifting) and orientation (rotating), respectively. We analyze these results as a function of the number of channels in the first layer, M to evaluate the impact of the network structure on the second layer’s responses. For p = MaxPool 2D, the networks showed high R (up to 60%) and low R (at most 20%), independently to M. This suggests that the SDPC network efficiently develops phase-invariant responses while maintaining a relatively narrow tuning to the stimulus’s orientation (see Fig 5, top-right). When p = MaxPool 2D + 1D, that is, a spatial pooling and a feature pooling organized on a linear structure, R is much higher (up to 100%, see Fig 5, bottom-right) and a smaller R, also independent on M; this can be explained by the combined action of pooling across spatial locations and across features with same orientation but different phases (see Fig 3). Interestingly, only when we impose p = MaxPool 2D + 2D, the network’s behavior appears to vary as a function of M (see Fig 5, bottom-left). For low dimensions (up to M = 64), the tested networks show a high fraction of complex-like cells. For M > 64, R decreases as M increases (see S1 Text for statistical tests). The same effect can be observed for R, which appears to be high for M = 36, and then decreases with increasing M.
Fig 5
Complex cells population analysis.
Here we show how the type of pooling function used and the number of cells type present in the first and second layers of the network (M and M, respectively) affect the complex cells population’s properties developed by the model. Top-left. We tested the network’s second layer invariance to phase (drifting) and orientation (rotation) in sinusoidal gratings (see section ‘Drifting grating vs. Rotating grating’). The ratios of cells R (complex) and R (orientation invariant), are defined as the percentage for which < 1 in the second layer of the network. While the network’s second layer shows a strong invariance to a drifting grating (black lines), the same effect is not present for the rotating grating (green lines). This result indicates that the network’s invariant response is specific to the stimulus’s phase and not to its orientation. Overall R is much higher when the max-pooling acts also in the feature space (p = MaxPool 2D + 1D and p = MaxPool 2D + 2D). Interestingly, for p = MaxPool 2D + 2D, the number of complex-like cells seems to depend on the size of the network’s first layer M.
Complex cells population analysis.
Here we show how the type of pooling function used and the number of cells type present in the first and second layers of the network (M and M, respectively) affect the complex cells population’s properties developed by the model. Top-left. We tested the network’s second layer invariance to phase (drifting) and orientation (rotation) in sinusoidal gratings (see section ‘Drifting grating vs. Rotating grating’). The ratios of cells R (complex) and R (orientation invariant), are defined as the percentage for which < 1 in the second layer of the network. While the network’s second layer shows a strong invariance to a drifting grating (black lines), the same effect is not present for the rotating grating (green lines). This result indicates that the network’s invariant response is specific to the stimulus’s phase and not to its orientation. Overall R is much higher when the max-pooling acts also in the feature space (p = MaxPool 2D + 1D and p = MaxPool 2D + 2D). Interestingly, for p = MaxPool 2D + 2D, the number of complex-like cells seems to depend on the size of the network’s first layer M.
Discussion
In previous work [28], we introduced the SDPC algorithm, and we used it to model local interactions in the early visual cortex (V1/V2). We showed SDPC can predict how strong intra-cortical feedback connectivity (from V2 to V1) shapes the response of neurons in V1 according to the Gestalt principle of good continuation. In this article, we have extended the SDPC with different pooling operations such that it is now including this non-linear computation. Our 2-layer SDPC model of V1 allows us to test different types of pooling operation: MaxPool 2D, MaxPool 1D and MaxPool 2D and to assess their impact on the cell’s response and on the emergence of an orientation map.
Summary of results
We have shown that when we introduce a 2D spatial pooling only (i.e. MaxPool 2D), our model efficiently develops complex cells, but we do not observe the emergence of an orientation map (see Figs 3a and 5). In the case of MaxPool 2D, the network only pools in the feature space. In this condition, the model learns a topographic map in the first layer for both orientation and phase (Fig 3b) but no complex cells are present in the second layer (R ≈ 0%). For MaxPool 2D + 1D, the network pools in both retinotopic and feature space (circular topology). In this case, the first layer of the network develops a topographic structure for orientation preference but not for phase (Fig 3c). The second layer presents a high fraction of complex cells (up to R = 100%) in all tested conditions irrespective of M. For MaxPool 2D + 2D, the network pools in both retinotopic and feature space (toroidal topology). Similar to the previous case, the network develops, in the first layer, a topographic map for orientation but not for phase (Fig 3d). In this case, however, R appears to depend on M, with smaller dimensions for the first layer (low M) appearing to produce more complex cells (Fig 5). Consequently, the different combinations of pooling functions, allow the SDPC to account for a high diversity of cells types and cortical maps. In the next section, we interpret our computational findings in light of current neuroscientific knowledge and link our work with other state-of-the-art models to discuss the bio-plausibility of the SDPC model.
SDPC models the diversity of V1 cells’ type and topological maps across species
We observed that when the pooling function used in the model acts in the feature space (MaxPool 2D, MaxPool 2D + 1D, and MaxPool 2D + 2D), the SDPC model develops an orientation map in the first layer. These orientation maps show quantitatively strong similarities with those observed in higher mammals: First, we found that a linear relationship exists between the local homogeneity index (LHI) of a model neuron (i.e. its position relative to pinwheels in the map) and its orientation tuning as observed in neurophysiological studies [20, 40]. In particular, neurons near pinwheels (low LHI) have a broader tuning than those in iso-oriented regions (see section ‘Learning topographic orientation maps in the SDPC network’). Second, we found the number of pinwheels for each of the conditions listed above to be independent of the network first layer’s size, M (Fig 4). To sum up, our model can predict, without any supervision mechanism, a key property of cortical orientation maps across different species such as carnivores and primates, that is, the emergence of pinwheels and orientation domains with a constant pinwheel density even for a large diversity of V1 sizes [14, 18, 42].According to our model, complex cells emerge by pooling in the retinotopic space, even in absence of orientation maps, thanks to hierarchical pooling in V1. Pooling across retinotopic positions (MaxPool 2D) can explain, by itself, the emergence of complex cells by enforcing position invariance, a fundamental mechanism observed in complex cells [43, 44]. In this case, the absence of cortical orientation maps suggests that the emergence of complex cells depends solely on pooling across different positions in the retinotopic space. This type of network could be the one preferably implemented in animals that do not exhibit orientation maps [14, 15]. This is in line with the results from [45], showing that orientation selectivity can emerge even in networks that connect locally neurons with a random preferred orientation, as in a salt-and-pepper map; in analogy with the random maps that the network produces in the case of MaxPool 2D [45, 46]. On the other hand, in the case of MaxPool 2D + 1D and MaxPool 2D + 2D, our model can generate feature invariance by pooling in the feature space. As a consequence, the model converges to a configuration where complex cells can be generated by pooling across the same orientations and different phases (Fig 3c and 3d).We have shown that networks that develop topographic maps, on top of classical spatial pooling, exhibit more complex-like cells and, in general, more phase invariant response (see section ‘Phase invariance and complex behavior’ and Fig 3c and 3d). These findings suggest that rodents should show a lower fraction of complex cells in V1, compared with other mammals. This prediction is, indeed, in line with neurophysiological findings in mouse [47, 48] and the rabbit [49], although squirrels (that are highly visual rodents) show a fraction of complex cells comparable to other mammals [46]. The most remarkable fact is that these results are obtained solely by changing a single computational mechanism, that is the type of nonlinear pooling used, and by learning neural responses directly from natural images.
Predictive Coding as a mechanism for orientation map formation
In SDPC, the orientation maps are naturally emerging thanks to the combined action of prediction error minimization and pooling. We have observed that there is no orientation map in the first layer when we remove the feedback connection (i.e. the orange connection in Fig 1a). This observation suggests that the feedback from the second layer mediates the formation of cortical maps in the first layer. Interestingly, a similar relationship between simple and complex cells has been suggested by Kayser & Miller [50]. We can then hypothesize that the orientation map (in the first layer) represents the best possible organization to minimize the prediction error with a representation (in the second layer) that is locally feature-invariant (because of the pooling in the feature space).
Relation to the state-of-the-art
One of the first models to account for both complex cells and orientation maps is the Topographic ICA from Hyvarinen et al. [25]. The topographic ICA describes the topology in V1 as a quantification of the residual dependence between the components of an ICA (the closer the features in the topological space, the more dependent they are). In addition, Hyvarinen et al have also observed orientation maps when maximizing the sparsity of a feedforward network with an energy pooling layer [24]. In contrast, the topology is naturally emerging from the combined action of feedback connection and pooling in the SDPC (see section Predictive Coding as a mechanism for orientation map formation). To the best of our knowledge, SDPC is the first model that links the emergence of the orientation map with the predictive coding framework. Another difference is that SDPC is convolutional while [24] and [25] have fully-connected neural layers. Fully-connected architectures do not disentangle the role of retinotopy from the one of orientation map in building complex cell responses, as the two structures are merged in the same map. The SDPC being convolutional, the feature space is by construction dissociated from the retinotopic space. This particular property allows the SDPC to describe not only the primate and carnivore orientation maps, but also the salt-and-pepper configuration observed in rodents (see section SDPC models the diversity of V1 cells’ type and topological maps across species).Other frameworks have been proposed to compute the optimal topography of features in natural image representation. For example, [51] propose a model based on strong dimension reduction (using PCA) that learns to ignore fine-grained structure from the signals of simple cells and discover a linear pooling of correlated units. Interestingly, [22] adopted a similar approach in which pooling emerges to model the geometry of the manifold of a sparse code. Different from the SDPC, these models have not studied the combination of complex cells with the emergence of different types of topographical maps (orientation map and salt and pepper).
About the bio-plausibility of SDPC
In this article, we have modeled complex cells using the max-pooling function that relies on the max operation. As demonstrated by [38], a cortical circuitry could implement a max-like operation. Interestingly, the proposed canonical circuit could also be used to model energy pooling.The SDPC model is relying on convolution operation. Convolutions assume that the synaptic weights are repeated across the image to tile the whole visual field. This weight-sharing mechanism is unlikely to be implemented in biology. However, the same retinotopic architecture can be achieved with local untied connections to keep the same structure without replicating the set of weights at each spatial position. In particular, [52] showed that such a model converges to a network similar to a convolutional network. These results suggest that convolutions can be regarded as a good model of retinotopic processing, even if it is not strictly biologically plausible.The SDPC satisfies computational constraints that are thought to occur in the brain. As illustrated by Eq 3, all computations involved in the neural response update are local. Indeed, the new state of the neural population (i.e. ) only depends on its previous state (i.e. ), the states of the adjacent layers ( and ) and the associated synaptic weights (W and W). In addition, one can observe that the weight update equation (see Eq 4) is a product of monotonically increasing functions of pre-synaptic (γ) and post-synaptic activity (γ). It could then be interpreted as an Hebbian rule [53] that have strong grounding in biology.
Concluding remarks
In this study, we have shown that SDPC including different variations of max-pooling could be regarded as an unsupervised and simple computational framework to model the diversity of mammals’ V1. It provides a plausible explanation for the emergence of complex cells in different types of topographical structures (salt-and-pepper and orientation maps) as observed in different species. One interesting perspective would be to extend to SDPC such that the pooling function can be learned rather than being fixed. To do so, one might leverage the more general parametrization of the pooling function that was proposed in [38]. Such a formalization would allow the SDPC to learn a continuum of pooling functions spanning from energy pooling to max-pooling. In addition, the SDPC allows us to analyze the effect of the feedback. In previous work, we have shown that more feedback strength in the SDPC, 1—reshapes neural organization to improve contour integration [28], 2—modulates the shape of the V1 RFs [27] and 3—improves generalization abilities [54]. Further work could then be conducted to assess the impact of feedback strength on the topology of features in the first layer.
Methods
Detailed description of SDPC
In the section ‘Brief description of Sparse Deep Predictive Coding (SDPC)’, we have described the generative problem solved by the SDPC model (see Eq 1). For the sake of concision, in Eq 2, we have given the loss function only for a 2-layered SDPC network. Here we showcase a more general formulation of this loss function, applicable to a N-layered network:In this equation, ϵ encodes for the prediction error, γ represents the neural response at the layer i, and W denotes the synaptic weights (which has a convolutional structure) between the layer i − 1 and i. The λ coefficient controls the strength of the sparse regularization constraint on the neural response γ. Convolution is an efficient mathematical framework to model retinotopic activity in the visual system: synaptic weights are repeated across the image to tile the whole input image. This brings the advantage of having a model that is translation invariant: the synaptic weights (also called features or kernels) encoded by W are translated at each position of the input image (for more details on the bio-plausibility of the convolution, see section ‘About the bio-plausibility of SDPC’). The resulting vector γ can thus be viewed as a neural response map, whose pixels encode for neurons sensitive to a specific feature and at a specific position in the input image, in perfect analogy with a retinotopic map (see Fig 1b). The other parameters of this convolution are the kernel size and the stride, as detailed in section ‘Training’. The minimization of Eq 6 is performed using an alternation of inference (see Eq 3) and learning (see Eq 4) steps. The soft-thresholding operator in Eq 3 is enforcing that the majority of neurons from γ are inactive (it thus has a sparsifying effect). The mathematical definition of the soft thresholding operator is:
Training
We built different SDPC networks to model simple and complex cells in V1. Each network has convolutions with kernel sizes of 7 × 7 pixels for the first layer (W), and 4 × 4 for the second layer (W). For both layers, we used a stride of 1. We tested the network with 4 different pooling functions: MaxPool 2D, MaxPool 2D, MaxPool 2D + 1D and MaxPool 2D + 2D. Importantly, we introduced a zero padding before the pooling to make sure that first and second layer cells have the same receptive field size with respect to the input image (14 × 14 pixels). Each network was also trained with different neural populations sizes: with M and M equal to 36, 49, 64, 81, 100 and 121 neurons for the first and second layer respectively, leaving us with 36 networks for each tested pooling function. The kernel values W and W were initialized as random noise and were normalized during training such that the energy (Euclidean norm) of each kernel was set to 1. All the networks were trained using grayscale natural images from the STL-10 dataset (96 × 96 pixels per image) [55] for 9 epochs (28125 iterations with mini-batches of 32 images). The input images were pre-processed with a whitening filter similar to the one used in [30] to model retinal processing. Each image was then bounded between the values −1 and 1. The synaptic weight W and W were updated using stochastic gradient descent with a learning rate, ω, of 0.01 and a momentum, β, of 0.9. Additionally, the sparsity parameters λ and λ were gradually incremented during training up to a value of 0.1.
Modulation ratio and complex behavior
In this study, in order to quantify the simple and complex behavior of model neurons in the SDPC, we use the classical measure, also known as the modulation ratio [4]. Given the response of a neuron to a grating drifting at temporal frequency f, the measure represents the ratio between the first harmonic of the response (f = F1) and the mean spiking rate F0. Intuitively, a high modulation ratio (between 1 and 2) indicates that the cell is sharply tuned to a specific phase and it is thus regarded as simple. If a cell shows a low modulation ratio (between 0 and 1) it is then regarded as complex, showing a broad tuning to the stimulus’ phase. For a neural population containing simple and complex cells, the distribution of the modulation ratios will appear to be bi-modal, with two peaks roughly in the regions described above. In [39], authors showed that the modulation ratio can be derived analytically from a half-rectified model of the a neuron’s response to a drifting grating. In particular, depends non-linearly on χ defined as:
Where V and V represent respectively the mean membrane potential and the threshold for spiking generation and A is the maximal amplitude of the modulation. Here we use the simplified rate based model:
With γ being response of the model neuron, ϕ is the stimulus’ phase and ϕ0 the phase at which the response is maximal. Finally, ( )+ indicates an half-rectified function that only output zero or positive values. In this case and the modulation ratio can be obtained through the nonlinear mapping (see [39] for details):
In [39], authors further suggested that the classification of V1 cells as simple or complex might be caused uniquely by the nonlinear relationship between and χ rather than reflecting an actual neuro-physiological difference. Here, for simplicity, we refer to the same values conventionally used in literature: we consider model neurons as simple-like if > 1 and as complex-like if ≤1. We use the modulation ratio also to evaluate the response to rotating stimuli in order to compare it to the response in the case of drifting phase. Although this measure is not in standard use to evaluate orientation tuning, it can be seen as a spectral analysis of the orientation response, similar to the work of Wörgötter and Eysel [56]. Nevertheless, to perform statistical tests on the different conditions, we analyze the distribution of χ (see S1 Text). Indeed, the χ index is easier to analyze as it does not show the typical bimodal distribution of and is linked to through a nonlinear, monotonous, and invertible function (Eq 10).
Local homogeneity index (LHI)
To evaluate the functional implication of the orientation maps learned by our model, we used the local homogeneity index (LHI) as defined by [40]. The LHI measures the similarity of the preferred orientation in neighboring cells:
With m being a location on the orientation map, n a set of neighboring locations and θ the preferred orientation at n. Then, j is the imaginary unit, so that and | | represents the module of a complex number. The constant k normalizes the measure and changes if m and n are bi-dimensional or mono-dimensional, for example in the case of MaxPool 2D or MaxPool 1D. Finally, σ is proportional to the width of the Gaussian window in which the LHI is calculated, here we set σ = 1 pixel. The LHI is bounded between 0 and 1, with high values corresponding to iso-orientation domains and low values to pinwheels. We define pinwheels as local minima in local neighborhoods of 3 × 3 pixels with LHI under a threshold of 0.2. In order to calculate the pinwheel density we define the average size of a cortical column as the ratio , for each tested network. The pinwheel density is then defined as the ratio:
Log-Gabor fitting
To estimate the optimal frequency and orientation for each model neuron, we fitted the synaptic weights of each cell with log-Gabor wavelets [31, 57]. This fit allows us to identify the preferred stimuli for each neuron in the first and second layers. Specifically, we can extract: the preferred orientation (θ), phase (ϕ), frequency (f0) as well as the orientation tuning width (half-width at half-height, HWHH) [58]. Since second layer cells in the model project into the first layer of the network through a nonlinearity (pooling), we defined an approximated linear mapping for the second layer such that:
Specifically, is used to back project (and visualize) the second layer’s weights in the input visual space. The kernels in were then fitted using log-Gabor functions to assess for the best orientation and spatial frequency to stimulate the second layer of the network.
Drifting grating vs. Rotating grating
One goal of our study is to assess the ability of our network to predict complex behavior in V1, that is, to show phase-invariant responses. We test the network’s invariance to drifting and rotating gratings. We do this to make sure that the networks we model indeed exhibit responses that are invariant to phase, yet that they remain tuned to orientation, as observed in neurophysiological experiments. Using the preferred orientation, θ, and frequency, f0, for each neuron in the first and second layer, we created sinusoidal grating with optimal parameters for each neuron. All stimuli were masked by a circular window of 14 pixels in diameter, that is, the size of the receptive field of model neurons. Finally, we evaluated the response of the network at different phases and orientations of the same grating (see Fig 5).
Analysis on χ.
(PDF)Click here for additional data file.
Sparse neural activity maps.
Left. Example of synaptic weights learned from natural images W and W, for the first and second layer, respectively. Here we show V, a linear approximation of W (see section ‘Drifting grating vs. Rotating grating’). Each kernel corresponds to a channel in the neural activity maps. Right. Representation of 3 channels from the neural activity maps (γ and γ) elicited by the input x. Here, each pixel represents a model neuron and the color code indicates the amplitude of the neural response (lighter for no response and darker blue for the maximal response, here normalized to 1). The kernel in the top-left corner indicates the preferred stimulus of each channel.(TIFF)Click here for additional data file.
Emergence of topographic maps during learning.
We show the evolution of the topographic organization learned by W during training when the SDPC network embeds the MaxPool 2D + 2D function. Here, for M = 100. The weights are initialized to random values and the network gradually learns from input data. At first, we observe the emergence of edge detectors similar to the ones observed in the V1 of mammals. Gradually, thanks to the combined action of the feedback coming from the second layer of the network and the pooling function in the forward stream, neighboring cells in the topography become tuned to stimuli of similar orientations but different phases, generating a topographically organized map.(TIFF)Click here for additional data file.
Complex cells population as a function of the network size.
We analyze the same results of Fig 5 in terms of the unimodal variable χ as defined in [39] (see section ‘Modulation ratio and complex behavior’). In the top-left graph we illustrate the nonlinear relationship between χ and . Here the results for M = 100 are shown. To avoid the assumption of normally distributed variables, we represent χ using the median ± MAD (median absolute deviation). Black (χ) and green (χ) lines indicate distributions obtained using drifting and rotating grating, respectively. The dashed lines indicate the value χ = −1 for which = 1, the threshold value for which a V1 cell is typically considered either simple or complex. Using drifting gratings as stimuli significantly generated lower values of χ (more complex-like cells) than rotating gratings, in all tested settings (one-tailed Wilcoxon signed-rank test). This result confirms that the network’s invariance is linked to the stimulus’ phase and that model cells remain narrowly tuned to orientation. For p = MaxPool 2D and p = MaxPool 2D + 1D the distribution of χ and χ do not vary significantly as a function of the network size. When the network shows a functional topographic map, for p = MaxPool 2D + 2D, we can see a clear dependency between χ and the number of features in the first layer of the network M.(TIFF)Click here for additional data file.20 Nov 2021Dear Mr Angelo,Thank you very much for submitting your manuscript "Pooling in a predictive model of early visual processing explains functional and structural diversity in V1 across species." for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Saad JbabdiAssociate EditorPLOS Computational BiologyWolfgang EinhäuserDeputy EditorPLOS Computational Biology***********************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: Based on (partially their own) previous work, the authors extend the predictive coding framework in a hierarchical network setting towards non-linear (max) pooling. They show that simple and complex cells can emerge in the first and second layer, respectively, of a two-layer network. Moreover, depending on whether the pooling is over adjacent units in retinotopic space or in feature space, the simple cells are arranged in a salt and pepper, or in smooth orientation map-type layout, respectively.The paper for most parts is in good shape and the formalism and implementation seem sound. The main results seem interesting and important and I would like to understand them better. However, the explanations lack clarity at times. Below I point out some of the parts where I got stuck, hoping the authors can improve those parts, as this would surely facilitate the impact of their ideas.Fig. 3 and text: In case there is a smooth orientation map, do also complex cells form such a map? With preferred orientations matched to those of the simple cells?3.2 along with Fig. 3: What exactly is shown in the left column and how does this link to Fig. 1? Is it all layers, or a subset, or just one layer? If several layers, are the neurons whose weights are shown all located at the same position across layers?3.1.: Here the authors say convergence is the first important result, but then in the conclusions they argue that the problem is convex and convergence therefore guaranteed?Please explain better p_i^-1 after Eq. 5.Fig. 1: what does the cross in the circle mean? Why is the black down-pointing error targeting the neuron layers, and not a point between the Relu and p(gamma)? What would ‘a’ correspond to in this fig?Eq. 9: I don’t see the Hebb rule. Please make this more obvious.4.31 “..it only depends on the type of pooling function used.” I suppose, it does also depend on the input data, or not?The results seem interesting, but I still remain a bit confused when trying to appreciate the proposed mechanism for map formation. Could the author please provide a better intuition here? Why does max pooling in layer 2 across nearby feature channels give rise to a continuous variation in the preferred orientation across retinotopic space (in layer 2, I suppose)?Reviewer #2: The paper elaborates on the authors' earlier work on hierarchical sparse predictive coding, but here incorporating a pooling mechanism between layers. The model is shown to account for the structure of orientation maps in V1, and provides an account that ties this structure together with complex cells.Overall I find this paper very interesting. The model is novel and potentially has interesting properties that go beyond previous work. My main concern is the suitability for PLOS, as it is currently written. The computational neuroscience literature is by now fully saturated with models that claim to account for orientation maps and pinwheels in various ways. Against this backdrop, it is not very clear what this paper contributes to the story. On the other hand, it seems the motivation from the introduction is about invariance, and how to tie this idea into the sparse predictive coding framework, which is a very rich and interesting question begging for a good solution. The main result though is about orientation maps and pinwheels, so I am left scratching my head.I would encourage the authors to rethink the narrative of this paper. It also needs to be shortened. The writing in the intro and discussion is at times long-winded. It is difficult to follow the thread, and the main point of the paper and its central contribution gets lost. Focus on what is missing from the orientation map/pinwheel story, and how does this paper fill the gap? (assuming that is the story) If the story is about invariance and complex cells, then focus on what is missing from that and how his paper contributes. Or if the story is about the relation between the two, then make that the focus and explain it clearly.Specific comments:This work seems very similar in spirit to Hyvarinen & Hoyer's Topographic ICA, as the authors point out. Some of the differences are discussed, but I am having a hard time seeing a fundamental distinction, at least as it relates to orientation maps and pinwheels. In my mind that model does a remarkable job explaining these phenomena.Using a convolutional neural network seems implausible as a neurobiological model. In a convolutional model you are dictating the stride of the filters - i.e., how they tile space. By contrast in a fully connected model, the system must figure out not only how to tile space, but all the other feature dimensions such as orientation, spatial frequency, color, disparity, motion, etc. Presumably biology had to confront a similar problem, and the solution is what we see is in the various maps found in V1. Perhaps I do not fully understand how the convolutional model is being utilized, but I think the authors need to do a better job justifying this choice, as it seems to presume part of the structure that you are attempting to explain in the biology.The pooling operation is introduced in Section 2.1, but not defined until Section 2.3. I was very confused until I finally got there. And then, I don't understand why max pooling. Part of the justification seems to be that it is popular, but that is sociology. The other argument refers to winner take all, but is that right? Sparse coding already induces a competition - a kind of multiple winner take all. So if that is already happening, why do you need the extra max step in the pooling function? this did not seem well motivated.Also, wouldn't you want to learn the pooling function? Isn't there a way to incorporate that into your learning rule?Back to Section 2.1, the definition of g_i in eq. 3 comes before how it is used in eq. 4. I would recommend swapping the order of these.In Figure 1, panel b, the filters shown in the different layers of the convnet - along the M_s dimension - are all the same. Shouldn't they be different? this causes confusion.You p^-1 to denote the derivative of p and refer to it as an "unpooling" function. Is that right? I am confused by this. Is it because you can't take the derivative of p since it is a max pooling function? how would such a function be implemented?The second layer weights W_c - are they learned as well? what do they look like? I didn't seem them shown in any of the figures. It seems like this is a potentially very interesting aspect of the model that is not brought out.Other comments/impressions:Overall what interests me about this paper the most is the idea of incorporating invariance into a sparse predictive coding hierarchical model. So I am surprised that aspect is almost played down, and the paper instead focuses on orientation maps and pinwheels. Yes some animals have them, some don't. Presumably there is some topographic organization the system is pooling over, but that may not be overt when you look at the cortex. Just because cortex is organized as a two-dimensional sheet does not imply that we should restrict ourselves to 2D topographic models. For example the intrinsic structure may 4D, but packed into a 2D sheet and using the thickness of that sheet to interweave the other dimensions. It would seem to me that one of the more interesting predictions of this model would be to show what the W_c weights look like, and illustrate how the top-down connections from the second layer influence the first layer. That is truly novel, and there are few models that make concrete statements about this in terms of a normative representation principle. BTW hierarchical sparse coding was also the basis of the famous first deep network model out of Google brain that learned "cat neurons," which used a kind of topographic ICA, see https://ieeexplore.ieee.org/abstract/document/6639343 - although they did not have any feedback since it was not formulated as a generative model. Yours is, which makes it potentially more interesting.Some other related works to consider:The sparse manifold transform of Chen et al. (2018), and this SFN 2019 poster:"A geometric theory for complex cells" https://www.abstractsonline.com/pp8/#!/7883/presentation/44286provide another interpretation of pooling as part of a transformation to discover the underlying geometry in a sparse code. See also this related paper by Hosoya & Hyvarinen on learning the pooling function:Hosoya, H., & Hyvärinen, A. (2016). Learning visual spatial pooling by strong PCA dimension reduction. Neural computation, 28(7), 1249-1264.Finally, there is this related work by Osindero & HintonOsindero, S., Welling, M., & Hinton, G. E. (2006). Topographic product models applied to natural scene statistics. Neural Computation, 18(2), 381-414.Spelling/typos:therm --> term**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: NoneReviewer #2: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: Yes: Bruno A. OlshausenFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols22 Apr 2022Submitted filename: response_to_reviewers.pdfClick here for additional data file.1 Jun 2022Dear Mr Boutin,We are pleased to inform you that your manuscript 'Pooling strategies in V1 can account for the functional and structural diversity across species.' has been provisionally accepted for publication in PLOS Computational Biology.Please note the reviewer's remaining comments, which may help improve the clarity of the paper.Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.Best regards,Saad JbabdiAssociate EditorPLOS Computational BiologyWolfgang EinhäuserDeputy EditorPLOS Computational Biology***********************************************************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: The authors have addressed most of my previous concerns and the manuscript has improved in clarity. The effect seen in Fig. 3a is still a bit confusing to me. This figure part is never really explained anywhere in detail. What is the intuition? In the text, Fig. 3a is mentioned only briefly, and only after the figure parts b-d were discussed, which seems to be a strange ordering.Reviewer #2: The paper has been improved, but there remain many points of confusion. I would encourage the authors to put more work into making the paper clear, as is it seems rather sloppily put together, and that will not encourage the reader to take the work seriously.The equations and description of the model appear under the "Results" section - this seems an odd choice, because these things are not results, they are the authors specification of the model. So perhaps put it in its own section called "Model" or something like that.Equation 1 specifies a general N layer architecture, and then it is stated that for the rest of the paper N=2. Equation 2 uses the notation W_1 and W_2 etc. which is consistent with that. But then eqs. 3 and 4 slip back to the general N layer notation, which is confusing. Pick one and stick with it."max-pooling is not derivable" --> "max-pooling is not differentiable"Figure 1 panel (a) refers to the W_2 layer (here labeled W_c - note more confusion) as "complex cells." But it would seem that the output of the pooling units that feed into the W_2 layer would constitute what most people would consider complex cells. The W_2 layer which takes linear combinations of these would correspond more to "hypercomplex" cells (as hubel and wiesel called them). This whole treatment I find rather confusing and sloppy.Regarding the topographic organization - which seems to be one of the main points of the paper - it appears there are two forms to this organization: 1) the organization along the feature dimension, for example arranging pooling into sqrt(M)xsqrt(M) blocks, and 2) the pooling over the stride of the convolution, which by definition is over the spatial dimension. If you were a neurophysiologist probing this system, how would you know which one you are looking at? does it matter? This setup seems rather complicated and contrived to me.Overall I think the authors are doing something interesting here, but the presentation is messy and complicated and difficult to read. I would encourage the authors to put more work into making these things clear.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: No:**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: No18 Jul 2022PCOMPBIOL-D-21-01332R1Pooling strategies in V1 can account for the functional and structural diversity across species.Dear Dr Boutin,I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!With kind regards,Agnes PapPLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol