Literature DB >> 35024759

FP-nets as novel deep networks inspired by vision.

Philipp Grüning^1,2, Thomas Martinetz^1,3, Erhardt Barth^1,4.

Abstract

Feature-product networks (FP-nets) are inspired by end-stopped cortical cells with FP-units that multiply the outputs of two filters. We enhance state-of-the-art deep networks, such as the ResNet and MobileNet, with FP-units and show that the resulting FP-nets perform better on the Cifar-10 and ImageNet benchmarks. Moreover, we analyze the hyperselectivity of the FP-net model neurons and show that this property makes FP-nets less sensitive to adversarial attacks and JPEG artifacts. We then show that the learned model neurons are end-stopped to different degrees and that they provide sparse representations with an entropy that decreases with hyperselectivity.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35024759 PMCID： PMC8762712 DOI： 10.1167/jov.22.1.8

Source DB: PubMed Journal: J Vis ISSN： 1534-7362 Impact factor: 2.240

Introduction

For machine learning to work, one needs appropriate biases to constrain the solution for the problem at hand. Deep convolutional neural networks (CNNs), for example, are successful due to two constraints that specialize them relative to more general networks such as the multilayer perceptron (MLP): sparse connections and shared weights. It is well known that biases cannot be learned from the data or derived by logical deduction (Watanabe, 1985). In computer vision, appropriate biases can be obtained, as in the case of the CNNs, by studying biological vision (LeCun et al., 2015; Majaj & Pelli, 2018). Besides inspiring the use of localized (oriented) filters (the two CNN biases above) followed by a pointwise nonlinearity, biological vision can provide additional insight, an issue that currently receives somewhat limited attention in the deep-learning community (Majaj & Pelli, 2018; Paiton et al., 2020). We here focus on the principle of efficient coding (Barlow, 1961; Simoncelli & Olshausen, 2001) and the related neural phenomenon of end-stopping (Hubel & Wiesel, 1965). Statistical analysis shows that oriented linear filters reduce the entropy of natural images by encoding oriented straight patterns (one-dimensional [1D] regions) such as vertical and horizontal edges (Zetzsche et al., 1993). In cortical area V2, however, the majority of cells are end-stopped to different degrees (Hubel & Wiesel, 1965). End-stopped cells are thought to detect two-dimensional (2D) regions such as junctions and corners. Since 2D regions are unique and sparse in natural images (Barth & Watson, 2000; Mota & Barth, 2000; Zetzsche et al., 1993), they represent images efficiently, that is, with a high degree of sparseness and minimal information loss. A standard way of modeling end-stopped cells is to multiply outputs of orientation-selective cells, resulting in an AND-combination of simple-cell outputs (Zetzsche & Barth, 1990). For example, a corner can be detected by the logical combination of “horizontal edge AND vertical edge.” In Paiton et al. (2020), the authors argue convincingly that principles adopted from vision should be beneficial for deep networks and that the exploitation of multiplicative interactions between neurons has not been sufficiently explored in this specific context. There is, nevertheless, a vast literature on sigma-pi networks in general (e.g., Mel & Koch, 1990; Rumelhart et al., 1986), which is not surprising since such networks define a large class of possible systems. It has been shown that end-stopping can emerge from the principle of predictive coding based on recursive connections (Rao & Ballard, 1999); the latter has also been observed in Barth and Zetzsche (1998). Note that in Rao and Ballard (1999), end-stopping emerges based on unsupervised learning with natural images and, in our case, on task-driven supervised learning in a natural vision task. Feature-product networks (FP-nets) implement a network architecture that contains explicit multiplications of the feature maps obtained with pairs of linear filters. The main feature of these networks is that they learn the appropriate filter pairs to be multiplied based on the task at hand. An early FP-net architecture has been presented as a preprint (Grüning et al., 2020b), and it has been shown in Grüning et al. (Grüning & Barth, 2021) that a similar network can predict subjective image quality well. Of course, we do not assume that neurons would compute ideal multiplications; the AND terms could be created in alternative ways, for example, by using logarithms (Grüning et al., 2020b) or the minimum operation (Grüning & Barth, 2021a) instead of multiplications. AND terms could also be generated by traditional CNNs with linear filters followed by simple ReLU nonlinearities (Barth & Zetzsche, 1998), but this would require larger networks and would be limited in terms of the possible tuning properties of the resulting nonlinear functions (see also Paiton et al., 2020, regarding the limits of pointwise nonlinearities). Here, we present a novel FP-net architecture that is closer to vision models than the ones introduced previously in Grüning and Barth (2021b) and Grüning et al. (2020b). We first demonstrate its performance and then analyze the learned units by relating them to biological vision. Regarding the use of multiplicative terms in CNNs, Zoumpourlis et al. (2017) have shown that quadratic forms added to the first layer of a CNN can improve generalization. An FP-net can be interpreted as a special case of a network with an additional second-order Volterra kernel, but it has much fewer parameters. However, CNNs are also special cases of MLPs and, as we have argued above, the challenge is to find the right biases that can take us from the general to the more special case. For more comprehensive overviews on how FP-nets relate to various deep-network architectures, especially to bilinear CNNs (Li et al., 2017), see Grüning et al. (2020a) and Grüning and Barth (2021b). In addition, we would like to mention recent work of Chrysos et al. (2020), which illustrates that the Hadamard product of layers in deep network and the resulting higher-order polynomial representation can improve classification performance. Finally, in recurrent networks, multiplications are used to implement useful gating mechanisms (Collins et al., 2016).

FP-nets as competitive deep networks

With FP-nets, we denote a deep-network architecture that contains one or several FP-blocks. Each block of a deep network implements a sequence of layers and operations that transforms an input tensor to an output tensor . A tensor consists of a number (e.g., , ) of feature maps, each with spatial width and height that may be altered by a factor . The typical input tensor for a CNN is an image, the three color channels being the feature maps. The sequence of operations in an FP-block is shown in Figure 1 and consists of three steps: (a) a first linear combination, (b) the feature product, (c) a second linear combination. In the first step, the feature maps of an input tensor are linearly combined, followed by a ReLU, to yield the tensor with feature maps: is the value of at pixel position and feature map ; are learned weights and is an expansion factor that controls the block size. By , we denote the th feature map of . The second step is the computation of feature products, the centerpiece of the FP-block. Each feature map , is convolved with two learned filters and . Filtering is followed by instance normalization (IN) (Ulyanov et al., 2016) and ReLU nonlinearity yielding two new feature maps. Subsequently, the product of the two filter outputs is computed. For any particular image patch , with the center pixel being , of a particular feature map , the filter operation for the vectorized image patch is the scalar product of the image patch with the vectorized filters and : is the resulting tensor and the stride of the filter operation. If is greater than 1, 's width and height are subsampled. and are the mean value and standard deviation of after convolution with either or : with being the th pixel of the filter result. In the third step, a second linear combination transforms into . To comply with the baseline architectures ResNet and MobileNet, a residual connection defines the final output as:

Figure 1.

The structure of an FP-block is illustrated with rectangles and circles for the various operations applied to the input tensor gradually transforming it into . The first row within each rectangle denotes which operations are applied in sequence. In the second row, the number of feature maps is given and indicates that the input number of feature maps changes to . The arrows in the figure indicate the inputs to the different operations and are labeled with the tensors defined in the equations (see text). Note that is input to two different depthwise-separable convolutions (DWS, middle rectangles) that are learned. Convolutions are followed by instance normalization (IN) and ReLU nonlinearity, resulting in two different tensors. is the result of element-wise multiplication of these two tensors (see Equation 2). A second linear combination, depicted by the bottom rectangle, yields . For the final output , a residual connection adds the input tensor to (see Equation 5). Using the above FP-block, we designed four different FP-nets based on different baseline architectures: an FP-net based on (a) the original ResNet, and (b) the PyrBlockNet trained on Cifar-10, (c) a ResNet-50, and (d) a MobileNet-V2 both trained on ImageNet. A stack is a larger segment of the network, consisting of several blocks. Except for the first stack that may have a stride of 1, each new stack starts with a block with a stride of 2 that reduces the size of each feature map. Within a stack, all blocks operate on feature maps of the same size. Different network architectures may have different numbers and types of blocks. In our case, basic blocks, pyramid blocks, bottleneck blocks, and inverted residual blocks define the ResNet-Cifar, PyrBlockNet, ResNet-50, and MobileNet-V2 architecture, respectively. The block is the core module of an architecture and contains several layers. Layers are the smallest network building units such as convolution layers and max-pooling layers. Figure 2 shows an example of a ResNet-Cifar architecture that has three stacks with five blocks each. Each first block of the second and third stacks contains a convolution layer with stride that downsamples the input. The two other architectures that we used are similar: The ResNet-50 has four stacks with varying numbers of bottleneck blocks. The MobileNet-V2 has six stacks consisting of inverted-residual blocks.

Figure 2.

Architecture of the ResNet-32 used on Cifar-10: The network contains three stacks with five blocks each. Each block contains several layers such as convolution layers with a kernel of size pixels, batch normalization (BN) layers, and ReLU and Softmax nonlinearities. Convolution layers with a stride larger than 1 subsample the input, for example, from pixels to or pixels. The number of feature maps can change within a block; for example, indicates an increase from 16 to 32 feature maps. The FP-net has the same baseline architecture, but each first block in a stack (colored in red) is replaced with an FP-block. We transform the four baseline architectures defined above into FP-nets using a simple design rule: Substitute each stack's first block with an FP-block. The input and output dimensions of the block are kept equal; only the internal operations differ. We developed this design rule to improve upon already well-established architectures, making FP-nets practical since only a few changes need to be done to create an FP-net. To be compatible with state-of-the-art architectures, the FP-block has a structure similar to the MobileNet-V2 block (Sandler et al., 2018). We found that combinations of convolution blocks and FP-blocks work best and that larger kernel sizes do not improve performance. One way to view a stack is that it constitutes a visual processing chain for a specific image scale. One would expect end-stopping to be more useful at the beginning of this chain. Thus, we replaced the first block of each stack. Note, however, that later stacks, for example, the second and third stack in the Cifar-10 networks, already work with highly processed inputs coming from the previous stacks. Therefore, one would expect that there is a lower necessity of extracting 2D regions in later stacks. Indeed, we will show, when analyzing the values of FP-blocks, that highly selective neurons are more common in earlier stacks. We train and test several FP-nets on the two well-known benchmarks Cifar-10 (Krizhevsky et al., 2021) and ImageNet (Deng et al., 2009). Due to the moderate size of the data set, Cifar-10 is often used to evaluate the potential of new architectures and designs. For our experiments on this data set, we used ResNets (He et al., 2016) as baseline; see Figure 2 for an example. These networks have three stacks, each consisting of blocks. We evaluated two types of the ResNet-20, ResNet-32, ResNet-44, and ResNet-56, with , 5, 7, and 9 blocks, respectively (the numbers after the names indicate the number of convolution or linear layers). Since the first publication of the ResNet architecture, several additional blocks were proposed; see Han et al. (2017) for an overview. As two baselines on Cifar-10, we used the original ResNet and a variant using the pyramid block that we denote PyrBlockNet. For both variants, we created FP-nets by replacing baseline blocks with FP-blocks according to our design rule. We used the same number of blocks, but note that an FP-block contains one additional convolution layer in each block. The FP-net-23, FP-net-35, FP-net-47, and FP-net-59 are based on the PyrBlockNet: Each stack's first block is an FP-block, and all other blocks are pyramid blocks. Analogously, FP-net (basic) denotes an FP-net based on the original ResNet: Each stack's first block is an FP-block, and the remaining blocks are basic blocks. Next, we evaluated the performance of FP-nets with the larger ImageNet data set that contains over 1.2 million training examples and 50,000 validation examples (we tested on the publicly available validation set). With an input size of at least pixels and 1,000 classes, ImageNet poses a greater challenge than Cifar-10. We compared the ResNet-50 to two FP-net-50: one smaller net with an expansion factor and a slightly larger network with . In both cases, for each of the four stacks of the ResNet-50, the first block was replaced by an FP-block to obtain the FP-net-50. Note that, if not explicitly mentioned, the term FP-net-50 refers to the variant. To further validate our approach, we evaluated an FP-net based on the popular MobileNet-V2 architecture. As with the ResNet, we replaced the first block of each stack with an FP-block, using . The results of the Cifar-10 experiments are shown in Figure 3: The left side compares the original ResNet to the FP-net (basic), and the right side compares the PyrBlockNet to the FP-net. Each point of the two curves shows the best possible test error occurring over all training epochs averaged over five runs and for one particular network (i.e., one particular number of blocks). The black line shows the baseline network, the green line the resulting FP-net when substituting the first blocks of the baseline's stacks. The -axis displays the number of parameters, a number that increases with the number of blocks. Note, however, that the inclusion of FP-blocks reduces the number of parameters. Overall, the FP-nets are more compact and perform better with a lower test error and only a small overlap in the standard deviations.

Figure 3.

The y-axis displays the best test score on Cifar-10 averaged over five runs, and the bars indicate the standard deviations. The transparent area indicates the range from the minimum to the maximum. Each diamond represents one network having a specific number of parameters (in thousands) denoted on the x-axis. On the left, the black solid line shows the baseline ResNet results with 20, 33, 44, and 56 layers, and the green solid line the results for the corresponding FP-nets (basic). On the right, the black solid line shows the baseline PyrBlockNet and the green solid line the results for the FP-nets. Substituting each stack's first block with an FP-block yielded, in all but one case, a significantly better performance with a reduced number of parameters. Table 1 shows the results on ImageNet. Note that the FP-net () performs better than the baseline ResNet-50, and the validation error is reduced by almost 0.4. When considering the already compact MobileNet architecture, the FP-net performs better than the MobileNet with an error decreased by 0.2. We trained the MobileNet-V2 baseline network ourselves to obtain its validation error. For the ResNet-50, we report the value from the Tensorpack repository (Wu, 2016). The performance depending on the number of parameters for the ResNet and FP-variants is illustrated in Figure 4.

Table 1.

ImageNet validation errors for different FP-nets and baselines: We transformed two baseline network architectures, the ResNet-50, and the MobileNet-V2, into FP-nets, here denoted as FP-net-50 and FP-MobileNet. The transformations are done by substituting specific blocks of the baseline networks with an FP-block (see text). Additionally, by choosing different expansion factors , we created one FP-net that is smaller than the baseline () and one larger network (). Note that FP-nets perform better than the baseline models if there is only a slight increase in the number of parameters (shown in millions).

Model	No. of parameters (M)	Error
ResNet-50 (baseline)	25.6	23.61
FP-net-50 (q=0.8)	24.3	23.80
FP-net-50 (q=1)	26.0	23.24
MobileNet-V2 (baseline)	3.5	28.71
FP-MobileNet	3.5	28.53

Figure 4.

Number of parameters vs. ImageNet validation error for the ResNet-50 (black diamond) and two FP-nets (green dots) with different expansion factors .

Number of parameters vs. ImageNet validation error for the ResNet-50 (black diamond) and two FP-nets (green dots) with different expansion factors . ImageNet validation errors for different FP-nets and baselines: We transformed two baseline network architectures, the ResNet-50, and the MobileNet-V2, into FP-nets, here denoted as FP-net-50 and FP-MobileNet. The transformations are done by substituting specific blocks of the baseline networks with an FP-block (see text). Additionally, by choosing different expansion factors , we created one FP-net that is smaller than the baseline () and one larger network (). Note that FP-nets perform better than the baseline models if there is only a slight increase in the number of parameters (shown in millions).

FP-nets and visual coding

Hyperselectivity of FP-units

Vilankar and Field (2017) used the term hyperselectivity to quantify how strongly a neuron is tuned to its optimal stimulus, that is, how quickly the response drops when the optimal stimulus changes. In the context of deep learning, hyperselectivity is relevant because it can increase robustness, for example, robustness against adversarial attacks (Paiton et al., 2020). One way to quantify hyperselectivity is to measure the curvature of iso-response contours. Given an -dimensional input to a function , an -dimensional surface may exist such that for all points on the surface, the output is a constant. As can be a high dimension, 2D projections are used to analyze such iso-surfaces, which in two dimensions become iso-response contours . The typical linear-nonlinear (LN) model neuron used in CNNs is a function that involves a linear projection on a weight vector followed by a pointwise nonlinearity . To analyze the iso-response contour of such a neuron, one first projects the input on , the axis corresponding to the optimal stimulus . To find a second axis, one searches for a vector orthogonal to , for example, by picking random values and using the Gram–Schmidt process (see Equation 16) to transform the random vector to one that is orthogonal to . When looking at the output of an LN-neuron for perturbed by any orthogonal vector with , the iso-response contour is always a straight line parallel to , because . Thus, for LN-neurons, the iso-response contours have zero curvature. For hyperselective neurons (), there exist vectors that are orthogonal to and decrease the neuron's optimal response such that . In this case, the exo-origin iso-response contour bends away from the origin of the basis defined by and . A higher curvature of this bend indicates a more significant activation dropoff in regions that are different from the optimal stimulus (i.e., a greater hyperselectivity). One way to quantify the curvature is to use the coefficient of the quadratic term obtained by fitting a second-order polynomial to the iso-response contour. FP-nets contain FP-blocks that consist of FP-units, or FP-neurons, which yield the feature-product output for a pixel in a feature map as defined by Equation 2. As shown in the Appendix, FP-neurons exhibit curved exo-origin iso-response contours with a curvature that depends on the angle . Iso-response contours are shown in Figure 5 for different values of . Note that curvature, and thus hyperselectivity, increases with . Accordingly, a large leads to a lower entropy of the resulting feature maps; see Figure 6.

Figure 5.

Figure 6.

Scatterplots of entropy over hyperselectivity (indicated by the angle ). Each dot corresponds to an FP-neuron. The color codes indicate the position of each neuron in the network (i.e., the number of convolution layers). The entropy of a particular FP-neuron's feature map is estimated as described in the Appendix and plotted against . The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers, and the right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers. Note the correlation between entropy and . Hyperselectivity is directly linked to , as illustrated in Figure 5.

Iso-response contour plots for different values of the angle . Each plot shows values that were determined by using Equation 23; furthermore, normalization and quantization to six bins were applied. The horizontal axis points in the direction of the optimal stimulus and is indexed by the value in Equation 23. The vertical axis is orthogonal to the optimal stimulus and indexed by . The black lines indicate the zero contour. Scatterplots of entropy over hyperselectivity (indicated by the angle ). Each dot corresponds to an FP-neuron. The color codes indicate the position of each neuron in the network (i.e., the number of convolution layers). The entropy of a particular FP-neuron's feature map is estimated as described in the Appendix and plotted against . The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers, and the right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers. Note the correlation between entropy and . Hyperselectivity is directly linked to , as illustrated in Figure 5.

Entropy and degree of end-stopping

To further support the view that FP-neurons are hyperselective depending on , we analyzed the entropy of the feature maps generated by different FP-neurons. The results in Figure 6 show that the learned filters tend to have a larger than zero, that is, the majority of FP-neurons are hyperselective and that a high -value leads to a lower entropy. Details of how the entropy is computed are given in the Appendix. In order to analyze the end-stopping behavior of the model neurons that are learned in the FP-nets trained on Cifar-10 and ImageNet, we needed to quantify the degree of end-stopping. In order to relate to physiological measurements, we started by analyzing the response of FP-neurons to straight lines and line ends, but this turned out to be problematic because the FP-nets use small filters and subsample the input. To keep the analogy, but with a more robust measure, we used a square as input and quantified the average responses to the uniform zero-dimensional (0D) regions, the straight 1D edges, and the 2D corners. The degree of end-stopping is then defined by the relation between 1D and 2D responses. In order to account for ON/OFF- type responses, we used both a bright and a dark square. The results are shown in Figure 7, and the details of the algorithm are given in the Appendix.

Figure 7.

Distribution of neurons plotted over the degree of end-stopping. Distributions are shown for the first block of the first stack for different models. The left image shows the activation of the first convolution, after batch normalization and ReLU, of a pyramid block in a PyrResNet (nine blocks per stack). Middle: the FP-neuron () of an FP-block for an FP-net trained on Cifar-10 (nine blocks per stack). Right: of an FP-block for the FP-net-50 trained on ImageNet. Blue bars show normalized histograms for the ratio that quantifies the relation between responses to straight edges (1D) and corners (2D); see Appendix. Neurons that respond to 0D regions (the center of a square) are excluded from the blue histogram and shown separately as orange bars. Neurons that do not respond at all (0D, 1D, and 2D responses are all zero) are also excluded from the blue histogram and are shown as green bars. Note that, as the real neurons in cortical areas V1 and V2, the model neurons in the FP-net are end-stopped to different degrees. Thus, end-stopping seems to be beneficial for both the ImageNet and Cifar-10 tasks, since the emergence of end-stopping is here driven by the classification error. As expected, the multiplication in the FP-block shifts the distribution toward a higher degree of end-stopping. However, the network could have learned filter pairs that do not lead to end-stopped FP-neurons. The bias that we introduce (i.e., the multiplication) just makes it easier for the network to learn end-stopped representations. The angle distributions in Figure 8 show that indeed linear FP-neurons are learned as well since more than 15% of FP-neurons have a -value near zero. With increasing network depth, the number of linear FP-neurons increases, indicating that hyperselectivity and especially end-stopping are more frequent in earlier stages of the visual processing chain.

Figure 8.

Distribution of FP-neurons () as a function of hyperselectivity (indicated by the angle ) and for different positions in the network. Note that the majority of neurons are hyperselective to different degrees and that hyperselectivity is reduced later in the network. The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers. The right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers.

FP-neurons are more robust against adversarial attacks

Although outperforming almost all alternative approaches on many vision tasks, CNNs are surprisingly sensitive to barely visible perturbations of the input images (Szegedy et al., 2013). An adversarial attack on a classifier function adds a noise pattern to an input image so that does not return the correct class . Furthermore, the attacker ensures that some -norm of does not exceed . In many cases, including this work, the infinity-norm is chosen, and the values are in the set . Thus, for example, for , each 8-bit pixel value is at most altered by adding or subtracting the value 1. Goodfellow et al. (2014) argue that the main reason for the sensitivity to adversarial examples is due to the linearity of CNNs: With a high-dimensional input, one can substantially change a linear neuron's output, even with small perturbations. Consider the output of an LN-neuron for an input with dimension perturbed by . We choose to be the sign function of the weight vector multiplied with : . Thus, roughly points in the direction of the optimal stimulus (which is also the gradient), but its infinity-norm does not exceed . Assuming that the mean absolute value of is , is approximately equal to . Accordingly, a significant change of the LN-neuron's output can be achieved by a small value if the input dimension is large, which is the case for many vision-related tasks. This gradient-ascent method can also be applied to nonlinear neurons. Within a local region, the output of almost any function can be approximated by a linear function. To optimally increase the output, the input needs to be moved along the gradient direction. The fast gradient sign method (FGSM; Goodfellow et al., 2014) perturbs the original input image by adding . Another approach is to define to be the gradient times a positive step size followed by clipping to . The clipped iterative gradient ascent (CIGA) greedily moves along the direction of the highest linear increase, with being the jth entry of the unbounded result at the ith iteration step. In the following, we use CIGA in our illustrations of the principle, and in our experiments, we employ FGSM as it is a widely recognized adversarial attack method. When regarding an iso-response contour plot, one can easily spot the direction of the gradient, which is orthogonal to an iso-response contour (Paiton et al., 2020). In Figure 9 on the left, the gradient for an LN-neuron is parallel to the optimal stimulus (black line). As long as the initial input yields a nonzero gradient, each step of CIGA maximally increases the LN-neuron output. Thus, the algorithm's effectiveness is only bounded by but widely independent of the initial input . For a step size larger than , CIGA finds the optimal solution in one step. We now investigate the effects of CIGA on a simplified version of an FP-neuron: Note that in the following particular example, the input is chosen to yield nonnegative projections on and ; thus, we can remove the ReLUs. The resulting gradient is The effectiveness of an iteration step strongly depends on the current position. The highest possible increase would be obtained along the line defined by the optimal stimulus. In Figure 9 on the right, this is the black line. If the initial input is located on this line, any step in the gradient direction yields an optimal increase of the FP-neuron output. However, for any other position with a nonzero gradient, an unbounded iteration step would move toward the optimal stimulus line. The blue curve in Figure 9 shows the path for several iterations of CIGA: Starting above the optimal stimulus line, each step slowly converges to the optimal stimulus line, eventually moving almost parallel to it. Once the threshold of 1 is reached in the horizontal dimension, the (now bounded) path runs parallel to the vertical dimension to increase the neuron output further. The optimal solution is found once the bound is also reached in the vertical dimension. The important difference when comparing with LN-neurons is that there are numerous conditions (depending on , , , and ) where CIGA would need several steps to find an optimal solution. This reduced effectiveness of the gradient ascent illustrates why hyperselective neurons are more robust against adversarial attacks; for example, if is too small, or is chosen poorly, or with too few iterations, an attack might not increase the FP-neuron output by much. Note that single neurons are usually not the target of adversarial attacks; instead, the gradient is determined on the classification loss function. Still, the argument holds that hyperselective neurons are harder to activate than LN-neurons, resulting in an increased robustness.

Figure 9.

Iso-response contours and iteration path of CIGA for (left) an LN-neuron and (right) an FP-neuron: The LN-neuron's weight vector is and is an orthogonal vector. The FP-neuron's filter-pair is and . The black lines point in the directions of the respective optimal stimulus. The blue dashed line shows the iteration path of the CIGA (see Equation 6). All other colored solid lines show iso-response contours; the number on each line shows the function value of the contour. For each neuron, CIGA aims to find a perturbation with that maximally increases the output . is the initial input to the neurons; is the step size, and a total of 10,000 iterations were computed. CIGA quickly finds an optimal solution for the LN-neuron since any step along the positive gradient (parallel to the optimal stimulus, orthogonal to the iso-response contours) optimally increases the function value. For the FP-neuron, the iteration path first moves toward the optimal stimulus, then almost parallel to it, and finally, moves upward once the bound on is reached along the -axis. This longer, more complex optimization path shows that CIGA is less effective for a hyperselective FP-neuron, indicating that FP-neurons are more robust against adversarial examples. To test this hypothesis, we created new Cifar-10 test sets derived from the original test set . Here, we focused on the most subtle adversarial attacks: we created one test set , where each test image was perturbed by using FGSM with . Results for larger -values are shown in the Appendix (see Table 2 and Table 3). To exclude the hypothesis that the better accuracy (with perturbations) is due to the fact that the FP-nets already generalize better, we present results where we measure the percentage of changed predictions of the classifier . is the indicator function returning a 1 for a true statement and a zero otherwise. is some function (here, FGSM) that perturbs the original image based on some parameter . We evaluated this metric for each of the four architectures that we trained on the original Cifar-10 training set (see Section “FP-nets as competitive deep networks”); no additional adversarial training scheme was employed. As shown in Figure 10, 40% to 50% of the predictions did change. However, for both baseline models, substituting some of the LN-neurons with FP-neurons increased the robustness against FGSM attacks.

Table 2.

Robust error values, in percentages, when using FGSM perturbations for all Cifar-10 models and different values. We report the mean values averaged over five different training runs.

Model; ε=	1/255	2/255	4/255	8/255	16/255
FP-net (N=3)	58.460	72.766	82.036	86.180	86.444
FP-net (basic) (N=3)	56.662	73.634	82.296	86.406	89.108
PyrBlockNet (N=3)	60.458	75.566	83.992	88.150	89.762
ResNet (N=3)	57.284	71.912	80.378	84.338	88.016
FP-net (N=5)	53.634	70.694	81.088	85.020	88.152
FP-net (basic) (N=5)	51.112	68.510	79.036	84.034	86.830
PyrBlockNet (N=5)	54.620	70.018	79.748	84.392	87.666
ResNet (N=5)	52.434	69.652	79.678	84.166	86.956
FP-net (N=7)	51.082	67.468	78.728	83.510	86.814
FP-net (basic) (N=7)	48.360	66.104	77.236	82.486	86.256
PyrBlockNet (N=7)	53.244	69.966	80.764	85.880	88.662
ResNet (N=7)	49.788	67.852	78.590	83.290	86.544
FP-net (N=9)	49.226	66.902	78.634	83.524	87.216
FP-net (basic) (N=9)	45.896	64.284	76.076	81.652	85.484
PyrBlockNet (N=9)	52.048	68.442	78.932	84.590	87.620
ResNet (N=9)	47.888	66.056	77.066	81.992	86.410

Table 3.

Percentage of changed predictions when using the FGSM perturbations for all Cifar-10 models and different values.

Model; ε=	1/255	2/255	4/255	8/255	16/255
FP-net (N = 3)	50.766	65.140	74.596	79.320	82.030
FP-net (basic) (N = 3)	48.858	65.928	74.906	80.128	85.654
PyrBlockNet (N = 3)	52.560	67.818	76.752	82.592	87.562
ResNet (N = 3)	49.526	64.228	73.038	78.480	85.148
FP-net (N = 5)	47.030	64.132	74.858	80.258	86.418
FP-net (basic) (N = 5)	44.482	61.960	72.662	78.526	83.700
PyrBlockNet (N = 5)	47.808	63.416	73.736	80.064	85.782
ResNet (N = 5)	44.996	62.298	72.546	78.148	83.916
FP-net (N = 7)	44.968	61.452	73.000	78.932	84.772
FP-net (basic) (N = 7)	42.102	59.922	71.294	77.402	83.376
PyrBlockNet (N = 7)	47.054	63.978	75.342	81.884	86.844
ResNet (N = 7)	42.820	60.948	71.898	77.538	83.390
FP-net (N = 9)	43.536	61.314	73.444	79.506	85.424
FP-net (basic) (N = 9)	39.804	58.240	70.196	76.448	82.376
PyrBlockNet (N = 9)	45.988	62.592	73.590	80.620	85.988
ResNet (N = 9)	41.206	59.426	70.662	76.534	83.808

Figure 10.

Percentage of changed predictions (see Equation 9) on the adversarial example test set . A lower value indicates that a network is more robust against attacks created by the FGSM.

Percentage of changed predictions (see Equation 9) on the adversarial example test set . A lower value indicates that a network is more robust against attacks created by the FGSM. The results reiterate that CNN predictions can be significantly altered by deliberate and subtle attacks (we show some example images in the Appendix). Unfortunately, this lack of robustness creates problems of practical relevance beyond such attacks. For example, JPEG-compression can create artifacts that have similar effects. To evaluate robustness against JPEG artifacts, we created the Cifar-10 test sets , with being the JPEG-compressed version of the original image with a quality rate , 100 being the original image. A low quality indicates a high compression with stronger artifacts (example images are given in the Appendix). In Figure 11, we show the results for the low compression test set and further results in the Appendix (see Tables 4 and 5).

Figure 11.

Percentage of changed predictions (see Equation 9) for the JPEG-compressed Cifar-10 test set . A lower value indicates that a network is more robust against JPEG artifacts.

Table 4.

Error values in percentages when using JPEG-compression for all Cifar-10 models and different quality (Q) values.

Model; Q=	90	80	70	60	50	40	30	20	10
FP-net (N = 3)	12.412	16.468	19.686	22.624	25.230	28.116	32.920	41.044	57.940
FP-net (basic) (N = 3)	12.288	16.338	19.636	22.734	25.306	27.890	32.744	40.138	56.260
PyrBlockNet (N = 3)	13.080	17.356	20.676	23.790	27.108	30.324	35.326	42.798	57.336
ResNet (N = 3)	12.910	17.322	20.830	24.416	27.308	30.404	35.252	43.356	58.722
FP-net (N = 5)	11.040	15.154	18.742	22.076	24.676	27.864	32.772	41.240	58.466
FP-net (basic) (N = 5)	10.908	14.866	18.142	21.282	23.792	26.760	31.736	40.024	58.488
PyrBlockNet (N = 5)	11.918	16.412	20.342	23.898	27.078	30.694	35.888	44.424	60.176
ResNet (N = 5)	11.592	15.654	19.114	22.330	24.590	27.462	32.100	40.726	56.878
FP-net (N = 7)	10.456	14.596	18.214	21.292	24.112	26.720	31.768	40.582	58.022
FP-net (basic) (N = 7)	10.624	14.274	17.590	20.384	23.020	26.014	30.646	39.382	56.966
PyrBlockNet (N = 7)	11.396	15.928	19.344	23.356	26.518	30.456	35.908	44.706	59.660
ResNet (N = 7)	11.338	15.294	18.288	21.254	23.968	27.044	31.534	40.168	56.540
FP-net (N = 9)	10.118	14.228	18.020	21.116	23.820	26.894	32.382	40.830	57.930
FP-net (basic) (N = 9)	10.128	13.984	16.996	19.912	22.116	24.956	29.710	38.096	56.040
PyrBlockNet (N = 9)	10.974	15.290	19.088	22.640	25.504	29.244	34.520	43.218	59.616
ResNet (N = 9)	10.788	14.830	18.162	21.558	24.200	27.728	32.272	40.942	57.842

Table 5.

Percentage of changed predictions when using JPEG-compression for all Cifar-10 models and different quality (Q) values.

Model; Q=	90	80	70	60	50	40	30	20	10
FP-net (N = 3)	9.024	14.124	17.676	21.014	23.714	26.688	31.580	40.116	57.502
FP-net (basic) (N = 3)	8.868	13.914	17.602	21.064	23.636	26.590	31.540	39.208	55.918
PyrBlockNet (N = 3)	10.112	15.086	18.864	22.176	25.670	29.102	34.168	41.842	56.930
ResNet (N = 3)	9.996	15.192	19.108	22.826	25.972	29.366	34.340	42.522	58.384
FP-net (N = 5)	8.374	13.158	17.104	20.644	23.232	26.690	31.766	40.412	58.090
FP-net (basic) (N = 5)	8.146	12.820	16.488	19.986	22.540	25.586	30.744	39.296	58.002
PyrBlockNet (N = 5)	9.322	14.590	18.746	22.488	25.770	29.656	35.006	43.668	59.840
ResNet (N = 5)	8.650	13.584	17.396	20.806	23.168	26.280	31.188	39.802	56.370
FP-net (N = 7)	8.058	12.868	16.780	19.898	22.830	25.714	30.818	39.848	57.782
FP-net (basic) (N = 7)	7.832	12.222	15.932	18.960	21.662	24.748	29.682	38.688	56.742
PyrBlockNet (N = 7)	8.804	13.980	17.920	22.114	25.410	29.432	34.964	44.002	59.232
ResNet (N = 7)	8.478	13.066	16.464	19.640	22.540	25.884	30.492	39.142	55.986
FP-net (N = 9)	7.694	12.482	16.704	19.866	22.700	25.886	31.402	40.064	57.774
FP-net (basic) (N = 9)	7.632	12.124	15.522	18.564	21.014	23.992	28.952	37.426	55.786
PyrBlockNet (N = 9)	8.740	13.498	17.594	21.258	24.294	28.190	33.596	42.560	59.332
ResNet (N = 9)	8.180	12.760	16.728	20.056	22.892	26.606	31.298	40.196	57.468

Percentage of changed predictions (see Equation 9) for the JPEG-compressed Cifar-10 test set . A lower value indicates that a network is more robust against JPEG artifacts. Again, using FP-neurons increased the robustness against artifacts. However, even a moderate compression alters up to 10% of the CNNs’ predictions.

Example FP-unit

As shown above, the learned FP-neurons are hyperselective and end-stopped to different degrees. However, these two properties do not fully specify an FP-neuron. When analyzing the individual FP-neurons in more detail, it is difficult to further specify them according to simple properties such as orientation or phase. Nevertheless, some FP-neurons look as if they were taken from a textbook on “how to model end-stopped neurons,” and we show one example in Figure 12.

Figure 12.

An example of learned filters pairs. The top row shows the two learned filters (arrows indicate orientation) and the row below the corresponding Fourier spectra. The third row depicts the responses of the two filters to the image shown in the bottom left (image used as to illustrate the selectivity of the FP-unit). The bottom-right panel shows the response of the FP-unit (the product of the filter responses). Such textbook units, however, are rather rare. This particular unit has emerged in the FP-net-59 trained on Cifar-10 without instance normalization and without the ReLU in Equation 2.

Discussion and conclusions

We have presented a novel FP-net architecture and have demonstrated its competitive performance. To do so, we have designed experiments with state-of-the-art deep networks and showed that we could improve their performance by substituting original blocks in the network architecture with FP-blocks that implement an explicit multiplication of feature maps. Given this simple design rule, we can expect our approach to be of practical use, since any traditional network can easily be transformed into an FP-net that will most likely perform better. We did not employ any hyperparameter tuning specific to the FP-nets but just used the hyperparameters of the original networks; one may thus expect even better performance with additional tuning. We believe that the improvement that comes with FP-nets is due to an appropriate bias, which allows the network to learn efficient representations based on units (model neurons) that are end-stopped to different degrees. The multiplications that we introduce allow for AND rather than OR combinations and thus make the resulting units more selective than linear filters with pointwise nonlinearities. Note that the key feature of FP-nets is that one learns pairs of linear filters, which are then AND combined. In case of FP-nets, the AND is implemented by multiplications. We could, however, show that logarithms (Grüning et al., 2020b) and the minimum operation (Grüning & Barth, 2021a) can also work as AND operation. We consider the improvements that bio-inspired FP-nets achieve over the baseline networks to be the main contribution of our article. Moreover, we have analyzed the selectivity of the FP-units in an attempt to relate them to what is known about visual neurons. We could show that FP-units are indeed end-stopped to different degrees. The emergence of end-stopping in a network that learns based on only the classification error demonstrates that end-stopping is beneficial for the task of object recognition. This finding is supported by previously known mathematical results, according to which (a) 2D features such as corners and junctions are statistically rare in natural images, leading to sparse representations (Zetzsche et al., 1993), and (b) 2D features are still unique since there exists a mathematical proof that 0D (uniform) and 1D (straight) regions in images are redundant (Mota & Barth, 2000), although being statistically frequent. Of course, the considerations above cannot be taken to imply that biological vision implements an FP-net architecture, especially as the FP-nets implement additional and typical deep-network operations such as linear recombinations that increase the entropy of the representation. In other words, much of what well-performing deep networks do is not something one would necessarily consider to be optimal. It is known that sparse-coding units are more selective than typical CNN units, that is, than linear neurons with pointwise nonlinearities (Paiton et al., 2020), and thus less prone to certain adversarial attacks. This increased selectivity has been quantified with the curvature of the iso-response contours. We could show that the iso-response contours of the FP-units are curved, with the degree of curvature depending on the angle between the multiplied feature vectors, and that a large number of hyperselective units emerge in FP-nets trained for object recognition. Furthermore, our results show that FP-nets are indeed more robust against adversarial attacks and compression artifacts, and this is, again, due to the vision-inspired FP-units.

8 in total

1. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.

Authors: R P Rao; D H Ballard
Journal: Nat Neurosci Date: 1999-01 Impact factor: 24.884

2. RECEPTIVE FIELDS AND FUNCTIONAL ARCHITECTURE IN TWO NONSTRIATE VISUAL AREAS (18 AND 19) OF THE CAT.

Authors: D H HUBEL; T N WIESEL
Journal: J Neurophysiol Date: 1965-03 Impact factor: 2.714

3. A geometric framework for nonlinear visual coding.

Authors: E Barth; A Watson
Journal: Opt Express Date: 2000-08-14 Impact factor: 3.894

4. Fundamental limits of linear filters in the visual processing of two-dimensional signals.

Authors: C Zetzsche; E Barth
Journal: Vision Res Date: 1990 Impact factor: 1.886

Review 5. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962