Literature DB >> 35024759

FP-nets as novel deep networks inspired by vision.

Philipp Grüning1,2, Thomas Martinetz1,3, Erhardt Barth1,4.   

Abstract

Feature-product networks (FP-nets) are inspired by end-stopped cortical cells with FP-units that multiply the outputs of two filters. We enhance state-of-the-art deep networks, such as the ResNet and MobileNet, with FP-units and show that the resulting FP-nets perform better on the Cifar-10 and ImageNet benchmarks. Moreover, we analyze the hyperselectivity of the FP-net model neurons and show that this property makes FP-nets less sensitive to adversarial attacks and JPEG artifacts. We then show that the learned model neurons are end-stopped to different degrees and that they provide sparse representations with an entropy that decreases with hyperselectivity.

Entities:  

Mesh:

Year:  2022        PMID: 35024759      PMCID: PMC8762712          DOI: 10.1167/jov.22.1.8

Source DB:  PubMed          Journal:  J Vis        ISSN: 1534-7362            Impact factor:   2.240


Introduction

For machine learning to work, one needs appropriate biases to constrain the solution for the problem at hand. Deep convolutional neural networks (CNNs), for example, are successful due to two constraints that specialize them relative to more general networks such as the multilayer perceptron (MLP): sparse connections and shared weights. It is well known that biases cannot be learned from the data or derived by logical deduction (Watanabe, 1985). In computer vision, appropriate biases can be obtained, as in the case of the CNNs, by studying biological vision (LeCun et al., 2015; Majaj & Pelli, 2018). Besides inspiring the use of localized (oriented) filters (the two CNN biases above) followed by a pointwise nonlinearity, biological vision can provide additional insight, an issue that currently receives somewhat limited attention in the deep-learning community (Majaj & Pelli, 2018; Paiton et al., 2020). We here focus on the principle of efficient coding (Barlow, 1961; Simoncelli & Olshausen, 2001) and the related neural phenomenon of end-stopping (Hubel & Wiesel, 1965). Statistical analysis shows that oriented linear filters reduce the entropy of natural images by encoding oriented straight patterns (one-dimensional [1D] regions) such as vertical and horizontal edges (Zetzsche et al., 1993). In cortical area V2, however, the majority of cells are end-stopped to different degrees (Hubel & Wiesel, 1965). End-stopped cells are thought to detect two-dimensional (2D) regions such as junctions and corners. Since 2D regions are unique and sparse in natural images (Barth & Watson, 2000; Mota & Barth, 2000; Zetzsche et al., 1993), they represent images efficiently, that is, with a high degree of sparseness and minimal information loss. A standard way of modeling end-stopped cells is to multiply outputs of orientation-selective cells, resulting in an AND-combination of simple-cell outputs (Zetzsche & Barth, 1990). For example, a corner can be detected by the logical combination of “horizontal edge AND vertical edge.” In Paiton et al. (2020), the authors argue convincingly that principles adopted from vision should be beneficial for deep networks and that the exploitation of multiplicative interactions between neurons has not been sufficiently explored in this specific context. There is, nevertheless, a vast literature on sigma-pi networks in general (e.g., Mel & Koch, 1990; Rumelhart et al., 1986), which is not surprising since such networks define a large class of possible systems. It has been shown that end-stopping can emerge from the principle of predictive coding based on recursive connections (Rao & Ballard, 1999); the latter has also been observed in Barth and Zetzsche (1998). Note that in Rao and Ballard (1999), end-stopping emerges based on unsupervised learning with natural images and, in our case, on task-driven supervised learning in a natural vision task. Feature-product networks (FP-nets) implement a network architecture that contains explicit multiplications of the feature maps obtained with pairs of linear filters. The main feature of these networks is that they learn the appropriate filter pairs to be multiplied based on the task at hand. An early FP-net architecture has been presented as a preprint (Grüning et al., 2020b), and it has been shown in Grüning et al. (Grüning & Barth, 2021) that a similar network can predict subjective image quality well. Of course, we do not assume that neurons would compute ideal multiplications; the AND terms could be created in alternative ways, for example, by using logarithms (Grüning et al., 2020b) or the minimum operation (Grüning & Barth, 2021a) instead of multiplications. AND terms could also be generated by traditional CNNs with linear filters followed by simple ReLU nonlinearities (Barth & Zetzsche, 1998), but this would require larger networks and would be limited in terms of the possible tuning properties of the resulting nonlinear functions (see also Paiton et al., 2020, regarding the limits of pointwise nonlinearities). Here, we present a novel FP-net architecture that is closer to vision models than the ones introduced previously in Grüning and Barth (2021b) and Grüning et al. (2020b). We first demonstrate its performance and then analyze the learned units by relating them to biological vision. Regarding the use of multiplicative terms in CNNs, Zoumpourlis et al. (2017) have shown that quadratic forms added to the first layer of a CNN can improve generalization. An FP-net can be interpreted as a special case of a network with an additional second-order Volterra kernel, but it has much fewer parameters. However, CNNs are also special cases of MLPs and, as we have argued above, the challenge is to find the right biases that can take us from the general to the more special case. For more comprehensive overviews on how FP-nets relate to various deep-network architectures, especially to bilinear CNNs (Li et al., 2017), see Grüning et al. (2020a) and Grüning and Barth (2021b). In addition, we would like to mention recent work of Chrysos et al. (2020), which illustrates that the Hadamard product of layers in deep network and the resulting higher-order polynomial representation can improve classification performance. Finally, in recurrent networks, multiplications are used to implement useful gating mechanisms (Collins et al., 2016).

FP-nets as competitive deep networks

With FP-nets, we denote a deep-network architecture that contains one or several FP-blocks. Each block of a deep network implements a sequence of layers and operations that transforms an input tensor to an output tensor . A tensor consists of a number (e.g., , ) of feature maps, each with spatial width and height that may be altered by a factor . The typical input tensor for a CNN is an image, the three color channels being the feature maps. The sequence of operations in an FP-block is shown in Figure 1 and consists of three steps: (a) a first linear combination, (b) the feature product, (c) a second linear combination. In the first step, the feature maps of an input tensor are linearly combined, followed by a ReLU, to yield the tensor with feature maps: is the value of at pixel position and feature map ; are learned weights and is an expansion factor that controls the block size. By , we denote the th feature map of . The second step is the computation of feature products, the centerpiece of the FP-block. Each feature map , is convolved with two learned filters and . Filtering is followed by instance normalization (IN) (Ulyanov et al., 2016) and ReLU nonlinearity yielding two new feature maps. Subsequently, the product of the two filter outputs is computed. For any particular image patch , with the center pixel being , of a particular feature map , the filter operation for the vectorized image patch is the scalar product of the image patch with the vectorized filters and : is the resulting tensor and the stride of the filter operation. If is greater than 1, 's width and height are subsampled. and are the mean value and standard deviation of after convolution with either or : with being the th pixel of the filter result. In the third step, a second linear combination transforms into . To comply with the baseline architectures ResNet and MobileNet, a residual connection defines the final output as:
Figure 1.

The structure of an FP-block is illustrated with rectangles and circles for the various operations applied to the input tensor gradually transforming it into . The first row within each rectangle denotes which operations are applied in sequence. In the second row, the number of feature maps is given and indicates that the input number of feature maps changes to . The arrows in the figure indicate the inputs to the different operations and are labeled with the tensors defined in the equations (see text). Note that is input to two different depthwise-separable convolutions (DWS, middle rectangles) that are learned. Convolutions are followed by instance normalization (IN) and ReLU nonlinearity, resulting in two different tensors. is the result of element-wise multiplication of these two tensors (see Equation 2). A second linear combination, depicted by the bottom rectangle, yields . For the final output , a residual connection adds the input tensor to (see Equation 5).

The structure of an FP-block is illustrated with rectangles and circles for the various operations applied to the input tensor gradually transforming it into . The first row within each rectangle denotes which operations are applied in sequence. In the second row, the number of feature maps is given and indicates that the input number of feature maps changes to . The arrows in the figure indicate the inputs to the different operations and are labeled with the tensors defined in the equations (see text). Note that is input to two different depthwise-separable convolutions (DWS, middle rectangles) that are learned. Convolutions are followed by instance normalization (IN) and ReLU nonlinearity, resulting in two different tensors. is the result of element-wise multiplication of these two tensors (see Equation 2). A second linear combination, depicted by the bottom rectangle, yields . For the final output , a residual connection adds the input tensor to (see Equation 5). Using the above FP-block, we designed four different FP-nets based on different baseline architectures: an FP-net based on (a) the original ResNet, and (b) the PyrBlockNet trained on Cifar-10, (c) a ResNet-50, and (d) a MobileNet-V2 both trained on ImageNet. A stack is a larger segment of the network, consisting of several blocks. Except for the first stack that may have a stride of 1, each new stack starts with a block with a stride of 2 that reduces the size of each feature map. Within a stack, all blocks operate on feature maps of the same size. Different network architectures may have different numbers and types of blocks. In our case, basic blocks, pyramid blocks, bottleneck blocks, and inverted residual blocks define the ResNet-Cifar, PyrBlockNet, ResNet-50, and MobileNet-V2 architecture, respectively. The block is the core module of an architecture and contains several layers. Layers are the smallest network building units such as convolution layers and max-pooling layers. Figure 2 shows an example of a ResNet-Cifar architecture that has three stacks with five blocks each. Each first block of the second and third stacks contains a convolution layer with stride that downsamples the input. The two other architectures that we used are similar: The ResNet-50 has four stacks with varying numbers of bottleneck blocks. The MobileNet-V2 has six stacks consisting of inverted-residual blocks.
Figure 2.

Architecture of the ResNet-32 used on Cifar-10: The network contains three stacks with five blocks each. Each block contains several layers such as convolution layers with a kernel of size pixels, batch normalization (BN) layers, and ReLU and Softmax nonlinearities. Convolution layers with a stride larger than 1 subsample the input, for example, from pixels to or pixels. The number of feature maps can change within a block; for example, indicates an increase from 16 to 32 feature maps. The FP-net has the same baseline architecture, but each first block in a stack (colored in red) is replaced with an FP-block.

Architecture of the ResNet-32 used on Cifar-10: The network contains three stacks with five blocks each. Each block contains several layers such as convolution layers with a kernel of size pixels, batch normalization (BN) layers, and ReLU and Softmax nonlinearities. Convolution layers with a stride larger than 1 subsample the input, for example, from pixels to or pixels. The number of feature maps can change within a block; for example, indicates an increase from 16 to 32 feature maps. The FP-net has the same baseline architecture, but each first block in a stack (colored in red) is replaced with an FP-block. We transform the four baseline architectures defined above into FP-nets using a simple design rule: Substitute each stack's first block with an FP-block. The input and output dimensions of the block are kept equal; only the internal operations differ. We developed this design rule to improve upon already well-established architectures, making FP-nets practical since only a few changes need to be done to create an FP-net. To be compatible with state-of-the-art architectures, the FP-block has a structure similar to the MobileNet-V2 block (Sandler et al., 2018). We found that combinations of convolution blocks and FP-blocks work best and that larger kernel sizes do not improve performance. One way to view a stack is that it constitutes a visual processing chain for a specific image scale. One would expect end-stopping to be more useful at the beginning of this chain. Thus, we replaced the first block of each stack. Note, however, that later stacks, for example, the second and third stack in the Cifar-10 networks, already work with highly processed inputs coming from the previous stacks. Therefore, one would expect that there is a lower necessity of extracting 2D regions in later stacks. Indeed, we will show, when analyzing the values of FP-blocks, that highly selective neurons are more common in earlier stacks. We train and test several FP-nets on the two well-known benchmarks Cifar-10 (Krizhevsky et al., 2021) and ImageNet (Deng et al., 2009). Due to the moderate size of the data set, Cifar-10 is often used to evaluate the potential of new architectures and designs. For our experiments on this data set, we used ResNets (He et al., 2016) as baseline; see Figure 2 for an example. These networks have three stacks, each consisting of blocks. We evaluated two types of the ResNet-20, ResNet-32, ResNet-44, and ResNet-56, with , 5, 7, and 9 blocks, respectively (the numbers after the names indicate the number of convolution or linear layers). Since the first publication of the ResNet architecture, several additional blocks were proposed; see Han et al. (2017) for an overview. As two baselines on Cifar-10, we used the original ResNet and a variant using the pyramid block that we denote PyrBlockNet. For both variants, we created FP-nets by replacing baseline blocks with FP-blocks according to our design rule. We used the same number of blocks, but note that an FP-block contains one additional convolution layer in each block. The FP-net-23, FP-net-35, FP-net-47, and FP-net-59 are based on the PyrBlockNet: Each stack's first block is an FP-block, and all other blocks are pyramid blocks. Analogously, FP-net (basic) denotes an FP-net based on the original ResNet: Each stack's first block is an FP-block, and the remaining blocks are basic blocks. Next, we evaluated the performance of FP-nets with the larger ImageNet data set that contains over 1.2 million training examples and 50,000 validation examples (we tested on the publicly available validation set). With an input size of at least pixels and 1,000 classes, ImageNet poses a greater challenge than Cifar-10. We compared the ResNet-50 to two FP-net-50: one smaller net with an expansion factor and a slightly larger network with . In both cases, for each of the four stacks of the ResNet-50, the first block was replaced by an FP-block to obtain the FP-net-50. Note that, if not explicitly mentioned, the term FP-net-50 refers to the variant. To further validate our approach, we evaluated an FP-net based on the popular MobileNet-V2 architecture. As with the ResNet, we replaced the first block of each stack with an FP-block, using . The results of the Cifar-10 experiments are shown in Figure 3: The left side compares the original ResNet to the FP-net (basic), and the right side compares the PyrBlockNet to the FP-net. Each point of the two curves shows the best possible test error occurring over all training epochs averaged over five runs and for one particular network (i.e., one particular number of blocks). The black line shows the baseline network, the green line the resulting FP-net when substituting the first blocks of the baseline's stacks. The -axis displays the number of parameters, a number that increases with the number of blocks. Note, however, that the inclusion of FP-blocks reduces the number of parameters. Overall, the FP-nets are more compact and perform better with a lower test error and only a small overlap in the standard deviations.
Figure 3.

The y-axis displays the best test score on Cifar-10 averaged over five runs, and the bars indicate the standard deviations. The transparent area indicates the range from the minimum to the maximum. Each diamond represents one network having a specific number of parameters (in thousands) denoted on the x-axis. On the left, the black solid line shows the baseline ResNet results with 20, 33, 44, and 56 layers, and the green solid line the results for the corresponding FP-nets (basic). On the right, the black solid line shows the baseline PyrBlockNet and the green solid line the results for the FP-nets. Substituting each stack's first block with an FP-block yielded, in all but one case, a significantly better performance with a reduced number of parameters.

The y-axis displays the best test score on Cifar-10 averaged over five runs, and the bars indicate the standard deviations. The transparent area indicates the range from the minimum to the maximum. Each diamond represents one network having a specific number of parameters (in thousands) denoted on the x-axis. On the left, the black solid line shows the baseline ResNet results with 20, 33, 44, and 56 layers, and the green solid line the results for the corresponding FP-nets (basic). On the right, the black solid line shows the baseline PyrBlockNet and the green solid line the results for the FP-nets. Substituting each stack's first block with an FP-block yielded, in all but one case, a significantly better performance with a reduced number of parameters. Table 1 shows the results on ImageNet. Note that the FP-net () performs better than the baseline ResNet-50, and the validation error is reduced by almost 0.4. When considering the already compact MobileNet architecture, the FP-net performs better than the MobileNet with an error decreased by 0.2. We trained the MobileNet-V2 baseline network ourselves to obtain its validation error. For the ResNet-50, we report the value from the Tensorpack repository (Wu, 2016). The performance depending on the number of parameters for the ResNet and FP-variants is illustrated in Figure 4.
Table 1.

ImageNet validation errors for different FP-nets and baselines: We transformed two baseline network architectures, the ResNet-50, and the MobileNet-V2, into FP-nets, here denoted as FP-net-50 and FP-MobileNet. The transformations are done by substituting specific blocks of the baseline networks with an FP-block (see text). Additionally, by choosing different expansion factors , we created one FP-net that is smaller than the baseline () and one larger network (). Note that FP-nets perform better than the baseline models if there is only a slight increase in the number of parameters (shown in millions).

ModelNo. of parameters (M)Error
ResNet-50 (baseline)25.623.61
FP-net-50 (q=0.8)24.323.80
FP-net-50 (q=1)26.023.24
MobileNet-V2 (baseline)3.528.71
FP-MobileNet3.528.53
Figure 4.

Number of parameters vs. ImageNet validation error for the ResNet-50 (black diamond) and two FP-nets (green dots) with different expansion factors .

Number of parameters vs. ImageNet validation error for the ResNet-50 (black diamond) and two FP-nets (green dots) with different expansion factors . ImageNet validation errors for different FP-nets and baselines: We transformed two baseline network architectures, the ResNet-50, and the MobileNet-V2, into FP-nets, here denoted as FP-net-50 and FP-MobileNet. The transformations are done by substituting specific blocks of the baseline networks with an FP-block (see text). Additionally, by choosing different expansion factors , we created one FP-net that is smaller than the baseline () and one larger network (). Note that FP-nets perform better than the baseline models if there is only a slight increase in the number of parameters (shown in millions).

FP-nets and visual coding

Hyperselectivity of FP-units

Vilankar and Field (2017) used the term hyperselectivity to quantify how strongly a neuron is tuned to its optimal stimulus, that is, how quickly the response drops when the optimal stimulus changes. In the context of deep learning, hyperselectivity is relevant because it can increase robustness, for example, robustness against adversarial attacks (Paiton et al., 2020). One way to quantify hyperselectivity is to measure the curvature of iso-response contours. Given an -dimensional input to a function , an -dimensional surface may exist such that for all points on the surface, the output is a constant. As can be a high dimension, 2D projections are used to analyze such iso-surfaces, which in two dimensions become iso-response contours . The typical linear-nonlinear (LN) model neuron used in CNNs is a function that involves a linear projection on a weight vector followed by a pointwise nonlinearity . To analyze the iso-response contour of such a neuron, one first projects the input on , the axis corresponding to the optimal stimulus . To find a second axis, one searches for a vector orthogonal to , for example, by picking random values and using the Gram–Schmidt process (see Equation 16) to transform the random vector to one that is orthogonal to . When looking at the output of an LN-neuron for perturbed by any orthogonal vector with , the iso-response contour is always a straight line parallel to , because . Thus, for LN-neurons, the iso-response contours have zero curvature. For hyperselective neurons (), there exist vectors that are orthogonal to and decrease the neuron's optimal response such that . In this case, the exo-origin iso-response contour bends away from the origin of the basis defined by and . A higher curvature of this bend indicates a more significant activation dropoff in regions that are different from the optimal stimulus (i.e., a greater hyperselectivity). One way to quantify the curvature is to use the coefficient of the quadratic term obtained by fitting a second-order polynomial to the iso-response contour. FP-nets contain FP-blocks that consist of FP-units, or FP-neurons, which yield the feature-product output for a pixel in a feature map as defined by Equation 2. As shown in the Appendix, FP-neurons exhibit curved exo-origin iso-response contours with a curvature that depends on the angle . Iso-response contours are shown in Figure 5 for different values of . Note that curvature, and thus hyperselectivity, increases with . Accordingly, a large leads to a lower entropy of the resulting feature maps; see Figure 6.
Figure 5.

Iso-response contour plots for different values of the angle . Each plot shows values that were determined by using Equation 23; furthermore, normalization and quantization to six bins were applied. The horizontal axis points in the direction of the optimal stimulus and is indexed by the value in Equation 23. The vertical axis is orthogonal to the optimal stimulus and indexed by . The black lines indicate the zero contour.

Figure 6.

Scatterplots of entropy over hyperselectivity (indicated by the angle ). Each dot corresponds to an FP-neuron. The color codes indicate the position of each neuron in the network (i.e., the number of convolution layers). The entropy of a particular FP-neuron's feature map is estimated as described in the Appendix and plotted against . The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers, and the right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers. Note the correlation between entropy and . Hyperselectivity is directly linked to , as illustrated in Figure 5.

Iso-response contour plots for different values of the angle . Each plot shows values that were determined by using Equation 23; furthermore, normalization and quantization to six bins were applied. The horizontal axis points in the direction of the optimal stimulus and is indexed by the value in Equation 23. The vertical axis is orthogonal to the optimal stimulus and indexed by . The black lines indicate the zero contour. Scatterplots of entropy over hyperselectivity (indicated by the angle ). Each dot corresponds to an FP-neuron. The color codes indicate the position of each neuron in the network (i.e., the number of convolution layers). The entropy of a particular FP-neuron's feature map is estimated as described in the Appendix and plotted against . The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers, and the right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers. Note the correlation between entropy and . Hyperselectivity is directly linked to , as illustrated in Figure 5.

Entropy and degree of end-stopping

To further support the view that FP-neurons are hyperselective depending on , we analyzed the entropy of the feature maps generated by different FP-neurons. The results in Figure 6 show that the learned filters tend to have a larger than zero, that is, the majority of FP-neurons are hyperselective and that a high -value leads to a lower entropy. Details of how the entropy is computed are given in the Appendix. In order to analyze the end-stopping behavior of the model neurons that are learned in the FP-nets trained on Cifar-10 and ImageNet, we needed to quantify the degree of end-stopping. In order to relate to physiological measurements, we started by analyzing the response of FP-neurons to straight lines and line ends, but this turned out to be problematic because the FP-nets use small filters and subsample the input. To keep the analogy, but with a more robust measure, we used a square as input and quantified the average responses to the uniform zero-dimensional (0D) regions, the straight 1D edges, and the 2D corners. The degree of end-stopping is then defined by the relation between 1D and 2D responses. In order to account for ON/OFF- type responses, we used both a bright and a dark square. The results are shown in Figure 7, and the details of the algorithm are given in the Appendix.
Figure 7.

Distribution of neurons plotted over the degree of end-stopping. Distributions are shown for the first block of the first stack for different models. The left image shows the activation of the first convolution, after batch normalization and ReLU, of a pyramid block in a PyrResNet (nine blocks per stack). Middle: the FP-neuron () of an FP-block for an FP-net trained on Cifar-10 (nine blocks per stack). Right: of an FP-block for the FP-net-50 trained on ImageNet. Blue bars show normalized histograms for the ratio that quantifies the relation between responses to straight edges (1D) and corners (2D); see Appendix. Neurons that respond to 0D regions (the center of a square) are excluded from the blue histogram and shown separately as orange bars. Neurons that do not respond at all (0D, 1D, and 2D responses are all zero) are also excluded from the blue histogram and are shown as green bars.

Distribution of neurons plotted over the degree of end-stopping. Distributions are shown for the first block of the first stack for different models. The left image shows the activation of the first convolution, after batch normalization and ReLU, of a pyramid block in a PyrResNet (nine blocks per stack). Middle: the FP-neuron () of an FP-block for an FP-net trained on Cifar-10 (nine blocks per stack). Right: of an FP-block for the FP-net-50 trained on ImageNet. Blue bars show normalized histograms for the ratio that quantifies the relation between responses to straight edges (1D) and corners (2D); see Appendix. Neurons that respond to 0D regions (the center of a square) are excluded from the blue histogram and shown separately as orange bars. Neurons that do not respond at all (0D, 1D, and 2D responses are all zero) are also excluded from the blue histogram and are shown as green bars. Note that, as the real neurons in cortical areas V1 and V2, the model neurons in the FP-net are end-stopped to different degrees. Thus, end-stopping seems to be beneficial for both the ImageNet and Cifar-10 tasks, since the emergence of end-stopping is here driven by the classification error. As expected, the multiplication in the FP-block shifts the distribution toward a higher degree of end-stopping. However, the network could have learned filter pairs that do not lead to end-stopped FP-neurons. The bias that we introduce (i.e., the multiplication) just makes it easier for the network to learn end-stopped representations. The angle distributions in Figure 8 show that indeed linear FP-neurons are learned as well since more than 15% of FP-neurons have a -value near zero. With increasing network depth, the number of linear FP-neurons increases, indicating that hyperselectivity and especially end-stopping are more frequent in earlier stages of the visual processing chain.
Figure 8.

Distribution of FP-neurons () as a function of hyperselectivity (indicated by the angle ) and for different positions in the network. Note that the majority of neurons are hyperselective to different degrees and that hyperselectivity is reduced later in the network. The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers. The right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers.

Distribution of FP-neurons () as a function of hyperselectivity (indicated by the angle ) and for different positions in the network. Note that the majority of neurons are hyperselective to different degrees and that hyperselectivity is reduced later in the network. The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers. The right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers.

FP-neurons are more robust against adversarial attacks

Although outperforming almost all alternative approaches on many vision tasks, CNNs are surprisingly sensitive to barely visible perturbations of the input images (Szegedy et al., 2013). An adversarial attack on a classifier function adds a noise pattern to an input image so that does not return the correct class . Furthermore, the attacker ensures that some -norm of does not exceed . In many cases, including this work, the infinity-norm is chosen, and the values are in the set . Thus, for example, for , each 8-bit pixel value is at most altered by adding or subtracting the value 1. Goodfellow et al. (2014) argue that the main reason for the sensitivity to adversarial examples is due to the linearity of CNNs: With a high-dimensional input, one can substantially change a linear neuron's output, even with small perturbations. Consider the output of an LN-neuron for an input with dimension perturbed by . We choose to be the sign function of the weight vector multiplied with : . Thus, roughly points in the direction of the optimal stimulus (which is also the gradient), but its infinity-norm does not exceed . Assuming that the mean absolute value of is , is approximately equal to . Accordingly, a significant change of the LN-neuron's output can be achieved by a small value if the input dimension is large, which is the case for many vision-related tasks. This gradient-ascent method can also be applied to nonlinear neurons. Within a local region, the output of almost any function can be approximated by a linear function. To optimally increase the output, the input needs to be moved along the gradient direction. The fast gradient sign method (FGSM; Goodfellow et al., 2014) perturbs the original input image by adding . Another approach is to define to be the gradient times a positive step size followed by clipping to . The clipped iterative gradient ascent (CIGA) greedily moves along the direction of the highest linear increase, with being the jth entry of the unbounded result at the ith iteration step. In the following, we use CIGA in our illustrations of the principle, and in our experiments, we employ FGSM as it is a widely recognized adversarial attack method. When regarding an iso-response contour plot, one can easily spot the direction of the gradient, which is orthogonal to an iso-response contour (Paiton et al., 2020). In Figure 9 on the left, the gradient for an LN-neuron is parallel to the optimal stimulus (black line). As long as the initial input yields a nonzero gradient, each step of CIGA maximally increases the LN-neuron output. Thus, the algorithm's effectiveness is only bounded by but widely independent of the initial input . For a step size larger than , CIGA finds the optimal solution in one step. We now investigate the effects of CIGA on a simplified version of an FP-neuron: Note that in the following particular example, the input is chosen to yield nonnegative projections on and ; thus, we can remove the ReLUs. The resulting gradient is The effectiveness of an iteration step strongly depends on the current position. The highest possible increase would be obtained along the line defined by the optimal stimulus. In Figure 9 on the right, this is the black line. If the initial input is located on this line, any step in the gradient direction yields an optimal increase of the FP-neuron output. However, for any other position with a nonzero gradient, an unbounded iteration step would move toward the optimal stimulus line. The blue curve in Figure 9 shows the path for several iterations of CIGA: Starting above the optimal stimulus line, each step slowly converges to the optimal stimulus line, eventually moving almost parallel to it. Once the threshold of 1 is reached in the horizontal dimension, the (now bounded) path runs parallel to the vertical dimension to increase the neuron output further. The optimal solution is found once the bound is also reached in the vertical dimension. The important difference when comparing with LN-neurons is that there are numerous conditions (depending on , , , and ) where CIGA would need several steps to find an optimal solution. This reduced effectiveness of the gradient ascent illustrates why hyperselective neurons are more robust against adversarial attacks; for example, if is too small, or is chosen poorly, or with too few iterations, an attack might not increase the FP-neuron output by much. Note that single neurons are usually not the target of adversarial attacks; instead, the gradient is determined on the classification loss function. Still, the argument holds that hyperselective neurons are harder to activate than LN-neurons, resulting in an increased robustness.
Figure 9.

Iso-response contours and iteration path of CIGA for (left) an LN-neuron and (right) an FP-neuron: The LN-neuron's weight vector is and is an orthogonal vector. The FP-neuron's filter-pair is and . The black lines point in the directions of the respective optimal stimulus. The blue dashed line shows the iteration path of the CIGA (see Equation 6). All other colored solid lines show iso-response contours; the number on each line shows the function value of the contour. For each neuron, CIGA aims to find a perturbation with that maximally increases the output . is the initial input to the neurons; is the step size, and a total of 10,000 iterations were computed. CIGA quickly finds an optimal solution for the LN-neuron since any step along the positive gradient (parallel to the optimal stimulus, orthogonal to the iso-response contours) optimally increases the function value. For the FP-neuron, the iteration path first moves toward the optimal stimulus, then almost parallel to it, and finally, moves upward once the bound on is reached along the -axis. This longer, more complex optimization path shows that CIGA is less effective for a hyperselective FP-neuron, indicating that FP-neurons are more robust against adversarial examples.

Iso-response contours and iteration path of CIGA for (left) an LN-neuron and (right) an FP-neuron: The LN-neuron's weight vector is and is an orthogonal vector. The FP-neuron's filter-pair is and . The black lines point in the directions of the respective optimal stimulus. The blue dashed line shows the iteration path of the CIGA (see Equation 6). All other colored solid lines show iso-response contours; the number on each line shows the function value of the contour. For each neuron, CIGA aims to find a perturbation with that maximally increases the output . is the initial input to the neurons; is the step size, and a total of 10,000 iterations were computed. CIGA quickly finds an optimal solution for the LN-neuron since any step along the positive gradient (parallel to the optimal stimulus, orthogonal to the iso-response contours) optimally increases the function value. For the FP-neuron, the iteration path first moves toward the optimal stimulus, then almost parallel to it, and finally, moves upward once the bound on is reached along the -axis. This longer, more complex optimization path shows that CIGA is less effective for a hyperselective FP-neuron, indicating that FP-neurons are more robust against adversarial examples. To test this hypothesis, we created new Cifar-10 test sets derived from the original test set . Here, we focused on the most subtle adversarial attacks: we created one test set , where each test image was perturbed by using FGSM with . Results for larger -values are shown in the Appendix (see Table 2 and Table 3). To exclude the hypothesis that the better accuracy (with perturbations) is due to the fact that the FP-nets already generalize better, we present results where we measure the percentage of changed predictions of the classifier . is the indicator function returning a 1 for a true statement and a zero otherwise. is some function (here, FGSM) that perturbs the original image based on some parameter . We evaluated this metric for each of the four architectures that we trained on the original Cifar-10 training set (see Section “FP-nets as competitive deep networks”); no additional adversarial training scheme was employed. As shown in Figure 10, 40% to 50% of the predictions did change. However, for both baseline models, substituting some of the LN-neurons with FP-neurons increased the robustness against FGSM attacks.
Table 2.

Robust error values, in percentages, when using FGSM perturbations for all Cifar-10 models and different values. We report the mean values averaged over five different training runs.

Model; ε=1/2552/2554/2558/25516/255
FP-net (N=3)58.46072.76682.03686.18086.444
FP-net (basic) (N=3)56.66273.63482.29686.40689.108
PyrBlockNet (N=3)60.45875.56683.99288.15089.762
ResNet (N=3)57.28471.91280.37884.33888.016
FP-net (N=5)53.63470.69481.08885.02088.152
FP-net (basic) (N=5)51.11268.51079.03684.03486.830
PyrBlockNet (N=5)54.62070.01879.74884.39287.666
ResNet (N=5)52.43469.65279.67884.16686.956
FP-net (N=7)51.08267.46878.72883.51086.814
FP-net (basic) (N=7)48.36066.10477.23682.48686.256
PyrBlockNet (N=7)53.24469.96680.76485.88088.662
ResNet (N=7)49.78867.85278.59083.29086.544
FP-net (N=9)49.22666.90278.63483.52487.216
FP-net (basic) (N=9)45.89664.28476.07681.65285.484
PyrBlockNet (N=9)52.04868.44278.93284.59087.620
ResNet (N=9)47.88866.05677.06681.99286.410
Table 3.

Percentage of changed predictions when using the FGSM perturbations for all Cifar-10 models and different values.

Model; ε=1/2552/2554/2558/25516/255
FP-net (N = 3)50.76665.14074.59679.32082.030
FP-net (basic) (N = 3)48.85865.92874.90680.12885.654
PyrBlockNet (N = 3)52.56067.81876.75282.59287.562
ResNet (N = 3)49.52664.22873.03878.48085.148
FP-net (N = 5)47.03064.13274.85880.25886.418
FP-net (basic) (N = 5)44.48261.96072.66278.52683.700
PyrBlockNet (N = 5)47.80863.41673.73680.06485.782
ResNet (N = 5)44.99662.29872.54678.14883.916
FP-net (N = 7)44.96861.45273.00078.93284.772
FP-net (basic) (N = 7)42.10259.92271.29477.40283.376
PyrBlockNet (N = 7)47.05463.97875.34281.88486.844
ResNet (N = 7)42.82060.94871.89877.53883.390
FP-net (N = 9)43.53661.31473.44479.50685.424
FP-net (basic) (N = 9)39.80458.24070.19676.44882.376
PyrBlockNet (N = 9)45.98862.59273.59080.62085.988
ResNet (N = 9)41.20659.42670.66276.53483.808
Figure 10.

Percentage of changed predictions (see Equation 9) on the adversarial example test set . A lower value indicates that a network is more robust against attacks created by the FGSM.

Percentage of changed predictions (see Equation 9) on the adversarial example test set . A lower value indicates that a network is more robust against attacks created by the FGSM. The results reiterate that CNN predictions can be significantly altered by deliberate and subtle attacks (we show some example images in the Appendix). Unfortunately, this lack of robustness creates problems of practical relevance beyond such attacks. For example, JPEG-compression can create artifacts that have similar effects. To evaluate robustness against JPEG artifacts, we created the Cifar-10 test sets , with being the JPEG-compressed version of the original image with a quality rate , 100 being the original image. A low quality indicates a high compression with stronger artifacts (example images are given in the Appendix). In Figure 11, we show the results for the low compression test set and further results in the Appendix (see Tables 4 and 5).
Figure 11.

Percentage of changed predictions (see Equation 9) for the JPEG-compressed Cifar-10 test set . A lower value indicates that a network is more robust against JPEG artifacts.

Table 4.

Error values in percentages when using JPEG-compression for all Cifar-10 models and different quality (Q) values.

Model; Q=908070605040302010
FP-net (N = 3)12.41216.46819.68622.62425.23028.11632.92041.04457.940
FP-net (basic) (N = 3)12.28816.33819.63622.73425.30627.89032.74440.13856.260
PyrBlockNet (N = 3)13.08017.35620.67623.79027.10830.32435.32642.79857.336
ResNet (N = 3)12.91017.32220.83024.41627.30830.40435.25243.35658.722
FP-net (N = 5)11.04015.15418.74222.07624.67627.86432.77241.24058.466
FP-net (basic) (N = 5)10.90814.86618.14221.28223.79226.76031.73640.02458.488
PyrBlockNet (N = 5)11.91816.41220.34223.89827.07830.69435.88844.42460.176
ResNet (N = 5)11.59215.65419.11422.33024.59027.46232.10040.72656.878
FP-net (N = 7)10.45614.59618.21421.29224.11226.72031.76840.58258.022
FP-net (basic) (N = 7)10.62414.27417.59020.38423.02026.01430.64639.38256.966
PyrBlockNet (N = 7)11.39615.92819.34423.35626.51830.45635.90844.70659.660
ResNet (N = 7)11.33815.29418.28821.25423.96827.04431.53440.16856.540
FP-net (N = 9)10.11814.22818.02021.11623.82026.89432.38240.83057.930
FP-net (basic) (N = 9)10.12813.98416.99619.91222.11624.95629.71038.09656.040
PyrBlockNet (N = 9)10.97415.29019.08822.64025.50429.24434.52043.21859.616
ResNet (N = 9)10.78814.83018.16221.55824.20027.72832.27240.94257.842
Table 5.

Percentage of changed predictions when using JPEG-compression for all Cifar-10 models and different quality (Q) values.

Model; Q=908070605040302010
FP-net (N = 3)9.02414.12417.67621.01423.71426.68831.58040.11657.502
FP-net (basic) (N = 3)8.86813.91417.60221.06423.63626.59031.54039.20855.918
PyrBlockNet (N = 3)10.11215.08618.86422.17625.67029.10234.16841.84256.930
ResNet (N = 3)9.99615.19219.10822.82625.97229.36634.34042.52258.384
FP-net (N = 5)8.37413.15817.10420.64423.23226.69031.76640.41258.090
FP-net (basic) (N = 5)8.14612.82016.48819.98622.54025.58630.74439.29658.002
PyrBlockNet (N = 5)9.32214.59018.74622.48825.77029.65635.00643.66859.840
ResNet (N = 5)8.65013.58417.39620.80623.16826.28031.18839.80256.370
FP-net (N = 7)8.05812.86816.78019.89822.83025.71430.81839.84857.782
FP-net (basic) (N = 7)7.83212.22215.93218.96021.66224.74829.68238.68856.742
PyrBlockNet (N = 7)8.80413.98017.92022.11425.41029.43234.96444.00259.232
ResNet (N = 7)8.47813.06616.46419.64022.54025.88430.49239.14255.986
FP-net (N = 9)7.69412.48216.70419.86622.70025.88631.40240.06457.774
FP-net (basic) (N = 9)7.63212.12415.52218.56421.01423.99228.95237.42655.786
PyrBlockNet (N = 9)8.74013.49817.59421.25824.29428.19033.59642.56059.332
ResNet (N = 9)8.18012.76016.72820.05622.89226.60631.29840.19657.468
Percentage of changed predictions (see Equation 9) for the JPEG-compressed Cifar-10 test set . A lower value indicates that a network is more robust against JPEG artifacts. Again, using FP-neurons increased the robustness against artifacts. However, even a moderate compression alters up to 10% of the CNNs’ predictions.

Example FP-unit

As shown above, the learned FP-neurons are hyperselective and end-stopped to different degrees. However, these two properties do not fully specify an FP-neuron. When analyzing the individual FP-neurons in more detail, it is difficult to further specify them according to simple properties such as orientation or phase. Nevertheless, some FP-neurons look as if they were taken from a textbook on “how to model end-stopped neurons,” and we show one example in Figure 12.
Figure 12.

An example of learned filters pairs. The top row shows the two learned filters (arrows indicate orientation) and the row below the corresponding Fourier spectra. The third row depicts the responses of the two filters to the image shown in the bottom left (image used as to illustrate the selectivity of the FP-unit). The bottom-right panel shows the response of the FP-unit (the product of the filter responses). Such textbook units, however, are rather rare. This particular unit has emerged in the FP-net-59 trained on Cifar-10 without instance normalization and without the ReLU in Equation 2.

An example of learned filters pairs. The top row shows the two learned filters (arrows indicate orientation) and the row below the corresponding Fourier spectra. The third row depicts the responses of the two filters to the image shown in the bottom left (image used as to illustrate the selectivity of the FP-unit). The bottom-right panel shows the response of the FP-unit (the product of the filter responses). Such textbook units, however, are rather rare. This particular unit has emerged in the FP-net-59 trained on Cifar-10 without instance normalization and without the ReLU in Equation 2.

Discussion and conclusions

We have presented a novel FP-net architecture and have demonstrated its competitive performance. To do so, we have designed experiments with state-of-the-art deep networks and showed that we could improve their performance by substituting original blocks in the network architecture with FP-blocks that implement an explicit multiplication of feature maps. Given this simple design rule, we can expect our approach to be of practical use, since any traditional network can easily be transformed into an FP-net that will most likely perform better. We did not employ any hyperparameter tuning specific to the FP-nets but just used the hyperparameters of the original networks; one may thus expect even better performance with additional tuning. We believe that the improvement that comes with FP-nets is due to an appropriate bias, which allows the network to learn efficient representations based on units (model neurons) that are end-stopped to different degrees. The multiplications that we introduce allow for AND rather than OR combinations and thus make the resulting units more selective than linear filters with pointwise nonlinearities. Note that the key feature of FP-nets is that one learns pairs of linear filters, which are then AND combined. In case of FP-nets, the AND is implemented by multiplications. We could, however, show that logarithms (Grüning et al., 2020b) and the minimum operation (Grüning & Barth, 2021a) can also work as AND operation. We consider the improvements that bio-inspired FP-nets achieve over the baseline networks to be the main contribution of our article. Moreover, we have analyzed the selectivity of the FP-units in an attempt to relate them to what is known about visual neurons. We could show that FP-units are indeed end-stopped to different degrees. The emergence of end-stopping in a network that learns based on only the classification error demonstrates that end-stopping is beneficial for the task of object recognition. This finding is supported by previously known mathematical results, according to which (a) 2D features such as corners and junctions are statistically rare in natural images, leading to sparse representations (Zetzsche et al., 1993), and (b) 2D features are still unique since there exists a mathematical proof that 0D (uniform) and 1D (straight) regions in images are redundant (Mota & Barth, 2000), although being statistically frequent. Of course, the considerations above cannot be taken to imply that biological vision implements an FP-net architecture, especially as the FP-nets implement additional and typical deep-network operations such as linear recombinations that increase the entropy of the representation. In other words, much of what well-performing deep networks do is not something one would necessarily consider to be optimal. It is known that sparse-coding units are more selective than typical CNN units, that is, than linear neurons with pointwise nonlinearities (Paiton et al., 2020), and thus less prone to certain adversarial attacks. This increased selectivity has been quantified with the curvature of the iso-response contours. We could show that the iso-response contours of the FP-units are curved, with the degree of curvature depending on the angle between the multiplied feature vectors, and that a large number of hyperselective units emerge in FP-nets trained for object recognition. Furthermore, our results show that FP-nets are indeed more robust against adversarial attacks and compression artifacts, and this is, again, due to the vision-inspired FP-units.
  8 in total

1.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.

Authors:  R P Rao; D H Ballard
Journal:  Nat Neurosci       Date:  1999-01       Impact factor: 24.884

2.  RECEPTIVE FIELDS AND FUNCTIONAL ARCHITECTURE IN TWO NONSTRIATE VISUAL AREAS (18 AND 19) OF THE CAT.

Authors:  D H HUBEL; T N WIESEL
Journal:  J Neurophysiol       Date:  1965-03       Impact factor: 2.714

3.  A geometric framework for nonlinear visual coding.

Authors:  E Barth; A Watson
Journal:  Opt Express       Date:  2000-08-14       Impact factor: 3.894

4.  Fundamental limits of linear filters in the visual processing of two-dimensional signals.

Authors:  C Zetzsche; E Barth
Journal:  Vision Res       Date:  1990       Impact factor: 1.886

Review 5.  Deep learning.

Authors:  Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal:  Nature       Date:  2015-05-28       Impact factor: 49.962

Review 6.  Selectivity, hyperselectivity, and the tuning of V1 neurons.

Authors:  Kedarnath P Vilankar; David J Field
Journal:  J Vis       Date:  2017-08-01       Impact factor: 2.240

7.  Deep learning-Using machine learning to study biological vision.

Authors:  Najib J Majaj; Denis G Pelli
Journal:  J Vis       Date:  2018-12-03       Impact factor: 2.240

8.  Selectivity and robustness of sparse coding networks.

Authors:  Dylan M Paiton; Charles G Frye; Sheng Y Lundquist; Joel D Bowen; Ryan Zarcone; Bruno A Olshausen
Journal:  J Vis       Date:  2020-11-02       Impact factor: 2.240

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.