Philipp Grüning1,2, Thomas Martinetz1,3, Erhardt Barth1,4. 1. Institute for Neuro- and Bioinformatics, University of Lübeck, Lübeck, Germany. 2. gruening@inb.uni-luebeck.de https://www.inb.uni-luebeck.de/mitarbeiter/mitarbeiter/wissenschaftliche-mitarbeiter/philipp-gruening.html. 3. martinet@inb.uni-luebeck.de https://www.inb.uni-luebeck.de/mitarbeiter/mitarbeiter/professoren/107.html. 4. barth@inb.uni-luebeck.de https://www.inb.uni-luebeck.de/mitarbeiter/mitarbeiter/professoren/erhardt-barth.html.
Abstract
Feature-product networks (FP-nets) are inspired by end-stopped cortical cells with FP-units that multiply the outputs of two filters. We enhance state-of-the-art deep networks, such as the ResNet and MobileNet, with FP-units and show that the resulting FP-nets perform better on the Cifar-10 and ImageNet benchmarks. Moreover, we analyze the hyperselectivity of the FP-net model neurons and show that this property makes FP-nets less sensitive to adversarial attacks and JPEG artifacts. We then show that the learned model neurons are end-stopped to different degrees and that they provide sparse representations with an entropy that decreases with hyperselectivity.
Feature-product networks (FP-nets) are inspired by end-stopped cortical cells with FP-units that multiply the outputs of two filters. We enhance state-of-the-art deep networks, such as the ResNet and MobileNet, with FP-units and show that the resulting FP-nets perform better on the Cifar-10 and ImageNet benchmarks. Moreover, we analyze the hyperselectivity of the FP-net model neurons and show that this property makes FP-nets less sensitive to adversarial attacks and JPEG artifacts. We then show that the learned model neurons are end-stopped to different degrees and that they provide sparse representations with an entropy that decreases with hyperselectivity.
For machine learning to work, one needs appropriate biases to constrain the solution for the problem at hand. Deep convolutional neural networks (CNNs), for example, are successful due to two constraints that specialize them relative to more general networks such as the multilayer perceptron (MLP): sparse connections and shared weights. It is well known that biases cannot be learned from the data or derived by logical deduction (Watanabe, 1985). In computer vision, appropriate biases can be obtained, as in the case of the CNNs, by studying biological vision (LeCun et al., 2015; Majaj & Pelli, 2018). Besides inspiring the use of localized (oriented) filters (the two CNN biases above) followed by a pointwise nonlinearity, biological vision can provide additional insight, an issue that currently receives somewhat limited attention in the deep-learning community (Majaj & Pelli, 2018; Paiton et al., 2020).We here focus on the principle of efficient coding (Barlow, 1961; Simoncelli & Olshausen, 2001) and the related neural phenomenon of end-stopping (Hubel & Wiesel, 1965). Statistical analysis shows that oriented linear filters reduce the entropy of natural images by encoding oriented straight patterns (one-dimensional [1D] regions) such as vertical and horizontal edges (Zetzsche et al., 1993). In cortical area V2, however, the majority of cells are end-stopped to different degrees (Hubel & Wiesel, 1965). End-stopped cells are thought to detect two-dimensional (2D) regions such as junctions and corners. Since 2D regions are unique and sparse in natural images (Barth & Watson, 2000; Mota & Barth, 2000; Zetzsche et al., 1993), they represent images efficiently, that is, with a high degree of sparseness and minimal information loss. A standard way of modeling end-stopped cells is to multiply outputs of orientation-selective cells, resulting in an AND-combination of simple-cell outputs (Zetzsche & Barth, 1990). For example, a corner can be detected by the logical combination of “horizontal edge AND vertical edge.” In Paiton et al. (2020), the authors argue convincingly that principles adopted from vision should be beneficial for deep networks and that the exploitation of multiplicative interactions between neurons has not been sufficiently explored in this specific context. There is, nevertheless, a vast literature on sigma-pi networks in general (e.g., Mel & Koch, 1990; Rumelhart et al., 1986), which is not surprising since such networks define a large class of possible systems.It has been shown that end-stopping can emerge from the principle of predictive coding based on recursive connections (Rao & Ballard, 1999); the latter has also been observed in Barth and Zetzsche (1998). Note that in Rao and Ballard (1999), end-stopping emerges based on unsupervised learning with natural images and, in our case, on task-driven supervised learning in a natural vision task.Feature-product networks (FP-nets) implement a network architecture that contains explicit multiplications of the feature maps obtained with pairs of linear filters. The main feature of these networks is that they learn the appropriate filter pairs to be multiplied based on the task at hand. An early FP-net architecture has been presented as a preprint (Grüning et al., 2020b), and it has been shown in Grüning et al. (Grüning & Barth, 2021) that a similar network can predict subjective image quality well. Of course, we do not assume that neurons would compute ideal multiplications; the AND terms could be created in alternative ways, for example, by using logarithms (Grüning et al., 2020b) or the minimum operation (Grüning & Barth, 2021a) instead of multiplications. AND terms could also be generated by traditional CNNs with linear filters followed by simple ReLU nonlinearities (Barth & Zetzsche, 1998), but this would require larger networks and would be limited in terms of the possible tuning properties of the resulting nonlinear functions (see also Paiton et al., 2020, regarding the limits of pointwise nonlinearities). Here, we present a novel FP-net architecture that is closer to vision models than the ones introduced previously in Grüning and Barth (2021b) and Grüning et al. (2020b). We first demonstrate its performance and then analyze the learned units by relating them to biological vision.Regarding the use of multiplicative terms in CNNs, Zoumpourlis et al. (2017) have shown that quadratic forms added to the first layer of a CNN can improve generalization. An FP-net can be interpreted as a special case of a network with an additional second-order Volterra kernel, but it has much fewer parameters. However, CNNs are also special cases of MLPs and, as we have argued above, the challenge is to find the right biases that can take us from the general to the more special case. For more comprehensive overviews on how FP-nets relate to various deep-network architectures, especially to bilinear CNNs (Li et al., 2017), see Grüning et al. (2020a) and Grüning and Barth (2021b). In addition, we would like to mention recent work of Chrysos et al. (2020), which illustrates that the Hadamard product of layers in deep network and the resulting higher-order polynomial representation can improve classification performance. Finally, in recurrent networks, multiplications are used to implement useful gating mechanisms (Collins et al., 2016).
FP-nets as competitive deep networks
With FP-nets, we denote a deep-network architecture that contains one or several FP-blocks. Each block of a deep network implements a sequence of layers and operations that transforms an input tensor to an output tensor . A tensor consists of a number (e.g., , ) of feature maps, each with spatial width and height that may be altered by a factor . The typical input tensor for a CNN is an image, the three color channels being the feature maps. The sequence of operations in an FP-block is shown in Figure 1 and consists of three steps: (a) a first linear combination, (b) the feature product, (c) a second linear combination. In the first step, the feature maps of an input tensor are linearly combined, followed by a ReLU, to yield the tensor with feature maps:
is the value of at pixel position and feature map ; are learned weights and is an expansion factor that controls the block size. By , we denote the th feature map of . The second step is the computation of feature products, the centerpiece of the FP-block. Each feature map , is convolved with two learned filters and . Filtering is followed by instance normalization (IN) (Ulyanov et al., 2016) and ReLU nonlinearity yielding two new feature maps. Subsequently, the product of the two filter outputs is computed. For any particular image patch , with the center pixel being , of a particular feature map , the filter operation for the vectorized image patch is the scalar product of the image patch with the vectorized filters and :
is the resulting tensor and the stride of the filter operation. If is greater than 1, 's width and height are subsampled. and are the mean value and standard deviation of after convolution with either or :
with being the th pixel of the filter result. In the third step, a second linear combination transforms into . To comply with the baseline architectures ResNet and MobileNet, a residual connection defines the final output as:
Figure 1.
The structure of an FP-block is illustrated with rectangles and circles for the various operations applied to the input tensor gradually transforming it into . The first row within each rectangle denotes which operations are applied in sequence. In the second row, the number of feature maps is given and indicates that the input number of feature maps changes to . The arrows in the figure indicate the inputs to the different operations and are labeled with the tensors defined in the equations (see text). Note that is input to two different depthwise-separable convolutions (DWS, middle rectangles) that are learned. Convolutions are followed by instance normalization (IN) and ReLU nonlinearity, resulting in two different tensors. is the result of element-wise multiplication of these two tensors (see Equation 2). A second linear combination, depicted by the bottom rectangle, yields . For the final output , a residual connection adds the input tensor to (see Equation 5).
The structure of an FP-block is illustrated with rectangles and circles for the various operations applied to the input tensor gradually transforming it into . The first row within each rectangle denotes which operations are applied in sequence. In the second row, the number of feature maps is given and indicates that the input number of feature maps changes to . The arrows in the figure indicate the inputs to the different operations and are labeled with the tensors defined in the equations (see text). Note that is input to two different depthwise-separable convolutions (DWS, middle rectangles) that are learned. Convolutions are followed by instance normalization (IN) and ReLU nonlinearity, resulting in two different tensors. is the result of element-wise multiplication of these two tensors (see Equation 2). A second linear combination, depicted by the bottom rectangle, yields . For the final output , a residual connection adds the input tensor to (see Equation 5).Using the above FP-block, we designed four different FP-nets based on different baseline architectures: an FP-net based on (a) the original ResNet, and (b) the PyrBlockNet trained on Cifar-10, (c) a ResNet-50, and (d) a MobileNet-V2 both trained on ImageNet. A stack is a larger segment of the network, consisting of several blocks. Except for the first stack that may have a stride of 1, each new stack starts with a block with a stride of 2 that reduces the size of each feature map. Within a stack, all blocks operate on feature maps of the same size. Different network architectures may have different numbers and types of blocks. In our case, basic blocks, pyramid blocks, bottleneck blocks, and inverted residual blocks define the ResNet-Cifar, PyrBlockNet, ResNet-50, and MobileNet-V2 architecture, respectively. The block is the core module of an architecture and contains several layers. Layers are the smallest network building units such as convolution layers and max-pooling layers. Figure 2 shows an example of a ResNet-Cifar architecture that has three stacks with five blocks each. Each first block of the second and third stacks contains a convolution layer with stride that downsamples the input. The two other architectures that we used are similar: The ResNet-50 has four stacks with varying numbers of bottleneck blocks. The MobileNet-V2 has six stacks consisting of inverted-residual blocks.
Figure 2.
Architecture of the ResNet-32 used on Cifar-10: The network contains three stacks with five blocks each. Each block contains several layers such as convolution layers with a kernel of size pixels, batch normalization (BN) layers, and ReLU and Softmax nonlinearities. Convolution layers with a stride larger than 1 subsample the input, for example, from pixels to or pixels. The number of feature maps can change within a block; for example, indicates an increase from 16 to 32 feature maps. The FP-net has the same baseline architecture, but each first block in a stack (colored in red) is replaced with an FP-block.
Architecture of the ResNet-32 used on Cifar-10: The network contains three stacks with five blocks each. Each block contains several layers such as convolution layers with a kernel of size pixels, batch normalization (BN) layers, and ReLU and Softmax nonlinearities. Convolution layers with a stride larger than 1 subsample the input, for example, from pixels to or pixels. The number of feature maps can change within a block; for example, indicates an increase from 16 to 32 feature maps. The FP-net has the same baseline architecture, but each first block in a stack (colored in red) is replaced with an FP-block.We transform the four baseline architectures defined above into FP-nets using a simple design rule: Substitute each stack's first block with an FP-block. The input and output dimensions of the block are kept equal; only the internal operations differ.We developed this design rule to improve upon already well-established architectures, making FP-nets practical since only a few changes need to be done to create an FP-net. To be compatible with state-of-the-art architectures, the FP-block has a structure similar to the MobileNet-V2 block (Sandler et al., 2018). We found that combinations of convolution blocks and FP-blocks work best and that larger kernel sizes do not improve performance. One way to view a stack is that it constitutes a visual processing chain for a specific image scale. One would expect end-stopping to be more useful at the beginning of this chain. Thus, we replaced the first block of each stack. Note, however, that later stacks, for example, the second and third stack in the Cifar-10 networks, already work with highly processed inputs coming from the previous stacks. Therefore, one would expect that there is a lower necessity of extracting 2D regions in later stacks. Indeed, we will show, when analyzing the values of FP-blocks, that highly selective neurons are more common in earlier stacks.We train and test several FP-nets on the two well-known benchmarks Cifar-10 (Krizhevsky et al., 2021) and ImageNet (Deng et al., 2009).Due to the moderate size of the data set, Cifar-10 is often used to evaluate the potential of new architectures and designs. For our experiments on this data set, we used ResNets (He et al., 2016) as baseline; see Figure 2 for an example. These networks have three stacks, each consisting of blocks. We evaluated two types of the ResNet-20, ResNet-32, ResNet-44, and ResNet-56, with , 5, 7, and 9 blocks, respectively (the numbers after the names indicate the number of convolution or linear layers). Since the first publication of the ResNet architecture, several additional blocks were proposed; see Han et al. (2017) for an overview. As two baselines on Cifar-10, we used the original ResNet and a variant using the pyramid block that we denote PyrBlockNet. For both variants, we created FP-nets by replacing baseline blocks with FP-blocks according to our design rule. We used the same number of blocks, but note that an FP-block contains one additional convolution layer in each block. The FP-net-23, FP-net-35, FP-net-47, and FP-net-59 are based on the PyrBlockNet: Each stack's first block is an FP-block, and all other blocks are pyramid blocks. Analogously, FP-net (basic) denotes an FP-net based on the original ResNet: Each stack's first block is an FP-block, and the remaining blocks are basic blocks.Next, we evaluated the performance of FP-nets with the larger ImageNet data set that contains over 1.2 million training examples and 50,000 validation examples (we tested on the publicly available validation set). With an input size of at least pixels and 1,000 classes, ImageNet poses a greater challenge than Cifar-10. We compared the ResNet-50 to two FP-net-50: one smaller net with an expansion factor and a slightly larger network with . In both cases, for each of the four stacks of the ResNet-50, the first block was replaced by an FP-block to obtain the FP-net-50. Note that, if not explicitly mentioned, the term FP-net-50 refers to the variant.To further validate our approach, we evaluated an FP-net based on the popular MobileNet-V2 architecture. As with the ResNet, we replaced the first block of each stack with an FP-block, using .The results of the Cifar-10 experiments are shown in Figure 3: The left side compares the original ResNet to the FP-net (basic), and the right side compares the PyrBlockNet to the FP-net. Each point of the two curves shows the best possible test error occurring over all training epochs averaged over five runs and for one particular network (i.e., one particular number of blocks). The black line shows the baseline network, the green line the resulting FP-net when substituting the first blocks of the baseline's stacks. The -axis displays the number of parameters, a number that increases with the number of blocks. Note, however, that the inclusion of FP-blocks reduces the number of parameters. Overall, the FP-nets are more compact and perform better with a lower test error and only a small overlap in the standard deviations.
Figure 3.
The y-axis displays the best test score on Cifar-10 averaged over five runs, and the bars indicate the standard deviations. The transparent area indicates the range from the minimum to the maximum. Each diamond represents one network having a specific number of parameters (in thousands) denoted on the x-axis. On the left, the black solid line shows the baseline ResNet results with 20, 33, 44, and 56 layers, and the green solid line the results for the corresponding FP-nets (basic). On the right, the black solid line shows the baseline PyrBlockNet and the green solid line the results for the FP-nets. Substituting each stack's first block with an FP-block yielded, in all but one case, a significantly better performance with a reduced number of parameters.
The y-axis displays the best test score on Cifar-10 averaged over five runs, and the bars indicate the standard deviations. The transparent area indicates the range from the minimum to the maximum. Each diamond represents one network having a specific number of parameters (in thousands) denoted on the x-axis. On the left, the black solid line shows the baseline ResNet results with 20, 33, 44, and 56 layers, and the green solid line the results for the corresponding FP-nets (basic). On the right, the black solid line shows the baseline PyrBlockNet and the green solid line the results for the FP-nets. Substituting each stack's first block with an FP-block yielded, in all but one case, a significantly better performance with a reduced number of parameters.Table 1 shows the results on ImageNet. Note that the FP-net () performs better than the baseline ResNet-50, and the validation error is reduced by almost 0.4. When considering the already compact MobileNet architecture, the FP-net performs better than the MobileNet with an error decreased by 0.2. We trained the MobileNet-V2 baseline network ourselves to obtain its validation error. For the ResNet-50, we report the value from the Tensorpack repository (Wu, 2016). The performance depending on the number of parameters for the ResNet and FP-variants is illustrated in Figure 4.
Table 1.
ImageNet validation errors for different FP-nets and baselines: We transformed two baseline network architectures, the ResNet-50, and the MobileNet-V2, into FP-nets, here denoted as FP-net-50 and FP-MobileNet. The transformations are done by substituting specific blocks of the baseline networks with an FP-block (see text). Additionally, by choosing different expansion factors , we created one FP-net that is smaller than the baseline () and one larger network (). Note that FP-nets perform better than the baseline models if there is only a slight increase in the number of parameters (shown in millions).
Model
No. of parameters (M)
Error
ResNet-50 (baseline)
25.6
23.61
FP-net-50 (q=0.8)
24.3
23.80
FP-net-50 (q=1)
26.0
23.24
MobileNet-V2 (baseline)
3.5
28.71
FP-MobileNet
3.5
28.53
Figure 4.
Number of parameters vs. ImageNet validation error for the ResNet-50 (black diamond) and two FP-nets (green dots) with different expansion factors .
Number of parameters vs. ImageNet validation error for the ResNet-50 (black diamond) and two FP-nets (green dots) with different expansion factors .ImageNet validation errors for different FP-nets and baselines: We transformed two baseline network architectures, the ResNet-50, and the MobileNet-V2, into FP-nets, here denoted as FP-net-50 and FP-MobileNet. The transformations are done by substituting specific blocks of the baseline networks with an FP-block (see text). Additionally, by choosing different expansion factors , we created one FP-net that is smaller than the baseline () and one larger network (). Note that FP-nets perform better than the baseline models if there is only a slight increase in the number of parameters (shown in millions).
FP-nets and visual coding
Hyperselectivity of FP-units
Vilankar and Field (2017) used the term hyperselectivity to quantify how strongly a neuron is tuned to its optimal stimulus, that is, how quickly the response drops when the optimal stimulus changes. In the context of deep learning, hyperselectivity is relevant because it can increase robustness, for example, robustness against adversarial attacks (Paiton et al., 2020). One way to quantify hyperselectivity is to measure the curvature of iso-response contours. Given an -dimensional input to a function , an -dimensional surface may exist such that for all points on the surface, the output is a constant. As can be a high dimension, 2D projections are used to analyze such iso-surfaces, which in two dimensions become iso-response contours .The typical linear-nonlinear (LN) model neuron used in CNNs is a function that involves a linear projection on a weight vector followed by a pointwise nonlinearity . To analyze the iso-response contour of such a neuron, one first projects the input on , the axis corresponding to the optimal stimulus . To find a second axis, one searches for a vector orthogonal to , for example, by picking random values and using the Gram–Schmidt process (see Equation 16) to transform the random vector to one that is orthogonal to . When looking at the output of an LN-neuron for perturbed by any orthogonal vector with , the iso-response contour is always a straight line parallel to , because . Thus, for LN-neurons, the iso-response contours have zero curvature. For hyperselective neurons (), there exist vectors that are orthogonal to and decrease the neuron's optimal response such that . In this case, the exo-origin iso-response contour bends away from the origin of the basis defined by and . A higher curvature of this bend indicates a more significant activation dropoff in regions that are different from the optimal stimulus (i.e., a greater hyperselectivity). One way to quantify the curvature is to use the coefficient of the quadratic term obtained by fitting a second-order polynomial to the iso-response contour. FP-nets contain FP-blocks that consist of FP-units, or FP-neurons, which yield the feature-product output for a pixel in a feature map as defined by Equation 2. As shown in the Appendix, FP-neurons exhibit curved exo-origin iso-response contours with a curvature that depends on the angle . Iso-response contours are shown in Figure 5 for different values of . Note that curvature, and thus hyperselectivity, increases with . Accordingly, a large leads to a lower entropy of the resulting feature maps; see Figure 6.
Figure 5.
Iso-response contour plots for different values of the angle . Each plot shows values that were determined by using Equation 23; furthermore, normalization and quantization to six bins were applied. The horizontal axis points in the direction of the optimal stimulus and is indexed by the value in Equation 23. The vertical axis is orthogonal to the optimal stimulus and indexed by . The black lines indicate the zero contour.
Figure 6.
Scatterplots of entropy over hyperselectivity (indicated by the angle ). Each dot corresponds to an FP-neuron. The color codes indicate the position of each neuron in the network (i.e., the number of convolution layers). The entropy of a particular FP-neuron's feature map is estimated as described in the Appendix and plotted against . The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers, and the right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers. Note the correlation between entropy and . Hyperselectivity is directly linked to , as illustrated in Figure 5.
Iso-response contour plots for different values of the angle . Each plot shows values that were determined by using Equation 23; furthermore, normalization and quantization to six bins were applied. The horizontal axis points in the direction of the optimal stimulus and is indexed by the value in Equation 23. The vertical axis is orthogonal to the optimal stimulus and indexed by . The black lines indicate the zero contour.Scatterplots of entropy over hyperselectivity (indicated by the angle ). Each dot corresponds to an FP-neuron. The color codes indicate the position of each neuron in the network (i.e., the number of convolution layers). The entropy of a particular FP-neuron's feature map is estimated as described in the Appendix and plotted against . The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers, and the right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers. Note the correlation between entropy and . Hyperselectivity is directly linked to , as illustrated in Figure 5.
Entropy and degree of end-stopping
To further support the view that FP-neurons are hyperselective depending on , we analyzed the entropy of the feature maps generated by different FP-neurons. The results in Figure 6 show that the learned filters tend to have a larger than zero, that is, the majority of FP-neurons are hyperselective and that a high -value leads to a lower entropy. Details of how the entropy is computed are given in the Appendix.In order to analyze the end-stopping behavior of the model neurons that are learned in the FP-nets trained on Cifar-10 and ImageNet, we needed to quantify the degree of end-stopping. In order to relate to physiological measurements, we started by analyzing the response of FP-neurons to straight lines and line ends, but this turned out to be problematic because the FP-nets use small filters and subsample the input. To keep the analogy, but with a more robust measure, we used a square as input and quantified the average responses to the uniform zero-dimensional (0D) regions, the straight 1D edges, and the 2D corners. The degree of end-stopping is then defined by the relation between 1D and 2D responses. In order to account for ON/OFF- type responses, we used both a bright and a dark square. The results are shown in Figure 7, and the details of the algorithm are given in the Appendix.
Figure 7.
Distribution of neurons plotted over the degree of end-stopping. Distributions are shown for the first block of the first stack for different models. The left image shows the activation of the first convolution, after batch normalization and ReLU, of a pyramid block in a PyrResNet (nine blocks per stack). Middle: the FP-neuron () of an FP-block for an FP-net trained on Cifar-10 (nine blocks per stack). Right: of an FP-block for the FP-net-50 trained on ImageNet. Blue bars show normalized histograms for the ratio that quantifies the relation between responses to straight edges (1D) and corners (2D); see Appendix. Neurons that respond to 0D regions (the center of a square) are excluded from the blue histogram and shown separately as orange bars. Neurons that do not respond at all (0D, 1D, and 2D responses are all zero) are also excluded from the blue histogram and are shown as green bars.
Distribution of neurons plotted over the degree of end-stopping. Distributions are shown for the first block of the first stack for different models. The left image shows the activation of the first convolution, after batch normalization and ReLU, of a pyramid block in a PyrResNet (nine blocks per stack). Middle: the FP-neuron () of an FP-block for an FP-net trained on Cifar-10 (nine blocks per stack). Right: of an FP-block for the FP-net-50 trained on ImageNet. Blue bars show normalized histograms for the ratio that quantifies the relation between responses to straight edges (1D) and corners (2D); see Appendix. Neurons that respond to 0D regions (the center of a square) are excluded from the blue histogram and shown separately as orange bars. Neurons that do not respond at all (0D, 1D, and 2D responses are all zero) are also excluded from the blue histogram and are shown as green bars.Note that, as the real neurons in cortical areas V1 and V2, the model neurons in the FP-net are end-stopped to different degrees. Thus, end-stopping seems to be beneficial for both the ImageNet and Cifar-10 tasks, since the emergence of end-stopping is here driven by the classification error. As expected, the multiplication in the FP-block shifts the distribution toward a higher degree of end-stopping. However, the network could have learned filter pairs that do not lead to end-stopped FP-neurons. The bias that we introduce (i.e., the multiplication) just makes it easier for the network to learn end-stopped representations.The angle distributions in Figure 8 show that indeed linear FP-neurons are learned as well since more than 15% of FP-neurons have a -value near zero. With increasing network depth, the number of linear FP-neurons increases, indicating that hyperselectivity and especially end-stopping are more frequent in earlier stages of the visual processing chain.
Figure 8.
Distribution of FP-neurons () as a function of hyperselectivity (indicated by the angle ) and for different positions in the network. Note that the majority of neurons are hyperselective to different degrees and that hyperselectivity is reduced later in the network. The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers. The right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers.
Distribution of FP-neurons () as a function of hyperselectivity (indicated by the angle ) and for different positions in the network. Note that the majority of neurons are hyperselective to different degrees and that hyperselectivity is reduced later in the network. The left panel shows results for the FP-net-50 trained on ImageNet after 2, 11, 23, and 41 convolution layers. The right panel shows results for the FP-net-59 trained on Cifar-10 after 2, 21, and 40 convolution layers.
FP-neurons are more robust against adversarial attacks
Although outperforming almost all alternative approaches on many vision tasks, CNNs are surprisingly sensitive to barely visible perturbations of the input images (Szegedy et al., 2013). An adversarial attack on a classifier function adds a noise pattern to an input image so that does not return the correct class . Furthermore, the attacker ensures that some -norm of does not exceed . In many cases, including this work, the infinity-norm is chosen, and the values are in the set . Thus, for example, for , each 8-bit pixel value is at most altered by adding or subtracting the value 1. Goodfellow et al. (2014) argue that the main reason for the sensitivity to adversarial examples is due to the linearity of CNNs: With a high-dimensional input, one can substantially change a linear neuron's output, even with small perturbations. Consider the output of an LN-neuron for an input with dimension perturbed by . We choose to be the sign function of the weight vector multiplied with : . Thus, roughly points in the direction of the optimal stimulus (which is also the gradient), but its infinity-norm does not exceed . Assuming that the mean absolute value of is , is approximately equal to . Accordingly, a significant change of the LN-neuron's output can be achieved by a small value if the input dimension is large, which is the case for many vision-related tasks. This gradient-ascent method can also be applied to nonlinear neurons. Within a local region, the output of almost any function can be approximated by a linear function. To optimally increase the output, the input needs to be moved along the gradient direction. The fast gradient sign method (FGSM; Goodfellow et al., 2014) perturbs the original input image by adding . Another approach is to define to be the gradient times a positive step size followed by clipping to . The clipped iterative gradient ascent (CIGA) greedily moves along the direction of the highest linear increase,
with being the jth entry of the unbounded result at the ith iteration step. In the following, we use CIGA in our illustrations of the principle, and in our experiments, we employ FGSM as it is a widely recognized adversarial attack method. When regarding an iso-response contour plot, one can easily spot the direction of the gradient, which is orthogonal to an iso-response contour (Paiton et al., 2020). In Figure 9 on the left, the gradient for an LN-neuron is parallel to the optimal stimulus (black line). As long as the initial input yields a nonzero gradient, each step of CIGA maximally increases the LN-neuron output. Thus, the algorithm's effectiveness is only bounded by but widely independent of the initial input . For a step size larger than , CIGA finds the optimal solution in one step. We now investigate the effects of CIGA on a simplified version of an FP-neuron:
Note that in the following particular example, the input is chosen to yield nonnegative projections on and ; thus, we can remove the ReLUs. The resulting gradient is
The effectiveness of an iteration step strongly depends on the current position. The highest possible increase would be obtained along the line defined by the optimal stimulus. In Figure 9 on the right, this is the black line. If the initial input is located on this line, any step in the gradient direction yields an optimal increase of the FP-neuron output. However, for any other position with a nonzero gradient, an unbounded iteration step would move toward the optimal stimulus line. The blue curve in Figure 9 shows the path for several iterations of CIGA: Starting above the optimal stimulus line, each step slowly converges to the optimal stimulus line, eventually moving almost parallel to it. Once the threshold of 1 is reached in the horizontal dimension, the (now bounded) path runs parallel to the vertical dimension to increase the neuron output further. The optimal solution is found once the bound is also reached in the vertical dimension. The important difference when comparing with LN-neurons is that there are numerous conditions (depending on , , , and ) where CIGA would need several steps to find an optimal solution. This reduced effectiveness of the gradient ascent illustrates why hyperselective neurons are more robust against adversarial attacks; for example, if is too small, or is chosen poorly, or with too few iterations, an attack might not increase the FP-neuron output by much. Note that single neurons are usually not the target of adversarial attacks; instead, the gradient is determined on the classification loss function. Still, the argument holds that hyperselective neurons are harder to activate than LN-neurons, resulting in an increased robustness.
Figure 9.
Iso-response contours and iteration path of CIGA for (left) an LN-neuron and (right) an FP-neuron: The LN-neuron's weight vector is and is an orthogonal vector. The FP-neuron's filter-pair is and . The black lines point in the directions of the respective optimal stimulus. The blue dashed line shows the iteration path of the CIGA (see Equation 6). All other colored solid lines show iso-response contours; the number on each line shows the function value of the contour. For each neuron, CIGA aims to find a perturbation with that maximally increases the output . is the initial input to the neurons; is the step size, and a total of 10,000 iterations were computed. CIGA quickly finds an optimal solution for the LN-neuron since any step along the positive gradient (parallel to the optimal stimulus, orthogonal to the iso-response contours) optimally increases the function value. For the FP-neuron, the iteration path first moves toward the optimal stimulus, then almost parallel to it, and finally, moves upward once the bound on is reached along the -axis. This longer, more complex optimization path shows that CIGA is less effective for a hyperselective FP-neuron, indicating that FP-neurons are more robust against adversarial examples.
Iso-response contours and iteration path of CIGA for (left) an LN-neuron and (right) an FP-neuron: The LN-neuron's weight vector is and is an orthogonal vector. The FP-neuron's filter-pair is and . The black lines point in the directions of the respective optimal stimulus. The blue dashed line shows the iteration path of the CIGA (see Equation 6). All other colored solid lines show iso-response contours; the number on each line shows the function value of the contour. For each neuron, CIGA aims to find a perturbation with that maximally increases the output . is the initial input to the neurons; is the step size, and a total of 10,000 iterations were computed. CIGA quickly finds an optimal solution for the LN-neuron since any step along the positive gradient (parallel to the optimal stimulus, orthogonal to the iso-response contours) optimally increases the function value. For the FP-neuron, the iteration path first moves toward the optimal stimulus, then almost parallel to it, and finally, moves upward once the bound on is reached along the -axis. This longer, more complex optimization path shows that CIGA is less effective for a hyperselective FP-neuron, indicating that FP-neurons are more robust against adversarial examples.To test this hypothesis, we created new Cifar-10 test sets derived from the original test set . Here, we focused on the most subtle adversarial attacks: we created one test set , where each test image was perturbed by using FGSM with . Results for larger -values are shown in the Appendix (see Table 2 and Table 3). To exclude the hypothesis that the better accuracy (with perturbations) is due to the fact that the FP-nets already generalize better, we present results where we measure the percentage of changed predictions of the classifier .
is the indicator function returning a 1 for a true statement and a zero otherwise. is some function (here, FGSM) that perturbs the original image based on some parameter . We evaluated this metric for each of the four architectures that we trained on the original Cifar-10 training set (see Section “FP-nets as competitive deep networks”); no additional adversarial training scheme was employed. As shown in Figure 10, 40% to 50% of the predictions did change. However, for both baseline models, substituting some of the LN-neurons with FP-neurons increased the robustness against FGSM attacks.
Table 2.
Robust error values, in percentages, when using FGSM perturbations for all Cifar-10 models and different values. We report the mean values averaged over five different training runs.
Model; ε=
1/255
2/255
4/255
8/255
16/255
FP-net (N=3)
58.460
72.766
82.036
86.180
86.444
FP-net (basic) (N=3)
56.662
73.634
82.296
86.406
89.108
PyrBlockNet (N=3)
60.458
75.566
83.992
88.150
89.762
ResNet (N=3)
57.284
71.912
80.378
84.338
88.016
FP-net (N=5)
53.634
70.694
81.088
85.020
88.152
FP-net (basic) (N=5)
51.112
68.510
79.036
84.034
86.830
PyrBlockNet (N=5)
54.620
70.018
79.748
84.392
87.666
ResNet (N=5)
52.434
69.652
79.678
84.166
86.956
FP-net (N=7)
51.082
67.468
78.728
83.510
86.814
FP-net (basic) (N=7)
48.360
66.104
77.236
82.486
86.256
PyrBlockNet (N=7)
53.244
69.966
80.764
85.880
88.662
ResNet (N=7)
49.788
67.852
78.590
83.290
86.544
FP-net (N=9)
49.226
66.902
78.634
83.524
87.216
FP-net (basic) (N=9)
45.896
64.284
76.076
81.652
85.484
PyrBlockNet (N=9)
52.048
68.442
78.932
84.590
87.620
ResNet (N=9)
47.888
66.056
77.066
81.992
86.410
Table 3.
Percentage of changed predictions when using the FGSM perturbations for all Cifar-10 models and different values.
Model; ε=
1/255
2/255
4/255
8/255
16/255
FP-net (N = 3)
50.766
65.140
74.596
79.320
82.030
FP-net (basic) (N = 3)
48.858
65.928
74.906
80.128
85.654
PyrBlockNet (N = 3)
52.560
67.818
76.752
82.592
87.562
ResNet (N = 3)
49.526
64.228
73.038
78.480
85.148
FP-net (N = 5)
47.030
64.132
74.858
80.258
86.418
FP-net (basic) (N = 5)
44.482
61.960
72.662
78.526
83.700
PyrBlockNet (N = 5)
47.808
63.416
73.736
80.064
85.782
ResNet (N = 5)
44.996
62.298
72.546
78.148
83.916
FP-net (N = 7)
44.968
61.452
73.000
78.932
84.772
FP-net (basic) (N = 7)
42.102
59.922
71.294
77.402
83.376
PyrBlockNet (N = 7)
47.054
63.978
75.342
81.884
86.844
ResNet (N = 7)
42.820
60.948
71.898
77.538
83.390
FP-net (N = 9)
43.536
61.314
73.444
79.506
85.424
FP-net (basic) (N = 9)
39.804
58.240
70.196
76.448
82.376
PyrBlockNet (N = 9)
45.988
62.592
73.590
80.620
85.988
ResNet (N = 9)
41.206
59.426
70.662
76.534
83.808
Figure 10.
Percentage of changed predictions (see Equation 9) on the adversarial example test set . A lower value indicates that a network is more robust against attacks created by the FGSM.
Percentage of changed predictions (see Equation 9) on the adversarial example test set . A lower value indicates that a network is more robust against attacks created by the FGSM.The results reiterate that CNN predictions can be significantly altered by deliberate and subtle attacks (we show some example images in the Appendix). Unfortunately, this lack of robustness creates problems of practical relevance beyond such attacks. For example, JPEG-compression can create artifacts that have similar effects. To evaluate robustness against JPEG artifacts, we created the Cifar-10 test sets , with being the JPEG-compressed version of the original image with a quality rate , 100 being the original image. A low quality indicates a high compression with stronger artifacts (example images are given in the Appendix). In Figure 11, we show the results for the low compression test set and further results in the Appendix (see Tables 4 and 5).
Figure 11.
Percentage of changed predictions (see Equation 9) for the JPEG-compressed Cifar-10 test set . A lower value indicates that a network is more robust against JPEG artifacts.
Table 4.
Error values in percentages when using JPEG-compression for all Cifar-10 models and different quality (Q) values.
Model; Q=
90
80
70
60
50
40
30
20
10
FP-net (N = 3)
12.412
16.468
19.686
22.624
25.230
28.116
32.920
41.044
57.940
FP-net (basic) (N = 3)
12.288
16.338
19.636
22.734
25.306
27.890
32.744
40.138
56.260
PyrBlockNet (N = 3)
13.080
17.356
20.676
23.790
27.108
30.324
35.326
42.798
57.336
ResNet (N = 3)
12.910
17.322
20.830
24.416
27.308
30.404
35.252
43.356
58.722
FP-net (N = 5)
11.040
15.154
18.742
22.076
24.676
27.864
32.772
41.240
58.466
FP-net (basic) (N = 5)
10.908
14.866
18.142
21.282
23.792
26.760
31.736
40.024
58.488
PyrBlockNet (N = 5)
11.918
16.412
20.342
23.898
27.078
30.694
35.888
44.424
60.176
ResNet (N = 5)
11.592
15.654
19.114
22.330
24.590
27.462
32.100
40.726
56.878
FP-net (N = 7)
10.456
14.596
18.214
21.292
24.112
26.720
31.768
40.582
58.022
FP-net (basic) (N = 7)
10.624
14.274
17.590
20.384
23.020
26.014
30.646
39.382
56.966
PyrBlockNet (N = 7)
11.396
15.928
19.344
23.356
26.518
30.456
35.908
44.706
59.660
ResNet (N = 7)
11.338
15.294
18.288
21.254
23.968
27.044
31.534
40.168
56.540
FP-net (N = 9)
10.118
14.228
18.020
21.116
23.820
26.894
32.382
40.830
57.930
FP-net (basic) (N = 9)
10.128
13.984
16.996
19.912
22.116
24.956
29.710
38.096
56.040
PyrBlockNet (N = 9)
10.974
15.290
19.088
22.640
25.504
29.244
34.520
43.218
59.616
ResNet (N = 9)
10.788
14.830
18.162
21.558
24.200
27.728
32.272
40.942
57.842
Table 5.
Percentage of changed predictions when using JPEG-compression for all Cifar-10 models and different quality (Q) values.
Model; Q=
90
80
70
60
50
40
30
20
10
FP-net (N = 3)
9.024
14.124
17.676
21.014
23.714
26.688
31.580
40.116
57.502
FP-net (basic) (N = 3)
8.868
13.914
17.602
21.064
23.636
26.590
31.540
39.208
55.918
PyrBlockNet (N = 3)
10.112
15.086
18.864
22.176
25.670
29.102
34.168
41.842
56.930
ResNet (N = 3)
9.996
15.192
19.108
22.826
25.972
29.366
34.340
42.522
58.384
FP-net (N = 5)
8.374
13.158
17.104
20.644
23.232
26.690
31.766
40.412
58.090
FP-net (basic) (N = 5)
8.146
12.820
16.488
19.986
22.540
25.586
30.744
39.296
58.002
PyrBlockNet (N = 5)
9.322
14.590
18.746
22.488
25.770
29.656
35.006
43.668
59.840
ResNet (N = 5)
8.650
13.584
17.396
20.806
23.168
26.280
31.188
39.802
56.370
FP-net (N = 7)
8.058
12.868
16.780
19.898
22.830
25.714
30.818
39.848
57.782
FP-net (basic) (N = 7)
7.832
12.222
15.932
18.960
21.662
24.748
29.682
38.688
56.742
PyrBlockNet (N = 7)
8.804
13.980
17.920
22.114
25.410
29.432
34.964
44.002
59.232
ResNet (N = 7)
8.478
13.066
16.464
19.640
22.540
25.884
30.492
39.142
55.986
FP-net (N = 9)
7.694
12.482
16.704
19.866
22.700
25.886
31.402
40.064
57.774
FP-net (basic) (N = 9)
7.632
12.124
15.522
18.564
21.014
23.992
28.952
37.426
55.786
PyrBlockNet (N = 9)
8.740
13.498
17.594
21.258
24.294
28.190
33.596
42.560
59.332
ResNet (N = 9)
8.180
12.760
16.728
20.056
22.892
26.606
31.298
40.196
57.468
Percentage of changed predictions (see Equation 9) for the JPEG-compressed Cifar-10 test set . A lower value indicates that a network is more robust against JPEG artifacts.Again, using FP-neurons increased the robustness against artifacts. However, even a moderate compression alters up to 10% of the CNNs’ predictions.
Example FP-unit
As shown above, the learned FP-neurons are hyperselective and end-stopped to different degrees. However, these two properties do not fully specify an FP-neuron. When analyzing the individual FP-neurons in more detail, it is difficult to further specify them according to simple properties such as orientation or phase. Nevertheless, some FP-neurons look as if they were taken from a textbook on “how to model end-stopped neurons,” and we show one example in Figure 12.
Figure 12.
An example of learned filters pairs. The top row shows the two learned filters (arrows indicate orientation) and the row below the corresponding Fourier spectra. The third row depicts the responses of the two filters to the image shown in the bottom left (image used as to illustrate the selectivity of the FP-unit). The bottom-right panel shows the response of the FP-unit (the product of the filter responses). Such textbook units, however, are rather rare. This particular unit has emerged in the FP-net-59 trained on Cifar-10 without instance normalization and without the ReLU in Equation 2.
An example of learned filters pairs. The top row shows the two learned filters (arrows indicate orientation) and the row below the corresponding Fourier spectra. The third row depicts the responses of the two filters to the image shown in the bottom left (image used as to illustrate the selectivity of the FP-unit). The bottom-right panel shows the response of the FP-unit (the product of the filter responses). Such textbook units, however, are rather rare. This particular unit has emerged in the FP-net-59 trained on Cifar-10 without instance normalization and without the ReLU in Equation 2.
Discussion and conclusions
We have presented a novel FP-net architecture and have demonstrated its competitive performance. To do so, we have designed experiments with state-of-the-art deep networks and showed that we could improve their performance by substituting original blocks in the network architecture with FP-blocks that implement an explicit multiplication of feature maps. Given this simple design rule, we can expect our approach to be of practical use, since any traditional network can easily be transformed into an FP-net that will most likely perform better. We did not employ any hyperparameter tuning specific to the FP-nets but just used the hyperparameters of the original networks; one may thus expect even better performance with additional tuning. We believe that the improvement that comes with FP-nets is due to an appropriate bias, which allows the network to learn efficient representations based on units (model neurons) that are end-stopped to different degrees. The multiplications that we introduce allow for AND rather than OR combinations and thus make the resulting units more selective than linear filters with pointwise nonlinearities. Note that the key feature of FP-nets is that one learns pairs of linear filters, which are then AND combined. In case of FP-nets, the AND is implemented by multiplications. We could, however, show that logarithms (Grüning et al., 2020b) and the minimum operation (Grüning & Barth, 2021a) can also work as AND operation. We consider the improvements that bio-inspired FP-nets achieve over the baseline networks to be the main contribution of our article.Moreover, we have analyzed the selectivity of the FP-units in an attempt to relate them to what is known about visual neurons. We could show that FP-units are indeed end-stopped to different degrees. The emergence of end-stopping in a network that learns based on only the classification error demonstrates that end-stopping is beneficial for the task of object recognition. This finding is supported by previously known mathematical results, according to which (a) 2D features such as corners and junctions are statistically rare in natural images, leading to sparse representations (Zetzsche et al., 1993), and (b) 2D features are still unique since there exists a mathematical proof that 0D (uniform) and 1D (straight) regions in images are redundant (Mota & Barth, 2000), although being statistically frequent.Of course, the considerations above cannot be taken to imply that biological vision implements an FP-net architecture, especially as the FP-nets implement additional and typical deep-network operations such as linear recombinations that increase the entropy of the representation. In other words, much of what well-performing deep networks do is not something one would necessarily consider to be optimal.It is known that sparse-coding units are more selective than typical CNN units, that is, than linear neurons with pointwise nonlinearities (Paiton et al., 2020), and thus less prone to certain adversarial attacks. This increased selectivity has been quantified with the curvature of the iso-response contours. We could show that the iso-response contours of the FP-units are curved, with the degree of curvature depending on the angle between the multiplied feature vectors, and that a large number of hyperselective units emerge in FP-nets trained for object recognition. Furthermore, our results show that FP-nets are indeed more robust against adversarial attacks and compression artifacts, and this is, again, due to the vision-inspired FP-units.
Authors: Dylan M Paiton; Charles G Frye; Sheng Y Lundquist; Joel D Bowen; Ryan Zarcone; Bruno A Olshausen Journal: J Vis Date: 2020-11-02 Impact factor: 2.240