James B Isbister1, Akihiro Eguchi1, Nasir Ahmad1, Juan M Galeazzi2, Mark J Buckley2, Simon Stringer1. 1. Oxford Centre for Theoretical Neuroscience and Artificial Intelligence, University of Oxford, Oxford OX2 6GG, UK. 2. Oxford Brain and Behaviour Group, Department of Experimental Psychology, University of Oxford, Oxford OX2 6GG, UK.
Abstract
We discuss a recently proposed approach to solve the classic feature-binding problem in primate vision that uses neural dynamics known to be present within the visual cortex. Broadly, the feature-binding problem in the visual context concerns not only how a hierarchy of features such as edges and objects within a scene are represented, but also the hierarchical relationships between these features at every spatial scale across the visual field. This is necessary for the visual brain to be able to make sense of its visuospatial world. Solving this problem is an important step towards the development of artificial general intelligence. In neural network simulation studies, it has been found that neurons encoding the binding relations between visual features, known as binding neurons, emerge during visual training when key properties of the visual cortex are incorporated into the models. These biological network properties include (i) bottom-up, lateral and top-down synaptic connections, (ii) spiking neuronal dynamics, (iii) spike timing-dependent plasticity, and (iv) a random distribution of axonal transmission delays (of the order of several milliseconds) in the propagation of spikes between neurons. After training the network on a set of visual stimuli, modelling studies have reported observing the gradual emergence of polychronization through successive layers of the network, in which subpopulations of neurons have learned to emit their spikes in regularly repeating spatio-temporal patterns in response to specific visual stimuli. Such a subpopulation of neurons is known as a polychronous neuronal group (PNG). Some neurons embedded within these PNGs receive convergent inputs from neurons representing lower- and higher-level visual features, and thus appear to encode the hierarchical binding relationship between features. Neural activity with this kind of spatio-temporal structure robustly emerges in the higher network layers even when neurons in the input layer represent visual stimuli with spike timings that are randomized according to a Poisson distribution. The resulting hierarchical representation of visual scenes in such models, including the representation of hierarchical binding relations between lower- and higher-level visual features, is consistent with the hierarchical phenomenology or subjective experience of primate vision and is distinct from approaches interested in segmenting a visual scene into a finite set of objects.
We discuss a recently proposed approach to solve the classic feature-binding problem in primate vision that uses neural dynamics known to be present within the visual cortex. Broadly, the feature-binding problem in the visual context concerns not only how a hierarchy of features such as edges and objects within a scene are represented, but also the hierarchical relationships between these features at every spatial scale across the visual field. This is necessary for the visual brain to be able to make sense of its visuospatial world. Solving this problem is an important step towards the development of artificial general intelligence. In neural network simulation studies, it has been found that neurons encoding the binding relations between visual features, known as binding neurons, emerge during visual training when key properties of the visual cortex are incorporated into the models. These biological network properties include (i) bottom-up, lateral and top-down synaptic connections, (ii) spiking neuronal dynamics, (iii) spike timing-dependent plasticity, and (iv) a random distribution of axonal transmission delays (of the order of several milliseconds) in the propagation of spikes between neurons. After training the network on a set of visual stimuli, modelling studies have reported observing the gradual emergence of polychronization through successive layers of the network, in which subpopulations of neurons have learned to emit their spikes in regularly repeating spatio-temporal patterns in response to specific visual stimuli. Such a subpopulation of neurons is known as a polychronous neuronal group (PNG). Some neurons embedded within these PNGs receive convergent inputs from neurons representing lower- and higher-level visual features, and thus appear to encode the hierarchical binding relationship between features. Neural activity with this kind of spatio-temporal structure robustly emerges in the higher network layers even when neurons in the input layer represent visual stimuli with spike timings that are randomized according to a Poisson distribution. The resulting hierarchical representation of visual scenes in such models, including the representation of hierarchical binding relations between lower- and higher-level visual features, is consistent with the hierarchical phenomenology or subjective experience of primate vision and is distinct from approaches interested in segmenting a visual scene into a finite set of objects.
The feature-binding problem concerns how the visual system represents the hierarchical relationships between features (such as edges and objects). For example, at an object level, how does the visual system represent which low-level features belong to a particular object? If two letters T and L are seen together, how does the visual system represent which horizontal and vertical bars are part of which letter?Moreover, the visual system must represent hierarchical-binding relations across the entire visual field at every spatial scale and level in the hierarchy of visual primitives. Representing the binding relations between visual features is necessary in order for the visual brain to make sense of its visuospatial world. Furthermore, the binding of subfeatures to their parent object would provide rich representations if applied not only to the visual but also to the auditory and behavioural systems of the brain. Consequently, solving this problem would be an important step towards the development of what is commonly termed artificial general intelligence (AGI). This refers to machines that may one day be able to perceive and comprehend their visuospatial environment with a similar semantic richness to the brain, and exploit this semantically rich representation of the world to guide general intelligent behaviour within complex real environments.One simple example of the feature-binding problem from a connectionist perspective was discussed by Rosenblatt [1] and further elaborated by von der Malsburg [2]. The example is illustrated in figure 1. Consider a neural network with four output neurons A, B, C and D. The first two neurons, A and B, represent the triangle and square, respectively. These neurons have location-invariant responses in that the neurons respond to their preferred objects in both the top and bottom locations. The second pair of output neurons, C and D, represent object location, and respond when any one of the objects is presented in either the top or bottom location, respectively. If the network is presented with a single object, the responses of the output neurons are sufficient to determine the identity of the object and its location. However, when both objects are presented to the network in different positions simultaneously, then all of the output neurons respond and their combined activity is insufficient to determine which object is in which location. This has been called the superposition catastrophe [2]. So how might the visual brain represent which features or objects are in which retinal locations when multiple objects are presented together within a scene?
Figure 1.
A connectionist example of the feature-binding problem proposed by Rosenblatt [1]. Top row: A neural network receives input from a simple visual scene, in which a triangle or a square can appear in either the top or bottom location. The network has four output neurons A, B, C and D that respond to the following kinds of visual inputs: (A) triangle in either location, (B) square in either location, (C) either object in the top location and (D) either object in the bottom location. Bottom row: The responses of the output neurons to four different visual scenes. It is evident that when a single object is presented, then the combined activity among the output neurons is sufficient to determine the identity of the object and its location. However, when both objects are presented together in different locations, then it is not possible to determine the locations of each of the objects from the responses of the output neurons. Reproduced with permission from Rosenblatt [1]. (Online version in colour.)
A connectionist example of the feature-binding problem proposed by Rosenblatt [1]. Top row: A neural network receives input from a simple visual scene, in which a triangle or a square can appear in either the top or bottom location. The network has four output neurons A, B, C and D that respond to the following kinds of visual inputs: (A) triangle in either location, (B) square in either location, (C) either object in the top location and (D) either object in the bottom location. Bottom row: The responses of the output neurons to four different visual scenes. It is evident that when a single object is presented, then the combined activity among the output neurons is sufficient to determine the identity of the object and its location. However, when both objects are presented together in different locations, then it is not possible to determine the locations of each of the objects from the responses of the output neurons. Reproduced with permission from Rosenblatt [1]. (Online version in colour.)One approach that has been proposed for solving the feature-binding problem is known as feature integration theory (FIT) [3]. This theory makes the assumption that there is only a single spatial locus of attention within the visual field where features are bound together. This implies that visual tasks requiring feature binding would need to be carried out in a time-consuming serial manner as the visual brain processes the visual field sequentially. However, feature binding would be far more adaptive for an animal if it could be simultaneously performed across the entire visual field in parallel. Moreover, can the feature-binding problem, in which the brain must represent the hierarchical relations between visual features at different spatial scales, really be solved by trying to reduce the size of the spatial region in which it is performed? In fact, an experimental study carried out by Duncan & Humphreys [4] on human participants did not observe a clear dichotomy between serial and parallel modes of visual search. Instead, the search efficiency was related to factors affecting the intrinsic difficulty of the task. For example, the search efficiency decreased as the targets and non-targets became more similar, or if the non-targets became more dissimilar to each other. These experimental observations are inconsistent with the assumption of FIT that visual binding is performed sequentially as a spatial locus of attention shifts across the visual field.Another mechanism that has been proposed for solving feature binding is synchronization of neuronal firing. Real neurons in the brain communicate with each other by emitting electrical pulses known as action potentials or ‘spikes’. The binding by synchrony hypothesis suggests that the subpopulation of neurons encoding the visual features that are part of the same object will emit their spikes close together in time, but not at the same time as those neurons encoding features associated with different objects [5-7]. In this way, it is suggested that synchronization may be used to segment a visual scene into several discrete object regions. It is important to note that synchronization and oscillations are often interchangeably discussed in the literature with overlapping definitions. In this paper, we use the term synchronization and synchrony to refer to the event when multiple neurons fire spikes effectively simultaneously. Oscillations instead refer to the longer timescale waves of excitatory and inhibitory activity in a network that can be of a width covering tens of milliseconds.Under the hypothesis of binding by synchrony, simultaneous firing of neurons binds together the visual features that they represent. Attempts to find such a relationship have been unsuccessful [8]. Furthermore, if neural network models incorporate randomized distributions of axonal transmission delays of the order of several milliseconds as found in the brain, then this has the effect of degrading the emergence of synchrony in these simulations. Meanwhile, the question remains: Can decomposing natural scenes into a several-object region really capture the semantic richness of primate vision? Duncan & Humphreys [4] describe the hierarchical nature of primate vision as follows:A fully hierarchical representation is created by repeating segmentation at different levels of scale. Each structural unit, contained by its own boundary, is further subdivided into parts by the major boundaries within it. Thus, a human body may be subdivided into head, torso, and limbs, and a hand into palm and fingers. Such subdivision serves two purposes. The description of a structural unit at one level of scale (animal, letter, etc.) must depend heavily on the relations between the parts defined within it (as well as on properties such as colour or movement that may be common to the parts). Then, at the next level down, each part becomes a new structural unit to be further described with its own properties, defined among other things by the relations between its own subparts. At the top of the hierarchy may be a structural unit corresponding to the whole input scene, described with a rough set of properties (e.g. division into light sky above and dark ground below).How might the visual cortex represent such a hierarchy of visual features, as well as the hierarchical binding relations between these features, at every spatial scale and across the entire visual field? Eguchi et al. [9] have recently shown how this may be achieved within a biologically realistic hierarchical neural network model of the primate ventral visual system with the following properties.(1) The model is a ‘spiking’ neural network, in which the timings of the spikes emitted by neurons are explicitly represented.(2) The synaptic connections are modified during visual training by spike time-dependent plasticity (STDP). Specifically, a synapse is strengthened through long-term potentiation (LTP) if a spike from the presynaptic neuron arrives at the postsynaptic neuron just before the postsynaptic neuron emits a spike. The synapse is weakened through long-term depression (LTD) if the spike from the presynaptic neuron arrives at the postsynaptic neuron just after the postsynaptic neuron has emitted its spike [10,11].(3) The network architecture incorporates bottom-up, top-down and lateral synaptic connections. This kind of synaptic connectivity is consistent with the primate visual cortex.(4) There is an axonal transmission delay of a few milliseconds in the time it takes for an action potential or spike to pass from one neuron to another. The axonal transmission delay between each pair of pre- and postsynaptic neurons has a fixed value that does not alter through time. However, different axonal connections have different random transmission delays, which can be anywhere from a few milliseconds to tens of milliseconds.(5) The network may incorporate multiple synaptic connections between each pair of pre- and postsynaptic neurons, where these connections have different axonal transmission delays. Eguchi et al. [9] showed that this allows the STDP to selectively strengthen specific synaptic connections with particular axonal transmission delays.Using a neural network model with the above architectural components, Eguchi et al. [9] reported that training the model on visual stimuli led to the emergence of repeating spatio-temporal patterns of spikes in the higher layers of the network. A subpopulation of such neurons that emit their spikes in a regularly repeating spatio-temporal chain is referred to as a polychronous neuronal group (PNG). Figure 2 illustrates two examples of basic network connectivities, which could underlie basic polychronous groups. The phenomenon of network responses with spatio-temporal patterns of neural activity is known as polychronization [12]. This is in contrast with synchronization, in which the spikes of subpopulations of neurons are clustered very close together in time (synchronized). A key factor in pushing the network from synchronous to polychronous activity is the incorporation of axonal transmission delays, which forces the neurons to spike at different times relative to other neurons. The original study also reported that these PNGs responded selectively to particular visual stimuli on which the network had been trained. These stimulus-specific PNGs were found to emerge even though the input neurons representing the visual stimuli had entirely randomized spike times set according to a Poisson distribution.
The emergence of polychronization within a biological spiking neural network model of the primate visual system
Eguchi et al. [9] hypothesized that the kind of spiking neural network architecture with properties (i)–(iv) described above, especially including randomized distributions of axonal transmission delays, would develop regularly repeating spatio-temporal patterns of spiking activity in the higher network layers after training on a set of visual stimuli—i.e. polychronization [12]. This hypothesis was originally inspired by the modelling study of Diesmann et al. [19]. They showed that a hierarchical spiking neural network consisting of series of successive layers could lead to the emergence of synchronous activity in the higher layers even when the spikes in the input layer were widely dispersed (i.e. unstructured) in time. This is an example of the development of a synfire chain. Synfire chains were originally proposed by Abeles [20] as networks within which such synchronous activity could propagate. They are defined as a hierarchical series of pools (or layers) of neurons which when a given layer fires, in a sufficiently synchronous manner, the resulting volley of spikes propagate on from each pool to the next, causing each pool of neurons to fire synchronously one after another in sequence. The conditions for the stability of synfire chains have been explored and characterized, and a major requirement within these networks is the existence of single-valued synaptic transmission delays [19,21]. The reason this work was impactful was that feature binding was posited at the time to be linked to synchronized neuronal activity, whereby the spikes emitted by neurons representing visual features that are part of the same object would be clustered very closely together in time. The simulations of Diesmann et al. [19] showed how such synchronized activity could emerge naturally within a biological spiking neural network. However, in order for synchrony to emerge in their simulations, the model assumed either no axonal delays or axonal delays all of the same fixed length. Bienenstock [22] proposed that it might be possible to relax this constraint by allowing the incorporation of non-uniform axonal transmission delays as actually found in the brain. These authors hypothesized that synchronous waves could still emerge if pairs of given neurons in the network were connected by multiple polysynaptic pathways with the same overall length. This was referred to as a synfire braid. Nevertheless, Bienenstock [22] was still concerned with the emergence of synchronized activity to solve feature binding. As discussed above, our view was that synchrony could not offer a solution to the binding problem that accorded with the rich hierarchical phenomenology of primate vision [4]. In this paper, we shall be discussing the alternative notion of polychronization proposed by Izhikevich [12], and its potential role in solving feature binding. Although synfire chains/braids and polychronization both involve spatio-temporal patterns of spiking activity, the latter is quite distinct from the former. Specifically, polychronization is far less constrained than synfire chains/braids in that it does not assume that subpopulations of neurons have to emit their spikes in a synchronized manner. We have found that this greater freedom can lead to the emergence of representations of the hierarchical binding relations between lower- and higher-level features.Building closely on the work of Diesmann et al. [19], it was hypothesized by Eguchi et al. [9] that including randomized distributions of axonal transmission delays, e.g. spread uniformly in the range 0–10 ms, into such hierarchical spiking network models would force neurons to emit their spikes separated in time, thereby creating spatio-temporal spike sequences (PNGs). Moreover, as Diesmann et al. [19] showed how synchronous neural activity could gradually emerge through successive layers even when there was no such temporal structure among spikes in the input layer, Eguchi et al. [9] hypothesized that input patterns with randomized spike times could lead to the emergence of polychronous activity in the higher layers of networks incorporating randomized distributions of axonal delays. In the simulation study reported in Eguchi et al. [9], the spike patterns representing the stimuli in the input layer had no regular temporal structure, except that the average firing rates of the input neurons were set in accordance with the outputs of Gabor filters that simulated the responses of simple cells in visual area V1. Eguchi et al. [9] also hypothesized that training the network on visual stimuli using STDP to modify the synaptic connections would enhance the emergence of PNGs in the higher network layers, where individual PNGs would learn to respond to a particular preferred visual stimulus. The study reported that these predictions were confirmed in their simulations. Moreover, these authors reported that many more stimulus-specific PNGs emerged in the highest (output) layer than individual neurons tuned to specific visual stimuli. This strongly hints at such PNGs playing an important role in stimulus representations in the brain. This is supported by experimental observations from multi-unit recording studies in monkeys, which have reported the existence of such spatio-temporal spike patterns in the primate cortex in response to the presentation of visual stimuli [23,24].Given the reported emergence of stimulus-specific polychronous activity in the spiking network simulations performed by Eguchi et al. [9], and the observed presence of these kinds of PNGs in the primate brain, what role might such polychronous activity play in solving the feature-binding problem in a way that reflects the hierarchical subjective experience of primate vision?
How the emergence of polychronization may offer an approach to solving feature binding in primate vision
Eguchi et al. [9] proposed that the emergence of polychronous activity within a hierarchical spiking neural network may provide an understanding of how the primate brain solves the feature-binding problem. To address the hierarchical phenomenology of primate vision, as described by Duncan & Humphreys [4], consider a couple of higher-level features or stimuli such as the alphabetical letters T and L, each of which may be located anywhere on the retina. The letters T and L both comprise a horizontal bar and a vertical bar, which are the lower-level features. If the letters T and L are presented together at some random locations on the retina, how might the visual cortex represent which horizontal and vertical bars (lower-level features) are part of which letters (higher-level features or stimuli)? The ability to represent such hierarchical binding relations between lower- and higher-level features is fundamental to the ability of the visual brain to produce an integrated representation of a visual scene, and consequently make sense of its visuospatial world.Consider training a spiking network with properties (i)–(iv) described above on the letter T presented everywhere across the retina. Eguchi et al. [9] hypothesized that this would lead to the emergence of a T-specific PNG which is activated regardless of the location of the T on the retina. Within this T-specific PNG, it was hypothesized that binding neurons could exist which encode the hierarchical binding relations between lower- and higher-level features. Specifically, such binding neurons were posited to fire as part of the PNG if, and only if, a neuron or subset of neurons representing a lower-level feature, such as a horizontal bar at a particular retinal location, was participating in driving the neuron or subset of neurons representing a higher-level feature, such as the letter T. In this case, the binding neurons would carry measurable information that the lower-level feature was part of the higher-level feature or stimulus.The simplest example of how such binding neurons might operate is shown in figure 3a. There are three neurons forming a binding circuit. Neuron 1 in a lower layer represents the low-level feature; neuron 2 in the higher layer represents the high-level feature; and neuron 3 is a binding neuron that encodes the hierarchical binding relation between the low- and high-level features. The connections between the neurons have axonal transmission delays associated with them, where Δ( denotes the delay from presynaptic neuron j to postsynaptic neuron i. Given the existence of the axonal transmission delays shown in figure 3a, it is evident that neuron 1 will be participating in driving neuron 2 only if a spike emitted by neuron 2 occurs approximately Δ(2,1) after a spike emitted by neuron 1. Moreover, if the three axonal delays shown in figure 3a have the relationshipthen the spikes emitted by neurons 1 and 2 will arrive at the binding neuron 3 if and only if neuron 1 (representing the low-level feature) is participating in driving neuron 2 (representing the high-level feature). This is of critical importance because it is assumed that all neurons have relatively fast synaptic time constants, as well as synaptic weights appropriately scaled to the synaptic time constant, so that postsynaptic neurons only fire when presynaptic spikes arrive simultaneously. In this case, binding neuron 3 may fire if and only if neuron 1 is participating in driving neuron 2. In other words, the binding neuron 3 may fire if and only if the low-level feature encoded by neuron 1 is part of the high-level feature or stimulus encoded by neuron 2. In this case, the binding neuron 3 will carry measurable information about the hierarchical binding relationship between the lower- and higher-level visual features.
Bottom-up projection of visual information about low-level elemental features to higher network layers
Eguchi et al. [9] also hypothesized that the kind of mechanisms described above could also lead to visual information at smaller spatial scales being projected up to the higher layers of the network, which they called the holographic principle. The traditional view of processing in the primate ventral visual pathway is that the scale and complexity of visual features that are represented grow as one ascends the hierarchy of processing stages or layers. For example, it is widely thought that neurons in early cortical visual areas such as V1 and V2 represent local oriented bars and edges, while neurons in higher cortical areas such as the anterior inferior temporal cortex (TE) encode whole objects and faces. However, surely subsequent brain areas such as the PFC that are responsible for decision-making and behaviour must be guided by visuospatial representations at every spatial scale? If such behavioural brain areas only receive inputs from the later stages of the visual system, then there must be some way in which information about visual features at every spatial scale, including the binding relations between these features, is projected up to the higher visual layers for readout by later behavioural brain systems. Neurophysiology experiments on primates seem to support this proposal.For example, Rainer et al. [18] recorded the responses of neurons in the PFC, a brain area that receives inputs from the higher cortical visual stages and which is involved in decision-making. It was found that individual PFC neurons responded differentially depending on the location of the target stimulus, which is analogous to different sets of low-level features driving the target representation. This indicates that PFC is encoding the spatial configuration of visual objects rather than just the identity of the whole objects themselves.A very simple mechanism that can lead to information about a lower-level visual feature, including its hierarchical binding relationship with a higher-level visual feature, being projected up to a higher network layer is illustrated in figure 4a. This is similar to the network architecture shown in figure 3a. However, the binding neuron 3, which represents that the lower-level feature (such as a vertical bar) is part of the higher-level feature (such as the letter T), is now located in the higher layer along with neuron 2 encoding the higher-level feature. In this situation, information about the lower-level feature, and its hierarchical binding relationship with the higher-level feature, has now been projected up to the higher layer. In fact, Eguchi et al. [9] found that a large proportion of the three-neuron binding circuits that they developed in their simulations were of this general form, with the binding neuron situated in the higher layer.
Binding neuron activation through local increases in spike density
The original hypothesis of Eguchi et al. [9] was that polychronous activity depended on precise spatio-temporal patterns of individual spikes emitted at specific times with millisecond precision. For example, in the binding circuit illustrated in figure 3a, it was assumed that neuron 1 would emit a single spike at time zero, then neuron 2 would emit a single spike 3 ms after neuron 1 and then binding neuron 3 would emit a single spike 3 ms after neuron 2. However, we now propose that the binding mechanism could still operate in a somewhat similar manner but instead use local increases in spike density at appropriate moments in time. That is, instead of neuron 2 emitting a single spike exactly 3 ms after neuron 1, the kind of binding mechanism illustrated in figure 3a could still operate even if neuron 2 simply displays an increase in the number of spikes emitted around that time, i.e. a temporally localized increase in the spike rate. Examples of how this might look are shown in figure 5, which shows a number of spike raster plots recorded from the PFC of an awake behaving monkey as the animal was presented with a visual stimulus. It is evident that each spike raster plot shows fluctuations in the local spike density through time, with some localized clusters of spikes. This kind of neuronal behaviour, in which there appears to be some kind of regular underlying temporal variation in spike rate, is quite typical in visually responsive neurons in monkey cortex. We hypothesize that these localized variations in spike density through time reflect underlying spatio-temporal activity patterns across subpopulations of neurons, which include neurons carrying information about the (hierarchical) binding relations between visual features. This broader concept, based on local variations in spike density, represents a generalization of the original notion of polychronization in which neurons had to emit single spikes at particular times. However, even with this more flexible form of polychronization, the binding mechanisms illustrated in figure 3a should still operate in a similar manner as long as the increases in neuronal spike rate are sufficiently temporally localized. That is, neuron 2 displays a localized increase in spike rate around 3 ms (approx. 2–4 ms) after a localized increase in the spike rate of neuron 1, and binding neuron 3 displays a localized increase in spike rate around 3 ms (approx. 2–4 ms) after a localized increase in the spike rate of neuron 2. We, therefore, suggest that the temporal structure displayed in the spike rasters shown in figure 5 reflects this more generalized form of polychronous activity within the PFC and other reciprocally connected brain areas. In future work, we will use multi-unit recording techniques in monkeys, in which the spiking activity of many neurons is recorded simultaneously, to look for the existence of neurons using this more general form of polychronization to carry measurable information about the (hierarchical) binding relations between visual features.
Figure 5.
Neurophysiological evidence from single unit recording carried out in monkey prefrontal cortex (PFC) using chronically implanted multi-electrode arrays. This figure presents three spike raster plots recorded from the PFC of an awake behaving monkey as the animal looked at a visual stimulus. That is, each row shows the series of spikes emitted by a different individual unit through 1000 ms as the monkey viewed the stimulus. It can be seen that each spike raster plot shows fluctuations in the local spike density through time, with some localized clusters of spikes. We hypothesize that these localized variations in spike density reflect underlying polychronous activity within this brain region and other reciprocally connected areas. (Online version in colour.)
Neurophysiological evidence from single unit recording carried out in monkey prefrontal cortex (PFC) using chronically implanted multi-electrode arrays. This figure presents three spike raster plots recorded from the PFC of an awake behaving monkey as the animal looked at a visual stimulus. That is, each row shows the series of spikes emitted by a different individual unit through 1000 ms as the monkey viewed the stimulus. It can be seen that each spike raster plot shows fluctuations in the local spike density through time, with some localized clusters of spikes. We hypothesize that these localized variations in spike density reflect underlying polychronous activity within this brain region and other reciprocally connected areas. (Online version in colour.)
Neural network model and analysis of network performance
In this section, we describe the original neural network model and performance analyses employed in the simulation study by Eguchi et al. [9]. Then, in §4.1, we present some novel simulation results investigating the emergence of polychronization through successive network layers using a simplified version of this model. Finally, in §4.2, we review some of the original simulation results of Eguchi et al. [9] showing the emergence of feature-binding representations within PNGs.
Neural network model
Network architecture
The neural network model investigated by Eguchi et al. [9] is shown in figure 6. It simulates successive stages of processing within the primate ventral visual pathway. Specifically, it consists of four hierarchical layers of neurons that correspond to cortical visual areas V2, V4, posterior inferior temporal cortex (TEO) and anterior inferior temporal cortex (TE). Within each network layer, there are subpopulations of interconnected excitatory and inhibitory neurons. There are plastic (modifiable) bottom-up (feedforward) and top-down (feedback) synaptic connections between excitatory neurons in successive layers, as well as plastic lateral synapses between excitatory neurons within each layer. The inhibitory neurons within each layer have non-plastic connections to and from the excitatory neurons. The inhibitory interneurons effectively implement lateral competition between the excitatory neurons within a layer. This supports competitive learning among the excitatory neurons within each layer, whereby individual excitatory neurons learn to respond to particular visual features or stimuli presented during training. There were 64 × 64 = 4096 excitatory neurons and 32 × 32 = 1024 inhibitory neurons within each layer. The excitatory connectivity between layers was topographical, with excitatory neurons in each layer receiving connections from excitatory neurons within a corresponding local region of the lower or higher layer. Table 1a shows the different numbers of afferent connections onto each postsynaptic neuron, as well as the fan-in radius of these connections, for the different types of excitatory–excitatory, excitatory–inhibitory and inhibitory–excitatory connections between and within the four layers of the network.
Figure 6.
The four-layer neural network model of the primate ventral visual pathway investigated by Eguchi et al. [9]. The network architecture consists of a hierarchy of four layers of neurons 1–4 that correspond to cortical visual areas V2, V4, posterior inferior temporal cortex (TEO) and anterior inferior temporal cortex (TE). Within each of these four network layers, there are subpopulations of interconnected excitatory and inhibitory neurons. Layer 0 contains a layer of excitatory neurons, whose firing rates reflect the outputs of Gabor filters that mimic the responses of bar/edge-detecting V1 simple cells after convolution with the visual input image. Although the firing rates of the layer 0 neurons are set according to the outputs of the Gabor filters, their actual spike times are randomized according to a Poisson distribution. Thus, there is no spatio-temporal structure imposed on the spiking activity of the input layer; this has to emerge gradually as visual signals propagate through the hierarchy of higher layers 1–4. Layer 0 neurons have purely bottom-up (feedforward) connections to layer 1. Each of the following layers 1–4 consists of 64 × 64 = 4096 excitatory neurons and 32 × 32 = 1024 inhibitory neurons. The excitatory plastic (modifiable) synaptic connections (shown in red) in the model include bottom-up (feedforward) and top-down (feedback) connections between excitatory neurons in successive layers, and lateral connections between excitatory neurons within the same layer (shown by the curved red arrows). Within each layer, the subpopulation of excitatory neurons send non-modifiable projections to the subpopulation of inhibitory neurons, which in turn send non-modifiable connections back to the excitatory neurons. The inhibitory interneurons effectively implement lateral competition between the subpopulation of excitatory neurons within a layer.
Table 1.
Model parameters. Most integrate and fire parameters were taken from Troyer et al. [30] (derived originally from McCormick et al. [31]) as indicated by §. Plasticity parameters (denoted by †) are taken from Perrinet et al. [29]. Parameters marked with * were tuned for the reported simulations.
layer
1
2
3
4
(a) network parameters
number of excit. neurons within each layer
64 × 64
64 × 64
64 × 64
64 × 64
number of inhib. neurons within each layer
32 × 32
32 × 32
32 × 32
32 × 32
number of feedforward (FF) afferent excit. connections per excit. neuron (EfE)
30
100
100
100
fan-in radius for FF afferent excit. connections to each excit. neuron (EfE)
1.0
8.0
12.0
16.0
number of feedback (FB) afferent excit. connections per excit. neuron (EbE)
{0,10}
{0,10}
{0,10}
—
fan-in radius for FB afferent excit. connections to each excit. neuron (EbE)
8.0
8.0
8.0
—
number of lateral (LAT) afferent excit. connections per excit. neuron (ElE)
{0,10}
{0,10}
{0,10}
{0,10}
fan-in radius for LAT afferent excit. connections to each excit. neuron (ElE)
4.0
4.0
4.0
4.0
number of LAT afferent excit. connections per inhib. neuron (ElI)
30
30
30
30
fan-in radius for LAT afferent excit. connections to each inhib. neuron (ElI)
1.0
1.0
1.0
1.0
number of LAT afferent inhib. connections per excit. neuron (IlE)
30
30
30
30
fan-in radius for LAT afferent inhib. connections to each excit. neuron (IlE)
8.0
8.0
8.0
8.0
(b) parameters for Gabor filtering of visual images
synaptic neurotransmitter concentration (αC) and proportion of unblocked NMDA receptors (αD)
0.5
†
presynaptic STDP time constant (τC) and postsynaptic STDP time constant (τD)
{5, 25, 125} ms
†
synaptic learning rate (ρ)
0.1
†
range of synaptic conductance delay
[0.1, 10.0] ms
†
synaptic conductance scaling factor for FF excitatory connections from Gabor filters to layer 1 excit. cells (λGfE · ΔgGfE)
[0, 0.4] nS
*
synaptic conductance scaling factor for FF excit. connections to excit. cells in layers 2, 3 or 4 (λEfE · ΔgEfE)
[0, 1.6] nS
*
synaptic conductance scaling factor for FB excit. connections to excit. cells in layers 1, 2 or 3 (λEbE · ΔgEbE)
[0, 1.6] nS
*
synaptic conductance scaling factor for LAT excit. connections to excit. cells in layers 1, 2, 3 or 4 (λElE · ΔgElE)
[0, 1.6] nS
*
synaptic conductance scaling factor for LAT connections from excit. cells to inhib. cells in layers 1, 2, 3 or 4 (λElI · ΔgElI)
40 nS
*
synaptic conductance scaling factor for LAT connections from inhib. cells to excit. cells in layers 1, 2, 3 or 4 (λIlE · ΔgIlE)
80 nS
*
excitatory reversal potential ()
0 mV
§
inhibitory reversal potential ()
−70 mV
§
synaptic time constant for all FF, FB and LAT connections from Gabor filters and excit. cells to excit. cells (τGfE, τEfE, τEbE, τElE)
150 ms
*
synaptic time constant for LAT connections from excit. cells to inhib. cells (τElI)
2 ms
§
synaptic time constant for LAT connections from inhib. cells to excit. cells (τIlE)
5 ms
§
(e) parameters for numerical simulation by forward Euler timestepping scheme
numerical step size (Δt)
0.02 ms
The four-layer neural network model of the primate ventral visual pathway investigated by Eguchi et al. [9]. The network architecture consists of a hierarchy of four layers of neurons 1–4 that correspond to cortical visual areas V2, V4, posterior inferior temporal cortex (TEO) and anterior inferior temporal cortex (TE). Within each of these four network layers, there are subpopulations of interconnected excitatory and inhibitory neurons. Layer 0 contains a layer of excitatory neurons, whose firing rates reflect the outputs of Gabor filters that mimic the responses of bar/edge-detecting V1 simple cells after convolution with the visual input image. Although the firing rates of the layer 0 neurons are set according to the outputs of the Gabor filters, their actual spike times are randomized according to a Poisson distribution. Thus, there is no spatio-temporal structure imposed on the spiking activity of the input layer; this has to emerge gradually as visual signals propagate through the hierarchy of higher layers 1–4. Layer 0 neurons have purely bottom-up (feedforward) connections to layer 1. Each of the following layers 1–4 consists of 64 × 64 = 4096 excitatory neurons and 32 × 32 = 1024 inhibitory neurons. The excitatory plastic (modifiable) synaptic connections (shown in red) in the model include bottom-up (feedforward) and top-down (feedback) connections between excitatory neurons in successive layers, and lateral connections between excitatory neurons within the same layer (shown by the curved red arrows). Within each layer, the subpopulation of excitatory neurons send non-modifiable projections to the subpopulation of inhibitory neurons, which in turn send non-modifiable connections back to the excitatory neurons. The inhibitory interneurons effectively implement lateral competition between the subpopulation of excitatory neurons within a layer.Model parameters. Most integrate and fire parameters were taken from Troyer et al. [30] (derived originally from McCormick et al. [31]) as indicated by §. Plasticity parameters (denoted by †) are taken from Perrinet et al. [29]. Parameters marked with * were tuned for the reported simulations.
Differential equations
The following subsections describe the cell, synapse and plasticity equations used in the simulations of [9], as well as the additional simulations described in this paper.Cell equationsIn the model developed by Eguchi et al. [9], each neuron is modelled as a conductance-based leaky integrate and fire (LIF) neuron. A neuron's membrane potential is updated according toThe cell membrane potential for a given neuron V(t) (indexed by i) is driven up by current from excitatory conductance-based synapses, and down towards the inhibitory reversal potential by current from inhibitory conductance-based synapses. Neurons decay back to their resting state over a time course determined by the properties of their membrane. Here τm represents the membrane time constant, defined as τm = Cm/g0, where Cm is the membrane capacitance, g0 is the membrane leakage conductance and R is the membrane resistance (R = 1/g0). V0 denotes the resting potential of the cell. Class-specific values (excitatory and inhibitory) are indexed by γ for the above neuron parameters. I(t) represents the total current input from the afferent synapses (described in equation (3.2)).The total synaptic current injected into a neuron is given by the sum of the conductances of all afferent synapses (excitatory and inhibitory), multiplied by the difference between the specific synapse class reversal potential () and the neuron membrane potential (V(t)). The conductance of a given synapse is given by g, where j and i are the indices of the pre- and postsynaptic neurons, respectively,Synaptic conductance equationsThe synaptic conductance of a particular synapse, g(t), is governed by a decay term τ and a Dirac delta function-based change (equation (3.4)) when spikes arrive from the presynaptic neuron j as follows:The conduction delay for a particular synapse is denoted by Δt, which ranges from 0.1 to 10.0 ms, and each presynaptic neuron spike is indexed by l. A biological scaling constant λ has been introduced to scale the synaptic efficacy Δg which lies between unity and zero. The Dirac delta function is defined as follows:Synaptic learning equationsIn the model investigated by Eguchi et al. [9], STDP operates at all of the bottom-up, top-down and lateral connections from excitatory cells to excitatory cells throughout layers 1–4. The equations for STDP at these excitatory–excitatory () synapses were adapted from [29]. The form of STDP operating at a synaptic connection from presynaptic cell j to postsynaptic cell i is as follows.The recent presynaptic activity, C(t), is modelled byThe variable C(t) represents the concentration of neurotransmitter (glutamate) released into the synaptic cleft [29] and is bounded by [0, 1] for 0 ≤ αC ≤ 1. C(t) is governed by a decay term τC and is driven up by presynaptic spikes according to the model parameter αC. The inclusion of the axonal transmission delay Δt from presynaptic neuron j to postsynaptic neuron i in equation (3.5) ensures that C(t) is driven when the spike from presynaptic neuron j actually arrives at the postsynaptic neuron i.The recent postsynaptic activity, D(t), is governed byThe variable D(t) represents the proportion of N-methyl-D-aspartate (NMDA) receptors unblocked by recent depolarization from back-propagated action potentials [29]. D(t) is governed by decay term τD and is driven up by postsynaptic spikes according to the model parameter αD. Postsynaptic neuron spikes are indexed by k.The strength of the synaptic weight, Δg(t), is governed bywith time constant τΔ.The above STDP model operates as follows. If the variable representing presynaptic activity C is high (due to recent presynaptic spikes having arrived at the postsynaptic neuron) at the time of a postsynaptic spike, then the synaptic weight is increased (LTP). Alternatively, if the variable representing postsynaptic activity D is high (from recent postsynaptic spikes) at the time of a presynaptic spike arriving at the postsynaptic neuron, then the weight is decreased (LTD).The model parameters used in the simulations performed by Eguchi et al. [9] were chosen to be as biologically accurate as possible and are given in table 1.
Training the network on visual stimuli
In the simulations carried out by Eguchi et al. [9], visual images were first preprocessed by a set of Gabor filters that mimicked the responses of simple cells in V1 [32-34]. That is, individual Gabor filters responded to a bar or edge element with a particular orientation and retinal location. The outputs of the Gabor filters were used to set the firing rates of excitatory input neurons in layer 0. However, the actual spikes of the input cells were set to occur at randomized timings according to a Poisson distribution. So the original study did not impose any initial spatio-temporal structure on the spiking activity in layer 0.The Gabor input filters used were computed by the following equation:with the following definitions:where x and y specify the position of a light impulse in the visual field [35]. The parameter λ is the wavelength (1/λ is the spatial frequency), σ controls the number of such periods inside the Gaussian window based on λ and spatial bandwidth b, θ defines the orientation of the feature, ψ defines the phase and γ sets the aspect ratio that determines the shape of the receptive field. In the experiments carried out by Eguchi et al. [9], an array of Gabor filters was generated at each of 128 × 128 retinal locations with the parameters given in table 1.The outputs of the Gabor filters were used as the basis to generate Poisson spike trains as follows:where f is the index of a Gabor filter used for the simulation and max_rate_scaling_factor is the maximum input neuron firing rate (set to 100 in the simulation studies). The outputs of the Gabor filters are used to set the firing rates of layer 0 excitatory input neurons. However, the spike times of the layer 0 neurons are randomized according to a Poisson distribution. The layer 0 neurons then propagate activity to the layer 1 excitatory neurons according to the synaptic connectivity given in table 1. That is, each layer 1 neuron receives connections from 30 randomly chosen layer 0 neurons localized within a topologically corresponding region of the retina. These connection distributions are defined by a radius shown in table 1.
Analysis of network performance
Information analysis of average firing rate responses of single cells
Eguchi et al. [9] measured how much information is carried in the firing rates of cells in the fourth (output) layer of the network about the identity of visual stimuli presented to the model. If a neuron responds selectively with a high firing rate to only one particular stimulus, and responds to that stimulus across all transforms, then the firing rate response of the neuron carries maximal information about the presence of that visual stimulus. In the simulations performed in the original study, each presentation of a stimulus was considered a different transform because each stimulus presentation caused the input layer 0 neurons to emit a different randomized sequence of spikes according to the Poisson distribution. That is, the exact timings of the input neuron spikes were different for each presentation of the same stimulus. Hence different presentations of the same visual stimulus to the network were treated as ‘transforms’ of that stimulus.The amount of stimulus-specific information that a specific cell carries is calculated using the following formula with details given by Rolls & Milward [36]:where s is a particular stimulus, r is the response of a cell to a single stimulus and is the set of responses of a cell to the set of stimuli.The maximum information that a cell could carry in its firing rate response is given by the formulawhere n is the number of different visual stimuli.
Information analysis of temporal spike patterns emitted by polychronous neuronal groups
Eguchi et al. [9] also applied information theory to quantify the amount of information carried by PNGs about the identity of visual stimuli presented to the network. However, to simplify the analysis, the authors only investigated the information carried by simple PNGs consisting of two spikes emitted by a pair of neurons.The original study used the spike trains recorded during multiple stimulus presentations to the network to compute the probabilities that a given spike-pair will occur in response to the presentation of each of the stimuli s. These probabilities are based on the frequency of occurrence of the spike-pair across multiple transforms (presentations) of each stimulus s. From these frequency distributions the following probability table for each stimulus category s was constructed:where i and j are the indices of two neurons under consideration, t is the time at which the cell i emits a spike and d is the time interval that neuron i emits a spike after neuron j. Eguchi et al. [9] considered values of d within the range of [0, 10 ms], where this time interval was divided into 10 equal bins of 1 ms.The original study then applied the information analysis methodology of §3.2.1 to analysing the information carried by spike-pair PNGs, and in doing so regarded the probability table given by equation (3.13) as , the set of responses to the set of stimuli, used in equation (3.11). Thus, equation (3.11) was used to compute the information carried by spike-pair PNGs about the presence of a particular stimulus s. Using this approach, the authors were able to quantify how selective such spike-pair PNGs were for particular stimuli. If a particular spike-pair PNG responds invariantly to the transforms (presentations) of a particular stimulus s but not to the other stimuli, then the spike-pair PNG carries maximum information about the presence of its preferred stimulus.
Performance of spiking neural network models
The emergence of polychronization through successive network layers
We begin by presenting some new simulation results from a simplified two-layer bottom-up (feedforward) spiking neural network model. These simulation results take a more detailed look at the gradual emergence of polychronization through successive layers than was carried out by Eguchi et al. [9]. The results presented can be contrasted with simulations carried out Diesmann et al. [19], which demonstrated the emergence of synchronization through successive layers of spiking neurons. For synchronization to emerge, it was necessary for Diesmann et al. [19] to incorporate either no axonal transmission delays or axonal delays all of the same length. In the new simulations presented here we show that incorporating randomized distributions of axonal delays, say spread in the interval [1, 30] ms, into the bottom-up connections flips the model from synchronous to polychronous behaviour. This important mechanism, in turn, permits the emergence of binding neurons embedded within these polychronous stimulus representations, as described elsewhere in this paper.The two-layer neural network model simulated is shown in figure 7. The model consists of a one-dimensional input layer consisting of 1000 excitatory neurons. The spike times of active input layer neurons are taken from a Poisson distribution (equation (3.10)) with a mean firing rate of 50 Hz. The input layer sends bottom-up synaptic connections to layer 1, which in turn sends connections to the output layer 2. Layers 1 and 2 each consist of 1000 LIF excitatory spiking neurons. The bottom-up connections to layers 1 and 2 are modified during learning according to the STDP rule implemented by Eguchi et al. [9]. The equations governing the cellular and synaptic dynamics, including synaptic plasticity, are given in §3.1.2. Neurons in the first LIF layer receive connections from the input layer neurons with a connection probability of 0.2, while neurons in the second LIF layer receive connections from neurons in the first LIF layer with a probability of 0.02. Axonal transmission delays between the input layer and the first LIF layer are uniformly distributed between 1 and 10 ms, while axonal delays between the first LIF layer and the second LIF layer are uniformly distributed between 1 and 30 ms. All neuron, synapse and learning parameters that are not described in this subsection are the same as the values originally used by Eguchi et al. [9] shown in table 1. We refer to this model as a two-layer model because there are two layers of LIF spiking neurons that receive plastic bottom-up connections, which are modified during training.
Histogram showing the frequency distributions of standard deviations in the first spike times of neurons in layers 1 and 2 in response to a stimulus presentation. This figure represents the same simulation data as in figure 8. For each neuron, we compute the standard deviation in its first spike across all 10 simulations in which a stimulus is presented. The following four sets of simulation results are presented: pretraining layer 1 (blue line), pretraining layer 2 (orange line), post-training layer 1 (green line) and post-training layer 2 (red line). In these simulations, the axonal transmission delays between the input layer and layer 1 are uniformly distributed between 1 and 10 ms, while axonal delays between layer 1 and layer 2 are uniformly distributed between 1 and 30 ms. As in figure 8, it can be seen that training the network using STDP reduces the degree of temporal variation in the first spike times, and that successive layers of processing also reduce the degree of temporal variation.
Histogram showing the frequency distributions of standard deviations in the first spike times of neurons in layers 1 and 2 in response to a stimulus presentation. This figure represents the same simulation data as in figure 8. For each neuron, we compute the standard deviation in its first spike across all 10 simulations in which a stimulus is presented. The following four sets of simulation results are presented: pretraining layer 1 (blue line), pretraining layer 2 (orange line), post-training layer 1 (green line) and post-training layer 2 (red line). In these simulations, the axonal transmission delays between the input layer and layer 1 are uniformly distributed between 1 and 10 ms, while axonal delays between layer 1 and layer 2 are uniformly distributed between 1 and 30 ms. As in figure 8, it can be seen that training the network using STDP reduces the degree of temporal variation in the first spike times, and that successive layers of processing also reduce the degree of temporal variation.Figure 10 shows the performance of the network with uniform axonal delays of 1ms in all connections from layer 1 to layer 2. The plot is styled as in figure 8. It is evident that the mean spike times in layer 2 are much more synchronous (clustered close together in time) than the polychronous behaviour (spread out in time) seen in figure 8. Hence, comparing these results shows that increasing the range of axonal transmission delays between layer 1 and layer 2 makes the spike times of the layer 2 neurons polychronous rather than synchronous.
Figure 10.
Performance of network with 1 ms axonal transmission delays between layer 1 and layer 2. Plot styled as in figure 8. It is evident that with uniform axonal delays of 1 ms in all connections from layer 1 to layer 2 the mean spike times in layer 2 are much more synchronous (clustered close together in time) than the behaviour (spread out in time) seen in figure 8 with a broad distribution of axonal delays in the interval [0, 30] ms. Hence, comparing these results shows that incorporating a broad distribution of axonal delays drives the system to reliable spiking over a greater temporal range.
Performance of network with 1 ms axonal transmission delays between layer 1 and layer 2. Plot styled as in figure 8. It is evident that with uniform axonal delays of 1 ms in all connections from layer 1 to layer 2 the mean spike times in layer 2 are much more synchronous (clustered close together in time) than the behaviour (spread out in time) seen in figure 8 with a broad distribution of axonal delays in the interval [0, 30] ms. Hence, comparing these results shows that incorporating a broad distribution of axonal delays drives the system to reliable spiking over a greater temporal range.
Selected simulation results from Eguchi et al. [9]
In this section, we review some of the simulation results from Eguchi et al. [9]. The simulation study carried out by these authors demonstrated the emergence of stimulus-specific PNGs, binding neurons, and the bottom-up projection of visual information about lower-level features to the highest network layers.
Training and testing the network model on a set of visual stimuli
Eguchi et al. [9] trained and tested their network on the three visual stimuli shown in figure 11, which included a circle, a heart and a star.
Figure 11.
The set of three visual stimuli presented to the network during training and testing in the simulation study of Eguchi et al. [9]. The stimulus set included a circle, a heart and a star. Reproduced with permission from Eguchi et al. [9].
The set of three visual stimuli presented to the network during training and testing in the simulation study of Eguchi et al. [9]. The stimulus set included a circle, a heart and a star. Reproduced with permission from Eguchi et al. [9].During the initial training phase, the three stimuli were repeatedly presented in turn to the network. At each stimulus presentation, the Gabor filters (equation (3.8)), which mimic the responses of bar/edge-detecting simple cells in cortical visual area V1, were convolved with the image of the stimulus. The outputs of the Gabor filters were used to set the firing rates of the input neurons in layer 0. Crucially, the spike times of the input neurons were randomized according to a Poisson distribution (equation (3.10)). These spikes were then propagated up through the network layers according to the model equations described in §3.1. As the visual signals propagated through the network, the plastic excitatory connections (which included the connections from the Gabor filters to layer 1 excitatory neurons, as well as the bottom-up, top-down and lateral connections between excitatory neurons across layers 1–4) were modified according to the STDP learning rule (3.7).When testing the model, the same three stimuli were presented to the network with STDP turned off. For each stimulus presentation, the spike train responses of all fourth (output) layer neurons were recorded.
Stimulus information carried by the average firing rates of neurons and spike-pair PNGs in the output layer of the network
Eguchi et al. [9] analysed the stimulus information carried either in the average firing rates of fourth (output) layer neurons or the spike-pair PNGs in the output layer. The maximum amount of information that can be carried by a single neuron or spike-pair PNG is log2(n), where n is the number of stimuli. In these simulations, there were three stimuli as shown in figure 11. Therefore, the maximum possible information is log2(3) ≈ 1.58 bits. Network performance was investigated with different combinations of feedforward (FF), feed-back (FB) and lateral (LAT) connectivity between excitatory neurons in layers 1–4. Specifically, the original study presented results for the full network architecture FF + FB + LAT before training, and results after training with the following different forms of connectivity: FF, FF + FB, FF + LAT, FF + FB + LAT.Figure 12a shows the stimulus information carried in the average firing rates of fourth (output) layer neurons. For each plot, the single cell information carried by 300 output neurons is shown, where the neurons are arranged along the abscissa in rank order. It is evident that very few output neurons in the FF + FB + LAT model reached the maximal information of 1.58 bits before training. However, after training, all four network architectures developed 50–100 neurons with maximal stimulus information. Nevertheless, it is evident that the network incorporating all three categories of connections, which is closer to the connectivity observed in the visual cortex, gave the lowest performance when analysing the information carried by the average firing rates of neurons.
How the stimulus information carried by spike-pair PNGs in the output layer is affected by key model parameters
Eguchi et al. [9] explored how the stimulus information carried by spike-pair PNGs in the output layer is affected by varying two important model parameters: the STDP time constants and the number of synaptic contacts between each pair of pre- and postsynaptic neurons. This part of their investigation used the full network architecture with all three kinds of synaptic connectivity, that is, FF + FB + LAT.Figure 13a shows how the stimulus information carried by spike-pair PNGs in the output layer was affected by varying the STDP time constants. In the results shown, the STDP time constants were varied over the values τC = τD = 125 ms, 25 ms or 5 ms. It was found that many more spike-pair PNGs carrying maximal information about stimulus identity emerged in the network when the STDP time constants were shortest. Short-duration STDP time constants are needed to maintain the temporal precision of the STDP, which is in turn required to promote the development of stimulus-specific PNGs. However, as the STDP time constants are increased, the synaptic plasticity becomes less dependent on the precise timings of spikes, and begins to operate more like a classical rate-coded Hebbian learning rule. In this case, the emergence of stimulus-specific PNGs is degraded.
Examples of how STDP selectively strengthens connections with particular delays between pairs of pre- and postsynaptic neurons with multiple synaptic contacts. Eguchi et al. [9] showed results for a model in which each pair of pre- and postsynaptic neurons had two connections with different randomly assigned transmission delays in the interval [0, 10] ms. The figure shows how the strengths of the two synaptic connections with different delays between four example pairs of pre- and postsynaptic neurons are modified by STDP during visual training. It can be seen that, in each of the four cases, one connection is selectively strengthened, while the other is weakened. In this way, STDP is able to effectively choose which transmission delay to strengthen in the connectivity between the two neurons in order to promote the emergence of polychronization within the network. Reproduced with permission from Eguchi et al. [9].
Examples of how STDP selectively strengthens connections with particular delays between pairs of pre- and postsynaptic neurons with multiple synaptic contacts. Eguchi et al. [9] showed results for a model in which each pair of pre- and postsynaptic neurons had two connections with different randomly assigned transmission delays in the interval [0, 10] ms. The figure shows how the strengths of the two synaptic connections with different delays between four example pairs of pre- and postsynaptic neurons are modified by STDP during visual training. It can be seen that, in each of the four cases, one connection is selectively strengthened, while the other is weakened. In this way, STDP is able to effectively choose which transmission delay to strengthen in the connectivity between the two neurons in order to promote the emergence of polychronization within the network. Reproduced with permission from Eguchi et al. [9].
The development of binding neurons during visual training
Eguchi et al. [9] demonstrated that embedded within the stimulus-specific PNGs that emerged in the full network architecture during training on the circle, heart and star shown in figure 11 were binding neurons of the kind illustrated in figures 3a and 4a. Simulation results are presented in figure 15. Each row shows an example of a stimulus-selective PNG, where the PNGs shown in rows (a)–(c) respond selectively to the circle, heart and star, respectively. Each row shows an example of a stimulus-selective PNG. Figure 15a(i),b(i),c(i) shows the neurons in the PNG, where the neurons are represented by circles and the strengthened connections between the neurons are represented by lines. The neurons are plotted along the abscissa according to the relative timings of their spikes within the PNGs, which was determined by the axonal transmission delays of the strengthened connections between the neurons. The right plots show the patterns of input Gabor filters with the strongest bottom-up connectivity to the lower- and higher-level feature neurons shown in figure 15a(i),b(i),c(i).
Figure 15.
Examples of binding neurons that develop during visual training. Eguchi et al. [9] showed the emergence of three-neuron binding circuits in the full network architecture, including FF+FB+LAT connections, after training on the circle, heart and star shown in figure 11. Each row shows an example of a stimulus-selective PNG. (a(i),b(i),c(i)) The neurons in the PNG, where the neurons are represented by circles and the strengthened connections between the neurons are represented by lines. The neurons are plotted along the abscissa according to the relative timings of their spikes within the PNGs, which was determined by the axonal transmission delays of the strengthened connections between the neurons. (a(ii),b(ii),c(ii)) The patterns of input Gabor filters with the strongest bottom-up connectivity to the lower- and higher-level feature neurons shown in (a(i),b(i),c(i)). It can be seen that rows (a) and (c) represent examples of the hypothesized three-neuron binding circuits illustrated in figures 3a and 4a. Reproduced with permission from Eguchi et al. [9].
Examples of binding neurons that develop during visual training. Eguchi et al. [9] showed the emergence of three-neuron binding circuits in the full network architecture, including FF+FB+LAT connections, after training on the circle, heart and star shown in figure 11. Each row shows an example of a stimulus-selective PNG. (a(i),b(i),c(i)) The neurons in the PNG, where the neurons are represented by circles and the strengthened connections between the neurons are represented by lines. The neurons are plotted along the abscissa according to the relative timings of their spikes within the PNGs, which was determined by the axonal transmission delays of the strengthened connections between the neurons. (a(ii),b(ii),c(ii)) The patterns of input Gabor filters with the strongest bottom-up connectivity to the lower- and higher-level feature neurons shown in (a(i),b(i),c(i)). It can be seen that rows (a) and (c) represent examples of the hypothesized three-neuron binding circuits illustrated in figures 3a and 4a. Reproduced with permission from Eguchi et al. [9].Rows (a) and (c) of figure 15 represent examples of the hypothesized three-neuron binding circuits illustrated in figures 3a and 4a. For example, consider the three-neuron binding circuit illustrated in row 3. Neuron 12686 is situated in layer 3 and represents the lower-level feature, neuron 18657 is situated in the output layer 4 and represents the higher-level feature, and neuron 18396 is a binding neuron that represents the binding relationship between the lower- and higher-level features. It can be seen that the axonal transmission delay from the lower-level feature neuron 12686 to binding neuron 18396 is equal to the transmission delay from the lower-level feature neuron 12686 to higher-level feature neuron 18657 plus the transmission delay from the higher-level feature neuron 18657 to binding neuron 18396. Given this pattern of axonal transmission delays between the three neurons, the spikes emitted by the lower-level feature neuron 12686 and higher-level feature neuron 18657 will arrive simultaneously at, and hence fire, binding neuron 18396 if and only if the lower-level feature neuron 12686 is actually participating in firing the higher-level feature neuron 18657. Row (c) presents another similar example of a three-neuron binding circuit. Figure 15a(ii),c(ii) confirm that the layer 3 neurons (left) represent lower-level features, while the layer 4 neurons (right) represent higher-level features of their preferred visual stimuli.
Bottom-up projection of visual information about lower-level features to higher layers of the network
The simulations of Eguchi et al. [9] demonstrated the bottom-up propagation of visual information about lower-level features to the higher network layers according to the holographic principle described in §2.3 and illustrated in figure 4a. In the examples of three-neuron binding circuits shown in figure 15a,c, the lower-level feature neuron is situation in layer 3, the higher-level feature neuron is situated in layer 4 and the binding neuron is located in layer 4. Thus, the binding neuron is located in the same layer as the higher level feature neuron. The simulation results shown in figure 15 are examples of the kind of bottom-up projection of visual information shown in figure 4a. This kind of bottom-up projection of visual information about lower-level features to the higher network layers could make more fine-grained visuospatial information available at the end of the visual pathway for readout by subsequent brain areas involved in decision-making and behaviour.
Discussion
In this paper, we have discussed a new approach to solving the feature-binding problem in visual neuroscience that relies on the emergence of polychronization within biological spiking neural networks. This problem is described by authors in different ways, but broadly refers to the ability of the visual brain to represent the hierarchical relationships between lower- and higher-level visual features within a scene. Solving this problem is essential for understanding how the brain builds an integrated and coherent representation of the visual world. We suggest that solving how feature binding is accomplished by the brain will be necessary for the future development of artificial general intelligence and machine consciousness.Simulation studies carried out by Eguchi et al. [9] have reported that fixed spatio-temporal patterns of spikes emerge automatically within the higher layers of a spiking neural network, and are repeated across different presentations of the same stimulus, even when the stimulus input representations have entirely randomized spike timings. These authors investigated the emergence of both large-scale PNGs consisting of many neurons and spike-pair PNGs consisting of just two neurons that carried high levels of stimulus-specific information. However, as discussed in §1, there is a potential issue with the latter results. That is, if two neurons respond with high firing rates to a preferred stimulus, but do not respond to any other stimuli, then it would still be possible to find spike-pair PNGs that carry high levels of stimulus-specific information in a random spike train. Consequently, given this possibility, in §4.1 we have presented some new simulation results that take a closer look at the emergence of polychronization through successive network layers. These simulation results show how precise spatio-temporal spike patterns may emerge naturally and automatically in the higher layers even though the input stimulus patterns have randomized spike times.The hypothesis that such PNGs might develop was strongly inspired by the work of Diesmann et al. [19], which showed the emergence of synchronization in the higher layers of a hierarchical feedforward spiking neural network. For Diesmann et al. [19] to demonstrate the emergence of synchrony, these authors had to implement either no axonal transmission delays or axonal delays of the same duration. However, in the brain axonal transmission delays within the visual cortex are not all of the same duration. The key result introduced by Eguchi et al. [9] and further investigated in this paper is that incorporating distributions of axonal delays, say in the interval [0, 10] ms, flips the network behaviour from the emergence of synchronization in the higher layers to the emergence of polychronization. Consistent with this theoretical result, neurophysiology studies have observed the presence of polychronous activity in the brain [13,23,24].Why is polychronization important? When Eguchi et al. [9] trained their spiking neural network model of the primate ventral visual pathway using STDP to modify the synaptic connections, they reported seeing the emergence of polychronous stimulus representations. In particular, embedded within these PNGs were feature-binding neurons that represented the hierarchical binding relationships between lower- and higher-level visual features. These authors reported the emergence of three-neuron binding circuits in the general form illustrated in figure 3a. These kinds of feature-binding representations could emerge simultaneously at every level of the hierarchy of network layers, which encode visual features at different spatial scales, and everywhere across the visual field. However, the three-neuron binding circuits shown in figure 3a are only the simplest possible realization of the basic approach to feature binding using polychronization. For example, many other kinds of more complex feature-binding representations may emerge such as those illustrated in figure 16. Moreover, as illustrated in figure 3b, the representations of the lower- and higher-level features, as well as the feature-binding representations, may also take the form of PNGs. Furthermore, the connectivity between these features and feature-binding representations could be poly-synaptic instead of the simple mono-synaptic connectivity shown in figure 3a. It is quite clear, then, that we are at the beginning of exploring the nature of the polychronous representations of features and feature-binding relationships that may emerge within spiking neural networks.
Figure 16.
More complex kinds of feature-binding representation. This figure shows three examples of binding circuits that encode more complex forms of hierarchical binding relationship between features. (a) The binding neuron responds when a lower-level feature such as a vertical bar is part of an intermediate-level feature such as the letter T, which is in turn part of a higher-level feature such as the word CAT. (b) The binding neuron responds when a low-level feature such as a vertical bar is simultaneously part of two different higher-level features such as the letter T and the word CAT. (c) The binding neuron responds when two lower-level features such as a vertical bar and a horizontal bar are both part of a higher-level feature such as the letter T. Reproduced with permission from Eguchi et al. [9].
More complex kinds of feature-binding representation. This figure shows three examples of binding circuits that encode more complex forms of hierarchical binding relationship between features. (a) The binding neuron responds when a lower-level feature such as a vertical bar is part of an intermediate-level feature such as the letter T, which is in turn part of a higher-level feature such as the word CAT. (b) The binding neuron responds when a low-level feature such as a vertical bar is simultaneously part of two different higher-level features such as the letter T and the word CAT. (c) The binding neuron responds when two lower-level features such as a vertical bar and a horizontal bar are both part of a higher-level feature such as the letter T. Reproduced with permission from Eguchi et al. [9].Eguchi et al. [9] also proposed that information about lower-level visual features could be projected upwards to the higher network layers, where it would be available for subsequent brain systems involved in decision-making and behaviour. The simplest way in which this was hypothesized to occur was illustrated in figure 4a. This proposed mechanism was demonstrated in the simulation results shown in figure 15a,c. Here it could be seen that the binding neuron representing the hierarchical relationship between a lower-level feature and a higher-level feature emerged in the same higher layer as the neuron representing the higher-level feature. In this way, information about the lower-level feature, including its binding relationship to the higher-level feature, was projected up to the same higher layer representing the higher-level feature. However, many other more complex circuit architectures could develop during visual training that could project visual information about lower-level features upward to higher network layers as shown in figure 4b. Experimental evidence for the upward projection of fine-grained visuospatial information to higher brain areas has been provided by neurophysiology studies in monkeys. The PFC is a brain area that is strongly implicated in decision-making and behaviour. It receives inputs from the end of the ventral visual pathway. Rainer et al. [18] showed that information about the location of a target object was encoded in the responses of neurons in the PFC. This observation implies that visual neurons in the PFC encode the spatial configuration of objects rather than just the identity of the whole objects themselves.An outstanding question is how later decision-making areas of the brain, such as the PFC, might readout and use visual information encoded by PNGs in the visual cortex. Given fast synaptic time constants, the responses of real neurons in the brain will be sensitive to the timings of incoming spikes. In particular, a postsynaptic neuron will be more likely to fire if the afferent spikes from a subpopulation of presynaptic neurons arrive at the postsynaptic neuron near simultaneously. Given the presence of random axonal transmission delays between neurons, say within the interval [0, 10] ms, the postsynaptic neuron will have the greatest probability of firing when the presynaptic neurons emit their spikes in a specific spatio-temporal sequence that ensures the spikes arrive at the postsynaptic neuron together. Thus, PNGs would appear to be the natural way in which neurons should be expected to encode information in the visual brain for subsequent readout by decision-making brain areas. Furthermore, the holographic principle proposes that information about visual features at all spatial scales, including the binding relations between these features, is projected upwards to such decision-making areas. Evidence for this emerged in the neural network simulations of Eguchi et al. [9]. Future experimental studies may investigate whether such an upward projection of visual information occurs in the brain by analysing the visuospatial information present in the PFC about not only the identity of visual objects but also the detailed spatial structure of these objects. For example, single/multi-unit recording studies in the PFC area of the monkey brain could test for the presence of visual neurons that encode the parts of objects as well as their spatial relationships with the whole object. The upward projection of such detailed visuospatial information to brain systems that produce behavioural responses is consistent with the hierarchical phenomenology of human vision described by Duncan & Humphreys [4]. Obviously, such hierarchical visual representations are useful, and in fact essential, for guiding behaviour in natural spatial environments. Moreover, it has been known since the early experimental studies of Edward Tolman [37] that even non-primates such as rats naturally learn about the structure of their environment and produce behaviour that seems to draw upon this knowledge [37,38]. We posit that the kind of hierarchical visual representations that develop in our brain-inspired models, which encode not only visual features at every spatial scale but also the binding relations between these features, are necessary to enable the brain to learn a sufficiently rich model of causal relations in the world for guiding decision-making and behaviour.An extraordinary aspect of the hierarchical feature-binding hypothesis of Eguchi et al. [9] is that this theory proposes a key functional role for axonal transmission delays, which theoretical neuroscientists and engineers may have previously considered to be merely sources of noise or processing delay in the primate visual system. Instead, these axonal delays are essential to the emergence of polychronization and feature-binding representations. It is therefore highly interesting to note that other simulation studies have found that axonal transmission delays may play an important functional role in quite a different aspect of brain function, that is, path integration of allocentric spatial representations in the brain. Specifically, Walters et al. [39] found that incorporating axonal transmission delays into their model of the head direction system allowed the model to learn to update its internal representation of head direction using vestibular angular head velocity signals at approximately the correct speed during head rotations in the dark. Taken together, these varied simulation studies indicate that axonal transmission delays may play an important role in information processing across a variety of different brain areas and functions.However, the simulation study carried out by Eguchi et al. [9] was limited by the use of a relatively impoverished set of visual stimuli used to train and test the network as shown in figure 11. In particular, these authors did not test the firing responses of three-neuron binding circuits that emerged in their model on a large set of more realistic visual stimuli translating across different retinal locations. Nor did they present multiple stimuli at the same time to the network during testing, which is a further important test of feature binding as discussed by [2] and illustrated in figure 1. In such richer visual test environments, sometimes the low-level feature neuron 1 may fire without stimulating the high-level feature neuron 2 because the lower-level feature is part of a different visual object, or the high-level feature neuron 2 (with a larger receptive field) may fire without the low-level feature neuron 1 (with a smaller receptive field) being activated because the visual object is presented at a different retinal location. These kinds of more realistic simulation are needed to enable a proper test of whether such binding neurons consistently fire if and only if the low-level feature neuron 1 is participating in firing the high-level feature neuron 2. Hence this remains an important property to test for in future simulation studies with more ecologically realistic visual test scenes containing multiple objects that undergo natural transformations such as changes in retinal location, orientation or scale.Furthermore, the role of population-wide oscillations in the coding of information is left unaddressed in this paper. Population oscillations may emerge naturally in the cortex through interactions between populations of excitatory and inhibitory neurons. Moreover, the literature indicates important functional roles for population oscillations [14] within cortical neural networks. In particular, there is experimental evidence that spatio-temporal patterns of spiking activity may occur in fixed temporal relationships to underlying population oscillations, where the timings of spikes relative to the population oscillation carry stimulus information [13]. In this case, the PNGs may sit on top of, and in fact be organized by, the underlying population oscillation. In future simulation work, we will investigate the interaction between population oscillations and both precise input spike timing and emergent polychronization.The simulation results discussed in this paper show how representations of visual features at every spatial scale, as well as the hierarchical binding relations between these features, may develop through the emergent polychronization within biological spiking neural networks and be projected up to the higher network layers for readout by later behavioural brain systems. These theoretical findings, which are supported by neurophysiology studies such as Abeles et al. [23], Prut et al. [24] and Rainer et al. [18], are consistent with the rich hierarchical phenomenology of primate vision as described by Duncan & Humphreys [4] in §1. We claim that such a semantically rich hierarchical visuospatial representation is essential to the ability of the brain to make sense of its sensory world and behave intelligently within it. Understanding this ability of biological vision is therefore a key step towards the development of machines that can also perceive and understand their environment and behave flexibly within it—i.e. what is commonly referred to as artificial general intelligence.
Authors: Martyna J Grabowska; Rhiannon Jeans; James Steeves; Bruno van Swinderen Journal: Proc Natl Acad Sci U S A Date: 2020-11-11 Impact factor: 11.205