To investigate the integration of features, we have developed a paradigm in which an element is rendered invisible by visual masking. Still, the features of the element are visible as part of other display elements presented at different locations and times (sequential metacontrast). In this sense, we can "transport" features non-retinotopically across space and time. The features of the invisible element integrate with features of other elements if and only if the elements belong to the same spatio-temporal group. The mechanisms of this kind of feature integration seem to be quite different from classical mechanisms proposed for feature binding. We propose that feature processing, binding, and integration occur concurrently during processes that group elements into wholes.
To investigate the integration of features, we have developed a paradigm in which an element is rendered invisible by visual masking. Still, the features of the element are visible as part of other display elements presented at different locations and times (sequential metacontrast). In this sense, we can "transport" features non-retinotopically across space and time. The features of the invisible element integrate with features of other elements if and only if the elements belong to the same spatio-temporal group. The mechanisms of this kind of feature integration seem to be quite different from classical mechanisms proposed for feature binding. We propose that feature processing, binding, and integration occur concurrently during processes that group elements into wholes.
Does binding operate on pre-processed features or are feature processing and binding concurrent operations? What is the relationship between features and their carriers? What happens to features whose carriers become invisible? How do the inhibitory processes that operate on carriers affect feature processing and binding? Why and when features are segregated or integrated? Does attention play a role in these processes?
Introduction
To make sense of the world surrounding us, the brain has to extract and interpret information from the vast amount of photons impinging on our photoreceptors. The interpretation of information requires the establishment of spatio-temporal relations between different elements. How these processes of information extraction and interpretation lead to perception, learning, development, and knowledge have been fundamental problems in philosophy, psychology, neuroscience, and artificial intelligence. For example, empiricism and behaviorism are based on the principle of association. Elements that co-occur repetitively or persistently in spatial and/or temporal proximity become associated, i.e., relations are established among them so as to bind them into more complex entities. The Hebbian postulate offers a possible mechanism whereby such associations can be implemented in neural systems (Hebb, 1949). In contrast to these hierarchical approaches that build more complex entities from combinations of simpler entities, Gestalt psychologists suggested that stimuli become organized into wholes, or Gestalts, that cannot be reduced to associative combinations of their parts. Both associationist and Gestaltist views are still prevalent today as one considers the binding problem at its various levels, from perception to knowledge.In visual perception, most approaches to the binding problem are guided by the parallel and hierarchical organization of the early visual system. Information is carried by parallel pathways from the retina to higher levels of cortex, for example, by retino-cortical magnocellular and parvocellular pathways, and cortico-cortical dorsal and ventral pathways. Neurons in different visual areas generate distinctive responses to different stimulus attributes. For example, neurons in area MT are sensitive to motion whereas neurons in the blob regions of V1 are particularly sensitive to color. There appears to be a hierarchy within pathways; for example, complex shape selectivity appears to result from a hierarchy in the ventral pathway, starting with orientation selectivity, leading to curvature selectivity, and finally to complex shape selectivity (Connor et al., 2007). This hierarchy has been suggested to be accompanied by a shift in reference frames, from retinotopic reference frames in early areas to object-centered reference frames in higher areas (Connor et al., 2007). If different attributes of a stimulus are processed in different parts of the brain according to different reference frames, how are they associated with each other to underlie the unified percepts that we experience?The hierarchy in the visual system is often interpreted to support the associationist view. It is assumed that the early visual system computes a set of stimulus attributes (e.g., oriented boundary segments, color, texture) and the binding consists of selectively associating different attributes with each other by, for example, hierarchical convergence (e.g., Riesenhuber and Poggio, 1999b), neural synchrony (Singer, 1999), or by an attentional scanning mechanism (Treisman and Gelade, 1980; Treisman, 1998).In analyzing the binding problem in its broader context, one has to recognize that there are several stimulus attributes that need to be bound together, thereby leading to a variety of binding problems. Treisman (1996) pointed out the existence of at least seven types of binding, including “property binding” (e.g., how color and shape of the same object are bound together), “part binding” (how different parts of an object, such as boundary segments, are segregated from the background and bound together), “location binding” (how shape and location information, believed to be represented in ventral and dorsal pathways, respectively, are bound together) and “temporal binding” (how binding operates across time when an object moves). It is highly likely that these different types of binding operations are not independent from each other but work in an interactive way. Furthermore, while most theoretical approaches assume as a starting point simple “features,” such as oriented line segments and color patches, it is highly likely that the computation of even these basic features is not independent from their binding operations. To appreciate this last point, one needs to first recognize that the computation of features is not instantaneous, but takes time. Second, under normal viewing conditions, our eyes undergo complex movements. Many objects in the environment are also in motion and thereby cause dynamic occlusions. As a result, the representation of the stimulus in retinotopic areas is highly dynamic, transient, and intermingled. Under these conditions, one cannot assume that features are already computed and ready for binding operations; instead, one needs to address the problem of how to simultaneously compute and bind features through interactive processes. Consider for example a moving object. Due to occlusions, the features of the moving object will overlap with those of the background or with those of other occluding objects. The receptive fields of neurons in retinotopic areas will receive a succession of brief and transient excitations from a variety of features, some belonging to the same object, some belonging to different objects. To compute features, the visual system should be able to decide whether to segregate information (when it belongs to different objects) or integrate information (when it belongs to the same object). The object file theory (Kahneman et al., 1992) assumes that an object file is opened and indexed by location and features are inserted to this file over time to allow processing. However, this poses a “chicken-and-egg” problem: In order to decide distinct objects, one needs to have access to their features; but unambiguous processing of features, in turn, needs the opening of distinct object files. This vicious circle suggests, again, that the processing of features need to co-occur with their binding.In this paper, we summarize our recent findings from studies where we examined the spatio-temporal dynamics of feature processing and integration. In order to assess the temporal interval during which the stimulus is processed, we used brief presentations of features (a vernier offset presented for 20 ms).
The Sequential Metacontrast Paradigm
We presented a vernier stimulus that comprises a vertical line with a small gap in the middle. The vernier was presented for 20 ms, followed by blank screen (inter-stimulus interval, ISI) for 30 ms, and then a pair of lines neighboring the vernier. The central vernier stimulus is rendered invisible because the flanking lines exert a metacontrast effect (Figure 1A; Stigler, 1910; Alpern, 1953; Bachmann, 1994; Breitmeyer and Ögmen, 2006).
In all the experiments reported up to here, observers attended to one pre-determined motion stream and reported the vernier offset that they perceived within this motion stream. Thus, as the attended stream and the stream selected for perceptual report were always the same, the results cannot clarify whether attention plays a role in these binding and integration effects. In order to study the role of attention, we used a cueing paradigm. We modified the experiment shown in Figure 4 by keeping the length of the sequence to four flanking lines and by introducing an auditory cue (Figure 5).
The results so far showed that features remain segregated according to motion grouping relations and their processing and integration takes place within each motion stream. As we have mentioned at the beginning of the article, under normal viewing conditions, moving objects overlap and occlude each other. The visual system needs to decide whether to integrate or segregate overlapping features. To study this problem, we presented two sequential metacontrast sequences next to each other so that two of the four motion streams merged at a common point (Figures 6A–C).
An Ecological Framework: The Problems of Motion Blur and Moving Ghosts
The studies outlined above were motivated by the observations that under normal viewing conditions, the visual system needs to compute features at the same time as it binds them. This is because the computation of a feature requires decisions regarding whether transient stimulations generated come from the same or different objects. Our results show that the carrier of a stimulus can be rendered invisible and the corresponding feature can be integrated with features presented at retinotopic locations different than the retinotopic location of its carrier. Why is the perception of the carrier inhibited and why is the feature integrated with other features in a non-retinotopic manner?Under normal viewing conditions, a briefly presented stimulus can remain visible for more than 100 ms, a phenomenon known as visible persistence (Haber and Standing, 1970; Coltheart, 1980). This should imply that moving objects appear extensively blurred; however, in general we do not experience motion blur (e.g., Ramachandran et al., 1974; Burr, 1980; Hogben and Dilollo, 1985; Farrell et al., 1990; Castet, 1994; Bex et al., 1995; Chen et al., 1995; Westerink and Teunissen, 1995; Bedell and Lott, 1996; Burr and Morgan, 1997; Hammett, 1997). Another problem associated with object motion is the problem of “moving ghosts” (Ögmen, 2007). Since a moving object stimulates each retinotopically localized neuron only for a brief time period, no retinotopic neuron by itself will receive sufficient stimulation to extract features of the stimulus. Thus, moving stimuli should appear as “ghosts,” i.e., blurred and quasi-uniform in character, devoid of specific featural qualities. This happens when stimuli move at excessively high speeds but not for ecologically observed speeds. We suggest that the visual system solves motion blur and moving ghosts problems by two complementary mechanisms. The carriers and features are first registered in retinotopic representations. The spatial extent of motion blur is curtailed by inhibitory mechanisms that make stimuli, as the central vernier in our displays, invisible on a retinotopic basis (Ögmen, 1993, 2007; Chen et al., 1995; Purushothaman et al., 1998). However, the features of the retinotopically inhibited stimuli are not destroyed; instead, based on prevailing motion grouping relations, they are attributed to motion streams where they are processed and bound (Otto et al., 2006, 2009; Ögmen et al., 2006; Ögmen, 2007; Breitmeyer et al., 2008). This non-retinotopic, motion grouping based feature processing provides the solution to the moving ghosts problem. While features of an object activate retinotopically organized cells momentarily, they remain relatively invariant along the motion path of the object. This allows sufficient time for non-retinotopic mechanisms to receive and process incoming stimuli as they become segregated according to prevailing motion grouping relations. Thus, we suggest that features that become dissociated from their carriers are mapped into non-retinotopic representations following spatio-temporal grouping relations.When and why are features integrated? Most vision problems are ill-posed (Poggio et al., 1985). For example, the light that shines on a photoreceptor is always the product of the illuminance (e.g., sun light) and the reflectance (properties of the object): luminance = illuminance × reflectance. Hence, the luminance value is not sufficient to determine reflectance. Solving such ill-posed problems can take substantial amounts of time and needs to take contextual information into account, making a short-term retinotopic analysis impossible. Consider the following situation. A car drives through a street. Because of shadows and reflecting lights, the car elicits a series of very different luminance and chromacity signals on the retina. For example, the red of the car may be almost invisible when driving through a dark shadow but bright and well visible when in sun light. The brain usually discounts for the illuminance (color constancy). However, for the fast running car, processing time is too short when computed at each retinotopic location. Moreover, it is not necessary to compute the reflectance of the car at each location and instance given the knowledge that car colors do not change. Hence, averaging across the features along a motion trajectory may be a first step toward a good estimate of the car color. Vernier offset integration is just a toy version of such a scenario. For this and other reasons, we would like to argue that most visual processes occur in fact in non-retinotopic frames of reference – including feature processing, binding, and integration. Using a different approach than the sequential metacontrast paradigm, we have shown evidence for non-retinotopic processes of vernier offsets (as used here; Ögmen et al., 2006; Aydin et al., 2011b), motion, form, and attention in visual search (Boi et al., 2009, 2011). In addition, perceptual learning in the sequential metacontrast paradigm occurs within non-retinotopic rather than retinotopic coordinates (Otto et al., 2010b).Where does feature integration occur? We used high density EEG and inverse solutions. We found that the insula showed enhanced activity when vernier offsets are integrated (Plomp et al., 2009). The insula is one of the areas involved in all sorts of integration processes and consciousness (e.g., Craig, 2009).
Implications for Models of Binding
In classical models of binding by synchrony, features are bound together when their neural representations fire simultaneously or with a common frequency and phase relation (e.g., Singer, 1999). For example, when a red square and a green disk are presented, neurons coding for red and squareness fire synchronously and similarly neurons coding for green and diskness. When the combination of colors and form changes, the synchronization changes accordingly. However, synchronization is not a mechanism per se for computing binding but may be a way of communicating information. Therefore, the crucial question that remains is how grouping, feature processing, integration, and binding take place in our stimuli. Synchronization may be an outcome of computational mechanisms underlying these processes; however, it does not provide, in itself, a causal explanation for the outcome. As a result, to test whether synchronization can explain our results necessitates models that would be able to carry out the aforementioned processes and produce synchronization as an emergent property.Can our results be explained by the association principle and the related convergent coding models? Particularly, averaging of features is a classical property of many models of grandmother cell coding to avoid the curse of dimensionality (Riesenhuber and Poggio, 1999a). The sequential metacontrast paradigm is quite robust to substantial changes, i.e., changes in ISI, spacing between lines, number and orientation of lines, and contrast polarity (see Figure 4; Otto et al., 2009). On the other hand, small spatio-temporal details do matter when they change the grouping (Figure 3). Hence, it is hard to explain with most convergent coding schemes how for each conceivable motion stream, there are hard wired detectors binding offsets together. Moreover, sequential metacontrast is not limited to vernier offsets; hence, the number of possible motion groups and feature bindings is virtually unlimited (see also Footnote 1).Often it is proposed that a master map of attention binds features of retinotopic, basic features maps together, particularly, to solve the property binding problem (Treisman, 1998). However, the role of attention in our dynamic stimuli appears to be different. Within a given stream, vernier offset integration occurs automatically without focused attention. When attention is focused on the stream, only the integrated sum of the vernier offsets, rather than individual offsets, can be read-out. On the other hand, attention can play a major role when it comes to combining different, independent motion trajectories into more complex motion structures (Figures 6 and 7).As a path toward the solution, we propose the following non-retinotopic processing scheme shown in Figure 8 (Ögmen, 2007; Ögmen and Herzog, 2010).
Figure 8
Schematic depiction of the proposed approach to conceptualize non-retinotopic representations wherein feature processing, binding, and integration take place. The two-dimensional plane at the bottom of the figure depicts the retinotopic space in early vision. In this example, a group of dots (shown in red) moves rightward and a second group of dots (shown in orange) moves upward. A fast motion segregation and grouping operation establishes two distinct local neighborhoods, which are mapped into two different non-retinotopic representations (for clarity, the figure shows only the non-retinotopic representation for rightward moving dots). A vector decomposition takes place (e.g., Johansson, 1973) and a common vector for the neighborhood (dashed green vector) serves as the reference frame for the neighborhood. The stimulus in the local neighborhood is mapped on a non-retinotopic manifold (for depiction purposes a sphere is used). This allows the processing and integration of features in a manner that remains invariant to their global motion. Features that are mapped to common manifolds become candidates for binding into groups (from Ögmen and Herzog, 2010).
Schematic depiction of the proposed approach to conceptualize non-retinotopic representations wherein feature processing, binding, and integration take place. The two-dimensional plane at the bottom of the figure depicts the retinotopic space in early vision. In this example, a group of dots (shown in red) moves rightward and a second group of dots (shown in orange) moves upward. A fast motion segregation and grouping operation establishes two distinct local neighborhoods, which are mapped into two different non-retinotopic representations (for clarity, the figure shows only the non-retinotopic representation for rightward moving dots). A vector decomposition takes place (e.g., Johansson, 1973) and a common vector for the neighborhood (dashed green vector) serves as the reference frame for the neighborhood. The stimulus in the local neighborhood is mapped on a non-retinotopic manifold (for depiction purposes a sphere is used). This allows the processing and integration of features in a manner that remains invariant to their global motion. Features that are mapped to common manifolds become candidates for binding into groups (from Ögmen and Herzog, 2010).The retinotopic space is depicted at the bottom of the Figure as a two-dimensional plane. A group of dots move rightward (highlighted in red) while another group of dots move upward (highlighted in orange). Based on differences in motion vectors, the two local neighborhoods are mapped into two different non-retinotopic representations; for clarity the figure shows only the non-retinotopic representation for the rightward moving dots. Each feature, visible or not, is attributed to a motion group. The invisibility of the carrier of the feature indicates the inhibition of its retinotopic activity. A common vector for each neighborhood is determined (dashed green vector) and serves as the reference frame for that neighborhood. All motion vectors are decomposed into a sum of the reference motion and a residual motion vector. The stimulus in the local neighborhood is mapped on a manifold (in Figure 8, for depiction purposes a sphere is used), i.e., a geometric structure that preserves local neighborhood relations. However, the surface can be stretched and deformed. The residual motion vectors, or relative motion components with respect to the reference frame, are then applied to the manifold so as to deform it to induce transformations that the shape undergoes during motion. Features that are mapped into this manifold within a pre-determined spatio-temporal window become integrated. Thus, according to this approach, feature processing and binding occurs largely in non-retinotopic representations that are built from ongoing motion grouping relations in the retinotopic space. Two different motion streams are mapped into two different manifolds and remain segregated in agreement with our results. When the streams merge, a common point in the retinotopic space signals occlusion. We suggest that observers can read-out information about different motion streams by accessing their distinct manifold representations and resolve the occlusion in a flexible way by attributing to the common point the feature information associated with the attended stream. This is illustrated in Figure 9.
Figure 9
Depiction of how occluded objects are represented and processed. According to this approach, retinotopic areas serve as a relay where features are transferred to non-retinotopic areas according to spatio-temporal grouping relations. A second role of retinotopic areas is to resolve depth order and occlusion relations. While the entire shapes of objects can be accessed from their non-retinotopic representations, visibility of the parts is dictated by retinotopic activities. In this example, observers can recognize a complete triangle (amodal completion) but only those parts that are un-occluded in retinotopic representations become visible.
Depiction of how occluded objects are represented and processed. According to this approach, retinotopic areas serve as a relay where features are transferred to non-retinotopic areas according to spatio-temporal grouping relations. A second role of retinotopic areas is to resolve depth order and occlusion relations. While the entire shapes of objects can be accessed from their non-retinotopic representations, visibility of the parts is dictated by retinotopic activities. In this example, observers can recognize a complete triangle (amodal completion) but only those parts that are un-occluded in retinotopic representations become visible.In this example, a square and a triangle move and according to motion grouping relations, a non-retinotopic representation is created for each motion stream. The retinotopic information is conveyed to the appropriate non-retinotopic representations where processing of features takes place. Thus, according to our theory, the first major role of retinotopic processes is to establish grouping relations and convey feature information to non-retinotopic areas according to these grouping relations. Grouping and attention are independent but interactive processes (Aydin et al., 2011a). A second role of retinotopic representations is to resolve depth order and occlusion relations and thereby determine those features that will gain visibility. Figure 9 shows a time instant when the rectangle and the triangle occlude each other. The reciprocal relationships between retinotopic and non-retinotopic activities reveal occlusion properties and establish visibility based on this information. In the example shown in Figure 9, the rectangle is in the foreground and becomes fully visible; only the un-occluded parts of the triangle become visible. However, since the triangle is stored and computed in non-retinotopic representations, the percept is not that of two disjoint segments, but instead a single triangle (amodal completion). Applying this concept to the merging streams, one can see that an observer can access the vernier information of the streams independently because they are stored in separate representations. The point where the streams merge constitutes an ambiguous occlusion point because, unlike the square-triangle example of Figure 9, the shape at the point where the two streams merge (line) can belong to either stream. Thus, based on attentional cueing, the offset of either stream can be attributed to the point of occlusion.
Conclusion
In summary, the sequential metacontrast paradigm is a versatile tool to investigate many aspects of vision including consciousness, spatio-temporal grouping, attention, and feature integration. We have shown how features of invisible elements can still become visible at other elements and even integrated with other features. Feature integration occurs only when elements belong to one spatio-temporal group. Our findings show how the human brain integrates even very briefly presented information at a very subtle spatial scale.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.