Literature DB >> 32735342

High-acuity vision from retinal image motion.

Alexander G Anderson1, Kavitha Ratnam2, Austin Roorda2, Bruno A Olshausen3.   

Abstract

A mathematical model and a possible neural mechanism are proposed to account for how fixational drift motion in the retina confers a benefit for the discrimination of high-acuity targets. We show that by simultaneously estimating object shape and eye motion, neurons in visual cortex can compute a higher quality representation of an object by averaging out non-uniformities in the retinal sampling lattice. The model proposes that this is accomplished by two separate populations of cortical neurons - one providing a representation of object shape and another representing eye position or motion - which are coupled through specific multiplicative connections. Combined with recent experimental findings, our model suggests that the visual system may utilize principles not unlike those used in computational imaging for achieving "super-resolution" via camera motion.

Entities:  

Year:  2020        PMID: 32735342      PMCID: PMC7424138          DOI: 10.1167/jov.20.7.34

Source DB:  PubMed          Journal:  J Vis        ISSN: 1534-7362            Impact factor:   2.240


Introduction

During visual fixation, humans have a stable, high-acuity perception of the world despite substantial drifting movements of the eyes. Recent experiments demonstrate the benefit of these movements for the discrimination of a small letter whose stroke spacing is near the sampling limit of the cone photoreceptor array (Ratnam, Domdei, Harmening, & Roorda, 2017). Subjects are shown a diffraction-limited letter E in one of four orientations (strokes pointing up, down, left, or right) during natural drift movements of the eye, and are asked to report the letter's orientation. The stimulus size is chosen to challenge the subject to the point that the orientation is discriminated correctly 40% to 60% of the time. In a second condition, the image of the E is stabilized on the retina by a real-time eye tracker with cone-level precision. Here, subjects’ performance decreases. In a third condition, the stimulus moves on the retina with the same statistics as natural eye motion, but incongruent (uncorrelated) with the eye's true motion. Surprisingly, although subjects are aware of the incongruent motion of the stimulus, their task performance is the same as the natural condition in which there is no perception of motion. Taken together, these results are remarkable because the visual features defining the object span just a few photoreceptors, yet the eye's own motion spreads these features over many photoreceptors within the presumed temporal integration window of downstream cortical neurons. Thus, there must be a neural mechanism that makes use of the movement of the stimulus relative to the retina, independent of whether or not the motion is generated by the eye, for improving task performance. Our goal here is to elucidate the neural mechanisms that could underlie these experimental results with a mathematical model capable of exhibiting the same behavior. Previous modeling efforts aimed at modeling perceptual stability in the face of fixational eye movements proposed specific neural computations to build up invariant representations of sensory signals using shifter circuits (Anderson and Van Essen, 1987) or map-seeking circuits (Arathorn, Stevenson, Yang, Tiruveedhula, & Roorda, 2013). Other investigators have approached the problem in the framework of Bayesian inference and proposed models that decode retinal ganglion cell (RGC) spikes generated from a stimulus moving owing to fixational eye movements (Pitkow, Sompolinsky, & Meister, 2007; Burak, Rokni, Meister, & Sompolinsky, 2010). Burak et al. (2010) showed that, for stimuli with binary-valued pixels, a decoder of these spikes must take into account the motion of the eye (under reasonable assumptions about the size of the eye motions and the firing rates of RGCs), otherwise the reconstructed pattern is a blur. They showed how this blur may be mitigated by simultaneously estimating form and motion in a Bayesian optimal manner. The estimated motion is used to dynamically reroute the incoming spikes onto a population of cortical neurons so as to build up an unblurred estimate of the underlying pattern on the retina. Although this model took an important step in demonstrating a computational mechanism that can account for how high acuity is preserved under fixational eye movements, it primarily aims to mitigate this blur, viewing the eye position drift as a hindrance. Inspired by the results of Burak et al., we sought to show how blur could not only be mitigated, but how retinal image drift could confer a benefit because it can potentially improve visual acuity by averaging over inhomogeneities in the retinal sampling lattice. Doing so requires generalizing the model to allow for spatially continuous eye movements and gray-valued image stimuli, as opposed to the discretized eye movements and binarized stimuli assumed in Burak et al. Generalizing the model in this way is both scientifically important and technically difficult. Although the structure of our generative model is closely related to that of Burak et al., the methods for inferring the spatial pattern from the spikes are completely different because the values for the position and pixel values can no longer be discretely enumerated. Furthermore, their mean-field approximation of the image does not allow for non-trivial priors on the spatial pattern, such as in the sparse coding model of V1 (Olshausen & Field, 1997). Thus, we developed a novel, approximate Bayesian inference method based on an online approximation of the expectation maximization (EM) algorithm. The general idea that motion is beneficial for an image sensor has been considered in a variety of disciplines. In the computational imaging community, the problem of combining a sequence of low resolution images to form a single high-resolution image has well-developed solutions (e.g., Farsiu, Robinson, Elad, & Milanfar, 2004). In the field of active perception, Rucci and colleagues (Rucci, Iovin, Poletti, & Santini, 2007; Kuang, Poletti, Victor, & Rucci, 2012; Aytekin, Victor, & Rucci, 2014; Rucci & Victor, 2015; Boi, Poletti, Victor, & Rucci, 2017; Rucci, Ahissar, & Burr, 2018; Casile, Victor, & Rucci, 2019; Intoy & Rucci, 2020) have studied the benefits that could arise from small eye motions due to the spreading of signal power from the spatial domain into the temporal domain. They show that the 1/f2 spatial power spectrum of natural images, when combined with the statistics of eye motion, results in a flattening of the power spectrum over the joint spatiotemporal frequency domain. They further show that, when this signal is sent through the temporal filtering properties of RGCs, high spatial frequency details get amplified and that more global spatial structures such as contours could be detected from spike synchrony. Their theory is complementary to ours in that they address limitations imposed by postreceptoral mechanisms (e.g., limited dynamic range and bandwidth of the optic nerve) and subsequent processes of feature extraction, assuming the image signal has been adequately sampled by the cones such that it can be treated as a continuous function of space and time, I(x, y, t). The focus of our work, by contrast, is to understand how spatial detail at the very highest spatial frequencies (50 cycles/deg) can be perceived and discriminated despite the fact that spatial information at these scales is compromised owing to the punctate nature of cone sampling, inhomogeneities in the retinal cone mosaic, and among the cones themselves. We also take into account the punctate encoding in time by RGCs–that is, signals are conveyed to the brain not as continuous waveforms, but as a sequence of spikes. We propose a computational mechanism for decoding images that have been sampled and temporally encoded in this way, and we quantitatively evaluate its performance, corroborating the psychophysical measurements of Ratnam et al. In what follows, we first describe our model used for estimating form and motion, with more complete details described in the Appendix. We then use our model to decode simulated spikes generated by the same letter E stimulus used in the experiments of Ratnam et al. We show that it is possible to resolve the fine spatial structure of the letter E that would otherwise be impossible to resolve in a statically viewed presentation of the stimulus on the cone lattice. We also demonstrate the ability to resolve the stimulus given a retina with holes in the cone lattice, which corroborates the fact that observers with retinal degeneration exhibit normal visual acuity. Finally, we generalize the model to the case of natural image stimuli, using a sparse latent variable model as the image prior, resulting in a model that is consistent with the known feature representations in V1 (i.e., neurons with localized, oriented, and bandpass receptive fields). We conclude by discussing neurobiological and technological implications of the model.

Methods

The simulations in this article proceed by first generating spikes from a spatial array of simplified RGCs in response to a spatial pattern (either an E or a natural scene patch) as it drifts over the retina, as shown in Figure 1A. These spikes are then decoded by our proposed model — an approximate inference procedure that assumes knowledge of the process by which the spikes were generated — to infer the spatial pattern and its motion, as shown in Figure 1B to 1D.
Figure 1.

Model Overview: (A) An upright letter E (stroke width = 0.8 arcmin) projected onto a simulated cone lattice (average spacing 1.09 arcmin) with a 500 ms eye drift trajectory (Ratnam et al., 2017) superimposed (green trace). RGC cell spikes are generated using a linear-nonlinear-poisson model with ON and OFF cells. The ON and OFF RGC response functions are symmetrical, so the presence of a stimulus for an ON cell gives an equivalent response to the absence of a stimulus for an OFF cell. (B) Probabilistic model for inferring stimulus shape S (encoded by latent variables A) and position X from retinal spikes R. Arrows indicate causal relationships between variables. The spikes R are observed and the latent factors encoding shape A and position X must be simultaneously inferred. (C, D) The spike decoder repeatedly alternates between two steps: (C) In the first step (Equations 5), the estimate of the pattern is fixed (S = S) and new evidence coming from the next set of incoming spikes R is incorporated to obtain an updated posterior distribution over eye position P(X|R0: ) (shown as a probability cloud). This update is computed by multiplying the probability distribution over the predicted position P(X|R0: ) (computed from the diffusion model applied to the previous position estimate) together with the likelihood P(R|X, S = S) (computed by cross-correlating the current estimate of the pattern with the spatial array of incoming spikes). (D) In the second step (Equations 8, 10), the neurons representing the internal position estimate X act to dynamically route incoming spikes by multiplicatively gating their connections to the internal pattern estimate, thus updating S.

Model Overview: (A) An upright letter E (stroke width = 0.8 arcmin) projected onto a simulated cone lattice (average spacing 1.09 arcmin) with a 500 ms eye drift trajectory (Ratnam et al., 2017) superimposed (green trace). RGC cell spikes are generated using a linear-nonlinear-poisson model with ON and OFF cells. The ON and OFF RGC response functions are symmetrical, so the presence of a stimulus for an ON cell gives an equivalent response to the absence of a stimulus for an OFF cell. (B) Probabilistic model for inferring stimulus shape S (encoded by latent variables A) and position X from retinal spikes R. Arrows indicate causal relationships between variables. The spikes R are observed and the latent factors encoding shape A and position X must be simultaneously inferred. (C, D) The spike decoder repeatedly alternates between two steps: (C) In the first step (Equations 5), the estimate of the pattern is fixed (S = S) and new evidence coming from the next set of incoming spikes R is incorporated to obtain an updated posterior distribution over eye position P(X|R0: ) (shown as a probability cloud). This update is computed by multiplying the probability distribution over the predicted position P(X|R0: ) (computed from the diffusion model applied to the previous position estimate) together with the likelihood P(R|X, S = S) (computed by cross-correlating the current estimate of the pattern with the spatial array of incoming spikes). (D) In the second step (Equations 8, 10), the neurons representing the internal position estimate X act to dynamically route incoming spikes by multiplicatively gating their connections to the internal pattern estimate, thus updating S.

Simulating RGC responses to drifting stimuli

Each RGC is assumed to receive input from a single cone (one ON and one OFF RGC per foveal cone (Ahmad, Klug, Herr, Sterling, & Schein, 2003), and is modeled as having a Gaussian receptive field with full width at half maximum of 0.48 times the cone spacing (Macleod, Williams, & Makous, 1992). For the present purposes, we leave out the lateral inhibition and temporal filtering properties of RGCs, focusing mainly on the spatial resolution provided by the retina. The retinal cone lattice is specified by generating a hexagonal grid (random orientation) with spacing of 1.09 arcmin and then randomly jittering the position of each cone by adding noise uniformly distributed within ±25% of the spacing to the horizontal and vertical coordinates. Although jittering the centers of the cones adds more realism to the simulations and demonstrates the flexibility of the inference model, our experiments showed that it does not impact the reconstruction error as a function of time. Eye movement trajectories are generated either as a diffusive random walk or from drift eye movement recordings from (Ratnam et al., 2017). The eye motion traces are obtained using an adaptive optics scanning laser ophthalmoscope (Roorda et al., 2002). Trials with microsaccades are thrown out. The raw data are cleaned by using interpolation to replace one timestep outliers and trials with longer sections of invalid data are thrown out. Finally, a Kalman filter with a diffusion motion prior is used to smooth the data. Because the error between the smoothed path and the true path has roughly double the standard deviation of the adaptive optics scanning laser ophthalmoscope's error (Stevenson, Roorda, & Kumar, 2010), one-half of the difference between the data and the smoothed path is added to the smoothed path to retain some of the non-smooth component of the eye motion (aka tremor). Spiking responses of RGCs are generated using a linear-nonlinear-poisson model (Paninski, Simoncelli, & Pillow, 2004) without any spike history dependencies. The instantaneous rate parameter for each RGC is set to a baseline of 10 Hz and increases exponentially according to the inner product of the RGC's receptive field with the retinal image translated by the current eye position, and scaled so that the maximum rate is 100 Hz, as specified in Appendix Equations 11–17.

Joint inference of object shape and eye position from RGC spike trains

Our hypothesis is that the visual cortex seeks to infer the spatial stimulus pattern, S, given the incoming spikes, R, where the trajectory of the eye, X, is an unknown variable. If both X and R were known, S could be easily estimated by accumulating evidence from spikes after the motion is used to correct for the translation of the eye. Likewise, if both S and R were known, X could be estimated by finding the translation of the stimulus pattern, S, that provides the best spatial alignment with the spike patterns, R, across time. In the case where only R is known, X and S must be jointly inferred, because one variable is needed to estimate the other. To solve this problem from a principled perspective, we impose priors on S and X. The prior on the eye trajectory, p(X), is a diffusive random walk with a diffusion constant . The prior on the stimulus pattern, p(S), is constructed by constraining S to be given by , where D is a “dictionary matrix” whose columns are some elementary spatial patterns, and the vector A is a set of latent variables that specify how much of each pattern is present. The spatial structure in S can then be modeled with a simple, factorial prior over A, p(A) = ∏p(A). The relationships between R, X, S, and A are described by the probabilistic graphical model shown in Figure 1B. The joint distribution of the nodes in the graph, N, is p(N) = ∏p(N|Nπ(), where π(i) denotes the parents of node i in the graph defining the model (parent-child relationships are denoted by arrows in the diagram). All quantities of interest are computed by marginalizing the joint distribution. In an ideal Bayesian framework, one would compute the full posterior distribution over the latent variables encoding object shape p(A|R), given by where p(R|X, S) reflects the probabilistic (Poisson) model used in generating the spikes (Appendix Equation 11). The posterior p(A|R) assigns a probability for every possible stimulus pattern S = DA given the spikes R coming from the retina, taking into account all possible eye movement trajectories weighted by their probability. We use a series of approximations to derive a computationally tractable, causal, and online computation to estimate A (see the Appendix for details). First, only the most probable set of latent shape variables is considered, . The second is to deal with the intractable sum over all possible eye trajectories by using an online approximation of the EM algorithm. The EM algorithm maximizes log P(A|R) in an iterative manner by alternating between two steps, one for estimating X, which comes from introducing a variational distribution q(X), and the other for estimating A. To make time explicit in X and R, we henceforth rewrite them as X0: = (X0, X1, …X) and R0: = (R0, R1, …R), where T is the total number of time steps in the simulation. R denotes the number of spikes emitted from each RGC in the time interval [t, t + Δt]. Because R depends only on the current eye position, X, and the stimulus, S, we can derive a set of EM update equations as follows: A full derivation is given in Appendix Equations 31-34. Equation 2 estimates the eye position at time t, X, given the spikes R0: and the current estimate of the spatial pattern A′, while Equation 3 estimates A given the spikes R0: and estimated eye positions X0: . The traditional EM algorithm repeatedly applies these equations for some number of iterations. For simplicity, A can be initialized to zero. Note that although these update equations are guaranteed to converge to a critical point of log P(A|R) by repeatedly applying them (and initializing them with A = 0), they are still non-causal (requiring spikes from the future to estimate quantities at the current time t), and Equation 3 is not amenable to online processing because it requires optimizing over a batch of quantities from t = 0: T. To obtain a causal position estimator for Equation 2, the distribution over eye position at time t is approximated by replacing it with a filtering estimate that only takes into account spikes up to time t: This is then updated at each subsequent timestep via where is the current estimate of A given the spikes from 0 to t (computed via Equation 8). The steps involved in this calculation are shown graphically in Figure 1C. A particle filter with resampling (Doucet & Johansen, 2009) is used to represent and propagate q(X) from one timestep to the next (see Equations 48, 49). The optimization for A in Equation 3 is also modified to be causal and online. First, we denote the negative expected log-likelihood of A at time t as which can be thought of as an energy (to be minimized) that corresponds with how well A agrees with the position estimate and spikes at time t. A causal approximation to the update for A at time t may be obtained by considering the sum of these energies only up to time t, along with the log-prior: where we are now minimizing rather than maximizing owing to the change in sign. To make the computation online (so that the entire sum over time need not be reminimized at each time step), the sum of the energy terms up to time t is replaced by a quadratic approximation, resulting in the following update for the next time step: where is the current estimate of A given the spikes from 0 to t, and The contribution of each of the terms in this expression may be understood as follows: the first term is a running estimate of the accumulated energies up to time t, the second term corresponds with the energy coming from the new set of incoming spikes at time t + 1, and the last two terms correspond with the log prior on A. The quadratic approximation of corresponds with a Gaussian approximation in probability, and so as H grows over time, the uncertainty shrinks, meaning that this term has increasing influence in determining the optimal value of A over time. log p(A) is either the sum of absolute values of A (to encourage sparsity) or a quadratic function of A. The minimization of Equation 8 is done using the FISTA algorithm (Beck & Teboulle, 2009), which is a version of gradient descent modified to handle the situation where the expression to be minimized contains an L1 loss term. The basic computations required to compute the gradient are specified in Equations 50–56 of the Appendix. Figure 1D shows a graphical illustration of the computation owing to the gradient of the second term, , which updates A according to each new set of incoming spikes. This results in a “dynamic routing” circuit (Olshausen, Anderson, & Van Essen, 1993), in which RGC spikes R are routed into different elements of the internal shape estimate A via another set of units representing the internal position estimate X that multiplicatively gate the RGC's. To summarize, the full algorithm computes three equations at each timestep. First, an internal estimate of eye position at time t is updated based on the current estimate of the stimulus pattern and the incoming spikes R (Equation 5). Second, the new estimate of the stimulus pattern (represented by latent factors A) is generated by minimizing Equation 8, which takes into account the new spikes and the updated estimate of eye position. Third, the estimate of the uncertainty of the latent factors, H, is updated (Equation 9).

Results

A moving retina averages out spatial inhomogeneities

Much like looking through a broken window, viewing the world through a stationary, inhomogeneous retina results in a belief about the world that is precise in some places and uncertain in others. The key idea of this work is that this detrimental, nonuniform uncertainty can be alleviated by the eye's natural drift movements. Our main result, shown in Figure 2, is that the signal generated by a moving retina, when properly processed by downstream neural circuitry that jointly estimates the eye's motion and the stimulus, results in a higher quality representation of the stimulus as compared the signal generated by a stationary retina. Specifically, for a stimulus duration of 700 ms, our model achieves a 50% improvement in the average signal-to-noise ratio (SNR) when the retina drifts (average SNR=5.9) as compared with when it is held stationary (average SNR=3.9). SNR is computed as the power of the ground-truth signal divided by the squared error between the ground-truth pattern and the estimated pattern (see Appendix, section SNR for details).
Figure 2.

Benefits of motion for the discrimination of high-acuity targets: (A) Stimulus (S) to be recovered. The entire pattern is defined on a 20 × 20 pixel array subtending 8 arcmin. The width of each leg of the E is 2 pixels (0.8 arcmin). The cone lattice and eye trajectories are the same as in Figure 1A. (B) SNR of the reconstruction of the E as a function of time. The Shaded region shows 95% confidence intervals of the mean given 40 trials. Either the stimulus is moved relative to the retina (S:M = Motion), or not (S:NM = No Motion). For each of these cases, the stimulus pattern is inferred using either the approximate EM algorithm (D:EM) or an optimal decoder assuming no motion (D:NM) are used to decode the pattern. Note that D:EM > D:NM, even when there is no stimulus motion (S:NM) because the uncertainty over the position implicitly smooths the pattern. The difference between the two best methods is statistically significant (S:M | D:EM > S:NM | D:EM with p = 0.002 at t = 700 ms). (C) Typical reconstructions of the pattern in the case of either motion and no motion after 700 ms. (D) Reconstruction over time in the case of motion using the EM algorithm. (E) Reconstruction over time in the case of motion assuming no motion. (F) Estimated versus true eye position as a function of time. The red curve shows the estimated horizontal and vertical eye position using the EM algorithm (width reflects +/−1 standard deviation). The blue curve shows the true eye position. The timestep of the simulation is 1 ms.

Benefits of motion for the discrimination of high-acuity targets: (A) Stimulus (S) to be recovered. The entire pattern is defined on a 20 × 20 pixel array subtending 8 arcmin. The width of each leg of the E is 2 pixels (0.8 arcmin). The cone lattice and eye trajectories are the same as in Figure 1A. (B) SNR of the reconstruction of the E as a function of time. The Shaded region shows 95% confidence intervals of the mean given 40 trials. Either the stimulus is moved relative to the retina (S:M = Motion), or not (S:NM = No Motion). For each of these cases, the stimulus pattern is inferred using either the approximate EM algorithm (D:EM) or an optimal decoder assuming no motion (D:NM) are used to decode the pattern. Note that D:EM > D:NM, even when there is no stimulus motion (S:NM) because the uncertainty over the position implicitly smooths the pattern. The difference between the two best methods is statistically significant (S:M | D:EM > S:NM | D:EM with p = 0.002 at t = 700 ms). (C) Typical reconstructions of the pattern in the case of either motion and no motion after 700 ms. (D) Reconstruction over time in the case of motion using the EM algorithm. (E) Reconstruction over time in the case of motion assuming no motion. (F) Estimated versus true eye position as a function of time. The red curve shows the estimated horizontal and vertical eye position using the EM algorithm (width reflects +/−1 standard deviation). The blue curve shows the true eye position. The timestep of the simulation is 1 ms. The parameters of the stimulus, cone sampling lattice, and eye motion trajectories used in these simulations correspond directly with the experiments of (Ratnam et al., 2017). The strokes of the E have a width of 0.8 arcmin, and the cone lattice has an average spacing of 1.09 arcmin. The diffusion constant that is used in the prior for inferring motion, P(X), is set to , which matches that of the recorded eye motions. Although the subject in the experiment is asked to report which of four orientations the E is in, our task requires estimating the entire shape. The prior used to infer the shape, P(S), uses a simple dictionary of non-overlapping square blocks of size 0.8 arcmin × 0.8 arcmin, with no sparsity imposed. Because the receptive fields of the cones modeled as Gaussians have a full-width half maximum that is half the distance between the cones (Macleod et al., 1992), the strokes of the E can fall between the cones. In other words, even a retina with uniformly tiled cones has spatially non uniform sensitivity to the diffraction-limited stimuli in the experiments of (Ratnam et al., 2017). It is remarkable that both the mathematical model and human subjects can recover the stimulus given the gaps and irregularities in sensitivity in the retinal cone lattice (Harmening, Tuten, Roorda, & Sincich, 2014). In additional experiments, we examined how performance changes as a function of stimulus size (Figure 5, SI). When the stimulus is very small, there is no benefit from eye motion. It cannot be well-decoded in either condition (static or moving) because the features are too small relative to the cone receptive field size. When the stimulus size is sufficiently large so that the stimulus features are large relative to gaps between the cones, both conditions accurately estimate the stimulus. Even though the SNR is higher with eye motion, there is effectively no perceptually noticeable gain because both are near perfect. There is only a nontrivial motion benefit when the strokes of the E are on the order of the spacing between the cones. Varying the magnitude of the eye motion (gain) shows that the maximum benefit from eye motion is obtained for gains between 0.5 and 1.0. The performance drops off significantly for zero motion or motion gains around 1.5 and above.
Figure 5.

Extended Tuning Plots: (A) The SNR as a function of motion gain (n = 40 for each value of the motion gain). The experimentally measured eye trajectories are used, except that the overall position is multiplied by the gain factor. (B) The SNR as a function of stimulus size (n = 20). Both plots use the same parameters as in Figure  2. The error bars in both plots show the standard error. (E–H) Example reconstructions for the stimulus size experiments with stroke width (w), and motion (S:M) or no motion (S:NM). The horizontal and vertical axes are in arcmin. (C, D) For small stimuli, the orientation of the stimulus is unrecognizable in both cases. (E, F) For stimuli with a stroke width on the order of the spacing of the cones, the orientation of the stimulus is barely recognizable. (G, H) For larger stimuli, although the SNRs are different, the orientation of the stimulus is unambiguous, despite a large difference in the SNR.

Beyond the punctate sensitivity of the cones, there are other sources of inhomogeneities in the retina that can compromise the accurate recovery of the luminance pattern of the retinal image, including variable cone gain factors (Li et al., 2014), different spectral sensitivities (Hofer, Carroll, Neitz, Neitz, & Williams, 2005), and disruptions in the cone mosaic caused by retinal degeneration (Duncan et al., 2007). Even in extreme cases, where retinal degeneration results in a fovea with 52% fewer cones than normal, patients still have normal visual acuity (Ratnam, Carroll, Porco, Duncan, & Roorda, 2013). Our model illustrates how these limitations can be compensated for by eye movements. Figure 3 shows the results of a simulation where a variable percentage of the cones are dropped out (besides the cone lattice, all other parameters are the same as the experiments in Figure 2). The quality of stimulus reconstruction enabled by a moving retina is dramatically improved over that with a stationary retina under conditions of cone loss.
Figure 3.

Motion benefit during cone loss. (A) Letter E stimulus sampled by a retinal cone lattice that has 30% of the cones dropped out randomly (cone loss, eye trajectories, and RGC spikes are resampled each trial). The same stimulus size, cone spacing, eye trajectories, and diffusion constant for inference were used as in Figure 2. (B) SNR at t = 700 ms as a function of cone loss for a moving and a stationary retina with n = 21 for each motion condition and cone loss value. The error bars correspond with plus or minus one standard error of the mean. (C and D) Examples of the reconstructed stimulus in the case of retinal drift motion and no motion for 30% cone loss.

Motion benefit during cone loss. (A) Letter E stimulus sampled by a retinal cone lattice that has 30% of the cones dropped out randomly (cone loss, eye trajectories, and RGC spikes are resampled each trial). The same stimulus size, cone spacing, eye trajectories, and diffusion constant for inference were used as in Figure 2. (B) SNR at t = 700 ms as a function of cone loss for a moving and a stationary retina with n = 21 for each motion condition and cone loss value. The error bars correspond with plus or minus one standard error of the mean. (C and D) Examples of the reconstructed stimulus in the case of retinal drift motion and no motion for 30% cone loss. Compounding the challenge of inferring spatial patterns defined by luminance, the visual system must additionally infer the spatial distribution of the color of objects (Sabesan, Schmidt, Tuten, & Roorda, 2016). The randomly placed cones tend to form clumps and the three cones types vary widely in their proportions (Hofer et al., 2005), which begs the question of how the joint spatiochromatic structure of small objects can be correctly inferred. Drift motion may also play a role here by averaging color appearance as an object is swept over different spectral swaths of the retina, and this merits further investigation.

Inferring natural image patterns

To infer more complex spatial patterns such as would occur in natural scenes, it is desirable to use a richer prior p(S) to capture this structure. For this we turn to the sparse coding model of V1 (Olshausen & Field, 1997), which uses the generative model S = DA, where D is a dictionary of features learned from the statistics of natural images and A is a set of latent variables with a sparse prior p(A). The goal in this case is to infer the latent factors (or image features), A, rather than a pictorial description of the pattern, S, from the incoming spikes, R. The equations for inferring A given S are usually interpreted as describing the dynamics of a neural network where the elements of A correspond with the activations of cortical neurons that have “Gabor-like” receptive fields similar to neurons in V1 (given by the dictionary, D) (Rozell, Johnson, Baraniuk, & Olshausen, 2008). In this case, we infer A given only the spikes R, which change as patterns drift over the retina. The resulting Equations (5, 8) can be interpreted as describing the interactions between two separate populations of neurons that work together to jointly infer the eye position X and the latent factors A. The neurons representing the latent factors A will appear to have dynamic, Gabor-like receptive fields that track features as they drift across the retina rather than remaining locked in retinotopic coordinates. Our experiments simulating this model on whitened natural scene patches (whitening the stimulus serves as an approximation to the center surround receptive field structure of RGCs) demonstrate that the sparse prior improves the inference of spatial patterns drawn from natural images (Figure 4).
Figure 4.

Neurons with structured receptive fields improve inference. (A) A whitened 32 × 32 pixel natural scene patch scaled to subtend a square with side length 24 arcmin is projected onto a simulated cone lattice with an average spacing of 1 arcmin. The retinal drift motion in this case is generated by a random walk with . (B) SNR of the decoded image at t = 600 ms. RGC spikes are decoded using three pattern priors. The SNR is plotted relative to PCA averaged over 15 trials (different natural scene patches and eye trajectories). Error bars show 95% confidence intervals. The p-values are calculated between the uniform prior and PCA, and between the sparse coding prior and PCA (**** p < 0.0001; *** p < 0.001). (C) A random set of 25 elements from the learned sparse coding dictionary, D. Sparse coding seeks to describe any given image pattern as a sparse linear combination of these features. (D – F) Example reconstructed image patterns for each method after 600 ms. IND, independent pixel prior; PCA, Gaussian prior; SP, dictionary trained with sparse coding with both a L1 and L2 prior.

Neurons with structured receptive fields improve inference. (A) A whitened 32 × 32 pixel natural scene patch scaled to subtend a square with side length 24 arcmin is projected onto a simulated cone lattice with an average spacing of 1 arcmin. The retinal drift motion in this case is generated by a random walk with . (B) SNR of the decoded image at t = 600 ms. RGC spikes are decoded using three pattern priors. The SNR is plotted relative to PCA averaged over 15 trials (different natural scene patches and eye trajectories). Error bars show 95% confidence intervals. The p-values are calculated between the uniform prior and PCA, and between the sparse coding prior and PCA (**** p < 0.0001; *** p < 0.001). (C) A random set of 25 elements from the learned sparse coding dictionary, D. Sparse coding seeks to describe any given image pattern as a sparse linear combination of these features. (D – F) Example reconstructed image patterns for each method after 600 ms. IND, independent pixel prior; PCA, Gaussian prior; SP, dictionary trained with sparse coding with both a L1 and L2 prior. Extended Tuning Plots: (A) The SNR as a function of motion gain (n = 40 for each value of the motion gain). The experimentally measured eye trajectories are used, except that the overall position is multiplied by the gain factor. (B) The SNR as a function of stimulus size (n = 20). Both plots use the same parameters as in Figure  2. The error bars in both plots show the standard error. (E–H) Example reconstructions for the stimulus size experiments with stroke width (w), and motion (S:M) or no motion (S:NM). The horizontal and vertical axes are in arcmin. (C, D) For small stimuli, the orientation of the stimulus is unrecognizable in both cases. (E, F) For stimuli with a stroke width on the order of the spacing of the cones, the orientation of the stimulus is barely recognizable. (G, H) For larger stimuli, although the SNRs are different, the orientation of the stimulus is unambiguous, despite a large difference in the SNR.

Discussion

The drift motions that occur during fixation create a problem, but also an opportunity, for neural circuits downstream tasked with inferring the structure of high acuity targets. The prior work of Burak et al. showed how the problem may be solved by a Bayesian decoder that factorizes the time-varying spikes arriving from the retina into separate representations of form and motion. Our contribution here is to take this work a step further to realize the opportunity provided by retinal drift to obtain a higher quality visual representation than would otherwise be available given the inhomogeneities of the retinal sampling lattice. The model proposed here should be seen as a first step to establish the basic neural computations that would need to occur for a causal, online system to perform approximate Bayesian inference that could account for the improvement in acuity observed in the experiments of Ratnam et al. There are obviously many important neurobiological elements missing from our model — the temporal filtering known to occur in RGCs, Magno versus Parvo streams, wavelength selectivity of cones and color-opponency of RGCs, and so on. For this first step, we sought to include the most important biophysical factors that make the recovery of fine spatial detail a challenge — that is, cone sampling properties, and the spiking nature of neural activity, which requires temporal integration by neurons downstream. Further work is needed to realize a more accurate neurobiological implementation of the model to demonstrate its true feasibility, and thus in the meantime any conclusions from our results should be tempered accordingly. In this section, we discuss some of the considerations that arise in mapping different elements of our inference model onto neural circuits in the brain, as well as further modeling and experimental efforts suggested by this work.

Neural implementation

The update Equations (5 and 8) can be interpreted as describing the interactions between two separate populations of neurons — one representing hypotheses about eye position, X, and another representing the stimulus pattern, A. We hypothesize these two populations to reside in area V1. The incoming spikes R would be carried by the LGN afferents innervating layer 4 of V1 (assuming LGN to be a simple relay of RGCs). The neurons representing A would likely be those in layer 4, or possibly layers 2 and 3. The hypotheses about eye position X would be represented by a population of neurons corresponding to the particles supporting q(X). Such a scheme for neurally representing and updating probability distributions was proposed previously by (Lee & Mumford, 2003). Importantly, the neural representations of A and X are not computed independently from the input, but rather jointly by multiplicative interactions between the two populations. The neurons representing X essentially compute a cross-correlation between the spatial pattern of incoming spikes R and the current estimate of the pattern represented by A (Figure 1C). Conversely, the neurons representing A are computed (in part) by dynamically routing the incoming spikes R via multiplicative gating by neurons representing X (Figure 1D). The idea of dynamic routing (i.e., shifter circuits) was proposed more than 30 years ago as a model for stabilizing the cortical image representation in the face of retinal drift (Anderson & Van Essen, 1987). Here, rather than proposing a routing circuit a priori, the routing dynamics emerge from the principled objective of doing optimal (Bayesian) estimation of a moving spatial pattern using a log-linear Poisson observation model. To see this, consider the gradient of the second term in the cost function of Equation 8, , which is minimized using Fast Iterative Shrinkage-Thresholding Algorithm (FISTA). Ignoring the slight modifications from using FISTA instead of gradient descent, the update equation for the kth element of A is: where R is the number of spikes arriving from RGC j in the time interval [t, t + Δt] and λ is the corresponding rate parameter of its Poisson distribution p(R|X, S = DA) (a full derivation is the Appendix Equations 50–56). g(x) corresponds to a dynamically controlled connection strength between RGC j and latent factor k that is determined by the eye position estimate X. 〈 · 〉 denotes averaging with respect to q(X). The first term of Equation 10 corresponds with a multiplicative gating of the incoming spikes by the internal position estimate. The second term is a homeostatic correction that corresponds with the expected number of incoming spikes given the internal estimate of the spatial pattern. The precise mathematical form of g(x) is determined by the parameters of the spiking model and the receptive fields of the latent factors (Appendix Equation 56). An interesting future direction will be to reformulate the model to directly estimate motion rather than position, and to use this to update the pattern estimate. The model currently assumes RGCs with receptive fields that are static in time and the inference model effectively updates its position estimate via spatial cross-correlation between the current pattern estimate and the image features. Alternatively, a shift signal relative to the current position could be estimated via spatiotemporal correlation and then used to dynamically route spikes into the latent representation of shape. Reformulating the model in this way could allow for a more direct correspondence with the temporal filtering and direction selective properties of RGCs and V1 neurons, respectively.

Neurobiological implications and questions

A key question that arises from this model is whether neurons in the foveal region of V1 form a locally stabilized, object-centered representation or a dynamically changing representation that moves in the presence of fixational drift. This question has been investigated previously with conflicting results from different laboratories (Motter & Poggio, 1990; Motter, 1995; Gur & Snodderly, 1997). In the case of microsaccades (Meirovithz, Ayzenshtat, Werner-Reiss, Shamir, & Slovin, 2011), observe that a local population response evoked by a small stimulus is shifted over the V1 retinotopic map after each microsaccade. Recent experimental work on mapping the receptive fields of V1 neurons while compensating for eye motion is a promising approach to resolve this question (McFarland, Bondy, Cumming, & Butts, 2014). Another promising direction is to use an adaptive optics scanning laser ophthalmoscope with targeted stimulus delivery combined with V1 electrophysiological measurements (or two photon imaging) to study V1 activity in response to motion-controlled stimuli presented to the fovea (Sincich, Zhang, Tiruveedhula, Horton, & Roorda, 2009). It should be noted that, although our model recovers an explicit stabilized representation of the object, it is also possible that these computations could be done in a nonstabilized representation that still integrates information efficiently (Appendix: Alternative Representations). Another possibility that is important to consider is that it may be the case that the visual cortex has instances of both types of cells. In addition to neurons sensitive to the shape of the stimulus, we also predict that there is a collection of neurons that track the position of the eye to high precision in visual cortex. Although the computations to integrate information in our simulations are handled by a particle filter, the same computations could be executed by an integrator circuit that tracks the position of the left and right eyes. In line with this, Snodderly, Kagan, & Gur (2001) find that some V1 cells have varying activation in response to drift and microsaccades (e.g., tuned to one or the other, or a combination). More generally, there is good reason to believe that the neural computations associated with the fovea are fundamentally different than the periphery. Fixational drift is large relative to the receptive field sizes of RGCs in the fovea (but not in the periphery) and there is an additional factor of cortical amplification in the fovea. There are four times more LGN cells per RGC and 10 times more striate cells per LGN projection (Connolly & Van Essen, 1984) in the fovea versus the periphery. How exactly this over-representation is being used for high-acuity vision merits further attention.

Future directions

Beyond understanding the neural computations associated with the fovea, there are many important ways to extend the model and the associated experiments. First, more work needs to be done to understand the way in which spike history dependencies in RGC firing contribute to the perception of high-acuity stimuli. On one hand, the spatial pattern is moving so fast that there may be an effect akin to motion blur, where there is less spatial information content available in the RGC spikes owing to the duration of the temporal integration of light on a particular RGC. On the other hand, the temporal filtering may serve as a preprocessing step that whitens the stimulus and decreases the impact of RGC noise on the final estimate of the stimulus. Regardless, new methods of approximate Bayesian inference need to be developed to extend the inference model to the case where the RGC cells have spike history dependencies. Second, there are many unanswered questions about our ability to infer the spatial color profile of small objects. For instance, how is our ability to infer color impacted by the nonuniform placement of the different cone types? Furthermore, to what extent does the natural motion of the retina help to alleviate these nonuniformities? Finally, although our psychophysical experiments and mathematical model probe the inner workings of our retinal circuitry, more work toward understanding simultaneous estimation of form and motion given high-acuity stimuli presented without adaptive optics is warranted.

Conclusions

The role of eye movements in visual perception is an important and long-studied problem. We use psychophysical experiments and mathematical modeling to identify a novel principle by which one can understand the benefits of drift eye movements for the perception of high-acuity targets: eye movements carry the stimulus across the retina to acquire a higher acuity representation of the spatial structure in the world than would otherwise be possible owing to inhomogeneities in retinal sampling. This principle has far-reaching consequences, both for understanding biological sensory systems and the design of novel sensors. From the biological side, this principle informs future experiments on the high-acuity perception of color and active perception for vision and other sensory modalities. From the technological side, the novel algorithms of this work motivate the design of imaging systems that exploit (rather than avoid) image motion in order to infer high-quality images from cheap non-uniform or noisy sensors.
  39 in total

1.  Fast and robust multiframe super resolution.

Authors:  Sina Farsiu; M Dirk Robinson; Michael Elad; Peyman Milanfar
Journal:  IEEE Trans Image Process       Date:  2004-10       Impact factor: 10.856

2.  Dynamic stabilization of receptive fields of cortical neurons (VI) during fixation of gaze in the macaque.

Authors:  B C Motter; G F Poggio
Journal:  Exp Brain Res       Date:  1990       Impact factor: 1.972

3.  Miniature eye movements enhance fine spatial detail.

Authors:  Michele Rucci; Ramon Iovin; Martina Poletti; Fabrizio Santini
Journal:  Nature       Date:  2007-06-14       Impact factor: 49.962

4.  Adaptive optics scanning laser ophthalmoscopy.

Authors:  Austin Roorda; Fernando Romero-Borja; William Donnelly Iii; Hope Queener; Thomas Hebert; Melanie Campbell
Journal:  Opt Express       Date:  2002-05-06       Impact factor: 3.894

5.  How the unstable eye sees a stable and moving world.

Authors:  David W Arathorn; Scott B Stevenson; Qiang Yang; Pavan Tiruveedhula; Austin Roorda
Journal:  J Vis       Date:  2013-08-29       Impact factor: 2.240

6.  Spatiotemporal effects of microsaccades on population activity in the visual cortex of monkeys during fixation.

Authors:  Elhanan Meirovithz; Inbal Ayzenshtat; Uri Werner-Reiss; Itay Shamir; Hamutal Slovin
Journal:  Cereb Cortex       Date:  2011-06-07       Impact factor: 5.357

7.  Visual receptive fields of neurons in primary visual cortex (V1) move in space with the eye movements of fixation.

Authors:  M Gur; D M Snodderly
Journal:  Vision Res       Date:  1997-02       Impact factor: 1.886

Review 8.  Temporal Coding of Visual Space.

Authors:  Michele Rucci; Ehud Ahissar; David Burr
Journal:  Trends Cogn Sci       Date:  2018-10       Impact factor: 20.229

9.  High-resolution imaging with adaptive optics in patients with inherited retinal degeneration.

Authors:  Jacque L Duncan; Yuhua Zhang; Jarel Gandhi; Chiaki Nakanishi; Mohammad Othman; Kari E H Branham; Anand Swaroop; Austin Roorda
Journal:  Invest Ophthalmol Vis Sci       Date:  2007-07       Impact factor: 4.799

10.  The elementary representation of spatial and color vision in the human retina.

Authors:  Ramkumar Sabesan; Brian P Schmidt; William S Tuten; Austin Roorda
Journal:  Sci Adv       Date:  2016-09-14       Impact factor: 14.136

View more
  4 in total

1.  Oculo-retinal dynamics can explain the perception of minimal recognizable configurations.

Authors:  Liron Zipora Gruber; Shimon Ullman; Ehud Ahissar
Journal:  Proc Natl Acad Sci U S A       Date:  2021-08-24       Impact factor: 11.205

2.  Fixational eye movements in passive versus active sustained fixation tasks.

Authors:  Norick R Bowers; Josselin Gautier; Samantha Lin; Austin Roorda
Journal:  J Vis       Date:  2021-10-05       Impact factor: 2.240

3.  Fixational drift is driven by diffusive dynamics in central neural circuitry.

Authors:  Nadav Ben-Shushan; Nimrod Shaham; Mati Joshua; Yoram Burak
Journal:  Nat Commun       Date:  2022-03-31       Impact factor: 14.919

4.  An image reconstruction framework for characterizing initial visual encoding.

Authors:  Ling-Qi Zhang; Nicolas P Cottaris; David H Brainard
Journal:  Elife       Date:  2022-01-17       Impact factor: 8.140

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.