Literature DB >> 33704373

Is there a serial bottleneck in visual object recognition?

Dina V Popovkina^1,2, John Palmer^1,3, Cathleen M Moore^4,5, Geoffrey M Boynton^1,6.

Abstract

Divided attention has little effect for simple tasks, such as luminance detection, but it has large effects for complex tasks, such as semantic categorization of masked words. Here, we asked whether the semantic categorization of visual objects shows divided attention effects as large as those observed for words, or as small as those observed for simple feature judgments. Using a dual-task paradigm with nameable object stimuli, performance was compared with the predictions of serial and parallel models. At the extreme, parallel processes with unlimited capacity predict no effect of divided attention; alternatively, an all-or-none serial process makes two predictions: a large divided attention effect (lower accuracy for dual-task trials, compared to single-task trials) and a negative response correlation in dual-task trials (a given response is more likely to be incorrect when the response about the other stimulus is correct). These predictions were tested in two experiments examining object judgments. In both experiments, there was a large divided attention effect and a small negative correlation in responses. The magnitude of these effects was larger than for simple features, but smaller than for words. These effects were consistent with serial models, and rule out some but not all parallel models. More broadly, the results help establish one of the first examples of likely serial processing in perception.

Entities: Chemical Disease Species

Mesh：

Year: 2021 PMID： 33704373 PMCID： PMC7961120 DOI： 10.1167/jov.21.3.15

Source DB: PubMed Journal: J Vis ISSN： 1534-7362 Impact factor: 2.240

Introduction

Visual tasks can produce a variety of divided attention effects for making multiple judgments, ranging from little or no effect, to large effects. One way to measure divided attention effects is to use dual tasks, in which participants perform the same task in two locations. Some simple judgments, such as detecting luminance increments, can be performed in two locations as well as in one, as if there are independent parallel processes for each location (Bonnel, Stein, & Bertucci, 1992). Additional evidence of independent parallel processing is found in experiments on summary statistics (Attarha, Moore, & Vecera, 2014; Sun, Chubb, Wright, & Sperling, 2016). Other judgments, such as semantic categorization of masked words, have large divided attention effects, as if they can be carried out in only one location at a time (White, Palmer, & Boynton, 2018). Our question is whether processing of multiple visual objects is subject to large divided attention effects, like words, or little or no divided attention effects, like simple features. Specifically, we consider semantic categorization of nameable visual objects. The anatomic organization of the early stages of the visual system is capable of parallel processing. For example, early cortical areas have retinotopic organization, and stimuli presented in the left and right visual hemifields are processed by primary visual cortices in opposite hemispheres. This separation can support parallel processing of visual input. Indeed, there is behavioral and physiological evidence consistent with parallel processing of multiple stimuli for simple judgments, such as contrast detection (Scharff, Palmer, & Moore, 2011; Chen & Seidemann, 2012; White, Runeson, Palmer, Ernst, & Boynton, 2017). For judgments of simple features, there may be little or no effect of divided attention if the information about each stimulus can be represented by a distinct neuronal subpopulation in early retinotopic areas. In contrast, semantic categorization of simultaneously presented masked words results in a severe processing limit, as if participants can recognize only one word at a time (White, Palmer, & Boynton, 2018; White, Palmer, & Boynton, 2020). This large effect of divided attention is consistent with a serial processing bottleneck. Indeed, processing of words appears to be mediated by the visual word form area, which is less retinotopically organized (Rauscheker, Bowen, Parvizi, & Wandell, 2012; Le, Witthoft, Ben-Shachar, & Wandell, 2017). Because this higher-level visual area might be unable to represent two words at once, this could explain the serial processing for reading words (White, Palmer, Boynton, & Yeatman, 2019). Might judgments of multiple visual objects be similarly constrained?

Previous studies of object perception

In an early study, Biederman and colleagues (Biederman, Blickle, Teitelbaum, & Klatsky, 1988) examined whether object perception is limited by divided attention. The stimuli were line drawings of objects associated with basic-level category names (e.g. “traffic light” or “file cabinet”). The task was to search for a specific object among a display of one to six objects. Response time to find the target increased with the addition of more distractor objects to the display, suggesting that perceptual processing of objects is limited in some way. However, such response time set-size effects do not clearly distinguish serial and parallel processing, because for this measurement parallel processes can mimic serial processing (Townsend, 1971; Townsend, 1990; Palmer, Verghese, & Pavel, 2000). Potter and Fox (2009) used the simultaneous-sequential paradigm (Shiffrin & Gardner, 1972) to measure how object categorization is limited by divided attention. Stimuli were pictures of objects in scenes with associated verbal descriptions, presented as rapid serial visual presentation (RSVP) sequences of eight displays with one to four stimuli simultaneously displayed. The task was to search for the presence of a picture matching a target verbal description (e.g. “balloons” or “cut up fruit”). The target picture could either appear together with one or more distractors in the same presentation interval (simultaneous), or in the other presentation interval (sequential). The key comparison was between simultaneous and sequential presentations. Performance was worse for targets presented simultaneously with one or more distractors, compared with performance for targets and distractors presented sequentially. This sequential advantage is consistent with limited-capacity processing in perception. Although Potter and Fox showed that object processing is limited in capacity, they did not distinguish whether it was serial or parallel, because both serial and limited-capacity parallel processes produce similar results in these experiments. Scharff and colleagues (Scharff, Palmer, & Moore, 2011) also examined whether object processing has limited capacity using the simultaneous-sequential paradigm. The stimuli were pictures of animals that were members of categories (e.g. fox or deer). The task was to determine which of two categories of animal was present in a multi-object display. For example, the target might be a fox or a deer presented among distractors that were a squirrel and a moose. Scharff and colleagues used simpler displays than the relatively long RSVP sequences used by Potter and Fox (2009). The simultaneous condition presented four objects on the display at a time, whereas the sequential condition presented two objects on the display at a time over two intervals. The results showed a sequential advantage in this task, consistent with limited-capacity perceptual processing. However, this study was also unable to distinguish between a serial process and a limited-capacity parallel process, because both could account for the sequential advantage.

Alternative hypotheses

Our focus is the hypothesis that the perceptual processing of objects, like words, is severely limited under divided attention. For example, if there are large divided attention effects for the semantic categorization of nameable objects that are similar to those observed for words, there might be a serial bottleneck in object processing. In that case, one possibility is that words and objects share serial processes beyond retinotopic cortex. Such a bottleneck might restrict the extraction of the meaning of the word or object (Broadbent, 1958). Another possibility is that a serial bottleneck might constrain a certain type of higher-level process that is similar for objects and words, but need not be the same. For example, the bottleneck may constrain the formation of an object representation (Kahneman, Treisman, & Gibbs, 1992). An alternative possibility is the hypothesis that object judgments depend on only parallel processes, either similar to simple feature judgments or intermediate between results for features and words. This predicts that object judgments are not subject to large divided attention effects. This idea is consistent with the argument that some visual processing, such as the discounting of distractors, can occur in parallel for multiple stimuli, and limitations in processing constrain only judgments of multiple simultaneous targets (Duncan, 1980). According to this hypothesis, the decreased performance during divided attention tasks is due to limitations beyond perception, such as in sensorimotor processes, which link percepts to actions (Allport, 1987). This view stands in contrast to a serial bottleneck in object perception. As described above, although there is consistent evidence that object tasks do show effects of divided attention, unlike some simple feature tasks that show no such effects, this evidence is ambiguous with respect to whether object tasks show divided attention effects as large as those caused by a serial bottleneck. Specifically, the studies above present evidence only for limited capacity, which is consistent with either serial or parallel processes. Our experimental approach follows the studies by White and colleagues that have uncovered evidence for a serial bottleneck for word categorization. These studies take advantage of a dual-task paradigm that can distinguish between specific serial and parallel models.

Benchmark models of perceptual dual tasks

Here, we consider three models of serial and parallel processes as benchmarks for judging experimental evidence in favor of or against the existence of a serial processing bottleneck. These models were implemented as in White et al. (2020). The independent parallel model allows for two stimuli to be processed independently (with unlimited capacity). Because processing is not affected by the number of objects to be processed, this model predicts no divided attention effect - judging two stimuli will be as accurate as judging one. This prediction has been satisfied for detecting simple features such as luminance increments (Graham, Kramer, & Haber, 1985; Bonnel, Stein, & Bertucci, 1992). The fixed-capacity parallel model is a special case of a limited-capacity parallel model. It assumes that parallel processing is limited such that the total amount of information obtained from a display is constant (Taylor, Lindsay, & Forbes, 1967). Hence the name: “fixed capacity.” One way to implement such a constraint is to use the metaphor of statistical sampling as done in Shaw's (1980) sample size model. This theory starts with a signal detection theory framework in which the quality of a percept, and therefore the probability of its detection, is a function of a single random variable. In particular, the quality of a percept is assumed to correspond to the variability of estimates based upon a set of samples of the underlying random variable. When one object is relevant, all of the available samples can be directed to this one object. When there are two objects, the samples must be shared among the objects. Consequently, each object is sampled less often, and the quality of the percept per object is lower. For the case of equally sampling two objects instead of one object, the standard deviation of the mean of the samples increases by the square root of two (in d’ units). This prediction has been satisfied for discriminating some simple features (Miller & Bonnel, 1994) and for some simple visual memory tasks (Smith, Lilburn, Corbett, Sewell, & Kyllingsbaek, 2016). The all-or-none serial model represents a processing bottleneck, which allows for only one stimulus to be processed at a time. For this “all-or-none” model, we also assume that there is no time to process a second stimulus: no switching of a single serial process between two stimuli. The model thus predicts the largest effect of divided attention. In addition, because only one stimulus out of two is processed, there is a negative correlation in the accuracy of the two responses: correct responses for one stimulus co-occur with incorrect (or chance) responses for the other. These predictions have been satisfied for letter-digit tasks with conflicting S-R mapping (Sperling & Melchner, 1978), for certain multiple object dual tasks (Bonnel & Prinzmetal, 1998), and for masked words (White et al., 2018; White et al., 2020). In summary, the all-or-none serial model predicts both a large magnitude effect of divided attention and a negative correlation between dual-task responses. Effects of intermediate magnitudes can be predicted with generalizations of each model. For example, a fixed-capacity parallel process can predict a larger dual-task deficit when target detection uses discretized states than when target detection uses continuous information (Swagman, Province, & Rouder, 2015; and see Appendix). Similarly, a serial process can produce a smaller dual-task deficit if there is enough time within one trial to complete processing one stimulus and switch to processing a second stimulus (White et al., 2020). To interpret our findings, we consider these generalized versions of these models alongside the three benchmark models.

Overview of experiments

In this article, we ask whether semantic categorization of visual objects shows large divided attention effects, consistent with that predicted by a serial bottleneck. The observed accuracy was compared with predictions of the three benchmark models to interpret the effects of divided attention. Critically, we use brief stimulus presentations and masking to minimize the opportunity for the switching of any serial processes. Without time constraints, a serial process could completely process a stimulus in one location, and start processing a stimulus in another location, reducing the observed effect of divided attention. This brief timing is implemented in two ways: in Experiment 1, multiple stimuli were shown using RSVP; in Experiment 2, single stimuli were shown with pre- and post-masks.

Experiment 1: RSVP

In the first experiment, stimuli were presented using brief durations and RSVP to limit the time available to process stimuli, and thus help distinguish serial and parallel processing predictions (Forster, 1970; Potter & Hagmann, 2015; Robinson, Grootswagers, & Carlson, 2019). The task was semantic object categorization, similar to the task used with words by White and colleagues (White et al., 2018).

Methods

Participants

For Experiment 1, 12 paid participants (6 men and 6 women) were recruited from the University of Washington and greater Seattle community; author D.V.P. was one of the participants for Experiment 1. Participants had normal or corrected-to-normal visual acuity. All participants gave written and informed consent in accordance with the Declaration of Helsinki and the human subjects Institutional Review Board at the University of Washington.

Apparatus and eyetracking

Stimuli were presented on a linearized CRT monitor (Sony GDM-FW900) with a resolution of 1024 × 640 pixels and a 120 Hz refresh rate. The monitor was viewed from a 60 cm distance and had a peak luminance of 90 cd/m2. Presentation of stimuli was controlled using MATLAB (MathWorks, Natick, MA) and the Psychophysics Toolbox (Brainard, 1997). An Eyelink 1000 (SR Research, Ontario, Canada) and the Eyelink Toolbox (Cornelissen, Peters, & Palmer, 2002) were used to monitor and enforce fixation during the experiment. A trial was terminated if the participant blinked or moved their eyes outside of a 2 degree window while stimuli were present on the screen. On average over all experiments, 0.7% ± 0.1% of trials were terminated due to blinks or apparent eye movements.

Stimuli

The stimuli were photographs of nameable objects removed from the background context. Stimuli were hand-selected from an internet image search and from the Massive Memory Object Categories image set (Konkle, Brady, Alvarez, & Oliva, 2010). Each image was adjusted to maximize contrast and remove color, and was resized to a 100 pixel × 100 pixel square (4.2 degrees × 4.2 degrees). Stimuli were from eight categories: plants, food, clothing, animals, furniture, household devices, transport, and musical instruments. Two judges confirmed that all examples were easy to identify and clearly belonged to the assigned category, and not the other categories. Each category had 50 exemplar objects; Figure 1 shows the 50 objects in the category “animals.” With eight categories, the stimulus set had a total of 400 objects.

Figure 1.

Example stimuli used in both experiments. All images from the category “animals” are shown.

Procedure

Figure 2 shows a schematic of the RSVP task, which was similar to Experiment 1 in White et al. (2018). On each trial, the participants saw a category word, followed by briefly presented visual objects, and a response prompt. Participants reported with a button press whether an object from the target category had appeared in the cued location. For example, for the trial in Figure 2, participants were looking for food objects and a target object (bread) was presented in the top location. The relevant location(s) were cued both before presentation and during the response prompt. Red and blue colored lines were used as cues, with one color assigned as a relevant cue for each participant and the other color serving as the irrelevant cue. The assignment was balanced such that for half of the participants, the relevant cue was red, and for the other half, the relevant cue was blue (in Figure 2, the relevant cue is red).

Figure 2.

Rapid serial visual presentation (RSVP) procedure in Experiment 1. Trial sequences for the single task (A, top location cued) and dual task (B, both locations cued) are shown. Ellipses indicate more intervals of the same duration, for a total of seven intervals containing objects and six intervening blank intervals. In this example, the observer cue color is red. Mean stimulus and blank ISI durations shown; these were adjusted separately for each observer to produce approximately 80% accuracy in the single task. Stimuli were presented above and below a 0.5 degree fixation cross (top-side and bottom-side, respectively) and were centered at 4 degrees away from fixation. In Experiment 1, stimuli were presented as a RSVP sequence. Figure 2 shows a schematic of an example trial sequence, with each box representing a time interval. The RSVP sequence contained seven object presentations separated by equal duration intervals with a blank screen (only 3 object presentations are shown in the figure). The first and last object presentations never contained an object from the target category (serving as pre- and post-masks). In the second to sixth intervals, one object from the target category could appear amid a stream of objects from other categories. Over the course of the entire sequence, there was a 50% chance of a target object appearing within the stream at a given stimulus location. This probability was independent for the two locations: that is, the presence of a target object in one location gives no information about the presence of a target object in the other location. The only dependency between locations was that in trials with a target present in both locations, the targets appeared in the same interval to make switching ineffective. All other stimuli, including masks, were randomly chosen from nontarget categories. The post-mask stayed on the screen for 700 ms, at which time a brief tone accompanied the response prompt. The post-mask was replaced by a blank as soon as there was a response.

Conditions

Stimuli were presented in three different conditions, which were blocked: In the single-task condition, there was a single task to perform on each trial. Objects were presented in two locations, but only one location was relevant. A label at the beginning of a block indicated whether the relevant location was on the top or bottom side of the display; the relevant location stayed the same for the duration of the block. Participants judged the object in the cued location only. In the dual-task condition, there were two tasks to perform on each trial. Again, objects were presented in two locations, but both locations were relevant and participants judged the objects separately for each location. If a target object was present in both locations, the target objects were shown at the same time in the sequence to make switching strategies ineffective. The order of testing the two locations was randomized. In the control single-stimulus condition, there was a single task to perform on each trial. Participants saw an object in only one location and judged the object in that location. The relevant location stayed the same for the duration of the block. The only difference from the single-task condition was the absence of the irrelevant stimulus. This condition was included to check for crowding and similar interference effects.

Timing

Before the main experiment, the RSVP timing was adjusted for each participant to achieve approximately 80% accuracy in the single-task condition by manipulating the duration of the stimulus and blank intervals. The stimulus and interval durations were always identical and adjusted together. The mean stimulus and interval duration across 12 participants was 42 ms (individual timings ranged from 33 ms to 58 ms), and the resulting mean accuracy was 82% ± 1% in the single-task condition. For the main experiment, the same customized timing was used in all conditions. It is possible for factors other than timing to affect accuracy: for example, stimuli could be inherently difficult to discriminate, or the RSVP paradigm could be limited by a memory requirement. In a control experiment, we verified that timing was a primary factor limiting accuracy by increasing all interval durations to 150 ms (for 2 early participants, the duration was 100 ms and 125 ms, respectively). Over 128 trials in the single-task condition, average accuracy was 93.8% ± 1.4% (n = 12), only about 6% below perfect. Thus, other phenomena, such as discrimination difficulty and memory processes, limited accuracy only slightly.

Responses

Participants made unspeeded responses using one of four buttons. They reported “yes” / “no” answers to the core question: “on the prompted side, did any object belong to the target category?” Participants also gave a confidence rating (“likely” or “guess”) associated with their report. Specifically, the four buttons represented the following responses: “likely no,” “guess no,” “guess yes,” and “likely yes.” Button layout was horizontal, orthogonal to the vertical stimulus layout. Responses about the top location were arranged along the top row of a keypad; responses about the bottom location were arranged along the bottom row of a second keypad. After the response, feedback was given in the form of a high- or low-frequency tone for correct and incorrect responses, respectively. Feedback for the responses in the dual-task condition was provided only after both responses were given.

Design

The experiment was carried out in sessions of six blocks of 16 trials: two blocks of dual task; two block of single task, one cued to the top location and one cued to the bottom location; two blocks of single stimulus, one cued to the top location and one cued to the bottom location. Trials within each block had the same target category and the order of blocks was randomized for each session. Each session took about 15 minutes to complete. A complete experiment included at least 38 sessions for a total of at least 1188 trials per task condition.

Analysis

Accuracy was measured as the percentage of area under the receiver operating characteristic (ROC). This metric has properties similar to two traditional accuracy measures: like percent correct, it is bounded by 50% (chance accuracy) and 100% (perfect accuracy); and like d’, it is an unbiased measure of accuracy. The ROC curves were constructed using the confidence ratings reported by the participants. All accuracy results are reported as mean ± standard error of the mean. For significance testing, all alpha levels were set to 0.05 and all t-tests were two-tailed.

Number of participants

To determine the appropriate sample size (number of participants), we examined data from four previous dual-task experiments using RSVP and masked word stimuli (White et al., 2018; White et al., 2020). In each, participants (n = 10) performed judgments of words with similar methods as the current study. A power analysis was conducted to determine the sample size needed to distinguish the predictions of the fixed-capacity, parallel model and the all-or-none serial model. This was done for the dual-task deficit and a conditional accuracy measure of response correlation. Our calculations assumed alpha and beta errors of 0.05 (power of 95%). The estimated minimum sample size was five for the dual-task deficit and eight for the conditional accuracy measure. To be conservative, we used a minimum sample size of 10. In practice, we collected data from a few additional participants, for a total of 12 participants in Experiment 1 and 11 participants in Experiment 2.

Main results

Dual-task deficit

Accuracy in the semantic categorization task was worse for categorizing two objects (dual-task accuracy: 70.0% ± 0.8%) compared with categorizing one object (single-task accuracy: 82.1% ± 1.1%). The difference is the dual-task deficit: 12.1% ± 1.0% (significantly different from zero, t(11) = 12.3, p << 0.001). Figure 3 shows average accuracy in the form of an attention operating characteristic (Sperling & Melchner, 1978). Accuracy (measured as area under ROC, see Methods section) for the task in the top location (y-axis) is plotted against accuracy for the task in the bottom location (x-axis). The blue circles on the axes indicate the single-task accuracy for the respective locations; the red square indicates accuracy for each of the locations in the dual-task condition. The overlaid lines correspond to the predictions of the three benchmark models: the independent parallel model (solid line); the all-or-none serial model (dashed line); and the fixed-capacity parallel model (dotted line). The observed results are inconsistent with the independent parallel model and the fixed-capacity model because the dual-task deficit is larger than the deficit predicted by either of these models. The results are also inconsistent with an all-or-none serial model because the dual-task deficit is smaller than the deficit it predicts. In summary, there was a large dual-task deficit in the semantic categorization task; but the magnitude of the observed deficit was smaller than predicted by an all-or-none serial model.

Figure 3.

Attention operating characteristic for Experiment 1. Observed behavioral accuracy, measured as percent area under the ROC curve, in single (blue) and dual (red) tasks. Error bars: standard error of the mean. Solid line: prediction of the independent parallel model. Dashed line: prediction of the all-or-none serial model. Dotted curve: prediction of the fixed-capacity parallel model.

Response correlation

In the presence of a serial bottleneck, the observer in a dual-task trial can perform the task for the stimulus in one location, and not in the other. To extract this hallmark of a bottleneck, we use a trial-by-trial analysis of response correlation. Specifically, an all-or-none serial process predicts a negative correlation between the accuracy of responses. Accuracy should be higher for a response in one location if the response in the other location was wrong, rather than correct. One way to quantify such a response correlation is to use a conditional accuracy measure (see White et al., 2018; White et al., 2020). Conditional accuracy can be calculated only on dual-task trials. Responses are separated into two sets of trials: one set where the response about the other stimulus in the same trial was correct, and another set where the response about the other stimulus in the same trial was wrong. Then, accuracy is calculated separately for each set. Figure 4 shows accuracy conditioned on whether the response about the stimulus in the other location was correct (ordinate) or wrong (abscissa). The dashed line shows the conditional accuracy predicted by the all-or-none serial model: higher accuracy when the response about the other stimulus was wrong than when the response about other stimulus was correct (this prediction was generated using simulated dual-task trials from an all-or-none serial model; for details; see White et al., 2018). Neither of the parallel models predicts any difference in conditional accuracy (solid line). In dual-task trials, the observed conditional accuracy was higher when the response on the other side was wrong (70.7%) than when the response on the other side was correct (68.1%), a difference of −2.5% ± 1.1% (significantly different from zero, t(11) = −2.24, p = 0.046). This negative correlation had a smaller magnitude than predicted by the all-or-none serial model, but it was reliably different from zero, the prediction of the parallel models. In summary, there was a negative conditional accuracy difference that is often considered to be a signature of serial processing.

Figure 4.

Conditional accuracy in Experiment 1. Observed behavioral accuracy, measured as percent area under the ROC curve, in dual-task trials conditioned on whether the response about the other side was wrong (abscissa) or correct (ordinate). Error bars: standard error of the mean. Solid line: prediction of both parallel models. Dashed line: prediction of the all-or-none serial model.

Secondary results

Effect of crowding from the second stimulus

In the single-stimulus condition, participants performed the single task with stimuli presented only in the relevant location. Accuracy in the single-stimulus condition (83.0% ± 0.9%) was slightly better, but similar to accuracy in the single-task condition (82.1% ± 1.1%). The difference was 0.9% ± 0.5% (not significantly different from zero, t(11) = 1.98, p = 0.073). Thus, there was no evidence of crowding in this experiment.

Effect of response order in the dual task

In the dual task, one of the two locations was randomly chosen as the first response, and the other as the second response. Accuracy for the first response (70.0% ± 0.8%) was similar to accuracy for the second response (70.0% ± 1.0%). The difference was 0.04% ± 0.8% (not significantly different from zero, t(11) = 0.049, p = 0.96). Thus, neither memory nor response interference appeared to differentially affect the second response.

Effect of stimulus order in the RSVP sequence

Across all trials and participants, the accuracy in each interval that could contain a target was: 73.3%, 74.2%, 73.4%, 70.2%, and 70.9% (listed in chronological order). There was a small advantage for detecting a target object in the first possible stimulus interval (73.3% ± 1.0%) compared with the last possible stimulus interval (70.9% ± 1.4%). This difference was small (2.4% ± 1.3%) and not significant (t(11) = 1.83, p = 0.094). Such small “primacy” effects are often reported for RSVP procedures (Coltheart, 1999).

Two-target effects

In some cases, target detection can be affected by the presence of another target in the display (Duncan, 1980). The difference between trials where the other stimulus was a distractor and trials where the other stimulus was a target was 3.4% ± 1.7% in the single-task condition (significantly different from zero, t(11) = 1.95, p = 0.08), and 3.8% ± 0.8% in the dual-task condition (significantly different from zero, t(11) = 4.64, p < 0.001). These differences are consistent with a performance deficit in the presence of another target in the display. However, both serial and parallel models can give rise to such effects; see Appendix and General Discussion.

Discussion

In summary, in the first experiment, participants cannot categorize two objects as well as they can categorize one. The large dual-task deficit and the negative correlation between responses were generally consistent with, but smaller than, the predictions of an all-or-none serial model. Our findings also reject the fixed-capacity parallel model. There was a two-target effect, but the results were not mediated by other stimulus- and task-related factors, such as crowding or response order effects (see Appendix for similar analyses of response bias and stimulus location). Before considering the implications of these results more deeply, we present a second version of the experiment to test the generality and reliability of these results.

Experiment 2: Masking

In the first experiment, we used brief stimulus durations and RSVP to differentiate predictions of serial and parallel models in a semantic object categorization task. In this second experiment, we asked whether removing the RSVP component of the task can produce the same results. Specifically, brief masked presentation of a single object was used, similar to Experiment 2 of White et al. (2018) with words. This change helps address potential confounds arising from the temporal uncertainty of target appearance in the RSVP stream, or some effect of interference or overload in short-term memory (Akyurek & Hommel, 2005), whereas the remaining masks continue to make the task challenging. The methods were the same as in Experiment 1, with the exception of differences described below. For Experiment 2, 11 paid participants (5 men and 6 women) were recruited from the University of Washington and greater Seattle community. Seven of these participants also completed Experiment 1. In Experiment 2, the participants completed a minimum of 1129 trials per condition. In Experiment 2, a single stimulus display was presented with pre- and post-masks, rather than an RSVP sequence. Figure 5 shows a schematic of an example trial sequence, with each box representing a time interval. In this experiment, the display sequence contained three object presentation intervals separated by intervals with a blank screen. The first and last object presentation intervals never contained an object from the target category (serving as pre- and post-masks). In the second interval, one object from the target category could appear. There was a 50% chance of a target object appearing, independently at each stimulus location: that is, the presence a target object in one location gives no information about the presence of a target object in the other location. The stimuli shown in mask intervals were randomly chosen from nontarget categories.

Figure 5.

Masking procedure in Experiment 2. Trial sequences for the single task (A, top location cued) and dual task (B, both locations cued) are shown. Unlike in Experiment 1, the target can occur in only one time interval. In this example, the observer cue color is red. Mean ISI durations shown; these were adjusted separately for each observer to produce approximately 80% accuracy in the single task. The object presentations were fixed in duration: pre-mask = 66 ms, stimulus interval = 33 ms, and post-mask = 66 ms. Timing of both intervening blank intervals was adjusted for each participant to achieve approximately 80% accuracy in the single-task condition. The mean interval duration across 11 participants was 48 ms (range: 25 ms – 91 ms), and the resulting mean accuracy was 80% ± 1% in the single-task condition. In a control experiment, we verified that timing was a primary factor limiting accuracy by setting the blank interval duration to 400 ms in a short session of 128 trials. With this longer blank interval duration, average accuracy in the single task was 95.2% ± 1.3% (n = 8 participants). Thus, like in Experiment 1, other phenomena, such as discrimination difficulty, did not limit accuracy with longer intervals. Accuracy in the semantic categorization task was worse for categorizing two objects (dual-task accuracy: 68.2% ± 1.1%) compared with categorizing one object (single-task accuracy: 80.2% ± 1.3%). The dual-task deficit was 11.9% ± 1.2% (t(10) = 9.85, p << 0.001). Figure 6 shows average accuracy in Experiment 2 in the form of an attention operating characteristic. As in Figure 3, accuracy for the task in the top location is plotted against accuracy for the task in the bottom location. The blue circles on the axes indicate the single-task accuracy for the respective locations; the red square indicates accuracy for each of the locations in the dual-task condition. The overlaid lines correspond to predictions of three theoretical models: the independent parallel model (solid line); the all-or-none serial model (dashed line); and the fixed-capacity parallel model (dotted line). The results are inconsistent with the independent parallel model and the fixed-capacity parallel model because the dual-task deficit is larger than the deficit predicted by either of these models. The results are also inconsistent with an all-or-none serial model because the dual-task deficit is smaller than the deficit it predicts. In summary, there was a large dual-task deficit, but it was smaller than that predicted by the all-or-none serial model.

Figure 6.

Attention operating characteristic for Experiment 2. Observed behavioral accuracy, measured as percent area under the ROC curve, in single (blue) and dual (red) tasks. Error bars: standard error of the mean. Solid line: prediction of the independent parallel model. Dashed line: prediction of the all-or-none serial model. Dotted curve: prediction of the fixed-capacity parallel model. Figure 7 shows accuracy conditioned on whether the response about the stimulus in the other location was correct (ordinate) or wrong (abscissa). The dashed line shows the prediction of the all-or-none serial model: higher accuracy when the response about the stimulus was wrong (dashed line). Neither of the parallel models predicts any difference in conditional accuracy (solid line). Accuracy in the dual-task condition was higher when the response on the other side was wrong (68.1%) than when the response on the other side was correct (66.4%). This difference of −1.6% ± 1.3% was not reliable (t(10) = −1.26, p = 0.25). In summary, the difference in conditional accuracy was consistent in sign with the prediction of the all-or-none serial model, but not reliably different than zero.

Figure 7.

Conditional accuracy in Experiment 2. Observed behavioral accuracy, measured as percent area under the ROC curve, in dual-task trials conditioned on whether the response about the other side was wrong (abscissa) or correct (ordinate). Error bars: standard error of the mean. Solid line: prediction of both parallel models. Dashed line: prediction of the all-or-none serial model. Accuracy in the single-stimulus condition (83.0% ± 1.4%) was similar to accuracy in the single-task condition (80.2% ± 1.3%). The difference was 2.8% ± 0.6% (significantly different from zero, t(10) = 4.47, p = 0.0012). Thus, displaying two stimuli had a small effect in this experiment. In the dual task, accuracy for the first response (68.9% ± 1.2%) was the similar to accuracy for the second response (67.7% ± 1.1%). The difference was 1.3% ± 0.5% (significantly different from zero, t(10) = 2.58, p = 0.027). Thus, memory or interference had a small effect on the second response; however, it was much smaller than the dual-task deficit. The difference between trials where the other stimulus was a distractor and trials where the other stimulus was a target was 3.4% ± 2.3% in the single-task condition (not significantly different from zero, t(10) = 1.46, p = 0.17) and 3.6% ± 1.3% in the dual-task condition (significantly different from zero, t(10) = 2.67, p = 0.02). These differences are consistent with a performance deficit in the presence of another target in the display. However, both serial and parallel models can give rise to such effects; see Appendix and General Discussion. In Experiment 2, we found that participants categorized two visual objects worse than they categorized one. The large dual-task deficit was consistent with, but less than predicted by, an all-or-none serial model. Although the response correlation was negative (as predicted by serial models) it was not significantly different from the parallel processing prediction of zero correlation. There was a two-target effect, but the divided attention effects were not mediated by any other stimulus- and task-related factors observed in this study, such as crowding or response order (see Appendix for similar analyses of response bias and stimulus location). The magnitude of the dual-task deficit was similar in Experiments 1 and 2: both were larger than the fixed-capacity parallel model prediction, and smaller than predicted by an all-or-none serial model. This result is distinct from the observations for both simple feature judgments and word judgments (see Summary of Results below). The sign of the response correlation was the same in Experiments 1, and 2, consistent with a serial bottleneck; but for Experiment 2, it was not statistically different from the prediction of parallel models. In sum, three of four lines of evidence reject the fixed-capacity parallel model for visual object processing.

General discussion

Summary of results

In this study, we asked whether the semantic categorization of nameable objects shows large effects of divided attention, like those observed for categorization of words. We performed the experiment using two presentation paradigms, RSVP and masking, and found similar divided attention effects in both. Specifically, there was a large dual-task deficit, and a negative correlation in responses. Both findings reject the fixed-capacity parallel model, and are smaller than the prediction of an all-or-none serial model. Figure 8 summarizes these two metrics for seven studies of objects, words, and simple features (White et al., 2018; White et al., 2020; all studies used the same metrics). In Figure 8A, the x- and y-axes show single- and dual-task accuracy, respectively. The crossed squares represent results from experiments involving judgments of a simple feature (color); the open diamonds represent results from experiments involving semantic categorization of words; the closed circles represent results from the experiments in the current study; and the lines represent the predictions of the benchmark models. Results from our Experiments 1 and 2 fall nearest the results from word judgments, indicating that words and objects show a similar large dual-task deficit under divided attention. In Figure 8B, the x-axis shows the magnitude of the dual-task deficit and the y-axis shows the response correlation for the same studies. Results from our Experiments 1 and 2 fall in between the results from simple feature judgments and the results from word judgments. Thus, while object judgments show a large dual-task deficit, overall, the divided attention effects are smaller than found with words.

Figure 8.

Summary of divided attention effects in object, feature, and word judgments. (A) Relationship between single- and dual-task performance. (B) Relationship between dual-task deficit and conditional accuracy. Solid circles: object judgments (present study). Crossed squares: color judgments (White et al., 2018; White et al., 2020). Open diamonds: word judgments (White et al., 2018; White et al., 2020). Dotted line: no difference in conditional accuracy predicted by parallel models. Error bars: standard error of the mean.

Working hypothesis

Our working hypothesis is that semantic categorization of nameable objects is constrained by a serial bottleneck in perceptual processing. Because the results fell short of the benchmark all-or-none serial model prediction, we considered a more general model that can predict any magnitude of dual-task deficit or negative correlation. The serial model with partial switching represents a bottleneck where processing can occur for only one stimulus at a time. Moreover, on some proportion of trials there is enough time to switch to processing a second stimulus. For example, if 100% of trials allow processing of the second stimulus, the prediction becomes identical to the independent parallel model (no dual-task deficit). Conversely, if 0% of trials allow processing of the second stimulus, the prediction becomes identical to the all-or-none model with no switching (a large dual-task deficit). Intermediate proportions of trials with processing of the second stimulus produce intermediate dual-task deficits. Changing the specific proportion of trials in which only one stimulus is processed allows the model to predict any magnitude of the dual-task deficit; and we can interpret our results in this context by estimating this proportion from the model using the observed performance. The dual-task deficit magnitude in both experiments can be described by switching on about 20% of trials, and processing of only one stimulus on the remaining 80% of trials. The observed negative correlation was consistent with the predictions of a partial switching model where only one stimulus was processed on about 60% of trials (Experiment 1) or 50% of trials (Experiment 2). More generally, the results are consistent with a model where in at least half of the trials only one stimulus can be processed. By this interpretation, object processing does not rely on only parallel processes, like simple features, and instead is constrained by a serial bottleneck. Objects differ from words in that more than one object can be processed on some trials despite the brief masked displays. This working hypothesis can be further extended to account for the observed two-target effect. Sometimes this effect is taken as a sign of parallel processing and “late selection” (Duncan, 1980). In fact, a small modification of the partial switching model can account for the two-target effect: a reduction in the proportion of trials in which the second stimulus is processed after a target has been processed, compared with after a distractor. In the extreme, a second stimulus might be processed only on trials in which a distractor is processed first. An alternative account is to adapt the two-stage models proposed by Duncan (1980) or Corbett and Smith (2017) by appending them to a first stage consisting of the partial switching model. When the first stage processes two targets, information about the targets is subject to target-specific interference at the second stage (e.g. limited memory encoding), producing the two-target effect. Without two targets, there would be no effect of the second stage. Either of these accounts can yield the modest 3% two-target effects found in the current experiments.

An alternative hypothesis

An alternative to our working hypothesis is that semantic categorization of objects is mediated by some kind of limited-capacity parallel process. While the divided attention effects for objects are larger than the predictions of the two parallel models we used as benchmarks, a discrete fixed-capacity parallel model can capture the magnitude of the dual-task deficit. This model is similar to the fixed-capacity parallel model, but information from the stimulus informs two discrete states: “detect target” or “detect no target” (Luce, 1963; Swagman et al., 2015; see Appendix for details). The discrete model predicts a dual-task deficit magnitude similar to what we observed in this study. However, like other parallel models, the discrete model predicts no response correlation. To predict a negative correlation between the two responses, one can add parameter variability to the attention parameter that assigns the relative number of samples to one task or the other task (see the section at the end of the Appendix on response correlation). If this parameter varies from trial to trial, then some trials have higher performance for one response and other trials have higher performance for the other response. This modification can predict the observed small negative correlation between the two responses. But unlike the other changes to the model, this change is ad hoc. This alternative hypothesis can also be extended to account for the two-target effect. One can adapt the two-stage models proposed by Duncan (1980) or Corbett and Smith (2017). These models add a second stage (e.g. limited memory encoding) to the simple parallel perception models described here. For example, after a target is processed, there is a memory encoding process that delays processing of additional targets, but not distractors. Alternatively, the two-target effect can be accounted for by attenuating the gain of a stimulus in a target context, relative to a distractor context (analogous to crosstalk, but opposite in sign). In summary, there are many ways to modify a parallel model to yield the two-target effects found in the current studies. Our larger point is that by themselves, two-target effects are not diagnostic for distinguishing between parallel and serial models. Figure 9 summarizes the specific models considered and tested here in the context of other general models of processing. In the present study, we provide evidence rejecting all of the common specific models suggested for object processing: the independent parallel model, the fixed-capacity parallel model, and the all-or-none serial model. Our working hypothesis consists of a serial model with partial switching. This model allows switching on some proportion of trials. It provides the most parsimonious explanation of the results. However, we cannot yet rule out an alternative: a discrete fixed-capacity parallel model, with an ad hoc addition such that on some trials attention is allocated unequally between the two stimulus locations. Although the current data cannot be used to definitively distinguish these possibilities, future studies could be targeted to accomplish this: for example, a redundant-target experiment could provide further evidence for or against serial processing (Mullin, Egeth, & Mordkoff, 1988; Mullin & Egeth, 1989; Shepherdson & Miller, 2014).

Figure 9.

Summary of the relevant models. Arrows lead from more general to more specific models.

Summary of the relevant models. Arrows lead from more general to more specific models. Stepping back from the details of the models, this study informs a larger set of questions about perception within a single fixation (or single brief display). Although there are many examples of likely parallel processing (e.g. contrast detection; Bonnel et al., 1992; Scharff et al., 2011), there are few examples of likely serial processing. Early work suggesting serial processing in visual search has been rejected both because of mimicry between serial and parallel processes (e.g. Townsend, 1990; Palmer et al., 2000) and because the detailed predictions of serial models have not been satisfied (e.g. Ward & McClelland, 1989). In contrast, White et al. (2018) proposed that the perception of words is a good example of serial processing. Evidence from dual-task paradigms in that study builds on the important existing evidence from the redundant target experiments of Mullin and Egeth (1989). The research in this article extends this example to nameable objects. We intend future studies to further evaluate the case of words and objects as an example of perception limited by a serial process.

Relationship to other behavioral studies

The results of our study are compatible with previous literature suggesting that processes governing object perception have limited capacity (Potter & Fox, 2009; Scharff et al., 2011). Our results are similar to but smaller than the effects of divided attention reported for the semantic categorization of words. White et al. (2018 and 2020) found a large dual-task deficit and a negative correlation consistent with the prediction of the all-or-none serial model. The methods used in the present study are especially compatible to allow a direct comparisons of effect magnitudes to those found by White and colleagues, as summarized in Figure 8. We see our results as broadly compatible with the literature on the automaticity of perceptual processes (Shiffrin & Schneider, 1977). Our experiments use familiar objects, but the specific images are not particularly familiar. While one might find parallel processing of a particular familiar feature, it is less likely that there would be parallel processing of the diverse images of a familiar object. Therefore, we see no conflict between finding evidence of serial processing for recognition of familiar objects and theories of automaticity. A different point of contact to this literature is Cousineau and Shiffrin (2004). In this study, the authors created a task with difficult discriminations involving the relative position of multiple features (4 spokes around a central circle). Their goal was to find a task that required serial processing in visual search. Using a response time paradigm, the results are among the most convincing for serial processing in visual search. Thus, that paper is one of the closest response time analogues to the current study. In contrast to these studies, recent work with natural scene stimuli has proposed that processing in complex visual recognition tasks might be parallel. For example, Thorpe and colleagues (Thorpe, Fize, & Marlot, 1996) investigated the speed of visual recognition using objects in natural scenes. A natural scene stimulus was presented briefly for 20 ms; and the task was to report the presence of an animal in that scene by releasing a button. Participants were very accurate while maintaining fast reaction times, which the authors suggest reveals an underlying rapid and efficient process for recognition of objects in complex scenes. We point out that although objects might be highly discriminable, this observation does not necessarily imply a reduced effect of divided attention. Relevant to divided attention, Rousselet and colleagues used the same task to ask whether this processing occurs in parallel when two or four scenes are presented simultaneously (Rousselet, Thorpe, & Fabre-Thorpe, 2004). The authors observed set-size effects in accuracy and response time when multiple scenes were presented simultaneously. They argued that these divided attention effects were consistent with an unlimited-capacity parallel model with late selection, but the authors did not consider, nor rule out, alternative serial models. In summary, Rousselet and colleagues make the case that parallel processes can plausibly underlie object recognition in their experiment, but do not distinguish parallel and serial model predictions for their task. Indeed, that remains a largely unsolved problem for such visual search tasks that use response time.

The neural basis of divided attention effects

Visual word form area (VWFA) has been proposed as the locus of the serial bottleneck for words (White et al., 2019). Although simple features can be processed using information in earlier visual cortex (from V1 to V4), word processing depends on VWFA. White and colleagues present evidence that signals in the anterior part of VWFA can represent only one word at a time. This is a candidate neural correlate of a serial bottleneck, but does not rule out additional bottlenecks elsewhere in processing, for example, in semantic processing within the language areas. Analogous to the VWFA, the candidate locus for a serial bottleneck for objects is likely to be in the ventral pathway beyond retinotopic cortex. One candidate is lateral occipital complex (LOC). In this area, signals represent complex object characteristics such as category, and thus underlie more complex judgments than simple features (Kourtzi & Kanwisher, 2000; Eger, Ashburner, Haynes, Dolan, & Rees, 2008). Retinotopy is weaker and receptive field sizes are larger in this area than in early visual cortex, opening up the possibility that two objects cannot be represented at the same time; that is, neural responses carry information about only one of the two objects (Larsson & Heeger, 2006; Cichy, Chen, & Haynes, 2011). Another candidate is anterior or anteromedial temporal cortex. These regions are downstream of LOC, and are known to be involved in object-based and semantic processing (Moss, Rodd, Stamatakis, Bright, & Tyler, 2005; Patterson, Nestor, & Rogers, 2007).

Comparison of processing for words and objects

Is processing different for words and objects? Several aspects of word processing may be unique; for example, word judgments require lexical access, whereas many object judgments do not (although there may be some overlap for object naming tasks; Biggs & Marmurek, 1990). Word processing also appears to have a visual hemifield effect: English words shown in the right hemifield are recognized more effectively than in the left (Chiarello, 1988; Bub & Lewine, 1988; Brysbaert, Vitu, & Schroyens, 1996; Simola, Holmqvist, & Lindgren, 2009). Nameable objects generally do not show such asymmetry (Biederman & Cooper, 1991; but see McAuliffe & Knowlton, 2001). Word and object recognition processes also appear to be nonoverlapping at the conceptual processing level. For example, Endress and Potter (2012) asked participants to recognize a scene or its verbal description while performing a simultaneous secondary task: scenes were presented together with words, or non-word stimuli. Scene understanding was impaired only when the secondary task involved non-word stimuli, suggesting that the processing of words is distinct from the processing of scenes.

Conclusion

In this study, we examined whether object perception showed large divided attention effects like those found for words, or little or no divided attention effects like those found for simple features. We used a semantic categorization task with nameable object stimuli and two presentation paradigms: RSVP and brief masking. There was a large divided attention effect and a negative correlation in responses for the semantic categorization of two nameable objects. These results are consistent with a serial model in which only one stimulus is processed on most, but not all trials. The results are also consistent with an alternative discrete fixed-capacity parallel process with differential allocation of attention. In conclusion, the effects of divided attention for objects are smaller than observed for words, but might still reflect a serial bottleneck in object processing.

Table 1.

Experiment 1: Percent Hit and Correct Rejection by Target or Distractor Context

	Irrelevant target context (T)	Irrelevant distractor context (D)	T-D
Relevant target (hits)	57.0 ± 1.3%	62.8 ± 1.6%	−5.8 ± 1.3%
Relevant distractor (correct rejections)	72.5 ± 1.8%	75.7 ± 2.4%	−3.1 ± 1.7%

Table 2.

Experiment 2: Percent hit and correct rejection by target or distractor context.

	Irrelevant target context (T)	Irrelevant distractor context (D)	T–D
Relevant target (hits)	57.6 ± 2.2%	59.5 ± 2.5%	−2.0 ± 1.8%
Relevant distractor (correct rejections)	68.7 ± 2.8%	78.7 ± 2.9%	−10.1 ± 3.0%

52 in total

2. The cost of divided attention for detection of simple visual features primarily reflects limits in post-perceptual processing.

Authors: Amelia H Harrison; Sam Ling; Joshua J Foster
Journal: Atten Percept Psychophys Date: 2022-08-08 Impact factor: 2.157

2 in total