Literature DB >> 31486277

The Scene Perception & Event Comprehension Theory (SPECT) Applied to Visual Narratives.

Lester C Loschky¹, Adam M Larson², Tim J Smith³, Joseph P Magliano⁴.

Abstract

Understanding how people comprehend visual narratives (including picture stories, comics, and film) requires the combination of traditionally separate theories that span the initial sensory and perceptual processing of complex visual scenes, the perception of events over time, and comprehension of narratives. Existing piecemeal approaches fail to capture the interplay between these levels of processing. Here, we propose the Scene Perception & Event Comprehension Theory (SPECT), as applied to visual narratives, which distinguishes between front-end and back-end cognitive processes. Front-end processes occur during single eye fixations and are comprised of attentional selection and information extraction. Back-end processes occur across multiple fixations and support the construction of event models, which reflect understanding of what is happening now in a narrative (stored in working memory) and over the course of the entire narrative (stored in long-term episodic memory). We describe relationships between front- and back-end processes, and medium-specific differences that likely produce variation in front-end and back-end processes across media (e.g., picture stories vs. film). We describe several novel research questions derived from SPECT that we have explored. By addressing these questions, we provide greater insight into how attention, information extraction, and event model processes are dynamically coordinated to perceive and understand complex naturalistic visual events in narratives and the real world.

Entities: Chemical

Keywords: Attention; Comics; Event perception; Eye movements; Film; Narrative comprehension; Scene perception; Visual narratives

Mesh：

Year: 2019 PMID： 31486277 PMCID： PMC9328418 DOI： 10.1111/tops.12455

Source DB: PubMed Journal: Top Cogn Sci ISSN： 1756-8757

Introduction

How do viewers both perceive and understand visual narratives? This is a difficult and complex question that has not previously been addressed in any comprehensive theory (but see Cohn, 2013a; Cohn, 2019b). It involves coordinating perceptual and comprehension processes that operate over multiple images and produce a durable mental model of a narrative (Loschky, Hutson, Smith, Smith, & Magliano, 2018). Consider Fig. 1A, which shows three images from Mercer Mayer’s (1967) visual narrative, A Boy, a Dog, and a Frog. In the first image, the viewer sees a boy carrying a net and a pail, running down a wooded hill with his dog, toward a frog on a lily pad, in a pond at the bottom of the hill. The viewer also sees a tree branch close to the ground about halfway down the hill. Several things are worth noting here. First, the viewer needs to recognize the boy, net, pail, dog, frog, lily pad, pond, hill, and tree branch as such. Likewise, the viewer needs to recognize that the boy and dog are running down the hill, and that the boy is carrying the net and pail. Research suggests that all of these things can be recognized very rapidly, with the boy, dog, tree branch, and hill potentially being recognized within the time frame of a single eye fixation (Fei‐Fei, Iyer, Koch, & Perona, 2007). The fact that the boy and dog are running may also be recognized within the first fixation, or require a further fixation to extract the necessary visual detail (Glanemann, 2008; Larson, Hendry, & Loschky, 2012). The frog may also be too small to detect peripherally and require an additional fixation (Nelson & Loftus, 1980). Similarly, the fact that the boy is carrying a net and a pail will likely require one or two further fixations. Importantly, each additional fixation requires the viewer to attentionally select part of the image for further processing (Deubel & Schneider, 1996), though these selections are usually made preconsciously (Belopolsky, Kramer, & Theeuwes, 2008; Memmert, 2006). All of these processes can be thought of as basic perceptual building blocks of the scene.

Figure 1

Experimental conditions used to elicit bridging inferences while viewers read a picture story (Hutson, Magliano, & Loschky, 2018; Magliano, Larson, Higgs, & Loschky, 2016). (A) Complete target episode from “A Boy, a Dog, and a Frog” (Mayer, 1967), including beginning‐state, bridging‐event, and end‐state images. (B) The target episode missing the bridging‐event image, which requires the viewer to generate a bridging inference when viewing the end‐state image to maintain coherence with the beginning‐state image. But the viewer must also make sense of the how these agents, their actions, objects, and scene background elements depict events in a narrative. For example, to understand the narrative, viewers must infer the goal of the boy (Graesser & Clark, 1985; Long, Golding, & Graesser, 1992; Suh & Trabasso, 1993), which is to the catch the frog. Inferences of this sort can be generated very quickly in the context of reading texts (Long et al., 1992). When viewing such a picture story, viewers can generate an inference within two extra fixations on details of the scene that suggest the inference (e.g., the direction of the boy’s eye gaze relative to the location of the frog, and the position of the boy’s net, suggest his goal of catching the frog: Hutson et al., 2018). The second image shows the boy and dog tripping over the tree branch, with the boy having let go of his net and pail, and shows the frog noticing these events. Understanding this picture also requires the same processes of scene perception and attentional selection described for the first picture, but those processes should be supported by representations of the prior narrative context held in working memory and episodic memory (Graesser, Millis, & Zwaan, 1997). Importantly, the boy and dog tripping on the tree branch is inconsistent with their inferred goal of catching the frog from the first picture;thus, it reflects a failure of that goal (Trabasso, van den Broek, & Suh, 1989). Understanding how this picture fits into the narrative involves inferring the causal relationship between it and the prior narrative context (e.g., the boy tripped because he was running down the hill and there was a tree branch blocking his way; the boy failed to achieve his goal) (Trabasso & Suh, 1993). The final image shows the same scene from a slightly zoomed‐in view. This image illustrates that viewers need to understand how the products of scene perception are related across images. The viewer sees boots sticking out of the water and must recognize that those are the boy’s boots. Establishing this relationship has implications at the level of the narrative event model because it implies that the boy has fallen in the water, and that this is a result of his tripping on the tree branch (as depicted in the previous image). Clearly, this illustrates the coordination of visual perception and narrative comprehension in the cognitive processing of visual narratives. Thus, it is no surprise that scholars who study these different levels of cognitive processes have become interested in how they are involved in visual narratives. Specifically, research on visual narrative processing has been rapidly expanding in the domains of visual scene perception (e.g., Hutson, Smith, Magliano, & Loschky, 2017), event perception/cognition (e.g., Zacks, Speer, & Reynolds, 2009), psycholinguistics (e.g., Cohn, 2013a), and narrative comprehension (e.g., Magliano, Kopp, McNerney, Radvansky, & Zacks, 2012). While these research areas are seemingly disparate, the above example illustrates that comprehensive theoretical frameworks are needed to explain how these processes are coordinated to support the perception and understanding of visual narratives. Additionally, such theoretical frameworks may be necessary to prevent fragmentation of the visual narrative research field, as occurred, for example, in reading research, where multiple models accounted for aspects of reading, such as word identification, syntactic parsing, discourse representations, and the roles of the reader’s eye movements (Rayner & Reichle, 2010). Thus, the novel theoretical contribution of our theoretical framework lies in integrating processes from the scene perception literature and processes from the event perception and narrative comprehension literatures, raising interesting research questions concerning interactions between them. In this way, research on visual narratives within our framework is an example of complex cognition that can inform our broader understanding of naturalistic visual processing and transcend the currently compartmentalized research on visual narrative processing in the separate, minimally interacting research fields. Below, we outline a theoretical framework, the Scene Perception & Event Comprehension Theory (SPECT: Loschky et al., 2018), that describes how perceptual processes and event model construction processes are coordinated during visual narrative processing. The novel theoretical contribution of SPECT lies in being an integrative theoretical framework, which identifies important interactions between perceptual and event model processes. SPECT allows researchers to identify the core perceptual and cognitive processes for perceiving and comprehending visual media. Critically, these core processes are also utilized in non‐narrative contexts, such as real‐world scenes. In formulating SPECT, we demonstrate how visual narratives are an example of complex cognition of broader interest to the cognitive sciences in general.

The SPECT framework

SPECT builds on decades of theoretical developments in general cognition and its subsystems (e.g. working memory, attentional control, etc.). Thus, SPECT is the application of general models of visual cognition to visual narratives, and many of SPECT’s assumptions equally apply to real‐world scene perception. SPECT bridges theories of scene perception (Henderson & Hollingworth, 1999; Irwin, 1996), event cognition (Radvansky & Zacks, 2011, 2014), and narrative comprehension (Gernsbacher, 1990; Zwaan & Radvansky, 1998). SPECT specifically pertains only to processing visual content; namely, it does not specify processes involved in processing either language narrowly defined, or non‐linguistic audio. The basic architecture of SPECT distinguishes between stimulus features and front‐end and back‐end cognitive processes involved in visual event and narrative cognition, as illustrated in Fig. 2. Note that the front‐end versus back‐end distinction is not equivalent to bottom‐up and top‐down processes. We will clarify these distinctions below. We will briefly overview how these processes are conceptualized in SPECT before outlining each component in more detail with supporting evidence.

Figure 2

Model of the Scene Perception & Event Comprehension Theory (SPECT) theoretical framework. The eye icon denotes the position of viewer gaze on the stimulus during a particular fixation. A further walkthrough of the framework is provided in the text below. SPECT’s starting point is the stimulus. All visual narratives are composed of either static (e.g., in Fig. 1) or dynamic (in the case of film, theater, or virtual reality) visual images of varying degrees of complexity and realism composed in sequence. Some stimulus properties constrain later processes within SPECT via medium–agnostic mechanisms such as the salience of primitive visual features (e.g., luminance, contrast, or motion; Itti & Koch, 2001). For example, Fig. 3A shows the computed saliency of the “Beginning State” image from Fig. 1. For this saliency algorithm (AWS) (Garcia‐Diaz, Fdez‐Vidal, Pardo, & Dosil, 2009), the highest computed saliency regions (i.e., most likely to capture a viewer’s attention) are for the Boy’s head and the Frog’s legs. This is based on analyzing the orientations of image elements at numerous size scales, and finding the local regions that are the most different from the rest of the image. Fig. 3B shows an actual fixation heat map, based on 39 viewers’ fixations while viewing this image within the context of the entire visual narrative. The computed saliency is very close to the empirical fixation probabilities. Other stimulus properties are medium‐specific such as the panels, layout, and action lines in comics, which are assumedly learned, rather than universal, in contrast to visual saliency (Cohn, 2013b). The three‐panel layout of the images in Fig. 1A is familiar to comic readers, and it is meant to be read from left to right. In film, camera movements, cuts, and the predetermined pace of the moving images are similarly meant to guide viewers’ attention (Bordwell & Thompson, 2003). Thus, the combination of medium‐agnostic and medium‐specific stimulus features shape what potential information is available to the viewer, and likely influence how front‐end and back‐end processes interact in processing this information.

Figure 3

(A) Example of computationally predicted visual saliency of regions in the Beginning‐State image of Fig. 1, using the AWS saliency algorithm (Garcia‐Diaz et al., 2009). (B) Fixation heat map from 39 viewers reading the wordless visual narrative. In both images, red = highest saliency/fixation probability. (Saliency and fixation heat map images courtesy of Maverick E. Smith.) Front‐end processes are involved in extracting content from the image, and back‐end processes operate on their output to support the construction of an event model. Front‐end processes occur during single eye fixations. These processes during a fixation extend from the earliest perceptual processes to activated semantic representations that are sent to working memory (WM). The front‐end involves two key processes that occur during each fixation: information extraction and attentional selection. Information extraction is further subdivided between broad (the gist of the whole scene, e.g., woods) and narrow (detailed information from animate or inanimate entities, e.g., boy, dog, frog, net, pail, etc., in Fig. 1). Information extraction includes both entities and events, with event information extraction also producing both broad categorizations of what is seen (e.g., “trying to catch”) and narrow categorizations (e.g., “running” in Fig. 1). The information extracted during each fixation is fed to the back‐end. Attentional selection determines what information to process during single fixations, and where the eyes will be sent for the next fixation, and is influenced by both exogenous and endogenous factors. Note that the above definition of front‐end processes is far more specific (i.e., occurring during single fixations) and limited (i.e., information extraction and attentional selection) than the term bottom‐up processes, and thus the two terms are not synonymous. Back‐end processes occur in memory across multiple eye fixations, specifically WM and long‐term memory (LTM). The information represented in the back‐end is accumulated over multiple eye fixations spanning durations extending from milliseconds to minutes. A key back‐end process is the construction and maintenance of the current event model in WM, which represents what is happening now (e.g., a boy and his dog, trying to catch a frog in the woods, in Fig. 1). An event model is a particular type of mental model that captures a sequenced event. This representation is maintained until perceptual and conceptual content specifies that it is no longer relevant or valid (due to content changing over time–the boy’s falling in the pond indicates the failure of his attempt to catch the frog, in Fig. 1). At that point, back‐end processes encode the event model into episodic LTM, which we call a stored event model (e.g., the boy and dog tried to catch a frog in the woods, but fell in the pond, in Fig. 1). From these stored event models, more semantic event schemas can be derived in semantic LTM by averaging across multiple event model instances (e.g., Hintzman, 1988). The recently stored event models in episodic LTM will feed back to and influence the new current event model in WM (e.g., the expectation that the boy may make another attempt to catch the frog; see information arrows back from Episodic Memory to the Event Model in Fig. 2). The current event model is also influenced by schemas in semantic LTM (e.g., “little boys,” “catching animals,” etc.), and from executive functions, like goal setting, attention control, and inhibition. Note too that the above definition of back‐end processes is also more specific (i.e., in memory across multiple fixations) and limited (to the event model building processes in WM, the stored event model in episodic LTM, stored knowledge in semantic LTM) than the term bottom‐up processes. An underlying assumption of SPECT is that front‐end and back‐end processes iteratively support the creation of the current event model in WM, and the management of stored event models in episodic LTM. Importantly, front‐end attentional selection and information extraction guide the moment‐to‐moment knowledge retrieval from semantic LTM that supports the back‐end processes of creating the current event model in WM (McKoon & Ratcliff, 1998; Myers & O'Brien, 1998). Thus, we cannot understand how knowledge is retrieved from LTM in the moment without understanding the role of these front‐end processes. Similarly, a key theoretical issue raised by SPECT is whether and how back‐end processes, including the current event model in WM, the stored event models in episodic LTM, and schemas and scripts in semantic LTM, influence the front‐end information extraction and attentional selection processes. Thus, SPECT provides a theoretical framework to explore and explain the relationships between front‐ and back‐end processes during visual narrative processing.

Theoretical foundations for front‐end processes

When looking at real‐world scenes, comics, or videos, visual information extraction only occurs during periods in which the eyes are stabilized relative to fixed points in space (fixations) or slowly moving objects (smooth pursuit or vestibulo–ocular reflex). This is because processing of visual detail is suppressed during the rapid shifts (saccadic eye movements) between locations (Matin, 1974; Ross, Morrone, Goldberg, & Burr, 2001). Thus, we can consider eye fixations 1 to be the spatio‐temporal input units of vision. Furthermore, any extracted information maintained across multiple fixations is in short‐term memory or WM (Irwin, 1996; Zelinsky & Loschky, 2005),2 which is strongly constrained in terms of capacity (i.e., 3–4 items without rehearsal or chunking: Cowan, 2001), and encodable information (i.e., post‐perceptual information: Hollingworth, 2009; Irwin, 1996). This key insight provides the rationale for distinguishing between front‐end processes occurring during single fixations, and back‐end processes occurring across multiple fixations occurring in memory. Furthermore, these constraints from eye movements necessarily shape how events in the environment, comics, or films are understood and become long‐term episodic memories.

Information extraction

What types of information are extracted during a single eye fixation? SPECT distinguishes broad versus narrow information extraction (Loschky et al., 2018). Broad extraction is from all or most of an entire scene, producing holistic semantic information called scene gist (Oliva, 2005). This includes the basic level category of a scene (e.g., woods, a pond, in Fig. 1) (Fei‐Fei et al., 2007; Greene & Oliva, 2009; Loschky & Larson, 2010), detecting animals or people (Fletcher‐Watson, Findlay, Leekam, & Benson, 2008; Thorpe, Fize, & Marlot, 1996), the scene’s emotional valence (Calvo, Nummenmaa, & Hyönä, 2007; Maljkovic & Martini, 2005), some rather rudimentary information about basic level actions (e.g., running vs. falling in Fig. 1) (Larson, 2012), and both the agent and patient of an action (e.g., the boy [agent] trying to catch the frog [patient], in Fig. 1) (Dobel, Gumnior, Bölte, & Zwitserlood, 2007; Hafri, Papafragou, & Trueswell, 2013). Narrow extraction operates on a particular entity (object, animal, or person) providing details such as colors, shapes, and sizes of object parts (Hollingworth, 2009; Pertzov, Avidan, & Zohary, 2009). Such broadly and narrowly extracted information in WM is used for comprehending events by back‐end processes in the current event model. Importantly, despite the wide range of information extracted during a single eye fixation, the total amount of consciously available information from a single fixation remains limited, and thus increasingly detailed information from a scene or image must accrue in WM over multiple fixations (Hollingworth & Henderson, 2002; Pertzov et al., 2009) in the back‐end. One slight caveat to this assumption of SPECT is that fixations are actually made up of micro movements (e.g. microsaccades, drift, etc), which may constitute phases of slight attentional shifts and changes in perceived information within a single fixation (Otero‐Millan, Troncoso, Macknik, Serrano‐Pedraza, & Martinez‐Conde, 2008) and that the phases of attending to and processing a specific object could also be made up of multiple fixations dwelling within the object (Nuthmann & Henderson, 2010). Since both of these behaviors can still be considered fixations at different spatiotemporal scales, we will use the all‐encompassing term fixation within SPECT.

Attentional selection

The other key front‐end process during each fixation is attentional selection, which is the gateway to WM, comprehension, and explicit LTM for events. On each fixation, before moving the eyes, attention covertly shifts to the next to‐be‐fixated object (Deubel & Schneider, 1996; Hoffman & Subramaniam, 1995; Kowler, Anderson, Dosher, & Blaser, 1995). Attentional selection is affected by both exogenous, bottom‐up, stimulus saliency, as described above (Borji & Itti, 2013; Wolfe & Horowitz, 2004), and endogenous, top‐down, cognitive processes (DeAngelus & Pelz, 2009; Eckstein, Drescher, & Shimozaki, 2006; Findlay & Walker, 1999). Specifically, stimulus saliency is determined by visual feature contrast in terms of motion, brightness, color, orientation, and size (Mital, Smith, Hill, & Henderson, 2010; Peters, Iyer, Itti, & Koch, 2005). However, top‐down, task‐driven goals, such as searching for specific information, more strongly affect viewers’ attention than saliency in pictures (Foulsham & Underwood, 2007; Henderson, Brockmole, Castelhano, & Mack, 2007), and some evidence of saliency‐override by task has been demonstrated in film viewing, although this is believed to be more difficult (Hutson et al., 2017; Smith & Mital, 2013). More specifically, there are volitional (consciously controlled) versus mandatory (unconscious prior knowledge‐based) top‐down effects on attentional selection (Baluch & Itti, 2011). These can interact in tasks, such as visual search, in which the volitional top‐down goal of finding a specific target (e.g., a chimney) is facilitated by mandatory top‐down knowledge of likely target locations (e.g., at the top of a house: Eckstein et al., 2006; Torralba, Oliva, Castelhano, & Henderson, 2006). In SPECT, volitional top‐down attentional control occurs in WM, using executive processes (Moss, Schunn, Schneider, McNamara, & VanLehn, 2011). Mandatory top‐down processes can come from the event model or relevant world knowledge (i.e., schemas). Attentional selection during single fixations can be narrowly focused, for example at the point of fixation, or broadly spread across a large portion of the visual field, also known as attentional breadth, or a person’s useful field of view (Ball, Beard, Roenker, Miller, & Griggs, 1988; Eriksen & Yeh, 1985; Larson, Freeman, Ringer, & Loschky, 2014). Importantly, this can change dynamically based on the viewer’s processing demands (Ringer, Throneburg, Johnson, Kramer, & Loschky, 2016; Williams, 1988). A viewer’s breadth of attention also changes over the course of ~12–24 fixations in the first 4–6 s of viewing an image in the ambient‐to‐focal shift of eye movements (Pannasch, Helmert, Roth, Herbold, & Walter, 2008; Smith & Mital, 2013). Specifically, during the first 2 s of viewing an image, viewers tend to make long saccades, indicating broad attention, and short fixations, indicating shallow processing. Then, from 4 to 6 s of viewing, viewers shift to making short saccades, indicating narrowly focused attention, and long fixations, indicating deeper processing. Because this ambient‐to‐focal shift occurs across multiple fixations, back‐end processes could possibly influence the front‐end process of attentional selection. While information extraction and attentional selection are considered independent within SPECT, and strong empirical and theoretical evidence supports their separation (Smith, Lamont, & Henderson, 2012; Triesch, Ballard, Hayhoe, & Sullivan, 2003), in active processing of scenes, these processes often operate in conjunction (Williams, Henderson, & Zacks, 2005). For example, when discussing how a viewer’s fixation of an object influences his or her memory for it, we implicitly combine both attentional selection (i.e., choosing which object to send your eyes to) and information extraction (i.e., visual processing during the viewer’s fixation on the object, as implicated by their later memory of it). As such, for parsimony we will sometimes refer to both processes together in later discussions.

Theoretical foundations for back‐end processes

Back‐end processes support the construction of a coherent current event model in WM, which later becomes a stored event model in episodic LTM (Magliano et al., 2012). A coherent event model contains information about the time and place in which the events unfold (the spatio‐temporal framework), the entities in the event (people, animals, objects), the properties of those entities (e.g., colors, sizes, emotions, goals), the actions of the agents, the unintentional events that occur (e.g., acts of nature), and relational information (spatial, temporal, causal, ownership, kinship, social, etc.) (Magliano, Miller, & Zwaan, 2001; Zwaan, Magliano, & Graesser, 1995; Zwaan & Radvansky, 1998). As shown in Fig. 2, SPECT describes three key back‐end processes involved in constructing the current event model: laying the foundation for a new event model, mapping incoming information to the current event model, and shifting to create a new event model (Gernsbacher, 1990).

Laying the foundation

Laying the foundation is the process of constructing the first nodes in an event model, where a node reflects a basic unit of representation (e.g., proposition, simple grounded simulation). These nodes then become memory structures to which subsequent information is connected or not (Gernsbacher, 1990, 1997). When a new event model is created, the viewer must lay the foundation for it. In the context of a visual narrative, the foundation will likely involve a representation of the spatial‐temporal information that is extracted through gist processing, and any agents and actions recognized in the first fixation of the images. As noted above, the information extraction process can gather some rudimentary information about basic level actions, including the agent and the patient, within a single eye fixation (Glanemann, 2008; Hafri et al., 2013). However, due to the limits of information processing within the time span of a single fixation (e.g., 330 ms), it takes at least two fixations to reach peak accuracy for identifying an action (Hafri et al., 2013; Larson, 2012). Thus, the information required to lay the foundation for the current event model, namely recognizing a basic action, requires integrating information across at least two fixations in WM.

Mapping incoming information

With each subsequent fixation, the viewer builds upon the foundation by mapping incoming information to WM, but only if it is coherent with the event model (Gernsbacher, 1990, 1997). This process involves monitoring continuities in the event indices of time, space, entities, causality, and goals (Gernsbacher, 1997; Zwaan & Radvansky, 1998). Specifically, situational information extracted by front‐end processes serves as LTM retrieval cues, thus activating semantically related information in WM (Myers & O’Brien, 1998). Viewers assess the coherence of the event indices within the current event model and the newly activated information from LTM. Changes along any event index that are coherent with the current event model will lead viewers to incrementally update, or map, that change (Kurby & Zacks, 2012). In this way, the current event model becomes gradually elaborated as more information is extracted on each eye fixation. Mapping is supported by inference generation (Graesser, Singer, & Trabasso, 1994), particularly bridging inferences (Magliano, Zwaan, & Graesser, 1999). Bridging inferences connect two or more story events and are considered necessary for maintaining a coherent mental model (Graesser et al., 1994). Virtually all comprehension models consider bridging inferences important (McNamara & Magliano, 2009) because they are required when comprehenders perceive a gap in the narrative events (e.g., Magliano et al., 2016), or when two narrative events are causally related (e.g., Suh & Trabasso, 1993). For example, in Fig. 1B, the Bridging‐Event image shown in Fig. 1A is missing. Thus, for viewers seeing only the Beginning‐State and End‐State images in Fig. 1B, they would need to generate a bridging inference to coherently map the information from the End‐State image (boy and dog fell in the pond) onto the foundation of the event model created based on the information from the Beginning State image (boy and dog running down the hill to catch a frog).

Shifting

When mapping is no longer possible, the viewer shifts to create a new event model. This occurs when new incoming information produces a trigger signal, resulting in event segmentation, which parses this continuous activity into discrete events (Kurby & Zacks, 2008; Magliano et al., 2012). For example, when watching someone making breakfast, we recognize the discrete actions of taking a slice of bread out of a loaf, putting the slice in a toaster, toasting it, taking it out of the toaster, and putting it on a plate (Newtson, 1973; Newtson, Engquist, & Bois, 1977). Segmentation is critical for understanding and remembering complex events (Magliano et al., 2012; Radvansky & Zacks, 2011; Sargent et al., 2013). Segmentation also occurs when we experience narratives, and triggers can be either perceptual or more conceptual in nature. For example, visual motion is strongly associated with event segmentation (Zacks, Swallow, Vettel, & McAvoy, 2006). Other important triggers are when viewers perceive shifts in situational continuities, such as shifts in time and space, causal discontinuities, the introduction of new characters, or changes in characters’ goal‐plans (Magliano et al., 2012; Zacks et al., 2009; Zwaan & Radvansky, 1998). If such changes are important enough, it indicates an event boundary, also known in older story grammar theories as a boundary between narrative episodes (Baggett, 1979; Gernsbacher, 1985; Thorndyke, 1977). For example, most readers of the visual narrative fragment in Fig. 1 will perceive an event boundary to have occurred on the End‐State image, assumedly because the Boy’s attempt to achieve his goal of catching the Frog has failed. Perceiving an event boundary means the current event has ended, which triggers a shift (Kurby & Zacks, 2012), and leads to storing the current event model in LTM as a global update to the previously stored event models in episodic LTM (Gernsbacher, 1985). Once this boundary has been perceived and the stored event model updated, information from the previous event model becomes less accessible (Gernsbacher, 1985; Swallow, Zacks, & Abrams, 2009). Once shifting is complete, the cycle begins again with laying the foundation for a new event model.

Executive processes

The back‐end comprehension processes discussed above occur by default without the viewer’s volition. Yet viewers can exert volitional control over their mental processes when they feel the need to do so. This likely happens when the viewer is given a task unrelated to understanding the story while viewing a visual narrative (Hutson et al., 2017; Lahnakoski et al., 2014). This seems relatively uncommon when people read comics or watch movies for pleasure, but it is very common when students are given educational tasks in school settings (Britt, Rouet, & Durik, 2018; McCrudden, Magliano, & Schraw, 2010). Such volitional strategic comprehension processes are more cognitively demanding (Kaakinen, Hyönä, & Keenan, 2003), and they engage frontal and prefrontal brain regions known to be involved in executive processes (Moss et al., 2011). This suggests that volitional control of comprehension processes involves executive processes, such as goal setting (i.e., deciding to carry out a specified task), attentional control (i.e., paying attention to task‐relevant information), and inhibition (i.e., intentionally ignoring irrelevant information), as indicated in Fig. 2. For example, in Hutson et al. (2017, Exp 2B), prior to watching a film clip from Touch of Evil, viewers were told that after watching the clip, they would be asked to draw a map of all landmarks and their relative locations from memory. Assumedly, doing this task successfully would involve setting the goal of memorizing the landmarks and their locations, volitionally controlling one’s attention to meet this goal (e.g., by fixating background buildings, street signs, etc.), and inhibiting attending to the protagonists of the narrative (which would conflict with the spatial memorization goal). We assume that such executive processes are available to viewers of visual narratives, but they only use them when necessary, and only if they have the required WM resources, given their cognitive load, and the processing demands of the stimulus (e.g., rapidly edited film sequences may overload cognitive resources: Andreu‐Sánchez, Martín‐Pascual, Gruart, & Delgado‐García, 2018; Lang, 2000).

Differences between static and dynamic media

There are potential differences in the complexities of processing narratives across media (Loschky et al., 2018; Magliano, Loschky, Clinton, & Larson, 2013). First, a growing literature indicates that fluency in processing visual narratives requires exposure and learning (Cohn & Kutas, 2017; Fussell & Haaland, 1978; Ildirar & Schwan, 2014; Liddell, 1997). This may explain why proficiency in comprehending text is weakly correlated with proficiency in comprehending visual narratives in children (Pezdek, Lehrer, & Simon, 1984), whereas they are robustly correlated in adults (Gernsbacher, Varner, & Faust, 1990). Second, there are non‐trivial cultural differences in the structure of visual narratives across cultures, which in turn produce non‐trivial differences in comprehension (Cohn & Kutas, 2017). Finally, the extent to which attentional selection can support event segmentation, inference generation, and model updating may be affected by whether consumption of the visual narrative is self‐paced or externally controlled (Hutson et al., 2018; Magliano et al., 2013), a difference which SPECT is intended to describe. Attentional selection and subsequent processing will also be shaped by medium‐specific differences between reading comics or text and viewing films (Magliano et al., 2013). The static versus dynamic nature of visual narratives has a strong effect on the front‐end process of attentional selection, as shown in Fig. 4, since viewers show greater attentional synchrony (i.e., looking at the same places at the same times) during a video clip in comparison to individual frames from the same video (Dorr, Martinetz, Gegenfurtner, & Barth, 2010; Smith & Mital, 2013). Viewers may tend to fixate similar locations in static scene perception (Mannan, Ruddock, & Wooding, 1997) but not at the same time, suggesting a larger influence of individual differences in both front‐ and back‐end processes for static images (Hayes & Henderson, 2018; Le Meur, Le Callet, & Barba, 2007). Consistent with the finding of greater attentional synchrony during film viewing, motion is perhaps the most salient stimulus feature in guiding attention (Carmi & Itti, 2006; Le Meur et al., 2007; Mital et al., 2010). Another important medium‐specific difference between comics and film, which will affect attentional selection, is the typical way each is viewed. When people read comics, their eyes move from panel to panel of a page layout, which stereotypically follows a “Z‐path” of left‐to‐right and top‐to‐bottom, consistent with text (though the specific pattern varies by language, such as Hebrew and Japanese being read from right to left, and top to bottom). This is entirely different from film viewing, which instead shows a viewing pattern oriented toward the center of the screen due, in part, to the temporal presentation of visual information (Dorr et al., 2010; Le Meur et al., 2007; Mital et al., 2010).

Figure 4

The difference in gaze exploration of a scene (represented as a fixation heatmap of multiple viewers’ gaze locations) across static (top row) and dynamic versions (bottom row), and free‐viewing (left column) versus a spot‐the‐location task, which prioritizes background details (right column). Note that the most tightly clustered gaze is in the dynamic free‐viewing condition, which contains motion. Note, also though, that this gaze clustering due to motion is somewhat reduced by giving viewers an explicit task (i.e., spot‐the‐location). (Reproduced with permission from Smith and Mital [2013].) Perhaps just as importantly, in reading both text and comics, comprehension differences are evident in the duration of fixations and the frequency of regressive saccades (Foulsham, Wybrow, & Cohn, 2016; Hutson et al., 2018; Laubrock, Hohenstein, & Kummerer, 2018; Rayner, 1998). Conversely, the predetermined pace of film does not provide viewers much time to look around and refixate things, when they have difficulty understanding what they saw (although this may occur more while watching digital video, where viewers can pause, stop, or rewind the video). Consider Fig. 5, which shows two different versions of the same scene from the graphic novel Watchmen (Moore & Gibbons, 1987) and the film adaptation of it (Snyder, 2009). The events depicted in both versions are identical. One would expect the general products of information extraction to be similar when processing either version. However, the graphic narrative affords more endogenous control of attentional selection because it is a static representation of the events. In contrast, the dynamic presentation of the film version has cinematic features that provide stronger exogenous influences on attentional selection such as dynamic framing through camera movements (e.g., the single zoom shot pulling back from the Comedian’s badge in the film version versus the three frames depicting the same change in viewpoint in the comic; Fig. 5, middle row), lighting and focal depth changes, and choreography of actor motion within the frame (Hutson et al., 2017; Loschky, Larson, Magliano, & Smith, 2015; Smith, 2012a).

Figure 5

Sample panels/stills from the original Watchmen (Moore & Gibbons, 1987) graphic novel and movie (Snyder, 2009) depicting the same actions. Note the images are not presented in their original sequence. The scene depicted in Fig. 5 is predominantly action based, which means that the movie stills (essentially a storyboard) roughly convey the same content as the original comic version (as was intended for this particular film). The comic version may place more demands on back‐end processes to guide attention and actively construct the event model than when viewing the movie version, for which the events are self‐evident in the actions depicted. But in a scene like this it would seem reasonable to assume the resulting narrative event models would be medium‐agnostic. However, for scenes involving richer characterization and dialogue, the formal decisions comic artists and film directors make when composing their scenes may result in very different event models. For example, Fig. 6 depicts a later scene from Watchmen in which Rorschach meets with his old partner, Nite Owl. The comic version uses four panels each containing multiple visual centers of interest.3 The reader likely must perform multiple fixations within each panel to extract information about the characters, their actions, and Nite Owl’s emotional response to Rorschach (Laubrock et al., 2018). By comparison, the film version uses 12 shots varying widely in shot scale to convey the same information. By conveying each action serially (i.e., one per shot) this likely reduces the need for multiple fixations per shot, which raises interesting questions about whether back‐end WM processes would differ depending on the number of front‐end attentional shifts and fixations across media. Studies of medium differences in front‐end attentional selection and information extraction on back‐end processes process have been largely unexplored (Magliano, Clinton, O'Brien, & Rapp, 2018). However, the SPECT framework specifies the importance of exploring these issues.

Figure 6

Rorschach surprises Nite Owl in his kitchen and reveals the Comedian has died. Top row: panels taken from the original Watchmen (Moore & Gibbons, 1987) graphic novel. Bottom three rows: shots depicting the same action from the movie version (Snyder, 2009).

Research questions raised by SPECT and their investigation

We have been using this framework to guide our program of research on the processing of visual narratives (both static and dynamic) for the past 8 years. Most of these studies have directly involved narratives, but a few have involved non‐narrative content that mirrors important features of visual narratives. In those cases, we have adopted that approach because it afforded the experimental control needed to ask and answer the central questions raised by SPECT regarding processes involved in visual narratives. In addition, SPECT suggests that the distinction between static narratives and film may be important for attentional selection because the two media differ in terms of their degree of visual salience (e.g., due to film having motion, but not comics). Nevertheless, to date, we have not made within‐study comparisons between film and comics using the same content. With those caveats, in this section, we will illustrate SPECT’s utility as a theoretical framework to guide research on the coordination of information extraction, attentional selection, basic event cognition, and event model construction in processing visual narratives. To date, we have investigated two key lines of research, one regarding information extraction, and the other regarding attentional selection. Regarding information extraction, we have previously carried out studies on the discourse comprehension of visual narratives, which show how viewers monitor situation indices in their current event model, which then affects their event segmentation. However, those studies beg the question, what information is extracted during single fixations in the front‐end? SPECT provides a framework for asking questions about how information extraction affects event model construction. Thus, we have investigated how extracting information on the viewer’s first fixation on a new scene allows the viewer to lay the foundation of a new event model (Larson et al., 2012), and how that newly laid foundation primes extraction of further event indices on further fixations (Larson & Lee, 2015). We have also investigated how laying the foundation of the event model can, in turn, allow the viewer to predict what spatiotemporal context he or she will see next, which influences further information extraction on subsequent eye fixations (Smith & Loschky, in press). This can lead to further questions, such as how are subsequent event indices mapped onto an existing event model or signify an event model shift? Regarding attentional selection, we have investigated how it is influenced by event model construction while viewing visual narratives, including both static picture stories and film (Hutson et al., 2017, 2018; Loschky et al., 2015). Specifically, we have studied how mapping incoming information to the current event model guides attentional selection in visual narratives with static images (picture stories) (Hutson et al., 2018). More specifically, how does the mapping process in the event model, and its subprocess of bridging inference generation, affect attentional selection, as measured by what viewers fixate on in a given picture in a visual narrative. We have also studied how the current and stored event models guide attentional selection in dynamic visual narratives (film clips) (Hutson et al., 2017; Loschky et al., 2015). More specifically, how does the mapping process in the event model, and its subprocess of predictive inference generation (forward mapping), affect attentional selection, as measured by what viewers look at from moment to moment while watching a narrative film? Below, we discuss these studies, what they have shown that speaks to the SPECT framework, and a non‐exhaustive sample of other relevant work that speaks to the same issues. We describe these studies in sections below, first on The Relationship between Information Extraction and Event Model Construction, and second on The Relationship between Event Model Construction and Attentional Selection.

The relationship between information extraction and event model construction

According to SPECT, the first stage of creating an event model is laying its foundation. This iteratively operates as one processes each picture (or frame) in a visual narrative, in a manner akin to the processes that support reading sentences in the context of narrative text (Magliano et al., 2013). For example, when viewing the first image of Fig. 1A (labeled “Beginning State”), the reader needs to quickly perceive who is doing what, when, and where. SPECT raises critical questions about how the process of information extraction, on each eye fixation enables the viewer to lay the foundation over the course of the first few eye fixations. Is there a temporal order in which the viewer recognizes that the scene takes place on a wooded hill, that there is a boy and a dog, and that they are both running? Perhaps the viewer recognizes the boy, the dog, and the wooded hill on the first fixation and stores that information as event indices in the foundation of the new event model.4 And perhaps the viewer recognizes that both boy and dog are running down the hill on the second fixation and map that onto the foundation of the event model. If so, could this temporal order of information extraction imply that recognizing the spatiotemporal context (“wooded hill”) on the first fixation facilitates recognizing the event (“running down the hill”) on the second fixation? Alternatively, since comprehending a narrative requires recognizing the main character and his or her actions, perhaps the foundation of the event model requires recognizing this event information within the first eye fixation (Dobel et al., 2007; Hafri et al., 2013). Furthermore, attention is strongly biased to people in scene images within a single fixation (Fletcher‐Watson et al., 2008; Humphrey & Underwood, 2010; Zwickel & Võ, 2010). These two points strongly suggest that people and their actions form the basis of an event model. Larson (2012) explored these above issues within a non‐narrative context, in order to gain an understanding of the processes involved when looking at the very first image in a narrative. Specifically, Larson (2012) examined the rapid categorization of locations and actions in static photographic scenes both within single eye fixations, and across multiple fixations. Larson found that viewers were able to rapidly categorize locations within a single fixation, but that actions required a second fixation. This suggests that laying the foundation consists of first recognizing the spatiotemporal framework, and then recognizing and mapping the actions that entities carry out within it. In a further study, Larson and Lee (2015) found that recognizing an action was facilitated by seeing it within the context of a recognizable scene. Importantly, however, this facilitation was only found after viewers had processed the image long enough to relatively accurately recognize the scene context (about 100 ms). Nevertheless, we have yet to explore if this hierarchy of recognizing the spatiotemporal framework first and then action occurs across pictures in a visual narrative. SPECT raises further questions about the relationship between information extraction, laying the foundation, and mapping to the current event model, in the context of visual narrative sequences. For example, in Fig. 1A, according to SPECT, the boy’s spatiotemporal context (i.e., “wooded hill”) will be extracted while viewing the first picture, and stored as the foundation of the event model in working memory. A key question raised by SPECT is whether that foundation should facilitate information extraction of the spatiotemporal context on the subsequently viewed second and third pictures in Fig. 1A. Similar to anticipatory processes in language processing, we would expect to find priming of the upcoming spatiotemporal contexts, whether they remain the same, as in Fig. 1A (all showing a “wooded hill”), or they are different but spatiotemporally related (e.g., a transition from the wooded area into a field), but not if they clash with expectations (e.g., a transition from the wooded area into a bustling city street). Smith and Loschky (in press) have investigated the above questions using simple first‐person visual narratives of traveling from one location to another (e.g., going from an office to a parking lot) akin to the short narratives often used by discourse psychologists. As shown in Fig. 7, that study presented viewers with short narrative sequences of 0–9 scene priming images, each briefly flashed for enough time to both recognize and store them in working memory (about 300 ms), followed by a single target image that was briefly flashed and immediately masked to limit processing time (for 24 ms), after which the viewer was asked the categorize the target scene. The key manipulation was to present the image sequences in coherent versus randomized order. The results showed that viewers were much more accurate at categorizing the scenes shown in coherent sequences than in randomized sequences, showing clear priming of the current spatiotemporal context by the preceding context. Furthermore, the priming was greater when the priming was from the same category as the target (e.g., the second of two hallways in a row) than from a different but spatiotemporally related category to the target (e.g., a hallway seen immediately after one or more office images). This is consistent with the above hypothetical scenario for processing the image sequence in Fig. 1A, in which recognition of the second and third “wooded hill” images would be primed by recognizing the first such image. However, Smith and Loschky’s (in press) results also showed that expectations about up‐coming different spatiotemporal contexts also produced priming, but to a lesser degree.

Figure 7

Experimental conditions used by Smith and Loschky (in press) to investigate priming of the current spatiotemporal context by the preceding context. Viewers saw either spatiotemporally coherent or randomized image sequences ending with a briefly flashed and visually masked target image. Participants then identified the scene category of the target from a list of all possible scene categories in that sequence. The coherent spatiotemporal sequence shows two office images followed by a hallway image, taken from a route from office to parking lot. The randomized sequence shows a parking lot, a stairwell, and then the target hallway image. Participants found the target images more predictable and were more accurate at identifying them, when presented in coherent sequences. (A) shows the beginnings of two sequences, including 2 primes, the target, and the response screen, in the (i) coherent, and (ii) randomized conditions. (B) shows a more complete representation of each sequence of 10 images, including those images that appeared after the participant's response. Finally, Smith and Loschky (in press) investigated whether the spatiotemporal priming shown in coherent image sequences was simply due to response biases (i.e., guessing the scene category at the time of being tested), or was actually due to facilitation of perceptual sensitivity. To tease apart those possibilities, they showed participants the exact same coherent and randomized image sequences, but participants’ task was changed from (1) identifying the scene category of the target to (2) visually discriminating whether the target was a real scene image or a noise image (with a 50/50 mix of both types of target images). Note that a viewer’s ability to predict the category of the next scene should not bias him or her to respond either “real scene” or “noise image.” Importantly, the results showed that participants were more perceptually sensitive to targets in the coherent than the randomized scene sequences, while their response bias was neutral (i.e., they equally responded “real scene” vs. “random noise”) and did not differ between the coherent and randomized sequences. Thus, a viewer’s expectations about the up‐coming spatiotemporal context can facilitate their perception of that context.

Further research on the relationship between information extraction, attentional selection, and event model construction

A limitation of the above studies is that they either used only single images, or minimal visual narratives, rather than the more naturalistic ones found in comics, picture stories, and film. As previously discussed, the perceptual processing demands of static versus dynamic visual narratives differ greatly and these may alter the degree to which back‐end processing influences front‐end attentional selection and information extraction. For example, Smith and colleagues (Smith & Henderson, 2008; Smith & Martin‐Portugues Santacreu, 2017) have demonstrated that continuity of a basic level action percept across a cut can obscure viewer awareness of the cut (i.e., edit blindness), which involved a global change in viewpoint of the spatiotemporal context. Object features and even actor identity can also change across cuts without viewers noticing (Levin & Simons, 1997). Diminished awareness of the shot change only occurs if sufficient action motion is present across the cut (hence the film technique name Match‐On‐Action: Smith & Martin‐Portugues Santacreu, 2017), suggesting that viewers may often lack the capacity (e.g., attentional resources, working memory, or executive resources) to encode detailed surface information that does not change key event indices (Sampanes, Tseng, & Bridgeman, 2008) or that such information is obscured by the image motion (Smith, 2012a). Similar failures to notice differences between two different versions of the same static image are well known from “spot the difference” tasks. Assumedly, such changes are also missed in comics across pairs of adjacent busy panels sharing much of the same background. Within the Attentional Theory of Cinematic Continuity (AToCC) (Smith, 2012a, 2012b), these effects are explained as postdiction, the backwards inference of the details of the event after it has begun rather than predictive inference (Smith & Martin‐Portugues Santacreu, 2017). Whether this absence of predictive inference is specific to the fast‐paced sequences used in these studies is not currently known. In fact, such postdictive inferences are very similar to the bridging inferences that have shown to be commonly drawn in picture story studies (Hutson et al., 2018; Magliano et al., 2016). Predictive inferences do obviously occur in film, with many having been intentionally targeted by the filmmakers through filmmaking techniques (Magliano, Dijkstra, & Zwaan, 1996). Thus, it is possible that postdictive bridging inferences are more commonly generated during film viewing than predictive inferences, which appears to also be the case with narrative text (Graesser et al., 1994; Magliano et al., 1996). However, not all cuts are missed, and their rate of detection is in proportion to the number of spatiotemporal and semantic features changed across the cut (Smith & Henderson, 2008; Smith & Martin‐Portugues Santacreu, 2017) and object changes will be noticed if they change meaning, even if the changes are relatively small (Sampanes et al., 2008). In support of this, recent eye movement evidence indicates that low‐level visual salience does not entirely account for gaze biases toward continued scene content across cuts—instead, memory‐guided attention facilitates the deployment of attention but only if the viewer is actively tracking scene content (Valuch, König, & Ansorge, 2017). Whether such active tracking occurs automatically during visual narrative viewing is currently unknown. However, it is worth noting that drawn American visual narratives often circumvent this processing, by first introducing an environment early on in a sequence, and then leaving out the background entirely in later panels, though this intuition should be tested with corpus analyses. Within the SPECT framework, we would suggest that important event indices are tracked by viewers across shots and cuts, interacting with visual salience to guide attention and gaze, and allowing changes to important semantic features of a scene to be detected (e.g., entities or actions that could change the goals of a protagonist in the visual narrative) but allowing unimportant features to pass unnoticed. Indeed, what constitutes an important event index has been the subject of much study in the event perception literature. Studies analyzing the likelihood of discontinuities in particular feature dimensions being perceived as event boundaries during film viewing have revealed that discontinuities of the goal of characters trump space and time (Magliano & Zacks, 2011). Exactly what information is used to construct and maintain a representation of action, or to detect changes to it, is currently unclear and will require further study in terms of the stages of processing outlined by SPECT.

Conclusions regarding information extraction and event model construction

Thus far, the studies by Larson (Larson, 2012; Larson & Lee, 2015) and Smith and Loschky (in press) have shown how rapid scene categorization processes, typically investigated by scene perception researchers, interact with higher‐level event model processes, such as laying the foundation and mapping, typically studied by discourse comprehension researchers. These studies have shown evidence for a temporal order of processing event indices in which the spatiotemporal context is processed earlier than actions, with the former priming the latter (Larson, 2012; Larson & Lee, 2015). They have also shown that such spatiotemporal contexts can prime each other when encountered in sequential visual narratives (Smith & Loschky, in press). Further research is needed to investigate the temporal order of information extraction of the full range of key event indices across multiple fixations while viewing visual narratives. Other studies of change blindness and edit blindness while people watch films, however, raise questions about how much information viewers encode while viewing visual narratives (Levin & Simons, 1997; Smith & Henderson, 2008; Smith & Martin‐Portugues Santacreu, 2017). A testable hypothesis consistent with SPECT is that viewers will detect those changes that change important event indices in the current event model or, to a lesser degree, recently stored event models (Sampanes et al., 2008).

The relationship between event model construction and attentional selection

SPECT assumes that not only do front‐end information extraction and attentional selection processes affect back‐end event model building, but also that back‐end event model building processes affect front‐end processes, such as attentional selection. We have conducted a series of studies that have been motivated by this general assumption and have explored whether and how event model building affects attentional selection. However, we have found evidence suggesting that the nature of this relation may vary as a function of whether narratives are static (comics, pictures stories) or dynamic (TV shows, videos, and films). Specifically, it seems that dynamic visual narratives, such as films, exert quite a bit of exogenous control over attentional selection, as measured by eye movements, and thus they may not afford much influence of the event model. This may be due to the fact that dynamic visual narratives (by definition) include motion, which is the single strongest stimulus feature for predicting eye movements and guiding attentional selection (Carmi & Itti, 2006; Mital et al., 2010). Conversely, because static narratives lack motion, and reading is self‐paced, it seems that they may afford more endogenous influences on attentional selection via the back‐end event model. First consider static sequential picture stories. Magliano et al. (2016) had viewers read six wordless “Boy, Dog, Frog” stories. In each story, as illustrated in Fig. 1A, the authors identified three‐image sequences that showed a beginning‐state (e.g., Boy running down a hill), a bridging‐event (e.g., Boy tripping over tree branch), and an end‐state (e.g., Boy face first in the pond). As shown in Fig. 1A versus 1B, Magliano et al. (2016) manipulated whether the bridging‐event image was present or not. When the bridging‐event image was absent, viewers would need to generate a bridging inference in order to map event indices from the end‐state picture onto their event model based on the beginning‐state image. Magliano et al. (2016) found direct evidence of this in a pilot study in which they asked viewers to read the wordless picture stories on a computer screen, one image at a time, and do a think aloud after each end‐state image. As predicted if an inferred bridging event was more highly activated in WM than an actually viewed bridging event, the authors found that participants were more likely to mention the bridging event in the absent condition than the present condition. In a follow‐up study, the authors dropped the think‐aloud task, and simply had viewers read the wordless picture stories at their own pace, while their viewing times were recorded. Consistent with the hypothesis that viewers were generating bridging inferences, the authors found that viewing times were longer when the bridging‐event images were absent than when they were present. Hutson et al. (2018) carried out a follow‐up study that investigated more precisely why viewing times were longer in the bridging‐state absent condition. They measured viewers’ eye movements and asked whether viewing time differences were due to differences in either mean fixation durations or the mean number of fixations. They found that there were no differences in mean fixation durations, but there were approximately 20% additional fixations in the bridging‐state absent condition relative to the bridging‐event present condition. This suggested that, rather than the bridging event generation requiring further internal processing (during fixations), it may have required gathering additional information (in extra fixations). Thus, Hutson et al. (2018) empirically identified regions of the pictures that were informative for generating the bridging inference when the bridging‐state picture was absent. Consistent with the hypothesis that viewers would preferentially fixate image regions that were more informative for generating the bridging inference, they found that the inferential‐informativeness of image regions was more strongly correlated with the likelihood of eye fixations falling within them in the bridging‐event absent condition. These data demonstrate that processes that support constructing the event model can influence attentional selection in scenes. Specifically, when visual narrative readers detect that they need to generate an inference to support the mapping process, their attentional selection system is engaged to support constructing that inference. Presumably, each fixation to support a bridging inference engages in information extraction, and that process continues until either (a) sufficient knowledge in semantic LTM is activated to support generating the inference, or (b) the viewer decides that the information is insufficient. The coordination of information extraction and attentional selection to support bridging inference generation warrants further investigation. The story is much different in the context of film, likely because, as noted above, there are stronger exogenous features that attract attention. SPECT assumes that the event model will have less of an impact on attentional selection under such conditions. We have conducted a series of studies that have shown that the nature of the event model has a real but relatively small impact on attentional selection. Consider the film clip narrative sequence from James Bond Moonraker used in Loschky et al. (2015) illustrated in Fig. 8. This clip was chosen because Magliano et al. (1996) found that the use of cross cutting in shots 3–6 (alternating shots between two locations, in this case, a man in free fall and a circus tent) engendered a similar predictive inference, namely “the man will fall on the circus tent,” across most viewers. Loschky et al. (2015) varied whether participants saw the prior 2 minutes of movie context leading up to this scene, and they found that participants in the “No‐context” condition were less likely to generate the predictive inference than in the prior exposure (“Context”) condition. Thus, this manipulation changed viewers’ event models. However, when we measured viewers’ eye movements as they watched the film clips, their gaze behavior indicated a high level of attentional synchrony both within and across the Context and No‐context conditions. Only in a shot that had essentially no motion (Fig. 8, Shot 4), in which viewers were free to explore the shot of the circus tent, did we find gaze differences across the two context conditions. Thus, the nature of the event model appeared to have only a small effect on attentional selection, at least in the context of this film clip. We dubbed this phenomenon the tyranny of film, because despite large differences in viewers’ understanding, there were small differences in attentional selection, assumedly due to the power of the film stimulus in guiding their attention.

Figure 8

Drawings of six frames from six sequential shots from James Bond Moonraker (Broccoli & Gilbert, 1979). (Reproduced with permission of Loschky et al., 2015.)

Drawings of six frames from six sequential shots from James Bond Moonraker (Broccoli & Gilbert, 1979). (Reproduced with permission of Loschky et al., 2015.) Hutson et al. (2017) further investigated whether such tyranny of film on attentional selection, found in a highly edited film clip, would operate in a film clip with no editing. Given that editing practices are designed to influence attentional selection (Smith, 2012a), perhaps a lack of editing would minimize the tyranny of film. Hutson et al. (2017) explored this possibility by using the opening scene from Touch of Evil (Welles & Zugsmith, 1958), which consists of a single continuous long shot (i.e., no cuts), showing two couples navigating the streets of a Mexico/US border town. As shown in Fig. 9, the opening segment shows a man setting a time bomb and putting it in the trunk of a car. Soon after, a couple who owns the car unwittingly gets into the car and drives away. The couple in the car then passes a walking couple on the street. Hutson et al. reasoned that, since the bomb has tremendous causal power in the event models of viewers who know about it, if viewers had no knowledge of the bomb, they would be less likely to fixate on the car. Thus, in Experiment 1, Hutson et al. manipulated whether participants saw the bomb placed in the car trunk (Context condition) or not (No‐context condition). Similarly to Loschky et al (2015), this context manipulation strongly affected participants’ predictions of what would happen next at the end of the clip (e.g., either “the car will explode,” or “the two couples will have dinner together”). This showed that the heavy‐handed context manipulation indeed dramatically changed the nature of viewers’ event models for the movie clip. Surprisingly, however, Hutson et al. (2017) found equal proportions of fixations on the car in both the Context (bomb‐present) and No‐context (bomb‐absent) conditions. Thus, this showed that the tyranny of film was still operating even without film editing. Apparently, the structure of the long shot was such that the movement of the car exerted exogenous control of attentional selection.

Figure 9

Nine frames from the opening long shot of Touch of Evil (Welles & Zugsmith, 1958) and the experimental conditions used in Hutson et al. (2017). The blue dashed outline indicates the video starting point in the Context Condition (Experiments 1 and 2); the orange outline shows the starting point for the No‐context condition (Experiment 1); the green outline shows the starting point for the No‐context condition (Experiment 2). (Published with permission of Hutson et al. [2017]). In Experiment 2 of Hutson et al. (2017), the No‐context condition began watching the clip when only the walking couple was on screen; thus, viewers would not consider the couple in the car as protagonists. When the walking couple passed the temporarily parked car, this was the first time viewers in the No‐context condition saw it, and they were much less likely to fixate on it than those in the Context condition. Assumedly, this was because the No‐context viewers perceived the car as background, whereas the Context condition viewers knew about the bomb, and also treated the couple in the car as protagonists/agents. Hutson et al. called this the agent effect. However, once the car began to move again, viewers in both context conditions fixated on the car equally, regardless of knowledge of the bomb. Thus, as in Loschky et al. (2015), the effect of the event model on attentional selection was real, but small. In a further control experiment, Hutson et al. (2017) found they could reduce the tyranny of film by using a task that directed viewers’ volitional attention away from the narrative events in the shot, namely asking viewers to prepare to draw a map from memory of the spatial environment in the film clip. As noted earlier, SPECT assumes that this requires the use of effortful volitional executive processes (see Fig. 2). Additionally, Hutson et al. (2017) compared the levels of attentional synchrony found in the highly edited shot sequence of James Bond Moonraker used in Loschky et al. (2015) versus the continuous long shot from Touch of Evil. As predicted, the levels of attentional synchrony were less in the continuous long shot than in the highly edited sequence. This analysis suggests that there may be differences in the extent to which features of dynamic visual narratives affect the relationship between the event model and attentional selection, and more research is warranted to address this issue.

Conclusions regarding the relationship between event model construction and attentional selection

The studies described in this section have shown effects of event model building processes on attentional selection in visual narratives, including both static picture stories (Hutson et al., 2018) and movie clips (Hutson et al., 2017; Loschky et al., 2015). However, these effects appeared stronger in the static picture stories than in the film clips. This has led us to modify the assumption of SPECT that back‐end and front‐end processes have bi‐directional influences. Specifically, we have added the further assumption that the influence of event model building processes on attentional selection is moderated by whether a visual narrative is static or dynamic. Nevertheless, this conjecture is in need of more direct tests. More generally, the implications of differences between media in terms of affording control over attentional selection need to be carefully explored. Related to the above, there are likely trade‐offs involved in the tyranny of film. Filmmakers can utilize the properties of film to direct viewers’ attention to specific portions of the screen, which should affect the process of information extraction, which then affects passive knowledge activation, which in turn affects back‐end event model building processes (e.g., Kintsch, 1988). However, the lack of opportunities for regressive eye movements in film that might support comprehension repair is a price that is paid for the tyranny and the lack of self‐paced control. SPECT provides a motivation for research that addresses these important issues.

Discussion

The intent of SPECT is to explain how visual narratives are processed and understood from early perceptual processes to relatively late processes that support event model building. In doing so, SPECT integrates previously separate research domains for visual narrative perception and comprehension, which has rarely occurred in research on text comprehension (for exceptions, see the computational models of reading, e.g., SWIFT: Engbert, Nuthmann, Richter, & Kliegl, 2005; EZ‐Reader: Reichle, Rayner, & Pollatsek, 1999). Our intent in this article was to inspire future research on the processing and comprehension of visual narratives that explores the interplay between multiple levels of front‐ and back‐end cognitive processing. We have made a case that the framework has been invaluable in guiding our program of research on visual narrative processing, and we contend that new research questions are afforded by it.

Future research questions

As noted above, we have been using this framework to guide a program of research. However, that program is by no means exhaustive in addressing the important research questions that can be derived by SPECT. In this section, we discuss pressing questions that we believe should be addressed in order to further illustrate the utility SPECT as a theoretical framework. An important unanswered question raised by SPECT regarding front‐end information extraction is the temporal order of information extraction for event indices (e.g., spatiotemporal framework, agents, objects, actions, goals of agents) across multiple eye movements. As noted earlier, Larson (Larson, 2012; Larson & Lee, 2015) has begun to answer this question by showing that the spatiotemporal event index is extracted prior to the action underlying an event. However, further, more detailed investigations are needed to determine when entities, their inferred goals, and inferred causal relationships are extracted across multiple eye fixations. It seems likely that not only the spatiotemporal context, and actions, but also the entities of agents and patients are among the first event indices to be extracted in a new event model. Furthemore, given that identifying goals and causal relationships require more inferential processes, these event indices are likely extracted and generated in the event model later. However, tests of these hypotheses are needed. Doing so would elucidate the role of front‐end information extraction during single fixations in event model building across multiple fixations in WM. As noted above, a key research question suggested by SPECT is whether static versus dynamic visual narratives differ in the degree to which the event model influences attentional selection. To answer this question will require at least two things: (a) visual narratives in which manipulations of viewers’ event models influence attentional selection, and (b) versions of those narratives that primarily differ in terms of the static versus dynamic distinction. Meeting both criteria is non‐trivial. However, answering this question will more broadly help answer the question of the conditions under which the viewer’s event model influences attentional selection in visual narratives. A further key unanswered research question suggested by SPECT is whether and how the back‐end process of shifting to build a new event model affects the front‐end process of attentional selection (but see Huff, Papenmeier, & Zacks, 2012). Research has shown better memory for event boundaries than middles (Huff, Meitz, & Papenmeier, 2014; Swallow et al., 2009), suggesting that attentional selection is affected by shifting. Interestingly, it is possible that attention is heightened at event boundaries (Huff et al., 2014; Swallow et al., 2009) or, conversely, that it is diminished (Huff et al., 2012). This apparent contradiction may be resolved by other results showing that gaze patterns change just before and after event boundaries (Eisenberg & Zacks, 2016; Smith, Whitwell, & Lee, 2006), consistent with the ambient‐to‐focal eye movement shift—namely, attention may expand and contract over time near event boundaries (Ringer 2016; Ringer, 2018). This warrants further research to clarify these relationships. We invite the reader to identify questions that have not yet been pursued. Such efforts are essential for revisions to SPECT that would allow it to become a formalized and implementable model. Moreover, in the pursuit of such research, we acknowledge that alternative and perhaps contradictory frameworks could emerge. We see that possibility as healthy and indicative of the study of visual narratives as being a vibrant and growing area of research.

Future computational and neurophysiological tests of SPECT

SPECT is not yet a formally complete cognitive model of visual narrative processing, primarily because many assumptions of the model remain to be empirically validated, as laid out above. However, while it has not yet been computationally implemented, our goal is to refine the model such that it eventually can be. Thus, future studies should develop and test computational approximations of key elements of SPECT. For front‐end mechanisms, there are already deep neural networks that can extract the event indices needed for laying the foundation of an event model (e.g., locations, people, animals, objects, and actions) from video (Du, El‐Khamy, Lee, & Davis, 2017; Hoai, Lan, & De la Torre, 2011; LeCun, Bengio, & Hinton, 2015; Manohar, Sharath Kumar, Kumar, & Rani, 2019; Zhou, Lapedriza, Khosla, Oliva, & Torralba, 2018). There are also neural networks for attentional selection (Adeli & Zelinsky, 2018; Huang, Shen, Boix, & Zhao, 2015). For the back‐end, formal ontologies use techniques such as description logics to represent events (Baader & Nutt, 2003; Neumann & Möller, 2008). Inferential processes based on event representations can be modeled in terms of Bayesian weights for likely inferences (Bateman & Wildfeuer, 2014; Grosz & Gordon, 1999).5 A key challenge is to link the front‐end event index outputs in ways that are usable by the back‐end ontologies. The neurophysiological foundations of SPECT are based on numerous related but non‐visual‐narrative‐based studies. The distinction between the front‐end processes of information extraction and attentional selection is strongly supported by their implementation within different functional brain networks, and having differentiable time courses. Front‐end information extraction of foundation event indices (i.e., locations, people, animals, actions, and objects) is extremely rapid, with perceptual decisions occurring within 150–225 ms post‐stimulus, as shown by EEG and MEG studies (Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Greene & Hansen, 2018; Ramkumar, Hansen, Pannasch, & Loschky, 2016; VanRullen & Thorpe, 2001). Such event indices can be decoded from fMRI brain activity within functionally defined areas for locations (Walther, Caddigan, Fei‐Fei, & Beck, 2009), objects (Majaj, Hong, Solomon, & DiCarlo, 2015), and actions (Gallivan & Culham, 2015). Front‐end attentional selection is also extremely fast, with ventral stream neurons activating roughly 100 ms before the eyes fixate an object of interest (Sheinberg & Logothetis, 2001). Very fast stimulus saliency effects on attentional selection are controlled by the superior colliculus (Boehnke & Munoz, 2008), and slower back‐end influences are likely controlled by the fronto‐parietal and fronto‐temporal networks (Baldauf & Desimone, 2014). Knowing these basic facts can guide research to test front‐end hypotheses of SPECT. The functional distinction of back‐end from front‐end processes is also strongly supported by their having different time courses and involving different brain networks. EEG research has shown that mapping processes, such as when the same people and/or locations recur across multiple panels in a visual narrative, elicit decreased N400 amplitude (roughly 300–500 ms post‐stimulus) (Cohn, Paczynski, Jackendoff, Holcomb, & Kuperberg, 2012). Importantly, consistent with SPECT, this time course is later than the 150–225 ms needed to extract an event index (e.g., the person, the location) and is operating over multiple items in WM. Other EEG studies have shown that mapping processes such as updating the event model with new event indices and bridging inference generation occur even later, eliciting the P600 (roughly 400–900 ms post‐stimulus) (Cohn & Kutas, 2015). Likewise, consistent with SPECT’s assumption that mapping and shifting are separate processes, fMRI studies have shown that they involve separate brain regions (Ezzyat & Davachi, 2011). SPECT further argues that shifting at event boundaries leads the event model in WM to be stored in LTM. Consistent with this claim, fMRI studies have shown event boundaries lead to activity in parietal and posterior medial cortex being temporarily synchronized (Baldassano et al., 2017; Ezzyat & Davachi, 2011). Such interactions between brain areas involved in front‐end and back‐end processes are critical predictions of SPECT, but the bidirectionality of these predicted interactions requires considerable further neuroimaging support. Additionally, SPECT assumes that event segmentation is similar across different representational formats, which has been supported by fMRI studies showing the same posterior‐medial network (Inhoff & Ranganath, 2017) being engaged when reading written narratives (Baldassano et al., 2017; Speer, Zacks, & Reynolds, 2007) and watching visual narratives (Baldassano et al., 2017; Kurby & Zacks, 2018; Zacks et al., 2001). This further supports the relevance of research from outside of the context of visual narratives for establishing the neurophysiological bases of SPECT. Indeed, processing visual narratives likely involves a complex coordination of neurophysiological systems that are both domain‐specific (Cohn & Maher, 2015) and domain‐general (Cohn, 2019a, 2019b). Future studies should use behavioral and computational modeling methods together with neuroimaging methods to test hypotheses of SPECT in terms of their time course or functional differentiation. Furthermore, while research using non‐visual‐narrative materials is valuable for understanding the neurophysiological bases of SPECT, there is behavioral evidence, and some neurophysiological evidence, that visual narratives require processing unique to each medium (Cohn & Ehly, 2016; Cohn & Maher, 2015; Smith, 2012a; Smith, Levin, & Cutting, 2012) and may require specialized literacy skills (Cohn, 2019b; Cohn & Magliano, in press; Schwan & Ildirar, 2010). Thus, future tests of the neurophysiological bases of SPECT should prioritize visual narratives.

Limitations of SPECT

What is missing from SPECT? One obvious limitation of SPECT is that it does not specify how prior world knowledge supports the comprehension of visual narratives. In contrast, theories of text comprehension focus on how semantic knowledge is activated and integrated into a mental model for a text (McNamara & Magliano, 2009). The complexities of exploring how front‐end process support mental model construct are such that, at this juncture, we deem this to be a necessary omission. As described here, SPECT principally applies to traditional non‐interactive media (though reading comics and picture stories allows self‐pacing). SPECT does not account for visual narrative experience in which the viewer is also an active participant, such as video game and virtual reality experiences. Given that first‐person experiences are processed in similar fashion to narrative experiences (Magliano, Radvansky, Forsythe, & Copeland, 2014), SPECT should be able to accommodate these experiences. However, the fact that one is an active agent in many of these contexts will obviously have implications for attentional selection. SPECT neglects the rich and important social aspects (e.g., communal viewing at a cinema, or a parent reading a picture book to their child) and emotional aspects (i.e., the affective profile of joy and despair so important to narrative arcs) of visual narratives. This is a systemic issue with many theories of comprehension, but it does not imply that these processes are unimportant for comprehension. Probably the most important current omission is that SPECT specifically describes the relationship between visual processing and event model construction, but it does not describe how written or auditory information (linguistic and non‐linguistic) contributes to the understanding of visual narratives. Cohn similarly does not specify how linguistic information is processed in his theory of visual narrative processing (Cohn, 2013a), though he acknowledges the importance of understanding the relationship between text and images (Cohn, 2016; Manfredi, Cohn, & Kutas, 2017) and sounds and images (Manfredi, Cohn, De Araújo Andreoli, & Boggio, 2018) in sequential visual narratives in conveying meaning. Auditory information is vital to the practices of storytelling in filmmaking (Batten & Smith, 2018; Bordwell, 1985) and representations of speech, thought, narration, and sound effects are vital to storytelling in comics (Cohn, 2013a). Moreover, when comic panels contain a large amount of text, readers allocate considerable attentional resources to process the text, and there is some suggestion that image content may be processed in parafoveal vision (Laubrock et al., 2018). Furthermore, in film, auditory and linguistic content support inference processes (Magliano et al., 1996). However, given the complexities of understanding the relationship between visual perception and event cognition, we argue that this is a necessary omission at this juncture.

Conclusion

With SPECT, we have taken the first steps toward outlining a comprehensive cognitive framework for visual narrative processing which extends from momentary attentional selection and information extraction from visual images to the longer‐scale creation and maintenance of event models in WM and LTM. This theoretical framework incorporates contemporary theories of all of these stages of visual scene perception, event perception, and narrative comprehension, but by applying SPECT to complex visual narratives, a number of important ruptures, inconsistencies, and gaps in our understanding have emerged. Most important, as previously stated in relation to film (Smith, Levin, et al., 2012), by theorizing about and studying how we process visual narratives, we learn more about how we perceive and make sense of the real world.

112 in total

1. How prior knowledge, WMC, and relevance of information affect eye fixations in expository text.

Authors: Johanna K Kaakinen; Jukka Hyönä; Janice M Keenan
Journal: J Exp Psychol Learn Mem Cogn Date: 2003-05 Impact factor: 3.051

2. How does the purpose of inspection influence the potency of visual salience in scene perception?

Authors: Tom Foulsham; Geoffrey Underwood
Journal: Perception Date: 2007 Impact factor: 1.490

Review 3. Mechanisms of top-down attention.

Authors: Farhan Baluch; Laurent Itti
Journal: Trends Neurosci Date: 2011-03-23 Impact factor: 13.837

4. Application of Spatial Cues and Optical Distortions as Augmentations during Virtual Reality (VR) Gaming: The Multifaceted Effects of Assistance for Eccentric Viewing Training.

Authors: Alexandra Sipatchin; Miguel García García; Yannick Sauer; Siegfried Wahl
Journal: Int J Environ Res Public Health Date: 2022-08-04 Impact factor: 4.614

5. Your Brain on Comics: A Cognitive Model of Visual Narrative Comprehension.

Authors: Neil Cohn
Journal: Top Cogn Sci Date: 2019-04-08

6. Editors' Introduction and Review: Visual Narrative Research: An Emerging Field in Cognitive Science.

Authors: Neil Cohn; Joseph P Magliano
Journal: Top Cogn Sci Date: 2019-12-22

6 in total