We tested whether viewers have cognitive control over their eye movements after cuts in videos of real-world scenes. In the critical conditions, scene cuts constituted panoramic view shifts: Half of the view following a cut matched the view on the same scene before the cut. We manipulated the viewing task between two groups of participants. The main experimental group judged whether the scene following a cut was a continuation of the scene before the cut. Results showed that following view shifts, fixations were determined by the task from 250 ms until 1.5 s: Participants made more and earlier fixations on scene regions that matched across cuts, compared to nonmatching scene regions. This was evident in comparison to a control group of participants that performed a task that did not require judging scene continuity across cuts, and did not show the preference for matching scene regions. Our results illustrate that viewing intentions can have robust and consistent effects on gaze behavior in dynamic scenes, immediately after cuts.
We tested whether viewers have cognitive control over their eye movements after cuts in videos of real-world scenes. In the critical conditions, scene cuts constituted panoramic view shifts: Half of the view following a cut matched the view on the same scene before the cut. We manipulated the viewing task between two groups of participants. The main experimental group judged whether the scene following a cut was a continuation of the scene before the cut. Results showed that following view shifts, fixations were determined by the task from 250 ms until 1.5 s: Participants made more and earlier fixations on scene regions that matched across cuts, compared to nonmatching scene regions. This was evident in comparison to a control group of participants that performed a task that did not require judging scene continuity across cuts, and did not show the preference for matching scene regions. Our results illustrate that viewing intentions can have robust and consistent effects on gaze behavior in dynamic scenes, immediately after cuts.
Entities:
Keywords:
attention; continuity; dynamic scenes; editing; eye tracking; fixations; movies
Edited dynamic scenes, such as films, newscasts, television shows, and all sorts of
edited videos are widely prevalent in human environment. Such edited videos (as we
might call them) contain frequent cuts, which are abrupt global changes of video
content, occurring every few seconds and connecting different video takes. Despite
the high prevalence of edited videos, only few studies systematically explored how
eye movements might be affected by cuts (e.g., Carmi
& Itti, 2006a; Germeys &
D’Ydewalle, 2007). As outlined in the Background section below,
previous studies suggested that in early time periods following scene cuts,
cognitive top-down influences on eye movements are rather limited. Here, we chose an
experimental approach to uncover such cognitive top-down influences on eye movements
following scene cuts: We looked at how spatio-temporal eye movement patterns
following scene cuts are influenced by specific task goals (Yarbus, 1967). Our study illustrates that humans exert robust
cognitive control over their eye movements. Immediately after cuts, viewers can
selectively fixate on scene regions that contain the most task-relevant information.To start with, eye fixations enable humans to perceive visual information using
highly accurate foveal vision. Yet in each moment, foveal vision
captures only a small portion of the available information. To overcome this
inherent selectivity, humans make several fixations per second and select or sample
visual information from different spatial locations (Rayner, 2009). Thus, mechanisms of selective visual attention are
tightly connected to fixation location selection (Deubel & Schneider, 1996) and ensure that relevant information is
sampled and available for cognitive processing and behavior (Land & Tatler, 2009). In edited dynamic scenes, accurate
and timely fixations on behaviorally relevant content are of particular importance
because specific information is only transiently accessible and the video content
undergoes frequent dynamic changes. A more solid understanding of the degree to
which human viewers have cognitive control over their fixations in edited dynamic
scenes would be an important step towards the improvement of technological
applications that rely on videos and human attention. Examples are the design of
viewer-acceptable video-coding standards (Adzic,
Kalva, & Furht, 2013; Salomon,
2004) or, more generally, display devices that are aware of and optimized
for the viewer’s attention (Ferscha,
Paradiso, & Whitaker, 2014). Moreover, the principles that determine
attention and eye movements in edited dynamic scenes could inspire the development
of graphical user interfaces (May, Dean, &
Barnard, 2003; Valuch, Ansorge,
Buchinger, Patrone, & Scherzer, 2014).
Background
Recognition tasks uncover cognitive control
Cognitive influences on eye movements can be uncovered by manipulating the
viewing task between different groups of experimental participants (Smith & Mital, 2013; Yarbus, 1967). In static images,
recognition tasks have proven particularly useful for this purpose (Castelhano, Mack, & Henderson,
2009). For example, one group of participants can be asked to first
memorize a series of images in a learning block and then discriminate
between novel and familiar images in a transfer block (Foulsham & Kingstone, 2013; Valuch, Becker, & Ansorge, 2013). To identify how
task goals modulate gaze behavior, a second group of participants can be
presented with the identical series of images but with different task
instructions (Valuch et al., 2013).
Differences in eye movement measures between the groups of participants can
then be attributed to cognitive influences (Castelhano et al., 2009).A recent study compared fixation patterns between two groups of
participants, which were both shown the same photographs of real-world
scenes (Valuch et al., 2013). The
main experimental group was instructed to memorize the photographs in a
learning block, and then discriminate between familiar and novel scenes in a
transfer block. The control group was instructed to freely view the
photographs in both blocks, without the need to recognize familiar scenes.
Crucially, some of the scenes that were repeated in the transfer block
underwent a panoramic view shift relative to the learning block. In these
shifted views, either the left or the right half of the photograph matched
with the view from the learning block. In other words, the image content in
the transfer block overlapped by exactly 50% with the image content that was
previously presented in the learning block. The central result was that
participants from the experimental group, actively recognizing familiar
images, fixated significantly more often and longer on the matching scene
regions than on the nonmatching, novel scene regions. In contrast, the
control group, which viewed exactly the same scene photographs in both
blocks but was not required to actively recognize the images, did not show
this effect in their fixation locations. Other experiments have delivered a
tentative explanation for the tendency to fixate on matching scene regions
during recognition: In order to accurately recognize whether a scene is
familiar or not, humans must direct their foveal vision to scene details
that were present and fixated during learning of the scene (Foulsham & Kingstone, 2013; Valuch et al., 2013). Hence, the bias
in the spatial fixation distribution in the recognition group towards
overlapping scene regions reflected the degree to which viewers exerted
cognitive top-down control over their eye movements.
Cognitive top-down influences in dynamic scenes
Cognitive top-down influences are well established for static scenes, but
only few studies attempted testing them in the context of dynamic scenes
(e.g., Germeys & D’Ydewalle,
2007; Loschky, Larson, Magliano,
& Smith, 2015; Smith &
Mital, 2013). The majority of research suggests that cognitive
top-down factors play a negligible to minor role for explaining eye
movements in videos (Carmi & Itti,
2006a; Mital, Smith, Hill, &
Henderson, 2011). Previous studies suggest that a large part of
the spatial variance in fixations in videos could be explained by a strong
generic viewing bias towards the center regions of a video (Tseng, Carmi, Cameron, Munoz, & Itti,
2009). In addition, fixations seem to correlate substantially
with salient visual image features, such as strong motion (Mital et al., 2011), luminance and
color contrasts (Carmi & Itti,
2006b), or spatio-temporal novelty (Itti & Baldi, 2009). Of particular importance for
our present study, research suggests that any residual cognitive top-down
influences are muted during the very first second following scene cuts in
edited dynamic scenes (Carmi & Itti,
2006a; Smith & Mital,
2013): After cuts, studies reported particularly high
correlations between fixation locations and salient visual characteristics
(Carmi & Itti, 2006a), or
generic biases towards the screen center, with cognitive top-down influences
only slowly taking over gaze control as the video progresses (Smith & Mital, 2013).One explanation for the lack of evidence for cognitive top-down influences
on early fixation selection after cuts is the choice of stimuli and viewing
tasks of previous studies. To start with, not all types of videos are
equally suited to study differences due to cognitive top-down influences.
Hollywood-like video material minimizes interobserver viewing variability
and elicits particularly strong clustering of fixations in the center of the
image area (Dorr, Martinetz, Gegenfurtner,
& Barth, 2010; Goldstein,
Woods, & Peli, 2007). This is possibly due to tailored
editing, aimed at attracting the gaze and attention of most viewers and in a
similar way to the most important content in images. In contrast, more
naturalistic videos of real-world scenes are known to invite higher spatial
fixation variability and leave more room for detecting differences caused by
cognitive top-down influences (cf. Dorr et
al., 2010). Also, if participants are asked to “freely
view” a series of videos, they often orient their gaze towards
salient visual features (Mital et al.,
2011), but this does not indicate a causal effect of visual
salience on gaze control (e.g., Nuthmann
& Henderson, 2010), and specific task goals could drastically
change such relationships (e.g., Acik, Onat,
Schumann, Einhäuser, & König, 2009; Fuchs, Ansorge, Redies, & Leder,
2011). To date, there is a general shortage of studies that would
include specific task instructions, as well as suitable control conditions
that could justify conclusions about causal influences of visual saliency,
independent of cognitive influences, on fixation selection (Tatler, Hayhoe, Land, & Ballard,
2011).For example, if an experiment includes only cuts between completely
unrelated scenes, fixations appear to correlate more strongly with visual
characteristics in the very first second following the cut (Carmi & Itti, 2006a). Notably, in
the absence of an important comparison condition—cuts between related
scene images—conclusions about the general absence of cognitive
top-down influences are difficult. This is problematic because edited
material very often includes cuts between related scenes. A common example
is viewpoint shifts, where the same scene is shown from two different camera
perspectives before and after a cut. With such cuts, cognitive top-down
influences can be expected, because viewers might recognize or actively
search for familiar (remembered) previous visual scene content to understand
how the two different scene views relate to one another (Ansorge, Buchinger, Valuch, Patrone, &
Scherzer, 2014; Hochberg &
Brooks, 1996). Indeed, using a novel type of recognition task,
two recent eye tracking studies suggested that eye movements might be
differently affected by cuts that connect visually unrelated scenes as
opposed to cuts that connect two visually related views on the same scene
(Valuch et al., 2014; Valuch & Ansorge, 2015). Viewers
are able to faster recognize movie continuations after cuts between related
scenes relative to cuts between visually unrelated scenes (Valuch et al., 2014). Moreover, if
viewers do not know at which of two alternative locations a movie will
continue after a cut, they make faster eye movements to the correct location
after cuts between related scenes than after cuts between visually unrelated
scenes (Valuch & Ansorge, 2015).
While these studies looked at the temporal properties of the initial gaze
orientation after cuts, they did not explore whether cuts between related
scene views entail systematic cognitive top-down influences on
spatio-temporal fixation distributions within the post-cut scene and how
these develop over the course of the first seconds following a cut. Related,
these previous studies did not manipulate the viewing task between separate
groups of participants, leaving it unclear whether the observed attentional
effects were due to cognitive task-dependent top-down influences or whether
they could be explained by some form of task-independent stimulus-driven
repetition priming effect (cf. Maljkovic
& Nakayama, 1994; Theeuwes,
2013). The aim of the present study was to address these open
questions.
The Present Study
We used a large set of naturalistic video recordings of real-world scenes to test
if human viewers can exert cognitive control over their eye movements during the
very first seconds following scene cuts. In each trial of our experiments,
participants saw two video takes in succession, separated by a single cut. All
takes were spatial segments cropped from originally larger wide-screen source
videos and showed a city scene that did or did not continue across the cut. In
the main Experiment 1, we asked participants to recognize the post-cut takes as
continuations or discontinuations of the immediately preceding pre-cut takes.
Among these continuous cuts, we used shifted conditions in which the view on the
scene underwent a panoramic shift from the pre- to the post-cut take: The takes
presented before the cut were cropped from the left or right side of the
original (panoramic) source video, and the takes following the cut showed a
leftward or rightward shifted view that was cropped from the same source video.
In these conditions, either the left or the right 50% of the image content in
the post-cut take was visually related to and, therefore, matched with the view
in the pre-cut take (see Figure 1).
Figure 1.
Example images of a source video (A) and the two alternative cropped
views created from this video that were used for panoramic view shifts
in the shifted conditions (B). As can be seen, each cropped view
corresponded to one horizontal side of the source video and, as depicted
within the dotted rectangle, there was an area of 50% spatial overlap
between the two different views taken from the same source video.
Example images of a source video (A) and the two alternative cropped
views created from this video that were used for panoramic view shifts
in the shifted conditions (B). As can be seen, each cropped view
corresponded to one horizontal side of the source video and, as depicted
within the dotted rectangle, there was an area of 50% spatial overlap
between the two different views taken from the same source video.In the control Experiment 2, we used the same set of stimuli, but we changed the
viewing task. Different from Experiment 1, we did not ask participants to
recognize whether the post-cut take was a continuation of the pre-cut scene.
Instead, we implemented an alternative recognition task as a control. Crucially,
this control task did not require the participants to directly compare the two
immediately succeeding takes within a trial. Before starting the experimental
trials, participants in this control group were shown 16 videos that were also
presented as to-be-recognized videos among the experimental trials. After each
experimental trial, they were asked to report whether any of these 16 videos was
identical to the pre-cut or the post-cut take in this trial.In addition to these critical shifted conditions, both experiments included two
further control conditions. The first control condition consisted of
discontinuous cuts, where the post-cut take was completely unrelated to the
pre-cut take. The second control condition consisted of full continuations,
where the post-cut take was a continuation of the same scene from exactly the
same view as the pre-cut take. This was only possible by inserting a blank
screen (with only a central fixation cross) between all pre- and post-cut takes.
In addition to allowing the inclusion of the full continuations as a control
condition, this ensured that all participants started viewing the post-cut take
from the same neutral central position in all conditions and trials.We predicted that in the shifted conditions, the main experimental group
(Experiment 1) would be more likely to fixate on visually related, matching
scene regions compared to participants in the control Experiment 2. This is
because matching scene regions contained critical information to solve the task
of recognizing whether or not the post-cut take was a continuation of the
pre-cut take. Only the matching scene regions were informative about whether
this was the same scene or a different, potentially similar scene, without
requiring an exhaustive inspection of the whole post-cut images. In contrast,
participants in Experiment 2 did not need to establish a relation between the
two takes across the cut. Hence, we predicted that the control group should not
show a particular preference for the matching scene regions, provided that such
a preference in Experiment 1 would be solely due to the cognitive top-down
demands imposed by the specific recognition task used. Across participants, we
balanced the assignment of the individual video clips to the three cut
conditions (see Figure 2). This allowed us
to rule out any possibility that the clustering of fixations in matching scene
areas of post-cut takes of Experiment 1 resulted from a higher occurrence of
interesting scene content compared to the nonmatching areas. Thus, the full
continuations and the discontinuous cuts served as control conditions for the
shifted conditions in both experiments.
Figure 2.
Depicted is the same source video (in the post-cut take), assigned to the
three different cut conditions: discontinuous (on the left), shifted
condition (in the middle), or full continuation (on the right). In
Experiment 1, each trial consisted of a 10 s pre-cut take followed by a
fixation cross (cut) for 500 ms, and a 10 s post-cut take. In different
versions of the experiment, the same post-cut take was used in
discontinuous, shifted, or continuous cut conditions, but each
participant saw only one of these versions. In Experiment 2, instead of
10, only 5 s of each take were shown.
Depicted is the same source video (in the post-cut take), assigned to the
three different cut conditions: discontinuous (on the left), shifted
condition (in the middle), or full continuation (on the right). In
Experiment 1, each trial consisted of a 10 s pre-cut take followed by a
fixation cross (cut) for 500 ms, and a 10 s post-cut take. In different
versions of the experiment, the same post-cut take was used in
discontinuous, shifted, or continuous cut conditions, but each
participant saw only one of these versions. In Experiment 2, instead of
10, only 5 s of each take were shown.
Eye Tracking Experiments
Methods and Materials
Participants
Forty-eight students took part in the experiments in exchange for partial
course credit. Half of the participants (age 18-23 years, M
= 19.8) took part in the main Experiment 1 and the second half (age 19-32
years, M = 24.6) took part in the control Experiment 2. All
participants had normal or fully-corrected vision and gave informed consent
prior to participation.
Dynamic scene stimuli
We recorded 240 different landscape videos of street, park, or interior
scenes around the city of Vienna (see Figure
3). All videos were recorded using a tripod from fixed positions,
without any camera or lens movements, but movement was present within the
videos at several locations in each frame. This movement was mostly caused
by people walking or working, animals moving, cars passing, trees moving in
the wind, or reflections on water surfaces. Videos were recorded with a wide
angle lens in daylight conditions using narrow apertures to ensure high
depth of field such that all image areas remained homogeneously sharp. In
Experiment 1, we cut two immediately succeeding shorter video takes out of
the source videos, henceforth referred to as Takes 1 and 2, each with a
length of 10 s. In Experiment 2, the same Takes 1 and 2 were used, but
further shortened to 5 s each (i.e., the last 5 s before each cut, and the
first 5 s following each cut) because, after Experiment 1, it was clear that
even 5 s are more than sufficient for understanding gaze behavior around the
time of the cuts. For the creation of altogether 320 (plus a few
demonstration) takes, we cropped spatially smaller frames (with a resolution
of 1,280 × 1,024 pixels; 5:4 ratio) corresponding approximately to two
thirds of the high definition source Takes 1 and 2 (with an original
resolution of 1,920 × 1,088 pixels) (see Figure 1). The two alternative cropped views of each take
depicted either the left or the right two thirds of the source takes and
overlapped by precisely 50%.
Figure 3.
Example still images from the videos that were used in the current
study.
Example still images from the videos that were used in the current
study.
Apparatus
Eye movements were recorded using an EyeLink Desktop Mount eye tracker (SR
Research Ltd.) at a sampling rate of 1,000 Hz. The system was calibrated to
each participant’s dominant eye using a standard 9-point calibration
procedure. Every time the takes started or stopped, the exact timestamp was
saved to the eye tracking data file, which allowed analyzing fixation
latencies, durations, and frequencies with millisecond precision relative to
the onset of each stimulus. After every tenth trial, calibration was checked
using a standard drift check procedure and, if necessary, recalibrated. The
videos were displayed on a 19-in. color CRT monitor (Sony Multiscan G400) at
a resolution of 1,280 × 1,024 pixels and a refresh rate of 60 Hz. The
experimental procedure was implemented in MATLAB (MathWorks) using the
Psychophysics toolbox and the Eyelink toolbox (Kleiner et al., 2007). Viewing distance to the monitor
was 64 cm, supported by chin and forehead rests, resulting in an apparent
size of the full screen videos of 31 × 24.2°.
Procedure and design
Following six demo trials, every participant saw 160 experimental trials,
each of them consisting of two takes—one pre-cut take of 10 s
(Experiment 1) or 5 s (Experiment 2) and one post-cut take of 10 s
(Experiment 1) or 5 s (Experiment 2)—and a cut between them (here, a
short break of 500 ms). All takes were presented in full screen and in
color. Prior to each trial and during the cut between the takes within each
trial, the screen went grey for 500 ms, with the exception of a black
fixation cross at screen center. Only after the post-cut take finished,
participants were shown a grey response screen until they responded. In
Experiment 1, the response screen contained the question: “Was the
post-cut take a continuation of the scene shown in the pre-cut take?”
To implement the control task in Experiment 2, participants saw and learned
16 clips for later recognition in advance of the actual experimental trials.
These clips included both pre- and post-cut takes of full continuations.
During the experimental trials of Experiment 2, in four instances of each of
the four possible conditions (continuous cuts, discontinuous cuts,
left-shifted, and right-shifted conditions), either the pre- or the post-cut
clip contained a pre- or a post-cut clip of the initially learned videos,
and each post-cut screen read: “Was there one of the 16 initial clips
among the two takes that you just saw?” For those trials of the
control task in which the participants in Experiment 2 indicated that one of
the clips was part of the initially learned memory set, participants
additionally had to indicate whether the first or the second take was among
the initially presented clips. In this task, any “yes” (or
recognition) answer was counted as correct when either the pre- or the
post-cut take was from the initially learned memory set.Throughout the experiments, participants fixated on the central fixation
cross whenever it was present (i.e., before a trial started, and in between
the end of the first take and the beginning of the second take).
Participants pressed the 8 or 2 keys on
the numerical keypad of a standard USB keyboard for their different
judgments (e.g., 8 for the same scene vs.
2 for different scenes in pre- and post-cut takes of
Experiment 1). Only after incorrect responses, participants saw an
additional feedback screen of another 2 s that indicated that the wrong
response had been given.One half of all trials (80 trials) were discontinuous cuts in which the
post-cut take showed a novel, hitherto not presented take (see Figure 2). The other half of all trials
was continuations. Among the continuations, half (40) of the trials were
full continuations, with the pre-and post-cut takes depicting the same view
on the same scene. The other half of all continuous trials were shifted
conditions. In shifted conditions, the cut constituted a panoramic view
shift, with the view in the post-cut take shifted either to the right (20
trials) or to the left (20 trials) border of the original panoramic source
video. To note, all of the take sequences, including the full continuations
and the shifted conditions, were presented in the correct temporal order and
presented 20 s (Experiment 1) or 10 s (Experiment 2) of immediately
succeeding video content, without any temporal omissions, repetitions, or
reversals. Also, the 5 s before the cut and the 5 s following the cut were
exactly the same in Experiments 1 and 2. In Experiments 1 and 2, all
different conditions were presented in a randomized order. Each trial took
about 25 s (Experiment 1) or 14 s (Experiment 2), and the total test time
was about 80 min (Experiment 1) or 60 min (Experiment 2).
Data analysis
Areas of interest (AOI) for the major analyses were the post-and pre-cut
takes’ left and right sides. These AOIs started at two degrees
eccentricity from the vertical meridian and reached until the respective
image borders. Fixations were detected from the recorded gaze coordinates
using the SR Research detection algorithm, as the average gaze position
during periods with gaze position changes by less than 0.1°, eye
movement velocity below 30°/s, and acceleration below
8000°/s². Eye movement data were preprocessed in MATLAB and
statistically analyzed in R (R Core Team, 2016). Statistical significance
was assumed at an α level of .01 or below. (A slightly more liberal
criterion of an α level of .05 would have yielded identical
conclusions.) We limited our analysis of the viewing behavior in the
post-cut takes to the first three seconds following the cut because after
this time we observed no preferences for fixating one of the two alternative
AOIs between the different conditions. All statistical tests were based on
144 out of the 160 trials because the 16 trials that contained a clip of the
control task in Experiment 2 were excluded from all analyses of both
experiments.
Results of Experiment 1
Behavioral task
Participants made 1.46% errors (SD = 1.18) in the scene
continuity judgment task.
Fixation frequencies
Within the first 3 s of the post-cut takes, participants made 8.21 fixations
on average (SD = 1.31). Figure 4 gives an impression of how the spatial distribution of
fixations of our participants developed across five 250 ms time bins from 0
to 1.5 s following the onset of the post-cut take. Fixations starting and
ending in different bins were assigned to all bins in which they were
measured. The first column of Figure 4 shows that in right shifted
conditions, participants preferentially fixated on the across-cut matching
left side of the post-cut take and fewer fixations were made on the
nonmatching right side of the post-cut take. One can also see that this
preference for one side over the other was reversed in left shifted
conditions (third column of Figure 4).
In contrast, no strong preferences for either side were observed in the two
control conditions—that is, in the fully continuous (see second
column of Figure 4) and in the
discontinuous takes (see fourth column of Figure 4), in which both sides of the post-cut takes were
equally matching or nonmatching across the cut.
Figure 4.
Heat maps (across participants and images) of fixations within the
first 1.5 s in the post-cut takes for the two shifted conditions
(Columns 1 and 3) and for the two control conditions (Columns 2 and
4) of Experiment 1. Here, red, orange and yellow depict areas of
relatively higher numbers of fixations, while green, blue, and white
depict areas of lower numbers of fixations. The horizontal and
vertical coordinates of each subplot correspond to the screen
coordinates of the full screen post-cut takes (1,280 × 1,024
pixels). In the first column, fixation data from the right shifted
conditions show more fixations on the left side, with its across-cut
matching content. In Columns 2 to 4, fixations are shown for full
continuations, left shifted conditions, and discontinuous cuts,
respectively. The time bins into the post-cut takes are given in the
rows from early at the top to further into the post-cut take at the
bottom. From Rows 2 to 6, in the shifted conditions, a clustering of
fixations in areas that match across the cut is evident.
Heat maps (across participants and images) of fixations within the
first 1.5 s in the post-cut takes for the two shifted conditions
(Columns 1 and 3) and for the two control conditions (Columns 2 and
4) of Experiment 1. Here, red, orange and yellow depict areas of
relatively higher numbers of fixations, while green, blue, and white
depict areas of lower numbers of fixations. The horizontal and
vertical coordinates of each subplot correspond to the screen
coordinates of the full screen post-cut takes (1,280 × 1,024
pixels). In the first column, fixation data from the right shifted
conditions show more fixations on the left side, with its across-cut
matching content. In Columns 2 to 4, fixations are shown for full
continuations, left shifted conditions, and discontinuous cuts,
respectively. The time bins into the post-cut takes are given in the
rows from early at the top to further into the post-cut take at the
bottom. From Rows 2 to 6, in the shifted conditions, a clustering of
fixations in areas that match across the cut is evident.For the statistical analysis, we split the first 3 s of fixations on the
post-cut takes into time bins of a length of 250 ms. We used
t tests with Holm-Bonferroni correction to test for
statistical differences between the frequencies of fixations on the left
versus the right sides (see Table 1).
Since each participant saw each post-cut take only as either left shifted or
right shifted, a direct within-participant comparison between the two would
have been beset with a difference in their visual content. To compare both
of these experimental conditions separately with their respective control
conditions (i.e., the full continuations and the discontinuous takes), the
averages for each condition and time bin were taken for each participant and
compared afterwards. The control conditions showed exactly the same post-cut
takes as were used in the respective shifted conditions. As can be seen in
Table 1, from at least 250 ms
until about 1.5 s into the post-cut takes, fixation frequencies on
across-cut matching regions of the post-cut takes in the shifted conditions
differed significantly from fixation frequencies in the corresponding areas
of the same post-cut takes under the control conditions. The upper three
rows of Table 1 show that
t tests confirmed that more fixations were made on the
left side of the post-cut takes of the right shifted conditions than on the
left side of the same post-cut takes in full continuations and discontinuous
cuts. The t tests also showed that there were no such
differences between the two control conditions (full continuations and
discontinuous cuts). The lower three rows of Table 1 show that t tests also confirmed more
fixations were made on the right side of the post-cut takes of the left
shifted conditions than on the right side of the same post-cut takes in full
continuations and discontinuous cuts. And again, the t
tests also demonstrated that there were no such differences between the two
control conditions (full continuations and discontinuous cuts).
Table 1.
p-Values of Pairwise Comparisons
(T-Tests With Holm-Bonferroni Correction)
Between Different Cut Conditions for Different Times Into the
Post-Cut Takes of Experiment 1
Time after the cut in ms in
timespans of 250ms
1-250
-500
-750
-1000
-1250
-1500
-1750
-2000
-2250
-2500
-2750
-3000
right shifted
shifted vs. discont.
1
.001*
.001*
.001*
.001*
.001*
.011
.104
.109
.877
1
1
shifted vs. fully cont.
.959
.001*
.001*
.001*
.001*
.005*
.185
1
1
1
.972
1
discont. vs. fully cont.
1
.366
.860
1
1
1
1
1
1
1
1
1
left shifted
shifted vs. discont.
.511
.001*
.001*
.001*
.001*
.001*
.035
.138
.026
.289
.070
.022
shifted vs. fully cont.
1
.001*
.001*
.001*
.001*
.001*
.022
.536
.094
.063
.059
.001*
discont. vs. fully cont.
1
1
1
1
1
1
1
1
1
1
1
1
Note. * = significant at α p < .01.
df = 23. discont. = discontinuous; cont. =
continuous.
Note. * = significant at α p < .01.
df = 23. discont. = discontinuous; cont. =
continuous.Figure 5 shows the mean horizontal
locations of the participants’ fixations. From 250 ms to 1.5 s into
the post-cut take, the AOI on the right side attracted more fixations in the
shifted left conditions (blue line), and the AOI on the left side attracted
more fixations in the shifted right condition (red line). These increased
fixation frequencies were found compared to their respective control
conditions: In the full continuations (dotted lines) and the discontinuous
cuts (broken lines), the mean fixation locations were less lateralized and
more consistently within 2° of the take center, showing spatially more
balanced fixations on the left and right. To estimate the effect sizes of
the fixation preference for the matching regions in the particular time bins
with significant differences in the shifted conditions, we calculated
Pearson’s r correlation coefficients across
participants between the horizontal axis positions in the post-cut takes.
The rationale for this test is that a high preference for one side should
lead to a high correlation of the horizontal fixation locations. In the
shifted conditions, these correlations were of medium size
(r = 0.29 to 0.35) for fixations from 250 ms to 750 ms,
and small for the other significant time segments (r = 0.14
to 0.26).
Figure 5.
Mean horizontal deviations of all fixations (on the abscissa) as a
function of the time into the post-cut take on the ordinate,
separately for different conditions of Experiment 1. One can see
that in the shifted conditions (continuous lines), participants more
frequently fixated locations in the across-cut matching regions of
the post-cut take (left sides for right shifted, right sides for
left shifted conditions) than under both control conditions (full
continuations [punctuated lines] and discontinuous takes [dashed
lines]).
Mean horizontal deviations of all fixations (on the abscissa) as a
function of the time into the post-cut take on the ordinate,
separately for different conditions of Experiment 1. One can see
that in the shifted conditions (continuous lines), participants more
frequently fixated locations in the across-cut matching regions of
the post-cut take (left sides for right shifted, right sides for
left shifted conditions) than under both control conditions (full
continuations [punctuated lines] and discontinuous takes [dashed
lines]).
Latencies of first fixations
As a second dependent variable, the latencies of the first fixations on
either AOI of the post-cut takes were analyzed. We discarded trials in which
participants did not fixate inside the AOIs within 5 s following the start
of the post-cut take and outliers that exceeded a criterion of 1.5 times the
interquartile range of the latency distribution. As a result, 20.9% of the
fixations on the post-cut takes were excluded. This resulted in 969 trials
(27.3%) being excluded due to fixations outside the AOIs. Another 32 trials
(0.9%) exceeding the range for outliers were excluded. On average, it took
the participants 865 ms to fixate at least once on locations inside both
lateral AOIs, left and right.Table 2 shows that for post-cut takes
of the shifted conditions, latencies were significantly shorter for the
first fixations on across-cut matching sides than for the sides of the
identical post-cut takes in the respective two control conditions (i.e.,
full continuations and discontinuous cuts). In addition, fixations on the
across-cut matching sides were of significantly lower latency than fixations
on nonmatching regions (p < .01). Finally, at least for
the right shifted conditions, latencies of fixations on the nonmatching side
were significantly higher than latencies of fixations on the same (right)
side in the control conditions. Mean fixation latencies are also plotted in
Figure 6. The error bars represent
the standard errors of the means.
Table 2.
Means and Standard Deviations (in Parentheses) of Fixation
Latencies on the Left and Right Sides (or in the Respective Areas of
Interest, AOI) of the Post-Cut Takes as a Function of the Different
Cut Conditions in Experiment 1
Shifted conditions
Full continuations
Discontinuouscuts
right shifted
left AOI
617 ms (500)
976 ms (759)*
885 ms (696)*
right AOI
976 ms (703)
715 ms (621)*
730 ms (531)*
left shifted
left AOI
1,088 (766)
853 ms (687)*
826 ms (682)
right AOI
600 (451)
955 ms (714)*
980 ms (681)*
Note. * = Wilcoxon signed-rank test significant at α
< .01 between shifted conditions and full continuations; and
between shifted conditions and discontinuous cuts. Rows featuring
across-cut matching AOIs of the shifted conditions are in
italics.
Figure 6.
Mean latencies of first fixations on the left side (area of interest,
AOI ; in yellow) and on the right side (AOI ; in red) of the
post-cut takes in Experiment 1. Left panel: performance in left
shifted conditions and in the corresponding post-cut takes from the
two control conditions (discontinuous and full continuations). Right
panel: performance in right shifted conditions and in the
corresponding post-cut takes from the two control conditions.
Note. * = Wilcoxon signed-rank test significant at α
< .01 between shifted conditions and full continuations; and
between shifted conditions and discontinuous cuts. Rows featuring
across-cut matching AOIs of the shifted conditions are in
italics.Mean latencies of first fixations on the left side (area of interest,
AOI ; in yellow) and on the right side (AOI ; in red) of the
post-cut takes in Experiment 1. Left panel: performance in left
shifted conditions and in the corresponding post-cut takes from the
two control conditions (discontinuous and full continuations). Right
panel: performance in right shifted conditions and in the
corresponding post-cut takes from the two control conditions.
Results of Experiment 2
The mean percentage of incorrect answers to the control task was 13%
(SD = 3.9).Participants made an average of 8.38 fixations (SD = 0.89)
within the first 3 s of the post-cut takes. As in Experiment 1, the first 3
s were split into bins of 250 ms and t tests with
Holm-Bonferroni correction were used to determine differences between
conditions (see Table 3 and Figure 7).
Table 3.
p-Values of Pairwise Comparisons (T-Tests With Holm-Bonferroni
Correction) Between Different Cut Conditions for Different Times
Into the Post-Cut Takes of Experiment 2
Time after the cut in ms in timespans of 250ms
1-250
-500
-750
-1000
-1250
-1500
-1750
-2000
-2250
-2500
-2750
-3000
right shifted
shifted vs. discont.
1
1
.008*
.004*
.002*
.028
.202
.467
1
1
1
1
shifted vs. fully cont.
1
1
.891
.671
.030
.014
.016
.915
1
1
1
1
discont. vs. fully cont.
1
1
.891
1
1
1
1
1
1
1
1
1
left shifted
shifted vs. discont.
1
1
.034
.001*
.006*
.135
1
1
1
1
1
1
shifted vs. fully cont.
1
1
.166
.129
.288
.135
.647
1
1
1
1
1
discont. vs. fully cont.
1
1
1
1
1
1
1
1
1
1
1
1
Note. * = significant at α p < .01.
df = 23. discont. = discontinuous; cont. =
continuous.
Figure 7.
Mean horizontal deviations of all fixations (on the abscissa) as a
function of the time into the post-cut take on the ordinate,
separately for different conditions of Experiments 1 and 2. One can
see that in the shifted conditions of Experiment 2 (continuous lines
in red and blue), participants showed a slight fixation preference
on nonmatching regions (right of center for right shifted, left of
center for left shifted conditions) compared to both control
conditions (full continuations [punctuated lines] and discontinuous
[dashed lines] cuts). This is in contrast to Experiment 1, where a
strong preference for matching regions was found. For the sake of an
easier comparison, the corresponding performances of Experiment 1
have also been included (pale lines).
Note. * = significant at α p < .01.
df = 23. discont. = discontinuous; cont. =
continuous.Mean horizontal deviations of all fixations (on the abscissa) as a
function of the time into the post-cut take on the ordinate,
separately for different conditions of Experiments 1 and 2. One can
see that in the shifted conditions of Experiment 2 (continuous lines
in red and blue), participants showed a slight fixation preference
on nonmatching regions (right of center for right shifted, left of
center for left shifted conditions) compared to both control
conditions (full continuations [punctuated lines] and discontinuous
[dashed lines] cuts). This is in contrast to Experiment 1, where a
strong preference for matching regions was found. For the sake of an
easier comparison, the corresponding performances of Experiment 1
have also been included (pale lines).In the upper three rows of Table 3,
t tests showed that more fixations were made on the
right side in right shifted conditions than on the right side of the same
post-cut takes in discontinuous cuts. (Numerically, the same difference was
found between the right shifted and the fully continuous conditions.) This
is the opposite tendency of what has been observed in Experiment 1, where
the participants made more fixations on across-cut matching regions in the
shifted conditions compared to the control conditions. For the sake of an
easier comparison, we plotted the data from both experiments in Figure 7 (data from Experiment 1 are
rendered as pastel red and blue continuous lines). The lower three rows of t
tests in Table 3 show that more
fixations were made on the left side in left shifted conditions than on the
left side of the same post-cut takes under discontinuous conditions. Again,
these effects were in the opposite direction of the differences that we saw
in Experiment 1.In Experiment 2, there were no significant differences between the cut
conditions for the latencies of the first fixations on either AOI at
all.
Discussion
Our study presents new evidence that viewers of edited dynamic scenes have robust
cognitive top-down control over their eye movements immediately after scene cuts.
This was doubtful in light of previous studies that had suggested cognitive top-down
influences were minimal after cuts (Carmi & Itti,
2006a) and need more time to take effect (Smith & Mital, 2013). Here, we uncovered early task-dependent,
top-down controlled gaze behavior by comparing spatio-temporal fixation
distributions after standardized view shifts in a large set of videos of complex
real-world scenes.In each trial of Experiment 1, participants judged whether or not the second of two
takes was a continuation of the pre-cut scene. In the critical view shifted
conditions, we found robust and early task effects on eye movements: From 250 ms up
to 1.5 s, the participants’ fixations systematically clustered in scene
regions that matched with the familiar view from before the cut (the very first time
window of 0 to 250 ms did not show an effect, as it was determined by the central
fixation cross that was present in the interval between the two takes). Statistical
tests confirmed that participants made an overall higher number of fixations on
scene regions that matched across cuts compared to nonmatching regions. This result
conceptually replicates a previous study using a similar view shift manipulation
with static images (Valuch et al., 2013).
Moreover, in Experiment 1, fixations on matching regions had a lower mean latency
compared to fixations on nonmatching regions.We can exclude the possibility that these differences in fixations between matching
and nonmatching scene regions of Experiment 1 resulted from particular visual
characteristics of the videos in the shifted conditions (e.g., matching regions
containing more interesting image content, hence, attracting fixations independent
of the task): Assignment of the videos to the three different experimental
conditions was balanced across participants, and the two control conditions of
discontinuous cuts and full continuations did not result in any systematic
off-center gaze clustering. Note also that in the shifted conditions, participants
could not have expected the direction of the upcoming view shift because shifts
occurred only in 25% of trials and with equal probability to the left and to the
right. The results thus illustrate that viewers exerted immediate cognitive top-down
control over their fixations in the post-cut takes, and quickly oriented their eyes
towards matching scene regions, which contained the most task-relevant
information.In Experiment 2, we tested an independent group of participants with a control task.
Importantly, all video stimuli and the three cut conditions were the same as in
Experiment 1, but participants did not need to make judgments about scene continuity
from the pre-cut take into the post-cut take. They performed a control task by
indicating after each trial whether either of the two successive takes in a trial
was part of a set of 16 clips that were shown before the experimental trials. Given
this task, there was no utility in making fixations on scene regions that matched
across view shifts because participants did not need to directly relate the pre-cut
and the post-cut take to each other. In line with our prediction, in this control
group, fixation latencies were not decreased and fixation frequencies were not
increased on across-cut matching scene regions. If anything, there was a slight but
less pronounced tendency to fixate more frequently on the nonmatching regions of the
post-cut takes. This slightly opposite tendency occurred relative to the
discontinuous cut conditions of Experiment 2 and relative to the shifted conditions
in Experiment 1. One possible explanation could be a tendency to visually explore
scenes, in the absence of the need to identify the connecting elements. This could
be similar to a preference for spatio-temporal novelty that was reported in past
research using free-viewing tasks (Itti & Baldi,
2009). A tendency for novel information in Experiment 2 might have also
reflected a task-specific influence because, when the post-cut clips were presented,
pre-cut scenes had already been compared to the memory content, so that only the
novel information in the post-cut clips had to be evaluated for its similarity to
the initially learned videos. However, one should rather not over-interpret this
result because it was by far weaker than the strong, task-driven effects in
Experiment 1.In any case, Experiment 2 rules out the possibility that the effects observed in
Experiment 1 could be explained by an involuntary tendency to automatically look at
whatever content repeats across viewpoint changes. If we would have found the same
preference for matching areas in Experiment 2, this would have suggested that the
effect stems from repetition priming, which is sometimes believed to influence
attentional selection in a stimulus-driven way, irrespective of the task (Theeuwes, 2013). The data from Experiment 2
thus further supports the notion that the behavior observed in Experiment 1 was
truly driven by the requirements imposed by the task.The present results extend the literature in several respects. Previous studies
reasoned that following scene cuts, the contribution of cognitive top-down factors
to gaze behavior is minimized (Carmi & Itti,
2006a; Loschky et al., 2015). This
judgment was based on increased correlations between fixation locations and salient
local features (Carmi & Itti, 2006a) or an
increased tendency to look at the image center following cuts (Dorr et al., 2010; Tseng et al.,
2009). In Experiment 1, we clearly showed a robust and systematic
off-center deviation in spatial fixation distributions towards peripheral scene
regions that contained the task-relevant information very early after scene cuts.
The discrepancy between our results and previous reports could partly be explained
by methodological differences. Previous studies were mostly conducted in a
free-viewing context, without manipulating the task between groups of participants
(for an exception see, e.g., Smith & Mital,
2013). Moreover, previous studies partly relied on professionally
produced video material that is known to elicit strong center biases and high
inter-observer correlations in gaze direction, due to more strongly constrained
visual content (Goldstein et al., 2007; Loschky et al., 2015). Maybe most importantly,
studies of eye movements in edited videos sometimes used only cuts between different
scenes, where the pre- and post-cut takes were visually completely unrelated (e.g.,
Carmi & Itti, 2006a, 2006b; Itti
& Baldi, 2009). Under such conditions, it is impossible to
discriminate cognitive top-down influences from stimulus-driven factors, such as
visual salience (Carmi & Itti, 2006b). In
contrast, our study was purposely tailored to identify the contribution of cognitive
top-down control to gaze guidance early after scene cuts by comparing two groups of
participants under different task instructions, and it included cuts between two
related views on the same scene, such that the pre-to-post-cut view change followed
a well-defined relationship in all trials of this condition. We also used videos of
real-world scenes because these enabled us to implement the view shift manipulation
across a large set of videos in an automated manner while retaining the potential to
visually explore a complex dynamic scene.One limitation of our present findings might be that the explicit recognition task
in Experiment 1 was realized in a setting that is quite different from more everyday
video viewing situations (where viewers of edited movies usually simply attentively
view one edited video and follow its narrative). However, we believe that the
setting we created in Experiment 1 is actually quite similar to a more implicit
viewing task that viewers are usually engaged in whenever they attentively follow an
edited video. Indeed, cognitive film theory suggests that following each cut, film
viewers must recognize how a new take relates to what they saw before the cut (
Hochberg & Brooks, 1996). One caveat
here certainly is the different nature of the edited material in our study compared
to professional footage (Dorr et al., 2010).
Replications of our study with spatially cropped outtakes of feature films would,
therefore, be informative about whether there is something particular about
professionally produced feature films that mesmerizes the audience and causes all
viewers to fixate on the same content (Goldstein et
al., 2007) or whether such observations could reflect implicit task-goals
of the viewers dedicated to understanding of how successive takes relate to each
other. As such, our results could inspire further theorizing about why certain
standard editing practices of film and media professionals work particularly well.
In continuity editing, for example, cutters take care to facilitate the narrative
connection between pre- and post-cut take (Bordwell
& Thompson, 2001). At least some of the subjectively perceived
smoothness of continuity editing (Shimamura,
Cohn-Sheehy, & Shimamura, 2014) might be explained by the degree to
which visual similarities across cuts facilitate recognition of familiar content
after view changes (Valuch & Ansorge,
2015; Valuch, König, & Ansorge,
2017).Finally, our findings have applications beyond video editing. First, the improvement
of video coding standards benefits from a better understanding of the determinants
of gaze behavior in edited videos (Adzic et al.,
2013). Second, based on our results, computational models of human eye
movements can include memory components that allow to model gaze behavior in
contexts where cuts frequently occur between related scene views (Ansorge et al., 2014). Apart from edited videos,
there are other situations where viewers visually explore complex dynamic scenes
across view changes while being engaged in a specific task. For example, in
radiographic applications and ultrasound imaging, highly complex and visually
cluttered dynamic scenes have to be searched for anomalies. Recognizing the visual
content that repeats across view changes is key to efficient visual orienting. In
these applications, understanding the spatio-temporal limits and properties of
voluntary top-down control over eye movements after cuts is of obvious relevance.
More broadly, graphical user interfaces regularly include shifts between complex
screens which could be conceptually similar to scene cuts in edited dynamic scenes (
May et al., 2003; Valuch et al., 2014). Knowing to which degree humans possess
immediate cognitive control over their gaze direction after scene cuts could
facilitate overall usability and help improve user experience in computer
applications in a wide variety of applications.
Conclusion
We showed that human eye movements in edited videos are sensitive to task-specific
cognitive top-down control immediately after view changes across cuts. Different to
what the existing research literature suggests, task goals can override influences
of stimulus characteristics and generic viewing biases already in the very first
second following a cut. We described implications for understanding common video
editing practices, and for the improvement of technological applications.