Matthew B Winn1, Dorothea Wendt2,3, Thomas Koelewijn4, Stefanie E Kuchinsky5. 1. 1 Speech-Language-Hearing Sciences, University of Minnesota, Minneapolis, MN, USA. 2. 2 Eriksholm Research Centre, Snekkersten, Denmark. 3. 3 Hearing Systems, Department of Electrical Engineering, Technical University of Denmark, Kongens Lyngby, Denmark. 4. 4 Section Ear & Hearing, Department of Otolaryngology-Head and Neck Surgery, Amsterdam Public Health Research Institute, VU University Medical Center, the Netherlands. 5. 5 National Military Audiology and Speech Pathology Center, Walter Reed National Military Medical Center, Bethesda, MD, USA.
Abstract
Within the field of hearing science, pupillometry is a widely used method for quantifying listening effort. Its use in research is growing exponentially, and many labs are (considering) applying pupillometry for the first time. Hence, there is a growing need for a methods paper on pupillometry covering topics spanning from experiment logistics and timing to data cleaning and what parameters to analyze. This article contains the basic information and considerations needed to plan, set up, and interpret a pupillometry experiment, as well as commentary about how to interpret the response. Included are practicalities like minimal system requirements for recording a pupil response and specifications for peripheral, equipment, experiment logistics and constraints, and different kinds of data processing. Additional details include participant inclusion and exclusion criteria and some methodological considerations that might not be necessary in other auditory experiments. We discuss what data should be recorded and how to monitor the data quality during recording in order to minimize artifacts. Data processing and analysis are considered as well. Finally, we share insights from the collective experience of the authors and discuss some of the challenges that still lie ahead.
Within the field of hearing science, pupillometry is a widely used method for quantifying listening effort. Its use in research is growing exponentially, and many labs are (considering) applying pupillometry for the first time. Hence, there is a growing need for a methods paper on pupillometry covering topics spanning from experiment logistics and timing to data cleaning and what parameters to analyze. This article contains the basic information and considerations needed to plan, set up, and interpret a pupillometry experiment, as well as commentary about how to interpret the response. Included are practicalities like minimal system requirements for recording a pupil response and specifications for peripheral, equipment, experiment logistics and constraints, and different kinds of data processing. Additional details include participant inclusion and exclusion criteria and some methodological considerations that might not be necessary in other auditory experiments. We discuss what data should be recorded and how to monitor the data quality during recording in order to minimize artifacts. Data processing and analysis are considered as well. Finally, we share insights from the collective experience of the authors and discuss some of the challenges that still lie ahead.
In this introductory article, we offer advice on how to understand and
incorporate pupillometry (the measurement of pupil size) as a measure of
listening effort. The target audience includes researchers who have considered
using pupillometry but might not be familiar with the technical or logistical
challenges that are involved. For the purpose of having a standard set of
recommendations in place, the authors have collected their shared
experiences—both good practices as well as pitfalls—in this article. Original
hypothesis-driven research can be found in numerous other publications and
elsewhere in this special issue. But the story of how this
research is done is sometimes hidden out of sight. The point of this article is
to familiarize the reader with the challenges one could come up against when
conducting pupillometry research for measuring listening effort.The attraction of pupillometry is that changes in pupil dilation appear to
distinguish cognitive tasks that are more or less effortful across a wide
variety of domains (Beatty,
1982), including those that do not involve speech intelligibility.
Pupil dilation scales with mathematical ability (Ahern & Beatty, 1979), short-term
memory capacity (Klingner,
Tversky, & Hanrahan, 2011; Zekveld, Kramer, & Festen, 2011),
Stroop task interference (Laeng, Ørbo, Holmlund, & Miozzo, 2011), and resolving ambiguity
in language (Vogelzang,
Hendriks, & van Rijn, 2016). It therefore has the potential to
add value to assessments of speech perception especially where there is reason
to believe that there could be different amounts of cognitive load exerted for
tasks that are not clearly distinguished by task accuracy.One of the most important things to note about pupillometry is that pupil size is
not a monotonic direct index of effort but rather a complicated mixture that
reflects the combined contributions of the autonomic nervous system (ANS; Zekveld, Koelewijn, &
Kramer, 2018). For cognitive-evoked dilations, the response is
nonlinear (Ohlenforst
et al., 2017; Wendt, Koelewijn, Książek, Kramer, & Lunner, 2018); dilations
are small for easy tasks but also small for very hard tasks
(where effort might be withdrawn because of lack of task success). Pupil
dilation therefore appears to be an index of a person’s willingness to exert
more effort because it is worth the exercise of greater mental
resources to achieve a goal. This important concept will return numerous times
in this article, especially as it relates to fatigue and capacity. If a person
is overly fatigued, there is increased likelihood that effort will be reduced
because of less engagement—leading to reduced pupil size (Wang, Zekveld, Lunner, & Kramer,
2018). For the purpose of interpreting pupil size data, this means
that the experimenter should be cognizant of whether changes in pupil dilation
are truly indicative of changes in task-related effort or unintended participantfatigue or disengagement.In the following sections, we begin by introducing the complicated term
effort and the connection that pupil dilation might have
with effort. We then discuss experimental design and planning, including task
selection, constraints of the method, logistics for hardware and data
collection, and considerations for participant inclusion. The article continues
with a review of some essential components of data processing and some recent
insights on physiology of the pupil response. We then contribute some advice on
helpful practices for the experimenter and conclude with some recommended
further reading.
What Do We Mean by Effort?
The Framework for Understanding Effortful Listening (FUEL; Pichora-Fuller et al., 2016) defines
listening effort as the “deliberate allocation of mental resources to overcome
obstacles in goal pursuit when carrying out a [listening] task” (p. 10 S). This
definition highlights that effort arises not only as a result of the difficulty
of the task itself (i.e., the intelligibility of the stimuli) but also of a
result of the active application of the individual’s mental capacities to
overcome an obstacle. Another essential component is the participant’s
engagement or motivation to succeed in a task, which may vary widely across
individuals. The FUEL has its roots in the classical capacity model described by
Kahneman (1973),
who emphasized the role of attention (and arguably used effort
and attention interchangeably; Bruya & Tang, 2018). Kahneman
suggested that effort is “a special case of arousal,” characterizing it as
effort invested in what one is doing, rather than arousal in response to what is
happening to a person (e.g., from loud sounds, drugs, etc.). According to FUEL,
the level of arousal related to the processing of speech in adverse conditions
is reflected by activation of the ANS, which can be measured by the pupil
dilation response.
Why Measure Listening Effort and Not Just Intelligibility?
Listening effort is increasingly recognized as an important aspect of hearing
loss. Hearing difficulties and increased listening effort are reportedly
connected with numerous medical, financial, and occupational challenges (Kramer, Kapteyn, &
Houtgast, 2006; Nachtegaal et al., 2009) as well as feeling of social connectedness
(Hughes, Hutchings,
Rapport, McMahon, & Boisvert, 2018). Hétu, Riverin, Lalande, Getty, and St-Cyr
(1988) reported interviews with individuals with hearing impairment
who mentioned that fatigue related to their hearing difficulties and coping
mechanisms was severe enough that they would be “… too tired for normal
activities” after finishing work.Two listeners might achieve the same intelligibility score but exert different
amounts of effort to do so; pupil dilation appears consistent with subjective
notions of relative difficulty even in these equally intelligible cases (Koelewijn, Zekveld, Festen,
& Kramer, 2012). Because speech perception can involve a variety
of cognitive linguistic skills in addition to auditory processing (Bronkhorst, 2015; Mattys, Davis, Bradlow, &
Scott, 2012), the same intelligibility score can be obtained by a
listener with moderate hearing loss exerting great focus, or by a person with
typical hearing who is listening with less effort (Ohlenforst et al., 2017). Where
audibility fails, cognitive compensation strategies (suppressing irrelevant
information, relying on context, etc.) can compensate (Peelle, 2017; Rönnberg, Lunner, & Zekveld, 2013)
as is often seen in the case of individuals with hearing impairment who show
reliance on context in speech perception (Pichora-Fuller, Schneider, & Daneman,
1995). This greater reliance on top-down mechanisms appears to come
at a cost of decline in other cognitive and physical tasks, or memory of words
heard (e.g., Koeritzer,
Rogers, Van Engen, & Peelle, 2018; McCoy et al., 2005). Listeners with
normal hearing also engage in potentially effortful cognitive processes when
listening to acoustically challenging speech, even when the speech is highly
intelligible and supported by linguistic context (Koeritzer et al., 2018). Thus, an
experimenter or clinician might be interested not only in the ultimate accuracy
in a task, but also the mechanisms used to accomplish the task, and how
effortful it was to complete the task.In addition to the reasons stated earlier, it is also useful to remember that not
all aspects of speech perception are gauged by whether the words are correctly
identified. Other aspects include analyzing and updating a talker’s intention
(Snedeker &
Trueswell, 2004; Tanenhaus, Spivey, Eberhard, & Sedivy, 1995), predicting
upcoming information (Altmann
& Kamide, 1999; Tavano & Scharinger, 2015),
identifying a talker (Best
et al., 2018), perceiving prosodic emphasis (Dahan, Tanenhaus, & Chambers,
2001), translating speech into a different language (Hyönä, Tommola, & Alaja,
1995), and judging whether an utterance makes sense (Best, Streeter, Roverud,
Mason, & Kidd, 2016). These would all be essential components of
speech communication that would not be adequately quantified by a score of
whether words were correctly repeated. Apart from pupillometry, other classic
experimental measures like eye tracking and brain imaging show that there is
value in granular responses that scale with task demands even when
intelligibility is not the outcome measure of primary interest.From the perspective of the audiologist, listening effort is arguably a
worthwhile measurement even in the absence of intelligibility
scores because effort is often the direct complaint of the patient. Gatehouse and Noble
(2004) found that the disability-handicap relationship was governed
by sound identification, attention, spatial aspects hearing, and “effort
problems,” but not “intelligibility of speech.” Although it would not be
controversial to say that speech intelligibility plays a role in increasing
effort, we argue it is not that a word was repeated incorrectly that makes an
event effortful, especially since the listener might not be aware that the
perception was incorrect. Instead, effort likely arises from the related
cognitive processes engaged to correct that error if the listener suspects that
it might have been a mistake, or perhaps to use cognitive strategies to restore
a word that was completely masked by noise.Apart from examining the relationship between effort measures and performance
accuracy measures, it is also worthwhile to consider any sign of effort as an
indication of task engagement, which could be a useful outcome measure in
itself. For example, Teubner-Rhodes, Vaden, Dubno, and Eckert (2017) proposed an
assessment of executive function that they call “Cognitive persistence.”
Individuals who face listening difficulties might avoid challenging auditory
environments (cf. Wu
et al., 2018); tasks that evoke consistent signs of effort could
indicate that the individual is at least willing to attempt the task.Despite the showcase of pupillometry in this article, we remind the reader that
it has not been conclusively established that the laboratory-based pupillometric
measures of effort are directly related to symptoms such as everyday listening
difficulties, susceptibility to fatigue, and poor recognition memory. Hornsby and Kipp (2016)
showcase the need for systematic investigations into this connection and also
highlight the concept of fatigue separately from episodic effort. However,
pupillometry and other measures of effort likely play a fractional role in
establishing those connections through converging sets of evidence and
associations.
The Unique Value of Pupillometry
Considering the success of other measures of effort, such as dual-task paradigms
(Gagné, Besser, &
Lemke, 2017) and reaction times, which might not need such a
complicated set of guidelines; what value is added by pupillometry? There are
multiple benefits that we highlight here, which expand on the commentary on
methodology by McGarrigle
et al. (2014). First, pupillometry is a time-series
measurement. Timing is an essential part of understanding listening effort
because speech demands rapid auditory encoding as well as cognitive processing
distributed over time, rather than being deployed all at once at the end of a
stimulus. Effort might not be uniformly distributed over a perceptual event, and
pupillometric measures have the benefit of showing change in dilation at
different time landmarks. McCloy, Lau, Larson, Pratt, and Lee (2017) showed changes in pupil
dilation in anticipation of a difficult task. Vogelzang et al. (2016) similarly
showed changes in timing of pupil dilations based on pronoun ambiguity in
sentences, followed by anticipatory dilations in preparation for follow-up
questions. These are examples where pupillometry revealed changes in cognitive
load during the test trial, as it related to linguistic
processing during and after perception.Measures of effort each have their own advantages and limitations. Reaction times
are subject to variations in manual dexterity and speech, which might change
with age or physical abilities not related to the experimental task. Pupil size
and range of dilation are also affected by age (Bitsios, Prettyman, & Szabadi,
1996; Kim,
Beversdorf, & Heilman, 2000; Winn, Whitaker, Elliott, & Phillips,
1994) although there are published normalization methods (discussed
later) that capitalize on the reliable pupillary light reflex as a standard of
dynamic range (Piquado,
Isaacowitz, & Wingfield, 2010). Pupillometry is arguably a more
sensitive measure than dual-task cost, which does not provide temporal
information. Compare, for example, measures of spectrally degraded speech
perception by Winn,
Edwards, and Litovsky (2015) using pupillometry and by Pals, Sarampalis, and Baskent
(2013) using dual-task cost. We note, however, that dual-task
measures can be logistically more feasible to conduct and are less affected by
the methodological constraints outlined in this article. Functional magnetic
resonance imaging studies have aimed to reveal the mechanisms that underlie
listening effort via linking pupil dilation to the engagement of both
domain-general attention and sensory-specific brain regions during speech
comprehension (e.g., Kuchinsky et al., 2016; Zekveld, Heslenfeld, Johnsrude, Versfeld,
& Kramer, 2014), during other cognitive tasks (e.g., Siegle, Steinhauer, Stenger,
Konecky, & Carter, 2003), and during spontaneous fluctuations in
alertness (e.g., Murphy,
O’Connell, O’Sullivan, Robertson, & Balsters, 2014; Schneider et al.,
2016).Other neuroimaging methods with faster temporal resolutions, such as
magnetoencephalography and electroencephalogram (EEG), have similarly sought to
establish a neural basis for pupillary indices of listening effort. In fact,
pupil dilation and EEG have been simultaneously registered in multiple studies.
McMahon et al.
(2016) showed that EEG alpha level was comodulated with pupil
dilation for 16-channel vocoded speech, but for conditions of more-difficult
six-channel vocoded speech, the relationship was much less clear. Miles et al. (2017)
followed up with a related study aimed at discerning effects of intelligibility,
finding that unlike EEG results, pupil dilation was related to intelligibility
scores. Interestingly, the two measurements were not correlated with each other,
suggesting that they tap into potentially different cognitive mechanisms.
Further investigations are needed to better understand the potential connection
of different measures.There is a benefit of pupillometry in the context of testing participants who use
assistive devices such as hearing aids and cochlear implants (CIs), which is
that the experimenter can avoid problematic interference of the device with
electrical or magnetic imaging techniques (Friesen & Picton, 2010; Gilley et al., 2006; L.
Wagner, Maurits, Maat,
Baskent, & Wagner, 2018). Similarly, functional magnetic
resonance imaging can provide precise spatial information about the neural
systems engaged during effortful speech processing (Lee, Min, Wingfield, Grossman, & Peelle,
2016; Obleser,
Wise, Dresner, & Scott, 2007) but is not well suited to
individuals with electronic implants, vascular disorders, or for presenting
speech in relative quiet (because of machine noise). Functional near-infrared
spectroscopy is unaffected by such interference and compatible with the use of
implants (McKay et al.,
2016), but it is slower than pupillometry (i.e., it cannot capture
rapid changes), and considerably more expensive than EEG or pupillometry at the
time of this writing. In all, pupillometry is not free from limitations, but is
relatively easy and fast to set up, has a sufficient temporal resolution, is
free from electrical artifact, and is comparatively inexpensive compared with
some other imaging techniques.
Experimental Design and Planning
What Does Pupil Dilation Reflect?
Ranging between sizes of roughly 3 mm and 7 mm (Laeng, Sirous, & Gredebäck, 2012),
the pupil dilates and contracts for multiple reasons (see Zekveld et al., 2018). In normal
circumstances, the largest changes in pupil dilation occur in response to
changes in luminance. When changing from light to dark environments, pupil
diameter can increase by as much as 3 to 4 mm, or roughly 120% (Laeng et al., 2012).
Conversely, the cognitive task-evoked pupil dilations that are central to this
article are much smaller by comparison, on the order of 0.1 to 0.5 mm, depending
on testing conditions and task. Because of these factors, one must manage the
sources of dilation and constriction factors apart from the experimental task so
that an evoked response can be a reliable indicator of the effort exerted during
the task. In addition, the amount of pupil dilation evoked by a task can be
modulated by the participant’s motivation and arousal state (Stanners et al.,
1979), as will be discussed in detail in this article.It is reasonable to consider task-evoked pupil dilation to reflect not a simply
unitary concept of effort but rather some amalgamation of attention, engagement,
arousal, anxiety, and effort (Nunnally, Knott, Duchnowski, & Parker,
1967; Pichora-Fuller et al., 2016). While it is not within the scope of
this article to clarify the distinctions between these interrelated concepts,
they all have been invoked in numerous explanations of the pupillary response
over the years. We use the term “Listening effort” as a useful shorthand tool
that can be understood to capture a union of these concepts as they relate to
hearing (difficulties), but there could be valid reasons to unpack each of these
concepts individually.In agreement with the frameworks described by Kahneman (1973) and Pichora-Fuller et al.
(2016), we highlight the critical role of intentional
attentional engagement in the study of effort. When a person has
motivation to exercise more cognitive resources to a task, it can be understood
in the context of goal-directed behavior, where attention not only has a target
but also an intensity. Attention and effort are highly related and sometimes
studied in tandem. For example, in a study conducted by Koenig, Uengoer, and Lachnit (2017),
there was increased pupil dilation in early stages of attention to consistently
reinforced learning cues, while in later stages of learning when those cues did
not demand as much attention, relatively larger pupil dilations were observed
for ambiguous or unreinforced cues. The pupillary response was associated with a
strategic shift in attention in a goal-directed task. Karatekin, Couperous, and Marcus (2004)
measured significantly larger pupil dilations in conditions of divided attention
in a dual-task experiment conducted to distinguish performance accuracy and
efficiency (stated as “the costs of that performance in mental effort”).Since Kahneman’s
(1973) influential monograph, examination of effort has historically
been tied with the concepts of attention and arousal. Bruya and Tang (2018) are critical of
Kahneman’s binding of attention and effort, suggesting that instead of
characterizing attention as the use of cognitive or metabolic resources, we
ought to instead consider it as the “readying” of metabolic resources in the
form of adaptive gain modulation. Considering the physiological evidence in
studies by Reimer et al.
(2016) and McGinley, David, and McCormick (2015) and some of the speech
perception work described later in this article and elsewhere in this issue,
Bruya and Tang’s suggestion cannot be dismissed. It should be noted however,
that even in Kahneman’s original book, the concept of effort as
preparation is clearly mentioned in the introductory chapter (p.
4).The persistent tradition is to consider larger pupil dilation to be a sign of
increased listening effort, and therefore a negative outcome, compared with
smaller pupil dilation. However, we should not assume that more effort (or
larger pupil response) is always a negative thing. Engagement in speech
communication can be a very productive and satisfying process, but only with
sufficient effort or attention devoted to the input. Increased pupil dilation is
a signal that a listener is at least willing to engage in a
task and therefore could be a positive sign. Take, for example, studies of pupil
dilation across a range of intelligibility. Listener will show larger pupil
dilation for speech that is perceived with 70% accuracy compared with speech
with 25% accuracy (Ohlenforst et al., 2017; Wendt et al., 2018). We should not
conclude that the 25%-intelligible speech was easier to listen
to, but rather that the listener was less engaged in the 25%-correct task
because it was so hard that more engagement would be unlikely to return any
value to the listener. This concept could help the experimenter interpret pupil
dilation not as the effort demanded by a task, but rather the effort actually
exerted by the participants, modulated by the perceived cost or benefit of
expending more metabolic resources. We emphasize this aspect of the measurement
not only to highlight nonlinearities that are less well known but also to
encourage the idea that pupillometry could play a role in exploring the finding
that people with hearing loss appear to select against environments with poor
signal-to-noise ratio (SNR; Wu et al., 2018). Perhaps pupillometric measures of effort or
engagement could reveal that a person is more capable of handling such
situations with a clinical intervention, and therefore an increased dilation
would be a sign of progress and increased confidence to face a wider range of
communication environments.
Task Selection—What Task Properties Will Evoke Pupil Dilation?
The experimental task should ideally demand that a listener exert intentional
effort beyond passive awareness of sounds in the environment. Ideally, there
would be multiple experimental conditions where the participant is motivated to
exert more effort in at least one condition because it will produce better
results. In the following sections, we review some relevant considerations for
guiding task selection.
Stimulus difficulty and listener interest
For reliable and interpretable pupillometry results, there is a balance of
making the stimuli not so easy as to demand too little cognitive effort and
also not so difficult as to make cognitive effort futile (see previous
section, and also Wendt
et al., 2018 and Eckert, Teubner-Rhodes, & Vaden,
2016 for supporting data and discussion). In addition to stimulus
difficulty, the experimenter should also consider stimulus
value to the participant. For example, Eckert et al.
(2016) illustrated how a conversation with grandchildren would
yield higher value than watching a documentary about lint. There will likely
be more engagement (and therefore likely larger pupil dilations) when
listening to the grandchildren, even if the speech is equally intelligible
in both situations. Furthermore, Eckert et al. note that the more valuable
conversation would likely retain its value more strongly through
communication barriers, invoking extra activity from cortical regions
involved in executive attentional control where boring tasks might not,
since they are not worth the metabolic cost.
Basic psychoacoustics
Some basic tasks of auditory detection or discrimination might not demand
cognitive resources sufficient to evoke a strong or consistent evoked pupil
response although some reports do exist. For example, pitch discrimination
elicits smaller pupil dilation in musicians than nonmusicians (Bianchi, Santurette,
Wendt, & Dau, 2016), despite comparable peripheral
sensitivity. Although no consistent pattern of pupil dilation would be
expected if a participant simply hears different sounds coming from
different locations, task-evoked changes do emerge in a task of explicit
sound localization (Bala,
Spitzer, & Takahashi, 2007). Beatty (1982) showed data from a
study of selective attention to individual pure tones, revealing a dilation
pattern that was detectable (and detectably different when tones were
targets or distractors), but the dilations were on the order of 0.01 mm,
which is one tenth the size of those normally reported in the easiest
conditions in many other articles. Without sufficiently powered experiments
with a large number of trials, it is unlikely that such small effects would
emerge in a consistent fashion. Beatty (1982) also notes that
experimenters should take caution to distinguish between tasks of signal
detection, in which pupil dilation increases with
increased certainty (Hakerem & Sutton, 1966) and signal
discrimination, in which pupil size increases with
increased uncertainty (Kahneman & Beatty, 1967).
Because of these complications, much of the literature on task-evoked pupil
dilation concerns more tasks that are more complicated and demanding than
detection of a signal, such as sentence perception, mental manipulation of
input, mathematical problems, and so on.
Speech perception in quiet
For listeners with normal hearing, speech perception in quiet can be
automatic or effortless if it does not come coupled demands no particular
challenge (e.g., syntactic structure, auditory distortion, etc.). It
therefore might not demand substantial cognitive resources to complete,
producing pupil dilations that do not always reliably emerge from the noise
of random pupillary oscillations. Data from Zekveld and Kramer (2014) show
pupil dilations to quiet speech that hover around the baseline levels
although their data were clean enough to illustrate clear interpretable
morphology. In a number of published studies, speech in quiet is presented
with some kind of extra cognitive demand, such as memory load (Johnson, Singley, Peckham,
Johnson, & Bunge, 2014), spectral degradation (Winn et al.,
2015), anomalous semantic content (Beatty, 1982), lexical competition
(A. Wagner, Toffanin,
& Baskent, 2016), competition from a second language (Schmidtke, 2014),
conflict of prosody and syntactic structure (Engelhardt, Ferreira, & Patsenko,
2010), object-focused syntactic structure (Wendt, Dau, & Hjortkjær,
2016), and pronoun ambiguity (Vogelzang et al., 2016). In these
cases, it is crucial to emphasize that the evoked dilations are likely in
response to language processing and not simply auditory encoding.
Linguistic aspects of effort
In each of the aforementioned examples of speech perception studies, some
aspect of language processing was the focal point of investigation. These
studies establish conclusively that the cognitive activity indexed by pupil
dilation does not follow merely from audition alone, but also from language
processing. In another example, Hyönä et al. (1995) found increased
pupil dilation in a task of sentence translation compared
with verbatim repetition of sentences, suggesting that the pupil response
reflects general processing load, not just effort in
listening to the auditory stimulus.Not all speech stimuli demand the same kinds of language or cognitive
processing, and therefore experimenters should guard against the notion of a
unitary category of “speech perception.” In other words, just because
stimuli are speech sounds, they might not elicit typical patterns of pupil
dilation because they do not necessarily entail cognitive processes that
relate to processing of natural speech. For example, it is possible that the
popular style of “matrix” sentences where each word in a sentence is drawn
from a closed set of choices elicits less effort, since most digits and
colors can be distinguished by vowel alone (in English) and therefore might
not reflect the effort needed to understand normal speech. Other sentence
materials might be preferable to examine speech perception with a richer set
of linguistic processes in play. Several studies have successfully used
traditional speech-in-noise tests (such as the Dutch Versefeld sentences,
Danish HINT test, English R-SPIN test, IEEE sentences, and others) and
applied the pupillometry method.Some linguistic stimuli might demand such limited amounts of cognitive
processing that they do not elicit expected effects on pupil dilation. For
example, auditory spectral degradation affects the pupillary response to
sentence-length materials (Winn et al., 2015) but not
recognition of individual spoken letters (McCloy et al., 2017). It is
therefore worthwhile for the experimenter to consider whether the speech
perception task involves some kind of linguistic computation or minimal
auditory detection.
Increasing Motivation and Avoiding Boredom
Motivation will affect the pupillary response (Kahneman & Peavler, 1969). Left
without a goal-directed task, a person’s pupil will change size as the mind
wanders (Franklin,
Broadway, Mrazek, Smallwood, & Schooler, 2013), in a way that
will not be aligned with stimulus presentation. If the task does not give
enough reason for the participant to engage, the pupil size will likely not
give useful results. Monetary incentives have been shown by Heitz, Schrock, Payne, and
Engle (2008) to increase the magnitude of pupillary responses.
When people are curious about the answers to trivia questions, their pupils
dilate more (Kang et al.
2009)—by a small (8% vs. 4%) but detectable amount.Although boredom is to be avoided in order to elicit pupil dilation reliably,
experimenters should also consider avoiding emotional stimuli that evoke
pleasure, disgust, or an otherwise strong physiological response unrelated
to the planned task. For example, sentence materials can be chosen to avoid
notions of violence, sexuality, or trauma. Pupillary responses to
emotionally toned or arousing stimuli were reported by Hess and Polt (1960) in an early
influential paper. More recently, Partala and Surakka (2003) showed
that compared with neutral stimuli, negative-valence stimuli evoked larger
pupil responses, with largest dilations evoked by positive stimuli. If
emotional response is not the target of investigation, then these kinds of
stimuli could be avoided to reduce unwanted variability in the data.
Behavioral Considerations
In most pupillometry studies of listening effort, there is a behavioral
component such as a spoken response or a button press, which can increase
the measured pupil response by as much as 400% (Privitera, Renninger, Carney, Klein,
& Aguilar, 2010) and this amplified response can sustain for
several seconds. When the behavioral contribution is removed through
deconvolution, the task-evoked pupil response is still present but is more
modest and short lasting (cf. Hoeks & Levelt, 1993; McCloy, Larson, Lau, &
Lee, 2016). Similar behavioral contributions to pupil size can be
seen in studies of sentence recognition involving verbal responses. Winn et al. (2015)
and Winn (2016)
showed that pupil dilations from the verbal response were typically larger
than those elicited by the listening task itself. Papesh and Goldinger (2012)
carefully illustrated the effect of motor speech planning (as well as
lexical frequency) on pupil dilations in a study involving cued response
options that alternated between verbatim repetition or substituting “blah”
in place of syllables. In numerous pupillometry studies of sentence
perception, the timing of the behavioral response is so far separated from
the listening response that the auditory-evoked pupil dilations recover
almost completely back to baseline levels, and the behavioral-induced
dilations are thus often not illustrated on published figures (Koelewijn, de Kluiver,
Shinn-Cunningham, Zekveld, & Kramer, 2015; Koelewijn et al.,
2012; Wendt
et al., 2018; Zekveld, Kramer, & Festen,
2010).We recommend letting pupil size return to baseline levels following a trial
(though see studies that have employed deconvolution to tease apart
pupillary effects arising from closely spaced visual stimuli, e.g., Wierda, Van Rijn,
Taatgen, & Martens, 2012). In typical speech-in-noise
testing, it is normally sufficient to wait 4 to 6 s after the completion of
the participant’s verbal response, but for other experiments without
extensive precedent in the literature, we recommend pilot testing involving
extended recording time (e.g., 10 s beyond the stimulus or response) and
inspecting the data to see when the aggregated data return to baseline
levels.Experimenters should be aware of all task events that would invoke
intentional attention, including physical motion. McGinley et al. (2015) found that
20% of variance in pupil size in mice was explained by locomotion; increases
in pupil dilation were substantial and long lasting during motion. In
addition, locomotion has been found to suppress sensory-evoked responses
(Williamson,
Hancock, Shinn-Cunningham, & Polley, 2015).
Experiment Logistics and Constraints
Task selection for pupillometry is somewhat constrained by the measurement technique,
specifically because of the timing of the response and the challenge of avoiding
changes in pupil size that are unrelated to the target task. It is therefore not
advisable to simply add pupil dilation measures to an existing behavioral procedure
that was not designed for pupillometry. Instead, there should be deliberate planning
to design testing methods to suit the nature of the measurement technique. A
compelling reason to measure pupil size should justify the cost and effort of
possibly altering the experimental procedure, based on the desire to obtain
information not available in behavioral methods. The pupil dilation response has
complicated innervation and is affected by a wide range of experiences and stimuli,
so there is an unfortunate amount of noise inherent in any pupil measurement.
However, this noise can be addressed if the experimenter is careful with the
experimental setup and judicious with the monitoring of factors that would affect
physiological measures for any unique testing condition. Absence of these
considerations will undoubtedly weaken the measurement and potentially cause
distrust of the method altogether, undermining the field’s confidence in the
carefully produced studies that do exist.
Number of Trials
In the end, the number of trials (and participants) needed in any experiment will
depend on the effect size of interest and power of the analytical approach.
Generally, the experimenter will want to have at least 16 to 18 good recordings
of pupil size for each condition. In any pupillometry experiment, there will be
missing data because some trials will be dropped due to mistracking,
contamination, or other reasons (e.g., scratch an itch or exercise a sore muscle
can show up as surprisingly dramatic changes in pupil size that is unrelated to
the listening task itself). Hence, it is wise to record a sufficient number of
trials so that the estimation of the task-evoked response will stabilize. For
sentence-perception tasks, 20 to 25 trials are normally a safe starting number.
Fewer trials might be sufficient for listeners who are highly engaged in
demanding tasks, where the effect is expected to be very large.Number of trials for testing can be considered to be inversely proportional to
the difficulty of the experimental task. For a very difficult task, a reliable
large pupil dilation response (i.e., with a large-effect size) can be achieved
with as few as 10 trials. For distinguishing more subtle differences between
similar conditions (e.g., vocoders with different number of channels, small
changes in SNR, linguistic content such as semantic context), a larger number of
trials is advisable. This consideration highlights the importance of authors
publishing measures of effect size along with their statistical tests.
Trial Events and Timing
Trials should have consistent timing of events, for example, the onset of noise,
an alerting sound, the stimulus itself, any cue to prompt a behavioral response,
or any other relevant trial landmark. An illustration of an example trial
timeline is given in Figure
1. Ideally, each trial should start with a drift-correction phase, in
which the participant is required to look at a central fixation symbol before
moving on (though this is not always possible when combining pupillometry with
certain imaging modalities). Timing of each event should be planned carefully
and intentionally. It is advisable to consider separating these events in time,
because the pupillary responses to two events could sum together, obscuring
dilations that arise from listening as opposed to those that arise from
behavioral responses. Specifically, the listening portion of a
trial could elicit a peak dilation, and a second peak in dilation could appear
during the verbal response or button press
portion. Sentence-repetition studies have varied in the duration of this
retention interval, with times ranging from 5 s (Ohlenforst et al., 2017; Zekveld, Festen, &
Kramer, 2013), 4 s (Koelewijn et al., 2012, 2015), 3.5 s (Koelewijn, Versfeld, &
Kramer, 2017), 3 s (Koelewijn et al., 2012; Piquado et al., 2010),
2 s (Winn, 2016),
1.5 s (Winn et al.,
2015), and some studies with variable interval durations (Zekveld et al., 2010;
intervals ranged between 2.1 and 3.5 s), or no reported retention interval
enforced (McMahon et al.,
2016). In addition to potentially convolving the auditory and
behavioral portions of pupil responses, a long retention interval might demand
that a listener use short-term memory (for long intervals) or perhaps create
pressure to rush to complete cognitive processing during a short interval.
Shallower slopes of pupil dilation have been obtained by Zekveld et al. (2010) and Koelewijn et al. (2012,
2015) in numerous
studies with longer retention intervals. A relatively shorter retention interval
of 1.5 s used by Winn
et al. (2015) yielded a relatively steeper slope and larger magnitude
of dilation responses, perhaps because of increased pressure to respond quickly.
In that study, prolonged dilations in difficult conditions of auditory
degradations extended from the auditory-response peak all the way to the
behavioral response peak with little recovery, while responses in easier
conditions yielded quick recovery during the retention interval.
Figure 1.
Events in a basic pupillometry experiment for measuring listening
effort. There are other experimental paradigms that are possible,
this illustrates a commonly used sequence of events.
Events in a basic pupillometry experiment for measuring listening
effort. There are other experimental paradigms that are possible,
this illustrates a commonly used sequence of events.It has become more common for experimenters to introduce a cue that indicates the
timing of an upcoming stimulus. For example, in tests of speech recognition in
noise, there could be leading noise that lasts for 2 to 3 s before the onset of
the speech (cf. Koelewijn
et al., 2012, 2015, 2017; Wendt,
Hietkamp, & Lunner, 2017; Wendt et al., 2018; Zekveld et al., 2010,
2013). There are
at least two benefits of this practice. First, it alleviates the problem of
target-masker separation, whereby simultaneous onset of speech and noise
increases the difficulty of hearing the target signal. In addition, although the
onset of sound could elicit a brief pupillary response, it could orient the
listener so that the target signal of interest does not come as a surprise.
However, the presence (or continuation) of noise after a signal, though common
in published studies, could interfere with language processing, as shown by
Winn and Moore
(2018). As the pupillary response can be slow and long lasting, it is
worthwhile to consider that the analysis window can be after
stimulus delivery.
Stimulus Duration
Most of the literature reviewed in this article features multiple trials of
relatively short duration (2–6 s). For pupillometry, similar to other evoked
measurements like EEG, magnetoencephalography, or auditory brainstem response,
multiple stimuli of the same (or similar) type and duration are played in a
testing block, and the responses are averaged in time. As long as the
stimulus-driven portion of the overall evoked response is time-aligned, the part
that is unrelated to the stimulus should be averaged out, leaving behind only
the relevant task-evoked response. There will likely be cleaner patterns of data
for these time-constrained stimuli compared with untimed stimuli, longer
passages, or entire conversations, where one could not assume the same
progression of cognitive processing landmarks trial-to-trial. Longer passages
might not produce consistent patterns in dilation across stimuli (because of
varying landmarks for parsing, resolution, or chunking) and therefore might have
relevant phasic peaks neutralized by cross-trial averaging of data. The current
article will focus on phasic responses to short stimuli.
Controlling the Visual Field
The amount of pupil dilation or constriction seen in response to changes in
luminance far surpasses the amount of pupil dilation measured for cognitive
tasks. Therefore, it is of critical importance to control the visual field when
measuring task-evoked pupil dilation. Typically, the participant is stationary
and visually fixated on an image that is either completely blank or with minimal
stimulation. This is not to say that other protocols are impossible, but they
would be subject to a higher amount of potentially confounding noise from
movement, luminance effects on pupil size, and so on.The setup in most labs includes a uniform solid color visual field that is
neither too bright nor too dark. The visual field could be a plain wall, or a
computer screen. Bright colors—especially white backgrounds on a computer
screen—are problematic for multiple reasons. First, they could cause excessive
pupil constriction; the cognitive response might not be strong enough to emerge.
Second, they might cause discomfort for the participant, which we have noticed
could result in a larger number of blinks and need for additional breaks during
testing. Task-evoked pupil dilations have been observed reliably in dark-adapted
conditions (McCloy et al.,
2017; Steinhauer
& Zubin, 1982). However, there are at least two cautions against
testing in dark conditions. First, the pupils will dilate to accommodate low
light, leaving less head room for task-evoked dilation. In addition, inspired by
previous work by Steinhauer, Seigle, Condray, and Pless (2004), Wang et al. (2018) has recently shown
that testing with brighter luminance elicits more reliable dilation because the
parasympathetic nervous system releases its “grip” on the sympathetic nervous
system’s dilation-inducing projections to the pupil dilator muscles.
Combining Pupillometry With Eye Tracking
Despite risk of contamination by changes in luminance and gaze position, there
are published studies where pupillometry has been used in studies of visual
search or other visual recognition tasks (described in the next paragraph).
These experiments offer the value of introducing the well-documented effects of
lexical competition and sentence processing that have been studied with the
“visual-world” paradigm, which is notable for providing precise timing
information and insight on perceptual competition. Cavanaugh, Wiecki, Kochar, and Frank
(2014) used a drift-diffusion model to suggest that eye tracking and
pupillometry shed light on dissociable factors relating to decision-making. They
found that gaze fixation time corresponds to rate of evidence accumulation,
while increasing pupil size corresponds to increasing decision threshold (i.e.,
willingness to commit to a decision).Visual aspects of stimuli in a gaze-tracking experiment could affect pupil size
and therefore deserve extra scrutiny in the context of pupillometry. Engelhardt et al.
(2010) used images in conjunction with pupillometry in a sentence
comprehension task but did not publish examples of the images used. Wagner et al. (2016)
used black and white line drawings in a picture-gazing task where lexical
disambiguation led to changes in pupil dilation. It should be noted that in that
study, concurrent gaze changes during pupillometry might have led to unknown
effects on pupil size due to changes in gaze location and changes in the local
luminance of the image on the retina. Using a variation of this method, Wendt et al. (2016)
also used picture stimuli that were controlled to have equal luminance, and
perhaps more importantly were presented before acoustic
stimulus representation so that a pupillary response to the auditory stimuli
would be measured independent of any visually driven changes in pupil size.Although the aforementioned studies demonstrate that pupillometry could be
combined with “visual-world”-style testing paradigms, there are special
considerations to be made, in light of the influence of gaze position and
luminance on pupil size. Kun, Palinko, and Razumenić (2012) reported that even for small
targets (angular radius of 2.5°) changes in luminance can result in changes in
pupil size that can obscure cognitive load-related pupil dilations. However,
Palinko and Kun
(2011) have also demonstrated that when the experimenter has rigorous
control over the placement and luminance of objects in a visual scene, it is
possible to disentangle luminance and task-evoked changes in pupil size. In
realistic everyday conditions, it might not be possible to exert such control.
Kuchinsky et al.
(2013) identified systematic changes in pupil size relating to gaze
position, which were ultimately modeled and regressed out of the data.Minimizing eye movement will likely lead to cleaner estimation of
cognitive-evoked pupil size when using remote eye trackers, because gaze away
from a remote stationary camera can cause a distorted estimation of pupil size,
depending on the algorithm used. Systems that use the long axis of the ellipse
fit to the pupil or that dynamically take into account the rotation of the eye
away from the camera are unaffected by this issue (although one should always
check their data to be sure). Methods for estimating and regressing out the
degree to which a dataset is impacted by gaze position, including the proper
design of a control viewing-only condition, have been described in detail by
Gagl, Hawelka, and
Huzler (2011) and others (e.g., Brisson et al., 2013; Hayes & Petrov,
2016; Kuchinsky
et al., 2013).Another reason to be cautious of eye movements in pupillometry tasks is that the
luminance of visual field will change depending on what the participant is
looking at, at any moment. If they shift gaze from a location with higher to
lower luminance, pupil dilation might increase because of luminance instead of
cognitive activity. Pupil size for a person shifting gaze around a room (or even
around different areas of a screen) would be intractably convoluted with pupil
size from luminance changes (and perhaps also with locomotion). Even if the
visual scenes used are counterbalanced across the conditions of interest, one
could not ensure a priori that participants would look at the displays in a
consistent fashion across trials. In the best-case scenario, in which viewing
patterns were relatively consistent, the added source of noise stemming from
unpredictable changes in local luminance with fixations may minimize one’s
ability to detect differences across conditions.
Data Collection
Data Quality Monitoring
Data contamination should be detected as soon as possible—during testing.
Real-time monitoring of the eye or the recorded pupil diameter shows blinks or
other dropouts of data. If real-time monitoring is not an option, the estimation
of pupil size could be displayed for the experimenter at the end of every trial,
to see if something is amiss. Even in the absence of clear problems like head
movement and shuffling posture, the pupil response can fatigue after several
trials or can show a pattern of fluctuation—called hippus (see
Figure 2). When
hippus is observed, it is advisable to delay advancing to the next trial until
the pupil size has stabilized. When it persists for over 10 s, this process can
be aided by breaking the monotony of trials with a quick break to chat with the
participant, or a brief irrelevant task (e.g., “look up to the corner of the
room … now look back”). If the experimenter is not able to examine the time
series of pupil size for each trial, it is at least recommended to monitor the
eyes using a video stream of the pupil (as it is provided by most of the
traditional cameras). Pupil size changes with mind wandering (Franklin et al., 2013),
and the participant’s mind might wander during a long and monotonous testing
session. For that reason, it can be beneficial to introduce some variety or
challenge to keep the participant alert.
Figure 2.
Pupillary hippus, or small ongoing fluctuations in
pupil size that are unrelated to an external stimulus.
Pupillary hippus, or small ongoing fluctuations in
pupil size that are unrelated to an external stimulus.Data quality will likely change over the course of a long-testing session. It is
common to observe a general decrease in pupil dilation over time, both in terms
of baseline level and magnitude of dilation response. For experiments up to 1 to
1.5 h, these effects do not show up as significant. However, McGarrigle, Dawes, Stewart,
Kuchinsky, and Munro (2017a) have shown an effect of task-related
fatigue in pupil response during a longer sustained listening task. We have
found that in typical sentence-perception experiments (with noise, or some other
auditory distortion), fatigue is avoidable for most listeners if testing blocks
are 2 h or shorter. Participants vary in how long they can engage and also their
willingness to communicate their need for a break. Experimenters should remain
vigilant for changes in participant alertness so that they can initiate breaks
and avoid unwanted fatigue. Monitoring of data can reveal that the test is long
enough that the participant is changing physiological state or alertness. After
some reasonable number of trials (e.g., 25 trials in a sentence-recognition
task), a break of a few minutes can refresh the listener.Longer experiments could be split into different testing sessions although
experimenters should be careful about splitting different compared conditions
across different days, in case there are sizeable differences in pupil size or
dynamic range for an individual across days. Be mindful that performance in a
task can be situationally dependent and can vary by the day (Veneman, Gordon-Salant,
Matthews, & Dubno, 2013). When possible, trials for different
conditions could be interspersed or presented in alternating short blocks in the
same testing period. The experimenter wants to ensure that the participant is in
the same physiological state for each tested condition, so that any differences
in pupil dilation are due to the task and not other unintended differences.Pupillometry experiment setup and delivery improves with tester experience (just
as for other methods such as EEG, where one detects when data are too noisy,
develops criteria for removing an electrode, applying gel, etc.). One becomes
more familiar with troubleshooting calibration and other unique situations over
time, so early struggles with the method should not necessarily be taken as a
sign that it will not be fruitful. Based on prior experiments and guidelines
collected in this article, one could identify aspects of the testing procedure
that would deviate from traditional psychoacoustics, like the increased
interstimulus interval and extra attention to test difficulty and likelihood of
participant disengagement.The quality of pupil dilation measurements improves with attention to participantfatigue, comfort, readiness, and head movement. Although these factors might be
noticeable in other types of behavioral psychoacoustic experiments, their
effects might be even more damaging to a physiological measure like
pupillometry. Examination of raw data (rather than aggregated smoothed averages)
gives the experimenter a chance to identify situations that indicate that
corrective action should be taken to the test protocol. For example, although
blinks are normally not a problem (because they can be removed and smoothed over
in postprocessing), an unusually large amount of blinks might indicate fatigue
or a too-bright screen. Participants might also give long and tense blinks just
at the moment of response, potentially erasing an important piece of the data.
Consistently high variability in pupil baseline level before each stimulus might
indicate that not enough time has passed since the last stimulus or response, as
the pupil size might still be coming down from an evoked dilation.Because attention to the aforementioned factors will likely improve with
experience, we recommend that the testing procedure be at least as consistent
and regimented as one would be in any other scientific procedure. It is also
advisable to have testing be performed by those who have at least some practical
experience with the method, perhaps by repeated practice and shadowing of more
experienced lab members first.
Participant Inclusion and Exclusion Criteria and Other Considerations
Eye color
Most eye trackers are robust to differences in eye iris color, but there are
occasional difficulties with very dark irises (for light-detecting systems)
and light irises (for dark-detecting systems).
Makeup
Participants should be encouraged to avoid the use of mascara and eye-liner,
as it can be erroneously detected as the pupil.
Age
Older listeners show generally weaker pupil dilation responses to light
(Winn et al.,
1994). Following Piquado et al. (2010), a control
task that measures dynamic range is recommended when comparing younger and
older adults.
Hearing status
Smaller amounts of pupil dilation are routinely observed in listeners with
hearing loss and older listeners compared with young control groups with
typical hearing (Koelewijn, Shinn-Cunningham, Zekveld, & Kramer, 2014). There
is likely more than one reason for this, including listening fatigue
draining a listener’s cognitive resources, age-related atrophy of pupillary
dilator muscles, or some other factors. It does not necessarily mean that
the tasks performed by older or hearing-impaired listeners are regarded as
less effortful. It could mean that they are devoting less intentional
attentional engagement because they are conserving energy in a continuously
exhausting task.
Pharmacological effects
Drugs can impact the ANS, which will affect the pupil dilation response.
Steinhauer et al.
(2004) report that blocking the sympathetically mediated
alpha-adrenergic receptor of the dilator enables targeted measurement of the
parasympathetic branch, while blocking of the muscarinic receptor of the
sphincter muscles allows only contributions of the sympathetic branch. They
showed that tropicamide (a parasympathetic ANS activity blocker) eliminated
differences in the task-evoked response, while dapiprazole (a sympathetic
ANS blocker) merely decreased pupil size while maintaining the phasic
task-evoked response. It could therefore be especially important to guard
against drugs that affect the parasympathetic nervous system. Common
muscarinic antagonists that are used to treat for Parkinson’s disease,
peptic ulcers, incontinence, and motion sickness are all likely to inhibit
the pupillary response.
Caffeine
Pupil dilations are larger after ingestion of caffeine (Abokyi, Oqusu-Mensah, & Osei,
2017). Caffeine has been shown to affect the pupil response for
up to about 6 h, particularly in people who do not routinely consume it
(Wilhelm, Stuiber,
Lüdtke, & Wilhelm, 2014).
Eye diseases
Some conditions might affect the biological function or appearance of the
eye, such as cataracts (lowers contrast between iris and pupil), nystagmus,
amblyopia (“lazy eye”), and macular degeneration.
Anything that affects visual fixation and tracking ability
Tracking can be compromised by attention deficit problems, severe fatigue.
Tracking quality is sometimes affected by hard contacts and glasses
(especially bifocals where refraction will change depending on eye position
with respect to the lenses) although glasses do not always pose a problem
and can usually be discarded in situations where there are no visual
stimuli.
Head injury or any history of neurological problems
These issues can affect gaze stability, congruence of eye movements (Samadani et al.,
2015), and pupil dilation (Marmarou et al., 2007).
General hearing ability (avoiding floor-level intelligibility)
Participants who are unable to complete a task successfully will likely show
reduced pupil dilation, because they might be more likely to abandon effort
on the task.
Native language
When completing a task in a nonnative language, greater pupil dilation is
observed, and some effects of language processing will deviate from those
observed in native listeners (Schmidtke, 2014).
Fatigue
Although fatigue is obviously related to the study of effort, it can actually
be a barrier to measurement of short-term task-evoked pupil dilation.
Fatigued listeners will show a weakened pupillary response. McGinley et al.
(2015) provide a clear and physiologically grounded explanation
for the preference to test participants in a quiet and alert state, avoiding
both fatigued and hyper-aroused states. Task-induced fatigue might be
reflected in the baseline value of the pupillary response over the course of
the experiment (i.e., lower baseline toward the end of the experiments).
Chronic fatigue (need for recovery) effects the pupillary response as well
(see Wang et al.,
2018).
Measuring Pupil Dilation in Children
Relatively few published studies have used pupillometry to measure listening
effort in children. Of those that have (e.g., Johnson et al., 2014; McGarrigle, Dawes, Stewart,
Kuchinsky, & Munro, 2017b; Steel, Papsin, & Gordon, 2015),
the age range appears to begin at 7 or 8 years. It is possible that the
intentional attention mechanisms employed by adults and older school-aged
children reflect cognitive activity that would simply not be invoked reliably by
younger children. Furthermore, logistical constraints such as stabilized-head
position, sustained attention, and patience for a very plain unstimulating
visual field would certainly make pupil measurements in young children very
difficult, even if their cognition and language skills were mature. It is
therefore possible that pupillometry is not the ideal effort measurement to use
with very young children. Later, we describe some studies of school-aged
children and some related work on pupillometry in other young populations.McGarrigle et al.
(2017b) tested school-aged children (age 8–11 years old) and
successfully measured differences in pupil dilation related to SNR. Notably,
these SNRs did not produce differences in intelligibility, suggesting that
children, like adults, can achieve the same score using different amounts of
effort. Furthermore, behavioral response time did not distinguish the two
listening conditions. McGarrigle et al.’s data demonstrate that it is feasible
to use pupillometry for children of an age where attention and engagement are
dependable, for at least 40 minutes. Incidentally, measurements in school-aged
children with hearing loss might be more feasible, given their
experience of annual (or more frequent) hearing tests that require the sustained
attention and behavior that is somewhat reminiscent of pupillometry tasks.Johnson et al. (2014)
measured pupil dilation in children aged 7.5 to 14 years and obtained results
that indicated reliable differences between children and adults on a short-term
memory overload task. Specifically, dilation magnitude grew as memory demands
increased, up to a plateau; adults’ dilations continued to grow up to a higher
plateau (eight items), while children showed a reversal of dilation patterns
after a smaller number of items (6) had been reached.Steel et al. (2015)
measured pupil dilation in 11 - to 15-year-old children, but the experimental
design was in some ways not optimal for pupillometry as much as it was for tests
of binaural fusion. They measured peak pupil diameter for a 2-s window following
stimulus onset, in an experiment where average reaction times spanned a range of
2 to 3.5 s, possibly resulting in the exclusion of true peak dilation which
likely occurred after the pupil data recording period. Correlations between
binaural hearing and pupil dilation in that study were reported but appear to be
driven by overall group differences rather than within-group binaural hearing
ability and also were affected by ceiling effects and general effects of
age.Pupillometry in children younger than 8 years is rare and is typically used for
purposes other than listening effort tasks. Recovery latency of pupil dilations
has been used as a biomarker for children at risk for autism spectrum disorder
(ASD; Martineau et al.,
2011; Lynch,
James, & VanDam, 2017). Pupil size was also reported to be a
biomarker for ASD by Anderson
and Columbo (2009) although that study included a small number of
participants, and, despite statistically detectable differences, data for the
ASD group fell within the range of the control group.Measuring pupil diameter in young children during listening tasks is a
substantial challenge, for both theoretical and logistical reasons. Changes in
pupil size can be measured in 8-month old infants in reaction to surprising
physical events (Jackson
& Sirois, 2009), and both 6 - and 12-month old infants show
increased pupil dilation to odd social behaviors (Gredebäck & Melinder, 2010). Thus,
the pupil response can be measured; for the purpose of this article, in question
is whether the assumptions that we make about the nature of language processing
and goal-directed task engagement used by adults in speech recognition tasks
could generalize to very young listeners.
Hardware
Trackers
It is not within the scope of this article to recommend a particular product,
especially because products continue to be improved with each year. A
majority of pupillometry articles in the area of listening effort have used
traditional eye trackers that might more commonly be used to track eye-gaze
direction. They come in many varieties, including remote cameras (that sit
on a desk beneath a monitor display), tower stands (which record a
reflection of the eyes akin to a teleprompter in reverse), and eyeglasses
outfitted with cameras. Many of these instruments also report an estimate of
pupil size, with some degree of error. Quality of the camera and quality of
the software algorithms for calculating pupil size are of extreme
importance, for three main reasons. First, the pupil is small, so the amount
of noise in the pupil size estimation must be limited. Second, the time it
takes for the system to recover from losing track of the pupils (in the case
of a blink, or a look off-screen) can result in the loss of valuable data.
Finally, a change in pupil size can be indistinguishable from a change in
distance to the camera unless head position is stabilized, or if there are
supporting measurements made, like distance. Trackers that report absolute
pupil size (in millimeters) necessarily must complete such a calculation,
albeit sometimes without transparency in how it is done. Some trackers
instead report pupil size in arbitrary units, akin to the number of pixels
that the pupil occupies on a camera image. In addition, while some eye
trackers model the rotation of the eye away from center or correct pupil
size for gaze position in other ways, other trackers do not, and thus extra
caution (such as applying correction factors; see Brisson et al., 2013; Gagl et al., 2011;
Hayes & Petrov,
2016) must be taken into account when measuring pupil size in
experiments that also feature eye movements.
Clinical pupillometers
Hand-held clinical pupillometers—as used for neurology, ophthalmology, and
emergency medicine—have the advantage of being user friendly (via automated
routines), less expensive than some full-fledged video-based eye trackers,
and designed specifically for accuracy in measuring pupil size. However,
they might not have been designed for research, which could result in
limitations on recording time, lack of connectivity with popular experiment
delivery software, or lack of synchronized event tagging.
Chin rests or other head stabilizers
Pupil size can be estimated more reliably if the distance from the eyes to
the camera remains constant (particularly for trackers that do not
automatically attempt to correct for distance). It is customary to use a
stabilizer such as a chin rest, akin to what could be used at an
optometrist’s office. However, chin rests are not always comfortable for
participants, especially when they are giving verbal responses, or if it
requires them to lean forward unnaturally. An alternative solution is to
have the participant lean back to have her or his head position stabilized
on the top of a sturdy and stationary chair.
Seating
Sturdy stationary (not rolling) chairs will make measurement easier. The
participant’s comfort should be taken into consideration even more than for
a traditional psychoacoustic experiment, because the act of shifting posture
or tensing muscles will show up as changes in pupil dilation. A
height-adjustable chair akin to a hairdresser’s chair (or height-adjustable
table for the camera) is advisable to maintain a constant viewing angle and
comfort for all participants. There are also chairs used for EEG recordings
that have adjustable headrests to avoid muscle tension in the neck.
Room lighting
Light should be homogeneous in the whole room so that if a participant looks
around, it won’t cause a reflexive dilation in response to changing
luminance on the retina. Soft lighting is best, especially if it is
adjustable for individuals (Zekveld et al., 2010). A range of
10 to 200 lux, with a median for older adults around 30 lux and for younger
adults around 110 lux depending on the dynamic range of their pupil. As a
reference, a normal in offices is around 300 to 500 lux.Brighter luminance produces more reliable dilations than dark settings (Steinhauer et al.,
2004; Wang
et al., 2018) but take caution that too-bright lighting
(especially projected directly at a participant from a computer screen)
might also cause discomfort and high number of blinks. A moderate mid-range
gray color background on a computer monitor or a plainly lit wall target
avoids these issues of discomfort.
Handling of Raw Data
Sampling rate
The pupil changes size slowly, so a sampling frequency of 30 Hz or higher is
sufficient. Very high sampling frequencies (e.g., above 120 Hz) of some
trackers would be beneficial for studies of precise saccade timing but are
not necessary for most pupillometry studies.
Data transfer
To ensure that stimulus timing landmarks are recorded and synchronized with
corresponding timestamps in the eye tracking or pupillometry data, the
experimenter should be sure that time tracking would not be compromised by
the use of a single computer to handle all of the processing. There are
two-computer solutions that use physically separate computers for experiment
delivery and tracker data collection, with ethernet or USB links for data
transfer. Timing is not as delicate an issue as it is for other methods such
as EEG; there are also single-computer pupillometry solutions, which can be
sufficient since pupillary responses are slow enough that a drift of 30 ms
(less than the duration of one sample at 30 Hz) should not affect the
quality of data.
Monocular and binocular tracking
The pupils should show congruent dilation patterns (Purves et al., 2004), so binocular
tracking might not offer any substantial advantage over monocular tracking,
apart from the opportunity to pick the eye that produces the fewest missing
data samples.
Stimulus Timing
Of critical importance is waiting for the pupil to return to baseline size before
the next trial. The duration of this interval will depend on the experimental
task. Heitz et al.
(2008) found that larger dilations on difficult test trials affected
baseline levels for subsequent trials, even with interstimulus intervals of
3.5 s. Sentence repetition tasks might require nearly 4 to 6 s following the end
of the participant’s verbal response (discussed in further detail later).
Response Timing
The pupillary response takes up to 1 s to emerge, with estimates ranging from
roughly 500 ms to 1.5 s (Hoeks & Levelt, 1993; Verney, Granholm, & Marshall,
2004). McGinley
et al. (2015) found that the derivative of the pupil was correlated
to the pupil diameter 1.3 ± 0.7 s after corresponding cortical oscillations.
Peak timing in sentence-recognition experiments appears to follow the same time
course, emerging typically 0.7 to 1 .2 s following stimulus offset (Winn, 2016; Winn et al., 2015).
Systematically longer stimuli elicit longer latency to peak in situations where
duration differences are known by the participant before the trials begin (Borghini, 2017; Winn & Moore,
2018).
What Data to Record
In addition to pupil dilation, the experimenter should record accurate timestamps
of the onset and offset of a stimulus, the timing of behavioral response (if
any), and the horizontal and vertical gaze positions of the eye. Timestamps will
be used to aggregate and align data, and the gaze coordinates can be used to
ensure fixation at a target, as well as to covary gaze position with pupil size
estimation.
Data Processing
Raw pupil data must be processed in several steps before analysis and visualization.
Figure 3 illustrates
common steps in treating pupil data, described later.
Figure 3.
Sequential steps of data processing. Raw data (black, marked no. 1)
contain blinks that appear as transient changes in pupil dilation
separated by a blank stretch of missing data. De-blinked data (no. 2, in
red) expands the gap of missing data to remove the transient excursions.
The gaps are interpolated (no. 3 in blue, interpolations in dashed
lines). Finally, the data are low-pass filtered (no. 4, green).
Sequential steps of data processing. Raw data (black, marked no. 1)
contain blinks that appear as transient changes in pupil dilation
separated by a blank stretch of missing data. De-blinked data (no. 2, in
red) expands the gap of missing data to remove the transient excursions.
The gaps are interpolated (no. 3 in blue, interpolations in dashed
lines). Finally, the data are low-pass filtered (no. 4, green).
De-blinking
Blinks are generally not a problem if they are quick
(<125 ms) and uncorrelated with stimulus timing landmarks. They are typically
brief enough that they can be identified, removed, and interpolated in the data
without substantial change to the overall pattern. It is therefore typical to
not give any explicit acknowledgment of blinks during testing, to avoid
conscious awareness or patterned blinks. In the case that participants exhibit
blinks that are time-locked to trial events (e.g., always blinking at stimulus
offset, at onset of verbal response, etc.), then there is risk of data
distortion. A. Wagner
et al. (2016) addressed this issue by incorporating a Blink
now event at the end of each trial.Klingner et al. (2011)
performed a thorough analysis of 20,000 blinks to estimate the expected
perturbation of pupil size. They report that the pupil size before the blink
undergoes a very brief dilation of about 0.04 mm, followed by a contraction of
about 0.1 mm and then a gradual recovery to preblink diameter over the next 2 s.
They also reported that the difference in statistical results did not change
with the incorporation of a blink correction algorithm. It is likely the case
that interpolating across blinks is a safe practice that will not affect results
in any meaningful way. However, we recommend that interpolation begin roughly
50 ms before the blink and end at least 150 ms after the blink in order to avoid
task-uncorrelated high-frequency changes in pupil size. Note that when a pupil
trace consists of a larger percentage of blinks (>15%–25% of the relevant
recorded time), interpolation might result in a flat trace that no longer shows
a pupil dilation response. These traces should not be used for further
analysis.
Low-Pass Filtering
Klingner et al. (2011)
analyzed binocular pupil measurements and found that changes faster than 10 Hz
are uncorrelated across the eyes, thus justifying low-pass filtering at 10 Hz.
High sampling rates are therefore not essential for pupillometry. However, it
would be advisable to maintain a sampling rate high enough to distinguish
between fast and slow pupil responses (in terms of derivative or rate), as they
have been shown to be driven by different neural systems (Reimer et al., 2016). Numerous studies
report using an n-point smoothing average filter in place of an explicit
low-pass filter.
Baseline Correction
The most common method of quantifying pupil dilation is not to report absolute
pupil size but instead to report change in pupil size relative
to the time immediately before the stimulus (Beatty & Lucero-Wagner, 2000).
Collecting literature from a variety of studies (Bradshaw, 1969, 1970; Kahneman & Beatty, 1967), Beatty (1982) argued
that task-evoked changes in pupil size are independent of baseline pupil size,
at least for intermediate tonic sizes. Reporting baseline-subtracted absolute
pupil size is common (Beatty,
1982; Kramer et al., 1997; Zekveld et al., 2010) but we are
unaware of any empirical verification of Beatty’s claim. Since baseline pupil
size typically will vary across people, vary within people across time, and will
gradually diminish over the course of a testing session, this is a topic
deserving of exploration. One approach is to drop the first few trials because
baseline levels are substantially higher during the onset of a testing session
but quickly stabilize after roughly five trials (Wendt et al., 2016; 2018). Apart from
discarding trials at the onset of a session, it appears that the common practice
is to handle these unknown sources of variability as one would handle other
population-level and trial-level sources of noise, by recruiting a sufficiently
large number of participants and presenting a large number of trials.Duration of baseline intervals ranges from 100 ms (Karatekin et al., 2004) to 2 s (Ayasse, Lash, & Wingfield,
2017). Figure
4 illustrates how variation in the absolute baseline duration should
play no substantial role in reporting pupil dilation. For baseline durations
extending from 100 to 3000 ms, the baseline-corrected data are virtually
identical. However, there are some factors that probably contribute to the
common practice of using 1 s. First, it is a duration long enough that a single
blink would not eliminate all data from the baseline window. However, it is
short enough that to hopefully minimize influence of pupil dilations from a
previous trial, as long as there is a sufficient intertrial interval.
Figure 4.
Different baseline intervals end at the onset of the stimulus and
extend backwards by variable durations (highlighted in each panel by
a shaded vertical area). Comparison of the resulting
baseline-corrected data is shown on the far-right panel, revealing
negligible differences across baseline durations.
Different baseline intervals end at the onset of the stimulus and
extend backwards by variable durations (highlighted in each panel by
a shaded vertical area). Comparison of the resulting
baseline-corrected data is shown on the far-right panel, revealing
negligible differences across baseline durations.Baseline is typically computed for each trial, rather than for a whole test
session, since the baseline level typically will drift downward over the course
of a session (which, when using a single baseline average, would result in
underestimation of dilations for early trials, and overestimation of dilation
for late trials). To avoid single-trial erratic deviations in baseline that are
unrelated to the stimulus (e.g., random large excursions in baseline size, or a
long blink), one could also compute each trial’s baseline as a rolling average
over a number of adjacent trials.To elicit a more consistent baseline level across trials within a participant,
one could use consistent timing landmarks and alerting stimuli to signal for
trial onsets. For experiments of speech in noise, this could mean having a
consistent duration of noise before stimulus onset so that the participant knows
when to listen. For speech in quiet, Winn and Moore (2018) have used an
alerting beep denoting trial onset, with the intention of avoiding any
surprise responses to the target speech. Unexpected
patterns on baseline levels should be explored by viewing raw trial-level
data.Baseline pupil size, while not always reported in many published studies of
listening effort, has been a subject of investigation. Some authors have
suggested that tonic pupil size is a measure of global arousal levels (Granholm & Steinhauer,
2004), with alternative viewpoints framing tonic pupil size as an
indicator of attentional capacity (Kahneman, 1973). Based on a review of
various physiological studies done with animals, Laeng et al. (2012) suggest that tonic
activity (i.e., not stimulus-time-locked) of the locus coeruleus (LC; indexed by
pupil dilation, to be discussed further later) indicates the likelihood of
abandoning a current task for another, while phasic activity signals the
processing of attended task-relevant events. Expanding and updating this idea,
physiological work with mice (McGinley et al., 2015) suggests that
tonic pupil size is related to moment-to-moment readiness for sensory detection,
with intermediate sizes measured during trials with better task performance and
reduced response variability. Conversely, very low tonic size was interpreted as
a sign of indicating drowsiness, and very high tonic size was also found to be
suboptimal, consistent with hyper-activity that would cause asynchronous
cortical activity.Figure 5 illustrates two
approaches to baseline correction methods that have been observed in the
literature: absolute subtraction and proportional transformation (which can be
considered an additional follow-up step following baseline subtraction). In the
top panel, changes in raw pupil size are shown from two hypothetical
participants. Participant 1 shows greater pupil size and greater apparent
difference between dilation sizes in response to two different stimulus
conditions (indicated by line color). However, the apparent lack of condition
effect for Participant 2 is simply hidden by baseline differences. When
subtracting baseline size to yield an absolute change in pupil size, Participant
1 retains a higher overall change in dilation, but the differences between
conditions are now apparent for Participant 2 as well. When analyzing
proportional differences relative to baseline, the two participants’ responses
are transformed to look virtually identical. The consequences of these and other
methods should be considered in situations where participants in different
comparison groups have different overall dynamic range of pupil reactivity or
substantial changes in baseline (as a result of, e.g., age differences, or
testing at different times). It is worthwhile to consider Beatty’s (1982) aforementioned assertion
that absolute dilation is independent of baseline size (which would undermine
the proportional method) but to also consider data from Piquado et al. (2010) who normalized
dynamic range to overcome substantial differences in pupil reactivity across
younger and older participants. Systematic study and replication will hopefully
discern the optimal way to treat data with variable baseline size.
Figure 5.
Illustration of baseline correction and proportionalization of data
for two hypothetical individuals each participating in two testing
conditions indicated by line color. Raw pupil size is shown on the
upper panels, absolute change (mm) in pupil size is shown in the
middle panels, and proportional change in shown in the lower panels.
Baseline intervals consisted of the 1-s of data prior to stimulus
onset indicated at time 0.
Illustration of baseline correction and proportionalization of data
for two hypothetical individuals each participating in two testing
conditions indicated by line color. Raw pupil size is shown on the
upper panels, absolute change (mm) in pupil size is shown in the
middle panels, and proportional change in shown in the lower panels.
Baseline intervals consisted of the 1-s of data prior to stimulus
onset indicated at time 0.
Normalization
An equivalent amount of pupil dilation across two participants could be more
meaningful for the one whose pupil has a smaller dynamic range. One such known
contributor to overall dynamic range is aging. Older individuals tend to have
pupils that are smaller in size, with a more restricted range of dilation, and
which take longer to reach maximum dilation or constriction (Bitsios et al., 1996).
In light of interindividual differences in pupil dynamic range, normalization
methods are sometimes applied. Following baseline correction (e.g., subtraction
of baseline level), local deviation from baseline can then be expressed in
numerous ways, including percent or proportional change from baseline (Hess & Polt, 1964;
Johnson et al.,
2014), percent change of average experimental trial value versus
average control trial values (Payne, Parry, & Harasymiw, 1968),
within-trial mean scaling (Kuchinsky et al., 2013), grand-mean scaling (Engelhardt et al., 2010),
z-score transformation (McCloy et al., 2016), expressing
dilation as a proportion of the individual’s dynamic range of pupil size (Piquado et al., 2010),
or as a proportion of a reference peak dilation within the individual (Winn, 2016).Each of these methods has advantages that might suit some experimenters’ needs.
For example, the percent-of-range and z-score calculations
address interindividual differences in variability in dilation
(which might be useful in situations where engagement is different, such as for
responses by normal-hearing and hearing-impaired listeners), whereas the others
can correct for average differences and the method used by
Piquado addresses reactivity or dynamic range, which could be
important for experiments pertaining to aging. It remains unknown whether the
percent or proportional methods are immune to overall differences in baseline
magnitude, but these methods—as well as completely nonnormalized methods used
for instance by Zekveld
et al. (2010) have still been found to be effective in identifying
task-related differences in pupil size within participants even when averaged
across a number of participants who likely differ in pupil reactivity.Figure 6 illustrates a
hypothetical situation that illustrates an application of Piquado et al.’s (2010) method, in
which change in pupil size is expressed as a proportion of the full dynamic
range elicited by the pupillary light reflex. The apparently larger change in
amount of evoked pupil dilation in “Participant 3” is rendered equivalent to
that obtained in “Participant 4”; twice the change was observed for a pupil with
twice the dynamic range. An alternative method has been used by Winn (2016) and Winn and Moore (2018),
who compared the pupil dilation in one condition to the peak dilation obtained
in a reference condition. Proportional change between the two conditions was
considered to be normalized within individuals and therefore free from
individual differences in pupillary reactivity. An intriguing area of future
research could be to create a control task that determines the cognitive dynamic
range of the pupil (e.g., a working memory task ranging from one item until
cognitive overload). The pupil response could then be reported as a percentage
of this cognitive-evoked range rather than the
light-evoked range.
Figure 6.
Different amounts of change in pupil dilation for two hypothetical
individuals who have different dynamic range of pupil size. In each
panel, the difference in peak pupil dilation occupies the same
proportion of the overall dynamic range.
Different amounts of change in pupil dilation for two hypothetical
individuals who have different dynamic range of pupil size. In each
panel, the difference in peak pupil dilation occupies the same
proportion of the overall dynamic range.Although there is no consensus gold standard method of normalization, this
problem has been commonly addressed (or rather avoided) by (a)
analysis of differences within the same participants across conditions, with the
assumption that individual effects would be common to all tested conditions or
(b) the use of sample sizes large enough to overcome the variability, with the
assumption that individual differences in pupil reactivity would be
approximately equally represented in comparison groups.Contrary to the aforementioned attempts to normalize differences in pupil
reactivity, these differences might be a meaningful outcome measure. For
example, Koelewijn, van
Haastrecht, and Kramer (2018) found that participants with history of
traumatic brain injury showed reduced phasic pupil dilations, despite reporting
generally higher subjective effort ratings. In addition, peak pupil dilation
correlated negatively with participants’ speech reception threshold, suggesting
a decrease in phasic activity with a decrease in hearing acuity. Another example
is given in by Jensen et al. (2018) who examined the impact of tinnitus and a
noise-reduction scheme on the pupil response. They found that participants with
tinnitus generally reported a greater need for recovery and showed smaller
task-evoked pupil dilations (in contrast to the hypothesis of greater
effort—greater dilations). The results of these studies suggest that there is a
possibility of interpreting reduced pupil dilation not merely as reduced effort,
but potentially as lower capacity (consistent with Kahneman’s [1973]
framework), or perhaps as reduced ability to maintain engagement. The multitude
of possible interpretations highlights the need to design experiments that
differentiate related concepts such as effort, attention, engagement, and
fatigue.
Baseline Analysis
Numerous studies propose that the baseline (or tonic) pupil size
is not simply noise to be discarded or normalized but rather inspected for its
own sake. Gilzenrat,
Nieuwenhuis, Jepma, and Cohen (2010) suggest that increases in
baseline pupil diameter reflects disengagement from a task (specifically, to
explore more useful tasks), whereas smaller baseline diameter would indicate
increased engagement, and also reveal relatively larger task-evoked dilations.
McGinley et al.
(2015) have observed that tonic pupil size is related to the accuracy
of performance in psychoacoustic tasks by mice. They further provide a framework
for using tonic pupil size (i.e., pupil size not evoked by a specific task or
stimulus) as an index of general arousal (or “brain state”), with medium-level
arousal yielding optimal performance for sensory-cognitive tasks. Heitz et al. (2008)
corroborated this finding in humans, finding larger prestimulus tonic pupil size
for participants with higher working memory span compared with those with
shorter memory spans. Even researchers primarily interested in the task-evoked
pupil response should examine baseline epochs to ensure differences in peaks are
not due to systematic, condition-specific biases in the baseline window. If such
patterns emerge, then the baseline-corrected and normalized dilations should be
(a) interpreted with extra caution and (b) motivate inspection of the
experimental protocol to discover any unintended bias. We do not necessarily
recommend discarding these data, since the baseline pattern might be indicative
of an effect that is worth interpreting or using to motivate a follow-up
experiment. For example, if experiments are to be conducted at very different
times of day, differences in arousal or alertness might affect results (Veneman et al.,
2013).
Data Alignment
Time alignment is standard in pupillometric analysis, because task-irrelevant
changes in pupil size are likely to neutralize when averaged across multiple
trials, leaving only the evoked changes relevant to the experiment. In cases
where stimuli are of variable duration (like sentences in a standard corpus),
alignment can be done by onset or offset. Alignment by stimulus onset has been
used to examine the effect of intelligibility (Zekveld et al., 2010) and masker type
(Koelewijn et al.,
2012) on pupil size. Klingner et al. (2011) aligned their
data to stimulus offset for purposes of looking at peak pupil dilation resulting
from stimuli of systematically longer durations. Winn et al. (2015) and Winn (2016) aligned
data to stimulus offset for the purpose of separating listening responses from
speaking responses and serendipitously found temporally specific effects of
prolonged effort following challenging stimuli, which would have been otherwise
partially obscured by onset alignment. Figure 7 shows a series of trials with
stimulus onset and offset marked, and Figure 8 shows the aggregate of these
trials, aligned by both onset and offset. There is arguably a more distinct
onset slope for the onset-aligned data, and more distinct shape of dilation at
offset for the offset-aligned data, but the data take the same general shape
regardless of alignment position.
Figure 7.
Sixteen individual trials of pupil data, with baseline period marked
as thin gray vertical bar, and retention interval marked as thick
gray vertical bar. The stimulus is played between these two bars.
Baseline level for each trial is marked as the horizontal blue line.
Lines plotted in color are low-pass filtered data overlaid on gray
raw data that include transient vertical displacements that indicate
blinks. Data in red are marked to be dropped from the data set due
to excessive data loss or contamination (excessive nontask-evoked
dilation) during baseline interval.
Figure 8.
Aggregation of trials displayed in Figure 7, excluding “dropped”
trials. The left and right panels display data aligned to stimulus
onset and offset, respectively. The thin gray bar (to the left in
each panel) corresponds to the baseline interval, and the thicker
gray bar (to the right) corresponds to the retention interval.
Because stimuli were of variable duration, average values were used
for the offset time for onset-aligned data, and for onset in
offset-aligned data.
Sixteen individual trials of pupil data, with baseline period marked
as thin gray vertical bar, and retention interval marked as thick
gray vertical bar. The stimulus is played between these two bars.
Baseline level for each trial is marked as the horizontal blue line.
Lines plotted in color are low-pass filtered data overlaid on gray
raw data that include transient vertical displacements that indicate
blinks. Data in red are marked to be dropped from the data set due
to excessive data loss or contamination (excessive nontask-evoked
dilation) during baseline interval.Aggregation of trials displayed in Figure 7, excluding “dropped”
trials. The left and right panels display data aligned to stimulus
onset and offset, respectively. The thin gray bar (to the left in
each panel) corresponds to the baseline interval, and the thicker
gray bar (to the right) corresponds to the retention interval.
Because stimuli were of variable duration, average values were used
for the offset time for onset-aligned data, and for onset in
offset-aligned data.On a technical note, it is important to consider that there could be some timing
drift or numerical rounding that occur with timestamps. For example, for 60 Hz
data collection, some timestamps might be multiples of 16 while others are
multiples of 17, even if they refer to the same ordered index (i.e., 1st, 2nd,
3rd, etc.) of sampled time across trials. To ensure that such data points are
numerically aligned when aggregating across trials, the experimenter might need
to transform the timing data to maintain consistent timestamps.
Identifying and Dropping Contaminated Trials
Trial-level data that are contaminated (e.g., by gross excursions during
baseline) should be removed from analysis using automatic rules that are
consistent and motivated by realistic constraints of task-evoked pupil
responses. There is no substitute for visualizing and becoming familiar with the
trial-level data so that unexpected deviations can be
detected and used to design algorithms to filter data. The nature of these
deviations can vary among individuals and among eye-tracking systems. The
strategy taken to drop trials should be consistent and blind to test condition,
to avoid experimenter bias. There are some useful heuristics that can drive an
automated trial exclusion process. For example, when a considerable amount of
data samples are lost (e.g., because of long blinks), a trial should be dropped.
Exact missing data criteria vary across studies, but generally range from 10% to
30%, and might be especially important for parts of the trial where the peak
dilation would occur; the criteria for trial dropping might therefore be
weighted by trial event time.Often, the first few trials are excluded from analysis because the pupillary
responses look considerably different from those in the rest of the block (Wendt et al., 2018).
When pupil dilation slopes steeply downward during stimulus playback, it is a
good time to consider dropping the trial, because the “peak” dilation will
likely be more than three standard deviations from the mean. Figure 7 illustrates an
example of this pattern, as well as some other examples of contaminated or
mis-tracked trials that could be dropped from a data set. We recommend being
very conservative with dropping trials, especially if there is no obvious
artifact like an eye movement or something explaining a deviating response. All
remaining unexplained noise should be addressed by event-related averaging akin
to other evoked responses.Just as for other evoked-response methods, outliers should be detected and
removed from the data set. In Figure 7, data for Trial 11 are marked to be dropped because an
event-unrelated brief and excessive dilation that occurs during baseline driving
all subsequent data downward following baseline correction. This contamination
can be detected by identifying the average or peak dilation as an outlier, or by
detecting deviation of the baseline value relative to adjacent baselines. Trial
15 is dropped because it contains a large amount of missing data due to blinks
or tracker error. Trial 4 is marked to be dropped because of excessive
distortion precisely during the time where there would be valuable dilation
information. The range and rate of dilation observed in this trial is
uncharacteristic of task-evoked changes but likely to detrimentally affect data
aggregation. Trial 6 is marked as “questionable” because it contains excessive
distortion during baseline (and later), which propagate forward to affect the
calculation of subsequent data samples in the trial (just as for Trial 11).
Standard criteria (such as absolute value of peak dilation or baseline levels
being more than 3 standard deviations from the mean) could help to identify such
situations. Measures of dissimilarity of individual trials might be applied in
the future to characterize such deviant trials and evaluate them for
contamination.Although most of the group-averaged data from publications cited in this article
take the general form illustrated in the figure earlier, it is important to note
that morphology of responses will vary substantially across individuals. Lõo, van Rij, Järvikivi, and
Baayen (2016) describe distinctive group patterns including
differences in the number of peaks, the timing of peaks relative to trial
landmarks, as well as the tendency for pupil size to either rise or
fall after stimulus onset.
Analysis Techniques and Time Windows
As mentioned earlier, the pupil will start to dilate between roughly 0.5 to 1.3 s
following the stimulus onset, and the peak dilation occurs typically roughly
700 ms to 1 s following the end of the stimulus (at least for
sentence-recognition experiments where stimulus duration was relatively constant
and therefore easy to predict by the listener). It is customary to measure peak
pupil dilation, peak pupil latency, and mean pupil dilation in a fixed window of
time around stimulus presentation. The popularity of these measurements (and
peak dilation in particular) should not limit the creativity of experimenters to
develop and use measurements that relate to specific hypotheses regarding the
timing of the response. Some investigators have also measured the shape of the
pupillary response over time (Kuchinsky et al., 2013; Wendt et al., 2018;
Winn, 2016;
Winn et al.,
2015) using growth-curve analysis (Mirman, 2014) or generalized additive
models (van Rij,
2012; van Rij,
Hendriks, van Rijn, Baayen, & Wood, 2018). Alternative approaches
include principal components analysis used in pupillometry experiments by Schluroff et al.
(1986) with seven factors and Verney et al. (2004) with three
factors.In the case of stimuli that do not have any specific internal landmarks (such as
conventional speech-in-noise tasks), mean dilation, peak dilation, and latency
should suffice. However, sometimes stimuli should elicit cognitive load at
specific times, and this should be approached with time-series methods or
carefully chosen time windows. For example, in an experiment by Bradshaw (1968),
latency to peak dilation depending on task instructions; in an open-ended
problem-solving condition, latency was locked to the time of solution and
reflected stimulus difficulty, but in a condition where the solution was
elicited with a predictably timed probe, latency to peak was rather consistent
despite differences in stimulus difficulty.Analysis windows can vary according to the design of the test procedure. In many
cases, a single large analysis window including stimulus and response
preparation can be used to detect main effects such as speech intelligibility
(Zekveld et al.,
2010) and masker type (Koelewijn et al., 2012). In some cases,
a briefer analysis window can reveal how the growth from baseline to peak can
reveal differences that can be subsequently neutralized as the participant
prepares a response (Winn
et al., 2015). Two explicit analysis windows have been used to
separate listening and rehearsal phases of sentence-repetition tests (Winn, 2016). Wendt et al. (2016)
used three separate analysis windows designed to track listening, linguistic
processing, and decisions. Alternatively, analysis of pupil dilation could be
locked to stimulus events such as disambiguating information during a stimulus
(Wagner et al.,
2016).Various language-related and aptitude-related factors can affect pupil dilation
in ways that might be best described by factors other than mean, peak, and
latency. For example, in a study of mathematical problem solving by Ahern and Beatty (1979),
there were two groups of listeners that were found to have equivalent peak
dilation and latency, but different slopes of recovery after peak; there was
quicker recovery from peak for higher aptitude students. Similar patterns were
described in a pitch perception task by Bianchi et al. (2016), where musicians
had quicker recovery than nonmusicians. Bradshaw (1968) showed roughly
equivalent peak dilations more difficult math problems regardless of whether
participants obtained a solution, but prolonged dilation was
observed when problems remained unsolved, suggesting a useful role for
late-stage dilation analysis. In cases of using semantic context, the timing of
changes in pupil dilation after stimulus delivery has been
proposed as a potentially meaningful distinction across listener groups (Winn, 2016; Winn & Moore,
2018). Previous work by Verney et al. (2004) proposed using
principal components analysis to identify early, middle, and late contributions
to dilation, specifically mentioning a late factor responsible for dilation
magnitude following the peak.If the experimenter is interested in a listener’s anticipation and online
prediction, analysis could target the pupil response during or
before the stimulus. For example, anticipation of difficult
trials elicits larger pupil dilation even before stimuli are presented (McCloy et al., 2017).
If a listener already knows the spatial location of the target signal, pupil
dilation will grow at a slower rate than if the location is unknown (Koelewijn et al.,
2017). Anticipation of longer stimuli brings shallower growth of
dilation, presumably to distribute cognitive resources over the entire stimulus,
both for digit spans (Kahneman & Beatty, 1966; Klingner et al., 2011) and speech
signals (Winn & Moore,
2018). Studies by Kahneman and Beatty (1966) and Piquado et al. (2010) reported that in
conditions where listeners expected longer stimulus length (based on a
blocked-condition design), larger overall pupil dilation was
recorded during the baseline period, as if to indicate a greater tonic state of
arousal in anticipation of a more challenging task. Sentences that afford the
opportunity to predict upcoming words elicit shallower pupil dilation than
sentences that are unpredictable, in terms of semantic content and syntactic
structure (Schluroff et
al., 1986).In addition to examining stimulus onset or offset, more precise methods can be
used, which track stimulus-related information or participant behavior. Ayasse et al. (2017)
used pupillometry along with a concurrent eye-tracking paradigm to verify the
moment of sentence comprehension (e.g., a participant would gaze at an image
that related to the sentence). They examined pupil size at moment of
comprehension and found significant combined effects of age and hearing loss.
Notably, they found that speed of comprehension was not statistically different
across the groups, but the amount of pupil dilation was reliably different.
Wagner et al.
(2016) similarly aligned their pupil dilation analysis to the onset
of target words under lexical competition in a study that examined how spectral
degradation affected lexical access. In some of these aforementioned examples,
it is probable that conventional measures of overall mean, peak, and latency
would identify consistent differences across conditions or participant groups,
but leave out other interesting layers that are deserving of interpretation.Stimulus difficulty will also affect the precise timing of dilations and the
clarity of the overall morphology of the response. In an unpublished pilot
study, Winn and colleagues presented listeners with a slowly spoken five-word
sentence consisting of a name, verb, number, adjective, and plural noun (e.g.,
“Bill sold four red hats”). Either all words were unpredictable, or the second
word (the verb) was always the same word “found.” The premise was to examine
whether the predictable second word would reduce pupil dilation in the moments
just after that word. Overall, the pupil dilations of listeners with CIs were
far more precisely time locked to the stimuli compared with dilations measured
in listeners with normal hearing, for whom the task was trivially easy. The CI
group showed distinct dilation responses for individual words (see Figure 9), consistent with
greater demand on auditory processing in these listeners. While there was an
effect of word predictability in both groups, the effect was more time
constrained for the CI group, where reduction in dilation began just following
the second local peak in the time series response. For the listeners with normal
hearing, the reduction was spread across a larger portion of time rather than
being constrained to a particular moment. The results of this small pilot study
are consistent with the aforementioned capacity model and FUEL; in situations
where the task-evoked boredom or low cognitive demand (i.e., the normal-hearing
listeners in this slow-paced easy speech task), the pupil response should be
less reliable or more difficult to interpret because the participant’s attention
is not locked to the stimulus (or perhaps is divided among the task and anything
else on their mind). Conversely, in the case of the CI listeners for whom even
recognition of words in quiet can be challenging, a steep and reliable
time-locked dilation response was strong, likely because the task demanded more
of their attentional capacity, and pupil size was therefore dictated primarily
by the stimuli.
Figure 9.
Temporal precision of pupil dilation morphology is related to
difficulty of a task. Listeners with cochlear implants (for whom
auditory perception can be quite challenging) show individuated
dilations to slowly spoken words that are time locked across trials.
Affecting the predictability of the second word in the sequence
causes a reduced response shortly after the corresponding word. Such
patterns do not emerge clearly for listeners with normal hearing,
for whom the task is very easy.
Temporal precision of pupil dilation morphology is related to
difficulty of a task. Listeners with cochlear implants (for whom
auditory perception can be quite challenging) show individuated
dilations to slowly spoken words that are time locked across trials.
Affecting the predictability of the second word in the sequence
causes a reduced response shortly after the corresponding word. Such
patterns do not emerge clearly for listeners with normal hearing,
for whom the task is very easy.
Effort Does Not Stop When the Stimulus Is Over
There are numerous interesting effects of cognitive load that are found in the
pupillary response as it continues past the end of a stimulus. The retention
interval—the time between the end of a stimulus and the behavioral response
prompt—has been examined in detail in only a few studies. Piquado et al. (2010) argued that the
retention interval is a good reflection of the cumulative memory load. Indeed,
when participants in a digit-span memory task report their responses, pupil size
appears to decrease in a stepwise fashion with each reported digit, as if the
memory is “unloaded” (Beatty,
1982).The retention interval appears to show differences relating to intelligibility or
confidence in speech perception. Winn et al. (2015) found that when
participants heard severely degraded sentences, pupils remained dilated, but
quickly constricted in cases of reduced degradation, and also in the case of the
severely degraded condition when intelligibility was perfect. That study
suggested that the auditory processing demands were reflected
by the slope of pupil dilation to its peak value, whereas the continued
linguistic processing needed to repair misperceptions in the difficult condition
was reflected in the dilation during the retention window. Further exploration
into the retention window—and its susceptibility to interference—is presented by
Winn and Moore
(2018).
Insight From Behavioral Economics
Humans are not machines, and their exact motivations moment-to-moment will affect
their willingness to put forth extra effort in listening tasks. As that
willingness—that intentional engagement—appears to be the central
driver of the pupillary response, it should be given special consideration in
experimental design. Pupil dilation should be observed during tasks that reward the
listener for putting in more effort. It is no coincidence that some of the
pioneering work in pupillometry was done by a psychologist-turned-behavioral
economist—Daniel Kahneman—who contributed the Capacity Model of
attention and effort that eventually gave rise to the more recent FUEL (Pichora-Fuller et al.,
2016). Appreciation of the basic concepts of these models will likely aid
in the planning and execution of listening effort experiments.Eckert et al. (2016) adopt
the framework of behavioral economics—the study of choice relative to the value of
options—to suggest that the level of effort exerted during speech communication
reflects expected value of return on effort. In other words, effort might not be
exerted if the effort is not worth it. This is an important
principle for experimenters to bear in mind as they choose their stimuli and testing
conditions, as nonmonotoniticies in the difficulty-effort function could result in
very complicated data that is not easy to interpret. Using this framework for a
series of brain imaging studies, Eckert and colleagues have placed listening effort
in the domain of neuroeconomics.At some level of difficulty, listeners do not continue to exert more effort; they
begin to disengage (Granholm,
Asarnow, Sarkin, & Dykes, 1996; Peavler, 1974), because there is no value
obtained by expending more effort. The pupils will therefore dilate less when the
task is too hard to complete successfully. At intelligibility levels below 40%,
pupil dilation tends to decrease as listeners disengage. Similar patterns of
nonlinearity also appear in other measures of listening effort, such as dual-task
cost (Wu, Stangl, Zhang,
Perkins, & Eilers, 2016), notably with the same cutoff of performance
level. In digit-span memory tasks, sequences longer than 7 to 9 items are generally
not attainable by typical listeners. Johnson et al. (2014) found that for adults
in such a task, the pupil typically does not dilate any more for 11 items than for 7
items. Furthermore, pupillary responses for longer digit sequences were actually
smaller (especially in children), possibly reflecting
abandonment of the task.In particularly difficult laboratory experiments, it is reasonable to suspect that
people might either (a) actively strategize to change their performance in ways that
would not be necessary in easy conditions or (b) decide to withdraw their effort
either because it does not seem worth it (Teubner-Rhodes et al., 2017), or because
the task has been rendered unrealistic. Consider also that testing in very difficult
conditions (e.g., at 30% to 40% accuracy) is likely not a useful situation to test
because people might rarely find themselves in that situation. Individuals with
hearing loss, who would certainly struggle in noisy complex listening environments,
might instead change their own behavior and social activities rather than
participate in those environments (Demorest & Erdman, 1986; Weinstein & Ventry,
1982; Wu et al.,
2018).
Understanding the Physiology of Pupil Dilation
The physiology of the pupillary response implicates an interesting psychological
framework for understanding human behavior in pupillometry experiments. Changes in
pupil size are correlated to changes in activity in neurons of the LC (Rajkowski, Kubiak, &
Aston-Jones, 1993; Rajkowski, Majczynski, Clayton, & Aston-Jones, 2004). This
physiological connection corroborates the connection between pupil diameter and
phasic attention or effort. Following from theories about the role of the (LC) in
the modulation of attention, individuals will continue to perform a behavior so long
it is rewarding or provides utility. Importantly, the valuation of performing a
particular task may vary across groups or individuals. During such focused
attention, the LC exhibits a phasic burst of activity (tied to the task-evoked pupil
response) to support making a behavioral response. As task utility declines,
scanning attention takes over as the individual searches for new rewards in the
environment and the LC enters a high-tonic mode (tied to the baseline pupil size).
Reflecting this pattern in cortex, the anterior cingulate cortex (ACC), which is a
primary input to and receives feedback from the LC (Aston-Jones & Cohen, 2005; Gilzenrat et al., 2010),
has been associated with effortful attention. In particular, the ACC is part of the
cingulo-opercular performance-monitoring network, which includes bilateral frontal
opercula or anterior insulae. This network is engaged across a wide range of tasks
(e.g., Dosenbach et al.,
2006; Eckert et al.,
2009) to actively monitor for response uncertainty and errors, so that
other systems can be brought online to adapt and improve performance (i.e.,
fronto-parietal, cognitive-control network; Kerns et al., 2004). Eckert et al. (2016) thus argue for the
increasing engagement of the cingulo-opercular network as an indicator of greater
value or utility in engaging in speech recognition (cf. the earlier section on
Behavioral economics). Interestingly, inactivation of cingulate
neurons is associated with impairment of adjusting to errors in a task (Vaden, Kuchinsky, Ahlstrom,
Dubno, & Eckert, 2015). Pupil size has been linked with activity of
the cingulo-opercular system (Schneider et al., 2016; Zekveld et al., 2014). In light of this
converging evidence, it seems reasonable to interpret the phasic pupillary response
as a sign that a listener feels the need to “take action” mentally. Circling back to
studies of pupil dilation and intelligibility, this framework helps to explain why
there is increased dilation in moderately challenging conditions (where “taking
action” could overcome acoustic challenges), but little dilation when the stimulus
is so easy as to not demand action and also little dilation where no action is taken
because the task is too difficult for performance to be successful.Recent work suggests that the notion of understanding pupil size as an index of LC
activity should be expanded and updated. For years, many pupillometry publications
have cited Aston-Jones and Cohen
(2005) to link pupil dilation with the activity of the LC-norepinephrine
system. However, as McGinley
et al. (2015) point out, the compelling figure from that publication
reflected the recording of a single neuron and was from the abstract of an
unpublished study. McGinley et al. suggest that while LC activity is related to
pupil size, other mechanisms and neural substrates linked with pupil dilation have
not been ruled out. Furthermore, they characterize pupil size as an indicator of
“brain state,” possibly intentionally avoiding linkage with a specific local
structure. Consistent with this, Reimer et al. (2016) found that pupil
dilation was comodulated with cortical activity in general,
complicating the process of using pupillometry to assess any specific function. A
study in nonhuman primates also found that although LC stimulation reliably proceeds
changes in pupil dilation, that similar, though weaker, patterns can be observed in
other regions connected with the LC (i.e., ACC and colliculi; Joshi, Li, Kalwani, & Gold, 2016).
Thus, given the extensive connectivity within the LC-norepinephrine system, these
results suggest that, rather than a single brain area controlling pupil dilation,
the LC acts as a hub that coordinates attention-related neural activity.Reimer et al. (2016)
showed evidence that rapid pupil dilations (roughly 0.25 Hz) are
associated with phasic norandrenergic activity, while slower long-lasting dilations
are linked more closely with cholinergic activity. Furthermore, the
rate of change in pupil size was linked with norandrenergic
activity, while overall diameter was linked with both, but more strongly with
cholinergic activity. Pupil dilation was observed to reliably lag behind neural
activity by roughly 1 s, consistent with behavioral studies where onset of pupil
dilation and peak dilation follow stimulus onset and offset by about 1 s,
respectively (Hoeks &
Levelt, 1993). Steinhauer et al. (2004) suggest that the rapid early component of the
dilation response is driven by parasympathetic activity while the later-occurring
component is driven by the sympathetic system.
What Is Still Beyond Our Reach at This Time?
Individualized Measures or Cross-Person Comparisons
The range of variability observed in most physiological recordings will carry
over to pupillometry. Data from a single person are rarely as clean as the
grouped data displayed in popular pupillometry articles. Given a sufficient
number of trials, more confidence can be obtained for single-participant test
sessions, but data should be interpreted with caution. The absolute size of the
pupil response varies across people, and there are people for whom measurements
will not be reliable, sometimes for unknown reasons.Unsurprisingly, individualized measures of listening effort are the goal of many
audiologists and experimenters alike, for the purposes of tracking progress with
clinical intervention or to compare treatment approaches. This goal might become
attainable as techniques continue to be refined and appropriate baseline tasks
can be developed. However, though there will still be some difficulties. In
particular, the use of certain medications that affect that sympathetic or
parasympathetic nervous system will render pupillometry unreliable (cf. Steinhauer et al.,
2004). In addition, some participants exhibit markedly different
morphologies in their pupillary responses over time, leaving experimenters
unsure how to make a fair comparison. It is possible that comparison of
within-subjects conditions (e.g., proportional differences between responses in
two conditions within an individual) will be the key to drive reliable
individualized analysis.
Single-Trial Analysis and Adaptive Tracking
As for other evoked physiological measures, single trials are not sufficient to
draw reliable conclusions. The reason for the recommendation of ∼25 trial is
that there is a wide range of variability at the individual trial level. Each
individual trial will produce a time series (“trace”) of pupil dilation that is
the result of many factors, some of which are beyond the experimenter’s control,
and perhaps unrelated to the task itself. Substantial deviations from the
expected pattern of the task-evoked pupillary response are common in any series
of trials and are not necessarily indicators of a specific level of cognitive
load. Only when averaging together multiple trials can the experimenter get a
reliable estimate of the pupillary response separate from any idiosyncratic
effects on an individual trial.The pupillary response alone should not be used as a criterion for adaptive
tracking in speech perception experiments, for the reasons outlined earlier.
Although it would be ideal to dynamically adjust SNR (or speech rate, spectral
resolution, bandwidth, or any other signal property) to arrive at a target level
of pupil dilation, it is not feasible using the standard analysis techniques
used in behavioral experiments. The problem is that each unique condition (SNR,
etc.) requires a sufficiently large number of trials to estimate the pupillary
response, rendering the tracking anything but adaptive. Contrary to this
suggestion, Marshall
(2002) described a framework for virtually real-time assessment of
cognitive activity using pupil dilation. However, in that article, cognitive
activity was discretized as low,
medium, or high; we suggest that these
categories might lead to oversimplifications of important factors that
experimenters might want to explore in finer detail.At least one report exists of using pupillometry as an adaptive tracking
mechanism for real-time feedback. Choi et al. (2017) used the pupil
response to govern feedback in a visual scanning task in individuals at risk for
psychosis. However, instead of measuring listening effort, they sought to elicit
a broad indication of how much a person was actively engaged in the task, to
monitor for lapses of attention or cognitive overload.
Consensus Technique for Automatic Trial Dropping
As discussed earlier, at this time, there is no universally used algorithm for
detecting and removing aberrant trials. Hands-on experience with your own raw
data and consultation with previous literature will inform the most appropriate
filtering used to identify contaminated trials. Perhaps machine learning or
other modern approaches can be used to devise algorithms to identify
questionable trials (Książek, Wendt, Alickovic, & Lunner, 2018), but at the time of
this writing, there is no consensus approach.
Testing in Unconstrained or Conversational Situations
As the task-evoked pupillary response is a time-locked, aggregated response, it
is not conducive to conversational situations or other situations that are
unplanned or unconstrained by trial timing. The problem is
that during a conversation, it could be hard to find regular timing landmarks
upon which trial data could be aligned and aggregated.
Reducing Listening Effort
The measurement of pupil size is not the same as the reduction of listening
effort. Readers should be cautioned that studies designed to
measure effort are not in themselves a tool to alleviate
effort. Furthermore, changes in pupil size might not yield clear actionable
conclusions about how to alleviate effort.
Linking Subjective and Objective Measures of Effort
Both subjective report and a variety of objective measures have been used to
track changes in listening effort (for reviews see McGarrigle et al., 2014; Ohlenforst et al.,
2017). However, to the extent that pupillometry and subjective report
have been collected in the same experiment, researchers have generally observed
weak to no correlations between these measures of effort (e.g., Wendt et al., 2016;
Zekveld & Kramer,
2014; Zekveld
et al., 2011; though cf. Koelewijn et al., 2015). Currently, it
is unclear whether this pattern is due to methodological limitations (e.g.,
comparing offline vs. online measures, biases that are inherent in self-report),
or whether conscious awareness of effort engages reflects underlying mechanisms
(Mulert, Menzinger,
Leicht, Pogarell, & Hegerl, 2005) that may not be tracked by
objective measures. Francis,
MacPherson, Chandrasekeran, and Alvar (2016) suggest that
physiological measures in general likely reflect constructs other than
subjective or perceived effort and can represent the listener overcoming signal
distortion, the separation of target and noise, or the affective response to the
difficulty of the situation. Hornsby and Kipp (2016) report that reports of fatigue in people
with hearing impairment are related not to degree of hearing loss but to
perceived difficulty or handicap. These relationships will continue to demand
further exploration as the topic of hearing, effort, and fatigue continue to
flourish in the literature.
A Caution About Equating Pupil Size With “Effort”
Pupil size reflects multiple things, and there is no consensus on what percentage
of change in pupil size corresponds to a particular change in proportion of
effort capacity. In addition, note that smaller pupil dilation is routinely
observed in listeners who are older, and listeners with hearing impairment
(Koelewijn et al.,
2017), and listeners with traumatic brain injury (Koelewijn et al.,
2018), despite common reports of elevated effort in these
populations.
Summary of Helpful Practices and Advice
As a start, it is a smart thing to allow your design to replicate an effect shown in
a previous study. Klingner
et al. (2011) provide an excellent case study of replicating known
results in the midst of exploring a new problem. By doing this, the experimenter can
avoid the mystery of unclear or null results by verifying that her or his data
collection system is working.Keep the luminance of the testing area as steady as possible, with few changes in
visual input. When recording, either avoid eye moments or control for their effects
on the estimation of pupil size. Limit physical locomotion. A normal amount of
random blinking is okay but beware of stimulus-timed blinks; consider having a
“blink” instruction at the end of trials. Make sure the participants have enough
breaks and avoid long fatiguing sessions. Make sure data are well annotated with the
appropriate time stamps. Check if you can read and processes the data after you have
recorded the first few participants, to ensure that tracking and annotation are
sound. Use a consistent amount of time for each trial and leave a sufficient amount
of time between trials to allow the pupil to return to baseline. Make sure
participants can anticipate when a trial will start; consider using an alerting
signal or a consistent amount of leading noise time for each trial.Avoid conditions of very low intelligibility or task performance, as participants are
likely to disengage from the task. In line with this, consider a participant’s
anxiety about failure, and whether anxiety is central to the research question or
just a byproduct of the testing environment. Be mindful of various sources of
individual variability (cf. Tryon, 1975), and also that sometimes pupil dilation is influenced more
strongly by behavioral response rather than by listening. In line with this,
deliberately plan the amount of time between auditory stimulus and behavioral
response.When possible, it is advisable to test within-group effects to control for individual
differences in pupillary reactivity. Use a sufficient number of trials (20–25) for
each condition and leave out the first few trials when participants are still easing
in to the task. Expect some amount of missing or contaminated data. Engaging tasks
keep participants motivated and have shown prominent pupil responses, but be aware
that emotional stimuli evoke a big response that might lead to unplanned distortion
of the results. Finally, be aware of age effects when designing your study and
analysis techniques.
Review Articles or Recommended Reading
General reviews of pupillometry to assess cognitive load have been written by Beatty (1982) and Laeng et al. (2012). For a
more general review of the pupillary system as a whole, the chapter by Beatty and Lucero-Wagoner
(2000) is especially helpful. Detailed information about the physiology
of pupil dilation can be found in articles by McGinley et al. (2015), Reimer et al. (2016), and
Eckstein, Guerra-Carrillo,
Miller Singley, and Bunge (2017). A review of pupillometry to assess
listening effort in particular is found elsewhere in this issue, by Zekveld et al. (2018). In
addition to these articles, there are also other introductory reviews for new
experimenters (like the current article), for the fields of second-language
acquisition (Schmidtke,
2017) and for assessment of brainstem function by anesthesiologists
(Larson & Behrends,
2015).
Summary and Concluding Remarks
In this article, we have aimed to provide a series of recommendations on best
practices (or at least common practices) for research on conducting pupillometry
studies of listening effort. We have highlighted the importance of conducting
theory-driven research and employing careful research design and analytical
approaches informed by well-established ideas of attention, motivation, and economy
of effort, as well as the constraints of the measurement technique itself.
Pupillometry has particular strengths as a noninvasive physiological measurement
that can track changes in effort with some temporal precision. It is also compatible
with electronic hearing devices used by many listeners who are the focus of hearing
research. Although implementing any new methodology can seem daunting, we hope that
by providing this (nonexhaustive) guide, researchers will be encouraged to get
started and will attain success quickly. Addressing the outstanding questions in the
field will require a multidisciplinary approach, involving both clinicians and basic
researchers as well as perceptual, linguistic, cognitive, and neuroeconomic
perspectives. It will require studies examining variability in a variety of task
conditions, task goals, populations, and within and across individuals. Given the
ultimate goal of improving individuals’ quality of life through better
communication, we hope we have convinced readers that pupillometry is worth the
effort.
Authors: Peter R Murphy; Redmond G O'Connell; Michael O'Sullivan; Ian H Robertson; Joshua H Balsters Journal: Hum Brain Mapp Date: 2014-02-07 Impact factor: 5.038
Authors: Molly Winston; Kritika Nayar; Abigail L Hogan; Jamie Barstein; Chelsea La Valle; Kevin Sharp; Elizabeth Berry-Kravis; Molly Losh Journal: Physiol Behav Date: 2019-11-22
Authors: Nicole M Amichetti; Jonathan Neukam; Alexander J Kinney; Nicole Capach; Samantha U March; Mario A Svirsky; Arthur Wingfield Journal: J Acoust Soc Am Date: 2021-12 Impact factor: 1.840