We examine the use of multiple subphonemic differences distinguishing homophones in production and perception, through a case study focusing on the distinction between two polysemous senses of the English word "sorry" (apology vs. attention-seeking). An analysis of production data from voice actors revealed significant and substantial durational differences between the two meanings. Tokens expressing an apology were longer than attention-seeking tokens, and the situational intensity of the context also independently affected duration. When asked to identify the meaning in a two-way forced-choice task after hearing each token spliced out of its context, listeners were above chance (64.7% accuracy) in identifying the intended meaning, and their responses were significantly correlated with the duration, intensity, and intonation contour (but not mean F0) of the productions. In a second perception task, listeners heard tokens of "sorry" that had been systematically manipulated to vary in duration, intensity, and intonation contour, with responses indicating that each of these dimensions played an independent role in listeners' judgments. The results highlight the importance of broadening the scope of research on the use of subphonemic detail during lexical access and considering a wider range of lexical and non-lexical factors that condition variability on multiple acoustic dimensions, in order to work toward a more accurate picture of the systematic variability available in the input and tracked by listeners.
We examine the use of multiple subphonemic differences distinguishing homophones in production and perception, through a case study focusing on the distinction between two polysemous senses of the English word "sorry" (apology vs. attention-seeking). An analysis of production data from voice actors revealed significant and substantial durational differences between the two meanings. Tokens expressing an apology were longer than attention-seeking tokens, and the situational intensity of the context also independently affected duration. When asked to identify the meaning in a two-way forced-choice task after hearing each token spliced out of its context, listeners were above chance (64.7% accuracy) in identifying the intended meaning, and their responses were significantly correlated with the duration, intensity, and intonation contour (but not mean F0) of the productions. In a second perception task, listeners heard tokens of "sorry" that had been systematically manipulated to vary in duration, intensity, and intonation contour, with responses indicating that each of these dimensions played an independent role in listeners' judgments. The results highlight the importance of broadening the scope of research on the use of subphonemic detail during lexical access and considering a wider range of lexical and non-lexical factors that condition variability on multiple acoustic dimensions, in order to work toward a more accurate picture of the systematic variability available in the input and tracked by listeners.
Homophones,
by definition, are expected to be pronounced identically; however, they
can be characterized by small but systematic differences in their phonetic
realization. For example, the more frequent members of a given homophone pair
tend to be of shorter duration than their less-frequent homophone twins (e.g.,
English “fair” is on average shorter than the less-frequent “fare”) (Gahl, 2008; Lohmann, 2018b). A
variety of factors, including lexical frequency, underlying phonological
representation, and contextual predictability, have been shown to condition
systematic variation in homophones, and this variation occurs across multiple
acoustic dimensions, as will be reviewed below. Furthermore, listeners can, in
some circumstances, identify the intended meaning at greater-than-chance levels,
and listeners’ responses have been found to be independently influenced by
specific acoustic properties of the sound, such as duration (e.g., Bond, 1973; Warner et al., 2004).
These sorts of findings suggest not only that subphonemic differences
corresponding to the different meanings or senses must be perceptible in the
signal, but also that the relevant information must be encoded in listeners’
speech perception systems at some level of representation and/or processing.This case study provides an in-depth investigation of listeners’ use of multiple
phonetic cues (duration, intensity, and pitch) in differentiating between two
meanings of the English word “sorry”: an apology (i.e., “Sorry! I didn’t mean to
hurt you,” henceforth Apology) versus an attention-seeking mechanism
(i.e., “Sorry—can you move so I can pass by?”, henceforth ExcuseMe). We
first lay out the acoustic characteristics of productions from a set of four
voice actors producing the word “sorry” in different contexts designed to elicit
the two meanings. We then examine whether listeners are sensitive to the
intended form of the word, as well as which phonetic cues are used to inform
their decisions. We test these questions using two perception tasks: the first
examines listeners’ use of cues when making judgments about actors’ productions
of “sorry,” and the second uses a subset of these stimuli, systematically
manipulated across the relevant dimensions, allowing us to assess the
independent contribution of each cue.
1.2 Subphonemic variation in production of homophones
Previous investigations of subphonemic variation in homophones have primarily
focused on durational differences. Higher-frequency words are, all else being
equal, produced with shorter word durations than lower-frequency words, and this
is also the case with pairs of homophones (at least for content words; studies
involving only function words have not found similar effects, e.g., Jurafsky et al., 2002).
In the first large-scale study on this topic, Gahl (2008) examined the durations of
344 homophonous but heterographic word pairs (e.g., “fair” vs. “fare”) in the
Switchboard corpus of American English telephone conversations, and found that
higher-frequency items were shorter in duration than their lower-frequency
counterparts. This result persisted after controlling for factors including
speaking rate, contextual predictability, and syntactic category (see Lohmann, 2018b, for a
re-analysis and confirmation of this result). Similar patterns were found in a
comparison of children and adults (Conwell, 2018) and in Mandarin (Sherr-Ziarko, 2015).
The generalization also holds in homographic homophones, as demonstrated by
Lohmann (2018a),
who examined the durations of 63 homographic noun–verb pair homophones (e.g.,
“cut” (N) vs. “cut” (V)) in the Buckeye Corpus of spontaneous speech.Durational differences across members of homophone pairs have also been shown to
be conditioned by other factors, including the morphological status of the word
(Seyfarth et al.,
2018), the underlying phonological form of the word (“incomplete
neutralization,” Port &
O’Dell (1985) among many others), and orthography (Warner et al., 2004).
Additional factors systematically affecting duration that have not been tested
specifically with homophones could also be expected to play a role as well
(e.g., contextual predictability: Seyfarth, 2014; Tang & Shaw, 2020). Bell et al. (2009), in a
corpus study of conversational English, showed that three separate
factors—frequency, contextual predictability, and repetition—all contribute
independently to durations of spoken words, even though they are all
correlated.Though duration is the most well-studied property, similar influences might be
expected on other acoustic dimensions, particularly those which, like duration,
are heavily influenced by the extent of articulatory reduction. A few studies
have given evidence for other types of differences: Wright (2004) shows that properties of
the lexical neighborhood affect spectral characteristics of vowels, with “easy”
lexical items (i.e., those with low neighborhood density and high relative
frequency) showing systematically less dispersion in the F1-F2 space than “hard”
lexical items, and Tang and
Shaw (2020), using a large telephone corpus of Mandarin Chinese, show
that contextual predictability has significant effects on maximum pitch and
intensity, dimensions related to prosodic prominence.While all of the phonetic patterns discussed so far are at least indirectly
attributable to the extent of articulatory reduction (with shorter durations,
smaller pitch range, lower intensity, and less vowel space dispersion), there
are many other sources of phonetic variability that could result in systematic
differences across members of a homophone pair. For example, if the two meanings
tend to be produced in different prosodic environments, they would show
different distributions along many phonetic dimensions in the input,
particularly for suprasegmental features like F0 or duration. There are a wide
range of factors systematically affecting both the phonological (e.g., whether a
word is accented or not) and phonetic detail (e.g., F0 alignment and range) of
prosody, including discourse status, or topic structure (e.g., whether a topic
is given or new; see Hirschberg, 2002, and references therein). Furthermore, intonational
patterns may be idiomatically related to certain usages or meanings (Calhoun & Schweitzer,
2012).Another potential source of differences in a homophone pair is another type of
context: the emotion or affect used for each meaning. There is a large body of
literature documenting the acoustic correlates of emotion (e.g., Williams & Stevens,
1972; Sauter et
al., 2010, among many others; see Bachorowski & Owren, 2008, for a
review). One example is work done by Banse and Scherer (1996), who recorded
12 actors asked to portray 14 different emotions in meaningless utterances
consisting of non-words. They performed detailed acoustic analyses involving
dimensions hypothesized to be relevant to emotion, including multiple measures
of F0, intensity, speaking rate, and spectral measures corresponding to voice
quality (e.g., spectral tilt). Almost all of the 29 features were statistically
significant in explaining some of the emotion-based variance, with pitch and
intensity measures capturing the most variance, and discriminant-based
classification accuracy was well above chance (40% accuracy given a choice of 14
emotions, with chance being 1/14, or 7%). Relatedly, Ohala (1983) proposes that a
biologically based “frequency code” underlies cross-language (and cross-species)
regularities in use of higher F0 to signal more submissive, less assertive
stances. While the acoustic correlates of emotion or attitude are far from
deterministic, it is well established that there are some regularities, and if
there are systematic differences in the contexts in which two members of a
homophone pair are used, then it follows that there will also be systematic
differences in their pronunciation.Given the many factors at play simultaneously, it is likely that any given pair
of words will differ along multiple dimensions for multiple reasons. In a case
study, Drager (2011)
examined variation across different grammatical uses of the word “like” in a
corpus of sociolinguistic interviews in a New Zealand high school. Quotative
“like” (as in, “I was like ‘yeah, okay’”) had on average more monophthongal
vowels, shorter initial-consonant-to-vowel ratios, and higher mean pitch than
when it was used as a discourse particle or a lexical verb. Differences in
intonational patterns were proposed to underlie the pitch differences:
“Impressionistically, lexical verb like seemed to be produced
in conjunction with a dip in the intonation contour, whereas quotative like
rarely was and was sometimes part of a rising contour that raised more steeply
after the verb” (Drager,
2011: 702). The different lemmas tend to occur in different prosodic
positions, and while Drager argues that prosody is unlikely to account for all
of the variation found in the dataset, it does likely have some effect. Though
only focusing on a single wordform, this case study demonstrated that a given
pair of homophones will be characterized by differences across multiple acoustic
dimensions. Whether listeners track these sorts of differences, which dimensions
are tracked by listeners, and whether the tracking is different depending on the
conditioning factor (e.g., lexical frequency, prosody, emotion) are open
questions.
1.3 Listeners’ use of acoustic variation in the perception of
homophones
The presence of the systematic acoustic variation discussed above brings up the
question of whether listeners make use of this information to inform their
perceptual strategies. Listeners have been found to distinguish between the two
members of homophone pairs at above-chance accuracy, although these effects tend
to be very small, indicating that the cues are far from deterministic. For
example, Sanker
(2019) showed that listeners were significantly above chance in
identifying members of a homophone pair that had been spliced out of sentences
(e.g., “a doe is a deer” vs. “a dough is a
mixture”) in a forced-choice task. However, accuracy was only very slightly
above chance (50.8%), and this effect only held for words that had been produced
in a contextually predictable sentence, and not in a different condition where
the words were produced in isolation.Listeners have also been shown to be sensitive to incomplete neutralization:
Port and O’Dell
(1985) showed that German listeners identified minimal pairs
differing in underlying final consonant voicing, like “rat” and “rad” (which are
both phonologically devoiced word-finally, such that the broad transcription of
both is [ʁɑt]), at 59% accuracy. Looking at the same phenomenon in Dutch, Warner et al. (2004)
found that listeners performed above chance in identifying the intended
production, but only for stimuli drawn from certain speakers—specifically, those
who showed greater durational distinctions between final underlying voiced
versus voiceless consonants—suggesting that these durational differences were in
fact what listeners were using to inform their decisions. The idea that these
sorts of phonetic differences may have only limited usefulness for listeners is
supported by the fact that the word-final voicing distinction, which is
represented orthographically, is a common source of spelling mistakes in Dutch
(Sandra et al.,
2001). Finally, there may be cases where these sorts of cues are not
used by listeners at all. In Mandarin, Tone 3 (low) is pronounced similarly to
Tone 2 (rising) in certain phonological contexts, but systematic phonetic
differences indicate that this T3/T2 neutralization is incomplete: however,
listeners do not reliably distinguish the two categories (Zhang & Peng, 2013).Prosodic variation has also been shown to be useful to listeners. For example,
listeners differentiate short words from longer words containing them (e.g.,
cap vs. )
based on their duration (e.g., Davis et al., 2002; Salverda et al., 2003,
for English and Dutch listeners, respectively), and Spinelli et al. (2010) showed that
listeners use F0 contour as an independent cue to word segmentation in French.
Prosodic patterns may also be responsible for listeners’ demonstrated
sensitivity to emotion or affect. Multiple studies have shown listeners to be
consistently above chance at choosing a speaker’s intended emotion in
forced-choice tasks (see Johnstone & Scherer, 2000, for a review). Furthermore,
listeners’ perception of different emotions or affects is systematically
influenced by specific acoustic properties. For example, intonation contour has
been found to independently influence listeners’ assessment of speaker certainty
(Gravano et al.,
2008), and listeners tend to associate higher intensity and lower F0
with dominance (e.g., Scherer et al., 1973; Puts et al., 2007). Banse and Scherer (1996)
found correlations between listeners’ identification of emotions and acoustic
properties of the utterances (see also Sobin & Alpert, 1999). The
correspondence is far from deterministic, and listeners’ use of cues does not
always reflect the apparent importance of cues in productions (see Johnstone & Scherer,
2000, and Bachorowski & Owren, 2008, for discussion). Nevertheless, as
with the other subphonemic differences discussed in this section, they do appear
to be able to influence listeners’ perception of emotion in at least some
situations.Finally, the prosody and/or affective tone of an utterance may influence lexical
selection or processing. In a study by Nygaard and Lunders (2002), listeners
heard words spoken in happy, sad, or neutral voices and were asked to transcribe
them. Critical stimuli were members of homophone pairs in which one member had
emotional connotations and the other did not (e.g., sad die vs.
neutral dye). Listeners were more likely to transcribe the
emotional token when it was heard in an emotionally congruent tone of voice,
indicating that the emotional information was integrated at some level of
linguistic processing, influencing lexical selection.Taken together, there are a large number of studies demonstrating listeners’
sensitivity to the different types of subphonemic variation discussed above. It
is important to remember that many of these effects are very small,
inconsistent, or task-dependent. Particularly given the fact that positive
results are more likely to be published (“publication bias,” e.g., de Bruin et al., 2015),
it should not be concluded that listeners are always able to use this
information. Instead, which sources of variation and which acoustic dimensions
are tracked by listeners remains an open question.
1.4 Implications for representation, storage, and processing
The examination of subphonemic differences between different meanings of
homophones in production, and listeners’ sensitivity to those differences in
perception, has been of interest in large part because of its relevance to
models of lexical representation. Systematic phonetic variability is pervasive
in speech, governed by “contexts” such as the properties of the speaker, the
interlocutor, the social/discourse/pragmatic context, and the
syntactic/prosodic/semantic context. Since any two meanings are likely to be
used in systematically different contexts, they will likely show different
distributions of phonetic properties in production. How this variation is
tracked, stored, and used by listeners is a question that has implications for
the architecture of the lexicon and the structure of the speech processing
system.As discussed above, listeners are sensitive to subphonemic differences in
meanings and use it to inform perception, at least in some situations, and this
indicates that distinct information must be linked to each meaning. However,
this link could take different forms. First, each of the two meanings could be
directly linked to separate phonetic representations, either in the form of
acoustic “exemplars” or as a summary of distributional statistics of the
acoustics. In this situation, listeners would determine the meaning by
evaluating the incoming signal against the two possible pronunciations.
Alternatively, both meanings could be linked to the same phonetic
representation, but each meaning has distinct information about the context in
which it tends to be produced. These contexts are in turn linked to general
phonetic regularities which also form part of listeners’ knowledge. In this
case, the link between meaning and pronunciation is indirect and modulated by
context: the phonetic information would suggest a certain context to listeners,
and knowledge of the context would in turn inform the decision about the meaning
of a word.There is convincing evidence that both abstraction and acoustic specificity play
a role in phonetic representations (e.g., McQueen et al., 2006; Ernestus, 2014; Pierrehumbert, 2016),
and teasing apart the relative contributions is not possible with a case study.
However, if listeners can make use of phonetic differences, it is necessary for
there to be independent representations of meanings at some level. If they are
not separately specified phonetic representations, but rather shared, then it is
necessary that the relevant contexts are tracked independently and stored
separately for the different meanings. There is evidence that homophones (with
unrelated meanings) and polysemes (with related meanings) are represented
differently by listeners: specifically, there is psycholinguistic evidence that
while the different meanings of a homophone pair have independent semantic
representations, the different “senses” of polysemes may fall under the same
representation (Rabagliati
& Snedeker, 2013; Rodd et al., 2002). If senses do share
a representation, it is difficult to see how either option above (separate
pronunciations or separate meanings) could be stored.
1.5 The current study
We present a case study examining listeners’ use of multiple phonetic cues in
distinguishing between two meanings of “sorry”: when it is used as an apology
versus when it is used as an attention-seeking mechanism (ExcuseMe).
Restricting the domain to a single word pair allows for an in-depth look at
multiple dimensions, including pitch, intensity, and durational measurements.
Previous work has shown that listeners make use of phonetic detail to
inform perception of homophones; this has primarily been focused on one
dimension at a time, focusing either on durational cues or (in a few cases)
intonation contour. However, as we have seen above, any pair of words will
likely differ on multiple dimensions for multiple reasons, including frequency,
contextual predictability, and the prosodic, emotional, and pragmatic contexts
they are likely to occur in. In this work, we test whether these other
dimensions are used by listeners, and we assess their relative role in cuing the
distinction for listeners. Our study is structured around three questions:Which, if any, acoustic characteristics distinguish the Apology
vs. ExcuseMe meanings of the English word “sorry”?Do listeners accurately perceive the difference between the two uses of
“sorry,” when hearing isolated tokens removed from their original
context?Which phonetic cues do listeners use when making their judgments, and
what is their relative reliance on each cue?We address these goals in a series of three experiments. In Experiment 1, we
analyze productions of “sorry” recorded by voice actors in contexts designed to
elicit the two different meanings in different levels of situational intensity,
and examine the patterning of several acoustic cues shown to be relevant to
homophone distinctions and affect/emotional distinctions in previous work:
duration, pitch, and intensity. In Experiment 2, we test listeners’ accuracy in
identifying the intended meaning of tokens of “sorry” produced in Experiment 1
and spliced out of their context. We also look at the correlation of listeners’
responses with the acoustic properties of the stimuli to begin to establish
which of the cues, if any, might play a role in listeners’ decisions. Based on
significant correlations between listeners’ responses and multiple acoustic
dimensions in Experiment 2, Experiment 3 examines listeners’ perception of
tokens of “sorry” that have been systematically manipulated across three
dimensions, to assess whether these cues play an independent role in perception
and to quantify listeners’ relative reliance on each cue. Prior to running the
experiment, we did not have specific hypotheses about the use of individual
cues. However, given the fact that apologies are likely to be said in more
submissive, less assertive attitudes, we might expect them to have higher F0 and
lower intensity than ExcuseMe. It is also possible that the two
meanings are associated with different intonation contours, whether for
structural reasons or because they are idiomatically specified (e.g., Calhoun & Schweitzer,
2012).If listeners associate the two senses with different phonetic realizations, it
indicates two things. First, the different representations of the polysemous
“sorry” must be independent at some level of processing, regardless of whether
they are linked to distinct pronunciations, and/or associated with distinct
contexts, which in turn are associated with systematic phonetic patterns.
Second, listeners must be sensitive to this acoustic detail and actively make
use of it during speech perception. An additional contribution of this study is
to examine perception of tokens systematically manipulated along multiple
dimensions, as compared to perception of naturally produced tokens, where the
dimensions may covary. This allows us to evaluate the extent to which listeners
use each of these dimensions independently.
2 Experiment 1—Production
2.1 Overview
The purpose of Experiment 1 was to provide stimuli for perception experiments, as
well as to examine the question of which phonetic cues, if any, distinguish
productions of the word “sorry” used as an Apology from those meaning
ExcuseMe in this dataset. We use recordings from four voice actors
producing sentences containing the two types in context as the basis for our
analysis. We used scripted dialogues in order to maintain control over the
phonetic content of the utterances, and we recruited voice actors because we
expected them to be comfortable reading scripted dialogues in a relatively
naturalistic way (see Banse
& Scherer, 1996, for discussion of the ecological validity of
using voice actors to simulate emotional communication).
2.2 Methods
2.2.1 Participants, materials, and procedure
Two male and two female American English voice actors recruited from
fiverr.com (a platform for freelance workers) were paid to
take part in the experiment. Eighteen scenarios containing the word
sorry were used for the production study. Table 1 shows six
sample scenarios, and the full set is provided in the supplementary
material. Each of these 18 scenarios fit into one of six general situations,
half of which included someone apologizing, and the other half of which
included someone needing to say “excuse me.” From each of these general
scenarios, three specific situations, varying on a three-tier scale of
“situational intensity,” were created, resulting in the total of 18
scenarios. For example, the Apology example in Table 1 belongs to
the “Book” base scenario grouping. In the low-intensity
version of this scenario, only the cover of the book was damaged, while in
the medium situational intensity the entire book is significantly damaged,
and in the high-intensity setting the book was the friend’s most treasured
possession and was completely ruined. The ExcuseMe example belongs
to the “Coffee Shop” base scenario. In the low-intensity
version of this scenario, the subject needs to pick up coffee while running
ahead of schedule, while in the medium and high intensities, this changes to
running slightly late in the former situation and running 30 minutes late in
the latter.
Table 1.
Six of the scenarios used in the production experiment. The “context”
was read silently, while the two lines of dialogue, including the
target sentence, spoken by “Alex,” were read out loud.
Apology (“Book” scenario)
ExcuseMe (“Coffee Shop”
scenario)
Intensity Level 1
Intensity Level 1
[Context]: [Alex’s friend has this new book
they’ve been interested in reading but haven’t had
time yet. Alex wanted to read the book as well so
his/her friend lent them their copy. Alex really
enjoyed the book and goes to return it and encourage
his/her friend to make time to read it so they can
discuss it. When s/he leaves her/his apartment it’s
raining pretty hard, so to protect the book s/he
puts it in a plastic bag; however, on the way there
s/he bumps into someone and the bag rips, dropping
the book in a puddle and damaging the cover. Alex
feels somewhat responsible and goes to
apologize.]Alex: Sorry! Please don’t be mad!
It was an accident! I really didn’t mean for it to
happen!Friend: It’s fine, the
cover art was ugly anyway.
[Context]: [Alex was on his/her way in to work
when s/he got a text from his/her co-worker asking
him/her to pick up the coffee orders for the office
since it was his/her turn. Alex had no problem with
this as s/he left home with enough time to make a
small detour. Once Alex collects the orders s/he
heads over to the self-serve table to add the
appropriate amounts of cream and sugar to the
various cups. At the counter there’s a wo/man
putting milk in her/his tea who is blocking Alex’s
access to the sugar. S/he’s not in a rush but if
s/he doesn’t leave soon s/he might be pushing it so
s/he decides to reach past the wo/man to grab the
sugar.]Alex: Sorry, just gonna reach past
you to grab the sugar real quick, hope you don’t
mind.Wo/man: Oh, feel
free.
Intensity Level 2
Intensity Level 2
[Context]: [Alex’s friend has this new book
they’ve been raving about and insisting Alex read.
Alex agrees to give the book a go so his/her friend
lent them their copy. Alex really enjoyed the book
and goes to return it and discuss the story with
their friend. When s/he leaves her/his apartment
it’s raining pretty hard, so to protect the book
s/he puts it in a plastic bag; however, on the way
there s/he bumps into someone and the bag rips,
dropping the book in a puddle and soaking through
all the pages, warping it to twice its normal size
even though the text is still readable. Alex feels
pretty bad about what happened and goes to
apologize.]Alex: Sorry! Please don’t be mad!
It was an accident! I really didn’t mean for it to
happen!Friend: That really
sucks, but if you buy me a replacement I’ll forgive
you.
[Context]: [Alex was on his/her way in to work
when s/he got a text from his/her co-worker asking
him/her to pick up the coffee orders for the office
since it was his/her turn. Alex was a little behind
schedule but figured s/he probably had time if s/he
was quick. Once Alex collects the orders s/he heads
over to the self-serve table to add the appropriate
amounts of cream and sugar to the various cups. At
the counter there’s a wo/man putting milk in her/his
tea who is blocking Alex’s access to the sugar.
S/he’s in a bit of a rush at this point and if s/he
doesn’t leave soon s/he’ll definitely be late so
s/he decides to reach past the wo/man to grab the
sugar.]Alex: Sorry, just gonna reach past
you to grab the sugar real quick, hope you don’t
mind.Wo/man: Sure.
Intensity Level 3
Intensity Level 3
[Context]: [Alex’s friend has been going on and
on about their favorite book for all the years Alex
has known them, their mom gave it to them when they
were 14 just before she passed away and they have
this book memorized cover to cover. It’s their most
treasured possession and if you ever asked them what
one thing they would save in a fire, their answer
would be the book. They’ve been insisting Alex read
it for years so they can talk about the story with
each other and as a last resort to get Alex to read
it, they lent him/her their copy. After reading the
book Alex found s/he really did enjoy it and is
going to meet up with his/her friend to discuss the
story. When s/he leaves her/his apartment it’s
raining pretty hard, so to protect the book s/he
puts it in a plastic bag; however, on the way there
s/he bumps into someone and the bag rips, dropping
the book in a huge puddle and utterly ruining it.
S/he feels awful and immediately goes to apologize
to their friend.]Alex: Sorry! Please don’t be mad!
It was an accident! I really didn’t mean for it to
happen!Friend: I don’t know
what to say.
[Context]: [It was already looking like Alex
would be late to work when s/he got a text from
his/her co-worker asking him/her to pick up the
coffee orders for the office since it was his/her
turn. Alex cursed his/her bad luck but knew that is
was in fact his/her turn. Once Alex collects the
orders s/he speed walks over to the self-serve table
to add the appropriate amounts of cream and sugar to
the various cups. At the counter there’s a wo/man
putting milk in her/his tea who is blocking Alex’s
access to the sugar. S/he starts tapping her foot
but realizes unless s/he says something s/he could
be waiting awhile and s/he’s already gonna be at
least 30 minutes late. S/he decides to reach past
the wo/man to grab the sugar.]Alex: Sorry, just gonna reach past
you to grab the sugar real quick, hope you don’t
mind.Wo/man: Go ahead.
Six of the scenarios used in the production experiment. The “context”
was read silently, while the two lines of dialogue, including the
target sentence, spoken by “Alex,” were read out loud.The script given to the participants included all scenarios. As shown in
Table 1,
each scenario included a first paragraph to provide context, which was not
read aloud, followed by two lines of dialogue, which the participants were
asked to read aloud. The first line of dialogue (spoken by the character
named Alex) was the sentence containing the target word “sorry,” and it
remained identical, including in punctuation, across each of the three
intensity levels within each base scenario group. In all sentences, “sorry”
was in utterance-initial position, followed by either a stop or affricate
(this was done to facilitate clean splicing of the word for use in the
subsequent perception experiments). The order of scenarios was
pseudo-randomized such that no scenarios of the same base group, or of the
same intensity, appeared immediately next to one another.Participants were sent the script described above and given instructions
about the recording procedure. Recordings took place in a quiet environment
using personal audio-recording equipment. Participants were instructed to
read the context silently to themselves, then to read both lines of dialogue
out loud using their regular voices. They were asked not to whisper; this
instruction was included in an attempt to avoid breathy voice, which could
pose problems for pitch measurement and/or manipulation. Participants
recorded two repetitions of the 18 dialogues from the scenarios above,
resulting in 144 tokens of “sorry” in context (18 scenarios x 2 repetitions
x 4 actors).In addition to these “contextual” dialogues containing the word “sorry,”
participants were also asked to record 20 instances of “sorry” in isolation,
10 as if they were apologizing, and 10 as if they were saying “excuse me.”
They were asked to indicate orally when they switched from one set to the
other and to feel free to vary the mood however they saw fit. No further
specific instructions were given. Five tokens of each Sorry Type from each
speaker were chosen for the analysis.
2.2.2 Annotations and measurements
We focused on phonetic dimensions shown in previous work to be relevant to
phonetically differentiating between homophones, and/or differentiating
expression of emotion: pitch (mean and contour) (ex: Bänziger & Scherer, 2005; Sauter et al.,
2010), duration (Gahl, 2008; Lohmann, 2018a; etc.) and intensity (Pereira & Watson, 1998).All annotations and measurements were done using Praat (Boersma & Weenink, 2018). First,
the following landmarks were manually annotated: fricative onset (marked by
the onset of frication in /s/), fricative offset (marked by the beginning of
visible formants in the vowel following /s/), and word end (the end of
stable F2).F0 (in Hz) was measured at seven time points (0%, 10%, 25%, 50%, 75%, 90%,
100%) across the periodic portion of the word (i.e., everything except the
initial fricative). In order to minimize measurement errors, we first
determined speaker-specific floors and ceilings based on manual inspection
of the data and used these as the basis for automatic pitch measurements.
Visible outliers were checked and manually corrected.Mean intensity (in dB) was measured across the full word.One additional source of variability that we thought might be relevant was
the intonation used to produce the target word. Based on initial inspection
of the data, tokens varied in whether they had a final rising or falling
contour. Examples of each of these is given in Figure 1, and the corresponding audio
files are available in the supplementary materials. Data were coded based on
the perceptual judgments of a native speaker of English familiar with the
ToBI (Tones and Break Indices) system (Beckman & Ayers, 1997). Final
rise tokens fell into the HLH% or LH% categories, and final fall tokens were
HLL% or LL%, but were coded into a binary choice of final rise versus final
fall for ease of analysis. A second annotator with no knowledge of the study
provided independent judgments, and there was 76% agreement. Only tokens
which were consistently annotated (n = 140) are included in
the intonation contour plots and analyses below.
Figure 1.
Spectrograms overlaid with pitch contours (white line) for a final
rising (a) vs. falling (b) contour. Both of these were productions
of an ExcuseMe token from participant M2.
Spectrograms overlaid with pitch contours (white line) for a final
rising (a) vs. falling (b) contour. Both of these were productions
of an ExcuseMe token from participant M2.Based on the landmarks and measurements above, we use the following measures
as dependent variables in the statistical analyses below:Total word duration: fricative onset to word end;Mean F0: the average across the seven points measured;Mean intensity;Pitch contour: final rise versus final fall.
2.2.3 Statistical analysis
Our primary question for the production experiment was to determine whether
there are phonetic differences between the two types of “sorry”
(Apology vs. ExcuseMe). To evaluate this, we used
mixed-effects regression, using the lme4 (Bates et al., 2015) package in R
(R Core Team,
2019) to assess the effect of Sorry Type on each dimension shown
above. For continuous dimensions (all dimensions except pitch contour), we
used a linear mixed-effects model (or, for pitch contour, a logistic
mixed-effects model) predicting the value of that dimension from Sorry Type.
Sorry Type was simple coded (apology: −0.5, excuse
me: 0.5). The model included random intercepts for speaker and
item (where each of the 18 scenarios represented an item), as well as a
by-speaker slope for Sorry Type. P-values were computed using the lmerTest
package (Kuznetsova et
al., 2017), and an alpha-level of 0.05 was used as the threshold
for significance. Although not our primary question of interest, we also
tested whether the degree of situational intensity influenced each acoustic
dimension (model details described below).
2.3 Results
This experiment sought to determine whether “sorry” as an apology is phonetically
distinct from a “sorry” meaning “excuse me,” and if so, which phonetic cues
distinguish them. We examined four dimensions that were expected, based on
previous work, to be potentially relevant: word duration, mean F0, intensity,
and pitch contour. 184 tokens were analyzed (144 “contextual” tokens plus 40
spoken in isolation). Tokens for which more than half of the pitch contour was
undefined (n = 9) were omitted from the F0 analysis. We discuss
results for each dimension in turn in the following paragraphs. Graphs of all
dimensions are provided in Figure 2, and statistical results in Table 2. For continuous dimensions, the
graphs show the overall distribution of all tokens (violin plots) for each Sorry
Type, as well as speaker-specific means and 95% confidence intervals. For the
graph showing the binary variable pitch contour, the proportion of tokens
produced with a final rising (vs. falling) tone for each speaker is given.
Figure 2.
Graphs of values for each acoustic dimension measured in the production
study for tokens intended as Apology vs. ExcuseMe.
Violin plots show the distribution of values (for continuous variables),
and speaker-specific means and confidence intervals (for continuous
variables) are shown in different shades of gray.
Table 2.
Results of linear mixed-effects regression models for duration, pitch,
and intensity and a logistic mixed-effects regression model for pitch
contour. The model structure (example given for duration) is
lmer(duration ~ Sorry Type + (Sorry Type | speaker) + (1|item)).
Significant (p < .05) results are bolded.
β
SE
t/z
p
Duration
Intercept
565.44
24.82
22.77
< 0.001
Sorry Type
–126.46
26.60
–4.75
0.011
Mean F0
Intercept
216.61
41.41
5.23
0.013
Sorry Type
–13.68
17.59
–0.78
0.489
Mean intensity
Intercept
70.22
0.64
109.55
< 0.001
Sorry Type
–0.32
1.27
–0.25
0.811
Final pitch contour
Intercept
0.27
0.23
1.15
0.250
Sorry Type
0.94
0.83
1.14
0.253
Graphs of values for each acoustic dimension measured in the production
study for tokens intended as Apology vs. ExcuseMe.
Violin plots show the distribution of values (for continuous variables),
and speaker-specific means and confidence intervals (for continuous
variables) are shown in different shades of gray.Results of linear mixed-effects regression models for duration, pitch,
and intensity and a logistic mixed-effects regression model for pitch
contour. The model structure (example given for duration) is
lmer(duration ~ Sorry Type + (Sorry Type | speaker) + (1|item)).
Significant (p < .05) results are bolded.Duration: The first panel of Figure 2 shows the distribution of
duration values for Apology vs. ExcuseMe tokens. Overall,
ExcuseMe tokens are shorter than Apology tokens, and this
pattern is consistent across speakers. As indicated in the statistical results
(Table 2),
tokens are on average 565 ms (as indicated by the estimate for Intercept), and
Apology tokens are on average 126 ms longer than those intended as
ExcuseMe (as indicated by the estimate for Sorry Type). This effect
of Sorry Type is significant.Mean F0: The second panel of Figure 2 shows the distribution of mean
F0 values across Sorry Types. There is no clear difference, nor is there a
consistent pattern across speakers. This is reflected in the statistical
results: while ExcuseMe tokens are on average 13.68 Hz lower than
Apology tokens (as indicated by the estimate for Sorry Type), this
difference is not significant.Mean Intensity: The third panel of Figure 2 shows the distribution of mean
intensity values across Sorry Types. There is again no consistent pattern, and
this is again reflected in the statistical results; numerically,
ExcuseMe tokens are on average 0.32 dB lower than Apology
tokens, but this difference is not significant.Pitch contour: As shown in the final panel of Figure 2, the speakers
differed in their distribution of pitch contours across the Sorry Types, with
three speakers producing more final-rising-tone tokens for ExcuseMe
than Apology, and one showing the opposite pattern. Since the pitch
contour variable was binary (vs. the three continuous factors discussed above),
the statistical model was a mixed-effects logistic regression model. The
estimates therefore represent the log odds of a final rising tone (vs. final
falling tone). The positive estimate for the intercept indicates that overall,
final rising tone is more common than final falling tone, but the lack of
significance indicates that this difference is not above chance. The positive
estimate for Sorry Type indicates that ExcuseMe tokens are
characterized by rising tone more frequently than Apology tokens, but
this difference is again not significant.
2.3.1 Situational intensity
Although not our primary question of interest, we examined whether the degree
of situational intensity influenced each acoustic dimension. In order to
test this, we created four models, one for each dimension, which differed
from the models described above in that (1) the 40 tokens produced in
isolation were omitted from analysis, and (2) the model structure included
an additional predictor variable of Situational Intensity, as well as its
interaction with Sorry Type. Situational Intensity was simple-coded as a
three-level factor, with Level 1 as the reference level. The only
significant effects were found in the model for duration: as above, there
was a main effect of Sorry Type, showing longer durations for
Apology than ExcuseMe tokens (β = −123.22,
SE = 25.63, t = −4.81,
p = 0.017), and there was also a main effect of
Situational Intensity, where tokens produced in the Level 3 intensity
scenarios (across both Sorry Types) were on average 46 ms longer than those
produced in Level 1 (β = 46.22, SE = 19.72,
t = 2.34, p = 0.026). This difference
is shown in Figure
3. No other main effects or interactions for any of the models were
significant.
Figure 3.
Distribution of durations by Sorry Type and Situational Intensity
(for contextual “sorry” tokens only).
Distribution of durations by Sorry Type and Situational Intensity
(for contextual “sorry” tokens only).
2.4 Production: Interim summary
The results above show that Apology tokens in this dataset are
significantly longer than ExcuseMe tokens. Numerically, they are also
characterized by higher intensity, higher F0, and a lower likelihood of a final
rising tone, but these differences were not significant. Durations were also
slightly longer for situations with higher emotional intensity (across both
Sorry Types).This is a very small-scale study; therefore, we do not intend to try to interpret
these findings as general to a larger population, but rather wanted to provide
an overall analysis of the productions that form the basis of the perception
experiment. Nevertheless, we can make some observations. First, we found a
significant effect of duration, consistent across all speakers. Second, while
the other three dimensions were not significant, looking at the by-speaker
results suggests that this lack of significance may have different underlying
causes for the different dimensions. For intensity and mean F0, none of the
speakers showed clear differences, but for pitch contour, we found fairly robust
differences within each speaker, but different strategies across speakers. A
larger sample size is necessary to determine whether this observation is
correct.
3 Experiment 2—Perception of “sorry”
3.1 Overview
In this experiment, listeners completed a forced-choice identification task,
classifying tokens of “sorry” produced by voice actors in Experiment 1 as either
an Apology or ExcuseMe. The primary purpose of this experiment
was to determine whether listeners accurately perceive the speakers’ intended
meaning in the absence of context. By examining the acoustic correlates of
listeners’ responses, this experiment also allowed for a preliminary
investigation of the question of which phonetic cues listeners use in their
classification decisions.
3.2 Methods
3.2.1 Participants
47 listeners residing in Ontario, Canada (7 males and 40 females, age range
18 to 69, mean age 27.5) participated in this experiment. All had learned
English as a child (before the age of 10).
3.2.2 Materials
Target stimuli consisted of the 184 tokens of “sorry” analyzed in the
production study. These were made up of 144 contextual “sorry” tokens,
spliced out of the first line of dialogue shown in Table 1 (recall that the “sorry”
was always utterance-initial and followed by a stop or affricate such that
the target word was surrounded by silence), as well as 40 tokens produced in
isolation.In addition to these target stimuli, the experiment contained four practice
trials and 18 filler trials. Both the practice and filler trials were full
sentences with context making it explicit whether the speaker was intending
an Apology or ExcuseMe (e.g., Apology: “I’m so
sorry for taking advantage of you like that, can you forgive
me?” Excuse me: “Sorry, can I just reach past you there
for a sec?”). The four practice trials were played at the
beginning of the experiment, and the 18 filler trials were interspersed at
regular intervals throughout the experiment both in order to ensure that
participants were not responding randomly and to break up the monotony of
the task.
3.2.3 Procedure
The experiment was run using PsychoPy (Peirce, 2007). Participants were
given headphones and were seated in either a soundproof booth (Toronto) or a
quiet room (Ottawa). Before the experiment, participants were given an oral
explanation of the nature of the experiment. They were first given examples
of how the word “sorry” could be used in different ways: to apologize for
something, or to get someone’s attention in order to get by (i.e., in place
of “excuse me”). Then, they were told that they would be listening to
multiple repetitions of the word “sorry” in isolation and asked to decide
whether it sounded more like an apology or “excuse me,” indicating their
choice via keypress. Short written instructions were also provided at the
beginning of the experiment.The experiment consisted of four practice trials, followed by the 184 target
items, randomized by participant, interspersed at regular intervals with the
18 filler trials. The relevant response keys (“a” for Apology and
“l” for ExcuseMe) were indicated by stickers on the keyboard, as
well as on the experiment presentation screen. Responses were not able to be
given until the full sound file was played. The experiment took
approximately 15 minutes.
3.2.4 Analysis
We coded listeners’ responses (Apology vs. ExcuseMe), as
well as their accuracy in identifying the speaker’s intended meaning. Two
analyses were performed. First, logistic mixed-effects models were used to
determine whether listeners identified the speakers’ intended meaning at
greater than chance level, and whether this differed across Sorry Types or
Situational Intensity levels. Second, we examined which acoustic properties
of the stimuli were predictive of listeners’ responses, using the acoustic
measurements discussed in the production experiment. The lme4 package (Bates et al., 2015)
in R (R Core Team,
2019) was used for the statistical analysis.
3.3 Results
3.3.1 Accuracy
Figure 4(a) shows the
percentage of time participants accurately classified tokens as
Apology vs. ExcuseMe, broken down by intended Sorry
Type and by speaker, and Figure 4(b) shows listeners’ responses broken down by Sorry Type
and Situational Intensity (excluding the tokens produced in isolation, which
did not have a value for Situational Intensity).
Figure 4.
(a) Percentage correct (i.e., listeners’ choice matched the speaker’s
intention) identification of “sorry” tokens, broken down by the
intended meaning (Apology vs. ExcuseMe) and by
speaker, and (b) Percentage Apology responses by Sorry Type
and Situational Intensity. Error bars show 95% confidence intervals
based on the distribution of by-participant means.
(a) Percentage correct (i.e., listeners’ choice matched the speaker’s
intention) identification of “sorry” tokens, broken down by the
intended meaning (Apology vs. ExcuseMe) and by
speaker, and (b) Percentage Apology responses by Sorry Type
and Situational Intensity. Error bars show 95% confidence intervals
based on the distribution of by-participant means.Sorry Type: We tested whether listeners performed
significantly above chance, and whether performance varied by Sorry Type and
speaker. We used a mixed-effects logistic regression model with Accuracy as
the response variable and Sorry Type as a fixed predictor (sum-coded,
reference level Apology). We also included random intercepts for
participant, word, and speaker, as well as by-participant and by-speaker
random slopes for Sorry Type. The model estimate for the intercept was
significant (β = 0.751, SE = 0.136, z =
5.536, p < .001), with the positive estimate indicating
that listeners were significantly above chance in this task (64.7% accuracy
overall). The main effect for Sorry Type was not significant (β = −0.098,
SE = 0.486, z = −0.202,
p = 0.84), as reflected by a lack of clear difference
across Sorry Types in Figure 4(a).Situational Intensity: In order to test whether Situational
Intensity affected listeners’ responses, we ran a model with only the
Contextual stimuli, since the tokens produced in isolation did not have a
value for Situational Intensity. The model predicted listeners’ choice of
Apology (vs. ExcuseMe) based on the predictor
variables of Sorry Type and Situational Intensity (simple-coded, reference
level = 1), with random by-participant and by-speaker intercepts and slopes
for Sorry Type and Situational Intensity, as well as a random by-word
intercept. There was again a significant effect for Sorry Type, reflecting
the results above that listeners’ choices matched the actors’ intent (β =
−1.312, SE = 0.276, z = −4.760,
p < .001), but Situational Intensity did not have a
significant effect on listeners’ responses (Level 2 vs. Level 1: β = 0.055,
SE = 0.209, z = 0.264,
p = 0.792; Level 3 vs. Level 1: β = 0.171,
SE = 0.197, z = 0.866,
p = 0.386). The intercept of this model was also not
significant (β = −0.233, SE = 0.249, z =
−0.935, p = 0.350), indicating that there was no overall
bias for either Sorry Type.In sum, listeners were above chance in classifying Apology vs.
ExcuseMe tokens, although performance was still well below
ceiling, and there were no systematic differences in accuracy across the two
Sorry Types. There was no overall bias in choice of Sorry Type, with
listeners choosing Apology 48% of the time.
3.3.2 Correlations between responses and acoustic properties
We now turn to the question of which cues listeners use when differentiating
between the two Sorry Types, returning to the four acoustic values
(duration, mean F0, mean intensity, and pitch contour) described in the
production study. For the perception analysis, we used normalized F0 values,
scaled to z-scores for each speaker, instead of raw values,
since we expect listeners to normalize by speaker. Figure 5 shows how often listeners
classified each token as an Apology (vs. ExcuseMe), as a
function of each acoustic dimension.
Figure 5.
Percentage Apology (vs. ExcuseMe) choice, as a
function of duration, F0 (normed), mean intensity, and final pitch
contour. In the first three panels, each point represents one
stimulus, and in the final panel, boxplots for pitch contour show
the distribution of responses for falling versus rising tokens.
Best-fit linear regression lines are shown for the continuous
dimensions.
Percentage Apology (vs. ExcuseMe) choice, as a
function of duration, F0 (normed), mean intensity, and final pitch
contour. In the first three panels, each point represents one
stimulus, and in the final panel, boxplots for pitch contour show
the distribution of responses for falling versus rising tokens.
Best-fit linear regression lines are shown for the continuous
dimensions.The patterns in Figure
5 suggest that listeners are more likely to perceive tokens as
indicating an apology when they are longer in duration, lower in F0, lower
in intensity, and characterized by a falling (vs. rising) pitch contour.In order to determine whether these patterns are significant, we used
logistic regression models predicting listeners’ responses from each
dimension. Prior to the regression analysis, we examined the
interdependencies between the dimensions, testing whether all pairwise
relationships were significant using linear regression models, with an
alpha-level of 0.05. Three of the six pairwise comparisons showed a
significant relationship (F0~intensity: t = 8.56, p <
.001; F0~pitch contour: t = 2.59, p = 0.010;
duration~intensity: t = −3.64, p < .001) while the other
three did not (F0~duration: t = −0.93, p = 0.355;
duration~pitch contour: t = −1.06, p = 0.288;
intensity~pitch contour: t = 1.19, p = 0.235).We then analyzed the relationship between acoustics and listener responses
using four separate logistic regression models, one for each dimension,
parallel to the production analysis. For each model, the binary response
variable was participants’ choice of Apology (vs.
ExcuseMe), and the predictor variable was one of the four
dimensions shown in the graphs above. Continuous variables (duration,
normalized F0, and mean intensity) were scaled to z-scores
prior to analysis, and pitch contour was simple-coded, with “falling” as the
reference level. Each model also included a random by-participant and
by-speaker intercept, as well as random by-participant and by-speaker slopes
for the fixed predictor (e.g., duration). Statistical results are shown in
Table 3.
Table 3.
Results of logistic mixed-effects regression models predicting
Apology choice from four acoustic dimensions. The model
structure (example given for duration) is glmer(choice ~ duration +
(duration | participant) + (duration | speaker), family=binomial).
Significant (p < .05) results are bolded.
β
SE
z
p
Duration
Intercept
–0.04
0.24
–0.18
0.859
Duration
0.99
0.08
12.40
< .001
Mean F0
Intercept
–0.11
0.18
–0.62
0.538
F0 (normed)
–0.26
0.16
–1.62
0.105
Mean intensity
Intercept
–0.02
0.15
–0.14
0.888
Intensity
–0.47
0.12
–3.85
< .001
Final pitch contour
Intercept
–0.30
0.24
–1.22
0.224
Pitch contour
–0.53
0.18
–2.91
0.004
Results of logistic mixed-effects regression models predicting
Apology choice from four acoustic dimensions. The model
structure (example given for duration) is glmer(choice ~ duration +
(duration | participant) + (duration | speaker), family=binomial).
Significant (p < .05) results are bolded.The estimates for each main effect represent the change in log odds of an
Apology response for a one-unit change in the predictor
variable: in the case of the continuous predictor variables, which were
scaled to z-scores, this represents a
one-standard-deviation change, and for pitch contour, this represents the
difference between falling and rising pitch contour. The results show that
duration, mean intensity,
and pitch contour are significantly predictive of listeners’
responses, but mean F0 was not.These results suggest that multiple acoustic dimensions may inform listeners’
decisions about the meaning of “sorry.” Given that these dimensions were
correlated with one another in the natural productions, it is not possible
to make claims about the independent role of any single dimension; for
example, since duration and intensity are correlated, it could be that
listeners are only using duration, but not intensity, with the apparent use
of intensity being an artifact of the fact that lower-intensity tokens
(which were more likely to be classified as Apology) tend to have
longer durations (and were also more likely to be classified as
Apology than shorter tokens). Nevertheless, this analysis
suggests that one or more of these dimensions is used by listeners; in the
final experiment, we manipulate the dimensions independently in order to
tease apart their individual contributions to listeners’ decisions.
4 Experiment 3—Perception of systematically manipulated tokens of “sorry”
4.1 Overview
The results of Experiment 2 showed that listeners’ perception of a speaker’s
intended meaning of “sorry” is affected by one or more of the acoustic
dimensions measured in Experiment 1. The goal of this experiment was to tease
apart the independent role of these dimensions: which cues are used, and what is
the relative reliance on each cue? We approach this question by examining
listeners’ perception of “sorry” tokens that have been systematically
manipulated to vary along each dimension.
4.2 Methods
4.2.1 Participants
47 listeners residing in Ontario, Canada (17 males and 30 females, age range
18 to 73, mean age 27.2) participated in this experiment. All participants
had learned English as a child (before the age of 10).
4.2.2 Materials
The stimuli for this experiment again consisted of individual tokens of
“sorry” in isolation. Stimuli were created from 12 baseline tokens that had
been used as stimuli in Experiment 2. These baseline tokens were
subsequently manipulated on the three acoustic dimensions found in
Experiment 2 to play a role in predicting listeners’ responses: duration,
intensity, and pitch contour.Baseline tokens: Baseline tokens, a subset of the stimulus
set of Experiment 2, were selected to serve as the basis for subsequent
manipulations. To increase generalizability and to allow for the fact that
there are almost certainly other dimensions beyond those we were
manipulating that affect listeners’ perceptions, we selected 12 baseline
tokens, three from each of the four speakers in Experiment 1. In order to
maximize the range of variability in the natural tokens, we chose the token
from each speaker that had elicited the highest, lowest, and most ambiguous
Apology responses in Experiment 2 (across the four speakers,
these tokens averaged 94%, 9%, and 51% Apology responses
respectively). Each of these 12 tokens then served as the baseline for the
manipulations described below. A summary of the stimulus set used in this
experiment is given in Table 4.
Table 4.
Summary of factors varying in the stimuli for Experiment 3.
Dimension
Num. levels
Levels
Baseline tokens
Speaker
4
The four speakers from Experiment 1
Base token
3
Heard in Exp2 as Apology vs. ExcuseMe
vs. ambiguous
Manipulations
Pitch contour
2
Falling, rising
Intensity
2
65 dB, 75 dB
Duration
5
350 ms, 475 ms, 600 ms, 725 ms, 850 ms
Total tokens
240
Summary of factors varying in the stimuli for Experiment 3.Pitch contour manipulation: We created two stylized pitch
contours, one with a falling and one with a rising tone. These two contours
(which approximate HLH% and HLL% contours in the ToBI system, Beckman & Ayers,
1997), were chosen as models for manipulation because they were
the most common contours seen in the data. The parameters for the stylized
contours were chosen by trial and error, with the goal of creating contours
that (1) approximated the patterns seen in the production data and (2)
sounded natural to native speakers of English. An example of the
manipulation is shown in Figure 6.
Figure 6.
Examples of falling (a) and rising (b) stylized pitch contours. The
visible pitch range is 150 to 400 Hz (audio files available in the
supplementary materials).
Examples of falling (a) and rising (b) stylized pitch contours. The
visible pitch range is 150 to 400 Hz (audio files available in the
supplementary materials).Pitch contour manipulations were done using the “Manipulation” interface in
Praat. Each stylized pitch contour was created by setting pitch points at
three landmarks: the first at the end of the first vowel of “sorry,” the
second at the onset of the second vowel, and the final one at the end of the
word. Contours are shown in Figure 6. For the falling contour, the pitch point fell eight
semitones from the first to the second landmark, then fell an additional two
semitones to the third landmark. For the rising contour, the pitch point
fell five semitones from the first to the second landmark, then rose five
semitones to the third landmark. Identical contours were superimposed on all
baseline tokens, but the raw pitch values were speaker-specific: the first
landmark was always set as the speaker’s average F0 value at the 10% point
across the full production dataset. The pitch contour manipulation resulted
in two pitch files for each of the 12 baseline tokens (24 stimuli).Intensity manipulation: Using the “Scale Intensity” function
in Praat, we created two levels of intensity: 65 and 75 dB for each of the
previously manipulated tokens, resulting in 48 tokens. These values were
chosen to be centered around 70 dB and to allow for a perceptible difference
in volume while remaining within a natural-sounding range so as not to be
overly salient or distracting.Duration manipulation: Since results from both Experiments 1
and 2 suggested an important role for duration, we chose to create a
five-step continuum to allow for more precision in the analysis. The
endpoints were set at 350 ms and 850 ms, based on the range of duration in
the production dataset (after removing some apparent outliers). Using the
PSOLA-based manipulation algorithm in Praat (Moulines & Charpentier, 1990),
we manipulated each of the previously created 48 tokens to have each of the
five duration values (350, 475, 600, 725, and 850 ms), resulting in a final
total of 240 stimuli.
4.2.3 Procedure
The procedure was identical to that of Experiment 2, except the 240
manipulated tokens were used as target stimuli in lieu of the 184 naturally
produced target items from Experiment 2. As in Experiment 2, 18 filler
trials consisting of complete sentences with context were interspersed at
regular intervals, and the order of the 240 target tokens was randomized by
participant.
4.2.4 Analysis
We used a logistic mixed-effects model to model listeners’ response
(Apology vs. ExcuseMe) as a function of the
manipulated acoustic dimensions. Since the dimensions were manipulated
independently and therefore uncorrelated, we were able to analyze all
predictors in a single model. The response variable was listeners’ choice of
Apology (vs. ExcuseMe), with fixed predictors of Pitch
Contour (falling vs. rising), Duration (350–850 ms),
Intensity (65 dB vs. 75 dB), and Base Type (elicited
mostly vs. mostly Apology
vs. ambiguous responses in Experiment 2), along with random by-participant
intercepts and slopes for all the fixed predictors, and random by-speaker
intercepts. No interactions were included. Duration was centered and
analyzed as a continuous predictor. Categorical predictors were sum-coded
(reference levels in italics above). As above, the lme4 package (Bates et al., 2015)
in R (R Core Team,
2019) was used for the statistical analysis.
4.3 Results
Listeners’ responses as a function of each manipulated dimension are shown in
Figure 7, and
statistical results are given in Table 5. As shown by the
non-significant intercept in the model, there was no overall bias for
Apology vs. ExcuseMe response in this dataset. We discuss
the effect of each dimension in turn in the following paragraphs.
Figure 7.
Listeners’ choice of Apology (vs. ExcuseMe) as a
function of (a) base token, (b) duration, (c) mean intensity, and (d)
pitch contour. Violin plots are based on the distribution of
by-participant means; error bars represent 95% confidence intervals of
these by-participant means.
Table 5.
Results of logistic mixed-effects regression models predicting
Apology choice from baseline token, duration, intensity,
and pitch contour. Reference levels are in italics. The model structure
is glmer(choice ~ base + duration + intensity + pitch.contour + (base +
duration + intensity + pitch.contour | participant) + (1 | speaker),
family=binomial). Significant (p < .05) results are bolded.
β
SE
z
p
Intercept
–0.22
0.44
–0.50
0.617
Base (AMBIG. vs EXC.)
0.14
0.08
1.80
0.072
Base (APOL. vs. EXC.)
0.75
0.09
8.00
< .001
Duration
1.00
0.09
11.59
< .001
Intensity (75 dB vs. 65 dB)
–0.15
0.06
–2.63
0.008
Pitch contour (rising vs.
falling)
–0.72
0.17
–4.28
< .001
Listeners’ choice of Apology (vs. ExcuseMe) as a
function of (a) base token, (b) duration, (c) mean intensity, and (d)
pitch contour. Violin plots are based on the distribution of
by-participant means; error bars represent 95% confidence intervals of
these by-participant means.Results of logistic mixed-effects regression models predicting
Apology choice from baseline token, duration, intensity,
and pitch contour. Reference levels are in italics. The model structure
is glmer(choice ~ base + duration + intensity + pitch.contour + (base +
duration + intensity + pitch.contour | participant) + (1 | speaker),
family=binomial). Significant (p < .05) results are bolded.Base token: As shown in Figure 7, listeners were more likely to
identify tokens created from baselines that elicited more Apology
responses in Experiment 2 as an apology, even when all of the dimensions
considered in this work were controlled for by manipulating in the same way for
all baseline tokens. The statistical results support this, showing that baseline
Apology tokens elicited more Apology responses (53%) than
baseline ExcuseMe tokens (41%). Baseline ambiguous tokens were
intermediate (43%) but were not significantly different from
ExcuseMe.Duration: Reflecting the results of Experiment 2, as well as the
production study, there is a remarkably systematic relationship between duration
and listeners’ Apology responses, with longer tokens eliciting
significantly more Apology responses: looking at the endpoints, the
shortest duration (350 ms) elicited on average 20% Apology responses,
whereas the longest duration (850 ms) elicited 66% Apology
responses.Intensity: As in Experiment 2, tokens with lower intensity
(i.e., softer tokens) elicited more Apology responses (47%) than louder
tokens (44%), and although small, this difference was significant.Pitch contour: As in Experiment 2, tokens with a falling contour
were more likely to be considered as an Apology than those with a
rising contour (52% for falling vs. 39% for rising contour), and this difference
was significant.Overall, significant differences were found for all factors tested, and
paralleled the results found in Experiment 2. This suggests an independent role
for each of these dimensions: tokens that were longer in duration, lower in
intensity, and with a final falling pitch tended to be classified more often as
Apology than ExcuseMe. This was also the case for tokens
that elicited more Apology responses in Experiment 2, suggesting there
are cues other than those manipulated here that inform listeners’ decisions.We also wanted to assess the relative importance of these cues in predicting
listeners’ responses. While the question of how to quantify cue use is complex
(see Schertz & Clare,
2020, for discussion), one metric for how predictive a cue is in this
sort of paradigm is simply by assessing the difference in predicted responses
across the range of each dimension. These differences are summarized in Table 6. Based on this
metric, duration is the most predictive of the factors examined in this work,
followed by pitch contour, followed by difference in base token, followed by
intensity (which only elicited a 3% difference in responses). However, it should
be kept in mind that this can only be interpreted within the stimulus set used
in this experiment. With different parameter values (e.g., a larger difference
between the two intensity values), the apparent “use” of the cue could
differ.
Table 6.
Comparison of differences in percentage Apology response across
the levels (or extreme values, in the case of continuous factors or
factors with more than two levels) of each factor.
Factor
Levels
% Apology Response
Difference
Duration
350 ms
20%
46%
850 ms
66%
Pitch contour
falling
52%
13%
rising
39%
Base token
ExcuseMe
41%
12%
Apology
53%
Intensity
65 dB
47%
3%
75 dB
44%
Comparison of differences in percentage Apology response across
the levels (or extreme values, in the case of continuous factors or
factors with more than two levels) of each factor.Overall, the results of Experiment 3 strengthen the findings of Experiment 2,
suggesting that multiple cues inform listeners’ perception of Sorry Type: using
manipulated stimuli that varied orthogonally on the dimensions of interest, this
experiment provided evidence for an independent role for each of these
dimensions. Specifically, tokens showing longer duration, lower intensity, and
falling pitch are more likely to be classified as apologies, and that, at least
in the acoustic space of the stimulus set, duration exerts the greatest
influence. Furthermore, the baseline token played a role, with tokens created
from baselines that had been perceived as Apology in their natural form
eliciting more Apology responses even with duration, intensity, and
pitch controlled, indicating the presence of other information in the signal
that is relevant to the perception of this distinction.
5 Discussion
5.1 Summary of results
This work examined the use of multiple acoustic dimensions to cue different
meanings of the word “sorry” in production and perception. In Experiment 1,
tokens of “sorry” produced by voice actors differed in duration based on whether
they were in the contexts suggesting an Apology versus an
ExcuseMe, with longer durations for Apology tokens.
Duration was also longer in contexts of greater situational intensity for both
Sorry Types. No significant differences in other dimensions (mean F0, intonation
contour, intensity) were found, although given the small number of talkers,
these null results cannot be interpreted as a strong claim for a lack of
difference in the general population. In Experiment 2, listeners were able to
identify the actors’ intended meaning (Apology vs. ExcuseMe)
at above-chance levels (64.7% accuracy overall), and their responses were
correlated with duration, intonation contour, and intensity (but not mean F0).
Finally, we tested whether each of these cues contributed independently to
listeners’ judgments in Experiment 3, where listeners identified tokens drawn
from a controlled acoustic space varying systematically along the relevant
dimensions. All three manipulated dimensions did contribute independently to
listeners’ response patterns: duration had the largest effect of the three
manipulated cues, with longer duration cuing Apology, and with pitch
contour and intensity (where a final F0 fall and lower intensity cued
Apology) playing smaller but still significant roles. The baseline
token from which the tokens were created also influenced listeners’ responses in
the expected direction, indicating that other dimensions other than those that
were manipulated also play a role. Not only were all manipulated dimensions
significant, but they also explained a fairly large amount of the variability:
tokens with short duration, rising final tone, and high intensity (collapsing
over all baseline tokens) elicited 15% Apology response, whereas those
tokens with long duration, falling final tone, and low intensity elicited 73%
Apology response.While listeners’ use of duration reflected the distribution of durations in the
production dataset, two of the other cues (intonation contour and intensity)
were found to influence listeners’ judgments, despite the fact that there was no
significant difference in the values of these two dimensions across the two
Sorry Types in production. On the surface, this may appear to be a case of
acoustic cues being used in perception even when they are not informative in
production (e.g., Warner et
al., 2004). However, we think it is more plausible that these cues
are indeed informative in the input, but that our production study simply did
not have enough power, or was not representative enough, to reflect the true
input. Our sample size was quite small, and given the usually very small size of
subphonemic differences, we expect that a larger sample size would be necessary
to detect an effect. Furthermore, we used a nonrepresentative sample of talkers
(voice actors) and a non-naturalistic production setting (asking actors to read
a script), such that it is difficult to assess the extent to which this reflects
the distribution of productions that listeners would hear in real life. Further
studies using corpus work, or more naturalistic production experiments with
larger, more representative samples of speakers, are necessary to answer
questions about the relationship between the distribution of cues in the input
and listeners’ use of these cues in perception.
5.2 Use of individual cues
In the following sections, we discuss the findings for each acoustic dimension in
the context of previous work, and we lay out potential explanations for their
use.
5.2.1 Duration
Given the many factors shown to influence duration, it is perhaps
unsurprising to find a durational difference between the two forms in
production, with Apology tokens being on average 126 ms longer than
ExcuseMe tokens. We do not have independent lexical statistics
for the two meanings of “sorry,” so we did not have specific predictions
about durational differences based on factors shown to affect duration in
previous work (e.g., frequency: Gahl, 2008; Lohmann, 2018b; or contextual
predictability: Seyfarth, 2014). Given that these effects are well established,
we expect that they could contribute to the difference found in these
productions. Systematic differences in prosodic phrasing may provide another
explanation for the difference: as pointed out by a reviewer, all of our
Apology tokens were followed by an exclamation point, while
ExcuseMe tokens were followed by a comma. This may have made it
more likely for speakers to produce the apology tokens as an independent
intonational phrase, and therefore lengthened. While this was not an
intentional element of our design, we speculate that this difference
reflects the typical prosodic environments of these meanings in real-world
discourse, and that these results are not simply an artefact of punctuation.
The perception results showing listeners’ use of duration are consistent
with the idea that these prosodic regularities are linked to each sense,
either via a direct link between each sense and its typical prosodic
realization, or via an indirect link in which each sense is linked to an
abstract prosodic context.We also found systematic durational variability within each meaning as a
function of situational intensity. The fact that higher-intensity situations
(with identical utterances) elicited longer durations provides evidence that
factors unrelated to lexical statistics or prosodic structure must play a
role, and it suggests that emotion/affect exerts an independent influence on
production.
5.2.2 Pitch contour
There were no statistically significant differences between the pitch
contours of the two Sorry Types in production, but this lack of result needs
to be interpreted with particular caution for several reasons. First, our
analyses were based on perceptual judgments using the ToBI coding system.
These judgments can differ substantially across coders, even in “best-case”
scenarios with highly trained coders and clear speech (e.g., Syrdal & McGory,
2000). We only reported tokens that were agreed-upon by our two
coders, which reduces the amount of data (although the results were the same
when using the full dataset of each coder individually). Second, the
different speakers in our study showed different patterns, with three of the
four speakers showing more falling tones for Apology productions,
and the other showing the opposite. These speaker-specific production
strategies may have obscured the group-level results, and point to the need
to look at a wider range of speakers, as well as the specific phonetic
realizations of the production patterns, to get an accurate view of how this
difference is manifested in production.While it is not possible to make strong generalizations about how this
contrast is realized in production, we did see a clear influence of
intonation contour in both perception experiments, with final falling tone
eliciting more Apology responses. We again consider potential
sources of this difference. Intonation contour has been found to be an
acoustic correlate for different emotions: for example, Pereira and Watson
(1998) found that falling contours are acoustic correlates of sad
utterances. On a linguistic level, F0 as a source of subphonemic variation
has been less well studied than duration, but Tang and Shaw (2020) found F0
differences paralleling the contextual-predictability effects on duration
discussed above: items that tend to be more predictable have lower F0 peaks.
While it is possible that this factor is at play, this is a different
question than categorically different intonation contours, which is what we
were testing. We think the most likely explanation is that there may be
systematic differences in the contour used for Apology vs.
ExcuseMe tokens, and this may be due to pragmatic and/or
linguistic context. A corpus or larger-scale production experiment is
necessary to test whether this is the case.It might seem surprising that the overall F0 level (mean F0) was not found to
play a role in either production or perception, given the fact that it has
been a relatively consistent feature found to vary based on emotion (Banse & Scherer,
1996, among others). Furthermore, we might expect
Apology tokens to be lower in F0, as they presumably tend to be
more submissive and less assertive than attention-seeking
ExcuseMes, based on Ohala’s (1983) frequency code.
However, when there are overall differences in intonation contour, as we saw
in the current set of tokens analyzed in production and used for perception,
the intonation-based variability in F0 would make it difficult to detect a
difference in overall mean F0 even if it existed. Furthermore, we did not
test an independent effect of mean F0 in Experiment 3. More sensitive tests
would be necessary to make strong claims about the absence of an overall F0
effect.
5.2.3 Intensity
Intensity was found to play a small but significant role in both perception
experiments, with lower intensity eliciting more Apology responses,
but no effect was found in production. As with F0, Tang and Shaw (2020) found
decreased intensity corresponding to greater contextual predictability, and
based on the fact that lexical frequency is also expected to result in
overall lower prominence, decreased intensity might also be expected in
lower-frequency forms. At the same time, lower intensity has been shown to
be associated with sadness (Pereira & Watson, 1998) and
lack of assertiveness/dominance (Puts et al., 2007), both of which
may be expected in Apology compared to ExcuseMe
productions. Therefore, as with the other dimensions, both lexical and
non-lexical factors potentially contribute to the current results.
5.3 Implications for models of speech perception and processing
Listeners’ systematic use of multiple phonetic dimensions to differentiate two
polysemous uses of “sorry” (Apology vs. ExcuseMe) indicates
that these two senses must be associated with different representations at some
level of processing. Previous work has presented evidence that the different
senses of polysemes share a semantic representation, whereas the different
(unrelated) meanings of homophones have independent semantic representations
(Rabagliati &
Snedeker, 2013; Rodd et al., 2002). Although our results point to separate
representations for the different senses of the polysemous “sorry,” we do not
think that they are inconsistent with these previous findings. The evidence for
shared senses in Rabagliati
and Snedeker (2013) was for very closely related, “regular” polysemes
(e.g., “chicken” the animal vs. “chicken” the food); in a separate condition
testing less closely related, “irregular” polysemes (e.g., “sheet of
glass” vs. “drinking glass”), results
suggested separately stored meanings. The two meanings of “sorry” used in this
case study fall toward the less-related end of the continuum (see Moldovan, 2019, for
definitions and discussion of the gradient nature of the polysemy–homophony
continuum). Although we are not aware of any previous work along these lines,
this brings up the interesting possibility that evidence of listeners’
systematic phonetic differentiation of different senses/meanings could
potentially be used as a supporting diagnostic for separate semantic
representations (taking into account the extent of phonetic differences that
actually exist in the production of a given word-pair).It is clear from our results that the two senses of “sorry” must be independent
at some level of representation. However, this study cannot provide a definitive
answer to how sense-specific phonetic information is represented and linked to
the two meanings. The fact that listeners differentiate the two senses based on
acoustic information could be explained straightforwardly by each sense having
an independently specified phonetic representation (e.g., a separate phonetic
prototype or exemplars associated with each sense). Under this view, listeners’
choice of Apology vs. ExcuseMe would be determined via direct
comparison with the phonetic representations of each sense (e.g., “This
production had a long duration, and Apology tokens are characterized by
longer duration”). However, another possibility is that the two senses could
also be linked to a single, shared phonetic representation. Under this model,
each sense could also be associated with distinct “contexts,” which are in turn
associated with systematic phonetic properties, as discussed in the
Introduction. When distinguishing between the two senses, the phonetic
information would inform listeners’ judgment about the context, which would then
inform their choice of category (e.g., “This production had a long duration,
which means it’s likely to have been produced in a single prosodic phrase, and
Apology tokens are more likely to be produced in a single prosodic
phrase”).The findings that listeners use a given dimension is consistent with either of
these scenarios. More broadly, an accurate model of speech perception likely
includes both components as a means to represent pronunciation variation. As
discussed in Pierrehumbert
(2016), there is evidence that (at least some) subphonemic detail
must be able to be stored independently for (at least some) lexical
representations. However, more abstract knowledge of phonetic patterns also
influences lexical decisions. For example, Shatzman and McQueen (2006) found that
listeners used duration information to distinguish short versus embedded novel
words (e.g., “bap” vs.
“”) just as they did for real
words, even though durational information across the two syllables was identical
during an exposure phase. If the representations of these new words were simply
made up of acoustic exemplars, duration would not be an informative cue for
listeners, and the fact that listeners were using it indicates that listeners
were drawing on more general or abstract information. The fact that listeners
are sensitive to phonetic properties of emotion or affect, even in non-words,
provides further support for use of non-lexically-specific phonetic detail.
Taken together, it appears that the appropriate question is not whether lexical
representations are purely abstract or fully phonetically specified, but rather
how word-specific and abstract components are integrated during perception and
processing (see Ernestus,
2014, for discussion).
5.4 Lessons from a case study
As discussed in the sections above, there are several plausible explanations for
the effects found in this work. It is likely that differences in prosodic
phrasing play a large role across several of the dimensions examined here; this
was not one of our primary considerations when designing the study, but its
importance was highlighted by reviewers of an initial version of this work. For
example, an Apology may be more likely to be produced in a single
intonational phrase than an ExcuseMe, affecting both duration and F0,
and/or the two senses might be associated with different intonation contours. In
either case, listeners’ use of these dimensions could then be explained by a
model of perception which incorporates prosodic knowledge as part of the process
of lexical competition (e.g., a “Prosody Analyzer,” as proposed by Cho et al., 2007), or
by independently specified intonation contours for each sense (e.g., Calhoun & Schweitzer,
2012; Tang &
Shaw, 2020). Emotion or affect could also be a contributing factor:
tokens identified as Apology by listeners were characterized by
features found to correspond to sadness (falling F0 contour, lower intensity),
and the fact that the situational intensity exerted an independent influence on
duration suggests a role for emotion/affect (though it is not necessarily
sadness per se). Finally, although difficult to evaluate given the lack of
frequency measures for the two different senses, the role of frequency also
likely plays a role, given the findings of previous work (e.g., Lohmann, 2018b). We
think it most likely that our results indicate listeners’ tracking and use of
the systematic variation conditioned by both linguistic and affective factors.
However, in the presence of multiple possible interpretations, it is not
possible to decide between them in a case study using a single word. Instead,
determining the relative roles needs to be investigated separately in studies
with multiple words.The effects found in this case study were quite large in comparison to previous
work. For example, consider the 126 ms average difference between
Apology and ExcuseMe tokens in the production study,
compared to a vowel duration difference of 3.5 or 10 ms for differences based on
underlying voicing in Dutch and German (Port & O’Dell, 1985; Warner et al., 2004),
or an average of 15 ms and 21 ms difference between high- versus low-frequency
homophones analyzed in Gahl
(2008) and noun–verb homophone pairs (Lohmann, 2018b). Listeners’ accuracy in
correctly identifying the intended sense for naturally produced tokens was also
quite high compared to very small effects in previous work (just over 50% in
Sanker, 2019),
and in our manipulated task, a large amount of the variance was explained by our
manipulations, as opposed to very weak correlations between acoustic features
and listeners’ responses in previous work (Sanker, 2019; Drager, 2010).There are multiple possible reasons for the relatively large effects found here.
First, the studies about durational differences cited above were examining
homophonous contrasts which differed along scalar variables (e.g., frequency,
contextual predictability) or which were actually expected to be phonologically
neutralized. In contrast, the senses of “sorry” used in the current study are
commonly used in social routines. Their frequency and the social context of
their use may form the basis of a robust contrast in production, and/or greater
sensitivity to these differences by listeners; indeed, these are the types of
words proposed by Calhoun and
Schweitzer (2012) to be the most likely to have lexicalized or
idiomatic intonational contours. If so, this would account for the particularly
strong listener judgments. Second, as discussed at the beginning of this
section, there are multiple factors, including prosodic regularities,
emotion/affect, and lexical statistics, that could potentially play additive
roles. Determining the relative roles of each of these factors is only possible
through larger-scale studies examining multiple words; however, the strength of
the effects in the current work highlights the importance of examining
word-specific effects in these larger-scale studies. This is crucial because a
particularly robust contrast for a single word could skew group-level effects,
but also because examining the properties of words which have stronger/weaker
differences could shed light on which factors are most important.Finally, our findings from Experiment 3 show an independent contribution of
different acoustic dimensions, and the pattern of responses to the five-step
duration continuum indicates a gradient use of this cue (the other dimensions
only had two levels, so we are unable to make observations about gradience).
This suggests that dimensions are stored and tracked independently. The question
of which dimensions are tracked, and the details of their use, is another factor
that needs to be considered in models of speech perception.
5.5 Limitations and future directions
This case study of a single word pair allows for simultaneous investigation of
listeners’ use of multiple acoustic dimensions, and for a comparison of
perception of naturally produced and manipulated tokens. The results highlight
the fact that listeners rely on several acoustic dimensions to inform their
perception of these words, and while the design of this study does not allow for
a definitive answer about why these cues are used, it is plausible that the
regularities underlying the use of these cues stem from both lemma-specific and
general contextual factors.Any affective/emotive elements of the utterances produced in our production study
are simulated, due to the fact that they are based on reading of scripted
productions by voice actors. We make the assumption, following Banse and Scherer (1996)
and many other researchers, that these utterances contain properties of
“real-life” speech, and we believe that this assumption was supported by the
findings that listeners did indeed identify the intended meanings at
above-chance levels. Nevertheless, as with any laboratory-based study, we do
expect that there will be systematic differences between lab-based and
spontaneous productions. In this case, we might expect phonetic distinctions to
be exaggerated, because of the read speech, because of the use of voice actors,
and because the word of interest was likely clear to them. In the future, corpus
work could be used to test to what extent the patterns found here are reflected
in naturalistic speech.As with any case study, this is a starting point for formulating and testing more
general hypotheses about models of speech processing and cue use. One question,
in terms of the perception–production interface, is how faithfully cue use
mirrors input distributions. This has been examined quite frequently in terms of
perceptual cue-weighting for phonetic categorization (see Schertz & Clare, 2020, for
discussion), the question being to what extent the relative informativity of
various cues in distinguishing two members of a phonetic contrast predicts
listeners’ relative reliance on these cues in perception. Expanding the scope of
this inquiry to lemma-level lexical contrasts can help provide information about
the level on which listeners are computing input statistics, and how these
statistics are used. A second, related question is which factors underlie the
use of these cues. As discussed above, in the perception of a given word
meaning, listeners likely draw on both lemma-specific knowledge (i.e.,
information encoded in the lexical entry) and their knowledge of the contexts in
which the word meaning is likely to occur, as well as the phonetic regularities
associated with those contexts. The range of potential “contexts” is vast,
encompassing syntactic, prosodic, discourse, and emotional domains. The extent
to which listeners track statistics along these dimensions, and whether they do
it in similar ways for different contexts, has, to our knowledge, been largely
unexplored.Moving forward in these areas will require work on two methodological fronts:
first, corpus work and/or larger-scale elicitation-based production studies
should be done to get a better idea of the distribution of cues in the input as
a function of a broader range of factors. Second, perception patterns should be
explored in controlled experimental work using comparable methodologies as
previous work to tease apart and quantify the relative roles of word-specific
versus abstract information, considering a broader range of contexts over which
listeners might generalize (e.g., emotion/affect). While these are broad and
complex questions, answering them will help build accurate and computationally
viable models of how information is stored and used.