Literature DB >> 35238259

The Effects of Uncertainty in Level on Speech-on-Speech Masking.

Andrew J Byrne¹, Christopher Conroy¹, Gerald Kidd^1,2.

Abstract

Identification of speech from a "target" talker was measured in a speech-on-speech masking task with two simultaneous "masker" talkers. The overall level of each talker was either fixed or randomized throughout each stimulus presentation to investigate the effectiveness of level as a cue for segregating competing talkers and attending to the target. Experimental manipulations included varying the level difference between talkers and imposing three types of target level uncertainty: 1) fixed target level across trials, 2) random target level across trials, or 3) random target levels on a word-by-word basis within a trial. When the target level was predictable performance was better than corresponding conditions when the target level was uncertain. Masker confusions were consistent with a high degree of informational masking (IM). Furthermore, evidence was found for "tuning" in level and a level "release" from IM. These findings suggest that conforming to listener expectation about relative level, in addition to cues signaling talker identity, facilitates segregation of, and maintaining focus of attention on, a specific talker in multiple-talker communication situations.

Entities: Chemical

Keywords: cocktail party; competing talkers; informational masking; speech recognition

Mesh：

Year: 2022 PMID： 35238259 PMCID： PMC8902181 DOI： 10.1177/23312165221077555

Source DB: PubMed Journal: Trends Hear ISSN： 2331-2165 Impact factor: 3.293

Introduction

The difficulty experienced by a human listener attempting to focus attention on the voice of one particular talker (the "target") and recognize the message conveyed by the target talker's speech in the presence of multiple competing talkers ("maskers") is known as the “cocktail party problem” (CPP; see Middlebrooks et al., 2017, for a series of recent reviews). In typical multiple-talker communication situations, the listener has many different cues that may assist with solving the CPP by enhancing the perceptual segregation of sound sources and aiding in the focus of attention on the target while ignoring competing masker speech or other unwanted sources of environmental "noise." These cues include both auditory and visual information (e.g., lip reading, Woodhouse et al., 2009), as well as higher-level linguistic factors such as syntactic and semantic context (e.g., Brouwer et al., 2012; Kidd et al., 2014; review in Mattys et al., 2012). Spatial separation of sound sources may also afford a substantial improvement in speech recognition performance in multiple-source sound fields compared to conditions where competing sources appear to emanate from the same spatial location (i.e., "colocated" with the target source; e.g., Best et al., 2013a, 2013b; Hawley et al., 2004; Marrone et al., 2008). Furthermore, multiple factors may combine to facilitate the maintenance of attentional focus on the target speech stream under masked conditions (e.g., Rennies et al., 2019; Rodriguez et al., 2021; Swaminathan et al., 2015). Because natural speech is inherently a dynamic stimulus, following one specific talker among competing talkers involves using the patterns of variation of the target talker's voice to anticipate that talker's speech. Prosodic information, which involves the plausible variation in vocal pitch, intensity, and timing, may be particularly useful in that regard (e.g., Calandruccio et al., 2019; Kim & Sumner, 2017; Zekveld et al., 2014). Of the cues that comprise prosody, variation in level has received less attention as a potential cue in CPP communication situations than other factors, such as pitch and intonation. This is due in part to the obvious strength of vocal pitch as a cue: pitch may signal the sex/age of the talker (Hazan & Markham, 2004), indicate declarative/interrogative statements, and aid in talker identification. Furthermore, because speech is intelligible over a wide range of absolute levels, once it is above detection threshold it could be assumed that level differences play a secondary, even insignificant, role in maintaining speech stream integrity because each individual unit (i.e., word) is fully audible/identifiable. Although the effect of talker intensity or loudness (i.e., talker "level"), has been a variable in the aforementioned work, it most often is used either for measuring psychometric functions for speech recognition, or for quantifying performance when level is adapted to measure a speech reception threshold. In either case, the strength of the relative intensity of the target as a cue in multiple-source, CPP communication situations either is implicit or is easily explained by energetic masking (EM). That is, when the target speech is more intense than the masker(s), its spectral energy dominates the stimulus representation in the auditory system, presumably causing it to be more salient perceptually and easier to segregate and understand than when it is less intense than the masker(s). Furthermore, the relative levels of target and masker speech produce varying "glimpses" of target energy that must be integrated by the listener to accurately identify the message conveyed by the target (e.g., Brungart et al., 2006; Cooke, 2006). Broadly speaking, as the relative intensity of the target increases the extent to which glimpses of target information are accessible tends to increase, allowing it to be segregated from competitors and perceived as perceptually distinct. There is evidence, however, that recognition is not always a monotonic function of overall level. For instance, with colocated speech-on-speech (SOS) masking, well-trained observers can produce psychometric functions with a relative improvement of recognition on either side of a 0-dB target-to-masker ratio (TMR; e.g., Brungart, 2001; Dirks & Bower, 1969; MacPherson & Akeroyd, 2014). One explanation for this distinct dip in performance of the psychometric function is that the target and masker words are more easily confused when there is no consistent relative level cue available between the speech streams. This does not necessarily reflect a failure to perceive specific words, but rather indicates a failure to focus attention on the correct talker while ignoring others, a characteristic of informational masking (IM; see discussion in Kidd & Colburn, 2017). Brungart (2001) found that the amount of IM could be varied with different types of maskers, and that psychometric functions were the most orderly (monotonic) for low-IM maskers. Conversely, under high-IM conditions the psychometric functions could flatten or even become “U” shaped. When the target and the masker were the same talker (the configuration with the greatest IM), level became the only cue to distinguish the target speech stream, and there was convincing evidence that a listener could effectively attend to the less intense of two talkers. This improvement in speech recognition at negative TMR values suggests that a listener may actively focus attention on a lower-level speech stream, despite the presumably greater EM that it creates, compared to higher TMRs. However, evidence in support of level as an effective segregation cue separate from its relation to EM is quite limited. Given that relative level differences between a target and masker(s) may exert an influence on the success of selective attention to the target speech stream under conditions high in IM, the present study aimed to evaluate the effectiveness of level as a cue for speech segregation more directly. Specifically, we pose the question of whether a known target level (i.e., low target level uncertainty) can be used to focus attention at that level, ignoring other voices which are either louder or softer than expected. Conversely, any benefit that might be obtained from attentional focus on level could be mitigated by listener uncertainty about where in level attention should be focused. The present study investigated target speech identification in the presence of competing speech maskers with various amounts of level separation between the talkers, as well as with varying degrees of talker level uncertainty, for sentences having identical syntactic and semantic structure. Moreover, the use of a matrix identification task in which both the target and masker sentences comprised individually concatenated words (see Kidd et al., 2008) mitigated the influence of prosodic information so that the effectiveness of level as a cue could be ascertained more directly. It is hypothesized that a priori knowledge and consistency of talker level will result in better speech identification than when level cues are uncertain. Although there has been little evidence to date supporting the proposition that selective attention in level produces a "tuned" response (i.e., enhanced performance at the point of attentional focus along the given stimulus dimension) in SOS masking, past studies have not examined the issue under different degrees of stimulus uncertainty. Tuning at the focus of attention can be most pronounced when the conditions produce a high degree of uncertainty in the listener, and this idea forms a part of the rationale for the experimental design in the work that follows.

Methods

Subjects

Eleven listeners participated in the experiment (two male and nine female, 20–40 years old, mean age: 25.4), two of which were the first and second authors (designated as S1 and S2, respectively, in a later figure), while the others were students from Boston University who were paid to participate. All participants had normal hearing based on pure-tone thresholds of 20 dB hearing level (HL) or better at octave frequencies from 250 to 8000 Hz. (Listener audiograms were measured in the lab prior to online testing.) Each participant reported U.S. English as their first and primary language, and all had previous experience with other psychoacoustics tasks, including the speech corpus and response method described below. All procedures were approved by the institutional review board of Boston University, and all participants gave written informed consent.

Stimuli and Procedure

The speech identification task used was an established matrix identification design and speech corpus (e.g., Helfer et al., 2020; Holmes et al., 2021; Kidd et al., 2008, 2020; Puschmann et al., 2019) and consisted of three simultaneous co-located talkers. Each talker spoke (artificially constructed) five-word sentences, one word from each of the five syntactic categories of name, verb, number, adjective, and object (in that fixed order), e.g., “Sue took ten small toys”. Natural production of eight exemplar words in each syntactic category of the corpus had been previously recorded from twelve different female talkers (refer to Kidd et al., 2008). Non-individualized head-related transfer functions (HRTFs; Gardner & Martin, 1994) were used to spatialize all speech to be at 0° azimuth and elevation for binaural presentation over headphones. On each experimental trial, one of the twelve talkers was randomly selected to be the target talker, and the first word of the target sentence was always the name "Sue," which served to cue the listener to the target voice. The other four words of the target sentence were randomly selected from the eight exemplars available in each of the remaining four syntactic categories. Two different masker talkers were then randomly selected from the remaining talkers, and two different masker sentences were constructed from additional random selection of words (without replacement) from each category. The sentences were constructed such that the first word of each talker began simultaneously; however, all following words were concatenated without any attempt at time-aligning the words within each category across talkers. This design was chosen to accentuate the “multimasker penalty” (Iyer et al., 2010) with not only a simultaneous multi-talker masker, but also with all maskers being contextually relevant; therefore, increasing the potential for effects (both improvements in performance as well as deficits) to occur. The task of the listener was to focus on, and respond to, only the words spoken by the target talker on each trial. These responses were obtained by the listener clicking on the target words that were heard, as shown on a graphical user interface (GUI) displaying the matrix of possible words in syntactic columns and exemplar rows. The cue word “Sue” was always pre-selected, and the listener was required to make a selection in each of the other categories in the syntactic order that they were presented. Correct answer feedback (displaying and highlighting the correct words) was provided only after all four responses for that trial were selected. The controlled variable across conditions was the level of the target and maskers, as defined by the root-mean-square of the waveform of each individual word. Due to remote testing (described in the next section), absolute values of sound pressure level (SPL) are not specified; instead, the relative level of the target to the masker is given. However, as an approximate reference based on the remote calibration routine, a relative level of 0 dB roughly corresponds to 60 dB SPL. For the "Fixed Level" (FL) conditions, the target and each masker were presented at constant levels within each trial and throughout each block of 12 trials. Across blocks, the level difference (ΔL) between the three talkers varied from 0 to 9 dB, in 3 dB steps, but that spacing in level was fixed for each block of trials. One talker was always at 0 dB, while the other two talkers were at + /-ΔL. Note that the ΔL value refers to adjacent talker levels and thus the total range of levels is twice the value (e.g., a level spacing of 9 dB between adjacent talkers equals an 18-dB range between loudest and softest talkers). The level of the target will be designated as “Loud”, “Mid”, or “Soft” relative to the other talkers, given levels of + ΔL, 0, or -ΔL, respectively. Prior to each block, the designation "Fixed Level", followed by the relative target level designation, was displayed on the GUI, so that the listener could try to focus attention at the appropriate intensity. Figure 1 (upper row) illustrates the FL condition for a “loud” target level.

Figure 1.

Illustrations of two possible trials (each column) for the Fixed Level (FL; top panels), Random Sentence Level (RSL; middle panels), and Random Word Level (RWL; bottom panels) conditions. Overall level difference by word (y-axis) is shown as a function of time along the x-axis, with different colors/fonts depicting the different simultaneous talkers. The target sentence always begins with the name "Sue”. In another set of conditions, the “Random Sentence Level” (RSL) conditions, ΔL was fixed for a block of trials, but the level of the target relative to the maskers was varied. Level was randomized from trial to trial, but held constant within a trial, and there was a new random permutation order chosen for each block, such that there were always four targets of each relative level in each block. Prior to each block, the designation "Random Sentence Level" was displayed on the GUI, so that the listener would be informed of the inconsistent target level across trials and instead would be required to use the cue word “Sue” to designate the target speech stream. An illustration of the RSL condition is shown in the middle panels of Figure 1. In contrast to the FL condition (upper panels), the target level is uncertain from trial to trial but is consistent within a trial. The final set of conditions involved changes in level across words within a sentence: the “Random Word Level” (RWL) condition. The level of each word was randomly selected from the set of three relative levels for that block, with the stipulation that no two consecutive target words were ever at the same level, and no sentence contained more than three target words at a given level. Given the randomization strategy utilized, there were 24 different level variations possible for the target sentence on each trial, with 75% of those variations including a subsequent target word being presented at the cue level. For both of the other (non-cue) levels, there was only a 62.5% chance of a variation including a double presentation at the given level. This resulted in the level of the “average” target sentence (across all possible variations) being shifted slightly towards that of the cue. An illustration of a sample trial of the RWL condition is shown in the lower panels of Figure 1. (Note that the jumbled/overlapping appearance of the words illustrated for this condition is simply due to their display after sorting by level; the words in all three conditions shown in Figure 1 temporally overlap in roughly the same way.) In addition, as with the RSL condition, the level of the cue word for the target sentence was randomized on each trial such that there were equal numbers of trials starting with the cue word at the Loud, Mid, and Soft levels. Again, the listener was informed prior to the experimental block with the label “Random Word Level” and that, for these conditions, the only reasonable strategy would be to listen for the cue word and attempt to follow the vocal characteristics of the target talker while ignoring intensity fluctuations. (The difference between all three types of conditions was emphasized in the instructions to the listeners prior to data collection.) The full experiment consisted of 112 blocks, with four blocks of each of the four FL conditions (four different ΔL's), and 16 blocks of each RSL and RWL condition at each of the three ΔL possibilities. A one-quarter complete set of conditions (28 blocks) was obtained before the next repetition, and the condition order within each set was randomized. Listeners ran the experiment self-paced, typically with 1-h sessions (including brief breaks), and completed the full experiment in 5–6 hours total.

Remote Testing Considerations

All data included in the present study were gathered remotely. Thus, listeners were allowed to perform the task at home using their own computers and headphones. However, prior to participating in the experiment, each participant was required to sign an online “Attestation of Low-Noise Distraction-Free Testing” form, agreeing to do the experiment under the quietest and most distraction-free testing environment that they could create at home. A MATLAB (MathWorks, Natick, MA) experiment was built and compiled as a “Web App” and run on a Microsoft (Redmond, WA) Windows 2019 Virtual Machine (VM) with MATLAB Web App Server installed, which was open to the Internet. The GUI was accessed and displayed within an Internet browser window. Audio presentation consisted of a trial-by-trial uncompressed WAV file (44100-Hz sampling rate) generated by the application on the VM and presented through the computer audio hardware of the listener. Each listener used a different model of personal headphone, with both standard and earbud, as well as wired and wireless, headphones being used. The headphones used were: Apple (Cupertino, CA) AirPods Pro (three listeners), Apple EarPods (two listeners), Beats Electronics (Culver City, CA) Flex, Jabra GN (Copenhagen, Denmark) Elite 65t, Samsung (Seoul, South Korea) Level On, Sennheiser HD300, Sennheiser HD380 Pro, and Sony (Tokyo, Japan) MDR-V6. The exact hardware configurations (e.g., sound cards, audio enhancement settings, wireless compression protocols) of each remote setup were unknown. Calibration for remote testing was done prior to the start of each experimental session. The listener was presented with sample speech (from the experimental stimuli) over a 30-dB range. The audio samples were labeled: “loud voice (75 dB)”, “normal conversation (60 dB)”, and “quiet voice (45 dB)”. The listener then adjusted the overall computer volume until the speech subjectively matched the given labels. By no means should it be assumed that each listener was able to set “60 dB” speech to an overall level of 60 dB SPL, but by setting the softest speech to be both audible and understandable, and the loudest speech to be still “comfortable”, the dynamic range of the stimuli presented in the present task (18 dB, with 0-dB ΔL calibrated as 60 dB “SPL”) was considered to be within an acceptable range of values for testing relative level effects in this speech identification paradigm.

Results

Speech identification performance was defined as the proportion of words that the listener identified correctly in the final four words of the target sentence (i.e., excluding the initial target cue word “Sue”). Broadly, and as expected, performance varied with changes in the level of the target words and as the level difference between the target and maskers was altered. Furthermore, in support of the underlying premise of the study, performance also depended on the degree of uncertainty with respect to the target level. The mean results across the eleven listeners are shown in Figure 2. Each plotted point represents the average proportion correct of 48 trials for each listener, with the group means across listeners and standard error of the means (SEM) plotted in the figure. A 3-by-3 repeated-measures analysis of variance (ANOVA) performed on the FL data (not including the 0-dB ΔL reference point) revealed that both cue level (Loud, Mid, Soft) [F(2,20) = 115.12, p < 0.001] and level difference (3, 6, and 9 dB) [F(2,20) = 49.04, p < 0.001] were significant main effects, as was the interaction of the two factors [F(2,40) = 18.22, p < 0.001]. For the FL conditions (top panel), performance improved from 0.53 with no level difference across the three talkers (which is the reference in each panel for gauging the effects of relative level) to nearly perfect performance with a 9-dB separation for the Loud Cue condition (solid squares). This result would be expected simply due to the increased salience of the more intense target voice over the masker voices (cf., Brungart, 2001). For both the Mid and Low Cue conditions (shaded and open squares, respectively), group mean performance was relatively constant and near the value obtained for 0 dB (i.e., equal talker level) across the range of talker level differences, with some separation between Mid and Soft values apparent for the 9 dB level difference.

Figure 2.

Group mean speech identification results for the three different amounts of target talker level uncertainty, shown in separate panels. The level difference between the three simultaneous talkers forms the x-axis in each panel, while the y-axis is the mean proportion correct performance, with SEM error bars. Within each panel, the separate functions indicate the conditions where the target cue word was at the Loud, Mid, or Soft relative levels (solid, shaded, and open symbols, respectively). For the Random Word Level condition, each function contains target words from all levels with the percent correct values computed based on the cue word level of each trial (see text). The dashed horizontal line represents chance performance. It should be noted that a purely EM-based explanation of the results would not seem to be compatible with the trends in performance found for the Soft cue condition. In that case, performance remained constant as both the target voice was made less intense and one of the maskers was made more intense. In order to evaluate this impression quantitatively, a glimpsing analysis based on "ideal time-frequency segregation" (e.g., Best et al., 2017; Brungart et al., 2006; Conroy et al., 2020; Kidd et al., 2016) was performed on these stimuli and is described more fully in the Appendix. The analysis confirmed that this finding could not be attributed to changes in EM (which is increasing, while performance was stable) and was, instead, evidence of a release from IM, presumably from level cues facilitating attentional focus on the target. The results for the RSL condition (bottom left panel) were qualitatively quite similar to those from the FL condition and the trends were roughly the same. A 3-by-3 repeated-measures ANOVA for the RSL condition yielded significant main effects of both cue level [F(2,20) = 150.50, p < 0.001] and level difference [F(2,20) = 13.55, p < 0.001], as well as a significant interaction [F(4,40) = 15.56, p < 0.001] of the two factors. The similarity of the findings for FL and RSL conditions suggested that a constant level for each talker throughout the trial interval was generally sufficient for the listener to focus on the target speech. The three functions for the RWL condition were not that different from each other and all showed a moderate decrease in performance of roughly 0.12 proportion correct relative to the reference condition as the level difference was increased. A 3-by-3 repeated-measures ANOVA for the RWL conditions revealed that the main effect of cue level was significant [F(2,20) = 4.91, p = 0.002], as well as level difference [F(2,20) = 20.80, p < 0.001], but with no significant interaction [F(4,40) = 0.36, p = 0.97]. Recall that the designation of the functions as Loud, Mid, or Soft for this condition refers only to the cue word, given that the levels of the remaining target words were chosen pseudo-randomly (see the description of the RWL condition in the Methods above). This explains the similar performance shown for each function. If the salience of the cue word, as implied by its relative level, facilitated extracting the subsequent target words then we would have expected some degree of ordering of the functions from Loud to Soft. Although the Soft Cue function does appear slightly worse than the others, simply providing a more intense cue word was not sufficient to compensate for the level variation of the subsequent words in the target sentence. To examine the role of target level uncertainty explicitly, the group mean data for the FL, RSL, and RWL conditions were transformed and plotted as psychometric functions in Figure 3. The values along the x-axis are the relative levels of the individual target words while the values on the y-axis are the proportion correct performance. Note that, for the FL and RSL conditions, the relative levels of the target words were the same as the cue word level (as illustrated in Figure 1). However, for the RWL condition, the relative levels of the target words corresponded to the actual levels of the words independent of the level of the cue word. Referring back to the lower panels of Figure 1, to compute proportion correct for the + ΔL word level for the RWL condition, identification performance for only the highest-level words (upper row) from both of the two trials illustrated would be counted, even though the target cue level was different across trials. The same is true across all trials and for all cue word levels. Thus, the proportion correct represents a word-by-word computation of proportion correct according to target word level regardless of the levels of the cue or the other words in the sentence.

Figure 3.

Group mean speech identification performance, as a function of the relative level of the individual target words. The psychometric functions shown are for the three types of level uncertainty: the Fixed Level, Random Sentence Level, and Random Word Level conditions, which are plotted together in each panel (squares, diamonds, and triangles, respectively). The error bars represent the SEM. A 3-by-7 repeated-measures ANOVA revealed that both the degree of uncertainty [F(2,20) = 189.73, p < 0.001] and the relative level [F(6,60) = 106.88, p < 0.001] were significant main effects, as was the interaction between the two factors [F(12,120) = 53.85, p < 0.001]. [A post hoc statistical power analysis was performed using G*Power (Faul et al., 2007) indicated that the N = 11 used in the present study was sufficiently powered (greater than 0.95, with α = 0.05) to rule out the null hypothesis that the degree of uncertainty (i.e., the main effect between conditions) was not significantly different.] Performance for the FL (blue square) condition improved steadily as the relative level cue was increased above the 0-dB reference but was roughly constant below 0 dB. (Refer to the Appendix for the pattern expected from only energetic masking.) This flattening of the psychometric function below the mid-point was likely due to the ability of listeners to focus on the softest voice during the FL blocks. A similar effect was shown in some of the studies reviewed by MacPherson and Akeroyd (2014) and suggested that prior knowledge of target level can improve speech identification under conditions high in IM. For the RSL condition, the function was quite similar to the FL function above 0 dB, but declined somewhat more below 0 dB. (Post hoc t-tests, with the Bonferroni correction applied, confirmed that the FL function was significantly greater than RSL function at both the −6 and −9 relative levels: −6 dB, t(10) = 3.30, p = 0.004, −9 dB, t(10) = 5.79, p < 0.001.) This suggested that, as a group, listeners were less able to use the softer target voice as a cue than was the case in the FL condition. Here, apparently, the reduced variability across trials in the FL condition yielded an additional benefit for focusing on the target stream. For the RWL condition, performance varied less with relative level and, below 0 dB, declined with increasing level difference to 0.33 proportion correct for the largest level difference (but still substantially above chance performance). To highlight the individual differences found, and given how common across-subject variability can be in IM tasks (see Lutfi et al., 2021), the individual listener data are plotted in Figure 4 in the same manner as Figure 3. Post hoc between-subjects effects were found to be significant: main effect of listener [F(10,45) = 12.67, p < 0.001], degree of uncertainty by listener interaction [F(20,120) = 2.97, p < 0.001], and relative level by listener interaction [F(60,120) = 3.41, p < 0.001]. Although the general trends were broadly similar across subjects, there were a few noteworthy differences. Some subjects (e.g., S1 and S5) showed an improvement at the lower relative levels for the FL condition over the RSL and RWL conditions with the psychometric function appearing flat or even "U"-shaped. In contrast, the performance of other subjects (e.g., S3 and S4) at the low levels continued to decline steadily with decreasing relative level, thus creating more typical monotonic psychometric functions. These individual differences may suggest an inability of some listeners to focus attention on only the softest talker (when explicitly told to), i.e., attentional tuning in level. For the RSL case, the function at lower levels was generally diminished relative to the FL case for most listeners, indicating that the greater uncertainty about the target level on a trial-by-trial basis affected the ability to focus on the softer talker during the trial. There was less variability across subjects in the shapes of the psychometric functions for the RWL condition where all listeners produced shallow, somewhat convex psychometric functions indicating little benefit, and indeed occasionally some cost, for higher or lower relative levels for the target words.

Figure 4.

Speech identification performance for each listener (separate panels) is plotted as a function of the relative level of the individual target words, with error bars representing the standard error of the proportions.

Additional Evidence of Tuning in Level

Level Release from Masking

It was also of interest to examine the extent to which focusing attention at the target's value along the physical dimension of level led to a "tuned" response pattern (e.g., analogous to what has been found for the dimensions of frequency/fundamental frequency and spatial location; cf. Arbogast & Kidd, 2000). The Mid Cue functions shown in Figure 5 (replotted from Figure 2) revealed some support for this proposition: There was an increase in performance for the FL and RSL conditions at the largest level difference across talkers, despite the overall level of the target voice being held constant (at 0 dB relative level) across all ΔL's in the Mid Cue condition, while the two masker talkers varied symmetrically in level above and below the target. Post hoc paired-samples t-tests for the FL and RSL functions confirmed that the improvement in performance from the 0-dB ΔL condition was significant at 9-dB ΔL: FL, t(10) = -7.91, p < 0.001; RSL, t(10) = -4.92, p = 0.002.

Figure 5.

Mid Cue functions, replotted from Figure 2, highlighting the increase in performance at the largest level difference for the Fixed Level and Random Sentence Level conditions. The error bars represent the SEM. This effect would be analogous to the improvement in performance seen when increasing the difference between the target and masker along different dimensions using other psychophysical tuning paradigms (e.g., spatial release from masking, Marrone et al., 2008). This pattern could suggest that, with a 9-dB separation in level between talkers, listeners could begin to more easily differentiate all three talkers as separate streams based on relative location along the level dimension; an example of a level “release” from IM. The apparent benefit of a 9-dB level separation was not due to reduced EM, however; a point supported by the glimpsing analysis described in the Appendix. This trend was not seen for the RWL condition, however, as target words were presented at all relative levels and only the cue word was guaranteed to be at the middle level.

Attentional Tuning to the Cue Level

In order to solve either of the random target speech identification (RSL and RWL) conditions, the listener must have recognized the target cue word ("Sue") at the beginning of the sentence and then followed that talker's voice throughout the remainder of the stimulus while ignoring the competing speech streams. The FL and RSL functions discussed above and shown in Figure 5 suggested that, in an attempt to solve the speech identification task, the listeners used the cue word on each trial to form an attentional filter centered at the location (in level) of the cue and reported the words on that trial that were consistent with that level. An additional analysis of the listener's responses relative to the cue level in the RWL condition suggested that the listeners did indeed employ such a strategy, despite level being uninformative for that condition. In Figure 6, the RWL condition was analyzed with each target word in the sentence being compared to the cue word level on a given trial. That is, the target words were sorted according to the difference in level from the cue word and proportion correct performance was computed for each level difference separately. Thus, the x-axis in Figure 6 is the difference in level between subsequent words in the target stream and the target cue word, computed over all RWL trials (a positive value indicates the target word was more intense than the cue word). The top panel plots functions obtained for the three relative level separations (i.e., ΔL's of 3, 6, and 9 dB), each presented in separate blocks of trials, with the heavy black line as a fit to all trials (the mean across functions). The results strongly suggested a tuned pattern of responses for relative level: listeners performed the best (53% correct) when the word was presented at, or very near, the same level as the initial cue word and performance declined (down to 34% and 26%, at + /-18 dB, respectively) as the level was changed from that of the cue. Post hoc paired-samples t-tests comparing the 0-dB level difference to the + /-18 dB difference endpoints confirmed that the decrease in performance was significant: −18 dB, t(10) = 9.26, p < 0.001, + 18 dB, t(10) = 5.63, p < 0.001.

Figure 6.

Group mean performance based on the difference in target word level from the current trial's cue level for the Random Word Level condition. The different functions indicate either the difference in level between the three talkers (top panel) which was fixed for a block of trials, or the cue level (bottom panel) on a given individual trial. The solid black line is the average across all trials, and the error bars represent the SEM. The bottom panel of Figure 6 sorts the same data set by the level of the cue (as “Loud," "Mid," or "Soft”) on each trial to show any effect that the salience of the cue may have had, something which was not clearly evident in Figure 2. With no level difference (the 0-dB points, the only value all three cue levels share), the effect of cue level was quite clear and ordered as would be predicted. For the Loud Cue function (solid symbols), performance simply improved as the target word approached the cue level, likely due to the increased intensity/salience of the target words; however, both the Mid and Soft Cue functions (shaded and open symbols, respectively) were non-monotonic. In both cases, the function does not simply increase with the intensity of the target word, another trend which does not follow an EM pattern (see the Appendix).

Masker Confusions

Analyzing “masker confusions” for matrix identification speech-on-speech tasks can also provide considerable insight into the factors underlying the incorrect responses that listeners make (e.g., Brungart, 2001; Brungart et al., 2001; Iyer et al., 2010; Kidd et al., 2005). High proportions of masker confusions (i.e., selecting words from masker sentences rather than words from the target sentence) is interpreted as evidence for IM, whereas selection of words not present in the stimulus (i.e., random choices) is consistent with performance limited by EM. This is because listeners tend to choose more masker words when the competing masker streams are both intelligible and similar (e.g., same gender talker) to the target, rather than making random word selections (Brungart, 2001). This can be due to weak segregation cues and/or because the masker will frequently intrude into a listener's focus of attention. Thus, masker confusions can provide insights into the types of cues that listeners use to segregate competing talkers and attend to a target under high IM conditions. If listeners rely on level cues, situations in which level differences among talkers are minimal should produce more masker confusions. Moreover, if listeners are attending to a range of levels surrounding the target, masker confusions should be more frequent for maskers that fall within that particular range. An analysis of masker confusions for the current study is presented in Figure 7 according to the relative level of the target cue for all three conditions: FL, RSL and RWL. To obtain the data plotted in Figure 7, incorrect responses were compared to the two masker sentences that were presented on each trial. Incorrect responses that corresponded to the louder (more intense) masker sentence (on a word-by-word basis) are plotted as the (blue) upward triangles, while responses that corresponded to the softer (less intense) masker are plotted as the (red) downward triangles (in this context "louder" and "softer" refer to the masker talkers relative to each other, not relative to the target word level; the 0-dB ΔL condition was not included). An incorrect response that was consistent with neither masker word was labeled as a random choice.

Figure 7.

Group mean masker confusion functions for each type of condition (different panels), with SEM error bars. The proportion of incorrect responses that were either consistent with the louder masker word (blue upward triangles), the softer masker word (red downward triangles), or a random word choice that was not presented (black squares), as a function of the relative level of the target cue word. When the relative cue level was positive for the FL condition, the listeners tended to have a higher proportion of confusions with the more intense masker. (For the largest relative cue level, 9 dB, performance was at or near perfect for all listeners, so there were insufficient incorrect responses for the masker confusion analysis.) Conversely, confusions with the less intense masker were made at negative target cue levels. (Random choice confusions were not substantially different from the horizontal chance line.) The functions for these two types of errors were uniformly rising/falling over the range of target word levels; thus, there was a clear level proximity effect between the masker confusions and the levels of the target cue words. A post hoc 2-by-6 repeated-measures ANOVA performed for only the louder- and softer-level masker confusion functions revealed that there was no main effect of Loud versus Soft confusions [F(1,10) = 0.08, p = 0.78] (this is not unexpected given the symmetry of the functions), but relative cue level was significant [F(5,50) = 4.16, p = 0.03], and there was a significant interaction [F(5,50) = 35.35, p < 0.001]. For the lowest relative level (-9 dB), the less intense masker was in fact selected significantly more often than the more intense masker [t(10) = -12.95, p < 0.001]. This finding may be considered as yet another indication of tuning to level because a listener was more likely to respond to a softer masker after a soft cue, rather than the (likely more salient) louder masker. A similar trend was seen for the RSL condition (middle panel), although the functions were somewhat flatter, possibly due to the listener needing to initially focus on a broader range of levels within a trial, rather than having a priori knowledge of target level before the block began. The post hoc 2-by-7 repeated-measures ANOVA performed for the RSL condition had no significant main effects, but a significant interaction was found [Loud versus Soft: F(1,10) = 0.64, p = 0.44, relative cue level: F(6,60) = 1.64, p = 0.15, interaction F(6,60) = 10.48, p < 0.001]. Similar to the FL results at a −9 dB relative level, the less intense masker of the RSL condition was selected significantly more often than the more intense masker [t(10) = -8.50, p < 0.001]. Finally, for the RWL condition (bottom panel), there was a pronounced bias toward the louder masker during loud cue trials, while at negative relative cue levels, masker confusions were about equal between the two maskers. For the RWL condition, both main effects and the interaction were significant [Loud versus Soft: F(1,10) = 34.00, p < 0.001, relative level: F(6,60) = 4.96, p < 0.001, interaction F(6,60) = 15.53, p < 0.001]. This trend may have been caused by a soft cue being less salient and the listener not always knowing which talker to focus on, and thus arbitrarily choosing a voice to follow. (Note that the masker confusions were sorted according to the level of the target cue only, since the levels of the words following the cue were uncertain.)

DISCUSSION

The present study manipulated the degree of a priori knowledge about the relative level of a target talker in a speech-on-speech masking task. The goal was to evaluate the benefit of certainty about target level over time and the effectiveness of level as a cue in overcoming IM, while controlling the influence of other source segregation cues which are known to affect listener performance in CPP situations. Overall speech identification performance was much better when the level of a talker's voice was held constant over the duration of a sentence compared to when the level changed unpredictably from word to word. This benefit of level certainty is clearly shown in Figure 8, where performance for the FL and RSL conditions is compared to performance for the RWL condition (derived from the psychometric functions from Figure 3 and 4). It is apparent from Figure 8 that increasing certainty about target level exerts a strong effect on listener performance with advantages of 0.5 in proportion correct at high relative levels. Also, it is of interest to note that group mean performance improvement was similar for FL and RSL conditions for relative levels of −3 dB and above, with only a slight FL benefit. At −6 and −9 dB, however, the benefit of the increased certainty associated with the FL condition was significantly greater than for the RSL condition. This implies that the ability to focus on the softer target varies substantially across subjects (as indicated by the error bars) but can be improved when the target level is stable throughout a block of trials.

Figure 8.

The improvement in proportion correct for the Fixed Level and Random Sentence Level conditions relative to the Random Word Level condition is plotted as a function of the relative level of the target speech. The values are the group mean differences in proportion correct and SEM error bars. Although performance was still above chance for the RWL condition, in general, (from Figures 3 and 4), it was substantially poorer than for either of the stable speech level conditions, suggesting that listeners benefitted significantly from a known and constant level even when the consistency was only within a trial (e.g., the similarity of results between FL and RSL conditions). This highlights the importance of speech streams conforming to expectations over time for target "stream maintenance" under competing speech (see Kidd et al., 2014, for analogous evidence of the benefit of conforming to a known syntax). For the stable sentence level conditions, tuning to level and some effect of level uncertainty also were apparent. The ability of (at least some) listeners to obtain an improvement in target speech identification for negative TMRs (cf., Brungart, 2001), suggests that a priori knowledge of level can foster attentional focus on a speech stream to reduce the IM that is present, a possible level cueing effect similar to the spatial cueing reported by Kidd et al. (2005). This is consistent with previous research (e.g., Brungart, 2001; Brungart et al., 2001; Dirks & Bower, 1969; MacPherson & Akeroyd, 2014) that has suggested that, under high-IM situations, having a level difference between multiple talkers can improve a listener's ability to segregate and attend to a target, even if that level difference creates greater EM of the target. In addition, masker confusions also suggested tuning to level as more incorrect responses were consistent with the masker talker that was closer in level to the target cue, not simply which talker was louder and likely more salient. Varying the level of two symmetrically placed maskers around a center target level also produced some evidence for tuning and a level release from masking, but only at the widest level range (9 dB separation between talkers) and only for the conditions with sentence-level certainty about target level (Figure 5). This evidence nonetheless was consistent with the view that listeners may concentrate processing resources at a point along the level continuum, resulting in better performance at the point of focus. This was apparent for both FL and RSL conditions where there was a high degree of target level certainty within trials, but was not apparent for the RWL condition where there was little certainty about target word levels following the cue. The finding that there is some minimum separation in level needed to yield an enhancement in identification of a stream of target speech flanked by competing speech streams is reminiscent of past findings for streams of speech that are separated in spatial location rather than in level (e.g., Marrone et al., 2008; Srinivasan et al., 2016). Much like the benefit of binaural cues that occur from certain separations in spatial location of speech streams, the overall level differences in the present experiment were sufficient to reduce the IM caused by the masker speech and enhance the perceptual segregation of sources, allowing better focus of the listener's attention on the target source. This effect may also be related to a reduction of the multimasker penalty (Iyer et al., 2010), as the soft masker may no longer be fully extracted from the speech mixture, or be more easily ignored, with the largest separation of levels, thus allowing performance to improve to that which might occur with only the loud masker present. In addition, when the data from the RWL condition were sorted by level relative to the target cue (based on the premise that the listener would focus attention on the level of the cue word and maintain it at that point for the remainder of the sentence), a tuned response was found (Figure 6). This too was taken as evidence for attentional tuning along the level dimension and suggested that the listener focused on the cue word's level, enhancing identification performance for subsequent words near to the cue level, while effectively attenuating off-axis target words. Thus, a tuned response was found only when performance was calculated with respect to a marker in level (the cue) to which they were obliged to attend in order to solve the task. Given how generally predictable and relatively constant the overall level of speech tends to be in normal conversation (Byrne et al., 1994), it may be very difficult for listeners to suppress level as a contributing factor in the focus of attention, even when it is unreliable. It should be noted that, although word by word level variation is an inherent component of prosody in natural speech, the RWL condition violates our expectations about normal prosodic relationships. Even in the FL and RSL conditions, the artificially constructed sentences, with equal overall levels for each word, produces unnatural prosody. Co-articulated sentences, however, have been shown to only produce a small improvement in speech recognition, compared to concatenated sentences (Jett et al., 2021). The present pattern of results, which is consistent with attention-based tuning in level, fits within a substantial body of work in the auditory domain revealing similar effects for a variety of stimuli and tasks. Among the first and most influential of these studies was Greenberg and Larkin (1968), who manipulated attention along the frequency dimension by varying the probability of occurrence of "probe" tones masked by noise. In the probe-signal experiment, detectability was higher at the most likely frequency of a pure-tone signal, which presumably was where the listener chose to focus attention, and decreased above and below that frequency for less-likely signals resulting in a bandpass filter-like response. Other studies using adaptations of the probe-signal method also have revealed selective responses for a variety of stimuli and tasks including amplitude modulation detection (Wright & Dai, 1998), detection of a tone of uncertain duration (Wright & Dai, 1994), spatial separation of sources for frequency sweep discrimination (Arbogast & Kidd, 2000), among others. In the present context, tuning to level, or “attentional bands” in the amplitude domain, have been previously studied psychophysically (Luce & Green, 1978) as well as physiologically, with evidence of neurons tuned to level in the marmoset primary auditory cortex (Watkins & Barbour, 2011). The implementation of a remote testing procedure for the present study, with different hardware used by each listener, should not have affected the results. If unintended level compression did occur, it would have likely reduced the magnitude of the effects reported, not affected overall trends across conditions. The 30-dB range presented during the calibration procedure, and certainly the smaller 18-dB range of the experimental stimuli itself, should have been reasonably produced by personal computer/headphone setups. There were also no individual listener results that would lead one to believe that accurate (relative) levels were not being presented.

Summary

Varying the degree of certainty about the target level had a significant effect on speech identification performance. This effect was confined to negative relative target levels for sentence-level variation, but was present (and much larger) at both positive and negative relative levels for word-to-word level variation within sentences. Higher certainty about target sentence level across trials enhanced the ability of listeners to segregate the target stream when it was at negative relative levels, compared to trial-by-trial variation in target sentence level (although both were superior to word-by-word level variation). This finding indicates that focusing attention on the lower-level speech source in a mixture requires a high degree of a priori knowledge and, based on the large inter-subject differences observed, varies significantly across listeners. Evidence for a tuned response in level, and a subsequent level release from IM with a sufficient separation in level, was found in multiple analyses and depended crucially on both a priori knowledge provided to the listener and the context in which variability occurred (e.g., across sentences or words, or relative to the level of the cue). This evidence comprised both differences across conditions in speech identification performance, masker confusions, and a pattern of results similar to a tuning curve for the within-sentence variability condition where uncertainty was the greatest. In conclusion, the predictability of speech, whether that be syntactic structure, logical semantic meaning, or the spectral and spatial properties of the speech sounds themselves, assists with our ability to understand one talker among many. The intensity of a person's voice, and consistency of that level, is important in more ways than simply overcoming the EM of a noisy environment. The relative talker level, and spacing between talker levels, must also be considered.

40 in total

1. Effects of target-masker contextual similarity on the multimasker penalty in a three-talker diotic listening task.

Authors: Nandini Iyer; Douglas S Brungart; Brian D Simpson
Journal: J Acoust Soc Am Date: 2010-11 Impact factor: 1.840

2. Use of a glimpsing model to understand the performance of listeners with and without hearing loss in spatialized speech mixtures.

Authors: Virginia Best; Christine R Mason; Jayaganesh Swaminathan; Elin Roverud; Gerald Kidd
Journal: J Acoust Soc Am Date: 2017-01 Impact factor: 1.840

3. The role of syntax in maintaining the integrity of streams of speech.

Authors: Gerald Kidd; Christine R Mason; Virginia Best
Journal: J Acoust Soc Am Date: 2014-02 Impact factor: 1.840

4. Musicians at the Cocktail Party: Neural Substrates of Musical Training During Selective Listening in Multispeaker Situations.

Authors: Sebastian Puschmann; Sylvain Baillet; Robert J Zatorre
Journal: Cereb Cortex Date: 2019-07-22 Impact factor: 5.357