Le Wang1, Virginia Best2, Barbara G Shinn-Cunningham3. 1. Department of Biomedical Engineering, Boston University, Boston, MA, USA. 2. Department of Speech, Language and Hearing Sciences, Boston University, Boston, MA, USA. 3. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
Abstract
A study was conducted to examine the benefits afforded by a signal-processing strategy that imposes the binaural cues present in a natural signal, calculated locally in time and frequency, on the output of a beamforming microphone array. Such a strategy has the potential to combine the signal-to-noise ratio advantage of beamforming with the perceptual benefit of spatialization to enhance performance in multitalker mixtures. Participants with normal hearing and with hearing loss were tested on both speech localization and speech-on-speech masking tasks. Performance for the spatialized beamformer was compared with that for three other conditions: a reference condition with no processing, a beamformer with no spatialization, and a hybrid beamformer that operates only in the high frequencies to preserve natural binaural cues in the low frequencies. Beamforming with full-bandwidth spatialization supported speech localization and produced better speech reception thresholds than the other conditions.
A study was conducted to examine the benefits afforded by a signal-processing strategy that imposes the binaural cues present in a natural signal, calculated locally in time and frequency, on the output of a beamforming microphone array. Such a strategy has the potential to combine the signal-to-noise ratio advantage of beamforming with the perceptual benefit of spatialization to enhance performance in multitalker mixtures. Participants with normal hearing and with hearing loss were tested on both speech localization and speech-on-speech masking tasks. Performance for the spatialized beamformer was compared with that for three other conditions: a reference condition with no processing, a beamformer with no spatialization, and a hybrid beamformer that operates only in the high frequencies to preserve natural binaural cues in the low frequencies. Beamforming with full-bandwidth spatialization supported speech localization and produced better speech reception thresholds than the other conditions.
Entities:
Keywords:
binaural hearing; cocktail party; hearing aids; hearing loss
In many realistic communication settings, a fundamental task of the listener is
to perceptually segregate the various sources of sound, select one source
upon which to focus attention, and then receive and process the information
coming from the chosen source. The ability of listeners to succeed in this
task varies widely and depends not only on the properties of the acoustic
environment and types of competing sound sources that are present but also
on a range of factors specific to individual listeners. For the task of
understanding speech in the presence of competing talkers, some factors that
have been shown to adversely influence performance include advanced age and
hearing loss (e.g., Gallun, Diedesch, Kampel, & Jakien, 2013; Glyde, Cameron, Dillon,
Hickson, & Seeto, 2013; Marrone, Mason, & Kidd, 2008)
although even young normal-hearing (NH) listeners may vary widely in their
abilities (e.g., Kidd
et al., 2016; Ruggles & Shinn-Cunningham,
2011).Many front-end signal-processing approaches have been proposed to improve
speech recognition in noisy situations. The most successful of these
approaches use directionality to improve the signal-to-noise ratio (SNR) and
are a common feature of commercial assistive listening devices such as
hearing aids (Launer,
Zakis, & Moore, 2016) and cochlear implants (Loizou, 2006).
Directional systems make use of multiple microphones and beamforming to
emphasize sound sources from one direction and attenuate sound sources from
other directions (see reviews in Doclo, Gannot, Moonen, & Spriet,
2010; Greenberg & Zurek, 2001). Such systems can make use of the
microphones available in current hearing aids (i.e., two in a single hearing
aid or four across a pair of hearing aids) or an array of microphones that
may be mounted on the head or on eyeglasses (e.g., Anderson et al., 2018; Greenberg, Desloge, &
Zurek, 2003; Kidd, 2017). The spatial tuning
can be extremely narrow in systems based on a large number of microphones,
which can dramatically improve the SNR for a sound located in the focus of
the beamformer. Under relatively simple conditions with a frontal speech
source and one or more spatially separated noise sources, reported
improvements in speech reception thresholds (SRTs) for beamformers relative
to omnidirectional microphones range from around 5 to 12 dB (e.g., Luts, Maj, Soede, &
Wouters, 2004; Saunders & Kates, 1997; Soede, Bilsen, &
Berkhout, 1993).A drawback of many beamforming strategies is that they combine the microphone
signals to produce a single-channel output that conveys no binaural
information. This obviously compromises the ability to localize sounds and
may impede the segregation of competing sounds based on differences in
spatial position as well as the ability to selectively attend to (or
suppress) different sounds. To mitigate this problem, a variety of
strategies have been proposed to preserve or restore spatial cues in
beamformer systems (see reviews in Doclo et al., 2010; Kollmeier &
Kiessling, 2016). Most of these strategies involve the
combination of processed with unprocessed signals or the selective
application of beamforming to some parts of the signal. Generally,
beamformer systems designed for hearing-aid applications try to reach a
balance between SNR improvement and spatial-cue preservation (Van den Bogaert, Doclo,
Wouters, & Moonen, 2008, 2009) and thus may not provide
speech-in-noise benefits as large as those that are theoretically possible.
Indeed, recent studies that evaluated beamforming hearing aids under
relatively complex listening situations found rather modest improvements in
speech intelligibility relative to standard directional microphones (e.g.,
Picou, Aspell,
& Ricketts, 2014; Best, Mejia, Freeston, van Hoesel, &
Dillon, 2015; Picou & Ricketts, 2019; Völker, Warzybok, &
Ernst, 2015; Wu et al., 2019).By using experimental systems, it is possible to explore the trade-off between
SNR improvement and spatial-cue preservation systematically while bypassing
some of the constraints associated with hearing-aid applications. We have
previously used such a system to evaluate a spatial-cue preservation
strategy in which the beamforming is restricted to frequencies above a
specific cutoff, and natural binaural signals are allowed to pass through
below that cutoff (Desloge, Rabinowitz, & Zurek, 1997). This hybrid approach
improves the SNR while maintaining the perceived spatial separation of and
the ability to localize sources in the scene. A drawback of this approach,
in addition to the loss of any SNR advantage in the low frequencies, is that
potentially useful spatial information in the higher frequencies is
discarded. Moreover, in this scheme, spatial cues in the high and low
frequencies are inconsistent (with natural location information at low
frequencies, but spatial information consistent with a source in the median
plane for high frequencies). Despite these issues, our experimental results
show that both NH and hearing-impaired (HI) listeners benefit from the
preservation of the low-frequency spatial cues in situations involving
speech-on-speech masking, where differences in the perceived locations of
the competing sounds are thought to be critical for their segregation (Best, Roverud, Mason,
& Kidd, 2017). Notably, under these conditions, if no
binaural cues are preserved, some listeners perform more poorly with
beamforming than without, indicating that the improvement in SNR is more
than counteracted by the loss of spatial information (Best et al., 2017; Kidd, Mason, Best, &
Swaminathan, 2015; see also Neher, Wagener, & Latzel,
2017).Here, we describe an approach that extracts full-bandwidth spatial information
in an acoustic mixture prior to beamformer processing and then reapplies
this spatial information to the output of the beamformer. In contrast to the
hybrid approach, the final stimulus includes SNR improvements provided by
beamforming as well as full-bandwidth spatial information gleaned from the
original sound mixture. Like the hybrid approach (but unlike other
spatial-cue preservation approaches), the scheme presented here has the very
useful characteristic that the spatial cues are applied to all sources with
no need for an estimate of what is the target and what is the noise. The
approach is based on the observation that in a mixture of spectrotemporally
sparse, competing sounds (such as speech), different sources tend to
dominate at any given time and frequency; as a result, spatial information
in the original mixture at each time and frequency largely reflects the
spatial attributes of the sound source dominating that time–frequency
“tile.” Indeed, it has been shown that the spatial cues of the nondominant
source in a tile contribute very little to the perception and
intelligibility of the mixture (Schoenmaker, Brand, & van de Par,
2016). This observation suggests the possibility of
resynthesizing a binaural sound from the single-channel beamformer output
signal by (re)imposing the binaural differences present in the original
sound mixture on the appropriate points in time and frequency of the
resynthesized sound. In those time–frequency tiles dominated by the target
sound, the spatial cues will be consistent with the target’s location; in
the tiles dominated by a distracting sound, the spatial cues will be
consistent with the spatial location of that sound. As a result, distractor
energy that is not perfectly suppressed by the beamformer should still be
perceptually separable from target energy, which may improve the efficacy of
neural suppression of the distractor through spatial selective
attention.The following study was designed to compare this full-bandwidth spatialization
approach with the hybrid approach and to a full-bandwidth beamformer with no
spatialization. These three strategies were compared with a reference
condition in which the listener received natural, unprocessed binaural
signals. The main task of interest was a speech-on-speech masking task like
that we have used previously. A localization experiment was included to
confirm that natural binaural cues were conveyed by the full-bandwidth
spatialization approach. For each task, both broadband and high-pass speech
conditions were included to confirm that useful spatial cues were provided
at high frequencies as well as at low frequencies. The hypothesis was that
full-bandwidth spatialization would support accurate horizontal
localization, which would in turn support the ability of listeners to focus
spatial attention effectively on the target, and thus maximize the advantage
of beamforming under speech-on-speech masking conditions.
Methods
Participants
Fourteen adults participated in the study, seven with normal hearing
(aged 18–40 years, mean age 23 years) and seven with bilateral
sensorineural hearing impairment (aged 20–56 years, mean age
36 years). There was no significant age difference between the NH and
HI groups, t(12) = 2.02, p = .07.
The NH participants had pure-tone averages (PTAs; mean threshold
across both ears at 0.5, 1 and 2 kHz) that ranged from 0 to 6.7 dB
hearing level (HL; mean 3.8 dB HL). The HIparticipants had a range of
losses with PTAs from 2.5 to 73.3 dB HL (mean 35.8 dB HL). The losses
were relatively symmetric, with a PTA difference between the ears of
no more than 10 dB. Participants were paid for their participation,
gave informed consent, and all procedures were approved by the Boston
University Institutional Review Board. Total testing time for the
localization and speech intelligibility experiments was approximately
2.5 hr. We note that one NH participant was unable to complete the
high-pass condition for both localization and speech intelligibility
experiments, and another was unable to complete the broadband speech
intelligibility experiment. Results are based on only six NH
participants in these cases.
Beamforming and Spatialization
The different listening conditions were tested using a headphone
simulation. Impulse responses were measured on an acoustic manikin
(KEMAR) seated in a large sound-treated booth (IAC Acoustics). The
inner dimensions of the booth were approximately 3.75 m × 4 m × 2.25
(Length × Width × Height). The manikin was seated halfway along one of
the walls with a distance of about 0.6 m between its back and the
wall. An array of loudspeakers (Acoustic Research 215PS) was arranged
in front of the manikin at ear-height, at a distance of about 1.5 m.
Loudspeakers were positioned between −90° and +90° azimuth at 7.5°
intervals for the impulse response recordings, although only a subset
of five positions was used for this study (see later). The manikin was
fitted with a flexible headband that ran from ear-to-ear across the
top of the head. The headband housed a microphone array (Sensimetrics
Corporation), which consisted of 16 omnidirectional microphones
arranged in four front-back-oriented rows. The rows were evenly spaced
with a separation of 66.67 mm, for a total array length of 200 mm.
More details about the array, including images of the microphone
layout, can be found elsewhere (Kidd, 2017, Figure 6; Roverud, Best,
Mason, Streeter, & Kidd, 2018, Figure 2).
Figure 2.
(a) Proportion of time–frequency tiles that were masker-dominated
in the natural stimulus and flipped to being target-dominated
after beamforming. Flip rates are shown for a wide range of
nominal TMRs and for two criteria for defining flipped tiles:
0 dB (any tile whose local SNR changed from <0 dB to
>0 dB) and 10 dB (any tile whose local SNR changed from
<−10 dB to >+10 dB). (b) ITD histograms for
target-dominated tiles before and after beamforming. (c) ILD
histograms for target-dominated tiles before and after
beamforming. ILD = interaural level difference; ITD = interaural
time difference; TMR = target-to-masker ratio.
Figure 1 provides
an overview of the processing steps used to create stimuli for each
condition. Two sets of impulse responses were recorded. One set of
impulse responses, which captured the signals received by the
manikin’s in-ear microphones, were used to simulate a natural binaural
listening situation (“KEMAR” condition). The other set of impulse
responses captured the 16-channel output of the microphone array for
each source location. These outputs were weighted and combined
according to the optimal-directivity algorithm of Stadler and
Rabinowitz (1993) for a look direction of 0° azimuth. The
single-channel output for this condition (“BEAM”) was presented
diotically. The hybrid condition described earlier was simulated by
combining low-pass-filtered binaural KEMAR impulse responses with
high-pass-filtered BEAM impulse responses and is referred to as
“BEAMAR.” The crossover frequency was chosen to be 800 Hz, which was
shown previously to be optimal (Best et al., 2017; Desloge et al.,
1997).
Figure 1.
Overview of the steps used to create stimuli for each
condition. The speech mixture was filtered with IRs
measured in KEMAR’s ear canals create the KEMAR stimuli
and with IRs obtained from the microphone array to create
the BEAM stimuli. IRs for the BEAMAR condition were
created by combining LP-filtered KEMAR IRs and HP-filtered
microphone array IRs. For spatial-BEAM stimuli, IPDs and
ILDs were extracted from the KEMAR signals and applied to
the BEAM signal. LP = low-pass; HP = high-pass;
IPD = interaural phase difference; ILD = interaural level
difference; IR = impulse response.
Overview of the steps used to create stimuli for each
condition. The speech mixture was filtered with IRs
measured in KEMAR’s ear canals create the KEMAR stimuli
and with IRs obtained from the microphone array to create
the BEAM stimuli. IRs for the BEAMAR condition were
created by combining LP-filtered KEMAR IRs and HP-filtered
microphone array IRs. For spatial-BEAM stimuli, IPDs and
ILDs were extracted from the KEMAR signals and applied to
the BEAM signal. LP = low-pass; HP = high-pass;
IPD = interaural phase difference; ILD = interaural level
difference; IR = impulse response.The full-bandwidth spatialization strategy, referred to as “spatial
BEAM,” combined the spatial information from the KEMAR condition and
the noise suppression from the BEAM condition across all frequencies.
Specifically, for each spatial configuration, two spatial cues
(interaural phase difference [IPD] and interaural level difference
[ILD]) in each time–frequency tile were extracted from the binaural
KEMAR signals. To do this, the KEMAR signal was broken down into time
frames using a 92.9-ms hamming window (4,096 samples) that shifted by
23.2 ms (1,024 samples) for each frame. Within each frame, the
spectrum of the left signal and the right signal was computed
(frequency resolution = 10.8 Hz). The IPD and ILD were then calculated
as the phase difference and the magnitude difference between the left
and right signal for each frequency bin in the spectrum. These
frequency-dependent IPDs and ILDs were then imposed on the
corresponding time slice in the BEAM signal, creating a left and right
signal per time frame. To do this, half of the IPD and ILD values were
applied to the BEAM signal to create a left-ear signal; half of the
IPD and ILD values inverted in polarity were applied to another copy
of the BEAM signal to create a right-ear signal. The resynthesized
binaural signals in each time frame were then summed to create a
continuous output without additional temporal smoothing.
Stimuli
Target stimuli were taken from a 40-word corpus containing eight
monosyllabic words in each of five distinct word categories (Kidd, Best, &
Mason, 2008). Eight female voices were used in this
study. Two speech bandwidth conditions were tested: broadband speech
and high-pass speech. The broadband speech condition used the full
spectrum speech signals in each processing condition, while the
high-pass speech condition removed the frequency content below 800 Hz.
The high-pass speech condition served as a control condition to test
whether the high-frequency spatial information
preserved in the spatial-BEAM condition offers useful information for
localization and speech understanding, or whether benefits can only be
obtained from the salient low-frequency spatial
information. In the high-pass speech condition, the BEAMAR condition
became identical to the BEAM condition because the low-frequency KEMAR
portion of the BEAMAR signal was filtered out.Stimuli were generated using MATLAB software (MathWorks Inc.) and
presented via a 24-bit soundcard (RME HDSP 9632) through a pair of
circumaural headphones (Sennheiser HD280 Pro). The participant was
seated in a small sound-treated booth fitted with a computer monitor
and mouse. For HIparticipants, individualized linear amplification
according to the National Acoustic Laboratories' Revised, Profound
(NAL-RP) prescription (Dillon, 2012) was applied to
each stimulus just prior to presentation. A linear prescription was
chosen to avoid potentially complicating interactions between
nonlinear amplification and the different processing strategies.
Localization Experiment
A localization experiment was conducted to confirm that the spatial BEAM
provided appropriate spatial information and produced lateral percepts
in line with those produced by natural binaural stimuli (KEMAR). The
stimuli were single words drawn at random from the speech corpus,
presented at random from one of the five locations: −60°, −30°, 0°,
+30°, and +60° azimuth. The nominal level of each word was 55 dB sound
pressure level. Each word was processed according to one of the four
conditions described earlier (KEMAR, BEAM, BEAMAR, and spatial BEAM),
with the look direction of the beamformer always fixed at 0°. Trials
were organized into blocks of 100 (five repetitions for each
combination of processing condition and location), presented in a
different random order for each participant. One block was completed
for each of the two speech bandwidth conditions. Participants reported
the perceived location of each stimulus by clicking on a graphical
user interface showing a continuous arc representing the azimuthal
plane from −90° to +90°. Before the experiment, each participant was
given a training demo containing example trials from the KEMAR
condition until they were familiar with the procedure.
Speech Intelligibility Experiment
The second experiment tested speech intelligibility in the presence of
spatially separated competing talkers. Target stimuli were five-word
sentences created by concatenating words from the speech corpus (with
no added gaps between words). Each sentence had the form
name-verb-number-adjective-noun (e.g., “Sue bought two red shoes”).
The target sentence was spoken by one voice (chosen randomly on each
trial from the set of eight) and was identified on the basis of the
first word (which was always “Sue”). The target was presented
simultaneously with four speech maskers. The speech maskers were also
five-word sentences, assembled in the same manner as the target
sentence. The five presented sentences were spoken by different
talkers and had no words in common. The target was located at 0°
azimuth, and the four maskers were located at −60°, −30°, +30°, and
+60° azimuth. Each masker was presented at 55 dB sound pressure level
and the level of the target was varied to set the target-to-masker
ratio (TMR; note that this is relative to each masker
not the sum) to −20, −15, −10, −5, or 0 dB.Participants completed three blocks of trials in each of the eight
conditions (two bandwidths × four processing conditions) for a total
of 24 blocks. Within each block, each condition was tested 5 times at
each of the five TMRs (25 trials total). Responses were given by
clicking on a graphical user interface containing a grid of the 40
possible key words. Psychometric functions were generated for each
participant in each condition by plotting percentage correct
(calculated across all four key words in all trials) as a function of
TMR and fitting a logistic function. SRTs corresponding to the TMR at
50% correct were extracted from each function. Participants were given
a training demo of the speech intelligibility experiment, which
contained example trials from the KEMAR condition. The TMR in the
example trials was set at 0 dB to make sure all participants were able
to perform the task and follow the instructions before they moved on
to the main experiment.
Acoustic Analyses
Before examining the results of the behavioral experiments, acoustic analyses
are presented to describe the performance of the spatial-BEAM algorithm for
the speech materials used in this study.One concern with the approach is that there may be a number of time–frequency
tiles that are dominated by an interferer before beamforming, but that flip
to being dominated by the target after beamforming. In these cases, binaural
cues extracted from the original mixture may be inappropriately assigned to
the target and lead to a distorted spatial representation. To assess how
often this would have occurred for the stimuli in our speech intelligibility
experiment, a simulation was conducted using 100 random stimuli configured
in the same way as the experiment with a target at 0° and four maskers at
±60° and ±30°. The analysis was conducted for a range of TMRs from −20 to
+20 dB. For each time–frequency tile, the SNR was compared before and after
beamforming, and the number of tiles in which the SNR “flipped” from below
0 dB (masker-dominated) in the original signal to above 0 dB
(target-dominated) in the BEAM signal was counted. Figure 2(a) shows this flip rate as
a percentage of the total number of tiles. The flip rate ranged from around
4% at the lowest TMR to around 22% at a TMR of +10 dB (solid line) and
covered the range 4% to 19% over the TMRs tested in the experiment. However,
we speculated that many of these SNR sign flips did not lead to drastic
changes in the value of the SNR but rather represented small absolute
changes in SNR from slightly negative to slightly positive. Using a more
conservative definition of what constitutes a flipped tile, by counting only
those in which the SNR was <10 dB before and >10 dB after beamforming,
the flip rate never exceeded 4% (dashed line). We can conclude from this
analysis that the number of tiles that flip from being masker-dominated to
target-dominated was relatively low but not negligible. Thus, we considered
it important to gain some intuition about the effect of these flipped tiles
on the spatial representation of the target.(a) Proportion of time–frequency tiles that were masker-dominated
in the natural stimulus and flipped to being target-dominated
after beamforming. Flip rates are shown for a wide range of
nominal TMRs and for two criteria for defining flipped tiles:
0 dB (any tile whose local SNR changed from <0 dB to
>0 dB) and 10 dB (any tile whose local SNR changed from
<−10 dB to >+10 dB). (b) ITD histograms for
target-dominated tiles before and after beamforming. (c) ILD
histograms for target-dominated tiles before and after
beamforming. ILD = interaural level difference; ITD = interaural
time difference; TMR = target-to-masker ratio.To this end, another analysis was conducted to estimate the binaural cues
associated with the target signal before and after the spatial-BEAM
processing. First, time–frequency tiles were identified that were dominated
by the target in the original mixture. For these tiles, IPDs and ILDs were
calculated from the KEMAR stimuli, and IPDs were transformed to interaural
time differences (ITDs). Histograms of ITDs and ILDs are plotted in Figure 2(b) and (c)
(black lines) for a mixture with a TMR of 0 dB. As expected for a centrally
located source, the histograms are centered on 0-µs ITD and 0-dB ILD (any
asymmetries are due to asymmetries in the impulse responses, including minor
effects of the room and the alignment of KEMAR within it, etc.). Second,
time–frequency tiles were identified that were dominated by the target after
beamforming. For these tiles, ITD and ILD distributions were again
calculated from the KEMAR stimuli (gray lines in Figure 2(b) and (c)). These
distributions capture the binaural cues that would be applied to the target
in the spatial-BEAM condition. The first thing to notice is that there are
more target-dominated tiles after beamforming. Moreover, these tiles
continue to be centered on 0-µs ITD and 0-dB ILD, although some spread in
the histograms can be observed. In general, we can conclude that the
binaural cues associated with the target talker are not drastically
distorted by the tiles that were previously masker-dominated. A similar
pattern was found when the same analysis was applied to each of the four
masker talkers, although in those cases fewer tiles (and not more) were
available after beamforming.
Results
Figure 3 shows
the group average localization responses as a function of the true
location in the broadband speech condition (a and c) and the high-pass
speech condition (b and d) for NH participants (a and b) and HIparticipants (c and d). The gray line indicates a slope of one, or
perfect performance, where the perceived location is equal to the true
location. Shallower slopes indicate a weaker spatial percept with a
slope of zero indicating that the responses were not at all related to
the true location.
Figure 3.
Mean perceived location plotted against true location for the
broadband speech stimuli (a and c) and high-pass speech
stimuli (b and d) for the NH group (a and b) and the HI
group (c and d). Different symbols and line styles
represent the different signal-processing schemes. The
gray line represents the case when the perceived location
is equal to the true location. Error bars show
across-subject standard deviations. NH = normal-hearing;
HI = hearing-impaired.
Mean perceived location plotted against true location for the
broadband speech stimuli (a and c) and high-pass speech
stimuli (b and d) for the NH group (a and b) and the HI
group (c and d). Different symbols and line styles
represent the different signal-processing schemes. The
gray line represents the case when the perceived location
is equal to the true location. Error bars show
across-subject standard deviations. NH = normal-hearing;
HI = hearing-impaired.When tested with broadband speech, both NH and HI groups were able to
localize the stimulus with reasonable accuracy in all conditions
except the BEAM condition. A mixed analysis of variance (ANOVA) model
using true location and processing condition as within-subjects
factors and group as a between-subjects factor found a significant
main effect of true location, F(4, 48) = 287.61,
p < .001, and a significant interaction
between true location and processing condition, F(12,
144) = 84.67, p < .001. No main effect of group
was identified, F(1, 12) = 0.53,
p = .48, and none of the interactions involving group
were significant. Follow-up ANOVAs comparing individual processing
conditions indicated that the effect of true location in the
spatial-BEAM condition was not significantly different from that in
the KEMAR condition, F(4, 52) = 0.68,
p = .61, while the effect of true location in
the BEAMAR condition was significantly different than in the KEMAR
condition, F(4, 52) = 6.81,
p < .001. In terms of absolute localization error,
when pooled across all trials for all locations for all participants,
the mean values were 10° (KEMAR), 12° (spatial BEAM), 13° (BEAMAR),
and 40° (BEAM).With high-pass speech, because the spatial information in the BEAMAR
condition was absent, participants were able to perceive lateral
sounds only in the KEMAR and spatial-BEAM conditions. Similar to the
broadband speech condition, there was a significant main effect of
true location, F(4, 44) = 173.08,
p < .001, and a significant interaction between
true location and processing condition, F(12,
132) = 74.67, p < .001. No group effect was found,
F(1, 11) = 0.06, p = .81, and
no interactions involving group were significant. An ANOVA comparing
the spatial-BEAM and KEMAR conditions found no significant difference
in the effect of true location, F(4, 48) = 0.07,
p = .99. Mean absolute localization errors in
the high-pass condition were 13° (KEMAR), 12° (spatial BEAM), 35°
(BEAMAR), and 37° (BEAM).Figure 4 (a) and
(b) shows group average SRTs for the different listening
conditions, and Figure 4 (c) and (d) shows the same data but after each
participant’s SRT was normalized by their KEMAR SRT so that the
ordinate represents the benefit provided by each
beamforming condition relative to a natural binaural listening
condition. As expected, the SRTs were generally higher in the HI group
than in the NH group, for both the broadband speech (mean SRT −6 dB
vs. −12 dB) and high-pass speech (mean SRT −4 dB vs. −9 dB).
Figure 4.
Mean SRTs (a and b) and benefits relative to KEMAR (c and d)
for the broadband speech stimuli (a and c) and high-pass
speech stimuli (b and d). The different processing schemes
are shown along the abscissa, and NH and HI groups are
shown in black and gray, respectively. Error bars show
across-subject standard deviations. NH = normal-hearing;
HI = hearing-impaired; SRT = speech reception
threshold.
Mean SRTs (a and b) and benefits relative to KEMAR (c and d)
for the broadband speech stimuli (a and c) and high-pass
speech stimuli (b and d). The different processing schemes
are shown along the abscissa, and NH and HI groups are
shown in black and gray, respectively. Error bars show
across-subject standard deviations. NH = normal-hearing;
HI = hearing-impaired; SRT = speech reception
threshold.A mixed ANOVA on the SRTs in the broadband speech condition identified a
significant main effect of processing condition,
F(3,33) = 38.5, p < .001, and a
significant main effect of group, F(1, 11) = 10.44,
p = .008, but no significant interaction,
F(3, 33) = 1.14, p = .35. Post
hoc comparisons (paired t tests with Bonferroni
correction) confirmed that SRTs in the spatial-BEAM condition were
significantly lower than in all other conditions at −12 dB on average,
while the SRTs in the KEMAR condition were the highest at −5 dB. In
the high-pass speech condition, the effect of processing condition was
again significant, F(3, 33) = 55.75,
p < .001, as was the effect of group,
F(1, 11) = 11.00, p = .007,
and there was no significant interaction, F(3,
33) = 1.21, p = .32. Post hoc comparisons confirmed
that SRTs in the spatial-BEAM condition were significantly lower than
in all other conditions at −11 dB on average, while the SRTs in the
KEMAR condition were the highest at −2 dB.Consistent with these findings, the spatial BEAM provided larger benefits
for speech understanding (8 dB on average) than both BEAM and BEAMAR
(4 dB on average; Figure 4 (c) and (d)). When pooled across NH and HIparticipants, the SRT benefit in the spatial-BEAM condition was not
significantly correlated with age or with PTA for either speech
condition (p > .05). However, the spatial-BEAM
benefit was positively correlated with the BEAM benefit (broadband:
r = .62; high-pass: r = .71)
and with the BEAMAR benefit (broadband: r = .84;
high-pass: r = .62).
Overall Performance
Figure 5
illustrates the overall performance of individual participants,
combining the results from both the localization and speech
intelligibility experiments. The four panels show data for the
broadband speech condition (a and c) and the high-pass speech
condition (b and d) for NH participants (a and b) and HIparticipants
(c and d). Within each panel, individuals with good performance in
both experiments are represented as points in the lower left corner,
corresponding to low localization errors and low SRTs. For broadband
speech, all processing schemes except the BEAM show a small average
localization error, and the spatial BEAM shows generally lower SRTs
than KEMAR and BEAMAR. Thus, the spatial BEAM provided the most
benefit in speech intelligibility among all conditions without
degrading localization accuracy. For high-pass speech, results for the
BEAMAR condition look essentially the same as the BEAM condition,
confirming its limitation in situations when only high-frequency
information is available. On the other hand, the spatial-BEAM
condition preserves reliable spatial information at high frequencies
and thus leads to low localization errors. Because the spatial BEAM
also inherits the improved SNR from beamforming, overall performance
was better than in the KEMAR condition.
Figure 5.
Individual localization errors plotted against individual
SRTs for the broadband speech stimuli (a and c) and
high-pass speech stimuli (b and d) for NH participants (a
and b) and HI participants (c and d). Different symbols
represent the different signal-processing schemes.
NH = normal-hearing; HI = hearing-impaired; SRT = speech
reception threshold.
Individual localization errors plotted against individual
SRTs for the broadband speech stimuli (a and c) and
high-pass speech stimuli (b and d) for NH participants (a
and b) and HIparticipants (c and d). Different symbols
represent the different signal-processing schemes.
NH = normal-hearing; HI = hearing-impaired; SRT = speech
reception threshold.
Discussion
This study provided behavioral data to assess the benefits of a
signal-processing strategy that reimposes natural, full-bandwidth, binaural
information on the output of a highly directional beamformer. This
spatial-BEAM strategy may be a promising option for assistive hearing
devices because it has the potential to combine the SNR advantage of
beamforming with the perceptual benefit of spatialization. Groups of
participants with and without hearing loss were tested using this approach
on both sound localization and speech intelligibility tasks. As anticipated,
the spatial-BEAM strategy supported horizontal localization performance (as
measured with single speech sources) that was equivalent to that observed in
the natural binaural condition. Moreover, the spatial-BEAM strategy
significantly improved speech understanding in the presence of competing
speech relative to other implementations of the beamformer.While the HI group in our study had a poorer mean SRT than the NH group, we
found no interaction between group and processing condition, suggesting that
the benefit of spatialization was achieved by both groups. Indeed, all
participants had lower SRTs for the spatial BEAM than for BEAM (or for
BEAMAR) in both bandwidth conditions. As pointed out by Neher et al.
(2017), however, a benefit of spatialization may only apply to
listeners who are sufficiently sensitive to spatial information; for
listeners with poor sensitivity, it may have no effect. We note though that
as the spatial-BEAM strategy does not sacrifice beamforming for spatial-cue
preservation, it is hard to imagine a case in which it would be
detrimental to performance.A particular strength of the spatial-BEAM strategy is that it provides
appropriate binaural cues across all frequencies rather than in a restricted
frequency range (cf., Best et al., 2017; Desloge et al., 1997). The
full-bandwidth approach has several specific advantages. First, the
consistency of cues across frequency provides a coherent spatial perception
of sounds without the risk of “split images.” Second, spatial cues are
available across the spectrum, which increases the versatility of the
strategy. For example, this strategy will be robust in situations where
low-frequency spatial information is degraded or lost due to masking noise
or reverberation. Moreover, the inclusion of both IPDs and ILDs increases
the chances that some sense of spatialization will be preserved even if one
of the cues is unavailable or inaccessible to a listener. This issue may be
important for potential applications in listeners with poor sensitivity to
IPDs (such as older listeners; Moore, 2014) or in bilateral
cochlear-implant users who are almost entirely dependent on ILDs (e.g.,
van Hoesel,
2004).It is worth noting that this strategy is ideally suited for situations
containing spectrotemporally sparse targets and distractors, such as the
speech mixtures tested here. However, it is likely to be less effective for
conditions in which the component sounds overlap heavily in the
spectrotemporal plane. For example, for a speech target in the presence of a
continuous intense noise, there may be few tiles in which the target is very
dominant and thus the location of the target may not be clearly represented.
Moreover, there may be many tiles in which both the target and noise sources
contribute significant and near-equal amounts of energy and the spatial cues
are inconsistent with either source. Under such conditions, reconstituting
the spatial information may be of little benefit. An important next step in
this line of work will be to compare the spatial-BEAM approach with other
kinds of beamforming strategies using a wide variety of stimuli and
tasks.Finally, while the experimental system we used here allowed us to explore the
potential benefits of the spatial-BEAM strategy while retaining a good
degree of experimental control, this approach comes with some clear
limitations. First, we tested only a “fixed-head” situation in which the
received signals were unaffected by head movements. It is possible that this
scenario underestimates the performance that is possible in the BEAM
condition and hence overestimates the benefit of spatialization. Second, the
spatial-BEAM strategy relies on binaural cues contained in signals as they
occur in the listener’s ear canals, and these were captured in our system
via the KEMAR impulse responses. It is likely, however, that equally useful
binaural cues could be extracted from microphones in other locations (such
as the outermost microphones of the microphone array). Third, it is not
clear whether similarly robust effects of spatialization would be achieved
within the constraints of a real hearing-aid system. For example, it may not
be feasible to implement the spatial-BEAM processing in real time,
especially the ideal version implemented in this study that had very fine
spectral and temporal resolution. Systematic evaluations of lower resolution
versions of the spatialization would be very informative in this case. It
would also be critical to determine how well the preserved binaural cues are
transmitted to the wearer for different hearing-aid styles, ear pieces, vent
sizes, and so on.
Authors: Tim Green; Gaston Hilkhuysen; Mark Huckvale; Stuart Rosen; Mike Brookes; Alastair Moore; Patrick Naylor; Leo Lightburn; Wei Xue Journal: Trends Hear Date: 2022 Jan-Dec Impact factor: 3.293