Alain de Cheveigné1,2,3. 1. Laboratoire des systèmes perceptifs27051, CNRS, Paris, France. 2. Département d'études cognitives, École normale supérieure, PSL University, Paris, France. 3. UCL Ear Institute, London, UK.
Abstract
This paper reviews the hypothesis of harmonic cancellation according to which an interfering sound is suppressed or canceled on the basis of its harmonicity (or periodicity in the time domain) for the purpose of Auditory Scene Analysis. It defines the concept, discusses theoretical arguments in its favor, and reviews experimental results that support it, or not. If correct, the hypothesis may draw on time-domain processing of temporally accurate neural representations within the brainstem, as required also by the classic equalization-cancellation model of binaural unmasking. The hypothesis predicts that a target sound corrupted by interference will be easier to hear if the interference is harmonic than inharmonic, all else being equal. This prediction is borne out in a number of behavioral studies, but not all. The paper reviews those results, with the aim to understand the inconsistencies and come up with a reliable conclusion for, or against, the hypothesis of harmonic cancellation within the auditory system.
This paper reviews the hypothesis of harmonic cancellation according to which an interfering sound is suppressed or canceled on the basis of its harmonicity (or periodicity in the time domain) for the purpose of Auditory Scene Analysis. It defines the concept, discusses theoretical arguments in its favor, and reviews experimental results that support it, or not. If correct, the hypothesis may draw on time-domain processing of temporally accurate neural representations within the brainstem, as required also by the classic equalization-cancellation model of binaural unmasking. The hypothesis predicts that a target sound corrupted by interference will be easier to hear if the interference is harmonic than inharmonic, all else being equal. This prediction is borne out in a number of behavioral studies, but not all. The paper reviews those results, with the aim to understand the inconsistencies and come up with a reliable conclusion for, or against, the hypothesis of harmonic cancellation within the auditory system.
Entities:
Keywords:
auditory scene analysis; harmonic cancellation; harmonicity; pitch perception; segregation
Our environment is cluttered with sound sources, but to act effectively we must focus
on one or a few and ignore the others. This is hard because the mixing process, by
which sounds from the various sources add up before entering the ears, cannot be
undone. We usually do not know the mixing matrix (i.e., the delays and gains applied
to each source before adding) and, even if we did, that matrix is generally not
invertible. Recovering individual sources is thus impossible except
in very simple cases. Nonetheless, we sometimes feel that we can follow an
individual source, for example, a voice within a conversation, or an instrument
within an ensemble, as if it were alone. The ability to make sense of a complex
acoustic scene in terms of individual sources is known as Auditory Scene Analysis
(Bregman, 1990).Auditory Scene Analysis is sometimes discussed as a process of “grouping” elements
(e.g., partials) to form sources or objects (Bregman, 1990), for example, according to Gestalt
principles. However, such “elements” are conceptual rather than operational. While
sinusoids and clicks serve well as synthesis parameters, it may not be possible to
extract them from the sound due to theoretical limits (e.g., time–frequency
uncertainty tradeoff, Gábor, 1947)
and physiological limits (e.g., temporal and frequency resolution of cochlear
analysis, Moore &
Glasberg, 1983; Plack &
Moore, 1990). If they cannot be accessed, postulating that they can be grouped is
perhaps misleading.Fortunately, perfect isolation of each source is usually not necessary. According to
the principle of unconscious inference (Helmholtz, 1867; Kersten et al., 2004), we need only to recover enough
information to infer the presence or nature of a target. Regularities within the
world, internalized as models within the perceptual system, allow us to fill in
missing parts. This process, which manipulates incomplete information “under the
hood,” provides us with the illusion of perceiving each object just as if true
unmixing had taken place. Information about the source is partial but, thanks to
inference, it appears to us that it is complete (al Haytham, 1030; Hatfield, 2002; Imbert, 2020).For this to work, it is essential that the sensory representation be stripped of the
influence of background objects. If not, a different background might lead to a
different percept, defeating the goal of perceiving the target as if it were in
isolation. In other words, the sensory representation should be made
invariant to the presence of interfering sources. This is
analogous to invariance with respect to intra-class variability in pattern
classification (Duda et
al., 2012).Several aspects of auditory processing might contribute to this goal. If target and
background differ by their spectral content, cochlear filtering can
be used to split sensory input into channels dominated by the target, distinct from
those that reflect the background. Discarding the latter then yields a
representation that is invariant to the presence of the background—albeit incomplete
because of the missing channels. Likewise, if target and background occur at
different points in time, temporal resolution properties of the
auditory system (Moore et
al., 1988;
Plack & Moore,
1990) can be used
to discard time intervals contaminated by the background.Putting both elements together, the target can be “glimpsed” within spectro-temporal
gaps of the background (Cooke, 2006). The glimpsed “pixels” of the time-frequency representation are handed
over to subsequent processing together with a mask to indicate their position.
Discarded pixels are not merely set to zero: they are given zero
weight (Cooke et
al., 1997).
Spectro-temporal glimpsing has been proposed in speech processing applications
(Wang & Brown,
2006; Wang, 2008), and to account for
human perceptual abilities and derive predictive measures of intelligibility (e.g.,
Best et al., 2019; Josupeit et al., 2020).Binaural disparity is another potentially useful cue. In addition to head shadow
effects that produce favorable target-to-masker ratios within certain frequency
channels at either ear (Grange
& Culling, 2016), perception benefits from binaural interaction, which is commonly
understood to follow the well-known equalization cancellation (EC) model (Durlach, 1963), and its extensions
(e.g. Culling &
Summerfield, 1994; Breebaart et
al., 2001;
Akeroyd, 2004). Signals at each ear
are differentially time-shifted and scaled (“equalization”), and then subtracted one
from the other (“cancellation”) to suppress interaurally coherent sound from a
competing source. The internal time shift and scale factor are tuned to match the
interfering source. The EC model is assumed to involve temporally accurate neural
patterns processed by specialized neural circuitry within the auditory brainstem
(Tollin & Yin,
2005; Joris & van der
Heijden, 2019).To summarize this viewpoint, Auditory Scene Analysis entails canceling and/or
ignoring irrelevant features of the sensory input, and matching the remainder to an
internal model to produce a reliable percept. The process draws on spectro-temporal
analysis within the cochlea, complemented by neural time-domain signal processing
within the brain, to provide the brain with a rich—albeit incomplete—representation
within which a target can be “glimpsed.” The glimpses are then interpreted according
to a Helmholtzian inference process.The remainder of this paper asks whether this process can be extended to include, as
a cue, the harmonic (periodic) structure of interference such as a competing talker.
So-called “double-vowel” experiments found that vowels mixed in pairs are easier to
identify if their fundamental frequencies (F0s) differ (Brokx & Nooteboom, 1982; McKeown, 1992; Culling & Darwin,
1993; Assmann & Summerfield,
1994), suggesting
that harmonic structure somehow assists segregation. Furthermore, it appears that
this effect is driven mainly by the harmonicity of the background,
for example, the competing vowel (Lea, 1992; Summerfield & Culling, 1992; de Cheveigné et al., 1997). This is the
harmonic cancellation hypothesis.To set the stage, I assume a “segregation module” that works hand in hand with a
“pattern-matching” module (Figure 1). The segregated sensory pattern (dark red arrow) is
accompanied by a “reliability mask” (gray arrow) to assist matching of a pattern
that is incomplete or distorted by the segregation process. Sensory representations
might consist of a spectral profile (e.g., place-rate representation), or a
temporal, or place-time pattern. Examples of the latter are a matrix of
autocorrelation functions (ACFs), one per channel (autocorrelogram), or the sum over
channels of these ACFs (summary ACF, SACF) (Licklider, 1959; Lyon, 1984; Meddis & Hewitt, 1992). The flow of
sensory information in this figure is purely bottom-up: the only top-down influence
is attentional control (dotted arrow). Top-down transfer of a sensory-like pattern
is also conceivable (“schema-driven” segregation), but not considered here.
Figure 1.
Segregation and matching. Sensory input is stripped of correlates of
interfering sources, and the selected pattern, possibly incomplete, is
passed on for pattern-matching (or model-fitting), together with a mask that
indicates which parts are missing or unreliable. Initial stages are under
attentional control.
Segregation and matching. Sensory input is stripped of correlates of
interfering sources, and the selected pattern, possibly incomplete, is
passed on for pattern-matching (or model-fitting), together with a mask that
indicates which parts are missing or unreliable. Initial stages are under
attentional control.We want to know whether harmonic cancellation is instantiated in the auditory system,
but it is often easier to reason in terms of the acoustic waveform, for clarity and
to distinguish theoretical from implementation limits: if a principle fails in
abstract terms, consideration of biological constraints is premature. That said,
references to “cochlear filtering” or “neural processing” will sometimes creep into
the discussion without warning. I beg your patience when this occurs.
Harmonic Cancellation—Possible Mechanisms
How might harmonic cancellation be implemented? This section investigates several
hypotheses, including frequency-domain, time-domain, and hybrid models. A later
section will ask which—if any—is used by the auditory system. The busy reader might
want to read about frequency domain and time
domain models, then skip to the Psychophysics section and come back for
details as needed. There are also interesting things to be found in the
Appendix.
Frequency Domain
Conceptually, harmonic cancellation is straightforward: just zero all spectral
components at multiples of
, where
is the period of the background, as shown in Figure 2 (Parsons, 1976; Stubbs &
Summerfield, 1988). Target components emerge intact (right panel), except in the
event, vanishingly unlikely in this idealized world, that a target component
falls on the harmonic series of the background.
Figure 2.
Harmonic cancellation in the idealized frequency domain. Left: line
spectra of a “target” sound (red) and a “background” (blue). Next to
left: mixture. Next to right: harmonic mask with zeros at all harmonics
of background. Right: recovered target.
Harmonic cancellation in the idealized frequency domain. Left: line
spectra of a “target” sound (red) and a “background” (blue). Next to
left: mixture. Next to right: harmonic mask with zeros at all harmonics
of background. Right: recovered target.A practical implementation, however, needs to deal with two issues: one is
limited frequency resolution of the spectral representation, the other is the
spectral widening expected when analyzing a time-limited or otherwise
non-stationary signal. Figure 3(a) shows short-term amplitude spectra of two harmonic
sounds, a 200 Hz “background” with a flat spectral envelope (blue), and a weaker
238 Hz “target” with a broad peak centered at 1 kHz (red).
Figure 3.
Harmonic cancellation in the frequency domain using a short-term Fourier
representation, or a filter bank. (a) 238 Hz target (red) and 200 Hz
background (blue) analysed by a filter bank with 100 Hz resolution, (b)
mixture, (c) harmonic mask, (d) target recovered from mixture (green),
and same in the absence of the background (thin red), (e) same analysis
but using a filter bank with non-uniform frequency resolution. Filter
bandwidth depends on center frequency (CF) according to estimates of
cochlear frequency resolution from Moore and Glasberg 1983 as
implemented by Slaney (1993).
Harmonic cancellation in the frequency domain using a short-term Fourier
representation, or a filter bank. (a) 238 Hz target (red) and 200 Hz
background (blue) analysed by a filter bank with 100 Hz resolution, (b)
mixture, (c) harmonic mask, (d) target recovered from mixture (green),
and same in the absence of the background (thin red), (e) same analysis
but using a filter bank with non-uniform frequency resolution. Filter
bandwidth depends on center frequency (CF) according to estimates of
cochlear frequency resolution from Moore and Glasberg 1983 as
implemented by Slaney (1993).This spectral transform has limited frequency resolution (or, equivalently,
infinite resolution but the signals are time-limited, in this case eight cycles
of a 200 Hz fundamental, shaped with a Hanning window). When target and masker
are mixed, here with a target-to-masker ratio (TMR) of
12 dB, the spectrum of the mix (Figure 3(b), black) is almost entirely
dominated by the background (Figure 3(a), blue). This differs radically from the idealized
picture of Figure 2.If we multiply the spectrum of the mix with a harmonic mask with zeros at the
harmonics of the background (Figure 3(c)), we obtain a “recovered” spectral pattern (d, green)
very different from the true target (a, red). Two terms contribute to this
difference. One is multiplicative distortion from the masking procedure (compare
d, red to a, red), the other is additive distortion due to the incompletely
canceled background (compare d, green to d, red). The former can, in principle,
be taken into account by a pattern-matching stage if it has access to the nature
of that distortion, for example, via the gray arrow in Figure 1. The latter is more serious
because it is unknown and cannot be compensated for, and because it implies that
we miss our goal of invariance with respect to the background. The shape of the
harmonic mask (Figure 3(c)) affects the balance between error terms but a different
mask would not yield a radically different result. The contrast between Figure 2 (conceptual
model) and Figure 3
(feasible implementation) is sobering.Spectral resolution is critical. Cochlear filters are narrower, on a linear
frequency scale, at low than at high CFs (Figure 3(e)). From this figure, it would
seem that low-frequency target features might be recovered, but perhaps not
high-frequency (compare green and thin red). This illustration used a bank of
gammatone filters (Slaney, 1993) with equivalent rectangular bandwidths (ERBs) from
psychophysical estimates (Moore & Glasberg, 1983). If cochlear filters were
narrower (e.g., Shera et
al., 2002; Sumner
et al., 2018) a wider frequency range might be recoverable (not shown), but
resolution would still be limited if the stimulus were short or
non-stationary.In summary, frequency-domain cancellation requires (a) a spectral representation
with resolution sufficient to cancel background partials while retaining enough
of the target to support pattern matching, (b) an estimate of the background
period
, and (c) a pattern-matching process that tolerates distortion
of target spectral patterns. How to estimate the background period is discussed
in the Appendix (Period Estimation).
Time Domain
Harmonic cancellation can also be implemented in the time domain by a simple
filter with impulse response
where
is the period of the interfering sound and
is the Kronecker delta function translated to
(Figure 4(a), left). The filtered version of a signal
is simply
. The magnitude transfer function of this filter has deep dips
at all harmonics of 1/
(Figure 4(a), right).
Figure 4.
Harmonic cancellation in the time domain. (a) Impulse response of the
cancellation filter (left) and corresponding magnitude transfer function
(right). (b) Input (left) and output (right) of the cancellation filter
for the background 100 Hz vowel /a/ (top), target 132 Hz vowel /e/
(middle), and mixture at TMR=
12 dB (bottom). (c) Schematic diagram of a circuit
implementing the cancellation filter (Equation (1)) (left) and neural circuit with similar function (right).
A spike on the direct pathway (black) is transmitted unless it coincides
with a spike on the delayed pathway (red). The delay can be applied to
the positive/excitatory input, instead of negative/inhibitory, with
equivalent results.
Harmonic cancellation in the time domain. (a) Impulse response of the
cancellation filter (left) and corresponding magnitude transfer function
(right). (b) Input (left) and output (right) of the cancellation filter
for the background 100 Hz vowel /a/ (top), target 132 Hz vowel /e/
(middle), and mixture at TMR=
12 dB (bottom). (c) Schematic diagram of a circuit
implementing the cancellation filter (Equation (1)) (left) and neural circuit with similar function (right).
A spike on the direct pathway (black) is transmitted unless it coincides
with a spike on the delayed pathway (red). The delay can be applied to
the positive/excitatory input, instead of negative/inhibitory, with
equivalent results.Figure 4(b) shows a
background vowel stimulus /a/ with fundamental 100 Hz (top), a weaker target
vowel /i/ with fundamental 132 Hz (middle), and their mixture (bottom), before
(left) and after (right) filtering with a cancellation filter with lag
equal to the period of the background vowel. The response
consists of initial and final one-period glitches, separated by a short
steady-state portion, in red. The steady-state portion is zero
for the background (top). For the target, it is a distorted version of the
target waveform (compare middle right, red, to middle left). For the mixture, it
is the same as for the target alone (compare middle right, red, to bottom right,
red). In other words, this part of the pattern is invariant with respect
to the presence of a background of period
, which is what we need. This contrasts with frequency-domain
cancellation for which none of the recovered pattern was
background-invariant.In summary, time-domain cancellation requires (a) a time-domain signal
representation such that Equation (1) can be implemented, (b) an
estimate of the background period
(see Appendix, Period Estimation), (c) a
pattern matching process capable of selecting the intervals of perfect
cancellation, and compensating for distortion of the target within these
intervals.
Hybrid Models
A hybrid model combines spectral and temporal processing, for example, cochlear
filter bank analysis followed by time-domain harmonic cancellation within the
brainstem. There is a rich literature based on this idea for the purpose of
auditory modeling and sound processing applications (e.g., Lyon, 1983, 1988; Weintraub, 1985; Meddis & Hewitt, 1992; Assmann &
Summerfield, 1990). A benefit of the filter bank is that TMR varies across
channels, some favoring the target and others the background (Figure 5(a)), which may
be useful if the dynamic range of temporal processing is limited.
Figure 5.
(a) TMR within each channel of a model cochlear filter bank for an input
consisting of a 124 Hz harmonic target mixed with a 100 Hz harmonic
background with overall TMR=0 dB (black),
12 dB (dotted blue), or +12 dB (dotted red). Thanks to
the filter bank, the TMR is enhanced in certain channels within which
the target can be “glimpsed.”(b) Linear operations can be swapped.
Filtering the signal before the filter bank is equivalent to applying
the same filter to each channel after the filter bank.
(a) TMR within each channel of a model cochlear filter bank for an input
consisting of a 124 Hz harmonic target mixed with a 100 Hz harmonic
background with overall TMR=0 dB (black),
12 dB (dotted blue), or +12 dB (dotted red). Thanks to
the filter bank, the TMR is enhanced in certain channels within which
the target can be “glimpsed.”(b) Linear operations can be swapped.
Filtering the signal before the filter bank is equivalent to applying
the same filter to each channel after the filter bank.It is worth remembering that linear, time-invariant operators can be swapped: a
time-domain cancellation filter applied to the acoustic waveform can instead be
applied to each channel after filtering: the result is the same (Figure 5(b)). Cochlear
filtering and transduction are both non-linear and non-stationary (e.g.,
adaptation), but the “equivalence” of Figure 5(b) may nonetheless be useful
conceptually. I review briefly here a selection of hybrid schemes for harmonic
cancellation, described in detail in the Appendix (Hybrid
Models). In brief: These examples illustrate how peripheral filtering and temporal
processing might work hand-in-hand to enhance a spectral model (Hybrid Model 1)
or a temporal model (Hybrid Models 2–6) of harmonic cancellation. To summarize,
a wide variety of mechanisms can implement harmonic cancellation: spectral,
temporal, and hybrid.Hybrid Model 1: Cancellation-enhanced spectral
patterns. A time-domain cancellation filter is applied
to each channel of the cochlear filter bank, resulting in are
cleaner spectral patterns for pattern matching.Hybrid Model 2: Channel rejection on the basis of
periodicity. Channels dominated by the background
periodicity are discarded, and the remaining channels are used to
form a time-domain pattern for pattern matching, as in the
concurrent vowel identification model of Meddis and Hewitt (1992).Hybrid Model 3: Cancellation filtering of selected
channels. As in Hybrid Model 2, channels dominated by
the background are discarded, and channels dominated by the target
are left intact. In contrast to Hybrid Model 2, channels with
intermediate TMR are processed by a cancellation filter. The result
is used for time-domain pattern matching.Hybrid Model 4: Channel-specific cancellation
filter. The parameter
of the cancellation filter can differ between
channels, in contrast to other models that use the same
for all channels. The result is used for
time-domain pattern matching.Hybrid Model 5: Synthetic delays. The “synthetic
delay” mechanism of de Cheveigné and Pressnitzer (2006) is
used to implement the relatively long delays
required by the temporal model of harmonic
cancellation. The result is used for time-domain pattern
matching.Hybrid Model 6: Logan’s theorem. This is not a
specific model but a processing principle. A narrowband signal can
be reconstructed perfectly from its zero crossings (and hence also
from its half-wave rectified version) (Logan, 1977).
This implies that, despite the non-linearities, the temporal model
can be implemented after transduction as if it were applied to the
acoustic waveform (the theorem does not say how).
Alternatives to Harmonic Cancellation
It is important to consider alternatives: to the extent that they are viable, the
case for harmonic cancellation is weaker. Other aspects of the spectral
structure of the target or background might support segregation, even in
situations that seem to implicate harmonic cancellation.
Harmonic Enhancement
According to this hypothesis, the harmonic structure of a target
sound allows its extraction from a background. The idea is attractive: it fits
with the Auditory Scene Analysis credo that components of a sound are “grouped”
together, here on the basis of harmonicity, to form a coherent “object” that can
be distinguished from other parts of the scene (Bregman, 1990). It is satisfying to hypothesize
that voiced speech might be “engineered” for this purpose through evolution
(e.g., Popham et
al., 2018).The mechanisms just reviewed can be re-purposed for enhancement. For example, the
mask in Figure 2 can be
made to select target harmonics rather than reject background
harmonics. Likewise, replacing the minus by a plus in Equation (1), and
setting
to the period of the target, yields a harmonic
enhancement filter:
Enhancement and cancellation seem symmetric one of the other, but
they have rather different properties. Enhancement requires the period of the
target, but this is hard to estimate when TMR is small, which is unfortunately
when segregation is most necessary. Cancellation works well in that situation.
An enhancement filter provides only a limited boost in TMR (6 dB for the simple
filter of Equation (2)) in contrast to cancellation
that can reject the masker perfectly, at least in principle. A larger boost
would require a longer impulse response (as explained in Appendix A of de Cheveigné, 1993, courtesy of Jean
Laroche), but this might not be practical for a non-stationary signal such as
speech. Anticipating, behavioral results also don’t favor the enhancement
hypothesis.Incidentally, the term “harmonic enhancement” appears in other contexts with a
different meaning: perceptual enhancement of one harmonic of a complex when it
is turned on or off (e.g., Hartmann & Goupell, 2006). Hopefully no confusion will
result from this overloading of the terminology.
Spectral Glimpsing
Between the lines of a harmonic spectrum are gaps where target components might
be glimpsed (Deroche et
al., 2013; Guest &
Oxenham, 2019), and this might conceivably account for the benefit observed
when a background is harmonic rather than inharmonic. Figure 5(a) shows how individual
channels in the low-frequency region can preferentially reflect one source or
the other, as long as partials are not too close. The spectral-glimpsing
hypothesis glosses over the question of how target channels are distinguished
from background channels. In that, it differs from Hybrid Model 2 above.
Waveform Interactions
The sinusoidal waveforms of two or more partials can interact within a channel of
a filter bank to produce a complex “beat” pattern. This can occur between
partials of the same sound (with a rate equal to the fundamental if the sound is
harmonic) or partials of different sounds. The patterns that result are quite
diverse (static summation, slow fluctuations, rapid beats, etc.), and they
depend in a complex way on several parameters (frequencies, levels, filter
shapes). The “waveform interactions” hypothesis is thus ill-defined unless
further specified.From slow to fast: phase-dependent summation of same-frequency
partials constitutes a potential confound in experiments that include a “zero
F
” condition (de Cheveigné, 1999c). Slow beats
between closely-spaced partials from different sounds cause the short-term
spectrum to cycle between shapes that might favor perception of one or the other
sound, either because it momentarily resembles that of one of the sounds in
isolation, or because temporal contrast effects enhance important spectral
features (Summerfield et
al., 1981; Assmann
& Summerfield, 1994; Culling & Darwin, 1994). Faster
beats might evoke a sensation of roughness signaling the presence
of a target (Treurniet
& Boucher, 2001), or the spectral location of such beats might provide cues to
its spectral features (e.g., the location of a formant peak, or the boundary
between formants of different sounds). Conversely, the lack of
beats at a rate slower than F
(or the perceptual correlate of this lack, “smoothness”) could
signal the absence of a target, or the spectral location of channels dominated
by harmonics of a single sound. Finally, the absence of any modulation at
F
implies that the channel is dominated by a single partial, as
in the phenomenon of “synchrony capture” which might signal the position of a
formant peak of a successfully isolated sound (Carney et al., 2015; Maxwell et al., 2020).Interaction of more than two harmonics produces a phase-dependent beat pattern
that is more deeply sculpted for certain phase relations, such as cosine, or
“Klatt” phase that approximates natural phonation with a glottal pulse within
each period. Valleys between pulses might then allow a target to be glimpsed for
a favorable alignment, as might occur if sounds of different F
are mixed (the pitch period asynchrony hypothesis, PPA, Summerfield &
Assmann, 1991).Beat patterns might be exploited to group channels by correlation (Hall et al., 1984; Sinex et al., 2002; Sinex & Li, 2007; Fishman &
Steinschneider, 2010; Shamma
et al., 2011) or, alternatively, beat rates in the F
range might be compared across channels (Roberts & Bregman, 1991; Treurniet &
Boucher, 2001; Roberts
& Brunstrom, 2003). This requires the existence of
some mechanism to analyze beat patterns and quantify their rates (see
Modulation Filter Bank below).Beat amplitude depends non-monotonically on the amplitude of sources within the
stimulus, and the shape of the beat pattern is phase-dependent (for three or
more partials). Beat rate affects perceptual salience (e.g., roughness)
non-monotonically, and the rate itself may depend non-monotonically on
F
difference, depending on which partials happen to be close.
Finally, each channel has its own pattern of beats. For these reasons, a
“waveform interaction hypothesis” is hard to delineate and test (which does not
imply that it is incorrect).
Modulation Filter Bank
An influential idea is that cochlear filtering and transduction are followed by
analysis by a modulation filter bank within the auditory system
(Kay &
Matthews, 1972; Viemeister, 1979; Dau et
al., 1997; Joris et
al., 2004; Stein et
al., 2005; Jepsen
et al., 2008). Conceptually, this seems rather like reproducing internally
an operation (spectral analysis) that is already carried out in the cochlea. A
major difference, however, is that it occurs after demodulation
of each output of the peripheral filter bank (non-linearity followed by
smoothing), which makes it primarily sensitive to features of the waveform
envelope, and less sensitive to carrier phase. The concept makes most sense when
applied to slow fluctuations (e.g., below
30 Hz), but models have been proposed with channels up to
500 Hz, capitalizing on the smooth transition between neural
coding of fine structure at low frequencies and of envelope at higher
frequencies (Joris et
al., 2004). A modulation filter bank applied to each peripheral channel
results in a center frequency
best modulation frequency pattern that can be collapsed across
channels to obtain a “summary modulation spectrum.” One could imagine a
frequency-domain harmonic cancellation model applied to this “internal
spectrum.” However, most estimates of modulation filter width are rather wide
(quality factor
1), which makes this idea unlikely to work given the issues
mentioned earlier.Alternatively, the 2D pattern could be used to tag channels for the purpose of
segregation (Ewert &
Dau, 2000; Meyer et
al., 1997). One might consider implementing this modulation filter bank
using cancellation filters, which would result in a model similar to the hybrid
models reviewed previously, a major difference being the demodulation step which
renders the model sensitive to envelope periodicity rather than (or in addition
to) waveform periodicity.
In Summary
Multiple models have been put forward to explain how the harmonic structure of
sounds within an acoustic scene can be used to analyze the scene and attend to
particular sources. Some fit the definition of harmonic cancellation, others do
not. The next section reviews psychophysical evidence in favor—or against—this
hypothesis and its alternatives.
Psychophysics
Detection Benefits from ΔF0
When presented with a mixture of two vowels, subjects more often report that they
hear two vowels if the F0s differ (de Cheveigné et al., 1997; Arehart et al., 2005, 2011; McPherson et al.,
2020).
Likewise, when presented with a harmonic tone with one partial mistuned, they
may detect the partial as “standing out” as a separate sound (Moore et al., 1985, 1986). Such a
mistuned target tone can be detected at
15 dB relative to a harmonic masker, whereas against a noise
background the threshold is
15 dB higher (Micheyl et al., 2006). In each of these examples,
background harmonicity seems to affect how many sources are heard. An
interpretation, in the context of harmonic cancellation, is that a single entity
is perceived if cancellation is perfect, and multiple entities if it leaves a
residual.
Discrimination and Identification Benefit from
F
Mistuning one partial of a harmonic complex allows it to be matched to a pure
tone (Hartmann et
al., 1990), implying not only that this “second sound” is detectable, but
also that its frequency can be accessed. Subjects are more likely to identify
both vowels of a concurrent pair if their
s differ (Brokx & Nooteboom, 1982; Scheffers, 1983; Zwicker, 1984; Summerfield & Assmann, 1991; McKeown, 1992; Chalikia &
Bregman, 1993; Culling &
Darwin, 1993; Assmann
& Summerfield, 1994; Shackleton et al., 1994; Arehart et al., 2011). The pattern of
results is similar across studies: poor performance (albeit well above chance)
for
F
=0, rapid improvement up to about one semitone, followed by a
plateau and possibly a dip at the octave. To create the
F
=0 condition with continuous speech, the voices must be
re-synthesized on a monotone, or one voice given the same F
track as the other, so that
F
s remain the same throughout the presentation. With that
manipulation, a similar benefit of non-zero
F
is obtained (Brokx & Nooteboom, 1982; Leclère et al., 2017).Improved performance with
F
0 is taken to reflect a harmonicity-based segregation mechanism
that fails when F0s are the same, and indeed, identification is
less good if both voices are whispered (Lea, 1992), or inharmonic (de Cheveigné et al.,
1997). This
brings up the question as to whether each voice benefits from its harmonic
structure, that of its competitor, or both. To answer that question, voices must
be parametrized individually, and responses tallied separately. It cannot be
answered if the performance metric is “both correct” (Brokx & Nooteboom, 1982; Scheffers, 1983; Summerfield &
Assmann, 1991), or if both voices are made inharmonic at the same time (Popham et al., 2018).
Background Harmonicity is Important
In “double vowel experiments,” listeners give two answers on each trial, but it
has been noted that one constituent (the “dominant” vowel) is usually identified
regardless of
F
, whereas identification of the other depends on
F
(Zwicker, 1984; McKeown, 1992; McKeown
& Patterson, 1995). “Dominance” is phoneme- and
subject-dependent, but this can be overridden by changing the relative level of
the vowels, in which case the
F
effects are mainly observed for the weaker
(smaller amplitude) vowel (McKeown, 1992; de
Cheveigné et al., 1997; Arehart et al., 2005). This is congruent with the
harmonic cancellation hypothesis, in that estimation of the harmonic structure
of the background should be easy when the target is weak. However, it could also
simply result from a reduced ceiling effect for the more challenging, weaker
vowel.With the
F
0 condition as a starting point, performance degrades if the
competing vowel is whispered (Lea, 1992) or made inharmonic (de Cheveigné et al.,
1997),
regardless of whether the target is harmonic or not. This too is consistent with
the harmonic cancellation hypothesis. Similar results are reported for connected
speech: Steinmetzger and Rosen (2015) found that speech reception
thresholds (SRTs) were up to 11 dB lower for periodic than aperiodic maskers,
while Deroche et al.
(2014b) reported a 4 dB elevation in SRT for inharmonic versus
harmonic maskers. Incorporating harmonic cancellation within a predictive model
of speech intelligibility improved its fit to experimental data (Prud’homme et al.,
2020).Gockel et al. (2002)
found that the threshold for detecting noise in a harmonic masker was 11–14 dB
lower than the converse, and Gockel et al. (2003) found a similar
result for loudness. This suggests that a harmonic masker might be less potent
than a noise masker, as expected from harmonic cancellation. As mentioned
earlier, Micheyl et al.
(2006) found that a harmonic complex tone (HCT) was easier to detect
within a background consisting of another HCT than within noise, and Klinge et al. (2011)
found a lower threshold for detection of a tone embedded in (but mistuned from)
a harmonic rather than inharmonic or noise background (see also Oh & Lutfi, 2000).All these results are consistent with harmonic cancellation. However, harmonic
cancellation is not exclusive of other mechanisms, and one might expect the
auditory system to use several or all if they are effective. The next section
reviews evidence for harmonic enhancement.
Target Harmonicity is Less Important
The idea that harmonicity ensures that a sound does not “fall apart into a sea of
individual harmonics” is seducing (Popham et al., 2018), but studies that tried to
demonstrate an advantage of target harmonicity for segregation have met with
mixed results. As noted earlier, in double-vowel experiments the benefit of a
F
is greatest for weak targets, and measurable for TMR as low as
25 dB (McKeown, 1992; de
Cheveigné et al., 1997; Arehart et al., 2005). Estimating the
F
of a target that weak would be challenging. Replacing a voiced
target by a whispered target does not impair intelligibility, regardless of
whether the competitor is voiced or whispered (Lea, 1992), nor does randomly perturbing
its harmonics to make it inharmonic (de Cheveigné et al., 1997). Modulating the
F
of target speech in the presence of reverberation disrupts its
periodicity, but Culling et
al. (1994) found no effect on SRTs (see also Deroche & Culling, 2011b).For continuous speech, it has been hypothesized that target harmonicity (one
aspect of “temporal fine structure,” TFS) could aid glimpsing within a
spectro-temporally modulated noise, by tagging time–frequency regions that are
voiced. However, a direct test of this hypothesis gave negative results (Shen & Pearson,
2019). There is
however some evidence that continuity of target F
helps to connect information over time, or reduce
informational masking if target and masker F
ranges are non-overlapping (Darwin & Bethell-Fox, 1977).A difficulty in testing the enhancement hypothesis is that manipulation of the
target might affect its intelligibility independently of any segregation effect.
Whispered speech is reportedly less intelligible than voiced speech (Ruggles et al., 2014), and
reverberation, which disrupts harmonicity of an intonated target, also degrades
intelligibility (Deroche
& Culling, 2011b). Manipulating F
(monotonizing, transposing, or inverting the
F
track) may also affect intrinsic intelligibility (Binns & Culling,
2007; Deroche et al., 2014a; Guest & Oxenham,
2019). Such
effects might conceivably offset the benefits of harmonic enhancement, making
them unmeasurable, so the best we can say is that we lack strong evidence in
favor of harmonic enhancement.
An Intriguing Exception: Target Pitch
In contrast to results just reviewed, a target within a noise background
is easier to detect if it is harmonic than inharmonic
(McPherson et
al., 2020). This inconsistency is resolved if we reflect that a harmonic
target is likely detected in noise on the basis of its pitch (Scheffers, 1984; Hafter & Saberi,
2001; Gockel et al., 2006), which is
probably more salient if the sound is harmonic. If frequency discrimination in
noise relies on a pitch percept, it too should benefit from target harmonicity,
as found by McPherson et
al. (2020). Thus, we cannot with confidence attribute such benefits
to enhanced segregation as opposed to an enhanced pitch
percept.It is also intriguing that the pitch of a target is easier to discriminate if
mixed with a noise background rather than a harmonic background (Micheyl et al., 2006), opposite to
what we expect of harmonic cancellation (indeed, in that study the same sounds
were easier to detect within a harmonic background than a noise
background). It would seem that background harmonicity interferes with target
pitch, possibly in a way similar to the phenomenon of pitch discrimination
interference (PDI) (Gockel
et al., 2009; Micheyl
et al., 2010). That interference is not absolute: the pitch of a mistuned
partial may be heard within a harmonic background (Hartmann et al., 1990; Hartmann & Doty, 1996), and individual
tones may be heard within a chord (Graves & Oxenham, 2019), consistent with
skills found in competent musicians.
Is the Benefit Explained by Spectral Glimpsing?
Several results seem consistent with this hypothesis. The benefit of
F
to vowel identification is mainly limited to the region of
resolved partials (Culling
& Darwin, 1993), and it improves with a higher background F
at which partials are more widely spaced (Deroche et al., 2013, 2014a). Guest and
Oxenham (2019) found
that removing the even harmonics of a masker reduced masking of a target placed
one octave above, also consistent with glimpsing within the large gaps between
background partials of odd rank.However, Deroche et al.
(2013, 2014a, 2014b) argued that the larger gaps that arise when a masker is made
inharmonic should reduce masking, contrary to their results. A
possible explanation is that cancellation and glimpsing are both involved (Deroche et al., 2014b), consistent
with Hybrid Models 2 or 3.
Is the Benefit Explained by Waveform Interactions?
As pointed out earlier, waveform interaction comes in multiple forms, and it is
not always clear which version of the hypothesis is implied when it is invoked.
One difficulty, common to many versions, is that the non-monotonic dependency of
beat amplitude on component amplitudes implies that the magnitude (and spectral
locus) of beat-dependent cues should show non-monotonic variations with level,
whereas identification usually varies monotonically with TMR. Another challenge
is that F
-based segregation seems to benefit mostly partials of low
rank, for which, thanks to resolvability, the distribution over channels of
high-amplitude beats is likely sparse (Deroche et al., 2014).Phase effects attributable to PPA were found at 50 Hz, but not at 100 Hz or
higher (Summerfield &
Assmann, 1991; de
Cheveigné et al., 1997; Deroche et al., 2013, 2014; Green & Rosen, 2013, but see Summers & Leek
1998). Furthermore, reverberation should scramble the phase relations
required by PPA, whereas it does not affect segregation unless
F
is modulated (Culling et al., 1994, 2003; Deroche & Culling, 2011b).Culling and Darwin (1994) attributed effects of small
F
to the ability to shop for favorable spectral patterns among
those offered by slow beats. Random starting phase should reduce this benefit
due to the haphazard temporal alignment of beat patterns, but, de Cheveigné et al.
(1997) found that the
F
benefit did not depend on the phase pattern (random vs sine)
of either target or background. The slow-beat hypothesis was further tested by
de Cheveigné
(1999c), again with limited support. The reader should refer to those
two papers for a detailed discussion of several forms of the waveform
interactions hypothesis. Given the diversity, it is hard to rule out that some
form of waveform interaction contributes to segregation. Indeed, harmonic
cancellation itself could be construed as a mechanism to exploit a particular
form of waveform interaction specific to harmonically-related partials.
The Special Case of Maskers With Frequency-Shifted or Odd-Order
Harmonics
In experiments that require detecting (or matching the pitch of) a mistuned
partial of rank
within a harmonic complex of fundamental F
, the subject likely attends to channels with a center
frequency close to
. The task might then be hampered by the presence, within those
channels, of neighboring harmonics, in particular harmonics of rank
and
. A cancellation filter tuned to F
would suppress those unwanted harmonics, but it would also
suppress the target unless it is mistuned. We would thus expect
performance to improve with mistuning, as indeed is observed (Moore et al., 1986; Hartmann et al., 1990).However, Roberts and Brunstrom (1998) found a similar result when the
background series had been made inharmonic by shifting all partials by the same
amount
, in which case partials are regularly spaced but harmonicity
is disrupted. This suggests that spectral regularity, rather
than harmonicity, might be the driving factor, which would put in doubt the
harmonic cancellation account. However, that proposal hinges on the existence of
a mechanism to detect spectral regularity: Roberts and Brunstrom (2001) doubted the
existence of a dictionary of shifted-harmonic templates.An alternative is that harmonic cancellation is applied locally
within peripheral channels, for example based on Hybrid Model 4 (analogous to
what has been proposed for the binaural EC model, Culling & Summerfield, 1994; Akeroyd, 2004; Breebaart et al.,
2001).
Specifically: the shifted partials
and
can be approximated with harmonics of rank
and
of a harmonic series of fundamental
. A cancellation filter tuned to that series would
approximately cancel the closest offending background partials (more distant
ones are attenuated by cochlear filtering). The
th zero of that filter falls at
, that is, it fits the “spectral regularity” template invoked
by Roberts and Brunstrom (1998), which would explain why they found that “mistuning” a partial
from that position makes it easier to detect or match. An array of such
CF-dependent cancellation filters, each tuned to an “equivalent
F
” equal to
would attenuate a shifted-harmonic complex across all
channels, allowing “mistuning” relative to that spectrally regular (but
inharmonic) pattern to be detected.This reasoning can be extended to the case of a background harmonic complex with
only odd harmonics of F
, as it is equivalent to a series of harmonics of
2F
each shifted by
. This series can be canceled perfectly by a
cancellation filter tuned to F
, or approximately, within each peripheral
channel, by a cancellation filter tuned near 2F
as just described. The reason for considering the latter is
that it requires a shorter delay, which is relevant if there is a penalty on
longer delays as has been suggested in the context of pitch perception (Moore, 2003; de Cheveigné &
Pressnitzer, 2006; Bernstein
& Oxenham, 2008). An array of cancellation filters, each tuned to
, would spare anything that does not fit the series of odd
harmonics, in particular an even-numbered harmonic. If so, it might explain why
a single even-numbered harmonic embedded among odd-numbered harmonics is “heard
out” more easily than any of the odd-numbered partials (Roberts & Bregman, 1991), and similar
explanation might underlie the benefit for identification of a speech target of
removing even harmonics of the masker (Guest & Oxenham, 2019) mentioned
earlier. This question is revisited in the Discussion.A body of evidence agrees with the hypothesis that harmonic cancellation assists
auditory scene analysis, complementing the well-known benefits of peripheral
frequency analysis. Dissenting results are sparse. The alternative hypothesis of
harmonic enhancement, while attractive, garners little experimental support.
Harmonic cancellation raises a number of issues that are discussed further in
the Appendix. These include period estimation (necessary to
apply cancellation), the relations between correlation and
cancellation, analogies with the well-known EC
model of Durlach, pattern matching with missing
data, potential anatomical and physiological substrates, and
the possible synergy between cochlear filtering and
neural filtering.
Discussion
Periodicity (or harmonicity)—and its perceptual correlate, pitch—have long captured
the attention and imagination of thinkers and scientists (Micheyl & Oxenham, 2010). A periodic sound
within the right parameter range evokes a salient percept that is long-lasting in
memory (McPherson et
al., 2020),
is robust to masking by noise (Hafter & Saberi, 2001; McPherson et al., 2020), and supports fine discrimination
(e.g., Micheyl &
Oxenham, 2010). However, the idea that a sound “falls apart” unless it is
harmonic does not withstand a bit of reflection. A one-period tone pulse seems
unitary without the aid of harmonicity, meaningless at that duration. A harmonic
tone of longer duration may sound unitary, but so does noise which lacks
harmonicity. An alternative proposition is that the percept evoked by a sound is
unitary by default, and that “multiplicity” is inferred from the accumulation of
evidence in favor of additional sources. A complex with a mistuned harmonic
initially sounds like a single object but, given time and encouragement, a subject
might detect something amiss and interpret it as an additional source. The process
requires time (Moore et
al., 1985;
Hartmann et al.,
1990; McKeown & Patterson,
1995), and is
harder if the background is made inharmonic (Roberts & Brunstrom, 2003; Roberts & Holmes,
2006). Thus, one
could argue, the harmonic nature of one part of the stimulus makes it easier to
detect the presence of other parts. From this perspective, harmonicity of a source
may contribute to a percept of multiplicity for mixtures in which
it participates, rather than to its own unity.That background harmonicity is crucial comes as a surprise, as it suggests that
segregation must rely on an adventitious quality of the environment. Also surprising
is that target harmonicity has only a minor role, as it goes against the attractive
idea that communication sounds are “engineered” through evolution to be harmonic for
resilience. It does make sense, however, when one realizes that cancellation works
well (and enhancement poorly) at low TMR, which is when segregation is most needed.
Infinite TMR improvement can be achieved, in principle, for very short stimuli for
which enhancement offers more limited benefit. Cancellation meshes well with the
concept that perception involves a quest for invariance to irrelevant
dimensions.
Cancellation as a Model of Sound
The ability to cancel unwanted sounds is clearly useful for perception, but one
might take a step further and argue that it is, in part,
constitutive of perception. As a predictive model, a
harmonic cancellation filter characterizes the part of input that it can cancel,
just as an autoregressive model characterizes its spectral envelope, or a
binaural EC model its spatial position. The residual, which by definition does
not fit that model, informs us about “what else is out there.” It too can be
characterized by recursively applying the same model or, alternatively, a
compound model can be applied to the original sound to estimate parameters
jointly (as in the multiple F
model described in the Appendix, Period
Estimation). This is related to concepts of predictive coding
(Friston, 2018) and compression
(Schmidhuber,
2009).Like pattern classification (Duda et al., 2012), cancellation seeks invariance with respect to irrelevant
dimensions of the input, specifically those that reflect the background. In
contrast to classifiers that involve non-linear transforms, cancellation as
described here is purely linear, which makes sense given that the acoustic
mixing process itself is linear.
How Useful is it in Practice?
Auditory Scene Analysis benefits from multiple cues and regularities, of which
harmonicity is but one. Harmonic cancellation is likely to be useful in
situations where neither temporal separation, nor spectral separation, nor
binaural disparities are effective to suppress interfering sources, and then
only if the interference is harmonic. Thus, at best, it is one tool among many,
beneficial in a restricted set of circumstances.Measured in terms of TMR at threshold performance, the harmonicity benefit can
reach
17 dB for identifying synthetic vowels, although most studies
report smaller effects (Summerfield et al., 1992; Culling et al., 1994; de Cheveigné et al., 1997). This is of the
same order of magnitude as reported for binaural unmasking (Colburn & Durlach,
1965; Jelfs et al., 2011). In terms of
proportion of tokens recognized, the benefit appears maximal for TMR around
15 dB and vanishes below
30 dB or above +15 dB (McKeown, 1992; de Cheveigné et al., 1997; de Cheveigné, 1999b). Thanks in part
to harmonicity-based segregation, a target (wide-band harmonic or noise) mixed
with a harmonic background can be detected at TMRs down to
20 dB (Gockel et al., 2002; Micheyl
et al., 2006), or
32 dB for a narrowband noise target (Deroche & Culling, 2011a). The benefit
relative to a noise or inharmonic masker is on the order of 5–15 dB (Micheyl et al., 2006; Deroche & Culling,
2011a; Deroche et al., 2014). Overall,
harmonic cancellation mainly benefits weak targets.For vowel identification, the benefit is measurable for
F
s as small as 0.4% but not less (de Cheveigné, 1997b), and plateaus for
F
s beyond
6%. It is greater for longer stimuli (200 ms) than shorter
stimuli (50 ms) (Assmann
& Summerfield, 1994), but measurable for stimuli as
short as four cycles of the lower F
(23 ms at 175 Hz, McKeown & Patterson, 1995). It is reduced
but not abolished if the masker’s F
is modulated at rates as fast as 5 Hz (200 ms period) (Summerfield et al.,
1992; de Cheveigné, 1997b; Deroche & Culling,
2011b),
suggesting a remarkable ability to track F
variations. However, this breaks down in the presence of
reverberation, whereas a similar degradation is not observed if the masker
F
is steady-state (Culling et al., 1994; Sayles et al., 2015). Data from mistuned harmonic
experiments suggest that the benefit might be limited to the spectral region
below
2–3 kHz (Hartmann et al., 1990). Indeed, in concurrent vowel
experiments the benefit appears to stem mainly from the region below 1 kHz that
includes a vowel’s first formant (Culling & Darwin, 1993).Real speech maskers differ from ideal harmonic maskers in that periodic portions
are sparsely distributed over time (Hu & Wang, 2008), the F
varies due to intonation, and periodicity is further degraded
by articulation, irregularities in voice excitation, and added noise including
reverberation. The benefit of a
F
between a monotonized speech target and monotonized masker
(two concurrent voices with the same F
, or harmonic complex with spectral envelope similar to speech)
ranges from 3 to 8 dB (Deroche & Culling, 2013; Deroche et al., 2014a, 2017), which is also on the same order
as binaural effects for similar stimuli (Deroche et al., 2017).
Learning?
Pattern-matching models of pitch perception (de Boer, 1976) postulate some form of harmonic
template, or “sieve” (Schroeder, 1968; Duifhuis
et al., 1982), and the same template is also required for a spectral domain
model of segregation. This is non-trivial: the dictionary of templates must
cover the full range of F0s, there must be some mechanism to
align the templates accurately with the substrate of frequency analysis (e.g.,
cochlea), and each template itself is a complex affair involving multiple slots
with accurate tuning. It has been proposed that templates are learned from
exposure to harmonic sounds such as speech (Terhardt, 1974; Divenyi, 1979; Bowling & Purves, 2015; Saddler et al., 2020) possibly
modulated by cultural preferences (McDermott & Hauser, 2004; McDermott et al.,
2010, 2016; McPherson et al.,
2020). The
demonstration that templates can be learned from noise (Shamma & Klein, 2000; Shamma & Dutta,
2019) makes
that argument more tenuous, and highlights the question of what, exactly, is
being learned. Perhaps that algorithm discovers, rather than learns, the
mathematical property that is exploited more directly by the cancellation
filter.The template-like properties of a time-domain cancellation filter (Equation
(1), Figure 4) stem from mathematics, rather than learning. This is a big
appeal: why jump through hoops when a simple solution is at hand? The organism
may still need to discover that this regularity exists and is worth attending
to, and the mechanism may need tuning, particularly if it involves combining
frequency channels. This leaves ample room for learning, and possibly even
cultural influences.
Is There Time?
In a classic chapter, de
Boer (1976) likened auditory theory to a pendulum moving between
“time” and “place” (spectrum). The pendulum is still swinging, and several
recent papers have strengthened the case for spectral and place-rate accounts
(e.g., Shera et
al., 2002;
Sumner et al.,
2018; Verschooten et al.,
2018; Whiteford et al.,
2020; Su & Delgutte,
2020).
Arguments for time remain (a) evidence for temporal mechanisms of binaural
processing (see section Analogy with Binaural EC of the
Appendix), (b) existence of specialized neural circuitry within the brain (see
section Anatomy and Physiology of the Appendix), and (c) the
simplicity, effectiveness and ease of implementation of a time-domain harmonic
filter, in contrast to a harmonic template or sieve in the frequency domain.Hybrid models offer the best of both worlds, but they may worry scholars who care
about parsimony or falsifiability. As a case in point, if we admit that delay
might arise by cross-channel interaction (de Cheveigné & Pressnitzer, 2006), it is hard to
say anything for, or against, the hypothesis that processing involves neural
delays. On the other hand, it would be unwise to let this blind us to the
possibility that auditory system does rely on a combination of spectral and
time-domain analysis.My personal inclination is that auditory perception involves time-domain
processing within the brain, but the effectiveness of that processing is
enhanced by the peripheral bandpass filter bank that helps overcome the effects
of non-linear transduction and noise (based on principles related to Logan’s
theorem). High-resolution mechanical filtering serves to “pre-calculate” a set
of useful basis functions upon which the brain then operates in the time-domain
(see sections Transforms in Filter Space and
Non-Linearity of the Appendix). In this perspective,
cochlear mechanics are the “last chance” to process acoustical signals with good
resolution, linearity, and low noise, before handing transduced patterns over to
more flexible but less accurate neural processing.
Carving Sound at its Joints
Auditory Scene Analysis is often described as a process of
assembling elements across the spectrum (simultaneous
grouping) or across time (sequential grouping) (Bregman, 1990), mirroring the common process of
additive or concatenative synthesis by which stimuli are created in the lab. It
glosses over the issue of whether these ingredients are recoverable from the
mix, upon which this assumption depends. Once the coins are thrown into the
melting pot, can we pull them out intact? According to classic Auditory Scene
Analysis, we can: spectral analysis reveals “natural kinds” (partials), between
which are found the “joints” at which sounds may be carved (Campbell et al., 2011). Indeed,
according to this view, a grouping mechanism is required for any complex sound
to form a coherent whole, otherwise it might shatter into as many percepts as
partials (although few of us would claim to ever have heard more than a couple
of such percepts within a sound). The wisdom of invoking sinusoidal partials as
“natural kinds” on which Auditory Scene Analysis processes operate is rarely
questioned.In contrast, harmonic cancellation requires no analysis-into-parts or grouping.
Whereas a bandpass filter is defined by what it selects (a frequency band), a
cancellation filter is defined by what it removes (periodic power at period
). This is an example, like a shadow, of what Sorensen (2011) calls
a “para-natural kind.” The process is effective both to characterize a periodic
sound by its parameter
, and to get rid of that sound and search for more. It is an
alternative way to “carve sound at its joints.”
Conclusion
The harmonic cancellation hypothesis states that the harmonic (or periodic) structure
of interfering sounds can be exploited to suppress or ignore them. A large body of
experimental results are consistent with this hypothesis, whereas alternative
hypotheses for F
-based segregation are less well supported. In particular, harmonic
enhancement, according to which harmonicity of a target makes it resilient to
masking, receives little support, which is surprising because counter to our
intuition and inconsistent with textbook explanations of scene analysis involving a
harmonicity-based “grouping” operation. Harmonic cancellation fits well with an
account of perception as seeking invariance with respect to irrelevant dimensions of
the sensory pattern, and with the concept of “unconscious inference” promoted by
Helmholtz. Harmonic cancellation can be implemented in the frequency domain (based
on cochlear analysis) or time domain (based on the temporal processing of neural
discharge patterns). Support for the latter comes from the success of the related EC
model of binaural interactions, from the presence of neural structures apparently
specialized for processing of temporal information, and from theoretical
considerations that suggest that a time-domain implementation might be more
straightforward and effective.
Appendix: Deeper Issues
The harmonic cancellation hypothesis is straightforward and well supported
experimentally, but it raises a number of interesting questions that are worth
considering.
Hybrid models
The hybrid harmonic cancellation models enumerated in the main text are described
here in greater detail.Hybrid Model 1: Cancellation-enhanced spectral
patterns. Each channel of a filter bank is convolved
with a cancellation filter tuned to
. This has the effect of sharpening spectral
analysis so that the outcome is closer to the ideal (Figure 2
right). The pattern of power over channels is then handed over to a
frequency-domain pattern-matching stage. This is illustrated in
Figure 6(a). Two vowels, /a/ and /e/ with fundamentals
100 and 106 Hz, respectively (left), are mixed. Cues to /e/ are
indistinct within the spectrum of the mix (right, black), but can be
enhanced by applying to each channel a cancellation filter tuned to
suppress /a/ (right, red). This model is reminiscent of periodicity
tagging of tonotopic patterns (Keilson et al., 1997), or
of the place-time model of Assmann and Summerfield (1990) in
which a spectral profile for the target vowel was taken by sampling
the ACF at the target’s period. If the spectral profile were derived
from a limited window of cancellation-filtered signal, placing that
window within the background-invariant part (red in Figure 4(b),
right) would make the profile invariant with respect to
backgrounds of period . The pattern would still be distorted
by the cancellation filtering, and spectral pattern-matching would
need to take this into account.
Figure 6.
Two hybrid models of harmonic cancellation. (a) Hybrid Model 1. Left:
power as a function of CF for synthetic vowels /a/,
F
=100 Hz (blue) and /e/, F
=106 Hz (red). Short lines above the plot indicate
the first two formant frequencies of each vowel. Right: power as a
function of CF for the mix before (black) and after (red) applying a
cancellation filter tuned to suppress the period of /a/. (b) Hybrid
Model 3. Black: per-channel TMR of vowel /e/ as a function of CF for
a mixture of /a/+/e/ at overall TMR=0 dB. Channels are divided into
three groups: TMR>12 dB (green, to be left intact),
(black, to be discarded), and
12 dB
TMR
12 dB (red, to be filtered by a cancellation
filter).
Hybrid Model 2: Channel rejection on the basis of
periodicity. Filter bank channels are divided into two
groups based on TMR (estimated based on residual power at the output
of a cancellation filter tuned to
). The first group consists of channels dominated
by the background; these are rejected. The remaining channels are
handed over to the pattern-matching stage to be matched based on
their temporal pattern. This principle was employed in the
concurrent vowel identification model of Meddis and Hewitt (1992),
itself inspired from earlier ideas for binaural or periodicity-based
segregation (Lyon, 1983, 1988; Weintraub, 1985). Spectral
resolution must be sufficient so that enough channels are spared to
represent the target.Hybrid Model 3: Cancellation filtering of selected
channels. Filter bank channels are divided into three
groups based on TMR. Channels with large TMR are left untouched,
channels with small TMR are discarded, and intermediate channels are
processed by the cancellation filter. Keeping the first group intact
reduces target distortion, and discarding the second group avoids
contamination from noise if the cancellation filter is imperfect (as
it might be due to non-linearity or noise). Cancellation filtering
is reserved for channels with intermediate TMR, for which it can be
effective. This model differs from Hybrid Model 2 by the presence of
this third group. A similar suggestion was made by Guest and Oxenham
(2019).Hybrid Model 3 is illustrated in Figure 6(b). The black line
shows the TMR per channel at the output of a filter bank in response
to the mix /a/+/e/ with overall TMR = 0 dB. Channels for which TMR
exceeds some threshold (+12 dB in this example) are left intact
(green), channels for which TMR is below a second threshold (
12 dB in this example) are discarded (black).
Channels with intermediate TMR are processed with a cancellation
filter (red).Hybrid Model 4: Channel-specific cancellation
filter. In contrast to previous models, for which the
parameter
is the same for all channels, here it is allowed
to vary across channels. This is analogous to the channel-dependent
versions of the EC model of binaural unmasking (Culling &
Summerfield, 1994; Akeroyd,
2004;
Breebaart
et al., 2001). This hypothesis may
be useful to explain results found with inharmonic stimuli (e.g.,
Roberts
& Brunstrom, 1998) as discussed in the
main text.Hybrid Model 5: Synthetic delays. The cancellation
filter of Equation (1) requires a delay
equal to the background period (e.g., 20 ms for a 50 Hz
fundamental). The existence of delays of this size in the auditory
system has been questioned (e.g., Laudanski et al., 2014),
and to address this issue it has been suggested that long delays
might arise from cross-channel interaction (de Cheveigné &
Pressnitzer, 2006). According to this
model, the filter bank serves mainly that purpose: to help
synthesize the delay
required by Equation (1).Hybrid Model 6: Logan’s theorem. Rather than a
specific model, this is a processing principle that addresses the
issue of the non-linear transduction that follows cochlear
filtering. Due to half-wave rectification, each transduced signal is
“blind” to one-half of every cycle, and thus one might worry that
some information was lost. Logan’s theorem states instead that a
narrowband signal can be reconstructed perfectly from its zero
crossings, and hence also from its half-wave rectified version
(Logan, 1977; Shamma &
Lorenzi, 2013). To the extent that
it is applicable here, the benefit of cochlear filtering would be to
linearize transduction, so that neural signal
processing has, in effect, full access to the acoustic waveform (see
the section “Non-Linearity” below).Two hybrid models of harmonic cancellation. (a) Hybrid Model 1. Left:
power as a function of CF for synthetic vowels /a/,
F
=100 Hz (blue) and /e/, F
=106 Hz (red). Short lines above the plot indicate
the first two formant frequencies of each vowel. Right: power as a
function of CF for the mix before (black) and after (red) applying a
cancellation filter tuned to suppress the period of /a/. (b) Hybrid
Model 3. Black: per-channel TMR of vowel /e/ as a function of CF for
a mixture of /a/+/e/ at overall TMR=0 dB. Channels are divided into
three groups: TMR>12 dB (green, to be left intact),
(black, to be discarded), and
12 dB
TMR
12 dB (red, to be filtered by a cancellation
filter).
Period Estimation
Harmonic cancellation requires an estimate of the interferer period
. Harmonic cancellation itself can be used for that purpose: an
array of cancellation filters, each tuned to a different delay (lag) covering
the range of expected periods, shows a minimum in output power at a lag equal to
the period. This is equivalent to searching for a peak in the ACF (Licklider, 1951; Meddis & Hewitt,
1991; de Cheveigné, 1998). The relation
between cancellation and correlation is detailed in the next section.From this perspective, cancellation is both an analysis tool (it
cancels part of a signal to reveal the remainder), and an
estimation tool (it estimates the period of the part it
cancels). Applied recursively to a mixture of two sounds, it can reveal two
periods: we first estimate the period of the dominant sound and cancel it, and
then recurse on the remainder. These steps can be performed in parallel by
searching the two-dimensional parameter space of a cascade of cancellation
filters defined as
and
for a minimum in output power. This output is zero when
for integers
,
(de
Cheveigné, 1993; de
Cheveigné & Kawahara, 1999). Interestingly, a neural version
of this model designed to estimate the pitch of a mistuned partial (de Cheveigné, 1999a) accurately
accounted for the subtle shifts observed by Hartmann and Doty (1990), Hartmann et al.
(1996), see also
Holmes and Roberts (2012).Associated with the period is an estimate of the degree to which the sound is, in
fact, periodic. A straightforward measure is output power of a cancellation
filter tuned to the period
, normalized by power at the input (or by output averaged over
other lags, e.g., 1,…, T). A value of zero indicates that the
sound is perfectly periodic, and a small value indicates that it is
“approximately periodic.” This same measure can be used as a criterion to detect
a target in the presence of a harmonic background.The threshold beyond which a sound should be declared “aperiodic” depends on the
application, and more specifically on the distributions of “periodic” and
“aperiodic” sounds as defined by the application’s needs. It is worth noting
that residual aperiodic power at the output of a narrowband filter (e.g., filter
bank channel) takes on relatively low values even if the stimulus is aperiodic.
The threshold needs adjusting accordingly.
Correlation and Cancellation
We can define the running autocorrelation function (ACF) at time
as
(dropping the scaling factor 1/W for
simplicity), where
is the duration of a sliding integration window that serves to
smooth the time course of
. Power at time
can then be defined as
. Likewise, we can define a squared difference
function (SDF) as power at time
of the cancellation filter output
ACF and SDF are then related by
A peak in correlation, cue to the period, maps to a trough in
difference function. It is convenient to normalize ACF and SDF
in which case the normalized functions are related more simply
by
For a periodic sound with period
,
, andEquation (5) is useful to derive the ACF from the SDF or vice-versa. It can
also be extended to more terms, for example to implement a cascade of
cancellation filters in terms of correlation. This allows different modeling
strands to be unified, and justifies some flexibility when speculating about
hypothetical neural implementations (see below).
Analogy with Binaural EC
Durlach’s EC model has been successful in accounting for binaural unmasking
(Durlach, 1963; Culling &
Summerfield, 1994; Culling, 2007) and binaural pitch phenomena (Culling, Summerfield, & Marshall,
1998), and in
predictive models of speech intelligibility (Beutelmann & Brand, 2006; Lavandier et al.,
2012; Cosentino et al.,
2014; Schoenmaker et al.,
2016). Binaural
interaction has also been couched in terms of inter-aural
correlation rather than cancellation (Jeffress, 1948) but, as pointed out by Green (1992), an EC
model can be implemented on the basis of inter-aural correlation, and vice
versa, as the two are related:
, where
and
are sounds at left and right ears, respectively. A
cancellation residue in one model maps to decorrelation in the other.An interesting suggestion is that EC might operate independently within frequency
channels (Culling &
Summerfield, 1994; Akeroyd, 2004; Breebaart
et al., 2001), rather than with parameters common to all channels as in the
original EC model (Durlach, 1963). It has been further suggested that EC parameters can be
estimated and applied within short-time windows (Wan et al., 2014; Hauth & Brand, 2018), which paves the
way for a spectro-temporal form of the EC model that supports “glimpsing” (Beutelmann et al.,
2010).A monaural version of the EC model has been invoked to explain comodulation
masking release (CMR) (Piechowiak et al., 2007).
Anatomy and Physiology
Time-domain and hybrid models entail time-domain signal processing within the
brain. Anatomical and physiological specializations to support such processing
include transduction and coding of acoustic temporal structure in the auditory
nerve (up to 4–5 kHz or possibly higher, Heinz et al., 2001; Hartmann et al., 2019; Carcagno et al., 2019; Verschooten et al., 2019), specialized
synapses in the cochlear nucleus and subsequent relays, and fast excitatory and
inhibitory interaction in the medial and lateral superior olives (MSO and LSO)
(Grothe, 2000; Zheng & Escabí,
2013; Keine et al., 2016; Beiderbeck et al.,
2018; Stasiak et al., 2018) and other
nuclei (Albrecht et
al., 2014;
Caspari et al.,
2015; Felix et al., 2017). Some of these
circuits are interpreted as serving binaural interaction, but presumably could
be borrowed for other needs (see Joris & van der Heijden, 2019; Kandler et al., 2020, for recent
reviews).The time-domain cancellation filter of Figure 4(c, left), Equation (1), can
be approximated by the “neural cancellation filter” of Figure 4(c, right). Spikes arriving via
the direct pathway are suppressed by the coincident arrival of spikes delayed by
. Applied to data recorded from the auditory nerve in response
to a mixture of two vowels with different F
s (Palmer, 1990), that simple circuit was effective in estimating both their
periods and suppressing correlates of one or the other vowel (de Cheveigné, 1993, 1997a; Guest & Oxenham,
2019). Such a
mechanism would require temporally accurate neural representations (excitatory
and inhibitory), delays, and an inhibitory gating or “anticoincidence”
mechanism.Temporally accurate inhibitory transforms of sensory input are created in several
nuclei, including cochlear nucleus (CN) (stellate-D cells), medial and lateral
nuclei of trapezoid body (MNTB and LNTB), and ventral nucleus of the lateral
lemniscus (VNLL) (Arnott et
al., 2004;
Caspari et al.,
2015; Joris & Trussell,
2018). Fast
interaction between direct and delayed neural patterns could in principle occur
as early as the dendritic fields of cells in CN (Shore et al., 1991; Schofield, 1994; Davis & Voigt, 1997; Needham &
Paolini, 2006; Xie
& Manis, 2013), or as late as dendritic fields of the inferior colliculus
(IC) (Caspari et
al., 2015;
Chen et al.,
2019). A recent
study reported evidence for an inhibitory “veto” mechanism at the axon initial
segment of LSO principal neurons, with very narrow tuning to inter-aural time
differences (Franken et
al., 2021). Transmission failure at reputed “secure” synapses in CN and MNTB
might conceivably reflect a similar veto mechanism (Mc Laughlin et al., 2008; Englitz et al., 2009; Stasiak et al., 2018).The cancellation-correlation equivalence discussed earlier implies that fast
interaction might also be excitatory-excitatory, the correlation pattern being
later converted to a cancellation-like statistic by slower inhibitory
interaction along the lines of Equations (5) and (8).
Note, however, that finding a minimum of cancellation would then require
subtraction of two large correlation values, which may be a problem if those
values are coded by a representation (like rate of a Poisson-like process) for
which the noise variance of the value increases with its mean. One might
speculate that the cost of specialized fast inhibitory circuitry is recouped by
the benefit of performing cancellation directly.There is also evidence in favor of accurate rate-place spectral representations
(Larsen et al.,
2008; Fishman et al., 2013, 2014; Su & Delgutte,
2020) that
might support a spectral version of the harmonic cancellation hypothesis,
particularly as it has been argued that tuning might be narrower in humans than
in most model animals (Shera et al., 2002; Verschooten et al., 2018; Sumner et al., 2018; Walker et al., 2019). Narrow tuning might also
benefit a spectro-temporal mechanism, with the caveat that narrower filters are
temporally more sluggish.Sinex et al. (2002),
Sinex and Li (2005), Sinex et al. (2007) report stronger responses in IC
neurons for mistuned partials, consistent with the output of a cancellation
filter, but they explain it by a different model based on cross-channel
interaction of between-partial beat patterns, analogous to the waveform
interaction models described earlier. Their model also accounts for the
particular temporal structure of the response; whether that structure too could
be explained by cancellation remains to be determined.In summary, known neural circuitry might support both temporal and spectral
mechanisms of harmonic cancellation, however I am not aware of evidence as
strong as that reported in favor of the EC model. A rate-frequency response such
as Figure 4(a) might
evade notice if attention is devoted to peaks of activity
rather than dips. It could also elude discovery if the output pattern follows a
latency code rather than rate code (Chase & Young, 2007). The filter
output in Figure 4(b)
is evocative of ON–OFF patterns observed in the superior paraolivary nucleus
(SPON) (Kandler et
al., 2020) but this similarity could be fortuitous, indeed those patterns
have been attributed to gap detection or duration encoding (Kadner & Berrebi,
2008).
Smart Pattern Matching
As discussed in the main text (Harmonic Cancellation—Possible
Mechanisms), each recovered target pattern is affected by two error
terms: imperfect cancellation of the background, and distortion undergone by the
target. In the time-domain model, the first term can be reduced to zero over
part of the pattern (red segment in Figure 4(b), right). This assumes the
ability to locate and isolate reliable intervals, which is commonly granted for
auditory perception (Viemeister & Wakefield, 1991; Moore et al., 1988).There remains the second error term due to filter-induced target distortion. This
can be mitigated if it is known to the pattern matching stage,
for example, by applying the same distortion to each pattern in the dictionary.
Distortion consists of an attenuation factor applied to each target component
depending on how close it falls to the harmonic series of the background, as
quantified by the filter transfer function (Figure 4(a), right). This produces a
“moiré effect” that can be quantified (and thus taken into account) if
F
s of both background and target are known.Target patterns can be further refined if the background is stationary over more
than two periods, as illustrated in Figure 7. Specifically, if the stimulus
is long enough to define
distinct observation intervals temporally separated by
, these intervals can form
distinct pairs from which to infer the target. These
observations are not all strictly independent, but the distortion (Figure 7, right) and
noise patterns differ between pairs and this may assist inference. A perceptual
mechanism operating in this fashion might seem implausibly complex. On the other
hand, we cannot rule out that the trick is discovered by a learning process. The
point made here is that the opportunity exists.
Figure 7.
Left: waveform of the mix of target vowel /e/ (132 Hz) with background
vowel /a/ (100 Hz) at TMR=
12 dB. Given four background cycles, intervals can be
paired over spans of
, 2
, and 3
, with three, two and one repeats, respectively (blue
arrows). Right: spectrum of target vowel /e/ (black line) and
cancellation-filtered estimates obtained for spans
, 2
, and 3
(symbols). Averaging over estimates (or better: taking
their maximum) would yield a more accurate estimate of the target, and
averaging over repeats might further attenuate uncorrelated noise (not
shown).
Left: waveform of the mix of target vowel /e/ (132 Hz) with background
vowel /a/ (100 Hz) at TMR=
12 dB. Given four background cycles, intervals can be
paired over spans of
, 2
, and 3
, with three, two and one repeats, respectively (blue
arrows). Right: spectrum of target vowel /e/ (black line) and
cancellation-filtered estimates obtained for spans
, 2
, and 3
(symbols). Averaging over estimates (or better: taking
their maximum) would yield a more accurate estimate of the target, and
averaging over repeats might further attenuate uncorrelated noise (not
shown).
Transforms in Filter Space
The idea that cochlear filtering works hand in hand with
neural filtering is intriguing. What are the possibilities,
what are the limits? As an example, the bandwidth of cochlear filters is usually
seen as a hard limit on spectral resolution, but it appears that with neural
filtering that limit can be overcome, as exploited by past schemes such as the
“second filter” (Huggins
& Licklider, 1951), stereausis (Shamma et al., 1989), lateral inhibitory network
(LIN) (Shamma,
1985), phase
opponency (Carney et
al., 2002), synthetic delays (de Cheveigné & Pressnitzer, 2006), EC (Durlach, 1963), selectivity
focusing in inferior colliculus (IC) (Chen et al., 2019), and here harmonic
cancellation.This section attempts to make sense of this situation by casting both filtering
stages into a common framework. Any filter can be approximated as a finite
impulse response (FIR) filter of order
, defined by the column vector
of impulse response coefficients. A signal
is filtered by convolving it with this impulse response.
Alternatively, using matrix notation, if
is the
matrix of time-lagged signals, the filtered signal is obtained
as the product
. A useful way to think of it is that the lags [
] create a memory of the past signal, within
which the filter can “shop” for useful information to characterize variations
over time.Extending to a
-channel filter bank, the filters can be defined by a matrix of
impulse responses
of size
, where each column of
represents the impulse response of one channel. The matrix of
filtered signals is then obtained as the product
. To relate this to the context of this paper, picture
as an acoustic signal,
as a bank of “cochlear” filters, and
as a matrix of vibration waveforms at different points along
the basilar membrane.If the matrix
is of rank
, it has a right inverse
such that
, the identity matrix. Why might this be useful? Suppose that
we wish to speculate that the auditory brain implements a particular filter
(defined by its impulse response
applicable to the acoustic waveform). It does not have access
to time-lagged acoustic signals
, so it cannot implement that filter directly, but it does have
access to peripheral filter outputs
. We want to know if our speculation is realistic.We can write
where
is a vector of weights. Applying weights
to
yields the desired filtered signal, exactly as if we had
applied the filter
directly to the acoustic waveform. Whereas the filter was
originally defined by its coordinates
on a basis of time shifts applicable to the acoustic signal,
it is now defined using coordinates
on a basis of filter bank channels. The outcome is the
same.Why is this relevant here? It means that essentially any filter can be
implemented (or its implementation can be complemented) by forming a weighted
sum of cochlear filter outputs, as long as their impulse responses are long
enough to reach the required order
. This is the gist of the “synthetic delay” model of de
Cheveigné and Pressnitzer (2006). According to this view, peripheral filtering and neural
time-domain interaction work hand in hand to perform acoustic signal processing
(subject to limits imposed by noise and non-linearity discussed in the next
section).A matrix of
cancellation filters with lag parameters
ranging from 0 to
-1 is also invertible (if one replaces the degenerate
=0 filter by
), and thus one can treat it as a “basis” similar to the filter
bank basis just described. A filter defined by its coefficients
on a lags basis, or
on a filter bank basis, can therefore also be defined by a set
of coefficients
on this new basis. One can, at least conceptually, transform
the sensory representation back and forth between these three representations:
lagged waveforms, band-pass filter bank channels, and cancellation-filtered
channels, with no loss of information. The cancellation-filtered representation
is reminiscent of the pitch-like “level of representation” invoked by Hafter and
Saberi (2001).There remains one difficult issue: given a periodic sound with period
, how do we find the coefficients
of a cancellation filter (defined over a basis of peripheral
filter outputs) that can cancel it? In the standard formulation (Equation (1))
based on a basis of lags, the filter
consists of all zeros except
=1 and
, so the parameter
can easily be found by scanning a linear array for a minimum.
For
, the situation is more complex because we must find a set of
parameters, rather than one, to obtain the same result. This
is a serious obstacle unless a “smart” way of finding
is found. A full discussion of the problem is beyond the scope
of this paper, but it is worth taking note of three points.The first is that, if principal component analysis (PCA) is applied to the matrix
for a periodic input with period
, at least one column of the PCA transform matrix defines a FIR
filter
that cancels that input. This is because the
th column of
is identical to the 0th column (periodicity), hence
is not of full rank.The second point follows from the first: if PCA is applied to the matrix
of filter bank outputs, at least one column of the PCA
transform matrix defines a set of coefficients
that also cancels its input. This is because rank deficiency
of
implies rank-deficiency of
. Thus, the appropriate coefficients
can be also be found by applying PCA to filter bank outputs
for a periodic input. This data-dependent process can be seen as a form of
data-driven learning, analogous to what we discussed earlier.The third point is that PCA is widely considered as a plausible neural operation
(Oja, 1982; Qiu et al., 2012; Minden et al., 2018). Putting these
pieces together, we can speculate that the hypothesis that Equation (1) is
implemented in the brain as a weighted sum of filter bank outputs, rather than a
simple delay
, is not completely unrealistic. This rough sketch needs
fleshing out, but it suggests a possible direction to model how the auditory
brain might implement complex signal processing tasks, cancellation being one
particular example.Again, such operations might seem implausibly complex for a biological
implementation, but knowing that the option exists, in principle, and
understanding how it works, can guide speculation that something similar is
discoverable by a learning process.
Non-Linearity
Previous sections mostly glossed over the issue of non-linear transduction. The
suggestion that linear operations can be swapped, as shown in Figure 5(b), or linear
transforms inverted as in the previous paragraph, is moot if the systems are not
linear. What can be salvaged from those simple ideas?First, note that any time-invariant transform of a periodic signal is periodic
with the same period (or submultiple of that period), so a canceation filter
tuned to the period would produce zero output as in the linear case. Thus, for
example, Hybrid Model 1 would work as advertized. Second, pattern distortions
due to non-linearity may be compensated for in the pattern-matching stage. Thus,
Hybrid Model 2 might also work. Third, more generally, we can invoke Logan’s
theorem and assume that the deleterious effects of non-linearity, whatever they
are, can be redressed by subsequent processing. The theorem doesn’t say how, but
it is easy to imagine simple situations in which this might pan out. For
example, sampling the steep phase characteristic of the cochlear filter bank at
two points differing by
might give access to both polarities of the signal at that
point, reversing effects of half-wave rectification. Fourth, non-linearity
demodulates the band-pass filtered signal, thus abstracting an informative
temporal envelope from less robust fine structure (Dau et al., 1997). In this respect, non-linearity
is a feature, rather than a bug.In summary, non-linearity does not prevent harmonic cancellation, although it
does make it harder to understand the limits of what can be achieved, and
how.
Authors: Richard A Felix Ii; Boris Gourévitch; Marcelo Gómez-Álvarez; Sara C M Leijon; Enrique Saldaña; Anna K Magnusson Journal: Front Neural Circuits Date: 2017-05-31 Impact factor: 3.492