| Literature DB >> 31165733 |
Matthias J Sjerps1,2, Neal P Fox3, Keith Johnson4, Edward F Chang5,6.
Abstract
The acoustic dimensions that distinguish speech sounds (like the vowel differences in "boot" and "boat") also differentiate speakers' voices. Therefore, listeners must normalize across speakers without losing linguistic information. Past behavioral work suggests an important role for auditory contrast enhancement in normalization: preceding context affects listeners' perception of subsequent speech sounds. Here, using intracranial electrocorticography in humans, we investigate whether and how such context effects arise in auditory cortex. Participants identified speech sounds that were preceded by phrases from two different speakers whose voices differed along the same acoustic dimension as target words (the lowest resonance of the vocal tract). In every participant, target vowels evoke a speaker-dependent neural response that is consistent with the listener's perception, and which follows from a contrast enhancement model. Auditory cortex processing thus displays a critical feature of normalization, allowing listeners to extract meaningful content from the voices of diverse speakers.Entities:
Mesh:
Year: 2019 PMID: 31165733 PMCID: PMC6549175 DOI: 10.1038/s41467-019-10365-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Listeners perceive speech sounds relative to their acoustic context. a Target sounds were synthesized to create a six-step continuum ranging from sufu (step 1; low-first formant [F1]) to sofo (step 6; high F1). Context sentences were synthesized to sound like two different speakers: a speaker with a long vocal tract (low-F1 range: Speaker A), and a speaker with a short vocal tract (high-F1 range; Speaker B). Context sentences contained only the vowels /e/ and /a/, but not the target vowels /u/ and /o/. b Context sentences preceded the target on each trial (separated by 0.5 s of silence), after which participants responded with a button press to indicate whether they heard “sufu” or “sofo”. c All targets were presented after both speaker contexts. d Listeners more often gave “sofo” responses to target sounds if the preceding context was spoken by Speaker A (low F1) than Speaker B (high F1). Error bars indicate s.e.m.
Fig. 2The neural response to bottom-up acoustic input is modulated by preceding context. a Target vowel preferences and locations (plotted on a standardized brain) for electrodes from all patients (three with right hemisphere [RH] and two with left hemisphere [LH] grid implants). Only those temporal lobe electrodes where the full omnibus model was significant during the context and/or the target window (F-test; p < 0.05) are displayed. Strong target F1 selectivity is relatively uncommon: electrodes with a black-and-white outline are significant at Bonferroni corrected p < 0.05 (n = 9, out of 406 temporal lobe electrodes); a single black outline indicates significance at a more liberal threshold (p < 0.05, uncorrected; n = 28). Activity from the indicated electrode (e1) is shown in b and c. b Example of normalization in a single electrode (e1; z-scored high-gamma [high-γ] response averaged across the target window [target window marked in c]). c Activity from e1 across time, separating the endpoint targets (middle panel) or the contexts (bottom panel). The electrode responds more strongly to /o/ stimuli than /u/ stimuli, but also responds more strongly overall after Speaker A (low-F1). This effect is analogous to the behavioral normalization (Fig. 1d). Black bars at the bottom of the panels indicate significant time-clusters (cluster-based permutation test of significance). d Among all electrodes with significant target sound selectivity (n = 37 [9 + 28]), a relation exists between the by-electrode context effect and target preference. Both are expressed as a signed t-value, demonstrating that the size and direction of the target preferences predicts the size and direction of the context effects. e An LDA classifier was trained on the distributed neural responses elicited by the sufu and sofo endpoint stimuli using all task-related electrodes. This model was then used to predict classes for neural responses to (held-out) endpoint tokens and for the ambiguous steps in each context condition. Proportions of neurally based “sofo” predicted trials (thick lines) display a relative shift between the two context conditions (data from one example patient). Regression curves were fitted to these data for each participant separately to estimate 50% category boundaries per condition for panel f (thin lines). f The neural classification functions display a shift in category boundaries between context conditions for all patients individually. In b and c, error bars indicate ±1 s.e.m.
Fig. 3Sensitivity to contrast in acoustic-phonetic features. a Electrode preferences for both context F1 (during context window) and target F1 (during target window) from a single example patient. Some populations display both target F1 selectivity and context F1 selectivity (marked with a black-and-white outline), indicating a general preference for higher or lower F1 frequency ranges. Others are only tuned for target F1 or context F1 (marked with a single black outline in their respective panels). Significance assessed at p < 0.05 uncorrected. b Mean (±1 s.e.m.) high-gamma activity at an example electrode (e2) from the example patient in panel a (conditions split as described in Fig. 2c). Activity is displayed for a time window encompassing the full trial duration (both precursor sentence and target word). Black bars represent significant time-clusters (p < 0.05; cluster-based permutation). c A relation exists between the by-electrode context preference and target preference: electrodes that display a preference for either high- or low-target F1 typically also display a preference for the same F1 range during the context