| Literature DB >> 33804653 |
Jessica Jiang1, Elia Benhamou1, Sheena Waters2, Jeremy C S Johnson1, Anna Volkmer3, Rimona S Weil1, Charles R Marshall1,2, Jason D Warren1, Chris J D Hardy1.
Abstract
The speech we hear every day is typically "degraded" by competing sounds and the idiosyncratic vocal characteristics of individual speakers. While the comprehension of "degraded" speech is normally automatic, it depends on dynamic and adaptive processing across distributed neural networks. This presents the brain with an immense computational challenge, making degraded speech processing vulnerable to a range of brain disorders. Therefore, it is likely to be a sensitive marker of neural circuit dysfunction and an index of retained neural plasticity. Considering experimental methods for studying degraded speech and factors that affect its processing in healthy individuals, we review the evidence for altered degraded speech processing in major neurodegenerative diseases, traumatic brain injury and stroke. We develop a predictive coding framework for understanding deficits of degraded speech processing in these disorders, focussing on the "language-led dementias"-the primary progressive aphasias. We conclude by considering prospects for using degraded speech as a probe of language network pathophysiology, a diagnostic tool and a target for therapeutic intervention.Entities:
Keywords: Alzheimer’s disease; Parkinson’s disease; degraded speech processing; dementia; perceptual learning; predictive coding; primary progressive aphasia
Year: 2021 PMID: 33804653 PMCID: PMC8003678 DOI: 10.3390/brainsci11030394
Source DB: PubMed Journal: Brain Sci ISSN: 2076-3425
Figure 1A predictive coding model of normal degraded speech processing with major anatomical loci for core speech decoding operations and their connections, informed by evidence in the healthy brain. Different kinds of degraded speech manipulation are likely to engage these cognitive operations and connections differentially (see Table 1). Incoming sensory information undergoes “bottom-up” perceptual analysis chiefly in early auditory areas, while higher level brain regions generate predictions about the content of the speech signal. Boxes indicate processors that instantiate core functions; note, however, that processing “levels” are not strictly confined to higher-order predictions or early sensory input: interactions occur at each level. Arrows indicate connections between levels, with reciprocal information flow mediating modulatory influences and dynamic updating/perceptual learning of degraded speech signals. This figure is necessarily an over-simplification; cortical areas that are likely to have separable functional roles are grouped together for clarity of representation, and while they are not shown in this figure, intra-areal recurrences and inhibitions alongside other local circuit effects may also be operating within these regions. aTL, anterior temporal lobe; HG, Heschl’s gyrus; IFG, inferior frontal gyrus; IPL, inferior parietal lobule; STG, superior temporal gyrus; STS, superior temporal sulcus.
Figure 2Examples of degraded speech manipulations used experimentally and their acoustic effects on the speech signal. Broadband time-frequency spectrograms of the same speech token (“tomatoes”), subjected to different forms of speech degradation (all samples apart from 2B were recorded by a native British speaker with a Standard Southern English accent; wavefiles of A–G are in Supplementary Material online). (A) Natural speech token. (B) Same speech token spoken with an American-Californian accent (an accent is a meta-linguistic feature that reveals information about the speaker’s geographical or socio-cultural background [53]; normal listeners make predictions about speakers’ accents that tend to facilitate faster accent processing [54]). (C) Speech in multi-talker babble (speech-in-noise can be adaptively adjusted to find the point at which speech switches from intelligible to unintelligible [55]; background “noise” used experimentally typically comprises either “energetic” masking (e.g., steady-state white noise) or “informational” masking (e.g., multi-talker babble, as illustrated here)) [56], (D) Perceptual (or phonemic) restoration (Warren [57] originally observed that when a key phoneme is artificially excised from a given sentence, control participants are unable to identify the location of the missing phoneme when “filled-in” with a burst of white noise (bottom panel), but are able to identify the location accurately if the gap remains silent (top panel), i.e., they perceptually “restore” the excised phoneme). (E) Noise-vocoded speech (vocoding removes fine spectral detail from speech, whilst preserving temporal cues [58,59]; three bands of modulated noise (i.e., three “channels”; top panel) are the minimum needed for consistent recognition by normal listeners [59], spectrograms for six (middle panel) and twelve (bottom panel) channels also shown here). (F) Time-compressed speech (created by artificially increasing the rate at which a recorded speech stimulus is presented; intelligibility decreases as speech compression rate increases [60,61,62]). (G) Sinewave speech (this transformation reduces speech to a series of “whistles” or sinewave tones that track formant contours [63]). Note that these speech manipulations vary widely in the cognitive process they target, the degree to which they degrade the speech signal and their ecological resonance (see also Table 1); accented speech and speech-in-noise or babble are commonly encountered in daily life through exposure to diverse speakers and noisy environments, perceptual restoration simulates the frequent everyday phenomenon of speech interruption by intermittent extraneous sounds (e.g., a slamming door), whereas sinewave-speech is a drastic impoverishment of the speech signal that sounds highly unnatural but becomes intelligible with exposure due to perceptual learning [64].
Summary of major forms of speech degradation with representative experimental studies in healthy listeners.
| Degradation Type | Study | Participants | Methodology | Major Findings |
|---|---|---|---|---|
| Bent and Bradlow [ | 65 healthy participants (age: 19.1) | Participants listened to English sentences spoken by Chinese, Korean, and English native speakers. | Non-native listeners found speech from non-native English speakers as intelligible as from a native speaker. | |
| Clarke and Garrett [ | 164 healthy participants (American English) | Participants listened to English sentences spoken with a Spanish, Chinese, and English accent. | Processing speed initially slower for accented speech, but this deficit diminished with exposure. | |
| Floccia, Butler, Goslin and Ellis [ | 54 healthy participants (age 19.7; Southern British English) | Participants had to say if the last word in a spoken sentence was real or not. | Changing accent caused a delay in word identification, whether accent change was regional or foreign. | |
| Siegel and Pick [ | 20 healthy participants | Participants produced speech whilst hearing amplified feedback of their own voice. | Participants lowered their voices (displaying the sidetone amplification effect) in all conditions. | |
| Jones and Munhall [ | 18 healthy participants (age: 22.4; Canadian English) | Participants produced vowels with altered feedback of F0 shifted up or down. | Participants compensated for change in F0. | |
| Donath et al. [ | 22 healthy participants (age: 23; German) | Participants said a nonsense word with feedback of their frequency randomly shifting downwards. | Participants adjusted their voice F0 after a set period of time due to processing the feedback first. | |
| Stuart et al. [ | 17 healthy participants (age: 32.9; American English) | Participants spoke under DAF at 0, 25, 50, 200 ms at normal and fast rates of speech. | There were more dysfluencies at 200 ms, and more dysfluencies at the fast rate of speech. | |
| Moray [ | Healthy participants, no other information given | Participants were told to focus on a message played to one ear, with a competing message in the other ear. | Participants did not recognize the content in the unattended message. | |
| Lewis [ | 12 healthy participants | Participants were told to attend to message presented in one ear, with a competing message in the other. | Participants could not recall the unattended message, but semantic similarity affected reaction times. | |
| Ding and Simon [ | 10 healthy participants (age 19–25) | Under MEG, participants heard competing messages in each ear, and asked to attend to each in turn. | Auditory cortex tracked temporal modulations of both signals, but was stronger for the attended one. | |
| Shannon, Zeng, Kamath, Wygonski and Ekelid [ | 8 healthy participants | Participants listened to and repeated simple sentences that had been noise-vocoded to different degrees. | Performance improved with number of channels; high speech recognition was achieved with only 3 channels. | |
| Davis, Johnsrude, Hervais-Adelman, Taylor and McGettigan [ | 12 healthy participants (age 18–25; British English) | Participants listened to and then transcribed 6-channel noise-vocoded sentences. | Participants showed rapid improvement over the course of 30-sentence exposure. | |
| Scott, Rosen, Lang and Wise [ | 7 healthy participants (age 38) | Under PET, participants listened to spoken sentences that were noise-vocoded to various degrees. | Selective response to speech intelligibility in left anterior STS. | |
| Warren [ | 20 healthy participants | Participants identified where the gap was in sentences where a phoneme was replaced by silence/white noise. | Participants were more likely to mislocalize a missing phoneme that was replaced by noise. | |
| Samuel [ | 20 healthy participants (English) | Participants heard sentences in which white noise was either “Added” to or “Replaced” a phoneme. | Phonemic restoration was more common for longer words and certain phone classes. | |
| Leonard, Baud, Sjerps and Chang [ | 5 healthy participants (age 38.6; English/Italian) | Subdural electrode arrays recorded while participants listened to words with noise-replaced phonemes. | Electrode responses were comparable to intact words vs. words with a phoneme replaced. | |
| Remez, Rubin, Pisoni and Carrell [ | 54 control participants | Naïve listeners heard SWS replicas of spoken sentences and were later asked to transcribe the sentences. | Most listeners did not initially identify the SWS as speech, but were able to transcribe them when told this. | |
| Barker and Cooke [ | 12 control participants | Participants were asked to transcribe SWS or amplitude-comodulated SWS sentences. | Recognition for SWS ranged from 35–90%, and amplitude-comodulated SWS ranged from 50–95%. | |
| Möttönen, Calvert, Jääskeläinen, Matthews, Thesen, Tuomainen and Sams [ | 21 control participants (18–36; English) | Participants underwent two fMRI scans: one before training on SWS, and one post-training. | Activity in left posterior STS was increased after SWS training. | |
| Pichora-Fuller et al. [ | 24 participants in three groups (age 23.9; 70.4; 75.8; English) | Participants repeated the last word of sentences in 8-talker babble. Half had predictable endings. | Both groups of older listeners derived more benefit from context than younger listeners. | |
| Parbery-Clark et al. [ | 31 control participants (incl. 16 musicians; age: 23; English) | Participants were assessed via clinical measures of speech perception in noise. | Musicians outperformed the non-musicians on both QuickSIN and HINT. | |
| Anderson et al. [ | 120 control participants (age 63.9) | Peripheral auditory function, cognitive ability, speech-in-noise, and life experience were examined. | Central processing and cognitive function predicted variance in speech-in-noise perception. | |
| Dupoux and Green [ | 160 control participants (English) | Participants transcribed spoken sentence were compressed to 38% and 45% of their original durations. | Participants improved over time. This happened more rapidly for the 45% compressed sentences. | |
| Poldrack et al. [ | 8 control participants (age: 20–29; English) | Participants listened to time-compressed speech. Brain responses were tracked using fMRI. | Activity in bilateral IFG and left STG increased with compression, until speech became incomprehensible. | |
| Peelle et al. [ | 8 control participants (age: 22.6; English) | Participants listened to sentences manipulated for complexity and time-compression in an fMRI study. | Time-compressed sentences recruited AC and premotor cortex, regardless of complexity. |
The table is ordered by type of speech degradation. Information in the Participants column is based on available information from the original papers; age is given as a mean or range and language refers to participants’ native languages. Abbreviations: AC, anterior cingulate; DAF, delayed auditory feedback; F0, fundamental frequency; fMRI; functional magnetic resonance imaging; HINT, Hearing in Noise Test; IFG, inferior frontal gyrus; ms, millisecond; QuickSIN, Quick Speech in Noise Test; PET, positron emission tomography; STG, superior temporal gyrus; STS, superior temporal sulcus; SWS, sinewave speech.
Summary of representative studies of degraded speech processing in clinical populations.
| Population | Study, Degradation | Participants | Methodology | Major Findings |
|---|---|---|---|---|
| Traumatic brain injury | Gallun et al. [ | 36 blast-exposed military veterans (age: 32.8); 29 controls (age: 32.1) | Participants went through a battery of standardised behavioural tests of central auditory function: temporal pattern perception, GIN, MLD, DDT, SSW, and QuickSIN. | While no participant performed poorly on all behavioural testing, performance was impaired in central auditory processing for the blast-exposed veterans in comparison to matched-controls. |
| Saunders et al. [ | 99 military veterans (age: 34.1) | Participants went through self-reported measures as well as a battery of standardised behavioural measures: HINT, NA LiSN-S, ATTR, TCST, and SSW. | Participants in this study showed measurable performance deficits on speech-in-noise perception, binaural processing, temporal resolution, and speech segregation. | |
| Gallun et al. [ | 30 blast-exposed military veterans, with a least one blast occurring 10 years prior to study (age: 37.3); 29 controls (age: 39.2) | Participants went through a battery of standardised behavioural tests of central auditory function: GIN, DDT, SSW, FPT, and MLD. | Replicating the findings from Gallun et al., 2012, this study found that the central auditory processing deficits persisted in individuals tested an average of more than 7 years after blast exposure. | |
| Papesh et al. [ | 16 blast-exposed veterans (age 36.9); 13 veteran controls (age 38) with normal peripheral hearing | Participants competed self-reported measures and standardised tests of speech-in-noise perception, DDT, SSW, TCST, plus auditory event-related potential studies. | Impaired cortical sensory gating was primarily influenced by a diagnosis of TBI and reduced habituation by a diagnosis of post-traumatic stress disorder. Cortical sensory gating and habituation to acoustic startle strongly predicted degraded speech perception | |
| Stroke aphasia | Bamiou et al. [ | 8 patients with insular strokes (age: 63); 8 control participants (age: 63) | Participants heard pairs of spoken digits presented simultaneously to each ear, and were asked to repeat all four digits. | Dichotic listening was abnormal in five of the eight stroke patients. |
| Dunton et al. [ | 16 participants with aphasia (age: 59); 16 controls (age: 59; English) | Participants heard English sentences spoken with a familiar (South-East British England) or unfamiliar (Nigerian) accent. | Aphasia patients made more errors in comprehending sentences spoken in an unfamiliar accent vs. a familiar accent. | |
| Jacks and Haley [ | 10 aphasia patients (age: 53.1); 10 controls (age: 63.1; English) | Participants produced spoken sentences with no feedback, DAF, FAF or noise-masked auditory feedback (MAF). | Speech rate increased under MAF but decreased with DAF and FAF in most participants with aphasia. | |
| Parkinson’s disease | Liu et al. [ | 12 PD participants (ge: 62.3); 13 control participants (age: 68.7) | Participants sustained a vowel whilst receiving changes in feedback of loudness (±3/4 dB) or pitch (±100 cents). | All participants produced compensatory responses to AAF, but response sizes were larger in PD than controls. |
| Chen et al. [ | 15 people with PD (age: 61); 15 control participants (age 61; Cantonese) | Participants were asked to vocalize a vowel sound with AAF pitch-shifted upwards or downwards. | PD participants produced larger magnitudes of compensation. | |
| Alzheimer’s disease | Gates et al. [ | 17 ADs (age: 84); 64 MCI (age: 82.3); 232 controls (age: 78.8) | Participants listened to 40 numbers presented in pairs to each ear simultaneously. | AD patients scored the worst in the dichotic digits, followed by the MCI group and then the controls. |
| Golden et al. [ | 13 AD participants (age: 66); 17 control participants (age: 68) | In fMRI, participants listened to their own name interleaved with or superimposed on multi-talker babble. | Significantly enhanced activation of right supramarginal gyrus in the AD vs. control group for the cocktail party effect. | |
| Ranasinghe et al. [ | 19 AD participants; 16 control participants | Participants were asked to produce a spoken vowel in context of AAF, with perturbations of pitch. | AD patients showed enhanced compensatory response and poorer pitch-response persistence vs. controls. | |
| Primary progressive aphasia | Hailstone et al. [ | 20 ADs (age: 66.4); 6 nfvPPA (age: 66); 35 controls (age: 65); British English | Accent comprehension and accent recognition was assessed. VBM examined grey matter correlates. | Reduced comprehension for phrases in unfamiliar vs. familiar accents in AD and for words in nfvPPA; in AD group, grey matter associations of accent comprehension and recognition in anterior superior temporal lobe |
| Cope et al. [ | 11 nfvPPA (age: 72); 11 control participants (age: 72) | During MEG, participants listened to vocoded words presented with written text that matched/mismatched. | People with nfvPPA compared to controls showed delayed resolution of predictions in temporal lobe, enhanced frontal beta power and top-down fronto-temporal connectivity; precision of predictions correlated with beta power across groups | |
| Hardy et al. [ | 9 nfvPPA (age: 69.6); 10 svPPA (age: 64.9); 7 lvPPA (age: 66.3); 17 control (age: 67.7) | Participants transcribed SWS of numbers/locations. VBM examined grey matter correlates in combined patient cohort. | Variable task performance groups; all showed spontaneous perceptual learning effects for SWS numbers; grey matter correlates in a distributed left hemisphere network extending beyond classical speech-processing cortices, perceptual learning effect in left inferior parietal cortex |
Information in the Participants column is based on available information from the original papers; age is given as a mean or range and language refers to participants’ native languages. Abbreviations: AAF, altered auditory feedback; AD, Alzheimer’s disease; ATTR, Adaptive Tests of Temporal Resolution; DAF, delayed auditory feedback; dB, decibels; DDT, Dichotic Digits Test; FAF; frequency altered feedback; fMRI, functional magnetic resonance imaging; FPT, Frequency Patterns Tests (FPT); GIN, Gaps-In-Noise test; HINT, Hearing in Noise Test; lvPPA, logopenic variant primary progressive aphasia; MAF, masked/masking auditory feedback; MCI, mild cognitive impairment; MEG, magnetoencephalography; MLD, The Masking Level Difference; NA LiSN-S, North American Listening in Spatialised Noise-Sentence test; nfvPPA, nonfluent primary progressive aphasia; PD, Parkinson’s disease; PR, perceptual restoration; QuickSIN, Quick Speech in Noise; SSW, Staggered Spondaic Words; SWS, sinewave speech; svPPA, semantic variant primary progressive aphasia; TBI, traumatic brain injury; TCST, Time Compressed Speech Test; VBM, voxel based morphometry.
Figure 3A simplified model of predictive coding of degraded speech processing in primary progressive aphasia (PPA), referenced to the healthy brain presented in Figure 1. The three major PPA variant syndromes—nonfluent/agrammatic variant PPA (top panel); semantic variant PPA (middle panel) and logopenic variant PPA (bottom panel)—are each associated with a specific pattern of regional brain atrophy and/or dysfunction that is critical to the degraded speech processing network, implying that different PPA subtypes may be associated with specific profiles of degraded speech processing (see text for details). Boxes indicate processors that instantiate core speech decoding functions (see Figure 1), and arrows indicate their connections in the predictive coding framework, with the putative direction of information flow. In the case of nfvPPA, the emboldened descending arrow from IFG to STG signifies aberrantly increased precision of inflexible top-down priors (after Cope and Colleagues [93]), to date the most secure evidence for a predictive coding mechanism in the PPA spectrum; the status of the IPL locus in this syndrome is more tentative. Implicit in the model is the hypothesis that neurodegenerative pathologies will tend to disrupt stored neural templates (“priors”) and “prune” projections from heavily involved, higher order association cortical areas due to neuronal dropout (promoting inflexible top-down predictions), but also degrade the fidelity of signal traffic through sensory cortices (reducing sensory precision and promoting over-precise prediction errors) [15]. The relative prominence of these mechanisms will depend on the macro-network and local neural circuit anatomy of particular neurodegenerative pathologies. Proposed major loci of disruption caused by each PPA variant are indicated with crosses; dashed arrows arising from these damaged modules indicate disrupted information flow. aTL, anterior temporal lobe; HG, Heschl’s gyrus; IFG, inferior frontal gyrus; IPL, inferior parietal lobule; lvPPA, logopenic variant primary progressive aphasia; nfvPPA, non-fluent variant primary progressive aphasia; STG superior temporal gyrus; STS, superior temporal sulcus; svPPA, semantic variant primary progressive aphasia.