Literature DB >> 30356320

Learning to decipher time-compressed speech: Robust acquisition with a slight difficulty in generalization among young adults with developmental dyslexia.

Yafit Gabay^1,2, Avi Karni^2,3, Karen Banai⁴.

Abstract

Learning to decipher acoustically distorted speech serves as a test case for the study of language-related skill acquisition in persons with developmental dyslexia (DD). Deciphering this type of input is rarely learned explicitly and does not yield conscious insights. Problems in implicit and procedural skill learning have been proposed as possible causes of DD. Here we examined the learning of time-compressed (accelerated) speech and its generalization to novel materials among young adults with DD compared to typical readers (TD). All participants completed a training session that involved judging the semantic plausibility of sentences, during which the level of time-compression was changed using an adaptive (staircase) procedure according to each participant's performance. In the test, phase learning (test on same items) and generalization (test on new items and same items spoken by a new speaker) were assessed. Both groups showed robust gains after training. Moreover, after training, the initial disadvantage of the DD group was no longer significant. After training, both groups experienced relative difficulties in deciphering learned tokens spoken by a different voice, though participants with DD were less able to generalize the gains to deciphering new tokens. Thus, DD individuals benefited from repeated experience with time-compressed speech no less than typical readers, but their evolving skill was apparently more dependent on the specific characteristics of the tokens. Atypical generalization, which indicates that perceptual learning is contingent on lower-level features of the input though does not necessarily point to impaired learning potential per se, may explain some of the contradictory findings in published studies of speech perception in DD.

Entities: Disease Gene Species

Mesh：

Year: 2018 PMID： 30356320 PMCID： PMC6200219 DOI： 10.1371/journal.pone.0205110

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Although usually transparent to listeners, speech perception is quite a challenging task. In particular, it requires mapping the acoustic input onto stable (pre-lexical/lexical) representations even though the speech signal itself is variable as a result of between-speaker differences, changes in speech rate [1] and environmental conditions [2]. Speech stimuli constitute a learning challenge for the perceptual system because accurate speech recognition requires generalization across the highly variable acoustic information that underlies the speech signal. Listeners are capable of overcoming these variations in speech through perceptual learning, according to which they align their perceptual system with new variations in the speech input [3]. Perceptual learning has been demonstrated across a variety of tasks in which the speech signal is noisy, distorted (e.g., noise vocoded, spectrally shifted or time-compressed speech) or otherwise unusual (e.g., unfamiliar dialects or accents) [4]. Previous studies suggest that adaptive training procedures that start off with relatively little signal distortion (“easy” items, not far removed from standard speech) may be advantageous for learning and its generalization [5, 6].

Implicit and procedural learning in speech perception

Since speech is rarely learned explicitly and perceptual learning does not yield conscious insights that can be easily communicated, the perceptual learning of speech is a case of implicit learning of skills that are essential to human communication [7]. Implicit learning refers to situations in which learning occurs incidentally [8], and the knowledge gained through this process is believed to be implicit as participants find it difficult to conceptualize what has been learned [9]. Implicit and procedural learning has been related to the acquisition and formation of motor skills [10]. An accumulating body of evidence also implicates its involvement in language-related skills, including the acquisition of grammar, syntax, morphology and phonology [11-14]. Research closely related to the present study also implicates implicit learning in the perceptual learning of speech [15].

Implicit and procedural learning in developmental dyslexia

Developmental Dyslexia (DD) is one of the most common neurodevelopmental disorders, with prevalence rates estimated at 5%-10% [16]. Despite extensive research, the underlying biological and cognitive causes of DD remain unclear. DD has been thought to arise from phonological impairments [17]. Recent conceptualizations of dyslexia implicate domain-general procedural and/or implicit learning systems in its etiology [13, 18–22]. These views are based on increasing evidence for the role of non-declarative systems in language learning and development [11-15] and on the plethora of findings that individuals with dyslexia often demonstrate impairments on procedural and implicit learning tasks [21, 23–31]. Although there is evidence suggesting intact procedural learning in DD [32-34], a recent meta-analysis argues in favor of the possibility that compensatory declarative learning mechanisms may mask procedural learning deficits in DD [27]. Although perceptual learning has been examined previously in individuals with DD in both the visual [35, 36] and the auditory modalities [37-42], perceptual learning of speech stimuli has rarely been assessed. Compared with other stimuli, speech stimuli represent a different learning challenge for the perceptual system. Generalization, for example, may be particularly important for speech perception due to the highly variable nature of the acoustic information that underlies the speech signal. The goal of the current study was therefore to investigate the perceptual learning of distorted speech in people with DD. Such speech is often particularly challenging for people with dyslexia [43]. Many studies suggest that typically developing individuals can adapt to such speech quite rapidly, especially under favorable learning conditions [44]. Recent studies suggest that adaptive protocols that begin with easy tasks provide such conditions [5, 6]. Previous observations suggest that adaptive training conditions yield more perceptual learning and generalization than constant training conditions [5]. Thus, to provide a strict test of the hypothesis that learning may differ between DD and TD participants, we used an adaptive protocol in the current study. Three indices of learning were investigated. First we asked whether rapid baseline adaptation to time-compressed speech is affected by DD. Second, we compared the effects of adaptive training on the recognition of time-compressed speech between the two groups of readers. Third, we compared the ability to transfer the training-related gains to novel conditions, i.e., conditions not encountered in training, across the two groups of readers. Two types of transfer were studied: (1) transfer to stimuli that share the high-level features of the trained tokens, but differ in their low-level features, i.e., sentences identical to those presented during training but produced by a new unfamiliar speaker; (2) transfer to stimuli that share the low-level features of the trained stimuli but differ in their high-level features, i.e., novel sentences but uttered by the speaker encountered in the training phase.

Methods

Participants

Participants were 24 university students (undergraduates or graduate students), among them 12 dyslexics (5 female) and 12 typical readers (7 female). A similar sample size was sufficient to detect group differences on the same task between native and non-native listeners [45]. Participants were native Hebrew speakers with no history of neurological disorders, psychiatric disorders or attention deficits. In addition, participants were right handed, had normal or corrected-to-normal vision, and normal hearing (participants in the DD group were screened for normal hearing; participants in the control group declared they had no hearing impairment). The DD group was recruited from the Student Support Service at the University of Haifa, a center that provides support for students with learning disabilities. Dyslexia was diagnosed by the University of Haifa Learning Disabilities Diagnostic Center by means of the MATAL test. This test is designed to assess developmental disabilities (Dyslexia, Dysgraphia, Dyscalculia, and Attention Deficit Disorder) in adults who are native Hebrew speakers. The MATAL is a standardized test developed by the Israeli National Institute for Testing and the Israeli Council for Higher Education [46]. The test consists of 20 tests and 54 performance measures, and was validated and normed with a standardization sample of 508 participants. The MATAL has been used in many previous investigations for the assessment of dyslexia [47, 48]. The typical reading group (TD) consisted of participants with no history of learning disabilities. Both the DD and the TD groups performed a battery of cognitive and literacy tests similar to the battery used in the study by]. The ethics committee of the Faculty of Social Welfare and Health Sciences at the University of Haifa (199/12) approved all aspects of the study and written informed consent was obtained from all participants.

Cognitive and literacy measures

Intellectual ability

Intelligence was assessed by means of two subtests from the Wechsler Intelligence test for adults [49]. One is the non-verbal block design task in which participants are required to rearrange blocks with different color patterns according to a stimulus presented to them upon a card. The other is the verbal similarities subtest in which participants are required to indicate what two words in a pair have in common (i.e., what do dog and cat have in common = both are animals).

Verbal working memory

Verbal working memory was assessed by the Digit Span subtest from the Wechsler Adult Intelligence Scale [49]. In this test the examiner reads a list of digits to the examinee and the examinee is required to repeat the digits in that order (forward) or to state the digits in reverse order (backward). Task administration is stopped after failure to recall on two trials with a similar number of digits.

Reading skills

Decoding, reading fluency, and reading comprehension tests were administered, as described in the following sections. Two tests were used to assess decoding skills: One Minute Tests of Words [50] and of Non-words [51], which examine the number of words and non-words accurately read aloud within a time limitation of one minute. The first test included 168 non-vowelized words of an equal level of difficulty listed in columns. The second test was composed of 86 successively difficult vowelized non-words listed in columns. In both tests, measures of accuracy (number of correct words read per minute) and of speed (number of items read per minute) were collected. The Oral Reading Tests obtained from the reading comprehension subset of the Israeli Psychometric Exam was used to assess reading fluency. In this test, participants were required to read a text of 216 words aloud, as quickly and accurately as possible. The number of words read correctly per minute was calculated.

Reading-related skills

Phonological awareness was assessed by the following tests: Phoneme Deletion, Segmentation and Parsing [52]. The phoneme deletion test consists of 25 non-words. In this test, the experimenter reads a word and a phoneme aloud and the participant is required to indicate how the word sounds after deletion of this phoneme. The segmentation test includes 16 non-words that are read to the examinee by the experimenter. The task is to segment the word into its basic phonological sounds as quickly as possible. The parsing test [53] contains 46 rows of words. Each row is composed of four words printed with no spaces between them. The participants’ task was to identify the words in each row by drawing a line to mark where the spaces should be. For all tests, both accuracy (number of correct letters/objects read per minute) and time (the time participants required to complete the task) were measured. Naming skills were assessed through the RAN- Naming Speed Test [54] that consists of the following tests for naming objects and letters and for naming alternating objects and letters. In the letter naming test (RAN letters), five (non-final) Hebrew letters—ס, א, ד, ג, ל—were repeatedly presented in random order, with each letter repeated ten times. The participants were asked to read the 50 letters aloud as quickly and accurately as they could. The object naming test (RAN object) consists of pictures of five objects: flower, cat, book, watch and flag, where each object is repeated randomly 10 times. The participants were asked to name the 50 pictures aloud as accurately and quickly as they could. In both tasks, the accuracy rates and the time for naming the entire list were measured. TD and DD listeners did not differ in intelligence (as measured by the block design subtest and by verbal ability scores measured by the similarities subtest) or chronological age. However, there were significant group differences with regard to reading, naming and phonological skills (see Table 1), confirming group assignments with respect to reading ability.

Table 1

Performance of the DD and TD groups on cognitive and literacy measures.

	Group
Measure	DyslexiaMean (SD)	Range	ControlMean (SD)	Range	t value	p
Age (in years)	27 (2.21)	24–35	27.75 (2.98)	25–32	-.69	.49
Decoding
Oral word recognition accuracy	63.58 (17.84)	39–98	119.58 (15.24)	98–153	-8.04	.01
Oral words recognition speed	67.77 (16.53)	45–98	119.08 (15.24)	98–153	-8.15	.01
Oral non-words recognition accuracy	26.08 (9.69)	9–38	60.41 (11.42)	45–81	-7.93	.01
Oral non-words recognition speed	39.83 (9.03)	29–64	68 (11.81)	50–86	-6.55	.01
Reading Fluency measures
Oral text fluency- words per min	96.05 (33.15)	14–145	164.01 (20.42)	127–212	-5.84	.01
Naming digits	28.38 (5.13)	20–37	20 (1.95)	16–23	-5.28	.01
Naming objects	43.81 (4.82)	38–51	31.08 (3.39)	27–39	7.46	.01
Phonological processing
Phoneme deletion (time)	194.84 (60.64)	111–293	105.16 (22.5)	81–157	4.8	.01
Phoneme deletion (accuracy)	19.33 (6.3)	3–25	23.66 (2.96)	15–25	19.33	.05
Segmentation (time)	120.8 (59.06)	88–304	94.83 (42.29)	55–205	1.23	n.s.
Segmentation (accuracy)	11 (4.22)	2–16	14.33 (2.14)	9–16	-2.43	.05
Parsing (time)	329.36 (73.88)	192–244	178.25 (32.06)	135–232	6.48	.01
Parsing (accuracy)	42.41 (3.08)	36–46	45.08 (.79)	44–46	-2.89	.01
Digit Span	8.5 (1.83)	6–11	12.33 (2.64)	9–19	-4.13	.01
Intellectual ability
Block design (nonverbal intelligence)	11.66 (3.39)	4–16	12.5 (2.43)	8–16	-0.69	n.s.
Similarities (verbal intelligence)	12.08 (1.56)	10–15	12.16 (1.69)	10–15	-0.12	n.s.

Experimental procedure

Stimuli

The stimuli and the procedure were similar to those used in our previous study [5]. A young male native speaker of Hebrew (the trained speaker) recorded and sampled the stimuli at 44 kHz using a standard microphone and PC soundcard and Audacity software. Additionally, several sentences designed to assess generalization to a new speaker were recorded by a second native Hebrew speaker. RMS levels of all sentences were normalized after recording and before compression. Stimuli were time-compressed using a WSOLA algorithm [55], which changes speech rate but preserves other qualities such as pitch and timbre. The sentences included 120 simple active subject-verb-object (SVO) sentences in Hebrew taken from the study by Prior and Bentin [56]. Each sentence contained 5–6 words and had adjectives modifying both the subject and the object. The duration of the naturally spoken sentences ranged from 2.3–4.2 s (72–144 words/minute). This speech rate is similar to that of Israeli newscasters [57]. Sixty sentences were semantically plausible (true, e.g., “The municipal museum purchased the impressionist painting”), whereas the remaining sentences (false) contained a semantic violation that rendered them improbable (e.g., “The municipal museum ate the impressionist painting”). One hundred sentences (50 true) were used for training. Twenty of those sentences were presented in the pre-test and test phases to assess learning of the repeated tokens. Likewise, 20 of the trained sentences uttered by a different speaker were used to assess cross-speaker generalization. The remaining 20 sentences were used to assess generalization to untrained tokens.

Procedure

Testing took place in a quiet room and participants were seated directly in front of a computer monitor during the entire experiment. Stimulus presentation and time compression manipulation were controlled by Matlab. Stimuli were presented binaurally using headphones (Sennheiser HD-215). The experiment consisted of three phases: a pre-test phase in which baseline performance was assessed, a training phase and a test phase. During the pre-test and test phases, participants were required to write down each of the presented sentences as accurately as they could. During the training phase, participants were required to press a key to indicate whether the sentences they heard were plausible or not. The experiment was administered in one session of approximately one hour. Cognitive and literacy tests were administered to participants in a different session. During the session, participants completed the pre-test, the training and the test. The training phase consisted of 100 different time-compressed sentences. During training, listeners performed a semantic verification task on these sentences during five blocks, each containing 60 trials. After hearing each sentence, listeners were required to determine whether it was semantically improbable (false) or probable (true). Sentences were selected at random (without replacement) until all 100 sentences were presented, after which random selection began again. Visual feedback (smiling/sad face) was delivered to participants after each response. In the present study, an adaptive staircase training protocol was used. That is, training started with a compression level of 65% of the naturally spoken duration. After that, compression was adapted using a 2-down/1-up staircase procedure in 25 logarithmically equal steps to a maximal compression of 20% [58]. The considerations that led us to select the stimuli (compression rates) were very similar to considerations used in many perceptual learning studies (e.g., [59]). The idea was to start from typical levels of performance and try to push the participants’ performance as much as possible into conditions wherein untrained individuals would fail to correctly recognize the stimuli. Thus, the compression rates were chosen so as to provide experience with speech rates that range from easily recognizable up to high-speed speech stimuli that cannot be recognized by native listeners without specific training [60, 61].

Test and training tasks

During the pretest phase, 20 sentences compressed to 30% of their naturally spoken duration were presented. During the test phase, blocks of 20 sentences compressed to 30% of their naturally spoken duration were presented. The participants’ task was to write the sentences down as accurately as they could. The test phase consisted of three different conditions of 20 trials each (repeated items, new items, repeated items presented by a different speaker). 1) In the repeated-items test, 20 sentences randomly selected from the training set were uttered by the same male speaker from the training phase. 2) In the new-items test, 20 new sentences with similar semantic structure to those in the training phrase were uttered by the same speaker heard throughout the training phase. 3) In the test of repeated items presented by a different speaker, 20 sentences were selected from the training set but uttered by a different male speaker. The order of the three tests was counterbalanced across participants. No feedback was provided during either the pre-test or the test. See Fig 1 for an illustration of the design.

Fig 1

Procedure.

Participants performed pre-test and five blocks of training (each contained sixty trials). After that, participants performed a test with three conditions. Test performance on same tokens is indicative of learning. Test performances with the new speaker and with new tokens are indicative of generalization.

Procedure.

Results

Data analysis

Performance during the pre-test and test was quantified as the mean proportion of words correctly identified across all sentences in a given condition. Orthographic errors (e.g., homophones) were not calculated as errors because the purpose was to assess whether listeners heard the sentences correctly and not to assess their writing skills. Incomplete/incorrect suffixes were considered errors because Hebrew is an inflected language and suffixes affect the meaning of the sentence (e.g., changing the timing of an event from past to future). The mean proportion of sentences correctly judged (verification) in each block was used to quantify performance during the training phase. To this end, mean verification threshold were calculated based on the five final trials in each block. We first calculated participants’ level of performance during the pretest phase. Previous research has contended that for typical readers rapid learning can be observed even during the pretest phase [60]. We then estimated training-phase performance in the two groups by calculating the 71% correct verification thresholds for each listener (for details see [45]. Group differences after training were then evaluated. For this purpose, test performance was compared to pre-test performance on the repeated-tokens as evidence for learning across groups. Finally, test performance on the trained items was compared to performance on new items and on items produced by a new speaker to test for generalization.

Rapid learning during the pre-test

Fig 2 shows the mean performance accuracy over the first and last five trials of the pretest phase. An analysis of variance was conducted, with group (TD vs. DD) as a between-subject factor, learning (first five trials vs. five last trials) as within-subject factors and mean proportion of words correctly identified as the dependent variable. The main effect of group was significant, suggesting that DD participants were less able to decipher time-compressed speech compared to TD participants (F(1, 22) = 8.68, p<.01; η = .28). However, the main effect of learning was also significant, suggesting that recognition accuracy improved during the test (rapid learning, F(1, 22) = 122.58, p<.01; η = .28). There was no significant interaction of group by learning (F<1), suggesting that both groups improved to a similar extent during this phase.

Fig 2

Pretest performance (mean of five first trials vs. mean of five last trials) as a function of group (TD vs. TD).