| Literature DB >> 33517856 |
Katja Stärk1, Evan Kidd1,2,3, Rebecca L A Frost1.
Abstract
To acquire language, infants must learn to segment words from running speech. A significant body of experimental research shows that infants use multiple cues to do so; however, little research has comprehensively examined the distribution of such cues in naturalistic speech. We conducted a comprehensive corpus analysis of German child-directed speech (CDS) using data from the Child Language Data Exchange System (CHILDES) database, investigating the availability of word stress, transitional probabilities (TPs), and lexical and sublexical frequencies as potential cues for word segmentation. Seven hours of data (~15,000 words) were coded, representing around an average day of speech to infants. The analysis revealed that for 97% of words, primary stress was carried by the initial syllable, implicating stress as a reliable cue to word onset in German CDS. Word identity was also marked by TPs between syllables, which were higher within than between words, and higher for backwards than forwards transitions. Words followed a Zipfian-like frequency distribution, and over two-thirds of words (78%) were monosyllabic. Of the 50 most frequent words, 82% were function words, which accounted for 47% of word tokens in the entire corpus. Finally, 15% of all utterances comprised single words. These results give rich novel insights into the availability of segmentation cues in German CDS, and support the possibility that infants draw on multiple converging cues to segment their input. The data, which we make openly available to the research community, will help guide future experimental investigations on this topic.Entities:
Keywords: German; Language acquisition; child-directed speech; distributional cues; speech segmentation
Mesh:
Year: 2021 PMID: 33517856 PMCID: PMC8886305 DOI: 10.1177/0023830920979016
Source DB: PubMed Journal: Lang Speech ISSN: 0023-8309 Impact factor: 1.500
Frequency of primary word stress at each syllable position.
| Syllable | Primary word stress: | Primary word stress: | Primary word stress: | |||
|---|---|---|---|---|---|---|
| Count | % | Count | % | Count | % | |
| 1 | 14,206 | 96.90 | 1536 | 87.03 | 2771 | 85.92 |
| 2 | 398 | 2.71 | 191 | 10.82 | 398 | 12.34 |
| 3 | 47 | 0.32 | 31 | 1.76 | 47 | 1.46 |
| 4 | 6 | 0.04 | 5 | 0.28 | 6 | 0.19 |
| 5 | 1 | 0.01 | 1 | 0.06 | 1 | 0.03 |
| 6 | 0 | – | 0 | – | 0 | – |
| 7 | 2 | 0.01 | 1 | 0.06 | 2 | 0.06 |
| 8 | 0 | – | 0 | – | 0 | – |
| Total | 14,660 | 1765 | 3225 | |||
Figure 1.Density plot of transitional probabilities (TPs) between syllables in the corpus. The panels on the left and right show the frequency data for backwards and forwards transitions, respectively. TPs within words are indicated in green, whereas TPs between words are indicated in orange.
Summary of the linear mixed-effects model for transitional probabilities (TPs).
| Estimate | 95% confidence interval | Standard error | χ2 |
|
| |
|---|---|---|---|---|---|---|
| (Intercept) | 0.220 | [0.215, 0.225] | 0.003 | – | – | – |
| Context | −0.219 | [−0.229, −0.209] | 0.005 | 1556.53 | 1 | < 0.001 |
| Direction | 0.047 | [0.035, 0.059] | 0.006 | 52.46 | 1 | < 0.001 |
| Context × direction | −0.041 | [−0.063, −0.018] | 0.012 | 11.24 | 1 | < 0.001 |
Notes: context distinguished between within-word and between-word TPs, direction distinguished between forwards and backwards TPs, deviation-coded as: within-word, -0.5; between-word 0.5; forwards, -0.5; and backwards, 0.5; model fit: Bayesian information criterion (BIC) = 2211; modified Akaike information criterion (AICc) = 2151; = 0.098; and = 0.464.
Figure 2.Density plot of word token frequencies, indicating the extent to which words occur with particular frequencies in the corpus.
Summary of word token frequencies in the present corpus of German child-directed speech, and in Kaeding’s (1897) study of written German.
| Word token frequency |
| Current dataset |
|---|---|---|
| 1 | 49.14% | 49.76% |
| 2 | 13.37% | 15.12% |
| 3 | 6.61% | 7.65% |
| 4 | 4.31% | 4.34% |
| 5 | 3.04% | 3.37% |
| 6–10 | 7.76% | 7.47% |
Figure 3.Frequencies for the 50 most frequent words in the corpus. Panel A (left) shows word tokens, and Panel B (right) shows word types.
Figure 4.Syllable and syllable structure frequencies in the corpus. Panel A (left) shows the 50 most frequent syllables, and Panel B (right) shows all 45 different syllable structures. Because we consider syllabic consonants such as [ṇ] as consonants, it is possible to have syllables with multiple consonants but no vowels (e.g., the second syllable of the verb putzen [pʊtsṇ] “clean” consists of three consonants); similarly, because we code long vowels as VV, it is possible to have syllables with multiple vowels (e.g., the word er [e:ɐ̯] “he” consists of three vowels).
Frequency statistics for word length (measured in number of syllables).
| Number of syllables | Number of word tokens | Number of unique word tokens | Number of unique word types | |||
|---|---|---|---|---|---|---|
| Count | % | Count | % | Count | % | |
| 1 | 11,435 | 78.00 | 655 | 37.09 | 553 | 34.33 |
| 2 | 2550 | 17.39 | 715 | 40.49 | 672 | 41.71 |
| 3 | 545 | 3.72 | 293 | 16.59 | 285 | 17.69 |
| 4 | 98 | 0.67 | 78 | 4.42 | 76 | 4.72 |
| 5 | 21 | 0.14 | 17 | 0.96 | 17 | 1.06 |
| 6 | 9 | 0.06 | 7 | 0.40 | 7 | 0.43 |
| 7 | 0 | – | 0 | – | 0 | – |
| 8 | 2 | 0.01 | 1 | 0.06 | 1 | 0.06 |
| Total | 14,660 | 1766 | 1611 | |||
Background information on the data included in the corpus.
| Corpus | Child | Age | Length of recording | Total number of utterances | Total number of utterances excluding unintelligible utterances | Speaker | Number of utterances per speaker | Number of utterances per speaker excluding unintelligible utterances |
|---|---|---|---|---|---|---|---|---|
| Caroline | Caroline (female (f)) | 00;10.01 | 00:06:44 | 22 | 22 | MOT | 22 | 22 |
| 00;10.02 | 00:06:37 | 20 | 20 | MOT | 20 | 20 | ||
| 00;11.25 | 00:18:22 | 81 | 79 | MOT | 63 | 63 | ||
| FAT | 18 | 16 | ||||||
| 01;00.19 | 00:11:04 | 37 | 37 | MOT | 37 | 37 | ||
| 01;00.23 | 00:16:37 | 69 | 69 | MOT | 53 | 53 | ||
| FAT | 16 | 16 | ||||||
| 01;01.02 | 00:12:06 | 61 | 61 | MOT | 61 | 61 | ||
| 01;01.04 | 00:01:46 | 10 | 10 | MOT | 10 | 10 | ||
| Manuela | Dasca (male (m)) | 00;06.13 | 00:00:34 | 5 | 5 | MOT | 5 | 5 |
| Nibra (m) | 00;10.24 | 00:00:24 | 5 | 4 | MOT | 5 | 4 | |
| Oskoa (m) | 00;10.12 | 00:01:05 | 28 | 28 | MOT | 28 | 28 | |
| 00;10.12 | 00:00:50 | 18 | 18 | MOT | 18 | 18 | ||
| Viala (m) | 00;06.13 | 00:01:08 | 21 | 21 | MOT | 21 | 21 | |
| Viwia (m) | 00;10.20 | 00:00:15 | 7 | 7 | MOT | 7 | 7 | |
| Miller | Kerstin (f) | 01;03.22 | Approximately 00:15:00 | 214 | 211 | MOT | 199 | 198 |
| FAT | 13 | 11 | ||||||
| OBS | 2 | 2 | ||||||
| 01;04.13 | Approximately 00:30:00 | 421 | 416 | MOT | 284 | 283 | ||
| FAT | 22 | 18 | ||||||
| OBS | 79 | 79 | ||||||
| VIS | 36 | 36 | ||||||
| Rigol | Corinna (f) | 01;00.09 | 00:31:22 | 714 | 657 | MOT | 379 | 356 |
| FAT | 322 | 289 | ||||||
| OBS | 13 | 12 | ||||||
| 01;00.23 | 00:32:38 | 629 | 577 | MOT | 279 | 257 | ||
| FAT | 348 | 318 | ||||||
| OBS | 2 | 2 | ||||||
| 01;01.08 | 00:33:44 | 487 | 479 | MOT | 463 | 455 | ||
| OBS | 24 | 24 | ||||||
| Cosima (f) | 01;08.13 | 00:30:38 | 449 | 443 | MOT | 354 | 350 | |
| FAT | 14 | 12 | ||||||
| OBS | 81 | 81 | ||||||
| Wagner | Katrin (f) | 1;05.15 | 03:22:00 | 855 | 803 | MOT | 607 | 570 |
| FAT | 248 | 233 | ||||||
| Total | 07:32:54 | 4153 | 3967 |
Notes: the Caroline corpus contains utterances consisting of several sentences which makes the utterances longer than the ones in the other corpora, that is, to work with comparable numbers the number of utterances in the Caroline corpus should be increased; * the length of recording of the Miller corpus is an estimate based on the number of utterances in the datasets; the Speaker abbreviations are: MOT = mother, FAT = father, OBS = observer/researcher, VIS = visitor.