| Literature DB >> 35205578 |
Abstract
Zipf's law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.Entities:
Keywords: Zipf’s law of abbreviation; corpora; frequency; informativity; linguistic typology; n-grams
Year: 2022 PMID: 35205578 PMCID: PMC8870940 DOI: 10.3390/e24020280
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Genealogical information (according to www.wals.info, accessed on 1 December 2021) and number of tokens in the web corpora.
| Language | Genus | Total Number of Tokens (with Punctuation Marks) | Number of Tokens in Dictionaries and Subtitles |
|---|---|---|---|
| Arabic | Semitic | 232,612,052 | 196,210,964 |
| Finnish | Finnic | 130,150,353 | 93,230,461 |
| Hungarian | Ugric | 192,639,921 | 139,922,959 |
| Indonesian | Malayo-Sumbawan | 183,318,963 | 131,109,405 |
| Russian | Slavic | 165,441,482 | 127,190,505 |
| Spanish | Romance | 206,340,525 | 168,109,866 |
| Turkish | Turkic | 157,739,375 | 108,278,142 |
Figure 1Spearman’s correlations between length and corpus-based measures in seven languages. Pink—based on all bigrams; turquoise—based on cleaned bigrams.
P-values of testing the differences between correlation coefficients (informativity vs. frequency). In parentheses, the winning measure with the highest positive correlation with length is displayed.
| Language | Bigrams Processing Method | Frequency vs. Informativity Given Previous Word | Frequency vs. Informativity Given Next Word |
|---|---|---|---|
| Arabic | all bigrams | ||
| cleaned bigrams | |||
| Finnish | all bigrams | ||
| cleaned bigrams | |||
| Hungarian | all bigrams | ||
| cleaned bigrams | |||
| Indonesian | all bigrams | ||
| cleaned bigrams | |||
| Russian | all bigrams | ||
| cleaned bigrams | |||
| Spanish | all bigrams | ||
| cleaned bigrams | |||
| Turkish | all bigrams | ||
| cleaned bigrams |
Figure 2Relationships between word lengths in characters and three corpus-based measures: negative log-frequency (top left), informativity given previous word (top right) and informativity given next word (bottom left). The points are the trimmed means based on 500 bootstrapped samples. The error bars represent one standard deviation from the mean. The informativity measures are based on cleaned bigrams.
Correlations between predictability measures and word length in phonemes and in characters: Spearman’s correlation coefficients and 95% confidence intervals (in parentheses). The informativity measures are based on cleaned bigrams.
| Language | Predictability Measure | Correlation with Length in Phonemes | Correlation with Length in Characters |
|---|---|---|---|
| Hungarian | Neg. frequency | 0.25 (0.23, 0.27) | 0.24 (0.23, 0.26) |
| Info given prev | 0.10 (0.08, 0.11) | 0.10 (0.08, 0.12) | |
| Info given next | 0.14 (0.12, 0.16) | 0.13 (0.11, 0.15) | |
| Indonesian | Neg. frequency | 0.16 (0.14, 0.18) | 0.17 (0.15, 0.18) |
| Info given prev | 0.28 (0.26, 0.30) | 0.28 (0.27, 0.30) | |
| Info given next | 0.09 (0.07, 0.11) | 0.09 (0.08, 0.11) | |
| Spanish | Neg. frequency | 0.17 (0.15, 0.19) | 0.17 (0.16, 0.19) |
| Info given prev | −0.01 (−0.03, 0.01) | 0.01 (-0.01, 0.02) | |
| Info given next | 0.18 (0.16, 0.20) | 0.19 (0.17, 0.21) | |
| Turkish | Neg. frequency | 0.28 (0.26, 0.30) | 0.28 (0.26, 0.30) |
| Info given prev | −0.02 (−0.03, 0.00) | −0.02 (−0.04, 0.00) | |
| Info given next | 0.31 (0.29, 0.34) | 0.31 (0.29, 0.33) |
Mean lengths and standard deviations of common nouns and their lexical modifiers. Numbers in bold show the longer NP elements. The numbers are valid only for the predominant order.
| Language | Predominant Order | Number of Modified Heads | Mean Length of Heads | Mean Length of Modifiers |
|---|---|---|---|---|
| Arabic | Head + Modifier | 39,200 | 5.07 | |
| Finnish | Modifier + Head | 11,603 | 9.08 | |
| Hungarian | Modifier + Head | 18,358 | 7.68 | |
| Indonesian | Head + Modifier | 17,257 | 6.62 | |
| Russian | Modifier + Head | 13,057 | 7.27 | |
| Spanish | Head + Modifier | 6759 | 7.69 | |
| Turkish | Modifier + Head | 19,246 | 5.95 |