| Literature DB >> 34415939 |
Iván G Torre1, Łukasz Dębowski2, Antoni Hernández-Fernández3,4.
Abstract
Menzerath's law is a quantitative linguistic law which states that, on average, the longer is a linguistic construct, the shorter are its constituents. In contrast, Menzerath-Altmann's law (MAL) is a precise mathematical power-law-exponential formula which expresses the expected length of the linguistic construct conditioned on the number of its constituents. In this paper, we investigate the anatomy of MAL for constructs being word tokens and constituents being syllables, measuring its length in graphemes. First, we derive the exact form of MAL for texts generated by the memoryless source with three emitted symbols, which can be interpreted as a monkey typing model or a null model. We show that this null model complies with Menzerath's law, revealing that Menzerath's law itself can hardly be a criterion of complexity in communication. This observation does not apply to the more precise Menzerath-Altmann's law, which predicts an inverted regime for sufficiently range constructs, i.e., the longer is a word, the longer are its syllables. To support this claim, we analyze MAL on data from 21 languages, consisting of texts from the Standardized Project Gutenberg. We show the presence of the inverted regime, not exhibited by the null model, and we demonstrate robustness of our results. We also report the complicated distribution of syllable sizes with respect to their position in the word, which might be related with the emerging MAL. Altogether, our results indicate that Menzerath's law-in terms of correlations-is a spurious observation, while complex patterns and efficiency dynamics should be rather attributed to specific forms of Menzerath-Altmann's law.Entities:
Mesh:
Year: 2021 PMID: 34415939 PMCID: PMC8378695 DOI: 10.1371/journal.pone.0256133
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Memoryless source model.
The memoryless source has only one state and emits in an infinite loop symbols C, V and S with probabilities p, p, and 1 − p − p, respectively.
Fig 2Menzerath-Altmann’s law for the memoryless source.
Parameter p = 0.48 is used accordingly with real languages. The solid line is the theoretical model, given by Eqs (4) and (5), whereas the square points are the simulation of the memoryless source.
The adopted grapheme classification into seven sonority classes.
| Sonority class | Graphemes |
|---|---|
| Vowels | a, e, i, o, u, y, à, á, â, ä, æ, ã, å, ā, ą, è, é, ê, ë, ē, ė, ę, î, ï, í, ī, į, ì, ô, ö, ò, ó, œ, ø, ō, õ, û, ü, ù, ú, ū, ů, ÿ, ű, ő, ŵ, ŷ, ỳ, ẁ, ě, ý, ǫ, |
| Approximants | ŭ, w, ł |
| Liquid | l, r, ř |
| Nasals | m, n, ñ, ń, ŋ, ň, |
| Fricatives | ß, z, v, s, f, ç, ć, ś, ŝ, ĉ, ĥ, h, ĵ, š, ž ð, đ |
| Affricates | x, j, ź, ż, ĝ, č |
| Occlusives | b, c, d, g, t, k, p, q, þ, ď, ť |
Summary of characteristics for the 21 languages studied including the total number of books, word tokens, syllables, consonants (C) and vowels (V), as well as the probabilities of vowels (p), consonants (p) or spaces (p).
The last column (Syll/V) reports the syllable-vowel ratio.
| Language | Books | Tokens | Syll. | C | V |
|
|
| Syll/V |
|---|---|---|---|---|---|---|---|---|---|
|
| 2500 | 1.5 ⋅ 108 | 2.4 ⋅ 108 | 4.0 ⋅ 108 | 2.6 ⋅ 108 | 0.33 | 0.49 | 0.18 | 0.89 |
|
| 2500 | 1.5 ⋅ 108 | 2.5 ⋅ 108 | 3.7 ⋅ 108 | 3.0 ⋅ 108 | 0.37 | 0.45 | 0.18 | 0.82 |
|
| 2226 | 7.7 ⋅ 107 | 1.9 ⋅ 108 | 2.6 ⋅ 108 | 2.4 ⋅ 108 | 0.42 | 0.45 | 0.13 | 0.80 |
|
| 1801 | 8.5 ⋅ 107 | 1.5 ⋅ 108 | 2.8 ⋅ 108 | 1.7 ⋅ 108 | 0.32 | 0.52 | 0.16 | 0.87 |
|
| 834 | 5.1 ⋅ 107 | 1.0 ⋅ 108 | 1.3 ⋅ 108 | 1.2 ⋅ 108 | 0.39 | 0.44 | 0.17 | 0.91 |
|
| 817 | 4.5 ⋅ 107 | 7.3 ⋅ 107 | 1.3 ⋅ 108 | 8.9 ⋅ 107 | 0.34 | 0.49 | 0.17 | 0.82 |
|
| 650 | 4.2 ⋅ 107 | 8.1 ⋅ 107 | 1.0 ⋅ 108 | 9.0 ⋅ 107 | 0.39 | 0.43 | 0.18 | 0.89 |
|
| 565 | 1.5 ⋅ 107 | 2.9 ⋅ 107 | 3.6 ⋅ 107 | 3.3 ⋅ 107 | 0.39 | 0.43 | 0.18 | 0.87 |
|
| 237 | 1.2 ⋅ 107 | 2.6 ⋅ 107 | 3.8 ⋅ 107 | 2.7 ⋅ 107 | 0.35 | 0.49 | 0.16 | 0.96 |
|
| 206 | 8.6 ⋅ 106 | 1.4 ⋅ 107 | 2.6 ⋅ 107 | 1.5 ⋅ 107 | 0.30 | 0.52 | 0.18 | 0.99 |
|
| 106 | 2.2 ⋅ 106 | 4.1 ⋅ 106 | 5.8 ⋅ 106 | 4.4 ⋅ 106 | 0.35 | 0.47 | 0.18 | 0.94 |
|
| 88 | 3.9 ⋅ 106 | 8.4 ⋅ 106 | 1.2 ⋅ 107 | 9.4 ⋅ 106 | 0.48 | 0.47 | 0.15 | 0.89 |
|
| 70 | 3.9 ⋅ 106 | 6.4 ⋅ 106 | 1.1 ⋅ 107 | 6.7 ⋅ 106 | 0.31 | 0.50 | 0.19 | 0.96 |
|
| 57 | 9.9 ⋅ 105 | 2.0 ⋅ 106 | 2.7 ⋅ 106 | 2.3 ⋅ 106 | 0.38 | 0.45 | 0.17 | 0.90 |
|
| 32 | 1.2 ⋅ 106 | 2.0 ⋅ 106 | 2.8 ⋅ 106 | 2.2 ⋅ 106 | 0.36 | 0.45 | 0.19 | 0.89 |
|
| 29 | 4.1 ⋅ 105 | 8.3 ⋅ 105 | 1.2 ⋅ 106 | 9.2 ⋅ 105 | 0.35 | 0.49 | 0.16 | 0.90 |
|
| 20 | 8.0 ⋅ 105 | 1.2 ⋅ 106 | 2.1 ⋅ 106 | 1.3 ⋅ 106 | 0.31 | 0.50 | 0.19 | 0.95 |
|
| 10 | 3.6 ⋅ 105 | 7.2 ⋅ 105 | 1.0 ⋅ 106 | 7.2 ⋅ 105 | 0.34 | 0.48 | 0.17 | 0.99 |
|
| 10 | 2.2 ⋅ 105 | 3.3 ⋅ 105 | 5.6 ⋅ 105 | 3.7 ⋅ 105 | 0.33 | 0.48 | 0.19 | 0.88 |
|
| 7 | 7.9 ⋅ 104 | 1.3 ⋅ 105 | 2.1 ⋅ 105 | 1.4 ⋅ 105 | 0.32 | 0.50 | 0.18 | 0.96 |
|
| 6 | 2.0 ⋅ 105 | 3.0 ⋅ 105 | 4.9 ⋅ 105 | 3.8 ⋅ 105 | 0.36 | 0.46 | 0.18 | 0.77 |
Fig 3Different methods of computing MAL reflect similar results.
Relation between word size (measured in number of syllables) versus the mean size of those syllables, where each panel corresponds to one different languages of Gutenberg corpus. Each thin grey line represents one book, whereas black circles are the mean duration of books. Meanwhile blue squares are the result of computing MAL from the full corpus. Both methods shows similar results and solid lines are just represented for visual comparison. Results on 17 additional languages are provided in S1 Fig.
Fig 4Menzerath-Altmann’s law and the memoryless source baseline.
Relation between the word size and the mean size of syllables for Italian, Dutch, Spanish and Portuguese. Experimental results are shown in blue circles, the red dotted line is a fit to Menzerath-Altmann’s law (2), whereas the gray solid line corresponds to Eq (4) with p given by the relative frequency of consonants. Results for 17 additional languages are provided in S2 Fig.
Menzerath’s law tested on the 21 languages under study.
Spearman’s rank correlation coefficients and p-values between the size of the word and the mean size of the syllables. The regression function (4) for the memoryless source is monotonically decreasing so Spearman’s rank correlation coefficient equals −1 in this case.
| Language | Correlation | p-value |
|---|---|---|
| English | −0.64 | 0.09 |
| French | −1 | <10−2 |
| Finnish | 0.14 | 0.64 |
| German | −0.25 | 0.49 |
| Italian | 0.07 | 0.86 |
| Dutch | −0.53 | 0.14 |
| Spanish | −1 | <10−2 |
| Portuguese | −0.83 | <10−2 |
| Hungarian | −0.68 | 0.04 |
| Swedish | −0.26 | 0.53 |
| Esperanto | −0.96 | <10−2 |
| Latin | −0.89 | <10−2 |
| Danish | −0.62 | 0.1 |
| Tagalog | −1 | <10−2 |
| Catalan | −0.77 | 0.07 |
| Polish | −1 | <10−2 |
| Norwegian | −0.61 | 0.15 |
| Czech | −0.83 | 0.04 |
| Welsh | −1 | <10−2 |
| Icelandic | 0.0 | 1.0 |
| Afrikaans | −0.83 | 0.04 |
The estimated parameters of MAL for the Gutenberg Corpus.
Fitting of MAL to the experimental data has been done using Levenberg–Marquardt algorithm and excluding words with only one syllable. R2 (coefficient of determination) is used to determine the goodness of the fit. Column β/γ corresponds to the observable extremum of MAL for β ⋅ γ > 0 or is left blank otherwise.
| Language |
|
|
|
| |
|---|---|---|---|---|---|
| English | 3.19 | −3.5 ⋅ 10−1 | −5.4 ⋅ 10−2 | 0.82 | 6.5 |
| French | 2.96 | −2.4 ⋅ 10−2 | 2.6 ⋅ 10−2 | 0.94 | - |
| Finnish | 2.69 | −5.6 ⋅ 10−2 | −8.8 ⋅ 10−3 | 0.63 | 6.4 |
| German | 3.19 | −1.5 ⋅ 10−1 | −2.4 ⋅ 10−2 | 0.62 | 6.3 |
| Italian | 2.59 | −1.8 ⋅ 10−1 | −3.6 ⋅ 10−2 | 0.41 | 5.1 |
| Dutch | 3.27 | −1.4 ⋅ 10−1 | −1.3 ⋅ 10−2 | 0.63 | 10.9 |
| Spanish | 2.6 | −5.3 ⋅ 10−2 | 4.6 ⋅ 10−3 | 0.98 | - |
| Portuguese | 2.66 | −1.1 ⋅ 10−1 | −1.0 ⋅ 10−2 | 0.93 | 10.9 |
| Hungarian | 2.69 | −8.9 ⋅ 10−2 | −1.1 ⋅ 10−2 | 0.81 | 7.9 |
| Swedish | 2.82 | −1.0 ⋅ 10−1 | −2.0 ⋅ 10−2 | 0.71 | 5.1 |
| Esperanto | 2.69 | −1.4 ⋅ 10−1 | −2.0 ⋅ 10−2 | 0.95 | 7.1 |
| Latin | 2.82 | −1.7 ⋅ 10−1 | −2.1 ⋅ 10−2 | 0.95 | 8.2 |
| Danish | 2.77 | −5.2 ⋅ 10−2 | −6.5 ⋅ 10−3 | 0.44 | 8.1 |
| Tagalog | 2.69 | −1.1 ⋅ 10−1 | −3.9 ⋅ 10−3 | 0.98 | 28.7 |
| Catalan | 2.83 | −1.9 ⋅ 10−1 | −2.5 ⋅ 10−2 | 0.96 | 7.5 |
| Polish | 3.13 | −1.6 ⋅ 10−1 | −4.1 ⋅ 10−3 | 0.99 | 39.5 |
| Norwegian | 2.87 | −1.4 ⋅ 10−1 | −2.7 ⋅ 10−2 | 0.67 | 5.2 |
| Czech | 2.76 | −2.3 ⋅ 10−1 | −3.5 ⋅ 10−2 | 0.95 | 6.7 |
| Welsh | 3.3 | −2.8 ⋅ 10−3 | 5.9 ⋅ 10−2 | 0.99 | - |
| Icelandic | 2.67 | 5.8 ⋅ 10−2 | 1.9 ⋅ 10−2 | 0.13 | 3.0 |
| Afrikaans | 3.4 | −3.0 ⋅ 10−1 | −4.8 ⋅ 10−2 | 0.97 | 6.2 |
Fig 5Syllable sizes depending on word length and position.
The mean size of syllables in the number of characters for different word sizes and syllable positions for the case of English, Hungarian, Esperanto and Tagalog. The position in the word has been standardized according to Eq (6), 0 being the first syllable and 1 being the last syllable. The mean size value for monosyllabic words is represented with a black circle at x = 0.5. The results for 17 additional languages are provided in S4 Fig.