| Literature DB >> 27058596 |
Weibing Deng1, Armen E Allahverdyan2.
Abstract
We study rank-frequency relations for phonemes, the minimal units that still relate to linguistic meaning. We show that these relations can be described by the Dirichlet distribution, a direct analogue of the ideal-gas model in statistical mechanics. This description allows us to demonstrate that the rank-frequency relations for phonemes of a text do depend on its author. The author-dependency effect is not caused by the author's vocabulary (common words used in different texts), and is confirmed by several alternative means. This suggests that it can be directly related to phonemes. These features contrast to rank-frequency relations for words, which are both author and text independent and are governed by the Zipf's law.Entities:
Mesh:
Year: 2016 PMID: 27058596 PMCID: PMC4825982 DOI: 10.1371/journal.pone.0152561
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Rank-frequency curves and error generated by the Dirichlet density with β = 0.8 and n = 44.
Blue curve: 〈θ(〉 (as a function of r) calculated according to Eqs (7)–(9). Black curve: calculated via the approximate formula Eq (10); cf. S2 Appendix. Red points: the normalized variance for r = 1, …, 44 calculated according to Eqs (7)–(9). This expression is well approximated in S2 Appendix.
Nine texts and their parameters.
Texts are abbreviated and numbered. N, N, N and N are, respectively, the total number of words, the number of phonemes of the total words, the number of different words and the number of phonemes of different words.
J. Austen: Mansfield Park (MP or 1) 1814; Pride and Prejudice (PP or 2) 1813; Sense and Sensibility (SS or 3) 1811.
C. Dickens: A Tail of Two Cities (TC or 4) 1859; Great Expectations (GE or 5) 1861; Adventures of Oliver Twist (OT or 6) 1838.
J. Tolkien: The Fellowship of the Ring (FR or 7) 1954; The Return of the King (RK or 8) 1955; The Two Towers (TT or 9) 1954.
| Texts | ||||
|---|---|---|---|---|
| MP (1) | 160473 | 567750 | 7854 | 48747 |
| PP (2) | 121763 | 435322 | 6385 | 39767 |
| SS (3) | 119394 | 425822 | 6264 | 38668 |
| TC (4) | 135420 | 468642 | 9841 | 58760 |
| GE (5) | 186683 | 623079 | 10933 | 65364 |
| OT (6) | 159103 | 555372 | 10359 | 61072 |
| FR (7) | 177227 | 617106 | 8644 | 46509 |
| TT (8) | 143436 | 502303 | 7676 | 39823 |
| RK (9) | 134462 | 431141 | 7087 | 36494 |
Fitting parameters for texts numbered as 1–9; see Eqs (11) and (12) and Table 1 for text numbers.
The phoneme frequencies are extracted from all words of the text.
| Parameters | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| 0.61 | 0.63 | 0.61 | 0.67 | 0.69 | 0.69 | 0.75 | 0.74 | 0.79 | |
| 7696 | 7574 | 6151 | 4317 | 5287 | 3993 | 4196 | 4337 | 3580 | |
| 0.9768 | 0.9765 | 0.9816 | 0.9859 | 0.9820 | 0.9867 | 0.9844 | 0.9842 | 0.9860 |
Fitting parameters for texts numbered as 1–9; see Eqs (11) and (12) and Table 1 for text numbers.
The phoneme frequencies are extracted from different words of the text; see Table 2 for the values of β calculated from all words of texts. Eqs (17) and (18) compare the data presented in Tables 2 and 3.
| Parameters | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| 0.72 | 0.69 | 0.69 | 0.77 | 0.78 | 0.79 | 0.968 | 0.979 | 0.975 | |
| 5150 | 4495 | 5003 | 6107 | 5265 | 5220 | 11296 | 12943 | 10366 | |
| 0.9818 | 0.9847 | 0.9829 | 0.9771 | 0.9800 | 0.9800 | 0.9501 | 0.9403 | 0.9525 |
The values of β extracted from different words of texts for 5 authors.
For each author we analyzed three texts. They are described in the S3 Appendix, where we also discuss 8 other authors.
| Author | |||
|---|---|---|---|
| C. Lyell | 0.798 | 0.785 | 0.792 |
| A. R. Wallace | 0.744 | 0.756 | 0.739 |
| C. Darwin | 0.817 | 0.810 | 0.822 |
| H. Spenser | 0.646 | 0.658 | 0.650 |
| H. G. Wells | 0.737 | 0.735 | 0.724 |
Fig 2Rank-frequency relation (black circles) and the fitting with Dirichlet distribution (red line).
(a) Left figure: text TC, where frequencies were extracted from all words. (b) Right figure: text PP, where different words were employed; see Table 1 for the description of texts.
Fig 3Rank-frequency relation (black and red circles) for two texts written by the same author.
(a) Left figure: TC and GE written by Dickens (all words were employed for extracting the phoneme frequencies). (b) Right figure: PP and SS written by Austen (different words were employed); see Table 1.
Fig 4Rank-frequency relation (black and red circles) for two texts written by different authors.
(a) Left figure: TC by Dickens versus MP by Austen (all words were employed). (b) Right figure: SS by Austen versus RK by Tolkien (different words were employed); see Table 1 for parameters of these texts.
Distances ρ0 and ρ1 between texts; see Table 1 and Eqs (20) and (19) for the definition of ρ0 and ρ1.
The phoneme frequencies are extracted from all words of the text. Eqs (24) and (25) compare the distances from all words with those from different words.
| Texts | Texts | Texts | Texts | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 & 2 | 3045 | 2227 | 1 & 4 | 3583 | 2784 | 2 & 7 | 7653 | 4978 | 4 & 7 | 5174 | 3950 |
| 1 & 3 | 2062 | 1602 | 1 & 5 | 4690 | 3044 | 2 & 8 | 7629 | 5052 | 4 & 8 | 5327 | 3568 |
| 2 & 3 | 2549 | 2103 | 1 & 6 | 4000 | 3260 | 2 & 9 | 7650 | 5449 | 4 & 9 | 5061 | 3935 |
| 4 & 5 | 3423 | 2100 | 1 & 7 | 7372 | 5149 | 3 & 4 | 3562 | 2546 | 5 & 7 | 6113 | 3894 |
| 4 & 6 | 2382 | 1978 | 1 & 8 | 7402 | 5227 | 3 & 5 | 4924 | 3022 | 5 & 8 | 6436 | 4014 |
| 5 & 6 | 3448 | 2753 | 1 & 9 | 7322 | 5599 | 3 & 6 | 4358 | 3181 | 5 & 9 | 6217 | 4325 |
| 7 & 8 | 2584 | 1808 | 2 & 4 | 3645 | 2712 | 3 & 7 | 7737 | 5266 | 6 & 7 | 5074 | 3727 |
| 7 & 9 | 2066 | 1809 | 2 & 5 | 4762 | 3059 | 3 & 8 | 6950 | 5085 | 6 & 8 | 5706 | 3934 |
| 8 & 9 | 2464 | 2037 | 2 & 6 | 4064 | 3110 | 3 & 9 | 7447 | 5654 | 6 & 9 | 5202 | 3770 |
Distances ρ0 and ρ1 between texts; see Table 1 and Eqs (20) and (19).
The phoneme frequencies are extracted from different words of each text after excluding the words that are common for both compared texts; see Eqs (26) and (27) for comparison with the situation without excluding common words.
| Texts | Texts | Texts | Texts | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 & 2 | 3792 | 2832 | 1 & 4 | 4758 | 3912 | 2 & 7 | 13323 | 9469 | 4 & 7 | 10980 | 7025 |
| 1 & 3 | 3217 | 2463 | 1 & 5 | 5742 | 4276 | 2 & 8 | 15733 | 10387 | 4 & 8 | 13905 | 7371 |
| 2 & 3 | 3734 | 2502 | 1 & 6 | 6087 | 4830 | 2 & 9 | 14113 | 9621 | 4 & 9 | 12109 | 6928 |
| 4 & 5 | 3146 | 2190 | 1 & 7 | 12574 | 8800 | 3 & 4 | 5188 | 4344 | 5 & 7 | 10346 | 6537 |
| 4 & 6 | 2930 | 2215 | 1 & 8 | 15119 | 9576 | 3 & 5 | 5887 | 4917 | 5 & 8 | 13003 | 7021 |
| 5 & 6 | 2329 | 1610 | 1 & 9 | 13490 | 8895 | 3 & 6 | 6476 | 5285 | 5 & 9 | 11673 | 6673 |
| 7 & 8 | 5918 | 3317 | 2 & 4 | 5708 | 4529 | 3 & 7 | 13391 | 9835 | 6 & 7 | 10413 | 6580 |
| 7 & 9 | 4421 | 2773 | 2 & 5 | 6385 | 4991 | 3 & 8 | 15842 | 10637 | 6 & 8 | 13288 | 6667 |
| 8 & 9 | 4770 | 2809 | 2 & 6 | 6880 | 5495 | 3 & 9 | 14244 | 9891 | 6 & 9 | 11911 | 6433 |
Distances ρ0 and ρ1 between texts; see Table 1 and Eqs (20) and (19).
The phoneme frequencies are extracted from different words of the text; see Eqs (24) and (25) for comparison with all words.
| Texts | Texts | Texts | Texts | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 & 2 | 1563 | 1346 | 1 & 4 | 2296 | 1967 | 2 & 7 | 8141 | 6587 | 4 & 7 | 5918 | 4795 |
| 1 & 3 | 1317 | 1205 | 1 & 5 | 2703 | 2110 | 2 & 8 | 9999 | 7544 | 4 & 8 | 7875 | 5971 |
| 2 & 3 | 1413 | 1346 | 1 & 6 | 2868 | 2470 | 2 & 9 | 9167 | 7136 | 4 & 9 | 6899 | 5368 |
| 4 & 5 | 1568 | 1266 | 1 & 7 | 7430 | 6103 | 3 & 4 | 2718 | 2193 | 5 & 7 | 5521 | 4631 |
| 4 & 6 | 1380 | 1126 | 1 & 8 | 9535 | 7200 | 3 & 5 | 3264 | 2486 | 5 & 8 | 7842 | 5566 |
| 5 & 6 | 1100 | 1052 | 1 & 9 | 8434 | 6775 | 3 & 6 | 3257 | 2636 | 5 & 9 | 6646 | 5222 |
| 7 & 8 | 2853 | 1653 | 2 & 4 | 2839 | 2252 | 3 & 7 | 7943 | 6539 | 6 & 7 | 5595 | 4486 |
| 7 & 9 | 1946 | 1476 | 2 & 5 | 3318 | 2436 | 3 & 8 | 9998 | 7447 | 6 & 8 | 7785 | 5645 |
| 8 & 9 | 2025 | 1569 | 2 & 6 | 3458 | 2709 | 3 & 9 | 8997 | 7022 | 6 & 9 | 6786 | 5201 |
The fraction p of common words between texts given in Table 1.
Now p is defined as follows. Let n(i) and n(ij) be, respectively, the number of different words in text i and the number of common words in texts i and j. We define: p(ij) = n(ij)/(n(i) + n(j) − n(ij)), where 0 ≤ p(ij) ≤ 1. This is the number of common words divided over the number of all different words in texts i and j. As seen from the data below, analogues of Eqs (21)–(23) hold with 1 − p(ij) instead of ρλ(ij).
| Texts | Texts | Texts | Texts | ||||
|---|---|---|---|---|---|---|---|
| 1 & 2 | 47554 | 1 & 4 | 35592 | 2 & 7 | 26549 | 4 & 7 | 33901 |
| 1 & 3 | 47786 | 1 & 5 | 35819 | 2 & 8 | 24180 | 4 & 8 | 30387 |
| 2 & 3 | 50655 | 1 & 6 | 36660 | 2 & 9 | 24643 | 4 & 9 | 32005 |
| 4 & 5 | 41146 | 1 & 7 | 28978 | 3 & 4 | 33463 | 5 & 7 | 32069 |
| 4 & 6 | 42454 | 1 & 8 | 25870 | 3 & 5 | 32813 | 5 & 8 | 27963 |
| 5 & 6 | 41822 | 1 & 9 | 26730 | 3 & 6 | 34643 | 5 & 9 | 29994 |
| 7 & 8 | 45010 | 2 & 4 | 32902 | 3 & 7 | 27572 | 6 & 7 | 32002 |
| 7 & 9 | 46948 | 2 & 5 | 32499 | 3 & 8 | 25340 | 6 & 8 | 28649 |
| 8 & 9 | 48173 | 2 & 6 | 33877 | 3 & 9 | 25733 | 6 & 9 | 30518 |