| Literature DB >> 20231884 |
Ramon Ferrer-I-Cancho1, Brita Elvevåg.
Abstract
BACKGROUND: Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,...) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2010 PMID: 20231884 PMCID: PMC2834740 DOI: 10.1371/journal.pone.0009411
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The rank histograms of English texts versus that of random texts ().
A comparison of the real rank histogram (thin black line) and two control curves with the upper and lower bounds of the expected histogram of a random text of the same length in words (dashed lines) involving four English texts. is the frequency of the word of rank . For the random text we use the model with alphabet size . The expected histogram of the random text is estimated averaging over the rank histograms of random texts. For ease of presentation, the expected histogram is cut off at expected frequencies below . AAW: Alice's adventures in wonderland. H: Hamlet. DC: David Crockett. OS: The origin of species.
Figure 2The rank histograms of English texts versus that of random texts ().
The same as Fig. 1 for the model with alphabet size and probability of blank obtained from the real text.
Summary of English texts employed.
| Title | Abbreviation | Author |
| Alice's adventures in wonderland | AAW | Lewis Carroll (1832–1898) |
| The adventures of Tom Sawyer | ATS | Mark Twain (1835–1910) |
| A Christmas carol | CC | Charles Dickens (1812–1870) |
| David Crockett | DC | John S. C. Abbott (1805–1877) |
| An enquiry concerning human understanding | ECHU | David Hume (1711–1776) |
| Hamlet | H | William Shakespeare (1564–1616) |
| The hound of the Baskervilles | HB | Sir Arthur Conan Doyle (1859–1930) |
| Moby-Dick: or, the whale | MB | Herman Melville (1819–1891) |
| The origin of species by means of natural selection | OS | Charles Darwin (1809–1882) |
| Ulysses | U | James Joyce (1882–1941) |
The data set of English texts employed in our study.
Statistics of the English texts.
| Abbreviation |
|
|
|
|
|
|
| AAW | 27342 | 28 | 0.254 | 2574 | 254.05 | 466.60 |
| CC | 29253 | 30 | 0.240 | 4263 | 463.31 | 887.22 |
| H | 32839 | 28 | 0.253 | 4582 | 474.39 | 932.44 |
| ECHU | 57958 | 36 | 0.212 | 4912 | 433.91 | 861.35 |
| HB | 59967 | 39 | 0.244 | 5568 | 472.87 | 990.44 |
| ATS | 73523 | 31 | 0.248 | 7169 | 612.45 | 1298.53 |
| DC | 78819 | 36 | 0.228 | 7385 | 668.60 | 1346.19 |
| OS | 209176 | 36 | 0.207 | 8955 | 589.94 | 1274.53 |
| MB | 218522 | 36 | 0.229 | 17190 | 1291.67 | 2909.44 |
| U | 269589 | 36 | 0.228 | 29213 | 2425.63 | 5444.95 |
Statistical properties of the English texts. See Table 1 for the meaning of each abbreviation. Texts are sorted by increasing length. is the text length in words. is the number of different characters excluding the blank. is the estimated probability of blank. is the maximum rank or the observed vocabulary size. and are, respectively, the mean and the standard deviation of the rank.
Figure 3The rank histograms of English texts versus that of random texts ().
The same as Fig. 1 for the model with alphabet size and character probabilities obtained from the real text.
Distance to the mean in standard deviations.
|
|
|
| ||||||||
| Abbrv. |
|
|
|
|
|
| - |
|
|
|
| AAW |
| 42.6 | −97.5 | −133.2 | −163.4 | −573.4 | −160.6 | 54.0 | −74.3 | −147.1 |
|
| 130.5 | −59.7 | −78.6 | −94.2 | −312.9 | −93.1 | 173.0 | −46.3 | −85.7 | |
|
| 56.3 | −83.4 | −119.5 | −156.8 | −2033.6 | −153.1 | 74.1 | −63.0 | −135.5 | |
| CC |
| 99.1 | −80.7 | −120.0 | −151.1 | −555.4 | −158.5 | 116.8 | −53.7 | −139.3 |
|
| 267.3 | −54.5 | −76.8 | −93.6 | −317.8 | −98.0 | 347.2 | −36.8 | −87.1 | |
|
| 136.5 | −72.8 | −111.9 | −149.5 | −1969.6 | −159.2 | 169.9 | −48.7 | −134.7 | |
| H |
| 103.5 | −86.6 | −127.6 | −158.4 | −581.5 | −157.7 | 121.5 | −58.6 | −142.3 |
|
| 277.8 | −58.4 | −81.4 | −97.6 | −331.3 | −97.5 | 361.9 | −40.5 | −89.1 | |
|
| 142.1 | −77.8 | −118.3 | −155.8 | −2017.9 | −154.6 | 176.8 | −53.1 | −135.4 | |
| ECHU |
| 75.6 | −133.8 | −184.6 | −226.0 | −795.6 | −275.9 | 93.7 | −98.9 | −240.5 |
|
| 247.4 | −81.9 | −108.5 | −129.7 | −431.3 | −155.9 | 328.9 | −61.6 | −137.4 | |
|
| 106.3 | −112.7 | −161.7 | −210.2 | −2494.6 | −278.7 | 138.1 | −82.9 | −227.2 | |
| HB |
| 92.0 | −131.4 | −182.2 | −225.5 | −791.3 | −246.2 | 112.8 | −93.8 | −207.3 |
|
| 272.8 | −82.6 | −109.2 | −131.3 | −432.7 | −142.6 | 366.7 | −60.5 | −121.9 | |
|
| 127.8 | −112.0 | −161.0 | −211.0 | −2482.7 | −238.3 | 165.5 | −79.9 | −189.0 | |
| ATS |
| 120.7 | −137.9 | −195.9 | −242.1 | −854.7 | −253.9 | 143.7 | −97.7 | −219.7 |
|
| 369.8 | −87.6 | −118.6 | −142.1 | −469.4 | −148.1 | 488.6 | −63.6 | −130.6 | |
|
| 173.0 | −118.1 | −173.1 | −226.1 | −2620.6 | −241.3 | 218.5 | −83.9 | −199.6 | |
| DC |
| 119.2 | −143.6 | −201.8 | −250.1 | −882.0 | −294.2 | 143.9 | −102.1 | −246.5 |
|
| 404.6 | −89.5 | −120.6 | −145.5 | −482.0 | −168.9 | 540.4 | −64.0 | −143.9 | |
|
| 175.7 | −121.8 | −177.0 | −232.0 | −2678.1 | −288.0 | 224.7 | −86.6 | −226.7 | |
| OS |
| 72.9 | −258.2 | −341.1 | −419.4 | −1446.2 | −539.9 | 100.0 | −205.0 | −443.3 |
|
| 349.3 | −148.4 | −189.5 | −228.6 | −754.5 | −289.1 | 486.6 | −119.9 | −240.5 | |
|
| 117.7 | −203.8 | −279.5 | −362.9 | −3939.7 | −514.2 | 164.6 | −160.9 | −390.7 | |
| MB |
| 222.1 | −221.6 | −311.1 | −392.2 | −1418.5 | −470.9 | 266.4 | −155.8 | −382.7 |
|
| 849.5 | −137.8 | −184.8 | −226.4 | −765.8 | −266.8 | 1152.2 | −98.0 | −221.3 | |
|
| 352.8 | −184.8 | −265.5 | −350.9 | −3908.0 | −444.5 | 452.7 | −130.7 | −339.9 | |
| U |
| 404.3 | −200.7 | −303.2 | −398.6 | −1491.0 | −481.1 | 466.6 | −120.8 | −388.7 |
|
| 1672.5 | −133.7 | −190.8 | −241.7 | −828.3 | −285.2 | 2206.4 | −78.8 | −235.9 | |
|
| 693.6 | −175.0 | −266.7 | −364.3 | −4068.4 | −462.1 | 862.0 | −107.3 | −354.5 |
Summary of , the distance to the mean (in standard deviations), between real values and those of random texts for three different rank statistics: (the maximum rank), (the mean rank) and (the standard deviation of the rank). The first column contains the abbreviation of the text (see Table 1 for the meaning of each abbreviation). Texts are sorted by increasing length. The columns after the first column correspond to different versions of the random text model and different parameter settings. For each text and parameter setting, we show , and , the distances from each of the three rank statistics. is the number of characters other than space. and are two parameter settings borrowed from [2]. indicates that all character probabilities are obtained from the original text. Distances are computed from the estimated mean and standard deviation of the rank of a certain random text through independently generated replicas. The random texts have the same length in words as the target real text.