| Literature DB >> 32269820 |
A Chacoma1, D H Zanette2.
Abstract
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.Entities:
Keywords: Heaps’ Law; grammatical classes; language regularities; statistical anomalies; tagged texts
Year: 2020 PMID: 32269820 PMCID: PMC7137977 DOI: 10.1098/rsos.200008
Source DB: PubMed Journal: R Soc Open Sci ISSN: 2054-5703 Impact factor: 2.963
List of works in the present corpus. When two years are indicated in the second column (aus07, wel04) the first one corresponds to the (estimated) year of writing. The two last columns give the length N (number of word tokens) and the vocabulary size V (number of word types).
| author and code | title (publication year) | ||
|---|---|---|---|
| J. Austen | |||
| aus01 | Pride and Prejudice (1813) | 12 2576 | 8698 |
| aus02 | Emma (1815) | 161 338 | 10 241 |
| aus03 | Sense and Sensibility (1811) | 120 373 | 8631 |
| aus04 | Northanger Abbey (1817) | 77 937 | 7822 |
| aus05 | Persuasion (1818) | 83 821 | 7553 |
| aus06 | Mansfield Park (1814) | 160 770 | 10 883 |
| aus07 | Lady Susan (1794/1871) | 23 254 | 3495 |
| Ch. Dickens | |||
| dic01 | Oliver Twist (1838) | 159 565 | 14 851 |
| dic02 | A Christmas Carol (1843) | 28 954 | 5215 |
| dic03 | The Cricket on the Hearth (1845) | 31 440 | 5818 |
| dic04 | The Haunted Man and the Ghost’s Bargain (1848) | 33 778 | 5818 |
| dic05 | Hard Times (1854) | 102 977 | 13 086 |
| dic06 | A Tale of Two Cities (1859) | 137 153 | 14 040 |
| dic07 | Great Expectations (1860) | 187 455 | 15 717 |
| dic08 | The Mystery of Edwin Drood (1870) | 95 252 | 12 135 |
| dic09 | David Copperfield (1850) | 356 161 | 22 486 |
| dic10 | The Pickwick Papers (1836) | 300 495 | 24 016 |
| dic11 | Little Dorrit (1857) | 38 553 | 23 311 |
| dic12 | Barnaby Rudge (1841) | 255 447 | 20 158 |
| dic13 | The Chimes (1844) | 30 570 | 5822 |
| A. Huxley | |||
| hux01 | The Tilloston Banquet (1922) | 14 393 | 3534 |
| hux02 | Antic Hay (1923) | 87 974 | 13 908 |
| hux03 | Chrome Yellow (1921) | 57 208 | 10 342 |
| hux04 | Farcical History of Richard Greenow (1920) | 20 478 | 4954 |
| hux05 | Those Barren Leaves (1925) | 122 484 | 16 807 |
| hux06 | Brave New World (1932) | 63 778 | 11 078 |
| hux07 | Eyeless in Gaza (1936) | 146 216 | 19 068 |
| hux08 | The Devils of Loudun (1952) | 124 116 | 17 282 |
| hux09 | Island (1962) | 107 723 | 15 845 |
| hux10 | Happily Ever After (1920) | 13 704 | 3283 |
| hux11 | Eupompus Gave Flavor to Art by Numbers (1920) | 3334 | 1225 |
| hux12 | Cynthia (1920) | 2437 | 935 |
| hux13 | The Bookshop (1920) | 1698 | 776 |
| hux14 | The Death of Lully (1920) | 4455 | 1443 |
| hux15 | The Gioconda Smile (1921) | 11 190 | 2756 |
| E. A. Poe | |||
| poe01 | The Purloined Letter (1844) | 7042 | 1950 |
| poe02 | The Thousand-and-Second Tale of Scheherazade (1845) | 5660 | 1737 |
| poe03 | A Descent into the Maelström (1841) | 7035 | 1878 |
| poe04 | Von Kempelen and his Discovery (1849) | 2783 | 993 |
| poe05 | Mesmeric Revelation (1844) | 3742 | 1133 |
| poe06 | The Facts in the Case of M. Valdemar (1845) | 3559 | 1177 |
| poe07 | The Black Cat (1843) | 3925 | 1348 |
| poe08 | The Fall of the House of Usher (1839) | 7186 | 2234 |
| poe09 | Silence-a Fable (1838) | 1359 | 427 |
| poe10 | The Masque of the Red Death (1842) | 2425 | 900 |
| poe11 | The Cask of Amontillado (1846) | 2341 | 850 |
| poe12 | The Imp of the Perverse (1845) | 2437 | 936 |
| poe13 | The Island of the Fay (1841) | 1974 | 823 |
| poe14 | The Assignation (1834) | 4473 | 1613 |
| poe15 | The Pit and the Pendulum (1842) | 6152 | 1788 |
| M. Twain | |||
| twa01 | The Gilded Age (1873) | 162 003 | 16 879 |
| twa02 | The Prince and the Pauper (1881) | 69 693 | 10 869 |
| twa03 | A Connecticut Yankee in King Arthur’s court (1889) | 119 560 | 14 200 |
| twa04 | The American Claimant (1892) | 65 776 | 9462 |
| twa05 | The Tragedy of Pudd’nhead Wilson (1893) | 53 274 | 8175 |
| twa06 | Personal Recollections of Joan of Arc (1896) | 151 693 | 14 697 |
| twa07 | A Horse’s Tale (1907) | 17 127 | 3906 |
| twa08 | The Mysterious Stranger (1916) | 37 262 | 5580 |
| twa09 | A Fable (1909) | 810 | 307 |
| twa10 | Hunting the Deceitful Turkey (1906) | 1259 | 519 |
| twa11 | The McWilliamses And The Burglar Alarm (1882) | 2680 | 904 |
| twa12 | The Adventures of Tom Sawyer (1876) | 72 697 | 9996 |
| twa13 | Adventures of Huckleberry Finn (1884) | 114 973 | 9971 |
| twa14 | Tom Sawyer Abroad (1894) | 35 067 | 4676 |
| twa15 | Tom Sawyer, Detective (1896) | 24 078 | 3354 |
| H. G. Wells | |||
| wel01 | The Time Machine (1895) | 32 391 | 5887 |
| wel02 | The Island of Dr. Moreau (1896) | 43 909 | 6696 |
| wel03 | The Wonderful Visit (1895) | 38 884 | 6709 |
| wel04 | The Wheels of Chance (1895/1935) | 55 824 | 9380 |
| wel05 | The Invisible Man (1897) | 49 460 | 7400 |
| wel06 | The War of the Worlds (1898) | 59 861 | 9063 |
| wel07 | The First Men in the Moon (1901) | 69 114 | 9266 |
| wel08 | The Passionate Friends (1913) | 103 694 | 12 852 |
| wel09 | The Shape of Things to Come (1933) | 156 204 | 18 662 |
| wel10 | The Soul of a Bishop (1917) | 80 080 | 11 066 |
Figure 1.Heaps plot (vocabulary size V versus text length N, measured in number of words) in log-log scales, for the 75 works in the corpus. Different symbols correspond to different authors (aus: J. Austen, dic: Ch. Dickens, hux: A. Huxley, poe: E. A. Poe, twa: M. Twain, wel: H. G. Wells; table 1). The inset shows the same data in linear-linear scales. Lines correspond to a power-law fitting, V ∝ N, with h = 0.68.
Figure 2.(a) Number of word tokens in each tagged class, Ntag (tag = nouns, verbs and others) as a function of the text length N, in log-log scales, for the 75 works in the corpus. Straight lines have unitary slope. (b) As in (a), for the number of word types in each tag, Vtag, as a function of the vocabulary size V. (c) Heaps plot for the words in each tag, for the 75 works. Open symbols correspond the same data plotted in figure 1. The two upper straight lines, are fittings for nouns and verbs, both with slope h = 0.70. The lower straight line is the fitting for others, with slope h = 0.62.
Figure 3.(a) Curves stand for the Heaps functions v(n) of three works in the corpus, namely, Austen’s Northanger Abbey (aus04), Huxley’s Chrome Yellow (hux03) and Wells’ The Wonderful Visit (wel03). Narrow shaded areas are bounded by the average functions . (b,c) Respectively, the absolute and relative Heaps anomalies, defined as in equation (4.5), for the same three texts. Horizontal bands in (c) have integer widths, helping to appraise the absolute anomaly with respect to the standard deviation of randomized shufflings of the texts.
Figure 4.The relative anomaly averaged along each whole text, 〈δ〉 (symbols) and the corresponding standard deviations (error bars) for the 75 works in the corpus, as functions of their vocabulary sizes V. The horizontal axis is logarithmic, to ease discerning data in the low-V zone. Different symbols correspond to different authors (figure 1). Horizontal shaded bands have integer widths. The curve corresponds to a linear fitting of 〈δ〉 versus V, as described in the text. The inset shows a log-log plot of versus V. The straight line has slope s = 0.29.
Figure 5.Maximum, average and minimum absolute anomaly—respectively, Δmax, 〈Δ〉 and Δmax—along each of the 75 works in the corpus, as functions of their vocabulary sizes V. Curves correspond to linear fittings of the three quantities versus V. The inset shows a log-log plot of the standard deviation versus V, with computed along each work. The straight line has slope s = 0.83.
Figure 6.The Heaps excess Etag as a function of n and for each tag, as defined by equations (5.1) and (5.3), along Twain’s The Mysterious Stranger (twa08, N = 37262, V = 5580).
Figure 7.(a) Maximum, average and minimum Heaps excess for nouns along each of the 75 works in the corpus, as functions of the number of nouns in the vocabularies, V. Curves correspond to linear fittings. The inset shows the Heap excess standard deviation along each text. (b) Same as (a), for verbs. (c) Same as (a), for others. For the average and maximum Heaps excess, the fittings are here linear versus the logarithm of the vocabulary size. Note that, in contrast with the other panels, the inset is plotted in linear-log scales. For clarity, the axes labels have been indicated only once, but they are the same in all plots.