| Literature DB >> 31032001 |
Francis Mollica1, Steven T Piantadosi2.
Abstract
We introduce theory-neutral estimates of the amount of information learners possess about how language works. We provide estimates at several levels of linguistic analysis: phonemes, wordforms, lexical semantics, word frequency and syntax. Our best guess is that the average English-speaking adult has learned 12.5 million bits of information, the majority of which is lexical semantics. Interestingly, very little of this information is syntactic, even in our upper bound analyses. Generally, our results suggest that learners possess remarkable inferential mechanisms capable of extracting, on average, nearly 2000 bits of information about how language works each day for 18 years.Entities:
Keywords: Fermi estimation; information theory; linguistic memory
Year: 2019 PMID: 31032001 PMCID: PMC6458406 DOI: 10.1098/rsos.181393
Source DB: PubMed Journal: R Soc Open Sci ISSN: 2054-5703 Impact factor: 2.963
Summary of estimated bounds across levels of linguistic analysis.
| section | domain | lower bound | best guess | upper bound |
|---|---|---|---|---|
| phonemes | 375 | 750 | 1500 | |
| phonemic wordforms | 200 000 | 400 000 | 640 000 | |
| lexical semantics | 553 809 | 12 000 000 | 40 000 000 | |
| word frequency | 40 000 | 80 000 | 120 000 | |
| syntax | 134 | 697 | 1394 | |
| total (bits) | 794 318 | 12 481 447 | 40 762 894 | |
| total per day (bits)a | 121 | 1900 | 6204 |
aFor this value, we assume language is learned in 18 years of 365 days.
Figure 1.The shaded spheres represent uncertainty in semantic space centred around a word (in green). (a) The uncertainty is given with respect to the word’s farthest connection in semantic space (in yellow), representing R. (b) The uncertainty is given with respect to the Nth nearest neighbour of the word (in red), representing r. The reduction in uncertainty from R to r reflects the amount of semantic information conveyed by the green word.
Figure 2.Histograms showing the number of bits-per-dimension () for various estimates of R and r. These robustly show that 0.5–2.0 bits are required to capture semantic distances.
Figure 3.Accuracy in frequency discrimination accuracy as a function of log word frequency bin faceted by log reference word frequency bin. Vertical red lines denote within bin comparison. Line ranges reflect bootstrapped confidence intervals.