| Literature DB >> 32617441 |
Abstract
There has been much work over the last century on optimization of the lexicon for efficient communication, with a particular focus on the form of words as an evolving balance between production ease and communicative accuracy. Zipf's law of abbreviation, the cross-linguistic trend for less-probable words to be longer, represents some of the strongest evidence the lexicon is shaped by a pressure for communicative efficiency. However, the various sounds that make up words do not all contribute the same amount of disambiguating information to a listener. Rather, the information a sound contributes depends in part on what specific lexical competitors exist in the lexicon. In addition, because the speech stream is perceived incrementally, early sounds in a word contribute on average more information than later sounds. Using a dataset of diverse languages, we demonstrate that, above and beyond containing more sounds, less-probable words contain sounds that convey more disambiguating information overall. We show further that this pattern tends to be strongest at word-beginnings, where sounds can contribute the most information.Entities:
Keywords: Zipf’s law of abbreviation; incremental processing; information theory; language efficiency; language evolution
Year: 2020 PMID: 32617441 PMCID: PMC7323847 DOI: 10.1162/opmi_a_00030
Source DB: PubMed Journal: Open Mind (Camb) ISSN: 2470-2986
Relationship between log word probability and mean token-based segment information for words of length 4–8. Grayed area represents 95% confidence intervals. Less-probable words contain higher information segments.
Relationship between mean type-based segment information and log word probability for words of length 4–8. Less-probable words more quickly reduce the cohorts of competing words.
Relationship between log word probability and relative position of uniqueness-point for words of length 4–8. Less-probable words have relatively earlier uniqueness-points for all lengths.
Distribution for Pearson’s correlation between log word probability and mean token-based segmental information for 10,000 shuffled variants of the real-world lexicons. The x-axis shows the number of standard deviations from the mean correlation in frequency-shuffled variants (in log2 scale) and the red dashed lines indicate the correlation in the real-world lexicons. The real-world lexicons show a significantly stronger correlation relative to shuffled variants.