Literature DB >> 26063827

Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation.

Stefan Thurner1, Rudolf Hanel2, Bo Liu2, Bernat Corominas-Murtra2.   

Abstract

The formation of sentences is a highly structured and history-dependent process. The probability of using a specific word in a sentence strongly depends on the 'history' of word usage earlier in that sentence. We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average. We first show that the model explains the approximate Zipf law found in word frequencies as a direct consequence of sample-space reduction. We then empirically quantify the amount of sample-space reduction in the sentences of 10 famous English books, by analysis of corresponding word-transition tables that capture which words can follow any given word in a text. We find a highly nested structure in these transition tables and show that this 'nestedness' is tightly related to the power law exponents of the observed word frequency distributions. With the proposed model, it is possible to understand that the nestedness of a text can be the origin of the actual scaling exponent and that deviations from the exact Zipf law can be understood by variations of the degree of nestedness on a book-by-book basis. On a theoretical level, we are able to show that in the case of weak nesting, Zipf's law breaks down in a fast transition. Unlike previous attempts to understand Zipf's law in language the sample-space reducing model is not based on assumptions of multiplicative, preferential or self-organized critical mechanisms behind language formation, but simply uses the empirically quantifiable parameter 'nestedness' to understand the statistics of word frequencies.
© 2015 The Author(s) Published by the Royal Society. All rights reserved.

Keywords:  language formation; random walks on networks; scaling in stochastic processes; word-transition networks

Mesh:

Year:  2015        PMID: 26063827      PMCID: PMC4528601          DOI: 10.1098/rsif.2015.0330

Source DB:  PubMed          Journal:  J R Soc Interface        ISSN: 1742-5662            Impact factor:   4.118


  18 in total

1.  Emergence of scaling in random networks

Authors: 
Journal:  Science       Date:  1999-10-15       Impact factor: 47.728

2.  Zipf distribution of U.S. firm sizes.

Authors:  R L Axtell
Journal:  Science       Date:  2001-09-07       Impact factor: 47.728

3.  Least effort and the origins of scaling in human language.

Authors:  Ramon Ferrer i Cancho; Ricard V Sole
Journal:  Proc Natl Acad Sci U S A       Date:  2003-01-22       Impact factor: 11.205

4.  Some effects of intermittent silence.

Authors:  G A MILLER
Journal:  Am J Psychol       Date:  1957-06

5.  On 1/f noise and other distributions with long tails.

Authors:  E W Montroll; M F Shlesinger
Journal:  Proc Natl Acad Sci U S A       Date:  1982-05       Impact factor: 11.205

6.  Emergence of Zipf's law in the evolution of communication.

Authors:  Bernat Corominas-Murtra; Jordi Fortuny; Ricard V Solé
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2011-03-28

7.  Understanding scaling through history-dependent processes with collapsing sample space.

Authors:  Bernat Corominas-Murtra; Rudolf Hanel; Stefan Thurner
Journal:  Proc Natl Acad Sci U S A       Date:  2015-04-13       Impact factor: 11.205

8.  Scaling features of noncoding DNA.

Authors:  H E Stanley; S V Buldyrev; A L Goldberger; S Havlin; C K Peng; M Simons
Journal:  Physica A       Date:  1999       Impact factor: 3.263

9.  Emergence of good conduct, scaling and zipf laws in human behavioral sequences in an online world.

Authors:  Stefan Thurner; Michael Szell; Roberta Sinatra
Journal:  PLoS One       Date:  2012-01-12       Impact factor: 3.240

10.  Modeling statistical properties of written text.

Authors:  M Angeles Serrano; Alessandro Flammini; Filippo Menczer
Journal:  PLoS One       Date:  2009-04-29       Impact factor: 3.240

View more
  3 in total

1.  How driving rates determine the statistics of driven non-equilibrium systems with stationary distributions.

Authors:  Bernat Corominas-Murtra; Rudolf Hanel; Leonardo Zavojanni; Stefan Thurner
Journal:  Sci Rep       Date:  2018-07-18       Impact factor: 4.379

2.  The role of grammar in transition-probabilities of subsequent words in English text.

Authors:  Rudolf Hanel; Stefan Thurner
Journal:  PLoS One       Date:  2020-10-08       Impact factor: 3.240

3.  Sample space reducing cascading processes produce the full spectrum of scaling exponents.

Authors:  Bernat Corominas-Murtra; Rudolf Hanel; Stefan Thurner
Journal:  Sci Rep       Date:  2017-09-11       Impact factor: 4.379

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.