| Literature DB >> 33285940 |
Luís F Seoane1, Ricard Solé2,3,4.
Abstract
What are relevant levels of description when investigating human language? How are these levels connected to each other? Does one description yield smoothly into the next one such that different models lie naturally along a hierarchy containing each other? Or, instead, are there sharp transitions between one description and the next, such that to gain a little bit accuracy it is necessary to change our framework radically? Do different levels describe the same linguistic aspects with increasing (or decreasing) accuracy? Historically, answers to these questions were guided by intuition and resulted in subfields of study, from phonetics to syntax and semantics. Need for research at each level is acknowledged, but seldom are these different aspects brought together (with notable exceptions). Here, we propose a methodology to inspect empirical corpora systematically, and to extract from them, blindly, relevant phenomenological scales and interactions between them. Our methodology is rigorously grounded in information theory, multi-objective optimization, and statistical physics. Salient levels of linguistic description are readily interpretable in terms of energies, entropies, phase transitions, or criticality. Our results suggest a critical point in the description of human language, indicating that several complementary models are simultaneously necessary (and unavoidable) to describe it.Entities:
Keywords: Pareto-optimality; bottleneck method; phase transitions; statistical mechanics; syntax
Year: 2020 PMID: 33285940 PMCID: PMC7516582 DOI: 10.3390/e22020165
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Different levels of grammar. Language contains several layers of complexity that can be gauged using different kinds of measures and are tied to different kinds of problems. The background picture summarizes the enormous combinatorial potential connecting different levels, from the alphabet (smaller sphere) to grammatically correct sentences (larger sphere). On top of this, it is possible to describe each layer by means of a coarse-grained symbolic dynamics approach. One particularly relevant level is the one associated to the way syntax allows generating grammatically correct strings . As indicated in the left diagram, symbols succeed each other following some rules . A coarse-graining groups up symbols in a series of classes such that the names of these classes; also generate some symbolic dynamics whose rules are captured by . How much information can the dynamics induced by recover about the original dynamics induced by ? Good choices of and will preserve as much information as possible despite being relatively simple.
Grammatical classes present in the most fine-grained level of our corpora.
| Conjunction | Adverb |
| Cardinal number | Adverb, comparative |
| Determiner | Adverb, superlative |
| Existential there | to |
| Preposition | Interjection |
| Adjective | Verb, base form |
| Adjective, comparative | Verb, past tense |
| Adjective, superlative | Verb, gerund or present participle |
| Modal | Verb, past participle |
| Noun, singular | Verb, non-3rd person singular present |
| Noun, plural | Verb, 3rd person singular present |
| Proper noun, singular | Wh-determiner |
| Proper noun, plural | Wh-pronoun |
| Predeterminer | Possessive wh-pronoun |
| Possessive ending | Wh-adverb |
| Personal pronoun | None of the above |
| Possessive pronoun | ‘.’ |
Figure 2Interactions between spins and word classes. (a) A first crude model with spins encloses more information than we need for the kind of calculations that we wish to do right now. (b) A reduced version of that model gives us an interaction energy between words or classes of words. These potentials capture some non-trivial features of English syntax, e.g., the existential “there” in “there is” or modal verbs (marked E and M respectively) have a lower interaction energy if they are followed by verbs. Interjections present fairly large interaction energy with any other word, perhaps as a consequence of their independence within sentences.
Figure 3Pareto optimal maximum entropy models of human language. Among all the models that we try out, we prefer those Pareto optimal in energy minimization and entropy maximization. (a) These reveal a hierarchy of models in which different word classes group up at different levels. The clustering reveals a series of grammatical classes that belong together owing to the statistical properties of the symbolic dynamics, such as possessives and determiners which appear near to adjectives. (b) A first approximation to the Pareto front of the problem. Future implementations will try out more grammatical classes and produce better quality Pareto fronts, establishing whether phase transitions or criticality are truly present.