| Literature DB >> 35280234 |
Jinbiao Yang1,2, Antal van den Bosch3, Stefan L Frank2.
Abstract
Words typically form the basis of psycholinguistic and computational linguistic studies about sentence processing. However, recent evidence shows the basic units during reading, i.e., the items in the mental lexicon, are not always words, but could also be sub-word and supra-word units. To recognize these units, human readers require a cognitive mechanism to learn and detect them. In this paper, we assume eye fixations during reading reveal the locations of the cognitive units, and that the cognitive units are analogous with the text units discovered by unsupervised segmentation models. We predict eye fixations by model-segmented units on both English and Dutch text. The results show the model-segmented units predict eye fixations better than word units. This finding suggests that the predictive performance of model-segmented units indicates their plausibility as cognitive units. The Less-is-Better (LiB) model, which finds the units that minimize both long-term and working memory load, offers advantages both in terms of prediction score and efficiency among alternative models. Our results also suggest that modeling the least-effort principle for the management of long-term and working memory can lead to inferring cognitive units. Overall, the study supports the theory that the mental lexicon stores not only words but also smaller and larger units, suggests that fixation locations during reading depend on these units, and shows that unsupervised segmentation models can discover these units.Entities:
Keywords: cognitive unit; computational cognition; eye movement; mental lexicon; reading units; text segmentation; unsupervised learning
Year: 2022 PMID: 35280234 PMCID: PMC8905434 DOI: 10.3389/frai.2022.731615
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1Illustration of the LiB model: (A) information flow in the LiB model; (B) the mechanisms in the text segmentation module; (C) the mechanisms in the lexicon update module.
The corpora statistics after preprocessing.
|
|
|
|
|
|
|---|---|---|---|---|
| English | GECO | 13,491 | 57,170 | 5,316 |
| COCA (sample) | 1,745,060 | 9,451,421 | 140,553 | |
| Dutch | GECO | 13,407 | 60,836 | 5,859 |
| SoNaR (books) | 3,308,337 | 22,802,170 | 272,865 |
Segmentation examples from different models in English and Dutch.
|
| |||
|---|---|---|---|
|
|
|
|
|
| 1 | Input | i was trying to make up my mind what to do | was ik nog aan het overleggen wat ik zou gaan doen |
| LiB | i was |trying to |make |up |my mind |what |to do | was ik |nog |aan het |over|leggen |wat ik |zou gaan |doen | |
| CBL | i |was trying |to make |up |my mind |what |to do | was |ik nog |aan |het overleggen |wat |ik zou gaan |doen | |
| AG-word | i |was |try|ing |to |make |up |my |mind |what |to |do | was |ik |nog |aan |het |over|leg|gen |wat |ik |zou |gaan |doen | |
| AG-collocation | i was |trying to |make |up |my mind |what |to do | was ik |nog |aan het |over|leggen |wat ik |zou gaan |doen | |
| 2 | Input | and it ended in his inviting me down to styles to spend my leave there | en het eind van 't liedje was dat hij mij uitnodigde mijn verlof door te brengen op styles |
| LiB | and it |ended |in his |invi|ting |me |down to |styles to |sp|end |my |leave |there | en |het eind |van 't |li|e|d|je was |dat hij mij |uitnodig|de |mijn |verlof |door te brengen |op styles | |
| CBL | and |it ended |in |his inviting |me |down |to |styles to spend |my |leave |there | en |het eind |van |'t liedje |was |dat hij |mij uitnodigde |mijn verlof |door |te brengen |op styles | |
| AG-word | and |it |end|ed |in |his |invit|ing |me |down |to |styl|es to |spend||my |leav|e |there | en |het |eind |van |'t |lie|d|je |was |dat |hij |mij |uit|nodig|de |mijn |ver|lo|f |door |te breng|en |op |styles | |
| AG-collocation | and |it |ended |in his |inviting |me |down to |styles to |spend my |leave |there | en |het eind van |'t |lied|je |was |dat |hij mij |uitnodig|de |mijn |verlof |door |te brengen |op styles | |
Figure 2The average token lengths of the observed eye fixation units, the model-segmented units, and linguistic words in English and Dutch texts. The error bars represent 99% confidence intervals.
Figure 3Distributions of the counts of the (predicted) eye fixations on the English (A) and Dutch (B) corpora. Firstly we define three labels of the fixation counts (“Skip”: 0, “One”: 1, and “More”: >1). The histograms present the distribution of three labels; specifically, the vertical histograms present the predictions of the models and the horizontal histogram presents the observations in GECO. The scatter plots present the confusion matrix between the model predictions and the GECO observations; the surface area of each circle indicates the item count of the matching instance.
Evaluations of models/baselines in different languages.
|
|
|
|
|---|---|---|
| LiB | 53.06 |
|
| CBL | 52.20 | 50.04 |
| AG-word | 30.10 | 28.95 |
| AG-collocation |
| 51.45 |
| Word-by-Word | 38.32 | 38.68 |
| Only-Length | 50.82 | 50.57 |
All the scores are the weighted F1 metric between the predicted eye fixations and the observed eye fixations.
Bold values indicate the highest score across models.
Figure 4The prediction scores with different LiB unit length limitations. The blue/orange dotted lines indicate the peak scores and the corresponding maximum unit length in the English/Dutch simulations, respectively.
Comparison of training times and F1 scores between different models and different training corpora.
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| LiB | GECO | 2 min 31 s | 15,867 | 53.06 | 2 min 38 s | 17,525 | 51.87 |
| COCA/SoNaR | 24 min 51 s | 97,872 |
| 72 min 5 s | 143,665 |
| |
| CBL | GECO | 1 s | 29,268 | 52.28 | 1 s | 33,248 | 50.04 |
| COCA/SoNaR | 1 min 24 s | 2,051,239 | 53.30 | 3 min 23 s | 3,782,605 | 51.71 | |
COCA/SoNaR means the training corpus is COCA for the English task and SoNaR for the Dutch task. The Lexicon size of CBL is the sum count of its stored unigrams and backward transitional probabilities between the unigrams.
Bold values indicate the highest score across models.