| Literature DB >> 35990365 |
Sidney Evaldo Leal1, Katerina Lukasova2, Maria Teresa Carthery-Goulart2, Sandra Maria Aluísio1.
Abstract
This article presents RastrOS, a new eye-tracking corpus of eye movement data from university students during silent reading of paragraphs of texts in Brazilian Portuguese (BP). The article shows the potential of the corpus for natural language processing (NLP) using it to evaluate the sentence complexity prediction task in BP and it also focuses on the description of NLP resources and methods developed to create the corpus. Specifically, we present: (i) the method used to select the corpus paragraphs from large corpora, using linguistic metrics and clustering algorithms; (ii) the platform for collecting the Cloze test, which is also responsible for creating the project datasets, and (iii) the hybrid semantic similarity method, based on word embedding models and contextualised word representations, used to generate semantic predictability norms. RastrOS can be downloaded from the open science framework repository with the computational infrastructure mentioned above. Datasets with predictability norms of 393 participants and eye-tracking data of 37 participants are available in the OSF repository for this work (https://osf.io/9jxg3/).Entities:
Keywords: Brazilian Portuguese; Eye-tracking corpus; Natural language processing; Predictability norms; Sentence complexity prediction
Year: 2022 PMID: 35990365 PMCID: PMC9383681 DOI: 10.1007/s10579-022-09609-0
Source DB: PubMed Journal: Lang Resour Eval ISSN: 1574-020X Impact factor: 1.835
Eye-tracking corpora and RastrOS numbers
| Corpus | Language | Stimulus | Corpus Stats | Participants |
|---|---|---|---|---|
| English | Connected Sentences | |||
| German | Isolated Sentences | |||
| English and Dutch | Connected Sentences | |||
| English | Connected Sentences | |||
| Russian | Isolated Sentences | |||
| Brazilian Portuguese | Connected Sentences |
Predictability norm variables, and their explanations
| Variable | Description |
|---|---|
| Word_Unique_ID | The ID number for each word in the dataset |
| Text_ID | The text number of the RastrOS corpus (paragraph 1-50) |
| Text | The paragraphs from which the target word is taken |
| Word_Number | The position of the word in the text |
| Sentence_Number | The number of the sentence (1-120) in which the current word is located |
| Word_In_Sentence_Number | The position of the current word within the current sentence |
| Word | The target word, with punctuation, capitalization and contractions removed |
| Response | The response produced by the participant in the Cloze task |
| Response_Count | Number of participants who produced a given response |
| Total_Response_Count | The total number of responses provided on the Cloze task for this word token |
| Response_Proportion | How often a given response was provided, as a proportion of all responses. Response_Proportion = Response_Count/Total_Response_Count |
| Source | Link to the source text. |
Fig. 1Distribution of words in PoS Classes. N stands for noun, PRP for preposition, V-FIN for finite verb, DET for article, ADJ for adjective, ADV for adverb, PROP for proper noun, CONJ-C for coordinating conjunction, V-PCP for past participle, V-INF for infinitive, PRON-INDP for independent pronoun (or substantive pronoun), NUM for number, PRON-PERS for personal pronoun, CONJ-S for subordinating conjunction, V-GER for gerund, ERR for error, VAUX for auxiliary verb, INTJ for interjection, IMP for command/imperative, VAUX-S for auxiliary verb
Eye-Tracking Dataset Variables
| Variable | Description |
|---|---|
| RECORDING_SESSION_LABEL | Session ID (Participant). The ID differentiates participants starting the undergraduate course, finishing the course and taking intermediate semesters. |
| Word_Unique_ID | A ID number for each word (each token) in the dataset, composed of the information about Text_ID and Word_Number (for example, UID_13_69) |
| Text_ID | The text number of RastrOS corpus (paragraph 1-50) |
| Genre | The text genre. RastrOS has three genres: journalistic (JN), literary (LT) and popular science (DC). |
| Word_Number | The position of the word in the text. It varies from 1 to the length of the paragraph. |
| Sentence_Number | The ordinal number of the sentence in which the current word is located in the paragraph. This number varies from 1 to 5 as the length of the paragraphs in RastrOS is short. |
| Word_In_Sentence_Number | The ordinal position of the current word within the current sentence. It varies from 1 to the length of the sentence. |
| Word_Place_In_Sent | Word position in quartiles of a sentence: 0-25% = 1, 25% -50% = 2, 50% to 75% = 3 and 75% -100% = 4. |
| Word | The word as it appeared on the screen |
| Word_Cleaned | The word, with punctuation and capitalisation removed |
| Word_Length | The length of the current word, in letters |
| Total_Response_Count | The total number of responses provided on the Cloze task for this word token |
| Unique_Count | The total number of unique responses provided on the Cloze task for this word token |
| OrthographicMatch | Cloze probability: The proportion of responses that were an orthographic match with the target word |
| IsModalResponse | Whether the target word was the most commonly produced response (1) or not (0) |
| ModalResponse | The modal response. If IsModalResponse is 1, this is the same as Word (see above). If IsModalResponse is 0, this is whichever response was provided most frequently. |
| ModalResponseCount | A count of how many times the modal response was provided in the Cloze procedure |
| Certainty | The Cloze probability of the modal response. Certainty = ModalResponseCount/ResponseCount |
| POS | The part of speech tag of the target word (See https://visl.sdu.dk/visl/pt/info/symbolset-manual.html for more information on the meaning of the specific tags.) |
| Word_Content_Or_Function | Whether the word is a content word or a function word, based on POS |
| Word_POS | A more general grouping of parts of speech, based on POS, which includes the following categories (in Portuguese): Adjetivo, Advérbio, Artigo, Conjunção, Interjeição, Nome, Numeral, Preposição, Pronome, Verbo. In English they are: Adjective, Adverb, Article, Conjunction, Interjection, Noun, Number, Preposition, Pronoun, Verb, respectively. |
| POSMatch | The proportion of responses with the same POS as the target, using POS column. |
| Word_Inflection | RastrOS evaluates inflection of the following Word_PoS: noun, verb, adjective, pronoun and article, using Palavras tags (https://visl.sdu.dk/visl/pt/info/symbolset-manual.html).For nouns there is gender and number; for finite verbs, person, tense and mode; for infinitive verbs, tense and mode;for past participle verbs, gender and number; for adjectives, gender and number; for personal pronouns, gender, number, case and person; for adjective and substantive pronouns, gender and number; for articles, gender and number. |
| InflectionMatch | The proportion of responses that carried the same inflection as the target. RastrOS evaluates inflection of the following Word_PoS: noun, verb, adjective, pronoun and article. |
| Semantic_Word_Context_Score | A measure of the semantic association between the target word and the entire preceding passage context. This score is a measure of the semantic fit of the target word with the previous context of a sentence. It was obtained with the hybrid method created in this project, which uses one word embedding model and the contextualised word representation (BERT) which is described in detail in Sect. |
| Semantic_Response_Match_Score | The mean match score between the target and all provided responses. This measure is an estimate of the semantic predictability of a given target word, i.e. it evaluates if the participants can grasp the general meaning of the upcoming word. It was obtained using the hybrid method created in this project, which uses one word embedding model and the contextualised word representation (BERT) which is described in detail in Sect. |
| Semantic_Response_Context_Score | A measure of the semantic association between the response and the entire preceding passage context. This score is a measure of the semantic fit of the response with the previous context of a sentence. This metric was proposed in RastrOS, with no correspondent in Provo. It was obtained using the hybrid method created in this project, which uses one word embedding model and the contextualised word representation (BERT) which is described in detail in Sect. |
| Freq_brWaC_fpm | Normalised frequency (or frequency per million) of the BrWac Corpus words. The BrWac Corpus was made publicly available in January 2017 and consists of 3.53 million web documents, 2.68 billion tokens and 5.79 million types (TTR 0.0021). |
| Freq_Brasileiro_fpm | Normalised frequency (or frequency per million) of the words of the Corpus Brasileiro. The Corpus Brasileiro ( |
| Freq_brWaC_log | Frequency on the Zipf scale, which is |
| Freq_Brasileiro_log | Frequency on the Zipf scale, which is log 10 (normalised frequency) + 3 of the words using the Corpus Brasileiro. The Corpus Brasileiro ( |
| Surprisal | Negative log probability of a word |
| Entropy_reduction | For each word, the distribution of all answers (right and wrong) was obtained and the Shannon Entropy formula (see Eq. |
| Time_to_Start | Time (in seconds) between the presentation of the gap and when the participant started typing. |
| Typing_Time | Time between the start of typing and the submission of the response. |
| Total_time | Sum of Time_to_Start and Typing_Time. |
| IA_ID | Identification number for each interest area in the text. Note that because of typos and text parsing errors, this number may not correspond to the Word_Number. |
| IA_LABEL | The string of letters (w/ punctuation) contained within the interest area |
| TRIAL_INDEX | The order that the text was presented within the experiment for a given participant |
| IA_LEFT | The left boundary of the interest area, in pixels from the left of the screen |
| IA_RIGHT | The right boundary of the interest area, in pixels from the left of the screen |
| IA_TOP | The top boundary of the interest area, in pixels from the top of the screen |
| IA_BOTTOM | The bottom boundary of the interest area, in pixels from the top of the screen |
| IA_AREA | The total screen area of the interest area, in pixels |
| IA_FIRST_FIXATION_DURATION | First Fixation Duration: The duration of the first fixation on the interest area, in milliseconds. |
| IA_FIRST_FIXATION_INDEX | Ordinal sequence of the first fixation that was within the current interest area |
| IA_FIRST_FIXATION_VISITED_ IA_COUNT | The number of interest areas visited prior to first fixation on the current interest area |
| IA_FIRST_FIXATION_X | The X position of the first fixation event that was within the current interest area, in pixels |
| IA_FIRST_FIXATION_Y | The Y position of the first fixation event that was within the current interest area, in pixels |
| IA_FIRST_FIX_PROGRESSIVE | Checks whether later interest areas have been visited before the first fixation enters the current interest area. 1 if NO higher IA ID in earlier fixations before the first fixation in the current interest area; 0 otherwise. This measure is useful in reading to check whether the first run of fixations in this interest area is in fact first-pass fixations. |
| IA_FIRST_FIXATION_RUN_INDEX | This counts how many runs of fixations have occurred when a first fixation is made to an interest area. The current run is also included in the tally. |
| IA_FIRST_FIXATION_TIME | Start time of the first fixation to enter the current interest area |
| IA_FIRST_RUN_DWELL_TIME | Gaze duration: Dwell time (i.e., summation of the duration across all fixations) of the first run within the current interest area |
| IA_FIRST_RUN_FIXATION_COUNT | Number of all fixations in a trial falling in the first run of the current interest area |
| IA_FIRST_RUN_START_TIME | Start time of the first run of fixations in the current interest area |
| IA_FIRST_RUN_END_TIME | End time of the first run of fixations in the current interest area |
| IA_FIRST_RUN_FIXATION_% | Percentage of all fixations in a trial falling in the first run of the current interest area |
| IA_DWELL_TIME | Total Reading Time: Dwell time (i.e., summation of the duration across all fixations) on the current interest area |
| IA_FIXATION_COUNT | Total fixations falling in the interest area |
| IA_RUN_COUNT | Number of times the Interest Area was entered and left (runs) |
| IA_SKIP | An interest area is considered skipped (i.e.,IA_SKIP = 1) if no fixation occurred in first-pass reading. |
| IA_REGRESSION_IN | Whether the current interest area received at least one regression from later interest areas (e.g., later parts of the sentence). 1 if the interest area was entered from a higher IA_ID (from the right in English); 0 if not. |
| IA_REGRESSION_IN_COUNT | Number of times interest area was entered from a higher IA_ID (from the right in English) |
| IA_REGRESSION_OUT | Whether regression(s) was made from the current interest area to earlier interest areas (e.g., previous parts of the sentence) prior to leaving that interest area in a forward direction. 1 if a saccade exits the current interest area to a lower IA_ID (to the left in English) before a later interest area was fixated; 0 if not. |
| IA_REGRESSION_OUT_COUNT | Number of times an interest area was exited to a lower IA_ID (to the left in English) before a higher IA_ID was fixated in the trial |
| IA_REGRESSION_OUT_FULL | Whether regression(s) was made from the current interest area to earlier interest areas (e.g., previous parts of the sentence). 1 if a saccade exits the current interest area to a lower IA_ID (to the left in English); 0 if not. Note that IA_REGRESSION_OUT only considers first-pass regressions whereas IA_REGRESSION_OUT_FULL considers all regressions, regardless of whether later interest areas have been visited or not. |
| IA_REGRESSION_OUT_FULL_ COUNT | Number of times interest area was exited to a lower IA_ID (to the left in English) |
| IA_REGRESSION_PATH_DURA-TION | Go-Past Time: The summed fixation duration from when the current interest area is first fixated until the eyes enter an interest area with a higher IA_ID |
| IA_FIRST_SACCADE_AMPLITUDE | Amplitude (in degree of visual angle) of the first saccade entering into the current interest area |
| NOTE: Saccade data have not been cleaned, and so include return sweeps (large eye movements from the end of one line to the beginning of the next). Excluding saccades>15 deg removes these return sweeps without impacting other reading-related saccades. | |
| IA_FIRST_SACCADE_ANGLE | Angle between the horizontal plane and the direction of the first saccade entering into the current interest area |
| IA_FIRST_SACCADE_START_TIME | Start time of the saccade that first landed within the current interest area |
| IA_FIRST_SACCADE_END_TIME | End time of the saccade that first landed within the current interest area |
Initial evaluations on the Cloze test participants data related to the percentage of correct prediction of a word, using the OrtographicMatch variable that is the percent of correct answers
| Groups | Sum of squares | df | Mean square | F | p-value | |
|---|---|---|---|---|---|---|
| Year | 215.727 | 4 | 53.932 | 1.532 | 0.190 | 0.018 |
| Area | 17.640 | 1 | 17.640 | 0.504 | 0.478 | 0.001 |
| Year * area | 230.439 | 4 | 60.110 | 1.719 | 0.145 | 0.020 |
| Residuals | 11713.902 | 335 | 34.967 |
The percentage of correct answers in each group separated by year and area (SD = standard deviation; N = Number of subjects)
| Year | Area | Mean | SD | N |
|---|---|---|---|---|
| Exact sciences | 17.146 | 5.356 | 127 | |
| Human sciences | 19.652 | 6.125 | 30 | |
| Exact sciences | 17.765 | 9.805 | 26 | |
| Human sciences | 18.515 | 4.180 | 30 | |
| Exact sciences | 16.535 | 6.868 | 16 | |
| Human sciences | 17.961 | 6.465 | 24 | |
| Exact sciences | 21.497 | 6.320 | 20 | |
| Human sciences | 18.797 | 5.626 | 42 | |
| Exact sciences | 18.332 | 2.472 | 9 | |
| Human sciences | 19.166 | 4.046 | 21 |
Fig. 2Pipeline for using the clustering method. Step 1 represents the extraction of metrics from the texts in the corpus, using the NILC-Metrix tool, and generates a file with all the features. Step 2 uses these features to generate the ideal number of groups, using the elbow technique, and presents the distribution of texts and the proposed groups, as well as measures for assessing confidence in these groups
Fig. 3Screenshot of Simpligo-Cloze platform screen while collecting responses for the Cloze test; in the example above two words have already been answered, since the first word of the paragraph is always provided. The participant must then try to predict the fourth word of the paragraph
Fig. 4Pipeline for processing data collected in the RastrOS project. The Simpligo-Cloze platform integrates all methods for creating predictability and eye-tracking datasets. Both the source code of the Simpligo-Cloze platform and the developed scripts are publicly available. https://github.com/sidleal/simpligo-cloze
| Context: | |
|---|---|
| Task 1: Semantic Fit Target | P1: |
| P2: | |
| O: 2.96 N: | |
| Task 1: Semantic Fit Response | P1: |
| P2: | |
| O: 2.61 N: | |
| Task 2: Semantic Similarity | P1: |
| P2: | |
| O: 0.34 N: | |