| Literature DB >> 35350726 |
Dominique Brunato1, Felice Dell'Orletta1, Giulia Venturi1.
Abstract
In this paper, we present an overview of existing parallel corpora for Automatic Text Simplification (ATS) in different languages focusing on the approach adopted for their construction. We make the main distinction between manual and (semi)-automatic approaches in order to investigate in which respect complex and simple texts vary and whether and how the observed modifications may depend on the underlying approach. To this end, we perform a two-level comparison on Italian corpora, since this is the only language, with the exception of English, for which there are large parallel resources derived through the two approaches considered. The first level of comparison accounts for the main types of sentence transformations occurring in the simplification process, the second one examines the results of a linguistic profiling analysis based on Natural Language Processing techniques and carried out on the original and the simple version of the same texts. For both levels of analysis, we chose to focus our discussion mostly on sentence transformations and linguistic characteristics that pertain to the morpho-syntactic and syntactic structure of the sentence.Entities:
Keywords: Italian language; aligned corpora; corpus construction; linguistic complexity; text simplification
Year: 2022 PMID: 35350726 PMCID: PMC8958033 DOI: 10.3389/fpsyg.2022.707630
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Monolingual parallel corpora of original/simplified sentences classified with respect to the type of approach adopted for their construction, the language, the textual genre, the target (GP, general purpose; CHI, children; LL, language learners; L2LL, L2 language learners; PLI, people with language impairments; PLL, people with low literacy level; NLP, NLP tasks; CS, crowd-sourcing), and the size of corpus.
|
| |||
|---|---|---|---|
|
|
|
|
|
| ENG (Pellow and Eskenazi, | Everyday documents | GP | 200 sentence pairs |
| ENG (Xu et al., | Newspapers | CHI | 56,037 original sentences |
| ENG (Barzilay and Elhadad, | Encyclopedia Britannica | CHI | 2,600 easy-to-read documents |
| ENG (Allen, | Classroom materials | LL | 178,967 of simplified words |
| ENG (Petersen and Ostendorf, | Newspapers | LL | 2,539 original sentences |
| ENG (Xu et al., | Wikipedia | CS | 2,359 original sentences |
| ENG (Alva-Manchego et al., | Wikipedia | CS | 2,359 original sentences |
| Many (Orasan et al., | Miscellanea | PLI | 320 original sentences |
| SPA (Bott and Saggion, | Newspapers | PLI | 145 simplified sentences |
| SPA (Collados, | Newspapers | NLP | 300 simplified sentences |
| FRE (Brouwers et al., | Narrative texts | L2LL | 83 original sentences |
| FRE (Grabar and Cardon, | Encyclopedic, scientific, clinical texts | GP | 4,596 sentence pairs |
| FRE (Gala et al., | L1 student materials | PLI | 52,704 tokens |
| DAN (Klerke and Søgaard, | Newspapers | L2LL | 3,701 document pairs |
| POR (Caseli et al., | Newspapers | PLL | 2,116 original sentences |
| POR (Aluísio et al., | Popular science articles | PLL | 882 original sentences |
| GER (Klaper et al., | Websites | PLI | 7,755 original sentences |
| GER (Sauberli et al., | Newspapers | L2LL | 3,616 sentence pairs |
| JPN Goto et al. ( | Newspapers | L2LL | 2,885 sentence pairs |
| EUS Gonzalez-Dios et al. ( | Popular science articles | L2LL | 227 original sentences |
| RUS Dmitrieva and Tiedemann ( | Literary texts | L2LL | 69,737 original sentences |
| ITA Tonelli et al. ( | Administrative texts | GP | 157 original sentences |
| ITA Brunato et al. ( | Children's literature | PLI | 1,060 sentence pairs |
| ITA Brunato et al. ( | Educational material | L2LL | 1,356 original pairs |
|
| |||
| ENG Kauchak ( | Wikipedia | GP | 167K sentence pairs |
| ENG Kajiwara and Komachi ( | Wikipedia | GP | 492,993 sentence pairs |
| ENG Zhu et al. ( | Wikipedia | GP | 108,016 sentence pairs |
| ENG Narayan et al. ( | Wikipedia | GP | 5,546 original sentences |
| ENG Woodsend and Lapata ( | Wikipedia | GP | 14,831 sentence pairs |
| ENG Botha et al. ( | Wikipedia | GP | 1,004,944 original sentences |
| ENG Pavlick and Callison-Burch ( | Miscellanea | CS | 4.5 million of simplifying paraphrase rules |
| ITA Tonelli et al. ( | Wikipedia | GP | 530 original sentences |
| FRE Brouwers et al. ( | Wikipedia | L2LL | 72 original sentences |
| FRE Cardon and Grabar ( | Wikipedia | GP | 297,494 sentence pairs |
| ITA Brunato et al. ( | Web corpus | GP | 63,000 sentence pairs |
Figure 1Distribution of simplification operations across corpora.
Overview of the linguistic features used for linguistic profiling.
|
|
|
|---|---|
| Raw Text |
|
| Sentence length | |
| Word length | |
| Vocabulary |
|
| Type/Token ratio for words and lemmas | |
| POS tagging |
|
| Distribution of UD and language–specific POS | |
| Lexical density | |
|
| |
| Inflectional morphology of lexical verbs and auxiliaries | |
|
| |
| Distribution of verbal heads and verbal roots | |
| Verb arity and distribution of verbs by arity | |
|
| |
| Depth of the whole syntactic tree | |
| Average length of dependency links and of the longest link | |
| Average length of prepositional chains and distribution by depth | |
| Clause length | |
| Dependency |
|
| Relative order of subject and object | |
|
| |
| Distribution of dependency relations | |
|
| |
| Distribution of subordinate and principal clauses | |
| Average length of subordination chains and distribution by depth | |
| Relative order of subordinate clauses with respect the main clause |
Figure 2A sentence from the corpus linguistically annotated in the UD-format.
Distribution of the raw text, lexical, and morpho-syntactic features in the complex and simple set of sentences for the three corpora.
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
|
| |||||||||
| Sentence length | 19.92 | 18.61 |
| 21.25 | 18.56 |
| 8.97 | 8.0 |
|
| Word length | 4.89 | 4.80 |
| 4.74 | 4.70 | 0.04 | 4.70 | 4.54 |
|
|
| |||||||||
| % BIV | 75.59 | 77.31 |
| 78.53 | 77.77 | 0.75 | 72.19 | 77.08 |
|
| % FO | 78.14 | 79.82 |
| 80.21 | 82.73 |
| 75.03 | 75.76 |
|
| % HU | 13.08 | 12.15 |
| 11.98 | 9.68 |
| 20.19 | 19.82 |
|
| % HA | 8.77 | 8.03 |
| 7.81 | 7.60 | 0.21 | 4.78 | 4.42 |
|
| Type/Token ratio | 0.942 | 0.941 | -0.001 | 0.921 | 0.913 | 0.008 | 0.97 | 0.99 |
|
|
| |||||||||
|
| |||||||||
| Adjectives | 5.87 | 5.97 | −0.01 | 5.34 | 5.11 | 0.23 | 5.74 | 7.90 |
|
| Adverbs | 6.82 | 6.97 | −0.15 | 7.62 | 6.73 | 0.89 | 12.26 | 9.95 |
|
| Articles | 8.79 | 8.73 | 0.07 | 8.24 | 8.69 | −0.45 | 11.04 | 12.71 |
|
| Conjunctions—coordinating | 3.57 | 3.76 |
| 3.98 | 4.72 |
| 2.66 | 3.45 |
|
| Conjunctions—subordinating | 1.75 | 2.16 |
| 1.73 | 1.09 |
| 0.32 | 0.30 | 0.02 |
| Prepositions | 13.31 | 12.50 |
| 10.77 | 10.51 | 0.25 | 5.98 | 6.21 |
|
| Pronouns | 5.33 | 5.04 |
| 17.69 | 17.15 |
| 7.23 | 4.14 |
|
| Pronouns—relative | 0.87 | 0.81 | 0.06 | 0.85 | 0.28 |
| 0.27 | 0.1 |
|
| Pronouns—clitic | 2.78 | 2.61 |
| 5.25 | 2.74 |
| 2.47 | 1.60 |
|
| Punctuation | 11.57 | 11.54 | 0.03 | 15.53 | 15.52 | 0.01 | 20.5 | 15.13 |
|
| Numbers | 1.07 | 0.91 |
| 2.25 | 2.47 | −0.22 | 2.25 | 2.47 |
|
| Lexical density | 0.59 | 0.60 |
| 0.58 | 0.62 |
| 0.61 | 0.60 |
|
|
| |||||||||
| Indicative mood | 61.23 | 64.4 |
| 57.14 | 70.87 |
| 68.14 | 68.31 |
|
| Participial mood | 6.95 | 4.63 |
| 3.95 | 2.84 | 1.11 | 3.65 | 2.42 |
|
| Gerundive mood | 3.44 | 2.62 |
| 1.56 | – |
| 0.46 | 0.04 |
|
| Infinitive mood | 15.98 | 17.64 |
| 22.1 | 19.67 |
| 12.04 | 11.65 | 0.39 |
| Subjunctive mood | 1.00 | 0.57 |
| 0.58 | – |
| 0.78 | 0.05 |
|
| Conditional mood | 0.19 | 0.12 | 0.07 | 0.84 | 0.18 | 0.66 | 3.34 | 0.001 |
|
| Present tense | 6.21 | 4.74 |
| 43.31 | 90.19 |
| 79.18 | 80.91 |
|
| Imperfect tense | 50.66 | 52.97 |
| 16.39 | 0.82 |
| 2.89 | 4.29 |
|
| Past tense | 40.98 | 39.97 |
| 27.45 | – | 27.45 | 1.33 | 1.57 |
|
| 2 person, singular | 0.44 | 0.51 | −0.07 | 2.77 | 0.37 |
| 0.60 | 0.44 |
|
| 3 person, singular | 64.9 | 66.09 | −1.19 | 48.59 | 53.31 | −4.72 | 62.31 | 58.13 |
|
| 1 person, plural | – | 0.09 | –0.09 | 2.95 | 4.13 | −1.18 | 1.51 | 1.84 |
|
| 2 person, plural | – | – | – | 0.42 | 0.32 | 0.10 | 0.30 | 0.19 |
|
| 3 person, plural | 18.69 | 19.14 | −0.45 | 13.86 | 16.55 | −2.69 | 8.12 | 7.83 |
|
Statistically significant variations with respect to the Wilcoxon signed-rank test at p < 0.05 are bold.
Distribution of the syntactic features in the complex and simple set of sentences for the three corpora.
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
|
| |||||||||
| Subjects | 6.37 | 6.87 |
| 5.25 | 6.71 |
| 7.65 | 6.94 |
|
| Objects | 4.77 | 5.12 |
| 4.93 | 4.92 | 0.01 | 1.82 | 1.90 |
|
| Subjects—passive | 0.20 | 0.15 | 0.04 | 0.17 | 0.08 | 0.09 | 0.66 | 0.95 |
|
|
| |||||||||
| Subordinate clauses | 51.86 | 51.41 |
| 53.08 | 47.35 |
| 50.083 | 50.078 |
|
| Depth of “chains” of subord. | 0.39 | 0.41 |
| 0.39 | 0.27 | 0.12 | 0.05 | 0.06 |
|
| Post-main subordinates | 40.54 | 43.42 |
| 42.27 | 27.98 |
| 3.28 | 4.17 |
|
|
| |||||||||
| Parse tree depth | 5.80 | 5.56 |
| 5.10 | 4.46 |
| 2.85 | 2.70 |
|
| Dependency links length | 2.07 | 2.03 |
| 2.29 | 2.12 |
| 1.76 | 1.63 |
|
| Length of the longest link | 8.01 | 7.48 |
| 9.24 | 7.32 |
| 3.81 | 3.31 |
|
| Verbal arity | 1.93 | 1.95 | −0.02 | 1.85 | 1.91 | −0.05 | 2.09 | 2.08 |
|
| Depth of prepositional “chains” | 1.06 | 1 |
| 0.90 | 0.91 | −0.01 | 0.44 | 0.41 |
|
|
| |||||||||
| Pre-verbal subjects | 71.07 | 71.35 | −0.28 | 51.16 | 61.71 |
| 50.58 | 43.85 |
|
| Post-verbal subjects | 9.59 | 11.32 | −1.73 | 16.62 | 15.51 | 1.11 | 15.36 | 14.37 |
|
| Pre verbal objects | 5.72 | 5.54 | 0.17 | 8.29 | 3.24 |
| 2.03 | 1.38 |
|
Statistically significant variations with respect to the Wilcoxon signed-rank test at p < 0.05 are bold.