| Literature DB >> 35205572 |
Mahdi Mohseni1,2, Christoph Redies2, Volker Gast1.
Abstract
Computational textual aesthetics aims at studying observable differences between aesthetic categories of text. We use Approximate Entropy to measure the (un)predictability in two aesthetic text categories, i.e., canonical fiction ('classics') and non-canonical fiction (with lower prestige). Approximate Entropy is determined for series derived from sentence-length values and the distribution of part-of-speech-tags in windows of texts. For comparison, we also include a sample of non-fictional texts. Moreover, we use Shannon Entropy to estimate degrees of (un)predictability due to frequency distributions in the entire text. Our results show that the Approximate Entropy values can better differentiate canonical from non-canonical texts compared with Shannon Entropy, which is not true for the classification of fictional vs. expository prose. Canonical and non-canonical texts thus differ in sequential structure, while inter-genre differences are a matter of the overall distribution of local frequencies. We conclude that canonical fictional texts exhibit a higher degree of (sequential) unpredictability compared with non-canonical texts, corresponding to the popular assumption that they are more 'demanding' and 'richer'. In using Approximate Entropy, we propose a new method for text classification in the context of computational textual aesthetics.Entities:
Keywords: Approximate Entropy; POS-tags; Shannon Entropy; canonical texts; fictional texts; non-canonical texts; non-fictional texts; text classification
Year: 2022 PMID: 35205572 PMCID: PMC8870941 DOI: 10.3390/e24020278
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Text categories in the Jena Expository and Fictional Prose (JEFP), version 2.0. The table shows, for each text category, the number of texts and the mean text length, measured in tokens, ± standard deviation.
| Category | Number of Texts | Mean Length (×103) |
|---|---|---|
| Canonical | 76 | 199 ± 96 |
| Non-Canonical | 130 | 111 ± 56 |
| Non-Fictional | 185 | 171 ± 178 |
Median values of Approximate Entropy (ApEn) for all text properties. ApEn values were analysed for two tasks: canonical (N = 76) vs. non-canonical (N = 130) texts and fictional (N = 206) vs. non-fictional (N = 185) texts. The asterisks indicate whether the differences between the two text categories of a given task are statistically significant (Mann–Whitney U test; ns, not significant; * ; ** ; and *** ). Values that are significantly higher within a pair of columns are shown in boldface. 95% confidence intervals for the median (according to [58]) are shown in parentheses.
| Text Property | Canonical | Non-Canonical | Fictional | Non-Fictional |
|---|---|---|---|---|
| Sentence Length | 1.86 (1.83, 1.89) | 1.87 (1.86, 1.90) | 1.87 (1.86, 1.88) | 1.90 (1.88, 1.92) |
| Noun |
| 1.83 (1.81, 1.84) *** |
| 1.82 (1.81, 1.84) ** |
| Verb |
| 1.70 (1.69, 1.71) *** | 1.714 (1.706, 1.723) | |
| Adjective |
| 1.45 (1.43, 1.48) *** | 1.488 (1.469, 1.494) | |
| Adverb |
| 1.48 (1.46, 1.49) ** |
| 1.36 (1.34, 1.39) *** |
| Pronoun |
| 1.681 (1.675, 1.691) *** |
| 1.31 (1.28, 1.36) *** |
| Preposition |
| 1.67 (1.66, 1.68) *** | 1.678 (1.672, 1.683) |
Median values of Shannon Entropy (ShEn) for all text properties. ApEn values were analysed for two tasks: canonical (N = 76) vs. non-canonical (N = 130) texts and fictional (N = 206) vs. non-fictional (N = 185) texts. The asterisks indicate whether the differences between the two text categories of a given task are statistically significant (Mann–Whitney U test; ns, not significant; * ; ** ; and *** ). Values that are significantly higher within a pair of columns are shown in boldface. 95% confidence intervals for the median (according to [58]) are shown in parentheses.
| Text Property | Canonical | Non-Canonical | Fictional | Non-Fictional |
|---|---|---|---|---|
| Sentence Length | 3.96 (3.88, 4.05) | 3.96 (3.87, 4.08) | 3.96 (3.91, 4.03) | |
| Noun |
| 1.97 (1.95, 1.98) *** | 1.98 (1.97, 1.99) | 1.97 (1.95, 1.99) |
| Verb |
| 1.777 (1.772, 1.783) *** | 1.785 (1.779, 1.792) | |
| Adjective |
| 1.49 (1.47, 1.53) *** | 1.52 (1.51, 1.53) | |
| Adverb |
| 1.51 (1.49, 1.53) * |
| 1.40 (1.37, 1.42) *** |
| Pronoun |
| 1.78 (1.77, 1.79) *** |
| 1.37 (1.33, 1.42) *** |
| Preposition |
| 1.73 (1.72, 1.74) *** | 1.736 (1.729, 1.744) |
Balanced accuracy of classification (in %) for the single features for the canonical/non-canonical distinction (Task 1) and the non-fictional/fictional distinction (Task 2). To compare classification results, we used the 5 × 2CV paired t-test [59] with a significance level of . Values that are significantly higher within a pair of columns are shown in boldface. All values are significantly different (p≤ 0.05) from random accuracy (50%), except where indicated by a dagger (†).
| Task 1 | Task 2 | |||
|---|---|---|---|---|
| ApEn | ShEn | ApEn | ShEn | |
| Sentence Length |
| 50.0 ± 1.0 † | 53.6 ± 2.9 |
|
| Noun |
| 60.0 ± 4.5 | 57.4 ± 1.9 |
|
| Verb |
| 56.2 ± 3.8 | 65.5 ± 2.4 |
|
| Adjective |
| 51.5 ± 2.7 † | 71.7 ± 2.1 | 74.3 ± 1.0 |
| Adverb | 51.6 ± 1.4 † | 51.0 ± 1.5 † | 72.8 ± 2.2 | 73.0 ± 2.9 |
| Pronoun |
| 63.8 ± 1.8 | 95.1 ± 1.5 | 95.0 ± 1.7 |
| Preposition |
| 59.7 ± 1.7 | 56.9 ± 2.6 |
|
| All |
| 68.5 ± 2.3 | 95.4 ± 1.8 | 96.5 ± 1.9 |
Figure 1ApEn (a) and ShEn (b) of Noun and Verb, the two best features for classification of canonical vs. non-canonical texts (Task 1). ApEn and ShEn values of these two features provide an accuracy of 75.9% and 68.4%, respectively. The coloured regions and the border (dashed) line show the decision space of the Support Vector Machine.
Figure 2Values for ApEn (a) and ShEn (b) of Pronoun. These two features yield high accuracy for the classification of fictional vs. non-fictional texts (Task 2).