| Literature DB >> 31536558 |
Abstract
Classical null hypothesis significance tests are not appropriate in corpus linguistics, because the randomness assumption underlying these testing procedures is not fulfilled. Nevertheless, there are numerous scenarios where it would be beneficial to have some kind of test in order to judge the relevance of a result (e.g. a difference between two corpora) by answering the question whether the attribute of interest is pronounced enough to warrant the conclusion that it is substantial and not due to chance. In this paper, I outline such a test.Entities:
Year: 2019 PMID: 31536558 PMCID: PMC6752893 DOI: 10.1371/journal.pone.0222703
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1TTR values per page for the Austen corpus (3,117 pages) and the Shakespeare corpus (3,654 pages).
NB.: A page is defined as consisting of N = 250 words.
Fig 2Frequencies per page for four selected words in the Austen and in the Shakespeare corpus.
p-values (one-sided) based on t-tests on the equality of means for the four selected words in the Austen and in the Shakespeare corpus.
| Word | |
|---|---|
| Brother | .270 |
| He | .000 |
| She | .000 |
| Woman | .004 |
Example of all possible unique permutations for the hypothetical example.
| Author 1 | Mean | Author 2 | Mean | Rank | |||||
|---|---|---|---|---|---|---|---|---|---|
| 71 | 72 | 74 | 72.33 | 67 | 69 | 70 | 68.67 | 3.67 | 1 |
| 70 | 71 | 74 | 71.67 | 67 | 69 | 72 | 69.33 | 2.33 | 3 |
| 69 | 72 | 74 | 71.67 | 67 | 70 | 71 | 69.33 | 2.33 | 4 |
| 69 | 71 | 74 | 71.33 | 67 | 70 | 72 | 69.67 | 1.67 | 5 |
| 70 | 71 | 72 | 71.00 | 67 | 69 | 74 | 70.00 | 1.00 | 6 |
| 69 | 70 | 74 | 71.00 | 67 | 71 | 72 | 70.00 | 1.00 | 7 |
| 67 | 72 | 74 | 71.00 | 69 | 70 | 71 | 70.00 | 1.00 | 8 |
| 69 | 71 | 72 | 70.67 | 67 | 70 | 74 | 70.33 | 0.33 | 9 |
| 67 | 71 | 74 | 70.67 | 69 | 70 | 72 | 70.33 | 0.33 | 10 |
| 69 | 70 | 72 | 70.33 | 67 | 71 | 74 | 70.67 | -0.33 | 11 |
| 67 | 70 | 74 | 70.33 | 69 | 71 | 72 | 70.67 | -0.33 | 12 |
| 69 | 70 | 71 | 70.00 | 67 | 72 | 74 | 71.00 | -1.00 | 13 |
| 67 | 71 | 72 | 70.00 | 69 | 70 | 74 | 71.00 | -1.00 | 14 |
| 67 | 69 | 74 | 70.00 | 70 | 71 | 72 | 71.00 | -1.00 | 15 |
| 67 | 70 | 72 | 69.67 | 69 | 71 | 74 | 71.33 | -1.67 | 16 |
| 67 | 70 | 71 | 69.33 | 69 | 72 | 74 | 71.67 | -2.33 | 17 |
| 67 | 69 | 72 | 69.33 | 70 | 71 | 74 | 71.67 | -2.33 | 18 |
| 67 | 69 | 71 | 69.00 | 70 | 72 | 74 | 72.00 | -3.00 | 19 |
| 67 | 69 | 70 | 68.67 | 71 | 72 | 74 | 72.33 | -3.67 | 20 |
Results of separate Monte Carlo permutation tests with 100,000 repetitions on the equality of means for the four selected words in the Austen and in the Shakespeare corpus.
| Word | ||
|---|---|---|
| Brother | 27004 | .270 |
| He | 0 | .0000 |
| She | 0 | .0000 |
| Woman | 426 | .004 |
Agreement between the t-test and the permutation test.
‘No’ and ‘Yes’ indicate that the corresponding p-value is greater than or equal to (= ‘No’) or below (= ‘Yes’) the significance level of 1%.
| Significant? | Permutation test | |||
|---|---|---|---|---|
| No | Yes | Total | ||
| No | 10,460 | 65 | 10,525 | |
| Yes | 499 | 5,564 | 6,063 | |
| Total | 10,959 | 5,629 | 16,588 | |
Results for permutation test with different segment lengths.
| Attribute | ||||
|---|---|---|---|---|
| brother | .24437 | .27004 | .31753 | .36108 |
| he | 0 | 0 | 0 | 0 |
| she | 0 | 0 | 0 | 0 |
| woman | .00178 | .00426 | .01196 | .04454 |
| lexical richness | 0 | 0 | 0 | 0 |
| Agreement rate with | .96594 | .96600 | .95599 | .95720 |