| Literature DB >> 31856230 |
Arnaud Rey1,2, Jean-Luc Manguin3, Chloé Olivier1, Sébastien Pacton4, Pierre Courrieu1.
Abstract
Several dictionary websites are available on the web to access semantic, synonymous, or spelling information about a given word. During nine years, we systematically recorded all the entered letter sequences from a French web dictionary. A total of 200 million orthographic forms were obtained allowing us to create a large-scale database of spelling errors that could inform psychological theories about spelling processes. To check the reliability of this big data methodology, we selected from this database a sample of 100 frequently misspelled words. A group of 100 French university students had to perform a spelling-to-dictation test on this list of words. The results showed a strong correlation between the two data sets on the frequencies of produced spellings (r = 0.82). Although the distributions of spelling errors were relatively consistent across the two databases, the proportion of correct responses revealed significant differences. Regression analyses allowed us to generate possible explanations for these differences in terms of task-dependent factors. We argue that comparing the results of these large-scale databases with those of standard and controlled experimental paradigms is certainly a good way to determine the conditions under which this big data methodology can be adequately used for informing psychological theories.Entities:
Mesh:
Year: 2019 PMID: 31856230 PMCID: PMC6922404 DOI: 10.1371/journal.pone.0226647
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Frequencies computed for the website and dictation databases for the target word “hallucinant” (hallucinating).
| target | occurrences in the website database | orthographic strings | response | frequency | frequency |
|---|---|---|---|---|---|
| 3510 | hallucinant | correct | 79.9 | 79 | |
| 417 | allucinant | error | 9.5 | 16 | |
| 221 | alucinant | error | 5,0 | 0 | |
| 244 | halucinant | error | 5.6 | 3 | |
| 0 | allusinant | error | 0 | 1 | |
| 0 | hallucinent | error | 0 | 1 | |
For each string, the website frequency is computed by dividing the number of occurrences obtained for that string by the total number of strings related to the target word (i.e., 4392). The dictation frequency is simply equal to the number of participants having produced that string. Note that two strings were not present in the website database (i.e., “allusinant” and “hallucinent”) but were produced in the dictation experiment. Conversely, one string appeared in the website database (i.e., “alucinant”) and not in the dictation experiment.
Fig 1ECVT test for the production frequency of correct and misspelled strings under word dictation.
Correlations coefficients between the six regressors for words and for misspellings.
| Word frequency | Orthogr. neighbors | Phonol. neighbors | Min bigram frequency | Min big. frq. increase | String length | |
|---|---|---|---|---|---|---|
| Words (N = 100) | ||||||
| Word frequency | - | 0.1259 | 0.0690 | 0.1427 | - | -0.0860 |
| Ortho. N | 0.1259 | - | 0.6160 | 0.2953 | - | -0.4334 |
| Phono. N | 0.0690 | 0.6160 | - | 0.2493 | - | -0.6463 |
| Min bigram freq. | 0.1427 | 0.2953 | 0.2493 | - | - | -0.0507 |
| String length | -0.0860 | -0.4334 | -0.6463 | -0.0507 | - | - |
| Errors (N = 653) | ||||||
| Word frequency | - | 0.1547 | 0.1318 | 0.0519 | -0.0550 | -0.1253 |
| Ortho. N | 0.1547 | - | 0.5536 | 0.1333 | -0.1014 | -0.3765 |
| Phono. N | 0.1318 | 0.5536 | - | 0.1161 | -0.0969 | -0.6321 |
| Min bigram freq. | 0.0519 | 0.1333 | 0.1161 | - | 0.5680 | -0.0619 |
| Min big. frq. inc. | -0.0550 | -0.1014 | -0.0969 | 0.5680 | - | -0.0986 |
| String length | -0.1253 | -0.3765 | -0.6321 | -0.0619 | -0.0986 | - |
Correlation coefficients of the six regressors with the frequency of observed strings in the website database and in the spelling to dictation data.
| Word frequency | Orthogr. neighbors | Phonol. neighbors | Min bigram frequency | Min big. frq. increase | String length | |
|---|---|---|---|---|---|---|
| Correct (N = 100) | ||||||
| Web database | 0.2350 | 0.0465 | -0.0597 | 0.1341 | - | 0.0710 |
| Dictation | 0.4767 | 0.0972 | 0.0359 | 0.0273 | - | -0.1039 |
| Errors (N = 653) | ||||||
| Web database | 0.0047 | 0.0475 | 0.0555 | 0.0362 | 0.0938 | -0.0951 |
| Dictation | -0.0727 | 0.0338 | 0.0273 | 0.0745 | 0.1029 | -0.0572 |
*: p < .05
**: p < .01
***: p < .001
Statistics of the sample of 6567 words from the website database.
| min | max | mean | sd | |
|---|---|---|---|---|
| Num. of occurrences in the website database | 200 | 290554 | 9326 | 15576 |
| Spelling accuracy | 0.1176 | 0.9988 | 0.7835 | 0.1619 |
| Frequency per million in books (from Lexique 3) | 0 | 5186.8 | 17.6862 | 98.7846 |
| Number of letters | 3 | 18 | 8.2164 | 2.1936 |
| From MANULEX: | ||||
| Grapheme to phoneme minimum consistency | 0 | 100 | 41.6152 | 24.7274 |
| Phoneme to grapheme minimum consistency | 0 | 100 | 23.1678 | 19.3699 |
| Num. of heterographic homophones | 0 | 13 | 1.3224 | 1.8886 |
| Freq. of heterographic homophones | 0 | 1053 | 3.6534 | 21.6155 |
| Num. of phonographic neighbours | 0 | 9 | 0.4570 | 0.9104 |
| Freq. of phonographic neighbourhood | 0 | 3347.5 | 4.3722 | 52.4606 |
Correlation matrix of the spelling accuracy in the website database and the 8 tested regressors.
| Accu | CGP | CPG | WF | Len | HPn | HPf | PGNn | PGNf | |
|---|---|---|---|---|---|---|---|---|---|
| Accu | - | -0.012 | 0.135 | 0.242 | 0.120 | 0.063 | 0.026 | 0.011 | 0.016 |
| CGP | -0.012 | - | 0.236 | 0.031 | -0.298 | 0.096 | 0.001 | 0.140 | 0.112 |
| CPG | 0.135 | 0.236 | - | 0.028 | -0.174 | 0.065 | -0.004 | 0.098 | 0.079 |
| WF | 0.242 | 0.031 | 0.028 | - | -0.204 | 0.215 | 0.429 | 0.204 | 0.217 |
| Len | 0.120 | -0.298 | -0.174 | -0.204 | - | -0.231 | -0.232 | -0.355 | -0.300 |
| HPn | 0.063 | 0.096 | 0.065 | 0.215 | -0.231 | - | 0.396 | 0.332 | 0.180 |
| HPf | 0.026 | 0.001 | -0.004 | 0.429 | -0.232 | 0.396 | - | 0.207 | 0.223 |
| PGNn | 0.011 | 0.140 | 0.098 | 0.204 | -0.355 | 0.332 | 0.207 | - | 0.611 |
| PGNf | 0.016 | 0.112 | 0.079 | 0.217 | -0.300 | 0.180 | 0.223 | 0.611 | - |
Significance: p < .05: |r|>0.024; p < .01: |r|>0.032; p < .001: |r|>0.041; N = 6567.
Accu: spelling accuracy; CGP: grapheme to phoneme min consistency; CPG: phoneme to grapheme min consistency; WF: word log-frequency; Len: number of letters; HPn: number of heterographic homophones; HPf: log-frequency of heterographic homophones; PGNn: number of phonographic neighbours; PGNf: log-frequency of phonographic neighbours.
Stepwise hierarchical multiple regression analysis of the spelling accuracy in the website database (6567 items).
| Regressor | R2 | ΔR2 | Significance | β |
|---|---|---|---|---|
| WF | 0.0586 | 0.0586 | F(1,6565) = 408.7, p < .0001 | 0.2996 |
| CPG | 0.0749 | 0.0163 | F(1,6564) = 115.8, p < .0001 | 0.1570 |
| Len | 0.1141 | 0.0392 | F(1,6563) = 290.5, p < .0001 | 0.2055 |
| HPn | 0.1159 | 0.0018 | F(1,6562) = 13.16, p < .0003 | 0.0684 |
| HPf | 0.1205 | 0.0046 | F(1,6561) = 34.63, p < .0001 | -0.0809 |
β significance
*: p < 10−7.
WF: word log-frequency; CPG: phoneme to grapheme min consistency; Len: number of letters; HPn: number of heterographic homophones; HPf: log-frequency of heterographic homophones.