| Literature DB >> 27477913 |
Amaç Herdağdelen1, Marco Marelli2.
Abstract
Corpus-based word frequencies are one of the most important predictors in language processing tasks. Frequencies based on conversational corpora (such as movie subtitles) are shown to better capture the variance in lexical decision tasks compared to traditional corpora. In this study, we show that frequencies computed from social media are currently the best frequency-based estimators of lexical decision reaction times (up to 3.6% increase in explained variance). The results are robust (observed for Twitter- and Facebook-based frequencies on American English and British English datasets) and are still substantial when we control for corpus size.Entities:
Keywords: Frequency effects; Lexical decision; Social media; Text corpora
Mesh:
Year: 2016 PMID: 27477913 PMCID: PMC5484375 DOI: 10.1111/cogs.12392
Source DB: PubMed Journal: Cogn Sci ISSN: 0364-0213
Spearman correlations between frequency predictors and reaction times for American English
| RT | RTC | FB‐US | HAL | SUBTLEX‐US | |
|---|---|---|---|---|---|
| RTC | −0.694 | ||||
| FB‐US | −0.693 | 0.950 | |||
| HAL | −0.663 | 0.861 | 0.851 | ||
| SUBTLEX‐US | −0.672 | 0.891 | 0.891 | 0.851 | |
| SUBTLEX‐US (CD) | −0.679 | 0.900 | 0.903 | 0.853 | 0.991 |
Spearman correlations between frequency predictors and reaction times for British English
| RT | RTC | FB‐UK | BNC | CELEX | SUBTLEX‐UK | |
|---|---|---|---|---|---|---|
| RTC | −0.714 | |||||
| FB‐UK | −0.709 | 0.922 | ||||
| BNC | −0.664 | 0.771 | 0.793 | |||
| CELEX | −0.650 | 0.747 | 0.763 | 0.936 | ||
| SUBTLEX‐UK | −0.694 | 0.858 | 0.887 | 0.888 | 0.845 | |
| SUBTLEX‐UK (CD) | −0.701 | 0.865 | 0.892 | 0.901 | 0.863 | 0.992 |
Absolute correlations and explained variance of various measures with respect to the response latencies from the British Lexicon Project
| Corpus | Pearson Correlation |
|
|---|---|---|
| CELEX | 0.627 | 0.409 |
| BNC | 0.638 | 0.422 |
| SUBTLEX‐UK | 0.666 | 0.459 |
| SUBTLEX‐UK CD | 0.675 | 0.471 |
| FB‐UK | 0.684 | 0.483 |
| FB‐UK UC | 0.686 | 0.486 |
| RTC | 0.686 | 0.487 |
| RTC UC | 0.686 | 0.487 |
| Baseline | 0.489 | |
| Baseline + FB‐UK | 0.515 | |
| Baseline + RTC | 0.522 | |
| Baseline + FB‐UK + RTC | 0.525 |
CD, contextual diversity; UC, user count. Frequency values and response latencies are log‐transformed.
Absolute correlations and explained variance of various measures with respect to the response latencies from the English Lexicon Project
| Corpus | Pearson Correlation |
|
|---|---|---|
| HAL | 0.646 | 0.429 |
| SUBTLEX‐US | 0.642 | 0.430 |
| SUBTLEX‐US CD | 0.656 | 0.448 |
| RTC | 0.673 | 0.467 |
| RTC UC | 0.674 | 0.467 |
| FB‐US | 0.674 | 0.468 |
| FB‐US UC | 0.675 | 0.469 |
| Baseline | 0.495 | |
| Baseline + RTC | 0.506 | |
| Baseline + FB‐US | 0.507 | |
| Baseline + FB‐US + RTC | 0.509 |
CD, contextual diversity; UC, user count. Frequency values and response latencies are log‐transformed.
Figure 1Variance explained by social media frequency norms in the BLP and ELP item sets for American and British English.
Figure 2Effect of corpus size on explained variance in reaction times.
Figure 3Mean absolute residual differences between RTC and SUBTLEX‐US in reaction time modeling analysis, conditioned by reaction time deciles. Deciles for which a paired t test is significant (p < .001, using a Bonferroni correction) are marked by an asterisk.
Over‐represented words in each corpora, according to log‐likelihood score (G 2), for different reaction time quintiles
| Quintile | FB‐US (w.r.t. SUBTLEX‐US) | RTC (w.r.t. SUBTLEX‐US) | SUBTLEX‐US (w.r.t. RTC) |
|---|---|---|---|
| 1 (fastest 20%) | my, day, year, for, today, love, friends, its, so, being | my, its, love, follow, today, day, new, though, for, so | you, sir, here, right, it, know, he, okay, that, do |
| 2 | and, real, but, wit, part, awesome, miss, ass, work, tonight | real, wit, at, but, awesome, ass, watching, miss, and, weekend | mean, won, were, got, father, minute, yeah, doing, honey, come |
| 3 | gonna, birthday, gotta, thankful, bout, prayers, status, season, evening, wish | gonna, via, gotta, followers, bout, wish, birthday, posted, season, boo | him, little, three, supposed, ought, hundred, yours, mister, colonel, enough |
| 4 | am, haha, mommy, hubby, momma, disclose, contained, posting, ma, goodnight | haha, ma, wow, huh, awkward, rite, ad, goodnight, congrats, mum | something, ahead, pardon, discuss, sweetheart, downstairs, warrant, champagne, brilliant, assure |
| 5 (slowest 20%) | a, thru, their, ugh, yea, ah, awhile, tech, hating, auntie | a, twitter, yea, ah, ugh, thru, follower, subscribed, tech, awhile | haven, sergeant, ashore, scare, adjourned, missiles, vanquish, slipstream, potion, hostage |
Figure 4Log frequency ratio of SUBTLEX‐US and social media frequencies. The dark bars are log2(FB‐US/SUBTLEX‐US), and the light bars are log2(RTC/SUBTLEX‐US).
Standard error of the mean (SEM) for log ratio relative frequencies across words within a LIWC subcategory. SEMs are reported for both log2(FB‐US/SUBTLEX‐US) (SEM FB) and log2(RTC/SUBTLEX‐US) (SEM RTC)
| Category | Subcategory | Word Count |
|
|
|---|---|---|---|---|
| Social | Family | 52 | 0.266 | 0.193 |
| Social | Friend | 39 | 0.261 | 0.215 |
| Social | Humans | 44 | 0.259 | 0.222 |
| Affective | Posemo | 532 | 0.070 | 0.062 |
| Affective | Negemo | 738 | 0.050 | 0.050 |
| Affective | Anx | 135 | 0.128 | 0.123 |
| Affective | Anger | 254 | 0.082 | 0.084 |
| Affective | Sad | 168 | 0.096 | 0.108 |
| Biological | Body | 324 | 0.083 | 0.085 |
| Biological | Health | 263 | 0.095 | 0.094 |
| Biological | Sexual | 100 | 0.157 | 0.152 |
| Biological | Ingest | 201 | 0.094 | 0.084 |
| Relative | Relativ | 963 | 0.043 | 0.043 |
| Relative | Motion | 305 | 0.073 | 0.080 |
| Relative | Space | 372 | 0.069 | 0.066 |
| Relative | Time | 273 | 0.085 | 0.081 |
| Personal | Work | 385 | 0.084 | 0.083 |
| Personal | Achieve | 265 | 0.085 | 0.084 |
| Personal | Leisure | 354 | 0.083 | 0.081 |
| Personal | Home | 180 | 0.135 | 0.116 |
| Personal | Money | 228 | 0.099 | 0.094 |
| Personal | Relig | 141 | 0.177 | 0.148 |
| Personal | Death | 88 | 0.147 | 0.149 |
Absolute correlation and variance explained of various measures with respect to the lexical decision latencies from the entire English Lexicon Project
| Corpus | Pearson Correlation |
|
|---|---|---|
| HAL | 0.611 | 0.392 |
| SUBTLEX‐US | 0.648 | 0.447 |
| SUBTLEX‐US CD | 0.654 | 0.454 |
| RTC | 0.684 | 0.490 |
| FB‐US | 0.676 | 0.480 |
| Baseline | 0.623 | |
| Baseline + RTC | 0.632 | |
| Baseline + FB‐US | 0.632 | |
| Baseline + FB‐US + RTC | 0.633 |
Baseline refers to the model that includes HAL, SUBTLEX‐US, SUBTLEX‐US CD, number of syllables, and number of letters in the word. Frequency values and response latencies are log‐transformed.