| Literature DB >> 30576343 |
Nicolas Pröllochs1, Stefan Feuerriegel2, Dirk Neumann3.
Abstract
Information forms the basis for all human behavior, including the ubiquitous decision-making that people constantly perform in their every day lives. It is thus the mission of researchers to understand how humans process information to reach decisions. In order to facilitate this task, this work proposes LASSO regularization as a statistical tool to extract decisive words from textual content in order to study the reception of granular expressions in natural language. This differs from the usual use of the LASSO as a predictive model and, instead, yields highly interpretable statistical inferences between the occurrences of words and an outcome variable. Accordingly, the method suggests direct implications for the social sciences: it serves as a statistical procedure for generating domain-specific dictionaries as opposed to frequently employed heuristics. In addition, researchers can now identify text segments and word choices that are statistically decisive to authors or readers and, based on this knowledge, test hypotheses from behavioral research.Entities:
Mesh:
Year: 2018 PMID: 30576343 PMCID: PMC6303018 DOI: 10.1371/journal.pone.0209323
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Common dictionaries in behavioral research.
| Dictionary | Size | Categories | Domain | Selection Process | Polarity Levels | Notes |
|---|---|---|---|---|---|---|
| Diction | 10,000 | 35 linguistic categories (e. g. optimism, satisfaction, praise, blame, denial) | Politics | Expert judgment | Binary | Accessible for purchase via the Diction software for text analysis |
| Harvard IV | 4206 | 15 linguistic categories (e. g. polarity, motivation, pleasure, pain, cognitive orientation) | Psychology | Expert judgment | Binary | Shipped in General Inquirer |
| LIWC | 4500 | 64 linguistic dimensions (e. g. polarity, part-of-speech, cognitive and psychological words) | Psychology | Independent judges | Binary | Accessible for purchase from the LIWC text analysis software |
| Loughran-McDonald | 2709 | Polarity (positive, negative) | Finance | Manual selection procedure | Binary | Based on 2of12inf dictionary |
| QDAP | 6789 | Polarity (positive, negative) | General | Heuristic based on co-occurences to positive/negative seed words | Binary | Synset of WordNet |
| SentiStrength | 763 | Positivity, negativity | Social media | Human judgment | Continuous rating | Derived from LIWC |
| SentiWordNet 3.0 | 28,431 | Positivity, negativity, neutrality | General | Heuristic based on co-occurrences to positive/negative seed words | Continuous rating | Based on 86,994 terms from WordNet |
Empirical results of top 15 opinionated terms in movie reviews.
| Word Stem | Coef. | Stand. Error | Relative | Positive | Negative | Harvard |
|---|---|---|---|---|---|---|
| P | ||||||
| great | 0.0709 | 0.0098 | 31.58 | 66.29 | 33.71 | ⊕ |
| perfect | 0.0707 | 0.0096 | 19.18 | 74.17 | 25.83 | ⊕ |
| excel | 0.0572 | 0.0156 | 19.68 | 64.57 | 35.43 | ⊕ |
| best | 0.0571 | 0.0098 | 47.16 | 64.59 | 35.41 | ⊕ |
| life | 0.0551 | 0.0102 | 49.82 | 63.79 | 36.21 | |
| delight | 0.0515 | 0.0098 | 10.69 | 76.82 | 23.18 | ⊕ |
| brilliant | 0.0480 | 0.0095 | 7.19 | 73.06 | 26.94 | ⊕ |
| intens | 0.0469 | 0.0097 | 9.27 | 74.35 | 25.65 | |
| uniqu | 0.0416 | 0.0098 | 8.39 | 73.81 | 26.19 | ⊕ |
| recommend | 0.0393 | 0.0138 | 18.57 | 59.78 | 40.22 | |
| marvel | 0.0390 | 0.0096 | 5.19 | 79.62 | 20.38 | ⊕ |
| hilari | 0.0373 | 0.0096 | 6.97 | 75.64 | 24.36 | ⊕ |
| easi | 0.0353 | 0.0095 | 15.46 | 70.67 | 29.33 | ⊕ |
| matur | 0.0347 | 0.0102 | 10.49 | 74.67 | 25.33 | ⊕ |
| fascin | 0.0346 | 0.0099 | 10.59 | 77.74 | 22.26 | ⊕ |
| N | ||||||
| bad | -0.1124 | 0.0103 | 34.50 | 47.60 | 52.40 | ⊝ |
| worst | -0.1011 | 0.0132 | 16.26 | 52.21 | 47.79 | ⊝ |
| wast | -0.0762 | 0.0144 | 19.28 | 52.54 | 47.46 | ⊝ |
| review | -0.0741 | 0.0169 | 53.28 | 51.48 | 48.52 | |
| suppos | -0.0699 | 0.0097 | 15.66 | 41.07 | 58.93 | |
| least | -0.0672 | 0.0097 | 22.67 | 47.58 | 52.42 | |
| movi | -0.0671 | 0.0130 | 84.66 | 56.06 | 43.94 | |
| cinematograph | -0.0538 | 0.0151 | 21.17 | 44.81 | 55.19 | |
| flat | -0.0526 | 0.0096 | 6.31 | 35.44 | 64.56 | |
| unfortun | -0.0512 | 0.0102 | 14.12 | 43.42 | 56.58 | ⊝ |
| dull | -0.0483 | 0.0096 | 5.43 | 32.72 | 67.28 | ⊝ |
| bore | -0.0483 | 0.0097 | 8.21 | 37.23 | 62.77 | ⊝ |
| denni | -0.0468 | 0.0236 | 23.33 | 42.21 | 57.79 | |
| lack | -0.0450 | 0.0097 | 16.48 | 48.61 | 51.39 | ⊝ |
| wors | -0.0442 | 0.0097 | 7.11 | 38.48 | 61.52 | ⊝ |
Notes: This table reports the extracted terms that convey a particularly positive or negative sentiment in movie reviews. Top: the 15 most positive word stems, together with their estimated coefficient. Standard errors are calculated via the Post-LASSO [39]. Bottom: the 15 most negative word stems. In addition, we provide the relative frequency within the corpus, as well as the ratio of positive and negative documents that contain each word. The last column show the overlap with the Harvard IV psychological dictionary. The symbol “⊕” indicates terms that appear in the positive word list and “⊝” in the negative word list of this dictionary. The complete list with all 549 stems is given in the supplementary materials.
Empirical results of top 15 polarity expressions in financial filings.
| Word Stem | Coef. | Stand. Error | Relative | Positive | Negative | Harvard | Loughran- |
|---|---|---|---|---|---|---|---|
| P | |||||||
| improv | 0.0325 | 0.0045 | 37.27 | 49.70 | 50.30 | ⊕ | ⊕ |
| rais | 0.0160 | 0.0038 | 11.06 | 51.17 | 48.83 | ⊝ | |
| strong | 0.0144 | 0.0045 | 28.41 | 50.32 | 49.68 | ⊕ | |
| increas | 0.0113 | 0.0051 | 60.51 | 49.16 | 50.84 | ||
| facil | 0.0106 | 0.0039 | 35.28 | 49.29 | 50.71 | ||
| waiver | 0.0095 | 0.0039 | 15.33 | 48.33 | 51.67 | ||
| stronger | 0.0080 | 0.0039 | 5.44 | 50.58 | 49.42 | ⊕ | |
| vacat | 0.0076 | 0.0037 | 5.58 | 49.72 | 50.28 | ||
| repurchas | 0.0074 | 0.0039 | 22.97 | 50.03 | 49.97 | ||
| favor | 0.0073 | 0.0040 | 25.45 | 49.71 | 50.29 | ⊕ | ⊕ |
| consumm | 0.0067 | 0.0040 | 15.13 | 48.43 | 51.57 | ⊕ | |
| annum | 0.0056 | 0.0039 | 9.2707 | 48.00 | 52.00 | ||
| avoid | 0.0051 | 0.0037 | 11.55 | 48.82 | 51.18 | ⊝ | |
| payrol | 0.0049 | 0.0037 | 6.69 | 49.32 | 50.68 | ||
| middl | 0.0046 | 0.0037 | 5.16 | 49.15 | 50.85 | ||
| N | |||||||
| declin | -0.0204 | 0.0045 | 23.59 | 48.65 | 51.35 | ⊝ | ⊝ |
| negat | -0.0162 | 0.0040 | 20.03 | 47.85 | 52.15 | ⊝ | ⊝ |
| lower | -0.0138 | 0.0047 | 27.27 | 48.72 | 51.28 | ⊝ | |
| experienc | -0.0117 | 0.0038 | 12.17 | 47.93 | 52.07 | ||
| delay | -0.0091 | 0.0038 | 18.72 | 47.65 | 52.35 | ⊝ | ⊝ |
| broad | -0.0063 | 0.0038 | 11.46 | 48.22 | 51.78 | ||
| advertis | -0.0056 | 0.0042 | 8.86 | 48.35 | 51.65 | ||
| project | -0.0055 | 0.0041 | 36.88 | 48.84 | 51.16 | ||
| pressur | -0.0055 | 0.0038 | 9.42 | 48.96 | 51.04 | ||
| now | -0.0054 | 0.0040 | 27.14 | 48.80 | 51.20 | ||
| challeng | -0.0054 | 0.0039 | 15.58 | 48.41 | 51.59 | ⊝ | ⊝ |
| offer | -0.0052 | 0.0045 | 40.33 | 48.76 | 51.24 | ⊕ | |
| depreci | -0.0051 | 0.0052 | 23.11 | 48.59 | 51.41 | ⊝ | |
| impact | -0.0041 | 0.0046 | 39.79 | 48.62 | 51.38 | ||
| weak | -0.0039 | 0.0038 | 8.5841 | 48.35 | 51.65 | ⊝ | ⊝ |
Notes: This table reports verbal expressions that convey positive and negative information in financial disclosures (Form 8-K filings). Top: 15 most positive word stems, together with their estimated coefficient. Standard errors are calculated via the Post-LASSO [39]. Bottom: the 15 most negative word stems. In addition, we provide the relative frequency in financial filings, as well as the ratio of documents with a positive or negative market response. The last columns shows the agreement between our statistical inferences and two common dictionaries based on human annotations, namely, the Harvard IV psychological and Loughran-McDonald finance-specific dictionary. The symbol “⊕” indicates terms that appear in the respective positive word list, “⊝” in the negative one. The complete table with all 172 entries is given in the supplements.
Comparison of human classifications to statistical inferences.
| Size | Overlapping Terms | Consensus Classification | Correlation | Reliability | |||
|---|---|---|---|---|---|---|---|
| Count | Share | Count | Share | ||||
| S | |||||||
| Harvard IV | 4206 | 222 | 0.4044 | 138 | 0.6216 | 0.3236 *** | 0.2246 |
| Henry | 190 | 26 | 0.0474 | 20 | 0.7692 | 0.5593 ** | 0.5446 |
| Loughran-McDonald | 2709 | 73 | 0.1330 | 45 | 0.6164 | 0.4303 *** | 0.2311 |
| SentiWordNet | 28431 | 440 | 0.8015 | 246 | 0.5591 | 0.2649 *** | 0.1001 |
| QDAP | 6789 | 176 | 0.3206 | 114 | 0.6477 | 0.3638 *** | 0.2863 |
| S | |||||||
| Harvard IV | 4206 | 55 | 0.3198 | 34 | 0.6182 | 0.2742 * | 0.2270 |
| Henry | 190 | 21 | 0.1221 | 19 | 0.9048 | 0.6333 ** | 0.8102 |
| Loughran-McDonald | 2709 | 20 | 0.1163 | 18 | 0.9000 | 0.6433 ** | 0.8030 |
| SentiWordNet | 28431 | 118 | 0.6860 | 69 | 0.5847 | 0.2089 * | 0.1715 |
| QDAP | 6789 | 40 | 0.2326 | 28 | 0.7000 | 0.4524 ** | 0.3939 |
Notes: This table compares common, human-generated word lists to extracted terms based on our statistical inferences. We omitted LIWC and Diction, since these are commercial products with proprietary dictionaries. When computing correlation coefficients and reliability scores, we exclude non-overlapping terms and count binary dictionary entries with a negative label as -1 and positive ones as 1. Statistical significance levels are ***0.001, **0.01, *0.05. Reliability (i. e. the concordance with our statistical inferences) is measured in terms of Krippendorff’s alpha coefficient [43].
Summary statistics of statistical inferences with word tuples.
| S | S | |
|---|---|---|
| Regressors before regularization | 1059 | 2971 |
| Extracted terms | 442.0000 | 47.0000 |
| Ratio of extracted terms | 41.7400% | 1.5800% |
| Positive terms | 234 | 19 |
| Negative terms | 208 | 28 |
| Ratio positive terms | 52.9400% | 40.4300% |
| Ratio negative terms | 47.0600% | 59.5800% |
| Adjusted | 0.3184 | 0.0036 |
| Correlation between model estimate and gold standard | 0.6300 | 0.0800 |
| Regressors before regularization | 2254 | 4695 |
| Extracted terms | 798 | 132 |
| Ratio of extracted terms | 35.4000% | 2.8110% |
| Positive terms | 394 | 62 |
| Negative terms | 404 | 70 |
| Ratio positive terms | 49.3700% | 46.9700% |
| Ratio negative terms | 50.6300% | 53.0300% |
| Adjusted | 0.6126 | 0.0072 |
| Correlation between model estimate and gold standard | 0.8300 | 0.1100 |
Notes: The table compares our statistical inferences for different inputs, consisting of bigrams and the combination of unigrams and bigrams. These are evaluated in terms of goodness-of-fit and by comparing the number of selected entries. The complete lists of extracted variables and their coefficients are given in the supplements.
Summary statistics for hypothesis testing with movie reviews.
| P | P | P | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | 0.1025 | 0.0578 | 0.1604 | 0.1626 | 0.1402 | 0.3027 | -0.0098 | -0.0930 | -0.1028 |
| Min. | -0.4159 | -0.5922 | -0.8903 | -0.2020 | -0.3977 | -0.2947 | -0.4159 | -0.5922 | -0.8903 |
| 25% Quantile | 0.0159 | -0.0517 | -0.0173 | 0.0825 | 0.0507 | 0.1539 | -0.0729 | -0.1639 | -0.2028 |
| Median | 0.0996 | 0.0528 | 0.1476 | 0.1557 | 0.1342 | 0.2896 | -0.0021 | -0.0849 | -0.0947 |
| 75% Quantile | 0.1853 | 0.1646 | 0.3331 | 0.2325 | 0.2251 | 0.4302 | 0.0587 | -0.0178 | -0.0014 |
| Max. | 0.7336 | 0.7655 | 1.3848 | 0.7336 | 0.7655 | 1.3848 | 0.3624 | 0.3322 | 0.3687 |
| Std. Dev. | 0.1340 | 0.1615 | 0.2569 | 0.1175 | 0.1357 | 0.2066 | 0.1060 | 0.1123 | 0.1604 |
| Skewness | 0.1717 | 0.1535 | 0.2202 | 0.4439 | 0.3028 | 0.5058 | -0.2684 | -0.3898 | -0.3062 |
| Kurtosis | 0.6052 | 0.2682 | 0.2204 | 0.6407 | 0.5611 | 0.6167 | 0.5321 | 0.7152 | 0.6622 |
Notes: Panel I compares the sentiment of the first (μ1) and second half (μ2) of movie reviews, as well as the overall sentiment μ. The additional panels present the same statistics for reviews with positive (Panel II) or negative (Panel III) gold standard only.