| Literature DB >> 35273198 |
Oscar N E Kjell1,2, Sverker Sikström3, Katarina Kjell3, H Andrew Schwartz4.
Abstract
We show that using a recent break-through in artificial intelligence -transformers-, psychological assessments from text-responses can approach theoretical upper limits in accuracy, converging with standard psychological rating scales. Text-responses use people's primary form of communication -natural language- and have been suggested as a more ecologically-valid response format than closed-ended rating scales that dominate social science. However, previous language analysis techniques left a gap between how accurately they converged with standard rating scales and how well ratings scales converge with themselves - a theoretical upper-limit in accuracy. Most recently, AI-based language analysis has gone through a transformation as nearly all of its applications, from Web search to personalized assistants (e.g., Alexa and Siri), have shown unprecedented improvement by using transformers. We evaluate transformers for estimating psychological well-being from questionnaire text- and descriptive word-responses, and find accuracies converging with rating scales that approach the theoretical upper limits (Pearson r = 0.85, p < 0.001, N = 608; in line with most metrics of rating scale reliability). These findings suggest an avenue for modernizing the ubiquitous questionnaire and ultimately opening doors to a greater understanding of the human condition.Entities:
Mesh:
Year: 2022 PMID: 35273198 PMCID: PMC8913644 DOI: 10.1038/s41598-022-07520-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparing Pearson Correlations based on All Responses Combined and Analyzed with Contextualized Word Embeddings to the Reliability of the Rating Scales.
| Model | HILS | SWLS |
|---|---|---|
| BERT contextualized word embeddings from word- and text-responses of HIL and SWL | 0.85↑*** | 0.80↑** |
| Inter-item Pearson correlation average | 0.76 | 0.73 |
| Corrected item-total Pearson correlation average | 0.84 | 0.82 |
Italic values indiactes results from other articles/datasets.
All correlations were significant at p < 0.001. N = 608.
HIL = Harmony in life; SWL = Satisfaction with life; S = Scale.
= significantly higher than Inter-item correlation average; *** = p < 0.001, ** = p < 0.001.
Comparison of Pearson Correlations Using Contextualized versus Decontextualized Word Embeddings for Individual Word- and Text-Responses.
| Word Embeddings | Text-response | Word-response | |||
|---|---|---|---|---|---|
| HIL | SWL | HIL | SWL | ||
| HILS | SWLS | HILS | SWLS | ||
| Context | BERT | 0.74 | 0.74 | 0.79 | 0.75 |
| No context | BERT 1 word/docs | 0.54↓ | 0.59↓ | 0.78 | 0.75 |
| Latent Semantic Analysis | 0.47↓ | 0.46↓ | 0.75↓ | 0.72 | |
All correlations were significant at p < .001. N = 608.
HIL = Harmony in life; SWL = Satisfaction with life; S = Scale.
Latent Semantic Analysis based on Google 5-gram, 512 dimensions; number of dimensions were optimized as described in[10] (i.e., based on previous state-of-the-art).
↓ = significantly smaller than BERT. See Table SM3 for more comparisons.
The Construct Specific Validity of Language Models Using Individual Word- and Text-Responses.
| Language Model | Predict | Pearson r | |||
|---|---|---|---|---|---|
| Harmony in life responses | Satisfaction with life responses | ||||
| Words | Text | Words | Text | ||
| BERT | HILS | 0.79↑ | 0.74↑ | 0.75 | 0.71 |
| SWLS | 0.66 | 0.61 | 0.75 | 0.74 | |
N = 608. HIL = Harmony in life; SWL = Satisfaction with life; S = Scale.
↓ = significantly higher than SWLS prediction.
The Discriminant Validity of Language Models: Significantly Predicting the Harmony in life scale Minus the Satisfaction with life scale.
| Language Model | Responses | Predicted HILS correlated with Predicted SWLS | Accuracy ( |
|---|---|---|---|
| BERT | All | 0.96 | 0.34 |
| Words | 0.97 | 0.25 | |
| Text | 0.96 | 0.27 |
N = 608. All correlations (Pearson r) were significant at p < .001.
Accuracy (r) of Predicted HILS minus SWLS = predicting the difference score of the normalized HILS minus the normalized SWLS, where normalization was achieved by respectively subtracting the column mean from each score.
Word- versus Text-Responses: Accuracy (r), Diversity Index, and Mean (SD) number of words.
| Language Response | r, HILS prediction | r, SWLS prediction | Diversity Index of Words Input | Mean (SD) of N words |
|---|---|---|---|---|
| HIL + SWL Words | 0.83 | 0.77 | 874.5 | 19.71 (1.42) |
| HIL + SWL Text | 0.79 | 0.75 | 409.4 | 145.0 (74.8) |
| HIL words + Text | 0.82 | NA | 543.7 | 79.2 (38.4) |
| SWL words + Text | NA | 0.80 | 518.4 | 85.5 (46.0) |
| HIL Words | .79 | 0.66 | 807.0 | 9.8 (1.0) |
| HIL Text | .74 | 0.61 | 380.1 | 69.4 (38.4) |
| SWL Words | .75 | 0.75 | 653.0 | 9.9 (0.73) |
| SWL Text | .71 | 0.74 | 379.4 | 75.6 (45.9) |
All correlations were significant at p < 0.001. N = 608. BERT large using the second last layer (L23). HIL = Harmony in life; SWL = Satisfaction with life; S = Scale.
Diversity index of Words Input is 2entropy, which indicates how many different “types” (i.e. distinct categories) that could theoretically be accounted for by the data.