| Literature DB >> 28759628 |
Julia M Rohrer1,2,3, Martin Brümmer4, Stefan C Schmukle1, Jan Goebel2, Gert G Wagner2,3,5.
Abstract
Open-ended questions have routinely been included in large-scale survey and panel studies, yet there is some perplexity about how to actually incorporate the answers to such questions into quantitative social science research. Tools developed recently in the domain of natural language processing offer a wide range of options for the automated analysis of such textual data, but their implementation has lagged behind. In this study, we demonstrate straightforward procedures that can be applied to process and analyze textual data for the purposes of quantitative social science research. Using more than 35,000 textual answers to the question "What else are you worried about?" from participants of the German Socio-economic Panel Study (SOEP), we (1) analyzed characteristics of respondents that determined whether they answered the open-ended question, (2) used the textual data to detect relevant topics that were reported by the respondents, and (3) linked the features of the respondents to the worries they reported in their textual data. The potential uses as well as the limitations of the automated analysis of textual data are discussed.Entities:
Mesh:
Year: 2017 PMID: 28759628 PMCID: PMC5536367 DOI: 10.1371/journal.pone.0182156
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of the steps of automated text analysis.
Fig 2Word cloud of “raw” texts, tokenized but not otherwise processed.
Fig 3Word cloud of free texts after data pre-processing.
Results of binary logistic multilevel regressions predicting responses to the open-ended question, including 222,165 records from 25,978 individuals.
| Model 1 | Model 2 | Model 3 | |||||
|---|---|---|---|---|---|---|---|
| Trait (unit of change) | Odds Ratio | Odds Ratio | Odds Ratio | ||||
| Male | 0.87 | < .001 | 0.87 | < .001 | 0.97 | .420 | |
| Between | 1.01 | < .001 | 1.01 | < .001 | 1.02 | < .001 | |
| Within | 0.98 | < .001 | 0.98 | < .001 | 0.98 | < .001 | |
| East | 1.36 | < .001 | 1.23 | < .001 | 1.24 | < .001 | |
| Middle secondary | 1.43 | < .001 | 1.50 | < .001 | 1.39 | < .001 | |
| Intermediate higher secondary | 1.99 | < .001 | 2.15 | < .001 | 1.89 | < .001 | |
| Higher secondary | 2.24 | < .001 | 2.52 | < .001 | 2.20 | < .001 | |
| Other | 1.32 | .001 | 1.35 | < .001 | 1.35 | .001 | |
| None | 0.96 | .760 | 0.90 | .430 | 1.01 | .959 | |
| Not yet | 1.71 | < .001 | 1.89 | < .001 | 1.77 | < .001 | |
| Direct | 0.79 | .001 | 0.77 | < .001 | 0.78 | < .001 | |
| Indirect | 1.13 | .060 | 1.12 | .070 | 1.11 | .090 | |
| Written | 0.90 | < .001 | 0.86 | < .001 | 0.86 | < .001 | |
| Mixed | 0.86 | .001 | 0.83 | < .001 | 0.82 | < .001 | |
| Between | 0.85 | < .001 | 0.84 | < .001 | |||
| Within | 0.91 | < .001 | 0.91 | < .001 | |||
| Emotional Stability | 0.88 | < .001 | |||||
| Extraversion | 1.07 | < .001 | |||||
| Agreeableness | 0.98 | .286 | |||||
| Conscientiousness | 1.01 | .585 | |||||
| Openness | 1.33 | < .001 | |||||
| Fit (against Null model) | 670.82 | < .001 | 1083.72 | < .001 | 1613.47 | < .001 | |
| Comparison with Preceding Model | 412.90 | < .001 | 529.75 | < .001 | |||
Fig 4Log likelihood of LDA models, depending on the number of topics chosen.
Fig 5Four of the 15 topics derived through LDA topic modeling.
Labels of the topics derived via LDA topic modeling, proportion of texts in which the topic occurred, and the most relevant terms.
| Topic | Topic name | Occurrence | Most relevant terms |
|---|---|---|---|
| 1 | Future of children | 19.8% | children, future, occupational, youth, development, worries, grandchild, apprenticeship, son, family |
| 2 | Children, youth, school | 6.0% | children, youth, school, education, apprenticeship, violence, bad, upbringing, Germany, educational policy |
| 3 | Health of family | 12.4% | Health, family, man/husband, worries, children, woman/wife, son, parents, mother, daughter |
| 4 | Rising prices | 6.1% | rising, high, pension, Euro, prices, expensive, costs, unemployment, health reform, taxes |
| 5 | Rich and poor | 4.0% | social, rich, poor, unjust, Germany, divide, reduction, justice, peace, gap |
| 6 | Foreigners in Germany | 4.0% | foreigners, Germany, German, politics, land, criminality, law, immigration, judiciary, Hartz |
| 7 | Unemployment | 6.8% | Unemployment, high, youth, young, people, Germany, east, people, jobs, emigration |
| 8 | Finding employment | 6.8% | work, find, receive, job, children, apprenticeship position, apprenticeship place, son, apprenticeship, studies |
| 9 | Pension and financial security | 6.5% | pension, money, live, old, work, state, receive, secure, expensive, foreigners |
| 10 | Development of Germany | 11.6% | Germany, development, politics, general, unemployment, economy, situation, state, social, educational policy |
| 11 | Interpersonal dealings | 7.8% | people, egoism, society, increasing, indifference, contact, general, social, decline in values, together |
| 12 | Moral decay | 4.5% | society, moral, values, decay, politics, loss, media, decline in values, people, general |
| 13 | Politics: corruption and inability | 9.6% | politics, corruption, inability, economy, political party, factitiousness, state, manager, Germany, government |
| 14 | Politics | 3.9% | Politics, government, political party, inability, SPD, problem, Germany, CDU, current, election |
| 15 | War and terrorism | 6.8% | war, USA, world, Iraq, terrorism, Bush, BSE, Islam, America, bird flu |
Fig 6Variabilities (Coffey-Feingold-Bromberg measure) of the occurrence of topics across the years.
Fig 7Time course of the two topics with the highest variabilities across survey years, Topic 15 (War and terrorism) and Topic 4 (Rising prices).
Fig 8Relationships between reports of worries in closed-ended questions regarding various subjects (Panels A to I) and topic occurrence in free texts (Topics 1 to 15).
Topics with the highest and lowest relative risk are labeled for each item.
Fig 9Results of word-level correlational analyses linking the use of single words with features of the respondents within the subsample who answered the open-ended question.
Size reflects the frequency of the word across all answers; horizontal position and color reflect both the strength and direction of the associations; all displayed words are significant at p < .05 (Bonferroni-corrected). (A) Results of linear regressions predicting age. (B) Results of binary-logistic regressions predicting gender. (C) Results of binary-logistic regressions predicting sample region. (D) Results of ordered logistic regressions predicting education.
Fig 10Results of word-level correlational analyses linking the use of single words with life satisfaction (A) in the subsample that answered the open-ended question and (B) relative to the full sample. Size reflects the frequency of the word across all answers; horizontal position and color reflect both the strength and direction of the associations; all displayed words are significant at p < .05 (Bonferroni-corrected).
Fig 11Results of correlational analyses linking features of the respondent and topic occurrence in the textual answer within the subsample that answered the open-ended question.
Topics with the highest and lowest coefficients are labeled for each feature; red bars indicate significant results at p < .05 (Bonferroni-corrected). (A) Results of linear regressions predicting age. (B) Results of binary-logistic regressions predicting gender. (C) Results of binary-logistic regressions predicting sample region. (D) Results of ordered logistic regressions predicting education.
Fig 12Results of correlational analyses linking the personality of the respondent and topic occurrence in the textual answer.
Topics with the highest and lowest coefficients are labeled. Red bars indicate significant results at p < .05 (Bonferroni-corrected). (A) Results of the linear regressions predicting life satisfaction in the subsample that answered the open-ended question. (B) Results of the linear regressions predicting life satisfaction in the full sample, including respondents who did not answer the open-ended question. (C)-(G) Results of the linear regressions predicting the Big Five personality traits in the subsample that answered the open-ended question.