| Literature DB >> 34172568 |
Olivier Toubia1, Jonah Berger2, Jehoshua Eliashberg2.
Abstract
Narratives, and other forms of discourse, are powerful vehicles for informing, entertaining, and making sense of the world. But while everyday language often describes discourse as moving quickly or slowly, covering a lot of ground, or going in circles, little work has actually quantified such movements or examined whether they are beneficial. To fill this gap, we use several state-of-the-art natural language-processing and machine-learning techniques to represent texts as sequences of points in a latent, high-dimensional semantic space. We construct a simple set of measures to quantify features of this semantic path, apply them to thousands of texts from a variety of domains (i.e., movies, TV shows, and academic papers), and examine whether and how they are linked to success (e.g., the number of citations a paper receives). Our results highlight some important cross-domain differences and provide a general framework that can be applied to study many types of discourse. The findings shed light on why things become popular and how natural language processing can provide insight into cultural success.Entities:
Keywords: cultural analytics; cultural success; discourse; natural language processing
Mesh:
Year: 2021 PMID: 34172568 PMCID: PMC8256009 DOI: 10.1073/pnas.2011695118
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Fig. 1.Stylized illustration of the measures. Note that higher speed means more distance was covered in the same number of periods. Higher volume means that more ground was covered in the same number of periods. Higher circuitousness means that a less direct route was taken between a set of points.
Link between semantic progression and success
| Movies | TV show episodes | Academic papers | |
| Average speed | 0.048* | 0.072* | −0.125* |
| Normalized volume | 0 | −0.082* | 0.095* |
| Circuitousness | 0 | 0.006 | 0.070* |
| Controls | |||
| Year fixed effects | Yes | Yes | |
| Genre fixed effects | Yes | Yes | |
| Movie duration | Yes | ||
| TV channels fixed effects | Yes | ||
| Journal fixed effects | Yes | ||
| No. of pages | Yes | ||
| Log(words in document) | Yes | Yes | Yes |
| Log(sentences in document) | Yes | Yes | Yes |
| Topic intensities | Yes | Yes | Yes |
| No. of parameters | 169 | 148 | 158 |
| No. of observations | 4,118 | 12,336 | 29,300 |
| Mean-squared error | 0.711 | 0.793 | 1.066 |
| 0.306 | 0.326 | 0.364 |
Note that all independent variables for which coefficients are reported are standardized. The dependent variable is not standardized. Parameters are estimated using a lasso regression. Confidence intervals are obtained via bootstrapping. *The 95% confidence interval does not include 0. Dependent variable is IMDB ratings for movies and TV show episodes and log(1 + citations) for academic papers.