Johan Bollen1, Marijn Ten Thij2,3, Fritz Breithaupt4, Alexander T J Barron2, Lauren A Rutter5, Lorenzo Lorenzo-Luaces5, Marten Scheffer6. 1. Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408; jbollen@indiana.edu. 2. Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408. 3. Delft Institute of Applied Mathematics, Delft University of Technology, 2628 CD Delft, The Netherlands. 4. Department of Germanic Studies, Indiana University, Bloomington, IN 47405. 5. Psychological and Brain Sciences, Indiana University, Bloomington, IN 47405. 6. Department of Environmental Sciences, Wageningen University, 6708 PB Wageningen, The Netherlands.
In their critique, Schmidt et al. (1) claim that our analysis of book language (2) cannot meaningfully reflect society. Their arguments bear no relevance to our paper. The statement that “words in books are not clinical interviews, and word frequencies are not psychiatric assessments” is irrelevant because we make no attempts at clinical diagnoses. Their observation that “Derrida” is a more frequent book word than “The Beatles” is also a red herring: We do not compare between words, but instead follow the dynamics of phrases over time. Lastly, our tracking of cognitive distortions (CDS) markers is not an attempt to identify “negative thoughts” but rather to detect markers of language involved in the expression of distorted thinking.Furthermore, Schmidt et al. (1) claim that our results are explained by a composition shift of the Google Books data toward more fiction since 2000. We have to disagree. We made our observations relative to a null model that specifically controls for such changes in corpus composition and other recency effects, and reported a robust signal well above that baseline (2).Schmidt et al. (ref. 1, figure 1) perform a linear regression analysis that shows a correlation between a word’s relative frequency in fiction and its rise in prevalence. Because our CDS n-grams (2, 3) are about 43% more prevalent in fiction than English overall, a shift toward more fiction does increase CDS n-gram prevalence. However, our analyses indicate that the observed rise of fiction in the data would only cause CDS prevalence to increase 16% from 1980 to 2019, much less than the magnitude of the observed shift and accounted for by our null model.
Fig. 1.
(Left) Original results published in Bollen et al. (2). (Right) The same analysis with n-gram counts in the Fiction corpus subtracted from the English corpus. The comparison reveals that the original results are robust against the removal of Fiction, and can thus not be explained by the growth of Fiction in the Google Books sample.
(Left) Original results published in Bollen et al. (2). (Right) The same analysis with n-gram counts in the Fiction corpus subtracted from the English corpus. The comparison reveals that the original results are robust against the removal of Fiction, and can thus not be explained by the growth of Fiction in the Google Books sample.Schmidt et al. (ref. 1, figure 2) make another inferential error when they draw conclusions from a correspondence between the prevalence of CDS n-grams and the sum of the log prevalence of their constituent words. These observations are not only compatible with our results but predicted: Changes in n-gram prevalence should match those of their constituent words. One cannot write “completely bad” without “completely” and “bad.” Furthermore, both terms individually mark similar cognitive distortion types, and will thus follow a similar trajectory.Instead of such indirect inferences or speculations, there is a more direct way to test whether our results are caused by a rise of fiction in the database. We remove the entire Fiction corpus from English by subtracting Fiction n-gram word counts from those in the English corpus. This analysis (Fig. 1) shows that the dynamics of CDS markers hardly differ from our original results. Along with the null model, this confirms that our results are unlikely to be driven by the growth of fiction in the Google Books sample.Overly harsh critiques on the emerging field of culturomics carry the risk of throwing the baby out with the bathwater. The millions of books produced over the past centuries are not unbiased reflections of natural language. Yet, they are not uncoupled from social, cultural, and psycholinguistic changes (4–8). This implies a treasure trove of information when interpreted with care.
Authors: Jean-Baptiste Michel; Yuan Kui Shen; Aviva Presser Aiden; Adrian Veres; Matthew K Gray; Joseph P Pickett; Dale Hoiberg; Dan Clancy; Peter Norvig; Jon Orwant; Steven Pinker; Martin A Nowak; Erez Lieberman Aiden Journal: Science Date: 2010-12-16 Impact factor: 47.728
Authors: Peter Sheridan Dodds; Eric M Clark; Suma Desu; Morgan R Frank; Andrew J Reagan; Jake Ryland Williams; Lewis Mitchell; Kameron Decker Harris; Isabel M Kloumann; James P Bagrow; Karine Megerdoomian; Matthew T McMahon; Brian F Tivnan; Christopher M Danforth Journal: Proc Natl Acad Sci U S A Date: 2015-02-09 Impact factor: 11.205
Authors: Johan Bollen; Marijn Ten Thij; Fritz Breithaupt; Alexander T J Barron; Lauren A Rutter; Lorenzo Lorenzo-Luaces; Marten Scheffer Journal: Proc Natl Acad Sci U S A Date: 2021-07-27 Impact factor: 11.205