| Literature DB >> 30901329 |
Nadja Younes1, Ulf-Dietrich Reips1.
Abstract
The Google Books Ngram Viewer (Google Ngram) is a search engine that charts word frequencies from a large corpus of books and thereby allows for the examination of cultural change as it is reflected in books. While the tool's massive corpus of data (about 8 million books or 6% of all books ever published) has been used in various scientific studies, concerns about the accuracy of results have simultaneously emerged. This paper reviews the literature and serves as a guideline for improving Google Ngram studies by suggesting five methodological procedures suited to increase the reliability of results. In particular, we recommend the use of (I) different language corpora, (II) cross-checks on different corpora from the same language, (III) word inflections, (IV) synonyms, and (V) a standardization procedure that accounts for both the influx of data and unequal weights of word frequencies. Further, we outline how to combine these procedures and address the risk of potential biases arising from censorship and propaganda. As an example of the proposed procedures, we examine the cross-cultural expression of religion via religious terms for the years 1900 to 2000. Special emphasis is placed on the situation during World War II. In line with the strand of literature that emphasizes the decline of collectivistic values, our results suggest an overall decrease of religion's importance. However, religion re-gains importance during times of crisis such as World War II. By comparing the results obtained through the different methods, we illustrate that applying and particularly combining our suggested procedures increase the reliability of results and prevents authors from deriving wrong assumptions.Entities:
Mesh:
Year: 2019 PMID: 30901329 PMCID: PMC6430395 DOI: 10.1371/journal.pone.0213554
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
List of religious English terms with their German and Italian translations.
| Original | German | Italian |
|---|---|---|
| altar | Altar | altare |
| angel | Engel | angelo |
| belief | Glaube | fede |
| clergy | Geistlichkeit | clero |
| creed | Überzeugung | credo |
| doctrine | Doktrin | dottrina |
| God | Gott | Dio |
| heaven | Himmel | paradiso |
| miracle | Wunder | miracolo |
| pilgrimage | Pilgerfahrt | pellegrinaggio |
| prayer | Gebet | preghiera |
| prophet | Prophet | profeta |
| religion | Religion | religione |
| revelation | Offenbarung | rivelazione |
| ritual | Ritual | rituale |
| saint | Heiliger | santo |
| sermon | Predigt | predica |
| shrine | Schrein | santuario |
| soul | Seele | anima |
| spirit | Geist | spirito |
Fig 1Higher frequency inflections for the German word “eigen”.
Fig 2Frequencies of inflections for the word “saint”.
Frequencies of given inflections for the word “saint” using the American English corpus (A) and the three most frequent inflections for the German translation “Heiliger” (B).
Overview of original words and their higher frequency inflections.
| American English | British English | ||||
| Original | High | Ratio | Original | High | Ratio |
| angel | angels | 1.08 | angel | angels | 1.12 |
| saint | saints | 1.18 | |||
| German | Italian | ||||
| Original | High | Ratio | Original | High | Ratio |
| Engel | Engels | 2.28 | angelo | angeli | 1.52 |
| Glaube | Glauben | 1.91 | |||
| Prophet | Propheten | 2.54 | |||
| Heiliger | Heiligen | 13.10 | |||
The columns “Original” present original words, whereas the columns “High” present the original words’ higher frequency inflections. The columns “Ratio” display the average yearly ratios between the frequencies of higher frequency inflections and the frequencies of their original counterparts.
Fig 3Frequencies for most frequent words.
Frequencies for the most frequent words in the American English, British English, German, and Italian Google Ngram corpora.
Correlation coefficients for different standardization procedures.
| (I) | (II) | (III) | (IV) | (V) | (VI) | (VII) | |
|---|---|---|---|---|---|---|---|
| Original | Σ Raw | Σ z-scores | Σ z−scores of | Σ z−scores of | Σ z−scores | ||
| American English | r = -0.89, p<0.001 | r = -0.88, p<0.001 | r = -0.71, p<0.001 | r = -0.77, p<0.001 | r = -0.70, p<0.001 | r = -0.76, p<0.001 | r = -0.49, p<0.001 |
| British English | r = -0.96, p<0.001 | r = -0.94, p<0.001 | r = -0.95, p<0.001 | r = -0.96, p<0.001 | r = -0.91, p<0.001 | r = -0.92, p<0.001 | r = -0.81, p<0.001 |
| German | r = -0.47, p<0.001 | r = -0.29, p<0.01 | r = -0.51, p<0.001 | r = -0.46, p<0.001 | r = -0.32, p<0.001 | r = -0.26, p<0.01 | r = -0.13, p>0.1 |
| Italian | r = -0.64, p<0.001 | r = 0.24, p<0.05 | r = -0.76, p<0.05 | r = -0.58, p<0.001 | r = -0.01, p>0.1 | r = 0.35, p<0.001 | r = 0.51, p<0.001 |
Correlation coefficients for different standardization procedures using higher frequency words and synonyms.
| (I) | (II) | (III) | (IV) | (V) | (VI) | (VII) | |
|---|---|---|---|---|---|---|---|
| Original | Σ Raw | Σ z-scores | Σ z−scores of | Σ z−scores of | Σ z−scores | ||
| American English | r = -0.92, p<0.001 | r = -0.83, p<0.001 | r = -0.77, p<0.001 | r = -0.83, p<0.001 | r = -0.60, p<0.001 | r = -0.68, p<0.001 | r = -0.72, p<0.001 |
| British English | r = -0.95, p<0.001 | r = -0.91, p<0.001 | r = -0.94, p<0.001 | r = -0.95, p<0.001 | r = -0.87, p<0.001 | r = -0.89, p<0.001 | r = -0.89, p<0.001 |
| German | r = -0.73, p<0.001 | r = -0.33, p<0.001 | r = -0.76, p<0.001 | r = -0.74, p<0.001 | r = -0.36, p<0.001 | r = -0.29, p<0.01 | r = -0.26, p<0.01 |
| Italian | r = -0.76, p<0.001 | r = -0.28, p<0.01 | r = -0.84, p<0.001 | r = -0.72, p<0.001 | r = -0.49, p<0.001 | r = -0.19 p<0.1 | r = -0.14, p>0.1 |
Fig 4Frequencies of religious terms translated to Russian.