| Literature DB >> 34697754 |
Alessandro Miani1, Thomas Hills2,3, Adrian Bangerter4.
Abstract
The spread of online conspiracy theories represents a serious threat to society. To understand the content of conspiracies, here we present the language of conspiracy (LOCO) corpus. LOCO is an 88-million-token corpus composed of topic-matched conspiracy (N = 23,937) and mainstream (N = 72,806) documents harvested from 150 websites. Mimicking internet user behavior, documents were identified using Google by crossing a set of seed phrases with a set of websites. LOCO is hierarchically structured, meaning that each document is cross-nested within websites (N = 150) and topics (N = 600, on three different resolutions). A rich set of linguistic features (N = 287) and metadata includes upload date, measures of social media engagement, measures of website popularity, size, and traffic, as well as political bias and factual reporting annotations. We explored LOCO's features from different perspectives showing that documents track important societal events through time (e.g., Princess Diana's death, Sandy Hook school shooting, coronavirus outbreaks), while patterns of lexical features (e.g., deception, power, dominance) overlap with those extracted from online social media communities dedicated to conspiracy theories. By computing within-subcorpus cosine similarity, we derived a subset of the most representative conspiracy documents (N = 4,227), which, compared to other conspiracy documents, display prototypical and exaggerated conspiratorial language and are more frequently shared on Facebook. We also show that conspiracy website users navigate to websites via more direct means than mainstream users, suggesting confirmation bias. LOCO and related datasets are freely available at https://osf.io/snpcg/ .Entities:
Keywords: ■■■
Mesh:
Year: 2021 PMID: 34697754 PMCID: PMC8545361 DOI: 10.3758/s13428-021-01698-z
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X
Key features of eight corpora relevant to conspiracy theory content
| Resource | BNC | WaCky | CORPS | FNweb | RumTweet | PHEME | NYT | LOCO |
|---|---|---|---|---|---|---|---|---|
| Focus | Language | language | Political speeches | Fake news | Rumors | Rumors | Conspiracy | Conspiracy |
| Obtained from | Printed material | Web pages | Web pages | Webpages from list of websites | Newspaper | Webpages from list of websites | ||
| Number of documents | 4 K | 2.69 M | 3.6 K | 14 K (7 K fake) | 192 K tweets (61 K rumor) | 7.5 K threads (35 K rumor tweets) | 100 K (800 conspiracy) | 96 K (24 K conspiracy) |
| Number of tokens | 100 M | 1.9 B | 7.9 M | 7 M* | 2.8 M* | 100 K* | 88 M | |
| Topic structure | NO | 2 K seeds | NO | NO | YES 111 events (60 rumors, 51 non-rumors) | YES 9 events | YES | YES 47 seeds 600 topics |
| Grouping structure | NO | NO | NO | YES | YES | YES (matched) | YES | YES (matched) |
| Year range | 1917 2010 | 2013 2018 | 2006 2009 | Events around 2014–2015 | 1897 2010 | 1853 2020 | ||
| Freely available | YES | YES | YES | YES | YES | YES | NO | YES |
Note. Resources: BNC (Aston & Burnard, 1998); WaCky (Baroni et al., 2009); CORPS (Guerini et al., 2013); FNweb (Castelo et al., 2019); RumTweet (Kwon et al., 2017); PHEME (Zubiaga et al., 2016); NYT (Uscinski et al., 2011). *Number of tokens calculated from studies’ freely available datasets
Summary statistics of mainstream, conspiracy, and all documents in LOCO
| Mainstream | Conspiracy | Whole corpus | |
|---|---|---|---|
| No. of documents | 72,806 | 23,937 | 96,743 |
| No. of websites | 92 | 58 | 150 |
| Range of years | 1853–2020 | 2004–2020 | 1853–2020 |
| No. of words per document | 805.94 (939) [97–9507] | 1236.32 (1307) [100–9428] | 912.43 (1059) [97–9507] |
| No. of sentences per document | 37.92 (47.89) [1–1087] | 59.63 (69.58) [1–1047] | 43.29 (54.88) [1–1087] |
| No. of paragraphs per document | 16.56 (19.30) [1–829] | 24.51 (32.83) [1–905] | 18.53 (23.64) [1–905] |
Types of variables included in LOCO
| Level | Variable type | Example of variable | Section |
|---|---|---|---|
| 1. Document | Raw content | Document ID | Table |
| Title | 3.4 | ||
| Text | 3.4 | ||
| Features | Number of words, sentences, paragraphs | 3.8.2 | |
| Semantic content | Topic | 3.6 | |
| Lexical features | 3.5 | ||
| Conspiracy content | Representativeness | 3.7 | |
| Mention of conspiracy | 3.8.1 | ||
| 2. Webpage | Information | Website host | 3.2 |
| URL | 3.3 | ||
| Date | 3.8.3 | ||
| Seeds | 3.1 | ||
| Spread | Facebook shares, comments, and reactions | 3.8.4 | |
| 3. Website | Classification | Political orientation, factual reporting, category | 3.8.5 |
| Size | Number of webpages | 3.8.6 | |
| Popularity | Visits, traffic, and rank | 3.8.6 | |
| Spread | Facebook shares, comments, and reactions | 3.8.4 |
List of seeds
| seed source | No. of conspiracy documents | No. of mainstream documents | |
|---|---|---|---|
| 5g | m | 702 | 1664 |
| aids | 2 | 1025 | 2428 |
| alien | 1, 2 | 813 | 1715 |
| barack obama | 1 | 496 | 1485 |
| big foot | 1 | 708 | 2019 |
| big pharma | 1 | 716 | 1758 |
| bill gates | m | 717 | 1623 |
| cancer | m | 839 | 2098 |
| chemtrails | 1 | 744 | 549 |
| cia cocaine | 1 | 552 | 1030 |
| climate change | 1, 2 | 889 | 2166 |
| coronavirus | m | 1104 | 2588 |
| covid 19 | m | 1004 | 2395 |
| drug companies | 1 | 1024 | 2356 |
| ebola | m | 626 | 2140 |
| elvis death | m | 188 | 1386 |
| elvis presley | m | 132 | 1258 |
| flat earth | m | 605 | 1646 |
| fluoride water | 1 | 395 | 1384 |
| george bush | 1 | 844 | 1737 |
| george soros | m | 735 | 1178 |
| global warming | 1, 2 | 896 | 1793 |
| gmo | m | 620 | 1924 |
| illuminati | m | 804 | 1479 |
| jfk assassination | 1, 2 | 607 | 1344 |
| jonestown suicide | 2 | 42 | 594 |
| mh370 | m | 167 | 1086 |
| michael jackson death | m | 616 | 1564 |
| mind control | 1 | 949 | 2036 |
| moon landing | 1, 2 | 349 | 1579 |
| new world order | 1 | 1036 | 2162 |
| nwo | 1 | 814 | 1350 |
| osama bin laden | 1 | 645 | 1415 |
| paul mccartney death | 1 | 149 | 1190 |
| pharmaceutical industry | 1 | 828 | 1684 |
| pizzagate | m | 359 | 1012 |
| planned parenthood | m | 626 | 1434 |
| population control | m | 972 | 2295 |
| princess diana death | 2 | 309 | 1338 |
| reptilian | 1 | 494 | 1418 |
| saddam hussein | 1 | 677 | 1623 |
| sandy hook | m | 470 | 1500 |
| september 11 attack | 1, 2 | 939 | 2207 |
| vaccine | 1 | 803 | 2125 |
| vaccine autism | 1 | 531 | 1654 |
| vaccine covid | m | 923 | 2031 |
| zika virus | m | 473 | 1675 |
Note. Sources 1, 2, and m refer to: 1 = Jensen (2013); 2 = Douglas and Sutton (2011), and m = manual
Fig. 2LDA topic gamma values over time. The red dotted vertical lines represent the occurrences of significant events associated with the topic. In the 9/11 topic, each vertical line represents September 11th in each year, starting from 2001. Coronavirus topics (bottom) are distributed over the year 2020 (from January to July, when LOCO data collection ended).
Fig. 1Distribution of documents in LOCO by date. Distribution for a each subcorpora (red: conspiracy; green: mainstream) and b all documents from 1995 to the time of data collection (the red vertical line represents the mean, the boxplot on top displays the median and the interquartile ranges)
Differences between conspiracy and mainstream website metrics
| Mainstream | Conspiracy | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ( | ( | |||||||||||
| Total monthly visits | 102,285,513 | (191,306,614) | 92 | 965,242 | (2,115,315) | 28 | 5.08 | *** | 1.10 | 14.00 | *** | 3.02 |
| Global rank | 7313 | (21,765) | 89 | 211,904 | (168,890) | 28 | −6.39 | *** | 1.39 | –17.11 | *** | 3.71 |
| Website size | 6,844,908 | (16,049,205) | 92 | 6224 | (12,918) | 58 | 4.09 | *** | 0.69 | 18.43 | *** | 3.09 |
| FB projected shares | 3,213,458,353 | (9,348,074,961) | 92 | 27,11,190 | (14,540,274) | 58 | 3.29 | ** | 0.55 | 12.78 | *** | 2.14 |
| Traffic, direct† | 28.95 | (13.45) | 92 | 57.55 | (21.57) | 28 | –6.63 | *** | 1.43 | |||
| Traffic, search† | 56.83 | (17.49) | 92 | 13.82 | (10.44) | 28 | 16.00 | *** | 3.45 | |||
| Traffic, social† | 8.08 | (5.58) | 92 | 18.4 | (19.28) | 28 | –2.8 | ** | 0.60 | |||
Note. Differences tested with Welch's unequal variances t-test. Log transformation was applied to highly skewed variables after having added a constant 1 to avoid -Infinite values when the raw score was zero. †Values expressed as percentages and not log-transformed. d: Cohen’s d. FB: Facebook. Website size is expressed in number of webpages
LOCO dataset variables description
| Variable name (% empty/missing values, if any) | Variable description |
|---|---|
| doc_id | Six-character hexadecimal sequence of document unique identification number. The first character stores the source: C stands for conspiracy (e.g., C0004d) and M stands for mainstream (e.g., M095eb) |
| URL | URL associated with the document |
| Website | The website from which the document was extracted |
| seeds (2.26%) | The seeds we used to gather documents. The page was returned by all the keywords listed in this variable ( |
| date (33.98%) | The date the webpage was uploaded or uploaded (format: YYYY-MM-DD) |
| subcorpus | Either conspiracy or mainstream ( |
| title (0.11%) | Title of the document |
| txt | Document text (see text statistics in Table |
| txt_nwords | Number of words |
| txt_nsentences | Number of sentences |
| txt_nparagraphs | Number of paragraphs |
| topic_k100 | The topic ID with highest gamma value within k100 LDA ( |
| topic_k200 | The topic ID with highest gamma value within k200 LDA ( |
| topic_k300 | The topic ID with highest gamma value within k300 LDA ( |
| mention_conspiracy | Occurrences count for the word “conspir*” in text, see “Mentioning conspiracy” ” section |
| conspiracy_representative | Logical. TRUE ( |
| cosine_similarity | Cosine similarity values for conspiracy documents (values > mean + 1 SD are considered representative) |
| FB_shares (0.01%) | URL’s Facebook shares |
| FB_comments (0.01%) | URL’s Facebook comments |
| FB_reactions (0.01%) | URL’s Facebook reactions |
Note. Percentages of empty/missing values are calculated on the list of documents (N = 96,743)
LOCO’s website metadata variables description
| Variable name (% empty/missing values, if any) | Variable description |
|---|---|
| Website | Website name ( |
| URL | URL associated with the website domain |
| n_webpages | Overall number of webpages in website obtained by Google search (see “ |
| MBFC_political_orientation (69%) | Political orientation. Left ( |
| MBFC_factual_reporting (21%) | Factual reporting. Very_low ( |
| MBFC_conspiracy | Logical. If TRUE ( |
| MBFC_pseudoscience (62%) | For conspiracy websites only. Zero ( |
| MBFC_proscience | Logical. TRUE ( |
| SW_total_visits (20%) | Total visits, desktop and mobile web aggregated |
| SW_global_rank (22%) | Traffic rank of website, as compared to all other websites in the world |
| SW_Category (20%) | Website category (e.g., news_and_media, |
| SW_traffic_direct (20%) | Percentage of direct desktop incoming traffic (from typing the URL in a browser) |
| SW_traffic_search (20%) | Percentage of search desktop incoming traffic (from a search engine) |
| SW_traffic_social (20%) | Percentage of direct desktop incoming traffic (from a URL on social media) |
| FB_shares_homepage | Facebook shares of homepage (see discussion in SM9) |
| FB_shares_estimated | Estimated overall Facebook shares given total number of website’s webpages (see “ |
Note. Percentages of empty/missing values are calculated on the list of websites (N = 150)
Fig. 3Differences in lexical features between conspiracy and mainstream documents a Effect sizes that yielded a Cohen’s d > .20 from t-tests between conspiracy and mainstream documents on Empath lexical categories. Positive effect sizes indicate that the category value is higher in conspiracy documents. A star indicates that the category emerged as having d > .20 also in Klein et al. (2019). b Comparison of means [and 95% CIs] for the same set of variables (scaled to z values) across different document categories
Fig. 4Differences in lexical features between high- and low-representative conspiracy documents. Positive β estimates indicate that the category is higher among conspiracy documents that are more representative of the conspiracy corpus as measured by their document cosine similarity with other conspiracy documents in the corpus
Fig. 5Types of incoming traffic by website category. Average of websites’ percentages of incoming traffic (direct, from web search, and from social media) by website categories. Error bars represent the standard error of the mean