| Literature DB >> 34239967 |
Arianna D'Ulizia1, Maria Chiara Caschera1, Fernando Ferri1, Patrizia Grifoni1.
Abstract
Fake news detection has gained increasing importance among the research community due to the widespread diffusion of fake news through media platforms. Many dataset have been released in the last few years, aiming to assess the performance of fake news detection methods. In this survey, we systematically review twenty-seven popular datasets for fake news detection by providing insights into the characteristics of each dataset and comparative analysis among them. A fake news detection datasets characterization composed of eleven characteristics extracted from the surveyed datasets is provided, along with a set of requirements for comparing and building new datasets. Due to the ongoing interest in this research topic, the results of the analysis are valuable to many researchers to guide the selection or definition of suitable datasets for evaluating their fake news detection methods.Entities:
Keywords: Evaluation datasets; Fake news detection; Online fake news
Year: 2021 PMID: 34239967 PMCID: PMC8237334 DOI: 10.7717/peerj-cs.518
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Comparison among the contributions of our survey and existing surveys on fake news detection.
| Contributions | Fake news characterization | Fake news detection methods | Fake news diffusion models | Fake news mitigation techniques | Evaluation of fake news detection methods | |||
|---|---|---|---|---|---|---|---|---|
| Dataset characterization | Datasets | Dataset requirements | Metrics | |||||
Figure 1PRISMA 2009 Flow diagram.
Figure 2(A) The major components to characterize fake news defined by Zhang & Ghorbani (2019); (B) the major components of the FNDD characterization defined in our survey.
The characteristics adapted from Zhang & Ghorbani (2019) are indicated with a colored background, while the newly defined ones are indicated with a white background.
Characteristics of the surveyed datasets for fake news detection.
| Dataset | News domain | Application purpose | Type of disinformation | Language | Size | News content type | Rating scale | Media platform | Spontaneity | Availability | Extraction time |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Yelp dataset ( | Technology | Fake detection | Fake reviews | English | 18912 reviews | Text | 2 values | Mainstream | Yes | Yes | No |
| PHEME dataset ( | Society, politics | Rumor detection | Rumors | English and German | 330 rumorous conversations and4842 tweets overall | Text | 3 values | Social media (Twitter) | Yes | Yes | No |
| CREDBANK ( | Society | Veracity classification | Rumors | English | 60 million streaming tweets | Text | 5 values | Social media (Twitter) | Yes | Yes | Yes (October 2014 - February 2015) |
| BuzzFace ( | Politics, society | Veracity classification | Fake news articles | English | 2263 news | Text | 4 values | Social media (Facebook) | Yes | Yes | Yes (September 2016) |
| FacebookHoax ( | Science | Fake detection | Hoaxes | English | 15500 posts | Text | 2 values | Social media (Facebook) | Yes | Yes | Yes (July 2016 - December 2016) |
| LIAR ( | Politics | Fake detection | Fake news articles | English | 12836 short statements | Text | 6 values | Mainstream + social media (Facebook, Twitter) | Yes | Yes | Yes (2007-2016) |
| Fact checking dataset ( | Politics, society | Fact checking | Fake news articles | English | 221 statements | Text | 5 values | Mainstream | Yes | Yes | No |
| FEVER ( | Society | Fact checking | Fake news articles | English | 185,445 claims | Text | 3 values | Mainstream | No | Yes | No |
| EMERGENT ( | Society, technology | Rumor detection | Rumors | English | 300 claims, and 2,595 associated article headlines | Text | 3 values | Mainstream + social media (Twitter) | Yes | Yes | No |
| FakeNewsNet ( | Society, politics | Fake detection | Fake news articles | English | 422 news | Text, images | 2 values | Mainstream + social media (Twitter) | Yes | Yes | No |
| Benjamin Political News Dataset ( | Politics | Fake detection | Fake news articles | English | 225 stories | Text | 3 values | Mainstream | Yes | Yes | Yes (2014-2015) |
| Burfoot Satire News Dataset ( | Politics, economy, technology, society | Fake detection | Satire | English | 4,233 newssamples | Text | 2 values | Mainstream | Yes | Yes | No |
| BuzzFeed News dataset ( | Politics | Fake detection | Fake news articles | English | 2,283 news samples | Text | 4 values | Social media (Facebook) | Yes | Yes | Yes (2016-2017) |
| MisInfoText dataset ( | Society | Fact checking | Fake news articles | English | 1,692 news articles | Text | 4 values for BuzzFeed and 5 values for Snopes | Mainstream | Yes | Yes | No |
| Ott et al.’s dataset ( | Tourism | Fake detection | Fake reviews | English | 800 reviews | Text | 2 values | Social media (TripAdvisor) | No | Yes | No |
| FNC-1 dataset ( | Politics, society, technology | Fake detection | Fake news articles | English | 49972 articles | Text | 4 values | Mainstream | Yes | Yes | No |
| Spanish fake news corpus ( | Science, Sport, Economy, Education, Entertainment, Politics, Health, Security, Society | Fake detection | Fake news articles | Spanish | 971 news | Text | 2 values | Mainstream | Yes | Yes | Yes (January 2018 - July 2018) |
| Fake_or_real_news ( | Politics, society | Fake detection | Fake news articles | English | 6,337 articles | Text | 2 values | Mainstream | Yes | Yes | No |
| TSHP-17 ( | Politics | Fact checking | Fake news articles | English | 33,063 articles | Text | 6 values for PolitiFact and 4 values for unreliable sources | Mainstream | Yes | Yes | No |
| QProp ( | Politics | Fact checking | Fake news articles | English | 51,294 articles | Text | 2 values | Mainstream | Yes | Yes | No |
| NELA-GT-2018 ( | Politics | Fake detection | Fake news articles | English | 713000 articles | Text | 2 values | Mainstream | Yes | Yes | Yes(February 2018 - November 2018) |
| TW_info ( | Politics | Fake detection | Fake news articles | English | 3472 articles | Text | 2 values | social media (Twitter) | Yes | Yes | Yes (January 2015 - April 2019 |
| FCV-2018 ( | Society | Fake detection | Fake news content | English, Russian, Spanish, Arabic, German, Catalan, Japanese, and Portuguese | 380 videos and 77258 tweets | Videos, text | 2 values | social media (YouTube, Facebook, Twitter) | Yes | Yes | Yes (April 2017 - July 2017 |
| Verification Corpus ( | Society | Veracity classification | Hoaxes | English, Spanish, Dutch, French | 15629 posts | Text, images, videos | 2 values | social media (Twitter) | Yes | Yes | Yes (2012-2015) |
| CNN / Daily Mail summarization dataset ( | Politics, society, crime, sport, business, technology, health | Fake detection | Fake news articles | English | 287000 articles | Text | 4 values | Mainstream | Yes | Yes | Yes (April 2007 - April 2015 |
| Zheng et al.’s dataset ( | Society | Clickbait detection | Clickbait | Chinese | 14922 headlines | Text | 2 values | Mainstream + social media (Wechat) | Yes | Yes | No |
| Tam et al.’s dataset ( | Politics, technology, science, crime, fraud and scam, fauxtography | Rumor detection | Rumors | English | 1022 rumors and 4 million tweets | Text | 2 values | social media (Twitter) | Yes | Yes | Yes (May 2017 - November 2017) |
Figure 3Number of datasets by year of publication.
Figure 4Frequency distribution (percentage) of the following characteristics of the surveyed datasets: (A) news domains; (B) application purpose; (C) types of disinformation; (D) rating scale; (E) media platforms.
Figure 5Size (number of items) of the datasets.
(A) Size of the surveyed datasets with less than 100,000 collected items; (B) size of the surveyed datasets with more than 100,000 collected items.
Classification of datasets for fake news detection according to the news domain.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Technology | Yelp, EMERGENT, Burfoot Satire News, FNC-1, CNN/Daily Mail summarization dataset, Tam et al.’s dataset | 6 | |
| Politics | PHEME, BuzzFace, LIAR, Fact checking, FakeNewsNet, Benjamin Political News, Burfoot Satire News, BuzzFeed News, FNC-1, Spanish fake news, Fake_or_real_news, TSHP-17, Qprop, NELA-GT-2018, TW_info, CNN/Daily Mail summarization dataset, Tam et al.’s dataset | 17 | |
| Economy | Burfoot Satire News, Spanish fake news, CNN/Daily Mail summarization dataset | 3 | |
| Society | PHEME, CREDBANK, BuzzFace, Fact checking, FEVER, EMERGENT, FakeNewsNet, Burfoot Satire News, MisInfoText, FNC-1, Spanish fake news, Fake_or_real_news, FCV-2018, Verification Corpus, CNN/Daily Mail summarization dataset, Zheng et al.’s dataset | 16 | |
| Science | FacebookHoax, Spanish fake news, Tam et al.’s dataset | 3 | |
| Security | Spanish fake news | 1 | |
| Health | Spanish fake news, CNN/Daily Mail summarization dataset | 2 | |
| Tourism | Ott et al.’s | 1 | |
| Sport | Spanish fake news, CNN/Daily Mail summarization dataset | 2 | |
| Education | Spanish fake news | 1 | |
| Entertainment | Spanish fake news | 1 | |
| Crime | CNN/Daily Mail summarization dataset, Tam et al.’s dataset | 2 | |
| Fraud and scam | Tam et al.’s dataset | 1 | |
| Fauxtography | Tam et al.’s dataset | 1 | |
Note:
Dataset characteristics and requirements for comparing datasets.
Classification of datasets for fake news detection according to the extraction period.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Defined | CREDBANK, BuzzFace, FacebookHoax, LIAR, Benjamin Political News, BuzzFeed News, Spanish fake news, NELA-GT-2018, TW_info, FCV-2018, Verification Corpus, CNN/Daily Mail summarization dataset, Tam et al.’s dataset | 13 | |
| Not-defined | Yelp dataset, PHEME, Fact checking, FEVER, EMERGENT, FakeNewsNet, Burfoot Satire News, Ott et al.’s dataset, MisInfoText, FNC-1, Fake_or_real_news, TSHP-17, Qprop, Zheng et al.’s dataset, | 14 | |
Figure 6(A) Requirements for fake news detection datasets defined by Rubin, Chen & Conroy (2015): (B) requirements for fake news detection datasets defined in our study.
Matching between the Rubin et al.’s requirements and our dataset characteristics.
| Rubin et al. requirements | Dataset characteristics according to our FNDD characterisation |
|---|---|
| Homogeneity in lengths | Media platform |
| - | Type of disinformation |
| - | News domain |
| - | Application purpose |
| Availability of both trustful and deceptive news | Rating scale |
| Digital textual format accessibility | News content type |
| Pragmatic concerns | Availability |
| Language and culture | Language |
| Verifiability of ground truth | Spontaneity |
| Predefined time frame | Extraction period |
| Homogeneity in writing matter | |
| The manner of news delivery |
FNDD characteristics and our requirements for comparing datasets.
| Categories of requirements | Our dataset requirements | Dataset characteristics according to our FNDD characterisation |
|---|---|---|
| Homogeneity requirements | homogeneity in news lengths | Media platform |
| homogeneity in type of disinformation | Type of disinformation | |
| homogeneity in news domain | News domain | |
| homogeneity in application purpose | Application purpose | |
| Availability requirements | Fake/real availability | Rating scale |
| Textual format availability | News content type | |
| public availability | Availability | |
| Multi-lingual availability | Language | |
| Verifiability requirement | verifiability of ground truth | Spontaneity |
| Temporal requirement | belonging to a predefined time frame | Extraction period |
Analysis of the surveyed datasets for fake news detection according to our requirements.
| Dataset | Availability of fake and real news | Textual format availability | Verifiability of ground-truth | Homogeneity of news length | Homogeneity in type of disinformation | Homogeneity in news domain | Belonging to a predefined time frame | Homogeneity in application purpose | Public availability | Multi-lingual availability |
|---|---|---|---|---|---|---|---|---|---|---|
| Yelp dataset | – | |||||||||
| PHEME dataset | ||||||||||
| CREDBANK | – | |||||||||
| BuzzFace | – | |||||||||
| FacebookHoax | – | |||||||||
| LIAR | – | |||||||||
| Fact checking dataset | – | |||||||||
| FEVER | – | |||||||||
| EMERGENT | – | |||||||||
| FakeNewsNet | – | |||||||||
| Benjamin Political News Dataset | – | |||||||||
| Burfoot Satire News Dataset | – | |||||||||
| BuzzFeed News dataset | – | |||||||||
| MisInfoText dataset | – | |||||||||
| Ott et al.’s dataset | – | |||||||||
| FNC-1 dataset | – | |||||||||
| Spanish fake news corpus | ||||||||||
| Fake_or_real_news | ||||||||||
| TSHP-17 | ||||||||||
| QProp | ||||||||||
| NELA-GT-2018 | ||||||||||
| TW_info | ||||||||||
| FVC-2018 | ||||||||||
| Verification Corpus | ||||||||||
| CNN/Daily Mail summarization dataset | – | |||||||||
| Zheng et al.’s dataset | – | |||||||||
| Tam et al.’s dataset | – |
Analysis of the surveyed datasets for fake news detection according to the requirement to belong to a predefined time frame.
| Dataset | ||
|---|---|---|
| CREDBANK | ||
| BuzzFace | ||
| LIAR | ||
| Benjamin Political News Dataset | ||
| FacebookHoax dataset | ||
| BuzzFeed News dataset | ||
| Spanish fake news corpus | ||
| NELA-GT-2018 | ||
| TW_info dataset | ||
| FCV-2018 | ||
| Verification Corpus | ||
| CNN/Daily Mail summarization dataset | ||
| Tam et al.’s dataset | ||
Analysis of the surveyed datasets for fake news detection according to the verifiability of ground-truth requirement.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Verifiability of ground truth | Yelp dataset | 1 | |
| PHEME dataset, EMERGENT, Benjamin Political News Dataset, BuzzFeed News dataset, Spanish fake news corpus | 5 | ||
| CREDBANK, BuzzFace | 2 | ||
| FacebookHoax, Fact checking dataset, Burfoot Satire News Dataset, MisInfoText dataset, FNC-1 dataset, Fake_or_real_new, TSHP-17, QProp, TW_info dataset, Zheng et al.’s dataset, Tam et al.’s dataset | 11 | ||
| LIAR | 1 | ||
| FEVER | 1 | ||
| FakeNewsNet, Verification Corpus | 2 | ||
| Ott et al.’s dataset, CNN/Daily Mail summarization dataset | 2 | ||
| NELA-GT-2018 | 1 | ||
| FCV-2018 | 1 | ||
Figure 7Analysis of the surveyed datasets according to: (A) the verifiability of ground-truth; (B) the homogeneity of the news length; (C) the homogeneity in the news domain; (D) the homogeneity in the type of disinformation; (E) the homogeneity in application purpose.
Analysis of the surveyed datasets for fake news detection according to the homogeneity in news length requirement.
| Dataset | # Of datasets | ||
|---|---|---|---|
| News length | PHEME dataset, CREDBANK, Benjamin Political News Dataset, TW_info dataset | 4 | |
| BuzzFace, FacebookHoax, BuzzFeed News dataset | 3 | ||
| LIAR, FEVER | 2 | ||
| Fact checking dataset, EMERGENT | 2 | ||
| FakeNewsNet, Burfoot Satire News Dataset, MisInfoText dataset, FNC-1 dataset, Spanish fake news corpus, Fake_or_real_news, TSHP-17, QProp, NELA-GT-2018, CNN/Daily Mail summarization dataset | 10 | ||
| Ott et al.’s dataset | 1 | ||
| Zheng et al.’s dataset | 1 | ||
| Yelp dataset, FCV-2018, Verification Corpus, Tam et al.’s dataset | 4 | ||
Analysis of the surveyed datasets for fake news detection according to the homogeneity in type of disinformation requirement.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Type of disinformation | Yelp dataset, Ott et al.’s dataset | 2 | |
| PHEME dataset, CREDBANK, EMERGENT, Tam et al.’s dataset | 4 | ||
| BuzzFace, LIAR, Fact checking dataset, FEVER, FakeNewsNet, Benjamin Political News Dataset, BuzzFeed News dataset, MisInfoText dataset, Spanish fake news corpus, Fake_or_real_news, TSHP-17, QProp, TW_info dataset, CNN/Daily Mail summarization dataset, FCV-2018, Verification Corpus | 15 | ||
| FacebookHoax, FNC-1 dataset | 2 | ||
| Burfoot Satire News Dataset | 1 | ||
| MisInfoText dataset, NELA-GT-2018 | 2 | ||
| Zheng et al.’s dataset | 1 | ||
Classification of datasets for fake news detection according to the application purpose.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Fake detection | Yelp, FacebookHoax, LIAR, FakeNewsNet, Benjamin Political News, Burfoot Satire News, BuzzFeed News, Ott et al.’s, FNC-1, Spanish fake news, Fake_or_real_news, NELA-GT-2018, TW_info, FCV-2018, CNN/Daily Mail summarization dataset | 15 | |
| Fact checking | Fact checking, FEVER, MisInfoText, TSHP-17, Qprop | 5 | |
| Veracity classification | CREDBANK, BuzzFace, Verification Corpus | 3 | |
| Rumour detection | EMERGENT, PHEME, Tam et al.’s dataset | 3 | |
| Clickbait detection | Zheng et al.’s dataset | 1 | |
Classification of datasets for fake news detection according to the language.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Monolingual–English | Yelp dataset, CREDBANK, BuzzFace, FacebookHoax, LIAR, Fact checking, FEVER, EMERGENT, FakeNewsNet, Benjamin Political News, Burfoot Satire News, BuzzFeed News, MisInfoText, Ott et al.’s dataset, FNC-1, Fake_or_real_news, TSHP-17, Qprop, NELA-GT-2018, TW_info, CNN/Daily Mail summarization dataset, Tam et al.’s dataset | 22 | |
| Monolingual–Spanish | Spanish fake news | 1 | |
| Monolingual–Chinese | Zheng et al.’s dataset | 1 | |
| Multi-lingual | PHEME, FCV-2018, Verification Corpus | 3 | |
Classification of datasets for fake news detection according to the type of disinformation.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Fake news articles | BuzzFace, LIAR, Fact checking, FEVER, FakeNewsNet, Benjamin Political News, BuzzFeed News, MisInfoText, FNC-1, Spanish fake news, Fake_or_real_news, TSHP-17, Qprop, NELA-GT-2018, TW_info, FCV-2018, CNN/Daily Mail summarization dataset | 17 | |
| Fake reviews | Yelp dataset, Ott et al.’s dataset | 2 | |
| Satire | Burfoot Satire News | 1 | |
| Hoaxes | FacebookHoax, Verification Corpus | 2 | |
| Rumours | PHEME, CREDBANK, EMERGENT, Tam et al.’s dataset | 4 | |
| Clickbait | Zheng et al.’s dataset | 1 | |
Classification of datasets for fake news detection according to the size.
| Dataset | # Of datasets | ||
|---|---|---|---|
| 0–1,000 | Benjamin Political News, Fact checking, FakeNewsNet, Ott et al.’s dataset, Spanish fake news | 5 | |
| 1,000–10,000 | MisInfoText, BuzzFeed News, BuzzFace, EMERGENT, Burfoot Satire News, PHEME, Fake_or_real_news, TW_info | 8 | |
| 10,000–100,000 | LIAR, FacebookHoax, Yelp dataset, TSHP-17, FNC-1, Qprop, FCV-2018, Verification Corpus, Zheng et al.’s dataset | 9 | |
| 100,000–100,000,000 | FEVER, NELA-GT-2018, CREDBANK, CNN/Daily Mail summarization dataset, Tam et al.’s dataset | 5 | |
Classification of datasets for fake news detection according to the news content type.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Text | Yelp dataset, PHEME, CREDBANK, BuzzFace, FacebookHoax, LIAR, Fact checking, FEVER, EMERGENT, Benjamin Political News, Burfoot Satire News, BuzzFeed News, MisInfoText, Ott et al.’s dataset, FNC-1, Spanish fake news, Fake_or_real_news, TSHP-17, Qprop, NELA-GT-2018, TW_info, CNN/Daily Mail summarization dataset, Tam et al.’s dataset, Zheng et al.’s dataset | 24 | |
| Text, images | FakeNewsNet | 1 | |
| Text, videos | FCV-2018 | 1 | |
| Text, images, videos | Verification Corpus | 1 | |
Classification of datasets for fake news detection according to the rating scale.
| Dataset | # Of datasets | ||
|---|---|---|---|
| 2 values | Yelp dataset, FacebookHoax, FakeNewsNet, Burfoot Satire News, Ott et al.’s dataset, Spanish fake news, Fake_or_real_news, Qprop, NELA-GT-2018, TW_info, FCV-2018, Verification Corpus, Zheng et al.’s dataset, Tam et al.’s dataset | 14 | |
| 3 values | PHEME, FEVER, EMERGENT, Benjamin Political News | 4 | |
| 4 values | BuzzFace, BuzzFeed News, MisInfoText, FNC-1, TSHP-17, CNN/Daily Mail summarization dataset | 6 | |
| 5 values | CREDBANK, Fact checking | 2 | |
| 6 values | LIAR | 1 | |
Classification of datasets for fake news detection according to the media platform.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Mainstream media | Yelp dataset, Fact checking, FEVER, Benjamin Political News, Burfoot Satire News, MisInfoText, FNC-1, Spanish fake news, Fake_or_real_news, TSHP-17, Qprop, NELA-GT-2018, CNN/Daily Mail summarization dataset | 13 | |
| Online social media | PHEME, CREDBANK, BuzzFace, FacebookHoax, BuzzFeed News, Ott et al.’s dataset, TW_info, FCV-2018, Verification Corpus, Tam et al.’s dataset | 10 | |
| Mainstream + Online social media | LIAR, EMERGENT, FakeNewsNet, Zheng et al.’s dataset | 4 | |
Classification of datasets for fake news detection according to the spontaneity.
| Dataset | # Of datasets | ||
|---|---|---|---|
| Spontaneous | Yelp dataset, PHEME, CREDBANK, BuzzFace, FacebookHoax, LIAR, Fact checking, EMERGENT, FakeNewsNet, Benjamin Political News, Burfoot Satire News, BuzzFeed News, MisInfoText, FNC-1, Spanish fake news, Fake_or_real_news, TSHP-17, Qprop, NELA-GT-2018, TW_info, FCV-2018, Verification Corpus, CNN/Daily Mail summarization dataset, Zheng et al.’s dataset, Tam et al.’s dataset | 25 | |
| Artificial | FEVER, Ott et al.’s dataset | 2 | |