| Literature DB >> 34697560 |
Christian E Lopez1, Caleb Gallemore2.
Abstract
This work presents an openly available dataset to facilitate researchers' exploration and hypothesis testing about the social discourse of the COVID-19 pandemic. The dataset currently consists of over 2.2 billions tweets (count as of September, 2021), from all over the world, in multiple languages. Tweets start from January 22, 2020, when the total cases of reported COVID-19 were below 600 worldwide. The dataset was collected using the Twitter API and by rehydrating tweets from other available datasets, data collection is ongoing as of the time of writing. To facilitate hypothesis testing and exploration of social discourse, the English and Spanish tweets have been augmented with state-of-the-art Twitter Sentiment and Named Entity Recognition algorithms. The dataset and the summary files provided allow researchers to avoid some computationally intensive analyses, facilitating more widespread use of social media data to gain insights on issues such as (mis)information diffusion, semantic networks, sentiments, and the evolution of COVID-19 discussions. In addition, the dataset provides an archive for researchers in the social sciences wishing to have access to a dataset covering the entire duration of the pandemic.Entities:
Keywords: COVID-19; Named Entity Recognition; Sentiment analysis; Twitter
Year: 2021 PMID: 34697560 PMCID: PMC8528187 DOI: 10.1007/s13278-021-00825-0
Source DB: PubMed Journal: Soc Netw Anal Min
Fig. 3Tweet frequency across top five observed languages
List and description of dataset tables
| Table Name | Description |
|---|---|
| Summary_Details | This table contains general details about the tweets, such as (i) Tweet ID, (ii) Language, (iii) Geolocation presence or not, (iv) number of likes, (v) number of retweets, (vi) country of tweet (vii) date created |
| Summary_Hastag | This table contains the Hashtags |
| Summary_Mentions | This table contains the different mentions |
| Summary_Sentiment | This table contains the sentiment information of the tweets in English |
| Summary_NER | This table contains information about the named entity recognized by the NER algorithm of the tweets in English |
| Summary_Sentiment_ES | This table contains the sentiment information of the tweets in Spanish |
| Summary_NER_ES | This table contains information about the named entity recognized by the NER algorithm of the tweets in Spanish |
Fig. 1Example of Tweet Related to COVID-19
Fig. 2Example of dataset tables
Tweet summary statistics, by month
| Month/ year | Avg. OR | Avg. RT | Avg. Total | OR | RT | Total |
|---|---|---|---|---|---|---|
| Jan/2020 | 5,947.00 | 30,576.50 | 35,501.50 | 1,958,346 | 7,852,504 | 9,810,850 |
| Feb/2020 | 10,978.00 | 29,918.00 | 40,604.50 | 7,624,648 | 21,944,443 | 29,568,948 |
| Mar/2020 | 13,095.50 | 44,714.50 | 56,283.00 | 12,610,824 | 46,659,589 | 59,270,412 |
| Apr/2020 | 30,091.00 | 89,513.00 | 119,859.50 | 20,591,357 | 60,301,889 | 80,893,244 |
| May/2020 | 35,163.00 | 99,928.50 | 135,709.00 | 26,258,213 | 73,618,083 | 99,876,289 |
| Jun/2020 | 51,033.00 | 142,569.00 | 193,096.00 | 34,786,076 | 95,171,388 | 129,957,461 |
| Jul/2020 | 54,131.50 | 154,737.00 | 209,566.50 | 29,441,533 | 82,903,912 | 112,345,445 |
| Aug/2020 | 51,330.50 | 143,551.00 | 195,142.00 | 37,596,182 | 103,098,588 | 140,694,770 |
| Sept/2020 | 50,068 | 132,040 | 182,947 | 35,861,979 | 92,957,247 | 128,819,226 |
| Oct/2020 | 54,489 | 137,225 | 198,708 | 41,062,885 | 104,195,279 | 144,962,625 |
| Nov/2020 | 64,125 | 111,686 | 177,062 | 45,096,171 | 77,885,575 | 122,981,746 |
| Dec/2020 | 64,840 | 121,149 | 186,852 | 49,065,436 | 87,366,002 | 133,179,589 |
Distribution of tweets, by language
| Language | English | Spanish | Portuguese | French | Bahasa | Others |
|---|---|---|---|---|---|---|
| Number of Tweets | 1,133,263,003 | 227,558,226 | 77,280,463 | 50,812,571 | 44,299,982 | 200,157,965 |
| Percentage | 65.38 | 13.13 | 4.46 | 2.93 | 2.56 | 11.55 |
Fig. 4Map of tweets featuring geolocation information
Fig. 5Sentiment of English-language tweets
Fig. 6Daily proportion of English-language tweets sentiment
Fig. 7Sentiment of Spanish tweets
Fig. 8Daily proportion of Spanish-language tweets by sentiment
Top 5 Mentions, hashtags, and named entities
| 1st | 2nd | 3rd | 4th | 5th | |
|---|---|---|---|---|---|
| Mentions | @realDonaldTrump | @realdonaldtrump | @mippcivzla | @joebiden | @narendramodi |
| Hashtag | #covid19 | #coronavirus | #covid | #covid-19 | #lockdown |
| NER Person (English/Spanish) | trump/maduro | biden/covid | covid/ivanduque | donal trump/nicolas maduro | fauci/trum |
| NER Location (English/Spanish) | us/españa | china/italia | uk/china | america/venezuela | india/méxico |
| NER Organization (English/Spanish) | cdc/gobierno | trump/china | senate/oms | covid/minsaludcol | pfizer/auronplay |
| NER Miscellaneous (English/Spanish) | covid-19/coronavirus | americans/covid-19 | covid/covid19 | coronavirus/covid | american/covidã19 |
Fig. 9Network generated from English tweets augmented dataset
Openly available COVID-19 Twitter datasets, with features available for download without need for rehydration
| Citation | Approximate Tweets | Dates | Tweet ID | Time Stamp | Text | Location | Sentiment | Topic | Other attributes |
|---|---|---|---|---|---|---|---|---|---|
| Abdul-Mageed et al. | 1.5 × 109 | 2007–May 15 2020 | ✔ | ✔ | |||||
| Lamsal | 1.3 × 109 | Oct 01, 2019–Ongoing | ✔ | ||||||
| Banda et al. | 1.1 × 109 | Jan 27–Ongoing | ✔ | ✔ | Hashtag/mention summaries; 1,000 frequent terms | ||||
| Chen et al. | 623 × 106 | Jan 28–Ongoing | ✔ | ||||||
| Suprem & Pu | 600 × 106 | Jan 25–Ongoing | ✔ | ||||||
| Yang Q et al. 2020 | 105 × 106 | Mar 01–May 15, 2020 | ✔ | ✔ | |||||
| Gupta et al. | 63 × 106 | Jan 28–July 01,2020 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | User Metadata; Hashtags; Retweets; Emotions |
| Ziems et al. | 31 × 106 | Jan 15–Ongoing | ✔ | ✔ | ✔ | ✔ | Sampled egonetworks | ||
| Gao et al. | 25 × 106 | Jan 20–Mar 24, 2020 | ✔ | ✔ | |||||
| Dimitrov et al. | 8.2 × 106 | Oct 2019–Apr 2020 | ✔ | ✔ | ✔ | NER; Mentions; Hashtags; User Metadata; URLs | |||
| Alqurashi et al. | 4.5 × 106 | Jan 01– May 30, 2020 | ✔ | ||||||
| de Melo & Figueiredo | 3.9 × 106 | Jan–May 2020 | ✔ | ✔ | Retweets; hashtags | ||||
| Haouari et al. | 2.7 × 106 | Jan 27– Jan 31,2021 | ✔ | Propagation network of top 1,000 tweets by day | |||||
| Feng & Zhou | 650 × 103 | Jan 25– May 10, 2020 | ✔ | ✔ | ✔ | ✔ | ✔ | ||
| Sarker et al. 2020 | 472 × 103 | Feb 01–?? Apr 2020 | Self-reported COVID-19 symptoms | ||||||
| Cui & Lee | 296 × 103 | Dec 01 – Nov 01, 2020 | ✔ | User ID; Reply ID; Misinformation detection | |||||
| Elhadad et al. | 110 × 103 | Feb 04–Mar 10,2020 | ✔ | ✔ | Fact- checking annotation | ||||
| Naseem et al. | 90 × 103 | Feb–Mar 2020 | |||||||
| Dharawat et al. | 61 × 103 | Health risk severity | |||||||
| Vidgen et al. | 40 × 103 | Jan 01 Mar 10, 2020 | ✔ | ✔ | |||||
| Mutlu et al. | 14 × 103 | Apr 04– Apr 30, 2020 | ✔ | Human-coded stances on Hydroxychloroquine | |||||
| Ameur et al. 2021 | 11 × 103 | Dec 15, 2019–Dec 15, 2020 | ✔ | ✔ | Manual topic, misinformation, and negative speech annotations | ||||
| Memon & Carley, | 4.5 × 103 | Mar 29; Jun 15/24, 2020 | ✔ | ✔ | ✔ | Tweets for users collected in this period |