| Literature DB >> 32844106 |
Tiago de Melo1, Carlos M S Figueiredo1.
Abstract
In this data article, we provide a collection of 3,925,366 tweets and 18,413 online news around the online discussion about COVID-19 in Brazil. The data from Twitter were collected through Twitterscraper Python library and we considered a set of keywords in Portuguese regarding to COVID-19. In order to facilitate the identification of tweets that have hashtags, media and retweets for researchers or data enthusiasts, we created three specific datasets for each of these categories. The news on COVID-19 was collected from the UOL portal, the most popular Brazilian website. All the data were gathered from January to May, 2020. These datasets can attract the attention from communities such as data science, social science, natural language processing, tourism, infodemiology, and public health.Entities:
Keywords: COVID-19; Dataset; News; Pandemic; Portuguese; Twitter
Year: 2020 PMID: 32844106 PMCID: PMC7434436 DOI: 10.1016/j.dib.2020.106179
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1The trend of news and tweets regarding to COVID-19 topic.
Datasets in Mendeley.
| Dataset | Description | Fields |
|---|---|---|
| General | Data collection of tweets regarding COVID-19. This dataset has approximately 220MB. | tweet_id: unique identifier for Twitter. |
| UOL | Data collection of news media regarding COVID-19. This dataset has approximately 65MB. | date: when the news media was posted in website. |
| Retweets | Data collection of tweet with at least one retweet. This dataset has approximately 26MB. | tweet_id: unique identifier for Twitter. |
| Media | Data collection of tweets with at least one picture or video. This dataset has approximately 31MB. | tweet_id: unique identifier for Twitter. |
| Hashtags | List of hashtags in collected tweets. This dataset has approximately 32MB. | tweet_id: unique identifier for Twitter. |
| Python Scripts | List of programs written in Python to collect, transform, read and visualize each of the datasets. Each program has the following name format: 1) Collection - crawler-twitter.py and crawler-uol.py; 2) Transformation - create-<dataset_name>.py; 3) Reading - read-<dataset_name>.py; and 4) Visualizing – Script available at |
Fig. 2Distribution of tweets by keywords.
Fig. 3Distribution of news considering the same keywords for Twitter.
Fig. 4Wordcloud of hashtags used on Twitter on COVID-19 topic.
Top 20 terms since January to May.
| January | February | March | April | May | |
|---|---|---|---|---|---|
| 1 | coronavirus | coronavirus | COVID19 | COVID19 | pandemia |
| 2 | corona | quarentena | pandemia | coronavirus | COVID19 |
| 3 | virus (virus) | corona | coronavirus | quarentena | social |
| 4 | China | vírus | vírus | pandemia | coronavirus |
| 5 | quarentena(quarantine) | China | social | social | isolamento |
| 6 | Brasil | COVID19 | corona | isolamento | quarentena |
| 7 | pandemia (pandemic) | Brasil | quarentena | vírus | pessoas |
| 8 | casos (cases) | casos | isolamento (isolation) | corona | Brasil |
| 9 | pessoa (person) | pandemia | pessoas | Brasil | vírus |
| 10 | Saúde (health) | pessoas | cloroquina (chloroquine) | cloroquina | casa |
| 11 | mundo (world) | mundo | Brasil | COVID | meio (middle, kind or or media) |
| 12 | gente (people) | brasileiros (brazilians) | casa | saúde | saúde |
| 13 | novo (new) | carnaval | casos | pessoas | mortes (deaths) |
| 14 | Carnival (carnival) | Itália (Italy) | mundo | casos | mundo |
| 15 | medo (fear) | governo | gente | contra | contra |
| 16 | suspeita (suspicious) | país (country) | saúde | casa | distanciamento |
| 17 | cerveja (beer) | gente | Bolsonaro | Bolsonaro | COVID |
| 18 | surto (outbreak) | surto | COVID | mortes | cloroquina |
| 19 | governo (government) | casa (home) | lockdown | gente | Bolsonaro |
| 20 | cidade (city) | doença (disease) | contra (against) | mundo | president (president) |
| Subject | Social Science, Health Informatics, Computer Science |
|---|---|
| Specific subject area | Covid-19 related online and social media mining for understanding the main discussed topics and effects on people's life. |
| Type of data | Text (CSV-formatted) |
| How data were acquired | Tweets and news on COVID-19 pandemic were retrieved using a set of keywords regarding to this topic. We used self-made Python scripts with both Twitter Streaming API and Requests API for Tweets and news, respectively. |
| Data format | Raw |
| Parameters for data collection | Tweets and news matching a set of keywords in Portuguese, and from the start date of January until the end of May, 2020. |
| Description of data collection | We collected all data of Twitter and news articles posted from January to May, 2020, and filtered those in Portuguese, only. All the data are provided in csv-formatted text files. Data are provided together with sample Python code to read each dataset. |
| Data source location | News: |
| Data accessibility | Repository name: Mendeley Data |