| Literature DB >> 34188896 |
Yang Liu1, Christopher Whitfield1, Tianyang Zhang1,2, Amanda Hauser3, Taeyonn Reynolds4, Mohd Anwar1.
Abstract
PURPOSE: It has been over a year since the first known case of coronavirus disease (COVID-19) emerged, yet the pandemic is far from over. To date, the coronavirus pandemic has infected over eighty million people and has killed more than 1.78 million worldwide. This study aims to explore "how useful is Reddit social media platform to surveil COVID-19 pandemic?" and "how do people's concerns/behaviors change over the course of COVID-19 pandemic in North Carolina?". The purpose of this study was to compare people's thoughts, behavior changes, discussion topics, and the number of confirmed cases and deaths by applying natural language processing (NLP) to COVID-19 related data.Entities:
Keywords: COVID-19; Named-entity recognition; Natural language processing; Sentence clustering; Social media; Topic modeling
Year: 2021 PMID: 34188896 PMCID: PMC8226148 DOI: 10.1007/s13755-021-00158-4
Source DB: PubMed Journal: Health Inf Sci Syst ISSN: 2047-2501
Fig. 1Methodological workflow
Number of posts distribution of 18 subreddits in six months for the three NC landform distributions
| Subreddits | Members | March | April | May | June | July | August | Total # of posts | Landform distributions | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mountain (Western) | Piedmont (Central) | Coast (Eastern) | Other | |||||||||
| r/asheville | 26,620 | 1671 | 1047 | 628 | 588 | 671 | 330 | 4935 | X | |||
| r/bullcity | 16,422 | 590 | 84 | 361 | 380 | 405 | 280 | 2100 | X | |||
| r/cary | 2603 | 15 | 2 | 0 | 0 | 0 | 3 | 20 | X | |||
| r/chapelhill | 6629 | 40 | 1 | 0 | 3 | 0 | 3 | 47 | X | |||
| r/Charlotte | 58,773 | 1441 | 686 | 1430 | 428 | 441 | 368 | 4794 | X | |||
| r/CoronaNC | 2593 | 252 | 168 | 139 | 108 | 82 | 85 | 834 | X | |||
| r/elizabethcity | 117 | 16 | 5 | 0 | 0 | 0 | 0 | 21 | X | |||
| r/ENC | 411 | 5 | 0 | 0 | 0 | 0 | 0 | 5 | X | |||
| r/fayettenam | 2180 | 48 | 27 | 2 | 2 | 8 | 28 | 115 | X | |||
| r/greenvilleNCarolina | 870 | 4 | 0 | 0 | 6 | 0 | 0 | 10 | X | |||
| r/gso | 9547 | 308 | 222 | 67 | 38 | 28 | 7 | 670 | X | |||
| r/NorthCarolina | 90,677 | 896 | 1307 | 956 | 1049 | 1167 | 1124 | 6499 | X | |||
| r/NorthCarolinaCOVID | 1278 | 126 | 64 | 59 | 33 | 43 | 27 | 352 | X | |||
| r/raleigh | 64,580 | 1825 | 812 | 802 | 636 | 584 | 777 | 5436 | X | |||
| r/triangle | 30,541 | 570 | 131 | 72 | 39 | 78 | 110 | 1000 | X | |||
| r/Wilmington | 9114 | 403 | 209 | 143 | 202 | 21 | 29 | 1007 | X | |||
| r/winstonsalem | 7330 | 255 | 104 | 105 | 57 | 37 | 8 | 566 | X | |||
| r/WNC | 2524 | 39 | 12 | 23 | 13 | 4 | 0 | 91 | X | |||
| Total | 332,809 | 8504 | 4881 | 4787 | 3582 | 3569 | 3179 | 28,502 | ||||
Fig. 2One post per number of members in each subreddit. The number of people provide 1 post in each subreddit. There are a total of 332,809 members in 18 subreddits. The left Pie chart is the percentage of members based on geography classification; the right Pie chart is percentage of posts based on geography classification (Example: In subreddit of r/asheville, one post per 5 members.)
Fig. 3Distribution of the number of confirmed cases (a), deaths (b) and posts (c) from March to August
The five most similar words to Gloves, Soap, Fever, Test, and Lockdown across the three different algorithms (CBOW, Skip-Gram, and GloVe)
| Gloves | Soap | Fever | Test | Lockdown | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CBOW | Skip-Gram | GloVe | CBOW | Skip-Gram | GloVe | CBOW | Skip-Gram | GloVe | CBOW | Skip-Gram | GloVe | CBOW | Skip-Gram | GloVe |
| Save | Practice | Wear | Alcohol-based | Water | Alcohol-based | Infected | Negative | Cough | Kit | Currently | r/coronavirussc | Admit | Reasonable | California |
| Clean | Sanitize | Useless | Refrain | Sanitizer | Sleeve | Cough | Cough | Negative | Positive | Lab | r/coronavirusalabama | Eviction | Possibly | Similar |
| Completely | Wear | Sanitize | Squirt | Alcohol-based | Water | Thousand | Breath | Shortness | Case | Result | Positive | Strike | Relatively | Monger |
| Apart | Hygiene | Mask | Wipe | Bottle | Gallon | Symptom | Shortness | Ache | Confirm | Kit | Kit | Course | Stand | Compare |
| Homemade | Shake | Touch | Towel | Often | Hearsay | Yesterday | 100.4 | 100.4 | cdc | cdc | Roadblock | Vulnerable | ppl | Martial |
Identification of entities for 3 mitigation types (distancing, disinfection, and PPE), and 2 detection types (symptoms and testing)
| Categories | Asheville | Categories | Charlotte | Categories | Greensboro | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| March to May | June to August | March to May | June to August | March to May | June to August | |||||||||
| Entity name | # of entities | Entity name | # of entities | Entity name | # of entities | Entity name | # of entities | Entity name | # of entities | Entity name | # of entities | |||
| DIST | 87 | 135 | DIST | 140 | 220 | DIST | 8 | 28 | ||||||
| Lockdown | 53 | Social distanceing | 105 | Lockdown | 110 | Lockdown | 146 | Work from home | 2 | Lockdown | 13 | |||
| Work (from) home | 6 | Work (from)/stay home | 10 | Work (from) home | 18 | Work (from) home | 69 | Work time | 1 | Work from home | 5 | |||
| DIT | (Hand) sanitizer(s)/soap | 9 | 50 | DIT | (Hand) sanitizer/soap | 10 | 116 | DIT | Impact | 1 | 21 | |||
| 10 | Wipe | 12 | 11 | Wipe | 33 | Wipe | 1 | Wipe | 11 | |||||
| Wipe | 3 | Bleach | 11 | wipe | 3 | Lysol | 25 | Profit | 1 | Hygiene | 7 | |||
| PPE | 973 | 574 | ppe | 952 | 707 | ppe | 117 | 125 | ||||||
| n95(s)/kn95 | 46 | Glove | 37 | n95(s) | 24 | n95(s) | 51 | Cloth | 3 | Glove | 9 | |||
| Glove | 17 | n95(s)/kn95 | 32 | Cloth/gown | 9 | Cloth/gown | 36 | Glove | 3 | n95(s) | 9 | |||
| SYM | 229 | 225 | Sym | 320 | 511 | SYM | 7 | 43 | ||||||
| Flu/influenza | 99 | Flu/influenza | 211 | Flu/influenza | 81 | Flu/influenza | 316 | Death | 4 | Death | 29 | |||
| Coughed | 30 | Coughed | 62 | Cough | 25 | Cough | 108 | Coughed | 2 | Coughed | 15 | |||
| TEST | 320 | Cases | 436 | Test | 529 | Cases | 992 | Test | Cases | 12 | Cases | 122 | ||
| (Antibody) test | 277 | 648 | (Antibody) test | 324 | 1170 | 21 | 188 | |||||||
| Test result | 260 | Test result | 208 | Test result | 182 | Test result | 421 | Test result | 16 | Test result | 80 | |||
Most distinct and frequently mentioned entities are in bold
Fig. 4Word clouds representing each topic found using LDA Topic modeling. The larger the word is the more significant it is within that topic
Sample of BERT sentences clustering on different topics of subreddits
| Subreddit | Time period | Topics | Sentence sample |
|---|---|---|---|
| Asheville | March–May | Concerns | |
| Spread of virus | ***Try | ||
| Impacts | That means right now the average North Carolinian on | ||
| June–August | Concerns | I don't understand | |
| Impacts | In June of 2020 what are we afraid of? Hospitals have had time to prepare. Cases will | ||
| Testing | Literally the least shocking thing ever. And yet Buncombe County has suspended all | ||
| Charlotte | March–May | Lockdown | Doubt it, remember when Florida |
| Impacts | So, in a few weeks, US | ||
| Spread of virus | Ummm…we have a lower infection rate because of the | ||
| June–August | Spread of virus | In my view, the only good news from these charts is that growth appears linear rather than exponential. However, we shouldn’t be complacent in thinking that this is an inherent quality of the virus. If we relax controls enough and a seasonal affect is removed, we will likely see | |
| Reopen | I think it would be nearly | ||
| Lockdown | We all stayed | ||
| Greensboro | March–May | Testing | There's no third option. It's reclosures or much more |
| Lockdown | Go do some real reporting and find out why we | ||
| Spread of virus | One scenario without lockdown: 1k get it in March, 2k in April, and the remaining 7k in May. Given this scenario, we peak in May with 7k cases all at once | ||
| June–August | Politics | I'm not trying to change anyone's mind on masks. There's no changing anyone's mind about it if we have conflicting guidance between our doctors/scientists and our | |
| Lockdown | So, this is not regular time. Everything isn't hunky dory. We don't need another person to get infected and then go out and infect other people. | ||
| Impacts | The $600 Pandemic Unemployment Assistance (PUA) from the federal government runs out on 7/31. Your regular state benefits will continue past that date | ||
| Raleigh | March–May | Impacts | We have to |
| Reopen | They | ||
| Lockdown | My point was they were smart. They are now prepared for even bigger | ||
| June–August | Spread of virus | What you see now, however, is a clearly disturbing trend: | |
| Impacts | Still, | ||
| Reopen | Bull and Bear is operating way outside the guidelines. Gyms are the last place to be | ||
| Wilmington | March–May | Reopen | We know who is at risk, yet we |
| Testing | Widespread | ||
| Lockdown | No. | ||
| June–August | Spread of virus | New Hanover has | |
| Reopen | Other countries were told to stay home and they did. Now they're | ||
| Testing | When you look at the percentage of positives against the | ||
| North Carolina | March–May | Reopen | I want NC to " |
| testing | And it frankly pisses me off that there isn't way more | ||
| Spread of virus | These people are underestimating how easily this | ||
| June–August | Spread of virus | * NC is currently | |
| Reopen | You do realize that people have | ||
| Impacts | As far as the costs. What are the economic costs of 231,000,000 |
Comparison of state-of-the-art methods
| Objective | References | Data source | Method |
|---|---|---|---|
| To measure and monitor citizens’ concern levels using public sentiments in Twitter data | Chun et al. [ | NLP and case fatality rate (CFR) | |
| To retrieval articles related to COVID-19 | Das et al. [ | A corpus of scientific articles | Graph community detection and Bio-BERT embeddings |
| To utilize NLP for the analysis of public health applications | Conway et al. [ | Reddit, Microblog, Instagram, etc. | Literature review |
| To characterize the media coverage and collective internet response to the COVID-19 in four countries | Gozzi et al. [ | Reddit and Wikipedia | Linear regression model, nonnegative matrix factorization |
| To characterize people’s responses to COVID-19 on two Reddit communities | Zhang et al. [ | 2 subreddits on Reddit | Classification, Fightin' words model, |
| To leverage NLP to characterize changes in mental and non-mental health support groups during the initial stage of the pandemic | Low et al. [ | NLP, unsupervised clustering, topic modeling, Similarity | |
| To predict the general sentiment polarity of the COVID-19 related news on Reddit before a news article is published | Dheeraj [ | Sentiment analysis | |
| To understand the patient mental health through the stages of COVID-19 illness | Murray et al. [ | Topic modeling, sentiment analysis, clustering | |
| To understand the public’s concerns around coronavirus and identify future opportunities for medical experts to leverage the Reddit in communicating with the general public | Lai et al. [ | 1 subreddit on Reddit | Retrospective content analysis |
| To track public priorities and concerns regarding COVID-19 | Stokes et. al [ | Topic modeling | |
| To explore “how useful is Reddit social media platform to surveil COVID-19 pandemic?” and “how do people’s concerns/behaviors change over the course of COVID-19 pandemic in North Carolina? | Our paper | 18 subreddits of North Carolina on Reddit | NLP, word embedding, similarity, topic modeling, custom NER, BERT-based clustering, K-means clustering |