| Literature DB >> 34153432 |
Julia Wu1, Venkatesh Sivaraman2, Dheekshita Kumar1, Juan M Banda3, David Sontag4.
Abstract
The rapid evolution of the COVID-19 pandemic has underscored the need to quickly disseminate the latest clinical knowledge during a public-health emergency. One surprisingly effective platform for healthcare professionals (HCPs) to share knowledge and experiences from the front lines has been social media (for example, the "#medtwitter" community on Twitter). However, identifying clinically-relevant content in social media without manual labeling is a challenge because of the sheer volume of irrelevant data. We present an unsupervised, iterative approach to mine clinically relevant information from social media data, which begins by heuristically filtering for HCP-authored texts and incorporates topic modeling and concept extraction with MetaMap. This approach identifies granular topics and tweets with high clinical relevance from a set of about 52 million COVID-19-related tweets from January to mid-June 2020. We also show that because the technique does not require manual labeling, it can be used to identify emerging topics on a week-to-week basis. Our method can aid in future public-health emergencies by facilitating knowledge transfer among healthcare workers in a rapidly-changing information environment, and by providing an efficient and unsupervised way of highlighting potential areas for clinical research.Entities:
Keywords: Clinical concept extraction; Data mining; Information retrieval; Public health surveillance; Social media; Topic modeling
Mesh:
Year: 2021 PMID: 34153432 PMCID: PMC9339268 DOI: 10.1016/j.jbi.2021.103844
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 8.000
Paraphrased example tweets from the HCP-authored subset of the dataset.
Fig. 1Topics generated after three iterations of relevance filtering (i.e. from ). The upper section shows the top 40 highest-scoring topics, while the lower section shows the 10 lowest-scoring topics. Both sections are sorted vertically in order of the date of maximum intensity. The heat map colors indicate the popularity of each topic per day, with yellow representing the peak of popularity for the topic.
Fig. 2Comparison of the iterative relevance filtering method with traditional and seeded LDA approaches by two metrics, proportion of topic words annotated by MetaMap and the UMass coherence score. Error bars and shaded regions indicate 95% confidence intervals around the mean over 5 trials.
Comparison of 20-topic models generated by various approaches: (a) SeededLDA seeded with the top 20 most relevant clinical concepts; (b) SeededLDA seeded with the first three words of the 20 highest-scoring topics in Fig. 1; (c) a traditional MALLET model on the initial HCP-authored tweet corpus; and (d) a MALLET model trained on the final filtered tweet corpus. Words provided as seeds to the SeededLDA algorithm are italicized. Topics are presented in the arbitrary order output by each algorithm.
| (a) SeededLDA with Concepts | (b) SeededLDA with Topics |
| (c) Traditional LDA | (d) Iterative Relevance Filtering |
| week, good, dont, thing, year | risk, high, blood, heart, lung |
| open, travel, close, reopen, food | sars, cell, viral, human, ace |
| business, government, job, crisis, pay | people, flu, death, die, rate |
| dr, today, live, question, pm | patient, treatment, evidence, good, life |
| test, positive, contact, symptom, people | patient, pt, icu, ventilator, require |
| vaccine, study, patient, treatment, research | symptom, cough, respiratory, fever, droplet |
| time, child, family, school, feel | patient, health, doctor, hospital, care |
| health, pandemic, public, response, state | immune, response, infection, level, cytokine |
| lockdown, fight, life, india, save | study, data, outcome, mortality, group |
| people, die, life, kill, stop | drug, trial, hydroxychloroquine, treatment, remdesivir |
| case, death, report, number, day | test, antibody, positive, result, negative |
| news, china, world, country, uk | mask, wear, face, protect, public |
| pandemic, community, change, impact, risk | day, case, symptom, infection, report |
| work, support, great, team, pandemic | disease, severe, child, infectious, syndrome |
| read, data, article, important, great | patient, cancer, pandemic, treatment, impact |
| patient, care, hospital, doctor, medical | vaccine, trial, develop, plasma, recover |
| disease, infection, risk, flu, sars | work, great, dr, article, today |
| mask, home, stay, spread, social | patient, pneumonia, clinical, ct, finding |
| late, learn, information, check, free | transmission, asymptomatic, spread, infection, contact |
| trump, american, president, medium, house | people, dont, question, lot, time |
Words and phrases that were designated most (upper half) and least (lower half) relevant according to Eq. 2 (relevance values shown in parentheses). The first column indicates the initial relevance estimates computed after heuristic author filtering; the right two columns list the relevances after three rounds of filtering with and without concept annotations.
| 1 | medtwitter ( | cells ( | cells ( |
| 2 | publichealth ( | lung ( | mortality ( |
| 3 | physicians ( | hcq ( | rate ( |
| 4 | patients with ( | trial ( | asymptomatic ( |
| 5 | pts ( | patients with ( | rate of ( |
| 6 | clinical ( | severe ( | fatality ( |
| 7 | physician ( | respiratory ( | mild ( |
| 8 | icu ( | blood ( | growth rate ( |
| 9 | surgery ( | antibodies ( | viral ( |
| 10 | md ( | ards ( | ace2 ( |
| 1 | f∗∗∗ ( | business ( | trump ( |
| 2 | s∗∗∗ ( | leadership ( | business ( |
| 3 | f∗∗∗ing ( | businesses ( | bbc news ( |
| 4 | petition ( | students ( | bbc ( |
| 5 | the petition ( | government ( | president ( |
| 6 | rt ( | trump ( | news coronavirus ( |
| 7 | sign the petition ( | crisis ( | bbc news coronavirus ( |
| 8 | democrats ( | county ( | the latest ( |
| 9 | sign the ( | pm ( | businesses ( |
| 10 | the petition via ( | economy ( | amid ( |
Fig. 3Fraction of tweets preserved by two different models (with and without using clinical concepts for relevance filtering) for each of 12 pre-defined tweet categories, including clinically-relevant subjects (first seven bars) and irrelevant ones (next five). The progressively darker bars represent successive stages of filtering. The parenthesized number indicates the number of tweets that fell into each category. The total fractions for the irrelevant and irrelevant categories are shown in the right panel.
Fig. 4Incidence of selected topics of clinical interest in HCP-authored tweets, time-limited topic models, and academic publications. The number of tweets containing the bolded keywords is plotted in blue as a 7-day rolling sum (note the different y-axis scales), while the red heatmap bar shows the number of new publications in each week. Time intervals are highlighted in green if the topic keywords appear in the topic model for that interval (specifically, in the top ten words for each topic). The black carets mark the date of the first tweet (paraphrased at left of each plot) and the first publication relevant to the topic.
Fig. 5Topics containing the same surfaced clinical concepts plotted across time. Topics containing the concept anosmia from the time intervals 3/8–3/22, 3/15–3/29, and 3/22–4/05 are shown on top. Topics containing the concept thrombosis from the time intervals 3/29–4/12, 4/05–4/19, and 4/19–5/3 are shown at the bottom. Each topic is rendered as a word cloud; the size of the words correlates to its weight in the topic. A paraphrased tweet belonging to each topic along with the topic’s clinical relevance ranking relative to other topics in that time interval are also shown.