Literature DB >> 28123858

Social Media as a Sentinel for Disease Surveillance: What Does Sociodemographic Status Have to Do with It?

Elaine O Nsoesie1, Luisa Flor2, Jared Hawkins3, Adyasha Maharana4, Tobi Skotnes5, Fatima Marinho6, John S Brownstein7.   

Abstract

INTRODUCTION: Data from social media have been shown to have utility in augmenting traditional approaches to public health surveillance. Quantifying the representativeness of these data is needed for making accurate public health inferences.
METHODS: We applied machine-learning methods to explore spatial and temporal dengue event reporting trends on Twitter relative to confirmed cases, and quantified associations with sociodemographic factors across three Brazilian states (São Paulo, Rio de Janeiro, and Minas Gerais) at the municipality level.
RESULTS: Education and income were positive predictors of dengue reporting on Twitter. In contrast, municipalities with a higher percentage of older adults, and males were less likely to report suspected dengue disease on Twitter. Overall, municipalities with dengue disease tweets had higher mean per capita income and lower proportion of individuals with no primary school education.
CONCLUSIONS: These observations highlight the need to understand population representation across locations, age, and racial/ethnic backgrounds in studies using social media data for public health research. Additional data is needed to assess and compare data representativeness across regions in Brazil.

Entities:  

Keywords:  Brazil; Twitter; dengue; disease surveillance; infectious disease; social medi; sociodemographic status; socioeconomic factors

Year:  2016        PMID: 28123858      PMCID: PMC5222536          DOI: 10.1371/currents.outbreaks.cc09a42586e16dc7dd62813b7ee5d6b6

Source DB:  PubMed          Journal:  PLoS Curr        ISSN: 2157-3999


INTRODUCTION

Dengue is a mosquito-borne disease transmitted between humans by infected Aedes mosquitoes1 and is a major cause of illness and death in many tropical and subtropical regions2. Despite improvements in disease surveillance and investments in mosquito control programs, dengue remains a major public health threat in many countries3 , 4 , 5 , 6. Efforts at improving surveillance have explored non-traditional data sources, including, crowd-generated approaches using mobile phones and social media7 , 8, and Internet search query data9 , 10 , 11. These systems have the potential to capture mild infections not requiring medical attention, and enable the ascertainment of the probable temporal and spatial distribution of cases prior to official reports of disease. Where available, disease reports on social media platforms also have the advantage of having geographical coordinates (latitude and longitude), enabling the probable estimation of the exact location of the case report, and potential for prompt response and vector control. In spite of this resource, in-depth analyses at fine geographical resolutions to understand temporal and spatial variation of dengue reporting using these non-traditional sources and an understanding of key sociodemographic determinants is lacking. Using geotagged dengue disease event tweets from October 2012 to December 2014, we explore spatio-temporal dengue event reporting trends on Twitter relative to confirmed cases, and quantify key sociodemographic determinants across three Brazilian states (São Paulo, Rio de Janeiro, and Minas Gerais) at the municipality level. Brazil’s comprehensive dengue surveillance system covers over 200 million individuals12 , 13, thereby enabling a detailed assessment of this data resource.

RESULTS

To extract major features distinguishing irrelevant and relevant (i.e. suspected dengue disease) tweets, we considered emojis, location information (state, county and micro-region), unigrams, bigrams and trigrams. We compared three machine-learning classifiers - Support Vector Machines (with linear, sigmoid, radial basis function kernels), Naïve Bayes and Maximum entropy. The accuracy of the different classifiers was evaluated using a sample of the data and a five-fold cross validation process. The Naïve Bayes classifier with a linear kernel based on a feature set combining text unigrams and emojis performed best (Table 1). The precision, recall and F1-score (a measure of accuracy) of relevant tweets were 75.20%, 80.51% and 77.52%, respectively and 90.23%, 87.34% and 88.66%, respectively for irrelevant tweets. The macro-averaged precision, recall and F1-score of the system were 82.72%, 83.93% and 82.72%. The most significant text features (Table 2) suggest that individuals are more likely to tweet of dengue if a family member or associate is ill, or to express sadness, or pain or discuss death due to dengue disease.
Table 1

Comparison of classifier performance

Classifier with Feature SetsNaive Bayes (Unigrams Only)Naive Bayes (Unigrams + Emojis)Naive Bayes (Unigrams + Emojis + Bigrams)Linear SVM (Unigrams + Emojis)
Accuracy84.4585.0686.2284.91
Precision74.2175.2081.2077.17
Recall79.7480.5175.5376.02
True Negative Rate86.8387.3491.4889.17
Table 2

The most representative unigram features

Features - 'Irrelevant' tweetsEnglish TranslationFeatures - 'Sickness' tweetsEnglish Translation
paradostoppedirmã / irmão/ vo/ mãe/ pai/ prima / professorsister / brother / grandfather/ mother /father / cousin / professor
mosquitomosquitodores / dóipains / it hurts
ebolaebola:( ...emojis
copacuphospitalhospital
dandogivingmorrendodying
venenopoisonrepousorest
saúdecheersresultadoresult
Comparison of classifier performance The most representative unigram features Suspected dengue cases were reported on Twitter from all states in Brazil from 2013 to 2015 (Figure SI 1) and the highest volume of reports originated from São Paulo (3204 reports; 41.39%), Rio de Janeiro (1368 reports; 17.67%) and Minas Gerais (1025 reports; 13.24%). These reports were distributed across 254 (39.38%), 143 (16.78%) and 64 (69.57%) municipalities in São Paulo, Minas Gerais and Rio de Janeiro, respectively. The tweet volume was significantly much lower compared to dengue case volume and densely populated municipalities tended to have higher dengue case and tweet volume across all states (Figure 1).

Figure 1. Spatial variation of case and tweet volume by municipality across the states of São Paulo, Minas Gerais, and Rio de Janeiro for 2013-2014.

Figure 1. Spatial variation of case and tweet volume by municipality across the states of São Paulo, Minas Gerais, and Rio de Janeiro for 2013-2014. Sociodemographic Analysis The best logistic multivariable model to predict the occurrence of dengue tweets included population density and the percentage of individuals with higher education, older adults; defined as 60 and above, and males. A high percentage of individuals with higher education at the municipality level was positively associated (0.14, 95% CI [0.11, 0.16]) with dengue reporting on Twitter. In contrast, a higher percentage of older adults (-0.12, 95% CI [-0.18, -0.06]), and males (-0.32, 95% CI [-0.47, -0.18]) at the municipality level were negatively associated with the observation of a dengue tweet. Compared to the other variables, population density was only mildly predictive (p=0.04) of dengue disease reporting on Twitter. Additionally, a 1% increase in income was associated with a 2.89% increase in the odds of observing a dengue tweet in a municipality. These differences were more marked for municipalities in Rio de Janeiro compared to Minas Gerais. Figure 2. Comparison of the distribution of (a) mean per capita income; (b) percent population 60 years and older; (c) percent population without basic education; and (d) percent population identified as male between municipalities with and without tweets. Temporal Analysis We fit univariate linear regression models to state-level dengue case data, with weekly tweet volume as the independent variable for each of the three states. Despite less than 50% of municipalities accounting for the tweets for two of the three states, weekly tweet volume explained 53.65% (correlation (r) = 0.74), 85.69% (r = 0.93) and 67.98 % (r = 0.82) of the variance observed in the confirmed weekly dengue cases for the states of São Paulo, Minas Gerais and Rio de Janeiro, separately. Univariate linear regression models fit to weekly tweet volume for the municipalities of São Paulo in São Paulo, Belo Horizonte in Minas Gerais, and Rio de Janeiro in Rio de Janeiro had similar outcomes. Weekly tweet volume from the municipality of São Paulo explained 77.47% (r = 0.88) of the variance observed in the confirmed weekly case data (Figure 3(a) and (b)). Similarly, weekly tweet volume for Belo Horizonte and Rio de Janeiro separately explained 81.41 % (r = 0.90) and 56.36 % (r = 0.68) of the variance observed in the confirmed weekly case data.

Figure 3. (a) and (b) are scaled weekly volume of tweets of suspected dengue disease and confirmed dengue cases for the municipality of São Paulo, São Paulo, respectively. (c) univariate linear regression model of weekly dengue cases fitted against weekly suspected dengue disease tweets.

Figure 3. (a) and (b) are scaled weekly volume of tweets of suspected dengue disease and confirmed dengue cases for the municipality of São Paulo, São Paulo, respectively. (c) univariate linear regression model of weekly dengue cases fitted against weekly suspected dengue disease tweets. Dengue cases peaked a week prior to the suspected dengue disease tweets for both São Paulo and Belo Horizonte municipalities. In contrast, the dengue tweets peaked two weeks prior to dengue cases for Rio de Janeiro, suggesting tweets could be predictive of dengue case volume. Additionally, weekly volume of tweets of suspected dengue cases captured dynamical changes in reported cases, which differed significantly across municipalities in the same state (e.g., São Paulo (Figure 3(b)) and Santos (SI 6) in the state of São Paulo). However, such associations were only observed for municipalities with a high tweet volume, suggesting that state-level aggregation of such data excludes some municipalities with confirmed dengue cases.

DISCUSSION

Real-time reports of dengue on social media can potentially be used to augment disease response time; resulting in quicker control efforts and mitigation of disease spread. Although the majority of tweets are suspected cases, laboratory confirmed cases are also reported and real-time reports provide timely updates for situational awareness14, which is necessary due to weekly or monthly delays in dengue case reporting in Brazil and other endemic regions. Inequality and low mean per capita income have been associated with dengue mortality in Brazil15. Furthermore, males, and people older than 69 years had a higher mortality rate from neglected tropical diseases, when compared to other populations in Brazil from 2000 to 201116. Our results indicate that these populations – lower educated, males, and people older than 60 – are less likely to tweet about dengue disease. This suggests that social media might not be an adequate supplement to traditional public health surveillance for these populations. The rapid penetration of the Internet and mobile phone technology has provided a great opportunity for improving data collection in data poor regions. However, different communities use varied forms of technology to communicate and some portions of the population (e.g., individuals with little or no basic education) might lack access and the knowledge to use certain technologies. Therefore, studies that aggregate these data across spatial and temporal scales, may only represent major cities or regions with higher education and income, thereby excluding poorer regions. A limitation of this study is that only approximately 1 to 4% of tweets are geotagged, thereby leading to a small sample size for most municipalities. In addition, some of the suspected dengue cases are likely due to other disease etiologies and a denominator for scaling the tweet volume is unavailable. Additional data is needed to explore representativeness and differences across regions in Brazil. Despite these limitations, significant correlations were observed between tweets and actual case reports. Two approaches for improving the utility of these data for public health surveillance are to integrate data from different sources, and develop methods to improve estimations in data poor scenarios to enable representation of poor and at-risk populations17. Participatory surveillance systems could be useful in supplementing these data if at-risk individuals can be convinced to participate. In addition to surveillance, these data can be used for seeding mathematical models for assessment of control strategies and real-time updates of disease occurrence reports18. Suspected cases in a municipality can be later confirmed as additional data become available. If combined with mobility, environmental and socioeconomic covariates, there is potential for assessing the potential spread and quantifying the impact of different intervention methods during ongoing disease epidemics, such as zika and chikungunya, that share the same vectors as dengue19. Our results suggest that populations that have been shown to have a higher dengue mortality risk are also less likely to tweet about dengue. Studies aiming at augmenting dengue surveillance using these data should make careful inferences, while accounting for the caveats associated with these data resource, including the underrepresentation of specific populations.

MATERIALS AND METHODS

Dengue Case Data De-identified dengue case reports were provided by the Brazilian Ministry of Health for October 2012 to December 2014. We further aggregated the data to daily and weekly totals for each municipality and state. The cases comprised of dengue hemorrhagic fever, dengue shock syndrome and dengue fever. Dengue Reports from Twitter We extracted from Twitter - a social networking site - a subset of tweets containing the term “dengue” or hashtags with dengue (e.g., #eutenhodengue) posted between October 2012 and May, 2015, for Brazil. This was done by 1) writing a custom script in PHP to access the free Twitter Public API to collect the maximum allowed number of tweets (up to 1% total volume) with any geographical coordinates (either tweet coordinates or place coordinates), and then 2) restricting to those tweets with coordinates that were within the geographic bounding box for Brazil. Tweet Classification We developed a large manually curated sample of tweets by classifying each tweet as irrelevant, official report, or relevant (suspected dengue disease case). Two curators independently classified each tweet and tweets with curator agreement (8,000 of 10,116) were used to train a machine learning classifier and to assess human-machine agreement. A standard two step classification approach involving, pre-processing and evaluation of three machine-learning classifiers - Support Vector Machines (with linear, sigmoid, radial basis function kernels), Naïve Bayes and Maximum entropy was used. All manually classified tweets were assigned to a training or test set. Each tweet was pre-processed and represented as a feature vector. This involved tokenization (separation of sentences into individual words), stemming and removal of stop and common words, not typically useful for classification. To extract major features distinguishing irrelevant and relevant tweets, we considered emojis, location information (state, county and micro-region), unigrams, bigrams and trigrams. The accuracy of the different classifiers was evaluated using the test data and a five-fold cross validation process. The cross validation involves randomly partitioning the data into a training and validation set prior to applying the classifiers. This process is repeated five times and the results are averaged. The best performing machine learning classifier was applied to the 14,611 unclassified tweets in the database and 2,207 tweets with curator disagreement. All tweets were reverse geo-located to extract the municipality and state of origin. Python was used for these analyses. Spatio-temporal Analysis We used the resulting dataset – manually tagged and machine classified tweets – to describe spatial and temporal trends in reporting, and evaluated key sociodemographic determinants on the reporting of dengue on Twitter using logistic regression after considering a mixed effects logistic regression model. The response variable was represented as one if reports of dengue on Twitter could be mapped to a municipality and zero otherwise. We explored different combinations of the six covariates from the Brazilian census (www2.datasus.gov.br) – sex (male or female), age (under five, five to fourteen, fifteen to thirty-nine, forty to fifty-nine, and sixty and above), race (white, brown, black, yellow, indigenous and undeclared), level of education (uneducated or incomplete elementary cycle, complete primary cycle or 2nd cycle incomplete, and 2nd cycle complete or more), mean per capita income and population density at the municipality level. Since the levels of the various variables were highly correlated, we evaluated four models with main differences in the level of the education variable and age group considered. Additionally, univariate linear regression models and Pearson correlation were used to quantify temporal association between tweets and dengue case data. The response variable was a time series of the number of confirmed dengue cases and the dependent variables was the number of relevant tweets. The model was fit for municipalities with a high volume of relevant dengue tweets. These analyses were implemented in R.

Authors' Contributions

EON, LSF, and TS manually classified tweets. JBH, JSB, LSF, and FM provided data. AM implemented the machine learning classifier. EON drafted the manuscript. All authors read and edited the manuscript.

Data Availability

Python code used in analyses are available on Github: https://github.com/adypooja/dengueTweets. The dengue case data is publicly available from Brazil Ministry of Health SINAN system (http://sinan.saude.gov.br/sinan).

Competing Interest Statement

John S. Brownstein is a member of the PLOS Currents: Outbreaks review board.

Corresponding Author

Elaine Nsoesie: onelaine@vt.edu

Appendix

Figure SI 1. Number of dengue tweets from each state in Brazil. There was at least one tweet of a suspected dengue case from each of the states with the highest volume originating from São Paulo, Rio de Janeiro and Minas Gerais. Figure SI 2. Trend of monthly tweet volume and confirmed cases for Niteroi municipality in Rio de Janeiro. The estimated Pearson correlation was 0.894 and 0.708 for monthly and weekly reports, respectively. Figure SI 3. Trend of monthly tweet volume and confirmed cases for Rio de Janeiro municipality in Rio de Janeiro. The Pearson correlation wa3s 0.749 and 0.683 for monthly and weekly reports, respectively. Figure SI 4. Trend of monthly tweet volume and confirmed cases for Juiz de Fora municipality in Minas Gerais. The Pearson correlation was 0.913 and 0.524 for monthly and weekly reports, respectively. Figure SI 5. Trend of monthly tweet volume and confirmed cases for Belo Horizonte municipality in Minas Gerais. The Pearson correlation was 0.978 and 0.903 for monthly and weekly reports, respectively. Figure SI 6. Trend of monthly tweet volume and confirmed cases for Santos municipality in São Paulo. The Pearson correlation was 0.845 and 0.689 for monthly and weekly reports, respectively.
  16 in total

1.  Trends and factors associated with dengue mortality and fatality in Brazil.

Authors:  Enny Santos Paixão; Maria da Conceição Nascimento Costa; Laura Cunha Rodrigues; Davide Rasella; Luciana Lobato Cardim; Alcione Cunha Brasileiro; Maria Gloria Lima Cruz Teixeira
Journal:  Rev Soc Bras Med Trop       Date:  2015 Jul-Aug       Impact factor: 1.581

Review 2.  Effect of dengue vector control interventions on entomological parameters in developing countries: a systematic review and meta-analysis.

Authors:  T E Erlanger; J Keiser; J Utzinger
Journal:  Med Vet Entomol       Date:  2008-09       Impact factor: 2.739

3.  Prediction of dengue incidence using search query surveillance.

Authors:  Benjamin M Althouse; Yih Yng Ng; Derek A T Cummings
Journal:  PLoS Negl Trop Dis       Date:  2011-08-02

Review 4.  A critical assessment of vector control for dengue prevention.

Authors:  Nicole L Achee; Fred Gould; T Alex Perkins; Robert C Reiner; Amy C Morrison; Scott A Ritchie; Duane J Gubler; Remy Teyssou; Thomas W Scott
Journal:  PLoS Negl Trop Dis       Date:  2015-05-07

Review 5.  Reviewing dengue: still a neglected tropical disease?

Authors:  Olaf Horstick; Yesim Tozan; Annelies Wilder-Smith
Journal:  PLoS Negl Trop Dis       Date:  2015-04-30

6.  Epidemiology of dengue: past, present and future prospects.

Authors:  Natasha Evelyn Anne Murray; Mikkel B Quam; Annelies Wilder-Smith
Journal:  Clin Epidemiol       Date:  2013-08-20       Impact factor: 4.790

7.  Epidemiological trends of dengue disease in Brazil (2000-2010): a systematic literature search and analysis.

Authors:  Maria Glória Teixeira; João Bosco Siqueira; Germano L C Ferreira; Lucia Bricks; Graham Joint
Journal:  PLoS Negl Trop Dis       Date:  2013-12-19

8.  Mortality from neglected tropical diseases in Brazil, 2000-2011.

Authors:  Francisco Rogerlândio Martins-Melo; Alberto Novaes Ramos; Carlos Henrique Alencar; Jorg Heukelbach
Journal:  Bull World Health Organ       Date:  2015-11-24       Impact factor: 9.408

9.  The global distribution and burden of dengue.

Authors:  Samir Bhatt; Peter W Gething; Oliver J Brady; Jane P Messina; Andrew W Farlow; Catherine L Moyes; John M Drake; John S Brownstein; Anne G Hoen; Osman Sankoh; Monica F Myers; Dylan B George; Thomas Jaenisch; G R William Wint; Cameron P Simmons; Thomas W Scott; Jeremy J Farrar; Simon I Hay
Journal:  Nature       Date:  2013-04-07       Impact factor: 49.962

Review 10.  Public health for the people: participatory infectious disease surveillance in the digital age.

Authors:  Oktawia P Wójcik; John S Brownstein; Rumi Chunara; Michael A Johansson
Journal:  Emerg Themes Epidemiol       Date:  2014-06-20
View more
  18 in total

1.  Feasibility of using social media to monitor outdoor air pollution in London, England.

Authors:  Yulin Hswen; Qiuyuan Qin; John S Brownstein; Jared B Hawkins
Journal:  Prev Med       Date:  2019-02-08       Impact factor: 4.018

2.  Exploring online communication about cigarette smoking among Twitter users who self-identify as having schizophrenia.

Authors:  Yulin Hswen; John A Naslund; Pooja Chandrashekar; Robert Siegel; John S Brownstein; Jared B Hawkins
Journal:  Psychiatry Res       Date:  2017-08-02       Impact factor: 3.222

3.  Risk assessment strategies for early detection and prediction of infectious disease outbreaks associated with climate change.

Authors:  E E Rees; V Ng; P Gachon; A Mawudeku; D McKenney; J Pedlar; D Yemshanov; J Parmely; J Knox
Journal:  Can Commun Dis Rep       Date:  2019-05-02

4.  Online Communication about Depression and Anxiety among Twitter Users with Schizophrenia: Preliminary Findings to Inform a Digital Phenotype Using Social Media.

Authors:  Yulin Hswen; John A Naslund; John S Brownstein; Jared B Hawkins
Journal:  Psychiatr Q       Date:  2018-09

5.  Disparities in digital reporting of illness: A demographic and socioeconomic assessment.

Authors:  Samuel Henly; Gaurav Tuli; Sheryl A Kluberg; Jared B Hawkins; Quynh C Nguyen; Aranka Anema; Adyasha Maharana; John S Brownstein; Elaine O Nsoesie
Journal:  Prev Med       Date:  2017-05-17       Impact factor: 4.018

6.  Socioeconomic bias in influenza surveillance.

Authors:  Samuel V Scarpino; James G Scott; Rosalind M Eggo; Bruce Clements; Nedialko B Dimitrov; Lauren Ancel Meyers
Journal:  PLoS Comput Biol       Date:  2020-07-09       Impact factor: 4.475

7.  Racial and Ethnic Digital Divides in Posting COVID-19 Content on Social Media Among US Adults: Secondary Survey Analysis.

Authors:  Celeste Campos-Castillo; Linnea I Laestadius
Journal:  J Med Internet Res       Date:  2020-07-03       Impact factor: 5.428

Review 8.  Social media based surveillance systems for healthcare using machine learning: A systematic review.

Authors:  Aakansha Gupta; Rahul Katarya
Journal:  J Biomed Inform       Date:  2020-07-02       Impact factor: 6.317

9.  A Platform for Crowdsourced Foodborne Illness Surveillance: Description of Users and Reports.

Authors:  Patrick Quade; Elaine Okanyene Nsoesie
Journal:  JMIR Public Health Surveill       Date:  2017-07-05

10.  Harnessing Big Data for Communicable Tropical and Sub-Tropical Disorders: Implications From a Systematic Review of the Literature.

Authors:  Vincenza Gianfredi; Nicola Luigi Bragazzi; Daniele Nucci; Mariano Martini; Roberto Rosselli; Liliana Minelli; Massimo Moretti
Journal:  Front Public Health       Date:  2018-03-21
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.