| Literature DB >> 27981203 |
Abeed Sarker1, Graciela Gonzalez1.
Abstract
In this data article, we present to the data science, natural language processing and public heath communities an unlabeled corpus and a set of language models. We collected the data from Twitter using drug names as keywords, including their common misspelled forms. Using this data, which is rich in drug-related chatter, we developed language models to aid the development of data mining tools and methods in this domain. We generated several models that capture (i) distributed word representations and (ii) probabilities of n-gram sequences. The data set we are releasing consists of 267,215 Twitter posts made during the four-month period-November, 2014 to February, 2015. The posts mention over 250 drug-related keywords. The language models encapsulate semantic and sequential properties of the texts.Entities:
Year: 2016 PMID: 27981203 PMCID: PMC5144647 DOI: 10.1016/j.dib.2016.11.056
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1Left: distribution of tweets over four months, from November 2015 to end of February 2015. Right: Top 10 drug-related keywords on Twitter over the four-month period.
Fig. 2Sample tweets containing correct spellings and misspellings for the drugs Seroquel® and Adderall®.
Fig. 3Sample tweets containing drug-related chatter. Collected between November 2014 and February 2015.
Fig. 4Drug-adverse reaction association signals for three drugs obtained using a distributed word representation model and cosine similarity.
Fig. 5Sample tetra-gram language model scores for health and non-health tweets.
| Subject area | |
| More specific subject area | |
| Type of data | |
| How data was acquired | |
| Data format | |
| Experimental factors | |
| Experimental features | |
| Data source location | |
| Data accessibility |