| Literature DB >> 26920122 |
Yoonsang Kim1, Jidong Huang, Sherry Emery.
Abstract
BACKGROUND: Social media have transformed the communications landscape.Entities:
Keywords: Twitter; digital disease detection; infodemiology; infoveillance; precision and recall; search filter; sensitivity and specificity; social media; standard reporting
Mesh:
Year: 2016 PMID: 26920122 PMCID: PMC4788740 DOI: 10.2196/jmir.4738
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
A framework for Twitter data collection and validation.
| Step | Details |
| Develop search filter | 1. Build a list of search keywords: (a) Generate a list of candidate keywords based on expert knowledge, systematic search of topic-related language, and other resources, (b) Screen the keywords by examining relevance and frequency of posts, (c) Discard keywords that return posts with high proportion of irrelevant contents or relatively low frequency, and (d) Add and screen new keywords when new relevant terms and phrases emerge. |
| 2. Integrate keywords with search rules (eg, Boolean operators) for a more focused search. | |
| Apply search filter | 3. The search filter retrieves and splits data into a retrieved set and an unretrieved set. |
| Assess search filter | 4. Cross-tabulate data by gold standard and search filter: (a) Randomly sample from retrieved and unretrieved data; stratified sampling may be applied, (b) Manually code sampled data to determine relevance in both of retrieved and unretrieved sets, (c) Cross-tabulate sampled data by human-coded relevance (coded relevant vs irrelevant) and search filter retrieval status (retrieved vs unretrieved). |
| 5. Compute retrieval precision and retrieval recall. |
Figure 1The archive (a+b+c+d), retrieved tweets (a+b), and relevant tweets (a+c+e) in Twitterverse.
Assessment of search filter with human coding as a gold standard.
| Search filter | Human coding | Total | |
| Coded relevant | Coded not-relevant | ||
| Retrieved | a (True Positive) | b (False Positive) | a + b=n1 |
| Not retrieved | c (False Negative) | d (True Negative) | c + d=n2 |
| Total | a + c | b + d | n |
Figure 2The average limits of 95% confidence intervals for recall (vertical axis) as the sample size of unretrieved messages increases (horizontal axis), fixing the sample size of retrieved data at 3000.
Multinomial likelihood contributions of all possible cases of observed data and unknown quantities (the unknown quantities of truly relevant tweets are denoted by y1, y2, y3, y4).
| Search filter ( | Human coding ( | |
| Coded relevant | Coded not-relevant | |
| Retrieved | a − y1 | b − y2 |
| y1 | y2 | |
| Not retrieved | y3 | y4 |
| c – y3 | d − y4 | |
Search filter versus human coding on sampled data adjusted for sampling fraction.
| Search filter | Human coding | Total | |
| Coded relevant | Coded not-relevant | ||
| Retrieved | 128 | 6 | 134 |
| Not retrieved | 20 | 6285 | 6305 |
| Total | 148 | 6291 | 6439 |
Prior and posterior means and 95% credible intervals when unretrieved messages cannot be archived.
|
| Beta prior distribution | Posterior distribution | |||
| Mean | 95% HDa | Mean | 95% HPDb | ||
| Prevalence | 0.010 | 1×10‒6-0.031 | 0.028 | 0.020-0.038 | |
|
| |||||
|
| Recall | 0.667 | 0.340-0.954 | 0.752 | 0.505-0.979 |
|
| Precisionc | – | – | 0.955 | 0.949-0.961 |
|
| Specificity | 0.733 | 0.474-0.962 | 0.999 | 0.999-0.999 |
|
| F1 scorec | – | – | 0.835 | 0.663-0.968 |
aHD: highest density interval.
bHPD: highest posterior density interval. HPD interval gives narrower length than equal-tailed intervals for skewed distribution (computed using R Package BOA [36]).
cPrior density functions of precision and F1 score are not specified but determined as a function of other parameters.
Prior and posterior means and 95% credible intervals when human coding is not a standard classifier.
|
| Beta prior distribution | Posterior distribution | |||
| Mean | 95% HDa | Mean | 95% HPDb | ||
| Prevalence | 0.019 | 1×10‒6-0.031 | 0.021 | 0.018-0.025 | |
|
| |||||
|
| Recall | 0.667 | 0.340-0.954 | 0.929 | 0.862-0.992 |
|
| Precisionc | – | – | 0.956 | 0.914-0.994 |
|
| Specificity | 0.733 | 0.474-0.962 | 0.999 | 0.998-1.000 |
|
| F1 scorec | – | – | 0.942 | 0.901-0.982 |
|
| |||||
|
| Recall | 0.733 | 0.474-0.962 | 0.961 | 0.923-0.995 |
|
| Precisionc | – | – | 0.897 | 0.824-0.971 |
|
| Specificity | 0.800 | 0.616-0.975 | 0.998 | 0.996-0.999 |
|
| F1 scorec | – | – | 0.927 | 0.883-0.971 |
aHD: highest density interval.
bHPD: highest posterior density interval. HPD interval gives narrower length than equal-tailed intervals for skewed density (computed using R package BOA [36]).
cPrior density of precision is not specified but implied as a function of other parameters.