| Literature DB >> 25717407 |
Pranoti Pimpalkhute1, Apurv Patki1, Azadeh Nikfarjam2, Graciela Gonzalez2.
Abstract
Social media postings are rich in information that often remain hidden and inaccessible for automatic extraction due to inherent limitations of the site's APIs, which mostly limit access via specific keyword-based searches (and limit both the number of keywords and the number of postings that are returned). When mining social media for drug mentions, one of the first problems to solve is how to derive a list of variants of the drug name (common misspellings) that can capture a sufficient number of postings. We present here an approach that filters the potential variants based on the intuition that, faced with the task of writing an unfamiliar, complex word (the drug name), users will tend to revert to phonetic spelling, and we thus give preference to variants that reflect the phonemes of the correct spelling. The algorithm allowed us to capture 50.4 - 56.0 % of the user comments using only about 18% of the variants.Entities:
Keywords: Information Retrieval; Natural Language Processing and Free Text Data Mining; Spelling-Error
Year: 2014 PMID: 25717407 PMCID: PMC4333687
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
CMU and Metaphone encoding
| Word Variants | CMU Expanded Pronunciation | Metaphone Encoding |
|---|---|---|
| POZAC | P AA Z AH K | POZACPSK |
| PRZAC | P R Z AE K | |
| PROAC | P R OW AE K | PROACPRK |
| PROZAK | ||
| PROXAC | PROXACPRKS |
correct spelling of the word
Figure 1Control Flow Chart
Statistics of misspelled variants.
| Paxil | Prozac | Seroquel | Olanzapine | |
|---|---|---|---|---|
| Levenshtein (1-edit) distance words | 238 | 291 | 397 | 503 |
| CMU lib words generated | 21 | 18 | 27 | 31 |
| Metaphone words generated | 79 | 103 | 121 | 338 |
| Combining the two lists | 85 | 104 | 119 | 327 |
| Keywords selected by proposed algorithm | 18 | 18 | 17 | 15 |
Figure 3Google Plus comments and Tweets for Custom Search API sorted variants and Random variants.
Figure 2Number of Tweets vs Drug Variants.
| Prozac | Paxil | Seroquel | Olanzapine | ||||
|---|---|---|---|---|---|---|---|
| Variant | Google Hits | Variant | Google Hits | Variant | Google Hits | Variant | Google Hits |
| prozact | 3960000 | paxl | 52300000 | seroquels | 1910000 | olanzapin | 1220000 |
| prozaac | 3160000 | pxil | 12200000 | seroqul | 1810000 | olanzapoine | 869000 |
| prozaqc | 1300000 | pexil | 10600000 | seroqual | 1810000 | olanzapines | 868000 |
| prozaxc | 1300000 | paxol | 2490000 | sroquel | 1800000 | olanzaoine | 864000 |
| prozax | 1270000 | paxial | 2340000 | seruquel | 1790000 | olanzaopine | 863000 |
| prozc | 1260000 | paxiol | 866000 | saroquel | 1760000 | olanzapne | 796000 |
| prozec | 1260000 | paxill | 856000 | seroqel | 1710000 | olanzaplne | 765000 |
| proazac | 1260000 | paxilk | 819000 | seroquell | 1230000 | olanzapuine | 734000 |
| prozzac | 1220000 | paxilo | 809000 | serocquel | 763000 | olanzapins | 567000 |
| prazac | 1210000 | paxils | 790000 | seroguel | 751000 | olanzpine | 565000 |
| proazc | 1180000 | paxilv | 750000 | seroquol | 742000 | olanzopine | 536000 |
| proxac | 1150000 | paxilj | 746000 | sereoquel | 676000 | olanzipine | 530000 |
| prozacs | 1120000 | paxiln | 738000 | seriquel | 615000 | olanazapine | 525000 |
| prizac | 1100000 | paxilq | 738000 | serroquel | 604000 | olanzepine | 386000 |
| przac | 1070000 | paxcil | 708000 | serequel | 111000 | olanzapinm | 6820 |
| porzac | 997000 | paxiul | 694000 | seraquel | 106000 | ||
| prozacc | 995000 | paxilz | 668000 | seroquela | 5580 | ||
| prozaq | 12500 | paxila | 5700 | ||||
Evaluation for Twitter and GooglePlus
| Drug Name | Twitter Comments Coverage | GooglePlus Comments Coverage | Keyword Coverage |
|---|---|---|---|
| Prozac | 45.44138929 | 52.7607362 | 17.30769231 |
| Paxil | 25.85669782 | 54.54545455 | 21.17647059 |
| Seroquel | 65.28497 | 62.29508 | 14.40678 |
| Olanzapine | 65.09433962 | 54.28571429 | 4.573170732 |
| 50.417 | 55.971 |