| Literature DB >> 34332525 |
Conrad J Harrison1, Chris J Sidey-Gibbons2.
Abstract
BACKGROUND: Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software.Entities:
Year: 2021 PMID: 34332525 PMCID: PMC8325804 DOI: 10.1186/s12874-021-01347-1
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Creating a document term matrix from the data
Fig. 2A part of the document term matrix
Fig. 3Latent Dirichlet allocation can be performed with a short passage of code
Fig. 4Splitting data into training and test sets
Sentiment analysis for reviews of Viagra, Levothyroxine, Oseltamivir and Apixaban
| 3 | 6 | 33% | |
| 7 | 16 | 30% | |
| 3 | 44 | 6% | |
| 0 | 2 | 0% |
Terms most likely to belong to topics 1 and 2
| effect | 0.0180 | period | 0.0248 | pain | 0.0209 |
| feel | 0.0167 | month | 0.0238 | effect | 0.0162 |
| start | 0.0162 | pill | 0.0164 | onli | 0.0120 |
| week | 0.0135 | control | 0.0144 | time | 0.0117 |
| month | 0.0124 | week | 0.0142 | start | 0.0108 |
| medic | 0.0116 | birth | 0.0129 | veri | 0.0097 |
| time | 0.0115 | weight | 0.0122 | week | 0.0097 |
| anxieti | 0.0110 | cramp | 0.0121 | doctor | 0.0094 |
| depress | 0.0109 | gain | 0.0120 | feel | 0.0093 |
| life | 0.0101 | start | 0.0116 | medic | 0.0092 |
Documents most likely to belong to topics 1 and 2
| Citalopram | 0.9997 | Etonogestrel | 0.9999 | Bisacodyl | 0.9992 |
| Prozac | 0.9995 | Nexplanon | 0.9999 | Clindamycin | 0.9989 |
| Pristiq | 0.9994 | Ethinyl estradiol / norgestimate | 0.9998 | Oseltamivir | 0.9988 |
| Vortioxetine | 0.9993 | Mirena | 0.9998 | Aluminium chloride hexahydrate | 0.9988 |
| Effexor | 0.9992 | Medroxyprogesterone | 0.9997 | Propofol | 0.9982 |
| Mirtazapine | 0.9992 | Skyla | 0.9997 | Polyethylene glycol 3350 with electrolytes | 0.9981 |
| Strattera | 0.9990 | Depo-Provera | 0.9996 | Bactrim DS | 0.9980 |
| Abilify | 0.9990 | Lo Loestrin Fe | 0.9996 | Otezla | 0.9979 |
| Aripiprazole | 0.9989 | Plan B | 0.9995 | MoviPrep | 0.9979 |
| Levetiracetam | 0.9988 | Desogestrel / ethinyl estradiol | 0.9995 | Levaquin | 0.9977 |
Supervised machine learning algorithm performance
| Model | Classification accuracy | AUC | Sensitivity | Specificity |
|---|---|---|---|---|
| Regularised regression | 0.664, 95% CI [0.608, 0.716] | 0.671, 95% CI [0.599, 0.734] | 0.720, 95% CI [0.651, 0.785] | 0.549, 95% CI [0.439, 0.651] |
| Support vector machine | 0.720, 95% CI [0.664, 0.776] | 0.725, 95% CI [0.658, 0.789] | 0.815, 95% CI [0.755, 0.873] | 0.524, 95% CI [0.420, 0.636] |
| Artificial neural network | 0.688, 95% CI [0.628, 0.744] | 0.672, 95% CI [0.599, 0.739] | 0.982, 95% CI [0.959, 1.000] | 0.085, 95% CI [0.026, 0.154] |
AUC Area under the receiver operating characteristic curve, CI Confidence interval