| Literature DB >> 34728902 |
Ramya Tekumalla1, Juan M Banda1.
Abstract
Twitter has been a remarkable resource for research in pharmacovigilance in the last decade. Traditionally, rule- or lexicon-based methods have been utilized for automatically extracting drug tweets for human annotation. The process of human annotation to create labeled sets for machine learning models is laborious, time consuming and not scalable. In this work, we demonstrate the feasibility of applying weak supervision (noisy labeling) to select drug data, and build machine learning models using large amounts of noisy labeled data instead of limited gold standard labelled sets. Our results demonstrate the models built with large amounts of noisy data achieve similar performance than models trained on limited gold standard datasets, hence demonstrating that weak supervision helps reduce the need to rely on manual annotation, allowing more data to be easily labeled and useful for downstream machine learning applications, in this case drug mention identification.Entities:
Keywords: Noisy learning; Pharmacovigilance; Twitter; Weak supervision
Year: 2021 PMID: 34728902 PMCID: PMC8554513 DOI: 10.1007/s00521-021-06614-2
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.102
Training sets and description
| Training size | Description | No. of training sets used |
|---|---|---|
| 100 k | 100,000 drug tweets + 100,000 non drug tweets | 10 |
| 200 k | 200,000 drug tweets + 200,000 non drug tweets | 10 |
| 300 k | 300,000 drug tweets + 300,000 non-drug tweets | 10 |
| 500 k | 500,000 drug tweets + 500,000 non-drug tweets | 10 |
| 1 M | 1,000,000 drug tweets + 1,000,000 non-drug tweets | 10 |
| 2 M | 2,000,000 drug tweets + 2,000,000 non-drug tweets | 10 |
| 3 M | 3,000,000 drug tweets + 3,000,000 non-drug tweets | 10 |
Details of Embeddings
| # | Embedding name and source | Details |
|---|---|---|
| em1 | Drug Chatter Twitter [ | 1B drug tweets from user timelines; window size 5 and dimension 400 |
| em2 | Glove [ | 840B tokens, 2.2 M vocab, cased, 300d vectors |
| em3 | Twitter Word2vec Embeddings [ | 400 million Twitter tweets; Negative sampling; Skip-gram architecture; Window of 1; subsampling rate of 0.001; Vector size of 400 |
| em4 | glove.twitter.27B [ | 2B tweets, 27B tokens, 1.2 M vocab, uncased,200d vectors |
| em5 | RedMed Model [ | 3 M tokens, 64d; Reddit drug posts |
| em6 | Glove [ | 42B tokens, 1.9 M vocab, uncased, 300d vectors |
Fig. 1ROC curves for all embedding models
SVM model results for all training sets for the training size 100 k
| Training set # | P | R | F | A |
|---|---|---|---|---|
| 1 | 0.9943 | 0.8012 | 0.8874 | 0.8983 |
| 2 | 0.9946 | 0.8004 | 0.887 | 0.8981 |
| 3 | 0.9946 | 0.8028 | 0.8884 | 0.8992 |
| 4 | 0.9939 | 0.7990 | 0.8859 | 0.8971 |
| 5 | 0.9938 | 0.7869 | 0.8784 | 0.8910 |
| 6 | 0.9938 | 0.8028 | 0.8881 | 0.8989 |
| 7 | 0.9941 | 0.7955 | 0.8838 | 0.8954 |
| 8 | 0.9939 | 0.8037 | 0.8888 | 0.8994 |
| 10 | 0.9941 | 0.8018 | 0.8876 | 0.8985 |
Fig. 4F-measure for all best classical models
F-measure of best classical models
| Training Size | LR | SVM | NB | RF | DT |
|---|---|---|---|---|---|
| 100 k | 0.8429 | 0.8462 | 0.771 | 0.7512 | |
| 200 k | 0.8597 | 0.8626 | 0.7821 | 0.7194 | |
| 300 k | 0.8708 | 0.8749 | 0.8028 | 0.7582 | |
| 500 k | 0.8784 | 0.8861 | 0.8116 | 0.7465 | |
| 1 M | 0.8907 | 0.8906 | 0.8231 | 0.775 | |
| 2 M | 0.8948 | 0.8961 | 0.8409 | 0.7592 | |
| 3 M | 0.8984 | 0.8989 | 0.8464 | 0.7262 |
Fig. 2Precision for all best classical models
Fig. 3Recall for all best classical models
Fig. 5a ROC curves for all the 10 training sets for the 100 k training size. b Enlarged version of (a)
F-measure of best Deep Learning Models
| # Size | BERT | Bio BERT | RoBERTa | CNN | LSTM |
|---|---|---|---|---|---|
| 100 k | 0.9296 | 0.9339 | 0.8675 | 0.9869 | |
| 200 k | 0.9282 | 0.9287 | 0.9333 | 0.8566 | |
| 300 k | 0.9324 | 0.9299 | 0.9371 | 0.8558 | |
| 500 k | 0.9270 | 0.9264 | 0.8549 | 0.9389 | |
| 1 M | 0.9705 | 0.9257 | 0.8392 | 0.9312 | |
| 2 M | 0.9231 | 0.9152 | 0.8118 | 0.9152 | |
| 3 M | 0.9125 | 0.9517 | 0.8200 | 0.9180 |
Fig. 6Precision of all best deep learning models
Fig. 7Recall of all best deep learning models
Fig. 8F-measure of all best deep learning models
Results of classical models when trained on gold standard dataset
| Classifier | P | R | F | A |
|---|---|---|---|---|
| LR | 0.9635 | 0.9494 | 0.9564 | 0.9573 |
| NB | 0.7191 | 0.9898 | 0.8330 | 0.8043 |
| RF | 0.8032 | 0.7685 | 0.7855 | 0.7929 |
| DT | 0.9447 | 0.9601 | 0.9523 | 0.9526 |
Results of deep learning models when trained on gold standard dataset
| Classifier | P | R | F | A |
|---|---|---|---|---|
| CNN | 0.9146 | 0.8737 | 0.8937 | 0.8968 |
| RNN | 0.9855 | 0.9882 | 0.9868 | 0.9869 |
| BioBERT | 0.9850 | 0.9973 | 0.9911 | 0.9908 |
| RoBERTa | 0.9967 | 0.9978 | 0.9973 | 0.9972 |
Number of minimum noisy samples required
| Accuracy | Minimum number of noisy samples | Model |
|---|---|---|
| 0.9978 | 14,558 | BERT |
| 0.7930 | 42,021 | DT |
Comparison of gold standard vs silver standard results
| Data | No. of drug samples | P | R | F | A | Best model |
|---|---|---|---|---|---|---|
| Gold standard | 7215 | 0.9978 | 0.9978 | 0.9978 | 0.9977 | BERT |
| Silver standard | 100,000 | 0.9953 | 0.9950 | 0.9951 | 0.9951 | BERT |
| Silver standard | 100,000 | 0.9852 | 0.9410 | 0.9626 | 0.9634 | LSTM |
| Silver standard | 300,000 | 0.9919 | 0.9009 | 0.9442 | 0.9468 | LSTM |
| Silver standard | 500,000 | 0.9954 | 0.9374 | 0.9655 | 0.9665 | RoBERTa |
| Silver standard | 1,000,000 | 0.9832 | 0.9953 | 0.9892 | 0.9891 | RoBERTa |
| Silver standard | 2,000,000 | 0.9971 | 0.9500 | 0.9730 | 0.9736 | RoBERTa |
| Silver standard | 3,000,000 | 0.9951 | 0.9500 | 0.9720 | 0.9726 | BERT |