| Literature DB >> 35455795 |
Jhih-Yuan Huang1, Wei-Po Lee1, King-Der Lee2.
Abstract
Social forums offer a lot of new channels for collecting patients' opinions to construct predictive models of adverse drug reactions (ADRs) for post-marketing surveillance. However, due to the characteristics of social posts, there are many challenges still to be solved when deriving such models, mainly including problems caused by data sparseness, data features with a high-dimensionality, and term diversity in data. To tackle these crucial issues related to identifying ADRs from social posts, we perform data analytics from the perspectives of data balance, feature selection, and feature learning. Meanwhile, we design a comprehensive experimental analysis to investigate the performance of different data processing techniques and data modeling methods. Most importantly, we present a deep learning-based approach that adopts the BERT (Bidirectional Encoder Representations from Transformers) model with a new batch-wise adaptive strategy to enhance the predictive performance. A series of experiments have been conducted to evaluate the machine learning methods with both manual and automated feature engineering processes. The results prove that with their own advantages both types of methods are effective in ADR prediction. In contrast to the traditional machine learning methods, our feature learning approach can automatically achieve the required task to save the manual effort for the large number of experiments.Entities:
Keywords: adverse drug reaction; deep learning; feature engineering; machine learning; pharmacovigilance; social media monitoring
Year: 2022 PMID: 35455795 PMCID: PMC9024774 DOI: 10.3390/healthcare10040618
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
The feature names extracted from the dataset.
| Feature Name | Dim | Description |
|---|---|---|
| text | 5000 | |
| synset vector | 2000 | the tf.idf measure for each derived synonym |
| cluster vector | 981 | cluster terms tf |
| topic vector | 500 | the topic terms that appear in the instance |
| sentiments | 5 | the sum of all the individual term-POS (part-of-speech) sentiment scores divided by the length of the sentence in words |
| good/bad | 4 | four features: MORE-GOOD, MORE-BAD, LESS-GOOD, and LESS-BAD |
| structural features | 3 | length: lengths of the text segments in words |
| ADRs lexicon | 2 | The first feature is a binary feature indicating the presence/absence of ADR mentions. The second feature is a numeric feature computed by counting the number of ADR mentions in a text segment and dividing it by the number of words in the text segment. |
| topics | 1 | sums of all the relevance scores of the terms in each instance |
Figure 1The proposed network architecture.
The baseline results by SVM and logistic regression.
| Methods | Accuracy | Precision | Recall | F-Score | AUC |
|---|---|---|---|---|---|
| SVM | 0.90 | 0.54 | 0.51 | 0.52 | 0.73 |
| LR | 0.90 | 0.51 | 0.56 | 0.53 | 0.75 |
Results of different methods with an under-sampling technique.
| Methods | Accuracy | Precision | Recall | F-Score | AUC |
|---|---|---|---|---|---|
| Without balance | 0.90 | 0.58 | 0.41 | 0.48 | 0.69 |
| Random under-sampling | 0.74 | 0.28 | 0.78 | 0.41 | 0.76 |
| TomekLinks [ | 0.90 | 0.59 | 0.41 | 0.48 | 0.69 |
| NearMiss [ | 0.37 | 0.15 | 0.95 | 0.26 | 0.62 |
| CondensedNearestNeighbour [ | 0.85 | 0.39 | 0.58 | 0.47 | 0.73 |
| OneSidedSelection [ | 0.90 | 0.59 | 0.41 | 0.48 | 0.69 |
| NeighbourhoodCleaningRule [ | 0.90 | 0.56 | 0.51 | 0.53 | 0.72 |
| EditedNearestNeighbours [ | 0.89 | 0.55 | 0.54 | 0.54 | 0.74 |
| RepeatedEditedNearestNeighbours [ | 0.88 | 0.48 | 0.56 | 0.52 | 0.74 |
| AllKNN [ | 0.89 | 0.51 | 0.55 | 0.53 | 0.74 |
| InstanceHardnessThreshold [ | 0.85 | 0.40 | 0.63 | 0.49 | 0.75 |
Results of different methods with an over-sampling-based method.
| Methods | Accuracy | Precision | Recall | F-Score | AUC |
|---|---|---|---|---|---|
| Without balance | 0.90 | 0.58 | 0.41 | 0.48 | 0.69 |
| Random over-sampling | 0.87 | 0.47 | 0.55 | 0.51 | 0.73 |
| SMOTE [ | 0.88 | 0.47 | 0.54 | 0.50 | 0.73 |
| Borderline-SMOTE type 1 [ | 0.88 | 0.48 | 0.55 | 0.51 | 0.73 |
| Borderline-SMOTE type 2 [ | 0.87 | 0.46 | 0.60 | 0.52 | 0.76 |
| Support Vectors SMOTE [ | 0.89 | 0.53 | 0.50 | 0.51 | 0.72 |
| ADASYN [ | 0.89 | 0.53 | 0.49 | 0.51 | 0.72 |
| SMOTE + Tomek [ | 0.88 | 0.47 | 0.55 | 0.51 | 0.73 |
| SMOTE + ENN [ | 0.88 | 0.47 | 0.54 | 0.50 | 0.73 |
Results of the over-sampling-based approach.
| Methods | Accuracy | Precision | Recall | F-Score | AUC |
|---|---|---|---|---|---|
| Balanced Bagging DT [ | 0.81 | 0.32 | 0.67 | 0.43 | 0.75 |
| Balanced RandomForest [ | 0.73 | 0.26 | 0.76 | 0.39 | 0.75 |
| EasyEnsemble [ | 0.74 | 0.26 | 0.76 | 0.38 | 0.75 |
| RUSBoost [ | 0.74 | 0.26 | 0.76 | 0.38 | 0.75 |
Figure 2Results of the classifier with feature selection scheme.
Results of removing 1000 most important features.
| Method | Accuracy | Precision | Recall | F-Score | AUC |
|---|---|---|---|---|---|
| LR with DB | 0.90 | 0.51 | 0.56 | 0.53 | 0.75 |
| LR with DB and FS | 0.90 | 0.51 | 0.62 | 0.56 | 0.78 |
| 1000 best features removed | 0.84 | 0.29 | 0.30 | 0.29 | 0.60 |
Results of the different types of features selected.
| Feature Name | Selected | Original | Category (%) | Overall (%) |
|---|---|---|---|---|
| text | 332 | 5000 | 6.6 | 4.90 |
| synset vector | 471 | 2000 | 23.5 | 6.90 |
| sentiments | 3 | 5 | 60.0 | 0.04 |
| cluster vector | 123 | 981 | 12.5 | 1.80 |
| structural features | 2 | 3 | 66.7 | 0.03 |
| adrlexicon | 2 | 2 | 100.0 | 0.03 |
| topics | 1 | 1 | 100.0 | 0.01 |
| topic vector | 65 | 500 | 13.0 | 0.95 |
| goodbad | 1 | 4 | 25.0 | 0.01 |
Results by deep learning methods.
| Methods | Accuracy | Precision | Recall | F-Score | AUC |
|---|---|---|---|---|---|
| CNN [ | 0.88 | 0.47 | 0.50 | 0.48 | 0.71 |
| CRNN [ | 0.85 | 0.38 | 0.53 | 0.44 | 0.71 |
| RCNN [ | 0.89 | 0.50 | 0.44 | 0.46 | 0.69 |
| BERT with BCE | 0.90 | 0.56 | 0.50 | 0.53 | 0.85 |
| BERT with MSE | 0.91 | 0.62 | 0.45 | 0.52 | 0.86 |
| BERT with fixed weights-1 | 0.90 | 0.58 | 0.49 | 0.51 | 0.82 |
| BERT with fixed weights-2 | 0.91 | 0.53 | 0.55 | 0.53 | 0.83 |
| BERT with BAW | 0.90 | 0.56 | 0.53 | 0.55 | 0.87 |