| Literature DB >> 30272184 |
Abeed Sarker1, Maksim Belousov2, Jasper Friedrichs3, Kai Hakala4,5, Svetlana Kiritchenko6, Farrokh Mehryary4,5, Sifei Han7, Tung Tran7, Anthony Rios7, Ramakanth Kavuluru7,8, Berry de Bruijn6, Filip Ginter4, Debanjan Mahata9, Saif M Mohammad6, Goran Nenadic2, Graciela Gonzalez-Hernandez1.
Abstract
Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials andEntities:
Mesh:
Year: 2018 PMID: 30272184 PMCID: PMC6188524 DOI: 10.1093/jamia/ocy114
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.Class distributions for subtasks-1 and -2.
Figure 2.Sample instances and their categories for the 3 subtasks. Medication names are shown in bold-face.
Figure 3.Percentage distributions for 5 categories of approaches attempted by teams for the shared tasks.
Performance metrics for selected system submissions for subtask-1, baselines, and system ensembles. Precision, recall, and F1-score over the ADR class are shown. The top F1-score among all systems is shown in bold. Detailed discussions about the approaches can be found in the system description papers referenced
| System/Team | ADR precision | ADR recall | ADR F1-score |
|---|---|---|---|
| 0.774 | 0.098 | 0.174 | |
| 0.501 | 0.215 | 0.219 | |
| 0.429 | 0.066 | 0.115 | |
| 0.392 | 0.488 | 0.435 | |
| 0.437 | 0.393 | 0.414 | |
| 0.395 | 0.431 | 0.412 | |
| 0.498 | 0.337 | 0.402 | |
| 0.336 | 0.348 | 0.342 | |
| 0.435 | 0.492 | 0.461 | |
| 0.529 | 0.398 | 0.454 | |
| 0.462 | 0.492 | ||
| 0.521 | 0.415 | 0.462 | |
| 0.304 | 0.641 | 0.413 | |
| 0.464 | 0.441 | 0.452 |
Figure 4.Distributions of system scores for the 3 subtasks (1, 2, and 3, respectively, from left to right).
Summary of system extensions and changes in performance compared to the original shared task systems
| Team | Subtask (evaluation metric) | Extension description | Score | Performance change |
|---|---|---|---|---|
| NRC-Canada | 1 (ADR F1-score) | Ensemble of 7 classifiers with random undersampling of the majority class to imbalance ratio of 1: 2 | 0.456 | +0.021 |
| UKNLP | 1 (ADR F1-score) | Additional training data, logistic regression and CNN ensembles | 0.459 | +0.057 |
| InfyNLP | 2 (micro-averaged F1-score for classes 1 and 2) | Additional training data, increased number of random search runs | 0.692 | −0.001 |
| NRC-Canada | 2 (micro-averaged F1-score for classes 1 and 2) | Additional training data | 0.679 | +0.0058 |
| UKNLP | 2 (micro-averaged F1-score for classes 1 and 2) | Additional training data (and removed all non-ASCII characters from tweets) | 0.694 | +0.005 |
| TurkuNLP | 2 (micro-averaged F1-score for classes 1 and 2) | Additional training data | 0.665 | +0.002 |
| UKNLP | 3 (accuracy) | CNN instead of LSTM at the character level for hierarchical composition | 87.7% | +0.5 |
Figure 5.Boxplots illustrating the performances of SVMs, CNNs, and Other classification strategies for subtasks-1 and -2.
Performance metrics for selected system submissions for subtask-2, baselines, and system ensembles. Micro-averaged precision, recall, and F1-scores are shown for the definite intake (class 1) and possible intake (class 2) classes. The highest F1-score over the evaluation dataset is shown in bold. Detailed discussions about the approaches can be found in the system description papers referenced (when available)
| System/Team | Micro-averaged precision for classes 1 and 2 | Micro-averaged recall for classes 1 and 2 | Micro-averaged F1-score for classes 1 and 2 |
|---|---|---|---|
| 0.359 | 0.503 | 0.419 | |
| 0.652 | 0.436 | 0.523 | |
| 0.628 | 0.487 | 0.549 | |
| 0.725 | 0.664 | 0.693 | |
| 0.701 | 0.677 | 0.689 | |
| 0.708 | 0.642 | 0.673 | |
| 0.691 | 0.641 | 0.665 | |
| 0.701 | 0.630 | 0.663 | |
| 0.709 | 0.604 | 0.652 | |
| 0.690 | 0.554 | 0.614 | |
| 0.736 | 0.657 | 0.694 | |
| 0.726 | 0.679 | ||
| 0.724 | 0.673 | 0.697 | |
| 0.723 | 0.667 | 0.694 | |
| 0.727 | 0.673 | 0.699 |
System performances for subtask-3, including baselines and ensembles. Summary approaches and accuracies over the evaluation set are presented. Best performance is shown in bold
| Team | Approach summary | Accuracy (%) |
|---|---|---|
| Exact lexical match with MedDRA PT | 11.6 | |
| Exact lexical match with MedDRA LLT or PT | 25.1 | |
| Match with training set annotation | 63.5 | |
| Multinomial Logistic Regression | 87.7 | |
| RNN with GRU | 85.5 | |
| Ensemble | 88.5 | |
| Hierarchical RNN with LSTM | 87.2 | |
| Hierarchical RNN with LSTM and external data | 86.7 | |
| All systems | ||
| Top 3 |