| Literature DB >> 30577747 |
Asan Agibetov1, Kathrin Blagec1, Hong Xu1, Matthias Samwald2.
Abstract
BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited.Entities:
Keywords: FastText; Natural language processing; Neural networks; Scientific abstracts; Text classification; Word vector models
Mesh:
Year: 2018 PMID: 30577747 PMCID: PMC6303852 DOI: 10.1186/s12859-018-2496-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
denotes vocabulary size. For train, validation and test datasets, the number of abstracts followed and the number of sentences (in parentheses) are shown
| Dataset |
| Train | Validation | Test |
|---|---|---|---|---|
| PubMed 20k | 68k | 15k (180k) | 2.5k (30k) | 2.5k (30k) |
| PubMed 200k | 331k | 190k (2.2M) | 2.5k (29k) | 2.5k (29k) |
| Extended corpus | 451k | 872k (10,3M) | 2.5k (29k) | 2.5k (29k) |
Fig. 1Schematic representation of the neural embedding model for sentences (supervised and unsupervised) consisting of two embedding layers and a final softmax layer over k classes (for the supervised case). In the unsupervised case and the softmax outputs the probability of the target word (over all vocabulary, as in C-BOW model) given its context: fixed-length context for fastText and entire sentence context for sent2vec. Independently of the training mode (e.g., supervised vs unsupervised) word embeddings are stored as columns in the weight matrix V of the first embedding layer. Note that in the unsupervised case the rows of the weight matrix U of the second embedding layer represent the embeddings for the “negative” words; these embeddings however are not used for the downstream machine learning tasks. In all instances the averaging of embeddings of constituent tokens () is performed by fastText (sent2vec implementation is based on fastText)
Weighted F1 scores for various models trained on single sentences. Best results for each dataset are printed in bold. For our models, training time is given (for hyperparameter settings yielding the shown score)
| Model | PubMed 20k | PubMed 200k | Extended corpus |
|---|---|---|---|
| Logistic regression model (LR) [ | .831 | .859 (33,006 s) | - |
| Forward artificial neural network (ANN) [ | .861 | .884 | - |
| Conditional random field (CRF) [ | .895 | .915 (4867 s) | - |
| bi-ANN [1]a |
| .916 | - |
|
| .825 (5 s) | .852 (13 s) | .852 (61 s) |
|
| .896 (11 s) |
|
|
aResult and runtime reported by [11]; the reported runtimes given by authors include both training and testing time while we report only training time. Testing of a trained fastText model took approx. 15 s with the evaluation tool supplied by the fastText library.
Ablation experiments based on PubMed 200k dataset
| Weighted F1 score | |
|---|---|
| Full model | .917 |
| Removed numeric sentence position | .912 |
| Removed sentence context | .904 |
| Removed both sentence context and numeric sentence position (single sentence model) | .852 |
Confusion matrix for test results for the PubMed 200k dataset, yielded by the fastText model with sentence context and numeric sentence position
| True label | Predicted label | ||||
|---|---|---|---|---|---|
| Objective | Background | Methods | Results | Conclusions | |
| Objective |
| 573 | 98 | 0 | 2 |
| Background | 471 |
| 91 | 8 | 42 |
| Methods | 25 | 36 |
| 296 | 18 |
| Results | 3 | 3 | 395 |
| 131 |
| Conclusions | 1 | 23 | 13 | 203 |
|
Fig. 2Weighted F1 score of test set predictions for different training set sizes