| Literature DB >> 30137284 |
Aris Fergadis1,2, Christos Baziotis1,3, Dimitris Pappas2,3, Haris Papageorgiou2, Alexandros Potamianos1,2.
Abstract
In this paper, we describe a hierarchical bi-directional attention-based Re-current Neural Network (RNN) as a reusable sequence encoder architecture, which is used as sentence and document encoder for document classification. The sequence encoder is composed of two bi-directional RNN equipped with an attention mechanism that identifies and captures the most important elements, words or sentences, in a document followed by a dense layer for the classification task. Our approach utilizes the hierarchical nature of documents which are composed of sequences of sentences and sentences are composed of sequences of words. In our model, we use word embeddings to project the words to a low-dimensional vector space. We leverage word embeddings trained on PubMed for initializing the embedding layer of our network. We apply this model to biomedical literature specifically, on paper abstracts published in PubMed. We argue that the title of the paper itself usually contains important information more salient than a typical sentence in the abstract. For this reason, we propose a shortcut connection that integrates the title vector representation directly to the final feature representation of the document. We concatenate the sentence vector that represents the title and the vectors of the abstract to the document feature vector used as input to the task classifier. With this system we participated in the Document Triage Task of the BioCreative VI Precision Medicine Track and we achieved 0.6289 Precision, 0.7656 Recall and 0.6906 F1-score with the Precision and F1-score be the highest ranking first among the other systems.Database URL: https://github.com/afergadis/BC6PM-HRNN.Entities:
Mesh:
Year: 2018 PMID: 30137284 PMCID: PMC6105093 DOI: 10.1093/database/bay076
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Overview of our proposed system. Word Vectors is a matrix of word embeddings, where M is the maximum number of sentences and N the maximum number of words in a sentence. t refers to the Sentence Encoder representation for the title vector and to the representations of the abstract vectors.
Figure 2.Architecture of our proposed sequence encoder. The same architecture is used for encoding a sequence of word vectors to a sentence vector (sentence encoder) and a sequence of sentence vectors to a document vector (document encoder). When used as a sentence encoder x represent words, T takes values up to N and the output sequence vector is a sentence vector. When used as a document encoder x represent sentences, T takes values up to M and the output is a document vector.
Dataset of BC6-PM document triage task
| Dataset | Negative | Positive | Total |
|---|---|---|---|
| Train | 2353 (57.64%) | 1729 (42.36%) | 4082 |
| Test | 723 (50.67%) | 704 (43.33%) | 1427 |
Figure 3.Distribution for the number of sentences in the abstracts of the train and test sets.
Figure 4.Distribution for the number of words per sentence for the train and test sets.
Hyper-parameter values of our model
| Layer | Size | Dropout | Noise (σ) | |
|---|---|---|---|---|
| Embedding | 200 | 0.2 | 0.2 | |
| GRU | 150 (x2) | 0.3 | — | |
| Attention | 1 | 0.3 | — | |
| GRU | 150 (x2) | 0.3 | — | |
| Attention | 1 | 0.3 | — |
Tokenization options for biomedical entity mentions [e.g. ‘brancio-oto-renal (BOR) syndrome’]
| Option | Result |
|---|---|
| Tokens | ‘branchio’, ‘oto’, ‘renal’, ‘BOR’, ‘syndrome’ |
| MWE | ‘branchio-oto-renal’, ‘(BOR)’, ‘syndrome’ |
| Both | ‘branchio-oto-renal’, ‘(BOR)’, ‘syndrome’, ‘branchio’, ‘oto’, ‘renal’, ‘BOR’, ‘syndrome’ |
Official results for the submitted runs along with the organizer’s baseline and an SVM model
| Model | Run | RNN size | Shortcut connection | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|
| Baseline | — | — | — | 0.6122 | 0.6435 | 0.6274 |
| SVM | — | — | — | 0.5850 | 0.7869 | 0.6711 |
| HBGRU | 1 | 100 | No | 0.6136 | 0.7670 | 0.6818 |
| 2 | 150 | No | 0.5944 | 0.8139 | 0.6871 | |
| 3 | 150 | Yes | 0.6289 | 0.7656 |
The hyper-parameters not mentioned remain unchanged.
F1-scores of the 5-fold cross validation with options to annotate or not biomedical entities and the three tokenization options
| Tokenization options | ||||
|---|---|---|---|---|
| Fold | Annotation | Tokens | MWE | Both |
| 1 | Yes | 0.6078 | 0.6097 | 0.6088 |
| 2 | Yes | 0.7493 | 0.7550 | 0.7399 |
| 3 | Yes | 0.8023 | 0.7883 | 0.8067 |
| 4 | Yes | 0.7834 | 0.7846 | 0.7581 |
| 5 | Yes | 0.6974 | 0.7019 | 0.7018 |
| 1 | No | 0.6257 | 0.6171 | 0.6178 |
| 2 | No | 0.7557 | 0.7516 | 0.7555 |
| 3 | No | 0.7988 | 0.7903 | 0.8012 |
| 4 | No | 0.7904 | 0.7682 | 0.7578 |
| 5 | No | 0.7145 | 0.7136 | 0.7037 |