| Literature DB >> 30938776 |
Gully A Burns1, Xiangci Li2, Nanyun Peng2.
Abstract
We investigate the application of deep learning to biocuration tasks that involve classification of text associated with biomedical evidence in primary research articles. We developed a large-scale corpus of molecular papers derived from PubMed and PubMed Central open access records and used it to train deep learning word embeddings under the GloVe, FastText and ELMo algorithms. We applied those models to a distant supervised method classification task based on text from figure captions or fragments surrounding references to figures in the main text using a variety or models and parameterizations. We then developed document classification (triage) methods for molecular interaction papers by using deep learning mechanisms of attention to aggregate classification-based decisions over selected paragraphs in the document. We were able to obtain triage performance with an accuracy of 0.82 using a combined convolutional neural network, bi-directional long short-term memory architecture augmented by attention to produce a single decision for triage. In this work, we hope to encourage biocuration systems developers to apply deep learning methods to their specialized tasks by repurposing large-scale word embedding to apply to their data.Entities:
Mesh:
Year: 2019 PMID: 30938776 PMCID: PMC6449534 DOI: 10.1093/database/baz034
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Neural network configurations for method classification based on INTACT evidence fragment corpus.
Figure 2Neural network configurations for document triage applied to ‘Darkspace’ data.
Triage Accuracy by text source, network model, and embedding type
|
|
|
| |
|---|---|---|---|
|
|
| ||
| Abstract | CNNBiLSTM | 0.71 ± 0.02 | 0.76 ± 0.02 |
| All paragraphs | CNNBiLSTM + attention | 0.81 ± 0.01 | 0.76 ± 0.01 |
| Captions | CNNBiLSTM + attention | 0.82 ± 0.01 | 0.79 ± 0.01 |
| Captions + Evid. Frg. | CNNBiLSTM + attention | 0.82 ± 0.01 | 0.77 ± 0.01 |
| Evid. Frg. | CNNBiLSTM + attention | 0.81 ± 0.01 | 0.77 ± 0.01 |
| MeSH | Simple CNN | 0.61 ± 0.03 | 0.70 ± 0.02 |
| Title | Simple CNN | 0.60 ± 0.02 | 0.69 ± 0.02 |
| Title + Abstract | CNNBiLSTM | 0.72 ± 0.01 | 0.71 ± 0.01 |
Figure 3Classification accuracy for experimental methods based on text, network model, and embedding.