| Literature DB >> 30864314 |
Arjun Magge1, Davy Weissenbacher, Abeed Sarker, Matthew Scotch, Graciela Gonzalez-Hernandez.
Abstract
Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.Entities:
Mesh:
Year: 2019 PMID: 30864314 PMCID: PMC6417823
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928
Fig. 1.A schematic representation of the sequence of actions performed in the NER equipped with bi-directional RNN layers and an output CRF layer. RNN variants discussed in this paper involve replacing RNN units with LSTM, LSTM-Peepholes, GRU and UG-RNN units.
Median Precision(P), Recall(R) and F1 scores for NER and Resolution. Bold-styled scores indicate highest performance. All recurrent neural network units were used in a bidirectional setup with inputs containing pre-trained word embeddings, character embeddings and case features, and an output layer with an additional CRF layer.
| Method | NER-Strict | NER-Overlapping | Resolution | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Rule-based[ | 0.58 | 0.876 | 0.698 | 0.599 | 0.904 | 0.72 | 0.547 | 0.697 | |
| CRF-All[ | 0.85 | 0.76 | 0.80 | 0.86 | 0.77 | 0.81 | - | - | - |
| FFNN + DS[ | 0.90 | 0.93 | 0.91 | - | - | - | - | - | - |
| RNN | 0.910 | 0.891 | 0.901 | 0.931 | 0.912 | 0.922 | 0.896 | 0.817 | 0.855 |
| UG-RNN | 0.948 | 0.902 | 0.924 | 0.959 | 0.912 | 0.935 | 0.903 | 0.824 | 0.862 |
| GRU | 0.919 | 0.935 | 0.930 | 0.948 | 0.888 | 0.835 | 0.860 | ||
| LSTM | 0.932 | 0.926 | 0.929 | 0.954 | 0.947 | 0.950 | 0.892 | 0.842 | 0.866 |
| LSTM-Peep | 0.934 | 0.951 | 0.863 | ||||||
Fig. 2.(Left) Ablation/leave-one-out analysis showing the contribution of individual features to the NER performance across the RNN models. (Right) Impact of additive layers on the performance of the NER across the RNN models. Here, RNN layers refer to respective variants of RNN architectures. Y-axis shows strict F1 scores.