Literature DB >> 30864314

Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature.

Arjun Magge¹, Davy Weissenbacher, Abeed Sarker, Matthew Scotch, Graciela Gonzalez-Hernandez.

Abstract

Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 30864314 PMCID： PMC6417823

Source DB: PubMed Journal: Pac Symp Biocomput ISSN： 2335-6928

Introduction

Nucleotide sequence repositories like GenBank contain millions of records from various organisms collected around the world that enables researchers to perform phylogenetic tree and spread reconstruction. However, a vast majority of the records (65–80%)[1,2] contain geographic information that is deemed to be at an insufficient level of granularity; information that is often present in the associated published article. This motivates the use of natural language processing (NLP) methods to find the geographic location (or toponym) of infected hosts in the full text. In NLP, this task of detecting toponyms from unstructured text, and then disambiguating the locations to their co-ordinates is formally known as toponym resolution. Toponym resolution in scientific articles can be used to obtain precise geospatial metadata of infected hosts which is highly beneficial in building transmission models in phylogeography that could enable public health agencies to target high-risk areas. Improvement in geospatial metadata also enriches other scientific studies that utilize GenBank data, such as those in population genetics, environmental health, and epidemiology in general, as geographic location is often used in addition to or as a proxy of other demographic data. Toponym Resolution is typically accomplished in two stages (1) toponym detection (geotagging), a named entity recognition (NER) task in NLP and (2) toponym disambiguation (geocoding). For instance, given the sentence “Our study mainly focused on pediatric cases with different outcomes from the most populated city in Argentina and one of the hospitals in Buenos Aires where patients are most often referred.”, the detection stage deals with extracting the locations “Argentina” and “Buenos Aires”.[3] The disambiguation stage deals with assigning the most likely, unique, identifiers from gazetteer resources like Geonames[a] to each location detected e.g. “3865483:Argentina” from 145 candidate entries containing the same name and “3435910:Buenos Aires” from 943 candidate entries with variations of the same name. Both tasks bring forth interesting NLP challenges with applications in a wide number of areas. In this work, we present a system for toponym detection and disambiguation that improves substantially over previously published systems for this task, including our own.[4-6] Since detection is the first step in the process, its impact on the overall performance of the combined task is multiplied, as locations not detected can never be disambiguated. We use recurrent neural network (RNN) architectures that use word embeddings, character embeddings and case features as input for performing the detection task. In addition to these, we also experiment with the use of conditional random fields (CRF) on the output layer as they have known to improve performance. We perform ablation studies/leave-one-out analysis with repetitive runs with different seed values for drawing strong conclusions about the use of deep recurrent neural networks, their architectural variations and common features. We evaluate the impact of the results from the detection task on the upstream disambiguation task, performed using the commonly assumed population heuristic[7] whereby the location with the greatest population is chosen as the correct match. The rest of the document is structured as follows. In Section 2, we summarize research efforts in the area of toponym detection and disambiguation and list the contributions of this paper in light of previous work. We distinguish the RNN architectures used for evaluation along with the population heuristic used for measurement in Section 3. Finally, we present and discuss the results of the toponym detection and disambiguation experiments in Sections 4 and discuss limitations and scope for improvements in Section 5.

Related Work

Toponym detection and toponym disambiguation have been widely researched by the NLP community, with numerous publications on both detection and disambiguation tasks.[8-10] To ponym detection is commonly tackled as a NER challenge where toponyms are recognized among other named entities like organization names and people’s names. Previous studies[11] have identified the performance of the NER as an important source of errors in enhancing geospatial metadata in GenBank, motivating the development of tools for performing detection and resolution of named entities such as infected hosts and geographical locations.[12,13] The annotated dataset used in this work[4,11] includes both span and normalized Geonames ID annotations. Since the performance of the overall resolution task is deeply influenced by the NER, some of the previous works using this dataset have looked specifically at improving the NER’s performance. Our previous research on toponym detection have used rule-based methods,[4] traditional machine learning sequence taggers using conditional random fields (CRF)[5] and deep learning methods using feed forward neural networks.[6] NER performance since the introduction of the dataset has increased from an F1-score of 0.70 to 0.91 closing in on the human-level annotation agreement of 0.97. In the previous baseline for toponym resolution[4] a rule based extraction system was used to detect toponyms. In subsequent work, traditional machine learning algorithms such as conditional random fields (CRFs)[5] and feedforward neural nets[6] were introduced for improving the NER’s performance. There exist some studies involving RNN experiments that explore the use of RNN architectures for sequence tagging tasks in the generic domain.[14,15] While these tasks measure the performance on specific tasks, the effect of optimal performances haven’t been measured in upstream tasks. On the other hand, toponym disambiguation has been commonly tackled as an information retrieval challenge by creating an inverted index of Geonames entries.[4,16] Given a toponym, candidate locations are first retrieved based on words used in the toponym and subsequently heuristics are used to pick the most appropriate location. Popular techniques use metrics such as entity co-occurrences, similarity measures, distance metrics, context features and topic modeling.[7,16-20] This approach is largely adopted due the large number of Geonames entries (about 12 million) to choose from. We also find that the most common baseline used for measuring the disambiguation performance is the population heuristic where the place with the most population is chosen as the correct match. Most research articles that focus specifically on the disambiguation problem use Stanford-NER or the Apache-NER tool[20-22] for detection which has been trained on datasets like CoNLL-2003, ACE-2005 and MUC. Some studies assume gold standard labels and proceed with the task of disambiguation which makes it difficult to assess the strength of the overall system. It is also important to note that a majority of efforts have been focused on texts from a general domain like Wikipedia or news articles.[20-22] Only a handful of publications deal with the problem in other domains like biomedical scientific articles[4,23] which contain a different and broader vocabulary. Similar to the previous disambiguation method developed for this dataset,[4] we build an inverted index using Geonames entries but use term expansion techniques to improve the performance and usability of the system in various contexts. In light of previous work, the main contributions of this work can be summarized as follows: We perform a comprehensive and systematic evaluation of multiple RNN architectures from over 400 individual runs for the task of toponym detection in scientific articles and arrive at state-of-the-art results compared to previous methods. We discuss the impact of significant performance improvement in toponym detection in the upstream task of toponym resolution.

Methods

Our approach for detection and disambiguation of geographic locations are tackled independently, as described in the following subsections. For the purposes of training and evaluation, we use the publicly available human annotated corpus of 60 full-text PMC articles containing 1881 toponyms.[4] Of the 60, the standard test set for the corpus includes only 12 articles containing a total of 285 toponyms, a large majority of which are countries and major locations. The annotated dataset contains both span annotations and gazetteer ID annotations linking ISO-3166–1 codes for countries and GeonamesIDs for the remaining toponyms. For uniformity, we converted all ISO-3166–1 codes to equivalent GeonameIDs.

Toponym Detection

The task of toponym detection typically involves identifying the spans of the toponyms in an NER task where the sequence of actions is illustrated in Fig 1. As input features, we use publicly available pre-trained word embeddings that were trained on Wikipedia, PubMed abstracts and PubMed Central full text articles.[24] In addition to word embeddings, we experiment with orthogonal features such as (1) a case feature to explicitly distinguish all-uppercase, all-lowercase and camel-case words encoded as one-hot vectors that are appended to the word, and (2) fixed length character embeddings. Character embeddings have shown to improve the performances of deep neural networks and are employed in few different ways. One of the popular methods used involves the use of a CNN layer[25] or an LSTM layer[26] on vectors from a randomly initialized character embeddings that are fine tuned during training appended to the input word embedding layer. During initial experiments we found that implementation of this architecture added significantly to the training time and hence we employ the use of a simpler model where character embeddings are pre-trained using word2vec and appended directly to the input layer along with word embeddings and case features.

Fig. 1.

A schematic representation of the sequence of actions performed in the NER equipped with bi-directional RNN layers and an output CRF layer. RNN variants discussed in this paper involve replacing RNN units with LSTM, LSTM-Peepholes, GRU and UG-RNN units.

The proposed RNN units and their variations can be used on their own for NER purposes. However, bidirectional architectures are popularly employed for NER as they have the combined capability of processing input sentences in both directions and making tagging decisions collectively using an output layer as illustrated in figure 1. In this paper, we specifically look at bi-directional recurrent architectures. It is also common to observe the use of a CRF output layer on top of the output layer of bidirectional RNN architecture. CRF’s are known to add consistency in making final tagging decisions using IOB or IOBES styled annotations. We experiment between combinations of the RNN variants along with the optional features in an ablation study to identify the impact of these additive layers on the NER’s performance as well as its impact on the upstream resolution task.

Recurrent Neural Networks

RNN architectures have been widely used for auto-encoders and sequence labeling tasks such as part-of-speech tagging, NER, chunking among others.[27] RNNs are variants of feedforward neural networks that are equipped with recurrent units to carry signals from the previous output y for making decisions at time y as shown in equation 1. Here, W and U are the weight matrices and b is the bias term that are randomly initialized and updated during training. σ represents the sigmoid activation function. In practice other activation functions such as tanh and rectified linear units (ReLU) are also used. This characteristic recurrent feature simulates a memory function that makes it ideal for tasks involving sequential predictions dependent on previous decisions. However, learning long term dependencies that are necessary have been found to be difficult using RNN units alone.[28]

LSTM

LSTM networks[29] are variants of RNN that have proven to be fairly successful at learning long term dependencies. A candidate output g is calculated using an equation similar to equation 1 and further manipulated based on previous and current states of a cell that retains signals simulating long-term memory. The LSTM cell’s state is controlled by forget (f), input (i) and output (o) gates that control how much information flows from the input to the state and from state to the output. The gates themselves depend of current input and previous outputs. The future state of the cell c is calculated as a combination of (1) signals from forget gate g and the previous state of the cell c which determines the information to forget (or retain) in the cell, and (2) signals from the input gate i and the candidate output g that determines the information from the input to be stored in the cell. Eventually the output y is calculated using signals from the output gate o and the current state of the cell c. In the above equations, ʘ indicates pointwise multiplication operation. While the above equations represent LSTM in its most basic form, many variations of the architecture have been introduced to simulate retention of long-term signals a few of which have been summarized in the following subsections and subsequently evaluated in the results section. For reasons of brevity, we do not include the formulas used for calculating the output y but they can be inferred from the works cited.

Other Gated RNN Architectures

We evaluate in our experiments one of the LSTM variations introduced for speech processing[30] that introduced the notion of peepholes (LSTM-Peep) where the idea is that state of the cell influences the input, forget and output gates. Here, signals for the input and forget gates i and f depend not only on the previous output y and current input x but also the previous state of the cell c and the output gate o depends on the current state of the cell c. Gated Recurrent Unit (GRU)[31] also known as coupled input and forget gate LSTM (CIFGLSTM)[15] is a simpler variation of LSTM with only two gates: update z and reset r. Their signals are determined based on the current input x and previous output y similar to the gates in LSTMs. The update gate z attempts to combine the functionality of input and forget gates of LSTMs i and f and eliminates the need for an output gate as well as an explicit cell state. A singular update gate signal z controls the information flow to the output value. Although it appears far more simple, GRU has gained a lot of popularity in the recent years in a variety of NLP tasks.[32,33] Update gate RNN (UG-RNN)[34] is a much simpler variation of LSTM and GRU architectures containing only an update gate z is also included in our experiments. The importance of the update gate is often highlighted in RNN based architectures.[15] Hence, we include this model to perform a gate based ablation study to understand their contributions to the overall resolution task.

Hyperparameter search and optimization

The performance of deep neural networks relies greatly on optimization of its hyperparameters and the performance of the models have been found to be sensitive to changes in seed values used for initializing the weight matrices.[27] We first performed a grid search over the previously recommended optimal range of hyperparameter space for NER tasks[27] and to arrive at potential candidates of optimal configurations. We then performed up to 5 repetitions of experiments at the optimal setting for the model at different seed values to obtain the median performance scores. All models were developed using the TensorFlow framework and trained on NVIDIA Titan Xp GPUs equipped with an Intel Xeon CPU (E5–2687W v4).

Toponym Disambiguation

For toponym disambiguation, we use the Geonames gazetteer data to build an inverted index using Apache Lucene[b] and search for the toponym terms extracted in the toponym detection step in the index.

Building Geonames Index

Individual Geonames entries in the index are documents with common fields such as GeonameID, LocationName, Latitude, Longitude, LocationClass, LocationCode, Population, Continent and AncestorNames. Here, LocationName contains the common name of the place. For countries, we expand this field by using official names, ISO and ISO3 abbreviations (e.g. United States of America, US and USA, respectively, for United States). For ADM1 (Administrative Level 1) entries that have available abbreviations (e.g. AZ for Arizona, and CA for California), we add such alternate names to the LocationName field. In addition to the above fields we add the County, State and Country fields depending on the type of geoname entry. Fields such as LocationName, County, State, Country and AncestorNames are chosen to be reverse indexed such that partial matches of names offers the possibility of being matched with the right disambiguated toponym on a search.

Searching Geonames Index

Most cities and locations commonly have their parent locations listed as comma separated values (e.g. Philadelphia, PA, USA). In such cases, the index provides the capability to perform compound searches (e.g. LocationName:”Philadelphia” AND AncestorNames:”PA, USA”). We find that this method offers the best scalable framework for toponym disambiguation among approximately 12 million entries. Efficient search capabilities aside, the solution internally provides documents to be sorted by a particular field. In this case, we choose the Population field as the default sorting heuristic such that search results are sorted by highest population first. An additional motivation for the implementation of this solution is the flexibility of using external information to narrow down search results. For example, when Country information is available in the GenBank record, we can use queries like LocationName:”Paris” AND Country:”France” to narrow down the location of infected hosts.

Results and Discussion

For the NER task, we use the standard metric scores of precision, recall, and F1-scores for toponym entities across two modes of evaluation:(1) Strict where the predicted spans of the toponym have to match exactly with the gold standard spans to be counted as a true positive and (2) Overlapping where predicted spans are true positives as long as one of its tokens overlap with gold standard annotations. For toponym disambiguation, we compare the predicted and gold standard GeonameIDs to measure precision, recall and f1-scores as long as the spans overlap. We compare our scores with the previous systems that were trained and tested on the same dataset. To evaluate the performance of the overall resolution task, it is important to examine the performance of the individual systems to assess the cause of errors and identifying regions for improvement. Our toponym disambiguation system is unsupervised, giving us the capability to test its performance on the entire dataset assuming gold standard toponym terms to be available. Under this assumption, the accuracy of the disambiguation system was found to be 91.6% and 90.5% on training and test set respectively. Analyzing the errors, we found that comparing ids directly is a very strict mode of evaluation for the purposes of phylogeography as Geonames contains duplicate entries for many locations that belong to two or more classes of locations such as administrative division (ADM) and populated area or city (PPLA, PPLC) but refer to the same geographical location. For instance, when we look at the test set alone, which had 27 errors from a total of 285 locations, 19 appeared to be roughly the same location. These included locations like Auckland, Lagos, St. Louis, Cleveland, Shantou, Nanchang, Shanghai, and Beijing which were assigned the ID of the administrative unit by the system, while the annotated locations were assigned the ID of the populated area or city or vice versa. Given these reasons, we find that the performance of the resolution step exceeds the reported scores by 5% to arrive at an approximate accuracy of 95–96%. However, for the purposes of comparison with previous systems we report the overall resolution performance in Table 1 without making such approximations. We did however observe 8 errors where the system assigned GeonamesIDs were drastically different from their original locations due to the population heuristic. For example, a toponym of Madison was incorrectly assigned the ID of Madison County, Alabama which had a higher population than the gold standard annotation Madison, Dane County, Wisconsin(WI).

Table 1.

Median Precision(P), Recall(R) and F1 scores for NER and Resolution. Bold-styled scores indicate highest performance. All recurrent neural network units were used in a bidirectional setup with inputs containing pre-trained word embeddings, character embeddings and case features, and an output layer with an additional CRF layer.

Method	NER-Strict			NER-Overlapping			Resolution
Method	P	R	F₁	P	R	F₁	P	R	F₁
Rule-based[4]	0.58	0.876	0.698	0.599	0.904	0.72	0.547	0.897	0.697
CRF-All[5]	0.85	0.76	0.80	0.86	0.77	0.81	-	-	-
FFNN + DS[6]	0.90	0.93	0.91	-	-	-	-	-	-
RNN	0.910	0.891	0.901	0.931	0.912	0.922	0.896	0.817	0.855
UG-RNN	0.948	0.902	0.924	0.959	0.912	0.935	0.903	0.824	0.862
GRU	0.952	0.919	0.935	0.967	0.930	0.948	0.888	0.835	0.860
LSTM	0.932	0.926	0.929	0.954	0.947	0.950	0.892	0.842	0.866
LSTM-Peep	0.934	0.944	0.939	0.951	0.961	0.956	0.907	0.863	0.884

Toponym Resolution

Analyzing the errors across the architectures, we find that 80–90% of the erroneous instances to be repeating across the RNN architectures making it challenging to use ensemble methods for reducing errors. These included false negative toponyms such as Plateau, Borno, Ga, Gurjev, Sokoto etc. which appear in tables and structured contexts making it difficult to recognize them. However, as discussed in our previous work,[6] we plan to handle table structures differently by employing alternative methods of conversions from pdf to text. Almost all false positives appeared to be geographic locations, however in the text they were found to be referring to other named entities like virus strains and isolates rather than toponyms. We found that the LSTM-Peep based architecture appeared to have marginally better performance scores on the NER task and hence the overall resolution task. Feature ablation analysis shown in Figure 2 indicate that inclusion of the character embedding feature contributed to increase in the overall performance of RNN models. However, inclusion of case feature in combination with the character embeddings appeared to be redundant. Inclusion of the CRF output layer seemed to have a positive impact on most models while additive layers seemed to have more effect on GRU, LSTM and LSTM-Peep architectures.

Fig. 2.

(Left) Ablation/leave-one-out analysis showing the contribution of individual features to the NER performance across the RNN models. (Right) Impact of additive layers on the performance of the NER across the RNN models. Here, RNN layers refer to respective variants of RNN architectures. Y-axis shows strict F1 scores.

Limitations and Future Work

In this work, we find that utilizing state-of-the-art NER architectures help us obtain performances that are inching close to human performance. However, we do find that the articles in the test set may perhaps be relatively easier than the average article for the detection task when we compare it to randomly selected validation/development set performances. As discussed in our previous work,[6] distance supervision datasets can contain toponym spans in close proximity to each other generating noisy training examples. This makes it challenging to use distance supervision techniques to increase the size of training data for training sequence tagging models based on RNN architectures. Hence, to address this issue, we are in the process of expanding the annotation dataset from 60 articles to 150 articles for a more comprehensive training and evaluation of the system. Irrespective of the ease of detection in the test set, there appear to be false negative toponyms (discussed in the previous section) that could possibly be the location of infected hosts(LOIH). While there are chances that toponyms that are LOIH appear repeatedly in the scientific article in varying contexts thus increasing the chances of them being detected, in our following work we wish to evaluate the impact of these false negatives on the overall task of identifying the LOIH. To reduce false positives where locations could infact refer to other named entities like virus strains and isolates than toponyms themselves, we intend to explore approaches from metonymy resolution[35] for filtering out such false positives.

Conclusion

Phylogeography research relies on accurate geographical metadata information from nucleotide repositories like GenBank. In records that contain insufficient metadata information, there is a motivation to extract the geographical location from the associated articles to determine the location of the infected hosts. In this work we present and evaluate methods built on recurrent neural networks that extract geographical locations from scientific articles with a substantial increase in performance from an F1 score of 0.88 which improves significantly over the previous toponym resolution system F1 of 0.69. Our implementations of the toponym detection and toponym disambiguation[c] systems along with the updated version of the annotations containing GeonameIDs[d] are available online.

15 in total

1. Learning long-term dependencies with gradient descent is difficult.

Authors: Y Bengio; P Simard; P Frasconi
Journal: IEEE Trans Neural Netw Date: 1994

2. Genetic and phylogenetic analyses of influenza A H1N1pdm virus in Buenos Aires, Argentina.

Authors: P R Barrero; M Viegas; L E Valinotto; A S Mistchenko
Journal: J Virol Date: 2010-11-03 Impact factor: 5.103

3. Recurrent neural networks for classifying relations in clinical notes.

Authors: Yuan Luo
Journal: J Biomed Inform Date: 2017-07-08 Impact factor: 6.317

4. LSTM: A Search Space Odyssey.

Authors: Klaus Greff; Rupesh K Srivastava; Jan Koutnik; Bas R Steunebrink; Jurgen Schmidhuber
Journal: IEEE Trans Neural Netw Learn Syst Date: 2016-07-08 Impact factor: 10.451

5. A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.

Authors: Tasnia Tahsin; Davy Weissenbacher; Robert Rivera; Rachel Beard; Mari Firago; Garrick Wallstrom; Matthew Scotch; Graciela Gonzalez
Journal: J Am Med Inform Assoc Date: 2016-01-17 Impact factor: 4.497

6. Enhancing phylogeography by improving geographical information from GenBank.

Authors: Matthew Scotch; Indra Neil Sarkar; Changjiang Mei; Robert Leaman; Kei-Hoi Cheung; Pierina Ortiz; Ashutosh Singraur; Graciela Gonzalez
Journal: J Biomed Inform Date: 2011-06-24 Impact factor: 6.317

7. EnvMine: a text-mining system for the automatic extraction of contextual information.

Authors: Javier Tamames; Victor de Lorenzo
Journal: BMC Bioinformatics Date: 2010-06-01 Impact factor: 3.169

8. Knowledge-driven geospatial location resolution for phylogeographic models of virus migration.

Authors: Davy Weissenbacher; Tasnia Tahsin; Rachel Beard; Mari Figaro; Robert Rivera; Matthew Scotch; Graciela Gonzalez
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

9. Natural language processing methods for enhancing geographic metadata for phylogeography of zoonotic viruses.

Authors: Tasnia Tahsin; Rachel Beard; Robert Rivera; Rob Lauder; Garrick Wallstrom; Matthew Scotch; Graciela Gonzalez
Journal: AMIA Jt Summits Transl Sci Proc Date: 2014-04-07

10. Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods.

Authors: Davy Weissenbacher; Abeed Sarker; Tasnia Tahsin; Matthew Scotch; Graciela Gonzalez
Journal: AMIA Jt Summits Transl Sci Proc Date: 2017-07-26

3 in total