| Literature DB >> 28815119 |
Davy Weissenbacher1, Abeed Sarker2, Tasnia Tahsin1, Matthew Scotch1, Graciela Gonzalez1.
Abstract
The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively.Entities:
Year: 2017 PMID: 28815119 PMCID: PMC5543364
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1:Sample sentence and token-level annotation using the BIO (Beginning, Inside, Outside) scheme. The left- most column shows the tokens, the Annotation column shows the token-level annotations, and the four columns on the right show four sample features and their representations. For multi-word named entities, the first token is given the B tag and all the following tokens are given the I tag. All other tokens are given the O tag, indicating that they do not belong to any toponyms.
Toponym detection performance for various classifiers; best results in each category shown in bold. Overlapping evaluation considers partial overlap to be correct, while strict evaluation evaluates each token separately. CRF with all the features outperforms other variants, including the state-of-the-art knowledge-based toponym detector. Note that these scores are based on the positive classes (i.e., B and I) only.
| Overlapping Evaluation | Strict Evaluation | |||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-score | Precision | Recall | F-score | |
| 0.60 | 0.72 | 0.58 | 0.88 | 0.70 | ||
| 0.70 | 0.77 | 0.83 | 0.69 | 0.75 | ||
| 0.78 | 0.35 | 0.49 | 0.76 | 0.34 | 0.47 | |
| 0.19 | 0.31 | 0.84 | 0.18 | 0.30 | ||
| 0.77 | 0.85 | 0.76 | 0.80 | |||
| 0.85 | 0.75 | 0.80 | 0.84 | 0.74 | 0.79 | |
| 0.52 | 0.89 | 0.66 | 0.51 | 0.86 | 0.64 | |
Figure 2:Precision, Recall and F-scores for our system over a blind evaluation set for various training set sizes (100% = 48 documents). The figure shows that precision remains fairly constant for our system, but recall shows steady improvement as more training data is utilized. We fit a logarithmic trend line to the F-score curve to derive the relationship between training set size and classification performance.