| Literature DB >> 29950020 |
Arjun Magge1,2, Davy Weissenbacher3, Abeed Sarker3, Matthew Scotch1,2, Graciela Gonzalez-Hernandez3.
Abstract
Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER.Entities:
Mesh:
Year: 2018 PMID: 29950020 PMCID: PMC6022665 DOI: 10.1093/bioinformatics/bty273
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The NER architecture with distant supervision. The NER model is first trained on distant supervision data followed by human annotated data to obtain the final model
Fig. 2.The training procedure of the NER’s neural network with two hidden layers
Precision, Recall and F1 scores using strict tokenwise evaluation for toponym detection where the NER was trained on Dtrain and tested on Dtest
| Configuration | Word embedding | |||
|---|---|---|---|---|
| FFNN 1-layer | No pre-training | 0.97 | 0.65 | 0.779 |
| Glove | 0.89 | 0.87 | 0.883 | |
| Wiki-pm-pmc | 0.92 | 0.82 | 0.878 | |
| FFNN 2-layers | Glove | 0.92 | 0.86 | 0.891 |
| Wiki-pm-pmc | 0.93 | 0.88 | 0.906 | |
| FFNN 2-layers + features | Glove | 0.94 | 0.87 | 0.903 |
| Wiki-pm-pmc | 0.96 | 0.86 | ||
| Random forest + features | Wiki-pm-pmc | 0.82 | 0.91 | 0.862 |
| SVM + features | Wiki-pm-pmc | 0.83 | 0.92 | 0.875 |
Bold indicates highest scores in the performance measure.
Examples of errors made by the NER trained on supervised annotated data
| Error type | No | Category | Examples |
|---|---|---|---|
| Partial match | 1 | Tagged prefix | Probable person to person transmission of novel avian influenza A (H7N9) virus in |
| 2 | Tagged suffix | Surveillance was conducted in live poultry markets in | |
| 3 | Tagged suffix | University of Ibadan, | |
| 4 | Unrecognized token | The overwhelming majority (94.2%) of H9N2 influenza viruses were isolated in | |
| False positive | 5 | Other entities | Phylogenetic analyses show that it is a recombinant virus containing genome segments derived from the Eurasia and |
| 6 | Other entities | Thus, current G1-like viruses in southern | |
| 7 | Other entities | This work was supported by a Natural Sciences and Engineering Research Council of | |
| 8 | Partial annotation | Abbreviations: BJ and | |
| False negative | 9 | Table entries | Virus Group State of isolation Date of isolation A/chicken/Nigeria/1071-1/2007 EMA1/EMA2-2: 6-R07 |
| 10 | Unrecognized toponym | The characterization of the swH3N2 / pH1N1 reassortant vi- ruses from swine in the province of | |
| 11 | Unrecognized toponym | Centers for Disease Control and Prevention, |
Note: Underlined tokens indicate entities recognized by the NER. Italicized tokens are human annotated gold standard entities.
Tokenwise scores for performance comparison of NERs
| Implementation | |||
|---|---|---|---|
| Knowledge-based | 0.58 | 0.88 | 0.70 |
| CRF-All | 0.85 | 0.76 | 0.80 |
| Stanford-NER | 0.89 | 0.85 | 0.872 |
| Train | 0.96 | 0.86 | 0.910 |
| Train |
Bold indicates highest scores in the performance measure.