| Literature DB >> 25717409 |
Tasnia Tahsin1, Rachel Beard1, Robert Rivera1, Rob Lauder1, Garrick Wallstrom1, Matthew Scotch1, Graciela Gonzalez1.
Abstract
Zoonotic viruses represent emerging or re-emerging pathogens that pose significant public health threats throughout the world. It is therefore crucial to advance current surveillance mechanisms for these viruses through outlets such as phylogeography. Despite the abundance of zoonotic viral sequence data in publicly available databases such as GenBank, phylogeographic analysis of these viruses is often limited by the lack of adequate geographic metadata. However, many GenBank records include references to articles with more detailed information and automated systems may help extract this information efficiently and effectively. In this paper, we describe our efforts to determine the proportion of GenBank records with "insufficient" geographic metadata for seven well-studied viruses. We also evaluate the performance of four different Named Entity Recognition (NER) systems for automatically extracting related entities using a manually created gold-standard.Entities:
Year: 2014 PMID: 25717409 PMCID: PMC4333696
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1.Flowchart of experiment procedure
Figure 2.Screenshot of data automatically extracted from GenBank
Figure 3.Description of Sufficiency criteria for GenBank record
Percentage of GenBank records with insufficient geographic information for each virus.
| Submission Type | Number of Entries | % Insufficient |
|---|---|---|
| WEE | 67 | 90 |
| Rabies | 4450 | 85 |
| WNV | 1084 | 79 |
| SLE | 141 | 74 |
| Hanta | 1745 | 66 |
| Influenza | 51734 | 62 |
| EEE | 374 | 51 |
| All | 59595 | 64 |
Inter-rater agreement measurements (H.M. = Harmonic Mean, J.S. = Jaccard Similarity).
| Entity | H.M.(
| H.M. (
| H.M. (
| J.S. (A,B) | J.S. (A,C) | J.S.(B, C) |
|---|---|---|---|---|---|---|
| Date | .975; .978 | .979; .987 | .962; .973 | .952; .957 | .950; .965 | .928; .947 |
| GeneName | .914; .926 | .913; .932 | .911; .954 | .845; .868 | .840; .872 | .837; .913 |
| Location | .945; .961 | .907; .931 | .914; .935 | .897; .925 | .831; .871 | .841; .877 |
| Organism | .909; .956 | .874; .940 | .915; .959 | .833; .916 | .792; .905 | .843; .922 |
| Virus | .952; .958 | .947; .966 | .947; .955 | .907; .920 | .903; .937 | .900; .914 |
Performance statistics of the integrated NER system
| Entity | Precision | Recall | F-measure |
|---|---|---|---|
| GeneName | 0.070; 0.239 | 0.114; 0.395 | 0.087; 0.297 |
| Location | 0.452; 0.626 | 0.658; 0.783 | 0.536; 0.696 |
| Species | 0.853; 0.962 | 0.563; 0.658 | 0.678; 0.781 |
| Date | 0.800; 0853 | 0.681; 0.727 | 0.736; 0.785 |
Entity type frequency table
| Entity | Annotator A | Annotator B | Annotator C |
|---|---|---|---|
| Date | 386 | 387 | 390 |
| GeneName | 230 | 209 | 208 |
| Location | 846 | 846 | 903 |
| Organism | 916 | 866 | 850 |
| Virus | 1037 | 994 | 1031 |
Additional inter-rater agreement measurements
| Entity |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Date | .977; .979 | .974; .977 | .984; .992 | .974; .982 | .966; .977 | .959; .969 |
| GeneName | .870; .880 | .962; .976 | .870; .887 | .962; .981 | .909; .952 | .913; .957 |
| Location | .943; .961 | .948; .961 | .939; .962 | .879; .901 | .944; .966 | .885; .905 |
| Organism | .885; .930 | .935; .984 | .843; .906 | .902; .976 | .906; .950 | .924; .968 |
| Virus | .932; .938 | .972; .979 | .945; .963 | .951; .969 | .965; .973 | .930; .938 |