| Literature DB >> 24594988 |
Anne E Thessen1, Cynthia Sims Parr2.
Abstract
Numerous digitization and ontological initiatives have focused on translating biological knowledge from narrative text to machine-readable formats. In this paper, we describe two workflows for knowledge extraction and semantic annotation of text data objects featured in an online biodiversity aggregator, the Encyclopedia of Life. One workflow tags text with DBpedia URIs based on keywords. Another workflow finds taxon names in text using GNRD for the purpose of building a species association network. Both workflows work well: the annotation workflow has an F1 Score of 0.941 and the association algorithm has an F1 Score of 0.885. Existing text annotators such as Terminizer and DBpedia Spotlight performed well, but require some optimization to be useful in the ecology and evolution domain. Important future work includes scaling up and improving accuracy through the use of distributional semantics.Entities:
Mesh:
Year: 2014 PMID: 24594988 PMCID: PMC3940440 DOI: 10.1371/journal.pone.0089550
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Test Species and URI Annotation Results.
| Common Name | Scientific Name | EOL Taxon ID | Number of Text Objects in English | Number of Correct URIs | Number of Correct and Unique URIs | Precision | Recall | F1 Score |
| Great White Shark |
| 213726 | 56 | 169 | 45 | 0.918 | 1 | 0.956 |
| Lion |
| 328672 | 52 | 173 | 45 | 0.961 | 1 | 0.980 |
| Moss |
| 861365 | 2 | 0 | 0 | 0 | 0 | 0 |
| Diatom |
| 904440 | 19 | 23 | 14 | 0.852 | 1 | 0.92 |
| Mosquito |
| 740671 | 8 | 8 | 6 | 1 | 1 | 1 |
| Bacteria |
| 972688 | 7 | 36 | 17 | 0.783 | 1 | 0.878 |
| Cactus |
| 594934 | 20 | 13 | 9 | 1 | 1 | 1 |
| Oak |
| 1151323 | 36 | 37 | 20 | 0.949 | 1 | 0.974 |
| Worm |
| 3126801 | 17 | 28 | 15 | 0.903 | 1 | 0.949 |
| Crab |
| 312939 | 45 | 94 | 36 | 0.95 | 1 | 0.97 |
| Spider |
| 1182604 | 17 | 27 | 16 | 0.871 | 1 | 0.931 |
| Ciliate |
| 484359 | 2 | 3 | 3 | 1 | 1 | 1 |
| Mushroom |
| 190054 | 9 | 9 | 7 | 0.9 | 1 | 0.947 |
| Beetle |
| 982103 | 2 | 5 | 5 | 1 | 1 | 1 |
| Walrus |
| 328627 | 57 | 100 | 35 | 0.909 | 1 | 0.952 |
| Water Bear |
| 1053826 | 4 | 5 | 5 | 0.833 | 1 | 0.909 |
| Virus | Tobacco mosaic virus | 8615186 | 2 | 5 | 4 | 1 | 1 | 1 |
| Copepod |
| 1020941 | 6 | 6 | 4 | 1 | 1 | 1 |
| Kelp |
| 902899 | 9 | 7 | 5 | 0.875 | 1 | 0.933 |
| Tube Worm |
| 393274 | 55 | 68 | 17 | 0.523 | 1 | 0.687 |
| Honey Bee |
| 1045608 | 61 | 222 | 54 | 0.978 | 1 | 0.989 |
| GRAND TOTAL | 487 | 1038 | 362 | 0.889 | 1 | 0.941 |
URI Annotation Errors.
| Species | Negation | Describing Related Taxa | Word Part | Generalities | Homonym | Total Errors |
| Shark | 2 | 7 | 1 | 2 | 0 | 12 |
| Lion | 0 | 4 | 2 | 0 | 0 | 6 |
| Moss | N/A | N/A | N/A | N/A | N/A | N/A |
| Diatom | 0 | 1 | 1 | 2 | 0 | 4 |
| Mosquito | 0 | 0 | 0 | 0 | 0 | 0 |
| Bacteria | 0 | 1 | 1 | 3 | 1 | 6 |
| Cactus | 0 | 0 | 0 | 0 | 0 | 0 |
| Oak | 0 | 2 | 0 | 0 | 0 | 2 |
| Worm | 0 | 1 | 0 | 1 | 1 | 3 |
| Crab | 0 | 3 | 0 | 0 | 1 | 4 |
| Spider | 0 | 1 | 0 | 0 | 2 | 3 |
| Ciliate | 0 | 0 | 0 | 0 | 0 | 0 |
| Mushroom | 0 | 0 | 0 | 1 | 0 | 1 |
| Beetle | 0 | 0 | 0 | 0 | 0 | 0 |
| Walrus | 1 | 0 | 2 | 1 | 3 | 7 |
| Water Bear | 0 | 0 | 0 | 0 | 1 | 1 |
| Copepod | 0 | 0 | 0 | 0 | 0 | 0 |
| Virus | 0 | 0 | 0 | 0 | 0 | 0 |
| Kelp | 0 | 0 | 1 | 0 | 0 | 1 |
| Tube Worm | 1 | 1 | 1 | 0 | 1 | 4 |
| Honey Bee | 0 | 0 | 2 | 0 | 1 | 3 |
| GRAND TOTAL | 4 | 21 | 11 | 10 | 11 | 57 |
Figure 1Automated Species Interactions.
Taxon interaction network generated algorithmically by extracting information from text under the “Ecology” Chapter using the EOL and GNRD APIs. These associations are from the test species only.
Figure 2Manual Species Interactions.
Taxon interaction network generated manually by reading through all of the text on an EOL taxon page and collecting scientific and common names. These associations are from the test species only.
Performance Metrics for Associations Workflow.
| True Positives | False Positives | False Negatives | True Negatives | Precision | Recall | F1 Score | |
|
| 9 | 2 | 2 | 23 | 0.818 | 0.818 | 0.818 |
|
| 30 | 2 | 4 | 16 | 0.938 | 0.882 | 0.909 |
|
| 0 | 0 | 0 | 1 | |||
|
| 0 | 0 | 0 | 3 | |||
|
| 0 | 0 | 0 | 6 | |||
|
| 0 | 0 | 1 | 119 | |||
|
| 1 | 0 | 0 | 5 | 1.000 | 1.000 | 1.000 |
|
| 81 | 2 | 3 | 15 | 0.976 | 0.976 | 0.976 |
|
| 0 | 0 | 2 | 3 | |||
|
| 49 | 9 | 11 | 13 | 0.845 | 0.875 | 0.860 |
|
| 0 | 0 | 0 | 11 | |||
|
| 0 | 0 | 0 | 1 | |||
|
| 0 | 0 | 0 | 51 | |||
|
| 0 | 0 | 1 | 24 | |||
|
| 1 | 1 | 1 | 17 | 0.500 | 0.500 | 0.500 |
|
| 0 | 0 | 0 | 7 | |||
| Tobacco mosaic virus | 5 | 0 | 1 | 0 | 1.000 | 0.833 | 0.909 |
|
| 0 | 0 | 0 | 4 | |||
|
| 0 | 0 | 0 | 2 | |||
|
| 0 | 0 | 0 | 7 | |||
|
| 318 | 75 | 17 | 19 | 0.809 | 0.952 | 0.875 |
| TOTAL | 494 | 91 | 43 | 347 | 0.844 | 0.930 | 0.885 |
Performance of GNRD without Associations Workflow.
| True Positives | False Positives | False Negatives | True Negatives | Precision | Recall | F1 Score | |
|
| 10 | 37 | 1 | 0 | 0.213 | 0.909 | 0.345 |
|
| 33 | 45 | 2 | 0 | 0.423 | 0.917 | 0.579 |
|
| 0 | 2 | 0 | 0 | |||
|
| 0 | 7 | 0 | 0 | |||
|
| 0 | 7 | 0 | 0 | |||
|
| 0 | 135 | 1 | 0 | |||
|
| 1 | 7 | 0 | 0 | 0.125 | 1.000 | 0.222 |
|
| 82 | 26 | 2 | 3 | 0.759 | 0.976 | 0.854 |
|
| 2 | 4 | 0 | 0 | 0.333 | 1.000 | 0.500 |
|
| 53 | 28 | 2 | 1 | 0.654 | 0.964 | 0.779 |
|
| 0 | 16 | 0 | 0 | |||
|
| 0 | 3 | 0 | 0 | |||
|
| 0 | 55 | 0 | 1 | |||
|
| 1 | 25 | 0 | 0 | 0.038 | 1.000 | 0.074 |
|
| 2 | 29 | 0 | 1 | 0.065 | 1.000 | 0.121 |
|
| 0 | 8 | 0 | 0 | |||
| Tobacco mosaic virus | 5 | 3 | 1 | 0 | 0.625 | 0.833 | 0.714 |
|
| 0 | 6 | 0 | 0 | |||
|
| 0 | 4 | 0 | 0 | |||
|
| 0 | 12 | 0 | 0 | |||
|
| 321 | 101 | 15 | 0 | 0.761 | 0.961 | 0.849 |
| TOTAL | 510 | 560 | 24 | 6 | 0.477 | 0.957 | 0.636 |
Figure 3Annotator Comparison.
Combined results from Terminizer (blue) and DBpedia Spotlight (red) when given an EOL text object to annotate (originally from ARKive). Strings that were annotated by both tools are colored purple. The superscripts correspond to the Identifier column in Table S1.