| Literature DB >> 30700967 |
Nhung T H Nguyen1, Roselyn S Gabud2,3, Sophia Ananiadou1.
Abstract
Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus-a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature-a useful task for monitoring species distribution and preserving the biodiversity.Entities:
Keywords: Biodiversity; gold standard; named entity recognition; species occurrence; text mining
Year: 2019 PMID: 30700967 PMCID: PMC6351503 DOI: 10.3897/BDJ.7.e29626
Source DB: PubMed Journal: Biodivers Data J ISSN: 1314-2828
Statistics of Linnaeus and S800 corpora for species names.
|
|
|
|
|
|
|
| Linnaeus | PMC full paper | 100 | 17.580 | 502,507 | 4,259 |
| S800 | PubMed abstract | 800 | 8.064 | 201,981 | 3,708 |
Figure 1.Argo’s Manual Annotation Editor to support annotators. Each entity category is represented using a different colour.
Inter-annotator agreement on different named categories over 200 doubly-annotated documents. The categories are arranged in descending order of agreement.
|
|
|
|
|
| Geographical Location | 94.32 | 94.89 | 94.60 |
| Person | 88.93 | 91.76 | 90.33 |
| Temporal Expression | 86.59 | 87.25 | 86.92 |
| Taxon | 81.09 | 83.87 | 82.45 |
| Habitat | 45.85 | 48.36 | 47.07 |
|
| 82.09 | 81.62 | 81.86 |
Statistics of the gold standard corpus. The categories are arranged in descending order of the instance number
| Number of documents | 668 | |
| Number of sentences | 26,277 | |
| Number of words | 33,475 | |
| Number of entities | Taxon | 12,227 |
| Geographical Location | 9,921 | |
| Person | 2,889 | |
| Temporal Expression | 2,210 | |
| Habitat | 1,554 | |
The distribution of entities in training, development and test sets.
|
|
|
|
|
| Taxon | 9,357 | 1,548 | 1,322 |
| Geographical Location | 8,121 | 992 | 878 |
| Person | 2,479 | 180 | 230 |
| Temporal Expression | 1,800 | 157 | 253 |
| Habitat | 1,308 | 91 | 115 |
Performance of CRF and BiLSTM on the testing set. The categories are arranged in descending order of F-score for each type of model.
|
|
|
|
|
|
| CRF | Geographical Location | 82.35 | 83.49 | 82.92 |
| Taxon | 75.27 | 62.40 | 68.23 | |
| Temporal Expression | 77.19 | 52.17 | 62.26 | |
| Person | 72.82 | 43.10 | 54.15 | |
| Habitat | 63.55 | 44.16 | 52.11 | |
|
| 77.67 | 66.29 | 71.53 | |
| Bi-LSTM | Geographical Location | 85.05 | 85.63 | 85.34 |
| Taxon | 77.42 | 69.67 | 73.34 | |
| Habitat | 64.10 | 64.94 | 64.52 | |
| Temporal Expression | 70.67 | 54.36 | 61.45 | |
| Person | 58.92 | 48.44 | 53.17 | |
|
| 77.49 | 71.89 | 74.58 |
Performance of different NER tools on Taxon entities in the COPD corpus test set. In this table, we report the best performance for taxon names by the BiLSTM model.
|
|
|
|
|
| Our NER (BiLSTM) | 77.42 | 69.67 | 73.34 |
| GNRD | 77.61 | 54.02 | 63.70 |
| SPECIES Tagger | 86.79 | 4.51 | 8.57 |
Figure 2.Schema of occurrence extraction.
Figure 3.Examples of species occurrences automatically extracted by PASMED. A dashed line indicates an incorrect relation.