| Literature DB >> 30010737 |
Chen Li1, Zhiqiang Rao2, Qinghua Zheng1, Xiangrong Zhang2.
Abstract
Current research of bio-text mining mainly focuses on event extractions. Biological networks present much richer and meaningful information to biologists than events. Bio-entity coreference resolution (CR) is a very important method to complete a bio-event's attributes and interconnect events into bio-networks. Though general CR methods have been studies for a long time, they could not produce a practically useful result when applied to a special domain. Therefore, bio-entity CR needs attention to better assist biological network extraction. In this article, we present two methods for bio-entity CR. The first is a rule-based method, which creates a set of syntactic rules or semantic constraints for CR. It obtains a state-of-the-art performance (an F1-score of 62.0%) on the community supported dataset. We also present a machine learning-based method, which takes use of a recurrent neural network model, a long-short term memory network. It automatically learns global discriminative representations of all kinds of coreferences without hand-crafted features. The model outperforms the previously best machine leaning-based method.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30010737 PMCID: PMC6041745 DOI: 10.1093/database/bay065
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Coreferences in biological texts. (A) is a sentence depicting a biological reaction illustrated by (B and C) is a sentence depicting a biological reaction illustrated by (D).
Figure 2.System architecture.
Figure 6.LSTM-Coref.
Results on test dataset
| Recall (%) | Precision (%) | ||
|---|---|---|---|
| UU | 22.2 | 73.3 | 34.1 |
| UZ | 21.5 | 55.5 | 31.0 |
| CU | 19.4 | 63.2 | 29.7 |
| UT | 14.4 | 67.2 | 23.8 |
| ( | 52.5 | 50.2 | 51.3 |
| ( | 50.4 | 62.7 | 55.9 |
| ( | 55.6 | 67.2 | 60.9 |
| Proposed | 60.2 | 63.8 | 62.0 |
Results on development dataset
| ( | Proposed | |||||
|---|---|---|---|---|---|---|
| Recall (%) | Precision (%) | Recall (%) | Precision (%) | F (%) | ||
| Relative pronoun | 28.2 | 83.3 | 42.2 | 28.2 | 83.3 | 42.2 |
| Personal pronoun | 26.3 | 77.9 | 39.3 | 33.6 | 72.3 | 45.9 |
| Definite NP | 6.9 | 58.3 | 12.4 | 6.9 | 70.0 | 12.6 |
| All | 59.9 | 77.7 | 67.4 | 68.8 | 76.0 | 72.2 |
LSTM-Coref results on test dataset
| Recall (%) | Precision (%) | ||
|---|---|---|---|
| UU | 22.2 | 73.3 | 34.1 |
| ( | 52.5 | 50.2 | 51.3 |
| ( | 50.4 | 62.7 | 55.9 |
| ( | 55.6 | 67.2 | 60.9 |
| LSTM-Coref | 54.9 | 58.0 | 56.4 |
LSTM-Coref results on development dataset with different feature combinations
| Recall (%) | Precision (%) | ||
|---|---|---|---|
| Mention -vec | 52.5 | 65.0 | 58.1 |
| Mention-vec+features | 60.4 | 61.9 | 61.2 |
Figure 7.Learning curves on development.
MGLs of rule method
| Types | Relative | Personal | DNP | Others | All |
|---|---|---|---|---|---|
| MGM | 4 | 2 | 16 | 11 | |
| FL | 2 | 9 | 7 | 0 | 18 |
| OOR | 0 | 5 | 7 | 0 | 12 |
| Sum | 6 | 16 | 11 | 63 |
Bold values are the main errors of coreference types or error types.
Spurious gold links of rule method
| Types | Relative | Personal | DNP | All |
|---|---|---|---|---|
| EL | 6 | 0 | 0 | 6 |
| FL | 5 | 19 | 6 | |
| BMB | 0 | 8 | 0 | 8 |
| Sum | 11 | 6 | 44 |
Bold values are the main errors of coreference types or error types.
MGLs of LSTM-Coref
| Types | Relative | Personal | DNP | Others | All |
|---|---|---|---|---|---|
| MGM | 7 | 2 | 12 | 11 | 32 |
| FL | 2 | 14 | 25 | 0 | |
| OOR | 0 | 0 | 7 | 0 | 7 |
| Sum | 9 | 16 | 11 | 80 |
Bold values are the main errors of coreference types or error types.
Spurious gold links of LSTM-Coref
| Types | Relative | Personal | DNP | All |
|---|---|---|---|---|
| EL | 12 | 0 | 0 | 12 |
| FL | 8 | 28 | 15 | |
| BMB | 1 | 11 | 0 | 12 |
| Sum | 21 | 15 | 75 |
Bold values are the main errors of coreference types or error types.