| Literature DB >> 32413094 |
Xun Zhu1,2, Chen Lyu3,4, Donghong Ji1, Han Liao2, Fei Li1.
Abstract
Scientific information extraction is a crucial step for understanding scientific publications. In this paper, we focus on scientific keyphrase extraction, which aims to identify keyphrases from scientific articles and classify them into predefined categories. We present a neural network based approach for this task, which employs the bidirectional long short-memory (LSTM) to represent the sentences in the article. On top of the bidirectional LSTM layer in our neural model, conditional random field (CRF) is used to predict the label sequence for the whole sentence. Considering the expensive annotated data for supervised learning methods, we introduce self-training method into our neural model to leverage the unlabeled articles. Experimental results on the ScienceIE corpus and ACL keyphrase corpus show that our neural model achieves promising performance without any hand-designed features and external knowledge resources. Furthermore, it efficiently incorporates the unlabeled data and achieve competitive performance compared with previous state-of-the-art systems.Entities:
Year: 2020 PMID: 32413094 PMCID: PMC7228065 DOI: 10.1371/journal.pone.0232547
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Keyphrases annotated in the ScienceIE corpus.
Fig 2One example using the BILUO label scheme.
Sentence 1 comes from the ScienceIE corpus.
Fig 3Overview of the BLSTM-CRF model.
Statistics of the datasets.
| Training | Dev | Test | |
|---|---|---|---|
| ScienceIE | |||
| Sentences | 2403 | 399 | 851 |
| Keyphrases | 6721 | 1154 | 2051 |
| ACL | |||
| Sentences | 2159 | 283 | 272 |
| Keyphrases | 2999 | 389 | 392 |
Hyper-parameter settings.
| Type | Hyper-parameter |
|---|---|
| probability threshold | 0.8 |
| Initial learning rate | 0.01 |
| Regularization parameter | 10−8 |
| dropout rate | 0.4 |
| Dim(emb(word)) | 300 |
| Dim(emb(POS)), Dim(emb(DEP)) | 25 |
| Hidden layer size | 100 |
Effects of the heuristic rule.
| Models | ScienceIE (P/R/F1) | ACL (P/R/F1) |
|---|---|---|
| BLSTM-CRF | 47.4/40.5/43.7 | 40.7/31.4/35.4 |
| BLSTM-CRF-H | 47.3/43.1/45.1 | 39.9/31.4/35.1 |
Effects of the self-training method.
| Models | ScienceIE (P/R/F1) | ACL (P/R/F1) |
|---|---|---|
| Baseline | 47.3/43.1/45.1 | 40.7/31.4/35.4 |
| 49.2/42.6/45.7 | 42.6/30.6/35.6 | |
| 48.6/43.2/45.7 | 45.0/29.6/35.7 | |
|
| 48.6/42.9/45.6 | 45.4/31.6/37.3 |
Ablation test for different embeddings.
| Models | ScienceIE (P/R/F1) | ACL (P/R/F1) |
|---|---|---|
| BLSTM-CRF | 47.4/40.5/43.7 | 40.7/31.4/35.4 |
| No Pos embeddings | 48.2/39.2/43.2 | 40.9/29.3/34.2 |
| No dependency embeddings | 47.6/39.9/43.4 | 41.5/28.6/33.8 |
Results of our model on the ScienceIE corpus, together with other top-performance systems.
| Models | F1(%) |
|---|---|
| Gupta [ | 9.8 |
| Tsai [ | 11.9 |
| AI2 [ | 44 |
| Luan | 46.6 |
| BLSTM-CRF | 43.7 |
| BERT | 35.1 |
Results of our model on the ACL corpus, together with other top-performance systems.
| Models | F1(%) |
|---|---|
| BLSTM-CRF | 35.4 |
| BERT | 31.6 |