| Literature DB >> 29961818 |
Abstract
In this article, we describe our system for the CHEMPROT task of the BioCreative VI challenge. Although considerable research on the named entity recognition of genes and drugs has been conducted, there is limited research on extracting relationships between them. Extracting relations between chemical compounds and genes from the literature is an important element in pharmacological and clinical research. The CHEMPROT task of BioCreative VI aims to promote the development of text mining systems that can be used to automatically extract relationships between chemical compounds and genes. We tested three recursive neural network approaches to improve the performance of relation extraction. In the BioCreative VI challenge, we developed a tree-Long Short-Term Memory networks (tree-LSTM) model with several additional features including a position feature and a subtree containment feature, and we also applied an ensemble method. After the challenge, we applied additional pre-processing steps to the tree-LSTM model, and we tested the performance of another recursive neural network model called Stack-augmented Parser Interpreter Neural Network (SPINN). Our tree-LSTM model achieved an F-score of 58.53% in the BioCreative VI challenge. Our tree-LSTM model with additional pre-processing and the SPINN model obtained F-scores of 63.7 and 64.1%, respectively.Database URL: https://github.com/arwhirang/recursive_chemprot.Entities:
Mesh:
Year: 2018 PMID: 29961818 PMCID: PMC6014134 DOI: 10.1093/database/bay060
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Five groups of CHEMPROT relations to be used for evaluation
| Groups | CHEMPROT relations | Sentence example |
|---|---|---|
| CPR:3 | UPREGULATOR|ACTIVATOR| | <BC6ENT1>Amitriptyline</BC6ENT1>, but not any other tricyclic or selective serotonin reuptake inhibitor antidepressants, promotes <BC6ENT2>TrkA</BC6ENT2> autophosphorylation in primary neurons and induces neurite outgrowth in PC12 cells. |
| INDIRECT_UPREGULATOR | ||
| CPR:4 | DOWNREGULATOR|INHIBITOR| | Ginseng total saponins, <BC6ENT1>ginsenosides Rb2, Rg1 and Rd</BC6ENT1> administered intraperitoneally attenuated the immobilization stress-induced increase in plasma <BC6ENT2>IL-6</BC6ENT2> level. |
| INDIRECT_DOWNREGULATOR | ||
| CPR:5 | AGONIST|AGONIST-ACTIVATOR| | At 10(-6)M in transcription assays, none of these compounds showed progestin agonist activity, whereas <BC6ENT1>mifepristone</BC6ENT1> and its monodemethylated metabolite manifested slight <BC6ENT2>glucocorticoid</BC6ENT2> agonist activity. |
| AGONIST-INHIBITOR | ||
| CPR:6 | ANTAGONIST | In another experiment, <BC6ENT1>cyanopindolol</BC6ENT1>, an antagonist of the <BC6ENT2>serotonin terminal autoreceptor</BC6ENT2>, also prolonged the clearance of 5-HT from the CA3 region. |
| CPR:9 | SUBSTRATE|PRODUCT_OF| | Leukotriene A(4) hydrolase (<BC6ENT1>LTA(4)H</BC6ENT1>) is a cystolic enzyme that stereospecifically catalyzes the transformation of <BC6ENT2>LTA(4)</BC6ENT2> to LTB(4). |
| SUBSTRATE_PRODUCT_OF |
Figure 1.Overall system pipeline and data examples. We applied pre-processing and sentence parsing to the challenge data. The examples (a–c) are the input sentence, pre-processing, and parsing results, respectively. We also extract the subtree containment feature and the position feature for the tree-LSTM model and extract the transition list for the SPINN model. The steps labeled with ‘Post-challenge’ indicate the additional work done after the challenge to further enhance the performance.
Process for finding the best hyperparameter
| Model | Parameter | Test range | Test unit | Selected |
|---|---|---|---|---|
| TreeLSTM | Batch size | 64–512 | 64 | 256 |
| Hidden unit size | 64–512 | 64 | 256 | |
| Learning rate | 0.0005–0.01 | 0.0005 | 0.001 | |
| Keep probability | 0.5–0.9 | 0.1 | 0.5 | |
| Subtree containment | 2–10 | 2 | 10 | |
| Epoch | 500–1000 | 100 | 1000 | |
| SPINN | Batch size | 64–256 | 64 | 256 |
| Hidden unit size | 64–256 | 64 | 256 | |
| MLP dropout | 0.5–0.9 | 0.1 | 0.5 |
MLP, Multi-Layer Perceptron.
Subtree Containment Size.
Figure 2.The architecture of our tree-LSTM model. (1) The target words in the sentences are underlined. (2) Vector representations of words from a pre-trained word embedding. (3) Vector form of subtree containment feature for each leaf. (4) Position feature vectors. PV_1 and PV_2 are the relative distances of the first and second target words from the current word, respectively. (5) An example of the position feature vector when the current word is ‘dual’.
Vector representation according to the distance between one of the target entities and the current word
| Relative distance | −5 | −4 | −3 | −2 | −1 | 0 | 1 | 2 | 3 | 4 | 5 | 6–10 | 11–15 | 16–20 | 21–∞ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Figure 3.The SPINN model which implements a shift-reduce parser for each transition step.
The statistics of the BioCreative VI CHEMPROT corpus after pre-processing
| Dataset | Abstract | Positive | Negative | Ratio |
|---|---|---|---|---|
| Train_orig | 1020 | 4157 | 16 964 | 1:3.08 |
| Develop_orig | 612 | 2, 416 | 10, 614 | 1:3.39 |
| Test_orig | 3399 | 58, 523 | ||
| Train | 1020 | 4133 | 16, 522 | 1:2.99 |
| Devel | 612 | 2412 | 10 362 | 1:3.29 |
| Test | 3399 | 3444 | 10 999 | 1:3.19 |
The first three datasets are the original datasets used during the challenge.
The challenge organizers appended dummy data to the test set to prevent from manual annotation by participants.
Comparison between the results of our Recursive Neural Network systems and other top three CHEMPROT challenge results
| Rank/Team ID (model) | P (%) | R (%) | F (%) | |
|---|---|---|---|---|
| Challenge results | ||||
| 1 | TEAM_430 | 57.3 | ||
| 2 | TEAM_403 | 56.1 | 61.4 | |
| 3 | TEAM_417 | 66.0 | 56.6 | 60.9 |
| 4 | Our Tree-LSTM (ensemble) | 67.0 | 51.9 | 58.5 |
| Our Post-Challenge Enhancements | ||||
| Tree-LSTM (single) +pp | 65.7 | 58.1 | 61.7 | |
| Tree-LSTM (ensemble)+pp | 70.0 | 58.4 | 63.7 | |
| SPINN model (single)+pp | 61.5 | 60.2 | ||
| SPINN model (ensemble)+pp | 56.0 | |||
Note: P, R and F denote Precision, Recall and F1 score, respectively; pp denotes the new pre-processing method applied to only the post-challenge models; (ensemble) is the weighted voting of 10 instances of the same models that are independently trained with randomly initialized weights; (single) represents the result of a single model instance.
Confusion matrix for the SPINN model (ensemble) result on the test set
| Gold | False | CPR:3 | CPR:4 | CPR:5 | CPR:6 | CPR:9 |
|---|---|---|---|---|---|---|
| Pred | ||||||
| False | 10 410 | 281 | 566 | 72 | 95 | 396 |
| CPR:3 | 134 | 331 | 26 | 1 | 0 | 4 |
| CPR:4 | 266 | 51 | 1064 | 0 | 4 | 6 |
| CPR:5 | 16 | 0 | 0 | 107 | 4 | 0 |
| CPR:6 | 30 | 0 | 1 | 3 | 189 | 0 |
| CPR:9 | 100 | 1 | 1 | 0 | 0 | 238 |
Pred is the prediction result.
Types of errors and corresponding example sentences from our SPINN model
| Predicted results | A representative example sentence |
|---|---|
| Answer: CPR:4 (INHIBITOR) Predicted: - | Small molecules bearing hydroxamic acid as the <BC6ENTC> |
| Answer: CPR:3 (ACTIVATOR) Predicted: - | <BC6ENTC> |
| Answer: CPR:4 (INHIBITOR) Predicted: CPR:3 (ACTIVATOR) | <BC6ENTC> |
| (BC6ENT1 (is (an inhibitor) (of BC6ENT2))) | |
| [BC6ENT1, is, an, inhibitor, of, BC6ENT2] | |
| [shift, shift, shift, shift, reduce, shift, shift, reduce, reduce, reduce] |