| Literature DB >> 29212530 |
Maryam Khordad1, Robert E Mercer2.
Abstract
BACKGROUND: One important type of information contained in biomedical research literature is the newly discovered relationships between phenotypes and genotypes. Because of the large quantity of literature, a reliable automatic system to identify this information for future curation is essential. Such a system provides important and up to date data for database construction and updating, and even text summarization. In this paper we present a machine learning method to identify these genotype-phenotype relationships. No large human-annotated corpus of genotype-phenotype relationships currently exists. So, a semi-automatic approach has been used to annotate a small labelled training set and a self-training method is proposed to annotate more sentences and enlarge the training set.Entities:
Keywords: Computational linguistics; Genotype-phenotype relationship; Genotypes; Phenotypes; Self-training; Semi-automatic corpus annotation
Mesh:
Year: 2017 PMID: 29212530 PMCID: PMC5719522 DOI: 10.1186/s13326-017-0163-8
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1Workflow
Distribution of data in our different sets
| Data set | Sentences | Instances | Positive instances | Negative instances |
|---|---|---|---|---|
| Training set | 509 | 845 | 576 | 269 |
| Test set | 244 | 823 | 536 | 287 |
| Unlabelled data | 408 | 823 | N/A | N/A |
List of dependency features
| Features | Description |
|---|---|
| Relationship term | Root of the portion of the dependency tree connecting phenotype and genotype |
| Stemmed relationship term | Stemmed by |
| Relative position of relationship term | Whether it is before the first entity, after the second entity or between them |
| The relationship term combined with the dependency relationship | To consider the grammatical role of the relationship term in the dependency path. |
| The relationship term and its relative position | |
| Key term | Described in Ibn Faiz’s four step method [ |
| Key term and its relative position | |
| Collapsed version of the dependency path | All occurrences of nsubj/nsubjpass are replaced with subj, rcmod/partmod with mod, prep x with x and everything else with O, a placeholder to indicate that a dependency has been ignored. |
| Second version of the collapsed dependency path | Only the prep_* of dependency relationships are kept. |
| Negative dependency relationship | A binary feature that shows whether there is any node in the path between the entities which dominates a |
| prep_between | A binary feature that checks for the existence of two consecutive prep_between links in a dependency path. |
List of syntactic and surface features
| Features | Description |
| Syntactic features | |
| Stemmed version of relationship term in the Least Common Ancestor (LCA) node of the two entities | If the head6 of the LCA node of the two entities in the syntax tree is a relationship term then this feature takes a stemmed version of the head word as its value, otherwise it takes a NULL value. |
| The label of each of the constituents in the path between the LCA and each entity combined with its distance from the LCA node | |
| Surface features | |
| Relationship terms and their relative positions | The relationship terms between two entities or within a short distance (4 tokens) from them. |
Fig. 2Dependency tree related to the sentence “The association of Genotype1 with Phenotype2 is confirmed”
Fig. 3The self training process
Evaluation results
| Method | Precision | Recall | F-measure |
|---|---|---|---|
| Supervised learning method | 76.47 | 77.61 | 77.03 |
| Self-training method | 77.70 | 77.84 | 77.77 |
| PPI-configured ML-based tool | 75.19 | 53.17 | 62.29 |
| PPI-configured rule-based tool | 77.77 | 38.04 | 51.09 |
Fig. 4Precision values on the test set for all 22 parameter settings for 31 semi-supervised learning iterations
Fig. 5Recall values on the test set for all 22 parameter settings for 31 semi-supervised learning iterations
Fig. 6F-measure values on the test set for all 22 parameter settings for 31 semi-supervised learning iterations
Maximum values for precision, recall, and F-measure
| Precision | Recall | F-Measure | ||||
|---|---|---|---|---|---|---|
| Parameter setting | Maximum value | Iteration | Maximum value | Iteration | Maximumvalue | Iteration |
| 0.82 0.92 | 0.7699 | 4 | 0.8138 | 17 | 0.7880 | 19 |
| 0.83 0.92 | 0.7714 | 5 | 0.8287 | 31 | 0.7911 | 31 |
| 0.84 0.92 | 0.7709 | 7 | 0.8250 | 30 | 0.7889 | 26 |
| 0.85 0.92 | 0.7780 | 5 | 0.8268 | 31 | 0.7935 | 17 |
| 0.86 0.92 | 0.7709 | 5 | 0.8156 | 13 | 0.7870 | 12 |
| 0.87 0.92 | 0.7743 | 5 | 0.8063 | 20 | 0.7788 | 20 |
| 0.88 0.92 | 0.7698 | 5 | 0.8231 | 23 | 0.7907 | 23 |
| 0.85 0.93 | 0.7770 | 6 | 0.8268 | 24 | 0.7870 | 12 |
| 0.86 0.93 | 0.7757 | 5 | 0.8324 | 25 | 0.7856 | 25 |
| 0.87 0.93 | 0.7689 | 4 | 0.8287 | 15 | 0.7857 | 19 |
| 0.88 0.93 | 0.7704 | 7 | 0.8343 | 20 | 0.7946 | 17 |
| 0.89 0.93 | 0.7665 | 1 | 0.8399 | 27 | 0.7923 | 30 |
| 0.85 0.94 | 0.7755 | 9 | 0.8250 | 31 | 0.7836 | 14 |
| 0.86 0.94 | 0.7712 | 2 | 0.8250 | 31 | 0.7849 | 19 |
| 0.87 0.94 | 0.7741 | 5 | 0.8436 | 26 | 0.7961 | 26 |
| 0.88 0.94 | 0.7689 | 5 | 0.8194 | 25 | 0.7849 | 13 |
| 0.89 0.94 | 0.7715 | 6 | 0.8156 | 13 | 0.7892 | 13 |
| 0.85 0.95 | 0.7694 | 2 | 0.8156 | 20 | 0.7866 | 11 |
| 0.86 0.95 | 0.7688 | 2 | 0.8287 | 19 | 0.7896 | 15 |
| 0.87 0.95 | 0.7694 | 2 | 0.8268 | 31 | 0.7848 | 11 |
| 0.88 0.95 | 0.7705 | 7 | 0.8231 | 28 | 0.7875 | 14 |
| 0.89 0.95 | 0.7681 | 10 | 0.8212 | 21 | 0.7848 | 13 |
Fig. 7Instances added for all 22 parameter settings for 31 semi-supervised learning iterations on the test set
Results after deleting Phenominer sentences from the test set
| Method | Precision | Recall | F-measure |
|---|---|---|---|
| Supervised learning method | 80.20 | 79.79 | 80.00 |
| Self-training method | 80.05 | 81.07 | 80.55 |