| Literature DB >> 22759460 |
Sofie Van Landeghem1, Jari Björne, Thomas Abeel, Bernard De Baets, Tapio Salakoski, Yves Van de Peer.
Abstract
BACKGROUND: Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts.Entities:
Mesh:
Year: 2012 PMID: 22759460 PMCID: PMC3384255 DOI: 10.1186/1471-2105-13-S11-S6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Examples of entity relations
| Type of relation | Examples |
|---|---|
| Equivalence | [human |
| Member-Collection | [ |
| Protein-Component | [ |
| Subunit-Complex | [ |
Examples of entity relation types, including both embedded and non-embedded cases. GGPs are in italic and domain terms are delimited by square brackets.
Dataset dimensionalities
| Relation type | Train instances | Test instances |
|---|---|---|
| 1689 | 334 | |
| 751 | 163 | |
| 720 | 129 | |
| Functional (GENIA - E) | 110 | 17 |
| Locus (GENIA - E) | 11 | 5 |
| Member-Collection (GENIA - E) | 5 | 0 |
| Misc (GENIA - E) | 53 | 11 |
| Object-Variant (GENIA - E) | 14 | 5 |
| Out-of (GENIA - E) | 40 | 7 |
| 222 | 51 | |
| 108 | 22 | |
| 760 | 181 | |
| 593 | 174 | |
| 275 | 82 | |
Number of positive instances of the various types in the entity relation corpora. ST refers to the BioNLP'11 Shared Task data, while GENIA refers to the GENIA relation corpus. The latter corpus is further divided into embedded (E) and non-embedded (NE) cases. Datasets sufficiently large for classification analysis are in bold.
Corpora characteristics
| Relation types | Embedded distinction | Gold | Gold | 800 | 150 | 260 | |
|---|---|---|---|---|---|---|---|
| 2 | no | yes | no | train | dev. | test | |
| 4 | yes | yes | yes | train | test | - |
Characteristics of the two different entity relations corpora. The number of relation types only includes those that were used in the classification experiments.
Figure 1Flowchart of the GETM framework. Flowchart of the GETM framework, including example intermediate steps for the sentence "Thrombin-induced p65 homodimer binding to downstream NF-kappa B site of the promoter mediates endothelial ICAM-1 expression and neutrophil adhesion."
Figure 2Clustering example. A few examples of clusters of domain terms, derived by LSA and clustered with the Markov Cluster algorithm.
Figure 3Feature selection analysis for Protein-Component. Most important features for predicting Protein-Component relations, as predicted by the GETM framework. The feature cloud shows all types of grammatical and lexical features that are most discriminative according to the ensemble feature selection algorithm. Red indicates features that mark negative examples, blue features mark positive examples.
Results on the ST data
| Subunit-Complex | Protein-Component | All | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 66.95 | 48.47 | 56.23 | 68.57 | 50.90 | 58.43 | 68.04 | 50.10 | ||
| 38.12 | 47.85 | 42.43 | 36.53 | 47.31 | 41.23 | 37.04 | 47.48 | ||
| 66.95 | 48.47 | 56.23 | 61.79 | 52.40 | 56.70 | 63.32 | 51.11 | ||
| 75.25 | 46.63 | 57.58 | 71.56 | 48.80 | 58.03 | 72.70 | 48.09 | ||
| 60.74 | 50.31 | 55.03 | 59.73 | 53.89 | 56.66 | 60.05 | 52.72 | ||
Performance on the ST'11 test set, measured by precision, recall and their harmonic mean, the F-score (F). The first few rows indicate the official results of the TEES (T) and GETM frameworks. Next, the performance of the hybrid (H) system is shown. Finally, the two last rows report on the performance of creating the intersection and the union of TEES and the hybrid system.
Results on the GENIA relation corpus
| TEES | GETM | |||||
|---|---|---|---|---|---|---|
| 93.13 | 95.31 | 94.21 | 97.64 | 96.12 | 96.88 | |
| 96.08 | 96.08 | 96.08 | 100.00 | 86.27 | 92.63 | |
| 79.17 | 86.36 | 82.61 | 80.00 | 72.73 | 76.19 | |
| 92.23 | 94.53 | 97.85 | 90.10 | |||
| 81.44 | 75.14 | 78.16 | 71.73 | 75.69 | 73.66 | |
| 87.77 | 67.03 | 76.01 | 73.33 | 83.63 | 78.14 | |
| 81.54 | 64.63 | 72.11 | 73.24 | 63.41 | 67.97 | |
| 83.83 | 69.89 | 72.65 | 76.50 | |||
| 86.83 | 77.55 | 79.94 | 80.82 | |||
Performance on the GENIA relation corpus for embedded (E) and non-embedded (NE) relation types, measured by precision, recall and their harmonic mean, the F-score.