| Literature DB >> 25214883 |
Deyu Zhou1, Dayou Zhong1, Yulan He2.
Abstract
Biomedical relation extraction aims to uncover high-quality relations from life science literature with high accuracy and efficiency. Early biomedical relation extraction tasks focused on capturing binary relations, such as protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. In recent years, more interests have been shifted to the extraction of complex relations such as biomolecular events. While complex relations go beyond binary relations and involve more than two arguments, they might also take another relation as an argument. In the paper, we conduct a thorough survey on the research in biomedical relation extraction. We first present a general framework for biomedical relation extraction and then discuss the approaches proposed for binary and complex relation extraction with focus on the latter since it is a much more difficult task compared to binary relation extraction. Finally, we discuss challenges that we are facing with complex relation extraction and outline possible solutions and future directions.Entities:
Mesh:
Year: 2014 PMID: 25214883 PMCID: PMC4156999 DOI: 10.1155/2014/298473
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Total bibliographical data in MEDLINE since 1995 and the two stages of biomedical relation extraction research.
Example of a sentence and its corresponding PPIs.
| Sentence | Leukotriene B4 stimulates c-fos and c-jun gene transcription and AP-1 binding activity in human monocytes. |
|---|---|
| PPIs | Stimulate (leukotriene B4, c-fos) |
| Stimulate (leukotriene B4, c-jun) | |
| Stimulate (leukotriene B4, AP-1) |
Example of a sentence and the relations it contains.
| Sentence | The binding of I kappa B/MAD-3 to NF-kappa B p65 is sufficient to retarget NF-kappa B p65 from the nucleus to the cytoplasm. |
|---|---|
| Relation_1 | Binding (I kappa B/MAD-3, NF-kappa B p65) |
| Relation_2 | Localization (NF-kappa B p65, the nucleus, and the cytoplasm) |
| Relation_3 | Positive regulation (relation_1, relation_2) |
Figure 2The general framework of a relation extraction system.
Figure 3General procedure of a PPI extraction system employing different methodologies.
Features employed in [30].
| Feature type | Features in detail |
|---|---|
| Word level | Lemma form of a word; relative position to the pair of proteins (before, middle, after); frequency in the sentence |
|
| |
| Shortest path level | Vertex walks in the shortest path; edge walks in the shortest path; subsets of walks on the target pair in a parse structure |
|
| |
| Graph level | Graph matrices based on a parse structure subgraph and linear order subgraph from the dependency parsers. The graph features are all the nonzero elements in the graph matrices |
Available annotated corpora for binary relation extraction in the biomedical domain.
| Corpus name | General description | URL |
|---|---|---|
| GENIA | 2,000 MEDLINE abstracts with more than 400,000 words and almost 100,000 annotations for biological terms. |
|
|
| ||
| LLL05 | 80 sentences in the training set including 106 examples of genic interactions without coreferences and 165 examples of interactions with coreferences. |
|
|
| ||
| BioCreAtIvE II | Training data is derived from the content of the IntAct and MINT databases. The test set collection consists of a collection of PubMed article abstracts. |
|
|
| ||
| AIMed | 225 MEDLINE abstracts (200 abstracts describing interactions between human proteins and around 1000 tagged interactions). |
|
|
| ||
| BioInfer | 1100 sentences annotated with protein names, their relationships, and PPI annotations. |
|
|
| ||
| HPRD50 | 50 abstracts referenced by the Human Protein Reference Database including 266 relation instances. |
|
Performance of existing PPI extraction methods on the data corpora used.
| Category | Result (%) | Corpus | References | |
|---|---|---|---|---|
| Recall | Precision | |||
| Rule-based | 86.8 | 94.3 | 834 and 752 sentences obtained by a MEDLINE search using these keywords, “protein binding,” “yeast,” “ | [ |
| 60 | 87 | 550 sentences were retained containing at least one of four keywords “interact,” “bind,” “associate,” “complex,” or one of their inflections from 3343 abstracts retrieved from MEDLINE with the following keywords: “ | [ | |
| 80.0 | 80.5 | About 1200 sentences were kept from the top 50 biomedical papers retrieved from the Internet by querying using the keyword “protein-protein interaction.” | [ | |
|
| ||||
| ML methods | 57 | 90 | Training set consists of 500 abstracts from MEDLINE. Evaluation set consists of 56 abstracts collected using search strings “protein” and “inhibit.” | [ |
| 21 | 91 | 3.4 million sentences from approximately 3.5 million MEDLINE abstracts dated after 1988 containing at least one notation of a human protein. | [ | |
| 71.9 | 60 | AIMed | [ | |
| 87.2 | 72.5 | LLL | [ | |
| 76 | 70 | The test corpus consists of 300 randomly selected sentences. | [ | |
| 70.7 | 70.3 | LLL | [ | |
| 71.9 | 60 | AIMed | [ | |
| 59.26 | 63.37 | LLL | [ | |
| 89 | 73 | LLL | [ | |
Figure 4An example of identifying trigger words based on the predefined pattern.
Figure 5Event extraction rules employed in [58].
Figure 6An example of a sentence with target event edge to be extracted.
Features sets and classifiers employed in machine learning-based approaches for event extraction.
| References | Feature sets | Classifier |
|---|---|---|
| [ | (1) N-grams (merging the attributes of 2 to 4 consecutive tokens); (2) individual component features for each token and edge in a path; (3) semantic node features (the attributes of the two terminal event/entity nodes of the potential event argument edge); (4) frequency features (the length of the shortest path and the number of named entities and event nodes, per type, in the sentence) | Multiclass SVM |
|
| ||
| [ | (1) Trigger type; (2) argument terms; (3) argument type; (4) argument supertype; (5) trigger and argument; (6) trigger and argument POS; (7) parse tree path; (8) voice of sentence (active or passive); (9) trigger and argument partial paths; (10) trigger subcategorization | SVM |
|
| ||
| [ | (1) Words and POS in a window around the trigger; (2) distances between the trigger and the two nearest annotated proteins (left and right) and the theme candidate | C4.5 decision tree |
|
| ||
| [ | (1) Three stemmed consecutive words from the subsentence spanning the event; (2) lexical and syntactic information of triggers; (3) size of the subgraph; (4) bag of words; (5) length of the subsentence; (6) extra features for regulation events; (7) vertex walks which consist of two vertices and their connecting edge | SVM |
Figure 7The dependency path for the sentence “The binding of I kappa B/MAD-3 to NF-kappa B p65 is sufficient to retarget NF-kappa B p65 from the nucleus to the cytoplasm.”
An example of a sentence and its event representations employed in [85].
| Sentence |
|
|---|---|
| Event representations |
|
|
| |
|
| |
|
| |
|
|
Examples of the three subtasks of the BioNLP'09 shared task.
| Subtask | Sentence | Events |
|---|---|---|
| Core event extraction |
|
|
|
| ||
| Event enrichment | We demonstrate the |
|
|
| ||
| Negation and speculation recognition | This |
|
Performance of the biomedical event extraction systems not participating in the BioNLP shared tasks.
| Category | Recall (%) | Precision (%) |
| Corpus | References |
|---|---|---|---|---|---|
| Nonpipeline | NA | NA | 56.0 | BioNLP'11 | [ |
| NA | NA |
| BioNLP'11 | [ | |
|
| |||||
| Rule-based | 38.01 | 52.06 | 43.94 | BioNLP'11 | [ |
| 33.66 | 41.77 | 37.28 | BioNLP'09 | [ | |
| 10.12 | 27.17 | 14.75 | BioNLP'11 | [ | |
|
| |||||
| Machine learning | 51.25 | 64.92 | 57.28 | BioNLP'11 | [ |
| NA | NA |
| BioNLP'09 | [ | |
| NA | NA | 53.30 | BioNLP'11 | [ | |