| Literature DB >> 18426551 |
Sampo Pyysalo1, Antti Airola, Juho Heimonen, Jari Björne, Filip Ginter, Tapio Salakoski.
Abstract
BACKGROUND: Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate.Entities:
Mesh:
Year: 2008 PMID: 18426551 PMCID: PMC2349296 DOI: 10.1186/1471-2105-9-S3-S6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Corpora
| AIMed | BioInfer | HPRD50 | IEPA | LLL | ||
| 1955 | 1100 | 145 | 486 | 77 | ||
| Entity | human P/G | P/G/R and related | human P/G | Chemicals | P/G | |
| all occurrences | all occurrences | NER system | list of 16 names | list of 116 names | ||
| no | 111 types (ontology) | no | no | P/G | ||
| PPI | no | 68 types (ontology) | no | no | 3 types | |
| no | yes | no | yes | no | ||
| no | yes | no | yes | yes | ||
| no | yes | no | no | no | ||
| no | yes | no | no | no | ||
| no | no | yes | no | no | ||
Legend:
Size: Number of sentences in the corpus
Entity scope: Types of the named entities identified in the corpus: (P)rotein, (G)ene, (R)NA
Entity coverage: Coverage of in-scope entity occurrences in each sentence
Entity types: Explicit identification of the type of the annotated named entity occurrences
PPI types: Explicit indication of the type of the annotated interactions
PPI binding: Identification of the specific text spans that entail the annotated interactions
PPI directed: Specification of the directionality of the interaction (typically identification of agent vs. patient roles)
PPI complex: Annotation includes nested or n-ary (for n > 2) interactions
PPI negative: Annotation of negative interactions
PPI certainty: Annotation of the levels of certainty, or speculativeness, of interactions
PPI extraction performance
| Corpus | Co-occurrence | RelEx | ||||
| P | R | F | P | R | F | |
| AIMed | 0.17 | 0.95 | 0.29 | 0.40 | 0.50 | 0.44 |
| BioInfer | 0.13 | 0.99 | 0.23 | 0.39 | 0.45 | 0.41 |
| HPRD50 | 0.38 | 1.0 | 0.55 | 0.76 | 0.64 | 0.69 |
| IEPA | 0.41 | 1.0 | 0.58 | 0.74 | 0.61 | 0.67 |
| LLL | 0.50 | 1.0 | 0.66 | 0.82 | 0.72 | 0.77 |
(P)recision, (R)ecall, and (F)-score for the co-occurrence and RelEx methods on the various corpora.
Figure 1RelEx and co-occurrence F-scores for the five corpora
Corpus statistics
| Corpus | Per sentence average number of | Fraction of sentences with | ||||
| Tokens | Entities | Entity pairs | Interactions | No entities | No interactions | |
| AIMed | 25.2 | 2.2 | 3.0 | 0.5 | 18% | 69% |
| BioInfer | 31.3 | 4.2 | 9.4 | 1.3 | ~0% | 48% |
| HPRD50 | 26.1 | 2.8 | 3.0 | 1.1 | 0% | 38% |
| IEPA | 32.2 | 2.3 | 1.7 | 0.7 | 0% | 37% |
| LLL | 29.6 | 3.1 | 4.3 | 2.1 | 0% | 0% |
PPI extraction performance on filtered corpora
| Corpus | Co-occurrence | RelEx | ||||||
| P | ΔP | F | ΔF | P | ΔP | F | ΔF | |
| AIMed | 0.53 | 0.36 | 0.68 | 0.39 | 0.85 | 0.45 | 0.63 | 0.19 |
| BioInfer | 0.53 | 0.40 | 0.70 | 0.47 | 0.78 | 0.39 | 0.57 | 0.16 |
| HPRD50 | 0.64 | 0.26 | 0.78 | 0.23 | 0.93 | 0.17 | 0.76 | 0.07 |
| IEPA | 0.88 | 0.47 | 0.94 | 0.36 | 1.00 | 0.26 | 0.75 | 0.08 |
| LLL | 0.50 | 0.00 | 0.66 | 0.00 | 0.82 | 0.00 | 0.77 | 0.00 |
Precision and F-score for the co-occurrence and RelEx methods on the corpora with only entities that participate in an interaction preserved. Recall is not shown as it is unaffected by this modification. The Δ columns show absolute difference to results without filtering.
Figure 2RelEx and co-occurrence F-scores for the filtered version of the corpora. Unfiltered results given in Figure 1 shown in gray for reference.
Figure 3Distribution of interaction types in the corpora as mapped to the BioInfer ontology. In cases where the reason why an interaction was annotated could not be identified the supplemental Out of Ontology type was assigned. Empty cells represent zero count.
Qualitative analysis results
| Corpus | Explicit | Direct | Definite | Positive |
| AIMed | 52% | 72% | 92% | 98% |
| BioInfer | 67% | 45% | 93% | 100% |
| HPRD50 | 75% | 53% | 81% | 92% |
| IEPA | 86% | 6% | 92% | 100% |
| LLL | 73% | 24% | 94% | 100% |
Results of the analysis of 50 interactions per corpus with respect to their directness, explicitness, certainty, and polarity. Results are given as fraction of analysed interactions that were not out-of-ontology (see Figure 3) and were identified as having a given property by two annotators. Cases where annotators disagreed were counted as half a point.
Figure 4RelEx interaction extraction rules. The three RelEx rules search the dependency graph for paths between protein names indicating an interaction. Found paths are filtered according to criteria such as whether certain interaction terms appear in them. In the examples the protein names are marked in bold, and the possible interaction words in italics. Rule 1 extracts protein1-relation-protein2 structures from the dependency graph. In the basic case the rule requires that the path includes a subject dependency (e.g. nsubj). Rule 2 extracts relations that are represented by a sequence of noun phrases connected to each other via prepositional dependencies (prep). Rule 3 searches for structures of the type relation-between-protein1-and-protein2 connected through prep_between and conj_and dependencies.