| Literature DB >> 23046792 |
Ke Ravikumar1, Haibin Liu, Judith D Cohn, Michael E Wall, Karin Verspoor.
Abstract
BACKGROUND: We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model.Entities:
Year: 2012 PMID: 23046792 PMCID: PMC3465209 DOI: 10.1186/2041-1480-3-S3-S2
Source DB: PubMed Journal: J Biomed Semantics
Evaluation of performance of residue and mutation extraction on the Nagel corpus (original annotations)
| Evaluation Scheme | Precision (%) | Recall (%) | F-Measure (%) |
|---|---|---|---|
| System1: -SLAA, +SLM | 88.92 | 98.09 | 93.28 |
| System 2: +SLAA,-SLM | 71.86 | 79.01 | 75.27 |
| System 3: -SLAA,-SLM | 94.78 | 76.33 | 84.56 |
| System 4: +SLAA,+SLM | 74.42 | 98.85 | 84.91 |
| Nagel et al.'s reported numbers | 92.00 | 98.00 | 95.00 |
SLAA - Single letter Amino acid patterns; SLM - Single Letter Mutation patterns; + for inclusion of patterns; - for exclusion of patterns
Evaluation of performance of residue and mutation extraction on the Nagel corpus (Modified per our annotation guidelines)
| Evaluation Scheme | Precision (%) | Recall (%) | F-Measure (%) |
|---|---|---|---|
| System1: -SLAA, +SLM | 91.64 | 98.50 | 94.95 |
| System 2: +SLAA,-SLM | 74.38 | 79.17 | 76.70 |
| System 3: -SLAA,-SLM | 96.23 | 77.27 | 85.71 |
| System 4: +SLAA,+SLM | 77.41 | 98.88 | 86.84 |
SLAA - Single letter Amino acid patterns; SLM - Single Letter Mutation patterns; + for inclusion of patterns; - for exclusion of patterns
Table 3
| a. Evaluation of performance of mutation extraction on MutationFinder corpus | ||||
|---|---|---|---|---|
| Development | Our system | 96.82 | 82.91 | 89.32 |
| Test | Our system | 95.61 | 81.59 | 88.04 |
| MutationFinder | 98.40 | 81.90 | 89.40 | |
| b. Evaluation of performance of residue and mutation extraction on LEAP-FS corpus | ||||
| LEAP-FS | 85.23 | 87.93 | 86.56 | |
Evaluation of subgraph matching and co-occurrence baseline approach for protein-residue relation extraction on silver corpus.
| Corpus | Corpus | Precision (%) | Recall (%) | F-Mes (%) |
|---|---|---|---|---|
| Development | E+P+A | 80.26 | 77.05 | 78.62 |
| E+P*+A* | 79.10 | 78.10 | 78.60 | |
| E+P+A+Rule ranking | 81.20 | 76.42 | 78.74 | |
| E+P*+A*+Rule ranking | 79.35 | 77.68 | 78.51 | |
| Sentence co-occurrence | 59.45 | 100 | 75.28 | |
| Test Corpus | E+P+A | 84.07 | 79.43 | 81.69 |
| E+P*+A* | 82.72 | 80.10 | 81.39 | |
| E+P+A+Rule ranking | 86.83 | 78.26 | 82.32 | |
| E+P*+A*+Rule ranking | 83.60 | 78.43 | 80.93 | |
| Sentence co-occurrence baseline | 62.42 | 100 | 76.86 | |
| Approximate subgraph matching (ASM) with distance threshold 0.6 | 81.96 | 86.62 | 84.22 | |
E+P+A - Match edge labels, Parts of speech, All tokens; E+P+A* - Match only Edge labels and Parts of speech.
Figure 1Effect of distance threshold on the performance of protein-residue relation extraction on the test portion of silver corpus in approximate subgraph matching.
Evaluation of protein residue relation extraction on Nagel corpus
| Method | Precision (%) | Recall (%) | F-Measure (%) |
|---|---|---|---|
| Abstract level co-occurence | 57.10 | 100 | 72.69 |
| Sentence level co-occurence | 63.50 | 84.77 | 72.61 |
| Subgraph matching | 85.09 | 69.54 | 76.54 |
| Subgraph matching | 90.38 | 71.57 | 79.89 |
Protein residue relation statistics of silver corpus
| Parameter | Number |
|---|---|
| Total number of abstracts | 18,045 |
| Total number of sentences | 138,790 |
| Total sentences with protein names | 41,722 |
| Total sentences with at least one amino acid or mutation | 13,729 |
| Sentences with co-mentions of protein-amino acid (or) mutation | 5,256 |
| Sentences with validated protein-residue relations | 2,516 |
| Physically validated protein-residue relations | 2,814 |
| Total abstracts with validated protein-residue relation | 1,728 |
Figure 2System Architecture.
Pattern definitions and regular expressions to detect amino acid residues and mutations in the text
| Pattern name | Pattern Meaning | Expressions |
|---|---|---|
| RES-S | Single letter amino acid code | [ARNDCQEGHILKMFPSTWYVOUBZX] |
| RES-T | Three letter amino acid code | ([aA]la|ALA|[aA]rg|ARG| [aA]sn|ASN|[aA]sp|ASP| [cC]y|CYS|[gG]ln|GLN| [gG]lu|GLU|[gG]ly|GLY| [hH]is|HIS|[iI]le|ILE| [lL]eu|LEU|[lL]ys|LYS| [mM]et|MET|[pP]he|PHE| [pP]ro|PRO|[sS]er|SER| [tT]hr|THR|[tT]rp|TRP| [tT]yr|TYR|[vV]al|VAL| [pP]yl|PYL|[sS]ec|SEC) |
| RES-F | Full amino acid names | ([aA]lanine|[aA]rginine| [aA]sparagine| [aA]spart(ate|ic acid)| [cC]ysteine|[gG]lutamine| [gG]lutam(ate|ic acid)| [gG]lycine|[hH]istidine| [iI]soleucine|[lL]eucine| [lL]ysine|[mM]ethionine| [pP]henylalanine|[pP]roline| [sS]erine|[tT]hreonine| [tT]ryptophan|[tT]yrosine| [vV]aline|[pP]yrrolysine| [aA]spartic acid |[aA]sparagine|[gG]lutamic acid|[gG]lutamine) |
| POS | Residue Position | 0[ |
| WTRES | Wild type residue | (RES-S|RES-T|RES-F) |
| MUTRES | Mutant residue | (RES-S|RES-T|RES-F) |
| UNIARR | Unicode character for arrows | \\u2192,\\u21D2 |
| UNIDASH | Unicode character for dash | \\u2013 |
| GRAMMAR | Grammatical expressions | residues? at positions?|for| position|residues? (in|on|at) |substitutions? at|always exists as|at positions?|mutated to|substituted by |
| POSCOORD | Co-ordination of residue position | POS(,\\s?POS)* (and|or) POS |
| AMINOCOORD | Co-ordination of amino acid residues | (RES-T|RES-F)(,\\s?RES-T|RES-F)* (and|or) (RES-T|RES-F) |
| WORD | ANY WORD | |
| PREP | Prepositions | in, at, on, within, of |
Pattern names are shown in THIS FONT and can be themselves used within other regular expressions.
Figure 3Physical validation of protein residue relation.
Figure 4Rule induction and protein-residue relation extraction.