| Literature DB >> 22823282 |
Jan Czarnecki1, Irene Nobeli, Adrian M Smith, Adrian J Shepherd.
Abstract
BACKGROUND: Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway - metabolic pathways - has been largely neglected.Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein-protein interactions.Entities:
Mesh:
Year: 2012 PMID: 22823282 PMCID: PMC3475109 DOI: 10.1186/1471-2105-13-172
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Assignments of the entities enzyme (E), substrate (S) and product (P) for a sample sentence
| E | S | P | P |
| E | S | S | P |
| E | P | S | S |
| E | P | P | S |
| E | S | P | |
| E | S | | P |
| E | | S | P |
| E | P | S | |
| E | P | | S |
| E | P | S |
The ten assignments of E, S and P for the sentence “L-Arabinose isomerase catalyzes the conversion of L-arabinose to L-ribulose, the first step in the utilization of n-arabinose by Escherichia coli B/r”. Given that L-Arabinose isomerase is the only tagged protein, it is deemed to be the enzyme in all cases, whereas different numbers and orderings of substrates and products are possible, given the presence of three tagged small molecules (L-arabinose, L-ribulose and n-arabinose). Note that other potential orderings (namely E-P-S-P and E-S-P-S) are not considered, as they are deemed highly unlikely to occur in practice.
Figure 1The pantothenate and coenzyme a biosynthesis pathway. A diagram of the pathway obtained using the BioCyc pathway viewer [35].
Figure 2Graphs showing the performance of OSCAR3 at a range of confidence thresholds. Performance is shown under the following conditions: a) when applied to the SCAI chemical corpus; b) when applied to the GENIA corpus without acronym detection; and c) when applied to the GENIA corpus with acronym detection. The y-axis gives the recall(C), precision and F-score values in the range 0 to 1.
The tagging performance of BANNER and OSCAR3
| | ||
|---|---|---|
| Recall(C) (%) | 81 (112/139) | 96 (329/343) |
| Precision (%) | 85 (112/132) | 86 (329/384) |
| F-score (%) | 83 | 91 |
| Recall(C) (%) | 93 (250/268) | 82 (528/647) |
| Precision (%) | 76 (250/327) | 95 (528/558) |
| F-score (%) | 84 | 88 |
| Recall(C) (%) | 91 (341/376) | 81 (456/565) |
| Precision (%) | 82 (341/414) | 92 (456/494) |
| F-score (%) | 86 | 86 |
The tagging performance of the NER tools when applied to the Abstracts and Introductions from papers referenced in EcoCyc with respect to our three evaluation pathways. Taking the BANNER column for the pantothenate and coenzyme A biosynthesis pathway as an example, the numbers in brackets indicate that BANNER correctly identified 112 out of the 139 protein names (recall row); and of the 132 names it tagged, 112 were correct (precision row). The OSCAR3 results are with a confidence threshold of zero.
The performance of our metabolic reaction extraction method on three evaluation pathways
| | ||
|---|---|---|
| Recall(P) (%) | 78 (7/9) | 56 (5/9) |
| Precision (%) | 59 (24/41) | 41 (17/41) |
| Recall(P) (%) | 90 (9/10) | 70 (7/10) |
| Precision (%) | 60 (39/65) | 38 (25/65) |
| Recall(P) (%) | 29 (2/7) | 29 (2/7) |
| Precision (%) | 30 (11/37) | 14 (5/37) |
Taking the “correct reactions (ignoring enzymes)” column for the pantothenate and coenzyme A biosynthesis pathway as an example, the numbers in brackets indicate that our algorithm correctly identified 7 out of the 9 reactions in the curated EcoCyc pathway (recall row), giving 78%; and of the 41 identified interactions (precision row), 24 were valid reactions (irrespective of whether they belong to the pathway or not), giving 68%. A reaction for which the susbtrate(s) and product(s) have been correctly assigned, but not the enzyme, is deemed correct in column two, but incorrect in column three.
Binary interaction extraction for all three evaluation pathways
| Substrate–product | Substrate–enzyme | Product–enzyme | Total | |
|---|---|---|---|---|
| Recall(P) (%) | 67 (10/15) | 58 (7/12) | 55 (6/11) | 61 (23/38) |
| Precision (%) | 59 (35/59) | 65 (13/20) | 59 (13/22) | 60 (61/101) |
| Recall(P) (%) | 82 (9/11) | 64 (7/11) | 70 (7/10) | 78 (25/32) |
| Precision (%) | 48 (55/114) | 62 (28/45) | 58 (26/45) | 53 (109/204) |
| Recall(P) (%) | 20 (2/10) | 38 (3/8) | 38 (3/8) | 31 (8/26) |
| Precision (%) | 40 (12/30) | 80 (8/10) | 67 (6/9) | 53 (26/49) |
| Recall(P) (%) | 63 (737/1167) | 37 (166/439) | 36 (157/439) | 52 (1060/2045) |
| Precision (%) | 88 (749/856) | 72 (169/235) | 67 (157/234) | 81 (1075/1325) |
Numbers in brackets were calculated as for Table 3.
Figure 3A network showing the reactions predicted from the eight source papers for the pantothenate and coenzyme A biosynthesis pathway. Squares are small molecules, circles are enzymes, and a pair of arrows is used to denote a single reaction (the first for the interaction substrate-enzyme, and the second for the interaction enzyme-product). Items labeled green are correct; items labeled red are incorrect. The number next to a reaction indicates the number of times that reaction was extracted from the set of source texts. The reactions on the right-hand side of the figure (lying outside the blue rectangle) are reactions extracted by our algorithm that are not part of the manually-annotated pantothenate and coenzyme A biosynthesis pathway from EcoCyc given in Figure 1.
Performance comparison of gene/protein extraction tools with our metabolic reaction extraction method
| | | ||
|---|---|---|---|
| | | ||
| Method | Interaction type | Precision | Recall |
| RelEx | Protein-protein | 39-80 | 45-72 |
| Baseline(k) | Protein-protein | 23-54 | 52-67 |
| AkanePPI | |||
| (trained on BioInfer) | Protein-protein | 29-77 | 40-56 |
| Method described | |||
| in this paper | Substrate-product | 40-88 | 20-82 |
| Method described | |||
| in this paper | Substrate-enzyme | 62-80 | 37-64 |
| Method described | |||
| in this paper | Product-enzyme | 58-67 | 36-70 |
The range of scores for the gene/protein extraction tools are for five corpora as evaluated in [20]. The scores for our metabolic reaction extraction method summarize those in Table 4, i.e. they are broken down into the same three binary interactions and the range is for the three evaluation corpora and the Reactome dataset.