| Literature DB >> 19515247 |
Sam Zaremba1, Mila Ramos-Santacruz, Thomas Hampton, Panna Shetty, Joel Fedorko, Jon Whitmore, John M Greene, Nicole T Perna, Jeremy D Glasner, Guy Plunkett, Matthew Shaker, David Pot.
Abstract
BACKGROUND: The Enteropathogen Resource Integration Center (ERIC; http://www.ericbrc.org) has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as Escherichia coli and Salmonella spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP), and in particular Information Extraction (IE) technology, can be a significant aid to this process. DESCRIPTION: We have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include: Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc.) and over 70% for relations (gene/gene product to role, etc). This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application.Entities:
Mesh:
Year: 2009 PMID: 19515247 PMCID: PMC2704210 DOI: 10.1186/1471-2105-10-177
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of extraction rules implemented in ERIC NetOwl®
| Organism | 5 | Gene or Gene Product Roles | 150 |
| Strain | 18 | Mutation Phenotypes | 42 |
| Enzyme | 5 | Organism Pathogenesis | 9 |
| Gene | 18 | ||
| Gene Product | 20 | ||
| Operon | 31 |
Performance scores on entities and relations in Blind set (138 abstracts)
| Organism (1362 entities) | 92.0 | 98.1 | 94.9 | Gene or Gene Product Roles (615 relations) | 62.5 | 82.9 | 70.9 |
| Strain (554 entities) | 81.1 | 82.9 | 81.9 | Mutation Phenotypes (149 relations) | 58.5 | 77.8 | 66.7 |
| Enzyme (386 entities) | 85.7 | 81.6 | 83.5 | Organism Pathogenesis (34 relations) | 68.9 | 83.1 | 75.1 |
| Gene (916 entities) | 93.6 | 93.7 | 93.6 | ||||
| Gene Product (1425 entities) | 92.3 | 94.8 | 93.5 | ||||
| Operon (310 entities) | 96.2 | 93.0 | 94.5 |
Figure 1An overview of the ERIC Literature Text Mining population process.
Figure 2(Left) The Latest Articles tab lists PubMed abstracts involving enteropathogens published over the previous 7 days. (Right) The Search tab supports query by keyword(s) and phrases, PMID, date range, and/or journal. The PMID link of a title retrieves the abstract in the ERIC text mining interface.
Figure 3ERIC text mining interface of a PubMed abstract processed by NetOwl.
Figure 4Detail of the Relationships Extracted panel on the ERIC text mining interface.
Figure 5Montage shows workflow from an extracted gene/gene products in the text-mining interface, to the ASAP annotations database.
Figure 6Detailed Feature page in ERIC-ASAP. Community users viewing newly extracted information may alert ERIC via the Add a note to the curator button (inset).