| Literature DB >> 23160413 |
Kimberly Van Auken1, Petra Fey, Tanya Z Berardini, Robert Dodson, Laurel Cooper, Donghui Li, Juancarlos Chan, Yuling Li, Siddhartha Basu, Hans-Michael Muller, Rex Chisholm, Eva Huala, Paul W Sternberg.
Abstract
WormBase, dictyBase and The Arabidopsis Information Resource (TAIR) are model organism databases containing information about Caenorhabditis elegans and other nematodes, the social amoeba Dictyostelium discoideum and related Dictyostelids and the flowering plant Arabidopsis thaliana, respectively. Each database curates multiple data types from the primary research literature. In this article, we describe the curation workflow at WormBase, with particular emphasis on our use of text-mining tools (BioCreative 2012, Workshop Track II). We then describe the application of a specific component of that workflow, Textpresso for Cellular Component Curation (CCC), to Gene Ontology (GO) curation at dictyBase and TAIR (BioCreative 2012, Workshop Track III). We find that, with organism-specific modifications, Textpresso can be used by dictyBase and TAIR to annotate gene productions to GO's Cellular Component (CC) ontology.Entities:
Mesh:
Year: 2012 PMID: 23160413 PMCID: PMC3500519 DOI: 10.1093/database/bas040
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1The WormBase literature curation workflow. WormBase literature curation incorporates automated (blue), semi-automated (green) and manual (pink) steps. Potentially curatable papers are initially brought into WormBase primarily via PubMed searches with additional contributions from authors and a collaboration between WormBase and the GSA for Genetics and G3: Genes, Genomes, Genetics papers. Following full-text acquisition, a triage step is used to determine what data types are present in papers. The triage step is largely automated but also includes author contributions. Once papers have been flagged for data types, curators responsbile for curation of that data type use manual and semi-automated methods to extract the information and convert it into machine-readable format.
Figure 2Pipeline for Textpresso for Cellular Component Curation for dictyBase and TAIR. PDFs of publications included in the dictyBase and TAIR curation corpora and files of Dictyostelium and Arabidopsis gene and protein names and synonyms are uploaded to a Textpresso server at Caltech. PDFs are converted to text; gene and protein names and synonyms are processed to include variants (e.g. upper- and lower-case versions), and organism-specific terms are added to the Textpresso cellular component ontology. Full text is then marked up using the new categories. Four- and five-category searches are performed on the full text and results formatted and stored for use in the curation database. Using a Web-based curation form, curators make annotations that are subsequently stored in the curation database and available for export as annotation files.
Figure 3The Textpresso for CCC Curation Form. A screenshot of the curation form used for the Textpresso for CCC evaluation is shown. Textpresso sentences are displayed on the bottom right corner of the form, with matches to each of the Textpresso categories highlighted and color coded. The title and abstract of the paper are shown at the top. On the bottom left side of the form are three curation boxes containing, from left to right, the identified gene product, the component term from the retrieved sentence and suggested GO terms based on the previous curation. To make a GO annotation, the curator makes a selection from each of the boxes, highlighted in gray, selects the curate radio button above the sentence and presses Submit to commit the annotation to the curation database. Additional radio buttons allow curators to further classify sentences, if needed. These additional actions were not part of the current BioCreative evaluation.
Results of Textpresso sentence retrieval for four- and five-category searches for Dictyostelium discoideum and Arabidopsis thaliana literature
| Four-category search | Five-category search | |||||
|---|---|---|---|---|---|---|
| Recall | Precision | Recall | Precision | |||
| 0.379 | 0.775 | 0.509 | 0.397 | 0.815 | 0.534 | |
| 0.522 | 0.575 | 0.547 | 0.568 | 0.893 | 0.694 | |
Results of Textpresso-based cellular component annotation for four- and five-category searches for Dictyostelium discoideum and Arabidopsis thaliana literature
| Four-category search | Five-category search | |||||
|---|---|---|---|---|---|---|
| Recall | Precision | Recall | Precision | |||
| 0.371 | 0.783 | 0.503 | 0.322 | 0.750 | 0.450 | |
| 0.145 | 0.778 | 0.244 | 0.113 | 0.714 | 0.195 | |
| 0.460 | 0.920 | 0.613 | 0.420 | 0.913 | 0.575 | |