| Literature DB >> 23160416 |
Zhiyong Lu1, Lynette Hirschman.
Abstract
Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. DATABASE URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/.Entities:
Mesh:
Year: 2012 PMID: 23160416 PMCID: PMC3500522 DOI: 10.1093/database/bas043
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Outline of issues for describing the curation workflow
| Issue | Specific questions |
|---|---|
| Introduction | Overall philosophy: what information is captured and from what sources? What use is being made of this information or is envisioned for this information? What is the current workflow of the operation, and where are automated methods used? |
| Encoding methods | How is the information captured to make it machine readable? What entities are involved and how are they entered in the database? What relationships are involved and how are they symbolized? What standardized or controlled vocabularies are used? Give examples of a variety of data elements and how they appear in the database |
| Information access | When a curator runs into a problem or a difficult case, what kind of information is needed to solve it? What kind of internet searching is used most often in difficult cases? Dictionary? Wikipedia? Other database? |
| Use of text-mining tools | What text-mining tools do you currently employ in your workflow and what problems do these algorithms solve for you? What problems do you have that are not currently solved, but which you think could be amenable to a text-mining solution (i.e. for which steps could text mining overcome current bottlenecks in the existing pipeline)? |
Stages in the curation workflow
| Curation stage | Sub-stage | Description |
|---|---|---|
| Sources | 0 | Collecting papers to be curated from multiple sources |
| Paper selection | 1 | Triage to prioritize articles for curation |
| 2 | Indexing of biological entities of interest | |
| Full curation | 3 | Curation of relations, experimental evidence |
| 4 | Extraction of evidence within document (e.g. sentences, images) | |
| 5 | Check of record |
Commonalities and differences in the curation workflow stages
| Curation stage | Commonalities | Differences |
|---|---|---|
| Source collection | PubMed search (abstracts) Full-text articles (pdf) | Number of papers to be curated Acceptance of sources outside of PubMed (e.g. author submission) |
| Paper selection (triage) | Manual process by humans Primarily based on abstract Assignment of curation priorities Identification of genes/proteins | Database-specific selection criteria (e.g. species, gene/function, novelty) Identification of additional bio-entities (e.g. anatomy, cell type) |
| Full curation | Gene (function) centric Use of full text Use of controlled vocabularies and ontologies Identification of experimental evidence Contacting authors when needed | Annotating database/species-specific entities and relationships Annotating images (Xenbase) |
Common ontologies used across multiple curation databases (“X” indicates ontology in use by the database in column header)
| Ontologies | AgBase | TAIR | MGI | Xenbase | MaizeGDB | FlyBase | WormBase |
|---|---|---|---|---|---|---|---|
| Gene Ontology ( | X | X | X | X | X | X | X |
| Plant Ontology ( | X | X | X | ||||
| Sequence Ontology ( | X | X | X |
Current uses of text mining and desired uses
| Status | Specific use cases of text-mining tools |
|---|---|
| Current | Finding gene names and symbols (gene indexing) Querying full text with Textpresso Assigning GO cellular component terms |
| Future/desired | Improving gene indexing results Performing document triage Recognizing additional biological concepts (disease, anatomy) Capturing terms from additional ontologies (e.g. GO, particularly molecular function and biological process) Capturing complex relations such as gene regulation |