| Literature DB >> 25037308 |
Rafal Rak1, Riza Theresa Batista-Navarro2, Andrew Rowley3, Jacob Carter3, Sophia Ananiadou3.
Abstract
Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk.Entities:
Mesh:
Year: 2014 PMID: 25037308 PMCID: PMC4103424 DOI: 10.1093/database/bau070
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.A screenshot of Argo’s workflow diagramming window. Users create their workflows graphically by selecting and placing elementary processing components onto a drawing canvas and interconnecting them to form meaningful processing units.
Figure 2.A screenshot of the Argo application showing the Workflows panel that lists and enables managing user-created workflows. The right-hand side panel provides a description of the currently selected workflow as well as warning messages informing the user of problems with the underlying components’ configuration or connections.
Figure 3.A screenshot of the fragment of Argo’s manual annotation editor. The editor allows users to visually create, modify or delete annotations in the left-hand-side panel, as well as fill in more specific information (governed by a given annotation schema) for each of the annotations in the right-hand-side panel.
Figure 4.A screenshot of a window for selecting an identifier for an annotated fragment of text from an external resource—in this case, the ChEBI ontology. Users may select the most suitable entry by browsing a selection in the left-hand-side panel and viewing the details in the central panel.
Data serialization and deserialization components available in Argo that may be used in biocuration workflows as terminal components for reading source data and saving the results of processing
| Component name | Description |
|---|---|
| Document Reader | Deserializes text files stored in the user’s personal space (i.e. the Documents panel) |
| Kleio Search | Remotely fetches PubMed abstracts matching a query set as a parameter |
| PubMed Abstract Reader | Fetches abstracts directly from PubMed using a list of PubMed IDs as input |
| Input Text Reader | Reads text supplied in a parameter |
| BioNLP Shared Task Reader | Deserializes triple files (containing plain text, stand-off annotations of named entities and stand-off annotations of events or structured relationships) as defined in the BioNLP shared task |
| XMI Reader/Writer | (De)serializes entire CASes (data and annotations) from/into the XML Metadata Interchange (XMI) format |
| RDF Reader/Writer | (De)serializes entire CASes from/into RDF which may then be reused in other applications, e.g. in query engines supporting SPARQL |
| BioC Reader/Writer | (De)serializes selected annotations from/into BioC format |
ahttp://2013.bionlp-st.org
bhttp://bioc.sourceforge.net
Text analysis components available in Argo that may be used in biocuration workflows to produce annotations
| Component name | Description |
|---|---|
| GENIA Sentence Splitter | A sentence splitter trained on biomedical text ( |
| GENIA Tagger | Performs tokenization, part-of-speech and chunk tagging, and recognition of genes or gene products (e.g., proteins, DNA, RNA) ( |
| GENIA Dependency Parser | A dependency parser optimized for biomedical text ( |
| Enju Parser | Returns phrase and predicate-argument structures for general and biomedical text ( |
| Anatomical Entity Tagger | A machine learning-based anatomical entity mention recognizer ( |
| NERsuite | A named entity recognizer implemented on top of the NERsuite package. |
| Chemical Entity Recogniser | A named entity recognizer optimized for chemical text ( |
| OscarMER | A refactored version ( |
| Species Tagger | A tagger for species names based on a dictionary look-up method ( |
| CTD Linker | Normalizes an action term to one of the types in the CTD interaction types ontology. |
| ChEBI Linker | Normalizes a chemical compound name to an entry in the Chemical Entities of Biological Interest (ChEBI) database. |
| UniProt Linker | Normalizes a name of a gene or gene product to a UniProt entry. |
| EventMine | A machine learning-based event extractor with models for GENIA, epigenetics, infectious diseases, pathway and cancer genetics event types ( |
ahttp://nersuite.nlplab.org
Utility components available in Argo that may be used in biocuration worfklows to support automatic processing
| Component name | Description |
|---|---|
| Manual Annotation Editor | A user-interactive component that supports visualization and manipulation of annotations, allowing the user to manually intervene in the processing of a workflow |
| SPARQL Annotation Editor | Creates, removes and modifies annotations using a SPARQL query ( |
| Generic Listener | An interface that allows a user to plug-in their own components running on a local machine ( |
| Agreement Evaluator | Analyses two or more input annotation efforts (coming from different branches in a workflow) and produces a tab-separated file, reporting agreement rates between the inputs; may serve to compute inter-annotator agreement |
| Reference Evaluator | Compares automatically generated annotations against reference annotations and produces a tab-separated file reporting evaluation results. |
The performance of the automatic workflows compared against the annotations of human curators
| Curator | Category | Manual annotation | Manual correction | All | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | ||
| Curator 1 | Chemicals | 47 | 67 | 55 | 48 | 71 | 57 | 47 | 69 | 56 |
| GGPs | 63 | 61 | 62 | 63 | 55 | 58 | 63 | 58 | 60 | |
| Triggers | 93 | 55 | 69 | 98 | 88 | 93 | 96 | 71 | 81 | |
| Curator 2 | Chemicals | 39 | 60 | 47 | 93 | 97 | 95 | 67 | 83 | 74 |
| Curator 3 | Chemicals | 40 | 67 | 50 | 91 | 98 | 95 | 66 | 86 | 75 |
| Majority voting | Chemicals | 27 | 66 | 39 | 90 | 98 | 94 | 67 | 87 | 76 |
| Union | Chemicals | 57 | 49 | 53 | 82 | 96 | 89 | 72 | 73 | 72 |
The results are split into the two subsets of documents used in the curation task. P—precision, R—recall, F—F-score. Reported values are in percentages.
The inter-annotator agreement
| Curator | Manual annotation | Manual correction | All |
|---|---|---|---|
| Curator 1 and 2 | 76 | 57 | 67 |
| Curator 2 and 3 | 76 | 92 | 84 |
| Curator 1 and 3 | 82 | 56 | 69 |
The results are split into the two subsets of documents used in the curation task. Reported values are F-scores in percentages.
The approximate and indirect mapping of human and automatic annotations to chemical–gene interactions in the CTD
| Curator | Category | Manual annotation | Manual correction | All | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | ||
| Automatic workflow | Chemicals | 16 | 55 | 25 | 16 | 49 | 24 | 16 | 52 | 25 |
| GGPs | 14 | 37 | 21 | 14 | 34 | 19 | 14 | 36 | 20 | |
| Triggers | 63 | 42 | 50 | 53 | 51 | 52 | 58 | 46 | 51 | |
| Curator 1 | Chemicals | 33 | 68 | 44 | 32 | 64 | 43 | 32 | 66 | 44 |
| GGPs | 20 | 40 | 26 | 18 | 39 | 24 | 19 | 39 | 25 | |
| Triggers | 55 | 58 | 57 | 46 | 60 | 52 | 51 | 59 | 54 | |
| Curator 2 | Chemicals | 31 | 67 | 43 | 18 | 53 | 27 | 24 | 60 | 34 |
| Curator 3 | Chemicals | 36 | 66 | 46 | 17 | 49 | 25 | 24 | 58 | 34 |
| Majority voting | Chemicals | 35 | 67 | 46 | 18 | 52 | 27 | 25 | 60 | 35 |
| Union | Chemicals | 26 | 68 | 38 | 19 | 66 | 30 | 22 | 67 | 34 |
The results are split into the two subsets of documents used in the curation task. P—precision, R—recall, F—F-score. Reported values are in percentages.