| Literature DB >> 25006225 |
Rafal Rak1, Riza Theresa Batista-Navarro2, Jacob Carter3, Andrew Rowley3, Sophia Ananiadou3.
Abstract
Web services have become a popular means of interconnecting solutions for processing a body of scientific literature. This has fuelled research on high-level data exchange formats suitable for a given domain and ensuring the interoperability of Web services. In this article, we focus on the biological domain and consider four interoperability formats, BioC, BioNLP, XMI and RDF, that represent domain-specific and generic representations and include well-established as well as emerging specifications. We use the formats in the context of customizable Web services created in our Web-based, text-mining workbench Argo that features an ever-growing library of elementary analytics and capabilities to build and deploy Web services straight from a convenient graphical user interface. We demonstrate a 2-fold customization of Web services: by building task-specific processing pipelines from a repository of available analytics, and by configuring services to accept and produce a combination of input and output data interchange formats. We provide qualitative evaluation of the formats as well as quantitative evaluation of automatic analytics. The latter was carried out as part of our participation in the fourth edition of the BioCreative challenge. Our analytics built into Web services for recognizing biochemical concepts in BioC collections achieved the highest combined scores out of 10 participating teams. Database URL: http://argo.nactem.ac.uk.Entities:
Mesh:
Year: 2014 PMID: 25006225 PMCID: PMC4086403 DOI: 10.1093/database/bau064
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Comparison of selected functionalities of Argo and other related platforms
| Feature | Argo | GATE Developer | U-Compare | Taverna | Kepler | Triana |
|---|---|---|---|---|---|---|
| Based on a standard interoperability framework | + | − | + | − | − | − |
| Web-based | + | − | − | − | − | − |
| GUI-based workflow construction | + | + | + | + | + | + |
| In-built library of analytics | + | + | + | − | + | + |
| Focussed on text mining | + | + | + | − | − | − |
| Strong support for biomedical applications | + | + | + | + | − | − |
| Support for data curation | + | + | − | − | − | − |
| Workflow sharing | + | + | + | + | + | + |
| Web service deployment | + | − | + | + | + | + |
| Customizable I/O formats for Web services | + | − | − | − | − |
Figure 1.A Web service-enabled workflow built in Argo for identification of metabolic process concepts. The workflow features BioC as the Web service’s input and output format. The callouts show component-specific output annotation types that are relevant for this workflow.
Figure 2.A Web service-enabled workflow built in Argo for biological event extraction. The workflow accepts REST calls with data in BioNLP format and produces RDF output. The callouts show component-specific output annotation types that are relevant for this workflow.
Examples of the transcription of BioNLP annotations into BioC XML format
| Annotation category | BioNLP annotation |
|---|---|
| BioC transcription | |
| Entities | T1 Protein 19 49 interferon regulatory factor 4 |
| Events with modifications | T11 Gene_expression 55 65 expressionE2 Gene_expression:T11 Theme:T1M1 Speculation E2 |
| Equivalent entities | * Equiv T2 T3 |
| Coreferences (GENIA corpora) | R1 Coreference Subject:T13 Object:T3R2 Coreference Subject:T13 Object:T4R3 Coreference Subject:T13 Object:T5 |
External databases used as dictionaries by the proposed NERs
| Concept type | External databases |
|---|---|
| Chemical | Chemical Entities of Biological Interest (ChEBI) ( |
| Gene | UniProt ( |
| Disease | Medical Subject Headings (MeSH) ( |
| Action term | BioLexicon ( |
Approximate string matching algorithm applied to produce silver annotations
| Step | Phrase in text | CTD entry |
|---|---|---|
| Input | injured by stun gun | Stun Gun Injury |
| Case normalization | injured by stun gun | stun gun injury |
| Stop word removal | injured stun gun | stun gun injury |
| Stemming | injur stun gun | stun gun injur |
| Reordering | gun injur stun | gun injur stun |
Official BioCreative IV evaluation results for NaCTeM’s CTD Web services
| Category | Precision(%) | Recall(%) | F-score(%) | Average response time (sec.) |
|---|---|---|---|---|
| Chemical | 75.24 | 73.41 | 74.31 | 0.77 |
| Gene | 53.61 | 70.86 | 61.04 | 0.80 |
| Disease | 34.67 | 49.42 | 40.75 | 0.78 |
| Action term | 34.53 | 50.72 | 41.09 | 0.92 |
Contribution of dictionaries to the performance of the proposed NERs
| Dictionaries | Chemical | Gene | Disease | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | Precision | Recall | ||||
| None | 76.38 | 67.28 | **71.54 | 53.77 | 64.23 | **58.54 | 33.87 | 44.85 | **38.59 |
| CTD only | 74.59 | 72.40 | *73.48 | 53.28 | 68.87 | 60.08 | 34.35 | 49.10 | 40.42 |
| All | 75.24 | 73.41 | 74.31 | 53.61 | 70.92 | 61.06 | 34.67 | 49.52 | 40.79 |
Note: The difference in F-score between the NERs using all dictionaries and the other setups is statistically significant for cells marked with *(0.01 < P-value <0.05) and ** (P-value <0.01).
Values in percentages.
Performance gain of the proposed NERs (with all dictionaries) trained on the created silver corpus against the same NERs trained on domain-related, gold standard corpora
| Category | Gold standard corpus | Precision | Recall | F-score |
|---|---|---|---|---|
| Chemical | BioCreative IV CHEMDNER | +34.11 | −10.40 | +19.13 |
| Gene | BioCreative II Gene Mention | +23.91 | −4.28 | +18.48 |
| Disease | NCBI Disease | +3.30 | +0.74 | +2.60 |
Note: F-score gain is statistically significant (P-value <0.01) for all categories. Values in percentage points.