| Literature DB >> 27589962 |
Sun Kim1, Rezarta Islamaj Doğan1, Andrew Chatr-Aryamontri2, Christie S Chang3, Rose Oughtred3, Jennifer Rust3, Riza Batista-Navarro4, Jacob Carter4, Sophia Ananiadou4, Sérgio Matos5, André Santos5, David Campos6, José Luís Oliveira5, Onkar Singh7, Jitendra Jonnagaddala8, Hong-Jie Dai9, Emily Chia-Yu Su7, Yung-Chun Chang10, Yu-Chen Su11, Chun-Han Chu12, Chien Chin Chen13, Wen-Lian Hsu12, Yifan Peng14, Cecilia Arighi15, Cathy H Wu15, K Vijay-Shanker14, Ferhat Aydın16, Zehra Melce Hüsünbeyi16, Arzucan Özgür16, Soo-Yong Shin17, Dongseop Kwon18, Kara Dolinski3, Mike Tyers19, W John Wilbur1, Donald C Comeau20.
Abstract
BioC is a simple XML format for text, annotations and relations, and was developed to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a collaborative task to build an assistant system for BioGRID curation. In this paper, we describe the framework of the collaborative BioC task and discuss our findings based on the user survey. This track consisted of eight subtasks including gene/protein/organism named entity recognition, protein-protein/genetic interaction passage identification and annotation visualization. Using BioC as their data-sharing and communication medium, nine teams, world-wide, participated and contributed either new methods or improvements of existing tools to address different subtasks of the BioC track. Results from different teams were shared in BioC and made available to other teams as they addressed different subtasks of the track. In the end, all submitted runs were merged using a machine learning classifier to produce an optimized output. The biocurator assistant system was evaluated by four BioGRID curators in terms of practical usability. The curators' feedback was overall positive and highlighted the user-friendly design and the convenient gene/protein curation tool based on text mining.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-1-bioc/. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Year: 2016 PMID: 27589962 PMCID: PMC5009341 DOI: 10.1093/database/baw121
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Overview of BioCreative V BioC track. The track consists of named entity recognition (NER), protein–protein interaction (PPI), genetic interaction (GI) and visual tool tasks. The NER tasks include gene/protein NER, species/organism NER and gene/protein normalization. The PPI/GI tasks include finding passages with PPI/GI information (PPI/GI Passages), passages with PPI experimental methods (PPI Evidence Passages) and passages with GI types (GI Evidence Passages).
PPI experimental methods and GI interaction types defined in BioGRID
| PPI experimental methods | GI interaction types |
|---|---|
|
Affinity Capture-Luminescence Affinity Capture-MS Affinity Capture-RNA Affinity Capture-Western Biochemical Activity Co-crystal Structure Co-fractionation Co-localization Co-purification Far Western FRET PCA Protein-peptide Protein-RNA Proximity Label-MS Reconstituted Complex Two-hybrid |
Dosage Growth Defect Dosage Lethality Dosage Rescue Negative Genetic Phenotypic Enhancement Phenotypic Suppression Positive Genetic Synthetic Growth Defect Synthetic Haploinsufficiency Synthetic Lethality Synthetic Rescue |
Detailed information can be found in http://wiki.thebiogrid.org/doku.php/experimental_systems.
Datasets used and created by participating teams
| Teams | Tasks | Datasets | URL |
|---|---|---|---|
| Matos et al. (T2) | NER (Gene/protein) + normalization | BioCreative II Gene Mention Recognition corpus | |
| NER (Species/organism) + normalization | LINNAEUS | ||
| Batista-Navarro et al. (T3) | NER (Gene/protein) + normalization | CHEMDNER GPRO | |
| NER (Species/organism) + normalization | S800 | ||
| Singh et al. (T4) | NER (Gene/protein) + normalization | BioCreative II Gene Mention Recognition corpus | |
| NER (Species/organism) + normalization | S800 | ||
| NER (Gene/protein/species) evaluation | IGN corpus | ||
| Peng et al. (T6) | PPI passages | 20 in-house full text documents | |
| AIMed corpus | |||
| Aydin et al. (T7) | PPI experimental method passages | In-house developed corpus | |
| Kim and Wilbur (T8) | PPI passages | BioCreative PPI corpus | |
| Two in-house developed corpora | |||
| Islamaj Dogan et al. (T8) | GI passages | Two in-house developed corpora |
Submitted runs from nine participating teams
| Team | Task 1 | Task 2 | Task 3 | Task 4 | Task 6 | Task 7 | Task 8 |
|---|---|---|---|---|---|---|---|
| T1 | 1 | ||||||
| T2 | 1 | 1 | 1 | ||||
| T3 | 1 | 1 | 1 | ||||
| T4 | 1 | 1 | 1 | ||||
| T5 | 1 | ||||||
| T6 | 4 | ||||||
| T7 | 2 | ||||||
| T8 | 1 | 2 | 4 | ||||
| T9 | 1 | ||||||
| Total | 4 | 3 | 3 | 6 | 4 | 4 | 1 |
To boost the synergy effect of using multiple runs, we (T8) produced additional results for Tasks 4 and 6. Only one team was selected for Task 8 as it was to develop a user interface.
Figure 2.BioC Format for BioCreative V BioC track. (A) BioC format to share annotations for named entity recognition tasks: gene/protein and organism mentions and normalization. OrganismID and GeneID are NCBI Taxonomy ID and Entrez Gene ID, respectively. (B) BioC format to share annotations for the molecular interaction tasks: protein–protein interaction mention and evidence (PPImention, PPIevidence) and genetic interaction mention and evidence (GImention, GIevidence).
Evaluation set used for optimizing the merger of submitted runs
| Organisms | Documents | Molecular interaction information |
|---|---|---|
| Yeast | 60 | PPI and GI |
| Human | 38 | PPI and GI |
| Human | 17 | PPI |
| Human | 5 | GI |
Documents were randomly selected from PMC articles relevant to either yeasts or humans. Of these, 98 documents contained both PPI and GI information, the remaining 22 documents contained either PPI or GI.
Figure 3.Annotation Interface for full-text PMC articles. This is a screenshot of our annotation interface that curators used to create a gold-standard annotation set. For annotations, a curator selects relevant text and chooses an annotation type button on the screen. Gene ID and Tax ID options are for assigning IDs to gene and organism names.
Figure 4.Score assigning process for each submission from PPI/GI tasks.
Questionnaire used for user feedback
| Questions | Rates |
|---|---|
| I. | |
| a. Please rate your experience with BioC Viewer. | 3.3 |
| b. Overall, I am satisfied with BioC Viewer. | 3.0 |
| c. I would recommend BioC Viewer to other PPI/GI curators. | 2.8 |
| II. | |
| a. It is easy to use BioC Viewer. | 5.0 |
| b. I am satisfied with using BioC Viewer. | 4.0 |
| c. BioC Viewer is powerful enough to complete the task. | 3.0 |
| III. | |
| a. Speed: the system would reduce annotation time to reach my curation goal. | 3.5 |
| b. Effectiveness: the system would help me get closer to my curation goal. | 3.0 |
| c. Efficiency: I can be both fast and effective with the system. | 2.8 |
| IV. | |
| a. Task 1 (gene/protein NER) | 4.3 |
| b. Task 2 (organism NER) | 2.7 |
| c. Task 3 (gene/protein name normalization) | 3.8 |
| d. Task 4 (Passages with PPIs) | 3.3 |
| e. Task 6 (Passages with PPI experimental systems) | 2.5 |
| f. Task 7 (Passages with GI types) | 3.0 |
| V. | |
| a. It was easy to find and read information. | 4.0 |
| b. Highlights were adequate and helpful. | 3.5 |
| c. Information was well organized. | 3.5 |
| VI. | |
| a. It was easy to learn how to operate the interface. | 4.3 |
| b. It was easy to remember features in BioC Viewer. | 4.3 |
| c. It was straightforward to use the interface. | 4.3 |
| VII. | |
| a. The interface was fast enough to do my job. | 3.5 |
| b. The interface was performed consistently. | 4.0 |
| c. The interface provided a means to easily correct mistakes. | 3.0 |
For each question, BioGRID curators rated on a 1 (bad) to 5 (good) scale. The scores shown are the average rates from four curators.
Figure 5.Curators’ ratings for prediction performance for each task. Tasks 1, 2 and 3 are gene/protein named entity recognition (NER), species/organism NER and gene/protein name normalization, respectively. Tasks 4, 6 and 7 are passages with protein–protein interactions (PPIs), PPI experimental methods and genetic interaction types, respectively. Tasks 1 and 3 received positive responses overall, however ratings were mixed for other tasks depending on curators’ preferences. Curator 4 did not assign a score for Task 2.