| Literature DB >> 23327936 |
Cecilia N Arighi1, Ben Carterette, K Bretonnel Cohen, Martin Krallinger, W John Wilbur, Petra Fey, Robert Dodson, Laurel Cooper, Ceri E Van Slyke, Wasila Dahdul, Paula Mabee, Donghui Li, Bethany Harris, Marc Gillespie, Silvia Jimenez, Phoebe Roberts, Lisa Matthews, Kevin Becker, Harold Drabkin, Susan Bello, Luana Licata, Andrew Chatr-aryamontri, Mary L Schaeffer, Julie Park, Melissa Haendel, Kimberly Van Auken, Yuling Li, Juancarlos Chan, Hans-Michael Muller, Hong Cui, James P Balhoff, Johnny Chi-Yang Wu, Zhiyong Lu, Chih-Hsuan Wei, Catalina O Tudor, Kalpana Raja, Suresh Subramani, Jeyakumar Natarajan, Juan Miguel Cejuela, Pratibha Dubey, Cathy Wu.
Abstract
In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.Entities:
Mesh:
Year: 2013 PMID: 23327936 PMCID: PMC3625048 DOI: 10.1093/database/bas056
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1BioCreative 2012 workshop workflow. The chart shows the three main phases for this activity: (1) the preparation phase included the system and document preparation by teams, recruitment of biocurators to test each system and modification of the system for the assigned biocuration group; (2) the training phase actively involved both teams and biocurators, the former to provide the necessary support to use the system, the latter to learn about the curation task and the system functionalities, reporting system’s bugs when necessary and (3) the evaluation phase included the selection of corpus and manual annotation by expert (to create gold standard), annotation of this corpus by biocurators, half manually and half system-assisted, along with time recording and filling of the user survey. The results were collected by teams and coordinators and presented at the workshop. Some important dates are indicated on the right side.
Systems registered in BioCreative 2012 Track III
System descriptions with task proposed in BioCreative and reported internal benchmark results. aTerm-label EQs are EQ statements created strictly based on the original descriptions, independent of any ontologies, whereas the label-based EQs are the corresponding formal statements (using ontology terms). bThis system only participated at the workshop.
Participating databases/institutions in BioCreative Workshop 2012
| Database/Institution type | Database/Institution | Gold standard annotation | Pre-workshop Evaluation | Workshop evaluation |
|---|---|---|---|---|
| Industry | AstraZeneca (1) | ✓ | ||
| Merck Serono (1) | ✓ | ✓ | ||
| Pfizer (1) | ✓ | ✓ | ||
| Literature | NLM (1) | ✓ | ✓ | |
| Model Organism (MOD)/Gene Ontology Consortium (GOC) | AgBase (1) | ✓ | ||
| dictyBase (2) | ✓ | ✓ | ✓ | |
| FlyBase (1) | ✓ | |||
| MaizeDB (1) | ✓ | |||
| MGI (3) | ✓ | ✓ | ||
| SGD (1) | ✓ | ✓ | ||
| TAIR (2) | ✓ | ✓ | ✓ | |
| WormBase (1) | ✓ | |||
| XenBase (1) | ✓ | |||
| ZFIN (1) | ✓ | |||
| Ontology | Plant ontology (1) | ✓ | ✓ | |
| Protein ontology (2) | ✓ | |||
| Pathway | Reactome (2) | ✓ | ||
| Phenotype | GAD (1) | ✓ | ||
| Phenoscape (3) | ✓ | ✓ | ✓ | |
| Protein–protein interaction | BioGrid (1) | ✓ | ✓ | |
| MINT (1) | ✓ | ✓ | ||
| Others (approx. 11) | ✓ |
Numbers in parentheses are the number of biocurators from each institution. Biocurators aided in dataset annotations and system evaluations
Dataset preparation for systems in BioCreative Workshop 2012
| System | Dataset selection for pre-workshop evaluation | Information captured | Biocurators involved in gold standard annotation | Biocurators involved in annotation in evaluation |
|---|---|---|---|---|
| Textpresso | 30 full-length articles about | Paper Identifier, annotation entity, paper section, curatable sentence, component term in sentence, GO term, GO ID and evidence code. | dictyBase senior curator | dictyBase and Plant Ontologya |
| PCS | 50 textual descriptions of phenotypic characters in NeXML format randomly selected from 50 articles about fish or other vertebrates. Gold standard 50 character descriptions annotated by a senior Phenoscape biocurator | Entity term, entity ID, quality term, quality ID, quality negated, quality modifier, entity locator, count and more | Phenoscape senior curator | ZFIN and Phenoscape |
| PubTator | TAIR set: 50 abstracts (24 relevant) sampled from November 2011 for Arabidopsis already curated by TAIR | Gene indexing: gene names and Entrez gene ID | Existing annotated corpus | TAIR and National Library of Medicine (NLM) |
| NLM set: 50 abstracts sampled from Gene Indexing Assistant Test Collection (human) | Document triage information: list of relevant PMIDs | |||
| PPInterFinder | 50 abstracts describing human kinases obtained by using a combination of tool/resources (such as UniProt, PubMeMiner, FABLE, and PIE). | PMID, protein interactant name 1, protein interactant name 2 | NR | BioGrid and MINT |
| eFIP | PMID-centric: 50 abstracts randomly selected based on proteins involved in two pathways of interest to Reactome autophagy and HIV infection | PMID, phosphorylated protein, phosphorylated site, interactant name, effect, evidence sentence | NR | Merck Serono, Reactome, and SGDb |
| gene-centric: 10 first-ranked abstracts for 4 proteins involved in the adaptive immune system (Reactome: REACT_75774) | ||||
| T-HOD | PMID-centric: 50 abstracts from 2011 journals about obesity, diabetes or hypertension | PMID, EntrezGene ID, gene name, disease, gene–disease relation, evidence sentence | Protein Ontology senior curator | Pfizer, Reactome, GAD, and MGI |
| gene-centric: review relevancy of documents for four genes |
NR:non-recorded. aCurator novice to GO annotation. bSGD curator participated in first evaluation which is not reported in performance results here.
System performance metrics in pre-workshop evaluation
| System performance measure (%) | System output versus gold standard annotation | System-assisted annotations | Manual annotation | |||||
|---|---|---|---|---|---|---|---|---|
| Textpresso | ||||||||
| Sentence level | ||||||||
| Category 4 | System alone | |||||||
| Recall | 37.9 | |||||||
| Precision | 77.5 | Curator 1b | Curator 2b | |||||
| F-measure | 50.9 | 55.1 | 26.9 | |||||
| Category 5 | System alone | 41.7 | 63.3 | |||||
| Recall | 39.7 | 47.5 | 37.8 | |||||
| Precision | 81.5 | |||||||
| F-measure | 53.4 | |||||||
| GO annotation level | ||||||||
| Category 4 | Curator 1 | Curator 2 | ||||||
| Recall | 37.1 | 14.5 | Curator 1b | Curator 2b | ||||
| Precision | 78.3 | 77.8 | 86.8 | 39.5 | ||||
| F-measure | 50.3 | 24.4 | 42.8 | 41.2 | ||||
| Category 5 | Curator 1 | Curator 2 | 57.3 | 40.3 | ||||
| Recall | 32.2 | 11.3 | ||||||
| Precision | 75.0 | 71.4 | ||||||
| F-measure | 45.1 | 19.5 | ||||||
| PCS | ||||||||
| Term-based EQsc | System alone | Curator 1 | Curator 2d | Curator 3 | ||||
| Recall | 65.0 | 47.0 | 38.0 | 50.0 | ||||
| Precision | 60.0 | 57.0 | 65.0 | 67.0 | ||||
| F-measure | 62.4 | 51.5 | 48.0 | 57.3 | ||||
| Label-based EQsc | System alone | Curator 1 | curator 2d | Curator 3 | ||||
| Recall | 24.0 | 44.0 | 51.0 | 51.0 | ||||
| Precision | 23.0 | 54.0 | 81.0 | 74.0 | ||||
| F-measure | 23.5 | 48.5 | 62.6 | 60.4 | ||||
| Phenex + Charaparser | Phenex | |||||||
| Label-based EQsc | Curator 1 | Curator 2d | Curator 3 | Curator 1 | Curator 2d | Curator 3 | ||
| Recall | 51.0 | 38.0 | 66.0 | 37.0 | 63.0 | 36.0 | ||
| Precision | 58.0 | 70.0 | 84.0 | 49.0 | 88.0 | 60.0 | ||
| F-measure | 54.3 | 49.3 | 73.9 | 42.2 | 73.4 | 45.0 | ||
| PubTator | ||||||||
| NLM indexing mention-level | System alone | Curator 1 | Curator 1 | |||||
| Recall | 80.1 | 98.6 | 91.0 | |||||
| Precision | 83.4 | 98.3 | 93.0 | |||||
| F-measure | 81.7 | 98.0 | 92.0 | |||||
| TAIR indexing document level | System alone | Curator 2 | Curator 2 | |||||
| Recall | 76.0 | 90.0 | 91.0 | |||||
| Precision | 73.9 | 77.1 | 75.0 | |||||
| F-measure | 74.9 | 83.0 | 82.0 | |||||
| TAIR triage | System alone | Curator 2 | ||||||
| Recall | 68.6 | 84.6 | ||||||
| Precision | 80.5 | 100.0 | ||||||
| F-measure | 74.1 | 92.0 | ||||||
| PPInterFinder | ||||||||
| PPI algorithm alone | System alone | Curator 1 | Curator 2 | Curator 1 | Curator 2 | |||
| Recall | NR | 69.8 | 63.8 | 72.7 | 79.7 | |||
| Precision | 85.7 | 85.7 | 87.0 | 90.4 | ||||
| F-measure | 76.9 | 73.2 | 79.2 | 84.7 | ||||
| PPI algorithm (gene mention/ gene normalization) | System alone | Curator 1 | Curator 2 | |||||
| Recall | NR | 46.9 | 46.9 | |||||
| Precision | 85.7 | 85.7 | ||||||
| F-measure | 60.6 | 60.6 | ||||||
| eFIP | ||||||||
| PMID-centric (sentence level) | System alone | Curator 1 | Curator 2 | Curator 1 | Curator 2 | |||
| Recall | NR | 69.2 | 88.2 | 89.5 | 77.8 | |||
| Precision | 94.7 | 79.0 | 85.0 | 70.0 | ||||
| F-measure | 80.0 | 83.3 | 87.2 | 73.7 | ||||
| Gene-centric (document level) | System alone | Curator 1 | Curator 2 | Curator 1 | Curator 2 | |||
| Recall | NR | 78.6 | 85.7 | 100.0 | 77.8 | |||
| Precision | 91.7 | 85.7 | 83.3 | 77.8 | ||||
| F-measure | 84.6 | 85.7 | 90.9 | 77.8 | ||||
| Document-ranking | ||||||||
| nDCG | 93–100 | |||||||
| T-HOD | ||||||||
| PMID-centric (sentence level) | System alone | Curator 1 | Curator 2 | Curator 3 | Curator 4 | |||
| Recall | 70.0 | 56.0 | 22.0 | 24.0 | 42.0 | |||
| Precision | 79.5 | 32.0 | 26.0 | 40.0 | 42.0 | |||
| F-measure | 74.5 | 40.0 | 24.0 | 30.0 | 42.0 | |||
| Gene-centric (document level) | System alone | Curator 1 | Curator 2 | Curator 3 | Curator 4 | |||
| Recall | 54.3 | 56.0 | 30.0 | 26.0 | 42.0 | |||
| Precision | 72.1 | 63.0 | 41.0 | 52.0 | 71.0 | |||
| F-measure | 62.0 | 59.0 | 35.0 | 35.0 | 53.0 | |||
a4-Category search use ‘bag of words’ for (1) assay terms, (2) verbs, (3) cellular component terms, and (4) gene product names, whereas 5-Category search also include words for Table and Figures. bManual annotations don't necessarily correspond to either the 4- or 5-category search as curators do annotations for sentences that fit both criteria. cTerm-label EQs are entity-quality statements created strictly based on the original descriptions, independent of any ontologies, whereas the label-based EQs are the corresponding formal statements (using ontology terms). dCurator ignore an unspecified number of CharaParser proposals to save time.
Ratio of time for task completion: manual/system-assisted and curation time range
| Time ratio manual/system | Time range (min) | |||||
|---|---|---|---|---|---|---|
| System | Curator 1 | Curator 2 | Curator 3 | Curator 4 | Manual | System |
| Textpresso | 2.3 | 2.5a | 375–692 | 150–297 | ||
| PCS | 1.0 | 0.8 | 135–210 | 165–210 | ||
| Pubtator | 1.8 | 1.7 | 83–135 | 49–79 | ||
| PPInterFinder | 0.9 | NR | 58 | 62 | ||
| eFIP | 2.4 | 2.5 | 88–120 | 35–50 | ||
| T-HOD | 0.9 | 1.3 | 1.2 | 4.0 | 110–140b | 110–120b |
NR, not recorded. aOnly after getting familiar with the tool. bOne curator was significantly faster 60 min manual to 15 min with T-HOD and is not shown.
Overall rating for each system by category in pre-workshop evaluation
| Subjective measure (overall median for each section) | ||||||
|---|---|---|---|---|---|---|
| System | Overall evaluation | Task completion | System design | Learnability | Usability | Recommendation |
| Textpresso | 4.0 | 4.5 | 6.0 | 6.0 | 6.0 | 3.5 |
| PCS | 3.0 | 3.0 | 4.5 | 6.0 | 7.0 | 3.0 |
| PubTator | 6.0 | 6.0 | 6.0 | 6.0 | 6.0 | 7.0 |
| PPInterFinder | 2.5 | 1.0 | 4.5 | 5.5 | 3.5 | 2.0 |
| eFIP | 5.5 | 6.0 | 6.0 | 6.0 | 6.0 | 5.0 |
| THOD | 4.0 | 3.0 | 4.5 | 5.0 | 5.0 | 3.0 |
Median value for questions linked for each of the categories. Likert scale from 1 to 7, from worst to best, respectively.
Degree of correlation of top 10 questions to overall satisfaction measure
| Question | Correlation |
|---|---|
| Q4: personal experience | 0.719 |
| Q10: task completion efficiency | 0.622 |
| Q8: task completion speed | 0.569 |
| Q5: power to complete tasks | 0.568 |
| Q9: task completion effectiveness | 0.53 |
| Q23: consistent use of terms | 0.473 |
| Q6: flexibility | 0.443 |
| Q25: helpful error messages | 0.438 |
| Q15: learning to perform tasks | 0.431 |
| Q3: ease of use | 0.431 |
Overall rating for each system by category
| Subjective measure (overall median for each section) | ||||||
|---|---|---|---|---|---|---|
| System | Overall evaluation | Task completion | System design | Learnability | Usability | Recommendation |
| PubTator | 6.0 | 5.5 | 6.0 | 6.0 | 6.5 | 7.0 |
| eFIP | 6.0 | 6.0 | 6.0 | 6.0 | 7.0 | 5.5 |
| Tagtoga | 5.0 | 5.0 | 5.0 | 5.0 | 6.0 | 4.5 |
| Textpresso | 4.0 | 5.0 | 5.0 | 5.0 | 6.0 | 4.5 |
| PCS | 4.0 | 3.0 | 6.0 | 6.0 | 6.0 | 4.0 |
| PPInterFinder | 4.0 | 2.5 | 5.0 | 5.0 | 5.0 | 3.0 |
| T-HOD | 4.0 | 3.0 | 4.0 | 5.0 | 5.0 | 3.0 |
Median for questions linked for each of the categories. Likert scale from 1 to 7, from worst to best, respectively. aThis system was only reviewed at the workshop.