| Literature DB >> 27589961 |
Qinghua Wang1, Shabbir S Abdul2, Lara Almeida3, Sophia Ananiadou4, Yalbi I Balderas-Martínez5, Riza Batista-Navarro4, David Campos6, Lucy Chilton7, Hui-Jou Chou8, Gabriela Contreras9, Laurel Cooper10, Hong-Jie Dai11, Barbra Ferrell12, Juliane Fluck13, Socorro Gama-Castro9, Nancy George14, Georgios Gkoutos15, Afroza K Irin16, Lars J Jensen17, Silvia Jimenez18, Toni R Jue19, Ingrid Keseler20, Sumit Madan13, Sérgio Matos3, Peter McQuilton21, Marija Milacic22, Matthew Mort23, Jeyakumar Natarajan24, Evangelos Pafilis25, Emiliano Pereira26, Shruti Rao27, Fabio Rinaldi28, Karen Rothfels22, David Salgado29, Raquel M Silva30, Onkar Singh31, Raymund Stefancsik32, Chu-Hsien Su33, Suresh Subramani24, Hamsa D Tadepally34, Loukia Tsaprouni35, Nicole Vasilevsky36, Xiaodong Wang37, Andrew Chatr-Aryamontri38, Stanley J F Laulederkind39, Sherri Matis-Mitchell40, Johanna McEntyre41, Sandra Orchard41, Sangya Pundir41, Raul Rodriguez-Esteban42, Kimberly Van Auken37, Zhiyong Lu43, Mary Schaeffer44, Cathy H Wu1, Lynette Hirschman45, Cecilia N Arighi46.
Abstract
Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested.Database URL: http://www.biocreative.org. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Year: 2016 PMID: 27589961 PMCID: PMC5009325 DOI: 10.1093/database/baw119
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
IAT activity workflow suggested to biocurators committed to full level participation
| Week | Activity |
|---|---|
| Week 1 | Training with guided exercises with TM team |
| Week 2 | Review of task guidelines with TM team and coordinator. |
| Week 3 | Pre-designed tasks exercise |
| Week 4 | 1 h annotation (non-TM assisted) and 1 h annotation (TM-assisted) |
| Week 5 | 1 h annotation (non-TM assisted) and 1 h annotation (TM-assisted) |
| Week 6 | Survey and submission of data |
The schedule was presented to teams and curators as a guide to plan the different steps of the IAT activity. It was important to follow the order of these steps, whereas the time devoted to each could vary depending upon the curator’s availability. However, by the end of Week 6 all surveys and data should be submitted.
Summary of IAT participating systems
| System | Description | Bioconcepts | Link to Standards | Curation workflow step | Relations captured | Text | Browser |
|---|---|---|---|---|---|---|---|
|
Argo | Curation of phenotypes relevant to the chronic obstructive pulmonary disease (COPD) |
-Gene/protein -medical condition -sign/symptom -drug |
-UniProt -UMLS -ChEBI |
-Entity Detection -Relation/Evidence |
-COPD-medical condition -COPD-drug -COPD-protein -COPD-sign/symptom |
-full-text |
-Chrome -Firefox -Safari |
| BELIEF | Semi-automated curation interface which supports relation extraction and encoding in the modeling language BEL (Biological Expression Language) |
-Gene/protein -disease -chemical -biological processes |
-HGNC/MGI/RGD -MeSH Diseases Branch -ChEBI -GO-Biological Process -GO-Complex -Selventa Protein/Family Names |
-Entity Detection -Relation/Evidence | Relations expressed in BEL. Relations can be expressed between all of the detected entity types |
-abstract -full-text |
-Chrome -Firefox |
| egas | Identification of clinical attributes associated with human inherited gene mutations, described in PubMed abstracts |
-Gene/protein -mutation -disorder/disease -zygocity -penetrance -ethnicity |
-HGNC -OMIM -Human Phenotype Ontology -NCI Thesaurus |
-Entity Detection -Relation/Evidence |
-gene/protein-mutation -gene/protein-disease -mutation-zygocity -mutation-penetrance |
-abstract |
-Chrome -Firefox -Safari |
| EXTRACT | Lists the environment type and organism name mentions identified in a given piece of text |
-Environment -organism -tissue -disease |
-Environment Ontology -NCBI taxonomy -BRENDA tissue ontology -Disease ontology |
-Entity Detection |
-text snippets |
-Chrome -Firefox | |
| GenDisFinder | Knowledge discovery of known/novel human gene-disease associations (GDAs) from biomedical literature |
-Gene -disease |
-EntrezGene -OMIM |
-Triage -Entity Detection -Relation/Evidence |
-gene/protein-disease -GDA-related action words and network association type |
-abstract |
-Chrome -Explorer -Firefox -Safari |
| MetastasisWay (MET) | Look for the biomedical concepts and relations associated with metastasis and finally construct the metastasis pathway. |
-Gene/protein -metastasis -cancer -tissue -body part -microrna -gene expression -cell line -experimental techniques |
-EntrezGene -Disease Ontology -MirTarBase |
-Entity Detection -Relation/Evidence | positive and negative regulations between biomedical concepts associated with metastasis |
-abstract |
-Chrome |
| Ontogene | Curation of bioconcepts, such as miRNA, gene, disease and chemicals and their relations. |
-Microrna -gene/protein -disease -organism |
-RegulonDB ID -CTD -NCBI taxonomy |
-Entity Detection |
-full-text |
-Chrome -Firefox -Safari |
The columns from left to right indicate: (i) name of the system, (ii) description of the system in relation to the biocuration task proposed, (iii) bioconcepts (what entities are detected, e.g. gene, disease), (iv) standards adopted by the system to link the bioconcepts detected to corresponding databases and ontologies, (v) what step of the literature curation workflow the system helps with, (vi) the relations captured include relations between bioconcepts (e.g. relation between gene-disease), (vii) the text column lists what type of text the systems are able to process and (viii) the browser column indicates the system compatibility with web browsers.
Figure 1.Distribution of biocurators (A) by geographic area, (B) by type of database/institution, and (C) by level of participation. A total of 43 biocurators participated in this activity. Notice that the total number in (C) is higher because some biocurators tested more than one system, and all curators participated in the partial activity.
Results on task completion in the pre-designed tasks for each system
| TASK | % users completed task | Based on those who completed task | |
|---|---|---|---|
| % found it difficult | % not-confident | ||
| TASK1-Launching Argo | 100 | 0 | 0 |
| TASK2-Find the page with tutorial for curation task | 80 | 0 | 0 |
| TASK3-Managing files in Argo | 100 | 0 | 0 |
| TASK4-Open a file | 80 | 25 | 0 |
| TASK5-Edit annotations | 80 | 25 | 0 |
| TASK6-Saving annotations | 80 | 25 | 0 |
| TASK1-Find information about BEL | 100 | 13 | 13 |
| TASK2-Find and open project. Understanding content of page | 100 | 0 | 13 |
| TASK3-Edit the BEL statements and select for export | 75 | 33 | 17 |
| TASK4-Export the document | 100 | 0 | 13 |
| TASK5-Add document to project | 88 | 14 | 0 |
| TASK1-Log in and access the project | 100 | 0 | 0 |
| TASK2-Find project status (private vs public) | 89 | 0 | 13 |
| TASK3-Finding help | 100 | 0 | 0 |
| TASK4-Edit annotation | 100 | 0 | 0 |
| TASK5-Export and opening file | 33 | 0 | 0 |
| TASK1-Install bookmarklet | 100 | 0 | 0 |
| TASK2-Extract on a piece of text | 100 | 0 | 0 |
| TASK3-Review annotations and information | 90 | 0 | 0 |
| TASK4-Save Extract table | 100 | n/a | n/a |
| TASK5-Finding help | 100 | 0 | 0 |
| TASK1-Find information on format | 100 | 0 | 0 |
| TASK2-Find GenDisFinder gene-disease associations in a given abstract | 33 | 0 | 0 |
| TASK3-Understand annotations and network | 33 | 0 | 0 |
| TASK4-Edit annotation | 56 | 20 | 20 |
| TASK5-Export annotation | 67 | n/a | n/a |
| TASK1-Register and install MAT | 82 | 33 | 22 |
| TASK2-Find information about vocabularies used* | 89 | 13 | 50 |
| TASK3-Review and edit annotations* | 67 | 17 | 17 |
| TASK4-Save annotation* | 89 | n/a | n/a |
| *calculations based on the 9 curators who were able to install the application | |||
| TASK1-Open a document in Ontogene | 100 | 10 | 0 |
| TASK2-Find information about panels | 100 | 10 | 0 |
| TASK3-Using filters in panels | 100 | 0 | 0 |
| TASK4-Validate annotation | 80 | 0 | 0 |
| TASK5-Export annotations | 100 | 0 | 0 |
For each system, a series of tasks were presented to the biocurators via the SurveyMonkey interface followed by questions to address task completion, difficulty of the task and confidence on the task. Based on the responses we calculated the percentage (%) of users that completed each task; the percentage that found the task difficult even when they were able to finish it; and the percentage who felt not-confident about their task performance. n/a means not applicable, that is we did not ask the question for that particular task.
Figure 2.Pooled responses to questions related to system perception of usability from the pre-designed task activity.
Figure 3.Plot of the NPS score (bars) and the median for the system rating for each system (dots). The y-axis represents whether the NPS and median are positive (for NPS, positive means NPS > 0, for system rating median >3) or negative (for NPS, negative means NPS < 0, for system rating median <3). The NPS score is represented with bars, white and grey color indicate positive and negative scores, respectively. The median for the system rating is represented with black dots with dotted line extending from minimum to maximum value for the sample.
Ontogene metrics from full level evaluation
| Performance | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Curators | Annotation | Non-TM assisted | TM assisted | ||||||
| 3 | #articles/day | 1 | 12 | ||||||
| Survey | median | Q1 | min | max | Q3 |
Ave. | St. Dev | ||
| Task | 4 | 3 | 3 | 5 | 5 | SUS | 91.67 | 4.44 | |
| Design | 4 | 3.75 | 3 | 5 | 4.25 | Usability | 90.62 | 6.25 | |
| Usability | 3 | 3 | 3 | 4 | 4 | Learnability | 95.83 | 5.55 | |
The upper half of the table shows the number of curators involved in the evaluation, and the throughput (average time per article or per concept annotated) without or with the assistance of TM. The lower half of the table shows the central tendency of the survey results for the pool of questions related to ability to complete the task (Task), Design of the interface (Design) and Usability. The responses were converted to a numeric scale from 1 (most negative response) to 5 (most positive response). To give an idea of the response distribution, the central tendency is described with the median along with minimum (min) and maximum (max) values, respectively, and the lower (Q1) and upper (Q3) quartiles, respectively. In addition, the average system usability score (SUS) from the SUS questionnaire and its breakdown into the usability (all questions but 4 and 10) and learnability (questions 4 and 10) questions are shown on the lower right. A score higher than 68 means the system scored better than average (other benchmarked systems).
Figure 4.Scores for usability and learnability for each system. SUS score (black) encompasses 10 standard questions, question 4 and 10 are related to learnability (light grey) where the others to usability (dark grey). Standard deviations are shown. The dashed line indicates the average SUS 68.
Argo metrics from full level evaluation
| Performance | Ave. # documents/hour | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Curators | Annotation | non-TM assisted | TM assisted | Ave. IAA | ||||||
| 5 | concept | 9 | 14 | 68.12% | ||||||
| relation | 25 | 35 | ||||||||
| Survey | median | Q1 | min | max | Q3 | Ave. | St. Dev. | |||
| Task | 4 | 4 | 2 | 5 | 4.5 | SUS | 71 | 3.6 | ||
| Design | 3 | 3 | 2 | 4 | 4 | Usability | 72.5 | 3.5 | ||
| Usability | 4 | 3 | 2 | 5 | 4 | Learnability | 65 | 8 | ||
The upper half of the table shows the number of curators involved in the evaluation, the throughput (average number (#) of curated documents per hour) without or with the assistance of TM, and the average inter-annotator agreement (IAA). The lower half of the table shows the central tendency of the survey results for the pool of questions related to ability to complete the task (Task), Design of the interface (Design) and Usability. The responses were converted to a numeric scale from 1 (most negative response) to 5 (most positive response). To give an idea of the response distribution, the central tendency is described with the median along with minimum (min) and maximum (max) values, respectively, and the lower (Q1) and upper (Q3) quartiles, respectively. In addition, the average system usability score (SUS) from the SUS questionnaire and its breakdown into the usability (all questions except 4 and 10) and learnability (questions 4 and 10) questions are shown on the lower right. A score higher than 68 means the system scored better than average (other benchmarked systems).
BELIEF metrics from full level evaluation
| Performance | Ave. # documents/hour | |||||||
|---|---|---|---|---|---|---|---|---|
| Curators | Non-TM assisted | TM assisted | ||||||
| 6 | 4 | 4 | ||||||
| Survey | median | Q1 | min | max | Q3 | Ave. | St. Dev. | |
| Task | 4 | 3 | 2 | 5 | 4 | SUS | 66.67 | 15.28 |
| Design | 3.5 | 3 | 2 | 5 | 4 | Usability | 67.19 | 13.54 |
| Usability | 3 | 3 | 2 | 4 | 4 | Learnability | 64.58 | 31.25 |
The number of documents per hour was rounded up.
The upper half of the table shows the number of curators involved in the evaluation, and the throughput (average time per article or per concept annotated) without or with the assistance of TM. The lower half of the table shows the central tendency of the survey results for the pool of questions related to ability to complete the task (Task), Design of the interface (Design) and Usability. The responses were converted to a numeric scale from 1 (most negative response) to 5 (most positive response). To give an idea of the response distribution, the central tendency is described with the median along with minimum (min) and maximum (max) values, respectively, and the lower (Q1) and upper (Q3) quartiles, respectively. In addition, the average system usability score (SUS) from the SUS questionnaire and its breakdown into the usability (all questions but 4 and 10) and learnability (questions 4 and 10) questions are shown on the lower right. A score higher than 68 means the system scored better than average (other benchmarked systems).
Egas metrics from full level evaluation
| Performance | ||||||||
|---|---|---|---|---|---|---|---|---|
| Curators | Annotation | Non-TM assisted | TM assisted | Ave. IAA | ||||
| 7 | concept | 664 | 744 | 74% | ||||
| relation | 157 | 217 | ||||||
| time/article (seconds) | 245 | 219 | ||||||
| time/concept (seconds) | 13.1 | 10.8 | ||||||
| Survey | median | Q1 | min | max | Q3 | Ave. | St. Dev. | |
| Task | 4 | 4 | 3 | 5 | 5 | SUS | 77.14 | 9.69 |
| Design | 4 | 4 | 3 | 5 | 5 | Usability | 76.34 | 9.18 |
| Usability | 4 | 3 | 3 | 5 | 4 | Learnability | 80.36 | 13.26 |
7 curators participated in the full activity: two curators annotated a small portion of the corpus (8–13 documents), hence their annotation was not included in annotation metrics, but were included in the survey.
The upper half of the table shows the number of curators involved in the evaluation, the throughput (average time per article or per concept annotated) without or with the assistance of TM, and the average inter-annotator agreement (IAA). The lower half of the table shows the scores for the survey results for the pool of questions related to ability to complete the task (Task), Design of the interface (Design) and Usability. The scale was from 1 to 5 from most negative to most positive response, respectively. To give an idea of the response distribution, the scores are shown as median with minimum (min) and maximum (max) values, respectively, and the lower (Q1) and upper (Q3) quartiles, respectively. The average system usability score (SUS) from the SUS questionnaire and its breakdown into the usability (all questions but 4 and 10) and learnability (questions 4 and 10) questions.
EXTRACT metrics from full level evaluation
| Survey | median | Q1 | min | max | Q3 | Ave. | St. Dev. | |
|---|---|---|---|---|---|---|---|---|
| Task | 4 | 3.25 | 1 | 4 | 4 | SUS | 77.5 | 20.0 |
| Design | 4.25 | 3.75 | 2 | 5 | 5 | Usability | 76.6 | 20.3 |
| Usability | 4 | 4 | 4 | 4 | 5 | Learnability | 81.2 | 18.7 |
The table shows the central tendency of the survey results for the pool of questions related to ability to complete the task (Task), Design of the interface (Design) and Usability. The responses were converted to a numeric scale from 1 (most negative response) to 5 (most positive response). To give an idea of the response distribution, the central tendency is described with the median along with minimum (min) and maximum (max) values, respectively, and the lower (Q1) and upper (Q3) quartiles, respectively. In addition, the average system usability score (SUS) from the SUS questionnaire and its breakdown into the usability (all questions but 4 and 10) and learnability (questions 4 and 10) questions are shown on the lower right. A score higher than 68 means the system scored better than average (other benchmarked systems).
GenDisFinder metrics from full level evaluation
| Survey | median | Q1 | min | max | Q3 | Ave. | St. Dev | |
|---|---|---|---|---|---|---|---|---|
| Task | 3 | 3 | 3 | 3 | 3 | SUS | 57.50 | n/a |
| Design | 3.5 | 3 | 3 | 4 | 4 | Usability | 53.12 | n/a |
| Usability | 3 | 3 | 3 | 3 | 3 | Learnability | 75.00 | n/a |
The table shows the central tendency of the survey results for the pool of questions related to ability to complete the task (Task), Design of the interface (Design) and Usability. The responses were converted to a numeric scale from 1 (most negative response) to 5 (most positive response). To give an idea of the response distribution, the central tendency is described with the median along with minimum (min) and maximum (max) values, respectively, and the lower (Q1) and upper (Q3) quartiles, respectively. In addition, the average system usability score (SUS) from the SUS questionnaire and its breakdown into the usability (all questions but 4 and 10) and learnability (questions 4 and 10) questions are shown on the lower right. A score higher than 68 means the system scored better than average (other benchmarked systems).
MetastasisWay metrics from full level evaluation
| Performance | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Curators | Annotation | Non-TM assisted | TM assisted | ||||||
| 6 | #abstracts Week1 | 46 | 40 | ||||||
| #abstracts Week2 | 49 | 44 | |||||||
| Survey | median | Q1 | min | max | Q3 |
Ave. |
St. Dev | ||
| Task | 4 | 3.25 | 1 | 5 | 4 | SUS | 68.75 | 5.41 | |
| Design | 4 | 4 | 3 | 5 | 5 | Usability | 68.75 | 7.29 | |
| Usability | 4 | 3 | 2 | 5 | 4 | Learnability | 68.75 | 14.58 | |
The upper half of the table shows the number of curators involved in the evaluation, the throughput (number (#) of abstracts annotated per week) without or with the assistance of TM. The lower half of the table shows the central tendency of the survey results for the pool of questions related to ability to complete the task (Task), Design of the interface (Design) and Usability. The responses were converted to a numeric scale from 1 (most negative response) to 5 (most positive response). To give an idea of the response distribution, the central tendency is described with the median along with minimum (min) and maximum (max) values, respectively, and the lower (Q1) and upper (Q3) quartiles, respectively. In addition, the average system usability score (SUS) from the SUS questionnaire and its breakdown into the usability (all questions but 4 and 10) and learnability (questions 4 and 10) questions are shown on the lower right. A score higher than 68 means the system scored better than average (other benchmarked systems).
Figure 5.Usage of standards/databases proposed by the systems. The table describes most of the bioentities and standards/databases proposed by the different systems, and the bar graphs show the number of IAT evaluators using each standard/database. Note that environment is a specialized bioentity type which is only used by the microbial and metagenomics communities. Data from 25 users.