| Literature DB >> 22151968 |
Cecilia N Arighi1, Phoebe M Roberts, Shashank Agarwal, Sanmitra Bhattacharya, Gianni Cesareni, Andrew Chatr-Aryamontri, Simon Clematide, Pascale Gaudet, Michelle Gwinn Giglio, Ian Harrow, Eva Huala, Martin Krallinger, Ulf Leser, Donghui Li, Feifan Liu, Zhiyong Lu, Lois J Maltais, Naoaki Okazaki, Livia Perfetto, Fabio Rinaldi, Rune Sætre, David Salgado, Padmini Srinivasan, Philippe E Thomas, Luca Toldo, Lynette Hirschman, Cathy H Wu.
Abstract
BACKGROUND: The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested.Entities:
Mesh:
Year: 2011 PMID: 22151968 PMCID: PMC3269939 DOI: 10.1186/1471-2105-12-S8-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Members of the UAG represent a diverse sample of end users with multiple text mining needs
| Domains represented by UAG members and Chair* | |
|---|---|
| Model Organism Databases | dictyBase, MGI, TAIR, Gramene, Wormbase |
| Protein Sequence Databases | UniProtKB |
| Protein-Protein Interaction Databases | BioGrid, MINT |
| Ontologies | Gene Ontology, Protein Ontology, Plant Ontology, Microbial Phenotype Ontology |
| Pharmaceutical Companies | Dupont, Merck KGaA, Pfizer |
| □ gene normalization | Identification of articles: |
*Note that some members represent more than one resource
Gene centrality assignment by a subset of UAG members (9) on two selected articles.
| ypr152c (yeast) | 856275 | 5 | |||
| DBP2 (yeast) | 855611 | 2 | |||
| ECM33 (yeast) | 852370 | 2 | |||
| pRB (human) | 5925 | 5 | Clf1 (yeast) | 850808 | 1 |
| CD71 (mouse) | 22042 | 4 | CA150 (human) | 10915 | 1 |
| c-kit (mouse) | 16590 | 4 | |||
| ter119 (mouse) | 104231 | 4 | |||
| pcna (mouse) | 18538 | 3 | |||
| p107 (mouse) | 18148 | 3 | |||
| beta-actin (mouse) | 11461 | 3 | |||
| eGFP (B. cereus) | 8382257 | 1 | |||
The consensus genes that were considered central are in bold. Central vote is the number of UAG members who selected the given gene as central.
Overview of the major features offered by IAT systems.
| Team | Team 61 | Team 65 | Team 68 | Team 78 | Team 89 | Team 93 |
|---|---|---|---|---|---|---|
| No | Yes | Yes | Yes | Yes | Yes | |
| Yes | Yes | Yes | Yes | Yes | Yes | |
| Text | PMCID or PMID | PMCID or PMID | PMCID | PMCID | PMCID or PMID | |
| No | Yes | Yes | Yes | Yes | Yes | |
| No | Yes | Yes | Yes | Yes | Yes | |
| No | Yes | Yes | No | Yes | Yes | |
| No | Yes | Yes | Yes | Yes | No | |
| Remove and add both species and genes | Remove species and genes/proteins. Add species and genes that can be associated with a term in document | No | No | Remove and add genes | Remove a gene or add it back (cannot add new) | |
| UniProt, NCBI taxonomy, and MIM | Entrez Gene UniProt | Entrez Gene, KEGG, UniProt, Interpro, GO, DIP, Intact, MIPS, MINT HPRD, dbSNP | Entrez Gene, NCBI taxonomy | Entrez Gene | Entrez Gene, NCBI taxonomy | |
| Multiple boxes with abstract, species and protein information | Panels with information linked interactively | Panels with information linked interactively | Panels with linked information | Table | Summary Table | |
| Saves tagged abstract | Tabular format | Tabular format | Tabular format, need to specify before querying | Tabular format | Tabular format | |
Figure 1Usability and performance assessment survey results. Note that only selected questions are shown in graph format. Results are shown as number of UAG member that selected a particular response.
Example of an article that presents name ambiguity between gene names, and between a gene name and a term from other domain (PMC2275796).
| PMC2275796 | Central Vote | Curated Outputa | System Raw Output Team | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Gene ID | Gene names | Species | 78 | 68 | 65 | 93 | 89 | ||
| 56606 | GLUT9/SLC2A9 | human | 7 | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C |
| 9948 | WDR1/AIP1 | human | Y | Y | Y | Y | Y | - | |
| Some examples of ambiguity found in system’s output | |||||||||
| 11182 | GLUT9/SLC2A6 | human | N, C | N,C | N,C | ||||
| CAD | N | N | |||||||
| MI | N | N | |||||||
| 139741 | MAGI2/AIP1 | human | N | N | N | ||||
| Total genes detected | 2 | 6 | 4 | 44 | 4 | 15 | |||
| Performance for total of genes in the article | FP | 0 | 4 | 2 | 42 | 2 | 14 | ||
| FN | 0 | 0 | 0 | 0 | 0 | 1 | |||
| TP | 2 | 2 | 2 | 2 | 2 | 1 | |||
| Precision | 1 | 0.33 | 0.50 | 0.05 | 0.50 | 0.07 | |||
| Recall | 1 | 1 | 1 | 1 | 1 | 0.5 | |||
| Total central genes | 1 | 1 | 2 | 2 | 2 | 1 | |||
| Performance for detecting central genes | FP | 0 | 0 | 1 | 1 | 1 | 0 | ||
| FN | 0 | 0 | 0 | 0 | 0 | 0 | |||
| TP | 1 | 1 | 1 | 1 | 1 | 1 | |||
| Precision | 1 | 1 | 0.50 | 0.50 | 0.50 | 1 | |||
| Recall | 1 | 1 | 1 | 1 | 1 | 1 | |||
List of Entrez Gene IDs, gene name and species found in PMC2275796. The Central Vote column indicates the number of curators that selected the gene as central; “Y”: gene mentioned in the article was detected; “-”:gene mentioned was missed; “N”: the entity detected was not a gene or a wrong gene; “C”=indicates central gene as determined by majority vote, and in the systems it means that the gene was ranked high (gene ranked higher than non central genes); “Total genes detected”: totality of gene mentions provided by a given system (what the system considered a gene). FP and FN stand for false positive and negative, respectively. aCurated output by manual curation (2 curators) and system-assisted curation (5 curators) was identical so it is shown as a single column. bThe FP for central gene performance was calculated by comparing the list of manually curated central genes with the gene ranking by the system. If any non-central gene is ranked higher than a central one it is considered a FP.
Example of an article containing multiple gene and specie mentions (PMC2680910)
| PMCID2680910 | Central Vote | Curated Outputa | System Raw Output Team | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene ID | Gene names | Species | 1 | 2 | 3 | 4 | 5 | 78 | 68 | 65 | 93 | 89 | |
| 10015 | ALIX | human | 7 | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C |
| 57630 | POSH | human | 7 | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C | Y, C |
| 155030 | Gag | HIV-1 | 6 | Y, C | Y, C | - | Y, C | Y, C | Y, C | - | - | Y, C | - |
| 36990 | POSH | Drosophila | Y | Y | Y | Y | Y | Y | Y | - | Y | Y | |
| 43330 | ALIX | Drosophila | Y | Y | Y | Y | Y | Y | Y | - | - | Y | |
| 128866 | CHMP4B | human | Y | Y | Y | Y | - | Y | - | Y | - | Y | |
| 39659 | TAK-1 | Drosophila | Y | Y | Y | Y | Y | - | Y | - | - | Y | |
| 3355106 | ALG-2 | Drosophila | Y | Y | Y | Y | Y | - | - | - | Y | - | |
| 7323 | UbcH5c | human | Y | Y | Y | Y | - | - | Y | Y | - | - | |
| 1489984 | p9 | EIAV | Y | Y | Y | Y | - | - | - | - | - | - | |
| 137492 | HCRP1 | human | Y | Y | Y | Y | - | Y | - | Y | - | - | |
| 7251 | TSG101 | human | Y | Y | Y | Y | - | Y | - | Y | - | - | |
| 155030 | p6 | HIV-1 | Y | - | Y | Y | - | - | - | - | - | - | |
| 7334 | UBC13 | human | 1 | Y | - | Y, C | Y | - | Y | Y | Y | - | - |
| Total genes detected | 14 | 19 | 13 | 26 | 10 | 90 | 22 | 120 | 9 | 52 | |||
| FP | 0 | 5 | 0 | 0 | 3 | 81 | 15 | 113 | 4 | 46 | |||
| FN | 0 | 2 | 1 | 0 | 7 | 5 | 7 | 7 | 8 | 8 | |||
| TP | 14 | 12 | 13 | 14 | 7 | 9 | 7 | 7 | 5 | 6 | |||
| Precision | 1.00 | 0.71 | 1.00 | 1.00 | 0.70 | 0.10 | 0.32 | 0.06 | 0.56 | 0.12 | |||
| Recall | 1.00 | 0.86 | 0.93 | 1.00 | 0.50 | 0.64 | 0.50 | 0.50 | 0.38 | 0.43 | |||
List of Entrez Gene ID, gene name and species found in PMC2680910. The Central Vote column indicates the number of curators that selected the gene as central; “Y”: gene mentioned in the article is detected; “-”:gene mentioned was missed; “C”=indicates central gene as determined by majority vote, and in the systems it means that the gene was ranked high by the system (gene ranked higher than non central genes); “Total genes detected”: totality of gene mentions provided by a given system (what the system considered a gene). FP and FN stand for false positive and negative, respectively. aCurated output by manual curation (2 curators, 1-2) and system-assisted curation (5 curators, but 3 are shown, 3-5).
Example of an article where a new gene name is introduced (PMC2764847).
| PMC2764847 | Central Vote | Curated Outputa | System Raw Output Team | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Gene ID | Gene name | Species | 78 | 68 | 65 | 93 | 89 | ||
| 828316 | AtIscU1 | A. thaliana | 9 | Y, C | - | - | - | - | - |
| 829947 | AtHscA1 | A. thaliana | 8 | Y, C | - | - | - | - | - |
| 830529 | AtHscB | A. thaliana | 8 | Y, C | - | Y | - | Y, C | - |
| 852866 | Jac1 | Yeast | 8 | Y, C | Y, C | Y, C | Y, C | - | Y, C |
| 851084 | Ssq1 | Yeast | 8 | Y, C | Y, C | Y, C | Y, C | - | Y, C |
| 830818 | HscA2 | A. thaliana | 1 | Y | - | - | - | - | - |
| 821316 | AtIscU2 | A. thaliana | 1 | Y | - | - | - | - | - |
| 825719 | AtIscU3 | A. thaliana | 1 | Y | - | - | - | - | - |
| Total genes detected | 29 (manual) | 54 | 22 | 65 | 9 | 23 | |||
| FP | 46 | 14 | 58 | 7 | 16 | ||||
| FN | 21 | 21 | 19 | 27 | 22 | ||||
| TP | 8 | 8 | 10 | 2 | 7 | ||||
| Precision | 0.93 (0.07)b | 0.15 | 0.36 | 0.15 | 0.22 | 0.30 | |||
| Recall | 0.75 (0.16)b | 0.28 | 0.28 | 0.34 | 0.07 | 0.24 | |||
There were a total of 29 gene mentions in the article (as determined independently by manual curation), but for simplicity, only the list of proposed central genes are listed here (as considered by 10 curators). The Central Vote column indicates the number of curators that selected the gene as central; “Y”: gene mentioned in the article is detected; “-”:gene mentioned was missed; “C”=indicates central gene as determined by majority vote, and in the systems it means that the gene was ranked high by the system (gene ranked higher than non central genes); “Total genes detected”: totality of gene mentions provided by a given system (what the system considered a gene). FP and FN stand for false positive and negative, respectively. aCurated output by 10 curators (2 per system). Central genes were selected by majority vote, with previous revision of discrepancies of annotation with individual UAG members. bAverage value from curators output with standard deviation shown in parenthesis.
Gene Entity Recognition errors and potential solutions
| Error Class | Error | Example (PMCID) | Potential Solution |
|---|---|---|---|
| Synonym Not Found | New synonym is not found in databases | AtHscB (PMC2764847) | Increase breadth of databases searched by tool |
| Species prefix obfuscates synonym | AtHscB (PMC2764847) | Ability to add synonym or species-specific rules for string matching | |
| Ambiguity | Synonym is a common English word | WASp | Ability to add or remove a synonym and reprocess highlighting |
| Synonym maps to more than one identifier | AIP1 | Present matches simultaneously with clues like other synonyms and interacting partners | |
| Species not clearly specified | Reference 5 in PMC2680910 | Be able to navigate to other sections of the paper, other papers; be able to curate to orthologous cluster of proteins | |
| Synonym refers to a protein family or an enzymatic activity | Ability to curate to protein family or orthologous cluster of proteins | ||
Figure 2ODIN interface. The ODIN interface is organized in 3 panels: the inspector panel (left) is used to edit single annotations, the document panel (center) contains the document being inspected, and the annotation panel(right) contains grid views (in different tabs) of the terms, concepts and interactions identified by the system in the target document. The term tab contains columns showing the textual form of a term occurrence, its possible concept identifiers and main semantic types together with an ambiguity count. In the concept tab (called "Genes/Proteins" for this task) there is a row for each concept identifier with a relevance score, a frequency count, the most prominent text zone where the concept appears (title, abstract, text), its semantic type, and a link to allow exploration of the concept in the web interface of the ontology where it stems from.
Figure 3GeneView interface. The main panel shows the article and the recognized entities. Detected gene names are highlighted in green and entity-specific information, as shown for gene ALIX (PDCD6IP), is displayed. The left panel provides an overview of all entities found in the article sorted by overall count. This ranking can be manually modified. Per default all genes are highlighted in the text, but GeneView allows to limit the highlighting to the species of interest.
Figure 4IAT interface from University of Iowa. The left panel displays the full text of the article selected by the user for the purpose of gene normalization. The right panel shows a ranked list of gene and species names along with their normalized identifiers. In this figure, all instances of the user-selected gene POSH are shown to be highlighted.
Figure 5GeneIR interface from University of Wisconsin. Screenshot showing the two search boxes. Results are presented as a table. Links are provided to view the genes highlighted in the article, add or delete a gene and download the gene list. List of genes can be sorted by centrality (default), presence in title and abstract, or the frequency with which they appear in the article.
Figure 6GNSuite interface. A screenshot for PMC 2680910 with the “gene summary table” and “full text” tabs activated. On the left are links to the system documentation, and on the right is detailed information about the most recently clicked gene name. On the top of the screen, right under the PMC and PubMed identifier information, are tabs for the different input sub-systems for genes and species information in addition to the summary tabs and a “hide gene tables” tab. The gene table can be saved locally by clicking the provided button. On the bottom of the screen are three tabs for viewing the abstract/MEDIE or full text/GNSuite or Web-search results respectively. The selected gene and species names from the top tables are highlighted in the texts at the bottom.
Figure 7MyMiner interface. MyMiner Entity tagging and Entity linking user interfaces for PMC2680910 article abstract. Entity tagging (A) and Entity linking (B) have been manually edited; some tags have been added or removed depending on the bio curator choices.