| Literature DB >> 30576492 |
Aurore Britan1, Isabelle Cusin1, Valérie Hinard1, Luc Mottin2,3, Emilie Pasche2,3, Julien Gobeill2,3, Valentine Rech de Laval1, Anne Gleizes1, Daniel Teixeira1, Pierre-André Michel1, Patrick Ruch2,3, Pascale Gaudet1.
Abstract
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.Entities:
Mesh:
Year: 2018 PMID: 30576492 PMCID: PMC6301339 DOI: 10.1093/database/bay129
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Comparison of some existing text-mining tools
|
|
|---|
The performance of the main parameters important for the curation workflow is indicated by the degree of shading: white means feature not available; light grey, medium performance; and dark gray, very good performance.
Figure 1Activity diagram of the literature curation process using neXtA5.
Figure 2neXtA5 user interface for query page.
Figure 3neXtA5 user interface for curation. From the abstract of an article, neXtA5 extracts relevant concepts and displays a list of potential annotations. Here, the annotations related to PIM1 for the BPs and extracted from the abstract of (32) are shown.
Semantic classification of concepts annotated by the curators or proposed by neXtA5
|
|
|
|---|---|
| 1 | Reactive oxygen species biosynthetic process |
| Reactive oxygen species metabolic process | |
| ROS generation | |
| 2 | S phase |
| DNA replication | |
| Regulation of cell cycle | |
| 3 | Autophagy |
| Autophagosome assembly | |
| Autophagosome formation |
Figure 4Ancestor charts of the GO terms from semantic classification 2, shown in Table 2 [S phase (GO:0051320), DNA replication (GO:0006260) and regulation of cell cycle (GO:0051726)], using https://www.ebi.ac.uk/QuickGO/.
Figure 5neXtA5 user interface for curation. One of the guidelines for the curators to select relevant documents was to not consider statements from titles and from the introductory part of the abstract. Here, the introduction of the abstract of (37) related to FYN function (BP axis) is highlighted in yellow.
Inter-curator agreement analysis
|
|
| |||
|---|---|---|---|---|
| Papers accepted by both curators | 162 |
| 152 |
|
| Papers rejected by both curators | 39 |
| 48 |
|
| Papers rejected by just one curator | 41 |
| 42 |
|
| Total papers analyzed |
|
| ||
Figure 6Inter-curator agreement with respect to concepts in BP (A) and D (B) axes showing the proportion of common concepts found by both curators. The number indicated is the number of common concepts identified by both curators (0–4 for BP; 0–6 for Ds).
Precision analysis for BP and D axes
|
|
|
|
|
| |
|---|---|---|---|---|---|
| BP | 3175 | 699 | 413 | 2061 |
|
| Ds | 4967 | 1094 | 146 | 3727 |
|
Average number of terms found by curators (common terms and total terms) and by neXtA5 for BP and D axes
|
|
| |
|---|---|---|
| Number of concepts identified by at least one curator and neXtA5 | 1.1 | 1.2 |
| Manual curator (average number of concepts/papers) | 2.4 | 1.5 |
| neXtA5 (average number of concepts/papers) | 6.2 | 6.0 |
Precision analysis for BP (A) and D (B) axes
|
|
|
|
| |||
|---|---|---|---|---|---|---|
|
|
|
|
| |||
| LRRK2 | 247 |
| LYN | 398 |
| |
| SGK1 | 301 |
| SYK | 570 |
| |
| SYK | 262 |
| ZAP70 | 250 |
| |
| IRAK4 | 236 |
| PIM1 | 444 |
| |
| LYN | 265 |
| FYN | 402 |
| |
| FYN | 343 |
| RIPK2 | 194 |
| |
| PIM1 | 327 |
| IRAK4 | 351 |
| |
| CDK2 | 333 |
| CDK2 | 504 |
| |
| RIPK2 | 145 |
| LRRK2 | 452 |
| |
| CSK | 318 |
| SGK1 | 491 |
| |
| STK11 | 156 |
| STK11 | 635 |
| |
| ZAP70 | 242 |
| CSK | 276 |
| |
List of rejected terms by the curators in BP (A) and D (B) axes
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| |||
| GO:0023052 | Signaling | Signaling | 154 |
| 63 |
| 56 |
| 273 |
| GO:0032502 | Developmental process | Developmental process | 112 |
| 13 |
| 5 |
| 130 |
| GO:0065007 | N/A | Biological regulation | 110 |
| 26 |
| 0 |
| 136 |
| GO:0016310 | Phosphorylation | Phosphorylation | 110 |
| 55 |
| 66 |
| 231 |
| GO:0007165 | Signal transduction | Signal transduction | 90 |
| 11 |
| 17 |
| 118 |
| GO:0006351 | Transcription and DNA-templated | Transcription and DNA-templated | 88 |
| 6 |
| 18 |
| 112 |
| GO:0009058 | Biosynthetic process | Biosynthetic process | 83 |
| 19 |
| 2 |
| 104 |
| GO:0040007 | Growth | Growth | 74 |
| 16 |
| 2 |
| 92 |
| GO:0010467 | Gene expression | Gene expression | 55 |
| 1 |
| 16 |
| 72 |
| GO:0006915 | Apoptotic process | Apoptotic process | 44 |
| 4 |
| 36 |
| 84 |
| GO:0009405 | N/A | Pathogenesis | 42 |
| 0 |
| 0 |
| 42 |
| GO:0051726 | Regulation of cell cycle | Regulation of cell cycle | 40 |
| 2 |
| 7 |
| 49 |
| GO:0007049 | Cell cycle | Cell cycle | 37 |
| 4 |
| 16 |
| 57 |
| GO:0006954 | Inflammatory response | Inflammatory response | 36 |
| 2 |
| 20 |
| 58 |
| GO:0006283 | Transcription-coupled nucleotide-excision repair | TCR | 34 |
| 9 |
| 1 |
| 44 |
| GO:0016246 | N/A | RNA interference | 31 |
| 0 |
| 0 |
| 31 |
| GO:0008283 | Cell proliferation | Cell proliferation | 31 |
| 1 |
| 20 |
| 52 |
| GO:0009056 | Catabolic process | Catabolic process | 26 |
| 10 |
| 8 |
| 44 |
| GO:0033673 | N/A | Negative regulation of kinase activity | 26 |
| 0 |
| 0 |
| 26 |
| GO:0051179 | Localization | Localization | 24 |
| 10 |
| 1 |
| 35 |
| GO:0008152 | Metabolic process | Metabolic process | 22 |
| 0 |
| 3 |
| 25 |
| GO:0016049 | Cell growth | Cell growth | 21 |
| 1 |
| 4 |
| 26 |
| GO:0045087 | Innate immune response | Innate immune response | 19 |
| 0 |
| 6 |
| 25 |
| GO:0001816 | Cytokine production | Cytokine production | 17 |
| 1 |
| 15 |
| 33 |
| GO:0008219 | Cell death | Cell death | 16 |
| 2 |
| 10 |
| 28 |
| GO:0006412 | Translation | Translation | 16 |
| 1 |
| 13 |
| 30 |
| GO:0042110 | T cell activation | T-cell activation | 14 |
| 0 |
| 6 |
| 20 |
| GO:0051320 | S phase | S phase | 13 |
| 5 |
| 10 |
| 28 |
| GO:0046903 | Secretion | Secretion | 13 |
| 7 |
| 6 |
| 26 |
| GO:0006914 | Autophagy | Autophagy | 13 |
| 0 |
| 7 |
| 20 |
| GO:0030154 | Cell differentiation | Cell differentiation | 12 |
| 2 |
| 0 |
| 14 |
| GO:0032259 | N/A | Methylation | 12 |
| 0 |
| 0 |
| 12 |
| GO:0006260 | DNA replication | DNA replication | 11 |
| 0 |
| 9 |
| 20 |
| GO:0009293 | N/A | Transduction | 11 |
| 3 |
| 0 |
| 14 |
| GO:0006810 | N/A | Transport | 11 |
| 0 |
| 0 |
| 11 |
| GO:0046960 | Sensitization | Sensitization | 11 |
| 0 |
| 1 |
| 12 |
| GO:0016311 | N/A | Dephosphorylation | 10 |
| 0 |
| 0 |
| 10 |
|
| |||||||||
|
|
|
|
|
|
|
| |||
| C2991 | D or Disorder | Condition | 148 |
| 11 |
| 8 |
| 167 |
| C3262 | Neoplasm | Tumor | 100 |
| 12 |
| 39 |
| 151 |
| C45576 | N/A | Mutation | 90 |
| 0 |
| 0 |
| 90 |
| C9305 | Malignant neoplasm | Cancer | 90 |
| 2 |
| 29 |
| 121 |
| C3114 | Hypersensitivity | Sensitivity | 50 |
| 1 |
| 2 |
| 53 |
| C3137 | Inflammation | Inflammation | 49 |
| 1 |
| 17 |
| 67 |
| C18264 | Pathogenesis | Pathogenesis | 46 |
| 1 |
| 1 |
| 48 |
| C120860 | N/A | Accumulation | 43 |
| 0 |
| 0 |
| 43 |
| C18078 | Carcinogenesis | Tumorigenesis | 36 |
| 4 |
| 11 |
| 51 |
| C26845 | Parkinson’s D | Parkinson’s D | 33 |
| 0 |
| 9 |
| 42 |
| C19296 | N/A | Deletion | 32 |
| 0 |
| 0 |
| 32 |
| C50753 | N/A | Staining | 30 |
| 0 |
| 0 |
| 30 |
| C3324 | Peutz–Jeghers syndrome | Peutz–Jeghers syndrome | 29 |
| 0 |
| 6 |
| 35 |
| C14339 | N/A | Knockout mice | 27 |
| 0 |
| 0 |
| 27 |
| C20200 | N/A | Outcome | 26 |
| 0 |
| 0 |
| 26 |
| C45581 | Gene amplification abnormality | Amplification | 26 |
| 0 |
| 1 |
| 27 |
| C3671 | N/A | Injury | 25 |
| 4 |
| 0 |
| 29 |
| C53802 | Adverse event associated with the gastrointestinal system | Gastrointestinal | 25 |
| 0 |
| 5 |
| 30 |
| C42077 | Cellular infiltrate | Infiltration | 24 |
| 0 |
| 3 |
| 27 |
| C17666 | N/A | Germline mutations | 23 |
| 0 |
| 0 |
| 23 |
| C75004 | Invasion | Invasion | 22 |
| 1 |
| 5 |
| 28 |
| C55998 | N/A | Platelets | 19 |
| 0 |
| 0 |
| 19 |
| C3161 | Leukemia | Leukemia | 19 |
| 0 |
| 5 |
| 24 |
| C53791 | Adverse event associated with infection | Infection | 18 |
| 14 |
| 3 |
| 35 |
| C54685 | Tissue adhesion | Adhesion | 17 |
| 0 |
| 1 |
| 18 |
| C94604 | N/A | Mouse model | 16 |
| 0 |
| 0 |
| 16 |
| C39723 | Immune system finding | Immune system | 16 |
| 0 |
| 1 |
| 17 |
| C19987 | Cancer progression | Cancer progression | 16 |
| 0 |
| 2 |
| 18 |
| C4089 | Polyposis | Polyposis | 16 |
| 1 |
| 1 |
| 18 |
| C93210 | Inflammatory disorder | Inflammatory Ds | 16 |
| 0 |
| 5 |
| 21 |
| C19151 | Metastasis | Metastases | 16 |
| 5 |
| 24 |
| 45 |
| C53809 | Adverse event associated with the vascular system | Vascular | 15 |
| 0 |
| 2 |
| 17 |
| C17609 | Tumor progression | Tumor progression | 15 |
| 0 |
| 3 |
| 18 |
| C3208 | Lymphoma | Lymphoma | 15 |
| 0 |
| 7 |
| 22 |
| C16897 | N/A | Necrosis | 14 |
| 0 |
| 0 |
| 14 |
| C27990 | Toxicity | Toxicity | 14 |
| 0 |
| 1 |
| 15 |
| C36117 | Invasive lesion | Invasive | 14 |
| 2 |
| 4 |
| 20 |
| C62200 | N/A | Point mutation | 13 |
| 0 |
| 0 |
| 13 |
| C39725 | Immunodeficiency | Immunodeficient | 13 |
| 0 |
| 1 |
| 14 |
| C120867 | N/A | Bacteria | 13 |
| 5 |
| 0 |
| 18 |
| C102283 | N/A | Extracted | 12 |
| 0 |
| 0 |
| 12 |
| C17354 | N/A | Frameshift mutation | 12 |
| 0 |
| 0 |
| 12 |
| C28193 | N/A | Syndrome | 12 |
| 0 |
| 0 |
| 12 |
| C2873 | N/A | Aneuploidy | 12 |
| 0 |
| 0 |
| 12 |
| C45582 | N/A | Duplication | 12 |
| 0 |
| 0 |
| 12 |
| C18016 | Loss of heterozygosity | Allelic loss | 12 |
| 0 |
| 1 |
| 13 |
| C14174 | N/A | Metastatic | 12 |
| 2 |
| 0 |
| 14 |
| C50774 | Tissue degeneration | Degeneration | 12 |
| 0 |
| 2 |
| 14 |
| C2916 | Carcinoma | Carcinomas | 12 |
| 2 |
| 1 |
| 15 |
| C3340 | Polyp | Polyps | 12 |
| 1 |
| 3 |
| 16 |
| C2950 | Cytogenetic abnormality | Chromosomal aberration | 11 |
| 0 |
| 1 |
| 12 |
| C3117 | Hypertension | Hypertension | 11 |
| 0 |
| 4 |
| 15 |
| C4872 | Breast carcinoma | Breast carcinomas | 11 |
| 0 |
| 17 |
| 28 |
| C120945 | N/A | Inclusions | 10 |
| 0 |
| 0 |
| 10 |
| C17212 | N/A | Cell transformation | 10 |
| 0 |
| 0 |
| 10 |
| C18133 | N/A | Missense mutations | 10 |
| 0 |
| 0 |
| 10 |
| C3101 | N/A | Inherited D | 10 |
| 0 |
| 0 |
| 10 |
| C3174 | N/A | Chronic myelogenous leukemia | 10 |
| 0 |
| 0 |
| 10 |
| C48189 | N/A | Genome instability | 10 |
| 0 |
| 0 |
| 10 |
| C48275 | N/A | Fatal | 10 |
| 0 |
| 0 |
| 10 |
| C8509 | Primary neoplasm | Primary tumor | 10 |
| 0 |
| 4 |
| 14 |
Terms always rejected are highlighted in grey. The list is limited to terms proposed at least 30 times by the system. The proposed label does not necessarily correspond to the primary class label; it may be the term synonym identified by neXtA5.
Results of learning to rank applied to annotations
|
|
| |||
|---|---|---|---|---|
|
|
|
|
| |
|
| 0.48 | 0.28 | 0.63 | 0.35 |
|
| 0.48 | 0.17 | 0.59 | 0.22 |