| Literature DB >> 27374119 |
Luc Mottin1, Julien Gobeill2, Emilie Pasche2, Pierre-André Michel3, Isabelle Cusin3, Pascale Gaudet3, Patrick Ruch2.
Abstract
The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being implemented into the neXtProt annotation pipeline.Available on: http://babar.unige.ch:8082/neXtA5Database URL: http://babar.unige.ch:8082/neXtA5/fetcher.jsp.Entities:
Mesh:
Year: 2016 PMID: 27374119 PMCID: PMC4930835 DOI: 10.1093/database/baw098
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Example of GO BP entity with valid SY
|
[Term] id: GO:0030318 name: melanocyte differentiation namespace: biological_process def: “The process in which a relatively unspecialized cell acquires specialized features of a melanocyte.” [GOC:mah] synonym: “melanocyte cell differentiation” EXACT [] synonym: “melanophore differentiation” EXACT [] is_a: GO:0050931! pigment cell differentiation is_a: GO:0060563! neuroepithelial cell differentiation |
Index size per axis after thesauri refinement
| Thesaurus | #terms |
|---|---|
| Diseases | 97545 |
| GO MF | 36068 |
| GO BP | 99242 |
| GO CC | 6459 |
| ECO | 174 |
| Species | 81 |
| ChEBI | 41 |
Figure 1.The neXtA5 functional architecture.
Figure 2.The neXtA5 web interface. The query was {FER—vectorial—Diseases}, with 1990 as the lower limit (the date can be modified in advanced mode) for the publication dates. The output presents the first fifty results ranked over the chosen axis, with the score of the linear combination and the concepts identified in the PMID.
JSON annotation example for PMID: 23883606
|
{“ [{“ {“ {“ {“ |
aThis article contains six concepts: one from NCIt (“dehydration”) and three GO (two different synonyms for “hydrogen:potassium-exchanging ATPase activity” and three occurences of the term “membrane”). Bold entries represent the JSON fields that could be captured from the web service.
Distribution of annotated proteins and PMIDs in the benchmark for each annotation axis
| Thesaurus | #kinases | #terms |
|---|---|---|
| Diseases | 100 | 4839 |
| GO MF | 18 | 47 |
| GO BP | 100 | 3189 |
Figure 3.Optimization of the ranking function for each axis.
Figure 4.Precision (at P0, P10 and MAP) obtained by the neXtA5 vector-space retrieval model compared with the PubMed search modes: (a) Diseases, (b) BP and (c) MF.