Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition.

Literature DB >> 33839304

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition.

Rezarta Islamaj¹, Chih-Hsuan Wei¹, David Cissel¹, Nicholas Miliaras¹, Olga Printseva¹, Oleg Rodionov¹, Keiko Sekiya¹, Janice Ward¹, Zhiyong Lu².

Abstract

The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/NLMGene. We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (www.ncbi.nlm.nih.gov/research/pubtator).

Entities: Species

Keywords: Biomedical Text Mining; Deep Learning; Gene entity recognition; Manual annotation; Natural language processing

Year: 2021 PMID： 33839304 DOI： 10.1016/j.jbi.2021.103779

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

Keyword Cloud
Cited

2 in total

1. RegEl corpus: identifying DNA regulatory elements in the scientific literature.

Authors: Samuele Garda; Freyda Lenihan-Geels; Sebastian Proft; Stefanie Hochmuth; Markus Schülke; Dominik Seelow; Ulf Leser
Journal: Database (Oxford) Date: 2022-06-27 Impact factor: 4.462

2. Assigning species information to corresponding genes by a sequence labeling framework.

Authors: Ling Luo; Chih-Hsuan Wei; Po-Ting Lai; Qingyu Chen; Rezarta Islamaj; Zhiyong Lu
Journal: Database (Oxford) Date: 2022-10-13 Impact factor: 4.462

2 in total