| Literature DB >> 35663412 |
Rishab Mallick1, Valerio Arnaboldi2, Paul Davis1, Stavros Diamantakis1, Magdalena Zarowiecki1, Kevin Howe1.
Abstract
Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers. Copyright:Entities:
Year: 2022 PMID: 35663412 PMCID: PMC9160977 DOI: 10.17912/micropub.biology.000578
Source DB: PubMed Journal: MicroPubl Biol ISSN: 2578-9430