| Literature DB >> 32683454 |
Arjun Magge1,2,3, Davy Weissenbacher3, Karen O'Connor3, Tasnia Tahsin1, Graciela Gonzalez-Hernandez3, Matthew Scotch1,2.
Abstract
SUMMARY: We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information's GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning.Entities:
Mesh:
Year: 2020 PMID: 32683454 PMCID: PMC7755405 DOI: 10.1093/bioinformatics/btaa647
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Screenshot from the GeoBoost2 website. In this example, the user enters GenBank accession IDs for Zika virus and designates sufficiency level in terms of administrative divisions (GeoNames, 2020a), such as ADM1 for states/provinces, ADM2 for county and maximum number of possible locations to be displayed per record for the search. Upon submission of the request, GeoBoost2 extracts locations from GenBank record metadata. For each record where the sufficiency level is not met, GeoBoost2 checks associated PubMed abstracts/open access articles. The system then displays all possible locations it extracted with details available on hover over the pins on the map. The user can then export the data in csv, tsv or json formats. For record KU497555 (Calvet ), only the country information was available in the metadata but a finer location; in this case, the state of Paraiba was found in one of the linked papers