Literature DB >> 32683454

GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography.

Arjun Magge^1,2,3, Davy Weissenbacher³, Karen O'Connor³, Tasnia Tahsin¹, Graciela Gonzalez-Hernandez³, Matthew Scotch^1,2.

Abstract

SUMMARY: We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information's GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning.
AVAILABILITY AND IMPLEMENTATION: Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2020 PMID： 32683454 PMCID： PMC7755405 DOI： 10.1093/bioinformatics/btaa647

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Molecular sequences play a vital role in conducting phylogenetic, phylogeographic and epidemiological studies to understand the dynamic nature of evolution and migration of pathogens across countries and continents. The National Center of Biotechnology Information (NCBI) maintains GenBank (Benson ), which is one of the largest comprehensive databases of nucleotide sequences available to the public. As of July 2020, GenBank contains 217 million entries (NCBI, 2020a) with over 3 million viral sequences reported in the latest release notes (NCBI, 2020b). The availability of such a database supports research in various domains of public health, particularly infectious diseases such as Ebola, Zika and most recently SARS-CoV-2 (Dudas ; Lai ; Pybus ). However, the quality of geographic metadata about the location of infected hosts (LOIH) that is readily available at the individual record level may be insufficient for studies conducted at the state/province levels within the country (Scotch ; Tahsin ). The presence of detailed geographic metadata is crucial not just for epidemiological studies, but also in retrospective genomic studies by the wider scientific community. Geographic metadata about the infected host is not required when submitting a sequence to GenBank. The database offers a Features table which includes both mandatory and optional qualifiers (Benson ; INSDC, 2019). Geographic metadata is amongst the optional qualifiers including lat_lon for the approximate coordinates, and country for named locations. Among the over 3 million viral sequences available (NCBI, 2020b), only about 1% of the records contained the infected host’s coordinates in the lat_lon field and only 26% contained host information more specific than a country in the country field. Such unavailability of detailed metadata in GenBank creates barriers for phylogeographic and genomic epidemiology at a local level. Researchers are then required to manually analyze other metadata fields in the record and/or review any associated PubMed articles. If no additional metadata is found, then the researcher might decide to exclude these records from the study altogether, reducing the sample size of the study and potentially introducing bias. GeoBoost2 provides a framework to automate this manual extraction process where the individual metadata fields are analyzed with the objective of extracting the LOIH from associated records. GeoBoost2 improves over its predecessor GeoBoost (Tahsin , 2018) in extraction performance by over 35% when evaluated on two corpora using advanced data mining methods on the linked PubMed articles to enrich the geospatial metadata. Overall, GeoBoost2 achieved 90% accuracy in resolving the LOIH in GenBank metadata and 57% accuracy in resolving LOIH extraction from associated PubMed articles. To the best of our knowledge, GeoBoost and GeoBoost2 are the only systems that using natural language-processing (NLP) techniques to extract LOIH from articles cited in GenBank accessions. In Supplementary Information, we describe in detail our methods and evaluation of GeoBoost2. We also provide a screenshot of the current version of the interface (Fig. 1).

Fig. 1.

Screenshot from the GeoBoost2 website. In this example, the user enters GenBank accession IDs for Zika virus and designates sufficiency level in terms of administrative divisions (GeoNames, 2020a), such as ADM1 for states/provinces, ADM2 for county and maximum number of possible locations to be displayed per record for the search. Upon submission of the request, GeoBoost2 extracts locations from GenBank record metadata. For each record where the sufficiency level is not met, GeoBoost2 checks associated PubMed abstracts/open access articles. The system then displays all possible locations it extracted with details available on hover over the pins on the map. The user can then export the data in csv, tsv or json formats. For record KU497555 (Calvet ), only the country information was available in the metadata but a finer location; in this case, the state of Paraiba was found in one of the linked papers GeoBoost2 includes: A state-of-the-art deep-learning NLP algorithm trained on manually annotated geographic location mentions in PubMed Central Open Access articles (Magge , 2019). All geographic location mentions are disambiguated and resolved to a unique identifier in GeoNames (2020a,b), a database containing 12 million locations across the globe. A Python 3.7 framework implementation (replacing a Java-based framework) for continuous improvement with deep-learning and machine-learning methods for information extraction. A Web-based interface with a map view that accepts as input any GenBank accessions (not limited to viruses) and provides features to export results. In addition to accepting GenBank accession IDs, the tool can also accept PubMed IDs or raw text captured from an article for mining geographic locations. An application programming interface (API) for use of the results in downstream applications. In addition to mining PubMed articles directly linked in the GenBank accessions, GeoBoost2 also mines geographic locations from additional PubMed articles and their respective Supplementary Information that have cited the GenBank accessions in their studies. All data retrieval functionalities in the tool rely on APIs provided by NCBI, ensuring the latest available information. Results from GeoBoost2 can be used for Bayesian discrete phylogeography on ZooPhy (Scotch , 2019b; ZooPhy, 2020). Here, the probabilities for potential LOIH generated by GeoBoost2 can be used as sampling uncertainties (Scotch ) for the taxa in phylogeographic studies implemented using BEAST (Suchard ). We plan to extend our information extraction and normalization efforts to additional optional qualifiers such as collection_date, host and isolation_source. We also plan to validate the performance of the tool on other pathogens such as bacteria and parasites. With the growing concern over emerging and re-emerging pathogens, a publicly available, free tool like GeoBoost2 will facilitate public health surveillance and genomic epidemiology. Click here for additional data file.

13 in total

1. At the intersection of public-health informatics and bioinformatics: using advanced Web technologies for phylogeography.

Authors: Matthew Scotch; Changjiang Mei; Cynthia Brandt; Indra Neil Sarkar; Kei Cheung
Journal: Epidemiology Date: 2010-11 Impact factor: 4.822

2. A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.

Authors: Tasnia Tahsin; Davy Weissenbacher; Robert Rivera; Rachel Beard; Mari Firago; Garrick Wallstrom; Matthew Scotch; Graciela Gonzalez
Journal: J Am Med Inform Assoc Date: 2016-01-17 Impact factor: 4.497

3. GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records.

Authors: Tasnia Tahsin; Davy Weissenbacher; Karen O'Connor; Arjun Magge; Matthew Scotch; Graciela Gonzalez-Hernandez
Journal: Bioinformatics Date: 2018-05-01 Impact factor: 6.937

4. Unifying the spatial epidemiology and molecular evolution of emerging epidemics.

Authors: Oliver G Pybus; Marc A Suchard; Philippe Lemey; Flavien J Bernardin; Andrew Rambaut; Forrest W Crawford; Rebecca R Gray; Nimalan Arinaminpathy; Susan L Stramer; Michael P Busch; Eric L Delwart
Journal: Proc Natl Acad Sci U S A Date: 2012-08-27 Impact factor: 11.205

5. Virus genomes reveal factors that spread and sustained the Ebola epidemic.

Authors: Gytis Dudas; Luiz Max Carvalho; Trevor Bedford; Andrew J Tatem; Guy Baele; Nuno R Faria; Daniel J Park; Jason T Ladner; Armando Arias; Danny Asogun; Filip Bielejec; Sarah L Caddy; Matthew Cotten; Jonathan D'Ambrozio; Simon Dellicour; Antonino Di Caro; Joseph W Diclaro; Sophie Duraffour; Michael J Elmore; Lawrence S Fakoli; Ousmane Faye; Merle L Gilbert; Sahr M Gevao; Stephen Gire; Adrianne Gladden-Young; Andreas Gnirke; Augustine Goba; Donald S Grant; Bart L Haagmans; Julian A Hiscox; Umaru Jah; Jeffrey R Kugelman; Di Liu; Jia Lu; Christine M Malboeuf; Suzanne Mate; David A Matthews; Christian B Matranga; Luke W Meredith; James Qu; Joshua Quick; Suzan D Pas; My V T Phan; Georgios Pollakis; Chantal B Reusken; Mariano Sanchez-Lockhart; Stephen F Schaffner; John S Schieffelin; Rachel S Sealfon; Etienne Simon-Loriere; Saskia L Smits; Kilian Stoecker; Lucy Thorne; Ekaete Alice Tobin; Mohamed A Vandi; Simon J Watson; Kendra West; Shannon Whitmer; Michael R Wiley; Sarah M Winnicki; Shirlee Wohl; Roman Wölfel; Nathan L Yozwiak; Kristian G Andersen; Sylvia O Blyden; Fatorma Bolay; Miles W Carroll; Bernice Dahn; Boubacar Diallo; Pierre Formenty; Christophe Fraser; George F Gao; Robert F Garry; Ian Goodfellow; Stephan Günther; Christian T Happi; Edward C Holmes; Brima Kargbo; Sakoba Keïta; Paul Kellam; Marion P G Koopmans; Jens H Kuhn; Nicholas J Loman; N'Faly Magassouba; Dhamari Naidoo; Stuart T Nichol; Tolbert Nyenswah; Gustavo Palacios; Oliver G Pybus; Pardis C Sabeti; Amadou Sall; Ute Ströher; Isatta Wurie; Marc A Suchard; Philippe Lemey; Andrew Rambaut
Journal: Nature Date: 2017-04-12 Impact factor: 49.962

6. Natural language processing methods for enhancing geographic metadata for phylogeography of zoonotic viruses.

Authors: Tasnia Tahsin; Rachel Beard; Robert Rivera; Rob Lauder; Garrick Wallstrom; Matthew Scotch; Graciela Gonzalez
Journal: AMIA Jt Summits Transl Sci Proc Date: 2014-04-07

7. GenBank.

Authors: Dennis A Benson; Mark Cavanaugh; Karen Clark; Ilene Karsch-Mizrachi; James Ostell; Kim D Pruitt; Eric W Sayers
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

8. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10.

Authors: Marc A Suchard; Philippe Lemey; Guy Baele; Daniel L Ayres; Alexei J Drummond; Andrew Rambaut
Journal: Virus Evol Date: 2018-06-08

9. Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature.

Authors: Arjun Magge; Davy Weissenbacher; Abeed Sarker; Matthew Scotch; Graciela Gonzalez-Hernandez
Journal: Pac Symp Biocomput Date: 2019

10. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): The epidemic and the challenges.

Authors: Chih-Cheng Lai; Tzu-Ping Shih; Wen-Chien Ko; Hung-Jen Tang; Po-Ren Hsueh
Journal: Int J Antimicrob Agents Date: 2020-02-17 Impact factor: 5.283

2 in total

1. Bioinformatics for the Origin and Evolution of Viruses.

Authors: Jiajia Chen; Yuxin Zhang; Bairong Shen
Journal: Adv Exp Med Biol Date: 2022 Impact factor: 2.622

2. GenBank as a Source to Monitor and Analyze Host-Microbiome Data.

Authors: Vivek Ramanan; Shanti Mechery; Indra Neil Sarkar
Journal: Bioinformatics Date: 2022-07-08 Impact factor: 6.931

2 in total