Literature DB >> 26911818

A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.

Tasnia Tahsin1, Davy Weissenbacher2, Robert Rivera2, Rachel Beard2, Mari Firago2, Garrick Wallstrom2, Matthew Scotch2, Graciela Gonzalez2.   

Abstract

OBJECTIVE: The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases.
MATERIALS AND METHODS: We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus.
RESULTS: We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively. DISCUSSION: Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction.
CONCLUSION: Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.
© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Entities:  

Keywords:  information extraction; natural language processing; phylogeography

Mesh:

Year:  2016        PMID: 26911818      PMCID: PMC4997033          DOI: 10.1093/jamia/ocv172

Source DB:  PubMed          Journal:  J Am Med Inform Assoc        ISSN: 1067-5027            Impact factor:   4.497


  20 in total

1.  Agreement, the f-measure, and reliability in information retrieval.

Authors:  George Hripcsak; Adam S Rothschild
Journal:  J Am Med Inform Assoc       Date:  2005-01-31       Impact factor: 4.497

2.  Leveraging biomedical ontologies and annotation services to organize microbiome data from Mammalian hosts.

Authors:  Indra Neil Sarkar
Journal:  AMIA Annu Symp Proc       Date:  2010-11-13

Review 3.  Pharmacogenomics knowledge for personalized medicine.

Authors:  M Whirl-Carrillo; E M McDonagh; J M Hebert; L Gong; K Sangkuhl; C F Thorn; R B Altman; T E Klein
Journal:  Clin Pharmacol Ther       Date:  2012-10       Impact factor: 6.875

4.  Network analysis of global influenza spread.

Authors:  Joseph Chan; Antony Holmes; Raul Rabadan
Journal:  PLoS Comput Biol       Date:  2010-11-18       Impact factor: 4.475

5.  MeSHing molecular sequences and clinical trials: a feasibility study.

Authors:  Elizabeth S Chen; Indra Neil Sarkar
Journal:  J Biomed Inform       Date:  2009-10-20       Impact factor: 6.317

Review 6.  Toward a quantitative understanding of viral phylogeography.

Authors:  Nuno Rodrigues Faria; Marc A Suchard; Andrew Rambaut; Philippe Lemey
Journal:  Curr Opin Virol       Date:  2011-10-28       Impact factor: 7.090

7.  Combining phylogeography and spatial epidemiology to uncover predictors of H5N1 influenza A virus diffusion.

Authors:  Daniel Magee; Rachel Beard; Marc A Suchard; Philippe Lemey; Matthew Scotch
Journal:  Arch Virol       Date:  2014-10-30       Impact factor: 2.574

8.  Endemic dengue associated with the co-circulation of multiple viral lineages and localized density-dependent transmission.

Authors:  Jayna Raghwani; Andrew Rambaut; Edward C Holmes; Vu Ty Hang; Tran Tinh Hien; Jeremy Farrar; Bridget Wills; Niall J Lennon; Bruce W Birren; Matthew R Henn; Cameron P Simmons
Journal:  PLoS Pathog       Date:  2011-06-02       Impact factor: 6.823

9.  Knowledge-driven geospatial location resolution for phylogeographic models of virus migration.

Authors:  Davy Weissenbacher; Tasnia Tahsin; Rachel Beard; Mari Figaro; Robert Rivera; Matthew Scotch; Graciela Gonzalez
Journal:  Bioinformatics       Date:  2015-06-15       Impact factor: 6.937

10.  Natural language processing methods for enhancing geographic metadata for phylogeography of zoonotic viruses.

Authors:  Tasnia Tahsin; Rachel Beard; Robert Rivera; Rob Lauder; Garrick Wallstrom; Matthew Scotch; Graciela Gonzalez
Journal:  AMIA Jt Summits Transl Sci Proc       Date:  2014-04-07
View more
  14 in total

1.  GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records.

Authors:  Tasnia Tahsin; Davy Weissenbacher; Karen O'Connor; Arjun Magge; Matthew Scotch; Graciela Gonzalez-Hernandez
Journal:  Bioinformatics       Date:  2018-05-01       Impact factor: 6.937

2.  GenBank as a Source to Monitor and Analyze Host-Microbiome Data.

Authors:  Vivek Ramanan; Shanti Mechery; Indra Neil Sarkar
Journal:  Bioinformatics       Date:  2022-07-08       Impact factor: 6.931

3.  Linking dimensions of data on global marine animal diversity.

Authors:  Thomas J Webb; Bart Vanhoorne
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2020-11-02       Impact factor: 6.237

4.  Seqenv: linking sequences to environments through text mining.

Authors:  Lucas Sinclair; Umer Z Ijaz; Lars Juhl Jensen; Marco J L Coolen; Cecile Gubry-Rangin; Alica Chroňáková; Anastasis Oulas; Christina Pavloudi; Julia Schnetzer; Aaron Weimann; Ali Ijaz; Alexander Eiler; Christopher Quince; Evangelos Pafilis
Journal:  PeerJ       Date:  2016-12-20       Impact factor: 2.984

5.  Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods.

Authors:  Davy Weissenbacher; Abeed Sarker; Tasnia Tahsin; Matthew Scotch; Graciela Gonzalez
Journal:  AMIA Jt Summits Transl Sci Proc       Date:  2017-07-26

6.  The Effects of Sampling Location and Predictor Point Estimate Certainty on Posterior Support in Bayesian Phylogeographic Generalized Linear Models.

Authors:  Daniel Magee; Jesse E Taylor; Matthew Scotch
Journal:  Sci Rep       Date:  2018-04-12       Impact factor: 4.379

7.  Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research.

Authors:  Tasnia Tahsin; Davy Weissenbacher; Demetrius Jones-Shargani; Daniel Magee; Matteo Vaiente; Graciela Gonzalez; Matthew Scotch
Journal:  Database (Oxford)       Date:  2017-01-01       Impact factor: 3.451

8.  Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature.

Authors:  Arjun Magge; Davy Weissenbacher; Abeed Sarker; Matthew Scotch; Graciela Gonzalez-Hernandez
Journal:  Pac Symp Biocomput       Date:  2019

9.  GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography.

Authors:  Arjun Magge; Davy Weissenbacher; Karen O'Connor; Tasnia Tahsin; Graciela Gonzalez-Hernandez; Matthew Scotch
Journal:  Bioinformatics       Date:  2020-12-22       Impact factor: 6.937

10.  A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks.

Authors:  Rachel Beard; Elizabeth Wentz; Matthew Scotch
Journal:  Int J Health Geogr       Date:  2018-10-30       Impact factor: 3.918

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.