Literature DB >> 33816817

20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration.

Anne E Thessen1,2, Jorrit H Poelen3, Matthew Collins4, Jen Hammock5.   

Abstract

Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10-11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills. ©2018 Thessen et al.

Entities:  

Keywords:  Biodiversity; Collaboration; Graph; Identifiers; Linking; Wikidata

Year:  2018        PMID: 33816817      PMCID: PMC7924439          DOI: 10.7717/peerj-cs.164

Source DB:  PubMed          Journal:  PeerJ Comput Sci        ISSN: 2376-5992


  8 in total

1.  A tale from Bioutopia.

Authors:  P L Nimis
Journal:  Nature       Date:  2001-09-06       Impact factor: 49.962

Review 2.  Biodiversity informatics: the challenge of linking data and the role of shared identifiers.

Authors:  Roderic D M Page
Journal:  Brief Bioinform       Date:  2008-04-29       Impact factor: 11.622

3.  Linking NCBI to Wikipedia: a wiki-based approach.

Authors:  Roderic D M Page
Journal:  PLoS Curr       Date:  2011-03-31

4.  TBMap: a taxonomic perspective on the phylogenetic database TreeBASE.

Authors:  Roderic D M Page
Journal:  BMC Bioinformatics       Date:  2007-05-18       Impact factor: 3.169

Review 5.  A decadal view of biodiversity informatics: challenges and priorities.

Authors:  Alex Hardisty; Dave Roberts; Wouter Addink; Bart Aelterman; Donat Agosti; Linda Amaral-Zettler; Arturo H Ariño; Christos Arvanitidis; Thierry Backeljau; Nicolas Bailly; Lee Belbin; Walter Berendsohn; Nic Bertrand; Neil Caithness; David Campbell; Guy Cochrane; Noël Conruyt; Alastair Culham; Christian Damgaard; Neil Davies; Bruno Fady; Sarah Faulwetter; Alan Feest; Dawn Field; Eric Garnier; Guntram Geser; Jack Gilbert; David Grosser; Alex Hardisty; Bénédicte Herbinet; Donald Hobern; Andrew Jones; Yde de Jong; David King; Sandra Knapp; Hanna Koivula; Wouter Los; Chris Meyer; Robert A Morris; Norman Morrison; David Morse; Matthias Obst; Evagelos Pafilis; Larry M Page; Roderic Page; Thomas Pape; Cynthia Parr; Alan Paton; David Patterson; Elisabeth Paymal; Lyubomir Penev; Marc Pollet; Richard Pyle; Eckhard von Raab-Straube; Vincent Robert; Dave Roberts; Tim Robertson; Olivier Rovellotti; Hannu Saarenmaa; Peter Schalk; Joop Schaminee; Paul Schofield; Andy Sier; Soraya Sierra; Vince Smith; Edwin van Spronsen; Simon Thornton-Wood; Peter van Tienderen; Jan van Tol; Éamonn Ó Tuama; Peter Uetz; Lea Vaas; Régine Vignes Lebbe; Todd Vision; Duong Vu; Aaike De Wever; Richard White; Kathy Willis; Fiona Young
Journal:  BMC Ecol       Date:  2013-04-15       Impact factor: 2.964

6.  BioNames: linking taxonomy, texts, and trees.

Authors:  Roderic D M Page
Journal:  PeerJ       Date:  2013-10-29       Impact factor: 2.984

7.  The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth.

Authors:  Cynthia S Parr; Nathan Wilson; Patrick Leary; Katja S Schulz; Kristen Lans; Lisa Walley; Jennifer A Hammock; Anthony Goddard; Jeremy Rice; Marie Studer; Jeffrey T G Holmes; Robert J Corrigan
Journal:  Biodivers Data J       Date:  2014-04-29

8.  Automated assembly of a reference taxonomy for phylogenetic data synthesis.

Authors:  Jonathan A Rees; Karen Cranston
Journal:  Biodivers Data J       Date:  2017-05-22
  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.