| Literature DB >> 26544980 |
Patricia Kelbert1, Gabriele Droege1, Katharine Barker2, Kyle Braak3, E Margaret Cawsey4, Jonathan Coddington2, Tim Robertson3, Jamie Whitacre2, Anton Güntsch1.
Abstract
With the rapidly growing number of data publishers, the process of harvesting and indexing information to offer advanced search and discovery becomes a critical bottleneck in globally distributed primary biodiversity data infrastructures. The Global Biodiversity Information Facility (GBIF) implemented a Harvesting and Indexing Toolkit (HIT), which largely automates data harvesting activities for hundreds of collection and observational data providers. The team of the Botanic Garden and Botanical Museum Berlin-Dahlem has extended this well-established system with a range of additional functions, including improved processing of multiple taxon identifications, the ability to represent associations between specimen and observation units, new data quality control and new reporting capabilities. The open source software B-HIT can be freely installed and used for setting up thematic networks serving the demands of particular user groups.Entities:
Mesh:
Year: 2015 PMID: 26544980 PMCID: PMC4636251 DOI: 10.1371/journal.pone.0142240
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1GBIF-HIT Harvesting process.
It consists of 4 major steps that have to be executed after each update of a datasource. The harvested data is eventually parsed and stored into the database.
ABCD versions supported by the GBIF-HIT and B-HIT.
| ABCD version | |||||||
|---|---|---|---|---|---|---|---|
| 1.2 | 2.06 | 2.1 | EFG | GGBN | GGBN Enviro | ABCD—Archive | |
|
| X | X | |||||
|
| X | X | X | X | X | X | |
Supported ABCD versions are marked with an X.
Darwin Core versions supported by the GBIF-HIT and B-HIT.
| Darwin Core (DwC) version | |||
|---|---|---|---|
| DwC 1.0, 1.4, 1.4-Geospatial, 1.4-Curatorial, MaNIS 1.0, MaNIS 1.21 | DwC Archive | DwC GGBN | |
|
| X | X | |
|
| X | X | X |
Supported DwC versions are marked with an X.
Fig 2Principal model of raw and improved data in the B-HIT database.
Corresponding raw and improved table (i.e. identification from the provider and improved identification) are linked through a 1–1 relation. Multiple identifications can be associated to a single record and are therefore linked through a 1-n relation with the (raw) occurrence table.
Fig 3Representation of associated units in ABCD 2.1.
A DNA Sample with the triple ID “DB 4745—BGBM—DNA Bank” is associated to a tissue (triple ID “B GT 0004682—BGBM—Herbarium Berolinense”). This tissue is associated to a herbarium sheet (triple ID “B 10 0163635—BGBM—Herbarium Berolinense”). The associated dataset access point and triple ID make it possible to retrieve each record.
Fig 4Web interface of B-HIT.
This extended user-interface makes it possible to gain access to the new functionalities (i.e. Associated Datasource Harvesting, Data quality, Datasource Management) through a series of tabs.
Export subset of the quality logs.
| Test | Original value (countryname—ISOcode) | New value (countryname—ISOcode) | Suggestion or log | UnitIDs |
|---|---|---|---|---|
|
| none-CH | Switzerland-CH | extracted country from gathering area | Bridel-1-512 Bridel-1-576 [..] |
|
| none-none | United States-US | extracted country from locality | Bridel-1-525 Bridel-1-898 [..] |
|
| none-MG | Madagascar-MG | extracted country from locality | Bridel-1-206 Bridel-1-359 |
|
| Slovak Republic-none | Slovakia-SK | countryname replaced Slovak Republic by Slovakia | M-0136500-550428-132827 [..] |
|
| Bayern-none | Germany-DE | countryname replaced Bayern by Germany | ZSM-A-20032864 / 604358 / 487654 |
The dedicated tables, for each kind of test, store the test name, the value from the provider, the improved value, and a brief explanation. The list of concerned records is also saved in the quality tables for helping the provider to find and correct its data.