| Literature DB >> 30833825 |
Mathias Dillen1, Quentin Groom1, Simon Chagnoux2, Anton Güntsch3, Alex Hardisty4, Elspeth Haston5, Laurence Livermore6, Veljo Runnel7, Leif Schulman8, Luc Willemse9, Zhengzhe Wu8, Sarah Phillips10.
Abstract
BACKGROUND: More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. NEW INFORMATION: To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.Entities:
Year: 2019 PMID: 30833825 PMCID: PMC6396854 DOI: 10.3897/BDJ.7.e31817
Source DB: PubMed Journal: Biodivers Data J ISSN: 1314-2828
The guidelines given to herbaria to select specimens for the test dataset. The goal was not to have a representative sample of all specimens, but to have comparable subsets, which will have labels written in different languages; will be printed or handwritten; will cover a wide range of dates; will be both type specimens and general collections and will provide specimens from different families and different parts of the world.
| Number of specimens | Type status | Date of collection | Geography |
| 25 | Type | < 1970 | Any country |
| 25 | Type | > 1970 | Any country |
| 25 | non-Type | < 1970 | From the country where the herbarium is located |
| 25 | non-Type | > 1970 | From the country where the herbarium is located |
| 100 | non-Type | Any | non-Type specimens from one other country or region of which the herbarium possesses a substantial number of specimens |
Contributions of 9 different institutes to the dataset. Availability of JPG and TIFF images is indicated, as well as the source of label data. Most institutes were able to follow the template in Table 1. The regions picked for the 100 non-type specimens are indicated in the last column, as are deviations from the template in Table 1. Institution codes follow Index Herbariorum (http://sweetgum.nybg.org/science/ih/). The DOI of the collections is listed if GBIF was used as a data source. FinBIF is the Finnish Biodiversity Information Facility available at www.species.fi (Schulman et al. 2018). JACQ is a joint specimen data management system of over 30 European and Asian herbaria available at https://herbarium.univie.ac.at/database/ (Rainer and Vitek 2009).
|
|
|
|
|
| BR |
| As Table 1; 100 from AU, CA, NZ, US | |
| Royal Botanic Gardens, | K |
| As Table 1; 100 from BR |
| Natural History Museum, | BM |
| As Table 1; 100 from AU, CA, NZ, US |
| Botanic Garden and Botanical Museum, | B |
| As Table 1; 100 from AU, BR, CN, ID, TZ, US |
| Royal Botanic Garden | E |
| As Table 1; 100 from CN |
| National Museum of Natural History, | P |
| 50 type, 50 non-Type FR, 100 non-Type not FR |
| Natural History Museum, University of | TU |
| 100 < 1970, 100 > 1970 |
| L |
| As Table 1; 100 from ID; no selection on date | |
| Finnish Museum of Natural History LUOMUS, University of | H |
| As Table 1; 14 FI, 36 ET instead of 50 FI; 100 from AU, BR, CN, |
Figure 1.A classification of the languages used on labels of the different specimens. EN = English, FR = French, LA = Latin, ET = Estonian, DE = German, NL = Dutch, PT = Portuguese, ES = Spanish, SV = Swedish, RU = Russian, FI = Finnish and IT = Italian. ZZ indicates a single language could not be determined: either there were multiple languages used on the label, there was no obvious use of a certain language (i.e. only scientific Latin terms) or the language was not readily identifiable. Different herbaria are identified by their Index Herbariorum codes (Institution Code in Table 2).
Figure 4.The location of geolocated specimens within the dataset and the number of specimens from each country. A total of 267 (15%) specimens have coordinates associated with them and 1,695 (94%) are located to a country. Both categories may overlap. The map uses a Mollweide equal-area projection.
Figure 3.A stacked pie chart generated using Krona (Ondov et al. 2015, https://github.com/marbl/Krona/), depicting the taxonomic distribution by phylum, order and family (if known). Missing taxa were extracted from the GBIF backbone by family, if possible. For H, they were extracted by genus, as family was unavailable. An interactive version of this graph is available as an HTML file in the supplementary material.
Figure 2.The distribution of collection dates (by year, if known) of the specimens in the dataset for each providing institution. The heat colour indicates the number of specimens for each 10 year time period. Year data were extracted from Darwin Core eventDate and verbatimEventDate if these were in ISO 8601 standard. Codes for the herbaria follow Table .
| Column label | Column description |
|---|---|
| Data and links.csv | Supplementary Info 5 |