| Literature DB >> 31819990 |
Quentin Groom1, Mathias Dillen1, Helen Hardy2, Sarah Phillips3, Luc Willemse4, Zhengzhe Wu5.
Abstract
There are more than 1.2 billion biological specimens in the world's museums and herbaria. These objects are particularly important forms of biological sample and observation. They underpin biological taxonomy but the data they contain have many other uses in the biological and environmental sciences. Nevertheless, from their conception they are almost entirely documented on paper, either as labels attached to the specimens or in catalogues linked with catalogue numbers. In order to make the best use of these data and to improve the findability of these specimens, these data must be transcribed digitally and made to conform to standards, so that these data are also interoperable and reusable. Through various digitization projects, the authors have experimented with transcription by volunteers, expert technicians, scientists, commercial transcription services and automated systems. We have also been consumers of specimen data for taxonomical, biogeographical and ecological research. In this paper, we draw from our experiences to make specific recommendations to improve transcription data. The paper is split into two sections. We first address issues related to database implementation with relevance to data transcription, namely versioning, annotation, unknown and incomplete data and issues related to language. We then focus on particular data types that are relevant to biological collection specimens, namely nomenclature, dates, geography, collector numbers and uniquely identifying people. We make recommendations to standards organizations, software developers, data scientists and transcribers to improve these data with the specific aim of improving interoperability between collection datasets.Entities:
Mesh:
Year: 2019 PMID: 31819990 PMCID: PMC6901386 DOI: 10.1093/database/baz129
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1An example of location information on a herbarium sheet label. This text can be entered into a database in several ways, even if the goal is to transcribe it in a verbatim manner. For example, should the dwc:verbatimLocality contain the country ‘Gabon’, the province ‘Ogooué- Lolo’ and the habitat ‘In forest’? Transcribers may decide to distribute parts of this text into the fields dwc:country, dwc:stateProvince, dwc:locality, dwc:verbatimLocality and dwc:habitat or they might choose to transcribe everything literally into dwc:verbatimLocality. Source: http://www.botanicalcollections.be/specimen/BR0000013860288.
A list of use cases for verbatim data, with examples and notes on applications
| Use case | Examples | Application notes | |
|---|---|---|---|
| 1 | Facilitating data cleaning and indicating the degree of interpretation in the standardized fields | Dates that are found to be unlikely or impossible can be easily checked for typos or erroneous transcription | If a digital image of the label is available, then there is less need to check a verbatim transcription for validation. |
| 2 | Discovering information hidden in the typography of how text is presented on the label | The syntax of person names can be a clue to the writer’s identity and for linking related specimens | This is unnecessary for most specimens but is valuable for enriching poorly documented specimens. |
| 3 | Increasing the findability of specimens. | Where a word, such as a place name, can be read but not understood, then the text can still be found | Original text can be searched in the original language. |
| 4 | Accommodating partial or uncertain transcriptions, which would otherwise clutter standardized, interpreted fields | The use of square brackets ([]) and ellipses to indicate uncertainty or a failure to read part of the text | Other transcribers can build on the initial attempt, and it will be clear that the information is present on the label. |
| 5 | Providing training and validation source data for automated text capture methods | Automated reading of 19th century handwriting and recognition of symbols used on labels | Finding gold standard training data for algorithms is a common problem. |
| 6 | Accommodating data that are not sufficiently standardized for the interpreted field or that fail to comply with the restrictions of the interpreted field | Dates that lack a year or data awaiting interpretation | It is common to find verbatim fields containing data in non-standard formats, yet they are not transcribed data either. |
| 7 | Accommodating data following obsolete or bespoke standards | Grid system location codes | When a database is migrated from one system to another, then verbatim fields are used to store old formats. |
| 8 | Preserving the original language when interpretation has included translation | Habitats can have some very specific meanings in different languages and they are difficult to translate because there may not be a direct equivalent. | This also improves the findability of specimens written in a different language. |
A list of terms for missing data values that could be applied to fields in Darwin Core
| Missing data terms | Definition | Example |
|---|---|---|
|
| The information is not digitally available. | Empty value in a digital record of unknown provenance |
|
| The information is not digitally available. No attempt has been made to digitize it. | Empty value in a skeletal record to which data still need to be added from the label |
|
| The information is not digitally available. It appeared to be absent during digitization. | A value of S.D. used by transcription platforms to indicate the absence of a date value |
|
| The information is not digitally available. It appeared to be present during digitization, but failed to be captured. | An indication made by a transcriber that they failed to transcribe the information |
|
| The information is digitally available, but it has been withheld by the provider. | A georeferenced record for which coordinate data are available but withheld for conservation considerations |
The generic unknown indicates that the information is indeed not available.
The additives undigitized, missing and indecipherable allow elaboration as to why the data are unavailable, if this reason is known.
known:withheld indicates that the data are digitally available in a more primary source and could potentially be retrieved after contacting the data provider.
Figure 2Examples of potential problems encountered while transcribing dates from specimen labels. (a) Handwriting difficult to interpret (1849 or 1899). (b) Symbolism used can be interpreted differently (5 February or 5 November). (c) Impossible but partially true date (correct year was 2002). (d) Impossible but likely mostly true date. (e) Uncertainty of order of day and month and missing century digits (2 December or 12 February, of 1981 or 1881). All examples from specimens in the Meise Botanic Garden herbarium.
Figure 3Change in dates of observation for occurrence records on GBIF. Note the 12 spikes corresponding to the first day of each month, with a disproportionately large spike for the first of January. This is more likely caused by many systems, including GBIF itself, storing partial dates as the first day of the month and only using the start date of a date range. Created from a snapshot of GBIF taken on 06 April 2019.