| Literature DB >> 30295725 |
Agnes Kirchhoff1, Ulrich Bügel2, Eduard Santamaria2, Fabian Reimeier1, Dominik Röpert1, Alexander Tebbje3, Anton Güntsch1, Fernando Chaves2, Karl-Heinz Steinke3, Walter Berendsohn1.
Abstract
Over the past years, herbarium collections worldwide have started to digitize millions of specimens on an industrial scale. Although the imaging costs are steadily falling, capturing the accompanying label information is still predominantly done manually and develops into the principal cost factor. In order to streamline the process of capturing herbarium specimen metadata, we specified a formal extensible workflow integrating a wide range of automated specimen image analysis services. We implemented the workflow on the basis of OpenRefine together with a plugin for handling service calls and responses. The evolving system presently covers the generation of optical character recognition (OCR) from specimen images, the identification of regions of interest in images and the extraction of meaningful information items from OCR. These implementations were developed as part of the Deutsche Forschungsgemeinschaft-funded a standardised and optimised process for data acquisition from digital images of herbarium specimens (StanDAP-Herb) Project.Entities:
Mesh:
Year: 2018 PMID: 30295725 PMCID: PMC6174549 DOI: 10.1093/database/bay103
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Herbarium sheet.
Figure 2Information on specimen labels.
Figure 3Digistreets for the MNHN in Paris (photo: P. Lafaite, MNHN).
Figure 4The StanDAP-Herb main choreography.
Web services to enable the implementation of workflows for processing digital herbarium specimens (23, 24)
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Image-based | Object recognition | Scale Matching Service | GUID, SRI | Scale region coordinates | Template matching algorithm (uses an example image of the searched for scale to detect it in other images) ( |
| Image-based | DPI recognition | DPI Service | SRI resolution, SRI size, scale region coordinates | DPI | Computation of DPI using the physical size of the actual scale and the size of its digital counterpart on the herbarium sheet |
| Image-based | Object recognition | Text Region Service | GUID, DPI | Text region coordinates | Line contrast approach (taking advantage of the fact that the horizontal contrast of text lines is very high and dark and bright areas are alternating quickly)/machine learning approach to detect text-like structures |
| Image-based | OCR | Tesseract/OmniPage Service | GUID, text region coordinates | Text | Tesseract/OmniPage OCR algorithm |
| Text-based | Dictionary-based | Scientific Name Extractor | Text | Scientific names | Parsing with dictionary based on Global Names ( |
| Text-based | Dictionary-based | Botanist Name Extractor | Text | Botanist names | Parsing with dictionary based on botanists’ database of the Harvard University Herbaria & Libraries ( |
| Text-based | Regular expression | Date Extractor | Text | Dates | Matching using regular expressions for dates related to collection, accession and determination |
| Text-based | Regular expression | Geographical Coordinates (GeoCoord) Extractor | Text | Coordinates | Matching using regular expressions for geographical coordinates |
| Text-based | Dictionary-based | Location Extractor | Text | Locations | Parsing based on Cartographic Location and Vicinity Indexer
(CLAVIN) library ( |
GUID indicates Globally Unique Identifier; SRI, Scale Reference Image.
Figure 5Pre-OCR workflow.
Figure 6OCR workflow.
Figure 7Extractor workflow.