| Literature DB >> 32269476 |
Abraham Nieva de la Hidalga1, Paul L Rosin1, Xianfang Sun1, Ann Bogaerts2, Niko De Meeter2, Sofie De Smedt2, Maarten Strack van Schijndel2, Paul Van Wambeke2, Quentin Groom2.
Abstract
Digitisation of natural history collections has evolved from creating databases for the recording of specimens' catalogue and label data to include digital images of specimens. This has been driven by several important factors, such as a need to increase global accessibility to specimens and to preserve the original specimens by limiting their manual handling. The size of the collections pointed to the need of high throughput digitisation workflows. However, digital imaging of large numbers of fragile specimens is an expensive and time-consuming process that should be performed only once. To achieve this, the digital images produced need to be useful for the largest set of applications possible and have a potentially unlimited shelf life. The constraints on digitisation speed need to be balanced against the applicability and longevity of the images, which, in turn, depend directly on the quality of those images. As a result, the quality criteria that specimen images need to fulfil influence the design, implementation and execution of digitisation workflows. Different standards and guidelines for producing quality research images from specimens have been proposed; however, their actual adaptation to suit the needs of different types of specimens requires further analysis. This paper presents the digitisation workflow implemented by Meise Botanic Garden (MBG). This workflow is relevant because of its modular design, its strong focus on image quality assessment, its flexibility that allows combining in-house and outsourced digitisation, processing, preservation and publishing facilities and its capacity to evolve for integrating alternative components from different sources. The design and operation of the digitisation workflow is provided to showcase how it was derived, with particular attention to the built-in audit trail within the workflow, which ensures the scalable production of high-quality specimen images and how this audit trail ensures that new modules do not affect either the speed of imaging or the quality of the images produced. Abraham Nieva de la Hidalga, Paul L Rosin, Xianfang Sun, Ann Bogaerts, Niko De Meeter, Sofie De Smedt, Maarten Strack van Schijndel, Paul Van Wambeke, Quentin Groom.Entities:
Keywords: Data capture; digital specimen; digitisation workflow; herbarium sheets; image quality control; natural history collections
Year: 2020 PMID: 32269476 PMCID: PMC7125238 DOI: 10.3897/BDJ.8.e47051
Source DB: PubMed Journal: Biodivers Data J ISSN: 1314-2828
Quality criteria for herbarium sheet images.
|
|
|
|
| |
| Web Publishing | 72 PPI | 24-bit colour | ΔE < 5 | |
| Printing | 300 PPI | 24-bit colour | ΔE < 5 | |
| OCR Labels | 400 PPI | 8-bit grey scale | Min: 28 steps | |
| Identify Specimen Features | 400 PPI | 24-bit colour | ΔE < 5 | |
| Research on Specimen | 600 PPI* | 24-bit colour | ΔE < 5 | |
| Preservation | 600 PPI* | 24-bit colour | ΔE < 5 | |
| * Minimum resolution recommended; if digitisation devices available allow for higher resolution, that resolution should be used. | ||||
Figure 1.Examples of herbarium sheets and the required elements to capture. The left image corresponds to a specimen digitised during the GPI project and the one on the right is an specimen digitised during DOE!. The elements are (1) Colour Chart, (2) Scale Bar, (3) Barcode, (4) Labels and (5) Institution Name. As the images show, some elements may be combined, for instance the scale bar and institution name on the left and colour chart and scale on the right.*1
MBG digitisation workflow tasks.
|
|
|
|
|
|
Selection of specimens to digitise. Retrieval from storage. Identification of specimens (barcoding). Conservation/restoration of specimens selected for digitisation. Specifying safeguards for handling specimens. Marking specimens that are already digitised. Extraction exceptions for internal imaging (e.g. capsuled specimens or specimens that needed to be imaged twice due to added booklets). Creation of metadata record / adding cover barcodes for external transcription of the labels. Transfer to digitisation station. | Specimens are selected and prioritised for digitisation by collection curators. |
|
|
Station(s) Setup Digitisation equipment selection, acquisition and set up. Equipment testing/calibration. Training of digitisation technicians. | Equipment should be calibrated to minimise image postprocessing after digitisation. |
|
Digitisation Mounting for imaging Digitisation of a specimen, creation of a master file (TIFF). Unmounting and return of specimen. Data capture, based on the image when outsourced. | Identification, digitisation and [meta] data capture, so that images are correctly linked to the corresponding specimen records. | |
|
|
Retrieval of master files (TIFF) from temporary storage. Creation of derivatives for publishing and distribution (JPEG2000 and JPG); Verification of naming and linking of files (based on barcode ID). Verification of file formats. | Verification of master image resolution format. |
|
|
Imaging (2) and image processing (3) are integrated. The Task receives specimens and produces full sets of images (TIFF, JPEG2000 and JPG). | Same as those for 2 and 3 above. |
|
|
Verification of image sets (correspondence of master and derivatives). Verification of naming and linking of files (based on barcode ID). | The task is simpler. However, the load increases considerably, from 5,000 to 25,000 weekly specimen image sets to process (400% increase). |
|
|
Transfer of master and derivative files to archive servers and image servers. Create and preserve links to storage. | Verify that master and derivative files are not corrupted in transfer to storage. |
|
|
Deposit master files on external archives for long term preservation. | Verify master is not corrupted in transfer and images are recoverable. |
|
|
Extraction of data from images, populating/complementing specimen record. Final verification/correction of specimen data. | Verify readability of image data for transcription. |
|
|
Extraction of data from images, populating/complementing specimen record. | Verify readability of image data for transcription. |
|
|
Final verification/correction of specimen data. | Verification against reference image and recorded data before publishing. |
|
|
Creation of digital specimen, verifying links to images, data, physical specimen and collection management system data. Publishing of digital specimen. | Data, metadata, persistent identifiers and links are used to build stable long-lasting specimens which adhere to FAIR data principles. |
Figure 2.MBG digitisation workflow diagram. The circle shapes at the top and bottom indicate the start and end of the workflow. The rounded corner boxes represent workflow tasks (described in Table 2). The lines connecting tasks indicate flow of execution. The squares on connecting lines represent the data objects produced. The diamond shapes indicate a fork or merge of the flow. The bar shapes represent flow synchronisation, i.e. processing waits for completion of previous tasks.
Image processing subtasks.
|
|
|
|
|
| ||
|
|
|
| ||||
| 1 | Check file name (Table 5) | AT, QA | TIFF set | names_ok | names_error | |
| 2 | Check tiff file size, image dimensions and resolution (Table 6) | AT, QA | TIFF set | names_ok | fssr_ok | fssr_error |
| 3 | Generate JPEG 2000 derivatives | AT, IH | TIFF set | fssr_ok | jp2_gen | jp2_gen_err |
| JP2 set | jp2_gen | jp2_gen_err | ||||
| 4 | Generate jpeg derivatives | AT, IH | TIFF set | jp2_gen | jpg_gen | jpg_gen_err |
| JPG set | jpg_gen | jpg_gen_err | ||||
| 5 | Check metadata file structure (Table 7) | AT, QC | TIFF set | jpg_gen | md5_ok | md5_error |
| 6 | Check duplicates (Table 8) | AT, QA | TIFF set | md5_ok | unique | duplicate |
| 7 | Check structure and file size (Table 9) | AT, QA | TIFF set | unique | fss_ok | fss_error |
| JP2 set | jp2_gen | fss_ok | fss_error | |||
| 8 | Visual qc tiff files (Table 10) | MT, QC | TIFF set | fss_ok | vqc_ok | vqc_error |
| 9 | Check filename (Table 5) | AT, QA, IH | JPG set | jpg_gen | jpgn_ok | jpgn_error |
Store images sub-tasks.
|
|
|
|
|
| ||
|
|
|
| ||||
| 1 | Remove duplicates and bad crops (Table 11) | MT, QA | TIFF set | vqc_ok | dup_rmv | |
| JP2 set | fss_ok | dup_rmv | ||||
| JPG set | jpgn_ok | dup_rmv | ||||
| 2 | Copy files to archive | AT | JP2 set | dup_rmv | stg_ok | stg_error |
| JPG set | dup_rmv | stg_ok | stg_error | |||
| 3 | Generate image viewers | AT | JP2 set | stg_ok | vwrg_ok | |
| 4 | Copy files to ftp server | AT | TIFF set | stg_ok | svrc_ok | svrc_error |
| 5 | Copy files to external archive | AT | TIFF set | svrc_ok | arc_ok | arc_error |
| 6 | Check jp2 and jpg sets stored (Table 12) | AT, QA | JP2 set | vwrg_ok | stgv_ok | stgv_err |
| JPG set | stg_ok | stgv_ok | stgv_err | |||
| 7 | Clear buffer server (Table 13) | AT, QA | TIFF set | arc_ok | bufc_ok | bufc_err |
| 8 | Clear buffer server | AT | JP2 set | stgv_ok | bufc_ok | bufc_err |
| JPG set | stgv_ok | bufc_ok | bufc_err | |||
Check file name sub-task*.
| Agent | Check-barcode (script). |
| Function | Verify that image file names structure is formed using the corresponding barcode. |
| Dependencies | ZBAR open source library for reading barcodes from image files ( |
| Target(s) |
master images of TIFF set (Image Processing sub-task 1) production images on JPG and JP2 sets (Image Processing sub-task 10) |
| Criteria | Each file name must conform to the format: two-letter prefix (here BR) 13-digit string padded left with zeros, which contains the digits in the barcode optional suffix, 2-character underscore and letter (a-z) used if specimen is associated to more than one image (e.g. _a) file extension consistent with the set being processed (either .tif,.jp2,.jpg) |
| Success | Filenames are correctly formed (names_ok). |
| Fail | Filenames are incorrect (names_err). |
| Example | Valid file names for the images in the three sets corresponding to specimens with barcode from the example shown on Fig. BR0000008378064.tif BR0000008378064.jp2 BR0000008378064.jpg |
| Exceptions | Herbarium sheets can contain more than one specimen and more than one barcode. These sheets may be flagged as incorrect and require manual processing. Additionally, herbarium sheets can have legacy barcodes from previous cataloguing efforts and, consequently, may have more than one barcode even when having only one specimen. If this is the case, the legacy barcode is removed, the image is deleted and the specimen is sent back for re-imaging. |
| * Owing to the need to image collections of other herbaria and various subcollections, other filename formats have had to be accommodated. | |
Check file size and resolution sub-task.
| Agent | Check-tif-resol-and-size (script). |
| Function | Utilise image file size to detect resolution and cropping. |
| Dependencies | JHOVE: a file format identification, validation and characterisation tool ( |
| Target(s) | Master images of TIFF set (Image Processing sub-task 2). |
| Criteria | Each file size must be above 88 MB (average minimum file size, which is a consistent indicator of image dimensions). |
| Success | Correct file size indicates that cropping and resolution are within the acceptable range (fssr_ok). |
| Fail | Incorrect file size may indicate cropping or resolution issues (fssr_err). The images need to be flagged for manual verification. |
| Exceptions | Some specimens can be preserved in non-standard size sheets, like the one shown on Fig. |
Check TIFF metadata file structure sub-task.
| Agent | Check-md5-meta (script). |
| Function | Utilise md5 checksum to verify the integrity of images after transmission, storage and recovery operations. |
| Dependencies | md5deep and hashdeep software packages to process verify the match between stored and computed md5 hash values ( |
| Target(s) | Master images of TIFF set, only a subset is verified (Image Processing sub-task 5). |
| Criteria | Calculated md5 hashset values must coincide with stored hashset values. |
| Success | The image file has not changed, the copy is consistent with the original (md5_ok). |
| Fail | The image file has been corrupted since its creation, original archive file is required to restore it (md5_err). |
| Exceptions | If errors are detected in a sample, the process can be reverted to verify the full batch. |
Check duplicates sub-task.
| Agent | check-dups (script). |
| Function | Verify barcodes in a new batch against the ones already in the archive database. |
| Dependencies | None. |
| Target(s) | Master images of TIFF set (Image Processing sub-task 6). |
| Criteria | Checking eventual duplicates is done by a script which verifies that the barcodes in the batch have not been already used by looking up in the archive database. |
| Success | The set does not contain duplicate images (unique). |
| Fail | The set contain duplicate images which need to be further analysed to determine if they are valid duplicates or need to be flagged for removal from the set (duplicate). |
| Exceptions | Some types of duplicates are allowed, but require the intervention of a human operator. |
Check structure sub-task.
| Agent | check-jp2-and-size (script). |
| Function | Verify that the images conform to the standards selected by MBG for long-term storage (TIFF) and high-definition production images (JP2). |
| Dependencies | JHOVE for analysing and checking that the images are well-formed (consistent with the basic requirements of the format) and valid ( |
| Target(s) | Master images of TIFF set (Image Processing sub-task 7) |
| Criteria | TIFF images must conform to the TIFF 6.0 Specification. |
| Success | The image files conform to the corresponding standard (fss_ok). |
| Fail | The image files do not conform to the corresponding standard (fss_err). |
| Exceptions | Legacy scans prior to the implementation of the audit trail procedures may not conform to the current standards selected. |
Visual inspection sub-task.
| Agent | Quality Manager (person). | |
| Function | Verify image quality by visually inspecting a sample of the images in the batch. | |
| Dependencies | Calibrated high pixel density display (e.g. Retina 5K Apple) Image editing programme (e.g. GIMP ( | |
| Target(s) | Master images of TIFF set, only a subset is verified (Image Processing sub-task 8). | |
| Criteria | focus | Edges of the elements (specimen, labels, charts) are well defined, the text is readable. |
| cropping | All elements of the specimen are visible in the image frame, i.e. no parts seem to extend beyond the edge of the image. | |
| exposure | Verify white balance using the white box of the colour chart and verify its average value: Above 250: Image overexposed -> reject Below 225: Image underexposed -> reject Above 18: Image overexposed -> reject Below 12: Image underexposed -> reject | |
| barcode | Verify that the name of the file is the same as the barcode on the sheet. | |
| Success | Images meet visual quality criteria (vqc_ok). | |
| Fail | Images do not meet visual quality criteria (vqc_err). In this case, the operator needs to verifty another sample to determine if the whole batch should be rejected. | |
| Exceptions | Reference values need to be verified depending on the colour chart. usedSpecimens have been photographed with two types of colour chart: Standard CIE D50 Illuminant D50, Macbeth ColorChecker and ISA Golden Thread target. | |
Remove Duplicates Sub-task.
| Agent | Quality Manager (person). |
| Function | Remove images flagged as bad crops or duplicates. |
| Dependencies | Error log report with list of non-compliant images. |
| Target(s) | Master images of TIFF set (Store Image sub-task 1). |
| Criteria | If an image in one of the sets is flagged (TIFF, JP2 or JPG), that image is removed from the set and all corresponding images in the other sets are also removed. |
| Success | Flagged images have been removed (dup_rmv). |
| Fail | Flagged images have been removed (dup_err). |
| Exceptions | If flagged images are part of the production set, the corresponding image from the master set needs to be validated to determine if the error was generated when the derivatives were produced or it is an imaging error. |
Check if production set stored sub-task.
| Agent | check-if-archived (script). |
| Function | Verify that the production set images have been copied to the image repository and the back-up server. |
| Dependencies | Logs containing the paths to the servers where the image sets are stored. Read access to server for verification of file paths. |
| Target(s) | Production images on JPG and JP2 sets (Store Image sub-task 6). |
| Criteria | File paths for the images in the production sets need to be valid and non-empty. |
| Success | Image files are stored and backed up (stgv_ok). |
| Fail | Error in the image files store/back up process (stgv_err). |
Check if production set stored sub-task.
| Agent | del-dir-viaa (script). |
| Function | Delete master set copy from the buffer server, once reception and archiving is confirmed. |
| Dependencies | Confirmation from contractor of archiving of TIFF set. |
| Target(s) | Master images of TIFF set (Store Image subtask 7). |
| Criteria | The acknowledge code from contractor indicates that the master set has been received and archived. |
| Success | The image files are archived and the buffer has been cleared (bufc_ok). |
| Fail | Archiving of image files is not confirmed (bufc_err). |
| Exceptions | Retry copy files to archive sub-task. |
Figure 5.Example of results from in-house and outsourced digitisation. Image "a" corresponds to a specimen digitised in-house and image "b" corresponds to a specimen digitised by the contractor. Images "c" and "d" correspond to close-ups of the sections highlighted in "a" and "b", respectively, presented at 100% size (5x5 cm square).
Figure 6.Image Resolution and Scaling Comparison. HS A fragments correspond to close-ups of the images shown in Fig. 5a-c. HS B fragments correspond to close-ups of the images shown in Fig. 5b-d.
Figure 7.Diagram of the digitisation workflows at the Royal Botanic Garden Edinburgh (from Haston 2012).
Figure 8.Picturae digitisation workflow for herbarium collections. The digitisation task (3) includes image quality control checks (courtesy of Picturae*3).