| Literature DB >> 25009435 |
Robyn E Drinkwater1, Robert W N Cubey1, Elspeth M Haston1.
Abstract
At the Royal Botanic Garden Edinburgh (RBGE) the use of Optical Character Recognition (OCR) to aid the digitisation process has been investigated. This was tested using a herbarium specimen digitisation process with two stages of data entry. Records were initially batch-processed to add data extracted from the OCR text prior to being sorted based on Collector and/or Country. Using images of the specimens, a team of six digitisers then added data to the specimen records. To investigate whether the data from OCR aid the digitisation process, they completed a series of trials which compared the efficiency of data entry between sorted and unsorted batches of specimens. A survey was carried out to explore the opinion of the digitisation staff to the different sorting options. In total 7,200 specimens were processed. When compared to an unsorted, random set of specimens, those which were sorted based on data added from the OCR were quicker to digitise. Of the methods tested here, the most successful in terms of efficiency used a protocol which required entering data into a limited set of fields and where the records were filtered by Collector and Country. The survey and subsequent discussions with the digitisation staff highlighted their preference for working with sorted specimens, in which label layout, locations and handwriting are likely to be similar, and so a familiarity with the Collector or Country is rapidly established.Entities:
Keywords: Data entry; Digitisation; Herbarium; Label; OCR; Specimen
Year: 2014 PMID: 25009435 PMCID: PMC4086207 DOI: 10.3897/phytokeys.38.7168
Source DB: PubMed Journal: PhytoKeys ISSN: 1314-2003 Impact factor: 1.635
Figure 1.Example labels: a Pre-printed label with handwritten details b and c mixed labels with pre-printed and typed information d Mainly handwritten label, with printers mark e Mainly handwritten label with unusual phrasing.
Average time taken to complete trials.
| Trial | ‘Filter’ | Number of completed batches per Protocol | Average Complete Protocol (minutes) | % time saved (compared with Random 1) | Average Partial Protocol (minutes) | % time saved (compared with Random 1) |
|---|---|---|---|---|---|---|
| 1. | Random 1 | 10 | 313 | 0% | 226.9 | 0% |
| 2. | Collector | 10 | 259.5 | 17.1% | 220.2 | 2.7% |
| 3. | Country | 10 | 345.7 | 10.5% increase | 192.6 | 15.2% |
| 4. | Collector & Country | 10 | 262.8 | 16.1% | 105.3 | 53.6% |
| 5. | Collector & Country (OCR) | 10 | 252.6 | 19.3% | 125.7 | 44.7% |
| 6. | Random 2 | 10 | 283.9 | 9.3% | 219.9 | 3.1% |
Figure 2.Box plot of Complete and Partial Protocol results. R1C – Random 1 complete; R1P – Random 1 Partial; CollC – Collector only Complete; CollP – Collector only Partial; CouC – Country only Complete; CouP – Country only Partial; CCC – Collector & Country Complete; CCP – Collector & Country Partial; OCRC – Collector & Country OCR Complete; OCRP – Collector & Country OCR Partial; R2C – Random 2 Complete; R2P – Random 2 Partial.
Figure 3.Box plot of Partial Protocol results.
Result of ANOVA using Protocol ‘pairs’ (Complete and Partial).
| Trial | Df | F Value | Pr (>F) | Significance | |
|---|---|---|---|---|---|
| Partial | Trial | 5 | 6.487 | 0.0013 | ** (0.001) |
| Residuals | 18 |
Format of the trials.
| Trial | ‘Filter’ | Protocol | Number of repeats/person | Total specimens/person |
|---|---|---|---|---|
| 1. | Random | Complete | 2 | 100 |
| Partial | 100 | |||
| 2. | Collector | Complete | 2 | 100 |
| Partial | 100 | |||
| 3. | Country | Complete | 2 | 100 |
| Partial | 100 | |||
| 4. | Collector & Country | Complete | 2 | 100 |
| Partial | 100 | |||
| 5. | Collector & Country(OCR) | Complete | 2 | 100 |
| Partial | 100 | |||
| 6. | Random | Complete | 2 | 100 |
| Partial | 100 |
Result of ANOVA for the 12 trials.
| Df | F Value | Pr (>F) | Significance | |
|---|---|---|---|---|
| Trial | 11 | 13.03 | 4.11e-14 | *** (0) |
| Residuals | 85 |