| Literature DB >> 22859891 |
Andrea Thomer1, Gaurav Vaidya, Robert Guralnick, David Bloom, Laura Russell.
Abstract
Part diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a workflow to generate structured outputs while also maintaining links to the original texts. The first step in this workflow was to place already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wikisource, an open text transcription platform. Next, we created Wikisource templates to document places, dates, and taxa to facilitate annotation and wiki-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and cross-walked into Darwin Core compliant record sets. Finally, these recordsets were vetted, to provide valid taxon names, via a process we call "taxonomic referencing." The result is identification and mobilization of 1,068 observations from three of Henderson's thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn."Compose your notes as if you were writing a letter to someone a century in the future."Perrine and Patton (2011).Entities:
Keywords: Field notes; Colorado; Darwin Core; Junius Henderson; Wikisource; annotation; biodiversity; crowd sourcing; digitization; natural history; notebooks; species occurrence records; taxonomic referencing; text-mining; transcription
Year: 2012 PMID: 22859891 PMCID: PMC3406479 DOI: 10.3897/zookeys.209.3247
Source DB: PubMed Journal: Zookeys ISSN: 1313-2970 Impact factor: 1.546
Figure 1.Web browser view of a scanned page of Henderson’s journal displayed side-by-side with transcriptions and annotations using the MediaWiki Proofread Page extension.
Figure 2.Index page for Notebook #1. Each Index page corresponds to a multipage file. The Index page displays volume metadata and links to sections of the notebook, while also providing links out to each notebook page and color-coding to determine which pages have been already transcribed and proofed.
Figure 3.Henderson’s first sentence. “Boulder, Colo. July 28, 1905. Saw Say [sic] Phoebe and siskins, [American] Robins, [Northern] Flicker.”
Figure 4.Editing a notebook page on Wikisource. This screenshot shows side-by-side transcription and wiki markup syntax.
Figure 5.An example of how a location (Big Thompson Creek near Loveland), a date (Sunday, June 10, 1906), and a taxon (Cottonwood, genus Populus) are grouped from across multiple pages.
Summary information on each notebook.
| Notebook 1 | Notebook 2 | Notebook 3 | |
|---|---|---|---|
| URL | |||
| Number of annotations | 632 | 703 | 1007 |
| Taxon annotations | 349 (201 unique) | 224 (125 unique) | 514 (248 unique) |
| Place annotations | 219 (115 unique) | 419 (154 unique) | 401 (139 unique) |
| Date annotations | 64 (63 unique) | 60 (59 unique) | 92 (90 unique) |
| Dates in range | July 1905 to April 1907 | May 1907 to October 1908 | January 1909 to September 1909 |
| Time spent annotating | 6 weeks | 4 weeks | 6 weeks |
| Darwin Core Class | Terms included in Darwin Core file |
| Record-level Terms | dcterms:modified, basisOfRecord, institutionCode, collectionCode, source |
| Occurrence | catalogNumber, recordedBy |
| Event | eventDate, year, month, day, verbatimDate, fieldNotes |
| Location | country, countryCode, stateProvince, locality, verbatimLocality |
| Identification | identifiedBy, identificationRemarks, |
| Taxon | taxonID, scientificName, kingdom, phylum, class, order, family, genus, species, vernacularName, taxonStatus, taxonRemarks |
| Non-Darwin Core Terms | –ScrapedName records the scientificName for the organism observed as entered by Henderson and transcribed by us.-AnnotatorName records the corrected ScrapedName as recorded by the annotators. The annotators had the option of leaving this field blank, in which case we use the ScrapedName as the AnnotatorName.–Both ScrapedName and AnnotatorName were fed through a taxonomic resolution process (see Methods, section “Proofing the Darwin Core record set”). Three taxonomic resolvers were used for some of the records: the Global Names Index (GNI), the Encyclopedia of Life (EOL) and the Integrated Taxonomic Information System (ITIS). The resulting identifiers and best-matched scientificNames are provided for all three services; additionally, our ITIS service returned vernacular names, which are also recorded. The Source of correct name field indicates whether EOL, ITIS or Both services were returned the correct name.-canonicalScientificName is the scientificName with the authorship information deleted.-AnnotatorLocality: Annotators were asked to provide a corrected, modern place name for the verbatimName; these are recorded here.-Higher taxonomy (kingdom, phylum/division, etc.) were only extracted from ITIS for records where the ITIS name was correct. The taxonID field contains the ITIS Taxonomic Serial Number (TSN) used to look up the higher taxonomy; the scientificName from TSN field contains the scientific name that ITIS associates with that TSN. |