Literature DB >> 35755873

Meteorological data rescue: Citizen science lessons learned from Southern Weather Discovery.

Andrew M Lorrey1, Petra R Pearce1, Rob Allan2, Clive Wilkinson3, John-Mark Woolley1, Emily Judd1,4, Stuart Mackay1, Sudhir Rawhat5, Laura Slivinski6, Sally Wilkinson3, Ed Hawkins7, Patrick Quesnel5, Gilbert P Compo6.   

Abstract

Daily weather reconstructions (called "reanalyses") can help improve our understanding of meteorology and long-term climate changes. Adding undigitized historical weather observations to the datasets that underpin reanalyses is desirable; however, time requirements to capture those data from a range of archives is usually limited. Southern Weather Discovery is a citizen science data rescue project that recovered tabulated handwritten meteorological observations from ship log books and land-based stations spanning New Zealand, the Southern Ocean, and Antarctica. We describe the Zooniverse-hosted Southern Weather Discovery campaign, highlight promotion tactics, and replicate keying levels needed to obtain 100% complete transcribed datasets with minimal type 1 and type 2 transcription errors. Rescued weather observations can augment optical character recognition (OCR) text recognition libraries. Closer links between citizen science data rescue and OCR-based scientific data capture will accelerate weather reconstruction improvements, which can be harnessed to mitigate impacts on communities and infrastructure from weather extremes.
© 2022 The Authors.

Entities:  

Keywords:  Zooniverse; citizen science; climate; data rescue; meteorology; optical character recognition; reanalysis

Year:  2022        PMID: 35755873      PMCID: PMC9214331          DOI: 10.1016/j.patter.2022.100495

Source DB:  PubMed          Journal:  Patterns (N Y)        ISSN: 2666-3899


Introduction

The importance of meteorological data rescue

Historical climate research has significantly bolstered global reconstructions of daily weather, also known as reanalyses.1, 2, 3, 4 Reanalyses are valuable tools for visualizing and contextualizing local weather patterns and extreme weather events,,, as well as investigating climate variability and teleconnections.7, 8, 9 These long weather reconstructions also improve understanding of broad climate change trends in regions where observation depths are robust and can be cautiously interpreted. Broadly, reanalyses rely on incorporating historical observations into estimates of past conditions using modern weather models. The performance of centennial-length reanalyses, like the 20th Century Reanalysis,, can be highly dependent on the density and accuracy of such observations throughout time. Uncertainties in historical daily weather patterns from these reanalyses can arise from a diminished spatiotemporal coverage of near-surface terrestrial and marine observations that were assimilated into the reconstruction. However, international surface observation databases (e.g., International Combined Ocean-Atmosphere Dataset [ICOADS]; International Surface Pressure Databank [ISPD])11, 12, 13, 14, 15, 16 that underpin reanalyses are continually expanding as new data are recovered and digitized by ongoing meteorological data rescue efforts. Data rescue therefore creates new pathways to improve reanalyses, like 20CR, but those opportunities are heavily dependent on (1) the ability to locate missing meteorological records for areas where spatial coverage is weak, and (2) the capability and capacity to rapidly capture, transcribe, and efficiently quality control numerical observations contained in archives. The first data rescue dependency is being addressed in parallel by individual research efforts and coordinated international initiatives. Both approaches have uncovered new historical meteorological observation sources that have led to improved visibility, management, and curation of those data. Examples of international coordination efforts include the international Atmospheric Circulation Reconstructions over Earth (ACRE) initiative,, the International Data Rescue (I-DARE) portal hosted by the World Meteorological Organization (WMO; https://www.idare-portal.org/), and the Copernicus C3S Data Rescue Service (https://data-rescue.copernicus-climate.eu/). The WMO I-DARE and Copernicus portals (https://datarescue.climate.copernicus.eu/) are currently being integrated into the same framework. Collectively, these efforts have improved the quality of recovered data, helped with resource sharing when capturing digital surrogates of original data sources, and reduced replication when obtaining archived meteorological data resources. The second data rescue dependency has been addressed either by individual researchers or research groups who manually transcribe historical data into digital format, or by using computer-aided recovery of text and numeric data (e.g., optical character recognition [OCR])., The efficacy of the latter approach to date has, in a handful of trials, shown some promise but with significant limitations., However, progress to speed up transcription has been made using citizen science, which relies on individuals that are willing to voluntarily transcribe historical analogue meteorological data. There are many projects that have used this approach in recent years across a range of document types (see Ashcroft et al., 2016 for some Australasian examples). Pioneering efforts for handwritten tabulated observations are exemplified by OldWeather (www.oldweather.org), Meteororum ad Extremum Terrae (http://Zooniverse.org/projects/acre-ar/meteororum-ad-extremum-terrae), and Weather Rescue (www.weatherrescue.org). In this descriptor article, we summarize our experiences from Southern Weather Discovery (SWD) (www.southernweatherdiscovery.org), a citizen science initiative hosted on the Zooniverse web platform, to show how southern hemisphere meteorological time series have been generated and quality controlled from historical ship logbooks. This case study builds on prior work that has documented preparation of historical documents and transcription tactics,23, 24, 25, 26, 27 but also adds detail by explaining key elements of publicity and a media strategy plan that engendered public support for meteorological data rescue. We provide a stepwise account of our methods, highlighting some successes and pitfalls, that other researchers may benefit from to improve citizen science data rescue efforts for the geosciences. We provide details on the use of a data transcription interface on Zooniverse, preparation of historical documents for transcription, requirements for retrieving data from Zooniverse, and tactics to form a comprehensive observation dataset with minimal transcription errors. We also discuss serendipitous outcomes from SWD citizen science, where replicate keying of meteorological observations can be harnessed to improve artificial intelligence (AI) transcription of tabulated scientific data.

Launching a data rescue mission from the antipodes

A citizen science data rescue effort was launched as a component of the New Zealand Deep South National Science Challenge (DeepSouthChallenge.co.nz [DSC]) in 2015. The DSC’s main aim is to understand the role of the Antarctic and Southern Ocean in determining New Zealand’s future climate conditions and environmental outcomes from climate changes. The focus of DSC data rescue work was to recover undigitized weather observations and use them to help assess and evaluate the New Zealand Earth System Model (NZESM). Many late 19th and early 20th century historical weather and climate events caused damage and disruption to New Zealand’s civil infrastructure and economy (e.g., significant snowfall, floods, droughts). Ensemble uncertainty in the 20CR analysis during the late 19th/early 20th century is large around New Zealand and the South Pacific (Figure 1), providing little insight into the atmospheric conditions leading up to these important past episodes. Thus, evaluating the quality of the NZESM using a reanalysis during those times is hampered. However, a few examples of full transcriptions and analyses of early handwritten scientific observations show potential to address this shortcoming. Improving the efficacy of long-range reanalyses like 20CR with newly rescued data could enable further testing and more detailed validation of the NZESM, with direct applications toward improving our understanding of weather events, climate variability, and long-term changes.
Figure 1

Twentieth Century Reanalysis ensemble spread in late 1800s and early 1900s

The Twentieth Century Reanalysis version 2c (20CRv2c) mean ensemble spread for the 1,000 hPa geopotential height (top) from 1891 to 1910 is contrasted with a spread anomaly plot (bottom) where the zonal (latitude average) mean for the same interval has been subtracted. This was the most recent version of 20CR that existed when Southern Weather Discovery (SWD) began. These plots show the effects of areas where there are relatively concentrated (e.g., New Zealand, Australia) and diminished (e.g., Amundsen Sea) observations. Outside of the south west Pacific tropics, and especially around Antarctica, there is higher uncertainty in the 20CR daily weather reconstruction. Areas where centers of action for modes of variability that affect New Zealand (including PSA, SAM, ZW3) are indicated. Plots are courtesy of National Oceanic and Atmospheric Administration Physical Science Laboratory (NOAA PSL). The spatial extent of the ACRE Antarctica regional chapter domain in the bottom panel was the focus of SWD data rescue.

Twentieth Century Reanalysis ensemble spread in late 1800s and early 1900s The Twentieth Century Reanalysis version 2c (20CRv2c) mean ensemble spread for the 1,000 hPa geopotential height (top) from 1891 to 1910 is contrasted with a spread anomaly plot (bottom) where the zonal (latitude average) mean for the same interval has been subtracted. This was the most recent version of 20CR that existed when Southern Weather Discovery (SWD) began. These plots show the effects of areas where there are relatively concentrated (e.g., New Zealand, Australia) and diminished (e.g., Amundsen Sea) observations. Outside of the south west Pacific tropics, and especially around Antarctica, there is higher uncertainty in the 20CR daily weather reconstruction. Areas where centers of action for modes of variability that affect New Zealand (including PSA, SAM, ZW3) are indicated. Plots are courtesy of National Oceanic and Atmospheric Administration Physical Science Laboratory (NOAA PSL). The spatial extent of the ACRE Antarctica regional chapter domain in the bottom panel was the focus of SWD data rescue. To achieve data rescue aims that could contribute to the DSC, ACRE Antarctica (a chapter of ACRE) was created to focus on data rescue of historical meteorological observations within the high-latitude region bounded by Australia, South America, and Antarctica. Within that geographic domain, there are key atmospheric and oceanic centers of action that are linked to modes of climate variability including the Pacific South American Mode (PSA), Zonal Wave 3 (ZW3), and the Southern Annular Mode (SAM), that directly and indirectly (via teleconnections) impinge on New Zealand’s weather and regional climate conditions (Figure 1). In contrast to the broad continental expanse of the northern hemisphere and the tropics with numerous land-based weather observation stations, the ACRE Antarctica region is dominated by the South Pacific Ocean and Southern Ocean. Thus, historical observations are sparse and significant gaps in observation coverage over the southern hemisphere oceans produce large weather reconstruction uncertainties (Figures 1 and 2). This geographic predicament means any data rescue efforts need to consider limitations of coastal and maritime scientific data resources from lighthouses and seasonal coastal stations, and should harness the benefits of observations from harbor-based and ocean-going vessels. For the latter type of resource, ship log books have previously shown great potential to bolster maritime instrumental observations for the 19th and 20th centuries., Essential climate variables (ECVs), including atmospheric pressure, air temperature, sea surface temperature, and sea ice extent, were targeted for recovery in ACRE Antarctica’s initial funding support from the DSC. Additional financial support from the Copernicus Climate Change Service (C3S) further allowed us to refine citizen science meteorological data rescue methods that have helped to streamline citizen science data capture of historical maritime weather observations.
Figure 2

Change in selected 20CR uncertainty metrics through time

(Top) The inter-region mean ensemble spread (uncertainty in daily reconstruction of weather) for the tropics and the ACRE Antarctica domain in the 20CR version 3 (20CRv3) January to March. It shows progressive improvement for both regions through time from the mid-19th to mid-20th century. (Bottom) The mean ensemble spread ratio (ACRE Antarctica mean ensemble spread divided by the tropics mean ensemble spread) is a dimensionless index indicating that, despite overall 20CR improvement, there is still lower uncertainty for past daily weather in the tropics (possibly a result of greater density or greater consistency of observations in that region) relative to the high southern latitudes. It also shows that distinct seasonal differences for the mean ensemble spread uncertainty are lowest for the southern high latitudes in summer and worst for autumn and winter.

Change in selected 20CR uncertainty metrics through time (Top) The inter-region mean ensemble spread (uncertainty in daily reconstruction of weather) for the tropics and the ACRE Antarctica domain in the 20CR version 3 (20CRv3) January to March. It shows progressive improvement for both regions through time from the mid-19th to mid-20th century. (Bottom) The mean ensemble spread ratio (ACRE Antarctica mean ensemble spread divided by the tropics mean ensemble spread) is a dimensionless index indicating that, despite overall 20CR improvement, there is still lower uncertainty for past daily weather in the tropics (possibly a result of greater density or greater consistency of observations in that region) relative to the high southern latitudes. It also shows that distinct seasonal differences for the mean ensemble spread uncertainty are lowest for the southern high latitudes in summer and worst for autumn and winter.

Data

Navigating happy hunting grounds for historical maritime weather data

Significant efforts have been made to photograph logbooks from ships that visited the southern hemisphere during the early to mid-20th century. Those resources are widely dispersed across a broad range of archives (see Teleti et al. and Chappell et al.). For example, in the archives of New Zealand’s National Institute of Water and Atmospheric Research (NIWA), there are copies of published historical scientific expeditions to the Antarctic region written in English, French, Spanish, Portuguese, Russian, Norwegian, Finnish, and Swedish. Many of the original historical ship logbooks supporting those publications are held in European archives and include British merchant and immigration ships that visited New Zealand, Australia, and the South Pacific., In addition, significant numbers of ship logbooks exist in Scandinavia, related to whale hunting in the southern hemisphere. An initial assessment spanning the period 1900–1960 indicates a minimum of seven million unique ship logbook weather observations for the high southern latitudes., In addition to regional data richness, data consistency is an important element to consider, given the sheer scale of processing and formatting citizen science data (see explanations below). Many logbooks have a standard printed table format that shipboard observers completed while at sea, and some shipboard expedition reports also include land-based observations from stationary and overland traverses. Our primary focus was then honed and directed at ship logbook data rescue from 1900 to 1950, a time frame that encompasses several severe weather events that affected New Zealand. Some 150,000 logbook images from more than 300 individual voyages were carefully photographed and passed to NIWA in support of SWD to bolster mid-19th to mid-20th century sample depth and reduce ensemble uncertainty in future 20CR iterations (Figure 2). Table 1 shows details for the log books that were digitized. The experimental procedures section outlines the process of data collection for this study. It describes how we established the SWD project identity and set up a data transcription platform on Zooniverse. This section also highlights progressive changes in data-rescue tactics and the approaches we deployed to recruit personnel who transcribed data from ship logbooks and land-based meteorological registers.
Table 1

Ship logbook meteorological observations recovered in SWD phase I hosted on Zooniverse

Unique shipUnique logbook file nameShip nameLog imagesImages clippedNumber of clipsClip not loadedYear of dataBarometer uncorrectedAttached thermometerBarometer correctedAir temperatureSea temperatureTotal observations
1MF911_44460_Athel ChiefAthel Chief1051806194622620022900655
2MF911_19702_BullysesBullyses5340519307778775959350
2MF911_19703_BullysesBullyses42301219304747474747235
3MF911_32842_CambridgeCambridge6354019355353535352264
4MF911_27139_CanonosaCanonosa4236319333635313636174
4MF911_28085_CanonosaCanonosa1059091933172173157173173848
4MF911_29126_CanonosaCanonosa8472619348282828180407
4MF911_29949_CanonosaCanonosa8472919347271717272358
5MF911_24993_CopticCoptic6354319326665666565327
5MF911_27188_CopticCoptic6354919335960596060298
5MF911_32788_CopticCoptic6354019356163636061308
5MF911_34004_CopticCoptic631081519365858585858290
5MF911_37438_CopticCoptic631089193764616300188
6MF911_27819_CumberlandCumberland84721519336564646666325
6MF911_37610_CumberlandCumberland8414430193771727222219
6MF911_39392_CumberlandCumberland841442419386868686566335
7MF911_17513_DeucalionDeucalion111501929210373737132
7MF911_17626_DeucalionDeucalion111501929200272727101
8MF911_17770_DevonDevon22301219294242384241205
8MF911_22989_DevonDevon447201931173176169176175869
9ML911_2798_Discovery IIDiscovery II292934201950450451450001,351
10MF911_19398_Dorington CourierDorington Courier2230019307373737273364
11MF911_35639_Dunedin StarDunedin Star331082419365355535147259
12MF911_39566_DurhamDurham224801938107107107107107535
12MF911_41980_DurhamDurham335491939117117116116117583
12ML911_900_DurhamDurham111123127194825825825800774
13MF911_44536_Empire VictoryEmpire Victory111119801946351352352001,055
14MF911_40649_EssexEssex331081819385448545555266
15MF911_33390_FordsdaleFordsdale3354019358485857979412
15MF911_37896_FordsdaleFordsdale841053919376566666463324
16MF911_17310_GloxiniaGloxinia423001929630616464252
16MF911_17474_GloxiniaGloxinia423001929760787575304
16MF911_19208_GloxiniaGloxinia211521929380383833147
16MF911_19287_GloxiniaGloxinia8345121930850858578333
17MF911_33108_HertfordHertford8472919357575747575374
17MF911_37308_HertfordHertford427215193739373800114
18MF911_20476_HororataHororata42301219304848484848240
19MF911_25026_HuntingdonHuntingdon6354019326768646868335
19MF911_25770_HuntingdonHuntingdon84721519327270737373361
19MF911_26744_HuntingdonHuntingdon5236019324043434343212
19MF911_27605_HuntingdonHuntingdon6354619335655585960288
19MF911_30667_HuntingdonHuntingdon84721819346868676868339
19MF911_35480_HuntingdonHuntingdon841442419367272726969354
19MF911_37975_HuntingdonHuntingdon841442119377879807878393
20MF911_34739_HurunaiHurunai8414414419361480146147146587
21MF911_34145_IonicIonic841441441936115117115112112571
22ML_17955_JunieJunie262016016019294424424434444442,215
23MF911_27575_KarameaKaramea63545419335959595959295
23MF911_28249_KarameaKaramea84727219336463646464319
23MF911_37224_KarameaKaramea6354541937134133134135135671
23MF911_40779_KarameaKaramea7310810819386060596060299
23ML_18569_KarameaKaramea312016016019324414394414344352,190
23ML_18678_KarameaKaramea292116816819324434444434224262,178
24MF911_9415_KiaoraKia Ora634545192590009089269
25MF911_44413_LafoniaLafonia83108108194641414100123
26MF911_39550_LorigaLoriga6354541938121121120121118601
27MF911_39459_LosadaLosada21181819383030303030150
28MF911_32584_MahanaMahana6310810819356053535958283
29MF911_26079_MahiaMahia84727219327475747575373
29MF911_27728_MahiaMahia63545419335252525252260
29MF911_37618_MahiaMahia84144144193771717100213
29MF911_43608_MahiaMahia10310810819465555525049261
30ML_17885_MaimoaMaimoa29211681319294514504424514512,245
30ML_18480_MaimoaMaimoa3524192019315295235274073562,342
30ML_18660_MaimoaMaimoa32231841019324964964944784752,439
30MF911_27208_MaimosaMaimosa8472619337373727071359
30MF911_34685_MaimosaMaimosa841441619367878787672382
31MF911_12243_MamariMamari423001926/2777007877232
31MF911_13623_MamariMamari634501927920939393371
32MF911_25506_MatakanaMatakana6354019326868666868338
32MF911_27290_MatakanaMatakana6354019337171717171355
32MF911_28274_MatakanaMatakana8472919338181808181404
32MF911_33110_MatakanaMatakana84721519357171717271356
32MF911_9992_MatakanaMatakana42300192576007777230
32ML_17869_MatakanaMatakana30231842019294704634654674652,330
33MF911_27403_MiddlesexMiddlesex10590331933138137138138138689
34ML_18676_NorfolkNorfolk2921168019324424464424364292,195
34ML911_958_NorfolkNorfolk361021021194823323423100698
35MF911_44234_NorthumberlandNorthumberland13725221194729229129600879
36MF911_23534_OpawaOpawa6354319316464646464320
36MF911_26440_OpawaOpawa8472019326264616464315
36MF911_27357_OpawaOpawa84722419336364636464318
36MF911_28505_OpawaOpawa84721519337071707071352
36MF911_29526_OpawaOpawa6354619346363636363315
36MF911_30783_OpawaOpawa84722119346666666666330
36ML_18575_OpawaOpawa3119152019324284244284284262,134
37MF911_27680_OrariOrari6354319335656505454270
37MF911_30480_OrariOrari4236019343334343434169
37ML911_527_OrariOrari221225230194726226225700781
37ML911_81_OrariOrari311225230194724024023800718
38ML_18115_OtakiOtaki39292323619295625635634564262,570
39MF911_10588_OtiraOtira4230019264040394040199
39MF911_26306_OtiraOtira6354019327777727979384
39MF911_28021_Otira_DUPOtira8472919337979797979395
40MF911_29047_PakehaPakeha1059091933-348889898989444
40ML_17655_PakehaPakeha29221761419284674674674594622,322
40ML_17804_PakehaPakeha37292322019285785815775805782,894
40ML_18410_PakehaPakeha32211681019314534554553363332,032
41MF911_25239_PiakoPiako8472619327878787777388
41MF911_27502_PiakoPiako8472919337475757475373
42MF911_25348_Port AdelaidePort Adelaide959011932167163168153153804
42MF911_25349_Port AdelaidePort Adelaide2118019322529290083
42MF911_33271_Port AdelaidePort Adelaide8472019356970716971350
42MF911_35042_Port AdelaidePort Adelaide147252231936140141140134134689
42ML_18174_Port AdelaidePort Adelaide3519152019304174204144103832,044
43MF911_25998_Port AlmaPort Alma8472019327576767676379
43MF911_27851_Port AlmaPort Alma84721519337069697162341
43ML_18499_Port AlmaPort Alma3220160019314224214253763822,026
43ML_18587_Port AlmaPort Alma3721168019324634654634524452,288
44MF911_32068_Port AucklandPort Auckland63108019356666666652316
44MF911_41786_Port AucklandPort Auckland12610801938130130129131131651
44ML_17895_Port AucklandPort Auckland30231841819294734744754754752,372
44ML_18144_Port AucklandPort Auckland3021168019304524554533683692,097
45MF911_16080_Port BowenPort Bowen42300192876007676228
45MF911_32187_Port BowenPort Bowen841363419355853575960287
45MF911_33308_Port BowenPort Bowen6354019356565656561321
45MF911_34293_Port BowenPort Bowen841443819366464646464320
45MF911_35486_Port BowenPort Bowen4272019364747474747235
46ML_17849_Port CampbellPort Campbell28221761619294614604574584592,295
47MF911_26975_Port CarolinePort Caroline10472619338283838383414
47ML_18260_Port CarolinePort Caroline3221168019304364334344004012,104
47ML_18565_Port CarolinePort Caroline3823184019324794784824684672374
48MF911_31626_Port ChalmersPort Chalmers84117019357159737372348
48MF911_37654_Port_ChalmersPort Chalmers631081119376262626061307
49MF911_25515_Port DarwinPort Darwin8472919327069716866344
49MF911_35675_Port DarwinPort Darwin63108319366565656548308
49MF911_37099_Port DarwinPort Darwin841440193770707100211
49MF911_39424_Port DarwinPort Darwin841443019386868686868340
50MF911_13131_Port DenisonPort Denison423001927790787979315
50MF911_29212_Port DenisonPort Denison8472919347676767676380
50MF911_34150_Port DenisonPort Denison841442719367667747574366
50MF911_35293_Port DenisonPort Denison841441819368166808479390
51MF911_41942_Port DundedinPort Dunedin147126121939136131130137135669
51ML_18673_Port DunedinPort Dunedin2920160019334164154153653601,971
52MF911_36019_Port FremantlePort Fremantle631140193611411411400342
52ML_18558_Port FremantlePort Fremantle3519152019324224194204064102,077
52ML_18630_Port FremantlePort Fremantle3121168819324704704684594412,308
52ML_18680_Port FremantlePort Fremantle3021168019324534524534374412,236
53ML_18476_Port_GisbornePort Gisborne3318144019314224214222742521,791
53MF911_32744_Port GisbornePort Gisborne63108019357070706768345
53MF911_34915_Port GisbornePort Gisborne63108619365454565454272
53MF911_39080_Port GisbornePort Gisborne63108919386264646366319
53MF911_40057_Port GisbornePort Gisborne63108319386665646665326
54MF911_33928_Port HobartPort Hobart6354019367071717171354
54MF911_34904_Port HobartPort Hobart63108019367071676465337
54MF911_35996_Port HobartPort Hobart63108019367069706666341
55ML_18639_Port HunterPort Hunter34241922019324985035043453652,215
56MF911_40199_Port JacksonPort Jackson1051801219389498989596481
57ML_17977_Port MelbournePort Melbourne32241922219294744774774794792,386
58MF911_11208_Port NapierPort Napier6230019264341434342212
59ML_17873_Port NicholsonPort Nicholson31231841819294794804794784802,396
59ML_18399_Port NicholsonPort Nicholson30211681219314484464394184072,158
60ML_18155_Port SydneyPort Sydney3322176019304434464434174162,165
61MF911_41432_Port TownvillePort Townsville6310818193856565300165
62ML_17860_Port VictorPort Victor35282243619295415405405395372,697
63MF911_23307_Port WellingtonPort Wellington84721219311511152152153609
63MF911_27086_Port WellingtonPort Wellington847291933150150150152152754
63MF911_28051_Port WellingtonPort Wellington84721219338484868685425
63MF911_37821_Port WellingtonPort Wellington73108619375151514545243
64MF911_33984_Port WyndhamPort Wyndham631080193655555522169
64MF911_36181_Port WyndhamPort Wyndham631082119364747474646233
64MF911_39352_Port WyndhamPort Wyndham66310819384444443732201
64MF911_40313_Port WyndhamPort Wyndham631081519384444434141213
65MF911_41898_Reina del PacificoReina del Pacifico42363193955666666208
66ML_17827_RimutakaRimutaka42342722619297056897057087013,508
67ML_17998_RunpenuRuapehu36252002619294964954954584522,396
68ML_18579_SomersetSomerset3522176019324584594604354392,251
68ML_18646_SomersetSomerset3821168619324574584562742421,887
69MF911_44525_Southern HarvesterSouthern Harvester427227194751525300156
70MF911_18465_Southern KingSouthern King211501929383803811125
70MF911_18548_Southern KingSouthern King211591929130013026
70MF911_18819_Southern KingSouthern King211591929130013026
70MF911_18978_Southern KingSouthern King2115919291300131238
70MF911_18979_Southern KingSouthern King21159192916150151056
70MF911_21761_Southern KingSouthern King6354161930957902120215
71ML911_1542_StruanStruan241429455194715014200157
72MF911_43636_SuffolkSuffolk8310812194612112012100362
73MF911_10646_TairoaTairoa4230019266969696969345
73MF911_30690_TairoaTairoa8472619347878787878390
73MF911_35417_TairoaTairoa841443319366669666669336
73MF911_37729_TairoaTairoa841442419377777767675381
73MF911_8723_TairoaTairoa6345919259090899090449
74MF911_25999_TaranakiTaranaki635401932143143143142141712
74MF911_26898_TaranakiTaranaki6354019336767656767333
74MF911_27674_TaranakiTaranaki6354019336969686969344
74MF911_30161_TaranakiTaranaki6354019346464636464319
74MF911_42691_TaranakiTaranaki12621631939120133125134133645
74ML_18585_TaranakiTaranaki3019152019323793833853313191,797
75MF911_22425_TasmaniaTasmania8472919317878787878390
75MF911_26731_TasmaniaTasmania947201932107107107107107535
75MF911_27816_TasmaniaTasmania10590151933122122122123123612
76ML911_2149_ThuleThule372833301950430434436001,300
77MF911_37834_TongariroTongariro841443619376868686969342
77MF911_42032_TongarioTongariro12610831938134134134134134670
77ML_18641_TongariroTongariro2819152119324284294293323531,971
78MF911_44607_TrepasseyTrespassey8472619467675900160
79MF911_35419_Tuscan_StarTuscan Star631081819365657555759284
80MF911_12609_VerbaniaVerbania42300192747004747141
81MF911_10751_WaimanaWaimana63450192680008080240
82MF911_32663_WaipawaWaipawa4272019354242424141208
82MF911_34853_WaipawaWaipawa10518012193810510210300310
82ML911_988_WaipawaWaipawa391225248194823223223200696
83MF911_35087_WaiweraWaiwera63108919366464646363318
83MF911_36265_WaiweraWaiwera631081219366161616161305
83MF911_39360_WaiweraWaiwera631081719385859575659289
84MF911_32763_WestmorelandWestmoreland6310801935135135135135135675
85MF911_24120_ZealandicZealandic635401931142142142142142710
85MF911_25112_ZealandicZealandic635401932139139139139139695
85MF911_25805_ZealandicZealandic635401932129129129129129645
85MF911_30311_ZealandicZealandic6472319348181818585413

A total of 150,690 observations from 85 unique ships that embarked on 210 voyages were successfully captured by replicate keying from citizen scientists. Log images are the total number of digital files that correspond to the unique logbook file, images clipped are the total number of images that had data within the ACRE Antarctica domain (see Figure 1) that were selected for processing, the number of clips are the total segment number that were extracted from all pages selected for processing, and blanks not loaded are the number of clips that had no data. Total number of recovered observations for each category for each unique voyage (barometric pressure, air temperature, and sea temperature) are shown. Total number of unsuccessful transcriptions not shown.

Ship logbook meteorological observations recovered in SWD phase I hosted on Zooniverse A total of 150,690 observations from 85 unique ships that embarked on 210 voyages were successfully captured by replicate keying from citizen scientists. Log images are the total number of digital files that correspond to the unique logbook file, images clipped are the total number of images that had data within the ACRE Antarctica domain (see Figure 1) that were selected for processing, the number of clips are the total segment number that were extracted from all pages selected for processing, and blanks not loaded are the number of clips that had no data. Total number of recovered observations for each category for each unique voyage (barometric pressure, air temperature, and sea temperature) are shown. Total number of unsuccessful transcriptions not shown.

Results

A total of 210 logs from voyages of 85 unique ships were transcribed in phase I of SWD (see Table 1 for details including years of coverage). Over 2,500 log book images were collectively obtained for those voyages, and 1,521 of those images were then selected for transcription. From the log book images that were used, 18,490 clips containing multiple meteorological observations were loaded to Zooniverse (taking into account that 16.6% of a grand total of 22,180 clips were blank and did not need to be transcribed). A grand total of 150,690 meteorological observations were recovered through replicate keying (nuncorrected barometer = 32,747; nattached thermometer = 31,399; ncorrected barometer = 32,196; nair temperature = 27,330; nsea temperature = 27,018). The total time for SWD phase I data capture was 9 months (running from October 2018 to July 2019), with a majority of transcribed observations obtained within the first 2 months from the project launch.

Charting a new course for streamlined data transcription

Determining what transcription retirement limit to use for individual observations was still an open-ended question when we launched SWD and after completing phase I. Replicated keying of logbook segments is designed to provide a majority consensus (and a measure of confidence through replication) that defines what numeric value exists in each table cell. Replicated keying levels for numeric values from an individual cell is proportional to time, but the effort to repeat keying as a way to increase confidence should have a functional limit. A choice of too few replicate keying attempts places the onus back on the researcher more frequently to re-classify questionable values that are not resolved via consensus. In turn, that can also generate re-work in terms of reposting logbook clips online to obtain additional transcriptions. We initially chose to have entries transcribed by 10 different volunteers during the first phase of SWD. This limit was increased initially from five entries after we discovered some problems with respect to the general data format returned by Zooniverse (see issues outlined below). In SWD phase II, a goal was to determine optimal transcription and image clip retirement limits. Tabulated historical weather observations for eight ECVs (attached thermometer, uncorrected barometer, corrected barometer, maximum temperatures, minimum temperatures, wind direction, wind force, wind run) for the austral winter of 1939 (June, July, August) on original meteorological Form 301 paper copies taken held in NIWA’s archive were digitally scanned from 63 stations spread across New Zealand to cover the winter season when the 1939 Week it Snowed Everywhere (WISE) event occurred. A brute-force approach was employed by setting SWD transcription retirement limits at 20 for WISE, which was twice the sample pool of SWD phase I transcription. Using these data, we were able to use hierarchical degradation that progressively lowered replicate transcription sample depth of keyed values in order to evaluate optimal data keying retirement limits. Retirement statistics (completed successful transcription) were assessed for individual entries (each individual observation recorded in a log book clip), segments (the log book clips), and entire logbook images (with multiple segments that contain multiple entries). We considered results from our entire pool of volunteers (n = 20), the control dataset, to evaluate the effects of transcription sample depth degradation. The 20-volunteer sample depth also allowed us to gather a large enough dataset to evaluate type 1 (consensus acceptance of an incorrect value; false-positive/acceptance) and type 2 (non-consensus and rejection of a value that was legitimate; false-negative/rejection) errors. We also used the WISE dataset to evaluate minimum number of repeat classifications needed to obtain a 100% complete dataset via majority consensus with minimal transcription errors. Some examples of log books that had the most common errors are provided in the supplemental information. The percentage of entries, segments, and images that were “retired” (i.e., consensus reached, with citizen science transcription considered a success) decreased for all replicate keying tests conducted on each of the hierarchical transcription classes (5, 10, 15, and 20 volunteers) when the pass rate threshold was raised progressively from 60% to 90% (Figure 3). Results for all the hierarchical classes appeared most similar for entries, segments, and images for the 75% pass rate threshold and were the most different for the 90% pass rate test. The difference between success of the five-volunteer class within the 60% pass rate test and the 90% pass rate test resulted from the fact that all five answers need to align for the latter to be considered a success, while the former only requires three out of five to be right. The results for 10 versus 20 volunteers in the 90% pass rate test also appeared similar. Few appreciable differences were also observed in the 60% pass rate test for the 10, 15, and 20 volunteer classes.
Figure 3

WISE consensus results

Frequency of successful consensus classification using different thresholds of agreement for individual entries, meteorological form segments (clips), and entire meteorological form images pooled from unique land-based stations that were transcribed the WISE phase of work on SWD. See supplemental information for more details about the number of data points, segments, and images that comprise these statistics.

WISE consensus results Frequency of successful consensus classification using different thresholds of agreement for individual entries, meteorological form segments (clips), and entire meteorological form images pooled from unique land-based stations that were transcribed the WISE phase of work on SWD. See supplemental information for more details about the number of data points, segments, and images that comprise these statistics. We evaluated the probability of type 1 and type 2 errors by comparing expert-guided transcriptions of original logbook entries with the consensus values obtained through WISE. This experiment used tests based on several draws of 20 entries at random for each of the eight WISE data entry tasks, and just over 46,200 values constituted the pool that could be analyzed to assess errors associated with data entry. Across the entire dataset, 56% of entries had a low risk of error, 7% had a medium risk, 1% had a high risk, and 36% were blank. Blank and failed consensus entries were automatically excluded from these random draws. Each draw was evaluated with respect to consensus keying based on either a threshold of agreement (termed O75, 75% consensus, 15 of 20 values; O60, 60% consensus, 12 of 20 values) or by selecting the first five or 10 keyed responses (O5, O10) out of the 20 selected values. An additional test, termed “output resampled” (ORS), added a step to the O60 consensus processing with a random draw for entries that failed to reach consensus as a way to reach a definitive result. Each of the failed entries from this test had a statistical mode calculated from 500 iterations that individually pulled a five-sample random draw from the pool of 20 entered values (Table 2). We further classed type 1 and type 2 errors in each of these tests across three categories of keying success with respect to whether there was an increased likelihood of either error occurring (with the a priori assumption this would be strongly linked to the quality of the uploaded image on SWD). These categorical tests spanned low-risk images (consensus pass rate = 100%; clear penmanship, no edits in the original cell), medium-risk images (consensus pass rate <80%), and high-risk images (consensus not reached; often associated with edited original tabulated entries or messy penmanship).
Table 2

Type 1 and type 2 errors associated with the WISE hierarchical degradation and resampling tests (O75, O60, O5, O10, and O-RS)

Low risk
Medium risk
High risk
Blank cells
Whole set
T1T2CorrectT1T2CorrectT1T2CorrectT1T2CorrectT1T2Correct
O75001000138708911001000.001.8398.17
O6000100059556234001000.050.9698.99
ORS001000010016084001000.160.0099.84
O5001003097263044001000.510.3099.19
O10001000397124939001000.120.6999.19

The percentages for each of these risk categories was calculated by weighting by the proportion of composition for the entire dataset by the percentage correct in that particular category (low, medium, high) in order to obtain a percentage error and percentage correct whole-set results. These results represent the aggregate for all entry types (pressure, temperature, and wind) across eight tasks. More details about this experiment can be found in the supplemental information.

Type 1 and type 2 errors associated with the WISE hierarchical degradation and resampling tests (O75, O60, O5, O10, and O-RS) The percentages for each of these risk categories was calculated by weighting by the proportion of composition for the entire dataset by the percentage correct in that particular category (low, medium, high) in order to obtain a percentage error and percentage correct whole-set results. These results represent the aggregate for all entry types (pressure, temperature, and wind) across eight tasks. More details about this experiment can be found in the supplemental information. Blank cells were identified correctly in all tests. For the low-risk image category, type 1 and type 2 errors were absent, but type 1 errors slightly increased and more so for type 2 errors in medium-risk images (Table 2). For high-risk images, all of the tests except O75 revealed type 1 errors, and there were no type 2 errors associated with the ORS analysis. The most common incorrect transcription issues had to do with (1) confusion between 4s and 7s and 4s and 6s; (2) omission of decimal points or other delimiters; and (3) extraneous notes, arrows, or values in cells where original data had been manually corrected (i.e., crossed out and re-written).

Training machines to guide the data rescue ship

The WISE dataset was also used to independently test Microsoft Read API (Figure 4) using digital photograph surrogates of the 1939 Form 301s that contained the original analogue data. One advantage with Microsoft Read API is the ability to transcribe an entire sheet using computer vision, which can save research preparation time related to clipping and uploading segments of a page onto Zooniverse for citizen science transcription. Six high-resolution scans of full original Form 301 sheet data sheets from two stations were used for a Microsoft Read API preliminary test, which draws on an OCR engine based on deep learning algorithms.44, 45, 46 A Microsoft Excel template indicating the position of data on the page (row cell and column) was also provided to the Microsoft team for supplying values back to NIWA for validation.
Figure 4

Azure OCR pipeline

Generalized architecture of the automated Azure cloud computing pipeline hosted by Microsoft that was used for the WISE OCR and transcription experiment. Handwritten meteorological tables in portable document file (PDF) format were transferred to Microsoft and loaded to the Azure Data Lake Storage (ADLSv2), where a Function Apps code forwarded them for text extraction. The Read API Azure Cognitive Service was used to extract handwritten digits from each PDF, in conjunction with custom machine learning models deployed using the Azure Kubernetes service via the Azure Container Registry. The custom model removed noise from the digital surrogates and located cells with digits in them. The extracted components from each page were further processed and the final outcome from OCR analysis was stored in the Azure SQL database (Result Store) where they were accessed, analyzed, and visualized using Power BI. In addition, capabilities for inter-service communication were securely held in Key Vault.

Azure OCR pipeline Generalized architecture of the automated Azure cloud computing pipeline hosted by Microsoft that was used for the WISE OCR and transcription experiment. Handwritten meteorological tables in portable document file (PDF) format were transferred to Microsoft and loaded to the Azure Data Lake Storage (ADLSv2), where a Function Apps code forwarded them for text extraction. The Read API Azure Cognitive Service was used to extract handwritten digits from each PDF, in conjunction with custom machine learning models deployed using the Azure Kubernetes service via the Azure Container Registry. The custom model removed noise from the digital surrogates and located cells with digits in them. The extracted components from each page were further processed and the final outcome from OCR analysis was stored in the Azure SQL database (Result Store) where they were accessed, analyzed, and visualized using Power BI. In addition, capabilities for inter-service communication were securely held in Key Vault. The results from OCR using Microsoft Read API indicate variable efficacy between different observing sites and for different observation types (Table 3). Across five quantitative observation categories (attached thermometer, barometer uncorrected, barometer corrected, maximum temperature, minimum temperature), the Microsoft Read API validation grand strike rate was 69% ± 15% (n = 920). Results for transcribing ECVs were also site dependent (related to penmanship of the observer who filled in the data table).
Table 3

Results from Microsoft Read API for the WISE

Attached thermometerBarometerBarometer correctedMinimum temperatureMaximum temperature
Grand strike rate, uncorrected (%)65.177.169.064.270.7
Grand strike rate, potential correction (%)81.580.176.078.780.8

Strike rate (percentage correct) across five meteorological variables transcribed by Microsoft Read API for Albert Park (A64871) and Christchurch (H32561) spanning June to August 1939. The potential corrected grand strike rate is corrected for any miss related to a decimal or a dash that was not captured in the automated transcription.

Results from Microsoft Read API for the WISE Strike rate (percentage correct) across five meteorological variables transcribed by Microsoft Read API for Albert Park (A64871) and Christchurch (H32561) spanning June to August 1939. The potential corrected grand strike rate is corrected for any miss related to a decimal or a dash that was not captured in the automated transcription. Extraneous formatting and errors related to decimals and dashes were not considered when validating the Microsoft Read API because they are minor (see Table 3; difference between uncorrected and potential corrected strike rate for machine learning transcription). The most common issues identified where the Microsoft Read API auto-transcription did not validate related to incorrect transcription of the first digit of a numeric string, and designation of a letter where a number actually occurred. The most common digits that were not transcribed correctly were 4s and 7s (often swapped). Both of those shortcomings are similar to issues that we experienced on SWD for citizen scientists keying in data for the WISE experiment. Additional simplified guidance for unsupervised machine learning algorithms could be applied in those cases (e.g., pressure values recorded in inches of mercury must begin with a 2 or 3) to improve strike rate results for the Microsoft Read API (Table 3).

Discussion

Consolidating lessons learned from the SWD data rescue journey

Improving our understanding of past weather events and the roles that modes of variability have played in guiding extreme conditions requires better reanalyses, and in particular the coverage for the southern high latitudes needs to be dramatically augmented (Figure 1). There is massive potential to improve global reanalyses using the troves of historical meteorological data that are stored in a wide range of archives.,,,, These observations can be digitized by volunteers with assistance from scientists who can prioritize and arrange data rescue activities. A major advantage to using Web-based citizen science for data rescue efforts is that the human resource can be drawn from all regions on Earth, volunteer time is free, and progress is made more or less continuously. In addition, citizen science data rescue provides an opportunity to engage and educate the general public about the importance of long-term meteorological observations and climate change. In SWD, we learned that when data rescue is conducted under the auspices of a global effort like ACRE,, and with support from agencies like the World Meteorological Organization and Copernicus Climate Change Service,, it engenders increased regional responsibility for data stewardship and archives while raising the profile of the science. This typically has a positive feedback for conducting additional data rescue activities, particularly in remote and under-resourced regions. In addition, there are improvements for transparency of nation- and archive-specific data holdings that engenders wider data sharing that can exceed what ad hoc efforts undertaken by isolated researchers have achieved in the past. It is also clear that automated OCR approaches, like those we tested using Microsoft Read API, could be greatly improved with using the vast data captured through citizen science efforts like SWD. The SWD core team that undertook the tasks required to capture handwritten observations using Zooniverse consisted of nine people. Our team members collectively found and captured digital twins of data sheets in multiple archives, prepared them for transcription on the Web platform, retrieved/parsed replicate keyed observations, and undertook statistical analyses of the results. Each of these data rescue tasks does not constitute an equivalent time investment or skill level. We also obtained external support from national and international collaborators to achieve many of our aims (e.g., finding ship log books in archives, testing machine learning OCR transcriptions). We divided basic data rescue tasks between senior scientists, casual staff, and students in order to maximize the use of limited funding. Overall, the foundation for a data rescue project like ours could be run on 1.0 full-time equivalent (FTE) employment. However, it is likely that multiple years would be required if one person were to do all of the associated tasks, including the field work. This type of effort also requires a broad enough skill set that includes development, adaptation, or augmentation of code that automates tasks through scientific programming. In addition, support from professional media experts would be required to attain the level of external project promotion we achieved. The benefits of crowd-sourcing labor to key historical observations are partially offset by some unique challenges. A significant investment of time is required to train personnel in how to manually clip the logbook images or to set up different workflows for logbooks that are printed in different formats. This echoes findings learned from citizen science efforts to key United Kingdom Met Office daily weather reports, where it was noted that the effort required to clip segments of images and provide them using consistent formatting for end-user context places an additional time burden on the research team. Automated clipping routines we tested reduced the time investment for this specific data rescue step, but success is highly dependent on the quality of photography and the types of scientific data tables being rescued. We are aware that the efforts from ACRE Argentina at present are using clipping approaches that focus on single cells and providing them in Zooniverse without formatting, which is a potential time-saving measure (https://www.zooniverse.org/projects/acre-ar/meteororum-ad-extremum-terrae). For SWD phase one, all the logbooks that were not in a consistent format were omitted. An issue related to consistency of data transcription from international audiences can also arise. We noted that dashes and decimals were commonly substituted with commas or used as a delimiter, making our post-transcription data retrieved from native Zooniverse outputs difficult. There were also significant discrepancies related to the citizen science transcription of ship coordinates that led us to eventually input that category manually using an expert team. Significant time was also required to respond to questions from volunteers (particularly in the early stages following initial project launch).

Scanning the horizon for fair winds and smooth data rescue sailing

Based on the outcome of the WISE experiment tests, we consider a compromise can be reached between time spent keying by citizen science volunteers and achieving completeness and accuracy of a transcribed dataset when eight replicate entries are employed. To achieve that, we recommend initially setting a minimum 60% pass rate threshold (five out of eight in agreement) and then using a resampling scheme for any values that did not reach consensus. Using that scheme, we would expect that type 2 errors would be absent from the transcribed data, and type 1 errors would be, on average, less than two in 1,000. In addition, the transcribed dataset will be 100% complete and close to 99.5% accurate. It is also worth noting that these results are dependent upon the nature of the data being transcribed. The specific retirement limit and broader strategy employed for scientific data transcription may need to be adjusted based on the type of observations being rescued. Integer values with no decimal points are the most straightforward to key and require little repetition to ensure a correct consensus value. Conversely, alphanumeric values and values with many significant figures introduce more complexity or opportunity for variation among responses from the citizen scientists (e.g., representing a decimal with a period, a comma, a space, or ignoring the decimal altogether). For example, within our dataset, we observed significantly more errors in the temperature fields, which generally include decimal points, than the pressure (integer-only values) and wind run (alphabetic-only values) fields. As such, ironing out idiosyncrasies that can make data rescue efforts through Zooniverse universal and successful requires the following minimum requirements: Prepare scans of data tables in a way that enables efficient keying and that is easy to understand. Test and re-test workflows to ensure they are simple to follow (heeding participant feedback). Design tasks so that citizen scientists have the best chance of entering a correct result. Evaluate initial inputs and data retrievals with a small dataset before launching a full data rescue campaign. Optimize replicate keying levels to balance confidence of results with time invested from citizen scientists. Prepare enough material in advance to ensure momentum can be continually maintained. Promotion of our project and engagement with media and the public was strongly connected to the rate of retirement for logbook segments and the overall success of completing the recovery of meteorological data via SWD. Our approach kept the following in mind: A strong communications strategy with a “hook” to get people involved. Willingness to engage with the media and the project participants. Promotion of the project on multiple social media platforms. Repeated contact with the citizen science community using emails and updates as tasks progressed. Data rescue on Zooniverse has a proven successful track record for several projects that have focused on the recovery of historical weather observations. Our approach for SWD is something that can be easily replicated for other disciplines where tabulated scientific data need to be transcribed. We recently provided training to assist the launch of the Climate History Australia project (https://climatehistory.com.au) using the lessons we learned via SWD. It is important to note that inter-project knowledge sharing for meteorological data rescue has largely been by word of mouth and interpersonal relationships (having been helped by colleagues in the Weather Rescue and Old Weather projects that came prior to our project). There are relatively few references in the literature that describe exactly how data rescue that engages the general public is undertaken. Hence, we hope that this study provides a basic roadmap for novice practitioners that highlights insights about success and challenges for data rescue, and that the scientific community can build upon these lessons to accelerate the rapid acquisition of historical scientific data for wider societal benefits.

Experimental procedures

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Andrew Lorrey (a.lorrey@niwa.co.nz).

Materials availability

Digital twins of the original log books and meteorological forms used in this study are held by NIWA. They can be made available on reasonable request.

Establishing a citizen science identity for our data rescue crew

When our work began, leading exemplars for historical meteorological data rescue harnessing citizen science were OldWeather56, 57, 58 and Weather Rescue., The latter project was built on the free-to-use Zooniverse Web platform (www.zooniverse.org) and demonstrated a capability to recover millions of observations keyed in replicate. Based on the global success of Weather Rescue, both in terms of public engagement and the great speed and volume of historical weather data transcribed, our research team decided to employ a similar design. We registered our project on Zooniverse, and simultaneously created a project identity. Our project description included a name and icon connected to the southern hemisphere region where we wanted to generate “discovery” science about weather and climate with historical meteorological observations and reanalyses. The heavy focus on rescuing maritime data in our project led us to use a ship as a project icon, including a sail with an Antarctica logo that was embellished with a thermometer and sun in the background. The name SWD arose out of testing word combinations we thought were reflective of the project work and regional focus. It is also a subtle play on words with respect to the well-known RRS Discovery Antarctic expeditions (from which we have obtained data). A website domain name was purchased in order to make a shortcut (via redirection) to SWD (www.southernweatherdiscovery.org) to make it easier for the general public to find us on Zooniverse (instead of directing them to find the project at the Zooniverse URL https://www.zooniverse.org/projects/drewdeepsouth/southern-weather-discovery).

Guiding citizen scientists through an ocean of data

The Zooniverse Web platform is designed to accommodate novice citizen science practitioners who have no prior knowledge of website design or Web development. The build-a-project instructions (https://help.zooniverse.org/getting-started/) guide the completion of a project setup leading to two basic website components: a front end, which the general public can see and work with, and a back end that contains the design and content elements required to organize workflows and create data entry fields. There are several Web page hierarchical elements that can be viewed on the SWD front end, which include primary navigation tabs labeled About, Classify, Talk, and Collect. We discuss the first three of these tabs below. Under the About tab, there are subsidiary tabs for Research, The Team (biographic information), Results, and Frequently Asked Questions (FAQ). We felt it was important to complete details for the Research and Team tabs under the About heading in order to establish our project identity upon launching SWD. We used the Team tab to outline biographic information; this element of the website humanizes the project by providing a face behind the science, as well as key points of contact. Additional considerations for providing personal details need to be weighed by each research team; we included the ability for citizen scientists to contact us to engender a better connection between our role as researchers and the public who we were trying to engage with for participating in data transcription. The Research tab provided an opportunity to outline more in-depth reasons for doing citizen science data transcription. Many of the citizen scientists using the Zooniverse platform are genuinely excited about the research, and providing these additional details helps them to engage with the project. The Talk tab included conversations between our research team and citizen scientists. It was used to engage with participants who initiated questions or discussions, with most of the queries related to general data entry issues that were not pre-emptively thought of for the tutorial (mostly uncommon problems). In rare cases, submitted questions were related to reiteration of instructions when an occasional user did not understand our tutorial. More detailed questions about experimental design, including reasons for retirement limits for each image, and how to deal with missing data were popular topics. We also used the Talk tab to occasionally provide new instructions for data keying, and in one case we specifically asked citizen scientists to change their transcription on the fly (to not use commas as a numeric separator due to a data formatting issue with Zooniverse). The Classify tab will be discussed below in more detail when we outline how workflows for data transcription were made.

Advanced preparations for a long data rescue voyage

Historical ship logbook observations and land-based meteorological registers were handwritten on standardized printed table forms (Figure 5), making them ideal for Zooniverse platform transcription. We undertook two main transcription tranches, each with a slightly different approach for uploading digital copies of meteorological registers for transcription. Ahead of volunteers keying data online, the architecture of a basic workplan needs to be considered to determine how the division of labor should proceed. This helps to maximize efficiency, minimize transcription errors, and reduce preparation time. Consideration about the data types that are keyed and preparation related to both SWD transcription tranches are provided below.
Figure 5

A log book page used in SWD

This example shows a standard weather observation register that was transcribed by citizen scientists in SWD. Clipping masks were placed over the digital version of the register, with alphanumeric labels placed on to ship position (X1–X6), barometric pressure (A1–A6), and temperature (B1–B6). The original file name contains a unique sample identifier, the name of the ship (in this case the MS Port Gisborne), and an image number, to which the clipping mask alphanumeric code was added before uploading to Zooniverse (e.g., MF911_39,080_Port Gisborne_IMG_6247_B5.jpg for the clip of the register corresponding to box B5). This scheme facilitated ease of data retrieval and reparsing the data into a continuous time series for replicate quality assurance and further analysis. Image supplied by C. Wilkinson, RECLAIM.

A log book page used in SWD This example shows a standard weather observation register that was transcribed by citizen scientists in SWD. Clipping masks were placed over the digital version of the register, with alphanumeric labels placed on to ship position (X1–X6), barometric pressure (A1–A6), and temperature (B1–B6). The original file name contains a unique sample identifier, the name of the ship (in this case the MS Port Gisborne), and an image number, to which the clipping mask alphanumeric code was added before uploading to Zooniverse (e.g., MF911_39,080_Port Gisborne_IMG_6247_B5.jpg for the clip of the register corresponding to box B5). This scheme facilitated ease of data retrieval and reparsing the data into a continuous time series for replicate quality assurance and further analysis. Image supplied by C. Wilkinson, RECLAIM. Most of the logbooks used in the first phase of SWD (Table 1) were sourced from UK merchant and immigration ships that visited New Zealand and Australia via the South Pacific, and were acquired from international archives through an extension of the Recovery of Logbooks and International Marine data (RECLAIM) project., Many of those logs used a standard printed register that arranged multiple observations in columns containing a unique variable (e.g., 9 a.m. temperature, pressure) and discrete entries in rows corresponding to a common date, time, and location (Figure 5). In the second phase of SWD (see section “shore leave for the WISE”), New Zealand land-based observations from meteorological registers containing a broader range of observations were drawn on, with nine discrete variables to key. For both phases of SWD, individual meteorological register pages were subdivided into small parts to provide a segment for an individual volunteer to transcribe rather than providing the whole page. This choice was based on discussions with colleagues and feedback from volunteers, and helped to (1) ensure data keying contributions could be completed in short bursts rather than taking up lengthy intervals of time, (2) minimize the risk of volunteers abandoning a data entry form before submitting a full transcription, (3) reduce mistakes that are associated with transcribing the wrong column or row, and (4) decrease the probability of widespread error propagated across an entire log, if, for example, a specific volunteer had a difficult time deciphering the handwriting on a specific page. Our project’s contractual requirements for the DSC also meant we prioritized certain observations on a logbook page, and therefore only a subset of meteorological logbook segments for each page and logbook were targeted for transcription.

A task fit for a clipper

To accommodate the structure of Zooniverse workflows that lead citizen science volunteers through keying (covered in detail below), we created subsets of each logbook page that were cropped and uploaded to SWD in a standard format. Adobe Illustrator software was used to crop segments of each logbook page using the artboard function. Logbook segments usually covered 2 days of a voyage, with four rows for observations per day (Figure 6). Clipping each segment out of the entire page was initially a semi-manual process, because many logbook images were not positioned identically each time a digital surrogate was created in archive (e.g., pages were inconsistently positioned when captured). This meant artboards used for clipping had to be iteratively adjusted to ensure the observations contained in the 2-day logbook segments were not truncated. Eventually, our team fixed this in pre-processing so the entire process of clipping could be automated. Once artboards were adjusted and aligned to the standardized logbook table dimensions, we adapted existing JavaScript to automate image labeling and cropping to produce logbook segments. A labeling convention was devised to identify where each segment clip was located on the original logbook page, with column A assigned to barometric pressure, column B assigned to temperature, and column X for ship position (Figure 5). The image names of each clip contained information about the archive folder, the ship name, the original image name, and the position of the clip on the logbook page attached as a file name suffix (see Figure 5).
Figure 6

Example of ship log segment in SWD

(Left) Formatted clipped segment taken from the MS Port Alma in 1932 that shows 2 days of handwritten regimented observations in tabulated format for uncorrected atmospheric pressure, attached thermometer, and corrected pressure (reduced to sea level). (Center) The task description for keying these observations (step 1) serves as a check that the correct image clip was uploaded, while the example for the data entry field instructions (right) indicate to the citizen scientist which column to key and how to separate the values.

Example of ship log segment in SWD (Left) Formatted clipped segment taken from the MS Port Alma in 1932 that shows 2 days of handwritten regimented observations in tabulated format for uncorrected atmospheric pressure, attached thermometer, and corrected pressure (reduced to sea level). (Center) The task description for keying these observations (step 1) serves as a check that the correct image clip was uploaded, while the example for the data entry field instructions (right) indicate to the citizen scientist which column to key and how to separate the values. Prior to uploading each clipped image to Zooniverse, a Jupyter notebook (a Web-based interactive computing platform) script run in Python was used to add the name of the ship, the year of the voyage, the hours of observation, and column headings (see Figure 6). Our decision to use a Jupyter notebook for this step enabled research team members to generate logbook segment clips regardless of their scientific computing experience. In SWD phase II, we also initiated automation of logbook segment clipping using MATLAB to help streamline this stage of the data rescue process. This labeling system also makes reassembling data after transcription easier. Links to code for the aforementioned steps are provided in the supplemental information.

Changing tack with specialized workflows

Three specialized workflows were created on Zooniverse for volunteers to take part in transcribing data for the SWD first tranche: ship position, temperature, and barometric pressure. The workflows were designed to be as simple as possible and utilized the logbook clips discussed above rather than displaying a whole logbook page. In the first tranche, we used an open entry field and asked volunteers to key a small column of data, with values separated by a range of delimiters (e.g., space, comma). In the SWD second phase, we upgraded the data entry forms to provide an individual entry box for each observation and adjusted the subdivisions of the logbook pages to ensure only one column of data was keyed in a step. This was assisted by Zooniverse via the Combo Task feature, which was experimental during early 2020, having been trialed through the Weather Rescue project. Although further customization of a Zooniverse-hosted citizen science website is possible, our team only used minimal special requirements like this that were facilitated by the Zooniverse staff. The workflow questions were designed to lead the volunteers through the image with handwritten meteorological data: first, we asked if the image related to what the workflow task indicated (to potentially eliminate images that had been loaded into the wrong workflow). This was followed by a number of sequential questions that asked the volunteer to transcribe columns of numbers (Figure 6). A workflow task also asked the volunteers to transcribe the latitude and longitude so the historical weather observations could be ascribed to a location, date, and time. A separate workflow for temperature observations asked volunteers to transcribe air, sea, dry bulb, and wet bulb temperatures. Finally, a barometric pressure workflow asked volunteers to transcribe uncorrected pressure, the attached thermometer (required for correcting raw pressure measurements), and the corrected pressure at sea level. Alongside each workflow, Zooniverse requires tutorials and a field guide to guide volunteers through each workflow step by step and address any idiosyncratic tasks for that workflow. As such, each separate workflow has a unique tutorial. The field guide addresses more general questions from the workflows and about the project in general, and it can be found on the side of any page of the Zooniverse project (see more details on southernweatherdiscovery.org).

Conscripting data rescue participants

Our maiden voyage into the deep south

Our team used a multiphase communication plan to introduce and promote SWD, including video and print media campaigns to garner participation and maintain interest in our meteorological data rescue project. An initial step was to create an introductory video for the SWD website that would encourage people to participate in digital keying of historical weather observations. The video content was crafted with the assumption that the audience had never heard of the project or previously participated in a citizen science effort. This portion of our strategy, in addition to parallel strategies for social media, print media, and radio, was designed by the SWD team and the NIWA Communications team across several months of work to ensure the data rescue scientific content was robust and that delivery to multiple media outlets would be ready in time for launching our SWD project on Zooniverse. In pre-production for the SWD introductory video, we noted an obvious limitation related to visual content being restricted to historical ship logbooks. However, we were fortunate to find historical footage shot by Herbert Ponting of Robert Falcon Scott’s British Antarctic Expedition to the South Pole in 1910. The black and white video footage of Scott’s expedition showcases the conditions under which the scientific observations were made during the “heroic age of exploration.” Weaving several segments from this historical video into our messaging was central to the strategy of initiating and maintaining engagement with SWD data rescue. To reflect the nautical elements of SWD data rescue, key segments were filmed at the Auckland Maritime Museum. We also framed the central issue around the difficulty of transcribing handwriting, and demonstrated how the audience could be a part of the solution. The general progression of the video also highlighted the importance of recovering historical meteorological observations to provide insights about our current and future climate (refer to the statement “Using their legacy to help ours” at the 2-minutes-and-13-seconds mark in our first SWD video; https://vimeo.com/297007476). The mixture of contemporary and historical video footage engenders ties to the golden age of exploration, with the idea to “breadcrumb” prospective citizen scientists toward participating. For the SWD launch, observations that were taken at temporary encampments and during overland sledging missions during Scott’s expedition were added onto the SWD website. These workflows were used to entice members of the public to take part in the transcribing effort, and for the media to create a story around. Some of these data came from printed tables and had already been transcribed by other researchers (without our prior knowledge). However, it was considered a minimal time sacrifice to copy and clip those images to garner significant public interest in the project. SWD was launched on 30 October 2018. The introductory SWD video was promoted by NIWA and reached 127,200 people on Facebook, resulting in 288 comments, likes, or shares. It was viewed 55,000+ times on Facebook, YouTube, and Vimeo. The project was also promoted through the NIWA Communications team to New Zealand media, with a story featured on primetime television news (TV One), which has a nightly audience of ∼600,000 national viewers (>10% of New Zealand’s population) (https://www.tvnz.co.nz/one-news/new-zealand/robert-scotts-weather-logs-give-kiwi-scientists-new-insight-climate-change). A story about SWD also featured on the front page of The New Zealand Herald website, New Zealand’s second largest online news outlet with a monthly reach of 1.3 million subscribers (https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12151407). SWD also had significant coverage in provincial and regional newspapers, with an additional estimated 100,000+ audience reach. A parallel social media campaign was also launched through NIWA’s social channels (Facebook and Twitter) and by members of the research team, with subsidiary re-promotion of materials to reach the Weather Rescue participants (who were largely based overseas and at the time were waiting for more data to key). NIWA’s Twitter promotion about the project reached an audience of 7,955 people with 165 comments, likes, or retweets. Posts on Twitter about the project were shared by climate scientists, international and New Zealand science organizations, and hundreds of members of the public (and even by Chelsea Clinton to her ∼2.4 million followers). The project was also promoted in an e-mail newsletter to all Zooniverse volunteers. The metrics and progress components supplied from Zooniverse also allowed us to track the progress of data transcription, feedback from participants, and also opportunities to push social media to re-energize and draw in more people. The uptake of data keying by volunteers was swift, and over 50,000 observations, including all of the ice sledging data from Scott’s expedition, were initially transcribed in replicate over the first 2 days after the launch of the SWD project. In total, 167,914 unique meteorological observations were successfully captured in replicate through phase I of the SWD project.

Shore leave for the WISE

The second phase of SWD focused on a project called The Week it Snowed Everywhere (WISE). This was a phrase that we coined to describe a significant snowfall event that affected most of New Zealand during the austral mid-winter of 1939 (Figure 7). A primary goal for this phase of SWD was to evaluate transcription retirement limits (replicate keying) and how those limits relate to optimal accuracy of citizen science transcription when dealing with different levels of replicate transcription. We also wanted to highlight the serendipitous benefit of SWD citizen science data rescue that comes from high levels of keying replication, including the ability to augment training libraries that underpin computer vision transcription of handwritten tabulated numbers. Automated transcription using computer vision techniques commonly relies on a standardized digital library called the Modified National Institute of Standards and Technology database (MNIST), which is used to train AI approaches., However, the MNIST dataset is relatively limited in terms of exemplary forms for handwritten digits compared with available contemporary resources and offerings in old texts.
Figure 7

Snow during the Week it Snowed Everywhere

Snowfall evidence for the Week it Snowed Everywhere (WISE) during late July 1939 at (left) Pukekohe, Auckland (credit: Huia Mitchell via Auckland Libraries Heritage collections, Footprints 03,956), and (right) in the streets of Dunedin, Otago (credit: Evening Star, reproduced by the Otago Daily Times).

Snow during the Week it Snowed Everywhere Snowfall evidence for the Week it Snowed Everywhere (WISE) during late July 1939 at (left) Pukekohe, Auckland (credit: Huia Mitchell via Auckland Libraries Heritage collections, Footprints 03,956), and (right) in the streets of Dunedin, Otago (credit: Evening Star, reproduced by the Otago Daily Times). A promotional video for WISE homed in on the technological connection between citizen science-driven data rescue and AI-based handwriting transcription (https://vimeo.com/374313908). A primary goal for this promotion was to communicate to the citizen scientists how their assistance could accelerate technology improvements and our scientific goals. In this case, having humans contribute to deep datasets that can train AI for handwriting transcription would result in more rapid realization of the benefits of weather reconstructions on a global scale. We began the WISE video by establishing the value of the ship log observations for understanding past weather events (see quote at 20 seconds in the video, which states “We cannot go back. This is our time machine”). Then, we highlighted the problem that OCR has for transcribing tabulated handwritten digits. We coupled both concepts with the idea that combining scientific knowledge of meteorological data with citizen science and partnering with a global leader in software provision (Microsoft) could help to rapidly overcome a significant problem. Despite OCR technology being used for decades, there are limited video exemplars that demonstrate exactly how it works. To get around this shortfall for communicating to the target audience, our team made a suite of visual animations depicting what OCR software basically does (including a mock visualization of the MNIST training dataset). These connections helped to bring two project elements together: historical handwritten logbooks and OCR technology. The understated message is that the 1939 snowfall event provides data that can lead to improved OCR technology, which in turn can help to surmount present limits on rapid acquisition of historical scientific observations. The example from the snowfall event of 1939 also connects the importance of studying past extreme weather events with understanding global change from a relatively isolated location in the antipodes. The WISE video launched at the November 2019 Microsoft Envision Forum NZ held in Auckland. In parallel, there was a promotional media campaign driven by NIWA Communications, with uptake of the story by all major print media, television, and radio outlets in New Zealand (Figure 8). The connection between Microsoft New Zealand and their parent organization also meant shared Twitter reach presenting the link to the promotional video exceeded 400,000 impressions in under 1 month. We also received significant help by “piggy-backing” off of Rainfall Rescue, a UK-based data rescue project running on Zooniverse, which supplied a volunteer corps to SWD directly after their project was completed. This influx of citizen scientists saw our project classifications increase by about 500%, and that level was maintained through completion, dramatically reducing the time for data capture (Figure 8).
Figure 8

Zooniverse classification daily progress

Classification statistics for SWD, highlighting phase II WISE activity. Each classification is an instance where a citizen scientist has undertaken a keyed data transcription for a small section of a log book uploaded to Zooniverse. A noticeable boost in keying and project participation by citizen scientists coincided with media and social media advertising, emails to participants, and new material being uploaded to the site. The largest keying increase was associated with the completion of the UK Rainfall Rescue project, which bolstered international participation in our project.

Zooniverse classification daily progress Classification statistics for SWD, highlighting phase II WISE activity. Each classification is an instance where a citizen scientist has undertaken a keyed data transcription for a small section of a log book uploaded to Zooniverse. A noticeable boost in keying and project participation by citizen scientists coincided with media and social media advertising, emails to participants, and new material being uploaded to the site. The largest keying increase was associated with the completion of the UK Rainfall Rescue project, which bolstered international participation in our project.
  1 in total

1.  Hourly weather observations from the Scottish Highlands (1883-1904) rescued by volunteer citizen scientists.

Authors:  Ed Hawkins; Stephen Burt; Philip Brohan; Michael Lockwood; Harriett Richardson; Marjory Roy; Simon Thomas
Journal:  Geosci Data J       Date:  2019-08-26       Impact factor: 1.778

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.