| Literature DB >> 30473618 |
Alison Specht1, Matthew P Bolton2, Bryn Kingsford3, Raymond L Specht4, Lee Belbin5.
Abstract
This paper discusses the process of retrieval and updating legacy data to allow on-line discovery and delivery. There are many pitfalls of institutional and non-institutional ecological data conservation over the long term. Interruptions to custodianship, old media, lost knowledge and the continuous evolution of species names makes resurrection of old data challenging. We caution against technological arrogance and emphasise the importance of international standards. We use a case study of a compiled set of continent-wide vegetation survey data for which, although the analyses had been published, the raw data had not. In the original study, publications containing plot data collected from the 1880s onwards had been collected, interpreted, digitised and integrated for the classification of vegetation and analysis of its conservation status across Australia. These compiled data are an extremely valuable national collection that demanded publishing in open, readily accessible online repositories, such as the Terrestrial Ecosystem Research Network (http://www.tern.org.au) and the Atlas of Living Australia (ALA: http://www.ala.org.au), the Australian node of the Global Biodiversity Information Facility (GBIF: http://www.gbif.org). It is hoped that the lessons learnt from this project may trigger a sober review of the value of endangered data, the cost of retrieval and the importance of suitable and timely archiving through the vicissitudes of technological change, so the initial unique collection investment enables multiple re-use in perpetuity.Entities:
Keywords: data conservation; data curation; data retrieval; legacy data; long-term data accessibility
Year: 2018 PMID: 30473618 PMCID: PMC6235994 DOI: 10.3897/BDJ.6.e28073
Source DB: PubMed Journal: Biodivers Data J ISSN: 1314-2828
Figure 1.The workflow from collation of original documents (A) through the publication of the ‘Conservation Atlas’ (E) to the retrieval project (G). The first step was to extract and digitise data from written publications (A-B). Due to the computing limitations of the time, it was necessary to split the data into sub-files (B and C) for analysis (D) which was the aim of the original project ('The Conservation Atlas' 1975-1995). Storage throughout the Conservation Atlas project was in both hard copy printouts and digital form. The ‘mainframe’ computers referred to were those from the PDP-10 computer family through the University of Queensland computer centre. The magnetic tapes were used as backup storage from the PDP-10s and the Exabyte tape was used to store the data from the magnetic tapes at the end of the Conservation Atlas project.
Note: Letters are used to facilitate reference to the figure from the text. The temporal axis is not to scale.
Numbers of sites and species in each vegetation formation in the initial project. These numbers include species that occur in more than one vegetation formation.
* = Not including introduced species or singletons within the formation; ** = Not including tree species >10 m tall
|
|
|
|
|
| Closed forests | n/a | 644 | 1,418 |
| Dry scrubs – SE Queensland | 232 | 232 | 475 |
| Dry scrubs – Northern Territory | n/a | 1,219 | 559 |
| Eucalypt open-forests and woodlands (tree species) | 201 | 1,275 | 276 |
| Sclerophyll vegetation SW Western Australia | 64 | 172 | 1,761 |
| Sclerophyll vegetation Central and Eastern Australia | 188 | 549 | 2,581** |
| Sclerophyll vegetation – heathland and tall shrubland | 136 | 312 | 2,071** |
| Alpine vegetation | 73 | 61 | 556 |
| Savannah understorey | 56 | 198 | 1,313 |
| Mallee open-scrub | 28 | 41 | 395 |
| Desert | 54 | 148 | 1,229 |
| Chenopod shrubland | 30 | 68 | 410 |
| Forested wetlands (including brigalow) | 31 | 36 | 193 |
| Arid wetlands | 20 | 42 | 642 |
| Freshwater swamp vegetation | 80 | 80 | 139 |
| Coastal dune vegetation | 45 | 56 | 315 |
| Coastal wetland vegetation (mangroves and saltmarshes) | n/a | 15 | 74 |
Figure 2.Illustration of the data resources available to the retrieval project: (i) a sample of the boxes of original copies of papers and reports (A), (ii) a table extracted from a publication prepared for data entry (B), (iii) a sample of the hard copy printouts showing alphanumeric lists of species under each location and community (C), (iv) the magnetic tapes on which backups were kept from day to day during the 1980s project (D), and (v) an exabyte tape on to which the data from the magnetic tapes were transferred in 1991 (E).
Figure 3.Diagrammatic representation of the workflow for retrieval of data from the original reference files (A). These files were separated into two parts for editing influenced by the 1980s organisation of the data: (i) information on the sites at which data were collected (B), and (ii) the species lists, which were updated through the Biodiversity Information Explorer, BIE (http://bie.ala.org.au/ws) (C). Once these components were updated, they were re-assembled using DarwinCore standards (D) to enable delivery through a data portal (in this case the Knowledge Network for Biocomplexity, KNB (https://knb.ecoinformatics.org). Ecological Metadata Language (EML) was used to describe the dataset.
An example of the core data available from printouts and (mostly) retrieved from Exabyte tapes according to formation and State. These examples are from the forested wetlands and desert acacia formations in New South Wales (N) and the Northern Territory (P).
|
|
|
| 800000 | N |
| 503200 | LOCATION N032 = CENTRAL COAST: SYDNEY (PIDGEON 1940) |
| 903200 | 33 51 151 13 |
| 503201 | COMMUNITY 01 = FRESHWATER RIVER (COMBINED LIST) |
| 003201 | UTRIAUST UTRIEXOL UTRIBILO VALLGIGA POTAOCHR POTAPERF POTATRIC BRASSCHR # |
| 003201 | NAJAMARI MYRIPROP PHRAAUST ELEOCHAR* TYPHORIE TYPHDOMI TRIGPROC TRIGSTRI # |
| 003201 | JUNCPAUC JUNCPALL JUNCPLAN AGROAVEN GAHNIA__* CASUCUNN MELALINA MELASTYP # |
| 003201 | CALLSALI EUCAROBU EUCAAMPL CAREX___* ISOLPROL VILLRENI ALISPLAN RANURIVU # |
| 003201 | GRATPUBE GOODPANI HYDRPEDU CENTASIA VIOLHEDE PRUNVULG STELFLAC SCHOAPOG # |
| 003201 | OPLIIMBE BLECINDI ADIAAETH PHILLANU # |
| 503202 | COMMUNITY 02 = FRESHWATER SWAMPS ON WIND BLOWN SAND (PORT STEPHENS) |
| 003202 | BAUMTERE BAUMARTI TRIGPROC TRIGSTRI PHILLANU LEPIARTI MELAQUIN EUCAROBU # |
| 003202 | ISOLINUN GRATPEDU DROSSPAT VILLRENI BAUMJUNC SCHOBREV RESTAUST LEPTTENA # |
| 003202 | RESTTETR SPREINCA BOROPARV EPACOBTU GONOMICR BLECINDI HYDRTRIP SPHAGNUM* # |
| 003202 | VIOLHEDE # |
| 500000 | ------------------------------- |
| 800000 | P |
| 503700 | LOCATION P037 = TANAMI DESERT: LAKE SURPRISE, N.T. (MACONOCHIE 1973) |
| 903700 | 20 15 131 45 |
| 503701 | COMMUNITY 01 = TUSSOCK GRASS-SEDGE-LAND + TREES |
| 303701 | EUCAPAPU ACACVICT # |
| 003701 | ABUTOTOC ACACADSU ACACJENS ACACMELL ACACSTIP ACACTENU ALTEANGU ARISBROW # |
| 003701 | ARISINAE BERGTRIM BONALINE BRACHOLO BRUNAUS2 BULBBARB CANTATTE CASSCOST # |
| 003701 | CASSHELM CASSOLIG CASSFILI CLEOVISC CLERFLOR COMESYLV CROTCUNN CROTEREM # |
| 003701 | CYPEBULB CYPECUNN CYPEHOLO CYPEIRIA DAMPCAND DESMMUEL DICRLEWE DODOPETI # |
| 003701 | ECTRSCHU ELYTSPIC ERAGLANF ERIAARIS ERIABENT EUCAASPE EUCAPRUI EUCASETO # |
| 003701 | EUCATERM EULAFULV EUPHDRUM EUPHWHEE GOODAZUR GOODENIA*GOMPCONI GREVJUNC # |
| 003701 | GREVWICK HALGSOLA HELIAMBI HIBILEPT HIBISTURC HIBISTURP INDIBREV IPOMMUEL # |
| 003701 | ISOTATRO LOMALEUC MARSEXAR MELAGLOM MELALASI MELANERV MELHOBLO MELOMADE # |
| 003701 | MERRDAVE MIRBVIMI MORGFLOR NEPTDIMO PANIAUST PARAMUEL PHYLCARP PHYLHUNT # |
| 003701 | PHYLRHYT PIMEAMMO PLECPUNG PLUCTETR PLUCTETRT POLYSYNA POLYGALA *PORTFILI # |
| 003701 | PORTOLER PSORMART PTILARTH PTILASTR PTILCALO RULILOXO SANTLANC SCAEPARV # |
| 003701 | SCIRLAEV SIDAPLAT STACMEGA SWAIBUR3 SYNATILL TINOSMIL TRIAPILO TRIOPUNG # |
| 003701 | TRIUGLAU WALTINDI ZORNALBI # |
| 500000 | ------------------------------- |
Example of records from the publications spreadsheet. ID = our imposed identification number (roughly alphabetical).
|
|
|
|
|
|
|
|
| 1 | Abbott, J. | 1977 | Species richness, turnover and equilibrium in insular floras near Perth, Western Australia. | Aust. J. Bot. | 25 | 193-208 |
| 8 | Adams, L. D. & Craven, L. A. | 1976 | Checklist of vascular plants in a study area of the South Coast of N.S.W. | C.S.I.R.O. Land Use Res. Tech. Mem. | 76/16 | |
| 387 | McMahon, A.R.G., Carr, G.W., Todd, J.A. & Race, G.J. | 1990 | The Conservation Status of Major Plant Communities in Australia: Victoria. | Ecological Horticulture Pty Ltd, Clifton Hill, Vic. | ||
| 474 | Pye, K. | 1982 | Morphology and sediments of the Ramsay Bay sand dunes, Hinchinbrook Island, North Queensland. | Proc. R. Soc. Qld | 93 | 31-47 |
| 560 | Tate, R. | 1880 | On the geological and botanical features of southern Yorke Peninsula, South Australia. | Trans. R. Soc. S. Aust. | 13 | 112-120 |
| 705 | Willis, J.H. | 1967 | Systematic arrangement of vascular plants noted on the slopes and summit of the peak: The Rocks Nature Reserve, New South Wales. | Nat. Pks & Wildl. Serv., N.S.W. | 705 |
An example of the species conversion file for the sclerophyll formation and of alphacodes. This example does not illustrate the size of the files.
|
|
|
|
|
|
| 2 | L G | ABELMOSC |
| |
| 14 | LZG | ACACACAN |
| |
| 19 | LMG | ACACARGY |
| |
| 20 | SZG | ACACARMA |
|
|
| 21 | MLG | ACACASHA |
|
|
| 174 | S G | ACACKEMP | ||
| 466 | S G | BORRCARP/ |
| |
| 704 | S G | CARPAEQU |
|
|
| 705 | L G | CARPMODE |
| |
| 3019 | SIG | RUMEACET |
| |
| 3020 | SIG | RUMEANGI |
| |
| 3647 | S G | ZYGOFRUT |
|
|
| 3650 | L G | ZYGOIODO |
|
Species name match categories.
| CODE | Meaning | action |
| MATCH | Near-exact match or better | accept |
| PARTIAL-L and PARTIAL-R | A significant substring match | manual check |
| FUZZY | Fuzzy matching algorithm built on the score from the web service using a 'letter-pair similarity' score | manual check |
| WEAK | A weak match falling below thresholds; the best match is retained | manual check |
| TAXM | No match or major problem with original or subsequent species name | refer to expert |