Literature DB >> 36105745

Designing and implementing a data model for describing environmental monitoring and research sites.

Christoph Wohner^1,2, Johannes Peterseil¹, Hermann Klug².

Abstract

With new technological advancements and increasing demands of open data in environmental sciences, the requirements for data are increasing in a variety of ways. Having machine and human readable documentation about environmental research and monitoring sites available online is one of them. The Dynamic Ecological Information Management System - Site and Dataset Registry (DEIMS-SDR, https://www.deims.org/) is a research and monitoring site registry that allows the description of in-situ observation or experimental sites, generating persistent, unique and resolvable identifiers for each site. The aim of DEIMS-SDR is to collect site information in a catalogue describing a wide range of sites across the globe, providing information including each site's location, ecosystems, facilities, measured parameters and research themes and enabling that standardised information to be openly available. This article describes the outcomes of the revision of its data model, the conceptual considerations behind it and how it is implemented. These conceptual considerations also encompass the definition of what we call the "onion model of site data interoperability" - a fundamental concept for the design of site data models against the backdrop of data interoperability. Furthermore, we illustrate the capabilities of the revised data model and provide an overview of common data formats for the description of sites, current initiatives driving the harmonisation of descriptions and the outlook of future developments.

Entities: Chemical

Keywords: DEIMS-SDR; Interoperability; Site data model; System architecture

Year: 2022 PMID： 36105745 PMCID： PMC9461184 DOI： 10.1016/j.ecoinf.2022.101708

Source DB: PubMed Journal: Ecol Inform ISSN： 1574-9541 Impact factor: 4.498

Introduction

Accurate data collection is essential to maintaining the integrity of scientific research (Sapsford and Jupp, 2006). Many research contexts use in-situ data collection systems that generate large volumes of data that must be documented, managed, and shared. To ensure accuracy in data collection and management, data collection systems have to be developed and maintained. For this, data models are needed to support data and computer systems by providing definitions and formats of data. Even though the data modelling phase represents only a relatively small share of the total development effort of data systems, its impact on the final result is probably greater than that of any other phase (Simsion and Witt, 2009). Hence, if a flawed data model is implemented, the resulting system and its interfaces will also be flawed. Versatile and thought-out data models are therefore the key for efficient data collection systems (West, 2011) and thus a precondition for the collection and storage of accurate data. With a general trend towards more data sharing in ecology (Farley et al., 2018; Hampton et al., 2013; Michener, 2015), the purpose of such systems extends not only to the collection but also the publication and sharing of data. Increasing requirements for the collection and sharing of data are defined on the national and international scale by projects and funding organisations (Guidelines on FAIR Data Management in Horizon 2020, 2016) or global initiatives like the Group on Earth Observations (GEO; GEOSS Data Sharing Principles, 2015). One of these requirements is the provision of information about the context of measurements, i.e. the observation or research facilities where measurements were done. For any place-based observation, information on such facilities or sites is an intrinsic and important part of the description, discovery and reuse of data and expertise. This information is also needed when managing research site networks (Mirtl et al., 2018; Wohner et al., 2020). Furthermore, in-situ research and monitoring sites are an important source for ground-truthing remote sensing data (Pause et al., 2016). The need for a persistent database into which information about in situ observatories and other infrastructure can be compiled has been identified while underlining that it is often unclear who is measuring what and where (Shahgedanova et al., 2021). It is therefore necessary for information about research and monitoring sites to be readily available online in both machine and human readable form. The information is then searchable and sharable thus allowing initiatives such as GEO or the European Union's Earth Observation Programme Copernicus to find, use and capitalise on existing resources. A system for the documentation of such sites is DEIMS-SDR (Dynamic Ecological Information Management System - Site and dataset registry; Wohner et al., 2019; DEIMS-SDR, 2021). DEIMS-SDR is a web-based information portal for integrated ecological information that comprises detailed descriptions of sites where research is carried out, including the technical infrastructure, ecosystem properties and research activities and foci. The available information is exposed through various formats and services. While DEIMS-SDR allows documentation of sites, datasets, sensors and research activities, this article focuses on its site data model, placing special emphasis on the conceptual considerations behind it. In this context, it needs to be underlined that the usage of terms such as “facility” or “site” is not standardised in the ecological domain and other terms, such as “plot”, “station” or “platform”, are also commonly used. In addition, these terms often use differing definitions (Wohner et al., 2020). For this article, we will exclusively use the term “site”, which we define as “an in-situ observation or experimentation facility, delimited in space, but varying in size and complexity of the internal organisational and observational design, for the collection of data covering e.g. biogeophysical, biotic or socio-ecological characteristics” (Wohner et al., 2019). Systems for the documentation of sites are referred to as site catalogues. DEIMS-SDR is used as the official site catalogue of the International Long-Term Ecological Research (ILTER) network (ILTER, 2021; Mirtl et al., 2018), its regional network (LTER-Europe, 2022) and other member networks (Kim et al., 2018). These networks consist of sites focusing on long-term, in-situ research covering a wide range of ecosystems and biogeographical regions. In addition, DEIMS-SDR has been and is used in a number of European Union (EU) projects, such as eLTER (2019), e-shape (2021), ENVRI-PLUS (2019) and ENVRI-FAIR (2021). While DEIMS-SDR has been designed with a focus on the needs of LTER and is primarily used by its communities, its data is openly shared. Following the recommendations of OpenAIRE (OpenAIRE, 2021), data published by DEIMS-SDR is using the CC BY-NC 4.0 International license (CC BY-NC 4.0 International, 2021) to enable a free use of the information while at the same time ensuring appropriate credit for the system and its data providers. Moreover, DEIMS-SDR is open to the research community to provide information about additional long-term monitoring research sites not being formally part of an LTER network. Any ecological or environmental in-situ monitoring or research site regardless of its network affiliation can be registered and documented. Initially, DEIMS-SDR was an extended fork of DEIMS, a dataset management system developed by the US LTER network (Gries et al., 2010; Wohner et al., 2019). DEIMS was designed to aid the data management of the network and its member sites by providing an installation profile for the open-source Content Management System Drupal 6 and later on Drupal 7, setting up web-services that allowed the management of datasets while ensuring interoperability with data harvesting systems of the network. DEIMS-SDR added to this functionality by allowing the description of sites, sensors and general research activities. However, with the advent of Drupal 8, and the subsequent need to migrate all data and mostly re-build large aspects of the code basis (Drupal Documentation, 2021), DEIMS-SDR was detached from the initial DEIMS code basis and has since become an independent software stack. Data on DEIMS-SDR is input by filling in forms for the different entity types, most notably the sites. The users providing such information are usually the principal investigator of sites or managers of research networks. In addition to the manually input data, some data is calculated automatically as described in section 4.1. In the various LTER networks, there are periodic requests for site managers to update their site records which is further encouraged and supported in EU projects such as the aforementioned eLTER PLUS or e-shape in order to ensure the accuracy and currentness of data. As of May 2022, DEIMS-SDR stores over 1200 site records (Fig. 1). While the majority of these sites is located in Europe, DEIMS-SDR ultimately aims for comprehensive coverage across the globe. This openly available site data has been used for a variety of studies (Mollenhauer et al., 2018; Wohner et al., 2021; Zilioli et al., 2019), and as the source of input data for the processing of large-scale datasets (Rennie et al., 2021). However, some of these analyses revealed shortcomings of the web service in general and the site data model in particular. Therefore, a revision of the data model was necessary to meet current and upcoming requirements for future analyses as well as ensuring increased accessibility and continued use of DEIMS-SDR data. This article is an extension of the work done in Wohner et al. (2019) where an initial overview of the system, its architecture and capabilities where given and incorporates findings of Wohner et al. (2020) which focussed on a more holistic look at existing systems, their data models and how interoperability could be increased between such systems. It outlines the revised version of the data model for the description of environmental monitoring and research sites. In doing so, it corrects flaws in previous descriptions of the system, while putting an emphasis on the documentation of the site and the used data model, which over the course of the past few years has turned out to be DEIMS-SDR's most influential functionality. Furthermore, it illustrates the developments and components that have been added since Wohner et al. (2019) such as a REST-API and a quality assurance tool as well as introducing a concept for the design of interoperable site data models.

Fig. 1

Sites registered on DEIMS-SDR as of May 2022.

General considerations about site data models

A data model needs to fulfil a set of requirements that vary from case to case. Generally, the quality of a data model can be evaluated by looking at multiple factors, such as correctness, completeness, integration, simplicity, understandability, flexibility, and implementability (Moody and Shanks, 1994). Hence, before creating or, in this case, revising a data model, there are a number of questions to be considered. Are there existing models that already fit the given purpose? Are there (meta)data formats that should be adhered to? Are there any working groups that are currently devising recommendations or standards? Which of these existing concepts should be incorporated? The following sections provide a brief overview of the problems we had with previous iterations of the data model, existing formats and initiatives relevant for site data models. Conceptual challenges when documenting sites will also be described.

Previous shortcomings of the data model

During data processing and analysis work carried out for projects like eLTER PLUS (2021) or studies such as Mollenhauer et al. (2018) and Wohner et al. (2021), it became clear that there were inaccuracies in the collected data, which reduced the validity of derived results. After collecting user feedback through questionnaires as well as live sessions during project meetings, we realised that the fields in the form used to describe sites were sometimes not sufficiently defined resulting in fuzzy data input that was not easily comparable between site records. One example for this was the field “site owner”. It was not clear to users what this exactly meant. Was it the organisation owning the equipment, financing the operation or actually owning the land the site is located on? It was therefore necessary to revise the label and definition of the field, turning it into “operating organisation”, as well as introducing an additional field called “funding agency”. Other cases included redundancies and overlaps between fields like the “research topics” and “observed parameters”. This made it difficult to actually work with the collected data, also causing dissatisfaction of users when having to enter data as well as using the collected data themselves. This especially was the case in Wohner et al. (2021) where the provided geographic boundaries of sites were used to carry out a study on the representativeness of the ILTER site network. Even though in this case, DEIMS-SDR provided a clear definition for the field describing the site boundaries, the collected data vastly differed in the types of boundaries that were provided. Sometimes the provided boundaries would describe a sampling area while in other cases the hydrological catchment area. As a result, the extent of particular sites would exceed 10,000 km2 as opposed to other sites with an extent of merely 1 ha. This led to a strong bias towards the sites with such extensive boundaries and made substantial processing work necessary for the site boundaries to be comparable. After collecting user feedback from the managers of the affected sites, we realised that the given way to describe boundaries was insufficient to describe the sometimes more complex geographic setup. This is further described in 2.5. The sum of these various shortcomings had to be taken into consideration when revising the data model.

Existing data standards

A major requirement for the sharing of collected data is the compliance with existing standards and formats. Each of these standards and formats have certain requirements that have to be met when aiming for compliance. As detailed in Wohner et al. (2020), there are a number of data formats and standards that can be used for the description of sites. For the revision of the DEIMS-SDR site data model, we decided to be compatible with the most common standards, namely Dublin Core, ISO 19115/19139 and Observations and Measurements (O&M) in the form of INSPIRE Environmental Monitoring Facilities (INSPIRE EF). Dublin Core is a set of fifteen elements for describing resources being formally standardised as ISO 15836 (ISO, 2017). It was chosen due to the simplicity of creation and maintenance of its metadata as well as its generally wide usage. ISO 19115 (ISO, 2014) and its parts define how to describe geographical information and associated services, including contents, spatial-temporal purchases, data quality, access and rights to use with ISO 19139 being its XML encoding standard (ISO, 2019). It was chosen as it is the primary metadata format of the Catalogue Service for the Web (CSW), which is supported by DEIMS-SDR (see section 5). O&M is an Open Geospatial Consortium (OGC) standard defining a conceptual schema encoding for observations and for features involved in sampling when making observations (Observations and Measurements, 2013). It is one of the core standards in the OGC Sensor Web Enablement suite as the OGC Sensor Observation Service (SOS) Interface Standard 2.0 relies on O&M (OGC Sensor Observation Service, 2012). The INSPIRE EF data specification re-uses the O&M specification for the actual observations and is a generic model that has been developed for environmental monitoring of any kind and across any domain (INSPIRE EF Data Specification, 2013). It includes two aspects, (1) the environmental monitoring facility as a spatial object in the context of INSPIRE (location) and (2) data obtained through observations and measurements linked to the environmental monitoring facility (operation). Due to its prominent role in the INSPIRE Directive (European Parliament and of the Council, 2007) and in the scope of EU projects, it was thus chosen. Another relevant standard that the newly revised data model is only partially compatible with is the WMO WIGOS metadata standard. The Integrated Global Observing System (WIGOS) Metadata Standard of the World Meteorological Organization (WMO) is used in the framework for all WMO observing systems. Similar to INSPIRE EF, it also consists of two parts, namely (1) discovery and interpretation/description metadata and (2) observational metadata (World Meteorological Organization, 2017). As DEIMS-SDR is not affiliated with the WMO, we abstained from complete integration with the WIGOS standard and instead focused on particular aspects such as adopting relevant WIGOS codelists (see section 3).

Data interoperability

Interoperability of data in general and site documentation in particular is a requirement in the scope of EU projects, such as ENVRI-FAIR, but also in non-EU contexts (see section 2.6) that is gaining importance. Regardless of the specific requirements of a site catalogue, there should be a general level of interoperability with other catalogues in terms of their data models to allow for at least basic aggregation of site information stemming from different sources. Based on the work done in Wohner et al. (2020) and following the general principle of clean architecture and its so-called onion models (Martin, 2018), we suggest that data models of site catalogues should consist of three types of layers: (1) a set of core fields (Name, Identifier, Textual Description, Contact, Observed Properties) standardised across domains, (2) a set of fields standardised within each domain, e.g. ecosystem research infrastructures, and (3) catalogue specific fields that are not and do not have to be standardised (see Fig. 2). We therefore would like to coin the term “onion model of site data interoperability” which intends to symbolise this multi-layered approach.

Fig. 2

Onion model of site data interoperability.

A site belonging to multiple domains, e.g. ecosystem, polar, carbon flux or seismological monitoring/research, could therefore have multiple domain specific layers that would be stacked on top of each other. Using an onion model concept for site data models would ensure a minimum level of interoperability between site catalogues through the core fields and increased interoperability within catalogues in the same domain. This is exemplified in Fig. 2 where the catalogues A, B and C would always conform to the Core Fields and therefore have basic data interoperability. Catalogues A and B describing sites in the same domain, e.g. the ecosystem domain, would have increased interoperability. Catalogue (or network) specific fields would not have be interoperable allowing for greater flexibility as opposed to the other two layers. An example for a practical implementation is given in section 2.6. All of these different layers would be connected via the core fields including a persistent, resolvable and unique identifier. Such an identifier should ultimately be acknowledged across domains and networks, e.g. a DOI. However, currently there is no such identification service for the persistent identification of sites that is used by the broad community as, for instance, ORCIDs are. Until such a widely recognised, domain-independent service is in place, DEIMS-SDR issues the so-called DEIMS.iD, which serves this purpose for sites registered on DEIMS-SDR. At present, the DEIMS.iD is only recognised by particular other communities and their systems within the ecosystem domain, such as the US National Ecological Observatory Network (NEON; NEON Data API Documentation, 2021) or the National Emission reduction Commitments Directive of the European Union (). Onion model of site data interoperability.

Stakeholder requirements

In addition to the general requirements given by data standards and the overall need for interoperability, the quality of a data model is also defined by the satisfaction of involved stakeholders (Moody and Shanks, 1994). In the case of DEIMS-SDR, the aforementioned research networks LTER-Europe and ILTER define these stakeholder requirements. Generally, these requirements can be divided into two categories: 1) requirements defined by regular users like scientists and 2) requirements defined by the network management. While the requirements defined by data standards and interoperability needs are comparatively stable, the requirements defined by research networks might quickly change over the runtime of a project or due to political decisions. This poses additional challenges for a data model and its implementation due to the inability of an implemented model to be constantly overhauled fundamentally (see section 6). Further stakeholder requirements are defined in projects linked to LTER such as e-shape (2021) which aims to improve user uptake of Earth Observation (EO) data. It develops operational EO services and demonstrates the benefits of the EO pilots amongst defined key user communities. This inter-alia entails requirements regarding the machine-readable provision of site information and data through APIs to be intersected with remote-sensing data. Another EU project that defines such requirements is eLTER PLUS (2021), which is closely tied to LTER-Europe and addresses long-term monitoring and analysis with regard to biodiversity loss, biogeochemical controls of ecosystem functions, the climate-water-food nexus and socio-ecological systems. Within the eLTER PLUS project, the needs of the LTER-Europe community were collected using targeted semi-structured interviews and a user centric design process. Fictionalised biographies based on user types identified in the interviews, called personas, were developed. These personas were presented to the community in a series of online workshops who in return described what a persona might want from a system. The material created by the participants was used to generate use cases that describe the required functionality, which is set to be evaluated and, if necessary, revised. These defined requirements predominantly concerned issues revolving around the user interface and the documentation provided to the users. This addressed in particular the definitions, help texts and list of values of fields and how the site form looks and reacts to input by the user and how site information should be presented to the public once it was entered in the system (see section 4.4). Another example for an eLTER PLUS requirement is that well-arranged reports for each LTER network should be generated automatically. These reports should be based on site information available on DEIMS-SDR and provide an overview of each site in the network, its location, covered biomes and observed properties. Requirements defined in e-shape involve the implementation of standardised services for the provision of geodata and metadata to external services and users who incorporate site data in their systems and analysis workflows. An example for this is the creation of “Remote Sensing Analysis Areas” for selected sites. These polygon geometries are used for cookie-cutting pan-European remote sensing datasets and providing the processed data to project partners. These geometries need to be available through Web Feature Services (WFS) providing machine-readable geodata such as GeoJSON. This enables an implementation in automated data processing workflows. These WFS and other related services are further described in section 5. Repeating feedback processes in these projects results in requirements changing over the course of a project. To properly address such changing stakeholder requirements, we therefore broadly distinguish between fields that are of general interest and fields that are relevant in the scope of a project or network. Such “project-related” fields are implemented using labels indicating that they are intended for usage in a particular project. They are generally not included in any of the publicly available APIs (see section 4.3) but in dedicated exports available for project partners instead. At the end of a project, these fields may be archived and removed from the system. Should a project-related field prove to be valuable beyond the scope of a project, it might be moved to the general site data model. This approach is in line with the aforementioned onion model of site data interoperability. It also allows both having a stable data model and API definitions as well as a more flexible part of the model which is suitable for temporary projects and their needs.

Conceptual difficulties

Most established catalogues for the documentation of ecosystem research and monitoring sites feature relatively simple means of describing the geographic extent of their sites. The collected information often consists of point coordinates and/or a bounding polygon. Similar to the varying usage of alternative terms for “site” outlined in the introduction, research networks and their respective catalogues also tend to define the spatial extent of sites differently from each other (Wohner et al., 2020). However, the geographic setup of sites is usually more complex than simply describing one set of coordinates or a polygon. Instead, the geographic composition and observation design of a site can consist of different horizontal and vertical layers of geographic information depending on the research or monitoring context (Fig. 3).

Fig. 3

Different geographic layers of a site.

Different geographic layers of a site. Depending on the types of parameters monitored at a site, it might have different boundaries (or “areas of interest”) that are of relevance when describing a site and potentially linking them to a specific observation. A site monitoring a river or lake might have multiple sampling points or measurement equipment deployed in one or multiple places. In addition, the extent of the hydrological catchment might also be of relevance in this case. Sites monitoring air quality or carbon fluxes can have measurement equipment with sensors installed at different heights. In this case, the airshed might also be of relevance. Sites that monitor a forest might have multiple plots in said forest with different parameters being observed, also making the forest itself relevant as a reference area. Aggregating and using such different types of site boundaries without further distinction can lead to spatial and thematic inaccuracies in case studies, e.g. in the assessment of biogeographical and socio-ecological representativeness of LTER site networks (Mollenhauer et al., 2018; Wohner et al., 2021). To address this issue, we therefore developed and implemented the concept of “locations”. It allows users to create an arbitrary number of geographic features, e.g. points, lines or polygons that can be linked to a site. These locations can be classified according to a predefined list of location types and provide geographical and contextual reference areas when documenting datasets. Currently, the location types encompass “Sampling area”, “Hydrological catchment”, “Airshed”, “Model area”, “Socio-ecological reference area” and the aforementioned “Remote Sensing Analysis Area”. These types are based on stakeholder requirements as described in section 2.4 and might be subject to change in the future. In the previously discussed data standards (section 2.2), such different types of locations generally correspond to the so-called “Feature of Interest”. In INSPIRE EF, corresponding entities are the “Feature of Interest Station/Location (sampling point)”, “Feature of Interest Sample/Specimen” or “Feature of Interest Extensive Feature (sampling surface)” (D2.9 Guidelines for the use of Observations and Measurements and Sensor Web Enablement-related standards in INSPIRE Annex II and III data specification development, 2011). We decided to use the term “locations” as opposed to “feature of interest” as it allows for API calls with a simpler syntax (see section 5) and is probably more intelligible to scientists in the environmental domain with no knowledge of such data standards and their terminology. By classifying locations using the above-mentioned list of types, further distinction of the location can be achieved. Depending on the way these locations are used, their application context can vary. Hence, they can also implicitly act as what O&M describes as the “samplingFeature”. In addition to locations, DEIMS-SDR also allows sensors to be described separately and linked to sites (Wohner, 2022b, Wohner, 2022c), effectively allowing the user to describe deployed equipment as well. While we consider this functionality a novelty for DEIMS-SDR, the site catalogue of the NEON had already implemented similar functionality before DEIMS-SDR did (NEON Data API Documentation, 2021).

Current initiatives for the harmonisation and standardisation of site documentation

In addition to requirements defined by stakeholders and data formats, there are a number of initiatives and projects dedicated to the harmonisation and standardisation of site documentation. Groups and projects that are relevant for DEIMS-SDR are the EU project ENVRI-FAIR, the Global Ecosystem Research Infrastructures (GERI) initiative and the Polar Observing Assets working group (POAwg, 2021). ENVRI-FAIR targets the development and implementation of a technical and policy framework to overcome discipline boundaries within the Environmental Research Infrastructures (ENVRI) community and foster the implementation of the FAIR data principles (Wilkinson et al., 2016). Its aim is to facilitate interdisciplinary Earth system science by doing cross-discipline harmonisation and standardisation, together with the implementation of joint data management and access structures. Within ENVRI-FAIR, there is a use case dedicated to the harmonisation of site documentation across participating ecosystem research infrastructures, namely AnaEE (2021), LTER-Europe (2022), ICOS (2021), Lifewatch (2021) and SIOS (2021) (ENVRI-FAIR, 2021). One of the outputs of this use case is the work done by Wohner et al. (2020) defining a set of core fields for cross-RI site documentation (see Table 1).

Table 1

Core elements of the site data model.

Field name	Multiplicity	Description	Comment
Name	1..1	Name identifying the documented observation and/or experimentation facility (site).Example: “Zöbelboden”
Textual Description	1..1	A short textual description of the site or platform which includes the location, biophysical characteristics, a brief history, the main scientific purpose
Contact	1..n	Reference to the contact person responsible for the site. The person could be either the principal investigator, the information manager, the site manager or a technician.Example: “John Doe”	The referenced contact entity consists of sub-elements such as first name, last name, email address, orcid
Observed properties	1..n	Description of the observed parameters and parameter groups at the site. The parameter is defined as property of the ecosystem or an ecosystem compartment which can be observed either by sensors or humans, e.g. pH, species number, radiation
Centroid or representative coordinates	1..1	Location of the site in the geographic space, given as centre coordinates expressed in latitude and longitude in decimal degrees
Boundaries	0..1	A primary boundary is defined as “The geographic extent covering the area of all measurement infrastructures including measurement devices (e.g. hydrological observations, soil respiration chambers) as well as permanent plots (e.g. vegetation surveys, soil sampling)”.	This is to ensure that one type of boundaries is available for as many sites as possible and thus enabling a homogenous dataset of site boundaries
Locations	0..n	Optional geographic information to provide a better overview of the geographic setup of a site
DEIMS.iD	1..1	Persistent, unique and resolvable identifier for a site	Generated automatically
Creation date and time	1..1	Date and time when a record was created	Generated automatically
Update data	1..1	Date and time when a record was last updated	Generated automatically
Metadata creator(s)	1..n	Provides the full name of the person(s) or organisation(s), who created the documentation for the site	The referenced contact entity consists of sub-elements such as first name, last name, email address, orcid

Core elements of the site data model. The GERI initiative consists of six major ecosystem research infrastructures (SAEON in South Africa, TERN in Australia, CERN in China, NEON in the USA, ICOS and LTER Europe in Europe) that have started programmatic work needed for concerted operation and the provisioning of interoperable data and services (Bäck et al., 2021). One of its activities is the harmonisation and collection of site information of the participating research infrastructures. One of the concrete steps that have been taken is the inclusion of the DEIMS.iD in the NEON Data API (NEON Data API Documentation, 2021). This means that every site that is documented in both the NEON Site Catalog and in DEIMS-SDR can be linked through the same identifier. This eases the process of synchronising site descriptions between the two catalogues and allows integration of information from both catalogues. POAwg is a newly formed working group that facilitates the discovery and interoperability of information about research and monitoring assets in polar regions, which encompasses sites, transects, observatories, projects, and networks or systems (POAwg, 2021). While POAwg has a dedicated focus on polar assets, some of their findings will likely be able to be generalised and applied to the documentation of non-polar assets as well.

Content of the data model

For the revision of the data model, we compiled all relevant fields of the different standards and used the fields that are common across standards as the basis for the data model. For a more comprehensive overview, comparison and mapping between these standards, please refer to Wohner et al. (2020). Following the concept of the onion model of site data interoperability (see section 2.3), we outline the content of the data model below. However, we underline that there is no domain-wide consensus on which fields are core or domain-specific fields. We merely present ideas on the fields that we would use to populate conceptual layers. While this publication intends to detail relevant aspects of the developed data model, its extensive nature prevents its entire content description here. The complete documentation of the model in its latest revision can be found online (DEIMS-SDR Site Data Model, 2021) alongside all other data models used for DEIMS-SDR which are not described in this article. A snapshot of the site data model and its related classes in the Unified Modelling Language format (UML) is also available (Wohner, 2022c). Compared to the previous version of the site data model (Wohner et al., 2019), fields that had semantic overlaps or redundancies such as “history” and “purpose” which in their definition overlapped with the general site description were removed. We renamed the field “parameters” to “observed properties” to use a more common terminology and removed the field “research topics” as users usually only diligently filled in the coarser research topics field while only ticking few values in the more detailed parameters field. This resulted in the situation that when searching for similar research topics and parameters in the search interface, the results would be different depending on the filtering of research topics or its parameters. Consequently, users started doubting the system and its data quality. However, this only addressed one part of the problem, as the “observed properties” are currently a relatively flat list of terms that does not allow for aggregation on different levels, i.e. deriving research topics from observed properties, as well as mapping to other vocabularies and establishing increased interoperability between different site descriptions. This issue is further discussed in section 4.4. Overall, this led to a slight simplification of the model without losing relevant information. The description of geographic information is now realised through three separate fields: (1) the centroid/representative coordinates, (2) a primary boundary with a given definition, and (3) the possibility to create and link any number of locations. This information is stored in the database as both a string describing a GeoJSON as well as an SQL “Geometry” type, which is used as the data source for GeoServer, which in turn provides data as e.g. GML, KML, Shapefile or GeoJSON. On creation of a location record, it is assigned a Universally Unique Identifier (UUID), which allows persistent identification of a location and querying information about the location through the API (see section 5). Each location can also be linked to a site. This linking is internally realised through the site identifier, in the user interface it is presented via the site name. The identifier, as well as information on creation and update dates are generated automatically on creation of a record. At least one contact person and metadata creator has to be linked to a site record. Such a person record consists of a name, postal address, email address, ORCID, association with networks and scientific focus. For the domain specific layer of the model (Table 2), we introduced new fields such as “Land Use”, which is based on the Hierarchical INSPIRE Land Use Classification System (HILUCS, 2021) and “Landforms” whose list of values is based on a simplified list of the Wikipedia page of the glossary of landforms (Wikipedia Landforms, 2021), which itself is based on Hargitai and Kereszturi (2015).

Table 2

Domain specific key elements of the site data model.

Name	Multiplicity	Description	Comment
Land Use	0..n	Land Use is an emerging socio-economic activity wherein a region of one major specific purpose utility may be converted into another land for general purpose utility. It provides information on land cover, and the types of human activity involved in land use. […]Example: “forestry based on continuous cover”	Uses the INSPIRE HILUCS classification(HILUCS, 2021)
Landforms	0..n	This field describes the geomorphological landform, i.e. the feature of the solid surface of the Earth, which your site is composed of.Example: “Basin”
Biogeographical region	0..1	Area of similar character in terms of the biota (fauna & flora) present in it. […] (EEA Glossary, 1998).Example: “Alpine”	Only applicable for European sites (Biogeographical Regions, 2019)
Site Status	0..1	Describes the current status of the site, e.g. if it is planned or already existing.	Uses a WIGOS codelist (WIGOS Reporting Status, 2021)
IUCN category	0..n	Description of the level of nature protection which is applied for the whole or a part of the site or platform.Example: “Category Ib – wilderness area”	Uses IUCN protected area categories (Protected Area Categories, 2021)
EUNIS Habitat	0..n	Assignment of the habitat type based on the EEA, EUNIS Habitat types.Example: “Raised bogs (D1.1)”	Only the levels 1–3 are used (EUNIS Habitat, 2020)

Domain specific key elements of the site data model. Based on the work done so far in POAwg, we also revised the fields for temperature and precipitation values so that average monthly air temperature and precipitation values can now be described if applicable. The field “site status” was revised to utilise the WIGOS Codes registry for “Station/Platform operating status” (WMO Codes Registry, 2021). This was done so that the field uses a web-accessible code list thus increasing interoperability in general and with the WMO WIGOS specification in particular. There are two other fields thematically linked to this field, namely the year the site was established and – in case the site status is set to “closed” - the year the site was closed. This indicates the monitoring period of closed or inactive sites. We also included the International Union for Conservation of Nature (IUCN) categories for the description of the protection level if applicable (Protected Area Categories, 2021). There are also ecological classification fields, such as the Biogeographical Regions (2019) or the EUNIS Habitat (2020) which also utilise existing codelists. Due to the usage of accepted and published codelists, these fields are suitable to be used in other catalogues dedicated to documenting ecosystem research/monitoring sites. DEIMS-SDR includes additional ecosystem classification fields that are not further discussed in this article but can looked up elsewhere (DEIMS-SDR Site Data Model, 2021; Wohner, 2022c). Definitions and vocabularies used for these fields are project specific and currently not available in a format that follows the FAIR principles (Wilkinson et al., 2016). This will be subject of future work and will be made public in the Environmental Thesaurus (EnvThes, 2022), which is the primary source of keywords and vocabulary terms used in DEIMS-SDR. The catalogue specific layer of the new DEIMS-SDR site data model revolves around the requirements defined by the different LTER networks. This includes a network membership acknowledgement for each site and allows indicating the membership in a given network as well as a corresponding site identifier within the network. Using this setup, the respective network managers are able to verify site memberships in their network. This requirement was generalised and in principle made available to every network registered on DEIMS-SDR. However, this functionality is currently only available for the LTER networks as well as a limited number of other networks as it requires an agreement on both sides for such a workflow to be implemented. Additionally, information about the deployed equipment or devices can be indicated as a simple listing of the equipment type, e.g. lysimeter or measurement tower. The code list used is based on network specific requirements and is published in EnvThes (EnvThes, 2022). Nevertheless, the list of equipment and devices is currently under revision and will be extended based on the network needs. Another requirement was the inclusion of an LTER-specific site classification as well as being able to centrally tag sites with keywords used for the management of the ILTER and LTER-Europe networks using a dedicated LTER vocabulary.

Functionality enabled by the data model

Calculating values of fields based on existing site information

Some values of fields that are present in the model can be calculated (semi-)automatically based on existing information, most notably geographic information. More specifically, we derived the aforementioned IUCN categories for European sites by intersecting site boundaries information with the Common Database on Designated Areas (CDDA) dataset of the European Environment Agency (CDDA, 2021). Users validated the results. The same was done for the calculation of the biogeographical regions for sites located in Europe (Biogeographical Regions, 2019). Generally, this is also possible, but currently not executed, for fields describing the country a site is located or elevation values of a site. This is because site managers are able to provide more accurate values than the ones that can be calculated automatically. As mentioned in the introduction, DEIMS-SDR site boundaries information has also been used to process regional climate change scenario data (Rennie et al., 2021). The revised fields for average monthly air temperature and precipitation values can be populated by processing large-scale input datasets, e.g. Copernicus Climate Change Service (2020). The calculated average values allow generating climate charts for sites.

Using gamification to increase data quality

Even a thought-through data model implemented in a state-of-the-art system is insufficient if it is not populated with quality assured data. To facilitate the input of accurate and complete data by the users, we have implemented so-called “gamified” interfaces and reporting tools. Gamification is the strategic attempt to enhance systems with the purpose of creating experiences similar to those experienced when playing games in order to motivate and engage users (Hamari, 2007). Gamified interfaces have been proven to increase data quality in data repositories (Trisovic et al., 2021). We therefore implemented such gamification design elements in DEIMS-SDR. The most notable examples for this are the Site Record Completeness Measure and the Site Record Quality Assurance (QA) tool. The completeness measure is a status bar showing the percentage of recommended fields that are filled in. For this purpose, a particular set of fields in the data model are defined as “recommended”. Depending on the degree to which a record has completed these defined fields, the status bar is coloured in as either red, orange, yellow or green. The QA tool checks a given site record in regards to the completeness and the quality of data provided for selected fields. For instance, the validity and topology of the provided geodata is checked using the JavaScript Topology Suite (JSTS, 2021). Is there an invalid geometry, e.g. a polygon intersecting itself, the tool will provide written feedback. Other checks include the extent to which fields are filled, i.e. the number of provided values or the total length of text strings as well as logical comparisons, i.e. a short name should be shorter than the full name of the site. The introduction of these tools led to increased data quality and positive user feedback.

Sharing site data

Standardised ways to share site documentation are limited. While there are standards for describing sites as outlined in section 2.2, the application of such standards is scarce in the ecosystem domain and even more so the ways to share this information. Additionally, these standards only support a limited set of fields that usually encompass a basic description of sites limiting the amount of information that can be shared (Wohner et al., 2020). Since the publication of Wohner et al. (2019), we have therefore built a REST-API exposing all information that is available about a site on DEIMS-SDR as JSON files (DEIMS-SDR REST-API, 2021; see section 5). While this API is not standardised to the same degree as established metadata formats, it allows sharing of all available information in a common data format. The general idea is to expose all information about a site available on DEIMS-SDR in a data format that can be parsed and further worked with without any restriction that metadata formats have. Hence, these JSON records form the basis for creating records in other formats, e.g. ISO 19139 for creating discovery records, which can be distributed using a CSW service. Efforts in supporting additional formats, such as the Data Catalog Vocabulary (DCAT), are currently being undertaken for the ENVRI-FAIR project. Other metadata formats will likely follow in the future depending on the requirements of projects that DEIMS-SDR will be part of. However, even with the support of additional formats, the possibilities to query site records are limited. This is due the unfinished state of the list of values used for some fields. These fields have therefore not yet been configured to be filterable in the API. Using the REST-API and custom scripts to query and filter the content of desired fields, nevertheless, enables potent queries by combining multiple filters, which can be realised in any given programming language that allows http requests. Examples for such queries are: Sites that research Mountain Lakes (landform = mountain && lake) Sites located in the Alps (country = France, Switzerland, Liechtenstein, Austria, Germany, Italy, Slovenia, Monaco and/or a geospatial query using a bounding box and/or elevation, e.g. elevation >2000 msl) Sites that monitor air quality (observed property = air quality) Sites that research forest ecosystems (ecosystem type = forest) Sites that have a permanent power supply (equipment = permanent power supply) Sites located on Mediterranean coasts that research soil erosion (country = Spain, France, Italy, …) (landform = coastal) (observed property = soil erosion) Sites located on Pacific (spatial query using a bounding polygon) islands (landform = island) that research forests (EUNIS habitat = Woodland, forest and other wooded land) in regards to decomposition (observed property = decomposition) and have a lysimeter (equipment = lysimeter) There are plans to extend the capabilities of the search interface and provide a GUI for such queries. A first version of a Python package for easing data access has been released (deimsPy, 2021). An overview of data extraction methods for end uses is given in the official DEIMS-SDR user documentation (DEIMS-SDR User Manual, 2021).

Hierarchical observed properties

There are a number of challenges when documenting and communicating the observed properties of sites. A typical LTER site can monitor >200 different parameters covering different ecosystem compartments ranging from soil, water and air, e.g. LTER Zöbelboden (2021). Hence, giving a simple overview of the observed properties of a site is not simple. Plain text lists are difficult to read and interpret if presented as a flat list without any grouping, hierarchy or other structure. Therefore, there is a need for taxonomies or controlled vocabularies that enable intelligible visualisations of the research focus of a site in a hierarchical way and aggregating this information on different levels. The principle of visualising such a hierarchical taxonomy is illustrated in Fig. 4.

Fig. 4

Example for a hierarchically structured visualisation of the observed properties of a site.

Example for a hierarchically structured visualisation of the observed properties of a site. While the data model already allows such a visualisation from a technical and syntactical viewpoint, we currently lack the published standardised and hierarchical vocabulary on observed properties that allows visualising observation information this way. Future work in the eLTER PLUS project will therefore be about creating such a semantically sound, hierarchical taxonomy for the description of the observed properties of a site in the LTER context. Fig. 4 uses an early version of such a taxonomy that is currently under development in the scope of the eLTER PLUS project (Zacharias et al., 2021). Further challenges will be the semantic harmonisation of such a taxonomy with other site catalogues, e.g. by developing term mappings which will be realised in the earlier mentioned Environmental Thesaurus (EnvThes, 2022) also taking into account the recommendation from the Research Data Alliance “I-ADOPT” working group (Magagna et al., 2022).

Technical implementation

The aforementioned concepts and data model contents are implemented as a publicly available web service called DEIMS-SDR (2021). It is a software stack consisting of a number of open-source components as well as custom extensions and applications. As of June 2022, the stack consists of the Content Management System (CMS) Drupal 9, a MySQL database, GeoServer, pycsw, custom code components within Drupal as well as external applications that access DEIMS-SDR data via the aforementioned custom-built REST-API. While the major components have stayed the same, the configuration of and data provision between those components have changed significantly when compared to previous versions of the software stack (Wohner et al., 2019). A brief overview of the current components and how they are connected is presented in Fig. 5.

Fig. 5

System components of DEIMS-SDR.

System components of DEIMS-SDR. The open-source CMS Drupal 9 represents the biggest part of the software stack. It provides the user interface functionality as well as the general database structure, which is used to collect site information provided by the users. It is currently configured to use a MySQL database, which is one of the recommended database management systems for Drupal (Drupal Database Server Requirements, 2021). Since all site data is stored in the Drupal component of the software stack, the actual site data model implementation follows the general Drupal schema (Drupal Database Schema, 2021). In principle, the site data model can still be used for non-Drupal systems as well. Depending on future developments of Drupal and general requirements of DEIMS-SDR, it might be replaced with a PostgreSQL database with a PostGIS extension for a better integration with GeoServer and in order to be able to do more complex geodata processing. GeoServer is an open-source server that allows users to share, process and edit geospatial data. It publishes data from any major spatial data source using open standards (GeoServer, 2021). The DEIMS-SDR instance of GeoServer is directly connected with the MySQL database. It provides all relevant geodata services for DEIMS-SDR and has been an essential part of the software stack for years and proven to be a reliable way to share DEIMS-SDR data using Web Features Services (WFS) and Web Mapping Services (WMS) complying with the requirements for data provision defined in EU projects (see section 2.3). Another key part of the stack for data provision is the REST-API. Even though Drupal 9 provides out-of-the-box REST-API functionality, we decided to develop a custom REST-API to expose all relevant information to the public in comparatively simple JSON objects. This was done as the Drupal REST-API would have had a significant overhead and increased the size and complexity of the JSON objects. Like Drupal itself, the API is written in PHP using the Symfony framework (Symfony, 2021) and is deployed like any regular Drupal module would be. It directly accesses the system's MySQL database via Drupal's API calls and generates the aforementioned JSON objects based on template files that call each field described in the data model and prints their content in a predefined structure. If locations or sensors are linked to a site, a list of references to each location and sensor is printed in the JSON object as well. This allows each sensor or location to be called independently while also connecting this information to the sites. Documentation about how to use the custom REST-API as well as the exact format of its output is available following the OpenAPI specification 3.0, a specification for machine-readable interface files for describing, producing, consuming, and visualising RESTful web services (DEIMS-SDR REST-API, 2021; DEIMS-SDR User Manual, 2021; OpenAPI, 2021). A snapshot of the code is available on Zenodo (Wohner, 2022a). This REST-API is the basis for a set of applications that work with DEIMS-SDR data. As the data is licenced as CC BY-NC 4.0 International and therefore openly available, applications outside the context of the various LTER networks or EU projects can also access it. Such external applications can and partially are written in a variety of languages and frameworks, depending on the development context. A setup like this has the advantage that the system or its components can be replaced without needing to be concerned with breaking existing applications as long as the APIs and their standardised output format stay the same. While we would like to keep the APIs and their data as open as possible, depending on threats, e.g. Distributed-Denial-of-Service (DDOS) attacks, access to the APIs might eventually be restricted in some way. Compared to previous versions of the stack (Wohner et al., 2019), we placed a greater emphasis on modularisation and keep the amount of Drupal code to a minimum. Instead, we use more encapsulated external apps that access the REST-API. The intention is to enable both project-internal as well as external developers to build applications for DEIMS-SDR without needing to know the Drupal API or the Symfony framework that Drupal uses (Symfony, 2021). Examples for such external applications are the site map and the QA tool. The DEIMS-SDR Site Map (2021) is a custom web-mapping application running on OpenLayers, Bootstrap and a number of auxiliary libraries which use the WFS and WMS services of GeoServer as well as querying detailed site descriptions using the REST-API. It provides a cartographic overview of all site records on DEIMS-SDR. To ensure the data quality of DEIMS-SDR as explained in section 4.2, we now utilise a gamified QA tool, which queries and evaluates site data using the REST-API. The QA tool is mostly built with JavaScript, a few auxiliary libraries and (Bootstrap, 2021) for styling. It provides HTML representations of the site record quality checks (DEIMS-SDR Site QA Tool, 2021). Similar to the QA tool, there is a metadata application written in Python that queries the REST-API and translates the records to ISO 19139 on the fly. The modular structure and containerised nature of the scripts allow easy deployment and updating even without any Drupal knowledge. These ISO records are batch-imported into pycsw on a weekly basis. Pycsw is an OGC API - Records and CSW server implementation written in Python (pycsw, 2021). The deployed instance of pycsw ingests the generated ISO 19139 records and exposes them using CSW. This service is harvested by external infrastructures, such as the GEOSS Portal (2021) or the eLTER Information System (2021). Other applications, such as analysis or data integration scripts, are usually hosted in external data labs using e.g. Jupyter Notebook (Jupyter, 2021). These decoupled interfaces allow users, e.g. ecologists, simple access to DEIMS-SDR data without needing to know about the specifics of data services such as the REST-API, WFS or metadata formats.

Conclusions

The concepts, references and developed model in this article present the current status quo when documenting sites in the ecosystem domain and potentially beyond using the example of DEIMS-SDR. While a few significant steps regarding the documentation of sites have been made in the past years, there are still many unanswered questions regarding the standardisation, harmonisation and sharing of site data. Means to share and synthesise site data will become more important and should be considered when designing site data models. It can be expected that in the future there will be more aggregates of site information fetched from different sources. This will add to the already increasing need for data harmonisation of environmental research infrastructures (Huber et al., 2021). Additionally, increasing integration of remote sensing and in-situ data will also call for greater availability of machine-readable site documentation as illustrated by projects such as EcoPotential (2019) and e-shape (2021) or the In Situ Component of the European Union's Earth observation programme (Copernicus In Situ Component, 2021). For these purposes, there is a need for unique and persistent identification of sites across networks, infrastructures and even scientific domains. However, this issue is currently not properly addressed. Issuing DOIs for sites might be a pragmatic way to identify sites and would also enable indexing citations and subsequently calculating measures like an impact factor for sites. This could turn into a viable way for assessing and comparing the importance of particular in-situ sites as well as being an incentive for site managers to document their sites. An example for this would be the possibility of site managers to better track where and how their research is used by other scientists. A high impact-factor of a site could also facilitate the acquisition of funding as it would illustrate the relative importance of a site in the national or international context. While the user-based data input of site documentation in DEIMS-SDR has proven successful, it has a significant downside. Changes in the site data model might take a long time until they are reflected in the data itself. While some data can be mapped, we have witnessed with DEIMS-SDR that after extending an already implemented database schema (e.g. by extending list of values for dropdown menus or adding entirely new fields that cannot be mapped) and the corresponding data input forms, it can easily take months or even years until the majority of users have updated their site records and data model changes are reflected in the actual data. But even with this disadvantage, the overall process of collecting site information from users has proven effective. Both the gamified interfaces as well as the modular setup of DEIMS-SDR are a relatively easy way to maintain such a system with limited development resources while at the same time achieving satisfying results. This approach will therefore be further pursued in future iterations of the software stack. We will continue to work on the site data model, refine and adapt it to future requirements that will likely be collected in upcoming EU projects while at the same time pushing for the acceptance of the core fields as a standard for site descriptions. Recommendations for the adaption of data models provided by initiatives such as POAwg will also be considered when revising the site data model in the future. This might also encompass incorporating conventions to provide schema.org markup in landing pages to improve data discovery through search engines (Matt Jones et al., 2021) and thus increasing findability of site records. Regardless of such recommendations, site data models should be compatible with common data formats, which DEIMS-SDR will continue to strive for. This increases the chances of being compatible with formats that might be developed in the future. To ease access to DEIMS-SDR data for the general public, we will further develop the Python package and potentially also release an R package. With this outlook in mind, this article should be regarded as a snapshot of the current developments of site data models. We hope our use case and especially the onion model of site data interoperability can be helpful for other projects or infrastructures that are also facing the challenges of building registries of their research and monitoring infrastructures and can be the basis for increased interoperability of site documentation.

CRediT authorship contribution statement

Christoph Wohner: Conceptualization, Methodology, Software, Validation, Resources, Data curation, Writing – original draft, Visualization, Writing – review & editing. Johannes Peterseil: Writing – original draft, Writing – review & editing, Supervision, Writing – review & editing. Hermann Klug: Writing – original draft, Writing – review & editing, Supervision, Writing – review & editing.

Declaration of Competing Interest

None.

4 in total

1. Genesis, goals and achievements of Long-Term Ecological Research at the global scale: A critical review of ILTER and future directions.

Authors: M Mirtl; E T Borer; I Djukic; M Forsius; H Haubold; W Hugo; J Jourdan; D Lindenmayer; W H McDowell; H Muraoka; D E Orenstein; J C Pauw; J Peterseil; H Shibata; C Wohner; X Yu; P Haase
Journal: Sci Total Environ Date: 2018-02-19 Impact factor: 7.963

Review 2. Long-term environmental monitoring infrastructures in Europe: observations, measurements, scales, and socio-ecological representativeness.

Authors: Hannes Mollenhauer; Max Kasner; Peter Haase; Johannes Peterseil; Christoph Wohner; Mark Frenzel; Michael Mirtl; Robert Schima; Jan Bumberger; Steffen Zacharias
Journal: Sci Total Environ Date: 2017-12-27 Impact factor: 7.963

3. Assessing the biogeographical and socio-ecological representativeness of the ILTER site network.

Authors: Christoph Wohner; Thomas Ohnemus; Steffen Zacharias; Hannes Mollenhauer; Erle C Ellis; Hermann Klug; Hideaki Shibata; Michael Mirtl
Journal: Ecol Indic Date: 2021-08 Impact factor: 4.958

4. The FAIR Guiding Principles for scientific data management and stewardship.

Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444

4 in total