| Literature DB >> 26140212 |
Patricia A Soranno1, Edward G Bissell1, Kendra S Cheruvelil1, Samuel T Christel2, Sarah M Collins1, C Emi Fergus1, Christopher T Filstrup3, Jean-Francois Lapierre1, Noah R Lottig4, Samantha K Oliver2, Caren E Scott1, Nicole J Smith1, Scott Stopyak1, Shuai Yuan5, Mary Tate Bremigan1, John A Downing3, Corinna Gries2, Emily N Henry6, Nick K Skaff1, Emily H Stanley2, Craig A Stow7, Pang-Ning Tan8, Tyler Wagner9, Katherine E Webster5.
Abstract
Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.Entities:
Keywords: Data harmonization; Data reuse; Data sharing; Database documentation; Ecoinformatics; Integrated database; LAGOS; Landscape limnology; Macrosystems ecology; Water quality
Mesh:
Year: 2015 PMID: 26140212 PMCID: PMC4488039 DOI: 10.1186/s13742-015-0067-4
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Fig. 1A description of the major components and data themes that are integrated to create LAGOS. P is phosphorus, N is nitrogen, C is carbon. Further detail is provided in Figures 5 and 6
Fig. 2The study extent of LAGOS, showing location of all lakes ≥ 4 ha (blue polygons). The study extent included 17 states in the upper Midwest and Northeastern parts of the US. Note that there are many lakes that straddle the state boundaries but are still included in the database because the source data for the lakes are based on natural watershed boundaries rather that state boundaries
Fig. 5The workflow used to create LAGOS, including the research decisions needed to design the database. Once the research decisions have been made (grey boxes), the workflow is divided into three modules: building the multi-themed GEO data module (green boxes); georeferencing the site-level data (orange boxes); and building the site-level data module (blue boxes). The black boxes with white text identify the Additional files (AF) that describe each element in further detail and the red text provides the programming language or software used for each step. ARCGIS is ArcGIS, Ver 10.1 (ESRI); FGDC is the Federal Geographic Data Committee metadata standard; EXCEL is Microsoft Excel; TAUDEM is the TauDEM Version 5 suite of models to analyze topographical data; PYTHON is the Python programming language; SQL is structured query language used in the PostgreSQL database system; R is the R statistical language [36]; and EML is ecological metadata language
Fig. 6Database schema for LAGOS including the two main modules: LAGOSGEO (green box) and LAGOSLIMNO (blue box). The component that links the two models is the ‘aggregated lakes’ table (LAGOS lakes) that has the unique identifier and spatial location for all 50,000 lakes. LAGOSGEO data are stored in horizontal tables that are all linked back to the spatial extents for which they are calculated and ultimately linked to each of the 50,000 individual lakes. The LAGOSGEO data includes information for each lake, calculated at a range of different spatial extents that the lake is located within (such as its watershed, its HUC 12, or its state). Each green box identifies a theme of data, the number of metrics that are calculated for that theme, and the number of years over which the data are sampled. LAGOSLIMNO data are stored in vertical tables that are also all linked back to the aggregated lakes table. The ‘limno values’ table and associated tables (in blue) include the values from the ecosystem-level datasets for water quality; each value also has other tables linked to it that describe features of that data value such as the water depth at which it was taken, the flags associated with it, and other metadata at the data value level. The ‘program-level’ tables (in purple) include information about the program responsible for collecting the data. Finally, the ‘source lakes’ table and associated tables include information about each lake where available. Note that a single source can have multiple programs that represent different datasets provided to LAGOS
Fig. 3Contributions and collaborations of disciplines for developing an integrated geospatial-temporal database for macrosystems ecology (MSE). Ecoinformatics includes database systems, metadata, and other informatics tools needed for documenting and integrating datasets. Although statistics and machine learning are not used to create the integrated database, the constraints and requirements for future statistical and machine learning modeling should be incorporated into the process from the beginning
Assumptions and fundamental principles in building, maintaining, and sharing integrated macrosystems ecology databases
| • The database should include both a ‘census’ population in which all possible ‘ecosystems’ or ‘sites’ are geographically represented in addition to the sites with |
| • The database will be fully documented, including descriptions of: the original data providers or sources, database design, all data processing steps and code for all data, possible errors or limitations of the data for the integrated dataset and individual datasets, and methods and code for geospatial data processing. |
| • To the greatest degree possible, existing community data standards are used to facilitate integration with other efforts. |
| • To the greatest degree possible, the provenance of the original data will be preserved through to the final data product. |
| • The database will include a versioning system to track different versions of the database for future users and to facilitate reproducibility. |
| • The database will be made publicly accessible in an online data repository with a permanent identifier using non-proprietary data formats at the end of the project or after a suitable embargo if necessary. |
| • A data paper will be written with the original data providers as co-authors to ensure recognition of data providers. |
| • A data-methods paper is written with the data-integration team as co-authors to ensure recognition of data integrators. |
| • Once the database is made available in a data repository and is open-access, whether it is static (no further data is added to the database) or ongoing (data continues to be added to it), there are a set of community policies by which other scientists use and cite the database, the original data providers, and the database-integrators. |
Fig. 4Flow chart of the sequence of research decisions relevant to the database design and integration efforts that are required prior to entering the database development phase
The description of the sources of site-level datasets that were identified to integrate into LAGOSLIMNO
| Program type providing dataset | Number of datasets | Type of sampling | Spatial resolution | Temporal range of data | Temporal resolution |
|---|---|---|---|---|---|
| Federal agency | 7 | Survey | US | 1991 - 2007 | Single summer sample |
| 7 | Long-term | Single lake - Regional | 1984 - 2011 | Weekly - Yearly | |
| LTER program | 5 | Long-term | Single lake - Regional | 1967 - 2013 | Weekly - Monthly (up to all year) |
| State agency | 14 | Survey | State | 1937 - 2011 | Single summer sample - Monthly |
| 14 | Long-term | Watershed - Regional | 1984 - 2011 | Weekly - Yearly | |
| Citizen monitoring program | 6 | Survey | Regional - State | 1989 - 2011 | Single summer sample - Monthly |
| 4 | Long-term | Regional - State | 1974 - 2012 | Monthly - Multi-years | |
| Non-profit agency | 3 | Long-term | Regional | 1990 - 2011 | Monthly - Multi-years |
| Tribal agency | 5 | Long-term | Regional | 1998 - 2011 | |
| University research program | 16 | Long-term | Single lake - Regional | 1925 - 2011 | Single summer sample - Weekly (some fall and winter samples) |
Note that at the time of writing, we have not incorporated all of these datasets into the database. The table describes the types of programs providing data, the type of sampling conducted, the spatial resolution, and the temporal range and resolution of the data