Literature DB >> 35551557

Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package.

Alexander G Hurley1, Richard L Peters2,3, Christoforos Pappas4,5,6, David N Steger1,7,8, Ingo Heinrich1,7,8.   

Abstract

Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of "big data" are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in data exploration, quality control and reproducibility are to be met. This can occur when conventional methods, such as generating and assessing diagnostic visualizations or tables, become unfeasible due to time and practicality constraints. Interactive processing can alleviate this issue, and is increasingly utilized to ensure that large data sets are diligently handled. However, recent interactive tools rarely enable data manipulation, may not generate reproducible outputs, or are typically data/domain-specific. We developed datacleanr, an interactive tool that facilitates best practices in data exploration, quality control (e.g., outlier assessment) and flexible processing for multiple tabular data types, including time series and georeferenced data. The package is open-source, and based on the R programming language. A key functionality of datacleanr is the "reproducible recipe"-a translation of all interactive actions into R code, which can be integrated into existing analyses pipelines. This enables researchers experienced with script-based workflows to utilize the strengths of interactive processing without sacrificing their usual work style or functionalities from other (R) packages. We demonstrate the package's utility by addressing two common issues during data analyses, namely 1) identifying problematic structures and artefacts in hierarchically nested data, and 2) preventing excessive loss of data from 'coarse,' code-based filtering of time series. Ultimately, with datacleanr we aim to improve researchers' workflows and increase confidence in and reproducibility of their results.

Entities:  

Mesh:

Year:  2022        PMID: 35551557      PMCID: PMC9098071          DOI: 10.1371/journal.pone.0268426

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.752


Introduction

Ecology, just as all Earth system sciences, is increasingly data-rich [e.g., 1]. These data are a boon for novel inferences, and increasingly inform decision making [2, 3], for example, through databases from coordinated efforts that facilitate synoptic studies of carbon fluxes [e.g., FLUXNET, 4] and stocks [5] or ecosystem functioning [e.g., via trait databases like TRY, 6]. Low-cost monitoring and sensing solutions have also immensely increased the amount of data individual researchers can produce [e.g., 7]. However, the data deluge—often from heterogeneous sources—introduces new logistical and computational challenges for researchers [7, 8] wanting to maintain best practices in data analyses, reproducibility and transparency [see frameworks on workflow implementation in 9, 10]. It is clear that we need not only frameworks, but also flexible tools to deal with the ever-increasing, heterogeneous data and corresponding issues. Paramount to any analyses is ensuring the validity of input data through adequate exploration and quality control, which allows identifying any idiosyncrasies, outliers or erroneous structures. However, with growing data volumes this becomes increasingly difficult. Indeed, several definitions establish “big data” at the threshold where single entities (i.e. researchers, institutions, disciplines) are no longer able to manage and process a given data set due to its size or complexity [e.g., 11, 12]. Yet, several current research applications in ecology and Earth system science require handling more than Gigabyte-scale data and regularly lead to the development of dedicated and domain-specific processing pipelines and tools, e.g., processing of raw data from FLUXNET [4] or automating data assimilation from long-term experiments [13]. Individual scientists, however, frequently encounter data sets smaller than this, which nonetheless challenge the feasibility of common data processing and exploration methods. These include the best practice examples of generating static diagnostic/summary visualizations, statistics and tables for detecting problematic observations [e.g., 14, 15]. Data sets of this intermediate scale are termed “high-volume,” rather than “big,” for purposes of this study. Issues with these data often arise when the dimensions and data types require numerous of the aforementioned items (e.g., n-dimensional visualizations), and their individual assessment becomes unfeasible due to time and practicality constraints, even when their generation can be largely automated. Hence, they can pose a challenge even for experienced researchers adept at script-based analyses, if convenient tools do not exist or are financially inaccessible due to commercial licensing. For instance, over-plotting may require generating several static visualizations for nested categorical levels, such as branch, individual tree and forest stand, or for spatial granularity, such as plot, site and region. Furthermore, time series from monitoring equipment may show issues related to sensor drift, step-shifts, or random sensor errors. While gap-filtering, trend-adjustment and outlier-removal algorithms exist for these circumstances [e.g., 16, 17], subsequent manual checking is usually still advised, leading to similar issues as above. For time series, in particular, problematic periods (e.g., systematically-occurring sensor errors) may be removed entirely for convenience in code-based processing; by contrast, interactive engagement down to individual observations may allow applying more diligence and retaining more data. Ideally, researchers should be able to engage with their data, across scales and dimensions as diligently as needed, with as little effort as possible. Accordingly, interactive processing is increasingly called for and deemed critical [18] for ensuring best practices in data exploration, and quality control when dealing with high-volume data and beyond [e.g., 19, 20]. Indeed, interactive exploration is increasingly provided through open-source graphing frameworks (e.g., plotly; https://plotly.com/ or, highcharts; https://highcharts.com/) and/or commercially-licensed software (e.g., Tableau®; https://tableau.com/). However, actual data manipulation, and especially the generation of subsequent outputs that are fully reproducible, are far less common features; this could potentially stimulate reluctance for sharing analysis code [e.g., 21]. Further issues can arise when outputs are (commercially licensed) platform/software dependent and thus not easily incorporated with other widely-used languages, such as R [22] or python [23]. Interactive, reproducible is, therefore, typically linked to method-specific workflows within research domains, for instance, to annotate images [e.g., 24], acoustic files [e.g., 25], or explore spatial and time series data [e.g., 20, 26]. There is a clear need for interactive tools that can facilitate best practices in processing heterogeneous, high-volume data, while enabling interoperability with reproducible workflows. To address this, we developed datacleanr: an open-source R-based package containing a graphical user interface for rapid data exploration and filtering, as well as interactive visualization and annotation of various data types, including spatial (georeferenced) and time series observations. datacleanr is designed to fit in existing, scripted processing (R) pipelines, without sacrificing the benefits of interactivity: this is achieved through features that allow validating the results of previous quality control, and by generating a code script to repeat any interactive operation. The code script can be slotted into existing workflows, and datacleanr’s output can hence be directly used for subsequent reproducible analyses. Below we provide an overview of the package. Additionally, we demonstrate datacleanr’s utility with two ecology-based use-cases addressing common issues during data processing: 1) Identifying problematic data structures and artefacts using an urban tree survey, where data is nested by species, street and city district. 2) Preventing excessive loss of data from “coarse,” code-based filtering in messy time series of sap flow data, bolstering subsequent analyses. Lastly, we provide an outlook for future developments and conclude by inviting the community to contribute to further increase datacleanr’s capabilities and reach.

Datacleanr overview

Availability

This publication used v1.0.3 of datacleanr, which is permanently archived on Zenodo under https://doi.org/10.5281/zenodo.6337609. Stable package releases are available on the Comprehensive R Archive Network (CRAN; use install.packages(“datacleanr”)), which aim to mirror new developments provided and documented on a dedicated repository (www.github.com/the-hull/datacleanr). The repository provides installation instructions for all sources (CRAN, repository) and animated demonstrations with test data. datacleanr is available under a GPL-3 license.

Capabilities

datacleanr is an interactive R package for processing high-volume data, and it caters to best practices in data exploration, processing, and reproducibility. This section describes the general capabilities of the package, and an in-depth walk-through of all functionalities is provided with animated examples in S1 File in the supplemental material. The package uses the shiny [27] and plotly [28] packages to generate a web browser-based graphical user interface (GUI), where modern browser capabilities allow displaying approximately 2 million observations smoothly, around which the visualizations and processing increasingly slow down (dependent on computing power). The GUI has four modules represented in application tabs, which are documented using intuitively-placed help links and package documentation: 1) Set-up and Overview (grouping and exploration), 2) Filtering, 3) Visual Cleaning and Annotating, 4) Extraction (reproducible recipe). The processing GUI is launched with datacleanr::dcr_app(x) in R, where x is a data set for processing (several data types are possible, including data.frames, tibbles and data.tables; run? datacleanr::dcr_app() for help). The chart in Fig 1 shows the datacleanr workflow across the four modules (A-D) with optional pre- and post-processing with external algorithms. Users are encouraged to cycle through multiple grouping structures, filters and variable combinations to get adequately acquainted with their data. The functions of individual tabs are discussed in detail below.
Fig 1

Conceptual workflow for datacleanr across its four processing modules.

A) The Set-up and Overview tab allows for a quick initial assessment of a data set (variable types, distribution, completeness), where nested structures (e.g., by plot, site, region) can be resolved by defining a grouping structure from a categorical data column. B) The Filtering tab allows sub-setting the data based on valid R code (logical statements), which can be targeted (i.e., “scoped”) to individual groups from A). C) The Visual Cleaning and Annotating tab allows generating two or three dimensional visualizations (X, Y, and point size) rapidly, while dividing the data set into groups specified in A); data points for further inspection can be identified by clicking or lasso selection through which annotations may also be added. An overview table and histogram highlight selected points and the potential impact on the data’s distribution, should the selected observations be removed. D) The Extract Recipe tab generates code to reproduce all processing steps, which can be copied to the clipboard or sent directly to an active RStudio® [29] session’s script; depending on the processing mode (in memory or from a file), additional settings for file name specification are available. The schematic here illustrates the potential for including datacleanr into an existing workflow, for example, with prior determination of outliers using external algorithms (requires appending a logical TRUE/FALSE column named.dcrflag), interactive exploration and processing (with datacleanr), and informing subsequent analyses by drawing on the interactively annotated data (.annotation column in output from datacleanr).

Conceptual workflow for datacleanr across its four processing modules.

A) The Set-up and Overview tab allows for a quick initial assessment of a data set (variable types, distribution, completeness), where nested structures (e.g., by plot, site, region) can be resolved by defining a grouping structure from a categorical data column. B) The Filtering tab allows sub-setting the data based on valid R code (logical statements), which can be targeted (i.e., “scoped”) to individual groups from A). C) The Visual Cleaning and Annotating tab allows generating two or three dimensional visualizations (X, Y, and point size) rapidly, while dividing the data set into groups specified in A); data points for further inspection can be identified by clicking or lasso selection through which annotations may also be added. An overview table and histogram highlight selected points and the potential impact on the data’s distribution, should the selected observations be removed. D) The Extract Recipe tab generates code to reproduce all processing steps, which can be copied to the clipboard or sent directly to an active RStudio® [29] session’s script; depending on the processing mode (in memory or from a file), additional settings for file name specification are available. The schematic here illustrates the potential for including datacleanr into an existing workflow, for example, with prior determination of outliers using external algorithms (requires appending a logical TRUE/FALSE column named.dcrflag), interactive exploration and processing (with datacleanr), and informing subsequent analyses by drawing on the interactively annotated data (.annotation column in output from datacleanr).

Set-up and overview

datacleanr facilitates processing of nested data through defining a grouping structure (Fig 1A; also see animation at https://doi.org/10.5281/zenodo.6469658) to the level of interest (e.g., by selecting species, plot and region). The structure is available during targeted filtering (scoping; see section Filtering) and visual cleaning (see section Visual cleaning and annotating, as well as Case studies). Once the grouping is set, a dataset summary can be generated via the package summarytools [30], highlighting duplicates, missingness, and distribution of each variable.

Filtering

Filtering (Fig 1B; also see animation at https://doi.org/10.5281/zenodo.6469721) is done by adding filter statement text boxes by clicking on the respective button on the tab. Statements can be applied to the entire data set, or targeted to specific groups using a “scoped” (i.e. group-specific) filter. The application’s interactivity allows reviewing the impact of filters through a text note highlighting the percentage of removed data and an overview table showing the remaining observations (per group), as well as by iterating between settings and visualizations (across several variable combinations). This is more efficient than (re-)generating individual, static figures, and highlights which data will be excluded. The result of a quantile-based threshold filter implemented in datacleanr, as used e.g., in TRY [6] or BAAD [31], is illustrated in Fig 2.
Fig 2

Example application of statistical filtering.

A percentile threshold filter (0.01 and 0.99) is applied to the full range or scoped to groups (panels, see bold text) on the x-variable. Trait data from BAAD [31] is used to illustrate the impact on bivariate relationships across plant functional types, where the gray shading indicates the filtered variable space and text labels count the excluded observations (e.g.,” n = 2”). Note, the figure was not generated in datacleanr.

Example application of statistical filtering.

A percentile threshold filter (0.01 and 0.99) is applied to the full range or scoped to groups (panels, see bold text) on the x-variable. Trait data from BAAD [31] is used to illustrate the impact on bivariate relationships across plant functional types, where the gray shading indicates the filtered variable space and text labels count the excluded observations (e.g.,” n = 2”). Note, the figure was not generated in datacleanr. Example of the impact of statistical filtering on bivariate relationships between trait data from BAAD [31]. A percentile threshold filter (0.01 and 0.99) is used to remove extreme low and high values on the x-variable across its full space (left) or scoped to groups represented by functional types (right). The gray shading indicates the filtered variable space (full or scoped), while text labels and black points count and highlight, respectively, individual observations captured by the applied filter. Note, the figure was not generated in datacleanr. Any filtering statement (provided as valid R code) which evaluates to TRUE or FALSE and using the dataset’s column names can be used, as shown below. That is, let example_numeric_variable and example_categorical_variable be column names, then following statements are valid and can be supplied in a filter statement box: # simple logical filter example_numeric_variable > 3 # using expressions to define thresholds ## percentile/rank based example_numeric_variable > quantile(example_numeric_variable, 0.01) ## dispersion based (median absolute distance) example_numeric_variable > median(example_numeric_variable)– 3 * mad(example_numeric_variable) # example for subsetting example_categorical_variable = = "SpeciesA" example_categorical_variable %in% c("SpeciesA", "SpeciesB")

Visual cleaning and annotating

Interactive visualizations (Fig 1C; also see animation at https://doi.org/10.5281/zenodo.6469756) via plotly [28] are based on bivariate scatter or time series plots with optional third dimension represented by point size. Spatial data can be displayed on interactive map tiles, if columns named lon (longitude) and lat (latitude) are present in decimal degrees. Example visualizations for currently supported data types are in Fig 3.
Fig 3

Examples of interactive visualizations.

Panels show a subset of hourly time series of latent heat fluxes (A) from all Swiss FLUXNET2015 sites [4], spatial data illustrating sample locations for BAAD (B) and the relationship between stem diameter and height (C) with plant traits from BAAD [31]. Colors represent the grouping structure defined in the “Set-up” operation (A: Swiss sites; B, C: functional types).

Examples of interactive visualizations.

Panels show a subset of hourly time series of latent heat fluxes (A) from all Swiss FLUXNET2015 sites [4], spatial data illustrating sample locations for BAAD (B) and the relationship between stem diameter and height (C) with plant traits from BAAD [31]. Colors represent the grouping structure defined in the “Set-up” operation (A: Swiss sites; B, C: functional types). A key feature on the visualization tab is the grouping structure table, which allows highlighting granular data levels, e.g., all species at a given site, by hiding all other data (see Figs 7 and 8). Entries in this table correspondto colored traces in the figure legend via unique numbers. Users can thus cycle through or compare deeply nested data structures. Visualizations support zooming and panning (scatter plots and maps), as well as axes scrolling and stretching on mouse hover-over. Observations can be (de-)selected through clicking or lasso and box selecting, and annotated with text labels, which are listed in a summary table below the visualization. Annotations can be provided in a text box, and are added either individually through a button click, or automatically on every selection (requires ticking corresponding box); these are added to the input data in an appended column (.annotation) and can be used to inform subsequent processing. Lastly, histograms of all displayed, numeric variables can be generated to assess the potential impact of data removal.
Fig 7

Overview of visual cleaning tab for Berlin tree data.

The plot shows tree age (x) and diameter (y), resolving nesting by district and species. All 120 groups are displayed (see A, and figure legend). Potential outliers are obvious and easily highlighted and annotated for later reference (B, C). The dense point cloud comprises nearly 320000 observations and requires further inspection.

Fig 8

Identification of problematic data structures.

Closer inspection of the tree dataset using the grouping structure (A) to highlight/hide specific groups. The obvious, near-perfect linear relationship between tree age and diameter at breast height requires further inspection. Concerning data points are easily selected leveraging the interactivity of the visualization by clicking or with a lasso tool (B, see inset).

Extract recipe

Reproducibility requires that any analyses step can be recovered, comprehended, and executed identically, repeatedly, and independently of the user. The datacleanr package caters to this by translating every processing (filtering, highlighting or annotating) action into R code on the Extract Recipe tab, which can be copied or directly sent to an active RStudio® [R development environment, 29] session (Fig 1D; also see animation at https://doi.org/10.5281/zenodo.6469767). This code represents a recipe to reproduce the interactive processing, and survives the interactive session; subsequent analyses steps can thus include and build on the recipe (i.e., code script) for generating quality-controlled data. The dcr_app can also be launched with a file path to an *.RDS file on disk, rather than an object in R’s environment (i.e., memory). In this case, additional saving options are available for adjusting output names and file locations (Fig 4). This is currently recommended for data requiring extensive annotation, which would result in code scripts of considerable length. However, the option will be made available for both modes (file path, object) in a future version.
Fig 4

Code recipe extraction.

Options provided in the Extract Recipe tab for defining and saving outputs when datacleanr is launched with a file path. Copying or sending the recipe (i.e. R code) to an active RStudio® [29] session is always possible.

Code recipe extraction.

Options provided in the Extract Recipe tab for defining and saving outputs when datacleanr is launched with a file path. Copying or sending the recipe (i.e. R code) to an active RStudio® [29] session is always possible. Observations that were flagged by prior outlier or data processing have distinct symbols in interactive visualizations; this is enabled by adding (or renaming) a logical (TRUE / FALSE) flagging column named.dcrflag before launching dcr_app(). The reproducible code recipe can be used as a step following or preceding additional analyses.

Interoperability with external packages and algorithms. Interoperability is achieved with pre- and post-processing (in R) by two means:

datacleanr can hence be embedded in R- based workflows with the existing strengths of R’s extensive ecosystem of packages to increase flexibility and, ultimately, productivity. For instance, a script-based workflow applying widely-used R packages for reading, “wrangling” and cleaning data, such as readr [32], dplyr [33], tidyr [34] and lubridate [35] from the tidyverse [36], as well as janitor [37], can be complemented with datacleanr’s interactivity. In addition, due to the script and file-based output, datacleanr can also be included in workflow management tools such as drake [38] and workflowr [39].

Efficiency and batching

datacleanr has been extensively tested on mobile and desktop workstations (Windows 10 and 11, Ubuntu 19.10) considered medium to high-end, and is easily capable of processing and displaying above 1 million observations simultaneously. A speed test with outlier selections at excessive and improbable scales indicated comfortable response times for most user scenarios (Fig 5).
Fig 5

Speed test of visualization and data selection on synthetic data (n = 250000).

In 25 consecutive steps 10000 (additional) observations were selected. This was repeated three times (points) on low and high CPU-power settings, and processing time was determined using profvis [40], with bands plotted as visual aids. The inset shows processing totals (mean, min, max) after completing all 25 selections. Even with selections representing unlikely outlier numbers, the application remained responsive and appreciably fast.

Speed test of visualization and data selection on synthetic data (n = 250000).

In 25 consecutive steps 10000 (additional) observations were selected. This was repeated three times (points) on low and high CPU-power settings, and processing time was determined using profvis [40], with bands plotted as visual aids. The inset shows processing totals (mean, min, max) after completing all 25 selections. Even with selections representing unlikely outlier numbers, the application remained responsive and appreciably fast. Nevertheless, the notable limiting factors for processing are number of columns (in exploratory summary) and the grouping structure during plotting. That is, a large number of unique groups will slow down the visualization, and we recommended aiming for a maximum of around 100 groups per datacleanr run. datacleanr::dcr_app() returns processed results to the active R session. Hence, multiple datasets can be processed in batch and results (including code) saved for subsequent use. This is especially helpful when data nesting structures are too deep (e.g., ≫100 groups), or datasets too large (approximately above 2 million observations) to handle in one sitting. In these cases, a split-combine approach is recommended: # prepare data into species sub-sets iris_split <- split(x = iris, f = iris$Species) # run for each species dcr_iris <- lapply(iris_split, function(split){ datacleanr::dcr_app(split) }) Similarly, a list of file paths to datasets can be supplied (see help in R via? datacleanr::dcr_app()).

Case studies

Below are two use-cases illustrating the utility of the interactive approach adopted in datacleanr.

Identifying structure and artefacts in nested data

High-volume, data with nested hierarchical structure (i.e. observations grouped at many levels) are difficult to explore and process, especially when obtained from secondary sources, where the data-generating process or the collection method are not fully known [e.g., see 41]. In such cases greater care is required, as unexpected or erroneous structures and artifacts can be present—especially so when these vary at group-level. Here, interactively engaging with the data can expedite processing while increasing confidence. As an example of dealing with such scenarios in datacleanr, we present a subset of nearly 320000 city and park trees listed in Berlin’s (DE) green infrastructure registry (https://daten.berlin.de/, Strassenbäume, Anlagenbäume), focusing on mensuration data from the 10 most-frequent species across all 12 districts. These data were collected by different agencies or contractors (within and between districts), and are nested at multiple levels (district, street/park). For convenient exploration with datacleanr, the data is grouped by district and species (Fig 6), giving 120 sub-groups.
Fig 6

Example set-up for hierarchically nested data.

Grouping structure set to district (BEZIRK) and species (ART_BOT) in “Set-up and Overview” tab for subsequent exploration, plotting and cleaning.

Example set-up for hierarchically nested data.

Grouping structure set to district (BEZIRK) and species (ART_BOT) in “Set-up and Overview” tab for subsequent exploration, plotting and cleaning. Potential outliers are readily identified and annotated in a bivariate plot of tree age and diameter (Fig 7). Note, these observations could also be captured using threshold filters.

Overview of visual cleaning tab for Berlin tree data.

The plot shows tree age (x) and diameter (y), resolving nesting by district and species. All 120 groups are displayed (see A, and figure legend). Potential outliers are obvious and easily highlighted and annotated for later reference (B, C). The dense point cloud comprises nearly 320000 observations and requires further inspection. Upon cycling through the set groups, however, additional structures are apparent in the district of Neukölln for Quercus robur L (Fig 8), among others. These structures would only be apparent if individual visualizations (at least 120, and potentially at variable zoom) had been generated. Yet, they could not be removed easily with threshold filters and would require a high level of effort to address with manual or automated code-based processing.

Identification of problematic data structures.

Closer inspection of the tree dataset using the grouping structure (A) to highlight/hide specific groups. The obvious, near-perfect linear relationship between tree age and diameter at breast height requires further inspection. Concerning data points are easily selected leveraging the interactivity of the visualization by clicking or with a lasso tool (B, see inset). Contrastingly, with scroll and zoom in datacleanr, problematic observations are efficiently selected and annotated. Such observations could be erroneous, and, for example, pose an issue in hierarchical modeling, if an entire group structure is affected (e.g., random effect at park or street level). We take this opportunity to explicitly urge users to make extensive use of the annotation feature on the visualization tab to provide rich information on the selected observations (e.g., “interpolated observations”), as well as to adhere to best practices and transparency in outlier assessment and handling [e.g., 42] for any subsequent removal.

Retaining more data from time series with interactive cleaning

Time series data, e.g., from ecophysiological monitoring, can be messy due to instrument drift, response lags, power issues, etc. In high volumes, messy data may call for pragmatic decisions, such as indiscriminately removing entire periods, if interactive processing tools are not available. Such decisions may be owed to either time constraints for detailed manual processing, or because automated approaches may not accommodate unexpected processes and resulting observations. Interactive processing—both after automated quality control as well as in first instance—with datacleanr allows inspecting high-volume data at high resolution and to identify the impact of erroneous data points. Consequently, individual problematic observations, rather than entire periods, can be flagged and removed after careful consideration. We provide an example of manual (code-based period filtering) vs. interactive processing with datacleanr of an unpublished (in prep.) time series of raw sap flow data from the TERENO North-East Observatory [Müritz National Park, 43]. Note, in both cases due diligence and best practices were applied. With datacleanr more observations were retained, as individual points—not only periods—could be removed. Consequently, resulting gaps were shorter, could be gap-filled and potentially provide a higher level of insight (Fig 9). Further, the processing time decreased from approximately 2 hours by a skilled R-user to under 15 minutes for the entire series.
Fig 9

Comparison of code-based and interactive processing with datacleanr of raw sap flux data.

Compared to often more tedious, code-based filtering, the interactive quality control using domain expertise allowed retaining more observations resulting in greater data coverage across days (x-axis) for the measurement campaign, which lasted a total of 2769 days. Here, three additional full days of measurements, as well as several additional days with varying partial coverage were retained, as indicated by text labels to the right of bars (completeness by day; e.g. 25% of all measurements for a given day). This is because individual, problematic observations could be removed interactively (see inset), which may increase explanatory power in subsequent analyses.

Comparison of code-based and interactive processing with datacleanr of raw sap flux data.

Compared to often more tedious, code-based filtering, the interactive quality control using domain expertise allowed retaining more observations resulting in greater data coverage across days (x-axis) for the measurement campaign, which lasted a total of 2769 days. Here, three additional full days of measurements, as well as several additional days with varying partial coverage were retained, as indicated by text labels to the right of bars (completeness by day; e.g. 25% of all measurements for a given day). This is because individual, problematic observations could be removed interactively (see inset), which may increase explanatory power in subsequent analyses.

Future developments

Development will be continued to enhance performance, and incorporate user feedback. Additional improvements are planned and will be implemented in upcoming versions. These include: 1) saving and loading processing progress within the application, 2) pre-select groups for plotting to reduce loading times, 3) a toggle to display filtered data (from the Filter tab) in visualizations for easier assessment of filters. Further, 4) a more convenient method for gracefully handling data selections from multiple groups in the interactive visualization will be added; this is particularly helpful when, for instance, problematic observations cluster in similar plot regions. Lastly, 5) additional options for data input via data base connections and internal splitting of (large) data sets will be added.

Conclusion

Exploration and processing of high-volume data can be enhanced by using interactive tools. datacleanr achieves this with its flexibility and interoperability, while facilitating best practices in data exploration, outlier detection, and especially reproducibility through the extractable code recipe. While we acknowledge the place for and utility of fully-automated processing pipelines, we are certain that freely-available, interactive tools will improve researchers’ and analysts’ necessary engagement with their data, and consequently, increase confidence in their results. Further, we believe the datacleanr’s design will increase productivity of both technically-proficient as well as users with limited programming ability. For this, we ensured it would fit seamlessly into existing, script-based analyses pipelines, or that it could be used as a stand-alone tool by a wide audience. Lastly, we hope datacleanr will be of great use to the scientific community, including ecology, Earth system sciences and fields working with spatial and temporal data in general. We encourage users to provide feedback and suggestions in the dedicated repository to drive the continued development of the application.

Animated walk-through.

An overview of the package’s functionalities with animated examples of every feature. (HTML) Click here for additional data file. 15 Feb 2022
PONE-D-21-05607
Addressing the need for interactive, efficient and reproducible data processing in ecology with the datacleanr R application
PLOS ONE Dear Dr. Hurley, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The reviewers raised a number of concerns with your study, in particular the ease of use of the program and the lack of a walk-through analysis, as well as some points to improve clarity. Their comments can be viewed in full, below and in the attached file. Please submit your revised manuscript by Mar 31 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Natasha McDonald, PhD Associate Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Your abstract cannot contain citations. Please only include citations in the body text of the manuscript, and ensure that they remain in ascending numerical order on first mention. 3. We note that Figure 3 in your submission contain map images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: a. You may seek permission from the original copyright holder of Figure 3 to publish the content specifically under the CC BY 4.0 license. We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text: “I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.” Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission. In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].” b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only. The following resources for replacing copyrighted map figures may be helpful: USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/ The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/ Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/ Landsat: http://landsat.visibleearth.nasa.gov/ USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/# Natural Earth (public domain): http://www.naturalearthdata.com/ 4. We note that Figures 4, 6, 7 and 8 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: a. You may seek permission from the original copyright holder of Figures 4, 6, 7 and 8 to publish the content specifically under the CC BY 4.0 license. We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text: “I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.” Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission. In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].” b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only. 5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: N/A ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Addressing the need for interactive, efficient and reproducible data processing in ecology with the dataclearnr R application The manuscript introduces an R package for data processing of large ecological data sets. It gives an overview of the package's functionality. I think that such an R package is useful, especially for the scenarios the authors have pointed out (large data sets with many different scales / scopes, large monitoring data sets), for quick visualizations, outlier and problem detection. I only have a few suggestions for improving the manuscript: 1. line 139 Figure 1 caption: Sometimes the authors refer to specific functionality that the reader would only really understand when running the package, e.g. line 149 after "Set and Start" is clicked. This detail is perhaps too technical for this overview. 2. line 151 via 29 .... would be more useful if it named the package directly 3. line 156: not sure what 'through text cues on the tab' refers to 4. Figure 2 and caption: To me it is not entirely clear what is displayed here. E.g. what is the difference between left and right, what the n = .. refer to? Make clearer that the grey areas correspond to points which are filtered out?' 5. line 167 --- 'any statement': make clearer that this refers to the filtering statement, it would be useful to add that this filtering statement requires R code 6. line 169: insert 'the' before 'following' 7. line 185: more useful to refer directly to 'plotly' 8. line 186: insert 'be' before 'displayed' 9. line 197: clarify what 'key feature' refers to. I think it refers to the visualization tab, but this is not clear. 10. line 199: correspond to 11. line 242, Figure 5 caption:not sure what 'processing totals' refer to. Why (n=3) after means? Does this imply that there were 3 runs at every setting (number of points)? 12. line 270: remove comma after high-volume 13. Figure 9 and caption (line 331): I cannot see a difference between figures A and B. Therefore, is A necessary. If there are differences, maybe highlight these or point out. Also, I am not sure how to understand panel C. 'including' in caption misplaced? Reviewer #2: The paper describes an R package which provides an interactive graphical user interface to identify outliers and clean the data. It has one extremely important feature, i.e. it returns the R code for doing the filtering and the cleaning of the data, i.e. therefore making it transparent and reproducible. This is an essential feature, which bridges the gap between purely code based outlier removal and interactive outlier identification. But as the R script is effectively a script adding a column identifying if a data point was identified as an outlier or not, it would be very useful to also generate a report which includes e.g. the graphs in which the outliers were identified and the filtering rules as an html or pdf which can be added to the data as a justification why the points were identified as outliers. In the ideal case (future development?) I would suggest a config file (yaml?) which contains all the info and settings used, and when loaded, loads the data and applies all settings from the previous session. Unfortunately, I was only able to test the app after quite a bit of trying, as no walk-through of the data analysis is provided. This is a pity, as all the data is available in appropriate licenses and all that would be needed is to supplement the manuscript with boxes showing the settings for each step. This relatively easy addition would make the manuscript much more approachable. Overall, I have added a number of comments to the pdf document (attached) which can be easily adressed. As mentioned, my main concern is the missing of a walk-through through one analysis. I would suggest that this relatively straight forward addition / change is done before publication. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
Submitted filename: PONE-D-21-05607_reviewer.pdf Click here for additional data file. 1 Apr 2022 Dear Editor and referees, Thank you for the chance to improve our manuscript; your comments were highly appreciated. I kindly refer you to the attached file "Response to Reviewers.docx" for a formatted version of our rebuttal, but have copy-pasted it below for completeness. Sincerely, Alexander Hurley ______ Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package Response to reviewers Editor The reviewers raised a number of concerns with your study, in particular the ease of use of the program and the lack of a walk-through analysis, as well as some points to improve clarity. Their comments can be viewed in full, below and in the attached file. We thank the editor and reviewers for their time and the opportunity to improve the manuscript and thus the (initial) user experience of datacleanr. We have carefully considered the concerns raised by the referees and included a detailed walk-through analysis to further enhance the usability of the presented package. With these suggestions we feel that the manuscript has significantly improved. Reviewer 1 The manuscript introduces an R package for data processing of large ecological data sets. It gives an overview of the package’s functionality. I think that such an R package is useful, especially for the scenarios the authors have pointed out (large data sets with many different scales / scopes, large monitoring data sets), for quick visualizations, outlier and problem detection. We thank the reviewer for their assessment and recognition of our tool’s utility. I only have a few suggestions for improving the manuscript: We have addressed all comments by implementing the suggested changes or by providing a justification for the issue at hand in its current state. An overview thereof is given below. *1. line 139 Figure 1 caption: Sometimes the authors refer to specific functionality that the reader would only really understand when running the package, e.g. line 149 after “Set and Start” is clicked. This detail is perhaps too technical for this overview. We improved the wording of technical details here, enhanced Fig 1 caption (package overview, L165 in response) to better guide the reader, and added a SI File with animated examples for a walk-through, as also suggested by Reviewer 2. The section mentioned above now reads: “The structure is available during targeted filtering (scoping; see Filtering) and visual cleaning (see Figs 6-8). Once the grouping is set, a dataset summary can be generated via the package summarytools [29], highlighting duplicates, missingness, and distribution of each variable.” As noted above, the caption of Fig 1 has been improved significantly, in accordance with Reviewer 2’s suggestions (see in respective section further below, L165), which we believe allows the reader to follow the descriptions more readily. *2. line 151 via 29 …. would be more useful if it named the package directly Adjusted to “..summary can be generated via the package summarytools [29]..” *3. line 156: not sure what ‘through text cues on the tab’ refers to The section in question now reads: “The application’s interactivity allows reviewing the impact of filters through a text note highlighting the percentage of removed data and an overview table showing the remaining observations (per group), …” *4. Figure 2 and caption: To me it is not entirely clear what is displayed here. E.g. what is the difference between left and right, what the n = .. refer to? Make clearer that the grey areas correspond to points which are filtered out?’ We appreciate the caption was not detailed enough and have adjusted it to: “Figure 2: Example of the impact of statistical filtering on bivariate relationships between trait data from BAAD [30]. A percentile threshold filter (0.01 and 0.99) is used to remove extreme low and high values on the x-variable across its full space (left) or scoped to groups represented by functional types (right). The gray shading indicates the filtered variable space (full or scoped), while text labels and black points count and highlight, respectively, individual observations captured by the applied filter. Note, the figure was not generated in datacleanr.” We are confident that the adjusted caption sufficiently explains the figure and highlights the difference between full and scoped filtering adequately. *5. line 167 — ‘any statement’: make clearer that this refers to the filtering statement, it would be useful to add that this filtering statement requires R code The text was adjusted to: “Any filtering statement (provided as valid R code) which evaluates to TRUE or FALSE..” *6. line 169: insert ‘the’ before ‘following’ Done. *7. line 185: more useful to refer directly to ‘plotly’ Done. *8. line 186: insert ‘be’ before ‘displayed’ Done. *9. line 197: clarify what ‘key feature’ refers to. I think it refers to the visualization tab, but this is not clear. This was changed to: “A key feature on the visualization tab is the grouping structure table…” *10. line 199: correspond to Done. *11. line 242, Figure 5 caption: not sure what ‘processing totals’ refer to. Why (n=3) after means? Does this imply that there were 3 runs at every setting (number of points)? The caption was adjusted to: “Figure 5: Speed test of visualization and outlier selection on synthetic data (n = 250000). In 25 consecutive steps 10000 (additional) points were selected. This was repeated three times on low and high CPU-power settings, and processing time was determined using profvis [35]. Bands represent minimum and maximum durations, and points are means of the three replicates. The inset shows processing totals (mean, min, max) after completing all 25 selections. Even with unlikely outlier numbers, the application remained responsive and appreciably fast.” *12. line 270: remove comma after high-volume Done. *13. Figure 9 and caption (line 331): I cannot see a difference between figures A and B. Therefore, is A necessary. If there are differences, maybe highlight these or point out. Also, I am not sure how to understand panel C. ‘including’ in caption misplaced? We appreciate this issue, and have previously tried different versions of the figure with overplotting and offsetting in a single panel. However, neither option was fully satisfactory. This was either due to the same issue (distance between lines) or overplotting due to the time scale. We do want to emphasize the entire time series to highlight the package’s capability of dealing with fairly large/high-resolution data here for one processing example, and have used the caption to better highlight visible differences, , although these admittedly require increased attention. The caption now reads: “Figure 9: Comparison of code-based (A) and interactive processing (B) of raw sap flux data. Compared to often more tedious, code-based filtering, interactive quality control using domain expertise allowed retaining more observations resulting in greater data coverage. Here, three additional full days of measurements, as well as several days with varying partial coverage were retained (C, completeness by day; e.g. 25 % of all measurements for a given day). For example, differences are found in 2014 and 2017, where the code-based filtering removes entire periods (i.e., days, weeks). By contrast, individual, problematic observations could be removed interactively (D), which may increase explanatory power in subsequent analyses.” Reviewer 2 *It has one extremely important feature, i.e. it returns the R code for doing the filtering and the cleaning of the data, i.e. therefore making it transparent and reproducible. This is an essential feature, which bridges the gap between purely code based outlier removal and interactive outlier identification. We thank the reviewer for the recognition of datacleanr’s utility and the thorough review. *But as the R script is effectively a script adding a column identifying if a data point was identified as an outlier or not, it would be very useful to also generate a report which includes e.g. the graphs in which the outliers were identified and the filtering rules as an html or pdf which can be added to the data as a justification why the points were identified as outliers. We discussed similar features for PDF or HTML reports during datacleanr’s development. We agree that a justification or annotation for conspicuous data is not only useful but necessary. This is why we implemented the annotation feature in the interactive data selection. This gives users the freedom to use a set of self-defined annotations or tags for different scenarios (e.g., “high value,” “battery failure,” etc.). These annotations are stored in the .annotation column, and allow to not only identify selected points in a boolean manner, but also provide the user-specified annotation. From our own use of datacleanr we have concluded that this approach affords flexibility down the line to implement case-specific solutions, such as bespoke graphs and tables, which can be included in user-specific outputs (e.g., PDF or HTML reports based on RMarkdown). We also would like to emphasize that the reproducible recipe provides all the necessary information for such reports - or even as a stand-alone overview - if the data selection and annotation is done with due diligence. *In the ideal case (future development?) I would suggest a config file (yaml?) which contains all the info and settings used, and when loaded, loads the data and applies all settings from the previous session. This is most certainly planned as a future development, and will likely rely on the new shiny caching feature, rather than a yaml config file – we will explore the latter option as well and appreciate the idea/pointer, however. As we recognize the importance of this feature it is also the first we mention in this section (with updated wording): “Additional improvements are planned and will be implemented in upcoming versions. These include: 1) saving and loading processing progress within the application …” *Unfortunately, I was only able to test the app after quite a bit of trying, as no walk-through of the data analysis is provided. This is a pity, as all the data is available in appropriate licenses and all that would be needed is to supplement the manuscript with boxes showing the settings for each step. This relatively easy addition would make the manuscript much more approachable. Thank you for highlighting this. We have included the package’s readme file (https://github.com/the-Hull/datacleanr or on CRAN as https://cran.r-project.org/web/packages/datacleanr/readme/README.html) with animated examples as an SI file (modified to comply with the 20 MB file restriction), which we consider even more instructive than a walk-through with screenshots only. It includes details on installation, capabilities, and use, with in-depth examples of every feature. We note this in the “Capabilities section” to prime the reader. The paragraph now reads: “datacleanr is an interactive R package for processing high-volume data, and it caters to best practices in data exploration, processing, and reproducibility. This section describes the general capabilities of the package, and an in-depth walk-through of all functionalities is provided with animated examples in S1 File).” *Overall, I have added a number of comments to the pdf document (attached) which can be easily adressed. Thank you for the thorough comments. We have adjusted the text accordingly. A list of noteworthy alterations or responses to comments in the PDF beyond simple text adjustments: -the title of the article was changed to read “package” rather than “application,” and is referred to as such throughout the article now -added DOI (https://doi.org/10.5281/zenodo.6337609) and version number (v1.0.3) to “Availability” section -Updated data availability statement to include “latest version” Zenodo DOI, for which the archive now contains CSV data only: https://doi.org/10.5281/zenodo.4550726 -instead of noting an explicit example with respect to financially inaccessible software, we rephrased to “..if convenient tools do not exist or are financially inaccessible due to commercial licensing” -tibbles can be supplied to dcr_app(); the help documentation for dcr_app() notes that the data can be a data.frame, tbl (tibble), or data.table -The caption for Figure 1 was modified to address all processing modules. It now reads: Figure 1: Conceptual workflow for datacleanr across its four processing modules. A) The Set-up and Overview tab allows for a quick initial assessment of a data set (variable types, distribution, completeness), where nested structures (e.g., by plot, site, region) can be resolved by defining a grouping structure from a categorical data column. B) The Filtering tab allows sub-setting the data based on valid R code (logical statements), which can be targeted (i.e., “scoped”) to individual groups from A). C) The Visual Cleaning and Annotating tab allows generating two or three dimensional visualizations (X, Y, and point size) rapidly, while dividing the data set into groups specified in A); data points for further inspection can be identified by clicking or lasso selection through which annotations may also be added. An overview table and histogram highlight selected points and the potential impact on the data’s distribution should the selected observations be removed. D) The Extract recipe tab generates code to reproduce all processing steps, which can be copied to the clipboard or sent directly to an active RStudio session’s script; depending on the processing mode (in memory or from a file), additional settings for file name specification are available. The schematic here illustrates the potential for including datacleanr into an existing workflow, for example, with prior determination of outliers using external algorithms (requires appending a logical TRUE/FALSE column named .dcrflag), interactive exploration and processing (with datacleanr), and informing subsequent analyses by drawing on the interactively annotated data (.annotation column in output from datacleanr). -We considered your comment on using text-based data objects, as opposed to binary files. We strongly feel that the binary format for use with datacleanr is necessary, as this is the only way to ensure that data input/output does not alter data types, for example from factor or time to character, which may require that additional bespoke code is added to the extracted recipe. We cannot ensure that this is would be done correctly during generation of the recipe, and thus prefer the *.RDS format. We appreciate the limitation on universal data input/output, however. - Is the cleaning (or any other aspects in this app) paralellised, and could you gain a substantial increase if you do this? Currently, the data selection (point clicking and lasso selection) is implemented by carrying point indices and plotly trace numbers (i.e., group numbers) in a data.frame. The limiting factors are checking for duplicates and redrawing the “outlier” trace (which is plotly’s equivalent of a ggplot2 geom). In fact, due to plotlys mechanics, the entire plot needs to be redrawn or refreshed when traces are manipulated. Packages such as multidplyr may be interesting for subsetting individual groups in the Filtering tab, but those operations are certainly not a bottleneck in processing time. We do, however, strive to enhance the user experience in the future by further decreasing processing time with streamlining above mentioned data.frame operations. -We appreciate you highlighting the necessity to provide information on why observations were considered conspicuous, erroneous or outliers, and have rephrased line 310 (PDF) to: We take this opportunity to explicitly urge users to make extensive use of the annotation feature on the Visualization tab to provide rich information on the selected observations (e.g., “interpolated observations”), as well as to adhere to best practices and transparency in outlier assessment and handling [e.g., 37] for any subsequent removal. -Comments on future developments were: + maybe included but not mentioned here: generate report which can include comments to why certain decisions were taken -THIS WOULD BE A VERY USEFUL FEATURE. See previous comments on annotation tool and flexibility (L111 in response) + loading from other file formats csv, txt, databases (DBI). See previous comments on file formats; database connections were data types are preserved could be a viable option, however, which we will consider for future developments + parallelisation of processing. See previous comments (L199 in response) + include splicing in the app (probably loading data into an sqlite database and query subsets out)? We have included this in the list of future developments within the text (L401 in Revised Mansuscript with Track Changes.docx) + I don’t know if it is possible - reduce the dependencies. We have done our best to keep dependencies as low as possible, but shiny and plotly are rather heavy. However, we still strive to drop dependencies where possible, and will aim to do so in the future, e.g., for new developments. + add vignettes to the package (possibly this paper? License?). We appreciate the utility of vignettes. However, the GitHub repository (https://github.com/the-Hull/datacleanr) has an extensive set of examples with GIFs, which we believe are better suited to highlight the package’s functionality – this now constitutes the walk-through in the supporting information in a somewhat reduced fashion to meet the <20 MB file size requirements We also would like to note that CRAN very regularly enforces a <5 MB package size policy, and believe the GIFs in the ReadMe do better justice to the packages functionalities than a screenshot-based document of smaller size. The license for the package is GPL-3 and listed in the Availablility section. + In response to the comment on enhancement of data analyses through interactivity, we rephrased the first sentences of the conclusion to: “Exploration and processing of high-volume data can be enhanced by using interactive tools. datacleanr achieves this with its flexibility and interoperability, while facilitating best practices in data exploration, outlier detection, and especially reproducibility through the extractable code recipe.” *As mentioned, my main concern is the missing of a walk-through through one analysis. I would suggest that this relatively straight forward addition / change is done before publication. We appreciate this concern and are confident our animated examples in the Supporting Information file S1 File (Walkthrough.html) sufficiently address this concern. Submitted filename: Response to Reviewers.docx Click here for additional data file. 13 Apr 2022
PONE-D-21-05607R1
Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package PLOS ONE Dear Dr. Hurley, I took over the editorial role for your manuscript to PLOS ONE and would like to acknowledge my role as Reviewer 2 in the last round of review. I acknowledge the extremely long time since your initial submission but I was only assigned the role as editor for this paper a few days ago and will do anything possible to bring this paper to publication as soon as possible. A few comments are listed below in this letter. Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Thanks a lot for addressing all points raised by the reviewers. I have only a few minor points which should be adressed before publication and some more general comments: Text: Please include a reference for RStudio, as it is done for e.g. R Figure 2 caption: clarify what n means Sort figs according to their first occurrence in the manuscript (i.e. 6-8 after fig 1) Availability Section: Start with stating which version you are using in the package and mention the DOI - than mention stable release and where one can get these from (DOI to newest version, 10.5281/zenodo.6337608, CRAN) and finally github repo as newest version. Probably one sentence about release plans to CRAN (always latest stable), github master / main is stable? Do you have a dev branch for development and possibly unstable versions? Capabilities l 135 add "in the supplemental material" or "LINK TO THE PERMANENT FILE IN GITHUB" l 238: "2 million observations smoothly" - is this a hard link, or soft link, after which it get's slower but still work? Also, in line 270 you mention one milion. I stil think it would be nice to have screenshots in the sections "Capabilities" when you refer to the individual steps (you have them in the visualization section), but I appreciate your point that the animated gifs give a better point. Can you put direct links to the animated gifs in the section they refer to and keep these permanent (e.g. link to a specific tagged release on github)? l 264: Add individual references to the individual packages (readr, dplyr, tidyr and lubridate) Figures: Figure 5: with three replicates, to plot mean,min and max (which are two of the three points) is an aggressive approach. I would rather leave the shaded area and plot, instead of the mean, the third point. I do not see it as necessary as in this context, as the graph is not at all crucial to the paper, but regard would regard it as a cleaner approach. Figure 9 caption: re comment reviewer 1 on difference: I agree - I still, although explained in the caption, struggle to see where there is a difference. Also, I think the actual values of deltaV are not relevant in this graph. I think as a y asis, you would simply need four categories (from top to bottom): data in original data, day removed by using code based filering, day removed by using interactive processing, days gained by using interactive processing. These would be the relevant info, as the values at these days are not relevant based on the caption. If you want to retain the graph as it is, I would strongly suggest to have a fourth (or fifth?) colour, which highlights the data points retained in addition to the code based filtering. SI: I like the SI a lot and the gifs work brilliantly - thanks. SI: Example 2 has missing GIF General comments which do not require any action from your side: text based data object versus RDS: most data is stored in a csv file, as it can be generated from e.g. Excel. So using these as input would be very useful. Export of results does not need to be lossless, ut only include relevant results from the result and additional info could be saved as text files with additional details or even rds files. Saving should be lossless, so RDS is here appropriate. CRAN issues - use RUniverse for "full" package, provide functions in the package to download the additional info and data when needed, .. l 269: If you put the shiny app on shiny server, you can use it from all platforms (even smart phones) which have a browser. Probably include this in future plans (not relevant to the paper here). Reading in from databases (sqlite and duckdb come to mind as widely used stand alone databases which are used for larger datasets when standard handling in R is not possible anymore) would be extremely useful as a next step. Please submit your revised manuscript by May 28 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Rainer M Krug, PhD Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 21 Apr 2022 Dear Editor, We extend our thanks for your time (also for the previous review) and the opportunity to further improve the manuscript. We have carefully considered the concerns and are confident that we fully address them in this second round of revisions. We kindly refer to the submitted response letter for detailed responses. Note that line references there refer to Revised Manuscript with Track Changes.docx Sincerely, Alexander Hurley on behalf of all authors Submitted filename: Response to Reviewers.docx Click here for additional data file. 2 May 2022 Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package PONE-D-21-05607R2 Dear Dr. Hurley, I received your revised version today and I am happy with the changes you made. We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Rainer M Krug, PhD Guest Editor PLOS ONE 5 May 2022 PONE-D-21-05607R2 Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package Dear Dr. Hurley: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Rainer M Krug Guest Editor PLOS ONE
  10 in total

1.  Big questions, big science: meeting the challenges of global ecology.

Authors:  David Schimel; Michael Keller
Journal:  Oecologia       Date:  2015-02-15       Impact factor: 3.225

2.  An R package for correcting continuous water quality monitoring data for drift.

Authors:  Andrew R Shaughnessy; Christopher G Prener; Elizabeth A Hasenmueller
Journal:  Environ Monit Assess       Date:  2019-06-18       Impact factor: 2.513

3.  TRY plant trait database - enhanced coverage and open access.

Authors:  Jens Kattge; Gerhard Bönisch; Sandra Díaz; Sandra Lavorel; Iain Colin Prentice; Paul Leadley; Susanne Tautenhahn; Gijsbert D A Werner; Tuomas Aakala; Mehdi Abedi; Alicia T R Acosta; George C Adamidis; Kairi Adamson; Masahiro Aiba; Cécile H Albert; Julio M Alcántara; Carolina Alcázar C; Izabela Aleixo; Hamada Ali; Bernard Amiaud; Christian Ammer; Mariano M Amoroso; Madhur Anand; Carolyn Anderson; Niels Anten; Joseph Antos; Deborah Mattos Guimarães Apgaua; Tia-Lynn Ashman; Degi Harja Asmara; Gregory P Asner; Michael Aspinwall; Owen Atkin; Isabelle Aubin; Lars Baastrup-Spohr; Khadijeh Bahalkeh; Michael Bahn; Timothy Baker; William J Baker; Jan P Bakker; Dennis Baldocchi; Jennifer Baltzer; Arindam Banerjee; Anne Baranger; Jos Barlow; Diego R Barneche; Zdravko Baruch; Denis Bastianelli; John Battles; William Bauerle; Marijn Bauters; Erika Bazzato; Michael Beckmann; Hans Beeckman; Carl Beierkuhnlein; Renee Bekker; Gavin Belfry; Michael Belluau; Mirela Beloiu; Raquel Benavides; Lahcen Benomar; Mary Lee Berdugo-Lattke; Erika Berenguer; Rodrigo Bergamin; Joana Bergmann; Marcos Bergmann Carlucci; Logan Berner; Markus Bernhardt-Römermann; Christof Bigler; Anne D Bjorkman; Chris Blackman; Carolina Blanco; Benjamin Blonder; Dana Blumenthal; Kelly T Bocanegra-González; Pascal Boeckx; Stephanie Bohlman; Katrin Böhning-Gaese; Laura Boisvert-Marsh; William Bond; Ben Bond-Lamberty; Arnoud Boom; Coline C F Boonman; Kauane Bordin; Elizabeth H Boughton; Vanessa Boukili; David M J S Bowman; Sandra Bravo; Marco Richard Brendel; Martin R Broadley; Kerry A Brown; Helge Bruelheide; Federico Brumnich; Hans Henrik Bruun; David Bruy; Serra W Buchanan; Solveig Franziska Bucher; Nina Buchmann; Robert Buitenwerf; Daniel E Bunker; Jana Bürger; Sabina Burrascano; David F R P Burslem; Bradley J Butterfield; Chaeho Byun; Marcia Marques; Marina C Scalon; Marco Caccianiga; Marc Cadotte; Maxime Cailleret; James Camac; Jesús Julio Camarero; Courtney Campany; Giandiego Campetella; Juan Antonio Campos; Laura Cano-Arboleda; Roberto Canullo; Michele Carbognani; Fabio Carvalho; Fernando Casanoves; Bastien Castagneyrol; Jane A Catford; Jeannine Cavender-Bares; Bruno E L Cerabolini; Marco Cervellini; Eduardo Chacón-Madrigal; Kenneth Chapin; F Stuart Chapin; Stefano Chelli; Si-Chong Chen; Anping Chen; Paolo Cherubini; Francesco Chianucci; Brendan Choat; Kyong-Sook Chung; Milan Chytrý; Daniela Ciccarelli; Lluís Coll; Courtney G Collins; Luisa Conti; David Coomes; Johannes H C Cornelissen; William K Cornwell; Piermaria Corona; Marie Coyea; Joseph Craine; Dylan Craven; Joris P G M Cromsigt; Anikó Csecserits; Katarina Cufar; Matthias Cuntz; Ana Carolina da Silva; Kyla M Dahlin; Matteo Dainese; Igor Dalke; Michele Dalle Fratte; Anh Tuan Dang-Le; Jirí Danihelka; Masako Dannoura; Samantha Dawson; Arend Jacobus de Beer; Angel De Frutos; Jonathan R De Long; Benjamin Dechant; Sylvain Delagrange; Nicolas Delpierre; Géraldine Derroire; Arildo S Dias; Milton Hugo Diaz-Toribio; Panayiotis G Dimitrakopoulos; Mark Dobrowolski; Daniel Doktor; Pavel Dřevojan; Ning Dong; John Dransfield; Stefan Dressler; Leandro Duarte; Emilie Ducouret; Stefan Dullinger; Walter Durka; Remko Duursma; Olga Dymova; Anna E-Vojtkó; Rolf Lutz Eckstein; Hamid Ejtehadi; James Elser; Thaise Emilio; Kristine Engemann; Mohammad Bagher Erfanian; Alexandra Erfmeier; Adriane Esquivel-Muelbert; Gerd Esser; Marc Estiarte; Tomas F Domingues; William F Fagan; Jaime Fagúndez; Daniel S Falster; Ying Fan; Jingyun Fang; Emmanuele Farris; Fatih Fazlioglu; Yanhao Feng; Fernando Fernandez-Mendez; Carlotta Ferrara; Joice Ferreira; Alessandra Fidelis; Bryan Finegan; Jennifer Firn; Timothy J Flowers; Dan F B Flynn; Veronika Fontana; Estelle Forey; Cristiane Forgiarini; Louis François; Marcelo Frangipani; Dorothea Frank; Cedric Frenette-Dussault; Grégoire T Freschet; Ellen L Fry; Nikolaos M Fyllas; Guilherme G Mazzochini; Sophie Gachet; Rachael Gallagher; Gislene Ganade; Francesca Ganga; Pablo García-Palacios; Verónica Gargaglione; Eric Garnier; Jose Luis Garrido; André Luís de Gasper; Guillermo Gea-Izquierdo; David Gibson; Andrew N Gillison; Aelton Giroldo; Mary-Claire Glasenhardt; Sean Gleason; Mariana Gliesch; Emma Goldberg; Bastian Göldel; Erika Gonzalez-Akre; Jose L Gonzalez-Andujar; Andrés González-Melo; Ana González-Robles; Bente Jessen Graae; Elena Granda; Sarah Graves; Walton A Green; Thomas Gregor; Nicolas Gross; Greg R Guerin; Angela Günther; Alvaro G Gutiérrez; Lillie Haddock; Anna Haines; Jefferson Hall; Alain Hambuckers; Wenxuan Han; Sandy P Harrison; Wesley Hattingh; Joseph E Hawes; Tianhua He; Pengcheng He; Jacob Mason Heberling; Aveliina Helm; Stefan Hempel; Jörn Hentschel; Bruno Hérault; Ana-Maria Hereş; Katharina Herz; Myriam Heuertz; Thomas Hickler; Peter Hietz; Pedro Higuchi; Andrew L Hipp; Andrew Hirons; Maria Hock; James Aaron Hogan; Karen Holl; Olivier Honnay; Daniel Hornstein; Enqing Hou; Nate Hough-Snee; Knut Anders Hovstad; Tomoaki Ichie; Boris Igić; Estela Illa; Marney Isaac; Masae Ishihara; Leonid Ivanov; Larissa Ivanova; Colleen M Iversen; Jordi Izquierdo; Robert B Jackson; Benjamin Jackson; Hervé Jactel; Andrzej M Jagodzinski; Ute Jandt; Steven Jansen; Thomas Jenkins; Anke Jentsch; Jens Rasmus Plantener Jespersen; Guo-Feng Jiang; Jesper Liengaard Johansen; David Johnson; Eric J Jokela; Carlos Alfredo Joly; Gregory J Jordan; Grant Stuart Joseph; Decky Junaedi; Robert R Junker; Eric Justes; Richard Kabzems; Jeffrey Kane; Zdenek Kaplan; Teja Kattenborn; Lyudmila Kavelenova; Elizabeth Kearsley; Anne Kempel; Tanaka Kenzo; Andrew Kerkhoff; Mohammed I Khalil; Nicole L Kinlock; Wilm Daniel Kissling; Kaoru Kitajima; Thomas Kitzberger; Rasmus Kjøller; Tamir Klein; Michael Kleyer; Jitka Klimešová; Joice Klipel; Brian Kloeppel; Stefan Klotz; Johannes M H Knops; Takashi Kohyama; Fumito Koike; Johannes Kollmann; Benjamin Komac; Kimberly Komatsu; Christian König; Nathan J B Kraft; Koen Kramer; Holger Kreft; Ingolf Kühn; Dushan Kumarathunge; Jonas Kuppler; Hiroko Kurokawa; Yoko Kurosawa; Shem Kuyah; Jean-Paul Laclau; Benoit Lafleur; Erik Lallai; Eric Lamb; Andrea Lamprecht; Daniel J Larkin; Daniel Laughlin; Yoann Le Bagousse-Pinguet; Guerric le Maire; Peter C le Roux; Elizabeth le Roux; Tali Lee; Frederic Lens; Simon L Lewis; Barbara Lhotsky; Yuanzhi Li; Xine Li; Jeremy W Lichstein; Mario Liebergesell; Jun Ying Lim; Yan-Shih Lin; Juan Carlos Linares; Chunjiang Liu; Daijun Liu; Udayangani Liu; Stuart Livingstone; Joan Llusià; Madelon Lohbeck; Álvaro López-García; Gabriela Lopez-Gonzalez; Zdeňka Lososová; Frédérique Louault; Balázs A Lukács; Petr Lukeš; Yunjian Luo; Michele Lussu; Siyan Ma; Camilla Maciel Rabelo Pereira; Michelle Mack; Vincent Maire; Annikki Mäkelä; Harri Mäkinen; Ana Claudia Mendes Malhado; Azim Mallik; Peter Manning; Stefano Manzoni; Zuleica Marchetti; Luca Marchino; Vinicius Marcilio-Silva; Eric Marcon; Michela Marignani; Lars Markesteijn; Adam Martin; Cristina Martínez-Garza; Jordi Martínez-Vilalta; Tereza Mašková; Kelly Mason; Norman Mason; Tara Joy Massad; Jacynthe Masse; Itay Mayrose; James McCarthy; M Luke McCormack; Katherine McCulloh; Ian R McFadden; Brian J McGill; Mara Y McPartland; Juliana S Medeiros; Belinda Medlyn; Pierre Meerts; Zia Mehrabi; Patrick Meir; Felipe P L Melo; Maurizio Mencuccini; Céline Meredieu; Julie Messier; Ilona Mészáros; Juha Metsaranta; Sean T Michaletz; Chrysanthi Michelaki; Svetlana Migalina; Ruben Milla; Jesse E D Miller; Vanessa Minden; Ray Ming; Karel Mokany; Angela T Moles; Attila Molnár; Jane Molofsky; Martin Molz; Rebecca A Montgomery; Arnaud Monty; Lenka Moravcová; Alvaro Moreno-Martínez; Marco Moretti; Akira S Mori; Shigeta Mori; Dave Morris; Jane Morrison; Ladislav Mucina; Sandra Mueller; Christopher D Muir; Sandra Cristina Müller; François Munoz; Isla H Myers-Smith; Randall W Myster; Masahiro Nagano; Shawna Naidu; Ayyappan Narayanan; Balachandran Natesan; Luka Negoita; Andrew S Nelson; Eike Lena Neuschulz; Jian Ni; Georg Niedrist; Jhon Nieto; Ülo Niinemets; Rachael Nolan; Henning Nottebrock; Yann Nouvellon; Alexander Novakovskiy; Kristin Odden Nystuen; Anthony O'Grady; Kevin O'Hara; Andrew O'Reilly-Nugent; Simon Oakley; Walter Oberhuber; Toshiyuki Ohtsuka; Ricardo Oliveira; Kinga Öllerer; Mark E Olson; Vladimir Onipchenko; Yusuke Onoda; Renske E Onstein; Jenny C Ordonez; Noriyuki Osada; Ivika Ostonen; Gianluigi Ottaviani; Sarah Otto; Gerhard E Overbeck; Wim A Ozinga; Anna T Pahl; C E Timothy Paine; Robin J Pakeman; Aristotelis C Papageorgiou; Evgeniya Parfionova; Meelis Pärtel; Marco Patacca; Susana Paula; Juraj Paule; Harald Pauli; Juli G Pausas; Begoña Peco; Josep Penuelas; Antonio Perea; Pablo Luis Peri; Ana Carolina Petisco-Souza; Alessandro Petraglia; Any Mary Petritan; Oliver L Phillips; Simon Pierce; Valério D Pillar; Jan Pisek; Alexandr Pomogaybin; Hendrik Poorter; Angelika Portsmuth; Peter Poschlod; Catherine Potvin; Devon Pounds; A Shafer Powell; Sally A Power; Andreas Prinzing; Giacomo Puglielli; Petr Pyšek; Valerie Raevel; Anja Rammig; Johannes Ransijn; Courtenay A Ray; Peter B Reich; Markus Reichstein; Douglas E B Reid; Maxime Réjou-Méchain; Victor Resco de Dios; Sabina Ribeiro; Sarah Richardson; Kersti Riibak; Matthias C Rillig; Fiamma Riviera; Elisabeth M R Robert; Scott Roberts; Bjorn Robroek; Adam Roddy; Arthur Vinicius Rodrigues; Alistair Rogers; Emily Rollinson; Victor Rolo; Christine Römermann; Dina Ronzhina; Christiane Roscher; Julieta A Rosell; Milena Fermina Rosenfield; Christian Rossi; David B Roy; Samuel Royer-Tardif; Nadja Rüger; Ricardo Ruiz-Peinado; Sabine B Rumpf; Graciela M Rusch; Masahiro Ryo; Lawren Sack; Angela Saldaña; Beatriz Salgado-Negret; Roberto Salguero-Gomez; Ignacio Santa-Regina; Ana Carolina Santacruz-García; Joaquim Santos; Jordi Sardans; Brandon Schamp; Michael Scherer-Lorenzen; Matthias Schleuning; Bernhard Schmid; Marco Schmidt; Sylvain Schmitt; Julio V Schneider; Simon D Schowanek; Julian Schrader; Franziska Schrodt; Bernhard Schuldt; Frank Schurr; Galia Selaya Garvizu; Marina Semchenko; Colleen Seymour; Julia C Sfair; Joanne M Sharpe; Christine S Sheppard; Serge Sheremetiev; Satomi Shiodera; Bill Shipley; Tanvir Ahmed Shovon; Alrun Siebenkäs; Carlos Sierra; Vasco Silva; Mateus Silva; Tommaso Sitzia; Henrik Sjöman; Martijn Slot; Nicholas G Smith; Darwin Sodhi; Pamela Soltis; Douglas Soltis; Ben Somers; Grégory Sonnier; Mia Vedel Sørensen; Enio Egon Sosinski; Nadejda A Soudzilovskaia; Alexandre F Souza; Marko Spasojevic; Marta Gaia Sperandii; Amanda B Stan; James Stegen; Klaus Steinbauer; Jörg G Stephan; Frank Sterck; Dejan B Stojanovic; Tanya Strydom; Maria Laura Suarez; Jens-Christian Svenning; Ivana Svitková; Marek Svitok; Miroslav Svoboda; Emily Swaine; Nathan Swenson; Marcelo Tabarelli; Kentaro Takagi; Ulrike Tappeiner; Rubén Tarifa; Simon Tauugourdeau; Cagatay Tavsanoglu; Mariska Te Beest; Leho Tedersoo; Nelson Thiffault; Dominik Thom; Evert Thomas; Ken Thompson; Peter E Thornton; Wilfried Thuiller; Lubomír Tichý; David Tissue; Mark G Tjoelker; David Yue Phin Tng; Joseph Tobias; Péter Török; Tonantzin Tarin; José M Torres-Ruiz; Béla Tóthmérész; Martina Treurnicht; Valeria Trivellone; Franck Trolliet; Volodymyr Trotsiuk; James L Tsakalos; Ioannis Tsiripidis; Niklas Tysklind; Toru Umehara; Vladimir Usoltsev; Matthew Vadeboncoeur; Jamil Vaezi; Fernando Valladares; Jana Vamosi; Peter M van Bodegom; Michiel van Breugel; Elisa Van Cleemput; Martine van de Weg; Stephni van der Merwe; Fons van der Plas; Masha T van der Sande; Mark van Kleunen; Koenraad Van Meerbeek; Mark Vanderwel; Kim André Vanselow; Angelica Vårhammar; Laura Varone; Maribel Yesenia Vasquez Valderrama; Kiril Vassilev; Mark Vellend; Erik J Veneklaas; Hans Verbeeck; Kris Verheyen; Alexander Vibrans; Ima Vieira; Jaime Villacís; Cyrille Violle; Pandi Vivek; Katrin Wagner; Matthew Waldram; Anthony Waldron; Anthony P Walker; Martyn Waller; Gabriel Walther; Han Wang; Feng Wang; Weiqi Wang; Harry Watkins; James Watkins; Ulrich Weber; James T Weedon; Liping Wei; Patrick Weigelt; Evan Weiher; Aidan W Wells; Camilla Wellstein; Elizabeth Wenk; Mark Westoby; Alana Westwood; Philip John White; Mark Whitten; Mathew Williams; Daniel E Winkler; Klaus Winter; Chevonne Womack; Ian J Wright; S Joseph Wright; Justin Wright; Bruno X Pinho; Fabiano Ximenes; Toshihiro Yamada; Keiko Yamaji; Ruth Yanai; Nikolay Yankov; Benjamin Yguel; Kátia Janaina Zanini; Amy E Zanne; David Zelený; Yun-Peng Zhao; Jingming Zheng; Ji Zheng; Kasia Ziemińska; Chad R Zirbel; Georg Zizka; Irié Casimir Zo-Bi; Gerhard Zotz; Christian Wirth
Journal:  Glob Chang Biol       Date:  2019-12-31       Impact factor: 10.863

4.  ForC: a global database of forest carbon stocks and fluxes.

Authors:  Kristina J Anderson-Teixeira; Maria M H Wang; Jennifer C McGarvey; Valentine Herrmann; Alan J Tepley; Ben Bond-Lamberty; David S LeBauer
Journal:  Ecology       Date:  2018-04-27       Impact factor: 5.499

5.  Low availability of code in ecology: A call for urgent action.

Authors:  Antica Culina; Ilona van den Berg; Simon Evans; Alfredo Sánchez-Tójar
Journal:  PLoS Biol       Date:  2020-07-28       Impact factor: 8.029

6.  The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data.

Authors:  Gilberto Pastorello; Carlo Trotta; Eleonora Canfora; Housen Chu; Danielle Christianson; You-Wei Cheah; Cristina Poindexter; Jiquan Chen; Abdelrahman Elbashandy; Marty Humphrey; Peter Isaac; Diego Polidori; Alessio Ribeca; Catharine van Ingen; Leiming Zhang; Brian Amiro; Christof Ammann; M Altaf Arain; Jonas Ardö; Timothy Arkebauer; Stefan K Arndt; Nicola Arriga; Marc Aubinet; Mika Aurela; Dennis Baldocchi; Alan Barr; Eric Beamesderfer; Luca Belelli Marchesini; Onil Bergeron; Jason Beringer; Christian Bernhofer; Daniel Berveiller; Dave Billesbach; Thomas Andrew Black; Peter D Blanken; Gil Bohrer; Julia Boike; Paul V Bolstad; Damien Bonal; Jean-Marc Bonnefond; David R Bowling; Rosvel Bracho; Jason Brodeur; Christian Brümmer; Nina Buchmann; Benoit Burban; Sean P Burns; Pauline Buysse; Peter Cale; Mauro Cavagna; Pierre Cellier; Shiping Chen; Isaac Chini; Torben R Christensen; James Cleverly; Alessio Collalti; Claudia Consalvo; Bruce D Cook; David Cook; Carole Coursolle; Edoardo Cremonese; Peter S Curtis; Ettore D'Andrea; Humberto da Rocha; Xiaoqin Dai; Kenneth J Davis; Bruno De Cinti; Agnes de Grandcourt; Anne De Ligne; Raimundo C De Oliveira; Nicolas Delpierre; Ankur R Desai; Carlos Marcelo Di Bella; Paul di Tommasi; Han Dolman; Francisco Domingo; Gang Dong; Sabina Dore; Pierpaolo Duce; Eric Dufrêne; Allison Dunn; Jiří Dušek; Derek Eamus; Uwe Eichelmann; Hatim Abdalla M ElKhidir; Werner Eugster; Cacilia M Ewenz; Brent Ewers; Daniela Famulari; Silvano Fares; Iris Feigenwinter; Andrew Feitz; Rasmus Fensholt; Gianluca Filippa; Marc Fischer; John Frank; Marta Galvagno; Mana Gharun; Damiano Gianelle; Bert Gielen; Beniamino Gioli; Anatoly Gitelson; Ignacio Goded; Mathias Goeckede; Allen H Goldstein; Christopher M Gough; Michael L Goulden; Alexander Graf; Anne Griebel; Carsten Gruening; Thomas Grünwald; Albin Hammerle; Shijie Han; Xingguo Han; Birger Ulf Hansen; Chad Hanson; Juha Hatakka; Yongtao He; Markus Hehn; Bernard Heinesch; Nina Hinko-Najera; Lukas Hörtnagl; Lindsay Hutley; Andreas Ibrom; Hiroki Ikawa; Marcin Jackowicz-Korczynski; Dalibor Janouš; Wilma Jans; Rachhpal Jassal; Shicheng Jiang; Tomomichi Kato; Myroslava Khomik; Janina Klatt; Alexander Knohl; Sara Knox; Hideki Kobayashi; Georgia Koerber; Olaf Kolle; Yoshiko Kosugi; Ayumi Kotani; Andrew Kowalski; Bart Kruijt; Julia Kurbatova; Werner L Kutsch; Hyojung Kwon; Samuli Launiainen; Tuomas Laurila; Bev Law; Ray Leuning; Yingnian Li; Michael Liddell; Jean-Marc Limousin; Marryanna Lion; Adam J Liska; Annalea Lohila; Ana López-Ballesteros; Efrén López-Blanco; Benjamin Loubet; Denis Loustau; Antje Lucas-Moffat; Johannes Lüers; Siyan Ma; Craig Macfarlane; Vincenzo Magliulo; Regine Maier; Ivan Mammarella; Giovanni Manca; Barbara Marcolla; Hank A Margolis; Serena Marras; William Massman; Mikhail Mastepanov; Roser Matamala; Jaclyn Hatala Matthes; Francesco Mazzenga; Harry McCaughey; Ian McHugh; Andrew M S McMillan; Lutz Merbold; Wayne Meyer; Tilden Meyers; Scott D Miller; Stefano Minerbi; Uta Moderow; Russell K Monson; Leonardo Montagnani; Caitlin E Moore; Eddy Moors; Virginie Moreaux; Christine Moureaux; J William Munger; Taro Nakai; Johan Neirynck; Zoran Nesic; Giacomo Nicolini; Asko Noormets; Matthew Northwood; Marcelo Nosetto; Yann Nouvellon; Kimberly Novick; Walter Oechel; Jørgen Eivind Olesen; Jean-Marc Ourcival; Shirley A Papuga; Frans-Jan Parmentier; Eugenie Paul-Limoges; Marian Pavelka; Matthias Peichl; Elise Pendall; Richard P Phillips; Kim Pilegaard; Norbert Pirk; Gabriela Posse; Thomas Powell; Heiko Prasse; Suzanne M Prober; Serge Rambal; Üllar Rannik; Naama Raz-Yaseef; David Reed; Victor Resco de Dios; Natalia Restrepo-Coupe; Borja R Reverter; Marilyn Roland; Simone Sabbatini; Torsten Sachs; Scott R Saleska; Enrique P Sánchez-Cañete; Zulia M Sanchez-Mejia; Hans Peter Schmid; Marius Schmidt; Karl Schneider; Frederik Schrader; Ivan Schroder; Russell L Scott; Pavel Sedlák; Penélope Serrano-Ortíz; Changliang Shao; Peili Shi; Ivan Shironya; Lukas Siebicke; Ladislav Šigut; Richard Silberstein; Costantino Sirca; Donatella Spano; Rainer Steinbrecher; Robert M Stevens; Cove Sturtevant; Andy Suyker; Torbern Tagesson; Satoru Takanashi; Yanhong Tang; Nigel Tapper; Jonathan Thom; Frank Tiedemann; Michele Tomassucci; Juha-Pekka Tuovinen; Shawn Urbanski; Riccardo Valentini; Michiel van der Molen; Eva van Gorsel; Ko van Huissteden; Andrej Varlagin; Joseph Verfaillie; Timo Vesala; Caroline Vincke; Domenico Vitale; Natalia Vygodskaya; Jeffrey P Walker; Elizabeth Walter-Shea; Huimin Wang; Robin Weber; Sebastian Westermann; Christian Wille; Steven Wofsy; Georg Wohlfahrt; Sebastian Wolf; William Woodgate; Yuelin Li; Roberto Zampedri; Junhui Zhang; Guoyi Zhou; Donatella Zona; Deb Agarwal; Sebastien Biraud; Margaret Torn; Dario Papale
Journal:  Sci Data       Date:  2020-07-09       Impact factor: 6.444

7.  Developing a modern data workflow for regularly updated data.

Authors:  Glenda M Yenni; Erica M Christensen; Ellen K Bledsoe; Sarah R Supp; Renata M Diaz; Ethan P White; S K Morgan Ernest
Journal:  PLoS Biol       Date:  2019-01-29       Impact factor: 8.029

8.  DetEdit: A graphical user interface for annotating and editing events detected in long-term acoustic monitoring data.

Authors:  Alba Solsona-Berga; Kaitlin E Frasier; Simone Baumann-Pickering; Sean M Wiggins; John A Hildebrand
Journal:  PLoS Comput Biol       Date:  2020-01-13       Impact factor: 4.475

9.  Creating and sharing reproducible research code the workflowr way.

Authors:  John D Blischak; Peter Carbonetto; Matthew Stephens
Journal:  F1000Res       Date:  2019-10-14

10.  Biospytial: spatial graph-based computing for ecological Big Data.

Authors:  Juan M Escamilla Molgora; Luigi Sedda; Peter M Atkinson
Journal:  Gigascience       Date:  2020-05-01       Impact factor: 6.524

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.