| Literature DB >> 35949539 |
Petra Führding-Potschkat1, Holger Kreft1, Stefanie M Ickert-Bond2.
Abstract
Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. Taking North American Ephedra as a model, we examined how different data cleaning pipelines (using, e.g., the GBIF web application, and four different R packages) affect downstream species distribution models (SDMs). We also assessed how data differed from expert data. From 13,889 North American Ephedra observations in GBIF, the pipelines removed 31.7% to 62.7% false positives, invalid coordinates, and duplicates, leading to datasets between 9484 (GBIF application) and 5196 records (manual-guided filtering). The expert data consisted of 704 records, comparable to data from field studies. Although differences in the absolute numbers of records were relatively large, species richness models based on stacked SDMs (S-SDM) from pipeline and expert data were strongly correlated (mean Pearson's r across the pipelines: .9986, vs. the expert data: .9173). Our results suggest that all R package-based pipelines reliably identified invalid coordinates. In contrast, the GBIF-filtered data still contained both spatial and taxonomic errors. Major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of taxonomic expert knowledge. We conclude that application-filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving high-quality taxonomic data will require extra effort, probably by thoroughly analyzing the data for misidentified taxa, supported by experts.Entities:
Keywords: GBIF; automated data cleaning pipelines; data quality; expert data; species distribution modeling
Year: 2022 PMID: 35949539 PMCID: PMC9351331 DOI: 10.1002/ece3.9168
Source DB: PubMed Journal: Ecol Evol ISSN: 2045-7758 Impact factor: 3.167
FIGURE 2(a–c) North America‐native Ephedra specimens (female specimens with seeds). Ephedra antisyphilitica, E. nevadensis, and E. trifurca (left to right). (d) Examples of taxonomic and spatial errors identified in the Ephedra data. Filter categories of the following markers: False positives. Markers 1, 8, and 9 were specimens from shops in Seattle and Berkeley. Markers 3, 4, 10, and 11 were non‐native species from botanical gardens and scientific institutes. Marker 2 pointed to a North America‐native species at the University of Connecticut, NY. Markers 5 to 7 showed coordinate errors that the verbatim locality description can only identify. The species at markers 12 and 13 were misidentified, as the documented species do not occur naturally at these localities. The data for the map derived from the P1, post‐cleaning (L3, number of co‐occurring species). Color coding of the map: P1 observed distribution (see Figure 4).
Results of the pipelines' data cleaning performance, compared to the P0 benchmark dataset (summary table)
|
|
Note: The color‐coded cells of P1 to P6 datasets indicate the activity of a particular DC tool (color code see below). The blue cells of the P0 benchmark indicate the number of Ephedra records in GBIF, quantified by standardization and error category. Records which did not comply with the standardization conditions or were erroneous in the context of this study were flagged (flg). Since several standardization conditions and errors coincided in the same record, the number of removed records did not correspond to the sum of the identified errors. The P1, P2, and P3 data retrieval tools partially standardized the data and eliminated several errors (“three‐in‐one” tools). Thus, the number of records retrieved differed significantly from P4 to P6, and P0. The removed records in these pipelines could only be reconstructed as differences of subcategories (e.g., in‐scope countries, collection year, null and zero coordinates) in comparison to P0. The difference between P3 and P2 resulted from the added dplyr and CC packages, which increased standardization and removed still more erroneous records. Using the added packages ensured more insight into data cleaning.
Abbreviation: CC (→ P3/P6) = R package CoordinateCleaner.
FIGURE 3Information condensing pyramid of the pipelines and the expert data (L1 to L5: Condensing levels of the data). The data show an increasingly higher correlation from the bottom to the top of the pyramid, which results from data transformations into an increasingly higher condensed species occurrence information state. The 704 expert data occurrences (L1) were allocated into 358 grid cells (L2, with a maximum of four co‐occurring species, L3). The correlation of 0.6536 (L4, mean Pearson's r of pairings [P1 to P6/expert]) was compared to the mean of the pairings P1 to P6. At this level (L4), the minimum Pearson's r‐value of the occupied grid cells from pipeline data was .9920 (pair: P1/P6), and the maximum Pearson's r value was .9999 (pair: P4/P5). At the L5 level, the minimum Pearson's r value was .9951 (pair: P1/P6), and the maximum Pearson's r value was 1.0000 (pair: P4/P5). Dashed box: Expert data comparison numbers, L2 to L4.
Pipeline filter summary for standardization and error removal
| Categories | Filter | Requirement | Rationale |
|---|---|---|---|
| STD | Country range | Spatial | North America: Mexico and the USA |
| STD | Infraspecific rank | Taxonomic | Required rank: species (Claridge et al., |
| STD | Collection years | Temporal | 1945 to 2020, as older records are more likely to contain erroneous coordinates (Zizka et al., |
| STD | Basis of record | Consistency | Specimens and observations. |
| STD | Occurrence status | Consistency | Presence data. |
| FPS | Non‐North America‐native | Taxon | All non‐native |
| FPS/REC | Zero or missing coordinates | Spatial | Zeroes and missing values may represent records with data entry errors. Missing values will cause error messages in |
| REC | Longitude and latitude are equal | Spatial | Equal longitude and latitude may represent records with data entry errors. |
| DUP | Duplicate records | Consistency | Duplicate records that may represent, for example, record copy errors. |
| FPS | Country capitals | Spatial | Records that may contain the coordinates of the country capital. |
| FPS | Country centroids | Spatial | Records that may contain the centroid coordinates of the country. |
| FPS | GBIF headquarters | Spatial | Records that may contain the coordinates of the GBIF headquarters. |
| FPS | Biodiversity institutions | Spatial | Records that may contain the coordinates of biodiversity institutions where the herbarium voucher is stored. |
| FPS | Geographic outliers | Spatial | Geographic outliers that may represent misidentified specimens. |
| REC | Urban areas | Spatial | Records from urban areas that may represent old data or vague locality descriptions. |
| REC | dd.mm to dd.dd conversion errors | Spatial | Records with ddmm to dd.dd conversion error (misinterpretation of the degree sign as decimal delimiter). |
| REC | Rasterized collections | Spatial | Records with a significant proportion of coordinates that might have a low precision. |
| FPS | “Manual” removal of false positives | Consistency | False positives that have been overlooked by automated error removal, based on the knowledge that they are in the records. |
Note: Categories: DUP, duplicate records; FPS, false positives; REC, recording errors; STD, standardization.
FIGURE 1Workflow of the pipelines and the downstream analyses. The pipelines' part comprised the following sections: Data Retrieval, Standardization, and Error Removal. The Downstream Analysis featured the Predictor Variables Extraction, the Model Fitting, the Model Building (SDMs, S‐SDMs) and Evaluation, and the Correlation Analysis developed from the pipeline data P1 to P6 and the expert data. R packages used in the course of the workflow are in italics. (a) Observed species distribution from GBIF P1 data. (b) Observed species distribution from expert data. Filter categories: DUP, Duplicate records; FPS, False positives; REC, Recording Errors.
FIGURE 4Stacked species distribution maps based on cleaned GBIF data from pipelines P1, P6, and expert data. Depicted are the maps of the least cleaning P1 and the most cleaning P6 that show only minor differences (the maps from the other pipeline data are close to P6). The control data map from the expert data shows differences to the pipelines. Left: Observed distribution (L2 data). Point‐occurrences after passing the pipelines, allocated to grid cells of a stacked range map of all Ephedra species. The expert map shows less occupied grid cells (n = 358) than P1 (n = 636) resulting in a smaller range. Right: Map of the predicted probability of species from S‐SDMs (L5 data). The color keys show highly correlated patterns of each data quality (P1, P6, and expert data: 0 to 12 species, Pearson's r = .9173).