| Literature DB >> 25632106 |
Leen Vandepitte1, Samuel Bosch2, Lennert Tyberghein3, Filip Waumans3, Bart Vanhoorne3, Francisco Hernandez3, Olivier De Clerck3, Jan Mees3.
Abstract
Being able to assess the quality and level of completeness of data has become indispensable in marine biodiversity research, especially when dealing with large databases that typically compile data from a variety of sources. Very few integrated databases offer quality flags on the level of the individual record, making it hard for users to easily extract the data that are fit for their specific purposes. This article describes the different steps that were developed to analyse the quality and completeness of the distribution records within the European and international Ocean Biogeographic Information Systems (EurOBIS and OBIS). Records are checked on data format, completeness and validity of information, quality and detail of the used taxonomy and geographic indications and whether or not the record is a putative outlier. The corresponding quality control (QC) flags will not only help users with their data selection, they will also help the data management team and the data custodians to identify possible gaps and errors in the submitted data, providing scope to improve data quality. The results of these quality control procedures are as of now available on both the EurOBIS and OBIS databases. Through the Biology portal of the European Marine Observation and Data Network (EMODnet Biology), a subset of EurOBIS records--passing a specific combination of these QC steps--is offered to the users. In the future, EMODnet Biology will offer a wide range of filter options through its portal, allowing users to make specific selections themselves. Through LifeWatch, users can already upload their own data and check them against a selection of the here described quality control procedures. Database URL: www.eurobis.org (www.iobis.org; www.emodnet-biology.eu/).Entities:
Mesh:
Year: 2015 PMID: 25632106 PMCID: PMC4309024 DOI: 10.1093/database/bau125
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Overview of all the QC steps in the EurOBIS database, including the unique bit-sequence (2(-1), with X = number of the QC flag) when the QC step is evaluated positively. The second last column lists whether a QC step is also available to the users through the online web services. IQR = Interquartile range; MAD = Median absolute deviation; SSS = Sea surface salinity; SST = Sea surface temperature
| QC-number | Category | Question | Bit-sequence, if answer is yes | Available as online data service | Implemented in |
|---|---|---|---|---|---|
| 2 | Taxonomy | Is the taxon name matched to WoRMS? | 2 | Yes (taxon match) | EurOBIS + OBIS |
| 3 | Taxonomy | Is the taxon level lower than family? | 4 | Yes (taxon match) | EurOBIS + OBIS |
| 4 | Geography: lat/lon | Are the latitude/longitude values different from zero? | 8 | Yes (check OBIS format) | EurOBIS + OBIS |
| 5 | Geography: lat/lon | Are the latitude/longitude values within their possible boundaries? | 16 | Yes (check OBIS format) | EurOBIS + OBIS |
| 6 | Geography: lat/lon | Are the coordinates situated in sea or along the coastline (20 km buffer)? | 32 | Yes (check OBIS format) | EurOBIS + OBIS |
| 9 | Geography: lat/lon | Are the coordinates situated in the expected geographic area (compare metadata)? | 256 | No, but visual check possible through separate data validation service | EurOBIS |
| 18 | Geography: depth | Is minimum depth ≤ maximum depth? | 131 072 | Not yet available | EurOBIS + OBIS |
| 19 | Geography: depth | Is the sampling depth possible when compared with GEBCO depth map (incl. margin)? | 262 144 | No, but depths per lat-lon can be requested through geographic web services | EurOBIS + OBIS |
| 7 | Completeness: date/time | Is the sampling year (start/end) completed and valid? | 64 | Yes (check OBIS format) | EurOBIS + OBIS |
| 11 | Completeness: date/time | Is the sampling date (year/month/day; start/end) valid? | 1 024 | Yes (check OBIS format) | EurOBIS + OBIS |
| 12 | Completeness: date/time | If a start and end date are given, is the start before the end? | 2 048 | Yes (check OBIS format) | EurOBIS + OBIS |
| 13 | Completeness: date/time | If a sampling time is given, is this valid and is the time zone completed? | 4 096 | Not yet available | EurOBIS + OBIS |
| 14 | Completeness: presence/abundance/biomass | Is the value of the field ‘ObservedIndividualCount’ empty or > 0? | 8 192 | Not yet available | EurOBIS + OBIS |
| 15 | Completeness: presence/abundance/biomass | Is the value of the field ‘Observedweight’ empty or > 0? | 16 384 | Not yet available | EurOBIS + OBIS |
| 16 | Completeness: presence/abundance/biomass | Is the field ‘SampleSize’ completed if the field ‘ObservedIndividualCount’ is > 0? | 32 768 | Not yet available | EurOBIS + OBIS |
| 1 | (Eur)OBIS data format | Are the required fields from the OBIS Schema completed? | 1 | Yes (check OBIS format) | EurOBIS + OBIS |
| 10 | (Eur)OBIS data format | Is the ‘Basis of Record' documented, and is an existing OBIS code used? | 512 | Yes (check OBIS format) | EurOBIS + OBIS |
| 17 | (Eur)OBIS data format | Is the value of the field ‘Sex’ empty or is an existing OBIS code used? | 65 536 | Not yet available | EurOBIS + OBIS |
| 21 | Outliers:environment | Is the observation within six MADs from the median depth of this taxon? | 1 048 576 | Not yet available | OBIS |
| 22 | Outliers:environment | Is the observation within three IQRs from the first & third quartile depth of this taxon? | 2 097 152 | Not yet available | OBIS |
| 23 | Outliers:environment | Is the observation within six MADs from the median SSS of this taxon? | 4 194 304 | Not yet available | OBIS |
| 24 | Outliers:environment | Is the observation within three IQRs from the first & third quartile SSS of this taxon? | 8 388 608 | Not yet available | OBIS |
| 25 | Outliers:environment | Is the observation within six MADs from the median SST of this taxon? | 16 777 216 | Not yet available | OBIS |
| 26 | Outliers:environment | Is the observation within three IQRs from the first & third quartile SST of this taxon? | 33 554 432 | Not yet available | OBIS |
| 27 | Outliers:geography | Is the observation within six MADs from the distance to the centroid of this taxon? | 67 108 864 | Not yet available | OBIS |
| 28 | Outliers:geography | Is the observation within three IQRs from the first & third quartile distance to the centroid of this taxon? | 134 217 728 | Not yet available | OBIS |
| 29 | Outliers:geography | Is the observation within six MADs from the distance to the centroid of this dataset? | 268 435 456 | Not yet available | OBIS |
| 30 | Outliers:geography | Is the observation within three IQRs from the first & third quartile distance to the centroid of this dataset? | 536 870 912 | Not yet available | OBIS |
Figure 1.Relative number of records (%) that pass the individual QC steps within the OBIS database. The QC steps are listed in Table 1.
Figure 2.Box and whisker plot per QC step, showing the variability of quality and completeness (in percentage) of the distribution records within the 21 OBIS nodes.
Overview of the number of records (absolute and relative) that pass specific combinations of QC steps, indicating their fitness for use in analysing research hypotheses. QC2: taxon name matched to the WoRMS; QC3: taxon level more detailed than family; QC4: coordinates different from zero; QC5: coordinates within possible boundaries; QC6: coordinates in sea or within 20 km coastline buffer; QC7: sampling year available and valid; QC16: count available, in combination with sample size information
| Combined QC steps | Positively evaluated OBIS records (#) | Positively evaluated OBIS records (%) |
|---|---|---|
| 2-3-4-5 | 34 991 925 | 86.05 |
| 2-3-4-5-6 | 32 216 817 | 79.22 |
| 2-3-4-5-7 | 32 849 480 | 80.78 |
| 2-3-4-5-6-7 | 30 311 653 | 74.54 |
| 2-3-4-5-16 | 23 315 398 | 57.33 |
| 2-3-4-5-6-16 | 19 189 668 | 47.19 |
| 2-3-4-5-6-7-16 | 19 189 668 | 47.19 |
Figure 3.Results of the geographic outlier analysis on the dataset ‘ICES Biological Community’. The left figure (A) represents the IQR approach, the right figure (B) represents the MAD approach. Black diamonds indicate the centroid of the investigated data, green triangles have been evaluated as OK, orange squares have been evaluated as possible outliers.
Figure 4.Results of the geographic and environmental outlier analysis of the species Verruca stroemia (Crustacea, Cirripedia). The left column represents the IQR approach, the right column represents the MAD approach. The different outlier analyses are A: geography, B: bathymetry, C: Sea Surface Salinity (SSS) SSS and D: Sea Surface Temperature (SST) SST. Black diamonds indicate the centroid of the investigated data (only for the geographic outlier analysis), green triangles have been evaluated as OK, orange squares have been evaluated as possible outliers.
Figure 5.Synthesis map representing the combined results of the outlier analyses of Verruca stroemia from Figure 4. The scale represents the number of times a species distribution is seen as an outlier, when combining the eight outlier analyses—geography, bathymetry, Sea Surface Salinity (SSS) and Sea Surface Temperature (SST) SSS and SST according to the IQR and MAD approach—from Figure 4. The black diamond indicates the centroid of the investigated data.