| Literature DB >> 23794914 |
Lee Belbin1, Joanne Daly, Tim Hirsch, Donald Hobern, John La Salle.
Abstract
A recent ZooKeys' paper (Mesibov, 2013: http://www.pensoft.net/journal_home_page.php?journal_id=1&page=article&SESID=df7bcb35b02603283dcb83ee0e0af0c9&type=show&article_id=5111) has highlighted data quality issues in aggregated data sets, but did not provide a realistic way to address these issues. This paper provides an aggregator's perspective including ways that the whole community can help to address data quality issues. The establishment of GBIF and national nodes (national aggregators) such as the Atlas of Living Australia (ALA) have integrated and exposed a huge diversity of biological observations along with many associated issues. Much of the admirable work by Mesibov (2013) was enabled by having the data exposed. Data quality, one of the highest priorities for GBIF, the national nodes and other aggregators, depends on both automatic methods and community experts to detect and correct data issues. Not all issues can however be automatically detected or corrected, so community assistance is needed to help improve the quality of exposed biological data. We do need to improve the infrastructure and associated processes to more easily identify data issues and document all changes to ensure a full record is permanently and publicly available.Entities:
Keywords: ALA; Australia; GBIF; data cleaning; data quality; fitness for use; millipede; occurrence records
Year: 2013 PMID: 23794914 PMCID: PMC3689094 DOI: 10.3897/zookeys.305.5438
Source DB: PubMed Journal: Zookeys ISSN: 1313-2970 Impact factor: 1.546
Example of automated data checks within the Atlas of Living Australia.
| 1 | Missing coordinate precision | 90.2% |
| 2 | Geodetic datum assumed WGS84 | 44.4% |
| 3 | Decimal Latitude Longitude Converted | 27.6% |
| 4 | Unrecognized geodetic datum | 21.1% |
| 5 | Coordinate uncertainty not specified | 18.6% |
| 6 | Possible duplicate record | 8.4% |
| 7 | Invalid collection date | 8.0% |
| 8 | No collection date supplied | 4.1% |
| 9 | Coordinate uncertainty not valid | 2.6% |
| 10 | Habitat incorrect for species (user flagged issue category) | 2.4% |
| 11 | Name not in national checklists | 2.3% |
| 12 | Basis of record not supplied | 2.1% |
| 13 | Altitude value non-numeric | 2.0% |
| 14 | Name not in any national or international checklists | 1.1% |
| 15 | Suspected outlier (user flagged issue category) | 1.0% |
| 16 | Type status not recognized | <1% |
| 17 | Basis of record badly formed | <1% |
| 18 | Coordinates don’t match supplied state | <1% |
| 19 | Supplied country not recognized | <1% |
| 20 | Image URL invalid | <1% |
| 21 | Supplied coordinates are zero | <1% |
| 22 | Collection code not recognized | <1% |
| 23 | Min and max depth reversed | <1% |
| 24 | Unparseable verbatim coordinates | <1% |
| 25 | Coordinates derived from verbatim coordinates | <1% |
| 26 | Latitude is negated | <1% |
| 27 | Depth value non-numeric | <1% |
| 28 | Outside expert range for species | <1% |
| 29 | Longitude is negated | <1% |
| 30 | Min and max altitude reversed | <1% |
| 31 | Coordinates were transposed | <1% |
| 32 | Decimal Lat/Long calculated from easting-northing (grid reference) | <1% |
| 33 | Supplied coordinates centre of state | <1% |
| 34 | Coordinate precision and uncertainty transposed | <1% |
| 35 | Coordinates are out of range for species | <1% |
| 36 | Decimal Lat/Long calculated from Easting Northing Failed | <1% |
| 37 | Coordinates centre of country | <1% |
| 38 | Geospatial issue (user flagged issue category) | <1% |
| 39 | Day and month transposed | <1% |
| 40 | Depth out of range | <1% |
| 41 | Taxon misidentified (user flagged issue category) | <1% |
| 42 | Taxonomic issue (user flagged issue category) | <1% |
| 43 | Temporal issue (user flagged issue category) | <1% |
| 44 | Altitude out of range | <1% |
A two-way decision table of issue detection versus correction.
| Domain specific expertise required to address issue? | |||
| Domain specific expertise required to detectissue? | Type 1 | Type 2 | |
| Type 3 | Type 4 | ||