Literature DB >> 28555477

An analysis of the process and results of manual geocode correction.

Yolanda J McDonald1, Michael Schwind, Daniel W Goldberg, Amanda Lampley, Cosette M Wheeler.   

Abstract

Geocoding is the science and process of assigning geographical coordinates (i.e. latitude, longitude) to a postal address. The quality of the geocode can vary dramatically depending on several variables, including incorrect input address data, missing address components, and spelling mistakes. A dataset with a considerable number of geocoding inaccuracies can potentially result in an imprecise analysis and invalid conclusions. There has been little quantitative analysis of the amount of effort (i.e. time) to perform geocoding correction, and how such correction could improve geocode quality type. This study used a low-cost and easy to implement method to improve geocode quality type of an input database (i.e. addresses to be matched) through the processes of manual geocode intervention, and it assessed the amount of effort to manually correct inaccurate geocodes, reported the resulting match rate improvement between the original and the corrected geocodes, and documented the corresponding spatial shift by geocode quality type resulting from the corrections. Findings demonstrated that manual intervention of geocoding resulted in a 90% improvement of geocode quality type, took 42 hours to process, and the spatial shift ranged from 0.02 to 151,368 m. This study provides evidence to inform research teams considering the application of manual geocoding intervention that it is a low-cost and relatively easy process to execute.

Entities:  

Mesh:

Year:  2017        PMID: 28555477      PMCID: PMC5978681          DOI: 10.4081/gh.2017.526

Source DB:  PubMed          Journal:  Geospat Health        ISSN: 1827-1987            Impact factor:   1.212


Introduction

Geocoding is the process of matching postal addresses to their corresponding geographical coordinates (i.e. latitude, longitude) (Rushton ). Sophisticated science, data sets, and algorithms underlie this complex process (Boscoe, 2008; Zandbergen, 2008). There are a large number of published studies (Goldberg, 2008; Ratcliffe, 2001) that describe the numerous algorithms that are used during the geocoding process to attempt to match an input address to an address stored in a reference database. The variability in algorithms, addresses, and databases can lead to a variety of errors in the geocoded results (Ratcliffe, 2001; Gilboa ; Schootman ; Zandbergen, 2008, 2011; Goldberg ). There is no such thing as a one size fits all type of geocoding system that works perfectly in every situation and for every user. The accuracy of this complex process can range from the centroid of a rooftop to the centroid of a state (Jacquez and Rommel, 2009). This leads to the following questions: Should inaccuracies be incorporated into research or should they be omitted entirely? Should inaccuracies be corrected? Is there a threshold that inaccuracies should not exceed? Previous studies have indicated that researchers should attempt to correct inaccurate data so that real world variances can be incorporated into analysis (Krieger, 2003; Zandbergen, 2007; Goldberg ; Goldberg and Cockburn, 2012; Murray ; Zandbergen, 2012). The practical application of reducing geocode inaccuracies is to improve the source data (i.e. geocoded data) used for spatial analysis (Strickland ). However, despite calls to pay heed to geocode quality by type and to employ manual geocode correction methods, there are few documented case studies that evaluate the cost effectiveness of this practice, or the improvements that can be expected by undertaking such an effort (Goldberg ). The purpose of this study was to quantify the effort (i.e. time) required to manually correct the geocodes in a health related dataset, as well as the match rate improvement between the original geocoded and the corrected geocode, and the corresponding spatial shift by geocode quality type resulting from the corrections. The results of this study can be used to help guide researchers as they decide whether or not to undertake manual geocoding correction to improve the geocode quality type of a dataset.

Materials and Methods

Web based geocoding and interactive geocoding correction procedures were performed using the Texas A&M University (TAMU) Geoservices Online Geocoding service, version 4.01, which was developed by the study authors (Goldberg ). The corrections were performed by the study authors, a Ph.D. student and an honors undergraduate student. This web-based system allows for rapid manual intervention of previously geocoded data by drawing from online satellite imagery, street maps, and additional geocoding engines to determine an improved geocode for each record (Goldberg ). This system allows a user to upload a dataset and analyse each record one at a time. It compares the current location of each geocode to that of another location provided by an alternate geocoder (i.e. Google Maps) within the TAMU online geocoding platform, and allows the user the flexibility to execute a manual intervention process to determine a more accurate geocode. The user can select which geocoder produced a more accurate location and the dataset can be updated with the corrected coordinates. In the event that neither geocoder provides an accurate location, the user can utilise online sources to refine an address (e.g. misspelling of an address) as well as aerial imagery and street views to attempt to find the location intuitively, and visually verify a location using Google Maps. The TAMU Geoservices Online Geocoding service utilises publicly accessible data so person-hours are the only cost associated with the geocode correction processes. It is free to all researchers (https://geoservices.tamu.edu/), and the source code can be made available upon request to researchers and/or organisations that wish to use it. To analyse the impact of the geocode correction process, a health related dataset was used. This dataset contained 784 addresses of health service facilities located within the state of New Mexico that offered cervical screening (Pap and/or Human Papillomavirus testing), diagnostic testing (colposcopy), and excisional pre-cancer treatment (loop electrosurgical excision procedure or cone biopsy). Although this data is publically available, it is not practical to obtain information on specific tests offered by individual clinics or providers. This unique health service facilities dataset was provided by the New Mexico HPV Pap Registry (NMHPVPR). The NMHPVPR is the first population-based statewide cervical screening registry in the United States; it includes address-level data on healthcare facilities providing aforementioned services in rural and urban areas. Due to the uniqueness of this data set, the authors invested the effort to have the most accurate geocoding possible. The first step of processing was to geocode the entire set of addresses using the TAMU Geoservices Online Geocoding service. The version of the geocoding service used for this research included the 2015 Navteq Address Points database, the 2010 USPS ZIP+4 reference files, the 2010 Boundary Solutions National Parcel Data Layer, and the 2010 US Census TIGER/Lines the reference, and the US Census Bureau 2010 Cartographic Boundary files for Minor Civil Divisions, Zip Code Tabulation Areas, Counties, and States. Once the results were obtained, the geocoded file was uploaded to the TAMU Geoservices Online Geocoding Correction Service; Figure 1 displays the geocode correction tool interface. This service provides a user interface that displays a map that shows the point obtained from the TAMU geocoding system and the point obtained from the alternate geocoder, i.e. Google Maps. If the alternate geocoder is able to find a match that is more accurate than the original match, a button can be pressed that updates the original geocode with the more accurate geocode. As previously noted, in the case that both geocodes appear to be inaccurate, the next step would be to attempt manual interactive geocoding. Online resources can be used to refine the address contained within the input file and often photo(s) of the building to be geocoded are available online. In addition, the user can study aerial imagery and street views of the location and attempt to manually locate the site; Figure 2 displays the correction prompt. If the site is located, the user marks that spot on the map and the geocode will be updated. These processes were used to update and correct the health service facility dataset analysed for this study. The final file contained information about the original geocodes and the corrected geocodes, which were used for comparative analysis.
Figure 1

Manual geocode correction tool interface.

Figure 2

Prompt for new accuracy description.

Results

This section provides a description of the results that were obtained from manually correcting the 784 geocodes. The same method used in prior research (Goldberg ) was used to classify an improved record as one of two criteria (Rushton ). A record that was originally non-geocodable and a geocode was obtained after processing was categorised as criteria one. A record that was previously geocodable and the accuracy of the geocode was improved after processing was categorised as criteria two (Boscoe, 2008). It should be noted that we considered a record that has a lower North American Association of Central Cancer Registries (NAACCR) GIS Coordinate Quality Code (Goldberg, 2008) after it has been processed, to be an improvement in accuracy according to criteria 2. We acknowledge that without direct field observation, it is not possible to assess with 100% accuracy that the original geocode was improved. All of the records in the dataset were geocodeable in the original file, therefore no records met criteria one. For measuring improvement, we followed the geocode output type hierarchy of the NAACCR GIS Coordinate Quality Code. Of the 784 records, 709 met criteria two. Ninety percent of the original addresses were corrected to a higher accuracy after the manual correction processes and 10% did not change. Of the 75 records that did not change, 21 were of the Exact Parcel Centroid quality, 50 were of Address Range Interpolation, and four records were of the USPS Zip Centroid quality. Table 1 shows that of the 71 addresses that matched to either Exact Parcel Centroid or Address Range Interpolation these records were already either the second or the third highest ranked geocode quality types (Goldberg, 2008).
Table 1

Geocode quality types and descriptions ranked from most to least accurate and geocode quality types of the original and corrected dataset.

Quality typeDescriptionOriginal quality type Total (N=784) %Corrected quality type Total (N=784) %
Building centroidMatched to the centroid of the building00.0063881.38
Exact parcel centroid pointMatched to the centroid of the parcel19424.75445.61
Address range interpolationUses information about the address number ranges to estimate the position of a numbered address38649.237910.08
Street centroidMatched to the centroid of the street00.00182.29
USPS zip centroidMatched to the zip code area centroid20426.0240.51
City centroidMatched to the centroid of the city00.0010.13
State centroidMatched to the centroid of the state00.0000.00

USPS, United States Postal Service.

Table 1 contains the original and corrected geocode quality type for the dataset. The original dataset contained zero records that were geocoded to the Building Centroid quality type. The corrected dataset contains 638 (81.38%) geocodes of this quality. It is notable that the original geocoded dataset contained 204 (26%) geocodes that matched to the USPS Zip Centroid quality type and after manual geocoding correction there were only four (<1%) records.

Discussion

Processing time

The correction process of the entire dataset consisting of 784 records was completed in 42.21 hours. The average processing time was 194 seconds per record. In the following sections, we will discuss the quality improvement of the dataset. The purpose of analysing both the time taken and the geocode quality improvement is to illustrate the effort that is involved versus the improvement in geocode accuracy gained.

Spatial shift

Of the 784 geocodes, 709 were assigned a new set of coordinates during the correction process. In this section we will review the spatial shift that the majority of the geocodes underwent. This distance was measured in meters (m) using the XY to Line tool within ArcGIS 10.1. Of the addresses that met criteria 2, the spatial shift improvements ranged from the smallest (0.018851 m) to the largest (151,368 m), the mean was 1963 m, and the median was 114 m (Table 2). For the smallest spatial shift improvement category, i.e. Exact Parcel Centroid to Building Centroid, we found that these geocode quality types were closely aligned and required minimal processing time (in seconds), mean 100 seconds and the median 52. In the event that the original geocode location of an Exact Parcel Centroid quality type was already accurate but needed to be updated to Building Centroid, the building was selected to reflect its true level of accuracy. The newly selected point was located proximate to the original point, resulting in the small difference between the original and corrected geocodes. For the largest spatial shift the geocode quality improved from USPS Zip Centroid to Street Centroid and the processing time was 1276 sec (21.2 min). Figure 3 illustrates an example of the spatial shift between the original and corrected geocoded points. In the bottom left of the diagram, it can be seen that many corrected geocoded points were derived from the same original point. In this case, many addresses were originally geocoded to a zip code centroid and then corrected to more accurate single location-based geocode.
Table 2

Geocode quality types of the original and corrected dataset and spatial shift improvement by each geocode quality type correction.

Old geocode quality typeNew geocode quality type*Total (N=703)Spatial shift (m)
N%MeanMedianIQR (Q1, Q3)°MinimumMaximum
Address range interpolationBuilding centroid32345.95355.22105.88(54.21, 221.96)3.4933936.56

Address range interpolationExact parcel centroid101.42253.7772.32(42.75, 130.22)7.041904.97

Exact parcel centroidBuilding centroid17124.32116.6211.66(2.29, 27.25)0.028260.35

USPS zip centroidBuilding centroid14320.345070.823094.47(1446.09, 5455.60)191.0454717.53

USPS zip centroidExact parcel centroid141.999903.805669.26(3036.69, 11614.65)871.1441691.95

USPS zip centroidAddress range interpolation294.136581.603405.08(858.99, 12227.95)114.3123920.18

USPS zip centroidStreet centroid131.8522956.7211708.03(3959.76, 20884.24)1734.06151367.94

All corrections7031963.18113.81(24.64, 940.39)0.02151367.94

USPS, United States Postal Service.

Geocode quality type change of N≥5;

IQR, interquartile range.

Figure 3

Spatial shift from original geocode to corrected geocode.

Geocoding a list of addresses is often just the first step to a more extensive project (Rushton ; Goldberg ). This first step, however, is very important because it can ultimately dictate the accuracy and direction of the final result (Oliver ; Zandbergen, 2009; Wey ). Prior research has demonstrated that geocoded datasets should be evaluated not only for match rate but also by geocode quality type (Goldberg ; Rushton ). Based on the level of accuracy of geocodes and the research purpose, it is our recommendation that researchers pause and evaluate if it is necessary to invest time to improve the accuracy of the geocodes (Krieger ; Bonner ; Nuckols ; Oliver ; Grubesic and Matisziw, 2006; Schootman ; Zandbergen, 2007, 2009). This study illustrates that a dataset of lower geocode quality types can be improved to a higher level of quality with very little investment of time, effort, or finances. The original dataset contained zero geocodes that matched to a building centroid. After 42 hours (~one week of work), 638 (81%) of the geocodes matched to a building centroid. Our spatial shift findings support previous studies demonstrating that inaccurate geocoding produces positional errors (Cayo and Talbot, 2003; Ward ). These errors have the potential to impact health analysis ranging from inaccurate local disease rates to imprecise accessibility measures; these health analysis studies are frequently used to inform health policy decisions (Jacquez, 2012). The manual intervention geocoded dataset that was produced as part of this study is now more suitable to be used for analysis because it will yield more reliable results.

Conclusions

The current study provides additional motivation and evidence-based findings for the purpose of demonstrating that manual geocoding correction is both a feasible and economical method for improving the quality of geocoded data. And, we demonstrated that the manual intervention geocoded processes resulted in increased match rates, higher confidence in geocode quality, and improved geocode match types. Finally, this study supports prior research that has been conducted in the geocoding accuracy and analysis field, and supports that prior findings are transferable from one geographic region to another as well as across domains of health services (Goldberg ). As demonstrated by this study, the TAMU Geoservices geocoder and the geocode correction tool, which is integrated in the online web service, is a low to no cost, easy to use option to improve geocode accuracy.
  20 in total

1.  Place, space, and health: GIS and epidemiology.

Authors:  Nancy Krieger
Journal:  Epidemiology       Date:  2003-07       Impact factor: 4.822

2.  Error propagation models to examine the effects of geocoding quality on spatial analysis of individual-level datasets.

Authors:  P A Zandbergen; T C Hart; K E Lenzer; M E Camponovo
Journal:  Spat Spatiotemporal Epidemiol       Date:  2012-02-11

3.  Positional accuracy of two methods of geocoding.

Authors:  Mary H Ward; John R Nuckols; James Giglierano; Matthew R Bonner; Calvin Wolter; Matthew Airola; Wende Mix; Joanne S Colt; Patricia Hartge
Journal:  Epidemiology       Date:  2005-07       Impact factor: 4.822

Review 4.  Geocoding in cancer research: a review.

Authors:  Gerard Rushton; Marc P Armstrong; Josephine Gittler; Barry R Greene; Claire E Pavlik; Michele M West; Dale L Zimmerman
Journal:  Am J Prev Med       Date:  2006-02       Impact factor: 5.043

5.  Positional accuracy and geographic bias of four methods of geocoding in epidemiologic research.

Authors:  Mario Schootman; David A Sterling; James Struthers; Yan Yan; Ted Laboube; Brett Emo; Gary Higgs
Journal:  Ann Epidemiol       Date:  2007-04-19       Impact factor: 3.797

6.  The effect of administrative boundaries and geocoding error on cancer rates in California.

Authors:  Daniel W Goldberg; Myles G Cockburn
Journal:  Spat Spatiotemporal Epidemiol       Date:  2012-02-10

7.  Local indicators of geocoding accuracy (LIGA): theory and application.

Authors:  Geoffrey M Jacquez; Robert Rommel
Journal:  Int J Health Geogr       Date:  2009-10-28       Impact factor: 3.918

8.  Geographic variability in geocoding success for West Nile virus cases in South Dakota.

Authors:  Christine L Wey; Jennifer Griesse; Lon Kightlinger; Michael C Wimberly
Journal:  Health Place       Date:  2009-06-12       Impact factor: 4.078

9.  Comparison of residential geocoding methods in population-based study of air quality and birth defects.

Authors:  Suzanne M Gilboa; Pauline Mendola; Andrew F Olshan; Catherine Harness; Dana Loomis; Peter H Langlois; David A Savitz; Amy H Herring
Journal:  Environ Res       Date:  2006-02-17       Impact factor: 6.498

10.  Quantifying geocode location error using GIS methods.

Authors:  Matthew J Strickland; Csaba Siffel; Bennett R Gardner; Alissa K Berzen; Adolfo Correa
Journal:  Environ Health       Date:  2007-04-04       Impact factor: 5.984

View more
  1 in total

1.  Geocoding cryptosporidiosis cases in Ireland (2008-2017)-development of a reliable, reproducible, multiphase geocoding methodology.

Authors:  Lisa Domegan; Patricia Garvey; Paul McKeown; Howard Johnson; Paul Hynds; Jean O'Dwyer; Coilín ÓhAiseadha
Journal:  Ir J Med Sci       Date:  2021-01-19       Impact factor: 1.568

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.