Literature DB >> 27903063

Smartphone-assisted spatial data collection improves geographic information quality: pilot study using a birth records dataset.

Xiaohui Xu1, Hui Hu, Sandie Ha, Daikwon Han.   

Abstract

It is well known that the conventional, automated geocoding method based on self-reported residential addresses has many issues. We developed a smartphone-assisted aerial image-based method, which uses the Google Maps application programming interface as a spatial data collection tool during the birth registration process. In this pilot study, we have tested whether the smartphone-assisted method provides more accurate geographic information than the automated geocoding method in the scenario when both methods can get the address geocodes. We randomly selected 100 well-geocoded addresses among women who gave birth in Alachua county, Florida in 2012. We compared geocodes generated from three geocoding methods: i) the smartphone-assisted aerial image-based method; ii) the conventional, automated geocoding method; and iii) the global positioning system (GPS). We used the GPS data as the reference method. The automated geocoding method yielded positional errors larger than 100 m among 29.3% of addresses, while all addresses geocoded by the smartphoneassisted method had errors less than 100 m. The positional errors of the automated geocoding method were greater for apartment/condominiums compared with other dwellings and also for rural addresses compared with urban ones. We conclude that the smartphone-assisted method is a promising method for perspective spatial data collection by improving positional accuracy.

Entities:  

Mesh:

Year:  2016        PMID: 27903063      PMCID: PMC5800510          DOI: 10.4081/gh.2016.482

Source DB:  PubMed          Journal:  Geospat Health        ISSN: 1827-1987            Impact factor:   1.212


Introduction

Geocoded, vital statistics birth records have been widely used to examine the potential adverse effects of environmental exposures during pregnancy on pregnancy and birth outcomes, including low birth weight, preterm delivery, small for gestational age (Dadvand ; Metcalfe ; Sapkota ; Shah and Balkhair, 2011; Stieb ; Strand ), congenital anomalies (Vrijheid ), pregnancy complications such as hypertensive disorders of pregnancy (Hu ), and gestational diabetes mellitus (Hu ). A wide range of environmental factors have been investigated in previous studies, including air pollution (Hu , 2015; Sapkota ; Shah and Balkhair, 2011; Stieb ; Vrijheid ), temperature (Strand ), greenness (Dadvand ), built environment (Hystad ; Miranda ), and other neighbourhood-level factors such as income, education, and racial residential segregation (Anthopolos ; Metcalfe ). These studies provide important evidence in this field. However, geocoded information in the vital statistic birth records using the traditional automated, geocoding method based on self-reported residential addresses has many issues including missing geocode data and positional errors of geocoded addresses. The issues regarding positional accuracy of geocoded addresses have drawn much attention and recent studies suggest that potential errors cannot be ignored when using geocoding methods in epidemiologic studies (Cayo and Talbot, 2003; Hurley ; Whitsel ). The positional errors seen with geocoding can have substantial impacts on many salient factors underlying environmental epidemiologic studies (Jacquez, 2012), including exposure estimates (Zandbergen, 2007), health access analysis (Frizzelle ; McLafferty ), disease cluster detection (Jacquez and Rommel, 2009; Zimmerman ), disease rates estimates (Goldberg and Cockburn, 2012), and spatial weights (Jacquez and Rommel, 2009). More importantly, studies have shown the heterogeneity in positional accuracy with greater geocoding errors observed in rural compared to urban areas (Cayo and Talbot, 2003; Hurley ; Whitsel ). These errors may cause a differential mis-classification among rural and non-rural individuals and lead to biased results in epidemiologic studies (Krieger ; Oliver ). Alternative geocoding methods such as aerial image-based methods have been available for a long time and are usually used for improving positional accuracy of addresses in the traditional post-hoc geocoding method. The advantages of these methods have been reported by many authors (Baltsavias, 1993; Boulos, 2005; Conzelmann ; Hild and Fritsch, 1998; Richards ; Ward ), but limited knowledge regarding the addresses among geographic information system technicians could significantly restrict their application in geocoding. More importantly, to our knowledge, these techniques have not been used for spatial data collection. We propose a smartphone-assisted aerial image-based method for spatial data collection during the process of birth registry. This method has many advantages including map/aerial image searching for addresses, participants' involved verification and real-time geocoding over the traditional post-hoc geocoding method (Figure 1). The prospective use of such methods has the potentials to substantially improve data quality by reducing missing values and improving the accuracy of geographic information.
Figure 1

Illustration of a smartphone-assisted aerial image-based method for spatial data collection

In this pilot study, we aimed to examine if the smartphone-assisted, aerial image-based method provides more accurate geographic information than the post-hoc geocoding method in the scenario when both methods can obtain the geographic information of an address.

Materials and Methods

Study population and geocoding by Florida Department of Health

We obtained birth record data from the Bureau of Vital Statistics & Office of Health Statistics and Assessment, Florida Department of Health (FDOH), Tallahassee, FL, USA. The data included all registered live births in Florida (FL), USA between January 1, 2012 and December 31, 2012 (n=211,437). The FDOH used ArcGIS 10.1 software with the topologically integrated geographic encoding and referencing (TIGER) street database from the US Census Bureau to geocode maternal residential address at delivery for all FL residents, while 1,093 births with maternal address outside FL were not geocoded. A total of 206,796 (98.3%) women were successfully geocoded among the 210,344 women living within the state of Florida. A total of 2733 women with geocoded maternal residential addresses inside Alachua county, FL were eligible to be sampled in this study. The population of Alachua county was 251,417 (71% urban, 29% rural) that year. From these eligible addresses, a total of 100 addresses were statistically randomly sampled using the SURVEYSELECT procedure in SAS 9.3 (http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_surveyselect_sect001.htm). We compared geocodes generated from three geocoding methods: i) the conventional, FDOH-geocoded records using an automated, geocoding method based on the TIGER street database (https://www.census.gov/geo/maps-data/data/tiger.html) and ArcGIS (http://www.esri.com); ii) reference measures using global positioning system (GPS) receivers 5 m away from the sampled addresses (outside the building); and iii) the geocodes obtained from the smartphone-assisted, aerial-based method using the Google Maps application programming interface (API) (Google, 2015).

Global positioning system receiver measurements

The Garmin GPSMAP® 76Cx receiver (Garmin International Inc., Olathe, KS, USA) was used. The typical position accuracy of this receiver ranges from 3 to 5 m, and it has been validated and widely used in many studies (Wing, 2008). In this study, GPS measurements were taken 5 m away from the sampled addresses (outside the building), in order to avoid direct interactions or contacts with any residents. None of the addresses located in apartment complexes have controlled access during daytime when the measurements were done. All data were collected in January 2015.

The smartphone-assisted, aerial image-based method

Besides the automated and GPS-measured geocodes, we developed and used a method built on satellite and aerial images using Google Map API (Google, 2015). Briefly, the researchers automatically search the address on the map, browse the aerial images, verify the location (i.e. simulating the process of participant-involved verification) and obtain the geocodes of the address, or the first placed pinpoint on the aerial images if the address cannot be automatically found, aligned with the centroid location of each actual address. The system then returns and records the longitude and latitude for the pinpoint. Figure 1 shows the algorithm of the smartphone-assisted, aerial image-based method for spatial data collection during participant interview. As shown, the geographic coordinates of the location will be automatically generated and collected from this proposed method so that no post-hoc data cleaning or geocoding is needed. In this pilot study, the data collectors all had background knowledge obtained through field visits to the selected addresses that served as participants.

Covariates

Information of maternal, socio-demographic status was obtained from the vital statistics dataset, including maternal age at delivery (<30 or ≥30 years old), race (black or non-black), education level (high school), marital status (married or not married) and insurance types (Medicaid or non-Medicaid). In addition, housing types were categorized into two groups: apartment/condominium and others. We also categorised each address as urban or rural based on the GPS-measured geocodes using the 2013 cartographic boundary shapefiles (urban areas) from the US Census (https://www.census.gov/geo/maps-data/data/cbf/cbf_ua.html).

Statistical analysis

The geocodes measured by the GPS receiver were used as the reference in this study. Geocodes from all three different methods were based on the datum WGS84. The positional errors of the automated geocoded addresses by FDOH and the geocodes generated using the smartphone-assisted method were determined by their geodetic distance (the shortest path along the ellipsoid of the Earth at sea level between two points) to the GPS-measured geocodes in meters using the GEODIST function in SAS 9.3. Descriptive statistics were generated where appropriate, and paired t-tests were used to examine the difference in positional errors between the automated geocoding method and the smartphone-assisted method. The distribution of parcel size for the addresses was generated by housing type (apartment/condominium or not). We used both regression and tree-based methods to model the potential association between housing types, maternal characteristics, urbanization and the positional accuracy of the automated geocoding method. The positional errors of the automated geocoded addresses by FDOH were modelled both as continuous and dichotomous variables (>100 m or 100 m). The cut-off of 100 m was selected because of its widely use in literatures of positional accuracy and environmental exposure assessment (Bonner ; Gordian ; Wu ; Zandbergen ). We first fitted generalized linear models for these outcomes and all covariates with the continuous outcomes log-transformed to account for its skewed distribution, and then used regression trees to further explore the potential interactions and nonlinear association between the covariates and the outcomes (James ). The regression tree is a non-parametric method which recursively partitions the data space and fits a simple prediction model within each partition. Therefore, it can identify complex interaction and non-linear associations between the predictors and the outcome without any a priori specification. Data management was performed using SAS 9.3 and all analysis were conducted using R 3.1.2.

Results

Among the 100 randomly sampled addresses, 99 were successfully identified and geocoded using both the GPS receiver and the smart-phone-assisted method. All subsequent analyses were based on the 99 successfully identified and geocoded addresses. For the one remaining address, apparent errors in the street number made it unidentifiable, so it was excluded from this study. Table 1 shows the distribution of maternal socioeconomic status at delivery, housing and area characteristics. Most of the women living in the sampled addresses were less than 30 years old (65.66%), Non-Black (64.65%), had education levels greater than high school (74.75%), married (59.60%) or had insurance other than Medicaid (61.62%). Approximately 30% of the housing was apartments or condominiums and approximately 14% of the addresses were located in rural areas. Table 1 also presents the geometric means of positional errors measured by both the automated geocoding method and the smartphone-assisted method. Overall, the automated geocoding method yielded a mean (geometric) positional error of 56.46 m, while the error for the smartphone-assisted method was confined to 13.30 m. Consistent patterns were observed in all subgroups by scociodemographic status, housing and area characteristics. In addition, the paired t-test showed significant differences between all pairs examined (all P values <0.05). The distribution of parcel size by housing type is presented in Table 2.
Table 1

Geometric means of positional errors by maternal scociodemographic status and housing and area characteristics.

ParameterN%Positional error (m), geometric mean±SDP
Automated geocoding methodMobile-assisted aerial image-based method
Total99100.0056.46±3.8113.30±3.18<0.001

Age at delivery (years)
 <306565.6658.22±3.9811.92±3.18<0.001
 ≥303434.3453.25±3.5616.41±3.14<0.001

Race
 Black3535.3545.57±3.4410.15±3.60<0.001
 Non-black6464.6563.48±4.0515.43±2.90<0.001

Education
 <High school1717.1755.33±4.2410.83±3.43<0.001
 High school88.0859.88±4.399.47±4.580.031
 >High school7474.7556.36±3.7414.47±3.01<0.001

Marital status
 Married5959.6067.20±3.8313.76±3.36<0.001
 Not married4040.4043.67±3.6812.67±2.96<0.001

Insurance
 Medicaid3838.3848.90±3.6110.72±3.51<0.001
 Non-Medicaid6161.6261.75±3.9515.22±2.95<0.001

Housing type
 Apartment/condominium3030.30151.09±4.067.91±3.28<0.001
 Other6969.7036.80±2.9016.68±2.93<0.001

Area
 Urban area8585.8654.97±3.8512.94±3.14<0.001
 Rural area1414.1466.40±3.6915.72±3.560.021

SD, standard deviation.

Table 2

Distribution of parcel size (square meters) by housing type.

Housing typeNMedianMeanSDQuartile 1Quartile 3
Apartment/condominium3040,984.1357,131.7062,680.051627.07104,800.82
Others691104.4118,958.3448,402.67730.787265.93
Total991390.6524,742.1852,294.61801.2720,234.57

SD, standard deviation.

Figure 2 compares the positional errors between the automated geocoding method and the smartphone-assisted method. All aerial image geocoded locations fell within 100 m away from the true location with around 94% of them within 50 m. However, only around 70% of the automated geocoded addresses were within 100 m of the true location with 52 and 9% having errors less than 50 and 10 m, respectively. When stratified (Table 3), we found higher proportions of mis-classified addresses for apartment/condominiums compared with other housing types (67 vs 13% of addresses with positional errors greater than 100 m) and when comparing addresses located in rural areas to those located in urban areas, the outcome was 43 vs 27%, respectively, when the automated geocoding method was used for geocoding. In addition, there was no address with >100 m positional errors with the new mobile-assisted method.
Figure 2

The positional errors between the automated geocoding method and the smartphone-assisted method

Table 3

Positional errors by housing type and area.

Housing type/areaTotal number of addressesAutomated geocoding methodMobile-assisted aerial image-based method
Addresses with errors >100 m%(95% CI)Addresses with errors >100 m% (95% CI*)
Apartment or condominium302066.67 (49.80, 83.54)0-

Other69913.04 (5.10, 20.99)0-

Urban area852327.06 (17.61, 36.50)0-

Rural area14642.86 (16.93, 68.78)0-

CI, confidence interval.

Table 4 shows the results of the generalized linear models used to examine the potential association between the positional errors of the automated geocoding method and covariates. The continuous model showed that the housing type of apartment/condominium was associated with a 1.59 [95% confidence interval (CI): 1.07, 2.12] increase in the log-transformed positional error. In addition, the logistic regression model found that addresses of the apartment/condominium housing type compared with those located in rural areas had 64.54 (95% CI: 14.94, 409.55) and 9.66 (95% CI: 1.79, 64.93), respectively, times the odds of being automatically geocoded with positional errors >100 m, respectively. Nonblack women's addresses were also found to be significantly associated with an increased odds ratio (OR: 7.08, 95% CI: 1.25, 51.90) of having positional errors greater than 100 m when using the automated geocoding method.
Table 4

Associations between positional error of automated geocoding method by Florida Department of Health and maternal socioeconomic status and housing characteristics.

ParameterContinuous (Log-transformed), β(95% CI)Dichotomous (>100 m vs ≤100 m), OR (95% CI)
Age at delivery (years)
 <30ReferenceReference
 ≥30-0.25 (-0.79, 0.30)0.89 (0.22, 3.70)

Race
 BlackReferenceReference
 Non-black0.32 (-0.36, 1.00)7.08 (1.25, 51.90)

Education
 <High schoolReferenceReference
 High school0.37 (-0.54, 1.28)4.92 (0.50, 53.51)
 >High school0.38 (-0.40, 1.15)0.63 (0.08, 5.09)

Marital status
 MarriedReferenceReference
 Not married-0.43 (-1.09, 0.23)0.77 (0.14, 4.23)

Insurance
 MedicaidReferenceReference
 Non-Medicaid0.02 (-0.66, 0.69)0.42 (0.08, 2.09)

Housing type
 Apartment/condominiumReferenceReference
 Other1.59 (1.07, 2.12)64.54 (14.94, 409.55)

Area
 Urban areaReferenceReference
 Rural area0.62 (-0.12, 1.35)9.66 (1.79, 64.93)

CI, confidence interval; OR, odds ratio.

Figure 3 presents the covariates significantly associated with positional errors of the automated geocoding method from the regression trees analyses. The housing type was significant in both models on continuous and dichotomous outcomes and urbanity was shown as an important predictor for positional errors of the automated geocoding method among the addresses that were not apartment/condominiums.
Figure 3

Covariates significantly associated with positional errors of the automated geocoding method

Discussion

Using GPS receivers as the reference measure for true location, we compared the positional errors of the automated geocoding method used by FDOH and the smartphone-assisted geocoding method. The conventional automated geocoding method has substantial deficiencies in positional accuracy with approximately 30% of the geocoded addresses having positional errors exceeding 100 m; this is a significant methodologic shortcoming in many settings of environmental epidemiologic studies (Griffith ; Zandbergen, 2008). The positional errors of the automated geocoding method observed in this study are comparable to previous research conducted in the states of Iowa, New York and Texas, from where 21-28% of the automated geocoded addresses over 100 m have been reported (Bonner ; Ward ; Zhan ). More importantly, our study shows that such errors are not randomly distributed given the association observed between positional errors and housing type and urbanity. In addition to the urban-rural heterogeneity of positional errors reported from previous studies (Cayo and Talbot, 2003; Hurley ; Whitsel ), we observed even larger heterogeneity among addresses referring to apartment/condominiums. These non-randomly distributed errors may lead to a differential misclassification bias that will greatly influence the validity of studies based on these automated geocoding data. In addition, we found that the smartphone-assisted geocoding method may substantially increase the positional accuracy compared with traditional geocoding. Different from some previous studies which used the geocodes by the aerial image as the true location gold standard (Schootman ), we regarded aerial image as a potential method for address location verification during the spatial data collection. Although the aerial image substantially improved positional accuracy, it still had slightly discrepancy when compared with the GPS-measured geocodes. This may be due to several reasons, of which the resolution of the aerial image is one important factor. In addition, in our study, some of the homes could not be accurately identified in the aerial images since they were covered and surrounded by trees and green spaces. In spite of these limitations, the smartphone-assisted method still offered significant improvement over the traditional methods, especially for addresses for apartment/condominiums since most automated geocoding methods cannot handle apartment-level information. Extensive efforts have been devoted to improve automated geocoding, and many methods have been proposed including the manual intervention (Chaput ; Goldberg ; Ward ), re-geocoding with a different geocoder (Lovasi ; Zhan ), and imputation or pseudocoding (Boscoe, 2008; Henry and Boscoe, 2008; Strickland ). However, all these methods focused on improving spatial data quality after the data collections. The proposed smartphone-assisted method integrates the aerial image-based manual corrections to the data collections, thus making it possible to prospectively collect and geocode addresses, to verify the geocoded data during data collections, which is particularly important. Previous studies have suggested an error rate of 10% and a missing rate of 5% of self-reported addresses in public health surveillance datasets (Zinszer ). Such errors and missing data can be caused by both participants and administrative staff. Participants may accidentally skip or report a wrong address due to many reasons such as privacy concerns and recall errors. On the other hand, staff may make data-entry and processing mistakes. Importantly, the automated geocoding method may sometimes fail to identify such errors and even assign a false-matched geocode. Unfortunately, it is hard to detect such errors in large datasets and there is no existing validation tool to identify and fix these errors in the data collection process. Such errors are therefore almost impossible to correct once the data collection has been completed. However, this proposed smartphone-assisted method can avoid these issues during the process of data collection with participants' involved verification, real-time of geocoding and aerial image/map-assisted real time search. This proposed method can easily be integrated into many data collection systems and so obtain high-quality spatial data. Integrations of this method into data collection systems will transfer the efforts of geocoding from the data collectors to the participants, making it feasible for data collection in large health studies or electronic health records such as vital statistics birth records. It will also allow participants to interact with this geocoding system directly offering an unprecedented use of street maps, satellite images and street views to reduce missing records as well as to improve positional accuracy. Indeed, participants have more local knowledge than GIS technicians and can accurately verify and find the locations of their addresses on maps/aerial images. Therefore, the use of this method for spatial data collection has a great potential with respect to improving spatial data quality. Several limitations of this study should be noted. First, this is a pilot study that has a relatively small sample size and focused on only one county. Additionally, the smartphone-assisted method was conducted by researchers. Ideally, residents may provide more accurate geocoding information using the system, as they are more familiar with the neighbourhood, especially when the home cannot be directly identified in the image. Furthermore, measurement errors may exist for the reference method using GPS receiver since we were not able to enter the participants' homes.

Conclusions

With respect to the vital statistics birth record dataset, studies relying on automated geocoding may suffer from potential differential bias. Addresses referring to the housing apartment or condominium type and addresses located in rural areas are more likely to have greater positional errors. The smartphone-assisted method may substantially improve positional accuracy in geocoding, which has the potential to be used as a spatial data collection tool to further improve spatial data quality.
  36 in total

1.  Post office box addresses: a challenge for geographic information system-based studies.

Authors:  Susan E Hurley; Theresa M Saunders; Rachna Nivas; Andrew Hertz; Peggy Reynolds
Journal:  Epidemiology       Date:  2003-07       Impact factor: 4.822

2.  Positional accuracy of geocoded addresses in epidemiologic research.

Authors:  Matthew R Bonner; Daikwon Han; Jing Nie; Peter Rogerson; John E Vena; Jo L Freudenheim
Journal:  Epidemiology       Date:  2003-07       Impact factor: 4.822

3.  Residential address errors in public health surveillance data: a description and analysis of the impact on geocoding.

Authors:  Kate Zinszer; Christian Jauvin; Aman Verma; Lucie Bedard; Robert Allard; Kevin Schwartzman; Luc de Montigny; Katia Charland; David L Buckeridge
Journal:  Spat Spatiotemporal Epidemiol       Date:  2010-03-20

4.  A research agenda: does geocoding positional error matter in health GIS studies?

Authors:  Geoffrey M Jacquez
Journal:  Spat Spatiotemporal Epidemiol       Date:  2012-02-14

5.  Surrounding greenness and pregnancy outcomes in four Spanish birth cohorts.

Authors:  Payam Dadvand; Jordi Sunyer; Xavier Basagaña; Ferran Ballester; Aitana Lertxundi; Ana Fernández-Somoano; Marisa Estarlich; Raquel García-Esteban; Michelle A Mendez; Mark J Nieuwenhuijsen
Journal:  Environ Health Perspect       Date:  2012-08-16       Impact factor: 9.031

6.  Comparing a single-stage geocoding method to a multi-stage geocoding method: how much and where do they disagree?

Authors:  Gina S Lovasi; Jeremy C Weiss; Richard Hoskins; Eric A Whitsel; Kenneth Rice; Craig F Erickson; Bruce M Psaty
Journal:  Int J Health Geogr       Date:  2007-03-16       Impact factor: 3.918

7.  Quantifying geocode location error using GIS methods.

Authors:  Matthew J Strickland; Csaba Siffel; Bennett R Gardner; Alissa K Berzen; Adolfo Correa
Journal:  Environ Health       Date:  2007-04-04       Impact factor: 5.984

8.  Positional error in automated geocoding of residential addresses.

Authors:  Michael R Cayo; Thomas O Talbot
Journal:  Int J Health Geogr       Date:  2003-12-19       Impact factor: 3.918

9.  Spatial analysis of human granulocytic ehrlichiosis near Lyme, Connecticut.

Authors:  Emma K Chaput; James I Meek; Robert Heimer
Journal:  Emerg Infect Dis       Date:  2002-09       Impact factor: 6.883

10.  Residential greenness and birth outcomes: evaluating the influence of spatially correlated built-environment factors.

Authors:  Perry Hystad; Hugh W Davies; Lawrence Frank; Josh Van Loon; Ulrike Gehring; Lillian Tamburic; Michael Brauer
Journal:  Environ Health Perspect       Date:  2014-07-11       Impact factor: 9.031

View more
  2 in total

1.  Towards a new era of mass data collection: Assessing pandemic surveillance technologies to preserve user privacy.

Authors:  Samuel Ribeiro-Navarrete; Jose Ramon Saura; Daniel Palacios-Marqués
Journal:  Technol Forecast Soc Change       Date:  2021-02-22

2.  Birth and death notification via mobile devices: a mixed methods systematic review.

Authors:  Lavanya Vasudevan; Claire Glenton; Nicholas Henschke; Nicola Maayan; John Eyers; Marita S Fønhus; Tigest Tamrat; Garrett L Mehl; Simon Lewin
Journal:  Cochrane Database Syst Rev       Date:  2021-07-16
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.