Literature DB >> 35761991

MTBF-33: A multi-temporal building footprint dataset for 33 counties in the United States (1900 - 2015).

Johannes H Uhl1,2,3, Stefan Leyk2,3.   

Abstract

Despite abundant data on the spatial distribution of contemporary human settlements, historical datasets on the long-term evolution of human settlements at fine spatial and temporal granularity are scarce, limiting our quantitative understanding of long-term changes of built-up areas. This is because commonly used large-scale mapping methods (e.g., computer vision) and suitable data sources (i.e., aerial imagery, remote sensing data, LiDAR data) have only been available in recent decades. However, there are alternative data sources such as cadastral records that are digitally available, containing relevant information such as building construction dates, allowing for an approximate, digital reconstruction of past building distributions. We conducted a non-exhaustive search of open and publicly available data resources from administrative institutions in the United States and gathered, integrated, and harmonized cadastral parcel data, tax assessment data, and building footprint data for 33 counties, wherever building footprint geometries and building construction year information was available. The result of this effort is a unique dataset that we call the Multi-Temporal Building Footprint Dataset for 33 U.S. Counties (MTBF-33). MTBF-33 contains over 6.2 million building footprints including their construction year, and can be used to derive retrospective depictions of built-up areas from 1900 to 2015, at fine spatial and temporal grain. Moreover, MTBF-33 can be employed for data validation purposes, or to train statistical learning models aiming to extract historical information on human settlements from remote sensing data, historical maps, or similar data sources.
© 2022 The Author(s).

Entities:  

Keywords:  Building stock; Built environment; Change detection; Historical GIS; Human settlements; Long-term land use change; Remote sensing; Urbanization; historical spatial data

Year:  2022        PMID: 35761991      PMCID: PMC9233215          DOI: 10.1016/j.dib.2022.108369

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the Data

Open and publicly accessible data on building age are scarce. Our data scraping, integration, and harmonization effort aims to fill this gap in the data landscape. Knowledge of the construction year of individual buildings allows for creating spatially and temporally fine-grained depictions of past built-up surfaces. Such spatial-historical data may serve as calibration and validation data for urban change models and for historical (and more recent) human settlement datasets, as well as for historical population downscaling efforts. Moreover, such data are very useful to be employed as auxiliary data for the automated training data generation for data-intensive (e.g. deep learning) computer vision models to automatically extract urban change signals from remote sensing data or historical maps. These data enable historical analyses of the building stock in 33 U.S. counties, encompassing the whole state of Massachusetts, as well as several urban areas of different settlement age and characteristics, such as Boston, Charlotte, and Minneapolis. Lastly, these data are highly valuable for urban planners, remote sensing analysts, historians, demographers, and data scientists working in the context of urban land use change and (sub)urbanization, as they provide rare insight into the long-term dynamics of built-up areas at very high spatial and temporal detail.

Data Description

We collected open and publicly available data resources from the web from administrative, county- or state-level institutions in the United States and integrated and harmonized cadastral parcel data, tax assessment data, and building footprint data for 33 counties, where building footprint data and building construction year information (“year built”) was available. The result of this effort is a unique dataset called the Multi-Temporal Building Footprint Dataset for 33 U.S. Counties (MTBF-33, [1], available at http://dx.doi.org/10.17632/w33vbvjtdy). MTBF-33 contains over 6.2 million building footprints including their construction year, and is available as polygonal geospatial vector data in 33 ESRI Shapefiles, in geographic coordinates (WGS84, EPSG:4326), as well as projected into Albers equal area conic projection for the contiguous USA (USGS version, SR-ORG:74801), organized per county. Fig. 1 shows small subsets of the MTBF-33 dataset for selected regions.
Fig. 1

MTBF-33 multi-temporal building footprint data examples, shown for (a) Boulder (Colorado) (b) Sarasota (Florida), (c) Boston (Massachusetts), and (d) Minneapolis (Minnesota).

MTBF-33 multi-temporal building footprint data examples, shown for (a) Boulder (Colorado) (b) Sarasota (Florida), (c) Boston (Massachusetts), and (d) Minneapolis (Minnesota). Moreover, Fig. 2 shows small subsets of the data for most of the 33 U.S. counties, illustrating the variability in the data and their coverage across different geographic settings.
Fig. 2

MTBF-33 multi-temporal building footprint dataset, showing examples for most of the 33 counties covered in MTBF-33. Water mask in grey derived from GHS-BUILT R2018A (epoch 2014, [2]). Counties are sorted by their FIPS in the same order as shown in Table 1 (upper left: Boulder County, lower right: Mecklenburg county). Parts of New York County and Kings County (New York City) are jointly shown as “Manhattan”. Not shown are Queens and Richmond counties (New York City).

MTBF-33 multi-temporal building footprint dataset, showing examples for most of the 33 counties covered in MTBF-33. Water mask in grey derived from GHS-BUILT R2018A (epoch 2014, [2]). Counties are sorted by their FIPS in the same order as shown in Table 1 (upper left: Boulder County, lower right: Mecklenburg county). Parts of New York County and Kings County (New York City) are jointly shown as “Manhattan”. Not shown are Queens and Richmond counties (New York City).
Table 1

Overview and statistics of the 33 counties covered in MTBF-33.

CountyCountyBuildings w/TotalPercentYear builtYear builtYear builtYear built
FIPSNameStatevalid year builtbuildingscompleteminimummaximummeanmedian
08013BoulderColorado76,92980,25595.91858201419681971
12057HillsboroughFlorida410,076421,04697.41842201419811984
12081ManateeFlorida154,416173,17389.21870201319801982
12115SarasotaFlorida194,043198,68597.71877201319811981
18163VanderburghIndiana93,797108,79886.21810201519511950
24005BaltimoreMaryland281,216308,93391.01676201519581958
25001BarnstableMassachusetts172,542185,81892.91626201419611971
25003BerkshireMassachusetts82,03689,79091.41650201519371950
25005BristolMassachusetts225,156233,70496.31500201319481957
25007DukesMassachusetts18,75823,52479.71660201419631980
25009EssexMassachusetts224,351270,39883.01600201419371947
25011FranklinMassachusetts43,43650,20986.51666201319361951
25013HampdenMassachusetts192,281207,19592.81600201519471953
25015HampshireMassachusetts69,50577,98289.11629201419471960
25017MiddlesexMassachusetts460,722500,04792.11600201519421950
25019NantucketMassachusetts13,54713,97197.01640201119621983
25021NorfolkMassachusetts216,150242,63189.11500201519441951
25023PlymouthMassachusetts207,264230,78889.81600201519501962
25025SuffolkMassachusetts106,037109,87696.51637201519241920
25027WorcesterMassachusetts317,302344,30792.21650201519481957
27003AnokaMinnesota128,498135,30795.01852201519771979
27019CarverMinnesota40,48841,76896.91816201519691984
27037DakotaMinnesota145,903163,17989.41832201419731978
27053HennepinMinnesota380,301387,85698.11843201019551956
27123RamseyMinnesota239,544245,27997.71850201519461951
27163WashingtonMinnesota86,21695,01490.71742201519731983
34025MonmouthNew Jersey206,624212,95197.01684201519611963
36005BronxNew York102,658103,86598.81780201519411931
36047KingsNew York329,283331,81399.21800201519311925
36061New YorkNew York45,32246,20998.11765201419211910
36081QueensNew York454,506457,62899.31661201519391935
36085RichmondNew York138,609140,05099.01665201419621969
37119MecklenburgNorth Carolina402,242418,05696.21792201519801984
As can be seen in Figs. 1 and 2, there are several buildings without year built attribute (white color). We report the year built attribute completeness for each of the 33 counties in Table 1. Moreover, Table 1 shows some basic year built statistics, illustrating the variety in temporal coverage of the data. Overview and statistics of the 33 counties covered in MTBF-33.

Experimental Design, Materials and Methods

Data creation We manually collected cadastral parcel data, tax assessment data, and building footprint data from publicly and openly available web resources, such as from state-level or county-level administrative GIS or spatial data resources. We used open data portals such as https://hub.arcgis.com to identify counties or states where (a) both parcel and building footprint data is available, and (b) parcel data or joinable tax assessment data contains information on the year when structures have been established (year built). We identified 33 counties that satisfied these criteria and where the completeness of the building footprint data and the year built attribute was acceptable (see Table 1). In counties where the year built information was contained in separate tax assessment datasets, we first joined the tax assessment data to the parcel data. Then, we integrated the parcel data and building footprint data. This integration was done through a spatial join operation, in order to transfer the year built attribute from the parcel polygon features to the building footprint features contained within the cadastral parcel boundaries. This spatial assignment was based on a majority-area criterion in order to account for certain levels of spatial offsets between parcel and building footprint data. Such offsets may exist due to different data acquisition methods: While parcel boundaries are typically measures using terrestrial or Global Navigation Satellite System (GNSS)-based land surveying technologies, building footprint data may be obtained through automatic segmentation of LiDAR data or by digitization in aerial imagery. As a result of these spatial joins, the year built attribute was transferred to the building footprint features. For these processes, we used the GeoPandas2 and ESRI ArcPy3 python package. We then harmonized and cleaned the data. This cleaning process involved the identification of non-plausible year built values (e.g., < 1500). We do this to remove structural zero values representing missing information, and obviously incorrect values, likely resulting from typos, such as “910” which could be “1910”. The threshold of 1500 was chosen as it marks the approximate end of the Pre-Columbian era, and very few built-up structures or dwellings from that era are still intact, and if so, they are rather considered a monument or landmark than a “building” as defined in modern building stock databases. Such missing or non-plausible year built values were set to 0. Importantly, any property-, building-, or individual-level data other than the year built attribute was removed, so that the MTBF-33 data exclusively consists of building footprint geometries and their construction year. The resulting polygonal, geospatial vector data represent building footprints for 33 counties in the conterminous United States, allowing for the reconstruction of spatially and temporally fine-grained depictions of built-up surfaces (i.e., building level, annual temporal resolution). While contemporary building footprint data is available at high levels of accuracy [3], data on the historical distributions of the U.S. building stock is very scarce, in particular for time periods earlier than the 1970s or 1980s, when remote-sensing-based, digital earth observation data became accessible. Thus, despite representing only about 1% of the U.S. counties, this unique dataset covers more than 40,000 km² and more than 6,000,000 cadastral parcels, and fills an important gap in the geospatial data landscape. The MTBF-33 dataset was collected in 2016-2017 and since then, MTBF-33 has been employed by the authors for different purposes, including the validation of global remote-sensing-based multi-temporal built-up surface data [4], [5], [6], [7], [8], the validation of historical settlement data derived from property databases [9,10], to automatically generate training data for urban change detection based on Landsat time series data [11], to assess the sensitivity of Landsat time series to urban changes [12], and for training data generation used by computer vision models to extract settlement patterns from historical topographic maps [13,14]. Validation and uncertainty assessment: The validation of data that entail advancements in quality and accuracy compared to existing data products is always challenging. We evaluated MTBF-33 through two approaches. First, we compared the dataset for all 33 counties with the Microsoft Building Footprint (MSBF) dataset [3] for the building footprints existing in 2015. Second, we carried out a visual comparison between the MTBF-33 and urban extents as found in historical topographic maps.

Agreement assessment between MTBF-33 and Microsoft building footprint data

Microsoft's building footprint dataset (data release from 2018) has a US-wide coverage and has been extracted from Microsoft Bing imagery using a deep-learning-based computer vision method. While the acquisition years of the Bing imagery are likely to vary across the United States, we assume that MSBF represents the U.S. building stock approximately in 2015, expecting an average period of three years between image data acquisition and building footprint data release. In order to evaluate the MTBF-33 quantitatively, we created gridded binary layers (i.e., built-up versus not built-up) for 2015 for both MTBF-33 and MSBF, in a spatial grid of 250m x 250m, based on an area majority rule, for each of the 33 counties covered in MTBF-33. Based on these gridded surfaces, we established confusion matrices per county, used to calculate various agreement measures to assess agreement between the two binary layers, using the MSBF data as reference data (Table 2). While some of these measures have been criticized due to some limitations, e.g., if class proportions are imbalanced [15,16], a cross-section through all those measures represents a reliable assessment basis. We carried out the agreement assessment across all 33 counties, as well as separately for higher-density and lower-density counties. This stratification was done based on the MSBF built-up surface density per county, using the median county-level built-up surface density as a threshold.
Table 2

Results of the agreement assessment between MTBF-33 (in 2015) and Microsoft building footprint data.

Agreement measureAll countiesHigher-density countiesLower-density counties
Overall accuracy0.9330.9690.919
Precision0.9560.9600.954
Recall0.9000.9900.853
F-measure0.9280.9740.901
Kappa index0.8650.9350.833
Results of the agreement assessment between MTBF-33 (in 2015) and Microsoft building footprint data. As can be seen in Table 2, when using all 33 counties for map comparison, all accuracy measures show high agreement between the two layers (between 86.5% based on Kappa and 95.6% based on Precision). Higher accuracy is observed for high-density counties compared to the low-density counties. The notable difference in Recall (0.99 and 0.85, respectively) indicates that omission errors are higher in low-density counties, possibly because MSBF identifies structures that are not part of the county assessor's building stock database (e.g., barns), especially in more rural settings. Thus, MTBF-33 has no built structures at those locations, resulting in higher proportions of false negatives. This effect propagates into the other measures, resulting in reduced F-measure and Kappa index for lower-density counties. However, even in lower-density counties, no accuracy measure is less than 0.83 indicating high levels of agreement between the two datasets. Moreover, there may be slight temporal gaps between MTBF-33 (representing the building stock in 2015) and MSBF (heterogeneous acquisition years of the imagery underlying the MSBF data), due to the heterogeneous levels of temporal coverage of the MTBF-33 data (see built year statistics in Table 1), and due to the vagueness in the definition of the construction year in MTBF-33. These factors could explain some of the slight disagreement observed in Table 2. Here, it is worth noting that the aforementioned, assumed period of three years between Bing imagery acquisition and data release in 2018 may be overly optimistic, i.e., the underlying imagery may have been acquired prior to 2015 and thus, MSBF may reflect an earlier state of the building stock. Assuming predominant growth (rather than shrinkage) of the building stock over time, such an incorrect time stamp of our reference data (i.e., MSBF) would result in artificially inflated commission errors (i.e., low precision) in our test data (i.e., the MTBF-33) which represents a later state of the building stock. However, as shown in Table 2, Precision is very high across all strata, and thus, the unknown temporal reference of the MSBF data and the effects of a potential temporal gap may explain the observed commission errors of 0.040 to 0.046.

Qualitative evaluation of MTBF-33 data against historical maps

We carried out a visual comparison between the MTBF-33 and urbanized extents as shown in historical topographic maps of the U.S. Geological Survey (USGS) historical topographic map collection (HTMC4) [17]. To do so, we created historical binary layers of the MTBF-33 to match the publication dates of the historical map sheets. Fig. 3 shows historical map sheets for Boulder, Colorado, and the respective extracted MTBF-33 binary layers for 1904, 1957 and 1984. As can be seen, the spatial representations of built-up / urbanized areas generally agree. Agreement is highest within urban areas as shown by denser building blocks along roads in 1904, in red in 1957 and grey in 1984, even though MTBF-33 shows much finer spatial detail of the built-up areas. Outside the urbanized areas, the 1904 and 1957 maps show detailed building symbols along roads, many of which are also visible in the MTBF-33 layer. Some discrepancies can be seen in the North-west part of the 1984 map, where some buildings are visible in MTBF-33 but not in the historical map. This is due to the level of cartographic generalization used in the 1984 map sheet (scale 1:100,000, whereas the 1904 and 1957 maps are at scale 1:62,500) which may not include individual building footprints outside urban extents.
Fig. 3

Comparison of retrospective building footprint distributions to historical maps for Boulder, Colorado, in 1904 (scale 1:62,500), 1957 (scale 1:62,500), and 1984 (scale 1:100,000). Map source: USGS-HTMC.

Comparison of retrospective building footprint distributions to historical maps for Boulder, Colorado, in 1904 (scale 1:62,500), 1957 (scale 1:62,500), and 1984 (scale 1:100,000). Map source: USGS-HTMC. Moreover, such discrepancies may be due to temporal uncertainty in the historical maps (i.e., temporal gap between land surveying or field check, and map edition / publishing year) and due to the potential uncertainty of the construction year information in MTBF-33. It is unknown whether the date on record reflects the beginning or the end of the building construction phase, and how long the construction phase endured. Moreover, the construction year could be an estimate, and buildings may be missing in MTBF-33 because of incomplete records or missing built year information. However, the visual similarity for the two historical map sheets combined with the quantitative agreement assessment against Microsoft's building footprint dataset provide strong confidence that the built-up surfaces in MTBF-33 are very plausible and accurate with some local variations in completeness. There are uncertainties in MTBF-33 that are very difficult to measure. For example, buildings that have been torn down or were destroyed are not included in the dataset. The respective parcels may have become vacant land or a new structure may have been built. As a consequence, there is survivorship bias in construction year information which increases as we go back in time. There are few studies that report on or measure survivorship bias in settlement layers as this requires access to historical versions of cadastral data or demolition records (e.g., [18], [19], [20], [21]. For example, McShane et al. [21] used historical demolition data for Colorado and found that survivorship bias had limited impact on the resulting settlement layers, resulting in relative errors of less than 2%. Note that while we retain all plausible year built values from the scraped source data, we constrain the temporal coverage to the period 1900 – 2015. We discourage data users to use MTBF-33 to create snapshots of built-up surfaces preceding the year 1900, as the survivorship bias may be very large.

Ethics Statements

None.

CRediT Author Statement

Johannes H. Uhl: Methodology, Investigation, Data curation, Validation, Visualization, Writing - Original Draft; Stefan Leyk: Conceptualization, supervision, writing - reviewing and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
SubjectGeography
Specific subject areaUrban change detection, long-term land development, built environment, human settlements
Type of dataGeospatial vector data
How the data were acquiredData were collected, integrated, and harmonized from web-based, open and publicly available sources published by local or regional governmental organizations, such as county or state governments.
Data formatRaw: ESRI Shapefile, ESRI File Geodatabase, Excel spreadsheets, CSV files. Filtered: ESRI Shapefile
Description of data collectionWe identified U.S. counties or states that provide building footprint data and cadastral parcel data attributed with building construction year information. In a non-exhaustive search we identified 33 U.S. counties where these criteria were met. We integrated and harmonized these data to create geospatial vector datasets holding over 6.2 million building footprints attributed with their construction year.
Data source locationSource data was collected in 2016 from the following resources:ftp://ftp.co.ramsey.mn.us/GISdata/ (last accessed: 2016-03-01)ftp://ftp.lmic.state.mn.us/pub/data/elevation/lidar/county/ (last accessed: 2016-03-01)ftp://ftp1.fgdl.org/pub/state/ (last accessed: 2016-03-01)ftp://gisdata.co.anoka.mn.us/ (last accessed: 2016-03-01)http://bostonopendata.boston.opendata.arcgis.com/datasets/f3d274161b4a47aa9acf48d0d04cd5d4_3 (last accessed: 2016-03-01)http://data.evansvilleapc.opendata.arcgis.com/datasets/0f1a8007227641d394f4170acba8aa67_1 (last accessed: 2016-03-01)http://maps.co.mecklenburg.nc.us/openmapping/data.html (last accessed: 2022-03-02)http://opendata.arcgis.com/datasets/611eb2cad81a4089afa188233e6b6dd1_0 (last accessed: 2022-03-02)http://www.co.carver.mn.us/GIS (last accessed: 2022-03-02)http://www.co.dakota.mn.us/homeproperty/propertymaps/pages/default.aspx (last accessed: 2022-03-02)http://www.co.ramsey.mn.us/is/gisdata.htm (last accessed: 2016-03-01)http://www.co.washington.mn.us/index.aspx?NID=1606 (last accessed: 2022-03-02)https://city-tampa.opendata.arcgis.com/datasets/building-footprint (last accessed: 2017-12-01)https://data.cityofboston.gov/Permitting/Property-Assessment-2015/yv8c-t43q (last accessed: 2016-03-01)https://gisdata.mn.gov/dataset/us-mn-co-dakota-plan-parcels (last accessed: 2022-03-02)https://gisdata.mn.gov/dataset/us-mn-co-dakota-struc-propertyinfo-buildingp (last accessed: 2022-03-02)https://gisdata.mn.gov/dataset/us-mn-state-metrogis-plan-regonal-prcls-open (last accessed: 2016-03-01)https://gis-monmouthnj.opendata.arcgis.com/datasets/0d2bb2c31e854819939cae0e6e1b589b_0 (last accessed: 2018-12-02)
https://gis-monmouthnj.opendata.arcgis.com/datasets/fec0b3d813174cdfb766134315120460_0/ (last accessed: 2022-03-02)https://koordinates.com/layer/102872-sarasota-county-florida-building-footprint/ (last accessed: 2022-03-02)https://mecklenburgcounty.exavault.com/share/view/mg5f-3uke3hyw (last accessed: 2022-03-02)https://opendata-bc-gis.hub.arcgis.com/datasets/building-footprints/ (last accessed: 2016-10-01)https://opendata-bc-gis.hub.arcgis.com/datasets/parcels/explore (last accessed: 2022-03-02)https://public-manateegis.opendata.arcgis.com/datasets/building-footprints (last accessed: 2017-07-01)https://www.bouldercounty.org/government/open-data/ (last accessed: 2022-03-02)https://www.bouldercounty.org/government/open-data/) (last accessed: 2022-03-02)https://www.mass.gov/get-massgis-data (last accessed: 2022-03-02)
Data accessibilityRepository name: Mendeley DataData identification number: 10.17632/w33vbvjtdyDirect URL to data: https://data.mendeley.com/datasets/w33vbvjtdy
Related research articleS. Leyk, J. H. Uhl, D. Balk, B. Jones, Assessing the accuracy of multi-temporal built-up land layers across rural-urban trajectories in the United States, Remote Sens. Environ. 204 (2018). https://doi.org/10.1016/j.rse.2017.08.035.
  5 in total

1.  Assessing the Accuracy of Multi-Temporal Built-Up Land Layers across Rural-Urban Trajectories in the United States.

Authors:  Stefan Leyk; Johannes H Uhl; Deborah Balk; Bryan Jones
Journal:  Remote Sens Environ       Date:  2017-10-07       Impact factor: 10.164

2.  Exposing the urban continuum: Implications and cross-comparison from an interdisciplinary perspective.

Authors:  Johannes H Uhl; Hamidreza Zoraghein; Stefan Leyk; Deborah Balk; Christina Corbane; Vasileios Syrris; Aneta J Florczyk
Journal:  Int J Digit Earth       Date:  2018-11-26       Impact factor: 3.538

3.  A guide for evaluating and reporting map data quality: Affirming Shao et al. "Overselling overall map accuracy misinforms about research reliability".

Authors:  S V Stehman; J Wickham
Journal:  Landsc Ecol       Date:  2020-06-01       Impact factor: 3.848

4.  HISDAC-US, historical settlement data compilation for the conterminous United States over 200 years.

Authors:  Stefan Leyk; Johannes H Uhl
Journal:  Sci Data       Date:  2018-09-04       Impact factor: 6.444

5.  Fine-grained, spatiotemporal datasets measuring 200 years of land development in the United States.

Authors:  Johannes H Uhl; Stefan Leyk; Caitlin M McShane; Anna E Braswell; Dylan S Connor; Deborah Balk
Journal:  Earth Syst Sci Data       Date:  2021-01-27       Impact factor: 11.333

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.