| Literature DB >> 34347840 |
Jonathan Tollefson1, Scott Frickel1, Maria I Restrepo2.
Abstract
U.S. cities contain unknown numbers of undocumented "manufactured gas" sites, legacies of an industry that dominated energy production during the late-19th and early-20th centuries. While many of these unidentified sites likely contain significant levels of highly toxic and biologically persistent contamination, locating them remains a significant challenge. We propose a new method to identify manufactured gas production, storage, and distribution infrastructure in bulk by applying feature extraction and machine learning techniques to digitized historic Sanborn fire insurance maps. Our approach, which relies on a two-part neural network to classify candidate map regions, increases the rate of site identification 20-fold compared to unaided visual coding.Entities:
Year: 2021 PMID: 34347840 PMCID: PMC8336811 DOI: 10.1371/journal.pone.0255507
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Circular gasometer structures as they appear in Sanborn fire insurance maps.
Fig source: Sanborn Map Company.
Fig 2MGP identification workflow.
Fig source: Tollefson.
Map volumes from sampled cities and regions.
| Region | Years |
|---|---|
| Chicago | 1901, 1905, 1906, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1916, 1917, 1918, 1919, 1920, 1946, 1949, 1950, 1951 |
| New Orleans | 1885, 1887, 1893, 1895, 1896, 1908, 1909, 1937, 1940, 1950, 1951 |
| New York City area | 1886, 1887, 1888, 1893, 1895, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1898, 1901, 1902, 1903 |
| Portland | 1889, 1901, 1905, 1908, 1909, 1950 |
| Rhode Island (all cities) | 1884, 1885, 1886, 1887, 1889, 1890, 1891, 1892, 1894, 1896, 1897, 1898, 1899, 1900, 1902, 1903, 1907, 1909, 1910, 1911, 1912, 1920, 1921, 1922, 1923, 1925, 1928, 1933, 1934, 1935, 1940, 1941, 1944, 1945, 1946, 1947, 1949, 1950, 1951, 1953, 1955, 1956, 1958 |
| San Francisco | 1886, 1887, 1889, 1893, 1899, 1900, 1913, 1914, 1915, 1948, 1949, 1950 |
Note: Publication dates represent years in which at least one map volume was published in a given region. New York City area truncated at 1903.
Fig 3Examples of positive and negative MGP classification categories.
Fig source: Tollefson, using map images from Sanborn Map Company.
CNN layer architecture.
| Layer | Output shape | Param. # |
|---|---|---|
| 2D Convolution | [64, 64, 16] | 448 |
| 2D Convolution | [62, 62, 16] | 2,320 |
| Max Pooling | [31, 31, 16] | |
| 2D Convolution | [29, 29, 32] | 4,640 |
| Max Pooling | [14, 14, 32] | |
| 2D Convolution | [12, 12, 64] | 18,496 |
| Max Pooling | [6, 6, 64] | |
| Flatten | [2304] | |
| Dense | [1032] | 2,378,760 |
| Dropout | [1032] | |
| Dense | [2] | 2,066 |
| Total parameters | 2,406,730 |
MLP hyperparameters.
| Parameter | Tests |
|---|---|
| Hidden layer sizes | (100, 50) |
| (100,) | |
| (3, 3, 3) | |
| Alpha | 0.00001 |
| 0.0001 | |
| .01 | |
| 0.1 | |
| 0.5 | |
| Learning rate | Constant |
| Adaptive |
* Selected parameter.
Fig 4ROC curves for all folds and all model iterations, hybrid CNN-MLP model.
Dotted line displays expected TRP-FPR curve for a fully random classifier. Fig source: Tollefson.
Step-wise reduction in candidate map pages.
| Step | N-output | N-MGP | Pct. Positive |
|---|---|---|---|
| 1. Data acquisition | 3,278(m) | 56(m) | 1.7% |
| 2. Circle detection | 1,336(c) | 68(c) | 5% |
| 3. Standardization | - | - | - |
| 4. CNN + MLP model | 206(c) | 58(c) | 28% |
| 5. Full-size map images | 166(m) | 50(m) | 30% |
| Total reduction in candidate map regions: 94% | |||
| Final recall rate: 90% | |||
(m) N = number of full map pages.
(c) N = number of candidate circles
* No MGP data loss in this step. Reduction in N-MGP from 58 to 50 due to aggregation of individual candidate circles to full map pages.