| Literature DB >> 34992328 |
Tim Y Tang1, Daniele De Martini1, Shangzhe Wu2, Paul Newman1.
Abstract
Traditional approaches to outdoor vehicle localization assume a reliable, prior map is available, typically built using the same sensor suite as the on-board sensors used during localization. This work makes a different assumption. It assumes that an overhead image of the workspace is available and utilizes that as a map for use for range-based sensor localization by a vehicle. Here, range-based sensors are radars and lidars. Our motivation is simple, off-the-shelf, publicly available overhead imagery such as Google satellite images can be a ubiquitous, cheap, and powerful tool for vehicle localization when a usable prior sensor map is unavailable, inconvenient, or expensive. The challenge to be addressed is that overhead images are clearly not directly comparable to data from ground range sensors because of their starkly different modalities. We present a learned metric localization method that not only handles the modality difference, but is also cheap to train, learning in a self-supervised fashion without requiring metrically accurate ground truth. By evaluating across multiple real-world datasets, we demonstrate the robustness and versatility of our method for various sensor configurations in cross-modality localization, achieving localization errors on-par with a prior supervised approach while requiring no pixel-wise aligned ground truth for supervision at training. We pay particular attention to the use of millimeter-wave radar, which, owing to its complex interaction with the scene and its immunity to weather and lighting conditions, makes for a compelling and valuable use case.Entities:
Keywords: Localization; cross-modality localization; deep learning; self-supervised learning
Year: 2021 PMID: 34992328 PMCID: PMC8721700 DOI: 10.1177/02783649211045736
Source DB: PubMed Journal: Int J Rob Res ISSN: 0278-3649 Impact factor: 4.703
Fig 1.Given a map image of modality (left) and a live data image of modality (middle), we wish to find the unknown offset between them. To do so, our method generates a synthetic image of modality (right) that is pixel-wise aligned with the map image, but contains the same appearance and observed scenes as the live data image. Top: localizing radar data against satellite imagery. Middle: localizing lidar data against satellite imagery. Bottom: localizing radar data against prior lidar map.
Fig. 2.Our method is demonstrated on datasets collected around different locations, at various types of settings including urban (Oxford, UK), residential (Karlsruhe, Germany), campus (KAIST, Korea), and highway (Sejong City, Korea).
Fig. 3.A one-to-many mapping (left) versus a one-to-one mapping (right). Left: the mapping from modality to modality preserves color, but is ambiguous in orientation of the output, resulting in a one-to-many mapping, and is therefore not a function. Right: augmenting the input with an element of offers additional constraint in orientation, resulting in a one-to-one mapping as the mapping is now unambiguous in both color and orientation. Note that the mapping on the right is one-to-one, but not necessarily surjective.
Fig. 4.Two radar images captured 15 seconds apart from each other (2 and 4), pixel-wise aligned with satellite images (1 and 3). Though the overlapping scenes in the satellite images are identical, the radar scans appear significantly different, as they capture different regions in their field of view.
Fig. 5.Results of CycleGAN: satellite image (left), live radar image pixel-wise aligned with the satellite image (middle), synthetic radar image (right). There is no explicit constraint on which regions of the input satellite image will appear in the output synthetic image. As a result, this leads to large localization error as the synthetic image does not contain scenes observed by the live radar image.
Fig. 6.Prior work in Tang et al. (2020b) proposes a network to infer the rotation offset. The rotation offset is found by softmaxing a stack of rotated radar images to produce a radar image with the same heading as the satellite image.
Fig. 7.Given and a rotation stack the network finds by taking softmax. Then, given and a rotation stack the network outputs a softmaxed map image from A loss is applied to enforce the output of the second pass to be which in turn enforces the output of the first pass to be Here both symbols for in the figure refer to the same network with the same parameters, but at different forward passes.
Fig. 8.Architecture for image generation in prior supervised approach (Tang et al., 2020b).
Fig. 9.Top: During pre-training, we can learn an appearance encoder and a pose encoder that discovers the translation offset between an image of and a shifted version of itself. Bottom: Taking and and fixing their weights, we seek to learn which discovers the translation offset between two images from different modalities. Here and can provide the necessary geometric and appearance relationships used for learning self-supervised.
Fig. 10.The networks and are learned to project real live images and synthetic images to a joint embedding, where their translation offset can be found by maximizing correlation.
Fig. 11.Overall data flow of our method at inference: given map image and live data image based on the initial heading estimate, we form a stack of rotated images from which discovers that is rotated to be rotation-aligned with This process also infers the heading estimate Here and are used to generate a synthetic image that has the same appearance and observed scene as and is pose-aligned with ; and and are projected to deep embeddings and where the estimate for the translation offset is found by correlation maximization.
Architecture for rotation inference.
| Rotation Inference Function |
| Input shape: |
| Conv(4, 32, 3, 2, 1) + IN + ReLU |
| Conv(32, 64, 3, 2, 1) + IN + ReLU |
| Conv(64, 128, 3, 2, 1) + IN + ReLU |
| Conv(128, 256, 3, 2, 1) + IN + ReLU |
| Latent shape: |
| Take the mean along |
| Latent vector shape: |
| Matrix-multiple softmax weights with the input |
| Shape of the multiplication product: |
| Extract the associated channel(s) to get |
U-Net architecture for learning embeddings.
| Embedding Networks |
|---|
| Conv(1, 32, 4, 2, 0) |
| LReLU(0.2) + Conv(32, 64, 4, 2, 0) + IN |
| LReLU(0.2) + Conv(64, 128, 4, 2, 0) + IN |
| LReLU(0.2) + Conv(128, 256, 4, 2, 0) + IN |
| LReLU(0.2) + Conv(256, 512, 4, 2, 0) + IN |
| LReLU(0.2) + ReLU + Conv(512, 1024, 4, 2, 0) |
| ReLU + ConvT(1024, 512, 4, 2, 1, 0) + IN |
| ReLU + ConvT(512, 256, 4, 2, 1, 0) + IN |
| ReLU + ConvT(256, 128, 4, 2, 1, 0) + IN |
| ReLU + ConvT(128, 64, 4, 2, 1, 0) + IN |
| ReLU + ConvT(64, 32, 4, 2, 1, 0) + IN |
| ReLU + ConvT(32, 1, 4, 2, 1, 0) + Sigmoid |
| With skip connections in-between intermediate layers |
Fig. 12.Training (blue), validation (green), and test (red) trajectories for RobotCar (top left), KAIST (top right), Sejong (bottom left) and 20111003_drive0034 (bottom right). Certain data are removed to avoid overlap between the splits.
Mean error and error standard deviation for radar localization against satellite imagery.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar (ours) | 3.44 | 5.40 |
| 3.97 | 6.23 | 4.44 | 7.50 |
|
| RobotCar (RSL-Net ( |
|
| 3.12 |
|
|
|
| 4.16 |
| MulRan (ours) | 6.02 |
| 2.92 | 7.64 |
|
| 6.87 | 3.64 |
| MulRan (RSL-Net) |
| 7.11 |
|
| 9.03 | 6.50 |
|
|
Mean error and error standard deviation for lidar localization against satellite imagery.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar (ours) |
|
| 2.29 |
|
|
|
| 3.10 |
| RobotCar (RSL-Net) | 2.31 | 2.55 |
| 5.33 | 5.89 | 2.26 | 2.57 |
|
| KITTI (ours) | 3.05 | 3.13 | 1.67 | 6.64 | 6.82 |
|
| 2.57 |
| KITTI (RSL-Net) |
|
|
|
|
| 3.70 | 3.88 |
|
Mean error and error standard deviation for radar localization against prior lidar map.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar (ours) |
|
| 2.65 |
|
|
| 3.53 |
|
| RobotCar (RSL-Net) | 2.66 | 3.41 |
| 3.07 | 3.93 | 2.85 |
| 2.18 |
| RobotCar (CycleGAN) | 6.41 | 9.05 | 2.65 | 7.40 | 10.44 | 7.43 | 7.17 | 2.06 |
| MulRan (ours) | 3.57 | 3.26 | 2.15 | 4.53 | 4.13 | 4.29 | 4.84 | 2.38 |
| MulRan (RSL-Net) |
|
|
|
|
|
|
|
|
| MulRan (CycleGAN) | 4.84 | 4.39 | 2.15 | 6.14 | 5.58 | 5.78 | 4.96 | 2.38 |
Fig. 13.Estimated pose (blue) versus ground-truth pose (red) for localizing a radar (left) and a lidar (right) against satellite imagery. Our system tracks the vehicle’s pose over 1 km, where we occasionally fall back to odometry for the radar experiment (green). Our system is stand-alone and requires GPS only for the first frame.
Ablation study for using reduced training data, evaluated on radar localization against satellite imagery on the RobotCar Dataset.
|
|
| ||||
|---|---|---|---|---|---|
|
|
|
|
|
| |
| RobotCar (full) |
|
|
|
|
|
| RobotCar (first | 7.96 | 7.45 | 6.03 | 9.18 | 8.59 |
| RobotCar (every | 4.36 | 6.18 | 4.40 | 5.03 | 7.14 |
Results for radar localization against satellite imagery on multiple test sequences.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar (sequence no. 1) | 3.44 | 5.40 | 3.03 | 3.97 | 6.23 | 4.44 | 7.50 | 3.04 |
| RobotCar (sequence no. 7) | 3.44 | 5.15 | 3.96 | 3.97 | 5.95 | 5.76 | 5.44 | 4.32 |
| RobotCar (sequence no. 15) | 3.21 | 5.21 | 3.80 | 3.70 | 6.01 | 2.83 | 5.76 | 4.20 |
| MulRan | 6.02 | 7.02 | 2.92 | 7.64 | 8.91 | 5.75 | 6.87 | 3.64 |
| MulRan | 6.44 | 6.59 | 5.15 | 8.19 | 8.37 | 6.64 | 6.62 | 7.32 |
Results for lidar localization against satellite imagery on multiple test sequences.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar (sequence no. 1) | 1.54 | 1.85 | 2.29 | 3.55 | 4.27 | 1.54 | 2.29 | 3.10 |
| RobotCar (sequence no. 7) | 1.38 | 1.72 | 2.35 | 3.18 | 3.98 | 1.37 | 1.83 | 2.71 |
| RobotCar (sequence no. 15) | 1.67 | 1.81 | 2.05 | 3.87 | 4.17 | 1.42 | 1.84 | 2.12 |
Results for radar localization against satellite imagery using a circular translation offset range.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar (ours) | 2.83 | 4.49 | 3.57 | 3.39 | 5.39 |
| 5.48 | 3.71 |
| RobotCar (RSL-Net) |
|
|
|
|
| 3.85 |
|
|
| MulRan (ours) | 6.30 |
| 3.01 | 8.00 |
|
| 5.51 | 3.61 |
| MulRan (RSL-Net) |
| 7.58 |
|
| 9.62 | 5.25 |
|
|
Results for lidar localization against satellite imagery using a circular translation offset range.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar (ours) |
|
|
|
|
|
|
|
|
| RobotCar (RSL-Net) | 1.81 | 2.15 | 2.09 | 4.19 | 4.96 | 1.53 | 1.90 | 3.17 |
| KITTI (ours) | 2.93 | 3.10 |
| 6.38 | 6.74 |
|
|
|
| KITTI (RSL-Net) |
|
| 1.50 |
|
| 3.49 | 3.09 |
|
Mean error for using various choices of network width and depth, evaluated on lidar localization against satellite imagery on the RobotCar Dataset.
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
| (M) | |
| RobotCar, channels | 1.57 | 1.93 | 2.58 | 3.62 | 4.45 | 5.23 |
| RobotCar, ours |
| 1.85 | 2.29 |
| 4.27 | 20.86 |
| RobotCar, channels | 1.58 |
|
| 3.64 |
| 83.37 |
| RobotCar, depth | 1.70 | 2.09 | 2.47 | 3.92 | 4.82 | 5.22 |
| RobotCar, ours |
|
|
|
|
| 20.86 |
| RobotCar, depth | 1.56 | 1.98 | 2.39 | 3.60 | 4.57 | 83.41 |
Architecture of for image generation.
| Appearance Encoder |
|---|
| RP(3) + Conv(1, 16, 7, 1, 0) + IN + ReLU |
| Conv(16, 32, 3, 2, 1) + IN + ReLU |
| Conv(32, 64, 3, 2, 1) + IN + ReLU |
| Conv(64, 128, 3, 2, 1) + IN + ReLU |
| Conv(128, 256, 3, 2, 1) + IN + ReLU |
| ResNet blocks ( |
| Conv(256, 256, 3, 1, 0) + IN + ReLU + Drop(0.5) |
| Conv(256, 256, 3, 1, 0) + IN |
| Intra-Modality Pose Encoder |
| RP(3) + Conv(2, 16, 7, 1, 0) + IN + ReLU |
| Conv(16, 32, 3, 2, 1) + IN + ReLU |
| Conv(32, 64, 3, 2, 1) + IN + ReLU |
| Conv(64, 128, 3, 2, 1) + IN + ReLU |
| Conv(128, 256, 3, 2, 1) + IN + ReLU |
| ResNet blocks ( |
| Conv(256, 256, 3, 1, 0) + IN + ReLU + Drop(0.5) |
| Conv(256, 256, 3, 1, 0) + IN |
| Cross-Modality Pose Encoder |
| RP(3) + Conv(4, 16, 7, 1, 0) + IN + ReLU |
| Conv(16, 32, 3, 2, 1) + IN + ReLU |
| Conv(32, 64, 3, 2, 1) + IN + ReLU |
| Conv(64, 128, 3, 2, 1) + IN + ReLU |
| Conv(128, 256, 3, 2, 1) + IN + ReLU |
| ResNet blocks ( |
| Conv(256, 256, 3, 1, 0) + IN + ReLU + Drop(0.5) |
| Conv(256, 256, 3, 1, 0) + IN |
| Decoder |
| ConvT(512, 256, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| ConvT(256, 128, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| ConvT(128, 64, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| ConvT(64, 32, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| RP(3) + Conv(32, 1, 7, 1, 0) + Sigmoid |
Fig. 14.A pair of satellite and radar images queried using zoom levels of 18 (top) versus 17 (bottom) at roughly the same center position. Although “zooming in” can lead to a more refined resolution, certain regions far away are not seen in the resulting radar images, despite being observed by the sensor.
Mean error and error standard deviation for using refined resolution, evaluated for radar localization against satellite imagery. We also include results for a hybrid approach utilizing both resolutions..
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| RobotCar, | 3.44 | 5.40 |
|
| 6.23 | 4.44 | 7.50 |
|
| RobotCar, |
|
| 3.22 | 5.29 |
|
|
| 3.13 |
| RobotCar, hybrid | 2.56 | 3.60 | 3.03 | 5.91 | 8.32 | 3.78 | 5.45 | 3.04 |
Mean error and error standard deviation for various rotation increments, evaluated for radar localization against satellite imagery on the RobotCar Dataset.
|
|
|
|
|---|---|---|
|
| 3.29 | 3.91 |
|
|
|
|
|
| 4.04 | 4.93 |
|
| 4.26 | 5.86 |
Fig. 15.Mean error in rotation (left) and translation (right) with larger initial translation offset range, evaluated for radar localization against satellite imagery on the RobotCar Dataset.
Fig. 16.The center of image is shifted onto each of the four quadrants to produce four shifted versions.
Fig. 17.The unknown translation offset between and is larger than the networks are designed for.
Fig. 18.If we shift by to form then the offset between and is within what the networks are designed for. In this case, generating and should both be accurate, as the offset in both cases are within what the networks are trained for.
Fig. 19.The resulting synthetic image will still be erroneous, if an incorrect quadrant is selected. Here the offset between and is larger than what the networks can handle. In this case, generating and will both be problematic due to the issue with offsets.
Radar localization against satellite imagery evaluated on the test set of RobotCar, where the initial translation offset is large at inference. We also included results from Table 5 for reference..
|
|
|
|
|
| |||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| ✗ | 3.44 | 5.40 | 3.03 | 3.97 | 6.23 |
|
|
| ✗ | 6.62 | 7.88 |
| 7.64 | 9.09 |
|
|
| ✓ | 4.67 |
|
| 5.39 |
|
|
|
| ✗ |
| 7.35 |
|
| 8.49 |
Fig. 20.Qualitative results of CycleGAN for domain adaptation between a single scan of radar and lidar data. From left to right: a real radar image and its synthetic lidar image, a real lidar image and its synthetic radar image. Top: CycleGAN applied in Cartesian coordinate representation. Bottom: CycleGAN applied in polar coordinate representation.
Localization results for experiments where the range sensor used is different between training and inference time, with and without using modality transfer, evaluated on the RobotCar Dataset. We also included results from Tables 6 and 14 for reference.
|
|
|
|
|
| |||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| Lidar | Lidar | ✗ | 1.54 | 1.85 | 2.29 | 3.55 | 4.27 |
| Radar | Lidar | ✗ | 6.39 | 8.64 | 4.71 | 14.74 | 19.95 |
| Radar | Lidar | ✓(Cartesian) | 2.63 | 3.48 | 2.65 | 6.08 | 8.03 |
| Radar | Lidar | ✓(Polar) |
|
|
|
|
|
| Radar | Radar | ✗ | 2.23 | 2.42 | 3.22 | 5.29 | 5.74 |
| Lidar | Radar | ✗ | 7.03 | 6.43 | 10.05 | 16.23 | 14.83 |
| Lidar | Radar | ✓(Cartesian) | 2.74 |
|
| 6.32 |
|
| Lidar | Radar | ✓(Polar) |
| 3.52 | 4.62 |
| 8.14 |
Fig. 21.Images at various stages of our method: map image (a), live data image (b), output of rotation inference (c), embedding (d), pixel-wise aligned ground truth (e), synthetic image (f), and embedding (g). From top to bottom: radar localization against satellite imagery evaluated on RobotCar (rows 1–2) and MulRan (rows 3–4), lidar localization against satellite imagery evaluated on RobotCar (rows 5–6) and KITTI (rows 7–8), and radar localization against lidar map evaluated on MulRan (row 9) and RobotCar (row 10).
Image generation for our implementation of RSL-Net for comparison.
| Encoder |
|---|
| RP(3) + Conv(4, 32, 7, 1, 0) + IN + ReLU |
| Conv(32, 64, 3, 2, 1) + IN + ReLU |
| Conv(64, 128, 3, 2, 1) + IN + ReLU |
| Conv(128, 256, 3, 2, 1) + IN + ReLU |
| Conv(256, 512, 3, 2, 1) + IN + ReLU |
| ResNet blocks ( |
| Conv(512, 512, 3, 1, 0) + IN + ReLU + Drop(0.5) |
| Conv(512, 512, 3, 1, 0) + IN |
| Decoder |
| ConvT(512, 256, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| ConvT(256, 128, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| ConvT(128, 64, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| ConvT(64, 32, 3, 2, 1, 1) + IN + ReLU + Drop(0.5) |
| RP(3) + Conv(32, 1, 7, 1, 0) + Sigmoid |