| Literature DB >> 36236676 |
Ying Sun1,2,3, Jun Hu1,2, Juntong Yun1,2, Ying Liu1,2, Dongxu Bai1,2, Xin Liu1,2, Guojun Zhao1,2, Guozhang Jiang1,2,3, Jianyi Kong1,2,3, Baojia Chen4.
Abstract
Simultaneous localization and mapping (SLAM) technology can be used to locate and build maps in unknown environments, but the constructed maps often suffer from poor readability and interactivity, and the primary and secondary information in the map cannot be accurately grasped. For intelligent robots to interact in meaningful ways with their environment, they must understand both the geometric and semantic properties of the scene surrounding them. Our proposed method can not only reduce the absolute positional errors (APE) and improve the positioning performance of the system but also construct the object-oriented dense semantic point cloud map and output point cloud model of each object to reconstruct each object in the indoor scene. In fact, eight categories of objects are used for detection and semantic mapping using coco weights in our experiments, and most objects in the actual scene can be reconstructed in theory. Experiments show that the number of points in the point cloud is significantly reduced. The average positioning error of the eight categories of objects in Technical University of Munich (TUM) datasets is very small. The absolute positional error of the camera is also reduced with the introduction of semantic constraints, and the positioning performance of the system is improved. At the same time, our algorithm can segment the point cloud model of objects in the environment with high accuracy.Entities:
Keywords: deep learning; multi-objective location; semantic mapping; target tracking; visual SLAM
Mesh:
Year: 2022 PMID: 36236676 PMCID: PMC9571389 DOI: 10.3390/s22197576
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The frame diagram of Deepsort with Mask R-CNN.
Figure 2The frame diagram of the object-oriented semantic mapping system.
The computer parameters of the experiments.
| Name | Model | Remarks |
|---|---|---|
| Operating System | Ubuntu 20.04 | / |
| Graphic Processing Unit (GPU) | NVIDIA Quadro M2000 | 4G |
| Central Processing Unit (CPU) | Intel® Xeon(R) CPU E5-1620 v4 | 3.5 GHz * 8 |
| Random Access Memory (RAM) | DDR4 | 32G |
| Hard Disk | SSD256G + HDD 1000G | Samsung |
The specific parameters for training in Mask R-CNN.
| Parameters | Quantity |
|---|---|
| backbone | Resnet101 |
| Epoch | 50 |
| Number of pictures | 750 |
| Training ratio | 0.8 |
| Validation ratio | 0.1 |
| Test ratio | 0.1 |
| Bach size | 1 |
| Training time(s/epoch) | 2917 |
| class_names | bowl, plate, cup |
Figure 3The training results of Mask R-CNN on freiburg2_dishes. The left image is the value loss diagram, and the right is the precision–recall diagram.
Figure 4The detection results of freiburg2_dishes using self-training weight. From left to right, the first image is the original RGB image. The second is the detection result of Mask R-CNN. The third is the detection result of Mask R-CNN with Deepsort.
Figure 5The results of semantic mapping of freiburg2_dishes using self-training weight. Figure (a) is the color information corresponding to three categories of objects in semantic map. Figure (b) is the dense point cloud map (left) and the dense semantic point cloud map (right). Figure (c) is the dense point cloud model of each object of freiburg2_dishes (bowl, bowl, cup, plate).
Figure 6The results of target tracking and instance segmentation on TUM datasets using coco weight. From left to right, the first figure is the original RGB image. The second is the detection result of Mask R-CNN. The third is the detection result of Mask R-CNN with Deepsort. Figure (a–f) are the detection results of the back of freiburg3_office, the front of freiburg3_office, freiburg3_teddy, freiburg2_desk, freiburg1_desk and freiburg1_room respectively.
Figure 7The results of semantic mapping of five datasets in TUM using coco weight. Figure (a) shows the color corresponding to the objects. Figures (b–g) show the dense point cloud map (left) and the dense semantic point cloud map (right) of the 5 datasets.
Figure 8The dense point cloud model of the eight categories of objects. Figure (a–h) are the point clouds of the chair, TV monitor, keyboard, bottle, teddy bear, book, cup, mouse respectively.
The number of detected objects and actual objects.
| Datasets | The Number of Detected Objects (D) | The Number of Actual Objects (GT) | Error |
|---|---|---|---|
| fr3_office | 36 | 30 | 20.0% |
| fr3_teddy | 1 | 1 | 0.0% |
| fr2_desk | 18 | 16 | 12.5% |
| fr1_desk | 16 | 15 | 6.7% |
| fr1_room | 21 | 20 | 5.0% |
The objects’ centroids calculated by the estimated pose and the ground truth.
| Objects | Tracked ID | Method | Centroids | Error (m) |
|---|---|---|---|---|
| chair | 6 | estimated | [0.585, 0.343, 2.32] | 0.0237 |
| ground truth | [0.591, 0.352, 2.34] | |||
| TV monitor | 389 | estimated | [0.592, −0.713, 3] | 0.00823 |
| ground truth | [0.593, −0.705, 3] | |||
| keyboard | 513 | estimated | [0.565, −0.614, 3.24] | 0.0116 |
| ground truth | [0.576, −0.612, 3.24] | |||
| bottle | 3 | estimated | [−0.513, −0.78, 2.4] | 0.00828 |
| ground truth | [−0.511, −0.773, 2.4] | |||
| teddy bear | 188 | estimated | [0.778, −0.966, 3.81] | 0.0131 |
| ground truth | [0.776, −0.953, 3.81] | |||
| book | 11 | estimated | [−0.203, −0.0962, 2.28] | 0.00997 |
| ground truth | [−0.198, −0.0899, 2.29] | |||
| cup | 41 | estimated | [−1.37, 0.143, 2.92] | 0.0098 |
| ground truth | [−1.37, 0.147, 2.93] | |||
| mouse | 402 | estimated | [0.252, −0.701, 3.41] | 0.00312 |
| ground truth | [0.254, −0.7, 3.4] |
The average positioning error (m) of objects in five datasets.
| Objects | fr3_Office | fr3_Teddy | fr2_Desk | fr1_Room | fr1_Desk |
|---|---|---|---|---|---|
| chair | 0.0235 | / | 0.146 | 0.266 | 0.232 |
| TV monitor | 0.00927 | / | 0.266 |
| 0.0839 |
| keyboard | 0.0118 | / | 0.0908 | 0.347 | 0.123 |
| bottle | 0.0154 | / | 0.0557 | 0.0347 | 0.00387 |
| teddy bear | 0.0177 | 0.00186 | 0.133 | 0.189 | / |
| book | 0.00972 | / | 0.0238 | 0.161 | 0.00946 |
| cup | 0.0167 | / | 0.0199 | 0.0539 | 0.0366 |
| mouse | 0.00656 | / |
| 0.0979 | 0.0846 |
The number of point clouds of five datasets.
| Datasets | Number of Original Point Clouds | Number of Point Clouds with Objects Only | Reduction |
|---|---|---|---|
| fr3_office | 4,814,372 | 2,635,597 | 45.3% |
| fr3_teddy | 8,451,663 | 3,395,009 | 59.8% |
| fr2_desk | 3,090,597 | 1,195,625 | 57.6% |
| fr1_room | 2,581,204 | 993,299 | 61.5% |
| fr1_desk | 364,226 | 166,546 | 54.3% |
Comparison of the localization performance of our method with existing SLAM systems. RMSE (m).
| Sequence | ORB-SLAM2 [ | Elastic-Fusion [ | DVO-SLAM [ | Fusion++ [ | Ours |
|---|---|---|---|---|---|
| fr3_office | 0.0121 | 0.017 | 0.035 | 0.1082 |
|
| fr3_teddy |
| 0.048 | 0.046 | 0.1535 | 0.0259 |
| fr2_desk | 0.0095 | 0.071 | 0.017 | 0.1144 |
|
| fr1_room | 0.0474 | 0.068 |
| 0.2356 | 0.0452 |
| fr1_desk | 0.0163 | 0.020 | 0.021 | 0.0499 |
|
Comparison of the function of our method with existing semantic SLAM systems.
| Semantic SLAM System | Semantics | Scenario-Oriented Semantic Maps | Object-Oriented Semantic Maps | Semantic Help SLAM Positioning | Objects Location | Objects Reconstruction |
|---|---|---|---|---|---|---|
| SLAM++ [ | √ | √ | ||||
| Meaningful maps [ | √ | √ | ||||
| CNN-SLAM [ | √ | √ | ||||
| SemanticFusion [ | √ | √ | √ | |||
| MaskFusion [ | √ | √ | √ | √ | ||
| Ours | √ | √ | √ | √ | √ |