| Literature DB >> 36246490 |
Masatoshi Nagano1, Tomoaki Nakamura1, Takayuki Nagai2,3, Daichi Mochihashi4, Ichiro Kobayashi5.
Abstract
In this study, HcVGH, a method that learns spatio-temporal categories by segmenting first-person-view (FPV) videos captured by mobile robots, is proposed. Humans perceive continuous high-dimensional information by dividing and categorizing it into significant segments. This unsupervised segmentation capability is considered important for mobile robots to learn spatial knowledge. The proposed HcVGH combines a convolutional variational autoencoder (cVAE) with HVGH, a past method, which follows the hierarchical Dirichlet process-variational autoencoder-Gaussian process-hidden semi-Markov model comprising deep generative and statistical models. In the experiment, FPV videos of an agent were used in a simulated maze environment. FPV videos contain spatial information, and spatial knowledge can be learned by segmenting them. Using the FPV-video dataset, the segmentation performance of the proposed model was compared with previous models: HVGH and hierarchical recurrent state space model. The average segmentation F-measure achieved by HcVGH was 0.77; therefore, HcVGH outperformed the baseline methods. Furthermore, the experimental results showed that the parameters that represent the movability of the maze environment can be learned.Entities:
Keywords: Gaussian process; convolutional variational autoencoder; hidden semi-Markov model; segmentation; spatio-temporal categorization; unsupervised learning
Year: 2022 PMID: 36246490 PMCID: PMC9562109 DOI: 10.3389/frobt.2022.903450
Source DB: PubMed Journal: Front Robot AI ISSN: 2296-9144
FIGURE 1Generative process of the proposed method.
FIGURE 2HcVGH model: White and gray nodes respectively represent unobserved variables and the high-dimensional observed sequence obtained by concatenating segments.
FIGURE 3Convolutional variational autoencoder network architecture: (A) encoder = six convolutional layers (conv) and two fully connected layers (FC); (B) decoder = one fully connected layer and seven deconvolutional layers (conv_T).
FIGURE 4Overview of HcVGH parameter estimation. The parameters are learned using a mutual cVAE and HDP-GP-HSMM learning loop.
FIGURE 5(A) Upper view of the maze. The white block indicates the agent, which can move along 6 paths indicated by the arrows. (B) Example of the first-person-view video data surrounded by the black circle in (A).
Baselines and HcVGH segmentation results.
| Hyperparameter | Hamming Distance | Precision | Recall | F-measure | |
|---|---|---|---|---|---|
| HcVGH | λ = 20 | 0.33 ± 0.05 | 0.84 ± 0.06 | 0.91 ± 0.06 | 0.87 ± 0.06 |
| λ = 10 | 0.19 ± 0.02 | 0.68 ± 0.05 | 0.96 ± 0.01 | 0.79 ± 0.03 | |
| λ = 7 | 0.18 ± 0.01 | 0.61 ± 0.03 | 1.0 ± 0.0 | 0.75 ± 0.02 | |
| λ = 5 | 0.19 ± 0.01 | 0.56 ± 0.02 | 0.99 ± 0.01 | 0.72 ± 0.01 | |
| λ = 4 | 0.19 ± 0.01 | 0.55 ± 0.02 | 1.0 ± 0.0 | 0.71 ± 0.02 | |
| Average | 0.22 ± 0.06 | 0.65 ± 0.12 | 0.97 ± 0.04 | 0.77 ± 0.07 | |
| HVGH | λ = 20 | 0.78 ± 0.18 | 0.54 ± 0.33 | 0.49 ± 0.35 | 0.50 ± 0.33 |
| λ = 10 | 0.66 ± 0.19 | 0.58 ± 0.33 | 0.56 ± 0.33 | 0.55 ± 0.32 | |
| λ = 7 | 0.60 ± 0.26 | 0.34 ± 0.31 | 0.45 ± 0.42 | 0.39 ± 0.35 | |
| λ = 5 | 0.68 ± 0.20 | 0.55 ± 0.34 | 0.55 ± 0.34 | 0.51 ± 0.29 | |
| λ = 4 | 0.80 ± 0.21 | 0.20 ± 0.27 | 0.29 ± 0.39 | 0.23 ± 0.31 | |
| Average | 0.70 ± 0.20 | 0.44 ± 0.33 | 0.47 ± 0.35 | 0.43 ± 0.32 | |
| HRSSM (6 paths) | Nmax = 1 | 0.40 | 0.95 | 0.56 | 0.70 |
| Nmax = 2 | 0.41 | 0.72 | 0.23 | 0.34 | |
| Nmax = 3 | 0.41 | 0.79 | 1.0 | 0.88 | |
| Nmax = 4 | 0.40 | 0.62 | 0.96 | 0.76 | |
| Nmax = 5 | 0.40 | 0.39 | 1.0 | 0.55 | |
| Average | 0.40 ± 0.01 | 0.69 ± 0.21 | 0.76 ± 0.35 | 0.65 ± 0.21 | |
| HRSSM (1M) | Nmax = 1 | 0.35 | 1.0 | 0.69 | 0.80 |
| Nmax = 2 | 0.35 | 1.0 | 0.48 | 0.64 | |
| Nmax = 3 | 0.34 | 0.64 | 0.65 | 0.63 | |
| Nmax = 4 | 0.39 | 0.51 | 0.73 | 0.60 | |
| Nmax = 5 | 0.39 | 0.44 | 0.92 | 0.59 | |
| Average | 0.36 ± 0.03 | 0.72 ± 0.26 | 0.70 ± 0.16 | 0.65 ± 0.09 |
FIGURE 6Segmentation results of a first-person-view video data.
FIGURE 7Transition probabilities of the estimated classes: (A) represents a transition matrix, and (B) represents the locations that correspond to each class in the maze.
Evaluation of spatial movability of paths. “T” in the “Type” column shows their class sequences were included in the training data. “G” in the “Type” column shows generated class sequences that were not included in the training data. Underlined numbers in “class sequence” represent spatially impossible transitions.
| Type | Class sequence |
| |
|---|---|---|---|
| 1 | T | 20, 11, 10, 21, 17, 15, 8, 16, 9, 4, 12, 23, 26, 2, 22 | −0.228 |
| 2 | T | 20, 6, 3, 7, 24, 14, 3, 25, 22, 26, 2, 22 | −0.390 |
| 3 | T | 20, 6, 3, 7, 24, 14, 5, 12, 23, 26, 2, 22 | −0.327 |
| 4 | T | 20, 11, 10, 21, 17, 15, 8, 16, 2, 13 | −0.432 |
| 5 | T | 20, 11, 10, 21, 17, 15, 0, 20, 27, 2, 4, 12, 23, 26, 2, 22 | −0.426 |
| 6 | T | 20, 11, 10, 19, 9, 18, 7, 24, 14, 3, 2, 1, 22 | −0.531 |
| 7 | G | 20, 11, 10, 21, 17, 15, 0, 20, 11, 10, 21, 17, 15, 8, 16, 9, 4, 12, 23, 26, 2, 22 | −0.244 |
| 8 | G | 20, 11, 10, 21, 17, 15, 0, 20, 11, 10, 19, 9, 18, 7, 24, 14, 5, 12, 23, 26, 2, 22 | −0.297 |
| 9 | G | 20, 11, 10, 21, 17, 15, 0, 20, 11, 10, 21, 17, 15, 8, 16, 2, 13 | −0.364 |
| 10 | G | 20, 11, 10, 21, 17, 15, 8, 16, 9, | −4.397 |
| 11 | G | 20, 11, 10, 21, 17, 15, 0, 20, 11, | −4.420 |
FIGURE 8Estimated results by HRSSM: (A–D) respectively represent an example of predicted images by using the same contexts.
Number of the predicted paths at the T-junction.
| left | right | else | |
|---|---|---|---|
| HRSSM (1M) | 52 | 23 | 25 |
FIGURE 9Latent variables learned by HcVGH: (A–C) respectively represent the first and second, first and third, and second and third dimension of the principal component of the latent variables. The color of each point reflects the correct corridor class.
FIGURE 13Latent variables learned by only cVAE: (A–C) respectively represent the first and second, first and third, and second and third dimension of the principal component of the latent variables. The color of each point reflects the correct corridor class.
FIGURE 12Latent variables learned by HVGH: (A–C) respectively represent the first and second, first and third, and second and third dimension of the principal component of the latent variables. The color of each point reflects the correct corridor class.
FIGURE 10Latent variables learned by HRSSM (6 paths): (A–C) respectively represent the first and second, first and third, and second and third dimension of the principal component of the latent variables. The color of each point reflects the correct corridor class.
FIGURE 11Latent variables learned by HRSSM (1M): (A–C) respectively represent the first and second, first and third, and second and third dimension of the principal component of the latent variables. The color of each point reflects the correct corridor class.