| Literature DB >> 34946008 |
Haibo Sun1,2,3,4, Feng Zhu2,3,4, Yanzi Kong2,3,4,5, Jianyu Wang1,2,3,4, Pengfei Zhao2,3,4,5.
Abstract
Active object recognition (AOR) aims at collecting additional information to improve recognition performance by purposefully adjusting the viewpoint of an agent. How to determine the next best viewpoint of the agent, i.e., viewpoint planning (VP), is a research focus. Most existing VP methods perform viewpoint exploration in the discrete viewpoint space, which have to sample viewpoint space and may bring in significant quantization error. To address this challenge, a continuous VP approach for AOR based on reinforcement learning is proposed. Specifically, we use two separate neural networks to model the VP policy as a parameterized Gaussian distribution and resort the proximal policy optimization framework to learn the policy. Furthermore, an adaptive entropy regularization based dynamic exploration scheme is presented to automatically adjust the viewpoint exploration ability in the learning process. To the end, experimental results on the public dataset GERMS well demonstrate the superiority of our proposed VP method.Entities:
Keywords: active object recognition; adaptive entropy regularization; continuous viewpoint planning; dynamic exploration; proximal policy optimization
Year: 2021 PMID: 34946008 PMCID: PMC8701023 DOI: 10.3390/e23121702
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1The illustration of viewpoint exploration ability. The exploration ability of the VP policy with the standard deviation is stronger than that of the VP policy with the standard deviation . Because there are more possibilities to try new viewpoints when .
Figure 2The proposed AOR pipeline. The pipeline adopts PPO framework [13] to learn the continuous VP policy that is denoted by a parameterized Gaussian model. In order to realize dynamic exploration, two separate neural networks are used for the representation of the policy mean and standard deviation of the Gaussian model and trained concurrently. During the training process, the policy is improved by collecting some sample trajectories and optimizing the PPO objective.
Figure 3The changes of exploration ability in the training process of GERMS left arm dataset [12] under different dynamic exploration schemes. Because the standard deviation is a function of the state, the standard deviation representing the exploration ability refers to the average of the standard deviation corresponding to all states. However, there are infinite states, so can not be calculated. In the training, we use the average of the standard deviation of some sample states to approximately replace . We implement three dynamic exploration schemes step by step: (1) the first is the simultaneous parameterization of policy mean and standard deviation with two separate neural networks (the curve with ); (2) the second is to add the constant coefficient entropy regularization on the basis of (1) (the curves with ); (3) the third is that the constant coefficient is improved into an adaptive version on the basis of (2) (the curve with Adaptive c). After experimental comparison, scheme (3) can meet our dynamic exploration need.
Figure 4The diagram of upper and lower boundary functions of standard deviation.
Figure 5The GERMS dataset [12]. The objects are soft toys describing various human cell types, microbes and disease-related organisms.
Figure 6The images from different viewpoints in different tracks.
GERMS dataset statistics (mean ± std).
| Number of Tracks | Images/Track | Total Number of Images | |
|---|---|---|---|
| Train | 816 | 157 ± 12 | 76,722 |
| Test | 549 | 145 ± 19 | 51,561 |
Abbreviations and interpretations for different components in our dynamic exploration scheme.
| Abbreviation | Interpretation |
|---|---|
| BL | Baseline PPO framework [ |
| SSDN | Separate standard deviation network |
| ER | Entropy regularization (with a fixed coefficient) |
| AERC | Adaptive entropy regularization coefficient |
Figure 7The performance comparison results of ablation experiments.
Figure 8The performance comparison results of continuous VP policies combined with different dynamic exploration schemes. The parameters , , and involved in ILDDE are 120, 0.1, and 4200.
Figure 9Performance comparison between our proposed continuous VP method and several competing approaches.