| Literature DB >> 30322007 |
Juan Manuel Vera-Diaz1, Daniel Pizarro2, Javier Macias-Guarasa3.
Abstract
This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network (CNN). In the proposed solution, the CNN is designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features. Given the limited amount of available localization data, we propose, in this paper, a training strategy based on two steps. We first train our network using semi-synthetic data generated from close talk speech recordings. We simulate the time delays and distortion suffered in the signal that propagate from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results, evaluated on a publicly available dataset recorded in a real room, show that this approach is able to produce networks that significantly improve existing localization methods based on SRP-PHAT strategies and also those presented in very recent proposals based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the performance of our CNN method does not show a relevant dependency on the speaker's gender, nor on the size of the signal window being used.Entities:
Keywords: acoustic source localization; convolutional neural networks; deep learning; microphone arrays
Year: 2018 PMID: 30322007 PMCID: PMC6210564 DOI: 10.3390/s18103418
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Used network topology.
Network convolutional layers summary.
| Block | Filters | Kernel |
|---|---|---|
| Convolutional block 1 | 96 | 7 |
| Convolutional block 2 | 96 | 7 |
| Convolutional block 3 | 128 | 5 |
| Convolutional block 4 | 128 | 5 |
| Convolutional block 5 | 128 | 3 |
Figure 2(a) Simplified top view of the IDIAP Smart Meeting Room; (b) a real picture of the room extracted from a video frame; (c) microphone setup used in this proposal.
IDIAP Smart Meeting Room used sequences.
| Sequence (abbr.) | Average Speaker Height (cm) * | Duration (s) | Number of Ground Truth Frames | Description |
|---|---|---|---|---|
| seq01-1p-0000 ( | 54.3 | 208 | 2248 | A single male speaker, static while speaking, at each of the 16 locations. The speaker is facing the microphone arrays. |
| seq02-1p-0000 ( | 62.5 | 171 | 2411 | A single female speaker, static while speaking, at each of 16 locations. The speaker is facing the microphone arrays. |
| seq03-1p-0000 ( | 70.3 | 220 | 2636 | A single male speaker, static while speaking, at each of the 16 locations. The speaker is facing the microphone arrays. |
| seq11-1p-0100 ( | 53.5 | 33 | 481 | A single male speaker, making random movements while speaking, and facing the arrays. |
| seq15-1p-0100 ( | 79.5 | 36 | 436 | A single male speaker, walking around while alternating speech and long silences. No constraints |
* The average speaker height is referenced to the system’s coordinates and refers to the speaker’s mouth height.
Baseline results for the SRP-PHAT strategy (column SRP), the one in [35] (column GMBF), and the Convolutional Neural Network (CNN) trained with synthetic data without applying the fine tuning procedure (column CNN) for sequences s01, s02 and s03 for different window sizes. Relative improvements compared to SRP-PHAT are shown below the MOTP (Multiple Object Tracking Precision) values.
| 80 ms | 160 ms | 320 ms | |||||||
|---|---|---|---|---|---|---|---|---|---|
| SRP | GMBF | CNN | SRP | GMBF | CNN | SRP | GMBF | CNN | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||||
| Average |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||||
Results for the stratgy in [35] (column GMBF); and the CNN that was fine-tuned with sequence s15 (column CNNf15).
| 80 ms | 160 ms | 320 ms | ||||
|---|---|---|---|---|---|---|
| GMBF | CNNf15 | GMBF | CNNf15 | GMBF | CNNf15 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
| - |
| - | |
| Average |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Relative improvements over SRP-PHAT for the strategy presented in Ref. [35] (columns GMBF) and the CNN fine-tuned with sequences s15 and s11 (columns CNNf15+11).
| 80 ms | 160 ms | 320 ms | ||||
|---|---|---|---|---|---|---|
| GMBF | CNNf15+11 | GMBF | CNNf15+11 | GMBF | CNNf15+11 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| Average |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fine tuning material used in the experiment corresponding to columns CNNf15+11+st in Table 7.
| Test Sequence | Fine Tuning Sequences |
|---|---|
| seq01 | |
| seq02 | |
| seq03 |
Relative improvements over SRP-PHAT for the strategy presented in Ref. [35] (column GMBF) and the CNN fine-tuned with the sequences described in Table 6 (column CNNf15+11+st).
| 80 ms | 160 ms | 320 ms | ||||
|---|---|---|---|---|---|---|
| GMBF | CNNf15+11+st | GMBF | CNNf15+11+st | GMBF | CNNf15+11+st | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| Average |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Figure 3MOTP relative improvements over SRP-PHAT for GMBF and CNN using different fine tuning subsets (for all window sizes).
Results for the CNN proposal, either trained from scratch (column tr−sc) or using semi-synthetic training + fine tuning (column tr+ft), for different training/fine tuning sequences.
| tr−sc/tr+ft Sequences | 80 ms | 160 ms | 320 ms | ||||
|---|---|---|---|---|---|---|---|
| tr−sc | tr+ft | tr−sc | tr+ft | tr−sc | tr+ft | ||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
| ||
| Sequences of |
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
Relative improvements over SRP-PHAT for the strategy in [18] (column SELDnet); and the CNN fine-tuned with the sequences described in Table 6 (column CNNf15+11+st).
| 80 ms | 160 ms | 320 ms | ||||
|---|---|---|---|---|---|---|
| SELDnet | CNNf15+11+st | SELDnet | CNNf15+11+st | SELDnet | CNNf15+11+st | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| Average |
|
|
|
|
|
|
|
|
|
|
|
|
| |