| Literature DB >> 29941845 |
Stavros Ntalampiras1, Ilyas Potamitis2.
Abstract
Human activities are accompanied by characteristic sound events, the processing of which might provide valuable information for automated human activity recognition. This paper presents a novel approach addressing the case where one or more human activities are associated with limited audio data, resulting in a potentially highly imbalanced dataset. Data augmentation is based on transfer learning; more specifically, the proposed method: (a) identifies the classes which are statistically close to the ones associated with limited data; (b) learns a multiple input, multiple output transformation; and (c) transforms the data of the closest classes so that it can be used for modeling the ones associated with limited data. Furthermore, the proposed framework includes a feature set extracted out of signal representations of diverse domains, i.e., temporal, spectral, and wavelet. Extensive experiments demonstrate the relevance of the proposed data augmentation approach under a variety of generative recognition schemes.Entities:
Keywords: echo state network; generalized audio recognition; hidden Markov model; multidomain features; transfer learning
Mesh:
Year: 2018 PMID: 29941845 PMCID: PMC6163773 DOI: 10.3390/bios8030060
Source DB: PubMed Journal: Biosensors (Basel) ISSN: 2079-6374
Figure 1The logical flow of the proposed method. It includes: (a) signal preprocessing; (b) feature extraction; (c) GMM creation; (d) Kullback–Leibler divergence calculation as a distance metric; (e) identification of the closest models; and (f) data augmentation based on transfer learning.
Figure 2The Echo State Network used for feature space transformation (a) input layer, (b) reservoir layer, and (c) output layer) .
The quantities of audio data per class of human activity.
| Human Activity | 10-Second Audio Clips |
|---|---|
| Brew coffee | 245 |
| Cooking | 132 |
| Use microwave oven | 42 |
| No activity | 16 |
| Taking a shower | 428 |
| Washing dishes | 134 |
| Washing hands | 70 |
| Brushing teeth | 92 |
The confusion matrix (in %) with respect to the class-specific and universal HMMs with and without the proposed TL data augmentation framework. The presentation format is the following: cHMM/cHMM-TL/uHMM/uHMM-TL. Average values over 50 iterations are shown. The achieved average recognition rates are 83.1%/89.5%/88.5%/94.6%. The highest rates are emboldened.
| Responded |
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
| Presented | |||||||||
|
| 90.3/92/90.1/ | -/-/-/- | -/-/-/- | 3.3/2.4/3/- | 6.4/5.6/6.9/4.3 | -/-/-/- | -/-/-/- | -/-/-/- | |
|
| -/-/-/- | 88.5/93.3/91/ | 11/6.7/8.1/5.7 | -/-/-/- | -/-/-/- | -/-/-/- | -/-/-/- | 0.5/0/0.9/- | |
|
| 1.9/-/-/- | 14.4/12.3/12.7/7.1 | 76.9/85.2/84.8/ | -/-/-/- | -/-/-/- | -/-/-/- | -/-/-/- | 6.8/2.5/2.5/- | |
|
| -/-/-/- | -/-/-/- | -/-/-/- | 91.7/93/92.2/ | -/-/-/- | -/-/-/- | 5.9/4.9/5.7/2.2 | 2.4/2.1/2.1/- | |
|
| 12.4/12.1/12.1/9.6 | -/-/-/- | -/-/-/- | 3.7/3.6/3.6/- | 83.9/84.3/84.3/ | -/-/-/- | -/-/-/- | -/-/-/- | |
|
| -/-/-/- | -/-/-/- | -/-/-/- | 3.6/-/-/- | -/-/-/- | 78.9/92.5/92.5/ | -/-/-/- | 17.5/7.5/7.5/4.4 | |
|
| -/-/-/- | -/-/-/- | -/-/-/- | 14.9/10.8/11.4/4.8 | -/-/-/- | -/-/-/- | 80.6/87.9/87/ | 4.5/1.3/1.6/- | |
|
| 6.7/-/-/- | -/-/-/- | -/-/-/- | -/-/-/- | -/-/-/- | 19.3/12.6/13.7/4.9 | -/-/-/- | 74/87.4/86.3/ | |