| Literature DB >> 27176486 |
Simone Hantke1,2, Felix Weninger1, Richard Kurle1, Fabien Ringeval2, Anton Batliner2,3, Amr El-Desoky Mousa2, Björn Schuller2,4.
Abstract
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient.Entities:
Mesh:
Year: 2016 PMID: 27176486 PMCID: PMC4866718 DOI: 10.1371/journal.pone.0154486
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Technical setup for the recording of the audio-visual streams.
| Tools | AKG HC 577 L headset |
| MAudio Fast Track C400 | |
| Sample rate | 44.1 kHz mono |
| Bit depth | 24 bits per sample |
| Tool | Logitech HD Pro Webcam C920 |
| Codec | MJPEG |
| Framerate | 30 fps |
| Resolution | 1280 x 720 pixels |
Number of subjects having special health issues.
| Health issue | Never | Rarely | Sometimes | Regularly | Often |
|---|---|---|---|---|---|
| Speech impediment | 26 | 2 | 2 | 0 | 0 |
| Vocal tract dysfunction | 29 | 0 | 1 | 0 | 0 |
| Difficulties in swallowing | 28 | 2 | 0 | 0 | 0 |
| Heart and circulatory problems | 28 | 2 | 0 | 0 | 0 |
| Toothaches | 27 | 2 | 1 | 0 | 0 |
| Being vegetarian | 28 | 2 | 0 | 2 | 0 |
| Smoking | 13 | 2 | 8 | 4 | 3 |
| Alcohol consumption | 1 | 5 | 14 | 3 | 7 |
1 The subjects have masticatory disturbance and feel difficulty in chewing and swallowing under illness. Under normal health circumstances—as during the recordings—no subject had any speech impediment, vocal tract dysfunction or masticatory disturbance.
2 The subjects eat fish but no meat.
3 The subjects prefer vegetarian food whenever possible.
Chosen food classes and amount of food served to the subjects while recording each utterance.
| Food | ID | Weight [g] |
|---|---|---|
| A | Ap | 11–15 |
| N | Ne | 17–20 |
| B | Ba | 22–26 |
| H | Ha | 5 |
| B | Bi | 5–6 |
| C | Cr | 3–4 |
1Haribo Smurf is a specific type of jelly gum.
Self-reporting on likability and difficulty of eating of food classes rated by all subjects.
| Food | Likability | Difficulty |
|---|---|---|
| A | .73 (.20) | .63 (.24) |
| N | .76 (.21) | .47 (.23) |
| B | .77 (.20) | .43 (.25) |
| H | .67 (.29) | .45 (.26) |
| B | .56 (.25) | .67 (.26) |
| C | .68 (.28) | .54 (.26) |
The ratings of likability are in a [0–1] scale (dislike extremely: 0 and like extremely: 1). The ratings of difficulty of eating are converted from a 5-point Likert into a [0–1] scale (very easy: 0 and very difficult: 1), to ease comparison with reports on likability; [mean value] (standard deviation).
Statistics of the iHEARu-EAT database.
| Read | Spontaneous | |||
|---|---|---|---|---|
| Class | # | Duration | # | Duration |
| A | 196 | 24:47 | 28 | 4:00 |
| N | 196 | 25:00 | 28 | 3:37 |
| B | 210 | 25:21 | 30 | 3:41 |
| H | 189 | 22:57 | 27 | 3:38 |
| B | 203 | 25:47 | 29 | 4:08 |
| C | 210 | 25:58 | 30 | 3:44 |
| N | 210 | 23:03 | 30 | 4:13 |
| Total | 1414 | 2:53:01 | 202 | 27:05 |
Number (#) and duration of speech utterances per class for read and spontaneous conditions. The slight difference in the number of utterances per class is due to the fact that some subjects chose not to eat all types of food.
Fig 1Exemplary subjects of the iHEARu-EAT database while recording an utterance without eating food (left), eating a banana (middle) and eating crisps (right).
Unusual configurations of the supra-glottal part of the vocal tract are clearly visible for the eating conditions.
ASR WERs [%] using 7-way acoustic model training on the iHEARu-EAT dataset.
| 24.54 | 31.88 | 28.44 | 6.42 | ||||
| 21.79 | 24.31 | 25.46 | 35.78 | 34.63 | 34.40 | 9.17 | |
| 17.66 | 23.17 | 33.94 | 34.17 | 28.44 | 8.26 | ||
| 20.41 | 26.38 | 27.75 | 32.11 | 31.88 | 30.28 | 11.01 | |
| 20.18 | 25.46 | 32.80 | 30.96 | 25.46 | 11.01 | ||
| 18.81 | 25.23 | 26.83 | 31.65 | 28.21 | 8.72 | ||
| 34.17 | 47.02 | 52.29 | 55.05 | 61.93 | 54.36 | ||
| 10.32 | 16.74 | 17.20 | 21.79 | 16.51 | 13.99 | 5.28 |
ComParE acoustic feature set: 65 low-level descriptors (LLD).
| Sum of auditory spectrum (loudness) | prosodic |
| Sum of RASTA-filtered auditory spectrum | prosodic |
| RMS Energy, Zero-Crossing Rate | prosodic |
| RASTA-filt. aud. spect. bands. 1–26 (0–8 kHz) | spectral |
| MFCC 1–14 | cepstral |
| Spectral energy 250–650 Hz, 1 k–4 kHz | spectral |
| Spectral Roll-Off Pt. 0.25, 0.5, 0.75, 0.9 | spectral |
| Spectral Flux, Centroid, Entropy, Slope | spectral |
| Psychoacoustic Sharpness, Harmonicity | spectral |
| Spectral Variance, Skewness, Kurtosis | spectral |
| prosodic | |
| Prob. of voicing | voice qual. |
| log. HNR, Jitter (local & | voice qual. |
ComParE acoustic feature set: Functionals applied to LLD contours (Table 7).
| quartiles 1–3, 3 inter-quartile ranges | percentiles |
| 1% percentile (≈ min), 99% pctl. (≈ max) | percentiles |
| percentile range 1%–99% | percentiles |
| position of min / max, range (max—min) | temporal |
| arithmetic mean | moments |
| contour centroid, flatness | temporal |
| standard deviation, skewness, kurtosis | moments |
| relative duration LLD is rising | temporal |
| rel. dur. LLD is above 25 / 50 / 75 / 90% range | temporal |
| rel. duration LLD has positive curvature | temporal |
| gain of linear prediction (LP), LP Coeff. 1–5 | modulation |
| mean, max, min, std. dev. of segment length | temporal |
| mean value of peaks | peaks |
| mean value of peaks—arithmetic mean | peaks |
| mean / std.dev. of inter peak distances | peaks |
| amplitude mean of peaks, of minima | peaks |
| amplitude range of peaks | peaks |
| mean / std. dev. of rising / falling slopes | peaks |
| linear regression slope, offset, quadratic error | regression |
| quadratic regression a, b, offset, quadratic err. | regression |
| percentage of non-zero frames | temporal |
1: arithmetic mean of LLD / positive Δ LLD.
2: not applied to voicing related LLD except F0.
3: only applied to F0.
Binary classification of eating condition.
| UAR [%] | Speech type | ||
|---|---|---|---|
| Feature | Spontaneous | Read | All |
| WER | – | 85.3 | – |
| CER | – | 90.7 | – |
| RTF | 86.9 | 91.8 | 91.8 |
| LL | 68.1 | 83.3 | 79.9 |
| WER+CER | – | 90.3 | – |
| RTF+LL | 93.9 | ||
| ALL | – | – | |
Binary classification of eating condition (Food / No Food) using ASR-related features per type of speech (spontaneous and read as well as both): UAR using SVMs. WER and CER require knowledge of the reference text whereas RTF and LL do not require a priori knowledge. Chance level UAR: 50.0%. WER: word error rate, CER: character error rate, RTF: real-time factor, LL: log-likelihood.
2-way and 7-way classification of eating condition.
| UAR [%] | Speech type | Chance | ||
|---|---|---|---|---|
| Spontaneous | Read | All | ||
| ASR-related | ||||
| 2-way | 91.1 | 94.9 | 92.5 | 50.0 |
| 7-way | 28.1 | 31.2 | 30.0 | 14.3 |
| 2-way | 98.0 | 50.0 | ||
| 7-way | 65.6 | 14.3 | ||
| ASR-related + | ||||
| 2-way | 89.6 | 96.9 | 50.0 | |
| 7-way | 57.2 | 65.0 | 14.3 | |
2-way (Food / No Food) and 7-way classification of eating condition (6 types of food / No Food) using either ASR-related features, low-level acoustic features (ComParE set) or their combination, per type of speech (spontaneous and read as well as both): UAR using SVMs.
Confusion matrix obtained by SVMs on the ComParE feature set in the 7-way classification of eating condition for both read and spontaneous speech production.
| 24.0 | 4.1 | 5.1 | 5.1 | 5.1 | 1.5 | ||
| 21.9 | 14.8 | 11.2 | 3.6 | 3.1 | 0.5 | ||
| 5.7 | 18.1 | 22.4 | 1.9 | 0.5 | 2.9 | ||
| 6.9 | 11.1 | 7.9 | 0.5 | 2.1 | 1.6 | ||
| 7.9 | 6.4 | 1.0 | 1.0 | 10.8 | 0.5 | ||
| 6.7 | 3.8 | 1.0 | 3.3 | 12.4 | 0.5 | ||
| 0.5 | 1.4 | 3.8 | 1.0 | 0.5 | 0.5 |
Fig 2Solutions of non-metric dimensional scaling applied to class confusions (2-D (top), 1-D (bottom left)) or Euclidean class center distances (1-D (bottom right)) in the 7-way task, ComParE low-level acoustic features.
Fig 3Degree of high-frequency noise of the words ‘warmed up’ caused by eating.
Subjects (left: female, right: male) while recording an utterance eating a banana (top), without eating a sort of food (middle), and eating crisps (bottom).
Regression-based recognition of eating condition.
| [%] | SVR/100 feat. | 1-best feat. | |
|---|---|---|---|
| Label | RAE | ||
| NMDS 1-D (Conf) | 45.5 | 69.7 | 23.0 |
| NMDS 1-D (Dist) | 66.8 | 31.9 | |
Regression-based recognition of eating condition: Determination coefficient (R2) and relative absolute error (RAE) obtained by SVR on the ComParE feature set, and best single feature baseline. Numeric labels obtained from 1-D NMDS solutions (‘Conf’usions or ‘Dist’ance of class centers).