| Literature DB >> 26136813 |
Tino Haderlein1, Cornelia Schwemmle2, Michael Döllinger3, Václav Matoušek4, Martin Ptok5, Elmar Nöth1.
Abstract
Due to low intra- and interrater reliability, perceptual voice evaluation should be supported by objective, automatic methods. In this study, text-based, computer-aided prosodic analysis and measurements of connected speech were combined in order to model perceptual evaluation of the German Roughness-Breathiness-Hoarseness (RBH) scheme. 58 connected speech samples (43 women and 15 men; 48.7 ± 17.8 years) containing the German version of the text "The North Wind and the Sun" were evaluated perceptually by 19 speech and voice therapy students according to the RBH scale. For the human-machine correlation, Support Vector Regression with measurements of the vocal fold cycle irregularities (CFx) and the closed phases of vocal fold vibration (CQx) of the Laryngograph and 33 features from a prosodic analysis module were used to model the listeners' ratings. The best human-machine results for roughness were obtained from a combination of six prosodic features and CFx (r = 0.71, ρ = 0.57). These correlations were approximately the same as the interrater agreement among human raters (r = 0.65, ρ = 0.61). CQx was one of the substantial features of the hoarseness model. For hoarseness and breathiness, the human-machine agreement was substantially lower. Nevertheless, the automatic analysis method can serve as the basis for a meaningful objective support for perceptual analysis.Entities:
Mesh:
Year: 2015 PMID: 26136813 PMCID: PMC4468283 DOI: 10.1155/2015/316325
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Age distribution of the speaker group (n = 58).
Diagnoses within the speaker group (n = 58).
| Edema | |
| Reinke's edema (bilateral) | 3 |
| Edge edema | 1 |
|
| |
| Pareses | |
| Vocal fold paresis (right) | 8 |
| Vocal fold paresis (left) | 3 |
| Vocal fold paresis (bilateral) | 2 |
|
| |
| Benign tumors, pseudotumors | |
| Hyperplasia vocal fold (right) | 1 |
| Vocal fold polyp (right) | 4 |
| Vocal fold polyp (left) | 1 |
| Vocal fold cyst (right) | 1 |
| Vocal fold nodules | 3 |
| Vocal fold granuloma | 1 |
| Larynx papillomatosis | 1 |
|
| |
| Inflammations | |
| Laryngitis | 3 |
|
| |
| Central movement disorders | |
| Spasmodic dysphonia | 3 |
| Balbuties | 1 |
| Other central disorders | 1 |
|
| |
| Functional dysphonia | |
| Psychogenic dysphonia | 1 |
| Dysphagia | 16 |
| Normal laryngeal findings | 4 |
Prosodic features and their intervals of computation; 33 prosodic features are based upon duration (“Dur”), energy (“En”), and fundamental frequency (“F0”) measures. The context size denotes the interval of words on which the features are computed; W: computed on current word, WPW: computed in the interval that contains the second and first word before the current word, and the pause between them.
| Features | Context size | |
|---|---|---|
| WPW | W | |
| Pause: before, Fill-before, after, Fill-after | • | |
| En: RegCoeff, MseReg, Abs, Norm, Mean | • | • |
| En: Max, MaxPos | • | |
| Dur: Abs, Norm | • | • |
| F0: RegCoeff, MseReg | • | • |
| F0: Mean, Max, MaxPos, Min, MinPos, Off, OffPos, On, OnPos | • | |
The features are abbreviated as follows.
Length of pauses “Pause”: length of silent pause before (before) and after (after), and filled pause before (Fill-before) and after (Fill-after) the respective word in context.
Energy features “En”: regression coefficient (RegCoeff) and mean square error (MseReg) of the energy curve with respect to the regression curve; mean (Mean) and maximum energy (Max) with its position on the time axis (MaxPos); absolute (Abs) and normalized (Norm) energy values.
Duration features “Dur”: absolute (Abs) and normalized (Norm) duration.
F 0 features “F0”: regression coefficient (RegCoeff) and the mean square error (MseReg) of the F 0 curve with respect to its regression curve; mean (Mean), maximum (Max), minimum (Min), voice onset (On), and offset (Off) values as well as the position of Max (MaxPos), Min (MinPos), On (OnPos), and Off (OffPos) on the time axis; all F 0 values are normalized.
Perceptual evaluation results (average, standard deviation, and minimal and maximal values) and interrater agreement expressed as Krippendorff's α and the correlation coefficients r and ρ (n = 58).
| Average | Standard dev. | Min | Max |
|
|
| |
|---|---|---|---|---|---|---|---|
|
| 0.88 | 0.51 | 0.05 | 2.21 | 0.45 | 0.65 | 0.61 |
|
| 0.59 | 0.47 | 0.00 | 2.16 | 0.33 | 0.58 | 0.50 |
|
| 0.81 | 0.56 | 0.05 | 1.89 | 0.36 | 0.59 | 0.57 |
Figure 2Perceptual roughness (R) evaluation by 19 listeners (mean value and standard deviation).
Figure 4Perceptual hoarseness (H) evaluation by 19 listeners (mean value and standard deviation).
Correlation r (ρ) between the perceptual ratings (n = 58).
|
|
| |
|---|---|---|
|
| 0.13 (0.33) | 0.50 (0.53) |
|
| 0.53 (0.67) |
∗ = correlation is significant on the 0.01 level.
Best feature sets for human-machine correlation and their weights in the regression formulae.
| Feature | Context | Rbest,I | Rbest,I
| Rbest,II | Rbest,II
| Bbest | Hbest,I | Hbest,II | Hbest,III | Hbest,IV |
|---|---|---|---|---|---|---|---|---|---|---|
| DurNorm | WPW | −0.057 | −0.046 | 0.377 | 0.499 | 0.378 | ||||
| DurNorm | W | 0.513 | 0.402 | |||||||
| F0Min | W | −0.446 | −0.458 | −0.452 | −0.389 | |||||
| F0Mean | W | −0.195 | −0.226 | −0.191 | −0.172 | |||||
| F0Onset | W | 0.173 | ||||||||
| F0OffPos | W | 0.322 | 0.120 | 0.185 | 0.236 | |||||
| EnNorm | WPW | −0.151 | 0.343 | |||||||
| EnNorm | W | −0.247 | −0.315 | 0.155 | ||||||
| MeanJitter | 15 W | 0.118 | 0.186 | 0.113 | 0.249 | 0.239 | 0.366 | 0.368 | 0.320 | 0.208 |
| MeanShimmer | 15 W | 0.144 | 0.138 | 0.145 | 0.114 | −0.031 | ||||
| StandDevShimmer | 15 W | −0.163 | ||||||||
| #+Voiced | 15 W | 0.321 | 0.347 | 0.334 | 0.324 | 0.094 | −0.133 | −0.117 | 0.122 | |
| RelNum+/−Voiced | 15 W | −0.164 | 0.218 | 0.082 | −0.144 | |||||
| CFx | 0.210 | 0.206 | ||||||||
| CQx | 0.643 | 0.495 | −0.242 | 0.506 | ||||||
|
| ||||||||||
|
| 0.71 | 0.66 | 0.71 | 0.67 | 0.36 | 0.53 | 0.47 | 0.45 | 0.49 | |
|
| 0.57 | 0.49 | 0.58 | 0.49 | 0.27 | 0.54 | 0.46 | 0.45 | 0.55 | |
| Significance level | <0.001 | <0.001 | <0.001 | <0.001 | 0.003 | <0.001 | <0.001 | <0.001 | <0.001 | |
Contexts: W: word, WPW: word-pause-word, 15 W: 15 words (“global” feature). The correlations of the respective set to the human reference are given by r (Pearson) and ρ (Spearman).
Figure 5Perceptual roughness (R) evaluation by 19 listeners, the SVR regression values (R best,I), and their best-fit line.
Figure 7Perceptual hoarseness (H) evaluation by 19 listeners, the SVR regression values (H best,I), and their best-fit line.
Weighting factors in the regression sums when the RBH rating is modeled by CFx and CQx only and the human-machine correlation (r, ρ).
| Feature |
|
|
|
|---|---|---|---|
| CFx | 0.303 | 0.091 | 0.340 |
| CQx | 0.033 | 0.117 | 0.490 |
|
| |||
|
| 0.31 | −0.10 | 0.44 |
|
| 0.43 | −0.05 | 0.48 |
| Significance level | 0.009 | 0.228 | <0.001 |
Figure 8Perceptual roughness (R) evaluation by 19 listeners versus CFx and CQx, respectively.
Figure 10Perceptual hoarseness (H) evaluation by 19 listeners versus CFx and CQx, respectively.
Correlations of prosodic and Laryngograph measures, which were in the best models for the human rating, with each other.
| Feature | DurNorm | DurNorm | F0Min | F0Mean | F0Onset | F0OffPos | EnNorm | EnNorm | MeanJitter | MeanShimmer | StandDevShimmer | #+Voiced | RelNum+/−Voiced | CFx | CQx | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Context | WPW | W | W | W | W | W | WPW | W | 15 W | 15 W | 15 W | 15 W | 15 W | |||
| DurNorm | WPW | 0.02 | −0.17 | −0.04 | 0.01 | −0.23 |
| 0.03 | 0.07 | −0.03 | −0.12 | 0.01 | −0.05 | 0.20 | 0.04 | |
| DurNorm | W | 0.10 |
| −0.24 | −0.19 |
| 0.01 |
| 0.22 | 0.00 | −0.03 | 0.08 | 0.13 | 0.13 | 0.05 | |
| F0Min | W | 0.02 |
|
|
|
| −0.19 | −0.11 |
|
| −0.14 |
|
|
| −0.11 | |
| F0Mean | W | −0.09 |
|
|
|
| −0.02 | −0.09 | −0.07 |
| 0.02 | −0.17 | −0.13 | −0.03 | 0.12 | |
| F0Onset | W | 0.00 | −0.23 |
|
|
| 0.05 | −0.06 | −0.01 | −0.12 | 0.07 | −0.10 | −0.10 | 0.02 | 0.06 | |
| F0OffPos | W | −0.13 |
|
| 0.20 |
| −0.18 | −0.23 | −0.04 |
| 0.08 |
|
| −0.01 | 0.01 | |
| EnNorm | WPW |
| 0.07 | 0.02 | −0.06 | 0.07 | −0.11 | 0.06 | 0.14 | 0.02 | −0.10 | 0.08 | 0.00 | 0.14 | 0.00 | |
| EnNorm | W | 0.19 |
| −0.11 | −0.13 | −0.03 | −0.22 | 0.24 | 0.15 | −0.14 | −0.11 | −0.02 | −0.02 | 0.07 | 0.02 | |
| MeanJitter | 15 W | −0.11 | 0.18 |
| −0.11 | 0.00 | 0.08 | −0.04 | 0.03 |
|
|
|
|
| 0.23 | |
| MeanShimmer | 15 W | −0.18 | −0.02 |
|
| −0.13 | −0.23 | −0.08 | −0.20 | 0.21 |
|
|
| 0.15 | −0.04 | |
| StandDevShimmer | 15 W |
| −0.10 | −0.10 | −0.03 | −0.02 | −0.03 | −0.21 | −0.16 | 0.17 |
|
|
| 0.15 | 0.04 | |
| #+Voiced | 15 W | −0.13 | 0.13 |
|
| −0.21 |
| −0.06 | 0.00 |
|
|
|
|
| 0.15 | |
| RelNum+/−Voiced | 15 W | −0.14 | 0.16 |
| −0.24 | −0.17 |
| −0.09 | −0.01 |
|
|
|
| 0.20 | 0.07 | |
| CFx | 0.08 | 0.24 |
| −0.07 | −0.02 | −0.18 | 0.05 | 0.10 |
| 0.12 | 0.00 |
|
|
| ||
| CQx | −0.07 | 0.02 | −0.04 | 0.18 | 0.07 | −0.01 | −0.12 | −0.04 | 0.15 | −0.03 | −0.04 | 0.11 | 0.08 |
|
Upper right triangle: Pearson's r; lower left triangle: Spearman's ρ.
Contexts: W: word, WPW: word-pause-word, 15 W: 15 words (“global” feature).
All r and ρ correlations with an absolute value of larger than 0.25 (0.33) are significant on the 0.05 (0.01) level.