| Literature DB >> 36188437 |
Jennifer M Vojtech1,2,3, Dante D Cilento2, Austin T Luong2, Jacob P Noordzij2, Manuel Diaz-Cadiz2, Matti D Groll1,2, Daniel P Buckley2,4, Victoria S McKenna2, J Pieter Noordzij4, Cara E Stepp1,2,4.
Abstract
Methods for automating relative fundamental frequency (RFF)-an acoustic estimate of laryngeal tension-rely on manual identification of voiced/unvoiced boundaries from acoustic signals. This study determined the effect of incorporating features derived from vocal fold vibratory transitions for acoustic boundary detection. Simultaneous microphone and flexible nasendoscope recordings were collected from adults with typical voices (N=69) and with voices characterized by excessive laryngeal tension (N=53) producing voiced-unvoiced-voiced utterances. Acoustic features that coincided with vocal fold vibratory transitions were identified and incorporated into an automated RFF algorithm ("aRFF-APH"). Voiced/unvoiced boundary detection accuracy was compared between the aRFF-APH algorithm, a recently published version of the automated RFF algorithm ("aRFF-AP"), and gold-standard, manual RFF estimation. Chi-square tests were performed to characterize differences in boundary cycle identification accuracy among the three RFF estimation methods. Voiced/unvoiced boundary detection accuracy significantly differed by RFF estimation method for voicing offsets and onsets. Of 7721 productions, 76.0% of boundaries were accurately identified via the aRFF-APH algorithm, compared to 70.3% with the aRFF-AP algorithm and 20.4% with manual estimation. Incorporating acoustic features that corresponded with voiced/unvoiced boundaries led to improvements in boundary detection accuracy that surpassed the gold-standard method for calculating RFF.Entities:
Keywords: high-speed videoendoscopy; laryngeal tension; relative fundamental frequency; voice assessment
Year: 2021 PMID: 36188437 PMCID: PMC9524108 DOI: 10.3390/app11093816
Source DB: PubMed Journal: Appl Sci (Basel) ISSN: 2076-3417 Impact factor: 2.838
Figure 1.Acoustic waveform of the vowel–voiceless consonant–vowel production, /ifi/. Voicing cycles preceding the voiceless consonant, /f/, are marked as voicing offset cycles, whereas those following the /f/ are indicated as voicing onset cycles. The first and tenth vocal cycles are highlighted for each transition.
Overall demographic information for the 122 participants.
| Cohort | Gender | Age | Overall Severity of Dysphonia | |||||
|---|---|---|---|---|---|---|---|---|
| M | F | Mean | SD | Range | Mean | SD | Range | |
| Young adults with typical voices | 18 | 17 | 22.8 | 5.5 | 18–31 | 5.4 | 3.8 | 0.6–23.5 |
| Older adults with typical voices | 18 | 16 | 65.6 | 10.8 | 41–91 | 11.4 | 7.7 | 1.7–34.2 |
| Adults with HVD[ | 6 | 22 | 37.5 | 16.1 | 19–70 | 12.3 | 10.7 | 0.9–38.5 |
| Adults with PD[ | 18 | 7 | 63.0 | 9.4 | 43–75 | 19.2 | 13.3 | 4.0–51.3 |
HVD = Hyperfunctional voice disorder;
PD = Parkinson’s disease
Figure 2.(a) View of the vocal folds under flexible nasendoscopy, with the glottic angle marked from the anterior commissure to the vocal processes, (b) acoustic signal, (c) raw glottic angle waveform (gray) with smoothed data overlay (black), and (d) Filtered quick vibratory profile (QVP). The black dotted line in (b), (c), and (d) indicates the current video frame shown in (a). The time of voicing offset (orange solid line) and time of voicing onset (teal solid line) are indicated in (b), (c), and (d).
Number of participants for which each of five trained technicians manually computed relative fundamental frequency.[1]
| Technician | Total | Number of Participants in Common between Technicians | ||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||
| 1 | 37 | |||||
| 2 | 82 | 5 | ||||
| 3 | 79 | 18 | 53 | |||
| 4 | 29 | 14 | 13 | 2 | ||
| 5 | 17 | 0 | 11 | 6 | 0 | |
Total ratings sum to 244 as manual RFF estimation was performed twice for each participant (N=122).
Acoustic measures for classifying voiced and unvoiced speech segments, with abbreviations (Abbr), the signal used to calculate the feature, feature definition, and proposed hypotheses surrounding feature trends shown. Rows that are shaded orange indicate that the acoustic feature was included in aRFF-AP algorithm.
| Feature Name | Abbr. | Signal(s) | Definition | Hypothesized |
|---|---|---|---|---|
| Autocorrelation | ACO | Raw and Filtered Microphone | ACO is a comparison of a segment of a voice signal to a delayed copy of itself as a function of the delay [ | V > UV |
| Mean Cepstral Peak Prominence | CPP | Raw and Filtered Microphone | CCP reflects the distribution of energy at harmonically related frequencies [ | V > UV |
| Average Pitch Strength | APS | Pitch Strength Contour | Using Auditory-SWIPE′ [ | V > UV |
| Average Voice | A | A | V > UV | |
| Cross-correlation | XCO | Raw and Filtered Microphone | XCO is a comparison of a segment of a voice signal with a different segment of the signal [ | V > UV |
| Low-to-high ratio of spectral energy | LHR | Raw and Filtered Microphone | LHR is calculated by comparing spectral energy above and below a specified frequency. Using a cut-off frequency of 4 kHz [ | V > UV |
| Median | MPS | Pitch Strength Contour | MPS was included as an alternative to APS. | V > UV |
| Median Voice | M | M | V > UV | |
| Normalized | NXCO | Raw and Filtered Microphone | NXCO was included as an alternative to XCO, in which the amplitude of the compared windows are normalized to remove differences in signal amplitude as a factor. | V > UV |
| Normalized Peak-to-peak Amplitude | PTP | Raw and Filtered Microphone | PTP is the range of the amplitude of a windowed voice signal. | V > UV |
| Number of Zero Crossings | NZC | Raw and Filtered Microphone | NZC refers to the number of sign changes of the windowed signal. | V < UV |
| Short-time Energy | STE | Raw and Filtered Microphone | STE is the energy of a short voice segment [ | V > UV |
| Short-time Log Energy | SLE | Raw and Filtered Microphone | SLE was included as an alternative to STE, and is calculated as the logarithm of the energy of a short voice segment. | V > UV |
| Short-time Magnitude | STM | Raw and Filtered Microphone | STM is the magnitude of a short voice segment [ | V > UV |
| Signal-to-noise Ratio | SNR | Raw and Filtered Microphone | SNR is an estimate of the power of a signal compared to that of a segment of noise. | V > UV |
| Standard Deviation of Cepstral Peak Prominence | SD CPP | Raw and Filtered Microphone | SD CPP is the standard deviation of CPP values within a window, and may capture variations in signal periodicity as a result of aspiration and frication noise in the /f/. | V < UV |
| Standard Deviation of Voice | SD | SD | V < UV | |
| Waveform Shape Similarity | WSS | Raw and Filtered Microphone | WSS is the normalized sum of square error between the current window of time and the previous window of time. It is calculated relative to a window of time in the voiceless consonant. | V < UV |
Figure 3.Normalized feature values calculated from the raw microphone signal or Auditory-SWIPE′ output (teal) with respect to distance (pitch periods) from the true boundary cycle (thin black dotted line) for voicing offset. Normalized feature values calculated from band-pass filtered microphone signal are overlaid in orange (when applicable). Top row: normalized peak-to-peak amplitude (PTP), short-time magnitude (STM), short-time energy (STE), cross-correlation (XCO), normalized cross-correlation (NXCO), autocorrelation (ACO). Middle row: mean and standard deviation of cepstral peak prominence (CPP, SD CPP), signal-to-noise ratio (SNR), number of zero crossings (NZC), waveform shape similarity (WSS), low-to-high ratio of spectral energy (LHR). Bottom row: log energy (LE), average and median pitch strength (APS, MPS), average, median, and standard deviation of f (Af, Mf, SD f). Thick solid lines indicate mean values of features that were retained after manual inspection. Thick orange and teal dashed lines indicate mean values of features that were removed through manual inspection. Shaded regions indicate standard deviation.
Figure 4.Normalized feature values calculated from the raw microphone signal or Auditory-SWIPE′ output (teal) with respect to distance (pitch periods) from the true boundary cycle (thin black dotted line) for voicing onset. Normalized feature values calculated from band-pass filtered microphone signal are overlaid in orange (when applicable). Top row: normalized peak-to-peak amplitude (PTP), short-time magnitude (STM), short-time energy (STE), cross-correlation (XCO), normalized cross-correlation (NXCO), autocorrelation (ACO). Middle row: mean and standard deviation of cepstral peak prominence (CPP, SD CPP), signal-to-noise ratio (SNR), number of zero crossings (NZC), waveform shape similarity (WSS), low-to-high ratio of spectral energy (LHR). Bottom row: log energy (LE), average and median pitch strength (APS, MPS), average, median, and standard deviation of f (Af, Mf, SD f). Thick solid lines indicate mean values of features that were retained after manual inspection. Thick orange and teal dashed lines indicate mean values of features that were removed through manual inspection. Shaded regions indicate standard deviation.
Summary of significant variables in the stepwise binary logistic regression statistical model.
| Model | Acoustic Feature | Coef | SE Coef |
|
| 95% Confidence Interval | VIF[ | |
|---|---|---|---|---|---|---|---|---|
| Lower Bound | Upper Bound | |||||||
| Voicing Offset | Constant | 0.10 | 0.07 | 1.48 | .15 | −0.03 | 0.24 | — |
| Filtered Waveform Shape Similarity | −1.52 | 0.05 | −30.07 | <.001 | −1.62 | −1.42 | 1.30 | |
| Median of Voice | 1.46 | 0.04 | 34.85 | <.001 | 1.37 | 1.54 | 1.21 | |
| Cepstral Peak Prominence | 1.23 | 0.06 | 20.07 | <.001 | 1.11 | 1.35 | 1.27 | |
| Number of Zero Crossings | −3.31 | 0.04 | −78.69 | <.001 | −3.39 | −3.23 | 1.55 | |
| Short-time Energy | −5.72 | 0.15 | −38.18 | <.001 | −6.01 | −5.42 | 9.03 | |
| Average Pitch Strength | 9.24 | 0.12 | 78.52 | <.001 | 9.01 | 9.47 | 4.81 | |
| Normalized Cross-Correlation | −0.84 | 0.05 | −16.77 | <.001 | −0.93 | −0.74 | 1.53 | |
| Cross-correlation | 1.00 | 0.16 | 6.25 | <.001 | 0.69 | 1.31 | 7.74 | |
| Voicing Onset | Constant | −2.18 | 0.10 | −22.69 | <.001 | −2.37 | −2.00 | — |
| Filtered Waveform Shape Similarity | 1.40 | 0.08 | 18.34 | <.001 | 1.25 | 1.55 | 1.30 | |
| Median of Voice | 2.21 | 0.06 | 40.31 | <.001 | 2.10 | 2.31 | 1.19 | |
| Cepstral Peak Prominence | 1.05 | 0.08 | 12.53 | <.001 | 0.89 | 1.22 | 1.06 | |
| Number of Zero Crossings | −2.62 | 0.06 | −42.15 | <.001 | −2.75 | −2.50 | 1.66 | |
| Average Pitch Strength | 8.94 | 0.15 | 59.45 | <.001 | 8.65 | 9.24 | 2.83 | |
| Signal-to-noise Ratio | 0.56 | 0.06 | 9.84 | <.001 | 0.45 | 0.68 | 2.44 | |
| Filtered Short-time Energy | −3.75 | 0.10 | −37.51 | <.001 | −3.95 | −3.56 | 3.66 | |
| Filtered Short-time Log Energy | 3.11 | 0.07 | 44.81 | <.001 | 2.97 | 3.24 | 3.01 | |
VIF = variable inflation factor
Figure 5.Boundary cycle identification of each RFF estimation method (manual, aRFF-AP, aRFF-APH). For (a) voicing offset and (b) voicing onset. Results for manual RFF estimation are shown in light orange, aRFF-AP in dark orange, and aRFF-APH in teal.
Chi-square (X) tests of independence to examine RFF estimation method and accuracy of boundary cycle identification for voicing offset and onset. Effect size interpretations of Cramer’s V are based on criteria from Cohen [49].
| Model | RFF Estimation Methods |
|
|
|
|
| Effect Size |
|---|---|---|---|---|---|---|---|
| Voicing Offset | Manual vs. aRFF-AP vs. aRFF-APH | 2 | 21793 | 5821.0 | <.001 | .52 | Large |
| Manual vs. aRFF-AP | 1 | 14250 | 3928.0 | <.001 | 53 | Large | |
| Manual vs. aRFF-APH | 1 | 14268 | 4982.0 | <.001 | .59 | Large | |
| aRFF-AP vs. aRFF-APH | 1 | 15068 | 89.7 | <.001 | .08 | Negligible | |
| Voicing Onset | Manual vs. aRFF-AP vs. aRFF-APH | 2 | 19112 | 6417.0 | <.001 | .58 | Large |
| Manual vs. aRFF-AP | 1 | 12631 | 3420.0 | <.001 | .52 | Large | |
| Manual vs. aRFF-APH | 1 | 12391 | 4831.0 | <.001 | .62 | Large | |
| aRFF-AP vs. aRFF-APH | 1 | 13202 | 283.0 | <.001 | .15 | Small |