Literature DB >> 36203405

Speech Recognition and Listening Effort of Meaningful Sentences Using Synthetic Speech.

Saskia Ibelings^1,2,3, Thomas Brand^2,3, Inga Holube^1,3.

Abstract

Speech-recognition tests are an important component of audiology. However, the development of such tests can be time consuming. The aim of this study was to investigate whether a Text-To-Speech (TTS) system can reduce the cost of development, and whether comparable results can be achieved in terms of speech recognition and listening effort. For this, the everyday sentences of the German Göttingen sentence test were synthesized for both a female and a male speaker using a TTS system. In a preliminary study, this system was rated as good, but worse than the natural reference. Due to the Covid-19 pandemic, the measurements took place online. Each set of speech material was presented at three fixed signal-to-noise ratios. The participants' responses were recorded and analyzed offline. Compared to the natural speech, the adjusted psychometric functions for the synthetic speech, independent of the speaker, resulted in an improvement of the speech-recognition threshold (SRT) by approximately 1.2 dB. The slopes, which were independent of the speaker, were about 15 percentage points per dB. The time periods between the end of the stimulus presentation and the beginning of the verbal response (verbal response time) were comparable for all speakers, suggesting no difference in listening effort. The SRT values obtained in the online measurement for the natural speech were comparable to published data. In summary, the time and effort for the development of speech-recognition tests may be significantly reduced by using a TTS system. This finding provides the opportunity to develop new speech tests with a large amount of speech material.

Entities: Chemical

Keywords: audiology; listening effort; speech quality; speech recognition; synthetic speech

Mesh：

Year: 2022 PMID： 36203405 PMCID： PMC9549212 DOI： 10.1177/23312165221130656

Source DB: PubMed Journal: Trends Hear ISSN： 2331-2165 Impact factor: 3.496

Introduction

Speech-recognition tests are used not only in the clinical diagnosis of hearing impairment, but also in the evaluation of hearing systems such as hearing aids. In speech-recognition tests, the task is to repeat the recognized words. It is possible to perform tests in quiet at different sound pressure levels (SPL), or in noise at different signal-to-noise ratios (SNR). Measurements in noise yield a psychometric function describing a speech-recognition score as a function of SNR. The SNR of a specific speech-recognition level, often 50%, is called the speech-recognition threshold (SRT). Currently used speech recognition-tests have the common feature that they were recorded using real (natural) speakers. For the recordings themselves, not only are efforts in terms of time and technical knowledge needed, but also professional equipment. The question arises whether the complex process of developing speech tests can be simplified using synthetic speech. Matrix tests, for example in German the Oldenburg sentence test (OLSA; Wagener et al., 1999), consist of 50 well-known words, and all sentences have the same grammatical structure. To generate the OLSA, 100 sentences were recorded so that all possible word transitions were considered. The recordings took place in a sound attenuated booth. The sentences were cut into segments, which include every single word with its specific co-articulation at the end. These words were concatenated into new combinations to generate new sentences. To achieve homogeneous word intelligibility, level adjustments were necessary (Kollmeier et al., 2015). For the German Göttingen Sentence Test (GÖSA; Kollmeier & Wesselkamp, 1997), which consists of everyday sentences, psychometric functions for each sentence and weighting factors for the individual words of a sentence were measured. Based on the results, level corrections were applied to reduce inhomogeneities. The GÖSA finally consists of ten test lists of 20 sentences each. These include not only declarative sentences, but also exclamations and questions (Kollmeier & Wesselkamp, 1997). The different test lists should be presented only once within a reasonable time period, because participants might remember the sentences and speech-recognition scores might thus increase (Yund & Woods, 2010). However, the relatively small number of ten test lists limits the test's applicability in research and in clinical care, e.g., for regular repetition in cochlear implant validation. Therefore, a sentence test with many more lists would be desirable. However, existing sentence tests with natural speakers cannot easily be extended, because the voice's characteristics are not constant over a larger age range (Schötz, 2007), the manner of speaking (e.g., speech rate; Schlueter et al., 2014) might not be replicated, and technical equipment might differ, which may result in differences in speech-recognition scores. To simplify the process of speech-test development, it is possible to use Text-To-Speech (TTS) systems. Current, TTS systems are not only less expensive, but can also reduce the effort by saving some optimization steps. Furthermore, there is then no need to hire a speaker or purchase recording equipment. TTS systems also offer the advantage that any amount of material can be post-produced. Nuesse et al. (2019) already showed that a TTS system is suitable for generating the sentences of OLSA using a female speaker (Wagener et al., 2014). A comparison between the natural speaker and the synthetized speaker using the voice “Claudia” (Acapela Group, Mons, Belgium), revealed an SRT difference of only 0.5 dB and comparable slopes (Nuesse et al., 2019). Another study using the German Freiburg monosyllabic test also found that natural and synthetic speech resulted in comparable SRTs and slopes (Schwarz et al., 2022). Although speech recognition is comparable, listening effort might be influenced by synthetic speech. An explanation of listening effort is given by the “Ease of Language Understanding” model (ELU) by Roennberg et al. (2013). Speech is the input signal to the listener's intermediate memory. The information contained in the speech regarding phonology, syntax, semantics, and prosody is automatically compared with representations from the listener's long-term memory. If the information matches, recognition of speech is easy. Due to hearing loss or complex environmental noise, however, recognizing speech may be hampered due to a mismatch of information. Synthetic speech may have an impact on this recognition process, in which case additional mental resources in the form of further processing steps and conscious and active processes are needed to recognize the speech. These processes include using context effects in sentence recognition. Listening effort is thus the increased demand on mental resources needed to identify the speech that is not understood (Roennberg et al., 2013). Both objective and subjective methods have been used to determine listening effort (Klink et al., 2012a, 2012b; McGarrigle et al., 2014; Pichora-Fuller et al., 2016). Subjective methods include questionnaires and scales (Krueger et al., 2017). Brain activity, as an objective measure, is examined using electroencephalography (EEG; Obleser et al., 2012). Other objective methods include physiological and cognitive measures, such as pupil size (Koelewijn et al., 2015), heart activity (Mackersie & Calderon-Moultrie, 2016), or skin conductance (Holube et al., 2016). Simantiraki et al. (2018), as well as Govender and King (2018), used pupillometry to measure listening effort for synthetic speech, and showed that synthetic speech generated with an HMM (Hidden-Markov Model)-based system led to larger pupil dilations than natural speech, indicating an increase in listening effort. It is worth noting that pupil size also showed a relationship with the quality of the TTS systems. Synthetic sentences that were rated higher in quality by the participants resulted in smaller changes in pupil size and reduced listening effort (Govender & King, 2018). Another method to gauge listening effort is to measure the Verbal Response Time (VRT; Houben et al., 2013; Meister et al., 2018; Pals et al., 2015; Visentin et al., 2021). VRT describes the time delay between the end of the stimulus presentation and the beginning of the response of the participant. According to Pals et al. (2015) the VRT is a good indicator for listening effort, but it is still unclear, whether it directly measures the listening effort itself or another dimension of listening effort. Both the presence of noise and poorer SNR (Houben et al., 2013; Meister et al., 2018; Visentin et al., 2021) led to higher VRT values. In the last few years, TTS systems have been continuously improved, resulting in the assumption that synthetic speech closely matches natural speech (King, 2014). Older systems often use Unit Selection (King, 2014; Taylor, 2006), in which speech is stored in a library. To find the right segment, the written text is decomposed into phonetic units. Before concatenating the segments, the selected segments are adjusted in duration, intensity, or even frequency. Nuesse et al. (2019) used such a system (Virtual Speaker, Acapela, Mons, Belgium) and found comparable speech-recognition scores for natural and synthetic speech. However, disadvantages of Unit Selection are both the diminished fluency of the sound and the high storage load (Taylor, 2006). Statistical models based on HMMs (Taylor, 2006) can, however, remedy these problems. The automated training based on many representative speech materials using statistics makes the model more robust. In addition, the use of features like Mel-Frequency Cepstral Coefficients (MFCC) can reduce storage requirements (Taylor, 2006). According to the Blizzard Challenge (King, 2014), statistical models result in an improved intelligibility compared to systems with Unit Selection. Modern systems often use Deep Neural Networks (DNN; Zen et al., 2013). DNN-based systems are not only rated as sounding more natural, but an objective improvement over HMM systems was also observed, e.g., a lower error rate for voiced and unvoiced utterances (Zen et al., 2013). One example of a DNN-based system is Acapela Cloud (Acapela Group, Solna, Sweden). In HMM- and DNN-based systems, the parameters obtained are used as the input of a parametric synthesizer or vocoder to generate audio signals. A vocoder uses a source-filter model of speech, i.e., a source signal (noise or pulse train) is passed through a filter representing the human vocal tract (Bunnell, 2022; Zen et al., 2013). Although the Wavenet technology (Shen et al., 2018) is also based on neural networks, it models the raw waveform of the audio signals sample by sample instead of using a vocoder. The overarching aim of the current study was to examine whether TTS systems can be usefully applied to speech-recognition tests using everyday sentences. In the first experiment, the qualities of three different TTS systems were compared in different dimensions. Using the best-rated TTS system, the sentences of the GÖSA were synthesized for the second experiment for a male (TTSmale) and a female speaker (TTSfemale). TTSmale was used to allow comparisons with the natural male speaker of the original GÖSA. TTSfemale was used because many international speech-recognition tests were recorded using female speakers (Kollmeier et al., 2015). Both TTSmale and TTSfemale were modified to match as far as possible the speech rate of the original material. The following listening tests were performed by normal-hearing participants. Speech-recognition scores and VRT values were compared for synthetic and natural speech. Overall, the research questions were: Is it possible to reduce the effort required for generating speech test material of everyday sentences by using a TTS system? This would be expected, since the production time of the synthetic speech material is shorter than for natural speech, because optimization steps such as level adjustments may no longer be required due to the more consistent properties of the synthetic speech. Are the VRT values for synthetic speech and natural speech comparable? Are the speech-recognition scores for synthetic speech comparable to those of natural speech? It was expected that the differences would be in the range of those for different natural speakers, i.e., up to 5 dB (Hochmuth et al., 2015).

Experiment 1: Comparison of TTS Systems

Methods

TTS Systems

To generate speech material using a synthetic voice, a suitable TTS system had to be chosen among the large number of TTS systems found using online research. The selection criteria were the availability of a male and a female speaker, and a subjective sound quality for the German language, as rated by the first author of this contribution. Additionally, the conditions of use were reviewed, especially in terms of their usability for a publicly available speech test and of a guarantee of the stability of the voices over a longer time period, for adding speech material in the future. None of the freely available TTS systems had sufficient sound quality for the German language without further training needs. One of the potential candidates was the commercial Virtual Speaker by the Acapela Group (Mons, Belgium) used in Nuesse et al. (2019). Since 2021, the Acapela Group (Solna, Sweden) offers a Cloud Service with modified speech synthesis, which was also preselected. Although Google Wavenet (Dublin, Ireland) does not guarantee the stability of the voices, it was preselected because of its application for synthesized digits in the digit-in-noise test (Polspoel et al., 2020). Table 1 gives an overview of the three preselected TTS systems, which differ in their signal-generation processes and costs.

Table 1.

Preselected TTS Systems.

TTS system	Signal generation	Voice (female)	Voice (male)	Costs	Abbreviation
Virtual Speaker (Acapela Group, Mons, Belgium)	Unit Selection	Claudia	Klaus	1.500 € for 5h	Acapela_US
Acapela Cloud Service (Acapela Group, Solna, Sweden)	Deep Neural Network	Claudia	Klaus	1.500 € for 75 min	Acapela_DNN
Google (Dublin, Ireland)	Wavenet	de-DE-Wavenet-F	de-DE-Wavenet-B	1 million characters free per month, otherwise 16 $ per 1 million characters	Google_wav

Preselected TTS Systems.

Stimuli

Based on Nuesse et al. (2019), 12 different sentences were chosen as stimuli. These were taken from the following speech-recognition tests: An excerpt from “North wind and sun” (Holube et al., 2010) spoken by a female speaker was also presented. All sentences were generated using the three different TTS systems for both speakers. The sentences generated were not optimized, but calibrated to the same digital root-mean-square (RMS) level. 6 everyday sentences from the GÖSA (Kollmeier & Wesselkamp, 1997) 3 sentences without semantic context from the OLSA (Wagener et al., 2014) 3 sentences with low semantic context from the Oldenburg Linguistically and Audiologically Controlled Sentences corpus (OLACS, Uslar et al., 2013)

Measurement Process

Systematic subjective ratings of the different systems and speakers were carried out using a MUSHRA test (MUlti Stimulus test with Hidden Reference and Anchor; ITU-R BS.1534-3, 2015). The quality dimensions evaluated were naturalness, prosody (stress and intonation), and speech flow in combination with subjective intelligibility (Hinterleitner et al., 2013). Due to the Covid-19 pandemic, measurements took place online using webMUSHRA (Schoeffler et al., 2018). Based on ITU-R BS.1534-3 (2015), the original material was presented as a reference and also named as such, so that the reference was known to the participants. In addition to the synthesized speech, a hidden reference and an anchor were also presented to the participants. According to ITU-R BS.1534-3 (2015) the anchor was derived from the reference by filtering using a low-pass filter with a cut-off frequency of 3.5 kHz. The order of the sentences, TTS systems, and speakers were randomized. The participants could switch between the hidden reference, the anchor, and the synthetic speech, and they could listen to them as often as necessary. The participants’ task was to compare these stimuli and rate them on a scale, that included values from 0 (very poor) to 100 (excellent). After completing the MUSHRA tests, participants were asked to sort the TTS systems for each speaker by preference (1 = favourite, 3 = last).

Participants

The quality of the speech was evaluated by 14 participants from 22 to 60 years of age (median age: 25.0 years, eleven females, three males). The participants were students and employees recruited via the mailing list of Jade University of Applied Sciences in Oldenburg. According to their own assessment, they had normal hearing. The experiment was approved by the ethics committee (Kommission für Forschungsfolgenabschätzung und Ethik) of Carl von Ossietzky University in Oldenburg, Germany (Drs. EK/2021/063).

Analysis and Statistics

The evaluation used Matlab 2020a (The MathWorks, Inc., Natick, Massachusetts) and SPSS 27 (IBM Corp., Armonk, New York). According to ITU-R BS.1534 3, two participants were excluded from the analysis. One participant rated more than 15% of the reference sentences worse than 90; the second scored the anchor greater than 90 in more than 15% of cases. Hence in total, data from 12 participants were used for the statistical analysis. Shapiro-Wilk tests revealed that the data of only one condition (naturalness for AcapelaUS with male speaker) deviated from a normal distribution (p = .016). Hence, for each quality dimension, repeated-measures analysis of variance (ANOVA) was performed with the within-subject factors TTS system and speaker (male/female). Post hoc tests were t-tests for paired samples with Bonferroni correction (α = .0167).

Results

Figure 1 shows the ratings for the various systems in the quality dimensions prosody, speech flow, and naturalness. In addition, the results for the original speech material are shown.

Figure 1.

Ratings using the MUSHA procedure of the quality dimensions prosody (top panel), speech flow (middle panel), and naturalness (bottom panel) for the different TTS systems and speakers (m = male, female = f), evaluated by 12 participants. The TTS systems are abbreviated as in Table 1. The original material was rated best in all dimensions. AcapelaUS was rated worst in all dimensions, with a trend towards the female speaker being rated better than the male speaker. AcapelaDNN and GoogleWav produced similar results. However, the participants rated the female Google-speaker worse than the male speaker. The repeated-measures ANOVA for the quality dimension prosody revealed that the TTS system had a significant effect on prosody ratings [F(2, 22) = 115.7, p < .001]. The speaker type (male or female) had no significant effect [F(1, 11) = 88.5, p = .529]. The interaction of the two factors was also significant [F(2, 22) = 17.1, p < .001]. The post-hoc tests performed with Bonferroni correction showed that AcapelaUS was significantly different from AcapelaDNN and GoogleWav (p < .001), whereas GoogleWav and AcapelaDNN were not significantly different in prosody (p = .242). Significant effects of TTS system [Greenhouse-Geisser ε = 0.661; F(1.321, 14.534) = 119.0, p < .001] and speaker [F(1, 11) = 10.2, p = .009] on speech-flow ratings were found. Their interaction also proved to be significant [F(2, 22) = 18.5, p < .001]. Post-hoc tests indicated that AcapelaUS was significantly different from the other systems (p < .001), whereas no significant difference was found between GoogleWav and AcapelaDNN (p > .05). For naturalness, the ANOVA showed a significant effect of the TTS system on the ratings [Greenhouse-Geisser ε = 0.656; F(1.312; 14.434) = 105.1, p < .001], but no effect of the speaker type [F(1, 11) = 2.97, p = .113]. In addition, there was a significant interaction [F(2, 22) = 31.8, p < .001]. The post-hoc tests performed with Bonferroni correction showed that the ratings of all TTS systems differed significantly from each other (p < .0167 in each case). The overall impression given by the participants supports the previously described results (see Figure 2). Among female speakers, ACAPELADNN was the first choice of all participants. For the male speaker, five participants preferred ACAPELADNN, seven GoogleWav. For most of the participants ACAPELAUS was the last choice.

Figure 2.

Overall impression of the three different TTS systems for the male and female speaker. The participants (N = 12) had to rank the systems in descending order according to their overall impression.

Discussion and Conclusion

Three different TTS systems using different synthesis methods and voices were evaluated. One of the TTS systems, AcapelaUS, outperformed two other systems in Nuesse et al. (2019). In contrast, in the current study, AcapelaUS was always rated as the worst. That the two other systems outperformed AcapelaUS is presumably related to the advancements in TTS systems and their functionalities over the past few years. This result is in line with Zen et al. (2013), who found that the DNN systems they tested produced better results than other systems. Nuesse et al. (2019) found a similar difference between the subjective ratings for AcapelaUS compared to the original material, as found in the current study for the synthetic speech material. The two TTS systems AcapelaDNN and GoogleWav were often rated as similar, but for naturalness and overall impression, AcapelaDNN outperformed GoogleWav. One dimension with similar ratings was fluency and intelligibility, based on Hinterleitner et al. (2013). It should be noted that one participant reported difficulties when scoring these two dimensions together. In general, the explanation of the procedures to the participants was limited to the written instructions in the online study. Further explanations, easily and often informally given in lab studies, were not possible. Nevertheless, no influence on the results was detected. The results are consistent in that the natural speech achieved a very high score in all dimensions. Although AcapalaDNN was rated as good, the ratings differ from those for natural speech. One possible explanation is that due to the measurement setup the participants knew the reference. Therefore, it was possible to recognize the reference within the presented stimuli and to evaluate it as the most natural. This aspect reveals a problem of using the MUSHRA test in this application. The MUSHRA test is usually used to evaluate intermediate quality differences of audio systems when processing the same source signal. In this study, however, although the same sentence was always used for each condition, these sentences were generated in different ways and differed in their speakers. Therefore, in this study the MUSHRA test only evaluates the different TTS systems against each other, and a comparison to the natural reference does not seem to be appropriate using this setup. However, future developments of TTS systems could possibly lead to natural speech being outperformed by synthetic speech in the assessed dimensions, especially if natural speech is recorded using an untrained speaker. It remains unclear whether the ratings were influenced by the selected stimuli. It is conceivable that acoustic differences are less obvious in sentences with a high context and that the TTS systems are therefore rated better than in sentences with lower context. Since both meaningful everyday sentences (GÖSA) and low context sentences like OLSA and OLACS were presented, the ratings could also be analyzed separately for each sentence group. However, the data is limited to 12 participants and only shows deviating trends in the comparisons of ratings for the two sentence groups synthesized with the different TTS systems. Hence, a detailed analysis of the hypothesis should be addressed in a future study. Overall, because AcapelaDNN yielded ratings (nearly) as good as the original material, it was concluded that AcapelaDNN is a good choice for synthesizing a new version of the GÖSA.

Experiment 2: Generating the Göttingen Sentences Using Synthetic Speech

Based on the previous results, the GÖSA sentences were generated using the TTS system AcapelaDNN. Both speech recognition and VRT were measured for natural and synthetic speech.

Procedure for Synthesis

The speech material of the GÖSA was resynthesized using the Acapela Cloud Service (Acapela Group, Solna, Sweden, https://www.acapela-cloud.com, accessed 8/27/2021). Each of the 200 sentences were generated with the German voices Klaus and Claudia. The sentences were entered into the online software and the sampling frequency set to 44.1 kHz.

Speech Characteristics

The speaking rate of the sentences was adapted to that of the original recordings. To allow direct comparisons, all audio files were cut directly before the beginning of each sentence and after the end of the sentence. The mean speech rate of the original recordings was 277 ± 38 syllables per minute. The speech rate of the synthesized speech differed from the original recordings, especially in that the sentences using TTSmale were much faster. Using the overlap-add procedure implemented in Praat (Boersma & Weenink, 2007), the speech rate of the synthetic speech was reduced. Figure 3(a) shows the speech rates of all sentences after adaptation. The adapted synthesized sentences are openly available at Zenodo (Ibelings et al., 2022). The fundamental frequency for TTSmale was somewhat lower than that of the original speech material (see Figure 3(b)). TTSfemale had the highest fundamental frequency. The long-term average speech spectra of all three variants are shown in Figure 3(c).

Figure 3.

Speech rate (a), fundamental frequency (b), and long-term average spectrum (c) of the original (natural male voice) and the synthetic male and female speakers (TTSmale, TTSfemale).

Masker

To optimize the spectral masking of stationary maskers, the masker should have the same spectral characteristics as the corresponding speech material (Festen & Plomp, 1990). The masker of the natural GÖSA was created from recordings of the same speaker using different speech material and different equipment, resulting in spectral deviations (Zinner et al., 2021). Therefore, to facilitate comparisons, each set of speech material was superimposed 30 times (Wagener et al., 2003) to generate a stationary masker. The power density spectra of the resulting maskers, called speech-adjusted noises (SAN, Zinner et al., 2021), differed from the speech materials by less than 0.1 dB in the frequency range from 100 Hz to 12 kHz. Both masker and the sentences were digitally calibrated to the same RMS value.

Measurement Procedure

Due to the Covid-19 pandemic, the measurements took place online via Gorilla Experiment Builder (www.gorilla.sc). Gorilla allows building experiments using the task builder or scripting. The participants started the experiment in their home environment with their own equipment by opening the Gorilla link. In the first step, the participants were informed about the study. Subsequently, exclusion criteria (age under 18 years, hearing impairment, mother tongue not German, no microphone available, GÖSA known) were clarified. If no exclusion criterion was met, the experiment continued with information, instructions, and the consent of the participants. To ensure that the participant's microphone was functional, a microphone test was conducted by asking the participants to allow access to their microphone and to record one test sentence. If the test recording was audible, a headphone test (Woods et al., 2017), which had been implemented in Gorilla by Milne et al. (2021), was performed. If these technical requirements were met, participants proceeded with the GÖSA. The measurement consisted of nine test lists, i.e., the three speakers (original, TTSmale, TTSfemale) were presented at three fixed SNRs (−4, −6, and −8 dB). The SNR values were chosen according to the psychometric function from Zinner et al. (2021) to meet recognition scores of approx. 20, 50, and 80%. The masker started 500 ms before the sentence and ended 20 ms after it. One of the ten lists of the GÖSA, each containing 20 sentences, was randomly selected, but ensured that no test list was measured twice per participant. The order of the sentences within each test list was randomized. After each sentence presentation, participants were asked to repeat the sentence orally. Guessing was allowed. The responses were recorded using Gorilla's audiorecord zone and saved as a so-called weba file. Participants were recruited via the bulletin board of Oldenburg University. The link to the study was opened 126 times, including possible second openings by the same person. Hence, individuals may have been counted multiple times. The experiment was started by 67 participants. Six were excluded, because German was not their mother tongue. Of the remaining 61 participants, 24 discontinued the experiment during the instruction and consent process and the check of technical requirements. Thirty-seven participants started the GÖSA, and 25 finished the whole experiment. The age of the 25 participants was between 19 and 40 years (average: 25.6 years, standard deviation: 5.3 years). Seven of the participants were male, 18 female, and all reported normal hearing. They also declared that they did not know the GÖSA and that their mother tongue was German. The experiment was approved by the ethics committee (Kommission für Forschungsfolgenabschätzung und Ethik) of the Carl von Ossietzky University in Oldenburg, Germany (Drs. EK/2021/063). When finishing the experiment, a voucher code of 10 € for a mail-order company was offered. Gorilla generates a separate table for each questionnaire, information section, and GÖSA test. The verbal responses of the participants were saved as weba files. For three participants, the weba files were incomplete or contained only noise. Therefore, these participants were excluded. For calculating the VRT, both the sentence offset time and the onset time of the participants’ responses are necessary. Figure 4 shows the chronological sequence. To determine the onset time of the response, a speech recognizer of the Fraunhofer Institute for Digital Media Technology IDMT (Branch for Hearing, Speech and Audio Technology, Oldenburg, Germany) was used. The inputs were the weba files, converted into raw PCM data with a sampling rate of 16 kHz, single channel, and 16-bit signed integer samples. The output files contained start- and end time of the participants’ responses per condition (combination of SNR and speaker).

Figure 4.

Schematic illustration of the chronological sequence and the verbal response time. The schematic is only to illustrate when each element began and ended.

Schematic illustration of the chronological sequence and the verbal response time. The schematic is only to illustrate when each element began and ended. Since it was not possible to measure the sentence offset time with Gorilla, the masker onset was saved instead. Using the knowledge of the sentence length and masker onset, the sentence offset and the VRT were calculated: According to the Shapiro-Wilk test, the calculated VRT values were not normally distributed. After transformation using the natural logarithm (Baayen & Milin, 2010; Pals et al., 2015), no significant deviation from a normal distribution was detected (p > .05). For measuring the speech-recognition score, all recorded responses were transcribed and word scoring per condition was used. The weighting factors applied in the natural GÖSA (Kollmeier & Wesselkamp, 1997) were used for each speaker. In one of nine conditions (TTSfemale at −4 dB SNR), the speech-recognition scores were not normally distributed (Shapiro-Wilk-Test, p = .003). Hence, parametric tests were applied. For both the logarithmized VRT values and for the speech-recognition scores, individual values deviating from the mean by more than three times the standard deviation were defined as outliers. Thus, one participant was considered as an outlier for speech-recognition scores under two conditions. Therefore, all statistical tests were performed using 21 participants. Subsequently, based on the speech-recognition scores at the three SNR values, psychometric functions of the form were fitted for each speaker using the Maximum Likelihood procedure (Brand & Kollmeier, 2002). L describes the SNR in dB. The slope at the SRT is denoted by s50 and is given in pp/dB. In the results, the VRT and the speech-recognition scores are presented as boxplots. The line in the middle of the box represents the median, the lower and upper bound of the box indicates the 25th and 75th percentile (box length is the interquartile range); whiskers are drawn from the lowest to the highest value within 1.5 times the interquartile range, + indicates values that are outside 1.5-times the interquartile range.

Results

Verbal Response Time

Figure 5(a) shows the VRT in ms and Figure 5(b) the logarithm of the VRT. As expected, poorer SNR led to higher VRTs, indicating an increase in listening effort. The lowest median is about 400 ms for TTSfemale at −4 dB SNR; the TTSfemale at −8 dB SNR led to the highest median (about 820 ms).

Figure 5.

(a) verbal response time (VRT, delay between end of sentence and verbal response of the participant) for natural (original) and both synthetic speakers (TTSmale and TTSfemale) at three SNRs (−4, −6, and −8 dB). (b) Log transformed VRTs to avoid deviations from a normal distribution, N = 21. An ANOVA for repeated measurements for the logarithmized VRT with the factors SNR and speaker (original, TTSmale and TTSfemale) confirmed a significant effect of the SNR [F(2, 38) = 29.2, p < .001], but no significant effect of the speaker [F(2, 38) = 0.363, p = .698]. There was no significant interaction between the factors [F(4, 76) = 2.84, p = .205]. For the female speaker, Bonferroni-adjusted post-hoc analysis (α = .167) revealed significantly higher VRT values for −8 dB SNR than for either −4 dB SNR or −6 dB SNR (p < .001 each). For the male speaker, significant differences between the VRT values for −8 dB SNR and −4 dB SNR were found (p = .001). There were no significant differences between the VRT values for the original speaker (p > .0167).

Speech Recognition

The poorer the SNR, the fewer words were correctly recognized, independently of the speaker (see Figure 6). Furthermore, the synthetic speakers generated higher recognition scores than the natural speaker.

Figure 6.

Speech-recognition scores for the natural (original) and synthetic speakers (TTSmale and TTSfemale) at three different SNRs (−4 dB, −6 dB and −8 dB), N = 21.

Speech-recognition scores for the natural (original) and synthetic speakers (TTSmale and TTSfemale) at three different SNRs (−4 dB, −6 dB and −8 dB), N = 21. The repeated-measures ANOVA confirmed a significant effect of the SNR on the speech-recognition scores [F(2, 40) = 519.7, p < .001]. Furthermore, the speakers showed a significant effect [F(2, 40) = 59.7, p < .001]. There was no significant interaction between SNR and speakers [F(4, 80) = 1.36, p = .254]. Post-hoc tests with Bonferroni correction (α = .0167) revealed significant differences for all SNR values (p < .001). The speech-recognition scores for the natural speaker differed significantly from TTSmale (p < .001) and from TTSfemale (p < .001). The scores for the synthetic speakers were not statistically different (p = .451).

Psychometric Functions

The psychometric functions for the three speakers based on all speech-recognition scores were fitted using equation (2) and are shown in Figure 7. The SRT for the natural speech is worse (−6.5 dB SNR) than for the synthetic speech (about −7.6 dB SNR). Natural speech resulted in the lowest slope at 14 pp/dB, the slopes for the synthetic speech being both slightly higher (16 pp/dB). Table 2 gives the measured values in comparison to published data.

Figure 7.

Table 2.

Comparison of the Measured SRT Values and Slopes for the Natural Speech Material and the Synthetic Material in the SAN Noise Compared to Published Data. In Each Case, the Values are Based on the Psychometric Functions Fitted to all Data Points.

	SRT in dB SNR	Slope in pp/dB
Original	−6.5	14.0
TTS_male	−7.6	16.0
TTS_female	−7.7	16.0
Zinner et al. (2021)	−6.2	18.1

Speech recognition scores for different SNR and speakers. The circles represent the individual speech-recognition scores of the 21 participants for the natural (Original) and both synthetic speakers (TTSmale, TTSfemale). Based on the scores, psychometric functions were fitted. The numbers in the figure show the SRT50 in dB SNR and the slope in pp/dB. Comparison of the Measured SRT Values and Slopes for the Natural Speech Material and the Synthetic Material in the SAN Noise Compared to Published Data. In Each Case, the Values are Based on the Psychometric Functions Fitted to all Data Points.

Discussion

One aim of this study was to compare the listening effort estimated using the VRT for synthetic and natural speech. For both, the results showed that lower SNRs led to higher VRTs of up to a median of about 800 ms. This agrees with the results of Meister et al. (2018), who also found an increase in VRT with lower SNR (worse speech-recognition scores). They measured the VRT at two different SNRs corresponding to speech-recognition scores of 50% and 80%, and in two different maskers (fluctuating and stationary) for different participant groups (young and old normal-hearing listeners, older hearing-aid users). For all groups, worse SNR led to higher VRTs. A similar study design to that of Meister et al. (2018) was used by Pals et al. (2015), who also found higher VRT values with lower SNR. Quantitatively comparing the results for stationary maskers and comparable speech-recognition scores from those studies to the results of this study reveals that the VRT values of the current study have the same median range, but show a larger spread. A possible explanation is that in the current study, fixed SNRs were used, whereas Meister et al. (2018) and Pals et al. (2015) used fixed SRTs. Furthermore, the age of some participants of the current study is higher than the age of their participants. A greater variance in age may result in a greater variance in VRT (Meister et al., 2018). In addition, the previous studies conducted the experiment in the same booth for all listeners. In contrast, the present study was conducted online at different locations. Therefore, it should be noted that parameters such as the performance of the computer, stability, and speed of the internet connection may have influenced the results (Anwyl-Irvine et al., 2021), hampering comparisons with other studies. According to Anwyl-Irvine et al. (2021), online tests with Gorilla result in a delay of about 80 ms (standard deviation: 8 ms) for different devices and browsers. Nevertheless, the absolute time delay is not important for the outcome of this study, because all participants performed under all conditions (SNR and speaker); only relative differences between the conditions were analyzed. It should be noted that there is no agreement on whether VRT directly measures listening effort or whether it is only related to listening effort. Visentin et al. (2021) measured speech recognition using a matrix test at different SNR values with normal-hearing participants. Subjective ratings of listening effort and VRT were measured; in addition, pupillometry was used. The VRT was found to be most sensitive to changes in SNR, which the authors equate with a change in listening effort. At the same time, however, no correlation with the results of pupillometry could be found, leading to the conclusion that different dimensions of listening effort were captured (Visentin et al., 2021). Pals et al. (2015) call the VRT measurement a “good candidate” for measuring listening effort. Meister et al. (2018) consider VRT a good way to obtain information beyond perceived listening effort. Summarizing these studies, the VRT measurement appears to be an indicator of listening effort, although it is still unclear whether it can be a direct measure of it. In the current study, no significant difference between the VRT values for the synthetic and the natural speakers was found. Related to the above-mentioned studies, this suggests a similar listening effort for both synthetic and natural speakers. By contrast, Govender and King (2018), who used pupillometry to measure listening effort, observed an increase in pupil size for synthetic speech, i.e., an increased listening effort for synthetic speech compared to natural speech. They observed no clear differences in listening effort between four different TTS systems; in some cases, however, there was a trend toward higher-quality rated systems resulting in lower listening effort. Simantiraki et al. (2018) also used pupil dilation as an indicator for listening effort. They noted that synthetic speech generated using an HMM based system led to larger pupil dilations than natural speech. None of the systems from these studies were based on neural networks. Hence, it can be assumed that AcapelaDNN sounds more natural than the systems used by Govender and King (2018) and Simantiraki et al. (2018). The SRT value for the natural speech measured online was −6.5 dB SNR. Nearly the same value (−6.1 dB SNR) was measured by Kollmeier and Wesselkamp (1997) using a different noise masker (original GÖSA noise). Zinner et al. (2021) found an SRT of −6.2 dB SNR, closely matching the current result of −6.5 dB SNR for the same SAN masker in the free field in a sound-proofed booth with normal-hearing participants. The good agreement despite a lack of control over the equipment used and possible unrecognized hearing losses in the current study indicates that online measurements are an appropriate tool to measure speech recognition using a speech test of everyday sentences. The speech-recognition scores for the synthetic speech were significantly different from those of natural speech. Synthetic speech led to better SRTs than natural speech by 1.2 dB. One possible reason is that the natural GÖSA appears a little less clearly articulated, and partly mumbled (Müller-Deile, 2009) compared to the synthetic speech. In contrast, Simantiraki et al. (2018) found worse SRTs for synthetic speech than for natural speech. In their study, four different speech types (e.g., synthetic and plain speech) were used. The authors defined plain speech as sentences spoken in a normal way using a male speaker and their synthetic speech was generated using an HMM-based TTS system. The normal-hearing participants scored 30% fewer correct responses with synthetic speech than with normal speech. As mentioned by Zen et al. (2013), HMM-based systems are rated less natural than DNN-based systems. It can therefore be concluded that the TTS systems have subsequently improved. The difference in SRT of 1.2 dB between natural and synthetic speech is in the same range as for different natural speakers. Differences in speech recognition between different natural speakers were already observed for the OLSA, which was recorded using a female and a male speaker. In contrast to the current study, where differences between the male and the female synthetic speaker were negligible, the SRT for the natural female speaker was about 2.5 dB better than for the natural male speaker (Wagener et al., 2014). For different German natural speakers, Hochmuth et al. (2015) found differences in SRT of up to 5 dB. The authors explained the differences as related to a different speech rate and a larger vowel space for the female voice. It should be noted that in contrast to the current study, the speech materials in those studies (Hochmuth et al., 2015; Wagener et al., 2014) were not adjusted to the same speech rate. Nevertheless, the speech rate might not have a significant effect on the SRT for the GÖSA. Winkler et al. (2021) showed that the SRT for the GÖSA for normal-hearing participants at a speech rate of 222 syllables per minute was not significantly different from the SRT at 279 syllables per minute. The fitted slope for the natural speaker was 14 pp/dB, and the slopes for the synthetic speakers were 16 pp/dB. In other studies, the slope was 17 pp/dB to 18 pp/dB (Kollmeier & Wesselkamp, 1997; Zinner et al., 2021). Since the slopes were shallower not only for the synthetic speakers but also for the natural speaker, the effect could possibly be related to the way the measurements were carried out. While measurements are normally conducted in a soundproof booth, in this study the measurements took place in the everyday environment of the participants. Thus, it could not be ensured that the participants were not distracted by other factors (e.g., by using the cell phone or doing other work on the computer screen), reducing attention. Additionally, participants were not tested for hearing impairments, and the age range was rather broad. Differences in participants’ performance lead to shallower discrimination functions that are fitted to pooled data (Wagener, 2004). Nevertheless, overall, the observed slopes for the synthetic speakers almost match the literature values for natural speakers, despite the absence of optimization steps typically applied to natural recordings. This indicates that optimization steps might not be necessary when producing speech tests with TTS systems. To facilitate comparisons, the measurements should be repeated or performed under controlled conditions in the laboratory and using more participants. It is unclear whether the similarity of natural and TTS materials with respect to the results of this study was influenced by the fact that everyday sentences were used. However, it can be assumed that for the purpose of speech audiometry, TTS systems can also be used for other materials. The reason for the assumption is that also for the German matrix test, which consists of sentences without semantic context as well as for the Freiburg monosyllabic test, it was shown that SRTs and slopes of the psychometric functions are similar for natural and synthetic speech (Nuesse et al., 2019; Schwarz et al., 2022). Nevertheless, it should be noted that the results of this study apply only to the GÖSA. This test consists of everyday sentences with three to seven words, and includes questions as well as declarative sentences and exclamations. The grammatical structure is mostly simple (subject-predicate-object) and there are no sub-clauses. Whether the results are also valid for other - or more complex - sentence structures can be examined in future studies.

Conclusion

Overall, it was shown that the use of a TTS system can simplify the generation of speech material for a speech-recognition test by reducing the time effort required for recording and subsequent optimization. Although the selected TTS system was not rated equally or better than the natural reference, this study confirmed that audiological measurements using synthetic speech are possible. The speech-recognition measurements resulted in about 1.2 dB lower SRTs for the synthetic speakers compared to the natural speaker recording of the original GÖSA. However, the slopes of the psychometric functions were slightly shallower than reported in other studies. Verbal response time, which can be interpreted as indicating listening effort, was comparable for synthetic and natural speech. It should be noted, that the results presented here only apply to everyday sentences. Other tests may lead to different results. However, for further measurements and for generating new speech-recognition tests, the use of a TTS system such as Acapela Cloud is a reasonable choice.

27 in total

1. Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests.

Authors: Thomas Brand; Birger Kollmeier
Journal: J Acoust Soc Am Date: 2002-06 Impact factor: 1.840

Review 2. The multilingual matrix test: Principles, applications, and comparison across languages: A review.

Authors: Birger Kollmeier; Anna Warzybok; Sabine Hochmuth; Melanie A Zokoll; Verena Uslar; Thomas Brand; Kirsten C Wagener
Journal: Int J Audiol Date: 2015-09-18 Impact factor: 2.117

3. Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing.

Authors: J M Festen; R Plomp
Journal: J Acoust Soc Am Date: 1990-10 Impact factor: 1.840

4. [Speech intelligibility tests in cochlear implant patients].

Authors: J Müller-Deile
Journal: HNO Date: 2009-06 Impact factor: 1.284

5. Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment.

Authors: B Kollmeier; M Wesselkamp
Journal: J Acoust Soc Am Date: 1997-10 Impact factor: 1.840

6. Verbal Response Times as a Potential Indicator of Cognitive Load During Conventional Speech Audiometry With Matrix Sentences.

Authors: Hartmut Meister; Sebastian Rählmann; Ulrike Lemke; Jana Besser
Journal: Trends Hear Date: 2018 Jan-Dec Impact factor: 3.293

7. Hearing Impairment and Cognitive Energy: The Framework for Understanding Effortful Listening (FUEL).

Authors: M Kathleen Pichora-Fuller; Sophia E Kramer; Mark A Eckert; Brent Edwards; Benjamin W Y Hornsby; Larry E Humes; Ulrike Lemke; Thomas Lunner; Mohan Matthen; Carol L Mackersie; Graham Naylor; Natalie A Phillips; Michael Richter; Mary Rudner; Mitchell S Sommers; Kelly L Tremblay; Arthur Wingfield
Journal: Ear Hear Date: 2016 Jul-Aug Impact factor: 3.570

Speech Recognition and Listening Effort of Meaningful Sentences Using Synthetic Speech.

Introduction

Experiment 1: Comparison of TTS Systems

Methods

TTS Systems

Stimuli

Measurement Process

Participants

Analysis and Statistics

Results

Discussion and Conclusion

Experiment 2: Generating the Göttingen Sentences Using Synthetic Speech

Procedure for Synthesis

Speech Characteristics

Masker

Measurement Procedure

Results

Verbal Response Time

Speech Recognition

Psychometric Functions

Discussion

Conclusion

1. Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests.

Review 2. The multilingual matrix test: Principles, applications, and comparison across languages: A review.

3. Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing.

4. [Speech intelligibility tests in cochlear implant patients].

5. Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment.

6. Verbal Response Times as a Potential Indicator of Cognitive Load During Conventional Speech Audiometry With Matrix Sentences.

7. Hearing Impairment and Cognitive Energy: The Framework for Understanding Effortful Listening (FUEL).

8. Intelligibility of time-compressed speech: the effect of uniform versus non-uniform time-compression algorithms.

9. Measuring Speech Recognition With a Matrix Test Using Synthetic Speech.

10. Subjective Listening Effort and Electrodermal Activity in Listening Situations with Reverberation and Noise.