Literature DB >> 32365075

Speaking to a common tune: Between-speaker convergence in voice fundamental frequency in a joint speech production task.

Abstract

Recent research on speech communication has revealed a tendency for speakers to imitate at least some of the characteristics of their interlocutor's speech sound shape. This phenomenon, referred to as phonetic convergence, entails a moment-to-moment adaptation of the speaker's speech targets to the perceived interlocutor's speech. It is thought to contribute to setting up a conversational common ground between speakers and to facilitate mutual understanding. However, it remains uncertain to what extent phonetic convergence occurs in voice fundamental frequency (F0), in spite of the major role played by pitch, F0's perceptual correlate, as a conveyor of both linguistic information and communicative cues associated with the speaker's social/individual identity and emotional state. In the present work, we investigated to what extent two speakers converge towards each other with respect to variations in F0 in a scripted dialogue. Pairs of speakers jointly performed a speech production task, in which they were asked to alternately read aloud a written story divided into a sequence of short reading turns. We devised an experimental set-up that allowed us to manipulate the speakers' F0 in real time across turns. We found that speakers tended to imitate each other's changes in F0 across turns that were both limited in amplitude and spread over large temporal intervals. This shows that, at the perceptual level, speakers monitor slow-varying movements in their partner's F0 with high accuracy and, at the production level, that speakers exert a very fine-tuned control on their laryngeal vibrator in order to imitate these F0 variations. Remarkably, F0 convergence across turns was found to occur in spite of the large melodic variations typically associated with reading turns. Our study sheds new light on speakers' perceptual tracking of F0 in speech processing, and the impact of this perceptual tracking on speech production.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 32365075 PMCID： PMC7197779 DOI： 10.1371/journal.pone.0232209

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

In spoken-language interactions, recent work has revealed that speakers tend to imitate their interlocutor’s own way of speaking (see [1] for a recent review). This phenomenon, referred to as phonetic convergence, entails a moment-to-moment adaptation of the speaker’s speech targets to the perceived interlocutor’s speech patterns. It is thought to contribute to setting up a conversational common ground between speakers, to facilitate mutual understanding, and to strengthen social relationships [2]. In addition, phonetic convergence has been found to persist after the interaction has ended [3], and this provides evidence for the emerging view that words’ spoken forms in the mental lexicon continuously evolve throughout the speaker’s lifespan under exposure to speech produced by other speakers. In the present work, we focused on voice fundamental frequency (F0), a central dimension of speech, as a conveyor of both linguistic information (through intonation patterns, as well as lexical tones in tone languages, in particular) and communicative cues associated with the speaker’s age, gender, and/or emotional state. Our main objective was to contribute to better characterizing the size of convergence effects in F0 in both the temporal and frequency domains. More specifically, we aimed to experimentally determine whether, and if so to what extent, convergence in F0 between human speakers extends across speakers’ turns. We also sought to establish how accurately speakers may imitate changes in their partner’s F0 that are both limited in magnitude and spread over large intervals. Previous studies have examined potential between-speaker convergence effects in F0 by means of direct, acoustic measures [4-15], indirect, perceptual evaluations performed by listeners [16], or both [1, 17–19]. The results, however, have shown important discrepancies both across and within studies, as to whether convergence occurs or not, and if so to what extent. Many of these studies have employed a repetition task, which entails participants repeating a series of isolated vowels [4, 9], nonwords [7], words [1, 6, 17, 19], or sentences [8, 20, 21] previously recorded by one or several model speaker(s) and played out to the participants. Using this approach, [4] and [9] have provided acoustic evidence for F0 convergence in the repetition of vowels, with a larger effect in [4] than [9]. [17] have reported a small but significant acoustic convergence effect in F0 in single-word shadowing, a finding consistent with Goldinger’s often cited, albeit unpublished early study referred to in [6]. In a VCV (/aba/) repetition task, however, [5] found that convergence in F0 occurred to a small degree when participants were presented with both the audio and video recordings of the model speaker, but not in the audio-only condition. [7] had German-speaking healthy participants repeat nonwords in different tasks that included delayed repetition and shadowing. Participants showed an F0 convergence effect towards the model speaker in the delayed repetition task but not in the shadowing task. Single-subject analyses revealed that in delayed repetition, the effect was significant for 3 participants only out of 10. The absence of F0 convergence in the shadowing task was attributed by the authors to an overall increase in F0 resulting from an enhanced speaking effort in the shadowing compared with the delayed repetition task. In a recent, single-word shadowing study [18], participants did not display a consistent trend toward acoustic convergence in F0. Likewise, Pardo and colleagues [1, 19] did not find acoustic evidence for convergence in F0 in their large-scale studies using single-word shadowing. It is difficult to pinpoint what may be the origin of the disparities in the occurrence and extent of convergence effects in F0 in the abovementioned studies, given the vast array of differences that these studies show at the methodological level. These differences include the number of model speakers (from one speaker, e.g. [7, 17], to 20 speakers in [19]), the number, phonological make-up and lexical status of the items used as stimuli, and the index employed to characterize convergence from the F0 measures, among other features. One may note, however, that three ([4, 7, 9]) of the studies in which convergence in F0 was observed appear to share one characteristic that we do not find in other studies. In [4], [7], and [9], the experimenters made F0 in the stimuli vary in a systematic way, either through resynthesis ([4, 7]) or by selection of a set of F0 values ([9]) in the material recorded by the model speaker(s). Systematic variations in F0 in the stimuli may have facilitated the emergence of convergence effects, compared with stimuli in which the range of F0 variations was not controlled. Convergence effects in repetition tasks have also been subject to perceptual evaluations, in conjunction with acoustic analyses [1, 17–19] or in an independent way [16]. When the participants’ task is to repeat (non-)words or shorter linguistic units, perceptual evaluations are most frequently carried out by means of an AXB classification test, in which listeners are asked to determine whether the participant’s shadowed version of a word (stimulus A) sounds more similar to the model speaker’s version (stimulus X) compared with the participant’s baseline version of that word (stimulus B, with stimuli A and B counterbalanced across trials). Because the listeners’ perceptual judgments are necessarily holistic, the potential influence of F0 in these judgments is difficult to disentangle from that of other acoustic parameters. In [16], however, F0 was artificially manipulated independently of other parameters, by being equated across the A, X and B stimuli in the AXB test. The results showed that shadowed words were more often correctly perceived as better imitations of the model speaker’s words for the original than for the equated-F0 stimuli, and were therefore indicative of F0 being a salient cue to imitation in single-word shadowing (see [16], footnote 4). Work has also been carried out on potential convergence effects in tasks that involve pairs of participants speaking in a turn-taking fashion, and performed by one speaker in conjunction with another human speaker or an artificial agent. This includes conversational interactions, but also interactive verbal games (e.g., [22]), or joint reading tasks as in the present piece of work, among other examples. Gregory, Webster and colleagues conducted a series of acoustic studies [12–15, 23, 24] on dyadic interviews and dyadic conversations, which were all carried out according to the same general design. Recordings for each participant were divided into a number of excerpts equally spaced over the duration of the interaction, and a long-term average spectrum (LTAS) was computed across the low-frequency range for each excerpt. At each temporal division, the LTAS for each speaker was then compared to that of her/his interlocutor and that of the other speakers. Gregory and colleagues recurrently showed that correlations were higher for actual pairs of speakers (that had actually interacted with each other) than for virtual pairs. These findings have been taken as providing strong support for convergence in F0 in conversational interactions. However, caution may be required in the interpretation of these results, due to the lack of information on different methodological and technical aspects that are central to accurately analyzing F0. In particular, both sampling frequency and duration of excerpts were unspecified, as was the participants’ gender in [12] (for the Arabic speakers) and [15]. In addition, whereas the LTAS was focused on a narrow low-frequency band (62-192 Hz) in [12], it was extended up to 500 Hz in [13-15], and this may have resulted in the LTAS incorporating spectral components above F0, such as F1 in non-low vowels in male speakers. It is also difficult to ascertain that LTAS correlations have not been affected by variations in the recording conditions across interviews (which would tend to mechanically make the correlations higher for actual compared with virtual pairs), in [13] for example. In a more recent work, [11] directly measured mean F0 values associated with their participants’ speaking turns in conversational exchanges with a virtual agent, and found a convergence effect in their participants towards the virtual agent’s pre-recorded voice. [11] also explored their participants’ potential tendency to converge towards the virtual agent in F0 changes across turns. The participants’ mean F0 appeared to vary across turns in a periodic fashion that mirrored the periodic pattern contained in the model speakers’ recorded voices, as implemented in the virtual agents. In the present work, we asked whether, and if so to what extent, two speakers converge towards each other with respect to variations in F0 in a scripted dialogue. Pairs of speakers jointly performed a speech production task, in which they were asked to alternately read aloud a written story divided into a sequence of short fragments, or reading turns. The task was conceived with a view to studying convergence in F0 in the framework of the novel experimental approach to sensori-motor integration and cognition known as joint action [25, 26]. We devised an experimental set-up that allowed us to manipulate the speakers’ F0 in real time and with high accuracy across turns, in a way which bears similarities with the paradigm employed by Natale in his early study [27] on convergence in vocal intensity in dyadic communication. Both our experimental design and convergence measures were specifically conceived with a view to disentangling genuine convergence effects in F0 changes across turns, from similarities between speakers in F0 patterns as a by-product of the fact that speakers employ shared sentence or discourse structures. Studies on modified auditory feedback where the formants [28] or F0 [29, 30] is altered in real time have found an automatic response in a direction opposite to that of the perturbation (although same-direction responses can also happen, see [31]). This behavior is accounted for by theoretical models which assume that an internal simulation of the speakers’ auditory target is compared with the actual auditory feedback and issue correcting commands to the motor system when a mismatch is detected [32, 33]. This framework could explain why speakers perform with outstanding accuracy in speech tasks such as shadowing [16] or synchronous speech [34]. We hypothesised that, because F0 is a core dimension of communicative behavior, speakers respond to F0 modifications in their interlocutor’s speech by tending to imitate these modifications.

Materials and methods

This study was approved by the Ethics Committee of Aix-Marseille University.

Participants

Sixty-two female native speakers of French, all undergraduate students at Aix-Marseille University and from 18 to 47 years old, took part in the experiment. We chose our sample size following Sato et al.’s [9] study, in which F0 convergence effects were observed with cohorts of 24 participants exposed to F0 values varying from 196 to 296 Hz (for female voices), that is, a variation of 714 cents. Given our planned variation of 400 cents, that is, about half of that in [9], we estimated that we would need at least twice as many participants to observe an F0 convergence effect, but set the number of participants to 62 to be on the safe side. One pair of participants had to be discarded from the analyses owing to faulty recordings of the audio signals, leaving 60 participants in the test sample. Post-hoc analyses confirmed that this sample size was adequate to observe the expected convergence effect with our planned design (see Results). Whether there are variations in the amount of between-speaker phonetic convergence as a function of speaker gender has been a matter of debate. In an often-cited work, [35] found that female shadowers converged towards the model speaker to a greater extent than male shadowers. However, more recent studies with a focus on gender-related differences in phonetic convergence (e.g., [1, 3, 17, 36–44]) have provided results that were inconsistent in that respect. In Pardo’s seminal study [44] on phonetic convergence in conversational interactions, for example, convergence was found to be greater for male than for female speakers. In another, large-scale study, Pardo et al. [44] found no difference in the amount of phonetic convergence depending on speaker gender, whether in their conversational interaction task or in their speech shadowing task. Because gender was not a focus of interest in our study, we chose to only have female participants, as in [5, 45–49], for technical reasons explained below. Recruitment and testing were made in accordance with the standard procedures of Aix-Marseille University at the time of recruitment. All participants provided written informed consent. They were recruited in pairs, with the requisite that a) they had no auditory, speech production or reading disorder known to them, and b) that pair members already knew each other and had an age difference of no more than ten years. Fulfillment of these criteria was established by means of a questionnaire filled in by the participants prior to the experiment. Both the familiarity and age difference criteria were expected to facilitate coordination between participants in the reading task. Duration of acquaintance ranged from a number of weeks (6 pairs) to several months (14 pairs) or years (9 pairs). Age difference was lower than 3 years for most (26) pairs and did not exceed 8 years.

Procedure

We asked pairs of participants to perform an alternate reading task. This entailed participants alternately reading aloud a written short story divided into a fixed number of reading turns (N = 74, see below). Each pair of participants was given a general introduction to the study by the experimenter in a control room, as well as written instructions. The two participants were then randomly assigned and dispatched to separate sound-isolated booths (A and B), where each of them was equipped with a C520 Sennheiser headset microphone and a pair of HD202 Sennheiser closed headphones. The participants’ positioning in different booths ensured that each participant’s voice was conveyed to the other participant through this electronic communication channel only, and that aerial sound transmission between participants was blocked. Further instructions by the experimenter were also transmitted through the communication equipment, from the control room. Participants first had to read silently the text they were to use in the alternate reading task, to familiarize themselves with it. They then did a practice session together, with a different, short text. Following this, they jointly performed two repetitions of the alternate reading task. The average duration of each repetition was 4 min and 33 s across the 32 pairs of participants. In all, the experiment lasted around 30 minutes. The experimental set-up is shown in Fig 1.

Fig 1

Experimental set-up.

Participants are seated in individual booths and communicate with each other through microphones and headphones, while an experimenter operates the F0 transformation software in the control room. Dashed lines indicate F0-transformed voice.

Experimental set-up.

Materials

The text used in the experiment was a simplified version of a technical notice for installing a wooden floor, chosen for its neutral style. It contained 804 words and was split into 74 turns, each from 6 to 13 words long, with turn boundaries placed within but not across sentences. This was done to avoid participants making long pauses between turns, and to favor prosodic continuity from one participant’s turn to the other participant’s one. The text was printed in two versions, one for participant A with odd-number turns in bold face and even-number turns in gray color, and the opposite pattern for speaker B (see S1 Text for the full text including turn segmentation).

Experimental design

Unbeknownst to both of the participants, and using an experimental device that was placed in the control room along the communication channel between them, we artificially shifted the participants’ F0 from one turn to the next, by a value determined at each turn according to the following sinusoidal function: where A is the amplitude of the transformation and was set to 200 cents (2 semitones), t is an index associated with the reading turns 1 to 74, ω is the angular frequency and was set to 2π/74 for τ to achieve one complete cycle over the sequence of reading turns, and ϕ is the phase angle, set to either 0 or π, as detailed below. The F0 transformation value τ was set before the beginning of each reading turn by the experimenter. The long period and limited amplitude of the transformation were both chosen so that the participants did not notice that their partner’s voice had been artificially manipulated. The maximal value of τ between two consecutive turns in a given speaker, was about 34 cents, i.e., 1/6 tone, and this made it unlikely for the other speaker to detect that change, all the more so since that speaker had to produce a turn herself in between. To assess the extent to which participants reproduced each other’s shifts in F0 across turns, we asked participants to perform the task twice. In one reading, the phase angle of the transformation function was 0 (hereafter, 0-phase condition). In the other reading, the phase angle was π (π-phase condition). The order of the 0-phase and π-phase readings was counterbalanced across pairs of participants. We then computed, for each participant and each turn, the difference δ in the median of the untransformed F0 values between the 0-phaseand π-phaseconditions, as follows: where and are the median of the untransformed F0 values for turn t in the 0-phaseand π-phaseconditions respectively. Because our goal here was to characterize F0 patterns in the speech waveform as produced by the participants, both median values related to the participants’ untransformed speech. If we assume that each participant tends to reproduce the shifts in F0 to which they are exposed in their partner’s speech, as heard through the voice-transformation system, δ should mirror the variations of τ in the π-phase condition as subtracted from τ in the 0-phase condition. That is, δ should display a sinusoidal shape with a period of 74 turns and a phase angle of 0. Note that δ is computed as a difference in F0 between two readings of the same text by the same two participants. As a result, δ is expected to mostly reflect the participants’ degree of convergence towards the F0 movements related to τ in their partner’s voice, and to be little sensitive to the prosodic variations associated with specific portions of the text, specific reading style of participants, or both. These variations should tend to be abstracted away in the calculation of the F0 difference between the two readings of the text by the same participant. The values of the transformation function in the two reading conditions is shown in Fig 2.

Fig 2

F0 transformation values for the two repetitions of the task.

Top: 0-phase condition. Bottom: π-phase condition.

F0 transformation values for the two repetitions of the task.

Top: 0-phase condition. Bottom: π-phase condition.

Voice fundamental frequency transformation

The real-time voice transformation system was implemented using the Max5 software (Cycling74). It consisted of a graphical interface that allowed us to interactively apply the F0 transformation to either of the two participants’ channels (see S1 Fig), and to more generally control the experimental procedure, including playing pre-recorded instructions. The voice transformation module used the phase vocoder supervp.trans∼ [50] provided within the real-time sound and music processing library IMTR-trans. Preliminary tests showed that the quality of the voice transformation was higher for female than for male speakers, owing to the fact that the higher mean F0 in female voices makes it possible to use shorter analysis/resynthesis time windows. This is the reason why we recruited female participants only.

Data analysis

For each participant, F0 values were extracted every 10 ms in both the untransformed and the transformed speech recordings, as produced by the participant and heard by the participant’s partner respectively. We employed a two-pass procedure (see [51]) using the Praat software [52] to minimize octave jumps and other detection errors: first, an automatic detection was performed using maximal F0 register limits (75 − 750 Hz). These limits were then manually adjusted on the basis of a visual inspection of the detection results for each participant and the new, speaker-specific, limit values were used for the second and final automatic detection pass. The temporal location of the boundaries between consecutive reading turns was established by a silence-detection semi-automatic procedure, followed by a visual check and adjustments when necessary using a signal editor. For each turn, we then took the median F0 value in the channel of the participant that had spoken during that turn, in both the untransformed and transformed speech recordings. To estimate to what extent the sinusoidal pattern introduced in τ (Eq 1) can be found in δ (Eq 2), we fitted a sinusoidal function to the δ data series and sought to estimate the target parameters A, ω and ϕ from Eq 1 using nonlinear least-squares regression (function nls() from the R package stats [53]).

Results

Pairs of participants accomplished the joint reading task in a smooth and fluent way, as indicated by the short lag (mean duration: 219 ms, SD: 253 ms) between each reading turn and the following one. As verified during a debriefing with the experimenter that followed the experiment, none of the participants noticed that the voice of their partner had been artificially modified. As the transformations made to each participant’s voice could be heard by the participant’s partner but not by the participant herself, none of the participants reported that their own voice had been artificially modified either. The accuracy of the voice transformation software was evaluated by calculating the difference between the measured transformed F0 values (F0 transf) and their expected values, estimated by the measured untransformed F0 values shifted by τ (F0 untransf + τ). The distribution of the difference was highly leptokurtic (kurtosis value of 414.0), with more than 92% of the measured points lying within ±20 cents of the expected values, indicating that the voice transformation system was highly accurate.

Between-speaker convergence in F0 shifts across turns

Between-participant imitation in turn-wise F0 transformation should cause δ (see Eq 2) to follow a sinusoidal pattern across turns, with the same period (74 turns) as that of the applied transformation τ, and a zero phase angle. A sinusoidal function was fitted to δ to determine the period, amplitude and phase which allowed that function to best account for the variations shown by δ across turns. We used nonlinear least-squares regression with initial conditions set to C = 0 cents, A = 40 cents, T = 74 turns, and ϕ = 0 to estimate the coefficients of the model. Coefficients were estimated to C = 7.06 cents, A = 24.11 cents, T = 75.88 turns and ϕ = −0.68, i.e., −8.11 turns (all p < 0.001). The resulting fit is shown in Fig 3. This indicates that, in both the 0- and π-phase conditions, participants converged towards each other by exhibiting F0 variations across turns that followed a single-cycle sinusoid, with a delay of 8.11 turns with respect to the transformation applied, that is, 4.06 turns heard by each participant, and an amplitude of 12.06 cents on average over the two repetitions of the task.

Fig 3

Between-speaker convergence in F0 shifts across turns.

δ measure: Difference between 0-phase and π-phase condition in median F0 for each participant and each turn. Sinusoidal fit is shown in blue, with 95% confidence interval in light blue.

Between-speaker convergence in F0 shifts across turns.

δ measure: Difference between 0-phase and π-phase condition in median F0 for each participant and each turn. Sinusoidal fit is shown in blue, with 95% confidence interval in light blue. A post-hoc evaluation of the replicability of the model coefficients’ significativity was conducted by generating new data using the estimated model coefficients and a random error term with mean and standard deviation equal to that of the estimated residual standard error. Out of 1000 simulated datasets, the C, A, ω and ϕ coefficients were significant 97.2, 100, 100 and 98.1% of the time respectively. We take this high degree of replicability as a confirmation of the adequacy of our participant sample size (see Methods).

Between-speaker convergence in mean F0

We examined to what extent participants converged towards each other in mean F0, by calculating the correlation in mean F0 across pairs of participants. Over the entire duration of the task, this correlation was found to be significantly positive (r = 0.45, p < 0.02, see Fig 4a). When computed for each successive pair of turns (associated with participant A and B respectively), the correlation reached its greatest positive value at the beginning of the task, decreased over the first reading (slope of linear regression = −0.003, p < 0.001), and remained stable for the second reading (p = 0.43). This was true regardless of whether participants started with the 0- or π-phase condition (see Fig 4b).

Fig 4

Between-speaker convergence in mean F0.

Between-speaker convergence in mean F0.

In each panel, the regression line is shown in blue. (a) Global A/B F0 correlation: mean F0 of participant A as a function of mean F0 of participant B over the total duration of the task. The dashed line represents a hypothetical correlation of 1. (b) A/B correlation in F0 (as in (a)) for successive pairs of turns. (c) Mean δ in turns 19 to 22 as a function of the overall F0 difference of the pair members. In addition, we asked to what extent the participants’ tendency to imitate each other’s perceived shifts in F0 across turns, as measured by δ, was related to how close participants were to each other in mean F0 value. To answer this question, we focused on turns 19 to 22, a selection determined as the longest turn sequence where δ was found to significantly differ from 0, as evaluated by uncorrected independent t-tests on each turn. Fig 4c shows the average δ in that interval as a function of the absolute difference in grand average F0 between participants. We found a significantly negative correlation between these two dimensions (r = −0.30, p < 0.02), showing that co-participants who were closer to each other in mean F0 tended to more closely imitate each other’s perceived shifts in F0 from one turn to the next.

Predictability of F0 across turns

To further characterize the turn-by-turn dynamics of convergence in F0, we evaluated to what extent the participants’ median F0 at each turn could be predicted from F0 values in preceding turns. We performed three linear mixed-effect analyses, each predicting the participants’ median untransformed F0 value at turn t. For model m1, the predictor was the median transformed F0 value at turn t − 1, i.e., the transformed F0 value of the participant’s partner as heard by the participant. For model m2, the predictor was the median untransformed F0 value at turn t − 2, i.e., the untransformed F0 of the participant’s own preceding turn as heard by the participant through auditory feedback. The predictors for m3 were a combination of the two predictors of m1 and m2. We included for all three models the same random effect structure, obtained by increasing the complexity of the structure until adding a term did not significantly increase the explained variance. The random effect structure consisted of an intercept and a random slope by turn for the first predictor, and an intercept and a random slope by participant for both predictors. Data were transformed to z-scores by participant prior to modeling to avoid numerical convergence issues due to difference in intra- vs. inter-pair variation. Table 1 summarizes the three models’ fits to the data, as well as the results of an ANOVA between the first two models and the third one in which they are both nested.

Table 1

Output of linear mixed-effect modeling of the turn median F0.

Model	Predictors	Random effects Standard Deviation						AIC	ANOVA with m3
Model	Predictors	(a)	(b)	(c)	(d)	(e)	(f)	AIC	χ²	p
m1	F˜0transf(t−1)	0.37	0.01	0.55	0.30	0.09	0.63	8891	87.52	< 2.2e-16
m2	F˜0untransf(t−2)	0.38	0.01	0.46	0.12	0.12	0.63	8827	23.71	1.122e-06
m3	F˜0transf(t−1)+F˜0untransf(t−2)	0.38	0.01	0.47	0.12	0.08	0.63	8805	–

Output of linear mixed-effect modeling of the turn median F0.

refers to the median of the F0 value in turn t. Columns show, from left to right: the model ID, the predictors, random effects standard deviations for terms: (a) and (b): intercept and resp. by turn, (c), (d), (e): intercept, and resp. by participant, (f): residuals, and Akaike’s Information Criterion. To the right of the vertical separator are the results of an ANOVA between the first two models and m3. We found that when tested separately, the factors associated with m1 and m2 were significant predictors of the median F0 at turn t, and that the combination of the two factors provided a significantly better fit than either of the factors taken separately. This suggests that in joint reading, the median F0 produced by participants during a turn depends on both their own median F0 in their preceding turn and their interlocutor’s average median F0 as just heard in the immediately preceding turn.

Discussion

This study first demonstrates that, in a joint reading task, the two speakers tend to imitate each other’s changes in F0 across turns that are both limited in amplitude and spread over large temporal intervals. In our experimental set-up, the shift we introduced in each speaker’s F0 between two of their consecutive reading turns was always smaller than one-sixth of a tone. The observed between-speaker convergence in F0 shifts across turns shows that, at the perceptual level, speakers monitor slow-varying movements in their partner’s F0 with high accuracy and, at the production level, that speakers exert a very fine-tuned control on their laryngeal vibrator in order to imitate these F0 variations. Remarkably, F0 convergence across turns was found to occur in spite of the large melodic variations typically associated with reading. Indeed, we found that the average F0 range of a turn, measured as the mean difference between the turn maximum and minimum values, was 10.94 semitones (SD = 3.00) across participants, close to one octave, and was therefore much larger than the one-sixth of a tone shift between reading turns in each speaker. Our results also indicate that speakers tended to converge towards each other in mean F0, a tendency that was found to establish itself from the beginning of the sequence of reading turns. It is important to note that convergence in mean F0, on the one hand, and convergence in F0 shifts across turns, on the other hand, constitute two different dimensions of variation in F0. The first dimension is concerned with how close speakers are to each other along the F0 scale. The second dimension is linked to how accurately each speaker reproduces the other speaker’s variations in F0 over the reading-turn sequence. These two dimensions are, in principle, mutually independent: for example, it could be conceived that speakers espouse each other’s changes in F0 from one turn to the next whilst remaining at the same distance from each other on the F0 scale. Our data, however, reveal that convergence between speakers occurred on both dimensions simultaneously, and that a greater amount of convergence in mean F0 was associated with a greater amount of convergence in F0 changes across turns. We also found that convergence between speakers fell into place from the beginning of the joint reading task. This applied to both convergence in mean F0 and convergence in F0 shifts across turns. In the latter case in fact, the δ measure appeared to deviate from zero to a greater extent over the first part relative to the second part of the reading-turn sequence (see Fig 3). These results are at variance with a conventional view of convergence as a phenomenon that gradually builds up over the course of a speech production task (see [10, 54] for schematized representations of this conventional view). In contrast to this view, our results indicate that convergence in both mean F0 and F0 shifts across turns can be performed very quickly and as soon as speakers start interacting with each other. A potential limitation of our work relates to the fact that our pairs of speakers already knew each other, since the question may be raised whether familiarity between speakers may have contributed to facilitating convergence in F0. However, in the only study known to us on the potential links between familiarity and convergence, Pardo and colleagues [55] found that perceived convergence between college roommates did not differ over the course of the academic year. Thus, the available experimental evidence does not point to an increase in phonetic convergence with increased familiarity. Another significant outcome of this work is that speakers converged to a greater extent towards each other (as measured by F0 shifts across turns) when they were already close to each other (as measured by overall proximity in mean F0). This is at odds with an approach to convergence according to which speakers move towards a target that is halfway between them along one or several phonetic dimensions, with the implication that the speakers deviate more from their respective initial positions when these positions are further apart (see [36, 56], among others). Our results are more consistent with a different view, in which speakers engaged in a verbal interaction tend to become more phonetically alike when they already sound more like each other at the outset. It may indeed be assumed that phonetic convergence towards the interlocutor will be facilitated when that interlocutor’s speech sounds are more within the range of the speaker’s own, long-established, articulatory maneuvers [4]. Our data can be accounted for by means of a new, dynamical model of F0 control based on three main assumptions. The first assumption is that speakers compute and store in memory a measure of mean F0 in their interlocutor’s speech over the interlocutor’s speaking turn. This entails speakers’ being able to abstract mean F0 from the potentially large up-and-down F0 excursions that the interlocutor may perform throughout the turn. The second assumption is that mean F0 associated with the speaker’s upcoming turn t + 1 is set as a function of both the speaker’s mean F0 in her/his last turn (t − 1), and the interlocutor’s perceived mean F0 in the ongoing turn t. The third assumption is that the interlocutor’s contribution to setting the speaker’s mean F0 has a weight that is larger when the speaker and interlocutor are closer to each other on the F0 dimension. Our proposed account differs in several important respects from current models of F0 convergence between speakers, such as the one exposed in [9]. First, these models appear to be agnostic as to the size of the time window over which the interlocutor’s F0 may be integrated, whereas we contend that this time window extends over one speaking turn. Second, we do not regard F0 convergence as stemming from a perceptuo-motor recalibration mechanism, by virtue of which changes in a speaker’s sensory targets occur as a result of that speaker being exposed to another speaker’s voice. Rather, in our account, the two speakers are speaking to a common tune, i.e. the target mean F0 for an upcoming turn is established by them in a joint manner. In other words, instead of conceiving F0 convergence as a shift in each speaker’s F0 under the influence of an external speech input, we suggest that it is the product of a two-speaker shared sensory-motor plan. Finally, our model sets limits to F0 convergence, which we expect to apply to a greater extent to speakers whose voices already resemble each other more. Joint reading aloud is but one instance of a wide repertoire of behaviors today referred to as joint action. Joint action has been defined as a social interaction whereby two or more individuals coordinate their actions in space and time to bring about a change in the environment [25]. In this domain, a central issue is to what extent joint action is the result of joint planning, and entails using shared task representations and sensory-motor goals [26]. To our knowledge, our results provide the first piece of experimental evidence for convergence in F0 as stemming from the use of shared representations and sensory-motor plans in a joint speech production task.

Interface for the voice transformation software.

Top-left panel: global parameters with, from top to bottom: toggle Audio, set recording index, allow cross-talk, set tranformation’s phase angle (0 or π), amplitude and period. Top-right panel: visual indicators monitored during the task. Audio signal and current transformation value for participants A and B are shown on the left and right respectively, with the current turn number in the center, and the recording indicator at bottom. Bottom panel: commands to control the task. Left: pushbuttons triggering audible instructions to participants for the 4 parts of the task (silent reading, practice text, first repetition of text, second repetition text. Right: pushbuttons to initialize the task, start and stop recording in green, red and gray respectively. (PDF) Click here for additional data file.

Text read by the participants.

For participant A (version shown here), odd-numbered turns are in boldface and are to be read aloud while even-numbered turns are in gray color and are to be listened to in the partner’s voice. For participant B (not shown), odd-numbered turns are in gray color, and even-numbered turns in boldface. Turn boundaries and turn numbers, added here for reference, were not shown in the version given to both participants. (PDF) Click here for additional data file. 27 Dec 2019 PONE-D-19-26013 Speaking to a common tune: between-speaker convergence in voice fundamental frequency in a joint speech production task PLOS ONE Dear authors, I have now received two clear and detailed assessments of the manuscript you submitted to PLOS One entitled "Speaking to a common tune: between-speaker convergence in voice fundamental frequency in a joint speech production task". While both reviewers consider this paper to represent an interesting contribution to the field, the first reviewer in particular raised a number of major issues regarding the description and analysis of the data. I agree with this reviewer' that the findings should be contextualized more precisely, and that the conclusions reached in some of the statistical analyses may not have been warranted. Taken together, addressing these and all other suggestions may require carrying out new analyses and reworking several sections of the text. However, I am convinced that addressing the reviewers' comments will improve the quality of your manuscript. I therefore hope you will consider resubmission. I look forward to receiving your revised manuscript. We would appreciate receiving your revised manuscript by January 24. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Francisco José Torreira, Ph.D. Academic Editor PLOS ONE Journal Requirements: 1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. Additional Editor Comments: l. 9: "speaker's mental lexicon": Perhaps you could use 'mental sound representations' or something similar, as it is not obvious whether/how phonetic convergence may affect the structure of the lexicon beyond phonetic and phonological processing. l. 14: "indexical information associated with the speaker’s age, gender, and/or emotional state": Paralinguistic uses of F0 have been claimed to play communicative roles (cf. Ohala and Gussenhoven's paralinguistic uses of their biological codes). l.57. Consider dropping "have" for consistency in the use of past tenses. l. 85: "to what extent" this projects some kind of comparison later. l.102: Consider dropping "the" in "the F0". l.116: "that participants read aloud alternatively": it'd be useful for the reader to have an approximate idea of how long the task was. l. 130: "achieve a turn". Not clear what is meant by "achieve" here. l. 184: "following [13] study: Please check grammar. l. 264: "7.5 turns": 75 turns l. 303: "in the immediately preceding turn.". Since m3 performed better than m1 and m2, shouldn't turn be "preceding turns"? l. 306: Consider dropping the word "even". l. 316: "a tendency that was found to establish from": Please check grammar. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In this study, the authors investigate pitch adaptation, the tendency for two speakers to converge towards similar pitch patterns over the course of a conversation (in this study during a scripted dialogue). They found that speakers imitated each other's changes in F0 across turns and that speakers were both able to perceive their partner’s changes in F0 and imitate these variations. The paper is clear and well written. The statistical analyses and experimental design are appropriate. The approach taken to study pitch convergence is novel and will be valuable to the expert reader. Below some suggestions for improvement of the manuscript. Abstract : 1/ « F0, in spite of the major role played by F0”: instead of F0 (parameter), maybe rather use “pitch” (perception) 2/ “we asked to what extent two speakers converge towards each other with respect to variations in F0 in a scripted dialogue.”: investigate instead of ask Title: 3/ “Speaking to a common tune: between-speaker convergence in voice fundamental frequency in a joint speech production task” the authors may just use the term pitch convergence to avoid the need of clarifying “voice” in “voice fundamental frequency” Introduction 4/ “It is thought to contribute to setting up a conversational common ground between speakers and to facilitate mutual understanding “: the authors may add here the other effect/role of convergence in strengthening relationship. This is of importance as the participants in this study knew each other prior to the experiment. It may be assumed that some form of phonetic convergence is already set in place in other contexts of social interaction they are used to engage in. 5/ “As is well known,” maybe remove not necessary 6/ “The results, however, have shown important discrepancies both across and within studies,” ok but there are some consistencies of global pitch convergence (f0 median/ range) which has been well described in the literature. The authors could here be more specific about what would have led to inconsistencies across studies; what exactly was measured of f0? what temporal span was used to assess f0 changes across studies? how much time was given to the participants to potentially measure pitch convergence? How familiar were the speakers before they engaged in the experimental task? 7/ “Our main objective was to contribute to better characterizing the size of convergence effects in F0 in both the temporal and frequency domains. More specifically, we aimed to experimentally determine whether and if so to what extent convergence in F0 between human speakers extends across turns. We also sought to establish how accurately speakers may imitate changes in their partner's F0 that are both limited in magnitude and spread over large intervals.”: The objectives could come earlier in the introduction to highlight the novelty of the work. Otherwise it may not be clear to the reader in what way this study advances current knowledge and measurement of pitch convergence. 8/ “This framework could explain why speakers perform with outstanding accuracy with little training at uncommon speech tasks such as shadowing [8] or synchronous speech [35].” “with little training” could be misleading. Yes these are uncommon speech tasks and speakers are not trained for these specific tasks however the fact that they are consistently engaged in social interaction in their day-to-day activities make them expose to others’ speaking styles and therefore they may engage in pitch convergence on a daily basis. Materials and methods 9/ The sections participants and procedure could be moved before the experimental design section to set the context for the experimental design. 10/ “communicated with each other using microphones and headphones.”: for the reader’s interest, the author could specify here the reason why the speakers were put in separate rooms (e.g. not to rely on non-verbal cues to communicate). 11/ “the participants did not notice that their partner's voice had been artificially manipulated.” How did the authors control for that? Post questionnaire? Also, what was the lag between speakers’ turns? 12/ For that reason, we recruited female participants: it is important to state this in the inclusion criteria and maybe add this as a limitation to the work as you may have measured here the effect of gender (male vs. female) or gender-mixed pair (female-female vs female-male) on adaptation capacity. 13/ “We chose our sample size following [13] study, where F0 convergence effects were observed with cohorts of 24 participants exposed to F0 values varying from 196 to 296 Hz (for female 185 voices), that is, a variation of 714 cents. Given our planned variation of 400 cents, that 186 is, about half of that in [13], we estimated that we would need at least twice as many 187 participants to observe an F0 convergence effect. .. we provisioned for an additional one third of participants, resulting in planned participants.” A power analysis maybe? 14/ “Participants came in pairs to the lab with the requisite that pair members knew each other well and had an age difference of no more than ten years. Both the familiarity and the age difference criteria were expected to facilitate the participants' joint accomplishment of the reading task.” A) This goes back to my previous comments that “little training” may not be appropriate in the context of this study, specifically when the authors ensured that the participants knew each other well to facilitate joint accomplishment; B) how did you control for similar familiarity across pairs? C) there may be a bit of circularity here. The results on the ability to both perceive slight changes in f0 and imitate F0 changes are to be taken in the context that the speakers were selected according to their familiarity. In this context, the results of them being able to quickly hear f0 changes and imitate are highly predictable and therefore it could be argued that the found convergence is just the result of pairing familiar individuals together. 15/ “median F0 value in the channel of the participant that had spoken during that turn” why not looking at the range too? Results 16/ “We examined to what extent participants converged towards each other in mean F0, by calculating the correlation in mean F0 across pairs of participants. Over the entire duration of the task, this correlation was found to be significantly positive (r = 0:45; p < 0:02, see Fig 3a)”. Providing that the authors measure f0median over the turn and that F0 median per turn tend to decline over the course of a paragraph, how can the authors ascertain that the speakers were adapting to each other and that the found correlation is not a by-product of global declination naturally used by both readers in the context of text reading (in this global context - larger temporal span)? 17/ “showing that co-participants who were closer to each other in mean F0 tended to more closely imitate each other's perceived shifts in F0 from one turn to the next” yes and this should be contextualised as per my previous comment of using same-gender pair and concluding on ability to adapt. 18/ “as soon as each speaker starts being exposed to the other speaker's voice.” This could be misleading provided that the speakers are used to each other’s speaking style and know each other well. 19/ The authors could emphasize the novelty of their approach of pitch convergence (experimental design and measure), and may make it a study goal in order to highlight the contribution of their work to the literature. Reviewer #2: My complete review is included as an attachment. Re: question 3 above, the authors state that data will be available upon acceptance, but do not give the specified information as requested as to why this is the case. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Review_PLOSOne_Dec2019.pdf Click here for additional data file. 4 Feb 2020 The responses to the reviewers' and editor's comments are in the document entitled . Submitted filename: Response to Reviewers.pdf Click here for additional data file. 10 Apr 2020 Speaking to a common tune: between-speaker convergence in voice fundamental frequency in a joint speech production task PONE-D-19-26013R1 Dear Dr. Nguyen, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Following the recommendation of Reviewer #2, I encourage you to have your manuscript proofread by a native speaker if possible in order to improve the readability of the text. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Francisco José Torreira, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: My concerns with the previous version have been substantially addressed. The manuscript should be proofread by a native speaker of English, since there are still a few usage errors to be corrected. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 16 Apr 2020 PONE-D-19-26013R1 Speaking to a common tune: between-speaker convergence in voice fundamental frequency in a joint speech production task Dear Dr. Nguyen: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Francisco José Torreira Academic Editor PLOS ONE

31 in total

4. A nonverbal signal in voices of interview partners effectively predicts communication accommodation and social status perceptions.

Authors: S W Gregory; S Webster
Journal: J Pers Soc Psychol Date: 1996-06

Speaking to a common tune: Between-speaker convergence in voice fundamental frequency in a joint speech production task.

Introduction

Materials and methods

Participants

Procedure

Experimental set-up.

Materials

Experimental design

F0 transformation values for the two repetitions of the task.

Voice fundamental frequency transformation

Data analysis

Results

Between-speaker convergence in F0 shifts across turns

Between-speaker convergence in F0 shifts across turns.

Between-speaker convergence in mean F0

Between-speaker convergence in mean F0.

Predictability of F0 across turns

Output of linear mixed-effect modeling of the turn median F0.

Discussion

Interface for the voice transformation software.

Text read by the participants.

Review 1. Computational neuroanatomy of speech production.

2. Voice F0 responses to manipulations in pitch feedback.

3. Sensorimotor adaptation in speech production.

4. A nonverbal signal in voices of interview partners effectively predicts communication accommodation and social status perceptions.

5. Phonetic convergence across multiple measures and model talkers.

6. Vocal imitation of song and speech.

7. Cognitive Load Reduces Perceived Linguistic Convergence Between Dyads.

8. Repeat what after whom? Exploring variable selectivity in a cross-dialectal shadowing task.

9. Neural correlates of phonetic convergence and speech imitation.

10. Can mergers-in-progress be unmerged in speech accommodation?

1. Effects of intention in the imitation of sung and spoken pitch.