| Literature DB >> 34326798 |
Peter A Krause1,2, Alan H Kawamoto2.
Abstract
In natural conversation, turns are handed off quickly, with the mean downtime commonly ranging from 7 to 423 ms. To achieve this, speakers plan their upcoming speech as their partner's turn unfolds, holding the audible utterance in abeyance until socially appropriate. The role played by prediction is debated, with some researchers claiming that speakers predict upcoming speech opportunities, and others claiming that speakers wait for detection of turn-final cues. The dynamics of articulatory triggering may speak to this debate. It is often assumed that the prepared utterance is held in a response buffer and then initiated all at once. This assumption is consistent with standard phonetic models in which articulatory actions must follow tightly prescribed patterns of coordination. This assumption has recently been challenged by single-word production experiments in which participants partly positioned their articulators to anticipate upcoming utterances, long before starting the acoustic response. The present study considered whether similar anticipatory postures arise when speakers in conversation await their next opportunity to speak. We analyzed a pre-existing audiovisual database of dyads engaging in unstructured conversation. Video motion tracking was used to determine speakers' lip areas over time. When utterance-initial syllables began with labial consonants or included rounded vowels, speakers produced distinctly smaller lip areas (compared to other utterances), prior to audible speech. This effect was moderated by the number of words in the upcoming utterance; postures arose up to 3,000 ms before acoustic onset for short utterances of 1-3 words. We discuss the implications for models of conversation and phonetic control.Entities:
Keywords: articulation; motor control; speech planning; timing prediction; turn-taking
Year: 2021 PMID: 34326798 PMCID: PMC8315268 DOI: 10.3389/fpsyg.2021.684248
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Speaker-specific information.
| Speaker | Number of dyads | Mins. of recordings | Constrained utterances | Unconstrained utterances |
| P1 | 2 | 10 | 12 | 29 |
| P2 | 3 | 15 | 12 | 50 |
| P3 | 3 | 15 | 50 | 59 |
| P4 | 2 | 10 | 19 | 23 |
| P5 | 2 | 10 | 14 | 36 |
| P6 | 3 | 15 | 19 | 29 |
FIGURE 1Plots of OpenFace parameters 48–67, which track the outer and inner lips, as arrayed on two frames of real facial video, each capturing the moment of acoustic onset for a different word. (Left) The first author beginning the word “oodles,” a constrained word. (Right) The first author beginning the word “apple,” an unconstrained word.
FIGURE 2A histogram depicting the floor transfer offsets (i.e., the inter-utterance gaps for gap utterances and between-overlaps) in the final dataset.
Reports of linear mixed models fit to lip area.
| Time point (ms) | Model details | |
| −3,000 | Random effects | (1 | speaker) + (1 | first word) |
| Constraint × log10(word count) | β = −64.48, SE = 24.99, | |
| Main effect of constraint | β = 68.73, SE = 25.96, | |
| Johnson-Neyman interval for probable speech postures | [0, 3.09] | |
| −2,500 | Random effects | (log10(word count) | speaker) |
| Constraint × log10(word count) | β = −61.86, SE = 24.14, | |
| Main effect of constraint | β = 66.33, SE = 23.62, | |
| Johnson-Neyman interval for probable speech postures | [0, 3.72] | |
| −2,000 | Random effects | (log10(word count) | speaker) + (1 | first word) |
| Constraint × log10(word count) | β = −58.07, | |
| Main effect of constraint | β = 71.78, SE = 25.83, | |
| Johnson-Neyman interval for probable speech postures | [0, 4.47] | |
| −1,500 | Random effects | (log10(word count) | speaker) |
| Constraint × log10(word count) | β = −90.33, | |
| Main effect of constraint | β = 105.23, SE = 24.96, | |
| Johnson-Neyman interval for probable speech postures | [0, 6.45] | |
| −1,000 | Random effects | (log10(word count) | speaker) + (1 | first word) |
| Constraint × log10(word count) | β = −70.37, SE = 27.22, | |
| Main effect of constraint | β = 82.24, SE = 29.40, | |
| Johnson-Neyman interval for probable speech postures | [0, 3.89] | |
| −500 | Random effects | (log10(word count) | speaker) + (1 | first word) |
| Constraint × log10(word count) | β = −42.84, | |
| Main effect of constraint | β = 62.23, | |
| Johnson-Neyman interval for probable speech postures | N/A | |
| 0 | Random effects | (log10(word count) | speaker) + (1 | first word) |
| Constraint × log10(word count) | β = −35.16, SE = 24.35, | |
| Main effect of constraint | β = 97.83, SE = 27.63, | |
| Johnson-Neyman interval for probable speech postures | N/A | |
FIGURE 3Predicted lip area values produced by the linear mixed models, when setting the word count predictor to 2 words and 8 words. Models were fit to junctures at 15-frame (500-ms) intervals, starting at 90 frames (3,000 ms) preceding acoustic onset. Error bars: Bootstrapped 95% CI.
FIGURE 4The regression line (with 95% confidence band) for the change of maximum lip movement speed with log10-transformed word count, as predicted by the mixed-effects model.
Content and frequency of 1-word utterances.
| Word | Count | Labial constraint |
| Yeah | 37 | Unconstrained |
| Alright | 13 | Constrained |
| Right | 9 | Constrained |
| No | 8 | Constrained |
| Yep | 5 | Unconstrained |
| Yes | 4 | Unconstrained |
| Really | 3 | Constrained |
| So | 2 | Constrained |
| Excellent | 1 | Unconstrained |
| Mmhmm | 1 | Constrained |
| Next | 1 | Unconstrained |
| Nice | 1 | Unconstrained |
| Oh | 1 | Constrained |
| Ok | 1 | Constrained |
| Thanks | 1 | Unconstrained |
| that’s | 1 | Unconstrained |
| Very | 1 | Constrained |
| Well | 1 | Constrained |
| What | 1 | Constrained |
| Which | 1 | Constrained |