| Literature DB >> 34405288 |
Judith M Varkevisser1, Ralph Simon2,3,4, Ezequiel Mendoza5, Martin How6, Idse van Hijlkema2, Rozanda Jin2, Qiaoyi Liang7, Constance Scharff5, Wouter H Halfwerk3, Katharina Riebel2.
Abstract
Bird song and human speech are learned early in life and for both cases engagement with live social tutors generally leads to better learning outcomes than passive audio-only exposure. Real-world tutor-tutee relations are normally not uni- but multimodal and observations suggest that visual cues related to sound production might enhance vocal learning. We tested this hypothesis by pairing appropriate, colour-realistic, high frame-rate videos of a singing adult male zebra finch tutor with song playbacks and presenting these stimuli to juvenile zebra finches (Taeniopygia guttata). Juveniles exposed to song playbacks combined with video presentation of a singing bird approached the stimulus more often and spent more time close to it than juveniles exposed to audio playback only or audio playback combined with pixelated and time-reversed videos. However, higher engagement with the realistic audio-visual stimuli was not predictive of better song learning. Thus, although multimodality increased stimulus engagement and biologically relevant video content was more salient than colour and movement equivalent videos, the higher engagement with the realistic audio-visual stimuli did not lead to enhanced vocal learning. Whether the lack of three-dimensionality of a video tutor and/or the lack of meaningful social interaction make them less suitable for facilitating song learning than audio-visual exposure to a live tutor remains to be tested.Entities:
Keywords: Bird song; Multimodal communication; Video tutors; Vocal development
Mesh:
Year: 2021 PMID: 34405288 PMCID: PMC8940817 DOI: 10.1007/s10071-021-01547-8
Source DB: PubMed Journal: Anim Cogn ISSN: 1435-9448 Impact factor: 2.899
Fig. 1Overview of the different tutoring treatments in this study. The audio–video treatment consisted of a synchronous sound and video exposure (120 fps video, sound and beak movements aligned, for an example see Online Resource 1); the audio-pixel treatment consisted of the same song and the same video, but the video was pixelated and played back in reversed order (for an example see Online Resource 2) and in the audio treatment only the audio channel of the song was played back
Description of the different tutoring schedules used in this study
| Schedule | # daily tutoring sessions | Daily tutoring times | # songs/session | # songs/day | Inter-song interval | |
|---|---|---|---|---|---|---|
| 1 | 3 | 8:15 12:15, 16:15 | 10 | 30 | Fixed 1 min | 3 |
| 2 | 4 | 8:15, 10:15 12:15, 16:15 | 12 | 48 | Variable range 2–6 sc | 9 |
| 3a | 8 | 8:15, 8:45, 9:15, 10:15 12:15, 13:30, 14:45, 16:15 | 24 | 192 | Variable range 2–6 s | 4b |
aWith this schedule, no birds were tutored in the Audio condition
bAll tutor groups had a different tutor song, but these four groups received the songs of 4 of the tutors used in schedule 2
cThe playback program used random inter-song intervals in the given range
Fig. 8Absolute radiance of the ASUS gaming monitors used for the stimuli presentation
Fig. 9Example frames from a video stimulus. a Original frame before colour adjustment. b Colour adjusted frame which was used for stimulus presentation. Note that the colours were adjusted for presentation on a particular screen (VG248QE, ASUS, Taipei, Taiwan) and that colours might deviate if shown on a different screen or in a printed version
Fig. 10Power spectra of one motif of one of the tutors in the original recording (a) and re-recorded after playback in the experimental set-up (b). Spectrogram of the same original recording (c) and the re-recorded playback in the experimental set-up (d)
Fig. 11Image of the random pixels used for the displacement filter to generate the pixelated videos
Fig. 3Schematic top view of the experimental set-up. In the set-up for the audio group, there was no screen next to the cage. For the behaviour observations, we divided the cage into three areas, with 1 being the perch area nearest to the screen (8 cm of perch), 2 being an intermediate area (60 cm of perch) and 3 the perch area furthest from the screen (104 cm of perch). The dotted rectangle indicates the location of the loudspeaker (hanging 50 cm above the cage). F = food, W = water. Food and water bottles were placed on the floor of the cage
Overview of song analysis parameters used in this study and the sample that was used to calculate them
| Parameter | Definition | Sample per bird used to calculate the parameter |
|---|---|---|
| Typical motif | Most frequently produced motif | 20 random songs |
| Full motif | Motif with highest # different syllables in bird’s repertoire | 20 random songs |
| Total number of syllables | # syllables in a tutee’s typical motif | Typical motif |
| Number of unique syllables | # unique syllables in a tutee’s full motif | Full motif |
| Linearity | 20 random songs | |
| Consistency | 20 random songs | |
| Human observer similarity score model–tutee | Full motif | |
| Human observer similarity score tutee–model | Full motif | |
| SAP similarity score tutor–tutee | SAP similarity scores comparing tutors’ to tutees’ motifs | 10 random motifs |
| SAP similarity score tutee–tutor | SAP similarity scores comparing tutees’ to tutors’ motifs | 10 random motifs |
| Luscinia similarity score | 1 − Luscinia distance score for comparison of tutor and tutee motifs | 10 random motifs |
| SAP stereotypy score | SAP similarity scores for the comparison between tutee motifs | 10 random motifs |
| Luscinia stereotypy score | 1 − Luscinia distance scores for the comparison between tutee motifs | 10 random motifs |
Fig. 2Spectrograms of the full motif of the tutor, the unfamiliar full motif of another adult male and three tutees from one tutor group. Letters above tutor and unfamiliar song spectrograms indicate how syllables were labelled with letters for further analyses. Human observer similarity between tutor/unfamiliar song and tutees was scored on a scale from 0 to 3. Syllables marked with the same colour and with the same label above them had a total similarity score of 4 or higher when the similarity scores of all three observers for this comparison were summed up
Fig. 5The average number of direct screen approaches during the stimulus presentations (values are the average per tutee for the three scored presentations per recording day (every fifth day of the tutoring period three (out of four) tutoring sessions were recorded and scored)). *Indicates p < 0.05, GLMM see Table 4
Details of best model (LMM) for the proportion of time spent in different areas of the cage, corrected for the perch length in that area
| Response variablea | Model term | Level | Estimate | SE | |
|---|---|---|---|---|---|
| Prop. of time spent | Intercept | 0.69 | 0.03 | 20.49 | |
| corrected for perch | Treatment | ||||
| length | 0.32 | 0.05 | 6.72 | ||
| 0.14 | 0.05 | 2.95 | |||
| Location | |||||
| − 0.07 | 0.05 | − 1.56 | |||
| − 0.17 | 0.05 | − 3.61 | |||
| Location × treatment | |||||
| − 0.51 | 0.07 | − 7.61 | |||
| − 0.50 | 0.07 | − 7.48 | |||
| − 0.23 | 0.07 | − 3.34 | |||
| − 0.21 | 0.07 | − 3.16 |
aLMM with random factor ‘Tutor group’. For post-hoc comparisons see Appendix, Table 11
Fig. 4Proportion of time spent in the different cage areas, corrected for the total perch length in that area. Box plots indicate the median (mid-line), interquartile range (box), and 1.5 times the interquartile range (whiskers). Data points beyond this range are plotted as individual points. Different letters above boxes indicate a significant difference of p < 0.05 according to post hoc tests (see Appendix, Table 11), LMM see Table 3
Results for post-hoc comparison of proportion of time that tutees spent in different areas of the cage (corrected for the perch length in that area)
| Contrast | Estimate | SE |
|
|
|---|---|---|---|---|
| Area 1 audio vs Area 1 audio–video | − 0.32 | 0.05 | − 6.72 |
|
| Area 1 audio vs Area 1 audio-pixel | − 0.14 | 0.05 | − 2.95 | 0.09 |
| Area 1 audio vs Area 2 audio | 0.07 | 0.05 | 1.56 | 0.82 |
| Area 1 audio vs Area 2 audio–video | 0.27 | 0.05 | 5.60 |
|
| Area 1 audio vs Area 2 audio-pixel | 0.16 | 0.05 | 3.33 |
|
| Area 1 audio vs Area 3 audio | 0.17 | 0.05 | 3.61 |
|
| Area 1 audio vs Area 3 audio–video | 0.36 | 0.05 | 7.47 |
|
| Area 1 audio vs Area 3 audio-pixel | 0.24 | 0.05 | 5.13 |
|
| Area 1 audio–video vs Area 1 audio-pixel | 0.18 | 0.05 | 3.77 |
|
| Area 1 audio–video vs Area 2 audio | 0.40 | 0.05 | 8.28 |
|
| Area 1 audio–video vs Area 2 audio–video | 0.59 | 0.05 | 12.32 |
|
| Area 1 audio–video vs Area 2 audio-pixel | 0.48 | 0.05 | 10.05 |
|
| Area 1 audio–video vs Area 3 audio | 0.49 | 0.05 | 10.33 |
|
| Area 1 audio–video vs Area 3 audio–video | 0.68 | 0.05 | 14.19 |
|
| Area 1 audio–video vs Area 3 audio-pixel | 0.57 | 0.05 | 11.85 |
|
| Area 1 audio-pixel vs Area 2 audio | 0.22 | 0.05 | 4.51 |
|
| Area 1 audio-pixel vs Area 2 audio–video | 0.41 | 0.05 | 8.55 |
|
| Area 1 audio-pixel vs Area 2 audio-pixel | 0.30 | 0.05 | 6.28 |
|
| Area 1 audio-pixel vs Area 3 audio | 0.31 | 0.05 | 6.56 |
|
| Area 1 audio-pixel vs Area 3 audio–video | 0.50 | 0.05 | 10.42 |
|
| Area 1 audio-pixel vs Area 3 audio-pixel | 0.39 | 0.05 | 8.08 |
|
| Area 2 audio vs Area 2 audio–video | 0.19 | 0.05 | 4.04 |
|
| Area 2 audio vs Area 2 audio-pixel | 0.08 | 0.05 | 1.77 | 0.70 |
| Area 2 audio vs Area 3 audio | 0.10 | 0.05 | 2.05 | 0.51 |
| Area 2 audio vs Area 3 audio–video | 0.28 | 0.05 | 5.90 |
|
| Area 2 audio vs Area 3 audio-pixel | 0.17 | 0.05 | 3.57 |
|
| Area 2 audio–video vs Area 2 audio-pixel | − 0.11 | 0.05 | − 2.27 | 0.37 |
| Area 2 audio–video vs Area 3 audio | − 0.09 | 0.05 | − 1.98 | 0.56 |
| Area 2 audio–video vs Area 3 audio–video | 0.09 | 0.05 | 1.87 | 0.64 |
| Area 2 audio–video vs Area 3 audio-pixel | − 0.02 | 0.05 | − 0.46 | 0.99 |
| Area 2 audio-pixel vs Area 3 audio | 0.01 | 0.05 | 0.28 | 1.00 |
| Area 2 audio-pixel vs Area 3 audio–video | 0.20 | 0.05 | 4.14 |
|
| Area 2 audio-pixel vs Area 3 audio-pixel | 0.09 | 0.05 | 1.80 | 0.68 |
| Area 3 audio vs area 3 audio–video | 0.18 | 0.05 | 3.85 |
|
| Area 3 audio vs area 3 audio-pixel | 0.07 | 0.05 | 1.52 | 0.84 |
| Area 3 audio–video vs Area 3 audio-pixel | − 0.11 | 0.05 | − 2.33 | 0.33 |
Significant p-values are indicated in bold
Details of best model (GLMM) for the amount of screen approaches
| Response variablea | Model term | Level | Estimate | SE | ||
|---|---|---|---|---|---|---|
| Number of screen | Intercept | − 4.15 | 0.70 | − 5.89 | ||
| approaches | Treatment | |||||
| 3.46 | 0.66 | 5.21 | ||||
| 2.45 | 0.70 | 3.51 |
Significant p-values are given in bold
aNegative binomial GLMM with random factor ‘Tutor group’. Significant post-hoc comparisons: audio vs. audio–video: estimate: − 3.46, SE: 0.66, z: − 5.21, p < 0.01, audio vs. audio-pixel: estimate: − 2.45, SE: 0.70, z: − 3.51, p < 0.01, audio–video vs. audio-pixel: estimate: 1.02, SE: 0.38, z: 2.67, p < 0.05
Mean values of song structure and performance parameters and details on ANOVA for comparison between null model and model including ‘treatment’ as a fixed effect
| Tutor (not in models) | Audio–video | Audio-pixel | Audio | ANOVA null model and model with ‘treatment’ | |||
|---|---|---|---|---|---|---|---|
| Mean ± SD | Mean ± SD | Mean ± SD | Mean ± SD | ||||
| Total nr syllables | 6.33 ± 1.44 | 5.08 ± 1.38 | 6.46 ± 1.76 | 5.25 ± 2.34 | 42 | 2.56 | 0.28 |
| Nr unique syllables | 5.25 ± 1.60 | 4.60 ± 1.30 | 4.93 ± 1.44 | 4.42 ± 0.51 | 42 | 0.40 | 0.82 |
| Linearity | 0.46 ± 0.12 | 0.41 ± 0.11 | 0.40 ± 0.10 | 0.44 ± 0.09 | 42 | 0.85 | 0.66 |
| Consistency | 0.94 ± 0.04 | 0.89 ± 0.08 | 0.90 ± 0.07 | 0.92 ± 0.08 | 42 | 0.77 | 0.68 |
In the models, only the data from the tutees from the different tutoring treatments was compared (the tutor data was not included in the models)
Details of best models for the song structure and performance parameters
| Response variable | Model term | Level | Estimate | SE | |
|---|---|---|---|---|---|
| (A) Total number of syllablesa | Intercept | 1.72 | 0.07 | 25.16 | |
| (B) Number of unique syllablesa | Intercept | 1.54 | 0.07 | 21.57 | |
| (C) Linearityb | Intercept | 0.51 | 0.04 | 14.49 | |
| Schedule | |||||
| − 0.12 | 0.04 | − 3.04 | |||
| − 0.12 | 0.05 | − 2.47 | |||
| (D) Consistencyb | Intercept | 0.90 | 0.02 | 49.34 |
aGLMM with a Poisson distribution and random factor ‘Tutor group’
bLMM with random factor ‘Tutor group’
Pearson correlation coefficients for the human observer similarity scores (square-root transformed to meet assumptions of normality), the median SAP similarity scores and the median Luscinia similarity scores for the tutor to tutee comparison
| Comparison | |||
|---|---|---|---|
| Human observer sim. score—SAP sim. score | 42 | 0.04 | 0.98 |
| Human observer sim. score—Luscinia sim. score | 42 | ||
| SAP sim. score—Luscinia sim. score | 42 | 0.14 | 0.44 |
Significant values are given in bold
Details of best models for the arcsine square-root transformed human observer similarity scores for the comparison of the model songs to the tutee songs (A) and the tutee songs to the model songs (B)
| Human observer similarity scores | |||||
|---|---|---|---|---|---|
| Response variable | Model term | Level | Estimate | SE | |
| (A) Model–tuteea | Intercept | 0.52 | 0.02 | 21.63 | |
| Model | |||||
| − 0.08 | 0.03 | − 2.34 | |||
| (B) Tutee–modela | Intercept | 0.57 | 0.02 | 23.41 | |
| Model | |||||
| − 0.08 | 0.03 | − 2.18 | |||
aLMM with random factors ‘Tutor group’ and ‘Bird ID’
Details of models with ‘Treatment’ as fixed factor for the arcsine square root transformed human observer similarity scores (A) and the best models for the arcsine square root transformed SAP (B) and Luscinia (C) similarity scores
| Response variable | Model term | Level | Tutor–tutee | Tutee–tutor | ||||
|---|---|---|---|---|---|---|---|---|
| Estim | SE | Estim | SE | |||||
| (A) Human observers sim. scoresa | Intercept | 0.62 | 0.05 | 12.18 | 0.64 | 0.05 | 14.05 | |
| Treatment | ||||||||
| − 0.17 | 0.07 | − 2.58 | − 0.13 | 0.06 | − 2.16 | |||
| − 0.10 | 0.07 | − 1.48 | − 0.07 | 0.06 | − 1.18 | |||
| (B) SAP sim. scoresb | Intercept | 1.00 | 0.05 | 18.59 | 1.07 | 0.04 | 27.07 | |
| Treatment | ||||||||
| 0.06 | 0.05 | 1.01 | ||||||
| 0.16 | 0.05 | 3.01 | ||||||
| (C) Luscinia sim. scoresc | Intercept | 1.19 | 0.01 | 109.69 | ||||
| Treatment | ||||||||
| − 0.024 | 0.01 | − 2.15 | ||||||
| − 0.001 | 0.01 | − 0.07 | ||||||
aLMMs with random factor ‘Tutor group’. Significant post-hoc comparison tutor–tutee: audio vs. audio–video: estimate: 0.17, SE: 0.07, t: 2.56, p = 0.04
bLMMs with random factor ‘Tutor group’. Significant post-hoc comparison tutor–tutee: audio vs. audio-pixel: estimate: − 0.16, SE: 0.06, t: − 2.99, p = 0.02. For the tutee–tutor comparison, ‘treatment’ was not included in the best model
cLMMs with random factor ‘Tutor group’
Fig. 6Graph showing the human observer similarity score for the tutor–tutee (a) and the tutee–tutor comparison (b), the SAP similarity score for the tutor–tutee (c) and the tutee–tutor (d) comparison and the Luscinia similarity score for the symmetric tutee and tutor comparison (e). *Indicates p < 0.05, LMMs see Table 9. NB human observer and SAP similarity scores calculate how much of one signal can be found in another signal. Therefore, when comparing two signals, two different comparisons can be made [what proportion of the tutor motif is found in the tutee motif (tutor–tutee) and what proportion of the tutee motif is found in the tutor motif (tutee–tutor)]. Luscinia does not calculate how much of one signal can be found in another signal, but calculates how dissimilar two signals are
Fig. 7a SAP and b Luscinia stereotypy scores for the 10 randomly selected tutee motifs
Details of best models for the (arcsine square root transformed) SAP (A) and Luscinia (B) stereotypy scores
| Response variablea | Model term | Level | Estimate | SE | |
|---|---|---|---|---|---|
| (A) SAP stereotypy score | Intercept | 1.23 | 0.04 | 30.15 | |
| Schedule | |||||
| − 0.15 | 0.05 | − 3.30 | |||
| − 0.23 | 0.06 | − 4.09 | |||
| (B) Luscinia stereotypy score | Intercept | 1.32 | 0.005 | 249.4 |
aLMMs with random factor ‘Tutor group’