| Literature DB >> 33237919 |
Hugo Gonzalez Villasanti1, Laura M Justice1,2, Leidy Johana Chaparro-Moreno1,2, Tzu-Jung Lin1,2, Kelly Purtell1,3.
Abstract
The present study explored whether a tool for automatic detection and recognition of interactions and child-directed speech (CDS) in preschool classrooms could be developed, validated, and applied to non-coded video recordings representing children's classroom experiences. Using first-person video recordings collected by 13 preschool children during a morning in their classrooms, we extracted high-level audiovisual features from recordings using automatic speech recognition and computer vision services from a cloud computing provider. Using manual coding for interactions and transcriptions of CDS as reference, we trained and tested supervised classifiers and linear mappings to measure five variables of interest. We show that the supervised classifiers trained with speech activity, proximity, and high-level facial features achieve adequate accuracy in detecting interactions. Furthermore, in combination with an automatic speech recognition service, the supervised classifier achieved error rates for CDS measures that are in line with other open-source automatic decoding tools in early childhood settings. Finally, we demonstrate our tool's applicability by using it to automatically code and transcribe children's interactions and CDS exposure vertically within a classroom day (morning to afternoon) and horizontally over time (fall to winter). Developing and scaling tools for automatized capture of children's interactions with others in the preschool classroom, as well as exposure to CDS, may revolutionize scientific efforts to identify precise mechanisms that foster young children's language development.Entities:
Mesh:
Year: 2020 PMID: 33237919 PMCID: PMC7688182 DOI: 10.1371/journal.pone.0242511
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Schematic view of the pipeline of the Classroom Interaction Detection and Recognition (CIDR) system.
Abbreviations correspond to child-directed speech (CDS), validation branch (V) and training branch (T). Boxes in light gray correspond to featurization modules deployed using Amazon Web Services (AWS).
Leave-one-subject-out (LOSO) performance results for the detector of focal child interactions with adults and peers.
| Adult Interaction Detector | Peer Interaction Detector | |||||||
|---|---|---|---|---|---|---|---|---|
| Acc | Prec | Rec | F1 | Acc | Prec | Rec | F1 | |
| Reduced features | 80.8 | 90.0 | 85.4 | 87.6 | 86.1 | 98.4 | 87.0 | 92.3 |
| Full features | 81.1 | 92.3 | 84.2 | 88.1 | 85.2 | 96.5 | 87.4 | 91.7 |
Note: Reduced features model correspond to bi-directional long-short term memory (BILSTM) network trained with face size and speech activity data, while the full feature model is a BILSTM network trained with the full set of facial features and speech activity. The performance metrics considered are accuracy (Acc), precision (Prec), recall (Rec), and F1 score (F1).
Leave-one-subject-out (LOSO) cross-validation results for the CDS recognition system.
| Speaker | TNU | TNW | NDW | MLU | TTR | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Adaptation | mARE | r | mARE | r | mARE | r | mARE | r | mARE | r | |
| Raw | 39.0 | 0.75 | 51.3 | 0.79 | 24.3 | 0.84 | 19.7 | 0.32 | 47.2 | 0.88 | |
| Adapted | 12.9 | 0.69 | 34.9 | 0.73 | 9.7 | 0.80 | 7.0 | 0.18 | 12.7 | 0.85 | |
| Raw | 92.9 | 0.46 | 96.1 | 0.50 | 93.3 | 0.49 | 47.6 | 0.44 | 88.7 | 0.52 | |
| Adapted | 40.6 | 0.36 | 40.3 | 0.38 | 20.8 | 0.37 | 6.9 | 0.34 | 25.5 | 0.42 | |
| Raw | 76.4 | 0.44 | 88.7 | 0.44 | 75.0 | 0.54 | 37.2 | -0.11 | 100.0 | -0.56 | |
| Adapted | 40.9 | 0.38 | 47.8 | 0.37 | 27.6 | 0.48 | 24.0 | -0.58 | 23.5 | 0.51 | |
Note: Language measures are total number of utterances (TNU), total number of words (TNW), number of different words (NDW), mean length of utterance (MLU), and type-token ratio (TTR). The performance metrics depicted are: median absolute relative error (mARE) and linear correlation (r). Raw refers to non-adapted measures calculated using the raw output of the automatic speech recognition service.
Fig 2Total number of words (TNW) by adult speakers during interactions with focal child.
Box edges represent lower and upper quartile, while middle line corresponds to the median. Whiskers depict minimum and maximum non-outlier values. Measures are normed at 10 minutes.
Fig 4Total number of words (TNW) by focal child during interactions with adults and peers.
Box edges represent lower and upper quartile, while middle line corresponds to the median. Whiskers depict minimum and maximum non-outlier values. Measures are normed at 10 minutes.
Fig 3Total number of words (TNW) by uttered by peers during interactions with focal child.
Box edges represent lower and upper quartile, while middle line corresponds to the median. Whiskers depict minimum and maximum non-outlier values. Measures are normed at 10 minutes.