| Literature DB >> 33828118 |
Peter Washington1, Qandeel Tariq2, Emilie Leblanc3, Brianna Chrisman1, Kaitlyn Dunlap3, Aaron Kline3, Haik Kalantarian3, Yordan Penev3, Kelley Paskov4, Catalin Voss5, Nathaniel Stockham6, Maya Varma5, Arman Husic3, Jack Kent3, Nick Haber7, Terry Winograd5, Dennis P Wall8,9,10.
Abstract
Standard medical diagnosis of mental health conditions requires licensed experts who are increasingly outnumbered by those at risk, limiting reach. We test the hypothesis that a trustworthy crowd of non-experts can efficiently annotate behavioral features needed for accurate machine learning detection of the common childhood developmental disorder Autism Spectrum Disorder (ASD) for children under 8 years old. We implement a novel process for identifying and certifying a trustworthy distributed workforce for video feature extraction, selecting a workforce of 102 workers from a pool of 1,107. Two previously validated ASD logistic regression classifiers, evaluated against parent-reported diagnoses, were used to assess the accuracy of the trusted crowd's ratings of unstructured home videos. A representative balanced sample (N = 50 videos) of videos were evaluated with and without face box and pitch shift privacy alterations, with AUROC and AUPRC scores > 0.98. With both privacy-preserving modifications, sensitivity is preserved (96.0%) while maintaining specificity (80.0%) and accuracy (88.0%) at levels comparable to prior classification methods without alterations. We find that machine learning classification from features extracted by a certified nonexpert crowd achieves high performance for ASD detection from natural home videos of the child at risk and maintains high sensitivity when privacy-preserving mechanisms are applied. These results suggest that privacy-safeguarded crowdsourced analysis of short home videos can help enable rapid and mobile machine-learning detection of developmental delays in children.Entities:
Mesh:
Year: 2021 PMID: 33828118 PMCID: PMC8027393 DOI: 10.1038/s41598-021-87059-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of the crowd-powered AI detection process. (a) A trustworthy crowd is selected through a filtration process involving an evaluation set of videos. (b) A diagnosis and gender balanced set of unstructured videos are evaluated both with and without a set of privacy-preserving alterations: pitch shift and face obfuscation. (c) The curated crowd extracts behavioral features about the children in the videos by answering a set of multiple choice questions about the child’s behavior exhibited in the video, with each worker assigned to a random subset of the videos. (d) A classifier trained on electronic medical records (the “training set”) corresponding to the multiple choice answers to behavioral questions is used to predict the diagnosis from the aggregated video-wide annotations (the “test set”), and the classifications are compared against the known diagnoses in the video set (the “test set”).
Figure 2Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves of the classifiers trained on aggregated features from the filtered crowd raters. The blue line shows the performance of the LR5 classifier and the green line shows the performance of the LR10 classifier. ROC curves for input features to the classifier are aggregated using the (A) mode, (B) round of the mean, and (C) median of the crowd worker responses. The true positive rate is plotted against the false positive rate for different class cutoffs of the logistic regression classifier’s output probability. PR curves for input features to the classifier are aggregated using the (D) mode, (E) round of the mean, and (F) median of the crowd worker responses. Precision is plotted against recall for different class cutoffs of the logistic regression classifier’s output probability. For both ROC and PR curves, area under the curves increasingly closer to 1.0 indicate increasingly better performance, and a value of 0.5 indicates random guessing by the classifier.
Performance of the machine learning classifiers on aggregated crowd features when using the majority rules (mode), median, and mean aggregation methods.
| Accuracy (%) | Precision (%) | Sensitivity/recall (%) | Specificity (%) | |||||
|---|---|---|---|---|---|---|---|---|
| LR10 | LR5 | LR10 | LR5 | LR10 | LR5 | LR10 | LR5 | |
| Mode | 96.0 ± 5.0 | 92.0 ± 7.0 | 100.0 ± 0.0 | 95.7 ± 7.0 | 92.0 ± 10.0 | 88.0 ± 13.0 | 100.0 ± 0.0 | 96.0 ± 6.7 |
| Median | 92.0 ± 7.0 | 92.0 ± 7.0 | 88.9 ± 12.0 | 92.0 ± 10.4 | 96.0 ± 6.8 | 92.0 ± 10.0 | 88.0 ± 13.0 | 92.0 ± 10.4 |
| Mean (rounded) | 90.0 ± 8.0 | 98.0 ± 3.0 | 85.7 ± 12.4 | 100.0 ± 0.0 | 96.0 ± 6.8 | 96.0 ± 6.8 | 84.0 ± 13.7 | 100.0 ± 0.0 |
Performance metrics from the LR10 and LR5 classifiers are shown respectively. A probability threshold of 0.5 was used to distinguish the ASD and neurotypical classes.
Figure 3ROC curves of the classifiers trained on aggregated features from the filtered crowd raters under each privacy condition. The true positive rate is plotted against the false positive rate for different class cutoffs of the logistic regression classifier’s output probability. The color of the curve represents the privacy condition: blue represents unaltered videos, green represents face obfuscation, red represents pitch shift, and purple represents face obfuscation and pitch shift. Plots show aggregated results using the (A,D) mode, (B,E) median, and (C,F) round of the mean of the crowd worker responses. The ROC curves are shown for both the LR5 (A–C) and LR10 (D–F) classifiers. Area under the curves increasingly closer to 1.0 indicate increasingly better performance, and a value of 0.5 indicates random guessing by the classifier.
Figure 4PR curves of the classifiers trained on aggregated features from the filtered crowd raters under each privacy condition. Precision is plotted against recall for different class cutoffs of the logistic regression classifier’s output probability. The color of the curve represents the privacy condition: blue represents unaltered videos, green represents face obfuscation, red represents pitch shift, and purple represents face obfuscation and pitch shift. Plots show aggregated results using the (A,D) mode, (B,E) median, and (C,F) round of the mean of the crowd worker responses. The ROC curves are shown for both the LR5 (A–C) and LR10 (D–F) classifiers. Area under the curves increasingly closer to 1.0 indicate increasingly better performance, and a value of 0.5 indicates random guessing by the classifier.
Performance of the LR10 classifier on aggregated crowd features across privacy-preserving mechanisms when using the mode, median, and mean aggregation methods, respectively.
| Privacy mechanism | Accuracy (%) | Precision (%) | Sensitivity [Recall] (%) | Specificity (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mode | Median | Mean | Mode | Median | Mean | Mode | Median | Mean | Mode | Median | Mean | |
| Unaltered | 96.0 | 92.0 | 90.0 | 100.0 | 88 | 85.7 | 92.0 | 96.0 | 96.0 | 100.0 | 88.0 | 84.0 |
| Face box | 94.0 | 88.0 | 82.0 | 95.8 | 85.2 | 73.5 | 92.0 | 92.0 | 100.0 | 96.0 | 84.0 | 96.0 |
| Pitch shift | 82.0 | 82.0 | 88.0 | 83.3 | 73.6 | 71.4 | 80.0 | 100.0 | 100.0 | 84.0 | 64.0 | 60.0 |
| Face box and pitch shift | 86.0 | 78.0 | 80.0 | 84.6 | 70.1 | 71.4 | 88.0 | 96.0 | 100.0 | 84.0 | 60.0 | 60.0 |
Sensitivity of the classifier is retained even with the most stringent privacy-preserving mechanisms. A probability threshold of 0.5 was used to distinguish the ASD and neurotypical classes.
Performance of the LR5 classifier on aggregated crowd features across privacy-preserving mechanisms when using the mode, median, and mean aggregation methods, respectively.
| Privacy mechanism | Accuracy (%) | Precision (%) | Sensitivity [Recall] (%) | Specificity (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mode | Median | Mean | Mode | Median | Mean | Mode | Median | Mean | Mode | Median | Mean | |
| Unaltered | 92.0 | 92.0 | 98.0 | 95.7 | 92.0 | 100.0 | 88.0 | 92.0 | 96.0 | 96.0 | 92.0 | 100.0 |
| Face box | 86.0 | 88.0 | 86.0 | 87.5 | 82.8 | 80.0 | 84.0 | 96.0 | 96.0 | 88.0 | 80.0 | 76.0 |
| Pitch shift | 84.0 | 92.0 | 88.0 | 84.0 | 86.2 | 82.8 | 84.0 | 100.0 | 96.0 | 84.0 | 84.0 | 80.0 |
| Face box and pitch shift | 76.0 | 82.0 | 88.0 | 74.1 | 73.5 | 82.8 | 80.0 | 100.0 | 96.0 | 72.0 | 64.0 | 80.0 |
Sensitivity of the classifier is retained even with the most stringent privacy-preserving mechanisms. A probability threshold of 0.5 was used to distinguish the ASD and neurotypical classes.