| Literature DB >> 33801763 |
Rob Argent1,2, Antonio Bevilacqua1,3, Alison Keogh1,2, Ailish Daly4, Brian Caulfield1,2.
Abstract
Machine learning models are being utilized to provide wearable sensor-based exercise biofeedback to patients undertaking physical therapy. However, most systems are validated at a technical level using lab-based cross validation approaches. These results do not necessarily reflect the performance levels that patients and clinicians can expect in the real-world environment. This study aimed to conduct a thorough evaluation of an example wearable exercise biofeedback system from laboratory testing through to clinical validation in the target setting, illustrating the importance of context when validating such systems. Each of the various components of the system were evaluated independently, and then in combination as the system is designed to be deployed. The results show a reduction in overall system accuracy between lab-based cross validation (>94%), testing on healthy participants (n = 10) in the target setting (>75%), through to test data collected from the clinical cohort (n = 11) (>59%). This study illustrates that the reliance on lab-based validation approaches may be misleading key stakeholders in the inertial sensor-based exercise biofeedback sector, makes recommendations for clinicians, developers and researchers, and discusses factors that may influence system performance at each stage of evaluation.Entities:
Keywords: biofeedback; biomedical technology; exercise therapy; human factors; inertial measurement unit; machine learning; wearables
Year: 2021 PMID: 33801763 PMCID: PMC8037109 DOI: 10.3390/s21072346
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1An example of a correctly segmented time-series of triaxial accelerometer data.
Figure 2An example of classification of a time-series of triaxial accelerometer data with a sub-optimal repetition highlighted in red.
Figure 3Steps of evaluation for the various machine learning components of the exercise biofeedback system.
Description of exercises and the errors assessed within the machine learning models.
| Exercise | Description of Exercise | Error Assessed |
|---|---|---|
| Heel Slide (HS) | In supine lying, the exercise is performed by flexing the hip and knee to slide the foot closer to the ipsi-lateral hip. | Excessive hip external rotation |
| Inner Range Quadriceps (IRQ) | In supine lying, a roll is placed under the knee to be exercised. The exercise is performed by contracting the quadriceps muscles to bring the knee from a position of slight flexion into full extension. | Hip flexion (raising knee off the towel) |
| Straight Leg Raise (SLR) | In supine lying, the exercise is performed by flexing the hip, lifting the leg off the supporting surface while keeping the knee in full extension, raising to a height above the contralateral toes. | Knee flexion (lag) |
| Seated Active Knee Extension (SAKE) | In sitting with the upper thigh supported on a chair, the exercise is performed by contracting the quadriceps to bring the knee from a position of flexion into full extension. | Lack of full knee extension |
Characteristics of the classification training data.
| Exercise | Participants | Exercise Sets | Total Repetitions | Correctly Performed Repetitions | Sub-Optimally Performed Repetitions |
|---|---|---|---|---|---|
| HS | 36 | 71 | 711 | 350 (49.2%) | 361 (50.8%) |
| IRQ | 35 | 68 | 679 | 351 (51.7%) | 328 (48.3%) |
| SLR | 37 | 69 | 689 | 370 (53.7%) | 319 (46.3%) |
| SAKE | 38 | 76 | 754 | 380 (50.4%) | 374 (49.6%) |
Figure 4Illustration of IMU placement, orientation and user setup. Figure taken from Argent et al. (2019) [6].
Figure 5Illustration of threshold for segmentation for points identified as the start of a repetition. The manual annotation for reference is highlighted in blue and the area for TP in green.
Figure 6Illustration of threshold for segmentation for points identified as the end of a repetition. The manual annotation for reference is highlighted in blue and area for TP in green.
Lab-based results following leave-one-subject-out cross-validation.
| Exercise | Best Performing Algorithm | Metric (%) | ||
|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | ||
| HS | Logistic Regression | 98.45 | 99.43 | 97.51 |
| IRQ | Logistic Regression | 92.05 | 93.73 | 90.24 |
| SLR | SVM | 94.78 | 96.22 | 93.10 |
| SAKE | Random Forest | 96.29 | 96.52 | 96.05 |
Data collected to form the test sets.
| Cohort | Exercise | Participants | Exercise Sets | Total Repetitions | Correctly Performed Repetitions | Sub-Optimally Performed Repetitions |
|---|---|---|---|---|---|---|
| Healthy | HS | 10 | 10 | 148 | 148 (100%) | 0 (0%) |
| IRQ | 10 | 10 | 150 | 150 (100%) | 0 (0%) | |
| SLR | 10 | 10 | 150 | 150 (100%) | 0 (0%) | |
| SAKE | 10 | 10 | 150 | 150 (100%) | 0 (0%) | |
| Clinical | HS | 10 | 23 | 320 | 320 (100%) | 0 (0%) |
| IRQ | 10 | 18 | 270 | 203 (75.2%) | 67 (24.8%) | |
| SLR | 10 | 21 | 297 | 148 (49.8%) | 149 (50.2%) | |
| SAKE | 11 | 17 | 241 | 103 (42.7%) | 138 (57.3%) |
Classification performance following manual segmentation of test data.
| Cohort | Exercise | Metric (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | ||||||||
| Mean | (95% CI) | Mean | (95% CI) | Mean | (95% CI) | |||||
| LB | UB | LB | UB | LB | UB | |||||
| Healthy | HS | 100.00 | (100.00 | 100.00) | 100.00 | (100.00 | 100.00) | N/A * | N/A * | N/A * |
| IRQ | 84.67 | (67.83 | 100.00) | 84.67 | (67.83 | 100.00) | N/A * | N/A * | N/A * | |
| SLR | 84.67 | (68.14 | 100.00) | 84.67 | (68.14 | 100.00) | N/A * | N/A * | N/A * | |
| SAKE | 100.00 | (100.00 | 100.00) | 100.00 | (100.00 | 100.00) | N/A * | N/A * | N/A * | |
| Clinical | HS | 98.99 | (96.46 | 100.00) | 98.99 | (96.46 | 100.00) | N/A * | N/A * | N/A * |
| IRQ | 58.49 | (41.50 | 75.45) | 52.70 | (31.93 | 73.46) | 78.17 | (51.33 | 100.00) | |
| SLR | 66.01 | (48.93 | 83.10) | 49.14 | (20.16 | 78.13) | 81.35 | (62.38 | 100.00) | |
| SAKE | 65.17 | (43.92 | 86.41) | 61.12 | (22.23 | 100.00) | 68.00 | (37.36 | 98.64) | |
* due to the unbalanced test set with no sub-optimal repetitions it was not possible to calculate specificity.
Segmentation model performance with use of the test data.
| Cohort | Exercise | Metric (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | Accuracy | ||||||||
| Mean | (95% CI) | Mean | (95% CI) | Mean | (95% CI) | |||||
| LB | UB | LB | UB | LB | UB | |||||
| Healthy | HS | 96.21 | (94.03 | 98.39) | 95.56 | (92.97 | 98.15) | 92.24 | (87.97 | 96.51) |
| IRQ | 96.00 | (92.67 | 99.34) | 96.00 | (92.67 | 99.34) | 92.64 | (86.69 | 98.60) | |
| SLR | 96.23 | (94.15 | 98.32) | 93.67 | (89.86 | 97.47) | 90.48 | (85.64 | 95.33) | |
| SAKE | 94.33 | (90.27 | 98.39) | 94.33 | (90.27 | 98.39) | 89.76 | (82.62 | 96.90) | |
| Clinical | HS | 86.84 | (77.35 | 96.34) | 74.95 | (62.87 | 87.03) | 70.64 | (58.16 | 83.12) |
| IRQ | 79.28 | (60.85 | 97.71) | 78.08 | (59.96 | 96.20) | 75.03 | (57.11 | 92.94) | |
| SLR | 91.21 | (81.19 | 100.00) | 84.48 | (71.88 | 97.08) | 81.50 | (68.77 | 94.23) | |
| SAKE | 91.53 | (87.85 | 95.21) | 82.05 | (74.19 | 89.92) | 77.01 | (68.45 | 85.57) | |
Figure 7Comparison of segmentation accuracy between healthy and clinical test data.
Biofeedback model performance: classification results of healthy test data with segments generated automatically.
| Cohort | Exercise | Metric (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | ||||||||
| Mean | (95% CI) | Mean | (95% CI) | Mean | (95% CI) | |||||
| LB | UB | LB | UB | LB | UB | |||||
| Healthy | HS | 100.00 | (100.00 | 100.00) | 100.00 | (100.00 | 100.00) | N/A * | N/A * | N/A * |
| IRQ | 86.00 | (68.67 | 100.00) | 86.00 | (68.67 | 100.00) | N/A * | N/A * | N/A * | |
| SLR | 76.47 | (56.09 | 96.86) | 76.47 | (56.09 | 96.86) | N/A * | N/A * | N/A * | |
| SAKE | 100.00 | (100.00 | 100.00) | 100.00 | (100.00 | 100.00) | N/A * | N/A * | N/A * | |
| Clinical | HS | 98.49 | (96.46 | 100.00) | 98.49 | (96.46 | 100.00) | N/A * | N/A * | N/A * |
| IRQ | 59.90 | (44.56 | 75.24) | 53.46 | (34.35 | 72.56) | 79.23 | (58.98 | 99.49) | |
| SLR | 67.30 | (51.66 | 82.94) | 45.34 | (19.14 | 71.55) | 87.26 | (77.02 | 97.50) | |
| SAKE | 68.86 | (47.29 | 90.43) | 60.48 | (18.60 | 100.00) | 74.72 | (45.49 | 100.00) | |
* due to the unbalanced test set with no sub-optimal repetitions, it was not possible to calculate specificity.
Figure 8The difference in classification accuracy between the lab-based cross-validation, manually segmented classification performance, and automatically segmented biofeedback model performance when testing with clinical data.
Figure 9Comparison of overall biofeedback model performance between lab-based cross-validation, and healthy and clinical test data.