| Literature DB >> 30857130 |
Jessamyn Dahmen1, Diane Cook2.
Abstract
Creation of realistic synthetic behavior-based sensor data is an important aspect of testing machine learning techniques for healthcare applications. Many of the existing approaches for generating synthetic data are often limited in terms of complexity and realism. We introduce SynSys, a machine learning-based synthetic data generation method, to improve upon these limitations. We use this method to generate synthetic time series data that is composed of nested sequences using hidden Markov models and regression models which are initially trained on real datasets. We test our synthetic data generation technique on a real annotated smart home dataset. We use time series distance measures as a baseline to determine how realistic the generated data is compared to real data and demonstrate that SynSys produces more realistic data in terms of distance compared to random data generation, data from another home, and data from another time period. Finally, we apply our synthetic data generation technique to the problem of generating data when only a small amount of ground truth data is available. Using semi-supervised learning we demonstrate that SynSys is able to improve activity recognition accuracy compared to using the small amount of real data alone.Entities:
Keywords: Synthetic data; activity recognition; healthcare data; hidden Markov models; regression; smart homes
Mesh:
Year: 2019 PMID: 30857130 PMCID: PMC6427177 DOI: 10.3390/s19051181
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Activity-labeled smart home sensor data that exhibit a nested sequence structure.
Summary statistics for each of the smart home samples used in our experiments.
| Dataset | Number of Residents | Number of Days | Total Number of Motion Sensors | Total Number of Door Sensors |
|---|---|---|---|---|
| 1 | 1 | 6 | 16 | 1 |
| 2 | 1 | 5 | 23 | 3 |
| 3 | 1 | 10 | 10 | 1 |
| 4 | 1 | 4 | 25 | 4 |
| 5 | 1 | 6 | 18 | 4 |
| 6 | 1 | 6 | 24 | 4 |
| 7 | 2 | 3 | 24 | 1 |
| 8 | 1 | 4 | 22 | 3 |
| 9 | 1 | 3 | 17 | 4 |
| 10 | 1 | 5 | 15 | 4 |
Summary of activity recognition features.
| Feature | Description |
|---|---|
| lastSensorEventHours | Hour of day |
| lastSensorEventSeconds | Seconds since midnight |
| windowDuration | Length of window (seconds) |
| timeSinceLastSensorEvent | Seconds since previous event |
| prevDominantSensor1 | Most frequent sensor in previous window |
| prevDominantSensor2 | Most frequent sensor two windows ago |
| lastSensorID | Most recent sensor identifier |
| lastLocation | Most recent sensor location |
| sensorCount | Number of events in window for each sensor |
| sensorElTime | Elapsed time since each sensor fired |
Figure 2Overview of the SynSys system for example activities Cook and Sleep.
Summary statistics of activity occurrence frequencies for the smart home data sets used in our experiments. Datasets contained an average of 227,511 sensor events with a standard deviation of 93,572.
| Predefined Activity Label | Mean Frequency | Standard Deviation |
|---|---|---|
| Sleep | 10,075 | 9400 |
| Other Activity | 96,762 | 53,116 |
| Bed Toilet Transition | 1949 | 2497 |
| Personal Hygiene | 20,646 | 14,229 |
| Take Medicine | 2463 | 1870 |
| Relax | 17,878 | 15,032 |
| Eat | 11,873 | 7654 |
| Work | 12,208 | 12,325 |
| Leave Home | 1708 | 867 |
| Enter Home | 1561 | 629 |
| Wash Dishes | 13,908 | 7375 |
| Cook | 33,915 | 15,946 |
| Bathe | 2561 | 3275 |
Distances between real data, synthetic data, and randomly generated data. Clustering is employed to group events within the Other Activity category. * = difference between SynSys and Real and other approach is statistically significant (p < 0.05).
| Data Comparison | Euclidean | DTW |
|---|---|---|
| Random and Real | 2048 * | 364,392 * |
| Real and Real from Another Home | 1799 * | 302,221 * |
| Real and Real from Another Time Period | 1781 * | 311,438 * |
| Real and Simple HMM&Poisson | 1846 * | 310,964 * |
| SynSys and Real | 1474 | 194,011 |
Activity recognition accuracies, using semi-supervised learning with a varying amount of real data and one month of synthetic data. * = difference from accuracy trained on real data is statistically significant (p < 0.05). The list of activities learned is mentioned in the introduction.
| Number of Real Sensor Events used for Training | Accuracy When Trained on Real Data Only | Accuracy When Trained on Real Data Plus 200,000 Synthetically-Generated Sensor Events |
|---|---|---|
| 5000 | 35.54% ± 5.10% | 47.90% * ± 3.60% |
| 10,000 | 45.50% ± 5.75% | 61.19% * ± 4.96% |
| 15,000 | 41.68% ± 5.49% | 52.83% * ± 3.23% |
| 20,000 | 47.61% ± 5.91% | 57.57% ± 2.99% |