| Literature DB >> 30646312 |
Liangyuan Na1, Cong Yang2, Chi-Cheng Lo2, Fangyuan Zhao3, Yoshimi Fukuoka4, Anil Aswani2.
Abstract
Importance: Despite data aggregation and removal of protected health information, there is concern that deidentified physical activity (PA) data collected from wearable devices can be reidentified. Organizations collecting or distributing such data suggest that the aforementioned measures are sufficient to ensure privacy. However, no studies, to our knowledge, have been published that demonstrate the possibility or impossibility of reidentifying such activity data. Objective: To evaluate the feasibility of reidentifying accelerometer-measured PA data, which have had geographic and protected health information removed, using support vector machines (SVMs) and random forest methods from machine learning. Design, Setting, and Participants: In this cross-sectional study, the National Health and Nutrition Examination Survey (NHANES) 2003-2004 and 2005-2006 data sets were analyzed in 2018. The accelerometer-measured PA data were collected in a free-living setting for 7 continuous days. NHANES uses a multistage probability sampling design to select a sample that is representative of the civilian noninstitutionalized household (both adult and children) population of the United States. Exposures: The NHANES data sets contain objectively measured movement intensity as recorded by accelerometers worn during all walking for 1 week. Main Outcomes and Measures: The primary outcome was the ability of the random forest and linear SVM algorithms to match demographic and 20-minute aggregated PA data to individual-specific record numbers, and the percentage of correct matches by each machine learning algorithm was the measure.Entities:
Mesh:
Year: 2018 PMID: 30646312 PMCID: PMC6324329 DOI: 10.1001/jamanetworkopen.2018.6040
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Figure 1. Threat Model for Reidentification of Health Data Using Physical Activity and Demographic Data
Figure 2. Block Diagram Showing the Main Steps of the Reidentification Procedure
Sociodemographic Characteristics of the Physical Activity Monitor Data Subset
| Characteristic | NHANES 2003-2004 | NHANES 2005-2006 | ||
|---|---|---|---|---|
| Adults (n = 4720) | Children (n = 2427) | Adults (n = 4765) | Children (n = 2539) | |
| Age, mean (SD), y | 40.0 (20.6) | 12.3 (3.4) | 45.2 (19.9) | 12.1 (3.4) |
| Sex | ||||
| Male | 2274 (48.2) | 1236 (50.9) | 2272 (47.7) | 1264 (49.8) |
| Female | 2446 (51.8) | 1191 (49.1) | 2493 (52.3) | 1275 (50.2) |
| Educational level | ||||
| High school or less | 2645 (56.0) | 2425 (99.9) | 2568 (53.9) | 2538 (99.9) |
| More than high school | 2069 (43.8) | 2 (0.1) | 2192 (46.0) | 1 (0.1) |
| Missing | 6 (0.1) | 0 | 5 (0.1) | 0 |
| Annual household income, $ | ||||
| <25 000 | 1574 (33.3) | 767 (31.6) | 1368 (28.7) | 698 (27.5) |
| 25 000-55 000 | 1831 (38.8) | 935 (38.4) | 1793 (37.6) | 912 (35.9) |
| >55 000 | 1315 (27.9) | 725 (30.0) | 1604 (33.7) | 929 (36.6) |
| Race/ethnicity | ||||
| Hispanic | 1160 (24.6) | 839 (34.6) | 1150 (24.1) | 907 (35.7) |
| White | 2392 (50.7) | 636 (26.2) | 2267 (47.6) | 667 (26.3) |
| Black | 973 (20.6) | 856 (35.3) | 1156 (24.3) | 815 (32.1) |
| Other | 195 (4.1) | 96 (3.9) | 192 (4.0) | 150 (5.9) |
| Country of birth | ||||
| United States | 3774 (79.9) | 2200 (90.6) | 3747 (78.6) | 2264 (89.2) |
| Outside the United States | 945 (20.0) | 227 (9.4) | 1016 (21.3) | 275 (10.8) |
| Missing | 1 (0.1) | 0 | 2 (0.1) | 0 |
| Daily physical activity intensity, counts, mean (SD) | ||||
| Monday | 172.0 (762.1) | 285.1 (951.4) | 208.6 (1237.0) | 277.4 (1021.3) |
| Tuesday | 183.3 (954.8) | 279.4 (1072.2) | 225.0 (1425.7) | 283.6 (1214.8) |
| Wednesday | 177.9 (934.1) | 269.4 (1108.6) | 230.9 (1536.8) | 296.1 (1478.4) |
| Thursday | 176.0 (1009.7) | 263.4 (1146.0) | 229.9 (1572.0) | 334.3 (1883.8) |
| Friday | 180.4 (1100.9) | 257.1 (1186.3) | 245.6 (1789.1) | 328.0 (1949.2) |
Abbreviation: NHANES, National Health and Nutrition Examination Survey.
Data are present as number (percentage) of participants unless otherwise indicated.
The ActiGraph AM-7164 returns data in digital units of counts.
Number of Correctly Reidentified Matches in Testing Data With Physical Activity Data Partially Aggregated Into 20-Minute Intervals
| Machine Learning Algorithm | No. (%) of Adults | No. (%) of Children | ||
|---|---|---|---|---|
| Demographics Only | Demographics With Physical Intensity | Demographics Only | Demographics With Physical Intensity | |
| Linear SVM | 3880 (81.2) | 4043 (85.6) | 1496 (61.6) | 1695 (69.8) |
| Random Forest | 4478 (94.9) | 2120 (87.4) | ||
| Linear SVM | 3827 (80.3) | 4041 (84.8) | 1514 (59.6) | 1705 (67.2) |
| Random Forest | 4470 (93.8) | 2172 (85.5) | ||
Abbreviations: NHANES, National Health and Nutrition Examination Survey; SVM, support vector machine.
P < .001.
Percentage of Correctly Reidentified Matches at Different Time Resolutions of Partial Aggregation of Physical Activity Data
| Machine Learning Algorithm | Correctly Reidentified Matches, % | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 15 min | 20 min | 30 min | 1 h | 2 h | 4 h | 6 h | 8 h | 12 h | 24 h | |
| Demographics only | 81.1 | |||||||||
| Physical activity | ||||||||||
| Linear SVM | 0 | 0 | 0 | 0.02 | 0 | 0.04 | 0 | 0.06 | 0.02 | 0.02 |
| Random forest | 5.97 | 5.93 | 6.52 | 5.61 | 4.24 | 2.14 | 1.14 | 0.74 | 0.23 | 0.23 |
| Demographics and physical activity | ||||||||||
| Linear SVM | 85.2 | 85.6 | 87.0 | 86.9 | 87.7 | 87.4 | 86.8 | 86.7 | 85.7 | 84.8 |
| Random forest | 94.3 | 94.9 | 94.5 | 94.0 | 93.0 | 91.7 | 91.3 | 90.6 | 89.0 | 87.0 |
| Demographics only | 80.3 | |||||||||
| Physical activity | ||||||||||
| Linear SVM | 0 | 0.02 | 0.02 | 0 | 0.02 | 0 | 0.02 | 0.02 | 0.02 | 0.02 |
| Random forest | 6.40 | 6.48 | 6.55 | 6.04 | 4.55 | 2.29 | 1.09 | 0.55 | 0.23 | 0.08 |
| Demographics and physical activity | ||||||||||
| Linear SVM | 84.5 | 84.8 | 84.7 | 85.1 | 86.3 | 86.1 | 85.9 | 85.5 | 85.0 | 83.1 |
| Random forest | 93.5 | 93.8 | 93.2 | 92.8 | 91.9 | 91.0 | 90.5 | 89.4 | 87.9 | 85.8 |
| Demographics only | 61.6 | |||||||||
| Physical activity | ||||||||||
| Linear SVM | 0.08 | 0 | 0.04 | 0.04 | 0 | 0.04 | 0 | 0.08 | 0.04 | 0.04 |
| Random Forest | 11.1 | 10.5 | 11.0 | 7.83 | 4.45 | 1.98 | 0.95 | 0.58 | 0.37 | 0.04 |
| Demographics and physical activity | ||||||||||
| Linear SVM | 70.3 | 69.8 | 70.8 | 71.5 | 68.9 | 69.9 | 67.0 | 68.6 | 67.5 | 64.3 |
| Random forest | 87.2 | 87.4 | 87.1 | 84.8 | 83.1 | 80.0 | 78.6 | 76.9 | 73.8 | 70.2 |
| Demographics only | 59.4 | |||||||||
| Physical activity | ||||||||||
| Linear SVM | 0.08 | 0 | 0.04 | 0.08 | 0 | 0.08 | 0.04 | 0 | 0.08 | 0 |
| Random forest | 10.6 | 11.5 | 10.0 | 7.17 | 4.25 | 1.81 | 1.02 | 0.91 | 0.32 | 0.32 |
| Demographics and physical activity | ||||||||||
| Linear SVM | 67.2 | 67.2 | 67.0 | 67.2 | 66.0 | 67.0 | 66.4 | 67.0 | 64.7 | 62.4 |
| Random forest | 84.8 | 85.5 | 84.4 | 82.6 | 80.3 | 78.2 | 75.9 | 74.5 | 70.9 | 67.3 |
Abbreviations: NHANES, National Health and Nutrition Examination Survey; SVM, support vector machine.
P < .001.
The demographics-only results are not subject to any time; thus, the same numbers are independent of the time columns
P > .99.