| Literature DB >> 31922426 |
Katherine E Boronow1, Laura J Perovich1,2, Latanya Sweeney3, Ji Su Yoo3, Ruthann A Rudel1, Phil Brown4, Julia Green Brody1.
Abstract
BACKGROUND: Sharing research data uses resources effectively; enables large, diverse data sets; and supports rigor and reproducibility. However, sharing such data increases privacy risks for participants who may be re-identified by linking study data to outside data sets. These risks have been investigated for genetic and medical records but rarely for environmental data.Entities:
Year: 2020 PMID: 31922426 PMCID: PMC7015543 DOI: 10.1289/EHP4817
Source DB: PubMed Journal: Environ Health Perspect ISSN: 0091-6765 Impact factor: 9.031
Study characteristics and data types that may contribute to re-identification risk in selected environmental health studies.
| Study | Study characteristics | Data types | ||||||
|---|---|---|---|---|---|---|---|---|
| Focus on specific locations | Family members in study | Medical data | Genetic data | Occupation data | Housing data | Exposure data from biological samples | Exposure data from home/personal environment samples | |
| x | x | x | x | x | x | x | x | |
| — | — | — | — | x | x | — | x | |
| x | — | x | x | — | — | x | — | |
| x | — | x | x | x | — | x | — | |
| x | x | x | x | x | x | x | x | |
| x | — | x | — | — | x | x | x | |
| x | — | x | x | x | x | x | x | |
| — | — | x | x | x | x | x | — | |
| x | — | — | — | x | x | — | x | |
| x | x | x | — | x | x | — | x | |
| x | — | — | — | x | x | x | x | |
| — | — | x | x | x | x | x | x | |
Note: —, not a study characteristic or a data type collected in the study.
One enrollment criterion was living or working in a publicly defined geographic area. In addition, NHANES samples from 15 locations per year, although these locations are not intended to be a focus of study.
The study enrolled family members as part of its study design. Additional studies, for example NHANES and the Sister Study, allow enrollment of multiple members of the same family.
Characteristics of participants' homes, such as number or type of rooms; square footage; year built; information about heating, ventilation, and air conditioning; presence of certain furnishings or appliances, etc.
Figure 1.Individual homes plotted by principal component scores (PC1 and PC2) of residential chemical concentration data and overlaid on gray convex hulls indicating the bounds of two clusters generated using unsupervised k-means cluster analysis of the same data. All panels show homes from two regions. Homes were classified as correctly clustered (white symbols) if they were grouped in the cluster containing the majority of homes from their region; otherwise, they were classified as incorrectly clustered (black symbols). The shape of the symbol indicates the home’s true location. (A) and (B): k-means classification of 122 homes in the Household Exposure Study (72 from Massachusetts, triangles; 50 from California, circles) based on chemical concentrations in indoor air using original (A) or censored (B) reporting limits. (C) and (D): k-means classification of 120 homes in the Household Exposure Study (71 from Massachusetts, triangles; 49 from California, circles) based on chemical concentrations in indoor dust using original (C) or censored (D) reporting limits. (E) and (F): k-means classification of 77 homes in the Green Housing Study (33 from Cincinnati, Ohio, triangles, and 44 from Boston, Massachusetts, circles) based on chemical concentrations in indoor air using original (E) or constant (F) reporting limits.
Accuracy of k-means cluster analysis for subgrouping homes by region in the household exposure study (HES; Massachusetts and California) and Green Housing Study (GHS; Boston, Massachusetts, and Cincinnati, Ohio) using concentrations of chemicals detected in at least 10 percent of residential indoor air or dust samples.
| Study | Homes ( | Sample matrix | Chemicals in study | Chemicals in cluster analysis | Reporting limits | Accuracy | Adjusted Rand index |
|---|---|---|---|---|---|---|---|
| HES | 122 | Air | 24 | 13 | Original | 98.4 | 0.93 |
| HES | 122 | Air | 24 | 13 | Censored | 92.6 | 0.72 |
| HES | 120 | Dust | 44 | 25 | Original | 96.7 | 0.87 |
| HES | 120 | Dust | 44 | 18 | Censored | 55.8 | 0 |
| GHS | 77 | Air | 35 | 28 | Original | 80.5 | 0.36 |
| GHS | 77 | Air | 35 | 28 | Constant | 80.5 | 0.36 |
Number of chemicals measured in the same medium in all homes in each cluster analysis.
Number of chemicals detected in at least 10% of homes given the reporting limits used in each analysis.
Number of homes correctly grouped by region using k-means clustering divided by the total number of homes analyzed.
The adjusted Rand index measures similarity between the two clusters identified by k-means analysis and the two true regional subgroups in the data. It has an expected value of zero for random clusters and a maximum value of 1 in the case of perfect agreement.
In analyses using the original reporting limits, concentrations that were not detected were substituted with the sample-specific reporting limit (SSRL). We calculated the SSRL as the method reporting limit (MRL) divided by the sample-specific volume of air or sample-specific mass of dust.
In analyses with censored reporting limits, we calculated the most frequent MRL reported in each site (in cases of ties we used the lower value). We defined as the higher of the two modal MRLs and calculated censored sample-specific reporting limits () as divided by the sample-specific volume of air or sample-specific mass of dust. For all records where the original SSRL or detected or estimated concentration was lower than , the concentration was substituted with .
Cluster analysis was performed on 77 homes comprising 105 samples. A total of 49 homes were sampled once, and 28 homes were sampled twice approximately six months apart. For homes sampled twice, we used the average exposure for each chemical.
Nondetects were substituted with the MRL divided by the median volume of air across Boston, Massachusetts, and Cincinnati, Ohio.