Literature DB >> 31074176

Clinical Data: Sources and Types, Regulatory Constraints, Applications.

Stanley C Ahalt1, Christopher G Chute2, Karamarie Fecho1, Gustavo Glusman3, Jennifer Hadlock3, Casey Overby Taylor2, Emily R Pfaff4, Peter N Robinson5, Harold Solbrig2, Casey Ta6, Nicholas Tatonetti6, Chunhua Weng6.   

Abstract

Entities:  

Mesh:

Year:  2019        PMID: 31074176      PMCID: PMC6617834          DOI: 10.1111/cts.12638

Source DB:  PubMed          Journal:  Clin Transl Sci        ISSN: 1752-8054            Impact factor:   4.689


× No keyword cloud information.
Access to clinical data is critical for the advancement of translational research. However, the numerous regulations and policies that surround the use of clinical data, although critical to ensure patient privacy and protect against misuse, often present challenges to data access and sharing. In this article, we provide an overview of clinical data types and associated regulatory constraints and inferential limitations. We highlight several novel approaches that our team has developed for openly exposing clinical data.

Background

Recognizing the need to respect and protect patient privacy, numerous regulations have been established to govern the use of clinical data by researchers, including the federal Health Insurance Portability and Accountability Act of 1996 (HIPAA) and the European Union General Data Protection Regulation. Institution‐specific guidelines and governing bodies such as institutional review boards (IRBs) also address research involving patient data and other sensitive data available in electronic medical records (e.g., administrative data), in part as a result of concerns regarding the liability of healthcare providers and institutions.1, 2 The Biomedical Data Translator (Translator) program, funded by the National Center for Advancing Translational Sciences, aims to facilitate the transformation of basic science discoveries into clinically actionable knowledge and leverage clinical expertise to drive research innovations.3, 4 Access to clinical data is central to the vision of the program. Yet, the program's dedication to open science adds complexity to the regulatory, technical, and cultural challenges that already surround access to clinical data. We review here the types of clinical data sets that can be derived from paper or electronic medical records, their applications and limitations, and their associated regulatory constraints, focusing primarily on compliance requirements mandated in the United States under HIPAA (Table  ). We briefly describe several clinical data types that are commonly employed in clinical and translational research, including fully identified clinical data, HIPAA‐limited clinical data, deidentified clinical data, and synthetic data. We highlight several novel approaches for openly exposing clinical data that we have developed as part of the Translator program, namely, HIPAA Safe Harbor Plus (HuSH+) clinical data, clinical profiles, Columbia Open Health Data (COHD), and the Integrated Clinical and Environmental Exposures Service (ICEES).
Table 1

Clinical data types, regulatory access restrictions, and applications

Clinical data typeBrief descriptionRegulatory access restrictionsApplications
Fully identified clinical data setsObservational patient data derived from paper‐based or electronic medical recordsIRB approval is required; an executed data use agreement is possibly requireda Clinical interpretation and scientific inference and discovery
HIPAA‐limited clinical data setsObservational patient data containing only a limited set of HIPAA‐defined PHIIRB approval is required; an executed data use agreement is possibly requireda Clinical interpretation and scientific inference and discovery, but with the understanding that certain data elements have been removed from the data and/or transformed
Deidentified clinical data setsObservational patient data, but with all HIPAA‐defined PHI elements removedIRB approval is not requiredb; IRB “Request for Determination of Human Subjects Research” is typically recommended; an executed data use agreement is possibly requiredClinical interpretation and scientific inference and discovery, but with the understanding that inferences regarding time and potentially other factors cannot be made
HuSH+ clinical data sets Observational patient data, fully compliant with HIPAA Safe Harbor, but unlike deidentified clinical data sets, HuSH+ clinical data sets have been altered such that (i) real patient identifiers (including geocodes) have been replaced with random patient identifiers and (ii) dates (including birth dates) have been shifted by a random number of days (maximum of ± 50 days), with all dates for a given patient shifted by the same number of days Data are derived from UNC Health Care System An executed data use agreement is requiredc Clinical interpretation and scientific inference and discovery, but with the understanding that any inferences based on date/time and location (geocode) cannot be made with precision, and all other inferences must consider date/time and location as potentially hidden covariates
Clinical profiles Statistical profiles of disease and associated phenotypic presentation derived from observational patient data Data are derived from Johns Hopkins Medicine IRB approval is required to generate clinical profiles; no other restrictions applyClinical interpretation and scientific inference, but with the understanding that the data represent statistical profiles
Synthetic clinical data setsRealistic, but not real, observational patient data generated statistically using population distributions of observational patient dataNoneFeasibility assessments and algorithm validation; generation of clinical profiles
COHD Counts of observational clinical co‐occurrences (e.g., co‐occurrences of specific diagnoses and prescribed medications), as well as their relative frequency and observed–expected frequency ratio Data are derived from Columbia University Irving Medical Center NoneClinical interpretation and scientific inference, but with the understanding that the data are restricted to co‐occurrences
ICEES Patient‐level or visit‐level counts of observational patient data integrated at the patient and visit level with a variety of environmental exposures derived from multiple public data sources Data are derived from UNC Health Care System and a variety of public data sources on environmental exposures IRB approval is required to generate ICEES integrated feature tables; no other restrictions applyClinical interpretation and scientific inference, but with the understanding that the raw data have been transformed (e.g., binned or categorized)

COHD, Columbia Open Health Data; HIPAA, Health Insurance Portability and Accountability Act; HuSH+, HIPAA Safe Harbor Plus; ICEES, Integrated Clinical and Environmental Exposures Service; IRB, institutional review board; PHI, protected health information; UNC, University of North Carolina.

aIndividual institutions may require a secure workspace for data access and use. bWhile HIPAA and IRB regulations do not apply, institutional approvals may be required. cHuSH+ clinical data sets were conceptualized and created by UNC as part of the National Center for Advancing Translational Sciences–funded Biomedical Data Translator program. The institution requires a fully executed data use agreement for access to the data.

Clinical data types, regulatory access restrictions, and applications COHD, Columbia Open Health Data; HIPAA, Health Insurance Portability and Accountability Act; HuSH+, HIPAA Safe Harbor Plus; ICEES, Integrated Clinical and Environmental Exposures Service; IRB, institutional review board; PHI, protected health information; UNC, University of North Carolina. aIndividual institutions may require a secure workspace for data access and use. bWhile HIPAA and IRB regulations do not apply, institutional approvals may be required. cHuSH+ clinical data sets were conceptualized and created by UNC as part of the National Center for Advancing Translational Sciences–funded Biomedical Data Translator program. The institution requires a fully executed data use agreement for access to the data.

Types of Clinical DATA SETS

Fully identified clinical data sets

Fully identified clinical data sets comprise observational patient data, including direct patient identifiers (i.e., protected health information (PHI)), as defined in the privacy rule issued under HIPAA. Access requires a specific research hypothesis, study approval by an IRB, a full or partial waiver of HIPAA‐informed consent, and typically a secure workspace. For investigators not affiliated with a specific institution, additional regulations and approvals may apply, including a data use agreement (DUA) with the provider institution. Fully identified clinical data sets may be used for clinical interpretation and scientific inference and discovery. However, as with all data sets but especially observational administrative data sets, issues of data quality and integrity must be taken into account when drawing conclusions.1

HIPAA‐limited clinical data sets

HIPAA‐limited clinical data sets comprise observational patient data with limited PHI: dates such as admission, discharge, service, and dates of birth and death; city, state, and five digits or more zip codes; and ages in years, months, days, or hours. HIPAA‐limited clinical data sets may be used or disclosed for purposes of research, public health, or healthcare operations without obtaining patient authorization or a waiver of HIPAA‐informed consent but with IRB approval and (in some cases) a fully executed DUA. HIPAA‐limited clinical data sets may be used for clinical interpretation and scientific inference and discovery but with the understanding that certain data elements have been removed from the data and/or transformed (e.g., age vs. birth date).

Deidentified clinical data sets

Deidentified clinical data sets comprise observational patient data from which all PHI elements have been removed. Access to deidentified clinical data sets does not require IRB approval, although an IRB Request for Determination of Human Subjects Research is advised. In addition, a fully executed DUA is sometimes required. Deidentified clinical data sets may be used for clinical interpretation and scientific inference and discovery but to a lesser extent than HIPAA‐limited clinical data sets because of the fact that key variables or covariates may have been removed from the data. For instance, dates are required to make inferences regarding seasonal patterns in clinical outcomes and correlations with natural disasters, system‐related issues such as protocol changes, and regulatory issues such as new black‐box warnings.

HuSH+ clinical data sets

HuSH+ clinical data sets were created by Translator team members as a hybrid deidentification approach that is completely compliant with HIPAA and provides restricted access to observational patient data from the UNC Health Care System. HuSH+ clinical data sets differ from deidentified clinical data sets in that (i) real patient identifiers (including geocodes) have been replaced with random patient identifiers and (ii) dates (including birth dates) have been shifted by a random number of days (maximum of ± 50 days), with all dates for a given patient shifted by the same number of days. Access to HuSH+ clinical data does not require IRB approval but does require a fully executed DUA per institutional mandate. HuSH+ clinical data sets may be used in a limited fashion for clinical interpretation and scientific inference and discovery. The main considerations are that any inferences based on date/time and location (geocode) cannot be made with precise accuracy or correlated with seasonal trends or specific events, and all other inferences must consider date/time and location as potentially hidden covariates.

Clinical profiles

Clinical profiles have been developed as part of the Translator program and represent statistical profiles of disease and associated phenotypic presentations derived from observational patient data from Johns Hopkins Medicine using the Health Level Seven International Fast Healthcare Interoperability Resources common data model. At present, clinical profiles include data on demographics, diagnoses, disease comorbidities, symptoms, medications, procedures, and laboratory measures. IRB approval is required to generate clinical profiles but once generated, clinical profiles can be openly shared. Institutional restrictions may apply, however. Clinical profiles can be used for clinical interpretation and scientific inference and discovery but with the understanding that they represent statistical summaries of patient populations and only indirectly represent patient‐level observations. Multiple computational tools and example output files are openly available for creating and using clinical profiles (see Supplemental Information on Clinical Profiles in ).

Synthetic clinical data sets

Synthetic clinical data sets comprise realistic (but not real) data generated statistically by applying simulation techniques to population distributions of observational patient data. Synthetic clinical data sets can be openly shared. A publicly available example, the Synthetic Mass data set, was generated using the Synthea method5 to simulate patient‐level and population‐level data on patients who reside in the state of Massachusetts. A similar open effort is Simulacrum, which is based on observational patient data held by Public Health England's National Cancer Registration and Analysis Service. The data include realistic patient histories with clinically relevant patient encounters; as such, the data can be used for feasibility assessments and algorithm validation but not for clinical interpretation or scientific inference and discovery.

COHD

Translator team members have pioneered the use of clinical co‐occurrence tables as part of the COHD initiative.6 COHD provides open access to observational patient data from Columbia University Irving Medical Center in the form of co‐occurrence counts of pairs of concepts or clinical feature variables (e.g., medications and diagnoses), as well as their relative frequency and observed–expected frequency ratio. The data are publicly accessible via an open web interface or Application Programming Interface. Risks to patient privacy are mitigated by excluding rare features (counts ≤ 10) and perturbing the counts according to the Poisson distribution. The data can be used to derive insights into questions of clinical relevance and importance for translational research. For instance, an individual user may wish to know the frequency of asthma among African American patients (Figure  a). A search of the COHD service reveals that there are 11,716 African American patients with a diagnosis of asthma among 208,438 African American patients (5.62%). For comparison, a second search reveals that there are 29,913 white patients with a diagnosis of asthma among 601,167 white patients (4.98%).
Figure 1

Example queries, including input parameters and output, for Columbia Open Health Data (COHD) (a) and the Integrated Clinical and Environmental Exposures Service (ICEES) (b). AvgDailyPM2.5Exposure = average daily patient exposure to PM2.5 (μg/m3) over a 1‐year study period; TotalEDInpatient Vists = total number of emergency department or inpatient visits for respiratory issues during a 1‐year study period. The study period shown here is for calendar year 2010. AvgDailyPM2.5Exposure <3 range: 1.58, 9.63 μg/m3; AvgDailyPM2.5Exposure ≥3 range: 9.63, 17.33 μg/m3. ID, identifier; PM2.5, airborne particulate matter ≤2.5 μm in diameter.

Example queries, including input parameters and output, for Columbia Open Health Data (COHD) (a) and the Integrated Clinical and Environmental Exposures Service (ICEES) (b). AvgDailyPM2.5Exposure = average daily patient exposure to PM2.5 (μg/m3) over a 1‐year study period; TotalEDInpatient Vists = total number of emergency department or inpatient visits for respiratory issues during a 1‐year study period. The study period shown here is for calendar year 2010. AvgDailyPM2.5Exposure <3 range: 1.58, 9.63 μg/m3; AvgDailyPM2.5Exposure ≥3 range: 9.63, 17.33 μg/m3. ID, identifier; PM2.5, airborne particulate matter ≤2.5 μm in diameter.

ICEES

ICEES was designed by Translator team members as a novel extension of COHD.7 Specifically, ICEES permits open access to observational patient data from the UNC Health Care System that have been integrated at the patient and visit level with environmental exposures data (e.g., airborne and roadway pollutants, socioeconomic factors) derived from multiple public sources. A complex data extraction and integration software pipeline has been developed to create ICEES integrated feature tables.8 The tables are generated using PHI (geocodes and dates), but the data are then binned or recoded and stripped of PHI. Thus, the ICEES pipeline must be executed under an approved IRB protocol, but subsequent steps are not subject to IRB regulation, and ICEES is publicly accessible via an Application Programming Interface. ICEES provides a number of functionalities for clinical interpretation and scientific inference and discovery. For example, Figure  b demonstrates that for COHORT:60 (African Americans with asthma‐like conditions in calendar year 2010), the percentage of patients with two or more annual emergency department or inpatient visits for respiratory issues is higher among patients with high average daily exposure to particulate matter ≤ 2.5 μm in diameter than among patients with low average daily exposure to particulate matter ≤ 2.5 μm in diameter (21.10% vs. 8.90%, P < 0.0001, N = 6,379), thus replicating published literature on the association between airborne pollutant exposures and asthma exacerbations.9 The data additionally suggest that African Americans with asthma‐like conditions have relatively high exposure to particulate matter, with ~ 95% of the cohort exposed to ≥ 9.63 μg/m3 average daily particulate matter ≤ 2.5 μm in diameter.

Clinical fingerprints

Although not a new clinical data type per se, Translator teams have been working to develop privacy‐preserving analytic approaches to visualize and compare patient data, including genomic data and clinical records in semistructured JavaScript Object Notation or eXtensible Markup Language formats. Genomic data typically consist of lists of variants relative to a reference allele sorted by position. Genome fingerprints capture the unique patterns generated by pairs of consecutive single‐nucleotide variants as patient‐level matrices or fingerprints.10 The correlation between two fingerprints reflects the degree of relatedness between two genomes. Clinical fingerprints similarly transform clinical records from the Fast Healthcare Interoperability Resources format into numerical vectors that greatly simplify their comparison. Translator team members are working to adapt this methodology for application to the ICEES integration pipeline and incorporation into the ICEES integrated feature tables.

Conclusion

In this article, we described various types of clinical data sets and associated inferential limitations and regulatory constraints, focusing primarily on compliance requirements mandated in the United States under HIPAA. We highlighted several novel approaches that we have developed as part of the Translator program to openly expose observational patient data, while respecting and protecting patient privacy. We recognize that each of these approaches retains a residual risk of patient reidentification; thus, we continue to work with experts in regulatory protections and computer security to ensure that those risks remain minimal. Although the Translator approaches are designed to be disease‐agnostic and generalizable, they were developed to comply with HIPAA and institutional guidelines; as such, our approaches may need to be modified prior to adoption elsewhere. Nonetheless, through these open services, we hope to accelerate clinical and translational science and foster biomedical discovery.

Funding

Support for this project was provided by the National Center for Advancing Translational Sciences, National Institutes of Health through the Biomedical Data Translator program (awards 1OT3TR002019, 1OT3TR002020, 1OT3TR002025, 1OT3TR002026, 1OT3TR002027, 1OT2TR002514, 1OT2TR002515, 1OT2TR002517, 1OT2TR002520, 1OT2TR002584) and the Clinical and Translational Sciences Award program (award UL1TR002489).

Conflict of Interest

All authors declared no competing interests for this work. Clinical Data: Sources and Types, Regulatory Constraints, Applications. Click here for additional data file.
  8 in total

1.  Electronic health records: privacy, confidentiality, and security.

Authors:  Laurinda B Harman; Cathy A Flite; Kesa Bond
Journal:  Virtual Mentor       Date:  2012-09-01

2.  A novel approach for exposing and sharing clinical data: the Translator Integrated Clinical and Environmental Exposures Service.

Authors:  Karamarie Fecho; Emily Pfaff; Hao Xu; James Champion; Steve Cox; Lisa Stillwell; David B Peden; Chris Bizon; Ashok Krishnamurthy; Alexander Tropsha; Stanley C Ahalt
Journal:  J Am Med Inform Assoc       Date:  2019-10-01       Impact factor: 4.497

3.  Toward A Universal Biomedical Data Translator.

Authors: 
Journal:  Clin Transl Sci       Date:  2018-11-09       Impact factor: 4.689

4.  The Biomedical Data Translator Program: Conception, Culture, and Community.

Authors: 
Journal:  Clin Transl Sci       Date:  2018-11-09       Impact factor: 4.689

5.  Outdoor PM2.5, Ambient Air Temperature, and Asthma Symptoms in the Past 14 Days among Adults with Active Asthma.

Authors:  Maria C Mirabelli; Ambarish Vaidyanathan; W Dana Flanders; Xiaoting Qin; Paul Garbe
Journal:  Environ Health Perspect       Date:  2016-07-06       Impact factor: 9.031

6.  Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints.

Authors:  Gustavo Glusman; Denise E Mauldin; Leroy E Hood; Max Robinson
Journal:  Front Genet       Date:  2017-09-26       Impact factor: 4.599

7.  Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records.

Authors:  Casey N Ta; Michel Dumontier; George Hripcsak; Nicholas P Tatonetti; Chunhua Weng
Journal:  Sci Data       Date:  2018-11-27       Impact factor: 6.444

8.  Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning.

Authors:  Liangyuan Na; Cong Yang; Chi-Cheng Lo; Fangyuan Zhao; Yoshimi Fukuoka; Anil Aswani
Journal:  JAMA Netw Open       Date:  2018-12-07
  8 in total
  8 in total

1.  Recent Developments in Privacy-Preserving Mining of Clinical Data.

Authors:  Chance Desmet; Diane J Cook
Journal:  ACM IMS Trans Data Sci       Date:  2021-11

2.  Sex, obesity, diabetes, and exposure to particulate matter among patients with severe asthma: Scientific insights from a comparative analysis of open clinical data sources during a five-day hackathon.

Authors:  Karamarie Fecho; Stanley C Ahalt; Saravanan Arunachalam; James Champion; Christopher G Chute; Sarah Davis; Kenneth Gersing; Gustavo Glusman; Jennifer Hadlock; Jewel Lee; Emily Pfaff; Max Robinson; Eric Sid; Casey Ta; Hao Xu; Richard Zhu; Qian Zhu; David B Peden
Journal:  J Biomed Inform       Date:  2019-10-30       Impact factor: 6.317

3.  FHIR PIT: an open software application for spatiotemporal integration of clinical data and environmental exposures data.

Authors:  Hao Xu; Steven Cox; Lisa Stillwell; Emily Pfaff; James Champion; Stanley C Ahalt; Karamarie Fecho
Journal:  BMC Med Inform Decis Mak       Date:  2020-03-11       Impact factor: 2.796

4.  Translator Exposure APIs: Open Access to Data on Airborne Pollutant Exposures, Roadway Exposures, and Socio-Environmental Exposures and Use Case Application.

Authors:  Alejandro Valencia; Lisa Stillwell; Stephen Appold; Saravanan Arunachalam; Steven Cox; Hao Xu; Charles P Schmitt; Shepherd H Schurman; Stavros Garantziotis; William Xue; Stanley C Ahalt; Karamarie Fecho
Journal:  Int J Environ Res Public Health       Date:  2020-07-21       Impact factor: 3.390

5.  Open Application of Statistical and Machine Learning Models to Explore the Impact of Environmental Exposures on Health and Disease: An Asthma Use Case.

Authors:  Bo Lan; Perry Haaland; Ashok Krishnamurthy; David B Peden; Patrick L Schmitt; Priya Sharma; Meghamala Sinha; Hao Xu; Karamarie Fecho
Journal:  Int J Environ Res Public Health       Date:  2021-10-29       Impact factor: 3.390

6.  Development and Application of an Open Tool for Sharing and Analyzing Integrated Clinical and Environmental Exposures Data: Asthma Use Case.

Authors:  Karamarie Fecho; Stanley C Ahalt; Stephen Appold; Saravanan Arunachalam; Emily Pfaff; Lisa Stillwell; Alejandro Valencia; Hao Xu; David B Peden
Journal:  JMIR Form Res       Date:  2022-04-01

7.  An approach for open multivariate analysis of integrated clinical and environmental exposures data.

Authors:  Karamarie Fecho; Perry Haaland; Ashok Krishnamurthy; Bo Lan; Stephen A Ramsey; Patrick L Schmitt; Priya Sharma; Meghamala Sinha; Hao Xu
Journal:  Inform Med Unlocked       Date:  2021-09-20

Review 8.  Progress toward a universal biomedical data translator.

Authors:  Karamarie Fecho; Anne E Thessen; Sergio E Baranzini; Chris Bizon; Jennifer J Hadlock; Sui Huang; Ryan T Roper; Noel Southall; Casey Ta; Paul B Watkins; Mark D Williams; Hao Xu; William Byrd; Vlado Dančík; Marc P Duby; Michel Dumontier; Gustavo Glusman; Nomi L Harris; Eugene W Hinderer; Greg Hyde; Adam Johs; Andrew I Su; Guangrong Qin; Qian Zhu
Journal:  Clin Transl Sci       Date:  2022-05-25       Impact factor: 4.438

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.