Literature DB >> 30864344

Merging heterogeneous clinical data to enable knowledge discovery.

Martin G Seneviratne1, Michael G Kahn, Tina Hernandez-Boussard.   

Abstract

The vision of precision medicine relies on the integration of large-scale clinical, molecular and environmental datasets. Data integration may be thought of along two axes: data fusion across institutions, and data fusion across modalities. Cross-institutional data sharing that maintains semantic integrity hinges on the adoption of data standards and a push toward ontology-driven integration. The goal should be the creation of query-able data repositories spanning primary and tertiary care providers, disease registries, research organizations etc. to produce rich longitudinal datasets. Cross-modality sharing involves the integration of multiple data streams, from structured EHR data (diagnosis codes, laboratory tests) to genomics, imaging, monitors and patient-generated data including wearable devices. This integration presents unique technical, semantic, and ethical challenges; however recent work suggests that multi-modal clinical data can significantly improve the performance of phenotyping and prediction algorithms, powering knowledge discovery at the patient- and population-level.

Entities:  

Mesh:

Year:  2019        PMID: 30864344      PMCID: PMC6447393     

Source DB:  PubMed          Journal:  Pac Symp Biocomput        ISSN: 2335-6928


The quantity of digitized health information has increased exponentially over the past decade, with growing data repositories across all sectors of the health system [1]. The rise of electronic health records has enabled the creation of large datasets containing structured, semi-structured and unstructured data, ranging from diagnostic codes and laboratory results to continuous monitoring signals, clinical notes, medical imaging and pathology. However, there are also rich clinical, molecular and environmental datasets held by government agencies, disease registries, employers, pharmaceutical companies and research organizations. Meanwhile, the proliferation of health tracking apps, wearables and home sensors have created new clinical data streams controlled by the patient, which capture granular information about lifestyle and micro-environmental exposures. Even an individual’s social media footprint may be considered as a source of clinical insights. Weber et al. have described the spectrum of clinical data available for an individual as a “tapestry of high-value information sources” ranging from the micro (genomic/molecular data) through to the macro (behavioral/lifestyle data) [2]. Many have predicted that the convergence of rich clinical, molecular and environmental data streams will accelerate knowledge discovery in biomedicine and help us to move toward the high-level goal of precision medicine [3,4]. Certainly, larger datasets combining information from numerous sources will improve the performance of diagnostic and prognostic machine learning algorithm, fuelling observational research and improving clinical decisions at the point of care. The critical challenge is how to integrate disparate clinical data streams in a flexible, query-able format while preserving patient privacy and data governance. This integration challenge may be thought of along two axes: data fusion across institutions, and data fusion across modalities. The first challenge involves cross-institutional data sharing. Federal incentive programs launched through the Health Information Technology for Economic and Clinical Health (HITECH) Act supported the creation of health information exchanges (HIEs) as a platform for clinical data sharing; however based on a 2015 survey, only 23% of HIEs currently supported research, with a further 47% planning to support secondary use in the future [5]. Furthermore, a 2016 review found that the number of HIEs had declined between 2012 and 2014 and only half report being financially sustainable [6]. In 2015, the Office of the National Coordinator of Health IT (ONC) published an Interoperability Roadmap, which outlines a national agenda for improving health information exchange [7]. One key objective is achieving syntactic and semantic interoperability by adoption of common vocabularies, including SNOMED-CT and RxNorm, and common data formats, including consolidated clinical document architecture (C-CDA) and Fast Health Interoperability Resources (FHIR). The roadmap also calls for the adoption of secure transport standards and outlines best practices for matching patient identities between sites. In parallel, there have been a number of academic endeavors to build platforms for observational clinical research, including the Observational Health Data Sciences and Informatics (OHDSI) network [8], SHARPn project [9], and the Informatics for Integrating Biology and the Bedside (i2b2) initiative [10]. An emerging theme throughout these cross-institutional data fusion efforts, from industry to academia, is the power of ontology-driven data integration, inspired by the rise of semantic web technologies [11-13]. This approach has a number of distinct advantages including the ability to synthesize across many disparate data sources via high-level ontologies and the ability to reason over a knowledge base [14]. Ongoing technical challenges include representing data provenance, temporal relationships and data quality [15]; however the prevailing challenge is operational - how to shift organizational culture toward interoperability and data sharing [16]. Beyond this, the infrastructure for interoperability may vary, with successful examples of centralized data warehouses [17], decentralized blockchain-based health records systems [18], and patient-controlled health records [19]. The second major component of data fusion is cross-modality integration. Most EHRs contain a diversity of data types that have traditionally been analyzed independently, ranging from structured diagnosis codes to signal data, clinical notes and imaging. Furthermore, the interoperability advances mentioned above are making it possible to harmonize traditional EHR data with novel clinical data streams including genomic, microbiome, metabolic and patient-generated health data (PGHD). There is an expanding evidence base showing that multi-modal data integration can support precision medicine by stratifying patients based on their ‘deep phenotype’ [20]; improving the performance of clinical decision support algorithms for diagnosis and prediction [21]; and uncovering new phenotypes altogether [22]. For example, Zhao et al. developed a risk prediction model for cardiovascular events using EHR data, but found a significant performance boost when those data were fused with patient-level genomic information [23]. Meanwhile, by using unsupervised learning on a combined dataset of metabolome, microbiome, genetics and imaging data, Shomorony et al. were able to identify a signature of biomarkers that identified diabetic patients more accurately than traditional clinical metrics (glucose, insulin resistance, and body-mass-index) - suggesting novel pathways that may be involved in the development of diabetes [24]. The combination of traditional health data with PGHD or social media data has enabled knowledge discovery in the realms of both precision medicine and population health. Santillana et al. combined hospital visit data with Twitter, Google searches, and posts on an online health forum to predict influenza incidence [25]. Vilar et al. describe efforts to identify drug-drug interactions by combining social media posts with the biomedical literature [26]. On a more granular level, there is a push to integrate patient-reported outcomes (PROs) into EHRs as a way to promote patient-centric care (an example of heterogeneous data fusion potentially driving behavior change) [27] which has fueled interesting insights into the relationship between PROs and clinical outcomes such as mortality [28]. The rise of the ‘Internet of Things’ in healthcare - the ecosystem of connected monitoring devices that surround a patient - as well as ambient information such as geo-location are creating opportunities for even richer multi-modal datasets [29-31]. These data no longer reside exclusively in hospitals. Private sector initiatives such as Verily’s Project Baseline and Apple’s HealthKit program are enabling patients to aggregate multiple medical data sources [32,33]. Meanwhile, the All Of Us initiative is a National Institutes of Health program to collect molecular, clinical and environmental data on a diverse cohort of volunteers for research purposes [34]. As the pathophysiology behind chronic disease is a complex interplay of clinical, molecular and behavioral factors acting over extended time periods, the datasets required to tackle the global epidemic of chronic disease will need to be similarly layered and sophisticated. There is both a clinical opportunity and an economic one, with increasing evidence to suggest that data integration can reduce overall healthcare costs [35]. Cross-modality data integration is associated with a number of challenges, of which we highlight three below. First, there is the issue of how to harmonize data from distant parts of a knowledge graph reflecting radically different levels of abstraction e.g. diagnosis codes (high-level) with proteomic data (low-level). This creates challenges for data storage and makes it difficult to generate feature vectors to train classifiers. Several recent studies have shown that deep learning can be used to create efficient abstract representations of structured and unstructured EHR data, for example the DeepPatient representation using stacked denoising autoencoders [36]. A similar approach might be considered for a broader range of input data. A second caveat is around data stewardship, particularly with respect to privacy and security [37]. Fusion of data streams may accelerate scientific discovery and clinical care, but this comes with an increased risk of patient re-identification. Further work is needed around de-identification, consent processes and access control when data are contributed to shared repositories. The increasing volume of digital health information available to clinicians also raises questions around liability and duty of care i.e. the extent to which clinicians are responsible for the full expanse of information in an aggregated health repository. A third challenge is around equity and inclusion. A 2018 report by Ferryman et al. on ‘Fairness in precision medicine’ highlights the potential for bias in large-scale biomedical training data, stemming from historical discrimination in the health system and recruitment biases at academic medical centers [38]. Data-fusion efforts must be cognizant of the distribution of important demographic variables, such as gender, ethnicity and socioeconomic status in their input data. The fusion of heterogeneous datasets from different institutions and across different modalities presents a powerful opportunity to drive knowledge discovery in biomedicine. There are technical and operational challenges to enable data sharing across borders of institutional ownership, which we are beginning to overcome with interoperability standards and data sharing platforms. Arguably the more nuanced problem today is how to grapple with extremely diverse data types that encompass the micro and macro scales of a patient’s data signature, including how to create flexible data storage and machine learning architectures, and how to design stewardship processes to govern these data appropriately. Holzinger et al. claimed in 2014 that “biomedical research is drowning in data, yet starving for knowledge”. Today we have more health data than ever before, but the challenge remains how to harmonize, structure and learn from multi-modal datasets [39].
  31 in total

1.  An integrated, ontology-driven approach to constructing observational databases for research.

Authors:  William Hsu; Nestor R Gonzalez; Aichi Chien; J Pablo Villablanca; Päivi Pajukanta; Fernando Viñuela; Alex A T Bui
Journal:  J Biomed Inform       Date:  2015-03-26       Impact factor: 6.317

Review 2.  "Big data" and the electronic health record.

Authors:  M K Ross; W Wei; L Ohno-Machado
Journal:  Yearb Med Inform       Date:  2014-08-15

3.  Apple HealthKit and Health App: Patient Uptake and Barriers in Primary Care.

Authors:  Frederick North; Rajeev Chaudhry
Journal:  Telemed J E Health       Date:  2016-05-12       Impact factor: 3.536

4.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2).

Authors:  Shawn N Murphy; Griffin Weber; Michael Mendis; Vivian Gainer; Henry C Chueh; Susanne Churchill; Isaac Kohane
Journal:  J Am Med Inform Assoc       Date:  2010 Mar-Apr       Impact factor: 4.497

5.  Predicting the Future - Big Data, Machine Learning, and Clinical Medicine.

Authors:  Ziad Obermeyer; Ezekiel J Emanuel
Journal:  N Engl J Med       Date:  2016-09-29       Impact factor: 91.245

6.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers.

Authors:  George Hripcsak; Jon D Duke; Nigam H Shah; Christian G Reich; Vojtech Huser; Martijn J Schuemie; Marc A Suchard; Rae Woong Park; Ian Chi Kei Wong; Peter R Rijnbeek; Johan van der Lei; Nicole Pratt; G Niklas Norén; Yu-Chuan Li; Paul E Stang; David Madigan; Patrick B Ryan
Journal:  Stud Health Technol Inform       Date:  2015

7.  The Number Of Health Information Exchange Efforts Is Declining, Leaving The Viability Of Broad Clinical Data Exchange Uncertain.

Authors:  Julia Adler-Milstein; Sunny C Lin; Ashish K Jha
Journal:  Health Aff (Millwood)       Date:  2016-07-01       Impact factor: 6.301

8.  Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance.

Authors:  Mauricio Santillana; André T Nguyen; Mark Dredze; Michael J Paul; Elaine O Nsoesie; John S Brownstein
Journal:  PLoS Comput Biol       Date:  2015-10-29       Impact factor: 4.475

9.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records.

Authors:  Riccardo Miotto; Li Li; Brian A Kidd; Joel T Dudley
Journal:  Sci Rep       Date:  2016-05-17       Impact factor: 4.379

10.  A Framework for Classification of Electronic Health Data Extraction-Transformation-Loading Challenges in Data Network Participation.

Authors:  Toan Ong; Rosina Pradhananga; Erin Holve; Michael G Kahn
Journal:  EGEMS (Wash DC)       Date:  2017-06-13
View more
  6 in total

Review 1.  Exploring the dark genome: implications for precision medicine.

Authors:  Tudor I Oprea
Journal:  Mamm Genome       Date:  2019-07-04       Impact factor: 2.957

2.  The Population Health OutcomEs aNd Information EXchange (PHOENIX) Program - A Transformative Approach to Reduce the Burden of Chronic Disease.

Authors:  Steven J Korzeniewski; Carla Bezold; Jason T Carbone; Shooshan Danagoulian; Bethany Foster; Dawn Misra; Maher M El-Masri; Dongxiao Zhu; Robert Welch; Lauren Meloche; Alex B Hill; Phillip Levy
Journal:  Online J Public Health Inform       Date:  2020-05-16

Review 3.  Review of Clinical Research Informatics.

Authors:  Anthony Solomonides
Journal:  Yearb Med Inform       Date:  2020-08-21

4.  From biobank and data silos into a data commons: convergence to support translational medicine.

Authors:  Rebecca Asiimwe; Stephanie Lam; Samuel Leung; Shanzhao Wang; Rachel Wan; Anna Tinker; Jessica N McAlpine; Michelle M M Woo; David G Huntsman; Aline Talhouk
Journal:  J Transl Med       Date:  2021-12-04       Impact factor: 5.531

5.  Pretrained transformer framework on pediatric claims data for population specific tasks.

Authors:  Xianlong Zeng; Simon L Linwood; Chang Liu
Journal:  Sci Rep       Date:  2022-03-07       Impact factor: 4.379

Review 6.  Leveraging Clinical Digitized Data to Understand Temporal Characteristics and Outcomes of Acute Myocardial Infarctions at a Tertiary Care Medical Centre in Pakistan from 1988-2018 - Methods and Results.

Authors:  Zainab Samad; Ali Aahil Noorali; Awais Farhad; Safia Awan; Nada Qaiser Qureshi; Minaz Mawani; Mushyada Ali; Laiba Masood; Ghufran Adnan; Linda K Shaw; Fahim Haider Jafary; Salim S Virani; Eric J Velazquez; Zulfiqar Bhutta; Gerald S Bloomfield; Javed Tai
Journal:  Glob Heart       Date:  2022-08-18
  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.