| Literature DB >> 34919694 |
Michael G Kahn1,2, Joyce Y Mui2, Michael J Ames3, Anoop K Yamsani2, Nikita Pozdeyev2,4,5, Nicholas Rafaels2,5, Ian M Brooks2,5.
Abstract
OBJECTIVE: Clinical research data warehouses (RDWs) linked to genomic pipelines and open data archives are being created to support innovative, complex data-driven discoveries. The computing and storage needs of these research environments may quickly exceed the capacity of on-premises systems. New RDWs are migrating to cloud platforms for the scalability and flexibility needed to meet these challenges. We describe our experience in migrating a multi-institutional RDW to a public cloud.Entities:
Keywords: big data; cloud computing; data warehousing; research data governance
Mesh:
Year: 2022 PMID: 34919694 PMCID: PMC8922165 DOI: 10.1093/jamia/ocab278
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.Key findings from 2016 pilot studies comparing Google Cloud Platform with existing on-premises systems as presented to nontechnical executive sponsors. Superlative were used to emphasize particularly distinctive findings that supported the migration proposal.
Figure 2.Top, The executive view of Health Data Compass highlighting data inputs, outputs and key GCP technologies for nontechnical audiences. Bottom, Technical view of data flows, network boundaries, and internal GCP technologies used in the current Health Data Compass research data warehouse. Google Cloud icons labels available at https://docs.google.com/presentation/d/1aGOTpNdCoO4GXZ2es38ZFO5qPGEAjTtDSVeHaDpwsas/edit#slide=id.g5e923c6224_190_56. Abbreviations: APCD: Colorado All Payers Claims Database; CDPHE: State death registry; GCP: Google Cloud Platform; Melissa: Melissa Inc.
Figure 3.Data flows and key Google Cloud Platform (GCP) technologies used by the Translational Informatics Service (TIS). Although TIS uses fewer GCP technologies, TIS deploys more “forward-facing” (App Engine GUI, R Studio), high-performance computing (Eureka HPC), and cloud storage resources than does the RDW.
Health data compass key performance indicators as of June 30, 2021
| Health data compass key metrics | |
|---|---|
| Tables | 790 |
| Storage/clinical | 16 TB |
| Storage/genomic | 55TB |
| Extraction-transform-load jobs | 2000+ |
| Data sources | 6 (3 internal; 3 external) |
| Unique persons | 7.3M |
| Visits (all types) | 51M |
| Conditions/Diagnoses (all types) | 171M |
| Medications (ordered, administered, dispensed) | 240M |
| Measurements (laboratory test) | 1.3B |
| Observations (includes flowsheets) | 6.6B |
| Clinical notes (all types) | 210M |
| Custom data sets delivered | 1286 |
| Custom data marts/registries (local, national) | 15 |
| End-user applications | 9 |
Figure 4.Top, Growth in Google Cloud Platform (GCP) total spend across all GCP services from July 2017. Middle, Growth of GCP monthly costs by specific GCP service October 2020–March 2021. Bottom, Proportion of charges across GCP services January–March 2021.
Categories of underappreciated challenges that emerged during migration from on-premises to cloud data warehouse
| Networking/Network security |
Integration with enterprise networking System security plan/HIPAA Compliance Network access |
| Data engineering | Performance mismatch between source and cloud-based environments |
| Computation |
Compute engines Managed services Cluster computing |
| Storage |
Storage costs Tiered storage strategies Data provenance |
| Secure analytics |
Cloud-based analytics Analytics repos vs security |
| Sandboxes/Public data |
Sandboxes for low-barrier access Oversight Understanding public data sets |
| Innovation/Consulting services |
Hyper-innovation/legacy architectures Development environments Consulting knowledge |
| Costs/utilization |
Oversight and monitoring Leveraging cost-savings opportunities |