| Literature DB >> 32805036 |
Melissa A Haendel1,2, Christopher G Chute3, Tellen D Bennett4, David A Eichmann5, Justin Guinney6, Warren A Kibbe7, Philip R O Payne8, Emily R Pfaff9, Peter N Robinson10, Joel H Saltz11, Heidi Spratt12, Christine Suver6, John Wilbanks6, Adam B Wilcox13, Andrew E Williams14, Chunlei Wu15, Clair Blacketer16, Robert L Bradford9, James J Cimino17, Marshall Clark9, Evan W Colmenares18, Patricia A Francis19, Davera Gabriel19, Alexis Graves20, Raju Hemadri21, Stephanie S Hong19, George Hripscak22, Dazhi Jiao19, Jeffrey G Klann23, Kristin Kostka24, Adam M Lee25, Harold P Lehmann19, Lora Lingrey26, Robert T Miller27, Michele Morris28, Shawn N Murphy29, Karthik Natarajan30, Matvey B Palchuk26, Usman Sheikh21, Harold Solbrig19, Shyam Visweswaran28, Anita Walden1,6, Kellie M Walters9, Griffin M Weber31, Xiaohan Tanner Zhang19, Richard L Zhu19, Benjamin Amor32, Andrew T Girvin32, Amin Manna32, Nabeel Qureshi32, Michael G Kurilla33, Sam G Michael34, Lili M Portilla35, Joni L Rutter36, Christopher P Austin34, Ken R Gersing21.
Abstract
OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers.Entities:
Keywords: COVID-19; EHR data; SARS-CoV-2; clinical data model harmonization; collaborative analytics; open science
Mesh:
Year: 2021 PMID: 32805036 PMCID: PMC7454687 DOI: 10.1093/jamia/ocaa196
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 7.942
Figure 1.Establishing National COVID Cohort Collaborative (N3C) sociotechnical processes and infrastructure via community workstreams. Each workstream includes representatives from National Center for Advancing Translational Sciences (NCATS), the Clinical and Translational Science Awards hubs, the Center for Data to Health, sites contributing data, and other members of the research community. (1) Data Partnership and Governance: This workstream designs governance and makes regulatory recommendations to National Institutes of Health (NIH) for their execution. Organizations sign a Data Transfer Agreement (DTA) with NCATS and may use the central institutional review board. (2) Phenotype and Data Acquisition: The community defines inclusion criteria for the N3C COVID-19 (coronavirus disease 2019) cohort and supports organizations in customized data export. (3) Data Ingest and Harmonization: Data reside within different organizations in different common data models. This workstream quality-assures and harmonizes data from different sources and common data models into a unified dataset. (4) Collaborative Analytics workstream: Data are made accessible for collaborative use by the N3C community. A secure data enclave (N3C Enclave), from which data cannot be removed, houses analytical tools and supports reproducible and transparent workflows. Formulation of clinical research questions and development of prototype machine learning and statistical workflows is collaboratively coordinated; portals and dashboards support resource, data, expertise, and results navigation and reuse. (5) Synthetic Clinical Data: A pilot to determine the degree to which synthetic derivatives of the Limited Data Set are able to approximate analyses derived from original data, while enhancing shareable data outside the N3C Enclave. ACT: Accrual to Clinical Trials; OMOP: Observational Medical Outcomes Partnership; PCORnet: National Patient-Centered Clinical Research Network.
Figure 2. Panel A. Regulatory steps and user access. Organizations can operate as data contributors or data users or both; contribution is not required for use. For contributing organizations, the first step is a Data Transfer Agreement (DTA) which is executed between National Center for Advancing Translational Sciences (NCATS) and the contributing organization (and its affiliates where applicable). For organizations using data, a separate, umbrella/institute-wide Data Use Agreement (DUA) is executed between organizations and NCATS. Interested investigators submit a Data Use Request (DUR) for each project proposal, which is reviewed by a Data Access Committee (DAC). The DUR includes a brief description of how the data will be used, a signed User Code of Conduct (UCoC) that articulates fundamental actions and prohibitions on data user activities, and if requesting access to patient-level data a proof of additional institutional review board (IRB) approval. The DAC reviews the DUR and upon approval, grants access to the appropriate data level within the National COVID Cohort Collaborative (N3C) Enclave. Synthetic data currently follow the same procedure, but if the pilot is successful, we aim to make access available by simple registration if provisioned by the organizations. The lock symbol references steps where multiple conditions must be met. HIPAA: Health Insurance Portability and Accountability Act; LDS: Limited Data Set; NIH: National Institutes of Health. Panel B. Features and requirements for each level of data in the N3C Enclave: Synthetic,, De-identified data ,,, and Limited Data Set, .
Scale comparison of 3 sites’ positive COVID-19 cases, their N3C-relevant cohort, and their denominator (number of patients seen in a 1-year period)
| Site 1 | Site 2 | Site 3 | |
|---|---|---|---|
| COVID-19–positive patients as publicly reported by site | 2550 | 5540 | 390 |
| N3C-relevant cohort | 67 350 | 46 500 | 12 000 |
| Denominator | 1 271 510 | 1 259 330 | 172 000 |
All numbers rounded to nearest 10.
COVID-19: coronavirus disease 2019; N3C: National COVID Cohort Collaborative.
The number of COVID-19–positive patients publicly reported by this site as of the week of June 8, 2020.
The number of patients qualifying for the N3C COVID-19–relevant phenotype at this site as of the week of 6/8/2020.
The number of unique patients seen in a 1-year period at this site.
Data extraction and transfer methods that sites may use to submit data to N3C
| Human (Manual) Steps | Automated Steps | |
|---|---|---|
| R Package |
Download the R and SQL code. Configure local variables (DB connection, schema names, etc.) |
Run phenotype and extract scripts. Extract results to individual files, following N3C naming and structure conventions. sFTP extract to N3C. |
| Python Package |
Download the Python and SQL code. Configure local variables (DB connection, schema names, etc.) |
Run phenotype and extract scripts. Extract results to individual files, following N3C naming and structure conventions. sFTP extract to N3C. |
| TriNetX |
(Automated step first) 1. Download data from TriNetX. 2. sFTP extract to N3C. |
TriNetX runs phenotype and extract scripts on the site’s behalf. |
| SQL |
Download the SQL code. Configure local variables (schema names, etc.) Run phenotype script. Run extract scripts, one at a time. Extract results to individual files using the N3C directory structure, naming conventions, file format. sFTP extract to N3C. | None |
DB: database; N3C: National COVID Cohort Collaborative.
Data quality tools and methods evaluated
| Tool Type | Tool | |
|---|---|---|
| Native CDM DQ Processes |
| Data Check Scripts (v8.0) |
|
| “Smoke” tests | |
|
| Focused Curation Process | |
|
| Process automation support | |
| Data & Map validation functions | ||
| OHDSI Collaborative Tools |
| Data quality tests of OMOP databases |
|
| Design/execute analytics on OMOP databases | |
|
| Data characterization of OMOP databases | |
|
| ETL preparation and support | |
|
| SQL, R | |
ACT: Accrual to Clinical Trials; CDM: common data model; DQ: data quality; ETL: extract-transform-load; OHDSI: Observational Health Data Sciences and Informatics; OMOP: Observational Medical Outcomes Partnership; PCORnet: National Patient-Centered Clinical Research Network.
Examples of community contributed tools integrated within the N3C computing environment
| Tool | Description |
|---|---|
| OHDSI Atlas |
|
| LOINC2HPO |
|
| NCATS Biomedical Data Translator |
|
| Leaf |
|
API: application programming interface; CDM: common data model; HPO: Human Phenotype Ontology; KG: knowledge graph; N3C: National COVID Cohort Collaborative; NCATS: National Center for Advancing Translational Sciences; OHDSI: Observational Health Data Sciences and Informatics; OMOP: Observational Medical Outcomes Partnership.
Figure 3.National COVID Cohort Collaborative (N3C) Data Quality Checks. At the sites, the extraction script performs a check for duplicate primary keys; if duplicate keys are found, the extraction fails until the site resolves the error. When extraction is successfully completed, a data “manifest” is created that contains metadata about the site and the payload. Site personnel then sFTP the data to N3C to be queued for ingestion. The first step in the ingestion process checks that the payload is consistent with the formatting requirements and the manifest file. Next, the payload is loaded into a database modeled after the payload’s native common data model (CDM), which ensures source data model conformance. Next, a series of data quality checks including a subset of coronavirus disease 2019 (COVID-19)–specific code validations are performed, and if needed, minimal corrections are made. Any corrections are recorded and added to the payload documentation. Next, the payload is transformed to Observational Medical Outcomes Partnership (OMOP) 5.3.1 using the validated maps from the payload’s native CDM. Once in OMOP 5.3.1, a subset of the Observational Health Data Sciences and Informatics (OHDSI) Data Quality Dashboard tests are run, and the results of these are added to the payload documentation. The payload is then exported to a merged database containing all the previously harmonized site data payloads, where it is then checked for conformance again before export to the analytics pipeline. DC: Data Characterization; DQD: Data Quality Dashboard.
Figure 4.National COVID Cohort Collaborative (N3C) Enclave. The analytical environment for N3C is a secure, virtualized, cloud-based platform. Within the N3C Enclave, researchers have access to raw data, as well as transformed and harmonized datasets derived by other researchers. Analytical tools hosted within the environment support complex ETL (extract-transform-load), generation of coronavirus disease 2019 (COVID-19)–specific data elements, statistical analysis, machine learning, and rich visualizations. Third-party tools contributed by the community can be integrated into the environment; current tools include Observational Health Data Sciences and Informatics (OHDSI) tools and the Leaf patient cohort builder. N3C is developing methods for integration of genomic, imaging, and other data modalities.
Figure 5.The Contributor Attribution Model. In the National COVID Cohort Collaborative Enclave, the Contributor Attribution Model is used to aggregate all contributions to any given workflow or report generated with a specific declaration of what exactly each person contributed, supporting the notion of transitive credit. ORCID identifiers are used to identify users. An example contributor to an artifact used in the National COVID Cohort Collaborative is shown on the lower panel.