| Literature DB >> 34590684 |
Emily R Pfaff1, Andrew T Girvin2, Davera L Gabriel3, Kristin Kostka4, Michele Morris5, Matvey B Palchuk6, Harold P Lehmann7, Benjamin Amor2, Mark Bissell2, Katie R Bradwell2, Sigfried Gold3, Stephanie S Hong3, Johanna Loomba8, Amin Manna2, Julie A McMurry9, Emily Niehaus2, Nabeel Qureshi2, Anita Walden10, Xiaohan Tanner Zhang11, Richard L Zhu11, Richard A Moffitt12, Melissa A Haendel13, Christopher G Chute14, William G Adams, Shaymaa Al-Shukri, Alfred Anzalone, Ahmad Baghal, Tellen D Bennett, Elmer V Bernstam, Elmer V Bernstam, Mark M Bissell, Brian Bush, Thomas R Campion, Victor Castro, Jack Chang, Deepa D Chaudhari, Wenjin Chen, San Chu, James J Cimino, Keith A Crandall, Mark Crooks, Sara J Deakyne Davies, John DiPalazzo, David Dorr, Dan Eckrich, Sarah E Eltinge, Daniel G Fort, George Golovko, Snehil Gupta, Melissa A Haendel13, Janos G Hajagos, David A Hanauer, Brett M Harnett, Ronald Horswell, Nancy Huang, Steven G Johnson, Michael Kahn, Kamil Khanipov, Curtis Kieler, Katherine Ruiz De Luzuriaga, Sarah Maidlow, Ashley Martinez, Jomol Mathew, James C McClay, Gabriel McMahan, Brian Melancon, Stephane Meystre, Lucio Miele, Hiroki Morizono, Ray Pablo, Lav Patel, Jimmy Phuong, Daniel J Popham, Claudia Pulgarin, Carlos Santos, Indra Neil Sarkar, Nancy Sazo, Soko Setoguchi, Selvin Soby, Sirisha Surampalli, Christine Suver, Uma Maheswara Reddy Vangala, Shyam Visweswaran, James von Oehsen, Kellie M Walters, Laura Wiley, David A Williams, Adrian Zai.
Abstract
OBJECTIVE: In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations.Entities:
Keywords: COVID-19; data accuracy; electronic health records
Mesh:
Year: 2022 PMID: 34590684 PMCID: PMC8500110 DOI: 10.1093/jamia/ocab217
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 7.942
Figure 1.The N3C data ingestion and harmonization pipeline. Participating sites regularly submit data in their native CDM format to an ingest server. A parsing step validates whether the data are formatted properly and check the contents of the payload against its package description, or “manifest.” The pipeline then transforms the submitted data to the OMOP model; data provenance is automatically maintained such that transformed data can be traced back to source at any time. The transformed data are then reviewed for DQ by a team of subject matter experts using a suite of data characterization and visualization tools. Every week, the latest data from all sites passing DQ checks are published as a versioned “release” for use by investigators. DQ: data quality; N3C: National COVID Cohort Collaborative; OMOP: Observational Medical Outcomes Partnership.
Data quality issue types
| Check type | Data checks |
|---|---|
| Source CDM conformance |
|
| Demographics |
|
| COVID tests |
|
| Conditions |
|
| Encounters |
|
| Measurements/observations |
|
| Coding completeness |
|
| Fitness for use | Use of the data by researchers often reveals additional DQ issues for one or more sites (eg, sparsely populated body mass index data, in the context of a study of obesity and COVID). In these cases, we report the findings to sites so that they can take action in their local data if they wish to have their site’s data included in the study |
“Must Pass” and “Heads Up” data check for release into the N3C Data Enclave.
DQ: data quality; N3C: National COVID Cohort Collaborative; OMOP: Observational Medical Outcomes Partnership.
Figure 2.Vital sign coverage visualization, N3C OMOP sites. This heatmap is representative of those that we sent to sites to provide them with benchmarked lab and vital coverage information. The rows represent concept sets for vital signs and the columns are individual sites. The cell colors reflect the z-score of the percentage of COVID inpatients at each site that have at least 1 lab or vital of that type recorded during their hospitalization. The bluer the color, the higher the percentage of COVID inpatients that have that vital sign at that site—redder shades mean a lower percentage of patients with that vital. Rows and columns are hierarchically clustered, bringing similar sites closer together, and similar vitals closer together. This visualization enables sites to compare their data coverage with other sites using the same data model. (Site numbers are anonymized and have been changed from the site numbers used inside the N3C Enclave.) N3C: National COVID Cohort Collaborative; OMOP: Observational Medical Outcomes Partnership.
Data quality heuristics
| No. | Heuristic | Type | No. of sites | sites (%)a |
|---|---|---|---|---|
| 1 | Not using (or improperly using) source CDM’s controlled vocabulary in one or more fields | Source CDM conformance | 13 | 23.2 |
| 2 | COVID test result values not standardized or null | COVID tests | 11 | 19.6 |
| 3 | Lacking/incorrectly populating field(s) required by source CDM | Source CDM conformance | 9 | 16.1 |
| 4 | Implausible distribution of visit types (eg, 75% inpatient) | Encounters | 7 | 12.5 |
| 5 | Large number of “No Matching Concept” records (OMOP source only) | Coding completeness | 6 | 10.7 |
| 6 | Lacking table(s) required by source CDM | Source CDM conformance | 5 | 9.0 |
| 7 | Many or all inpatient visits lacking valid end dates | Encounters | 5 | 9.0 |
| 8 | Few or no clinical encounters coded with U07.1 | Conditions | 5 | 9.0 |
| 9 | Implausible count of patients qualifying for phenotype | Demographics | 3 | 5.4 |
| 10 | Small number of unique measurement/observation types | Measurement/observation | 2 | 3.6 |
| 11 | PERSON_IDs in fact tables that are not in the PERSON table | Coding completeness | 2 | 3.6 |
| 12 | Primary keys are not unique | Coding completeness | 2 | 3.6 |
| 13 | Inconsistent local date shifting causing implausible timelines | Coding completeness | 2 | 3.6 |
| 14 | Implausible demographics (eg, 100% male patients) | Demographics | 2 | 3.6 |
| 15 | Data utility challenges (eg, missing mortality data) | Fitness for use | N/A | N/A |
Items compiled here are from a qualitative analysis of the “Must Pass” data issues filed on any one of the 56 currently released N3C sites that resulted in a fix by the site. Fitness for Use is an additional heuristic that applies to all sites and is thus also included here. Simple formatting errors (eg, incorrect delimiters) and noncritical “Heads Up” issues are excluded from this analysis.
Denominator: 56 sites; 37 unique sites are represented across these categories.
N3C: National COVID Cohort Collaborative; OMOP: Observational Medical Outcomes Partnership.
Figure 3.Improved percentages of valid COVID-19 test results across 11 N3C sites. The 11 sites shown here each had initial N3C submissions with high numbers of invalid (null, nonstandard) COVID test results. As time moves forward (left to right on the x-axis), drastic improvements are made following feedback from N3C. The blue line and shaded area represent the mean and standard deviation across all sites. N3C: National COVID Cohort Collaborative.
Figure 4.In A, one site’s initial N3C submission had a proportion of visits of type inpatient far above that of similar sites; in B, 4 sites’ initial submissions had no (or nearly no) inpatient visits. Our feedback encouraged the sites to re-examine and remap their source-to-CDM visit type mappings. In these cases, proportions improved. The shaded area reflects the mean and standard deviation of all sites. N3C: National COVID Cohort Collaborative.
Figure 5.Comparing sites within centralized data. One of the most stark differences we have observed among different sites is the different ways that a “visit” (or encounter) can be defined. Indeed, inpatient visits at several N3C sites are made up of a number (at times hundreds) of “microvisits”—consults with different specialists, imaging, infusions, et cetera. Because sites define inpatient visits so differently, they are difficult to harmonize. Centralized data make it easier to compare how sites define visits and develop derivative variables to enable harmonization. N3C: National COVID Cohort Collaborative.