| Literature DB >> 34849976 |
Victor M Castro1, Vivian Gainer1, Nich Wattanasin1, Barbara Benoit1, Andrew Cagan1, Bhaswati Ghosh1, Sergey Goryachev1, Reeta Metta1, Heekyong Park1, David Wang1, Michael Mendis1, Martin Rees1, Christopher Herrick1, Shawn N Murphy1,2.
Abstract
OBJECTIVE: Integrating and harmonizing disparate patient data sources into one consolidated data portal enables researchers to conduct analysis efficiently and effectively.Entities:
Keywords: Information storage and retrieval; data curation; data science; electronic health records; genomics; i2b2
Mesh:
Year: 2022 PMID: 34849976 PMCID: PMC8922162 DOI: 10.1093/jamia/ocab264
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.The Biobank Portal architecture is based on Informatics for Integrating Biology and the Bedside (i2b2). Investigators access data through the webclient which interacts with the i2b2 application server using application programming interfaces (APIs). Most data are ingested into the data repository directly, but other data are accessed using external APIs at query time. PM: Project management cell; ONT: Ontology cell; CRC: Data repository cell; OMOP: Observational medical outcome partnership; CDM: common data model; VCF: variant call format; ETL: extract-transform-load.
Mass General Brigham Biobank Participant Characteristics compared with all health system patients.
| Characteristic | Biobank Portal ( | All other patients ( |
|
|---|---|---|---|
| Demographics | |||
| Gender | <.001 | ||
| Female | 71 360 (57%) | 2 094 150 (55%) | |
| Male | 54 234 (43%) | 1 ,716 594 (45%) | |
| Other/Unknown | 4 (<0.1%) | 800 (<0.1%) | |
| Age at last visit | 59 (42, 70) | 46 (26, 64) | <.001 |
| Race | <.001 | ||
| Asian | 3785 (3.0%) | 190 804 (5.0%) | |
| Black | 5960 (4.7%) | 225 ,069 (5.9%) | |
| Other | 7711 (6.1%) | 681 710 (18%) | |
| Unknown | 2251 (1.8%) | 121 ,111 (3.2%) | |
| White | 105 891 (84%) | 2 592 850 (68%) | |
| Ethnicity | <.001 | ||
| Hispanic | 3387 (2.7%) | 214 407 (5.6%) | |
| Non-Hispanic | 122 211 (97%) | 3 597 137 (94%) | |
| ACS median income4 | $70 245 ($57 313, $90 673) | $69 576 ($55 652, $88 829) | <.001 |
| Healthcare utilization | |||
| Number of visit days | 125 (44, 268) | 9 (2, 38) | <.001 |
| Number of diagnosis codes | 405 (138, 962) | 29 (7, 124) | <.001 |
| Number of clinical notes | 124 (38, 314) | 20 (6, 70) | <.001 |
| Number of diagnostic reports | 122 (47, 262) | 18 (4, 60) | <.001 |
| Available data | |||
| Electronic health records | 124 760 (99%) | 3 811 544 (100%) | — |
| Health information survey | 55 121 (44%) | — | — |
| Genomic data | 43 552 (35%) | — | — |
| Biospecimens | 88 527 (71%) | — | — |
N (%) or median (IQR).
Other patients are defined as patients with a health system visit since 2010 and not enrolled in the MGB Biobank.
Pearson’s Chi-squared test or Wilcoxon rank-sum test.
2018 American Community Survey 2018 Median income in patients zip code.
Figure 2.Overview of Biobank Portal Data. Investigators see this screen at every login with information on available data, date of last update help, and quick start query examples.
Figure 3.Example analysis file specification to download limited datasets.
Biobank Portal example use cases and publications
| Research use case | Investigator type | Data types |
|---|---|---|
| Research feasibility for grant application | All types | All |
| Multicenter genome-wide association studies | Clinical/bioinformatics | EHR and Genomics |
| Machine learning disease subgroup detection using NLP and genetics | Data scientist | EHR, Genomics, and notes |
| Polygenic risk score integration with EHR data for phenotyping | Psychology fellow | EHR, Genomics, and Health Survey |
| Population cohort discovery based on gene variants and laboratory results | Population epidemiologist | Genomic and EHR |
| Obtain biospecimens for control group | Basic scientist | Biospecimen |
| Developing and validating phenotype algorithms | All types | EHR and notes |
| Case-control association study of disease comorbidity | Population epidemiologist | EHR |
| Evaluating the clinical utility of polygenic risk scores | Clinical/bioinformatics | Genomics |