Literature DB >> 34849976

The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics.

Victor M Castro¹, Vivian Gainer¹, Nich Wattanasin¹, Barbara Benoit¹, Andrew Cagan¹, Bhaswati Ghosh¹, Sergey Goryachev¹, Reeta Metta¹, Heekyong Park¹, David Wang¹, Michael Mendis¹, Martin Rees¹, Christopher Herrick¹, Shawn N Murphy^1,2.

Abstract

OBJECTIVE: Integrating and harmonizing disparate patient data sources into one consolidated data portal enables researchers to conduct analysis efficiently and effectively.
MATERIALS AND METHODS: We describe an implementation of Informatics for Integrating Biology and the Bedside (i2b2) to create the Mass General Brigham (MGB) Biobank Portal data repository. The repository integrates data from primary and curated data sources and is updated weekly. The data are made readily available to investigators in a data portal where they can easily construct and export customized datasets for analysis.
RESULTS: As of July 2021, there are 125 645 consented patients enrolled in the MGB Biobank. 88 527 (70.5%) have a biospecimen, 55 121 (43.9%) have completed the health information survey, 43 552 (34.7%) have genomic data and 124 760 (99.3%) have EHR data. Twenty machine learning computed phenotypes are calculated on a weekly basis. There are currently 1220 active investigators who have run 58 793 patient queries and exported 10 257 analysis files. DISCUSSION: The Biobank Portal allows noninformatics researchers to conduct study feasibility by querying across many data sources and then extract data that are most useful to them for clinical studies. While institutions require substantial informatics resources to establish and maintain integrated data repositories, they yield significant research value to a wide range of investigators.
CONCLUSION: The Biobank Portal and other patient data portals that integrate complex and simple datasets enable diverse research use cases. i2b2 tools to implement these registries and make the data interoperable are open source and freely available.

Entities: Chemical

Keywords: Information storage and retrieval; data curation; data science; electronic health records; genomics; i2b2

Mesh：

Year: 2022 PMID： 34849976 PMCID： PMC8922162 DOI： 10.1093/jamia/ocab264

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

Observational cohort studies are an important study design to complement interventional studies and answer a wide variety of research questions. Prospective and retrospective cohort studies are often designed to support well-defined research hypotheses and also test new hypotheses developed after study initiation. To do so, these studies often collect data beyond what is defined in the endpoints in the analysis plan. These data may include broad electronic health records (EHRs), genomics, complex assays, imaging studies and more. Examples include large national observational cohort studies such as All of Us, Million Veteran Project, and the UK Biobank that are collecting data and biospecimens to advance biological discoveries in medicine. In addition, large disease-specific observational cohorts have also been collecting high-dimensional datasets outside of the data required to test their initial hypotheses. These cohorts link primary data collection in electronic case report forms to EHR data, questionnaires, imaging studies, and more. Integrating these disparate and high-dimensional data types into a research data repository and making them discoverable and accessible by a diverse set of investigators is complex and requires specialized tools and informatics skills. Informatics for Integrating Biology and the Bedside (i2b2) is a software and data model initially built for cohort discovery from EHR data. Since its initial release over 10 years ago, the platform has been used to support an increasingly broad range of use cases including for creating networks of data registries, supporting disease-specific research registries, and aggregating data from clinical studies. The flexibility of the platform and its data model enables a wide range of customizations to support patient data. Collected data are harmonized and made available using FAIR principles of finability, accessibility, interoperability, and reusability.

Objective

In this article, we describe an implementation of i2b2 to create the Mass General Brigham (MGB) Biobank Portal data repository. The repository integrates data from many different types and schemas for over 125 000 consented patients and is updated weekly. The resulting structured data are made available to investigators in an easy-to-use data portal where they can construct and export customized datasets for analysis.

MATERIALS AND METHODS

Data repository population

The MGB Biobank (formerly named Partners Biobank) is an ongoing observational research project that enrolls patients and employees of a multicenter health system in Eastern Massachusetts. Participants are enrolled using a broad-based consent process by up to 30 research coordinators located at health system practices, in public hospital locations, as part of a collaborating study (dual consent) or electronically through Patient Gateway, the MGB patient portal. Biobank recruitment materials are available in print and electronically at https://biobank.massgeneralbrigham.org. MGB investigators conducting their own studies also consent patients (again with the Biobank consent language) to the Biobank at the same time as they consent for their own studies. Investigators with limited resources often consent research participants into the Biobank to take advantage of centralized resources for EHR data and sample collection, management, and genotyping. Demographic data and blood samples are collected at baseline and linked to EHRs data and self-reported health surveys for ongoing research. Biobank enrollment and biospecimen collection is supported through institutional funding but most data and biospecimen analyses are supported by public and private grant funding. All adult patients able to provide informed consent are eligible to participate. A small number of children have also been enrolled as part of a collaborating study with IRB-approved assent forms. Once a child turns 18, they are recontacted to consent using the adult form. If they refuse consent or are unable to be contacted, they are removed from the Biobank and their data are no longer available in the Biobank Portal. The Human Research Committee of MGB approved the Biobank research protocol (2009P002312).

i2b2 software and data model

i2b2 is both a modular software platform and a data meta-model. The software components include a middleware module that provides authentication and authorization based on HIPAA guidelines at a project level (named the Project Management Cell = PM). The data repository module (Clinical Research Chart) drives much of the interaction with the source data using well-defined application programming interfaces (APIs). The ontology module manages the metadata attached to the data and enables powerful query capabilities across various local and external data sources. A web user interface (webclient) is a JavaScript application that runs on users’ browser and communicates with the i2b2 APIs. The webclient is highly adaptable and extensible using plugins. The Biobank Portal web client is a customized version of the standard i2b2 webclient. i2b2 is completely open source distributed under the Mozilla Public License, version 2.0 and has an active developer and academic user group community. The i2b2 data model is described as a metamodel in that it defines a database schema for storing large-scale health data with the patient as the single point of reference. Beyond that there is no prescribed schema for specific types of data in the model, rather sites can customize how data are stored and define ontology rules to access that data. The data model is based on a star schema developed by Kimball and Ross and optimized for data warehousing use cases. The ontology is decoupled from the data, enabling a flexible and extensible data repository that can access data locally and remotely. The Biobank Portal leverages these features to allow disparate data sources to be interoperable and queryable across many different types of data which are described below. Figure 1 illustrates the overall architecture of the Biobank Portal using the i2b2 platform.

Figure 1.

The Biobank Portal architecture is based on Informatics for Integrating Biology and the Bedside (i2b2). Investigators access data through the webclient which interacts with the i2b2 application server using application programming interfaces (APIs). Most data are ingested into the data repository directly, but other data are accessed using external APIs at query time. PM: Project management cell; ONT: Ontology cell; CRC: Data repository cell; OMOP: Observational medical outcome partnership; CDM: common data model; VCF: variant call format; ETL: extract-transform-load.

Primary data

Demographic, consent, and biospecimen data

Participant identity and demographic characteristics are managed as part of the core identity management functionality of the Biobank portal platform. Demographic information is sourced from the Enterprise Master Patient Index (EMPI) and transformed to normalized age, race, gender, and ethnicity US Census categories. Each patient is mapped to one of many medical record numbers, Biobank participant ID, and other identifiers ensuring data linking to each patient is maintained across data sources and time. Patient consent information is provided by consent tracking software (CONSTRACK) which manages enrollment and withdrawal. Biospecimen tracking is managed by a laboratory information management system and tracks sample type, accession date, and quantity for plasma, serum, buffy coat, and other sample types. Consent and biospecimen data updates are sent via HL-7 and ingested into the main i2b2 data model tables.

EHR data

EHRs used for research have been extensively utilized for a wide range of studies. The i2b2 software began primarily as a data warehousing and query tool for EHR data before expanding to many more types of data. The Biobank Portal integrates EHR data retrieved from the MGB Research Patient Data Registry (RPDR). Data are ingested using extract-transform-load (ETL) procedures and stored in the i2b2 data model. i2b2 can also query data stored on the observational medical outcomes partnership (OMOP) common data model (CDM). Information on OMOP on i2b2 is available at https://community.i2b2.org/wiki/display/OMOP/OMOP+Home.

Case report and survey data

Each Biobank participant is asked to complete a comprehensive health information survey (available in English and Spanish) at the time of enrollment. These data are collected in REDCap, a widely used electronic data capture tool, using an online portal, either by the patient alone or together with a research coordinator. The data from REDCap are retrieved using a native API, transformed, and loaded into a staging table on a quarterly basis to be ingested into the core Biobank Portal data repository at each build. The latest i2b2 version provides native REDCap import functionality available at (https://community.i2b2.org/wiki/display/RM/1.7.12+Release+Notes). The most recent version of the Biobank Health Information Survey is included in the Supplementary Material. Although all participants are sent the survey only about 43% of participants complete the questionnaire. Barriers to survey completion include time and workflow issues. Many patients are enrolled in waiting rooms and do not have time to complete the 20- to 30-min survey and then do not follow-up to complete it online or return a paper version.

Genomic data

Biobank samples are genotyped using 3 versions of the Multi-Ethnic Global BeadChip SNP array offered by Illumina that is designed to capture the diversity of genetic backgrounds across the globe. These arrays cover over 1.7 million unphased variants which are annotated for dbSNP rs identifier, gene location, and protein and variant effect using Alamut-Batch (Interactive Biosoftware, France, https://www.interactive-biosoftware.com/alamut-batch/). Genotype calls and annotations are made available to investigators as VCF and PED file formats. Imputed genotypes are also available. To support querying by patient-level variant and zygosity, we extended the i2b2 web client to enable genomic queries by rsid and gene and allow constraints by variant effect (for gene queries) and zygosity (see Supplementary Figure S1). The VCF files are indexed using an optimized binary index and exposed as a REST API web service and integrated into the i2b2 application. Additional details and code for extending i2b2 for genomic queries is available at (https://community.i2b2.org/wiki/display/IGD)

Unstructured clinical text and reports

The MGB Notes Repository is a Microsoft SQL Server (MSSQL) relational database that hosts all clinical notes and reports since 1990. These notes are updated nightly and indexed using the MSSQL full-text service to allow text searches (Microsoft, https://docs.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-2016). While i2b2 supports full-text searches using SQL CONTAINS statements directly in the observation_fact table, we implement another REST API query framework to search the external notes repository for patients enrolled in the Biobank. Users can enter search phrases such as “diabetes” or “atrial fibrillation” and optionally choose to exclude matches near certain negation words (eg, not, negative, denies). The API restricts searches by Census names and numbers to prevent inadvertent release of PHI. While this is not a full-featured NLP pipeline for named entity extraction (ie, negation functionality is limited to prespecified negation terms and no additional context exclusions), investigators can run exploratory queries for cohort selection or for applying broad exclusion criteria. Most text queries can run in <20 s across the full corpus of over 250 million notes and reports. Supplementary Figure S2 illustrates the user interface for querying notes in the Biobank Portal.

Derived and curated data

Computed phenotypes

EHR data are multifaceted and often complex, reflecting both disease state, healthcare processes, and data collection and processing workflows. As such, we devote significant effort to validating and improving EHR data quality using quality assurance, as well as machine learning computed phenotypes. Computed phenotypes are derived from both structured and unstructured EHR data and provide the ability for researchers to accurately select a disease population for genomic or other analyses. The Biobank Portal computed phenotypes (also called “Curated Disease Populations”) are trained using PheCAP, a well-defined supervised learning workflow. Once a model is trained, it is operationalized in the data repository build process using SQL scripts that define features based on ontology paths and run the prediction to estimate predicted probability of each patient having a disease. In the Biobank Portal webclient, users can select these phenotype patient cohorts based on different levels of precision and sensitivity depending on their use case. Supplementary Figure S3 illustrates the query interface for phenotypes. In addition, each phenotype algorithm has a dedicated Wiki page that includes information on training data and evaluation sets with information on performance on test data. Supplementary Table S1 includes disease prevalence and performance of Biobank phenotype algorithms.

Composite variables

In addition to more advanced machine learning phenotypes, we also build composite variables derived from primary data. A key example is the computation of Charlson comorbidity indexes derived from ICD-9 and ICD-10 codes to identify levels of illness and select relatively healthy controls. Another recent example is we build a composite variable of COVID-19 positive or negative status based on PCR lab test results or infection control flag status to quickly support the large number of COVID-19 studies conducted with Biobank Portal data. While these derived variables often include simple rule-based logic that could be run as a user query, they significantly increase usability of the portal.

Quantitative imaging

Advances in computer vision have enabled high-throughput segmentation and analysis of radiology imaging. In the Biobank Portal, we have integrated machine learning-extracted quantitative abdominal CT scan area of skeletal muscle, visceral adipose tissue, and subcutaneous adipose tissue. The data are generated by a robust, fully automated, externally validated body composition pipeline consisting of 2 deep learning algorithms., The first algorithm selects a single, axial CT slice through the third lumbar vertebral body (L3) of the spine, the second algorithm anatomically identifies the boundaries of muscle, subcutaneous adipose tissue, and visceral adipose tissue in that slice and for each quantifies the area (cm2). Absolute and normalized values for each CT scan are loaded into the data repository for querying and dataset generation.

Ontology

The i2b2 ontology of the Biobank Portal is the main mechanism of data harmonization and quality. As a flexible data model, the i2b2 CDM relies heavily on predefined ontologies. Many of these ontologies are standard to support interoperability with other data sources and network data consortium. For EHR data, the Biobank Portal primarily relies on ICD-10-CM and CPT-4 diagnosis and procedure vocabularies to provide minimal transformation from the source data., Medications use a combination of RXNORM and VA National Drug Reference and lab tests and vital signs chiefly rely on LOINC codes. Local ontologies are developed for Biobank-specific data. The Health Information Survey variables collected from participants in REDCap format are mapped to an ontology based on events, forms, and question ordering. Because the data are completely decoupled from the ontology it is possible to have multiple ontologies covering the same data. For example, we also map ICD diagnosis to the PheCode ontology to support computable phenotype development and phenotype-genotype correlation studies. There is typically a usability and analytic trade-off between granular ontologies (eg, SNOMED-CT) and higher-level ontologies (eg, PheCode, CCS). Granular ontologies better support cohort definitions for study design however aggregated ontologies provide improved power and performance for developing high-dimensional machine learning risk prediction., i2b2 also provides the ability to share queries within study groups and within all users of this system. Users can drag over queries, ontology items and patient sets into Workplace folders for sharing.

Data updates and quality

The Biobank Portal data are rebuilt on a weekly basis. A set of ETL scripts run every Tuesday to load the CDM-ingested data sources. A set of data quality scripts is run on Wednesday to ensure all data are populated and values fall within the expected ranges and have increased or remained stable compared with the previous build. Any data quality variances are dealt with on Wednesdays and Thursdays before the new build is promoted to production on Friday mornings. Additional data quality and end-to-end testing are conducted on Friday after each build is promoted to production. The data quality workflow is integral to avoiding data errors in the setting of ingesting and harmonizing disparate data sources.

Exporting analysis datasets

The Biobank Portal provides 2 formats for data export: a 1 row per patient “standard” analysis file (configured to be downloaded immediately as a HIPAA-coded limited dataset) or a complete and detailed data longitudinal file that includes identifiers but must have an attached IRB protocol. All users must sign a data use agreement at first login which allows access to the limited dataset available in the portal. Users can download data for any patient with data in the portal. The analysis file download tool provides many options for the user to construct an analysis file. Users select a patient cohort (defining the rows in the file) and then can drag over concepts from the ontology (columns in the file) and select aggregation options to present as values in the cells. Aggregation options vary by data type and can include count of events, count of unique dates, most recent date, mean, median, max, min, and most recent values (for numeric concepts) or simply the presence or absence of a concept. This format was used in the 2018 Biobank Disease Challenge, an open competition for developing computable phenotypes from Biobank Portal data. A new version of the download tool also allows users to define an index date from a concept and constrain other concepts based on the index date. For example, users can define the index date as the first diagnosis of diabetes and then define a column with the maximum hemoglobin A1C value within 6 months prior to the index date. The detailed data format provides tables with integrated metadata and the full-time series data. It can contain full-text notes and genetic sequences. Because it is difficult to provide this kind of data as a limited dataset, we require every study to make a request only after obtaining IRB approval for the study. Users request PHI data, including clinical notes, using a detailed data request wizard that requires IRB approval and principal investigator sign-off. Each request is reviewed and validated. Data are delivered via secure, encrypted network file shares. It takes 2–3 days to put all the data together for these requests as opposed to the instantly available limited dataset requests. The detailed requests require more back-end informatics resources to fulfill.

RESULTS

As of July 2021, there are 125 645 consented patients enrolled in the MGB Biobank with data in the Biobank Portal. Enrolled patients tend to be older, more likely to be female and white non-Hispanic and have significantly higher healthcare utilization compared with the overall patient population (Table 1).

Table 1.

Mass General Brigham Biobank Participant Characteristics compared with all health system patients.

Characteristic^a	Biobank Portal (N = 125 645)	All other patients (N = 3 811 544)^b	P-value^c
Demographics
Gender			<.001
Female	71 360 (57%)	2 094 150 (55%)
Male	54 234 (43%)	1 ,716 594 (45%)
Other/Unknown	4 (<0.1%)	800 (<0.1%)
Age at last visit	59 (42, 70)	46 (26, 64)	<.001
Race			<.001
Asian	3785 (3.0%)	190 804 (5.0%)
Black	5960 (4.7%)	225 ,069 (5.9%)
Other	7711 (6.1%)	681 710 (18%)
Unknown	2251 (1.8%)	121 ,111 (3.2%)
White	105 891 (84%)	2 592 850 (68%)
Ethnicity			<.001
Hispanic	3387 (2.7%)	214 407 (5.6%)
Non-Hispanic	122 211 (97%)	3 597 137 (94%)
ACS median income⁴	$70 245 ($57 313, $90 673)	$69 576 ($55 652, $88 829)	<.001
Healthcare utilization
Number of visit days	125 (44, 268)	9 (2, 38)	<.001
Number of diagnosis codes	405 (138, 962)	29 (7, 124)	<.001
Number of clinical notes	124 (38, 314)	20 (6, 70)	<.001
Number of diagnostic reports	122 (47, 262)	18 (4, 60)	<.001
Available data
Electronic health records	124 760 (99%)	3 811 544 (100%)	—
Health information survey	55 121 (44%)	—	—
Genomic data	43 552 (35%)	—	—
Biospecimens	88 527 (71%)	—	—

N (%) or median (IQR).

Other patients are defined as patients with a health system visit since 2010 and not enrolled in the MGB Biobank.

Pearson’s Chi-squared test or Wilcoxon rank-sum test.

2018 American Community Survey 2018 Median income in patients zip code.

Mass General Brigham Biobank Participant Characteristics compared with all health system patients. N (%) or median (IQR). Other patients are defined as patients with a health system visit since 2010 and not enrolled in the MGB Biobank. Pearson’s Chi-squared test or Wilcoxon rank-sum test. 2018 American Community Survey 2018 Median income in patients zip code. Of the Biobank-enrolled patients, 88 527 (70.5%) have a biospecimen, 55 121 (43.9%) have completed the health information survey, 43 552 (34.7%) have genomic data and 124 760 (99.3%) have linked EHR data. We have deployed 20 machine learning computed disease phenotypes that are calculated on a weekly basis. Figure 2 illustrates characteristics of the Biobank population and data status in an overview screen all users see when logging in to the Biobank Portal.

Figure 2.

Overview of Biobank Portal Data. Investigators see this screen at every login with information on available data, date of last update help, and quick start query examples.

Overview of Biobank Portal Data. Investigators see this screen at every login with information on available data, date of last update help, and quick start query examples. Since its inception in 2015, the Biobank Portal has been rebuilt 327 times. There are currently 1220 active Biobank investigators who have run over 58 793 patient queries and exported over 10 257 analysis files (Figure 3 shows an example analysis file specification). The Biobank Portal has been used for a diverse set of research projects and use cases leading to many awarded research grants and publications. Table 2 lists example research use cases and publications generated using Biobank Portal data. Active research initiatives using Biobank and specimen are available at (https://biobank.massgeneralbrigham.org/research-initiatives).

Figure 3.

Example analysis file specification to download limited datasets.

Table 2.

Biobank Portal example use cases and publications

Research use case	Investigator type	Data types
Research feasibility for grant application	All types	All
Multicenter genome-wide association studies³³	Clinical/bioinformatics	EHR and Genomics
Machine learning disease subgroup detection using NLP and genetics³⁴	Data scientist	EHR, Genomics, and notes
Polygenic risk score integration with EHR data for phenotyping³⁵	Psychology fellow	EHR, Genomics, and Health Survey
Population cohort discovery based on gene variants and laboratory results	Population epidemiologist	Genomic and EHR
Obtain biospecimens for control group	Basic scientist	Biospecimen
Developing and validating phenotype algorithms³⁶	All types	EHR and notes
Case-control association study of disease comorbidity³⁷	Population epidemiologist	EHR
Evaluating the clinical utility of polygenic risk scores³⁸	Clinical/bioinformatics	Genomics

Example analysis file specification to download limited datasets. Biobank Portal example use cases and publications

DISCUSSION

Integrating and harmonizing disparate data sources into one interoperable and consolidated data portal is critical to enabling researchers to conduct analysis efficiently and effectively. The Biobank Portal allows a wide variety of researchers to conduct study feasibility by querying across these data sources and then extract data that are most useful to them. Democratizing the data retrieval process by providing tools to bridge the data acquisition and engineering gap has proven to be an effective model for Mass General Brigham. Providing informatics infrastructure support centrally using i2b2 and avoiding data gatekeeping limited to highly technical users have scaled more effectively and generated meaningful publications, many of which the informatics team has not been directly involved with. Secondary use of EHR and other data for which there were no predefined outcomes is a key strength of this data portal and federated analysis model. Large biobank and consortium efforts collecting similar types of data have also provided similar functionality to the Biobank Portal. The eMERGE network provides a “record counter” for submitted phenotype data that allows researchers to query demographic and diagnosis and procedure data using simple queries. The All of Us project also provides a data explorer to view data availability across various data types including survey and digital health data. These approaches rely on standard coding systems only and lack the flexibility of the ontology-driven i2b2 queries. The Observational Health Data Science Initiative provides a number of analytic tools for study cohort feasibility and generation focused primarily on EHR data with some extensions into NLP and other data. This approach requires a universal up-front agreement to define every element of the data model. This is less flexible than the i2b2 method of defining the data model through the ontology. The opportunity to derive queries through the ontology and quickly create complex multimodal queries has not been achieved with the OMOP model extensions and requires advanced analytic expertise unlike i2b2 where the ontology-driven queries allow any kind of extensions to be easily incorporated into query tools. The Biobank Portal does have some limitations. First, the data integration workflow is dependent on the existence of a single master patient index. The Biobank Portal relies on the MGB EMPI but this may not be available in all cases or in multisite data repositories. Emerging work in privacy-preserving patient linking enables cross-site patient identification using hashed tokens generated from patient identifiers (names, dates of birth, address, etc.) and would enable data integration across sites and data sources. In addition, the Biobank population is a relatively small population compared with a large multisite study or all patients in a healthcare system. However, we have shown that the i2b2 meta-model star schema overall will scale to much larger populations. Finally, we find that the Biobank population is highly enriched for non-Hispanic white patients than the general health system patient population. This may lead to bias and poor generalizability both in data completeness and analytic findings. Additional efforts to enroll more representative participants are required for all patients to realize the benefit of research findings.

CONCLUSION

The Biobank Portal and other patient data portals that integrate complex and simple datasets enable a diverse set of research use cases. i2b2 tools to implement these registries are open source and freely available. While institutions require substantial informatics resources to establish and maintain these types of data registries, they yield significant research value to a wide range of investigators.

FUNDING

This study was supported by Mass General Brigham institutional funds. The work was supported by the National Human Genome Research Institute grants R01-HG009174 and U01HG008685-05S1 and the National Heart, Lung, and Blood Institute award 1OT2HL161841-01. The authors had the final responsibility for the decision to submit for publication.

AUTHOR CONTRIBUTIONS

VMC was the primary author of this article. VSG, NW, and SNM also contributed substantially to the article. BB, AC, BG, SG, RM, HP, DW, MM, MR, and CH all contributed intellectual value, technical support, and text to the article.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online. Click here for additional data file.

31 in total

1. SNOMED-CT: The advanced terminology and coding system for eHealth.

Authors: Kevin Donnelly
Journal: Stud Health Technol Inform Date: 2006

2. Penetrance and Pleiotropy of Polygenic Risk Scores for Schizophrenia in 106,160 Patients Across Four Health Care Systems.

Authors: Amanda B Zheutlin; Jessica Dennis; Richard Karlsson Linnér; Arden Moscati; Nicole Restrepo; Peter Straub; Douglas Ruderfer; Victor M Castro; Chia-Yen Chen; Tian Ge; Laura M Huckins; Alexander Charney; H Lester Kirchner; Eli A Stahl; Christopher F Chabris; Lea K Davis; Jordan W Smoller
Journal: Am J Psychiatry Date: 2019-08-16 Impact factor: 18.112

3. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2).

Authors: Shawn N Murphy; Griffin Weber; Michael Mendis; Vivian Gainer; Henry C Chueh; Susanne Churchill; Isaac Kohane
Journal: J Am Med Inform Assoc Date: 2010 Mar-Apr Impact factor: 4.497

4. Association of Sinusitis and Upper Respiratory Tract Diseases With Incident Rheumatoid Arthritis: A Case-control Study.

Authors: Vanessa L Kronzer; Weixing Huang; Alessandra Zaccardelli; Cynthia S Crowson; John M Davis; Robert Vassallo; Tracy J Doyle; Elena Losina; Jeffrey A Sparks
Journal: J Rheumatol Date: 2021-10-15 Impact factor: 4.666

5. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies.

Authors: Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf
Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063

6. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP).

Authors: Yichi Zhang; Tianrun Cai; Sheng Yu; Kelly Cho; Chuan Hong; Jiehuan Sun; Jie Huang; Yuk-Lam Ho; Ashwin N Ananthakrishnan; Zongqi Xia; Stanley Y Shaw; Vivian Gainer; Victor Castro; Nicholas Link; Jacqueline Honerlaw; Sicong Huang; David Gagnon; Elizabeth W Karlson; Robert M Plenge; Peter Szolovits; Guergana Savova; Susanne Churchill; Christopher O'Donnell; Shawn N Murphy; J Michael Gaziano; Isaac Kohane; Tianxi Cai; Katherine P Liao
Journal: Nat Protoc Date: 2019-11-20 Impact factor: 13.491

7. Population-Scale CT-based Body Composition Analysis of a Large Outpatient Population Using Deep Learning to Derive Age-, Sex-, and Race-specific Reference Curves.

Authors: Kirti Magudia; Christopher P Bridge; Camden P Bay; Ana Babic; Florian J Fintelmann; Fabian M Troschel; Nityanand Miskin; William C Wrobel; Lauren K Brais; Katherine P Andriole; Brian M Wolpin; Michael H Rosenthal
Journal: Radiology Date: 2020-11-24 Impact factor: 11.105

8. The FAIR Guiding Principles for scientific data management and stewardship.

Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444

9. Implementation of Electronic Consent at a Biobank: An Opportunity for Precision Medicine Research.

Authors: Natalie T Boutin; Kathleen Mathieu; Alison G Hoffnagle; Nicole L Allen; Victor M Castro; Megan Morash; P Pearl O'Rourke; Elizabeth L Hohmann; Neil Herring; Lynn Bry; Susan A Slaugenhaupt; Elizabeth W Karlson; Scott T Weiss; Jordan W Smoller
Journal: J Pers Med Date: 2016-06-09

10. Collaborative Cohort of Cohorts for COVID-19 Research (C4R) Study: Study Design.

Authors: Elizabeth C Oelsner; Akshaya Krishnaswamy; Pallavi P Balte; Norrina Bai Allen; Tauqeer Ali; Pramod Anugu; Howard F Andrews; Komal Arora; Alyssa Asaro; R Graham Barr; Alain G Bertoni; Jessica Bon; Rebekah Boyle; Arunee A Chang; Grace Chen; Sean Coady; Shelley A Cole; Josef Coresh; Elaine Cornell; Adolfo Correa; David Couper; Mary Cushman; Ryan T Demmer; Mitchell S V Elkind; Aaron R Folsom; Amanda M Fretts; Kelley P Gabriel; Linda C Gallo; Jose Gutierrez; Mei Lan K Han; Joel M Henderson; Virginia J Howard; Carmen R Isasi; David R Jacobs; Suzanne E Judd; Debora Kamin Mukaz; Alka M Kanaya; Namratha R Kandula; Robert C Kaplan; Gregory L Kinney; Anna Kucharska-Newton; Joyce S Lee; Cora E Lewis; Deborah A Levine; Emily B Levitan; Bruce D Levy; Barry J Make; Kimberly Malloy; Jennifer J Manly; Carolina Mendoza-Puccini; Katie A Meyer; Yuan-I Nancy Min; Matthew R Moll; Wendy C Moore; David Mauger; Victor E Ortega; Priya Palta; Monica M Parker; Wanda Phipatanakul; Wendy S Post; Lisa Postow; Bruce M Psaty; Elizabeth A Regan; Kimberly Ring; Véronique L Roger; Jerome I Rotter; Tatjana Rundek; Ralph L Sacco; Michael Schembri; David A Schwartz; Sudha Seshadri; James M Shikany; Mario Sims; Karen D Hinckley Stukovsky; Gregory A Talavera; Russell P Tracy; Jason G Umans; Ramachandran S Vasan; Karol E Watson; Sally E Wenzel; Karen Winters; Prescott G Woodruff; Vanessa Xanthakis; Ying Zhang; Yiyi Zhang
Journal: Am J Epidemiol Date: 2022-06-27 Impact factor: 4.897

4 in total

1. Measured Blood Pressure, Genetically Predicted Blood Pressure, and Cardiovascular Disease Risk in the UK Biobank.

Authors: So Mi Jemma Cho; Satoshi Koyama; Yunfeng Ruan; Kim Lannery; Megan Wong; Ezimamaka Ajufo; Hokyou Lee; Amit V Khera; Michael C Honigberg; Pradeep Natarajan
Journal: JAMA Cardiol Date: 2022-09-28 Impact factor: 30.154

2. Research data warehouse best practices: catalyzing national data sharing through informatics innovation.

Authors: Shawn N Murphy; Shyam Visweswaran; Michael J Becich; Thomas R Campion; Boyd M Knosp; Genevieve B Melton-Meaux; Leslie A Lenert
Journal: J Am Med Inform Assoc Date: 2022-03-15 Impact factor: 7.942

3. Use of automatic SQL generation interface to enhance transparency and validity of health-data analysis.

Authors: Kavishwar B Wagholikar; David Zelle; Layne Ainsworth; Kira Chaney; Alexander J Blood; Angela Miller; Rupendra Chulyadyo; Michael Oates; William J Gordon; Samuel J Aronson; Benjamin M Scirica; Shawn N Murphy
Journal: Inform Med Unlocked Date: 2022-06-25

4. I2b2-etl: Python application for importing electronic health data into the informatics for integrating biology and the bedside platform.

Authors: Kavishwar B Wagholikar; Layne Ainsworth; David Zelle; Kira Chaney; Michael Mendis; Jeffery Klann; Alexander J Blood; Angela Miller; Rupendra Chulyadyo; Michael Oates; William J Gordon; Samuel J Aronson; Benjamin M Scirica; Shawn N Murphy
Journal: Bioinformatics Date: 2022-10-14 Impact factor: 6.931

4 in total