Literature DB >> 33861317

A System for Phenotype Harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) Program.

Adrienne M Stilp, Leslie S Emery, Jai G Broome, Erin J Buth, Alyna T Khan, Cecelia A Laurie, Fei Fei Wang, Quenna Wong, Dongquan Chen, Catherine M D'Augustine, Nancy L Heard-Costa, Chancellor R Hohensee, William Craig Johnson, Lucia D Juarez, Jingmin Liu, Karen M Mutalik, Laura M Raffield, Kerri L Wiggins, Paul S de Vries, Tanika N Kelly, Charles Kooperberg, Pradeep Natarajan, Gina M Peloso, Patricia A Peyser, Alex P Reiner, Donna K Arnett, Stella Aslibekyan, Kathleen C Barnes, Lawrence F Bielak, Joshua C Bis, Brian E Cade, Ming-Huei Chen, Adolfo Correa, L Adrienne Cupples, Mariza de Andrade, Patrick T Ellinor, Myriam Fornage, Nora Franceschini, Weiniu Gan, Santhi K Ganesh, Jan Graffelman, Megan L Grove, Xiuqing Guo, Nicola L Hawley, Wan-Ling Hsu, Rebecca D Jackson, Cashell E Jaquish, Andrew D Johnson, Sharon L R Kardia, Shannon Kelly, Jiwon Lee, Rasika A Mathias, Stephen T McGarvey, Braxton D Mitchell, May E Montasser, Alanna C Morrison, Kari E North, Seyed Mehdi Nouraie, Elizabeth C Oelsner, Nathan Pankratz, Stephen S Rich, Jerome I Rotter, Jennifer A Smith, Kent D Taylor, Ramachandran S Vasan, Daniel E Weeks, Scott T Weiss, Carla G Wilson, Lisa R Yanek, Bruce M Psaty, Susan R Heckbert, Cathy C Laurie.   

Abstract

Genotype-phenotype association studies often combine phenotype data from multiple studies to increase statistical power. Harmonization of the data usually requires substantial effort due to heterogeneity in phenotype definitions, study design, data collection procedures, and data-set organization. Here we describe a centralized system for phenotype harmonization that includes input from phenotype domain and study experts, quality control, documentation, reproducible results, and data-sharing mechanisms. This system was developed for the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program, which is generating genomic and other -omics data for more than 80 studies with extensive phenotype data. To date, 63 phenotypes have been harmonized across thousands of participants (recruited in 1948-2012) from up to 17 studies per phenotype. Here we discuss challenges in this undertaking and how they were addressed. The harmonized phenotype data and associated documentation have been submitted to National Institutes of Health data repositories for controlled access by the scientific community. We also provide materials to facilitate future harmonization efforts by the community, which include 1) the software code used to generate the 63 harmonized phenotypes, enabling others to reproduce, modify, or extend these harmonizations to additional studies, and 2) the results of labeling thousands of phenotype variables with controlled vocabulary terms.
© The Author(s) 2021. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health.

Entities:  

Keywords:  cardiovascular disease; common data elements; hematologic disease; information dissemination; lung diseases; phenotypes; sleep-wake disorders

Mesh:

Year:  2021        PMID: 33861317      PMCID: PMC8485147          DOI: 10.1093/aje/kwab115

Source DB:  PubMed          Journal:  Am J Epidemiol        ISSN: 0002-9262            Impact factor:   5.363


Abbreviations

database of Genotypes and Phenotypes Data Coordinating Center quality control Trans-Omics for Precision Medicine Unified Medical Language System Working Group To increase statistical power in epidemiologic analyses, multiple studies are often combined for pooled or meta-analysis. Heterogeneity among studies is generally addressed by means of careful selection and harmonization of study data to include in the analyses. In this report, we describe a system for phenotype harmonization which was developed for the National Heart, Lung, and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program (https://www.nhlbiwgs.org/). We define phenotype harmonization as the process by which data variables, each representing a specified phenotype concept, are selected from multiple studies and transformed as needed so that they can be combined and analyzed together. In principle, phenotype harmonization can be achieved prospectively when all contributing studies use the same standardized protocols (1, 2). However, retrospective harmonization is often needed in order to use valuable data previously collected in multiple studies using different phenotype definitions, study designs, data collection procedures, and data structures. A key goal of the TOPMed program is to identify genetic risk factors for heart, lung, blood, and sleep disorders. To date, the program has generated whole-genome sequence data for over 140,000 participants from more than 80 different studies (3). Investigators in the participating studies have previously gathered extensive phenotype data, including physical measurements, clinical chemistry, questionnaires, clinical registries, and medical imaging. Information on many of the same phenotypes has been collected in multiple studies, which provides the potential for combined analyses to increase power for detecting the effects of low-frequency and rare-sequence variants. However, because of substantial heterogeneity in phenotype data among studies and over time, harmonization is required for combined analyses. Our system for retrospective harmonization of phenotype data in TOPMed includes a collaborative framework, domain expertise, high-quality data inputs, validation of data outputs, rigorous documentation, and respect for stakeholders (i.e., features of the Maelstrom Research guidelines (Research Institute of the McGill University Health Centre, Montreal General Hospital, Montreal, Quebec, Canada) (2)), as well as reproducibility, updating, and sharing of harmonized results derived from controlled-access human data. We describe these features in detail, along with examples of applications to TOPMed study data. We also describe a system for tagging study variables with phenotype concepts for use in future harmonization efforts.

METHODS

Overview

The TOPMed Data Coordinating Center (DCC) developed a collaborative process for phenotype harmonization that was integrated with the activities of TOPMed Working Group (WG) members, who include phenotype experts, genetic epidemiologists, and data analysts. Initially, WG members established a specific objective, which was usually to identify DNA sequence variants associated with variation in a defined phenotype concept. The DCC identified relevant data from up to 17 TOPMed studies (see Web Appendix 1 and Web Table 1, available at https://doi.org/10.1093/aje/kwab115) per phenotype and performed harmonization to fit the WG’s phenotype concept, using the steps described below. This concept was often refined to provide greater homogeneity across studies as data from each study were explored, often in collaboration with WG investigators and data managers who had detailed knowledge of their study’s data. We also consulted periodically with the TOPMed Steering Committee and the TOPMed Phenotype Harmonization Committee on the overall process. Table 1 provides definitions of terms used in this paper.
Table 1

Specific Terminology Used in This Article, in Web Appendices 1–11, and in Documentation Distributed With Harmonized PhenoType Data

Term Definition
Participant or subjectStudies generally refer to an individual participating in their study as a “participant,” while dbGaP uses “subject” as the equivalent term.
Cohort and subcohortA sample of study participants enrolled in the study together at a given time (or clinic visit). The term “subcohort” refers to a distinct group of participants within a study, as defined by that study (e.g., a different recruitment wave or targeted demographic group).
Phenotype or traitObservable characteristics of an organism. “Phenotype” and “trait” are used synonymously.
Phenotype conceptBroad definition of a phenotype, such as “quantitative measure of high-density lipoprotein concentration in blood” or “qualitative indicator of diabetes mellitus status.”
Phenotype variableA vector of data values representing a measurement or other aspect of a phenotype concept, where each item in the vector corresponds to the value for a specific participant and/or repeated measure for a participant.
dbGaP study variableAn unharmonized phenotype variable from a given study’s dbGaP accession.
Candidate variableA phenotype variable from a given study to be evaluated for use as a component phenotype variable. Such evaluation includes consideration of factors such as how well it represents the target phenotype concept, how well it can be harmonized with candidate variables from other studies, and the quality of the data.
Component variableA phenotype variable selected for inclusion in a single harmonization, either because it directly represents the target phenotype (e.g., biomarker concentration) or because it is useful in constructing the harmonized variable (e.g., biomarker assay quality).
Harmonized variableA phenotype variable constructed from a set of component variables from different studies, after performing whatever harmonization steps are considered to be important for a valid pooled analysis or meta-analysis.
Harmonization algorithm and functionThe algorithm is a series of steps to be applied to the group of component variables to produce harmonized phenotype values for a single harmonization unit. Algorithms are implemented in Ra functions.
Harmonization unitA group of subjects from a single study (e.g., subcohort) with the same component variables, to which a single harmonization algorithm is applied to produce harmonized phenotype values. A harmonized variable is typically constructed by combining multiple harmonization units from one or more studies.
Harmonized data setA data set consisting of a set of harmonized variables representing various aspects of phenotype concepts. It may also contain harmonized variables for multiple related phenotype concepts. For example, the “lipids” data set contains phenotype variables for concentrations of each of several lipid compounds assayed from the same blood draw, as well as age at blood draw, fasting status, and use of lipid-lowering medication.

Abbreviation: dbGaP, database of Genotypes and Phenotypes.

a R Foundation for Statistical Computing, Vienna, Austria (5).

Overview of the data harmonization process used by the Trans-Omics for Precision Medicine (TOPMed) Data Coordinating Center (DCC). A) Existing study data in diverse formats are curated by the database of Genotypes and Phenotypes (dbGaP), including accessioning and conversion to a consistent file format. B) Formatted data and associated metadata (e.g., variable descriptions) are stored in a TOPMed DCC relational database. C) The harmonized phenotype variable is defined, and metadata for multiple studies are searched to identify candidate phenotypic variables that potentially can be harmonized together to produce the desired harmonized variable (harmonization steps 1 and 2). D) Analytical tools that interact with the DCC database are used for quality control (QC) of study variables, implementation of harmonization algorithms, and documentation; harmonized results are added to the same DCC database as that shown in step B (harmonization steps 3–5). E) Files containing a multistudy, harmonized data set and associated documentation are produced. F) Data, metadata, and documentation are submitted to a National Institutes of Health (NIH) repository for controlled access by the scientific community, while documentation files in JavaScript Object Notation format containing software code and provenance tracking are submitted to a publicly available GitHub repository. Specific Terminology Used in This Article, in Web Appendices 1–11, and in Documentation Distributed With Harmonized PhenoType Data Abbreviation: dbGaP, database of Genotypes and Phenotypes. a R Foundation for Statistical Computing, Vienna, Austria (5). The DCC’s system for implementing harmonization is outlined in Figure 1. Although we describe the harmonization process as a linear sequence of steps, the results of later steps often required going back and modifying earlier steps. The system tracked the harmonization of each phenotype separately, along with participant age at measurement or biosample collection. Each harmonized phenotype variable was assigned a controlled-vocabulary term from the Unified Medical Language System (UMLS) (4). Analysts worked on a group of related phenotype variables at the same time (e.g., high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, total cholesterol, triglycerides, fasting status, and use of lipid-lowering medication), which were generally released together in a single data set (e.g., “Lipids”; Table 2). When harmonizing a group of related phenotypes, it is important to use phenotype variables that were measured or collected from a participant at the same time point.
Table 2

Harmonized Variables Produced by the TOPMed Data Coordinating Center for 17 Studies with Recruitment Dates Spanning 1948–2012

Data Set and Phenotype Concept Harmonized Variable Name b No. of Participants No. of Studies
Atherosclerosis
 CAC volumecac_volume_111,0982
 CAC scorecac_score_115,0426
 Common carotid IMTcimt_135,4206
 Common carotid IMTcimt_230,4735
 Carotid stenosiscarotid_stenosis_115,0983
 Presence of carotid plaquecarotid_plaque_127,3445
Baseline common covariates
 Standing body heightheight_baseline_1230,28716
 Body weightweight_baseline_1230,65716
 Ever smoker statusever_smoker_baseline_1225,27114
 Current smoker statuscurrent_smoker_baseline_1228,68816
 Body mass indexbmi_baseline_1230,91817
Blood cell count
 Basophil concentration in bloodbasophil_ncnc_bld_136,5867
 Eosinophil concentration in bloodeosinophil_ncnc_bld_137,4267
 Lymphocyte concentration in bloodlymphocyte_ncnc_bld_139,7027
 Hematocrit level in bloodhematocrit_vfr_bld_1193,4699
 Hemoglobin concentration in bloodhemoglobin_mcnc_bld_1193,3679
 Monocyte concentration in bloodmonocyte_ncnc_bld_139,6477
 Neutrophil concentration in bloodneutrophil_ncnc_bld_138,2857
 Mean corpuscular volume in bloodmcv_entvol_rbc_144,5937
 Mean corpuscular hemoglobin concentration in bloodmchc_mcnc_rbc_151,2938
 Mean corpuscular hemoglobin in bloodmch_entmass_rbc_139,6497
 Platelet concentration in bloodplatelet_ncnc_bld_1190,1779
 Mean platelet volume in bloodpmv_entvol_bld_113,8163
 Red blood cell concentration in bloodrbc_ncnc_bld_139,7107
 Red cell distribution widthrdw_ratio_rbc_128,0344
 White blood cell concentration in bloodwbc_ncnc_bld_1192,3469
Blood pressure
 Systolic blood pressurebp_systolic_1225,93414
 Diastolic blood pressurebp_diastolic_1225,93414
 Use of antihypertensive medicationantihypertensive_meds_1207,13012
Demographic characteristics
 Hispanic subgrouphispanic_subgroup_118,6124
 Subcohort identifiersubcohort_1218,74715
 Reported racerace_1230,99417
 Reported sexannotated_sex_1233,03017
 Reported Hispanic/Latino indicatorethnicity_1188,90511
 Geographic recruitment sitegeographic_site_1212,52912
Inflammation
 CD40 protein concentration in bloodcd40_14,2382
 CRP concentration in bloodcrp_149,53610
 E-selectin concentration in bloodeselectin_11,2151
 ICAM-1 concentration in bloodicam1_115,8765
 IL-1β concentration in bloodil1_beta_17081
 IL-6 concentration in bloodil6_120,3905
 IL-10 concentration in bloodil10_13,4552
 IL-18 concentration in bloodil18_13,1591
 Isoprostane 8-epi-PGF2α concentration in urineisoprostane_8_epi_pgf2a_17,5231
 Activity of LP-PLA2 in bloodlppla2_act_118,1173
 Mass of LP-PLA2 in bloodlppla2_mass_118,0493
 MCP-1 concentration in bloodmcp1_17,5571
 MMP-9 concentration in bloodmmp9_19641
 Myeloperoxidase concentration in bloodmpo_13,1621
 Osteoprotegerin concentration in bloodopg_17,6481
 P-selectin concentration in bloodpselectin_18,0371
 TNF-α concentration in bloodtnfa_15,0753
 TNF-α receptor 1 concentration in bloodtnfa_r1_12,8021
 TNF receptor 2 concentration in bloodtnfr2_17,9621
Lipids
 Fasting statusfasting_lipids_164,89511
 High-density lipoprotein concentration in bloodhdl_165,67611
 Total cholesterol concentration in bloodtotal_cholesterol_165,70711
 Triglyceride concentration in bloodtriglycerides_165,70611
 Low-density lipoprotein concentration in bloodldl_164,71511
 Use of lipid-lowering medicationlipid_lowering_medication_158,9629
VTE
 Age at beginning of follow-upvte_followup_start_age_161,6924
 Prior history of VTEvte_prior_history_162,4455
 VTE case statusvte_case_status_163,0926

Abbreviations: CAC, coronary artery calcium; CAM-1, intercellular adhesion molecule 1; CD40, cluster of differentiation 40; CRP, C-reactive protein; 8-epi-PGF2-α, 8-epi-prostaglandin F2α; IL-1β, interleukin 1β; IL-6, interleukin 6; IL-10, interleukin 10; IL-18, interleukin 18; IMT, intima-media thickness; LP-PLA2, lipoprotein-associated phospholipase A2; MCP-1, monocyte chemoattractant protein 1; MMP-9, matrix metalloproteinase 9; TNF-α, tumor necrosis factor α; TOPMed, Trans-Omics for Precision Medicine; VTE, venous thromboembolism.

a See Web Table 1 for descriptions of the 17 studies. Additional documentation about each harmonized variable can be found in the GitHub repository (14).

b The “concept variant number” at the end of each harmonized variable name differentiates among different implementations of harmonization for the same basic phenotype concept (e.g., cimt_1 and cimt_2 are names for carotid IMT variables calculated with slightly different harmonization algorithms).

Harmonized Variables Produced by the TOPMed Data Coordinating Center for 17 Studies with Recruitment Dates Spanning 1948–2012 Abbreviations: CAC, coronary artery calcium; CAM-1, intercellular adhesion molecule 1; CD40, cluster of differentiation 40; CRP, C-reactive protein; 8-epi-PGF2-α, 8-epi-prostaglandin F2α; IL-1β, interleukin 1β; IL-6, interleukin 6; IL-10, interleukin 10; IL-18, interleukin 18; IMT, intima-media thickness; LP-PLA2, lipoprotein-associated phospholipase A2; MCP-1, monocyte chemoattractant protein 1; MMP-9, matrix metalloproteinase 9; TNF-α, tumor necrosis factor α; TOPMed, Trans-Omics for Precision Medicine; VTE, venous thromboembolism. a See Web Table 1 for descriptions of the 17 studies. Additional documentation about each harmonized variable can be found in the GitHub repository (14). b The “concept variant number” at the end of each harmonized variable name differentiates among different implementations of harmonization for the same basic phenotype concept (e.g., cimt_1 and cimt_2 are names for carotid IMT variables calculated with slightly different harmonization algorithms). The information technology supporting phenotype harmonization consisted of a locally hosted relational database and associated applications. A custom R (R Foundation for Statistical Computing, Vienna, Austria) package (5) was used to interact with the database, and a series of Python (Python Software Foundation, Wilmington, Delaware) and R scripts were run by analysts to perform harmonization. The codebase also allowed addition of new study and harmonized data to the database, retrieval of existing study data in their original structure, and production of harmonized data sets and documentation for distribution to investigators. A custom Web application was used to search the publicly available metadata for relevant study variables. This report describes the TOPMed DCC system. It does not document other harmonization efforts involving TOPMed studies that were performed independently of the DCC (e.g., by Oelsner et al. (6) or the independent efforts of various TOPMed WGs).

Obtaining and processing study data

All study phenotype data and associated metadata were obtained from the National Institutes of Health database of Genotypes and Phenotypes (dbGaP) (7), which provides controlled access for the scientific community. Use of dbGaP data provides a mechanism for tracking the provenance of a harmonized phenotype variable using dbGaP accession numbers assigned to multiple data entities, including studies, data sets, and individual variables within data sets. The harmonization system leverages work performed by dbGaP to curate data into a consistent file format, include metadata (e.g., variable descriptions and types), and perform some value-checking based on data type. Use of dbGaP data enables reproducibility of harmonized phenotypes, as scientific investigators can obtain the same data sets. For each study, the harmonization process included all participants with data available in dbGaP, rather than only those being sequenced in TOPMed. After obtaining approval for access to a study’s dbGaP accession, all available phenotype data and associated metadata were imported into a relational database (Web Appendix 2). Studies participating in TOPMed were approved by all relevant institutional review boards, and participants provided informed consent, including information regarding data-use limitations and guidelines for sharing data via dbGaP. Even though the DCC-harmonized data for all participants are available in dbGaP, the resulting harmonized phenotypes may only be analyzed for participants whose dbGaP consent group allows research in that area. Investigators must obtain approval (via dbGaP application) to obtain access to the studies and consent groups that match their intended use.

Harmonization steps

The following harmonization steps are focused on producing each individual harmonized phenotype variable (although several related phenotypes may be harmonized in parallel and provided to users within a single data set). Web Appendices 3–7 provide details about these steps using 4 harmonized phenotypes as examples.

Step 1: Define the harmonized phenotype variable.

The first step, usually performed by WG members intending to use the harmonized data, was to develop a precise definition of the target harmonized phenotype variable that includes key features needed to address their primary objectives. These features often included references to specific assay or measurement methods, time points in longitudinal studies, and other relevant factors. For example, for low-density lipoprotein cholesterol concentration in blood, the definition might specify calculation according to the Friedewald equation (8) using high-density lipoprotein cholesterol, total cholesterol, and triglyceride measurements, all from the same blood sample drawn at the baseline clinic visit after a period of fasting. The initial definition was sometimes modified to accommodate heterogeneity in the data available in different studies as it was discovered in subsequent steps. (Also see Web Appendix 3.)

Step 2: Identify “candidate” phenotype variables across contributing studies.

The next step was to identify candidate dbGaP study variables that could potentially be used for calculating the target harmonized phenotype variable, as well as corresponding variables containing age at measurement or biosample collection (Web Appendix 4 and Web Tables 2–6). Because controlled vocabulary usage is limited in dbGaP data sets, this process consisted of searching variable names, descriptions, and encoded values. WG members were closely involved in determining whether a study variable met the phenotype definition. The tagging project described below was implemented to facilitate this process for both DCC harmonization and similar efforts by the scientific community. Once an initial set of candidate variables was identified, the selection was refined by assessing compatibility with the definition of the target harmonized phenotype and for methodological equivalence across studies. This process often involved selecting among different methods of measuring the phenotype and/or choosing the most appropriate variable from a set of repeated measurements. Analysts generally consulted publicly available study protocols, phenotype domain experts in the relevant WG, and study liaisons, who know the intricacies of their study’s data. Some studies were omitted because candidate variables that met the definition could not be identified. Candidate variable selection is critical because phenotype heterogeneity in a combined analysis can lead to loss of power and thereby defeat the purpose of combining data across studies. In some cases, a new harmonized variable was constructed from previously harmonized component variables (e.g., a harmonized body mass index variable from previously harmonized height and weight variables).

Step 3: Perform quality control on candidate variables.

Quality control (QC) on selected candidate variables was performed to assess data reliability by checking whether the observed values were consistent with expected ranges, investigating any unexpected distributions, and checking that the data were internally consistent with other related study variables (Web Appendix 5). Batch effects and consistency of data collection methods were also evaluated when relevant information was available (e.g., Web Figure 1). Mistakes in data management and/or documentation (e.g., un- or misspecified missing-value codes, incorrect units of measurement, and errors in variable labeling or description) can be identified as a specific data set that differs notably from expectation. If QC issues were identified for a candidate variable, analysts decided, in consultation with the WG and study liaisons, whether an alternative variable from the same study could be used or whether the study should be excluded from the harmonization for this phenotype. Individual data points with impossible values (such as a negative analyte concentration) were excluded from the harmonized phenotype variable. Extreme but theoretically possible values were noted in the documentation but were not excluded because 1) the definition of extremity is often difficult and subjective; 2) TOPMed whole-genome sequencing has discovered millions of rare variants, some of which may be causing extreme phenotypic values; and 3) users may prefer to handle extreme values differently (e.g., by excluding or winsorizing at different values). Therefore, the decision about how to handle extreme values in analyses was left to downstream users of the data. QC results for candidate study variables were used to determine which ones would be used as “component” variables in subsequent harmonization steps. The final set of component variables was chosen only after QC of the multistudy harmonized variable (see step 5).

Step 4: Construct harmonization algorithms.

The next step was to specify the algorithms to be used in transforming component variables into the harmonized variable (Web Appendix 6). An algorithm was developed for each “harmonization unit,” which consists of a group of participants from a single study with component variables that can be harmonized in the same way. Each algorithm was implemented as an R function that accepts the component variables as input and returns the harmonized values and age at measurement. The algorithm might be as simple as giving each component variable a consistent name across studies or converting to a common unit of measurement, but it often included more complicated steps, such as averaging repeated measurements or creating a smoking status variable from multiple questionnaire responses. See Web Figures 2–7 for examples.

Step 5: Produce and perform QC on the multistudy harmonized phenotype.

After harmonization algorithms were implemented for each contributing study, the harmonized values were calculated and combined across harmonization units and studies using in-house R scripts (Web Appendix 7 and Web Figure 8). This draft of the multistudy harmonized variable was then assessed for homogeneity of values among studies and harmonization units within studies. This process included a comparison of mean values and standard deviations for continuous variables or frequencies for categorical variables, by study, subcohort, and other relevant subgroups within each study. For continuous variables, we also inspected the distributions of residuals after fitting a linear model with age, sex, and harmonized race (e.g., see Web Figures 9 and 10). The goal of this step was to identify issues in the harmonization functions or in studies’ component variables or metadata. If any issues arose in this process, analysts evaluated whether the harmonization unit in question should be excluded or whether different component variables should be used for harmonization. When QC checks were complete and the set of contributing studies was finalized, analysts summarized the results and any additional information relevant for analysis in a free text document named “Harmonization Comments.” This document may include 1) notes about the presence of a notable cluster of outliers; 2) differences among studies that were not considered important enough for removal of a study from harmonization; 3) errors encountered in the component variable metadata during the QC process; or 4) variation among studies or subcohorts in assay or other collection methodology. These notes allow users to flexibly choose how to account for potential effects or exclude specific studies in analysis. See Web Figures 11–14 for examples of harmonization comments. The final multistudy harmonized variable was then added to the DCC’s phenotype database. The information added included metadata and data values for the harmonized variable, the set of component variables and harmonization functions used to generate the harmonized data values, and the harmonization comments.

Distributing harmonization results to the scientific community

The DCC provides a package of data sets and documentation using information stored in the database (Web Appendix 8). Each data set generally includes a group of related harmonized variables plus age at measurement for each variable. The documentation includes files in JavaScript Object Notation format containing code that allows a user to reproduce or modify harmonized variables once they obtain access to the specified study data from dbGaP (see Web Appendix 9). In addition, the harmonized variables described in Table 2 have been submitted to dbGaP and to the National Heart, Lung, and Blood Institute’s BioData Catalyst repository (https://biodatacatalyst.nhlbi.nih.gov/) for distribution to the scientific community via application to dbGaP.

Updating harmonized variables

A harmonized phenotype variable often needed to be updated to include additional studies and/or to incorporate dbGaP updates to the component study variables from previously included studies (Web Appendix 10). These updates were semiautomated because the relational database contained all of the information necessary to recreate the harmonized phenotype. Updates to all variables in a data set were typically made at the same time when requested by WGs, often because additional studies were sequenced by TOPMed.

Tagging phenotype variables to facilitate future harmonization

While the detailed harmonization process described above produces well-documented, reproducible, and updateable harmonized phenotype variables, other investigators may want to carry out harmonization differently (e.g., using different component variables, a different harmonization algorithm, a different harmonized phenotype definition, or different time points). They may also need to harmonize a phenotype the DCC has not worked on yet. To facilitate identification of candidate variables for harmonization, we worked with study and domain experts to label TOPMed dbGaP study variables with controlled vocabulary terms to indicate the phenotype concept they represent (i.e., “variable tagging”). Study variables were tagged with 65 important phenotype concepts from heart, lung, blood, sleep, and demographic domains (Web Table 7). Harmonized phenotype variables for 27 of the 65 concepts have been constructed already, but many more are possible, even for the same concept. The remaining DCC-harmonized variables represent phenotype concepts that were not directly included among the 65 originally identified concepts. Study variable tagging was done via a database-backed Web application with built-in data validation. The DCC worked with representatives from 7 large cohort studies to identify all of their studies’ dbGaP study variables that fit 1 or more of the 65 phenotype concepts and to label them with the appropriate phenotype concept tag(s) and corresponding UMLS term(s). DCC phenotype team members also tagged variables for the remaining studies available at the time. We performed careful quality review of all tagged variables to ensure consistency and accuracy of the tagging across studies. Details on this process are provided in Web Appendix 11.

RESULTS

Phenotypes harmonized

A total of 63 harmonized variables were constructed across multiple TOPMed studies (up to 17 for some variables) belonging to 8 phenotype data sets (Table 2). Within each data set, the variables generally represent related phenotypes that are analyzed together (except for common covariates and demographic variables). Web Figure 15 shows histograms of the harmonized variables.

QC issues in harmonization

QC was generally the most time-consuming step in the process, as it directly tested the reliability of component variables and could not be meaningfully automated. Four types of issues arose frequently during QC of study and harmonized phenotype variables: 1) notable differences among studies/subcohorts in the distributions of quantitative measures or frequencies of categorical phenotypes; 2) variation among studies/subcohorts in methods for how the same phenotype was assessed or measured; 3) extreme (sometimes impossible) values of quantitative measures; and 4) inconsistencies in the values of related phenotypes. Distributional differences among studies/subcohorts were occasionally found to be due to errors in data management, such as a misspecified missing value code (see example on smoking below) or incorrect units in the data dictionary; such issues were resolved in consultation with study data managers. In general, the resolution of QC issues was highly phenotype-dependent and relied on expertise from the WG members and study liaisons. Here we give some examples of how these issues were detected and resolved, with more detail and examples in Web Appendices 5 and 7. When producing a harmonized variable, we compared distributions across studies and subcohorts within studies to identify differences that might be due to errors or unusual features of a given study. We show an example of this type of comparison in Figure 2 for the “ever smoker” harmonized variable. It is clear that 2 studies/subcohorts, F and G1, had a much higher proportion of smokers than average, while a third study/subcohort, E, had a much lower proportion of smokers than average. In 2 cases, the proportions can be explained by the studies’ recruitment strategies; study/subcohort F targeted smokers for enrollment in the study (9), while study/subcohort E included children (10). Because these differences can be explained by recruitment strategy, no modification of the harmonization process was needed. Further exploration of subcohort G1 showed that this high proportion was due to an unlabeled missing-value code in one of the component variables. We corrected the harmonization algorithm to account for this missing code, and the differences between the proportions of smokers by subcohort were then much smaller.
Figure 2

Proportion of ever smokers from the harmonized “ever_smoker_baseline_1” variable in the TOPMed DCC harmonized common covariates data set, by (anonymized) study subcohort. In both plots, different studies are labeled by a letter (e.g., B), and different subcohorts within each study (if applicable) are labeled by appending a number to the study letter (e.g., B1 and B2). A) Proportion of smokers by study/subcohort after initial harmonization. Three studies/subcohorts (E, F, and G1) have much smaller or larger proportions than the majority of other studies. B) Proportion of smokers by study/subcohort after correcting study/subcohort G1 (shown in black) for an unlabeled missing-value code. DCC, Data Coordinating Center; TOPMed, Trans-Omics for Precision Medicine.

Proportion of ever smokers from the harmonized “ever_smoker_baseline_1” variable in the TOPMed DCC harmonized common covariates data set, by (anonymized) study subcohort. In both plots, different studies are labeled by a letter (e.g., B), and different subcohorts within each study (if applicable) are labeled by appending a number to the study letter (e.g., B1 and B2). A) Proportion of smokers by study/subcohort after initial harmonization. Three studies/subcohorts (E, F, and G1) have much smaller or larger proportions than the majority of other studies. B) Proportion of smokers by study/subcohort after correcting study/subcohort G1 (shown in black) for an unlabeled missing-value code. DCC, Data Coordinating Center; TOPMed, Trans-Omics for Precision Medicine. A second example of harmonized phenotype QC is shown in Figure 3. The final QC for interleukin 6 concentration included inspection of the distribution of values by study and subcohort, as well as the residuals, after adjustment for age, sex, and race. The distribution for 1 study was much wider in range and generally had higher values than the other studies/subcohorts (study E in Figure 3). These differences remained even after adjustment for age, sex, and race. The DCC consulted with study liaisons and decided to remove that study from harmonization because the reason for the unusual distribution could not be determined.
Figure 3

Distribution of harmonized interleukin 6 (IL-6) values in the TOPMed DCC harmonized inflammation data set, by (anonymized) study/subcohort. In both plots, different studies are labeled by a single letter (e.g., D), and different subcohorts within each study (if applicable) are labeled by appending a number to the study letter (e.g., D1 and D2). A) Harmonized IL-6 values. The interquartile range for study E is much larger than that for the other studies/subcohorts. B) Residuals from a linear model (IL-6 ~ age + sex + race). The large differences between study E and the other studies/subcohorts remain after adjusting the values for age, sex, and race. DCC, Data Coordinating Center; TOPMed, Trans-Omics for Precision Medicine.

Distribution of harmonized interleukin 6 (IL-6) values in the TOPMed DCC harmonized inflammation data set, by (anonymized) study/subcohort. In both plots, different studies are labeled by a single letter (e.g., D), and different subcohorts within each study (if applicable) are labeled by appending a number to the study letter (e.g., D1 and D2). A) Harmonized IL-6 values. The interquartile range for study E is much larger than that for the other studies/subcohorts. B) Residuals from a linear model (IL-6 ~ age + sex + race). The large differences between study E and the other studies/subcohorts remain after adjusting the values for age, sex, and race. DCC, Data Coordinating Center; TOPMed, Trans-Omics for Precision Medicine. There is often a trade-off between the homogeneity of a harmonized variable and achieving a large sample size by including many studies (11–13). This issue generally arose when studies measured different aspects of a harmonized variable (e.g., measurements of the thickness of different carotid artery walls for calculating common carotid intima-media thickness) or used different methods to collect a similar measurement (e.g., different assay methods for inflammation phenotypes). In these cases, WG and study members were involved in decisions about whether to exclude studies or modify the definition of the harmonized phenotype. We sometimes found biologically invalid data points, such as diastolic blood pressure greater than systolic blood pressure for some participants, or unexpected relationships between variable values, such as white blood cell subtype counts not adding up to the total count. Other inconsistencies were found in participant responses to questionnaires (e.g., participants who report that they have never smoked but also report smoking a nonzero number of cigarettes per day). As noted in the Methods section, impossible data values are typically not included in the harmonized variable, while the potentially valid but extreme values are retained but noted in the harmonization comments.

Reproducibility of harmonized phenotype variables

We have successfully reproduced several of our harmonized variables exactly using only the JavaScript Object Notation documentation provided in our public GitHub repository (14), along with the specified study data files from dbGaP (via controlled access). The repository also includes a fully reproducible example using simulated dbGaP data that instructs users about how to reproduce the harmonized variables using the documentation.

DCC phenotype tagging results

We tagged dbGaP study variables with UMLS terms representing 65 phenotype concepts in 16 domains. Web Table 7 provides descriptions, detailed tagging instructions, and UMLS terms for each phenotype concept. A total of 16,671 dbGaP study variables from 17 studies were tagged with relevant UMLS phenotype terms. Table 3 shows the number of variables available in each study, the number tagged, and the proportion tagged. The latter varies according to variation among studies in the breadth and depth of phenotype domains for which the investigators have collected data. For example, the Framingham Heart Study has many variables in domains that are not part of the 65 phenotype concepts chosen for tagging, such as bone mineral density measurements. Further details are provided in Web Appendix 11, Web Table 8, and Web Figures 16 and 17.
Table 3

Numbers and Proportions of Variables Tagged With Controlled Vocabulary Phenotype Concepts for Each of the 17 TOPMed Studies Included in This Article

Study dbGaP Accession No. No. of dbGaP Variables No. of Variable-Tag Pairs b Proportion Tagged
Genetics of Cardiometabolic Health in the Amishphs000956.v2.p153400.75
ARIC Studycphs000280.v3.p114,4301,7130.12
CARDIA Studycphs000285.v3.p29,0361,6080.18
Cleveland Family Studyphs000284.v1.p12,3253710.16
Cardiovascular Health Studycphs000287.v6.p114,6572,1750.15
COPDGene Studyphs000179.v5.p23321030.31
CRA Studyphs000988.v2.p115130.87
Framingham Heart Studycphs000007.v29.p1061,1956,5790.11
GENOA Studyphs001238.v1.p11,0724410.41
GOLDN Studyphs000741.v2.p110790.08
HCHS/SOLphs000810.v1.p12741320.48
Heart and Vascular Health Studyphs001013.v2.p223200.87
Jackson Heart Studycphs000286.v5.p14,0847450.18
Mayo VTEphs000289.v2.p141170.41
MESAcphs000209.v13.p322,0441,9430.09
Samoan Adiposity Studyphs000914.v1.p1167480.29
Women’s Health Initiativecphs000200.v11.p36,1171,1060.18

Abbreviations: ARIC, Atherosclerosis Risk in Communities; CARDIA, Coronary Artery Risk Development in Young Adults; COPD, chronic obstructive pulmonary disease; COPDGene, Genetic Epidemiology of COPD; CRA, Genetic Epidemiology of Asthma in Costa Rica; GENOA, Genetic Epidemiology Network of Arteriopathy; GOLDN, Genetics of Lipid Lowering Drugs and Diet Network; HCHS/SOL, Hispanic Community Health Study/Study of Latinos; MAYO VTE, Mayo Clinic Venous Thromboembolism Study; MESA, Multi-Ethnic Study of Atherosclerosis; TOPMed, Trans-Omics for Precision Medicine.

a Participants were recruited during the years 1948–2012. See Web Table 1 for additional study information, including each study’s recruitment period.

b Number of variable-tag pairs. In some cases, a variable can be tagged with multiple different tags. The sum of all pairs in this column is 17,063, while the number of variables paired with 1 or more tags is 16,671.

c Initial tagging was done by study data experts; other studies in this table were tagged by analysts at the TOPMed Data Coordinating Center.

Numbers and Proportions of Variables Tagged With Controlled Vocabulary Phenotype Concepts for Each of the 17 TOPMed Studies Included in This Article Abbreviations: ARIC, Atherosclerosis Risk in Communities; CARDIA, Coronary Artery Risk Development in Young Adults; COPD, chronic obstructive pulmonary disease; COPDGene, Genetic Epidemiology of COPD; CRA, Genetic Epidemiology of Asthma in Costa Rica; GENOA, Genetic Epidemiology Network of Arteriopathy; GOLDN, Genetics of Lipid Lowering Drugs and Diet Network; HCHS/SOL, Hispanic Community Health Study/Study of Latinos; MAYO VTE, Mayo Clinic Venous Thromboembolism Study; MESA, Multi-Ethnic Study of Atherosclerosis; TOPMed, Trans-Omics for Precision Medicine. a Participants were recruited during the years 1948–2012. See Web Table 1 for additional study information, including each study’s recruitment period. b Number of variable-tag pairs. In some cases, a variable can be tagged with multiple different tags. The sum of all pairs in this column is 17,063, while the number of variables paired with 1 or more tags is 16,671. c Initial tagging was done by study data experts; other studies in this table were tagged by analysts at the TOPMed Data Coordinating Center.

Data availability

The study data used as input for harmonization are available to the scientific community from dbGaP via controlled access. In a single application, a user can apply for access to all dbGaP study accessions provided in the documentation. In addition, the harmonized data in Table 2 have been submitted to dbGaP and to the National Heart, Lung, and Blood Institute’s new data repository, BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov). In both cases, access will be through application to dbGaP. We worked with dbGaP scientists to make the tagging results available in dbGaP searches and visible on the dbGaP variable pages. Detailed information on how to access and use this information is available on the TOPMed website (15).

DISCUSSION

The TOPMed program was designed to add cutting-edge genomics and other -omics data to over 80 studies with extensive characterization of heart, lung, blood, and sleep phenotypes (3). Because phenotype data in the contributing studies are quite heterogeneous, retrospective harmonization is critical to achieving the program’s goals. The harmonization system described in this article has been used to harmonize 63 phenotypes for several WGs, members of which are using them in many different analyses, primarily genotype-phenotype association studies. Some of these studies have been published (e.g., 16–19), and many others are in preparation. An area of phenotype harmonization previously noted as needing further research is how to determine whether the harmonized data are adequate to address the intended research goals (20). In genome-wide association studies, one measure of success is replication of novel genotype-phenotype associations using independent data sets, which is a standard for publication in the field and a major component of TOPMed research. Assessing loss of power due to phenotype heterogeneity is more difficult but could potentially be addressed through sensitivity tests—for example, excluding studies with phenotypes that do not fit the idealized concept as well as others. In addition, failure to replicate strong, previously identified genotype-phenotype associations in a newly harmonized data set may suggest data heterogeneity (among other possible causes). An important consideration in the design of our harmonization system was the ability to share harmonized phenotypes with the broader scientific community. This goal is challenging because the study data, and any individual-level derivations thereof, require controlled access due to human subject privacy and consent restrictions. We addressed this problem by obtaining the study data for harmonization from dbGaP, which can be accessed by the scientific community; by providing detailed documentation about the component variables and algorithms for each variable; and by returning the harmonized data to National Institutes of Health–designated repositories. These repositories track the type of consent given by each study participant for the use of their data. The harmonized data are given the consent type previously assigned to the dbGaP components used in the harmonization. It is difficult to ensure reproducibility of results with confidential data (21). Harmonized data produced by our system are fully reproducible because of the availability of study data, provenance tracking, harmonization code, and other documentation. However, exact reproducibility can only be ensured if a user has access to the same version of the data that was used in harmonization, as study investigators can update or even remove variables from their dbGaP accessions. A limitation of our process for phenotype harmonization is that it is very labor-intensive and does not scale readily to the thousands of phenotypes available in TOPMed and other similar programs. Selection of study variables and subsequent QC are largely manual and would be very difficult to automate. Furthermore, as others have noted previously (2, 20), the utility of results may be seriously compromised without careful attention to these steps. Because of these scalability issues, we provide the following materials to help other investigators perform their own harmonizations: Software code and documentation sufficient to allow others to reproduce, modify, or expand upon our harmonizations. Detailed examples of the types of QC performed, issues that arose, and how they were resolved (Web Appendices 5 and 7). We expect this information will prove useful to investigators working on a broad range of phenotypes and may also be helpful to funding agencies regarding the level of resources required for useful harmonization efforts. Thousands of dbGaP variables tagged with 65 phenotype concepts, which can be used directly by other investigators for the largely manual and time-consuming step of identifying the study variables needed for harmonization. The tagging data also provide a gold-standard, human-curated data set for developing automated approaches to identifying variables that fit a specific phenotype concept. Figure 4 summarizes some of the challenges and lessons learned in developing the DCC’s harmonization system. These findings suggest a few key ways to reduce the effort required for future phenotype harmonizations. Studies sharing phenotype data with the community should structure their data tables so that each phenotype variable (i.e., table column) contains data corresponding to only 1 phenotype concept, and they should provide controlled vocabulary terms from standard ontologies for each phenotype variable. Researchers harmonizing phenotypes should provide full documentation, including code, procedures, and input data provenance, so that others can reproduce and extend their work. Sharing this documentation can benefit the scientific community without sharing the actual harmonized phenotype values (which often requires complicated data-sharing arrangements). Finally, investigators in studies currently collecting data should consider using standardized protocols, such as those developed by the PhenX consortium (1), to reduce the need for retrospective harmonization.
Figure 4

Lessons learned from phenotype harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) program.

Lessons learned from phenotype harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) program. Click here for additional data file.
  18 in total

1.  The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors:  Olivier Bodenreider
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

2.  The detection of gene-environment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement?

Authors:  M Y Wong; N E Day; J A Luan; K P Chan; N J Wareham
Journal:  Int J Epidemiol       Date:  2003-02       Impact factor: 7.196

3.  The NCBI dbGaP database of genotypes and phenotypes.

Authors:  Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal:  Nat Genet       Date:  2007-10       Impact factor: 38.330

4.  Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge.

Authors:  W T Friedewald; R I Levy; D S Fredrickson
Journal:  Clin Chem       Date:  1972-06       Impact factor: 8.327

5.  Genetic epidemiology of COPD (COPDGene) study design.

Authors:  Elizabeth A Regan; John E Hokanson; James R Murphy; Barry Make; David A Lynch; Terri H Beaty; Douglas Curran-Everett; Edwin K Silverman; James D Crapo
Journal:  COPD       Date:  2010-02       Impact factor: 2.409

6.  Harmonization of Respiratory Data From 9 US Population-Based Cohorts: The NHLBI Pooled Cohorts Study.

Authors:  Elizabeth C Oelsner; Pallavi P Balte; Patricia A Cassano; David Couper; Paul L Enright; Aaron R Folsom; John Hankinson; David R Jacobs; Ravi Kalhan; Robert Kaplan; Richard Kronmal; Leslie Lange; Laura R Loehr; Stephanie J London; Ana Navas Acien; Anne B Newman; George T O'Connor; Joseph E Schwartz; Lewis J Smith; Fawn Yeh; Yiyi Zhang; Andrew E Moran; Stanford Mwasongwe; Wendy B White; Sachin Yende; R Graham Barr
Journal:  Am J Epidemiol       Date:  2018-11-01       Impact factor: 4.897

7.  The PhenX Toolkit: get the most from your measures.

Authors:  Carol M Hamilton; Lisa C Strader; Joseph G Pratt; Deborah Maiese; Tabitha Hendershot; Richard K Kwok; Jane A Hammond; Wayne Huggins; Dean Jackman; Huaqin Pan; Destiney S Nettles; Terri H Beaty; Lindsay A Farrer; Peter Kraft; Mary L Marazita; Jose M Ordovas; Carlos N Pato; Margaret R Spitz; Diane Wagener; Michelle Williams; Heather A Junkins; William R Harlan; Erin M Ramos; Jonathan Haines
Journal:  Am J Epidemiol       Date:  2011-07-11       Impact factor: 4.897

8.  A common TCN1 loss-of-function variant is associated with lower vitamin B12 concentration in African Americans.

Authors:  Yao Hu; Laura M Raffield; Linda M Polfus; Arden Moscati; Girish Nadkarni; Michael H Preuss; Xue Zhong; Qiang Wei; Stephen S Rich; Yun Li; James G Wilson; Adolfo Correa; Ruth J F Loos; Bingshan Li; Paul L Auer; Alex P Reiner
Journal:  Blood       Date:  2018-05-15       Impact factor: 25.476

9.  Maelstrom Research guidelines for rigorous retrospective data harmonization.

Authors:  Isabel Fortier; Parminder Raina; Edwin R Van den Heuvel; Lauren E Griffith; Camille Craig; Matilda Saliba; Dany Doiron; Ronald P Stolk; Bartha M Knoppers; Vincent Ferretti; Peter Granda; Paul Burton
Journal:  Int J Epidemiol       Date:  2017-02-01       Impact factor: 7.196

10.  Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations.

Authors:  Madeline H Kowalski; Huijun Qian; Ziyi Hou; Jonathan D Rosen; Amanda L Tapia; Yue Shan; Deepti Jain; Maria Argos; Donna K Arnett; Christy Avery; Kathleen C Barnes; Lewis C Becker; Stephanie A Bien; Joshua C Bis; John Blangero; Eric Boerwinkle; Donald W Bowden; Steve Buyske; Jianwen Cai; Michael H Cho; Seung Hoan Choi; Hélène Choquet; L Adrienne Cupples; Mary Cushman; Michelle Daya; Paul S de Vries; Patrick T Ellinor; Nauder Faraday; Myriam Fornage; Stacey Gabriel; Santhi K Ganesh; Misa Graff; Namrata Gupta; Jiang He; Susan R Heckbert; Bertha Hidalgo; Chani J Hodonsky; Marguerite R Irvin; Andrew D Johnson; Eric Jorgenson; Robert Kaplan; Sharon L R Kardia; Tanika N Kelly; Charles Kooperberg; Jessica A Lasky-Su; Ruth J F Loos; Steven A Lubitz; Rasika A Mathias; Caitlin P McHugh; Courtney Montgomery; Jee-Young Moon; Alanna C Morrison; Nicholette D Palmer; Nathan Pankratz; George J Papanicolaou; Juan M Peralta; Patricia A Peyser; Stephen S Rich; Jerome I Rotter; Edwin K Silverman; Jennifer A Smith; Nicholas L Smith; Kent D Taylor; Timothy A Thornton; Hemant K Tiwari; Russell P Tracy; Tao Wang; Scott T Weiss; Lu-Chen Weng; Kerri L Wiggins; James G Wilson; Lisa R Yanek; Sebastian Zöllner; Kari E North; Paul L Auer; Laura M Raffield; Alexander P Reiner; Yun Li
Journal:  PLoS Genet       Date:  2019-12-23       Impact factor: 6.020

View more
  7 in total

1.  A phenome-wide association study identifies effects of copy-number variation of VNTRs and multicopy genes on multiple human traits.

Authors:  Paras Garg; Bharati Jadhav; William Lee; Oscar L Rodriguez; Alejandro Martin-Trujillo; Andrew J Sharp
Journal:  Am J Hum Genet       Date:  2022-05-23       Impact factor: 11.043

2.  A multi-ethnic polygenic risk score is associated with hypertension prevalence and progression throughout adulthood.

Authors:  Nuzulul Kurniansyah; Matthew O Goodman; Tanika N Kelly; Tali Elfassy; Kerri L Wiggins; Joshua C Bis; Xiuqing Guo; Walter Palmas; Kent D Taylor; Henry J Lin; Jeffrey Haessler; Yan Gao; Daichi Shimbo; Jennifer A Smith; Bing Yu; Elena V Feofanova; Roelof A J Smit; Zhe Wang; Shih-Jen Hwang; Simin Liu; Sylvia Wassertheil-Smoller; JoAnn E Manson; Donald M Lloyd-Jones; Stephen S Rich; Ruth J F Loos; Susan Redline; Adolfo Correa; Charles Kooperberg; Myriam Fornage; Robert C Kaplan; Bruce M Psaty; Jerome I Rotter; Donna K Arnett; Alanna C Morrison; Nora Franceschini; Daniel Levy; Tamar Sofer
Journal:  Nat Commun       Date:  2022-06-21       Impact factor: 17.694

3.  Sleep and circadian informatics data harmonization: a workshop report from the Sleep Research Society and Sleep Research Network.

Authors:  Diego R Mazzotti; Melissa A Haendel; Julie A McMurry; Connor J Smith; Daniel J Buysse; Till Roenneberg; Thomas Penzel; Shaun Purcell; Susan Redline; Ying Zhang; Kathleen R Merikangas; Joseph P Menetski; Janet Mullington; Eilis Boudreau
Journal:  Sleep       Date:  2022-06-13       Impact factor: 6.313

Review 4.  Multi-Omics Integrative Approach of Extracellular Vesicles: A Future Challenging Milestone.

Authors:  Enxhi Shaba; Lorenza Vantaggiato; Laura Governini; Alesandro Haxhiu; Guido Sebastiani; Daniela Fignani; Giuseppina Emanuela Grieco; Laura Bergantini; Luca Bini; Claudia Landi
Journal:  Proteomes       Date:  2022-04-22

5.  Correlation Analysis of Variables From the Atherosclerosis Risk in Communities Study.

Authors:  Meisha Mandal; Josh Levy; Cataia Ives; Stephen Hwang; Yi-Hui Zhou; Alison Motsinger-Reif; Huaqin Pan; Wayne Huggins; Carol Hamilton; Fred Wright; Stephen Edwards
Journal:  Front Pharmacol       Date:  2022-07-11       Impact factor: 5.988

6.  Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations.

Authors:  Michael Elgart; Genevieve Lyons; Santiago Romero-Brufau; Nuzulul Kurniansyah; Jennifer A Brody; Xiuqing Guo; Henry J Lin; Laura Raffield; Yan Gao; Han Chen; Paul de Vries; Donald M Lloyd-Jones; Leslie A Lange; Gina M Peloso; Myriam Fornage; Jerome I Rotter; Stephen S Rich; Alanna C Morrison; Bruce M Psaty; Daniel Levy; Susan Redline; Tamar Sofer
Journal:  Commun Biol       Date:  2022-08-22

7.  Whole genome sequence analysis of blood lipid levels in >66,000 individuals.

Authors:  Margaret Sunitha Selvaraj; Xihao Li; Zilin Li; Akhil Pampana; David Y Zhang; Joseph Park; Stella Aslibekyan; Joshua C Bis; Jennifer A Brody; Brian E Cade; Lee-Ming Chuang; Ren-Hua Chung; Joanne E Curran; Lisa de Las Fuentes; Paul S de Vries; Ravindranath Duggirala; Barry I Freedman; Mariaelisa Graff; Xiuqing Guo; Nancy Heard-Costa; Bertha Hidalgo; Chii-Min Hwu; Marguerite R Irvin; Tanika N Kelly; Brian G Kral; Leslie Lange; Xiaohui Li; Martin Lisa; Steven A Lubitz; Ani W Manichaikul; Preuss Michael; May E Montasser; Alanna C Morrison; Take Naseri; Jeffrey R O'Connell; Nicholette D Palmer; Patricia A Peyser; Muagututia S Reupena; Jennifer A Smith; Xiao Sun; Kent D Taylor; Russell P Tracy; Michael Y Tsai; Zhe Wang; Yuxuan Wang; Wei Bao; John T Wilkins; Lisa R Yanek; Wei Zhao; Donna K Arnett; John Blangero; Eric Boerwinkle; Donald W Bowden; Yii-Der Ida Chen; Adolfo Correa; L Adrienne Cupples; Susan K Dutcher; Patrick T Ellinor; Myriam Fornage; Stacey Gabriel; Soren Germer; Richard Gibbs; Jiang He; Robert C Kaplan; Sharon L R Kardia; Ryan Kim; Charles Kooperberg; Ruth J F Loos; Karine A Viaud-Martinez; Rasika A Mathias; Stephen T McGarvey; Braxton D Mitchell; Deborah Nickerson; Kari E North; Bruce M Psaty; Susan Redline; Alexander P Reiner; Ramachandran S Vasan; Stephen S Rich; Cristen Willer; Jerome I Rotter; Daniel J Rader; Xihong Lin; Gina M Peloso; Pradeep Natarajan
Journal:  Nat Commun       Date:  2022-10-11       Impact factor: 17.694

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.