| Literature DB >> 33861317 |
Adrienne M Stilp, Leslie S Emery, Jai G Broome, Erin J Buth, Alyna T Khan, Cecelia A Laurie, Fei Fei Wang, Quenna Wong, Dongquan Chen, Catherine M D'Augustine, Nancy L Heard-Costa, Chancellor R Hohensee, William Craig Johnson, Lucia D Juarez, Jingmin Liu, Karen M Mutalik, Laura M Raffield, Kerri L Wiggins, Paul S de Vries, Tanika N Kelly, Charles Kooperberg, Pradeep Natarajan, Gina M Peloso, Patricia A Peyser, Alex P Reiner, Donna K Arnett, Stella Aslibekyan, Kathleen C Barnes, Lawrence F Bielak, Joshua C Bis, Brian E Cade, Ming-Huei Chen, Adolfo Correa, L Adrienne Cupples, Mariza de Andrade, Patrick T Ellinor, Myriam Fornage, Nora Franceschini, Weiniu Gan, Santhi K Ganesh, Jan Graffelman, Megan L Grove, Xiuqing Guo, Nicola L Hawley, Wan-Ling Hsu, Rebecca D Jackson, Cashell E Jaquish, Andrew D Johnson, Sharon L R Kardia, Shannon Kelly, Jiwon Lee, Rasika A Mathias, Stephen T McGarvey, Braxton D Mitchell, May E Montasser, Alanna C Morrison, Kari E North, Seyed Mehdi Nouraie, Elizabeth C Oelsner, Nathan Pankratz, Stephen S Rich, Jerome I Rotter, Jennifer A Smith, Kent D Taylor, Ramachandran S Vasan, Daniel E Weeks, Scott T Weiss, Carla G Wilson, Lisa R Yanek, Bruce M Psaty, Susan R Heckbert, Cathy C Laurie.
Abstract
Genotype-phenotype association studies often combine phenotype data from multiple studies to increase statistical power. Harmonization of the data usually requires substantial effort due to heterogeneity in phenotype definitions, study design, data collection procedures, and data-set organization. Here we describe a centralized system for phenotype harmonization that includes input from phenotype domain and study experts, quality control, documentation, reproducible results, and data-sharing mechanisms. This system was developed for the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program, which is generating genomic and other -omics data for more than 80 studies with extensive phenotype data. To date, 63 phenotypes have been harmonized across thousands of participants (recruited in 1948-2012) from up to 17 studies per phenotype. Here we discuss challenges in this undertaking and how they were addressed. The harmonized phenotype data and associated documentation have been submitted to National Institutes of Health data repositories for controlled access by the scientific community. We also provide materials to facilitate future harmonization efforts by the community, which include 1) the software code used to generate the 63 harmonized phenotypes, enabling others to reproduce, modify, or extend these harmonizations to additional studies, and 2) the results of labeling thousands of phenotype variables with controlled vocabulary terms.Entities:
Keywords: cardiovascular disease; common data elements; hematologic disease; information dissemination; lung diseases; phenotypes; sleep-wake disorders
Mesh:
Year: 2021 PMID: 33861317 PMCID: PMC8485147 DOI: 10.1093/aje/kwab115
Source DB: PubMed Journal: Am J Epidemiol ISSN: 0002-9262 Impact factor: 5.363
Specific Terminology Used in This Article, in Web Appendices 1–11, and in Documentation Distributed With Harmonized PhenoType Data
|
|
|
|---|---|
| Participant or subject | Studies generally refer to an individual participating in their study as a “participant,” while dbGaP uses “subject” as the equivalent term. |
| Cohort and subcohort | A sample of study participants enrolled in the study together at a given time (or clinic visit). The term “subcohort” refers to a distinct group of participants within a study, as defined by that study (e.g., a different recruitment wave or targeted demographic group). |
| Phenotype or trait | Observable characteristics of an organism. “Phenotype” and “trait” are used synonymously. |
| Phenotype concept | Broad definition of a phenotype, such as “quantitative measure of high-density lipoprotein concentration in blood” or “qualitative indicator of diabetes mellitus status.” |
| Phenotype variable | A vector of data values representing a measurement or other aspect of a phenotype concept, where each item in the vector corresponds to the value for a specific participant and/or repeated measure for a participant. |
| dbGaP study variable | An unharmonized phenotype variable from a given study’s dbGaP accession. |
| Candidate variable | A phenotype variable from a given study to be evaluated for use as a component phenotype variable. Such evaluation includes consideration of factors such as how well it represents the target phenotype concept, how well it can be harmonized with candidate variables from other studies, and the quality of the data. |
| Component variable | A phenotype variable selected for inclusion in a single harmonization, either because it directly represents the target phenotype (e.g., biomarker concentration) or because it is useful in constructing the harmonized variable (e.g., biomarker assay quality). |
| Harmonized variable | A phenotype variable constructed from a set of component variables from different studies, after performing whatever harmonization steps are considered to be important for a valid pooled analysis or meta-analysis. |
| Harmonization algorithm and function | The algorithm is a series of steps to be applied to the group of component variables to produce harmonized phenotype values for a single harmonization unit. Algorithms are implemented in R |
| Harmonization unit | A group of subjects from a single study (e.g., subcohort) with the same component variables, to which a single harmonization algorithm is applied to produce harmonized phenotype values. A harmonized variable is typically constructed by combining multiple harmonization units from one or more studies. |
| Harmonized data set | A data set consisting of a set of harmonized variables representing various aspects of phenotype concepts. It may also contain harmonized variables for multiple related phenotype concepts. For example, the “lipids” data set contains phenotype variables for concentrations of each of several lipid compounds assayed from the same blood draw, as well as age at blood draw, fasting status, and use of lipid-lowering medication. |
Abbreviation: dbGaP, database of Genotypes and Phenotypes.
a R Foundation for Statistical Computing, Vienna, Austria (5).
Harmonized Variables Produced by the TOPMed Data Coordinating Center for 17 Studies with Recruitment Dates Spanning 1948–2012
|
|
|
|
|
|---|---|---|---|
| Atherosclerosis | |||
| CAC volume | cac_volume_1 | 11,098 | 2 |
| CAC score | cac_score_1 | 15,042 | 6 |
| Common carotid IMT | cimt_1 | 35,420 | 6 |
| Common carotid IMT | cimt_2 | 30,473 | 5 |
| Carotid stenosis | carotid_stenosis_1 | 15,098 | 3 |
| Presence of carotid plaque | carotid_plaque_1 | 27,344 | 5 |
| Baseline common covariates | |||
| Standing body height | height_baseline_1 | 230,287 | 16 |
| Body weight | weight_baseline_1 | 230,657 | 16 |
| Ever smoker status | ever_smoker_baseline_1 | 225,271 | 14 |
| Current smoker status | current_smoker_baseline_1 | 228,688 | 16 |
| Body mass index | bmi_baseline_1 | 230,918 | 17 |
| Blood cell count | |||
| Basophil concentration in blood | basophil_ncnc_bld_1 | 36,586 | 7 |
| Eosinophil concentration in blood | eosinophil_ncnc_bld_1 | 37,426 | 7 |
| Lymphocyte concentration in blood | lymphocyte_ncnc_bld_1 | 39,702 | 7 |
| Hematocrit level in blood | hematocrit_vfr_bld_1 | 193,469 | 9 |
| Hemoglobin concentration in blood | hemoglobin_mcnc_bld_1 | 193,367 | 9 |
| Monocyte concentration in blood | monocyte_ncnc_bld_1 | 39,647 | 7 |
| Neutrophil concentration in blood | neutrophil_ncnc_bld_1 | 38,285 | 7 |
| Mean corpuscular volume in blood | mcv_entvol_rbc_1 | 44,593 | 7 |
| Mean corpuscular hemoglobin concentration in blood | mchc_mcnc_rbc_1 | 51,293 | 8 |
| Mean corpuscular hemoglobin in blood | mch_entmass_rbc_1 | 39,649 | 7 |
| Platelet concentration in blood | platelet_ncnc_bld_1 | 190,177 | 9 |
| Mean platelet volume in blood | pmv_entvol_bld_1 | 13,816 | 3 |
| Red blood cell concentration in blood | rbc_ncnc_bld_1 | 39,710 | 7 |
| Red cell distribution width | rdw_ratio_rbc_1 | 28,034 | 4 |
| White blood cell concentration in blood | wbc_ncnc_bld_1 | 192,346 | 9 |
| Blood pressure | |||
| Systolic blood pressure | bp_systolic_1 | 225,934 | 14 |
| Diastolic blood pressure | bp_diastolic_1 | 225,934 | 14 |
| Use of antihypertensive medication | antihypertensive_meds_1 | 207,130 | 12 |
| Demographic characteristics | |||
| Hispanic subgroup | hispanic_subgroup_1 | 18,612 | 4 |
| Subcohort identifier | subcohort_1 | 218,747 | 15 |
| Reported race | race_1 | 230,994 | 17 |
| Reported sex | annotated_sex_1 | 233,030 | 17 |
| Reported Hispanic/Latino indicator | ethnicity_1 | 188,905 | 11 |
| Geographic recruitment site | geographic_site_1 | 212,529 | 12 |
| Inflammation | |||
| CD40 protein concentration in blood | cd40_1 | 4,238 | 2 |
| CRP concentration in blood | crp_1 | 49,536 | 10 |
| E-selectin concentration in blood | eselectin_1 | 1,215 | 1 |
| ICAM-1 concentration in blood | icam1_1 | 15,876 | 5 |
| IL-1β concentration in blood | il1_beta_1 | 708 | 1 |
| IL-6 concentration in blood | il6_1 | 20,390 | 5 |
| IL-10 concentration in blood | il10_1 | 3,455 | 2 |
| IL-18 concentration in blood | il18_1 | 3,159 | 1 |
| Isoprostane 8-epi-PGF2α concentration in urine | isoprostane_8_epi_pgf2a_1 | 7,523 | 1 |
| Activity of LP-PLA2 in blood | lppla2_act_1 | 18,117 | 3 |
| Mass of LP-PLA2 in blood | lppla2_mass_1 | 18,049 | 3 |
| MCP-1 concentration in blood | mcp1_1 | 7,557 | 1 |
| MMP-9 concentration in blood | mmp9_1 | 964 | 1 |
| Myeloperoxidase concentration in blood | mpo_1 | 3,162 | 1 |
| Osteoprotegerin concentration in blood | opg_1 | 7,648 | 1 |
| P-selectin concentration in blood | pselectin_1 | 8,037 | 1 |
| TNF-α concentration in blood | tnfa_1 | 5,075 | 3 |
| TNF-α receptor 1 concentration in blood | tnfa_r1_1 | 2,802 | 1 |
| TNF receptor 2 concentration in blood | tnfr2_1 | 7,962 | 1 |
| Lipids | |||
| Fasting status | fasting_lipids_1 | 64,895 | 11 |
| High-density lipoprotein concentration in blood | hdl_1 | 65,676 | 11 |
| Total cholesterol concentration in blood | total_cholesterol_1 | 65,707 | 11 |
| Triglyceride concentration in blood | triglycerides_1 | 65,706 | 11 |
| Low-density lipoprotein concentration in blood | ldl_1 | 64,715 | 11 |
| Use of lipid-lowering medication | lipid_lowering_medication_1 | 58,962 | 9 |
| VTE | |||
| Age at beginning of follow-up | vte_followup_start_age_1 | 61,692 | 4 |
| Prior history of VTE | vte_prior_history_1 | 62,445 | 5 |
| VTE case status | vte_case_status_1 | 63,092 | 6 |
Abbreviations: CAC, coronary artery calcium; CAM-1, intercellular adhesion molecule 1; CD40, cluster of differentiation 40; CRP, C-reactive protein; 8-epi-PGF2-α, 8-epi-prostaglandin F2α; IL-1β, interleukin 1β; IL-6, interleukin 6; IL-10, interleukin 10; IL-18, interleukin 18; IMT, intima-media thickness; LP-PLA2, lipoprotein-associated phospholipase A2; MCP-1, monocyte chemoattractant protein 1; MMP-9, matrix metalloproteinase 9; TNF-α, tumor necrosis factor α; TOPMed, Trans-Omics for Precision Medicine; VTE, venous thromboembolism.
a See Web Table 1 for descriptions of the 17 studies. Additional documentation about each harmonized variable can be found in the GitHub repository (14).
b The “concept variant number” at the end of each harmonized variable name differentiates among different implementations of harmonization for the same basic phenotype concept (e.g., cimt_1 and cimt_2 are names for carotid IMT variables calculated with slightly different harmonization algorithms).
Figure 2Proportion of ever smokers from the harmonized “ever_smoker_baseline_1” variable in the TOPMed DCC harmonized common covariates data set, by (anonymized) study subcohort. In both plots, different studies are labeled by a letter (e.g., B), and different subcohorts within each study (if applicable) are labeled by appending a number to the study letter (e.g., B1 and B2). A) Proportion of smokers by study/subcohort after initial harmonization. Three studies/subcohorts (E, F, and G1) have much smaller or larger proportions than the majority of other studies. B) Proportion of smokers by study/subcohort after correcting study/subcohort G1 (shown in black) for an unlabeled missing-value code. DCC, Data Coordinating Center; TOPMed, Trans-Omics for Precision Medicine.
Figure 3Distribution of harmonized interleukin 6 (IL-6) values in the TOPMed DCC harmonized inflammation data set, by (anonymized) study/subcohort. In both plots, different studies are labeled by a single letter (e.g., D), and different subcohorts within each study (if applicable) are labeled by appending a number to the study letter (e.g., D1 and D2). A) Harmonized IL-6 values. The interquartile range for study E is much larger than that for the other studies/subcohorts. B) Residuals from a linear model (IL-6 ~ age + sex + race). The large differences between study E and the other studies/subcohorts remain after adjusting the values for age, sex, and race. DCC, Data Coordinating Center; TOPMed, Trans-Omics for Precision Medicine.
Numbers and Proportions of Variables Tagged With Controlled Vocabulary Phenotype Concepts for Each of the 17 TOPMed Studies Included in This Article
|
|
|
|
|
|
|---|---|---|---|---|
| Genetics of Cardiometabolic Health in the Amish | phs000956.v2.p1 | 53 | 40 | 0.75 |
| ARIC Study | phs000280.v3.p1 | 14,430 | 1,713 | 0.12 |
| CARDIA Study | phs000285.v3.p2 | 9,036 | 1,608 | 0.18 |
| Cleveland Family Study | phs000284.v1.p1 | 2,325 | 371 | 0.16 |
| Cardiovascular Health Study | phs000287.v6.p1 | 14,657 | 2,175 | 0.15 |
| COPDGene Study | phs000179.v5.p2 | 332 | 103 | 0.31 |
| CRA Study | phs000988.v2.p1 | 15 | 13 | 0.87 |
| Framingham Heart Study | phs000007.v29.p10 | 61,195 | 6,579 | 0.11 |
| GENOA Study | phs001238.v1.p1 | 1,072 | 441 | 0.41 |
| GOLDN Study | phs000741.v2.p1 | 107 | 9 | 0.08 |
| HCHS/SOL | phs000810.v1.p1 | 274 | 132 | 0.48 |
| Heart and Vascular Health Study | phs001013.v2.p2 | 23 | 20 | 0.87 |
| Jackson Heart Study | phs000286.v5.p1 | 4,084 | 745 | 0.18 |
| Mayo VTE | phs000289.v2.p1 | 41 | 17 | 0.41 |
| MESA | phs000209.v13.p3 | 22,044 | 1,943 | 0.09 |
| Samoan Adiposity Study | phs000914.v1.p1 | 167 | 48 | 0.29 |
| Women’s Health Initiative | phs000200.v11.p3 | 6,117 | 1,106 | 0.18 |
Abbreviations: ARIC, Atherosclerosis Risk in Communities; CARDIA, Coronary Artery Risk Development in Young Adults; COPD, chronic obstructive pulmonary disease; COPDGene, Genetic Epidemiology of COPD; CRA, Genetic Epidemiology of Asthma in Costa Rica; GENOA, Genetic Epidemiology Network of Arteriopathy; GOLDN, Genetics of Lipid Lowering Drugs and Diet Network; HCHS/SOL, Hispanic Community Health Study/Study of Latinos; MAYO VTE, Mayo Clinic Venous Thromboembolism Study; MESA, Multi-Ethnic Study of Atherosclerosis; TOPMed, Trans-Omics for Precision Medicine.
a Participants were recruited during the years 1948–2012. See Web Table 1 for additional study information, including each study’s recruitment period.
b Number of variable-tag pairs. In some cases, a variable can be tagged with multiple different tags. The sum of all pairs in this column is 17,063, while the number of variables paired with 1 or more tags is 16,671.
c Initial tagging was done by study data experts; other studies in this table were tagged by analysts at the TOPMed Data Coordinating Center.
Figure 4Lessons learned from phenotype harmonization in the National Heart, Lung, and Blood Institute Trans-Omics for Precision Medicine (TOPMed) program.