| Literature DB >> 35769930 |
Ming Wai Yeung1, Pim van der Harst2, Niek Verweij3.
Abstract
The complexity and volume of data associated with population-based cohorts means that generating health-related outcomes can be challenging. Using one such cohort, the UK Biobank-a major open access resource-we present a protocol to efficiently integrate the main dataset and record-level data files, to harmonize and process the data using an R package named "ukbpheno". We describe how to use the package to generate binary phenotypes in a standardized and machine-actionable manner. For complete details on the use and execution of this protocol, please refer to Yeung et al. (2022).Entities:
Keywords: Bioinformatics; Health Sciences; Systems biology
Mesh:
Year: 2022 PMID: 35769930 PMCID: PMC9234069 DOI: 10.1016/j.xpro.2022.101471
Source DB: PubMed Journal: STAR Protoc ISSN: 2666-1667
Figure 1Capture of health-related outcome from multiple data sources in UK Biobank using ukbpheno
Figure 2Overview of the workflow
Definitions of the cardiometabolic traits to be generated
| Variable | ICD-9 | ICD-10 | OPCS-4 | Self-reported fields | READ2 | CTV3 |
|---|---|---|---|---|---|---|
| Atrial fibrillation or flutter | 4273 | I48 | K621, K622, K623 | 20002(1471, 1483) | G5730, G573., G5731, G573z, G5733, G5732, 793M0, 793M1, 793M3, 79345, 79348 | XaEga, G573., G5730, G5731, G573z, X202R, X202S, Xa2E8, XaMmd, XaMmc, XaMrB, XaLgF, XaMrA |
| Coronary artery disease | 414, 410, 412 | I24, I25, Z955, I21, I22, I23, I252, Z951, Z955 | K40, K41, K42, K43, K44, K45, K46, K49, K50, K75 | 20002(1075), 20004(1070, 1095, 1523),6150(1) | G34y1, G34.., G3..., ZV45L, G34z0, ZV458, 793G., 79280, 79281, 79282, 7928y, 7928z, 79292, 7929y, 7929z, 792.., 7A547, 793Gy, 793Gz, 79283 | G34y1, XE0WG, XE2uV, XaC1g, XaG1Q, XaQiY, ZV458, G34.., X200b, Xa1dP, XaLgU, 79280, 79281, 79282, 7928y, 7928z, 79292, 7929y, 7929z, X00tT, X013N, XE0Em, XaLgZ, XaLga, XaMKE |
| Hypertrophic cardiomyopathy | 4251 | I421, I422 | 20002(1588) | G551. | G551., X201Y | |
| Heart failure | 428 | I50, I110, I130, I132, Z941, T862 | K02 | 20002(1076), 20004(1098) | 14S3., G2101, G2111, G21z1, G232., G234., G58.., G5800, G5802, G5803, G5810, G582., SP084, SP111, G581., 1O1.., G583., ZV421 | 14S3., G2101, G2111, G21z1, G232., G234., G58.., G5800, G5802, G5803, G5810, G582., SP084, X202k, X202v, X202w, XE2QG, XaIpn, XaWyi, ZV421, X00y3 |
| Type 2 diabetes | 25000,25002,25010,25012,25020,25022,25030,25032,25040,25042,25050,25052,25060,25062,25070,25072,25080,25082,25090,25092 | E11 | 20002(1223) | C1001, C10F9, C10F., C1093, C1094, C1095, C1097, C10y1, C10z1, C10FJ, C109J, C1099, C10FD, C109D, C10FR, C10FK | C1001, C1011, C1031, C1071, C1021, C1072, C1093, C1094, C1095, C1097, C10y1, X40J5, X40J6, XaELQ, XaFWI, XaIrf, XaKyX, X40J5 | |
| Hypertension | 402, 403, 404, 405, 401 | I11, I12, I13, I15, O10, I10 | 20002(1065, 1072),6150(4),6177(2),6153(2),20003 | G21.., G220., G221., G23.., G24.., G240., G240z, G241., G241z, G24z., L12.., G22.., G22z., G2z.., G2y.., G21z0, G20.., G20z., x01QX | G21.., G220., G221., G23.., G24.., G240., G240z, G241., G241z, G24z., L12.., XE0Uf, XE0Ug, G2z.., G2y.., Xa0lt, Xa3fQ, Xa0kX, XE0Uc, x01QX | |
| Hyperlipidemia | 272 | E78 | 20002(1473),20003 | C32.., Cyu8D, Cyu8E | X40Uu, XE11R, C32.., C32z., X40Wx |
Variable definitions constructed using ICD-9, ICD-10, OPCS-4, READ2 and CTV3 codes as well as self-report data fields with disease- or procedure-specific codes between brackets are shown.
Abbreviations: CTV3, Clinical Terms Version 3; ICD, International Classification of Diseases; OPCS, Office of Population, Censuses and Surveys: Classification of interventions and Procedure.
Cases with type 1 diabetes specific codes are excluded and controls with any diabetes related codes are excluded.
Figure 3Basic syntax for filling in the definition tables
Figure 4The interface of the shiny app “ukb code explorer”
Syntax accepted by the package to describe conditions
| Condition symbol | Meaning |
|---|---|
| = | Equal to (value) |
| != | Not equal to |
| < | Smaller than |
| <= OR ≤ | Smaller than or equal to |
| > | Larger than |
| >= OR ≥ | Larger than or equal to |
Example usage of composite phenotype columns
| Exclude_from_cases | Study_population | Exclude_from_controls | Include_definitions |
|---|---|---|---|
| DmT1 | RxDm | RxDmOr |
In this example usage: cases with records of “DmT1” (type 1 diabetes) are excluded; Controls with records indicating “RxDm” (use of antidiabetic medication) are excluded; participants with records indicating “RxDmOr” (use of oral antidiabetic medication) will be considered as cases for this composite phenotype.
Figure 5Screenshot of the lst.harmonized.data object
Figure 6Screenshot of the harmonized records
Figure 7Screenshot of the result obtained from the get_cases_controls() function
Column description of the case-control summary table
| Column name | Information |
|---|---|
| identifier | Unique identifier of the participant |
| reference_date | The reference dates supplied by user |
| count | Number of episodes/events for that participant related to the target phenotype (diagnosis) |
| sum.epidur | Total number of days hospitalized due to the diagnosis according to secondary care data |
| median.epidur | The median days of hospitalization according to secondary care data |
| max.epidur | The number of days from the longest hospital stay due to the diagnosis. |
| survival_days | Days of survival from the reference date if the participant had died of the diagnosis as evidenced from the death registry |
| Death_primary | Indicates if the participant has died with the diagnosis as primary cause |
| Death_any | Indicates if the participant has died with the diagnosis as either primary or secondary cause |
| Hx_days | Duration of diagnosis in days counting at the reference date |
| Fu_days | Follow-up time until the participant has the diagnosis counting from the reference date |
| Hx | Indicate if the participant has the diagnosis before the reference date (prevalent case) |
| Fu | Indicate if the participant has the diagnosis after the reference date (incident case) |
| Ref | Indicate if the diagnosis was made close to the reference date with a window (default: 0 day) |
| first_diagnosis_days | Difference between first occurrence and reference date (including both Hx and Fu) |
| Any | Indicate if the participant has a diagnosis (including both Hx and Fu) |
Column description of the case event table
| Column name | Information |
|---|---|
| .id | Source of this event |
| identifier | Unique identifier of the participant |
| code | Event code |
| eventdate | Event date |
| event | Indicate if this episode contains a true event date (event date from linked data or self-report operation=1, self-report event date except for operations =2, not a true event date=0) |
| epidur | Days hospitalized in this episode documented in secondary care data |
| classification | Refers to the classification system of the code |
Figure 8Disease timeline of type 2 diabetes by different data sources
Figure 9UpSet plot of type 2 diabetes at baseline showing the overlaps between different data sources
Figure 10Frequency plots of type 2 diabetes diagnosis codes from secondary care
Left: y-axis in linear scale; Right: y-axis in logarithmic scale.
Figure 11Barplot of type 2 diabetes diagnosis code count from secondary care per individual
Figure 12Diagnosis timeline of a hypothetical participant
Inclusion and exclusion criteria for the phenotype type 2 diabetes
| Exclude_from_cases | Study_population | Exclude_from_controls | Include_definitions |
|---|---|---|---|
| DmT1 | DmRx |
Censoring dates of different data sources
| Data (provider) | Censoring date |
|---|---|
| Primary care - England (TPP) | 31May2016 |
| Primary care - England (Vision) | 31May2017 |
| Primary care - Scotland | 31Mar2017 |
| Primary care - Wales | 31Aug2017 |
| Secondary care - England | 31Mar2021 |
| Secondary care - Scotland | 31Mar2021 |
| Secondary care - Wales | 28Feb2018 |
| Cancer - England/Wales | 31Jul2019 |
| Cancer - Scotland | 31Oct2015 |
| Death - England/Wales | 28Feb2021 |
| Death - Scotland | 28Feb2021 |
The censoring dates for the current release can be found in Showcase (https://biobank.ndph.ox.ac.uk/showcase/exinfo.cgi?src=Data_providers_and_dates).
Clinical characteristics table stratified by type 2 diabetes at baseline visit
| Case inclusion | Case exclusion | Control inclusion | Control exclusion | ||
|---|---|---|---|---|---|
| n | 13286 | 4029 | 478864 | 6279 | |
| Age (mean (SD)) | 60.32 (6.75) | 59.37 (7.33) | 56.40 (8.10) | 56.32 (8.20) | <0.001 |
| BMI (mean (SD)) | 31.81 (5.90) | 31.13 (6.06) | 27.26 (4.67) | 29.12 (5.51) | <0.001 |
| Glucose (median [IQR]) | 6.42 [5.28, 8.50] | 7.64 [5.46, 11.30] | 4.91 [4.59, 5.27] | 5.36 [4.79, 6.70] | <0.001 |
| Glycated hemoglobin, HbA1c (median [IQR]) | 49.50 [43.10, 57.60] | 58.20 [48.90, 69.10] | 35.10 [32.70, 37.50] | 40.70 [36.00, 51.00] | <0.001 |
| Years since type 2 diabetes diagnosis (median [IQR]) | 4.05 [1.82, 7.16] | NA [NA, NA] | -6.27 [-9.09, -3.31] | NA [NA, NA] | <0.001 |
| Sex male (%) | 8426 (63.4) | 2370 (58.8) | 215111 (44.9) | 3198 (50.9) | <0.001 |
| Family history of diabetes (%) | 6219 (46.8) | 1869 (46.4) | 103336 (21.6) | 2363 (37.6) | <0.001 |
| Family history of heart disease (%) | 6870 (51.7) | 2005 (49.8) | 217257 (45.4) | 2900 (46.2) | <0.001 |
| Family history of hypertension (%) | 6565 (49.4) | 1927 (47.8) | 234153 (48.9) | 3185 (50.7) | 0.009 |
| Hypertension (%) | 10406 (78.5) | 3120 (77.5) | 140044 (29.3) | 3275 (52.5) | <0.001 |
| Hyperlipidemia (%) | 11038 (83.4) | 3232 (80.4) | 82921 (17.4) | 2964 (48.3) | <0.001 |
| Atrial fibrillation (%) | 345 (2.6) | 84 (2.1) | 4511 (0.9) | 65 (1.0) | <0.001 |
| Hypertrophic cardiomyopathy (%) | 10 (0.1) | 3 (0.1) | 185 (0.0) | 3 (0.0) | 0.13 |
| Heart failure (%) | 380 (2.9) | 201 (5.0) | 1981 (0.4) | 63 (1.0) | <0.001 |
| On oral diabetes medication (%) | 9057 (69.3) | 1975 (49.4) | 3910 (0.8) | 1361 (22.3) | <0.001 |
| On insulin (%) | 1371 (10.4) | 2672 (67.0) | 481 (0.1) | 1243 (19.9) | <0.001 |
| Started insulin within 1 year of diagnosis (%) | <0.001 | ||||
| No | 11642 (94.9) | 2072 (59.3) | 5988 (95.5) | 2280 (69.7) | |
| Yes | 508 (4.1) | 1401 (40.1) | 193 (3.1) | 932 (28.5) | |
| Do not know | 106 (0.9) | 21 (0.6) | 79 (1.3) | 42 (1.3) | |
| Prefer not to answer | 11 (0.1) | 3 (0.1) | 10 (0.2) | 17 (0.5) |
The negative values of “Years since type 2 diabetes diagnosis” were attributed to future instances (incident cases).
Figure 13Kaplan-Meier plot for new-onset heart failure stratified by type 2 diabetes status at baseline
The between-group difference in survival was assessed by log-rank test.
Figure 14Result summary on case-control matching for type 2 diabetes status at baseline
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| R Project for Statistical Computing | R Core Team | RRID: SCR_001905 |
| RStudio | RStudio Team | RRID: SCR_000432 |
| ukbpheno | this paper | |
| ComplexUpset | The R Foundation | |
| data.table | The R Foundation | |
| devtools | The R Foundation | |
| dplyr | The R Foundation | |
| fasttime | The R Foundation | |
| ggdendro | The R Foundation | |
| ggforce | The R Foundation | |
| ggplot2 | The R Foundation | |
| ggpubr | The R Foundation | |
| ggrepel | The R Foundation | |
| glue | The R Foundation | |
| jsonlite | The R Foundation | |
| lubridate | The R Foundation | |
| magrittr | The R Foundation | |
| MatchIt | The R Foundation | |
| matrixStats | The R Foundation | |
| readxl | The R Foundation | |
| RColorBrewer | The R Foundation | |
| stringr | The R Foundation | |
| survminer | The R Foundation | |
| tableone | The R Foundation | |
| tictoc | The R Foundation | |
| XML | The R Foundation | |
| UK Biobank | UK Biobank | RRID: SCR_012815 |