| Literature DB >> 35743732 |
Soo Kyung Park1,2, Yea Bean Kim3, Sangsoo Kim3, Chil Woo Lee2, Chang Hwan Choi4, Sang-Bum Kang5, Tae Oh Kim6, Ki Bae Bang7, Jaeyoung Chun8, Jae Myung Cha9, Jong Pil Im10, Min Suk Kim11, Kwang Sung Ahn12, Seon-Young Kim13, Dong Il Park1,2.
Abstract
Almost half of patients show no primary or secondary response to monoclonal anti-tumor necrosis factor α (anti-TNF) antibody treatment for inflammatory bowel disease (IBD). Thus, the exact mechanisms of a non-durable response (NDR) remain inadequately defined. We used our genome-wide genotype data to impute expression values as features in training machine learning models to predict a NDR. Blood samples from various IBD cohorts were used for genotyping with the Korea Biobank Array. A total of 234 patients with Crohn's disease (CD) who received their first anti-TNF therapy were enrolled. The expression profiles of 6294 genes in whole-blood tissue imputed from the genotype data were combined with clinical parameters to train a logistic model to predict the NDR. The top two and three most significant features were genetic features (DPY19L3, GSTT1, and NUCB1), not clinical features. The logistic regression of the NDR vs. DR status in our cohort by the imputed expression levels showed that the β coefficients were positive for DPY19L3 and GSTT1, and negative for NUCB1, concordant with the known eQTL information. Machine learning models using imputed gene expression features effectively predicted NDR to anti-TNF agents in patients with CD.Entities:
Keywords: Crohn’s disease; anti-TNF; genetic features; genotype
Year: 2022 PMID: 35743732 PMCID: PMC9224874 DOI: 10.3390/jpm12060947
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Figure 1Overview of imputing gene expression from genotype. The PrediXcan models (round-cornered box) were downloaded from https://predictdb.org (accessed on 21 April 2021). They were developed for a number of tissues using the matched genotype and gene expression datasets compiled by the GTEx consortium. A given gene’s expression value was linearly modeled from the genotypes of neighboring single-nucleotide proteins (SNPs), which was selected through elastic net. By applying the linear regression weights to our genotype data (upper right), we were able to impute the gene expression in our tissue of interest (lower right).
Patients’ clinical characteristics.
| Non-Durable Response | Durable Response | ||
|---|---|---|---|
| Age at diagnosis, year (SD) | 26.3 (9.2) | 28.2 (9.1) | 0.45 |
| Gender, male (%) | 8 (57.1%) | 162 (73.6%) | 0.21 |
| History of smoking, n (%) | 5 (35.7%) | 47 (21.4%) | 0.20 |
| Family history of IBD, n (%) | 0 (0%) | 7 (3.2%) | 1.0 |
| Disease duration, year (SD) | 9.1 (5.5) | 7.5 (3.8) | 0.16 |
| Disease location, n (%) | 0.07 | ||
| Ileal | 7 (50%) | 52 (23.6%) | |
| Colonic | 2 (14.3%) | 31 (14.1%) | |
| Ileocolonic | 5 (35.7%) | 137 (62.3%) | |
| Upper GI involvement, n (%) | 0 (0%) | 11 (5.0%) | 0.39 |
| Disease behavior, n (%) | 0.35 | ||
| Inflammatory | 9 (64.3%) | 163 (74.1%) | |
| Stricturing | 1 (7.1%) | 25 (11.4%) | |
| Penetrating | 4 (28.6%) | 32 (14.5%) | |
| Perianal disease, n (%) | 6 (42.9%) | 85 (38.6%) | 0.75 |
| Combination immunosuppressants, n (%) | 10 (71.4%) | 206 (93.6%) | <0.001 |
| Intestinal resection, n (%) | 6 (42.9%) | 61 (27.7%) | 0.22 |
IBD, inflammatory bowel disease; GI, gastrointestinal.
Figure 2The overall model training scheme. The dataset was split into training and test sets in an 8:2 ratio. This random split was repeated 100 times. For each split, the model training involving Least Absolute Shrinkage and Selection Operator (LASSO) regression, and recursive feature elimination (RFE) was performed. In the LASSO regression, the C parameter was scanned from 10 to 100 in multiples of 10 for the highest 5-fold cross-validation (5-CV) area under the receiver operating characteristic curve (AUC-ROC) value. For the best C parameter, typically 10 to 1000 features survived. Among these features, a fixed number, ranging from 1 to 10, of features was selected through RFE. For a given number of the selected features, 100 different models were developed due to the 100 random data splits. The model performance was evaluated with the test set using logistic regression of the selected feature(s). For 5-CV, the training and test sets are shown in light yellow and blue, respectively.
Figure 3Performance of the three tissue models used for gene expression imputation in classifying the non-durable response (NDR) vs. the durable response (DR). The model training by 5-fold cross-validation and feature selection via LASSO and recursive feature elimination was repeated 100 times (see main text for details). The single most significant gene was selected from each trial. The training and test performances given as area under the receiver operating characteristic curve (AUC-ROC) are shown as boxplots. That from the whole-blood model was significantly higher than that from the colon transverse (p-test = 6.5 × 10−38 and 3.7 × 10−12 for the training and test sets, respectively) or the small intestine terminal ileum (p-test = 1.8 × 10−46 and 7.5 × 10−15 for the training and test sets, respectively).
Most frequently selected single feature for each tissue expression imputation model and its performance in classifying NDR vs. DR.
| Tissue Expression Model | Selected Feature | Selection Frequency | AUC-ROC (SD) | |
|---|---|---|---|---|
| Training 5-CV Set | Test Set | |||
| Whole blood |
| 79/100 | 0.845 (0.027) | 0.839 (0.070) |
| Colon transverse |
| 40/100 | 0.728 (0.060) | 0.711 (0.150) |
| Small intestine terminal ileum |
| 14/100 | 0.738 (0.050) | 0.720 (0.120) |
5CV, five-fold cross-validation; AUC-ROC, area under the receiver operating characteristic curve; SD, standard deviation; NDR, non-durable response; DR, durable response.
The most frequently selected combination of two or three genes for the whole-blood expression imputation model and its performance for classifying NDR vs. DR.
| No. of Features | Selected Feature Set | Selection Frequency | AUC-ROC (SD) | |
|---|---|---|---|---|
| Training 5CV Set | Test Set | |||
| 1 |
| 79 | 0.845 (0.027) | 0.839 (0.070) |
| 2 |
| 32 | 0.918 (0.023) | 0.919 (0.040) |
| 3 |
| 9 | 0.935 (0.024) | 0.935 (0.040) |
5CV, five-fold cross-validation; AUC-ROC, area under the receiver operating characteristic curve; SD, standard deviation; NDR, non-durable response; DR, durable response.
Figure 4Performance evaluation of the models built with varying sets of features for classifying the non-durable response (NDR) vs. the durable response (DR). The feature genes were the same as listed in Table 3. A total of eight patient clinical parameters (see Methods Section), denoted as CRF in the figure, were also included in the model without further feature selections.
Univariate logistic regression analysis of the association between gene expression and NDR/DR status.
| Gene Name | Chr | β Value | |
|---|---|---|---|
|
| 19 | 0.000965 | 2.703 |
|
| 22 | 0.00343 | 1.735 |
|
| 19 | 0.00684 | −2.142 |
Chr, chromosome; NDR, non-durable response; DR, durable response.