| Literature DB >> 35705632 |
Lin Zhang1, Rui Mao1, Chung Tai Lau2, Wai Chak Chung2, Jacky C P Chan3, Feng Liang2, Chenchen Zhao4, Xuan Zhang5,6, Zhaoxiang Bian7,8.
Abstract
Ulcerative colitis (UC) is a chronic relapsing inflammatory bowel disease with an increasing incidence and prevalence worldwide. The diagnosis for UC mainly relies on clinical symptoms and laboratory examinations. As some previous studies have revealed that there is an association between gene expression signature and disease severity, we thereby aim to assess whether genes can help to diagnose UC and predict its correlation with immune regulation. A total of ten eligible microarrays (including 387 UC patients and 139 healthy subjects) were included in this study, specifically with six microarrays (GSE48634, GSE6731, GSE114527, GSE13367, GSE36807, and GSE3629) in the training group and four microarrays (GSE53306, GSE87473, GSE74265, and GSE96665) in the testing group. After the data processing, we found 87 differently expressed genes. Furthermore, a total of six machine learning methods, including support vector machine, least absolute shrinkage and selection operator, random forest, gradient boosting machine, principal component analysis, and neural network were adopted to identify potentially useful genes. The synthetic minority oversampling (SMOTE) was used to adjust the imbalanced sample size for two groups (if any). Consequently, six genes were selected for model establishment. According to the receiver operating characteristic, two genes of OLFM4 and C4BPB were finally identified. The average values of area under curve for these two genes are higher than 0.8, either in the original datasets or SMOTE-adjusted datasets. Besides, these two genes also significantly correlated to six immune cells, namely Macrophages M1, Macrophages M2, Mast cells activated, Mast cells resting, Monocytes, and NK cells activated (P < 0.05). OLFM4 and C4BPB may be conducive to identifying patients with UC. Further verification studies could be conducted.Entities:
Mesh:
Year: 2022 PMID: 35705632 PMCID: PMC9200771 DOI: 10.1038/s41598-022-14048-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1The workflow of the analysis steps.
Figure 2The 87 DEGs distributed in both UC group and healthy group. (A. Heatmap; B. Volcano diagram.). Note: R software (version 4.1.0; https://www.r-project.org/) was used to create the maps, including R package pheatmap (version 1.0.12; https://cran.r-project.org/web/packages/pheatmap/index.html) for heatmap and ggplot2 (version 3.35; https://cran.r-project.org/web/packages/ggplot2/index.html) for volcano plot, respectively.
Figure 3Functional enrichment analysis. (A. The top 10 most significantly enriched GO terms; B. The top 30 most significantly enriched DO terms; C. The 17 significantly enriched KEGG pathways; D. The top 5 GSEA-KEGG enrichment in healthy group; E. The top 5 GSEA-KEGG enrichment in UC group).
Figure 4Six MLs for DGEs comparison. (A. LASSO for 27 prognostic DGEs; B. SVM for 16 prognostic DGEs; C. PCA for classification in 2 dimensions; D. PCA for classification in 3 dimensions; E. The error rate of RF with 100 trees; F. The top 10 weighted genes in GBM).
Error rate in different machine learnings.
| Machine-learning | Error rate (%) |
|---|---|
| SVM | |
| RF | 0.65 |
| GBM | 0.98 |
| NN | 0.17 |
SVM, Support Vector Machine; RF, Random forest; GBM, Gradient boosting machine; NN, Neural network.
Bold value indicates the lowest value.
The top 20 weighted genes selected from different machine-learnings.
| LASSO | PCA | GBM | RF | NN | SVM | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes | Weight | Genes | Weight | Genes | Weight | Genes | Weight | Genes | Weight | Genes | Weight |
| S100P | 0.52 | C4BPA | 1.85 | OLFM4 | 1 | OLFM4 | 3.89 | TUBB2A | − 2.39 | OLFM4 | 8.87 |
| RARRES3 | 0.42 | RIPK2 | 1.85 | HLA-DMA | 0.23 | C4BPB | 3.7 | TIMP1 | 2.25 | C4BPB | 3.37 |
| IFITM3 | − 0.31 | PYY | 1.85 | C4BPB | 0.2 | ISG20 | 1.63 | CCL19 | − 2.25 | NMI | 2.13 |
| CD19 | 0.29 | REG3A | 1.85 | NMI | 0.18 | DMBT1 | 2.43 | DEFA6 | − 2.01 | HLA-DMA | 1.96 |
| CHAD | − 0.28 | DUSP10 | 1.85 | CLDN8 | 0.13 | CXCL1 | 1.08 | CD55 | 1.87 | VNN1 | 1.78 |
| NMI | 0.24 | CNTNAP2 | 1.84 | VNN1 | 0.12 | CLDN8 | 2.46 | CXCL9 | 1.77 | DEFA5 | 1.78 |
| PLA2G2A | − 0.24 | ATP2C2 | 1.84 | HYOU1 | 0.11 | LCN2 | 0.69 | IFITM1 | 1.7 | S100P | 1.77 |
| C4BPB | 0.19 | LRRN2 | 1.84 | DEFA5 | 0.1 | PRDX1 | 2.67 | PCBP1 | 1.65 | PRDX1 | 1.65 |
| HYOU1 | 0.19 | CHI3L2 | 1.83 | PRDX1 | 0.1 | GNA15 | 1.01 | AQP8 | 1.64 | CLDN8 | 1.55 |
| VNN1 | 0.18 | TRIM22 | 1.83 | NPTX2 | 0.08 | S100P | 2.44 | FTL | 1.48 | REG3A | 1.38 |
| NPTX2 | 0.18 | ALOX5 | 1.83 | S100P | 0.08 | IFITM1 | 2.03 | ASS1 | 1.4 | IRF9 | 1.34 |
| DMBT1 | 0.17 | OAZ1 | 1.83 | RARRES3 | 0.08 | NMI | 3.47 | HSPA5 | 1.34 | HYOU1 | 1.32 |
| OLFM4 | 0.15 | ZNF189 | 1.82 | CXCL1 | 0.07 | RARRES3 | 1.96 | ADM | − 1.34 | CXCL1 | 1.2 |
| CSF2RB | 0.15 | STAT3 | 1.82 | DEFA6 | 0.05 | MAP2K1 | 0.93 | C4BPB | 1.33 | NPTX2 | 1.14 |
| COL6A3 | − 0.12 | ZNF143 | 1.82 | REG3A | 0.05 | LYN | 1.54 | ISG20 | 1.31 | CD55 | 1.1 |
| PCK1 | − 0.11 | GPR161 | 1.82 | CHAD | 0.05 | STAT3 | 1.35 | SDCBP | 1.25 | RARRES3 | 0.94 |
| SERPINA3 | − 0.08 | SWAP70 | 1.82 | VOPP1 | 0.04 | TIMP1 | 1.23 | REG1B | − 1.19 | ISG20 | 0.86 |
| CLDN8 | − 0.05 | ME1 | 1.82 | CD19 | 0.04 | CD55 | 1.45 | TRIM22 | − 1.17 | CD19 | 0.86 |
| COL4A2 | 0.04 | BIRC3 | 1.82 | PCK1 | 0.04 | HLA-DMA | 2.11 | SERPINA3 | 1.09 | HLA-DRA | 0.85 |
| SPINK4 | − 0.04 | ADRA2A | 1.81 | HLA-DRA | 0.04 | S100A8 | 0.64 | CTSK | 1.07 | SELL | 0.81 |
LASSO, Least Absolute Shrinkage and Selection Operator; PCA, principal component analysis; GBM, Gradient boosting machine; RF, Random forest; NN, Neural network, SVM, Support Vector Machine.
Different MLS process different weights, and negative weights in LASSO and NN that we sort the weighted genes with absolute value.
Figure 5Results comparison of 6 DGEs in testing groups.
Figure 6The ROC curve of OLFM4 and C4BPB between two groups. (A1. The ROC curve of OLFM4 in training group; A2. The ROC curve of OLFM4 in SMOTE-training group; B1. The ROC curve of OLFM4 in the testing group; B2. The ROC curve of OLFM4 in SMOTE-testing group; C1. The ROC curve of OLFM4 in GSE87473 group; C2. The ROC curve of OLFM4 in SMOTE-GSE87473 group; D1. The ROC curve of C4BPB in training group; D2. The ROC curve of C4BPB in SMOTE-training group; E1. The ROC curve of C4BPB in the testing group; E2. The ROC curve of C4BPB in the SMOTE-testing group; F1. The ROC curve of C4BPB in the GSE87473 group; F2. The ROC curve of C4BPB in the SMOTE-GSE87473 group).
Figure 7The immune correlation landscape for the ten microarrays. (A. Barplot for the 22 immune cells; B. Violin plot among two groups in 7 immune cells).
Figure 8The lollipop figure in the immune correlation of C4BPB and OLFM4. (A. The immune correlation in C4BPB; B. The immune correlation in OLFM4).