| Literature DB >> 34943601 |
Soo-Kyung Park1,2, Sangsoo Kim3, Gi-Young Lee3, Sung-Yoon Kim3, Wan Kim3, Chil-Woo Lee2, Jong-Lyul Park4, Chang-Hwan Choi5, Sang-Bum Kang6, Tae-Oh Kim7, Ki-Bae Bang8, Jaeyoung Chun9, Jae-Myung Cha10, Jong-Pil Im11, Kwang-Sung Ahn12, Seon-Young Kim4, Dong-Il Park1,2.
Abstract
Crohn's disease (CD) and ulcerative colitis (UC) can be difficult to differentiate. As differential diagnosis is important in establishing a long-term treatment plan for patients, we aimed to develop a machine learning model for the differential diagnosis of the two diseases using RNA sequencing (RNA-seq) data from endoscopic biopsy tissue from patients with inflammatory bowel disease (n = 127; CD, 94; UC, 33). Biopsy samples were taken from inflammatory lesions or normal tissues. The RNA-seq dataset was processed via mapping to the human reference genome (GRCh38) and quantifying the corresponding gene models that comprised 19,596 protein-coding genes. An unsupervised learning model showed distinct clusters of four classes: CD inflammatory, CD normal, UC inflammatory, and UC normal. A supervised learning model based on partial least squares discriminant analysis was able to distinguish inflammatory CD from inflammatory UC after pruning the strong classifiers of normal CD vs. normal UC. The error rate was minimal and affected only two components: 20 and 50 genes for the first and second components, respectively. The corresponding overall error rate was 0.147. RNA-seq analysis of tissue and the two components revealed in this study may be helpful for distinguishing CD from UC.Entities:
Keywords: Crohn’s disease; RNA sequencing; inflammatory bowel disease; machine learning; ulcerative colitis
Year: 2021 PMID: 34943601 PMCID: PMC8700628 DOI: 10.3390/diagnostics11122365
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Figure 1Principal component analysis plot of IBD samples. Normalized log-transformed expression values from edgeR were used in the PCA up to 10 components. The first two components that explained 48% of the total variance are shown in the plot drawn with the Bioconductor R package mixOmics.
Figure 2KEGG pathways enriched in the lists of differentially expressed genes (DEGs). The four panels correspond to the DEGs of inflammatory CD vs. normal CD (labeled “CD”), inflammatory UC vs. normal UC (labeled “UC”), inflammatory CD vs. inflammatory UC (labeled “Inflamed”), and normal CD vs. normal UC (labeled “Normal”). The horizontal bars represent the -log10(FDR) values from the pathway enrichment analysis calculated with DAVID web service (red for upregulated DEGs, and blue for downregulated DEGs). Note that the lists are shown for FDR < 0.05. There were no significantly enriched pathways for the downregulated genes of both “Inflammatory” and “Normal” panels. See Supplementary Table S3 for details of DEGs categorized to each pathway.
Figure 3The clustering diagrams of the final sparse partial least-squares discriminant analysis (sPLS-DA) model that classifies inflammatory CD vs. inflammatory UC. (a) The PLS projection onto the subspace spanned by the two components. The ellipses for each class represent 95% confidence level of discrimination. (b) The heatmap hierarchical clustering of 70 genes (rows) and 49 samples (columns). The bottom 20 and top 30 genes belong to component 2, while the middle 20 genes belong to component 1. All plots were drawn with the Bioconductor R package mixOmics.