| Literature DB >> 35336739 |
Saraswati Koppad1, Annappa Basava1, Katrina Nash2, Georgios V Gkoutos3,4,5,6,7,8, Animesh Acharjee3,4,5.
Abstract
BACKGROUND: Colorectal cancer (CRC) is the third leading cause of cancer-related death and the fourth most commonly diagnosed cancer worldwide. Due to a lack of diagnostic biomarkers and understanding of the underlying molecular mechanisms, CRC's mortality rate continues to grow. CRC occurrence and progression are dynamic processes. The expression levels of specific molecules vary at various stages of CRC, rendering its early detection and diagnosis challenging and the need for identifying accurate and meaningful CRC biomarkers more pressing. The advances in high-throughput sequencing technologies have been used to explore novel gene expression, targeted treatments, and colon cancer pathogenesis. Such approaches are routinely being applied and result in large datasets whose analysis is increasingly becoming dependent on machine learning (ML) algorithms that have been demonstrated to be computationally efficient platforms for the identification of variables across such high-dimensional datasets.Entities:
Keywords: biomarker identification; machine learning; prediction; transcriptomics; variable selection
Year: 2022 PMID: 35336739 PMCID: PMC8944988 DOI: 10.3390/biology11030365
Source DB: PubMed Journal: Biology (Basel) ISSN: 2079-7737
List of the datasets and platforms used in this study.
| GEO Dataset | No. of Samples | Platform ID | References | ||
|---|---|---|---|---|---|
| Normal | CRC | Total | |||
| GSE44861 | 55 | 56 | 111 | GPL3921 | [ |
| GSE20916 | 44 | 46 | 90 | GPL570 | [ |
| GSE113513 | 14 | 14 | 28 | GPL15207 | [ |
Figure 1A schematic representation of the biomarker identification workflow.
Figure 2A comparison of accuracy (blue) and AUROC (orange) values obtained across the different classifiers using combinations of the GEO datasets as training and test datasets. (A) GSE44861 (training) and GSE20916 (test); (B) GSE44861 (training) and GSE20916 (test); (C) GSE20916 (training) and GSE44861 (test); (D) GSE20916 (training) and GSE113513 (test); (E) GSE113513 (training) and GSE44861 (test); (F) GSE113513 (training) and GSE20916 (test).
Figure 3ROC curves for the different classifiers. (A) Performance of logistic regression model with GSE44861 as training and GSE20916, GSE113513 as test data; (B) performance of random forest model with GSE20916 as training and GSE44861, GSE113513 as test data; (C) performance of ExtraTrees model with GSE20916 as training and GSE44861, GSE113513 as test data; (D) performance of naïve Bayes model with GSE20916 as training and GSE44861, GSE113513 as test data; (E) performance of XGBoost model with GSE44861 as training and GSE20916, GSE113513 as test data; (F) performance of Adaboost model with GSE44861 as training and GSE20916, GSE113513 as test data.
Figure 4Important genes selected using the mean decrease in impurity (MDI) technique in combination with random forest classifier. The x-axis represents the gene names, and the y-axis represents importance score values across the GSE44861 (A), GSE20916 (B), and GSE113513 datasets (C). The common genes from all three datasets (D).
Figure 5(A) Pathway enrichment analysis with the genes selected using the MDI method; (B) mapping of the 19 most interacting genes out of 34 genes with miRNAs and their interconnections; (C) the two clusters of genes with representative genes CA7 and TEAD4 for the GSE44861 are visualized by the largest effect size. The effect size of each assessed variable is shown along the y-axis, with a series of sample sizes along the x-axis.