| Literature DB >> 30642095 |
Nguyen Phuoc Long1, Seongoh Park2, Nguyen Hoang Anh3, Tran Diem Nghi4, Sang Jun Yoon5, Jeong Hill Park6, Johan Lim7, Sung Won Kwon8.
Abstract
The advancement of bioinformatics and machine learning has facilitated the discovery and validation of omics-based biomarkers. This study employed a novel approach combining multi-platform transcriptomics and cutting-edge algorithms to introduce novel signatures for accurate diagnosis of colorectal cancer (CRC). Different random forests (RF)-based feature selection methods including the area under the curve (AUC)-RF, Boruta, and Vita were used and the diagnostic performance of the proposed biosignatures was benchmarked using RF, logistic regression, naïve Bayes, and k-nearest neighbors models. All models showed satisfactory performance in which RF appeared to be the best. For instance, regarding the RF model, the following were observed: mean accuracy 0.998 (standard deviation (SD) < 0.003), mean specificity 0.999 (SD < 0.003), and mean sensitivity 0.998 (SD < 0.004). Moreover, proposed biomarker signatures were highly associated with multifaceted hallmarks in cancer. Some biomarkers were found to be enriched in epithelial cell signaling in Helicobacter pylori infection and inflammatory processes. The overexpression of TGFBI and S100A2 was associated with poor disease-free survival while the down-regulation of NR5A2, SLC4A4, and CD177 was linked to worse overall survival of the patients. In conclusion, novel transcriptome signatures to improve the diagnostic accuracy in CRC are introduced for further validations in various clinical settings.Entities:
Keywords: biomarker; colorectal cancer; diagnosis; machine learning; transcriptomics; variable selection
Mesh:
Substances:
Year: 2019 PMID: 30642095 PMCID: PMC6358915 DOI: 10.3390/ijms20020296
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Workflow of the biomarker candidate selection. (a) The process of selecting and validating diagnostic candidates with three different variable selection algorithms. (b) The Venn diagram demonstrating the relationships of selected biomarker candidates among three methods. VarSel: Variable selection; RF: Random Forest; AUCRF: the area under the curve (AUC)-RF; nBayes: naïve Bayes; logit: logistic regression; kNN: k-nearest neighbors.
Characteristics of the included data set for variable selection and model fitting.
| Section | Comparison | Author | Data Set | Year | Platform | Samples | ||
|---|---|---|---|---|---|---|---|---|
| Variable selection |
|
|
|
| ||||
| Ryan BM et al. [ | GSE44861 1 | 2013 | Affymetrix U133A | 55 | 56 | |||
| Sheffer M et al. [ | GSE41258 1 | 2012 | Affymetrix U133A | 44 | 183 | |||
| Kwon Y et al. [ | GSE83889 1 | 2016 | Illumina HumanHT-12 V4.0 | 35 | 101 | |||
|
| Marra G et al. [ | GSE8671 1 | 2007 | Affymetrix U133 2.0 | 32 | 32 | ||
| Model fitting and validation |
|
|
|
| ||||
|
| TCGA, GTEx [ | coad-rsem-fpkm-tcga, coad-rsem-fpkm-tcga-t, read-rsem-fpkm-tcga, read-rsem-fpkm-tcga-t, colon-rsem-fpkm-gtex | 390 2 | 372 3 | ||||
1 Paired samples; 2 41 from coad-rsem-fpkm-tcga, 10 from read-rsem-fpkm-tcga, 339 from GTEx; 3 285 from coad-rsem-fpkm-tcga-t, 87 from read-rsem-fpkm-tcga-t.
Figure 2Data exploration of three sets of biomarker candidates. (a) Principal component analysis of three sets of biomarker candidates between cancer samples and non-cancerous samples. (b) Heatmap analysis of three sets of biomarker candidates between cancer samples and non-cancerous samples. TCGA-READ: normal rectum, TCGA-COAD: normal colon, TCGA-T-READ: rectum adenocarcinoma, TCGA-T-COAD: colon adenocarcinoma, GTEx: normal colon and rectum.
Figure 3Correlation analysis of biomarker candidates of cancer samples and non-cancerous samples. (a) Correlation network of biomarkers in cancer samples and non-cancerous samples. Blurred edges in the network were the ones with correlation strength (in absolute value) below the cut-off value 0.7. The blue color indicates positive correlations while red color indicates negative correlations (b) Correlation matrix of biomarkers in cancer samples and non-cancerous samples.
Figure 4Performance metrics of classification models and variable importance scores from three tested signatures. (a) Accuracy, sensitivity, and specificity of various machine learning classification models. (b) Top 10 most important candidates of the random forests models.
Figure 5Overall survival and disease-free survival analysis of TGFBI, S100A2, NR5A2, SLC4A4, and CD177.