| Literature DB >> 30537927 |
Jorge Martinez-Romero1,2, Santiago Bueno-Fortes1, Manuel Martín-Merino1,3, Ana Ramirez de Molina2, Javier De Las Rivas4.
Abstract
BACKGROUND: Identification of biomarkers associated with the prognosis of different cancer subtypes is critical to achieve better therapeutic assistance. In colorectal cancer (CRC) the discovery of stable and consistent survival markers remains a challenge due to the high heterogeneity of this class of tumors. In this work, we identified a new set of gene markers for CRC associated to prognosis and risk using a large unified cohort of patients with transcriptomic profiles and survival information.Entities:
Keywords: Bioinformatics; Cancer; Colon; Colorectal cancer; Gene Expression; Gene marker; Kaplan-Meier analysis; Survival; Transcriptomics
Mesh:
Substances:
Year: 2018 PMID: 30537927 PMCID: PMC6288855 DOI: 10.1186/s12864-018-5193-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary information about the series of colorectal cancer (CRC) samples that were collected to produce the integrated data set analyzed in this work
| GEO dataset | Sample Source | Sample Description | Total samples in dataset | PubMed PMID | Authors and Year | Samples discarded | Samples processed |
|---|---|---|---|---|---|---|---|
| GSE14333 | Royal Melbourne Hospital, Western Hospital and Peter MacCallum Cancer Center, AUSTRALIA. H Lee Moffitt Cancer Center, USA | primary colorectal cancers | 290 | 19996206 | Jorissen RN et al. (2009) | 64 | 226 |
| GSE17536 | Moffitt Cancer Center, USA | colorectal cancer patients | 177 | 19914252 | Smith JJ et al. (2010) | 0 | 177 |
| GSE31595 | Roskilde Hospital, DENMARK | patients with stage II and III colorectal cancer | 37 | – | Thorsteinsson M et al. (2011) | 0 | 37 |
| GSE33113 | Academic Medical Center in Amsterdam, NETHERLANDS | primary tumor resections from stage II colorectal patients | 90 | 22496204 | Kemper K et al. (2012) | 0 | 90 |
| GSE38832 | Vandervilt University Medical Center, USA | tumor samples collected from colorectal patients | 122 | 25320007 | Tripathi MK et al. (2014) | 0 | 122 |
| GSE39084 | Toulouse Hospital, FRANCE | sporadic early onset primary colorectal carcinomas | 70 | 25083765 | Kirzin S et al. (2014) | 1 | 69 |
| GSE39582 | Institut G. Roussy (Villejuif), Hosp. Saint Antoine (Paris), Hosp. G.Pompidou (Paris), Hosp. Hautepierre (Strasbourg), Hosp. Purpan (Toulouse), Institut P. Calmettes (Marseille), Centre Antoine Lacassagne (Nice), FRANCE | colorectal cancer samples | 566 | 23700391 | Marisa L et al. (2013) | 14 | 552 |
| Total number | 1352 | 1273 |
All the CRC samples were tested for global gene expression profiling using high-density microarrays Human Genome U133 Plus 2.0 from Affymetrix (that measure the signal of 20,141 human genes). The total collection included 1352 samples, but only 1273 were finally used. A group of 79 samples were discarded because they did not have survival data or they presented anomalous data distributions with respect to the other samples of the same series
Fig. 1Symmetric heatmaps representing the similarity between the overall gene expression signal of the samples compared with each other. Each heatmap is composed of 210 samples (30 × 7, 30 samples random selected from each batch, i.e. from each one of the 7 GSE datasets). The samples of each batch are identified by a color in the top bar below the top dendrograms (following the colors legend). Each heatmap represents a different preprocessing and normalization method performed to merge the datasets in one batch. The methods applied were: a RMA; b RMA plus ComBat; c fRMA; d fRMA plus ComBat; e fRMA plus scaling of the data using mean-centered expression values
Fig. 2Plots presenting the distribution of the 1273 samples from 7 datasets (GSEs) obtained by Principal Component Analysis (PCA) of the global gene expression profile of each sample; that converts the signal of each sample using an orthogonal transformation in linearly uncorrelated variables called principal components or dimensions. Each plot presents the values of the two main dimensions (dim 1 versus dim 2) and corresponds to the PCA results obtained using the expression data calculated with different preprocessing and normalization methods. The methods applied were: a RMA; b RMA plus ComBat; c fRMA; d fRMA plus ComBat; e fRMA plus scaling of the data using mean-centered expression values. The samples of each batch are identified by color dots following the colors legend
Results of the linear regression analyses on the global expression matrix calculated for the 1273 samples from 7 datasets (GSEs) combined using 5 different preprocessing and normalization methods
| FACTORS considered | Estimated coefficients | std. error | t value | Factor effect | |
|---|---|---|---|---|---|
| (A) RMA | |||||
| Intercept | 6.925 | 0.014 | 512.610 | <2e-16 | – |
| (GSE14333+) GSE17536 | 0.387 | 0.019 | 20.230 | <2e-16 | yes |
| GSE31595 | −1.212 | 0.019 | −63.440 | <2e-16 | yes |
| GSE33113 | −0.577 | 0.019 | −30.210 | <2e-16 | yes |
| GSE38832 | −0.355 | 0.019 | −18.570 | <2e-16 | yes |
| GSE39084 | −0.978 | 0.019 | −51.180 | <2e-16 | yes |
| GSE39582 | −1.375 | 0.019 | −71.970 | <2e-16 | yes |
| (B) RMA plus Combat | |||||
| Intercept | 6.219 | 0.013 | 473.582 | <2e-16 | – |
| (GSE14333+) GSE17536 | 0.000 | 0.019 | 0.001 | 0.999 | no |
| GSE31595 | 0.002 | 0.019 | 0.122 | 0.903 | no |
| GSE33113 | 0.001 | 0.019 | 0.051 | 0.959 | no |
| GSE38832 | −0.001 | 0.019 | −0.033 | 0.973 | no |
| GSE39084 | 0.002 | 0.019 | 0.092 | 0.927 | no |
| GSE39582 | 0.001 | 0.019 | 0.029 | 0.977 | no |
| (C) fRMA | |||||
| Intercept | 6.535 | 0.015 | 450.434 | <2e-16 | – |
| (GSE14333+) GSE17536 | −0.011 | 0.021 | −0.553 | 0.580 | no so much |
| GSE31595 | 0.089 | 0.021 | 4.329 | 0.000 | yes |
| GSE33113 | 0.071 | 0.021 | 3.455 | 0.001 | yes |
| GSE38832 | 0.054 | 0.021 | 2.641 | 0.008 | yes |
| GSE39084 | 0.096 | 0.021 | 4.695 | 0.000 | yes |
| GSE39582 | 0.089 | 0.021 | 4.336 | 0.000 | yes |
| (D) fRMA plus Combat | |||||
| Intercept | 6.590 | 0.014 | 457.338 | <2e-16 | – |
| (GSE14333+) GSE17536 | 0.000 | 0.020 | 0.001 | 1.000 | no |
| GSE31595 | 0.002 | 0.020 | 0.093 | 0.926 | no |
| GSE33113 | 0.001 | 0.020 | 0.072 | 0.942 | no |
| GSE38832 | 0.000 | 0.020 | 0.019 | 0.985 | no |
| GSE39084 | 0.002 | 0.020 | 0.089 | 0.929 | no |
| GSE39582 | 0.000 | 0.020 | 0.007 | 0.994 | no |
| (E) fRMA plus mean centered | |||||
| Intercept | 0.000 | 0.000 | −1.638 | 0.101 | – |
| (GSE14333+) GSE17536 | 0.000 | 0.000 | 1.264 | 0.206 | yes |
| GSE31595 | 0.000 | 0.000 | 0.288 | 0.773 | no so much |
| GSE33113 | 0.000 | 0.000 | 1.605 | 0.108 | yes |
| GSE38832 | 0.000 | 0.000 | 1.449 | 0.147 | yes |
| GSE39084 | 0.000 | 0.000 | −0.076 | 0.940 | no |
| GSE39582 | 0.000 | 0.000 | 1.395 | 0.163 | yes |
The methods applied were: (A) RMA; (B) RMA plus ComBat; (C) fRMA; (D) fRMA plus ComBat; (E) fRMA plus scaling of the data using mean-centered expression values. The linear regression is done to evaluate the “batch effect” (i.e. considering that the tested factors are the fact of “belonging” to a given dataset). Thus, when the p-value of the factors are significant (< 0.05), the “batch effect” remains on the overall expression signal. A marginal low significance was considered when p-values were < 0.20 in the case E
Fig. 3Kaplan-Meier plots of the survival analysis of the set of 1273 samples from colorectal cancer (CRC) patients. The patients are separated in two groups (high in red and low in green) according to the expression profiles of 4 genes: a DCBLD2, b PTPN14, c EPHB2, d DUS1L. These genes provided the best split between patients of high and low risk based in their expression levels. In the case of genes DCBLD2 and PTPN14 (labelled in red) the over-expression is correlated with poor survival; and in the case of genes EPHB2 and DUS1L (labelled in green) the over-expression is correlated with good survival. In all cases the adjusted p.values of the analyses are very significant (as indicated inside each plot), indicating that the two populations represented by the two curves have a very clear difference in their overall survival
Genes selected as top-50 best survival markers of colorectal cancer (CRC)
| Number | GENE ENSEMBL_ID | GENE Symbol | KM. | HR (all-dt) | N-signf-in-100i (KM.p.value) | HR (mean-in-100i) | GENE HGNC_ID | GENE DESCRIPTION |
|---|---|---|---|---|---|---|---|---|
| 1 | ENSG00000057019 | DCBLD2 | 0.0000000000 | 2.02 | 99 | 2.106 | 24627 | discoidin; CUB and LCCL domain containing 2 [HGNC:24627] |
| 2 | ENSG00000152104 | PTPN14 | 0.0000000000 | 1.99 | 99 | 2.082 | 9647 | protein tyrosine phosphatase; non-receptor type 14 |
| 3 | ENSG00000125869 | LAMP5 | 0.0000000000 | 1.99 | 93 | 2.046 | 16097 | lysosomal associated membrane prot.member 5 [HGNC:16097] |
| 4 | ENSG00000169908 | TM4SF1 | 0.0000000001 | 1.96 | 93 | 2.031 | 11853 | transmembrane 4 L six family member 1 [HGNC:11853] |
| 5 | ENSG00000113389 | NPR3 | 0.0000000002 | 1.95 | 97 | 2.136 | 7945 | natriuretic peptide receptor 3 [HGNC:7945] |
| 6 | ENSG00000186007 | LEMD1 | 0.0000000003 | 1.95 | 85 | 1.937 | 18,725 | LEM domain containing 1 [HGNC:18725] |
| 7 | ENSG00000135338 | LCA5 | 0.0000000003 | 1.89 | 97 | 2.021 | 31,923 | LCA5; lebercilin [HGNC:31923] |
| 8 | ENSG00000169826 | CSGALNACT2 | 0.0000000008 | 1.91 | 92 | 1.974 | 24,292 | chondroitin sulfate N-acetylgalactosaminyltransferase 2 |
| 9 | ENSG00000059804 | SLC2A3 | 0.0000000014 | 1.93 | 89 | 1.993 | 11,007 | solute carrier family 2 member 3 [HGNC:11007] |
| 10 | ENSG00000099860 | GADD45B | 0.0000000018 | 1.92 | 97 | 2.074 | 4096 | growth arrest and DNA damage inducible beta [HGNC:4096] |
| 11 | ENSG00000136155 | SCEL | 0.0000000018 | 1.88 | 87 | 1.928 | 10,573 | sciellin [HGNC:10573] |
| 12 | ENSG00000100625 | SIX4 | 0.0000000019 | 1.89 | 91 | 1.951 | 10,890 | SIX homeobox 4 [HGNC:10890] |
| 13 | ENSG00000131016 | AKAP12 | 0.0000000028 | 1.85 | 95 | 2.092 | 370 | A-kinase anchoring protein 12 [HGNC:370] |
| 14 | ENSG00000158270 | COLEC12 | 0.0000000028 | 1.84 | 92 | 1.941 | 16,016 | collectin subfamily member 12 [HGNC:16016] |
| 15 | ENSG00000154553 | PDLIM3 | 0.0000000047 | 1.84 | 91 | 1.985 | 20,767 | PDZ and LIM domain 3 [HGNC:20767] |
| 16 | ENSG00000082781 | ITGB5 | 0.0000000049 | 1.82 | 88 | 1.911 | 6160 | integrin subunit beta 5 [HGNC:6160] |
| 17 | ENSG00000144366 | GULP1 | 0.0000000050 | 1.81 | 88 | 1.911 | 18,649 | engulfment adaptor PTB domain containing 1 [HGNC:18649] |
| 18 | ENSG00000171951 | SCG2 | 0.0000000051 | 1.81 | 93 | 2.034 | 10,575 | secretogranin II [HGNC:10575] |
| 19 | ENSG00000185567 | AHNAK2 | 0.0000000066 | 1.80 | 87 | 1.896 | 20,125 | AHNAK nucleoprotein 2 [HGNC:20125] |
| 20 | ENSG00000138061 | CYP1B1 | 0.0000000075 | 1.84 | 85 | 1.884 | 2597 | cytochrome P450 family 1 subfamily B member 1 [HGNC:2597] |
| 21 | ENSG00000184304 | PRKD1 | 0.0000000451 | 1.74 | 87 | 1.872 | 9407 | protein kinase D1 [HGNC:9407] |
| 22 | ENSG00000152583 | SPARCL1 | 0.0000000471 | 1.74 | 85 | 1.863 | 11,220 | SPARC like 1 [HGNC:11220] |
| 23 | ENSG00000147883 | CDKN2B | 0.0000000717 | 1.73 | 84 | 1.847 | 1788 | cyclin dependent kinase inhibitor 2B [HGNC:1788] |
| 24 | ENSG00000213190 | MLLT11 | 0.0000001989 | 1.70 | 84 | 1.813 | 16,997 | myeloid/lymphoid or mixed-lineage leukemia; translocated to 11 |
| 25 | ENSG00000135218 | CD36 | 0.0000002751 | 1.69 | 85 | 1.891 | 1663 | CD36 molecule [HGNC:1663] |
| 1 | ENSG00000133216 | EPHB2 | 0.0000000000 | 0.43 | 100 | 0.426 | 3393 | EPH receptor B2 [HGNC:3393] |
| 2 | ENSG00000169718 | DUS1L | 0.0000000000 | 0.49 | 98 | 0.481 | 30,086 | dihydrouridine synthase 1 like [HGNC:30086] |
| 3 | ENSG00000163545 | NUAK2 | 0.0000000001 | 0.51 | 96 | 0.495 | 29,558 | NUAK family kinase 2 [HGNC:29558] |
| 4 | ENSG00000158169 | FANCC | 0.0000000002 | 0.51 | 95 | 0.498 | 3584 | Fanconi anemia complementation group C [HGNC:3584] |
| 5 | ENSG00000277972 | CISD3 | 0.0000000002 | 0.51 | 87 | 0.511 | 27,578 | CDGSH iron sulfur domain 3 [HGNC:27578] |
| 6 | ENSG00000099800 | TIMM13 | 0.0000000003 | 0.53 | 95 | 0.511 | 11,816 | translocase of inner mitochondrial membrane 13 [HGNC:11816] |
| 7 | ENSG00000116771 | AGMAT | 0.0000000005 | 0.52 | 95 | 0.515 | 18,407 | agmatinase [HGNC:18407] |
| 8 | ENSG00000118513 | MYB | 0.0000000006 | 0.52 | 93 | 0.508 | 7545 | MYB proto-oncogene. Transcription factor [HGNC:7545] |
| 9 | ENSG00000016391 | CHDH | 0.0000000006 | 0.53 | 90 | 0.520 | 24,288 | choline dehydrogenase [HGNC:24288] |
| 10 | ENSG00000137460 | FHDC1 | 0.0000000008 | 0.52 | 96 | 0.505 | 29,363 | FH2 domain containing 1 [HGNC:29363] |
| 11 | ENSG00000132846 | ZBED3 | 0.0000000009 | 0.52 | 88 | 0.522 | 20,711 | zinc finger BED-type containing 3 [HGNC:20711] |
| 12 | ENSG00000162408 | NOL9 | 0.0000000015 | 0.54 | 92 | 0.527 | 26,265 | nucleolar protein 9 [HGNC:26265] |
| 13 | ENSG00000109534 | GAR1 | 0.0000000017 | 0.50 | 99 | 0.479 | 14,264 | GAR1 ribonucleoprotein [HGNC:14264] |
| 14 | ENSG00000133477 | FAM83F | 0.0000000019 | 0.54 | 93 | 0.518 | 25,148 | family with sequence similarity 83 member F [HGNC:25148] |
| 15 | ENSG00000100348 | TXN2 | 0.0000000036 | 0.53 | 88 | 0.527 | 17,772 | thioredoxin 2 [HGNC:17772] |
| 16 | ENSG00000108479 | GALK1 | 0.0000000036 | 0.55 | 88 | 0.525 | 4118 | galactokinase 1 [HGNC:4118] |
| 17 | ENSG00000110917 | MLEC | 0.0000000045 | 0.55 | 96 | 0.476 | 28,973 | malectin [HGNC:28973] |
| 18 | ENSG00000114738 | MAPKAPK3 | 0.0000000048 | 0.55 | 92 | 0.520 | 6888 | mitogen-activated protein kinase-activated 3 [HGNC:6888] |
| 19 | ENSG00000137752 | CASP1 | 0.0000000180 | 0.56 | 87 | 0.523 | 1499 | caspase 1 [HGNC:1499] |
| 20 | ENSG00000131844 | MCCC2 | 0.0000000183 | 0.57 | 93 | 0.516 | 6937 | methylcrotonoyl-CoA carboxylase 2 [HGNC:6937] |
| 21 | ENSG00000178409 | BEND3 | 0.0000000193 | 0.55 | 88 | 0.529 | 23,040 | BEN domain containing 3 [HGNC:23040] |
| 22 | ENSG00000114737 | CISH | 0.0000000216 | 0.55 | 87 | 0.508 | 1984 | cytokine inducible SH2 containing protein [HGNC:1984] |
| 23 | ENSG00000011376 | LARS2 | 0.0000000239 | 0.55 | 91 | 0.528 | 17,095 | leucyl-tRNA synthetase 2; mitochondrial [HGNC:17095] |
| 24 | ENSG00000164045 | CDC25A | 0.0000000481 | 0.57 | 90 | 0.539 | 1725 | cell division cycle 25A [HGNC:1725] |
| 25 | ENSG00000154655 | L3MBTL4 | 0.0000000606 | 0.54 | 90 | 0.506 | 26,677 | l(3)mbt-like 4 (Drosophila) [HGNC:26677] |
The first part of the table corresponds to the top-25 genes where up-regulation corresponds to shorter survival and higher risk (i.e., HR > 1); the second part of the table corresponds to the top-25 genes where UP-regulation corresponds to longer survival and lower risk (HR < 1). The genes were ranked by their KM adjusted p values and the Hazard Ratio values calculated for the whole dataset, i.e. for all the 1273 samples (all-dt). The stability and robustness of the gene survival markers was assessed by cross-validation, applying to each gene a resampling strategy with random selection of 80% of the samples 100 times (i.e. doing 100 iterations). For the ranking we also considered that the genes had to give a significant adjusted p-value in more than 80 iterations (N-sinf-in-100i > 80)
Fig. 4Risk prediction done for the cohort of 1273 patients of CRC based in the multivariate analysis using the top 100 genes that showed up-regulation correlated with poor prognosis (i.e. overexpressed in low survival cases). a Plot presenting the patients according to their risk score, from Low (blue) to High (red) risk. A recursive algorithm using 10-fold cross-validation finds the value of risk score (marked with a vertical black line) that allows the best splitting of the cohort in two groups. b Kaplan-Meier plot showing the separation of these two groups: a high-risk group including 425 individuals (in red) and a low-risk group including 848 individuals (in blue). The analysis has been done using a multivariate Cox proportional-hazards regression. As shown, the division is very significant (p-value = 8.25e-14) and allows an optimal separation of individuals according to their survival. The analysis of the beta factors assigned by the regression to each of the top 100 genes (i.e. to each variable within the multivariate vector) allows the identification of the genes that are the most influential factors in this risk analysis and therefore it helps in the selection of the best “gene survival markers”