Jia Zhao1,2,3,4, Ying Chen1,2,3,4. 1. Department of Medical Oncology, the First Hospital of China Medical University, Shenyang, China. 2. Key Laboratory of Anticancer Drugs and Biotherapy of Liaoning Province, The First Hospital of China Medical University, Shenyang, China. 3. Liaoning Province Clinical Research Center for Cancer, Shenyang, China. 4. Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, Shenyang, China.
Abstract
Colorectal cancer (CRC) has a high incidence rate and poor prognosis, and the available treatment approaches have limited therapeutic benefits. Therefore, understanding the underlying mechanisms of occurrence and development is particularly crucial. Increasing attention has been paid to the pathophysiological role of cancer-associated fibroblasts (CAFs) in the heterogeneous tumour microenvironment. CAFs play a crucial role in tumorigenesis, tumour progression and treatment response. However, routine tissue sequencing cannot adequately reflect the heterogeneity of tumours. In this study, single-cell sequencing was used to examine the fibroblast population in CRC. After cluster analysis, the fibroblast population was divided into four subgroups. The distribution and role of these four subgroups in CRC were found to be different. Based on differential gene expression and lasso regression analysis of the main marker genes in these subgroups, four representative genes were obtained, namely, TCF7L1, FLNA, GPX3 and MMP11. Patients with CRC were divided into the low- and high-risk groups using the prognostic risk model established based on the expression of these four genes. The prognosis of patients in different risk groups varied significantly; patients with low-risk scores had a greater response to PDL1 inhibitors, significant clinical benefits and significantly prolonged overall survival. These effects may be attributed to inhibition of the function of T cells in the immune microenvironment and promotion of the function of tumour-associated macrophages.
Colorectal cancer (CRC) has a high incidence rate and poor prognosis, and the available treatment approaches have limited therapeutic benefits. Therefore, understanding the underlying mechanisms of occurrence and development is particularly crucial. Increasing attention has been paid to the pathophysiological role of cancer-associated fibroblasts (CAFs) in the heterogeneous tumour microenvironment. CAFs play a crucial role in tumorigenesis, tumour progression and treatment response. However, routine tissue sequencing cannot adequately reflect the heterogeneity of tumours. In this study, single-cell sequencing was used to examine the fibroblast population in CRC. After cluster analysis, the fibroblast population was divided into four subgroups. The distribution and role of these four subgroups in CRC were found to be different. Based on differential gene expression and lasso regression analysis of the main marker genes in these subgroups, four representative genes were obtained, namely, TCF7L1, FLNA, GPX3 and MMP11. Patients with CRC were divided into the low- and high-risk groups using the prognostic risk model established based on the expression of these four genes. The prognosis of patients in different risk groups varied significantly; patients with low-risk scores had a greater response to PDL1 inhibitors, significant clinical benefits and significantly prolonged overall survival. These effects may be attributed to inhibition of the function of T cells in the immune microenvironment and promotion of the function of tumour-associated macrophages.
As the third most common malignancy, colorectal cancer (CRC) causes more than 8% of all deaths worldwide each year (1, 2). Routine treatment of CRC includes surgery, radiotherapy and chemotherapy, which are invasive and may have a greater impact on the quality of life of patients (3). After comprehensive treatment, the 5-year survival rate of patients with early-stage CRC is 90%; however, treatment options for patients with advanced-stage CRC who are ineligible for surgery are limited (4). Immunotherapy may be beneficial for patients with advanced-stage CRC. Because of its strong anti-tumour activity, immunotherapy is used for treating several solid tumours, including melanoma, kidney cancer, non-small cell lung cancer and prostate cancer (5). In addition to targeted and anti-vascular therapies, immunotherapeutic strategies have gradually improved. PD-1/L1 and CTLA-4 are the main immunotherapeutic agents; however, their clinical efficacy remains unclear. Studies have shown that only patients with CRC with defective mismatch repair (dMMR) or high microsatellite instability (MSI-H) are eligible for checkpoint inhibition and may benefit from it (6, 7). Therefore, dMMR/MSI-H is considered a predictive biomarker for the application and efficacy of immunosuppressants (8). However, the efficacy of dMMR/MSI-H is only 30–40% (9), which considerably limits the application of immune checkpoint inhibitors for the treatment of colon cancer. Therefore, understanding the underlying mechanisms of the occurrence and development of CRC is necessary to screen for more effective predictors and improve the currently available treatment approaches.In addition to the role of tumour cells, the tumour microenvironment (TME) is another major auxiliary factor in the onset and growth of tumours. Several studies have associated TME with the occurrence and growth of tumours, survival and clinical treatment sensitivity (10, 11). TME has an extremely complex system comprising stromal cells, tumour cells, various cytokines and an extracellular matrix (ECM) (12). Fibroblasts are the main cellular component of the matrix and are called cancer-associated fibroblasts (CAFs). They interact with cancer cells (13, 14) and are significantly associated with the prognosis of tumours (15). A recent study has demonstrated that CAFs play a significant role in various tumours. For example, matrix SOX2 upregulation promotes tumorigenesis by producing CAFs expressing SFRP1/2 (16), and Wnt-induced phenotypic transformation of CAFs inhibits EMT in CRC (17). However, most studies have focused only on the involvement of tumour cells in fibroblast remodelling or the effects of fibroblasts on tumour cells, and systematic analysis of tumours and TME including the whole fibroblast population is lacking.In this study, we identified fibroblast subsets based on single-cell sequencing analysis and identified hub genes significantly related to fibroblasts by differential analysis, correlation analysis, univariate cox analysis and lasso cox analysis. Further, we analysed the roles of hub genes in tumors from various aspects by studying the mutations and immunity of these genes. Finally, we constructed a multi-gene signature and confirmed its role in predicting patient outcomes and immunotherapy predictions.
Materials and methods
Extraction and preprocessing of scRNA data
The read count expression profile data of 16 cancer tissues and 7 adjacent tissues were extracted from the single-cell sequencing dataset GSE200997 from the NCBI database Gene Expression Omnibus (GEO). First, the single-cell data were filtered by ensuring that each gene was expressed in at least three cells, and at least 250 genes were expressed per cell. The PercentageFeatureSet function was used to determine the proportion of mitochondria and rRNA and ensure that <3000 genes are expressed per cell and the Unique molecular identifier (UMI) of each cell is at least >100.The data were standardised through log-normalisation, and highly variable genes were identified using the FindVariableFeatures function (variance-stabilising transformation was used to identify variable characteristics). Subsequently, the ScaleData function was used to scale all genes, and Principal components analysis (PCA) was used for dimensionality reduction to identify anchor points (dim = 40). The FindNeighbors and FindClusters functions (resolution = 0.2) were used to cluster the cells, and the RunTSNE function was used to reduce t-SNE dimensionality to screen for fibroblasts.
Extraction and preprocessing of the cancer genome atlas data
The clinical phenotype data of CRC were downloaded from TCGA database, and samples lacking data on survival time and survival status were removed. Samples were further filtered to ensure that the survival time in each sample was >0 days. In addition, the gene expression profile data were downloaded from TCGA database, and 431 tumour samples and 41 para-cancerous samples were selected for further analysis.The copy number variation (CNV) of CRC samples were downloaded from TCGA database and integrated using the GISTIC2 software.The single nucleotide variants (SNVs) data of TCGA-COAD cohort were downloaded from TCGA database and integrated using the Mutect2 software.
Extraction and preprocessing of GEO data
The GSE17536 and GSE17537 datasets were downloaded from GEO, and the probe IDs were converted to gene symbols according to the annotation files. A probe ID that corresponded to multiple genes was deleted, and the expression of several probes for a gene was averaged. Normal tissue samples were removed, and only tumour samples were retained. In addition, samples without clinical follow-up and OS data were removed to ensure that the survival time of all patients was >0 days. A total of 177 tumour samples and 21,655 genes were obtained from the GSE17536 dataset, and 55 tumour samples and 21,655 genes were obtained from the GSE17537 dataset.
Single-cell clustering dimensionality reduction
The R language Seurat package was first used to filter the single-cell data by setting each gene to be expressed in at least 3 cells, and each cell expresses at least 250 genes, calculating the proportion of mitochondria and rRNA through the PercentageFeatureSet function, and ensuring that each cell The expressed genes are less than 3000, and the UMI of each cell is at least greater than 100. Then, we normalized the data of 23 samples separately by log-normalization.The FindVariableFeatures function was used to find highly variable genes [identify variable features based on variance stabilizing transformation (“vst”)], then scaled all genes using the ScaleData function, and perform PCA dimensionality reduction to find anchors, we chose dim=40, pass The FindNeighbors and FindClusters functions cluster the cells (set Resolution=0.2), divided the subgroups, and used the RunTSNE function for TSNE dimensionality reduction,
Annotation and further segmentation of fibroblasts
The fibroblasts were screened with the four genes of ACTA2, FAP, PDGFRB and NOTCH3, and then the fibroblasts were extracted and clustered by the functions of FindNeighbors and FindClusters (setting Resolution=0.2), and the fibroblasts were further divided into 4 groups subpopulations and re-TSNE dimensionality reduction of fibroblasts using the RunTSNE function.
Identification of marker genes
The FindAllMarkers function of the Seurat package was used to identify marker genes of fibroblasts by LogFC=0.5, Minpct=0.35 (minimum expression ratio of differential genes) and identified marker genes with a corrected p<0.05.
Functional annotation of subgroups
KEGG enrichment analysis was performed on marker genes of fibroblast subpopulations using the compareCluster function of the clusterProfiler package in R language, and screening was performed with pvalue Cutoff=0.05.
Identification of malignant and non-malignant cells
Four fibroblast subpopulations were analyzed using the R language copykat package to differentiate between tumor cells/malignant cells and normal cells/non-malignant cells in each sample by changes in the cnv of the cells.Copykat’s statistical workflow combines Bayesian methods with hierarchical clustering to calculate genomic copy number profiles of individual cells and to define clonal substructures from high-throughput 3’ scRNA-seq data. The workflow takes a gene expression matrix of Unique Molecular Identifier (UMI) counts as input to the calculation. Analysis begins with rows of gene annotations, ordered by their genomic coordinates. Freeman-Tukey transformation (FTT) was performed to stabilize variance, followed by polynomial dynamic linear modeling (DLM) to smooth out outliers in single-cell UMI counts. A subset of diploid cells with high confidence was then examined to infer baseline copy number values for normal 2N cells. To do this, we pooled individual cells into several small hierarchical clusters and estimated the variance of each cluster using a Gaussian mixture model (GMM). By following strict classification criteria, the cluster with the smallest estimated variance was defined as “confident diploid cells”. Potential misclassification can occur when the data have only a few normal cells, or when tumor cells have near-diploid genomes and limited CNA events. In this context, Copykat provides a “GMM-defined” model to identify diploid normal cells one by one, where a mixture of three Gaussian models of gene expression in a single cell is assumed to represent genomic gain, loss, and neutral states. A single cell is defined as a “confident diploid cell” when the genes in the neutral state account for at least 99% of the expressed genes.
Tumour-related pathways.
As reported in a previous study, the 10 pathways related to tumours and genes associated with these pathways are shown in
. The scores of each cell for the 10 pathways were calculated via Single-sample GSEA (ssGSEA). The proportion of malignant and non-malignant cells and the MSI status in fibroblast subpopulations were compared via the chi-square test, and the scores of different fibroblast subpopulations associated with the 10 tumour-related pathways were compared via the Wilcoxon test.
Potential regulatory pathways of key genes
Using h.all.v7.5.1.symbols.gmt as a background, the enrichment scores of patients in TCGA cohort for each pathway were calculated using the GSVA package in R. Subsequently, the correlation between gene expression and pathway enrichment scores was analysed using the Hmisc package.
Construction of a risk model for predicting the response to PD-L1 inhibitor immunotherapy
The PD-L1 cohort (IMvigor210) was used to assess the relationship between risk scores and immunotherapy. The effects of PD-L1 inhibitors were different among 348 patients in the IMvigor210 cohort, which were characterised by stable disease (SD), progressive disease (PD), partial response (PR) and complete response (CR). In addition, differences between immunotherapy and chemotherapy were analysed in the IMvigor210 cohort. The risk model was used to evaluate the possible clinical outcomes of immunotherapy using the TIDE (http://tide.dfci.harvard.edu/) software. The likelihood of immune escape increased with increasing TIDE prediction scores, indicating that immunotherapy is less likely to benefit patients.
Statistical analyses
The Shapiro–Wilk test was used to compare the normality of variables between two groups. The unpaired Student’s t-test was used to determine the statistical significance of differences between normally distributed variables, and the Mann–Whitney U test was used to analyse non-normally distributed variables. The Kruskal–Wallis test and one-way ANOVA were employed as non-parametric and parametric methods, respectively, for comparing more than two groups. Spearman and distance correlation analyses were used to examine the correlation. The Kaplan–Meier method was used to compute survival rates, and the log-rank test was used to assess the significance of variations in survival curves.
Results
Identification of fibroblasts from scRNA-seq data
A total of 49,698 cells were obtained after filtering single-cell sequencing data. The PercentageFeatureSet function was used to calculate the proportion of mitochondria and rRNA, and 48,755 cells were obtained. As shown in
, a significant correlation was observed between the number of UMI and mRNA but not between the number of UMI/mRNA and the content of mitochondrial genes. A violin diagram created before and after QC analysis is shown in
.Furthermore, the data of 23 samples were standardised via log-normalisation. A total of 16 subgroups were obtained, and the RunTSNE function was used to reduce t-SNE dimensionality. Fibroblasts were screened based on the expression of ACTA2, FAP, PDGFRB and NOTCH3. Because these four genes were mainly expressed by cells in subgroup 9, the cells were defined as fibroblasts (
) and extracted for cluster analysis. These fibroblasts were further divided into four subgroups, and the RunTSNE function was used to reduce t-SNE dimensionality. The t-SNE map of the four fibroblast subpopulations and marker gene expression is shown in
.shows the t-SNE diagram of 23 samples,
shows the t-SNE diagram of different tissues (cancer and adjacent tissues),
shows the t-SNE diagram of the MSI status and
shows the t-SNE diagram of fibroblast subsets after cluster analysis. The number of cells in each sample before and after data filtration is shown in
.
Figure 1
(A) t-SNE diagram of 23 samples; (B) Distribution of t-SNE in cancer and adjacent tissues; (C) t-SNE distribution diagram of MSI status; (D) t-SNE map of four fibroblast subpopulations after cluster analysis; (E) Dot map of the expression of the top five marker genes in the subpopulations; (F) The proportion and cell number of subpopulations in cancer and adjacent tissues; (G) KEGG enrichment analysis of the four fibroblast subpopulations; (H) Distribution of t-SNE in malignant and non-malignant cells predicted using the copykat package.
Table 1
Counting of cell counts before and after sample filtration.
Samples
raw_count
clean_count
Percentage (%)
B_cac10
2823
2823
100
B_cac11
4644
4611
99.29
B_cac14
4764
4722
99.12
B_cac15
1034
1030
99.61
B_cac4
2666
2652
99.47
B_cac6
717
712
99.3
B_cac7
1565
1554
99.3
T_cac1
1692
1586
93.74
T_cac10
697
690
99
T_cac11
2865
2761
96.37
T_cac12
4038
4018
99.5
T_cac13
2642
2642
100
T_cac14
4071
4020
98.75
T_cac15
3675
3651
99.35
T_cac16
1381
1243
90.01
T_cac2
1674
1649
98.51
T_cac3
1183
1093
92.39
T_cac4
1584
1575
99.43
T_cac5
169
169
100
T_cac6
1690
1643
97.22
T_cac7
1494
1480
99.06
T_cac8
990
903
91.21
T_cac9
1640
1528
93.17
(A) t-SNE diagram of 23 samples; (B) Distribution of t-SNE in cancer and adjacent tissues; (C) t-SNE distribution diagram of MSI status; (D) t-SNE map of four fibroblast subpopulations after cluster analysis; (E) Dot map of the expression of the top five marker genes in the subpopulations; (F) The proportion and cell number of subpopulations in cancer and adjacent tissues; (G) KEGG enrichment analysis of the four fibroblast subpopulations; (H) Distribution of t-SNE in malignant and non-malignant cells predicted using the copykat package.Counting of cell counts before and after sample filtration.The marker genes of the four subpopulations were identified using the FindAllMarkers function (logfc = 0.5 [difference multiple], Minpct = 0.35 [minimum expression ratio of different genes] and corrected p-value < 0.05). The expression of the top five marker genes with the most prominent contribution was analysed in each subgroup (
).Furthermore, the proportion of the four fibroblast subpopulations was analysed in each sample (
), and the clusterprofiler package in R was used for KEGG enrichment analysis of marker genes in each subgroup (
).The copykat package in R was used to screen for tumour/malignant cells and normal/non-malignant cells in each sample based on CNVs (to ensure that normal cells were not included). A total of 297 cancer cells (malignant cells) and 491 normal cells (non-malignant cells) were eventually identified (
).
Expression of fibroblasts in tumour-related pathways
Genes involved in 10 important pathways associated with tumorigenesis and development were extracted from previous studies.
shows the enrichment of fibroblasts in the 10 tumour-related pathways. In addition, the proportion of malignant and non-malignant cells and the MSI status in the fibroblast subpopulations were compared (
), and the scores of different fibroblasts in the 10 pathways were compared (
).
Figure 2
(A) Heat map of the scores of 10 tumour-related pathways enriched in CAFs; (B) Comparison of CAF subpopulations in malignant and non-malignant cells; (C) Comparison of CAF subpopulations in terms of MSI status; (D) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_0 subgroup; (E) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_1 subgroup; (F) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_2 subgroup; (G) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_3 subgroup; (Wilcoxon test; *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001). ns, no significant.
(A) Heat map of the scores of 10 tumour-related pathways enriched in CAFs; (B) Comparison of CAF subpopulations in malignant and non-malignant cells; (C) Comparison of CAF subpopulations in terms of MSI status; (D) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_0 subgroup; (E) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_1 subgroup; (F) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_2 subgroup; (G) Comparison of the scores of 10 tumour-related pathways between malignant and non-malignant cells in the CAF_3 subgroup; (Wilcoxon test; *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001). ns, no significant.
Identification of key genes in fibroblasts
A total of 1424 upregulated and 1245 downregulated genes were identified in TCGA dataset using the limma package (FDR < 0.05 and |log2 (fold change)| > 1).
shows a volcano map of differential analysis.
Figure 3
(A) Volcano map of differential gene expression between cancer and adjacent tissues in TCGA dataset; (B) The scores of the four fibroblast subgroups were compared between cancer and adjacent tissues (Wilcoxon test); (C) KM curve of the high- and low-score groups in the CAF_0 subgroup; (D) KM curve of the high- and low-score groups in the CAF_1 subgroup; (E) KM curve of the high- and low-score groups in the CAF_2 subgroup; (F) KM curve of the high- and low-score groups in the CAF_3 subgroup. **P < 0.01, ****P < 0.0001.
(A) Volcano map of differential gene expression between cancer and adjacent tissues in TCGA dataset; (B) The scores of the four fibroblast subgroups were compared between cancer and adjacent tissues (Wilcoxon test); (C) KM curve of the high- and low-score groups in the CAF_0 subgroup; (D) KM curve of the high- and low-score groups in the CAF_1 subgroup; (E) KM curve of the high- and low-score groups in the CAF_2 subgroup; (F) KM curve of the high- and low-score groups in the CAF_3 subgroup. **P < 0.01, ****P < 0.0001.Based on the results of single-cell sequencing analysis, the scores of the CAF subgroups in TCGA dataset were calculated using ssGSEA to screen for marker genes in each subgroup. The results revealed that the scores of the CAF_0 subgroup were higher in cancer tissues, whereas those of CAF_1, CAF_2 and CAF_3 subgroups were higher in paracancerous tissues (
). Subsequently, the survminer package was used to select optimal truncation based on the total survival time, and the scores of the four fibroblast subgroups were divided into the high- and low-score groups. The KM curve revealed that the high-score group of the four subgroups had a poor prognosis (
).Furthermore, the Hmisc package was used to examine the correlation between 2669 DEGs associated with tumorigenesis and development and the scores of the four CAF subgroups. A total of 248 key genes significantly associated with the four fibroblast subpopulations were identified (p < 0.001; cor > 0.7) and subjected to univariate cox analysis using the coxph function of the survival package. The results revealed 36 genes with a high prognostic impact, which were considered prognostic risk factors (p < 0.01) (
).
Figure 4
(A) A total of 248 candidate DEGs were identified; (B) The locus of each independent variable changing with lambda; (C) Confidence interval under lambda; (D) Multivariate Cox regression analysis (coefficient of prognosis-related genes).
(A) A total of 248 candidate DEGs were identified; (B) The locus of each independent variable changing with lambda; (C) Confidence interval under lambda; (D) Multivariate Cox regression analysis (coefficient of prognosis-related genes).These 36 key genes were further filtered using lasso regression to decrease the number of genes used for constructing a risk model. Lasso regression is a compression estimation technique. By creating a penalty function, which causes certain coefficients to be compressed and some coefficients to be set to zero, lasso regression helps to create a more refined model. Therefore, lasso regression retains the benefit of subset contraction and is a biased estimation for analysing data with complex collinearity. It selects variables during parameter estimation and improves the method of dealing with multicollinearity in regression analysis. In this study, the R software package glmnet was used to perform lasso–Cox regression. The change in each independent variable was assessed (
), and the number of independent variable coefficients tending to 0 was found to gradually increase with the increase in lambda. The risk model was constructed using 10-fold cross-validation, and the confidence interval of each lambda was evaluated (
). The performance of the model was optimal at a lambda of 0.0251. The four genes obtained based on this value were selected as target genes for further analysis, and multivariate cox analysis revealed that the genes were prognostic risk factors (
).
Mutation analysis of key genes
The SNVs of the four genes were examined in TCGA dataset, and FLNA was found to have the highest mutation frequency (
). Subsequently, we examined the collinearity and mutual exclusiveness of these four and the top 10 genes with most mutations in CRC and found that the mutations of these four genes did not exhibit significant collinearity (
). Furthermore, the CNVs of the four genes were analysed, and only a few samples were found to have copy number amplification/deletion (
).
Figure 5
(A) Waterfall diagram of SNVs in the 4 key genes; (B) Collinearity and mutual exclusion analysis of the 4 key genes and 10 genes with the most mutations in CRC; (C) CNVs in the 4 key genes; (D) Heat map of the correlation between the 4 key genes and aneuploidy scores, homologous recombination defects, fraction altered, number of segments and non-silent mutation rates.
(A) Waterfall diagram of SNVs in the 4 key genes; (B) Collinearity and mutual exclusion analysis of the 4 key genes and 10 genes with the most mutations in CRC; (C) CNVs in the 4 key genes; (D) Heat map of the correlation between the 4 key genes and aneuploidy scores, homologous recombination defects, fraction altered, number of segments and non-silent mutation rates.The molecular characteristics of TCGA-COAD cohort were obtained from previous pan-cancer studies. Correlation analysis revealed that MMP11 and TCF7L1 were significantly positively correlated with aneuploidy scores, homologous recombination defects and the fraction altered (
).The enrichment scores of each pathway in TCGA cohort were calculated using the gsva package in R, and Pearson correlation analysis between the expression of the four genes and the pathway enrichment scores was performed using the Hmisc package in R. A total of 22 significantly related pathways were identified (|cor| > 0.4 and p < 0.001).
shows a heat map of the relationship between the 4 genes and 22 pathways.
shows a heat map of the enrichment scores of 22 pathways.
Figure 6
(A) Heat map of the correlation between genes and pathways; (B) Heat map of the enrichment scores of key pathways. *P< 0.05, **P < 0.01, ***P < 0.001.
(A) Heat map of the correlation between genes and pathways; (B) Heat map of the enrichment scores of key pathways. *P< 0.05, **P < 0.01, ***P < 0.001.
Relationship between key genes and immunity
The immune scores of each sample in TCGA dataset were evaluated using the ESTIMATE algorithm and were found to have a significant positive correlation with the four genes (
). The samples were divided into the high- and low-expression groups based on the median expression level of the four genes, and significant differences in immune scores were observed between the high- and low-expression groups (
).
Figure 7
(A) Heat map of the correlation between key genes and immune scores predicted using the ESTIMATE algorithm; (B) Comparison of immune scores in the high- and low-expression groups (Wilcoxon test); (C) Heat map of the correlation between key genes and immune cell scores predicted using the CIBERSORT algorithm; (D) Comparison of the scores of 22 immune cells between the high- and low-expression groups (Wilcoxon test; *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001). ns, no significant.
(A) Heat map of the correlation between key genes and immune scores predicted using the ESTIMATE algorithm; (B) Comparison of immune scores in the high- and low-expression groups (Wilcoxon test); (C) Heat map of the correlation between key genes and immune cell scores predicted using the CIBERSORT algorithm; (D) Comparison of the scores of 22 immune cells between the high- and low-expression groups (Wilcoxon test; *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001). ns, no significant.The CIBERSORT method was used to determine the immune cell scores of samples in TCGA dataset. Correlation analysis revealed that the expression of the four genes was significantly negatively correlated with T cell scores but was significantly positively correlated with macrophage-related scores (
). The samples were divided into the high- and low-expression groups based on the median expression level of the four genes, and significant differences in some immune cell scores between the high- and low-expression groups (
).
Construction of a risk model based on key genes
The results of multivariate Cox analysis are shown in
. The risk scores of samples were calculated using the following formula: RiskScore = Σ βi × Expi, where i refers to the expression levels of the four key genes, and β is the multivariate Cox regression coefficient of the corresponding genes. The final formula for calculating risk scores based on the 4-gene signature is as follows:TCGA cohort was used as the training dataset to determine the risk score of each sample. ROC analysis was performed to examine the efficiency of the risk model in predicting prognosis at 1–5 years using the R software package timeROC (
). The AUC value for predicting prognosis at 4 and 5 years was 0.7. In addition, z-scores were evaluated for risk scores, and samples with risk scores of >0 were included in the high-risk group, whereas those with risk scores of <0 were included in the low-risk group. Subsequently, KM curves were plotted, and significant differences were observed between the two groups (p < 0.0001) (
).
Figure 8
(A) ROC curve of the risk model constructed based on 4 genes in TCGA dataset; (B) KM curve of the risk model constructed based on 4 genes in TCGA dataset; (C) ROC curve of the risk model constructed based on 4 genes in the GSE17536 dataset; (D) KM curve of the risk model constructed based on 4 genes in the GSE17536 dataset; (E) ROC curve of the risk model constructed based on 4 genes in the GSE17537 dataset; (F) KM curve of the risk model constructed based on 4 genes in the GSE17537 dataset.
(A) ROC curve of the risk model constructed based on 4 genes in TCGA dataset; (B) KM curve of the risk model constructed based on 4 genes in TCGA dataset; (C) ROC curve of the risk model constructed based on 4 genes in the GSE17536 dataset; (D) KM curve of the risk model constructed based on 4 genes in the GSE17536 dataset; (E) ROC curve of the risk model constructed based on 4 genes in the GSE17537 dataset; (F) KM curve of the risk model constructed based on 4 genes in the GSE17537 dataset.The GSE17536 dataset was used to verify the robustness of the model using the abovementioned method. A risk model was constructed, and its efficiency in predicting prognosis at 1–5 years was analysed using the R software package timeROC (
). The AUC value for predicting prognosis at 1 year was 0.7. In addition, z-scores were evaluated for risk scores, and samples with risk scores of >0 were included in the high-risk group, whereas those with risk scores of <0 were included in the low-risk group. Subsequently, KM curves were plotted, and significant differences were observed between the two groups (p < 0.05) (
).The GSE17537 dataset was analysed using the same method. As shown in
, the AUC value for predicting prognosis at 1–4 years was >0.7, and substantial differences were observed between the high- and low-risk groups.
Combination of risk scores and clinicopathological features to improve survival prediction
Multivariate and univariate Cox regression analyses of the risk score and clinicopathological features showed that the risk score was the most significant prognostic factor (
). A nomogram integrating the risk scores and other clinicopathological parameters was constructed for quantifying the risk assessment and survival probability of patients with CRC (
). The risk score had the most influence on survival rate prediction. The predictive accuracy of the risk model was further assessed using a calibration curve (
). The calibration curve plotted for predicting prognosis at 1, 3 and 5 years and the standard curve yielded similar results, indicating the good predictive performance of the nomogram. Additionally, decision curve analysis was performed to assess the reliability of the model, and the benefits of the nomogram and risk score were found to be considerably greater than those of the extreme curve. The performance of the nomogram and risk score in predicting survival was superior to that of other clinicopathological features (
).
Figure 9
(A, B) Univariate and multivariate Cox analyses of the risk score and clinicopathological features; (C) Nomogram model; (D) Calibration curve of the nomogram (1, 3 and 5 years); (E) Decision curve of the nomogram; (F) Compared with other clinicopathological features, the nomogram exhibited a superior capacity for survival prediction.
(A, B) Univariate and multivariate Cox analyses of the risk score and clinicopathological features; (C) Nomogram model; (D) Calibration curve of the nomogram (1, 3 and 5 years); (E) Decision curve of the nomogram; (F) Compared with other clinicopathological features, the nomogram exhibited a superior capacity for survival prediction.
Prediction of the response to PD-L1 inhibitor immunotherapy via the risk model
The capability of the risk score to predict the response of patients to ICB therapy was assessed to study its association with immunotherapy. The results showed that patients with low risk scores had significant clinical benefits and prolonged OS in the anti-PD-L1 cohort (IMvigor210 cohort) (
, p < 0.05). PD-L1 inhibitors had different effects among 348 patients in the IMvigor210 cohort, which were characterised by progressive disease (PD), stable disease (SD), partial response (PR) and complete response (CR). The risk scores of patients with SD/PD were higher than that of patients with other types of reactions (
). Additionally, patients with low-risk scores experienced considerably superior treatment outcomes (
). In addition, differences in survival among patients with different CRC stages in the IMvigor210 samples were analysed. The results revealed that stage I+II samples showed substantial survival differences (
); however, stage III+IV samples did not show significant survival differences (
).
Figure 10
(A) Differences in immunotherapy responses and risk scores in the IMvigor210 cohort; (B) Immunotherapy response among different risk groups in the IMvigor210 cohort; (C) Prognostic differences between different risk groups in the IMvigor210 cohort; (D) Prognostic differences between different risk groups of patients with early-stage CRC in the IMvigor210 cohort; (E) Prognostic differences between different risk groups of patients with middle- and late-stage CRC in the IMvigor210 cohort; (F) Differences in immunotherapy response and different risk scores in the IMvigor210 cohort were analysed using the TIDE software; (G) Differences in TIDE scores and immunotherapy responses in the IMvigor210 cohort; (H) Correlation analysis between the risk and TIDE scores in the IMvigor210 cohort. *P< 0.05, **P < 0.01, ****P < 0.0001.
(A) Differences in immunotherapy responses and risk scores in the IMvigor210 cohort; (B) Immunotherapy response among different risk groups in the IMvigor210 cohort; (C) Prognostic differences between different risk groups in the IMvigor210 cohort; (D) Prognostic differences between different risk groups of patients with early-stage CRC in the IMvigor210 cohort; (E) Prognostic differences between different risk groups of patients with middle- and late-stage CRC in the IMvigor210 cohort; (F) Differences in immunotherapy response and different risk scores in the IMvigor210 cohort were analysed using the TIDE software; (G) Differences in TIDE scores and immunotherapy responses in the IMvigor210 cohort; (H) Correlation analysis between the risk and TIDE scores in the IMvigor210 cohort. *P< 0.05, **P < 0.01, ****P < 0.0001.Furthermore, differences in immunotherapy and chemotherapy responses were analysed among patients in the IMvigor210 cohort. The risk model was used to assess the potential clinical impacts of immunotherapy using the TIDE (http://tide.dfci.harvard.edu/) software. The likelihood of immune escape increased with increasing TIDE prediction scores, indicating that patients were less likely to benefit from immunotherapy. With regard to immunotherapy, the risk and TIDE scores of patients unresponsive to immunotherapy were found to be higher, which also showed that the high-risk group was less likely to benefit from immunotherapy (
). In addition, Pearson correlation analysis revealed a strong positive correlation between the TIDE and risk scores (
).
Discussion
The proliferation of connective tissue is one of the key hallmarks of tumours, and the components involved in proliferation include fibroblasts, macrophages, immune cells and dense ECM (18). Fibroblasts are the main cell type in ECM, which are called CAFs. Recently, a consensus statement was issued, which stated that cancer cells with slender morphology; a lack of mutations and negative markers of epithelial cells, endothelial cells and leukocytes may be considered CAFs (19). The characteristic markers of CAFs are α-SMA and fibroblast-activating protein (FAP), and the expression of fibroblast-specific protein 1 (FSP1), platelet-derived growth factor receptor (PDGFR)-α/β and vimentin is high in CAFs. These proteins are transcribed from ACTA2, FAP, PDGFRB and NOTCH3 genes, respectively. Because morphological features are subjective and not conducive to quantification, we used ACTA2, FAP, PDGFRB and NOTCH3 genes as markers to screen for CAFs in CRC samples via single-cell sequencing. Compared with single-cell sequencing technology, the traditional transcriptome sequencing technology (bulk RNA-seq) is based on tissue samples (cell population), which reflects the average expression level of genes in the cell population. However, several studies have indicated that CAF is heterogeneous, and certain CAF subtypes stimulate tumour growth, whereas some inhibit it. For instance, in a study by Costa et al., CAF subgroup 1 created an immunosuppressive microenvironment by suppressing CD4+CD25+ T cells in breast cancer (20). Su et al. (21) reported that the new subset, CD10+GPR77+ CAFs, can facilitate the formation of tumours in patients with breast and lung cancers. Therefore, conventional sequencing technology cannot reflect the role of CAFs in tumours. In this study, cells in subgroup 9 mainly expressed ACTA2, FAP, PDGFRB and NOTCH3 and were, therefore, defined as fibroblasts. The fibroblasts of subgroup 9 were extracted, subjected to cluster analysis and further divided into four subgroups. KEGG enrichment analysis of marker genes in each subgroup revealed that the genes were mainly enriched in pathways associated with ‘ECM’ and ‘focal adhesion’, which play an important role in tumours. However, this finding does not indicate that the four CAF subgroups play the same role in tumours.Consistent with previous studies, this study revealed that the four CAF subpopulations may play different or contradictory roles in tumours. The distribution of malignant and non-malignant cells among the CAF subpopulations was significantly different. In the CAF_0 subpopulation, the proportion of malignant cells was higher, and that of cells with MSI-H was lower. However, in the other three subpopulations, the proportion of malignant cells was lower, and that of cells with MSI-H was higher. Furthermore, single-cell sequencing was used to screen for marker genes in the CAF subgroups, and the scores of the subgroups in TCGA dataset were calculated via ssGSEA. The results showed that the scores of the CAF_0 subgroup were higher in cancer tissues, and those of CAF_1, CAF_2 and CAF_3 subgroups were higher in adjacent tissues, which was consistent with the previous results, that is, the proportion of malignant tumour cells was higher in the in CAF_0 subgroup and lower in the other three subgroups. However, no significant differences in prognosis were observed among the four subgroups, and subgroups with high gene expression had a better prognosis. This finding indicates that CAFs in the same subgroup have some heterogeneity and hence cannot adequately predict the survival of patients in different subgroups.Several studies have shown that CAFs promote tumour progression in various ways, such as by remodelling ECM (22, 23), interfering with drug delivery (24), producing collagen in ECM and regulating the hardness of the tumour matrix (25). CAFs can secrete chemokines (26, 27) and cytokines (28), leading to lymphatic angiogenesis (29), so as to promote the endocrine function of cancer cells. In addition, they change the immune cell environment by recruiting immunosuppressive cells and inhibiting the activity of immune effector cells (30). In this study, the role of different CAF subtypes in tumorigenesis and development of CRC was examined, and the scores of 10 tumour-related pathways in 4 CAF subtypes were compared between malignant and non-malignant cells. The PI3K pathway was found to be highly expressed in malignant cells. Studies have shown that the PI3K pathway promotes tumour progression. The EphA2–PI3K signal can simulate angiogenesis induced by CAFs in gastric cancer cells (31). CAF-derived HGF promotes cell proliferation and drug resistance by upregulating the c-Met/PI3K/Akt and GRP78 signalling pathways in ovarian cancer cells (32). The results of this study are consistent with those of previous studies, suggesting that CAFs promote tumour progression through the PI3K pathway.To decrease the heterogeneity among subgroups, the marker genes of different CAF subgroups were used to classify CAFs. After differential expression analysis, four genes were selected via lasso regression analysis, namely, TCF7L1, FLNA, GPX3 and MMP11. TCF7L1 is a member of the TCF/lymphoid enhancer (LEF) family of transcription factors, which is involved in maintaining stem cell pluripotency (33) and skin epithelial tissue homeostasis (34). Studies have shown that ectopic TCF7L1 expression impairs the growth and invasion of highly metastatic breast cancer cells (35). In addition, overexpression of TCF7L1 can induce the growth of colorectal tumour cells (36). FLNA, the most abundant and widely distributed member of the filamin family, is a non-muscle actin filament cross-linked protein (37). Some studies have shown that FLNA is associated with multiple functional non-cytoskeletal proteins and participates in several related pathways regulating cell migration and adhesion (38). FLNA acts as a pro-oncoprotein in various human malignancies, including metastatic melanoma and hepatocellular carcinoma (39, 40). However, the expression of FLNA is decreased in breast cancer, which is negatively correlated with lymph node metastasis. FLNA knockout can promote cell migration and invasion (41). In CRC, FLNA promotes chemotherapy resistance by inducing epithelial–mesenchymal transformation and the Smad2 signalling pathway (42). Therefore, the controversial role of FLNA in human malignant tumours has been reported in several studies. GPX3 is a tumour suppressor gene and the main antioxidant enzyme in plasma. It plays an important role in scavenging hydrogen peroxide and other oxygen free radicals and protecting cells from oxidative stress-induced damage (43–45). As an important member of the MMP family, MMP11 regulates a series of physiological processes and signalling events, manipulates some bioactive molecules on the cell surface, changes the biological behaviour of cells and plays an important role in TME (46, 47). In addition, studies have shown that MMP is closely related to tumorigenesis. The most important MMP is MMP11, which is overexpressed in tumours and participates in the proliferation and malignant development of tumour cells (48, 49). However, according to previous studies, CAFs can also degrade ECM by releasing MMPs and synthesising new matrix proteins to provide structural support for tumour invasion and angiogenesis (50, 51). Therefore, MMP11 can be used for the evaluation of prognosis.The four genes identified via lasso regression were subjected to enrichment analysis, and 22 significantly related pathways were identified including those associated with ‘angiogenesis’, ‘apical junction’, ‘apoptosis’ and ‘IL2–STAT5’. The four key genes were used to establish a prognostic risk model, which had good stability and accuracy in predicting prognosis in both training and validation sets. The prognosis of patients in the high-risk group was worse. To quantify the risk assessment and survival probability of patients, the risk score was combined with other clinicopathological features, and it was found that the risk score adequately predicted clinicopathological features, especially the M stage, indicating that patients with high risk scores may be more predisposed to distant metastasis. In addition, to examine the relationship between the risk score and immunotherapy, the ability of risk score to predict the response of patients to ICB therapy was examined. Patients with low risk scores had significant clinical benefits and significantly prolonged OS in the anti-PD-L1 cohort. Furthermore, mutation analysis of the four genes in TCGA cohort revealed that FLNA had the highest mutation frequency, and there was no significant collinearity among the mutations of the four genes. Moreover, only a few samples had copy number amplification/deletion. Because the mutation frequency of the four genes is not significant, their role may be directly realised through their expression levels.Furthermore, the correlation between the prognostic risk model and infiltrating immune cells was analysed, and a significant positive correlation was observed between the four genes and immune scores, indicating that high gene expression increased the abundance of infiltrating immune cells in ECM. Moreover, these four genes had a significant negative correlation with T cell-related scores. Therefore, CAFs labelled by these genes can promote tumour progression by inhibiting T-cell function. This result is consistent with that of previous studies. CAFs can induce immune evasion of cancer cells (52, 53) and restrict the recruitment of immune effector cells (such as CD8+ T cells) to tumour tissues by secreting different chemokines (54). In this study, a significant positive correlation was observed between the four genes and the score of macrophages, which is consistent with the finding of a previously reported study, indicating that CAF can induce M2 polarisation (55). These results suggest that the interaction between stromal cells and immune-related cells in TME promotes tumour progression.However, this study has certain limitations. First, the results of single-cell sequencing were not verified in actual clinical samples. The screened key genes lack basic in vivo and in vitro experimental verification, and the prognostic model should be verified in actual clinical samples, which is our next research direction. In addition, there are some contradictory and unexplained results. For example, the distribution of different CAF subpopulations among malignant and normal cells is different; however, the prognosis among these populations was not different. Whether their distribution in malignant cells also plays an important role warrants further investigation and verification.In conclusion, the fibroblast population screened via single-cell sequencing in CRC was divided into four subpopulations through cluster analysis. The distribution and role of these four subpopulations are different in CRC. In addition, by analysing the differential expression of the main marker genes in these subpopulations, four representative genes were identified via lasso regression, namely, TCF7L1, FLNA, GPX3 and MMP11. Using the prognostic risk model constructed based on the expression of these four genes, patients with CRC were divided into the high- and low-risk groups. Patients with low risk scores had significant clinical benefits from immunotherapy and had significantly prolonged OS, which may be attributed to inhibition of T-cell function in the immune microenvironment and promotion of the function of tumour-associated macrophages.
Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/
.
Author contributions
JZ performed the study and wrote the paper. YC edited and proofread the paper. All authors contributed to the article and approved the submitted version.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Authors: Angela N Giaquinto; Kimberly D Miller; Katherine Y Tossas; Robert A Winn; Ahmedin Jemal; Rebecca L Siegel Journal: CA Cancer J Clin Date: 2022-02-10 Impact factor: 286.130
Authors: Michal Slyper; Amit Shahar; Anat Bar-Ziv; Roy Z Granit; Tamar Hamburger; Bella Maly; Tamar Peretz; Ittai Ben-Porath Journal: Cancer Res Date: 2012-10-22 Impact factor: 12.701
Authors: Michael A Jacobetz; Derek S Chan; Albrecht Neesse; Tashinga E Bapiro; Natalie Cook; Kristopher K Frese; Christine Feig; Tomoaki Nakagawa; Meredith E Caldwell; Heather I Zecchini; Martijn P Lolkema; Ping Jiang; Anne Kultti; Curtis B Thompson; Daniel C Maneval; Duncan I Jodrell; Gregory I Frost; H M Shepard; Jeremy N Skepper; David A Tuveson Journal: Gut Date: 2012-03-30 Impact factor: 23.059
Authors: Alexandra Fullár; József Dudás; Lászlóné Oláh; Péter Hollósi; Zoltán Papp; Gábor Sobel; Katalin Karászi; Sándor Paku; Kornélia Baghy; Ilona Kovalszky Journal: BMC Cancer Date: 2015-04-11 Impact factor: 4.430