Literature DB >> 31186717

Identification of a 13-gene-based classifier as a potential biomarker to predict the effects of fluorouracil-based chemotherapy in colorectal cancer.

Zuhuan Gan1, Qiyuan Zou2, Yan Lin3, Zihai Xu1, Zhong Huang1, Zhichao Chen1, Yufeng Lv1.   

Abstract

The aim of the current study was to develop a predictor classifier for response to fluorouracil-based chemotherapy in patients with advanced colorectal cancer (CRC) using microarray gene expression profiles of primary CRC tissues. Using two expression profiles downloaded from the Gene Expression Omnibus database, differentially expressed genes (DEGs) between responders and non-responders to fluorouracil-based chemotherapy were identified. A total of 791 DEGs, including 303 that were upregulated and 488 that were downregulated in responders, were identified. Functional enrichment analysis revealed that the DEGs were primarily involved in 'cell mitosis', 'DNA replication' and 'cell cycle' signaling pathways. Following feature selection using two methods, a random forest classifier for response to fluorouracil-based chemotherapy with 13 DEGs was constructed. The accuracy of the 13-gene classifier was 0.930 in the training set and 0.810 in the validation set. The receiver operating characteristic curve analysis revealed that the area under the curve was 1.000 in the training set and 0.873 in the validation set (P=0.227). The 13-gene-based classifier described in the current study may be used as a potential biomarker to predict the effects of fluorouracil-based chemotherapy in patients with CRC.

Entities:  

Keywords:  colorectal cancer; differential expression genes; fluorouracil-based chemotherapy; random forest classifier

Year:  2019        PMID: 31186717      PMCID: PMC6507297          DOI: 10.3892/ol.2019.10159

Source DB:  PubMed          Journal:  Oncol Lett        ISSN: 1792-1074            Impact factor:   2.967


Introduction

Colorectal cancer (CRC) is the third most commonly diagnosed cancer in males and the second in females, and it is one of the most common causes of cancer mortality (1). Localized CRCs are amenable to curative surgical resection, however, ~25% of patients present with metastatic disease and ~50% of patients will develop metastases (2). Fluorouracil-based chemotherapy remains the primary treatment for metastatic CRC (3). 5-fluorouracil (5-FU) alone has an objective response rate of ~20% (4). The addition of irinotecan or oxaliplatin to 5-FU increases the objective response rate to ~50% (5). The effects of 5-FU/leucovorin combined with irinotecan (FOLFIRI) or oxaliplatin (FOLFOX) in the first-line treatment of metastatic CRC are comparable (6). In the last decade, the addition of targeted therapies based on these chemotherapy regimens has improved the therapeutic approach and significantly increased progression-free survival and overall survival times (7–9). Fluorouracil-based chemotherapy remains the primary treatment for metastatic CRC. However, ~50% of patients are resistant to fluorouracil-based chemotherapy. In addition, the side effects of systemic chemotherapy, including neurotoxicity, myelotoxicity and gastrointestinal toxicity, may have a major impact on the quality of life of the patients and may lead to life-threatening complications (3). Therefore, identifying effective strategies that predict response to chemotherapy are required. Using these strategies, patients that are predicted to not respond to chemotherapy may receive other potentially effective treatments as early as possible and avoid unnecessary side effects. Gene expression profiling is used to predict the clinical outcome of patients with CRC (10–12). Previous studies have revealed that gene expression profiling may be used to predict cancer response to chemotherapy, including breast cancer and CRC (13–15). The aim of the present study was to develop a predictor classifier for response to fluorouracil-based chemotherapy in patients with advanced CRC using microarray gene expression profiles of primary CRC tissues.

Materials and methods

Data processing

The raw microarray data (CEL files) of three datasets [GSE52735 (16), GSE62080 (15) and GSE69657 (17)] and corresponding clinical data were downloaded from the Gene Expression Omnibus database (www.ncbi.nlm.nih.gov/geo). The microarray data of the 3 datasets were based on the GPL570 Affymetrix Human Genome U133 Plus 2.0 Array platform (Affymetrix; Thermo Fisher Scientific Inc., Waltham, MA, USA). The GSE52735 set contained 37 advanced CRC samples treated with a fluoropyrimidine-based chemotherapy regimen (specific chemotherapy regimens were not available). A total of 23 of the samples were classified as responders and 14 samples were classified as non-responders to the chemotherapy regimen according to Response Evaluation Criteria in Solid Tumors (RECIST) (18). The GSE62080 dataset contained 21 advanced CRC samples treated with the FOLFIRI regimen. A total of 9 samples were classified as responders and 12 samples were classified as non-responders according to the World Health Organization (WHO) criteria (19). The GSE69657 dataset contained 30 advanced CRC samples treated with the FOLFOX4 regimen. However, the raw microarray data was available for only 16 samples. A total of 7 of these samples were classified as responders and 9 samples were classified as non-responders according to RECIST. Two different evaluation criteria used in these three studies due to long time intervals between the studies, Previous studies have revealed that the RECIST criteria are comparable with the WHO criteria in evaluating the response of solid tumors (20–23). Preprocessing and normalization of the raw data were analyzed using the ‘affy’ (version 3.8) package (24) in R (www.r-project.org; version 3.5), using robust multi-array average for background correction and quantiles for normalization. Kernel and nearest neighbor averaging methods were used to impute the missing values using the ‘impute’ package (bioconductor.org/packages/impute; version 3.8) in R. The ComBat function in the ‘sva’ (version 3.8) package (25) was applied to remove batch effects. If one gene matched multiple probes, the average value of the probes was calculated as the expression of the corresponding gene. To build a robust predictive classifier, the GSE52735 and GSE62080 datasets were used as the training set (n=58), while the GSE69657 dataset was used as the validation set (n=16).

Screening of differentially expressed genes (DEGs) and enrichment analysis

Following preprocessing of the raw expression data, the DEGs between responders and non-responders in the training set were screened using the unpaired t-test in the ‘limma’ (version 3.8) package (26) in R. A DEG was defined as |log2 fold change (FC)|≥0.263 and P<0.05. The Gene Ontology (GO; http://geneontology.org/) and Kyoto Encyclopedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/) pathway enrichment analyses of DEGs were performed using the ‘clusterProfiler’ (version 3.8) package (27) in R with a cut-off of q<0.01.

Principal component analysis (PCA) prior to and following feature selection using the least absolute shrinkage and selection operator (LASSO) method

The expression values of DEGs in each sample were extracted. The LASSO logistic regression model analysis was performed using the ‘glmnet’ package (CRAN.R-project.org/package=glmnet; version 2.0-16) in R. The LASSO method is used to select optimal features in high-dimensional microarray data with a powerful predictive value and a low correlation between each other to prevent over-fitting (28). In the training set, the LASSO logistic regression model was used to select the optimal predictive markers. PCA using the expression profiles of the DEGs was performed prior to feature selection using the LASSO method. PCA was subsequently performed using the expression profiles of the optimal DEGs identified using by the LASSO method. Samples were plotted in two-dimensional plots across the first two principal components.

Feature selection using Boruta and random forest classifier construction

A lower-dimensional model may reduce costs and is more likely to be used by clinicians (29). Following DEGs selection by the LASSO method, a feature selection was performed using the ‘Boruta’ package (www.jstatsoft.org/article/view/v036i11; version 6.0.0) in R. Boruta is a random forest-based feature selection method, which provides an unbiased and stable selection of important and non-important attributes from an information system. A variable importance (VIMP) measure may be calculated and visualized based on Boruta. In the current study, DEGs selected by Boruta were used to develop a gene-based classifier for response to fluorouracil-based chemotherapy in advanced CRCs. The random forest classifier was developed using the ‘randomForest’ package (CRAN.R-project.org/package=randomForest; version 4.6-14) in R. The validation set (GSE69657) was used to confirm the robustness and transferability of the classifier. The performance of the classifier was assessed by accuracy, sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV) and receiver operating characteristic (ROC) curves in the training and validation sets. The ROC curves were drawn and compared using the ‘pROC’ (version 1.13.0) package (30) in R.

Results

DEGs in responders and non-responders and enrichment analysis

The training set included 32 responders and 26 non-responders. According to the cut-off criteria (|log2FC|≥0.263 and P<0.05), 791 genes were identified as differentially expressed between responders and non-responders. A total of 303 genes were upregulated and 488 genes were downregulated in responders. Functional enrichment analysis revealed that the biological process of DEGs were primarily involved in ‘cell mitosis’, ‘DNA replication’ and ‘cell cycle’ signaling pathways. The results of enrichment analysis are presented in Fig. 1.
Figure 1.

Significantly enriched GO annotation and enriched KEGG pathways of differentially expressed genes. GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

PCA and feature selection using LASSO

For the first feature selection, LASSO logistic regression was performed using the expression data of DEGs in the training set. The group-wise classifications in 10-fold cross-validations were computed as default. A total of 31 DEGs were identified as optimal genes (Fig. 2A) with non-zero regression coefficients (Table I). Fig. 2B presents the results of PCA prior to feature selection using LASSO and Fig. 2C presents the results of PCA following feature selection using LASSO. As demonstrated in Fig. 2C, responders and non-responders are easily distinguished using the 31 DEGs selected by LASSO.
Figure 2.

LASSO model and principal component analysis. (A) 10-fold cross-validation for tuning parameter selection in the LASSO model. (B) PCA prior to and (C) following LASSO variable reduction. LASSO, least absolute shrinkage and selection operator; PCA, principal component analysis.

Table I.

Overview of the 31 optimal genes.

GeneLog2 fold change (Responder/non-responder)P-valueCoefficients provided by least absolute shrinkage and selection operatorVariable importance provided by Boruta
Matrix metallopeptidase 12−1.0690.002−0.154Tentative
C-X-C motif chemokine ligand 11−1.0160.015−0.184Rejected
Forkhead box P20.9570.0030.032Tentative
Small muscle protein X-linked0.7660.0030.575Confirmed
Pleckstrin homology like domain family A member 1−0.6250.000−0.584Confirmed
Prostaglandin reductase 2−0.6020.000−0.792Confirmed
Chitinase 10.5690.0020.976Confirmed
S100 calcium binding protein A2−0.5410.039−0.091Rejected
Histone cluster 1 H2B family member c0.5390.0010.927Confirmed
RP1-74M1.3−0.5150.005−0.023Tentative
Formin homology 2 domain containing 30.4690.0130.855Confirmed
RNA binding motif protein 3−0.4510.001−0.555Tentative
Tubulin polymerization promoting protein family member 30.4120.0110.442Rejected
Cadherin related family member 20.4110.0470.913Tentative
OTUD6B antisense RNA 1 (head to head)0.3870.0120.651Confirmed
Teashirt zinc finger homeobox 10.3840.0040.347Tentative
Cholinergic receptor nicotinic β1 subunit−0.3650.000−3.574Confirmed
Stromal antigen 3-like 4 (pseudogene)−0.3640.005−0.554Rejected
RPA interacting protein−0.3430.000−0.825Confirmed
Leucine rich repeat neuronal 1−0.3340.017−0.331Rejected
Heparan-α-glucosaminide N-acetyltransferase0.3340.0061.374Rejected
MINDY lysine 48 deubiquitinase 3−0.3200.001−0.253Tentative
THAP domain containing 5−0.3080.016−0.432Rejected
DNA ligase 40.2980.0021.692Confirmed
Zinc finger protein 2−0.2910.004−1.885Tentative
ASAP1 intronic transcript 20.2890.0040.045Confirmed
Small integral membrane protein 30−0.2870.001−0.973Confirmed
c-Maf inducing protein0.2820.0010.208Confirmed
ADAMTS like 20.2780.0051.088Tentative
Nucleoporin 133−0.2730.011−1.718Tentative
DEAD-box helicase 28−0.2670.003−0.063Tentative

The Boruta function was used to further select features among the 31 DEGs. A total of 13 genes were confirmed as important, 7 genes were rejected and 11 tentative genes remained (Table I)

Fig. 3 presents the variables' importance. These 13 important DEGs included small muscle protein X-linked, pleckstrin homology like domain family A member 1 (PHLDA1), prostaglandin reductase 2 (PTGR2), chitinase 1 (CHIT1), histone cluster 1 H2B family member c, formin homology 2 domain containing 3, OTUD6B antisense RNA 1 (head to head), cholinergic receptor nicotinic β1 subunit (CHRNB1), RPA interacting protein, DNA ligase 4 (LIG4), ASAP1 intronic transcript 2, small integral membrane protein 30 and c-Maf inducing protein. A random forest classifier was constructed using these 13 important DEGs.
Figure 3.

Z score evolution during Boruta run. Green lines correspond to confirmed attributes, yellow to tentative, red to rejected ones; and blue lines correspond to respectively minimal, average and maximal shadow attribute importance.

Performance of the gene-based classifier

The accuracy of the 13-gene classifier was 0.930 in the training set and 0.810 in the validation set. Based on accuracy, Se, Sp, PPV, NPV and area under curve (AUC) values, the sample recognition efficiency of the classifier was high (Table II). ROC curve analysis revealed that the AUC was 1.000 in the training set and 0.873 in the validation set (P=0.227; Fig. 4).
Table II.

Performance of the 13-gene classifier.

CohortSensitivitySpecificityPositive predictive valueNegative predictive valueAccuracyArea under the curve
Training set0.9700.9600.9100.9600.9301.000
Validation set0.8600.8800.7500.8800.8100.873
Figure 4.

Receiver operating characteristic curves for training and validation sets. AUC, area under the curve.

Discussion

Personalized treatment may improve the treatment outcome of patients with tumors (31). In CRC, the gene expression levels of vascular endothelial growth factor (VEGF) and epidermal growth factor receptor (EGFR) provide the basis for selecting EGFR and VEGF inhibitor combinations (32–36). Monoclonal antibodies against VEGF and EGFR have been approved for treatment of metastatic CRC in combination with 5-FU-based regimens (3). The identification of subsets of patients that respond to specific chemotherapy regimens remains a challenge (3). A previous study demonstrated that tumors with microsatellite instability (MSI) respond well to 5-FU-based therapies; however, further studies are required to substantiate these results (37). Another previously published study suggested that MSI status does not affect the outcome of the treatment (38). Therefore, effective tools for predicting the outcome of chemotherapy are currently lacking. The present study identified 13 genes from 791 DEGs using two feature selection algorithms and developed a 13-gene predictor classifier for response to fluorouracil-based chemotherapy in CRC. The predictor classifier demonstrated high accuracy in the training and validation sets. The training set included two datasets from different centers, and the validation set was from an additional independent center. ROC curve analysis revealed that the AUC was 1.000 in the training set and 0.873 in the validation set, and their difference was not significant (P=0.227). These results suggested that the classifier was robust. The study established a foundation for further research into personalized treatment of CRC. Previous studies have attempted to identify a single biomarker to predict response to fluorouracil-based chemotherapy in CRC (17,39–42). However, there is currently no single biomarker that is routinely applied in clinical practice. CRC is a heterogeneous disease, which is compounded by changes in the molecular profile of the tumor as it progresses (3). An in vitro study demonstrated that the measurement of multiple, rather than single marker genes, may provide a more accurate assessment of drug response in colon carcinoma (43). Previous studies have been designed to identify a pattern of gene expression capable of predicting response to fluorouracil-based chemotherapy in CRC (15,16). One study identified a set of 14 genes for predicting response to the FOLFIRI regimen based on 21 samples (15), and an expression profile of 7 genes was identified in another study (16). Compared with the two aforementioned studies, the current study performed a comprehensive analysis of more samples (n=58) from two centers and validated the predictor classifier in an independent dataset (n=16). Furthermore, to the best of the authors' knowledge, the current study is the first to construct a random forest classifier to predict response to chemotherapy in CRC. Considering the limited ability of Cox regression analysis to process high-dimensional data (44), it was not performed in the current study. A random forest algorithm was used to construct the classifier, which was subsequently validated with an independent dataset. The results obtained in the current study suggest that the robust classifier developed warrants further investigation. Functional enrichment analysis revealed that certain DEGs identified in the present study are involved in DNA replication and cell cycle pathways; however, none of the 13 genes were involved in these two signaling pathways. A previous study suggested that PHLDA1 may be associated with CRC progression (45). A previous study demonstrated that PTGR2-knockdown gastric cancer cells rendered them more sensitive to cisplatin and 5-FU compared with the PTGR2-overexpressing cells (46). In addition, two variants of CHIT1, rs61745299 and rs35920428, may increase expression of the gene and have been associated with CRC (47). CHRNB1 may be a biomarker for the detection of relapsed and early relapsed CRC (48). In addition, LIG4 may mediate Wnt signaling-induced radioresistance in CRC (49). With the exception of the aforementioned studies, the association between the 13 genes identified in the current study and CRC or chemotherapy has not been investigated. Therefore, it is not clear whether these genes are causal or merely markers for response to fluorouracil-based chemotherapy in CRC. Although the current study provides novel insights into the treatment of CRC, it has some limitations. The present study was based on a relatively small sample size; however, it is worth noting that the sample size in our study is relatively large compared with previous studies (15,16). Future studies are required to verify and improve the 13-gene signature in a larger independent cohort of patients. In conclusion, the current study identified a 13-gene predictor classifier for the response to fluorouracil-based chemotherapy in patients with advanced CRC.
  48 in total

1.  New guidelines to evaluate the response to treatment in solid tumors. European Organization for Research and Treatment of Cancer, National Cancer Institute of the United States, National Cancer Institute of Canada.

Authors:  P Therasse; S G Arbuck; E A Eisenhauer; J Wanders; R S Kaplan; L Rubinstein; J Verweij; M Van Glabbeke; A T van Oosterom; M C Christian; S G Gwyther
Journal:  J Natl Cancer Inst       Date:  2000-02-02       Impact factor: 13.506

2.  affy--analysis of Affymetrix GeneChip data at the probe level.

Authors:  Laurent Gautier; Leslie Cope; Benjamin M Bolstad; Rafael A Irizarry
Journal:  Bioinformatics       Date:  2004-02-12       Impact factor: 6.937

3.  Comparison of TP53 mutations identified by oligonucleotide microarray and conventional DNA sequence analysis.

Authors:  W H Wen; L Bernstein; J Lescallett; Y Beazer-Barclay; J Sullivan-Halley; M White; M F Press
Journal:  Cancer Res       Date:  2000-05-15       Impact factor: 12.701

4.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays.

Authors:  D A Notterman; U Alon; A J Sierk; A J Levine
Journal:  Cancer Res       Date:  2001-04-01       Impact factor: 12.701

5.  Prediction of docetaxel response in human breast cancer by gene expression profiling.

Authors:  Kyoko Iwao-Koizumi; Ryo Matoba; Noriko Ueno; Seung Jin Kim; Akiko Ando; Yasuo Miyoshi; Eisaku Maeda; Shinzaburo Noguchi; Kikuya Kato
Journal:  J Clin Oncol       Date:  2005-01-20       Impact factor: 44.544

6.  FOLFIRI followed by FOLFOX6 or the reverse sequence in advanced colorectal cancer: a randomized GERCOR study.

Authors:  Christophe Tournigand; Thierry André; Emmanuel Achille; Gérard Lledo; Michel Flesh; Dominique Mery-Mignard; Emmanuel Quinaux; Corinne Couteau; Marc Buyse; Gérard Ganem; Bruno Landi; Philippe Colin; Christophe Louvet; Aimery de Gramont
Journal:  J Clin Oncol       Date:  2003-12-02       Impact factor: 44.544

7.  A randomized controlled trial of fluorouracil plus leucovorin, irinotecan, and oxaliplatin combinations in patients with previously untreated metastatic colorectal cancer.

Authors:  Richard M Goldberg; Daniel J Sargent; Roscoe F Morton; Charles S Fuchs; Ramesh K Ramanathan; Stephen K Williamson; Brian P Findlay; Henry C Pitot; Steven R Alberts
Journal:  J Clin Oncol       Date:  2003-12-09       Impact factor: 44.544

8.  Gene expression profiles and molecular markers to predict recurrence of Dukes' B colon cancer.

Authors:  Yixin Wang; Tim Jatkoe; Yi Zhang; Matthew G Mutch; Dmitri Talantov; John Jiang; Howard L McLeod; David Atkins
Journal:  J Clin Oncol       Date:  2004-03-29       Impact factor: 44.544

9.  Gene expression profiling-based prediction of response of colon carcinoma cells to 5-fluorouracil and camptothecin.

Authors:  John M Mariadason; Diego Arango; Qiuhu Shi; Andrew J Wilson; Georgia A Corner; Courtney Nicholas; Maria J Aranes; Martin Lesser; Edward L Schwartz; Leonard H Augenlicht
Journal:  Cancer Res       Date:  2003-12-15       Impact factor: 12.701

10.  Measuring response in solid tumors: comparison of RECIST and WHO response criteria.

Authors:  Joon Oh Park; Soon Il Lee; Seo Young Song; Kihyun Kim; Won Seog Kim; Chul Won Jung; Young Suk Park; Young-Hyuk Im; Won Ki Kang; Mark Hong Lee; Kyung Soo Lee; Keunchil Park
Journal:  Jpn J Clin Oncol       Date:  2003-10       Impact factor: 3.019

View more
  2 in total

Review 1.  Towards the Interpretability of Machine Learning Predictions for Medical Applications Targeting Personalised Therapies: A Cancer Case Survey.

Authors:  Antonio Jesús Banegas-Luna; Jorge Peña-García; Adrian Iftene; Fiorella Guadagni; Patrizia Ferroni; Noemi Scarpato; Fabio Massimo Zanzotto; Andrés Bueno-Crespo; Horacio Pérez-Sánchez
Journal:  Int J Mol Sci       Date:  2021-04-22       Impact factor: 5.923

Review 2.  Big Data to Knowledge: Application of Machine Learning to Predictive Modeling of Therapeutic Response in Cancer.

Authors:  Sukanya Panja; Sarra Rahem; Cassandra J Chu; Antonina Mitrofanova
Journal:  Curr Genomics       Date:  2021-12-16       Impact factor: 2.689

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.