Literature DB >> 24098861

Hybrid method for prediction of metastasis in breast cancer patients using gene expression signals.

Alireza Mehri Dehnavi1, Mohammad Reza Sehhati, Hossein Rabbani.   

Abstract

Using primary tumor gene expression has been shown to have the ability of finding metastasis-driving gene markers for prediction of breast cancer recurrence (BCR). However, there are some difficulties associated with analysis of microarray data, which led to poor predictive power and inconsistency of previously introduced gene signatures. In this study, a hybrid method was proposed for identifying more predictive gene signatures from microarray datasets. Initially, the parameters of a Rough-Set (RS) theory based feature selection method were tuned to construct a customized gene extraction algorithm. Afterward, using RS gene selection method the most informative genes selected from six independent breast cancer datasets. Then, combined set of these six signature sets, containing 114 genes, was evaluated for prediction of BCR. In final, a meta-signature, containing 18 genes, selected from the combination of datasets and its prediction accuracy compared to the combined signature. The results of 10-fold cross-validation test showed acceptable misclassification error rate (MCR) over 1338 cases of breast cancer patients. In comparison to a recent similar work, our approach reached more than 5% reduction in MCR using a fewer number of genes for prediction. The results also demonstrated 7% improvement in average accuracy in six utilized datasets, using the combined set of 114 genes in comparison with 18-genes meta-signature. In this study, a more informative gene signature was selected for prediction of BCR using a RS based gene extraction algorithm. To conclude, combining different signatures demonstrated more stable prediction over independent datasets.

Entities:  

Keywords:  Breast cancer recurrence prediction; gene expression signature; meta-signature; rough-set theory

Year:  2013        PMID: 24098861      PMCID: PMC3788197     

Source DB:  PubMed          Journal:  J Med Signals Sens        ISSN: 2228-7477


INTRODUCTION

The engaged patients with breast cancer recurrence (BCR) can range from 20% to 90% upon diagnosing and treating in different stages of the cancer.[1] Whatever certain treatments can reduce cancer recurrence risk, BCR prediction can help us to prevent overtreatment.[2] In the last decades, poor prediction of BCR using clinical factors has been prompted researchers to identify cancer markers through examination of genome-wide expression profiles.[3] Numerous studies have performed for extracting a combination of genes from messenger ribonucleic acid (mRNA) microarrays as biomarkers or cancer related genes. The entire have used expression level of these genes for prediction of relapse or distant metastases in breast cancer.[4567] However, low accuracy of the published methods,[89] or limitations of the methods to clinically specific types of breast tumors,[9] were encouraged the researchers to identify more general, robust and accurate prognostic markers. Meanwhile, they considered combining previously introduced gene signatures,[8] and meta-analysis of all available datasets,[5] to obtain a unified set of biomarkers. In this paper, a hybrid method was built to predict BCR by finding informative gene sets regarding recurrence event based on Rough-Set (RS) theory.[10] Then, an appropriated supervised classifier was applied to the discovered genes. The RS method is preferred for gene selection in this study because of its proved power for dealing with vagueness in microarray data.[1112] This method also preserves the meaning of the original feature sets and it has interpretability advantage comparing to traditional transform-based feature selection techniques.[13] A key requirement for successful usage of RS is appropriate data discretization. Therefore, all expression signals were discretized in three levels based on previously successful applications.[14] Recently, different techniques have been applied to RS algorithm to relief its inadequacies. However, these methods are data dependent and should be modified for gene selection from microarray data. In this regard, various criteria like fuzzy entropy,[151617] dependency and consistency,[11] were investigated and the final RS method was constructed using appropriate parameters for generalized dependency function based on theta-eta model of RS domain.[11] For this purpose, different gene sets have been extracted from Wang dataset,[3] using different parameters. Thereafter, the optimized algorithm applied on different datasets for extracting the indicator genes. There is a small overlap between published gene signatures for BCR prediction, which reflects a small chance to reach a general and robust indicator, set to judge about all experimental data from different studies. So we decided to combine the extracted genes of different datasets together. This approach has less prediction performance in comparison to apply extracted genes for specific dataset, but overall performance among all datasets will be increased. Among published gene-expression-based prediction methods, we have chosen the work of Li et al.,[7] that has one of the best recent claims in robustness and accuracy for overall comparison with our algorithm. The remainder of this paper is organized as follows: In Section 2, Utilized datasets introduced and applied normalization procedure described. Subsequently details of gene selection procedure and classification of samples reported in this section. Experimental results are reported in Section 3. In Section 4, we discussed on the obtained results, then concluded our work and outlined some future work directions.

MATERIALS AND METHODS

In this section, an overview of the whole study will be described by introducing the utilized data in this work, and details of the proposed approach step by step.

Breast Tumor Datasets

We utilized human breast cancer microarray datasets of six studies with the same platform including 1338 samples that are publicly available from gene expression omnibus (GEO) database and will be referenced later in this paper by their GEO series code (GSE xxx) as illustrated in Table 1. Affymetrix gene-chip human genome U133 array (HG-U133A) set, also known as GEO platform 96 (GPL96), microarrays were taken into account because of frequent usage and abundance of samples. This selection helps us to overcome many problems such as non-overlapped transcripts, different precision and varying relative scales, and distinctive dynamic ranges of gene expression among different platforms. At the first step, the presented transcripts on the microarray have been pruned. In this regard, the genes that were not present significantly across all samples,[3] and samples that were censored and had no information about relapse or metastasis have been removed from all datasets. In final, metadata has been constructed by combining all utilized datasets.
Table 1

Summary of breast cancer microarray datasets

Summary of breast cancer microarray datasets

Microarray Data Normalization

The most important pre-processing step is a normalization of expression signals. In this regard expression values were log2 transformed, then the base signal (log2(600) or log2(500)), have been subtracted from all data. After this, maximum positive signal mapped to +1 and minimum negative signal mapped to −1. Before determining the differentially expressed genes as the next step, we discretized the expression values at three levels. This takes advantage of lightening the undesirable effect of noisy data and produces more reliable results when working with different datasets.[23] Following the procedure of fuzzy discretization technique,[14] the performance of fuzzy RS gene selecting step improved as described at the next subsection.

Selection of Prognoses Indicator Genes

RS theory is a new fashion to deal with uncertainty and incompleteness so that it has become a favorable technique for feature selection; especially in microarray data analysis.[111213] Using RS theory for gene selection we can screen informative and related genes to our desired output. So we can find a set of genes that their expressions among diverse samples have significant predictive power regarding recurrence event. Among various RS based feature selection methods that proposed before,[111213] we generated multiple subset of genes by applying various sub-methods with different parameters on the whole gene set. After primary investigation of specific parameters of the algorithm,[11] these parameters were tuned to gain the best classification performance on Wang data set. It should be noted that the 10-fold cross-validation (CV) test was used for tuning the algorithm's parameters. The customized RS gene selection algorithm applied to different datasets independently and demonstrated different gene signatures with different accuracies. Working in high dimension and sensitivity of feature selection method to sample cohorts are significant issues in this procedure that resulted in extracting different signature genes from different datasets. A set of 18 genes also selected from normalized combination of all samples from all datasets. Li et al., considered voting by different sets and claimed that simply combing the sets did not improve the results. However, as it was discussed, we observed acceptable improvement in robustness with combining extracted gene sets from different datasets.

Classification by Neuro-fuzzy Inference System

Choosing appropriate classifier has also great importance that affects the total performance of the proposed algorithm for BCR prediction. Numerous supervised classifiers have been used for classifying cancer-based gene expression data.[24] However, less attention have been paid on FISs, which have been successfully applied in many different areas.[17] They have been suggested as a commanding technique for dealing with noisy data with complex interactions.[17] In this regard, the adaptive neuro-fuzzy inference system (ANFIS) that was first proposed by Jang,[25] was used for identifying parameters of a FIS. The combination of RS theory and ANFIS called hybrid method in this work. We utilized the MATLAB fuzzy toolbox for generating a sugeno FIS with its default settings which previously proposed by Cetisli.[26] The obtained ANFIS model structure was shown in [Supplementary Figure 1]. In this regard, the grid partitioning procedure was used for subdividing the input space and generating the rules. Hybrid learning algorithm, which is a combination of the back-propagation gradient descent procedure and least-squares method, was applied for training the FIS membership function parameters. The performance evaluation plot for training ANFIS on meta-signature for 100 epochs of training was shown in [Supplementary Figure 2].
Supplementary Figure 1

The adaptive neuro-fuzzy inference system model structure. In this structure there are 16 Gaussian membership functions for input nodes and two Gaussian MFs for middle and output nodes. Settings for fuzzy inference system are: And = “prod”; Or = “probor”; Defuzzifier = “wtaver”; Implication = “prod”; Aggregation = “sum”

Supplementary Figure 2

Performance evaluation in 100 epochs of training

The adaptive neuro-fuzzy inference system model structure. In this structure there are 16 Gaussian membership functions for input nodes and two Gaussian MFs for middle and output nodes. Settings for fuzzy inference system are: And = “prod”; Or = “probor”; Defuzzifier = “wtaver”; Implication = “prod”; Aggregation = “sum” Performance evaluation in 100 epochs of training

RESULTS

We utilized microarray datasets from six human breast cancer studies for gene set extraction using RS feature selection algorithm with tuned parameters on Wang dataset. Then, a 10-fold CV test was performed for evaluation of the gene sets, which extracted from all datasets, using a neuro-fuzzy classifier. It should be noted that 10-fold CV test also used in preparing of RS feature selection method and mapping steps of ANFIS model. Table 2 reported the MCR, which obtained for prediction of recurrence over the validation dataset, using the selected signature from reference dataset. Number of extracted genes from each datasets was different as it was shown at the second column of Table 2. The number of extracted biomarkers by RS method depends on the threshold of dependency in the algorithm that obtained by primary optimization of classification error rate on Wang dataset. In this way, a local maximum in accuracy obtained using the selected number of genes. In connection with this, adapting looser thresholds or increase the number of selected features will help us to reach to more general results with more overlap among other datasets. However, by increasing the dimension, according to the low number of available samples, the over fitting problem would override the classification results.[27]
Table 2

Misclassification error rate of 10-fold independent cross-validation in six breast cancer studies

Misclassification error rate of 10-fold independent cross-validation in six breast cancer studies We also evaluated the classification accuracies after constructing two new signature sets. The first one, containing 114 genes, was constructed by combining all of six gene signatures together. The second gene signature, containing 18 genes, was constructed by applying RS feature selection algorithm to the metadata. Table 3 showed the MCR results of 10-fold CV test for the combined and meta-signature over all datasets. As it is shown in Table 3, the best result was obtained for Sotiriou dataset at MCR = 0.27 using combined gene signature.
Table 3

Misclassification error rate of 10-fold independent cross-validation in six breast cancer studies with combined set and meta-selected genes

Misclassification error rate of 10-fold independent cross-validation in six breast cancer studies with combined set and meta-selected genes We also applied the introduced gene sets by Li et al.,[7] called national research council (NRC) sets, to the fuzzy classifier to compare the performance of their gene sets with resultant signatures of this work. It should be noted that there is no common gene between selected genes in two studies. Because the Wang dataset have been used by Li et al.,[7] as training data and they gave the best results on this dataset we chose this dataset for a fair comparison. Table 4 reported the 10-fold CV MCR using nine different gene signatures, which introduced by Li et al., and evaluated over the Wang dataset. As it is shown in Table 4, the best classification accuracy was reached at MCR = 0.29 by the NRC7 signature, which extracted from negative estrogen receptor (ER−) samples. Li et al.,[7] reached at 87% of accuracy for classification on low-risk patients. However, in this work the overall MCR on all samples were evaluated and reported in the results. Two rows of Table 4 corresponds to the two different normalization methods, one used by Li et al., [second row of Table 4], and the other was used to obtain the presented results in Tables 2 and 3. We also applied the nearest shrunken centroid classification method that have been used by Li et al., using our extracted genes from Wang dataset. After this, we encountered almost with the same overall MCR (<0.32) after performing leave-one-out CV test with all six resultant gene sets of the proposed approach.
Table 4

Misclassification error rate of 10-fold cross-validation in Wang dataset (GSE2034) with Li gene sets (NRCx)

Misclassification error rate of 10-fold cross-validation in Wang dataset (GSE2034) with Li gene sets (NRCx) We also considered the effect of ER – status on prediction accuracy by separating samples according to ER status. In concordance with other studies accuracy will be improved up to 6%, when ER status taken into account. The selected genes from different samples and related gene ontology analysis were described in the Supplementary File 1. In order to show the association between introduced gene sets and survival, the samples are grouped according to the output of classifier that applied to the expression of selected genes. After that, the two groups, with known time of relapse or censoring are contrasted by a Kaplan-Meier plot. Figure 1 shows the probability of survival versus time from primary detection of tumor, for two groups of patients classified by the proposed method. Figure 1a shows survival curve obtained from 286 samples of Wang dataset using total 114 combined signature gene. We also plotted the survival curve for combined data of all datasets containing 1338 samples in Figure 1b. It can be seen in Figure 1 that two groups are very distinctive for both datasets with a little decreasing in the significant of distinction between samples with poor and good outcomes in metadata. We used Wilcoxon rank sum test for calculating P value. Survival curves for other five datasets [Supplementary Figures 3–8] imply the power of combined gene signatures in significant distinction between patients with distant metastasis and relapse-free ones, which were classified by the proposed method.
Figure 1

Kaplan-Meier relapse-free survival curves. (a) Classified samples on Wang dataset with 114-genes combined signature; (b) Classified samples on combined dataset of 1338 samples with 114-genes combined signature

Supplementary Figure 3

KM plot for classified samples of dataset GSE3494 using 23 genes extracted from this dataset (P = 3 e-05)

Supplementary Figure 8

KM plot for classified samples of dataset GSE6532 using 114 genes from six datasets (P = 0.09)

Kaplan-Meier relapse-free survival curves. (a) Classified samples on Wang dataset with 114-genes combined signature; (b) Classified samples on combined dataset of 1338 samples with 114-genes combined signature KM plot for classified samples of dataset GSE3494 using 23 genes extracted from this dataset (P = 3 e-05) KM plot for classified samples of dataset GSE3494 using 114 genes from six datasets (P = 1 e-07) KM plot for classified samples of dataset GSE4922 using 114 genes from six datasets (P = 3 e-09) KM plot for classified samples of dataset GSE2990 using 114 genes from six datasets (P = 0.01) KM plot for classified samples of dataset GSE7390 using 114 genes from six datasets (P = 6 e-06) KM plot for classified samples of dataset GSE6532 using 114 genes from six datasets (P = 0.09)

DISCUSSION

In this paper, selection of constructive genes and choosing an appropriate classifier helped us to present a prosperous method for prediction of breast cancer relapse. In this regard, a RS based feature selection method was proposed. Then a fuzzy classifier was used for classification of metastatic and non-metastatic cohorts in six independent breast cancer microarray datasets. According to Table 2 our approach reached acceptable MCR in testing cross accuracy of the introduced gene signatures. We also combined the extracted genes by RS algorithm and compared the results by extracted features from metadata. Table 3 demonstrated that combining extracted signatures from independent datasets reached at 7% improvement in average accuracy in comparison to 18-genes meta-signature, which selected from metadata. Recently, Li et al.,[7] introduced nine gene signature at equal size 30 and reported good results for classification of low risk patients. According to Table 4, the best classification accuracy was reached at MCR = 0.29 by the NRC7 signature, which extracted from negative ER − samples. Comparing the obtained results in Tables 2 and 3 and 4 demonstrated the better accuracy of our approach. To be specific, MCR after 10-CV on Wang dataset was 0.22 for our single gene set and 0.28 for the combined set that is better than the minimum value of 0.29, which obtained by NRC sets. Wei and Li[24] reported the comparison results of the average MCR of their proposed classifier and nine commonly used procedures based on 10-fold CV for three breast cancer data sets. Reported results by Wei and Li show that the best MCR among these ten different classifiers was 0.29 for Wang dataset, which is lower in accuracy from that we reached by the proposed hybrid method. Rely on literature and the presented results, using any algorithm for extracting informative genes from microarray, diverse set of genes were selected from different datasets, which all of the led to nearly acceptable classification performance. Because of a wide variety and heterogeneity in microarray experiments we cannot have rational reasoning about extracted genes and introduce extracted genes from a specific study as general biomarkers. According to the represented results on survival analysis [Figure 1] and CV test on different combination of gene signatures [Tables 2 and 3] there is a trade-off between robustness and accuracy of prediction. In this regard, combining gene signatures led to bigger gene signature with more robust prediction results. However, adding extra genes to the signature set, which have no useful information for classification, decreased the prediction accuracy. Moreover, when we cannot find a unique gene set, only a general model can help us to interpret the selected biomarkers, biologically. In this regard, model-based algorithms and their limitations should be considered in the future works.

BIOGRAPHIES

Alireza Mehri Dehnavi was born in Isfahan province at 1961. He had educated in Electronic Engineering at Isfahan University of Technology at 1988. He had finished Master of Engineering in Measurement and Instrumentation at Indian Institute of Technology Roorkee (IIT Roorkee) in India at 1992. He has finished his PhD in Medical Engineering at Liverpool University in UK at 1996. He is an Associate Professor of Medical Engineering at Medical Physics and Engineering Department in Medical School of Isfahan University of Medical Sciences. He is currently visiting at School of Optometry and Visual Science at University of Waterloo in Canada. His research interests are medical optics, devicesand signal processing E-mail: mehri@med.mui.ac.ir Mohammad Reza Sehhati was born in Isfahan, Iran, in 1981. He received the B.S and M.S degree in Biomedical Engineering from Shahed University and University of Tehran, Tehran, Iran, respectively. He is now Ph.D Candidate of Biomedical Engineering in Isfahan University of Medical Sciences, Isfahan, Iran. His research interests include Bioinformatics, Machine Learning, Data Mining, Image Processing, and Hospital Information Systems E-mail: sehhati@resident.mui.ac.ir Hossein Rabbani is an Associate Professor at Isfahan University of Medical Sciences, in Biomedical Engineering Department also Medical Image & Signal Processing Research Center. Involved research topics include medical image/volume processing, noise reduction and estimation problem, image enhancement, blind deconvolution, video restoration, probability models of sparse domain's coefficients especially complex wavelet coefficients. He is a member of IEEE, Signal Processing Society, Engineering in Medicine and Biology Society, and Circuits and Systems Society E-mail: h_rabbani@med.mui.ac.ir
  18 in total

Review 1.  Normalization and quantification of differential expression in gene expression microarrays.

Authors:  Christine Steinhoff; Martin Vingron
Journal:  Brief Bioinform       Date:  2006-03-07       Impact factor: 11.622

2.  Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series.

Authors:  Christine Desmedt; Fanny Piette; Sherene Loi; Yixin Wang; Françoise Lallemand; Benjamin Haibe-Kains; Giuseppe Viale; Mauro Delorenzi; Yi Zhang; Mahasti Saghatchian d'Assignies; Jonas Bergh; Rosette Lidereau; Paul Ellis; Adrian L Harris; Jan G M Klijn; John A Foekens; Fatima Cardoso; Martine J Piccart; Marc Buyse; Christos Sotiriou
Journal:  Clin Cancer Res       Date:  2007-06-01       Impact factor: 12.531

3.  Concordance among gene-expression-based predictors for breast cancer.

Authors:  Cheng Fan; Daniel S Oh; Lodewyk Wessels; Britta Weigelt; Dimitry S A Nuyten; Andrew B Nobel; Laura J van't Veer; Charles M Perou
Journal:  N Engl J Med       Date:  2006-08-10       Impact factor: 91.245

4.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer.

Authors:  Yixin Wang; Jan G M Klijn; Yi Zhang; Anieta M Sieuwerts; Maxime P Look; Fei Yang; Dmitri Talantov; Mieke Timmermans; Marion E Meijer-van Gelder; Jack Yu; Tim Jatkoe; Els M J J Berns; David Atkins; John A Foekens
Journal:  Lancet       Date:  2005 Feb 19-25       Impact factor: 79.321

5.  An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival.

Authors:  Lance D Miller; Johanna Smeds; Joshy George; Vinsensius B Vega; Liza Vergara; Alexander Ploner; Yudi Pawitan; Per Hall; Sigrid Klaar; Edison T Liu; Jonas Bergh
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-02       Impact factor: 11.205

6.  Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade.

Authors:  Sherene Loi; Benjamin Haibe-Kains; Christine Desmedt; Françoise Lallemand; Andrew M Tutt; Cheryl Gillet; Paul Ellis; Adrian Harris; Jonas Bergh; John A Foekens; Jan G M Klijn; Denis Larsimont; Marc Buyse; Gianluca Bontempi; Mauro Delorenzi; Martine J Piccart; Christos Sotiriou
Journal:  J Clin Oncol       Date:  2007-04-01       Impact factor: 44.544

7.  Genomic analysis identifies unique signatures predictive of brain, lung, and liver relapse.

Authors:  J Chuck Harrell; Aleix Prat; Joel S Parker; Cheng Fan; Xiaping He; Lisa Carey; Carey Anders; Matthew Ewend; Charles M Perou
Journal:  Breast Cancer Res Treat       Date:  2011-06-14       Impact factor: 4.872

8.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures.

Authors:  Pratyaksha Wirapati; Christos Sotiriou; Susanne Kunkel; Pierre Farmer; Sylvain Pradervand; Benjamin Haibe-Kains; Christine Desmedt; Michail Ignatiadis; Thierry Sengstag; Frédéric Schütz; Darlene R Goldstein; Martine Piccart; Mauro Delorenzi
Journal:  Breast Cancer Res       Date:  2008-07-28       Impact factor: 6.466

9.  Minimal gene selection for classification and diagnosis prediction based on gene expression profile.

Authors:  Alireza Mehridehnavi; Lia Ziaei
Journal:  Adv Biomed Res       Date:  2013-03-06

10.  Fuzzy logic for elimination of redundant information of microarray data.

Authors:  Edmundo Bonilla Huerta; Béatrice Duval; Jin-Kao Hao
Journal:  Genomics Proteomics Bioinformatics       Date:  2008-06       Impact factor: 7.691

View more
  4 in total

Review 1.  Organ-specific metastasis of breast cancer: molecular and cellular mechanisms underlying lung metastasis.

Authors:  Meysam Yousefi; Rahim Nosrati; Arash Salmaninejad; Sadegh Dehghani; Alireza Shahryari; Alihossein Saberi
Journal:  Cell Oncol (Dordr)       Date:  2018-03-22       Impact factor: 6.730

2.  Predicting 5-Year Survival Status of Patients with Breast Cancer based on Supervised Wavelet Method.

Authors:  Maryam Farhadian; Hossein Mahjub; Jalal Poorolajal; Abbas Moghimbeigi; Muharram Mansoorizadeh
Journal:  Osong Public Health Res Perspect       Date:  2014-11-01

3.  Cancer Classification in Microarray Data using a Hybrid Selective Independent Component Analysis and υ-Support Vector Machine Algorithm.

Authors:  Hamidreza Saberkari; Mousa Shamsi; Mahsa Joroughi; Faegheh Golabi; Mohammad Hossein Sedaaghi
Journal:  J Med Signals Sens       Date:  2014-10

4.  Improving Classification of Cancer and Mining Biomarkers from Gene Expression Profiles Using Hybrid Optimization Algorithms and Fuzzy Support Vector Machine.

Authors:  Niloofar Yousefi Moteghaed; Keivan Maghooli; Masoud Garshasbi
Journal:  J Med Signals Sens       Date:  2018 Jan-Mar
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.