Literature DB >> 26347014

Screening of feature genes in distinguishing different types of breast cancer using support vector machine.

Abstract

OBJECTIVE: To screen the feature genes in estrogen receptor-positive (ER+) breast cancer in comparison with estrogen receptor-negative (ER-) breast cancer.
METHODS: Nine microarray data of ER+ and ER- breast cancer samples were collected from Gene Expression Omnibus database. After preprocessing, data in five training sets were analyzed using significance analysis of microarrays to screen the differentially expressed genes (DEGs). The DEGs were further analyzed via support vector machine (SVM) function in e1071 package of R to construct a SVM classifier, the efficacy of which was verified by four testing sets and its combination with training sets using a leave-one-out cross-validation. Feature genes obtained by SVM classifier were subjected to function- and pathway-enrichment via the Database for Annotation, Visualization and Integrated Discovery and KEGG Orthology Based Annotation System, respectively.
RESULTS: A total of 526 DEGs were screened between ER+ and ER- breast cancer. The SVM classifier demonstrated that these genes could distinguish different subtype samples with high accuracy of larger than 90%, and also showed good sensitivity, specificity, positive/negative predictive value, and area under receiver operating characteristic curve. The inflammatory and hormone biological processes were the common enriched results for two different function analyses, indicating that the inflammatory (ie, IL8) and hormone regulation (ie, CGA) genes may be the involved feature genes to distinguish ER+ and ER- types of breast cancer.
CONCLUSION: The gene-expression profile data can provide feature genes to distinguish ER+ and ER- samples, and the identified genes can be used for biomarkers for ER+ samples.

Entities: Chemical Disease Gene Species

Keywords: biomarker; classification; differentially expressed genes

Year: 2015 PMID： 26347014 PMCID： PMC4556031 DOI： 10.2147/OTT.S85271

Source DB: PubMed Journal: Onco Targets Ther ISSN： 1178-6930 Impact factor: 4.147

Introduction

Breast cancer is the most common invasive cancer in females worldwide, with an estimated 232,670 newly diagnosed cases and approximately 40,000 deaths in 2014 in the USA.1 Breast cancer is a hormone-dependent malignancy. At their primary diagnosis, approximately 75%–80% of breast cancer patients present as estrogen receptor-positive (ER+), while 20%–30% are estrogen-negative (ER−).2 It is reported that ER− breast cancer is related with poor prognosis, whereas breast cancer patients who are ER+ have a favorable outcome.3 This indicates the importance to distinguish between these two different subtypes of breast cancer, with the aim to provide prognosis and guide targeted treatment. Traditional classification based on the histochemical analysis of ER expression is often limited, and does not have the ability to discern subtle differences in different subtypes of breast cancer.4 Thus, molecular identification is advocated. For example, Lim et al demonstrated that the lysine-specific demethylase 1 is highly expressed in ER− breast cancer,5 while Mehta et al showed that the fork-head box protein A1 is an independent prognostic marker for ER+ breast cancer.6 However, the research on the genes that could distinguish the different subtypes of breast cancer is so limited and needs further study. Recent studies have shown that gene-expression profile generated by high-throughput platforms may provide comprehensive molecular characteristics of the tumors and may be informative for tumor classifications.4,7 For example, Parker et al have identified a 50-gene transcriptional signature and demonstrated that they have a good prognosis performance for “intrinsic” subtypes of breast cancer (luminal A, luminal B, HER2-enriched, and basal-like).8 Haibe-Kains et al reported a three-gene-expression model to classify tumors into four molecular entities (ER+/HER2−/low proliferative, ER+/HER2−/high proliferative, HER2+, and ER−/HER2−),9 which displays relatively less prognosis ability compared to the 50-gene transcriptional signature.10 However, a gene model to specifically distinguish ER+ and ER− breast cancer remains poorly investigated. Several data-mining technologies have recently been developed to accomplish feature gene extraction and selection, among which the support vector machine (SVM) algorithm performs at a higher power in two categories of classification.11–13 In the present study, we used the SVM to analyze the biomarkers for two subtypes of breast cancer: ER− and ER+ utilizing gene-expression profiling data.

Materials and methods

Microarray data and data preprocessing

Gene-expression data of breast cancer were downloaded from Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo).14 The following criteria were used to screen out the appropriate gene-expression data: 1) samples were not treated by any medicines; 2) all samples were breast cancer samples; and 3) samples were classified by ER status. Nine datasets were finally included, consisting of 1,289 samples, in which five expression profiles (626 samples, 425 ER+ and 201 ER−) were randomly assigned to training sets and the other four expression profiles (663 samples, 492 ER+ and 171 ER−) were used for testing sets (Table 1). The raw downloaded data first underwent background correction,15 log2 transformation, and then quantiles normalization16 using Affy package in R.

Table 1

Summary of the nine included microarray data

Datasets ID	Number of samples	ER+	ER−	Average age (years)
Training sets
E-GEOD-3494	241	209	32	62.004
E-GEOD-24185	94	56	38	49.03
E-GEOD-22597	82	37	45	51.23
E-GEOD-22093	82	41	41	48.51
E-GEOD-45255	127	82	45	–
Total	626	425	201	52.69
Testing sets
E-GEOD-4922	245	211	34	62.12
E-GEOD-32518	71	40	31	47.69
E-GEOD-23988	61	32	29	48.69
E-GEOD-2034	286	209	77	–
Total	663	492	171	52.83

Abbreviations: ER+, estrogen receptor-positive; ER−, estrogen receptor-negative.

Screening of differentially expressed genes

To select the differentially expressed genes (DEGs) in ER+ samples compared with ER− samples after data pre-processing, significance analysis of microarrays package in R (www.r-project.org) was utilized.17 Genes with false discovery rates (FDRs) estimated by permutation method18 to be less than 0.05 and log2 fold change (FC) >1 were considered as DEGs.

Sample classification using SVM classifier

It was not certain whether the selected DEGs could distinguish the two types of breast cancers well; thus, SVM was used for this determination to build models based on “training” data and search for similar patterns in “testing” data. Based on the normalized expression values of 526 DEGs identified using training data, a SVM classifier was constructed via SVM function in e1071 package of R (www.r-project.org) with the nonlinear radial basis function as the kernel and penalty functions set at 1,000. The predicative results of the SVM model for the training set itself were evaluated by a leave-one-out cross-validation method,19 where one sample in n samples was randomly selected as the testing set, and the other n–1 samples were regarded as the training set. The error rate (1– accuracy), when every single sample in the training sets has been used in the testing set, is the accuracy reference of the SVM classifier. The lower the error rate is, the more accurate the classifier is. Besides accuracy, another five indices were also utilized: sensitivity (Se), specificity (Sp), positive predictive value, negative predictive value, and area under receiver operating characteristic curve. For Se and Sp, P=0.5 was the cutoff criteria; while area under the curve was a comprehensive assessment criteria. Subsequently, the accuracy of the SVM classifier was further verified using testing sets and the combined datasets according to the leave-one-out cross-validation method. If all of the earlier mentioned results suggest that the construed SVM classifier exhibits high reliability, DEGs collected from the training sets will be regarded as the feature genes to distinguish the two subtypes of breast cancer.

Function- and pathway-enrichment of feature genes

The feature genes were subjected to enrichment analysis to identify their roles in breast cancer. The Database for Annotation, Visualization and Integrated Discovery was used for function enrichment,20 while KEGG Orthology Based Annotation System was applied for pathway-enrichment using hypergeometric distribution algorithm.21 P<0.05 was the threshold for the enriched terms.

Results

Screening for DEGs

A total of 526 DEGs, consisting of 239 upregulated ones and 287 downregulated ones, were identified in ER+ samples, comparing with ER− samples in the five training datasets. Using the normalized expression values of DEGs, a SVM classifier was constructed (Figure 1). After that, the accuracy of this classifier was detected. For training, testing, and the combined datasets, two (one ER− and one ER+), 29 (16 ER− and 13 ER+), and 22 (seven ER− and 15 ER+) samples were wrongly classified by the SVM classifier, respectively. However, the accuracies were all larger than 90% (99.7% [99.5% for ER− and 99.8% for ER+], 95.6% [90.6% for ER− and 97.3% for ER+], and 98.2% [98.1% for ER− and 98.4% for ER+], respectively), indicating the reliability of the classifier. Moreover, the results of Se, Sp, positive predictive value, negative predictive value, and area under receiver operating characteristic curve of the SVM classifier showed that it could not only distinguish training datasets, but also testing datasets well (Table 2; Figure 2).

Figure 1

Classification of three sample datasets by constructed support vector machine classifier.

Notes: (A) Six hundred and twenty-six samples for training; (B) 663 samples for testing; (C) 1,289 combined samples for testing. (Aa, Ba, and Ca) indicate the sample distribution for ER+ and ER−. (Ab, Bb, and Cb) indicate the scatterplot of the classification, in which black dots represent ER− while red dots represent ER+ breast cancer samples.

Abbreviations: ER+, estrogen receptor-positive; ER−, estrogen receptor-negative.

Table 2

Effect evaluation of the support vector machine classifier on training and testing datasets

	Number of samples	Correct rate	Sensitivity	Specificity	PPV	NPV	AUROC
Training	626	0.9968	0.9976	0.9950	0.9976	0.9950	0.999
Testing	663	0.9563	0.9736	0.9064	0.9677	0.9226	0.816
Combined	1,289	0.9829	0.9836	0.9812	0.9923	0.9605	0.890

Abbreviations: AUROC, area under receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.

Figure 2

Receiver operating characteristic curve used for training, testing, and combined datasets by support vector machine classifier.

Significantly related functions of feature genes

A total of eight biological functions were enriched by the feature genes (Table 3), among which the response to inorganic substance was the most significant. Furthermore, the majority of features genes were collected in several biological functions, including response to organic substance, cell–cell signaling, response to wounding, behavior, and inflammatory response, each of which accounting for larger than 10%.

Table 3

Significantly enriched biological functions of feature genes

Term	Count	P-value	Enriched genes
GO:0010035, response to inorganic substance	24	1.28×10⁻⁷	SYT1, ERBB4, CRYAB, S100A7, EEF1A2, ALDOB, TRPA1, NR4A2, GGH, SOD2….. IGFBP2
GO:0007267, cell–cell signaling	43	1.02×10⁻⁶	LALBA, CGA, SYT1, CXCL5, NDP, S100A9, CXCL9, CCL8, CACNB2, GABBR2….. GDF15
GO:0009611, response to wounding	37	1.2×10⁻⁵	CXCL1, TF, S100A8, ERBB2, S100A9, CXCL9, CCL8, CXCL11, CDH3, CXCL10….. IGFBP4
GO:0006954, inflammatory response	27	1.29×10⁻⁵	CXCL1, TF, S100A8, S100A9, CXCL9, CCL8, CXCL11, CXCL10, FOS, IL17B…. IGFBP4
GO:0010038, response to metal ion	16	1.3×10⁻⁵	SYT1, ALDOB, GGH, PCSK1, FGG, CCND1, PLA2G4A, GRIA2, FGB, CYBRD1…. IGFBP2
GO:0007610, behavior	34	1.38×10⁻⁵	CXCL1, ADCY1, CXCL5, S100A9, UCHL1, CXCL9, CCL8, TRH, ZIC1, CXCL11…… CARTPT
GO:0010033, response to organic substance	45	1.94×10⁻⁵	CGA, TF, KYNU, ADCY1, CYP1B1, ERBB4, IL6ST, ERBB2, ARNT2, ALDOB…. IGFBP2
GO:0010817, regulation of hormone levels	17	2.2×10⁻⁵	KLK6, CGA, CYP1B1, TBX3, FOXA1, AFP, PCSK1, DHRS2, WNT4, SERPINA6….. SNAP25

Abbreviation: GO, gene oncology.

Significantly related pathways of feature genes

Eight pathways of the feature genes were enriched, of which, cytokine–cytokine receptor interaction was the most significant one, involving 17 genes. The other pathways included drug metabolism, GnRH signaling pathway, etc (Table 4).

Table 4

Significantly enriched pathways of feature genes

ID	Pathway	P-value	Enriched genes
hsa04060	Cytokine–cytokine receptor interaction	2.03×10⁻²	CXCL1, IL1R2, IL2RA, IL8, CXCL5, IL6ST, CXCL9, CCL8, CXCL11, CCL18, CXCL10, IL17B, CXCL14, CCL20, IL20RA, BMPR1B, LTB
hsa00982	Drug metabolism	2.05×10⁻²	GSTA1, FMO5, FMO3, UGT2B4, CYP2A6, CYP2A7, GSTP1
hsa04912	GnRH signaling pathway	2.07×10⁻²	CGA, ADCY1, PLA2G4A, ADCY9, MAPK14, CAMK2B, CALML5, CACNA1D, ITPR1
hsa00232	Caffeine metabolism	2.28×10⁻²	NAT1, CYP2A6, CYP2A7
hsa04062	Chemokine signaling pathway	3.00×10⁻²	CXCL1, ADCY1, VAV3, IL8, CXCL5, CXCL9, CCL8, CXCL11, CCL18, CXCL10, CXCL14, CCL20, ADCY9
hsa04914	Progesterone-mediated oocyte maturation	3.01×10⁻²	PGR, IGF1R, ADCY1, ADCY9, MAPK14, BUB1, IGF2, CDC25A
hsa04110	Cell cycle	3.03×10⁻²	CCNE1, CCND1, CDC45, CDKN2A, BUB1, TTK, CDC20, SFN, MCM4, CDC25A
hsa00380	Tryptophan metabolism	4.98×10⁻²	WARS, KYNU, CYP1B1, IDO1, TPH1

Discussion

In present study, ER+ and ER− types of breast cancer samples were investigated to screen the DEGs, which were further used for SVM classifier training. The SVM classifier could not only distinguish the training dataset, but could also distinguish the testing and combined datasets well, with accuracy higher than 90%. Thus, the DEGs could be considered as feature genes for ER+ and ER− types of breast cancer. Subsequently, function- and pathway-enrichment analyses were conducted, in which the inflammatory- and hormone-related biological process were the common results for these two different analyses, indicating these inflammatory (ie, interleukin-8 [IL8]) and hormone regulation (ie, glycoprotein hormones, alpha polypeptide, CGA) genes may be important for distinguishing ER+ and ER− types of breast cancer. It is well reported that cytokine IL8 plays an important role in malignant tumor progression. IL8 is highly expressed in breast cancer and is associated with an accelerated clinical course, a higher tumor load, and the presence of distant metastasis, ultimately leading to poor survival.22,23 The depletion of IL8 expression may promote the cell cycle arrest and inhibit migration and invasion in breast cancer cells, causing high response to chemotherapy.24,25 Thus, IL8 may be a biomarker for distinguishing subtypes of breast cancer because of the lower survival in ER− type of breast cancer. This hypothesis has been demonstrated by several studies.26,27 For example, Lin et al showed that IL8 is lowly expressed in ER+ whereas highly expressed in ER− cells.28 Specifically, knockdown of IL8 significantly reduces the cell invasion by suppressing the PI3K/Akt/NF-κB/integrin β3 pathway in ER− breast cancer cell lines,29,30 and the microvessel density and neutrophil infiltration into the tumors in vivo.29 Furthermore, exogenous addition of ERα in ER− cells may also achieve the goal of downregulating IL8 expression.31 These findings all suggest the negative relationship between IL8 and ER status, which was also proved in our study (IL8 was downregulated in ER+ samples, log2 FC =−2.25, FDR =1.04×10−70). CGA codes for the common alpha subunit of four glycoprotein hormones (chorionic gonadotropin, luteinizing hormone, follicle stimulating hormone, and thyroid stimulating hormone) that have a cystine knot motif formed by three of the five disulfide bonds.32,33 This cystine knot motif is known to be a characteristic feature of growth factor and, thus, the expression of CGA may be related with the development of cancer, which was also demonstrated in previous studies.34,35 However, compared with the beta subunit of glycoprotein hormones,36 the alpha subunit may be a marker of tumors with low aggressiveness, eg, ER+ but not ER− cancer cells.37,38 In this study, we also found the upregulated expression of CGA in ER+ breast cancer patients (log2 FC =2.63, FDR =2.82×10−34). Despite the ideal classification of different types of breast cancer samples and satisfactory accuracy, the SVM classifier showed decreased recognizing ability. The potential reasons for this are: 1) breast cancer is one of the tumors which vary between individuals, and this variation among samples will affect gene expressions; and 2) samples used for training and testing were obtained from different experiments, which allows for some personal error. This kind of error can hardly be eliminated by normalization of the data. However, all the other indices showed reliability of our classifier by SVM method.

Conclusion

Based on a set of gene-expression profiles, 526 DEGs were identified in ER+ samples in comparison with ER− samples, which were further used for SVM classifier construction. After being tested using the other microarray data, the SVM classifier showed satisfactory efficacy. The selected feature genes (such as IL8 and CGA) could well distinguish those two subtypes of breast cancer. However, further experimental studies are needed to confirm the values of other involved genes.

37 in total

1. Analysis of high density expression microarrays with signed-rank call algorithms.

Authors: W-m Liu; R Mei; X Di; T B Ryder; E Hubbell; S Dee; T A Webster; C A Harrington; M-h Ho; J Baid; S P Smeekens
Journal: Bioinformatics Date: 2002-12 Impact factor: 6.937

2. Estrogen receptor-positive breast cancer in Japanese women: trends in incidence, characteristics, and prognosis.

Authors: H Yamashita; H Iwase; T Toyama; S Takahashi; H Sugiura; N Yoshimoto; Y Endo; Y Fujii; S Kobayashi
Journal: Ann Oncol Date: 2010-11-30 Impact factor: 32.976

Review 3. Gene expression profiling in breast cancer: classification, prognostication, and prediction.

Authors: Jorge S Reis-Filho; Lajos Pusztai
Journal: Lancet Date: 2011-11-19 Impact factor: 79.321

4. The depletion of interleukin-8 causes cell cycle arrest and increases the efficacy of docetaxel in breast cancer cells.

Authors: Nan Shao; Liu-Hua Chen; Run-Yi Ye; Ying Lin; Shen-Ming Wang
Journal: Biochem Biophys Res Commun Date: 2013-01-12 Impact factor: 3.575

5. Supervised risk predictor of breast cancer based on intrinsic subtypes.

Authors: Joel S Parker; Michael Mullins; Maggie C U Cheang; Samuel Leung; David Voduc; Tammi Vickery; Sherri Davies; Christiane Fauron; Xiaping He; Zhiyuan Hu; John F Quackenbush; Inge J Stijleman; Juan Palazzo; J S Marron; Andrew B Nobel; Elaine Mardis; Torsten O Nielsen; Matthew J Ellis; Charles M Perou; Philip S Bernard
Journal: J Clin Oncol Date: 2009-02-09 Impact factor: 44.544

6. Dachshund inhibits oncogene-induced breast cancer cellular migration and invasion through suppression of interleukin-8.

Authors: Kongming Wu; Sanjay Katiyar; Anping Li; Manran Liu; Xiaoming Ju; Vladimir M Popov; Xuanmao Jiao; Michael P Lisanti; Antonella Casola; Richard G Pestell
Journal: Proc Natl Acad Sci U S A Date: 2008-05-08 Impact factor: 11.205

7. IL-8 expression and its possible relationship with estrogen-receptor-negative status of breast cancer cells.

Authors: Ariane Freund; Corine Chauveau; Jean-Paul Brouillet; Annick Lucas; Matthieu Lacroix; Anne Licznar; Françoise Vignon; Gwendal Lazennec
Journal: Oncogene Date: 2003-01-16 Impact factor: 9.867

8. Interleukin-8 modulates growth and invasiveness of estrogen receptor-negative breast cancer cells.

Authors: Chen Yao; Ying Lin; Mei-Sze Chua; Cai-Sheng Ye; Jiong Bi; Wen Li; Yi-Fan Zhu; Shen-Ming Wang
Journal: Int J Cancer Date: 2007-11-01 Impact factor: 7.396

Review 9. Comparative structure analyses of cystine knot-containing molecules with eight aminoacyl ring including glycoprotein hormones (GPH) alpha and beta subunits and GPH-related A2 (GPA2) and B5 (GPB5) molecules.

Authors: Eva Alvarez; Claire Cahoreau; Yves Combarnous
Journal: Reprod Biol Endocrinol Date: 2009-08-31 Impact factor: 5.211

10. A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance.

Authors: Shunpu Zhang
Journal: BMC Bioinformatics Date: 2007-06-29 Impact factor: 3.169

25 in total

1. A method of gene expression data transfer from cell lines to cancer patients for machine-learning prediction of drug efficiency.

Authors: Nicolas Borisov; Victor Tkachev; Maria Suntsova; Olga Kovalchuk; Alex Zhavoronkov; Ilya Muchnik; Anton Buzdin
Journal: Cell Cycle Date: 2018-01-17 Impact factor: 4.534

2. IL-8, MSPa, MIF, FGF-9, ANG-2 and AgRP collection were identified for the diagnosis of colorectal cancer based on the support vector machine model.

Authors: Mingfu Cui; Yanan Zhao; Zuocong Zhang; Yang Zhao; Songyun Han; Ruijie Wang; Dayong Ding; Xuedong Fang
Journal: Cell Cycle Date: 2021-03-28 Impact factor: 4.534

3. A six-long non-coding RNA signature predicts prognosis in melanoma patients.

Authors: Shuocheng Yang; Jianguo Xu; Xuan Zeng
Journal: Int J Oncol Date: 2018-02-07 Impact factor: 5.650

4. A Support Vector Machine Model Predicting the Risk of Duodenal Cancer in Patients with Familial Adenomatous Polyposis at the Transcript Levels.

Authors: Weiqing Liu; Jian Dong; Shumin Ma; Lei Liang; Jun Yang
Journal: Biomed Res Int Date: 2020-06-16 Impact factor: 3.411

5. A potential prognostic prediction model of colon adenocarcinoma with recurrence based on prognostic lncRNA signatures.

Authors: Lipeng Jin; Chenyao Li; Tao Liu; Lei Wang
Journal: Hum Genomics Date: 2020-06-10 Impact factor: 4.639

6. A novel risk score model based on eight genes and a nomogram for predicting overall survival of patients with osteosarcoma.

Authors: Guangzhi Wu; Minglei Zhang
Journal: BMC Cancer Date: 2020-05-24 Impact factor: 4.430

7. Integrated genomic and methylation profile analysis to identify candidate tumor marker genes in patients with colorectal cancer.

Authors: Guojun Huang; Wang Cheng; Fu Xi
Journal: Oncol Lett Date: 2019-09-04 Impact factor: 2.967

8. A six‑gene support vector machine classifier contributes to the diagnosis of pediatric septic shock.

Authors: Guoli Long; Chen Yang
Journal: Mol Med Rep Date: 2020-01-23 Impact factor: 2.952

9. Identification of Key Genes Involved in Pancreatic Ductal Adenocarcinoma with Diabetes Mellitus Based on Gene Expression Profiling Analysis.

Authors: Weiyu Zhou; Yujing Wang; Hongmei Gao; Ying Jia; Yuanxin Xu; Xiaojing Wan; Zhiying Zhang; Haiqiao Yu; Shuang Yan
Journal: Pathol Oncol Res Date: 2021-04-20 Impact factor: 3.201

10. Bioinformatics identification of lncRNA biomarkers associated with the progression of esophageal squamous cell carcinoma.

Authors: Jun Yu; Xiaoliu Wu; Kaidan Huang; Ming Zhu; Xiaomei Zhang; Yuanying Zhang; Senqing Chen; Xinyu Xu; Qin Zhang
Journal: Mol Med Rep Date: 2019-05-02 Impact factor: 2.952