Literature DB >> 35472078

GSEA-SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics.

Abstract

MOTIVATION: Selecting the most relevant genes for sample classification is a common process in gene expression studies. Moreover, determining the smallest set of relevant genes that can achieve the required classification performance is particularly important in diagnosing cancer and improving treatment.
RESULTS: In this study, I propose a novel method to eliminate irrelevant and redundant genes, and thus determine the smallest set of relevant genes for breast cancer diagnosis. The method is based on random forest models, gene set enrichment analysis (GSEA), and my developed Sort Difference Backward Elimination (SDBE) algorithm; hence, the method is named GSEA-SDBE. Using this method, genes are filtered according to their importance following random forest training and GSEA is used to select genes by core enrichment of Kyoto Encyclopedia of Genes and Genomes pathways that are strongly related to breast cancer. Subsequently, the SDBE algorithm is applied to eliminate redundant genes and identify the most relevant genes for breast cancer diagnosis. In the SDBE algorithm, the differences in the Matthews correlation coefficients (MCCs) of performing random forest models are computed before and after the deletion of each gene to indicate the degree of redundancy of the corresponding deleted gene on the remaining genes during backward elimination. Next, the obtained MCC difference list is divided into two parts from a set position and each part is respectively sorted. By continuously iterating and changing the set position, the most relevant genes are stably assembled on the left side of the gene list, facilitating their identification, and the redundant genes are gathered on the right side of the gene list for easy elimination. A cross-comparison of the SDBE algorithm was performed by respectively computing differences between MCCs and ROC_AUC_score and then respectively using 10-fold classification models, e.g., random forest (RF), support vector machine (SVM), k-nearest neighbor (KNN), extreme gradient boosting (XGBoost), and extremely randomized trees (ExtraTrees). Finally, the classification performance of the proposed method was compared with that of three advanced algorithms for five cancer datasets. Results showed that analyzing MCC differences and using random forest models was the optimal solution for the SDBE algorithm. Accordingly, three consistently relevant genes (i.e., VEGFD, TSLP, and PKMYT1) were selected for the diagnosis of breast cancer. The performance metrics (MCC and ROC_AUC_score, respectively) of the random forest models based on 10-fold verification reached 95.28% and 98.75%. In addition, survival analysis showed that VEGFD and TSLP could be used to predict the prognosis of patients with breast cancer. Moreover, the proposed method significantly outperformed the other methods tested as it allowed selecting a smaller number of genes while maintaining the required classification accuracy.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 35472078 PMCID： PMC9041804 DOI： 10.1371/journal.pone.0263171

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Selecting relevant genes to distinguish patients with or without cancer is a common task in gene expression research [1,2]. For genetic diagnosis in clinical practice, it is important to efficiently identify relevant genes and eliminate irrelevant and redundant genes to obtain the smallest possible gene set that can achieve good predictive performance [3]. To this end, genetic selection methods are of great importance. These methods can be roughly divided into three categories: filters, wrappers, and mixers [4]. In a previous study, I focused on a hybrid approach that combines the advantages of filter and wrapper methods [5]. For cancer classification, previous hybrid approaches have utilized symmetrical uncertainty to analyze the relevance of genes based on support vector machines [6], employed minimum redundancy and maximum relevance feature selection to select a subset of relevant genes [7], and applied Cuckoo search to select genes from microarray technology [8]. The hybrid approach essentially includes two processes, selecting relevant genes and eliminating redundant genes. To select relevant genes, previous research has utilized semantic similarity measurements of gene ontology terms based on definitions for similarity analysis of gene function [9], applied the concept of global and local gene relevance to calculate the equivalent principal component analysis load of nonlinear low-dimensional embedding [10], and obtained relevant features from the Cancer Genome Atlas (TCGA) transcriptome dataset by cooperative embedding [11]. Because relevant genes often contain redundant genes, the process of gene elimination is important for obtaining the minimal number of relevant genes that can function effectively in a classification model. Many methods can be applied including feature similarity estimated by explicitly building a linear classifier on each gene [12], homology searching against a gene or protein database [13], or the Cox-filter model [14]. In the present study, I propose a novel hybrid method that can determine the smallest set of relevant genes required to achieve accurate classification of breast cancer diagnosis. Breast cancer transcriptome data can be downloaded from the TCGA database; this unbalanced data was used in the current analyses. RF [15] and gene set enrichment analysis (GSEA) [16] were applied to select relevant breast cancer genes and the proposed Sort Difference Backward Elimination (SDBE) algorithm was then used to eliminate redundant genes from these relevant genes; hence, the proposed method was named GSEA–SDBE. First, a random forest model was constructed and trained with all the differential gene expression data and then the genes for which importance was almost zero were deleted. Subsequently, GSEA was applied to analyze the remaining differentially expressed genes (DEGs) according to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment and those genes that were strongly related to breast cancer were selected from the enriched KEGG pathways. Then, the SDBE algorithm was applied to identify the important relevant genes from the selected genes. The SDBE algorithm includes a process by which the difference in the Matthews correlation coefficients (MCCs) of random forest models is calculated before and after the deletion of a given gene, which indicates the degree of redundancy of the corresponding deleted gene on the remaining genes according to backward elimination. Using the SDBE algorithm, the most relevant genes are stably collected on the left side of the gene list while the redundant genes are gathered on the right side of the gene list. Through the GSEA–SDBE method, an optimal model was created that could determine the smallest set of relevant genes for breast cancer diagnosis. Results showed that this method could achieve excellent classification performance for breast cancer. Furthermore, some of the selected relevant genes could be used to predict prognosis in patients with breast cancer.

Materials and methods

Data preparation

Breast cancer transcriptome data

Transcriptome data from breast cancer samples and the clinical data of corresponding patients were downloaded from TCGA database (https://gdc.cancer.gov/). A total of 1222 transcriptome samples, wherein each sample contained expression of 18584 genes, were obtained. This unbalanced dataset, which includes 113 normal and 1109 tumor tissues, was named BCT_1222 (113: 1109). In addition, the clinical data of 1109 patients with breast cancer were obtained.

Differential expression analysis and normalization

By performing the Mann–Whitney–Wilcoxon test in R software 3.6.2 (wilcox.tes) with |logFC| > 1.0 and p.FDR < 0.05 as the thresholds, 4579 DEGs were screened between the normal samples and tumor samples from the BCT_1222 dataset. These samples were randomly shuffled and the expression values of each DEG in all samples were respectively standardized via min–max normalization.

Selecting genes by importance based on a random forest model

The random forest method can provide an assessment of variable importance to variable selection [17,18]. A random forest model was constructed and trained using Sklearn 0.22.2.post1 in python 3.6 with 4579 DEGs. The model was used to calculate the importance of variables (genes) and the genes were sorted by their importance in descending order. From these genes, a certain number of top genes were selected based on experience to reduce the burden of subsequent procedures.

Gene selection by GSEA

GSEA [19] can be used to determine whether a group of genes shows statistically significant and concordant differences between two biological states according to enrichment analysis; here, it was performed by the JAVA program. The KEGG database includes a collection of manually drawn graphical maps known as KEGG pathway maps [20]. KEGG in the Molecular Signatures Database (MSigDB) [21] was chosen as the back-end database of GSEA. GSEA was run and genes were selected through the core enrichment [22] of KEGG pathways strongly related to breast cancer. Therefore, it was possible to screen for DEGs that were closely associated with breast cancer. Genes that were weakly associated with or were unrelated to breast cancer were filtered out, even if they had high importance in a random forest model.

Metrics and benchmark methods

The performances of all classification models applied in this study were evaluated by 10-fold cross-validation. The models were trained and tested with 10-fold cross-validation. According to the prediction results and tested data, they were respectively merged in a given order. By comparing the prediction results with the tested data, true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) were obtained. Normal samples were negatives and tumor samples were positives. Tests were conducted on a real dataset with unbalanced data. Therefore, the effectiveness of the binary classification model was measured by several performance metrics [23] including accuracy (Acc), recall (Re), F1_score (F1), false positive rate (FPR), computed area under the receiver operating characteristic curve from prediction scores (ROC_AUC_score), and MCC. The formulas and functions are as follows: In addition, MCC [24,25] and ROC_AUC_score [26,27] are shown to better handle numerically unbalanced data sets.

SDBE algorithm

The training, testing, and calculation of various performance metrics for all classification models were based on 10-fold cross-validation. The focus was on finding a high-performance classification model with the fewest variables (genes); subsequently, a novel algorithm, namely SDBE, was proposed. The underlying principle of the SDBE algorithm is that the performance metrics of the classification model will not change significantly after a redundant gene is deleted. Therefore, the differences in the chosen performance metrics were computed before and after deletion of each gene to indicate the degree of redundancy of the corresponding deleted gene on the remaining genes in backward elimination based on the random forest method. These deleted genes were collected into a list in reverse order during backward elimination [28]. From a set position, genes were sorted by their corresponding performance metric differences in descending order into the two parts and the two parts were then merged. Through continuously iterating and changing the set position, the important relevant genes were stably assembled on the left side of the gene list to facilitate their easy identification, whereas redundant genes were gathered on the right side of the gene list for easy elimination. The procedure underlying the SDBE algorithm is provided in Fig 1. The SDBE algorithm consists of seven stages as follows.

Fig 1

Procedure of the Sort Difference Backward Elimination (SDBE) algorithm.

Stage 1: In each loop of backward elimination, 10-fold random forest models were trained and tested to calculate various performance metrics and the average importance of each variable, i.e., each gene. Next, these genes were sorted in descending order of average importance. After each loop of backward elimination, the deleted gene with the least importance and various metrics of the model were added to various dedicated lists. Thus, by respectively transposing all the lists, a list of genes in descending order of importance and various metric lists were obtained. These lists were provided to the stages that followed. Importantly, gene g at the first position in the list of the genes was determined at this stage because the position of this gene would not change in subsequent stages. Stage 2: One of model performance metrics, such as MCC or ROC_AUC_score, was chosen as the object of difference analysis for subsequent stages and the index variable ST was initialized to 0. Stage 3: The following formula was used to compute the difference in the performance metric before and after gene deletion during backward elimination based on random forest modeling: where m and m respectively denote the metric before and after deleting gene from sublist of gene list in backward elimination. Only one gene was deleted from the end of list Gs at each loop in backward elimination. The performance metric difference could indicate the degree of redundancy of the corresponding deleted gene on the remaining genes of sublist Gs. Stage 4: The value of the variable ST was used as the index position to search forward in the metric difference list until an element <0 was encountered; the index of this element was used to update the variable ST. Stage 5: The metric difference list DM was split into two parts, part1 and part2 (including the element at index ST) by index ST, and then the elements in part1 and part2 were respectively sorted in descending order. Stage 6: The elements of part1 and part2 were replaced with genes by the corresponding relationship between and , and then the two parts were merged into a new gene list NG. Subsequently, g in the list G was added to the end of the new list NG. Then, the list NG was transposed. Stage 7: The genes of the list NG were analyzed by backward elimination. At each step of backward elimination, the 10-fold classification mode, e.g., random forest (RF), support vector machine (SVM), k-nearest neighbor (KNN), extreme gradient boosting (XGBoost), and extremely randomized trees (ExtraTrees), and ExtraTrees, was trained and tested to calculate various performance metrics. After each step of backward elimination, the performance metrics were respectively added to the corresponding metric lists. Then, the iteration was terminated and the data were saved. However, if the number of iterations set based on experience was not reached, the metrics lists, which were respectively transposed, and the list NG were sent to stage 3 to start a new iteration. Stage 8: Mapping analysis of the metrics lists and the list NG was performed and the smallest set of relevant genes needed to achieve the required sample classification performance was determined.

The entire pipeline of the GSEA–SDBE method

The gene selection procedure followed in the GSEA–SDBE method is provided in Fig 2.

Fig 2

Gene selection procedure in the GSEA–SDBE method.

Results

Differential expression analysis and normalization

From 4579 DEGs identified in the BT_1222 dataset, 2702 were upregulated and 1877 were downregulated. These genes are represented in a volcano plot in Fig 3.

Fig 3

Volcano plot of differentially expressed genes.

The red and blue dots represent upregulated and downregulated genes, respectively.

Volcano plot of differentially expressed genes.

The red and blue dots represent upregulated and downregulated genes, respectively.

Random forest models

Having trained a random forest model with data on 4479 DEGs, the out-of-bag error was 0.01%. Genes were sorted by their importance in descending order, as shown in Fig 4. Selecting the top 2000 genes from the 4579 DEGs was optimal in the experiments; thus, the remaining 2579 genes, for which the importance was close to zero, were deleted.

Fig 4

Genes sorted by importance in descending order.

GSEA

GSEA 3.0 was applied to analyze 2000 DEGs with KEGG pathways enrichment; the gene sets database was set to c2.cp.kegg.v7.1.symbols.gmt of the MSigDB. In enrichment results, 30 gene sets were obtained. These included five and 15 upregulated and downregulated gene sets in the phenotype “Tumor” (S1 Table), respectively. Four gene sets (Table 1) were selected that were strongly associated with breast cancer (Fig 5). Altogether, 60 genes were identified, including 20 upregulated genes and 40 downregulated genes, after deleting 12 repeated downregulated genes from 72 genes in the core enrichment of the four gene sets.

Table 1

Gene sets (pathways) that were strongly related to breast cancer.

Gene set name	ES	NES	NOMP value	FDRQ value	Gene number (core enrichment)
KEGG_CELL_CYCLE	0.60	1.37	0.201	0.319	20
KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION	−0.29	−0.96	0.496	0.726	17
KEGG_JAK_STAT_SIGNALING_PATHWAY	−0.48	−1.34	0.143	1.000	11
KEGG_PATHWAYS_IN_CANCER	−0.23	−0.84	0.720	0.790	24

ES: Enrichment score; NES: Normalized enrichment scores; NOM p-val: Nominal p value; FDR: False discovery rate.

Fig 5

Enrichment plots for the four gene sets (pathways) that were strongly related to breast cancer.

ES: Enrichment score; NES: Normalized enrichment scores; NOM p-val: Nominal p value; FDR: False discovery rate. In the SDBE algorithm, the training, testing, and calculation of various performance metrics for all classification models were based on 10-fold cross-validation. The expression data of 60 genes from the GSEA enrichment analysis results were used in the SDBE algorithm. From stage 1 of the algorithm, 60 genes were listed in descending order of importance, as shown in S2 Table, and various metric lists (including Acc, Re, FPR, F1_score, ROC_AUC_score, and MCC) were illustrated using matplotlib in python 3.6 for comparison. It was difficult to select the smallest gene set that could still achieve good predictive performance by sorting genes by their importance, although ranking gene stages by importance was vital to the process. The most important part of this step was determining the top gene in the list as this gene does not change in subsequent stages. From this stage, the gene and metric lists were passed to the stages that followed. In stage 2 of the SDBE algorithm, the performance metrics ROC_AUC_score and MCC were respectively chosen as the objects of difference analysis for subsequent iterations; each iteration included stage 3–7 and the number of iterations was set at 19. To compare the influence of different classification models in the SDBE algorithm, the following were respectively chosen for use as the classification model: RF, SVM, KNN, XGBoost [29], and ExtraTrees [30]. Therefore, the SDBE algorithm was cross-tested. Regardless of the object chosen for difference analysis (ROC_AUC_score or MCC; Fig 6A and 6B) and the classification model (RF, SVM, KNN, XGBoost, or ExtraTrees) used, as the iteration progressed the most relevant genes were assembled in a stepwise manner on the left side of the gene list, whereas the redundant genes were gathered in a stepwise manner on the right side of the gene list (Fig 6). On the left side of the gene list, the identity and number of stable relevant genes differed depending on the analysis target and classification model, with three stable relevant genes being the maximum (S3 Table).

Fig 6

Polylines of classification metrics, MCC, and ROC_AUC_score in 19 iterations.

(a) MCC as the object of difference analysis. (b) ROC_AUC_score as the object of difference analysis.

Polylines of classification metrics, MCC, and ROC_AUC_score in 19 iterations.

(a) MCC as the object of difference analysis. (b) ROC_AUC_score as the object of difference analysis. To cross-compare the SDBE algorithm, I used the 19th iterations of the algorithm and compared the same performance metrics of multiple classification models (RF, SVM, KNN, XGBoost, and ExtraTrees; Fig 6). As shown by the shapes of the polylines in Fig 7A, using MCC as the object of difference analysis produced better results than using ROC_AUC_score (Fig 7B). With MCC, the performance metrics of the RF model were better than the performance metrics of the other classification models; the blue polyline of the RF model was always above the other polylines. Therefore, I assessed the polyline of RF and found that the top three genes did not reach the peak or trough of the polyline but were close to each other (Fig 6A). More importantly, the top three genes were stable and repeatable. Therefore, I extracted performance metrics of classification models trained and tested using the top three genes from Fig 6 for comparison (Tables 2 and 3). Except for FPR (1.77%), the relative performance metrics of the RF model in Table 2, showing MCC as the object, were superior to those in Table 3 (ROC_AUC_score as the object); moreover, the top three genes from the classification models RF, KNN, XGBoost, and ExtraTrees were identical when MCC was the object (Table 2) but typically differed among the models when ROC_AUC_score was the object (Table 3). Because the data used to train and test the classification models were unbalanced (113 vs. 1109 samples), the performance metrics MCC and ROC_AUC_score of the RF model were focused upon.

Fig 7

Polylines of classification metrics at the 19th iteration of the Sort Difference Backward Elimination (SDBE) algorithm.

Table 2

MCC as the object of difference analysis: 10-fold cross-validation classification metrics of the top three genes.

Modes	ROC_AUC_score	MCC	Recall	FPR	F1_score	Accuracy	Top three genes
RF	0.9875	0.9528	0.9928	0.0177	0.9955	0.9918	VEGFD, TSLP, PKMYT1
SVM	0.9684	0.8832	0.9810	0.0442	0.9882	0.9787	VEGFD, PKMYT1, BUB1B*
XGBoost	0.9861	0.9396	0.9900	0.0177	0.9941	0.9893	VEGFD, TSLP, PKMYT1
KNN	0.9653	0.8897	0.9837	0.0531	0.9891	0.9803	VEGFD, TSLP, PKMYT1
ExtraTrees	0.9818	0.9345	0.9900	0.0265	0.9937	0.9885	VEGFD, TSLP, PKMYT1

Genes marked with * are unstable genes in the SDBE algorithm.

Table 3

ROC_AUC_score as the object of difference analysis: 10-fold cross-validation classification metrics of the top three genes.

Modes	ROC_AUC_score	MCC	Recall	FPR	F1_score	Accuracy	Top three genes
RF	0.9799	0.8840	0.9774	0.0177	0.9877	0.9779	VEGFD, SPRY2, BUB1B*
SVM	0.9828	0.8501	0.9657	0.0	0.9825	0.9689	VEGFD, CCNB1, TSLP
XGBoost	0.9812	0.8952	0.9801	0.0177	0.9890	0.9803	VEGFD, CCL14, TSLP
KNN	0.9771	0.8627	0.9720	0.0177	0.9849	0.9710	VEGFD, TSLP, CCL14
ExtraTrees	0.9809	0.9260	0.9883	0.0265	0.9927	0.9869	VEGFD, TSLP, CDC25C

Genes marked with * are unstable genes in the SDBE algorithm.

Polylines of classification metrics at the 19th iteration of the Sort Difference Backward Elimination (SDBE) algorithm.

(a) MCC as the object of difference analysis. (b) ROC_AUC_score as the object of difference analysis. Various metric lists from stage 1 of the algorithm were illustrated by red polylines (RF_improtance). Genes marked with * are unstable genes in the SDBE algorithm. Genes marked with * are unstable genes in the SDBE algorithm. In summary, using MCC as the object of difference analysis and RF as the classification mode in the SDBE algorithm was optimal. In addition, three stable relevant genes, namely VEGFD, TSLP, and PKMYT1, were chosen for the diagnosis of breast cancer. Moreover, based on 10-fold verification, the performance metrics MCC and ROC_AUC_score for RF models were 95.28% and 98.75%, respectively.

Survival analysis of patients

First, patients were divided into two groups, high and low risk, based on the median expression of a certain gene (S4 Table). If the gene was downregulated, the patients whose expression of the gene was lower than the median expression were classified as high risk, whereas the remaining patients were low risk. If the gene was upregulated, the method of grouping was reversed. Kaplan–Meier survival analysis [31] and log-rank tests were used to determine the prognostic significance of expression of the three genes, VEGFD, TSLP, and PKMYT1, in patients with breast cancer. VEGFD and TSLP were downregulated genes, whereas PKMYT1 was upregulated. A log-rank test revealed that patients with low VEGFD and TSLP expression had significantly shorter overall survival (OS) times than those patients with high expression of these genes (P = 0.0466 and P = 0.0003, respectively; Fig 8); the median OS times in months (with 95% confidence intervals) were 129 (114–142) and 116 (102–132), respectively; Fig 8 and Table 4). In contrast, the result of the log-rank test for PKMYT1 was not significant (P = 0.2095) and the polylines of the high-risk and low-risk groups for this gene crossed at 120 months (Fig 8). Therefore, VEGFD and TSLP could be used to predict prognosis in patients with breast cancer, whereas PKMYT1 is not suitable for this purpose.

Fig 8

Kaplan–Meier survival graphs for expression of VEGFD, TSLP, and PKMYT1.

Red and blue curves denote high-risk and low-risk groups, respectively.

Table 4

Results of survival analysis for high-risk and low-risk groups according to three genes.

Genename	Expressionin tumor	P value	High risk			Low risk
Genename	Expressionin tumor	P value	SP (5 y)	M-OS [95% CI]	N	SP (5 y)	M-OS [95% CI]	N
VEGFD	Downregulated	0.0466	0.8088	129 [114–142]	846	0.8552	149 [122–inf]	262
TSLP	Downregulated	0.0003	0.7896	116 [102–132]	786	0.8837	248 [122–inf]	322
PKMYT1	Upregulated	0.2095	0.7743	149 [102–inf]	419	0.8494	131 [115–215]	689

P value: Comparison between high risk and low risk; Inf: Data points not obtained; SP (5 y): 5-year survival probability; M-OS (95% CI): Median overall survival time in months with 95% confidence intervals; N: Number of patients.

Kaplan–Meier survival graphs for expression of VEGFD, TSLP, and PKMYT1.

Red and blue curves denote high-risk and low-risk groups, respectively. P value: Comparison between high risk and low risk; Inf: Data points not obtained; SP (5 y): 5-year survival probability; M-OS (95% CI): Median overall survival time in months with 95% confidence intervals; N: Number of patients.

Relevance of the selected genes to cancer

VEGF-D induces the formation of lymphatics within tumors, thereby facilitating the spread of the tumor to lymph nodes, and promotes tumor angiogenesis and growth [32-36]. TSLP is an interleukin-7 (IL-7)-like cytokine that is involved in the progression of various cancers and is a key mediator of breast cancer progression [37-40]. Human PKMYT1 is an important regulator of the G2/M transition in the cell cycle. Studies have demonstrated that PKMYT1 might be a therapeutic target in hepatocellular carcinoma and neuroblastoma [41-43].

Performance comparison of GSEA–SDBE with that of other models

To test the feature selection performance of the GSEA–SDBE method, a simplified version, named Pre-SDBE, which does not use GSEA to filter out genes weakly associated with or unrelated to cancer, was used. The three advanced gene selection algorithms were the genetic algorithm (GA), particle swarm optimization (PSO) algorithm, and cuckoo optimization algorithm and harmony search (COA-HS). These algorithms use 100 relevant genes selected via the minimum redundancy and maximum relevance (MRMR) as input data and the SVM as a classifier [7]. The classification performance of Pre-SDBE was compared with that of the three advanced algorithms for five cancer datasets composed of DEGs in breast, lung, and liver cancers and genes expressed in prostate and colon cancers (Table 5).

Table 5

Information on the datasets used for performance comparison.

Name	Data sources	#Genes	#DEGs	#Samples	Normal	Tumor
Breast	TCGA ^a	56,536	4,579	1,222	113	1,109
Lung	TCGA ^a	56,536	7,483	1,146	108	1,038
Liver	TCGA ^a	56,536	8,772	465	58	407
Prostate	Microarr^ay dataset ^b	12,600	−	102	50	52
Colon	Microarray dataset ^c	7,457	−	62	22	40

a Database (https://gdc.cancer.gov/)

b Singh et al. [44]

c Alon et al. [45].

#Genes: Number of genes; #DEGs: Number of differentially expressed genes (obtained using wilcox.tes with |logFC| >1.0 and p.FDR <0.05); #Samples: Number of selected samples.

a Database (https://gdc.cancer.gov/) b Singh et al. [44] c Alon et al. [45]. #Genes: Number of genes; #DEGs: Number of differentially expressed genes (obtained using wilcox.tes with |logFC| >1.0 and p.FDR <0.05); #Samples: Number of selected samples. In the step of the Pre-SDBE algorithm selecting genes by their importance, the top 50 relevant genes were selected based on a random forest model (S1 Fig). Next, these genes were fed into the SDBE algorithm to identify the most relevant genes with the highest accuracy. The number of iterations in the SDBE algorithm was set at 6, 7, 23, 3, and 10 for the breast, lung, liver, colon, and prostate cancer datasets, respectively. The Fitness of PSO, GA, and COA-HS over 100 iterations for each cancer dataset are shown in S2 Fig. Table 6 shows that for unbalanced data (breast, lung, and liver cancers), the classification metrics (MCCs) of PSO, GA, and COA-HS algorithms were much lower than those of Pre-SDBE (98.07, 97.45, and 96.98 for breast, lung, and liver cancers, respectively). This indicated that the PSO, GA, and COA-HS algorithms did not perform well for unbalanced data.

Table 6

Classification metrics (%) of four optimization algorithms for five cancer datasets.

Algorithm	Breast						Lung
Algorithm	#Genes	MCC	RA	F1	SE	SP	#Genes	MCC	RA	F1	SE	SP
Pre-SDBE	4	98.07	99.42	99.82	99.73	99.12	3	97.45	98.93	99.76	99.71	98.15
PSO ^a	30	82.98	95.56	98.18	97.00	94.12	29	88.29	98.72	98.70	97.44	100
GA ^a	18	88.87	98.80	98.78	97.60	100	15	90.88	99.04	99.03	98.08	100
COA-HS ^a	11	90.93	97.78	99.09	98.50	97.06	8	89.56	98.88	98.87	97.76	100

a Elyasigomari et al. [7]; Pre-SDBE: Simplified version of the GSEA–SDBE method; RA: ROC_AUC_score; F1: F1_score; AC: Accuracy; SE: Sensitivity; SP: Specificity

#Genes: Number of selected genes.

Note: For unbalanced (breast, lung, and liver) and balanced data (colon and prostate), the performance metrics of the model are different.

a Elyasigomari et al. [7]; Pre-SDBE: Simplified version of the GSEA–SDBE method; RA: ROC_AUC_score; F1: F1_score; AC: Accuracy; SE: Sensitivity; SP: Specificity #Genes: Number of selected genes. Note: For unbalanced (breast, lung, and liver) and balanced data (colon and prostate), the performance metrics of the model are different. For the five cancer datasets, whether the data were balanced or unbalanced, Pre-SDBE outperformed the other three algorithms, achieving the highest classification accuracy while identifying fewer number of genes (Table 6). More details are shown in S3 Fig, S5 and S6 Tables.

Discussion

In this study, DEGs were extracted from a breast cancer data set. Genes that are not significantly differentially expressed but have important biological significance for breast cancer could easily be missed in this process; however, even if these lost genes are retained, they may be deleted in subsequent processing. Indeed, such genes would be ignored by the classification model used in the GSEA–SDBE method described here. Nevertheless, this did not affect the ability of the method to identify some key genes for the diagnosis of breast cancer. Dimensionality reduction runs through the entire GSEA–SDBE method; each step in the method prepares for dimensionality reduction in the next step. According to experience, selecting too few genes leads to some important pathways not being enriched, whereas selecting too many genes overfills the core enrichment of pathways with genes that make subsequent gene elimination difficult and GSEA time consuming. Therefore, the list of DEGs was sorted in descending order by variable importance according to a random forest model; the top 2000 genes were selected for analysis and some genes with importance close to zero were removed based on experience. Although the selection of KEGG pathways in GSEA based on experience is subjective, it does not prevent obvious DEGs with no important biological significance for breast cancer being filtered out. In addition, these genes may also enhance the performance of classification models and the selection of important genes would be compromised. To eliminate redundant genes from the selected genes, the SDBE algorithm was applied. This algorithm computed the difference in performance metrics of the classification model before and after gene deletion during backward elimination, which indicated the degree of redundancy of the deleted gene on the remaining genes. When a gene was deleted from the gene list in this manner, the performance metrics of the classification model did not change significantly. Therefore, the deleted gene was similar to some remaining genes, and thus considered redundant. Given the underlying principle of the SDBE algorithm, the top gene in the gene list would not participate in the sorting process and would not be recognized as redundant; additionally, the first gene in a similar gene group in the gene list would not be recognized as redundant or deleted. Therefore, stage 1 of the SDBE algorithm is particularly important because genes are sorted by their importance in RF during backward elimination at this stage. At stage 5 of the SDBE algorithm, to speed up the sorting process and reduce the number of cycles, the metric difference list was divided into two parts from a set position and these two parts were respectively sorted in descending order. The change of the set position occurred at stage 4. From the set position in the metric difference list, a forward search was conducted until an element with a value less than the threshold, which was set at zero, was encountered; the index of this element was used to update the set position. If the threshold was set to a certain value greater than zero, this may be more conducive to sorting. However, from the 19 iterations shown Figs 2 and 3, the polylines of the performance metrics for the classification models, particularly RF with MCC as the object of difference analysis, met the requirements. Including many more iterations would have been more time consuming. However, setting ROC_AUC_score as the object of difference analysis was less effective compared with using MCC, which might be related to the complexity of the ROC_AUC_score formula. In contrast to Pre-SDBE, the three advanced algorithms (GA, PSO, and COA-HS) did not filter out genes without biological significance for cancer and were much more time-consuming. This is likely because the three algorithms used MRMR to select input genes (S6 Table). Selecting fewer than 50 genes by their importance based on a random forest model as the input to the SDBE algorithm might save time. However, the 10-fold cross-validation was the main time-consuming factor in the GSEA–SDBE method and its simplified version (Pre-SDBE). Here, the proposed GSEA–SDBE method was used to analyze breast cancer datasets. It allowed determining the smallest set of biologically relevant genes for cancer diagnosis. The simplified GSEA–SDBE method (Pre-SDBE) was used to select genes to classify cancer datasets to test the feature selection performance of GSEA–SDBE. The results showed that the GSEA–SDBE and Pre-SDBE methods were excellent. In the future, I will apply the GSEA–SDBE method to many types of cancer data and Pre-SDBE to feature selection for various types of data.

Genes sorted by importance in descending order (Pre-SDBE).

(TIF) Click here for additional data file.

Fitness over 100 iterations for breast, lung, and liver cancers (PSO, GA, and COA-HS).

(TIF) Click here for additional data file.

Polylines of classification metrics of the Sort Difference Backward Elimination (SDBE) algorithm (Pre-SDBE).

(TIF) Click here for additional data file.

Gsea_report_for_Tumor_and_Normal.

(XLS) Click here for additional data file.

The 60 genes listed in descending order of importance.

(XLSX) Click here for additional data file.

Genes sorted in a descending order in 19 iterations.

(XLS) Click here for additional data file.

Information about survival of patients.

(XLS) Click here for additional data file.

Genes sorted by SDBE algorithm in descending order (Pre_SDBE).

(XLSX) Click here for additional data file.

Classification performance information of three advanced algorithms (PSO, GA, and COA-HS) for three cancer datasets.

(DOCX) Click here for additional data file. (TIF) Click here for additional data file. 17 Jun 2021 PONE-D-21-09636 GSEA–SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics PLOS ONE Dear Dr. Ai, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Aug 01 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Khanh N.Q. Le Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript is well written and the subject is very interesting. It has been written clearly and understandable. I think if the author provides a practical flowchart for readers it would be more applicable. Also please provide some information about survival of the patients in the results section. Reviewer #2: The paper does contribute to body of the existing knowledge. It needs some more work for further improvement. My main concerns are: 1. The paper is missing a comparison with other state-of-the-art methods for gene selection. 2. A discussion on the computational complexity of the method after comparison with the other methods is missing. Some minor comments. i) The paper needs language corrections. ii) References are not coherent. Some references are missing authors' names (et al. should not be in the bibliography) iii) Add mathematical description of the methods covered. Algorithm are messy, please make them clear and easy to understand. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 28 Aug 2021 Responds to the reviewer’s commants: Reviewer #1: 1. Response to comment: ( I think if the author provides a practical flowchart for readers it would be more applicable.) Response: I am very sorry for my negligence. The revised practical flowcharts are as follows. Fig. 1. Procedure of the Sort Difference Backward Elimination (SDBE) algorithm. Fig 2. Gene selection procedure in the GSEA–SDBE method. 2. Response to comment : (Also please provide some information about survival of the patients in the results section.) Response: I am very sorry for my negligence of providing some information about survival of the patients. So I added the S4 table, which provided this information. Reviewer #2: 1. Response to comment :( The paper is missing a comparison with other state-of-the-art methods for gene selection.) Response: I am very sorry for my negligence. The performance comparison with other models as follows. Performance comparison with other models The Pre-SDBE is a simplified version of the GSEA-SDBE method, which does not contain the use of GSEA to filter out genes weakly associated with or unrelated to cancer. The other three state-of-the-art-methods [7], the genetic algorithm (GA), the particle swarm optimization (PSO) algorithm, and the cuckoo optimization algorithm and harmony search (COA-HS), also does not filter out DEGs that have no biological significance for cancer. Therefore, the performance of the Pre-SDBE can be compared with these three algorithms which use relevant genes selected by the minimum redundancy and maximum relevance (MRMR) as input data. In the Pre-SDBE algorithm, the top 50 relevant genes from DEGs were selected in the step that selected genes by their importance based on a random forest model (S1 Fig). Next, these selected genes were fed into SDBE algorithm to pick the best genes while maintaining the highest accuracy. The number of iterations in SDBE algorithm was respectively set at 3 and 10 for the colon and prostate cancer dataset (S5 Table). Table 5. Information on the microarray datasets used in this comparison Microarray dataset Number of genes Number of samples Normal Tumor Prostate [32] 12,600 102 50 52 Colon [33] 7,457 62 22 40 In the case of the colon cancer dataset, the Pre-SDBE method outperformed other algorithms, reaching 100% accuracy with the fewest number of genes (2 genes) compared to all other algorithms (S2 Fig,). For the prostate cancer dataset, the Pre-SDBE method could also achieve accuracy (98.99%) with the fewest genes (5 genes) (Table 6 and S3 Fig). Table 6. Performance comparison of Pre-SDBE with three models for two microarray datasets Algorithm Colon Prostate #Genes AC SE SP #Genes AC SE SP Pre-SDBE 2 1.0000 1.0000 1.0000 5 0.9899 0.9899 0.9899 PSO [7] 11 0.9642 0.8580 1.0000 19 0.9804 0.9180 1.0000 GA1 [7] 14 0.9516 0.8460 1.0000 28 0.9804 0.9180 1.0000 COA-HS [7] 5 1.0000 1.0000 1.0000 5 1.0000 1.0000 1.0000 AC:Accuracy; SE:Sensitivity; SP:Specificity; Number of selected genes: #Genes. 2. Response to comment : (A discussion on the computational complexity of the method after comparison with the other methods is missing.) Response: I am very sorry for my negligence. The discussion about computational complexity as follows. In the Pre-SDBE method, the top 50 relevant genes were first selected by their importance based on a random forest model, and its number was only half of the number of genes selected using MRMR [7]. Perhaps fewer genes could be selected as the input to the SDBE algorithm to reduce time-consuming. However, whether it is GSEA-SDBE method or its simplified version (Pre-SDBE), 10-fold cross-validation is the main factor leading to time-consuming. Some minor comments 1. The paper needs language corrections. Response: I am very sorry for my incorrect writing. So, I have seriously corrected the language. Please check the revised paper. 2. References are not coherent. Some references are missing authors' names (et al. should not be in the bibliography) Response: I am very sorry for my incorrect writing and revised the references as follows. References 1. Hartmaier R, Albacker LA, Chmielecki J, Bailey M, He J, Goldberg ME, et al.. High-throughput genomic profiling of adult solid tumors reveals novel insights into cancer pathogenesis. Cancer Research. 2017;77:2464–2475. doi: 10.1158/0008-5472.CAN-16-2479. 2. Giovannantonio MD, Harris BH, Zhang P, KitchenSmith I, Xiong L, Sahgal N, et al. Heritable genetic variants in key cancer genes link cancer risk with anthropometric traits. Journal of Medical Genetics. 2020;0:1–8. doi: 10.1136/jmedgenet-2019-106799. 3. Dı´az-Uriarte R, Andre´s SAd. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7(3):1–13. doi: 10.1186/1471-2105-7-3. 4. Pok G, Liu J-CS, Ryu KH. Effective feature selection framework for cluster analysis of microarray data. Bioinformation. 2010; 4(8):385–389. doi: 10.6026/97320630004385. 5. Xie J, Wang C. Using support vector machines with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases. Expert Syst Appl. 2011; 38(5): 5809–5815. doi:10.1016/j.eswa.2010.10.050. 6. Piao Y, Piao M, Park K, Ryu KH. An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data. Bioinformatics. 2012; 28(24): 3306–3315. doi:10.1093/bioinformatics/bts602. 7. Elyasigomari V, Lee DA, Screen HRC, Shaheed MH. Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification. Journal of Biomedical Informatics. 2017; 67:11–20. doi:10.1016/j.jbi.2017.01.016. 8. Sampathkumar A, Rastogi R, Arukonda S, Shankar A, Kautish S, Sivaram M. An efficient hybrid methodology for detection of cancer-causing gene using CSC for micro array data. J Ambient Intell Humaniz Comput. 2020; 11(3):4743–4751. doi:10.1007/s12652-020-01731-7. 9. Pesaranghader A, Matwin S, Sokolova M, Beiko RG. SimDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics 2016; 32(9): 1380–1387. doi: 10.1093/bioinformatics/btv755. 10. Angerer P, Fischer DS, Theis FJ, Scialdone A, Marr C. Automatic identification of relevant genes from low-dimensional embeddings of single-cell RNA-seq data. Bioinformatics. 2020;36(15):4291–4295. doi: 10.1093/bioinformatics/btaa198. 11. Kuang S, Wei Y, Wang L. Expression-based prediction of human essential genes and candidate lncRNAs in cancer cells. Bioinformatics. 2021; 37(3):396–403. doi:10.1093/bioinformatics/btaa717. 12. Zeng XQ, Li GZ, Yang JY, Yang MQ, Wu GF. Dimension reduction with redundant gene elimination for tumor classification. BMC Bioinformatics 2008; 9 (Suppl 6): S8. doi:10.1186/1471-2105-9-S6-S8. 13. Ono H, Ishii K, Kozaki T, Ogiwara I, Kanekatsu M, Yamada T. Removal of redundant contigs from de novo RNA-Seq assemblies via homology search improves accurate detection of differentially expressed genes. BMC Genomics. 2015; 16(1):1031–1044. doi: 10.1186/s12864-015-2247-0. 14. Suyan T. Identification of subtypespecific prognostic signatures using Cox models with redundant gene elimination. Oncology Letters. 2018; 15:8545-8555. doi: 10.3892/ol.2018.8418. 15. Pashaei E, Aydin N. Binary black hole algorithm for feature selection and classification on biological data. Applied Soft Computing. 2017;56,94–106. doi: 10.1016/j.asoc.2017.03.002. 16. Xiao Y, Hsiao T-H, Suresh U, Chen H-IH, Wu X, Wolf SE, et al. A novel significance score for gene selection and ranking. Bioinformatics. 2014;30(6):801–807. doi:10.1093/bioinformatics/btr671. 17. Deng H, Runger G.. Gene selection with guided regularized random forest. Pattern Recognition. 2013; 46(12): 3483–3489. doi: 10.1016/j.patcog.2013.05.018. 18. Alikovi E, Subasi A. Breast cancer diagnosis using GA feature selection and Rotation Forest. Neural Computing and Applications. 2017;28(4):753–763. doi: 10.1007/s00521-015-2103-9. 19. Subramanian A, Kuehn H, Gould J, Tamayo P, Mesirov JP. GSEA-P: a desktop application for gene set enrichment analysis. Bioinformatics. 2007; 23(23):3251–3253. doi:10.1093/bioinformatics/btm369. 20. Ogata H, Goto S, Sato K, Fujibuchi w, Bono H, Kanehisa M. KEGG: kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 1999;27(1):29–34. doi:10.1093/nar/27.1.29. 21.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo p, Mesirov JP. Molecular signature database (msigdb) 3.0. Bioinformatics. 2011; 27(12):1739–1740. doi:10.1093/bioinformatics/btr260. 22. Reimand J, Isserlin R, Voisin V, Kucera M, Tannus-Lopes C, Rostamianfar A, et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nature Protocols. 2019; 14(2): 482–517. doi:10.1038/s41596-018-0103-9. 23. Robinson D. The statistical evaluation of medical tests for classification and prediction by m. sullivan pepe. Appl Stat. 2010;169(3): 656–656. doi: 10.1111/j.1467-985X.2006.00430_9.x 24. Khoury P, Gorse D. Investing in emerging markets using neural networks and particle swarm optimisation. International Joint Conference on Neural Networks. IEEE. 2015;1-7. doi:10.1109/IJCNN.2015.7280777. 25. Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS One. 2017;12(6):e0177678. doi: 10.1371/journal.pone.0177678. 26. Chawla NV, Karakoulas G. Learning from labeled and unlabeled data: an empirical study across techniques and domains. Journal of Artificial Intelligence Research. 2005; 23:331–366. doi:10.1613/jair.1509. 27. Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006; 27(8): 861–874. 28. John GH, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. Machine Learning Proceedings 1994; 121–129. doi: 10.1016/B978-1-55860-335-6.50023-4. 29. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: the proceedings of 22nd ACM SIGKDD conference on knowledge discovery and data mining. ACM.New York’ KDD. 2016; 785–794. 30. Geurts P, Ernst D. Wehenkel L. Extremely randomized trees. Machine Learning. 2006;63(1):3–42. doi: 10.1007/s10994-006-6226-1. 31. Foldvary N, Nashold B, Mascha E, Thompson EA, Lee N, McNamara JO, et al. Seizure outcome after temporal lobectomy for temporal lobe epilepsy: a kaplan-meier survival analysis. Neurology. 2000; 54(3):630–634. doi: 10.1212/WNL.54.3.630. 32. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, et al. Gene expression correlates of clinical prostate cancer behavior, Cancer Cell. 2002;1(2):203–209. doi:10.1016/S1535-6108(02)00030-2. 33. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA. 1999; 96 (12): 6745–6750. doi:10.1073/pnas.96.12.6745. 3. Add mathematical description of the methods covered. Algorithm are messy, please make them clear and easy to understand. Response: I am very sorry. I'm afraid I don't have this ability yet, but I will work hard in the future. Submitted filename: Response to Reviewers.docx Click here for additional data file. 25 Oct 2021

PONE-D-21-09636R1

GSEA–SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics

PLOS ONE Dear Dr. Ai, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Dec 09 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Khanh N.Q. Le Academic Editor PLOS ONE Journal Requirements: [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Partly ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: No ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: I suggest the following for further improving the article 1. The article is lacking an academic way of writing. It needs a thorough revision. Improve language and presentation of the paper. 2. Please consider various breast cancer datasets for showing the efficacy of the method. 3. Add redundancy assessment measures in addition to MCC. 4. Provide biological significance of the selected genes by various methods. 5. Provide experimental setup for the analysis in more details. What were the criteria used for parameter selection. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Dec 2021 Responds to the reviewer’s commants: Reviewer #2: ________________________________________ 1. Response to comment: (The article is lacking an academic way of writing. It needs a thorough revision. Improve language and presentation of the paper). Response: I am very sorry for my incorrect writing. So, I have seriously corrected the language. Please check the revised paper. 2. Response to comment: (Please consider various breast cancer datasets for showing the efficacy of the method) Response: I am very sorry for my negligence. To show the efficacy of Pre-SDBE method, the classification performance of this method was compared with that of the three advanced algorithms for five cancer datasets (breast, lung, liver, colon, and prostate cancer datasets). Performance comparison with other models The Pre-SDBE is a simplified version of the GSEA-SDBE method, which does not contain the use of GSEA to filter out genes weakly associated with or unrelated to cancer. The other three state-of-the-art-methods [7], the genetic algorithm (GA), the particle swarm optimization (PSO) algorithm, and the cuckoo optimization algorithm and harmony search (COA-HS), also does not filter out DEGs that have no biological significance for cancer. Therefore, the performance of the Pre-SDBE can be compared with these three algorithms which use relevant genes selected by the minimum redundancy and maximum relevance (MRMR) as input data. In the Pre-SDBE algorithm, the top 50 relevant genes from DEGs were selected in the step that selected genes by their importance based on a random forest model (S1 Fig). Next, these selected genes were fed into SDBE algorithm to pick the best genes while maintaining the highest accuracy. The number of iterations in SDBE algorithm was respectively set at 3 and 10 for the colon and prostate cancer dataset (S5 Table). Table 5. Information on the microarray datasets used in this comparison Microarray dataset Number of genes Number of samples Normal Tumor Prostate [32] 12,600 102 50 52 Colon [33] 7,457 62 22 40 In the case of the colon cancer dataset, the Pre-SDBE method outperformed other algorithms, reaching 100% accuracy with the fewest number of genes (2 genes) compared to all other algorithms (S2 Fig,). For the prostate cancer dataset, the Pre-SDBE method could also achieve accuracy (98.99%) with the fewest genes (5 genes) (Table 6 and S3 Fig). Table 6. Performance comparison of Pre-SDBE with three models for two microarray datasets Algorithm Colon Prostate #Genes AC SE SP #Genes AC SE SP Pre-SDBE 2 1.0000 1.0000 1.0000 5 0.9899 0.9899 0.9899 PSO [7] 11 0.9642 0.8580 1.0000 19 0.9804 0.9180 1.0000 GA1 [7] 14 0.9516 0.8460 1.0000 28 0.9804 0.9180 1.0000 COA-HS [7] 5 1.0000 1.0000 1.0000 5 1.0000 1.0000 1.0000 Pre-SDBE: A simplified version of the GSEA-SDBE method; AC: Accuracy; SE: Sensitivity; SP:Specificity; #Genes: Number of selected gene Performance comparison of GSEA–SDBE with that of other models To test the feature selection performance of the GSEA–SDBE method, a simplified version, named Pre-SDBE, which does not use GSEA to filter out genes weakly associated with or unrelated to cancer, was used. The three advanced gene selection algorithms were the genetic algorithm (GA), particle swarm optimization (PSO) algorithm, and cuckoo optimization algorithm and harmony search (COA-HS). These algorithms use 100 relevant genes selected via the minimum redundancy and maximum relevance (MRMR) as input data and the SVM as a classifier [7]. The classification performance of Pre-SDBE was compared with that of the three advanced algorithms for five cancer datasets composed of DEGs in breast, lung, and liver cancers and genes expressed in prostate and colon cancers (Table 5). Table 5. Information on the datasets used for performance comparison. Name Data sources #Genes #DEGs #Samples Normal Tumor Breast TCGA a 56,536 4,579 1,222 113 1,109 Lung TCGA a 56,536 7,483 1,146 108 1,038 Liver TCGA a 56,536 8,772 465 58 407 Prostate Microarray dataset b 12,600 － 102 50 52 Colon Microarray dataset c 7,457 － 62 22 40 a Database (https://cancergenome.nih.gov/); b Singh et al. [44]; c Alon et al. [45]; #Genes: number of genes; #DEGs: number of differentially expressed genes (obtained using wilcox.tes with logFC >1 and FDR <0.05); #Samples: number of selected samples. In the step of the Pre-SDBE algorithm selecting genes by their importance, the top 50 relevant genes were selected based on a random forest model (S1 Fig). Next, these genes were fed into the SDBE algorithm to identify the most relevant genes with the highest accuracy. The number of iterations in the SDBE algorithm was set at 6, 7, 23, 3, and 10 for the breast, lung, liver, colon, and prostate cancer datasets, respectively. The Fitness of PSO, GA, and COA-HS over 100 iterations for each cancer dataset are shown in S2 Fig. Table 6 shows that for unbalanced data (breast, lung, and liver cancers), the classification metrics (MCCs) of PSO, GA, and COA-HS algorithms were much lower than those of Pre-SDBE (98.07, 97.45, and 96.98 for breast, lung, and liver cancers, respectively). This indicated that the PSO, GA, and COA-HS algorithms did not perform well for unbalanced data. For the five cancer datasets, whether the data were balanced or unbalanced, Pre-SDBE outperformed the other three algorithms, achieving the highest classification accuracy while identifying fewer number of genes (Table 6). More details are shown in S3 Fig, S5 and S6 Tables. Table 6. Classification metrics of four optimization algorithms for five cancer datasets. Algorithm Breast Lung #Genes MCC RA F1 SE SP #Genes MCC AUC F1 SE SP Pre-SDBE 4 98.07 99.42 99.82 99.73 99.12 3 97.45 98.93 99.76 99.71 98.15 PSO a 30 82.98 95.56 98.18 97.00 94.12 29 88.29 98.72 98.70 97.44 100 GA a 18 88.87 98.80 98.78 97.60 100 15 90.88 99.04 99.03 98.08 100 COA-HS a 11 90.93 97.78 99.09 98.50 97.06 8 89.56 98.88 98.87 97.76 100 Liver Colon Prostate #Genes MCC RA F1 SE SP #Genes AC SE SP #Genes AC SE SP 3 96.98 98.12 99.63 99.75 96.49 2 100 100 100 5 98.99 98.99 98.99 24 62.03 91.87 91.15 83.74 100 11a 96.42a 85.80a 100a 19a 98.04a 91.80a 100a 16 68.30 93.90 93.51 87.80 100 14a 95.16a 84.60a 100a 28a 98.04a 91.80a 100a 9 72.73 95.12 94.87 90.24 100 5a 100a 100a 100a 5a 100a 100a 100a a Elyasigomari et al. [7]; Pre-SDBE: simplified version of the GSEA–SDBE method; RA: ROC_AUC_score; F1: F1_score; AC: accuracy; SE: sensitivity; SP: specificity; #Genes: number of selected genes. Note: For unbalanced (breast, lung, and liver) and balanced data (colon and prostate), the performance metrics of the model are different. 3. Response to comment: (Add redundancy assessment measures in addition to MCC.) Response: I'm very sorry. I can't find other redundancy assessment measures beyond of MCC, but I will work hard in the future. 4. Response to comment: (Provide biological significance of the selected genes by various methods.) Response: Biological significance of the selected genes as followers. Relevance of the selected genes to cancer VEGF-D induces the formation of lymphatics within tumors, thereby facilitating the spread of the tumor to lymph nodes, and promotes tumor angiogenesis and growth [32-36]. TSLP is an interleukin-7 (IL-7)-like cytokine that is involved in the progression of various cancers and is a key mediator of breast cancer progression [37-40]. Human PKMYT1 is an important regulator of the G2/M transition in the cell cycle. Studies have demonstrated that PKMYT1 might be a therapeutic target in hepatocellular carcinoma and neuroblastoma [41-43]. 5. Response to comment: (Provide experimental setup for the analysis in more details. What were the criteria used for parameter selection.) Response: I am very sorry for my negligence. More details were added for the SDBE algorithm, as shown below. Stage 7: The genes of the list NG were analyzed by backward elimination. At each step of backward elimination, the 10-fold classification mode, e.g., random forest (RF), support vector machine (SVM), k-nearest neighbor (KNN), extreme gradient boosting (XGBoost), and extremely randomized trees (ExtraTrees), and ExtraTrees, was trained and tested to calculate various performance metrics. After each step of backward elimination, the performance metrics were respectively added to the corresponding metric lists. Then, the iteration was terminated and the data were saved. However, if the number of iterations set based on experience was not reached, the metrics lists, which were respectively transposed, and the list NG were sent to stage 3 to start a new iteration. Stage 8: Mapping analysis of the metrics lists and the list NG was performed and the smallest set of relevant genes needed to achieve the required sample classification performance was determined. Submitted filename: Response to Reviewers.docx Click here for additional data file. 14 Jan 2022 GSEA–SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics PONE-D-21-09636R2 Dear Dr. Ai, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Nguyen Quoc Khanh Le Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 18 Apr 2022 PONE-D-21-09636R2 GSEA–SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics Dear Dr. Ai: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Nguyen Quoc Khanh Le Academic Editor PLOS ONE

33 in total

1. Thymic stromal lymphopoietin is a key mediator of breast cancer progression.

Authors: Purevdorj B Olkhanud; Yrina Rochman; Monica Bodogai; Enkhzol Malchinkhuu; Katarzyna Wejksza; Mai Xu; Ronald E Gress; Charles Hesdorffer; Warren J Leonard; Arya Biragyn
Journal: J Immunol Date: 2011-04-13 Impact factor: 5.422

2. Molecular signatures database (MSigDB) 3.0.

Authors: Arthur Liberzon; Aravind Subramanian; Reid Pinchback; Helga Thorvaldsdóttir; Pablo Tamayo; Jill P Mesirov
Journal: Bioinformatics Date: 2011-05-05 Impact factor: 6.937

3. Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification.

Authors: V Elyasigomari; D A Lee; H R C Screen; M H Shaheed
Journal: J Biomed Inform Date: 2017-02-03 Impact factor: 6.317

4. Vascular endothelial growth factor-C and vascular endothelial growth factor-d messenger RNA expression in breast cancer: association with lymph node metastasis.

Authors: Yu Koyama; Kouji Kaneko; Kohei Akazawa; Chizuko Kanbayashi; Tatsuo Kanda; Katsuyoshi Hatakeyama
Journal: Clin Breast Cancer Date: 2003-12 Impact factor: 3.225

5. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap.

Authors: Jüri Reimand; Ruth Isserlin; Veronique Voisin; Mike Kucera; Christian Tannus-Lopes; Asha Rostamianfar; Lina Wadi; Mona Meyer; Jeff Wong; Changjiang Xu; Daniele Merico; Gary D Bader
Journal: Nat Protoc Date: 2019-02 Impact factor: 13.491

Review 6. TSLP: from allergy to cancer.

Authors: Jonathan Corren; Steven F Ziegler
Journal: Nat Immunol Date: 2019-11-19 Impact factor: 25.606