Literature DB >> 20925936

Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier.

Martin Lauss¹, Attila Frigyesi, Tobias Ryden, Mattias Höglund.

Abstract

BACKGROUND: Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms.
RESULTS: The basis for the proposed approach is the use of metagenes, instead of collections of individual genes, and a feature selection using AUC values obtained by ROC analysis. Each gene in a data set is assigned an AUC value relative to the tumor class under investigation and the genes are ranked according to these values. Metagenes are then formed by calculating the mean expression level for an increasing number of ranked genes, and the metagene expression value that optimally discriminates tumor classes in the training set is used for classification of new samples. The performance of the metagene is then evaluated using LOOCV and balanced accuracies.
CONCLUSIONS: We show that the simple uni-variate gene expression average algorithm performs as well as several alternative algorithms such as discriminant analysis and the more complex approaches such as SVM and neural networks. The R package rocc is freely available at http://cran.r-project.org/web/packages/rocc/index.html.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 20925936 PMCID： PMC2966465 DOI： 10.1186/1471-2407-10-532

Source DB: PubMed Journal: BMC Cancer ISSN： 1471-2407 Impact factor: 4.430

Background

One of the most promising clinical applications of genome wide expression studies is the construction of robust and reliable disease classifiers. Correct identification and sub-classification of diseases such as cancer is a prerequisite for proper and efficient treatment. To date a large number of different algorithms for disease classification have been described. They range in complexity from neural network approaches [1] to the simpler nearest-neighbor classification algorithms [2]. Even though some of the more complex approaches such as neural networks and self organized maps (SOM) [3] have proved to be very efficient, these methods often rely on the tuning of several parameters and hence are liable for over-fitting. Furthermore, simple classifiers seem to perform remarkably well when compared to more sophisticated ones [4]. In the present investigation our aim has been to design a simple predictor system useful for cancer subtype classification. Features to be included in the predictor signatures are selected based on their classification capacity as determined by a receiver operating characteristic (ROC) analysis and area under the curve (AUC) estimates [5,6]. After selection of the appropriate number of genes in the predictor signature, the mean expression level of all genes included is calculated, transforming the ensemble of genes into one vector and used as a uni-variate gene expression average, or a metagene, as classifier. Two features of gene expression are exploited by the merging of genes, genes are often co-regulated and hence correlated, and by using the expression level of the metagene, effects by random noise from single genes are minimized. Most of the commonly used algorithms such as SVM [7] and PAM [8] apply specifications such as support vectors and weights to the individual features included in the predictor gene signatures which potentially complicate their application to independent data [9]. Hence in this investigation we use an alternative way to evaluate the results by using the obtained training set gene signature genes only and then establish new parameters in the validation set to evaluate the performance of the classifier. We show that the proposed metagene classifier produces excellent accuracies, similar to what is obtained with a SVM approach, in several types of cancer data sets using a variety of tumor classification criteria.

Implementation

Data sets

To establish the classifier we used bladder cancer datasets produced by Sanchez-Carbayo et al. [10] (Supplementary Table 10 in [10]) "SanchezC", Stransky et al. [11] (ArrayExpress: E-TABM-147) "Stransky"; and Blaveri et al. [12] (Supplementary Table 4 in [12]) "Blaveri". The remaining datasets were obtained from Gene Expression Omnibus (GEO) [13], except for the vandeVijver breast cancer dataset [14]. The following datasets were downloaded from GEO; for breast GSE2034 (WangY), GSE2990 (Sotiriou), for neuroblastoma GSE3960 (WangQ), GSE12460 (JanoueixL), GSE19274 (Attiyeh), for lung GSE8569 (Angulo), GSE11969 (Takeuchi). For a detailed description of the datasets see Additional file 1. Normal urothelium samples, recurring tumors from the same patient, cell lines, and technical replicates were not included in the final bladder cancer data sets. The SanchezC dataset was quantile-normalized using the normalizeBetweenArrays function of the R package limma [15]. Robust Multi-array Average (RMA) was performed separately for two samples sets of the Stransky dataset (on U95A and U95Av2 respectively) using the affy package [16]. Obtained RMA expression values were de-logged, the samples sets combined, and quantile normalized using limma. The SanchezC and Stransky datasets were both transformed to log2 scale. To obtain gene-centered values the gene expression values were subtracted by the mean expression of the gene in each dataset separately. The Blaveri dataset was imputed for missing values using k-nearest neighbors (k = 10) for genes that had no more than 20% missing data, and genes with >20% missing data were omitted [17]. The HGNC GeneSymbols were updated in all datasets with the official HGNC GeneSymbols from the HGNC webpage [18]. The expression values of GeneSymbols with multiple reporters were merged by taking the median expression value. All reporters in the datasets without a GeneSymbol were discarded. The final SanchezC dataset contained 90 patients and 12761 genes, the Stransky dataset 56 patients and 8955 genes, and the Blaveri dataset 74 patients and 4430 genes. The SanchezC and Stransky datasets share a total of 8518 GeneSymbols and were used to explore the AUC characteristics. For classification, Ta and T1 cases were considered non-muscle invasive (NMI), and ≥T2 cases as muscle-invasive (MI). Grade is discriminated between Grade 2 and 3 in SanchezC, and between Grade1+2 and Grade 3 in Sanchez. Randomized versions of the datasets were generated using the mean and standard deviation of the original datasets. Non-bladder cancer Affymetrix datasets not already normalized were downloaded as cel files and normalized using RMA. All other datasets were downloaded as normalized 'series matrix files'. In the case of missing values, k-nearest neighbor imputation was performed (k = 10). Gene Symbols were updated using the official HGNC nomenclature file, and then expression values for reporters with the same GeneSymbols were merged. The data was mean-centered, except for two-color array data, as this data comes already in ratios. Reporters with no GeneSymbols were excluded from the final data. The vandeVijver data was imputed for missing values by k-nearest neighbor, transformed to log2 ratios and GeneSymbols were updated and merged.

ROC analyses

The receiver operating characteristic (ROC) curve is the plot of sensitivity (true positive rate) vs. 1-specificity (false positive rate), for predicting a binary classification variable z using some covariate x. That is, if x >t for some threshold t then z is predicted as 1, otherwise as 0. As the threshold t ranges from +ℕ to -ℕ the fraction of true positive and false positive predictions will both increase from 0 to 1, yielding the ROC curve. The area under this curve (AUC) is an overall measure of the predictor's performance. An ideal predictor obtains true positive rate 1 and false positive rate 0 for some threshold t, and then AUC = 1. Ignoring covariate information and guessing randomly by predicting z = 1 with some probability q yields AUC = 1/2 by letting q range from 0 to 1, so 1/2 is a worst-case AUC. If AUC < 1/2 the covariate x is however negatively correlated with z, and replacing x by -x turns AUC into 1-AUC (which will be >1/2). For a given covariate x we can thus view max(AUC,1-AUC) as its performance. For a finite sample of x's and z's, AUC can be computed as a function of the Mann-Whitney (or two-sample Wilcoxon) statistic for comparing the x's associated z = 0 and z = 1 respectively.

Supervised classification using the uni-variate gene expression average classifier

Genes to be included in the gene signatures are selected in order of their ranked max(AUC,1-AUC) values. To merge the gene expression of a given gene signature to a single metagene expression value, an arithmetic mean is computed by summing up the expression values after multiplying expression values for genes negatively associated with the feature (AUC < 0.5) by -1 (Additional file 2). The resulting metagene expression values are then used in ROC analyses, i.e. by ranking the samples according to their metagene expression values. The optimal split of positive (i.e., 1) and negative (i.e., 0) samples is determined as the metagene expression threshold which produces the highest accuracy i.e., correct class assignments in respect to the real class, in the training set. More precisely, the threshold is computed as the mean metagene expression value of the two samples that constitute the border of the split. A new sample to be classified has its metagene expression value determined with the same genes to be multiplied by -1. The new sample is classified according to which side of the threshold the sample falls in, with a sample having higher metagene expression being classified as positive (i.e., 1) and with lower expression as negative (i.e., 0). The split yielding optimal accuracy in the ROC curve is determined using the R package ROCR [19]. The outlined approach is to some respect similar to the approach described by Rosenwald et al [20] except that we have simplified the use of metagenes further by using one single metagene and thus do not have to assign any specific weights to individual metagenes. In addition we optimize the threshold for each dataset.

Additional classification algorithms

We compared our classifier initially to a Support Vector Machine (SVM) with a linear kernel. SVMs have been shown to perform considerably well in microarray data [21] and SVMs with a linear kernel has been suggested to perform better in gene expression data than more complex SVM versions (Manual BRB Array Tools, [22]), and additionally, no parameter tuning is necessary for linear SVMs. Briefly, Support Vector Machines (SVMs) identify the maximum margin hyperplane that optimally separates the training samples (based on support vectors) and then classify unseen samples according to the side of the hyperplane they fall into [23]. We used the 'svm' function of the R package e1071 [24]. Additional classification algorithms were implemented using the R package MLInterfaces [25] using the default settings and included SVM with radial kernel (SVMradial), SVM with polynomial kernel (SVMpoly), k-nearest neighbor (knn, k = 3), random forest (rforest), recursive partitioning trees (rpart), bagging (bagging), linear discriminant analysis (lda), diagonal linear discriminant analysis (dlda), stabilized linear discriminant analysis (slda), neural network (neural net, with 3 hidden layers), except for the nearest centroid classifier (ncc) that was implemented using the pamr package [8].

Performance Evaluation

As the accuracies of prediction are dependent on the prior distribution of samples, we used balanced accuracies computed by (sensitivity + specificity)/2. Balanced accuracies are independent of the prior distributions [26]. Unbiased accuracies were obtained by leave-one-out-cross-validation (LOOCV). Feature selection was repeated in each loop of LOOCV. When testing the performance of a gene signature in independent validation data only the genes were used in the validation data, i.e. not the information on AUC and 1-AUC or the classification threshold. In a clinical validation platform, as qPCR and IHC, we assume that the classifier specification might not be taken over. We applied a LOOCV loop to determine max(AUC,1-AUC) and optimal classification threshold for the signature metagene in each loop. Accuracies obtained from cross-validation loops are an estimate for the accuracy obtained from the whole dataset, i.e. all samples.

R package rocc

Briefly, the package includes the functions tr.rocc, p.rocc, and o.rocc. The function tr.rocc uses a training set with a given phenotype to determine the metagene threshold. The function p.rocc predicts the class of a new sample using the classifier specification of the tr.rocc output. The function o.rocc performs a LOOCV loop using a metagene of given size, e.g., top 200 genes, with feature selection in each loop separately. tr.rocc (data,out,xgenes = 200) p.rocc (tr.rocc.object,newsample) o.rocc (data,out,xgenes = 200) data = dataset as a matrix file with samples as columns and genes as rows out = phenotype as factor with levels 0 and 1 xgenes = number of genes that constitute the metagene; can be a numeric vector. newsample = sample to be classified using a classifier obtained from tr.rocc()

Results and Discussion

Feature selection

Two bladder cancer datasets, SanchezC and Stransky, were selected to explore the efficiency of predictor gene signatures based on ROC statistics. For each gene the association with muscle-invasiveness (MI) and tumor grade (G) as feature variables was estimated by calculating the AUC value in each dataset. The obtained AUC values ranged between 0.06 and 0.95, and 0.07 and 0.94 for MI, and between 0.14 and 0.85, and 0.10 and 0.89 for high grade (G3) in the SanchezC and Stransky data, respectively. The distribution of the AUC values deviated from the normal distribution and showed heavy tails (Figure 1). As a comparison, AUC values were also estimated in randomized versions of the SanchezC and Stransky datasets. In the randomized data 99% of the obtained AUC values were approximately between 0.7 and 0.3 for MI and tumor grade. A total of 31% and 23%, and 11% and 14% of genes fall outside these 99% borders for MI and Grade variables in the SanchezC and Stransky dataset, respectively. In contrast, 1% of genes are expected to fall outside for a randomized variable. The genes in excess to this 1% of false discovery genes are informative, and hence may be considered as the maximum size of a gene signature. Hence, a large proportion of genes show informative AUC values. Furthermore, as more genes are associated with MI than with grade, the major difference in bladder cancer phenotype seems to be associated with stage.

Figure 1

Distribution of AUC values for all genes in a dataset. AUC values in respect to MI-Status are plotted in solid lines for SanchezC data in A) and Stransky data in B). AUC values in respect to high grade (G3) are plotted in solid lines for SanchezC data in C) and the Stransky data in D). Distributions of AUC values for randomized versions are plotted in dashed lines and the intervals that contain 99% of randomized data are indicated with vertical dashed lines. We then investigated the robustness of obtained AUC values by comparing AUC values from two different datasets. In Figure 2A we have plotted the AUC values for NMI/MI status and in Figure 2B the equivalent data for grade, for the SanchezC and Stransky data respectively. The correlations between the two datasets were moderate, 0.60 and 0.66, respectively. In fact, many genes change from significant AUC values (> 0.7 or <0.3) in one data set to insignificant values in the other, and even obtain AUC values in the opposite direction. Very few genes showed AUC values outside the 95% confidence interval of the randomized data in both datasets (Figure 2).

Figure 2

Scatter plot of AUC values for 8518 genes shared by SanchezC and Stransky. AUC values for MI-status are plotted in A and AUC values for grade are plotted in B. Genes with lower or higher AUC values than 99% of randomized data (dashed lines) in both datasets are depicted in green or red, respectively. r = Pearson correlation coefficient. We then investigated the possibility to identify robust predictor genes i.e., genes with informative AUC values in more than one dataset, by using more information during the feature selection process. We investigated deviations from a normal distribution as a possible additional criterion. In an optimal scenario an informative gene should show high expression in one and low expression in a second group in a two class situation and hence produce a bimodal distribution, or a distribution with a heavy tail. We used the Shapiro test for normality to identify genes with skewed or otherwise distorted distributions; the analysis revealed no differences between the two groups (Figure 3A and 3B). We investigated the variance of the informative genes as genes with large variance are expected to produce more robust results and did note a significant shift to larger variance for genes with AUC values >0.7. This shift was however too small to be of any practical use (Figure 3C and 3D). Similar results were obtained for the Stransky data (Additional file 3). From this we conclude that deviation from the normal distribution or large variance cannot be used to preselect genes with robust AUC values and hence no such functions were added to the software.

Figure 3

Deviation from normal distribution and standard deviation of genes from SanchezC with significantly high or low AUC values in both datasets, SanchezC and Stransky. A) Deviation from normal distribution of genes grouped by their association to MI in Figure 2A. B) Deviation from normal distribution of genes grouped by their association to grade in Figure 2B. C) Box plot for standard deviation of genes grouped by association to MI. D) Box plot for standard deviation of genes grouped by association to grade. -log10 p-value = logarithmic p-value of Shapiro test for normality. Gene expression is inherently noisy and random noise is expected to reduce the performance of predictors based on single genes. We therefore reasoned that the random noise effect could be counteracted by using the mean expression level for more than one gene. We consequently calculated the AUC values for all genes with respect to MI and grade in the Sanchez data. Before ranking, AUC values for genes with negative correlation, and hence showing AUC values <0.5, were inverted to values >0.5 by assigning AUC = 1-AUC-. We then designed gene signatures with increasing sizes by adding genes in order of their rank with steps of 1 at a time. For each signature the average expression level was calculated. In cases when genes showed negative association with MI or grade, the gene expression levels were inverted. The AUC values were then estimated using the expression levels of the created metagenes. Figure 4A and 4B show that the use of uni-variate gene expression average classifier results in considerably higher AUC values than those obtained for single and top ranking AUC genes. Furthermore, the AUC values turn stable when the metagenes become large enough.

Figure 4

Area under the Curve (AUC) values of the ranked single genes and of the metagene obtained from taking the mean expression of the single genes. A) AUC for association to MI. B) AUC for association to grade. Classification performance of MI status for the metagene based predictor (ROC) and the SVM using LOOCV in the SanchezC (C) and Stransky data (D), respectively. For SVM we used two different features selection criteria, t-test and AUC. Accuracy = Balanced Accuracy (see Methods), dashed line = Balanced Accuracy obtained by random class assignment (= 0.5). ROC = metagene-based predictor. SVM = Support Vector Machine.

Classification performance

We next compared the prediction performance of AUC-based metagenes with the standard and frequently used SVM-method. To accomplish this we constructed AUC metagenes with increasing number of genes in steps of 5 up to 500 genes. For comparison, we applied the most differentially expressed genes as determined by a t-test or AUC values to SVM, also in increasing steps of 5 genes. The accuracies were estimated using LOOCV in both cases. In Figure 4C and 4D we tested the NMI/MI classification in the SanchezC and in the Stransky data, respectively. As can be seen the AUC metagene approach is as efficient as the SVM approach using both t-test and AUC as feature selection criteria; only slightly poorer in the SanchezC data, with accuracies close to 0.9 in both cases, but slightly better in the Stransky data, with accuracies close to 0.85 for the AUC metagene and 0.80 for the SVM approach. Hence, the simple AUC metagene approach seems to be as efficient as the more sophisticated SVM. The robustness of the AUC metagene approach was then tested in independent data. For these purposes, we designed AUC metagene signatures with increasing sizes in one of the three datasets, SanchezC, Stransky, and Blaveri. The accuracy for each AUC metagene was first determined in the respective training set using LOOCV (Figure 5). Genes were then ranked according to AUC values using all cases in the training set and used as a source for producing gene signatures of increasing size applied to the independent data sets. These metagenes were then used to establish thresholds for maximum accuracies in the validation data sets within LOOCV loops. Hence, the signature genes used to produce the metagene were derived from the training set but the actual thresholds were derived from the validation sets. To estimate classification performance in the validation data sets we again performed LOOCV. In Figure 5A we have used metagenes obtained from the SanchezC dataset to predict NMI/MI in the Blaveri and Stransky datasets. As can be seen, when applied to the training set, i.e. the SanchezC dataset itself, accuracies close to 0.9 are obtained whereas in the Blaveri and Stransky accuracies slightly above 0.9 are obtained. Similar results were obtained for the other combinations of training and validation data, and hence, the AUC metagene approach is robust to independent data.

Figure 5

Performance of the metagene-based classifier when applied to independent data. A) Genes with highest AUC values of the SanchezC data are first tested by LOOCV in the SanchezC dataset (black circles) and then taken over to establish a classifier in the Stransky (green) and Blaveri (datasets), again by LOOCV. Gene signatures of 5-500 gene members are used. Balanced Accuracies are plotted; the dashed line at 0.5 indicates the balanced accuracy obtained by chance. B) and C): Procedure repeated with Blaveri and Stransky as training data, respectively. D) Changes in balanced accuracies from one gene signature size to the next biggest size are plotted for all balanced accuracies obtained in the validation datasets in A-C). A close inspection of Figure 5A-C seems to indicate that signatures for MI in bladder cancer smaller than 100 genes are less robust i.e. show a large variance in accuracy when applied to validation data. To explore this further, we calculated the change in accuracies between each step of signature size for signatures derived from the same dataset. These values were then plotted (Figure 5D). It is obvious form Figure 5D that signatures smaller than 100 are sensitive to the composition of the signature, whereas signatures above 200 show robust performance, seen as small or no changes in accuracies for different AUC metagenes. From this one may conclude that gene signatures of at least 150 genes are needed to produce robust predictors of NMI/MI in bladder. The metagene classifier approach was then evaluated further using two approaches. First, we evaluated the algorithm performance regarding different endpoints. For this purpose we used three breast cancer datasets (vandeVijver, WangY, Sotiriou), three neuroblastoma datasets (WangQ, JanoueixL, Attiyeh) and two lung cancer datasets (Angulo, Takeuchi), and six different endpoints; ER status, tumor grade, tumor size, tumor stage, MYCN amplification status, and histological subtype. Second, we compared the uni-variate gene expression average classifier with 12 alternative classification algorithms using two feature selection procedures. In these comparisons we predetermined the number of genes to be included in the metagene to 50, as this is close to the median number of genes included in published gene signatures. For each endpoint the selected features (genes) were determined in one dataset and then applied to the remaining ones (Table 1). For classification of e.g. ER status in breast cancer the gene list was derived from one dataset and then applied to the two remaining ones. This procedure was repeated using each dataset for feature selection resulting in a total of six tests. Overall the accuracies were high, ranging from 0.77 to 0.94 and similar to those obtained by SVM (range 0.71 to 0.95) when applied to validation data. Instances resulting in lower range accuracies using the metagene predictor where also low when using the SVM approach, indicating that the obtained lower accuracies were dependent on dataset and not on the algorithm used. Tumor grade was only available for two breast cancer datasets, vandeVijver and Sotiriou. Grade was predicted in validation data with accuracies ranging from 0.82 to 0.85 using the metagene predictor, similar to what was obtained by SVM (Table 1). The uni-variate gene exression average classifier could also faithfully predict MYCN status in neuroblastoma and histopathological subtype in lung cancers. Tumor stage in neuroblastoma and tumor grade in lung cancers were predicted with lower but significant accuracies. As can be seen from the computed average accuracies for each phenotype, the simple metagene predictor on average performs just as well, or better, than the more complex SVM approach. All predictor evaluations were repeated using 10 and 200 member metagenes, the obtained accuracies were however not influenced by the size of the metagenes (Additional file 4).

Table 1

Balanced accuracies of prediction obtained in various phenotypes using the metagene classifier (ROC) and linear SVM (SVM).

Phenotype/Method	Training¹	Validation	Validation	Training	Validation	Validation	Training	Validation	Validation	Mean
ER²(BC)	vandeVijver	WangY	Sotiriou	WangY	vandeVijver	Sotiriou	Sotiriou	vandeVijver	WangY
ROC	0.95	0.89	0.74	0.89	0.94	0.77	0.77	0.90	0.88	0.85
SVM	0.94	0.82	0.71	0.82	0.95	0.73	0.69	0.95	0.82	0.83
Grade³(BC)	vandeVijver	Sotiriou		Sotiriou	vandeVijver
ROC	0.79	0.85		0.84	0.82					0.84
SVM	0.72	0.78		0.84	0.78					0.78
Size⁴(BC)	vandeVijver	Sotiriou		Sotiriou	vandeVijver
ROC	0.63	0.65		0.70	0.60					0.63
SVM	0.57	0.51		0.61	0.48					0.49
MYCN⁵(NB)	WangQ	JanoueixL	Attiyeh	JanoueixL	WangQ	Attiyeh	Attiyeh	WangQ	JanoueixL
ROC	0.94	0.91	0.82	0.87	0.89	0.79	0.81	0.95	0.89	0.88
SVM	0.98	0.83	0.75	0.88	0.97	0.67	0.85	0.98	0.87	0.84
Stage⁶(NB)	WangQ	JanoueixL	Attiyeh	JanoueixL	WangQ	Attiyeh	Attiyeh	WangQ	JanoueixL
ROC	0.74	0.58	0.82	0.66	0.65	0.70	0.69	0.62	0.48	0.64
SVM	0.79	0.58	0.74	0.61	0.73	0.61	0.60	0.72	0.54	0.65
AD/SQ⁷(LC)	Angulo	Takeuchi		Takeuchi	Angulo
ROC	0.93	0.96		0.95	0.96					0.96
SVM	0.89	0.91		0.93	0.84					0.88
Grade⁸(LC)	Angulo	Takeuchi		Takeuchi	Angulo
ROC	0.67	0.69		0.69	0.68					0.68
SVM	0.63	0.64		0.63	0.59					0.61

1 Accuracies given in italics indicates accuracies obtained in the training set using LOOCV

2 ER+ vs. ER-

3 grade 1 vs. grade 3

4 tumors < = 2 cm vs. tumors > 2 cm

5 MYCN amplification positive vs. MYCN amplification negative

6 stage 1 and 2 vs. stage 3 and 4

7 adenocarcinoma vs. squamous cell carcinoma

8 grade 1 vs. grade 2 and 3

Abbreviations; BC = Breast Cancer, NB = Neuroblastoma, LC = Lung Cancer

Balanced accuracies of prediction obtained in various phenotypes using the metagene classifier (ROC) and linear SVM (SVM). 1 Accuracies given in italics indicates accuracies obtained in the training set using LOOCV 2 ER+ vs. ER- 3 grade 1 vs. grade 3 4 tumors < = 2 cm vs. tumors > 2 cm 5 MYCN amplification positive vs. MYCN amplification negative 6 stage 1 and 2 vs. stage 3 and 4 7 adenocarcinoma vs. squamous cell carcinoma 8 grade 1 vs. grade 2 and 3 Abbreviations; BC = Breast Cancer, NB = Neuroblastoma, LC = Lung Cancer For the extensive comparison with other classification algorithms we limited the analysis to ER status in breast cancer, MYCN status in neuroblastoma, and adenocarcinoma/squamous cell carcinoma status in lung cancer using a t-test as feature selection criteria (Table 2). For comparison we also used AUC as feature selection criteria (Additional file 5). As can be seen in Table 2, all classification algorithms show similar performance. The performance is more dependent on the endpoint/dataset than the actual algorithms used. Furthermore, simple classifiers such as the discriminant analysis methods (lda, dlda, slda) perform just as well as the more sophisticated ones, such as the neural network algorithm neuralnet. Our uni-variate gene expression average classifier shows good performance, ranking first, third, and fifth in the lung cancer, breast cancer and neuroblastoma datasets, respectively. Almost identical results were obtained when using AUC as feature selection criteria (Additional file 5).

Table 2

Balanced accuracies of prediction using the metagene classifier (ROC) and various classification algorithms and a t-test as feature selection criteria.

Phenotype/Method	Training¹	Validation	Validation	Training	Validation	Validation	Training	Validation	Validation	Mean
ER²(BC)	vandeVijver	WangY	Sotiriou	WangY	vandeVijver	Sotiriou	Sotiriou	vandeVijver	WangY
ROC⁵	0.95	0.89	0.74	0.89	0.94	0.77	0.77	0.90	0.88	0.85
SVM	0.94	0.82	0.71	0.82	0.95	0.73	0.69	0.95	0.82	0.83
SVMradial	0.94	0.86	0.70	0.87	0.93	0.73	0.76	0.93	0.84	0.83
SVMpoly	0.92	0.80	0.67	0.77	0.89	0.65	0.75	0.84	0.72	0.76
knn	0.94	0.83	0.70	0.84	0.93	0.78	0.77	0.95	0.78	0.83
rforest	0.98	0.85	0.73	0.84	0.93	0.73	0.76	0.98	0.83	0.84
rpart	1.00	0.79	0.61	0.82	0.90	0.63	0.72	1.00	0.75	0.78
bagging	1.00	0.84	0.72	0.86	0.91	0.72	0.81	1.00	0.84	0.84
lda	0.95	0.85	0.76	0.86	0.94	0.73	0.76	0.96	0.83	0.84
dlda	0.96	0.89	0.79	0.88	0.94	0.77	0.87	0.94	0.85	0.86
slda	0.94	0.86	0.74	0.85	0.94	0.77	0.77	0.94	0.85	0.85
neuralnet	0.94	0.78	0.69	0.81	0.92	0.72	0.76	0.97	0.80	0.81
ncc	0.96	0.89	0.78	0.88	0.94	0.75	0.82	0.95	0.87	0.86
MYCN³(NB)	WangQ	JanoueixL	Attiyeh	JanoueixL	WangQ	Attiyeh	Attiyeh	WangQ	JanoueixL
ROC	0.94	0.91	0.82	0.87	0.89	0.79	0.81	0.95	0.89	0.88
SVM	0.98	0.83	0.75	0.88	0.97	0.67	0.85	0.98	0.87	0.84
SVMradial	0.95	0.93	0.82	0.93	0.94	0.81	0.87	0.95	0.86	0.88
SVMpoly	0.95	0.86	0.74	0.79	0.88	0.69	0.82	0.90	0.79	0.81
knn	0.95	0.93	0.86	0.88	0.92	0.85	0.86	0.98	0.85	0.90
rforest	0.95	0.93	0.80	0.91	0.94	0.76	0.84	0.95	0.88	0.88
rpart	1.00	0.93	0.90	0.83	0.89	0.72	0.90	1.00	0.72	0.86
bagging	1.00	0.90	0.89	0.81	0.91	0.72	0.87	1.00	0.86	0.88
lda	1.00	0.70	0.83	0.69	0.94	0.73	0.86	0.98	0.65	0.80
dlda	0.94	0.90	0.86	0.90	0.94	0.86	0.84	0.95	0.85	0.89
slda	0.95	0.93	0.85	0.93	0.98	0.83	0.89	0.98	0.91	0.91
neuralnet	0.99	0.81	0.89	0.85	0.96	0.80	0.88	0.97	0.82	0.87
ncc	0.94	0.91	0.86	0.93	0.94	0.85	0.86	0.98	0.87	0.90
AD/SQ⁴(LC)	Angulo	Takeuchi		Takeuchi	Angulo
ROC	0.93	0.96		0.95	0.96					0.96
SVM	0.89	0.91		0.93	0.84					0.88
SVMradial	0.97	0.93		0.96	0.96					0.95
SVMpoly	0.89	0.92		0.90	0.96					0.94
knn	0.94	0.94		0.95	0.94					0.94
rforest	0.94	0.95		0.96	0.94					0.94
rpart	0.86	0.93		0.95	0.82					0.87
bagging	0.91	0.96		0.95	0.93					0.94
lda	0.80	0.91		0.93	0.83					0.87
dlda	0.94	0.95		0.95	0.92					0.93
slda	0.93	0.95		0.95	0.96					0.96
neuralnet	0.91	0.95		0.94	0.92					0.94
ncc	0.93	0.96		0.95	0.93					0.95

1 Accuracies given in italics indicates accuracies obtained in the training set using LOOCV

2 ER+ vs. ER-

3 MYCN amplification positive vs. MYCN amplification negative

4 adenocarcinoma vs. squamous cell carcinoma

5 the metagene classifier uses AUC values as feature selection criteria.

Abbreviations; BC = Breast Cancer, NB = Neuroblastoma, LC = Lung Cancer

Balanced accuracies of prediction using the metagene classifier (ROC) and various classification algorithms and a t-test as feature selection criteria. 1 Accuracies given in italics indicates accuracies obtained in the training set using LOOCV 2 ER+ vs. ER- 3 MYCN amplification positive vs. MYCN amplification negative 4 adenocarcinoma vs. squamous cell carcinoma 5 the metagene classifier uses AUC values as feature selection criteria. Abbreviations; BC = Breast Cancer, NB = Neuroblastoma, LC = Lung Cancer

Conclusions

We have developed a new algorithm for tumor classification based on the formation of gene expression metagenes and of feature selection using AUC values obtained by ROC analysis. This simple classification algorithm shows good performance and is robust in independent validation data. The described approach has the potential to be a valuable complement to algorithms based on alternative principles.

Availability and Requirements

• Project name: rocc • Project homepage: http://cran.r-project.org/web/packages/rocc/index.html • Operating systems: Platform independent • Programming language: R • Other requirements: R 2.9.2 or higher is recommended. The installation of the R package ROCR is required. • License: GPL2 or higher • Any restrictions to use by non-academics: none

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ML performed the majority of the computations, wrote the R package and parts of the manuscript. AF and TR performed the first evaluations of the predictor algorithm. MH conceived and planned the investigation, and wrote parts of the manuscript. All authors read and approved the final manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2407/10/532/prepub

Additional file 1

Description of the used cancer datasets. Click here for file

Additional file 2

Scheme of the generation of metagenes. Click here for file

Additional file 3

Deviation from normal distribution and standard deviation of genes from Stransky with significantly high or low AUC values in both datasets, SanchezC and Stransky. A) Deviation from normal distribution of genes grouped by their association to MI in Figure 2A. B) Deviation from normal distribution of genes grouped by their association to grade in Figure 2B. C) Box plot for standard deviation of genes grouped by association to MI. D) Box plot for standard deviation of genes grouped by association to grade. -log10 p-value = logarithmic p-value of Shapiro test for normality. Click here for file

Additional file 4

Balanced accuracies of prediction obtained using the metagene-based classifier and Support Vector Machine for gene signatures that consist of 10, 50 and 200 genes. Click here for file

Additional file 5

Balanced accuracies of prediction using the metagene classifier (ROC) and various classification algorithms and AUC as feature selection criteria. Click here for file

23 in total

1. Selecting differentially expressed genes from microarray experiments.

Authors: Margaret Sullivan Pepe; Gary Longton; Garnet L Anderson; Michel Schummer
Journal: Biometrics Date: 2003-03 Impact factor: 2.571

2. affy--analysis of Affymetrix GeneChip data at the probe level.

Authors: Laurent Gautier; Leslie Cope; Benjamin M Bolstad; Rafael A Irizarry
Journal: Bioinformatics Date: 2004-02-12 Impact factor: 6.937

3. Linear models and empirical bayes methods for assessing differential expression in microarray experiments.

Authors: Gordon K Smyth
Journal: Stat Appl Genet Mol Biol Date: 2004-02-12

4. ROCR: visualizing classifier performance in R.

Authors: Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal: Bioinformatics Date: 2005-08-11 Impact factor: 6.937

5. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction.

Authors: Digna R Velez; Bill C White; Alison A Motsinger; William S Bush; Marylyn D Ritchie; Scott M Williams; Jason H Moore
Journal: Genet Epidemiol Date: 2007-05 Impact factor: 2.135

6. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

Authors: Leming Shi; Gregory Campbell; Wendell D Jones; Fabien Campagne; Zhining Wen; Stephen J Walker; Zhenqiang Su; Tzu-Ming Chu; Federico M Goodsaid; Lajos Pusztai; John D Shaughnessy; André Oberthuer; Russell S Thomas; Richard S Paules; Mark Fielden; Bart Barlogie; Weijie Chen; Pan Du; Matthias Fischer; Cesare Furlanello; Brandon D Gallas; Xijin Ge; Dalila B Megherbi; W Fraser Symmans; May D Wang; John Zhang; Hans Bitter; Benedikt Brors; Pierre R Bushel; Max Bylesjo; Minjun Chen; Jie Cheng; Jing Cheng; Jeff Chou; Timothy S Davison; Mauro Delorenzi; Youping Deng; Viswanath Devanarayan; David J Dix; Joaquin Dopazo; Kevin C Dorff; Fathi Elloumi; Jianqing Fan; Shicai Fan; Xiaohui Fan; Hong Fang; Nina Gonzaludo; Kenneth R Hess; Huixiao Hong; Jun Huan; Rafael A Irizarry; Richard Judson; Dilafruz Juraeva; Samir Lababidi; Christophe G Lambert; Li Li; Yanen Li; Zhen Li; Simon M Lin; Guozhen Liu; Edward K Lobenhofer; Jun Luo; Wen Luo; Matthew N McCall; Yuri Nikolsky; Gene A Pennello; Roger G Perkins; Reena Philip; Vlad Popovici; Nathan D Price; Feng Qian; Andreas Scherer; Tieliu Shi; Weiwei Shi; Jaeyun Sung; Danielle Thierry-Mieg; Jean Thierry-Mieg; Venkata Thodima; Johan Trygg; Lakshmi Vishnuvajjala; Sue Jane Wang; Jianping Wu; Yichao Wu; Qian Xie; Waleed A Yousef; Liang Zhang; Xuegong Zhang; Sheng Zhong; Yiming Zhou; Sheng Zhu; Dhivya Arasappan; Wenjun Bao; Anne Bergstrom Lucas; Frank Berthold; Richard J Brennan; Andreas Buness; Jennifer G Catalano; Chang Chang; Rong Chen; Yiyu Cheng; Jian Cui; Wendy Czika; Francesca Demichelis; Xutao Deng; Damir Dosymbekov; Roland Eils; Yang Feng; Jennifer Fostel; Stephanie Fulmer-Smentek; James C Fuscoe; Laurent Gatto; Weigong Ge; Darlene R Goldstein; Li Guo; Donald N Halbert; Jing Han; Stephen C Harris; Christos Hatzis; Damir Herman; Jianping Huang; Roderick V Jensen; Rui Jiang; Charles D Johnson; Giuseppe Jurman; Yvonne Kahlert; Sadik A Khuder; Matthias Kohl; Jianying Li; Li Li; Menglong Li; Quan-Zhen Li; Shao Li; Zhiguang Li; Jie Liu; Ying Liu; Zhichao Liu; Lu Meng; Manuel Madera; Francisco Martinez-Murillo; Ignacio Medina; Joseph Meehan; Kelci Miclaus; Richard A Moffitt; David Montaner; Piali Mukherjee; George J Mulligan; Padraic Neville; Tatiana Nikolskaya; Baitang Ning; Grier P Page; Joel Parker; R Mitchell Parry; Xuejun Peng; Ron L Peterson; John H Phan; Brian Quanz; Yi Ren; Samantha Riccadonna; Alan H Roter; Frank W Samuelson; Martin M Schumacher; Joseph D Shambaugh; Qiang Shi; Richard Shippy; Shengzhu Si; Aaron Smalter; Christos Sotiriou; Mat Soukup; Frank Staedtler; Guido Steiner; Todd H Stokes; Qinglan Sun; Pei-Yi Tan; Rong Tang; Zivana Tezak; Brett Thorn; Marina Tsyganova; Yaron Turpaz; Silvia C Vega; Roberto Visintainer; Juergen von Frese; Charles Wang; Eric Wang; Junwei Wang; Wei Wang; Frank Westermann; James C Willey; Matthew Woods; Shujian Wu; Nianqing Xiao; Joshua Xu; Lei Xu; Lun Yang; Xiao Zeng; Jialu Zhang; Li Zhang; Min Zhang; Chen Zhao; Raj K Puri; Uwe Scherf; Weida Tong; Russell D Wolfinger
Journal: Nat Biotechnol Date: 2010-07-30 Impact factor: 54.908

7. Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Authors: Robert Tibshirani; Trevor Hastie; Balasubramanian Narasimhan; Gilbert Chu
Journal: Proc Natl Acad Sci U S A Date: 2002-05-14 Impact factor: 11.205

8. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.

Authors: J Khan; J S Wei; M Ringnér; L H Saal; M Ladanyi; F Westermann; F Berthold; M Schwab; C R Antonescu; C Peterson; P S Meltzer
Journal: Nat Med Date: 2001-06 Impact factor: 53.440

9. Regional copy number-independent deregulation of transcription in cancer.

Authors: Nicolas Stransky; Céline Vallot; Fabien Reyal; Isabelle Bernard-Pierrot; Sixtina Gil Diez de Medina; Rick Segraves; Yann de Rycke; Paul Elvin; Andrew Cassidy; Carolyn Spraggon; Alexander Graham; Jennifer Southgate; Bernard Asselain; Yves Allory; Claude C Abbou; Donna G Albertson; Jean Paul Thiery; Dominique K Chopin; Daniel Pinkel; François Radvanyi
Journal: Nat Genet Date: 2006-11-12 Impact factor: 38.330

10. The HUGO Gene Nomenclature Database, 2006 updates.

Authors: Tina A Eyre; Fabrice Ducluzeau; Tam P Sneddon; Sue Povey; Elspeth A Bruford; Michael J Lush
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

4 in total

1. Analysis of Deregulated microRNAs and Their Target Genes in Gastric Cancer.

Authors: Simonas Juzėnas; Violeta Saltenienė; Juozas Kupcinskas; Alexander Link; Gediminas Kiudelis; Laimas Jonaitis; Sonata Jarmalaite; Limas Kupcinskas; Peter Malfertheiner; Jurgita Skieceviciene
Journal: PLoS One Date: 2015-07-14 Impact factor: 3.240

2. Whole blood gene expression profiling of neonates with confirmed bacterial sepsis.

Authors: Paul Dickinson; Claire L Smith; Thorsten Forster; Marie Craigon; Alan J Ross; Mizan R Khondoker; Alasdair Ivens; David J Lynn; Judith Orme; Allan Jackson; Paul Lacaze; Katie L Flanagan; Benjamin J Stenson; Peter Ghazal
Journal: Genom Data Date: 2014-11-15

3. Classification of tumor samples from expression data using decision trunks.

Authors: Benjamin Ulfenborg; Karin Klinga-Levan; Björn Olsson
Journal: Cancer Inform Date: 2013-02-13

4. Identification of a human neonatal immune-metabolic network associated with bacterial infection.

Authors: Claire L Smith; Paul Dickinson; Thorsten Forster; Marie Craigon; Alan Ross; Mizanur R Khondoker; Rebecca France; Alasdair Ivens; David J Lynn; Judith Orme; Allan Jackson; Paul Lacaze; Katie L Flanagan; Benjamin J Stenson; Peter Ghazal
Journal: Nat Commun Date: 2014-08-14 Impact factor: 14.919

4 in total