Literature DB >> 23262226

ellipsoidFN: a tool for identifying a heterogeneous set of cancer biomarkers based on gene expressions.

Xianwen Ren¹, Yong Wang, Luonan Chen, Xiang-Sun Zhang, Qi Jin.

Abstract

Computationally identifying effective biomarkers for cancers from gene expression profiles is an important and challenging task. The challenge lies in the complicated pathogenesis of cancers that often involve the dysfunction of many genes and regulatory interactions. Thus, sophisticated classification model is in pressing need. In this study, we proposed an efficient approach, called ellipsoidFN (ellipsoid Feature Net), to model the disease complexity by ellipsoids and seek a set of heterogeneous biomarkers. Our approach achieves a non-linear classification scheme for the mixed samples by the ellipsoid concept, and at the same time uses a linear programming framework to efficiently select biomarkers from high-dimensional space. ellipsoidFN reduces the redundancy and improves the complementariness between the identified biomarkers, thus significantly enhancing the distinctiveness between cancers and normal samples, and even between cancer types. Numerical evaluation on real prostate cancer, breast cancer and leukemia gene expression datasets suggested that ellipsoidFN outperforms the state-of-the-art biomarker identification methods, and it can serve as a useful tool for cancer biomarker identification in the future. The Matlab code of ellipsoidFN is freely available from http://doc.aporc.org/wiki/EllipsoidFN.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Biomarkers, Tumor

Year: 2012 PMID： 23262226 PMCID： PMC3575836 DOI： 10.1093/nar/gks1288

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Computationally identifying cancer biomarkers that can indicate specific cancer types is an important and challenging topic in the current biomedical research because it can not only provide insightful clues into the cancer pathogenesis but also can help accurate diagnosis and prognosis. With the development of high-throughput technologies, e.g., microarrays and the next generation sequencing technologies, more than thousands of genes can be measured simultaneously. How to select the most meaningful biomarkers from the large number of genes forms a common question that scientists and clinicians often come across. The most straightforward method for identifying cancer biomarkers is to calculate the fold changes of gene expressions in different classes of samples, given that the gene expression data is used to characterize the biological states. The larger the fold change is, the more likely the gene is a biomarker. However, this method does not consider the variations among samples of the same classes. Hence, the methods based on or similar to the Student's t-test or Wilcoxon rank-sum test are introduced to eliminate the irrelevant or noisy features (1,2). Owing to the multiple testing issues, methods such as SAM that provides fine false discovery rate control were invented (3). All these methods score genes one by one based on their expression levels and can generate many redundant biomarkers. Peng et al. (4) propose a criterion based on mutual information (MI) to find a set of biomarkers that have the maximal relevancy to the class labels but minimal redundancy within themselves. But the underlying assumptions of this method are not very clear. In nature, biomarker identification is intrinsically linked to class assignment to samples (5–7). From a machine learning viewpoint, biomarker identification is a feature selection problem, given the biological states of samples (e.g., cancer or normal). The aim of feature selection is to find a set of features that can maximize the prediction accuracy of a classifier (8,9). With different classifiers, the identified biomarkers may be different. Many supervised or semi-supervised machine learning methods, such as support vector machines and Bayesian networks, can be exploited as the classifiers to guide the identification of biomarkers (1,10–14). Support vector machines provide a model assuming that biological states were linearly separated in the feature space, whereas Bayesian networks use graphs to model the complicated relationships among features. However, biomarker identification is not explicitly embedded in these methods. A model for simultaneous biomarker identification, especially non-redundant biomarker identification, and classification is needed to explicitly model the properties of biological states. In this study, we explicitly considered the heterogeneity of cancers and proposed a novel model based on linear programming. In the gene expression space, we used ellipsoids to model cancers and normal samples and tried to identify a minimal set of genes to maximize the distinctiveness between cancers and normal samples and between cancer types. Different from the general biomarker identification approaches, it produces a set of non-redundant but complementary biomarkers that maintain the maximal classification ability. Computational results on prostate cancer, breast cancer, and leukemia gene expression datasets suggested that our method significantly outperformed the state-of-the-art biomarker identification methods.

MATERIALS AND METHODS

Overview of ellipsoidFN

We construct our method based on two assumptions: (i) Cancer and normal samples are stable biological states in the gene expression space; (ii) The differences of cancers from normal samples or from another cancer type are sample heterogeneous, i.e. one patient develops cancer because of the dysfunction of one gene, but another patient may develop cancer due to the dysfunction of a second gene. We try to seek a minimal set of genes such that cancers and normal samples are represented by different ellipsoids and that the distances between ellipsoids are maximized (Figure 1).

Figure 1.

The schematic diagram of ellipsoidFN. ellipsoidFN tries to represent each cancer type by ellipsoids in the gene space and maximizes the distance between ellipsoids. A meta-ellipsoid (black) can be added to represent the relationship between cancer types. Given a gene expression data set , in which the expression of n genes is measured for m samples, and denotes the expression level of gene j in sample i, we set , denoting the weight for each gene to be determined. Supposing that there are in total c sample classes, the formulation of our method can be described as follows: Subject to Where is the average/median expression level of gene i in class a. is the set of samples belonging to class a. and are variables defining the inner and outer radius of the ellipsoid representing class a. are slack variables to tolerate the data errors. Equation (1) presents the objective function for the optimization problem. It consists of three terms. denotes the weight summarization of selected genes. By minimizing it, we aim to select a few of genes as biomarkers to enhance the interpretability. The second term is minimized to enlarge the difference of inner and outer radiuses of ellipsoid for perfect separation for each class. The third term denotes the total classification errors for all the samples. It should be minimized to achieve high classification accuracy. Here α and C are two parameters introduced to balance the above three goals and unify them into a single objective function. Equation (2) implements the assumption (1), i.e., samples from the same cancer type are enclosed by one ellipsoid, which minimizes the distance of a sample from its class center. Equation (3) implements the assumption (2), i.e., every sample from the other cancers locates outside of the ellipsoid representing the current cancer. The divergence of one cancer from another cancer or normal samples is measured by the weighted sum of the divergence of gene expressions such that heterogeneity is modeled. The goal is to identify a minimal set of genes that maximize the distances between ellipsoids. We used the quadratic function in constraints (2) and (3). Other non-negative functional forms, e.g., the absolute values used in (15), can also be applied in a similar way. We tuned two parameters, α and C, by grid search in the parameter space. For α, we tested 0.1, 0.5, 1, 2, 5, 10 and 100. For C, we tested 10, 100, 1000 and 10 000. The model will generate a trivial zero solution when α is small enough or C is large enough. Smaller α means the fewer biomarkers, whereas larger C means less classification errors. Thus, the parameter pair, which leads to non-trivial solution and at the same time has smaller α and larger C, was finally selected as our optimal parameter. α can be further decomposed into two separate parameters for and , respectively. In this situation, the weights of and can be tuned separately. Here, we used the same parameters for and to reduce the total number of parameters in the model. We name our method as ellipsoidFN (ellipsoid Feature Net). Different from the one-by-one biomarker identifying methods (like fishing by a fishing rod), ellipsoidFN simultaneously identifies a minimal set of genes that represent different cancer types and normal samples as discrete ellipsoids (like fishing by a fishing net). Altering the parameters can adjust the number of identified biomarkers (like adjusting the size of the fishing net grid). Mathematically, ellipsoidFN is a linear programming model that can be solved efficiently in polynomial time. Thus, it can be applied to high-dimensional datasets. ellipsoidFN is flexible. It can deal with any number of classes that have any relationship (unordered, linearly ordered, tree-ordered, etc.) as long as the computer memory and processor allows. For unordered multiple classes, the formulation is just illustrated as above. For cases where there are complicated relationships among classes, additional ellipsoids can be added into the model to represent meta-class denoting the class relationship.

Data sets and metrics for evaluation

We compared ellipsoidFN with the start-of-the-art biomarker identifying methods, which are widely used. For two-class cases, we compared ellipsoidFN with mRMR (4) and t-test. For multiple classes, we compared ellipsoidFN with minimum Redundancy Maximum Relevance Feature Selection (mRMR) and F-test-based gene weighting scheme. Evaluations were done on three different cancers (prostate cancer, breast cancer, and leukemia). The prostate cancer gene expression data set (16) was downloaded from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database (17) with accession number GDS3289. The breast cancer gene expression data set (18) was downloaded from the NCBI GEO database with accession number GSE10797. The leukemia gene expression data set was from (19). Four metrics were used to compare the results of ellipsoidFN and the state-of-the-art methods. One metric is the mean redundancy score between the identified biomarkers. Given two genes, the score of their redundancy was measured by the Pearson correlation coefficients (PCC) and MI of two genes’ expression profiles. The second and third metrics are the inter-class and intra-class similarity scores (measured by the PCCs and MI of two samples’ gene expression profiles). The fourth metric is the leave-one-out cancer classification error rate based on the identified biomarkers (naïve Bayes classifier). The smaller the redundancy, the inter-class similarity and the error rate are, the better the method is. The larger the in-class similarity is, the better the method performs. All the calculations were conducted in Matlab 7.13 on a computer with a 2.26 GHz Inter Core 2 Due CPU and 3GB memory. For a two-class data set with 72 samples and 1000 genes, ellipsoidFN took <1 min to identify the optimal biomarker set.

RESULTS

Comparisons on prostate cancer data set

There are totally 104 samples and 9483 genes in the prostate cancer dataset. The 104 samples consist of 22 normal samples and five different stages of prostate cancer samples. When we got the raw data, we filtered out those genes with missing values and those genes with low information content (measured by entropy of the gene expression distribution, <0.5). The normal and metastatic prostate cancer samples were extracted to evaluate the performances of ellipsoidFN, mRMR, and t-test. The normal, metastatic prostate cancer, and localized prostate cancer samples were extracted to assess the performances of ellipsoidFN, mRMR, and F-test. For each situation, the top 50 genes were selected as the most potential biomarkers in comparison. For two-class case, the biomarker redundancy of ellipsoidFN was lower than those of mRMR and t-test (Figure 2 and Table 1). The mean biomarker redundancy score (measured by PCC) of ellipsoidFN was 0.2350, whereas the mean redundancy score of mRMR was 0.2530 (PCC). The difference was significant (P = 0.0012, Student's t-test). The mean biomarker redundancy score of t-test was 0.4952 (PCC), much larger than that of ellipsoidFN (P < 10−20, Student's t-test). If MI was used to measure the biomarker redundancy, ellipsoidFN still identified the most heterogeneous biomarkers. Randomly sampling 1000 sets of biomarkers (50 genes per set), all the 1000 biomarker redundancy scores were smaller than those of ellipsoidFN (except two random biomarker sets for PCC), mRMR (no exception) and t-test (no exception) with regards to both PCC and MI, suggesting that ellipsoidFN identified a set of more heterogeneous biomarkers than mRMR and t-test.

Figure 2.

The biomarker redundancy heatmap of ellipsoidFN, mRMR, and t-test on the prostate cancer dataset when only two classes were considered. Red means high redundancy. Blue means no redundancy.

Table 1.

Comparison of ellipsoidFN, mRMR, and t-test for two-class situations

	ellipsoidFN	mRMR	t-test
Prostate cancer
B_redundancy	0.235^a/0.0021	0.253/0.0059	0.4952/0.0066
S_in-class	0.3632/0.1494	0.2849/0.0064	0.3483/0.0028
S_inter-class	−0.1733/0.1037	−0.0788/0.0013	−0.1274/0.0013
Error rate	0	0.0238	0.0238
Breast cancer
B_redundancy	0.2136/0.0526	0.2097/0.0271	0.3462/0.0837
S_in-class	0.3586/0.1552	0.7328/0.3433	0.4893/0.2164
S_inter-class	0.3576/0.1726	0.6962/0.3459	0.4401/0.2179
Error rate	0.0303	0.0152	0.0455
Leukemia
B_redundancy	0.322/0.0058	0.4912/0.0158	0.5804/0.0196
S_in-class	0.7249/0.0532	0.5537/0.0347	0.6778/0.0557
S_inter-class	0.3396/0.0150	−0.2819/0.0184	−0.0765/0.0122
Error rate	0.0139	0.0278	0.0417

aBold font indicates the best performer. Values in cells are PCC/MI, where PCC is Pearson correlation coefficient and MI is mutual information.

The biomarker redundancy heatmap of ellipsoidFN, mRMR, and t-test on the prostate cancer dataset when only two classes were considered. Red means high redundancy. Blue means no redundancy. Comparison of ellipsoidFN, mRMR, and t-test for two-class situations aBold font indicates the best performer. Values in cells are PCC/MI, where PCC is Pearson correlation coefficient and MI is mutual information. Exploiting the complementariness among the identified biomarkers, ellipsoidFN improved the in-class similarity and reduced the inter-class similarity of normal and prostate cancer samples (Figure 3 and Table 1). The in-class similarity of ellipsoidFN was 0.3632 (PCC), whereas the in-class similarity of mRMR was 0.2849 (PCC). The difference was statistically significant (P < 10−8, Student's t-test). The in-class similarity of t-test was similar to that of ellipsoidFN (0.3483, PCC). The inter-class similarity of ellipsoidFN was −0.1733 (PCC) whereas those of mRMR and t-test were −0.0788 (P < 10−20, Student's t-test) and −0.1274 (P < 10−5, Student's t-test), respectively. MI still supports the highest in-class similarity of ellipsoidFN. But mRMR and t-test got the lowest MI inter-class similarity.

Figure 3.

The sample similarity based on biomarkers identified by ellipsoidFN, mRMR, and t-test on the prostate cancer dataset when only two classes were considered. Red means high similarity. Blue means the opposite. To evaluate the predictive power of the identified biomarkers, we used the Naïve Bayes classifier to predict the sample types by leave-one-out cross-validation. The error rates of ellipsoidFN, mRMR, and t-test were 0 (0/42), 0.0238 (1/42), and 0.0238 (1/42), respectively, suggesting the effectiveness of ellipsoidFN. We also plotted the receiver operating characteristic (ROC) curve to evaluate the true positive rate and false positive rate (Supplementary Figure S1). The ROC curve suggested that ellipsoidFN and mRMR are almost the same and both better than t-test. For multiple-class case, ellipsoidFN also showed excellent performance, compared with mRMR and F-test. The biomarker redundancy of ellipsoidFN was 0.1895 (PCC), whereas those of mRMR and F-test were 0.2284 (P < 10−13, Student's t-test) and 0.3247 (P < 10−20, Student's t-test), respectively (Figure 4 and Table 2). Randomly sampling 1000 sets of biomarkers (50 genes per set), 426 biomarker sets had redundancy scores smaller than that of ellipsoidFN, whereas 996 had scores smaller than that of mRMR, and no set smaller than that of t-test with regards to PCC. Measuring by MI, four random biomarker sets had redundancy scores larger than that of ellipsoidFN, but all random biomarker sets had redundancy scores smaller than those of mRMR and t-test. The in-class similarity of ellipsoidFN was 0.2520 (PCC), whereas that of mRMR was 0.1852 (P < 10−11, Student's t-test). The in-class similarity of F-test was 0.3109 (P < 10−9, Student's t-test), larger than that of ellipsoidFN (Figure 5 and Table 2). This is reasonable because more redundant biomarkers were selected by F-test, and ellipsoidFN is designed to handle the sample heterogeneity. The inter-class similarity of ellipsoidFN is smaller than that of mRMR (P < 10−9, Student's t-test) but larger than that of F-test (P < 10−7, Student's t-test). The error rates of ellipsoidFN, mRMR and F-test in leave-one-out experiment by Naïve Bayes classifier are 0.0135, 0.1351 and 0.0946, respectively. This further proves the effectiveness of ellipsoidFN.

Figure 4.

The biomarker redundancy heatmap of ellipsoidFN, mRMR, and F-test on the prostate cancer dataset when three classes were considered. Red means high redundancy. Blue means no redundancy.

Table 2.

Comparison of ellipsoidFN, mRMR, and F-test for multiple-class situations

	ellipsoidFN	mRMR	F-test
Prostate cancer
B_redundancy	0.1895^a/0.0013	0.2284/0.0053	0.3247/0.0036
S_in-class	0.252/0.0028	0.1852/0.0054	0.3109/0.0036
S_inter-class	−0.1096/0.0013	−0.0774/0.0023	−0.1396/0.0015
Error rate	0.0135	0.1351	0.0946
Breast cancer
B_redundancy	0.1924/0.0457	0.3386/0.0709	0.5047/0.1075
S_in-class	0.4461/0.2335	0.5901/0.2100	0.7914/0.4279
S_inter-class	0.3833/0.1975	0.4219/0.1361	0.7047/0.3576
Error rate	0.1818	0.2121	0.2121
Leukemia
B_redundancy	0.2645/0.0048	0.4127/0.0173	0.4391/0.0164
S_in-class	0.7188/0.0496	0.6775/0.0522	0.8038/0.0855
S_inter-class	0.3276/0.0142	−0.0459/0.0086	0.1442/0.0093
Error rate	0.0417	0.0694	0.0417

aBold font indicates the best performer. Values in cells are PCC/MI, where PCC is Pearson correlation coefficients and MI is mutual information.

Figure 5.

The sample similarity based on biomarkers identified by ellipsoidFN, mRMR, and t-test on the prostate cancer dataset when three classes were considered. Red means high similarity. Blue means the opposite.

The biomarker redundancy heatmap of ellipsoidFN, mRMR, and F-test on the prostate cancer dataset when three classes were considered. Red means high redundancy. Blue means no redundancy. The sample similarity based on biomarkers identified by ellipsoidFN, mRMR, and t-test on the prostate cancer dataset when three classes were considered. Red means high similarity. Blue means the opposite. Comparison of ellipsoidFN, mRMR, and F-test for multiple-class situations aBold font indicates the best performer. Values in cells are PCC/MI, where PCC is Pearson correlation coefficients and MI is mutual information. We compared the biomarkers identified by different methods (Figure 6). For two-class case, ellipsoidFN had 12 biomarkers overlapped with t-test and 10 with mRMR. There were three biomarkers shared by all the three methods. Most biomarkers identified by the three methods were method specific. Among the 30 ellipsoidFN-specific biomarkers (see Supplementary Data Sets 1–9 for the full lists of the biomarkers identified by ellipsoidFN on all the three data sets), PCBP1 regulates the expression of the androgen receptor (20). ALDH1A1 is demonstrated to be a marker for malignant prostate stem cells and predictor of prostate cancer patients’ outcome (21). RPL15 is observed to be a frequent aberration in multiple tumor samples including prostate cancer (22). Overexpression of NCOR2 is demonstrated to activate the activity of the androgen receptor in a cell type-specific context (23). Targeting JunD is suggested as a potential strategy to counteract hormone-refractory prostate cancer (24). MDM2 is proved to mediate the interaction between USP2a and MYC in prostate cancer (25). SPRY1 is a potential tumor suppressor in prostate cancer (26). For multiple-class case, ellipsoidFN, mRMR and F-test identified two common biomarkers, TCN2 and C5orf13. TCN2 is associated with reduced risk of prostate risk (27). TP53BP2, ALDH1A3, RPL15, ANXA1, COMP and IGF2 in the 32 ellipsoidFN-specific biomarkers are reported to associated with cancers (22,28–32).

Figure 6.

(A) biomarkers identified by ellipsoidFN, mRMR, and t-test on the prostate cancer dataset when two classes were considered; (B) biomarkers identified by ellipsoidFN, mRMR, and F-test on the prostate cancer dataset when three classes were considered.

Comparisons on breast cancer and leukemia data set

We further evaluated ellipsoidFN on a breast cancer and a leukemia data set. The breast cancer data set consists of 22 277 probes and 66 samples, including five normal stromal samples, five normal epithelial samples, 28 stromal samples of breast cancers and 28 epithelial samples of breast cancers. After removing the probes with missing values and low information-content, we retained 1000 informative probes for biomarker identification. Then, we applied ellipsoidFN, mRMR, and t-test to identify biomarkers distinguishing the 10 normal samples from the 56 breast cancer samples. F-test, mRMR, and ellipsoidFN were applied to identify biomarkers discriminating normal stromal, normal epithelial, breast cancer stromal and breast cancer epithelial samples. For each method, the top 50 biomarkers were extracted for comparison. For the two-class case of the breast cancer data set (Table 1), the biomarker redundancy of ellipsoidFN is smaller than that of t-test (P < 10−20, Student's t-test, PCC) but larger than that of mRMR (P = 0.4481, Student's t-test, PCC). The in-class sample similarity of ellipsoidFN is lower than those of mRMR (P < 10−20, Student's t-test, PCC) and t-test (P < 10−20, Student's t-test, PCC), maybe owing to the in-class heterogeneity because we mixed stromal and epithelial samples in the same pseudo classes. The inter-class sample similarity of ellipsoidFN is lower than those of mRMR (P < 10−20, Student's t-test, PCC) and t-test (P < 10−12, Student's t-test, PCC). Based on the three identified biomarker sets, we evaluated the leave-one-out prediction accuracy by Naïve Bayes classifier. The error rates of ellipsoidFN, mRMR and t-test are 0.0303, 0.0152 and 0.0455, respectively. The ROC curve suggested that ellipsoidFN and mRMR are almost the same and better than t-test (Supplementary Figure S2). Randomly sampling 1000 sets of biomarkers (50 genes per set), only 1 of the 1000 redundancy scores measured by PCC was smaller than those of ellipsoidFN and mRMR. In all, 997 scores were smaller than that of t-test. Measured by MI, no random biomarker set had redundancy score smaller than that of mRMR, and no random biomarker set had redundancy score larger than that of t-test. In all, 519 of the 1000 random biomarker sets had redundancy score smaller than that of ellipsoidFN. For multiple-case of the breast cancer data set (Table 2), the biomarker redundancy of ellipsoidFN was still significantly lower than those of mRMR (P < 10−20, Student's t-test) and F-test (P < 10−20, Student's t-test). The in-class sample similarity of ellipsoidFN was lower than those of mRMR (P < 10−20, Student's t-test) and F-test (P < 10−20, Student's t-test), may be owing to the intrinsic in-class heterogeneity. The inter-class sample similarity of ellipsoidFN is lower than those of mRMR (P < 10−20, Student's t-test) and F-test (P < 10−20, Student's t-test). The leave-one-out prediction error rates of ellipsoidFN, mRMR and F-test are 0.1818, 0.2121 and 0.2121, respectively. Randomly sampling 1000 sets of biomarkers (50 genes per set), no PCC redundancy score was smaller than that of ellipsoidFN; 990 were smaller than that of mRMR, and no score was larger than t-test. The leukemia data set is composed of 7129 probes and 72 samples including 25 acute myeloid leukemia samples, 38 B-cell acute lymphoblastic leukemia (ALL) samples and nine T-cell ALL samples. After removing probes with missing values, three preprocessing steps including flooring/ceiling, filtering and log10-transformation were applied to select informative probes (33). Finally, 1000 of the most informative probes were retained for evaluation. First, ellipsoidFN, mRMR and t-test were applied to discriminate acute myeloid leukemias from ALLs (Table 1). The redundancy of the top 50 biomarkers of ellipsoidFN is lower than those of mRMR (P < 10−20, Student's t-test, PCC) and t-test (P < 10−20, Student's t-test, PCC). The in-class sample similarity of ellipsoidFN is larger than those of mRMR (P < 10−20, Student's t-test, PCC) and t-test (P < 10−20, Student's t-test, PCC). The inter-class sample similarity of ellipsoidFN was larger than those of mRMR (P < 10−20, Student's t-test, PCC) and t-test (P < 10−20, Student's t-test, PCC). The leave-one-out prediction error rates of ellipsoidFN, mRMR and t-test are 0.0139, 0.0278 and 0.0417, respectively. The ROC curve suggested that ellipsoidFN reached the highest true positive rate at a low false positive rate (Supplementary Figure S3). Randomly sampling 1000 sets of biomarkers (50 genes per set) suggested that no random redundancy score (PCC or MI) was larger than those of ellipsoidFN, mRMR, and t-test. For multiple-class case of the leukemia data set (Table 2), the biomarker redundancy of ellipsoidFN is smaller than those of mRMR (P < 10−20, Student's t-test, PCC) and F-test (P < 10−20, Student's t-test, PCC). The in-class sample similarity of ellipsoidFN is larger than that of mRMR (P < 10−20, Student's t-test, PCC) but smaller than that of F-test (P < 10−20, Student's t-test, PCC). The inter-class sample similarity of ellipsoidFN was larger than those of mRMR (P < 10−20, Student's t-test, PCC) and F-test (P < 10−20, Student's t-test, PCC) because ellipsoidFN included biomarkers that B-cell ALLs and T-cell ALLs shared. The leave-one-out prediction error rates of ellipsoidFN, mRMR and F-test are 0.0417, 0.0694 and 0.0417, respectively. Randomly sampling 1000 sets of biomarkers (50 genes per set) suggested that no random redundancy score (PCC or MI) was larger than those of ellipsoidFN, mRMR, and t-test.

DISCUSSIONS

Identifying effective biomarkers for cancers is a challenging task because of the complexity of cancer pathogenesis. As many genes and gene interactions are involved in the cancer progression, it is especially challenging to identify cancer biomarkers through a small number of samples (34). Samples of the same cancer type may carry different aberrations. Thus, effective cancer biomarkers need to be addressed from a gene set view. Peng et al. firstly introduced mRMR to identify a biomarker set with minimum redundancy and maximum relevance. But the underlying assumptions of the method are not clear. We modeled the heterogeneity of cancer samples and tried to identify a minimal biomarker set, resulting in a more non-redundant and relevant biomarker set than mRMR in most cases. Thus, the assumptions in ellipsoidFN may correctly reflect, at least partially, the truth of caner generation and progression, and the implementation of ellipsoidFN may be more efficient. We modeled the stable state of cancer types and normal samples by the average gene expressions of samples in ellipsoidFN. This is a little arbitrary, but facilitates the solving of ellipsoidFN. A future work is to optimize the representation of cancer types and normal samples. Besides, ellipsoids maybe cannot model the classes in some data sets perfectly, e.g. non-convex shapes in the geometric space. These situations may be solved by other modeling functions or be approximated by ellipsoids. We demonstrated the performance of ellipsoidFN in two-class cases and multiple-class cases in this study. We observed that the biomarkers it identified are robust, in some ways, to the labels assigned to samples. For example, in the leukemia data set, we merged the B cell ALLs and T cell ALLs to test the performance of ellipsoidFN in two-class situations. In the sample similarity heatmap (Supplementary Figure S1), the distinctiveness between B cell ALLs and T cell ALLs was still obvious, revealed by ellipsoidFN. However, the distinctiveness became very weak in the sample similarity heatmaps revealed by mRMR and t-test (Supplementary Figure S1). The reason may lay in the inclusion of B cell ALL-specific and T cell ALL-specific biomarkers. Thus, ellipsoidFN is capable of reflecting the substructures of cancer types. Actually, ellipsoidFN is very flexible to incorporate complicated relationships among cancer types by introducing meta-ellipsoids (not demonstrated). Actually, the solution to the cancer biomarker identification problem is not unique. There are many combinations of genes to distinguish cancer types and normal samples (33) because of curse of dimensionality (small number of samples but large number of genes). Different from those biomarkers identified by t-test or F-test, which were statistically significant, ellipsoidFN can identify biomarkers that may be not statistically significant but can enhance the explanation power of the identified biomarker set. This is very useful to identify new oncogenes and cancer suppressor genes (as demonstrated in the prostate cancer example). The rapid development of cancer research has elucidated more and more details of cancer pathogenesis that can be organized as dynamic biological networks. ellipsoidFN was built solely based on the gene expression profiles of samples. A promising direction to extend ellipsoidFN is to integrate the current knowledge of cancer pathogenesis. Also, integrating biomolecular network to identify network biomarkers (34) or further dynamical network biomarkers (35) is an important future topic.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Figures 1–3 and Supplementary Data Sets 1–9.

FUNDING

The National Natural Science Foundation of China [11131009, 61171007, 61134013 and 91029301]; Chief Scientist Program of Shanghai Institutes for Biological Sciences, CAS (in part) [2009CSP002]. Funding for open access charge: The National Natural Science Foundation of China [11131009]. Conflict of interest statement. None declared.

33 in total

1. Cell type specific expression of the apoptosis stimulating protein (ASPP-2) in human tissues.

Authors: Faris Q Alenzi
Journal: Acta Microbiol Immunol Hung Date: 2010-12 Impact factor: 2.048

Review 2. A review of feature selection techniques in bioinformatics.

Authors: Yvan Saeys; Iñaki Inza; Pedro Larrañaga
Journal: Bioinformatics Date: 2007-08-24 Impact factor: 6.937

Review 3. Specific changes in the expression of imprinted genes in prostate cancer--implications for cancer progression and epigenetic regulation.

Authors: Teodora Ribarska; Klaus-Marius Bastian; Annemarie Koch; Wolfgang A Schulz
Journal: Asian J Androl Date: 2012-02-27 Impact factor: 3.285

4. The causal roles of vitamin B(12) and transcobalamin in prostate cancer: can Mendelian randomization analysis provide definitive answers?

Authors: Simon M Collin; Chris Metcalfe; Tom M Palmer; Helga Refsum; Sarah J Lewis; George Davey Smith; Angela Cox; Michael Davis; Gemma Marsden; Carole Johnston; J Athene Lane; Jenny L Donovan; David E Neal; Freddie C Hamdy; A David Smith; Richard M Martin
Journal: Int J Mol Epidemiol Genet Date: 2011-11-28

5. Molecular signatures suggest a major role for stromal cells in development of invasive breast cancer.

Authors: Theresa Casey; Jeffrey Bond; Scott Tighe; Timothy Hunter; Laura Lintault; Osman Patel; Jonathan Eneman; Abigail Crocker; Jeffrey White; Joseph Tessitore; Mary Stanley; Seth Harlow; Donald Weaver; Hyman Muss; Karen Plaut
Journal: Breast Cancer Res Treat Date: 2008-03-29 Impact factor: 4.872

6. Androgen regulation of aldehyde dehydrogenase 1A3 (ALDH1A3) in the androgen-responsive human prostate cancer cell line LNCaP.

Authors: Steven E Trasino; Earl H Harrison; Thomas T Y Wang
Journal: Exp Biol Med (Maywood) Date: 2007-06

7. A unified computational model for revealing and predicting subtle subtypes of cancers.

Authors: Xianwen Ren; Yong Wang; Jiguang Wang; Xiang-Sun Zhang
Journal: BMC Bioinformatics Date: 2012-05-01 Impact factor: 3.169

8. NCBI GEO: archive for functional genomics data sets--10 years on.

Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Michelle Holko; Oluwabukunmi Ayanbule; Andrey Yefanov; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

ellipsoidFN: a tool for identifying a heterogeneous set of cancer biomarkers based on gene expressions.

INTRODUCTION

MATERIALS AND METHODS

Overview of ellipsoidFN

Data sets and metrics for evaluation

RESULTS

Comparisons on prostate cancer data set

Comparisons on breast cancer and leukemia data set

DISCUSSIONS

SUPPLEMENTARY DATA

FUNDING

1. Cell type specific expression of the apoptosis stimulating protein (ASPP-2) in human tissues.

Review 2. A review of feature selection techniques in bioinformatics.

Review 3. Specific changes in the expression of imprinted genes in prostate cancer--implications for cancer progression and epigenetic regulation.

4. The causal roles of vitamin B(12) and transcobalamin in prostate cancer: can Mendelian randomization analysis provide definitive answers?

5. Molecular signatures suggest a major role for stromal cells in development of invasive breast cancer.

6. Androgen regulation of aldehyde dehydrogenase 1A3 (ALDH1A3) in the androgen-responsive human prostate cancer cell line LNCaP.

7. A unified computational model for revealing and predicting subtle subtypes of cancers.

8. NCBI GEO: archive for functional genomics data sets--10 years on.

9. ALDH1A1 is a marker for malignant prostate stem cells and predictor of prostate cancer patients' outcome.

10. A two-sample Bayesian t-test for microarray data.

Review 1. Integrating Artificial Intelligence and Nanotechnology for Precision Cancer Medicine.

2. DISIS: prediction of drug response through an iterative sure independence screening.

3. Iterative sure independent ranking and screening for drug response prediction.

Review 4. Pathway and network analysis in proteomics.

5. Big biological data: challenges and opportunities.

6. Identifying network biomarkers based on protein-protein interactions and expression data.

7. Promote connections of young computational biologists in China.

8. iPcc: a novel feature extraction method for accurate disease class discovery and prediction.

Review 9. Machine learning applications in cancer prognosis and prediction.

10. An Efficient Approach to Screening Epigenome-Wide Data.