Literature DB >> 19455244

Feature selection for predicting tumor metastases in microarray experiments using paired design.

Qihua Tan¹, Mads Thomassen, Torben A Kruse.

Abstract

Among the major issues in gene expression profile classification, feature selection is an important and necessary step in achieving and creating good classification rules given the high dimensionality of microarray data. Although different feature selection methods have been reported, there has been no method specifically proposed for paired microarray experiments. In this paper, we introduce a simple procedure based on a modified t-statistic for feature selection to microarray experiments using the popular matched case-control design and apply to our recent study on tumor metastasis in a low-malignant group of breast cancer patients for selecting genes that best predict metastases. Gene or feature selection is optimized by thresholding in a leaving one-pair out cross-validation. Model comparison through empirical application has shown that our method manifests improved efficiency with high sensitivity and specificity.

Entities: Disease Gene Species

Keywords: feature selection; gene expression microarray; metastasis; prediction

Year: 2007 PMID： 19455244 PMCID： PMC2675839

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Characterized by simultaneous profiling for the transcriptional activities of thousands of mRNA species in a human tissue, the DNA microarray technology represents an important high-throughput platform for analyzing and understanding human diseases. The tremendous potential provided by the new technology is serving us not only as a molecular tool for investigating disease mechanisms but also for classification and clinical outcome prediction (Dudda-Subramanya et al. 2003). Application of the technology in clinical oncology is demonstrating it as a powerful tool for refining diagnosis and improving prognostic prediction accuracy of cancer patients (Pusztai et al. 2003). Bioinformatics and biostatistics play important roles in such practices in establishing gene expression signatures or prognostic markers and in building up efficient classifiers (Asyali et al. 2006). Among the major issues in gene expression profile classification, feature selection is an important and necessary step in achieving and creating good classification rules given the high dimensionality of microarray data. There are various approaches for feature selection in the literature among which one common approach is the univariate selection scheme for selecting only genes with the highest statistical significance. Such an approach can be inadequate because (1) it tends to include elements that contribute highly redundant information and (2) it ignores the co-regulatory network in gene function. As a result, the univariate approach does not necessarily guarantee a best classifier (Ein-Dor et al. 2005; Baker and Kramer, 2006). Tibshirani et al. (2002) proposed a Nearest Shrunken Centroids (NSC) method for both feature selection and tumor classification. In NSC, weak elements of the class centroids are shrunk or deleted via soft-thresholding to identify genes that best characterize each class. The method implemented in an R package (PAM, Prediction Analysis of Microarrays) performs well in identifying subsets of genes that can be used for classification and prediction. Although different feature selection methods have been reported for tumor classification (Inza et al. 2004), there has been no method specifically proposed for paired microarray experiments. In this paper, we introduce a simple feature selection procedure based on a modified t-statistic to microarray experiments using the popular matched case-control design and apply to our recent study on tumor metastasis in a low-malignant group of breast cancer patients for selecting genes that best predict metastases. Gene or feature selection is optimized by thresholding in a leaving one-pair out cross-validation procedure using the support vector machines (SVM) (Brown et al. 2000). Such an approach is necessary considering the advantages in a matched design because there are multiple factors (nodal status, tumor size, age, etc.) that convey important implications on tumor outcomes. Performance of the feature selection method is compared with that from PAM and from the ordinary paired t-test using receiver operating characteristics (ROC) analysis (Fawcett, 2006).

Methods

Suppose in a paired microarray experiment, we have the gene expression values (usually in log scale) from n pairs of samples j = 1, 2, … n. For each gene i (i = 1, 2, … p), we obtain the differential gene expression in pair j, d, by substracting the expression value of the control from the case and calculate the mean difference as and the standard error of d¯ as Now we can calculate the t-test statistic for the paired data as Similar to Tusher et al. (2001), we add a positive constant s0 to the denominator of (1) so that (1) becomes From (2) we can see that our modified t-statistic is a down-scaled t-statistic with the scaling determined by the ratio between s0 and s. Once s0 is specified, the scaling has a large effect on genes with small standard errors. Following Tibshirani et al. (2002), we set s0 to the median value of s (i = 1, 2, … p). For the purpose of feature selection, we specify a threshold Δ and pick up genes with . The optimal subset of genes is obtained through a leaving one-pair out cross-validation procedure using SVM. Similar to PAM, the optimal threshold Δ is determine through a grid search in which for each given Δ, the performance of classifier is judged by leaving one-pair out cross-validation to ensure that the training set and the prediction set are independent. The Δ that corresponds to the lowest classification error is taken as the optimal threshold. Once the optimal threshold Δ is determined, the overall optimal sub-set of genes is selected by applying the optimal Δ to the whole sample. The realization of SVM is done using the svm procedure in the R package e1071 (http://cran.at.r-project.org/src/contrib/PACKAGES). In order to assess and compare our model performance with that from PAM and the ordinary paired t-test, we introduce the ROC analysis and calculate the area under an ROC curve (AUC). A ROC curve is a two-dimensional depiction of classifier performance which plots sensitivity on the Y and 1-specificity on the X axes. As such, a high-AUC classifier has better average performance than a low-AUC classifier (Fawcett, 2006) with AUC = 0.5 for a random classifier. ROC analysis is performed using the free R package caTools.

Application

We apply our method to a microarray dataset on tumor metastasis from low-malignant breast cancer patients collected in our lab (Thomassen et al. 2006a). In this study, 13 low-malignant T1 (tumor size in diameter T ≤ 20 mm) and 17 low-malignant T2 (20 mm < T ≤ 50 mm) tumors from patients who developed metastases were matched to metastasis-free tumors from patients (followed up for about 12 years after diagnosis) of the same tumor type and according to year of surgery, tumor size, and age. Gene expression analysis was performed on 29K oligonucleotide arrays with duplicated measurements for each gene (Thomassen et al. 2006b). Data were normalized using the variance stabilization normalization method (Huber et al. 2002) implemented in the free R package vsn in Bioconductor (http://www.bioconductor.org). The study by Thomassen et al. (2006a) identified a 32-gene signature that classifies the 60 tumor samples with a mean accuracy of 78% (specificity 77%; sensitivity 80%) using leaving one-pair out cross-validation (Figure 1a). In the analysis, feature selection was done using the nearest shrunken centroids methods in the R package pamr (Tibshirani et al. 2002) and classification done using SVM in the R package e1071. Note that the feature selection procedure using pamr does not take the paired matching into account in identifying the subset of genes for training and prediction.

Figure 1.

Probability of metastasis calculated by SVM using leaving one-pair out cross-validation based on the 32-gene signature by PAM (1a), the 5-gene signature by our new method (1b) and the 43-gene signature by paired t-test (1c) for the 13 pairs of low-malignant T1 (asterisk) and 17 pairs of low-malignant T2 (triangle) patients. The best performance is achieved by our 5-gene signature with improved prediction accuracy and better separation.

Using our method described above, we re-analyze the data by introducing the modified t-statistic for paired data in defining the gene expression signature for predicting metastases. Our analysis achieved an overall accuracy of 83% (Δ = 0.396) with a specificity of 83% and a sensitivity of 83% using a subset of only 5 genes (Figure 1b). Comparing Figure 1a with 1b, one can see that our method has improved separation based on prediction probability and increased efficiency (median of correct prediction probability: 0.88 versus 0.86 for metastasis and 0.84 versus 0.81 for non-metastasis). Interestingly, all the 5 selected genes are within the 32-gene list identified by PAM in Thomassen et al. (2006a). To further compare our analysis, we additionally introduce the ordinary paired t-test for gene selection. Here the thresholding is imposed upon the ordinary paired t-statistic, i.e. we pick up genes with | t | −Δ > 0. Likewise, we again select the optimal subset of genes through cross-validation by leaving one-pair out. The classifier based on the expression signature specified by the ordinary paired t-test yields an average accuracy of 74% (specificity 74%; sensitivity 74%) when Δ is set to 3.1 (43 genes selected). The cross-validation probabilities plotted in Figure 1c shows that the model based on ordinary paired t-test has the lowest efficiency (median of correct prediction probability: 0.85 for metastasis and 0.83 for non-metastasis) even though the method makes use of the paired design. We finally evaluate the overall performances of the 3 methods using ROC analysis. Based on the cross-validation probability of metastasis from SVM and the observed metastasis status for each sample, we are able to draw the ROC curves and show it in Figure 2 with the dotted curves for the new method in black, for PAM in red and for the paired t-test in green. Visualization of Figure 2 indicates that since the black curve runs on top of the other curves in the upper-left triangle of the figure, our new method exhibits higher efficiency as compared with the others. This is further confirmed by calculating the AUC, a standard summary metric for assessing the overall performance of a classifier. The high AUC for our new method (0.86) again shows that it outperforms PAM (AUC = 0.83) and the ordinary paired t-test (AUC = 0.80).

Figure 2.

ROC analysis for model comparison with the dotted curves for the new method in black, for PAM in red and for the paired t-test in green. Since the black curve runs on top of the others in the upper-left triangle of the figure, our new method exhibits higher efficiency in its performance. The high AUC for our new method (0.86) indicates that it outperforms PAM (AUC = 0.83) and the paired t-test (AUC = 0.80).

Discussion

We have introduced a simple feature selection method for predicting tumor metastases in paired microarray experiments. Model comparison through empirical application has shown that our method manifests high efficiency and outperforms existing methods. As shown in the results section, the ordinary paired t-tests has the worst performance as compared with the other two methods which use modified t-statistics for thresholding to eliminate genes that do not contribute towards class prediction. Although both the modified and the ordinary paired t-statistics make use of the matched design, the better performance of our method is achieved by thresholding upon a new metric that is less dependent on gene-specific variances which helped to filter statistically significant genes due to small standard errors in their differential expressions. It is more interesting to compare the performances between our method and PAM. Although both methods use the modified versions of t-statistics, our method takes the following advantages of the paired design in selecting informative features. First, as a popular method in cancer research (Breslow and Day, 1990), the paired design helps to minimize the influence on tumor metastasis from non-transcriptomic factors such as age, clinical stage, treatment, etc (Gonzalez-Angulo et al. 2005). Second, in a transcriptomic study on tumor metastasis, these confounding factors not only affect the metastasis phenotype which is of our primary interest but could also influence the transcriptional profiles of genes. Ignoring these influences will simply introduce noise in feature selection resulting in low accuracy of the classifier. A good classification signature should be a minimal subset of genes that is not only differentially expressed but also contains most relevant genes without redundancy (Peng et al. 2006; Baker and Kramer, 2006). A comparative analysis on data across several studies has found that classification rules for 5 genes can achieve comparable performance as that for 20 or 50 genes (Baker and Kramer, 2006). In our analysis, the high performance is achieved by basing our classifier coincidently on 5 informative genes. It is interesting that all 5 genes overlap with the 32-gene signature identified by PAM (Thomassen et al. 2006a) and 2 of the 5 genes overlap with the 70-gene signature from van’t Veer et al. (2002) in their studies on breast cancer metastases. Further information on the 5 selected genes is provided in Table 1.

Table 1.

Information on the 5 selected genes.

Gene symbol	GenBank accession	Description	Gene Ontology
FLJ20354	NM_017779	Hypothetical protein FLJ20354, mRNA.	Intracellular signaling cascade
IMAGE:4081483	BC005998	Clone IMAGE:4081483, mRNA	Unknown
UBE2R2	NM_017811	Ubiquitin-conjugating enzyme E2R 2, mRNA.	Ligase activity; ubiquitin conjugating enzyme activity; Ubiquitin cycle; ubiquitin-ligase activity
ZNF533	NM_152520	Zinc finger protein 533	Unknown
DTL	NM_016448	Denticleless homolog	Unknown

Finally, it is necessary to point out that the paired experiment design in studying tumor metastasis using two-channel cDNA microarrays can be further advantaged by the reduced experimental cost when directly labeling, for example, metastasis mRNA with cy5 and non-metastasis mRNA with cy3 in each matched pair. Since our method works with the pair-wised difference in the log expression values, the feature selection algorithm is valid for both one- and two-channel microarray platforms. Overall, given the popularity of the pair matched design in cancer studies, we hope that our new method for feature selection can be of use in identifying efficient and informative gene expression signatures for predicting tumor metastases in clinical cancer research.

14 in total

1. Knowledge-based analysis of microarray gene expression data by using support vector machines.

Authors: M P Brown; W N Grundy; D Lin; N Cristianini; C W Sugnet; T S Furey; M Ares; D Haussler
Journal: Proc Natl Acad Sci U S A Date: 2000-01-04 Impact factor: 11.205

Review 2. Clinical application of cDNA microarrays in oncology.

Authors: Lajos Pusztai; Mark Ayers; James Stec; Gabriel N Hortobágyi
Journal: Oncologist Date: 2003

Review 3. Clinical applications of DNA microarray analysis.

Authors: Raghunandan Dudda-Subramanya; Guglielmo Lucchese; Darja Kanduc; Animesh A Sinha
Journal: J Exp Ther Oncol Date: 2003 Nov-Dec

Review 4. Filter versus wrapper gene selection approaches in DNA microarray domains.

Authors: Iñaki Inza; Pedro Larrañaga; Rosa Blanco; Antonio J Cerrolaza
Journal: Artif Intell Med Date: 2004-06 Impact factor: 5.326

5. Outcome signature genes in breast cancer: is there a unique set?

Authors: Liat Ein-Dor; Itai Kela; Gad Getz; David Givol; Eytan Domany
Journal: Bioinformatics Date: 2004-08-12 Impact factor: 6.937

6. Spotting and validation of a genome wide oligonucleotide chip with duplicate measurement of each gene.

Authors: Mads Thomassen; Vibe Skov; Freyja Eiriksdottir; Qihua Tan; Kirsten Jochumsen; Niels Fritzner; Klaus Brusgaard; Jesper Dahlgaard; Torben A Kruse
Journal: Biochem Biophys Res Commun Date: 2006-04-19 Impact factor: 3.575

7. Gene expression profiling predicts clinical outcome of breast cancer.

Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal: Nature Date: 2002-01-31 Impact factor: 49.962

8. Factors predictive of distant metastases in patients with breast cancer who have a pathologic complete response after neoadjuvant chemotherapy.

Authors: Ana M Gonzalez-Angulo; Sean E McGuire; Thomas A Buchholz; Susan L Tucker; Henry M Kuerer; Roman Rouzier; Shu-Wan Kau; Eugene H Huang; Paolo Morandi; Alberto Ocana; Massimo Cristofanilli; Vicente Valero; Aman U Buzdar; Gabriel N Hortobagyi
Journal: J Clin Oncol Date: 2005-10-01 Impact factor: 44.544

9. Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Authors: Robert Tibshirani; Trevor Hastie; Balasubramanian Narasimhan; Gilbert Chu
Journal: Proc Natl Acad Sci U S A Date: 2002-05-14 Impact factor: 11.205

10. Prediction of metastasis from low-malignant breast cancer by gene expression profiling.

Authors: Mads Thomassen; Qihua Tan; Freyja Eiriksdottir; Martin Bak; Søren Cold; Torben A Kruse
Journal: Int J Cancer Date: 2007-03-01 Impact factor: 7.396

11 in total

1. Bayesian Variable Selection Methods for Matched Case-Control Studies.

Authors: Josephine Asafu-Adjei; G Tadesse Mahlet; Brent Coull; Raji Balasubramanian; Michael Lev; Lee Schwamm; Rebecca Betensky
Journal: Int J Biostat Date: 2017-01-31 Impact factor: 0.968

2. Supervised Bayesian latent class models for high-dimensional data.

Authors: Stacia M Desantis; E Andrés Houseman; Brent A Coull; Catherine L Nutt; Rebecca A Betensky
Journal: Stat Med Date: 2012-04-11 Impact factor: 2.373

3. Variable selection and prediction using a nested, matched case-control study: Application to hospital acquired pneumonia in stroke patients.

Authors: Jing Qian; Seyedmehdi Payabvash; André Kemmling; Michael H Lev; Lee H Schwamm; Rebecca A Betensky
Journal: Biometrics Date: 2013-12-09 Impact factor: 2.571

4. A robust tool for discriminative analysis and feature selection in paired samples impacts the identification of the genes essential for reprogramming lung tissue to adenocarcinoma.

Authors: Swee Heng Toh; Philip Prathipati; Efthimios Motakis; Chee Keong Kwoh; Surya Pavan Yenamandra; Vladimir A Kuznetsov
Journal: BMC Genomics Date: 2011-11-30 Impact factor: 3.969

5. Association of miR-548c-5p, miR-7-5p, miR-210-3p, miR-128-3p with recurrence in systemically untreated breast cancer.

Authors: Ines Block; Mark Burton; Kristina P Sørensen; Lars Andersen; Martin J Larsen; Martin Bak; Søren Cold; Mads Thomassen; Qihua Tan; Torben A Kruse
Journal: Oncotarget Date: 2018-01-09

6. Gene expression meta-analysis identifies metastatic pathways and transcription factors in breast cancer.

Authors: Mads Thomassen; Qihua Tan; Torben A Kruse
Journal: BMC Cancer Date: 2008-12-30 Impact factor: 4.430

7. Prediction of breast cancer metastasis by gene expression profiles: a comparison of metagenes and single genes.

Authors: Mark Burton; Mads Thomassen; Qihua Tan; Torben A Kruse
Journal: Cancer Inform Date: 2012-12-10

8. Gene expression profiles for predicting metastasis in breast cancer: a cross-study comparison of classification methods.

Authors: Mark Burton; Mads Thomassen; Qihua Tan; Torben A Kruse
Journal: ScientificWorldJournal Date: 2012-11-28

9. Gene expression signatures that predict outcome of tamoxifen-treated estrogen receptor-positive, high-risk, primary breast cancer patients: a DBCG study.

Authors: Maria B Lyng; Anne-Vibeke Lænkholm; Qihua Tan; Werner Vach; Karina H Gravgaard; Ann Knoop; Henrik J Ditzel
Journal: PLoS One Date: 2013-01-16 Impact factor: 3.240

10. A new framework for prediction and variable selection for uncommon events in a large prospective cohort study.

Authors: Hye-Seung Lee; Jeffrey P Krischer
Journal: Model Assist Stat Appl Date: 2017-08-30