Literature DB >> 16046818

Selecting genes by test statistics.

Dechang Chen¹, Zhenqiu Liu, Xiaobin Ma, Dong Hua.

Abstract

Gene selection is an important issue in analyzing multiclass microarray data. Among many proposed selection methods, the traditional ANOVA F test statistic has been employed to identify informative genes for both class prediction (classification) and discovery problems. However, the F test statistic assumes an equal variance. This assumption may not be realistic for gene expression data. This paper explores other alternative test statistics which can handle heterogeneity of the variances. We study five such test statistics, which include Brown-Forsythe test statistic and Welch test statistic. Their performance is evaluated and compared with that of F statistic over different classification methods applied to publicly available microarray datasets.

Entities: Disease Gene Species

Year: 2005 PMID： 16046818 PMCID： PMC1184045 DOI： 10.1155/JBB.2005.132

Source DB: PubMed Journal: J Biomed Biotechnol ISSN： 1110-7243

INTRODUCTION

Microarrays provide information about the expression level of the genes represented on the array. Such gene expression profiling has been successfully applied to class prediction, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile [1, 2, 3, 4]. Various machine learning methods are currently used for class prediction. However, the task of prediction by microarrays is challenging, due to a large number of genes (features) and a small number of samples involved in the problem. As a consequence, one has to identify a small subset of informative genes contributing most to the classification task. Performing feature selection is essential for microarray prediction problems, since high-dimensional problems usually involve higher computational complexity and bigger prediction errors. Many methods have been proposed to select informative genes. One category of such work depends on the traditional t test statistic [5, 6, 7] and analysis of variance (ANOVA) F test statistic [8, 9]. While t is used for two-class prediction problems, F is used for multiclass problems. The statistics t and F are not only used in class prediction, they also apply to the class discovery [10, 11]. The main goal of class discovery is to identify subtypes of diseases. The major difference between class prediction and class discovery is that the former uses labeled samples while the latter uses unlabeled samples. Although t and F have been commonly used in the analysis of gene expression data, there exists a misunderstanding on the roles of t and F. The test statistic t is used to detect the difference between the means of two populations and it has two versions depending on whether or not the two variances of the two populations are equal. The test statistic F is often used to detect the difference among the means of three or more populations under the assumption that the variances of the involved populations are equal. Of course, the F statistic can be used to detect the difference between the means of two populations. In doing this, one can show that the F statistic is equivalent to the t statistic based on the equal variance, that is, one procedure rejects the null hypothesis that the two populations have the same mean if and only if the other procedure rejects the null hypothesis. In analyzing gene expression data, the t statistic is based on unequal variances so that its extension will never reach the ANOVA F. Therefore, for multiclass prediction problems, it is natural to explore other statistics which do not assume equal variances. In this paper, we study the effect on multiclass prediction results of gene selection from six test statistics: ANOVA F test statistic, Brown-Forsythe test statistic, Welch test statistic, adjusted Welch test statistic, Cochran test statistic, and Kruskal-Wallis test statistic. The five last test statistics can be viewed as extensions of the t statistic used in two-class prediction problems. Their performance will be compared with that of the F statistic. This paper is organized as follows. In “models and methods,” we describe the statistical model for gene expression levels, test statistics, and our method to select genes. In “experimental results,” we investigate the effect of test statistics on the classification results by using our gene selection approach and different machine learning techniques, applied to five publicly available microarray datasets. Our conclusion is given in “conclusion.”

MODELS AND METHODS

In this section, we will first introduce a general statistical model for gene expression values and describe test statistics for testing the equality of the class means. We then present our approach to select genes using power and correlation.

Statistical model

Assume there are k (≥ 2) distinct tumor tissue classes for the problem under consideration and there are p genes (inputs) and n tumor mRNA samples (observations). Suppose X is the measurement of the expression level of gene g from sample s for g = 1, . . ., p and s = 1, . . ., n. In terms of an expression matrix G, we may write It is seen that the columns and rows of the expression matrix G correspond to samples and genes, respectively. Note that G is a matrix consisting of data highly processed through preprocessing techniques that include image analysis and normalization and often logarithmic transformations. We assume that the data G are standardized so that the genes have mean 0 and variance 1 across samples. Given a fixed gene, let Y be the expression level from the jth sample of the ith class. Note that these Y come from the corresponding row of G. For example, for gene 1, Y are a rearrangement of the first row of G. We consider the following general model for Y: with n + n + . . . +n = n. In the model, μ is a parameter representing the mean expression level of the gene in class i, ϵ are the error terms such that ϵ are independent normal random variables, and for i = 1, 2, . . ., k; j = 1, 2, . . ., n. Schematically, the expression levels Y look like the following: Note that if the variances are equal, that is, , then the above model is simply the commonly used one-way ANOVA model. For the microarray data, we believe that heterogeneity in the variances is more realistic, since different σ may describe different variations of the gene expression across classes. One of the main tasks associated with the above model is to detect whether or not there is some difference among the means μ,μ, . . .,μ. For the case of homogeneity of variances, the well-known ANOVA F test is the optimal test to accomplish the task [12, 13]. However, with heterogeneity of the variances, the task is challenging and is closely related to the well-known Behrens-Fisher problem [14]. When the sample sizes in all classes are equal, that is, n = n = . . . = n, the presence of heterogeneous variances of the errors only slightly affects the F test. When the sample sizes are unequal, the effect is serious [15]. The actual type-I error is inflated if smaller sizes n are associated with larger variances . In addition, the significance levels are smaller than anticipated if larger sizes n are associated with larger variances . The above indicates that for our model, the F test may not be appropriate for testing H: μ = μ = . . . = μ versus H: not all the μ are equal. Therefore some alternatives to the F test are worthy of investigating.

Test statistics

After introducing the statistical model for gene expression values, we now turn to the test statistics used to test the equality of the class means for a fixed gene. We will consider the following six test statistics. The first five are parametric test statistics, while the last one is nonparametric. (a) ANOVA F test statistic. The definition of this test is where , , and . For simplicity, we use ∑ to indicate the sum is taken over the index i. Under H and assuming variance homogeneity, this well-known test statistic has a distribution of F [13]. (b) Brown-Forsythe test statistic [16]. This is given by Under H, B is distributed approximately as F, where (c) Welch test statistic [17]. This is defined as with and h = w/∑ w. Under H, W has an approximate distribution of F, where (d) Adjusted Welch test statistic [18]. It is similar to the Welch test statistic and defined to be where with chosen such that 1 ≤ ≤ (n −1)/(n −3), and . Under H, W* has an approximate distribution of , where In this paper, we choose = (n + 2)/(n + 1), since this choice provides reliable results for small sample sizes n and a large number (k) of populations [18]. (e) Cochran test statistic [19]. This test statistic is simply the quantity appearing in the numerator of the Welch test statistic W, that is, where w and h are given in (c). Under H, C has an approximate distribution of . (f) Kruskal-Wallis test statistic. This is the well-known nonparametric test and is given by where R is the rank sum for the ith class. The ranks assigned to Y are those obtained from ranking the entire set of Y (use the average rank in case of tied values). Assuming each n ≥ 5, then under H, H has an approximate distribution of [20].

Gene selection

With the test statistics introduced above, we are able to discuss the issue of gene selection. It has been well demonstrated in the literature that gene selection is an important issue in microarray data analysis. It is also known that with a large number of genes (usually in thousands) present, no practical method is available to locate the best set of genes, that is, the smallest subset of genes that offer optimal prediction accuracy. In this paper, the focus lies in comparing the performance of different test statistics in selecting genes for the classification of tumors based on gene expression profiles. Identifying a gene selection process to achieve good classification results is not the purpose of this paper. To make the comparison straightforward, we adopt the simplest gene selection approach as follows. First, we formulate the expression levels of a given gene by a one-way ANOVA model, as shown in “statistical model.” We then use the test statistics in “test statistics” to determine the power of genes in discriminating between tumor types. Given a test statistic ℱ, we define the discrimination power of a gene as the value of ℱ evaluated at the n expression levels of the gene. This definition is based on the fact that with larger ℱ the null hypothesis H: μ = μ = . . . = μ will be more likely rejected. Therefore, the higher the discrimination power is, the more powerful the gene is in discriminating between tumor types. Finally, we choose as informative genes those genes having high power of discrimination. We note that the discrimination power of genes could be determined equally well by the p value from ℱ. However, due to small sizes n, it is hard to justify the approximation of the known distribution to ℱ. Therefore the p values may not reflect the actual functionality of ℱ. This drawback is overcome by using the value of ℱ to determine the power of discrimination. Another obvious benefit is that using the value of ℱ will greatly simplify the calculation. In [18], extensive simulations have been conducted to examine the behavior of some test statistics for testing the equality of population means. The test statistics studied include B, W, W*, F, and C. The results show that with homogeneity of the variances, the ANOVA F test is the optimal test, as stated in “statistical model.” However, this assumption of homogeneity is rarely met in practice. Under heterogeneity of variances, the simulation results in [18] show that the test statistics B, W, and W* provide acceptable control of type I errors. This implies that the genes identified by B, W, and W* are more likely to be powerful than those by F and C in discriminating between tumor types, and thus the prediction errors resulting from B, W, and W* are expected to be lower than those from F and C. The nonparametric test statistic H can be applied to data with less restriction, for example, ordinal data, and thus is expected to perform worse than test statistics such as B, W, W*, and C. The above discussion will be further verified by our experiments on gene expression data conducted in “experimental results.”

EXPERIMENTAL RESULTS

In this section we investigate the effect on gene selection of the six test statistics introduced in “test statistics.” Five gene expression datasets and five prediction methods are used for this purpose. The performances of the test statistics are evaluated in terms of class prediction errors.

Datasets

We considered five multiclass gene expression datasets: leukemia72 [1], ovarian [21], NCI [22, 23], lung cancer [24], and lymphoma [25]. Table 1 presents more details of the datasets.

Table 1

Multiclass gene expression datasets.

Dataset	Leukemia72	Ovarian	NCI	Lung cancer	Lymphoma

No of genes	6817	7129	9703	918	4026
No of samples	72	39	60	73	96
No of classes	3	3	9	7	9

Comparison of test statistics

The gene selection procedure described above depends on the test statistics. Given a gene selection process from a test statistic, different classification methods may lead to different prediction errors. In our experiments, we used the following five prediction methods: naive Bayes, nearest neighbor, linear perceptron, multilayer perceptron neural network with 5 nodes in the middle layer, and support vector machines with a second-order polynomial kernel. All the algorithms are from Matlab PRTools 3.01 by Robert P. W. Duin. To calculate the overall prediction error, we used leave one out (LOO) cross-validation. For a dataset with n samples, this method involves n separate runs. For each of the runs, n−1 data points are used to train the model and then prediction is performed on the remaining data point. The overall prediction error is the sum of the errors on all n runs. Table 2 presents a comparison of the six test statistics when 50 informative genes were used. In the table, F, B, W, W*, C, and H represent the ANOVA F test statistic, Brown-Forsythe test statistic, Welch test statistic, adjusted Welch test statistic, Cochran test statistic, and Kruskal-Wallis test statistic, respectively. The first number in each cell denotes the average of 5 prediction errors from 5 different classification methods. The second number in each cell is the median of the 5 prediction errors. The results in the table suggest that B, W, W*, and, C perform better than F and H. Similar to Table 2, Tables 3 and 4 present comparison results with 100 and 200 informative genes, respectively.

Table 2

Performances of the test statistics with 50 informative genes.

Dataset	F	B	W	W*	C	H

Leukemia	3.4	2.4	2.8	2.8	3.2	3.0
Leukemia	3	2	3	3	3	3
Ovarian	0.2	0.0	0.0	0.0	0.0	0.0
Ovarian	0	0	0	0	0	0
NCI	36.0	32.0	27.4	26.0	27.0	35.4
NCI	35	29	27	27	27	35
Lung cancer	17.6	17.0	17.6	17.6	18.0	18.0
Lung cancer	17	17	18	18	18	18
Lymphoma	23.8	19.8	14.0	14.0	12.8	22.0
Lymphoma	23	19	12	12	13	20

Table 3

Performances of the test statistics with 100 informative genes.

Dataset	F	B	W	W*	C	H

Leukemia	3.4	3.0	3.0	3.0	3.2	3.0
Leukemia	3	3	4	3	3	3
Ovarian	0.2	0.0	0.0	0.0	0.0	0.0
Ovarian	0	0	0	0	0	0
NCI	33.0	22.6	23.8	25.2	25.2	31.6
NCI	33	22	25	26	26	31
Lung cancer	12.2	12.2	11.4	12.2	12.2	15.8
Lung cancer	12	12	11	11	11	14
Lymphoma	21.8	19.2	13.0	13.8	14.4	18.2
Lymphoma	17	16	12	12	12	18

Table 4

Performances of the test statistics with 200 informative genes.

Dataset	F	B	W	W*	C	H

Leukemia	3.0	3.0	2.4	2.8	1.8	2.4
Leukemia	3	3	2	3	1	2
Ovarian	0.4	0.2	0.2	0.2	0.2	0.4
Ovarian	0	0	0	0	0	0
NCI	25.6	22.6	22.6	22.8	22.2	25.6
NCI	26	22	24	25	24	25
Lung cancer	15.2	12.6	14.2	13.2	12.8	13.2
Lung cancer	13	11	12	12	12	11
Lymphoma	21.2	18.8	12.0	12.6	12.8	16.2
Lymphoma	15	14	8	9	8	14

Results in Tables 2, 3, and 4 may be summarized in a way by figures. Consider the average errors in the tables. For a fixed dataset and fixed number of informative genes, the performances of the six test statistics can be ranked. The fifteen ranks achieved by a test statistic could be used to obtain a 95% confidence interval of the mean rank for the test statistic. The corresponding bar chart plotting six confidence intervals is given in Figure 1. The bar chart based on the median errors in Tables 2, 3, and 4 is presented in Figure 2. Clearly, both figures show that B, W, W*, and C outperform F and H. These results indicate that the proposed models in “statistical model” without assuming equal variances are preferred to those assuming equal variances.

Figure 1

Relative performances of test statistics based on the average errors.

Figure 2

Relative performances of test statistics based on the median errors.

We note that in the above experiments, the performance of C is comparable to those of B, W, and W*. This does not look consistent with the discussion in “gene selection.” One reason might be that we only examined 5 datasets in this paper. Our opinion is that if more data sets are explored, the overall performance of C should be worse than that of B, W, or W*. We leave this as our future work. Before concluding, we point out that it is useful to assess the importance of genes selected by the test statistics from the biological perspective. Since this is not the focus of our research work in this paper, below we only provide a simple example to examine some genes selected by the Brown-Forsythe test statistic for the leukemia dataset. This dataset was also studied by Getz et al [26]. They extracted the stable clusters of genes by the coupled two-way clustering analysis and concluded that those genes grouped into the same cluster share certain biological significance such as on the same pathway. Among the top 50 informative genes from the Brown-Forsythe test statistic, 12 were mapped to the clusters of genes of interest given in [26]. Table 5 shows the information about the gene names, access numbers, corresponding clusters as well as the values of the Brown-Forsythe statistic. For details on the explanation of biological significance of clusters LG1, LG5, and LG6, readers are referred to [26].

Table 5

Mapping from genes selected by the Brown-Forsythe test statistic for the leukemia data to clusters of genes of interest provided by Getz et al [26].

Gene description	Access number	Cluster by Getz et al [26]	B

GB DEF = T-cell antigen receptor gene T3-delta	X03934	LG5	70.808014
Protein tyrosine kinase related mRNA sequence	L05148	LG5	43.676056
CD33CD33 antigen (differentiation antigen)	M23197	LG1	42.435883
GB DEF = T-lymphocyte specific protein tyrosine kinase p56lck (lck) abberant mRNA	U23852 s	LG5	35.120228
T-cell surface glycoprotein CD3 epsilon chain precursor	M23323 s	LG5	35.028965
CTSD (cathepsin D) (lysosomal aspartyl protease)	M63138	LG1	34.865067
HLA class II histocompatibility antigen, DR alpha chain precursor	X00274	LG6	31.882597
HLA class I histocompatibility antigen, F alpha chain precursor	X17093	LG6	31.83585
Leukotriene C4 synthase (LTC4S) gene	U50136 rna1	LG1	31.183104
RNS2 (ribonuclease 2) (eosinophil-derived neurotoxin (EDN))	X16546	LG1	29.52516
TIMP2 (tissue inhibitor of metalloproteinase 2)	M32304 s	LG1	28.233025
LMP2 gene extracted from Homo sapiens genes TAP1, TAP2, LMP2, LMP7, and DOB	X66401 cds1	LG6	27.11849

CONCLUSION

In this paper, we have compared the performance of different test statistics in selecting genes for multi-classification of tumors using gene expression data. Experiments show (a) the model for gene expression values without assuming equal variances is more appropriate than that assuming equal variances; (b) Brown-Forsythe test statistic, Welch test statistic, adjusted Welch test statistic, and Cochran test statistic perform much better than ANOVA F test statistic and Kruskal-Wallis test statistic.

DISCLAIMER

The opinions expressed herein are those of the authors and do not necessarily represent those of the Uniformed Services University of the Health Sciences and the Department of Defense.

15 in total

1. Coupled two-way clustering analysis of gene microarray data.

Authors: G Getz; E Levine; E Domany
Journal: Proc Natl Acad Sci U S A Date: 2000-10-24 Impact factor: 11.205

2. Singular value decomposition regression models for classification of tumors from microarray experiments.

Authors: Debashis Ghosh
Journal: Pac Symp Biocomput Date: 2002

3. SamCluster: an integrated scheme for automatic discovery of sample classes using gene expression profile.

Authors: Wuju Li; Ming Fan; Momiao Xiong
Journal: Bioinformatics Date: 2003-05-01 Impact factor: 6.937

4. Systematic variation in gene expression patterns in human cancer cell lines.

Authors: D T Ross; U Scherf; M B Eisen; C M Perou; C Rees; P Spellman; V Iyer; S S Jeffrey; M Van de Rijn; M Waltham; A Pergamenschikov; J C Lee; D Lashkari; D Shalon; T G Myers; J N Weinstein; D Botstein; P O Brown
Journal: Nat Genet Date: 2000-03 Impact factor: 38.330

5. A gene expression database for the molecular pharmacology of cancer.

Authors: U Scherf; D T Ross; M Waltham; L H Smith; J K Lee; L Tanabe; K W Kohn; W C Reinhold; T G Myers; D T Andrews; D A Scudiero; M B Eisen; E A Sausville; Y Pommier; D Botstein; P O Brown; J N Weinstein
Journal: Nat Genet Date: 2000-03 Impact factor: 38.330

6. Multiclass cancer diagnosis using tumor gene expression signatures.

Authors: S Ramaswamy; P Tamayo; R Rifkin; S Mukherjee; C H Yeang; M Angelo; C Ladd; M Reich; E Latulippe; J P Mesirov; T Poggio; W Gerald; M Loda; E S Lander; T R Golub
Journal: Proc Natl Acad Sci U S A Date: 2001-12-11 Impact factor: 11.205

7. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.

Authors: A A Alizadeh; M B Eisen; R E Davis; C Ma; I S Lossos; A Rosenwald; J C Boldrick; H Sabet; T Tran; X Yu; J I Powell; L Yang; G E Marti; T Moore; J Hudson; L Lu; D B Lewis; R Tibshirani; G Sherlock; W C Chan; T C Greiner; D D Weisenburger; J O Armitage; R Warnke; R Levy; W Wilson; M R Grever; J C Byrd; D Botstein; P O Brown; L M Staudt
Journal: Nature Date: 2000-02-03 Impact factor: 49.962

8. Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Authors: Robert Tibshirani; Trevor Hastie; Balasubramanian Narasimhan; Gilbert Chu
Journal: Proc Natl Acad Sci U S A Date: 2002-05-14 Impact factor: 11.205

9. Tumor classification by partial least squares using microarray gene expression data.

Authors: Danh V Nguyen; David M Rocke
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

10. Multi-class cancer classification via partial least squares with gene expression profiles.

Authors: Danh V Nguyen; David M Rocke
Journal: Bioinformatics Date: 2002-09 Impact factor: 6.937

13 in total

1. Permutation-based adjustments for the significance of partial regression coefficients in microarray data analysis.

Authors: Brandie D Wagner; Gary O Zerbe; Sharon Mexal; Sherry S Leonard
Journal: Genet Epidemiol Date: 2008-01 Impact factor: 2.135

2. Performance Analysis of Deep Learning Models for Binary Classification of Cancer Gene Expression Data.

Authors: Subhasree Majumder; Vipin Pal; Anju Yadav; Amitabha Chakrabarty
Journal: J Healthc Eng Date: 2022-03-09 Impact factor: 2.682

3. Coevolution of prostate cancer and bone stroma in three-dimensional coculture: implications for cancer growth and metastasis.

Authors: Shian-Ying Sung; Chia-Ling Hsieh; Andrew Law; Haiyen E Zhau; Sen Pathak; Asha S Multani; Sharon Lim; Ilsa M Coleman; Li-Chin Wu; William D Figg; William L Dahut; Peter Nelson; Jae K Lee; Mahul B Amin; Robert Lyles; Peter A J Johnstone; Fray F Marshall; Leland W K Chung
Journal: Cancer Res Date: 2008-12-01 Impact factor: 12.701

4. A Java-based tool for the design of classification microarrays.

Authors: Da Meng; Shira L Broschat; Douglas R Call
Journal: BMC Bioinformatics Date: 2008-08-04 Impact factor: 3.169

5. MMP1 bimodal expression and differential response to inflammatory mediators is linked to promoter polymorphisms.

Authors: Muna Affara; Benjamin J Dunmore; Deborah A Sanders; Nicola Johnson; Cristin G Print; D Stephen Charnock-Jones
Journal: BMC Genomics Date: 2011-01-19 Impact factor: 3.969

6. The Impact of Normalization Methods on RNA-Seq Data Analysis.

Authors: J Zyprych-Walczak; A Szabelska; L Handschuh; K Górczak; K Klamecka; M Figlerowicz; I Siatkowski
Journal: Biomed Res Int Date: 2015-06-15 Impact factor: 3.411

7. Sparse representation for classification of tumors using gene expression data.

Authors: Xiyi Hang; Fang-Xiang Wu
Journal: J Biomed Biotechnol Date: 2009-03-15

8. Gene-based multiclass cancer diagnosis with class-selective rejections.

Authors: Nisrine Jrad; Edith Grall-Maës; Pierre Beauseroy
Journal: J Biomed Biotechnol Date: 2009-06-24

9. Post hoc pattern matching: assigning significance to statistically defined expression patterns in single channel microarray data.

Authors: Randall Hulshizer; Eric M Blalock
Journal: BMC Bioinformatics Date: 2007-07-05 Impact factor: 3.169

10. Ranking analysis of F-statistics for microarray data.

Authors: Yuan-De Tan; Myriam Fornage; Hongyan Xu
Journal: BMC Bioinformatics Date: 2008-03-06 Impact factor: 3.169