Literature DB >> 26244016

A Simple Rank Product Approach for Analyzing Two Classes.

Tae Young Yang1.   

Abstract

The rank product statistic has been widely used to detect differentially expressed genes in replicated microarrays and a one-class setting. The objective of this article is to apply a rank product statistic to approximate the P-value of differential expression in a two-class setting, such as in normal and cancer cells. For this purpose, we introduce a simple statistic that compares the P-values of each class's rank product statistic. Its null distribution is straightforwardly derived using the change-of-variable technique.

Entities:  

Keywords:  change of variable; chi-squared approximation; log-transformation; rank product statistic; two-class setting

Year:  2015        PMID: 26244016      PMCID: PMC4507469          DOI: 10.4137/BBI.S26414

Source DB:  PubMed          Journal:  Bioinform Biol Insights        ISSN: 1177-9322


Introduction

The rank product statistic1 is a robust nonparametric approach that has been proposed to detect differentially expressed genes in replicated microarrays with just one class or condition. Because the rank product statistic transforms expression intensity into ranks, it has several advantages, including fewer assumptions and easy handling of noisy data or few micro-arrays.2 Although the rank product statistic has been used mainly for microarrays, it is also applicable to meta-analyses3,4 and proteomics.5 The rank product statistic ranks genes according to expression intensities within each microarray and calculates the product of these ranks across multiple microarrays. This technique can identify genes that are consistently detected among the most differentially expressed genes in a number of replicated microarrays. However, a very large number of permutations and a substantial amount of computation time are required to accurately calculate the P-value to test for differential expression. Alternatively, Koziol6 proposed a log-transformed rank product statistic and used a continuous gamma distribution to approximate its P-value. The computation time to calculate the P-value for testing differential expression is negligible compared with that required to calculate the permutations. To extend the rank product statistic to approximate the P-value of differential expression under a two-class setting, such as in cancer cells and normal cells, Koziol7 used the difference between two averaged gamma variables. However, calculating the null density of the difference is mathematically complicated. In contrast, this article proposes a simple variable for comparing the P-values of each class’s log-transformed rank product statistic and describes its null distribution, which is easily derived by a change-of-variable technique.

Background of One-Class Rank Product Statistic

Assume that we have m replicate microarrays representing one class, with each microarray measuring expression of n genes. For each microarray j (j = 1,…,m), Koziol6 ranked the expression levels X1,…,X, and denoted R = rank(X) in a way such that the most highly expressed gene is assigned rank 1 and the least expressed gene is assigned rank n, then R {1,…,n}. For each gene i, we have a rank tuple of {R1,…, R}. The original rank product statistic for gene i is which is the product of the ranks i over m independent microarrays. Assuming that each rank occurs only once with independent samples, RP takes discrete values of 1, 2,…, n. When (R1,…, R) is small, RP is small, indicating that gene i is expressed differentially. To calculate the P-value for the test that gene i is differentially expressed, RP is compared with its permutation distribution under the null hypothesis that R for i = 1,…,n are exchangeable within each microarray j.1 However, to accu-rately approximate the distribution, a very large number of permutations is required, which becomes very time-consuming computationally. Thus, a simpler approximation approach is needed to calculate the P-value of RP.

Log-Transformed Rank Product Statistic

An individual P-value given by R/(n + 1) is approximately uniformly distributed on the unit interval (0, 1), with the approximation improving as n (the number of genes) increases. If R/(n + 1) is continuously uniform on (0, 1), the transformation of −2ln(R/(n + 1)) has a chi-squared distribution with two degrees of freedom, denoted as χ2(2). In contrast, Koziol6 used the transformation −ln(R/(n + 1)), which has an exponential density Exp(1). Chi-squared tables are readily available, so the advantages of chi-squared favor the approach proposed here. We can combine individual chi-squared variables as follows which has a χ2(2m) density. Because the monotonicity of the log function ensures that significance levels of RP and ln RP are identical, the chi-squared density provides a simple calculation to obtain the P-value of RP. Let (r1,…,r) and be the observed values of (R1,…, R) and RP, respectively. The P-value of rp for testing the differential expression of gene i is When (r1,…,r) is small, rp and its P-value are also small, indicating that gene i is differentially expressed.

A New Statistic for Analyzing two Classes

Suppose we extend the analyses to two classes, with m1 independent microarrays in class 1 and m2 independent microarrays in class 2. Each microarray measures n genes. Going forward, for simplicity, the i gene label is omitted. Let and be the rank product statistics of classes 1 and 2, respectively. Note that rp1 and rp2 are the observed values of RP1 and RP2, respectively. Let X1 and X2 be Note that the two independent random variables X1 and X2 have χ2(2m1) and χ2(2m2), respectively, under the null hypothesis that R for i = 1,…,n are exchangeable within each microarray j. To calculate the P-value of differential expression of gene i under a two-class setting, we define a new statistic where and are the observed values of X1 and X2, respectively. Genes associated with sufficiently small V would be differentially expressed for testing H0: class 1 = class 2 vs. H: class 1 > class 2. The distributions of P(χ2(2m1) > x1) and P(χ2(2m2) > x2) are uniform (0, 1) under the null hypothesis. Then, the density of V is The proof is presented in the Appendix. The P-value for testing H0: class 1 = class 2 vs. H: class 1 > class 2 can be obtained by where p1 = P (χ2(2m1) > x1) and p2 = P (χ2(2m2) > x2). Similarly, the P-value for testing H0: class 1 = class 2 vs. H: class 1, class 2 can be obtained by

Numerical examples

Simulation study

We evaluated the performance of the proposed statistic V in Equation (2) by comparing its specificity (or 1 false-positive rate) and sensitivity (or power) in detecting differential expression to the Wilcoxon rank-sum statistic, which is widely used for nonparametrical testing to calculate the P-value of differential expression under a two-class setting. For the following specifications, we conducted 1,000 simulation experiments to assess the specificity and sensitivity of the statistic. To assess the specificity of the proposed statistic, we simulated 10,000 genes such that the gene expression in 40 microarrays for each gene was simulated independently from a standard normal distribution, where the first 20 samples (m1 = 20) were the control group and the second 20 were (m2 = 20) the treatment group. This specification represents a situation in which no genes are differentially expressed. The false-positive rate was then calculated as follows: the number of genes found to be differentially expressed at nominal level α were counted and divided by 10,000 (the number of genes). Table 1 presents the false-positive rates of the proposed statistic for various α, m1, and m2. As can be seen from the table, the statistic maintained appropriate α-levels.
Table 1

False-positive rates of the proposed statistic for various nominal α-levels and numbers of samples, where m1 and m2 are the sample numbers of the control group and the treatment group, respectively.

m1, m2α-LEVEL
0.010.050.100.25
10,100.00970.04940.09970.2501
20,200.00990.04960.09930.2495
30,300.00950.04920.09940.2491
10,200.00970.04960.09940.2493
20,100.00950.04960.09970.2499

Note: The numbers denote the rates of genes that were identified by the proposed statistic as differentially expressed at α.

To assess the power of the proposed statistic, 10,000 genes were simulated such that the gene expression for each gene in 40 microarrays was simulated independently from a standard normal distribution and where the first 20 samples were the control group and the second 20 were the treatment group. Next, 5% of genes were randomly selected, and a constant of 0.25 was added to their treatment group. These selected genes had a higher average expression in the treatment group; however, there was no difference between the two groups for the remaining 95% genes. We repeated the same procedure by adding larger constants: 0.5, 1.0, and 1.5. In Table 2, the numbers represent the percentages of the selected 5% differentially expressed genes that were found to be differentially expressed at various significance levels α. The results of the proposed statistic were compared with those obtained from the Wilcoxon rank-sum test statistic. The table clearly shows that the proposed statistic is more powerful than the Wilcoxon statistic and that it was able to accurately detect the differentially expressed genes.
Table 2

Power of the proposed statistic for various nominal α-levels.

α-LEVELADDED CONSTANT
0.250.51.01.5
0.010.08 (0.06)0.32 (0.20)0.91 (0.74)1.0 (0.98)
0.050.24 (0.19)0.56 (0.44)0.97 (0.91)1.0 (1.0)
0.10.35 (0.30)0.69 (0.58)0.98 (0.96)1.0 (1.0)
0.20.57 (0.54)0.83 (0.80)0.99 (0.99)1.0 (1.0)

Notes: We simulated 10,000 genes such that the gene expression in 40 microarrays for each gene was simulated independently from a standard normal distribution, and where the first 20 samples were the control group and the second 20 were the treatment group. We randomly selected 5% of genes and added a constant of 0.25 to their treatment group. These selected genes had a higher average expression in the treatment group; however, there was no difference between the two groups for the remaining 95% of genes. We repeated the same procedure by adding larger constants: 0.5, 1.0, and 1.5. The numbers denote the percentages of differentially expressed genes that were identified by the proposed statistic as differentially expressed. For comparison, the numbers inside parentheses denote the percentages of differentially expressed genes identified by the Wilcoxon rank-sum statistic.

Real data analysis

The widely used data set of Golub et al.8 came from a study of gene expression in two classes of acute leukemia: acute lymphocytic leukemia (ALL) and acute myelogenous leukemia (AML). Gene expression levels were measured using Affymetrix high-density oligonucleotide microarrays containing 6,817 human genes. Three preprocessing procedures were applied to the gene expression levels and are available at http://www.genome.wi.mit.edu/MPR. These preprocessing procedures included (i) thresholding: foor of 100 and ceiling of 16,000; (ii) filtering: exclusion of genes with (max/min) ≤5 or (max-min) ≤500, where max and min refer, respectively, to the maximum and minimum levels for a particular gene across mRNA samples; and (iii) log10 transformation.9 The data were then summarized by a 3,051 × 38 matrix, which is implanted in the multitest package from http://www.bioconductor.org/biocLite.R. Table 3 presents the top 25 AML significant genes from Equation (3). Eleven genes marked with * are also reported among the top 25 AML-specific genes in Golub et al. We also compared P-values of the proposed statistic to those of the Wilcoxon rank-sum statistic. The proposed P-values were obtained under the overall null hypothesis that the expression levels are exchangeable within each of the independent microarrays. Eleven genes marked with *were also reported among the top 25 AML-specific genes in Golub et al.
Table 3

Our P-values obtained under the overall null hypothesis that the expression levels are exchangeable within each of the independent microarrays.

AFFYMETRIX IDDESCRIPTIONOUR TOP 25 P-VALUESWILCOXON’S P-VALUE
Y00787*interleukin-8 precursor6.07 × 10−113.39 × 10−06
M27891*CST3 Cystatin C9.69 × 10−093.32 × 10−09
M96326*Azurocidin gene6.69 × 10−088.28 × 10−06
M28130*Interleukin 8 gene2.85 × 10−072.67 × 10−06
M63438glutamine synthase7.17 × 10−071.10 × 10−04
X17042*PRG1 Proteoglycan 1, secretory granule2.51 × 10−062.74 × 10−05
U01317Delta-globin gene4.47 × 10−066.42 × 10−04
M19507mpo myeloperoxidase5.95 × 10−061.53 × 10−05
M91036G-gamma globin8.83 × 10−061.37 × 10−03
M87789hybridoma H2101.00 × 10−052.06 × 10−04
X95735*Zyxin1.14 × 10−058.31 × 10−10
M19045*LYZ1.27 × 10−052.67 × 10−06
X14008Lysozyme gene1.81 × 10−056.67 × 10−06
X64072SELL Leukocyte adhesion protein beta subunit2.09 × 10−052.74 × 10−05
J04990cathepsin g precursor2.38 × 10−051.53 × 10−05
J03801LYZ2.59 × 10−051.63 × 10−06
X62320GRN Granulin4.60 × 10−054.16 × 10−07
X04085*Catalase 5′flank and exon 1 mapping to chr 115.59 × 10−052.67 × 10−06
M21119LYZ7.99 × 10−059.49 × 10−04
M84526*DF D component of complement1.09 × 10−043.30 × 10−05
M57710*galectin 31.11 × 10−049.37 × 10−05
L09209APLP2 Amyloid beta (A4) precursor-like protein 21.33 × 10−045.56 × 10−08
L08246*induced myeloid leukemia cell differentiation protein mcl11.53 × 10−042.67 × 10−06
X62654ME4912.21 × 10−041.15 × 10−07
X65965manganese superoxide dismutase3.26 × 10−047.78 × 10−03

Notes: The top 25 P-values for AML-specific genes from the leukemia data of Golub et al from Equation (3). Among them, 11 genes marked with * were reported among the top 25 AML-specific genes in Golub et al. Our P-values are compared with P-values of Wilcox rank-sum test. Ten genes of the Wilcoxon rank-sum statistic were reported among the top 25 AML-specific genes in Golub et al.

Conclusion

To approximate the P-value of differential expression under a two-class setting, Koziol7 derived the density of the difference between two averaged gamma variables, which is mathematically complex. In contrast, we provided a simple, nonparametric statistic V in Equation (2). Its null distribution was easily derived by the change-of-variable technique. In the sensitivity analysis presented in the Simulation study section, the proposed statistic was more powerful than the Wilcoxon statistic. In the specificity analysis, it also maintained appropriate α-levels. We developed an R program for this statistic, available at http://home.mju.ac.kr/home/index.action?siteId=tyang. Koziol6 noted that the P-values of ln RP in Equation (1) were well approximated by the corresponding continuous gamma approximation (or in our case, chi-squared) over most of the data range; however, the estimation of extremely small P-values was rather imprecise. Specifically, the gamma approximation is conservative in that it tends to overestimate extremely small P-values, leading to false-negative results, which is due to the fact that the discrete rank products take values of 1, 2,…, n, whereas the continuous chi-squared distribution uses positive, real numbers.10 Because p1 and p2 in Equation (3) are based on gamma approximation, the P-value of the proposed statistic V may be imprecise, particularly when both p1 and p2 are extremely small.
  9 in total

1.  RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis.

Authors:  Fangxin Hong; Rainer Breitling; Connor W McEntee; Ben S Wittner; Jennifer L Nemhauser; Joanne Chory
Journal:  Bioinformatics       Date:  2006-09-18       Impact factor: 6.937

2.  A classification model for the Leiden proteomics competition.

Authors:  Huub C J Hoefsloot; Suzanne Smit; Age K Smilde
Journal:  Stat Appl Genet Mol Biol       Date:  2008-02-19

3.  A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments.

Authors:  Fangxin Hong; Rainer Breitling
Journal:  Bioinformatics       Date:  2008-01-18       Impact factor: 6.937

4.  Comments on the rank product method for analyzing replicated experiments.

Authors:  James A Koziol
Journal:  FEBS Lett       Date:  2010-01-20       Impact factor: 4.124

5.  The exact probability distribution of the rank product statistics for replicated experiments.

Authors:  Rob Eisinga; Rainer Breitling; Tom Heskes
Journal:  FEBS Lett       Date:  2013-02-08       Impact factor: 4.124

6.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors:  T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal:  Science       Date:  1999-10-15       Impact factor: 47.728

7.  The rank product method with two samples.

Authors:  James A Koziol
Journal:  FEBS Lett       Date:  2010-10-14       Impact factor: 4.124

8.  Comparison study of microarray meta-analysis methods.

Authors:  Anna Campain; Yee Hwa Yang
Journal:  BMC Bioinformatics       Date:  2010-08-03       Impact factor: 3.169

9.  Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.

Authors:  Rainer Breitling; Patrick Armengaud; Anna Amtmann; Pawel Herzyk
Journal:  FEBS Lett       Date:  2004-08-27       Impact factor: 4.124

  9 in total
  2 in total

1.  Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.

Authors:  Xi Chen; Qiao-Ling Wang; Meng-Hui Zhang
Journal:  Exp Ther Med       Date:  2017-08-16       Impact factor: 2.447

2.  Common DNA methylation alterations of Alzheimer's disease and aging in peripheral whole blood.

Authors:  Hongdong Li; Zheng Guo; You Guo; Mengyao Li; Haidan Yan; Jun Cheng; Chenguang Wang; Guini Hong
Journal:  Oncotarget       Date:  2016-04-12
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.