| Literature DB >> 17603887 |
Abstract
BACKGROUND: The Significance Analysis of Microarrays (SAM) is a popular method for detecting significantly expressed genes and controlling the false discovery rate (FDR). Recently, it has been reported in the literature that the FDR is not well controlled by SAM. Due to the vast application of SAM in microarray data analysis, it is of great importance to have an extensive evaluation of SAM and its associated R-package (sam2.20).Entities:
Mesh:
Year: 2007 PMID: 17603887 PMCID: PMC1955751 DOI: 10.1186/1471-2105-8-230
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The SAM plot obtained by using the SAM algorithm. The red points are the points declared significant by SAM. The two horizontal lines refer to the lower cutoff δ(=cutlo) and the upper cutoff δ(=cutlup) from SAM. The threshold used is Δ = 0.099.
Figure 2The sam plot obtained from sam2.20. The red points are the points declared significant by sam2.20. The horizontal line refers to the upper cutoff δ(=cutlup) from sam2.20. The horizontal line corresponding to the lower cutoff δ(=cutlo) does not show up in the plot since δ= -1010. The threshold Δ used is the same as that used in producing Figure 1.
Estimated numbers of FP from SAM, (7) and sam2.20, s0 default choice of SAM. Table 1 displays the average numbers of , true FP and the estimated FP from SAM, formula (7) and sam2.20 from 100 simulations at different levels of estimated TP.
| Mean | Mean of true FP | Mean | Mean | Mean |
| 248.31 | 59.32 | 20.73 | 7.96 | 68.71 |
| 203.67 | 24.06 | 12.37 | 6.21 | 28.98 |
| 152.03 | 4.61 | 5.55 | 4.34 | 5.41 |
Figure 3Plot of the variance of the ordered test statistics . The vertical axis displays the values of the variances of the order statistics d(for i = 1, ..., 500.
Results obtained under Setups (1) – (3). The medians reported in the table were calculated from 100 simulations. The median est.FDR values reported in Column 4 were obtained from sam2.20 and the symmetric cutoff method directly. The median est.FDRc values reported in Column 5 are the estimated FDR with FP correction (12).
| Setup | Median | Median true FDR sam2.20/sym.cutoff | Median est.FDR sam2.20/sym.cut | Median est. FDRc sam2.20/sym.cut | |
| 1 | 199 | 0.1005/0.1005 | 0.1188/0.1238 | 0.0981/0.1020 | |
| 2 | (i) | 210.5 | 0.7082/0.7779 | 0.6959/0.8130 | 0.6875/0.7945 |
| (ii) | 211.5 | 0.6699/0.6748 | 0.6730/0.7219 | 0.6331/0.6933 | |
| (iii) | 207 | 0.6578/0.6533 | 0.6738/0.7039 | 0.6376/0.6616 | |
| (iv) | 200 | 0.5025/0.4371 | 0.5305/0.4846 | 0.5004/0.4403 | |
| 3 | (i) | 399 | 0.0599/0.0800 | 0.0596/0.1017 | 0.0598/0.0779 |
| (ii) | 399 | 0.1827/0.1754 | 0.1874/0.2135 | 0.1862/0.1758 | |
| (iii) | 399 | 0.2130/0.1989 | 0.21750/.2462 | 0.2168/0.2007 | |
| (iv) | 400.5 | 0.4913/0.3950 | 0.5067/0.4511 | 0.4975/0.3994 | |
Numbers of significant positive and negative genes identified by sam2.20 at Δ = 0.035. Table 2 displays the numbers of significant positive and negative genes from 10 simulations under the same setup as that used in producing Figures 1 and 2.
| Simulation | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| Number of sig. genes | 230 | 158 | 294 | 75 | 468 | 394 | 41 | 168 | 74 | 206 |
| sig. pos | 73 | 27 | 140 | 75 | 285 | 380 | 34 | 86 | 26 | 129 |
| sig. neg | 157 | 131 | 154 | 0 | 183 | 14 | 7 | 82 | 48 | 77 |
Figure 5Comparison between sam2.20 and the symmetric cutoff method under Setup 2(i). The histogram and the parallel boxplots in Figure 5 are defined the same as in Figure 4. The histogram shows that sam2.20 produces significantly smaller number of true FP than the symmetric cutoff method among all 100 simulations. The first two boxplots at the right of Figure 5 show that sam2.20 under-estimates the true FDR and FP correction (12) made the under-estimation even worse. The last two boxplots at the right of Figure 5 show that the symmetric cutoff method over-estimates the true FDR and the over-estimation has been corrected by FP correction (12).
Simulation setups. Under each setup, there are 5000 genes. Table 3 shows how the genes were simulated. For example, under setup 1, the first 100 genes were generated from N(0,1) and N(3,1) under experimental conditions 1 and 2, respectively, the middle 4800 genes were generated from N(0,1) regardless of experimental condition and the last 100 genes were generated from N(0,1) and N(-3,1) under experimental conditions 1 and 2, respectively. The third column displays the ratio of induced to repressed genes. If the number of repressed genes is 0, the ratio is defined as ∞.
| Setup | Genes First/middle/last | Ratio | Experimental condition 1 | Experimental condition 2 | |
| 1 | 100/4800/100 | 1/1 | |||
| 2 | (i) | 200/4800/0 | ∞ | ||
| (ii) | 167/4800/33 | 5/1 | |||
| (iii) | 160/4800/40 | 4/1 | |||
| (iv) | 100/4800/100 | 1/1 | |||
| 3 | (i) | 0/4600/400 | 0 | ||
| (ii) | 66/4600/334 | 1/5 | |||
| (iii) | 80/4600/320 | 1/4 | |||
| (iv) | 200/4600/200 | 1/1 | |||
Figure 4Comparison between sam2.20 and the symmetric cutoff method under Setup 1. The histogram at the left of Figure 4 shows the number of true FP from sam2.20 subtracted by the number of true FP from the symmetric cutoff method. The parallel boxplots at the right of Figure 4 are the boxplots of the values of 1) est. FDR from sam2.20 – the true FDR (sam.diff), 2) est. FDR from sam2.20 with FP correction – the true FDR (sam.diffc), 3) est. FDR from the symmetric cutoff method – true FDR (sym.diff) and 4) est. FDR from the symmetric cutoff method with FP correction – true FDR (sym.diffc).
Figure 6Comparison between sam2.20 and the symmetric cutoff method under setup 3(iv). The histogram and the parallel boxplots in Figure 6 are defined the same as in Figure 4. It is clear from the histogram that sam2.20 tends to produce significantly higher number of true FP than the symmetric cutoff method. The parallel boxplots show that both methods over-estimate the true FDR. It can also been seen that the over-estimation for the symmetric cutoff method is more serious than that of sam2.20. The second and fourth boxplots at the right of Figure 6 show that the over-estimation has been corrected by FP correction (12) for both methods.
Results obtained for the leukaemia data. Table 5 reports the number of significant positive and negative genes (columns 2, 6), the cutoffs (columns 3, 7), the estimated FDR (FDR) and the estimated FDR with FP correction (FDR-c) from sam2.20 and the symmetric cutoff method (columns 4, 8). Column 5 reports the number of genes found significant by sam2.20 and the symmetric cutoff method from the list of informative genes [21].
| sam2.20 | # of genes from Golub et al.'s list | The symmetric cutoff method | |||||
| sig. pos | cutup | FDR | sam2.20 | sig. pos | cutup | FDR | |
| sig. neg | cutlo | FDR-c | sym.cut | sig. neg | cutlo | FDR-c | |
| 316 | 227 | 3.2023 | 0.0065 | 14 | 215 | 3.3073 | 0.0043 |
| 89 | -3.3888 | 0.0068 | 14 | 101 | -3.3073 | 0.0045 | |
| 191 | 154 | 3.8648 | 0.0036 | 9 | 143 | 3.9612 | 0.0036 |
| 37 | -4.2450 | 0.0037 | 11 | 48 | -3.9612 | 0.0037 | |
| 92 | 87 | 4.0787 | 0 | 7 | 73 | 4.1906 | 0 |
| 5 | -5.1595 | 0 | 9 | 19 | -4.1906 | 0 | |
| 29 | 29 | 5.3143 | 0 | 5 | 27 | 5.3685 | 0 |
| 0 | - | 0 | 6 | 2 | -5.3685 | 0 | |
| 23 | 23 | 5.5514 | 0 | 5 | 22 | 5.5678 | 0 |
| 0 | - | 0 | 5 | 1 | -5.5678 | 0 | |