| Literature DB >> 19068485 |
Jeremy D Silver1, Matthew E Ritchie, Gordon K Smyth.
Abstract
Background correction is an important preprocessing step for microarray data that attempts to adjust the data for the ambient intensity surrounding each feature. The "normexp" method models the observed pixel intensities as the sum of 2 random variables, one normally distributed and the other exponentially distributed, representing background noise and signal, respectively. Using a saddle-point approximation, Ritchie and others (2007) found normexp to be the best background correction method for 2-color microarray data. This article develops the normexp method further by improving the estimation of the parameters. A complete mathematical development is given of the normexp model and the associated saddle-point approximation. Some subtle numerical programming issues are solved which caused the original normexp method to fail occasionally when applied to unusual data sets. A practical and reliable algorithm is developed for exact maximum likelihood estimation (MLE) using high-quality optimization software and using the saddle-point estimates as starting values. "MLE" is shown to outperform heuristic estimators proposed by other authors, both in terms of estimation accuracy and in terms of performance on real data. The saddle-point approximation is an adequate replacement in most practical situations. The performance of normexp for assessing differential expression is improved by adding a small offset to the corrected intensities.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19068485 PMCID: PMC2648902 DOI: 10.1093/biostatistics/kxn042
Source DB: PubMed Journal: Biostatistics ISSN: 1465-4644 Impact factor: 5.899
Bias and standard deviation (shown in brackets) in estimating μ for the 4 estimation methods in 9 different scenarios. The true values of α and σ in each scenario are shown in the first 2 columns, and μ = 100 for all scenarios. All values are given to 2 significant figures
| σ | α | MLE | Saddle | RMA-75 | RMA |
| 5 | 102 | 0.0079 (0.22) | – 0.25 (0.22) | 1.7 (1.6) | 12 (2.7) |
| 20 | 102 | 0.0024 (0.47) | 0.013 (0.50) | 5.4 (2.3) | 25 (2.6) |
| 100 | 102 | 0.013 (1.6) | 11.0 (1.5) | 4.8 (11) | 47 (9.0) |
| 5 | 103 | – 0.023 (0.67) | – 0.37 (0.65) | 4.2 (7.5) | 44 (23) |
| 20 | 103 | – 0.025 (1.4) | – 1.3 (1.4) | 6.6 (11) | 69 (24) |
| 100 | 103 | – 0.098 (3.1) | – 3.4 (3.1) | 26.0 (18) | 170 (24) |
| 5 | 104 | 0.022 (2.3) | – 0.36 (2.2) | 32.0 (64) | 380 (230) |
| 20 | 104 | 0.20 (4.2) | – 1.3 (4.0) | 32.0 (66) | 390 (220) |
| 100 | 104 | 0.069 (9.2) | – 6.5 (9.0) | 41.0 (85) | 520 (240) |
Bias and standard deviation (shown in brackets) in estimating σ for the 4 estimation methods in 9 different scenarios. The true values of α and σ in each scenario are shown in the first 2 columns, and μ = 100 for all scenarios. All values are given to 2 significant figures. ∞a and ∞b indicate, respectively, where 32.4% and 0.3% of replicates yielded infinite estimates
| σ | α | MLE | Saddle | RMA-75 | RMA |
| 5 | 102 | 0.00059 (0.20) | – 0.40 (0.19) | 1.5 (0.71) | 7.0 (1.9) |
| 20 | 102 | – 0.0069 (0.40) | – 0.46 (0.43) | 5.6 (1.0) | 15.0 (1.6) |
| 100 | 102 | 0.003 (1.0) | 7.3 (0.99) | 25.0 (5.0) | 45.0 (5.1) |
| 5 | 103 | – 0.067 (0.62) | – 0.56 (0.56) | 3.2 (4.7) | 32.0 (19) |
| 20 | 103 | – 0.11 (1.2) | – 1.9 (1.1) | 6.0 (5.3) | 44.0 (19) |
| 100 | 103 | – 0.00048 (2.8) | – 5.9 (2.7) | 27.0 (7.8) | 100.0 (16) |
| 5 | 104 | – 0.72 (2.4) | – 1.2 (2.1) | ∞a (∞a) | 310.0 (190) |
| 20 | 104 | – 0.40 (4.0) | – 2.5 (3.6) | ∞b (∞b) | 300.0 (180) |
| 100 | 104 | – 0.52 (8.5) | – 10.0 (7.8) | 36.0 (46) | 360.0 (190) |
Bias and standard deviation (shown in brackets) in estimating α for the 4 estimation methods in 9 different scenarios. The true values of α and σ in each scenario are shown in the first 2 columns, and μ = 100 for all scenarios. All values are given to 2 significant figures
| σ | α | MLE | Saddle | RMA-75 | RMA |
| 5 | 102 | – 0.00013 (0.75) | 0.25 (0.75) | – 1.1 (1.5) | – 80 (0.31) |
| 20 | 102 | – 0.013 (0.82) | – 0.023 (0.84) | – 2.5 (1.9) | – 79 (0.40) |
| 100 | 102 | – 0.046 (1.6) | – 11.0 (1.5) | 27.0 (8.1) | – 69 (4.4) |
| 5 | 103 | 0.021 (6.8) | 0.37 (6.8) | – 2.8 (10) | – 800 (2.9) |
| 20 | 103 | 0.11 (6.8) | 1.4 (6.8) | – 4.6 (12) | – 800 (2.9) |
| 100 | 103 | – 0.16 (7.5) | 3.2 (7.5) | – 15.0 (16) | – 790 (3.2) |
| 5 | 104 | 0.50 (72) | 1.0 (72) | – 28.0 (100) | – 8000 (28) |
| 20 | 104 | – 3.2 (69) | – 1.6 (69) | – 29.0 (100) | – 8000 (28) |
| 100 | 104 | 3.1 (71) | 9.5 (71) | – 23.0 (110) | – 8000 (30) |
Fig. 1.Box plots of parameter estimates for the 3 best-performing methods. The true values of the parameters are indicated by dashed vertical lines. Estimates of RMA were so far from those of the other methods that do not appear when plotted on this scale (see Tables 1–3).
Fig. 2.Left panel: smoothed log2-ratio of the true to the estimated signal versus the true signal. The black line shows this relationship if the true parameter values are used instead of estimates. The data used for this figure include 100 000 observations simulated with μ = 0, σ = 20, and α = 1000. Quantiles for the signal distribution are marked. The curves were smoothed using the lowess function in R (Cleveland, 1979). Right panel: smoothed 2 from the nonlinear fits versus intensity for the mixture experiment. The A-values have been standardized between methods and plotted from the 5th to the 95th percentiles. The quantiles of the A-values are marked.
Fig. 3.MA-plots obtained using different background correction methods for a self–self hybridization from the mixture experiment.
Fig. 4.Number of false discoveries from the mixture data set using moderated t-statistics from (a) limma and (b) SAM. Each curve is an average over the 5 mixtures.