| Literature DB >> 21252077 |
A Posekany1, K Felsenstein, P Sykacek.
Abstract
MOTIVATION: Although several recently proposed analysis packages for microarray data can cope with heavy-tailed noise, many applications rely on Gaussian assumptions. Gaussian noise models foster computational efficiency. This comes, however, at the expense of increased sensitivity to outlying observations. Assessing potential insufficiencies of Gaussian noise in microarray data analysis is thus important and of general interest.Entities:
Mesh:
Year: 2011 PMID: 21252077 PMCID: PMC3051324 DOI: 10.1093/bioinformatics/btr018
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.We represent the proposed model as DAG with rectangular nodes denoting observed quantities and circular nodes denoting random variables. Hyperparameters associated with priors are shown in brackets. Sheets indicate replication. With n denoting the sample and g the gene index, we denote the measurements as y, the group indicators as x, the group specific means as β, the differential expression indicators as I, their prior probability as p, the noise precision as τ, the precision of the coefficients prior as λ, the auxiliary variables of the Student's-t density as φ, the degrees of freedom as ν and the corresponding model indicator as J. All hyperparameters are discussed in Section 2.1.
Depending on sample type which is either 1 or 2, genes from subset i are drawn from distributions with means equal to μ and μ , respectively
| Subset i | μ | μ | % |
|---|---|---|---|
| 1 | −12.0 | 12.0 | 20 |
| 2 | −5.0 | 5.0 | 10 |
| 3 | −1.0 | 1.0 | 30 |
| 4 | −0.5 | 0.5 | 20 |
| 5 | 0.0 | 0.0 | 20 |
The proportion of genes in subset i is shown in column %.
Overview of the biological datasets describing the organism (Org.), the GEO ID (CAMDA 08 refers to the Endothelial Apoptosis contest datasets of the meeting), the preprocessing method (Prep.), the overall number of arrays (N), the average degrees of freedom (), the number of common genes (Comm.), the number of genes with noise model depending differential expression assessment (Diff.), the number of common GO terms (Comm.) and finally the number of noise model dependent GO terms (Diff.)
| Org. | GEO ID | Reference | Prep. | Comm. genes | Diff. genes | Comm. GO terms | Diff. GO terms | ||
|---|---|---|---|---|---|---|---|---|---|
| GDS3216 | MAS5.0 | 12 | 4.71 | 1176 | 150 | 111 | 78 | ||
| GDS3225 | MAS5.0 | 4 | 5.50 | 832 | 290 | 161 | 21 | ||
| GDS1404 | PathStat | 10 | 13.58 | 1776 | 136 | 11 | 14 | ||
| GDS1686 (I) | RMA | 9 | 3.62 | 136 | 174 | 11 | 96 | ||
| CAMDA 08 | CLSS4.1 | 24 | 4.04 | 400 | 304 | 26 | 67 | ||
| GDS1375 | MAS5.0 | 70 | 3.25 | 6861 | 3561 | 160 | 316 | ||
| GDS810 | MAS5.0 | 31 | 4.37 | 72 | 135 | 9 | 51 | ||
| GDS2960 | RGP3.0 | 101 | 4.33 | 318 | 166 | 51 | 2 | ||
| GDS660 | MAS5.0 | 22 | 10.48 | 584 | 126 | 20 | 26 | ||
| GDS3221 | Somel | RMA | 24 | 4.21 | 180 | 119 | 108 | 52 | |
| GDS3162 | MAS5.0 | 10 | 4.38 | 797 | 446 | 112 | 66 | ||
| GDS1555 | MAS5.0 | 8 | 3.90 | 131 | 183 | 24 | 110 | ||
| GDS2946 | MAS5.0 | 15 | 4.57 | 146 | 157 | 14 | 306 | ||
| GDS972 | MAS5.0 | 44 | 4.98 | 369 | 163 | 94 | 71 | ||
| golden-spike | MAS5.0 | 6 | 3.74 | 401 | 1748 | − | − |
The GEO entry GDS1686 (I) refers to the behavioural subset of the data (only the sleep-deprived flies). In column Prep., we use MAS5.0 to refer to the Affymetrix MAS 5.0 quantization method, RMA to refer to the ‘Robust Multi-array Average’ method by Irizarry ) (both used for Affymetrix arrays), PathStat for referring to the package described in Middleton ), CLSS4.1 to refer to the Codelink Software Suite 4.1 and RGP3.0 to refer to Research Genetics' Pathway software v. 3.0.
Fig. 2.The hyperparameters c and d in the prior p(λ|c, d) have to be chosen carefully to avoid side effects. The graphs show the ordered posterior probabilities of differential expression P(I|X, Y, a, b, c, d, e, h, K) with the legend denoting the corresponding c, d pair for a Gaussian and Student's-t noise model. Choices larger than around 100 increase the influence of these hyperparameters on the posterior of interest. This motivates our choice of an improper prior (c = 0, d = 0).
An assessment of robustness levels in dependence of normalization and preprocessing showing the expected degrees of freedom parameters
| GEO ID | Loess | Quantile | mmgMOS | PPLR |
|---|---|---|---|---|
| GDS3216 | 2.02 | 1.13 | 2.23 | 1.17 |
| GDS810 | 1.13 | 1.18 | 3.23 | 1.14 |
| GDS3225 | 1.24 | 1.29 | – | – |
| CAMDA 08 | 1.06 | 1.11 | – | – |
| GDS1375 | 1.14 | 1.15 | – | – |
| GDS2960 | 2.94 | 2.85 | – | – |
| GDS1555 | 1.15 | 1.17 | – | – |
| GDS972 | 1.38 | 1.4 | 3.67 | 1.15 |
for various datasets. Dashes indicate unavailable results, which require for mmgMOS and PPLR to have Affymetrix cell files available. The results confirm that neither normalization nor sophisticated preprocessing compensate for the need of heavy-tailed noise models.
Fig. 3.Noise model dependencies of posterior probabilities. Subplot (A) illustrates the Arabidopsis data (GDS3216) ranked by the posterior probabilities of differential expression obtained with the most probable Student's-t distribution (probabilities shown as black line). The probabilities obtained for the same genes by a Gaussian-based analysis are shown as dots. Subplot (B) illustrates the Human melanoma data (GDS1375) ranked by the posterior probabilities of differential expression obtained by a Gaussian-based analysis (probabilities shown as black line). The probabilities obtained for the same genes from an optimally adjusted robust analysis are shown as dots.
For comparing non-parametric robust methods with robust parametric methods, we provide the percentage agreement about differentially expressed genes
| GEO ID | KW perm. | RANOVA | ||
|---|---|---|---|---|
| 𝒯 (%) | 𝒩 (%) | 𝒯 (%) | 𝒩 (%) | |
| GDS3216 | 39 | 37 | – | – |
| GDS1375 | 86 | 84 | 86 | 83 |
| GDS2960 | 76 | 71 | 76 | 72 |
The columns under KW perm. illustrate the agreements of the Kruskal–Wallis permutation test with the robust parametric method (column ‘𝒯’) and with a Gaussian-based analysis (column ‘𝒩’). The two columns under RANOVA show the same information for the robust ANOVA method. Dashes indicate that the non-parametric method did not find differentially expressed genes. These results allow the conclusion that non-parametric methods are viable for analysing microarray data robustly, as long as we have sufficiently many samples.