| Literature DB >> 15469612 |
Keith A Baggerly1, Li Deng, Jeffrey S Morris, C Marcelo Aldaz.
Abstract
BACKGROUND: Two major identifiable sources of variation in data derived from the Serial Analysis of Gene Expression (SAGE) are within-library sampling variability and between-library heterogeneity within a group. Most published methods for identifying differential expression focus on just the sampling variability. In recent work, the problem of assessing differential expression between two groups of SAGE libraries has been addressed by introducing a beta-binomial hierarchical model that explicitly deals with both of the above sources of variation. This model leads to a test statistic analogous to a weighted two-sample t-test. When the number of groups involved is more than two, however, a more general approach is needed.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15469612 PMCID: PMC524524 DOI: 10.1186/1471-2105-5-144
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Methods of summarizing data. The effects of pooling and reduction to proportions on a single tag measured across four libraries. Pooling reduces the data to the summed counts at the right, and focusing on proportions reduces the data to the proportions on the bottom. In both cases, information is lost.
| Summed Counts | |||||
| Tag Count | |||||
| Library Size | |||||
| Proportions |
Tag counts from sample SAGE libraries. Counts and proportions of tags ATTTGAGAAG, TGCTGCCTGT and GCGAAACCCT in 8 colon libraries from Zhang et al. [2]; two normal colon (NC), two primary tumors (TU) and four cell lines.
| Group | Normal Colon | Primary Tumor | Cell Lines | |||||
| Library | NC1 | NC2 | TU98 | TU102 | CACO2 | HCT116 | RKO | SW837 |
| ATTTGAGAAG | 320 | 600 | 312 | 549 | 246 | 65 | 41 | 52 |
| TGCTGCCTGT | 0 | 1 | 1 | 15 | 9 | 1 | 12 | 27 |
| GCGAAACCCT | 167 | 566 | 64 | 98 | 33 | 47 | 40 | 27 |
| Library Size | 49610 | 48479 | 41371 | 55700 | 60682 | 55641 | 51294 | 61148 |
| Propn ATT..(%) | 0.65 | 1.24 | 0.75 | 0.99 | 0.41 | 0.12 | 0.08 | 0.09 |
Logistic regression models for two groups. Logistic regression fits contrasting normal colon with cancer samples for tag ATTTGAGAAG from Table 2. The first model makes no allowance for overdispersion, and the latter two introduce it in different ways. The introduction of overdispersion is important as it dramatically affects the results, but the choice of overdispersion method is less crucial.
| Model 1: | No overdispersion | |||
| Coefficients | Estimate | (s.e) | z-value | p-value |
| -4.660 | 0.033 | -140.68 | < 2 | |
| -0.888 | 0.043 | -20.41 | < 2 | |
| Model 2: | Quasilikelihood | |||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -4.660 | 0.454 | -10.261 | 5 | |
| -0.888 | 0.595 | -1.489 | 0.187 | |
| Model 3: | Hierarchical | |||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -4.656 | 0.428 | -10.874 | 3.6 | |
| -0.850 | 0.570 | -1.492 | 0.186 | |
Expanding contrasts from two to three groups. Logistic regression models testing the significance of a difference between normal colon and primary tumor (β1) for tag TGCTGCCTGT from Table 2. In the first model, only data from the four libraries directly involved are used. In the second model, data from the four cell line libraries are also included, providing a more stable estimate of the overdispersion parameter φ.
| Model 1: | Two Groups | |||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -11.484 | 2.309 | -4.973 | 0.038 | |
| 2.681 | 2.388 | 1.123 | 0.378 | |
| Model 2: | Three Groups | |||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -11.484 | 2.574 | -4.462 | 0.007 | |
| 2.676 | 2.661 | 1.005 | 0.361 | |
| 3.020 | 2.604 | 1.159 | 0.299 | |
Analysis of deviance. Deviance table for various submodels fit to the data for tag TGCTGCCTGT given in Table 2. All of these models use the value for overdispersion found for the most extensive model, = 1.160e - 04.
| Terms Fitted | Deviance | d.f. |
| 9.7433 | 7 | |
| 9.7418 | 6 | |
| 7.9826 | 6 | |
| 5.7866 | 5 |
Incorporating covariates into the model. Models treating the fitting of counts for tag GCGAAACCCT from Table 2, with the cell lines hypothetically allocated to normal tissue B (libraries 5 and 6) and cancer tissue B (libraries 7 and 8). This division is made to illustrate how the effects of two differences, normal vs cancer and tissue A vs tissue B (β1 and β2 respectively) can be partitioned according to their importance. In Model 2, we have further introduced a continuous covariate (β3) corresponding to the levels of a biomarker to show how that can be figured in as well.
| Model 1: | Hypothetical Cov. | df = 5 | ||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -4.928 | 0.291 | -16.921 | 1.318e - 05 | |
| -1.293 | 0.593 | -2.181 | 0.0810 | |
| -1.956 | 0.738 | -2.650 | 0.0454 | |
| Model 2: | Hypothetical Biom. | df = 4 | ||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -4.167 | 0.608 | -6.851 | 1.012e-03 | |
| -1.423 | 0.611 | -2.328 | 0.0674 | |
| -2.031 | 0.752 | -2.700 | 0.0428 | |
| -1.365 | 1.028 | -1.328 | 0.2417 | |
Fitting nested deviance models. Fitting nested models to the data in order to get deviance scores. The difference in deviance between models is a better indicator of the significance of the associated effect (β1) when the logistic regression fits are near the boundary of the space, giving proportions close to zero.
| Model 1: | Full Model | Deviance = 5.0742 | ||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -11.494 | 13.518 | -0.06 | 0.9519 | |
| 5.987 | 13.524 | -0.03 | 0.9750 | |
| Model 2: | Null Model | Deviance = 8.7541 | ||
| Coefficients | Estimate | (s.e) | t-value | p-value |
| -5.794 | 0.392 | -14.772 | 6.05e-06 | |