| Literature DB >> 15987513 |
Jun Lu1, John K Tomfohr, Thomas B Kepler.
Abstract
BACKGROUND: In testing for differential gene expression involving multiple serial analysis of gene expression (SAGE) libraries, it is critical to account for both between and within library variation. Several methods have been proposed, including the t test, tw test, and an overdispersed logistic regression approach. The merits of these tests, however, have not been fully evaluated. Questions still remain on whether further improvements can be made.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15987513 PMCID: PMC1189357 DOI: 10.1186/1471-2105-6-165
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparisons of t- and deviance tests in overdispersed logistic regression and log-linear models and a test based on a Bayesian model
| Group 1a | logistic regression | log-linear model | Bayesian model | ||||
| library 1 | library 2 | deviance test | deviance test | ||||
| 1b | 0 | 0 | 0.645 | 0.115 | 0.003 | 0.001 | 0.01 |
| 2 | 2 | 2 | 0.485 | 0.122 | 0.002 | 0.002 | 0.02 |
| 3 | 5 | 5 | 0.383 | 0.133 | 0.003 | 0.005 | 0.04 |
| 4 | 10 | 10 | 0.324 | 0.149 | 0.007 | 0.01 | 0.05 |
| 5 | 20 | 20 | 0.291 | 0.183 | 0.02 | 0.025 | 0.07 |
| 6 | 50 | 50 | 0.324 | 0.29 | 0.104 | 0.117 | 0.11 |
| 7 | 100 | 100 | 0.494 | 0.508 | 0.376 | 0.404 | 0.12 |
aTag counts in group 1 are artificially increased towards the levels observed in group 2 (which are held fixed). Tag counts in group 2 are 312, 549, 246, 65, 41, and 52. The library sizes and tag counts in group 2 are taken from Baggerly et al. [15].
b The empirical tag counts 0.506, and 0.494 are used to replace the zero counts in group 1[15].
c The t-test here is testing the hypothesis that β = 0.
d E, the Bayes Error Rate, is listed. [26].
A list of parameter values used in the simulations
| Distribution | binomial (i.e. no overdispersion); beta-binomial; negative-binomial |
| overdispersion parameter ( | 8e-06, 2e-05, 4.3e-05 for beta-binomial; 0.17, 0.42, 0.95 for negative binomial |
| number of samples in groups A and B | 5 in each group |
| mean proportion in group A ( | 1, 5, 10, 20, 50, and 100 out of 50,000 |
| ratio of mean proportions ( | 1, 2 and 4 |
Note: the library sizes are 66148, 67094, 53338, 80124, 64984, 70452, 74052, 60086, 52966 and 45377, each of which was determined by a draw from a uniform distribution over the interval from 30,000 to 90,000.
Figure 1Comparisons based on simulated data from the beta-binomial distribution. This figure shows the receiver operating characteristic curves (ROC) of the four tests applied to datasets generated from the beta-binomial distribution with various magnitudes of overdispersion (φ) (shown on the top of each graph). For a specific φ, 10,000 observations (tags) are simulated; 5,000 are generated under the assumption that p= pand the remaining from p= 2 p, where pand pare the mean proportions of the two groups and p= 0.0002 (i.e. 10 out of 50,000). For figures generated under other conditions, see Additional file 1.
Figure 2Comparisons based on simulated data from the negative binomial distribution. The ROC curves of the four tests are based on datasets generated from the negative binomial distribution with various magnitudes of overdispersion (φ). The data are simulated by the same strategy as used in Figure 1, except that p= 4p. Note that the overdispersion parameter here is not directly comparable with that in Figure 1 (the parameter φ for the negative binomial is not directly related to that for the beta-binomial). For figures generated under other conditions, see Additional file 2.
Library information on 5 cancer and 2 normal pancreas SAGE libraries
| Cancer cell lines | Normal cells | ||||||
| Library | ASPC | PL45 | CAPAN1 | CAPAN2 | Panc-1 | HX | H126 |
| Library size | 31,224 | 29,557 | 37,674 | 23,042 | 24,749 | 31,985 | 32,223 |
| Unique tags | 10,622 | 11,121 | 14,815 | 10,157 | 10,293 | 12,392 | 12,360 |
Pair-wise comparisons of the four tests
| logit- | |||
| 39(12)a | - | ||
| logit- | 42(17) | 66(29) | - |
| log- | 36(16) | 63(25) | 82(43) |
a number of genes shared among the list of top 100 and top 50 (in parenthesis) genes identified by the two tests; we note that for the t and ttests, the genes were ranked by the absolute t or tstatistic rather than by p-values.
Figure 3Comparing . Of the top 100 tags (ranked according to p-values) identified by the logit-t test and by the log-t test, 82 are common to both leaving 18 tags from each test that are not within the top 100 identified by the other. The p-values from both tests for these 36 remaining tags are plotted here. The circles represent the 18 in the top 100 by the logit-t test and the triangles those from the log-t test. While all the tags identified by the logit-t test also have reasonably low p-values according to the log-t test, the tags identified by the log-t test show a much wider range of p-values according to the logit-t test.
A set of genes identified as significantly differentially expressed (p < 0.05 and also among the list of top 100 genes) according to the log-t test but not by the logit-t test (p > 0.05)
| Normal | Cancer | ||||||||
| Tag | HX | H126 | ASPC | PL45 | CAPAN1 | CAPAN2 | Panc-1 | ||
| AGCAGATCAG* | 0.003 | 0.088 | 16 | 9 | 272 | 152 | 138 | 135 | 384 |
| TTGGTGAAGG | 0.003 | 0.069 | 6 | 0 | 90 | 267 | 194 | 187 | 238 |
| CCCATCGTCC | 0.003 | 0.309 | 13 | 34 | 2047 | 1333 | 364 | 456 | 408 |
| CCTCCAGCTA | 0.006 | 0.465 | 3 | 16 | 452 | 1766 | 292 | 265 | 364 |
| ACTTTTTCAA | 0.008 | 0.096 | 25 | 43 | 413 | 379 | 226 | 200 | 65 |
| CAAACCATCC* | 0.01 | 0.463 | 9 | 9 | 439 | 1235 | 154 | 143 | 133 |
| TGCCCTCAGG | 0.011 | 0.219 | 16 | 6 | 80 | 196 | 276 | 339 | 4 |
| GCTGTTGCGC* | 0.011 | 0.151 | 3 | 3 | 35 | 30 | 82 | 126 | 133 |
| GACATCAAGT* | 0.013 | 0.554 | 0 | 0 | 183 | 548 | 85 | 126 | 20 |
| TTCACTGTGA | 0.014 | 0.149 | 0 | 3 | 128 | 105 | 77 | 91 | 16 |
| TTGGGGTTTC | 0.015 | 0.142 | 69 | 37 | 701 | 507 | 173 | 195 | 230 |
| TGCCCTCAAA | 0.016 | 0.246 | 3 | 6 | 32 | 112 | 135 | 178 | 0 |
| GGGGAAATCG | 0.017 | 0.066 | 100 | 71 | 339 | 423 | 119 | 291 | 226 |
Note: Tag counts have been converted to number of tags per 100,000 for the comparison purpose. This scaling is not used in any statistical tests. Tags with (*) are those also identified by Ryu et al. [12].
A list of top 40 genes differentially expressed between pancreatic cancer and normal ductal epithelium
| Tag | Description | HX | H126 | ASPC | PL45 | CAPAN1 | CAPAN2 | Panc-1 | |
| Up-regulated in pancreatic cancer | |||||||||
| CTTCCAGCTA | annexin A2 | 0.0011 | 19 | 25 | 128 | 217 | 143 | 148 | 170 |
| AAAAAAAAAA | - | 0.0018 | 6 | 3 | 128 | 210 | 180 | 165 | 133 |
| AGCAGATCAG | S100 calcium binding protein A10 (annexin II ligand, calpactin I, light polypeptide (p11)) | 0.0027 | 16 | 9 | 272 | 152 | 138 | 135 | 384 |
| TTGGTGAAGG | thymosin, beta 4, X-linked | 0.003 | 6 | 0 | 90 | 267 | 194 | 187 | 238 |
| CCCATCGTCC | motichondria gene | 0.0032 | 13 | 34 | 2047 | 1333 | 364 | 456 | 408 |
| CCTCCAGCTA | keratin 8 | 0.0059 | 3 | 16 | 452 | 1766 | 292 | 265 | 364 |
| GGAAAAAAAA | ATP synthase, H+ transporting, mitochondrial F1 complex, epsilon subunit | 0.0063 | 3 | 6 | 64 | 61 | 74 | 74 | 57 |
| CCCCAGTTGC | calpain, small subunit 1 | 0.0066 | 22 | 22 | 64 | 88 | 77 | 61 | 113 |
| AACTAAAAAA | ribosomal protein S27a | 0.0078 | 19 | 16 | 45 | 85 | 80 | 61 | 61 |
| TTCAATAAAA | RPLP1, Ribosomal protein, large, P1 | 0.0079 | 9 | 25 | 147 | 179 | 135 | 104 | 40 |
| GCAAAAAAAA | chromosome 21 open reading frame 97 | 0.0079 | 6 | 3 | 58 | 68 | 40 | 65 | 65 |
| ACTTTTTCAA | motichondria gene | 0.0081 | 25 | 43 | 413 | 379 | 226 | 200 | 65 |
| CAAACCATCC | KRT18, Keratin 18 | 0.0095 | 9 | 9 | 439 | 1235 | 154 | 143 | 133 |
| GTGTGGGGGG | Junction plakoglobin | 0.0096 | 6 | 3 | 29 | 64 | 50 | 56 | 61 |
| TGCCCTCAGG | LCN2, Lipocalin 2 (oncogene 24p3) | 0.0106 | 16 | 6 | 80 | 196 | 276 | 339 | 4 |
| GCTGTTGCGC | - | 0.0108 | 3 | 3 | 35 | 30 | 82 | 126 | 133 |
| AAGAAGATAG | ribosomal protein L23a | 0.0116 | 16 | 9 | 77 | 108 | 85 | 65 | 24 |
| GAAAAAAAAA | SMAD, mothers against DPP homolog 3 (Drosophila) | 0.0118 | 6 | 0 | 74 | 47 | 40 | 56 | 44 |
| ACCTGTATCC | IFITM3, interferon induced transmembrane protein 3 (1-8U) | 0.0123 | 13 | 3 | 26 | 81 | 64 | 82 | 53 |
| CAACTTAGTT | myosin regulatory light chain MRLC2 | 0.0128 | 6 | 6 | 51 | 61 | 53 | 48 | 16 |
| Down-regulated in pancreatic cancer | |||||||||
| GACGACACGA | ribosomal protein S28 | 0.0001 | 428 | 388 | 109 | 122 | 90 | 117 | 154 |
| GGACCACTGA | ribosomal protein L3 | 0.0002 | 310 | 270 | 102 | 105 | 101 | 104 | 61 |
| GATCTCTTGG | S100 calcium binding protein A2 | 0.0002 | 188 | 174 | 3 | 10 | 8 | 4 | 0 |
| AGCAGGAGCA | S100 calcium binding protein A16 | 0.0005 | 144 | 152 | 26 | 41 | 45 | 26 | 16 |
| AGCTGTCCCC | capping protein (actin filament) muscle Z-line, beta | 0.0005 | 219 | 254 | 13 | 3 | 3 | 4 | 0 |
| GACTGCGCGT | tumor necrosis factor receptor superfamily, member 12A | 0.0007 | 103 | 93 | 10 | 10 | 24 | 22 | 16 |
| GTGGTGTGTG | congenital dyserythropoietic anemia, type I | 0.0011 | 59 | 87 | 10 | 10 | 8 | 13 | 8 |
| TAGGCATTCA | - | 0.0012 | 119 | 115 | 0 | 0 | 0 | 0 | 0 |
| TGAGTGGTCA | microtubule-associated protein 1 light chain 3 beta | 0.0017 | 66 | 53 | 0 | 7 | 5 | 13 | 8 |
| GGCGGCTGCA | excision repair cross- complementing rodent repair deficiency, group 1 | 0.0017 | 66 | 53 | 6 | 7 | 3 | 4 | 0 |
| AAGTTTGCCT | glutaredoxin (thioltransferase) | 0.0022 | 66 | 62 | 0 | 3 | 3 | 0 | 4 |
| AGCTCTCCCT | Ribosomal protein L17 | 0.0023 | 335 | 357 | 77 | 145 | 82 | 143 | 125 |
| CCGAAGTCGA | transcriptional regulating factor 1 | 0.0024 | 53 | 56 | 0 | 7 | 5 | 0 | 0 |
| GCTGCTGCGC | - | 0.0024 | 228 | 320 | 0 | 0 | 0 | 0 | 4 |
| TTGGGAGCAG | isoleucine-tRNA synthetase | 0.0031 | 72 | 43 | 10 | 10 | 19 | 4 | 8 |
| TAAGGAGCTG | Ribosomal protein S26 | 0.0031 | 344 | 329 | 138 | 85 | 96 | 43 | 101 |
| AACAGAAGCA | hypothetical protein FLJ25692 | 0.0031 | 75 | 59 | 13 | 24 | 24 | 9 | 16 |
| CCTCCACCTA | peroxiredoxin 2 | 0.0031 | 56 | 43 | 16 | 10 | 3 | 9 | 4 |
| TGTGAGTCAC | - | 0.0038 | 31 | 62 | 0 | 0 | 0 | 0 | 0 |
| TCAGGGATCT | - | 0.0038 | 41 | 53 | 0 | 0 | 0 | 0 | 0 |
Note: tag counts have been converted to tags per 100,000 for comparison purposes. The p values listed are from the log-t test.
Figure 4Plot of standardized residuals against estimated proportions. Standardized Pearson's residuals (y-axis) plotted vs. the proportion estimates (x-axis) for the two groups. The standardized Pearson's residuals are asymptotically distributed as a standard normal. The model fits of two tags (among the list of genes in Table 5) are shown here; the left is from the fit using the overdispersed logistic model and the right from the overdispersed log-linear model. A lower variance of residuals in the group (normal) with lower mean proportion is an indication of poor model fit.
Figure 5The distribution of overdispersion estimates (). The estimates are from the overdispersed log-linear model fit to the pancreas data. Tags with the overdispersion estimate 0 are not shown in the figure.