| Literature DB >> 15239836 |
Li Cai1, Haiyan Huang, Seth Blackshaw, Jun S Liu, Connie Cepko, Wing H Wong.
Abstract
Serial analysis of gene expression (SAGE) data have been poorly exploited by clustering analysis owing to the lack of appropriate statistical methods that consider their specific properties. We modeled SAGE data by Poisson statistics and developed two Poisson-based distances. Their application to simulated and experimental mouse retina data show that the Poisson-based distances are more appropriate and reliable for analyzing SAGE data compared to other commonly used distances or similarity measures such as Pearson correlation or Euclidean distance.Entities:
Mesh:
Year: 2004 PMID: 15239836 PMCID: PMC463327 DOI: 10.1186/gb-2004-5-7-r51
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
List of simulated data
| Group A | |||||
| a1 | 0 | 0 | 0 | 19 | 145 |
| a2 | 0 | 0 | 0 | 13 | 146 |
| a3 | 0 | 0 | 0 | 13 | 154 |
| Group B | |||||
| b1 | 16 | 33 | 31 | 60 | 12 |
| b2 | 8 | 23 | 23 | 59 | 18 |
| b3 | 11 | 30 | 39 | 76 | 14 |
| b4† | 109 | 306 | 296 | 620 | 93 |
| Group C | |||||
| c1 | 10 | 11 | 9 | 2 | 11 |
| c2 | 12 | 11 | 10 | 12 | 7 |
| c3 | 4 | 10 | 16 | 14 | 6 |
| c4 | 10 | 8 | 8 | 7 | 12 |
| c5 | 9 | 6 | 9 | 18 | 12 |
| c6‡ | 99 | 84 | 77 | 102 | 106 |
| Group D | |||||
| d1 | 19 | 0 | 0 | 0 | 154 |
| d2 | 17 | 0 | 0 | 0 | 148 |
| d3 | 12 | 0 | 0 | 0 | 173 |
| d4 | 10 | 0 | 0 | 0 | 148 |
| d5 | 12 | 0 | 0 | 0 | 152 |
| d6 | 15 | 0 | 0 | 0 | 146 |
| d7 | 13 | 0 | 0 | 1 | 149 |
* P(0.05): Poisson distribution with mean 10. †b4 is generated by P(100), P(300), P(300), P(600), P(100). ‡c6 is generated by P(100), P(100), P(100), P(100), P(100).
Figure 1Graphs of clustering results for simulation data. The x-axis represents the different time points; the y-axis represents the expression level scaled as percentage. Data were normalized before plotting. For each tag, the count vector is rescaled to make the sum of the elements of the count vector equal 1. For example, b4 = (109,306,296,620,93) is rescaled to b4' = b4/θ where θ = (109 + 306 + 296 + 620 + 93).
Figure 2Graphs of clustering results for mouse retinal SAGE data. The x-axis represents the time points of the developing mouse retina SAGE libraries; the y-axis represents the relative frequency for each tag scaled as a percentage. Data were normalized before plotting. Each tag from the 10 libraries was rescaled to make the sum of all 10 tags equal to 1. Different colors represent different tags. See Additional data file 1 for more details.
Statistics of photoreceptor-generated clusters by four different algorithms
| Algorithm | Number of total members | Number of specific genes | Percentage of specific genes | Number of rhodopsin tags |
| PoissonC | 28 | 22 | 78.6 | 5 of 5 |
| PearsonC | 67 | 24 | 35.8 | 2 of 5 |
| Eucli | 12 | 8 | 66.7 | 2 of 5 |
| Eucli on normalized data | 17 | 12 | 70.6 | 2 of 5 |
See Additional data file 1 for more details.
Statistics of the 34 cell-specific genes
| Cell-specific genes | Total | Sensitivity | Specificity |
| PoissonC | |||
| 13 | 50 | ||
| 1 | 7 | 2.9% | 14.3% |
| 5 | 42 | 14.7% | 11.9% |
| 3 | 68 | 8.8% | 4.4% |
| 3 | 90 | 8.8% | 3.3% |
| PearsonC | |||
| 12 | 86 | ||
| 3 | 52 | 8.8% | 5.8% |
| 3 | 55 | 8.8% | 5.5% |
| 3 | 75 | 8.8% | 4.0% |
| 3 | 81 | 8.8% | 3.7% |
| Eucli | |||
| 2 | 13 | 5.9% | |
| 7 | 77 | 20.6% | 9.1% |
| 12 | 206 | 5.8% | |
| 1 | 22 | 2.9% | 4.5% |
| 4 | 142 | 11.8% | 2.8% |
| Eucli on normalized data | |||
| 10 | 48 | ||
| 5 | 53 | 14.7% | 9.4% |
| 7 | 77 | 20.6% | 9.1% |
| 2 | 24 | 5.9% | 8.3% |
| 2 | 47 | 5.9% | 4.3% |
The numbers in the first column are the numbers of cell-specific genes in a cluster; total, the total number of cluster members; sensitivity, the number of cell-specific genes/34; specificity, the number of cell-specific genes/total number of cluster members. The top five clusters that contain the 34 cell-specific genes are listed. The numbers in bold are the highest percentage in sensitivity and specificity in that method. See Additional data file 2 for more details.
Comparison of algorithms on 143 tags
| Algorithm | Number of tags in incorrect clusters | Percentage of tags in incorrect clusters |
| PoissonL | 4 | 2.8 |
| PoissonC | 6 | 4.2 |
| Eucli on normalized data | 14 | 9.8 |
| PearsonC | NA | NA |
| Eucli | NA | NA |
Clusters generated by PearsonC and Eucli were too messy.