| Literature DB >> 18787709 |
Olena Morozova1, Vyacheslav Morozov, Brad G Hoffman, Cheryl D Helgason, Marco A Marra.
Abstract
BACKGROUND: Serial Analysis of Gene Expression (SAGE) is a DNA sequencing-based method for large-scale gene expression profiling that provides an alternative to microarray analysis. Most analyses of SAGE data aimed at identifying co-expressed genes have been accomplished using various versions of clustering approaches that often result in a number of false positives. PRINCIPALEntities:
Mesh:
Substances:
Year: 2008 PMID: 18787709 PMCID: PMC2527533 DOI: 10.1371/journal.pone.0003205
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Performance of seriation on simulated SAGE data.
(A). Seriation results of the three rounds of simulations with increasing amounts of noise from round 1 (34 tags) to round 3 (384 tags). The dark red squares along the diagonal indicate tags with the expression patterns 1–3 that were grouped together by seriation. (B). Seriation of 10 expression profiles with limited amount of noise. The dark red squares along the diagonal indicate tags in each expression profile that were grouped together. The numbers indicate expression patterns from Figure S1 that were grouped into each contig. Note that two contigs in the middle (5 and 1) appear more similar to each other than any other contig pair indicating similarity of the corresponding expression patterns.
Effect of the amount of noise in SAGE data on the performance of seriation and PoissonC.
| Pattern1 | Pattern2 | Pattern3 | |||||
| TP | FP | TP | FP | TP | FP | ||
| Round 1: 34 noise tags | Seriation | 41 | 1 | 38 | 4 | 37 | 4 |
| PoissonC | 41 | 2 | 38 | 2 | 37 | 5 | |
| Round 2: 120 noise tags | Seriation | 41 | 6 | 38 | 6 | 37 | 3 |
| PoissonC | 41 | 13 | 38 | 14 | 37 | 15 | |
| Round 3: 384 noise tags | Seriation | 41 | 5 | 38 | 3 | 37 | 3 |
| PoissonC | 41 | 43 | 38 | 99 | 37 | 61 | |
Seriation and PoissonC were applied to a simulated SAGE data set containing three expression patterns and increasing amount of noise tags. The dataset is described in more detail in the text and in Table S1. TP (True Positives) include tags that were correctly classified as belonging to the correct expression group (expression pattern 1, 2, or 3 or noise) by assigning them to the cluster (PoissonC) or contig (seriation) containing other members of the expression group. FP (False Positives) include noise tags that have been erroneously assigned to a cluster or contig with tags that conform to the expression pattern 1, 2, or 3.
The false positive rate is significantly higher for the PoissonC algorithm than it is for seriation mostly due to the erroneous assignment of noise tags to an expression pattern (p<0.05).
Comparative performance of seriation and PoissonC on a simulated SAGE data set with 10 expression patterns.
| Algorithm | TP | FP |
| Seriation | 549 | 1 |
| PoissonC | 528 | 22 |
Seriation and PoissonC were applied to the analysis of a simulated SAGE data set containing 10 expression patterns each including 50 tags, and 50 noise tags. TP (True Positives) are tags that were correctly classified as belonging to the right expression pattern or noise. FP (False Positives) are tags that were assigned to the wrong pattern or noise tags that were assigned to an expression pattern.
The false positive rate is significantly higher for the PoissonC algorithm than it is for seriation (p<0.05).
Figure 2Seriation of genes expressed in mouse retinal SAGE libraries.
SAGE data from Blackshaw et al. [20] were subjected to seriation analysis as described in the text. The resulting reordered correlation matrix containing correlation coefficients for each tag pair computed to measure the similarity of their retinal expression profiles is color-coded red to blue to represent decreasing correlation values. Ten contigs, including two supercontigs, recognizable as the squares of high (red) correlation values along the diagonal, are evident from the color-coded correlation matrix. The Figure on the right provides a zoomed-in view of the contigs.
Figure 3Analysis of seriation contigs of genes expressed in mouse retinal SAGE libraries.
(A). Comparison of seriation contigs to the original clusters from Blackshaw et al. [20]. Seriation contigs are color-coded and plotted on the x-axis of the 3D graph. The peaks on the z-axis represent the percent cluster members (y-axis) present in the particular contig. Most seriation contigs are composed of one or several predominant clusters (also see Table 3). (B). Expression profiles of genes in seriation contigs. The relative expression levels from 0% to 100% are plotted on the y-axis for each contig while the retinal libraries derived from developmental stages E12.5, E14.5, E16.5, E18.5, P0.5, P2.5, P4.5, P6.5, P10, and adult are on the x-axis. The ordering of contigs is temporal such that genes expressed in earlier developmental stages tend to be in the first contigs, while genes expressed in later stages are in later contigs. This partitioning is particularly evident from the expression patterns of genes in the supercontigs.
Comparison of seriation and PoissonC analyses of genes expressed in retinal SAGE libraries.
| Contig | Predominant clusters | Percent of predominant cluster members in contig | Percent of contig members in predominant cluster | Top Gene Ontology annotations enriched in predominant clusters | Top Gene Ontology annotations enriched in contigs |
| Contig1 | 4 | 32.59% | 58.4% | Mitochondrial | Ribonucleoprotein complex, p = 8.68E-03 |
| Contig2 | 5 | 79.77% | 74.19% | Ribosomal | Cytosolic ribosome ( |
| Contig3 | 12 | 96.36% | 42.74% | None | Biosynthesis, p = 1.11E-02 |
| Supercontig 1 (contigs 1, 2, 3) | 4, 5, 12 | 33.48%, 93.64%, 100% | 17.24%, 37.24%, 12.64% | Mitochondrial, Ribosomal | Ribonucleoprotein complex, p = 7.17E-14 |
| Contig4 | 6 | 31.94% | 32.86% | RNA processing | Ligase activity, p = 3.28E-02 |
| Contig5 | 6 | 33.33% | 68.57% | RNA processing | N/A |
| Contig6 | N/A | N/A | N/A | N/A | Structural molecule activity, p = 3.29E-05 |
| Contig7 | 15 | 41.07% | 67.65% | Ribosomal | Cytosolic ribosome ( |
| Contig8 | 8, 22, 24 | 100%, 37.5%, 51.56% | 15.09%, 2.8%, 46.7% | Ribosomal, Vision, Vision | Metal ion binding, p = 2.97E-03 |
| Contig9 | 1, 10, 21, 22 | 100%, 32.2%, 92.86%, 62.5% | 4.27%, 40.28%, 43.13%, 4.73% | Vision, Transporter activity, Vision, Vision | Vision, p = 4.74E-15 |
| Supercontig 2 (contigs 8, 9) | 1, 8, 10, 21, 22, 24 | 100%, 100%, 41.67%, 100%, 100%, 56.25% | 2.13%, 7.57%, 26%, 23.17%, 3.78%, 25.53% | Vision, Ribosomal, Transporter activity, Vision, Vision | Perception of light, p = 7.77E-14 |
| Contig10 | 2 | 100% | 15% | Lens proteins | Structural constituent of eye lens, p = 1.17E-19 |
Column 2 contains Blackshaw et al. [20] clusters that are predominant in seriation contigs in column 1. GO categories enriched in clusters/contigs were determined using EASE software [34]; p<0.05.
Data from Blackshaw et al. [20].
Figure 4Seriation of transcription factors expressed in Mouse Atlas pancreatic libraries.
SAGE data for transcription factors expressed in the pancreatic libraries from the Mouse Atlas project were subjected to seriation analysis as described in the text. The reordered correlation matrix containing correlation coefficients for each tag pair computed to measure the similarity of their pancreatic expression profiles is color-coded red to blue to represent decreasing correlation values. (A). 5 contigs recognizable as red squares along the diagonal are evident. (B). Expression profiles of transcription factors in contigs in (A). The relative expression levels from 0% to 100% are plotted on the y-axis for each contig while the pancreatic libraries derived from stages TS17, TS19, TS20, TS21, TS22, and P70 are on the x-axis.
Summary of seriation analysis of transcription factors expressed in pancreas.
| Contig | Tags in contig | Characterized transcription factors in contig | Gene Ontology annotations enriched in contig | SwissProt keywords enriched in contig | KEGG annotations enriched in contig |
| Contig1 | 28 |
| Anatomical structure development, p = 1.52E-03; System development, p = 2.98E-03; Organ development, p = 5.91E-03 | N/A | N/A |
| Contig2 | 55 | Kin, Pole4, Foxa3, Tox3, | Defense response, p = 2.21E-02; Receptor activity, p = 2.13E-02 | N/A | N/A |
| Contig3 | 61 |
| Multicellular organismal development, p = 1.45E-06; pattern specification process, p = 4.64E-05; regulation of cell differentiation, p = 9.97E-03 | Zinc-finger, p = 1.46E-03; Homeobox, p = 2.19E-03 | Adherens junction, p = 4.25E-02 |
| Contig4 | 52 | Hmga1, Foxp1, Mum1, Lin28, Msh6, Dnase2a, Mxd3, Rest, Gata4, Klf4, Cdx2, | Transmembrane receptor protein serine/threonine kinase signaling pathway, p = 2.96E-02; regulation of signal transduction, p = 2.12E-02; negative regulation of cellular process, p = 2.14E-02 | N/A | TGF-beta signaling pathway, p = 7.66E-04; Wnt signaling pathway, p = 3.65E-02 |
| Contig5 | 97 |
| Apoptosis, p = 1.23E-03; Programmed cell death, p = 1.43E-03; Regulation of apoptosis, p = 2.19E-03 | Apoptosis, p = 6.08E-03; Coiled coil, p = 6.28E-03 | N/A |
Number of tags falling into each seriation contig is shown along with the representative contig members and their representative functional annotation using GO categories, SwissProt keywords, and KEGG pathways. For a full list of annotations enriched in the contigs see Figure S5. Known regulators of pancreatic development as well as the transcription factors discussed in the text are bolded.
GO, SwissProt keyword, and KEGG pathway annotations enriched in contigs were computed using web-based FatiGO+ tool [35] and are provided with raw scores.