| Literature DB >> 19087282 |
Evert-Jan Blom1, Sacha A F T van Hijum, Klaas J Hofstede, Remko Silvis, Jos B T M Roerdink, Oscar P Kuipers.
Abstract
BACKGROUND: A typical step in the analysis of gene expression data is the determination of clusters of genes that exhibit similar expression patterns. Researchers are confronted with the seemingly arbitrary choice between numerous algorithms to perform cluster analysis.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19087282 PMCID: PMC2661003 DOI: 10.1186/1471-2105-9-535
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow diagram. The DISCLOSE application uses functional categories to evaluate the cluster results given a dataset (A), clustering algorithms (B), and clustering parameters (C). Clustering can be performed by the DISCLOSE application or based on results from external clustering programs (D). Each clustering run is evaluated (E) for overrepresented functional categories by the program using different annotation sources (F), and optionally by a motif identification algorithm (G). Lastly, results of the clustering analysis are cumulated in a tabular display in which each row shows the summary of the application of a clustering method to the data (H). From the tabular display, selecting results for an individual clustering (I) allows for a cluster based analysis (see Fig. 2A).
Figure 2Visualization. A. Genes from the DNA microarray data were clustered. The size of each cluster is displayed in blue underneath the cluster name. Numbers in each colored rectangle represent absolute values of occurrences. The significance of the overrepresentation is visualized in a colour gradient which is displayed at the bottom of the plot. The description of each category is placed at the right. Multiple testing correction results are visualized using five different symbols to distinguish between the individual corrections. The number of symbols placed in each rectangle corresponds to the number of multiple testing corrections after which the annotation is found significant (see [13] for more details concerning this visualization). The graphical representation of the overrepresented DNA binding sites from the SCOPE algorithm consists of several components. The results of SCOPE based on a single cluster are discussed: B. The expression graph of the genes in the cluster. C. Contains information concerning overrepresented functional categories and a link to the results of DISCLOSE. D. Link to the results of SCOPE. E. The highest scoring motif found in the cluster. F. The highest scoring motif is compared with existing binding site information. The known motif that matches the putative motif best is displayed.
Figure 3A customizable graphical representation of DNA binding sites. The Scalable Vector Graphics visualization displays the genomic context of putative and known motifs in the upstream sequences of the operons. The user interface allows users to interact with the visualization. A) Hide de novo motifs. B) Hide known motifs from literature. C). Hide upstream regions without any putative or known motifs. D). Use standard coloring of putative motifs. E). Use coloring of putative motifs based on best hit with known motif. F). Every found motif can be displayed or hidden from the visualization using checkboxes. G) Known motifs can be displayed or hidden from the visualization using checkboxes. H). The scaling slider adjusts the width between the upstream sequences. I). The zooming slider allows for zooming of the visualization. J). The first structural gene of each operon is a large polygon, whilst the other genes are represented using smaller polygons. K). Genes coding for a putative regulator are colored red. Hovering with the mouse over the genes creates a tooltip displaying the function of the gene. L). Open polygons represent known binding sites derived from literature sources. M). Filled polygons depict putative motifs.
Biological phenomena discussed in the original article
| Functional category | DISCLOSE | Significance frequency |
| Purine biosynthesis | X | 91% |
| Cell growth | X | 88% |
| General stress response | X | 88% |
| tricarboxylic acid cycle | X | 86% |
| Sigma D regulon (motility) | X | 85% |
| Glycolysis | X | 72% |
| cell division | X | 71% |
| pyrimidine biosynthesis | X | 70% |
| DNA replication and DNA repair functions | X | 66% |
| Sulfur amino acid metabolism | X | 19% |
| aspartate metabolism | X | 16% |
| serine metabolism | X | 12% |
| fatty acid biosynthesis | X | 12% |
| drug transporter activity | X | 3.1% |
| Na+/H+ antiporters | X | 0.6% |
| RNA modification | X | 0.3% |
| Multidrug transporters | - | |
The original analysis of the study of Keyser et al [17] revealed several biological phenomena that are found to be induced during the DNA timecourse experiment. The described biological phenomena were matched with the results of the robustness analysis (complete analysis is listed in Table 2). Phenomena discussed in the original analysis are listed in the first column. A match with the results of DISCLOSE is indicated in the second column. Information concerning the significance frequency is shown in column three.
Results of robustness analysis of DISCLOSE
| Functional category | Member size | Significance frequency | In original study |
| GO-0006164 : purine nucleotide biosynthetic process | 28 | 94.03% | Y |
| COG-F : Nucleotide transport and metabolism | 84 | 93.56% | Y |
| GO-0003735 : structural constituent of ribosome | 59 | 88.06% | Y |
| INT-SigB : general stress sigma factor | 66 | 87.91% | Y |
| COG-J : Translation, ribosomal structure and biogenesis | 161 | 87.12% | Y |
| PW-path-bsu00020 : Citrate cycle | 18 | 86.18% | N |
| GO-0003723 : RNA binding | 107 | 85.87% | Y |
| COG-N : Cell motility and secretion | 57 | 85.71% | Y |
| INT-SigG : late forespore-specific gene expression | 61 | 84.30% | N |
| PW-path-bsu00193 : ATP synthesis | 8 | 83.04% | Y |
| UP-67 : Ligase | 78 | 79.74% | Y |
| GO-0006935 : chemotaxis | 26 | 73.46% | Y |
| UP-56 : Glycolysis | 19 | 72.21% | Y |
| UP-29 : Cell division | 32 | 71.11% | Y |
| PW-path-bsu00720 : Reductive carboxylate cycle | 13 | 70.32% | N |
| PW-path-bsu00240 : Pyrimidine metabolism | 51 | 68.60% | Y |
| PW-path-bsu00970 : Aminoacyl-tRNA biosynthesis | 23 | 68.28% | Y |
| COG-L : DNA replication, recombination and repair | 138 | 65.93% | Y |
| UP-15 : Threonine biosynthesis | 3 | 58.55% | N |
| COG-G : Carbohydrate transport and metabolism | 246 | 55.88% | Y |
| GO-0006520 : amino acid metabolic process | 184 | 53.53% | Y |
| UP-124 : Sporulation | 180 | 49.92% | Y |
| INT-SigK : late mother cell-specific gene expression | 57 | 47.40% | N |
| PW-path-bsu03070 : Type III secretion system | 9 | 44.27% | N |
| INT-PurR : negative regulation of the purine operons | 10 | 43.32% | N |
| GO-0015293 : symporter activity | 82 | 35.63% | Y |
| UP-25 : Porphyrin biosynthesis | 13 | 27.62% | N |
| UP-179 : Folate biosynthesis | 6 | 27.31% | N |
| PW-path-bsu00190 : Oxidative phosphorylation | 31 | 26.05% | N |
| PW-path-bsu02060 : Phosphotransferase system (PTS) | 27 | 24.96% | N |
| COG-D : Cell division and chromosome partitioning | 33 | 24.64% | Y |
| GO-0008360 : regulation of cell shape | 36 | 24.17% | Y |
| PW-path-bsu00740 : Ribo avin metabolism | 5 | 23.54% | N |
| PW-path-bsu00030 : Pentose phosphate pathway | 24 | 23.39% | N |
| GO-0009086 : methionine biosynthetic process | 15 | 22.76% | Y |
| UP-17 : Hydrogen ion transport | 15 | 21.66% | Y |
| INT-SigE : early mother cell-specific gene expression | 82 | 21.66% | N |
| PW-path-bsu00920 : Sulfur metabolism | 15 | 19.62% | N |
| INT-SigA : RNA polymerase major sigma-43 factor | 320 | 18.52% | N |
| COG-O : Posttranslational modification, protein turnover, chaperones | 98 | 17.11% | N |
| PW-path-bsu00400 : Phenylalanine, tyrosine and tryptophan biosynthesis | 28 | 16.16% | N |
| PW-path-bsu00252 : Alanine and aspartate metabolism | 21 | 16.01% | Y |
| GO-0009252 : peptidoglycan biosynthetic process | 32 | 15.22% | Y |
| PW-path-bsu00260 : Glycine, serine and threonine metabolism | 34 | 12.55% | Y |
| UP-84 : Fatty acid biosynthesis | 11 | 12.55% | Y |
| GO-0000103 : sulfate assimilation | 7 | 10.67% | N |
| GO-0000105 : histidine biosynthetic process | 11 | 10.36% | N |
| PW-path-bsu00670 : One carbon pool by folate | 11 | 10.20% | N |
The performance of DISCLOSE was evaluated by comparing the clustering analysis results of a time course DNA microarray experiment of Bacillus subtilis ([17]) with the results obtained by DISCLOSE. This analysis recapitulated most of the findings of the original study. In addition, several significantly overrepresented categories were found by DISCLOSE that were not discussed by the authors. The following table list the functional categories that are identified by DISCLOSE from the robustness analysis with a significance frequency of at least 10%. The last column indicates if a category was found significant in the original study. Redundant functional categories were removed from the table (e.g., functional category GO:0006096-glycolysis was not listed in the table since this functional category was already covered by a UniProt category UP:56-glycolysis). (Abbreviations used; GO : Gene Ontology, INT : Regulon, PW: metabolic pathway, UP : UniProt and COG : Clusters of Orthologous Gtroups)
Figure 4Non-validated results of overrepresented DNA binding sites. DISCLOSE was also able to detect several motifs in clusters that could not be matched with motifs from literature. The motifs identified by DISCLOSE are visualized as sequence logos [16] and are displayed in the first column. An optimized version of the motif is placed in the second column whilst the genomic context of the instances are displayed in column three.