| Literature DB >> 26339652 |
Kai Wang1, Qing Zhao2, Jianwei Lu3, Tianwei Yu4.
Abstract
With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear K-profiles clustering method, which can be seen as the nonlinear counterpart of the K-means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that K-profiles clustering not only outperformed traditional linear K-means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which K-profile clustering generated biologically meaningful results.Entities:
Mesh:
Year: 2015 PMID: 26339652 PMCID: PMC4538770 DOI: 10.1155/2015/918954
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Illustration of the four functions used in simulations.
Figure 2Simulation results with nonlinear data.
Figure 3An example of confusion matrices shown as images. Cleaner pictures indicate better agreement between true clusters and clustering results. The left-most column of each subplot represents the pure noise gene group. (a) K-profiles clustering result. (b) GDHC result. (c) K-means result.
Figure 4Simulation results from data with linear associations only.
Figure 5Selecting the number of clusters for the Spellman dataset by plotting sum of negative logp values against the number of clusters.
Figure 6Significance levels of GO slims terms. Brighter colors indicate significance using the hypergeometric test for overrepresentation analysis.
Figure 7An example cluster with mostly periodically expressed genes.
Biological pathways significantly associated with clusters 2, 5, 7, and 12.
| Cluster | # genes | GO Biological Process ID# |
| Name of GO term |
|---|---|---|---|---|
| 2 | 228 | GO:0051301 | 1.03 | Cell division |
| GO:0006468 | 0.0001307 | Protein phosphorylation | ||
| GO:0010696 | 0.00163665 | Positive regulation of spindle pole body separation | ||
| GO:0030473 | 0.00584256 | Nuclear migration along microtubule | ||
| GO:0005977 | 0.00628021 | Glycogen metabolic process | ||
|
| ||||
| 5 | 116 | GO:0006301 | 5.94 | Postreplication repair |
| GO:0043570 | 1.87 | Maintenance of DNA repeat elements | ||
| GO:0006272 | 4.90 | Leading strand elongation | ||
| GO:0000070 | 0.00043025 | Mitotic sister chromatid segregation | ||
| GO:0009263 | 0.00067342 | Deoxyribonucleotide biosynthetic process | ||
| GO:0006298 | 0.00074914 | Mismatch repair | ||
| GO:0007131 | 0.00077629 | Reciprocal meiotic recombination | ||
| GO:0045132 | 0.00300391 | Meiotic chromosome segregation | ||
| GO:0006284 | 0.0034725 | Base-excision repair | ||
| GO:0006273 | 0.0041114 | Lagging strand elongation | ||
| GO:0006348 | 0.00415626 | Chromatin silencing at telomere | ||
| GO:0009200 | 0.00485315 | Deoxyribonucleoside triphosphate metabolic process | ||
| GO:0051301 | 0.00750912 | Cell division | ||
|
| ||||
| 7 | 69 | GO:0006334 | 4.57 | Nucleosome assembly |
| GO:0030473 | 6.32 | Nuclear migration along microtubule | ||
| GO:0030148 | 0.00299059 | Sphingolipid biosynthetic process | ||
| GO:0000032 | 0.00650292 | Cell wall mannoprotein biosynthetic process | ||
| GO:0009225 | 0.00774684 | Nucleotide-sugar metabolic process | ||
|
| ||||
| 12 | 155 | GO:0007020 | 1.07 | Microtubule nucleation |
| GO:0000070 | 0.0006474 | Mitotic sister chromatid segregation | ||
| GO:0006284 | 0.00078868 | Base-excision repair | ||
| GO:0006493 | 0.00078868 | Protein O-linked glycosylation | ||
| GO:0006273 | 0.00099378 | Lagging strand elongation | ||
| GO:0006337 | 0.00099378 | Nucleosome disassembly | ||
| GO:0000724 | 0.00151593 | Double-strand break repair via homologous recombination | ||
| GO:0000086 | 0.00242563 | G2/M transition of mitotic cell cycle | ||
| GO:0006368 | 0.00243303 | Transcription elongation from RNA polymerase II promoter | ||
| GO:0006338 | 0.0038366 | Chromatin remodeling | ||
| GO:0008156 | 0.00743106 | Negative regulation of DNA replication | ||
#Total number of GO Biological Process terms under study: 430.
* P value threshold: 0.01.