| Literature DB >> 29928719 |
Michelle Carey1,2, Shuang Wu1,3, Guojun Gan4, Hulin Wu1,5.
Abstract
Many pragmatic clustering methods have been developed to group data vectors or objects into clusters so that the objects in one cluster are very similar and objects in different clusters are distinct based on some similarity measure. The availability of time course data has motivated researchers to develop methods, such as mixture and mixed-effects modelling approaches, that incorporate the temporal information contained in the shape of the trajectory of the data. However, there is still a need for the development of time-course clustering methods that can adequately deal with inhomogeneous clusters (some clusters are quite large and others are quite small). Here we propose two such methods, hierarchical clustering (IHC) and iterative pairwise-correlation clustering (IPC). We evaluate and compare the proposed methods to the Markov Cluster Algorithm (MCL) and the generalised mixed-effects model (GMM) using simulation studies and an application to a time course gene expression data set from a study containing human subjects who were challenged by a live influenza virus. We identify four types of temporal gene response modules to influenza infection in humans, i.e., single-gene modules (SGM), small-size modules (SSM), medium-size modules (MSM) and large-size modules (LSM). The LSM contain genes that perform various fundamental biological functions that are consistent across subjects. The SSM and SGM contain genes that perform either different or similar biological functions that have complex temporal responses to the virus and are unique to each subject. We show that the temporal response of the genes in the LSM have either simple patterns with a single peak or trough a consequence of the transient stimuli sustained or state-transitioning patterns pertaining to developmental cues and that these modules can differentiate the severity of disease outcomes. Additionally, the size of gene response modules follows a power-law distribution with a consistent exponent across all subjects, which reveals the presence of universality in the underlying biological principles that generated these modules.Entities:
Keywords: Clustering; Inhomogeneous clusters; Power law
Year: 2016 PMID: 29928719 PMCID: PMC5963321 DOI: 10.1016/j.idm.2016.07.001
Source DB: PubMed Journal: Infect Dis Model ISSN: 2468-0427
Comparison of the GMM, MCL, IHC and IPC methods for each of the 9 symptomatic subjects. The Davies-Bouldin criterion (DB) determines the performance of the clustering methods. The smaller the DB is, the better the clustering method performs. The homogeneity of the clusters is examined by computing the average, standard deviation and minimum of the within-cluster correlation (WCC). The separation of the clusters is examined by computing the average, standard deviation and maximum of the between-cluster correlation (BCC).
| Subject | No of clusters | WCC mean (std) [min] | BCC mean (std) [max] | DB |
|---|---|---|---|---|
| IHC | ||||
| 1 | 115 | 0.702 (0.069) [0.554] | −0.004 (0.321) [0.700] | 0.733 |
| 5 | 96 | 0.717 (0.075) [0.489] | −0.003 (0.323) [0.696] | 0.761 |
| 6 | 74 | 0.724 (0.054) [0.596] | −0.004 (0.340) [0.700] | 0.695 |
| 7 | 68 | 0.707 (0.058) [0.631] | −0.002 (0.347) [0.700] | 0.718 |
| 8 | 64 | 0.716 (0.057) [0.614] | −0.008 (0.352) [0.697] | 0.755 |
| 10 | 95 | 0.711 (0.077) [0.562] | −0.005 (0.323) [0.696] | 0.726 |
| 12 | 78 | 0.715 (0.057) [0.567] | −0.000 (0.340) [0.700] | 0.746 |
| 13 | 105 | 0.718 (0.079) [0.565] | −0.005 (0.331) [0.698] | 0.776 |
| 15 | 133 | 0.702 (0.083) [0.548] | −0.005 (0.313) [0.700] | 0.766 |
| IPC | ||||
| 1 | 125 | 0.697 (0.067) [0.555] | −0.003 (0.327) [0.700] | 0.766 |
| 5 | 96 | 0.713 (0.074) [0.546] | −0.001 (0.324) [0.700] | 0.752 |
| 6 | 75 | 0.722 (0.063) [0.609] | −0.005 (0.344) [0.696] | 0.695 |
| 7 | 65 | 0.701 (0.064) [0.541] | 0.006 (0.355) [0.700] | 0.796 |
| 8 | 61 | 0.697 (0.058) [0.559] | 0.016 (0.339) [0.692] | 0.761 |
| 10 | 84 | 0.692 (0.073) [0.558] | −0.003 (0.328) [0.696] | 0.712 |
| 12 | 85 | 0.714 (0.059) [0.603] | 0.006 (0.334) [0.700] | 0.717 |
| 13 | 98 | 0.692 (0.076) [0.500] | −0.002 (0.330) [0.698] | 0.808 |
| 15 | 135 | 0.672 (0.065) [0.493] | −0.000 (0.312) [0.700] | 0.779 |
| MCL | ||||
| 1 | 48 | 0.637 (0.106) [0.432] | 0.005 (0.332) [0.821] | 1.063 |
| 5 | 51 | 0.662 (0.095) [0.404] | 0.004 (0.319) [0.764] | 1.004 |
| 6 | 23 | 0.657 (0.075) [0.485] | −0.020 (0.330) [0.789] | 0.760 |
| 7 | 24 | 0.654 (0.068) [0.492] | −0.022 (0.332) [0.757] | 0.747 |
| 8 | 30 | 0.622 (0.097) [0.453] | −0.003 (0.358) [0.833] | 1.198 |
| 10 | 46 | 0.629 (0.098) [0.400] | 0.014 (0.334) [0.875] | 1.199 |
| 12 | 39 | 0.662 (0.092) [0.418] | 0.025 (0.344) [0.821] | 0.996 |
| 13 | 47 | 0.629 (0.099) [0.400] | −0.005 (0.331) [0.802] | 1.223 |
| 15 | 87 | 0.657 (0.087) [0.413] | −0.002 (0.338) [0.879] | 1.296 |
| GMM | ||||
| 1 | 16 | 0.492 (0.186) [0.168] | −0.034 (0.467) [0.885] | 2.771 |
| 5 | 15 | 0.492 (0.194) [0.178] | −0.053(0.489) [0.946] | 3.481 |
| 6 | 12 | 0.538 (0.141) [0.188] | −0.059 (0.388) [0.707] | 1.370 |
| 7 | 8 | 0.533 (0.150) [0.216] | −0.116 (0.574) [0.864] | 2.420 |
| 8 | 12 | 0.527 (0.205) [0.161] | −0.053 (0.490) [0.833] | 1.748 |
| 10 | 16 | 0.522 (0.180) [0.177] | −0.03 (0.430) [0.896] | 1.545 |
| 12 | 12 | 0.502 (0.169) [0.168] | −0.039 (0.493) [0.842] | 2.392 |
| 13 | 14 | 0.465 (0.158) [0.155] | −0.018 (0.406) [0.829] | 1.653 |
| 15 | 14 | 0.456 (0.145) [0.138] | −0.044 (0.395) [0.760] | 1.643 |
Fig. 1Boxplots showing the accuracy of each method. Boxplots of the percentage of genes that are clustered into the correct functional modules for all three clustering procedures and σ=0.1,0.2,0.3.
The distribution of the clusters and genes across the four categories (LSM, MSM, SSM and SGM) The number of clusters and number of genes (in parentheses) in each category of modules (LSM, MSM, SSM and SGM) for the IHC method for each of the 9 subjects.
| Subject | LSM % clusters (% genes) | MSM % clusters (%genes) | SSM %clusters (%genes) | SGM %clusters (%genes) |
|---|---|---|---|---|
| 1 | 6% (79%) | 13% (14%) | 37% (5%) | 44% (2%) |
| 5 | 6% (78%) | 10% (15%) | 49% (6%) | 34% (1%) |
| 6 | 14% (88%) | 12% (8%) | 34% (3%) | 41% (1%) |
| 7 | 9% (87%) | 15% (9%) | 32% (3%) | 44% (1%) |
| 8 | 8% (89%) | 14% (7%) | 47% (4%) | 31% (1%) |
| 10 | 5% (82%) | 13% (12%) | 44% (5%) | 38% (1%) |
| 12 | 8% (86%) | 12% (9%) | 44% (4%) | 38% (1%) |
| 13 | 5% (72%) | 19% (20%) | 44% (7%) | 32% (1%) |
| 15 | 5% (74%) | 15% (18%) | 38% (6%) | 42% (2%) |
Fig. 2The time course patterns of the clustering centres grouped by module size. Single-gene modules (SGM) with only one gene in each cluster, small-size modules (SSM) that contain between 2 and 10 genes in each cluster, medium-size modules (MSM) that consist of 11–99 genes in each of the clusters and large-size modules (LSM) which contain over 100 genes in each cluster.
Fig. 3The average time course patterns of the LSM gene clusters and the reported symptom scores for each subject.
The enriched GO biological process terms that are related to the genes in each of the LSM clusters for each of the 9 subjects.
| Subject | The most enriched gene ontology (GO) BP terms |
|---|---|
| 1 | (i) negative regulation of cell growth, (ii) translation, (iii) adaptive immune response, (iv) negative regulation of epithelial cell proliferation, (v) translational elongation, (vi) inflammatory response, (vii) sphingolipid metabolic process |
| 5 | (i) positive regulation of cell motion, (ii) negative regulation of cell growth, (iii) regulation of transcription, (iv) cellular macromolecule catabolic process, (v) DNA replication, (vi) protein modification by small protein conjugation |
| 6 | (i) protein amino acid phosphorylation, (ii) cholesterol metabolic process (iii) response to virus (iv) innate immune response (v) negative regulation of transcription, (vi) innate immune response (vii) DNA metabolic process, (viii) positive regulation of immune response, (ix) translational elongation (x) mRNA metabolic process |
| 7 | (i) phosphate metabolic process, (ii) hexose metabolic process, (iii) carboxylic acid catabolic process (iv) defense response (v) monosaccharide metabolic process (vi) translation |
| 8 | (i) chromatin modification (ii) positive regulation of I-kappaB kinase/NF-kappaB cascade (iii) cellular amide metabolic process (iv) proteolysis (v) positive regulation of macromolecule biosynthetic process |
| 10 | (i) response to wounding (ii) chromatin modification (iii) cofactor metabolic process (iv) positive regulation of I-kappaB kinase/NF-kappaB cascade (v) icosanoid metabolic process |
| 12 | (i) translational elongation (ii) nucleobase, nucleoside, nucleotide and nucleic acid biosynthetic process (iii) cellular carbohydrate catabolic process (iv) small GTPase mediated signal transduction (v) inflammatory response (vi) proteolysis |
| 13 | (i) membrane organization (ii) translational elongation (iii) regulation of protein amino acid phosphorylation (iv) inflammatory response (v) histone acetylation |
| 15 | (i) regulation of transcription (ii) cellular protein complex assembly (iii) nicotinamide metabolic process (iv) Wnt receptor signaling pathway (v) innate immune response (vi) steroid metabolic process |
Fig. 4The percentage of the TRG's in the LSM's or SGM's that are common for each subject pair across all the 9 symptomatic subjects. The i,j block represents the % of genes in the LSM that are common for the i and j subject.
Fig. 5The semantic similarity of the gene biological process/cellular component/molecular function as defined by the GO terms for the four-type of temporal gene response module categories (LSM, MSM, SSM and SGM).
The estimates of the scaling exponentβfor the power-law model of the size of the clusters The p-value of the corresponding Kolmogorov-Smirnov goodness-of-fit statistic to test the hypothesis that the power-law model is feasible for the size of the clusters generated by all three clustering procedures with α = 0.70. If the p-value is greater than 0.1, we can infer that it is viable that the size of the clusters follows a power-law distribution.
| Subject | MCL | IHC | IPC | GMM |
|---|---|---|---|---|
| 1 | 1.64 (0.273) | 1.61 (0.980) | 1.58 (0.985) | 1.73 (0.591) |
| 5 | 1.76 (0.603) | 1.66 (0.199) | 1.59 (0.797) | 2.12 (0.882) |
| 6 | 1.50 (0.003) | 1.50 (0.513) | 1.50 (0.564) | 3.20 (0.977) |
| 7 | 1.50 (0.216) | 1.53 (0.827) | 1.50 (0.748) | 1.90 (0.774) |
| 8 | 1.64 (0.713) | 1.59 (0.625) | 1.51 (0.807) | 1.76 (0.938) |
| 10 | 1.65 (0.041) | 1.58 (0.485) | 1.56 (0.653) | 1.79 (0.981) |
| 12 | 1.61 (0.642) | 1.54 (0.219) | 1.56 (0.579) | 2.06 (0.727) |
| 13 | 1.78 (0.423) | 1.65 (0.724) | 1.53 (0.322) | 1.86 (0.781) |
| 15 | 1.79 (0.609) | 1.60 (0.859) | 1.61 (0.330) | 1.85 (0.256) |