| Literature DB >> 27258041 |
Abstract
The identification of modules or communities in sets of related variables is a key step in the analysis and modeling of biological systems. Procedures for this identification are usually designed to allow fast analyses of very large datasets and may produce suboptimal results when these sets are of a small to moderate size. This article introduces BoCluSt, a new, somewhat more computationally intensive, community detection procedure that is based on combining a clustering algorithm with a measure of stability under bootstrap resampling. Both computer simulation and analyses of experimental data showed that BoCluSt can outperform current procedures in the identification of multiple modules in data sets with a moderate number of variables. In addition, the procedure provides users with a null distribution of results to evaluate the support for the existence of community structure in the data. BoCluSt takes individual measures for a set of variables as input, and may be a valuable and robust exploratory tool of network analysis, as it provides 1) an estimation of the best partition of variables into modules, 2) a measure of the support for the existence of modular structures, and 3) an overall description of the whole structure, which may reveal hierarchical modular situations, in which modules are composed of smaller sub-modules.Entities:
Mesh:
Year: 2016 PMID: 27258041 PMCID: PMC4892581 DOI: 10.1371/journal.pone.0156576
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Cases considered in the computer simulations in Fig 1.
| Case | Number of variables | Number of modules | Module sizes | Sample size | Components distribution | v( |
|---|---|---|---|---|---|---|
| a | 8 | 2 | 4, 4 | 100 | Normal | 0.030 |
| b | 8 | 4 | 2, 2, 2, 2 | 100 | Normal | 0.030 |
| c | 8 | 2 | 4, 4 | 100 | Normal | 0.010 |
| d | 8 | 2 | 4, 4 | 100 | Normal | 0.015 |
| e | 8 | 2 | 4, 4 | 25 | Normal | 0.030 |
| f | 8 | 2 | 4, 4 | 50 | Normal | 0.030 |
| g | 8 | 7 | 1,1,1,1,1,1,2 | 100 | Normal | 0.030 |
| h | 8 | 3 | 5, 2, 1 | 100 | Normal | 0.030 |
| i | 8 | 2 | 4, 4 | 100 | Beta | 0.030 |
| j | 8 | 2 | 4, 4 | 100 | Uniform | 0.030 |
| k | 4 | 2 | 2, 2 | 100 | Normal | 0.030 |
| l | 16 | 2 | 8, 8 | 100 | Normal | 0.030 |
The cases differed in number of variables, number and sizes of modules (there were as many modules as sizes listed), number of individuals measured, distributions of the variables and variance of c, v(c). In all cases, e had a variance of 0.050. The correlations corresponding to the three considered v(c) values were 0.375, 0.231 and 0.167.
* c was generated as a beta variable with parameters α = 0.246 and β = 2, and e as a beta variable with α = 0.625 and β = 2, using R function rbeta; the resulting x, c and e distributions were markedly asymmetric.
** c was generated as a uniform variable with the range 0 to 0.600 and e as a uniform variable with the range 0 to 0.775 using R function runif.
Fig 1Values for the variance criterion (circles) in the cases listed in Table 1.
A single, randomly taken simulated data set is shown per case, with 100 randomized null data sets and 500 bootstrap resamples per data set, along with the lower 2.5 percentile (simple lines) for the corresponding null situation of no correlation between variables. Grey circles mark the value for the true number of modules.
Fig 2Analysis of hierarchical communities.
A single, randomly taken simulated data set is shown per case, with 100 records, 100 randomized null data sets and 500 bootstrap resamples per data set. The graphs (left) show the variance criterion for all possible numbers of clusters along with the lower 2.5 percentile for the corresponding null situation of no correlation between variables (simple lines). Grey circles mark correct clustering results for regular partitions. The diagrams to the right represent the different situations. The grayscale indicates the value of the correlation between variables (triangles) in the same ellipse. These were 0.273 and 0.545 in the eight variable cases (a–c), and 0.214, 0.429 and 0.643 in the 16 variable cases (d to f).
Fig 3Frequency of numbers of modules detected in computer simulations comparing the different procedures.
There were eight variables, 100 records and 1000 replicates per case. 2C and 4C are cases with two and four variable modules respectively, and 2/2C, hierarchical situations of two modules each divided in two sub-modules. The numbers to the right of the Cs mark the variance of component c, common to variables in the same module or sub-module: 1, variance = 0.01; 3, variance = 0.03 (the variance of component e was = 0.05 in all cases). Circles, upwards triangles, plus signs, x signs, rhombs, squares, asterisks, and downwards triangles show results for the BoCluSt, Edge betweeness, Infomap, Label propagation, linkcomm, Louvain, silhouette, and Walktrap procedures, respectively. In these simulations, the results for Fastgreedy, Leading eigenvector, Optimal modularity and Spinglass procedures were very similar to those of Louvain; only the latter is shown for clarity. Also for clarity, counts lower than 25 are shown for BoCluSt only. Grey symbols mark the true number(s) of modules. The smaller font values are the numbers of replicates in which BoCluSt found results both correct and significant (under the 2.5 quantile of the null distribution). In the hierarchical cases, these results consisted in two significant minima in the variance criterion for two and four clusters. In italics, the number of replicates finding the same two minima, whether significant or not.
Fig 4BoCluSt analysis of Drosophila melanogaster microarray data.
Top, analysis of probes for 23 genes annotated in the ontologies “positive control of cell growth” and “negative control of cell growth” (in normal and italic fonts respetively). The gene composition of the three-clusters partition with the least value for the variance criterion is shown in the grey boxes to the right of the graph. Asterisks mark second probes of genes having two different probes in the array. Middle, analysis of 104 probe sets of genes annotated in the ontology “DNA repair”. The partition showing the lowest variance criterion is amplified in the circle. Bottom, frequencies of pairwise correlation values between the probe sets corresponding to DNA repair genes (5356 correlations among 104 probe sets). The analysis identified a big cluster of 98 genes of narrowly correlated expressions. The bars correspond to the correlations a) of these 98 probe sets among themselves (”Big/big” box); b) of these probe sets with the EndoGl, Mlh1 and Hus1-like probe sets (left box); and c) with the CG1390, RpA-70 and Gnf1 probe sets (the three boxes at center; see the text for more detailed explanations).
Number and sizes of clusters found by the compared procedures in the analysis of gene expression.
| Procedure | Cell growth | DNA repair | ||
|---|---|---|---|---|
| # Clusters | Sizes | # Clusters | Sizes/Composition | |
| 3 | 16, 6, 2 | 2 | (Big), (c, C1, q) | |
| 8 | 17, 1, 1, 1, 1, 1, 1, 1 | 4 | (Big), (c), (C1), (q) | |
| 2 | 16, 8 | 1 | - | |
| 1 | 24 | 1 | - | |
| 1 | 24 | 1 | - | |
| 2 | 16, 8 | 2 | (Big), (A, B, c, C1, C2, p, q) | |
| 1 | 24 | 1 | - | |
| 2 | 16, 8 | 2 | (Big), (A, c, C1, q) | |
| 2 | 16, 8 | 2 | (Big), (B, c, C1, q) | |
| 2 | 15, 9 | 2 | (Big), (T, R) | |
| 2 | 16, 8 | 2 | (Big), (B, c, C1, q) | |
| 3 | 16, 6, 2 | 2 | (Big), (c, C1, q) | |
The simpler partitions found for genes in the “DNA repair” ontology made it possible to list the genes in the smaller clusters (A, Arp5; B, Blm; C1, CG10694; C2, CG18004; c, cry; p, p53; q, qjt; R, RpLP0; T, Tctp; Big, all genes in the data set but those listed for each procedure). Genes within the same parenthesis set were in the same cluster.