| Literature DB >> 35100338 |
Markku Kuismin1,2,3, Fatemeh Dodangeh1, Mikko J Sillanpää1,2,4.
Abstract
We introduce a new model selection criterion for sparse complex gene network modeling where gene co-expression relationships are estimated from data. This is a novel formulation of the gap statistic and it can be used for the optimal choice of a regularization parameter in graphical models. Our criterion favors gene network structure which differs from a trivial gene interaction structure obtained totally at random. We call the criterion the gap-com statistic (gap community statistic). The idea of the gap-com statistic is to examine the difference between the observed and the expected counts of communities (clusters) where the expected counts are evaluated using either data permutations or reference graph (the Erdős-Rényi graph) resampling. The latter represents a trivial gene network structure determined by chance. We put emphasis on complex network inference because the structure of gene networks is usually nontrivial. For example, some of the genes can be clustered together or some genes can be hub genes. We evaluate the performance of the gap-com statistic in graphical model selection and compare its performance to some existing methods using simulated and real biological data examples.Entities:
Keywords: cluster; co-expression; complex network; gap statistic; high-dimensional data; model selection
Mesh:
Year: 2022 PMID: 35100338 PMCID: PMC9210289 DOI: 10.1093/g3journal/jkab437
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.542
Fig. 1Permutation strategy. In this toy example, the ground truth graph is a star-graph with 100 distinct hubs (stars) (p = 500, n = 1,000). We use hard thresholding of pairwise correlation coefficients to compute sparse network estimates. a) Curves of estimated (blue solid line) and estimator (red dashed line) are both determined with the Walktrap community detection algorithm as a function of the hard-threshold parameter λ. b) Gap-com curve as a function of estimated k. We set the number of permutations to 50 to estimate the unknown quantity . The vertical dashed line in the left-hand figure corresponds to the parameter value that maximizes . The vertical dashed line in the right-hand figure corresponds to the true number of hubs.
Fig. 2.E-R model strategy. In this toy example, the ground truth graph is a star-graph with 100 distinct hubs (stars) (p = 500, n = 1,000). We use hard thresholding of pairwise correlation coefficients to compute sparse network estimates. a) Curves of (blue solid line) and reference (red dashed line) are both determined with the Walktrap community detection algorithm as a function of the hard-threshold parameter λ. b) Gap-com curve as a function of estimated k. We generated 50 copies of the E-R graph to estimate . The vertical dashed line in the left-hand figure corresponds to the parameter value that maximizes . The vertical dashed line in the right-hand figure corresponds to the true number of hubs.
Fig. 3.The elapsed times of both resampling strategies as a function of parallel threads. We use the hard thresholding of pairwise correlation coefficients to compute the sparse network estimates (p = 1,000, n = 200). We set the number of permutations to 50 in the first strategy and likewise generated 50 copies of the E-R graphs to evaluate the gap-com statistic.
Averaged model evaluation metrics of binary classification tests (mean), NMI (mean), and the number of clusters [median (IQR) after isolated nodes are removed] over 100 replications from different sparse network models while using hard thresholding (Threshold) when p = 500 and n = 200.
| Graph model | Method | Criterion | Sen | Pre | MCC | NMI | No. of clusters |
|---|---|---|---|---|---|---|---|
| Cluster | Threshold | gap-com (perm) | 0.68 | 0.17 | 0.33 | 0.75 | 16 (4.00) |
| Cluster | Threshold | gap-com (E-R) | 0.68 | 0.17 | 0.33 | 0.75 | 16 (4.00) |
| Cluster | Threshold | StARS | 0.57 |
|
|
| 11 (1.00) |
| Cluster | Threshold | PC |
| 0.14 | 0.31 | 0.73 | 17 (4.00) |
| Cluster | Threshold | AGNES | 0.66 | 0.19 | 0.34 | 0.76 | 15 (4.00) |
| Star | Threshold | gap-com (perm) | 0.70 | 0.80 | 0.74 | 0.75 | 10 (0.00) |
| Star | Threshold | gap-com (E-R) | 0.69 | 0.80 | 0.74 | 0.75 | 10 (0.00) |
| Star | Threshold | StARS | 0.37 |
| 0.58 | 0.65 | 10.50 (1.00) |
| Star | Threshold | PC |
| 0.77 |
|
| 10 (0.00) |
| Star | Threshold | AGNES | 0.52 | 0.87 | 0.67 | 0.69 | 10 (0.25) |
| Scale-free | Threshold | gap-com (perm) |
| 0.18 |
|
| 29 (17.25) |
| Scale-free | Threshold | gap-com (E-R) | 0.41 | 0.19 |
| 0.62 | 30 (12.75) |
| Scale-free | Threshold | StARS | 0.22 |
| 0.24 | 0.50 | 54 (12.00) |
| Scale-free | Threshold | PC | 0.24 | 0.26 | 0.25 | 0.52 | 50 (13.00) |
| Scale-free | Threshold | AGNES | 0.23 |
| 0.24 | 0.51 | 52 (12.25) |
| Random | Threshold | gap-com (perm) | 0.53 |
|
| 0.17 | 55 (15.00) |
| Random | Threshold | gap-com (E-R) | 0.53 |
|
| 0.17 | 55 (14.25) |
| Random | Threshold | StARS | 0.52 |
|
| 0.18 | 57.5 (16.00) |
| Random | Threshold | PC |
| 0.01 | 0.05 |
| 45 (11.25) |
| Random | Threshold | AGNES | 0.54 | 0.01 | 0.05 | 0.21 | 47.5 (12.00) |
The highest averaged value are boldfaced in each column. The true numbers of communities are 10, 10, 40, and 5 of the cluster, star, scale-free, and the random graph model, respectively.
Averaged model evaluation metrics of binary classification tests (mean), NMI (mean), and the number of clusters [median (IQR) after isolated nodes are removed] over 100 replications from different sparse network models while using BigQuic (BQ) when p = 500 and n = 200.
| Graph model | Method | Criterion | Sen | Pre | MCC | NMI | No. of clusters |
|---|---|---|---|---|---|---|---|
| Cluster | BQ | gap-com (perm) | 0.68 | 0.17 | 0.33 | 0.75 | 16 (3.25) |
| Cluster | BQ | gap-com (E-R) | 0.68 | 0.17 | 0.33 | 0.75 | 16 (5.00) |
| Cluster | BQ | StARS | 0.53 |
|
|
| 10 (1.00) |
| Cluster | BQ | PC |
| 0.14 | 0.30 | 0.72 | 18 (4.00) |
| Cluster | BQ | AGNES | 0.65 | 0.19 | 0.34 | 0.76 | 15 (3.50) |
| Star | BQ | gap-com (perm) | 0.60 | 0.80 | 0.68 |
| 10 (0.00) |
| Star | BQ | gap-com (E-R) | 0.59 | 0.80 | 0.68 | 0.75 | 10 (0.00) |
| Star | BQ | StARS |
| 0.72 |
| 0.75 | 10 (0.00) |
| Star | BQ | PC | 0.63 | 0.78 | 0.69 | 0.75 | 10 (0.00) |
| Star | BQ | AGNES | 0.40 |
| 0.57 | 0.70 | 10 (1.00) |
| Scale-free | BQ | gap-com (perm) | 0.41 | 0.18 |
|
| 26.50 (12.00) |
| Scale-free | BQ | gap-com (E-R) |
| 0.18 |
|
| 27 (12.25) |
| Scale-free | BQ | StARS | 0.13 |
| 0.20 | 0.37 | 48 (27.00) |
| Scale-free | BQ | PC | 0.36 | 0.21 | 0.24 | 0.57 | 43 (31.25) |
| Scale-free | BQ | AGNES | 0.17 | 0.26 | 0.20 | 0.52 | 53 (18.00) |
| Random | BQ | gap-com (perm) | 0.53 | 0.02 | 0.06 | 0.18 | 55 (22.00) |
| Random | BQ | gap-com (E-R) | 0.53 | 0.02 | 0.06 | 0.18 | 54.50 (23.25) |
| Random | BQ | StARS | 0.48 |
|
| 0.13 | 53 (23.50) |
| Random | BQ | PC |
| 0.01 | 0.05 |
| 49 (18.25) |
| Random | BQ | AGNES | 0.50 |
| 0.05 | 0.20 | 50 (16.25) |
The highest averaged value are boldfaced in each column. The true numbers of communities are 10, 10, 40, and 5 of the cluster, star, scale-free, and the random graph model, respectively.
The first of the 100 most connected nodes which correspond to potential TFs of the DREAM5 S. aureus network.
| Criterion | TFs |
|---|---|
| gap-com | SAV1228 |
| StARS | None |
| PC | SAV0044 |
| AGNES | SAV1228 SAV1686, SAV2046 |
The first 10 of the most connected genes (nonzero degree) according to their degrees in the DREAM5 S. aureus networks by the model selection criterion.
| gap-com | PC | AGNES | |||
|---|---|---|---|---|---|
| Gene ID | Degree | Gene ID | Degree | Gene ID | Degree |
| SAV0996 | 915 | SAV1355 | 152 | SAV1215 | 1,524 |
| SAV1215 | 911 | SAV1596 | 135 | SAV0996 | 1,498 |
| SAV2347 | 911 | SAV2596 | 130 | SAV0568 | 1,482 |
| SAV0669 | 903 | SAV0398 | 127 | SAV0669 | 1,477 |
| SAV0952 | 901 | SAV0537 | 124 | SAV1192 | 1,462 |
| SAV0014 | 893 | SAV0069 | 124 | SAV2071 | 1,442 |
| SAV1892 | 890 | SACOL0041 | 123 | SAV2347 | 1,442 |
| SAV1900 | 890 | SACOL0054 | 122 | SAV0602 | 1,438 |
| SAV0654 | 889 | SAV0400 | 121 | SAV0390 | 1,432 |
| SAV2071 | 871 | SAV0407 | 120 | SAV2221 | 1,425 |
The graph selected with StARS is left out because all nodes in the graph had a degree of zero.
Fig. 4.The gap-com statistic used to determine the optimal value of the tuning parameter for sparse network estimate of the DREAM5 S. aureus network (the vertical dashed line). The gap-com statistic is maximized with a tuning parameter value resulting in 137 clusters in the network. From these indentified clusters 4 contain more than single nodes. The standard errors are illustrated with horizontal lines around the corresponding values of .