| Literature DB >> 29545622 |
Wouter Saelens1,2, Robrecht Cannoodt1,3, Yvan Saeys4,5.
Abstract
A critical step in the analysis of large genome-wide gene expression datasets is the use of module detection methods to group genes into co-expression modules. Because of limitations of classical clustering methods, numerous alternative module detection methods have been proposed, which improve upon clustering by handling co-expression in only a subset of samples, modelling the regulatory network, and/or allowing overlap between modules. In this study we use known regulatory networks to do a comprehensive and robust evaluation of these different methods. Overall, decomposition methods outperform all other strategies, while we do not find a clear advantage of biclustering and network inference-based approaches on large gene expression datasets. Using our evaluation workflow, we also investigate several practical aspects of module detection, such as parameter estimation and the use of alternative similarity measures, and conclude with recommendations for the further development of these methods.Entities:
Mesh:
Year: 2018 PMID: 29545622 PMCID: PMC5854612 DOI: 10.1038/s41467-018-03424-4
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of our evaluation methodology. a The nine different datasets used in this evaluation. b We used three different module definitions to extract known modules from known regulatory networks for the evaluation on E. coli, yeast and synthetic data. c To avoid parameter overfitting on characteristics of particular datasets, we first optimized the parameters on every dataset using a grid search, and then used the optimal parameters on one dataset (training score) to assess the performance of a method on another dataset (test score). d We evaluated a total of 42 methods, which can be classified in 5 categories: clustering, biclustering, direct network inference (NI), decomposition, and iterative NI. e For the evaluation on human data, we compared how well the targets of each regulator is enriched in at least one of the modules. f We used four different regulatory networks in our evaluation, each generated from different types of data
Module detection methods evaluated in this study
|
| |
| A | FLAME: fuzzy clustering by selecting cluster supporting objects based on the K-nearest neighbor density estimation |
| B | K-medoids: iteratively refines the centers (which are individual genes) and the average dissimilarity within the cluster |
| C | K-medoids (see B) but with automatic module number estimation |
| D | Fuzzy c-means: similar to k-means (see F), but using fuzzy instead of crisp cluster memberships |
| E | Self-organizing maps: maps each gene on a node embedded in a two-dimensional graph structure |
| F | K-means: iteratively refines the mean expression with a cluster and the within-cluster sum of squares |
| G | MCL: simulates random walks within the co-expression graph by alternative steps of expansion and inflation |
| H | Spectral clustering: applies K-means in the subspace defined by the eigenvectors of the Pearson’s correlation affinity matrix |
| I | Affinity propagation: clustering by exchange of messages between genes |
| J | Spectral clustering: applies K-means in the subspace defined by the eigenvectors of the K-nearest-neighbor graph |
| K | Transitivity clustering: tries to find the transitive co-expression graph in which the total cost of added and removed edges is minimized |
| L | WGCNA: agglomerative hierarchical clustering (see M), but using the topological overlap measure and a dynamic tree cutting algorithm to implicitly determine the number of modules |
| M | Agglomerative hierarchical clustering: generates a hierarchical structure by progressively grouping genes and clusters based on their similarity |
| N | Hybrid hierarchical clustering: combination of agglomerative and divisive hierarchical clustering |
| O | Divisive hierarchical clustering: generates a hierarchical structure by progressively splitting the genes into clusters |
| P | Agglomerative hierarchical clustering (see M), but with automatic module number estimation |
| Q | SOTA: combination of self-organizing maps and divisive hierarchical clustering |
| R | First finds cluster centers by searching for high-density regions, each gene is then assigned to the cluster of its nearest neighbor of higher density |
| S | CLICK: uses density estimation to find tight groups of similar genes, after which these are expanded into modules |
| T | DBSCAN: groups genes within core, non-core and outlier genes based on the number of neighbors |
| U | Clues: first applies a shrinking procedure which moves each gene towards nearby high-density regions, after which the genes are partitioned into an automatically determined number of clusters using the silhouette width |
| V | Mean shift: moves each gene towards nearby high density regions until convergence |
|
| |
| A | Independent component analysis: decomposes the expression matrix into a set of independent components using the FastICA algorithm, detects potentially overlapping modules within each source signal using false-discovery rate (FDR) estimation |
| B | Similar to A, but detects two modules per independent component depending on whether genes have positive or negative weights |
| C | Similar to A, but detects modules within each source signal using |
| D | Combination of principal component analysis and independent component analysis, uses FDR estimation to find modules |
| E | Principal component analysis: decomposes the expression matrix into a set of linearly uncorrelated components, detects potentially overlapping modules within each component using FDR estimation |
|
| |
| A | Spectral biclustering: detecting checkerboard patterns within the gene expression matrix |
| B | ISA: iteratively refines a set of genes and samples based on high or low expression in both the gene and sample dimension |
| C | QUBIC: finds biclusters in which the genes have similar high or low expression levels in a discretized expression matrix |
| D | Bi-Force: finds biclusters with over- or under-expression by solving the bicluster editing problem |
| E | FABIA: builds a multiplicative model of the expression matrix layer by layer. Every layer represents a bicluster |
| F | Plaid: builds an additive model of the expression matrix layer by layer. Every layer represents a bicluster |
| G | MSBE: finds additive biclusters starting from randomly sampled reference genes and conditions |
| H | Cheng & church: minimizes the mean squared residue within every bicluster |
| I | OPSM: searches for biclusters where the expression changes in the same direction between genes and samples |
|
| |
| A | MERLIN: iteratively refines a direct regulatory network and modules within a probabilistic graphical network framework |
| B | Genomica: starts from an initial hierarchical clustering and iteratively refines this clustering and an inferred module network using a model based on Bayesian regression trees |
|
| |
| A | GENIE3: predicts the expression of each target gene based on random forest regression |
| B | CLR: calculates the likelihood of mutual information estimations based on the network neighborhood |
| C | Pearson’s correlation between regulator and target gene |
| D | TIGRESS: network inference using a combination of Lasso sparse regression and stability selection |
Within each category, methods are ranked according to their average test score (Fig. 2). We refer the reader to Supplementary Note 2 for details regarding the implementation and parameters
Fig. 2Overall performance of 42 module detection methods (Table 1) based on the agreement between observed modules and known modules in gene regulatory networks. The methods can be divided in five categories: clustering, decomposition, biclustering, direct network inference (direct NI) and iterative network inference (iterative NI) methods. Clustering and biclustering methods were further classified in subcategories (see Methods). a Average test and training scores across datasets and module definitions. The score represents a fold improvement over permutations of the known modules. *Automatic estimation of number of modules. b Different properties of the module detection methods (see Supplementary Note 2). A+ (green background) denotes that a method can handle a certain property listed on the left. We distinguish between explicit (−), implicit (±), and automatic (+) module number estimation. Note that running times strongly depend on the implementation, hardware, dataset dimensions, and parameter settings, and are therefore only indicative. c Test scores at each of the four datasets, averaged over module definitions. d Test scores on each of the three module definitions, averaged over different datasets
Fig. 3Effect of automatic parameter estimation using four different cluster validity indices and two measures based on functional enrichment on the performance of top module detection methods. Shown are changes in test scores after parameter estimation (either using measures based on functional enrichment in blue or cluster validity indices in red–orange), averaged over datasets and module definitions, of the top module detection methods in every category
Fig. 4Influence of the number of samples on the performance of the top module detection methods. Shown are average training scores (left) and test scores (right) over all datasets and module definitions at different levels of random subsampling (five repeats)
Fig. 5Practical guidelines for module detection in gene expression data. Module detection in gene expression data has three main applications (left; panel a). For each application, we suggest different module detection methods (b), which in turn influences the way parameters are estimated (c), how the modules can be visualized (d), and how they can be functionally interpreted (e)