Literature DB >> 34950902

Connecting the dots: The boons and banes of network modeling.

Abstract

Network modeling transforms data into a structure of nodes and edges such that edges represent relationships between pairs of objects, then extracts clusters of densely connected nodes in order to capture high-dimensional relationships hidden in the data. This efficient and flexible strategy holds potential for unveiling complex patterns concealed within massive datasets, but standard implementations overlook several key issues that can undermine research efforts. These issues range from data imputation and discretization to correlation metrics, clustering methods, and validation of results. Here, we enumerate these pitfalls and provide practical strategies for alleviating their negative effects. These guidelines increase prospects for future research endeavors as they reduce type I and type II (false-positive and false-negative) errors and are generally applicable for network modeling applications across diverse domains.

Entities: Chemical

Keywords: clustering; community detection; correlation; gene co-expression analysis; high-dimensional patterns; network analysis

Year: 2021 PMID： 34950902 PMCID： PMC8672149 DOI： 10.1016/j.patter.2021.100374

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

Humans have aspired to infer knowledge by collecting and analyzing data for millennia. Such works include an ancient Sumer scientist c. 2000 BCE who created a data table, including row and column headers, and delineated information for a number of animals. As the size of our global datasphere approaches 100 zettabytes, researchers in virtually every domain strive to harvest valuable information buried in a deep ocean of numerical and categorical data. Monumental data analysis advances have been achieved using machine learning, statistical, and operations research methods, yet accurately capturing complex patterns continues to challenge progress due to multiple factors. Key impediments include the sheer size of the search space, due to the combinatorial explosion of feasible patterns, and subtle assumptions underlying data analysis methods that may compromise outcomes. Identification of high-dimensional patterns in data is inherently difficult due to the combinatorial explosion of the number of possible patterns (Table 1). Network modeling, also known as community detection, has emerged as a leading strategy in this conquest due to its scalability, flexibility, and ability to capture any order of relationship size. In this realm, a dataset is modeled as a network composed of nodes representing objects and edges representing relationships between the objects (Figure 1B). In general, the edges can be directed to capture asymmetric relationships. In this article we are interested in undirected pairwise relationships and clustering methods based on undirected edges, so we focus on symmetric relationships. Network analyses typically involve data pre-processing, computation of pairwise relationships, network construction, identification of clusters/communities within the network, and validation of results (Figure 1A).

Table 1

Example of combinatorial explosion

Size	1	2	3	4	k
No. of combinations	N	(n2)=n2−n2	(n3)=n3−3n2+2n6	(n4)=n!4!(n−4)!	(nk)=n!k!(n−k)!
n = 1,000,000	1,000,000	499,999,500,000	1.7 × 10¹⁷	4.2 × 10²³	1,000,000!k!(1,000,000−k)!

Shown are the number of unique combinations for patterns comprising 1, 2, 3, 4, and k objects drawn from n objects, along with an example for a dataset with n = 1,000,000 objects.

Figure 1

Network modeling examples

(A) Typical steps in a network analysis.

(B) An example Facebook network (left) and gene co-expression network (right). For the Facebook network, each node represents a Facebook friend of a given individual, and an edge is placed between two nodes if the corresponding individuals are Facebook friends. For the gene co-expression network, nodes representing genes and edges are placed between two genes that exhibit correlated expression across a set of individuals.

Example of combinatorial explosion Shown are the number of unique combinations for patterns comprising 1, 2, 3, 4, and k objects drawn from n objects, along with an example for a dataset with n = 1,000,000 objects. Network modeling examples (A) Typical steps in a network analysis. (B) An example Facebook network (left) and gene co-expression network (right). For the Facebook network, each node represents a Facebook friend of a given individual, and an edge is placed between two nodes if the corresponding individuals are Facebook friends. For the gene co-expression network, nodes representing genes and edges are placed between two genes that exhibit correlated expression across a set of individuals. (C) Four example network modeling applications. “Hub nodes” are nodes with exceptionally high degree. Figure 1B illustrates Facebook and gene co-expression networks and Figure 1C describes characteristics for these networks, along with networks representing warehouse order picking and weather prediction. These examples illustrate the versatility of network modeling and provide illustrations for transferring real-world problems to a network structure. The Facebook network is a case in point of the strengths of network modeling. The input data are simply a list of an individual's Facebook friends and a list of pairs of these individuals that are Facebook friends with each other. Once these data are transformed into a network, clusters spontaneously arise. The numerous intra-cluster edges within a cluster indicate a high-ordered relationship and is the basis of the “guilt-by-association” postulation in this domain. The transitivity assumption is at the heart of network modeling and provides the mechanism to infer high-ordered relationships from simple pairwise information. Network modeling is capable of efficiently capturing high-ordered relationships, yet each step, from data pre-processing to validation of results, holds subtle impediments that arise due to intrinsic and extrinsic characteristics that may confound research progress. Here, we examine benefits and encumbrances of network modeling and demonstrate these characteristics in a popular application domain, gene co-expression analysis.,4, 5, 6, 7, 8, 9, 10, 11 A brief description of this application follows.

Example network modeling problem: Gene co-expression analysis

A vigorous application domain for network modeling is gene co-expression analysis, which explores gene expression level data to identify patterns of genes that are synchronously expressing within one group of individuals more than another (Figure 1C).12, 13, 14, 15 Complex traits, such as disease states, arise due to aberrant biological pathways, many of which are not well understood. For example, the characteristic plaques that are hallmarks of late-onset Alzheimer disease are comprised of amyloid-β that is being overproduced, misfolded, and/or ineffectively cleared.16, 17, 18 Identification of the deviant pathways underlying such processes facilitates understanding of the pathogenesis of diseases, revelations of unknown genetic functions, and recognition of potential drug targets. Given expression levels of genes for a group of affected cases and a group of normal controls, the goal is to find patterns of co-expressed genes that appear significantly more often in one group than the other. Note that each individual gene may have similar mean levels in both groups. The challenge is twofold. First, synchronized patterns of multiple, perhaps hundreds, of genes that are co-expressing together within individuals must be extracted. Second, if an association with a trait is pursued, the percentages of individuals carrying the synchronized genetic pattern must be significantly different between the two groups. Exhaustive enumeration is not feasible due to the combinatorial explosion (Table 1). Gene co-expression analysis typically casts genes as nodes and places edges between pairs of genes that exhibit correlated expression across the individuals (Figure 1C). Clusters of co-expressed genes are identified and then evaluated for potential interactions and/or associations with the trait of interest. The organization of this article follows the steps usually taken for network modeling, with caveats for each step highlighted and potential remedies presented. We begin with data pre-processing, then discuss pairwise relationship computations, network construction, clustering, and validation. A brief discussion concludes the article.

Data pre-processing

Due to the massive size of most datasets of interest, it is not possible to manually inspect data before starting an analysis. Typos and improperly formatted data can silently sabotage a study, so it is important that software packages exit with meaningful error messages when encountered. Furthermore, outliers and missing data hold potential to quietly distort results. In general, data cleaning is challenging, and many steps are domain specific. Here, we consider matters of general concern for network modeling: missing data and discretization, the latter of which palliates outliers.

Missing data

Missing data reduce power and potentially may lead to spurious correlations. Furthermore, some downstream analyses may require complete data. An approach that is receiving increasing popularity is data imputation, whereby the missing values are imputed based upon information drawn from the data. A wide range of methods have been developed, from simply replacing the missing values with the mean or median, to sophisticated methods designed to minimize the root-mean-squared error.20, 21, 22 Local methods, such as K-nearest neighbors (KNNimpute) and local least squares (LLSimpute), identify similar objects via correlation metrics or Euclidean distance, to infer missing values. Global methods, such as Bayesian principal component analysis (BPCA), disassemble the data and impute while rebuilding it. Classical methods, such as expectation maximization (EMimpute), utilize incremental refinements while iteratively maximizing likelihood. In general, these sophisticated methods outperform replacement with mean or median when assessed using the root-mean-squared error of the imputed values with the true values. However, this improvement may come with a cost for subsequent analyses which rely on correlations within the data. We next discuss three studies that investigate the impact of imputation error on downstream analyses. Souto et al. ran a series of trials to assess the impact of the four aforementioned imputation methods on downstream analyses. Using 12 cancer gene expression datasets, they imputed values with each method and then evaluated results for three network clustering algorithms. Interestingly, they observed that simply replacing values with the mean or median held similar performance as the four more elaborate techniques. They suggested that this observation may be due to the fact that clusters of co-expressed genes tend to be highly correlated and are likely to have some genes with no missing data, hence high accuracy of imputed values is not critical in downstream analyses. We propose an alternative viewpoint. A key stumbling block for data imputation prior to network modeling is that error in the imputations is not random for approaches that use correlations, such as KNNimpute, LLSimpute, BPCA, and EMimpute. When relationships within the data are used, exceptions to the trends are erroneously replaced with values that match the observed patterns. These biased errors can falsely boost pairwise relationships that are used to create edges for the network. In short, while the overall root-mean-squared error may be lower when one of these methods is utilized as opposed to simply using the mean or median, the inaccuracies that do arise tend to increase downstream correlation values and false-positive errors. A second study focused on the effects of imputation on an analysis of questionnaire data based on stress and health for older adults. This 20-page survey instrument included questions for computing scores for symptoms of depression, anxiety, and self-assessed health. A set of 96 cases with no missing data had the computed score for symptoms of depression removed from the data, along with 19.5% of data points used to compute this score. The missing data were imputed using simple regression (SR), regression with added error term (RET), and expectation maximization (EM), and the imputed score for depression symptoms obtained. The correlation between the imputed depression score and three of the variables included in the score calculations—sex, age, and self-assessed health—were computed for the original data and for the data following the three imputation methods. The authors also computed correlations between the depression score and two scores not included in the imputations: anxiety and functional health. While these two scores had strong correlations with the depression scores (p ≤ 0.001 for anxiety and p ≤ 0.01 for functional health) in the original data, the imputed depression scores exhibited dramatic differences. EM showed high significance (p ≤ 0.01) in the opposite direction for anxiety and SR showed high significance (p ≤ 0.05) in the opposite direction for functional health. The three variables with the imputed values sex, age, and self-assessed health did not exhibit this type of reversed correlation. The correlation between depression scores and sex for EM imputation was similar to the original score, while SR and RET failed to capture any significant correlation. Both age and self-assessed health demonstrated strong inflation of the correlation. Age was not correlated with depression for the original data and was significantly correlated for RET (p ≤ 0.05) and EM (p ≤ 0.01) in the imputed data. Self-assessed health was significantly anti-correlated (p ≤ 0.05) with depression in the original data, uncorrelated for RET, and jumped to very strong anti-correlation for both SR and EM (p ≤ 0.001). In short, the variables with imputed values tended to boost correlation values, while those without imputations exhibited unpredictable correlations with the imputed depression score. The third study examined the effects of imputation on mass spectrometry data taken across various tissues. For each tissue, data values were imputed using seven different imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least-squares regression, BPCA, singular value decomposition, and random forest). Following imputations at levels of missingness ranging from 10% to 50%, correlations between the matrices were computed and MANOVA trials were run. The authors observed two primary outcomes. First, the magnitude of the pairwise inter-matrix correlations declined and in some cases reversed in direction. Presumably this is due to erroneous inflation of the correlation patterns within each of the matrices induced by the imputations. Second, the number of false-positive errors in the MANOVA tests increased in accordance with the level of missingness for all seven imputation methods. In summary, data imputation methods that draw on patterns that exist in the known data points tend to reinforce these relationships, thereby inflating correlation structures, and hold potential to produce false-positive edges in network models. Data imputation may be more useful in approaches that are not based on network construction. For example, genome-wide association studies, whereby each genetic marker is directly analyzed for association with a trait and the relationships between the markers are not computed, may be more resilient to bias in imputation error. In lieu of imputation, a common approach is to remove any rows or columns of the data table with excessive missing data values. The downside is that a lot of known data points are lost in this process. It should be noted that starting with a relaxed threshold for missingness and iteratively removing rows and columns in an alternating fashion while gradually tightening the threshold often leads to higher data retention than applying the target threshold to all rows and columns simultaneously. We use an Alzheimer disease gene expression dataset generated by Amanda Myers' lab to demonstrate. These data include expression levels for 8,650 genes drawn from 363 individuals' postmortem brain. Directly cleaning to a maximum of 5% missing values for all individuals and for all genes eliminates 1,243 genes and 46 individuals. On the other hand, using the iterative procedure while striving to retain individuals eliminates 1,219 genes and 6 individuals. We offer open-source software for facilitating this iterative process at www.blocbuster.org. Another consideration is the relative distribution of the missing values between the two objects being measured for a relationship, as will be presented in the next section.

Discretization

Discretization of data values, whereby real values are binned into a set of discrete values such as low, average, and high, is performed in many analyses. Such techniques can facilitate computations, tolerate differences in scales across objects, and eliminate outlier concerns., However, the choice of discretization thresholds may have dramatic effects on results. When re-running entire analyses using different discretization thresholds is impractical, it is advisable to check the sensitivity of the results using different thresholds. While network analysis methods may benefit from the use of discretized data, it may be practical to assess the results found using the original continuous-valued data. When this type of validation is conducted, outliers should be carefully treated using an appropriate method that accounts for specific intricacies arising in the given research area. For example, in the gene expression domain, rare genetic variants can yield outlier gene expression values that are indeed biologically relevant.

Pairwise relationship calculations

After pre-processing data, pairwise relationships are computed to generate edges in the network. The number of computations to assess all pairs is equal to (n2 − n)/2, where n is the number of objects. Given an efficient algorithm and adequate resources, this number is feasible for many datasets of interest. When the computation time is too burdensome, these independent pairwise computations can be run in parallel across many processors, and cloud services are readily available for such tasks. As illustrated in Figure 1C, edges may be binary or carry a discrete or real-valued weight. The Facebook network example includes binary edges, where an edge exists if the individuals are Facebook friends and does not exist otherwise. Most network models of interest require a more complex evaluation of pairwise relationships. Similarity or correlation measures computed across arrays of values representing each object, such as Euclidean distance or Pearson's correlation coefficient (PCC), are commonly utilized, but some applications may require a domain-specific relationship computation. We discuss four challenges regarding this step: subset heterogeneity, sample size, spurious correlations, and edge retention.

Subset heterogeneity

Many network modeling domains exhibit subset heterogeneity, and such heterogeneity should be addressed by the correlation metric utilized. Examples of subset heterogeneity include different weather patterns preceding a common severe weather event and subtypes of diseases, such as breast cancer, in which different biological pathways are manifesting a shared cancer phenotype. Not only is it valuable to tease out these different subgroups to increase weather prediction accuracy and facilitate precision medicine, failure to account for this heterogeneity can yield false-negative correlations, as shown in Figure 2A. Prominent correlation measures, such as PCC and Euclidean distance, return a single scalar value that must account for the correlation over all of the data points in the arrays. This is problematic as when heterogeneity exists, one subgroup may exhibit high correlation, but there is no reason to expect other subgroups to hold any correlation, and this lack of correlation tends to weaken the correlation score. The only correlation measures that we are aware of that account for subset heterogeneity are Hamming distance and its variants, and the two vector-based correlation measures that we have introduced: custom correlation coefficient, for single-nucleotide polymorphism data and Duo for general real-valued data.

Figure 2

Subset heterogeneity, effective sample size, and permutation examples

Examples for pairs of objects, each with ten attribute values. Red upward arrow, dash, and blue downward arrow indicate high, neutral, and low data values, respectively. An “×” indicates missing data value.

(A) The first five attribute values are perfectly correlated for objects A and B, while the other five are not correlated at all. Such a situation may be expected in the presence of subset heterogeneity. The absolute value of Pearson's correlation coefficient is only 0.44 due to the uncorrelated values. Duo returns a high score of 0.80 for the high/low relationship and low scores for high/high, low/high, and low/low relationships.

(B) Objects C, D, E, and F each have 20% missing data. When computing a pairwise correlation measure for objects C and D, 40% of the value pairs contain missing values and do not contribute to the score. On the other hand, only 20% of the value pairs contain missing values for objects E and F.

(C) A′ and B′ represent random permutations of objects A and B, respectively. Each object retains the same values while the inherent correlation between A and B is broken up.

Subset heterogeneity, effective sample size, and permutation examples Examples for pairs of objects, each with ten attribute values. Red upward arrow, dash, and blue downward arrow indicate high, neutral, and low data values, respectively. An “×” indicates missing data value. (A) The first five attribute values are perfectly correlated for objects A and B, while the other five are not correlated at all. Such a situation may be expected in the presence of subset heterogeneity. The absolute value of Pearson's correlation coefficient is only 0.44 due to the uncorrelated values. Duo returns a high score of 0.80 for the high/low relationship and low scores for high/high, low/high, and low/low relationships. (B) Objects C, D, E, and F each have 20% missing data. When computing a pairwise correlation measure for objects C and D, 40% of the value pairs contain missing values and do not contribute to the score. On the other hand, only 20% of the value pairs contain missing values for objects E and F. (C) A′ and B′ represent random permutations of objects A and B, respectively. Each object retains the same values while the inherent correlation between A and B is broken up.

Sample size

Inadequate sample size increases the likelihood of observing spurious correlations and false-positive signals. Spurious correlations generally fall into two categories: those that arise from an indirect relationship and those that arise by mere chance. The first of these types can be expected in network analysis. For example, two genes may be exhibiting high expression together due to an underlying biological condition. Here, we consider spurious correlations that arise by mere chance. In general, a sample size that will adequately diminish spurious correlations can be difficult to correctly ascertain, as it is highly dependent upon the properties of the given dataset and the correlation metric employed. Moreover, for a given sample size and correlation metric, the effective sample size can be reduced due to missing data, with the reduction being dependent upon the relative locations of the missing data values. For example, consider PCC. This popular correlation measure is based on the covariance of the two arrays divided by the product of the standard deviations for the arrays. It should be noted that the percentage of missing values for each object is only a lower bound on the level of missing values used in correlation computations, as they range from the maximum percentage of the two objects to the sum of the percentages for the two objects, as shown in Figure 2B. In essence, the effective sample size can vary between each pair of objects. It is desirable for software to report warnings when the effective sample size drops below a given threshold, yet such features are rare.

Spurious correlations

In addition to inadequate effective sample size, spurious correlations can arise due to characteristics of the data and the algorithm employed. An agile approach to dynamically test for these errors is to run permutation trials for each pair of objects, thereby testing the null hypothesis for the given pair. For each correlation measurement above a given threshold, the corresponding pair of objects has their values permuted as shown in Figure 2C for an adequately large number of trials (e.g., 1,000). These permutations break up inherent correlations that might exist while retaining sample size and other statistical properties of each array, such as median and variance, as they are composed of exactly the same values but in different relative ordering. Correlation is measured over the permuted arrays and sorted to yield a p value for the degree of correlation for the array pair.

Edge retention

The number of possible edges in a network with n nodes is (n2 − n)/2. As it is not practical to hold all edges of a complete network in main memory for all but small n, a large proportion of edges is not retained. Assuming permutation trials are run, a minimum criterion for edge retention might be to require a p value of less than 0.05.

Network construction

The construction of a network once the edges have been identified is relatively straightforward. However, there is an insidious fundamental mistake that is practiced nearly universally, as described in this section. Another challenge is assessing the structure of the network, which is also addressed herein.

Duality nodes

Networks are normally constructed by assigning a node to represent each object and placing edges between pairs of nodes that are correlated. This practice leads to false-positive signals due to the transitivity assumption upon which network modeling is based. For a given object, correlations with other objects can arise due to high or low values in the object's array of data. For instance, high and low values of temperature, atmospheric pressure, wind, precipitation, cloudiness, and/or humidity are each associated with different weather events. Note that high values for one object and low values of another may be involved in important anti-correlations. Typical scalar correlation metrics indicate the degree of correlation/anti-correlation but do not indicate whether high or low values are contributing to the relationship, creating an environment for the generation of what we refer to as duality nodes (Figure 3). Duality nodes lead to the merging of unassociated clusters. Moreover, these clusters may be the opposite of each other. For example, if high expression of gene A is correlated with a cluster of genes that lie in a biological pathway leading to disease progression and low expression of A is correlated with a healthy biological pathway, the genes for both of these opposing pathways will be connected via A. Consider the β-site amyloid precursor protein (APP)-cleaving enzyme 1 (BACE1). BACE1 competes with α-secretase ADAM10 for cleaving APP. While ADAM10 cleavage has not been associated with deleterious effects, BACE1 cleavage yields β-amyloid peptides, which aggregate to form the amyloid plaques that are characteristic of Alzheimer disease. High expression of BACE1 has been observed in peripheral blood of Alzheimer disease cases when compared with normal controls. Consequently, a network in which each gene is represented by a single node will tend to connect the pathological pathway yielding production of excess β-amyloid peptides with analytes in healthy pathways that include low BACE1 levels. We have addressed this issue by expanding network scaffolding to include two nodes per object, representing high and low values, respectively. As illustrated in Figure 3B, this expansion separates the clusters and justifies the use of transitivity.

Figure 3

Duality node

Assume that low values of object A are correlated with low values of object B, high values of object A are correlated with low values of object C, and no other correlations exist for objects A, B, and C.

(A) In a standard network for which each object is represented by a single node, the transitivity assumption would falsely suggest that B and C are correlated.

(B) In an expanded network for which each object is represented by two nodes, one for high values and one for low values (red and blue, respectively), B and C are not joined by an intermediate node.

Duality node Assume that low values of object A are correlated with low values of object B, high values of object A are correlated with low values of object C, and no other correlations exist for objects A, B, and C. (A) In a standard network for which each object is represented by a single node, the transitivity assumption would falsely suggest that B and C are correlated. (B) In an expanded network for which each object is represented by two nodes, one for high values and one for low values (red and blue, respectively), B and C are not joined by an intermediate node. Allocating two nodes for each object doubles the number of nodes needed, but the number of edges is not increased. Indeed the resulting network is somewhat sparsified, and large connected components may be separated into smaller connected components. Network clustering is typically the most computationally demanding step during a network analysis, and each separate connected component can be clustered independently without any loss of accuracy. Identification of the connected components can be quickly computed using a modified breadth-first search (BFS) that runs in O(n + e) time, where n is the number of nodes and e is the number of edges. (We provide open-source code for this purpose at www.blocbuster.org.) In summary, while network expansion doubles the number of nodes, it eliminates false-positive signals due to duality nodes while retaining the same number of edges and may reduce the computational demands for downstream clustering analyses.

Network structure assessment

Large-scale networks are difficult to visualize due to their complexity and high dimensionality. Many visualization tools exist, such as Gephi and Cytoscape, along with Python, R, and MATLAB tools, but they tend to be computationally demanding and typically are unable to render large networks of interest. Moreover, these programs attempt to flatten a high-dimensional network into two-dimensional (2D) space, and this dimension squashing can obscure interesting characteristics. There are many different algorithms for laying out a network in two dimensions, such as Force Atlas, Fruchterman-Reingold, and Yifan Hu, and these methods generally yield vastly different visualizations that do not even appear to represent a common network. Consequently, it is advisable to view multiple layouts and to also consider other resources, as follows. To gain insights into large-scale network structure, one can identify properties such as edge density, node degree distributions, reciprocity, bridge counts, and centrality. In our genetics research, we have observed many networks that contain large numbers of singleton nodes without any edges connecting them to any other nodes, and completely disconnected components, in which no edges connect the components to each other. Knowledge of such structures can simplify downstream analysis by removing singletons and assessing each component separately, thereby reducing computational burden. As previously mentioned, a BFS of the network can be adapted to explore the network, thereby providing the numbers of nodes and edges for each disconnected component and a count of singleton nodes. Networks with disconnected subcomponents can be separated into smaller networks, each of which may be manageable for visualization tools.

Clustering

Typical clustering algorithms are not easily parallelized, and the computational bottleneck in a study may arise in this step. For this reason, it is common for researchers to prune the objects that appear the least promising. However, it is difficult to know a priori which objects to choose, as excluded objects may play roles in valuable synergistic interactions. An alternative approach is to increase the edge retention stringency to decrease the number of edges until the network breaks into disconnected components. After the components are identified, the discarded edges within each component can be replaced. Consequently, each component will require less computation time than the original network and can be run in parallel on different processors. Identifying an unknown number of clusters, also referred to as communities or modules, each with an unknown number of tightly connected nodes, can be a daunting task. A plethora of algorithms have arisen using diverse computational tools. Once an algorithm is selected, there are typically multiple adjustable parameters yielding a great variety of outputs. Taken together, there is a vast number of clustering results possible, which presses the question: which is correct for your network? Many researchers rely on precedence and simply use clustering algorithms and parameter settings that have been published in their domains previously. However, those previous selections may have been somewhat arbitrary and/or differences in network structures may invalidate this reuse. For algorithms that are not based on a specific objective, underlying assumptions and objectives are often difficult to assess, despite their importance for method selection. For example, many popular clustering methods, including k-means and hierarchical clustering, assume clusters have hyperspherical shapes and tend to minimize the overall diameters of the clusters. Many practical applications may yield elongated or complex structures that are likely to be cut apart by the sphericity assumption. Also, differences in densities of clusters within a single network can impede some algorithms, such as DBSCAN., Some clustering methods are based on clearly stated objectives. For example, a large group of clustering algorithms aim to maximize the modularity function that was proposed by Newman and Girvan in 2004. Modularity measures the numbers of edges within assigned clusters minus the numbers expected if the edges are placed randomly, while node degrees remain constant. This objective does not enforce sphericity and gained rapid popularity. Optimally maximizing the modularity objective function is NP-hard so many approximation implementations have arisen, including greedy methods, divisive optimization, simulated annealing, hierarchical clustering, and spectral partitioning. Sixteen different modularity implementations have been compared by Danon et al. Fortunato and Barthélemy observed a resolution limit for modularity wherein distinct clusters will be merged together when the network size is adequately large. We have observed that modularity is strongly biased against singletons, regardless of network size, and will sometimes split a dense cluster in two to avoid creating a singleton cluster. Consequently, modularity-based methods may be problematic for networks in which singletons are expected and for very large networks. In many research endeavors it is not clear what clustering objective is suitable, and it is tempting to apply many different clustering methods. However, multiple testing corrections should be applied, making this expedition prohibitive. Lea and Climer developed a solution to this dilemma by applying many different clustering techniques and sorting the clusters by desirable properties to select the most promising for validation testing, thereby managing multiple testing corrections. Another resource is VICTOR (http://bib.fleming.gr:3838/VICTOR/). This website provides visualizations of various clustering algorithms to aid in cluster selection.

Validation

Using an adequate number of permutation trials for pruning false-positive correlations, representing each object by two nodes to eliminate duality nodes, and utilizing an appropriate correlation metric and clustering technique will increase the likelihood of correct results. However, noise in the data and overfitting can sabotage outcomes, and it is imperative that results are validated. Depending on research design, validation via data generated by a different study may be problematic due to differences in data collection. In the realm of gene expression data, differences in platforms used to measure gene expression alone can be drastic enough to undermine efforts, as different variants of each gene may be captured. Furthermore, sample preparation, technician experience, and equipment settings can yield inconsistencies between studies. Alternatively, many publications report gene enrichment p values as validation of the results. Various reference databases, such as DAVID and Metascape, provide software to estimate the probability of seeing a group of biologically related genes appearing in a given module of genes. These results are dependent upon the clustering algorithm utilized by the software and the number and sizes of clusters. Furthermore, the analysis is based entirely on known biological relationships and, consequently, novel discoveries will not fare well in these evaluations. In general, it is ideal to split the data samples into discovery and validation sets, use the discovery data to generate the network and clusters, and test these clusters in the held-out validation data. For example, 70% of the samples can be used to build the network and discover patterns associated with the trait of interest, then each of these patterns can be tested for associations on the held-out samples, with multiple testing corrections applied. Unfortunately, this approach can diminish the power needed to identify true correlations and clusters in the discovery dataset while ensuring that the validation dataset is adequately large to be representative of the true patterns in the data. However, many data collection methods are becoming increasingly more affordable, and datasets are growing to suitable sizes in many domains.

Discussion

The pearls and pitfalls of network modeling are numerous. The beauty of the approach is that arbitrarily high-dimensional patterns can be identified based upon simple pairwise relationships. Given an efficient implementation and adequate computational resources, it is feasible to build and analyze networks for most datasets of interest. Another advantage is that components of a network can be visualized using 2D and 3D plotting software. These visualizations capture complex interactions and may reveal interesting characteristics worthy of further exploration, such as hub nodes that are connected to large numbers of other nodes and/or dense subclusters that are loosely connected. As detailed herein, there are numerous caveats that are commonly overlooked in network modeling. First, imputation of missing data can lead to false-positive signals for downstream correlation measurements. An alternative strategy is to iteratively remove objects and attributes with excessive numbers of missing values while gradually tightening the threshold until a desired threshold is reached. Second, discretization of data values, when utilized, needs to be evaluated for robustness of the discretization thresholds utilized. Third, the pairwise relationship metric must align with the specific properties of the domain. In particular, a common error is to apply a general-purpose correlation measure when subset heterogeneity exists, thereby leading to false-negative signals. Fourth, the “sample size” for a study is not necessarily equal to the effective sample size. For each pairwise relationship computation, the effective sample size is dependent upon the amount and the relative positioning of the missing data for the pair. Fifth, spurious correlations may arise. A straightforward strategy for assessing significance to use permutation trials and then base edge retention on the p values derived from an ample number of such trials. Sixth, duality nodes are pervasive and dicey actors hidden in the network modeling realm. It is natural to represent each object as a node, yet, in hindsight, it is clear that “high” and “low” values of an object should not be compressed into a single node, as it invalidates the transitivity assumption upon which network modeling is based. Seventh, although plotting network subcomponents can be insightful, visualization of high-dimensional networks in 2D space is somewhat arbitrary. Evaluating network properties can yield meaningful information while providing statistical characteristics to inform the next step: clustering. Eighth, properly clustering the network can be a daunting task. It is possible to ameliorate computational demands using divide-and-conquer strategies. However, selection of a valid clustering algorithm from the profusion of offerings, along with appropriate parameter settings, is challenging and requires careful considerations of the particular structure of the given network. Finally, despite best practices in network analysis, false-positive signals may arise due to noise in the data and overfitting. Stringent unbiased validation is indispensable and can be achieved using independent data. While these many challenges can be assuaged using the prescribed techniques, one pressing issue is that regardless of the perfection of the analysis, network modeling is an approximation method. Even if a dataset is analyzed a very large number of times using many different choices, there is never any guarantee that all useful patterns are revealed, and the most beneficial signals may remain hidden within the sea of values. The number of possible patterns grows exponentially with the pattern size (e.g., patterns of sizes 2, 3, and k have the order of n2, n3, and n possible patterns for n nodes, as shown in Table 1). Consequently, ensuring optimality is expected to be intractable for problem sizes of interest given currently available methods. Fortunately, when properly applied, the guilt-by-association basis of network modeling provides a scalable and flexible vehicle for releasing insightful high-dimensional relationships from otherwise incomprehensible datasets.

41 in total

1. A gene-coexpression network for global discovery of conserved genetic modules.

Authors: Joshua M Stuart; Eran Segal; Daphne Koller; Stuart K Kim
Journal: Science Date: 2003-08-21 Impact factor: 47.728

2. Community detection in complex networks using extremal optimization.

Authors: Jordi Duch; Alex Arenas
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2005-08-24

3. Resolution limit in community detection.

Authors: Santo Fortunato; Marc Barthélemy
Journal: Proc Natl Acad Sci U S A Date: 2006-12-26 Impact factor: 11.205

4. VICTOR: A visual analytics web application for comparing cluster sets.

Authors: Evangelos Karatzas; Maria Gkonta; Joana Hotova; Fotis A Baltoumas; Panagiota I Kontou; Christopher J Bobotsis; Pantelis G Bagos; Georgios A Pavlopoulos
Journal: Comput Biol Med Date: 2021-06-08 Impact factor: 4.589

5. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

6. Evaluation of the psoriasis transcriptome across different studies by gene set enrichment analysis (GSEA).

Authors: Mayte Suárez-Fariñas; Michelle A Lowes; Lisa C Zaba; James G Krueger
Journal: PLoS One Date: 2010-04-20 Impact factor: 3.240

Review 7. BACE1: the beta-secretase enzyme in Alzheimer's disease.

Authors: Robert Vassar
Journal: J Mol Neurosci Date: 2004 Impact factor: 3.444

8. Genetic control of human brain transcript expression in Alzheimer disease.

Authors: Jennifer A Webster; J Raphael Gibbs; Jennifer Clarke; Monika Ray; Weixiong Zhang; Peter Holmans; Kristen Rohrer; Alice Zhao; Lauren Marlowe; Mona Kaleem; Donald S McCorquodale; Cindy Cuello; Doris Leung; Leslie Bryden; Priti Nath; Victoria L Zismann; Keta Joshipura; Matthew J Huentelman; Diane Hu-Lince; Keith D Coon; David W Craig; John V Pearson; Christopher B Heward; Eric M Reiman; Dietrich Stephan; John Hardy; Amanda J Myers
Journal: Am J Hum Genet Date: 2009-04 Impact factor: 11.025

9. Allele-specific network reveals combinatorial interaction that transcends small effects in psoriasis GWAS.

Authors: Sharlee Climer; Alan R Templeton; Weixiong Zhang
Journal: PLoS Comput Biol Date: 2014-09-18 Impact factor: 4.475

10. The impact of rare variation on gene expression across tissues.

Authors: Xin Li; Yungil Kim; Emily K Tsang; Joe R Davis; Farhan N Damani; Colby Chiang; Gaelen T Hess; Zachary Zappala; Benjamin J Strober; Alexandra J Scott; Amy Li; Andrea Ganna; Michael C Bassik; Jason D Merker; Ira M Hall; Alexis Battle; Stephen B Montgomery
Journal: Nature Date: 2017-10-11 Impact factor: 49.962