| Literature DB >> 34950902 |
Abstract
Network modeling transforms data into a structure of nodes and edges such that edges represent relationships between pairs of objects, then extracts clusters of densely connected nodes in order to capture high-dimensional relationships hidden in the data. This efficient and flexible strategy holds potential for unveiling complex patterns concealed within massive datasets, but standard implementations overlook several key issues that can undermine research efforts. These issues range from data imputation and discretization to correlation metrics, clustering methods, and validation of results. Here, we enumerate these pitfalls and provide practical strategies for alleviating their negative effects. These guidelines increase prospects for future research endeavors as they reduce type I and type II (false-positive and false-negative) errors and are generally applicable for network modeling applications across diverse domains.Entities:
Keywords: clustering; community detection; correlation; gene co-expression analysis; high-dimensional patterns; network analysis
Year: 2021 PMID: 34950902 PMCID: PMC8672149 DOI: 10.1016/j.patter.2021.100374
Source DB: PubMed Journal: Patterns (N Y) ISSN: 2666-3899
Example of combinatorial explosion
| Size | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| No. of combinations | |||||
| 1,000,000 | 499,999,500,000 | 1.7 × 1017 | 4.2 × 1023 |
Shown are the number of unique combinations for patterns comprising 1, 2, 3, 4, and k objects drawn from n objects, along with an example for a dataset with n = 1,000,000 objects.
Figure 1Network modeling examples
(A) Typical steps in a network analysis.
(B) An example Facebook network (left) and gene co-expression network (right). For the Facebook network, each node represents a Facebook friend of a given individual, and an edge is placed between two nodes if the corresponding individuals are Facebook friends. For the gene co-expression network, nodes representing genes and edges are placed between two genes that exhibit correlated expression across a set of individuals.
(C) Four example network modeling applications. “Hub nodes” are nodes with exceptionally high degree.
Figure 2Subset heterogeneity, effective sample size, and permutation examples
Examples for pairs of objects, each with ten attribute values. Red upward arrow, dash, and blue downward arrow indicate high, neutral, and low data values, respectively. An “×” indicates missing data value.
(A) The first five attribute values are perfectly correlated for objects A and B, while the other five are not correlated at all. Such a situation may be expected in the presence of subset heterogeneity. The absolute value of Pearson's correlation coefficient is only 0.44 due to the uncorrelated values. Duo returns a high score of 0.80 for the high/low relationship and low scores for high/high, low/high, and low/low relationships.
(B) Objects C, D, E, and F each have 20% missing data. When computing a pairwise correlation measure for objects C and D, 40% of the value pairs contain missing values and do not contribute to the score. On the other hand, only 20% of the value pairs contain missing values for objects E and F.
(C) A′ and B′ represent random permutations of objects A and B, respectively. Each object retains the same values while the inherent correlation between A and B is broken up.
Figure 3Duality node
Assume that low values of object A are correlated with low values of object B, high values of object A are correlated with low values of object C, and no other correlations exist for objects A, B, and C.
(A) In a standard network for which each object is represented by a single node, the transitivity assumption would falsely suggest that B and C are correlated.
(B) In an expanded network for which each object is represented by two nodes, one for high values and one for low values (red and blue, respectively), B and C are not joined by an intermediate node.