| Literature DB >> 34216192 |
Elizabeth A Hobson1, Matthew J Silk2, Nina H Fefferman3, Daniel B Larremore4,5, Puck Rombach6, Saray Shai7, Noa Pinter-Wollman8.
Abstract
Analysing social networks is challenging. Key features of relational data require the use of non-standard statistical methods such as developing system-specific null, or reference, models that randomize one or more components of the observed data. Here we review a variety of randomization procedures that generate reference models for social network analysis. Reference models provide an expectation for hypothesis testing when analysing network data. We outline the key stages in producing an effective reference model and detail four approaches for generating reference distributions: permutation, resampling, sampling from a distribution, and generative models. We highlight when each type of approach would be appropriate and note potential pitfalls for researchers to avoid. Throughout, we illustrate our points with examples from a simulated social system. Our aim is to provide social network researchers with a deeper understanding of analytical approaches to enhance their confidence when tailoring reference models to specific research questions.Entities:
Keywords: agent-based model; animal sociality; configuration model; permutation; randomization; social network analysis
Mesh:
Year: 2021 PMID: 34216192 PMCID: PMC9292850 DOI: 10.1111/brv.12775
Source DB: PubMed Journal: Biol Rev Camb Philos Soc ISSN: 0006-3231
Fig 1Methods for creating reference models increase in level of abstraction. Methods progress from reference models that rely strongly on the empirical observations of sociality (left, Level 1) to methods that make assumptions about the generative processes that underlie the observed sociality and do not use the observed social associations when producing a reference model (right, Level 4).
Example of two research teams and their approach to studying burbil sociality
|
|
|
Team 1: do burbils socially associate by nose colour? Team 2: do burbils associate at random? |
|
|
|
Team 1: to determine if burbils associate based on nose colour, the researchers decide to preserve the observed network structure (Fig. Team 2: to determine if burbils associate at random, the researchers generate random networks with the same number of nodes and edges and then, for each random network, they draw edge weights from a normal distribution with the same mean and standard deviation as the observed adjacency matrix. |
|
|
|
Team 1: the researchers use a weighted assortativity coefficient to measure the tendency of burbils to associate with those of the same nose colour. Team 2: the researchers choose a measure of variance of the weighted degree (strength) distribution – coefficient of variance (CV) – as the test statistic to compare the observed and reference networks. |
|
|
| Both teams generate a reference distribution by running 9999 iterations of their randomization procedure to which they compare the observed test statistic. Using 9999 iterations means their full reference data set (including the observed value) is |
|
|
|
Team 1: after each shuffle of nose colour, the weighted assortativity coefficient is calculated to obtain 9999 reference values to compare with the observed value. Team 2: after the creation of each new interaction network, the CV of the weighted degree distribution is calculated for each simulation to obtain 9999 reference values of simulated weighted degree CV to compare with the observed value. |
|
|
|
|
|
Team 1: the observed assortativity coefficient falls higher than the 95% confidence interval of the reference distribution indicating that burbils do indeed assort by nose colour – tending to associate more with burbils with the same colour noses (Fig. Team 2: the observed weighted degree CV falls inside the 95% interval of the reference distribution, indicating that the network is not different from random with regard to this particular network measure. |
|
|
|
Team 1 asked a specific question, used a permutation procedure that shuffled only the one aspect of burbil society that was of interest, and they chose a test statistic that was well matched to their question. Team 2 asked a vague question (what does it mean for a network to be non‐random? What is the biological meaning of ‘random’ and how is it measured?). They found it difficult to define a satisfactory reference model and they chose a test statistic that was not as directly linked to their question. Team 2 is therefore uncertain about the biological conclusions they can draw. Most importantly, they failed to determine how the way in which they generated their reference distribution matches their research question. This failure stems from the lack of specificity of their biological question. Further, they missed the fact that they included zero values for self‐loops in their calculation of the mean and standard deviation of the edge weights when generating their reference networks. These edge weights had a biased representation and inflated their importance compared to the observed edge weights. |
Fig 2An example of study approaches: do burbils socially assort by nose colour? (A) Association network of burbils, with nodes colour‐coded by nose colour and (B) distribution of values based on the permutation procedure of Team 1; observed value of the test statistic shown as a red solid line and the 2.5 and 97.5% quantiles of the reference distribution as blue dashed lines.
Fig 3An illustration of how incorrect use of Markov chain Monte Carlo (MCMC) methods can lead to biased sampling from the configuration model when using data stream permutations. When permuting a bipartite group‐by‐individual network there are 11 possible configurations, depicted at the bottom of the figure. Of these possibilities, five (coloured yellow and orange) are acceptable because they do not contain double edges (as shown in the green and blue possibilities as a thick edge). Double edges indicate that the same individual occurred in the same grouping event twice – which is impossible. (A) The ‘graph of graphs’, or the Markov chain. (B) The distribution of samples obtained when permutations are conducted and every state, including those that are impossible (green and blue) are accepted. (C) The distribution of samples obtained when rejecting swaps that result in double edges and then rewiring a randomized network. Note that a sampling bias arises here – the orange state is oversampled – because it has more routes to other acceptable states as seen in A. (D) The distribution of the samples obtained when swaps that make double edges are resampled (i.e. the correct unbiased sampling approach). Note that in D the sampling of the five acceptable states is uniform – as it should be.
Fig 4Drawing random degree sequences from the distribution‐based model. (A) Histogram of the degree sequence of the network shown in the inset and a fitted lognormal distribution (red line). (B) Random samples of different sizes (100, 200, 500, 1000 randomization iterations) drawn from the fitted lognormal distribution (orange) and by resampling the original degree sequence (grey). Network visualization was done using Gephi (Bastian, Heymann & Jacomy, 2009) with force atlas, a force‐directed layout. Node colour and size correspond to degree.
Fig 5An illustration of covariance between two network properties in a burbil association network generated in Appendix S1, Section 3.3. (A) The degree distribution of the network. (B) The distribution of the clustering coefficient – the fraction of a node's friends that are friends with each other. (C) A visualization of the network where node size corresponds to degree and node colour corresponds to clustering coefficient [network visualization was done using Gephi (Bastian et al., 2009) with force atlas, a force‐directed layout]. (D) The correlation between clustering coefficient and degree in the network.