Literature DB >> 34928936

Revealing mechanisms of infectious disease spread through empirical contact networks.

Pratha Sah¹, Michael Otterstatter^2,3, Stephan T Leu⁴, Sivan Leviyang⁵, Shweta Bansal¹.

Abstract

The spread of pathogens fundamentally depends on the underlying contacts between individuals. Modeling the dynamics of infectious disease spread through contact networks, however, can be challenging due to limited knowledge of how an infectious disease spreads and its transmission rate. We developed a novel statistical tool, INoDS (Identifying contact Networks of infectious Disease Spread) that estimates the transmission rate of an infectious disease outbreak, establishes epidemiological relevance of a contact network in explaining the observed pattern of infectious disease spread and enables model comparison between different contact network hypotheses. We show that our tool is robust to incomplete data and can be easily applied to datasets where infection timings of individuals are unknown. We tested the reliability of INoDS using simulation experiments of disease spread on a synthetic contact network and find that it is robust to incomplete data and is reliable under different settings of network dynamics and disease contagiousness compared with previous approaches. We demonstrate the applicability of our method in two host-pathogen systems: Crithidia bombi in bumblebee colonies and Salmonella in wild Australian sleepy lizard populations. INoDS thus provides a novel and reliable statistical tool for identifying transmission pathways of infectious disease spread. In addition, application of INoDS extends to understanding the spread of novel or emerging infectious disease, an alternative approach to laboratory transmission experiments, and overcoming common data-collection constraints.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34928936 PMCID： PMC8758098 DOI： 10.1371/journal.pcbi.1009604

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Host contacts, whether direct or indirect, play a fundamental role in the spread of infectious diseases [1-4]. Traditional epidemiological models make assumptions of homogeneous social structure and mixing among hosts which can yield unreliable predictions of infectious disease spread [3, 5–7]. Network approaches provide an alternative to modeling infection transmission by explicitly incorporating host interactions that mediate pathogen transmission. Formally, in a contact network model, individuals are represented as nodes, and an edge between two nodes represents an interaction that has the potential to transmit infection. A dynamic contact network model tracks interactions evolving over time due to social, demographic or environmental processes as well as perturbations [8]. Constructing a complete contact network model requires (i) knowledge about the transmission route(s) of a pathogen, (ii) a sampling of all individuals in a population, and (iii) a sampling of all interactions among the sampled individuals that may lead to infection transfer. In addition, accuracy of disease predictions depends on the precise epidemiological knowledge about the pathogen, including the rate of pathogen transfer given a contact between two individuals, and the rate of recovery of infected individuals. The use of modern technology in recent years, including RFID, GPS, radio tags, proximity loggers and automated video tracking has enabled the collection of detailed movement and contact data, making network modeling feasible. Despite the technology, logistical and financial constraints still prevent data collection on all individuals and their social contacts [9-14]. More importantly, limited knowledge about a host-pathogen system makes it challenging to identify the mode of infection transmission, define the relevant contacts between individuals that may lead to infection transfer, and measure the per-contact rate of infection transmission [15-17]. Laboratory techniques of unraveling transmission mechanisms usually take years to resolve [18-20]. Defining accurate contact networks underlying infection transmission in human infectious disease has been far from trivial [3, 21, 22]. For animal infectious disease, limited information on host behavior and the epidemiological characteristics of the spreading pathogen makes it particularly difficult to define a precise contact network, which has severely limited the scope of network modeling in animal and wildlife epidemiology [15, 23]. Lack of knowledge about disease transmission mechanisms has prompted the use of several indirect approaches to identify the link between social structure and disease spread. A popular approach has been to explore the association between social network position (usually quantified as network degree) of an individual and its risk of acquiring infection [24-27]. Another approach is to use proxy behaviors, such as movement, spatial proximity or home-range overlap, to measure direct and indirect contact networks occurring between individuals [28-30]. A recent approach, called the k-test procedure, explores a direct association between infectious disease spread and a contact network by comparing the number of infectious contacts of infected cases to that of uninfected cases [31]. However, several challenges remain in identifying the underlying contact networks of infection spread that are not addressed by these approaches. First, it is often unclear how contact intensity (e.g. duration, frequency, distance) relates to the risk of infection transfer unless validated by transmission experiments [19]. Furthermore, the role of weak ties (i.e., low-intensity contacts) in pathogen transfer is ambiguous [21, 32]. The interaction network of any social group will appear as a fully connected network if monitored for a long period of time. As fully-connected contact networks rarely reflect the dynamics of infectious disease spread through a host population, one may ask whether weak ties can be ignored, or what constitutes an appropriate intensity threshold below which interactions are epidemiologically irrelevant? Second, many previous approaches ignore the dynamic nature of host contacts. The formation and dissolution of contacts over time is crucial in determining the order in which contacts occur, which in turn regulates the spread of infectious diseases through host networks [8, 33, 34]. Finally, none of the existing approaches allow direct comparison of competing hypotheses about disease transmission mechanisms which may generate distinct contact patterns and consequently different contact network models. All of these challenges demand an approach that can allow direct comparison between competing hypothesis on transmission pathways while taking into account the dynamics of host interactions and constraints of data sampling. We introduce a statistical tool called INoDS (Identifying contact Networks of infectious Disease Spread) that establishes the epidemiological relevance of observed contact networks in explaining the patterns of infectious disease spread. INoDS also allows testing competing hypotheses on the mode of disease transmission by performing model comparison between different contact networks. The tool can estimate the per-contact transmission rate and recovery rate for various disease progressions (e.g. SI, SIS and SIR) and can be extended to incorporate complexities in transmission (e.g. individual-level heterogeneity in susceptibility; latent period following infection). INoDS provides inference on static and dynamic contact networks, and is robust to common forms of missing data. Using two empirical datasets, we highlight the two-fold application of INoDS—(i) to identify whether observed patterns of infectious disease spread are likely given an empirical contact network, and (ii) to identify transmission routes, the role of the contact intensity, and the per contact transmission rate of a host-pathogen system. The epidemiological insights into infectious disease provided by INoDS can be invaluable in implementing immediate disease control measures in the event of an emerging epidemic outbreak.

Results

The primary purpose of INoDS is to evaluate whether an observed contact network is likely to generate an infection time-series observed in a particular host population. INoDS also provides epidemiological insights into the spreading pathogen by estimating the per-contact rate of transmission (β). The pattern of infectious disease spread in a host population depends on the mode of transmission, and the epidemiological relevance of contact network is sensitive to the amount of collected data on nodes and edges. The tool therefore treats each empirically collected contact network as a unique network hypothesis, and facilitates hypothesis testing between different contact networks. The INoDS algorithm follows a three step procedure (Fig 1). First, the tool estimates a per-contact transmission rate (β) of the pathogen and an background transmission parameter (ϵ). The β parameter quantifies the per-contact rate of pathogen transmission, and the ϵ parameter quantifies components of infection transmission that are unexplained by the contact network (S1 Fig). In the second step, Bayesian hypothesis testing is performed to establish epidemiological relevance of the observed contact network. Null hypothesis is expressed as a uniform distribution over randomized networks generated by permuting 10%—100% edge connections of each time-slice using double-edge swap procedure [35] and assigning each edge the average edge weight of the original snapshot. In the final step, model selection between multiple contact network hypotheses is performed using Bayes Factor.

Fig 1

A schematic of our algorithm.

Observed data(left panel): INoDS utilizes an observed infection time-series data to estimate statistical evidence towards a static or dynamic contact network hypothesis (or hypotheses) using a three-step procedure. Shown here is an example of two competing network hypotheses based on behaviors A and B that potentially cause infection transfer. Inferential steps (right panel): In the first step, the tool estimates per-contact transmission rate parameter β, and background transmission rate parameter ϵ which captures the components of infection propagation unexplained by the edge connections of the network hypothesis. Here, the total infected connections of the focal node i (k) is 2. Second, to estimate the epidemiological relevance of the network hypothesis, Bayesian hypothesis testing is performed. The prior distribution shows that the null hypothesis (M = 1) assumes a uniform distribution over randomized networks generated by permuting 10%—100% of edge connections in the contact network (H), whereas the alternate hypothesis (M = 2) is a spike-shaped distribution such that only the contact network (H, 0% permutation) has non-zero probability. The distribution on model index shifts to M = 2 if the alternate hypothesis has higher posterior probability than the null. Third, model selection of competing network hypotheses is performed using Bayes Factor (BF). A Bayes factor above 2.44 is considered to be decisive support for one hypothesis over the other.

A schematic of our algorithm.

Validating INoDS performance

We validated the performance of INoDS using a simulated dataset. This dataset was generated by performing numerical disease simulations on a synthetic dynamic network (henceforth called the true synthetic network) where per-contact transmission rate, β, ranged from 0.01 to 0.1. Validation of step 1: Fig 2a shows INoDS estimates of β and ϵ each for 10 independent disease simulations of the synthetic pathogen with disease contagiousness ranging from 0.01 to 0.1. We found that INoDS accurately estimated the per-contact transmission rate β, and background transmission rate ϵ for the simulated dataset. The accuracy was independent of the pathogen’s contagiousness. For example, we estimated an average β value of 0.039 (SD = 0.003) using INoDS for disease simulations involving a simulated pathogen with β value of 0.04. The background transmission parameter, ϵ, was accurately estimated as zero for all simulations since all infection transmission events in the simulated dataset were perfectly explained by the edge connections of the contact network (Fig 2a).

Fig 2

Validation of the three steps of INoDS.

Validation of the three steps of INoDS.

(a) Step 1: Absolute error in estimates of per-contact transmission rate parameter β (orange circles) and background transmission rate ϵ (purple circles) for the simulated dataset with disease transmission rate (β*) ranging from 0.01 to 0.1. The true value of background transmission rate (ϵ*) is zero. The filled black circle indicates the average absolute error and the error bars indicate standard deviation around the mean value. (b) Step 2: establishing epidemiological relevance of the observed contact network. Each box summarizes log Bayes factor of observed network compared to null hypothesis (viz a prior of networks with 10% to 100% permuted edges). (c) Step 3: model selection between the observed contact networks (0% randomization level) and networks with increasing edge randomization (25%, 50%, 75% and 100%). Log Bayes factor was calculated by substracting the log marginal evidence of randomized networks from log marginal evidence of the true (0% randomized) synthetic network. Log Bayes factor of more than 2.44 (dashed line) is considered to be a decisive evidence in favor of the observed contact network. The middle black line in each box plot is the median, the boxed area extends from the 25th to 75th quartile, and whiskers extended from the hinge to the largest/smallest value no further than 1.5 times the inter-quartile range. Validation of step 2: The second step, which involves establishing the epidemiological relevance of the contact network, was evaluated by performing Bayesian hypothesis testing. Prior was defined as an ensemble of randomized networks with 10%–100% permuted edge connections. We found that log Bayes factor of observed network was more than 2.44 for all ranges of β, indicating that INoDS accurately detected epidemiological relevance of the contact network irrespective of contagiousness of the spreading pathogen (Fig 2b). Validation of step 3: Where multiple hypotheses of contact networks exists, model selection is performed by computing the ratio of marginal likelihoods (Bayesian evidence). We validated this step by comparing the Bayesian evidence of the true synthetic network with networks generated by shuffling 25%, 50%, 75% and 100% of edge connections present in the true synthetic network. We found that log Bayes factor of the true synthetic contact network was more than 2.44 for most replicates compared with randomized networks when β* = 0.01. Log Bayes factor exceeded 10 for synthetic pathogens with transmission rates β* > 0.01, suggesting a decisive evidence for the true synthetic network (Fig 2c).

Robustness to missing network data

Next, we tested the robustness of INoDS against two potential sources of error in network data collection: incomplete sampling of individuals in a population (missing nodes) and incomplete sampling of interactions between individuals (missing edges). To create networks with missing data, we randomly deleted 25–75% of nodes and edges from the true synthetic network that were not a part of the path of simulated infection spread. We focus on this type of missing data as infected individuals and their contact are more likely to be observed (particularly for infections with observable symptoms). Methodologically, this approach also allows us to tease apart INoDS’s performance when both the complete and incomplete networks have equal ability to explain the propagation of disease. If robust to missing data, we expect INoDS to recover the same parameter estimates and model evidence as the true and complete synthetic network. We found that even when 75% of network data is missing, the estimated transmission rate β is identical to β values estimated for the complete synthetic network (Fig 3a). For all incompletely sampled networks, log Bayes factor was greater than 2.44 compared to the null hypothesis. INoDS thus correctly identified all incompletely sampled networks to be epidemiological relevant (Fig 3b). As expected, log Bayes factor was less than 2.44 for all degrees of missing data indicating that the true synthetic network does not have higher evidence compared with incompletely sampled networks. Together, we found that the performance of INoDS was unaffected by missing network data if the missing nodes/edges are not a part of the outbreak path.

Fig 3

Robustness of INoDS to missing network data.

Robustness of INoDS to missing network data.

Robustness of INoDS to missing nodes and missing edges in network hypothesis. Networks with missing nodes/edges were created by randomly removing 25–75% of nodes/edges not involved in infection spread path at each time-step from the dynamic synthetic network. (a) Step 1: Δβ is the relative deviation of estimated transmission parameter β from the true transmission rate β*. (b) Step 2: Epidemiological relevance of observed network with missing data. Each box summarizes log Bayes factor of observed network with missing data compared to null hypothesis (viz a prior of networks with 10% to 100% permuted edges). (c) Evidence for the true synthetic network over datasets with missing data. Log Bayes factor of more than 2.44 (dashed line) is considered to be a strong support in favor of the observed contact network. The middle black line in each box plot is the median, the boxed area extends from the 25th to 75th quartile, and whiskers extended from the hinge to the largest/smallest value no further than 1.5 times the inter-quartile range. In the supplement, we tested the performance of INoDS when data is missing at random for nodes, edges or cases (S3 Fig). We found that the deviation of estimated β increases with the degree of missing data, and networks with missing data had lower evidence compared with the true synthetic network. This is because removing nodes/edges involved in the transmission process lowers the network’s ability to explain the propagation of disease outbreak. Additionally, removing cases (i.e., infected status from nodes) from the infection time-series results in lower quality infection data compared to data where all infection events are documented.

Comparison with previous approaches

Next, we compared INoDS with two previous approaches that have been used to establish an association between infection spread and contact network in a host population—the k-test and network position test. The k-test procedure involves estimating the mean infected degree (i.e., number of direct infected contacts) of each infected individual in the network, called the k-statistic. The p-value in the k-test is calculated by comparing the observed k-statistic to a distribution of null k-statistics which is generated by randomizing the node-labels of infection cases in the network [31]. Network position test compares the degree of infected individuals to that of uninfected individuals [24, 25, 27]. The observed network is considered to be epidemiologically relevant when the difference in average degree between infected and uninfected individuals exceeds the degree difference in an ensemble of randomized networks at 5% significance level. Both these previous approaches only provide evidence for static networks by comparison with a null expectation. We therefore performed comparisons with step 2 of INoDS where epidemiological relevance of a network hypothesis is evaluated. To do so, we performed simulations of infection with β ranging from 0.01 to 0.1, corresponding to disease prevalence of 8.1% to 99.8%, respectively. We found that INoDS accurately established epidemiological relevance across a wide range of β when no network or disease data was missing. At β = 0.01, however, the power of the model is lower compared to the k-test (Fig 4). For values of beta beyond 0.01, the performance of INoDS in establishing epidemiological relevance surpasses two previous approaches—the k-test procedure and the network position test. The power of the k-test in detecting epidemiological relevance of an observed contact network decreases with an increasing amount of missing data and transmission rate of pathogen. Of the three approaches, the network position test has the lowest power in detecting epidemiological relevance.

Fig 4

Comparison of INoDS performance with previous approaches.

Statistical power of INoDS, k-test and network position test in establishing epidemiological relevance of the “true” contact network against three common forms of missing data—missing nodes, missing edges and missing infected cases. Statistical power of INoDS, k-test and network position test was calculated as the proportion of disease simulations where the observed contact network was detected as epidemiologically relevant (INoDS: log(B10) > 2.44; k-test and network position test: p < 0.05).

Comparison of INoDS performance with previous approaches.

Applications to empirical data-sets

We next demonstrate the application of INoDS to perform hypothesis testing on contact networks, identify transmission mechanisms and infer transmission rate using two empirical datasets. The first dataset is derived from the study by Otterstatter & Thomson [36] that examines the spread of an intestinal pathogen (Crithidia bombi) within colonies of the social bumble bee, Bombus impatiens. The second dataset documents the spread of Salmonella enterica within two wild populations of Australian sleepy lizards Tiliqua rugosa [37]. We chose these two empirical datasets because they represented two distinct (i) host taxonomic class, (ii) models of disease spread (SI vs SIS), and (iii) disease data collection methodology (infection timing known for bumble bee dataset vs unknown for sleepy lizard dataset, and (iv) network connectivity (fully connected in bumble bees vs sparsely connected in sleepy lizards).

Determining transmission mechanism and the role of contact intensity: Case study of Crithidia bombi in bumble bees

[36] showed that the transmission of gut protozons, Crithidia bombi, in bumble bee colonies is associated with the frequency of contacts with infected nest-mates rather than the duration of contacts. The dynamic contact networks in the experiments were fully connected, i.e., all individuals were connected to each other in the network at all time steps. We extended the previous analysis by answering two specific questions: (1) Does the type of contact (frequency vs duration) matter in transmission?, and (2) Do contact intensity (i.e, the edge weights) between individuals contribute to infection transfer? We performed analyses on two types of contact network hypotheses—those are described by frequency of contacts and those that are described by duration of contacts—and compared the results with the findings reported in [36]. To answer the two questions, we constructed dynamic contact networks where edges represent close proximity between individuals. Since fully connected networks rarely describe the dynamics of infection spread, we sequentially removed edges with weights less than 5–50% of the highest edge weight to generate contact network hypotheses at different edge weight thresholds. Corresponding to the two types (frequency and duration) of weighted networks, unweighted contact networks were also constructed by replacing weighted edges in the thresholded weighted networks with binary edges (i.e., edges with an edge weight of one). Fig 5 shows the estimates of pathogen transmission rate β for the four types of contact network hypotheses at different edge weight thresholds. We found only a few contact network hypotheses were epidemiologically relevant (bars with asterisks). Weighted frequency networks had highest evidence for colonies QC6 and UN1 and at 35% and 5% edge-weight threshold respectively (bars with red asterisks). Our results therefore show that contact frequency, rather than duration, better explained the spread of the Crithidia bombi through bumble bee colonies. We also found that contact intensity, quantified through edge weights, is important in explaining pathogen spread.

Fig 5

Identifying the contact network model of Crithidia spread in two bumble bee colonies (QC6 and UN1) described in [36].

Edges in the contact network models represent physical interaction between the bees. Since the networks were fully connected, a series of filtered contact networks were constructed by removing weak weighted edges in the network. The x-axis represents the edge weight threshold used to remove weak edges in the network. Two types of edge weights were tested—duration and frequency of contacts. In addition, both types of weighted edges were converted to binary to create binary networks. The results shown are estimated values of the per contact rate transmission rate, β, for the two colonies. Asterisks above bars indicate that the networks were epidemiologically relevant in explaining the spread of Crithidia (single asterisk: Log(B10) = 0.5–1, substantial evidence; double asterisks: Log(B10) = 1–2, strong evidence). We note that model convergence was not achieved for several network hypotheses and were removed in our final analysis.

Identifying the contact network model of Crithidia spread in two bumble bee colonies (QC6 and UN1) described in [36].

Identifying transmission mechanisms with imperfect disease data: Case study of Salmonella enterica Australian sleepy lizards

Spatial proximity is known to be an important factor in the transmission of Salmonella enterica within Australian sleepy lizard populations [37]. However, it is not known whether the transmission risk increases with frequency of proximate encounters between infectious and susceptible lizards. We therefore tested two contact network hypotheses to explain the spread of salmonella at two sites of wild sleepy lizards populations. The first contact network hypothesis placed binary edges between lizards if they were ever within 14m distance from each other during a day (24 hours). We constructed the second contact network by assigning edge weights proportional to the number of times two lizards were recorded within 14m distance of each other during a day. Because disease sampling was performed at regular fortnightly intervals, the true infection time (day) of individuals at both study sites was unknown. We therefore used a data augmentation method in INoDS (see Materials & methods) to sample unobserved infection timings along with the per contact transmission rate, β, and error, ϵ. We found that weighted network was epidemiologically relevant at site 2 but not at site 1 (Fig 6). Proximity networks with weighted edges had higher marginal (Bayesian) evidence compared with binary networks at both sites. This suggests that the occurrence of repeated contacts between two spatially proximate individuals, rather than just the presence of contact between individuals is more explanatory of Salmonella transmission in sleepy lizards.

Fig 6

Identifying transmission mechanisms of Salmonella spread in Australian sleepy lizards.

Dynamic network of proximity interactions for a total duration of 70 days between (A) 43 lizards at site 1, and (B) 44 lizards at site 2. Each temporal slice summarizes interactions within a day (24 hours). Edges indicate that the pair of individuals were within 14m distance of each other, and the edge weights are proportional to the frequency of physical interactions between the node pair. For ease in visualization, four networks summarizing interactions at day 15, 30, 57 and 70 are shown out of a total of 70 static network snapshots. Green nodes are the animals that were diagnosed to be uninfected at that time-point, red are the animals that were diagnosis to be infected and grey nodes are the individuals with unknown infection status at the time-point. We hypothesized that the spatial proximity networks could explain the observed spread of Salmonella in the population. The results are summarized as a table. Bold numbers indicate that the network hypothesis was found to be epidemiologically relevant compared to an ensemble of randomized networks. The network hypothesis with the highest log Bayesian (marginal) evidence at each site is marked with an asterisk (*).

Identifying transmission mechanisms of Salmonella spread in Australian sleepy lizards.

Discussion

In this study we present INoDS as a tool that performs network model selection and establishes the statistical significance of a contact network model to describe the spread of infectious diseases. Our method also provides epidemiological insights about the host-pathogen system by enabling hypothesis testing on different transmission mechanisms, and estimating pathogen transmission rates. Unlike previous approaches that rely on social network position [24-27], proxy behaviors [28-30] or connectivity [31], we show that our method is robust to missing network data, imperfect disease surveillance, can provide inference for dynamic networks and a range of disease progression models. Additionally, our tool overcomes a common challenge of imperfect knowledge of infection acquisition by assuming infection times to be unobserved and using data on infection diagnosis instead to provide inference on contact networks. In principle, the background transmission rate parameter, ϵ, in INoDS is similar to the asocial learning rate used in the network based diffusion analysis approach in the behavior learning literature [38, 39]. The background transmission parameter in our model serves to approximately assess transmission which is unexplained by the edge connections of the network hypothesis. Under the scenario of a network hypothesis with no edges, the best fit ϵ would indicate that all disease data is generated by unobserved transmission. Any transmission events that can be better explained by a network edge reduce the model’s expectation from this maximum ϵ estimate. Relative deviation of ϵ from the maximum possible value is therefore highest for a network hypothesis with no missing data. The relative deviation of ϵ declines with increasing missing data, although it is more sensitive to missing network information compared with missing information in infection time-series data (S4 Fig). Our work thus addresses a growing subfield in network epidemiology that leverages statistical tools to infer contact networks using all available host and disease data [9, 31, 40–42]. Our approach can be used to tackle several fundamental challenges in the field of infectious disease modeling [21, 22]. First, INoDS can be used to perform model selection on contact network models that quantify different transmission modes; this approach facilitates the identification of infection-transmitting contacts and does not rely on laboratory experimentation (or subjective expert knowledge). Second, INoDS can be used to establish the statistical significance of proxy measures of contact (such as spatial proximity, home-range overlap or asynchronous refuge use) in cases where data on direct interactions between hosts are limited. Third, INoDS can establish the epidemiological role of edge weights in a contact network by performing model selection of contact networks with similar edge connections but different edge weighting criteria. In the first empirical example involving the spread of the Crithidia gut protozoan in bumble bee colonies, we demonstrate that contact networks weighted with respect to frequency, rather than duration, explain the observed patterns of transmission. Our results therefore support the original finding of the study [36], where individual risk of infection was found to be correlated with contact rate with infected nest-mates. We further found that weak ties below a certain edge weight threshold do not play an important role in infection transfer for this empirical system. In the second empirical dataset, we found that frequency of contacts between closely located lizards allows better, i.e. more consistent, predictions on Salmonella transmission. Our results supports results from a previous study which suggests that the bacterial transmission in Australian sleepy lizards occur between closely located animals [37]. As with all models, the results from INoDS should be considered within the context of model assumptions. First, we assumed that the infection process has no latent period. In the future, disease latency can be incorporated into the model by using a data augmentation technique, similar to what we use for inferring infection times. Second, we assumed that the infectiousness of infected hosts and susceptibility of naive hosts is equal for all individuals in the population. Heterogeneity in infectiousness of infected hosts and the susceptibility of naive hosts can be incorporated as random effects in the model. Third, in the current version of our model, we do not consider re-infection of hosts. Re-infectivity of hosts can be easily incorporated by allowing multiple time-points of infection acquisition in Eq (2). Our results show that the data-collection efforts should aim to sample as many individuals in the population as possible, since missing nodes have the greatest impact (rather than missing edges) on the statistical significance of network models. Since data-collection for network analysis can be labor-intensive and time-consuming, our approach can be used to make essential decisions on how limited data collection resources should be deployed. For example, under a limited capability of recording real-time interactions between hosts, INoDS can identify the minimum time-resolution required during data collection for a network model with sufficient statistical ability to establish epidemiological relevance. Our approach can also be used to improve targeted disease management and control by identifying high-risk behaviors and super-spreaders of a novel pathogen without relying on intensive transmission experiments that take years to resolve.

Materials & methods

Here we describe INoDs (Identifying contact Networks of infectious Disease Spread), a computational tool that (i) estimates per contact transmission rate (β) of infectious disease for empirical contact networks, (ii) establishes the epidemiological relevance of a contact network by performing Bayesian hypothesis testing, and (iii) enables discrimination of competing contact network hypotheses, including those based on pathogen transmission mode, edge weight criteria and data collection techniques. Two types of data are required as input for INoDS—infection time-series data, which include infection diagnoses (coded as 0 = not infected and 1 = infected), and time-step of diagnosis for all available individuals in the population; and an edge-list of a dynamic (or static) contact network. An edge-list format is a list of node pairs (each node pair represents an edge of the network), along with the weight assigned to the interaction, and time-step of interaction, with one node pair per line. The tool can be used for unweighted contact networks—an edge weight of one is assigned to all edges in this case. Time-steps of interactions are not required when analysis is performed on static contact networks. The software, implemented in Python, is platform independent, and is freely available at https://github.com/bansallab/INoDS-model. Empirical datasets used in this study are available at https://doi.org/10.7910/DVN/YAHRDJ.

INoDS formulation

For a susceptible individual, the potential of acquiring infection at each time-step depends on the per contact transmission rate, the total strength of interactions with its infected neighbors at the previous time-step, and a parameter ϵ that captures the force of infection that is not explained by the individual’s social connections. The probability of receiving infection, λ(t), of a susceptible individual i at time t is thus calculated as: where both β and ϵ parameters are > 0; w(t − 1) denotes the total strength of association between the focal individual i and its infected associates at the previous time-step (t − 1). For binary (unweighted) contact network models w = k, where k is the total infected connections of the focal individual. The log-likelihood of the observed infection time-series data given the contact network hypothesis (H) can therefore be estimated as: where D is the infection time-series data, the set of unknown parameters is Θ = {Θ1, Θ2…..Θ}, t is the time of infection of individual n, and t are all time-steps at which individual n is naive and has the ability to contract infection. The first part of Eq 2 estimates the log likelihood of all observed infection acquisition events. The second part of the equation represents the log-likelihood of susceptible individuals m remaining uninfected at time t. Following Bayes’ theorem, the posterior distribution of the set of parameters is given as: where D is the infection time-series data, H is the contact network hypothesis, and are the shorthands for the posterior, the likelihood, the prior and the marginal likelihood, respectively.

Parameter estimation and data augmentation of infection timings

We used DynamicNestedSampler of dynesty package implemented in Python to estimate Bayesian posteriors and evidences [43]. Nested sampling is a numerical method of simultaneously estimating both the posterior and evidence by maintaining a set of samples from the prior, and iteratively updating them subject to the constraint that new samples have higher likelihoods. Dynamic nested sampling allows the samples to be allocated adaptively, maximizing both accuracy and efficiency [43]. We assumed a uniform prior distribution for β and ϵ parameters with range [0, 10]. Calculation of the likelihood in Eq 2 requires knowledge of exact timing of infection, t1, …t, for n infected individuals in the population. However in many cases, the only data available are the timings of when individuals in a populations were diagnosed to be infected, d1, …d. We therefore employ a Bayesian data augmentation approach to estimate the actual infection timings in the disease dataset [44]. Since in this case the infection time t for an individual i is unobserved, we only know that t lies between the interval (L, d], where L is the last negative diagnosis of individual i before infection acquisition. Within this interval, the individual could have potentially acquired infection at any time-step where it was in contact with other individuals in the network. Assuming incubation period to be one time-step, the potential set of infection timing can be represented as t ∈ {g(t − 1) > 0, L < t ≤ d}, where g(t − 1) is the degree (number of contacts) of individual i at time t − 1. For infections that follow a SIS or SIR disease model, it is also essential to impute the recovery time of infected individuals for accurate estimation of infected degree. To do so, we adopt a similar data augmentation approach as described to sample from the set of possible recovery time-points. The data augmentation proceeds in two steps. In the first step, the missing infection times are imputed conditional on the possible set of infection times. In the next step the posterior distributions of the unknown parameters are sampled based on the imputed data. We performed data imputation using inverse transform sampling method, which is a technique of drawing random samples from any probability distribution given its cumulative distribution function [45].

Interpretation of the εparameter

In principle, inclusion of the ϵ parameter in Eq 1 is similar to the asocial learning rate used in the network based diffusion analysis approach in the behavior learning literature [38, 39]. The background transmission parameter ϵ in INoDS formulation serves to approximately assess transmission which is unexplained by the edge connections of the network hypothesis. Estimate value of ϵ parameter increases with missing data, although it is more sensitive to missing network information compared with missing information in infection time-series data (S3 Fig).

Epidemiological relevance of a contact network

We performed Bayesian hypothesis testing to establish the epidemiological relevance of a contact network in explaining the observed pattern of infectious disease spread. Null hypothesis is expressed as a uniform prior distribution over networks with permuted edge connections of varying degree (10% to 100%). Prior distribution of alternative hypothesis puts all credibility in an infinitely dense spike at 0% permutation level (viz, the observed contact network). Bayes factor of alternate hypothesis vs null was defined as: Log Bayes factor was calculated as the difference between log marginal likelihoods of the alternate and null hypothesis. A log Bayes factor of 0.5 –1, 1–2 and >2.44 is considered as substantial, strong and decisive evidence, respectively, towards the epidemiological relevance of the observed contact network [46, 47].

Model selection of competing network hypotheses

To facilitate model selection in cases where there are more than one network hypothesis, we compute marginal likelihood of the infection data given each contact network model. The marginal likelihood, also called the Bayesian evidence, measures the model fit, i.e,. to what extent the infection time-series data can be simulated by a network hypothesis (H1). Bayesian evidence is based on the average model fit, and calculated by integrating the model fit over the entire parameter space. Dynamic nested sampling calculates the evidence by integrating the prior within nested contours of constant likelihood. Model selection can be then performed by computing pair-wise Bayes factor, i.e. the ratio of the marginal likelihoods of two network hypotheses. The log Bayes factor to assess the performance of network hypothesis H over network hypothesis H, is expressed as: The contact network with a higher marginal likelihood is considered to be more plausible, and a log Bayes’ factor of more than 2.44 is considered to be a decisive support in favor of the alternative network model (H) [46, 47]. We validated the performance of INoDS by evaluating its accuracy in estimating the unknown transmission parameter β, robustness to missing data, and comparing the performance of tool with previous approaches. To do so we first constructed a dynamic synthetic network using the following procedure. At time-step t = 0, a static network of 100 nodes, mean degree 4, and Poisson degree distribution was generated using the configuration model [35]. At each subsequent time-step, 10% of edge-connections present in the previous time-step were permuted, for a total of 100 time-steps. That is, the dynamic synthetic network (called true synthetic network in the results) consists of contacts (i, j, t) between node i and j at time t, lasting from a duration of 1 to 100 time steps; each contact is capable of disease transmission [48]. Next, through the synthetic dynamic network, we performed 10 independent SI disease simulations with per contact rate of infection transmission (β*) 0.01 to 0.1. Model accuracy was evaluating by comparing INoDS estimation of the transmission parameter, β, with the true transmission rate β* that was used to perform disease simulations. Since the synthetic network completely described the disease simulations, model accuracy was also tested by evaluating the deviation of the estimated error parameter ϵ, from the expected value of zero. We next tested robustness of the tool against two types of missing network data: (a) incomplete sampling of individuals in a population (missing nodes), and (b) incomplete sampling of interactions between individuals (missing edges). To do so, we simulated two independent SI disease simulations each for ten β* values ranging from 0.01 to 0.1 with increment of 0.01 through the synthetic dynamic network. The two scenarios of missing data were created by randomly removing 25–75% of nodes and edges from the true synthetic network that were not a part of the path of simulated infection spread. This approach allowed us to investigate INoDS performance for various levels of incompletely sampled networks with the infection path preserved. If INoDS is robust to missing network data, we expect to recover similar parameter estimate, epidemiological relevance and model evidence for all networks.

Comparisons with previous approaches

We compared INoDS performance with two previous approaches that are popularly used to establish epidemiological significance of contact networks—k-test [31] and network position test. k − test estimates the average number of infected connections for each infected individual in the network. To establish the epidemiological significance of a contact network this metric, called the k-statistic, is compared to a distribution of null k-statistics obtained by randomly swapping the node labels. A p-value is calculated as the number of permutations that produce k-statistics more extreme than the observed k-statistic. Network position test compares the average degree (i.e., the average number of connections) of infected cases with the degree of uninfected individuals. The difference in average degree in the observed network is compared to the degree difference in networks where random edge connections. A p-value is calculated as the number of random networks where the degree between infected and uninfected individuals is higher than the observed network. Both of these previous approaches only provide epidemiological evidence of the observed contact network and do not estimate transmission parameters or enable model comparisons. We therefore performed comparisons with step 2 of INoDS where epidemiological relevance of the observed network is evaluated. We also note that the previous approaches are limited to static and unweighted networks, we therefore performed model comparisons on a “true” synthetic static network with 100 nodes, Poisson degree distribution and a mean network degree of 3. Simulations of disease spread were performed with a broad range of per contact transmission rate (β). Null expectation in INoDS and network position test was generated by permuting the edge connections of the observed networks, creating an ensemble of null networks. In k-test, the location of infection cases within the observed network are permuted, creating a permuted distribution of k-statistic [31].

Applications to empirical datasets

We demonstrate the applications of our approach using two datasets from the empirical literature. The first dataset comprises of dynamic networks of bee colonies (N = 5–7 individuals), where edges represent direct physical contacts that were recorded using a color-based video tracking software. A bumble bee colony consists of a single queen bee and infertile workers. Here, we focus on the infection experiments in two colonies (colony QC6 and UN1). Infection progression through the colonies was tracked by daily screening of individual feces, and the infection timing was determined using the knowledge of the rate of replication of C. bombi within its host intestine. The second dataset monitors the spread of the commensal bacterium Salmonella enterica in two separate wild populations of the Australian sleepy lizard Tiliqua rugosa. The two sites consisted of 43 and 44 individuals respectively, and these represented the vast majority of all resident individuals at the two sites (i.e., no other individuals were encountered during the study period). Individuals were fitted with GPS loggers and their locations were recorded every 10 minutes for 70 days. Salmonella infections were monitored using cloacal swabs on each animal once every 14 days. Consequently, the disease data in this system do not identify the onset of each individual’s infection. We used a SIS (susceptible-infected-susceptible) disease model to reflect the fact that sleepy lizards can be reinfected with salmonella infections. Proximity networks were constructed by assuming a contact between individuals whenever the location of two lizards was recorded to be within 14m distance of each other [26]. The dynamic networks at both sites consisted of 70 static snapshots, with each snapshot summarizing a day of interactions between the lizards. We constructed two contact network hypotheses to explain the spread of salmonella. The first contact network hypothesis placed binary edges between lizards if they were ever within 14m distance from each other during a day. The second contact network assigned edge weights proportional to the number of times two lizards were recorded within 14m distance of each other during a day. Specifically, edge weights between two lizards were equal to their frequency of contacts during a day normalized by the maximum edge weight observed in the dynamic network.

Corner plot of the posterior sample showing relationship between estimates of β and ϵ.

True value of the parameters are indicated with red lines. Contours contain 25%, 50% and 75% of the sample points respectively. (EPS) Click here for additional data file.

Estimate of ϵ with increasing percentages of nodes/edges removal that were not involved in infection spread path.

(EPS) Click here for additional data file.

Robustness of INoDS to random missing data. Networks with missing data were created by randomly removing 10–60% of nodes, edges and cases.

(a) Step 1: Δβ is the deviation of estimated transmission parameter β from the true transmission rate β*. (b) Step 2: Epidemiological relevance of observed network with missing data. Each box summarizes log Bayes factor of observed network with missing data compared to null hypothesis (viz a prior of networks with 10% to 100% permuted edges). (c) Evidence for the true synthetic network over datasets with missing data. Log Bayes factor of more than 2.44 (dashed line) is considered to be a strong support in favor of the observed contact network. The middle black line in each box plot is the median, the boxed area extends from the 25th to 75th quartile, and whiskers extended from the hinge to the largest/smallest value no further than 1.5 times the inter-quartile range. (EPS) Click here for additional data file.

Relative deviation of estimated ϵ from the maximum possible value with increasing percentage of random missing data.

Maximum ϵ was determined by maximizing the likelihood function in Eq (2) assuming network with no edges. Each boxplot summarizes the results of 20 independent SI disease simulations through a synthetic dynamic network—two each for ten β* values ranging from 0.01 to 0.1 with increment of 0.01. Missing data were created by randomly removing 10–60% of nodes, edges or cases from the dataset. The middle black line in each box plot is the median, the boxed area extends from the 25th to 75th quartile, and whiskers extended from the hinge to the largest/smallest value no further than 1.5 times the inter-quartile range. (EPS) Click here for additional data file.

Disease prevalence corresponding to different values of β used in our simulations for Fig 4.

(EPS) Click here for additional data file. 18 Jan 2021 Dear Dr. Sah, Thank you very much for submitting your manuscript "Revealing mechanisms of infectious disease outbreak through empirical contact networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Rob De Boer Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This manuscript presents a method of inferring model parameters for disease spreading on empirical contact networks. It is also capable of testing the robustness of the network representations given data about who is infected at what time. I think it looks like a contribution of practical importance, even though getting the data needed for input is a difficult task. Below are some specific concerns: [Line 9–12] "Constructing a complete contact network model requires (i) knowledge about the transmission route(s) of a pathogen, (ii) a sampling of all individuals in a population, and (iii) a sampling of all interactions among the sampled individuals that may lead to infection transfer." First, if you have done (iii), haven't you also done (ii)? Second, why do you need to know (i)? Yes, you would need to know it to infer the parameters of the compartmental model, but not to construct a contact network? [Line 61–65] You say you can use a "dynamic contact network." Above, you defined a contact network as "individuals are represented as nodes, and an edge between two nodes represents an interaction that has the potential to transmit infection". Exactly how does dynamism enter this definition? [Line 62–63] If you have a temporal network version about the SIS or SIR model, it is a two-parameter model. You cannot reduce parameters and omit the recovery rate as in static network epidemiology. Thus, your tool also needs to estimate the recovery rate (or disease duration, depending on your model of infection duration). [Line 63] In the light of the previous question, you can omit the SI model since it is contained in the SIS and SIR models. [Fig. 1] The smallest text is not readable. [Around line 87] How do you randomize the network? If you randomize a dynamical network, there are many options that all destroy different structures of the original data. See Gauvin et al., Randomized reference models for temporal networks https://arxiv.org/abs/1806.04032 [Around line 87] What is the logic of randomizing different percentages of the links? When you randomize the network, you destroy the particular structure that affects the disease spreading. It is not telling you whether your network is robust to errors because random changes make the network more random, i.e., less like the structured, real networks. If you consider networks obtained by random changes as a model, this is systematically biased toward randomness and should certainly be a bad model. [Line 104] " 0.039 ( 0.003 SD)" What is this notation? Do you mean 0.003 times the standard deviation? or that 0.003 is the standard deviation? [Line 118] What is an "edge" in a dynamic network? Please conform to the temporal-network literarure. Masuda & Holme, Predicting and controlling infectious disease epidemics using temporal networks, F1000Prime Rep. 5, 6 (2013). Continuing the previous point. There are papers in the temporal-network epidemiology literature that relate to this work. One such is: Holme & Rocha, Impact of misinformation in temporal network epidemiology, Network Science 7, 52-69 (2019). [Line 122] How do you define "true synthetic network"? [Fig. 3] What is the point of plotting panel A? You could just state what we see in a sentence. [Fig. 3 caption] If you remove edges not involved in a spreading path, aren't you introducing a bias? Those edges should, by definition, be less important for the epidemics. Reviewer #2: Sah, et al. present an exciting technique that is leaps and bounds beyond of the "comparable" approaches they test it against in the study of contact network relevance to disease spread. This work has great potential to address some of the most pressing concerns in disease ecology, namely a rigorous way to assess and compare contact networks as explanatory tools in the spread of infectious disease. Despite my excitement and generally positive impression, I have one main concern and two secondary ones that I outline below. # Validation with respect to missing data: The authors perform three main tests to validate their approach in the face of missing data: they remove nodes, edges, or cases and then run the model to see if it still performs well in each of its three tasks: estimating parameters, determining that the network is relevant to disease spread, and performing model selection between the "true network" and the pruned ones. My main issue is that under the current methods, the authors only consider nodes and edges that are not involved in the simulated disease transmission for removal. Besides being unrealistic from a data-scarcity perspective (we rarely know whether the individuals we have missed were involved in the transmission chain or not), this additionally (I would argue undesirably) alters the model-perceived disease prevalence. Because (by definition) none of the removed edges were involved in the transmission process, many (most?) of them were connecting two susceptible nodes and thus wouldn't have come into the likelihood calculation anyway. The removal of edges should functionally increase the perception of beta, since edges that remain are more likely to result in further infections, but this is not what we see in Figure 3 -- I wonder if the authors have an intuition for why their model continues to perform so well The removal of nodes is not really a distinct treatment from the removal of edges, as when nodes are removed, the edges connected to them are removed as well. We would thus expect nodes to have (at least) as big of an effect as just removing edges. Importantly, this removal also increases the prevalence of disease within the network (because we are only removing healthy nodes). This combined with the above is likely why we see an increase in epsilon, though the lack of a response in beta is surprising, as before. Finally, removing cases is the only change that alters the actual infection data, but in this case, nodes are not removed but rather their infectious status is never recognized (assuming they follow the procedure from VanderWaal, et al. (2016)). This seems like it would have the biggest effect on beta, but those results are not presented in this work. It does yield the smallest absolute epsilon, but might have something to do with effectively reducing prevalence in the population. Thus, less extrinsic forcing is needed to explain the data, even in the absence of beta (more on this below). I would like to see two additional pruning procedures: removing nodes truly at random (infected or not), and the same for edges. This would keep the prevalence approximately equal and would be a more realistic scenario. I expect these will have a larger effect on beta, since we are losing information about the actual spread of disease. It might be interesting to connect the neighbors of removed nodes via the addition of edges, since there is still a path for infection even if we don't observe it, but this might be beyond the scope of this work. In Figure S2, the authors consider the absolute deviation of epsilon from 0, but I am not convinced this is the most informative way to present the model's reliance on unobserved transmission. Given a timeseries and a network without edges, there is some best-fit (i.e. maximizing the likelihood function in equation 2) epsilon for regenerating the data. Any transmission events that can be better explained by a network edge (and therefore beta parameter) reduce the model's expectation for epsilon. I am curious how a figure like S2 would look if the y-axis were the relative error (to this maximum) rather than absolute error (from 0). Finally, my understanding is that the time of infection is assigned (uniformly) randomly between the two timepoints of an individual being susceptible and infectious. It seems a reasonable assumption that this distribution is non-uniform insofar as the likelihood of infection will depend on the node's degree as it changes throughout this window. # Comparison to previous methods: One of the variables noted to be important to success of the approach of VanderWaal, et al. (2016) is pathogen prevalence. Did the authors detect a similar dependence in their method? Along these lines, I am having trouble reconciling Figure 4 with Figure 3 from VanderWaal, et al. (2016) -- how do the authors find nearly 0 power under (presumably) similar parameterizations that VanderWaal, et al. find nearly 1 (i.e. beta = 0.04 and 0.133)? l 378 -- the authors use the Kass & Rafferty (1995) interpretations of BF (i.e. > 0.5 is "signficant"), but this seemed to be the most generous of interpretations that I've encountered. In most cases, this doesn't matter, as the authors report the BF directly, however in Figure 3, I wonder how much the left column would differ if 2, 3, or 10 (sensu Lee and Wagenmakers (2014)) were used instead. # Code/reproducibility: The link provided in the text (https://bansallab.github.io/INoDS-model/) leads to a page that does not contain a link to download the software, requiring the reader to have sufficient knowledge of github to navigate to https://github.com/bansallab/INoDS-model/ before gaining access. It would additionally greatly improve accessibility if some sample data were provided in the repository such that a user can get it running right out-of-the-box, so to speak. As it is, even the referenced datasets are not readily available. I was unable to find either dataset within the Harvard Dataverse (as referenced in the text), and could find no record of the bee dataset elsewhere either. I did find the lizard dataset in DataDryad (https://datadryad.org/stash/dataset/doi:10.5061/dryad.jk87h), but the data format in this repository is not easily converted into one acceptable for INoDS. I don't think the randomization code used in this work was provided -- does your particular configuration model algorithm allow non-simple graphs? If so, I would recommend re-running with one that does not in order to make the random ensemble representative of empirically obtainable data. # Other Minor Points My read of Eq. 2 is that it this formulation is particular to a disease model without re-infection -- if so, this should be stated explicitly l 136-138 -- I'm not sure I understand why this was "expected" l 157 -- this reference should be to figure 4, I believe l 362 -- should this section heading have an "epsilon" in it? l 411 -- do I understand these methods correctly that there was only one simulation per missing data scenario-beta combination? If so, I assume the paired outliers in Figure 3c are thus corresponding to particular beta values. I highly recommend performing greater replication in this analysis. ## Figure 2 Contrary to the caption, the boxplot whiskers do not encompass the range (as evidenced by the presence of points outside of these ranges). More likely: "The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called "outlying" points and are plotted individually." -ggplot2 geom_histogram help page This description is also present in other captions. a) Plotting the beta/epsilon values directly here makes it somewhat difficult to gauge any deviation of the mean estimate from the ground truth, or any trends in deviation along beta*. Perhaps switch the vertical axis to show the absolute error or add a 1:1 line for reference. I slightly prefer the former approach, since that would allow a better understanding of any change in epsilon with increasing beta as well, which is hard to note in the current figure due to the difference in scales. b) Do these boxplots aggregate log BF across networks that vary in their degree of permutation (i.e. combining what is subdivided in panel c)? ## Figure 3 I wasn't sure why omitted cases is not presented here, despite being in Figure 4 and in the SI a) Are the differences from expectation actually 0 or is this a vertical-axis scaling issue? and is this absolute error? if so, I would recommend relative error here, since beta spans an order of magnitude. ## Figure 4 caption -- "Statistical power of INoDS was ," [delete was] ## Figure 6 Is there a missing network in this figure? I expected 5 networks given biweekly sampling for 70 days. I'm a little confused by the table here -- shouldn't frequency weighted at site 2 have an asterisk after it? and why does it have "frequency" in parentheses? Finally, am I reading this correctly insofar as the lower CI bound for the beta and epsilon of the site 1 binary network are equal to the estimates themselves (0.017 and 0.046, respectively)? Reviewer #3: This is an interesting and well written paper describing the development of a tool, INoDS (Identifying contact Networks of infectious Disease Spread), that can be used to perform network model selection and establishes the statistical significance of a contact network model. General comments: This work is very timely. The introduction, methods, and results are well described and easy to follow. However, there are few key components in the methods and results that need more explanation. Something that the authors need to explain in more detail is why the bumble bee and lizard case studies were chosen and why there was a need to examine both and not just one, for example. Additionally, while the introduction, methods, and results were well developed, the discussion is currently a bit weak, as it essentially re-iterates much of the information given in the abstract, introduction, and results. The authors do not discuss their work in the context of other studies (or do so at a very shallow level), there is no discussion of limitations nor of future directions and next steps. I would suggest cutting much of the information that repeats the introduction (e.g., why this tool is needed) and results, and provide a deeper discussion of how this tool should be used, which systems might need it most, and what the limitations are. Finally, the authors need to check some of their references and figure citations in the main text before publication as there are errors (e.g., some figures are incorrectly referenced in the main text (e.g., figure 3 instead of 4) and figure 6 was not referred to at all in the main text). Specific comments: Line 63-64: provide examples in parentheses of the type of complex models Line 94-95: More detail on why those two empirical datasets were chosen, and why the need to look at both. This can be stated either here or in the methods. Line 157: I believe the authors are referring to figure 4 here not figure 3. Line 292-293: In what way? Perhaps give examples. Line 458-459: This needs to be stated earlier in the manuscript as it helps understand why this dataset was used as an example. Figures: Figure 1 – In the first inferential step, need to define k. Since the methods are at the end of the manuscript, it would be good to define it here as well. In the third inferential step, might be good to explain that a Bayes factor above 3 provides support for one hypothesis over the other (for those less familiar with Bayesian statistics). Figure 3 – panels a-c are not listed in the figure but are when the figure is referenced in the text. Figure 6 – not referred to in main text Figure 6 – suggest adding days in parentheses for the time on the x-axis. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: No: See full comments for details Reviewer #3: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Matthew J Michalska-Smith Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 21 Jun 2021 Submitted filename: INoDS_ response to PLOS comp bio reviews.docx Click here for additional data file. 30 Jul 2021 Dear Dr Bansal, Thank you very much for submitting your manuscript "Revealing mechanisms of infectious disease outbreak through empirical contact networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Rob De Boer Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #2: I thank the authors for their work. They have comprehensively addressed all of my major concerns. Two minor lingering comments: 1) The axes in figure 2a could be better fit to the data (right now the data only fill the lower ~1/3 of the plot. 2) As noted in my original review, the replication count for figure 3 feels quite small (only 2 replicates per parameter combination) -- while I would have liked to see more, this is not a sticking point for my endorsement ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 30 Sep 2021 Submitted filename: cover_letter_INoDS_revision.pdf Click here for additional data file. 31 Oct 2021 Dear Dr Bansal, We are pleased to inform you that your manuscript 'Revealing mechanisms of infectious disease outbreak through empirical contact networks' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Rob De Boer Deputy Editor PLOS Computational Biology *********************************************************** 8 Dec 2021 PCOMPBIOL-D-20-02017R2 Revealing mechanisms of infectious disease spread through empirical contact networks Dear Dr Bansal, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Olena Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

36 in total