Literature DB >> 34525092

Reconstructing contact network structure and cross-immunity patterns from multiple infection histories.

Abstract

Interactions within a population shape the spread of infectious diseases but contact patterns between individuals are difficult to access. We hypothesised that key properties of these patterns can be inferred from multiple infection data in longitudinal follow-ups. We developed a simulator for epidemics with multiple infections on networks and analysed the resulting individual infection time series by introducing similarity metrics between hosts based on their multiple infection histories. We find that, depending on infection multiplicity and network sampling, multiple infection summary statistics can recover network properties such as degree distribution. Furthermore, we show that by mining simulation outputs for multiple infection patterns, one can detect immunological interference between pathogens (i.e. the fact that past infections in a host condition future probability of infection). The combination of individual-based simulations and analysis of multiple infection histories opens promising perspectives to infer and validate transmission networks and immunological interference for infectious diseases from longitudinal cohort data.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34525092 PMCID： PMC8475980 DOI： 10.1371/journal.pcbi.1009375

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Host populations are often assumed to be ‘well-mixed’ even though individuals tend to only interact with a small subset of the whole population and these contact patterns between individuals can dramatically affect the way epidemics spread [1-3]. For instance, it was shown during the early phase of the HIV pandemics that not only the average number of sexual partners but also the variance in the number of partners both increase the basic reproduction number (R0) of sexually transmitted infections [4]. Since then, studies have identified how key parameters of the host contact network affect the risk of outbreak [5, 6] and the spread of an epidemic [7-13]. Network reconstruction, i.e. the inference of adjacency weights based on observations of a dynamical system acting on the network, is a well-established research topic in engineering [14], and has recently been applied to infectious disease dynamics [15-17]. In the field of epidemiology, measuring contact networks is a lively research topic [18], ranging from definition issues [19] (defining a ‘contact’), to assessing the appropriateness of various types of data [20]. For livestock [21], interactions between interconnected farms can be well approximated through shipping logs. Other settings, such as wild populations or human populations, are more challenging to analyse. For humans, the field has traditionally relied on self-reported data, but new insights have been provided by airline transportation registries [22] or cell phone data [23]. More recently, parasite sequence data was used to infer network properties by analysing phylogenies of infections [24]. Importantly, this genetic data, which by definition pre-dates new outbreaks, can be used to make relevant predictions [25]. Note that we use the terms parasite and pathogen interchangeably encompassing both micro- and macro-parasites. We hypothesise that host contact network properties can be inferred from individual longitudinal data about infection status (Fig 1). Such longitudinal data is classically used in epidemiological studies to measure the odds that a specific event may occur [26], however, they are rarely coupled to mathematical models of disease spread (but see [27, 28]). Our idea is that multiple infection status can be used as a unique descriptor of a host’s position within the network. For instance, a host connected to many other hosts is expected to be more infected than a host connected to only one other host. Furthermore, hosts close in the network are expected to be infected at closer dates than hosts far away in the network.

Fig 1

Diagram illustrating the workflow for reconstruction of network properties and immunological interference.

Diagram illustrating the workflow for reconstruction of network properties and immunological interference.

In the first step, we obtain multiple infections histories of individual hosts from our simulator. Second, we calculate similarity metrics between hosts based on their multiple infection histories. Third, by comparing with simulated networks, we determine statistical associations between infection history similarity and contact network adjacency. Fourth, by calculating occurrences of infection overlaps across multiple infection histories, we obtain a matrix of immunological interference between pathogens. To track multiple parasite strains or species in these individual histories is both feasible and highly relevant, as the increasing affordability and power of sequencing strengthen the assertion that most infections are genetically diverse [29]. In fact, in the case of genital infections by human papillomaviruses (HPVs), not only do we know that coinfections, i.e. the simultaneous infection by multiple genotypes, are the rule rather than the exception, we also know that they strongly correlate with the number of lifetime sexual partners [30]. Another classical example is the distribution of the number of macroparasites per host, which has been used to infer population structure [31]. This makes multiple infections an ideal candidate to measure epidemic properties but also detect potential within-host interactions between parasite strains or species [32, 33]. Practically, this simulation study is intended to mirror analyses that could be performed using field data. This could be based on PCR-based detection tests for different genotypes of a parasite species. These would then be performed in the context of longitudinal follow-ups of seasonal (e.g. [34] for respiratory viruses) or prevalent and long-lasting infections (e.g. [35] for HPV). Importantly, the inference could also be based on longitudinal serological data (i.e. information on past infection). This would be particularly adapted in the case of short infections (e.g. influenza) or infections that are likely to be asymptomatic. Overall, we hypothesise that such studies can inform us both on the contact network on which parasite species (or strains) are spreading, but also on their immunological interactions. The latter can be of interest between genotypes of the same species (e.g. influenza variants) but also between different species (e.g. cross-immunity between zika and dengue viruses [36]). To show how individual infection histories for multiple parasite strains can inform us about the underlying transmission contact network, we conduct a simulation study, which requires alleviating two obstacles. First, we need to simulate multiple infections on a network, a task few studies have attempted [12, 37–42]. For this, we take advantage of recent developments in stochastic epidemiological modelling and implement a non-Markovian version of the well-known Gillespie algorithm [43]. This allows us to make sure a host’s infection history is not lost every time it acquires a new strain. It also addresses statistical evidence that for many parasites infection duration does not follow exponential but heavy-tailed distributions [44-49]. The second main obstacle resides in extracting information from longitudinal follow-up data. To compare these infection histories, we use barcode theory inspired by computational topology [50], where sets of intervals -in our case infection onset and clearance- are compared between each other. We integrate infection histories into our multiple infection modelling framework by accounting for the documented fact that recovery from infection becomes more likely with increasing ‘age’ of infection [51-54]. Then, we simulate individual multi-strain infection histories for epidemiological models with genotype-specific immunity on random clustered networks to demonstrate the impact of network topology on infection diversity. We refer to ‘genotypes’ to describe the parasite diversity, but our results apply to different genotypes of the same species or of different species of parasites provided that their mode of spreading (e.g. airborne, sexually transmitted) on the host network is the same. We include immunological interference between genotypes in our model by constraining the probability of acquiring a new infection in terms of the host’s infection history, akin to hemagglutination inhibition assays (e.g. multi-season influenza strains [55]). Finally, we show that infection barcodes can inform us both about the connectivity in the network and immunological interference between genotypes in several ways. First, similarity matrices between individual infection histories highly correlate with the network adjacency matrices. Second, an individual host’s network connectivity can be inferred from its connectivity in terms of infection barcodes. Third, by quantifying genotype co-occurrence based on infection histories, we recover model inputs for immunological interference. Fourth, mining frequent infection sequence motifs allows quantifying patterns induced by immunity settings in the spirit of more widely used cross-sectional prevalence surveys. Taken together, in our simulation study we provide proof-of-concept to reconstruct network and immunity characteristics from novel summary statistics based on multiple infection histories. The demonstrated robustness and limitation of our approach towards network properties, host population sampling and genotyping opens novel avenues towards computational epidemiology of multiple infections.

Materials and methods

Simulation algorithm

We developed an event-driven stochastic model of multiple infections on networks in Python 3.7. For the purpose of this simulation study, we considered static, random networks to model contact (i.e. edges with binary weights) between hosts (i.e. nodes). Contact networks were generated using the class of random clustered graphs [56-58] implemented in the networkx package in Python [59]. Given a propensity list of edge degrees and triangle degrees as input, this algorithm generates a random graph with predefined average degree and average clustering coefficient. The clustering coefficient of a graph was defined as the local clustering coefficient (i.e. the degree of a node divided by the number of all possible edges in the node’s neighbourhood) averaged over all nodes. Degree dispersion was defined as the ratio of degree variance and degree mean. Degree assortativity (i.e. the propensity for nodes in the network that have many connections to be connected to other nodes with many connections) was defined following [60]. For comparison purposes, using the same Python package, we also created random regular graphs with fixed degree of four. By definition, these graphs have zero degree dispersion. Upon network initialisation, we randomly seeded outbreaks of one infection per genotype simultaneously, allowing for up to four genotypes per host at high multiplicity. Disease dynamics followed Gillespie’s stochastic simulation algorithm (SSA) [61, 62] adapted to a non-Markovian setting [63] to incorporate memory-dependent processes. Here, in particular, we considered recovery from infection as a process depending on the age of infection. The simulations also feature potential immunological interference between genotypes. This was defined by an immunity input matrix J ∈ [0, 1], where J is the probability not to acquire an infection with genotype i given prior infection with genotype j, i.e. upon exposure from an infectious edge with genotype j. We sampled independent Bernoulli random variables with probabilities J, for all genotypes i contained in the infection history of the exposed host. In these simulations, we explored five immunity settings defined as follows: The matrix J1 corresponds to partial cross-immunity, with sterilising immunity between the first two genotypes, J2 models decreasing homologous immunity, where genotype g1 induces sterilizing immunity, whereas immunity is decreasing for the remaining genotypes. J3 models homologous immunity, i.e. when each genotype induces sterilizing immunity to future infection with the same genotype. J4 is an example for asymmetric cross-immunity, i.e. immunity to infection with g1 after clearing an infection with g2 is stronger than immunity to infection with g2 after clearing an infection with g1. Finally, J5 models sterilizing cross-immunity. Transmission and recovery rates were considered equal for all genotypes and fixed in advance. The underlying parasite life-cycle in our epidemiological models can be interpreted as a variation of a classical SIRS model, where susceptible hosts (S) can be infected upon contact with an infected host (I). Infected hosts become immune to infection after recovering (R) but this immunity can wane making the hosts susceptible again (S). In absence of host immunity or in presence of lifelong immunity, the life-cycle converges towards that of an SIS or an SIR model. However, there are two important differences with these canonical models [64]. First, our model allows for multiple infections meaning that host immunity to a given genotype depends on past infections (i.e. the J matrixes). Second, our model is non-Markovian meaning that the probability that an event occurs (e.g. recovering from an infection) depends on the history of the infection (i.e. the number of days elapsed since the inoculation). The Gillespie SSA allowed us to simulate these disease dynamics by performing regular updates in the values of the rates. For this, at each time point, we first created a rate vector {r} indexed by the set of all possible events E (recovery from an ongoing infection or acquisition of a new infection from an infectious host in the graph neighbourhood). If the node i had spent t time in an infection with genotype g, then the rate of recovery from this infection was assumed to be given by a Weibull hazard rate function with shape parameter α and scale parameter μ. The default setting for this recovery rate function was α = 2 and such that the mean was equal to one and the variance equal to 2.27, with Γ being the well-known gamma function. The probability of clearance increased with the age of infection, motivated by infection duration literature (see Introduction). For normalisation purposes, we assumed that the rate of infection for any node currently not infected with genotype g was unity, such that the average number of secondary infections was uniquely determined by the average node degree. To better handle the computational workload of multiple infections on networks, we stored the infection histories in an interval tree data structure containing the times of onset and end of each infection episode, the genotype, and the host number (this is an individual-based model and not an event-based model). We refer to this data structure as infection barcodes (see below). At a given time, from the rates vector, we first drew an exponentially distributed random variate with rate ∑ r to determine the time increment to the next event. Certain rates r depended on the infection age but, as shown in [63], drawing exponentially distributed random variables with respect to these rates provides good Markovian approximations of non-Markovian processes, granted the number of events is sufficiently large. We then drew a random variate according to the total probability vector {r/∑ r} to determine the nature of the next event. Depending on the event, we updated the list of possible events. In the case of recovery of host i from infection with genotype g, we wrote the end time of the infection episode to the interval tree, removed the recovered infection from the rate vector, and added potential infections with genotypes other than g from the host’s network neighbourhood. In the case where a host i is newly-infected with g, we wrote the time of onset of the infection episode to the interval tree, removed all possible edges from g-infected neighbours of i, and added possible infection edges from i to all neighbours that had not been or were not currently infected with g. We then updated the rates vector again and proceeded until the epidemic became extinct. Unless stated otherwise, we simulated disease transmission with four genotypes introduced simultaneously on the giant component of random clustered graphs with 250 nodes with average degree of 4 and an average clustering coefficients of 0.34 (referred to as ‘clustered’ networks). For a given parameter set, we seeded a random outbreak at the beginning of each simulation with one infection per genotype and ran 50 stochastic replicates until the disease-free state or a pre-defined time horizon was reached. Epidemic outcomes were reported as average and 95% confidence intervals for equidistant time bins. The source code and configuration files used for simulations are publicly available at the zenodo repository: https://doi.org/10.5281/zenodo.5159448.

Genotype diversity and multiplicity of infection

We measured genotype diversity during the course of epidemics with N = 4 genotypes by Shannon’s diversity index [65] defined as H(t) = exp(−∑ p(t) log p(t)), where p(t) is the frequency of infections with genotype i relative to all infections present at time t. The index H ∈ [1, 4] is maximised when all genotypes have equal frequency and minimised when only one genotype is present, and we used the implementation of the Python module ecopy. We emphasise that this index pertains to diversity at the population-level and does not distinguish between a multiple infection within a single host and single infections in several hosts. To highlight the importance of multiple infections, we also considered the multiplicity of infection (MOI) given by the number of genotypes present within infected hosts at a given time. We reported MOI averaged over all infected nodes.

Infection barcodes and network properties

We summarised each individual multiple infection history by a barcode, i.e. the set of intervals describing the onset and clearance of infection episodes accumulated within a host during the course of an epidemic. The notion of barcode first arose in computational topology [66] to succinctly summarise and compare topological properties of metric spaces. Mathematically, the infection barcode of a host A is a set of triples A = {A1, …, A}, where each A = (b, d, g) defines the birth 0 ≤ b < ∞ (i.e. infection onset) and death 0 ≤ d ≤ ∞ (i.e. infection clearance) with a particular genotype g. We explicitly included the presence of multiple genotypes in a host’s infection history, i.e. a host can simultaneously have several infections, also the same genotype can appear several times in the course of an epidemic. To compare infection histories between hosts, we considered the metric space of infection barcodes endowed with two complementary notions of distance. In the first approach, we considered a similarity index [67] such that two hosts with largely overlapping infection episodes tended to be similar to each other, regardless of the respective genotypes. More precisely, a given pair of barcodes between any two hosts A = {A} and B = {B} was transformed into a weighted bipartite graph. The nodes of this graph consisted of infection episodes and these nodes were partitioned according to the two hosts. The edge weight of the graph was given by the overlap length between the infection episodes constituting the edge (i, j), taking also into account overlaps between distinct genotypes. To obtain the similarity index s1(A, B), we calculated the maximum bipartite graph matching of this graph. The similarity index was transformed into a distance by d1 = −2s1 + ∑ |A| + ∑ |B|. For the second approach, we considered the overlap length of infection episodes between two hosts with matching genotype g. By summing over all genotypes we obtained the similarity index . This index was transformed into a distance by setting d2 = −2s2 + ∑ |A| + ∑ |B|. To determine whether similarity between infection histories implies proximity in the transmission network, we compared the metric space of infection barcodes to the metric space of the network, restricted to nodes that had been infected during the course of the epidemic. For the network, we considered two different distance notions, i.e. the discrete metric based on the binary adjacency matrix and the shortest path distance of the graph. For the shortest path distance, we expected negative correlations with barcode similarity, since the further two nodes are apart from each other (i.e. the longer the shortest path distance), the less infection barcodes would be similar to each other. The converse holds for the adjacency, since adjacent nodes with unit weight would be similar in terms of infection histories. We used two-sided p-values from the Mantel test [68] between the barcode similarity (resp. distance) matrix and the adjacency or shortest path length matrix respectively to assess for each model the correlation between infection barcode and network topology [69]. Since Mantel permutation tests have been scrutinised for underestimating type I errors in the presence of spatial autocorrelation [70-72], we tested adjacency and shortest path matrices for auto-correlation, using the R package statGraph [73]. While adjacency matrices were not significantly auto-correlated for most lags, the shortest path matrices had significant auto-correlations at all lags (Fig G in S1 Text). To assess whether spatial correlations between similarity matrices also translated to local properties such as a host’s node degree (i.e. number of network neighbours) or its localisation within the network (i.e. the sum of shortest path lengths to other hosts), we developed a measure of infection barcode connectivity and performed regression analyses of network characteristics on infection connectivity. More precisely, for a given simulation with maximum length L, N different genotypes and D + 1 nodes in the giant component, the normalized infection barcode connectivity of a host n was defined by for i = 1, 2, i.e. the sum of all barcode similarity values within the network of infected hosts relative to the maximum possible barcode similarity. Since shortest path length is a continuous measure, we performed linear regression using infection barcode connectivity as a regressor and assessed significance for coefficients by p-values from two-sided Z-tests for each of the immunity and network clustering settings. The relationship between node degree and infection connectivity was assessed using a multinomial logistic regression model defined by where β0 is the intercept, β1 is the coefficient for continuous variable of the normalised barcode similarity degree, β2 is the coefficient for the categorical variable of N = 2 relative to the base level N = 4, and β3 is the coefficient for the categorical variable of clustering coefficient c = 0.17 relative to the base level c = 0.34. We evaluated the model for genotype-agnostic and genotype-specific similarity degrees for each of the immunity settings. The multinomial model was trained on 80% of the stochastic replicates and test against the remaining 20%, the area under the curve (AUC) from multinomial precision call curves [74] was used to evaluate predictive power with the R package pROC.

From infection barcodes to immunity

In order to determine immunological interference between genotypes based on individual infection histories, we quantified non-overlapping co-occurrence C(G, H) of any two genotypes G and H within the host population. Large C values indicate that the presence of genotype G in the infection history did not preclude subsequent infection with H, i.e. the immunological interference between the two genotypes was weak. Conversely, if C is zero, then infection with G did prevent infection with H, such that G conferred full cross-immunity to H. This can be written mathematically as As an alternative, we used the sequence mining algorithm cSPADE [75] to determine the most frequent infection patterns from infection histories in the network. This algorithm has originally been designed to mine and classify the most frequent patterns in sets of sequences, e.g. products purchased by customers over multiple time points. In the machine learning literature, the frequency at which an item (e.g. a product, or a set of products) is encountered across a set of time-stamped customer data is referred to as ‘support’. In order to determine the most frequent multiple infection patterns at various snapshots during an epidemic, we interpreted multiple infection data as such sequences. More precisely, we sampled five random observation times τ1, …, τ5 during a simulation, and, for each host (i.e. ‘customer’), we defined the sequence of infections (i.e. ‘purchases’) by the set of genotypes the host was infected with at each observation time point. We deliberately assumed for this approach that multiple infection sequence motifs were independent snapshots of the infection state, such that re-infections and persistent infections were indistinguishable. Since our simulator allowed for multiple infections, elements of a sequence comprised the empty set, singleton sets (with only one genotype), or sets of several genotypes. The length of a motif is equivalent to the number of observations (e.g. the motif of length two < {g1, g3}, {g3} > indicated that a double infection was observed within hosts at the first time point and that a single infection was observed at the second time point). We used the R package implementation arules of the cSPADE algorithm and calculated the frequency (i.e. the ‘support’) averaged across simulation replicates and sequence motifs with minimum support of 0.02 and maximal length of 5.

Results

Network topology, genotype diversity, and MOI

To better understand how the topology of a contact network impacts multiple infection dynamics, we simulated epidemics by introducing simultaneously infections on random clustered graphs with distinct summary statistics. By default, we simulated multiple infections on random clustered graphs with high clustering coefficient, which we refer to as ‘clustered’ networks (see Table 1). For comparison, we considered ‘dispersed’ random clustered graphs, emulating contact networks with relatively low but variable number of contacts spread out evenly in the network due to highly over-dispersed degree and low clustering coefficient. To determine whether these networks would result in contrasting epidemic dynamics, we also considered single genotype simulations resulting in larger final epidemic size for dispersed networks (Fig A in S1 Text). In order to uncover potential biases stemming from degree dispersion and clustering, we also simulated multiple infections on random regular graphs, which had fixed degree of four and clustering coefficient of 0.014.

Table 1

Summary statistics of the two types of networks simulated.

Quantities are averaged across 250 nodes and 50 stochastic replicates.

Network type	degree	degree assortativity	degree dispersion	clustering coefficient
dispersed	4.6	0.58	51.51	0.13
clustered	4.1	0.53	2.36	0.34

Summary statistics of the two types of networks simulated.

Quantities are averaged across 250 nodes and 50 stochastic replicates. Infection multiplicity (MOI) was highest for both homologous immunity settings, which also maintained a high diversity over the course of the simulated epidemic (Fig 2). In general, the network topology only had a marginal impact on MOI and diversity, with dispersed networks (i.e. high degree and low clustering coefficients) showing slightly higher diversity (e.g. for homologous decreasing and asymmetric settings).

Fig 2

Genotype diversity index, multiplicity of infection (MOI), and epidemic prevalence as a function of network type and cross-immunity patterns.

Genotype diversity index, multiplicity of infection (MOI), and epidemic prevalence as a function of network type and cross-immunity patterns.

The figure shows the output of 50 stochastic epidemics with four genotypes on dispersed and clustered networks (see Table 1), and five immunological interference settings. Black lines show the time-averaged fraction of nodes infected with at least one genotype (95% confidence intervals shaded grey), blue lines show the average genotype diversity index at the population level, and green lines show the average MOI of individual nodes. Diversity decreased initially in all the runs, which was likely due to stochasticity (some strains grow more than others). The striking differences in diversity appeared later on with three groups with distinct shapes (Fig 2): (i) partial cross-immunity maintained moderate diversity throughout, (ii) homologous and homologous decreasing immunity had diversity following the fraction of infection nodes, with a sharp increase and decrease towards the end of the simulation, and (iii) asymmetric and full cross-immunity showed continuously declining diversity. As expected, full cross-immunity prevented multiple infections and sharply decreased diversity, especially for clustered networks. Infection diversity on random regular networks mirrored in general those on clustered networks, with a markedly lower levels for partial cross immunity settings (Fig B in S1 Text). Multiple infections could be sustained in the partial cross-immunity setting, under which the fraction of infected nodes showed a SIS-like shape due to reinfections by a subset of the circulating genotypes.

Infection barcodes and network structure

To infer properties of the host contact network from individual infection histories, we summarised each of these through an infection barcode, i.e. a set containing all intervals of infection episodes with different parasite genotypes. In order to measure similarity (or distance) between infection barcodes of any two hosts, we used the length and frequency of overlaps between infection episodes. As detailed in the Materials and methods, we distinguished overlaps that were genotype-specific (i.e. only for matching genotypes between two hosts) from those that were genotype-agnostic (i.e. regardless of the genotype). The resulting barcode matrices quantified the similarity (or distance) between any two hosts. These were tested for correlation to more classical matrices that capture global (shortest path between nodes) or local properties (binary adjacency) of contact networks restricted to infected individuals. Simulation results with four different parasite genotypes show that, if the network of infected individuals was fully observed, the barcode similarity matrix significantly correlated with neighbourhood properties of the contact network given by its adjacency matrix (Fig 3). In more realistic situations, however, infected nodes would only be partially observed. Our results showed that spatial correlations were less significant as the sampling rate of network nodes decreased but remained significant for sampling proportions of 25% or more.

Fig 3

Correlations between a matrix based on infection barcodes similarity and the network’s adjacency (top) or shortest path (bottom) matrices.

Correlations between a matrix based on infection barcodes similarity and the network’s adjacency (top) or shortest path (bottom) matrices.

For simulated epidemics on clustered networks with 2 or 4 circulating genotypes and a variety of cross-immunity settings, we tested for correlations using p-values of two-sided Mantel tests with 104 permutations. For each setting, we re-sampled 20 times randomly 100, 50, 25, or 10% of the infected nodes and report the average p-value. When pairwise comparisons between hosts were only based on the number of genotypes (green bars) but not on their nature (blue bars), our ability to detect significant correlations decreased. Unsurprisingly given the importance of multiple infection histories for our approach, decreasing the number of circulating genotypes from four (dark bars) to two (light bars) also decreased correlation significance. Immunity assumptions allowing for multiple infections with highly diverse infection barcodes (e.g. partial cross, homologous decreasing, homologous) resulted in higher significance levels. Likewise, when tested for global network properties given by the shortest path matrix (bottom in Fig 3), significance levels and correlation values (Fig C in S1 Text) appeared to be strongest for the same set of immunity assumptions. Overall, correlations with shortest path matrices were consistently less robust towards multiplicity than those with adjacency matrices (especially for the full cross immunity setting). This suggests that in order to distinguish distances between hosts beyond their graph neighbourhood with barcode similarities, one needs sufficiently rich multiple infection histories. Compared to clustered networks, we observed similar trends on dispersed networks regarding the immunity setting (Fig D in S1 Text), and correlations tended to be stronger than in the clustered network setting, especially for comparisons of barcode similarity matrices with shortest path matrices. Overall, we found that the genotype-specific similarity index for infection barcodes was more informative than the distance metric (Fig E in S1 Text). The spatial correlations between barcode similarity indices and both local and global graph properties also translated to the individual level. From the barcode similarity matrix, we calculated each host’s connectivity (i.e. degree) in terms of infection history by summing the host’s barcode similarity with respect to all other infected hosts and normalising appropriately (see Materials and methods). We hypothesised that the infection barcode connectivity within the network of infected hosts could inform the degree of connectivity in the contact network. First, the significant relationship between the barcode and shortest path connectivity indicated that a host with an infection history similar to those of many other hosts had also lower shortest paths to other hosts, and was hence more centrally located in the network (Fig 4).

Fig 4

Link between the host connectivity estimated via the barcodes and that estimated via the shortest path.

Link between the host connectivity estimated via the barcodes and that estimated via the shortest path.

Hosts that are found to have a high connectivity according to the multiple infection history (the barcode metric) tend to be closer in the contact network. Significance levels corresponds to p-values of a linear regression ***: p < 0.001 and **:p < 0.01. The variance explained (in percent) by barcode connectivity refers to sum of squares from ANOVA analysis. Multinomial regression tests showed that the odds of having higher degree in the contact network (i.e. more neighbours) increased with infection barcode connectivity. The power of the model to predict node degree was dependent on the immunity settings and highest for partial cross-immunity (Fig 5), mirroring what was seen for the continuous measure of shortest path lengths (Fig 4). For this particular immunity setting, the area under the curve was 0.74, indicating that predictive power increased when using multiple infection data, especially because in this case diversity was maintained over the entire time horizon yielding highly specific barcode similarity scores. Similar trends, but with less predictive power, were observed for the remaining immunity settings.

Fig 5

Precision-recall curve for multinomial models under five distinct immunity assumptions.

Precision (resp. recall) is defined as the percentage of true positives among all positives (resp. true positives plus false negatives). The area under the curve (AUC) indicates increased (compared to random AUC of 0.5) power to classify a host’s node degree using infection barcode connectivity information.

Precision-recall curve for multinomial models under five distinct immunity assumptions.

Infection history and immunological interference

To test whether immunological interference between genotypes could be inferred from individual infection histories, we simulated epidemics on random clustered networks with four distinct genotypes under a variety of immunity assumptions (see Materials and methods). The co-occurrence score C between genotypes resembled the immunity input matrix from the simulations, in two ways. First, whenever immunity was sterile, we did not observe co-occurring genotypes (red squares in Fig 6). Second, when the probability to develop protective immunity following infection decreased, the co-occurrence score increased.

Fig 6

Co-occurrence score from infection barcode outputs of simulations with various assumptions on immunological interference.

The underlying cross-immunity matrices are shown in the Materials and methods. The combinations with 100% cross-immunity correspond to the red cells with 0 co-occurrence score.

Co-occurrence score from infection barcode outputs of simulations with various assumptions on immunological interference.

The underlying cross-immunity matrices are shown in the Materials and methods. The combinations with 100% cross-immunity correspond to the red cells with 0 co-occurrence score. The partial cross-immunity setting allowed for co-occurrence between genotypes g3 and g4, while cross-immunity between g1 and g2 lowered also possible co-occurrence with genotypes g3 and g4. Similarly, decreasing homologous immunity strongly limited co-occurrence of the same genotype, and interestingly also limited co-occurrence between genotypes which were a priori not assumed to interfere immunologically (e.g. g1 and g4) due to population dynamics effects. The co-occurrence score was also able to capture asymmetric immunity, stressing the possibility to estimate the order of acquisition from infection barcodes. The sequence motif mining approach yielded infection patterns that were consistent with the immunity input (Fig 7). For partial cross-immunity, multiple infections including genotypes g1 and g2 were absent from the motifs, whereas infections including the remaining genotypes were abundant with high frequency. The homologous decreasing setting was mirrored by a continuous increase in motifs from genotypes g1 to g4. In the homologous setting, motifs were dominated by single infections, whereas double infections were equally frequent between genotypes. Unsurprisingly, sterilising cross-immunity excluded all multiple infections. In order to distinguish the order of genotype acquisition, we had to consider motifs of length two (Fig F in S1 Text). In this case, differences in motif frequency with asymmetric immunity assumptions (e.g. the motif < {g4}, {g1} > was more frequent than < {g1}, {g4} >) could be detected. In general, the higher computational efficiency of sequence mining compared to the co-occurrence score was offset by the fact that, in the former, motifs are independent snapshots of the infection state, which makes reinfections and persistent infections indistinguishable.

Fig 7

Effect of cross-immunity patterns on the average frequency of all possible sequence motifs with length one.

The motifs represent cross-sectional multiple infection snapshots. Here, samplings are performed at five random time points during the simulation. Note that coinfection by all 4 genotypes is only found in the homologous decreasing case. The corresponding cross-immunity matrices are shown in the Materials and methods. See also Fig 6 for additional information regarding co-occurrence between the genotypes.

Effect of cross-immunity patterns on the average frequency of all possible sequence motifs with length one.

Discussion

Understanding the properties of host contact networks is key to predicting epidemic spread but raises many practical challenges [18]. Phylodynamic studies have shown that some of these properties can be inferred from microbial sequence data [24, 76]. Here, we use individual infection histories to gain insights into the contact network structure. The first obstacle was to simulate epidemics on contact networks while allowing for coinfections, i.e. the simultaneous infection of hosts by multiple parasites [77]. To enable clearance events based on a host’s multiple infection history, we implemented a non-Markovian version of the Gillespie algorithm following recent developments in computational physics [43, 78]. As expected, network topology directly impacted multiple infection dynamics, with an increased level of clustering leading to higher parasite strain diversity. Being able to simulate epidemics of multiple infections on networks, we sought to compare complex individual infection histories in order to reconstruct transmission networks. To address this issue, we captured these histories using barcodes and compared infection histories between hosts using tools from computational topology [50]. We could show that global properties of the network are correlated with these barcodes, i.e. similarity matrices inferred from the barcodes correlate with matrices inferred from the network adjacency matrix. Furthermore, individual-centred properties such as a host’s degree can also be inferred from infection barcodes. Detecting within-host interactions between pathogens from population-level data is relevant both for persistent (e.g. HPV [79, 80]) and acute (e.g. influenzavirus [81, 82]) infections. In this simulation study, we focused on possible immunological interference between pathogens leading to exclusion mechanisms for multiple infections. We show that some patterns can be identified using multiple infection data. Overall, we provide proof-of-principle that multiple infections and infection history can be used to gain insight into host contact network properties and immunological interference. This is biologically sound, given that infections by multiple parasite genotypes are extremely frequent, and realistic. Indeed, for many human infections, there are longitudinal follow-ups with a detailed record of parasite diversity, one of the clearest examples being human papillomaviruses [83]. Studies also exist that follow individuals in natural populations [84, 85]. Individual-based models are particularly amenable to determine new ways to analyse multiple infection data from community-based routine surveillance cohorts [81, 82]. Our model is tailored towards quantities that could actually be observed in the field, e.g. the average number of contacts and the clustering coefficient can be extracted from surveys [86]. How precise genotyping should be is a highly relevant question. Furthermore, the time component is essential to distinguish ongoing infection from reinfection. However, for many acute infectious diseases, this issue can be addressed by spacing sampling time points. There are several limitations in our work with possible extensions. First, we only reported results obtained on random clustered networks. We also obtained similar results on other types of topologies (e.g. random regular graphs) but it would be valuable to know whether infection history is more or less informative depending on the type of network considered and if network comparison can be performed. We also assumed that the contact network was static, but in reality, it can be highly dynamic. Recent evidence suggests that these dynamics could be detected in sequence data [87] and it would be interesting to explore this with infection histories. Also, real-world multiple infection histories might be incomplete or noisy. Determining how sensitive statistical associations between barcode similarity and network properties are, could help delineate practical limits of our approach. We assumed the life-cycle and transmission mechanisms of the parasites to be equivalent. Simulating parasite spread with distinct infectivity and clearance parameters could show the robustness of barcode metrics. Also, within-host dynamics such as viral load and immune memory could further identify relevant mechanisms for population-level dynamics of multiple infections. Following many existing studies, e.g. in phylodynamics, we used a neutrality assumption. However, genotypes are known to interact [88] and multiple infections can be a means to detect these interactions [79, 89]. In general, exploring richer life-cycles is a promising path for future studies. We considered that infection history was based on parasite genotype presence in the host (e.g. PCR detection test) but the same methodology could be applied to serological data, i.e. evidence of past presence in the host via antibody detection. A clear advantage of this type of data is that it is more abundant. Another advantage is that it does not require a detailed longitudinal follow-up. However, the downside is that we then ignore the origin of the infection. Simulation studies could be instrumental in assessing our ability to infer network properties from serological data. A separate line of future research has to do with the inference per se. One possibility could be to use Approximate Bayesian Computation to get a more precise idea of the accuracy of the prediction we can make on key network topology parameters. This could also allow us to compare between different classes of networks, e.g. using random forest algorithms [90]. Finally, with the increased power and decreasing cost of sequencing, it may be possible in the future to have both the information about infection history and the sequence data for multiple pathogens. It would then be worth determining whether we can get the best out of the two types of data, infection history being less precise but also less noisy than phylodynamic inference. Fig A. Final epidemic size (in percent) for single genotype infections on dispersed (red) and clustered (blue) networks. Fig B. Genotype diversity index, multiplicity of infection (MOI), and epidemic prevalence for random regular graphs for five cross-immunity patterns. Fig c. Spatial correlations for clustered networks for five cross-immunity patterns. Fig D. Spatial correlations for dispersed networks for five cross-immunity patterns. Fig E. Correlation tests between a matrix based on infection barcodes distance and the network’s adjacency (top) or shortest path (bottom) matrices. Fig F. Mining for motifs of length two. Fig G. Spatial auto-correlations for dispersed (left) and clustered (right) neworks. (PDF) Click here for additional data file. 23 Jun 2021 Dear Dr. Selinger, Thank you very much for submitting your manuscript "Reconstructing contact network structure and cross-immunity patterns from multiple infection histories" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. The reviewers and editors were excited about this innovative approach to an important question in infectious disease dynamics, and the combination of simple models with advanced computational methods. Overall the suggested edits are minor, and most of them are explicitly optional, and can be addressed - or not - at the authors discretion. The editors request that you focus on a) fixing typos/sources of confusion/missing SI figures and expanding figure captions, b) consider adding an initial diagram which explains the model and the idea behind "barcodes", c) provide slightly more context into how this theoretical work relates to real-world inference scenarios of interest for particular types of infections. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Alison L. Hill Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor-in-Chief PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this study the authors show that multiple infection histories of individuals can be used to identify characteristics of the underlying transmission network structure and that they can also be used to detect immunological interference between pathogens. They do this by developing a simulator of a stochastic epidemic model consisting of multiple infections on a network and introduce the idea of ‘infection barcodes’ (taken from barcode theory in computational topology) that encodes individual infection histories in the form of times of infection onset and recovery, and the type of pathogen. They are able to show that the similarity matrices obtained by comparing infection barcodes between pairs of individuals are correlated with network adjacency matrices and that an individual’s degree of connectivity in the network can be inferred from their connectivity in the space of the infection barcodes. Furthermore, they also show that analyzing pathogen co-occurrence patterns within hosts can be used to detect immunological interference. Overall, I find the proof of concept described in this work to be very interesting and useful as the transmission network plays an important role in the spread of epidemics but it is generally very difficult to characterize. The methodology used in the paper is clear and I only have a few minor comments. Introduction • Line 18: ‘not only’ might be missing between ‘that’ and ‘the average’ • Line 27: Opening quotation mark on ‘contact’ • Line 65: Opening quotation mark on ‘age’ Methods • Line 104: ‘Upon network initialization, we randomly seeded outbreaks of one infection per genotype, allowing for up to four genotypes per host.’ Given that the outbreaks were seeded at the same time, how do you expect the results to be affected when the seeding occurs at different points in time? Naively I would expect this to decrease the difference between the genotype specific and the genotype agnostic case. • Line 138: ‘To better handle the computational workload of multiple infections on networks, we stored the infection histories in an interval tree data structure containing the times of onset and end of each infection episode, the genotype, and the host number’. In reality there might be uncertainties in the times of onset and end of infection episodes or the knowledge of infection episodes for individuals might be incomplete. How sensitive are the results to such uncertainties? • Lines 149-157: Line 149 for example says, ‘In the case of recovery of host i from infection with genotype g, we wrote the end time of the infection episode to the interval tree, removed the recovered infection from the rate vector, and added potential infections with genotypes other than g from the host's network neighborhood.’ This implies that an individual can’t get reinfected by g once they have recovered. Was this written just as an example? As for instance this is not the case for all the genotypes in immunity setting I_2. • Line 161: Opening quotation mark on ‘clustered’ Results • Fig 1: x-axis is missing units of time • Fig 2: I think the results of this section would become clearer if the Mantel test r-values were also added to this figure. Also, on first glance I found it a little confusing to see the correlation being positive for the adjacency matrix (in Fig.S1) but negative for the shortest path. Might help if this difference is described briefly in the Methods. • Line 320: ‘Cross-immunity assumptions allowing for multiple infections with highly diverse infection barcodes (e.g. full cross-immunity, asymmetric decreasing) resulted in higher significance levels’. Is there a typo in this line? As it seems that the first three columns in Fig.2 (i.e. apart from ‘full cross-immunity’ and ‘asymmetric decreasing’) have higher significance levels. • Line 322: ‘Significance levels and correlation values appeared to be higher when tested for global network properties given by the shortest path matrix (bottom), which is consistent with the barcode similarity being a quantitative measure.’ Is this statement also only restricted to the results in the first three columns of Fig.2? Also, is there an intuitive reason why the significance levels are so low (even for 100% sampling) for the shortest path full-cross immunity case? • Line 325: ‘We found that the correlations were highly significant even at a network sampling rate of 10%’. Looking at Fig.2 this sentence comes across a bit strong, specifying the cases for which this is true will help. Reviewer #2: Overview Selinger & Alizon present a wonderfully well-written manuscript and simulation study to explore how multiple infection histories can be used to infer network properties and immunological interference among co-circulating pathogens. This paper was justified, with very clearly written methods, and intuitive results. I have a few “major” comments, although they are more questions out of curiosity than necessary items to implement. Major comments -Have the authors considered differences in susceptibility across hosts? For example, younger individual may be more susceptible to infection with influenza due to less (or no) pre-existing immunity. This could also be the case for individuals with co-morbidities or individuals that are immunocompromised. Could these types of individual-level differences be implemented in this model? -I am wondering if the authors considered including a conceptual diagram for the barcode concept. I think this could be very useful to include in the methods, especially for those readers that are not familiar with barcode theory (probably most readers and infectious disease dynamics folks). -Throughout the results, it would be very helpful to contextualize the findings more. Specifically, I think the authors could explain if results were intuitive or not; they do this to some degree, but I think even more would be very useful. For example, did the authors expect the partial cross-immunity scenario to most accurately predict node degree? If so why or why not? Minor comments -Line 27: Could you add one reference for each: definition issues and assessing the appropriateness of various types of data? -Line 38: upon first definition, it is not immediately clear to me what “taking advantage of multiple infection as a unique descriptor of a host’s position within the network means.” Could you please clarify this? -Line 41: although I agree that the norm is for longitudinal data to be used in more simple statistical models rather than dynamical models, there are a few very nice recent works that incorporated longitudinal data into individual-level dynamical models (Ranjeva et al. 2017- PMID: 29208707 and Ranjeva et al. 2019- doi: 10.1038/s41467-019-09652-6) -Line 61: upon first introduction, I don’t know what barcode theory is. It would be helpful to provide a very brief definition of barcode theory upfront in the introduction. -Fig 1. I am surprised that diversity decreases as much in the homologous setting as in the full cross-immunity setting? Is this expected? It might be nice to quantify the differences between panels in Figure 1… like are the diversity, MOI, and time-averaged fraction of nodes infected statistically significantly different across panels? -it would be helpful in Figure 5 to include the immunity input matrix too, maybe there could be one additional panel with the 5 immunity input matrices shown, this would allow an easier comparison between co-occurrence score and the potential (or lack thereof) for homo / heterologous protection Additional comments from the Editors: The Author Summary is well-written but it might be a little too technical for the intended audience Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: They have not yet, but they say this will upon publication. ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 13 Aug 2021 Submitted filename: response_to_reviewers2.pdf Click here for additional data file. 23 Aug 2021 Dear Dr. Selinger, We are pleased to inform you that your manuscript 'Reconstructing contact network structure and cross-immunity patterns from multiple infection histories' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Alison L. Hill Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor-in-Chief PLOS Computational Biology *********************************************************** 8 Sep 2021 PCOMPBIOL-D-21-00635R1 Reconstructing contact network structure and cross-immunity patterns from multiple infection histories Dear Dr Selinger, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Amy Kiss PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

67 in total

Reconstructing contact network structure and cross-immunity patterns from multiple infection histories.

Introduction

Diagram illustrating the workflow for reconstruction of network properties and immunological interference.

Materials and methods

Simulation algorithm

Genotype diversity and multiplicity of infection

Infection barcodes and network properties

From infection barcodes to immunity

Results

Network topology, genotype diversity, and MOI

Summary statistics of the two types of networks simulated.

Genotype diversity index, multiplicity of infection (MOI), and epidemic prevalence as a function of network type and cross-immunity patterns.

Infection barcodes and network structure

Correlations between a matrix based on infection barcodes similarity and the network’s adjacency (top) or shortest path (bottom) matrices.

Link between the host connectivity estimated via the barcodes and that estimated via the shortest path.

Precision-recall curve for multinomial models under five distinct immunity assumptions.

Infection history and immunological interference

Co-occurrence score from infection barcode outputs of simulations with various assumptions on immunological interference.

Effect of cross-immunity patterns on the average frequency of all possible sequence motifs with length one.

Discussion

1. The effects of host contact network structure on pathogen diversity and strain structure.

2. Disease contact tracing in random and clustered networks.

3. The role of the airline transportation network in the prediction and predictability of global epidemics.

Review 4. Parasite adaptations to within-host competition.

5. Disease extinction and community size: modeling the persistence of measles.

6. Interplay between the temporal dynamics of the vaginal microbiota and human papillomavirus detection.

7. Temporal Gillespie Algorithm: Fast Simulation of Contagion Processes on Time-Varying Networks.

8. Age-specific differences in the dynamics of protective immunity to influenza.

9. Host mobility drives pathogen competition in spatially structured populations.

10. Predicting epidemic risk from past temporal contact data.