Literature DB >> 32881984

Inferring transmission heterogeneity using virus genealogies: Estimation and targeted prevention.

Yunjun Zhang^1,2, Thomas Leitner³, Jan Albert^4,5, Tom Britton².

Abstract

Spread of HIV typically involves uneven transmission patterns where some individuals spread to a large number of individuals while others to only a few or none. Such transmission heterogeneity can impact how fast and how much an epidemic spreads. Further, more efficient interventions may be achieved by taking such transmission heterogeneity into account. To address these issues, we developed two phylogenetic methods based on virus sequence data: 1) to generally detect if significant transmission heterogeneity is present, and 2) to pinpoint where in a phylogeny high-level spread is occurring. We derive inference procedures to estimate model parameters, including the amount of transmission heterogeneity, in a sampled epidemic. We show that it is possible to detect transmission heterogeneity under a wide range of simulated situations, including incomplete sampling, varying levels of heterogeneity, and including within-host genetic diversity. When evaluating real HIV-1 data from different epidemic scenarios, we found a lower level of transmission heterogeneity in slowly spreading situations and a higher level of heterogeneity in data that included a rapid outbreak, while R0 and Sackin's index (overall tree shape statistic) were similar in the two scenarios, suggesting that our new method is able to detect transmission heterogeneity in real data. We then show by simulations that targeted prevention, where we pinpoint high-level spread using a coalescence measurement, is efficient when sequence data are collected in an ongoing surveillance system. Such phylogeny-guided prevention is efficient under both single-step contact tracing as well as iterative contact tracing as compared to random intervention.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 32881984 PMCID： PMC7494101 DOI： 10.1371/journal.pcbi.1008122

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Allocation of prevention resources to where they are needed most is important for effective disease control. Thus, identifying where disease spread has its highest intensity would allow for efficient resource allocation. The intensity of the spread is typically uneven in time and space, causing episodic and local outbreaks. One type of spread-heterogeneity comes from the situation when some individuals spread at a much higher rate than others, causing superspreading [1], and invoking the 20-80 rule, where 20% of infected persons transmit to 80% of new infecteds [2, 3]. Such transmission heterogeneity can have impact on how fast epidemics spread, and may affect the outcome of control efforts [4]. While epidemiological methods using incidence data can pick up the signal of increased number of cases, which may indicate an outbreak or be the result of increased surveillance, they can not identify individual-level transmission heterogeneity. To assess transmission heterogeneity on the individual-level transmissibility, contact tracing based on interviews, reviews of clinical records, and partner follow-up have traditionally been used. Such follow-up may be slow, expensive and inaccurate, however. Several studies have reported that interview-based information about sexual contacts where HIV-1 transmission might have taken place often was not in agreement with the phylogenetic history of the transmitted virus [5, 6]. Early studies showed that virus phylogenies reflect their underlying transmission histories [7], and more recent work has shown that phylogenies carry information about the underlying population structure [8-10] and the degree distributions of the sexual contact network [11-13] both of which contribute to the transmission heterogeneity. The Multi-state birth-death (MSBD) models have been used to model population structure and infer the related transmission heterogeneity [8, 14]. The rationale of these models is to associate each cluster (or subpopulation) to a state in the MSBD model, with clusters differing in their transmission dynamics through time. An important limitation of the MSBD models is either requiring the correspondence between the tips and the states is known in advance [15] or using fixed positions of state changes [10]. An efficient method of model-based genetic clustering has recently been proposed in [9], which is based on fitting a Markov-modulated Poisson process representing the evolution of transmission rates along the tree relating different infections. Though it performs well in clustering lineages of different risky groups, the method is not designed to infer basic reproduction number which is also of certain interest in epidemiloigcal practice. In addition, because both incidence and pathogen phylogenies carry information about the epidemic, several efforts have been made to combine these data into comprehensive heterogeneity inference [16-18]. Interestingly, Li et al [19] recently showed that quantification of transmission heterogeneity was more accurately estimated using the pathogen phylogeny alone rather than in combination with incidence data. They describe the transmission heterogeneity as the variation among the offspring distributions, i.e., the number of secondary cases caused by each infected individual. The offspring distribution, however, is not only affected by the individual-level transmissibility but also by the length of the infectious period of each infected individual. In this study, we propose a heterogeneous birth-death model to describe the variation of the individual-level transmissibility, with birth rates corresponding to transmissibility (or transmission rates) and death rates to diagnosis rates (or removal rates). We model the transmission heterogeneity by letting each individual draw an independent random transmission rate from a continuous distribution, so the transmission heterogeneity is captured in a parameter quantifying the amount of variation in this distribution. Transmission heterogeneity may exist due to varying degrees of social activity as wells varying transmission risk upon contact, e.g., due to varying viral load in different disease stages. We then develop a general inference procedure to estimate model parameters and to test whether or not there is significant transmission heterogeneity, and thereafter we develop a method that identifies evolutionary lineages that likely have been involved in many transmission events of the epidemic. We use the second method to direct targeted prevention efforts, by means of (additional) contact tracing, towards individuals that have been involved in elevated spread as compared to random intervention. To address the realistic situation where not all persons have been sampled in an epidemic, we investigate the effects of sampling fractions, situations where samples are taken cross-sectionally and when samples are gathered in an ongoing surveillance system. We investigate the effects of within-host diversity in simulations on both the general heterogeneity test as well as the targeted prevention. We also test our estimate of transmission heterogeneity in facing of realistic complexities such as autocorrelation in transmission rates and varying transmission rate through time. We apply the heterogeneity model and estimate parameters on 3 real datasets sampled from different HIV-1 epidemics.

Materials and methods

A transmission model with heterogenous transmissibility

In order to estimate transmission heterogeneity in a population and, further, the effect of targeted prevention on individuals associated with high transmissibility, we develop a new transmission model, referred to as the heterogeneous birth-death (HBD) model, in which individuals in a community may have different rates of infecting new individuals. The difference in transmissibility could be caused by variable virus load leading to variable infection probability upon contact, and/or variable social contact rate. Here, we refer to any combination of these concepts as transmissibility. Specifically, in our model each infected individual i gets assigned a transmissibility rate λ by randomly drawing the rate from a Gamma distribution with mean μλ and standard deviation σλ. The gamma distribution is a common choice when modeling individual heterogeneity for a parameter. While infectious, this individual infects new individuals randomly in time, with rate λ per unit of time. Also we assume that the transmission rates of the infector and the infectee in a transmission event are independent. Later we will evaluate our model under the condition with autocorrelation of transmission rates. Infected individuals remain infectious until they are diagnosed—it is then assumed that individuals are immediately treated such that they no longer are infectious and instead immune [20]. The time to diagnosis, or more precisely to successful treatment, is assumed to be exponentially distributed with rate γ common for all individuals (hence random and with no systematic heterogeneity). The diagnosed individuals are sampled for virus sequencing with probability ρ (referred to as the sequencing ratio). Consequently, there are three categories of infected people: a) the undiagnosed (who continue to spread the disease), b) the diagnosed/treated who have not been sampled for sequencing, and c) the sequenced diagnosed/treated individuals. The ratio of the sequenced diagnosed over all the infected (i.e. c/(a + b + c)) is referred to as the sampling ratio in this study. In contrast with the sequencing ratio (i.e. c/(b + c)) which is usually assumed to be known in advance, the sampling ratio is unknown in the analysis of virus sequence data. In our model, a transmission history is generated with our heterogeneous birth-death transmission model under a given set of parameters (Θ = (μλ, σλ, γ)). The transmission history (denoted as ) contains all information about who-infected-whom in calendar time (Fig 1). The corresponding tree without information about who-infected-whom is called the transmission tree and denoted as . In general, the transmission tree cannot be observed from virus sequence data. However, what we can infer is the tree that describes how the virus sequences from all the infected individuals are related, i.e. the virus genealogy . If there is no within-host virus diversity, then the virus genealogy and transmission trees are identical: . If within-host diversity is taken into account (while still assuming only one strain is transferred at transmission), then the coalescence times of the sampled virus sequences may coalesce further back in time than the time of infection, with the effect that branch lengths and even tree topologies may differ from the underlying transmission tree [21]. The estimation method in this study was first developed under the former situation (i.e. without within-host diversity), and was extended to the latter case in the simulation.

Fig 1

The connection between transmission history (a), complete genealogy (b), and sampled genealogy (c).

The connection between transmission history (a), complete genealogy (b), and sampled genealogy (c).

In the model, a transmission history (a) is simulated with the heterogeneous birth-death transmission model under a given set of parameters (Θ). The i-th (i = I, ⋯, V in Roman numeral notation) individual was infected and stopped infection at t and s (i = 1, ⋯, 5 in Arabic numerals) respectively. There are three categories of individuals at the end of the simulation: sampled individuals, i.e. diagnosed and sequenced (tips labeled with −*), diagnosed but unsampled individuals (tips labeled with − ○), and undiagnosed individual (unlabeled tips). A complete virus genealogy (b) contains the genealogical lineages from all the infected individuals. In the case of no within host diversity it coincides with the transmission tree being the transmission history but without the direction of transmission. The sampled genealogy (c) is derived from the complete genealogy by removing edges from unsampled individuals. The input to our analysis is the pathogen (virus) genealogy (denoted as ), i.e. the time-scaled phylogeny that can be inferred from the viral DNA sequences from sampled patients (i.e. the diagnosed and sequenced individuals) in an epidemic spanning from time 0 to t. The sampled virus genealogy is obtained by removing the unobserved (undiagnosed and diagnosed but not sampled) individuals and the corresponding coalescent events from (Fig 1). The parameters to be estimated in analysis are the mean μλ and standard deviation σλ of the transmissibility and the mean time to diagnosis/treatment γ−1, with the main focus on the degree of heterogeneity of transmissibility, either measured in absolute terms as σλ or in relative terms as the coefficient of variation CVλ = σλ/μλ. The CV is a dimensionless quantity where standard deviation, as a measure of dispersion, is normalized by the mean that facilitates comparisons across data sets. When either σλ or CVλ are large, then that implies that a few individuals will tend to infect at a much higher rate than others. We also infer the basic reproduction number R0, which is the ratio of the mean transmissibility rate and the diagnosis rate, i.e. R0 = μλ/γ, being the average number of infections caused by an infected before diagnosis and treatment. Our transmission model is an extension of the homogeneous birth-death model in [22], which does not include transmission heterogeneity. Moreover, Leventhal et al [23] explicitly take the finite population into account by considering depletion of susceptible. Here, we neglect this depletion but investigate this simplification in the sensitivity analysis. Recently, Li et al [19] studied a related model where heterogeneity in the number of infections was described in a generation sense; while they assumed that all infections happen at the end of the infectious period, in our model infectious individuals infect new individuals at rate λ during the infectious period. On the other hand, their model also allow for the possibility to have a highly concentrated distribution of number of infections, for example caused by a social network with most individuals having very similar number of contacts.

Inferring transmission heterogeneity from sampled viral genealogy

There are two difficulties of using the sampled genealogy to infer transmission heterogeneity: one is the lack of transmission direction (who-infected-whom) in , and the other that has incomplete transmission chains [19]. We deal with these two difficulties by using a coordinate-ascent algorithm to estimate both the parameters and the direction of transmission in a genealogy, and, additionally, a data imputation correction to handle the incomplete transmission history. We will first introduce the coordinate-ascent algorithm and consequently give a description about the analysis of the sampled genealogy .

Inference with coordinate-ascent algorithm

The coordinate-ascent (CA) algorithm works with a complete genealogy as input, that is, the algorithm assumes there is no within-host diversity and that we observe and sequence all infected individuals up to a certain time t. Later, these assumptions have been relaxed to analyze the sampled genealogy. The CA algorithm has been developed on the basis of the likelihood function for a transmission history . Under the assumptions of our HBD model, not only the transmission events caused by different individuals occur independently, but also the multiple transmission events caused by one individual are independent. So the likelihood for the whole transmission history can be evaluated in terms of individuals. Let x and d (i = 1, ⋯, n) denote the number of infections and the duration of the infectious period for individual i, information which is contained in the transmission history . The likelihood function is then given by where Γ(⋅) is the gamma function, and the indicator in (1) indicates if the infectious period d has ended before the end of the observation period. The parameters of (k, θ) are parameterizations of Gamma distribution under which the mean and standard deviation can be represented as μλ = kθ and respectively. Consequently, the maximum likelihood estimates of the model parameters are obtained by equating the gradient of the log-likelihood function to zero. Particularly, the estimates of and are obtained by solving the following equations: Moreover, the estimate of is which is simply the observed number of recoveries divided by the total duration of the infectious periods in the transmission history . When we observe the virus genealogy rather than the transmission history , we can no longer estimate the parameters as described above since does not contain information of transmission direction, i.e. who infected whom. In other words, we can then not determine , so the estimates in (2) are not available. However, the estimator based on (3) is still available without the information of transmission directions: even if each infectious period cannot be determined, the sum of infectious periods ∑ d is independent of transmission directions and equals the sum of branch lengths in (denoted as ). The estimator can hence be calculated before the CA algorithm is introduced below, and it is given by: where is the number of diagnosis events in . In the following, we focus on the estimation of parameter k and θ. We treat the transmission directions of the branching events in as latent variables, and propose the CA algorithm which alternates between estimating the model parameters (Θ = {k, θ}) and reconstructing the transmission history . Here the reconstruction is implemented in a form of labeling the branches of (as illustrated in Fig 2). The branching events following the labeled branches are the transmission events caused by the individuals with the corresponding labels. The CA algorithm traverse the genealogy slice-by-slice from the leaves up to the root (Fig 2). Each slice (illustrated in Fig 2(a)) consists of branches with the same height, defined as the number of edges on the longest downward path from this branch to a tip (i.e. similar to the definition of the height of a node). For example, the external branches have the height of 0, so they consist of a slice. The branches within a particular slice come from different infected individuals, and the transmission event following these branches are hence independent. Therefore, labeling these branches can be performed in parallel.

Fig 2

Schematic representation of the coordinate-ascent algorithm.

Schematic representation of the coordinate-ascent algorithm.

(a) The branches in the genealogy have been divided into 4 slices (labeled as ⓪ to ③). (b) The algorithm performs estimation of model parameters (denoted as Θ) and reconstructing the transmission history (or labeling the branches (with different letters) of a genealogy) simultaneously under the likelihood framework. The algorithm starts from the external branches (labeled as ⓪) which come from different individuals, and proceeds backwards slice by slice. The internal branches with the same label represent the infectious period of a particular individual, who is the infector of the transmission events during that period. (c) Given the present estimate of model parameters and transmission history (,), the transmission histories of individual a and b are (d = l1, e = 0) and (d = l2 + l3, e = 1) respectively. For the unlabeled branch l, its labeling coefficient is evaluated as . The inference process begins with the external branches which are assigned to different individuals, yielding the initial estimate of the transmission history . Because of the independence assumption in the HBD model, the likelihood function in Eq (1) is also valid for the partially reconstructed transmission history , and the initial estimate of is hence obtained via (2). Further, for all unlabeled branches in a slice, the CA algorithm alternates between updating transmission history (i.e. labeling the branches based on the present estimate of model parameters) and updating the estimation of model parameter based on the updated transmission history, until convergence. More precisely, we evaluate the labeling coefficient for an unlabeled branch as the ratio of the probabilities of labeling the branch to its left and right descendant branches. Suppose that the unlabeled branch has a length of l, and that its left and right descendant branches were already labeled as a and b respectively, that is, with reconstructed transmission history of (d, e) and (d, e), respectively. Labeling the new branch as a or b stands for updating the corresponding transmission history by prolonging for another l time unit and adding one more transmission event. Here the labeling coefficient β is calculated as (illustrated in Fig 2(c)): where the probabilities such as is evaluated via (1) by plugging in these present estimates of , , and . Consequently, if β > 1, then the new branch is labeled as its left descendant branch (a in this example); otherwise, as its right descendant branch (b in this example).

Analysis of sampled virus genealogy

In this part, we describe the analysis of the sampled genealogy based on the CA algorithm. Since suffers from partial sampling, the CA algorithm cannot be applied directly. For a given tuning parameter p (0 < p < 1, often slightly smaller than 1), we focus on the part of up to time t − L (denoted as ), where L is the p × 100 quantile of the lengths of the external branches of . We set p to a high value (p = 0.9, 0.85, and 0.8) to get a local genealogy which contains a relative large part of information about the epidemic dynamics up to time t − L. Specifically, our analysis consists of three steps: firstly, we estimate the sampling ratio (i.e. the number of the diagnosed and sequenced divided by all the infected) of the local genealogy ; secondly, we apply the CA algorithm to the local genealogy by assuming that all individuals infected within [0, t − L] are sampled; thirdly, we correct these parameter estimates by taking advantage of the estimated sampling ratio in the first step. Also, the results under different values of p are averaged out to generate the final estimation. Firstly, the sampling ratio of the local genealogy , or in other words, the number of the unsampled individuals up to time t − L, has been estimated. As defined above L is the p × 100 percentile of the lengths of the external branches of , so L provides a conservative estimate of the p × 100 percentile for the distribution of infectious periods. Hence, we assume that a fraction p of the individuals infected within [0, t − L] will be diagnosed before time t. In addition, a fraction ρ of the diagnosed individuals was selected to do sequencing (i.e. the sequencing ratio is ρ), so the sampling ratio of is ρ * p. Denoting as the number of individuals collected in the local genealogy , so there are individuals which were infected before time t − L but were not sampled. Next, we apply the CA algorithm to the local genealogy and obtain the raw estimates of the mean (denoted as ) and the standard deviation (denoted as ) of the transmissibility rate. Furthermore, the raw estimate of the transmission heterogeneity is . Last, we study the correction of the raw estimates by considering the unsampled individuals. This correction is performed on the basis of data imputation. For the sampled individuals, their estimated transmissibility rates shall be upscaled based on the sampling ratio by allowing for partial sampling. For the unsampled individuals, we assume that they have the same infectious period and the same transmissibility rate because of no further information about their difference. The corrected values for the sampled individuals together with the substituted values for the unsampled individuals are combined to yield the corrected estimation of model parameters as follows (Please see S1 Text for details about the derivation of (6)-(8)): The corrected estimation of the diagnosis rate is where is the mean length of the external branches in , and is used as the infectious period of the unsampled individuals. In addition, the corrected estimates of the mean and standard deviation of transmissibility rate are and where λ is the substituted transmissibility rate for the unsampled individuals. Furthermore, the corrected estimation of transmission heterogeneity is . The value of λ shall vary with the level of heterogeneity. Under the homogeneous situation (i.e. CVλ = 0), all the infected individuals have the same transmissibility rates, so we set the transmissibility rates of the unsampled individuals to be the same as the average rate of the sampled individuals, that is, . On the other hand, under the situation with extremely high level of heterogeneity, a small proportion of individuals have very high level of transmissibility rates, while a large proportion of individuals have small level of transmissibility rates. The former will cause most of the infections, while the latter will cause few or no infections. Note that the transmission events caused by an individual could be reconstructed by the information from its descendants. The sampled individuals in the genealogy are more likely to be the descendants of the highly infectious individuals. Hence the coalescent events in the genealogy are more likely corresponding to the transmission events caused by these highly infectious individuals, even if some of them are not sampled while their descendants are sampled instead. Hence the transmissibility rates estimated from the sampled genealogy reflect the rates with which highly infectious individuals spread the disease. Based on this consideration, under the situation with high level of heterogeneity, we assume that unsampled individuals have low transmissibility rates and that they cause few or no infections before being diagnosed. In other words, we set the constant transmissibility rate λ for the unsampled individuals to , where is the estimated recovery rate. For the general case we compromise between the two extreme situations on the basis of the level of heterogeneity as follows: where is a raw estimation about the level of heterogeneity as defined above.

Targeted prevention based on phylogenetic information

An important application for heterogeneity detection is to optimize the epidemic control policy when transmission heterogeneity exists. As pointed out in [1], focusing on the highly infectious individuals will greatly increase the efficiency of disease control policy. To facilitate such an application, we propose a phylogenetic index—the Number of Coalescent Events (NCE) which helps to identify potentially highly infectious individuals. We investigate the effectiveness of using this index with the intention of controlling the spread of disease in our heterogeneous birth-death model. Specifically, for each tip in the sampled genealogy , we calculate the NCE within a fixed time-period t prior to the time of sampling of the individual. The taxa (diagnosed individuals) with higher NCE-values are more likely to be phylogenetically linked with unsampled highly infectious individuals. Thus, performing contact tracing on the individuals with high NCE-values should identify undiagnosed and highly infectious individuals and/or their undiagnosed descendants, and therefore efficiently reduce further disease spread. We refer to contact tracing based on NCE-value as phylogeny-guided contact tracing. We evaluate different threshold values m, i.e. where individuals who have NCE ⩾ m within t prior to being sampled are selected for contact tracing. We also evaluate both single-step and iterative contact tracing schemes [24] to show the full beneficial effects of phylogeny-guided contact tracing policies.

Overview of simulation study

To validate our general transmission heterogeneity detection method, we tested the performance of the inference algorithm on simulated data. As reported the average time between infection and diagnosis/treatment for HIV-1 in Sweden is around 2.5 years [11, 25, 26], so we set the rate of diagnosis at γ = 1/2.5 with year being our unit of time. We set R0 = 2.5 (implying that μλ = 1) which is in agreement with the previous study on the transmission of HIV in Switzerland [22]. We varied the level of transmission heterogeneity (i.e. CV = σλ/μλ) from 0 (no heterogeneity) to 5 (extremely high heterogeneity). Unless otherwise stated, each simulation began with one infection and was stopped when there were 100 diagnosed individuals (the average number of infected but not yet diagnosed were 130). Among these 100 diagnosed individuals, 90 diagnosed individuals were sampled for sequencing to reconstruct the viral genealogy, that is, the sequencing ratio was set to ρ = 0.9. In all, the average sampling ratio in our simulations was 90/230 ≈ 0.39. We assume that there was no delay between the time of diagnosis and sampling. Please see S1 Text for details of simulation. All the codes for simulation and inference are available at github.com/yunPKU/HeteroInfer.

Results

Estimation of heterogeneity and other model parameters on simulated datasets

The epidemic parameters describing the recovery rate (γ), average infectivity rate (μλ), and the transmission heterogeneity (σλ and CVλ = σλ/μλ) were estimated from our heterogeneous birth-death model and the described inference methodology (Fig 3A, 3B, 3D and 3E). While there was a slight bias trend over CVλ = σλ/μλ, the bias was well within the corresponding 95% confidence interval in each case. Hence, under a wide range of epidemic situations with different levels of heterogeneity (CVλ) the means of the epidemic parameter estimates were close to the corresponding true values. The inference of the basic reproduction number (R0), showed slight upward biases under low level of heterogeneity and slight downward biases under high level of heterogeneity (Fig 3B). However, these biases were also covered by the 95% confidence intervals.

Fig 3

Estimates of the heterogeneity and other epidemiological parameters from simulated virus genealogy.

Estimates of the heterogeneity and other epidemiological parameters from simulated virus genealogy.

The simulations were performed under various true levels of heterogeneity (). In each panel, the black line denotes the true value that was used to generate the simulated data, and the colored curve and the shaded area denote the mean and 95% confidence interval from 200 simulations, respectively. A estimate of recovery rate (γ), B estimate of average infectivity rate (μλ), C estimate of the basic reproduction number (R0), D estimate of the standard deviation of infectivity rate (σλ), and E estimate of the coefficient of variation (CVλ). The estimated CVλ showed non-negligible downward bias. Since the coefficient of variation CVλ is the inverse square-root of the shape parameter in the Gamma distribution, this observation agrees with earlier findings showing that the maximum likelihood estimate of the shape parameter tend to upward biased [27], leading to CVλ being underestimated. The upward bias trend of over heterogeneity level was potentially due to our simulation process conditioning on reaching 100 tips. Given this condition, the average branch length of the simulated tree decreased with the heterogeneity level CVλ (as shown in S1 Fig), indicating the simulation datasets were biased in favor of ‘shorter’ trees under higher level of heterogeneity. Therefore, the estimate of was upward biased under high CVλ. On the other hand, the estimate of remained relatively stable. This is due to the fact that we used λ as the substituted transmissibility rate for the unsampled individuals. We explained in section of Analysis of sampled virus genealogy that, under the high level of heterogeneity, λ was close to the estimated recovery rate which is considerably smaller than the true value of μλ. So this correction method reduced the upward bias of under high heterogeneity situation, and hence resulted in a stable estimate. In all, the upward bias trend of together with the stable estimate of μλ over CVλ resulted in the downward bias trend of the estimated over heterogeneity levels. In the simulation study we also vary time scales by fixing the average infectivity rate μλ = 1 and gradually increasing the average length of infectious period γ−1 from 1.5 to 3, corresponding to the basic reproduction number R0 increased from 1.5 to 3. As before the simulation began with one infection and was stopped when there were 100 diagnosed individuals. The results are summarized in S2 and S3 Figs. The estimates of transmission heterogeneity (σλ and CVλ) were robust to differences in the average length of infectious period γ−1 (S2 Fig). The effect on the estimated CVλ was small unless the true heterogeneity level was higher than 3 (S2A Fig). We also found that the estimated remained upward biased under the scenarios with shorter infectious period (γ−1 = 1.5 and 2). As our simulation process being conditioning on reaching 100 tips, the simulated datasets were likely to be biased in favor of “faster” trees which resulted in the overestimated R0 [28].

Inference of heterogeneity when transmission rates are autocorrelated

Our new method was developed under the assumption that each infected individual has an independent transmission rate. In reality, however, risk factors of transmissibility such as behavior often exhibit a high degree of homophily, such that the transmission rate of a newly infected individual may be similar to that of the donor. We therefore evaluated our method when transmission rates are autocorrelated and made a comparison with the previously developed Markov-Modulated Poisson Process (MMPP) method [9] which was developed by exploiting the autocorrelation of transmission rates to genetic clustering as well as to estimate different transmission rates of different subpopulations. We adopted the simulation scenario in [9] to divide the whole susceptible population into two subpopulations with different transmission rates (λ1 and λ2 respectively) and simulated the outbreak in two ways, that is, with and without autocorrelation of transmission rates. In the former case, each newly infected individual has the same transmission rate as its infectee with the probability 1 − π and switch to another transmission rate with the probability π. Here we set the transmission rates as λ1 = 0.9 and λ2 = 8.1 (9-fold faster than the other as in [9]). The switching probability is π = 0.2 or π = 0.8, corresponding to high level of autocorrelation with low heterogeneity (CVλ = 0.46) and low level of autocorrelation with high heterogeneity (CVλ = 1.05) respectively. The evaluation under each condition is performed based on 100 simulated replicates. In the case without autocorrelation, each newly infected individual draws its transmission rate independently from a binary probabilistic distribution, that is, choosing λ1 with probability 1 − π, and choosing a larger transmission rate λ2 with probability π. We set π = 0.1, corresponding to the proportion of risky subpopulation in [9]. We made the comparison for two settings, one with lower level of heterogeneity: λ1 = 2 and λ2 = 6 (corresponding to the mean transmission rate μλ = 2.4 and the heterogeneity CVλ = 0.5); and the other with higher level of heterogeneity: λ1 = 1 and λ2 = 15, corresponding to the same μλ = 2.4 but now CVλ = 1.75. Under the situation of no autocorrelation of transmission rates, it is clear that the new method generates more accurate estimates of the mean (μλ) and the heterogeneity (CVλ) of two transmission rates (S9A and S9B Fig). For the low heterogeneity condition with true values of μλ = 2.4 and CVλ = 0.5 (S9A Fig), the medians of the estimates for μλ and CVλ are 2.32 and 0.7 from our new method and are 4.65 and 0.74 from the MMPP respectively. For the high heterogeneity condition with true values of μλ = 2.4 and CVλ = 1.75 (S9B Fig), the medians are 2.19 and 1.56 from our new method and are 6.35 and 1.04 from the MMPP respectively. Under the situation with autocorrelation of transmission rates, the new method is comparable with the MMPP in terms of estimation accuracy (S9C and S9D Fig). For the low switching probability (π = 0.2) condition with true values of μλ = 6.49 and CVλ = 0.46 (S9C Fig), the medians of the estimates for μλ and CVλ are 5.22 and 0.79 from the new method and are 8.73 and 0.99 from the MMPP respectively. For the high switching probability π = 0.8 condition with true values of μλ = 3.18 and CVλ = 1.05 (S9D Fig), the medians of the estimates are 2.64 and 1.27 from our new method and are 5.75 and 1.00 from the MMPP respectively.

Parameter estimates remain accurate under finite population sizes and biased sequencing ratio ρ

The above results show good performance of the proposed inference method under the heterogeneous birth-death model assuming that there is no depletion of susceptibles (corresponding to an infinite population) and that the sequencing ratio ρ was known (but the sampling ratio is unknown). In real applications, however, these two assumptions do not hold. While the size of the susceptible population at risk is very difficult to know in real epidemics, it is never infinite. The sequencing ratio ρ is often known to some extent, but there can be uncertainty in the number of diagnosed but not sequenced, leading to uncertainty in ρ. To investigate if the susceptible population size and biased ρ affect the estimation of the model parameters, we tested the inference method under a relative realistic scenario, i.e. setting the population sizes N = 1000 and introducing 10 percent bias in ρ both upwards and downwards. As before we began with one infection and stopped when there were 100 diagnosed individuals. When the simulations were stopped, the average number of infected but not yet diagnosed was 120, that is, the average prevalence was 220/1000 = 22% which is a relatively higher level in reality [29]. The estimation of transmission heterogeneity was robust to finite populations at risk (Fig 4) and bias in ρ, with only small effects unless the true level of heterogeneity was very high (CVλ > 3). The effects were not significant as they was covered by the variability of the estimate under infinite population size. The same conclusion applied also to estimation of the basic reproduction number R0. This is encouraging, as the susceptible population size is typically unknown, and may vary over time in real epidemics.

Fig 4

Performance of parameter estimation under finite population size and biased ρ.

The estimates equipped with “hat” (e.g.,) are based on the infinite population size and the exact value of ρ. The estimates equipped with “tilde” (e.g., and ) are based on finite population size and 10% bias in ρ ( is on upwards bias and is with downwards bias in ρ). The colored curves in (A) and (B) are the mean of 200 simulations, while the shaded areas denote the 95% confidence intervals based on these simulation replicates.

Performance of parameter estimation under finite population size and biased ρ.

Inference of heterogeneity is robust to transmission heterogeneity through time

In the above simulations, a constant transmissibility rate for each infected individual was assumed. However, it is well known that for HIV, infected individuals initially have a short high-transmission-risk phase (the acute infection phase when viral load is very high), followed by a much longer low-transmission-risk phase (the chronic phase during which viral load is relatively low), which, in the absence of treatment, is followed by another high-transmission-risk phase (progression to AIDS) [30-32] To investigate if this transmission-risk heterogeneity through time affect the inference of heterogeneity among individuals and other model parameters, we tested the proposed inference methodology under a more realistic simulation scenario by allowing for a time-varying transmissibility rate for each infected individual. As before, each individual i drew a random transmission rate from the Gamma distribution with mean μλ and standard deviation σλ, and a duration of the infectious period being Exp(γ) distributed. The time-varying transmission rate was modeled by starting with transmission rate during acute phase and then dropping and rising again if not yet diagnosed. More specifically, we followed the four-phase model in [31] to model the change of transmission rate over time since infection that is, where t was the time since infection for the i-th individual. As before each simulation was started with one infected individual and stopped when there were 100 diagnosed individuals. The results are summarized in S4 Fig. Relevant to the estimate of CVλ (i.e. heterogeneity of individual-level transmissibility) which was shown in S4B Fig, we found that adding transmission-risk heterogeneity due to disease progression leads to a small overestimation when the true level of CVλ was low (CVλ < 1.5) and a slight underestimation when CVλ > 1.5. These effects, however, were not significant as they were covered by the 95% confidence interval. In addition, as shown in S4A Fig, adding heterogeneity through time tended to underestimate R0, and the bias became significant under lower level of heterogeneity (CVλ < 1.0). This makes sense because adding heterogeneity through time reduced the average transmission rate over the whole infectious period for each individual, and hence resulted in a drop of the estimate of basic reproduction number.

Within-host diversity causes underestimation of transmission heterogeneity

Because HIV, and many other rapidly evolving pathogens, accumulate significant levels of within-host diversity, pathogen phylogenies may differ from the transmission history [21]. Using a coalescent-based framework that allows for within-host pathogen diversification over time [11, 21, 33], we investigated the effect of different diversification rates on the ability to detect transmission heterogeneity (please see S1 Text for details of simulation). We found that within-host diversity leads to underestimation of the level of transmission heterogeneity (Fig 5A), which becomes significant at very high heterogeneity levels (CVλ > 3). Within-host diversity adds an additional source of phylogenetic variation, and this variation partially overshadows the heterogeneity resulting from differences in transmission contacts. Nevertheless, even with within-host diversity, the trend of the inferred heterogeneity is monotonically increasing as the true level of heterogeneity increases, leading to a positive response of the inferred level of heterogeneity. Hence, when a pathogen generates significant levels of within-host diversity, we still detect increased levels of transmission heterogeneity if it occurs, but the true level of transmission heterogeneity may become underestimated.

Fig 5

Performance of parameter estimation in the presence of within-host diversity.

Performance of parameter estimation in the presence of within-host diversity.

Here r denotes the rate of within-host population size increases (per day). Four levels of within-host diversity have been calculated: r = 1(blue), 0.5(orange), 0.3(green), 0.15(red), and r = 0 (purple) corresponding to no heterogeneity. Results are the mean of 100 simulations. Also the estimates (purple) and the corresponding 95% confidence intervals (shaded area) under the condition without within-host diversity are also calculate for comparison. Increasing levels of within-host diversity also had an effect on the estimated R0 (Fig 5B). While increasing levels of within-host diversification tended to increase the estimated R0, only at the highest level of within-host diversity (r = 1) did the overestimation become significantly higher than under no diversification. As within-host diversity pushes the phylogenetic node heights backwards in time, known as the pre-transmission interval [34], the mean infectious period (γ−1) appears prolonged. However, the average time interval between neighboring transmission events, which corresponds to the mean transmissibility rate μλ, is less affected by the within-host diversity. Therefore, the estimated R0 = μλ/γ in the presence of within-host diversity will increase with the level of diversification.

The power of detecting transmission heterogeneity increases with heterogeneity level and sample size

For μλ = 1 and γ = 1/2.5, we estimated the expected power to detect heterogeneity for various sample sizes and heterogeneity levels. We reject homogeneity if the estimated heterogeneity exceeds the upper-bound of the 95% confidence interval which was estimated under the homogenous situation by setting CVλ = 0. This upper-bound was calculated from 500 simulations. The power to detect transmission heterogeneity depends on the sample size and true level of transmission heterogeneity (Fig 6). We investigated two different sampling size effects; 1) the sequencing ratio at the time of stopping simulation (when 100 persons have been diagnosed), and 2) with a fixed sequencing ratio ρ at a time when a certain number of persons have been diagnosed (from 20 to 100 persons). The effects on detection power of these two sampling scenarios were similar. For a given sample size (i.e. a particular colored line in Fig 6), the power naturally increased with the level of heterogeneity. From a public health action point of view, this is again encouraging as higher heterogeneity levels suggest that intervention may be more beneficial, which we address further below. Naturally, the power also increased with larger sample size (#(D) in panel A, and ρ in panel B).

Fig 6

Comparison of power of detecting heterogeneity under different sample sizes.

Here #(D) denotes the number of diagnosed individuals. A. The sequencing ratio is fixed as 1, while the simulation stopped when there were different number of diagnosed individuals. B. The simulation was stopped when there were 100 diagnosed individuals (#(D) = 100), while the sequencing ratio increases from ρ = 0.2 to 1, corresponding to a sample size going from 20 to 100 (colored curves). The power is estimated based on 200 simulations.

Comparison of power of detecting heterogeneity under different sample sizes.

Computing time

There is a growing demand of analyzing large sequence databases where transmission heterogeneity may exists and be desirable to detect [35]. Hence, we evaluated the computing time required to process trees with the number of tips (individuals) varied from 100 to 4,000. These results are summarized in supplementary S8 Fig. Our result indicated that the average computing time (over heterogeneity level CVλ varying from 1 to 5) scales linearly with the size of the tree. For instance, it required about 54.8 minutes (averaging over heterogeneity levels from 1 to 5) to process a tree with 4,000 tips. We also observed that the computing time for a tree with low level of heterogeneity (i.e. CVλ = 1) is much longer than that of analyzing a tree with high level of heterogeneity (i.e. CVλ ≥ 3). And the difference between these two computing times increases with the size of the tree. This is due to the fact that the proposed algorithm needs to reconstruct the transmission history while estimating the model parameters. This reconstruction is easier to proceed under the situation with high level of heterogeneity where few super-spreaders contribute to most of the transmission events. Hence, we expect that more computing time is needed to process a tree with heterogeneity smaller than 1 (CVλ < 1).

Analysis of real HIV outbreak data

We investigate HIV-1 DNA sequences from three local epidemics in Sweden ranging from fast explosive outbreak to more slow spread over longer time periods (Fig 7). The three corresponding genealogies were selected from a recent publication on HIV-1 spread in Sweden and neighboring Scandinavian countries [36], but the datasets used here contain only Swedish sequences. The first genealogy, denoted IDU_AE, represents a rapid outbreak of CRF01_AE (circulating recombinant form number 1 of subtype A and E viruses) infections among intravenous drug users (IDUs) in Stockholm in 2006 and 2007 [37]. The genealogy contains 83 taxa which were sampled from 2003 to 2010. The second genealogy, denoted IDU_B, represents slower local spread of subtype B infections among IDUs in Stockholm with sampling dates ranging from 2003 to 2010 [38]. This genealogy contains 51 taxa which were sampled from 2002 to 2009. The last genealogy, denoted MSM_B, represents longstanding local spread of subtype B virus among men who have sex with men (MSM) with sampling dates ranging from 1994 to 2010. This genealogy contains 18 taxa which were sampled from 1994 to 2010. All genealogies were monophyletic relative to other sequences in the Swedish-Scandinavian study and BLAST’ed international sequences [36].

Fig 7

Estimation of heterogeneity and other epidemiological parameters for three HIV outbreaks in Sweden: IDU_AE (red), IDU_B(blue) and MSM_B (grey).

Estimation of heterogeneity and other epidemiological parameters for three HIV outbreaks in Sweden: IDU_AE (red), IDU_B(blue) and MSM_B (grey).

On the top four panels, the distributions of the estimated parameters from a sample of 10000 posterior genealogies of each dataset are presented: the rate of diagnosis (A), the basic reproduction number (B), the Coefficient of Variation (C) and the Sackin Index normalized with the ‘PDA’ method (D). On the bottom, the time-scaled genealogies with the highest posterior probability are presented. For each data set, we analyzed a random sample of 10000 genealogies from the posterior distribution (generated using BEAST [39]). Fig 7 summarizes the results of the parameter estimates. Overall, the IDU_B data showed the lowest diagnosis rate (), meaning the individuals infected in this outbreak had a longer transmission period before being diagnosed (and prevented to spread further). Interestingly, the IDU_B outbreak also showed the highest R0, suggesting that the low value of may have driven up R0 = μλ/γ. R0 did not seem directly linked to transmission heterogeneity, however. Similarly, Sackin’s index, which measures tree balance (high values indicate unbalanced trees where one side of a split carries more taxa than the other side), did not find any significant differences between the data that contained an outbreak (IDU_AE) and the data from more even spread (IDU_B). Instead, as one might have expected, correlates with the speed of the different outbreaks, i.e. the highest heterogeneity was inferred in the data that contained the fast outbreak, which occurred in the IDU_AE epidemic, and the lowest heterogeneity was inferred in the slowest spreading epidemic (MSM_B). These observations agree with previous analyses of these local epidemics, where it was reported that there was a rapid outbreak, during a limited time period, among IDU infected with CRF01_AE in Sweden following a similar outbreak in Finland [38]. This suggests that it took some time for HIV-1 CRF01_AE to reach one or several individuals with high contact numbers after it had entered the Swedish IDU community. Once the individuals with high contact rates were reached, a rapid outbreak followed, infecting many susceptibles. Finally, as shown in [38], the outbreak subsequently slowed down to pre-outbreak levels, presumably because the recipients of the high-contact persons had been exhausted. Lower levels of transmission heterogeneity was observed in the slower spread that took place in the IDU_B community [37], suggesting that the contact patterns in this community were more even among members.

Simulation of targeted contact tracing: Continuous monitoring and cross-sectional sampling

Next, we evaluated phylogeny-guided contact tracing policies based on simulation. We again set the mean transmissibility rate at μλ = 1 and the rate of diagnosis at γ = 1/2.5, which resulted in the basic reproduction number as R0 = μλ/γ = 2.5. Here, one unit of time corresponded to one year as before, and we set the time period t in which we calculate the NCE-value to t = 1.25 (corresponding to half of the average infectious period). In addition, the heterogeneity varied from CVλ = 0 to 5. We define ‘continuous monitoring’ as the typical public health situation where samples (sequences of diagnosed cases) are continuously collected throughout an epidemic and where for each collected sample it is decided to perform (additional) contact tracing or not. The second situation is denoted ‘cross-sectional sampling’ which is the situation where all samples are collected (or re-analyzed) at some common time t, for instance when the epidemiological status is investigated in an otherwise unmonitored population or when additional preventive measures are decided upon or in retrospective contact tracing.

Using a pathogen phylogeny improves disease prevention under continuous monitoring

When no transmission heterogeneity is present (CVλ = 0), targeting individuals with higher than average NCE for contact tracing has no effect on the resulting epidemic size (Fig 8A), i.e. randomly selecting the same number of persons for contact tracing has the same effect as a phylogenetically informed strategy (the dashed and solid lines are on top of each other in this case). The reason why preventing (i.e. contact tracing) persons with NCE ⩾ 1 has higher effect than NCE ⩾ 2 is simply because NCE ⩾ 1 involves a larger number of persons that are contact traced, shown in the sidebar ‘Fraction of contact traced’. The ‘no prevention’ line shows the epidemic growth when no prevention (i.e. no additional contact tracing) was administered and is thus the maximum size of the simulated epidemic at any time point. Reduction from this level indicates the prevention effect.

Fig 8

Comparison of NCE-based contact tracing and random contact tracing under the situation of continuous monitoring.

Comparison of NCE-based contact tracing and random contact tracing under the situation of continuous monitoring.

The relative infection sizes (defined in S1 Text) of the NCE ⩾ m contact tracing (solid lines) and the corresponding random contact tracing (dashed lines) are compared for four threshold values: m = 1 (lines with ●), m = 2 (lines with triangledown), m = 3 (lines with +), and m = 4 (lines with *). The subplots on the right side of all panels show the fraction of contact traced for all these prevention policies. Four panels show the comparison under four levels of heterogeneity respectively, i.e. CVλ = 0 (A), CVλ = 1 (B), CVλ = 3 (C), and CVλ = 5 (D). These results are the mean of 300 simulations. When transmission heterogeneity was present (i.e. CVλ > 0), the relative epidemic size under a phylogeny-guided prevention (based on NCE) was always smaller than that of random prevention at the same proportion of persons (Fig 8B–8D). At very high level of heterogeneity CVλ = 3, phylogeny-guided prevention typically reduced the epidemic by an additional 10% over random prevention at 2.5 years out, and approximately an additional 10% for each NCE step. Moreover, the advantage of a NCE ⩾ m strategy over the corresponding random prevention strategy increased with the level of heterogeneity. For instance, if extremely high heterogeneity occurred (CVλ = 5), by preventing spread at NCE ⩾ 4 one would only need to contact trace 20% of the population to get about 75% reduction in total number of infections 2.5 years later. This means that a phylogenetically informed prevention strategy (using the NCE concept) becomes more valuable the more heterogeneous transmission rates are. From a public health point of view, where resources always are limited, using the pathogen phylogeny to allocate resources becomes more efficient the more heterogeneity that exists in the epidemic, and the NCE guided prevention can pinpoint where contact tracing and prevention would make the largest impact in reducing future infections. This becomes clear when comparing the fraction of contact traced and the relative effect of NCE-guided prevention in the different CVλ and NCE cases, i.e. the fraction of contact traced in the NCE ⩾ m strategies decreased as the threshold value m increased while the relative effect over random prevention in the NCE ⩾ m strategies increased as the threshold value m increased (see the supplementary S5 Fig). Therefore, under low levels of heterogeneity (estimated using our general heterogeneity detection method), one may choose a small threshold m to gain more effect on reducing the epidemic size, while under high levels of heterogeneity, one may instead choose a high threshold m, involving less resources, to invoke efficient control. For pathogens that accumulate within-host diversity, such as HIV-1, we also evaluated the phylogeny-guided prevention strategy when significant within-host diversity accumulates. The effect of a NCE-strategy was, in fact, higher in the presence of within-host diversity (r = 1) than when no within-host diversity accumulated (r = 0) (Fig 9). This happens because large NCE values are associated with short branches, and the effect of within host diversity is to make short branches only slightly longer whereas longer branches are increased more due to within host diversity. As a consequence, within-host diversity has the effect of accentuating the difference between short and long branches thus leading to a better distinction between subtrees with a relatively large number of coalescent events (large NCE) compared to subtrees with lower number of coalescent events.

Fig 9

Comparison of the performance of NCE-based policies under the situations of with/without within-host evolutionary dynamics.

Comparison of the performance of NCE-based policies under the situations of with/without within-host evolutionary dynamics.

Here r is the diversification rate of within-host virus population. The solid lines (r = 0 lineage/day) and the dashed lines (r = 1 lineage/day) are the results with/without within-host diversity respectively. The performance of the NCE ⩾ m policy is measured by the relative effect over random prevention (please see text for definition). Three levels of threshold m are calculated: m = 1 (blue), m = 2 (orange), and m = 2 (green). Within-host diversity also has the effect of reducing the fraction of contact traced for given threshold m in NCE (S6 Fig). Because within-host diversity pushes the phylogenetic node heights backwards in time, known as the pre-transmission interval [34], there will be fewer nodes at any given time-distance into the tree from the tips. Within-host diversity therefore reduces the NCE-count on all lineages, leading to a reduction in the fraction of contact traced. Hence, one may adjust the depth t at which NCE is calculated depending on how much diversity the pathogen under study typically accumulates. Finally, we evaluated the phylogeny-guided prevention under two different models of contact tracing: single-step tracing (where only direct recent contacts of the diagnosed are tested and prevented from further spread) or iterative tracing (where new cases from the single-step tracing are also contact traced, and so on) [24, 40]. The relative effect of phylogeny-guided iterative contact tracing was more effective than single-step contact tracing at higher levels of the NCE-threshold (m = 2 and 3) in Fig 10. Contrary, at NCE ⩾ 1, single-step contact tracing did better at lower levels of transmission heterogeneity. Hence, iterative contact tracing is more advantageous when medium to high levels of transmission heterogeneity exist, where higher NCE-thresholds can pick up transmission chains where a super-spreader exists. At very high heterogeneity levels (CVλ ⩾ 3), the iterative contact tracing advantage gradually diminishes because the probability to find the super-spreaders increases after just a single-step contact tracing.

Fig 10

Comparison of the performance of NCE-based policies under different modes of prevention.

Comparison of the performance of NCE-based policies under different modes of prevention.

The solid lines/dashed lines are the results under the mode of single-step contact tracing/iterative contact tracing respectively. The performance of the NCE ⩾ m policy is measured by the relative effect over random prevention (please see text for definition). Three levels of threshold m are calculated: m = 1 (blue), m = 2 (orange), and m = 2 (green). As somewhat surprizing observation was that the fraction of contact traced under iterative contact tracing was actually found to besmaller than under the single-step contact tracing (S7 Fig). For each selected individual to contact trace there will of course be more (or at least not fewer) to contact trace under iterative contact tracing. However, iterative contact tracing has an even bigger negative effect on the spreading, such that the overall fraction (or number) being contact traced in the iterative setting is smaller than when applying single-step contact tracing.

Time-to-diagnosis is informative for contact tracing under cross-sectional sampling

When cross-sectional samples have been collected, phylogeny-guided prevention has no advantage over a random prevention strategy (Fig 11). Thus, to take advantage of the epidemic information captured by the pathogen phylogeny, continuous monitoring provides better public health prevention capacity. If only a cross-sectional sample is available, however, we find that targeting the most recently diagnosed individuals (MRD) for additional contact tracing performs better in preventing future spread than a random selection of the same number of individuals (Fig 11) or by selecting individuals to contact trace based on the NCE-value. At prevention fractions p = 0.2 and p = 0.5 (20% and 50% being contact traced, respectively), the relative effect of the MRD-strategy was constantly larger than random selection (of which the relative effect was set to 1) as well as that of the NCE-based strategy.

Fig 11

Relative effect of NCE-based strategy (blue lines) and MRD strategy (red lines) over random prevention under the situation of Cross-sectional prevention.

The fraction of contact traced is fixed as p = 0.2 (lines with ▽) and p = 0.5 (lines with ⋆).

Relative effect of NCE-based strategy (blue lines) and MRD strategy (red lines) over random prevention under the situation of Cross-sectional prevention.

The fraction of contact traced is fixed as p = 0.2 (lines with ▽) and p = 0.5 (lines with ⋆).

Discussion

In this study we have shown that when transmission heterogeneity exists, a phylogeny-guided prevention strategy is more efficient than randomly selecting cases for contact tracing in order to reduce the number of future infections. We first developed a general method to identify the level of transmission heterogeneity in a population, i.e. how much variation in the rate of transmission contacts that exist in a human population where a pathogen spreads. We then developed a phylogenetic measure, the NCE—the Number of Coalescent Events over a fixed time-interval ending at time of diagnosis—to identify which infected individuals to target for (additional) contact tracing. It was shown that, under the typical situation where pathogen samples are continuously collected (continuous monitoring), a phylogeny-guided prevention approach was clearly superior in effectively allocating public health resources to reduce future disease spread. We show that our heterogeneity detection method and phylogeny-guided approach work under a variety of realistic conditions. The estimates of heterogeneity and basic reproduction number are robust to assumptions about the susceptible population size and exact sampling fraction (the fraction of diagnosed individuals that are sequenced), both of which may be difficult to know exactly in a real epidemic. Also when changing the recovery rate of the infected individual, allowing autocorrelation in transmission rates, or adding heterogeneity through time to individual-level transmissibility, the estimate of heterogeneity is still less affected. Of particular importance is that our method also works when a pathogen accumulates genetic variation within each host. Such within-host diversity is significant in e.g. HIV infections. If not taken into account, tree measurements may significantly mislead epidemiological estimates by implicitly assuming identity between transmission history and pathogen phylogeny. It is well established that the between-host pathogen phylogeny will display nodes corresponding to the transmitted lineages biased backwards in time, as described by the pre-transmission interval [34], as well as potential lineage disordering relative to the transmission history [21]. While previously often ignored, the within-host diversity has recently been shown to have significant impact on inferences about pathogen epidemics, sometimes rendering results completely as compared to when within-host diversity was ignored [11, 21, 41, 42]. Here, we show that taking within-host diversity into account may, in fact, improve the prevention effect (Fig 9). When applying our method for detecting possible spread heterogeneity to three separate sub-epidemics from the larger Swedish HIV-1 epidemic it was found that the highest heterogeneity was inferred from the data that included the episodic IDU_AE outbreak. Interestingly, this outbreak did not show significant differences from the other sub-epidemics when analyzing Sackin’s index or R0 (which is also affected by the estimated diagnosis rate ). This suggests that our methodology is useful for identifying heterogeneous spread when typical tree balance or well-known epidemiological parameters fail. In some countries and settings where HIV transmission is criminalized there may be ethical limitations to the use of phylogenetics and other ways to identify clusters with active HIV transmission. We acknowledge this complication, but show that our approach could limit HIV transmissions in settings where such limitations do not exist. Our model treats heterogeneity as a single feature, while in real life there are several factors that may contribute to heterogeneous spread rates. It is possible that modeling one or several of these factors separately may be valuable. One such factor relates to social network structure, modeling heterogeneities in terms of contacts. Secondly, the current model assumes that the transmissibility of different individuals are independent. A (more complicated) alternative could be to assume that transmissibility of infectors and their cases are positively correlated, which would imply assortativity in sexual activity between sex-partners. Finally, individuals may also differ in terms of testing (i.e. diagnosis) rates; in our model all individuals were assumed to have equal rates of testing. For example, among MSM, individuals with risky behavior (high contact rates) may also test themselves more often. This simplifying assumption may lead to a systematic loss of sensitivity as sampling may not be complete and be biased towards some group of infected persons. Again, the heterogeneity in our model may be interpreted as a combination of all these and other aspects. In conclusion, with the increase of pathogen genetic data in public health databases, phylogeny-guided prevention will benefit public health efforts. As relevant epidemic information about differences in number of contacts is effectively “recorded” in a phylogeny by the evolutionary process of the pathogen, it only makes sense to use this information to the public health benefit. Hence, phylogeny-guided prevention results in more efficient epidemic control, which should reduce disease burden and costs to society.

Average branch length of the simulated tree under different conditions.

Average branch length of the simulated tree under varied basic reproduction number (x-axis) and varied levels of heterogeneity (y-axis). Under each condition, the simulation started from one infection and stopped when there were 100 recovered individuals. The average branch length was calculated as the total branch length of the reconstructed tree over the number of tips. The results were the average over 1000 simulations. (PDF) Click here for additional data file.

Inference of heterogeneity under various lengths of mean infectious period γ−1.

In each panel, the black line denotes the true value that was used to generate the simulated data. The colored curves are the means of the estimates under different levels of γ−1. The shaded area denotes 95% confidence interval estimated when γ−1 = 2.5. These results are obtained from 100 simulation replicates where the average transmissibility rate μλ was fixed as 1, the sequencing ratio ρ = 0.9, and the simulation stopped when there were 100 diagnosed individuals. The panel A is coefficient of variation (CVλ) and the panel B is standard deviation of infectivity rate (σλ). (PDF) Click here for additional data file.

Performance of parameter estimation under various lengths of mean infectious period γ−1.

In each panel, the colored curves are the means of relative biases (the bias over the true value) under different levels of γ−1. The shaded area denotes the relative bias of the 95% confidence interval estimated when γ−1 = 2.5. These results are obtained from 100 simulation replicates where the average transmissibility rate μλ was fixed as 1, the sequencing ratio ρ = 0.9, and the simulation stopped when there were 100 diagnosed individuals. A. basic reproduction number (R0), B. average infectivity rate (μλ), and C. recovery rate (γ). (PDF) Click here for additional data file.

Comparison of parameter estimation under constant/time-varying transmissibility.

The lines with circles “○” are the estimate under the situation of constant transmissibility, and the shaded area denote the 95% confidence interval from 100 simulations. The lines with triangles “▽” are the estimates under the situation with time-varying transmissibility (TVT). The comparison is performed under the situation of R0 = 2.5 and the heterogeneity (CVλ) varies from 0 to 5. Sample size n = 100 and simulation runs = 100. (PDF) Click here for additional data file.

Relative effect over random prevention of NCE strategy under the continuous monitoring scenario.

Four levels of threshold c have been calculated: m = 1 (blue), m = 2 (orange), m = 3 (green), and m = 4 (red). Results in (a) and (b) are the mean of 300 simulations. (PDF) Click here for additional data file.

Comparison of fraction of contact traced of NCE strategy under the situation of with/without within-host diversity (solid/dash lines respectively).

Three levels of threshold m have been calculated: m = 1 (blue), m = 2 (orange), and m = 3 (green). Results are the mean of 300 simulations. (PDF) Click here for additional data file.

Comparison of fraction of contact traced of NCE strategy under the situation of single-step/iterative contact tracing (solid/dash lines respectively).

Three levels of threshold m have been calculated: m = 1 (blue), m = 2 (orange), and m = 3 (green). Results are the mean of 300 simulations. (PDF) Click here for additional data file.

Computing times required for inferring heterogeneity from trees with varying numbers of tips.

Computing times (y-axis) required for processing trees with varying numbers of tips (each tip represents a diagnosed individual). For each number of tips (x-axis), we simulated trees with different levels of heterogeneity (i.e., CVλ = 1, 2, ⋯, 5). Each point represents the average computing time based on 10 simulations with given heterogeneity level and the size of tree. The number of tips was varied along the following sequence: 100, 200, 400, 1000, 2000, 4000. The dashed line represents the average computing time over all levels of heterogeneity. All runs were executed on an Intel Core i7 processor (Mac mini 2018). (PDF) Click here for additional data file.

Performance of new method and the Markov-modulated Poisson process (MMPP) based genetic clustering method on simulated data.

Virus genealogies were simulated under two scenarios: without autocorrelation in transmission rates (A and B) and with autocorrelation (C and D). In the former cases, each infected individual independently draws its transmission rate from a binomial distribution, i.e., choosing the slow rate λ1 with probability 1 − π = 0.9 and choosing the fast rate λ2 with probability π = 0.1. In (A), λ1 = 2 and λ2 = 6, corresponding to mean rate of μλ = 2.4 and a low level of heterogeneity (CVλ = 0.5). In (B)λ1 = 1 and λ2 = 15, corresponding to mean rate of μλ = 2.4 and a high level of heterogeneity (CVλ = 1.75). In the cases with autocorrelation in transmission rates, each newly infected individual switch its transmission rate with probability π and remains its infectee’s rate with probability 1 − π. In both (C) and (D), λ1 = 0.9 and λ2 = 8.1 with different switching probability, i.e., π = 0.2 in (C), corresponding to μλ = 5.22 and low level heterogeneity CVλ = 0.79, and π = 0.8 in (D) corresponding to μλ = 2.64 and low level heterogeneity CVλ = 1.27. And the diagnosis rate is fixed as γ = 1. The x- and y-axes correspond to the estimated μλ and the estimated CVλ respectively. Each point represents the outcome when analyzing one of 100 replicates. (PDF) Click here for additional data file.

Supporting information on models, simulation methods and the phylogenetic analysis of HIV data in Sweden.

Includes details on simulating the virus genealogy, how the correction for incomplete transmission chain were derived, the preprocess of the real epidemiological data from Sweden, how the evaluation of control measures was performed based on simulation, and the pseudo-code of the proposed algorithm. (PDF) Click here for additional data file. 6 Nov 2019 Dear Dr zhang, Thank you very much for submitting your manuscript 'Inferring transmission heterogeneity using virus genealogies: estimation and targeted prevention' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Roger Dimitri Kouyos Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This manuscript describes a new method to detect variation in transmission rates from the shape of the phylogenetic tree reconstructed from pathogen sequences. This quantity has public health relevance because it may enable one to optimize the allocation of prevention resources by focusing on those individuals with the highest probability of onward transmission, based on past dynamics. More specifically, the "heterogeneous birth-death" model presented by the authors is an extension of the standard birth-death model to accommodate variation in transmission rates among individuals through a random-effects approach, where these rates are drawn from a gamma distribution to be parameterized from the data. This random-effects approach is distinct from that of otherwise similar models, such as Kuhnert’s two-type birth-death model (PLOS Pathog 14(2): e1006895), Barido-Sottani’s multi-state birth-death model (J R Soc Interface 15(146)) or McCloskey’s pure-birth model (PLOS Comput Biol 13(11): e1005868), where lineages “evolve” between latent discrete branching rate states. Notably, none of these publications are cited by the authors. (See also a preprint from Volz et al.; bioRxiv, https://doi.org/10.1101/704528). Since the authors are presenting a new method, they should also evaluate their method alongside at least one of these other approaches that are generally designed to accomplish the same objective. Furthermore, the rates in the heterogeneous model are distributed among individuals in an uncorrelated fashion with respect to the transmission history, whereas the rates are autocorrelated in the two-type and multi-state models. This difference highlights the focus of this manuscript on the individual host, rather than groups in a risk-structured population. The assumption of uncorrelated transmission rates (absence of homophily by risk) needs to be justified (e.g., is it appropriate for HIV-1), as it is a distinctive feature of this model. The authors should also evaluate the impact of variation in diagnosis rates among individuals (rate parameter $\\gamma$) on their power to detect heterogeneity in transmission rates (more below). An important starting assumption of the model is that transmission unfolds in an infinite population of susceptible hosts, which supports the approximation that “transmission events occur independently” (line 164), for instance. The authors evaluate their model when this assumption is relaxed - however, this was carried out in the initial exponential growth phase of the epidemic (starting from a single infected individual to an average of 120 infections in a total population of 1000 hosts). This minimizes the effect of limited susceptibles, i.e., the “collision” of infected lineages, and it would be helpful to see the robustness of the model when the epidemic is at a more intermediate stage. Another, more problematic assumption of this model is that the expected time to diagnosis is the same across all individuals. Incomplete and inevitably biased sampling (diagnosis) is widely recognized as a severe confounding problem in measuring variation in transmission rates. For example, if individuals with higher transmission rates are also more likely to be lost to follow-up for sampling (sequencing after diagnosis), then the model (and most other models as well) will suffer from a systematic loss of sensitivity. This limitation should be addressed in the Discussion section. While the authors evaluated the assumption of incomplete sampling by allowing for transmission to unsampled individuals, they also assume that there is no onward transmission from unsampled to sampled individuals - viz., from the Supplementary Text S1: “Furthermore, we assume that the unsampled individuals have small level of transmissibility rates and that they caused no infection before being diagnosed.” This is an unrealistic assumption and especially problematic in the case of targeting specific individuals for public health intervention – to say nothing of the potential exposure of individuals to stigma and criminalization – and it should be described more transparently in the manuscript (for instance, around lines 221-223) and rationale given. The GitHub repository should incorporate the documentation in the Word document (which is a binary file) into the README in the conventional Markdown format. In addition, I found that the Python scripts do not adhere to PEP8 (e.g., lack of whitespace separating arguments) and there is a general lack of inline or block comments that would make it easier to review the code. Variables such as “mu_RecEst” need to be defined within the source code. There are multiple uses of undocumented magic numbers (e.g., 9999) where it is more conventional to use the Python Nonetype or math.inf. The module documentation needs to provide installation instructions, including module dependencies such as matplotlib and scipy. The tutorial is incomplete and should provide instructions for loading the module prior to calling functions such as “singleSimu”. In my small experiments with these scripts, I found that mean $\\lambda$ was substantially underestimated when setting $\\mu$ to 3.0 - the average estimate of 5 trials was 1.57 ($\\gamma$ set to 0.3). This may be masked by reporting accuracy for CV and not the mean, particularly when the variation parameter is underestimated the same way. Please provide results for these parameters individually. There also seems to be some interaction with $\\gamma$; if I set this parameter to 0.7 then the mean estimate of $\\mu$ increased to 1.99. This effect on parameter identifiability should be explored. I could not find details on run times of the analysis or estimates of time complexity. It took roughly 10 seconds to simulate and analyze a single tree with sample size 100 (default setting) on a rather outdated iMac (late 2013), and about 1 minute for a sample of 200, so I suspect that the complexity will be quadratic or more with sample size. An assessment of computing time should be incorporated into the revision. As developers and users of phylogenetic methods in the context of HIV-1, we are obligated to discuss the ethical implications of work that can be potentially used for source attribution (i.e., “who infected whom”). Specifically, this study proposes to identify individuals who are associated with higher numbers of coalescent events, which can be loosely interpreted as transmission events, as targets for public health intervention. In light of ongoing criminalization of HIV in many countries worldwide, and where punishments are often disproportionately severe, it is inevitable that methods such as the one presented by the authors here will also be used in the prosecution of individuals for transmission without disclosure. The authors should address this problem in the discussion section of their revised manuscript. Overall, I found this an interesting manuscript. It is written well and I found the model description mostly straight-forward. As discussed above, there are a number of potentially (or almost certainly) problematic model assumptions that need to be explored further, and the authors should address other work in this area and perform some comparison of methods. - line 64: Is there any empirical motivation for using the gamma distribution to model variation in transmission rates among individuals, other than this being a relatively flexible distribution for continuous non-negative values? - Minor point (line 94): I prefer to reserve the term “genealogy” for describing the ancestral relationships among individuals in a population, and “phylogeny” for the ancestral relationships among populations. Thus the variable G represents a phylogeny relating infections sampled from different hosts. However, I also recognize that others refer to this tree as a “genealogy” or use these terms interchangeably. - line 102: “… or in relative terms as the coefficient of variation CV_\\lambda = \\frac{\\sigma_\\lambda}{\\mu_\\lambda}.” For readers who are less familiar with the CV, it would be helpful to note that this is a dimensionless quantity where standard deviation, as a measure of dispersion, is normalized by the mean that facilitates comparisons across data sets. - lines 165-174: This section would greatly benefit from a diagram explaining the relationships among d_i, e_i, l_b, etc., as there is a substantial amount of notation being introduced here. - line 369: Please provide methodological details on the BEAST analysis, i.e., model assumptions, prior distributions, assessment of convergence. - Text S4: “During this simulation, when there is a newly diagnosed individual, the corresponding NCE value is calculated and the decision of contact tracing has been made based on the NCE value.” Is there a delay in this simulation for the time between diagnosis and sampling (sequencing), which would be required to compute the NCE value associated with that individual? - Figure 1: The notation in the figure does not correspond to the figure legend. For example, “the i-th individual was infected at t_i” – neither symbol appears in any of the figures. It seems that the authors switch between Roman numeral notation for branches and Arabic numerals for the time axis. - Figure 2. “The internal branches with the same label represent the infectious period of a particular individual.” But what if the start of the infectious period does not coincide with the branching point in the virus genealogy? This statement seems to assume that there is no within-host variation and that individuals are infectious immediately following transmission. - Figure 4. The data series for D=100 and \\rho_{SD} = 1.0 should be identical between plots A and B, but they do not seem to be. Reviewer #2: With this manuscript the authors propose a method to estimate the amount of transmission heterogeneity based on virus sequence data in a sampled epidemic and suggest useful applications for the method. While the method is very interesting and the implementation of heterogeneity is simple but novel, the range of simulated situations is rather limited, e.g. it was only tested for a single R0 value (2.5). Hence, I suggest to ensure the method is indeed applicable to a wide range of situations. Further comments and more detailed suggestions for simulation scenarios: ll 115-117: how does that compare to the model presented here? l 260: please specify how the R0=2.5 is derived from the estimates in [22] (which was 2.29 and 2.92, with and without rate heterogeneity among subepidemics resp). ll 282ff & Fig 3: the upward vs downward bias changes at CV~SQRT(2) for the 3 parameters R0, mu, gamma. and the same is seen in Fig 5. I doubt this is a coincidence, but that it has to do with the characteristics of the gamma function employed to model heterogeneity (if I understand correctly CV_lambda is the inverse squareroot of the shape parameter, is that right?). the authors should explore this and discuss how this may be used to assess in which direction they expect bias to occur. simulations: - what if sampling is delayed? - this would probably make the inference harder but would be more realistic. I suggest to add a simulation scenario to test that. real data analyses (page 10 / Fig 7): - please provide more information on the real datasets, at least the sampling period and number of samples - please perform additional simulations that test significance of the real dataset analyses, i.e. (i) a slow outbreak with high heterogeneity and (ii) a fast outbreak with low heterogeneity, and compare the results to each other and the real data analyses resp. - why was a coalescent skyline used in the BEAST analyses? the tree prior may impact the tree characteristics esp. in the MSM_B dataset, which appears quite small, such that the coalescent skyline may be "overfitted" - I would guess that the low gamma estimate in IDU_B is an artefact of the root of the tree ranging much further back than the oldest sample. this translates to a very long infectious period for the first cases, but is more likely a bias in the sampling process. can the model account for that by letting the sampling process start later? phylogeny guided contact tracing: - how robust is the NCE to biased sampling processes? parts of the tree that are sampled more heavily will have higher NCEs. - this needs to be tested using simulations - applying heterogenous sampling schemes - with large enough datasets a more complex birth-death model may be used to infer such differences, which could then be used to adjust the NCE accordingly. ll 452-453 "the effect of within host diversity is to make short branches only slightly longer whereas longer branches are increased more due to within host diversity" - why is that? Fig 8 & ll 491ff are 20% and 50% realistic proportions for contact tracing? please provide some references for that - Fig 8 would be easier to read if the authors used a colour gradient for the fraction contact traced ll 509-510: this has only been tested for very specific cases. please phrase this specifically referring to what you've tested, it's an unjustified generalisation Discussion: there may also be heterogeneity through time, not only among individuals, particularly regarding the time to diagnosis Parts of the supplement are confusing, e.g. the last sentence on page 3, most of Section 4.2 etc. Reviewer #3: This manuscript introduces a new method to estimate transmission heterogeneity from viral sequence data. It is an important question and a successful such method is appropriate for publication in PLOS CB. I have some concerns about the method and its assumptions. Line 72 I don’t understand why you can say that the ratio of sequenced diagnosed over all infected (c /(a+b+c)) is the sampling ratio (c/(b+c)) unless a=0. Please rephrase, as this obviously isn’t what you mean. Line 86 as you later clarify, you still can’t say G=T even when there is no in-host diversity, because T needs information about which host internal nodes are in (ie direction of transmission). A birth-death model with constant rates (even rates that vary between individuals) is a very poor match for the within-host natural history of HIV infection, which has an early high-transmission phase, followed by a long chronic infection phase, and then followed by progression to AIDS (in the absence of treatment). With treatment, the assumption that transmission ceases is good in some settings (with great treatment adherence and little resistance, for example), and poorer in others. But in pretty much all settings, assuming a constant infectivity for 2.5 years is not a great model; this needs to be discussed. The simulation study could explore model mis-specification in this direction. Line 140 surely the likelihood is proportional to, not equal to, as k and theta have moved across the conditioning. Does * mean multiplication in Eq (1), not convolution? What is gamma^1_DG? - Ah, in line 143 you state that the indicator is for whether the infectious period ended before the observation period; surely the probability of this depends on when the infectious period started. Pairwise transmission models have a long history and are well-liked in the community, but I have always wondered what the independence assumption means. Clearly if some individual A is in the dataset, then someone infected A (or A is the index case). If J infected A, then B did not infect A, nor C, D, …, because A can only have one infector. When we assume independence of all the transmission events, we cannot take these constraints into account. Line 153 i don’t know what $(d_i, x_i)_{i=1}^n G$ is. Line 163 - why would eq (1) be valid for a partially reconstructed transmission history? Or, what does that validity mean, given that the partially sampled history does not contain the information required to use (1) as it does not contain d_i and x_i (since it’s partially sampled)? Line 183 - what does “each slice consists of branches having the same maximum number of branches to the tips” mean? I like the observation in line 186 that the sum of the infectious periods has to add up to the total branch length; this applies to other models relating transmission to inference (phybreak, beastlier etc also) as far as I can tell. Please state exactly what the CA algorithm does - alternating between relabelling and re-estimating sounds good but please give more details; pseudo-code. It’s the section on analysis of the sampled genealogy where I really don’t follow and begin to doubt the modeling. First - is $t$ counted from t=0 at the root of the tree, or back into the past from the tips (and if so, which one)? With no in-host diversity, are you assuming that the external branches represent time from infection to sampling? Why use L_p in the way that you do? Or do you mean the heights of the tips, in line 196? You split the tree into the part before time t-Lp. There you estimate the sampling ratio; then you apply the CA algorithm.; then you correct the parameter estimates for Lp, and then “the results under different values of p are averaged” -- again, please clarify; give pseudo-code if possible. I really don’t see why the sampling ratio of the G(s) (t-Lp) is rho_SD *p . You have *set* p, so p could be whatever you choose. How does Lp provide an estimation about the p percentile for the distribution of infectious periods, and how does this mean that you know the sampling ratio of G(s)(t-Lp)? Also, if the sequencing ratio (and sampling fraction) is fixed over time, then you know from the times of sequenced cases quite a lot about the system. Line 220 what is “upscaled” here? Line 244 why should taxa with higher NCE values be more likely to be phylogenetically linked to unsampled highly infectious individuals? Does this happen in your simulations? I might have thought that in densely-branching parts of the phylogeny, sampling is likely to be more dense, not less. R0=2.5 seems a high estimate for HIV in Switzerland; most estimates for developed countries without generalised epidemics are less than 1. How robust is the method to model mis-specification? As above I would be most concerned about the non-constant transmission rate. I think the main text should state more about the simulation method, as it is inconvenient to have to go and dig it out of a supplement; in particular, the fact that you simulate trees rather than sequences is quite important as it means that you cannot see how robust the method is to tree uncertainty; given that you required timed trees, there is also the potential for the nuances of timed tree reconstruction to contribute to bias or noise in the estimates. Similarly, the main text should have information about how the within-host diversity was simulated (perhaps with a figure of a tree from such a simulation). There are various plots with “coefficient of variation” as the title and throughout there are plot titles but no y axis labels, which left me confused. Fig 6 - CV_lambda is on the x axis; what is y. I can’t tell from the caption what the vertical axis is showing. In Fig 3, CV_\\lambda is defined as “levels of heterogeneity” - what is it, in terms of the parameters of the simulation model? Figs 9, 10 what is on the y axis? The title “comparison of relative effect” doesn’t tell us much Supplement S2 Why is the adjusted transmissibility just lambda_i/rho ? ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None Reviewer #2: No: The authors state that "The alignments used for this study are available from the authors upon request" - I believe the code, analysis and alignment files should be published with the study, or a justification otherwise is necessary. Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Art Poon Reviewer #2: Yes: Denise Kühnert Reviewer #3: No 23 Feb 2020 Submitted filename: Re to Reviewer0204.docx Click here for additional data file. 27 May 2020 Dear Dr. zhang, Thank you very much for submitting your manuscript "Inferring transmission heterogeneity using virus genealogies: estimation and targeted prevention" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Roger Dimitri Kouyos Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I thank the authors for their work in revising the manuscript and providing some documentation for their source code on the GitHub repository. There are a number of remaining issues that need to be corrected before this manuscript can be considered for publication. Introduction, lines 24-34: The authors’ characterization of the multi-state birth-death (MSBD) model is incorrect. The MSBD model does not require the positions and times of state changes in the phylogeny to be known in advance — instead, the optimization algorithm adds and removes state changes in the phylogeny. Results, lines 364-384: Your comparison of the uncorrelated model to the MMPP model is problematic because you have simulated transmission rates as being drawn individually (in an uncorrelated fashion) from a gamma distribution. This violates a central assumption of the MMPP, which is premised on the existence of correlated rates (which is how clusters are identified). It is therefore not at all surprising that your model performs well on data simulated under the same model, and MMPP performs poorly when its core assumption is broken. The honest approach would be to evaluate the methods under both uncorrelated and correlated scenarios, recognizing that MMPP should be unable to detect variation in transmission rates in the uncorrelated scenario. In addition, MMPP does not attempt to estimate R0 — it is misleading to present results (Figure S9) that indicate otherwise. Introduction, lines 45-52: The addition of this text helps clarify that the authors’ model assumes that variation in transmission rates are uncorrelated (independent draws from some parametric distribution). However, the rationale provided by the authors — “varying degrees of social activity” — implies auto-correlation in transmission rates, since risk behaviours exhibit a high degree of homophily. Varying viral loads is time-heterogeneous within individuals, whereas your model assumes rates are time-invariant at the level of individuals. Thus, the rationale for your model remains unclear. Methods, lines 261-280: The addition of this section helps clarify the core assumptions of the authors’ model. Another way to put this is that the authors assume that, in the setting of high variation in transmission rates, all “superspreader” individuals have been sampled and diagnosed. The authors need to reconcile this assumption with another central assumption of their model, that the sampling and diagnosis of cases from the epidemic is occurring at an early exponential phase of the epidemic. (This would involve a tremendous sampling effort because most individuals will be uninfected when sampled.) For equation (9), I can see that the authors intend to scale the constant transmission rate for unsampled individuals from \\mu_lambda^0 at CV=0 to \\gamma as CV\\leftarrow\\infty. I cannot reconcile this with the statement at the top of this paragraph that \\lambda _{us} = \\mu_lambda^0 / \\rho. It is also important to clarify for the reader that there is no variation in transmission rates among unsampled individuals under any level of CV in this model. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 8 Jun 2020 Submitted filename: re2reviewers.docx Click here for additional data file. 2 Jul 2020 Dear Dr. zhang, We are pleased to inform you that your manuscript 'Inferring transmission heterogeneity using virus genealogies: estimation and targeted prevention' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. In addition, the reviewer pointed out a few typographical errors that you may want to address. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Roger Dimitri Kouyos Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Thank you for addressing my comments with this revised manuscript, which is substantially improved. Regarding the authors' assertion that "When A was unsampled while B and C were sampled, the coalescent event of B and C in the sampled tree (with tips B and C) is coincide with that of A and B in the full tree (with tips A~C)." -- I think this requires a strong assumption about the coalescence of lineages within hosts. While I agree that lineages are more likely to be traced back to a "superspreader" host, there is no guarantee that the coalescent events involving different sampled hosts will coincide (especially if the transmission bottleneck is incomplete). This is a minor point, however. I only have some typographical or stylistic errors to report that I would leave to the authors' discretion to amend. line 396: "it is clearly that" should be "it is clear that" line 400: dropped backslash resulted in exposed "mu_\\lambda". lines 396-403: it would be helpful to note the number of replicate simulations evaluated here Figure S9: "Virus genealogies were simulated under two scenarios: with autocorrelation in transmission rates (A and B) and without autocorrelation (C and D)." -- according to the figure titles, this is backwards; A and B are *without* autocorrelation. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 26 Aug 2020 PCOMPBIOL-D-19-01410R2 Inferring transmission heterogeneity using virus genealogies: estimation and targeted prevention Dear Dr zhang, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

41 in total

1. Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis.

Authors: T Leitner; D Escanilla; C Franzén; M Uhlén; J Albert
Journal: Proc Natl Acad Sci U S A Date: 1996-10-01 Impact factor: 11.205

2. Rates of coalescence for common epidemiological models at equilibrium.

Authors: Katia Koelle; David A Rasmussen
Journal: J R Soc Interface Date: 2011-09-15 Impact factor: 4.118

3. Heterogeneities in the transmission of infectious agents: implications for the design of control programs.

Authors: M E Woolhouse; C Dye; J F Etard; T Smith; J D Charlwood; G P Garnett; P Hagan; J L Hii; P D Ndhlovu; R J Quinnell; C H Watts; S K Chandiwana; R M Anderson
Journal: Proc Natl Acad Sci U S A Date: 1997-01-07 Impact factor: 11.205

4. Rates of HIV-1 transmission per coital act, by stage of HIV-1 infection, in Rakai, Uganda.

Authors: Maria J Wawer; Ronald H Gray; Nelson K Sewankambo; David Serwadda; Xianbin Li; Oliver Laeyendecker; Noah Kiwanuka; Godfrey Kigozi; Mohammed Kiddugavu; Thomas Lutalo; Fred Nalugoda; Fred Wabwire-Mangen; Mary P Meehan; Thomas C Quinn
Journal: J Infect Dis Date: 2005-03-30 Impact factor: 5.226

Review 5. Acute HIV-1 Infection.

Authors: Myron S Cohen; George M Shaw; Andrew J McMichael; Barton F Haynes
Journal: N Engl J Med Date: 2011-05-19 Impact factor: 91.245

6. The effectiveness of contact tracing in emerging epidemics.

Authors: Don Klinkenberg; Christophe Fraser; Hans Heesterbeek
Journal: PLoS One Date: 2006-12-20 Impact factor: 3.240

7. HIV-1 transmission between MSM and heterosexuals, and increasing proportions of circulating recombinant forms in the Nordic Countries.

Authors: Joakim Esbjörnsson; Mattias Mild; Anne Audelin; Jannik Fonager; Helena Skar; Louise Bruun Jørgensen; Kirsi Liitsola; Per Björkman; Göran Bratt; Magnus Gisslén; Anders Sönnerborg; Claus Nielsen; Patrik Medstrand; Jan Albert
Journal: Virus Evol Date: 2016-04-27

8. Social and Genetic Networks of HIV-1 Transmission in New York City.

Authors: Joel O Wertheim; Sergei L Kosakovsky Pond; Lisa A Forgione; Sanjay R Mehta; Ben Murrell; Sharmila Shah; Davey M Smith; Konrad Scheffler; Lucia V Torian
Journal: PLoS Pathog Date: 2017-01-09 Impact factor: 6.823

9. Phylodynamics on local sexual contact networks.

Authors: David A Rasmussen; Roger Kouyos; Huldrych F Günthard; Tanja Stadler
Journal: PLoS Comput Biol Date: 2017-03-28 Impact factor: 4.475

10. Mapping HIV prevalence in sub-Saharan Africa between 2000 and 2017.

Authors: Laura Dwyer-Lindgren; Michael A Cork; Amber Sligar; Krista M Steuben; Kate F Wilson; Naomi R Provost; Benjamin K Mayala; John D VanderHeide; Michael L Collison; Jason B Hall; Molly H Biehl; Austin Carter; Tahvi Frank; Dirk Douwes-Schultz; Roy Burstein; Daniel C Casey; Aniruddha Deshpande; Lucas Earl; Charbel El Bcheraoui; Tamer H Farag; Nathaniel J Henry; Damaris Kinyoki; Laurie B Marczak; Molly R Nixon; Aaron Osgood-Zimmerman; David Pigott; Robert C Reiner; Jennifer M Ross; Lauren E Schaeffer; David L Smith; Nicole Davis Weaver; Kirsten E Wiens; Jeffrey W Eaton; Jessica E Justman; Alex Opio; Benn Sartorius; Frank Tanser; Njeri Wabiri; Peter Piot; Christopher J L Murray; Simon I Hay
Journal: Nature Date: 2019-05-15 Impact factor: 49.962

2 in total

1. Combining biomarker and virus phylogenetic models improves HIV-1 epidemiological source identification.

Authors: Erik Lundgren; Ethan Romero-Severson; Jan Albert; Thomas Leitner
Journal: PLoS Comput Biol Date: 2022-08-26 Impact factor: 4.779

2. What Should Health Departments Do with HIV Sequence Data?

Authors: Ethan Romero-Severson; Arshan Nasir; Thomas Leitner
Journal: Viruses Date: 2020-09-12 Impact factor: 5.048

2 in total