Literature DB >> 24496011

Exact vs. approximate computation: reconciling different estimates of Mycobacterium tuberculosis epidemiological parameters.

R Zachariah Aandahl1, Tanja Stadler, Scott A Sisson, Mark M Tanaka.   

Abstract

Exact computational methods for inference in population genetics are intuitively preferable to approximate analyses. We reconcile two starkly different estimates of the reproductive number of tuberculosis from previous studies that used the same genotyping data and underlying model. This demonstrates the value of approximate analyses in validating exact methods.

Entities:  

Keywords:  IS6110; Mycobacterium tuberculosis; approximate Bayesian computation; reproductive number; summary statistics

Mesh:

Year:  2014        PMID: 24496011      PMCID: PMC3982679          DOI: 10.1534/genetics.113.158808

Source DB:  PubMed          Journal:  Genetics        ISSN: 0016-6731            Impact factor:   4.562


TWO previous methods for analyzing Mycobacterium tuberculosis infection and evolution produced conflicting estimates of the effective reproductive number, R. Tanaka used approximate Bayesian computation (ABC) (Beaumont 2010; Csilléry ) with two summary statistics to estimate this parameter using data from San Francisco (Small ), yielding R = 3.4 (95% C.I. 1.4, 79.7). Stadler (2011) derived an exact likelihood to analyze the same data within a Bayesian framework, giving the estimate R = 1.02 (95% C.I. 1.01,1.04). If this discrepancy is due to the approximation in ABC methods, it would call into question the reliability of ABC in other studies using similar summary statistics and models. We therefore investigate and resolve this discrepancy here. In both methods, the underlying process is a continuous time birth–death process with mutations occurring (at rate θ per infection per year) according to the assumption of infinite alleles. A birth event corresponds to a transmission event (with rate λ per infection per year) of tuberculosis while a death event represents death or recovery (with rate μ per infection per year). Under the method of Tanaka (henceforth “ABC06 method”), inference is performed using ABC and implemented with Markov chain Monte Carlo (MCMC) (Marjoram ; Sisson and Fan 2011). The process is simulated from a single infectious individual until either extinction occurs or the infectious population reaches a size N, at which point a sample of size n is taken. Two summary statistics are computed: the number of distinct genotypes in the sample and the virtual heterozygosity or gene diversity. A distance between observed and simulated statistics is computed to assess whether a parameter set should be accepted, leading to an approximate posterior parameter distribution. The method of Stadler (2011) (henceforth “Tree11 method”) derives an expression for the likelihood of a transmission tree with associated mutations, giving rise to a sample of genotypes (Equation 3 in Stadler (2011)). It is assumed that the epidemic started at a random time t0 in the past and each presently infected individual is included into the genotype sample with probability ρ = n/N. MCMC is used to explore the space of parameters and obtain a Bayesian posterior parameter distribution. We highlight here that the ABC06 and Tree11 method rely on the same model, up to the length of the epidemic and the exact sampling procedure. The ABC06 method assumes the epidemic spreads until N individuals are infected and then n isolates are taken. The Tree11 method assumes that the epidemic starts at a random time in the past, and an isolate is sampled from an individual with probability ρ. Stadler (2011) proposed that the discrepancy between the methods was due to a loss of information from the data when using nonsufficient summary statistics in ABC06. Here, we assess the choice of summary statistics in Tanaka by comparing the ABC method against the exact likelihood of observing data sampled from a population using the same simulation process as the ABC method. We call this the “Exact method.” Following Stadler (2011) who showed that the mutation rate, θ, cannot be estimated from snapshot genotyping data, we fix θ = 0.198. We also found that ABC06 with uninformative priors for the correlated parameters λ and μ consistently leads to similar estimates of R, regardless of the parameters used to simulate the data. We were able to rectify this problem either by setting μ to a constant or by using an informed prior (we call this form the “ABC method”). Here, we fix μ = 0.52 as the sum of estimates of the rates of self cure, death from causes other than tuberculosis, and death from untreated tuberculosis (Cohen and Murray 2004; Luciani ). The Exact method is as follows. Define the observed data as a sample of isolates of size n, c as the number of distinct genotypes in , and n as the number of instances of genotype i in so that . Let s be the unobserved population of size N, G the number of distinct genotypes in s, and X the number of instances of genotype i in s so that . The posterior distribution of the effective reproductive number, R = λ/μ, given , is Conditional on G ≥ c, we define the set as all of the c sized subsets in {1, 2,…, G} and p(i) as the ith value of subset p in . The probability that 0 came from s isWe used Equation 1 to sample from π(R, s | ) and estimate π(R | ) for each of 100 simulated data sets generated from a known value of R and used standard MCMC methods. We compared the resulting posterior distributions to those obtained using the ABC and Tree11 methods via a two-sample Kolmogorov–Smirnov test, based on posterior samples of size 100. Box plots of the resulting P-values (Figure 1A) indicate that the posteriors from the ABC method are similar to those from the Exact approach, while the posteriors from the Tree11 method are clearly different in each case. More precisely, we found that posteriors estimated using the ABC method were centered on the true, known values of R, but those estimated using the Tree11 method were shifted to the left (e.g., Figure 1, B–E). We identified two problems that affect inference when using the model from Stadler (2011).
Figure 1

Estimation of the effective reproductive number R using the (corrected) ABC, Exact, Tree11, and Tree methods. In all analyses, θ = 0.198. (A) Boxplots of 100 replicates of P-values from two-sample Kolmogorov–Smirnov tests, comparing the posterior distribution of R under the ABC, Tree11, and Tree methods with the Exact method. Each replicate P-value was based on data generated with R = 4, μ = 0.52, and ρ = 0.1. (B) Estimates of the posterior distribution of R using the ABC, Tree11, and Tree methods, based on simulated data with R = 1.60 (indicated by the vertical dashed line), μ = 0.52, and ρ = 0.05. (C) As for B, but using R = 1.55, μ = 0.34. (D) As for B, but using R = 1.48, μ = 0.62. (E) As for B, but using R = 1.89, μ = 0.75. (F) As for B, but using data from Small , and with the prior μ ∼ N(0.52, 0.004167).

Estimation of the effective reproductive number R using the (corrected) ABC, Exact, Tree11, and Tree methods. In all analyses, θ = 0.198. (A) Boxplots of 100 replicates of P-values from two-sample Kolmogorov–Smirnov tests, comparing the posterior distribution of R under the ABC, Tree11, and Tree methods with the Exact method. Each replicate P-value was based on data generated with R = 4, μ = 0.52, and ρ = 0.1. (B) Estimates of the posterior distribution of R using the ABC, Tree11, and Tree methods, based on simulated data with R = 1.60 (indicated by the vertical dashed line), μ = 0.52, and ρ = 0.05. (C) As for B, but using R = 1.55, μ = 0.34. (D) As for B, but using R = 1.48, μ = 0.62. (E) As for B, but using R = 1.89, μ = 0.75. (F) As for B, but using data from Small , and with the prior μ ∼ N(0.52, 0.004167). First, f(|t0) (cf. Stadler (2011) p. 666) gives the probability of an oriented tree, while the sampler operates on vectors of branching times, v (one vector per genotype). To correct this we derived the distribution of the vectors v|t0. We calculated the probability of a labeled tree , summed over all within-genotype labeled trees , and summed over the number of ways (m) in which a genotype cluster (i) may connect to a tree to obtainSecond, we found that the state of the MCMC sampler would become trapped in local maxima due to an inefficient proposal distribution. To address this, we modified the proposal to uniformly resample the genotype cluster vectors of branching times at each stage of the algorithm. We refer to this adjusted form of the Tree11 approach as the Tree method. We tested the accuracy of the ABC, Tree11, and Tree methods by computing the posterior distribution for R based on data generated from TreeSim (Stadler 2010) with an infinite alleles model of mutation. We then calculated the mean squared error (MSE) of the resulting posteriors compared to the true value of R. Table 1 presents the mean MSE and standard errors for each method based on 10 replicate data sets. An example of the posterior distributions resulting from one of the replicated data sets is shown in Figure 1B. Additional posterior distributions using different parameter combinations are shown in Figure 1, C–E. Very clearly, the ABC and Tree methods perform similarly well, and both outperform the Tree11 method (see also Figure 1A).
Table 1

Average mean squared error (MSE) estimates of the posterior distribution of R, based on 10 replicate analyses, using data simulated with θ = 0.198, N = 5000

Mean MSESE of mean MSE
ABC14.6 × 10−33.0 × 10−3
Tree1186.9 × 10−339.3 × 10−3
Tree13.9 × 10−34.4 × 10−3

The parameter μ for each of the 10 tests was chosen uniformly between 0.3 and 8, and R was chosen uniformly between 1 and 2.

The parameter μ for each of the 10 tests was chosen uniformly between 0.3 and 8, and R was chosen uniformly between 1 and 2. Finally, we reanalyzed the observed data taken from the IS6110 isolates in San Francisco in Small , but by fixing the value of mutation rate θ = 0.198 and using the Gaussian prior μ ∼ N(0.52, σ2 = 0.0125/3) for the death/recovery rate. The prior standard deviation corresponds to the standard deviation of the triangular distribution used in Dye and Espinal (2001). Figure 1F shows the resulting posterior distributions of R using the ABC, Tree11, and Tree methods. The original Tanaka estimate using the unmodified ABC method, trying to estimate all parameters, is R = 3.4 (95% C.I. 1.4, 79.7). The estimate from the model from Stadler (2011) is R = 1.63 (95% C.I. 1.32, 1.94). However, using the corrected methods, the estimate using the ABC method is R = 2.10 (95% C.I. 1.54, 2.66), and the estimate using the Tree method is R = 2.05 (95% C.I. 1.55, 2.53). The point estimates and credible intervals from the posteriors from the ABC and the Tree method are in close agreement. We have shown that the ABC analysis of Tanaka based on the method of Marjoram is valid as long as an informative prior is used for two of the parameters (here, the mutation rate θ and the death and recovery rate μ). The modified priors eliminate potential problems in the ABC and Tree approaches due to the strong correlation between μ and λ. This correction addresses the concern raised by Stadler (2011); that is, there is no substantial loss of information through the choice of summary statistics in the ABC method. Finally, we have improved the method of Stadler (2011) by modifying the mechanism of proposing new trees within the MCMC sampler to prevent it from converging to local maxima. In combination, these adjustments have reconciled the discrepancies between Tanaka and Stadler (2011); the methods now perform equivalently. Exact likelihood methods such as that of Stadler (2011) are generally preferable to ABC, which is an approximate inferential procedure. Here, however, we have demonstrated the value of using approximate methods to validate exact computational methods based on models with high-dimensional latent variables. For this setting, the ABC method has similar accuracy to and better computational efficiency than the Tree method. A further advantage of the ABC method is that it can easily be extended to more complex models. Recent work generalizing the coalescent to incorporate SIR dynamics (Volz ; Rasmussen ) presents a promising alternative approach for estimating parameters from genetic data under more realistic epidemiological models. Comparison of the coalescent SIS approach to fully stochastic models has been addressed elsewhere (Leventhal ) and would be an important issue to explore further in the future.
  11 in total

1.  Markov chain Monte Carlo without likelihoods.

Authors:  Paul Marjoram; John Molitor; Vincent Plagnol; Simon Tavare
Journal:  Proc Natl Acad Sci U S A       Date:  2003-12-08       Impact factor: 11.205

2.  Will tuberculosis become resistant to all antibiotics?

Authors:  C Dye; M A Espinal
Journal:  Proc Biol Sci       Date:  2001-01-07       Impact factor: 5.349

Review 3.  Approximate Bayesian Computation (ABC) in practice.

Authors:  Katalin Csilléry; Michael G B Blum; Oscar E Gaggiotti; Olivier François
Journal:  Trends Ecol Evol       Date:  2010-05-18       Impact factor: 17.712

4.  Using approximate Bayesian computation to estimate tuberculosis transmission parameters from genotype data.

Authors:  Mark M Tanaka; Andrew R Francis; Fabio Luciani; S A Sisson
Journal:  Genetics       Date:  2006-04-19       Impact factor: 4.562

5.  Inferring epidemiological parameters on the basis of allele frequencies.

Authors:  Tanja Stadler
Journal:  Genetics       Date:  2011-05-05       Impact factor: 4.562

6.  The epidemiological fitness cost of drug resistance in Mycobacterium tuberculosis.

Authors:  Fabio Luciani; Scott A Sisson; Honglin Jiang; Andrew R Francis; Mark M Tanaka
Journal:  Proc Natl Acad Sci U S A       Date:  2009-08-13       Impact factor: 11.205

7.  Modeling epidemics of multidrug-resistant M. tuberculosis of heterogeneous fitness.

Authors:  Ted Cohen; Megan Murray
Journal:  Nat Med       Date:  2004-09-19       Impact factor: 53.440

8.  The epidemiology of tuberculosis in San Francisco. A population-based study using conventional and molecular methods.

Authors:  P M Small; P C Hopewell; S P Singh; A Paz; J Parsonnet; D C Ruston; G F Schecter; C L Daley; G K Schoolnik
Journal:  N Engl J Med       Date:  1994-06-16       Impact factor: 91.245

9.  Phylodynamics of infectious disease epidemics.

Authors:  Erik M Volz; Sergei L Kosakovsky Pond; Melissa J Ward; Andrew J Leigh Brown; Simon D W Frost
Journal:  Genetics       Date:  2009-09-21       Impact factor: 4.562

10.  Inference for nonlinear epidemiological models using genealogies and time series.

Authors:  David A Rasmussen; Oliver Ratmann; Katia Koelle
Journal:  PLoS Comput Biol       Date:  2011-08-25       Impact factor: 4.475

View more
  8 in total

1.  On the Identifiability of Transmission Dynamic Models for Infectious Diseases.

Authors:  Jarno Lintusaari; Michael U Gutmann; Samuel Kaski; Jukka Corander
Journal:  Genetics       Date:  2016-01-06       Impact factor: 4.562

2.  Storytelling and story testing in domestication.

Authors:  Pascale Gerbault; Robin G Allaby; Nicole Boivin; Anna Rudzinski; Ilaria M Grimaldi; J Chris Pires; Cynthia Climer Vigueira; Keith Dobney; Kristen J Gremillion; Loukas Barton; Manuel Arroyo-Kalin; Michael D Purugganan; Rafael Rubio de Casas; Ruth Bollongino; Joachim Burger; Dorian Q Fuller; Daniel G Bradley; David J Balding; Peter J Richerson; M Thomas P Gilbert; Greger Larson; Mark G Thomas
Journal:  Proc Natl Acad Sci U S A       Date:  2014-04-21       Impact factor: 11.205

3.  Quantifying TB transmission: a systematic review of reproduction number and serial interval estimates for tuberculosis.

Authors:  Y Ma; C R Horsburgh; L F White; H E Jenkins
Journal:  Epidemiol Infect       Date:  2018-07-04       Impact factor: 4.434

4.  Two-phase importance sampling for inference about transmission trees.

Authors:  Elina Numminen; Claire Chewapreecha; Jukka Sirén; Claudia Turner; Paul Turner; Stephen D Bentley; Jukka Corander
Journal:  Proc Biol Sci       Date:  2014-11-07       Impact factor: 5.349

5.  Bayesian inference of physiologically meaningful parameters from body sway measurements.

Authors:  A Tietäväinen; M U Gutmann; E Keski-Vakkuri; J Corander; E Hæggström
Journal:  Sci Rep       Date:  2017-06-19       Impact factor: 4.379

6.  Inferring epidemiological parameters from phylogenies using regression-ABC: A comparative study.

Authors:  Emma Saulnier; Olivier Gascuel; Samuel Alizon
Journal:  PLoS Comput Biol       Date:  2017-03-06       Impact factor: 4.475

7.  The prevention and control of tuberculosis: an analysis based on a tuberculosis dynamic model derived from the cases of Americans.

Authors:  Yan Wu; Meng Huang; Ximei Wang; Yong Li; Lei Jiang; Yuan Yuan
Journal:  BMC Public Health       Date:  2020-07-28       Impact factor: 3.295

8.  Direct estimation of the parameters of a delayed, intermittent activation feedback model of postural sway during quiet standing.

Authors:  Kevin L McKee; Michael C Neale
Journal:  PLoS One       Date:  2019-09-17       Impact factor: 3.240

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.