| Literature DB >> 33948608 |
Vladimir Shchur1, Vadim Spirin1, Dmitry Sirotkin1, Evgeni Burovski1, Nicola De Maio2, Russell Corbett-Detig1,3.
Abstract
Accurate simulation of complex biological processes is an essential component of developing and validating new technologies and inference approaches. As an effort to help contain the COVID-19 pandemic, large numbers of SARS-CoV-2 genomes have been sequenced from most regions in the world. More than 5.5 million viral sequences are publicly available as of November 2021. Many studies estimate viral genealogies from these sequences, as these can provide valuable information about the spread of the pandemic across time and space. Additionally such data are a rich source of information about molecular evolutionary processes including natural selection, for example allowing the identification of new variants with transmissibility and immunity evasion advantages. To our knowledge, there is no framework that is both efficient and flexible enough to simulate the pandemic to approximate world-scale scenarios and generate viral genealogies of millions of samples. Here, we introduce a new fast simulator VGsim which addresses the problem of simulation genealogies under epidemiological models. The simulation process is split into two phases. During the forward run the algorithm generates a chain of population-level events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm. During the backward run a coalescent-like approach generates a tree genealogy of samples conditioning on the population-level events chain generated during the forward run. Our software can model complex population structure, epistasis and immunity escape. The code is freely available at https://github.com/Genomics-HSE/VGsim.Entities:
Year: 2021 PMID: 33948608 PMCID: PMC8095227 DOI: 10.1101/2021.04.21.21255891
Source DB: PubMed Journal: medRxiv
Figure 1:The scheme of the nested family Gillespie algorithm used to generate an event in the forward run. The corresponding reactions are listed in Table 1. Black squares correspond to the consecutive steps, where the subfamilies are chosen with the weights given by their propensities. The propensities for each step are cached and updated only if they change due to an event. For migration propensities, the rejection approach is used instead (SI 3).
The list of reactions and corresponding epidemiological events simulated by the Gillespie algorithm in our model, and the number of reactions in each category in function of the number of mutable sites U, number of susceptible individuals T, and the number of populations K.
| Reaction | Description | Number of reactions |
|---|---|---|
|
| 4 | |
|
| 2 · 4 | |
| 4 | ||
| 4 |
List of features which determine the simulation scenario. All the rates are normalized by the number of individuals in a particular group (i.e. the number of individuals infected with a particular haplotype or individuals of a certain susceptibility type). The rates are measured in terms of events per time unit.
| Model | Feature | Description | Value |
|---|---|---|---|
| Epidemiological model: every parameter can be set individually for each haplotype. | Transmission rate | The expected number of new infections per time unit caused by an individual infected with haplotype | |
| Recovery rate | Rate at which an infectious individual becomes recovered after being infected with haplotype | ||
| Sampling rate | Rate at which an infectious individual is sampled after being infected with haplotype | ||
| Mutation rate | Rate at which a genetic site | ||
| Substitution probabilities | The probabilities of particular nucleotide substitution at haplotype | ||
| Susceptibility | The multiplier which allows to change the relative susceptibility to haplotype | ||
| Susceptibility transition rate | The rate at which susceptible individuals move from one susceptibility type to another without being infected. This allows to model the loss of immunity with time or vaccination efforts. | [0; ∞] | |
| Population model | Population size | Total number of individuals in population | ℕ |
| Contact density | The multiplicative modifier of transmission rate corresponding to the relative number of contacts in population | ||
| Lockdown | Conditions to impose and lift lockdown in population | ||
| Sampling effort | This modifier increases or decreases the sampling rate in population | ||
| Migration probability | The probability that an individual from population |
Run time to generate 10 million events. The second number is the percentage of discarded events (due to migration acceptance/rejection). There are 16 = 42 haplotypes and 3 susceptible compartments. The sampling rate is set to ζ = 0.1, recovery rate is ρ = 0.9, transmission rate is λ = 2.5. The tests were run on a server node with Intel Xeon Gold 6152 2.1–3.7 GHz processor and 1536GB of memory.
| Cumulative migration probability | Number of demes | |||||
|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | 100 | |
| 0.001 | 28.7s | 30.0s | 31.9s | 35.1s | 47.2s | 69.3s |
| 0.002 | 29.2s | 30.4s | 32.3s | 35.3s | 47.0s | 70.1s |
| 0.005 | 29.4s | 30.7s | 32.5s | 35.6s | 48.1s | 70.4s |
| 0.01 | 29.4s | 30.6s | 32.9s | 35.5s | 48.0s | 70.9s |
| 0.1 | 30.3s | 31.9s | 33.8s | 37.0s | 50.3s | 73.0s |
Run time in seconds to generate a random genealogy for a sample of a certain size for different sampling rates. The execution time is shown split into the time demand for the forward run and the one for the backward run only. We simulated 16 = 42 haplotypes and no host immunity. The recovery rate is ρ = 1.0 − ζ, with ζ the sampling rate, while the transmission rate is λ = 2.5 for all 16 haplotypes. The tests were run on a server node with an Intel Xeon Gold 6152 2.1–3.7 GHz processor and 1536GB of memory.
| Sampling rate | Sample size | ||||||
|---|---|---|---|---|---|---|---|
| 105 | 106 | 5 · 106 | 107 | 5 · 107 | 1.5 · 108 | ||
| 0.01 | Forward time | 27.84s | 290.86s | 1275.53s | 2487.73s | 11295.01s | 34558.86s |
| Backward time | 0.85s | 7.44s | 26.93s | 50.27s | 217.51s | 813.25s | |
| Memory | 1.67MB | 10.87GB | 49.54GB | 94.64GB | 442.69GB | 1.34TB | |
| Total number of generated events | 34,038,092 | 286,381,088 | 1,120,365,070 | 2,121,897,004 | 9,878,131,708 | 30,152,423,891 | |
| Total number of infections | 24,040,769 | 185,954,943 | 619,559,504 | 1,119,957,985 | 4,994,200,627 | 15,121,211,248 | |
| 0.1 | Forward time | 2.18s | 29.89s | 154.15s | 296.43s | 1283.01s | 3470.47s |
| Backward time | 0.1s | 0.96s | 4.68s | 8.99s | 34.2s | 90.29s | |
| Memory | 1.68MB | 1.68MB | 5.51GB | 12.5GB | 53.27GB | 143.32GB | |
| Total number of generated events | 3,491,562 | 34,125,248 | 155,922,768 | 285,874,161 | 1,120,657,092 | 3,122,658,422 | |
| Total number of infections | 2,489,943 | 24,101,573 | 105,814,516 | 185,656,462 | 619,705,716 | 1,619,831,406 | |
| 1.0 | Forward time | 0.23s | 2.2s | 13.63s | 30.32s | 154.99s | 405.39s |
| Backward time | 0.01s | 0.15s | 0.92s | 2.08s | 11.35s | 32.48s | |
| Memory | 1.67MB | 1.68MB | 1.66MB | 1.67MB | 5.54GB | 20.9GB | |
| Total number of generated events | 350,517 | 3,492,789 | 17,271,140 | 34,113,125 | 155,899,482 | 401,912,500 | |
| Total number of infections | 250,290 | 2,490,805 | 12,261,217 | 24,093,104 | 105,799,613 | 251,613,148 | |