| Literature DB >> 32995774 |
Ales Varabyou1,2, Christopher Pockrandt1,3, Steven L Salzberg1,2,3,4, Mihaela Pertea1,2,3.
Abstract
The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, previous methods for detecting recombination and reassortment events cannot handle the computational requirements of analyzing tens of thousands of genomes, a scenario that has now emerged in the effort to track the spread of the SARS-CoV-2 virus. Furthermore, the low divergence of near-identical genomes sequenced in short periods of time presents a statistical challenge not addressed by available methods. In this work we present Bolotie, an efficient method designed to detect recombination and reassortment events between clades of viral genomes. We applied our method to a large collection of SARS-CoV-2 genomes and discovered hundreds of isolates that are likely of a recombinant origin. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. Our findings further show that several recombinants appear to have persisted in the population.Entities:
Year: 2020 PMID: 32995774 PMCID: PMC7523100 DOI: 10.1101/2020.09.21.300913
Source DB: PubMed Journal: bioRxiv
Figure 1.An unrooted topological cladogram of 4,249 SARS-CoV-2 genomes including 225 recombinants labeled as red bars. Arcs link each recombinant to both inferred parental genomes. The color of the arc corresponds to the color of the clade to which a recombinant was clustered within the tree. Clades correspond to the GISAID clades GR (0), GH (1), G (2) and all minor lineages combined (4).
Figure 2.Four examples of inferred recombinant sequences: A. EPI_ISL_439137; B. EPI_ISL_468407; C. EPI_ISL_509874; D. EPI_ISL_417420. The top section of each plot shows conditional probabilities of a clade given a nucleotide at each position. Bars are plotted for the two parent clades and the other clades are shown as dots of the corresponding color. Each peak >0.1 above the baseline (0.25) is labeled with the number of genomes it appears in. An average is reported whenever there are multiple variants in close proximity on the plot, listing the number of averaged variants in parentheses. The three lower panels of each plot show the frequency of variants at each position for parental clades (top and bottom rows) and variants observed on the recombinant genome (middle row).
Mutational signature of the EPI_ISL_439137 recombinant isolate (Figure 2B). The table shows all positions with defining conditional probabilities for each of the parental clades. Read counts extracted from the data deposited in EBI are provided to illustrate the likely single-isolate origin of the genome. Positions are colored according to the respective clade colors used in the manuscript.
| Position | Reference | Observed | P(0|Base) | P(1|Base) | A | C | G | T |
|---|---|---|---|---|---|---|---|---|
| 240 | C | C | 0.3975 | 0.2358 | 4 | 5 | 1 | |
| 3036 | C | C | 0.9998 | 0.0001 | 1 | 1 | 9 | |
| 8781 | C | T | 0.9929 | 0.0012 | 0 | 0 | 0 | |
| 14407 | C | T | 0.0092 | 0.3297 | 1 | 3 | 2 | |
| 17125 | T | C | 0.0217 | 0.9783 | 1 | 0 | 6 | |
| 20267 | A | G | 0.0061 | 0.9935 | 0 | 0 | 0 | |
| 23402 | A | G | 0.0066 | 0.3311 | 0 | 0 | 0 | |
| 28143 | T | C | 0.9988 | 0.0005 | 2 | 0 | 4 |
Figure 3.Effects of sequence composition on the topology of the phylogenetic tress for SARS-CoV-2. A tree obtained directly from NextStrain (A) is first compared to (B) the tree computed using Bolotie consensus sequences for the same set of isolates. (C) Shows a tree computed for the same set of isolates with 210 additional recombinant sequences as identified by Bolotie. Leaf nodes that correspond to recombinant genomes are labeled with red dots.
Figure 4.The maximum conditional probability for each nucleotide is highlighted in gray, while the path with the maximum likelihood is highlighted in bold. By penalizing switching of clades, insignificant differences in probabilities between clades as well as short windows representing a switch to a different clade are avoided. For clarity transitions between nodes on non-optimal paths are indicated in gray without labeled probabilities.