| Literature DB >> 33688063 |
Katrina A Lythgoe1,2, Matthew Hall1, Luca Ferretti3, Mariateresa de Cesare3,4, George MacIntyre-Cockett3,4, Amy Trebes4, Monique Andersson5,6, Newton Otecko3, Emma L Wise7,8, Nathan Moore7, Jessica Lynch7, Stephen Kidd7, Nicholas Cortes7,9, Matilde Mori10, Rebecca Williams7, Gabrielle Vernet7, Anita Justice5, Angie Green4, Samuel M Nicholls11, M Azim Ansari12, Lucie Abeler-Dörner3, Catrin E Moore3, Timothy E A Peto5,13, David W Eyre5,14, Robert Shaw5, Peter Simmonds12, David Buck4, John A Todd4, Thomas R Connor15,16, Shirin Ashraf17, Ana da Silva Filipe17, James Shepherd17, Emma C Thomson17, David Bonsall3,4,5, Christophe Fraser3,4,18, Tanya Golubchik1,2.
Abstract
Extensive global sampling and sequencing of the pandemic virus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have enabled researchers to monitor its spread and to identify concerning new variants. Two important determinants of variant spread are how frequently they arise within individuals and how likely they are to be transmitted. To characterize within-host diversity and transmission, we deep-sequenced 1313 clinical samples from the United Kingdom. SARS-CoV-2 infections are characterized by low levels of within-host diversity when viral loads are high and by a narrow bottleneck at transmission. Most variants are either lost or occasionally fixed at the point of transmission, with minimal persistence of shared diversity, patterns that are readily observable on the phylogenetic tree. Our results suggest that transmission-enhancing and/or immune-escape SARS-CoV-2 variants are likely to arise infrequently but could spread rapidly if successfully transmitted.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33688063 PMCID: PMC8128293 DOI: 10.1126/science.abg0821
Source DB: PubMed Journal: Science ISSN: 0036-8075 Impact factor: 47.728
Fig. 1Characterization of iSNV frequencies.
(A) Distribution of the number of identified iSNV sites in each sample against the number of unique mapped reads. The colors represent different MAF thresholds. An iSNV site is identified within a sample if the MAF is greater than the threshold. (B) Distribution of the mean MAF in each sample against the number of unique mapped reads, with no MAF threshold applied. The black line is the estimated mean value by linear regression. The green ribbon is the 95% CI. (C) Distribution of the number of identified iSNV sites at the 3% MAF threshold when subsampling from high-depth samples. Each color represents a different high-depth sample.
Fig. 2Comparison of allele frequencies between sequencing replicates of the same sample and multiple time points from the same individual.
(A) Comparison of MAFs from 27 replicate pairs resequenced from RNA, with each point representing a single genomic position in a pair of replicates. The plot represents all MAF frequency comparisons for the 27 samples where both replicates had >50,000 unique mapped reads, limited to genomic sites with MAF >0.02 in at least one of the 54 replicates. The blue lines are the threshold value of 0.03. (B and C) Comparison of allele frequencies from 41 individuals sampled on different days, with each point representing a genomic position in a pair of samples from the same individual. Each individual is represented by a different color, and for each individual, all genomic positions are considered where the MAF >0.03 at either sampling time point and/or a change in consensus was observed. In all cases, the poly-A tail and sites variable in RNA synthetic controls were excluded, as were sites observed to be variable in >20 samples at MAF >3% because these are unlikely to represent genomic variants. (C) is an enlargement of the region of (B) near the origin.
Fig. 3Small transmission bottleneck size within households.
(A) Estimated bottleneck size in 14 households calculated using the exact beta-binomial method described in (). Bottleneck size for both combinations of potential source and recipient were calculated if the first positive samples from each individual in the household were collected within a week of each other. No estimate was recorded if there were no identified iSNVs >3% MAF in the source individual (household 8) or if the two individuals in the household had more than two consensus differences (household 15). The error bars represent the 95% CI determined by the likelihood ratio test. (B) Fate of the identified iSNVs within households. Each line links the allele frequency of a given variant in one household member with that in the second member. Points and lines are colored by household. Each was identified as an iSNV in at least one individual but not necessarily (and usually not) both. Where the dates of sample collection differed by at least a week, we also indicate the assumed source and recipient members of the household.
Fig. 4iSNV sites were often found in multiple samples and most samples had at least one iSNV.
(A) Histogram showing the number iSNV sites that were found in N samples. All samples in our dataset are included. (B) Stacked histogram showing the number of samples that had n iSNV sites for all samples with >50,000 mapped reads (dark red) and samples with <50,000 mapped reads (light red). All 563 sites identified for variant analysis were included (see main text), including sites in the 3UTR and 5UTR but excluding the polyA tail and the 18 sites variable in 20+ individuals.
iSNVs and dN/dS by gene and over the whole genome.
| 5UTR | 265 | 82 | - | - | 0.0223 | - | |
| ORF1a | 13218 | 572 | 369 | 203 | 0.0031 | 0.51 (0.43, 0.61) | |
| nsp1 | 540 | 54 | 39 | 15 | 0.0072 | 0.79 (0.44, 1.47) | |
| nsp2 | 1914 | 105 | 65 | 40 | 0.0039 | 0.46 (0.31, 0.69) | |
| nsp3 | 5835 | 175 | 108 | 67 | 0.0022 | 0.45 (0.33, 0.61) | |
| nsp4 | 1500 | 101 | 61 | 40 | 0.0048 | 0.44 (0.3, 0.66) | |
| nsp5A | 918 | 25 | 22 | 3 | 0.002 | 2.08 (0.72, 8.77) | |
| nsp6 | 870 | 62 | 42 | 20 | 0.0051 | 0.58 (0.35, 1.01) | |
| nsp7 | 249 | 6 | 2 | 4 | 0.0017 | 0.14 (0.02, 0.73) | |
| nsp8 | 594 | 13 | 7 | 6 | 0.0016 | 0.32 (0.11, 0.98) | |
| nsp9 | 339 | 15 | 9 | 6 | 0.0032 | 0.46 (0.17, 1.37) | |
| nsp10 | 417 | 16 | 14 | 2 | 0.0028 | 1.99 (0.56, 12.67) | |
| nsp12* | 2795 | 122 | 69 | 53 | 0.0031 | 0.34 (0.24, 0.49) | |
| ORF1b | 8088 | 349 | 212 | 137 | 0.0031 | 0.42 (0.34, 0.52) | |
| nsp13 | 1803 | 59 | 33 | 26 | 0.0024 | 0.37 (0.22, 0.63) | |
| nsp14 | 1581 | 92 | 59 | 33 | 0.0042 | 0.48 (0.31, 0.74) | |
| nsp15 | 1038 | 31 | 21 | 10 | 0.0021 | 0.57 (0.27, 1.26) | |
| nsp16 | 894 | 45 | 30 | 15 | 0.0036 | 0.54 (0.29, 1.03) | |
| S | 3822 | 190 | 129 | 61 | 0.0036 | 0.6 (0.45, 0.82) | |
| ORF3a | 828 | 108 | 96 | 12 | 0.0094 | 2.29 (1.31, 4.4) | |
| E | 228 | 13 | 4 | 9 | 0.0041 | 0.15 (0.04, 0.47) | |
| M | 669 | 32 | 20 | 12 | 0.0034 | 0.51 (0.25, 1.08) | |
| ORF6 | 186 | 10 | 8 | 2 | 0.0039 | 0.97 (0.24, 6.43) | |
| ORF7a | 366 | 41 | 34 | 7 | 0.0081 | 1.43 (0.67, 3.52) | |
| ORF7b | 132 | 8 | 8 | 0 | 0.0044 | (0.93, ) | |
| ORF8 | 366 | 49 | 19 | 30 | 0.0096 | 0.17 (0.09, 0.3) | |
| N | 1260 | 145 | 106 | 39 | 0.0083 | 0.81 (0.56, 1.18) | |
| ORF10 | 117 | 11 | 6 | 5 | 0.0068 | 0.32 (0.09, 1.09) | |
| 3UTR | 229 | 74 | - | - | 0.0232 | - | |
| All coding regions | 29260 | 1526 | 1009 | 517 | 0.0038 | 0.55 (0.49, 0.61) | |
| Full genome | 22903 | 1708 | - | - | 0.0041 | - | |
All genome positions are relative to the Wuhan-Hu-1 reference sequence. iSNVs at the 18 highly shared sites and those identified from the synthetic controls are excluded, as are those in the poly-A tail (positions 29865 to 29903). The mean iSNVs per 100 sites column is the mean number in each gene over all 1390 sequenced genomes. Note that because of gene overlap and noncoding intergenic regions, the total number of iSNVs (1708) cannot be obtained as the sum of any column in this table, even if the rows for nonstructural proteins in ORF1ab are excluded.
*nsp12 overlaps the boundary between ORF1a and ORF1b.
Intergenic regions are excluded from this row.
Fig. 5Consensus phylogeny of all isolates.
In (A), tips are colored by sampling center (Oxford = orange; Basingstoke = green). The tree scale is in substitutions per site. (B to D) Distribution of samples with iSNVs at three loci. The genomic coordinate (with respect to the Wuhan-Hu-1 reference sequence) appears in the top left. Tree branches are colored by the consensus base at that position, and filled circles indicate iSNVs present at a minimum of 3% frequency in samples with depth of at least 100 at that position, and are colored by the most common minor variant present. For sites 28580 (B) and 20796 (C), an inset panel enlarges the section of the phylogeny where a consensus change is in close proximity to iSNVs with the relevant pair of nucleotides involved. The highlighted samples were prepared in separate batches and the patterns were not caused by contamination. (D) Variants at site 21575 (L5F) occurred in 14 samples but with no phylogenetic association with consensus changes at this site, which may represent independent emergence of this variant in multiple individuals. The phylogeny was constructed by maximum likelihood according to the robust procedure outlined by Morel et al. ().