| Literature DB >> 31792233 |
Gregory P Harhay1, Dayna M Harhay2, James L Bono2, Sarah F Capik3, Keith D DeDonder4, Michael D Apley5, Brian V Lubbers5, Bradley J White5, Robert L Larson5, Timothy P L Smith2.
Abstract
The virulence and pathogenicity of bacterial pathogens are related to their adaptability to changing environments. One process enabling adaptation is based on minor changes in genome sequence, as small as a few base pairs, within segments of genome called simple sequence repeats (SSRs) that consist of multiple copies of a short sequence (from one to several nucleotides), repeated in series. SSRs are found in eukaryotes as well as prokaryotes, and length variation in them occurs at frequencies up to a million-fold higher than bacterial point mutations through the process of slipped strand mispairing (SSM) by DNA polymerase during replication. The characterization of SSR length by standard sequencing methods is complicated by the appearance of length variation introduced during the sequencing process that obscures the lower abundance repeat number variants in a population. Here we report a computational approach to correct for sequencing process-induced artifacts, validated for tetranucleotide repeats by use of synthetic constructs of fixed, known length. We apply this method to a laboratory culture of Histophilus somni, prepared from a single colony, and demonstrate that the culture consists of populations of distinct sequence phase and length variants at individual tetranucleotide SSR loci.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31792233 PMCID: PMC6889271 DOI: 10.1038/s41598-019-53866-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Histograms showing the effect of the fraction base composition (FBC) filter at various z-score settings, on CCS mapping to the AAGC SSR in control duplexes to assess the effects of SSM in CCS sequencing. Only CCS passing initial filters for length and quality are shown. Panels a–c show histograms of CCS from a single 63 bp (16 RU-1 bp) synthetic control SSR. (a) No z-score filter applied (inset is same histogram with linear ordinate). (b) Z-score = 2.5, CCS in bins at non-integer RU distances from the control SSR length consists of adenine deletions or insertions. (c) Z-score = 1.5 showing a dominant mode of 63 bp corresponding to control SSR length and minor modes at integer RU distances due to SSM during sequencing (inset is same histogram with linear ordinate). Panels d–f show histograms of CCS from a mixture of libraries of synthetic duplexes with 63, 67, and 71 bp length SSRs. d. No z-score filter applied. (e) Z-score = 2.25 with CCS comprising single base indels relative to CCS at the control SSR lengths or in sidebands of bins at integer number of RUs different from the control SSR lengths. (f) Z-score = 1.5 showing occupancy of control SSR length modes at 63, 67, 71 bp and minor modes at 51, 55, 59, 75 bp due to SSM of the sequencing polymerase.
Figure 2Histograms and sequence contexts for tetranucleotide SSRs, where SSR sequence phase is defined by the first and last tetramers in the SSR. The percentage of CCS in non-consensus SSR length (cSSRlen) modes relative to those in the dominant cSSRlen modes are designated above their respective bins. (a) The CCS mapping to the 79 bp AAGC SSR using an FBC z-score = 1.55. (b) The genome context of the 79 bp AAGC SSR. (c,d) Same as the preceding panel pair but for the 107 bp ACTG SSR with FBC z-score = 1.0. (e,f) As with the preceding panel pairs, but for the 147 bp ACTG SSR. (g,h) As with preceding panel pairs, but for the 154 bp AATC SSR. (i,j) As with the preceding panel pairs, but for the 250 bp AACC SSR with no FBC filter applied.
Figure 3Comparison of the 3′-ends of CCS mapping to two SSRs comprised of ACTG RUs. (a) The 107 bp SSR terminates with CCS ending with TCAG and is initiated with a CAGT upstream starting sequence (Fig. 2d). (b) In contrast, the 147 bp SSR terminates with CCS ending with GTCA and is initiated with a TCAG (Fig. 2f).