Literature DB >> 19497932

Detecting SNPs and estimating allele frequencies in clonal bacterial populations by sequencing pooled DNA.

Kathryn E Holt1, Yik Y Teo, Heng Li, Satheesh Nair, Gordon Dougan, John Wain, Julian Parkhill.   

Abstract

SUMMARY: Here, we present a method for estimating the frequencies of SNP alleles present within pooled samples of DNA using high-throughput short-read sequencing. The method was tested on real data from six strains of the highly monomorphic pathogen Salmonella Paratyphi A, sequenced individually and in a pool. A variety of read mapping and quality-weighting procedures were tested to determine the optimal parameters, which afforded > or =80% sensitivity of SNP detection and strong correlation with true SNP frequency at poolwide read depth of 40x, declining only slightly at read depths 20-40x. AVAILABILITY: The method was implemented in Perl and relies on the opensource software Maq for read mapping and SNP calling. The Perl script is freely available from ftp://ftp.sanger.ac.uk/pub/pathogens/pools/.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19497932      PMCID: PMC2722999          DOI: 10.1093/bioinformatics/btp344

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


The discovery of assayable genetic variation is essential in order to study the population structure of bacteria, which is crucial for addressing important research questions including evolution, transmission and epidemiology of bacterial pathogens and associated disease (Keim et al., 2004; Nübel et al., 2008). Multi-locus sequence typing (MLST) has been widely adopted for the study of bacterial population structure (Maiden, 2006), however this technique is not sensitive enough to detect variation among highly monomorphic bacteria such as Salmonella Typhi (Kidgell et al., 2004). Furthermore, while many bacterial pathogens display a much more diverse population structure, disease outbreaks are often associated with the spread of a single clone associated with a single sequence type—e.g. methicillin resistant Staphylococcus aureus MRSA-15 and MRSA-16 (Enright et al., 2000). Here, we propose the high-throughput sequencing of pools containing equal amounts of genomic DNA extracted from multiple bacterial isolates. This approach facilitates genome-wide SNP detection among closely related isolates and is cheaper than sequencing isolates individually (even using multiplex libraries), allowing extensive sampling of a population at low cost. This is crucial for unbiased detection of genetic variation within a population, which in turn is required to give an unbiased picture of the underlying population structure (Pearson et al., 2004). Two likely applications of SNP detection in pooled DNA samples from bacteria are: (i) identifying SNPs for typing in a much larger population (Pearson et al., 2004), and (ii) identification of SNPs associated with a particular phenotype, e.g. increased virulence (Falush and Bowden, 2006). In either application it will be important to estimate the allele frequencies of each SNP within each pool. We used the Illumina GAI to sequence pools of DNA, and Maq (http://maq.sourceforge.net/) to map short reads to a reference sequence and make initial SNP calls. Maq is ideally suited to this task as it employs a user-specified number of haplotypes (in this case the number of strains in the pool) to detect the presence of SNPs (Li et al., 2008). The frequency of each SNP k in pool p containing S strains was estimated using information (read from Maq's pileup output) on each read i of N reads mapped to the SNP locus, including the phred-like base quality q which was used to calculate weights w (see below). Frequencies were calculated according to the following formulae, implemented in a Perl script which calls Maq to do the initial read mapping and SNP calling (here x=1 if SNP allele, 0 otherwise): To optimize and validate the method for allele frequency estimation, we chose six strains of the monomorphic bacterial pathogen Salmonella Paratyphi A to be sequenced individually and in pools. The strains include ATCC9150 and five novel isolates. Illumina sequencing was used to generate 35 bp reads at 20–40× depth (European Read Archive: ERA000083). Maq was used to map reads to the reference genome AKU_12601 (EMBL:FM200053) and call SNPs in each isolate. SNPs between the finished sequence ATCC9150 and the reference were also included [identified using MUMmer (Kurtz et al., 2004)]. SNPs lying in repetitive or phage sequences (5% of genomic sequence) were excluded from the analysis, as were SNP calls with low quality (quality score <20 or read depth <10). For each SNP locus identified in any strain (550 loci) alleles were checked in all six strains, resulting in a set of 403 SNPs with a reliable frequency estimate among the six strains (depth ≥10, quality score ≥20 in each strain). A pooled DNA sample (400 ng of DNA from each of the six strains) was also sequenced, generating 40× coverage of the pool. SNP detection and frequency estimation were performed on the pooled reads and the results compared with the SNP loci (550) and allele frequencies (for 403 loci) expected among the six strains. Different parameters were trialled in order to determine the optimal method for accurate estimation of allele frequencies within pooled DNA samples. The maximum mismatches allowed during mapping (maq assemble -m option) was varied from 1 to 7 bases (i.e. up to 20% mismatches per 35 bp read); the minimum mapping quality for reads to be included in frequency estimation (maq pileup -q) was varied from 10 to 50. In order to avoid interpreting base calling errors as SNPs, the contribution of each base to the estimate of prop [Equation (1)] was weighted according to its phred-like quality score q. Since quality scores are calibrated for each sequencing run, the simple and squared ratios of q to the maximum possible quality score max Q were considered, resulting in four alternative weighting schemes: All combinations of parameters and weighting measures were tested, and the following measures calculated (after removing SNP calls in repetitive or phage sequences): (i) sensitivity of SNP detection, i.e. proportion of the 550 known SNPs that were detected with an estimated frequency of ≥1 strain, (ii) false positive rate of SNP detection, i.e. proportion of the SNPs detected with estimated frequency at ≥1 strain that were not expected to be present in the pool, (iii) correlation (Pearson R2) between the expected and estimated allele frequencies and (iv) the proportion (among the 403 SNPs with reliable frequency estimates) of loci for which estimated and expected allele frequencies differed by ≥1 strain, i.e. the rate of incorrect frequency estimates. Five additional weighting measures were trialled, but were excluded from further analysis as they gave highly insensitive or inaccurate results (see Supplementary Fig. 1). Analysis of variance tables for each measure are given in Supplementary Tables 1–4. Using any combination of weights [Equations (3–6)], mismatches (1–7 per read) and mapping qualities (10–50), SNP detection was quite sensitive (78–84% of expected SNPs detected) and the experimentally observed frequencies were strongly correlated with the expected frequencies (Pearson R2 0.92–0.95) (Supplementary Fig. 2). Specifying the number of haplotypes in the pool (maq assemble -N parameter) was crucial to maintain sensitivity as without this, sensitivity dropped to 70%. Detection sensitivity was highly dependent on SNP frequency, with 37% detection for SNPs present in just one strain, compared with 95% and 100% detection, respectively, for SNPs present in 2 or ≥3 strains. The false positive (f.p.) rate varied between 5% and 18% using different methods and was closely correlated with number of mismatches allowed during mapping (Supplementary Fig. 3). However, setting the number of mismatches ≤1 reduced sensitivity too low (78%), thus the optimal setting was ≤2 mismatches per read (mean f.p. 8.8%, mean sensitivity 82.7%). The proportion of incorrect frequency estimates was reduced by using any of the weighting methods (4–6) and was also dependent on mapping quality. The lowest rate of incorrect estimates (19%) was seen with a minimum mapping quality 40; lowering or raising the cutoff increased the rate to >20%, while offering very little improvement in f.p. or sensitivity (Supplementary Fig. 4). The most accurate measurements (low f.p., low error rate, high R2) were obtained using the weighting method shown in Equation (6), regardless of other parameters, and the difference between expected and estimated frequencies was never more than one strain. Thus, the optimal parameters were mismatches ≤2 during mapping, mapping quality ≥40 to be included in frequency estimate, and weighting as shown in Equation (6). These parameters are implemented in the script, which includes 95% confidence intervals for frequency estimates that may be used to guide selection of SNPs for further typing. In this experiment, the pool was sequenced to 40× read depth, which using the selected parameters gave 83% sensitivity, 9% false positives, and 81% correct frequency estimates. To test how accuracy declines at lower read depth, data sets of 1×, 2×,…, 39× coverage were simulated by randomly sampling 50 subsets each of 1/40, 2/40,… 39/40 of the total available reads. The results, plotted in Supplementary Figure 5, show that accuracy declines only slightly with reduced read depth down to about 25×. Six strains were included in our pool, resulting in ∼6.5× read depth per strain across the 4.5 Mbp genome, but frequency estimates would still be fairly accurate down to ∼4× read depth per strain. This may be used to guide the design of similar experiments involving genomes of different size, although precise estimates would require further experimentation.
  9 in total

1.  Genome-wide association mapping in bacteria?

Authors:  Daniel Falush; Rory Bowden
Journal:  Trends Microbiol       Date:  2006-06-19       Impact factor: 17.079

Review 2.  Multilocus sequence typing of bacteria.

Authors:  Martin C J Maiden
Journal:  Annu Rev Microbiol       Date:  2006       Impact factor: 15.500

3.  Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors:  Heng Li; Jue Ruan; Richard Durbin
Journal:  Genome Res       Date:  2008-08-19       Impact factor: 9.043

4.  Frequent emergence and limited geographic dispersal of methicillin-resistant Staphylococcus aureus.

Authors:  Ulrich Nübel; Philippe Roumagnac; Mirjam Feldkamp; Jae-Hoon Song; Kwan Soo Ko; Yhu-Chering Huang; Geoffrey Coombs; Margaret Ip; Henrik Westh; Robert Skov; Marc J Struelens; Richard V Goering; Birgit Strommenger; Annette Weller; Wolfgang Witte; Mark Achtman
Journal:  Proc Natl Acad Sci U S A       Date:  2008-09-04       Impact factor: 11.205

Review 5.  Anthrax molecular epidemiology and forensics: using the appropriate marker for different evolutionary scales.

Authors:  Paul Keim; Matthew N Van Ert; Talima Pearson; Amy J Vogler; Lynn Y Huynh; David M Wagner
Journal:  Infect Genet Evol       Date:  2004-09       Impact factor: 3.342

6.  Phylogenetic discovery bias in Bacillus anthracis using single-nucleotide polymorphisms from whole-genome sequencing.

Authors:  Talima Pearson; Joseph D Busch; Jacques Ravel; Timothy D Read; Shane D Rhoton; Jana M U'Ren; Tatum S Simonson; Sergey M Kachur; Rebecca R Leadem; Michelle L Cardon; Matthew N Van Ert; Lynn Y Huynh; Claire M Fraser; Paul Keim
Journal:  Proc Natl Acad Sci U S A       Date:  2004-09-03       Impact factor: 11.205

7.  Versatile and open software for comparing large genomes.

Authors:  Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal:  Genome Biol       Date:  2004-01-30       Impact factor: 13.583

8.  Salmonella typhi, the causative agent of typhoid fever, is approximately 50,000 years old.

Authors:  Claire Kidgell; Ulrike Reichard; John Wain; Bodo Linz; Mia Torpdahl; Gordon Dougan; Mark Achtman
Journal:  Infect Genet Evol       Date:  2002-10       Impact factor: 3.342

9.  Multilocus sequence typing for characterization of methicillin-resistant and methicillin-susceptible clones of Staphylococcus aureus.

Authors:  M C Enright; N P Day; C E Davies; S J Peacock; B G Spratt
Journal:  J Clin Microbiol       Date:  2000-03       Impact factor: 5.948

  9 in total
  23 in total

1.  Ecological genomics of Anopheles gambiae along a latitudinal cline: a population-resequencing approach.

Authors:  Changde Cheng; Bradley J White; Colince Kamdem; Keithanne Mockaitis; Carlo Costantini; Matthew W Hahn; Nora J Besansky
Journal:  Genetics       Date:  2011-12-29       Impact factor: 4.562

2.  Multi-sample pooling and illumina genome analyzer sequencing methods to determine gene sequence variation for database development.

Authors:  Rebecca L Margraf; Jacob D Durtschi; Shale Dames; David C Pattison; Jack E Stephens; Rong Mao; Karl V Voelkerding
Journal:  J Biomol Tech       Date:  2010-09

3.  The next generation of molecular markers from massively parallel sequencing of pooled DNA samples.

Authors:  Andreas Futschik; Christian Schlötterer
Journal:  Genetics       Date:  2010-05-10       Impact factor: 4.562

4.  Genetic signature of bacterial pathogen adaptation during chronic pulmonary infections.

Authors:  Steve P Bernier; Matthew L Workentine; Michael G Surette
Journal:  Nat Genet       Date:  2014-01       Impact factor: 38.330

5.  Whole-genome comparison of Aspergillus fumigatus strains serially isolated from patients with aspergillosis.

Authors:  Daisuke Hagiwara; Hiroki Takahashi; Akira Watanabe; Azusa Takahashi-Nakaguchi; Susumu Kawamoto; Katsuhiko Kamei; Tohru Gonoi
Journal:  J Clin Microbiol       Date:  2014-09-17       Impact factor: 5.948

6.  Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data.

Authors:  Yun Li; Wei Chen; Eric Yi Liu; Yi-Hui Zhou
Journal:  Stat Biosci       Date:  2013-05

7.  Evolutionary conservation of essential and highly expressed genes in Pseudomonas aeruginosa.

Authors:  Andreas Dötsch; Frank Klawonn; Michael Jarek; Maren Scharfe; Helmut Blöcker; Susanne Häussler
Journal:  BMC Genomics       Date:  2010-04-09       Impact factor: 3.969

8.  GenHtr: a tool for comparative assessment of genetic heterogeneity in microbial genomes generated by massive short-read sequencing.

Authors:  Gongxin Yu
Journal:  BMC Bioinformatics       Date:  2010-10-12       Impact factor: 3.169

9.  Single-feature polymorphism discovery by computing probe affinity shape powers.

Authors:  Wayne Wenzhong Xu; Seungho Cho; S Samuel Yang; Yung-Tsi Bolon; Hatice Bilgic; Haiyan Jia; Yanwen Xiong; Gary J Muehlbauer
Journal:  BMC Genet       Date:  2009-08-26       Impact factor: 2.797

10.  Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster.

Authors:  Yuan Zhu; Alan O Bergland; Josefa González; Dmitri A Petrov
Journal:  PLoS One       Date:  2012-07-26       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.