Literature DB >> 19079253

Sensitive, specific polymorphism discovery in bacteria using massively parallel sequencing.

Chad Nusbaum1, Toshiro K Ohsumi, James Gomez, John Aquadro, Thomas C Victor, Robert M Warren, Deborah T Hung, Bruce W Birren, Eric S Lander, David B Jaffe.   

Abstract

Our variant ascertainment algorithm, VAAL, uses massively parallel DNA sequence data to identify differences between bacterial genomes with high sensitivity and specificity. VAAL detected approximately 98% of differences (including large insertion-deletions) between pairs of strains from three species while calling no false positives. VAAL also pinpointed a single mutation between Vibrio cholerae genomes, identifying an antibiotic's site of action by identifying sequence differences between drug-sensitive strains and drug-resistant derivatives.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 19079253      PMCID: PMC2613166          DOI: 10.1038/nmeth.1286

Source DB:  PubMed          Journal:  Nat Methods        ISSN: 1548-7091            Impact factor:   28.547


The bacterial world that surrounds us is of enormous importance for studies ranging from human medicine to environmental ecology. Even within ‘species’ of bacteria, there is enormous genetic variation. The ability to routinely compare the genomes of many bacterial strains and identify their differences is thus of tremendous value to understanding their impact on human health and biology. In this work we address two key applications: comparison of a new strain with a previously sequenced reference strain and comparison of a laboratory-induced mutant strain to the parental strain. An example for the former would be the comparison of multiple clinical isolates of tuberculosis, which might differ in their resistance to various drugs. We expect to find genetic variation at a number of sites across the genome; the critical genetic differences underlying a phenotype becomes apparent because they are seen in multiple strains. An example for the latter would be a mutant strain resistant to a particular antibiotic. We expect only one mutation in such a comparison, which can be used to pinpoint the protein target of the antibiotic. This is valuable because there is currently no simple way to identify the sites of action of a drug. Pioneering work on these applications used available technology, often in multi-stage processes. For example whole-genome shotgun Sanger sequencing was followed by directed resequencing1–2, or an initial microarray screen was followed by a validation microarray3. In each case the second stage was designed based on the results of the first stage. With the advent of massively parallel sequencing methods4–7, a streamlined approach is feasible, facilitating routine comparison at dramatically lower cost. Indeed, substantial sequence for a bacterial strain may now be generated for roughly $1,000, and the costs are likely to drop further. We require new computational methods to take advantage of the new inexpensive data types and deliver highly accurate results. Methods are needed that can find the full spectrum of variation, including insertions and deletions of arbitrary size, since all may be phenotypically important. Yet a single base change can confer antibiotic resistance in bacteria, a ‘needle in the haystack’ problem that exemplifies the need for high specificity. While initial work8–13 on this problem has been promising, there has been no systematic and controlled investigation of general variation detection using new technology data. For example, insertion/deletion (indel) detection has not yet been systematically tested on real data. We developed a new ‘variant ascertainment algorithm’ VAAL, and applied it to a series of comparisons, including between naturally occurring strains, and between antibiotic-sensitive and derived resistant strains. All the experiments were controlled, facilitating rigorous assessment of answers. [AU could you briefly explained how this control was done. Do you mean by comparison to the reference?] For each strain we generated a single lane of unpaired 36 base reads from the Illumina platform, thereby minimizing costs. The source code for VAAL and all experimental data sets are made freely available (Supplementary Resources). All applications of VAAL were carried out with the distributed version of the code and default arguments. VAAL takes as input short reads from a ‘sample genome’ and a known sequence for a related ‘reference genome’. VAAL assembles the reads14,15, assisting with the reference genome. This ‘assisted assembly’ indicates which bases are ‘trusted’ and which are not. By comparing the sample genome assisted assembly to the reference genome, VAAL deduces as output a list of differences between them. There are several steps in the algorithm (Supplementary Methods). Briefly, we assign each sample read an approximate position on the reference genome and then group reads by position. Next we assemble each group, then glue the assemblies along regions where sample and reference genomes agree, yielding a single assembly of the sample genome. Next we call bases in the assembly trusted that are strongly supported by the reads. Finally, to identify polymorphisms, we compare the sample read data assembly to the reference genome and identify differences between them that lie in trusted parts of the assembly. To do this, we identify regions containing disagreements between the assembly and the reference that are flanked by regions where the two agree. These disagreements include substitutions, insertions, deletions, and more complex changes (multiple nearby differences) that we refer to as composites. All are reported as polymorphisms by VAAL. In the text, composites are treated as single polymorphisms, but VAAL also provides output in which each composite is parsed into substitutions, insertions, and deletions. We designed VAAL to discover sequence differences between related genomes, with high sensitivity and specificity. To demonstrate this, we compared sequence data from ‘sample genomes’ to finished reference sequences for related isolates of the same species. Finished genomes were used so the true answer was known in all cases. The dataset for this work consisted of 36 base Illumina reads from three previously finished bacterial genomes (S. aureus, E. coli, M. tuberculosis, Supplementary Table 1). Genomes contain repetitive sequences longer than the reads, within which polymorphisms cannot be called unambiguously. To avoid these ambiguities, VAAL defines uncallable regions of the genome in which polymorphisms cannot be called unambiguously with the given data, declaring the remainder of the genome to be callable (Supplementary Methods). The size of the callable fraction depends on the amount of duplicated sequence in the genome. It also depends on read length and the minimum overlap K (here, 28). For the data and genomes used here, the callable fraction ranges from 91.6–96.7% (Table 1). In evaluating the algorithm, we defined the set of callable polymorphisms, which reflect unambiguous differences, as those residing in the callable regions, and used these to evaluate sensitivity (Table 1).
Table 1

Results of polymorphism discovery using VAAL

sample organismrelated strain (Genbank accession)callable fraction of genomecallable SNPscalled SNPscallable indelscalled indelscallable compositescalled compositesfalse positives
S. aureus USA 300COL NC_002951.296.5%68768463601161040
E. coli K12 MG1655DH10B CP000948.191.6%116116101013130
M. tuberculosis F11H37Rv AL123456.296.7%757736655419180

Reads from three bacterial strains were compared to reference sequences from other strains using VAAL. Polymorphisms were grouped in three categories: SNP, indel or complex changes (composite). Events that are within 28 bases of each other are reported as a single composite event. False positives: incorrect polymorphisms reported by the algorithm. None were found.

First, we used VAAL to compare the sample reads to the finished sequences of the genomes from which they were generated. Importantly, no polymorphisms were called. This demonstrates high specificity: any reported polymorphisms would have been false positives. Next we performed the actual between-strain comparison with VAAL, and evaluated the performance on the callable polymorphisms (Table 1). For E. coli (moderate GC content, 51%), VAAL found 100% of callable polymorphisms. The called insertions included three exceeding the read length, of sizes 93, 390, and 970 bp. For S. aureus (low GC content, 33%), VAAL found >99% of substitutions, 95% of indels, and 90% of composites. For M. tuberculosis (high GC content, 66%), VAAL found 97% of substitutions, 83% of indels, and 95% of composites. Briefly, VAAL showed high sensitivity in finding nearly all of the possible callable polymorphisms. In no cases were any false positives called. There were two cases where bona fide polymorphisms were called that lay outside the set of callable polymorphisms. The mechanism for these rare events is explained in Supplementary Methods. VAAL performed well even on complex composite events (Supplementary Methods). We also investigated the failure modes of the few polymorphisms that were missed, and how coverage influences the fraction of missed polymorphisms (Supplementary Methods). We note that the three bacterial genomes employed have GC content ranging from 33% to 66%, a range encompassing ~80% of previously sequenced bacteria (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Testing behavior over a range of sequence content is important because sequencing system’s behaviors vary considerably as a function of GC content. We next attempted to discover single base differences between a parental strain and a derived mutant strain. In this case, we sequenced both the parental and mutant strains and compared them to a known reference strain. We compared each of five Vibrio cholerae isolates sensitive to the antibiotic rifampicin with derived rifampicin-resistant isolates (Supplementary Methods), anticipating that each resistant isolate would likely have acquired a single resistance mutation. To do this, for each of the five sensitive-resistant pairs, we sequenced both strains and assembled them with VAAL, assisted by a single finished reference (Supplementary Table 4). Then we used VAAL to compare the two assemblies to each other, yielding a list of differences (Table 2). For all five sensitive-resistant pairs, we found exactly one difference in the entire genome. In all cases, the difference was in the rpoB gene, which encodes the β subunit of RNA polymerase, rifampicin’s known target (Supplementary Results).
Table 2

Vibrio cholerae: observed differences between resistant isolates and their sensitive parents

resistant isolates
JA-G-02RIF3-1RIF4-1RIF5-1RIF6-1
Differences with sensitive cultures341920 (rpoB) A → T, D516 Asp → Val341920 (rpoB) A → T, D516 Asp → Val341949 (rpoB) C → T, H526 His → Tyr341920 (rpoB) A → T, D516 Asp → Va341949 (rpoB) C → T, H526 His → Tyr

Five rifampicin-resistant isolates were compared to the five sensitive cultures in which the mutations originated (pairs of rows in Supplementary Table 4), using an intermediate reference AE00385{2,3}.1 to facilitate the comparison. In each case VAAL found exactly one mutation, and it was on AE003852.1. One-based coordinates along with the mutated gene (rpoB) in which the mutations occurred are shown together with the E. coli-based numbering of the corresponding codon (D516 or H526), and the amino acid change that occurred.

We compared VAAL to other algorithms (Supplementary Results): MAQ and VAAL call different numbers of single-nucleotide polymorphisms (SNPs), suggesting distinct advantages; MAQ does not call indels from unpaired reads, as in this paper; a Velvet-VAAL hybrid yielded results comparable to VAAL at high coverage but inferior results at lower coverage, as expected since Velvet assembles de novo. Sensitive and specific polymorphism discovery with single-molecule sequencing data is a major open problem. VAAL approaches it by grouping reads and then performing a local assembly, taking into account the relationships between individual reads and the reference, and between reads themselves. Thus, the algorithm can find differences (e.g. long insertions) that cannot be found by methods dependent upon alignments of single reads to the reference. Importantly, VAAL does not contain special rules for each polymorphism type (SNP, SNP cluster, indel, etc.) and thus can discover all differences without presuppositions. We have demonstrated the application of massively parallel sequencing to two critical problems in bacterial genetics. First, we showed reliable detection of sequence differences between two related bacterial strains, including long insertions and deletions, with very high sensitivity, and no observed false positives, across a wide range of genome compositions, using data from only a single lane of unpaired 36 base Illumina reads. Second, we demonstrated discovery of a single mutation conferring antibiotic resistance, without prior knowledge of its location. The first application enables rapid discovery of variations underlying medically or biologically important phenotypes. The second application enables the rapid discovery of the targets of current or newly developed antibiotics, by analyzing a series of resistant mutants. While VAAL requires relatively high coverage, this coverage is inexpensive. Costs are now sufficiently low that analysis of hundreds or thousands of bacterial strains is practical. In its present form, VAAL works on haploid genomes, providing highly sensitive and specific polymorphic detection for bacteria. Generalization of the algorithm to large, complex, diploid genomes is the next important goal. In particular, it should soon be practical to sequence targeted subsets (such as regions or exons) from hundreds of human samples. Supplementary Figure 1. Regional assembly Supplementary Table 1. Bacterial data sets used to test polymorphism discovery with VAAL Supplementary Table 2. Nature of called, uncalled polymorphisms of M. tuberculosis F11 vs H37Rv Supplementary Table 3. Effect of depth of sequence coverage on discovery of polymorphisms in M. tuberculosis F11 vs H37Rv Supplementary Table 4. Vibrio cholerae isolates used to find rifampicin-resistance mutations Supplementary Tables 5–6. Comparison of three methods for detecting polymorphisms Supplementary Table 7. Effect of depth of sequence coverage on discovery of polymorphisms in M. tuberculosis F11 vs H37Rv, using VAAL and Velvet + VAAL steps 3,4 Supplementary Table 8. Vibrio cholerae: observed differences between resistant isolates and their sensitive parents, found using Velvet (+ VAAL steps 3,4) Supplementary Methods Supplementary Results Supplementary References Supplementary Resources
  15 in total

1.  An Eulerian path approach to DNA fragment assembly.

Authors:  P A Pevzner; H Tang; M S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2001-08-14       Impact factor: 11.205

2.  Microbiology. TB--a new target, a new drug.

Authors:  Stewart T Cole; Pedro M Alzari
Journal:  Science       Date:  2005-01-14       Impact factor: 47.728

3.  Accurate multiplex polony sequencing of an evolved bacterial genome.

Authors:  Jay Shendure; Gregory J Porreca; Nikos B Reppas; Xiaoxia Lin; John P McCutcheon; Abraham M Rosenbaum; Michael D Wang; Kun Zhang; Robi D Mitra; George M Church
Journal:  Science       Date:  2005-08-04       Impact factor: 47.728

4.  Comprehensive mutation identification in an evolved bacterial cooperator and its cheating ancestor.

Authors:  Gregory J Velicer; Günter Raddatz; Heike Keller; Silvia Deiss; Christa Lanz; Iris Dinkelacker; Stephan C Schuster
Journal:  Proc Natl Acad Sci U S A       Date:  2006-05-17       Impact factor: 11.205

5.  Gene sequencing. The race for the $1000 genome.

Authors:  Robert F Service
Journal:  Science       Date:  2006-03-17       Impact factor: 47.728

6.  Whole-genome sequencing and variant discovery in C. elegans.

Authors:  LaDeana W Hillier; Gabor T Marth; Aaron R Quinlan; David Dooling; Ginger Fewell; Derek Barnett; Paul Fox; Jarret I Glasscock; Matthew Hickenbotham; Weichun Huang; Vincent J Magrini; Ryan J Richt; Sacha N Sander; Donald A Stewart; Michael Stromberg; Eric F Tsung; Todd Wylie; Tim Schedl; Richard K Wilson; Elaine R Mardis
Journal:  Nat Methods       Date:  2008-01-20       Impact factor: 28.547

7.  Single-molecule DNA sequencing of a viral genome.

Authors:  Timothy D Harris; Phillip R Buzby; Hazen Babcock; Eric Beer; Jayson Bowers; Ido Braslavsky; Marie Causey; Jennifer Colonell; James Dimeo; J William Efcavitch; Eldar Giladi; Jaime Gill; John Healy; Mirna Jarosz; Dan Lapen; Keith Moulton; Stephen R Quake; Kathleen Steinmann; Edward Thayer; Anastasia Tyurina; Rebecca Ward; Howard Weiss; Zheng Xie
Journal:  Science       Date:  2008-04-04       Impact factor: 47.728

8.  Mutation discovery in bacterial genomes: metronidazole resistance in Helicobacter pylori.

Authors:  Thomas J Albert; Daiva Dailidiene; Giedrius Dailide; Jason E Norton; Awdhesh Kalia; Todd A Richmond; Michael Molla; Jaz Singh; Roland D Green; Douglas E Berg
Journal:  Nat Methods       Date:  2005-11-18       Impact factor: 28.547

9.  Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori.

Authors:  R A Alm; L S Ling; D T Moir; B L King; E D Brown; P C Doig; D R Smith; B Noonan; B C Guild; B L deJonge; G Carmel; P J Tummino; A Caruso; M Uria-Nickelsen; D M Mills; C Ives; R Gibson; D Merberg; S D Mills; Q Jiang; D E Taylor; G F Vovis; T J Trust
Journal:  Nature       Date:  1999-01-14       Impact factor: 49.962

10.  Accurate whole human genome sequencing using reversible terminator chemistry.

Authors:  David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal:  Nature       Date:  2008-11-06       Impact factor: 49.962

View more
  45 in total

1.  PBP2a mutations causing high-level Ceftaroline resistance in clinical methicillin-resistant Staphylococcus aureus isolates.

Authors:  S Wesley Long; Randall J Olsen; Shrenik C Mehta; Timothy Palzkill; Patricia L Cernoch; Katherine K Perez; William L Musick; Adriana E Rosato; James M Musser
Journal:  Antimicrob Agents Chemother       Date:  2014-08-25       Impact factor: 5.191

2.  A Zn-dependent metallopeptidase is responsible for sensitivity to LsbB, a class II leaderless bacteriocin of Lactococcus lactis subsp. lactis BGMN1-5.

Authors:  Gordana Uzelac; Milan Kojic; Jelena Lozo; Tamara Aleksandrzak-Piekarczyk; Christina Gabrielsen; Tom Kristensen; Ingolf F Nes; Dzung B Diep; Ljubisa Topisirovic
Journal:  J Bacteriol       Date:  2013-10-11       Impact factor: 3.490

3.  Molecular complexity of successive bacterial epidemics deconvoluted by comparative pathogenomics.

Authors:  Stephen B Beres; Ronan K Carroll; Patrick R Shea; Izabela Sitkiewicz; Juan Carlos Martinez-Gutierrez; Donald E Low; Allison McGeer; Barbara M Willey; Karen Green; Gregory J Tyrrell; Thomas D Goldman; Michael Feldgarden; Bruce W Birren; Yuriy Fofanov; John Boos; William D Wheaton; Christiane Honisch; James M Musser
Journal:  Proc Natl Acad Sci U S A       Date:  2010-02-08       Impact factor: 11.205

Review 4.  Sequencing technologies - the next generation.

Authors:  Michael L Metzker
Journal:  Nat Rev Genet       Date:  2009-12-08       Impact factor: 53.242

5.  Aminoglycoside cross-resistance in Mycobacterium tuberculosis due to mutations in the 5' untranslated region of whiB7.

Authors:  Analise Z Reeves; Patricia J Campbell; Razvan Sultana; Seidu Malik; Megan Murray; Bonnie B Plikaytis; Thomas M Shinnick; James E Posey
Journal:  Antimicrob Agents Chemother       Date:  2013-02-04       Impact factor: 5.191

6.  Characterization of invasive group B streptococcus strains from the greater Toronto area, Canada.

Authors:  Sarah Teatero; Allison McGeer; Donald E Low; Aimin Li; Walter Demczuk; Irene Martin; Nahuel Fittipaldi
Journal:  J Clin Microbiol       Date:  2014-02-19       Impact factor: 5.948

Review 7.  Using small molecules to dissect mechanisms of microbial pathogenesis.

Authors:  Aaron W Puri; Matthew Bogyo
Journal:  ACS Chem Biol       Date:  2009-08-21       Impact factor: 5.100

8.  High Incidence of Invasive Group A Streptococcus Disease Caused by Strains of Uncommon emm Types in Thunder Bay, Ontario, Canada.

Authors:  Taryn B T Athey; Sarah Teatero; Lee E Sieswerda; Jonathan B Gubbay; Alex Marchand-Austin; Aimin Li; Jessica Wasserscheid; Ken Dewar; Allison McGeer; David Williams; Nahuel Fittipaldi
Journal:  J Clin Microbiol       Date:  2015-10-21       Impact factor: 5.948

9.  Simultaneous alignment of short reads against multiple genomes.

Authors:  Korbinian Schneeberger; Jörg Hagmann; Stephan Ossowski; Norman Warthmann; Sandra Gesing; Oliver Kohlbacher; Detlef Weigel
Journal:  Genome Biol       Date:  2009-09-17       Impact factor: 13.583

10.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads.

Authors:  Iain Maccallum; Dariusz Przybylski; Sante Gnerre; Joshua Burton; Ilya Shlyakhter; Andreas Gnirke; Joel Malek; Kevin McKernan; Swati Ranade; Terrance P Shea; Louise Williams; Sarah Young; Chad Nusbaum; David B Jaffe
Journal:  Genome Biol       Date:  2009-10-01       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.