Literature DB >> 23822731

The advantages of SMRT sequencing.

Richard J Roberts, Mauricio O Carneiro, Michael C Schatz.

Abstract

Of the current next-generation sequencing technologies, SMRT sequencing is sometimes overlooked. However, attributes such as long reads, modified base detection and high accuracy make SMRT a useful technology and an ideal approach to the complete sequencing of small genomes.

Entities: Chemical Gene Species

Mesh：

Year: 2013 PMID： 23822731 PMCID： PMC3953343 DOI： 10.1186/gb-2013-14-6-405

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Pacific Biosciences' single molecule, real-time sequencing technology, SMRT, is one of several next-generation sequencing technologies that are currently in use. In the past, it has been somewhat overlooked because of its lower throughput compared with methods such as Illumina and Ion Torrent, and because of persistent rumors that it is inaccurate. Here, we seek to dispel these misconceptions and show that SMRT is indeed a highly accurate method with many advantages when used to sequence small genomes, including the possibility of facile closure of bacterial genomes without additional experimentation. We also highlight its value in being able to detect modified bases in DNA.

Extending read lengths

So-called next-generation technologies for sequencing DNA are penetrating every aspect of biology thanks to the immense amount of information that is encoded within nucleic acid sequences. However, today's next-generation sequencing technologies, such as Illumina, 454 and Ion Torrent, have several significant limitations, especially short read lengths and amplification biases, that restrict our ability to fully sequence genomes. Unfortunately, with the rise of next-generation sequencing, even less emphasis is being placed on trying to understand at the biological and biochemical levels just what functions newly discovered genes have and how those functions allow an organism to work, which is surely why we are sequencing DNA in the first place. Now a new technology, SMRT sequencing from Pacific Biosciences [1], has been developed that not only produces considerably longer and highly accurate DNA sequences from individual unamplified molecules, but can also show where methylated bases occur [2] (and thereby provide functional information about the DNA methyltransferases encoded by the genome). SMRT sequencing is a sequencing-by-synthesis technology based on real-time imaging of fluorescently tagged nucleotides as they are synthesized along individual DNA template molecules. Because the technology uses a DNA polymerase to drive the reaction, and because it images single molecules, there is no degradation of signal over time. Instead, the sequencing reaction ends when the template and polymerase dissociate. As a result, instead of the uniform read length seen with other technologies, the read lengths have an approximately log-normal distribution with a long tail. The average read length from the current PacBio RS instrument is about 3,000 bp, but some reads may be 20,000 bp or longer. This is roughly 30 to 200 times longer than the read length from a next-generation sequencing instrument, and more than a four-fold improvement since the original release of the instrument two years ago. It is notable that the recently announced PacBio RS II platform claims to have a further four-fold improvement, with twice the mean read length and twice the throughput of the current machine.

Applications of SMRT sequencing

The SMRT approach to sequencing has several advantages. First, consider the impact of the longer reads, especially for de novo assemblies of novel genomes. While typical next-generation sequencing can provide abundant coverage of a genome, the short read lengths and amplification biases of those technologies can lead to fragmented assemblies whenever a complex repeat or poorly amplified region is encountered. As a result, GC-rich and GC-poor regions, which tend to be poorly amplified, are particularly susceptible to poor quality sequencing. Resolving fragmented assemblies requires additional costly bench work and further sequencing. By also including the longer reads of SMRT sequencing runs, the read set will span many more repeats and missing bases, thereby closing many of the gaps automatically and simplifying, or even eliminating, the finishing time (Figure 1). It is becoming routine for bacterial genomes to be completely assembled using this approach [3,4], and we expect this practice will translate to larger genomes in the near future. A complete genome is far more useful than the poor quality draft sequences that litter GenBank because it provides a complete blueprint for the organism; the genes encoded therein represent the full biological potential of that organism. With only draft assemblies available, one is always left with the nagging feeling that some crucial gene is missing - perhaps the one in which you are most interested! The long read lengths also have more power to reveal complex structural variations present in DNA samples, such as pinpointing precisely where copy number variations have occurred relative to the reference sequence [5]. They are also extremely powerful for resolving complex RNA splicing patterns from cDNA libraries, since a single long read may contain the entire transcript end-to-end, thus eliminating the need to infer the isoforms [6].

Figure 1

Idealized assembly graphs [18]of the 5.2 megabase-pair . The graphs encode the compressed de Bruijn graph derived from infinite coverage error-free reads, effectively representing the repeats in the genome and the upper bound of what could be achieved in a real assembly. Increasing the read length decreases the number of contigs because the longer reads will span more of the repeats. Note the assembly with 5,000 bp reads has a self-edge because the chromosome is circular. Second, consider DNA methyltransferases. These can exist as solitary entities or as parts of restriction-modification systems. In both cases, they methylate relatively short sequence motifs that can easily be recognized from SMRT sequencing data because of the change in DNA polymerase kinetics, as it moves along the template molecule, that result from the presence of epigenetic modifications. The altered kinetics cause a change in the timing of when the fluorescent colors are observed, thus enabling direct detection of epigenetic modifications, which can ordinarily only be inferred, and bypassing the usual necessity of enrichment or chemical conversion. Often, thanks to bioinformatics, the gene responsible for any given modification can be matched to the sequence motif in which the modification lies [7,8]. When it cannot, then simply cloning the gene into a plasmid, which is subsequently grown in a non-modifying host and re-sequenced, can provide the match [9]. Moreover, SMRT sequencing has also been able to identify RNA base modifications through the same approach as DNA base modifications, but using an RNA transcriptase in place of the DNA polymerase [10]. In fact, SMRT sequencing represents an important step toward uncovering the biology that happens between DNA and proteins, including not only the study of mRNA sequences but also the regulation of translation [11,12]. Thus, functional information emerges directly from the SMRT sequencing approach. Third, we must consider the persistent rumor that SMRT sequencing is much less accurate than other next-generation sequencing platforms, which has now been demonstrated to be untrue in several ways. First, a direct comparison of several approaches to determining genetic polymorphisms has shown that SMRT sequencing has comparable performance to other sequencing technologies [13]. Second, the accuracy of assembling a complete genome using SMRT sequencing in combination with other technologies has proved to be as reliable and accurate as more traditional approaches [3,6,14]. Moreover Chin et al. [15] showed that an assembly using only long SMRT sequencing reads achieves comparable or even higher performance than other platforms (99.999% accuracy in three organisms with known reference sequences), including 11 corrections to the Sanger reference of these genomes. Koren et al. [6] showed that most microbial genomes could be assembled into a single contig per chromosome with this approach; it is by far the least expensive option for doing so.

Debunking the error myth

The power of SMRT sequencing data lies both in its long read lengths and in the random nature of the error process (Figure 2). It is true that individual reads contain a higher number of errors: approximately 11% to 14% or Q12 to Q15, compared with Q30 to Q35 from Illumina and other technologies. However, given sufficient depth (8x or more, say), SMRT sequencing provides a highly accurate statistically averaged consensus perspective of the genome, as it is highly unlikely that the same error will be randomly observed multiple times. Notoriously, other platforms have been found to suffer from systematic errors that need to be resolved by complementary methods before the final sequence is produced [16].

Figure 2

A sequencing context breakdown of the empirical insertion error rate of the two platforms on NA12878 whole genome data. In this figure we show all contexts of size 8 that start with AAAAA. The empirical insertion quality score (y-axis) is PHRED scaled. Despite the higher error rate (approximately Q12) of the PacBio RS instrument, the error is independent of the sequencing context. Other platforms are known to have different error rates for different sequencing contexts. Illumina's HiSeq platform, shown here, has a lower error rate (approximately Q45 across eight independent runs), but contexts such as AAAAAAAA and AAAAACAG have extremely different error rates (Q30 versus Q55). This context-specific error rate creates bias that is not easily clarified by greater sequencing depth. Empirical insertion error rates were measured using the Genome Analysis Toolkit (GATK) - Base Quality Score Recalibration tool. Another approach that benefits from the stochastic nature of the SMRT error profile is the use of circular consensus reads, where a sequencing read produces multiple observations of the same base in order to generate high-accuracy consensus sequence from single molecules [17]. This strategy trades read length for accuracy, which can be effective in some cases (targeted re-sequencing, small genomes) but is not necessary if one can achieve some redundancy in the sequencing data (8x is recommended). With this redundancy, it is preferable to benefit from the improved mapping of longer inserts than opt for circular consensus reads, because the longer reads will be able to span more repeats and high accuracy will still be achieved from their consensus.

Conclusions

The considerations above make a strong case for combining the more traditional, sequence-dense data from other technologies with at least moderate coverage of SMRT data so that genomes can be improved, their methylation patterns obtained, and the functional activity of their methyltransferase genes deduced. We would especially urge all groups currently sequencing bacterial genomes to adopt this policy. That said, SMRT sequencing has also substantially improved eukaryotic genome assemblies, and we expect it to become more widely applied in this context over time, in light of the greater read lengths and throughput of the PacBio RS II instrument. Perhaps it would even be worth redoing many genomes so that existing shotgun dataset-based assemblies could be closed and their complete methylomes obtained. The resultant assembled (epi)genomes would be inherently more valuable: the usefulness of a closed genome with associated functional annotation of its methyltransferase genes is far greater than the uncertainties left with a shotgun data set. Whereas we currently know much about the importance of epigenetic phenomena for higher eukaryotes, very little is known about the epigenetics of bacteria and the lower eukaryotes. SMRT sequencing opens a new window that may have a dramatic effect on our understanding of this biology.

Abbreviations

bp: base pair.

Competing interests

None of the authors have competing financial interests, but RJR and MCS have collaborated extensively with scientists from Pacific Biosciences leading to several publications cited in the text.

Authors' contributions

All three authors contributed to the writing of this article.

18 in total

1. Direct detection of DNA methylation during single-molecule, real-time sequencing.

Authors: Benjamin A Flusberg; Dale R Webster; Jessica H Lee; Kevin J Travers; Eric C Olivares; Tyson A Clark; Jonas Korlach; Stephen W Turner
Journal: Nat Methods Date: 2010-05-09 Impact factor: 28.547

2. A flexible and efficient template format for circular consensus sequencing and SNP detection.

Authors: Kevin J Travers; Chen-Shan Chin; David R Rank; John S Eid; Stephen W Turner
Journal: Nucleic Acids Res Date: 2010-06-22 Impact factor: 16.971

3. Real-time DNA sequencing from single polymerase molecules.

Authors: John Eid; Adrian Fehr; Jeremy Gray; Khai Luong; John Lyle; Geoff Otto; Paul Peluso; David Rank; Primo Baybayan; Brad Bettman; Arkadiusz Bibillo; Keith Bjornson; Bidhan Chaudhuri; Frederick Christians; Ronald Cicero; Sonya Clark; Ravindra Dalal; Alex Dewinter; John Dixon; Mathieu Foquet; Alfred Gaertner; Paul Hardenbol; Cheryl Heiner; Kevin Hester; David Holden; Gregory Kearns; Xiangxu Kong; Ronald Kuse; Yves Lacroix; Steven Lin; Paul Lundquist; Congcong Ma; Patrick Marks; Mark Maxham; Devon Murphy; Insil Park; Thang Pham; Michael Phillips; Joy Roy; Robert Sebra; Gene Shen; Jon Sorenson; Austin Tomaney; Kevin Travers; Mark Trulson; John Vieceli; Jeffrey Wegener; Dawn Wu; Alicia Yang; Denis Zaccarin; Peter Zhao; Frank Zhong; Jonas Korlach; Stephen Turner
Journal: Science Date: 2008-11-20 Impact factor: 47.728

4. Finished bacterial genomes from shotgun sequence data.

Authors: Filipe J Ribeiro; Dariusz Przybylski; Shuangye Yin; Ted Sharpe; Sante Gnerre; Amr Abouelleil; Aaron M Berlin; Anna Montmayeur; Terrance P Shea; Bruce J Walker; Sarah K Young; Carsten Russ; Chad Nusbaum; Iain MacCallum; David B Jaffe
Journal: Genome Res Date: 2012-07-24 Impact factor: 9.043

5. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330

6. Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing.

Authors: Tyson A Clark; Iain A Murray; Richard D Morgan; Andrey O Kislyuk; Kristi E Spittle; Matthew Boitano; Alexey Fomenkov; Richard J Roberts; Jonas Korlach
Journal: Nucleic Acids Res Date: 2011-12-07 Impact factor: 16.971

7. Real-time tRNA transit on single translating ribosomes at codon resolution.

Authors: Sotaro Uemura; Colin Echeverría Aitken; Jonas Korlach; Benjamin A Flusberg; Stephen W Turner; Joseph D Puglisi
Journal: Nature Date: 2010-04-15 Impact factor: 49.962

8. Assembly complexity of prokaryotic genomes using short reads.

Authors: Carl Kingsford; Michael C Schatz; Mihai Pop
Journal: BMC Bioinformatics Date: 2010-01-12 Impact factor: 3.169

9. The birth of the Epitranscriptome: deciphering the function of RNA modifications.

Authors: Yogesh Saletore; Kate Meyer; Jonas Korlach; Igor D Vilfan; Samie Jaffrey; Christopher E Mason
Journal: Genome Biol Date: 2012-10-31 Impact factor: 13.583

10. Reducing assembly complexity of microbial genomes with single-molecule sequencing.

Authors: Sergey Koren; Gregory P Harhay; Timothy P L Smith; James L Bono; Dayna M Harhay; Scott D Mcvey; Diana Radune; Nicholas H Bergman; Adam M Phillippy
Journal: Genome Biol Date: 2013 Impact factor: 13.583

206 in total

1. Quantifying genome-editing outcomes at endogenous loci with SMRT sequencing.

Authors: Ayal Hendel; Eric J Kildebeck; Eli J Fine; Joseph Clark; Niraj Punjya; Vittorio Sebastiano; Gang Bao; Matthew H Porteus
Journal: Cell Rep Date: 2014-03-27 Impact factor: 9.423

2. Genomic, Phenotypic, and Virulence Analysis of Streptococcus sanguinis Oral and Infective-Endocarditis Isolates.

Authors: Shannon P Baker; Tara J Nulton; Todd Kitten
Journal: Infect Immun Date: 2018-12-19 Impact factor: 3.441

Review 3. Understanding Human Autoimmunity and Autoinflammation Through Transcriptomics.

Authors: Romain Banchereau; Alma-Martina Cepika; Jacques Banchereau; Virginia Pascual
Journal: Annu Rev Immunol Date: 2017-01-30 Impact factor: 28.527

4. The core genome m5C methyltransferase JHP1050 (M.Hpy99III) plays an important role in orchestrating gene expression in Helicobacter pylori.

Authors: Iratxe Estibariz; Annemarie Overmann; Florent Ailloud; Juliane Krebes; Christine Josenhans; Sebastian Suerbaum
Journal: Nucleic Acids Res Date: 2019-03-18 Impact factor: 16.971