| Literature DB >> 25730631 |
J C Kwong1, N McCallum, V Sintchenko, B P Howden.
Abstract
Genomics and whole genome sequencing (WGS) have the capacity to greatly enhance knowledge and understanding of infectious diseases and clinical microbiology.The growth and availability of bench-top WGS analysers has facilitated the feasibility of genomics in clinical and public health microbiology.Given current resource and infrastructure limitations, WGS is most applicable to use in public health laboratories, reference laboratories, and hospital infection control-affiliated laboratories.As WGS represents the pinnacle for strain characterisation and epidemiological analyses, it is likely to replace traditional typing methods, resistance gene detection and other sequence-based investigations (e.g., 16S rDNA PCR) in the near future.Although genomic technologies are rapidly evolving, widespread implementation in clinical and public health microbiology laboratories is limited by the need for effective semi-automated pipelines, standardised quality control and data interpretation, bioinformatics expertise, and infrastructure.Entities:
Mesh:
Year: 2015 PMID: 25730631 PMCID: PMC4389090 DOI: 10.1097/PAT.0000000000000235
Source DB: PubMed Journal: Pathology ISSN: 0031-3025 Impact factor: 5.306
Fig. 1Whole genome sequencing workflow. (1) DNA extraction from homogeneous microbial samples, e.g., single bacterial colony from a pure culture. (2) Whole genome sequencing using next-generation sequencers. Most high-throughput sequencers produce short reads (e.g., Illumina MiSeq), although long reads from Pacific Biosciences RS II or Illumina TruSeq technology may facilitate de novo assembly more readily. (3) SNPs called from read mapping to a reference genome can be used for phylogenetic comparisons to assist in epidemiological and outbreak analyses. Reads can also be assembled de novo into longer contiguous sequences (contigs), and orientated and aligned to form scaffolds. (4) The resulting de novo assemblies can be used for further analyses such as typing and resistance detection based on local alignment tools (e.g., BLAST), or can be further finished into a completed or closed genome. This finishing stage usually requires gap closure through extensive ‘wet-lab’ techniques such as primer walking, and so is generally performed for research purposes, although WGS long reads are increasingly being used to produce more complete de novo assemblies and minimise the amount of laboratory work required. (5) Data analysis for outbreak investigation, typing, or resistance detection. Closed annotated genomes can be used as reference genomes for comparison, or can be analysed in further detail.
Popular sequencing technology
| Traditional sequencing | |
| Sanger sequencing | • Still widely used for sequencing short segments of DNA (up to 1000 bp) due to the ease and accuracy of sequencing |
| • Labour, time and cost intensive for sequencing entire genomes on a regular basis | |
| Shotgun sequencing | • Involved fragmentation of long strands of DNA into numerous smaller segments for Sanger sequencing |
| • Facilitated initial whole genome sequencing efforts | |
| • Shotgun approach still utilised by ’next-generation’ sequencing methods |
Comparison of popular next-generation sequencers: high-end sequencing platforms for high throughput/long reads
| Illumina HiSeq 2500 | Illumina NextSeq 500 | Roche 454 GS FLX+ | Pacific Biosciences RS II | |
| Configuration | Rapid-run mode | High output flow cell | Titanium XL+ | RS II |
| Dual flow-cell | ||||
| Dimensions | 119 × 76 × 94 cm | 59 × 53 × 64 cm | Upper 74 × 70 × 36 cm | 200 × 77 × 158 cm |
| Lower 75 × 91 × 93 cm | ||||
| Weight | 221 kg | 83 kg | 242 kg | 1091 kg |
| Preparation time | 8 hours | 8 hours | 8 hours | 8 hours |
| Sequencing time | 60 hours | 30 hours | 24 hours | 4 hours |
| Data output (Gb per run) | 250–300 Gb/run | 100–120 Gb/run | 0.7 Gb/run | 0.5–1 Gb/run |
| Sequence read length | 2 × 250 bp | 2 × 150 bp | 700 bp | 1,000–40,000 bp |
| Number of S. aureus (∼2.9 Mb genome) per run at 30x coverage | 1200 | 480 | 5 | 1 |
| Error rate | 0.1% | 0.1% | 0.2–1.0% | 14% |
| Accuracy | Mostly Q30 | Mostly Q30 | Q20-Q30 | Mostly Q30 |
| Cost of platform | $650,000 | $250,000 | $500,000 | $750,000 |
| Advantages | • Massive throughput (though better suited to human genome sequencing) | • High throughput suitable for microbial genomes | • Read length up to 1000 bp facilitates | • Long reads facilitate |
| • Low cost per output | • Lower instrument cost | • Able to sequence regions of high GC content (results in more uniform coverage of the genome) | ||
| • High output and rapid run modes | • Low cost per output | • Detects modified DNA bases, eg, DNA methylation patterns | ||
| • Dimensions suitable for ’benchtop’ | ||||
| • Potential for expansion/upgrades | ||||
| Disadvantages | • Longer run time | • Short reads limit | • More ’hands-on’ time – requires manually amplified sequence libraries by emPCR | • Lower output |
| • Short reads limit | • Higher cost per output | • Higher error rate in individual reads | ||
| • Higher instrument cost | • Roche closing sequencing operations and ceasing production | • Higher instrument cost and cost per output | ||
| • Large instrument size |
*TruSeq Long Read technology allows sequencing reads of 10,000 bp in length.
†N50 = 14,000 bp; i.e., half of the sequence data is contained in reads >14,000 bp.
‡Theoretical number for comparison only – requires custom-synthesised indices. Current Illumina Index Kits (Nextera XT) allow up to 384 samples per flow cell.
§Error rate is based on raw read error rate. However, as the error model for SMRT sequencing is stochastic, combining reads can produce high quality consensus sequence across all bases. Our experience is that in comparison with sequencing on the Illumina MiSeq, the RS II produces high quality consensus sequences with an error rate approximately 1 per 1000 bases (predominantly homopolymers).
||Costs are only approximate at time of writing, and may vary substantially – intended only as a rough guide
Comparison of popular next-generation sequencers: benchtop sequencing platforms for low-moderate throughput
| Illumina MiSeq | Ion Torrent PGM (Life Technologies) | Ion Proton (Life Technologies) | Roche 454 GS Junior | |
| Configuration | Nextera Reagent Kit v3 | Ion 318™ Chip v2 | Proton I chip | GS Junior Plus |
| Dimensions | 69 × 57 × 52 cm | 61 × 51 × 53 cm | 54 × 78 × 47 cm | 40 × 60 × 40 cm |
| Weight | 54.5 kg | 30 kg | 59 kg | 25 kg |
| Preparation time | 8 hours | 8 hours | 8 hours | 8 hours |
| Sequencing time | 60 hours | 4–7 hours | 2–4 hours | 18 hours |
| Data output (Gb per run) | 13–16 Gb/run | 600 Mb – 2 Gb/run | 10 Gb/run | 50–70 Mb/run |
| Sequence read length | 2 × 300 bp | 200 / 400 bp | 200 bp | 700 bp |
| Number of S. aureus (∼2.9 Mb genome) per run at 30x coverage | 75 | 15 | 60 | 1 (at 15x coverage) |
| Error rate | Overall 0.1% | Overall 0.5–2.5% | Not reported | Overall 0.2–1.0% |
| Indel error rate 0.001 per 100 bp | Indel error rate 1.5 per 100 bp | Indel error rate 0.4 per 100 bp | ||
| Accuracy | Mostly Q30 | Mean Q20 (Q10-Q30) | Not reported | Q20-Q30 |
| Cost of platform (approximate) | $150,000 | $100,000 | $150,000 | $100,000 |
| Advantages | • Higher accuracy and data output | • Low platform cost | • Low cost per output | • Smaller instrument size |
| • Low cost per output | • Short run time | • Rapid run time | • Longer read length (up to 800 bp with GS Junior+) | |
| • Library amplification incorporated | ||||
| Disadvantages | • Longer run time | • Requires separately amplified sequence libraries by emPCR | • Requires separately amplified sequence libraries by emPCR | • More ’hands-on’ time – requires manually amplified sequence libraries by emPCR |
| • Higher platform cost | • Higher indel error rate, particularly with homopolymers | • Higher indel error rate, particularly with homopolymers | • Higher indel error rate, particularly with homopolymers | |
| • Shorter read length | • Quality of sequence deteriorates at ends of reads, though can be improved with post-sequencing read clipping | • Quality of sequence deteriorates at ends of reads, though can be improved with post-sequencing read clipping | • Higher cost per output | |
| • Poor coverage of AT-rich regions | • Poor coverage of AT-rich regions | • Requires manually amplified sequence libraries | ||
| • Can be more difficult to assemble | • Can be more difficult to assemble | • Roche closing sequencing operations and ceasing production |
*Based on Loman et al.[7] and Jünemann et al.[9]
†Costs are only approximate at time of writing, and may vary substantially – intended only as a rough guide.
‡emPCR = emulsion PCR. Slow and complicated process; automated amplification systems are available for Ion Torrent/Ion Proton (Ion Chef).
Common software for bioinformatic analysis
| • |
| • Contigs can be visualised in the Java-based program Mauve ( |
| Annotation |
| • Genome annotation includes identification of DNA segments of known and probable open reading frames (ORF) that contain gene coding DNA, and matching the identified segments to a database of known gene sequences. Tools include the web-based RAST ( |
| Genome visualisation and comparison |
| • Once assembled and annotated, genomes can be viewed using a genome browser to display the structure and embedded genetic elements of a genome in a graphical format, and manipulate the genome sequence if required. The Wellcome Trust Sanger Institute's Artemis ( |
| • Visual comparisons of multiple genomes can also be made using the above utilities. |
| Alignment and read mapping |
| • Read mapping is the process of aligning reads to a reference, using a combination of local and global alignment. Bowtie2 ( |
| • BLAST ( |
| • Whole genome alignment is a computationally intensive process, but can be performed using Mauve or Mugsy/MUMmer ( |
| SNP/variant calling |
| • Single nucleotide differences identified from aligning comparator sequences to a reference can be used to describe genetic relationships between isolates. Multiple tools are available,[ |
| • We use the Nesoni suite of tools ( |
| Phylogenetic analysis |
| • Phylogenetic trees can be used to analyse and visualise the SNP differences between isolates, although the true phylogeny of a group of isolates is never known. Popular methods include the simpler but rapid neighbour-joining method (most phylogenetic software), and the more complex maximum likelihood approach (RAxML |
| • SplitsTree and FigTree are examples of phylogenetic software that can calculate neighbour-joining or display trees produced by other software. |
| Utilities for clinical microbiology |
| • Species identification can be performed on WGS data by either 16 S characterisation, or by identifying short strings of DNA used in genome assembly (k-mer identification). Both options can be performed on the Danish Center for Genomic Epidemiology Java-based website |
| • A number of other clinically useful tools are available on this site, including ResFinder for the detection of antimicrobial resistance, and Multi-Locus Sequence Typing. Command-line based tools such as BLAST using |
| Databases |
| • NCBI GenBank ( |
| • European Nucleotide Archive ( |
| • DNA Databank of Japan ( |
| Typing databases |
| • MLST database ( |
| Antibiotic resistance gene databases |
| • ARG-ANNOT ( |
| • ResFinder ( |
| Multifunction bioinformatic suites |
| • Geneious Pro ( |
| • CLC Genomics ( |
| • Bionumerics ( |
| • Nesoni ( |
| • Harvest ( |
| • Galaxy ( |
A more extensive list of software can be found at http://seqanswers.com/wiki/Software/list.
Fig. 2Key considerations in quality assessment of whole genome sequencing analyses. Contigs, contiguous sequences; GC, genome coverage; SNP, single nucleotide polymorphism; wgMLST, whole genome multi-locus sequence typing.