| Literature DB >> 24428920 |
Kerensa McElroy1, Torsten Thomas, Fabio Luciani.
Abstract
Deep sequencing harnesses the high throughput nature of next generation sequencing technologies to generate population samples, treating information contained in individual reads as meaningful. Here, we review applications of deep sequencing to pathogen evolution. Pioneering deep sequencing studies from the virology literature are discussed, such as whole genome Roche-454 sequencing analyses of the dynamics of the rapidly mutating pathogens hepatitis C virus and HIV. Extension of the deep sequencing approach to bacterial populations is then discussed, including the impacts of emerging sequencing technologies. While it is clear that deep sequencing has unprecedented potential for assessing the genetic structure and evolutionary history of pathogen populations, bioinformatic challenges remain. We summarise current approaches to overcoming these challenges, in particular methods for detecting low frequency variants in the context of sequencing error and reconstructing individual haplotypes from short reads.Entities:
Year: 2014 PMID: 24428920 PMCID: PMC3902414 DOI: 10.1186/2042-5783-4-1
Source DB: PubMed Journal: Microb Inform Exp ISSN: 2042-5783
Representative examples of deep sequencing applied to viral populations
| HIV | RT-PCR, nested PCR of pol fragment | Roche-454 GS-FLX amplicon sequencing | Sanger sequenced pol gene | In-house software: removes reads with ambiguous bases, < 80% similarity to reference, or outside region of interest | GS amplicon software (Roche, Penzberg, Germany), Needleman-Wunsch | In house scripts, manual inspection: remove gaps, remove reads with frameshift indels or stop codons, remove variants only contained in reads in one direction, positional variant cut-off values based on control sequences | Individual reads (40 bp region of interest) | Longitudinal emergence of drug resistance during treatment failure | [ |
| HIV | RT, PCR amplificatin of 4 fragments (3.5 kb each). Full genome analysis | Roche-454 GS-FLX Titanium | NS | Mosaik | RC454 / V-Phaser | V-Phaser (one read length only) | Longitudinal emergence of CD8+ T cell escape variants, viral adaptation | [ | |
| HCV | RT, PCR amplification of HVR-1, nested PCR using sequencing adapters | Roche-454 GS-FLX Titanium amplicon sequencing | 358 HCV HVR-1 representative sequences from Los Alamos National Laboratory HCV | Flow clustering as implemented in QIIME, only reads covering entire region of interest | MAFFT (multiple sequence alignment) | NA | Individual reads | Identification of a transmission event | [ |
| HCV | Whole-genome library prep direct from RNA isolated from human serum, using mRNA-seq sample prep kit (Illumina, San Diego, CA) | Illumina GA IIx 76 bp single end reads | 970 reference HCV sequences registered at the Hepatitis Virus Database server | Primer stripping using CLC Genomics Workbench (4.6), remove reads aligning to human genome, removal of duplicate reads | BWA 0.5.9-r16 | Samtools (0.1.16) | NA | PCR-free whole genome HCV sequencing from human serum; variant comparison between treatment naïve and treatment experienced patients | [ |
| HCV | RT-PCR using genotype specific primers, nested PCR of full genome, followed by random shearing and library preparation | Roche-454 GS-FLX Titanium | Sanger-sequenced consensus | In house software (discard reads with Phred quality scores below 20 or length < 55nt) | Mosaik | ShoRAH, manual cleaning | ShoRAH (up to 1600 bp reconstructions) | Within-host evolution/genetic bottleneck | [ |
| HRV | Duplicate whole-genome RT-PCR of overlapping primer pairs, nebulisation of pooled fragments and library prep | Illumina GA IIx | Sanger-sequenced consensus | Illumina software: RTA SCS.2.6 and CASAVA 1.6 | MAQ v0.7.1 | In house scripts; cut-off based on statistical analyses of base frequencies along reference. Comparison between replicates. | NA | Within-host evolution during immunosuppression | [ |
| 76 bp single end reads | |||||||||
| Dengue | RT, PCR amplification of four different fragments, random shearing and adapter ligation | Roche-454 GS-FLX Titanium | NS | Mosaik | RC454/ V-Phaser. Manual removal of variants in primer binding sites or only in ends of reads | NA | Intra-host viral diversity | [ | |
| Poliovirus | RT-PCR and nested PCR of target amplicon, followed by random shearing and library preparation | Roche-454 FLX Titanium and Illumina GA IIx 76 bp single end reads | Known amplicon sequences | Proprietary Roche/Illumina software. In house software (discard reads with Phred quality scores below 20). | NS | Custom made scripts – disregard variants with strand bias, as well as insertions and deletions adjacent to homopolymers for Roche-454 data. | NA | Detection of emerging resistant variants in a vaccine stock | [ |
Details of the experimental design and analysis pipeline for various applications of deep sequencing to different viruses are given. ‘Design’ describes the types of samples used and any sample processing up to library preparation. ‘Technology’ indicates the type of sequencing employed. ‘Filter’ details any pre-alignment read processing steps. ‘Ref. Seq.’ describes what kinds of reference sequences were used for read alignment, while ‘Align’ gives the actual alignment software used. ‘SNV’ and ‘Hap.’ indicate software used for SNV detection and haplotype reconstruction respectively. ‘Application’ describes the biological motivation for the study. ‘NS’ indicates the method was not specified in the cited publication, while ‘NA’ means not attempted.
Sequencing technologies, features and errors
| 454 GS Junior | Roche | ~135 K reads @ ~520 nt | ~0.38% indels | 7 K | 14 | [ |
| GS-FLX Titanium | Roche | ~1 M reads @ ~500 nt | ~0.28% indels; ~0.12% substitution (max 1.07%) | 50 K | 100 | [ |
| MiSeq | Illumina | ~ 11 M reads @ ~ 150 nt | < 0.001% indels, ~0.1% substitutions | 165 K | 330 | [ |
| GA IIx | Illumina | ~ 640 M reads @ 100 nt | ~0.001% indels; ~0.31% substitutions (max ~5.85%) | 6 M | 13 K | [ |
| HiSeq 2000 | Illumina | ~ 6G reads @ 100 nt | ~0.002% indels; ~0.32% substitutions (max ~8.2%) | 60 M | 120 K | * |
| Ion Torrent PGM | Life technologies | ~2 M reads @ ~121 nt | ~1.5% indels | 24 K | 48 | [ |
| SOLiD | Life technologies | ~120 M reads @ ~50 nt | ~0.09% substitutions (max > 5%) | 600 K | 1 K | [ |
| RS | Pacific biosystems | ~200 K reads @ ~2000 nt (max > 15000 nt) | ~14% indels, ~1% substitutions | 40 K | 80 | [ |
| tSMS | Helicos | ~1G reads @ 35 nt | ~3% indels, ~0.2% substitutions | 3 M | 7 K | [ |
Indels errors are largely associated with homopolymers for Roche and Ion Torrent. This fact can have a significant impact on the detection of variants associated with homopolymers, as was recently shown for the 2184delA mutation of the cystic fibrosis transmembrane conductance regulator (CFTR) using Ion Torrent PGM [33]. Sequencing errors are also highly dependent on the sequencing context and thus can influence variant calling in a biased, but potentially predictable way. For example, certain GC-rich motifs have been reported to have substitution errors of close to 6% [27] for the Illumina sequencing technology. Depth columns give anticipated read depth for a typical viral (~10 K) or bacterial (~5 M) genome.
*Calculated for this review from control PhiX data using GemSIM v1.6 [27].
Studies applying deep sequencing to within-population bacterial variation
| Chemical shearing of pooled PCR-amplified target genes ( | Ion Torrent 314 PGM, generating 60-70 bp reads at 300-500× | NS | NS | NS | NS | NA | Detection of low-frequency drug resistance mutations | [ | |
| Extraction of genomic DNA followed by whole genome standard SOLiD mate-pair library construction, with 3 kb fragment size | SOLiD 3 plus, 2 times 50 bp reads at ~5000× | SOCS package: quality threshold of Q15 and trimming to 42 bp | SOCS package | Detect and filter using SOCS package (min. av. qual 20, 500 < read depth < 15000, apply Bernoulli test ( | NA | Genome evolution | [ |
Details of the experimental design and analysis pipeline for the two examples of deep sequencing applied to bacterial populations identified in this review. ‘Design’ describes the types of samples used and any sample processing up to library preparation. ‘Technology’ indicates the type of sequencing employed. ‘Filter’ details any pre-alignment read processing steps. ‘Ref. Seq.’ describes what kinds of reference sequences were used for read alignment, while ‘Align’ gives the actual alignment software used. ‘SNV’ and ‘Hap.’ indicate software used for SNV detection and haplotype reconstruction respectively. ‘Application’ describes the biological motivation for the study. ‘NS’ indicates the method was not specified in the cited publication, while ‘NA’ means not attempted.
Figure 1Flowchart detailing pipeline steps required for deep sequencing projects. After extracting genomic material, PCR amplification may be required prior to library preparation. For sequencing of a target region (‘amplicon sequencing’), multiple, ‘nested’ PCR rounds may be performed. Sequencing adapters and primers may be included in the primer for the final round, or may be annealed to the ends of fragments after amplification. For whole genome sequencing, multiple, overlapping PCR products are randomly sheared before annealing of sequencing adapters and primers. Alternatively, if sufficient genomic material is available, shearing and annealing may be performed directly without PCR amplification. If sequencing RNA, RT must be performed before library preparation. For amplicon sequencing, this may take the form of an initial RT-PCR. Choice of sequencing technology is dependent on the project’s aims: for instance, the longer reads of Roche-454 may be more appropriate for reconstructing haplotypes, while the high data volume afforded by Illumina is more suitable for detecting very low frequency SNVs. After sequencing, reads must be aligned, either via multiple sequence alignment or to a reference. Choice of reference is critical; if available, a published reference or references may be used; alternatively, a consensus sequence may be used, generated through de novo assembly, or by alignment to a published reference followed by replacement of fixed variants, or by Sanger sequencing the same sample as submitted for deep sequencing. Following alignment, a number of bioinformatic tools are available for SNV calling, haplotype reconstruction, and downstream analysis.