Literature DB >> 27092241

ve-SEQ: Robust, unbiased enrichment for streamlined detection and whole-genome sequencing of HCV and other highly diverse pathogens.

David Bonsall¹, M Azim Ansari^1,2, Camilla Ip³, Amy Trebes³, Anthony Brown¹, Paul Klenerman^1,4, David Buck³, Paolo Piazza³, Eleanor Barnes^1,4, Rory Bowden³.

Abstract

The routine availability of high-depth virus sequence data would allow the sensitive detection of resistance-associated variants that can jeopardize HIV or hepatitis C virus (HCV) treatment. We introduce ve-SEQ, a high-throughput method for sequence-specific enrichment and characterization of whole-virus genomes at up to 20% divergence from a reference sequence and 1,000-fold greater sensitivity than direct sequencing. The extreme genetic diversity of HCV led us to implement an algorithm for the efficient design of panels of oligonucleotide probes to capture any sequence among a defined set of targets without detectable bias. ve-SEQ enables efficient detection and sequencing of any HCV genome, including mixtures and intra-host variants, in a single experiment, with greater tolerance of sequence diversity than standard amplification methods and greater sensitivity than metagenomic sequencing, features that are directly applicable to other pathogens or arbitrary groups of target organisms, allowing the combination of sensitive detection with sequencing in many settings.

Entities: Chemical

Keywords: Anti-viral resistance; Hepatitis C virus; Sequence capture and enrichment; Virus genome sequencing

Year: 2015 PMID： 27092241 PMCID： PMC4821293 DOI： 10.12688/f1000research.7111.1

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction and background

With a world-wide prevalence estimated at 2.8% [1, 2] hepatitis C virus (HCV) poses a global health challenge unrivalled by any curable viral infection. In recent years, direct-acting antiviral (DAA) combination therapies have substantially improved outcomes, but fundamental barriers to eradication remain, including reduced efficacy against genotype 3 infections [3, 4] and a cost of modern treatments that is out of reach of even middle-income countries. Newer DAAs such as those targeting HCV’s polymerase and NS5a proteins augment protease inhibitors [5], but genotype-limited efficacy and the possibility of resistance mean that HCV genotyping and periodic monitoring of viral load (VL) will remain important in the selection and monitoring of DAA therapies. Resistance testing by PCR and sequencing of relevant genes is routinely used before initiation of HIV treatment and after its virological failure [6]. Similar testing in HCV is an exciting prospect, with potential benefits in efficacy and cost. With some notable exceptions, resistance-associated variant (RAV) status at baseline has not been shown to be strongly predictive of treatment success ( https://www.nice.org.uk/guidance/ta331), however the role of resistance testing in informing choice and timing of therapy after HCV treatment failure is an active area of clinical research (e.g. HCV-TARGET [7]). It is clear from clinical trials in which RAVs were assessed via amplicon sequencing that the relevance of particular mutations depends on both the drug in question and the genetic background of the virus, and attempts have been made to summarise these data as more drugs enter clinical practice [8]. As more data is acquired through phase 4, post-marketing studies, our ability to predict treatment success from viral genetic information is likely to improve, leading to higher cure rates across a greater variety of antiviral agents, with potential long-term benefits in treatment cost. However, several questions remain unanswered, including the relevance of variants detected at low frequency within the viral quasispecies and the impact of combinations of mutations on viral fitness, drug susceptibility and the genetic barrier to resistance. To date, these questions have escaped formal investigation owing to the technological challenges in obtaining whole-genome HCV sequences. A complete evaluation of prospective RAV characterization in guiding therapeutic options requires a comprehensive method for high-sensitivity variant detection, for which the development of efficient, unbiased, and cost-effective whole-genome sequencing methods seems a key requirement. Recent advances in genotype-agnostic whole-genome sequencing of HCV have been promising [9], but there is still room for improvement in sensitivity, throughput and cost. HCV strains fall into seven recognized genotypes which differ from each other at an average of 30–35% of nucleotide sites across the ~9650 nt genome [10], which is divided into highly conserved and extremely diverse regions of sequence. Genotypes are classified into approximately 67 subtypes, which differ at up to approximately 15% of nucleotide sites and include the globally distributed subtypes 1a, 1b, 2a, and 3a [10]. Available methods for the characterization of genetically diverse viruses such as HCV in clinical samples present several technical challenges. Amplification of reverse-transcribed virus RNA by PCR relies on a close match between primers and relatively conserved regions of the target, including an absolute match at the 3’ end of each primer, necessitating the design of multiple, genotype-specific sets of overlapping amplicons to recover complete genome sequences. In practical terms, PCR-based whole genome sequencing for HCV is complex and prone to technical failure, requiring a genotyping stage for primer selection, followed by genotype-specific amplification of several fragments and sequencing [11, 12], typically using a next-generation platform such as Illumina. The results can include high-depth coverage of the identified genotype, useful for the identification of known drug-related and immune escape variants, but the technique is less appropriate for the detection of low-frequency co-infections, uncovering novel diversity, or high-throughput analysis. An alternative approach, and the starting point of this research, is a method termed virus RNA-seq [13], which efficiently obtains direct “metagenomic” sequence data in the form of Illumina sequence reads from clinical material such as plasma [14] and which we used recently to identify a genotype 4 – genotype 1 chimeric isolate from a patient in Cameroon [15]. Virus RNA-seq is demonstrably unbiased with respect to the detection of any virus genotype, but relatively insensitive and costly for the recovery of whole virus genomes, even with modern sequencing technologies, because in many cases >99% of all sequence data generated derives from the host and is discarded [9, 13]. Strategies to deplete host-derived nucleic acids in virus metagenomic whole-genome sequencing have been applied successfully but are intrinsically limited in their effectiveness by the often-variable characteristics of the input sample. Using DNAase digestion of plasma before reverse transcription-based RNA amplification and a modified low-input Illumina library preparation, HCV-specific read proportions of 1.5%–47.7% have been reported [9], for samples with relatively high VLs (>1.8 10 5 IU/ml), sequenced in small multiplexes of eight samples per Illumina MiSeq run. Oligonucleotide-targeted RNAse H digestion of host rRNA has been used to improve the yield of Lassa and Ebola virus sequences but virus-specific sequencing efficiency remains close to 1% [16]. More promisingly, enrichment using biotinylated probes that target viral sequences has significantly improved sensitivity and efficiency of herpesvirus [17], Lassa virus [16] and Mycobacteria tuberculosis [18] sequencing. The ideal methodology for one-step, high-throughput clinical virus sequencing would combine the benefits of high-throughput sequencing with the sensitivity of PCR, while avoiding the pitfalls of PCR-based amplification and the inefficiencies of RNA-seq based metagenomic approaches. We report a comprehensive approach to virus-specific, genotype-agnostic, probe-based enrichment and sequencing of whole HCV genomes at a depth sufficient to call minor variants without bias and at a cost compatible with routine clinical HCV genotyping, that in principle can also be applied to other pathogens.

Materials and methods

Sample collection and preparation

Samples for optimization of sequencing methods were acquired from HCV Research UK ( http://www.hcvresearchuk.org/), whose clinical samples were used with informed consent, conforming to the ethical guidelines of the 1975 Declaration of Helsinki. Study protocols were approved by the NRES Committee East Midlands, Derby (Ethics reference 11/EM/0323). Samples for resistance testing were obtained from patients enrolled and consented as part of the OxBRC Prospective Cohort Study in Hepatitis C (Ethics reference 09/H0604/20) at the Oxford University Hospitals NHS Trust. Patient plasma was collected from EDTA blood tubes by centrifugation for 10 minutes at 600g in a Heraeus Megafuge, and stored at -80°C. RNA was isolated from 500µl plasma volumes using the NucliSENS magnetic extraction system (bioMerieux) and collected in 30µl of kit elution buffer for storage in aliquots at -80°C.

Sequencing library construction, enrichment and sequencing

Libraries were prepared for Illumina sequencing using the NEBNext ® Ultra™ Directional RNA Library Prep Kit for Illumina ® (New England Biolabs) with 5µl sample (maximum 10ng total RNA) and previously published modifications of the manufacturer’s guidelines (v2.0) [13], briefly: fragmentation for 5 or 12 minutes at 94°C, omission of Actinomycin D at first-strand reverse transcription, library amplification for 15–18 PCR cycles using custom indexed primers [19] and post-PCR clean-up with 0.85× volume Ampure XP (Beckman Coulter). Libraries were quantified using Quant-iT™ PicoGreen ® dsDNA Assay Kit (Invitrogen) and analysed using Agilent TapeStation with D1K High Sensitivity kit (Agilent) for equimolar pooling, then re-normalized by qPCR using the KAPA SYBR ® FAST qPCR Kit (Kapa Biosystems) for sequencing. Metagenomic virus RNA-Seq libraries were sequenced with 100b paired-end reads on the Illumina HiSeq 2500 with v3 Rapid chemistry. A 500ng aliquot of the pooled library was enriched using the xGen ® Lockdown ® protocol from IDT (Rapid Protocol for DNA Probe Hybridization and Target Capture Using an Illumina TruSeq ® or Ion Torrent ® Library (v1.0), Integrated DNA Technologies) with equimolar-pooled 120nt DNA oligonucleotide probes (IDT) followed by a 12-cycle, modified, on-bead, post-enrichment PCR re-amplification. The cleaned post-enrichment ve-Seq library was normalized with the aid of qPCR and sequenced with 100b paired-end reads on a single run of the Illumina MiSeq using v2 chemistry.

Sequence data analysis

De-multiplexed sequence read-pairs were trimmed of low-quality bases using QUASR v7.01 [20] and adapter sequences with CutAdapt version 1.7.1 [21] and subsequently discarded if either read had less than 50b remaining sequence or if both reads matched the human reference sequence using Bowtie version 2.2.4 [22]. The remaining read pool was screened against a BLASTn database containing all 165 ICTV (International Committee on the Taxonomy of Viruses) HCV genomes ( http://talk.ictvonline.org/ictv_wikis/m/files_flavi/default.aspx) both to choose an appropriate reference and to select those reads which formed a majority population for de novo assembly with Vicuna v1.3 [23] and finishing with V-FAT v1.0 ( http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-fat). Reads were mapped back to the assembly using Mosaik v2.2.28 [24], variants were called by V-Phaser v2.0 [25] and intra-host diversity was explored with V-Profiler v1.0 [26].

Results

Virus RNA-seq detection of RNA viruses in plasma

We first evaluated the performance of a conventional, “metagenomic” approach to virus whole-genome sequencing [13]. Indexed sequencing libraries were constructed in duplicate from plasma RNA of 29 subjects infected with diverse HCV subtypes (1a, 1b, 2a, 2b, 3a, 4a and 4d) and a 3.5-log range of VLs (2,200–4.9 million IU/mL; 1 IU = 2.7 copies on the instrument we use) and sequenced on a single Illumina HiSeq 2500 Rapid run, producing a median of 8.0 million reads per sample (range 6.0–24.9 million), of which 0.37% originated from HCV (range 0.03%–2.8%) ( Supplementary Table S1). There was a linear relationship between HCV VL and the yield of HCV reads with high mapping quality ( Figure 1). Mapping the HCV reads for each sample to the closest available reference (either a database reference or a de novo assembly of the same reads) produced patterns of peaks and troughs in sequence coverage along the genome that showed some similarity between samples of different subtypes and were highly reproducible between library and sequencing technical replicates; we therefore infer patterns of coverage are caused mainly by genomic features such as secondary structure and melting temperature [27].

Figure 1.

HCV metagenomic sequence yield is proportional to viral load.

HCV metagenomic sequence yield is proportional to viral load.

The yield of reads that map to any HCV genome and the probability of successful de novo assembly of a complete genome sequence both depend on viral load (VL). Samples were prepared as replicate libraries that were sequenced simultaneously with consistent yield. Blue circles: successful de novo assembly (>90% complete genome length recovered); red circles: incomplete genome assembly. a. With standard mapping criteria, up to 2.8% of reads match HCV and a background 0.02–0.1% of low-complexity human-derived sequences overwhelms the HCV signal in low-VL samples. Linear trend is plotted for samples with VL > 10 5 IU/ml. b. Under stringent mapping criteria (mapping Q > 40), lower complexity human and HCV reads are excluded and yield is proportional to VL (slope of linear trend in log-log space not significantly different from 1) across the VL range. In its standard form, metagenomic sequencing of a batch of up to 96 samples costs <£100 per sample. In this experiment, a VL of approximately 2×10 5 IU/mL was sufficient to attain a mean read depth across the genome of ~30 and a high probability of successful de novo assembly, but higher read depths are necessary for precise characterization of minor variants. Results are better with high-VL samples, and measures to increase library complexity and improve release of virus during RNA isolation may improve variant-calling sensitivity, but the low efficiency of metagenomic sequencing poses a fundamental problem.

ve-Seq: Probe-based enrichment increases HCV sequence yield

When the sequence of interest comprises only a small fraction of the starting material, probe-based sequence capture, as used in exome sequencing, can dramatically increase sequencing efficiency [17, 28]. Anticipating the challenge posed by the extreme diversity of HCV, we drew on a representative genome sequence from each of four common genotypes (1a, 2b, 3a and 4a) to construct a combined panel of biotinylated DNA oligonucleotides (xGen ® Lockdown ® probes, IDT) comprising four sets of 155–157 probes, each a 120 nt sequence fragment overlapping the next by 60 nt, and excluding the 3’ poly-(U) tract to avoid enrichment of low-complexity non-HCV sequences. We enriched the previously-sequenced pool of libraries for HCV sequences by solution hybridization with the 4-genotype probe panel and sequenced it on the Illumina MiSeq platform. This yielded a greater-than 16 × increase in the total number of HCV reads produced, even with an output of ~14 × fewer reads than the previous metagenomic sequencing on the higher-output HiSeq ( Supplementary Table S1). HCV sequence content reached 86% in the enriched pool (range 1–98% among samples), equivalent to a median 1,660 (range 10–75,700) genomic average read depth or >10 3-fold enrichment for samples with mid-range VL ( Supplementary Figure S1); and hit saturation point (near-100% HCV reads) for samples with higher starting HCV content. Although probe panels can be expensive to synthesize, they can be used for many (hundreds of) pooled captures, so the lower sequencing costs in ve-SEQ more than account for the extra costs of the enrichment step.

Probe-target dissimilarity reduces enrichment efficiency

We used a single-genome, subtype 1a subset of the 4-genotype probe panel to investigate the effect of varying probe-target sequence identity on ve-SEQ enrichment success ( Figure 2). When a sample is enriched with probes derived from that sample’s consensus sequence, there is no detectable bias in read depth with genomic position (i.e. coverage across the genome for enriched data follows a pattern almost identical to unenriched data, albeit at much higher read depth). When a non-identical sample of the same subtype is enriched, coverage patterns coincide, but are not identical. When a sample from the same genotype but a different subtype to the probe panel is enriched, large sections of the genome are adequately sequenced, but the most divergent regions are covered poorly and whole-genome assembly fails for samples with low viral load ( Supplementary Table S1). When the sample and the enrichment probe set come from different genotypes, only the most conserved parts of the genome are adequately represented with ve-SEQ data and read depth is essentially zero for divergent regions.

Figure 2.

Enrichment efficiency decreases with phylogenetic distance.

Enrichment efficiency decreases with phylogenetic distance.

Read depth across the genome before (blue, left axis) and after (red, right axis) enrichment with a single-sequence subtype 1a probe set. a. The HCV genome comprises 5’ and 3’ untranslated regions (UTRs) and a large central segment encoding a single polyprotein that is cleaved into ten proteins. b. A subtype 1a sample enriched with probes derived from its own consensus sequence yields coverage patterns across the genome essentially identical to metagenomic sequencing. c. A distinct subtype 1a sample produces highly similar but non-identical patterns of pre- and post-enrichment genomic coverage. d. A subtype 1b sample yields low read depths at loci that are relatively divergent from the 1a probe sequence (E1, E2, NS2 and NS5a). e. Sequence capture of a sample from a different genotype, 3a, is poor across large segments of the genome. f. Heat map representing average diversity (calculated as Shannon entropy) among 165 HCV reference genomes. Nucleotide diversity varies dramatically across the genome and tracks drops in enrichment efficiency between phylogenetically distinct probe-target combinations. In order to rationalize our approach to probe choice and enable the design of an efficient, comprehensive HCV enrichment probe set, we analysed the relationship between probe-target similarity and the relative efficiency of enrichment ( Figure 3). Noting a strong inflection point, we deduced that a minimum 80% identity between a 120 nt segment of sample sequence and its closest matching probe was sufficient to ensure near-maximal enrichment, assuming that each sequencing library molecule interacted with a single probe molecule and ignoring the potential effects of bridging capture (i.e. successful sequencing of a poorly matching fragment effected by hybridization of an adjacent target sequence on the same library molecule to a better-matching probe). The 20% divergence cutoff for successful enrichment falls between the mean inter-subtype (<15%) and inter-genotype (30–35%) divergence levels, explaining why enrichment with a subtype-mismatched probe set leads to only localized bias, while genotype-mismatch results in failure across most of the genome. It also follows from this analysis that when enrichment is performing well, there should be no detectable bias in the representation of single nucleotide variant alleles such as RAVs.

Figure 3.

Enrichment efficiency is directly related to probe-target identity.

A set of 10 HCV samples with highest VL was sequenced before and after enrichment with a single-genome, subtype 1a probe set, and for each sample the relative read depth for each probe window was plotted against the maximum identity between target and any probe. Read depth ratio was normalized by giving the most efficiently enriched probe position (in the highly conserved 5’ UTR) a value of 1. Maximal enrichment is observed where probe-target identity exceeds approximately 80% and enrichment decreases dramatically as identity falls below 80%.

Enrichment efficiency is directly related to probe-target identity.

Design of a comprehensive probe set for HCV

As is evident from the previous section, a probe panel based on just four subtype-representative sequences cannot perfectly capture HCV global diversity. Exploiting the observation that some regions of the HCV genome (e.g. the 5’UTR) are well-enough conserved to not require multiple probe sets, together with the 20% divergence cutoff for efficient capture, we implemented an algorithm for efficient probe set design that would facilitate a comprehensive HCV enrichment panel as well as, in principle, efficient probe sets for other organisms. We started with the 4-genotype probe panel and added extra probes to improve coverage for already-included subtypes 1a, 2b, 3a and 4a as well as the extra subtypes 1b, 2a, 2c, 5a and 6a, using a database of 482 reference whole-genome sequences. First we calculated a consensus sequence for each subtype. Then, starting with the existing probe set and the first genome in the most common subtype (1b), we identified genomic regions with less than 80% identity to any of the probes already in the panel. For each such region the subtype consensus sequence was considered as a potential probe but only used if it was ≥80% identical to the genomic sequence it replaced; otherwise the genomic sequence fragment was added as a new probe. The process was repeated for each 1b reference sequence and then similarly for each subtype. In contrast to the naïve design of probe sets with the standard IDT approach that requires 155–157 probes per HCV target genome, we were able to augment our 4-genome probe panel to represent the known diversity of nine subtypes spanning six of the seven recognized genotypes with only another 491 probes (1,116 total). Our algorithm substantially and automatically reduces redundancy: a completely naïve approach that simply encoded every genome in the reference set, without accounting for similarity between genomes, would have dictated a prohibitively expensive set of ~75,000 probes. In contrast, if we had instead started from scratch, we estimate that our simple algorithm could have produced an equally effective combined panel for nine subtypes with as few as 955 probes. In informal testing, a typical sample from the newly added subtype 1b achieved near-zero bias even though its exact sequence was not encoded in the probe set but was instead covered by reference to recorded sequence diversity ( Supplementary Figure S2), and a sample from subtype 4d, not included in the revised probe set, achieved adequate although imperfect enrichment ( Supplementary Figure S3), consistent with previous subtype-mismatched captures. Although feasible and relatively inexpensive, we have deferred the addition of probes for remaining rare subtypes.

Detection of resistance-associated variants in clinical samples

To explore the potential utility of high-depth RAV data in predicting the clinical effectiveness of HCV treatment, we used ve-SEQ to analyse retrospectively plasma samples collected from 33 genotype 1-infected patients before NS3-targeting DAA therapy with Boceprevir (14 patients) or Telaprevir (19 patients) ( Supplementary Table S2). We obtained whole-genome sequences for all samples, with a mean read depth of 4600 across the NS3 gene. We first confirmed that our sequence data (28 subtype 1a and 5 subtype 1b) matched clinical subtyping data where the latter was available. Mutations in the NS3 gene, denoted T54S and V55I, were detected in patient P23, in whom Boceprevir treatment failed to suppress HCV. Only one other patient had relevant baseline resistance: P6 possessed a single T54S mutation, yet cleared infection with 48 weeks of BCP. Additionally, Simeprevir RAVs Q80K/R were detected in five patients with genotype 1a virus, consistent with the reported prevalence of these mutations in PI-naïve patients [29]. Variants associated with NS5A inhibitor resistance were detected in 11 patients, including nine with combinations of two or more RAVs, previously associated with higher relapse rates than Lidipesvir/Sofosbuvir [30]. In samples taken after treatment cessation, five patients carried both V36M and R155K NS3 variants, associated with drug resistance but also reduced virus fitness in the absence of treatment [31, 32], including three patients illustrated in Figure 4. RAVs V36M and R155K were each detected independently of the other (in P30 and P33, respectively) and virus sampled in P27 during treatment revealed approximately 2-fold more V36M variants than R155K, confirming that V36M alone was sufficient to confer resistance on individual genomes. Telaprevir had failed to suppress virus in subject P24 by week 4 when V36M and R155K variants circulated in approximately half of virus. It is therefore not surprising that a subsequent treatment attempt also failed, providing a real-world clinical example of where sequencing might have prevented futile retreatment. Six weeks after the second treatment attempt had failed, the R155K mutation had reverted to the wild-type arginine residue in all sequence reads. Partial reversion was also observed in P18, although in this instance, reversion of V36M occurred some 20 or more weeks after the cessation of treatment and R155K was still present in 100% of variants 1 year later.

Figure 4.

Detection of resistance-associated variants after DAA treatment failure.

VL and RAV status for three patients who failed to achieve sustained virological response after Telaprevir-based therapy. Grey shading: duration of therapy (weeks starting at time 0); squares: VL measurements; inverted triangles: samples sequenced using the comprehensive probe panel (open: no Telaprevir RAVs detected, black: RAVs and supporting read proportions, where <100%).

Detection of resistance-associated variants after DAA treatment failure.

Discussion

Our ve-SEQ method provides improvements over other approaches currently used for rapid, high-throughput, high-sensitivity characterization of complete virus sequences from clinical samples. These advantages include sequencing efficiency for low-VL samples not available from metagenomic approaches [9] and robustness to extreme sequence diversity such as that found in HCV that is not available from PCR-based methods [8]. Our approach is similar to published methods [16– 18, 28] but benefits from low enrichment costs and defined performance that come from efficient probe design and non-proprietary, high throughput sample processing. In this study, treatment-naïve individuals carried RAVs to NS3 and NS5A inhibitors and emerging resistance was shown to persist 1 year after treatment failure, which stands to complicate empiric selection and timing of HCV treatment, particularly in previously treated patients. Stratification by viral genotype is currently the best strategy for successful treatment; ve-SEQ performs as well as current routine subtyping techniques at comparable cost while additionally offering high-depth, high-throughput and unbiased detection of RAVs, enabling future large-scale evaluation of resistance testing in clinical studies and offering the possibility of replacing current practice with a single highly informative test. Our preliminary analyses reveal cases in which such data may be clinically useful, and the cost of the test compared with that of a failed DAA treatment (e.g. ~£40K for HARVONI ®, https://www.nice.org.uk/guidance/gid-tag484) suggests potential for ve-SEQ to be cost-effective in a clinical setting. Our general approach also has clear application in the detection and sequencing in a single protocol of other pathogens – none is as diverse as HCV – including the potential for multi-pathogen, sub-genomic panels that might replace multiplex PCR-based screening and diagnostic techniques with more comprehensive, higher resolution data at comparable sensitivity [33]. ve-SEQ works at high-throughput scales, with a standard, plate-based format that makes it affordable and comparable in overall cost to less informative assays. To avoid turnaround delays while maintaining efficiency for routine use, in principle the HCV assay could be combined with assays for other pathogens, and plasma RNA-seq libraries could be pooled with RNA- and DNA-originating libraries from other sample types, for a routine test run on sequencing platforms like the Illumina MiSeq, that are becoming more generally available in large-hospital diagnostic labs. The more a pool of libraries is enriched, the more individual library complexity (broadly, the number of starting molecules of HCV included) becomes important: since the ve-SEQ approach can be used with any library methodology we have now turned our attention to ways of optimizing the yield of HCV in plasma RNA, increasing the amount of library input material and improving library efficiency. The robustness of probe-based enrichment provides a practical alternative to PCR and similar amplification-based approaches that require a close match between primer and target. We envisage that enrichment could provide almost-hypothesis-free detection for all plausibly present pathogens in clinical samples, both for low-diversity target genomes in which a single representative probe set is sufficient, and by using algorithms such as the one we implement here to efficiently capture more diverse pathogens. Because less sequencing effort is required, the overall cost of an enrichment-based protocol is lower than that of a no-enrichment approach and achieves a greater yield of useful data, more efficiently and robustly than PCR.

Data availability

The data referenced by this article are under copyright with the following copyright statement: Copyright: © 2015 Bonsall D et al. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication). Sequence data, filtered to remove human reads, is available from the European Nucleotide Archive (ENA) under accession PRJEB9338. Bonsall et al. describe an improved metagenomic approach for sequencing HCV, which is adequately described by the title “-ve-SEQ: Robust,unbiased enrichment for streamlined detection and whole-genome sequencing of HCV and other highly divers pathogens”. The article provides proof-of principle data to the detection of known HCV associated resistance associated variants in a small number of subjects across a range of genotypes. This is a valuable addition to the limited repertoire of sequencing methods available for full HCV genome sequencing. The article is clearly written, the abstract provides an accurate summary of the article and the overall conclusions are justified on the basis of the results. However, considering this manuscript describes a method, the paper does need more detail regarding the description of methods (including additional information outlining the probe design) and results would allow the reader to reproduce the data, draw their own conclusion and add value to this paper. Additional specific comments are listed below for the authors consideration. Introduction: First paragraph – the authors might wish to consider adding ‘reinfection’ as major challenge to controlling the HCV epidemic. Many of the at-risk populations become re-infected with HCV which will limit long-term DAA success and also necessitate good robust sequencing techniques that distinguish reinfection from relapse. 4th paragraph – the authors may wish to consider the recent publication by Bull et al. BMC Genomics 2016 in this discussion. It offers a slight improvement to some of the other amplification based methods, including detection of co-infection, but as correctly stated but the authors is still, as are all targeted based amplification methods, biased by primer design. Typo in the following statement “with relatively high VLs (>1.8 105 IU/ml) ” need an ‘x’. Methods: Given that this is a methods paper, the methods lack sufficient detail and the probe design is vague. For example, how was the reference sequence for each genotype selected? Is this a prototype strain, in which case genbank ID should be provided or a constructed genotype consensus sequence? I assume that each set in the 4 sets of probes represents a specific genotype. Samples selection: Could the authors provide more detail on the criteria by which a "representative" 1a sequence was selected? The authors show a single comparison of how the 1a probe set compares with the other subtypes / genotypes but it is hard to estimate how this might perform in the real world. The authors mention "informal testing of a typical sample", but more information is required to the support claims in the paper. Limit of detection: The authors should define the level of threshold called as “no resistance”. What was the lowest percentage threshold considered reliable to call variants? What is the minimum viral load at which this quantification is reliable? Results: Sequence success: Could the authors please clarify that only 29 samples were tested and all 29 were successfully sequenced. Cost breakdown: In regards to the statement “n its standard form, metagenomic sequencing of a batch of up to 96 samples costs <£100 per sample.” As the reduced cost re. rationale probe design is discussed as one of the main advantages of this method, could the authors please clarify exactly what is included in the ‘standard metagenomics cost’ and provide a disaggregation of the costs. Ie., is it just the cost of sequencing 96 samples on the HiSeq or did they also include library prep costs in that cost estimate. Is there an estimate of the number of samples that would be required to make this a cost-effective compared in comparison to bulk sequencing the NS3-NS5B regions? Figure 2 needs a key to describe the heat map (does Yellow = higher entropy) and more detailed description of how genetically" distinct" genotype 1a is in panel c. Probe design: Did the authors in their probe design consider or attempt to target the relatively conserved sequence after the poly U/C tract at the 3’ end of the genome? Perhaps it is too short or lack of reliable sequence for probe design? For while the 3’UTR is unlikely to be of interest for RAV analysis it has been proposed to be important in viral pathogenesis and induction of innate immunity and the exclusion in obtaining 3’UTR sequence in this method does present a small limitation for a subset of viral diversity studies. The authors mention "higher read depths" are required for precise characterisation of minor variants, could the authors describe the minimum number of reads that would be required for this? The authors refer to a database from which they got their 482 reference whole-genome sequences from which they designed their probes? What is this database and is it publicly available? In supplementary table 1 what does probe panel “G123456” mean? According to methods probes were only designed on Gt 1,2,3 and 4….. Ok now that I have continued reading the results I now understand as the design of G123456 probes is described in the results on page 6. I suggest that a section outlining the expanded probe design be added to the methods section so that there is not confusion in understanding the tables. This expanded probe design offers much more potential than the original probe design and needs highlighting. Figure 4: - Figure 4 labeling and description in the text needs to be improved to allow the reader to follow which subject is being described. Figure 4a: was reinfection with a different variant ruled out? For example what was the genetic distance of the consensus variants between pre-treatment and relapse? What percentage where the RAVs present after treatment? The authors indicate NS5A variants were found, but don't provide any more data (what RAVS, what level they were found). Sup figures 2 and 3: results look great but it would help for comparison of the different probe panels if the right y-axis was put on the same scale, as has been done for the left y-axis. This is unlikely to be possible with figure 2 as there is an order of magnitude difference between the plots. For the RAV analysis which probe set was used? Discussion: First sentence – the reviewers agree that the method is a valuable improvement in comparison to other metagenomic approaches but it still does have some limitations (and some advantages as already discussed in intro) when compared to amplicon approaches and these should be acknowledged and discussed. Specifically, sequencing samples with low viral loads, and the detection of low frequency RAVs is currently more sensitive with targeted amplicon approaches. Sensitivity: There is a potential sensitivity issue that has not been addressed with this assay in regards to RAVs as mentioned in the results section. Unless the authors add data to show high sensitivity then this should be discussed in paragraph 2 of the discussion. Minor point, Paragraph 3 – “none is as diverse as HCV” – I would change this to “few are as diverse as HCV”. It is debatable depending on your classification but the Enteroviruses are an extremely diverse group… We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above. This is a potentially important protocol for sequencing viral genomes using an adapted nucleic acid bead capture method. The results are impressive and I think the technique is likely to be useful to many who are undertaking viral sequencing. The article is written well and I believe acceptable for indexing in its current version. I have a number of suggestions that may improve the utility of the article to readers: I would appreciate a longer description of the NEBNext protocol - I assume this is first strand cDNA synthesis using random hexamers, but this would be useful to spell out. The main innovation is the use of IDT xGen Lockdown protocol, but the protocol is not described in much detail. I would appreciate a flowchart or textual description of the protocol, because I would like to know how long it takes, what steps are involved and what equipment is needed to carry it out . The costs of the per-sample sequencing is given, I would like to see this broken down by component. I have a major issue with Figure 2 as presented, due to the use of multiple Y-axes which I think makes it very very hard to interpret. Please split this out into panels with enrichment and unenriched data presented separately. Also please decide on consistent use of scientific notation or regular numbers (I prefer the latter). In fact rather than reporting read depth it would be more informative to report as a fraction of the number of reads from that barcode. I'm afraid I can't get on with the "ve-Seq" name, because I read it like "negative-Seq" each time! The authors might consider a more informative and easier to communicate name. The authors may consider citing some of the recently published pan-viral capture papers such as VirCAP-Seq [1] and relating ve-Seq to this technique. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. In this original research work the authors have developed a novel method for detection and sequencing of HCV genomes from clinical samples adopting a DNA probe approach. Similar methods have been developed previously for targeted genomes. This approach has been adapted for HCV and has great potential to be applied to other RNA viruses. Overall, in my opinion, this paper is an interesting and novel application which resolve few long standing issues with identification and sequencing of complex viral populations from clinical samples. It would be very helpful for the readers if the authors would address the comments below. It would be helpful to have more details on how the first set of probes were chosen. The authors state that these were 155-157 probes, each of length 120t, which roughly equate to 2 full HCV genomes. Can the authors describe what exactly was the algorithm to identify those fragments from the total genomes considered? It is clear that the second set for the rare GT were constructed with 80% dissimilarities from the first set. The critical message is that this approach seems to break the barrier of sequencing very low viral loads in an unbiased approach. I found this a very important result. It is however clear from the data that the attempt is not fully successful as only partial genomes are obtained. Maybe some more clear statements highlighting where we are up to with this method and what can be done to improve. I would recommend to have Supp Figure 1 in the main text as this is a rather interesting result showing that there is a better enrichment for low viral loads. I don't fully agree with the authors with the conclusions. This method is still not reliable in terms of detecting near-full length genome at low viral load, and therefore the classical PCR-primer genotype specific primers are needed. Rather, I would encourage the authors to discuss more the implication of such an approach (and improved ones into the future) for sequencing more complex scenarios, such as recombination, multiple infections, reinfections, superinfection etc. A comment that I hope will generate some feedbacks DAA treatment are much better than those considered in this manuscript. HARVONY and GS-5816 are breaking the barrier of 95% SVR pan genotype. This is the first time in history of antiviral therapy of such a limited drug resistance. I think the proposed method will have higher chances to be applied in other settings (as mentioned before to study complex genomic rearrangements) Maybe worth thinking about this. Finally, this work made me also wonder on what limitations still exist that this method does not assess. It would be interesting to mention that for understanding viral evolution, including drug resistance, there is need to identify compensatory mutations and epistatic interactions, which may occur between viral mutations that are far apart in the genome. This is a problem of haplotype reconstruction which has been proven to be very difficult to solve if the staring points are short reads. Thank you for the opportunity to comment on such a novel and interesting work. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

33 in total

1. Evaluation of high-throughput sequencing for identifying known and unknown viruses in biological samples.

Authors: Justine Cheval; Virginie Sauvage; Lionel Frangeul; Laurent Dacheux; Ghislaine Guigon; Nicolas Dumey; Kevin Pariente; Claudine Rousseaux; Fabien Dorange; Nicolas Berthet; Sylvain Brisse; Ivan Moszer; Hervé Bourhy; Claude Jean Manuguerra; Marc Lecuit; Ana Burguiere; Valérie Caro; Marc Eloit
Journal: J Clin Microbiol Date: 2011-06-29 Impact factor: 5.948

2. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

3. Sofosbuvir for previously untreated chronic hepatitis C infection.

Authors: Eric Lawitz; Alessandra Mangia; David Wyles; Maribel Rodriguez-Torres; Tarek Hassanein; Stuart C Gordon; Michael Schultz; Mitchell N Davis; Zeid Kayali; K Rajender Reddy; Ira M Jacobson; Kris V Kowdley; Lisa Nyberg; G Mani Subramanian; Robert H Hyland; Sarah Arterburn; Deyuan Jiang; John McNally; Diana Brainard; William T Symonds; John G McHutchison; Aasim M Sheikh; Zobair Younossi; Edward J Gane
Journal: N Engl J Med Date: 2013-04-23 Impact factor: 91.245

4. Phenotypic characterization of resistant Val36 variants of hepatitis C virus NS3-4A serine protease.

Authors: Yi Zhou; Doug J Bartels; Brian L Hanzelka; Ute Müh; Yunyi Wei; Hui-May Chu; Ann M Tigges; Debra L Brennan; B Govinda Rao; Lora Swenson; Ann D Kwong; Chao Lin
Journal: Antimicrob Agents Chemother Date: 2007-10-15 Impact factor: 5.191

5. De novo assembly of highly diverse viral populations.

Authors: Xiao Yang; Patrick Charlebois; Sante Gnerre; Matthew G Coole; Niall J Lennon; Joshua Z Levin; James Qu; Elizabeth M Ryan; Michael C Zody; Matthew R Henn
Journal: BMC Genomics Date: 2012-09-13 Impact factor: 3.969

6. Global distribution and prevalence of hepatitis C virus genotypes.

Authors: Jane P Messina; Isla Humphreys; Abraham Flaxman; Anthony Brown; Graham S Cooke; Oliver G Pybus; Eleanor Barnes
Journal: Hepatology Date: 2014-07-28 Impact factor: 17.425

7. Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples.

Authors: Christian B Matranga; Kristian G Andersen; Sarah Winnicki; Michele Busby; Adrianne D Gladden; Ryan Tewhey; Matthew Stremlau; Aaron Berlin; Stephen K Gire; Eleina England; Lina M Moses; Tarjei S Mikkelsen; Ikponmwonsa Odia; Philomena E Ehiane; Onikepe Folarin; Augustine Goba; S Humarr Kahn; Donald S Grant; Anna Honko; Lisa Hensley; Christian Happi; Robert F Garry; Christine M Malboeuf; Bruce W Birren; Andreas Gnirke; Joshua Z Levin; Pardis C Sabeti
Journal: Genome Biol Date: 2014 Impact factor: 13.583

8. Ledipasvir and sofosbuvir for previously treated HCV genotype 1 infection.

Authors: Nezam Afdhal; K Rajender Reddy; David R Nelson; Eric Lawitz; Stuart C Gordon; Eugene Schiff; Ronald Nahass; Reem Ghalib; Norman Gitlin; Robert Herring; Jacob Lalezari; Ziad H Younes; Paul J Pockros; Adrian M Di Bisceglie; Sanjeev Arora; G Mani Subramanian; Yanni Zhu; Hadas Dvory-Sobol; Jenny C Yang; Phillip S Pang; William T Symonds; John G McHutchison; Andrew J Muir; Mark Sulkowski; Paul Kwo
Journal: N Engl J Med Date: 2014-04-11 Impact factor: 91.245

9. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

Authors: Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal: Nucleic Acids Res Date: 2008-07-26 Impact factor: 16.971

10. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping.

Authors: Wan-Ping Lee; Michael P Stromberg; Alistair Ward; Chip Stewart; Erik P Garrison; Gabor T Marth
Journal: PLoS One Date: 2014-03-05 Impact factor: 3.240

26 in total

1. Performance of a high-throughput next-generation sequencing method for analysis of HIV drug resistance and viral load.

Authors: Jessica M Fogel; David Bonsall; Vanessa Cummings; Rory Bowden; Tanya Golubchik; Mariateresa de Cesare; Ethan A Wilson; Theresa Gamble; Carlos Del Rio; D Scott Batey; Kenneth H Mayer; Jason E Farley; James P Hughes; Robert H Remien; Chris Beyrer; Christophe Fraser; Susan H Eshleman
Journal: J Antimicrob Chemother Date: 2020-12-01 Impact factor: 5.790

2. Characterization of the Specificity, Functionality, and Durability of Host T-Cell Responses Against the Full-Length Hepatitis E Virus.

Authors: Anthony Brown; John S Halliday; Leo Swadling; Richie G Madden; Richard Bendall; Jeremy G Hunter; James Maggs; Peter Simmonds; Donald B Smith; Louisa Vine; Cara McLaughlin; Jane Collier; David Bonsall; Katie Jeffery; Susanna Dunachie; Paul Klenerman; Jacques Izopet; Nassim Kamar; Harry R Dalton; Eleanor Barnes
Journal: Hepatology Date: 2016-10-28 Impact factor: 17.425

Review 3. Sequencing of hepatitis C virus for detection of resistance to direct-acting antiviral therapy: A systematic review.

Authors: Sofia R Bartlett; Jason Grebely; Auda A Eltahla; Jacqueline D Reeves; Anita Y M Howe; Veronica Miller; Francesca Ceccherini-Silberstein; Rowena A Bull; Mark W Douglas; Gregory J Dore; Patrick Harrington; Andrew R Lloyd; Brendan Jacka; Gail V Matthews; Gary P Wang; Jean-Michel Pawlotsky; Jordan J Feld; Janke Schinkel; Federico Garcia; Johan Lennerstrand; Tanya L Applegate
Journal: Hepatol Commun Date: 2017-05-22

4. HIV-1 full-genome phylogenetics of generalized epidemics in sub-Saharan Africa: impact of missing nucleotide characters in next-generation sequences.

Authors: Oliver Ratmann; Chris Wymant; Caroline Colijn; Siva Danaviah; M Essex; Simon D W Frost; Astrid Gall; Simani Gaiseitsiwe; Mary Grabowski; Ronald Gray; Stephane Guindon; Arndt von Haeseler; Pontiano Kaleebu; Michelle Kendall; Alexey Kozlov; Justen Manasa; Bui Quang Minh; Sikhulile Moyo; Vladimir Novitsky; Rebecca Nsubuga; Sureshnee Pillay; Thomas C Quinn; David Serwadda; Deogratius Ssemwanga; Alexandros Stamatakis; Jana Trifinopoulos; Maria Wawer; Andrew Leigh Brown; Tulio de Oliveira; Paul Kellam; Deenan Pillay; Christophe Fraser
Journal: AIDS Res Hum Retroviruses Date: 2017-05-25 Impact factor: 2.205

Review 5. The Promise of Whole Genome Pathogen Sequencing for the Molecular Epidemiology of Emerging Aquaculture Pathogens.

Authors: Sion C Bayliss; David W Verner-Jeffreys; Kerry L Bartie; David M Aanensen; Samuel K Sheppard; Alexandra Adams; Edward J Feil
Journal: Front Microbiol Date: 2017-02-03 Impact factor: 5.640

6. PHYLOSCANNER: Inferring Transmission from Within- and Between-Host Pathogen Genetic Diversity.

Authors: Chris Wymant; Matthew Hall; Oliver Ratmann; David Bonsall; Tanya Golubchik; Mariateresa de Cesare; Astrid Gall; Marion Cornelissen; Christophe Fraser
Journal: Mol Biol Evol Date: 2018-03-01 Impact factor: 16.240

7. SARS-CoV-2 within-host diversity and transmission.

Authors: Katrina A Lythgoe; Matthew Hall; Luca Ferretti; Mariateresa de Cesare; George MacIntyre-Cockett; Amy Trebes; Monique Andersson; Newton Otecko; Emma L Wise; Nathan Moore; Jessica Lynch; Stephen Kidd; Nicholas Cortes; Matilde Mori; Rebecca Williams; Gabrielle Vernet; Anita Justice; Angie Green; Samuel M Nicholls; M Azim Ansari; Lucie Abeler-Dörner; Catrin E Moore; Timothy E A Peto; David W Eyre; Robert Shaw; Peter Simmonds; David Buck; John A Todd; Thomas R Connor; Shirin Ashraf; Ana da Silva Filipe; James Shepherd; Emma C Thomson; David Bonsall; Christophe Fraser; Tanya Golubchik
Journal: Science Date: 2021-03-09 Impact factor: 47.728

8. Enrichment of low abundance DNA/RNA by oligonucleotide-clicked iron oxide nanoparticles.

Authors: Fereshte Damavandi; Weiwei Wang; Wei-Zheng Shen; Sibel Cetinel; Tracy Jordan; Juan Jovel; Carlo Montemagno; Gane Ka-Shu Wong
Journal: Sci Rep Date: 2021-06-22 Impact factor: 4.379

9. Interferon lambda 4 variant rs12979860 is not associated with RAV NS5A Y93H in hepatitis C virus genotype 3a.

Authors: Vincent Pedergnana; David Smith; Paul Klenerman; Eleanor Barnes; Chris C A Spencer; M Azim Ansari
Journal: Hepatology Date: 2016-04-18 Impact factor: 17.425

10. Comparison of Next-Generation Sequencing Technologies for Comprehensive Assessment of Full-Length Hepatitis C Viral Genomes.

Authors: Emma Thomson; Camilla L C Ip; Anjna Badhan; Mette T Christiansen; Walt Adamson; M Azim Ansari; David Bibby; Judith Breuer; Anthony Brown; Rory Bowden; Josie Bryant; David Bonsall; Ana Da Silva Filipe; Chris Hinds; Emma Hudson; Paul Klenerman; Kieren Lythgow; Jean L Mbisa; John McLauchlan; Richard Myers; Paolo Piazza; Sunando Roy; Amy Trebes; Vattipally B Sreenu; Jeroen Witteveldt; Eleanor Barnes; Peter Simmonds
Journal: J Clin Microbiol Date: 2016-07-06 Impact factor: 11.677