| Literature DB >> 33162957 |
Carmen F Manso1, David F Bibby1, Kieren Lythgow1, Hodan Mohamed1, Richard Myers1, David Williams1, Renata Piorkowska1, Yuen T Chan1, Rory Bowden2, M Azim Ansari2,3, Camilla L C Ip2, Eleanor Barnes3, Daniel Bradshaw1, Jean L Mbisa1,4.
Abstract
Choice of direct acting antiviral (DAA) therapy for Hepatitis C Virus (HCV) in the United Kingdom and similar settings usually requires knowledge of the genotype and, in some cases, antiviral resistance (AVR) profile of the infecting virus. To determine these, most laboratories currently use Sanger technology, but next-generation sequencing (NGS) offers potential advantages in throughput and accuracy. However, NGS poses unique technical challenges, which require idiosyncratic development and technical validation approaches. This applies particularly to virology, where sequence diversity is high and the amount of starting genetic material is low, making it difficult to distinguish real data from artifacts. We describe the development and technical validation of a sequence capture-based HCV whole genome sequencing (WGS) assay to determine viral genotype and AVR profile. We use clinical samples of known subtypes and viral loads, and simulated FASTQ datasets to validate the analytical performances of both the wet laboratory and bioinformatic pipeline procedures. We show high concordance of the WGS assay compared to current "gold standard" Sanger assays. Specificity was 92.3 and 96.1% for AVR and genotyping, respectively. Discordances were due to the inability of Sanger assays to assign the correct subtype or accurately call mixed drug-resistant variants. We show high repeatability and reproducibility with >99.8% sequence similarity between sequence runs as well as high precision for variant frequency detection at >98.8% in the 95th percentile. Post-sequencing bioinformatics quality control workflows allow the accurate distinction between mixed infections, cross-contaminants and recombinant viruses at a threshold of >5% for the minority population. The sequence capture-based HCV WGS assay is more accurate than legacy AVR and genotyping assays. The assay has now been implemented in the clinical pathway of England's National Health Service HCV treatment programs, representing the first validated HCV WGS pipeline in clinical service. The data generated will additionally provide granular national-level genomic information for public health policy making and support the WHO HCV elimination strategy.Entities:
Keywords: antiviral resistance; direct acting antivirals; genotyping; hepatitis C virus; next generation sequencing; target enrichment; technical validation; whole genome sequencing
Year: 2020 PMID: 33162957 PMCID: PMC7583327 DOI: 10.3389/fmicb.2020.576572
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
FIGURE 1Determination of the sensitivity of the HCV WGS assay. Detection sensitivity was assessed as a function of viral load binned in 0.1 log10 IU/mL ranges. (A) Percentage of samples generating AVR data in NS3, NS5a, and NS5b genes at a minimum read depth of 30 in the different viral load ranges. (B) Percentage or samples generating genotyping data at a minimum read depth of 30 in the different viral load ranges. For each viral load range, the percentage of samples is shown by a black dot and the number of samples by the black and gray bars.
FIGURE 2Determination of the accuracy of the HCV WGS assay. (A) Accuracy was assessed by determining the degree of agreement between the consensus FASTA sequences generated by the assay at a variant frequency threshold of 15% compared with the consensus FASTA sequence generated by the “gold standard” amplicon-based Sanger assay in NS5a and NS5b genes at nucleotide and amino acid level. (B) Determination of the variant frequency at discordant positions as determined by the WGS assay where the variant was called by either WGS or Sanger assay.
Subtype distribution and positive percent agreement for samples used for validation of HCV genotyping.
| 1 | a | 41 (34.2) | 100 |
| b | 11 (9.2) | 90.9 | |
| l | 2 (1.7) | 100 | |
| unclassified | 2 (1.7) | 50 | |
| 2 | a | 2 (1.7) | 100 |
| b | 2 (1.7) | 100 | |
| j | 1 (0.8) | 0 | |
| unclassified | 1 (0.8) | 100 | |
| 3 | a | 35 (29.2) | 100 |
| b | 4 (3.3) | 100 | |
| c | 1 (0.8) | 0 | |
| g | 1 (0.8) | 100 | |
| k | 1 (0.8) | 100 | |
| 4 | a | 1 (0.8) | 100 |
| d | 7 (5.8) | 100 | |
| k | 2 (1.7) | 100 | |
| v | 1 (0.8) | 100 | |
| 6 | a | 1 (0.8) | 100 |
| f | 1 (0.8) | 100 | |
| h | 1 (0.8) | 100 | |
| r | 1 (0.8) | 100 | |
| unclassified | 1 (0.8) | 100 | |
| Total | 120 (100.0) | 96.7 |
Genotype and AVR of discordant samples.
| TV50 | 1-unknown subtype (NS5b sequencing) | 1a |
| TV106 | 1b (NS5b sequencing) | 1-novel subtype |
| TV5 | 2j (NS5b sequencing) | 2-novel subtype |
| TV101 | 3c (line-probe) | 3-novel subtype |
| TV62 | Y93CY (Sanger sequencing) | Y93 (C at 6% variant frequency) |
| TV51 | M28AV (Sanger sequencing) | M28MV (V at 79% variant frequency) |
Determination of assay precision and linearity.
| Intra-run repeatability | TV22 (1a) | 1,790,000 | 99.98 | 3 | 0.000211 | 93.81 | 99.43 | 99.91 |
| TV20 (1b) | 2,470,000 | 99.99 | 1 | 0.00007 | 93.57 | 98.61 | 99.27 | |
| TV33 (1a) | 106,000 | 100 | 0 | 0 | 86.61 | 98.28 | 99.95 | |
| TV32 (1a) | 301,000 | 99.98 | 4 | 0.000282 | 87.65 | 97.19 | 99.64 | |
| TV37 (3a) | 425,000 | 99.98 | 4 | 0.000281 | 97.61 | 99.51 | 99.94 | |
| Mean | – | 99.97 | 2.4 | 0.000169 | 91.85 | 98.6 | 99.74 | |
| Inter-run repeatability | TV33 (1a) | 106,000 | 100 | 0 | 0 | 97.71 | 99.9 | 99.98 |
| TV22 (1a) | 1,790,000 | 100 | 0 | 0 | 98.6 | 99.93 | 100 | |
| TV20 (1b) | 2,470,000 | 100 | 0 | 0 | 95.43 | 99.91 | 100 | |
| TV32 (1a) | 301,000 | 99.99 | 2 | 0.000155 | 94.98 | 99.74 | 100 | |
| TV37 (3a) | 425,000 | 100 | 0 | 0 | 98.75 | 99.97 | 100 | |
| Mean | – | 100 | 0.4 | 0.000031 | 97.09 | 99.89 | 100 | |
| Reproducibility | TV65 (1a) | 910,000 | 99.93 | 13 | 0.000997 | 88.7 | 95.82 | 98.84 |
| TV58 (1a) | 995,000 | 99.96 | 8 | 0.000613 | 92.69 | 97.67 | 99.62 | |
| TV48 (1a) | 664,000 | 99.76 | 44 | 0.003375 | 81.4 | 90.72 | 97.54 | |
| TV56 (1a) | 3,340,000 | 99.8 | 37 | 0.002838 | 91.96 | 96.95 | 99.29 | |
| TV46 (1a) | 98,200 | 99.98 | 3 | 0.00023 | 81.13 | 89.36 | 96.8 | |
| TV78 (1b) | 63,700 | 100 | 0 | 0 | 94.24 | 98.74 | 99.9 | |
| TV83 (1b) | 857,000 | 99.65 | 63 | 0.004833 | 88.92 | 94.98 | 98.86 | |
| TV74 (3a) | 150,000 | 99.99 | 1 | 0.000076 | 93.67 | 98.47 | 99.93 | |
| TV77 (3a) | 111,000 | 99.78 | 40 | 0.003056 | 86.49 | 94.69 | 98.92 | |
| Mean | – | 99.87 | 23.2 | 0.00178 | 88.8 | 95.27 | 98.86 | |
FIGURE 3The effect of sampling bias on variant frequency detection using serial dilutions of different samples. Stacked bar graph comparing the frequency of amino acid (dark colors) and corresponding nucleotide (light colors) discordances across the region covered by each dilution relative to the corresponding neat sample. Blue = sample 1 (gt4a); Orange = sample 2 (gt2b); Gray = sample 3 (gt1a); Yellow = sample 4 (gt3a).
FIGURE 4Graphical representation of how the pipeline outputs compare to quasispecies input at resistant loci with mixed amino acid frequencies. The x-axis comprises the frequency of each variant within the input quasispecies; the y-axis shows the difference between the input value and the pipeline output frequency at that variant. Shaded triangle regions represent the area where false positives (red) and false negatives (green) would be observed, using the pipeline variant calling threshold of 15%.
Summary of Splitpops module output showing representative examples of different clinical sample scenarios.
| Single infection | 44-S11 | 197,162 | 196,010 | 100 | 3a | 99.7 | other | 0.2 |
| 44-PC2 | 23,918 | 23,506 | 100 | 1a | 91.7 | 1l | 2.7 | |
| Contamination | 44-S19 | 20,160 | 4,641 | 93.76 | 1a | 67.5 | 3a | 22.9 |
| 44-S10 | 19,529 | 18,977 | 99.19 | 1a | 84.8 | 3a | 6.3 | |
| Mixed infection | 49-S64 | 25,566 | 25,454 | 100 | 2b | 56.3 | 1a | 39.3 |
| 59-S5 | 43,225 | 39,630 | 98.37 | 3a | 81.3 | 1a | 15.4 | |
| Recombinant virus | 10-S43 | 22,798 | 21,754 | 96.29 | 1b | 58.0 | 2k | 33.9 |
| 41-S22 | 254,285 | 253,624 | 99.84 | 1a | 71.0 | 4o | 10.5 | |
| Unclassified subtype | 33-S10 | 174,341 | 160,437 | 99.88 | 1c | 34.5 | 1b | 15.5 |
| 41-S10 | 48,461 | 47,621 | 93.56 | 2q | 26.6 | 2 | 21.8 | |
FIGURE 5Additional bioinformatics analyses undertaken on samples with Splitpops outputs that did not satisfy established primary and secondary population thresholds. (A) Phylogenetic reconstruction of NS5B sequences showing a cross-contamination event. Three samples: 44-S10, 44-S11, and 44-S19 (red font) in experiment no. 44 were involved and are analyzed together with sequences from five previous experiments, nos. 39-43 (blue font). Green font = positive controls and black font = subtype references. (B) Position of samples involved in the cross-contamination event are mapped on the library prep 96-well plate for experiment no. 44. Sample 44-S11, the source of contamination (dark red) and contaminated samples: 44-S10 and 44-S19 (light red) are indicated. Blue = other samples, gray = negative controls, green = positive controls and black = blank wells. The pair of adapter indexes used are shown adjacent to each row. (C) Phylogenetic reconstruction of NS5B sequences showing a mixed infection. Sample 49-S64 (red font) in experiment no. 49 was a mixed infection (subtypes 1a and 2b) and is analyzed together with sequences from five previous experiments, nos. 43-48 to rule out cross-contamination. Green font = positive controls and black font = subtype references. (D) Graphical representation of the distribution of the reads in sample 49-S64 by subtype shown as a percentage (y-axis) across the HCV genome (x-axis). The HCV genome is shown on top of the graph for reference.
FIGURE 6Additional bioinformatics analyses undertaken on samples with Splitpops outputs that did not satisfy established primary and secondary population thresholds. Phylogenetic reconstruction using core (A) and NS5B (B) gene sequences from experiment no. 10 showing clustering of sample S43 with subtype 2k and 1b reference sequences, respectively. Blue font = other sequences from experiment no. 10 and black font = subtype references. (C) Graphical representation of a similarity plot of S43 sequence showing its similarity to subtype 2k in the 5′-end structural genome region (blue) and to subtype 1b in the 3′-end non-structural genome region (red). The HCV genome map is shown on top of the graph for reference.