| Literature DB >> 31827222 |
Alan J Mueller-Breckenridge1, Fernando Garcia-Alcalde2, Steffen Wildum2, Saskia L Smits3, Robert A de Man4, Margo J H van Campenhout4, Willem P Brouwer4, Jianjun Niu5, John A T Young2, Isabel Najera2, Lina Zhu6, Daitze Wu6, Tomas Racek2, Gadissa Bedada Hundie3, Yong Lin5, Charles A Boucher3, David van de Vijver3, Bart L Haagmans7.
Abstract
Chronic infection with Hepatitis B virus (HBV) is a major risk factor for the development of advanced liver disease including fibrosis, cirrhosis, and hepatocellular carcinoma (HCC). The relative contribution of virological factors to disease progression has not been fully defined and tools aiding the deconvolution of complex patient virus profiles is an unmet clinical need. Variable viral mutant signatures develop within individual patients due to the low-fidelity replication of the viral polymerase creating 'quasispecies' populations. Here we present the first comprehensive survey of the diversity of HBV quasispecies through ultra-deep sequencing of the complete HBV genome across two distinct European and Asian patient populations. Seroconversion to the HBV e antigen (HBeAg) represents a critical clinical waymark in infected individuals. Using a machine learning approach, a model was developed to determine the viral variants that accurately classify HBeAg status. Serial surveys of patient quasispecies populations and advanced analytics will facilitate clinical decision support for chronic HBV infection and direct therapeutic strategies through improved patient stratification.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31827222 PMCID: PMC6906359 DOI: 10.1038/s41598-019-55445-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Study summary and motivation. Patients infected with HBV have complex and dynamic clinical profiles. The diagnostic and clinical decision paradigm (dashed box) in HBV infected patients involves classification by plasma markers of viral activity (a) and biochemical and histopathological (b) evidence of liver damage. This approach defines patients broadly into four classes that will inform the clinical decision for standard-of-care including the use of interferon and/or nucleoside analogues. Using virus whole genome sequencing to catalogue all nucleotide variants occurring at >1% machine learning approaches are explored to determine whether classification of HBeAg status could be recapitulated from a diverse patient population, extend our understanding of the virological factors associated with HbeAg status and evaluate whether this type of approach may be extended to novel markers of clinical status that will inform clinical decision making for stratification of patients in clinical trials and in the appropriate patient selection for use of next-generation treatment modalities for HBV. The study sought to answer three questions (solid boxes): Q1 - that plasma HBV quasispecies profiles were representative of those in the liver; Q2 – that machine learning approaches could accurately recapitulate classification by a routinely used clinical marker; and Q3 – whether this approach has wider utility in clinical decision support and the deconvolution of complex clinical history.
Summary descriptors for patients included in the study.
| Description | Dataset A | Dataset B |
|---|---|---|
| Female | 55 | 67 |
| Male | 127 | 140 |
| Duration of infection (years) | 33.2 (±13.2) | Unknown* |
| Age at inclusion (years) | 35.5 (±13.5) | 30.6 (±7.2) |
| HBV DNA load (log10 copies/mL) | 6.74 (±1.96) | 7.75 (±0.46) |
| AST (IU/mL) | 53.1 (±33.07) | 134.7 (±125.3) |
| ALT (IU/mL) | 87.8 (±80.02) | 265.4 (±272.9) |
| negative | 85 | 26 |
| positive | 97 | 181 |
| naive | 182 | 170 |
| established | 0 | 37 |
Datasets A and B represent the European and Asian cohorts respectively. Pertinent demographic, virological and biochemical data are provided. Continuous data is provided as the mean and standard deviation for the appropriate dataset. The duration of infection was not available for the Asian cohort as a result of the nature of standard presentation of patients in the recruiting clinic. *Estimated duration of infection unavailable.
Figure 2Allele frequency differences between liver and plasma. (A) Percentage differences (y-axis) in allele frequency between variants found more abundantly in the plasma (red points, positive values) or liver (blue points, negative values); HBV genome nucleotide positions defined by the x-axis. (B) Cumulative frequency plot for difference (>1%) in allele frequency for variants in n = 10 paired liver and plasma samples - each trace represents the running total trace for the frequency of variants with a defined difference in a allele frequency between liver and plasma samples. The majority of variants found in liver and plasma only show small differences in the adjusted allele frequency (1–2 percentage points) as demonstrated by the steep rise of traces in most patients. Few variants (n = 56) show differences >10% across all samples.
Figure 3Graphical overview of the distribution of samples from both datasets. Dataset A (A) consisted of n = 182 patients; Dataset B (B) was derived from n = 207 patients. The overlap of variants identified between different genotypes for Dataset A (C, Venn plot not scaled) demonstrated n = 208 variants common to genotypes A–D). In Dataset B genotypes B and C shared a larger number of variants (D). Frequency distribution - the majority of variants were either rare or uncommon with an adjusted allele frequency above the limit of detection (1%) or represented the most prevalent allele in the samples (E). (F) Violin plots show the distribution of variants within each genotype with frequency of variant occurrence within a genotype presented as the coverage (count/number of samples) where 1 represents a variant present in all samples of the same genotype (upper limits of the plot). Variants at the lower limits of the plots were rare and unique mutations and represented the sequence diversity in a genotype.
Drug Resistance-Associated Mutations and associated allele frequencies.
| Dataset | Gene | Amino Acid alteration | Nucleotide | Allele Frequency | Absolute Number of Samples | Genotype | Resistance association |
|---|---|---|---|---|---|---|---|
| A | RT | Leu80Ile | 367T > A | 0.72 | 1/19 | B | L |
| A | RT | Val84Met | 379G > A | 0.13–0.16 | 1/62 | D | A |
| 1/43 | C | ||||||
| A | RT | Asn238Asp | 841A > G | 0.02–0.99 | 6/43 | C | A/L |
| 0.17 | 1/56 | A | |||||
| A | RT | Ala181Thr | 670G > A | 0.02–0.06 | 3/62 | D | A/L/Tb/Tf |
| Ala181Val | 671C > T | 0.02 | 1/62 | ||||
| A | RT | Met204Ile | 741G > T | <0.02 | 3/62 | D | L/Tb ± E,A |
| 0.01 | 1/43 | C | |||||
| 0.02 | 1/56 | A | |||||
| 741G > T/C | 0.01/0.73 | 2/19 | B | ||||
| A | RT | Val214Ala | 770T > C | 0.02–0.05 | 4/62 | D | A/Tf |
| 0.34 | 1/56 | A | |||||
| 0.02/0.05 | 2/43 | C | |||||
| Val214Glu | 770T > A | 0.08 | 1/43 | C | |||
| B | RT | Val173Leu | 646G > C/T | 0.09/0.04 | 1/37 | C | L/E |
| 646G > C | 0.01 | 1/133 | B | ||||
| B | RT | Ala181Thr | 670G > A | 0.01 | 1/133 | B | A/L/Tb/Tf |
| 0.24 | 1/23A | B | |||||
| 0.02 | 1/14E | C | |||||
| B | RT | Ala194Thr | 709G > A | 0.04 | 1/133 | B | Tf |
| B | RT | Met204Ile | 741G > T | 0.98 | 1/23A | B | L/Tb/Tf |
| B | RT | Val214Ala | 770T > C | 0.04 | 1/14E | C | A/Tf |
| 0.02 | 1/14IFN | C | |||||
| 0.03 | 1/133 | B | |||||
| 0.02/0.04 | 2/37 | C | |||||
| Val214Glu | 770T > A | 0.8 | 1/14IFN | C | |||
| B | RT | Asn238Asp | 841A > G | 0.02–0.03 | 3/37 | C | A/L |
In Dataset B patients receiving treatments are highlighted as a superscript letter in column 5 (absolute numbers of patients). Amino acid changes and related nucleotide positions are provided. The allele frequency for a mutation relative to the reference genomes is provided as a range where the mutation was found in more than two patients. In some individuals more than one mutation is present at the same locus. Abbreviations for therapeutics: A – Adefovir; E – Entecavir; IFN – Interferon; L – Lamivudine; Tf – Tenofovir; Tb – Telbivudine.
Figure 4Circular cladogram based on n = 404 whole genome consensus nucleotide sequences. Phylogenetic analysis on sequences from n = 192 Dataset A (including 10 liver sample sequences), n = 207 Dataset B, and n = 5 reference strains. Figure key indicates genotype (by colour) and data source (by size and shape). Reference strains are defined by a ‘+’ and highlighted with dark-coloured arrow heads. Samples derived from n = 10 liver biopsies in Dataset A are defined by squares.
Figure 5Highest ranking variants in machine learning model. (A,C) –Plots of the mean decrease in Gini Index for top 20 variables contributing to the best-performing models for Dataset A (5A) and combined data (5C); (B,D) – Plots of model metrics from each of ten models developed from Dataset A (5B) or combined dataset (Dataset A and B) (5D). Variants nomenclature: e.g. n1896GA represents a G > A mutation at nucleotide position 1896. (A) Gini plot of variant importance based on n = 4215 variants from Dataset A data associated with Model F (B). Model achieves a balanced accuracy of 1 with defined data partitioning. Variants highlighted in orange represent the stop-gain mutation (G1896A) and the two mutations of the basal core promoter (G1764A and A1762T). (C) Gini plot for top 20 variants contributing to the model developed from n = 432 variants to predict HBeAg status combining samples from untreated patients in Datasets A and B. The best model accuracy found in Model A (balanced accuracy = 0.98), (D). Variants depicted in grey (4C) were common to both datasets; those represented in orange were the mutations defined in Dataset A alone and are genotype-associated.
Figure 6Circos plots depict distribution of high-ranking variants across HBV genome. Dataset A genotypes B and C. From outer to inner layer: Circos plot representing the HBV genome (layer 1) genotypes B and C (3215 nucleotides); layer 2–4: representation of the relative position of the transcribed genes; layer 5: mean entropy for each nucleotide position (0–0.15) with y/vertical-axis marks representing the following: 0, 0.03, 0.06 and 0.09 (outer to inner); layer 6: Read coverage per nucleotide position (maximum 60000) – vertical axis, (i) 5000 (red), (ii) 10000 (green), (iii) 20000, iv) 30000; layer 7: individual nucleotide position for the top 50 ranked variants contributing to the generic machine learning model for prediction of HBeAg status; layer 8: nucleotide positions of variants, found in untreated patients, associated with resistance to therapeutics. Arrows define the approximate location of the proprietary primers used. (B) Circos plot generated from dataset B follows the same topology as plot A. Gene annotation is provided by the colour key.