Literature DB >> 31478835

Interferon lambda 4 impacts the genetic diversity of hepatitis C virus.

M Azim Ansari¹, Elihu Aranday-Cortes², John McLauchlan², Vincent Pedergnana^1,3, Camilla Lc Ip¹, Ana da Silva Filipe², Siu Hin Lau², Connor Bamford², David Bonsall⁴, Amy Trebes¹, Paolo Piazza¹, Vattipally Sreenu², Vanessa M Cowton², Emma Hudson⁴, Rory Bowden¹, Arvind H Patel², Graham R Foster⁵, William L Irving⁶, Kosh Agarwal⁷, Emma C Thomson², Peter Simmonds⁴, Paul Klenerman⁴, Chris Holmes⁸, Eleanor Barnes⁴, Chris Ca Spencer¹.

Abstract

Hepatitis C virus (HCV) is a highly variable pathogen that frequently establishes chronic infection. This genetic variability is affected by the adaptive immune response but the contribution of other host factors is unclear. Here, we examined the role played by interferon lambda-4 (IFN-λ4) on HCV diversity; IFN-λ4 plays a crucial role in spontaneous clearance or establishment of chronicity following acute infection. We performed viral genome-wide association studies using human and viral data from 485 patients of white ancestry infected with HCV genotype 3a. We demonstrate that combinations of host genetic variants, which determine IFN-λ4 protein production and activity, influence amino acid variation across the viral polyprotein - not restricted to specific viral proteins or HLA restricted epitopes - and modulate viral load. We also observed an association with viral di-nucleotide proportions. These results support a direct role for IFN-λ4 in exerting selective pressure across the viral genome, possibly by a novel mechanism.

Entities: Chemical

Keywords: genetics; genome-to-genome analysis; genomics; hepatitis C virus; host-pathogen interactions; human; infectious disease; innate immunity; interferon lambda 4; microbiology; virus

Mesh：

Substances：

Year: 2019 PMID： 31478835 PMCID： PMC6721795 DOI： 10.7554/eLife.42463

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.713

Introduction

Hepatitis C virus (HCV) infects an estimated 71 million people worldwide (World Health Organization, 2017) and can lead to severe liver disease in chronically infected patients. The virus is highly variable and has been classified into seven distinct genotypes, and further divided into 67 subtypes, based on nucleotide sequence diversity (Simmonds, 2004). The factors that have driven the evolutionary path of HCV are multifactorial but undoubtedly are also shaped by host genetics. Because of its major health burden, determining how both host and viral genetics contribute to the outcomes of infection is critical for a better understanding of HCV-mediated pathogenesis (Ploss and Dubuisson, 2012) and the immune response to viral infection. Using a systematic genome-to-genome approach in a cohort of chronically infected patients, we recently reported associations between an intronic single nucleotide polymorphism (SNP) rs12979860 in the interferon lambda 4 (IFNL4) gene (CC vs. non-CC) and 11 amino acid polymorphisms on the HCV polyprotein (Ansari et al., 2017) at a 5% false discovery rate (FDR) (Benjamini and Hochberg, 1995). This observation was unexpected since IFNL4 is a member of the type III interferon (IFN-λ) family that act as cytokines as part of the innate immune system and therefore lack apparent epitope specificity (Bruening et al., 2017). These associations between polymorphisms on the HCV polyprotein and host IFNL4 SNP rs12979860 genotypes are further intriguing given that variants within the IFNL4 locus (including SNP rs12979860) contribute to HCV clinical and biological outcomes, including spontaneous virus clearance, response to IFN-based treatment, viral load and liver disease progression (Aoki et al., 2015; Ge et al., 2009; Noureddin et al., 2013; Patin et al., 2012; Rauch et al., 2010; Suppiah et al., 2009; Tanaka et al., 2009; Thomas et al., 2009). It is possible that the associations between the outcomes of HCV infection and the IFNL4 locus are directly linked to its impact on the viral genome and the encoded polyprotein. The intronic IFNL4 SNP rs12979860 is in high linkage disequilibrium with other SNPs that may be more biologically relevant, including the exonic dinucleotide variant rs368234815 (r2 = 0.975 CEU population, 1000 Genomes dataset) in IFNL4. This variant [ΔG > TT] causes a frameshift, abrogating production of functional IFN-λ4 protein (Prokunina-Olsson et al., 2013). Moreover, several amino acid substitutions within IFN-λ4 have been shown to alter its antiviral activity (Bamford et al., 2018; Terczyńska-Dyla et al., 2014). In particular, a common amino acid substitution (coded by the SNP rs117648444 [G > A]) in the IFN-λ4 protein, which changes a proline residue at position 70 (P70) to a serine residue (S70), reduces its antiviral activity in vitro (Terczyńska-Dyla et al., 2014). Thus, the combination of alleles at rs368234815 and rs117648444 creates four potential haplotypes, two that do not produce IFN-λ4 protein (TT/G or TT/A; IFN-λ4-Null) and two that result in production of two IFN-λ4 protein variants (ΔG/G; IFN-λ4-P70 and ΔG/A; IFN-λ4-S70). Patients harbouring the impaired IFN-λ4-S70 variant display lower hepatic interferon-stimulated gene (ISG) expression levels, which is associated with increased viral clearance following acute infection and a better response to IFN-based therapy, compared to patients carrying the more active IFN-λ4-P70 variant (Eslam et al., 2017). In this study, we report a large number of associations between HCV-encoded amino acids across the viral polyprotein and host IFNL4 SNP rs12979860, with 42 significant associations at a 5% FDR, increasing to 76 viral sites at a 10% FDR. The associations are observed in both structural and non-structural viral proteins and no enrichment of association signals is observed in any of the viral proteins or HLA restricted epitope regions. We also find an association with viral nucleotide content and certain dinucleotide frequencies, such as UpA (uracil base followed by adenine base). Finally, we demonstrate that IFNL4 haplotypes coding for IFN-λ4-S70 and IFN-λ4-P70 variants differ in terms of their impact on viral load and viral amino acid polymorphisms, in agreement with the reduced antiviral activity of IFN-λ4-S70. Together these observations suggest that IFN-λ4 is a driver of HCV sequence diversity and modulator of viral load.

Results

Host and virus genetic structures

To ensure that host and virus population structures had a minimal impact on our results, we used paired genome-wide human and viral genetic data in a homogenous group of 485 patients with self reported white ancestry, infected with HCV genotype 3a from two cohorts [BOSON (Foster et al., 2015) and Expanded Access Program (EAP) (Foster et al., 2016) cohorts, NBOSON = 411, NEAP = 74, see Supplementary file 1 and Materials and methods for a description of the cohorts]. To control for both human and virus population structures, we performed principal component analysis (PCA) on each of the host and viral genetic data (Materials and methods). The host PCA defined a largely homogenous group corresponding to self-reported white ancestry (Figure 1—figure supplement 1a). The first and second viral principal components (PCs) explained around 3% and 2% of variance in HCV nucleotide diversity respectively (Figure 1—figure supplement 1b), indicating a homogenous group of isolates as observed by the long terminal and short internal branches of the phylogenetic tree (Figure 1—figure supplement 2a). The viral sequences from the two cohorts were non-randomly distributed on the tree as one clade was underrepresented in the EAP cohort sequences; this clade corresponded to isolates in the BOSON cohort from outside the United Kingdom (treeBreaker Bayes factor = 249, see Materials and methods for an explanation on how to interpret Bayes factor and Figure 1—figure supplement 3a). This observation was not reflected in host IFNL4 SNP rs12979860 genotypes, which were randomly distributed on the viral phylogenetic tree (treeBreaker Bayes factor = 1.1, Figure 1—figure supplement 3b). However, we did observe associations between the host IFNL4 SNP rs12979860 and the fifth and seventh viral PCs (p=1.3×10−15 and 7.2 × 10−9, respectively), which were not directed by host-virus population co-structuring, suggesting that the IFNL4 locus drives HCV nucleotide diversity (Figure 1—figure supplement 2b–d and Appendix 1).

Figure 1—figure supplement 1.

Host and viral principal components.

(a) Host first and second PCs. (b) Proportion of variance explained by the viral principal components (PCs). The first and second PCs explain 3% and 2% of variation in the nucleotide sequences respectively which indicates there is little clustering of the viral sequences. This is also consistent with the virus phylogenetic tree, which is star-like.

Figure 1—figure supplement 2.

Association between viral PCs and IFNL4 SNP rs12979860 genotypes in the combined cohort (N = 485).

(a) Virus phylogenetic tree, cohort (black EAP, grey BOSON), IFNL4 SNP rs12979860 (CC white, non-CC black) and the first 10 viral PCs (the colours are mapped such that dark blue represents the smallest number and bright yellow represents the largest number for each PC). (b) P-value of univariate association tests between viral PCs and the host SNPs. Black and grey dots are for association tests between the viral PCs and the IFNL4 SNP rs12979860 and 500 SNPs with minor allele frequency (MAF) similar to IFNL4 SNP rs12979860 MAF, respectively. Dashed line shows the 10% FDR line and the dotted line shows the nominal significance of p=0.05. Distribution of the fifth (c) and seventh (d) PCs stratified by the host IFNL4 SNP rs12979860 genotypes. Black dot and lines show the mean and 95% confidence interval for each group.

Figure 1—figure supplement 3.

Distribution of (a) cohort from which the sequences were obtained (BOSON N = 411 or EAP N = 74) and (b) host IFNL4 SNP rs12979860 genotypes (CC or non-CC) on the virus phylogenetic tree.

The thickness and redness of the branches are proportional to the posterior probability that the distribution of the trait of interest on the tips of the tree is different under that clade. (a) Bayes factor of the alternative (where there is one or more branches which have a different distribution of sequences from the BOSON and EAP cohorts) to null model (where the distribution of sequences from the BOSON and EAP cohorts is the same everywhere on the tree) is 249, indicating that the alternative model is supported. There is a branch (thick and red) on the tree, representing a clade within which there are very few EAP sequences as well as sequences from UK patients in the BOSON cohort. (b) There is no evidence that any part of the tree has a distinct distribution in terms of the host IFNL4 SNP rs12979860 genotypes, the estimated Bayes factor for alternative model (where there is one or more branches which have a different distribution of host IFNL4 genotypes) to null model (where the distribution of IFNL4 SNP rs12979860 genotypes is the same everywhere on the tree) is 1.1 which indicates that the null model that host IFNL4 SNP rs12979860 genotypes are randomly distributed on the virus tree is better supported.

The IFNL4 locus affects virus-encoded amino acids at specific sites across the HCV polyprotein

A major advantage of determining entire HCV genomic sequence data is the possibility to perform footprinting analysis at a genome-wide scale. The nucleotide and amino acid frequencies at polymorphic viral sites in the two cohorts were similar and no systematic differences were observed (Figure 1—figure supplement 4). We used logistic regression to test for association between IFNL4 SNP rs12979860 genotypes (CC vs. non-CC) and virus-encoded amino acids, including the first two viral PCs and the first three host PCs as covariates to account for host-virus population co-structuring. Presence or absence of each viral amino acid was used as the response variable; 977 tests were performed at 471 viral sites. To test for possible confounders we separately added each of the cirrhosis status of patients, cohorts (BOSON vs. EAP), gender and age to the model as covariates. These covariates were not associated with any specific amino acids at a 10% FDR (data not shown).

Figure 1—figure supplement 4.

Viral allele frequencies in the BOSON (N = 411) and EAP (N = 74) cohorts.

(a) Viral nucleotide and (b) amino acid frequencies in the BOSON and EAP cohorts. The red dots represent the 10 amino acids previously reported to be associated with IFNL4 genotype. The black dot represents position 2576, which was identified in our previous study as a site associated with IFNL4 genotype (Ansari et al., 2017) but was not tested in the present study.

At a 5% FDR, 42 of the viral sites tested were associated with IFNL4 SNP rs12979860, increasing to 76 sites at a 10% FDR (Figure 1 and Supplementary file 2). This represented 1.4% at a 5% FDR and 2.5% at a 10% FDR of all the viral amino acids in the HCV genotype 3a polyprotein (N = 3021), reflecting a large impact of the host IFNL4 locus on the amino acids encoded at variable sites on the viral polyprotein. The most associated viral site was at position 2570 in the NS5B protein (p=1.32×10−8, log(OR)=1.19), as previously reported (Ansari et al., 2017). Notably, 26 of the 76 sites (34%) associated with the IFNL4 SNP rs12979860 at a 10% FDR lie within the HCV E2 glycoprotein (Appendix 1 and Figure 1—figure supplement 5). However, we did not observe significantly enhanced enrichment or depletion for association signals in any specific viral protein, or in previously reported HLA restricted epitope regions in HCV genotype 3a (von Delft et al., 2016) (Supplementary file 3 and Materials and methods).

Figure 1.

HCV genome-wide association study with IFNL4 SNP rs12979860 genotypes (CC vs. non-CC).

(a) Manhattan plot. The dashed line indicates 5% FDR. At this level 42 sites on the virus polyprotein are significantly associated with IFNL4 SNP. (b) Schematic of the HCV polyprotein.

The core E2 structure (PDB 4MWF) is shown in grey, (A) shows the protein surface as a mesh (B) shows the ribbon structure. IFNL4 associated residues are highlighted as follows: L438 and F442 in Epitope 2 (red), K500 and S501 in orange, R521 in cyan, A524 and L546 in Epitope 3 (blue), T558 in purple and D576, N577, T578 and L580 in the igVR (green).

QQ-plots for association tests between the host SNPs and viral amino acid (a) host SNPs and change from the most common viral codon to (b) non-synonymous codons and (c) synonymous codons. First two viral PCs and first three host PCs were used as covariates in all three analyses. The black circles show the QQ-plot for the virus GWAS against IFNL4 SNP rs12979860 and the grey circles show the QQ-plot for the virus GWASs against 500 frequency-matched SNPs.

The effect size is measured for non-CC relative to CC group.

The effect size is measured for non-CC relative to CC group. The square indicates the effect size in the BOSON cohort and the circle indicates the effect size in the EAP cohort.

Figure 1—figure supplement 5.

IFNL4-associated residues on the core E2 structure.

HCV genome-wide association study with IFNL4 SNP rs12979860 genotypes (CC vs. non-CC).

(a) Manhattan plot. The dashed line indicates 5% FDR. At this level 42 sites on the virus polyprotein are significantly associated with IFNL4 SNP. (b) Schematic of the HCV polyprotein.

Host and viral principal components.

Association between viral PCs and IFNL4 SNP rs12979860 genotypes in the combined cohort (N = 485).

Distribution of (a) cohort from which the sequences were obtained (BOSON N = 411 or EAP N = 74) and (b) host IFNL4 SNP rs12979860 genotypes (CC or non-CC) on the virus phylogenetic tree.

Viral allele frequencies in the BOSON (N = 411) and EAP (N = 74) cohorts.

IFNL4-associated residues on the core E2 structure.

QQ-plots for association studies between viral amino acids and viral codons and host IFNL4 SNP rs12979860 and 500 host SNPs chosen across the human genome with a minor allele frequency (MAF) similar to the IFNL4 SNP rs12979860 MAF.

Effect size (beta) of IFNL4 SNP rs12979860 genotypes (CC vs. non-CC) on the proportion of dinucleotide frequencies in the combined cohort (N = 485).

The effect size is measured for non-CC relative to CC group.

Effect size (beta) of IFNL4 SNP rs12979860 genotypes (CC vs. non-CC) on the proportion of dinucleotide frequencies in the BOSON (N = 411) and EAP (N = 74) cohorts.

The effect size is measured for non-CC relative to CC group. The square indicates the effect size in the BOSON cohort and the circle indicates the effect size in the EAP cohort. To ensure that host-virus population co-structuring or some other systematic bias was not confounding our results, we performed the same tests against the HCV amino acids for 500 host SNPs from across the human genome with a minor allele frequency (MAF) similar to IFNL4 SNP rs12979860 MAF, further referred to as ‘the 500 frequency-matched SNPs’. In effect we performed 500 viral GWASs, one for each of the 500 frequency-matched SNPs. Using a 5% FDR (calculated independently for each of the 500 viral GWASs), we observed no significant associations for 491 of the host SNPs tested against the HCV polyprotein. The remaining nine host SNPs were associated with one HCV amino acid each (Supplementary file 4). However, these associations are likely to be false positives as multiple testing corrections were performed for each viral GWAS independently. Additionally, the distribution of P-values for the association tests between HCV amino acids and the 500 frequency-matched SNPs followed the null distribution of no associations (Figure 1—figure supplement 6a), confirming that there was no systematic bias in our analysis. By comparison, the distribution of the P-values for the association tests between HCV amino acids and IFNL4 SNP rs12979860 deviated from the null distribution of no associations. This observation and the large number of HCV amino acids significantly associated with IFNL4 SNP rs12979860 genotypes highlighted that the broad impact of the IFNL4 locus on HCV-encoded amino acids was authentic and not driven by host-virus population co-structuring.

Figure 1—figure supplement 6.

QQ-plots for association studies between viral amino acids and viral codons and host IFNL4 SNP rs12979860 and 500 host SNPs chosen across the human genome with a minor allele frequency (MAF) similar to the IFNL4 SNP rs12979860 MAF.

We then explored nucleotide sequences at the codon level to distinguish the impact of the IFNL4 SNP rs12979860 on viral nucleotides as distinct from its effect on viral amino acids. We tested for associations between host IFNL4 SNP rs12979860 and HCV codon changes from the most common codon to synonymous and non-synonymous codons (Materials and methods). For each HCV codon site with at least 20 synonymous and 20 non-synonymous codons (N = 348), we performed a logistic regression including the first two viral PCs and the first three host PCs to test for association between IFNL4 SNP rs12979860 (and the 500 frequency-matched SNPs as in the previous section) and changes from the most common codon to synonymous and non-synonymous codons (Materials and methods). We observed that non-synonymous changes at 16 viral codons were significantly associated with IFNL4 SNP rs12979860 at a 5% FDR, increasing to 35 viral codons at a 10% FDR (Supplementary file 5). As expected the 500 frequency-matched SNPs did not show the same level of associations with HCV non-synonymous codon changes (Figure 1—figure supplement 6b). We also observed that synonymous changes at two viral codons were significantly associated with IFNL4 SNP rs12979860 at a 5% FDR, increasing to four viral codons at a 10% FDR (Supplementary file 6 and Figure 1—figure supplement 6c). This indicates that the effect of the IFNL4 locus on virus sequence diversity is mostly at the amino acid level, although a small impact on nucleotide substitutions cannot be excluded. We hypothesised that the observed impact of IFNL4 SNP rs12979860 on viral nucleotide sequences might be induced through dinucleotide sensing mechanisms. Most viruses suppress genomic CpG and UpA dinucleotide frequencies, supposedly to mimic host mRNA composition and avoid the immune response (Simmonds et al., 2013). To explore this possibility, we tested the association between the dinucleotide frequencies in each viral sequence and the host IFNL4 SNP rs12979860 (Materials and methods). The viral UpA dinucleotide frequency (estimated as the ratio of observed to expected frequencies) was significantly lower in the host individuals with IFNL4 SNP rs12979860 non-CC group compared to the CC group (p=1.5×10−6, Figure 1—figure supplement 7). By contrast, the viral UpG dinucleotide frequency was significantly higher in the IFNL4 SNP rs12979860 non-CC group compared to the CC group (p=1.5×10−5). The viral CpC and CpA dinucleotide frequencies were also significantly different between the individuals with IFNL4 SNP rs12979860 CC and non-CC genotypes (p=3.3×10−4 and p=3.3×10−4, respectively). Similar results were observed by analysing the cohorts independently (Appendix 1 and Figure 1—figure supplement 8).

Figure 1—figure supplement 7.

Effect size (beta) of IFNL4 SNP rs12979860 genotypes (CC vs. non-CC) on the proportion of dinucleotide frequencies in the combined cohort (N = 485).

The effect size is measured for non-CC relative to CC group.

Figure 1—figure supplement 8.

Effect size (beta) of IFNL4 SNP rs12979860 genotypes (CC vs. non-CC) on the proportion of dinucleotide frequencies in the BOSON (N = 411) and EAP (N = 74) cohorts.

The effect size is measured for non-CC relative to CC group. The square indicates the effect size in the BOSON cohort and the circle indicates the effect size in the EAP cohort.

IFN-λ4 protein impacts on viral amino acid variation and viral load

We then investigated the impact of the different haplotypes of IFNL4 on HCV amino acid diversity and viral load to refine its possible role. After imputing and phasing IFNL4 rs368234815 and rs117648444 (Materials and methods), we observed three haplotypes: TT/G (IFN-λ4-Null); ΔG/G (IFN-λ4-P70) and ΔG/A (IFN-λ4-S70). HCV-infected patients were classified into three groups according to their predicted ability to produce IFN-λ4 protein: (i) no IFN-λ4 (two allelic copies of IFN-λ4-Null, NBOSON = 145, NEAP = 41), (ii) IFN-λ4–S70 (two copies of IFN-λ4-S70 or one copy of IFN-λ4-S70 and one copy of IFN-λ4-Null, NBOSON = 48, NEAP = 7), and (iii) IFN-λ4-P70 (at least one copy of IFN-λ4-P70, NBOSON = 218, NEAP = 26) (Supplementary file 7). Since IFN-λ4-S70 can be distinguished phenotypically from IFN-λ4-P70 both in vivo and in vitro, we examined whether the IFNL4 haplotypes had distinct effects on viral amino acid polymorphisms and viral load. We estimated the effect size of IFN-λ4-S70 and IFN-λ4-P70 relative to the IFN-λ4-Null haplotype on the presence and absence of the 76 amino acids associated with IFNL4 SNP rs12979860 genotypes at a 10% FDR. We found that the estimated effect sizes of IFN-λ4-S70 were consistently smaller than those for IFN-λ4-P70 (Figure 2). Under the null hypothesis that there is no difference in the effect sizes of IFN-λ4-P70 and IFN-λ4-S70 variants on viral amino acid polymorphisms, we would expect the slope of the linear regression line (Figure 2) to have a value of one. However, the estimated slope of the best-fit line was significantly different from one (slope = 0.77, p=9.6×10−7, Figure 2). Additionally, we used bootstrapping to account for the uncertainty associated with the estimated effect sizes of IFN-λ4-P70 and IFN-λ4-S70 variants on HCV amino acid polymorphisms (Materials and methods). When estimating the slope of the line, we observed that the bootstrap 95% confidence interval [0.69, 0.99] for the slope of the line did not include one. Thus, we concluded that the impact of the host IFN-λ4-S70 variant on HCV-encoded amino acids was significantly smaller than the IFN-λ4-P70 variant.

Figure 2.

Comparison of the effect sizes (log(OR)) of host IFN-λ4 variants (IFN-λ4-S70 and IFN-λ4-P70 relative to IFN-λ4-Null) on HCV-encoded amino acids.

Comparison of the effect sizes (log(OR)) of host IFN-λ4 variants (IFN-λ4-S70 and IFN-λ4-P70 relative to IFN-λ4-Null) on HCV-encoded amino acids.

The circles show the log(OR) estimates and the grey lines indicate the 95% confidence intervals. The dashed line is the y = x line which has a slope of one. The solid black line shows the linear regression line, which has a slope of 0.77 that is significantly different from one (y = x line, p=9.6×10−7). We then investigated the effects of IFN-λ4 haplotypes on viral load. For this analysis, the EAP cohort was excluded as these patients had advanced liver disease with consistently lower viral loads relative to the BOSON cohort (Figure 3—figure supplement 1). We observed no difference in mean viral load between patients carrying IFN-λ4-S70 and IFN-λ4-Null haplotypes (p=0.61). However, the viral load in patients carrying IFN-λ4-P70 was significantly lower than in the other two groups (P = 1.6×10−4 and P = 3.9×10−10), with IFN-λ4-P70 conferring an approximately 2.3-fold decrease in viral load compared to IFN-λ4-S70 (mean for IFN-λ4-P70 = 2,905,333, IFN-λ4-S70 = 6,703,875 and IFN-λ4-Null = 6,256,523 IU/ml, Figure 3a).

Figure 3—figure supplement 1.

Pre-treatment viral load stratified by cohort.

The viral load in the EAP cohort (N = 74) is significantly lower than that of the BOSON cohort (N = 411) (p=1.49×10−13).

Figure 3.

Bayesian model comparison of effect sizes of IFNL-λ4 variants on viral load in the BOSON cohort (N = 411).

(a) Pretreatment viral load stratified by the host IFN-λ4 variants. The black dots and lines indicate the mean and 95% confidence interval (CI) for each group. (b) The posterior probability of the five tested models from (a) stacked on top of each other (from model 1 to model 5; posterior probabilities of models 1, 2 and 5 are too small to be labelled on this plot). Models where the posterior probability is higher or lower than the prior probability are coloured as dark grey and light grey respectively. Only model 3 (as indicated) has a posterior probability bigger than its prior probability and it assumes that the mean viral load is identical in IFN-λ4-Null and IFN-λ4-S70 groups and different from the mean viral load of IFN-λ4-P70 group. (c) Viral load stratified by the host IFNL-λ4 variants and the presence and absence of serine at the viral amino acid site 2414. The black dots and lines indicate the mean and 95% CI for each group. (d) The posterior probability of the 58 tested models from (c) stacked on top of each other (from model 1 to model 58; only models where the posterior probability is higher than the prior probability are labelled on this plot). Models where the posterior probability is higher or lower than the prior probability are coloured as dark grey and light grey respectively. Model 5 has the highest posterior probability and it assumes that the mean viral load is only different in ‘IFN-λ4-P70 + 2414 serine’ group and identical in other groups.

The viral load in the EAP cohort (N = 74) is significantly lower than that of the BOSON cohort (N = 411) (p=1.49×10−13).

Bayesian model comparison of effect sizes of IFNL-λ4 variants on viral load in the BOSON cohort (N = 411).

Pre-treatment viral load stratified by cohort.

The viral load in the EAP cohort (N = 74) is significantly lower than that of the BOSON cohort (N = 411) (p=1.49×10−13). We used a Bayesian approach to investigate the relationship between the effect sizes of the three IFN-λ4 haplotypes on viral load (Figure 3b). In essence, this method weighs up the evidence that the genetic effects of the IFN-λ4-Null, IFN-λ4-S70 and IFN-λ4-P70 haplotypes are the same or not relative to each other (Materials and methods). We tested five models; the effects of the three haplotypes are identical (model 1), the effects of IFN-λ4-P70 and IFN-λ4-Null are identical and different from the effect of IFN-λ4-S70 (model 2), the effects of IFN-λ4-S70 and IFN-λ4-Null are identical and different from the effect of IFN-λ4-P70 (model 3), all three haplotypes have different effect sizes (model 4) and the effects of IFN-λ4-P70 and IFN-λ4-S70 are the same but different from the effect of IFN-λ4-Null haplotype (model 5). Equal prior probabilities were used for all models. Model 3 had the highest posterior probability of 0.82 (Figure 3b). We had previously reported an association between IFNL4 SNP rs12979860 genotypes, the HCV-encoded amino acids at position 2414 in the polyprotein and viral load (Ansari et al., 2017). In the present study, we further stratified the analysis by IFNL4 haplotypes. The viral serine (S) residue at site 2414 (S2414) was significantly enriched in patients coding for the IFN-λ4-P70 variant compared to IFN-λ4-S70 (p=3.39×10−03) and IFN-λ4-Null (p=5.94×10−09) coding patients. S2414 was present in 86% (211/244) of IFN-λ4-P70 carrying patients, 69% (38/55) of IFN-λ4-S70 carrying patients and 62% (114/185) of IFN-λ4-Null carrying patients. In HCV-infected patients with a serine residue at position 2414, we observed no significant difference in mean viral load between IFN-λ4-Null and IFN-λ4-S70 carriers (p=0.31, Figure 3c), but both groups had a significantly higher viral load than IFN-λ4-P70 carriers (pIFN-λ4-Null=2.7x10−9; pIFN-λ4-S70=1.6x10−10). However, no such association (p=0.49) was found in HCV-infected patients with a non-serine residue at this site (Figure 3c). We performed a Bayesian analysis that compared 58 possible models against each other (from all effect sizes being the same to all being different from each other). The model where only the ‘IFN-λ4-P70 + S2414’ group had an effect size different from the other groups (model 5) had the highest posterior probability of 0.33 (Figure 3d and Appendix1). Taken together, the combination of IFN-λ4-P70 and S2414 conferred a 2.6-fold decrease in viral load compared to IFN-λ4-S70 and S2414 (mean viral load for IFN-λ4-P70 and S2414 = 2,376,747 IU/ml, and mean viral load IFN-λ4-S70 and S2414 = 6,093,167 IU/ml).

Discussion

Here, we show that genetic variants in the human IFNL4 locus drive widespread sequence changes across the entire HCV polyprotein - much larger than we previously reported (Ansari et al., 2017). We did not observe statistically significant enrichment of association signals in any specific HCV protein nor in HLA restricted epitope regions. This indicates that the host IFNL4 locus, and hence the innate immune response, can influence the amino acid residues that are encoded at specific sites on the HCV polyprotein. The mechanism of action of IFN-λ4 in determining such selectivity is not known. We also report an association of the IFNL4 locus with synonymous codon variants, suggesting that this locus might also affect the HCV genome at the nucleotide sequence level. Finally, we report that the IFNL4 haplotype coding for the IFN-λ4-S70 variant has a smaller impact on viral load and viral amino acid variation at associated sites compared to the haplotype coding for the more active IFN-λ4-P70 variant. This indicates that the IFNL4 gene not only mediates HCV amino acid variation but also modulates viral load in patients. Hence, our findings extend the association between genetic variation in the IFNL4 locus and outcome of HCV infection as well as hepatic disease (Aoki et al., 2015; Ge et al., 2009; Noureddin et al., 2013; Patin et al., 2012; Rauch et al., 2010; Suppiah et al., 2009; Tanaka et al., 2009; Thomas et al., 2009; Eslam et al., 2017). We selected patients chronically infected with HCV genotype 3a and of self-reported white ancestry to limit the impact of human and viral population co-structuring in our analyses. We observed that 2.5% of HCV-encoded amino acids across the polyprotein were significantly associated with host IFNL4 SNP at a 10% FDR. No such association was observed for 500 host SNPs chosen from across the human genome with a MAF similar to that for IFNL4 SNP rs12979860. This indicated that the observed impact of IFNL4 SNP rs12979860 on the viral sequences was not due to population structure or other systematic bias. Compared to our previous report (Ansari et al., 2017), the number of sites associated with IFNL4 genotype increased from 11 to 42 at a 5% FDR. One of the 11 previously reported associations was not reproduced (position 2576) as it was not tested in this study, due to more stringent frequency filtering (an amino acid had to be present in at least 20 samples to be tested). Thus, we have identified a further 32 associated sites at a 5% FDR compared to our previous report. There are two key factors that have contributed to this increased number of associated sites. Firstly, only HCV genotype 3a sequences found in those of white ancestry have been included in the analysis; our previous report included HCV genotype 2 and other genotype 3 subtypes from a broader ethnic mix. Secondly, we have used logistic regression and accounted for population structure by including viral and host genetic PCs as covariates in the analysis. This approach has recently been shown to be more powerful to detect associations in genome-to-genome analysis (Naret et al., 2018). Thus, focusing on a homogenous population (white ethnicity infected with genotype 3a) and using logistic regression has increased the power of our study to detect many more associations between the IFNL4 locus and amino acid variation in the HCV polyprotein. To distinguish the effect of IFNL4 SNP rs12979860 on viral amino acid variation from nucleotide sequence variability, we investigated the association of the host IFNL4 SNP rs12979860 with synonymous and non-synonymous codon changes in the HCV genome. IFNL4 SNP rs12979860 genotypes were associated with 35 non-synonymous codon changes at a 10% FDR, but we also observed four significant associations with synonymous codon changes. This indicated that in addition to a widespread impact at the amino acid level, the IFNL4 locus may also independently drive nucleotide diversity but to a lesser extent. To further explore the impact of the IFNL4 variants on HCV genomic sequences, we investigated viral dinucleotide frequencies. The HCV UpA dinucleotide frequency was significantly associated with the IFNL4 SNP rs12979860 genotypes; interestingly, ribonuclease L (RNase-L), an ISG that cleaves viral RNA to control viral infections in animals (Ding and Voinnet, 2007), targets both UpA and UpU dinucleotides (Wreschner et al., 1981). Moreover, HCV genotype 1, which is relatively resistant to IFN-based therapy, has a lower frequency of UpA and UpU dinucleotides than the more IFN-sensitive HCV genotypes 2 and 3 (Dao Thi et al., 2012; Kong et al., 2012). Indeed, the UpA dinucleotide is targeted by RNA-degrading enzymes and its presence in a RNA sequence accelerates its degradation in the cytoplasm (Simmonds et al., 2013). It is possible that the more stringent immune environment of non-CC patients (with higher hepatic expression of ISGs) selects virus with a lower UpA frequency. However, we note that the IFNL4 SNP rs12979860 non-CC patients have a modest reduction (0.9%) in their viral UpA frequencies relative to the CC patients and that this reduction could be mediated by the widespread amino acid changes associated with the IFNL4 SNP rs12979860. Previous studies have shown that IFN-λ4-P70 and IFN-λ4-S70 variants of IFN-λ4 protein have distinct phenotypes, both in vivo and in vitro. We hypothesized that if IFN-λ4 contributes to changes in the viral polyprotein, then the IFN-λ4-P70 and IFN-λ4-S70 variants should have different effect sizes on HCV-encoded amino acids and viral load. By imputing and phasing IFNL4 SNPs in our cohort, we inferred the haplotypes consisting of the IFNL4 rs368234815 (ΔG/TT) and rs117648444 (G/A) variants. Using these data, we observed that the haplotype coding for the IFN-λ4-S70 variant had a smaller effect on viral amino acid variability compared to the haplotype coding for the IFN-λ4-P70 variant. This observation agrees with previous independent studies showing the reduced antiviral activity for IFN-λ4-S70 in vitro (Bamford et al., 2018; Terczyńska-Dyla et al., 2014). Moreover, the mean viral load in IFN-λ4-Null carrying patients was similar to those carrying the IFN-λ4-S70 variant; by contrast, those carrying an IFN-λ4-P70 variant had a reduced viral load, which correlates with its higher antiviral activity. Taken together, these observations reinforce the hypothesis that IFNL4 is a functional gene with a major role in HCV infection. We conclude that production of IFN-λ4 protein drives an altered immune response that mediates reduced viral load with a broad impact on viral amino acid diversity. In this study, we demonstrated by using paired host-HCV genomic data that host IFNL4 gene, a cytokine that is part of the innate immune response and not expected to target specific viral residues, can mediate selection of amino acids at specific sites on the HCV polyprotein. We report that 1.4% (42/3021) of the HCV amino acids across the viral polyprotein are associated with host IFNL4 SNP rs12979860 at a 5% FDR and that the impact on amino acid variation is spread across all viral proteins. In comparison, we previously reported that 0.7% of the HCV amino acids were associated with HLA class I and II alleles (Ansari et al., 2017) at a 5% FDR. The only other major driver of HCV amino acid variation is the host B cell response, which is largely restricted to modifying amino acids in the envelope glycoproteins, in particular E2 (Ball et al., 2014). Thus, although both arms of the adaptive immune system direct amino acid selection through pressure on epitopes recognised by T and B cell responses to infection, our results show that the innate immune response also has the capacity to drive particular polymorphisms in HCV-encoded proteins and that its impact is comparable to that of the T cell response (as indicated by HCV amino acid variants associated with HLA alleles). Given that we did not observe any significant enhancement or depletion of association signals in a specific viral protein or in HLA-restricted epitopes, IFN-λ4 may exert its impact through a previously unknown mechanism or at more than one stage of the virus life cycle. We anticipate that this would result from distinct host responses in those who carry genetic variants that lead to IFN-λ4 synthesis as compared to individuals who carry the pseudogenized form of the gene. In common with other IFNs, IFN-λ4 induces a large number of ISGs, many of which are largely unstudied or poorly characterised. Within such an environment, it is possible that subtle selection of certain amino acids along the polyprotein will provide an advantage for viral entry, RNA replication, virion assembly and release. Although there was no significant enrichment of associations comparing the structural with the non-structural proteins, the E2 glycoprotein contained the highest proportion of sites affected by IFNL4 SNP rs12979860. From mapping these sites onto previously known functional domains on E2, we found that many residues were located either in hypervariable region 1 (HVR1) or on the surface of the protein. Indeed, some sites coincided with epitopes that are targets for the antibody response or have a role in virus entry (Appendix 1 and Figure 1—figure supplement 5). Since the host response to HCV genotype 3a infection induces pathways including those affecting B cell development (Robinson et al., 2015), we do not exclude the possibility that IFNL4 SNP rs12979860 genotype either influences B cell response to infection or the process of virus binding and entry. Indeed, a recent report has demonstrated the emergence of variants in E2 during establishment of chronic infection that give enhanced resistance to interferon-induced transmembrane (IFITM) proteins (Wrensch et al., 2019). One of the E2 residues that encoded a different amino acid as chronicity developed was at position 500, which is an IFNL4-associated site (Supplementary file 2). Given that IFITM proteins potently inhibit virus entry, it is possible that differences in IFITM induction between those who do and do not produce IFN-λ4 may account for some of the footprint observed in the E2 protein sequence. Moreover, IFITM proteins cooperate with anti-HCV neutralising antibodies to enhance restriction of virus entry. Therefore, there may be interaction between genes differentially regulated between IFN-λ4 producers and non-producers and components of the adaptive immune system, which together influence the amino acids encoded at certain IFNL4-associated sites (Wrensch et al., 2019). To reduce any confounding effects due to population stratification, we limited our analysis to a homogenous population of self-reported white origin infected with HCV genotype 3. Thus, our observations cannot directly be extended to other human populations or other HCV genotypes. This study provides a foundation for future analysis on whether IFNL4 genetic variation also drives diversity in other HCV genotypes and subtypes across other ethnic populations (also see ‘Adaptation of hepatitis C virus to interferon lambda polymorphism across multiple viral genotypes’ by Chaturvedi et al. in this issue). Additionally, our observations were performed with in vivo data; further functional studies using appropriate in vitro model systems are needed to understand how the IFNL4 locus drives HCV amino acid variation and modulates viral load. Such studies may also help to inform the basis for diversity and evolution of HCV in the presence or absence of IFNL4. There are now multiple studies suggesting that the IFNL3-IFNL4 locus could be a key player in the defense against viruses other than HCV. In HIV-infected patients, the rs368234815 variant has been associated with long-term non-progressor HIV-1 controllers (Dominguez-Molina et al., 2017). In influenza virus infection, IFNL4 SNP rs8099917 was associated with increased sero-conversion after influenza vaccination (Egli et al., 2014a). IFNL4 variants have also been associated with bronchiolitis (Scagnolari et al., 2012), cytomegalovirus (Egli et al., 2014b) and Andes virus (Angulo et al., 2015) infections. These observations suggest that IFNL4 possibly plays a role in many viral infections and immune related diseases in the liver and other organs. Investigating how IFN-λ4 (a cytokine without epitope specificity) drives amino acid selectivity in the HCV polyprotein would add a new dimension to how the human innate immune system interacts with viruses and controls infectious diseases.

Materials and methods

Patient cohorts

For this study, we used patient data from the BOSON and EAP cohorts that have been described elsewhere (Foster et al., 2015; Foster et al., 2016). All patients provided written informed consent before undertaking any study-related procedures. The BOSON study protocol was approved by each institution’s review board or ethics committee before study initiation. The study was conducted in accordance with the International Conference on Harmonisation Good Clinical Practice Guidelines and the Declaration of Helsinki (clinical trial registration number: NCT01962441). The EAP study conforms to the ethical guidelines of the 1975 Declaration of Helsinki as reflected in a priori approval by the institution’s human research committee. The EAP patients were enrolled by consent into the HCV Research UK registry. Ethics approval for HCV Research UK was given by NRES Committee East Midlands - Derby 1 (Research Ethics Committee reference 11/EM/0314). Patients from the BOSON cohort were recruited in five different countries (Australia, Canada, New Zealand, United Kingdom and United States). Patients from the EAP cohort were recruited exclusively in the United Kingdom. To limit the potential impact of population structure, we restricted the analysis to patients of self-reported white ancestry infected with HCV genotype 3a for which we had obtained both host genome-wide SNP data and full-length HCV genome sequences. In total we included 485 patients in the study, 411 from the BOSON cohort and 74 from the EAP cohort. The majority of the patients from the BOSON cohort have no or mild liver disease (compensated liver cirrhosis). The EAP cohort on the other hand consists of HCV-infected patients with advanced liver disease, the majority of whom had decompensated cirrhosis.

Host genotyping and imputation

Informed consent for genetic analysis was obtained from all patients. Genomic DNA was extracted from buffy coat using Maxwell RSC Buffy Coat DNA Kit (Promega) as per the manufacturer's protocol and quantified using Qubit (Thermofisher). DNA samples from patients were genotyped using the Affymetrix UK Biobank array, as described elsewhere (Ansari et al., 2017). Phasing and imputation was performed using SHAPEIT2 (Delaneau et al., 2013)Dao Thi et al., 2011 and IMPUTE2 (Howie et al., 2009) version 2.3.1 using default settings and the 1000 Genomes Phase III dataset as a reference population (Auton et al., 2015). Imputation quality was high for both rs117648444 and rs368234815 variants (information 0.974 and 0.994 respectively and certainty 0.995 and 0.997 respectively). Patients from the EAP cohort from whom enough DNA was available (62/74) were also independently genotyped for both rs117648444 and rs368234815 variants. Genotyping of IFNL4 rs368234815 and rs117648444 was performed on DNA using the TaqMan SNP genotyping assay and sequences described previously (Prokunina-Olsson et al., 2013) with Type‐it Fast SNP Probe PCR Master Mix (Qiagen). The concordance between genotyped and imputed genotypes was 100% for both variants.

Virus sequencing

The generation and assembly of viral sequences from HCV-infected clinical samples for the BOSON and EAP cohorts have been described previously (Ansari et al., 2017; Singer et al., 2019).

Statistical analysis

Human and viral population structure

For the viral data, principal component analysis (PCA) was performed on the nucleotide data as follows. The presence and absence of each viral nucleotide at all variable sites in the alignment was coded as a binary variable such that a bi-allelic site on the viral genome was converted into two binary variables (one for each nucleotide), a tri-allelic site into three binary variables and a quad-allelic site into four binary variables. R (version 3.4.3, https://www.r-project.org) was then used to perform the PCA using the prcomp function with default settings. PCA was performed using flashpca (Abraham and Inouye, 2014) for human genotype data. Whole-genome viral consensus sequences for each patient were aligned using MAFFT (Katoh and Standley, 2013) with default settings. This alignment was used to create a maximum-likelihood tree using RAxML (Stamatakis, 2014), assuming a general time-reversible model of nucleotide substitution under the gamma model of rate heterogeneity. The resulting tree was rooted at midpoint. We used treeBreaker software (Ansari and Didelot, 2016) (https://github.com/ansariazim/treeBreaker) to measure the association between the virus phylogenetic tree and the host IFNL4 SNP (CC vs. non-CC) and with the cohort from which the viral sequence was obtained (BOSON vs. EAP). This software uses a Bayesian model to infer whether the phenotype of interest is randomly distributed on the tips of the tree and to estimate which clades, if any, have a distinct distribution of the phenotype of interest from the rest of the tree. This software also performs Bayesian model comparison. It provides a Bayes factor for the alternative model (there is at least one or more branches with distinct distribution of the phenotype of interest) to the null model (there are no branches with distinct phenotype distribution on the tree). Bayes factor is a summary of the evidence provided by the data in favour of one model compared to another. In other words, the higher the Bayes factor, the more support there is for one model against another. Assuming that we are testing an alternative model against a null model, it has been suggested that the Bayes factor can be divided into four categories (Kass and Raftery, 1995). Bayes factor: 1 to 3.2, very little evidence against the null model. Bayes factor: 3.2 to 10, substantial evidence against null model and in favour of the alternative model. Bayes factor: 10 to 100, strong evidence against null model and in favour of the alternative model. Bayes factor:>100, decisive evidence against null model and in favour of the alternative model.

Association tests

The univariate association between the IFNL4 SNP rs12979860 (CC vs. non-CC) and the viral PCs was tested using logistic regression in R. We used the qvalue function from the qvalue package in R to perform the FDR analysis. To choose 500 SNPs across the human genome with minor allele frequency (MAF) similar to the IFNL4 SNP rs12979860 MAF, we used Fisher’s exact test to compare all SNPs against the IFNL4 SNP rs12979860 (2 × 3 contingency table where the columns indicate the number of 0, 1 and 2 copies of the minor allele and the rows indicate the IFNL4 SNP and the target SNP counts) and chose the 500 SNPs with the largest P-values (least significance). SNPs in the IFNL3-IFNL4 region were not included. To test for association between the virus amino acids and the host SNPs we used logistic regression in R including as covariates the first three host and the first two viral PCs. We investigated presence and absence of each amino acid at all variable sites, given that the amino acid was present in at least 20 HCV sequences in our dataset. The presence and absence of the viral amino acid was used as the response variable and the host SNP coded as 0 (homozygous for major alleles) and 1 (all other genotypes) as the explanatory variable (the same coding as the IFNL4 SNP rs12979860 CC vs. non-CC). To test for enrichment or depletion of the association signals in a viral protein or the epitope regions, we used Fisher’s exact test. Each tested site is either within the target region or not and it is either classified as significant or not. The resulting 2 × 2 contingency table was tested using fisher.test function in R.

Codon level analysis

To separate the impact of the IFNL4 SNP rs12979860 on amino acids from nucleotides we investigated the nucleotide sequences at the codon level. At each codon site (where the most common codon had at least 20 synonymous and 20 non-synonymous codons) we used logistic regression to test for association between IFNL4 SNP rs12979860 (CC vs. non-CC) and the changes from the most common codon to synonymous and non-synonymous codons. The IFNL4 SNP rs12979860 was denoted as the response variable and the codons were used as a categorical explanatory variable with three levels. The effect sizes (log(OR)) and P-values were estimated for the synonymous and non-synonymous codons relative to the most common codon. We included the first two viral PCs and the first three host PCs as covariates in this analysis.

Di-nucleotides analysis

To estimate the viral dinucleotide frequencies, the observed proportion of each dinucleotide was normalised by its expected proportion (assuming the nucleotides are independent the expected proportion can be calculated by multiplying the observed proportions for the relevant nucleotides). To test for association with IFNL4 SNP rs12979860 genotype we used a linear regression where the normalised dinucleotide proportions were used as the response variable and the IFNL4 SNP rs12979860 genotype as a categorical explanatory variable. We included the first two viral PCs and the first three host PCs as covariates.

Effect of the three IFN-λ4 protein variants

To estimate the effect of the IFN-λ4 protein variants on the encoded HCV amino acids, we used the 76 sites associated with IFNL4 SNP rs12979860 at a 10% FDR. HCV-infected patients were classified into three groups according to their predicted ability to produce IFN-λ4 protein: (i) no IFN-λ4 (two allelic copies of IFN-λ4-Null, NBOSON = 145, NEAP = 41), (ii) IFN-λ4–S70 (two copies of IFN-λ4-S70 or one copy of IFN-λ4-S70 and one copy of IFN-λ4-Null, NBOSON = 48, NEAP = 7), and (iii) IFN-λ4-P70 (at least one copy of IFN-λ4-P70, NBOSON = 218, NEAP = 26). We then used logistic regression to estimate the effect sizes (log(OR)) for IFN-λ4-P70 and IFN-λ4-S70 on the virus amino acids relative to IFN-λ4-Null. The presence and absence of the reported viral amino acid was used as the response variable and the host IFN-λ4 status was used as the explanatory variable with the IFN-λ4-Null used as the base level and the log(OR) for IFN-λ4-P70 and IFN-λ4-S70 were estimated relative to the IFN-λ4-Null base level. We included the first two viral PCs and the first three host PCs as covariates to account for host-virus populations co-structuring. To test whether the effect sizes of IFN-λ4-P70 and IFN-λ4-S70 on viral amino acids are the same, we used the above estimated effect sizes and fitted a linear regression line to it. One viral site (position 2371) was excluded from this analysis as it had unreliable effect size estimate (log(OR) = −17) for IFN-λ4-S70 variant. Under the null hypothesis that IFN-λ4-P70 and IFN-λ4-S70 have the same effect sizes, we would expect that the linear regression line to have a slope of one. To test whether the slope of the fitted line is different from one, we used R to fit a linear regression line with intercept of zero. We used bootstrapping to account for the uncertainty associated with the estimated effect sizes of IFN-λ4-P70 and IFN-λ4-S70 on viral amino acids. We simulated 10,000 bootstrap datasets where the effect sizes of each IFN-λ4 variant on each HCV amino acid were simulated using a normal distribution with mean set to the estimated effect size of the variant on that HCV amino acid and standard deviation set to the standard error of the estimate. For each dataset we fitted a linear regression with intercept of zero and estimated the slope of the fit. The empirical bootstrap 95% confidence interval of the slope of the line was estimated as [2* best_estimate_slope – 97.5%_quantile_of_bootstrap_slopes, 2* best_estimate_slope – 2.5%_quantile_of_bootstrap_slopes]. The ‘best_estimate_slope’ is the slope of the line estimated from the effect sizes without accounting for uncertainty. To assess whether the mean viral load was different in the three patient groups of IFN-λ4-Null, IFN-λ4-P70 and IFN-λ4-S70, we used a Bayesian framework to perform model comparison (see Appendix 1 for further details). The models we considered comprised fixed and independent effects between the IFN-λ4 variants. We standardised the log10(viral load) so that it had a mean of zero and standard deviation of one. We used linear regression to get maximum likelihood estimates of the effects of IFN-λ4-S70 and IFN-λ4-P70 variants relative to the IFN-λ4-Null variant. The estimates were adjusted for cirrhosis status and population structure including the first two viral PCs and the first three host PCs in the regression as covariates. For each effect size, we assumed a normally distributed prior on the log(OR) of association with mean of zero. The prior covariance matrix determined the prior model assumptions. The elements of the covariance matrix were chosen such that the relevant prior model was set (see Appendix 1 for details). To assess the evidence for interaction between host IFN-λ4 variants and viral amino acid site 2414, we used the same Bayesian framework detailed above. The patients were grouped into six categories based on the host IFN-λ4 variants and the presence or absence of serine at viral site 2414. We standardised the log10(viral load) so that it had a mean of zero and standard deviation of one. We used linear regression to get maximum likelihood estimates of the effects of ‘IFN-λ4-Null + 2414 not serine’, ‘IFN-λ4-P70 + 2414 not serine’, ‘IFN-λ4-P70 + 2414 serine’, ‘IFN-λ4-S70 + 2414 not serine’, ‘IFN-λ4-S70 + 2414 serine’ groups relative to the ‘IFN-λ4-Null + 2414 serine’ group. The estimates were adjusted for cirrhosis status and population structure including the first two viral PCs and the first three host PCs in the regression as covariates. The prior covariance matrix determined the prior model assumptions. The elements of the covariance matrix were chosen such that the relevant prior model was set (see Appendix 1 for details).

Materials and correspondence

Correspondence and material requests should be addressed by contacting STOP-HCV http://www.stop-hcv.ox.ac.uk/contact. In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included. Thank you for submitting your article "Broad Impact of Interferon Lambda 4 on Hepatitis C Virus Diversity" for consideration by eLife. Your article has been reviewed by four peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Wenhui Li as the Senior Editor. The reviewers have opted to remain anonymous. The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. Summary: Previously, Ansari et al. reported results of a genome-to-genome study of 542 individuals who were chronically infected with HCV (predominantly viral genotype (VGT) 3, but also VGT 2; Ansari et al., 2017). Genotype for the IFNL4 rs12979860 SNP marker associated with 11 amino acid polymorphisms on the HCV polyprotein (60, 109 [C]; 500, 501, 576d, 578, 741 [E2]; 2414 [NS5A]; 2570, 2576, 2991 [NS5B]) based on a 5% false discovery rate (FDR). The strongest association was for 2570 in NS5B; residue 2414 in NS5A associated with HCV RNA levels amongst patients infected with VGT 3a. The present paper, which is restricted to the 485 subjects who were infected with VGT 3a, contains important new data, but the analytical approach used to arrive at these conclusions is complicated and often confusing. A more focused and unified presentation of the results is needed. Essential revisions: Ansari et al. have now imputed genotype for IFNL4 rs368234815, which controls generation of the IFN-λ4 protein, and IFNL4 rs117648444, a non-synonymous polymorphism that defines a structural variant of the IFN-λ4 protein. Haplotypes comprised of these variants generate different versions of the IFN-λ4 protein. The authors report that variation in the HCV genome and HCV RNA levels associate with these different haplotypes, consistent with previous associations between these functional IFNL4 haplotypes and HCV clearance. This finding is novel and potentially important, as it would provide additional support that the IFN activity of IFNL4 affects the observed phenotype. This finding raises some additional questions: Re: the effect of the IFNL4 loci on viral load, does that translate directly to a lower level of viral replication in patients with IFNL4 P70? Might this be investigated from the diversity of the viral quasi-species found in the patient? Re: the coupling between IFNL4 P70 and S2414 in the HCV genome, is IFNL4 P70 driving selection of a specific amino acid at this position? In the Nature Genetics paper, IFNL4 rs12979860 associated with 11 amino acid polymorphisms, but now, in a smaller cohort, that SNP apparently associates with 42 sites (both based on a 5% FDR). The authors should explain the reason for that striking difference and comment more generally about how findings from the current paper differ from the previous publication. Similarly, the paper should be clearer on which data and results are original to this paper with previously published data referenced to the Nature Genetics paper. The value of including the Expanded Access Programme (EAP) subjects is unclear. EAP contributes only ~15% of the total subjects and differs from the BOSON group regarding important demographic and clinical characteristics (sex, prevalence of cirrhosis, HCV RNA levels), as well as genotype frequencies of rs12979860. If EAP subjects are retained in the revised paper, these differences should be considered in the analysis and addressed in the Discussion. The reviewers have many questions and concerns about the statistical methods and the analyses. Multiple testing: FDRs of either 5% or 20%, as well as Bonferroni correction are employed. Unless there are compelling reasons for these different approaches (which should be stated), a uniform approach to multiple testing adjustment should be used. The Abstract states rs12979860 genotype associated with 4% of viral amino acid sites across the HCV polyprotein. Based on the discussion (that finding does not appear in the results), this result is based on a 20% FDR, whereas findings presented in the Results (and the previous paper) are based on a 5% FDR. A 20% FDR seems very high and needs to be justified. Genomic inflation factor for IFNL4 rs12979860 and 500 SNPs with similar frequencies: The genome inflation factor (represented by λ) is used to examine assumptions re: cryptic relatedness when a large set of SNP markers are tested for association with a dichotomous trait in a GWAS (https://www.ncbi.nlm.nih.gov/pubmed/11315092 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0019416), but it is unclear what is being analyzed here – association of viral variants with a trait (yes or no host SNP)? Are the assumptions about the distribution of viral variants the same as for distribution of host germline variants, for which this approach was developed? This is not a standard approach to test for unaccounted population structure or other biases and the logic for doing this is unclear. Can the authors provide a reference to support use of this method? A λ value of 2.16 is extremely high and indicates cryptic relatedness, but in this case that statistic is impossible to interpret it because the approach is not described adequately. Principal component analysis: What is used for PCA in the host? There is no explanation and the plot differs from those used for GWAS, where study samples are plotted in relation to reference populations. Which viral PCs were included? Were the PCs used as continuous variables in any models? (Note: "principal" is spelled as "principle" in several places.) Assessment of Confounding, Interaction and Mediation: Assessment as to whether adjustment for potential confounders (e.g. sex, age, study [EAP or BOSON], cirrhosis) is needed. Were genotypes associated with any patient characteristics? E.g. age or cirrhosis status? Stratified analyses should be performed to identify possible interactions for those variables, especially sex (interaction between IFNL4 genotype and sex has been reported for associations with hepatic fibrosis). The association of IFNL4 genotype with the frequency of HCV polymorphisms could reflect an effect of IFNL4 on viral replication rates. To assess that possibility, the investigators should compare the results of two logistic regression models: one that does and one that does not include HCV RNA as an additional covariate to IFNL4. Otherwise these paired models should include identical adjustments. Other comments re: statistical analyses Subsection “IFNL4 SNP has a widespread impact on the viral amino acids” first paragraph: how was the "expected median" computed? Subsection “IFNL4 SNP has a widespread impact on the viral amino acids” second paragraph: was does "frequency matched" mean? With the same MAF? Subsection “Subsection “IFNL4 SNP has a widespread impact on the viral amino acids” first paragraph”, third paragraph: please state outcome variable for the logistic regression models. Subsection “IFNL4 SNP has a widespread impact on the viral amino acids”, third paragraph: please define FDR and give reference. Subsection “Statistical analysis”, third paragraph: please state clearly that the SNP was the outcome for the logistic models. Subsection “Statistical analysis”, eleventh paragraph: To obtain maximum likelihood estimates one needs to assume a normal distribution for log10(viral load) transformed data. IN my experience this assumption is not true for log10(viral load) transformed data. However, least squares estimates do not require this assumption. Subsection “Statistical analysis”, eleventh paragraph: when a line was fit through the log(OR) estimates, how were the standard deviations of the log(ORs) used? That uncertainty needs to be accommodated. Other general comments: The presentation is hard to follow with most data presented as minimally annotated supplementary materials with limited legends provided separately. Providing more detailed legends next to corresponding figures and tables should make it easier to follow. In the Discussion, the limitations of the study should be explored. [Editors' note: further revisions were requested prior to acceptance, as described below.] Thank you for resubmitting your work entitled "Interferon lambda 4 impacts on the genetic diversity of hepatitis C virus" for further consideration at eLife. Your revised article has been favorably evaluated by Wendy Garrett as the Senior Editor and a Reviewing Editor. The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below: Overall, the authors were responsive to reviewer comments and the paper is much improved. The analytical approach remains complicated and the paper is still challenging to read. The authors should consider and address the following comments. Multiple testing: The authors eliminated use of the Bonferroni correction and a false discovery rate (FDR) of 20%. The paper still presents two different FDR thresholds (5% and 10%) for many analyses and the reason for doing so is unclear. It would be simpler to report a single set of results based on a 5% FDR, the threshold used in the previous publication from this group. Normality of viral load data: It is not clear from a visual inspection of the Q-Q plot that these data are normally distributed. A P-value for fit would be a more objective measure. Second paragraph of the Introduction “substitutes a proline for a serine […]”: Terczyńska-Dyla et. al state, “an amino-acid substitution in the IFNλ4 protein changing a proline at position 70 to a serine (P70S) […]”. To this reader, that means a serine is substituted for a proline. Alternatively, the authors might use the language of Terczyńska-Dyla et. al to describe this variant. Subsection “Host and virus genetic structures”: Without any explanation, it is unclear how to interpret the Bayes factors of 249 and 1.1. How the patients are divided into IFNL4-null, S70 and P70 groupings could be clearer. Supplementary file 7 would present that information if the groups were arranged together and labeled. Subsection “Host and virus genetic structures”, ninth paragraph – The co-submission by Chaturvedi et al. has been accepted for publication and might be referenced here. Essential revisions: Ansari et al. have now imputed genotype for IFNL4 rs368234815, which controls generation of the IFN-λ4 protein, and IFNL4 rs117648444, a non-synonymous polymorphism that defines a structural variant of the IFN-λ4 protein. Haplotypes comprised of these variants generate different versions of the IFN-λ4 protein. The authors report that variation in the HCV genome and HCV RNA levels associate with these different haplotypes, consistent with previous associations between these functional IFNL4 haplotypes and HCV clearance. This finding is novel and potentially important, as it would provide additional support that the IFN activity of IFNL4 affects the observed phenotype. This finding raises some additional questions: Re: the effect of the IFNL4 loci on viral load, does that translate directly to a lower level of viral replication in patients with IFNL4 P70? Might this be investigated from the diversity of the viral quasi-species found in the patient? We note the comment from the reviewers on whether the viral load translates directly to a lower level of viral replication. While lower viral replication (i.e. replication of viral RNA) could contribute to reduced viral load, we would not exclude a contribution from other parts of the viral life cycle, especially since the IFNL4 footprint extends across both structural and non-structural proteins. Studies on the effects on viral RNA replication would require further analysis. However, during the review process, we are aware of a new report from Thomas Baumert’s group which reveals interplay between the innate immune response (IFITM proteins) and neutralization antibodies (Wrensch et al., Hepatology in press) that affect virus entry. Moreover, their study shows there is a change in the amino acids encoded by E2 from acute to chronic infection that generates IFITM-resistance. One of the sites that changes is at position 500 in the E2 protein, which is a position associated with IFNL4 genotype in our study (see Figure 1—figure supplement 2). Thus there could be differences in the interferon response dependent on IFNL4 genotype (i.e. producers vs. non-producers) that affects stages of the virus life cycle other than RNA replication (e.g. virus entry as in the Wrensch et al. paper or also translation, assembly and release). All of the above is addressed in the Discussion. As to the second point raised by the reviewer about whether lower viral load may influence the diversity of quasispecies, our analysis has not investigated the viral quasispecies in individuals (i.e. intra-host viral diversity); all of our analysis has been conducted on the consensus sequences and in this way, we have examined inter-host viral diversity. We agree with the reviewer that any reduction in viral load mediated by IFNL4-P70 could influence the quasispecies at the intra-host level. However, this would require a deeper analysis of the sequence data, preferably combined with functional in vitro studies. Re: the coupling between IFNL4 P70 and S2414 in the HCV genome, is IFNL4 P70 driving selection of a specific amino acid at this position? Since there is a significant association between serine at position 2414 and IFNL4-P70 variant, our interpretation would be that IFNL4-P70 does exert selection of S2414. However, this is potentially true also for the other associated sites that have statistical significance. This point is made in the revised Abstract (see final sentence). However, specifically in relation to S2414, we have added the following sentence to the paper: “The viral serine (S) residue at site 2414 (S2414) was significantly enriched in IFNL4-P70 carrying patients compared to IFNL4-S70 (P = 3.39x10-03) and IFNL4-Null (P = 5.94x10-09) carrying patients. S2414 was present in 86% (211/244) of IFNL-P70 carrying patients, 69% (38/55) of IFNL4-S70 carrying patients and 62% (114/185) of IFNL4-Null carrying patients.” In the Nature Genetics paper, IFNL4 rs12979860 associated with 11 amino acid polymorphisms, but now, in a smaller cohort, that SNP apparently associates with 42 sites (both based on a 5% FDR). The authors should explain the reason for that striking difference and comment more generally about how findings from the current paper differ from the previous publication. Similarly, the paper should be clearer on which data and results are original to this paper with previously published data referenced to the Nature Genetics paper. Our previous analysis in the Nature Genetics paper used 542 samples from individuals with different ethnic backgrounds (the two main ethnic groups were white [n=452] and Asian [n=74]). Additionally these individuals were infected by HCV genotypes 2 (n=45) and 3 (n=496) with different subtypes (gt3a, gt3b and gt3c). That study was aimed mainly at identifying any human genes that may be associated with viral polymorphisms. Not surprisingly, we identified viral variants that were linked to HLA alleles. Surprisingly, we found that IFNL4 genotype was also associated with variants in the virus polyprotein. However, the depth of the study was limited by the mixture of ethnicities and viral genotypes/subtypes in the BOSON cohort. We believe that our current report does not simply confirm the findings in the Nature Genetics paper but expands and enhances the extent of the footprint of IFNL4 genotype on variants in the viral polyprotein. With regard to the increased number of associated sites, there are two major factors compared to our previous report that contribute to this enhancement. Firstly, in the current paper, we have limited the dataset to individuals with self-reported white ancestry infected with HCV gt3a to avoid potential bias due to co-structuring between host and virus populations. To increase the power of the study, we included 74 patients from another cohort, the EAP cohort, who were also of white ethnicity and infected with gt3a. This gave a total of 485 patients with identical ethnic background and infection with the same HCV genotype/subtype. As with any GWAS study, confidence in statistically significant associations is greater with more data and this was a major factor for inclusion of the EAP group. The EAP and BOSON cohorts do differ with regard to severity of liver disease; the EAP cohort consists largely of patients with decompensated cirrhosis while the BOSON cohort were recruited on the basis of milder disease. We did not find that differences in disease severity affected our analysis (see below). Therefore, the associations between IFNL4 genotype and viral variants in gt3a apparently apply irrespective of the extent of liver disease. The second important difference between the two studies is that in the Nature Genetics paper we used a different methodology to analyse the impact of SNP rs12979860 on the viral variants. In the Nature Genetics paper, we used phylogenetically corrected Fisher’s exact test to measure associations and account for population structure. For this method we inferred the phylogenetic tree of the viral sequence data, estimated ancestral viral sequences and compared them to the viral sequences at the tips of the tree. The presence and absence of changes was then tested for association with SNP rs12979860. However, in the current study we apply logistic regression and account for population structure by including viral and host genetic PCs as covariates in the analysis. This approach has recently been shown to be more powerful to detect association in genome-to-genome analysis (Correcting for Population Stratification Reduces False Positive and False Negative Results in Joint Analyses of Host and Pathogen Genomes; DOI=10.3389/fgene.2018.00266). Thus, focusing on a homogenous population (white ethnicity infected with gt3a) and using logistic regression has increased the power of our study to detect many more associations between the SNP rs12979860 and the HCV amino acids. We have now expanded the second paragraph in the Discussion to compare the findings in this study with those in the Nature Genetics paper. The value of including the Expanded Access Programme (EAP) subjects is unclear. EAP contributes only ~15% of the total subjects and differs from the BOSON group regarding important demographic and clinical characteristics (sex, prevalence of cirrhosis, HCV RNA levels), as well as genotype frequencies of rs12979860. If EAP subjects are retained in the revised paper, these differences should be considered in the analysis and addressed in the Discussion. Although the EAP group represents a lesser proportion of the cohort, our aim has been to gather as many well-characterised samples in terms of viral sequence and clinical data as possible. As stated above, increasing the scale of the group under study enhances the confidence in any statistical analysis for identifying associations. Moreover, the viral sequence data was generated by similar methods in Oxford and Glasgow using target enrichment probe sets and next generation sequencing. The viral sequence data generated does not use HCV-specific primers and so the sequences produced are not biased. Putting together such a large number of sequences covering almost the entire length of the viral genome is not a trivial task. Additionally, we tested and found no association between the HCV phylogenetic tree (which estimates the population structure of the virus) and the host INFL4 SNP rs12979860 (using treeBreaker as indicated in the subsection “Host and virus genetic structures” and Figure 1—figure supplement 3B). Furthermore, we also performed all of the analysis in the BOSON cohort alone and found similar association results that are reported in the Materials and methods. Finally, we have performed an analysis testing for association between the cohort of origin of the patient and viral amino acid variants and found no significant associations (see below). Therefore, we believe that there are sound reasons for combining the BOSON and EAP cohorts. The reviewers have many questions and concerns about the statistical methods and the analyses. Multiple testing: FDRs of either 5% or 20%, as well as Bonferroni correction are employed. Unless there are compelling reasons for these different approaches (which should be stated), a uniform approach to multiple testing adjustment should be used. The Abstract states rs12979860 genotype associated with 4% of viral amino acid sites across the HCV polyprotein. Based on the discussion (that finding does not appear in the results), this result is based on a 20% FDR, whereas findings presented in the Results (and the previous paper) are based on a 5% FDR. A 20% FDR seems very high and needs to be justified. Firstly, we have modified the Abstract to remove the statement on association with 4% of viral amino acid sites across the viral polyprotein. In addition, we have modified the text to use 5% and 10% FDR thresholds throughout the manuscript. Genomic inflation factor for IFNL4 rs12979860 and 500 SNPs with similar frequencies: The genome inflation factor (represented by λ) is used to examine assumptions re: cryptic relatedness when a large set of SNP markers are tested for association with a dichotomous trait in a GWAS (https://www.ncbi.nlm.nih.gov/pubmed/11315092 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0019416), but it is unclear what is being analyzed here – association of viral variants with a trait (yes or no host SNP)? Are the assumptions about the distribution of viral variants the same as for distribution of host germline variants, for which this approach was developed? This is not a standard approach to test for unaccounted population structure or other biases and the logic for doing this is unclear. Can the authors provide a reference to support use of this method? A λ value of 2.16 is extremely high and indicates cryptic relatedness, but in this case that statistic is impossible to interpret it because the approach is not described adequately. We have modified the text to simplify this section. We have tested presence and absence of each viral variant (amino acid or codon) against host SNPs (the rs12979860 SNP was converted to binary variables to indicate CC vs. non-CC genotypes and the other 500 SNPs with similar frequencies were similarly converted to binary variables to indicate homozygous for major allele vs. other genotypes). In other words we have performed 501 viral GWASs and in each GWAS the trait of interest is a host SNP (SNP rs12979860 and 500 other SNPs with similar frequencies). In this revised version of the manuscript, we decided to present the results differently and to remove the genomic inflation factor. We have observed 42 association signals with IFNL4 SNP rs12979860 at a 5% FDR, whilst 491 of the 500 SNPs with a similar allele frequency showed no association signal and 9 showed only one association signal. If host-virus population co-structuring was contributing to our results, we would expect to observe similar patterns between HCV amino acid variants and the other host SNPs frequency-matched to IFNL4 SNP. Similarly, only the IFNL4 SNP rs12979860 showed a distribution of observed P-values different from the distribution of P-values for a null hypothesis of no associations, as shown on the qqplots. We moved the qqplot figure to a figure supplement and added a Manhattan plot of the associations (Figure 1). To answer the reviewer’s question, we detail as follows our assumptions for the genomic inflation factor as presented in the earlier version. For each of the viral GWASs against the 500 frequency-matched host SNPs, we calculated a genomic inflation factor. Only the viral GWAS with SNP rs12979860 as the outcome variable had inflated P-values, while none of viral GWASs with any other host SNP (outside of the IFNLs loci) as the outcome variable had inflated P-values. If co-structuring of host and virus populations or cryptic relatedness or other systematic biases were driving the inflation in the P-values of viral GWAS against host rs12979860 SNP, we would expect to observe the same effect for other host SNPs. As Figure 1—figure supplement 6A show, our viral GWAS P-values against all other host SNPs follow the expected null hypothesis of no association. Principal component analysis: What is used for PCA in the host? There is no explanation and the plot differs from those used for GWAS, where study samples are plotted in relation to reference populations. We used FlashPCA to perform PCA on the host genotype data. We performed this analysis only in the studied cohorts (all with self-reported white ethnicity) and did not include reference populations. This was to ensure that we were capturing lower level population structures in our cohort that may otherwise be lost in the context of global population structure. It is a standard analysis to do in human genetics with a homogenous cohort, even if it might not be the most reported one in the literature. Which viral PCs were included? Were the PCs used as continuous variables in any models? Viral sequence data were converted into binary variables which was then used for PCA. We used the first two PCs (as continuous variables) in all our viral GWASs unless indicated otherwise in the text. (Note: "principal" is spelled as "principle" in several places.) We modified the text accordingly. Assessment of Confounding, Interaction and Mediation: Assessment as to whether adjustment for potential confounders (e.g. sex, age, study [EAP or BOSON], cirrhosis) is needed. Were genotypes associated with any patient characteristics? E.g. age or cirrhosis status? Stratified analyses should be performed to identify possible interactions for those variables, especially sex (interaction between IFNL4 genotype and sex has been reported for associations with hepatic fibrosis). We assessed the impact of the above potential confounders on the viral amino acid variation. We performed four separate viral GWASs using logistic regression (one for each possible confounder). Presence and absence of viral amino acids was used as the response variable and 2 viral PCs, 3 host PCs and IFNL4 SNP rs12979860 (CC vs. non-CC genotype) were added as covariates. FDR was calculated for each viral GWAS independent of the others. At 10% FDR there were no significant associations between cirrhosis status, cohorts (BOSON vs. EAP), gender and age, and any of the viral amino acid variants. We included the following sentence to the Results section: “To test for possible confounders we added separately the cirrhosis status, cohorts (BOSON vs. EAP), gender and age to the model as covariates. These covariates were not associated with any specific amino acids at 10% FDR (data not shown).” In a separate viral GWAS, we investigated whether the interaction between IFNL4 SNP rs12979860 genotypes and gender had an impact on viral amino acid variation. We used the same procedure as above, adding 2 viral PCs, 3 host PCs, IFNL4 SNP rs12979860 (CC vs. non-CC genotype) and gender as covariates. At 10% FDR there was no significant association between any viral amino acids and the interaction term of IFNL4 SNP rs12979860 and gender. We also performed another analysis where we compared the impact of IFNL4 SNP rs12979860 (CC vs. non-CC genotypes) on viral amino acid variation, using two different logistic regression models. In model one, we used 2 viral PCs and 3 host PCs as covariates and in the second model we also added age, gender, cirrhosis status and cohort indicator as covariates. The P-values for the association between IFNL4 SNP rs12979860 and the viral amino acid variants are highly correlated between the two models as shown in Author response image 1.

Author response image 1.

Since the results were negative, we decided not to include the two former interaction analyses in the revised paper to keep it as simple as possible. The association of IFNL4 genotype with the frequency of HCV polymorphisms could reflect an effect of IFNL4 on viral replication rates. To assess that possibility, the investigators should compare the results of two logistic regression models: one that does and one that does not include HCV RNA as an additional covariate to IFNL4. Otherwise these paired models should include identical adjustments. We agree that IFNL4 SNP rs12979860 is associated with viral load such that CC patients on average have higher viral load and this could indicate the impact of IFNL4 on viral replication rates. Higher levels of replication in CC patients could lead to more within-patient diversity, but we have shown that non-synonymous codon changes are much more likely to be associated with IFNL4 SNP rs12979860 compared to synonymous codon changes. Additionally, unless there is a form of selection acting on specific amino acids, one cannot explain consistent changes to the same amino acid across multiple patients with the same IFNL4 genotype. However, we performed the above comparison suggested by the reviewer. In the first model we included 2 viral PCs and 3 host PCs as covariates and in the second model we also added log10(viral load) as a covariate to the model. We plotted the P-values of associations between IFNL4 SNP rs12979860 and the viral amino acids from the two models against each other. Adding log10(viral load) as a covariate reduced the P-values very slightly as shown in Author response image 2, but the impact on the analysis was minimal. At a 5% FDR 37 viral sites were associated with host INFL4 SNP, increasing to 68 at 10% FDR.

Author response image 2.

Other comments re: statistical analyses Subsection “IFNL4 SNP has a widespread impact on the viral amino acids” first paragraph: how was the "expected median" computed? To estimate the genomic inflation factor we calculated the median of the observed chi-squared test statistics and divided it by the median of the chi-squared distribution with one degree of freedom which is 0.4549 (expected median). The P-values were converted to chi-squared test statistics using qchiseq(1-p, df=1) in R. However we removed this section from the current version to simplify the presentation of the Results as indicated above. Subsection “IFNL4 SNP has a widespread impact on the viral amino acids” second paragraph: was does "frequency matched" mean? With the same MAF? We modified the text as follows: we performed the same tests against HCV amino acids for 500 host SNPs from across the human genome with a minor allele frequency (MAF) similar to IFNL4 SNP rs12979860 MAF, further referred to as the “500 frequency-matched SNPs”: Subsection “IFNL4 SNP has a widespread impact on the viral amino acids” first paragraph”, third paragraph: please state outcome variable for the logistic regression models. We have modified the text to indicate that the presence and absence of viral amino acids was used as the outcome variable in the logistic regression models. Subsection “IFNL4 SNP has a widespread impact on the viral amino acids”, third paragraph: please define FDR and give reference. We have modified the text as follows: at 5% false discovery rate (FDR) which indicates the expected number of false positives among discoveries (significant associations). We added a reference (Benjamini and Hochberg, 1995). Subsection “Statistical analysis”, third paragraph: please state clearly that the SNP was the outcome for the logistic models. In the last paragraph of the subsection “Human and viral population structure”, we did not use logistic regression. We used treeBreaker software (Ansari and Didelot, 2016). This software uses Bayesian methodology to measure association between a tree and a phenotype of interest. In brief, given a tree and tip phenotypes, it infers which clades on the tree, if any, have a distinct distribution of the tip phenotype from the rest of the tree. Subsection “Statistical analysis”, eleventh paragraph: To obtain maximum likelihood estimates one needs to assume a normal distribution for log10(viral load) transformed data. IN my experience this assumption is not true for log10(viral load) transformed data. However, least squares estimates do not require this assumption. The qqplot of the model residuals against standard normal distribution indicates that the assumption of normality is justified. Subsection “Statistical analysis”, eleventh paragraph: when a line was fit through the log(OR) estimates, how were the standard deviations of the log(ORs) used? That uncertainty needs to be accommodated. We have modified our analysis to account for the uncertainty associated with the log(ORs). To do this we used bootstrapping. We simulated 10,000 bootstrap datasets where the log(ORs) were simulated using a normal distribution with mean set to the log(ORs) and standard deviation set to the standard error of the estimate. For each dataset we fitted a linear regression and estimated the slope of the fit. The empirical bootstrap 95% confidence interval of the slope of the line was in (0.69,0.99) which does not include 1. Other general comments: The presentation is hard to follow with most data presented as minimally annotated supplementary materials with limited legends provided separately. Providing more detailed legends next to corresponding figures and tables should make it easier to follow. The eLife submission process requires us to submit supplementary figures and legends separately. We hope that in the final version of this manuscript, legends will be provided next to the corresponding figure or table. In the Discussion, the limitations of the study should be explored. We have included a paragraph (second last) in the Discussion on the limitations of the study. [Editors' note: further revisions were requested prior to acceptance, as described below.] The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below: Overall, the authors were responsive to reviewer comments and the paper is much improved. The analytical approach remains complicated and the paper is still challenging to read. The authors should consider and address the following comments. Multiple testing: The authors eliminated use of the Bonferroni correction and a false discovery rate (FDR) of 20%. The paper still presents two different FDR thresholds (5% and 10%) for many analyses and the reason for doing so is unclear. It would be simpler to report a single set of results based on a 5% FDR, the threshold used in the previous publication from this group. False discovery rate allows us to quantify the expected number of false positives in our results. Choosing a low FDR will ensure that there are very few false positives, but setting a low FDR value also risks reducing the number of true positives. In the present version of the manuscript, we decided to report both 5% and 10% FDR thresholds. The reason for doing so is that a large number of associated viral amino acids were required for analyses to be performed with sufficient power. This could only be achieved with a 10% FDR. In our previous publication, we also used two different FDR thresholds (5% and 20%). The lower 5% FDR is applied in this present work to help with comparisons to our previous report. Normality of viral load data: It is not clear from a visual inspection of the Q-Q plot that these data are normally distributed. A P-value for fit would be a more objective measure. This comment is related to the first set of reviews that we received. “Subsection “Statistical analysis”, eleventh paragraph: To obtain maximum likelihood estimates one needs to assume a normal distribution for log10(viral load) transformed data. IN my experience this assumption is not true for log10(viral load) transformed data. However, least squares estimates do not require this assumption.” We answered this comment in the rebuttal by presenting a qq-plot. No modification was added to the manuscript. Here, we explain the reason why we are confident that the tests performed in the manuscript are valid. In small samples most statistical methods do require distributional assumptions, and the case for distribution-free rank-based tests is relatively strong. However, in large data sets (which is the case for our study), most statistical methods rely on the Central Limit Theorem, which states that the average of a large number of independent random variables is approximately normally distributed around the true population mean. It is this normal distribution of an average that underlies the validity of the t-test and linear regression, but also of logistic regression and of most software for the Wilcoxon and other rank tests. The t-test and linear regression compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is normally distributed, their major usefulness comes from the fact that in large samples they are valid for any distribution (Lumley et al. The importance of the normality assumption in large public health data sets. Annu Rev Public Health. 2002;23:151-69). Second paragraph of the Introduction “substitutes a proline for a serine […]”: Terczyńska-Dyla et. al state, “an amino-acid substitution in the IFNλ4 protein changing a proline at position 70 to a serine (P70S) […]”. To this reader, that means a serine is substituted for a proline. Alternatively, the authors might use the language of Terczyńska-Dyla et. al to describe this variant. We modified the text to read “the IFN-λ4 protein, which changes a proline residue at position 70 (P70) to a serine residue (S70)”. We are not using the P70S nomenclature elsewhere in the article so would prefer not to introduce it here. We hope the change will make the description of the variant clearer and accurate. Subsection “Host and virus genetic structures”: Without any explanation, it is unclear how to interpret the Bayes factors of 249 and 1.1. We added the following sentences: “see Materials and methodsfor an explanation on how to interpret Bayes factor’ and in the Materials and methods section: ‘Bayes factor is a summary of the evidence provided by the data in favour of one model compared to another. […] Bayes factor: 10 to 100, strong evidence against null model and in favour of the alternative model. Bayes factor: >100, decisive evidence against null model and in favour of the alternative model.” How the patients are divided into IFNL4-null, S70 and P70 groupings could be clearer. Supplementary file 7 would present that information if the groups were arranged together and labeled. We modified the Supplementary file 7 to include a row labeling the different haplotypes into the grouping that we made. Each haplotype is now linked to the protein predicted to be produced. We also added the following sentence in the Materials and methods section: “HCV-infected patients were classified into three groups according to their predicted ability to produce IFN-λ4 protein: (i) no IFN-λ4 (two allelic copies of IFN-λ4-Null, NBOSON=145, NEAP=41), (ii) IFN-λ4–S70 (two copies of IFN-λ4-S70 or one copy of IFN-λ4-S70 and one copy of IFN-λ4-Null, NBOSON=48, NEAP=7), and (iii) IFN-λ4-P70 (at least one copy of IFN-λ4-P70, NBOSON=218, NEAP=26).” Subsection “Host and virus genetic structures”, ninth paragraph – The co-submission by Chaturvedi et al. has been accepted for publication and might be referenced here. We added the following sentence: (also see ‘Adaptation of hepatitis C virus to interferon lambda polymorphism across multiple viral genotypes’ by Chaturvedi et al. in this issue).

51 in total

1. A variant upstream of IFNL3 (IL28B) creating a new interferon gene IFNL4 is associated with impaired clearance of hepatitis C virus.

Authors: Ludmila Prokunina-Olsson; Brian Muchmore; Wei Tang; Ruth M Pfeiffer; Heiyoung Park; Harold Dickensheets; Dianna Hergott; Patricia Porter-Gill; Adam Mumy; Indu Kohaar; Sabrina Chen; Nathan Brand; McAnthony Tarway; Luyang Liu; Faruk Sheikh; Jacquie Astemborski; Herbert L Bonkovsky; Brian R Edlin; Charles D Howell; Timothy R Morgan; David L Thomas; Barbara Rehermann; Raymond P Donnelly; Thomas R O'Brien
Journal: Nat Genet Date: 2013-01-06 Impact factor: 38.330

2. IFN-λ is able to augment TLR-mediated activation and subsequent function of primary human B cells.

Authors: Rik A de Groen; Zwier M A Groothuismink; Bi-Sheng Liu; André Boonstra
Journal: J Leukoc Biol Date: 2015-06-30 Impact factor: 4.962

3. Genetic variation in IL28B predicts hepatitis C treatment-induced viral clearance.

Authors: Dongliang Ge; Jacques Fellay; Alexander J Thompson; Jason S Simon; Kevin V Shianna; Thomas J Urban; Erin L Heinzen; Ping Qiu; Arthur H Bertelsen; Andrew J Muir; Mark Sulkowski; John G McHutchison; David B Goldstein
Journal: Nature Date: 2009-08-16 Impact factor: 49.962

4. Immunomodulatory Function of Interleukin 28B during primary infection with cytomegalovirus.

Authors: Adrian Egli; Aviad Levin; Deanna M Santer; Michael Joyce; Daire O'Shea; Brad S Thomas; Luiz F Lisboa; Khaled Barakat; Rakesh Bhat; Karl P Fischer; Michael Houghton; D Lorne Tyrrell; Deepali Kumar; Atul Humar
Journal: J Infect Dis Date: 2014-03-11 Impact factor: 5.226

5. IFN-λ3, not IFN-λ4, likely mediates IFNL3-IFNL4 haplotype-dependent hepatic inflammation and fibrosis.

Authors: Mohammed Eslam; Duncan McLeod; Kebitsaone Simon Kelaeng; Alessandra Mangia; Thomas Berg; Khaled Thabet; William L Irving; Gregory J Dore; David Sheridan; Henning Grønbæk; Maria Lorena Abate; Rune Hartmann; Elisabetta Bugianesi; Ulrich Spengler; Angela Rojas; David R Booth; Martin Weltman; Lindsay Mollison; Wendy Cheng; Stephen Riordan; Hema Mahajan; Janett Fischer; Jacob Nattermann; Mark W Douglas; Christopher Liddle; Elizabeth Powell; Manuel Romero-Gomez; Jacob George
Journal: Nat Genet Date: 2017-04-10 Impact factor: 38.330

6. Fast principal component analysis of large-scale genome-wide data.

Authors: Gad Abraham; Michael Inouye
Journal: PLoS One Date: 2014-04-09 Impact factor: 3.240

7. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

8. Modelling mutational and selection pressures on dinucleotides in eukaryotic phyla--selection against CpG and UpA in cytoplasmically expressed RNA and in RNA viruses.

Authors: Peter Simmonds; Wenjun Xia; J Kenneth Baillie; Ken McKinnon
Journal: BMC Genomics Date: 2013-09-10 Impact factor: 3.969

9. Genetic variation in IL28B and spontaneous clearance of hepatitis C virus.

Authors: David L Thomas; Chloe L Thio; Maureen P Martin; Ying Qi; Dongliang Ge; Colm O'Huigin; Judith Kidd; Kenneth Kidd; Salim I Khakoo; Graeme Alexander; James J Goedert; Gregory D Kirk; Sharyne M Donfield; Hugo R Rosen; Leslie H Tobler; Michael P Busch; John G McHutchison; David B Goldstein; Mary Carrington
Journal: Nature Date: 2009-10-08 Impact factor: 49.962

10. A polymorphic residue that attenuates the antiviral potential of interferon lambda 4 in hominid lineages.

Authors: Connor G G Bamford; Elihu Aranday-Cortes; Ines Cordeiro Filipe; Swathi Sukumar; Daniel Mair; Ana da Silva Filipe; Juan L Mendoza; K Christopher Garcia; Shaohua Fan; Sarah A Tishkoff; John McLauchlan
Journal: PLoS Pathog Date: 2018-10-11 Impact factor: 6.823

15 in total

Review 1. Host-parasite co-evolution and its genomic signature.

Authors: Dieter Ebert; Peter D Fields
Journal: Nat Rev Genet Date: 2020-08-28 Impact factor: 53.242

Review 2. The role of IFNL4 in liver inflammation and progression of fibrosis.

Authors: Michelle Møhlenberg; Thomas R O'Brien; Rune Hartmann
Journal: Genes Immun Date: 2022-05-18 Impact factor: 2.676

3. The evolutionary dynamics and epidemiological history of hepatitis C virus genotype 6, including unique strains from the Li community of Hainan Island, China.

Authors: Ru Xu; Elihu Aranday-Cortes; E Carol McWilliam Leitch; Joseph Hughes; Joshua B Singer; Vattipally Sreenu; Lily Tong; Ana da Silva Filipe; Connor G G Bamford; Xia Rong; Jieting Huang; Min Wang; Yongshui Fu; John McLauchlan
Journal: Virus Evol Date: 2022-02-16

Review 4. Interferon Response in Hepatitis C Virus-Infected Hepatocytes: Issues to Consider in the Era of Direct-Acting Antivirals.

Authors: Pil Soo Sung; Eui-Cheol Shin
Journal: Int J Mol Sci Date: 2020-04-08 Impact factor: 5.923

5. The influence of human genetic variation on Epstein-Barr virus sequence diversity.

Authors: Sina Rüeger; Christian Hammer; Alexis Loetscher; Paul J McLaren; Dylan Lawless; Olivier Naret; Nina Khanna; Enos Bernasconi; Matthias Cavassini; Huldrych F Günthard; Christian R Kahlert; Andri Rauch; Daniel P Depledge; Sofia Morfopoulou; Judith Breuer; Evgeny Zdobnov; Jacques Fellay
Journal: Sci Rep Date: 2021-02-25 Impact factor: 4.379

6. Sex-specific innate immune selection of HIV-1 in utero is associated with increased female susceptibility to infection.

Authors: Emily Adland; Jane Millar; Nomonde Bengu; Maximilian Muenchhoff; Rowena Fillis; Kenneth Sprenger; Vuyokasi Ntlantsana; Julia Roider; Vinicius Vieira; Katya Govender; John Adamson; Nelisiwe Nxele; Christina Ochsenbauer; John Kappes; Luisa Mori; Jeroen van Lobenstein; Yeney Graza; Kogielambal Chinniah; Constant Kapongo; Roopesh Bhoola; Malini Krishna; Philippa C Matthews; Ruth Penya Poderos; Marta Colomer Lluch; Maria C Puertas; Julia G Prado; Neil McKerrow; Moherndran Archary; Thumbi Ndung'u; Andreas Groll; Pieter Jooste; Javier Martinez-Picado; Marcus Altfeld; Philip Goulder
Journal: Nat Commun Date: 2020-04-14 Impact factor: 14.919

7. Functional genetic variants of the IFN-λ3 (IL28B) gene and transcription factor interactions on its promoter.

Authors: Subhajit Roy; Debarati Guha Roy; Anand Bhushan; Seema Bharatiya; Sreedhar Chinnaswamy
Journal: Cytokine Date: 2021-03-13 Impact factor: 3.861

8. An Unbiased Molecular Approach Using 3'-UTRs Resolves the Avian Family-Level Tree of Life.

Authors: Heiner Kuhl; Carolina Frankl-Vilches; Antje Bakker; Gerald Mayr; Gerhard Nikolaus; Stefan T Boerno; Sven Klages; Bernd Timmermann; Manfred Gahr
Journal: Mol Biol Evol Date: 2021-01-04 Impact factor: 16.240

9. PDCD1 and IFNL4 genetic variants and risk of developing hepatitis C virus-related diseases.

Authors: Valli De Re; Maria Lina Tornesello; Mariangela De Zorzi; Laura Caggiari; Francesca Pezzuto; Patrizia Leone; Vito Racanelli; Gianfranco Lauletta; Stefania Zanussi; Ombretta Repetto; Laura Gragnani; Francesca Maria Rossi; Riccardo Dolcetti; Anna Linda Zignego; Franco M Buonaguro; Agostino Steffan
Journal: Liver Int Date: 2021-01 Impact factor: 5.828

10. Impact of virus subtype and host IFNL4 genotype on large-scale RNA structure formation in the genome of hepatitis C virus.

Authors: Peter Simmonds; Lize Cuypers; Will L Irving; John McLauchlan; Graham S Cooke; Ellie Barnes; M Azim Ansari
Journal: RNA Date: 2020-08-03 Impact factor: 5.636