Guy Ling1, Danielle Miller1, Rasmus Nielsen2,3,4, Adi Stern1,5. 1. School of Molecular Cell Biology and Biotechnology, Tel-Aviv University, Tel-Aviv, Israel. 2. Department of Integrative Biology, University of California, Berkeley, Berkeley, CA. 3. Department of Statistics, University of California, Berkeley, Berkeley, CA. 4. Center for Computational Biology at UC Berkeley (CCB), Berkeley, CA. 5. Edmond J. Safra Center for Bioinformatics at Tel Aviv University, Tel-Aviv, Israel.
Abstract
The probability of point mutations is expected to be highly influenced by the flanking nucleotides that surround them, known as the sequence context. This phenomenon may be mainly attributed to the enzyme that modifies or mutates the genetic material, because most enzymes tend to have specific sequence contexts that dictate their activity. Here, we develop a statistical model that allows for the detection and evaluation of the effects of different sequence contexts on mutation rates from deep population sequencing data. This task is computationally challenging, as the complexity of the model increases exponentially as the context size increases. We established our novel Bayesian method based on sparse model selection methods, with the leading assumption that the number of actual sequence contexts that directly influence mutation rates is minuscule compared with the number of possible sequence contexts. We show that our method is highly accurate on simulated data using pentanucleotide contexts, even when accounting for noisy data. We next analyze empirical population sequencing data from polioviruses and HIV-1 and detect a significant enrichment in sequence contexts associated with deamination by the cellular deaminases ADAR 1/2 and APOBEC3G, respectively. In the current era, where next-generation sequencing data are highly abundant, our approach can be used on any population sequencing data to reveal context-dependent base alterations and may assist in the discovery of novel mutable sites or editing sites.
The probability of point mutations is expected to be highly influenced by the flanking nucleotides that surround them, known as the sequence context. This phenomenon may be mainly attributed to the enzyme that modifies or mutates the genetic material, because most enzymes tend to have specific sequence contexts that dictate their activity. Here, we develop a statistical model that allows for the detection and evaluation of the effects of different sequence contexts on mutation rates from deep population sequencing data. This task is computationally challenging, as the complexity of the model increases exponentially as the context size increases. We established our novel Bayesian method based on sparse model selection methods, with the leading assumption that the number of actual sequence contexts that directly influence mutation rates is minuscule compared with the number of possible sequence contexts. We show that our method is highly accurate on simulated data using pentanucleotide contexts, even when accounting for noisy data. We next analyze empirical population sequencing data from polioviruses and HIV-1 and detect a significant enrichment in sequence contexts associated with deamination by the cellular deaminases ADAR 1/2 and APOBEC3G, respectively. In the current era, where next-generation sequencing data are highly abundant, our approach can be used on any population sequencing data to reveal context-dependent base alterations and may assist in the discovery of novel mutable sites or editing sites.
Single-base modifications, which include point mutations and DNA or RNA modifications, are often caused by enzymatic activity. Base alterations can include either standard point mutations or modifications such as cytosine methylation at the DNA level (Cooper and Krawczak 1989), or adenine to inosine at the RNA level (Wulff et al. 2011). For DNA/RNA modifications, specific sequence contexts influence the probability the enzyme will modify a base within this context (Lehmann and Bass 2000; Feltus et al. 2006; Wang et al. 2013). This is reflected by hotspots of mutation in different genomes driven by specific sequence contexts (Hwang and Green 2004; Hodgkinson and Eyre-Walker 2011; Aggarwala and Voight 2016). One well-known example is that of C→T mutations that occur at high rates in vertebrate genomes by spontaneous deamination of methylated cytosines in CpG positions, that is, positions in which a guanine follows a cytosine (Coulondre et al. 1978; Razin and Riggs 1980). Moreover, specific cellular enzymes belonging to the APOBEC3 family increase the rate of deamination of cytosine bases as a means of viral restriction. These enzymes increase the mutation rate of HIV by several orders of magnitude, at specific sequence contexts (Cuevas et al. 2015). Nevertheless, most evolutionary models commonly assume that every position evolves independently. This implies that neighboring positions do not affect the rate of mutation in a given position. Here we present a novel Bayesian method for the analysis of deep population sequencing data, which detects the effect of context on the rate of single-base modifications.Previous efforts for detecting the effects of context on substitution rates were often phylogenetic-based methods (Siepel and Haussler 2003). In order to keep the number of parameters small and computationally tractable, most such methods consider two to three sites (Krawczak et al. 1998; Dunson and Tindall 2000; Lunter and Hein 2004; Hernandez et al. 2007; Zhang et al. 2007; Rodrigue et al. 2009, 2010; Simmonds et al. 2013; Sung et al. 2015; Figliuzzi et al. 2016; Harris and Pritchard 2017) (but see Berikov and Rogozin 1999). Notably, methods which attempt to find both local and global explicit context dependencies usually suffer from high false positive rate, and hence require more statistical validations (Rogozin et al. 2005).Recently, novel models for analyzing context from polymorphism data in humans were suggested (Aggarwala and Voight 2016; Zhu et al. 2017). However, these models are very high-dimensional, and in essence, assume that a very large number of sequence motifs may influence the rate of substitution. An additional complication is that within the high-dimensional space of sequence motifs, there is a high level of correlation between motifs, which could lead to faulty inferences of context effects (but see Aikens et al. 2019). For example, the motif CGCXX is contained within the motif XGCXX (where X is any one of the four nucleotides) and this could lead to difficulty in inferring which motif truly has an effect.Here, we develop a model that addresses these problems and includes two main novelties. First, it is tailored for next-generation sequencing (NGS) of population data, which are becoming more and more abundant. Notably, our method is inspired by ours and other experiments that sequence virus populations at great depth (Acevedo et al. 2014; Stern et al. 2017); typically such experiments result in over 100,000–1,000,000 sequenced viral genomes, with sequencing accuracy that allowed detection of mutations present at a frequency as rare as . Moreover, in these experiments the very high mutation rate of the viruses is the predominant force: Genetic drift is mitigated by the large population size (but see Discussion), and the short time frame of the experiment mostly allows for only one mutation per genome. The second novelty is the use of Bayesian variable selection methods to identify the few sequence contexts that significantly influence the substitution rate, thereby addressing the combinatorial increase in parameter number with a larger context. This approach relies on the biological motivation that only a limited number of enzymes influence the rate of base alteration and they are often defined by a very specific context. For example, adenosines in mRNA may be methylated by a methyltransferase enzyme but only in the context RGACU (where R is a purine) (Narayan et al. 1994). There are over 1,000 possible contexts in a window size of 5 (, but as only two of them exert an effect, this is an assumption worthwhile taking into consideration.The problem of detecting contexts that affect the rate of base alteration is a special case of the standard statistical problem of identifying a subset of, possibly correlated, covariates that affect the response variable. One example is quantitative trait locus (QTL) mapping, where the objective is to detect a limited set of markers that affect a specific phenotype, and our method is inspired by Bayesian variable selection solutions to this problem (see, e.g., Yi 2004). We test the method on simulated data generated to mimic an increase or decrease of mutation rate caused by specific nucleotide motifs, in a population of replicating viruses, and show that the method accurately captures these changes in the simulated data. We next analyze data from NGS experiments of polioviruses and HIV-1 and describe how our method captures intriguing biological signals. Finally, we discuss the applicability of our method to any type of NGS data.
Theory
In brief, our method searches for a correlation between the presence of sequences motifs surrounding a site and elevated/decreased mutation frequencies at that site. The method uses a Bayesian approach to infer the parameters of the model and employs sparse shrinkage to account for the fact that only a small number of motifs affect the mutation rate.
Context-Dependent Base Alteration Model
For the sake of simplicity, we hereby refer to a base alteration as a mutation. This is often convenient because a base alteration may be captured in NGS experiments as an observed mutation (e.g., adenine to inosine alterations are observed as adenine to guanine mutations following sequencing [Levanon et al. 2004]). Let the full sequence (e.g., the genome) we are interested in be denoted as , where and . Note that upon analyzing RNA genomes is replaced by , however, for convenience, we keep the notation of through the text. We denote the -long context of a focal position, as the nucleotides flanking the position, that is, the sequence , where k is assumed to be an odd number. For example, for there are three distinct contexts,, and (fig. 1).
. 1.
Model definitions. For the labeled focal position with nucleotide A, we present its genomic context for , and illustrate the context, the context together with a mutation, and a possible motif embedded in this context. Motifs are associated with a context pattern, such that We exemplify how enzymatic activity operating on a specific sequence context may result in an increase in the mutation rate at this context.
Model definitions. For the labeled focal position with nucleotide A, we present its genomic context for , and illustrate the context, the context together with a mutation, and a possible motif embedded in this context. Motifs are associated with a context pattern, such that We exemplify how enzymatic activity operating on a specific sequence context may result in an increase in the mutation rate at this context.We can further decompose sequence contexts into motifs (which will be the features in our feature selection algorithm). In the example above, a possible motif would be AGX (where X is any of the four nucleotides). For k = 1, there are four motifs (the four possible nucleotides), for k = 3 there are motifs consisting of one nucleotide, motifs consisting of two nucleotides, and motif consisting of three nucleotides. In general, for k-long contexts, there are motifs (fig. 1 and table 1). Next, we consider a mutation operating on a context, , to be an ordered pair, where . For simplicity, we denote this as a context + mutation (fig. 1).
Table 1.
An Example of Sequencing Data and Associated Breakdown into Context, Motifs, and Counts.
note.—The main variables, , are shaded in gray, representing the indicators for presence/absence of motifs within a context. In the first row of this example, we observe C→T mutations within the context ACA in 110 of the individuals in the population sequenced, out of a total of 20,000 sequences (reads) covering all ACA contexts.
An Example of Sequencing Data and Associated Breakdown into Context, Motifs, and Counts.note.—The main variables, , are shaded in gray, representing the indicators for presence/absence of motifs within a context. In the first row of this example, we observe C→T mutations within the context ACA in 110 of the individuals in the population sequenced, out of a total of 20,000 sequences (reads) covering all ACA contexts.Now, let be a matrix that indicates, for each context + mutation, the motifs it contains such that:
where is the number of all unique contexts + mutations in genome · is at most (because not all contexts may be present in a given genome), and is defined above as the number of possible motifs (table 1).We assume sequencing of a population of homologous genomes and the availability of a known reference genome (fig. 2). These assumptions allow us to uniquely define which mutation occurred at what sequence context. We can define the vector to be the empirical count of contexts + mutations for each type of mutation. We define to be a vector of the total observed number of occurrences of each context in the sequencing data (so that is the sequencing coverage of the context). In other words, we pool all the mutations that have the exact same context across different positions in the genome so that in figure 2 we would count three mutations for the red context. Note that when using counts we implicitly assume lack of genetic drift which may have increased the copy number of an allele in the population. Alternatively, the input data may be the number of polymorphic loci with a specific context, that is, a position is then counted at most once. Accordingly, in figure 2 we would count two polymorphisms for the red context. The former representation fits our viral data set; however, the latter enables flexibility for other data sets as well, supporting the method’s general applicability. Table 1 illustrates an example showing , , and for a context of size for simplicity, using the pooled mutation approach.
. 2.
Population mutations as a factor of the genomic contexts in the ancestral genome. We start with an ancestral genome (also referred to here as a reference genome). In this original genome there are several sequence contexts, three of which are illustrated here in colored boxes; the red context, the blue context, and the yellow context. A context might be present in the genome once (e.g., blue and yellow contexts), might have multiple appearances (the red context), or not to be present at all. After t generations the population is no longer homogenous, and mutations might arise in different contexts (colored x marks). Elevated mutation rate or genetic drift may lead to more mutations at a specific locus (see text for more details).
Population mutations as a factor of the genomic contexts in the ancestral genome. We start with an ancestral genome (also referred to here as a reference genome). In this original genome there are several sequence contexts, three of which are illustrated here in colored boxes; the red context, the blue context, and the yellow context. A context might be present in the genome once (e.g., blue and yellow contexts), might have multiple appearances (the red context), or not to be present at all. After t generations the population is no longer homogenous, and mutations might arise in different contexts (colored x marks). Elevated mutation rate or genetic drift may lead to more mutations at a specific locus (see text for more details).
Model
We introduce CIPI—context-based inference of point-mutation influence. We will use a logistic regression model with the latent variables , , (where is the number of all possible motifs), which relates motifs in a context to the probability of observing a mutation in the context, defined as , where was referred earlier as a mutation type. We assume ). For convenience, from now on we will denote simply as given that the inference is performed per mutation type. The introduction of the indicator vector allows us to better penalize models with a large number of influencing motifs, that is, it will allow us to use shrinkage in the model to avoid overfitting. We further introduce κ, a factor that can control for shrinkage that is elaborated on below. Thus, our logistic regression model is defined by , a vector of indicator variables indicating whether the variable contributes significantly to the model or not, , the vector of regression coefficients that increase or decrease the probability of a mutation in a context, α, the baseline mutation rate in the absence of any context, and κ, a shrinkage factor. We use the properties of the which maps which guarantees that is always a valid probability.
Inference
We will use Bayesian variable selection to estimate . Our approach is similar to classical methods used for QTL mapping. These methods aim to identify correlations between a set of genetic markers (e.g., single nucleotide polymorphisms) and a continuous phenotype (Yi 2004). There is usually a very large number of single nucleotide polymorphisms (which take the role of features) in a sample and a much smaller number of samples. Moreover, often the different features are correlated, mainly due to linkage in a genetic cross or linkage disequilibrium in outbred populations. In our case, the motifs are the features and the number of possible motifs is often much larger than the number of mutations observed in a given genome, especially when considering microbial genomes that tend to be relatively small. Furthermore, there is also a strong correlation between the different motifs (features), as these may be nested within each other or may be mutually exclusive. We address these problems by assigning a prior distribution to each latent variable. The model is too complex to allow analytical solutions, but we can infer the posterior distribution using a Markov chain Monte Carlo (MCMC) algorithm.
Posterior Probability Calculations
The posterior probability of the four latent variables can be written as:The expressions in the numerator can easily be computed when values for the latent variables are given: where is defined above, c is defined above as the number of context + mutations, and we assume that. Notice that we assume here that the number of mutations observed for each context is small so that different mutations in the same context are approximately independent of each other. On the other hand, calculating the sum in the denominator is intractable for all possible combinations of .
Prior Probability Specification
Prior for
is the model shrinkage factor controlling the model’s sparsity. Small values will result in no signal at all (shrinkage is too high); however, larger values will increase the model complexity. We define a uniform prior on , such that . The boundaries were set in order to allow high shrinkage.We define a simple Bernoulli prior probability distribution for each element of so that , where is the number of “1” entries in . Taking defines a prior that gives equal weight to any In practical estimation, the collinearity between the predictors can lead to instability. The collinearity for this problem has been described previously as a “dilution” effect (George 2010). For example, if is a feature that increases the mutation rate for C→T mutations, naïve estimation of effects by counting might also find that significantly increases the mutation rates, as the motif contains the motif. To address this problem, we add a penalty to the prior for . We use a method based on determinantal point process (DPP) that has been shown to be effective in other Bayesian variable selection problems (Ročková and George 2014; Kojima and Komaki 2014, 2016). According to the DPP method, the prior is weighted by powers of the determinant of the correlation matrix, , where is a matrix including only the columns from the original matrix and is a weight factor. The weighting provides a computationally tractable approach for mitigating the effect of the dependencies between the features. If all features are completely independent , whereas in the case of full dependency (collinearity) between at least one pair of vectors the matrix will be singular and .We define the prior probability for the regression coefficients, , as:
Where is the indicator function, for some variance and some constant It is possible to either infer from the data or to define it as a constant; here we chose to infer it from the data. Notably when the regression coefficient is undefined and is unidentifiable.represents the mean rate of base modification, which we refer to here as the base mutation rate. We assume is normally distributed where , and use the empirical mean and variance inferred from the data, thuscan be set as a hyperparameter or can be inferred from the data with a variety of possible priors. A simple prior for is the uniform prior , where “low” and “high” are set arbitrarily.
MCMC Implementation
To traverse the posterior distribution, we chose to use MCMC using a Metropolis–Hastings method (Hastings 1970).Our MCMC algorithm works as follows:We start from an arbitrary point (.We then define a transition kernel, that is, the set of probabilities for proposing a new parameter value given the previous value. Let the parameters in the ith step be A new set of parameters are sampled based on Gaussian distributions for each variable: For the update kernel assumes for some and randomly choose entries to flip. For we sample based on a Gaussian distribution; where is the number of accepted steps in the previous 100 iterations, doubled since there are two possible directions (i.e.,increasing or decreasing ). As we aim for lower kappa values,we set a lower bound on the distribution’s standard deviation. The use of will result in bigger steps upon extensive acceptance, and smaller steps towards convergence.Then we accept with probability ofwhere is the prior.After the chain has completed a predetermined number of iterations (burn-in), we use the ergodic averages of each parameter to approximate the posteriors.
Results
Simulated Data Sets
In order to verify the performance of our method, we simulated population NGS data. Our aim was to mimic evolving populations of viruses where rare mutations are often observed widely across the genome. Accordingly, our simulations mimicked population sequencing of oral poliovirus (Stern et al. 2017) with a genome length of ∼7,500 bases at a sequencing depth of ∼100,000 reads per locus. Using an adaptation of the Moran model (Materials and Methods), we introduced an increased or reduced mutation frequency based on k = 5 contexts. We assume a given ancestral sequence and simulate each position independently, in line with our short-term evolutionary experiments, in which we do not expect more than one mutation per genome over the course of the experiment. Fourteen generations of mutations only (no selection) were simulated, with a population size of and mutation rate of . In each of 500 simulated data sets, we introduced a different number (between 0 and 3) of motifs influencing the mutation rate of a specific mutation type. We then used our inference framework to infer which were the influential motifs in each of the simulations. We defined a threshold, , such that if the posterior probability that motif has an effect on the mutation rate is larger than , , we predict that motif to affect the mutation rate. The posterior probability of is estimated as the average occupancy time of the Markov chain in the state .Figure 3 summarizes the results of the simulations for given different thresholds of . We will consider a motif as a true positive if it was originally simulated and determined as having an effect on mutation rate for a given threshold , and consider a motif as a false positive if it was not originally simulated but detected as having such effect. A motif that was not simulated and was not determined as effecting mutation rates will be considered as a true negative. All in all, the accuracy rate was very high across all thresholds tested, demonstrating the power of the approach to accurately detect the influence of a motif on the rate of mutation. High accuracy is contributed due to remarkably high true negative rate (specificity) as expected from imbalanced data predictions. This means that given a set of all possible motifs for a chosen , the number of influencing motifs which satisfy for a large enough will be miniscule compared with the noninfluencing ones. For example, a model that predicts for all will naturally result in high true negative rate (most motifs are eventually labeled as 0) and consequently high accuracy. Thus, we choose to use precision and recall as our evaluation metrics. We obtained the precision ( and recall (), also known as sensitivity and used the F1 score, which is the harmonic mean of precision and recall (, to evaluate the trade-off between precision and recall.
. 3.
Accuracy analysis of CIPI on simulated data, as a factor of β values. (A) From left to right: F1 score, Bayesian FDR, precision, and recall. For the F1 score is substantially low as we impose zero contribution to all motifs. Increasing beta highly improves our prediction, converging almost to the same scores. (B) Precision–recall curve.
Accuracy analysis of CIPI on simulated data, as a factor of β values. (A) From left to right: F1 score, Bayesian FDR, precision, and recall. For the F1 score is substantially low as we impose zero contribution to all motifs. Increasing beta highly improves our prediction, converging almost to the same scores. (B) Precision–recall curve.In order to approximate the false discovery rate (FDR), we measured the Bayesian FDR defined as the expected probability of labeling a motif as nonsignificant given that the motif is considered significant by a threshold . More formally:Where is the ergodic average calculated to approximate the posterior as defined in the MCMC implementation.Given a threshold of 0.8, our false positive rate is <3%, suggesting that the method has high specificity, that is, we correctly reject noninfluential motifs (fig. 3). The low false positive rate promotes high precision; however, not all motifs are successfully identified, leading to lower recall. When analyzing the few false positives of the method, we noticed that the vast majority of motifs incorrectly inferred were motifs embedded in the true simulated motif (e.g., AXXXX when the true motif was AAXXX). This suggests that the DPP method used to remove correlated motifs still has its limitations, and the incorrectly inferred motifs are still valid (see supplementary fig. S1, Supplementary Material online).
Introducing Noise into the Analysis
In the simulations above, all sites were simulated as neutral, and the observed frequencies were assumed to be the true frequencies. However, in real biological data, these two assumptions will likely be problematic. For one, there is no set of genomic sites known to be completely neutral. Typically, synonymous sites are assumed to be neutral, yet a subset of these sites may be under selection, particularly in viruses (Chamary et al. 2006). Moreover, the observed mutation frequencies will be affected by sampling and thus will likely deviate from true mutation frequencies. Thus, in order to test how our method fares with noisier data, we tested the inference of CIPI on “noisy” data, defined here as such when (i) a proportion of sites deviate from neutrality, and (ii) sampling is applied as described above. The accuracy of inference was examined as a function of , the proportion of nonneutral sites, and sequencing noise. Setting to zero yields no context effect, as , thus we expect the F1 score to asymptotically approach zero. We show that for ranging s the effects of noise are negligible.We were interested in comparing our method to alternative methods. However, many methods are based on assumptions that are violated in our case, mostly the lack of a phylogenetic tree for NGS data (Siepel and Haussler 2003) or the fact we are dealing with short genomes where not all contexts exist (Aggarwala and Voight 2016). We thus implemented a naïve method for inferring sequence context, where the motifs that have the highest mutation rate are picked (Materials and Methods). Notably variations of this naïve method are widely employed in the literature today (Schneider and Stephens 1990; Sandelin et al. 2004; Dey et al. 2018). We also compared our results with a model based on “random” inference, which assigns for all . Our results show dramatically superior inference to both the naïve and random approaches (fig. 4). (For precision and recall as a factor of threshold and performance under extreme selection see supplementary figs. S2 and S3, Supplementary Material online.)
. 4.
Performance comparison for analysis of noisy and non-noisy simulated data sets. F1 scores (upper panel) and Bayesian FDR (lower panel) are shown for nonnoisy (solid lines) and noisy data (dashed lines) for our CIPI model (left panels). Mid and right panels, respectively, demonstrate reduced performance of a random model (Luck) and a naive model (main text).
Performance comparison for analysis of noisy and non-noisy simulated data sets. F1 scores (upper panel) and Bayesian FDR (lower panel) are shown for nonnoisy (solid lines) and noisy data (dashed lines) for our CIPI model (left panels). Mid and right panels, respectively, demonstrate reduced performance of a random model (Luck) and a naive model (main text).
Oral Polio Vaccine Virus Experimental Evolution Data Set
To further examine the biological applicability of the method, we applied it to oral poliovirus 2 (OPV2) sequencing data (Stern et al. 2017). The experiment was designed to record the mutation frequencies of OPV as it was serially passaged in tissue culture. Notably, the very high sequencing depth of this experiment, spanning reads per position, combined with very high mutation rates of the virus (spanning mutations per base per replication cycle), made these data perfect for our model. In order to rule out the effects of selection that can easily mimic the effects of increased or decreased mutation rate, our analysis focused only on synonymous mutations that are mostly (although not always) neutral. We further focused only on transition mutations, as transversions are less frequent and hence inferred with less reliability. We analyzed the last and seventh passage, which corresponded to 14 viral replication cycles.Table 2 presents the resulting motifs detected in passage 7, at a threshold of for the probability that the motif affects mutation rates. Intriguingly, many of the motifs detected are compatible with editing by the enzymes ADAR1 or ADAR2 (Eggington et al. 2011). Both enzymes edit adenosine to inosine, which is detected as a A→G mutation, and prefer A or U (T) upstream to the edited A. As polioviruses copy both the positive and negative RNA strand syntheses in the cell (Schulte et al. 2015), a T→C mutation on the reverse-complement negative strand will be read as an A→G on the positive strand (which is the reference genome against which all reads are mapped). Accordingly, table 2 shows enrichment for A→G and T→C mutations as compared with the composition of the OPV2 genome (P < 0.001, Fisher exact test).
Table 2.
Motifs Affecting Mutation Rates Detected in Empirical Data of Passage 7 of OPV2.
Mutation Type
Motif
Reverse Complement Motif
Mean Gamma
Mean Mutation Rate
Context Mean Mutation Rate
Increase/Decrease
A→G
XXAGA
TCTXX
1
0.00041
0.00079
Increase
A→G
XTACX
XGTAX
1
0.00041
0.00113
Increase
A→G
XXAAG
CTTXX
1
0.00041
0.00098
Increase
A→G
CXAGX
XCTXG
1
0.00041
0.00019
Decrease
A→G
XAAGX
XXTTT
1
0.00041
0.00113
Increase
A→G
XXAGG
CCTXX
1
0.00041
0.00021
Decrease
C→T
GCCXX
XXGGC
1
0.00072
0.00027
Decrease
C→T
CXCCX
XGGXG
1
0.00072
0.00025
Decrease
G→A
XXGCX
XGCXX
1
0.00015
0.00085
Increase
G→A
XCGXX
XXCGX
1
0.00015
0.00085
Increase
T→C
TXTAX
XTAXA
1
0.00048
0.0012
Increase
T→C
AXTXG
CXAXT
1
0.00048
0.0015
Increase
T→C
XXTAA
TTAXX
1
0.00048
0.0008
Increase
T→C
XCTXG
CXAGX
0.95
0.00048
0.00016
Decrease
T→C
XGTTX
XAACX
0.99
0.00048
0.00021
Decrease
T→C
XTTXA
TXAAX
0.83
0.00048
0.0003
Decrease
note.—Motifs exceeding a threshold (mean gamma) of 0.8 are presented. Motifs in the context of ADAR are shown in bold.
Motifs Affecting Mutation Rates Detected in Empirical Data of Passage 7 of OPV2.note.—Motifs exceeding a threshold (mean gamma) of 0.8 are presented. Motifs in the context of ADAR are shown in bold.
APOBEC Signatures Discovered in HIV-1 Data Set
To demonstrate the applicability of the method on different types of NGS data, we obtained sequencing data of HIV-1 knowing to contain strong APOBEC3G signatures (Pollpeter et al. 2018). APOBEC3G (A3G) is an antiviral host factor with cytidine deaminase activity, which in our terms is a C →U (T) mutation. A3G was shown to deaminate the nascent minus DNA strand of HIV-1, leading to an observed G→A hypermutation on the +RNA genomic strand (Harris et al. 2003). The DNA editing function of A3G is known to be context dependent, where a focal G is observed to be followed by a “G” or an “A” (GG or G). We set out to analyze data from an experiment that measured A3G activity directly on nascent DNA strands (Pollpeter et al. 2018), and ran our inference with . Out of seven significant motifs, four were related to the A3G context with , suggesting that we are also capable of recognizing known enzymatic signatures on empirical data with strong selection effects. The results are presented in table 3.
Table 3.
Motifs with Positive Effect on Transition Mutation Rates Detected in HIV-1 Empirical Data.
Mutation Type
Motif
Mean Gamma
Mean Mutation Rate
Context Mean Mutation Rate
Increase/Decrease
G→A
AGGXX
1
0.08
0.22
Increase
G→A
TTGXX
0.93
0.08
0.1
Increase
G→A
XTGXT
1
0.08
0.14
Increase
G→A
CXGGX
1
0.08
0.43
Increase
G→A
TXGGX
1
0.08
0.22
Increase
G→A
XXGGG
1
0.08
0.59
Increase
G→A
XXGAG
1
0.08
0.37
Increase
note.—Motifs exceeding a threshold (mean gamma) of 0.8 are presented. All presented motifs are of mutation type G→A. The transitions C→T, T→C, and A→G did not contain motifs which were significant for positive effects. The motifs in A3G context are shown in bold.
Motifs with Positive Effect on Transition Mutation Rates Detected in HIV-1 Empirical Data.note.—Motifs exceeding a threshold (mean gamma) of 0.8 are presented. All presented motifs are of mutation type G→A. The transitions C→T, T→C, and A→G did not contain motifs which were significant for positive effects. The motifs in A3G context are shown in bold.
Running Times and Resources
We report running times of our method, obtained for a 2.5 GHz Intel Core i7 processor with 16 Gb of RAM. The analysis ran with a constant rate of 400 iterations per second, and running steps in one chain required ∼50 min for a ∼7,000-base-long genome and utilized ∼300 MB of available RAM.
Discussion
We developed here a novel approach for the detection and evaluation of sequence context on mutation rates. Our prime motivation was to develop a method that is able to analyze high-resolution deep sequencing data sets. One of the main challenges in these data sets is the high dimensionality of the data when accounting for sequence context. Thus, one of the main novelties of the new approach is the use of Bayesian shrinkage to take into account the fact that the number of sequence motifs that affect mutation rates is likely much smaller than the number of possible motifs. We conclude that the method provides highly accurate results on simulated data. In particular, we precisely identify the motifs which do not influence mutation rate. Our remarkably low false positive rate promotes high confidence in inferring influencing motifs; however, we also fail to detect some true motifs, suggesting that our method is conservative. We believe that this trend is due to a combination of the combinatorial complexity of the inference and the Bayesian shrinkage leading to a sparse vector. Too strong shrinkage might cause the dilution of an existing effect, as setting the initial vector to zero will show no effect at all.Our analysis of empirical poliovirus data revealed the potential effect of ADAR, a known protein known to edit virus genomes. Remarkably, we were able to detect this effect despite very low mutation frequencies of ADAR-associated motifs (table 2). Further experimental work is required to validate this finding. We also demonstrated that our method fares well with other types of empirical data where the effects of selection are much stronger. We successfully captured motifs which are associated with APOBEC3G editing in HIV-1 data, and thus we conclude that the method can be generally applicable to virus NGS data.Having said that, we would like to recognize some limitations that could restrict the reliability of inferences based on our method. The assumption of neutrality is essential to the analysis as point mutations which strongly deviate from neutrality might promote the appearance of a context effect. This might be especially true for empirical data. The more data that are available (e.g., the more synonymous sites analyzed), the more the effect of a handful of nonneutral sites will be diminished.In the analysis described here, we assumed the absence of genetic drift by pooling counts of all mutations in a given context. Our experiments were performed using very large population sizes; however, the copy number of the mutations was very low, and thus genetic drift is most likely very prominent, mostly in the earlier passages of the experiment. To this end we focused on the last passage where we noted fewer fluctuations of mutation frequencies, associated with drift (Zinger et al. 2019). Although random genetic drift is not directly modeled, our simulations do incorporate genetic drift. Reassuringly, we did not observe false positives due to drift, most likely because drift will not affect one specific context consistently. We note that our model does allow analysis of data where one counts polymorphisms rather than mutations; for our setup where high mutation rates of viruses lead to a polymorphism at most sites of the genome, this is irrelevant.To summarize, we have developed a robust method that has the potential and the strength to identify influential sequence contexts, and this may shed light upon the underlying mechanisms of both polymerases and other enzymes that render genetic modifications. The method is flexible, compatible with a wide variety of applications and data sets and should be fairly easily executed for the analysis of NGS data. Although the method was designed with our recent experiments of virus populations in mind, it is also generic enough for other types of NGS data.The python code for the method is publicly available via GitHub under https://github.com/SternLabTAU/CIPI. The data used in this study are available through Zenodo under the https://zenodo.org/record/3408598 with a direct link through the GitHub repository.
Materials and Methods
Moran Model for Simulating Context Dependence
To simulate viral sequences where the rate of mutation depends on the context, we use an adaptation of the four-allelic Moran model previously described for simulating the evolution of cancer cells (Zhu et al. 2011). This model is a continuous time birth–death model. Usefully, this model takes into account large population sizes and allows for different mutation rates, both of which are relevant in our case. We simulate each position independently, while taking into account the context of the position as described below. Each allele is one of the four nucleotides.Let be the population size. For a given position , at timewe define a vector . Thus, each entry is the number of genomes with the jth allele in the i position. We initialize
meaning that we start the simulations from a homogenous population of genomes defined by the original sequence. When an individual of type dies it is replaced by randomly selecting a parent from the options and copying it. Notably, the model we used (Zhu et al. 2011) also defines, for each phenotype , a different fitness value and different mutation rates for each that changes the allele of the individual. In the first set of simulations reported in the Results section, we assume neutrality for all mutations, and hence set for every . In order to model the influence of context, we assume that there are different for different sequence contexts. For each position at a given context , , similar to the way context was assumed to affect mutation rates in our model above. The evolutionary process defined above is a Markov chain and we assume that births/deaths occur at a constant rate in time such that time can be measured continuously. At time the state of the chain is . For a four-allelic model, we have 12 possible events that can change the state of the chain, by decreasing the value of and increasing the value of for some values of j and j′, j ≠ j′.The rate of such a transition is .Let be the rates of all 12 possible transitions that can occur at any given point in time. Furthermore, define a matrix where entry of the matrix equalsThen, the change in allele frequency through time is given by . Using standard stochastic simulation algorithms (Gillespie 1977), we then simulate this process to generate samples of genetic data. As the mutations are all assumed to be segregating at low frequencies, which is true for the experimental viral sequence evolution data that we hope to emulate, we will approximate the sequence evolution as the union of multiple independently evolving sequence sites.
Simulating Noisy Frequencies and the Effects of Purifying Selection
In order to simulate mutation frequencies that underwent sampling, we began with the set of simulated frequencies described above which we assumed to be our true frequencies and assumed N = 1,000,000 genomes. We then applied binomial sampling on the frequencies and sampled 100,000 genomes for sequencing. We further tested the method when a proportion of sites deviate from neutrality. We simulated 500 data sets with a varying number of loci under selection, where selection coefficients were sampled based on a distribution of fitness effects from an empirical data set of RNA viruses (Sanjuan et al. 2004).
Comparison with a Naïve Approach
We wished to compare our method with other approaches. However, all existing methods were inappropriate for our data: We cannot use any phylogenetic methods as the short read NGS data do not allow reconstructing a phylogenetic tree. Also, we cannot use the regression analysis described in Aggarwala and Voight (2016), which originally was used for analyzing the human genome, as the transition into polymorphisms in a small genome violates several of the model’s assumptions. Thus, we set out to implement a naïve solution for inferring the motifs in which the mutation rate is elevated. Given the frequencies of all mutations in a genome and a defined mutation type, we considered the top 1% contexts with the highest average mutation rates. Formally, for a context that appears times in a genome we define as the frequency of the mutation at position in the context . Then, we average all frequencies of all positions that share the same context.The average mutation rate in context will be:For the top 1% most frequent contexts, we obtain all the contained motifs and label them as significant.
Empirical Data analysis—Poliovirus Data Set
We analyzed the oral poliovirus 2 (OPV2) sequencing data reported by (Stern et al. 2017), using mutation frequencies as reported therein.
Empirical Data analysis—HIV Data Set
We analyzed HIV-1 data reported by Pollpeter et al. (2018). Raw fastq files were downloaded and mapped using an in-house computational pipeline to the HIV-1 reference genome pNL4-3 (accession number AF324493.2) at positions 1–180, as per the original paper. The coverage obtained matched the paper’s description.Click here for additional data file.