MOTIVATION: Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood. RESULTS: We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes. AVAILABILITY AND IMPLEMENTATION: Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood. RESULTS: We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes. AVAILABILITY AND IMPLEMENTATION: Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Probabilistic models of sequence evolution are among the earliest areas of research in bioinformatics. These models aim to quantify the probability of specific types of mutations such as substitutions (Jukes ; Kimura, 1980) and small indels (Thorne ) in an evolving sequence. They provide the necessary probabilistic framework that allows formulating meaningful biological questions such as sequence alignment and phylogenetic inference. They also enable the realistic simulation of sequence evolution, and increasingly accurate tools have been introduced for this purpose over the years, including Rose (Stoye ), INDELible (Fletcher and Yang, 2009) and Evolver (Edgar ). By enabling the simulation of fake but evolutionarily related sequences, these approaches enable benchmarking and validating bioinformatics tools for problems such as multiple sequence alignment (Earl ; Edgar, 2004; Lassmann and Sonnhammer, 2005; Papadopoulos and Agarwala, 2007; Ramakrishnan ), ancestral genome reconstruction (Blanchette ) and phylogenetic inference (Price ).Mutation probabilities have long been known to be dependent on sequence context: the probability of a mutation happening at a certain site does not only depend on the type of mutation (e.g. transition versus transversion), but also on the nucleotides around it (Arenas, 2015). In some cases, this dependency is intrinsic to the mutational process itself. For example, perhaps the strongest type of context dependencies is the elevated C-to-T substitution rate in the context of a CpG dinucleotide, caused by the deamination of a methylated cytosine, transforming it to a thymine (Bird, 1980; Ehrlich and Wang, 1981). In other cases, it is the result of context-dependent DNA repair. Surrallés found that the transcription-coupled repair machinery shows high localization on gene rich regions of the human genome, and Feng found that transcription-coupled repair of cyclobutanepyrimidine dimers in Chinese hamster genome is context-dependent. Finally, it may be the result of context-dependent selective pressure. For example, coding sequence evolution is best captured by a 3-nucleotide codon model that accounts for the consequences of a nucleotide change on the amino acid encoded (Goldman and Yang, 1994). There is also a rich literature on co-evolution models in DNA (Makova and Hardison, 2015), RNA (Holmes, 2004) and proteins (Rodrigue ). Insertions and deletion rates also exhibit strong context dependencies, often linked to DNA polymerase slippage in locally repetitive sequences (Messer and Arndt, 2007).Several types of models have been introduced to capture mutational context dependencies. Jensen and Pedersen (2000) proposed a Markov chain Monte Carlo method to calculate the substitution rate dependent on its two flanking bases inside a protein-coding DNA region. Later, a maximum likelihood analysis to infer nucleotide or dinucleotide mutation frequencies in non-coding regions of the genome was proposed (Arndt ). Siepel and Haussler (2003) introduced a context-dependent substitution model that has been developed to reflect context dependencies in both coding and non-coding regions of the genome, and a parameter-rich Bayesian network substitution model with parameters defined from flanking base contexts for genome-wide ancestral reconstruction were developed (Chachick and Tanay, 2012). However, these methods often limit the context effect to one flanking nucleotide/amino acid on each side due to the computational cost and sample size requirements growing exponentially with context size. The extent of longer-range dependencies in mutation rates is thus poorly understood, although sparse Bayesian models (Ling ) have recently been introduced for that purpose and applied to viral evolution to determine the significance of certain configurations of nucleotide around mutation sites by inferring the mutation rate of 5-mers. Aggarwala and Voight (2016) proposed a statistical model that takes parameters from observed frequency of mutations within 5-mer or 7-mer, while Zhu introduced a log-linear model for mutation frequency analysis that also considers up to 5-mers. It has also been found that variability of mutation in human genome between different populations of human genome are dependent on context beyond the immediate flanking bases (Aikens ).To overcome the limitations of these probabilistic context-dependent substitution models, we propose an evolutionary model that harnesses recent developments in machine learning. In recent years, recurrent neural networks (RNNs) have enabled models to learn in the context of time series data much more efficiently than previously possible. In particular, the long short-term memory (LSTM) model architecture (Gers ) made it possible for the deep neural network in the area of natural language modeling to overcome the vanishing gradient problem of standard RNN (Sundermeyer ). More recently, RNN-encoder–decoder (Cho ) architectures, which separate the network into an RNN-encoder and RNN-decoder to better capture the context of the input sequence, were introduced, with applications to automated language translation. With a similar idea, sequence-to-sequence models (Sutskever et al., 2014) have shown that using LSTM networks in the encoder–decoder architecture can further improve the capacity to capture the context of the input in neural machine translation.In this paper, we introduce EvoLSTM, a sequence-to-sequence LSTM model of sequence evolution inspired by the aforementioned recent work in language modeling. We trained EvoLSTM from entire whole-genome primate alignments and show that it is able to capture context dependencies that are longer in range than what had been previous reported, for both substitutions and short indels. EvoLSTM’s RNN evolutionary model paves the way for a variety of applications that rely on realistic modeling of sequence evolution.
2 Materials and methods
EvoLSTM is a machine-learning based probabilistic model of context-dependent sequence evolution. In its simplest form, it takes as input a sequence S of length K and outputs a randomly generated descendant sequence T that may either be identical to S or may differ from it through one or more substitutions or indels. In this section, we explain EvoLSTM’s architecture, its training and evaluation.
2.1 Training data
EvoLSTM is trained from a set of pairs of aligned ancestral/descendant sequences. We used a 100-way alignment of whole vertebrate genomes (Blanchette ; Miller ) and applied the Ancestors1.0 program (Blanchette ; Diallo ) to infer maximum likelihood ancestral sequences genome-wide based on a simple context-independent substitution model. The approach has previously been shown to be highly accurate for most ancestors and, in particular, for primates (Blanchette ). We then extracted induced pairwise alignments between the human genome and various primate ancestors: the old-world monkey ancestor (catarrhini; 0.03 expected substitution per site), and the simian ancestor (simiiformes; 0.05 exp. subst./site). Since our analyses did not reveal significant differences between using one or the other as training data, which proceeded to use the old-world monkey ancestor, which has the benefit of being the close enough to human to limit the risks of double mutations at the same site, while providing us with sufficiently many mutations to learn from.Each pairwise alignment was then processed as follows (Fig. 1). First, gaps present in both sequences were removed. Second, each portion of 1 or 2 nucleotides in the descendant sequence that is aligned to consecutive gaps in the ancestors (hence the results of a 1 or 2 bp insertion) were combined with the previous descendant nucleotide and represented as a single character from an extended alphabet of size 86 (4 normal nucleotides + 16 dinucleotides for 1 bp insertions + 64 trinucleotides for 2 bp insertions + gap + dummy character). This allows treating insertions of size 1 or 2 as a substitution, which means that we do not need to assume we know ahead of time the position and size of an insertion. Since our focus is on short indels, we did not consider regions with larger indels, due to the exponential blow up in extended alphabet size that would be required.
Fig. 1.
Data preprocessing pipeline. An ancestor/descendant alignment is converted to a set of overlapping K-mer pairs from an extended alphabet of meta-nucleotides and gaps. Those K-mer pairs are used to train EvoLSTM
Data preprocessing pipeline. An ancestor/descendant alignment is converted to a set of overlapping K-mer pairs from an extended alphabet of meta-nucleotides and gaps. Those K-mer pairs are used to train EvoLSTMThese modified ancestor-descendant alignments are sliced into K-mer pairs (for K ranging from 1 to 39). Those aligned K-mer pairs constitute the data from which EvoLSTM is trained. We selected the first 10 000 000 K-mer pairs from human chromosome 2 as the training set and the next 2 000 000 as validation set. Our test set consists of 149 860 432 K-mer pairs selected from all chromosomes except human chromosome 2, hence ensuring that the test set is entirely disjoint from the training and validation sets. This very large test set enables us to accurately estimate the accuracy of the trained models.
2.2 EvoLSTM’s LSTM architecture
EvoLSTM is an RNN that takes as input a K-mer and outputs the probability of each of the nucleotides (and meta-nucleotides) in a hypothetical descendant sequence (Fig. 2). It is based on the LSTM architecture for RNNs, which was introduced to address the vanishing gradient problem of classical RNNs (Sundermeyer ). The LSTM cell corresponding to position t in the K-mer takes in an input value (the one-hot encoded version of meta-nucleotide at position t), as well as two types of recurrent states: the hidden state and the cell state , which are vectors of predetermined dimensions. Following Gers , these states are combined with the input and are regulated by trainable input (W and W), output (W and W) and forget gates (W and W) weight vectors. Each LSTM cell remembers values of the hidden state and cell state over time intervals and the three gates modify the information received from the previous time step (Sundermeyer ).
Fig. 2.
EvoLSTM is a sequence-to-sequence bidirectional LSTM model made of an encoder (left portion, shown in red) and a decoder (right portion, shown in blue box). An ancestral sequences (here, ACT) are given as input to the encoder. Each cell recurrently receives information from the previously cell, in the form of a hidden state vector (blue arrows) and cell state vector (red arrows), and combines it with the one-hot encoded ancestral nucleotide to update those two vectors. The output of the encoder is the concatenation of the hidden and cell state vectors produced by the forward and reverse directions. Those vectors are an encoding of the sequence context, optimized so as to be maximally informative for the decoder. In the decoder, each cell receives as input (i) the one-hot encoded ancestral nucleotide and the previous descendant nucleotide generated by the model, as well as the state and cell vectors from the previous cell. It updates those two vectors and passes them to the next cell, but also feeds the hidden state vector to an MLP, which output a probability distribution over descendant characters (nucleotides or gap) at that position. A character is randomly drawn from that distribution, emitted and passed onto the next LSTM decoder cell
EvoLSTM is a sequence-to-sequence bidirectional LSTM model made of an encoder (left portion, shown in red) and a decoder (right portion, shown in blue box). An ancestral sequences (here, ACT) are given as input to the encoder. Each cell recurrently receives information from the previously cell, in the form of a hidden state vector (blue arrows) and cell state vector (red arrows), and combines it with the one-hot encoded ancestral nucleotide to update those two vectors. The output of the encoder is the concatenation of the hidden and cell state vectors produced by the forward and reverse directions. Those vectors are an encoding of the sequence context, optimized so as to be maximally informative for the decoder. In the decoder, each cell receives as input (i) the one-hot encoded ancestral nucleotide and the previous descendant nucleotide generated by the model, as well as the state and cell vectors from the previous cell. It updates those two vectors and passes them to the next cell, but also feeds the hidden state vector to an MLP, which output a probability distribution over descendant characters (nucleotides or gap) at that position. A character is randomly drawn from that distribution, emitted and passed onto the next LSTM decoder cellSpecifically, we haveFinally, the outputs of each gate are combined to update the hidden state and the cell state of the current cell:
where denotes the element-wise product.
2.3 Sequence-to-sequence model
Borrowing the idea from sequence-to-sequence learning (Sutskever et al., 2014), EvoLSTM is composed of connected LSTM-based encoder and decoder networks (Fig. 2). The encoder is made of two LSTM looking at the same input K-mer but in opposite directions. Their last output hidden states h and cell states c are concatenated and passed to the decoder network, which uses it, along with the ancestral K-mer itself, to generate a descendant sequence. h and c capture the context information that will be used to inform the decoder.The decoder consists of an LSTM similar to the encoder, coupled with a fully connected neural network. Each cell receives the cell and hidden states from the previous cell, except for the first cell, which obtains those from the encoder (h and c). Cell t from the decoder is used to predict a probability distribution over meta-nucleotides at position t of the descendant sequence. The hidden state is fed to a fully connected multi-layer perceptron (MLP) with 86 outputs, to which a softmax function is applied to obtain a normalized probability distribution. Also, in addition to receiving meta-nucleotide x as input, cell t (for t > 1) receives the observed (during training) or sampled (when using the network as a generative model) descendant nucleotide from position t − 1, providing it with information about the evolutionary event that took place at the previous position. This is particularly critical to enable indels spanning more than one nucleotide.
2.4 Training EvoLSTM
We trained our model using the cross-entropy (CE) (negative log-likelihood) as the loss function to minimize. In short, we aim to minimize
where A is an input ancestral K-mer and D is its descendant K-mer. Trainable weights include the input, output, forget, cell state and hidden state weight matrices and bias terms for both the encoder and decoder, as well as the weights of the MLP in the decoder. Training is carried out using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0001. We used batch learning with batch size 1024. To reduce overfitting, we used early stopping, ending training when the validation loss did not decrease for five consecutive epochs. To investigate the effect of context size, models were trained with six different values of K: 1, 5, 11, 15, 21 and 39.Hyper-parameters of the model were set based on a compromise between training time and accuracy on the validation set. The size of the hidden and cell state vectors was set to 512. The MLP consists of 1 hidden layer of 86 neurons with ReLU (rectified linear activation unit) (Nair and Hinton, 2010) activation function and the output layer of 86 neurons.
2.5 Using EvoLSTM as an evolution simulator
To use a trained EvoLSTM model to simulate the evolution of a given ancestral K-mer sequence A, the model is used as described previously, with a few small modifications. At each position t, a meta-nucleotide is sampled from the distribution generated by the model at that position. It is that nucleotide (rather than the true descendant nucleotide, as was done during training phase) that is passed as input to the next cell. In contrast, in the context machine translation (Sutskever et al., 2014), the goal is generally not to sample from a distribution over translations, but instead to identify the maximum-likelihood translation, which is achieved via a greedy or beam search algorithm (Neubig, 2017).In order to use EvoLSTM to simulate the evolution of ancestral sequences longer than K, we proceed as follows. EvoLSTM breaks down the ancestral genome sequence of interest into K-mers similarly to the data preprocessing step described in Figure 1 but without overlap between each K-mer. Denote those ancestral K-mers as . Each K-mer is given as input to EvoLSTM, in the order in which they appear in the sequence. To sample descendant K-mer , EvoLSTM uses the cell and hidden state vectors obtained from passing K-mer k to the encoder network together with the last simulated nucleotide obtained from to be passed down to the decoder network. Once all descendant K-mers are obtained, they are concatenated to yield the complete descendant sequence.The simulation processed described until now only allows mimicking sequence evolution over a branch of the same length as that corresponding to the pair of ancestor/descendant genomes used for training. Let us call that branch length the unit branch length. To simulate the evolution of branches of non-unit length, we proceed as follows. Suppose the target branch length is λ units long. Decompose λ into its integer and factional portions: . We apply EvoLSTM times, each time using it to re-evolve the sequence produced from the previous iteration. The resulting K-mer (which would be the ancestral K-mer if ) is then fed one last time through EvoLSTM, but this time rejecting a proposed mutation with probability . Supplementary Figure S1 shows that this iterative process captures well context dependencies of evolutionary events over longer branches.
2.6 Evaluation
2.6.1 Baseline approaches
Past studies on evolution models have mainly focused on immediate neighbor context-dependent substitutions and are thus not directly comparable to the EvoLSTM. Instead, we implemented two baseline models. Our table-based approach is perhaps the most natural context-dependent evolutionary model: It simply approximates and saves as , where N(A, D) and N(A) are the observed frequencies of (D, A) and alignments in the training set. As K increases, N(A) can become too small to provide a meaningful probability estimate. If , we symmetrically trim the context sequences (by one nucleotide at each end), until we get . This approach enables this algorithm to use large context sizes when sufficient data exist, but to returns to smaller context sizes when it does not.We also implemented a second machine-learning approach using a standard bidirectional LSTM network (Fig. 3) coupled to a 1-hidden layer MLP (ReLU activation), with an output layer of 86 neurons. Unlike our EvoLSTM model, this model does not consider the prediction made in the previous time step and can make predictions based only on the bidirectional input context. The learning rate, hidden and cell state weight size, optimizer, initial learning rate, batch size and all other training details are identical to the sequence-to-sequence EvoLSTM.
Fig. 3.
Baseline bidirectional LSTM model structure. The two hidden states emitted from both directions are concatenated and passed down to an MLP
Baseline bidirectional LSTM model structure. The two hidden states emitted from both directions are concatenated and passed down to an MLP
2.7 Implementation
EvoLSTM was implemented using tf.keras (Chollet ) LSTM layers in Tensorflow 2.0 (Abadi ). Biopython (Cock ) was used for reading the MAF file and preprocessing genome sequences. All relevant code is available at https://github.com/DongjoonLim/EvoLSTM.EvoLSTM is easy to train from whole-genome alignments and inferred ancestral sequences. It comes with code for interpretation and visualization of the models learned. It is also able to use a trained model to randomly evolve a given input sequence using substitutions and short indels. As such, it will easily integrate into more general genome evolution simulators such as Evolver (Edgar ).
3 Results
This section begins with the assessment of the accuracy of EvoLSTM and baseline models, followed by an empirical analysis of context-dependent mutation probabilities learned by EvoLSTM.
3.1 Model performance
We first evaluated the ability of different approaches to properly estimate mutation probability in a context-dependent manner (Fig. 4). The models investigated (see Section 2) included our biLSTM and EvoLSTM seq2seq models, as well as a simpler adaptive frequency-based approach, each with context sizes . Each model was trained on 10 000 000 aligned K-mer pairs extracted from the whole-genome alignment of the computationally reconstructed old-world monkey genome to the human genome (from human chromosomes 2), and evaluated on a similar but much larger test set obtained from chromosomes 1 and 3–22, containing >149 million K-mer pairs. This atypical imbalance between the size of the training and test sets is intentional; having a very large test set allows us to accurately estimate context-dependent mutation probabilities using the simple frequency-based approach, to then be able to assess how the different approaches proposed can learn context dependencies from a relatively smaller training set.
Fig. 4.
CE loss (lower is better) of the test set (>149 million K-mers, corresponding to alignments of portions of the old-world monkeys’ ancestral genome against the human genome), with different context sizes K. The models compared are two baseline models (adaptive frequency table and biLSTM) as well as two versions of EvoLSTM, with one or two biLSTM layers
CE loss (lower is better) of the test set (>149 million K-mers, corresponding to alignments of portions of the old-world monkeys’ ancestral genome against the human genome), with different context sizes K. The models compared are two baseline models (adaptive frequency table and biLSTM) as well as two versions of EvoLSTM, with one or two biLSTM layersWe assessed the ability of a model to accurately capture context-dependent mutation probabilities using the CE of the test data, which are equivalent to the negative log-likelihood of the data given the model. CE values that are reached using no context at all (K = 1) are much worse than those obtained with larger values of K, confirming that context dependencies are strong. The adaptive frequency table-based approach is limited in capturing long-range dependencies because of the exponential amount of data it would require in order to do so. In contrast, EvoLSTM (1 layer) is able to fully take advantage of large context sizes, reaching a minimal CE at K = 15. Larger values of K result in worse CE values, possibly because as the LSTM network becomes larger, gradients become unstable (Greff ). However, this does not mean that context sizes larger than K = 15 do not have an incidence on mutation probabilities. Overall, the difference in CE values obtained [0.18 for EvoLSTM (K = 15) versus 0.195 for table-based (K = 21)] is notable and reliably reproduced over different restarts. Note that the reason the performance of the adaptive table-based approach remains stable for large values of K is that this approach adaptively trims each K-mer until its count in the training set is sufficiently large to make accurate probability estimations; in practice, for , most K-mers get trimmed to 15-mers or shorter.We also tested whether it may be beneficial to add a second LSTM layer to EvoLSTM, which is an approach that has been shown to be beneficial in language modeling (Sutskever et al., 2014). However, this did not improve the performance, and was much longer to train. We also observe that the effect of capturing contexts by separating the encoder from the decoder in the sequence-to-sequence model used in EvoLSTM is important since the CE of EvoLSTM is much lower than that of the baseline biLSTM, across all context sizes. This suggests that relatively long-range context dependencies exist, and that EvoLSTM is able to capture many.
3.2 Flanking context strongly impacts mutation probability
We next aimed to characterize the impact of long-range context on mutation probability, and better understand the ability of EvoLSTM to take this context into consideration. We used a trained EvoLSTM (K = 15) to simulate the evolution of a set of 149 860 432 K-mers from an ancestral old-world monkey sequence. We then tabulated the frequency of simulated mutations in each of the 256 different contexts of the form wxNyz, i.e. including two flanking bases on each side of the mutating base, and contrasted those frequencies to those observed in actual alignments of the same regions.Figure 5 shows the results for a subset of the 16 possible substitutions, 4 possible deletions and 4 possible insertions. See also Supplementary Tables S1–S4 for full results. Several observations emerge. First, EvoLSTM efficiently captures and simulates context dependencies of that size, with correlation coefficients ranging from 0.83 to 0.99 for substitutions. In particular, it clearly and accurately captures the well-known CpG to TpG mutation of methylated C’s (Jabbari and Bernardi, 2004), or the equivalent CpG to CpA from the reverse strand, assigning significantly higher mutation probability (0.1–0.25) to these mutations when in the right dinucleotide context versus in other contexts (<0.05). Notably, even within the CpG dinucleotide context, the probability of C-to-T mutation is strongly dependent on the broader context. For example, CGCGz contexts are two times less mutagenic than AT-rich contexts. This may be explained by the presence of CpG islands in the genome, which is generally unmethylated and hence both CpG rich and substitution poor. Non-CpG contexts also exhibit strong context dependencies. For example, A-to-G transitions show a greater than 10-fold increase in the probability in the contexts of CAATw versus xAAAw.
Fig. 5.
Comparison of the observed and EvoLSTM-predicted mutation probabilities for context size K = 5. Each graph shows the results for a given mutation occurring at the center of the K-mer. Each plot has 256 points, corresponding to the 256 possible contexts for that mutation. Some of the contexts are labeled. The Pearson correlation coefficient r is given for each case
Comparison of the observed and EvoLSTM-predicted mutation probabilities for context size K = 5. Each graph shows the results for a given mutation occurring at the center of the K-mer. Each plot has 256 points, corresponding to the 256 possible contexts for that mutation. Some of the contexts are labeled. The Pearson correlation coefficient r is given for each caseFigure 6 illustrates this phenomenon in more details in the case of CpG dinucleotide (left) and nucleotide deletion (right). The massive difference in CpG versus non-CpG C-to-T substitution probabilities is only the beginning of a dive into further and further refinements of context dependencies. The ACG context is 70% more mutagenic than the TCG context. There is then a 35% difference in C-to-T mutagenicity between different xACGy contexts, and a 2-fold difference between high and low mutagenic xGACGGy contexts. Throughout, the correspondence between the predicted and observed mutation probabilities remains quite high, until the context size considered becomes less relevant to mutation probability (e.g. for xACCTTy). Deletions (right panel) exhibit these long-range dependencies even more strikingly, with a 20-fold difference in A-deletion probabilities between highly mutagenic AAAAAAA context and conservative context TCCAGCG.
Fig. 6.
EvoLSTM-predicted mutation probability and observed mutation probability in increasingly large contexts. Each dot represents one of the 16 possible context extensions. Out of those flanking base configurations, the top row shows the effect of adding the most mutagenic flanking bases, the second row shows the effect of adding the median mutagenic flanking bases and the bottom row shows the effect of the least mutagenic flanking bases
EvoLSTM-predicted mutation probability and observed mutation probability in increasingly large contexts. Each dot represents one of the 16 possible context extensions. Out of those flanking base configurations, the top row shows the effect of adding the most mutagenic flanking bases, the second row shows the effect of adding the median mutagenic flanking bases and the bottom row shows the effect of the least mutagenic flanking basesTransversions generally show a slightly weaker context-dependency (5-fold variation between most and least mutagenic contexts). Because they are also rarer in general, prediction accuracy is slightly worse due to the relatively small number of training examples.Figure 5 (bottom row) also shows the context-dependency of 1-nucleotide insertions and deletions. Those also display a very strong context-dependency. For example deletions of a T are roughly 8–12 times more likely in T-rich contexts than in GC-rich contexts such as zCTGz or zGTCz contexts. Insertions show a similar pattern of elevated probabilities for insertions of nucleotides in a context that resembles them, which is consistent with the well-known DNA polymerase slippage model (Messer and Arndt, 2007).To study the impact of the broader context, we investigated how the probabilities of specific substitutions and indels vary as we look at increasingly large context sizes (Supplementary Fig. S7). Consider mutation , where M and N are nucleotides or gaps, which is taking place in the context , where x and y are context nucleotides sequences of length greater or equal to zero (hence context size ). A meaningful assessment of the degree to which considering an extra context nucleotide at each end (i.e. ) affects the predicted mutation probability is the log-odds ratio of the long context model to the short context model:Increasing context size allows capturing strong dependencies not captured at lower context sizes. For each type of mutations (substitutions, deletions, insertions) and each context size K = 3, 5, 7, we show the log-odds ratio of a mutation/context pair in a model with context K versus one with context size K − 2. Each dot corresponds to a mutation/context pair, with x-coordinate being the log-odds ratio of the mutation in those two context sizes (K versus K − 2) and the y-axis being the same measure, but for the reverse complement of that mutation/context pair. Since most context-dependencies would be expected to be strand-independent, one would expect a strong correlation between those two valuesIf the mutation does not depend on distant context nucleotides a and b, we get . However, if the presence of a and b significantly increases (resp. decreases) the odds of mutation, LOR takes on a positive (resp. negative) values.When K is relatively large (K > 5), verifying EvoLSTM’s predictions about the frequencies of mutations in specific contexts become difficult, because our test data does not have sufficiently many examples of each mutation/context pairs for accurate estimation. Here, we take advantage of the fact that most mutational processes are agnostic of strandedness, which should result in context-dependencies to be invariant to reverse complementation:
where prime indicates reverse complement. This provides us with a way to verify the internal consistency of the model, without the need for a test set. This serves as a proxy for evaluating the tool’s accuracy, because high correlations would be unlikely to arise by chance.Figure 7 shows how LOR values for mutation/context pairs relate to the LOR values for their reverse complement. For , LOR values are highly consistent between a mutation/context pair and its reverse complement. For example, , suggesting that a T-to-A substitution is approximately times more likely in the context TTTAA than in the shorter context TTA. The reverse-complement mutation/context pair displays a similarly strong bias, with . With context size K = 7, the correspondence between reverse complements becomes less strong, suggesting that EvoLSTM’s estimations may be less accurate. Nonetheless, several results are striking and reproducible across reverse complements. Surprisingly, a C-to-A substitution in the context CCCCCCC is 3 times more likely than in the shorter CCCCC context. A-to-T substitutions are 5 times less likely in the context of ATTAAAG than in the context of TTAAA. These results show that long-range context dependencies are strong for certain mutations, and that EvoLSTM is able to capture many of them, although the task becomes increasingly difficult as K increases. Similar results are observed for deletions, but long-range context-dependencies for insertions appear to be less reliably captured, probably because of their rarity in our training set.
Fig. 7.
Increasing context size allows capturing strong dependencies not captured at lower context sizes. For each type of mutations (substitutions, deletions, insertions) and each context size K = 3, 5, 7, we show the log-odds ratio of a mutation/context pair in a model with context K versus one with context size K − 2. Each dot corresponds to a mutation/context pair, with x-coordinate being the log-odds ratio of the mutation in those two context sizes (K versus K − 2) and the y-axis being the same measure, but for the reverse complement of that mutation/context pair. Since most context-dependencies would be expected to be strand-independent, one would expect a strong correlation between those two values
3.3 Context-dependencies in other mammals and in plants
To demonstrate the applicability of EvoLSTM outside of primates, we used it to learn mutation context-dependencies in other species. First, we trained a model on bats sequences (training set: 10 million K-mers; test set: 158 million K-mers), using the branch from the most recent common ancestor David’s Myotis bat (Myotis davidii) and Microbat (Myotis lucifugus) to descendant Microbat. Supplementary Figure S9 shows that EvoLSTM is able to capture the same type of dependencies as in primates, although the correlations observed are weaker. We attribute those differences to the fact that the number of mutated sites available for training is substantially lower in bats, despite the branch lengths being similar to those used primates. This is likely an artifact of the way the multiple genome alignment used for ancestral genome reconstruction was built, using human as a reference, which results in highly diverged bat sequences to sometimes be missing from the alignment. Nevertheless, the predicted context dependencies learned in primates and bats are quite similar (Supplementary Fig. S3).We then repeated the analysis on plants (Brassicaceae), using a whole-genome alignment produced by Haudry . Here, we used as ancestral sequence the most recent ancestor of Arabidopsis thaliana and Arabidopsis lyrata, reconstructed using Capsella rubella as an outgroup, and studied its evolution toward the A.lyrata genome. We excluded coding regions, resulting in 10 million examples being used for training, but only 6 million for testing. Again, EvoLSTM is able to detect strong context dependencies, especially for insertions and deletions. The correlation coefficients of predicted and observed (in the test set) mutation frequencies are lower than in human, which we attribute in part to the fact that the test set used to estimate observed mutation frequencies is >20 times smaller than in mammals. Notably, we observe an absence of CpG to TpG elevated substitution rate, due to the fact that DNA methylation is rare in plants, outside of transposable elements (Zhang ).
3.4 Running time
EvoLSTM was trained on an Intel(R) Xeon(R) Silver 4210 CPU @ 2.20 GHz CPU and NVIDIA GeForce RTX 2080Ti GPU. Training on a set of 10 000 000 15-mer examples took on average 1374 s per epoch with the batch size of 1024 and the hidden state size of 512. The training was stopped after 132 epochs.
4 Discussion and conclusion
Context-dependent mutation rates have been known and documented for a long time, starting with the elevated CpG to TpG substitution rate (Bird, 1980; Ehrlich and Wang, 1981), and more recently in many other cases (Arndt and Hwa, 2005; Averof ; Messer and Arndt, 2007; Morton, 2003; Siepel and Haussler, 2003). While several computational models have been proposed to characterize these dependencies [e.g. hidden Markov models (Siepel and Haussler, 2003) or Bayesian networks (Chachick and Tanay, 2012; Cohn )], those are generally unable to capture complex long-range dependencies. The EvoLSTM model introduced here builds on prior work in deep learning and natural language processing to learn and reveal such dependencies.The ability to efficiently learn and model context-dependent mutation rates is of high importance for several tasks. First, substitution models are at the core of sequence alignment tasks. Most commonly used alignment algorithms use context-independent models [e.g. Needleman-Wunsch algorithm (Needleman and Wunsch, 1970), Blast (Altschul ) and their variations (Schwartz ; Smith and Waterman, 1981)] or consider limited context for codon-based alignment (Ranwez ). This is partly due to the algorithmic challenge linked to computing maximum likelihood alignments under context-dependent substitution or indel models (Hickey and Blanchette, 2011). Yet many scoring-scheme agnostic pairwise and multiple alignment heuristics have been proposed recently, e.g. using reinforcement learning (Jafari ; Mircea ; Ramakrishnan ). This paves the way for the possible adoption of complex mutation models such as EvoLSTM. Using accurate substitution and indel models is particularly important when aligning highly diverged sequences, where using the right model enables both more accurate alignment computation and higher remote homology detection, both of which are of high importance for whole-genome alignment (Blanchette ) and ancient transposable element detection (RepeatMasker; Smit et al.http://www.repeatmasker.org). Accurate modeling of context dependencies is also important to obtain improved sequence evolution simulators [such as Evolver (Edgar )], which are instrumental in benchmarking a variety of bioinformatics tools such as whole-genome aligners (Earl ). These models are also relevant to phylogenetic inference, where the choice of mutation models has been shown to have a high impact on the accuracy of the trees inferred, especially for highly divergent species (Delsuc ). Finally, a more detailed study of the models learned by EvoLSTM is likely to reveal valuable information about mutagenesis and DNA repair. In particular, different types of cancer have been shown to be associated with different mutational signatures (Helleday ); a detailed analysis of EvoLSTM models trained on such cancer mutations may help reveal the mechanisms at play.Many potentially fruitful directions may be explored to improve EvoLSTM’s accuracy and scalability. Attention mechanisms have shown promising results in neural machine translation (Bahdanau ) for capturing larger sentence contexts and could be beneficial in our context. Other directions may include using transformers (Vaswani ), which have recently revolutionized the field of natural language processing, as well as word embedding (Mikolov ). Code optimization should also enable EvoLSTM to be trained on larger datasets; memory requirements currently limit us to using at most 10 million training examples.Several new biological applications would also be of interest. First, one may consider training an ensemble of models, to capture different types of genomic contexts (methylated versus non-methylated, or transcribed versus non-transcribed, protein-coding versus non-coding regions, etc.), which are believed to have different mutational signatures either due to different mutational or DNA repair processes, or to natural selection. Extending EvoLSTM to other types of mutational events such as transposable element insertions [which have been shown to be highly context-dependent (Beggs ; Wall )] and tandem or segmental duplication would also be worthwhile.Applying EvoLSTM to the study of the mutational processes at play outside of primates would also be valuable. One challenge in that direction is data availability. To train EvoLSTM, one needs the possibility of accurately reconstructing an ancestral sequence, which is only feasible if at least two relatively closely related species and a close outgroup are available. These genomes need to be sufficiently closely related that they can be accurately aligned to each other, but diverged enough that the number of mutational events available for training is sufficient. As such, it works best for large genomes with a densely populated phylogenetic tree.In conclusion, machine-learning advances have only begun to impact evolutionary biology and genomics, but this represents an application area of potentially high impact, due to the complexity of the mechanisms at play and a large amount of genomic data available to train sophisticated models. EvoLSTM represents the first step in that direction, enabling a detailed study of context-dependent mutational mechanisms and their integration in sequence evolution simulations, with applications in genomics, evolution, phylogenetics and potentially human health.Click here for additional data file.
Authors: Mathieu Blanchette; W James Kent; Cathy Riemer; Laura Elnitski; Arian F A Smit; Krishna M Roskin; Robert Baertsch; Kate Rosenbloom; Hiram Clawson; Eric D Green; David Haussler; Webb Miller Journal: Genome Res Date: 2004-04 Impact factor: 9.043
Authors: Klaus Greff; Rupesh K Srivastava; Jan Koutnik; Bas R Steunebrink; Jurgen Schmidhuber Journal: IEEE Trans Neural Netw Learn Syst Date: 2016-07-08 Impact factor: 10.451
Authors: Annabelle Haudry; Adrian E Platts; Emilio Vello; Douglas R Hoen; Mickael Leclercq; Robert J Williamson; Ewa Forczek; Zoé Joly-Lopez; Joshua G Steffen; Khaled M Hazzouri; Ken Dewar; John R Stinchcombe; Daniel J Schoen; Xiaowu Wang; Jeremy Schmutz; Christopher D Town; Patrick P Edger; J Chris Pires; Karen S Schumaker; David E Jarvis; Terezie Mandáková; Martin A Lysak; Erik van den Bergh; M Eric Schranz; Paul M Harrison; Alan M Moses; Thomas E Bureau; Stephen I Wright; Mathieu Blanchette Journal: Nat Genet Date: 2013-06-30 Impact factor: 38.330
Authors: Dent Earl; Ngan Nguyen; Glenn Hickey; Robert S Harris; Stephen Fitzgerald; Kathryn Beal; Igor Seledtsov; Vladimir Molodtsov; Brian J Raney; Hiram Clawson; Jaebum Kim; Carsten Kemena; Jia-Ming Chang; Ionas Erb; Alexander Poliakov; Minmei Hou; Javier Herrero; William James Kent; Victor Solovyev; Aaron E Darling; Jian Ma; Cedric Notredame; Michael Brudno; Inna Dubchak; David Haussler; Benedict Paten Journal: Genome Res Date: 2014-10-01 Impact factor: 9.043