Literature DB >> 36108047

Graph-based algorithms for Laplace transformed coalescence time distributions.

Abstract

Extracting information on the selective and demographic past of populations that is contained in samples of genome sequences requires a description of the distribution of the underlying genealogies. Using the Laplace transform, this distribution can be generated with a simple recursive procedure, regardless of model complexity. Assuming an infinite-sites mutation model, the probability of observing specific configurations of linked variants within small haplotype blocks can be recovered from the Laplace transform of the joint distribution of branch lengths. However, the repeated differentiation required to compute these probabilities has proven to be a serious computational bottleneck in earlier implementations. Here, I show that the state space diagram can be turned into a computational graph, allowing efficient evaluation of the Laplace transform by means of a graph traversal algorithm. This general algorithm can, for example, be applied to tabulate the likelihoods of mutational configurations in non-recombining blocks. This work provides a crucial speed up for existing composite likelihood approaches that rely on the joint distribution of branch lengths to fit isolation with migration models and estimate the parameters of selective sweeps. The associated software is available as an open-source Python library, agemo.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36108047 PMCID： PMC9514611 DOI： 10.1371/journal.pcbi.1010532

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

This is a PLOS Computational Biology Software paper.

Introduction

The Laplace transform is often introduced as a formal tool to solve differential equations. Yet in mathematical fields like queuing theory, the integral transform is used to simplify the analysis of the studied probabilistic problems. This is due to two key properties of the Laplace. One, the transform of the distribution of the sum of independent random random variables becomes the product of their respective Laplace transformed marginal distributions. Two, the Laplace transform of a random variable X describing the length of an interval, has a clear probabilistic interpretation. Let a Poisson process with intensity ω mark this interval, then the Laplace transform, , is the probability of not observing any marks in the considered interval [1]. If we translate this idea to the standard coalescent framework [2-4] and let the marking process describe the arrival of mutations, then the Laplace transform of the distribution of the total branch length gives the probability of not observing any mutations along the modelled genealogies. Conversely, the probability of not observing any mutations between two coalescence events gives us the transformed distribution of the total branch length spanned by those two events. This means that, given a state-space graph that describes all possible transitions during the coalescent process, one can easily write down all associated expressions in the Laplace domain [5]. This is not the case for the time domain. The Laplace transformed description of the distribution of coalescence times has been used to tackle two major problems in population genetics: fitting explicit models of population history [5-8] as well as estimating sweep parameters [9]. Note that because the Laplace transform can be interpreted as the generating function (GF) of a continuous random variable, this method is often referred to as the GF approach. These studies have used composite likelihood-based approaches that summarize mutational information as counts of the (joint) site frequency spectrum within blocks of a fixed length [7]. Note that the GF framework allows for the inclusion of multiple recombining loci [5]. However, because of the associated computational complexity, current applications all calculate likelihoods based on the mutational information in blocks small enough for recombination to be negligible. Also note that likelihoods are not only composited over all blocks, but also over all possible subsamples of size at most 6 [8]. This is due to the fact that the GF grows superexponentially with the number of samples, but also because of the number of possible joint site frequency spectra within small blocks (see Mutation configuration probabilities). The probabilities of observing all block-wise mutational configurations are given by a multivariate Poisson distribution mixed over the joint distribution of branch lengths. The Laplace transform is well suited to compute these probabilities because this Poisson distribution can be written as a function of the Laplace transform of the branch length distribution. Specifically, the probability of observing k mutations along each of the k branch types is proportional to the k derivative with respect to the associated variable in the Laplace domain [5]. Previous implementations have always used a computer algebra system (CAS) to compute these higher-order derivatives of the Laplace transform. Such an approach suffers from an explosion in the number of terms due to the product or the Leibniz rule. This computational bottleneck has limited the usability of the framework. One way to solve the computational bottleneck is to replace the recursive description of the generating function by matrix manipulations. This has been done using phase-type theory [10]. Phase-type distributions are the result of a mixture or convolution of exponential distributions. The theory provides an alternate way of translating the state transition diagram into a description of the branch length distribution of genealogical trees. One of the major computational advantages of this description is that the matrices that define the distribution are typically preserved up to the point of evaluation [10]. Because computationally efficient algorithms for linear algebra operations already exist, phase-type theory lends itself to efficient implementations. With increasing sample size however, matrix operations become computationally unfeasible. Given the sparsity of most real-world state spaces, graph-based representations of these matrices can alleviate this issue to some extent [11]. However, currently no algorithms have been described to extract information from the joint branch length distribution. Graphs reveal the relationships between the base entities or nodes. Therefore, the more complex these relationships are, the more connected the graph will be and the more useful a graph-based representation becomes. In machine learning for example, computational graphs representing mathematical equations are used to efficiently calculate derivatives [12]. What I have implemented here takes inspiration from automatic differentiation in that we will use the recursively generated state-space graph as a computational graph to avoid both symbolic computation and repetitively evaluating the same expressions. Note that representing the GF as a graph does not remove the exponential growth with sample size of both the state space and the number of possible mutational configurations within small blocks. Here, I present the key algorithms underlying this approach as well as agemo, an open-source non-symbolic implementation of the GF framework. The paper is structured as follows. First, I will summarize the description of the GF as a large symbolic expression as defined in [5] and show how the GF can be represented more succinctly. Secondly, I lay out the graph traversal algorithm that allows efficient evaluation of the GF. I then show how this algorithm can be used to query the joint distribution of branch lengths and tabulate the probabilities of observing mutation configuration in blocks of non-recombining sequence.

Methods

Recursive description of the branch length distribution in the Laplace domain

Given a sample of uniquely labeled lineages from k distinct populations, we can represent all possible coalescent histories of that sample in a single rooted directed graph [13]. By labelling lineages by the samples they subtend, we can associate each node of the transition diagram with a vector Ω = (Ω1, …, Ω) uniquely describing that state. Here Ω represents all lineages present in deme i. As we move through the graph from the source node, representing the set of all sampled lineages, to the absorbing state or most recent common ancestor (mrca), we track the movement and coalescence of lineages. Coalescence events reduce the number of lineages in a deme, while events like a (mass) migration will move lineages from one deme to another. Fig 1 shows the state space graph for a toy example. Here, edges 1 and 3 are associated with mass migration events moving all lineages from population B to A, leaving Ω empty. Note that, in general, any state where all but Ω are empty and |Ω| = 1, can be an absorbing state.

Fig 1

Coalescent state space graph for two populations A and B with 2 and 1 unphased sample(s), respectively.

Coalescent state space graph for two populations A and B with 2 and 1 unphased sample(s), respectively.

Each node in the graph is labeled by the lineages present in each population as indicated by the square brackets (Ω: top, Ω: bottom). The demographic model assumes a single mass migration event from population B to A back in time. Vector consists of the rates of all possible events: coalescence (c0) in population A, a dummy variable δ associated with the mass migration (see Discrete events), and a dummy variable ω for each possible branch type. The Laplace transform associated with edge i, , can be retrieved as the ratio of the dotproduct of and with . Red indices on the graph indicate edges associated with an equation containing δ. Edges 1 and 3 represent a mass migration event, moving all lineages from B to A. All other transitions represent coalescence events. All paths can be described by enumerating the indices of the associated matrices: . The state space graph as described above can be generated recursively. All state transitions are conditionally independent. They only depend on the lineages in the current state and on the set of competing processes that define how one moves from one state to the next. Along each path through the graph, the time to the mrca is distributed as the sum of the inter-event times. In the Laplace domain, the sum of independent random variables is equal to the product of their respective Laplace transformed distributions. This general property of generating functions turns this graph into more than just a visual aid. The graph is now a description of how to generate the Laplace transform , of the random variable , describing all branch lengths. Given the expressions that detail the time to go from one state to the next, the distribution associated with a single path through the graph can be retrieved by multiplying all the expressions associated with the edges found along that path. The entire Laplace transform is then a simple sum of the equations describing all paths through the graph. A minimal representation of the GF thus consists of a list of the unique equations, each associated with a single edge, and a list of lists enumerating for each path b through the graph the index a of each edge/equation along that path (see Fig 1). As a result can be written as . Because of the probabilistic interpretation of the Laplace transform, the expression associated with each edge equates to the probability of observing the event of interest before any other event happening at rate ω. In the standard coalescent framework, coalescence events are exponentially distributed with rate when there are n lineages remaining. So in the Laplace domain, the distribution of the waiting time until the next coalescence is given by . To incorporate more than one process with an exponentially distributed waiting time, it suffices to observe that min(X, Y) ∼ exp(ω + ω) when both X ∼ exp(ω) and Y ∼ exp(ω). This means one can incorporate as many events with exponentially distributed waiting times as computationally possible. The probability that the event with rate λ happens before the other j − 1 competing events will still have the same general form. In this equation, we associate a unique dummy variable (ω) with each of the k branch types along which all j competing processes (λ) happen. Roman letters represent integers that count the number of branches of a particular type (o), or the number of ways a certain (coalescence or other) event can modify the current state (l). Note that in the case of multiple populations, coalescence rates are given relative to the rate in a reference population, i.e. . Let = (λ, )′ = (λ1, …, λ, …, λ, ω1, …, ω)′, denote the vector containing the rates of all j competing processes as well as the k dummy variables, and let = (0, …, l, …, 0) and = (l1, …, l, …, l, o1, …, o), then Eq 1 can be written as the ratio of * and * . Looking at our example (Fig 1), edge 0 is associated with a coalescence event of two lineages in A. There is only 1 possible way these lineages can coalesce, so the first entry of p0 is 1. q0 encodes all possible events (coalescence or mass migration) given the set of sampled lineages and a count of each of the branch types present prior to coalescence. Starting out, we have two a lineages and a single b lineage. All other equations, and their corresponding arrays can be deduced in the same way. Storing the equations in this way ensures that we can efficiently substitute in parameter values by taking the dot product with a vector representing a point in parameter space once the Laplace transform needs to be evaluated. Also, storing the equation coefficients in matrix form allows us to efficiently perform operations on the equations (see Discrete events).

Discrete events

For the general description of the GF, we have assumed that all competing processes have exponentially distributed waiting times. As outlined by [5], discrete events that only happen once can be included by treating them initially as a competing exponentially distributed process with rate δ. The Laplace transform of the joint branch length distribution , given the discrete event happened at time T, can be recovered by taking the inverse transform of the GF as described in the previous section, , divided by its associated dummy variable, δ. To see this, note that can be written as a compound distribution integrating over the probability density function of the time to the discrete event: . Dividing both sides by δ we recognize . This procedure has been used to incorporate population divergence, admixture events, and bottlenecks [5, 7], as well as selective sweeps [9]. Previous implementations have relied on a CAS to obtain an analytic solution for the inverse transform of the GF. However, as long as we limit ourselves to a single discrete event, the GF will always be a sum of the products of factors of the form . Using partial fraction expansion, we can formulate a closed-form solution, to the inverse Laplace, , with respect to δ, where T represents the time to the discrete event. Also, having stored all equation coefficients as an array, we can do so in a way that allows for efficient substitution of all parameter values. Looking at a single path along the graph, only the equations associated with edges leading up to the node representing the discrete event will contain δ. Equations associated with edges past that point can be treated as constants. The resulting inverse of this path will therefore be an expression given by Eq 2 times the unchanged equations associated with all edges positioned downstream, i.e. moving through the graph backwards in time, of the node associated with the discrete event.

Adding in new events

Currently, in addition to coalescence, two types of events have been implemented in the Python library agemo: unidirectional migration at a constant rate and population splits (forwards in time). Because of the recursive description, adding in more event types is straightforward and only requires a description of all possible state transitions due to that event given the current configuration of lineages. Note that the library does currently not accommodate events that generate cycles in the graph. This means that bi-directional migration, for example, is not supported. This does not mean that the GF framework cannot handle events that generate cycles. Using a Taylor series expansion, the GF can be decomposed into histories with 1, 2, …, m cyclic events [8].

Graph traversal algorithm

The one-to-one correspondence between the state space graph and the Laplace transform means the state space graph can be thought of as a computational graph. Evaluating the transform at a single point s in the Laplace domain equates to substituting the value into the expression associated with each edge, followed by multiplying the results along each path and adding the results across all paths. This is the general idea that will be used in the next paragraph. However, note that because of the general form of the inverse in the case of a discrete event (Eq 2), the graph needs to be modified slightly so that, ultimately, the nodes of the computational graph again represent the factors of a multiplication. This is achieved by, for each path, reducing the part of the path leading up to the node associated with the discrete event to a single edge and pairing that edge with the result of Eq 2. This is demonstrated in Fig 2A and 2B. Edges marked with red indices are associated with equations containing the dummy variable setting the rate of the discrete event and require inversion. Edge 0 and 1 are collapsed into a single edge and are now associated with as defined in Eq 2. Also note that to simplify the evaluation algorithm, equations are represented by nodes instead of edges (Fig 2C).

Fig 2

From coalescent state space to computational graph: State space graph and model identical to Fig 1.

A: Unmodified state-space graph. B: Collapsed form, grouping all parts of each path that require inverting with respect to the dummy variable associated with the discrete event. The integers in red are the indices of the equations containing δ. C: To simplify the formulated algorithms, nodes represent the equations previously associated with the edges. The graph has been annotated to demonstrate the general propagation algorithm of the evaluated equations associated with each node towards the root. is the inverse Laplace of with respect to δ.

From coalescent state space to computational graph: State space graph and model identical to Fig 1.

General algorithm

Given the computational graph (as described in Graph traversal algorithm and see Fig 2C), a general algorithm to propagate any evaluation of the equations associated with each node is given by Alg 1. The algorithm relies on the fact that, implicitly, the edges of the graph represent multiplication. The evaluated values associated with each node do not need to be single floats. They can be the coefficients of a generating function, for example, representing the probabilities of seeing particular mutation types (see Mutation configuration probabilities). In these cases, multiplication and addition operators will need to be defined for propagation. We can then rely on the commutative property to efficiently traverse the graph towards the root. Especially in the case where addition is a less costly operation than multiplication (as is the case for polynomials, see Mutation configuration probabilities), it will pay off to add the values associated with the children of a node prior to multiplication. The annotations on Fig 2C demonstrate the algorithm for our toy example. Algorithm 1: Propagate values through graph 1 function PROPAGATE; Input: A (adjacency list of graph), N (evaluated equations for each node), E (topological sorting of graph) 2 foreach parent in E do 3 children = A[parent]; 4 temp = 0; 5 if children then 6 foreach child in children do 7 temp+=N[child]; 8 end 9 if parent not root then 10 N[parent] = PRODUCT(temp, N[parent]); 11 else 12 return temp; 13 end 14 end 15 end

Mutation configuration probabilities

Assuming branches are labelled by the samples they subtend, 2 − 2 branch types can be distinguished for a sample of n lineages. Along each of these branch types, mutations might occur. Under an infinite-sites mutation model, the joint probability of seeing k mutations along each of these i branch types in short blocks of a given length can be derived using the GF. Each mutation configuration is then defined as a vector of the form where each entry is a count within the interval . is used to group all mutation configurations with more than mutations. The array representing all possible block-wise mutation configuration counts is of size . This is a more general description of the block-wise site frequency spectrum or bSFS as introduced by [7]. The bSFS only distinguishes mutations along branches with the same number of descendants. By ignoring both phase and root information, all mutation configurations are essentially instances of the folded (joint) site frequency spectrum for blocks of a fixed length. The bSFS is a tally of all observed (joint) SFS instances. Note that the absence of phase and root information can be accommodated via a simple relabeling of all branches, and therefore their associated dummy variables in the GF [8]. By labeling all samples by the population they were collected from, one can incorporate unknown phase. Removing root information is identical to the concept of folding the SFS, i.e. combining branch types on either side of the root. Quite often, the array containing the probabilities associated with all block-wise mutation configurations will be sparse. Some branch types can simply never be jointly observed along any of the possible genealogies. As a consequence, we know the probability of a configuration indicating the presence of mutations along these branch types will always be zero without having to perform any computations. agemo therefore pre-determines incompatible branch types and only reserves memory and performs computations for mutation configurations with a non-zero probability. This implies both time and significant memory savings. In the presence of phase information further savings can be made. Because of the symmetries inherent to the coalescent, some mutation configurations can be equally likely. For example, in a single population with samples (a, b, c) observing a single mutation along both branches ab and b is equiprobable to observing a single mutation along both branches bc and c. Here again, agemo will only compute the probability of a single representative of the set of all equiprobable mutation configurations. The probability of observing mutation configuration k under a specified model is proportional to a term in a (truncated) Taylor series expansion (see Eq (1) in [5] for details). Any naive approach, based on calculating all higher order derivatives using a CAS, will suffer from an explosion in the number of terms due to the Leibniz or product rule when differentiating. Generally, a CAS will fail to take into account the fact that the same partial derivatives of the functions that constitute the expression are computed multiple times. This problem has been well studied for the purpose of automatic differentiation algorithms [14-17]. In fact, it has been shown that a set of recurrence relations can be defined on the coefficients of truncated Taylor series to efficiently compute higher-order derivatives [18]. Departing from the elementary functions as represented by a computational graph, a complex Taylor series expansion can be performed without recalculating the same derivatives. Algorithm 2: Product of two truncated Taylor series [18]. 1 function SERIES_PRODUCT; Input: Two arrays A and B with same shape Output: array C of same shape as A and B 2 foreach multi-index < shape(A) do 3 sum = 0; 4 foreach multi-index ≤ multi-index 5 sum = ADD(sum, A[] * B[ − ]); 6 end 7 C[] = sum; 8 end 9 return C; Translating this to the graph traversal algorithm outlined above requires us to first obtain all coefficients for a truncated Taylor series of the equation associated with each node in our computational graph. We can then use the algorithm defining the product of two truncated Taylor series (see Alg 8 in [18] and Alg 2) to propagate the coefficients of the series associated with each node. Note that adding two truncated Taylor series simply amounts to the pairwise addition of all corresponding coefficients. To obtain the higher-order derivatives needed for the first step, we could use the recurrence relations defined in [18]. Note however, that the computational graph representation of the GF we have constructed is not at the level of the elementary functions. Because all equations associated with each of the nodes are well characterized, we can define a closed-form implementation of the derivatives with respect to the distinguished branch types. The equations all fall into one of two categories, depending on whether an inversion step was needed. Given a first-degree multivariate polynomial of the form f(x) = ∑ cx + b, non-inverted equations can be written as 1/f(x). Inverted equations on the other hand have building blocks that take on the form of e/(∏ f(x)). Using Alg 2 and Eq 4, we can come up with all partial derivatives for the inverted equations as well. With s = ∑ k and x representing the branch type vector, Note that Alg 2 contains an explicit ADD function. Care needs to be taken when summing (a subset of) the coefficients of a Taylor series: these will be both positive and negative, and as such, catastrophic cancellation might occur, leading to accuracy loss. To counteract this, I implemented the compensated summation algorithm of Ogita-Rump-Oishi [19]. The loss of precision is bounded by keeping track of small errors and adjusting the result using the error term. An alternative way of handling this would be to temporarily increase numeric precision at the crucial steps. Lastly, an advantage of using Taylor series coefficients rather than the corresponding derivatives is that the coefficients will always be smaller by a factor (∑ )!, leading to less cumulative rounding error [18].

Results and discussion

The work presented here constitutes a CAS-independent, open-source implementation of the GF approach. A general outline has been given on how the correspondence between the event state-space graph and the GF can be used to query the distribution of Laplace-transformed coalescence times efficiently. In particular, an algorithm has been laid out to calculate the probability of block-wise mutation configurations by propagating the calculation of series coefficients down the graph of ancestry states. The fact that this automation does not rely on a CAS and that it has been implemented in Python makes agemo an ideal back-end for likelihood calculations. agemo relies on numba [20] just-in-time compilation to speed up the critical parts of the code. Compiling the code using numba has a few consequences. Firstly, compilation happens the first time the code is run and the resulting compiled code is written into a file-based cache. Secondly, some numerical operations are implemented differently in numba than in numpy. In the case of summation this can lead to a loss of precision and has required the implementation of a compensated sum algorithm. A potentially faster solution would be to temporarily increase machine precision for the evaluation of particular sums. However, this is not possible using numba and would therefore require translating part of the code to C. To evaluate accuracy and performance, I calculated the bSFS for an isolation with migration (IM) model with 2 populations and 2 lineages in each population. Here two populations are descended from a common ancestral population at some time in the past, and since then unidirectional gene flow is assumed to have happened at a constant rate [21]. When discarding root and phase information, this leaves just 4 branch types. For each branch type I set k = 2, which means that the final result will contain 44 elements. This is the most complex model for which there exists a CAS-based implementation. Note that the original Mathematica implementation [5] can only calculate the bSFS for a simplified IM model with two N parameters, meaning that at least two populations must have the same size. An implementation using open source CAS Sagemath [22] takes about 75 s for a non-simplified model where each population has a unique N, while agemo evaluates a single point in parameter space in 181 ms. Increasing the sample size to 3 lineages in each population increases the number of nodes in the graph from 76 to 4449. The bSFS now contains 47 elements. Run time goes up accordingly to 134 s. The Python module contains multiple test suites testing all functions. Accuracy is assessed against an independent Sagemath implementation (Fig 3 left).

Fig 3

Accuracy: These scatter plots compare the negative logarithm of the computed probability (−logP) of observing each block-wise mutation configuration with a non-zero probability.

Accuracy: These scatter plots compare the negative logarithm of the computed probability (−logP) of observing each block-wise mutation configuration with a non-zero probability.

The output of an independent Sagemath implementation (left) or of the Monte Carlo simulation-based approach [23] (right) is plotted against the output of agemo. 1000 simulation replicates were used. Model parameters: IM model, 2 samples per population, migration from A to B (backwards in time). , , , m = 7e − 7, T = 1e7, θ = 1.152, = (2, 2, 2, 2). This IM model can be simplified to only include migration. We assume migration has been going on for an infinitely long time. Without any discrete events the graph is now maximally connected. For 2 lineages per population agemo takes 5 ms. Table 1 shows how performance scales with an increase in the number of samples per population.

Table 1

Run times migration-only model, 2 populations, unphased and unrooted branchtypes, k = 2.

samples per population	nodes in graph	non-zero entries in bSFS	size bSFS	time
2	30	112	256	5 ms
3	196	1408	16384	480 ms
4	1106	21952	16777216	52.3 s

I also benchmarked the evaluation time and accuracy (Fig 3 right) against the simulation-based approach as described in [23]. Using msprime [24], coalescent trees can be simulated under (almost) any demographic model. Without having to simulate mutations, we can calculate the probability of observing each mutation type. Given that mutations on each branch type happen independently, the probability of seeing mutation configuration (k1, k2, …, k) is given by the product of n probabilities as given by a Poisson distribution with rate θ/2 * t. Here, t is the total branch length of branch type i and θ = 4Nμ. When averaged across many replicates, the true value will be approximated. Note that particular entries of the bSFS might require fewer/more replicates to get at a good approximation than others [23, 25]. For the IM-model, with 1000 replicates one can already approach the true bSFS quite well [23] (see also Fig 3). Scaling linearly with the number of replicates, this takes about 450 ms. For 3 lineages per population run time goes up to 917 ms. With 4 lineages per population, there are 412 entries in the bSFS, making a non-sparse approach prohibitively slow. Using simulations, run times are the same for both the IM and the migration model. Note that I aimed to make the comparison as fair as possible by optimizing the code and compiling the critical parts with numba. All calculations have been done on the same MacbookPro (2.2 GHz 6-Core Intel Core i7). Also note that the bottleneck of simulating the bSFS is not the actual simulation itself but the inherent combinatorial explosion of an ever increasing number of mutation configurations with increasing sample size. I attempted to alleviate this by means of sparse matrices, but this came at a speed cost. This issue is inherent to the way the array of block-wise mutation configuration counts is defined and also applies to agemo. In part, this is solved by only calculating and storing the values associated with each unique mutation configuration that has a non-zero probability. However, computing the residual probabilities (observing more than mutations along each branch type i) in the last step currently still requires us to populate an array of size . Memory usage quickly becomes an issue here, and solving this requires a general sparse array implementation of the existing function. This suggests that the mutation configuration counts array would benefit from a dedicated sparse-data structure. Ideally, this data structure would also enable us to take advantage of the dependency structure of all higher-order derivatives. A last inherent limitation to the GF approach is that although we can include discrete events, retrieving the expression parametrized by the time to that discrete event requires us to take an inverse Laplace transform. Unfortunately, translating the mathematical description into a computational graph does not simplify this issue. As discussed, with the inclusion of discrete events the state space graph can no longer be translated into a computational graph without modification. A node must be added to the computational graph for each path leading to a discrete event, thus increasing the number of nodes and decreasing the connectivity of the graph, making a graph-based approach less efficient. agemo will therefore always do better in scenarios without discrete events (Table 1). As indicated in the Methods section, extending the GF approach to include new event types can easily be done. Because of its recursive nature, it only requires defining a function that describes the impact of the event on the extant lineages. All implemented events can then be combined to define a structured coalescent model. Note however that the current implementation only contains closed-form expressions to efficiently evaluate the GF associated with at most a single discrete event. The general algorithm outlined here should enable users to query the Laplace transform to extract, for example, topology information, the SFS or the time to the first coalescence event. These functionalities have not been explicitly implemented yet. But, they can be computed using the described graph and associated expressions. Also, agemo was designed with extensibility in mind. Future work on this library will enable a more diverse range of structured coalescent models as well as the ability to dynamically restrict the graph to those paths that are compatible with a specified topology. The work described in this paper shows significant similarities with recent progress in phase-type theory [11]. The authors present a general graph-based description of multivariate phase-type distributions and demonstrate the ability of their approach by calculating the SFS for an IM-type model sampling 7 lineages from each population. There are two main advantages of the phase-type theoretic approach. Firstly, incorporating discrete events does not require taking an inverse transform. Second, the paper contains algorithms using Gaussian elimination to translate cyclic graphs into an acyclic phase-type distribution, thus taking care of issues associated with, for example, bi-directional migration. Dealing with bi-directional migration thus no longer requires a costly matrix inversion. On the other hand, agemo allows users to take full advantage of the information present within the joint distribution of coalescence times (e.g. block-wise mutation configuration counts). Including short-range linkage information comes at a computational cost, limiting the applicability to smaller sample sizes. However, previous work has demonstrated that this approach maximizes the information contained in small samples compared to relying on the SFS [7, 9]. More importantly however, both frameworks have (independently) combined the same two basic ingredients to efficiently describe coalescent models: a recursive state-space construction and a graph representation for fast evaluation of the represented distributions. 28 Jun 2022 Dear Mr Bisschop, Thank you very much for submitting your manuscript "Graph-based algorithms for Laplace transformed coalescence time distributions." for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Two of the reviewers see great value in the new algorithm presented in the study. Reviewer 2's view is that there is too much focus on algorithms and not enough on biological questions; the reviewer also found the notation problematic and poorly defined, making the paper difficult to follow. Both Reviewer 1 and Reviewer 2 remark that the accuracy of the method should be checked in some way. We agree that this would be useful. Overall, given how widely the coalescent is used and how significant it can be to improve computational efficiency in inference methods for complex biological questions, the potential usefulness of this paper is clear. However, the flaws in the notation, figures and general presentation need to be addressed; please note suggestions made by all three reviewers. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Mark M. Tanaka Associate Editor PLOS Computational Biology William Noble Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Two of the reviewers see great value in the new algorithm presented in the study. Reviewer 2's view is that there is too much focus on algorithms and not enough on biological questions; the reviewer also found the notation problematic and poorly defined, making the paper difficult to follow. Both Reviewer 1 and Reviewer 2 remark that the accuracy of the method should be checked in some way. We agree that this would be useful. Overall, given how widely the coalescent is used and how significant it can be to improve computational efficiency in inference methods for complex biological questions, the potential usefulness of this paper is clear. However, the flaws in the notation, figures and general presentation need to be addressed; please note suggestions made by all three reviewers. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this paper Bisschop develops a more efficient algorithm to obtain the probability of observing a particular mutational configuration at a non-recombining locus by using a graphical representation of generating function computations. The material is obviously technical and the contributions of the present work are clearly explained. This is an important and difficult problem in population genetics, and the present manuscript represents a nice advance. I have a number of minor comments below. I prefer to sign my reviews, Jeffrey P. Spence Minor comments: The author should feel free to ignore this if they consider it to be too far outside the scope of the paper, but I imagine that it would considerably improve the impact of the work to show some real-world implications of the speedup offered by the present method. As an example, the present method allows for scaling to larger sample sizes. Does that buy anything in terms of inference? Does the Monte Carlo error in the simulation-based methods used in Beeravolu et al. 2018 result in worse inference compared to exact computation? These questions could be addressed using a small simulation study and one of the previously defined inference tasks. In retrospect it is obvious, but I was struck by the results in Table 1 for how difficult the generating function is to compute for even relatively small sample sizes (e.g., computing likelihoods for 5 diploids seems like it would be prohibitively expensive). I think it would be good to mention in the introduction the overall limitations of the generating function approach. While the present work certainly makes this approach more scalable, it would be good to state up front that it is extending something that applies to only extremely small sample sizes to something that can be applied to sample sizes that would still be considered small in most cases. A bit of additional background would be nice around lines 115-117 regarding how to obtain the generating function in the case of one or more discrete events. The material is present in Lohse et al. (2011) but it would make the present paper easier to follow if enough of the relevant material was included to be self-contained. It would be good to include a figure or table to show that the calculated bSFSs match those computed using previous generating function (or phase-type distribution theory), just to show that the implementation is correct and there are no issues with numerical precision. More explanation around equation 2 is required. What is g? What is \\mathcal{L} and what is its inverse? Both the alternating signs and the division by a difference in equation (2) suggests that the formula could be quite numerically unstable in certain regimes. It is mentioned a bit later in the manuscript, but how does agemo deal with the numerical instability? Can it detect if it's in a regime where higher numerical precision is needed? Throughout it's assumed that all processes are time-homogeneous. This is violated in the (common) case of time-varying population sizes. Can inhomogeneous processes be modeled or must they be represented as piecewise homogeneous and then dealt with as "discrete events"? Similarly, how does the computation scale as the number of discrete events increases? Is it feasible to allow populations sizes to change a few times while also modeling a simple IM scenario? The caption in Figure 2 refers to flipping the role of nodes and edges in a graph as "inverting". I have only heard of the "inverse" of a graph in the context of the complement of a graph -- where nodes in the inverse graph have an edge between them if and only if they do not have an edge between them in the original graph, which is quite different from the usage here. As a very minor comment -- numba allows caching computations (e.g., adding `cache=True` to any @njit calls) which may be able to prevent the cost of repeated compilation discussed on line 219-220. Typos: line 30: "respect to associated variable" --> "respect to the associated variable" Reviewer #2: This article describes a new software implementation for computing the Laplace transform associated with coalescence time distributions, under demographic models that allow for a certain amount of structure, including ongoing or discrete migration events between demes. The code is available from Github. Although the paper falls within the field of computational biology, it does not seem well suited to this journal. There is a great focus on algorithmic detail but little on any specific biological question. There is a missing 'second half' of this paper which would apply the software to an interesting biological problem and to provide further insight. This would probably require some downstream implementation which includes how to associate the Laplace transform to likelihood computations by considering placement of mutations along branches. I would also like to see more careful benchmarking - not just in runtime, but some evidence that the output is accurate. Altogether, as it stands this work might be more suited as an 'application note' at a venue such as Bioinformatics. The description of the algorithm is very hard to follow in several places. Partly this is due to mathematical imprecision. To give one example, equation (1) seems to be fundamental to the paper, but we are not told what l_i' represents, nor why the right-hand side seems to depend on index i (but not the left), nor what is m, nor what each summand is over. The phrase 'branch type' is used here and throughout the paper without a careful definition. Equation (2) is similarly afflicted - I'm afraid I could not work out what any of g, script-L, or, f_i are supposed to represent, nor what are the ranges of all of the summands. I can make a reasonable guess for some of them, but I don't think a typical reader should have to. In aggregate the accumulation of missing details like this make the paper very difficult to follow. At the very least there should be a complete, mathematically precise description of what is going on either in the main text or in a supplement, with simple worked examples. The accompanying figures do not help very much. Figure 1 seems to use completely different notation to the main text, and the graph in that figure is not explained at all. I am sorry to sound so negative. I think the software could be useful to researchers in this area, and it is clear that a lot of work has gone into getting the coding right. It is just that, from this paper, one cannot see the forest for the trees. Minor comments: - Figure referencing has not worked properly. e.g. Fig 1 is called Fig 2.1 in some places. - Some references are incomplete (lines 321, 328, 330, 373). Reviewer #3: This paper describes an efficient method for computing the distribution of branch lengths, and hence, of the probabilities of sets of mutations in a non-recombining region. This will make it possible to use the blockwise site frequency spectrum (bSFS) without using symbolic computation, allowing the method to be widely used. This is a substantial contribution, of considerable practical value, and well suited to publication in PLoS Computational Biology. Overall, the paper is well written, and makes clear the connection with the "phase type" method, which has been developed independently. However, the explanations could be yet clearer. In particular, the captions to the figures are too brief: it would be really helpful to walk the reader through the figures in the main text. In particular, the text above 2.1.1 needs to be clarified and expanded. Specific comments: - Note that the GF method also can include multiple recombining loci, though there is still the same computational difficulty as with two-way migration. - Bidirectional process can be handled using a series expansion - which may be worth noting 119 -Define the term CAS Eq 2 - Was the inverse Laplace Transform defined? Fig 2 caption should refer to Fig 1 not Fig 2.2; similarly in line 150 there is a refrence to Fig 2.2C - should be Fig 2C 166 - The explanation of the bSFS is hard to follow. 236 What is a 3Ne model?? 240 - specify "one-way migration" 253 - delete "a" 289 "But can be" - word missing? 299 - There is a statement here that suggests that phase-type theory avoids computational difficulties with bi-directional migration. However, I suspect that it requires matrix inversion, which is expensive. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jeffrey P. Spence Reviewer #2: No Reviewer #3: Yes: Nick Barton Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 11 Aug 2022 Submitted filename: agemo_response_to_reviewers.pdf Click here for additional data file. 1 Sep 2022 Dear Mr Bisschop, We are pleased to inform you that your manuscript 'Graph-based algorithms for Laplace transformed coalescence time distributions.' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Mark M. Tanaka Academic Editor PLOS Computational Biology William Noble Section Editor PLOS Computational Biology *********************************************************** Reviewer 2 has some remaining issues to for you to consider if you wish. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The author has satisfactorily addressed all of my previous concerns. Reviewer #2: The author has addressed nearly all of my comments. The mathematical descriptions are now clearer, and the notation is carefully spelled out and internally consistent. There is a convincing experiment to verify the accuracy of output. The software will be useful and offer some noticeable gains in efficiency in the generating function approach to computing sample configuration probabilities. There are still some obvious limitations, such as in scalability with sample size and the current lack of the possibility of cycles in the underlying graph, but these are clearly flagged and discussed. I would still like to see an enquiry into a question that is biologically driven. The author notes their intention to report a real world application elsewhere, but there is room for something here without having to go into all the details about data pre-processing. This paper is still mainly focused on algorithms, data structures, and computational issues. These are all important and relevant, but for papers appearing in a journal such as PLOS Computational Biology my feeling is that it should go further than this. What have we learned about, say, historical migration events that we didn't know before? (Okay, I'll leave it there. I can see that the highly esteemed other reviewers do not seem so exercised about this, and I won't make it a hill to die on.) Minor comments: l6 - duplicated 'random random' l129-130 - is this usual notation for dot product? I would expect to see p.r or . Fig 3 caption - 7e-7 formatting looks a bit misleading. Better to write something in full, e.g. $7 \\times 10^{-7}$. Reviewer #3: This revision deals well with my suggestions for clarification. Also, I think that Bisschop's argument for publishing this as a separate paper, rather than bundling it with the gimble paper, is convincing. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jeffrey P. Spence Reviewer #2: No Reviewer #3: No 10 Sep 2022 PCOMPBIOL-D-22-00768R1 Graph-based algorithms for Laplace transformed coalescence time distributions. Dear Dr Bisschop, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

13 in total

1. Distinguishing migration from isolation: a Markov chain Monte Carlo approach.

Authors: R Nielsen; J Wakeley
Journal: Genetics Date: 2001-06 Impact factor: 4.562

2. Phase-type distributions in population genetics.

Authors: Asger Hobolth; Arno Siri-Jégousse; Mogens Bladt
Journal: Theor Popul Biol Date: 2019-02-26 Impact factor: 1.570

3. TESTING THE CONSTANT-RATE NEUTRAL ALLELE MODEL WITH PROTEIN SEQUENCE DATA.

Authors: Richard R Hudson
Journal: Evolution Date: 1983-01 Impact factor: 3.694

4. Evolutionary relationship of DNA sequences in finite populations.

Authors: F Tajima
Journal: Genetics Date: 1983-10 Impact factor: 4.562

5. A general method for calculating likelihoods under the coalescent process.

Authors: K Lohse; R J Harrison; N H Barton
Journal: Genetics Date: 2011-09-06 Impact factor: 4.562

6. Efficient ancestry and mutation simulation with msprime 1.0.

Authors: Franz Baumdicker; Gertjan Bisschop; Daniel Goldstein; Graham Gower; Aaron P Ragsdale; Georgia Tsambos; Sha Zhu; Bjarki Eldon; E Castedo Ellerman; Jared G Galloway; Ariella L Gladstein; Gregor Gorjanc; Bing Guo; Ben Jeffery; Warren W Kretzschumar; Konrad Lohse; Michael Matschiner; Dominic Nelson; Nathaniel S Pope; Consuelo D Quinto-Cortés; Murillo F Rodrigues; Kumar Saunack; Thibaut Sellinger; Kevin Thornton; Hugo van Kemenade; Anthony W Wohns; Yan Wong; Simon Gravel; Andrew D Kern; Jere Koskela; Peter L Ralph; Jerome Kelleher
Journal: Genetics Date: 2022-03-03 Impact factor: 4.402

7. Testing models of speciation from genome sequences: divergence and asymmetric admixture in Island South-East Asian Sus species during the Plio-Pleistocene climatic fluctuations.

Authors: Laurent A F Frantz; Ole Madsen; Hendrik-Jan Megens; Martien A M Groenen; Konrad Lohse
Journal: Mol Ecol Date: 2014-11-05 Impact factor: 6.185