Literature DB >> 34516547

An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes.

Thomas K F Wong¹, Teng Li^1,2, Louis Ranjard^1,3, Steven H Wu⁴, Jeet Sukumaran⁵, Allen G Rodrigo^1,2.

Abstract

A current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian inference model to estimate the phylogeny of the haplotypes and their relative abundances, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and relative abundances of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Entities: Chemical

Mesh：

Substances：
DNA, Mitochondrial

Year: 2021 PMID： 34516547 PMCID： PMC8460051 DOI： 10.1371/journal.pcbi.1008949

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Methods paper.

Introduction

Molecular phylogenetic reconstruction is the mainstay of modern evolutionary biology [1, 2]. To use a particularly relevant and recent example, tracing the spread of the COVID-19 pandemic, and understanding the emergence of new variants, has required the use of reliably constructed phylogenies of SARS-CoV-2 genomes [3]. DNA sequencing is used to produce the data from which such valuable phylogenies can be inferred. However, because modern sequencing technologies can produce several gigabases of nucleotide sequences in a single day, one of the challenges for the molecular phylogeneticist is to deal with this quantity of data in a timely manner while still reconstructing accurate phylogenies. To this end, phylogeneticists have developed rapid alignment and tree reconstruction algorithms [4, 5], using pre-processed and curated sequences. Pre-processing and sequence curation can be laborious, but are necessary tasks because a great deal of sequence data are generated using next generation short-read sequencing technologies. Such sequences are often barcoded using unique DNA identifier tags, and then collectively pooled and sequenced in a single run. The unique barcode allows sequences belonging to different samples to be separated computationally, before additional error-correction and subsequent down-stream analyses are performed. Quite apart from the costs incurred by data pre-processing and curation, the preparation of barcoded sequence libraries is itself costly. More importantly, there are some samples where barcoding is impractical. For instances, rapidly evolving viruses (e.g., Human Immunodeficiency Virus (HIV) and Hepatitis C Virus (HCV)) typically exist as a collection of genetically diverse genomes within an infected individual. In some instances, hosts are infected with different strains of the same virus (i.e., coinfection). To sequence one or more target genes from a collection of these viruses using barcoding, one would need to isolate individual viral genomes before library preparation. This can be done, but again, is time-consuming and laborious. In this paper, we describe a novel approach, AFPhyloMix (Assembly-Free Phylogenetics for Mixtures), to reconstruct the phylogeny of non-barcoded amplicons in a mixture that has been sequenced using short-read sequencing. More precisely, the input sample consists of mixtures of anonymous (i.e., non-barcoded) amplicons of a targeted locus, obtained from multiple individuals, each amplicon being longer than the read-length of sequenced fragments. We assume that all short-reads can be aligned to the same reference sequence. We have developed our method to work on samples drawn from a population of closely related individuals (i.e., from individuals within a species). In any mixture of individuals drawn from such populations, some amplicons may be identical to others. We refer to a group of identical amplicons as a haplotype [6]. The mixture, therefore, contains a collection of haplotypes, each haplotype being represented by a relative abundance between 0 and 1 (non-inclusive). AFPhyloMix estimates the phylogeny of the haplotypes and their relative abundances, without the need for reconstructing the haplotype sequences. To validate our approach, we evaluate the efficiency of the method on simulated and real data, and we discuss the conditions under which the method performs well, and its limitations.

Methods

Overview

The algorithm, AFPhyloMix, proceeds as follows. Given a mixture of n haplotypes, with relative abundances (r1, r2, …, r), short-read sequencing generates k sequences that can be aligned to a reference sequence. AFPhyloMix then identifies the potential sites with single-nucleotide-polymorphisms (SNPs) from this alignment of reads. Under an infinite-sites model of evolution [7], where each mutation occurs at a new site and any given SNP can have a maximum of two nucleotides, we distinguish between the frequency of a given nucleotide at a given SNP, and the number of SNPs with the same frequency distribution of nucleotides. We refer to these two types of frequencies as the SNP ratio and the SNP frequency, respectively. For example, assume that in an alignment with three SNPs, site i has nucleotides A and G with ratios 0.75 and 0.25, respectively; site j has nucleotides C and T, with ratios 0.6 and 0.4, respectively; and site k has nucleotides G and T with ratios 0.75 and 0.25, respectively. We will adopt the convention of using the smaller nucleotide ratios when identifying the value of a SNP ratio. Therefore, the SNP ratio for site i and k is 0.25 and site j is 0.4. Sites i and k have the same frequency distribution of nucleotides, even though they may have different constituent nucleotides. In this case, the SNP frequency for the nucleotide distribution instantiated in sites i and k is 0.67 or 2/3. (We note that the SNP ratios and frequencies are related to the Site Frequency Spectrum [8]; however, because coverage of short-reads vary across the alignment, nucleotide ratios at each SNP vary as a continuous rational variable rather than as an integer). In AFPhyloMix, a likelihood function computes the probability of observing the distributions of SNP ratios (data, D) along the alignment given their expected distributions, which is itself conditional on a specified tree topology, haplotype relative abundances, and sequencing error, assuming an infinite-sites model of evolution. A Bayesian approach is used to compute the posterior probability P(R, T, e|D) of a set of parameters: the relative abundance of haplotypes (R), the tree topology (T), and the sequencing error (e), given the observed pattern of the data (D), as follows. L(D|R, T, e) is the likelihood of the observed pattern of SNPs given the relative abundances of haplotypes, tree topology, and the sequencing error. P(R), P(T), and P(e) are the prior probabilities of the relative abundances of haplotypes, tree topology, and the sequencing error, respectively. A Bayesian Metropolis-coupled Markov chain Monte Carlo (MCMCMC) inference engine is implemented, to deliver the joint posterior probability distribution of tree topologies and haplotype relative abundances. After the Bayesian computation, the edge lengths on the tree are computed according to the SNP frequencies and the tree topology with the highest posterior probabilities. To illustrate this approach, consider Fig 1 which shows the relationship between observed and expected SNP ratios and frequencies, along a specified tree. Given a 5-tip tree with tip relative abundances (i.e. the relative abundances of the corresponding haplotypes represented by the tips) shown in Fig 1A, a mutation x ∈ {A, C, G, T} that occurs on the edge XA over evolutionary time would lead to a different nucleotide on a SNP site in haplotype A relative to other haplotypes. The expected SNP ratio of any mutation along the edge XA would be 0.075, which is the relative abundance of tip A. The number of SNPs with this mutational pattern—the SNP frequency—would depend on the length of the edge XA. Fig 1B shows the expected SNP ratio and the expected SNP frequencies (i.e. the expected ratio of the occurrences for the mutations on different edges of the tree). For example, the high expected SNP frequency of sites with SNP ratio of 0.485 is due to the mutation on the long edge XZ.

Fig 1

SNP frequencies vs SNP ratios.

(A) An example of a 5-tip tree with expected tip relative abundances (i.e. haplotype relative abundances). (B) The expected SNP frequencies against the expected SNP ratio of the occurrences for the mutations on different edges of the tree. (C) The observed distribution of SNP ratios from the short read sequences generated from five simulated genomic sequences with expected relative abundances based on the tree in (A). To obtain observed frequencies and relative abundances, paired-end error-free reads of length 150bp with total coverage of 15,000x were simulated by ART [9] from five 50k-long haplotype sequences generated by INDELible [10] with JC model [11].

SNP frequencies vs SNP ratios.

Consideration of the connection between two SNP sites

In Fig 1B, every SNP site is treated independently. The fact that reads cover multiple SNP sites means that there is an association amongst the observed ratios for multiple SNPs; we found that modelling this association improved the accuracy of the estimation on the tree topology and the tip relative abundances. Where there is no sequencing error, as illustrated in Fig 2A, two SNP sites can generate three different combinations (patterns of nucleotides) on the nucleotide sequences if the mutations of the two SNP sites occur on different edges of a tree, while there are only two patterns of nucleotides if their mutations happen on the same edge of the tree. For example, two SNP sites with one mutation on the edge ZE and another on the edge XY, as shown in Fig 2A, lead to three different patterns of nucleotides on these two SNP sites with expected ratios 0.125, 0.41, and 0.465. Different locations of the mutations on the tree can result in different sets of expected ratios (Fig 2B). Similar to the compatibility problem of two sets of binary characters [12], since the infinite site model only allows one mutation along the tree for every SNP site, two SNP sites can create either two or three, but not four patterns of nucleotides. Moreover, how often these patterns of nucleotides happen depends on the product of the lengths of the edges on which the two SNP sites occur.

Fig 2

Consideration of connections between two SNP sites.

Consideration of connections between two SNP sites.

(A) Under the infinite site model, by allowing one mutation along the tree for every SNP site, two SNP sites may make three different patterns of nucleotides. (B) Different locations of the mutations on the tree can result in different set of expected ratios. Considering the possible three patterns of nucleotides (with ratios f1, f2, f3 = 1 − f1 − f2 where f1 ≤ f2 ≤ f3) due to the mutations of two SNP sites on different edges of the tree, Fig 3A shows the distribution of all possible pairs of f1 and f2 according to the tree in Fig 2A. The size of the circle represents the expected probability of occurrence. For example, the largest circle at (f1 = 0.125, f2 = 0.39) refers to the patterns of nucleotides created by two SNP sites with mutations on the edges XZ and ZE, and the probability is relatively high because of their long edge lengths.

Fig 3

Distribution of the first and the second ratios of three patterns of nucleotides created by two SNP sites.

Considering the tree in Fig 2A, two SNP sites having mutations on different pair of edges can lead to three patterns of nucleotides with ratios f1, f2, f3 = 1 − f1 − f2, where f1 ≤ f2 ≤ f3. (A) The distribution of all possible pairs of f1 and f2. The size of the circle represents the expected chance of occurrence. (B) The distribution of pairs of observed values of f1 and f2 obtained from the short read sequences generated from five simulated genomic sequences referenced in Fig 1C. Every pair of SNP sites close enough to be covered by the same short reads were checked. Considering the possible three patterns of nucleotides, the pairs of observed values of f1 and f2 were obtained based on the set of short reads covering the pairs of SNP sites. The observed distribution is consistent with the expected distribution plotted in (A).

Distribution of the first and the second ratios of three patterns of nucleotides created by two SNP sites.

Estimation of tree topology and tip relative abundances

Assume that there are n haplotypes. If there is no sequencing error, there should be only two types of nucleotides on each SNP site; for convenience, we will refer to the two allowable states at a given SNP location canonically as ‘0’ and ‘1’. Considering two sites i and j, let s(ij) be the nucleotides of the same read covering the sites i and j. Also let the states of the root of the tree be r and r, where r, r ∈ {0, 1}, on the site i and the site j, respectively. Given a n-tip rooted tree topology T, a set of n tip relative abundances R, and the edges of T: {e1, ⋯, e2}, let , where p, q ∈ {0, 1} and u, v ∈ {ε, e1, ⋯, e2} (the empty string ε represents no mutation on the site), be the expected probability of the same sequence having nucleotide p on the site i and nucleotide q on the site j given the mutations of the site i and j are on the edge u and the edge v, and the states of the root of the tree are r and r. For example, for the topology and tip relative abundances in Fig 2, when u = ZE, v = XY, r = r = 0, , , , and . Ideally, if there is no sequencing error, the number of combinations between the nucleotides of the reads covering the sites i and j should either be one (if u = v = ε), two (for example, when u = v ≠ ε, or v ≠ u = ε, or u ≠ v = ε), or three. However, in a data set with sequencing error, the number of combinations observed may well be more (up to a maximum of 16). We will compute the expected probabilities taking account of sequencing errors. With the sequencing errors, each SNP site may contain 4 nucleotide types, say 0, 1, 2, and 3. Without loss of generality, we assume 0 and 1 are the major characters, while 2 and 3 are the characters created by the sequencing errors. Given a tree topology T, a set of tip relative abundances R, and a sequencing error rate e, define P(s(ij) = pq|u, v, r, r), where p, q ∈ {0, 1, 2, 3}, as the expected probability of observing the same read having nucleotide p on the site i and nucleotide q on the site j, when the mutations of the sites i and j are on the edges u and v, and the states of the root of the tree are r and r, respectively. The function ψ(p′ → p) where p′ ∈ {0, 1} and p ∈ {0, 1, 2, 3} denotes the probability of observing a nucleotide p on the read when the actual nucleotide should be p′. Again, consider two sites i and j, and let n(p, q), where p, q ∈ {0, 1, 2, 3} be the number of reads observed having nucleotide p on site i and nucleotide q on site j in the data set. Let A = {n(p, q)|p, q ∈ {0, 1, 2, 3}} be the observed combinations of characters on the reads covering the site i and the site j. Given a n-tip rooted tree topology T, a set of relative abundances of n tips R, and a sequencing error rate e, define L(A|u, v, r, r, T, R, e) as the likelihood function of the alignment with sites i and j, where i ≠ j, provided that the mutations of the SNP sites i and j are on the edges u and v, and the states of the root of the tree on the SNP sites i and j are r and r, respectively. We assume that the ratios of the reads having different patterns of nucleotides for the sites i and j follow the multinomial distribution. Practically, when performing an analysis on the alignment of the reads, for each site of the alignment, AFPhyloMix assigns the nucleotide supported by the largest number of reads to 0, the second largest to 1, the third largest to 2, and the one with the least supports to 3. There are three reasons to observe two or more nucleotides at a site: A site truly has a single mutational event only in its evolutionary history, but sequencing errors and other technical artifacts can introduce more than two nucleotides in the alignment of short-reads; A site is truly invariable over the evolutionary tree, but sequencing errors/artifacts introduce more than a single observed nucleotide in the alignment at that site; A site truly has experienced multiple mutational events in its events in its evolutionary history (and thus, violates the assumption of an infinite sites model). We will address the second and third of these cases later, but if a site truly has only a single mutational event in its history, then nucleotides 0 and 1 should dominate, while nucleotides 2 and 3 will be due to sequencing errors. Let the n SNP sites be {S1, S2, S3, S4, ⋯, S, S}. One approach is to consider the patterns observed with pairs of adjacent SNP sites [i.e., if n is even, then consider (S1, S2)(S3, S4), ⋯, (S, S)]. This approach allows, at most, only n/2 pairs of SNP sites to be considered. On the other hand, if we nominate a reference site, and pair each non-reference site with the reference, we can use ≈ n pairs of SNP sites. We have used this approach, as follows: the whole alignment is partitioned into m non-overlapping windows (W1, W2, ⋯, W, ⋯, W) of size d (d was set to 100 in our implementation). In each window W a reference position c ∈ W is selected. Let the average of the read coverage along the alignment be cov. The reference site is selected arbitrarily among those sites covered by at least max{50, r * cov} reads (where r was set to 0.2). Thus, the selected reference sites will have reasonably high levels of support. For every site i inside the window W, its association with the reference position c is considered. This approach allows us to consider n − m pairs of SNP sites. If such reference positions cannot be found (because the coverage of the whole window is not high enough), AFPhyloMix selects a reference covered by the highest number of reads in the window. The likelihood of the whole alignment (A) given the tree topology (T), the tip relative abundances (R) and the sequencing error rate (e) is: The patterns obtained from the pair of site i and the reference c depends on the tree topology, the tip relative abundances, the sequencing error rate, the root states, and the edges on which the mutations occur for both the site i and the reference c. Amongst all sites paired with the reference site, pattern ratios are independent after conditioning on the tree topology, the tip relative abundances, the sequencing error rate, the root state and the edge on which the mutation occurs for the reference c. Therefore, the likelihood of the window (W) given the tree topology (T), the tip relative abundances (R), the sequencing error (e), the root state () and the edge on which the mutation occurs (v) for the reference c is: Using this approach, the likelihood essentially integrates across all observed patterns, thus negating the need to identify observed nucleotide ratios that are the consequence of sequencing errors. By substituting from Eq 5 into Eq 4, and setting the probability of the character r on position i (Pr(r|T, R, e)) equal to the observed proportion of the nucleotide r in the alignment, the likelihood of the alignment (A) given the tree topology (T), the tip relative abundances (R) and the sequencing error rate (e) becomes: Pr(u|T, R, e), which is the probability of having a mutation on the edge u given T, F, and e, can be approximated as the length of the edge u. The following approximation has fewer parameters and allows MCMC to converge at a faster rate: For each site, this equation considers all possible edges u (and v) on which the mutation occurs and assumes that the edge the mutation occurs on should have the highest likelihood value. This equation does not contain the parameters of Pr(u|T, R, e), and is defined only by the tree topology (T), the tip relative abundances (R) and the error rate (e). This equation allows MCMC to converge much faster. After the estimation on the tree topology, tip frequencies and error rate, the edge lengths will be computed in the next step. The details will be shown in the later subsection “Estimation on edge lengths”. We now define P(T, R, e|A) as the posterior probability of the tree topology T, the tip relative abundances R, and the sequencing error e given the alignment A.

Identifying invariable sites

As noted above, a truly invariable site may appear to be a SNP because of sequencing errors. To speed up the computation, AFPhyloMix skips all the sites which are identified as invariable sites. Let the maximum value of the sequencing error rate be e (in our application of AFPhyloMix, e is set to 0.01). A site is regarded as an invariable site if there exists only one nucleotide such that the percentage of the reads having that nucleotide covering the site is higher than e. Let I be the set of these sites which are identified as invariable sites; then

Markov chain Monte Carlo implementation

AFPhyloMix adopts a Bayesian model of inference to obtain an estimate of the joint posterior probability of phylogenies, haplotype relative abundances, and sequencing error, using Markov chain Monte Carlo (MCMC) sampling [13, 14]. The MCMC approach has been extensively used in phylogenetic analysis [15, 16], but sampling chains may not mix as well as they should. To overcome this, the Metropolis-coupled Markov chain Monte Carlo (MCMCMC) approach was developed by [17]. Although MCMCMC requires multiple parallel sampling chains to be run simultaneously (and thus, has demanding computational overheads), the approach has been demonstrated to improve mixing and convergence to a stationary posterior probability distribution [18]. We implemented MCMCMC in AFPhyloMix, and the details of implementation are listed in the Supporting information. AFPhyloMix reports the posterior probabilities, the tree topology, and the tip relative abundances along the cold chain every, by default, 100 rounds. Because the tip relative abundances may change along the chain, the consensus of the trees is not easy to compute. For the sake of simplicity, AFPhyloMix also reports the tree topology and the tip relative abundances which gives the highest value among all the resulting posterior probabilities along with the computation. The performance of AFPhyloMix was then evaluated based on this tree topology and the tip relative abundances.

Estimation on edge lengths

After AFPhyloMix estimates the topology T, the tip relative abundances R, and the sequencing error e for the mixture of short read sequences from n haplotypes, AFPhyloMix calculates the edge lengths in T (with edges e1, ⋯, e2) by the following method. Let length(u) be the length of edge e. In AFPhyloMix, length(u) is approximated as the probability of having mutation on edge u along the tree. Note that length(u = 0) is the probability of no mutation along the tree (i.e. ∑0≤ length(u) = 1). where Pr(edge u at site i) is the probability of a mutation on edge u of the tree at site i. As described above, the whole genome is partitioned into non-overlapping windows (W1, W2, ⋯, W, ⋯) of size d (d was set to 100), and inside each window W a reference site c ∈ W is selected. For every site i (say it is inside the window W), we consider the connection between the site i and the reference position c. For i ≠ c, let , where p, q ∈ {0, 1, 2, 3}, be the number of reads observed having nucleotide p at site i and nucleotide q at site c on the same read. Define as the observed combinations of characters on the same reads at the sites i and c. Similarly, for i = c, let , where q ∈ {0, 1, 2, 3} be the number of reads observed having nucleotide q at site c, let be the observed pattern of characters on the reads at the site c, and let , , be the estimated values of topology, tip relative abundances, and sequencing error, respectively. To calculate Pr(edge u at site i), two cases are considered: Case 1: i ≠ c (i.e. the site i is not the reference site of the window); Case 2: i = c (i.e. the site i is exactly the reference site of the window). For case 1, For case 2, Pr(r) and are the probabilities of the root states r and , which are set to the observed proportions of r and in A, respectively. is the likelihood value of the observed combinations of characters on the same reads at the sites i and c given the estimated topology , the estimated tip relative abundances , the estimated sequencing error , the mutations on edges u and v, and the root states r and for the sites i and c, respectively, while is the likelihood value of the observed pattern of characters on the reads at the site c given , , , the mutation on edge u and the root state for the site c. The computation of is listed in Eq 3, while the calculation of can be done similarly. Whereas it is theoretically possible to simultaneously infer edge lengths and topologies, in preliminary simulations, we have found that this does not deliver accurate results. We think that this is because, in the absence of sequence information at the tips, inferred mutations will naturally favour long branches, thus lowering the probability of seeing short branches. Instead, we adopt a two-step approach, in which the first step focuses on optimization of the tree topology, the tip relative abundances, and the sequencing error rate, while the second step computes the values of the edge lengths. Because of the fewer number of parameters needed to be estimated in the first step, we see that the MCMC chain converges at a faster rate, especially as the number of haplotypes increases.

Removal of the sites that violate the infinite sites model

As noted previously, AFPhyloMix uses an infinite sites model of evolution, and thus assumes that every site has no or only one mutation in its evolutionary history. Before AFPhyloMix proceeds to estimate the topology and the tip relative abundances, AFPhyloMix examines the read alignments and attempts to identify SNP sites which have more than one mutation. The procedure to identify these sites is as follows: If the SNP site has more than two different nucleotides supported by at least e (i.e. 1%) of the reads, then the SNP site is ignored. We consider SNP sites with only two different nucleotides, with each nucleotide supported by at least 1% of the reads. If there is no back/hidden mutation, two SNP positions will give at most three combinations (as mentioned before). Based on this observation, the following simple method has been developed to identify the sites which are likely to have back/hidden mutations: For a SNP site i, we check all other SNP sites j such that there are sufficient number of reads covering both sites i and j. If there are at least d sites (i.e. j1, j2, ⋯, j) which separately have more than three combinations supported by at least 1% of the reads when paired with site i, then site i is regarded having back/hidden mutations and it is discarded. We have tested for different values of d on simulated data and found that when d = 3 the method performed well in terms of accuracy.

Experiment and results

Both simulated and real data were used to evaluate AFPhyloMix.

Simulated data

To evaluate AFPhyloMix using simulated data sets, six hundred data sets were simulated and each data set was a mixture of various numbers (n) of haplotypes, where n ∈ {5, 7, 9, 11, 13, 15} (100 data sets each). In each data set, the n haplotypes were mixed in different random concentrations (with the difference between any two haplotypes ≥ 0.0033). Paired-end reads of length 150bp with total coverage of 15,000x (where the coverage of the haplotype with the least concentration is 250x) were simulated by ART with the Illumina HiSeq 2500 error model—HS25, from n 50k-long haplotype sequences, which were generated by INDELible using JC model from a n-tip tree with 0.03 root-to-tip distance randomly created by Evolver [19] from PAML package [20]. The root sequence reported by INDELible [10] was used as the reference sequence. After using BWA [21] to align the reads against the reference sequence, we ran AFPhyloMix under the default settings on the read alignments to estimate the tree topology and the tip relative abundances (i.e., haplotype concentrations) for the mixture. By default, AFPhyloMix runs 8 MCMC processes in parallel: one cold chain and seven hot chains, and each chain runs 650K (for mixture with 5 sub-samples) to 1150K (for 15 sub-samples) iterations, depending on the number of edges in the resulting tree. Fig 4 shows a result from AFPhyloMix on a simulated data set with a mixture of 15 sub-samples. Fig 4A displays the posterior probabilities along the cold chain of MCMCMC process while running the AFPhyloMix. The posterior probabilities increased rapidly during the burn-in period and then appeared to stabilise to an equilibrium distribution. Fig 4B shows the distribution of tip relative abundances along the cold chain of MCMCMC process from AFPhyloMix. The distribution was found to match the expected tip relative abundances (marked with red dots) in the real tree used to simulate the data set. Fig 4C is the resulting tree from AFPhyloMix with tip relative abundances. This is the tree with the highest posterior probability along the cold chain of the MCMCMC process. The tree is topologically congruent with the real tree (Fig 4D) and the tip relative abundances are also very similar with the real sub-sample concentrations.

Fig 4

The result of AFPhyloMix on a simulated data set with a mixture consisting of 15 sub-samples.

The result of AFPhyloMix on a simulated data set with a mixture consisting of 15 sub-samples.

AFPhyloMix was run under the default settings on the read alignments of a simulated mixture of 15 sub-samples in order to estimate the tree topology and the tip relative abundances. (A) The posterior probabilities along the cold chain of MCMCMC process while running the AFPhyloMix, which increased rapidly during the burn-in period and then frustrated steadily over a range of values and was reaching a convergence. (B) The distribution of tip relative abundances (i.e. sub-sample concentration) along the cold chain of MCMCMC process from AFPhyloMix. The actual sub-sample concentrations are marked by the red dots. (C) The tree with tip relative abundances having highest posterior probability along the computation reported by AFPhyloMix. (D) The real tree with the actual sample concentration used for simulating the data set. Fig 5A shows the summary on the accuracy of AFPhyloMix running on the simulated data sets. Among the data sets with the same number of haplotypes, the figure shows the percentage of data sets with correct estimation on both topologies and tip relative abundances. An estimated result is regarded as correct if the tips on an estimated tree can be paired up with the tips on the actual tree satisfying the following conditions: (1) the difference between the estimated tip relative abundances and the corresponding true tip relative abundances is less than 0.01; and (2) their unrooted topologies are identical. Details of the method on pairing between the tips on an estimated tree and those on the actual tree is listed in Supporting information. From the figure, AFPhyloMix achieved at least 80% accuracy for the mixtures with up to 15 haplotypes.

Fig 5

Performance of AFPhyloMix on simulated data.

Performance of AFPhyloMix on simulated data.

(A) Accuracy of AFPhyloMix—The percentage of data sets (out of 100), for different number of haplotypes, with correct estimated topologies and predicted tip relative abundances. (B) Root-mean-square differences between the actual- and the predicted tip relative abundances. (C) Root-mean-square differences between the actual and the predicated edge lengths. The accuracy decreased gradually while the variation between the estimated and the actual haplotype relative abundances and the edge lengths increased with the number of tips. To further examine the deviation between the predicted tip relative abundances and the actual sample concentrations, and that between the estimated and the actual edge lengths, for each data set with correct estimation, we computed the root-mean-square value of the differences between the estimated and the actual values. Fig 5B and 5C show the summary on the distributions of the root-mean-square of the differences for the tip frequencies and the edge lengths. The values rise gradually as the number of haplotypes increases. For tip relative abundances, the root-mean-square values were all below 0.0012, while over 75% of the cases were below 0.0008. For edge lengths, the root-mean-square values were all below 0.005. The pooled 95% highest posterior density of the MCMCMC estimate of error rates on these simulated data sets was between 0.00199 and 0.00204 with an average 0.00202. ART [9] reports the errors for the reads simulated by the error model HS25; in our simulations, we obtained a simulated error rate of 0.00190. AFPhyloMix relies on the alignment of reads against a root sequence to obtain the marginal posterior probability distribution of error rates. We expect, therefore, that the slightly higher value (≈ +0.00012) of the MCMCMC estimate of error rates on these simulated data sets, compared with the simulated error rate reported by ART, is likely due to imperfect alignment between simulated reads and the root sequences. The experiments were conducted on machines with 4 x 16-core Intel Xeon 2.10GHz CPUs. The running times of AFPhyloMix were listed in the Table 1.

Table 1

Average running time of AFPhyloMix on the simulated data sets.

Number of tips	Running time
5	4.0 hours
7	14.5 hours
9	22.3 hours
11	39.8 hours
13	76.8 hours
15	112.6 hours

Real data

Apart from the simulated data sets, mixtures of reads from kangaroo haplotypes were also used to evaluate the performance of AFPhyloMix.

DNA material collection and extraction

In Australia, kangaroo are not farmed, but are culled annually to control population numbers. Culled animals are butchered by certified butchers, and the meat is sold in supermarkets. As the exact provenance of kangaroo meat sold at supermarkets is unknown, meat (produced by Macro Meats) was purchased at local supermarkets in the Australian Capital Territory (Coles Supermarkets Australia Pty Ltd and Woolworths Ltd) on 10 separate occasions over one year (from 29th May 2018 to 26th July 2019), to avoid sampling the same animal. According to statistics from the Department of the Environment and Energy of Australia, they should be of genus Macropus, and most likely eastern grey kangaroo, Macropus giganteus, as it is the largest population with the highest quota in New South Wales, Australia [22]. Approximately 30 mg of meat was excised and homogenized for each individual. Genomic DNA was then extracted using the DNeasy Blood & Tissue Kits (Qiagen) following the manufacturer’s protocol.

Amplification and sequencing

Long PCR amplification of complete kangaroo mitochondrial genomes was carried out using the pair of primers (Lt12cons: 5’- GGGATTAGATACCCCACTAT -3’, HtPhe: 5’-CCATCTAAGCATTTTCAGT -3’), which was selected from a previous study [23]. PCR reactions was performed using Takara PrimeSTAR GXL DNA Polymerase under the following conditions: 1 min initial denaturation at 95°C, followed by 30 cycles of 10 s at 98°C, 15 s at 55°C, and 15 min at 68°C. The PCR products were electrophoresed in 1% agarose gel, purified the fragments, and then randomly fragmented to 650 bp by sonication (Covaris S220). Library preparation and sequencing were performed by GENEWIZ. Amplified fragments of all 10 individuals were sequenced under the same run. In order to obtain a highly reliable phylogenetic tree of these sub-samples for evaluating our method, each individual was barcoded with unique indices before multiplexing and sequencing, so that each short read sequence could be identified to the corresponding sub-sample. The relative concentrations of the sub-samples are listed in the Table 2 (2nd column). Sequencing was performed on an Illumina MiSeq machine with paired-end read length of 2 x 300 bp.

Table 2

Concentration of kangaroo sub-samples.

Sub-sample ID	Concentration	Concentration without K01
K01	0.019	-
K02	0.029	0.030
K03	0.045	0.046
K04	0.047	0.048
K05	0.094	0.096
K06	0.102	0.104
K07	0.125	0.127
K08	0.130	0.132
K09	0.188	0.192
K10	0.221	0.225
Total	1.000	1.000

The relative concentration between different kangaroo sub-samples.

Phylogenetic tree reconstruction

To start with, a reliable phylogenetic tree of the haplotypes was constructed, so that this gold-standard result could be used to evaluate our method. First, all the short read sequences were demultiplexed into sub-samples according to the barcodes appended on the sequences. Then de-novo assembly was performed on the short reads for each sub-sample separately by SOAPdenovo-Trans-127mer from SOAPdenovo-Trans package [24] with parameter: kmer-size = 91. After the mitochondria DNA sequences of the 10 haplotypes were constructed, a multiple sequences alignment was computed by MAFFT [4] with G-INS-i strategy in Geneious 11.1.5 [25]. The phylogenetic analysis was conducted using IQ-TREE [26] with the evolutionary model HKY+F+I, and the ML phylogenetic tree (shown in Fig 6B) was used as the reference tree to evaluate our method.

Fig 6

Result on a real data set with mixture of 10 sub-samples.

Result on a real data set with mixture of 10 sub-samples.

(A) The tree with tip relative abundances reported by AFPhyloMix. (B) The tree reported by IQ-Tree [26]. Note that the tip labels inside the brackets in (A) were added manually after comparing with the tree reported by IQ-Tree, indicating the corresponding sample that each tip should be assigned to according to the topology and the tip relative abundances. The short read sequences from 10 haplotypes were then mixed, and the barcode of each haplotype was removed. Trimmomatic [27] was run on the mixture of reads to remove adaptors, leading and trailing low quality bases by using the options: “ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36”. Next, the reads were aligned to the reference sequence of the eastern gray kangaroo mitochondrial genome (GenBank Accession Number: NC_027424) by BWA [21] and the alignments with mapping quality score (MAPQ) lower than 20 were discarded. AFPhyloMix was used with the short-read alignment to estimate the phylogenetic tree and the relative abundances of each haplotype. For each read, AFPhyloMix discarded the nucleotides with base quality score lower than 25. Fig 6A shows the reported tree, as well as the tip relative abundances, with the highest posterior probability obtained. The tip labels inside the brackets were added manually after comparing with the tree reported by IQ-Tree (in Fig 6B), indicating the corresponding sample that each tip should be assigned according to the topology and the tip relative abundances. Overall, the topology and the tip relative abundances recovered by AFPhyloMix matched the tree from IQ-Tree, except that AFPhyloMix combined two haplotypes K01 and K07 into one. From the tree reported by IQ-TREE, the distance between the tips K01 and K07 is 0.00054 (99.946% similarity between these two sequences). The method could not distinguish the two too-similar sequences, and thus regarded them as the same sequence. To determine whether these two nearly identical sequences affected the performance of AFPhyloMix, we removed the sub-sample of K01 and re-estimated the phylogeny with the same method described above. The updated relative concentrations of the remaining haplotypes among the mixture is shown in Table 2 (3rd column). Both AFPhyloMix and IQ-TREE analyses resulted in the same topology associated with tip relative abundances well matched the concentrations of those 9 haplotypes (in Fig 7).

Fig 7

Result on a real data set with mixture of 9 sub-samples.

Result on a real data set with mixture of 9 sub-samples.

The haplotype K01 was removed from the mixture and the experiment was repeated. (A) The tree with tip relative abundances reported by AFPhyloMix. (B) The tree reported by IQ-TREE [26]. Note that the tip labels inside the brackets in (A) were added manually after comparing with the tree reported by IQ-Tree, indicating the corresponding sample that each tip should be assigned to according to the topology and the tip relative abundances. The 95% highest posterior density of the MCMCMC estimate of error rates on the real data sets was between 0.000722 and 0.000735 with an average 0.000729. In order to compute the underlying actual error rate on the real data set, reads which had been processed by Trimmomatic [27], were compared with the corresponding assembled haplotype and the error rate of 0.000713 was obtained, after discarding the read bases with base quality scores lower than 25 (the same criteria AFPhyloMix used to filter out the low-quality read bases). Again, the slightly higher value (≈ +0.000016) of the MCMCMC estimate of error rates on the real data sets compared with the obtained error rate is consistent with what we observed with simulated data, and is likely due to the imperfect alignment between the reads and the reference sequence.

Discussion

This research demonstrates the feasibility of reconstructing a phylogenetic tree directly from the short read sequences obtained from a mixture of closely related amplified sequences, without barcoding. The results indicate that our methods work well on the simulated data set for a mixture of reads generated from up to 15 haplotypes and on a real data set of a mixture with 10 haplotypes. Perhaps unsurprisingly, AFPhyloMix worked better on the simulated data sets than in the real data sets when it came to estimating haplotype concentrations. The root-mean-square difference between the estimated sub-samples concentrations and the expected concentrations in the real data sets was 0.0059 (from Fig 7), which was larger than those in the simulated data sets (i.e. all were below 0.0027 in Fig 5B). Of course, the expected haplotype concentrations may have differed from the true concentrations in the mixture: the physical act of mixing small volumes could have led to differences in the relative concentrations of haplotypes, and this may have contributed to a higher-than-expected root-mean-square. Another factor affecting the performance of the method is the varying coverage of short read sequences along the genome. Fig 8A shows the actual distribution of the read alignments of 10 haplotypes along the genome. We expected read coverage to vary along the genome randomly, without any association to haplotypes. Surprisingly, we found that, when all the haplotypes were sequenced under the same run (i.e. all haplotypes were pooled into the same library before sequencing), the read coverages of the haplotypes had similar trends: all had relatively high (or low) read coverages at the same regions of the genome. The similar trends of the read coverages along the genome between the haplotypes led to a consistent distribution of the ratios of haplotypes along the genome (Fig 8B). The consistent read coverage ratios along the genome worked in our method’s favor; however, we noticed that a few short regions on the genome had sudden changes in the ratios of the read coverage amongst haplotypes (Fig 8B). In these regions, some reads were trimmed after the alignment against the reference sequence or could not be aligned to the reference sequence due to the dissimilarity between the read sequences and the reference sequence. Another preprocessing step was therefore developed in AFPhyloMix, in which read alignments were examined and problematic regions removed if more than r reads in those regions were trimmed (r was set to 2% of the read coverage).

Fig 8

Distribution of the read alignments of 10 haplotypes against the reference genome.

(A) The absolute read coverages along the genome. (B) The ratios on read coverages between sub-samples along the genome.

Distribution of the read alignments of 10 haplotypes against the reference genome.

(A) The absolute read coverages along the genome. (B) The ratios on read coverages between sub-samples along the genome. AFPhyloMix’s performance also depends on the SNP frequency and the read length. Too short a read length or too low a SNP frequency will mean that any given read will be unlikely to have more than a single SNP. If this happens, it will confound AFPhyloMix’s use of SNP pairs. AFPhyloMix only considers the association between two SNP sites, but it is sensible to consider the association between more SNP sites in order to acquire higher sensitivity of the methods especially when the number of haplotypes increases. The time complexity of the algorithm is O(mn) where n is the number of haplotypes, m is the number of potential SNP sites, and d is the number of SNP sites used to construct patterns of nucleotides. In our current implementation of AFPhyloMix, d = 2. The running time of the algorithm will increase exponentially as d increases. It will be a challenge to come up with a faster algorithm and consider the association between more SNP sites. Finally, it is worth noting that we have applied AFPhyloMix to sequences of closely related individuals—in our simulations, we set a root-to-tip distance of 0.03. The assumption of an infinite sites model that is applied in AFPhyloMix is appropriate for closely-related individuals. Amongst other things, the infinite sites model allows us to constrain the number of site patterns we expect to see, and use deviations from these expected patterns to error-correct. To extend our algorithm to more divergent sequences will require a different model of mutation. This remains a work in progress.

Conclusion

AFPhyloMix is designed to estimate the concentration of haplotypes and reconstruct the phylogenetic tree directly from the short read sequences in the mixture of haplotypes with no barcode, given that the number of haplotypes is known. This research demonstrates the feasibility of our approach, and is a first attempt to infer the phylogenetic tree from the mixture of reads unidentifiable to haplotypes, bypassing the assembly process of multiple genomic sequences. The experimental results have demonstrated that AFPhyloMix works reasonably well for both simulated data and real data.

Performance of AFPhyloMix on simulated data with different read coverage.

(A) Accuracy of AFPhyloMix between data sets of which the least abundant haplotype has at least 250x and 100x read coverages. (B) Root-mean-square differences between the actual and the predicted tip relative abundances for data sets with different read coverages. (C) Root-mean-square differences between the actual and the predicated edge lengths for data sets with different read coverages. (TIF) Click here for additional data file.

Performance of AFPhyloMix on simulated data evolved under two different substitution models.

(A) Accuracy of AFPhyloMix between data sets evolved under a simple model—JC and a model with site variation—JC+G. (B) Root-mean-square differences between the actual and the predicted tip relative abundances for data sets evolved under JC and JC+G. (C) Root-mean-square differences between the actual and the predicated edge lengths for data sets evolved under JC and JC+G. (TIF) Click here for additional data file.

Evaluation of AFPhyloMix on other simulated data sets.

(PDF) Click here for additional data file.

Implementation details in Markov chain Monte Carlo.

(PDF) Click here for additional data file.

To check the correctness of an estimated tree and report the minimum root-mean-square difference between the estimated and the actual haplotype relative abundances.

(PDF) Click here for additional data file. 27 May 2021 Dear Dr Wong, Thank you very much for submitting your manuscript "An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes" for consideration at PLOS Computational Biology. Your manuscript was reviewed by members of the editorial board and by three independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Multiple reviewers pointed out the practical limitations of this approach, while highlighting the theoretical advancement made in this study. This limitation should be explicit in the revised version of your manuscript. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Joel O. Wertheim Associate Editor PLOS Computational Biology Daniel Beard Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors present an interesting and novel approach to phylogeny reconstruction. I would like to see a few changes prior to publication, some textual, but I'd also like to see two more kinds of simulation study to clarify the behavior under certain conditions. Text changes: 1) Authors need to make it more prominently clear (maybe even in the abstract?) that this approach will estimate the tree and the haplotype frequencies, but not the actual sequence associated with each haplotype. This, along with the requirement that you must know, a priori, the number of haplotypes, is a rather severe limitation, and I struggle to think of a practical use for the proposed method. The typical case where we know the number of haplotypes in advance is when we explicitly mix a number of individually homogenous samples, but then we'd get a tree and frequencies, but have no way of saying which sample was associated to which tree tip, nor obtain a haplotype sequence for each sample. My particular concern is that, as it is currently written, a biologist may read the manuscript and not appreciate this limitation, design a sequencing experiment without barcodes, run the software, and then become immensely frustrated that they can't recover the haplotype sequences with each of their mixed components. 2) "We implemented MCMCMC in AFPhyloMix, which reports a the tree topology and the tip frequencies which gives the highest value among all the resulting posterior probabilities along the computation." This sounds like you're reporting a single sample from the chain? This is not very Bayesian in spirit, and could suffer from high variance, possibly differing substantially from one run to another (far more so than usual approaches to summarize MCMC chains). Given that tips are not tied to sequences in the usual way, so the "identity" of a tip can just drift, un-anchored, during sampling, I can see that you might have difficulties using other approaches to extract a consensus tree from the chain, but your chosen approach feels like it requires a little more justification and clarification. 3) "During the experiments, the following modified formula was found to have a better accuracy on the estimation of the tree topology and the tip frequencies:" Could the authors explain this a little? How much better? Why did they try this formula in the first place? What is the explanation for its better performance? 4) "Whereas it is theoretically possible to simultaneously infer edge lengths and topologies, we have found that this does not deliver accurate results." How inaccurate were they? This seems like strange behavior. 5) "The estimated result is regarded as correct if the tips on an estimated tree can be paired up with the tips on the actual tree satisfying the following conditions: (1) the difference between the predicted tip frequency and the corresponding actual tip frequency paired with is less than 0.01; and (2) their topologies are the same." It isn't clear that this will produce a unique pairing of estimated to true tips, especially if the tip frequencies are similar (even at a single "cherry" - https://rdrr.io/cran/ape/man/cherry.html). If not, how do you efficiently choose the optimal tip-to-tip configuration to minimize later deviation estimates? 6) Typo: "AFPhyloMix adopts a Bayesian model of inference to estimate[s] the phylogeny" 7) Possible typo: "To further examine the [derivation] between the predicted tip frequencies 328 and the actual sample concentrations" "deviation"? 8) Simulation request This approach might not be especially robust to violations of the underlying model. The simplest of these is where rates vary over sites. Standard phylogeny inference is quite robust to such variation, but something that relies on SNP frequency distributions in the manner described here may not be. I suggest two kinds of studies to investigate such model violations: A) Identical to your current simulation setup, but with varying levels of site to site variation in the Jukes Cantor mutation rate. B) Take a set of "real" haplotypes that have evolved through a real evolutionary process (eg. a genome segment for a random set of mammals, or bacteria, or...), and use these as the basis of the simulation, where each simulation replicate involves the assignment of frequencies to each haplotype, and then the paired-end read simulation, etc. The "true" tree is the phylogeny inferred from the full haplotypes. I suggest finding a biological example where many genomes are available, so you can randomly select N each time, rather than using the same N over and over. Note: perfect performance in these cases shouldn't be a requirement for publication - I just think it should be well-characterized exactly how things break down in these scenarios. Reviewer #2: This manuscript addresses a critical problem in working with sequence data - separating haplotypes that are all sequenced together and are difficult to separate through pre-processing. This new approach addresses this challenge and also identifies the frequency of each haplotype and the relationship among haplotypes. The manuscript does a nice job explaining the method, and then using both simulations and an empirical example to test the method. However, there are some important aspects of this paper that would benefit from further explanation and rewriting. First, I can see very important applications of this method that could be added to the introduction to greatly improve the motivation for the work. Primarily, understanding haplotype frequencies and the phylogeny of haplotypes in a mixed sample would seem to benefit naturally mixed samples, such as coinfections (e.g. malaria or SARS CoV-2). Focusing in the introduction dealing with quantities of data is irrelevant, while concern about pre-processing time and cost seems secondary to being able to sort our coinfections to which there is currently no solution. Rewriting isn't strictly necessary but would provide stronger motivation for the research. The methods section begins with the expectations and calculations. However, this section then begins to explain some results (e.g. Figure 1C, 3B, ln 119+). The flow into the results is logical, except that the observed distribution is shown prior to any explanation of how it was generated. The simulations are not explained until the later section. I suggest significantly rearranging the Methods and Experiment sections so the theoretical expectations are provided first, then the simulations, then the results. Note that there are several other results integrated into methods (e.g. line 94, 212, 288) that should be moved as well. I found the kangaroo data to be an odd example. It is clearly easy enough to tag these samples and get separate sequences for each sample. That said, it is an example that allows testing this method. To motivate its use better I suggest removing the discussion of kangaroo meat, which is completely irrelevant to AFPhyloMix, and just say this is a useful example where we could compare barcoded and not - you really could have picked any example where you mix samples at different concentrations. Consider combining Fig 1B and C - overlaying the dots (expectation) on histogram (observation) would make the comparison clearer. Additionally, the points in 1B should be different shapes. B&C may need to be a separate figure in results. Similarly, Fig 3B is also results from simulations and may need to be moved, and overlaying A on B might make the comparison easier. The caption for Fig 1 should explain frequency v ratio so that that the caption can stand independently from the text. The terms ratio and frequency are challenging, primarily because the authors make statements like "the expected SNP ratio ... would be... which is the frequency of tip A" (line 84). Ratio and frequency mean very specific things in this paper so the use of those words should be avoided otherwise. Ratio (an individual SNP) depends on sample proportion while frequency (across SNPs) depends on branch length. Simulations were done with unrealistically high coverage. It would be helpful to know how well this approach works with lower coverage. Why is the root sequence used as the reference for alignment? This seems unrealistic. This choice should be justified or a reference should be chosen from the "extant taxa". The abstract and introduction refer to amplicon sequencing, while the simulations are for small genomes. I suggest changing the abstract and introduction to match what was actually done. There are statements throughout such as Fig X shows... This phrasing is very awkward as it makes the results unclear. Please state actual results and refer the reader to the figure. Fig 5B and C - consider normalizing these values. It's unclear what y-values are meaningful. Also, describe the quality of the results in the caption. There are some minor typos and odd word choices throughout that I hope will be caught on rereading / in proofs. Reviewer #3: Wong and colleagues present an interesting method to co-estimate the topology, branch lengths, tip frequencies and error rates (sequencing errors or others) from mapped reads, sequenced from a pooled DNA library, without barcoding for subsequent demultiplexing. As a population geneticists working on ancient DNA, I am a bit outsider to the field, and need a few additional (mostly minor, I think) clarifications: 1) the applicability of the method. Authors say in the introduction that: "More importantly, there are some samples where barcoding is impractical. For instances, rapidly evolving viruses (e.g., Human Immunodeficiency Virus (HIV) and Hepatitis C Virus (HCV)) typically exist as a collection of genetically diverse genomes within an infected individual.". I truly understand your point here, but ...at the same time: "In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes.". How many virus haplotypes do the authors expect to coexist during an infection? Probably way many more than 15, isn't it? How high is the mutation rate in viruses? Probably incompatible with an infinite sites model (?). How to justify then the use of AFPhyloMix? What is the typical problem then were we can apply AFPhyloMix? Only mtDNA, given the poor reference assemblies of Y chromosomes? How expensive it is to sequence 10-15 mitochondrial genomes with barcodes? Not much really. Why do we need AFPhyloMix then? I am playing the devils advocate to motivate the authors to better justify the biological problems that could be addressed with AFPhyloMix in the introduction (I am sure there are). At the end of the day, this is what will condition the utility and the impact of their program. 2) the authors digest very well their equations making easy the reading and the understanding of the model, up until equation 6. They justify theoretically one equation, but suddenly "the following modified formula was found to have a better accuracy on the estimation of the tree topology and the tip frequencies", without much further explanation, despite equation 6 being only an approximation. Could the authors please elaborate this a bit more, including a non-mathematical (more intuitive) description of equation 6 and why (they think) it works better (no simulations shown proving this). It might be obvious, but I am bit lost there, and I suspect that other readers will be as well. 3) what is the running time of the program? The authors discuss about the algorithm complexity, but a minimal description on the absolute scale would be informative. I guess AFPhyloMix is relatively fast given the number of simulations done, but readers might wonder if takes 1 week or more with eg. 15 long haplotypes. 4) Why not using the information of the pair ends (if paired-end data is available) to help defining the haplotype? Imagine there are linked mutations at both reads of each pair. This would provide additional information on the read<->haplotype assignation, which could especially help refining estimates of the haplotype frequencies at the tips. 5) What is the impact of the read length? In their empirical experiment, the authors test AFPhyloMix with reads that are 150bp-long and mtDNA , where the mutation rate is probably greater also in this species. I guess however that shorter reads and lower mutation rates reduce the chance of having more than one mutation per read, perhaps impacting the accuracy of the method. I am just asking whether the authors could clarify/elaborate a bit on this. Also, what is the substitution model or mutation rate used in their simulations? 6) Is there any chance to add an independent function, once topology, branch length and tip frequencies are estimated (ie. given the full phylogeny) to classify each read (or read pair) as belonging to one haplotype or another? That would be helpful for demultiplexing without barcodes. I understand this can be too much work for a revision, but I encourage the authors to think in implementing it after the publication (if the rest of concerns are satisfactorily addressed). Thanks again for your efforts and congratulations for this nice approximation. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 17 Aug 2021 Submitted filename: cover_letter.pdf Click here for additional data file. 1 Sep 2021 Dear Dr Wong, We are pleased to inform you that your manuscript 'An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Joel O. Wertheim Associate Editor PLOS Computational Biology Daniel Beard Deputy Editor PLOS Computational Biology *********************************************************** 9 Sep 2021 PCOMPBIOL-D-21-00630R1 An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes Dear Dr Wong, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Andrea Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

21 in total

1. Phylogenetics and speciation.

Authors: T G. Barraclough; S Nee
Journal: Trends Ecol Evol Date: 2001-07-01 Impact factor: 17.712

2. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors: Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal: Nucleic Acids Res Date: 2002-07-15 Impact factor: 16.971

Review 3. Molecular phylogenetics: principles and practice.

Authors: Ziheng Yang; Bruce Rannala
Journal: Nat Rev Genet Date: 2012-03-28 Impact factor: 53.242

Review 4. Haplotype-resolved genome sequencing: experimental methods and applications.

Authors: Matthew W Snyder; Andrew Adey; Jacob O Kitzman; Jay Shendure
Journal: Nat Rev Genet Date: 2015-05-07 Impact factor: 53.242

5. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads.

Authors: Yinlong Xie; Gengxiong Wu; Jingbo Tang; Ruibang Luo; Jordan Patterson; Shanlin Liu; Weihua Huang; Guangzhu He; Shengchang Gu; Shengkang Li; Xin Zhou; Tak-Wah Lam; Yingrui Li; Xun Xu; Gane Ka-Shu Wong; Jun Wang
Journal: Bioinformatics Date: 2014-02-13 Impact factor: 6.937

6. PAML: a program package for phylogenetic analysis by maximum likelihood.

Authors: Z Yang
Journal: Comput Appl Biosci Date: 1997-10

7. Radiation of extant marsupials after the K/T boundary: evidence from complete mitochondrial genomes.

Authors: Maria A Nilsson; Anette Gullberg; Angel E Spotorno; Ulfur Arnason; Axel Janke
Journal: J Mol Evol Date: 2003 Impact factor: 2.395

8. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data.

Authors: Matthew Kearse; Richard Moir; Amy Wilson; Steven Stones-Havas; Matthew Cheung; Shane Sturrock; Simon Buxton; Alex Cooper; Sidney Markowitz; Chris Duran; Tobias Thierer; Bruce Ashton; Peter Meintjes; Alexei Drummond
Journal: Bioinformatics Date: 2012-04-27 Impact factor: 6.937

9. ModelFinder: fast model selection for accurate phylogenetic estimates.

Authors: Subha Kalyaanamoorthy; Bui Quang Minh; Thomas K F Wong; Arndt von Haeseler; Lars S Jermiin
Journal: Nat Methods Date: 2017-05-08 Impact factor: 28.547

10. INDELible: a flexible simulator of biological sequence evolution.

Authors: William Fletcher; Ziheng Yang
Journal: Mol Biol Evol Date: 2009-05-07 Impact factor: 16.240