Lei Zhao1, Rasmus Nielsen1,2,3, Thorfinn Sand Korneliussen1. 1. Section for Geogenetics, Globe Institute, University of Copenhagen, Øster Voldgade 5-7, 1350 København K, Denmark. 2. Department of Integrative Biology, University of California, 3040 Valley Life Sciences Building 3140, Berkeley, CA 94720-3140, USA. 3. Department of Statistics, University of California, 3040 Valley Life Sciences Building 3140, Berkeley, CA 94720-3140, USA.
Abstract
Commonly used methods for inferring phylogenies were designed before the emergence of high-throughput sequencing and can generally not accommodate the challenges associated with noisy, diploid sequencing data. In many applications, diploid genomes are still treated as haploid through the use of ambiguity characters; while the uncertainty in genotype calling-arising as a consequence of the sequencing technology-is ignored. In order to address this problem, we describe two new probabilistic approaches for estimating genetic distances: distAngsd-geno and distAngsd-nuc, both implemented in a software suite named distAngsd. These methods are specifically designed for next-generation sequencing data, utilize the full information from the data, and take uncertainty in genotype calling into account. Through extensive simulations, we show that these new methods are markedly more accurate and have more stable statistical behaviors than other currently available methods for estimating genetic distances-even for very low depth data with high error rates.
Commonly used methods for inferring phylogenies were designed before the emergence of high-throughput sequencing and can generally not accommodate the challenges associated with noisy, diploid sequencing data. In many applications, diploid genomes are still treated as haploid through the use of ambiguity characters; while the uncertainty in genotype calling-arising as a consequence of the sequencing technology-is ignored. In order to address this problem, we describe two new probabilistic approaches for estimating genetic distances: distAngsd-geno and distAngsd-nuc, both implemented in a software suite named distAngsd. These methods are specifically designed for next-generation sequencing data, utilize the full information from the data, and take uncertainty in genotype calling into account. Through extensive simulations, we show that these new methods are markedly more accurate and have more stable statistical behaviors than other currently available methods for estimating genetic distances-even for very low depth data with high error rates.
While much of biology has been revolutionized by the availability of high-throughput next-generation sequencing, the state-of-art methods for calculating genetic distances used in phylogeny estimation have not changed much for the past several decades, and are still not properly modeling the uncertainty and idiosyncrasies associated with next-generation sequencing data. Modern sequencing technologies produce millions or billions of small DNA fragments through a massively parallel sequencing process. In re-sequencing studies, these fragments are then aligned to a (typically) haploid representation of the target genome. As the DNA sequences in these fragments have non-negligible error-rates, genotype likelihoods are calculated in order to model the uncertainty of genotypes arising from sequencing errors and from varying sequencing depth. These genotype likelihoods are defined as the probability of the observed sequencing data—at a particular position of the genome—as a function of the true (but unknown) genotype, which often is assumed to be diploid. Genotype likelihoods, therefore, capture all the uncertainty in the data regarding the true genotype.In phylogenetics, the estimation of genetic distances is based on continuous time Markov chain models of nucleotide substitution. The core concept is to model nucleotide substitution as a Markov process, while allowing for differences in substitution rates among the four different nucleotides and, possibly, variation in the substitution rate among sites. The simplest model is the Jukes and Cantor model, also called the JC69 model (Jukes and Cantor 1969), which assumes equal rates of substitutions between all base pairs. A series of other models developed in the 1980s and 1990s including the K80 (Kimura 1980), F81 (Felsenstein 1981), HKY85 (Hasegawa et al. 1985), F84 (Felsenstein and Churchill 1996), TN93 (Tamura and Nei 1993) relaxes this assumption by incorporating additional parameters, such as differences in the rate of transitions and transversions, and unequal equilibrium nucleotide frequencies. The most general commonly used model is the Generalized Time Reversible (GTR) model (Tavaré 1986), which is the most parameter-rich model that is still time-reversible. Relaxing the requirement of time-reversibility of the Markov chain allows for more parameter-rich models, in particular the unrestricted UNREST model of Yang (1994) which is a fully parameterized model with 12 parameters—one for each of the 12 possible substitution types. However, because of computational simplicity and tractability, most work in phylogenetics focuses on the GTR model. In this work we will also assume the GTR model as the basic model of nucleotide substitution—however, the results and methodology we develop can in principle be generalized to non-reversible models.The previously discussed substitution models were developed for haploid sequences of nucleotides. However, much data currently generated arise from the nuclear DNA of diploid organisms. Applying these models designed for haploid sequences to diploid data has been a challenge in many studies, especially due to the considerable genotyping uncertainty in much of the published sequencing data. Different approaches exist for handling heterozygous sites and uncertainty in genotype calling for phylogenetic analyses of diploid data. One approach is to construct haploid data by phasing the diploid data using computational approaches—e.g., using various extensions of PHASE (Stephens et al. 2001; Stephens and Donnelly 2003) or BEAGLE (Browning and Browning 2007). This is an approach that may work well for species where many individuals (>100) have been re-sequenced or where large reference panels of individuals from the same species are already available. Unfortunately, for most phylogenetic analyses, such data will often not be available. Other approaches include: (1) ignoring all possible heterozygous sites, i.e. treating them as missing data (Nilsson et al. 2017; Árnason et al. 2018; Maldonado et al. 2019)—in this work, we call this approach NoAmbiguityGT; (2) representing possible heterozygous sites as IUPAC (The International Union of Pure and Applied Chemistry, Cornish-Bowden 1985) ambiguity codes (Klicka et al. 2014; Martin et al. 2014; Uckele et al. 2021)—termed AmbiguityGT in this work; (3) choosing a random nucleotide from the sequenced reads at each position (Skoglund et al. 2016; Yang et al. 2020)—RandomSEQ in this work; (4) making a consensus call of a single nucleotide based on the raw sequencing data, i.e., converting the diploid sequencing data into a haploid sequence by, for each site, selecting (one of) the most frequent nucleotides in the sequencing data (Manthey et al. 2016; Sass et al. 2016; Yuan et al. 2016)—ConsensusSEQ in this work; or (5) incorporating uncertainty using genotype likelihoods or other measurements of uncertainty, e.g., ngsDist (Vieira et al. 2015), which was applied in Choi and Purugganan (2018), Gaunitz et al. (2018), and Hu et al. (2018). ngsDist does not perform genotype calling, but instead uses the diploid diallelic genotype likelihood to compute the average per site allelic differences by averaging over the joint posterior genotype probabilities. ngsDist was not devised for phylogenetic analyses and does not use an explicit model of nucleotide substitution.As we will show in the results section, all existing methods have serious drawbacks in simulated scenarios. All simulated scenarios presented in the manuscript are based on the topology shown in figure 1. The first two approaches, AmbiguityGT and NoAmbituityGT, which perhaps are the most commonly used approaches, have previously been shown to cause strong biases in phylogenetic estimation (Lischer et al. 2014; Potts et al. 2014; Schrempf et al. 2016). The bias is not unidirectional: at low read depth (1) these two methods will overestimate the genetic distance; whereas higher read depths (5) will cause underestimation (see fig. 2 and supplementary fig. S8, and Section S5.1, Supplementary Material online for further discussion).
Fig. 1.
Simulated divergence tree: Tree structure used for simulations. Two diploid individuals are of the genotypes ij and kl, respectively. The most recent common ancestor of i and j is A, and that of k and l is B. The divergence time of A and B is denoted as , while the time from A to i or j is , and the time from B to k or l is . The divergence time between the two diploid individuals, t, is thus defined as .
Fig. 2.
The (scaled) genetic distance estimations for (a) AmbiguityGT and (b) NoAmbiguityGT based on the simulation results. The average read depth 0.5, 0.75, 1, 5, 10, 20. True genetic distance , 0.01, 0.016. JC69 and GTR models were used. , , see figure 1. The default simulation setting was applied.
Simulated divergence tree: Tree structure used for simulations. Two diploid individuals are of the genotypes ij and kl, respectively. The most recent common ancestor of i and j is A, and that of k and l is B. The divergence time of A and B is denoted as , while the time from A to i or j is , and the time from B to k or l is . The divergence time between the two diploid individuals, t, is thus defined as .The (scaled) genetic distance estimations for (a) AmbiguityGT and (b) NoAmbiguityGT based on the simulation results. The average read depth 0.5, 0.75, 1, 5, 10, 20. True genetic distance , 0.01, 0.016. JC69 and GTR models were used. , , see figure 1. The default simulation setting was applied.The methods RandomSEQ and ConsensusSEQ do not perform genotype calling, but instead choose a representative nucleotide among all the available data for every site. RandomSEQ samples a random allele per site and the resulting data is equivalent to a sequencing study with a deterministic read depth of 1. ConsensusSEQ is based on constructing a haploid sequence by choosing the most frequently observed nucleotide in each site. As we will later show (figs. 3, 4 and supplementary fig. S9, Supplementary Material online), both methods will be strongly affected by sequencing errors at low depths, although ConsensusSEQ can perform well at high sequencing depths.
Fig. 3.
The (scaled) genetic distance estimations () for distAngsd-nuc, distAngsd-geno, RandomSEQ, and ConsensusSEQ based on the simulation results under JC69 model. Average read depth 20, 10, 5, 1, 0.75, 0.5 for (a)–(f), respectively. True genetic distance , 0.01, 0.016. , , see figure 1. The default simulation setting was applied.
Fig. 4.
The (scaled) genetic distance estimations () for distAngsd-nuc, distAngsd-geno, RandomSEQ, and ConsensusSEQ based on the simulation results under GTR model. Average read depth 20, 10, 5, 1, 0.75, 0.5 for (a) to (f), respectively. True genetic distance , 0.01, 0.016. , , see figure 1. The default simulation setting was applied.
The (scaled) genetic distance estimations () for distAngsd-nuc, distAngsd-geno, RandomSEQ, and ConsensusSEQ based on the simulation results under JC69 model. Average read depth 20, 10, 5, 1, 0.75, 0.5 for (a)–(f), respectively. True genetic distance , 0.01, 0.016. , , see figure 1. The default simulation setting was applied.The (scaled) genetic distance estimations () for distAngsd-nuc, distAngsd-geno, RandomSEQ, and ConsensusSEQ based on the simulation results under GTR model. Average read depth 20, 10, 5, 1, 0.75, 0.5 for (a) to (f), respectively. True genetic distance , 0.01, 0.016. , , see figure 1. The default simulation setting was applied.The Bayesian approach in ngsDist assumes an infinite sites model that does not take recurrent mutations into account but more importantly it averages over the posterior probability distribution of genotypes when calculating distances. We will in the Results section show that this approach can result in highly biased estimates of genetic distances.Motivated by the statistical limitations of previous methods, we develop two new methods, distAngsd-geno and distAngsd-nuc, which are both maximum-likelihood (ML) estimators of the genetic distance. They differ from previous methods as they do not attempt genotype or haploid calling, but instead model the sequencing uncertainty while also using an explicit nucleotide substitution model. Through extensive simulations, and by applying our methods to real sequencing data, we show that the two novel methods outperform previous methods by having significantly less biases and smaller variances—especially in the context of low read depth, or small genome sizes. When using the methods for estimating phylogenetic trees on a real data set, we show that the phylogenetic trees estimated using the new methods have higher phylogenetic concordance with existing taxonomic categories than trees estimated using previous methods.
Materials and Methods
Both distAngsd-geno and distAngsd-nuc share similar characteristics. They both pre-compute a genome-wide estimate of the joint distribution of either genotypes (distAngsd-geno) or nucleotides (distAngsd-nuc) in two individuals. This is accomplished by using genotype likelihoods or by directly using the quality scores associated with each nucleotide across all reads. The estimation of these joint distributions is free of assumptions regarding evolutionary models. Inference of genetic distance, d, using maximum likelihood based on models of molecular evolution, e.g., the JC69 model or the more realistic GTR model, then proceeds by treating the inferred joint distributions as pseudo-data (i.e., and below). The inference procedure, therefore, has two steps: (1) Maximum-likelihood inference of genome-wide joint distribution of alleles or genotypes, and (2) maximum-likelihood inference of genetic distances based on the results of step (1). This two-step approach ignores the statistical uncertainty introduced in step (1) and merely uses point estimates of joint distributions. However, for genome-wide data, the variances in the estimates of genotype or allele frequencies should be negligible, and as we will show, this procedure does in fact have good statistical properties for realistic parameters’ values.
Estimation of Joint Distributions Using EM
As accurate genotype calling is not possible with low read depth sequencing data, e.g., Nielsen et al. (2011), we employ a likelihood approach to estimate the joint global distribution using the expectation maximization algorithm (EM) (Dempster et al. 1977).In distAngsd-geno, the joint genotype distribution is represented by a 10 by 10 matrix where 10 is the total number of possible unphased genotypes in a diploid individual, i.e., {AA, AC, AG, AT, CC, CG, CT, GG, GT, TT}. Both rows and columns of are indexed by these genotypes, and every element of , , represents the proportion of the informative sites (sites where we have data for both individuals) where the true genotypes are given by and . Similarly in distAngsd-nuc, a 4 by 4 matrix representing the joint distribution of nucleotides is estimated. Here, is indexed by nucleotides, i.e., {A,C,G,T}, rather than the 10 possible genotypes.The matrix used in distAngsd-geno is inferred via Algorithm 1, while distAngsd-nuc applies Algorithm 2 in order to estimate . The input of Algorithm 1 are the genotype likelihoods, while the input in Algorithm 2 are the likelihoods of all nucleotides given by the Phred-scaled base quality score in the read data. The objective function (i.e., eq. S3 for and eq. S4 for ) is essentially the same as equation (1) in Korneliussen et al. (2014), which is the two dimensional equivalent of the likelihood function presented in Keightley and Halligan (2011), Li (2011), and Nielsen et al. (2012).
Algorithm 1: EM estimation for M. EM algorithm for estimating the joint genotype distribution matrix M, where is the matrix M in the tth iteration and the function is the genotype likelihood of genotype at site s in sample k ().
Algorithm 2: EM estimation for matrix N. EM algorithm for estimating the joint nucleotide distribution matrix , where is the matrix in the tth iteration, and the function is the likelihood function of the nucleotide in the ’th read at site s in sample l, or 2. can be obtained from the read quality scores.
Algorithm 1: EM estimation for M. EM algorithm for estimating the joint genotype distribution matrix M, where is the matrix M in the tth iteration and the function is the genotype likelihood of genotype at site s in sample k ().Algorithm 2: EM estimation for matrix N. EM algorithm for estimating the joint nucleotide distribution matrix , where is the matrix in the tth iteration, and the function is the likelihood function of the nucleotide in the ’th read at site s in sample l, or 2. can be obtained from the read quality scores.
Log-likelihood of distAngsd-geno Inference
The log-likelihood function in distAngsd-geno given the inferred pseudo-observation matrix is defined as,Here, the sum is over all possible unphased diallelic genotype nucleotide configurations and and are the ith and jth nucleotides of genotypes and , respectively. is the transition probability from nucleotide to given the genetic distance d between the two samples. is the stationary distribution of nucleotide .
Log-likelihood of distAngsd-nuc Inference
The log-likelihood function of distAngsd-nuc given the inferred pseudo-observation matrix is,where is the transition probability from nucleotide to given the genetic distance d between the two samples. is the stationary distribution of nucleotide .
Parameters of the Substitution Model
The forms of , and d in both equations (1) and (2) will be determined by the nucleotide substitution model (e.g., the JC69 or GTR model, see supplementary Section S3, Supplementary Material online for details). The genetic distance between two individuals, d, is a product of the divergence time (as shown in fig. 1) and the substitution rate, , . Notice that this method, like other phylogenetic methods, estimate d but cannot independently estimate t and . We then scale the nucleotide substitution rate matrices of the JC69 model and GTR model so that the mean number of substitutions occurring per base per scaled time unit is fixed to be 1, and represents the mean number of substitutions per site in the time interval. In the following, we will not distinguish between estimating d and estimating t. Also, notice that, similarly to classical phylogenetic methods, we here ignore complications from varying coalescence time and incomplete lineage sorting (See supplementary Section S6, Supplementary Material online for some further investigations on the effects on the estimation method of incomplete lineage sorting) between the two individuals. If the method is applied in settings where the coalescent process causes significant variation in t, the method is expected to estimate an average value of t for the two individuals.The detailed derivation of the EM algorithm for both proposed methods are given in supplementary Section S2, Supplementary Material online.
Simulations and Comparisons to Previous Methods
Simulations
We will focus on the JC69 model and GTR model assuming all sites are variable. (i.e., no site is invariable). However, we also explore models where a proportion of sites are invariable in supplementary Section S3.5, Supplementary Material online.We simulate data following Algorithm 3 assuming a diploid species. While distAngsd-geno assumes diploidy, distAngsd-nuc is applicable to species with other ploidies, but we here only compare the methods for the diploid case. The simulated phylogeny is described in figure 1. The genotype likelihood model used in the simulations is the canonical genotype likelihood model (see also McKenna et al. 2010 and supplementary Section S1, Supplementary Material online):Here i, j, , and is the error rate of the kth read in the focal site. is the probability that the kth read at the focal site is called as nucleotide l given the true genotype in the site is ij.
Algorithm 3: Simulation of sequencing data. Simulation scheme of two individuals with a shared ancestor. A simulation scheme considers both variable and invariable sites can be found in supplementary alg. S1, Supplementary Material online.
Input: Substitution rate matrix R (the stationary distribution of π⋅ is known), divergence time t, t1 and t2, error rate e, mean read depth RD, genome length l;
Output: Read data and Genotype likelihoods across sites;
Ancestral sequences construction:
The ancestral nucleotide for sample A and B, denoted by a1 and a2, are simulated given R, π⋅, divergence time t−t1−t2, and l. Sites are assumed to evolve independently;
Sequence construction and calling:
initialization: GL1(s,g)←GL2(s,g)←0;
for j←1 to 2 do
Two sequences qj1 and qj2 are simulated from aj given R, l, and time tj;
for every site s in sample j do
Generate read depth n∼Poisson(RD).
Sample n reads from the true genotype at the site s of qj1 and qj2 with symmetric errors at rate e, to generate data r1,…,rn.
Define genotype likelihoods as GLj(s,g)=∑k=1nlog[Prob(rk|g)].
end
end
Algorithm 3: Simulation of sequencing data. Simulation scheme of two individuals with a shared ancestor. A simulation scheme considers both variable and invariable sites can be found in supplementary alg. S1, Supplementary Material online.In the simulations, it is assumed that the error rates are identical across different reads and sites (i.e., for all k). The read depths are assumed to be i.i.d., Poisson random variables in all sites, and only parameter needed to define this distribution is the average read depth. It would also be possible to simulate varying error rates by sampling them from a different distribution, but as long as the base error rates are calibrated correctly, variation in the error rate among sites should not affect the behavior of the new methods qualitatively.
Parameter Settings
Unless otherwise stated, we simulate data with fixed values of , and (see fig. 1), but vary the total divergence time , such that , and .In the main text, most of the simulations are conducted with the genome length and base calling error set to 1 Mb and . And for each scenario, 200 replicates are simulated. We will refer to these as the default simulation setting.The parameters of the GTR model are , , , , , , , , . These values are suggested by the table 1.3 in Yang (2006) and were originally inferred from the human and orangutan 12S rRNA genes.Other parameter values with larger base calling error, longer genetic distance, etc., are explored in the Supplementary material online but are not presented in the main text.
Comparisons with the Previous Methods
To assess the performance of the new methods, we compare their performance against the five previous methods (RandomSEQ, ConsensusSEQ, NoAmbiguityGT, AmbiguityGT, and ngsDist).1. RandomSEQ and ConsensusSEQThe likelihood function for both RandomSEQ and ConsensusSEQ is given by:where is the joint nucleotide counts matrix of the size obtained by either sampling the nucleotide (RandomSEQ) or by using the Consensus (ConsensusSEQ). The consensus sequence is created by choosing the most common base in each site, choosing randomly if multiple bases are equally common, and ignoring the site if no bases are observed in one or both of the samples. The definitions of and are the same as those in equation (2).2. AmbiguityGT and NoAmbiguityGTAmbiguityGT and NoAmbiguityGT perform genotype calling and, thereby, have an implicit assumption of known ploidy level. We assume diploid genotypes obtained through standard genotype calling where a posterior probability is computed for each of the 10 possible genotypes under the assumption of a uniform prior. In the NoAmbiguityGT approach, heterozygous sites are discarded, i.e. the data are treated as haploid data ignoring heterozygous sites. However, in the AmbiguityGT method heterozygous sites are represented as ambiguity characters and the likelihood function is calculated by assigning equal likelihood to each of the two nucleotides, i.e., heterozygous sites are treated as sites that are haploid but with uncertainty regarding the nucleotide in the site. This is the standard way of dealing with missing data or uncertainty regarding nucleotide state in phylogenetic inference. We can think of the AmbiguityGT approach as being based on the following likelihood function:where equals , is a count matrix of the called genotype pairs, and M1 and M2 are the sets of nucleotides observed in the two sites. will contain two nucleotides if sample i is heterozygous in that site, and one nucleotide if homozygous. We elaborate more on the details of these likelihood functions in supplementary Section S5.1, Supplementary Material online.3. ngsDistAll methods mentioned above including RandomSEQ, ConsensusSEQ, AmbiguityGT, and NoAmbiguityGT were implemented in distAngsd, and we used this implementation for the inferences for the simulated data.We also compare the new methods with ngsDist (Vieira et al. 2015), which is the only pre-existing method that models the uncertainty of the data to estimate genetic distances. However, this method assumes a di-allelic model. We, therefore, convert the 10-genotype likelihoods simulated by Algorithm 3 to 3-genotype likelihoods, by first inferring the major and minor alleles as the most frequently, and second most frequently observed nucleotides, respectively. The resulting 3-genotype likelihoods are then used as input to ngsDist to calculate the pairwise distances (see supplementary Section S5.2, Supplementary Material online for details).
Experimental Data Analyses
We apply the new distAngsd-geno method in a phylogenetic analysis of a previously published dataset of RADseq reads from oak trees (Fitz-Gibbon et al. 2017). This data set is comprised of 83 oak samples representing 16 taxa located around the USA. We obtained the raw fastq sequences and followed the reference mapping approach described in detail in Fitz-Gibbon et al. (2017). The sites retained for downstream analyses are based on the callable-locus BED files shared by the authors. For each site, we use the raw genotype likelihood (bcftools genotype likelihood model). The data processing pipeline differs from that of the original authors by allowing heterozygote sites, which had been masked in the original analyses. These sites are either encoded as IUPAC ambiguity codes for the AmbiguityGT method, or used in the form of raw genotype likelihoods for our new methods. We obtained nucleotide consensus sequences for the ConsensusSEQ method by using the bcftools consensus command.The fastq files for the 83 samples were aligned to the reference genome (Reference of the genome v0.5, Sork et al. 2016; available from https://valleyoak.ucla.edu/genomicresources/) to obtain per sample BAM files. For the distAngsd-geno inference, VCF files were generated by applying bcftools mpileup to all pairs of samples. Similarly, bcftools mpileup commands were applied to obtain per sample information for use in the AmbiguityGT and ConsensusSEQ analyses. Genotype calling was performed with “bcftools call” followed by “bcftools consensus -I” to obtain the putative genotype calls with IUPAC ambiguity codes for the AmbiguityGT analyses whereas we did not apply “bcftools call” but solely used the “bcftools consensus” for the ConsensusSEQ analyses (see supplementary Section S10, Supplementary Material online for details).The distAngsd-geno, ConsensusSEQ, and AmbiguityGT methods were applied on each pair of oak samples using the JC69 substitution model. Based on the resulting pairwise distance matrices, we performed neighbor joining estimation for each method using the program PhyD (Criscuolo and Gascuel 2008) with the BioNJ (Gascuel 1997) algorithm. The trees were then plotted with FigTree v1.4.4. (http://tree.bio.ed.ac.uk/software/figtree/).To compare the methods in low-coverage scenarios, we down-sampled each of the 83 samples to lower coverage. The original 83 samples have an average read depth of and lowest depth of . Each sample is down-sampled to read depths of 10, 5, 1, , , and . Phylogenetic trees were then estimated using the same methods as previously described.
Results
We implemented the new methods in a program: distAngsd (https://github.com/lz398/distAngsd). This program is threaded, scales linearly in the number of cores allocated, and is supplied as an open source c/c++ program under the GPL license hosted on Github. Importantly, it is user friendly and allows for various standard formats that will enable researchers to integrate these methods in standard data-analysis pipelines. In the following, we compare the performance of the new methods to previous methods using extensive simulations and an application to a real data set.
Simulation-based Inference
The quantities we compared across different methods in this section are mainly the scaled genetic distances, and the scaled mean squared error, . The distributions of the genetic distance estimates, , are presented as boxplots illustrating the variability among simulation replicates. The deviations of the mean of from 1 reflects the estimation biases, and the magnitudes of the boxes illustrate the variances of the estimates. is an overall measure of the accuracy of the estimates combining both bias and variance. Given the same true genetic distance d, the lower is, the better the inference of the method is.1. AmbiguityGT and NoAmbiguityGTAs shown in figure 2 and , the AmbiguityGT and NoAmbiguityGT approaches, which mimic the currently most commonly used methods, are generally found to be highly biased, and are affected by two oppositely directed biases: For example, for under the JC69 model with average read depth is , AmbiguityGT and NoAmbiguityGT overestimate the true distance and times, respectively. However, if the average read depth is large, e.g., 20, both methods underestimate the true distance ( for AmbiguityGT and for NoAmbiguityGT). Similar patterns of overestimation for low read depth and underestimation for high read depth are observed across simulation settings (see also supplementary figs. S8 and S12, and table S4, Supplementary Material online).The reasons that AmbiguityGT and NoAmbiguityGT are biased upwards for low depth and downwards for high depth are as follows: When the average read depth is low, most heterozygous sites appear as homozygous, however, sequencing errors will appear as additional fixed differences that cause overestimation of the divergence time. As the sequencing depth increases, the genotype calling becomes more accurate and less affected by sequencing errors. However, for the NoAmbiguityGT methods, the removal of heterozygous sites leads to underestimation of the genetic distance. Instead of estimating in figure 1, effectively only is being estimated. There will be a similar effect for the AmbiguityGT method, which also ends up effectively just estimating . This effect is explored in more detail in supplementary S5.1, Supplementary Material online.Since the biases of AmbiguityGT and NoAmbiguityGT are quite large, we do not include these methods in future comparisons.2. RandomSEQ and ConsensusSEQSimulation results using the same parameter settings as in the previous section, for distAngsd-geno, distAngsd-nuc, RandomSEQ, and ConsensusSEQ are plotted in figures 3 (JC69) and 4 (GTR) with different read depths in different panels.The results are qualitatively similar for both the GTR and JC69 models across all methods (including distAngsd-geno, distAngsd-nuc, RandomSEQ, and ConsensusSEQ, and the previously mentioned AmbiguityGT and NoAmbiguityGT methods see fig. 2). However, the biases tend to be larger under the GTR model due to the asymmetric nature of the GTR model. We summarize the common features of the results of distAngsd-geno, distAngsd-nuc, RandomSEQ, and ConsensusSEQ under both nucleotide substitution models as follows:Both distAngsd-geno and distAngsd-nuc have higher accuracy and precision (smaller biases and variances) than RandomSEQ and ConsensusSEQ. This is especially clear for low depth scenarios (figs. 3 and 4) and shorter genome length scenarios (see also supplementary figs. S13 and S14, Supplementary Material online). RandomSEQ and ConsensusSEQ are strongly affected by errors at lower sequencing depths. This sensitivity to errors also affects RandomSEQ at higher sequencing depths, while the performance of ConsensusSEQ improves strongly with increasing depth. It should be noted that when the mean read depth is small and the base calling error is relatively large (relative to the true genetic divergence), both of the new methods can still be biased. However, these biases are much smaller than those observed for the previous methods (e.g., see fig. 4). We refer readers to supplementary Section S4, Supplementary Material online for a general discussion on bias analyses.As expected, since the new methods take advantage of the full information of the sequencing data (see discussion in supplementary Section S5.1, Supplementary Material online), they also have the smallest variance and the smallest mean squared error (MSE, see fig. 5, and supplementary fig. S15, Supplementary Material online for shorter genome scenarios). Furthermore since distAngsd-geno also uses prior knowledge of ploidy level, it always has the least variance in our diploid simulations.
Fig. 5.
The (scaled) mean squared errors for six different methods (distAngsd-geno, distAngsd-nuc, RandomSEQ, ConsensusSEQ, AmbiguityGT, and NoAmbiguityGT) in the simulation results. True genetic distance d: (a) . (b) . (c) . , , see figure 1. The default simulation setting was applied. Solid lines correspond to the results of previous methods, while the dashed ones represent those of the two proposed methods.
The (scaled) mean squared errors for six different methods (distAngsd-geno, distAngsd-nuc, RandomSEQ, ConsensusSEQ, AmbiguityGT, and NoAmbiguityGT) in the simulation results. True genetic distance d: (a) . (b) . (c) . , , see figure 1. The default simulation setting was applied. Solid lines correspond to the results of previous methods, while the dashed ones represent those of the two proposed methods.Supplementary figures S13 and S14, Supplementary Material online correspond to figures 3 and 4, with the only difference being the number of variable sites used in the actual simulation (1 MB in the main text, MB in Supplementary material online).3. ngsDistngsDist is based on a model which measures pairwise differences and does not distinguish between different types of nucleotide substitutions, we therefore compare the results of ngsDist with disAngsd-geno only under the symmetric JC69 model. ngsDist is also the only previous method that makes use of genotype likelihoods. We should also emphasize that ngsDist was not developed for phylogenetic analysis and the results presented is therefore not the recommended scenarios for ngsDist. We therefore present the result separately (fig. 6). With an average read depth of 1, ngsDist overestimates genetic distance d when a relatively small true d is simulated (35-fold difference when true ). This is due to ngsDist averaging over the posterior genotype distribution when calculating genetic distances (eq. 2 in Vieira et al. 2015). Averaging over the posterior leads to overestimation of genetic distances when the true distances are small, because uncertainty regarding the genotype is effectively interpreted as a high probability of nucleotide differences. For example, for a uniform prior on diallelic genotypes and with no data, so that only the prior contributes to the posterior, the expected number of nucleotide differences per site is 0.5. If the sequencing depth is low, the genetic distance will therefore be biased towards large values especially when the true genetic distance is small. Incorporating statistical uncertainty by averaging over a posterior can, in general, lead to highly biased estimates and will be very sensitive to the choice of prior. DistAngsd-geno remains approximately unbiased but has increasing variance as depth decreases. In contrast, ngsDist has an obvious depth-dependent bias in the tested scenarios. As the depth becomes smaller, the genotype prior increasingly influences the estimated genetic distance (see fig. 7 for the genome length 1 Mb case and supplementary fig. S6, Supplementary Material online for the genome length Mb scenario).
Fig. 6.
Comparison of estimated genetic distances by distAngsd-geno and ngsDist for different true distances ranging from to 1. The JC69 model was used. , , see figure 1. At each true d value, 200 replicates were simulated and estimated. The default simulation setting was applied and the average read depth is set to 1. Blue points: distAngsd-geno. Red points: ngsDist.
Fig. 7.
Comparison of estimated genetic distances by ngsDist and distAngsd-geno for different read depths ranging from to 20. (a) Genetic distance was estimated by ngsDist. (b) Genetic distance was estimated by distAngsd-geno. JC69 model was used. True genetic distance (d) was assumed to be , , , see figure 1. The default simulation setting was applied.
Comparison of estimated genetic distances by distAngsd-geno and ngsDist for different true distances ranging from to 1. The JC69 model was used. , , see figure 1. At each true d value, 200 replicates were simulated and estimated. The default simulation setting was applied and the average read depth is set to 1. Blue points: distAngsd-geno. Red points: ngsDist.Comparison of estimated genetic distances by ngsDist and distAngsd-geno for different read depths ranging from to 20. (a) Genetic distance was estimated by ngsDist. (b) Genetic distance was estimated by distAngsd-geno. JC69 model was used. True genetic distance (d) was assumed to be , , , see figure 1. The default simulation setting was applied.
Inference based on Experimental Data
To compare the methods on real data, we used a previously published data set of RADseq data from oak trees (Fitz-Gibbon et al. 2017). We estimate neighbor joining trees using genetic distances based on distAngsd-geno and two more popular previous methods, ConsensusSEQ and AmbiguityGT under the JC69 model (fig. 8). We used the JC69 model to facilitate a more fair comparison among methods.
Fig. 8.
Neighbor-Joining Tree of 83 oak samples inferred using genetic distances estimated by (a) distAngsd-geno (b) ConsensusSEQ , and (c) AmbiguityGT. All inferences are based on the JC69 model. The colors of the leaf-node labels correspond to the species shown in the legend. All three trees are estimated as unrooted trees but roots are placed between the known outgroup Quercus kelloggii and other oak species.
Neighbor-Joining Tree of 83 oak samples inferred using genetic distances estimated by (a) distAngsd-geno (b) ConsensusSEQ , and (c) AmbiguityGT. All inferences are based on the JC69 model. The colors of the leaf-node labels correspond to the species shown in the legend. All three trees are estimated as unrooted trees but roots are placed between the known outgroup Quercus kelloggii and other oak species.We only compare distAngsd-geno to the ConsensusSEQ and AmbiguityGT methods since (1) they perform equally as well or better than RandomSEQ and NoAmbiguity in simulations. (2) ngsDist overestimates the pairwise genetic distances and is not appropriate for phylogenetic inference (figs. 6 and 7).Distances estimated using distAngsd-geno results in trees that are more concordant with existing taxonomic assignments, as they are closer at identifying Quercus berberidifolia, Quercus durata var. gabrielensis and Quercus durata var. durata (fig. 8) as monophyletic groups.To further examine the relative performance of the methods on real data, we down-sample to obtain mean read depths of 10, 5, 1, , , and . The original mean depth per sample was with a lowest mean depth of any sample of . We then, again, estimated neighbor-joining trees using distances inferred using distAngsd-geno, ConsensusSEQ, and AmbiguityGT. To compare the compatibility of the results with the existing taxonomy, we developed a compatibility measurement, m, that measures the amount of taxonomic compatibility observed in the trees, i.e., higher values correspond to better performance (see supplementary Section S8, Supplementary Material online). Results for all scenarios and methods can be found in table 1.
Table 1.
The Compatibility Measurement m between the Inferred Pairwise Distance Trees and the Prior Species Knowledge.
Downsampling Depth
Full
10
5
1
0.75
0.5
0.25
Criterion |Ω1∩Ω2|=0
distAngsd-geno
m
17
17
17
16
16
17
13
ConsensusSEQ
17
17
17
16
16
16
13
AmbiguityGT
16
16
16
16
16
16
12
Criterion |Ω1∩Ω2|≤1
distAngsd-geno
m
75
75
73
74
73
70
57
ConsensusSEQ
74
75
76
72
70
66
58
AmbiguityGT
75
75
74
74
70
67
56
Note:—The trees are inferred based on the original samples as well as the downsampled samples. The original 83 samples have mean read depth with lowest depth , and each sample is downsampled to mean read depth 10, 5, 1, , and .
The Compatibility Measurement m between the Inferred Pairwise Distance Trees and the Prior Species Knowledge.Note:—The trees are inferred based on the original samples as well as the downsampled samples. The original 83 samples have mean read depth with lowest depth , and each sample is downsampled to mean read depth 10, 5, 1, , and .Clearly, trees estimated using distAngsd-geno are more compatible with existing taxonomy than both ConsensusSEQ and AmbiguityGT, particularly as the read depth decreases.
Discussion
We have here presented two novel methods for estimating genetic distances: distAngsd-geno and distAngsd-nuc. Both of these methods incorporate the uncertainty that is inherently associated with high-throughput sequencing data. The uncertainty is either modeled through standard diploid genotype likelihoods or, equivalently, through the application of the base quality scores in a framework that facilitates inferences in a polyploid or unknown ploidy context. Both methods can estimate the genetic distance between samples with high sequencing error rates and low average read depth.The key characteristic of both methods is to decompose the likelihood optimization process into two parts: (1) estimation of global pseudo-observations (i.e. joint distributions of genotypes or nucleotides between each pair of samples); (2) the maximum-likelihood estimation of genetic distance (and other related parameters) based on the previously calculated pseudo-observations. This decomposition reduces the complexity of the maximum-likelihood estimation.However, we note that this decomposition can introduce biases (See supplementary Section S4, Supplementary Material online), particularly when the sequences compared are short. The proposed methods are not intended for data consisting of very short sequences. However, the bias for our proposed methods is still smaller—by a large margin—compared to any previous method.We also simulated and inferred genetic distances based on tree topologies different from figure 1 (See supplementary fig. S7, Supplementary Material online). Such topologies can occur due to incomplete lineage sorting, i.e. when coalescence times are shorter between alleles from different individuals than from the same individual. We find that for a realistic range of divergence levels, the results of the new methods still offer higher accuracy than the previous methods (See supplementary Section S6, Supplementary Material online). The biases of the previous methods for topologies resulting from incomplete linage sorting are similar to those observed for the standard topology in figure 1, and the general conclusions from the standard simulation scenario carries over to the case of incomplete lineage sorting. One exception is for extremely large divergence levels (see supplementary fig. S10, Supplementary Material online), where the new methods will be increasingly biased.While the distAngsd-geno results presented here assume diploidy, it can in principle be extended to any ploidy level. However, unlike the distAngsd-geno, the distAngsd-nuc does not require prior knowledge of ploidy level, and the size of joint distribution matrix of nucleotides N remains a regardless of ploidy level. We observe that distAngsd-geno produces more accurate results for our diploid simulation scenarios but speculate that in the context of unknown ploidy distAngsd-nuc will yield more robust estimates and would therefore be more suitable.Click here for additional data file.
Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043
Authors: Charleen Gaunitz; Antoine Fages; Kristian Hanghøj; Anders Albrechtsen; Naveed Khan; Mikkel Schubert; Andaine Seguin-Orlando; Ivy J Owens; Sabine Felkel; Olivier Bignon-Lau; Peter de Barros Damgaard; Alissa Mittnik; Azadeh F Mohaseb; Hossein Davoudi; Saleh Alquraishi; Ahmed H Alfarhan; Khaled A S Al-Rasheid; Eric Crubézy; Norbert Benecke; Sandra Olsen; Dorcas Brown; David Anthony; Ken Massy; Vladimir Pitulko; Aleksei Kasparov; Gottfried Brem; Michael Hofreiter; Gulmira Mukhtarova; Nurbol Baimukhanov; Lembi Lõugas; Vedat Onar; Philipp W Stockhammer; Johannes Krause; Bazartseren Boldgiv; Sainbileg Undrakhbold; Diimaajav Erdenebaatar; Sébastien Lepetz; Marjan Mashkour; Arne Ludwig; Barbara Wallner; Victor Merz; Ilja Merz; Viktor Zaibert; Eske Willerslev; Pablo Librado; Alan K Outram; Ludovic Orlando Journal: Science Date: 2018-02-22 Impact factor: 63.714
Authors: Pontus Skoglund; Cosimo Posth; Kendra Sirak; Matthew Spriggs; Frederique Valentin; Stuart Bedford; Geoffrey R Clark; Christian Reepmeyer; Fiona Petchey; Daniel Fernandes; Qiaomei Fu; Eadaoin Harney; Mark Lipson; Swapan Mallick; Mario Novak; Nadin Rohland; Kristin Stewardson; Syafiq Abdullah; Murray P Cox; Françoise R Friedlaender; Jonathan S Friedlaender; Toomas Kivisild; George Koki; Pradiptajati Kusuma; D Andrew Merriwether; Francois-X Ricaut; Joseph T S Wee; Nick Patterson; Johannes Krause; Ron Pinhasi; David Reich Journal: Nature Date: 2016-10-03 Impact factor: 49.962