Literature DB >> 29196497

Powerful Inference with the D-Statistic on Low-Coverage Whole-Genome Data.

Samuele Soraggi¹, Carsten Wiuf², Anders Albrechtsen³.

Abstract

The detection of ancient gene flow between human populations is an important issue in population genetics. A common tool for detecting ancient admixture events is the D-statistic. The D-statistic is based on the hypothesis of a genetic relationship that involves four populations, whose correctness is assessed by evaluating specific coincidences of alleles between the groups. When working with high-throughput sequencing data, calling genotypes accurately is not always possible; therefore, the D-statistic currently samples a single base from the reads of one individual per population. This implies ignoring much of the information in the data, an issue especially striking in the case of ancient genomes. We provide a significant improvement to overcome the problems of the D-statistic by considering all reads from multiple individuals in each population. We also apply type-specific error correction to combat the problems of sequencing errors, and show a way to correct for introgression from an external population that is not part of the supposed genetic relationship, and how this leads to an estimate of the admixture rate. We prove that the D-statistic is approximated by a standard normal distribution. Furthermore, we show that our method outperforms the traditional D-statistic in detecting admixtures. The power gain is most pronounced for low and medium sequencing depth (1-10×), and performances are as good as with perfectly called genotypes at a sequencing depth of 2×. We show the reliability of error correction in scenarios with simulated errors and ancient data, and correct for introgression in known scenarios to estimate the admixture rates.

Entities: CellLine Chemical Disease Gene Species

Keywords: ABBA–BABA test; ANGSD; D-statistic; admixture; four-population test; gene flow; introgression; low depth; next-generation sequencing data; tree test

Mesh：

Year: 2018 PMID： 29196497 PMCID： PMC5919751 DOI： 10.1534/g3.117.300192

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

An important part of the understanding of a population’s history and its genetic variability is past contacts with other populations. Such contacts could result in gene flow and admixture between populations and leave traces of a population’s history in genomic data. In fact, the study of gene flow between populations has been used to uncover demographic histories of many species, including human and archaic human populations (Patterson ; Raghavan , 2015; Green ; Reich , 2010, 2011; Wall ; Rasmussen , 2014; Lalueza-Fox and Gilbert 2011; Skoglund ). The study of the history of human populations using both modern and ancient human genomes has become increasingly topical with the recent availability of new high-throughput sequencing technologies (Stoneking and Krause 2011) such as next-generation sequencing (NGS) (Black ). These technologies have made it possible to obtain massive quantities of sequenced DNA data even from ancient individuals, including an Anzick Clovis individual from the late Pleistocene (Rasmussen ), a Neandertal individual (Green ), and a Paleoamerican individual (Chatters 2000). There are many different methods for inferring and analyzing admixture events using genome-scale data. Popular methods such as STRUCTURE (Pritchard ) and ADMIXTURE (Alexander ) estimate how much a sampled individual belongs to K clusters, which can often be interpreted as the individual’s admixture proportional to the K populations. However, these approaches are not appropriate for detecting ancient gene flow and do not work well with a limited number of individuals per population. A recent alternative to the above methods is the D-statistic. The D-statistic is based on the di-allelic patterns of alleles between four groups of individuals, and provides a way to test the correctness of a hypothetical genetic relationship between the four groups (see Figure 1). A variant of the D-statistic (called the F4-statistic) was first used in Reich to show that subgroups of the Indian Cline group are related to external populations in terms of gene flow. The amount of gene flow can also be estimated using the F4-statistic (Wall ).

Figure 1

Tree topology for the D-statistic. Hypothesis of genetic relationship between four populations,

Tree topology for the D-statistic. Hypothesis of genetic relationship between four populations, In a pivotal study (Green ), the D-statistic was used to show that three non-African individuals were more genetically similar to the Neandertal sequence than African San and Yoruban individuals were. Moreover, it has been shown that the East Asian populations have greater amounts of Neandertal shared genetic material (Wall ). Based on use of the D-statistic on many Old World and Native American individuals, gene flow into some Native American populations has been suggested, for instance, evidence of admixture from Australasian populations into New World populations (Raghavan ; Skoglund ). In another study, the affinity between the Anzick genome and the Native American genome was analyzed with the D-statistic to compare different hypotheses regarding their ancestry (Rasmussen ). Using the D-statistic, it has been reported based on the remains of an individual from the Mal’ta population in south-central Siberia, that this population contributed to the gene pool of modern-day Native Americans, with no close affinity to East Asians (Raghavan ). The first use of the D-statistic was based on a sampling approach that allowed the test to be performed without the need to call single-nucleotide polymorphisms (SNPs) or genotypes (Green ). This approach is still widely used, and the available computational tools implementing it include the doAbbababa program ANGSD (Analysis of Next Generation Sequencing Data) (Nielsen ) (supporting low-depth NGS data) and the fourpop program of TreeMix (Pickrell and Pritchard 2012) (supporting di-allelic genotype data and microsatellite data). The program qpDstat of ADMIXTOOLS (Patterson ) computes the D-statistic from populations with multiple individuals from di-allelic genotype data. The program doAbbababa relies on sampling one base from every locus, using the sequenced reads to define the sampling probabilities. The D-statistic is often applied to scenarios involving ancient individuals, which are commonly affected by deamination, i.e., the natural degradation of DNA after death of the organism, which leads to there being few molecules remaining in ancient specimens and often results in a low sequencing depth. Furthermore, deamination can cause high frequency of specific transitions of the bases, low quality of the SNPs, and very low depth of the data. The current methods for the D-statistic can be very ineffective and unreliable when applied to ancient data, since both sampling and genotype calling procedures are subject to high uncertainty. The focus of this paper is to address the problems stated above. We propose a D-statistic—implemented in the program doAbbababa2 of ANGSD—that supports low-depth NGS data and is calculated using all reads of the genomes, and therefore allows for the use of >1 individual per group. We prove that the improved D-statistic is approximated by a standard normal distribution and, using both simulated and real data, we show how this approach greatly increases the sensitivity of gene-flow detection and thus improves the reliability of the method, in comparison with sampling a single read. We also illustrate that it is possible to correct for type-specific error rates in the data, so that the reads used to calculate the D-statistic will not bias the result owing to type-specific errors. Moreover, our improved D-statistic can remove the effects of known introgression from an external population into , or and indirectly estimates the admixture rate.

Materials and Methods

This section introduces the traditional D-statistic and the theory that leads to its approximation as a normal distribution. Thereafter, we explain how to extend the D-statistic to use multiple individuals per population, without genotype calling and while still preserving the same approximation property of the D-statistic. Last, we will show how to deal with type-specific errors and introgression from a population external to the tree topology.

Standard D-statistic

The objective of the D-statistic is to assess whether the tree of Figure 1 that relates four present-day populations, is correct. When is an outgroup, the correctness of the tree corresponds to the absence of gene flow between and either or This objective is achieved by developing a statistical test based on the allele frequencies and a null hypothesis that the tree is correct and without gene flow. We limit the explanation to a di-allelic model with alleles A and B to keep the notation uncluttered; the extension to a four-allele model is fairly straightforward. We do not make an assumption on which allele is derived, but we assume that B is the nonoutgroup allele. Population is an outgroup that splits off from the other branches at the root of the tree. For each population in the tree, we consider the related allele frequencies For each population the observed data consist of a certain number of individuals sequenced without error. At every locus i there are sequenced bases observed from aligned reads. We consider only the M loci for which there is at least one sequenced base from aligned reads in all four groups. Moreover, in this theoretical treatment, we allow the number M of loci to grow to infinity. Assume that at a locus i the allele frequencies in the four groups of individuals and let be an unbiased estimator of such as the relative frequencies of the allele A in each population. The D-statistic focuses on di-allelic sites where the differences are observed within the pairs and Consider a random allele drawn from each of the four groups of genomes and the resulting combination of the four alleles. We are interested in two patterns: ABBA, meaning that we have the same allele in populations and , and another allele from the individuals in populations and BABA, where one allele is shared by individuals in populations and , and the other allele by individuals in populations and The tree of Figure 1 is subject to independent genetic drifts of the allele frequencies along each of its branches. Consequently, the probabilities of ABBA and BABA patterns, which are conditional only on population frequencies, would rarely be the same. Therefore, it is interesting to focus on their expected values with respect to the frequency distribution:To verify that allele A is shared between genomes in as often as it is shared between genomes in we require as null hypothesis that at each ith locus the probability (1) equals the probability (2). This condition can be written as:where the expectation is the difference between (1) and (2). Using the empirical frequencies of the alleles as unbiased estimators for the population frequencies, we define the D-statistic as the following normalized test statistic:The values and are the numerator and denominator, respectively. Using to normalize the numerator leads to the interpretation of , as the difference over all loci of the probabilities of having an ABBA or a BABA event, conditional on the assumption that only ABBA or BABA events are possible. Appendix 1 shows that under the hypothesis , the test statistic can be approximated by a standard normal variable. Specifically, the approximation holds with a proper rescaling, since would narrow the peak of the Gaussian around zero for large M (note that this rescaling is an embedded factor in the estimation of the variance of using the block jackknife method (Busing ) in the software implementation of ANGSD). More generally, the treatment could be extended to blockwise independence of the allele counts to take into account linkage disequilibrium. The convergence results of Appendix 1 apply to the following special cases of the D-statistic: the original D-statistic calculated by sampling a single base from the available reads (Green ) to estimate the sampling probabilities; the D-statistic evaluated by substituting the frequencies with the estimated population frequencies defined in equation 4 for multiple individuals (see Appendix 2); the D-statistic evaluated only over loci where the outgroup is mono-allelic, such as when the chimpanzee is set as an outgroup to test for gene flow from the Neandertal population into modern out-of-Africa populations (Green ).

Multiple individuals per group

The D-statistic defined in equation 3 is calculated using population frequencies. In the case where only one individual per population is chosen, it is easy to get an estimate of the populations’ frequencies by simply counting observed bases. In what follows, we are interested in getting a meaningful estimate of the frequencies in the case where we want to use all the available sequenced individuals without calling genotypes. This is done using a weighted sum of the estimated allele frequencies for each individual in every group. Assume that given the allele frequency at locus i for the jth population, we model the observed data as independent binomial trials with parameters and where is the number of trials. We take the frequency of allele A in the reads of each jth population as an unbiased estimator of the population frequency. Let be the number of individuals in population j. For the first individual within the jth population, let be the frequency of allele A at locus i, with estimator the frequency of allele A for . Define as the weighted sumwhere each is a weight that is proportional to a quantity depending on , the number of sequenced bases at locus i for individual l:The estimator in equation (4) is an estimator for the jth population frequency at locus i with minimal variance (the derivation of the weights as minimizer of the frequency estimator’s variance can be found in Appendix 2). Substituting the estimated population frequencies in equation (3) with the weighted estimators determined by equation (4), it is possible to account for multiple individuals per population. Since the weighted estimator is also unbiased, it does not affect the approximation of the D-statistic with a standard normal distribution. A first application of this method has been the estimation of population frequencies to reveal signatures of natural selection (Li ). The weights have a strong impact on loci with a low number of reads, where they assume a low value, leading to a stronger impact of population frequency estimated from high-depth individuals in each group.

Error estimation and correction

The study of genetic relationships between populations often involves the use of ancient genomes that are subject to high error rates. We introduce error correction following the idea illustrated in Orlando , to take errors into account and to obtain a more reliable D-statistic. Estimation of type-specific error rates is possible using two individuals (one affected by type-specific errors and one sequenced without errors) and an outgroup, denoted by T, R, and O, respectively. Those individuals are considered in the tree ((T,R),O) (see Appendix 3). After the error matrix is estimated for each individual, it is possible to obtain error-adjusted frequencies of alleles in locus i through the following matrix–vector product:where and are the true and observed vectors of allele frequencies locus i, respectively, and is the 4 × 4 type-specific error matrix whose entry is the probability of observing a base of type b when the true base is of type a. Note that estimating and correcting the allele frequencies is a process best applied before the calculation of weighted allele frequencies for multiple individuals. Using error-corrected estimators of the population frequencies to calculate the D-statistic does not prevent it from being approximated by a standard normal distribution, because the error-corrected estimators are unbiased for the true population frequency (see Appendix 3). According to equation (6), one is able to perform the error correction at every locus for every individual. In this way, it is possible to build a weighted frequency estimator for each population after the error correction. However, the implementation of equation (6) involves the inversion of a matrix and a matrix–vector multiplication at every locus for each individual in all populations. Moreover, as a consequence of the error estimation, there might be negative entries of the inverse , which might cause the product of equation (6) to result in negative entries in the vector Consequently, we decided to implement a less precise version of the error correction that is applied to each whole group of individuals instead of every single individual. Assume that the populations’ frequencies have been estimated from equation (4), and that it is possible to estimate the probabilities of the 256 allele combinations AAAA, AAAC,…, TTTT between the four populations. In each jth population of individuals, let be the mean of their error matrices. Then build the error matrix for the four groups, This has dimension 256 × 256, and its entry where and are two possible allele patterns of the four populations, is defined as the probability of observing instead of assuming independence of the error rates between the four populations:The equation states that the change from pattern to happens with a probability that is the product of the error rates of each population. Note that each error rate is the sum of the error rates of each individual in that population, and so does not take into account how every individual is weighted according to the frequency estimator of equation (4). Let be the vector of length 256 that contains the estimated probabilities of observing allele patterns between the four populations, affected by type-specific errors. Denote by the vector containing the estimated probabilities of patterns not affected by errors. With an approach similar to the one leading to equation 6, it holds thatUsing the error-corrected estimated probabilities of combinations of alleles of the type ABBA and BABA, it is then possible to calculate the numerator and denominator of the D-statistic. This procedure is fast, but it has the drawback that in every group the error matrix takes into account every individual within a population without its associated weight from equation 5. This means that the portion of alleles related to individuals with lower weights might undergo an excessive error correction.

Correction for introgression from an external population

The improved D-statistic proves to be very sensitive to introgression, but a hypothesized genetic relationship might be rejected because of an admixture involving a population not part of the considered tree. We propose a way to correct this issue and obtain an estimate of the amount of introgression when the source of gene flow is available. In this section, we analyze the case in which the null hypothesis might be rejected in favor of the alternative hypothesis, but the cause of rejection is not the presence of gene flow between and either or but instead gene flow between an external population and either or Consider the case of Supplemental Material, Figure S3A in File S1, where the null hypothesis might be rejected because of introgression from an external population into with rate α. We assume that the external sample for represents the population that is the source of introgression. Consider to be the population subject to introgression from and define as the same population when it has not undergone admixture. The four-population subtrees of interest (see Figure S3 in File S1) are which includes the four-population tree excluding the admixing population; where the population source of introgression replaces the admixed population; and in which has not yet undergone admixture and therefore reflects the null hypothesis H0. Consider the patterns of four alleles for the three subtrees mentioned above, whose estimated probabilities are respectively denoted as , and Using the frequency estimators of equation (4), it is possible to estimate and but not since is not an observed population. Assume that testing with the D-statistic on the tree leads to a rejection of because the allele frequencies of are altered by the gene flow from In fact, any combination of four alleles observed in has probabilityBy solving for it follows thatNote that if the admixture proportion α is known, then admixture correction is possible. If α is not known and we assume the tree is accepted for [D] = 0, where is the D-statistic related to the tree then α can be estimated. In this case, has to be determined for all values of α, and the correct value will be the one for which [D] = 0. In this way, an estimate of the admixture rate was obtained for the topology of Figure S3A in File S1.

Simulations

Different scenarios have been generated using the coalescent simulator msms (Ewing and Hermisson 2010) to reproduce the trees of Figure 2, A–C, in which times are in units of generations. Each topology has been simulated 100 times for a constant population size of Mutation and recombination of the simulations are consistent with human data (Ewing and Hermisson 2010). Migrations and admixtures, respectively, for the scenarios of Figure 2, A and C, were simulated with specific options of msms. For each simulation, we generated 200 regions of size 5 Mb for each individual and considered only variable sites, except for the case of Figure 2B, where the null hypothesis was affected by type-specific errors on some of the individuals. We used a type-specific error of e = 0.005 in populations The choice of the region size is compatible with the one estimated for applications with human genomes in Rasmussen . The regions are used by the jackknife estimator (Busing ) to estimate the SD of the D-statistic accommodating the nonindependence of loci.

Figure 2

Simulated scenarios. (A) Simulation of a tree in which migration occurs from population to The variable m is the (rescaled) migration rate varying from 0, 8, 16, 24, 32, 40, up to 280 with steps of size 20. Expressed as a percentage, the migration rate varies from 0, 0.02, 0.04, 0.06, 0.08, 0.1% up to 0.7%. Command: msms -N 10000 -ms 40 200 -I 4 10 10 10 10 0 -t 100 -r 100 1000 -em 0.2 3 1 $m -em 0.201 3 1 0 -ej 0.5 1 2 -ej 0.75 2 3 -ej 1 3 4. The same command line has been applied with the option -I 4 40 40 40 40 0 to generate populations of 20 diploid individuals, used to study the power of the method using subsets of 1, 2, 5, 10, and 20 individuals of such populations. (B) Simulation of a tree in which no migration occurs, but type-specific errors on some individuals provide a rejection when testing for correctness of the null hypothesis. Command: msms -N 10000 -ms 8 200 -I 4 2 2 2 2 0 -t 100 -r 100 1000 -ej 0.5 1 2 -ej 0.75 2 3 -ej 1 3 4. (C) Simulation of a tree in which admixes with with an instantaneous unidirectional admixture of rate α = 0.1. In this case, we expect the null hypothesis to be rejected since will alter the counts of ABBA and BABA patterns, but the alternative hypothesis does not involve gene flow with Command: msms -N 10000 -ms 50 200 -I 5 10 10 10 10 10 0 -t 100 -r 100 1000 -es 0.1 1 0.9 -ej 0.2 6 5 -ej 0.25 1 2 -ej 0.5 2 3 -ej 0.75 3 4 -ej 30 4 5.

Sequenced human populations

For the real data scenarios of Figure 3, A–C we used Illumina-sequenced individuals from several human populations. See Table 1 for an overview of the data. The depth of each individual has been calculated using the program doDepth of ANGSD. The Peruvian individuals used in our study were unadmixed with proportion ≥0.95. Estimation of the admixture proportions of these individuals was performed using ADMIXTURE (Alexander ). In each individual, only the autosomal regions of all individuals were taken into consideration, and bases were filtered out according to a minimum base quality score of 20 and a mapping quality score of 30. Type-specific error estimates for the Saqqaq, Mi’kmaq, and French individuals were performed using the program doAncError of ANGSD, where the chimpanzee was used as the outgroup and the consensus sequence of human NA12778 as an error-free individual (see Figure S4 in File S1 for the bar plot of the estimates of the type-specific error).

Figure 3

Real data scenarios. (A) Tree representing the southwestern European migration into the Americas during the European colonization. (B) Tree representing two independent migrations into northwestern Canada and Greenland. (C) Tree representing the presence of Neandertal genome in a modern non-African population, specifically the Han Chinese.

Table 1

List of the genomes used in real data scenarios

Genome ID	Major Population Division	Depth	Reference
HG01923	Peruvian (PEL)	6.3×	Altshuler et al. (2010)
HG01974	Peruvian (PEL)	11.9×	Altshuler et al. (2010)
HG02150	Peruvian (PEL)	7.3×	Altshuler et al. (2010)
HG02259	Peruvian (PEL)	6.5×	Altshuler et al. (2010)
HG02266	Peruvian (PEL)	3.8×	Altshuler et al. (2010)
NA18526	Han Chinese (CHB)	6.6×	Altshuler et al. (2010)
NA18532	Han Chinese (CHB)	7.3×	Altshuler et al. (2010)
NA18537	Han Chinese (CHB)	2.9×	Altshuler et al. (2010)
NA18542	Han Chinese (CHB)	7.3×	Altshuler et al. (2010)
NA18545	Han Chinese (CHB)	6.2×	Altshuler et al. (2010)
NA06985	CEPH (CEU)	12.8×	Altshuler et al. (2010)
NA06994	CEPH (CEU)	5.5×	Altshuler et al. (2010)
NA07000	CEPH (CEU)	9.4×	Altshuler et al. (2010)
NA07056	CEPH (CEU)	4.9×	Altshuler et al. (2010)
NA07357	CEPH (CEU)	5.7×	Altshuler et al. (2010)
NA12778	CEPH (CEU)	6.9×	Altshuler et al. (2010)
NA18501	Yoruba (YRI)	6.4×	Altshuler et al. (2010)
NA18502	Yoruba (YRI)	4.9×	Altshuler et al. (2010)
NA18504	Yoruba (YRI)	10.1×	Altshuler et al. (2010)
NA18505	Yoruba (YRI)	6.1×	Altshuler et al. (2010)
NA18507	Yoruba (YRI)	3×	Altshuler et al. (2010)
HGDP00778	Han Chinese (CHB)	23.4×	International HapMap Consortium (2003)
DNK02	Dinka	25.8×	Meyer et al. (2012)
HGDP00927	Yoruban (YRI)	28×	International HapMap Consortium (2003)
AltaiNea	Neandertal	44.9×	Green et al. (2010)
pantro2	Chimpanzee	—	Kent et al. (2002)
saqqaq	Saqqaq	15.7×	Rasmussen et al. (2010)
MARC1492	Ancient Canadian Dorset	1.1×	Raghavan et al. (2014)
MARC1492	(Mi’kmaq—New England)		Raghavan et al. (2014)
HGDP00521	French	23.8×	International HapMap Consortium (2003)

CEPH

Utah Resident with Northern and Western European Ancestry

Data availability

The real data used is specified in Table 1. The simulated data has been produced using msms (Ewing and Hermisson 2010). The msms code for simulations is in the caption of Figure 2. From the output of msms, NGS pileup files were simulated with the tool msToGlf integrated in ANGSD (Nielsen ). The one-sample D-statistic and the extended D-statistic implemented in this paper were performed on both real and simulated data with the program doAbbababa2 of ANGSD. ANGSD can be downloaded from https://github.com/ANGSD/angsd. A detailed guide including a tutorial for the program doAbbababa2 can be found at http://www.popgen.dk/angsd/index.php/Abbababa2.

Results and Discussion

In the study of our results, we compare different implementations of the D-statistic on simulated and real scenarios. We briefly define as the extended D-statistic that we implemented, as the D-statistic calculated by sampling one sequenced base per locus (Green ), and the D-statistic calculated with equation (3) using the allele frequencies estimated from the true genotype (the true genotype is only available in the case of simulated data). The D-statistic is computed on blocks of 5 Mb, to ensure that no block is subject to linkage disequilibrium from the other blocks, and that the number of loci in each block is large enough to make the D-statistic approach the approximation by a standard normal distribution (see Appendix 1). The use of blocks allows for estimation of a proper normalization constant for the D-statistic using the m-block jackknife method (Busing ). The threshold for rejection of the null hypothesis is set to a p-value of 0.001, corresponding approximately to the two-tailed acceptance region [−3, 3]. The formula for calculating the D-statistic is given in equation (3). Its current implementations include those in Patterson and Nielsen , with sampling of one base per locus from only one individual in each population. Such an implementation is computationally fast but has many drawbacks: when genomes are sequenced at low or medium depth (1–10×), sampling one base might lead to a process with high uncertainty; base transition errors might affect the sampling of the base, adding more uncertainty; only one individual per population is used; for a chosen individual from a population, the reads are not used to evaluate the D-statistic, but only to sample one base. We have proposed a solution to these problems with the extended version of the D-statistic implemented in ANGSD, and we will show in the following results how all the problems mentioned above are addressed.

Comparison of power among the different methods

Using simulated and real data, we compare the different types of D-statistics to study their sensitivity to gene flow. We illustrate how the improved D-statistic is not affected by the issues faced by the current D-statistic and how it even reaches the performances of the D-statistic based on the true genotype at a rather low sequencing depth. To evaluate the power of the different methods, we first simulated NGS data based on coalescent simulations with mutation and recombination rates consistent with human populations (Ewing and Hermisson 2010). We simulated without sequencing error four populations with a varying amount of migration from to (see Figure 2A) and applied the D-statistic based on five individuals from each population for two different sequencing depths. Figure 4, A and B show the power of the methods for depths 0.2× and 2×. Here, the power is the rejection rate of the null hypothesis when there is a migration from to in the tree

Figure 4

Detection of admixture and migration. (A and B) Rejection rate of the null hypothesis as a function of the migration rate in the tree where a migration from to occurs. The yellow and blue solid lines represent, respectively, the power of the method related to and The yellow dashed line represents the rejection rate when the genotypes of the five individuals in each population are known and thus equation (3) can be applied. The blue dashed line illustrates the power of the method when only one genome per population has known genotypes. performs almost as well as knowing the true genotypes already with depth 2×. (C) Value of lack square) and values of (black circles) using, respectively, five genomes per population and one from each population. Each D-statistic shows its associated SD multiplied by 1 and 3. On the left side of the graph, the stick men represent for each column the composition of the group by number of individuals. The extended D-statistic proves to be effective in detecting gene flow even when the simulated depth is very low. For the scenario with sequencing depth 0.2×, detects hardly any cases of migration from whereas reacts with an acceptable rejection rate, even for a migration rate as low as Of course, such a very low depth does not allow the D-statistic to perform as well as In the case of sequencing depth 2×, does not always detect the alternative hypothesis and has also a considerable delay in terms of the migration rate necessary to do that, when compared with Furthermore, follows almost exactly the behavior of the power related to This means that with a depth above 2× we can expect the D-statistic to perform as well as knowing the exact genotypes of the data. A deeper analysis to study the effects of using multiple individuals per group is illustrated in Figure S1 in File S1. Here, we simulated again the scenario with depth 0.2×, and compared the use of 1, 2, 5, 10, and 20 individuals per population. The graph shows that using multiple individuals increases the power of the method and at the same time decreases the SD of From 5000 simulations of the null hypothesis at depth 0.2×, we produced the quantile–quantile plot shown in Figure S2 in File S1. Here we can see that, despite us having simulated only 200 blocks of 5 Mb in length for each individual, the D-statistic already shows its asymptotic property of convergence to a standard normal distribution. The powers of and are compared in a real data scenario using Illumina-sequenced modern human populations from the 1000 Genomes Project, with a varying sequencing depth in the range 3 to 13×. We specifically used Peruvian, European, Han Chinese, and African Yoruban individuals to form the tree (((Peruvian,Han Chinese)European)Yoruban) shown in Figure 3A. This scenario represents the southwestern European gene flow into the ancestors of the Native Americans (Raghavan ). Each of the four populations consists of five sequenced individuals when evaluating and a distinct one of those individuals when evaluating five times (see Figure 4C). The extended D-statistic has much lower SE, which corresponds to a smaller p-value than in the case of and therefore a more significant rejection. See Table S1 in File S1 for a better comparison of the values of the different D-statistics. It is worth underlining that the presence of structured populations might lead to false positives, because the structure is not considered in the model. If there is structure within the properties of the D-statistic are preserved. However, if the population was structured prior to the split of H and H, then it will affect the D-statistic.

Error impact and correction

Sequencing or genotyping errors are known to have a large impact on the D-statistic (Orlando ). Using simulation, we show that if the type-specific error rates are known then we can correct the D-statistic accordingly. We simulate the tree under the null hypothesis. However, we add a base A → G error rate of 0.005 in populations and in order to alter the observed number of ABBA and BABA combination of alleles, leading to a possible rejection of the null hypothesis. The plot in Figure 5A represents the estimated distributions of the Z-scores related to before and after error estimation and error correction, for 100 simulations of a tree without any gene flow, where we have also introduced type-specific errors for transitions from allele A to another allele for the individuals in at different rates. The test statistic has high values owing to the error, whereas all simulations fall in the acceptance interval if we perform error correction.

Figure 5

Effect of error estimation and correction. (A) Estimated distributions of the Z-scores related to for the null hypothesis in which and have probabilities 0.005 and 0.01 of transition from base A, respectively. The blue polygon represents the interval where a Z-score would accept the null hypothesis. The red line represents the distribution of Z-scores before type-specific errors are corrected. In blue, we have the Z-scores after correction. (B) Values of in four different cases for the tree (((Saqqaq,Dorset)French)chimpanzee). The black circles are the values of the uncorrected D-statistic, removal of ancient transitions, error correction, error correction and ancient transition removal. The red and blue lines represent the SD and the value they need to reach the threshold of respectively. The uncorrected D-statistic performs poorly because of the errors in the data that cause rejection of the null hypothesis in all simulations. It is remarkable to observe that has good performance even at depth 0.5×. This means that even small error rates in the data make the D-statistic very susceptible to the rejection of H. Therefore, we need to apply error correction to our data. The result is that the Z-scores fall into the acceptance threshold and the null hypothesis is fulfilled. The distribution of corrected Z-scores is not perfectly centered on 0 because of imperfect error correction. The most obvious need for error correction in real applications is in the use of ancient genomes, which have large numbers of errors, especially transitions. To illustrate the effects of errors in real data and our ability to correct for them, we use two ancient genomes that contain high sequencing error rates owing to post mortem deamination. The tree (((Saqqaq,Dorset)French)chimpanzee) of Figure 3B illustrates the migrations to western Canada (Canadian Dorset Mi’kmaq genome) and southwestern Greenland (Saqqaq genome). Owing to the effects of deamination prior to sequencing (Rasmussen ; Raghavan ), the two ancient genomes have high type-specific error rates, as shown in Figure S4 and Table S2 in File S1. The error rates alter the counts of ABBA and BABA patterns, which bias the uncorrected D-statistic. We expect the tree to be true under the null hypothesis, since Saqqaq and Dorset have a recent common ancestor (Raghavan ). In Figure 5B, we compare the extended D-statistic in four cases: first, using observed data; second, removing all transitions, which are related to most of the errors; third, applying error correction; and, last, combining error correction and transition removal. Note that the removal of transitions related to the pairs of alleles A,C and G,T is the current standard technique to avoid high error rates when calculating the D-statistic from damaged low-coverage data. The uncorrected D-statistic rejects the null hypothesis, whereas correction or transition removal gives a nonsignificant test. Error correction performs better than transition removal, providing a value of the D-statistic that is closer to 0 and has smaller SD. Table S3 in File S1 shows the values related to the four D-statistics in this scenario. Figure S5 in File S1 illustrates the effects of increasing and decreasing the removal of error for the base transitions C → G and C → T for one of the Saqqaq, Dorset, and French genomes. This corresponds to adding a value to the estimated error rate matrix of one of the individuals. Observe that the French individual is less affected by the addition or removal of error than the other two individuals. Moreover, all three individuals are more sensitive to the error rate in the case of transversion C → T.

Correction for external introgression

We use simulations of a scenario with external introgression to verify the performance of correction for gene flow in restoring a four-population tree configuration that leads to the acceptance of the null hypothesis H0. In the simulation case, we know the value of α, that is, the amount of introgression; therefore, correction is possible. Thereafter, we use a known genetic relationship involving the Neandertal introgression into out-of-Africa modern individuals in Europe and Asia (Green ; Wall ) to correct for the effect of admixture. In addition, we show that, if we assume the absence of gene flow in the tree topology, we can estimate the amount of introgression and compare it with the estimation involving the original D-statistic tools. For some species, there are introgression events from an external source, which can affect the D-statistic when performing tests for admixture among the species. We performed 100 simulations of the null hypothesis of Figure 2C, for which an external population is admixed with with rate α = 0.1. The plot in Figure 6A shows the estimated distribution of the Z-scores related to the observed and admixture-corrected The observed D-statistic is positive and has Z-scores that reject the null hypothesis. Applying equation (7), we are able to remove the effect of gene flow from The result of removal of the gene flow effect is that the estimated probabilities of ABBA and BABA combinations of alleles are altered, and the resulting calculated values of the D-statistic lead to acceptance of the null hypothesis H0.

Figure 6

Effect of correction from external introgression. (A) Estimated distribution of the Z-scores related to from the 100 simulations of the null hypothesis with introgression of rate α = 0.1 from an external population into The Z-scores of the observed tree are far off the acceptance interval because of the admixture from Once the portion of genome from the external population is removed from the tree fulfills the null hypothesis and the Z-scores all fall in the acceptance interval defined by . . (B) Behavior of the of the tree (((Han Chinese,Dinka)Yoruban)chimpanzee) as a function of the admixture rate α used to correct for the introgression of the Neandertal population into the Han Chinese population. The red polygon is the interval in which (Green ) estimates α to fall. The black dot coincides with the value of α = 0.0307 calculated by Wall using the tree (((Han Chinese,Yoruban)Neandertal)chimpanzee), with SD 0.0049. The blue polygon is three times the SD of When is 0, we estimate α = 0.03 with SD 0.0042. For human populations, it is problematic to apply the D-statistics to both African and non-African populations because of ancient gene-flow from other hominids into non-Africans. Therefore, H0 might not be fulfilled for any tree , where an ingroup consists of both an African and a non-African population. This leads to rejection of the tree and to the natural conclusion that there is gene flow between (respectively, ). However, if there is known external admixture from a population it is possible to correct for admixture from this external contribution. We illustrate the problem and our ability to correct for it using the tree shown in Figure 3C, which shows introgression of the Neandertal genome into the ancestors of the Han Chinese population. The correction is performed for the admixture proportion α in the range [0, 0.05] in steps of 0.01. The value of α for which is closest to 0 might be considered as an estimate of the admixture rate. We chose these populations because we could compare our result with the estimate from previous studies of the same populations (Green ; Wall ). The study of Green estimated α to be in the range [0.01, 0.04], whereas (Wall ) estimated it as being α = 0.0307 with SD 0.0049. The result is shown in Figure 6B for the tree (((Han Chinese,Dinka)Yoruban)chimpanzee) for different admixture rates α used to correct for the introgression of the Neandertal population into the Han Chinese population. The red polygon is the interval in which α is estimated to be (Green ). The black dot coincides with the value of α = 0.0307 calculated in Wall . The blue polygon is three times the SD of For almost the whole range of reported admixture proportions, the tree is not rejected after adjustment for admixture, indicating that the uncorrected D-statistic concluded the presence of gene flow. When is 0, we estimate α = 0.03 with SD 0.0042, which is similar to previous estimates. In the cases of both simulated and real data, we have thus been able to distinguish the case in which the alternative hypothesis is due to an external introgression and not to admixture from In our simulations, the admixture correction seems not to suffer from the effect of drift, which is not modeled in the correction. In fact, the branch leading to splits 8000 generations in the past and admixes 4000 generations in the past on the branch leading to Thus, there is a drift affecting gene frequencies of both the admixing and admixed populations. In the case of real data, the exact amount of admixture α was not previously known. Therefore, we calculated the D-statistic for the tree (((Han Chinese,Dinka)Yoruban)chimpanzee) using admixture-corrected values of the probabilities of allele patterns, considering values of the admixture rate falling in the interval estimated in Green . Without admixture correction, the obvious conclusion would have been that for the tree (((Han Chinese,Dinka)Yoruban)chimpanzee) there is gene flow between the Yoruban and Dinka populations.

Conclusions

In summary, we have implemented a different D-statistic that addresses the drawbacks of the current implementations of the D-statistic, but still preserves the approximation as a standard normal distribution (see Appendix 1) that allows for a statistical test. The extended D-statistic allows for multiple individuals per population and, instead of sampling one base according to the estimated allele frequencies, uses all the available sequenced bases. Using both simulations and real data we have shown that: the extended D-statistic has more power than the alternative methods, with an increased sensitivity to admixture events. Moreover, even without a large amount of data, the extended D-statistic shows a good asymptotic convergence and, therefore, a low false positive rate; the performance of the extended D-statistic is the same as when the true genotype is known, for a depth of at least 2×; we can accommodate type-specific errors to prevent an eventual wrong acceptance or rejection of the null hypothesis caused by error-affected allele frequencies. The error estimation and correction appear to be especially suited to the case of ancient genomes, where error rates might be high owing to chemical treatments prior to sequencing and degradation over time; we can calculate the D-statistic after correcting for admixture from an external known population, such as in the case of Neandertal gene flow into the Han Chinese population. The extended D-statistic is especially effective compared with the standard D-statistic when applied to data with low or variable depth, multiple individuals, and ancient DNA.

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300192/-/DC1. Click here for additional data file.

26 in total

1. Inference of population structure using multilocus genotype data.

Authors: J K Pritchard; M Stephens; P Donnelly
Journal: Genetics Date: 2000-06 Impact factor: 4.562

2. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

3. The recovery and first analysis of an Early Holocene human skeleton from Kennewick, Washington.

Authors: J C Chatters
Journal: Am Antiq Date: 2000

4. Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse.

Authors: Ludovic Orlando; Aurélien Ginolhac; Guojie Zhang; Duane Froese; Anders Albrechtsen; Mathias Stiller; Mikkel Schubert; Enrico Cappellini; Bent Petersen; Ida Moltke; Philip L F Johnson; Matteo Fumagalli; Julia T Vilstrup; Maanasa Raghavan; Thorfinn Korneliussen; Anna-Sapfo Malaspinas; Josef Vogt; Damian Szklarczyk; Christian D Kelstrup; Jakob Vinther; Andrei Dolocan; Jesper Stenderup; Amhed M V Velazquez; James Cahill; Morten Rasmussen; Xiaoli Wang; Jiumeng Min; Grant D Zazula; Andaine Seguin-Orlando; Cecilie Mortensen; Kim Magnussen; John F Thompson; Jacobo Weinstock; Kristian Gregersen; Knut H Røed; Véra Eisenmann; Carl J Rubin; Donald C Miller; Douglas F Antczak; Mads F Bertelsen; Søren Brunak; Khaled A S Al-Rasheid; Oliver Ryder; Leif Andersson; John Mundy; Anders Krogh; M Thomas P Gilbert; Kurt Kjær; Thomas Sicheritz-Ponten; Lars Juhl Jensen; Jesper V Olsen; Michael Hofreiter; Rasmus Nielsen; Beth Shapiro; Jun Wang; Eske Willerslev
Journal: Nature Date: 2013-06-26 Impact factor: 49.962

5. A high-coverage genome sequence from an archaic Denisovan individual.

Authors: Matthias Meyer; Martin Kircher; Marie-Theres Gansauge; Heng Li; Fernando Racimo; Swapan Mallick; Joshua G Schraiber; Flora Jay; Kay Prüfer; Cesare de Filippo; Peter H Sudmant; Can Alkan; Qiaomei Fu; Ron Do; Nadin Rohland; Arti Tandon; Michael Siebauer; Richard E Green; Katarzyna Bryc; Adrian W Briggs; Udo Stenzel; Jesse Dabney; Jay Shendure; Jacob Kitzman; Michael F Hammer; Michael V Shunkov; Anatoli P Derevianko; Nick Patterson; Aida M Andrés; Evan E Eichler; Montgomery Slatkin; David Reich; Janet Kelso; Svante Pääbo
Journal: Science Date: 2012-08-30 Impact factor: 47.728

6. A draft sequence of the Neandertal genome.

Authors: Johannes Krause; Adrian W Briggs; Tomislav Maricic; Udo Stenzel; Martin Kircher; Nick Patterson; Richard E Green; Heng Li; Weiwei Zhai; Markus Hsi-Yang Fritz; Nancy F Hansen; Eric Y Durand; Anna-Sapfo Malaspinas; Jeffrey D Jensen; Tomas Marques-Bonet; Can Alkan; Kay Prüfer; Matthias Meyer; Hernán A Burbano; Jeffrey M Good; Rigo Schultz; Ayinuer Aximu-Petri; Anne Butthof; Barbara Höber; Barbara Höffner; Madlen Siegemund; Antje Weihmann; Chad Nusbaum; Eric S Lander; Carsten Russ; Nathaniel Novod; Jason Affourtit; Michael Egholm; Christine Verna; Pavao Rudan; Dejana Brajkovic; Željko Kucan; Ivan Gušic; Vladimir B Doronichev; Liubov V Golovanova; Carles Lalueza-Fox; Marco de la Rasilla; Javier Fortea; Antonio Rosas; Ralf W Schmitz; Philip L F Johnson; Evan E Eichler; Daniel Falush; Ewan Birney; James C Mullikin; Montgomery Slatkin; Rasmus Nielsen; Janet Kelso; Michael Lachmann; David Reich; Svante Pääbo
Journal: Science Date: 2010-05-07 Impact factor: 47.728

7. Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania.

Authors: David Reich; Nick Patterson; Martin Kircher; Frederick Delfin; Madhusudan R Nandineni; Irina Pugach; Albert Min-Shan Ko; Ying-Chin Ko; Timothy A Jinam; Maude E Phipps; Naruya Saitou; Andreas Wollstein; Manfred Kayser; Svante Pääbo; Mark Stoneking
Journal: Am J Hum Genet Date: 2011-09-22 Impact factor: 11.025

8. The genetic prehistory of the New World Arctic.

Authors: Maanasa Raghavan; Michael DeGiorgio; Anders Albrechtsen; Ida Moltke; Pontus Skoglund; Thorfinn S Korneliussen; Bjarne Grønnow; Martin Appelt; Hans Christian Gulløv; T Max Friesen; William Fitzhugh; Helena Malmström; Simon Rasmussen; Jesper Olsen; Linea Melchior; Benjamin T Fuller; Simon M Fahrni; Thomas Stafford; Vaughan Grimes; M A Priscilla Renouf; Jerome Cybulski; Niels Lynnerup; Marta Mirazon Lahr; Kate Britton; Rick Knecht; Jette Arneborg; Mait Metspalu; Omar E Cornejo; Anna-Sapfo Malaspinas; Yong Wang; Morten Rasmussen; Vibha Raghavan; Thomas V O Hansen; Elza Khusnutdinova; Tracey Pierre; Kirill Dneprovsky; Claus Andreasen; Hans Lange; M Geoffrey Hayes; Joan Coltrain; Victor A Spitsyn; Anders Götherström; Ludovic Orlando; Toomas Kivisild; Richard Villems; Michael H Crawford; Finn C Nielsen; Jørgen Dissing; Jan Heinemeier; Morten Meldgaard; Carlos Bustamante; Dennis H O'Rourke; Mattias Jakobsson; M Thomas P Gilbert; Rasmus Nielsen; Eske Willerslev
Journal: Science Date: 2014-08-29 Impact factor: 47.728

Review 9. Genotype and SNP calling from next-generation sequencing data.

Authors: Rasmus Nielsen; Joshua S Paul; Anders Albrechtsen; Yun S Song
Journal: Nat Rev Genet Date: 2011-06 Impact factor: 53.242

10. The genome of a Late Pleistocene human from a Clovis burial site in western Montana.

Authors: Morten Rasmussen; Sarah L Anzick; Michael R Waters; Pontus Skoglund; Michael DeGiorgio; Thomas W Stafford; Simon Rasmussen; Ida Moltke; Anders Albrechtsen; Shane M Doyle; G David Poznik; Valborg Gudmundsdottir; Rachita Yadav; Anna-Sapfo Malaspinas; Samuel Stockton White; Morten E Allentoft; Omar E Cornejo; Kristiina Tambets; Anders Eriksson; Peter D Heintzman; Monika Karmin; Thorfinn Sand Korneliussen; David J Meltzer; Tracey L Pierre; Jesper Stenderup; Lauri Saag; Vera M Warmuth; Margarida C Lopes; Ripan S Malhi; Søren Brunak; Thomas Sicheritz-Ponten; Ian Barnes; Matthew Collins; Ludovic Orlando; Francois Balloux; Andrea Manica; Ramneek Gupta; Mait Metspalu; Carlos D Bustamante; Mattias Jakobsson; Rasmus Nielsen; Eske Willerslev
Journal: Nature Date: 2014-02-13 Impact factor: 49.962

27 in total

1. Genomic data and multi-species demographic modelling uncover past hybridization between currently allopatric freshwater species.

Authors: Maria M Coelho; Vitor C Sousa; Sofia L Mendes; Miguel P Machado
Journal: Heredity (Edinb) Date: 2021-08-30 Impact factor: 3.832

2. Ghost Lineages Highly Influence the Interpretation of Introgression Tests.

Authors: Théo Tricou; Eric Tannier; Damien M de Vienne
Journal: Syst Biol Date: 2022-08-10 Impact factor: 9.160

3. Dsuite - Fast D-statistics and related admixture evidence from VCF files.

Authors: Milan Malinsky; Michael Matschiner; Hannes Svardal
Journal: Mol Ecol Resour Date: 2020-10-24 Impact factor: 7.090

4. Early diversification and permeable species boundaries in the Mediterranean firs.

Authors: Francisco Balao; María Teresa Lorenzo; José Manuel Sánchez-Robles; Ovidiu Paun; Juan Luis García-Castaño; Anass Terrab
Journal: Ann Bot Date: 2020-03-09 Impact factor: 5.040

5. RAD-Seq and Ecological Niche Reveal Genetic Diversity, Phylogeny, and Geographic Distribution of Kadsura interior and Its Closely Related Species.

Authors: Yuqing Dong; Xueping Wei; Tingyan Qiang; Jiushi Liu; Peng Che; Yaodong Qi; Bengang Zhang; Haitao Liu
Journal: Front Plant Sci Date: 2022-04-26 Impact factor: 6.627

6. Ancient Genomes Reveal the Evolutionary History and Origin of Cashmere-Producing Goats in China.

Authors: Yudong Cai; Weiwei Fu; Dawei Cai; Rasmus Heller; Zhuqing Zheng; Jia Wen; Hui Li; Xiaolong Wang; Akil Alshawi; Zhouyong Sun; Siqi Zhu; Juan Wang; Miaomiao Yang; Songmei Hu; Yan Li; Zhirui Yang; Mian Gong; Yunan Hou; Tianming Lan; Kui Wu; Yulin Chen; Yu Jiang; Xihong Wang
Journal: Mol Biol Evol Date: 2020-07-01 Impact factor: 16.240

7. Genomic differentiation tracks earth-historic isolation in an Indo-Australasian archipelagic pitta (Pittidae; Aves) complex.

Authors: Per G P Ericson; Yanhua Qu; Pamela C Rasmussen; Mozes P K Blom; Frank E Rheindt; Martin Irestedt
Journal: BMC Evol Biol Date: 2019-07-24 Impact factor: 3.260

8. Codweb: Whole-genome sequencing uncovers extensive reticulations fueling adaptation among Atlantic, Arctic, and Pacific gadids.

Authors: Einar Árnason; Katrín Halldórsdóttir
Journal: Sci Adv Date: 2019-03-20 Impact factor: 14.136

9. The origin of domestication genes in goats.

Authors: Zhuqing Zheng; Xihong Wang; Ming Li; Yunjia Li; Zhirui Yang; Xiaolong Wang; Xiangyu Pan; Mian Gong; Yu Zhang; Yingwei Guo; Yu Wang; Jing Liu; Yudong Cai; Qiuming Chen; Moses Okpeku; Licia Colli; Dawei Cai; Kun Wang; Shisheng Huang; Tad S Sonstegard; Ali Esmailizadeh; Wenguang Zhang; Tingting Zhang; Yangbin Xu; Naiyi Xu; Yi Yang; Jianlin Han; Lei Chen; Joséphine Lesur; Kevin G Daly; Daniel G Bradley; Rasmus Heller; Guojie Zhang; Wen Wang; Yulin Chen; Yu Jiang
Journal: Sci Adv Date: 2020-05-20 Impact factor: 14.136

Review 10. Genomic approaches for studying crop evolution.

Authors: Mona Schreiber; Nils Stein; Martin Mascher
Journal: Genome Biol Date: 2018-09-21 Impact factor: 13.583