| Literature DB >> 19440242 |
Frantz Depaulis1, Ludovic Orlando, Catherine Hänni.
Abstract
BACKGROUND: New polymorphism datasets from heterochroneous data have arisen thanks to recent advances in experimental and microbial molecular evolution, and the sequencing of ancient DNA (aDNA). However, classical tools for population genetics analyses do not take into account heterochrony between subsets, despite potential bias on neutrality and population structure tests. Here, we characterize the extent of such possible biases using serial coalescent simulations. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2009 PMID: 19440242 PMCID: PMC2678253 DOI: 10.1371/journal.pone.0005541
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Heterochrony effects on gene genealogies.
(A) Contemporaneous dataset. (B) Heterochroneous dataset. Lineages of sequences cannot reach a common ancestor before they are contemporaneous.
Summary of the main statistics, parameters and notations.
| Statistics/parameter | Definitions | Ref. |
| WF/ | Wright Fisher population genetics model assuming in particular a constant size, well mixed neutral population. | |
|
| Effective population size: equivalent size for an ideal WF population. The relevant time scale for population genetics processes is in | |
| MRCA | Most recent common ancestor, root of an intraspecific tree. | |
| IMSM | Infinitely many site mutational model adapted to nucleotide polymorphism, sequence data. |
|
|
| Mutational parameter of the population θ = 2 | |
|
| Sample size, subscript ‘ | |
|
| Time to the subset | |
|
| Number of polymorphic sites. |
|
|
| Watterson's estimator of |
|
|
| Diversity estimator of |
|
|
| Nei's net distance between two populations | |
|
| Population differentiation (genetic distance) index |
|
|
| scenario leading to star shape of genealogical trees, with long external branches: strong bottleneck, population expansions, recent fixation of a closely linked advantageous mutation (selective sweep) or complex population structure such as a collection small samples from a large number of populations. | |
|
| leading to balanced tree with long internal branches: population contraction, simple population structure between a small number of population, each with similar, substantial sampling effort. | |
|
| Tajima's |
|
|
| Fu and Li's |
|
|
| Fay and Wu's |
|
|
| Number of haplotypes elevated (with respect to |
|
|
| Haplotype diversity (sensitive to their frequency) elevated for star scenario. |
|
|
| Linkage disequilibrium: statistical association of mutations, trends of various mutations to be carried by the same individuals. | |
|
| Average LD between pairs of polymorphic sites measured through allelic correlation |
|
| Pearson | Recombination test: correlation between pairwise LD ( |
|
Figure 2Outline of the main models simulated.
(A) Single panmitic population with variable proportions of ancient data (n 1/n) (one third in this example) of moderate age (0.1) with respect to the time unit of 2N generations (i.e. the average age of the root of a population tree in the homogeneous, contemporaneous case; see figure 3). (B) Corresponding simulations for the population differentiation (F) analysis; a single homogenous set of individual, but labeled as randomly split into two populations equally sampled, one showing variable proportions of ancient data (n 1/n again, one third in this example) of moderate age (again t 1 = 0.1; see figure 3, F). (C) Single panmitic population with equal proportions of ancient and modern data (n 1/n = 1/2, equivalent to the two population samples for the F analysis;), the age of the ancient samples ranges from 0 to 20 N generations; see figure 4). See text for more explanations.
Figure 3Effect of subset size on statistical tests.
The temporal spacing between the two subsets is set to 0.2 N generations. Ten thousands runs were simulated for each set of parameter values. The X axis corresponds to the proportion of the ancient subset. DT: Tajima's D [39]; D*FL: Fu and Li's D* [40]; HFW: Fay and Wu's H [49]; Note that this statistics is not standardized by its variance and can thus potentially show high absolute values, hence a rather erratic behavior on fig. 2a]; ZnS: Kelly's Z [50]; K and H: Depaulis and Veuille's haplotype tests ([51]; K is scaled to the expected S+1, its expected maximal value in the absence of recombination and homoplasy); Slope: recombination test, pearson correlation coefficient between pairwise allelic correlation and distance between mutations tested by permutations according to Awaddala and colleagues [53]; Fst: Hudson, Slatkin and Maddison's F [31] between two population subsamples of equal size 50∶50, then the X axis corresponds to the proportion of ancient sequences in the second subset. This F is tested by permutations between subsets [33]. Five hundred permutations were used in these last two tests. (A) Mean (bias) and (B) Proportion of significant runs that show deviation from the standard coalescent expectations (rate of false positives). Only portions of curves above 6% (as an arbitrary threshold of marginal significance) are shown for clarity. Note the different scale of the Y axis on the top part of figure B. Empty symbols: deficit of the statistics; filled symbols: excess of the statistics.
Figure 4Effect of the time spacing with a 50% subset on statistical tests.
n = 50, whole second population subsample in the F analysis. The X axis is expressed in units of 2N generations. Same labeling and other parameter values as in figure 3.
Figure 5Effect of time sampling schemes on the statistics.
(a) Means. For comparison, statistics with non-null means in the contemporaneous case are scaled to the upper bound of their confidence interval under such null hypothesis. (b) Proportions of significant runs only in the direction of deviation potentially leading to deviation (if any) in the heterochroneous case are shown (the other one remaining below 5%). ‘inf’: deficit of the statistic; ‘sup’: excess of the statistic. Open bars: contemporaneous; stripped bars: regular in the range [0–0.2]; homogeneous gray bars: uniform, same range; gradient-filled bars: exponential with mean 0.1 (truncated at 10 to limit CPU and assuming that there was no chance at all to obtain as old DNA for a species that may not have even existed at that time).
Polymorphism estimates from caves of Cave Bear.
| Cave |
|
| Time range (KY) | Average pairwise time difference (KY) |
|
|
| %bias |
| Ach | 20 | 13 | 25–39 | 3.3 | 0.028 | 0.047 | 0.047 | 0.7 |
| Herdengel | 8 | 10 | 55–130 | 10.7 | 0.030 | 0.030 | 0.030 | 4.6 |
| Scladina | 20 | 15 | 30–130 | 36.6 | 0.052 | 0.040 | 0.037 | 8.7 |
| S Alps | 22 | 9 | 22–130 | 22.7 | 0.036 | 0.017 | 0.016 | 8.2 |
| N Alps | 33 | 21 | 22–130 | 32.9 | 0.039 | 0.047 | 0.046 | 3.0 |
| Total | 118 | 19 | 22–130 | 31.3 | 0.052 | 0.068 | 0.063 | 8.7 |
Corrected from equation (1).
Neutrality tests on Cave Bearsa.
| Cave | Time range (APTD) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 0** | 4* | 24 | 1** + | 23 | 11− | 23− | |||||||||
| Ach | 25–39 (3.3) |
| 2.60 | 0** | 1.48 | 3* | −0.67 | 23 | 0.79 | 0** + | −0.05 | 23 | 5 | 9− | 0.72 | 22− |
|
| 0** | 3* | 24 | 1** + | 22 | 9− | 22− | |||||||||
|
| 14 | 4* | 31 | 23+ | 55 | 33− | 39− | |||||||||
| Herdengel | 55–130 (10.7) |
| 0.96 | 11 | 1.50 | 3* | 1.50 | 34 | 0.51 | 18+ | 0.06 | 53 | 4 | 25− | 0.72 | 32− |
|
| 9 | 2* | 35 | 15+ | 52 | 19− | 27− | |||||||||
|
| 22 | 10 | 20 | 43+ | 4* | 35− | 41− | |||||||||
| Scladina | 30–130 (36.6) |
| −0.82 | 35 | −1.55 | 18 | −1.32 | 17 | 0.24 | 27+ | −0.39 | 5* | 7 | 16− | 0.79 | 30− |
|
| 41 | 24 | 20 | 23+ | 4* | 12− | 29− | |||||||||
|
| 2* | 4* | 1** | 3* + | 8− | 2* − | ||||||||||
| S Alps | 22–130 (22.7) |
| −1.75 | 4* | −2.35 | 6 | −5.84 | 1** | 0.63 | 1* + | / | 4 | 4* − | 0.38 | 1** − | |
|
| 6 | 9 | 1** | 1* + | 3* − | 1** − | ||||||||||
|
| 15 | 18 | 17 | 37+ | 77 | 0** + | 1* + | |||||||||
| N Alps | 22–130 (32.9) |
| 0.86 | 4* | −1.17 | 58 | 1.83 | 6 | 0.14 | 14+ | 0.08 | 77 | 22 | 0** + | 0.89 | 1** + |
|
| 2* | 39 | 4* | 11+ | 77 | 0** + | 0** + | |||||||||
|
| 10 | 29 | 8 | 31+ | 81 | 0** + | 5+ | |||||||||
| Total | 22–130 (31.3) |
| 1.12 | 2* | −0.80 | 42 | 2.05 | 2* | 0.15 | 11+ | 0.11 | 81 | 23 | 0** + | 0.87 | 3* + |
|
| 1** | 23 | 1** | 7+ | 80 | 0** + | 2* + |
All simulations are conditioned on the observed number of variable sites (S value). For each statistic, the observed value is indicated on the left.
Time range and Average pairwise time difference in KY.
On the right of each statistic, probability (%) on 3 lines (corresponding to the legend in the T column): first line, ‘c’ assuming contemporaneous sample; second line, ‘h’ taking into account heterochrony, with an average time for each sequence; third line, ‘hu’ also including time and parameter uncertainty with uniform deviates (ranges detailed in figure S2).‘*’: P<0.05; ‘**’ P<0.01.
The direction of deviation is indicated when not obvious (the contemporaneous expectation for frequency spectrum and Pearson statistics is 0): ‘+’ excess, ‘−’ deficit.
Pearson correlation between LD and distance test, permutation test.
Population differentiation between caves of Cave Bearsa.
| Ach recent | Ach old | Gamsulzen | Herdengel | Ramesh | Salzofen | Scladina | Vindija | Winden | S Alps | N Alps | |
| Time range (KY) | 25–28 | 27–39 | 31–50 | 55–130 | 30–130 | 22–130 | 30–130 | 22–51 | 22–130 | 22–130 | 22–130 |
| Cave \ | (7) | (13) | (7) | (8) | (9) | (4) | (20 | (12) | (7) | (22) | (33) |
| (15) | (3) | (12) | (12) | (12) | (15) | (4) | (1) | ||||
| Ach recent | 0.083 | 0.015 | 0.041 | 0.089 | 0.089 | 0.080 | 0.007 | 0.007 | |||
| 0.083 | 0.014 | 0.037 | 0.086 | 0.084 | 0.080 | 0.006 | 0.003 | ||||
| 0** | (12) | (16) | (12) | (7) | (14) | (13) | (12) | ||||
| Ach old | 0** | 0.064 | 0.028 | 0.031 | 0.024 | 0.002 | 0.069 | 0.076 | |||
| 0** | 0.063 | 0.025 | 0.028 | 0.029 | 0.002 | 0.069 | 0.072 | ||||
| 2.18 | |||||||||||
| 0** | 0** | (11) | (11) | (11) | (19) | (3) | (2) | ||||
| Gamsulzen | 0** | 0** | 0.024 | 0.068 | 0.065 | 0.057 | 0.000 | 0.007 | |||
| 0** | 0** | 0.021 | 0.066 | 0.062 | 0.057 | 0.000 | 0.003 | ||||
| 1.87 | 2.93 | ||||||||||
| 0** | 0** | 0.1** | (10) | (10) | (20) | (12) | (11) | ||||
| Herdengel | 0** | 0** | 1.7* | 0.014 | 0.001 | 0.014 | 0.033 | 0.033 | |||
| 0** | 0** | 1.4* | 0.012 | 0.001 | 0.013 | 0.030 | 0.033 | ||||
| 7.90 | 5.37 | 6.27 | |||||||||
| 0** | 0** | 0** | 0** | (0) | (14) | (12) | (16) | ||||
| Ramesh | 0** | 0** | 0** | 0** | 0 | 0.035 | 0.071 | 0.082 | |||
| 0** | 0** | 0** | 0** | 0 | 0.033 | 0.069 | 0.079 | ||||
| 6.16 | 3.78 | 4.83 | 2.17 | ||||||||
| 0** | 0** | 0.2** | 29.4 | / | (4) | (12) | (11) | ||||
| Salzofen | 4.0* | 0** | 0.2** | 27.4 | / | 0.017 | 0.068 | 0.082 | |||
| 3.0* | 0** | 1.0** | 41.2 | / | 0.016 | 0.065 | 0.082 | ||||
| 9.84 | 8.82 | 10.43 | 12.19 | 10.79 | |||||||
| 0** | 23 | 0.2** | 0.3** | 0** | 1.1* | (20) | (19) | ||||
| Scladina | 0** | 59 | 0.2** | 2.4* | 0** | 3.6* | 0.069 | 0.072 | |||
| 0** | 55 | 0.4** | 3.0* | 0** | 5.7 | 0.069 | 0.069 | ||||
| 9.64 | 4.43 | 9.56 | 9.95 | 8.20 | 19.52 | ||||||
| 0** | 0** | 26.4 | 0** | 0** | 0.0** | 0** | (3) | ||||
| Vindija | 0** | 0** | 27.8 | 0** | 0** | 1.4* | 0** | 0 | |||
| 0** | 0** | 29.7 | 0** | 0** | 0.9** | 0** | 0 | ||||
| 3.30 | 0.91 | 3.26 | 5.39 | 3.71 | 9.19 | 4.72 | |||||
| 0** | 0** | 0** | 0.1** | 0.1** | 0.2** | 0** | / | ||||
| Winden | 0** | 0** | 0** | 0.2** | 0.3** | 0.2** | 0** | / | |||
| 0** | 0** | 0** | 1.7* | 2.2* | 0.9** | 0** | / | ||||
| 7.09 | 5.71 | 5.16 | 2.89 | 3.58 | 8.14 | 11.28 | |||||
| (19) | |||||||||||
| S Alps | 0.015 | ||||||||||
| 0.018 | |||||||||||
| 0** | |||||||||||
| N Alps | 0** | ||||||||||
| 0** | |||||||||||
| 2.70 |
Top right: line 1, in parentheses, number of polymorphic sites in the pairwise alignment; lines 2 and 3, Nei's net distances D, line 2 uncorrected; line 3 corrected for heterochrony with equations (1) and (3). Bottom left: P values from permutation tests, significance level; line 1 neglecting heterochrony; line 2 taking it into account; line 3 including uncertainty; line 4: Inter-population average pairwise time difference (KY). The number of sequences used for each population is given in parentheses at the top of the columns.
Variable number of sequences depending on the alignment chosen to maximize information.
n = 20.
n = 6.
Figure 6Heterochrony-driven biases on summary statistics: a synthesis.
(A): contemporaneous case. (B) Heterochroneous dataset with limited time range. Lineages of sequences cannot reach a common ancestor before they are contemporaneous, leading to genealogies with proportionally longer external branches and excess of rare mutations thus mimicking bottlenecks, expansions or tightly linked selection. (c) Two subsets separated by a large time lapse. The coalescence process is finished within the most recent subset before reaching the ancient subset sampling point, leading to a genealogy with a long internal branch, more variation, especially for intermediate frequency mutations, and a genetically isolated subset, thus mimicking simple population structure or contraction. t 1: time lapse; n 1: oldest subset's size.