Literature DB >> 31725722

Inference of recombination maps from a single pair of genomes and its application to ancient samples.

Gustavo V Barroso1, Nataša Puzović1, Julien Y Dutheil1.   

Abstract

Understanding the causes and consequences of recombination landscape evolution is a fundamental goal in genetics that requires recombination maps from across the tree of life. Such maps can be obtained from population genomic datasets, but require large sample sizes. Alternative methods are therefore necessary to research organisms where such datasets cannot be generated easily, such as non-model or ancient species. Here we extend the sequentially Markovian coalescent model to jointly infer demography and the spatial variation in recombination rate. Using extensive simulations and sequence data from humans, fruit-flies and a fungal pathogen, we demonstrate that iSMC accurately infers recombination maps under a wide range of scenarios-remarkably, even from a single pair of unphased genomes. We exploit this possibility and reconstruct the recombination maps of ancient hominins. We report that the ancient and modern maps are correlated in a manner that reflects the established phylogeny of Neanderthals, Denisovans, and modern human populations.

Entities:  

Mesh:

Year:  2019        PMID: 31725722      PMCID: PMC6879166          DOI: 10.1371/journal.pgen.1008449

Source DB:  PubMed          Journal:  PLoS Genet        ISSN: 1553-7390            Impact factor:   5.917


Introduction

Meiotic recombination is a major driver of the evolution of sexually-reproducing species [1]. The crossing-over of homologous chromosomes creates new haplotypes and breaks down linkage between neighbouring loci, thereby impacting natural selection [2,3] and consequently the genome-wide distribution of diversity [4]. The distribution of such cross-over events is heterogeneous within and among chromosomes [5,6], and commonly referred to as the recombination landscape–a picture of how often genetic variation is shuffled in different parts of the genome. Interestingly, this picture is not static, but instead is an evolving trait that varies between populations [7,8] and species [9]. Moreover, the proximate mechanisms responsible for shaping the recombination landscape vary among taxa. For example, among primates (where the PRDM9 gene is a key player determining the location of so-called recombination hotspots [10]) the landscape is conserved at the mega-base (Mb) scale, but not at the kilo-base (kb) scale [11]. In birds, which lack PRDM9, the hotspots are found near transcription start sites in the species that have been studied so far [8,12]. In Drosophila (where clear hotspots appear to be absent [13]), inter-specific changes are associated with mei-218 variants [14], a gene involved in the positioning of double-strand breaks [15]. The molecular machinery influencing the distribution of cross-over events is still poorly understood in many other groups, where a survey of the recombination landscape in closely related species is lacking. Aside from their intrinsic value in genetics, accurate recombination maps are needed to interpret the distribution of diversity along the genome. Since the rate of recombination determines the extent to which linked loci share a common evolutionary history [16], inferring selection [17-19], introgression [18,20] and identifying causal loci in association studies requires knowledge of the degree of linkage between sites [21]. Furthermore, recombination can cause GC-biased gene conversion [22,23], which can mimic the effect of selection [24] or interfere with it [25]. Although important, obtaining recombination maps remains a challenging task. Due to the typically low density of markers, experimental approaches provide broad-scale estimates and are limited in the number of amenable taxa. Conversely, population genomic approaches based on coalescent theory [26,27] have proved instrumental in inferring recombination rates from polymorphism data. Traditionally, population genomic methods infer recombination maps from variation in linkage disequilibrium (LD) between pairs of single nucleotide polymorphisms (SNPs) [28-30]. However, since “LD-based” methods typically require large sample sizes per population (from a dozen haplotypes [31]), their application is restricted to a few model organisms where such data are available. Here we introduce a new modeling framework (iSMC) to infer the variation in the recombination rate along the genome, using a single pair of unphased genomes. Using simulations, we show that iSMC is able to accurately recover the recombination landscape under diverse scenarios. We further demonstrate its efficacy with case studies in Humans, Fruit-flies and the fungal pathogen Zymoseptoria tritici, where experimental genetic maps are available. Finally, we exploit our new method to investigate the recombination landscape of ancient hominin samples: Ust’Ishim, the Vindija Neanderthal, the Altai Neanderthal, and the Denisovan. Because it allows inference from datasets for which sample size is intrinsically limited, such as ancient DNA samples, our method opens a new window in the study of the recombination landscape evolution.

Results

Overview of iSMC

Besides its common interpretation as a backwards-in-time process, the coalescent with recombination [32,33] can also be modelled as unfolding spatially along chromosomes [34]. Starting from a genealogy at the first position of the alignment, the process moves along the chromosome, adding recombination and coalescence events to the ensuing ancestral recombination graph (ARG) [35,36] (). Due to long-range correlations imposed by rare recombination events that happen outside the ancestry of the sample (in so-called trapped non-ancestral material [37]), the genealogy after a recombination event cannot be entirely deduced from the genealogy before, rendering the process non-Markovian. The sequentially Markovian coalescent process (SMC) [38,39] ignores such recombination events but captures most of the properties of the original coalescent [40] while being computationally tractable. This model is the foundation of recent tools for demographic inference [41-43] and has been used to infer the broad-scale recombination map of the human-chimpanzee ancestor based on patterns of incomplete lineage sorting [44,45].

Schematic representation of iSMC for one pair of genomes, with five time intervals and three recombination rate categories.

A, In the SMC process, the spatial distribution of TMRCAs can be described by a matrix of transition probabilities that depend on the population recombination rate ρ and the ancestral coalescence rates. B, variation in ρ along the genome, modelled as a Markovian process and described by a matrix of transition probabilities. C, the combination of both Markovian processes leads to a Markov-modulated Markovian process. The hidden states of the resulting hidden Markov model are all pairwise combinations of discretized classes in A and B. In the SMC, transition probabilities between genealogies are functions of ancestral coalescence rates and–of key relevance to this study–the population recombination rate (ρ) [42,43]. An important limitation of current implementations, however, is that they either neglect variation in ρ along the genome and model its genome-wide average or they require the recombination landscape to be input in order to aid inference of other population genetic parameters (e.g., [46]). Heterogeneous recombination landscapes affect the SMC process by modulating the frequency of genealogy transitions: genomic regions with higher recombination rate are expected to harbour relatively more genealogies than regions with lower recombination rate (). We leverage this information by extending the SMC to accommodate spatial heterogeneity in ρ (see Methods). In brief, our new model combines the discretised distribution of times to the most recent common ancestor (TMRCAs) of the pairwise SMC [42] () with a discretised distribution of ρ () to jointly model their variation along the genome. We model the transition between discretised ρ categories as another spatially Markovian process along the genome. Therefore, combining the SMC with the Markov model of recombination variation leads to a Markov-modulated Markov model. We cast it as a hidden Markov model [47,48] (HMM) to generate a likelihood function, where the observed states are orthologous nucleotides and the hidden states are {TMRCA, ρ-category} pairs (). We name our new approach “integrative sequentially Markov coalescent (iSMC)”, as it enables jointly capturing the effect of time and space in the coalescent. Importantly, in contrast with methods that can take a pre-specified demographic history into account when reconstructing the recombination map [49,50], iSMC simultaneously infers coalescence rates and spatial variation in the recombination rate, potentially leading to more accurate estimates. This framework explicitly connects the genealogical process with the classical definition of LD as the non-random association of alleles at different loci [51], which has been formulated in terms of covariances in coalescence times [52]. Henceforth, we restrict the use of the term LD to its “topological” interpretation [53]. iSMC is based on a neutral model where time is re-scaled and measured in units of the effective population size (Ne). Thus, information about the recombination rate is obtained in the form of the compound parameter ρ = 4×Ne×r. Since under neutrality and panmixia, Ne is constant along autosomes, we use the inferred ρ landscape as a proxy for the spatial variation in the molecular rate r that measures the historical rate of cross-over events per-nucleotide per generation. (Note that local variation in TMRCA primarily reflects genealogical and sampling variance and cannot, on its own, be used to tease apart Ne and r.) Our approach is to model spatial variation in ρ using a Gamma distribution with k discretised categories (see Methods). Setting k equal to 1 is equivalent to the “standard” PSMC’ where ρ is homogeneous along the genome [43], and can be used as a null model to formally test the existence of spatial variation in the recombination rate. To this end, after fitting both homogeneous and heterogeneous models to the data, Akaike's Information Criterium (AIC) [54] is employed as a means of model selection. If AIC favours a spatially heterogeneous model over the null model where ρ is constant along the genome, iSMC can then be used to estimate a recombination landscape of single-nucleotide resolution by weighting the discretised values of the Gamma distribution of ρ with their local posterior probabilities. In the following section, we benchmark our model on different simulated scenarios. Therein, we computed the proportion of variance in simulated maps that is explained by inferred maps (R2) after binning the landscapes into windows of 50 kb, 200 kb, 500 kb and 1 Mb by averaging single-nucleotide estimates of ρ.

Simulation study

To assess iSMC’s overall performance, we simulated five recombination landscapes corresponding to different patterns of magnitude and frequency of change in ρ and a “null” scenario with constant recombination rate along the genome (see Methods). For each of the five scenarios, we simulated 10 replicate ARGs using SCRM [55], each describing the ancestry of 2 haploid chromosomes. When fitting a model to these simulated data, we tested two discretisation schemes for the joint distribution of TMRCAs and Gamma-distributed recombination rates (see Methods): the first with 40 time intervals, five ρ categories; the second with 20 time intervals, 10 ρ categories, leading to a total of 200 hidden states in both configurations. Model selection based on AIC favours the correct model in 45 of the 50 datasets (), with the five exceptions belonging to the scenario where changes are frequent and of small magnitude. In this regime, transitions to regions of slightly different recombination rates do not significantly skew the distribution of genealogies, and the short length of blocks with constant ρ leaves little signal in the data. Accordingly, among the four heterogeneous landscapes that we considered, median R2 is 38%, 44%, 49% and 51% for increasing window sizes in the five identifiable replicates of such challenging scenario, and 66%, 77%, 77% and 83% for increasing window sizes in the other three (, ). Overall, the results are consistent between replicates and robust to the choice of discretisation, although the 40x5 configuration performs better in the scenario with the challenging parameter combination (). Therefore, in the following we focus on the 40x5 configuration, noting that it implements a finer discretisation of time that is more adequate to capture the effect of ancestral demography. As we introduce new simulated scenarios, we focus on the recombination landscape with frequent changes of large magnitude.

Recombination map recovery under various simulated scenarios according to bin size.

Dot plots show the distribution of squared Pearson correlation coefficients (R2) between the simulated and inferred recombination maps. See Methods section for details about the simulations. A, four scenarios of spatial variation in the recombination rate, corresponding to different combinations of parameters (colour, see text), and comparison between two discretisation schemes (panels). B-C, comparison between a model where demography is mis-specified and another where it is jointly inferred (colour), in scenarios of recent growth (B) or ancient bottleneck (C). D, two scenarios of spatial variation in the mutation rate, varying its frequency of change (colour). E, two scenarios of introgression, varying the time of gene-flow (colour, CTA stands for Coalescent Time units Ago).

Demographic history

The random sampling of haplotypes during population bottlenecks and expansions affects LD between SNPs, thus creating spurious signals of variation in ρ [49,56,57]. To test whether iSMC could capture the effect of demography on the inference of recombination maps, we simulated a heterogeneous recombination landscape coupled with either a recent 20-fold increase or ancient 20-fold decrease in population size (see Methods). We then fitted our model twice for each scenario: first, erroneously assuming a flat demographic history; second, allowing iSMC to infer piece-wise constant coalescence rates in order to accommodate population size changes. Overall, R2 is high (median 67%, 78%, 79% and 83% for increasing window sizes in all scenarios, ), showing that the inferred recombination landscape is relatively robust to misspecification of the demographic scenario, but is systematically higher when demography is jointly inferred (55%, 69%, 75% and 79% when demography is assumed constant versus 71%, 79%, 80% and 85% when demography is jointly inferred, ). The difference is stronger at the fine scale, where, in the presence of complex demography the distribution of genealogies can get locally confined to a time period, and ignorance about differential coalescence rates reflects poorly on local ρ estimates. We conclude that the joint-inference approach of iSMC can disentangle the signal that variable recombination and fluctuating population sizes leave on the distribution of SNPs.

Inference of recombination maps in the presence of recent population growth.

Each column represents a different bin size (increasing from left to right, 50 kb, 200 kb, 500 kb and 1 Mb). A-D, scatter-plots of inferred versus simulated maps, coloured according to the θ / ρ ratio (green > = 1, blue < 1). The dashed magenta line represents ordinary least squares regression. E-H, corresponding simulated (black) and inferred (orange) maps.

Introgression events

Recent studies suggested that introgression is a frequent phenomenon in nature [58,59]. The influx of a subset of chromosomes from a “source” into a “target” population (in a process analogous to a genetic bottleneck) introduces long stretches of SNPs in strong LD. Past introgression events will thus affect runs of homozygosity, biasing the distribution of genealogies. To test the robustness of iSMC to the confounding effect of introgression, we simulated two scenarios of admixture which differ in their time of secondary contact between populations (see Methods). The proportion of variance explained remains high (median 57%, 64%, 69% and 75% for increasing window sizes, , ) and depends on the time when introgression occurred. Recombination maps are less accurately recovered in case of recent introgression (52%, 62%, 68% and 75% versus 58%, 65%, 70% and 75% for increasing window sizes), because in such case there has been less time for recombination to break SNP associations that do not reflect local ρ in the target, sampled population.

Variation in mutation rate

The rate of de novo mutations varies along the genome of many species. For example, CpG di-nucleotides experience an increase in mutation rate (μ) as a result of methylation followed by deamination into thymine, whereas the efficiency of the molecular repair machinery is negatively correlated with the distance from the DNA replication origin, causing μ to vary accordingly [60]. Such heterogeneity could bias iSMC's estimates because the transition into a region of higher μ mimics the transition to a genealogy with a more ancient common ancestor since in both cases the outcome is locally increased genetic diversity. To assess the impact of variation of mutation rate on the estimation of recombination rate, we simulated two scenarios of variation of θ = 4×Ne×μ along the genome, corresponding to low and high frequencies of change, relative to the frequency of change in the recombination rate. We report that transitions to different mutation rates along the genome globally do not introduce substantial biases in our estimates (median R2 76%, 83%. 88% and 88% for increasing window sizes when the mutation landscape changes often, and 80%, 83%, 88% and 83% for increasing window sizes when the mutation landscape changes rarely, , ). We also investigated the impact of the genome-wide average mutation rate μ on iSMC’s accuracy. Since SNPs are informative about the distribution of genealogies along the genome, the ability of population genetic methods to identify ancestral events depends on their abundance. In this case, the actual parameter of interest is the θ / ρ ratio, where values smaller than 1 imply less mutation than recombination events in the history of the sample (meaning some cross-overs will be effectively “invisible” in the sequence data) and values higher than 1 imply more mutations per recombination, thereby increasing resolution in our inference. As expected, we find that accuracy is reduced for a θ / ρ ratio equal to 0.5 (median R2 40%, 47%, 50% and 60% for increasing window sizes, ). Reassuringly, however, iSMC retains high accuracy for ratios > = 1.0, (median R2 51%, 56%, 61% and 63%; 60%, 74%, 75% and 75%; 75%, 79%, 82% and 83% for increasing ratios and window sizes, ), especially starting from 1.5 –a value commonly assumed for human data [61].

Comparison with state-of-the-art methods

Simulation studies are useful to understand the theoretical limitations of a method under idealised conditions. On the other hand, inference in real datasets provides the most realistic assessment of performance. For this reason, we benchmarked iSMC in two organisms with contrasting genomic architectures and evolutionary histories, and for which experimental genetic maps are available [62,63]. Using a 40x5 configuration of hidden states, we fitted iSMC independently to each of three pairs of genomes from each species (see Methods). In all six samples, AIC favoured a heterogeneous recombination landscape. We then computed R2 between the iSMC-inferred maps and the experimental genetic maps and used these as a proxy for iSMC's accuracy. In the Z. tritici samples R2 estimated with iSMC maps are 24% (+/- 13%), 30% (+/- 12%) and 38% (+/- 16%) respectively at the 20 kb scale for chromosome 1. Using the same chromosome and window size, the recombination map obtained with LDhat by Stukenbrock and Dutheil [64] with 13 haploid sequences (of which our samples are a subset) explains 25.35% of the variance in the same genetic map. In the D. melanogaster samples, R2 estimated with iSMC maps are 71% (+/- 27%), 56% (+/- 19%) and 73% (+/- 22%) respectively at the 1 Mb scale for chromosome 2L. Using the same chromosome and window size, the recombination maps obtained by Chan et al. [30] with LDhelmet and by Adrion et al. [50] with ReLERN with 10 haploid sequences explain 55% and 46% of the variance in the same genetic map, respectively. We thus conclude that remarkably, iSMC can outperform methods that use larger samples sizes in both Z. tritici and D. melanogaster, which are species with a large density of SNPs.

Application to modern and ancient human samples

The positioning of double-strand breaks leading to cross-over events is directed by several proteins that exert their influence at different scales. Evolution of such proteins results in differentiation of recombination maps with an increasing divergence between populations and species [65]. To gain insight into the evolution of the recombination landscape in hominins, we built a dataset of 22 individuals from 11 modern human populations sampled in the Simons Genome Diversity Panel [66] together with four ancient samples [67]: the Altai Neanderthal [68], the Vindija Neanderthal [69], the Denisovan [70] and the Ust’Ishim individual [61], a 45,000-year-old modern human from Siberia. Due to the overall low polymorphism in hominins, we sought to increase accuracy in our inference by fitting iSMC to whole-genome sequences of each individual using a 40x10 hidden states configuration. Estimation of the parameters of the prior distributions illustrated in (the genome-wide average recombination rate ρ, the shape of the Gamma distribution α and the transition probability between recombination categories δ, see Methods) yielded consistent values across extant samples (). Conversely, jointly inferring such parameters in the ancient samples proved challenging because of widespread missing data. To address this issue, we first inferred the demographic parameters in each of the four ancient samples with a homogeneous recombination model (40x1 hidden states configuration). Then, we obtained the recombination maps of the ancient samples using the resulting sample-specific coalescence rates together with extant-human estimates of ρ, α and δ. Although this procedure implies that extant and ancient samples share their prior distributions of recombination rate variation, computing posterior probabilities with an HMM confronts such distributions with the data such that the resulting ancient maps should reflect their specific characteristics of LD. We note that any biases introduced by this procedure should increase the similarity between modern and ancient recombination maps. The reconstructed phylogenetic distance between maps of modern and ancient hominins outlined below is therefore conservative. To quantify the similarity among these recombination landscapes, we computed Spearman’s rank correlation (rs) between all pairs of individual maps at the 50 kb, 200 kb and 1 Mb scales. We further added the sex-averaged genetic map from DECODE [7] to the dataset in order to assess iSMC’s accuracy (). Notwithstanding the low density of SNPs along single human genomes, correlations between the maps from iSMC and DECODE increase rapidly with window size. We then used 1 – rs as a measure of the pair-wise distance between maps to construct dendrograms () and discovered that the evolution of the recombination landscape globally reflects the currently accepted evolutionary history of hominins. Specifically, at the 50 kb and 200 kb scales, the major geographic groups are recovered with high bootstrap support (with the exception of the Punjabi individuals, see argumentation below): Africans, Asians and Europeans each form a distinct cluster; the 45,000-year-old Ust’Ishim is sister to modern humans, depicting ancestral similarities that have been frozen by his demise soon after the out-of-Africa migration; and all modern humans have diverged from the monophyletic Neanderthal-Denisovan group (). We note that, while Ust’Ishim is branching together with other H. sapiens samples in a well supported group, it is not placed immediately outside the Europe + Asian clade, as expected given that he is part of the out-of-Africa migration. One possibility is that the high level of missing data from this sample increases the noise in the infered recombination map–therefore decreasing the similarity with other H. sapiens maps. Further analyses are necessary to pinpoint the precise reason for this branching pattern. Interestingly, the Punjabi individuals, who speak an Indo-European language, fall within the European clade. The Punjabi population contains a large proportion of Ancestral North Indian ancestry, a ghost population that shows similarity to West-Eurasian populations [66,71]. Moreover, African samples can be further differentiated at the population level with high confidence, likely because population structure (hence recombination landscape evolution and differentiation) extends deeper into the history of this continent compared to the non-African populations. At the 1 Mb scale, the differentiation among clades of modern human populations becomes blurrier (), as expected due to the slower evolution of the recombination landscape at larger scales. These results cannot be accounted for by patterns of LD left by population-specific demographic histories because iSMC infers the recombination landscape while jointly modelling ancestral coalescence rates. They could, however, be driven by ancestral recombination events that are shared between sampled individuals. This is because the number of such events is proportional to the relatedness between individuals and can lead to a signal of similarity in the inferred maps. Hence the recombination maps would be expected to be correlated in a manner that reflects the hominin phylogeny, even in the absence of evolution of the recombination landscape. In order to mitigate this effect, we took advantage of iSMC’s explicit modelling of coalescence times to consider only recent recombination events. Because the discretisation of time is embedded in the hidden states of our HMM, we were able to restrict the posterior decoding to the first 10 time intervals (see Methods), thereby discarding recombination events older than ~0.25 coalescent time units and reducing the relevance of shared recombination events in our analysis. Overall, we were still able to recover the hominin phylogeny using pair-wise correlations between such time-restricted maps (), suggesting that the patterns depicted in reflect the accumulation of differences in population-specific recombination landscapes over time. We further assessed whether the spurious signal of similarity imposed by shared recombination events would be strong enough to cause a reflection of the population phylogeny. To this end, we performed simulations following Fu et al.’s [61] study of a hominin-specific demographic history (see Methods). On top of the parameters chosen by Fu et al., we imposed a common recombination landscape to all simulated haplotypes and fitted iSMC to non-overlapping genomes pairs belonging to the same population. We then used the inferred maps at the 50 kb scale to build a dendrogram using the same procedure outlined above. Since the recombination landscape does not evolve in these simulations, then in the absence of a confounding signal, we would expect the similarity between such maps to be uniform across the populations resulting in a star phylogeny. Instead, we observe that the pair-wise correlations between such simulated maps still reflect the underlying phylogeny with relatively high bootstrap support for the major branches (), indicating the presence of substantial confounding. As a proof-of-principle that iSMC can disentangle these events, we obtained time-restricted recombination maps from the same simulated dataset. The correlation patterns of these maps no longer reflect the underlying history of populations, and the resulting dendrogram displays very low bootstrap support (). In summary, our results provide evidence that the divergence of the recombination landscape in hominins mirrors the divergence of the species.

Evolution of the recombination landscape in hominins.

Left: Correlograms depicting pair-wise Spearman’s rank correlations between individual recombination maps. Right: corresponding dendrograms, highlighting the phylogenetic clustering of different groups (colour legend). The clades with less than 60% bootstrap support (over 1000 replicates) were collapsed. A, B: 50 kb windows. C, D:, 200 kb windows. E, F: 1 Mb windows.

Discussion

Our improved SMC model is able to infer accurate recombination maps from high-quality pairs of genomes. Nevertheless, the agreement between iSMC and experimental maps in all three species studied is lower than that obtained in simulations. There can be several factors driving this difference. First, technical artefacts (e.g., sequencing errors and missing data) will increase the noise in the estimates, in particular at the fine scale. Second, introgression and natural selection [72] influence Ne locally but are unaccounted for by our model, thus introducing bias in the estimation of the compound parameter ρ = 4×Ne×r. In particular, strong positive selection (or adaptive introgression) increases LD between loci, and the breadth of such effect should depend on the local recombination rate per nucleotide per generation, r. Such complex interplay can introduce severe biases in species where selection is widespread along the genome, particularly at narrow window sizes. Third, the distinct data types used by experimental and statistical methods imply that they measure different facets of recombination [13]. While experimental maps are a snapshot of the landscape at present-day generation, the historical map estimated by population genomic methods reflects the time-average cross-over rate at each position of the genome because ancient recombination events also influence the TMRCA distribution, as highlighted by our time-restricted analysis. As a result of this contrast between experimental and statistical approaches, the ensuing maps are not expected to be perfectly correlated, even in the absence of noise and model violations, because of the evolution of the recombination landscape (). A promising perspective of future research using our model lies in the potential to focus the reconstruction of the recombination landscape to particular epochs, allowing more detailed investigation of its evolution. Evolution of the recombination landscape implies that the present-day distribution of cross-over events may carry little information about linkage that influenced long-term processes such as linked selection and introgression. Therefore, historical maps are more meaningful than present-day maps in the context of assessing the evolutionary consequences of recombination rate variation [58]. Due to its power with restricted sample sizes, iSMC is well suited to extract LD information from population genomic datasets with high-quality whole-genome sequences from a small number of individuals [73-76]. We have demonstrated its accuracy in species with contrasting levels of diversity, demographic histories and selective pressures, and posit that it will be useful for investigation in other species. Not only will such maps aid the interpretation of diversity in non-model organisms, but a picture of the recombination landscape in different groups will tell us about the nature of recombination itself. Open questions include whether the recombination landscape is associated with large-scale genome architecture and how variation in the recombination landscape relates to life history and ecological traits [77]. Finally, as ancient DNA samples become more common (including species other than humans [78]), it will be possible to obtain maps from extinct taxa, granting the opportunity to study the evolution of the recombination landscape with unprecedented resolution [44,79].

Methods

The Markov-modulated hidden Markov model framework

We now describe our current implementation of the model that focuses on the case of a single pair of genomes, noting that the Markov-modulated framework is general and can be extended to include multiple pairs. The original pair-wise SMC implementation [42] discretises a distribution of scaled coalescence times into t TMRCA intervals to implement a discrete space Hidden Markov Model (HMM) with t × t transition matrix given by: where Gij (the transition probabilities between genealogies i and j, each being a TMRCA interval) is a function of ancestral coalescence rates and the global parameter ρ, which is assumed to be constant along the genome [39,42]. The key innovation we introduce in iSMC is to relieve this assumption by letting ρ vary along the genome, following its own Markov process, where values drawn from a prior distribution are used to compute the transition probabilities between genealogies. (We use the equations of the SMC’ model [39] as derived in [43].) Let R be a strictly positive Gamma distribution with a single parameter (α = β, which constrains the distribution to have mean equal to 1.0) describing the variation in recombination along the genome. The shape of this prior distribution is flexible since α is estimated by iSMC together with demographic parameters. If R is discretised into k categories of equal density, the possible values that ρ can assume in the Markov-modulated process are all r×ρ0, where rj is the mean value inside the jth R category and ρ0 is the genome-wide average population recombination rate. Our Markov model (inspired by the observation that the distribution of cross-over events is not random, but clustered in regions of similar values) states that the probability distribution of R at position i + 1 only depends on the distribution at position i. We consider the case where the transition probability between any two R categories (Pij) is identical and equivalent to one auto-correlation parameter (δ), which is also estimated by iSMC. The transition matrix of this Markovian process is simply: Because ρ is a parameter of the SMC, spatial variation in the recombination rate affects the transition probabilities between genealogies (Q). Since spatial variation in ρ is itself modelled as a Markovian process, the combined process is said to be Markov-modulated by Q, leading to a Markov-modulated HMM. If t is the number of discretised TMRCA intervals of the SMC, and k is the number of discretised ρ categories, then the Markov-modulated HMM is an HMM with n = t × k hidden states (). The transition matrix of the Markov-modulated process, Q, has dimension n × n and is obtained from Q and Q: In brief, Q is a composition of k2 sub-matrices of dimension t × t, each being a Q assembled using ρ0 scaled the corresponding category of R to compute transition probabilities between genealogies. The main diagonal sub-matrices are further scaled by 1 –δ, and the off-diagonal sub-matrices by δ / (k– 1). The Markov structure allows efficient integration over all recombination rates by means of the forward recursion ().

Model selection and computation of the posterior recombination landscape

iSMC works in two steps: (1) fitting models of recombination rate variation and demography, and (2) inferring recombination maps based on the selected model. During step 1, the model parameters (ρ, δ, α and a set of splines controlling the shape of the demography, see below) are optimised by maximizing the likelihood using the Powell multi-dimensions procedure [80], which is computed for the entire sequence by applying the forward recursion of the HMM [48] as implemented in the zipHMMlib [81] at every position i of the alignment: where we integrate over all k discretised values of ρ and over all t TMRCA intervals. The transition between genealogies (G) is a function of both the focal recombination rate (ρl) and the ancestral coalescence rates, whereas G→S represents the emission probability from G to the observed state at position i–which is greatly simplified in the case of a single pairs of genomes [42]. In case AIC favours the heterogeneous model, in step 2 iSMC uses the estimated parameters to calculate the posterior average ρ for all sites in the genome. To this end, it first uses the so-called posterior decoding method [48] as implemented in zipHMMlib [81] to compute the posterior probability of every hidden state at each position in the sequence. Since in the Markov-modulated HMM the hidden states are pairs of ρ categories and TMRCA intervals (), this results, for all sites i in the genome, in an n-dimensional vector with posterior probabilities for each hidden state. We can interpret these vectors as joint probability distributions P(x,y) where {xl }1 ≤ l ≤ k is the k-dimensional vector of recombination rates () and {yj } 1 ≤ j ≤ t is the t-dimensional vector of TMRCA intervals (). Thus, if rl is the value of R inside discretised category l and ρ0 the genome-wide average recombination rate, the posterior average ρ at position i is given by

Time-restricted posterior decoding

In the computation of ρi described in Eq 5, we obtain the marginal probability distribution of recombination rates by summing the posterior probabilities of each category l over all TMRCA intervals. To focus the inferred maps on the recent past, we further constrain the computation of ρi using a three-step procedure: first, we decode site-specific TMRCA’s by choosing the interval with maximum posterior probability, integrating over all recombination rate classes; second we compute site-specific posterior-average recombination rates by considering the probability of each category of the Gamma prior exclusively inside the most likely TMRCA. The posterior average ρ for site i is then given by where x is the most likely TMRCA at site i. These first two steps lead to a reconstructed landscape consisting of a TMRCA and the average recombination rate within such TMRCA, for each site in the genome. Finally, when binning recombination maps, for each window we only compute the average ρ over sites with a TMRCA within a predefined range, e.g., 0 < τ < 10.

Testing hidden state configurations

The hidden states of iSMC are pairs of genealogies and recombination rates where both elements are drawn from discretised distributions. Since the complexity of the forward algorithm (which computes the likelihood) is quadratic in the number of hidden states, there is a limit to the discretisation scheme that can be adopted, as too fine a discretisation would lead to impractical execution times. We thus fixed the number of hidden states to 200 and used this limit to run iSMC in all simulated datasets. Within this maximum, however, it is possible to devise several combinations of hidden states by changing the way in which we discretise the TMRCA and ρ distributions. The goal of reconstructing the recombination landscape would in principle make natural the choice of investing in a fine-grained discretisation of the distribution of ρ. However, this would mean a coarse-grained discretisation of time and, since the signal for fitting the distribution of ρ comes from the expected number of TMRCA transitions, this strategy could reduce iSMC's power to detect such changes. Therefore, in the simulation study, we tested the performance of two configurations: 20 TMRCA intervals and 10 ρ categories and 40 TMRCA intervals and 5 ρ categories ().

Modelling complex demographic histories

HMM implementations of the SMC [41-43] typically use the expectation-maximisation algorithm to optimise transition probabilities, where the actual targets of inference–the coalescence rates at each time interval–are latent variables of the model. Here we use cubic spline interpolation (similarly to [41]) to map coalescence rates at time boundaries, which are then assumed to be piecewise constant for the duration of each interval. Because we use three internal splines knots (i.e., the demographic history is divided into four epochs wherein a cubic curve is fitted), the number of parameters is substantially reduced in our model–in particular when a fine discretisation of TMRCA is employed. Importantly, in the spline implementation, the number of model parameters is independent of the number of classes in the discretisation scheme.

Four scenarios of spatial variation in ρ

We simulated a piece-wise constant recombination rate along the genome by drawing values from a continuous Gamma distribution with parameters α and β, and segment lengths from a geometric distribution with mean length g. We considered four possible scenarios where α = β = 0.5 or 5.0, and g = 100 kb or 1 Mb. For each of the four combinations, we simulated 10 independent pairs of two 30 Mb haploid chromosomes under a constant population size model, assuming θ = 0.003 and ρ0 = 0.0012. For each of the following simulated scenarios, we focus on the landscape with α = 0.5 and g = 100 kb. All scenarios share the same sequence length, sample size, as well as θ and ρ parameter values.

Demographic history

We simulated two demographic scenarios. First, a 20-fold population expansion 0.01 coalescent time units ago; second, a 20-fold population bottleneck 0.5 coalescent time units ago. Assuming an effective population size of 30,000, these coalescent times correspond to 1,200 and 60,000 generations ago, for the expansion and bottleneck events, respectively.

Introgression events

We simulated two introgression scenarios where a source population introduces a pulse of genetic material into a target population. In both scenarios, the split between source and target populations happened 2.0 coalescent time units ago, and the source replaces 10% of the genetic pool of the target. In the first scenario, secondary contact happened 0.125 coalescent time units ago; in the second, it happened 0.25 coalescent time units ago. Assuming an effective population size of 30,000, the split between population happened 240,000 generations ago, and the introgression events happened 15,000 and 30,000 generations ago, respectively.

Variation in the mutation rate

We simulated a piece-wise constant mutation rate along the genome by drawing rate values from a uniform distribution between 0.1 and 10.0 and segment lengths from a geometric distribution with mean length f, where f is either 20 kb or 500 kb. The uniform distribution generating scaling factors of θ has a mean equal to 5.05 instead of 1.0. In this case, the expected genome-wide average θ = 0.015. The reason for that is our focus on the spatial distribution of θ itself. If the landscape had mean = 0.003, its highly heterogeneous nature would scale θ down to values well below ρ (0.0012) too often along the 30 Mb sequence. The ensuing loss of signal (due to low SNP density) would result in poorly inferred maps that display low correlations with the simulations, not because of spatial heterogeneity in θ (local transitions), but instead because the ratio θ / ρ would be too low in many windows across the chromosome. In all the above scenarios, the proportion of variance (R2) in simulated maps that is explained by inferred maps was computed after binning the landscapes into non-overlapping windows of 50 kb, 200 kb, 500 kb and 1 Mb, that is, the analysis is agnostic to the true breakpoints of the simulated landscapes.

Hominin demography

We simulated 100 Mb-long haploid chromosomes according to sample size and parameters provided by Fu et al. [61]. We chose this sequence length because focusing on a 100 Mb segment of chromosome 1 was sufficient to reconstruct the hominin phylogeny based on the recombination maps. In other words, there is enough signal on a 100 Mb segment of human genomes. We added a simulated recombination landscape with α = β = 0.5 and g = 100 kb to the ms arguments provided by the authors.

Data analyses

Model selection followed by inference of recombination maps in D. melanogaster and Z. tritici was performed using publicly available data (chromosome 2L from haploid pairs ZI161 / ZI170, ZI179 / ZI191 and ZI129 / ZI138 in the Drosophila Population Genomics Project Phase 3 [82]; chromosome 1 from haploid pairs Zt09 / Zt150, Zt154 / Zt155 and Zt05 / Zt07 for Z. tritici [64]. Gaps and unknown nucleotides in these FASTA sequences were assigned as missing data. R2 were computed as the square of the Pearson correlation coefficient between the resulting (binned) recombination landscape and available genetic maps, and their 95% confidence intervals (CI) estimated using 100 bootstrap replicates by re-sampling map windows with replacement. For the benchmarking in Z. tritici, we used the average between the two cross-over maps given in [62]. Individual IDs for the 22 modern human samples from the Simons Genome Diversity Panel [66] are presented in . Data were downloaded from Harvard's public repository (https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/vcfs.variants.public_samples.279samples.tar) in July 2018. We used the available strict mask to set low-quality regions as missing data. The four ancient DNA samples were downloaded from the server at the Max Planck Institute for Evolutionary Anthropology in Leipzig (http://cdna.eva.mpg.de/neandertal/Vindija/VCF) in May 2018. Details of the model fitting procedure in the ancient samples are described in the Results section. Based on estimates obtained from extant humans (), posterior decoding was performed by fixing ρ to 0.00015, α to 0.5 and δ to 0.000025. After binning the maps, all downstream analyses were conducted after discarding windows with more than 50% missing data in any of the hominin sequences and considering only positions present in the DECODE genetic map. The correlograms and dendrograms presented in were obtained by hierarchical clustering (using UPGMA) of pair-wise distances computed from 1 –r, where r is the Spearman correlation of ranks between two individual recombination maps. The clades with less than 60% of bootstrap support (over 1000 replicates) were collapsed in the dendrograms. Recombination maps for all samples are available as a resource on FigShare under the DOI: http://10.0.23.196/m9.figshare.8269703.

Computing performance

The limiting computing resources are different between the optimisation and decoding steps performed by iSMC. During optimisation, execution time is key: the program typically takes about two weeks to fit a 40x10 model on the largest human chromosome, using a standard desktop machine. iSMC treats chromosomes independently, meaning that if a whole-genome sequence is input, it can parallelise computation across the multiple chromosomes. Therefore, fitting the model on the whole-genome sequence of a human individual takes roughly the same time as it takes to fit the model on its largest chromosome, provided that enough CPUs are available for full parallelisation. During computation of the posterior average recombination rate, the limiting resource is memory. On the same desktop machine and under the same 40x10 model, it takes around 20 min and 24 Gb of RAM to decode a 2 Mb fragment of the genome. Since these operations can also be parallelised over distinct chromosomes, decoding the full human genome uses a peak 528 Gb of RAM (24 Gb for each of 22 chromosomes) and takes about 42 hours to complete (20 min for each of the 125 windows of size 2 Mb in its largest chromosome). Crucially, the optimisation and decoding steps can be decoupled, such that the user can estimate parameters on a first run of the program and (arbitrarily) later decode the recombination maps. This separation between optimisation and posterior decoding allows the user to request the appropriate computer resources at each step of the inference, e.g., when using a computing cluster. We note that both the speed and accuracy of the optimisation depend on the overall genetic diversity as well as on the proportion of “missing data” in the sequences. On the one hand, higher density of SNPs should increase iSMC’s ability to identify recombination events, thus traversing the likelihood surface more efficiently and taking less optimisation steps to find a local optimum. On the other hand, large amounts of missing data will “flatten” the likelihood surface, therefore reducing the power to distinguish between likelihood peaks and taking more time to converge. We recommend running iSMC on datasets with less than 15% of missing data.

AIC value for each replicate of each of the simulated landscapes.

(ODS) Click here for additional data file.

R2 for each replicate of each parameter of the simulated landscape, according to discretisation scheme.

(ODS) Click here for additional data file.

R2 for each replicate of each simulated demographic history, according to whether coalescence rates were jointly-inferred or not.

(ODS) Click here for additional data file.

R2 for each replicate of each scenario of introgression.

(ODS) Click here for additional data file.

R2 for each replicate of each scenario of mutation rate variation.

(ODS) Click here for additional data file.

Metadata of the extant human samples.

(ODS) Click here for additional data file.

Estimates of recombination-related parameters in modern human samples.

A: the genome-wide recombination rate (ρ); B: the shape of the Gamma distribution (α); C: the average frequency of change in the recombination rate along the genome (δ). (PDF) Click here for additional data file.

R2 for each replicate as a function of the θ / ρ ratio.

(PDF) Click here for additional data file.

Dendrograms computed from pair-wise similarities in recombination maps at the 50 kb scale.

A: time-restricted maps based on human whole-genome sequences; B: regular (non time-restricted) maps based on a simulation of hominin specific demography; C: time-restricted maps based on the same data as B. (PDF) Click here for additional data file.

R scripts required to reproduce our analyses.

(GZ) Click here for additional data file. 19 Jul 2019 * Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. * Dear Dr Valadares Barroso, Thank you very much for submitting your Research Article entitled 'Inference of recombination maps from a single pair of genomes and its application to ancient samples' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. As you will see, the reviewers were all quite positive in their assessments, but there are several issues that should be addressed. We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should respond to the specific points made by each reviewer. A carefully revised manuscript might not need to be seen again by the reviewers. I will make that judgment based on the quality of your revisions. In addition we ask that you: 1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. 2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images. We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] Please let us know if you have any questions while making these revisions. Yours sincerely, Bret Payseur Section Editor: Evolution PLOS Genetics Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this paper Barroso and colleagues present a novel method, based on the sequentially Markov coalescent, that can jointly infer spatial variation in recombination rate and past population sizes. The paper represents an advance in both the area of coalescent hidden Markov models--it is the first model to attempt to account for spatial heterogeneity in this framework--and recombination rate inference--making inference of recombination rates possible from even single diploid samples. Both of these contributions are important and will be useful to a broader audience of population geneticists, especially empiricists working in non-model organisms and the ancient DNA community. The paper is well-written, enjoyable to read, and represents and important advance but I do have a small number of comments that I list below. Major Comments: -On p. 13 the authors claim "these results cannot be accounted for by patterns of LD left by population-specific demographic histories because iSMC infers the recombination landscape while jointly modeling ancestral coalescence rates". I think that this statement conflates two issues, and is misleading. On the one hand, bias introduced by assuming a constant demography (see [1,2,3]; references listed at the end of this review) could cause unrelated populations with similar past population size histories to have similar inferred recombination rates--by accounting for past population sizes iSMC certainly deals with this potential issue. On the other hand, this issue is distinct from the more subtle genealogical LD hinted at in the manuscript. There is some "true" ARG relating all organisms (see [4]), and recombinations that happen in ancient parts of this ARG are likely shared by many extant individuals. The more closely related individuals are, the more likely they are to share these ancient recombination events. This would cause the inferred recombination rates to be correlated, even if the true underlying recombination rates are not (see the discussion in [5]). No method that infers historical recombination rates, including iSMC, can avoid this confounding. And particularly, because iSMC is using the signal from only a pair of haplotypes and so most events are expected to occur on the order of 1 coalescent unit ago (i.e. fairly anciently) many of these events will be shared across different samples, meaning that this confounding should be particularly prevalent in iSMC. -In a previous version of this manuscript (https://www.biorxiv.org/content/10.1101/452268v2) the authors included a discussion of the runtime, which is absent in the present manuscript. Any paper presenting a computational method should provide an empirical analysis of the runtime of the method (e.g., how long did it take to perform the whole genome analyses for the humans in the present manuscript?). In general, a runtime of days or weeks is not prohibitive, but users should certainly be aware of what to expect from reading the manuscript. There are also a number of potential speedups to the HMM that while certainly not necessary for publication, could dramatically improve the runtime of iSMC, which would make the method more broadly useful. Specifically, the transition matrix Q_rho has a special structure where conditioned on changing recombination rates, it is equally likely to change to any other recombination rate. In a number of previous methods [6, 7, 8], the exact same structure (although in a different context) has been exploited by computing the transition as a combination of the transition conditioned on changing state (whereby it does not matter which state the HMM was in previously) and conditioned on not changing state (whereby the state can only remain in the same state). Such considerations should reduce the runtime from quadratic to linear in the number of recombination rate bins. Similarly, the genealogical part of the HMM has sufficient structure to reduce the runtime from quadratic to linear (e.g., [9, 10]). An alternative approach is to use the spectral decomposition method of [11], which allows for the "skipping'" of homozygous sites. -On pp. 13-14 there is a discussion of why the correlation is higher in simulations than in the real data. While natural selection, introgression, and evolution of the recombination rate certainly play a role, there are simpler reasons for the improved performance on simulated data (discussed below in the minor comments). In particular, the evidence in favor of evolving recombination rates is presented in terms of the patterns of correlation at different scales in the real data and in simulated data. This is not a fair comparison because the r^2 on the simulated data is likely inflated for reasons discussed below and since r^2 is bounded by 1, the slope must necessarily be less steep. Minor Comments: -p. 8 and pp. 10-11, results should be reported for each window size (50kb, 200kb, 500kb, 1Mb), instead of just the median across window sizes on p. 8, and for a single window size for pp. 10-11. Furthermore, since hotspots are on the order of kb in length it would be good to compute these metrics at even finer scales such as 1kb. -p. 10 Saying that the "variance explained is significant" should be removed. Proportion of variance explained is non-negative, and so it only takes on the value 0 if the observed correlation (r) is exactly 0. For continuous observations, this should happen on a set of measure 0, and so no bootstrap-based test would fail to reject this null. A better test would be to see if the bootstrap confidence interval for the correlation itself (i.e., r instead of r^2) contains 0. Furthermore, by providing the observed proportions of variance explained (with confidence intervals) the authors already convey this information. -The presented simulation results are biologically unrealistic in such a way that would optimize the performance of iSMC. In particular, in humans and many other species the ratio of the mutation rate to the recombination rate is approximately 1, whereas in the simulations this ratio is >2, which will facilitate inference. This is further exacerbated in the variable mutation rate case where the average mutation rate over the recombination rate is further increased to >10. The length scale of changing recombination rates is also rather unrealistic, with changes occurring either every 100kb or 1Mb on average. Because hotspots tend to be on the order of a few kb in species with PRDM9, the length scales used in simulation are 2-3 orders of magnitude too large to accurately reflect realistic recombination maps. These smoother true recombination maps would make inference substantially easier. Additionally, the simulations involving introgression only test fairly ancient introgression -- for human-like values of Ne and generation time the more recent introgression simulation would correspond to an introgression time of about 70kya. A far more concerning issue in practice would be recent gene flow which would affect longer chunks of the genome, resulting in the sort of long-range LD patterns discussed in the manuscript. It is not absolutely necessary to perform new simulations to account for any of these issues (elevated mutation rate to recombination rate ratio, overly smooth true recombination maps, and ancient introgression events) but it would be good to explicitly point out that these simulation scenarios suggest a much better performance of iSMC than is likely to be seen in practice (see the aforementioned major comment). -The performance of the method on data from regions under selection is never benchmarked. While weak background selection can be approximated by a local reduction in Ne in terms of the amount of polymorphism, it is unclear that this should have a monotonic, linear effect on the inferred rho as implied on p. 13. The effect is likely to be even more complex in the presence of recent selective sweeps, stronger background selection, etc... I do not think that this is a major issue or needs to be directly addressed (e.g. by a simulation study), but perhaps a note of caution should be included, especially for species where selection is likely to be pervasive. Typos and very minor comments: -p. 3 "where a survey of the recombination landscape in closely related species are lacking" --> "where a survey of the recombination landscape in closely related species is lacking" -p. 8 "and the short length of blocks with constant [rho] leaves few signal in the data" --> "and the short length of blocks with constant [rho] leaves little signal in the data" -p. 12, the authors discuss dealing with the ancient samples in a different way by first inferring a demography and mean recombination rate, and then using parameters inferred from modern humans as a prior for the posterior decoding. This approach seems reasonable, but it should bias the inferred maps to be closer to modern humans. Since the ancient individuals still form a clade to the exclusion of modern humans, such bias would only imply that Neanderthals and Denisovans have even more diverged recombination maps. It would be good to discuss that this bias is in the "conservative" direction and so it does not affect the major conclusions of this section. References: [1] H. R. Johnston and D. J. Culter, "Population demographic history can cause the appearance of recombination hotspots". The American Journal of Human Genetics, 2012. [2] J. A. Kamm, J. P. Spence, J. Chan, Y. S. Song, "Two-locus likelihoods under variable population size and fine-scale recombination rate estimation". Genetics, 2016. [3] A. L. Dapper, B. A. Payseur, "Effects of demographic history on the detection of recombination hotspots from linkage disequilibrium". Molecular Biology and Evolution, 2017. [4] P. L. Ralph, "An empirical approach to demographic inference with genomic data". Theoretical Population Biology, 2019. [5] J. P. Spence and Y. S. Song, "Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations". BioRxiv, 2019. [6] P. Fearnhead and P. Donnelly, "Estimating recombination rates from population genetic data". Genetics, 2001. [7] N. Li and M. Stephens, "Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data". Genetics, 2003. [8] S. Sheehan, K. Harris, and Y. S. Song, "Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach". Genetics, 2013. [9] K. Harris, S. Sheehan, J. A. Kamm, and Y. S. Song, "Decoding coalescent hidden Markov models in linear time". RECOMB, 2014. [10] P. F. Palamara, J. Terhorst, Y. S. Song, and A. L. Price, "High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability". Nature Genetics, 2018. [11] J. Terhorst, J. A. Kamm, and Y. S. Song, "Robust and scalable inference of population history from hundreds of unphased whole genomes". Nature Genetics, 2017. Reviewer #2: This paper presents a useful new method for jointly inferring population size changes with the recombination rate landscape of a single genome. The method is very well described, and based on benchmarking studies summarized by the authors, it infers maps that are at least as well correlated with pedigree-based maps as those produced by state of the art methods that require input data from many genomes. I do think that although the results presented on pages 10 and 11 are impressive, they don’t seem strong enough to justify the section heading “iSMC outpeforms the state of the art…” Although the point estimate of variance explained by the iSMC is higher than the point estimate of variance explained by LD helmet, the confidence interval of the iSMC estimate includes the LD helmet estimate. Is there a stronger case to be made that the iSMC is doing better than LD helmet overall? Or do certain genomes yield better estimates because those genomes are closely related to the genomes that were used to make the family-based estimates? One parameter of the method that is not very well described is the gamma prior probability distribution that the recombination rate will change from one locus to the next. There’s nothing in the coalescent model that dictates how fast the recombination rate is likely to change as a function of sequence, so this choice seems somewhat arbitrary and might not be well optimized as is. Unless I missed it, the manuscript doesn’t seem to discuss how the shape and scale of the gamma distribution have been chosen, and it seems worth exploring whether varying them could improve iSMC’s performance. Figure 3E-H suggests that the inferred recombination landscape is systematically too spikey and biased toward high recombination rates, and perhaps this problem could be fixed or reduced by lowering the prior probability of recombination rate change between loci. It is interesting to see that recombination landscapes cluster within human populations, but I’m concerned that this result could be misleading because of misidentification of genetic drift or demographic change as recombination landscape change. This result would be more convincing if it were accompanied by a proof of concept simulation study. If you were to simulate data from human populations that diverged under a realistic demographic scenario but share the same recombination map, would iSMC infer similar recombination maps from all individuals or instead infer some variation that comes from misidentified demographic variation? In other words, would inferred recombination maps from these populations also appear to cluster by population even though this structure is not built into the simulation? Reviewer #3: This paper presents a method, iSMC, that is adapted from the PSMC framework and allows for the joint inference of demography and spatial variation in recombination rate along the genome. The method applies to a single unphased diploid sample, and aims to uncover a recombination map in order to aid studies of recombination in non-model systems. The paper considers an interesting problem, and while there are important caveats, the results are generally promising, particularly it seems for species with high diversity. Beginning with the positive results, the authors demonstrate that iSMC has fairly good to very good squared correlations between their inferred maps and the true map in simulated data. The real data results from Z. tritici are not great, but those from D. melanogaster are reasonably good and the correlations of both are better than those from LDhat given data from more than one sample. The results from humans are less reliable, with R^2 to the deCODE map of 0.166 at the 50 kb scale and 0.550 at the 1 Mb scale. While the text says that, “This suggests that the differences in R^2 between simulation and case studies are not only driven by noise, but also by biological factors,” this is hard to be certain of. For example, the simulations use higher diversity rates with a genome-wide average theta of 0.015 (pg. 21), or >10-fold higher than human diversity. Since SNPs are the fundamental signal for the analysis, the simulation results need not be informative of the results in humans. Moreover, the deCODE, pedigree-based map is strongly correlated with the LD-based HapMap map (Kong et al. 2010) -- with a correlation (seemingly just R) of 0.729 at the 10 kb scale and 0.920 at 3 Mb. Given the above, the focus of the manuscript on ancient hominin samples is hard to justify. Instead, the authors could focus on species and potentially ancient samples that have a much higher diversity rate. A related concern is the inferred phylogenies, particularly the locations of the Ust’Ishim ancient sample, which seems more reasonable to place as an out-of-Africa sample, but instead is a sister group to other modern humans. To help understand the contexts in which iSMC can function, providing more detailed simulation results including a wider range of theta (diversity) values, and potentially expanding on the demographic scenarios tested would be informative. Lastly, please quote runtimes of iSMC. Minor: Pg. 2: “sequences from the ancestral population” - here, “ancestral population” is not well defined and could be clarified. Pg. 5: “extending the SMC to accommodate spatial heterogeneity in rho ...” - this is about PSMC, not the SMC, right? Pg. 6: “potentially leading to more accurate estimates” - it seems like the text here could be a bit more specific and conclusive about the estimation accuracy. More specifically, the word “potentially” could be omitted and the context of greater accuracy provided. Pgs. 8-10: while this is in the Results section, a few more details about the simulation scenarios, such as the meaning of the A (alpha) and g parameters (Fig 2) and what the simulation parameters are would be good to include here. (Also see pg. 12, which mentions rho, alpha, and delta without definition.) Pg. 12 “with 100% bootstrap support (with one exception, see below)” - please give the true % including the exception. Eq. (3): the rho_i notation doesn’t seem to be introduced; only rho_0 and r_j. Eq. (4): the probabilities here, e.g., Pr(rho_l -> rho_m), use a different notation than earlier in the text, where P_lm represents this probability. Would be helpful to be more consistent. Fig. 3: While it seems likely that A-D and E-H represent increasing bin sizes, it would be good to make this more explicit. Pg. 11 / citation 67: This is not the high coverage Vindija Neanderthal paper. See https://science.sciencemag.org/content/358/6363/655. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jeffrey P. Spence Reviewer #2: No Reviewer #3: No 11 Sep 2019 Submitted filename: ResponseToReviewers.odt Click here for additional data file. 30 Sep 2019 Dear Dr Valadares Barroso, We are pleased to inform you that your manuscript entitled "Inference of recombination maps from a single pair of genomes and its application to ancient samples" has been editorially accepted for publication in PLOS Genetics. Congratulations! Please note the comment from Reviewer 3. Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made. Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org. In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date. Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics! Yours sincerely, Bret Payseur Section Editor: Evolution PLOS Genetics www.plosgenetics.org Twitter: @PLOSGenetics ---------------------------------------------------- Comments from the reviewers (if applicable): Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed all of my concerns. Reviewer #2: The authors have revised the article to my satisfaction. I have no further comments. Reviewer #3: The authors have addressed my prior concerns, including more thorough evaluations of iSMC’s performance for smaller values of theta / rho. The model and paper are both very interesting. I have only one comment (and two other textual comments). On pg. 116-9: “In the SMC, transition probabilities between genealogies are functions of ancestral coalescence rates and – of key relevance to this study – the population recombination rate (ρ) [42,43]. An important limitation of current implementations, however, is that they neglect variation in ρ along the genome and model its genome-wide average instead.” The above isn’t true of ARGweaver, which uses an input genetic map to aid inference. This is worth noting. Minor: Line 133: “Coalescent” need not be capitalized. Line 348-9: extra “of” in “to non-overlapping of genomes pairs belonging to the same population” ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jeffrey P. Spence Reviewer #2: No Reviewer #3: No ---------------------------------------------------- Data Deposition If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website. The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-00987R1 More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support. Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present. ---------------------------------------------------- Press Queries If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org. 22 Oct 2019 PGENETICS-D-19-00987R1 Inference of recombination maps from a single pair of genomes and its application to ancient samples Dear Dr Valadares Barroso, We are pleased to inform you that your manuscript entitled "Inference of recombination maps from a single pair of genomes and its application to ancient samples" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Genetics On behalf of: The PLOS Genetics Team Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom plosgenetics@plos.org | +44 (0) 1223-442823 plosgenetics.org | Twitter: @PLOSGenetics
  8 in total

1.  Population-Specific Recombination Maps from Segments of Identity by Descent.

Authors:  Ying Zhou; Brian L Browning; Sharon R Browning
Journal:  Am J Hum Genet       Date:  2020-06-12       Impact factor: 11.025

2.  Accurate recombination estimation from pooled genotyping and sequencing: a case study on barley.

Authors:  Michael Schneider; Federico Casale; Benjamin Stich
Journal:  BMC Genomics       Date:  2022-06-25       Impact factor: 4.547

Review 3.  Inferring recombination patterns in African populations.

Authors:  Gerald van Eeden; Caitlin Uren; Marlo Möller; Brenna M Henn
Journal:  Hum Mol Genet       Date:  2021-04-26       Impact factor: 6.150

4.  Population Genomics of the Maize Pathogen Ustilago maydis: Demographic History and Role of Virulence Clusters in Adaptation.

Authors:  Gabriel Schweizer; Muhammad Bilal Haider; Gustavo V Barroso; Nicole Rössel; Karin Münch; Regine Kahmann; Julien Y Dutheil
Journal:  Genome Biol Evol       Date:  2021-05-07       Impact factor: 3.416

5.  The distribution of waiting distances in ancestral recombination graphs.

Authors:  Yun Deng; Yun S Song; Rasmus Nielsen
Journal:  Theor Popul Biol       Date:  2021-06-26       Impact factor: 1.514

6.  A community-maintained standard library of population genetic models.

Authors:  Jeffrey R Adrion; Christopher B Cole; Noah Dukler; Jared G Galloway; Ariella L Gladstein; Graham Gower; Christopher C Kyriazis; Aaron P Ragsdale; Georgia Tsambos; Simon Gravel; Ryan N Gutenkunst; Kirk E Lohmueller; Peter L Ralph; Daniel R Schrider; Adam Siepel; Jerome Kelleher; Andrew D Kern; Franz Baumdicker; Jedidiah Carlson; Reed A Cartwright; Arun Durvasula; Ilan Gronau; Bernard Y Kim; Patrick McKenzie; Philipp W Messer; Ekaterina Noskova; Diego Ortega-Del Vecchyo; Fernando Racimo; Travis J Struck
Journal:  Elife       Date:  2020-06-23       Impact factor: 8.140

7.  Estimating the rates of crossover and gene conversion from individual genomes.

Authors:  Derek Setter; Sam Ebdon; Ben Jackson; Konrad Lohse
Journal:  Genetics       Date:  2022-08-30       Impact factor: 4.402

8.  The Theory and Applications of Measuring Broad-Range and Chromosome-Wide Recombination Rate from Allele Frequency Decay around a Selected Locus.

Authors:  Kevin H-C Wei; Aditya Mantha; Doris Bachtrog
Journal:  Mol Biol Evol       Date:  2020-12-16       Impact factor: 16.240

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.