Milan Malinsky1,2, Emiliano Trucchi3, Daniel John Lawson4, Daniel Falush5. 1. Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland. 2. Wellcome Trust Sanger Institute, Cambridge, United Kingdom. 3. Department of Life Sciences and Biotechnology, University of Ferrara, Ferrara, Italy. 4. School of Social and Community Medicine, University of Bristol, Bristol, United Kingdom. 5. Milner Centre for Evolution, University of Bath, Bath, United Kingdom.
Abstract
Powerful approaches to inferring recent or current population structure based on nearest neighbor haplotype "coancestry" have so far been inaccessible to users without high quality genome-wide haplotype data. With a boom in nonmodel organism genomics, there is a pressing need to bring these methods to communities without access to such data. Here, we present RADpainter, a new program designed to infer the coancestry matrix from restriction-site-associated DNA sequencing (RADseq) data. We combine this program together with a previously published MCMC clustering algorithm into fineRADstructure-a complete, easy to use, and fast population inference package for RADseq data (https://github.com/millanek/fineRADstructure; last accessed February 24, 2018). Finally, with two example data sets, we illustrate its use, benefits, and robustness to missing RAD alleles in double digest RAD sequencing.
Powerful approaches to inferring recent or current population structure based on nearest neighbor haplotype "coancestry" have so far been inaccessible to users without high quality genome-wide haplotype data. With a boom in nonmodel organism genomics, there is a pressing need to bring these methods to communities without access to such data. Here, we present RADpainter, a new program designed to infer the coancestry matrix from restriction-site-associated DNA sequencing (RADseq) data. We combine this program together with a previously published MCMC clustering algorithm into fineRADstructure-a complete, easy to use, and fast population inference package for RADseq data (https://github.com/millanek/fineRADstructure; last accessed February 24, 2018). Finally, with two example data sets, we illustrate its use, benefits, and robustness to missing RAD alleles in double digest RAD sequencing.
Understanding of shared ancestry in genetic data sets is often key to their interpretation. The chromoPainter/fineSTRUCTURE package (Lawson et al. 2012) represents a powerful model-based approach to investigating population structure using genetic data. It offers especially high resolution in inference of recent shared ancestry, as shown for example in the investigations of worldwide human population history (Hellenthal et al. 2014) and of genetic structure of the British population (Leslie et al. 2015). Further advantages, when compared with other model-based methods, such as STRUCTURE (Pritchard et al. 2000) and ADMIXTURE (Alexander et al. 2009), include the ability to deal with a very large number of populations, explore relationships between them, and to quantify ancestry sources in each population.The high resolution of chromoPainter/fineSTRUCTURE and related methods derives from utilizing haplotype linkage information and from focusing on the most recent coalescence (common ancestry) among the sampled individuals. This approach derives a “coancestry matrix,” a summary of nearest neighbor haplotype relationships in the data set; that is, of the cases where pairs of individuals had the most similar haplotypes one to another. However, the existing pipeline for coancestry matrix inference was designed for large scale human genetic single nucleotide polymorphism (SNP) data sets, where chromosomal locations of the markers are known, haplotypes are typically assumed to be correctly phased (although it is possible to perform the analysis without this assumption), and missing data also need to have been imputed. Therefore, these methods have so far been generally inaccessible for investigations beyond model organisms.With no requirements for prior genomic information (e.g., no need for a reference genome) and relatively low cost, RADseq data are fuelling a boom in ecological and evolutionary genomics, especially for nonmodel organisms (Andrews et al. 2016), where questions on population structure and relative ancestry are among the most commonly asked. Therefore, we have developed RADpainter, a simple method designed to infer the coancestry matrix from RADseq data. RADpainter is designed to take full advantage of the sequence of all the SNPs from each RAD locus, finding (one or more) closest relatives for each allele. The information about the nearest neighbors of each individual is then summed up into the coancestry similarity matrix. We package RADpainter together with the fineSTRUCTURE Markov chain Monte Carlo (MCMC) clustering algorithm into an easy to use population inference package for RADseq data called fineRADstructure.
New Approaches
Briefly, the coancestry matrix is calculated as follows: for each RAD locus and each individual (a recipient), we calculate the number of sequence differences (i.e., SNPs) between that individual’s allele(s) and the alleles in all other individuals (potential donors). The closest relative (donor) for each allele is its nearest neighbor allele, that is, the allele with the least number of differences. In the case of multiple equally distant nearest neighbors, an equal proportion of coancestry is assigned to each “donor.” Finally, we sum these local coancestry values across all loci to obtain the coancestry matrix for the full data set. A basic outline of the coancestry estimation procedure for haploid individuals is shown in supplementary Algorithm S1, Supplementary Material online.Differences in ploidy are handled by averaging coancestry across alleles in the same individual. Figure 1 provides a further illustration of the method, showing co-ancestry matrix calculation at a single locus for four diploid individuals. However, we have implemented RADpainter in a way to handle arbitrary number of alleles per individual (i.e., any ploidy levels), depending on the input; ploidy can even vary across individuals in a single analysis. We expect these features to be of particular use to the plant research community.
. 1
Coancestry matrix example with four diploid individuals and no missing data. (A) For each RAD locus (stack), we use all the variable sites in the stack to define the stack haplotypes H1 and H2. The calculation proceeds by finding the closest relatives for each haplotype. (B) With individual 1 as the recipient (filling the first row of the coancestry matrix), we find that the single closest relative of H1 in individual 1 is the haplotype H1 in individual 2. These haplotypes are identical (ACTG; red). Haplotype H2 in individual 1 has two closest relatives, both with an identical sequence (ATTT; green). (C) Focusing on individual 2 as the recipient (filling the second row of the coancestry matrix) we find that haplotype H1 in individual 2 has a single closest relative: H1 in individual 1, in a mirror image of (B). However, H2 in individual 2 does not have an identical closest relative. Its “donors” are the four haplotypes that differ from it by a single SNP. (D) The coancestry matrix corresponding to the nearest-neighbor haplotype relationships at this locus. Note that the matrix is not symmetrical: row totals are identical for each individual and correspond to the number of RAD loci analyzed. In contrast, column totals vary. Individuals with “outlier” sequences are the least likely donors, CCTT in individual 4 being a clear outlier haplotype.
Coancestry matrix example with four diploid individuals and no missing data. (A) For each RAD locus (stack), we use all the variable sites in the stack to define the stack haplotypes H1 and H2. The calculation proceeds by finding the closest relatives for each haplotype. (B) With individual 1 as the recipient (filling the first row of the coancestry matrix), we find that the single closest relative of H1 in individual 1 is the haplotype H1 in individual 2. These haplotypes are identical (ACTG; red). Haplotype H2 in individual 1 has two closest relatives, both with an identical sequence (ATTT; green). (C) Focusing on individual 2 as the recipient (filling the second row of the coancestry matrix) we find that haplotype H1 in individual 2 has a single closest relative: H1 in individual 1, in a mirror image of (B). However, H2 in individual 2 does not have an identical closest relative. Its “donors” are the four haplotypes that differ from it by a single SNP. (D) The coancestry matrix corresponding to the nearest-neighbor haplotype relationships at this locus. Note that the matrix is not symmetrical: row totals are identical for each individual and correspond to the number of RAD loci analyzed. In contrast, column totals vary. Individuals with “outlier” sequences are the least likely donors, CCTT in individual 4 being a clear outlier haplotype.Concatenating all the variable sites (SNPs) at each locus to define alleles increases the resolution of RADpainter when compared with methods that only use a single SNP from each locus. This can be seen clearly by considering the fact that a single biallelic SNP splits the alleles at a locus into only two groups, whereas multiple linked SNPs typically enable more refined assignment of nearest neighbor relationships, as illustrated in figure 1. Given the short length of each RAD locus (typically < 250 bp), we assume that all SNPs within the locus are in almost complete linkage disequilibrium (LD) (i.e., that D′ 1) and, therefore, historical recombination can be disregarded.On the other hand, the summation across loci assumes frequent recombination between different RAD sequences—each RAD locus is counted as if providing independent evidence with regards to coancestry. This assumption is not always realistic, in particular when the data set contains pairs of loci from the two sides of each restriction site. However, we account for any linkage between RAD loci by adjusting the normalizing constant c which is passed on to the fineSTRUCTURE clustering algorithm together with the co-ancestry matrix and determines its sensitivity (Lawson et al. 2012).Briefly, the fineSTRUCTURE clustering algorithm (Lawson et al. 2012) uses a MCMC scheme to explore the space of population configurations (sample “partitions”) by proposing merging or splitting populations, merging then resplitting, or moving individuals. A proposed population configuration is accepted with a probability derived from the ratio of the likelihood with the previous configuration, a likelihood that in turn depends on the terms of the coancestry matrix scaled by the c value. Given the final fineSTRUCTURE output, we can infer the number of clusters, deal with a very large number of potential clusters, quantify ancestry sources in each group, and also explore relationships between groups, including with a simple tree-building algorithm (Lawson et al. 2012). Our new fineRADstructure package includes a set of R scripts that can be used to plot the results, including the clustered coancestry matrix, the tree with posterior population assignment probabilities, and a matrix with coancestry averages per population.As in the original chromoPainter tool, the aim of estimating the c parameter is to correct for the true underlying variance of the entries of the coancestry matrix (), so that the multinomial likelihood used in the clustering step matches the true statistical uncertainty in the matrix. In order to estimate an appropriate c value for each data set, RADpainter calculates empirical variance () for each entry of the coancestry matrix by jackknife, by default in blocks of 100 consecutive RAD loci. The entries are divided by the theoretical variance under the multinomial distribution for the coancestry matrix, that is, assuming that an element follows the binomial distribution: where is the total number of RAD loci in the data set and is the probability of individual j being the closest relative of individual i at any particular locus.If LD between RAD loci is weak or if the data have been mapped and sorted according to genome coordinates (so that loci in LD are grouped together within the jackknife blocks), then RADpainter correctly estimates the effective number of independent loci. However, when loci are not sorted and LD between them is strong then the estimation procedure may underestimate c and so be overconfident in population splits. To combat this we provide a script which reorders the RAD loci according to LD. We recommend users with unmapped data to run this before the RADpainter procedure in order to ensure they obtain a conservative upper bound on the number of statistically identifiable clusters. To test this approach, we constructed a RAD data set with pairs of loci in perfect LD by duplicating all loci in a RAD data set and randomly shuffling the positions of the duplicates. We found that c doubled after LD-based reordering of loci, estimating the same number of effectively independent loci correctly.Unknown nucleotides (Ns) can be present within alleles and their positions are ignored in all pairwise sequence comparisons. Where the entire donor and/or the recipient alleles are missing, their coancestry is assumed to be proportional to the amount of coancestry observed between them in the rest of the data (i.e., “missing data” coancestry is shared proportionally to “observed data” coancestry). However, it is well-known that one of the causes of missing alleles in RAD data is the presence of genetic polymorphisms in the restriction sites, referred to as allele dropout. Because allele dropout can lead to nonrandom missingness and thus influence inferences of population divergence (Arnold et al. 2013; Gautier et al. 2013), we suggest caution. The most problematic loci may be removed by filtering loci with a large excess of null alleles. In addition, we recommend that users should check for occurrence of large systematic differences in missingness between putative populations, and urge caution in interpreting the results if such differences occur. In some cases, it may be appropriate to remove outlier individuals with large amount of missingness prior to a rerun of the analysis pipeline. To assist these steps, each run of RADpainter outputs missingness (the proportion of missing alleles) per individual.To facilitate the use of fineRADstructure with existing RAD-seq processing pipelines, the input file can be generated by the widely used Stacks RAD-seq tool set (Catchen et al. 2013). This can be done directly, via output from the export_sql.pl Stacks program. In addition, for users who do not use the Stacks SQL database, we provide a data conversion and filtering script Stacks2fineRAD.py for processing the output from the core Stacks populations program. A third party utility script (https://github.com/edgardomortiz/fineRADstructure-tools; last accessed February 24, 2018) also enables conversion from the format of the pyRAD and ipyrad toolkits (Eaton 2014). Finally, our input format is a simple flat text file and we provide an example data set to enable the users of other pipelines to prepare their data.
Results
We applied fineRADstructure to a single-digestion RAD data set including 120 individuals from 12 populations of the alpine plant species complex Heliosperma pusillum. The data set comprises 1,097 loci which have been assembled through the Stacks pipeline without a reference genome (Trucchi et al. 2017). The complicated network of relationships among these twelve populations belonging to two phylogenetically intertwined species (H. pusillum and Heliosperma veselskyi), with contrasting ecology and a postglacial history of divergence in some of the six sampled localities, make it an excellent case to study the performance of our approach (fig. 2).
. 2
Method comparison: fineRADstructure, fastSTRUCTURE, and adegenet PCA. The analyses include individuals from six localities (A–F) and two sibling species (Heliosperma pusillum: solid line, Heliosperma veselskyi: dashed line). (A) Clustered fineRADstructure coancestry matrix. Individuals within the twelve populations share more coancestry with each other than between populations. Hierarchical structure among and between localities is clearly inferred—populations at localities B and C cluster by species, whereas populations at localities A, D, E, and F cluster by locality. Missingness is shown below each population label, confirming that it does not vary systematically between inferred populations. (B) fastSTRUCTURE results are shown at the suggested model complexity K = 6; the inflection point where the goodness-of-fit curve—BIC estimated for increasing number of K changes its slope. (C) PCA as implemented by the adegenet package. The main plot uses all samples, while the inset shows PCA performed only on individuals from population pairs A–D.
Method comparison: fineRADstructure, fastSTRUCTURE, and adegenet PCA. The analyses include individuals from six localities (A–F) and two sibling species (Heliosperma pusillum: solid line, Heliosperma veselskyi: dashed line). (A) Clustered fineRADstructure coancestry matrix. Individuals within the twelve populations share more coancestry with each other than between populations. Hierarchical structure among and between localities is clearly inferred—populations at localities B and C cluster by species, whereas populations at localities A, D, E, and F cluster by locality. Missingness is shown below each population label, confirming that it does not vary systematically between inferred populations. (B) fastSTRUCTURE results are shown at the suggested model complexity K = 6; the inflection point where the goodness-of-fit curve—BIC estimated for increasing number of K changes its slope. (C) PCA as implemented by the adegenet package. The main plot uses all samples, while the inset shows PCA performed only on individuals from population pairs A–D.The fineRADstructure results (a clustered coancestry matrix; fig. 2) make the presence of twelve populations immediately clear, with substructure suggested in some of the populations. The relationships between some of the populations at localities A, B, C, and D are clearly not tree-like with strong evidence of heterogeneous gene flow between the species (Trucchi et al. 2017). A variable level of intrapopulation co-ancestry, likely related to different degree of isolation is also visible across the populations (e.g., the two populations sampled at locality F are highly isolated from each other and also the most distinct from populations sampled at all the other localities).A number of benefits of our method can be seen in comparison against other approaches commonly used for population inference. In particular, the analysis using fastSTRUCTURE (Raj et al. 2014) at K = 6 supported a clustering mainly by locality, with the exception of populations in B and C which clustered by ecology (fig. 2). The choice of model complexity K = 6 was based on the rate of decrease in the value of the Bayesian Information Criterion (BIC; fig. 2) as returned by the find.clusters function in the R package adegenet (Jombart and Ahmed 2011) and follows the methodology used in the original manuscript that presented the data (Trucchi et al. 2017). Principal Component Analysis (PCA) implemented by the glPCA function in adegenet (Jombart and Ahmed 2011) was also unable to clearly partition the genetic diversity in twelve populations (fig. 2). The genetic structure was partly resolved by PCA by rerunning the analysis including only samples from populations in localities A, B, C, D (fig. 2 inset), but substantial overlaps remained between the population clusters, thus still providing lower resolution than fineRADstructure.Overall, in terms of biological insight, a major improvement of our approach over previous results in this data set is the clear visibility, at the same time, of both a global structure produced by a postglacial history of recolonization of newly ice-free mountain areas and a local structure related to parallel ecological divergence; previously these two phenomena were inferred only by applying a combination of multiple different analyses (Trucchi et al. 2017).Next we used a test data set produced using the double digest RAD (ddRAD) approach (Peterson et al. 2012) to assess the robustness of fineRADstructure to nonrandom data missingness, specifically to the batch effect commonly present especially in ddRAD data sets, and to allele dropout. The data set includes 76 samples from two different subspecies and three hybrid individuals. The ddRAD protocol is very sensitive to inconsistent size selection across different libraries as shifts in the size range directly influences which loci are included and sequenced in each library. Five different libraries (L1–L5), each of which underwent size selection (on a ThermoFischer E-gel) and sequencing on a PGM Ion torrent machine separately, are included in this data set. We preprocessed the data using the common Stacks RADseq pipeline (Catchen et al. 2013). Briefly, after trimming to 200 bp, raw reads were demultiplexed and quality filtered using the process_radtags.pl program, RAD loci were then assembled without a reference genome and SNPs called using the denovo_map.pl program (parameters -m5,-M6,-n8) and, finally, the populations program was used to filter and export the loci (parameters -r0.8,-p1,--min_maf0.05--vcf_haplotypes).We the used our script Stacks2fineRAD.py (a part of our software package) to remove samples with >20% missing loci (fig. 3) and loci with more than five SNPs. To assess the effect of missing data, our script transforms the resulting genotype matrix into a presence/absence matrix (replacing any genotype with 1 and missing data with 0) and produces a PCA ordination of the samples (sklearn.decomposition package in python).
. 3
Assessing the robustness of fineRADstructure inference to systematic missingness differences. (A) Per sample missingness in the original data set. We removed all individuals with missingness >20% (dashed line). (B) A “missingness PCA,” based on allele presence/absence matrix. Samples clearly cluster by library (L1–L5; identified by color as in (A)) due to batch effect and by subspecies due to allele dropout. (C) Clustered fineRADstructure coancestry matrix. Individuals from the two subspecies cluster together, with further substructure visible within the subspecies. Importantly, the structure within the subspecies does not relate to the missingness PCA clusters (i.e., it does not relate to the library batch). Hybrid individuals are clearly identifiable, sharing relatively more equal coancestry levels with both subspecies.
Assessing the robustness of fineRADstructure inference to systematic missingness differences. (A) Per sample missingness in the original data set. We removed all individuals with missingness >20% (dashed line). (B) A “missingness PCA,” based on allele presence/absence matrix. Samples clearly cluster by library (L1–L5; identified by color as in (A)) due to batch effect and by subspecies due to allele dropout. (C) Clustered fineRADstructure coancestry matrix. Individuals from the two subspecies cluster together, with further substructure visible within the subspecies. Importantly, the structure within the subspecies does not relate to the missingness PCA clusters (i.e., it does not relate to the library batch). Hybrid individuals are clearly identifiable, sharing relatively more equal coancestry levels with both subspecies.To evaluate the effect of systematic missingness biases on fineRADstructure results, we compared the “missingness” PCA with the co-ancestry matrix (fig. 3). This comparison clearly shows that nonrandom missingness due to batch (i.e., library) effect is not at all mirrored in the genetic structure inferred by fineRADstructure; individual samples from the different sequencing libraries are intermixed within the inferred subspecies clusters (fig. 3). Additionally, genetic hybrids between the two subspecies are clearly identified by fineRADstructure (as having more equal levels of coancestry with both of the subspecies; fig. 3), despite their missing data profile being more similar to subspecies 1. Thus, despite considerable nonrandom missingness due to genetic divergence (i.e., allele dropout), the true genetic relationship is reflected in the coancestry matrix and the missingness signal does not dominate.
Discussion
In this manuscript, we have described software that enables fine population structure inference based on nearest neighbor (first coalescence) relationship between haplotypes inferred from RAD-seq data. Thus, our software brings the benefits of this approach especially to genomics of nonmodel organisms, which currently relies heavily on RAD-seq. In this context, it is useful to emphasize two considerations related to the use of fineRADstructure across a range of species.First, because we assume independence between RAD loci, the coancestry matrix entries do not depend on RAD loci being ordered along the genome. Therefore, analyses with and without a reference genome will produce identical coancestry matrices. Second, our method benefits from linkage (LD) between multiple polymorphisms within the same RAD locus, enabling a more specific assignment of nearest neighbor relationships between individuals, as shown in figure 1. Therefore, unlike many methods which assume markers are unlinked and require removal of polymorphisms that are in high LD with each other (e.g., PCA based methods, STRUCTURE), fineRADstructure gains additional power from their presence within a RAD locus, as will be the case especially in species with high nucleotide polymorphism levels.fineRADstructure aims to detect and report all excess sample relatedness that rises above statistical noise in the coancestry matrix, that is, a population is defined as a group of individuals with indistinguishable genetic ancestry in the data set. This approach resulted in biologically meaningful clustering for all the data sets we tested. However, we would caution against interpreting the number of clusters as a definite estimate of the “correct” number of populations. As seen for example in figure 2, species often contain different levels of population substructure. Therefore, we suggest that especially in complex data sets users may sometimes obtain a better intuition about the relationships between their samples by repeating the clustering step several times with varying sensitivity (c parameter values) and interpreting the results jointly.Finally, we would like to point out that although the ad hoc tree building approach generally performs well, it is intended to be illustrative of the relationships between the populations, rather than to represent the true population history. Therefore, we suggest that where the true historical relationships are known (e.g., as a result of independent phylogenetic analysis), the clustering step can be skipped and the coancestry matrix ordered according to the pre-existing information. Under those circumstances, the entries of the coancestry matrix provide an independent source of information on recent sample relatedness, complementing the information about older historical relationships; this may be especially informative in cases of recent gene-flow between populations.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.Click here for additional data file.
Authors: Kimberly R Andrews; Jeffrey M Good; Michael R Miller; Gordon Luikart; Paul A Hohenlohe Journal: Nat Rev Genet Date: 2016-01-05 Impact factor: 53.242
Authors: Daniel Falush; Simon Myers; Garrett Hellenthal; George B J Busby; Gavin Band; James F Wilson; Cristian Capelli Journal: Science Date: 2014-02-14 Impact factor: 47.728
Authors: Stephen Leslie; Bruce Winney; Garrett Hellenthal; Dan Davison; Abdelhamid Boumertit; Tammy Day; Katarzyna Hutnik; Ellen C Royrvik; Barry Cunliffe; Daniel J Lawson; Daniel Falush; Colin Freeman; Matti Pirinen; Simon Myers; Mark Robinson; Peter Donnelly; Walter Bodmer Journal: Nature Date: 2015-03-19 Impact factor: 49.962
Authors: Emiliano Trucchi; Božo Frajman; Thomas H A Haverkamp; Peter Schönswetter; Ovidiu Paun Journal: New Phytol Date: 2017-08-07 Impact factor: 10.151
Authors: Vadim B Fedorov; Emiliano Trucchi; Anna V Goropashnaya; Eric Waltari; Susan Erin Whidden; Nils Chr Stenseth Journal: Proc Natl Acad Sci U S A Date: 2020-01-27 Impact factor: 11.205
Authors: Scott Hotaling; Daniel H Shain; Shirley A Lang; Robin K Bagley; Lusha M Tronstad; David W Weisrock; Joanna L Kelley Journal: Proc Biol Sci Date: 2019-06-19 Impact factor: 5.349
Authors: Jurjan P van der Zee; Marjolijn J A Christianen; Martine Bérubé; Mabel Nava; Kaj Schut; Frances Humber; Alonzo Alfaro-Núñez; Leontine E Becking; Per J Palsbøll Journal: Heredity (Edinb) Date: 2021-10-11 Impact factor: 3.821
Authors: Joshua I Brown; Flor Hernández; Andrew Engilis; Blanca E Hernández-Baños; Dan Collins; Philip Lavretsky Journal: Sci Rep Date: 2022-06-17 Impact factor: 4.996
Authors: J A Kolmer; A Herman; M E Ordoñez; S German; A Morgounov; Z Pretorius; B Visser; Y Anikster; M Acevedo Journal: Heredity (Edinb) Date: 2019-12-20 Impact factor: 3.821
Authors: Francisco Balao; María Teresa Lorenzo; José Manuel Sánchez-Robles; Ovidiu Paun; Juan Luis García-Castaño; Anass Terrab Journal: Ann Bot Date: 2020-03-09 Impact factor: 5.040
Authors: Gábor Sramkó; Ovidiu Paun; Marie K Brandrud; Levente Laczkó; Attila Molnár; Richard M Bateman Journal: Ann Bot Date: 2019-10-18 Impact factor: 5.040
Authors: Nihal Oztolan-Erol; Andrew J Helmstetter; Asuman İnan; Richard J A Buggs; Stuart J Lucas Journal: Front Plant Sci Date: 2021-07-01 Impact factor: 5.753