Literature DB >> 25852748

SNeP: a tool to estimate trends in recent effective population size trajectories using genome-wide SNP data.

Mario Barbato¹, Pablo Orozco-terWengel¹, Miika Tapio², Michael W Bruford¹.

Abstract

Effective population size (Ne ) is a key population genetic parameter that describes the amount of genetic drift in a population. Estimating Ne has been subject to much research over the last 80 years. Methods to estimate Ne from linkage disequilibrium (LD) were developed ~40 years ago but depend on the availability of large amounts of genetic marker data that only the most recent advances in DNA technology have made available. Here we introduce SNeP, a multithreaded tool to perform the estimate of Ne using LD using the standard PLINK input file format (.ped and.map files) or by using LD values calculated using other software. Through SNeP the user can apply several corrections to take account of sample size, mutation, phasing, and recombination rate. Each variable involved in the computation such as the binning parameters or the chromosomes to include in the analysis can be modified. When applied to published datasets, SNeP produced results closely comparable with those obtained in the original studies. The use of SNeP to estimate Ne trends can improve understanding of population demography in the recent past, provided a sufficient number of SNPs and their physical position in the genome are available. Binaries for the most common operating systems are available at https://sourceforge.net/projects/snepnetrends/.

Entities: Chemical Disease Gene Species

Keywords: SNPChip; demography; effective population size; large scale genotyping; linkage disequilibrium

Year: 2015 PMID： 25852748 PMCID： PMC4367434 DOI： 10.3389/fgene.2015.00109

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

Introduction

Effective population size (N) is an important genetic parameter that estimates the amount of genetic drift in a population, and has been described as the size of an idealized Wright–Fisher population expected to yield the same value of a given genetic parameter as in the population under study (Crow and Kimura, 1970). N sizes can be influenced by fluctuations in census population size (N), by the breeding sex ratio and the variance in reproductive success. N estimation can be achieved using approaches that fall into three methodological categories: demographic, pedigree-based, or marker-based (Flury et al., 2010). Pedigree data have been traditionally used to obtain N estimates in livestock. However, reliable estimates of N depend on the pedigree being complete. This state of knowledge is feasible in some domestic populations, the demographic parameters of which have been accurately monitored for a sufficiently large number of generations. However, in practice, the applicability of this approach remains limited to a few cases involving highly managed breeds (Flury et al., 2010; Uimari and Tapio, 2011). One solution to overcome the limitation of an incomplete pedigree is to estimate the recent trend in N using genomic data. Several authors have recognized that N could be estimated from information on linkage disequilibrium (LD) (Sved, 1971; Hill, 1981). LD describes the non-random association of alleles in different loci as a function of the recombination rate between the physical positions of the loci in the genome. However, LD signatures can also result from demographic processes such as admixture and genetic drift (Wright, 1943; Wang, 2005), or through processes such as “hitchhiking” during selective sweeps (Smith and Haigh, 1974) or background selection (Charlesworth et al., 1997). In such scenarios alleles at different loci become associated independently of their proximity in the genome. Assuming that a population is closed and panmictic, the LD value calculated between neutral unlinked loci depends exclusively on genetic drift (Sved, 1971; Hill, 1981). This occurrence can be used to predict N due to the known relationship between the variance in LD (calculated using allele frequencies) and effective population size (Hill, 1981). Recent advances in genotyping technology (e.g., using SNP bead arrays with tens of thousands of DNA probes) have enabled the collection of vast amounts of genome-wide linkage data ideal for estimating N in livestock and humans among others (e.g., Tenesa et al., 2007; de Roos et al., 2008; Corbin et al., 2010; Uimari and Tapio, 2011; Kijas et al., 2012). However, a software tool that enables estimation of N from LD is lacking, and researchers currently rely on a combination of tools to manipulate data, infer LD, and tend to use bespoke scripts to perform the appropriate calculations and estimate N. Here we describe SNeP, a software tool that allows the estimation of N trends across generation using SNP data that corrects for sample size, phasing and recombination rate.

Materials and methods

The method SNeP uses to calculate LD depends on the availability of phased data. When the phase is known the user can select Hill and Robertson (1968) squared correlation coefficient that makes use of haplotype frequencies to define LD between each pair of loci (Equation 1). However, in the absence of a known phase, squared Pearson's product-moment correlation coefficient between pairs of loci can be selected. While these two approaches are not the same, they are highly comparable (McEvoy et al., 2011): where p and p are respectively the frequencies of alleles A and B at two separate loci (X, Y) measured for n individuals, p is the frequency of the haplotype with alleles A and B in the population studied, and are the mean genotype frequencies for the first and second locus respectively, X is the genotype of individual i at the first locus and Y is the genotype of individual i at the second locus. Equation (2) correlates the genotypic allele counts instead of the haplotype frequencies and is not influenced by double heterozygotes (this approach results in the same estimates as the --r2 option in PLINK). SNeP estimates the historic effective population size based on the relationship between r2, N, and c (recombination rate), (Equation 3—Sved, 1971), and enabling users to include corrections for sample size and uncertainty of the gametic phase (Equation 4—Weir and Hill, 1980): where n is the number of individual sampled, β = 2 when the gametic phase is known and β = 1 if instead the phase is not known. Several approximations are used to infer the recombination rate using the physical distance (δ) between two loci as a reference and translating it into linkage distance (d), which is usually described as Mb(δ) ≈ cM(d). For small values of d the latter approximation is valid, but for larger values of d the probability of multiple recombination events and interference increases, moreover the relationship between map distance and recombination rate is not linear, as the maximum recombination rate possible is 0.5. Thus, unless using very short δ, the approximation d ≈ c is not ideal (Corbin et al., 2012). We therefore implemented mapping functions to translate the estimated d into c, following Haldane (1919), Kosambi (1943), Sved (1971), and Sved and Feldman (1973). Initially SNeP infers d for each pair of SNPs as directly proportional to δ according to d = kδ where k is a user defined recombination rate value (default value is 10−8 as in Mb = cM). The inferred value of δ can then be subjected to one of the available mapping functions if required by the user. Solving Equation (3) for N and including all the corrections described, allows the prediction of N from LD data using (Corbin et al., 2012): where N is the effective population size t generations ago calculated as t = (2f(c))−1 (Hayes et al., 2003), c is the recombination rate defined for a specific physical distance between markers and optionally adjusted with the mapping functions mentioned above, r2 is the LD value adjusted for sample size and α:= {1, 2, 2.2} is a correction for the occurrence of mutations (Ohta and Kimura, 1971). Therefore, LD over greater recombinant distances is informative on recent N while shorter distances provide information on more distant times in the past. A binning system is implemented in order to obtain averaged r2 values that reflect LD for specific inter-locus distances. The binning system implemented uses the following formula to define the minimum and maximum values for each bin: Where b (ℕ1) is the i bin of the total number of bins (totBins), minD, and maxD are respectively the minimum and the maximum distance between SNPs and x is a positive real number (ℝ0) When x equals 1, the distribution of distances between the bins is linear and each bin has the same distance range. For larger values of x the distribution of distances changes allowing a larger range on the last bins and a smaller range on the first bins. Varying this parameter allows the user to have a sufficient number of pairwise comparisons to contribute to the final N estimate for each bin.

Example application

We tested SNeP with two published datasets that had been previously used to describe trends in N over time using LD, Bos indicus [54,436 SNPs of 423 East African Shorthorn Zebu (SHZ)–Mbole-Kariuki et al., 2014, data available at Dryad Digital Repository: doi:10.5061/dryad.bc598.] and Ovis aries [49,034 SNPs genotyped in 24 Swiss White Alpine (SWA), 24 Swiss Black-Brown Mountain sheep (SBS), 24 Valais Blacknose sheep (VBS), 23 Valais Red sheep (VRS), 24 Swiss Mirror sheep (SMS) and 24 Bundner Oberländer sheep (BOS)–Burren et al., 2014]. The r2 estimates for the cattle datasets were obtained by the authors using GenABLE (Aulchenko et al., 2007) using a minimum allele frequency (MAF) < 0.01 and adjusting the recombination rate using Haldane's mapping function (Haldane, 1919). The r2 estimates of the sheep data were calculated by the authors using PLINK-1.07 (Purcell et al., 2007), with a MAF < 0.05 and no further corrections. For both autosomal datasets r2 estimates where corrected for sample size using equation (4) with β = 2. For these comparative analyses the SNeP command line included the same parameters used for the published data apart from the r2 estimates, calculated through genotype count and the use of SNeP's novel binning strategy.

Results

SNeP is a multithreaded application developed in C++ and binaries for the most common operating systems (Windows, OSX, and Linux) can be downloaded from https://sourceforge.net/projects/snepnetrends/. The binaries are accompanied by a manual describing the step-by-step use of SNeP to infer trends in N as described here. SNeP produces an output file with tab delimited columns showing the following for each bin that was used to estimate N: the number of generations in the past that the bin corresponds to (e.g., 50 generations ago), the corresponding N estimate, the average distance between each pair of SNPs in the bin, the average r2 and the standard deviation of r2 in the bin, and the number of SNPs used to calculate r2 in the bin. This file can be easily imported in Microsoft Excel, R or other software to plot the results. The plots shown here (Figures 1, 3) correspond to the columns of generations ago and N from the output file. The column with the r2 standard deviation is provided for users to inspect the variance in the N estimate in each bin, particularly for those bins reflecting older time estimates and which are less reliable as the number of SNPs used to estimate r2 becomes smaller.

Figure 1

Comparison of .

Comparison of . The format required for the input files is the standard PLINK format (ped and map files) (Purcell et al., 2007). SNeP allows the users to either calculate LD on the data as described above, or use a custom precalculated LD matrix to estimate N using Equation (5). The software interface allows the user to control all parameters of the analysis, e.g., the distance range between SNPs in bp, and the set of chromosomes used in the analysis (e.g., 20–23). Additionally, SNeP includes the option to choose a MAF threshold (default 0.05), as it has been shown that accounting for MAF results in unbiased r2 estimates irrespective of sample size (Sved et al., 2008). SNeP's multithreaded architecture allows fast computation of large datasets (we tested up to ~100K SNPs for a single chromosome), for example the BOS data described here was analyzed with one processor in 2′43″, the use of two processors reduced the time to 1′43″, four processors reduced the analysis time to 1′05″.

Zebu example

For the zebu analysis, the shapes of the N curves obtained with SNeP and their published data trends showed the same trajectory with a smooth decline until around 150 generations ago, followed by an expansion with a peak around 40 generations ago and ending in a steep decline on the most recent generations (Figure 1). However, while the trends in both curves were the same, the two approaches resulted in different N estimates, with SNeP's values being approximately three-fold larger than those in the original paper. While we attempted to use the authors' parameters in our analyses, some differences were inevitable, i.e., the original publication of the cattle data estimated r2 with a different approach to that implemented in SNeP. Analyses with SNeP were based on genotypes, while the original analysis was based on inferred two locus haplotypes, which results in the published data showing an expected r2 of 0.32 at the minimum distance, while our estimates was 0.23. Similarly, Mbole-Kariuki et al. (2014) obtained a background level r2 = 0.013 around 2 Mb, while our estimate at the same distance was 0.0035 (data not shown). Consequently, as our estimates of LD were consistently smaller than Mbole-Kariuki et al. (2014) it is expected that our N estimates should be larger. While this observation highlights the importance of a careful choice of the parameters and their thresholds, it is important to highlight that although the absolute magnitude of the N values is different, the trends are almost identical.

Swiss sheep example

The six Swiss sheep breeds analyzed with SNeP produced comparable results with those from the original paper (Figure 2), with mostly overlapping N trend curves (Figure 3). However, the general trend in N showed a decline toward the present. SNeP produced slightly larger values of N for the more distant past (700–800 generations). This is due to the different binning system used in SNeP, which allows the user to obtain a more even distribution of pairwise comparisons within each bin (i.e., the number of SNP pairwise comparisons within each bin is comparable). For the time span extending beyond 400 generations ago, Burren et al. (2014) used only three bins in their analysis (centered at 400, 667, and 2000 generations ago) while for the same time span SNeP used 5 bins with a number of pairwise comparisons dependent to the range defined with formulae 6a,b. Consequently, Burren and colleagues' approach ends with a higher density of data describing the most recent generations than describing the oldest generations. Therefore, the use of fewer bins tends to increase the presence of smaller values of N in each bin, consequently lowering the average N value for each bin. The N values for the recent past, compared at the 29th generation in the past, gave very similar results. The largest difference (50) was obtained for the SBS breed.

Figure 2

Comparison between recent .

Figure 3

Comparison of .

Comparison between recent . Comparison of .

Discussion

Analysis of N using LD data was first demonstrated 40 years ago, and has been applied, developed and improved since (Sved, 1971; Hayes et al., 2003; Tenesa et al., 2007; de Roos et al., 2008; Corbin et al., 2012; Sved et al., 2013). The traditionally small number of SNPs analyzed is no longer a limitation, since SNP Chips comprise an extremely large number of SNPs, available in a short time and at a reasonable price. This has boosted the use of the method, which has been applied to humans (Tenesa et al., 2007; McEvoy et al., 2011) as well as to several domesticated species (England et al., 2006; Uimari and Tapio, 2011; Corbin et al., 2012; Kijas et al., 2012). Along with these improvements, methodological limitations have become apparent and have been addressed here, with the majority of the efforts pointing to the correct estimation of recent N. Yet, the quantitative value of the estimate is highly dependent on sample size, the type of LD estimation and the binning process (Waples and Do, 2008; Corbin et al., 2012), while its qualitative pattern depends more on the genetic information than on data manipulation. So far this method has been applied using a variety of software, no standardized approach exists to bin the results and each study has applied a more or less arbitrary approach, e.g., binning for generation classes in the past (Corbin et al., 2012), binning for distance classes with a constant range for each bin (Kijas et al., 2012) or binning per distance classes in a linear fashion but with larger bins for the more recent time points (Burren et al., 2014). To our knowledge the only software available that estimate N through LD is NeEstimator (Do et al., 2014), an upgraded version of the former LDNE (Waples and Do, 2008) allowing the analysis of large dataset (as 50k SNPChip). Importantly, while SNeP focuses on estimating historical N trends, NeEstimator's aim is to produce contemporary unbiased N estimates, the latter should therefore be considered as a complementary tool while investigating demography through LD. We used SNeP to analyze two datasets where the method was previously applied. The results we obtained for the sheep data were both quantitatively and qualitatively comparable with those obtained by Burren et al. (2014), while for the Zebu data we obtained a N trend estimate that closely matched that of Mbole-Kariuki et al. (2014) although our point estimates of N were larger than those described for the data (Mbole-Kariuki et al., 2014). The discrepancy between these two results reflects that Burren and colleagues produced their r2 estimates using PLINK (the standard software for large scale SNP data manipulation) which uses the same approach used to estimate r2 by SNeP, while Mbole-Kariuki et al. followed Hao et al. (2007) for r2 estimation. The use of different estimates for LD is critical for the quantitative aspect of the N curve, where due to the hyperbolic correlation between N and r2, a decrease in r2 on its range closer to 0 can lead to a very large change in N estimates, while differences in estimates are less significant when the r2 value is high, i.e., closer to 1. Therefore, although in one of the datasets the N values where substantially different, in both cases the N curves overlapped with those originally published. As already suggested by other authors, the reliability of the quantitative estimates obtained with this method must be taken with caution, especially for N values related to the most recent and the oldest generations (Corbin et al., 2012) because for recent generations, large values of c are involved, not fitting the theoretical implications that Hayes proposed to estimate a variable N over time (Hayes et al., 2003). Estimates for the oldest generations might also be unreliable as coalescent theory shows that no SNP can be reliably sampled after 4N generations in the past (Corbin et al., 2012). Further, N estimates, and especially those related to generations further in the past, are strongly affected by data manipulation factors, such as the choice of MAF and alpha values. Additionally, the binning strategy applied can interfere with the general precision of the method, for example where an insufficient number of pairwise comparisons are used to populate each bin. One of the applications of method is to compare breed demographies. In this case the shape of the N curves would be the optimal tool to differentiate different demographic histories, more than their numerical values, by using them as a potential demographic fingerprint for that breed or species, yet taking into consideration that mutation, migration, and selection can influence the N estimation through LD (Waples and Do, 2010). Additionally, careful consideration of the data analyzed with SNeP (and other software to estimate N) is very important, as the presence of confounding factors such as admixture, may result in biased estimates of N (Orozco-terWengel and Bruford, 2014). The aim of SNeP is therefore to provide a fast and reliable tool to apply LD methods to estimate N using high throughput genotypic data in a more consistent way. It allows two different r2 estimation approaches plus the option of using r2 estimates from external software. The use of SNeP does not overcome the limits of the method and the theory behind it, yet it allows the user to apply the theory using all corrections suggested to date.

Author contributions

MB conceived and wrote the software and the manuscript. MB, MT, and POtW tested the software and performed the analyses. MT, POtW, and MWB revised the manuscript. All authors approved the final manuscript.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

28 in total

1. Effective population size of an indigenous Swiss cattle breed estimated from linkage disequilibrium.

Authors: C Flury; M Tapio; T Sonstegard; C Drögemüller; T Leeb; H Simianer; O Hanotte; S Rieder
Journal: J Anim Breed Genet Date: 2010-10 Impact factor: 2.380

2. Extent of linkage disequilibrium and effective population size in Finnish Landrace and Finnish Yorkshire pig breeds.

Authors: P Uimari; M Tapio
Journal: J Anim Sci Date: 2010-10-29 Impact factor: 3.159

3. LdCompare: rapid computation of single- and multiple-marker r2 and genetic coverage.

Authors: K Hao; X Di; S Cawley
Journal: Bioinformatics Date: 2006-12-05 Impact factor: 6.937

4. GenABEL: an R library for genome-wide association analysis.

Authors: Yurii S Aulchenko; Stephan Ripke; Aaron Isaacs; Cornelia M van Duijn
Journal: Bioinformatics Date: 2007-03-23 Impact factor: 6.937

5. Divergence between human populations estimated from linkage disequilibrium.

Authors: John A Sved; Allan F McRae; Peter M Visscher
Journal: Am J Hum Genet Date: 2008-11-13 Impact factor: 11.025

6. ldne: a program for estimating effective population size from data on linkage disequilibrium.

Authors: Robin S Waples; Chi DO
Journal: Mol Ecol Resour Date: 2008-07 Impact factor: 7.090

7. The hitch-hiking effect of a favourable gene.

Authors: J M Smith; J Haigh
Journal: Genet Res Date: 1974-02 Impact factor: 1.588

8. Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle.

Authors: A P W de Roos; B J Hayes; R J Spelman; M E Goddard
Journal: Genetics Date: 2008-07-13 Impact factor: 4.562

9. Estimation of historical effective population size using linkage disequilibria with marker data.

Authors: L J Corbin; A Y H Liu; S C Bishop; J A Woolliams
Journal: J Anim Breed Genet Date: 2012-05-24 Impact factor: 2.380

10. Linkage disequilibrium and historical effective population size in the Thoroughbred horse.

Authors: L J Corbin; S C Blott; J E Swinburne; M Vaudin; S C Bishop; J A Woolliams
Journal: Anim Genet Date: 2010-12 Impact factor: 3.169

102 in total

1. Estimating contemporary effective population size in non-model species using linkage disequilibrium across thousands of loci.

Authors: R K Waples; W A Larson; R S Waples
Journal: Heredity (Edinb) Date: 2016-08-24 Impact factor: 3.821

Review 2. Prediction and estimation of effective population size.

Authors: J Wang; E Santiago; A Caballero
Journal: Heredity (Edinb) Date: 2016-06-29 Impact factor: 3.821

Review 3. Microbial sequence typing in the genomic era.

Authors: Marcos Pérez-Losada; Miguel Arenas; Eduardo Castro-Nallar
Journal: Infect Genet Evol Date: 2017-09-21 Impact factor: 3.342

4. Genome-wide genotyping uncovers genetic profiles and history of the Russian cattle breeds.

Authors: Andrey Yurchenko; Nikolay Yudin; Ruslan Aitnazarov; Alexandra Plyusnina; Vladimir Brukhin; Vladimir Soloshenko; Bulat Lhasaranov; Ruslan Popov; Ivan A Paronyan; Kirill V Plemyashov; Denis M Larkin
Journal: Heredity (Edinb) Date: 2017-12-08 Impact factor: 3.821

5. Genomic diversity and population structure of three autochthonous Greek sheep breeds assessed with genome-wide DNA arrays.

Authors: S Michailidou; G Tsangaris; G C Fthenakis; A Tzora; I Skoufos; S C Karkabounas; G Banos; A Argiriou; G Arsenos
Journal: Mol Genet Genomics Date: 2018-01-25 Impact factor: 3.291

6. Climate-driven flyway changes and memory-based long-distance migration.

Authors: Zhongru Gu; Shengkai Pan; Zhenzhen Lin; Li Hu; Xiaoyang Dai; Jiang Chang; Yuanchao Xue; Han Su; Juan Long; Mengru Sun; Sergey Ganusevich; Vasiliy Sokolov; Aleksandr Sokolov; Ivan Pokrovsky; Fen Ji; Michael W Bruford; Andrew Dixon; Xiangjiang Zhan
Journal: Nature Date: 2021-03-03 Impact factor: 49.962

7. Genome sequence and genetic diversity of European ash trees.

Authors: Elizabeth S A Sollars; Andrea L Harper; Laura J Kelly; Christine M Sambles; Ricardo H Ramirez-Gonzalez; David Swarbreck; Gemy Kaithakottil; Endymion D Cooper; Cristobal Uauy; Lenka Havlickova; Gemma Worswick; David J Studholme; Jasmin Zohren; Deborah L Salmon; Bernardo J Clavijo; Yi Li; Zhesi He; Alison Fellgett; Lea Vig McKinney; Lene Rostgaard Nielsen; Gerry C Douglas; Erik Dahl Kjær; J Allan Downie; David Boshier; Steve Lee; Jo Clark; Murray Grant; Ian Bancroft; Mario Caccamo; Richard J A Buggs
Journal: Nature Date: 2016-12-26 Impact factor: 49.962

8. Multiple genomic signatures of selection in goats and sheep indigenous to a hot arid environment.

Authors: E-S Kim; A R Elbeltagy; A M Aboul-Naga; B Rischkowsky; B Sayre; J M Mwacharo; M F Rothschild
Journal: Heredity (Edinb) Date: 2015-11-11 Impact factor: 3.821

9. Genetic diversity and the application of runs of homozygosity-based methods for inbreeding estimation in German White-headed Mutton sheep.

Authors: Sowah Addo; Stefanie Klingel; Georg Thaller; Dirk Hinrichs
Journal: PLoS One Date: 2021-05-06 Impact factor: 3.240

10. Whole-genome resequencing of large yellow croaker (Larimichthys crocea) reveals the population structure and signatures of environmental adaptation.

Authors: Tetsuo Kon; Liyi Pei; Ryota Ichikawa; Chunyan Chen; Ping Wang; Ikuyo Takemura; Yingying Ye; Xiaojun Yan; Baoying Guo; Weiye Li; Hagai Nsobi Lauden; Hiromasa Tabata; Hao Pan; Yoshihiro Omori; Atsushi Ogura; Lihua Jiang
Journal: Sci Rep Date: 2021-05-27 Impact factor: 4.379