| Literature DB >> 33193568 |
Jonás A Aguirre-Liguori1,2, Javier A Luna-Sánchez1, Jaime Gasca-Pineda1, Luis E Eguiarte1.
Abstract
Massive parallel sequencing (MPS) is revolutionizing the field of molecular ecology by allowing us to understand better the evolutionary history of populations and species, and to detect genomic regions that could be under selection. However, the economic and computational resources needed generate a tradeoff between the amount of loci that can be obtained and the number of populations or individuals that can be sequenced. In this work, we analyzed and compared two simulated genomic datasets fitting a hierarchical structure, two extensive empirical genomic datasets, and a dataset comprising microsatellite information. For all datasets, we generated different subsampling designs by changing the number of loci, individuals, populations, and individuals per population to test for deviations in classic population genetics parameters (H S , F IS , F ST ). For the empirical datasets we also analyzed the effect of sampling design on landscape genetic tests (isolation by distance and environment, central abundance hypothesis). We also tested the effect of sampling a different number of populations in the detection of outlier SNPs. We found that the microsatellite dataset is very sensitive to the number of individuals sampled when obtaining summary statistics. F IS was particularly sensitive to a low sampling of individuals in the simulated, genomic, and microsatellite datasets. For the empirical and simulated genomic datasets, we found that as long as many populations are sampled, few individuals and loci are needed. For the empirical datasets, we found that increasing the number of populations sampled was important in obtaining precise landscape genetic estimates. Finally, we corroborated that outlier tests are sensitive to the number of populations sampled. We conclude by proposing different sampling designs depending on the objectives.Entities:
Keywords: Mexican wild maize; genomics of populations; landscape genomics; local adaptation; massive parallel sequencing; sampling design
Year: 2020 PMID: 33193568 PMCID: PMC7531271 DOI: 10.3389/fgene.2020.00870
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of 19 studies that have evaluated sampling designs using different markers (Microsatellites, AFLPs, SNPs); empirical vs. simulated data; and varying the number of loci, individuals, and populations.
| References | Dataset | Type of sampling | No. of populations | No. of individuals | No. of Loci | Principal conclusions |
| Microsatellite | Empirical | 1 | 480 | 4 | > 30 individuals increases the precision in | |
| Microsatellite | Empirical | 1 | 200 | 8 | Precision in summary statistics is increased when > 20 individuals are genotyped. | |
| Microsatellite | Empirical | 2 | 64 | 15 | Above 6 polymorphic markers are enough to adequately define the genetic structure between populations. | |
| Microsatellite | Empirical | 5 | 80 | 15 | Increasing the number of loci does not change the mean summary statistics, but increases the precision across replicates. IBD patterns are sensitive to fewer loci genotyped. | |
| Microsatellite | Empirical | 17–21 (different species) | 547, 652, and 516 | 18, 16, and 15 | > 20 individuals and between 50 and 80 individuals per population are needed to estimate | |
| Microsatellite | Simulation | 17 and 34 (different species) | 5,000 and 3,000 | 20 | Spatial sampling design (random, systemic, cluster) affect IBD patterns. Increasing loci, over individuals, increases the accuracy of IBD estimates. | |
| Microsatellite | Simulation | 1 | 10,000 | 15 | Different sampling designs generate different | |
| Microsatellite | Simulation | 1 | 1,000 | 25 | Increasing the number of polymorphic loci increases the precision of patterns of isolation by resistance (IBR). | |
| Microsatellite | Simulation | 1 | 1,000 | 25 | Increasing the number of polymorphic loci, individuals, and number of alleles increases the precision and the accurate estimation of patterns of (IBR). | |
| Microsatellite | Simulation | 64 | 64 | 20 | Increasing the number of populations (even if fewer individuals are sampled) increases the possibility of finding correct patterns of IBD. | |
| Microsatellite | Simulation | 3 | 100 | 100 | Reducing the number of samples do not affect | |
| Microsatellite | Mixed (Simulation and empirical) | 4 | 100 | 9, 5, 7, and 8 | For four different species, sampling between 25 and 30 individuals are enough to estimate accurately | |
| Microsatellite | Mixed (Simulation and empirical) | 4 | 4 different taxa: 726, 408, 372, 384 | 16 | Sex proportions do not affect summary statistics estimates. >20 individuals increase the precision of summary statistics. Empirical and simulated data show different patterns of deviation. | |
| AFLPs | Empirical | 6 | 159 | 59 and 117 | >30 individuals per population needed to estimate accurately | |
| SNPs | Simulation | 2 | 1,000 | 21,000 | Fewer individuals are needed to accurately estimate | |
| SNPs | Simulation | 1 | 1,000 | 20,000 | Low individual sampling, with a high genome coverage underestimates the number of segregating sites, | |
| SNPs | Empirical | 2 | 70 | 3,500 | Fewer individuals (8) but with a large number of SNPs (>1,000) increase the precision of | |
| SNPs | Empirical | 4 | 120 | 14,000 | >25 individuals (with 10,000 SNPs) are needed to estimate accurate kinship indexes (10,000 SNPs), identifying as identical by descent alleles and | |
| Mixed (SNPs and Microsatellite) | Empirical | 34 | Microsatellites dataset: 506 SNP dataset: 96 | Microsatellite dataset: 15 SNP dataset: 1,000 | 1,000 SNPs are more precise than microsatellites for assigning birth areas, even if fewer individuals are sampled. |
Summary statistics estimated for the DTS, 50K, and microsatellite datasets of Mexican wild maize.
| Mean estimate | Hierarchical high flow | Hierarchical low flow | DTS | 50K | Microsatellite |
| 0.26 (0.05) | 0.32 (0.03) | 0.130 (0.05) | 0.225 (0.04) | 0.691 | |
| 0.02 (0.18) | 0.01 (0.18) | 0.069 (0.04) | 0.182 | ||
| 0.393 | 0.246 | 0.106 | |||
| MRM: geographic (β) | 0.027 | 0.025 | 0.013 | ||
| MRM: environmental (β) | 0.011 | 0.011 | 0.004 | ||
| CAH: geographic (β) | −0.014 | −0.014 | −0.041 | ||
| CAH: environmental (β) | −0.006 | −0.008 | −0.031 |
FIGURE 1The effect of sampling designs on the estimation of summary statistics for genomic (left panels) and microsatellite (right panels) datasets: (A) H; (B) F; (C) F. Boxplots show the distribution of mean summaries estimated for 1,000 replicate simulations varying the number of individuals, number of SNPs, and number of populations sampled. F was not possible to obtain for the DTS dataset because it is based on pooled data.
FIGURE 2The effect of sampling designs on the analysis of patterns of isolation for genomic and microsatellite datasets: (A) IBD-MRM test; (B) IBE-MRM test. Boxplots show the distribution of associations estimated for 1,000 simulations varying the number of individuals, number of SNPs, and number of sampled populations. The dotted gray line shows the 0 value.
FIGURE 3The effect of sampling designs on the estimation of the central abundance hypothesis for genomic and microsatellite datasets: (A) the association between distance to the geographic centroid and Hs; (B) the association between distance to the niche centroid and Hs. Boxplots show the distribution of associations estimated for 1,000 simulations varying the number of individuals, number of SNPs, and number of sampled populations. The dotted gray line shows the 0 value.
FIGURE 4The tradeoff between the number of individuals and the number of populations sampled for all summary statistics using the 50K dataset. We tested the effect of sampling more individuals in fewer populations and fewer individuals in many populations.
FIGURE 5The effect of sampling a different number of populations on the identification of outlier SNPs. (A) Distribution of the highest F identified for a locus across simulations. (B) Number of outlier SNPs (q-val) identified for each replicate for a different number of sampled populations. (C) The Venn diagram shows the number of shared SNPs identified across replicates and the number of populations sampled using q-val to identify outliers.
Recommendations for sampling designs depending on study objectives.
| Estimate | Number of individuals | Number of loci | Number of populations | Considerations |
| Not sensitive (> 6 individuals) | Sensitive (>1,000 SNP loci; > 15 microsatellite loci) | Sensitive (> 20 populations) | Increase the number of loci and populations. Genomic dataset is less sensitive than microsatellite dataset. | |
| Very sensitive (>9 individuals) | Sensitive (>1,000 SNP loci; > 15 microsatellite loci) | Sensitive (> 20 populations) | Increase the number of individuals over loci and populations. If fewer populations are available, increase the number of individuals in those populations. | |
| Microsatellite dataset was very sensitive (> 20 individuals). 50K dataset: Not sensitive (>9 individuals) | Not very sensitive (>1,000 SNPs; > 15 loci) | Sensitive (> 20 populations) | Increase the number of populations over the number of SNPs or individuals. | |
| IBD and IBE MRM tests | Not sensitive (> 3 individuals) | Not sensitive (>1,000 SNPs, > 15 loci) | Very sensitive (>20 populations) | Sample as many populations as possible even if fewer individuals or loci are sampled. |
| CAH tests | Not very sensitive (> 3 individuals for genomic datasets; > 6 individuals for microsatellite datasets) | Sensitive depending on the dataset (>1,000 DTS SNPs, > 100 50K SNPs, > 15 microsatellite loci) | Very sensitive (> 30 populations) | Increasing the number of populations is more important than increasing the number of loci or individuals. Microsatellites are more sensitive than genomic datasets to the number of loci and individuals, although less sensitive to the number of populations sampled. |
| Tests of selection using bayescenv | Not tested | Sensitive (as many as possible) | Very sensitive (>30–40 populations) | As many SNPs as possible are needed to differentiate outlier loci, also to increase the probability of finding loci within selective regions. Increase as much as possible the number of populations, covering the largest geographic and environmental distribution. A possibility is to use pooled-sample DNA. |