| Literature DB >> 29242627 |
Richard G J Hodel1,2, Shichao Chen3,4, Adam C Payton5, Stuart F McDaniel5,3,6, Pamela Soltis3,6, Douglas E Soltis5,3,6.
Abstract
The widespread adoption of RAD-Seq data in phylogeography means genealogical relationships previously evaluated using relatively few genetic markers can now be addressed with thousands of loci. One challenge, however, is that RAD-Seq generates complete genotypes for only a small subset of loci or individuals. Simulations indicate that loci with missing data can produce biased estimates of key population genetic parameters, although the influence of such biases in empirical studies is not well understood. Here we compare microsatellite data (8 loci) and RAD-Seq data (six datasets ranging from 239 to 25,198 loci) from red mangroves (Rhizophora mangle) in Florida to evaluate how different levels of data filtering influence phylogeographic inferences. For all datasets, we calculated population genetic statistics and evaluated population structure, and for RAD-Seq datasets, we additionally examined population structure using coalescence. We found higher F ST using microsatellites, but that RAD-Seq-based estimates approached those based on microsatellites as more loci with more missing data were included. Analyses of RAD-Seq datasets resolved the classic Gulf-Atlantic coastal phylogeographic break, which was not significant in the microsatellite analyses. Applying multiple levels of filtering to RAD-Seq datasets can provide a more complete picture of potential biases in the data and elucidate subtle phylogeographic patterns.Entities:
Mesh:
Year: 2017 PMID: 29242627 PMCID: PMC5730610 DOI: 10.1038/s41598-017-16810-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The twelve sampling locations (each containing eight individuals), their codes, GPS coordinates, and the percentage of loci that have missing data for each sampling location before any filtering.
| Sampling Location | Code | Latitude (N) | Longitude (W) | % Loci Missing |
|---|---|---|---|---|
| Bahia Honda Key | BHKFl | 24.55286 | 81.76776 | 73.5 |
| Convoy Point | CvPFl | 25.46347 | 80.33133 | 81.2 |
| Cape Canaveral | CpCFl | 28.82173 | 80.75594 | 83.0 |
| Hollywood | HwdFl | 26.03841 | 80.11780 | 79.4 |
| Islamorada | IsmFl | 24.90031 | 80.65690 | 81.0 |
| Key Largo | KyLFl | 25.09569 | 80.42957 | 88.9 |
| Melbourne | MlbFl | 28.07435 | 80.60526 | 79.8 |
| New Port Richey | NPRFl | 28.25432 | 82.75723 | 69.5 |
| Seahorse Key | ShKFl | 29.10040 | 83.06185 | 65.8 |
| Terra Ceia Bay | TCBFl | 27.59172 | 82.57524 | 81.7 |
| Vaca Key | VKyFl | 24.71154 | 81.06992 | 85.1 |
| West Palm Beach | WPBFl | 26.67505 | 80.04259 | 83.9 |
Figure 1The 12 sampling locations (each with eight individuals) are indicated by orange circles. Sampling location codes are provided in Table 1. The map was generated using R (citation: R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/), and the R package ‘maps’ (citation: Original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn. (2017). maps: Draw Geographical Maps. R package version 3.2.0. https://CRAN.R-project.org/package=maps).
The seven data sets used in this study; RAD-Seq data sets were generated by filtering loci from largest data set (RAD_25198). For all data sets (six RAD and one microsatellite), the total number of loci used is indicated.
| Dataset | Individuals required to retain a locus | Number of loci | % individuals required to retain a locus |
|---|---|---|---|
| RAD_239 | 83 | 239 | 86.5 |
| RAD_1180 | 75 | 1180 | 78.1 |
| RAD_2317 | 65 | 2317 | 67.7 |
| RAD_3831 | 50 | 3831 | 52.1 |
| RAD_6255 | 30 | 6255 | 31.3 |
| RAD_25198 | 1 | 25198 | 1.0 |
Relevant population genetic statistics for each of the seven data sets used in this study. For each column, warmer colors indicate lower values and cooler colors show higher values. Immediately to the right of each of the four columns (F , F , H , H ) is the 95% confidence interval for each statistic.
Figure 2Stacked histograms of per locus estimates of F , F , and H for each of the RAD datasets. Datasets with more loci are stacked on top of datasets with fewer loci.
Pairwise F for each sampling location (i.e., one sampling location versus all others) for each of the seven datasets. Within each data set, lower (warmer colors) and higher (cooler colors) values of F are shown using color-coding.
The variation in average inbreeding coefficient (F ) among data sets and populations. Within each data set, lower (warmer colors) and higher (cooler colors) values of F are shown using color-coding. The average value of F across all data sets for each population is shown in the last column of the table.
The variation in observed heterozygosity (H ) among data sets and populations. Within each data set, lower (warmer colors) and higher (cooler colors) values of H are shown using color-coding. The average value of H across all data sets for each population is shown on the bottom row of the table.
Figure 3Principle component analysis (PCA) for all seven data sets. Note that the scales of the axes of the SSR_8 plot are different than the axes of all the RAD plots.
Figure 4Trees estimated using every individual for each RAD dataset in SVDQuartets. Orange branches indicate individuals from sampling locations in the Gulf of Mexico, and blue branches represent individuals from Atlantic sampling locations.
Figure 5Histograms showing the distribution of the 100 samplings of loci from a larger data set. In the first two panels, six and seven SSR loci, respectively, were randomly sampled 100 times from the SSR_8 data set, and the distribution of the 100 calculations of F are shown. The solid blue line indicates the parameter value estimated using all eight loci, and the dashed blue lines show the 95% confidence interval. In the remaining five plots, the histogram shows parameter estimates using the number of loci (239, 1,180, 2,317, 3,831, and 6,255, respectively) in the data set randomly sampled from RAD_25198 100 times. The solid blue lines indicate the F value estimated using all 25,198 loci, and the dashed blue lines show the 95% confidence interval. The solid orange lines indicate F estimated using the original data set (RAD_239, RAD_1180, RAD_2317, RAD_3831, and RAD_6255, respectively) and dashed orange lines show the 95% confidence interval for this estimate.