| Literature DB >> 31159724 |
Belinda Wright1, Katherine A Farquharson1, Elspeth A McLennan1, Katherine Belov1, Carolyn J Hogg1, Catherine E Grueber2,3.
Abstract
BACKGROUND: Recent advances in genomics have greatly increased research opportunities for non-model species. For wildlife, a growing availability of reference genomes means that population genetics is no longer restricted to a small set of anonymous loci. When used in conjunction with a reference genome, reduced-representation sequencing (RRS) provides a cost-effective method for obtaining reliable diversity information for population genetics. Many software tools have been developed to process RRS data, though few studies of non-model species incorporate genome alignment in calling loci. A commonly-used RRS analysis pipeline, Stacks, has this capacity and so it is timely to compare its utility with existing software originally designed for alignment and analysis of whole genome sequencing data. Here we examine population genetic inferences from two species for which reference-aligned reduced-representation data have been collected. Our two study species are a threatened Australian marsupial (Tasmanian devil Sarcophilus harrisii; declining population) and an Arctic-circle migrant bird (pink-footed goose Anser brachyrhynchus; expanding population). Analyses of these data are compared using Stacks versus two widely-used genomics packages, SAMtools and GATK. We also introduce a custom R script to improve the reliability of single nucleotide polymorphism (SNP) calls in all pipelines and conduct population genetic inferences for non-model species with reference genomes.Entities:
Keywords: DArTseq; GATK; Pink-footed goose; Population differentiation; Population genomics; Reference genome; SAMtools; Stacks; Tasmanian devil
Mesh:
Year: 2019 PMID: 31159724 PMCID: PMC6547446 DOI: 10.1186/s12864-019-5806-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Overview of methods used in this study to process reduced representation sequencing data with reference genomes, with some alternatives to software used indicated where appropriate. * Reproducibility filtering only possible if replicates or technical replicates are performed. ** Possible sex-linked SNP filter requires knowledge of sex of samples and is based on XX/XY system, but could be reversed for ZZ/ZW systems
Summary statistics for the resultant SNP loci datasets of three pipelines, filtered at a 70% call rate (see Additional file 1: Table S1 for data filtered on 30% call rate), for Tasmanian devil (N = 131) and pink-footed goose (N = 40), including total number of loci (total loci), average number of loci sequenced across individuals (mean loci), amount of missing data (%), calculated error rates (%), mean observed heterozygosity across loci (HO), mean expected heterozygosity across loci (HE), and average multilocus heterozygosity of individuals (MLH)
| Dataset | Pipeline | CPU hoursa | Total loci | Mean loci (min; max) | % missing | Error rate (%)b | HO (± SD) | HE (± SD) | MLH (± SD) |
|---|---|---|---|---|---|---|---|---|---|
| Devil | Stacks | 16 | 1359 | 1177.3 (500; 1326) | 13.4 | 2.9 | 0.207 (0.149) | 0.248 (0.163) | 0.205 (0.043) |
| SAMtools | 55 | 251 | 205.8 (96; 236) | 18.0 | 6.6 | 0.308 (0.160) | 0.327 (0.115) | 0.298 (0.092) | |
| GATK | 325 | 1464 | 1297.2 (604; 1442) | 11.4 | 5.3 | 0.185 (0.139) | 0.256 (0.161) | 0.184 (0.040) | |
| Goose | Stacks | 11 | 52,053 | 44,914.4 (954; 50,517) | 13.7 | NA | 0.132 (0.127) | 0.156 (0.136) | 0.127 (0.026) |
| SAMtools | 14 | 26,437 | 22,035.0 (732; 23,732) | 16.7 | NA | 0.256 (0.160) | 0.307 (0.142) | 0.563 (0.158) | |
| GATK | 65 | 277,362 | 245,412.2 (6787; 270,0084) | 11.5 | NA | 0.137 (0.121) | 0.187 (0.149) | 0.132 (0.034) |
aCPU hours represent total computational time for each pipeline excluding alignment and the further filtering in R. Note that while some steps can be parallelised for quicker computation, not all steps allow for this
bError rates could not be calculated for the pink-footed goose dataset as no replicates were included in the current analysis. Error rate is calculated after filtering on SNPs with > 85% reproducibility, so is lower than initial error rates
Genotypic differences between loci common to the 3 pipelines for devils (155 loci) and geese (3283 loci). Concordance rates (identical genotype calls between samples) between pipelines are in parentheses. Discordant genotype calls are presented as the percent of total genotypes
| Stacks:SAMtools | Stacks:GATK | GATK:SAMtools | |
|---|---|---|---|
| Devil | (97.77) | (98.15) | (98.92) |
| Homozygous → Homozygous | 0.00005 | 0.00010 | 0.00005 |
| Homozygous → Heterozygous | 0.00039 | 0.00227 | 0.00197 |
| Heterozygous → Homozygous | 0.01719 | 0.01369 | 0.00670 |
| Goose | (97.06) | (97.64) | (97.85) |
| Homozygous → Homozygous | 0.00019 | 0.00018 | 0.00043 |
| Homozygous → Heterozygous | 0.00395 | 0.00628 | 0.00394 |
| Heterozygous → Homozygous | 0.01920 | 0.01355 | 0.01327 |
Homozygous → Homozygous refers to those loci where an AA is called a TT in the other pipeline for example. Homozygous → Heterozygous are any genotype calls that are homozygous in the first pipeline but called heterozygous in the other for that sample at the same locus. Heterozygous → Homozygous are those calls that are heterozygous in one pipeline but called homozygous in the other for that sample at the same locus
Fig. 2PCoAs of the two datasets after processing through three pipelines with a call rate of 70% and the custom R script as outlined in Fig. 1. For devils, red is the “west” (N = 47) and blue is the “east” (N = 18) population. For goose, red is the “Iceland” (N = 20) and blue is the “Denmark” (N = 20) population. Inertia ellipses illustrate groupings and do not necessarily indicate confidence
Population pairwise FST values for each analysis with 95% confidence intervals generated over 2000 bootstraps. In devils, Pop1 refers to the Western population (N = 47), Pop2 refers to the Eastern population (N = 18), and Pop3 refers to the insurance population (N = 66). In geese, Pop1 refers to the Iceland population (N = 20) and Pop2 refers to the Denmark population (N = 20)
| Dataset | Pipeline | Pop 1:Pop 2 | Pop 1:Pop 3 | Pop 2:Pop 3 |
|---|---|---|---|---|
| Devil (70% call rate) | Stacks | 0.100 (0.090, 0.110) | 0.030 (0.027, 0.034) | 0.025 (0.021, 0.029) |
| SAMtools | 0.071 (0.056, 0.088) | 0.019 (0.014, 0.025) | 0.025 (0.017, 0.033) | |
| GATK | 0.094 (0.084, 0.103) | 0.029 (0.026, 0.033) | 0.025 (0.021, 0.029) | |
| Devil (30% call rate) | Stacks | 0.091 (0.084, 0.100) | 0.026 (0.023, 0.029) | 0.025 (0.021, 0.029) |
| SAMtools | 0.067 (0.057, 0.078) | 0.026 (0.022, 0.030) | 0.015 (0.011, 0.021) | |
| GATK | 0.091 (0.083, 0.099) | 0.028 (0.025, 0.031) | 0.026 (0.022, 0.030) | |
| Goose (70% call rate) | Stacks | 0.034 (0.032, 0.035) | ||
| SAMtools | 0.038 (0.036, 0.039) | |||
| GATK | 0.033 (0.032, 0.033) | |||
| Goose (30% call rate) | Stacks | 0.017 (0.016, 0.019) | ||
| SAMtools | 0.092 (0.091, 0.093) | |||
| GATK | 0.046 (0.045, 0.047) |