Literature DB >> 34270863

Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species.

Brandon M Lind¹, Mengmeng Lu², Dragana Obreht Vidakovic¹, Pooja Singh², Tom R Booker^1,3, Sally N Aitken¹, Sam Yeaman².

Abstract

Despite their suitability for studying evolution, many conifer species have large and repetitive giga-genomes (16-31 Gbp) that create hurdles to producing high coverage SNP data sets that capture diversity from across the entirety of the genome. Due in part to multiple ancient whole genome duplication events, gene family expansion and subsequent evolution within Pinaceae, false diversity from the misalignment of paralog copies creates further challenges in accurately and reproducibly inferring evolutionary history from sequence data. Here, we leverage the cost-saving benefits of pool-seq and exome-capture to discover SNPs in two conifer species, Douglas-fir (Pseudotsuga menziesii var. menziesii (Mirb.) Franco, Pinaceae) and jack pine (Pinus banksiana Lamb., Pinaceae). We show, using minimal baseline filtering, that allele frequencies estimated from pooled individuals show a strong, positive correlation with those estimated by sequencing the same population as individuals (r > .948), on par with such comparisons made in model organisms. Further, we highlight the utility of haploid megagametophyte tissue for identifying sites that are probably due to misaligned paralogs. Together with additional minor filtering, we show that it is possible to remove many of the loci with large frequency estimate discrepancies between individual and pooled sequencing approaches, improving the correlation further (r > .973). Our work addresses bioinformatic challenges in non-model organisms with large and complex genomes, highlights the use of megagametophyte tissue for the identification of paralogous artefacts, and suggests the combination of pool-seq and exome capture to be robust for further evolutionary hypothesis testing in these systems.

Entities: Chemical

Keywords: zzm321990Pinaceaezzm321990; exome-capture; non-model; paralogy; pool-seq

Mesh：

Year: 2021 PMID： 34270863 PMCID： PMC9292622 DOI： 10.1111/1755-0998.13474

Source DB: PubMed Journal: Mol Ecol Resour ISSN： 1755-098X Impact factor: 8.678

INTRODUCTION

Quantifying the spatial structure of neutral and adaptive genetic variation within ecologically and economically important tree species and their close relatives is fundamental to forecasting and managing their response to changing selection pressures from pests, pathogens, and climate (Aitken et al., 2008; Alberto et al., 2013; Holliday et al., 2017; Janes & Hamilton, 2017; Sniezko & Winn, 2017). Prerequisite to this information is the ability to produce high quality and cost‐effective data from which to generate reliable inference. While the life history of many tree species offers some ideal circumstances for studying adaptive evolution at the genetic level (Neale & Kremer, 2011; Neale & Savolainen, 2004), two ancient whole‐genome duplication events in the progenitors of the Pinaceae lineages (Li et al., 2015), transposable element dynamics (Morse et al., 2009; Voronova et al., 2017), tandemly arrayed genes (Pavy et al., 2017), subsequent gene duplication (Casola & Koralewski, 2018; Krutovsky et al., 2004) and gene family expansion (e.g., Liu et al., 2016) have led to giga‐genomes (>16 Gb in size) recalcitrant to chromosome‐level genome assembly under current sequencing and computational constraints (Neale, Martínez‐García, et al., 2017; but see Scott et al., 2020). For example, analysis of Pinus taeda L. (Pinaceae) has yielded estimates that upwards of 82% of its 22 Gb genome is repetitive, and 75% of the repetitive sequence is due to retrotransposons (Nystedt et al., 2013; Wegrzyn et al., 2014). It is also thought to be rich in pseudogenes (Kovach et al., 2010). Such large genome sizes have hampered production of dense SNP data sets across a large number of individuals (Lind et al., 2018). Most recent sequencing efforts in conifers have either used some form of reduced representation sequencing such as restriction‐site associated DNA sequencing (i.e., RADseq; reviewed in Andrews et al., 2016 and Parchman et al., 2018), which relies upon relatively few genomic resources, or targeted capture (e.g., Lu et al., 2016; Suren et al., 2016), which requires significant genomic and budgetary resources including the design of capture arrays (but see Puritz & Lotterhos, 2018). To capture population‐level polymorphism information while minimizing cost, sequencing pooled individuals (i.e., pool‐seq approaches) has emerged as a cost‐effective alternative to sequencing individuals (Gautier et al., 2013; Schlötterer et al., 2014). Further, pool‐seq can be combined with targeted capture approaches to both reduce cost and sample specific areas of the genome that are a priori considered functionally relevant (e.g., Rellstab et al., 2019). The pooling of biological samples has been commonplace for decades (Dorfman, 1943), owing to the cost‐efficiency of analysing multiple samples together. Such methods have expanded to other purposes, such as the estimation of allele frequencies of nucleotide polymorphisms in next‐generation sequence data (i.e., pool‐seq). Pool‐seq approaches use read counts across pooled individuals to estimate allele frequencies, generally for a single population, with individuals pooled with equimolar contributions. A number of studies have empirically evaluated the congruence between individual and pool‐seq allele frequency estimates across various taxa (e.g., Fracassetti et al., 2015; Futschik & Schlötterer, 2010; Rellstab et al., 2013, 2019). Such studies have led to broad agreement on the accuracy of pool‐seq when following best practices for the organism and study design. Of exceptional significance for the estimation of allele frequency from read count data is the proper alignment of reads to the reference. Misalignments, which may be particularly important for exome capture data from members of Pinaceae, can be due to reads from paralog gene copies in the data mapping to the incorrect copy in the reference, or from paralog copies being collapsed into a single sequence in the reference assembly where copies in the data map to this single sequence. Such misalignments can be exacerbated by assembly errors in the reference, particularly for organisms with repetitive genomes. These misalignments will skew allele ratios and bias allele frequency estimates downstream. In particular for non‐model species with histories of whole genome duplication or gene family expansion, steps must be taken to categorize misalignments so that there are not substantial allele frequency biases in downstream data sets. Indeed, methods by which to detect such loci have received considerable attention (see Table 1 in McKinney et al., 2017). Among these, one such method uses haploid samples and the presence of heterozygote genotype calls to identify potential paralogous artefacts (Limborg et al., 2016), since haploid samples can only be monoallelic. Another uses read ratio depths among heterozygote individuals from individual sequence data to identify deviations expected from duplicated loci (McKinney et al., 2016). While multicellular haploid tissue is not present in vertebrates, such tissue is readily accessible from gametophytic life stages in many plant species, and in particular from the maternally‐derived megagametophyte tissue that can be excised from the seeds of conifer species.

TABLE 1

Data set	Ploidy per sample (number of samples)	SNP Caller	Purpose
indSeq	2 (20)	GATK4	Validate poolSeq allele frequency estimates; calculate read ratio statistics to validate candidate paralog misalignments
poolSeq	2 (20)	VarScan	Compare with indSeq SNP set to determine filtering strategy
megaSeq	1 (1)	VarScan	Identify heterozygous sites as candidates for false SNPs due to misalignment of diverged/duplicated paralogs

Note that we use camelCase to denote our data sets, and reserve hyphens (e.g., pool‐seq) to denote methodologies.

Description of datasets used to call SNPs for both Douglas‐fir and jack pine. indSeq and poolSeq data sets for a given species share the same individuals from a single population. The megaSeq data set consists of haploid megagametophyte tissue from a single individual not included in the indSeq or poolSeq data sets Note that we use camelCase to denote our data sets, and reserve hyphens (e.g., pool‐seq) to denote methodologies. Here we harness the multicellular haploid megagametophyte of conifers to aid in mapping and analysing pool‐seq data from diploid individuals. We use this pooled exome capture approach for two conifers: coastal Douglas‐fir (Pseudotsuga menziesii var. menziesii (Mirb.) Franco, Pinaceae) and jack pine (Pinus banksiana Lamb., Pinaceae), to evaluate the utility of pool‐seq approaches in these systems. We use sequence data from haploid samples to identify misalignments from paralogous sites, and use individual sequence data to validate both the allele frequency estimates of the same individuals in pools and the candidate regions affected by paralog misalignments detected with haploid data (Table 1). We then use this information to quantify their effects on the congruence between individual and pool‐seq allele frequency estimates. Together, these data sets provide a path forward for filtering pool‐seq data of this kind, particularly for studies of non‐model organisms using a diverged, and potentially fragmented, reference genome. Our methods further highlight a cost‐effective means to empirically isolate potentially misaligned paralogs in species with accessible haploid tissue, which to date has not been widely used for such purposes in conifers.

MATERIALS AND METHODS

Focal species and population sampling

Coastal Douglas‐fir (Pseudotsuga menziesii var. menziesii) is a temperate species occupying primarily coastal habitat along the west coast of North America from California to British Columbia as well as inland habitat in the Cascade and Klamath ranges of Washington, Oregon, and California. It is important to the ecology and economical value of many of these forests. Jack pine (Pinus banksiana) has a vast distribution across the Canadian boreal forest, stretching from Atlantic Canada into western Alberta and Northwest Territories, and is important to the ecology of many of these systems and to the forest industry in some regions. For both Douglas‐fir and jack pine, we sampled 20 individuals for use in individual and pooled sequencing sets from operational reforestation seedlots created from open‐pollinated seeds from tens or hundreds of seed parents from a single provenance (see Appendix S1: Section 1.1). We used a single jack pine seed to extract megagametophyte haploid tissue. For Douglas‐fir haploid data, we downloaded paired‐end fastq files from a previously sequenced Douglas‐fir megagametophyte taken from a single individual (NCBI SRA accession SAMN0333061, Neale, McGuire, et al., 2017) to match our sequencing effort for jack pine haploid tissue (Appendix S1: Section 1.2).

Exome capture probe design

The capture probes were designed based on the genes identified using RNA sequencing (RNA‐seq) data for Douglas‐fir and jack pine. De novo transcriptome assembly was performed for each species using RNA‐seq reads. For jack pine, RNA‐seq reads were sequenced from a frozen sample of young needles taken from a recently flushed bud of a single tree grown in a growth chamber with a simulated climate corresponding to a mean annual temperature of 6℃ (Appendix S1: Section 1.3). For Douglas‐fir, RNA‐seq reads were obtained from two sources: one source was the read sets deposited in NCBI SRA, including SRX1851630 (Little et al., 2016), SRX1286745 (Hess et al., 2016), SRX1341335 (Cronn et al., 2017a), and SRX136240 (Cronn et al., 2017b). The other source was the reads sequenced from two needle samples infected by the fungal pathogen Phaeocryptopus gaeumannii, which causes Swiss needle cast disease in Douglas‐fir (Appendix S1: Section 1.3). The raw reads were processed by the software fastx toolkit (v0.0.13, http://hannonlab.cshl.edu/fastx_toolkit), including clipping the adaptors (‐l 25), filtering the artefacts, and keeping the reads with a minimum quality score of 20. The filtered reads were used to perform de novo transcriptome assembly using the software trinity v2.4.0 (‐‐bowtie2, Grabherr et al., 2011). Among the assembled transcripts, only the longest isoforms with a length of at least 300 bp for each gene were retained, which were then used as reference to align the reads using the software rsem (v1.3.0 Li & Dewey, 2011). From the expression quantification of transcripts, transcripts with aligned reads and transcript per million (TPM) ≥1 were retained. The completeness of the filtered transcripts was examined using the 1375 orthologues in the Benchmarking Universal Single Copy Orthologues (busco: v3.0), set of embryophyta_odb10 (‐‐evalue 1e‐10, Simão et al., 2015). To avoid probes spanning exon‐intron boundaries, exons were targeted to design probes. Using the software gmap (v2017‐06‐20, Wu & Watanabe, 2005), the filtered transcripts from Douglas‐fir were aligned to the convarietal reference (P. menziesii var. menziesii (coastal Douglas‐fir; v1.0, Neale, McGuire, et al., 2017). The jack pine transcripts were aligned to the congeneric loblolly pine (Pinus taeda) reference genome (v.1.01, Wegrzyn et al., 2014) as there is no available jack pine reference genome, and both loblolly and jack pine belong to Pinus subgenus Pinus, the hard pines. Exon sequences with a length of at least 100 bp were submitted to Roche NimbleGen for Custom SeqCap EZ probe design. To evaluate the capture efficiency of the probes, the captured sequences were aligned to reference genomes and numbers of reads on‐target, near‐target (≤500 bp from target regions), and off‐target regions were counted using “intersect” function in the software bedtools v2.28.0 (‐f 0.75, Quinlan & Hall, 2010). The depth of captured sequences was counted using “depth” function in the software samtools v1.3 (‐q 30 ‐Q 20, Li et al., 2009). The cumulative depth was calculated and plotted using r (R Core Team, 2018).

DNA extraction, library preparation, and sequencing

In total, three data sets were created for each of the two species (Table 1)—note that we use camelCase (e.g., poolSeq) to denote our data sets, and reserve hyphens (e.g., pool‐seq) to denote methodologies. These data sets included individual sequencing of 20 diploid individuals from a single population (hereafter indSeq), the same individuals pooled together with equimolar contributions prior to sequencing (hereafter poolSeq), and haploid megagametophyte tissue sequenced from a single individual (hereafter megaSeq). We use the indSeq data set to validate allele frequency estimates from our poolSeq data, and the megaSeq data to probe our data for apparent heterozygote SNPs (i.e., potential false‐positive SNPs) caused by the misalignment of diverged paralogs that could affect our allele frequency estimates (Table 1; see also Section 2.6). For each data set we extracted DNA from either diploid needle tissue or haploid megagametophyte tissue (see Appendix S1: Section 1.3). From these extractions, approximately 100 ng of DNA from each individual or pooled DNA sample was used for a barcoded (Kapa, Dual‐Indexed Adapter Kit) library with an approximately 350‐bp mean insert size. SeqCap library preparation was performed using custom NimbleGen SeqCap probes (described above in 2.1) according to the NimbleGen SeqCap EZ HyperCap Workflow User's Guide Ver 2 (Roche Sequencing Solutions, Inc.). Following capture, each library was sequenced in a 150 bp paired‐end format on an Illumina HiSeq4000 instrument at the Centre d'expertise et de services Génome Québec, Montreal, Canada.

Bioinformatic SNP calling pipelines

Raw paired‐end sequence reads from all data sets were trimmed with fastp (v0.19.5, Chen et al., 2018) by trimming reads that did not pass quality filters of <20 Ns, a minimum mean Phred quality score of 30 for sliding windows of five base pairs (bp), and a final length of 75 bp with no more than 20 bp called as N (‐n 20 ‐M 30 ‐W 5 ‐l 75 ‐g ‐3). Trimmed reads were mapped with bwa mem (v0.7.17, Li & Durbin, 2009) to reference assemblies; we mapped jack pine to the loblolly reference (v2.01, Wegrzyn et al., 2014) and Douglas‐fir to the convarietal reference (v1.0, Neale, McGuire, at al., 2017). The resulting .sam files were converted to binary with samtools v1.9 (view, sort, index; Li et al., 2009) and subsequently filtered for proper pairs and a mapping quality score of 20 or greater (view ‐q 20 ‐f 0x0002 ‐F 0x0004). Using picardtools v2.18.9 (http://picard.sourceforge.net), read groups were added and duplicates subsequently removed from filtered bam files. We then called SNPs using the Genome Analysis Toolkit (gatk v4.1.0.0; McKenna et al., 2010) for indSeq data, and VarScan (v2.4.3; Koboldt et al., 2012) for both poolSeq and megaSeq data sets (Table 1) for comparisons since data sets that stem from a larger project are all poolSeq (and we will therefore only be using VarScan). For SNPs called with GATK4, we used HaplotypeCaller (‐‐genotyping‐mode DISCOVERY ‐ERC GVCF) and GenotypeGVCFs. We then filtered data with SelectVariants (‐‐select‐type‐to‐include SNP), VariantFiltration (‐‐filter‐expression “QD <2.0 || FS >60.0 || MQ <40 || MQRankSum < ‐12.5”), and finally with vcftools v0.1.14 (‐‐maf 0.00 –minGQ 20 –max‐missing 0.75; Danecek et al., 2011). BQSR was not carried out in our analysis due to the lack of a high‐quality reference set of SNPs for our species. Note that no further filtering (e.g., for depth) was done for this initial baseline filtering strategy (further filtering is described in 3.4). Before calling SNPs with VarScan, we first realigned indels with gatk 3.8 (McKenna et al., 2010)—RealignerTargetCreator then IndelRealigner—and then passed a samtools mpileup object directly to VarScan::mpileup2cns with a minimum coverage set to 8, p‐value < .05, minimum variant frequency of 0.00, ignoring variants with >90% support on one strand, a minimum average genotype quality of 20, and a minimum allele frequency of 0.80 to call a site homozygous (‐‐min‐coverage 8 ‐‐). Output was then filtered with a custom python (v3.7, www.python.org) script to filter out indels, keep only biallelic loci, and to ensure a genotype quality score >20. From the megaSeq data, we then isolated heterozygous SNP calls (hereafter megaSNPs) that represent errors in genotype calling given the haploid nature of the tissue sequenced—to keep only heterozygous calls, we ignored any biallelic cases where only the nonreference allele was called. Such apparent SNPs are probably false due to misalignments. We have published our complete SNP calling pipelines in publicly available repositories (Lind, 2021a; Lind, 2021b).

Validation of megaSNPs as indicators of paralogy artefacts

To check whether heterozygous sites (megaSNPs) called from VarScan megaSeq are following expectations of patterns from paralogs, we investigated read ratio deviations from a binomial expectation for these VarScan megaSNP sites at the same sites in our GATK indSeq data using heterozygous diploid individuals (sensu McKinney et al., 2017; see also Rellstab et al., 2019). For true positive SNPs, heterozygous diploid individuals should have, on average, an even ratio of reference (REF) and alternative (ALT) read counts. If the SNP is due to a bioinformatic error arising from the misalignment of paralogs (i.e., a false positive SNP), the read ratio will differ significantly from this expectation when there is a SNP at a given position in only one paralog copy (McKinney et al., 2017). Similarly, if there is a fixed difference at a given position between two copies, then all individuals in a population will present as heterozygotes with balanced read counts for REF and ALT at that site. If we are sequencing (and then post‐hoc correctly identifying) paralogs in our poolSeq data using megaSNP sites, misalignment of either duplicated or diverged paralogs will cause read ratio deviations in these loci (and affect allele frequency estimates from poolSeq, and downstream analyses), which we should be able to detect in our indSeq data. As described by McKinney et al. (2017), subsequent to whole‐genome duplication during the rediploidization phase as homeologous chromosomes diverge, tetrasomically inherited sets of paralogs (duplicates) organize into distinct disomic loci (diverged duplicates). We calculated these read ratio statistics for sites within the intersection of (1) megaSNPs, indSeq, and poolSeq SNPs, and (2) poolSeq and indSeq SNPs alone; hereafter intersections I1 and I2. The purpose of (1) is to see how paralogs could affect our poolSeq data (leveraging information in our indSeq data to do so), and of (2) is to visualize the potential influence of paralogs in our data independent of sites identified as megaSNP sites, as well as to compare poolSeq allele frequency estimates with those estimated from the indSeq data set. For these sites, we queried the indSeq data to record the frequency of heterozygous individuals (H), the allele depth ratio , and the deviation of allele depth from expectation standardized by properties of a binomial distribution with n = depth of coverage, and p = .5 (i.e., the z‐score for the allele ratio deviation) following McKinney et al. (2016) and McKinney et al. (2017) with modifications to correctly account for missing data when calculating the proportion of heterozygotes at a particular locus. We compare our results with simulations carried out by McKinney et al. (2017). We used custom python code to replicate the methods of McKinney et al. (2016) with modifications, which is available on GitHub (003_testdata_validate_megaSNPs.ipynb, Lind, 2021c).

Comparison of sequencing approaches

To study the utility of our pooled exome capture approach, we compared estimates of allele frequency from our indSeq data with estimates from our poolSeq data. To do so, we took the baseline‐filtered SNPs from poolSeq and indSeq (see Section 2.4) and identified common SNPs (i.e., intersection I2). To quantify and visualize congruence between allele frequencies estimated with these methods, we report Pearson's correlation coefficient, plot histograms to visualize the congruence across the minor allele frequency (MAF) spectrum, and further plot 2D histograms to visualize congruence of allele frequency estimates. To visualize how filtering poolSeq SNPs affects the congruence between indSeq and poolSeq allele frequency estimates, we plot the allele frequency differences between methods (hereafter , calculated as the difference in allele frequency methods of poolSeq and indSeq: ) against poolSeq MAF, poolSeq depth of coverage, H, and the z‐score of read ratio deviation (where H and z were calculated using indSeq data). The code for this section can be found on GitHub (002_testdata_compare_AFs.ipynb, Lind, 2021c).

RESULTS

Sequencing, mapping, and probe efficiency

Sequencing of the prepared libraries resulted in high quality data sets, with the average base quality above 30 before trimming having a mean of 86.99% across data sets and species, and a mean of 89.43% after trimming (Table S1). The number of sequenced reads varied across data sets but was similar within data sets (on average 405 million reads for indSeq, 130 million reads for poolSeq, 202 million reads for megaSeq). Mapping rates generally reflected the phylogenetic relationship between the sequenced individuals and the reference used, where rates were high for all coastal Douglas‐fir data sets mapping to the convarietal reference (mean 85.11%) with lower rates for jack pine data sets mapping to the congeneric Pinus taeda reference (mean 35.36%; Table S1). After filtering, the jack pine transcriptome has a size of 53 Mbp and contains 31,282 transcripts ranging from 300 to 16,688 bp with a mean length of 1695 bp; the Douglas‐fir transcriptome has a size of 51 Mbp and contains 39,616 transcripts ranging from 300 to 15,302 bp with a mean length of 1310 bp. The BUSCO analysis to assess completeness of transcripts used in exome‐capture probe design resulted in recovery of 87% of the 1375 BUSCOs in Douglas‐fir transcripts, including 70% complete and single‐copy BUSCOs, 2% complete and duplicated BUSCOs, and 15% fragmented BUSCOs. For jack pine transcripts, 93% of the BUSCOs were recovered, including 85% complete and single‐copy BUSCOs, 2% complete and duplicated BUSCOs, and 6% fragmented BUSCOs. We aligned the transcripts to reference genomes to select exons and design probes. The final capture probe size are 41 Mbp for jack pine (design name: 180215_jackpine_v1_EZ_HX1) and 39 Mbp for Douglas‐fir (design name: 80215_DOUGFIR_V1_EZ), corresponding to 32,208 genes in jack pine and 37,787 genes in Douglas‐fir. We counted the number of captured reads on‐target, near‐target (≤500 bp from target), and off‐target regions for indSeq and poolSeq samples. As the DNA of each sample was sheared to approximately 350 bp, the 500 bp up‐ or downstream of target regions (near‐target) can be directly captured by probes, whereas reads arise from outside the 500 bp margin are most often the unintended regions of the genome (off‐target). The poolSeq samples had more reads than the indSeq samples and off‐target regions had the most aligned reads (Figure 1). Reliable SNP calling is dependent on sequencing depth, so we calculated the cumulative numbers of bases on different regions. For Douglas‐fir poolSeq, we obtained over 40 million bases in on‐target regions with at least 20× sequencing depth (Figure 2a). For jack pine poolSeq, we obtained over 30 million bases in on‐target regions with at least 20× sequencing depth (Figure 2b). Sequencing depths in near‐ and off‐target regions were dramatically diminished compared to the on‐target regions.

FIGURE 1

FIGURE 2

Cumulative numbers of bases in Douglas‐fir (a) and jack pine (b) on target, near‐target (≤500 bp from target), and off‐target regions. Dashed lines represent the cumulative numbers of bases in each of 20 indSeq samples. Solid lines represent the cumulative numbers of bases in the poolSeq sample

Numbers of captured reads from Douglas‐fir (a) and jack pine (b) that mapped on target, near‐target (≤500 bp from target) and off‐target regions. On the x‐axis, from left to right, the first 20 bars represent indSeq samples, and the last bars represents the poolSeq samples Cumulative numbers of bases in Douglas‐fir (a) and jack pine (b) on target, near‐target (≤500 bp from target), and off‐target regions. Dashed lines represent the cumulative numbers of bases in each of 20 indSeq samples. Solid lines represent the cumulative numbers of bases in the poolSeq sample

SNP calling

The total number of SNPs after baseline filtering varied across data sets and species (Table 2). Douglas‐fir generally had a higher number of SNPs called than jack pine, except for poolSeq data. However, there were more jack pine megaSNPs intersecting with poolSeq (25,500 SNPs) and indSeq (7408 SNPs) than for Douglas‐fir data sets (825 SNPs and 293 SNPs, respectively). Given that megaSNPs are cases where a heterozygote call was made from a haploid sample and are therefore indicators of bioinformatic‐paralogy errors, this suggests that this error rate is much higher in jack pine. In total, several hundred thousand SNPs were found in the intersection of poolSeq and indSeq for Douglas‐fir (636,279 SNPs) and jack pine (255,706 SNPs; Table 2).

TABLE 2

Output of SNPs from the conifer data sets

Data set	Species	Baseline‐filtered SNPs	Baseline Intersecting SNPs
Data set	Species	Baseline‐filtered SNPs	poolSeq	megaSeq ^a
indSeq	DF	1,526,554	636,279	293
indSeq	JP	377,080	255,706	7408
poolSeq	DF	1,601,285	—	—
poolSeq	JP	3,686,528	—	—
megaSeq ^a	DF	398,774	825	—
megaSeq ^a	JP	32,751	25,500	—

The intersection across all three baseline‐filtered data sets were 7006 SNPs for jack pine (JP) and 248 SNPs for Douglas‐fir (DF).

These numbers reflect only heterozygous SNPs (i.e., megaSNPs).

Output of SNPs from the conifer data sets The intersection across all three baseline‐filtered data sets were 7006 SNPs for jack pine (JP) and 248 SNPs for Douglas‐fir (DF). These numbers reflect only heterozygous SNPs (i.e., megaSNPs). Upon inspection of our intersecting sets, patterns expected for duplicated but not diverged duplicate paralogs (McKinney et al., 2017) were apparent in both intersection I1 (megaSNPs, indSeq and poolSeq SNPs) and intersection I2 (indSeq and poolSeq SNPs), and megaSNPs did not generally typify patterns expected from nonduplicated (singleton) genes. For instance, SNPs in duplicated genes should be most distinct from SNPs in singletons when the derived allele is at intermediate frequency, and diverged duplicates are most distinct from singletons when the derived allele is fixed (Figure 3a). Sites consistent with expectations for singletons and duplicates (but not diverged duplicates) were apparent from intersection of poolSeq and indSeq sites (i.e., intersection I2; Figure 3d,e), while the indSeq sites intersecting with candidate paralog sites (megaSNPs, i.e., intersection I1) displayed elevated levels of heterozygosity as expected from paralogs (Figure 3b,c). Indeed, patterns of deviated allele ratios were also seen in our data (Figures S1d,e and S2d,e), where the vast majority of megaSNP sites were considerably different than the 1:1 read ratio expected of heterozygous diploids (Figures S1b,c and S2b,c) as would otherwise be expected for singletons (Figures S1a and S2a). Lastly, when considering the standardized allele ratio deviation (z‐score) we recover the same patterns of point clouds classified by McKinney et al. (2017). We observe a dense set of SNPs around the ‐score of 0.0 for H values of 0.0–0.6 (Figure 4d,e) expected from singleton sites (blue in Figure 4a), another set of SNPs with elevated H and/or absolute z‐score (Figure 4b,c) that is expected from duplicate loci (red in Figure 4a), and a third set of SNPs with H > 0.9 (Figure 4b,c) that is expected for diverged duplicates (green in Figure 4a; compare to Figures 5 and 8 in McKinney et al., 2017).

FIGURE 3

FIGURE 4

Standard deviation of read ratio (‐score) and the percentage of heterozygotes () calculated from indSeq data distinguish paralog misalignments according to expectations (a, figure from McKinney et al., 2017), and empirically for Douglas‐fir (b, d) and jack pine (c, e). (b, c) Empirical distribution of megaSNP sites (candidate paralog sites identified as heterozygote calls from haploid tissue) calculated using indSeq data for those sites that were also called in poolSeq data (i.e., intersection I1). (d, e) Empirical distribution of intersection I2 (indSeq and poolSeq intersection) calculated using indSeq data. Some of the distribution found in the grey box in 4a will be found in the upper white panel because we used the reference allele instead of randomly choosing the allele for each locus. Note colour scale changes for each figure to accentuate patterns in the data. Code to create these figures is available on GitHub (003_testdata_validate_megaSNPs.ipynb, Lind 2021c)

The proportion of heterozygotes, H, and the alternative (ALT) allele frequency calculated from indSeq data distinguish paralog misalignments according to expectations (a, Figure 1 from McKinney et al., 2017—q is the frequency of the ALT allele), and empirically for Douglas‐fir (b, d) and jack pine (c, e). (b, c) Empirical distribution of megaSNP sites (candidate paralog sites identified as heterozygote calls from haploid tissue) calculated using indSeq data for those sites that were also called in poolSeq data (i.e., intersection I1). (d, e) Empirical distribution of intersection I2 (indSeq and poolSeq intersection) calculated using indSeq data. Note colour scale changes for each figure to accentuate patterns in the data. Frequency of ALT was binned for visualization purposes. Code to create these figures is available on GitHub (003_testdata_validate_megaSNPs.ipynb, Lind 2021c) Standard deviation of read ratio (‐score) and the percentage of heterozygotes () calculated from indSeq data distinguish paralog misalignments according to expectations (a, figure from McKinney et al., 2017), and empirically for Douglas‐fir (b, d) and jack pine (c, e). (b, c) Empirical distribution of megaSNP sites (candidate paralog sites identified as heterozygote calls from haploid tissue) calculated using indSeq data for those sites that were also called in poolSeq data (i.e., intersection I1). (d, e) Empirical distribution of intersection I2 (indSeq and poolSeq intersection) calculated using indSeq data. Some of the distribution found in the grey box in 4a will be found in the upper white panel because we used the reference allele instead of randomly choosing the allele for each locus. Note colour scale changes for each figure to accentuate patterns in the data. Code to create these figures is available on GitHub (003_testdata_validate_megaSNPs.ipynb, Lind 2021c)

Comparisons of sequencing approaches

Loci within the intersection of baseline‐filtered indSeq and poolSeq data sets (i.e., intersection I2) showed a strong positive association between allele frequencies estimated from indSeq and poolSeq (Pearson's = .9760, = .0000 for jack pine; Pearson's = .9483, = .0000 for Douglas‐fir; Figure 5a,b). Comparison of the MAF spectrum from these estimates also revealed good agreement by frequency bins (Figure S3). After exploring various filtering strategies (see Appendix S1: Section 1.4, Figures S4–S7), we applied filters that (1) showed a positive effect on congruence between allele frequency estimates in our data (removing megaSNP sites and indSeq sites with ), (2) that resulted in removing sites with extreme values of (removing indSeq sites with ‐score ; filtering alone also had this effect), and (3) that gave us the best estimate of indSeq allele frequency—our standard of comparison—and thus the best impression of the performance of our poolSeq approach (removing indSeq sites with >20% missing data). The correlation of allele frequencies estimated from indSeq and poolSeq data increased after this filtering (Pearson's = .9876, = .0000 for jack pine; Pearson's = .9703, = .0000 for Douglas‐fir) with relatively fewer sites with extreme differences in the estimates from each method (see top‐left and bottom‐right corners of 2D histograms, Figure 5a–d). While some differences remain in the estimates of the minor allele frequency spectrum (Figure 5e,f), these two methods largely agree, suggesting a robust poolSeq data set for further biological hypothesis testing.

FIGURE 5

Congruence between indSeq and poolSeq (x‐ and y‐axes, respectively a–d) minor allele frequency (MAF) estimates from Douglas‐fir (a, c, e) and jack pine (b, d, f). (a, b) Two‐dimensional (2D) histogram of baseline‐filtered intersection between indSeq and poolSeq (i.e., intersection I2). (c, d) 2D histogram for SNPs after filtering intersection I2 for megaSNP sites,, ‐score, and indSeq sites with >20% missing data. (e, f) Congruence of minor allele frequency spectra from SNPs in (c, d). Colour scale is standardized to visualize differences in density between filtering steps. Minor allele frequency was binned for visualization purposes. Code to create these figures is available on GitHub (002_testdata_compare_AFs.ipynb, Lind 2021c)

DISCUSSION

The pooling of individuals to obtain next‐generation sequence data is often motivated by cost savings at the expense of losing (phased) haplotype information, direct estimates of linkage disequilibrium, and rare alleles. The validation of pool‐seq approaches, however, commonly involves model organisms with complete or near‐complete chromosome‐scale reference genomes (e.g., see Table 1 in Rellstab et al., 2013). Indeed, there are few studies that explore this congruence in non‐model organisms such as conifers with large and highly fragmented reference genomes, and histories of whole genome duplications, repetitive elements, and gene family evolution (which could exacerbate misalignments through assembly errors in the reference). Here we show that combining exome capture and pool‐seq can be an efficient method for quantifying genetic polymorphisms in two such species, and that heterozygous SNPs from haploid data (megaSNPs) consistently uncover sites with patterns expected from the misalignment of paralogs (Figures 3 and 4, S1 and S2). Further, we appear to uncover more false‐positive variation in jack pine than in Douglas‐fir (Table 2), likely due to the relative divergence between the species sequenced and reference genome used. Yet, concordance of allele frequency estimates from baseline‐filtered indSeq and poolSeq data sets (i.e., intersection I2) was strong in both species ( > .948). Despite this high correlation, there were many loci that had extreme differences in the estimated minor allele frequency. The correlation improved further after these sites were removed with increased filtering, including the filtering of potential false‐positive sites ( > .970, Figure 5), highlighting the utility of this method across taxa with differing demographic histories and genomic resources. These values are well within the range expected from previous pool‐seq studies (Table 1 in Rellstab et al., 2013), and in some cases perform better than these model organisms. Despite their role in adaptation and speciation (Allendorf et al., 2015; Lynch & Conery, 2000), the exclusion of potentially paralogous sites from next generation sequencing data sets is commonplace due to the difficulty in differentiating genetic polymorphisms from differences present among copies from single or diverged gene families (Dou et al., 2012; Dufresne et al., 2014; Hohenlohe et al., 2012). There are several methods by which to detect such problematic sites, such as filtering by coverage (Dou et al., 2012), disomic models such as Hardy‐Weinberg proportions (Catchen et al., 2013; Chen et al., 2014; Hohenlohe et al., 2011), or gene annotation, though there are several shortcomings (see descriptions of these shortcomings in Table 1 of McKinney et al., 2017). When individual sequencing data is available for the same individuals or populations, such information can be used to isolate potentially paralogous sites from pool‐seq exome capture studies (e.g., Rellstab et al., 2019; Shu & Moran, 2020). However, a potentially cost‐saving alternative would be to sequence the haploid tissue of a single individual (if available). Even so, there may be reduced power to detect recently diverged paralogs (i.e., when derived alleles are at low frequency and therefore not readily detected in a single individual), and an exploration varying the number and source population of haploid tissue for future studies could be used to more precisely quantify the effect and consistency of such data across sample sizes. As such, heterozygous SNPs called from our haploid data (megaSNPs) allowed us to identify variation from putative paralogous misalignments that infrequently displayed patterns expected of singleton gene copies. Indeed, high quality heterozygous calls from haploid sequencing are a reliable method for identifying misalignments due the known monoallelic state of the sequenced site (Limborg et al., 2016). While metrics from sequences of individuals are reliable (McKinney et al., 2017), they can falsely flag potentially paralogous sites as SNPs due to the stochastic nature of the sequencing process and may result in the exclusion of biologically meaningful information. The accurate estimation of allele frequencies from pool‐seq data will often depend on adequate depth of coverage and individuals, as well as thoughtful consideration of wetlab procedures and aspects of genomic resources and organismal biology. As pointed out by Rellstab et al. (2019), use of exome capture in many Pinaceae species will require particular care to exclude potentially paralogous sites from downstream analysis to avoid biased results. This is particularly true for pool‐seq data sets relying on read counts for allele frequency estimation or population genetic inferences such as genotype‐environment associations. While individually sequenced data sets may be one path forward to identifying such problematic sites (as in Rellstab et al., 2019; Shu & Moran, 2020), the sequencing of sufficient quantities of DNA from haploid gametophyte tissue available for some plants, including conifers, seedless vascular plants, and bryophytes, offers an alternate path forward to balance sequencing cost and data reliability, particularly for organism using diverged and or highly fragmented reference genomes.

AUTHOR CONTRIBUTIONS

Sally Aitken and Sam Yeaman obtained funding for the study. Sally Aitken, Sam Yeaman, Brandon Lind, and Mengmeng Lu conceived the research design with contributions from Tom Booker. Dragana Obreht Vidakovic generated the sequence data which Brandon Lind analysed, with contributions from Mengmeng Lu, Pooja Singh, and Tom Booker. Mengmeng Lu designed probes with contributions from Sam Yeaman. Brandon Lind and Mengmeng Lu designed SNP calling pipelines, which were coded by Brandon Lind and reviewed by Mengmeng Lu. Brandon Lind wrote the manuscript with contributions from Mengmeng Lu. All authors contributed to the editing of this manuscript.

CONFLICT OF INTERESTS

The authors declare no conflicts of interest. Supplementary Material Click here for additional data file.

49 in total

1. Population genomic analysis of model and nonmodel organisms using sequenced RAD tags.

Authors: Paul A Hohenlohe; Julian Catchen; William A Cresko
Journal: Methods Mol Biol Date: 2012

2. Next-generation RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout.

Authors: Paul A Hohenlohe; Stephen J Amish; Julian M Catchen; Fred W Allendorf; Gordon Luikart
Journal: Mol Ecol Resour Date: 2011-03 Impact factor: 7.090

3. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

4. Comparative mapping in the Pinaceae.

Authors: Konstantin V Krutovsky; Michela Troggio; Garth R Brown; Kathleen D Jermstad; David B Neale
Journal: Genetics Date: 2004-09 Impact factor: 4.562

5. Exome capture from the spruce and pine giga-genomes.

Authors: H Suren; K A Hodgins; S Yeaman; K A Nurkowski; P Smets; L H Rieseberg; S N Aitken; J A Holliday
Journal: Mol Ecol Resour Date: 2016-08-05 Impact factor: 7.090

6. Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions.

Authors: Jinzhuang Dou; Xiqiang Zhao; Xiaoteng Fu; Wenqian Jiao; Nannan Wang; Lingling Zhang; Xiaoli Hu; Shi Wang; Zhenmin Bao
Journal: Biol Direct Date: 2012-06-08 Impact factor: 4.540

7. The variant call format and VCFtools.

Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937

8. Unique features of the loblolly pine (Pinus taeda L.) megagenome revealed through sequence annotation.

Authors: Jill L Wegrzyn; John D Liechty; Kristian A Stevens; Le-Shin Wu; Carol A Loopstra; Hans A Vasquez-Gross; William M Dougherty; Brian Y Lin; Jacob J Zieve; Pedro J Martínez-García; Carson Holt; Mark Yandell; Aleksey V Zimin; James A Yorke; Marc W Crepeau; Daniela Puiu; Steven L Salzberg; Pieter J Dejong; Keithanne Mockaitis; Doreen Main; Charles H Langley; David B Neale
Journal: Genetics Date: 2014-03 Impact factor: 4.562

2. Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species.

Authors: Brandon M Lind; Mengmeng Lu; Dragana Obreht Vidakovic; Pooja Singh; Tom R Booker; Sally N Aitken; Sam Yeaman
Journal: Mol Ecol Resour Date: 2021-08-14 Impact factor: 8.678

2 in total