Literature DB >> 33163293

A targeted sequence capture array for phylogenetics and population genomics in the Salicaceae.

Brian J Sanderson^1,2, Stephen P DiFazio², Quentin C B Cronk³, Tao Ma⁴, Matthew S Olson¹.

Abstract

PREMISE: The family Salicaceae has proved taxonomically challenging, especially in the genus Salix, which is speciose and features frequent hybridization and polyploidy. Past efforts to reconstruct the phylogeny with molecular barcodes have failed to resolve the species relationships of many sections of the genus.
METHODS: We used the wealth of sequence data in the family to design sequence capture probes to target regions of 300-1200 bp of exonic regions of 972 genes.
RESULTS: We recovered sequence data for nearly all of the targeted genes in three species of Populus and three species of Salix. We present a species tree, discuss concordance among gene trees, and present population genomic summary statistics for these loci.
CONCLUSIONS: Our sequence capture array has extremely high capture efficiency within the genera Populus and Salix, resulting in abundant phylogenetic information. Additionally, these loci show promise for population genomic studies.

Entities: Chemical

Keywords: Populus; Salicaceae; Salix; phylogenetics; targeted sequence capture

Year: 2020 PMID： 33163293 PMCID： PMC7598885 DOI： 10.1002/aps3.11394

Source DB: PubMed Journal: Appl Plant Sci ISSN： 2168-0450 Impact factor: 1.936

Although the cost of whole‐genome sequencing has continued to decrease dramatically over the past decade, the cost and complexity of whole‐genome analyses still limit their utility and accessibility for answering evolutionary questions in novel taxa (Richards, 2018). However, a polished genome assembly is not necessary to address many questions. In this context, several methods have been developed to reduce the cost and effort required to obtain genomic information in novel species (McKain et al., 2018). The recent development of targeted sequence capture presents an affordable method for consistently isolating specific, long, phylogenetically informative regions in the taxa of interest (Gnirke et al., 2009; Mamanova et al., 2010; Hale et al., 2020). Targeted sequence capture uses biotinylated RNA baits to target prepared sequencing library fragments. The baited library fragments can then be pulled out of the solution using streptavidin‐coated magnetic beads to selectively enrich the fragments that contain loci of interest, while discarding the majority of library fragments that do not. Two advantages of this method over other methods of genome sequence partitioning, such as genome skimming and restriction site–associated DNA sequencing (RAD‐seq), are (1) it does not necessarily depend on a highly polished, annotated reference genome, and (2) the same loci can be consistently sequenced at a high depth across individuals without requiring comprehensive, concurrent sequencing of all individuals (Mamanova et al., 2010; Grover et al., 2012; Jones and Good, 2016). In this paper, we report on the design and implementation of a targeted sequence capture array to collect data for phylogenetic analysis within the Salicaceae, the plant family that includes poplars and willows. Understanding species relationships within this family, and in particular within the genus Salix L., has presented challenges to taxonomists as early as Linnaeus, who noted that “species of this genus are extremely difficult to clarify” (Linnaeus, 1753; Skvortsov, 1999). Salix species present challenges to classification due to their wide geographic ranges, hysteranthous phenology, extensive interspecific hybridization, polyploidy, and the lack of well‐defined flower characters for morphological circumscription of taxa (Raup, 1959; Skvortsov, 1999; Percy et al., 2014; Wang et al., 2020). Species of Salix exhibit holarctic distributions, and there are several classifications that differ among continents and are challenging to synthesize due to non‐overlapping taxonomic treatment of species (Dickmann and Kuzovkina, 2014). Past efforts to reconstruct the phylogeny of Salix using nuclear amplified fragment length polymorphism (AFLP) markers and plastid barcode sequences have resulted in a lack of clearly resolved species relationships, especially in the subgenus Vetrix (Dumort.) Dumort. (Trybush et al., 2008; Percy et al., 2014). A more recent study using a supermatrix approach with RAD‐seq data showed resolution within a subset of species of the subgenera Vetrix and Chamaetia (Dumort.) Nasarov, highlighting the potential of large‐scale molecular data to resolve this phylogenetically challenging group (Wagner et al., 2018). The utility of RAD‐seq for collecting data for phylogeny, however, is limited by several issues. First, RAD‐seq does not consistently screen homologous regions across species and across different experiments, which limits its utility for adding species to a phylogeny at a later time. Second, because RAD‐seq assesses diversity in very short segments of the genome that contain little potential phylogenetic information independently, this type of data requires the concatenation of loci and the use of supermatrix phylogenetic analyses (de Queiroz and Gatesy, 2007), which do not allow the separate exploration of gene and species phylogenies using supertree methods (Sanderson et al., 1998). Additionally, concatenation approaches are likely to exacerbate problems associated with maximum‐likelihood methods for species with rapid diversification (Edwards et al., 2007; Edwards, 2009). Targeted sequence capture does not have these limitations, and thus may be a more appropriate genotyping platform for phylogenetics. Species of Populus L. and Salix have been of great interest for the development of forestry and biofuel products, resulting in polished reference genomes for P. trichocarpa Torr. & A. Gray, P. tremula L., P. euphratica Olivier, S. purpurea L., and S. suchowensis W. C. Cheng, as well as shallow resequencing data for many additional species (Tuskan et al., 2018). Our design strategy leveraged this abundance of existing genomic information to quantify polymorphism and the distribution of insertion/deletion polymorphisms (indels) within and among species in order to maximize capture efficiency. Furthermore, because we consistently target exon regions, we are able to leverage information about nucleotide‐site degeneracy to quantify population genomic summary statistics. We demonstrate the utility of this resource for Populus and Salix species by presenting a fully resolved phylogenetic tree for six species and an outgroup, and by estimating the distribution of nucleotide diversity within species for our targeted genes.

METHODS

Probe design

Our goal was to identify regions that could be efficiently captured using RNA bait hybridization for diverse species across the family Salicaceae. The family Salicaceae is thought to have diverged from other clades approximately 92.5 mya (Zhang et al., 2018b). Our primary focus was on the genera Populus and Salix, which diverged approximately 48 mya, and the species Idesia polycarpa Maxim., which diverged from other clades approximately 56 mya, which we use as an outgroup (Zhang et al., 2018b). Although we were interested in using these probes for phylogenetics with both Populus and Salix species, we focused on maximizing capture efficiency for the species in Salix, because the phylogeny for Populus is already much better resolved than that for Salix (Trybush et al., 2008; Percy et al., 2014; Wang et al., 2014, 2020; Liu et al., 2017). For this reason, the capture baits were designed to target regions in S. purpurea that also would have high capture efficiency across the Salicaceae. The efficiency of RNA bait binding, and thus capture efficiency, is reduced as target regions diverge due to sequence polymorphism (Lemmon and Lemmon, 2013). To improve capture efficiency, we quantified sequence polymorphism among whole‐genome resequencing data from a diverse array of Populus and Salix species (Appendix S1). The whole‐genome short reads of the Populus and Salix species were aligned to the P. trichocarpa genome assembly version 3 (Tuskan et al., 2006) using BWA MEM version 0.7.12 with default parameters (Li, 2013). We used the P. trichocarpa genome as our initial reference because it was the most polished and annotated genome in the genus. Variable sites and indels were identified using SAMtools mpileup (Li, 2011), and read depth for the variant calls was quantified using vcftools (Danecek et al., 2011). Custom Python scripts were used to identify variant and indel frequencies for all exons in the P. trichocarpa genome annotation (scripts available at https://github.com/BrianSanderson/phylo‐seq‐cap [see Data Availability]; Sanderson, 2020). Orthologs for our candidate loci in the Salix purpurea 94006 genome assembly version 1 (DOE‐JGI, 2016; Carlson et al., 2017; Zhou et al., 2018) were identified from a list of orthologs shared by the P. trichocarpa and S. purpurea genomes prepared using a tree‐based approach by JGI using Phytozome version 12 software (Goodstein et al., 2012). We further screened candidate regions to exclude high‐similarity duplicated regions by accepting only loci with single BLAST (Camacho et al., 2009) hits against the highly contiguous assembly of S. purpurea 94006 version 5 (Zhou et al., 2020), which is less fragmented than the S. purpurea 94006 version 1 genome assembly. Genes from the Salicoid whole‐genome duplication were identified using MCScanX (Wang et al., 2012) using default parameters, and segments for which the average number of synonymous substitutions (K s value) for paralogous genes was between 0.2 and 0.8 were selected. Genes for which at least 600 bp of exon sequence contained 2–12% polymorphism and fewer than two indels were selected for probe design by Arbor Biosciences (Ann Arbor, Michigan, USA). Probes were designed with 50% overlap across the targeted regions, so that each nucleotide position would potentially be captured by two probes. Finally, to ensure that loci with high divergence across the family would be captured, we identified targets with less than 95% identity (based on BLAST results) between S. purpurea and P. trichocarpa and designed supplementary probes from orthologs of these genes in the I. polycarpa genome.

Library preparation and sequence capture

Libraries for two individuals from each P. balsamifera L., P. tremula, P. mexicana Wesm., S. nigra Marshall, S. exigua Nutt., and S. phlebophylla Andersson (Appendix S2) were prepared using the NEBNext Ultra II DNA Prep Kit (New England Biolabs, Ipswich, Massachusetts, USA) following the manufacturer’s protocol, and quantified using an Agilent Bioanalyzer 2100 DNA 1000 kit (Agilent Technologies, Santa Clara, California, USA). Libraries were pooled at equimolar concentrations into two pools of six prior to probe hybridization following the Arbor Biosciences myBaits protocol version 3.0.1 and Hale et al. (2020). The hybridized samples were subsequently pooled at equimolar ratios and sequenced at the Texas Tech Center for Biotechnology and Genomics using a MiSeq with the v2 Micro kit and 150‐bp paired‐end reads (Illumina, San Diego, California, USA).

Analysis of sequence capture data

Read data were trimmed for primer sequences and low quality scores using Trimmomatic version 0.36 (Bolger et al., 2014). The trimmed read data, as well as the whole‐genome reads for I. polycarpa, were assembled into gene sequences using the HybPiper pipeline (Johnson et al., 2016). We estimated the depth of read coverage across all targeted genes as well as at off‐target sites in R version 3.3.0 (R Core Team, 2016). The assembled amino acid sequences were aligned with MAFFT version 7.310 using the parameters –localpair and –maxiterate 1000 (Katoh and Standley, 2013), converted into codon‐aligned nucleotide alignments with PAL2NAL version 14 (Suyama et al., 2006), and trimmed for quality and large gaps with trimAl version 1.4.rev15 with the parameter –gt 0.5 (Capella‐Gutiérrez et al., 2009). HybPiper provides warnings for genes that have multiple competing assemblies that are within 80% of the length of the target region, because the alternate alignments may indicate that those genes have paralogous copies in the genome. We estimated phylogenetic relationships using the full set of gene sequences recovered from our sequence capture data, as well as a restricted set of putatively single‐copy genes, based on our a priori list of shared paralogs between S. purpurea and P. trichocarpa, and supplemented by the list of paralog warnings from HybPiper. We estimated gene trees using RAxML version 8.2.10 (Stamatakis, 2014), specifying a GTRΓ model of sequence evolution. A set of 250 bootstrap replicates was generated for each gene tree. We used ASTRAL‐III (Zhang et al., 2018a; Rabiee et al., 2019) to infer the species tree from the RAxML gene trees. Because all nodes are weighted equally during quartet decomposition in ASTRAL‐III, we used sumtrees in the Python package DendroPy version 4.4.0 (Sukumaran and Holder, 2010) to collapse nodes with less than 33% bootstrap support values prior to species tree estimation. A set of 100 multilocus bootstrap replicates was generated for the species tree. We used phyparts (Smith et al., 2015) to determine the extent of congruence among gene trees for each node in the species tree. Cladograms representing the gene tree congruence and alternate topologies were plotted with the scripts phypartspiecharts.py and minority_report.py (scripts written by Matt Johnson [Texas Tech University], available at https://github.com/mossmatters/phyloscripts). Finally, we used custom Python scripts to quantify nucleotide diversity at synonymous and nonsynonymous sites between the individuals of the same species, as well as correlations in values of per‐site nucleotide diversity between all species. The scripts described above, as well as the full details of these analyses, are available at https://github.com/BrianSanderson/phylo‐seq‐cap (Sanderson, 2020).

RESULTS

Sequence capture efficiency

The final capture kit targets 972 genes covered by 12,951 probes based on the S. purpurea reference genome, and an additional 7049 (redundant) probes based on the I. polycarpa genome that target genes with the highest divergence between S. purpurea and P. trichocarpa (<95% identity in BLAST results of probes against the P. trichocarpa v3 genome sequence). This included an average of 680 ± 309 (mean ± SD) probes on each S. purpurea chromosome (Appendix S3), with an average of 1098 ± 489 bp (mean ± SD) of exon sequence per gene. Of the 972 target genes, 593 are putatively single copy based on our identification of paralogs in the S. purpurea genome assembly, 142 represent pairs of paralogs from the shared Salicoid whole‐genome duplication (i.e., 71 pairs of genes), and 237 are genes that have known paralogs for which we were not able to design targets in this kit (i.e., each of these genes has one or more paralogs in the S. purpurea genome that is not targeted by probes). We included a total of 1219 genes in the target file used to assemble the capture data, which includes the 972 targeted genes as well as paralogous copies for which probes were not designed. Because the issues of paralogy become more complex when we add species other than S. purpurea and P. trichocarpa, we advise using the HybPiper warnings of multiple competing long assemblies to assess paralogy in novel species following guidance from Johnson (2017a). The capture probe sequences and the target reference file are accessible at https://github.com/BrianSanderson/phylo‐seq‐cap (Sanderson, 2020). The sequence capture kit is available from Arbor Biosciences (Ref #170424‐30 “Salicaceae”). Sequence capture efficiency was high among the libraries. We recovered 805,820 ± 178,482 reads (mean ± SD) from our Populus and Salix target capture libraries, of which 86.7% ± 1.15% (mean ± SD) mapped to the target sequence reference (Table 1). An average of 94.48% ± 1.37% of targeted exon sequences were covered by ≥10 reads. The average read depth was 44.65 ± 1.61 for on‐target sites, and 14.48 ± 2.10 for off‐target sites (Appendix S4).

Table 1

					No. of genes with % targeted sequences
Name	No. of reads	Reads mapped	Proportion mapped	Genes mapped	25%	50%	75%	100%
I_polycarpa_WGS‐2 ^a	223,470,714	1,653,494	0.007	971	970	966	944	123
P_balsamifera_MGR‐01	614,093	523,321	0.852	972	971	960	884	122
P_balsamifera_MGR‐04	769,303	659,712	0.858	972	972	965	915	145
P_mexicana_PM3	843,032	739,728	0.878	972	972	964	917	140
P_mexicana_PM5	880,962	768,927	0.873	972	972	967	913	142
P_tremula_R01‐01	749,220	638,002	0.852	971	970	960	907	134
P_tremula_R04‐01	634,625	539,805	0.851	971	969	956	876	122
S_exigua_SE002	1,139,616	998,698	0.876	969	969	966	937	229
S_exigua_SE053	843,120	741,938	0.88	969	969	964	928	195
S_nigra_SG037	1,166,615	1,028,635	0.882	971	971	967	932	205
S_nigra_SG051	602,649	524,993	0.871	971	970	961	903	136
S_phlebophylla_SP15M	753,628	651,791	0.865	972	972	967	939	204
S_phlebophylla_SP7F	672,975	581,147	0.864	972	972	967	925	203

The I_polycarpa_WGS‐2 data is from whole‐genome sequencing data, rather than targeted sequence capture, and thus the low percentage of read mapping reflects the lack of target enrichment (although the read coverage across targets was comparable to the sequence capture libraries [Appendix S4]).

Coverage summary statistics for sequence capture read data. For each library, values represent the number of reads in the sequenced library, the number of those reads that mapped to the reference file for the targeted genes, the proportion of mapped reads, the number of targeted genes (out of 972) that had read data mapped to them, and the number of genes that had 25%, 50%, 75%, and 100% of the targeted sequences covered with >10× reads. The I_polycarpa_WGS‐2 data is from whole‐genome sequencing data, rather than targeted sequence capture, and thus the low percentage of read mapping reflects the lack of target enrichment (although the read coverage across targets was comparable to the sequence capture libraries [Appendix S4]).

Phylogenetics

The species tree estimated with putatively single‐copy genes correctly paired all individuals of the same species and revealed a fully resolved phylogeny for the Populus and Salix species with 100% multilocus bootstrap support for all nodes (Fig. 1A). At least 85% of gene trees support the topology of the species tree (Fig. 1B), with the exceptions of the bipartition that separates P. balsamifera and P. tremula, and the bipartition that separates S. phlebophylla from the other Salix species, which had dominant alternate topologies that were supported by a large number of gene trees (Appendices S5, S6). The topology of the species tree estimated with the full set of genes and known paralogs was nearly identical to the tree estimated with only the putatively single‐copy genes. The major difference between these trees was evident in the bipartition separating P. balsamifera and P. tremula, where there were a large number of alternative topologies supported by small numbers of gene trees (the top three were supported by 13, 11, and 10 gene trees; Appendix S7).

Figure 1

Species trees estimated for the 432 putatively single‐copy genes that did not have paralog warnings reported by HybPiper. (A) Species tree generated by ASTRAL‐III for the gene trees. Node values represent bootstrap support from 100 multilocus bootstrap replicates in ASTRAL‐III. Branch lengths represent coalescent units. (B) Cladogram showing the congruence of gene trees for all nodes in the ASTRAL‐III species tree. The numbers above each node represent the number of gene trees that support the displayed bipartition, and numbers below the node represent the number of gene trees that support all alternate bipartitions. Purple wedges represent the proportion of gene trees that support the displayed bipartition. Blue wedges represent the proportion of gene trees that support a single alternative bipartition (see Appendices S5, S6). Green wedges represent the proportion of gene trees that have multiple conflicting bipartitions. Yellow wedges represent the proportion of gene trees that have no supported bipartition. Plotting code and its interpretation were provided by Matt Johnson (for more detail, see Johnson, 2017b).

Population genomics

Patterns of nucleotide diversity, measured as Nei’s π (Nei and Li, 1979), varied among species, with the greatest variation at synonymous sites (Appendices S8, S9). Populus tremula had the highest average values of π at both synonymous and nonsynonymous sites (Fig. 2). The values of π among species were highly correlated for species within genera and exhibited lower correlations between genera (Fig. 3).

Figure 2

Means and 95% confidence intervals of values of nucleotide diversity (Nei’s π) within each species at synonymous (yellow) and nonsynonymous (purple) sites.

Figure 3

Pairwise correlation (Pearson’s r) of values of Nei’s π between all species. Values above the diagonal represent the correlation of π at synonymous sites, values below the diagonal represent nonsynonymous sites. Boxes outlined in black represent within‐genus comparisons.

Means and 95% confidence intervals of values of nucleotide diversity (Nei’s π) within each species at synonymous (yellow) and nonsynonymous (purple) sites. Pairwise correlation (Pearson’s r) of values of Nei’s π between all species. Values above the diagonal represent the correlation of π at synonymous sites, values below the diagonal represent nonsynonymous sites. Boxes outlined in black represent within‐genus comparisons.

DISCUSSION

The decreasing cost of obtaining genomic and transcriptomic sequence data holds great promise for unlocking our understanding of phylogenetic relationships and population genetic patterns within and among complex taxonomic groups. However, assembling complete genomes is still not a trivial task, and there exist relatively few polished plant reference genomes onto which genome skimming data can be mapped. Many methods have been developed to reduce the sequencing and analytical burdens associated with obtaining genome data. We believe that targeted sequence capture is one of the most promising contemporary methods of inexpensively generating genomic information. The efficiency of our targeted sequence capture array was extremely high, which yielded abundant phylogenetic information for six species of Populus and Salix. Overall, the phylogeny was fully resolved and conformed to our general understanding of the relationships among the taxa (Wu et al., 2015; Wang et al., 2020). One strength of the sequence capture approach is that it provides sufficiently long contiguous segments of gene sequences to assemble gene trees enabling the use of super‐tree methods, which can overcome the problems introduced by concatenation of multiple gene regions with divergent histories (Edwards et al., 2007; Edwards, 2009). The supertree approach also allowed for the identification of alternative evolutionary histories that are supported by different regions of the genome, as often occurs during historical hybridization and introgression (Zhang et al., 2018a; Rabiee et al., 2019). Our species tree identified three alternative gene tree relationships among the three Populus species (Appendix S5). Previous studies have provided evidence of historical introgression among these species, including a history of chloroplast capture and hybridization between P. mexicana and species in the section Tacamahaca Spach (including P. balsamifera; Wang et al., 2014, 2020; Liu et al., 2017). The second most‐supported alternative topology that we recovered placed P. mexicana and P. tremula as sister taxa, a pattern that does not support this hypothesis, likely due to incomplete lineage sorting (Wang et al., 2020). Populus tremula likely has a greater long‐term effective population size than P. balsamifera (Wang et al., 2016), and so coalescence times may be shorter on average in P. balsamifera. Among the Salix species, we identified three alternative gene tree relationships between the S. phlebophylla and S. exigua individuals, which may reflect the histories of rapid speciation and hybridization that have long vexed attempts at phylogenetic reconstruction in the genus Salix (Appendix S6; Trybush et al., 2008; Percy et al., 2014). Both of these patterns in Populus and Salix may be better understood once additional taxa are added to this phylogeny. We have also shown that this sequence capture design can be applied to address questions related to population genomics in the Salicaceae. Many of the advantages of targeted sequence capture over competing methods are of particular relevance for population genomic studies, including specific knowledge of loci being sequenced; the ability to differentiate among synonymous, nonsynonymous, intronic, and intergenic loci; and the ability to collect data on the same set of loci across different experiments, either within species or across species, for comparative studies. In particular, synonymous sites, especially fourfold synonymous sites, are among the fastest‐evolving regions of the genome and the sites within genic regions least influenced by selection (Wright and Andolfatto, 2008), and are thus among the best regions for estimating patterns of historical demography. Our estimates of nucleotide diversity are similar to those that have been previously reported for P. balsamifera and P. tremula using Sanger sequencing data (Ingvarsson, 2005; Olson et al., 2010) and whole‐genome sequencing data (Wang et al., 2016). The high estimates of diversity in S. phlebophylla compared to the other two Salix species are curious and may result from a history with relatively little migration due to the absence of glaciation over a large portion of its Beringian distribution (Hultén, 1937). The current study is based on a small sample size per species (n = 2), and so our ability to account for population structure or robustly perform population genomic inferences with these data is limited. Additionally, a potential limitation for using this sequence capture array for comparative population genomics is that we screened loci for a range of among‐species variability between 2–12%, which excludes loci that exhibit extremely high or low values of nucleotide diversity. This may bias estimates of nucleotide diversity arising from these probes toward greater evenness. The ability to identify synonymous sites, which are the closest to neutral among all classes of sites (Wright and Andolfatto, 2008), should partially address this bias. Another feature of sequence capture data is the recovery of “off‐target” sequences that result from the fact that the insert size of libraries is larger than the 120‐bp bait length, and so regions upstream and downstream of the target will be sequenced as well. These regions may include intronic and intergenic regions, as well as exonic sequences that deviate from the constraints we used for our design. The results we report here only incorporate the “on‐target” sites that we sequenced, but HybPiper implements methods to assemble intronic sequences as well. However, the potential effects of hitchhiking selection on synonymous site variation will likely remain apparent. We also found that it was straightforward to integrate the targeted sequence capture data with whole‐genome sequence data using the HybPiper pipeline by simply including the FASTQ files from whole‐genome reads in the pipeline. This strategy was used to successfully incorporate whole‐genome sequencing data from I. polycarpa, to act as our outgroup. The proportion of gene coverage as well as the read depth for the I. polycarpa data was similar to the sequence capture libraries (Table 1). A whole‐genome duplication occurred prior to the divergence of Salix and Populus, and there are at least 8000 known paralog pairs in the P. trichocarpa reference genome (Tuskan et al., 2006). Genes with paralogous copies in the genome can complicate gene assemblies, because sequence data from both copies may alternately align to the same target sequence. We identified paralogous sequences in the S. purpurea genome assembly using MCScanX, and used that information to assist in the design of the sequence capture array. The final array includes 593 putatively single‐copy genes, 142 pairs of paralogs, and 237 genes that have paralogs but for which we were not able to include both paralogs in the kit due to our selection criteria. The target reference file we used to map the sequence capture data thus includes 1219 genes, including the single‐copy and known paralogs from S. purpurea. In addition to this, HybPiper provides warnings for genes that have multiple competing alignments that cover the majority of the target sequence, which may indicate the presence of multiple paralogous copies in the genome (Johnson et al., 2016). This will be particularly useful because the genes that have maintained paralogous copies are likely to differ among species throughout the diversification of willows. We estimated evolutionary relationships using both the full set of 1219 single‐copy and known paralog genes, as well as a limited set of only single‐copy genes that did not report paralog warnings. The results from both analyses were nearly the same, but this will likely not be true for a more complex phylogenetic analysis that includes more than six species and an outgroup. For those more complex phylogenetic analyses, the ability to compare trees constructed with single‐copy genes with those using paralogous copies may provide crucial information for reconciling evolutionary relationships. This sequence capture array will provide the community with an excellent resource to consistently sequence a set of variable regions of the genome for phylogenetic and population genomic investigations in the Salicaceae. The rate of read mapping and coverage of target genes was remarkably consistent across both genera, despite the fact that the taxa were selected to maximize sampling of phylogenetic diversity within each genus. The Salicaceae are important plants in the Northern Hemisphere both ecologically and economically and have been the subjects of numerous population genetic and population genomic investigations of speciation, hybridization, introgression, selection, and historical population size and migration. This resource will allow phylogenetic and comparative population genomic studies to assess the same loci across different studies, which will allow us to build a worldwide diversity database and facilitate more precise comparative research questions. Our results demonstrate that the rate of gene capture is extremely high, such that it would be unnecessary to filter data and determine appropriate overlapping genotype thresholds, as is necessary with random genome partitioning methods such as RAD‐seq.

AUTHOR CONTRIBUTIONS

S.P.D. and M.S.O. conceived the study. S.P.D., Q.C.B.C., T.M., and M.S.O. secured funding to support the project. B.J.S. and S.P.D. designed the sequence capture array. Q.C.B.C. and T.M. provided whole‐genome sequence data. B.J.S and M.S.O. prepared and sequenced the DNA samples, analyzed the data, interpreted the results, and wrote the manuscript. All authors edited drafts of the manuscript and approved the final version. APPENDIX S1. Coverage summary statistics for whole‐genome reads used to design the sequence capture array. For each library, values represent the name of the sequenced individual, the number of reads in the sequenced library, the number of reads that mapped to the Populus trichocarpa v3 reference genome, the proportion of reads that mapped to the reference genome, and the mean and standard deviation of read depth. Click here for additional data file. APPENDIX S2. Collection details for Populus and Salix species. Click here for additional data file. APPENDIX S3. Distribution of probes across the Salix purpurea genome. Click here for additional data file. APPENDIX S4. Summary of read depth at on‐ and off‐target sites. For each library, values represent the 5%, 25%, 50%, 75%, and 99% quantiles of read depth, the maximum number of reads mapped to a site, and the mean and standard deviation of read depth. Click here for additional data file. APPENDIX S5. Alternate bipartitions for the three species of Populus, based on gene tree concordance. The cladogram in all three panels is that of the ASTRAL‐III species tree (Fig. 1), and the blue color represents the bipartition supported by the indicated number of gene trees in each panel. A total of 158 gene trees support the displayed ASTRAL‐III species tree topology (A), 114 gene trees support a bipartition that places P. tremula and P. mexicana together (B), and 94 gene trees support a bipartition that places P. balsamifera and P. mexicana together (C). Click here for additional data file. APPENDIX S6. Alternate bipartitions for the three species of Salix, based on gene tree concordance. The cladogram in all three panels is that of the ASTRAL‐III species tree (Fig. 1), and the blue color represents the bipartition supported by the indicated number of gene trees in each panel. A total of 327 gene trees support the displayed ASTRAL‐III species tree topology (A), while 44 gene trees (B) and 39 gene trees (C) place one of the S. phlebophylla individuals within S. exigua. Click here for additional data file. APPENDIX S7. Species trees estimated for all genes and known paralogs. (A) Species tree generated by ASTRAL‐III for the gene trees. Node values represent bootstrap support from 100 multilocus bootstrap replicates in ASTRAL‐III. Branch lengths represent coalescent units. (B) Cladogram showing the congruence of gene trees for all nodes in the ASTRAL‐III species tree. The numbers above each node represent the number of gene trees that support the displayed bipartition, and numbers below the node represent the number of gene trees that support all alternate bipartitions. Purple wedges represent the proportion of gene trees that support the displayed bipartition. Blue wedges represent the proportion of gene trees that support a single alternative bipartition. Green wedges represent the proportion of gene trees that have multiple conflicting bipartitions. Yellow wedges represent the proportion of gene trees that have no supported bipartition. Plotting code and its interpretation were provided by Matt Johnson (for more detail, see Johnson, 2017b). Click here for additional data file. APPENDIX S8. Distributions of values of nucleotide diversity (Nei’s π) within each species at synonymous (yellow) and nonsynonymous (purple) sites. Click here for additional data file. APPENDIX S9. Nucleotide diversity expressed as Nei’s π for nonsynonymous and synonymous sites. Click here for additional data file.

42 in total

1. Multi-allele species reconstruction using ASTRAL.

Authors: Maryam Rabiee; Erfan Sayyari; Siavash Mirarab
Journal: Mol Phylogenet Evol Date: 2018-10-26 Impact factor: 4.286

2. Nucleotide polymorphism and linkage disequilibrium within and among natural populations of European aspen (Populus tremula L., Salicaceae).

Authors: Pär K Ingvarsson
Journal: Genetics Date: 2004-10-16 Impact factor: 4.562

3. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity.

Authors: Yupeng Wang; Haibao Tang; Jeremy D Debarry; Xu Tan; Jingping Li; Xiyin Wang; Tae-ho Lee; Huizhe Jin; Barry Marler; Hui Guo; Jessica C Kissinger; Andrew H Paterson
Journal: Nucleic Acids Res Date: 2012-01-04 Impact factor: 16.971

4. Phylogeny reconstruction and hybrid analysis of populus (Salicaceae) based on nucleotide sequences of multiple single-copy nuclear genes and plastid fragments.

Authors: Zhaoshan Wang; Shuhui Du; Selvadurai Dayanandan; Dongsheng Wang; Yanfei Zeng; Jianguo Zhang
Journal: PLoS One Date: 2014-08-12 Impact factor: 3.240

5. Phylogeny of Salix subgenus Salix s.l. (Salicaceae): delimitation, biogeography, and reticulate evolution.

Authors: Jie Wu; Tommi Nyman; Dong-Chao Wang; George W Argus; Yong-Ping Yang; Jia-Hui Chen
Journal: BMC Evol Biol Date: 2015-03-04 Impact factor: 3.260

6. Phylogenetic and Taxonomic Status Analyses of the Abaso Section from Multiple Nuclear Genes and Plastid Fragments Reveal New Insights into the North America Origin of Populus (Salicaceae).

Authors: Xia Liu; Zhaoshan Wang; Wenhao Shao; Zhanyang Ye; Jianguo Zhang
Journal: Front Plant Sci Date: 2017-01-04 Impact factor: 5.753

7. HybPiper: Extracting coding sequence and introns for phylogenetics from high-throughput sequencing reads using target enrichment.

Authors: Matthew G Johnson; Elliot M Gardner; Yang Liu; Rafael Medina; Bernard Goffinet; A Jonathan Shaw; Nyree J C Zerega; Norman J Wickett
Journal: Appl Plant Sci Date: 2016-07-12 Impact factor: 1.936

8. A willow sex chromosome reveals convergent evolution of complex palindromic repeats.

Authors: Ran Zhou; David Macaya-Sanz; Craig H Carlson; Jeremy Schmutz; Jerry W Jenkins; David Kudrna; Aditi Sharma; Laura Sandor; Shengqiang Shu; Kerrie Barry; Gerald A Tuskan; Tao Ma; Jianquan Liu; Matthew Olson; Lawrence B Smart; Stephen P DiFazio
Journal: Genome Biol Date: 2020-02-14 Impact factor: 13.583

9. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants.

Authors: Stephen A Smith; Michael J Moore; Joseph W Brown; Ya Yang
Journal: BMC Evol Biol Date: 2015-08-05 Impact factor: 3.260

10. Plastome phylogeny and lineage diversification of Salicaceae with focus on poplars and willows.

Authors: Lei Zhang; Zhenxiang Xi; Mingcheng Wang; Xinyi Guo; Tao Ma
Journal: Ecol Evol Date: 2018-07-13 Impact factor: 2.912

4 in total

1. A target Capture Probe Set Useful for Deep- and Shallow-Level Phylogenetic Studies in Cactaceae.

Authors: Monique Romeiro-Brito; Milena Cardoso Telhe; Danilo Trabuco Amaral; Fernando Faria Franco; Evandro Marsola Moraes
Journal: Genes (Basel) Date: 2022-04-17 Impact factor: 4.141

2. Target capture data resolve recalcitrant relationships in the coffee family (Rubioideae, Rubiaceae).

Authors: Olle Thureborn; Sylvain G Razafimandimbison; Niklas Wikström; Catarina Rydin
Journal: Front Plant Sci Date: 2022-09-08 Impact factor: 6.627

3. Fishing for DNA? Designing baits for population genetics in target enrichment experiments: Guidelines, considerations and the new tool supeRbaits.

Authors: Belén Jiménez-Mena; Hugo Flávio; Romina Henriques; Alice Manuzzi; Miguel Ramos; Dorte Meldrup; Janette Edson; Snaebjörn Pálsson; Guðbjörg Ásta Ólafsdóttir; Jennifer R Ovenden; Einar Eg Nielsen
Journal: Mol Ecol Resour Date: 2022-03-03 Impact factor: 8.678

4. Taxon-specific or universal? Using target capture to study the evolutionary history of rapid radiations.

Authors: Gil Yardeni; Juan Viruel; Margot Paris; Jaqueline Hess; Clara Groot Crego; Marylaure de La Harpe; Norma Rivera; Michael H J Barfuss; Walter Till; Valeria Guzmán-Jacob; Thorsten Krömer; Christian Lexer; Ovidiu Paun; Thibault Leroy
Journal: Mol Ecol Resour Date: 2021-10-10 Impact factor: 8.678

4 in total