Literature DB >> 35178874

Fishing for DNA? Designing baits for population genetics in target enrichment experiments: Guidelines, considerations and the new tool supeRbaits.

Belén Jiménez-Mena¹, Hugo Flávio¹, Romina Henriques¹, Alice Manuzzi¹, Miguel Ramos², Dorte Meldrup¹, Janette Edson³, Snaebjörn Pálsson⁴, Guðbjörg Ásta Ólafsdóttir⁵, Jennifer R Ovenden⁶, Einar Eg Nielsen¹.

Abstract

Targeted sequencing is an increasingly popular next-generation sequencing (NGS) approach for studying populations that involves focusing sequencing efforts on specific parts of the genome of a species of interest. Methodologies and tools for designing targeted baits are scarce but in high demand. Here, we present specific guidelines and considerations for designing capture sequencing experiments for population genetics for both neutral genomic regions and regions subject to selection. We describe the bait design process for three diverse fish species: Atlantic salmon, Atlantic cod and tiger shark, which was carried out in our research group, and provide an evaluation of the performance of our approach across both historical and modern samples. The workflow used for designing these three bait sets has been implemented in the R-package supeRbaits, which encompasses our considerations and guidelines for bait design for the benefit of researchers and practitioners. The supeRbaits R-package is user-friendly and versatile. It is written in C++ and implemented in R. supeRbaits and its manual are available from Github: https://github.com/BelenJM/supeRbaits.

Entities: Chemical

Keywords: R-package; ancient DNA; baits; capture sequencing; genomics; population genetics

Mesh：

Substances：
DNA

Year: 2022 PMID： 35178874 PMCID： PMC9313901 DOI： 10.1111/1755-0998.13598

Source DB: PubMed Journal: Mol Ecol Resour ISSN： 1755-098X Impact factor: 8.678

INTRODUCTION

Genomic information is increasingly available for population genetics analyses due to the rapid development of next‐generation sequencing (NGS) methods. A multitude of wild species are studied in this way; however, the method is particularly important for endangered or commercially exploited species, where knowledge generated from genome‐wide data can greatly aid in conservation and sustainable management efforts (Supple & Shapiro, 2018). Many of these species are not widely used for general research questions, so reference genomic resources to initiate NGS studies are rarely available (Russell et al., 2017). Different NGS approaches are currently available, with their suitability varying with the study question and type of organism at hand. For example, while sequencing whole genomes provides very detailed data on the genomic architecture of a species, this approach remains time‐consuming and expensive, given the high cost of producing, analysing and storing the large quantity of data obtained. Alternatively, methods of reduced‐representation sequencing allow investigation of specific regions of the genome in a large number of conspecific individuals at a relatively low cost and short time (Mamanova et al., 2010), especially when the genome size of the organism of interest is large or complex (McCartney‐Melstad et al., 2016). One of these reduced‐representation methods is the so‐called “target enrichment” approach, which targets specific areas of interest within the genome (Mamanova et al., 2010). There are multiple methods of target enrichment, for example, PCR‐based enrichment or hybridization‐based capture sequencing (see Mamanova et al., 2010). Hybridization‐based capture sequencing (herein referred as CS) is currently one of quickest and most flexible methods for target enrichment (Mamanova et al., 2010) and can be performed using fixed predefined solid‐arrays or in‐solution (Horn, 2012). The latter is based on in vitro hybridization of the target genomic regions with designed synthetic probes of DNA or RNA, that is, “baits”, that will “capture” the complementary sequence that the bait was designed for (Horn, 2012). In principle, only the desired genomic areas for which the baits were designed will be captured and sequenced, thus CS has commonly been used for historical (hDNA) and ancient DNA (aDNA) studies as it increases the yield of sequence from study‐species by reducing the probability of sequencing contaminants (Willerslev & Cooper, 2005). Likewise, it has been suggested CS could drive the transition from conservation genetics to conservation genomics (Meek & Larson, 2019), given its flexibility and cost effectiveness (between $43 and 65 per sample in plants—see Hale et al., 2020). One potentially economical caveat of CS is the high percentage of off‐target reads, that is, reads that map to nontarget regions. In exon‐capture experiments it has been estimated that on average 40%–60% of the total amount of reads sequenced are off‐target (Samuels et al., 2013). Nevertheless, although a priori a disadvantage, the off‐target reads can still provide useful insights within subsequent data analysis. For instance, the mitochondrial DNA is very often sequenced in CS experiments as an off‐target (Picardi & Pesole, 2012) and it has been used to identify and clean out contaminated individuals belonging to other species or samples, as well as additional markers (e.g., Manuzzi et al., 2021). Other uses of off‐target data reported in the literature are for example, identification of new SNPs (Guo et al., 2012) or repeat regions (Costa et al., 2021). Discussion of pros and cons of CS in comparison with other genomic approaches (i.e., high‐ or low‐coverage whole‐genome sequencing, other reduced‐representation approaches) is outside the scope of this article, but we refer to the following studies/reviews for further comparisons (e.g., low‐coverage sequencing: Lou et al., 2021; Therkildsen & Palumbi, 2017; whole‐genome sequencing: Schwarze et al., 2020).

Why capture sequencing for population genetics?

In population genetics studies, documenting neutral processes is of particular interest from a conservation and resource management point of view (Zhou & Holliday, 2012). Neutral processes such as gene flow, population divergence and demographic history, allow the study of how populations are connected through time and space. The ability to select putative neutral genomic regions (and discard others that may be under natural selection or linked to such sites) provides an advantage for processing genetic data for conservation purposes (Zhou & Holliday, 2012). Alternatively, working with putative loci under selection can disentangle adaptive and neutral processes. For example, studies of loci supposedly under selection may identify local adaptation among populations subject to spatially varying selective pressures (e.g., Nielsen et al., 2009), or assess adaptation over time in response to temporally changing selective pressures (e.g., Franks et al., 2016; Therkildsen et al., 2013), such as global climate change. For many instances in population genetic studies, there is a greater benefit from increasing the number of individuals analysed as opposed to increasing genomic coverage per individual (Fumagalli, 2013), for example, Benestan et al. (2015). CS permits data collection from an increased number of sampled individuals per sequencing lane for the same cost compared to whole‐genome approaches (Jones & Good, 2016), at the expense of having fewer sequenced genomic regions. Another useful application of CS is for studies of historical or ancient DNA samples. For example, if one is interested in assessing the loss of genetic diversity through time, it is common to use historical or ancient samples, which generally yield smaller amounts of endogenous DNA that may be degraded and contaminated with DNA from a variety of organisms, including bacteria, fungi and viruses (Willerslev & Cooper, 2005). Therefore, baits designed to capture DNA from the target species have a huge potential to increase the success of retrospective population genomic studies. Likewise, CS is advantageous when working with endangered or nearly‐extinct species where samples could be scarce and of low‐quality (Glenn & Faircloth, 2016), as well as in environmental samples (Giebner et al., 2020). In conclusion, CS is powerful because it specifically targets DNA from the species of interest, including in contaminated or mixed samples, and specifically selects regions within those genomes to answer particular research questions at a reasonable cost. Because CS methods rely on targeting a narrow set of desired regions in the genome, the design of the oligonucleotide baits is key for the success of any CS‐based study. To date, bait design has focused mainly on comparing conserved regions in humans (Hodges et al., 2009) or in phylogenetic studies (Andermann et al., 2020; Hancock‐Hanser et al., 2013; Hugall et al., 2016; Lemmon et al., 2012). Recent bait design software has reflected this trend by focusing mainly on exonic regions that tend to be ultra‐conserved (Campana, 2018; Chafin et al., 2018; Faircloth, 2017; Mayer et al., 2016) or long regions (>20 kb, Jayaraman et al., 2020) of the genome. However, little attention has been given to bait design processes for population genetic studies using CS, where the focus lies largely on determining within‐species genetic structure and diversity. There is also currently an absence of guidelines, pipelines and specific software to help in this endeavor. In most cases, the process of designing baits is outsourced to manufacturers who ensure the baits are compatible and of the best quality, but it is time‐consuming and expensive (Meek & Larson, 2019). For example, in the case of medical genomics, several manufacturers have predesigned panels for genomic regions of interest, as well as tools for creating “custom capture reagents” for enrichment of genomic regions specified by the laboratory (see Glenn & Faircloth, 2016; Hagemann et al., 2013, for a review). For nonmedical related studies, predesigned panels do not generally exist, meaning that each project needs to create their own set of baits (Glenn & Faircloth, 2016). Exceptions can be found within phylogenetics (see Table S1 of Andermann et al., 2020), and a few of the databases have been successfully used in palmske BLAST and excluding bfor population genetic purposes, for example, in reptiles (Singhal et al., 2017) or frogs (Chan et al., 2020, 2021; Hutter et al., 2019). In addition, there is very limited literature available describing how to successfully design baits for population genetics, including the reasoning behind such bait design (but see Puritz & Lotterhos, 2018). Accordingly, bait design is usually left to short methodological descriptions in individual research manuscripts (see examples in Table S1). Here, we present novel guidelines and considerations for designing baits for population genetics that will save time and effort. We also discuss three empirical examples of bait design that investigated changes through time in genetic diversity using time‐series data from three fish species to inform conservation and management strategies. The aim of the SDPAS (“Strengthening the Danish Populations of Atlantic Salmon: increasing populations, genetic resources and recreational fishing”) project was to investigate the temporal variation in proxies of genetic diversity in the population of Atlantic salmon (Salmo salar) in Denmark over the span of a century. For this, we used ~1000 samples with a temporal range from 1913 to 2017, and targeted different genomic areas for elucidating both neutral and adaptive changes over time. In the project CODSTORY we investigated genetic changes in Atlantic cod (Gadus morhua) in Icelandic waters, to assess possible association with changes in fisheries practices over almost 1000 years. Finally, as part of the GENOJAWS project, we wanted to understand whether tiger sharks (Galeocerdo cuvier) had suffered a recent historical (since 1820) loss of genetic diversity, associated with climate change or human‐induced ecological perturbances. For these three studies, DNA was extracted from jaws, vertebrae, scales, bones and tissue samples collected from our local institute and museums and excavations across the world. We briefly discuss the suitability and broad applicability of our novel bait design approach using these three examples. Finally, we explain the main functionalities of supeRbaits, an R‐package designed for researchers and practitioners to design their own bait sets for CS experiments. Along with the R‐package, we also make the designed panels of bait sequences available for the research community.

WHAT TO TAKE INTO ACCOUNT WHEN DESIGNING YOUR BAITS?

Available genomic resources

An increasing number of non‐model species have reference genomes available (Hohenlohe et al., 2021). An annotated reference genome of the species of interest allows researchers to more accurately select genomic regions for bait design. The annotation helps to locate intron/exon boundaries allowing identification of coding/non‐coding regions of the genome subject to different evolutionary forces (Warr et al., 2015). In our research projects dealing with Atlantic salmon and Atlantic cod, we made use of the available reference genome for the Atlantic salmon (ICSASG_v2, Lien et al., 2016) and the latest genome assembly for the Atlantic cod (GadMor2, Tørresen et al., 2017), respectively. If full genomes are not available, transcriptomes can also be used for bait design (e.g., Bailey et al., 2016; Capblancq et al., 2020; Ehlers et al., 2020; Förster et al., 2018). For the tiger shark, we used the available transcriptome assembly from white muscle (Swift et al., 2016). However, even if genomic resources from the species of interest are not available, those from a closely related species or species group can still be used (e.g., Cosart et al., 2011; Nielsen et al., 2017), and even be used to generate new genomic resources. For instance, Förster et al. (2018) found 686 candidate SNPs in the Eurasian Lynx (Lynx lynx) using baits designed from the domestic cat (Felis catus), to generate a 96‐SNP panel to monitor the presence of the species in the wild. When using genomic information from another species, it is important to take the evolutionary distance into account (Jones & Good, 2016), as this will influence the effectiveness of the bait hybridization. However, one could also choose to study divergent variation within a family of species, thus the design of the baits should target areas with some level of divergence between the species; for example, Sanderson et al. (2020) designed baits in regions that were less than 95% identical between two species targeted in their study. In the case of complete lack of genomic resources, which is still common for many non‐model species, there are other methods available to circumvent the problem, mostly by generating new genomic resources. For example, PCR capture, de novo assembly capture and divergent reference capture (see Jones & Good, 2016), or more recently the combination of RAD‐sequencing with capture, that is, RAD‐capture (Ali et al., 2016), and expressed‐exome capture (Puritz & Lotterhos, 2018) can all be used when no genomic resources are available. In addition to finding and selecting genomic regions for CS from available genomic resources, in this study we also designed baits for regions containing already identified SNPs and reused capture baits that were previously designed from other species, for example, baits designed from the cat shark (Scyliorhinus canicula) transcriptome that had previously captured tiger sharks’ sequences successfully (Manuzzi et al., 2021). However, this is not an ideal approach, as biases can be introduced by not knowing the genomic location of cross‐species designed baits in the species of interest (e.g., linkage between markers when assuming independence), which should be kept in mind in downstream analyses.

The research question

As with other aspects of planning a research project, the specific questions and hypotheses should also guide the bait design process. Thus, designing a bait panel should provide the opportunity to generate enough data to address the specific research question in a cost‐effective way. For instance, to focus on population genetics of antelopes and measure genetic diversity, Gooley et al. (2020) designed 5000 baits outside exonic areas to target 5000 putatively neutral SNPs. Else, researchers can choose to divide the bait panel into different sets aimed at answering various questions within a population genomics‐related project; including addressing both neutral processes and selection/adaptation and therefore focusing on non‐coding/coding regions of the genome. In our projects, bait sets were designed to target: (i) Single nucleotide polymorphisms of interest (SNPs) from published SNP panels. In the SDPAS and CODSTORY projects we used information from SNP chips (Hubert et al., 2010; Karlsson et al., 2011; Moen et al., 2008) or SNPs previously applied to population genetic studies in our laboratory (Therkildsen et al., 2013); (ii) genes of interest or regions with known quantitative trait loci (QTL). As this approach provides no prior SNP knowledge, we allocated baits randomly around each gene/QTL regions, for example, genomic regions identified as related to parasite‐driven evolution in Atlantic salmon (Zueva et al., 2014), or genes related to survival in the wild (Besnier et al., 2015); (iii) genes of interest identified from available transcriptomes, exemplified with the bait design for the tiger shark project; (iv) particularly interesting genomic regions, such as the four known inversions in Atlantic cod that characterize the different ecotypes (Barney et al., 2017; Kirubakaran et al., 2016). Other studies also used this approach for generating some of the baits; for example genes associated to environmental stress responses (Bi et al., 2019), and (v) putative “neutral” areas of the genome (i.e., not in or adjacent to genes), in order to obtain sufficient data on neutral genomic processes to allow estimation of neutral indices such as effective population size (N e) and other measures of genetic diversity through time. In this instance, we generated sequences placed throughout the genome, but excluded repetitive areas; a similar approach was used in Gooley et al. (2020). Figure 1 shows a scheme of the classes of targeted areas. More details on the distribution of different bait classes for the three projects and species can be found in Table S2.

FIGURE 1

(a) Illustration of the design of the bait set. Different types of areas are taken into account for the design: exclusion areas, where no baits will be placed upon; regions of interest, typically genes or other areas to explore in the research questions; and points of interest, typically SNPs. (b) Diagram showing the “on‐target” area. A read was considered “on‐target” if it was located within 350 bp up or downstream of the genomic position of the designed capture bait of 120 bp

Length and number of baits

The impact of the choice of bait length remains understudied (Glenn & Faircloth, 2016). It is currently unknown what optimal, minimum or maximum sequence length is needed for the bait to capture the desired sequence. In some cases, bait design may only be guided or limited by the choice of sequencing platform and the size of the sequencing reads, as well as the length of the sequence fragment to be captured (Horn, 2012). The CS method captures a range of sequences between a few hundred base pairs (bp) to a few thousand base pairs (Mbp), and also allows a relatively high proportion of degenerate sites, in contrast to PCR primers (Glenn & Faircloth, 2016; Horn, 2012). A bait length of 120 bp is generally considered as representing a good balance between cost and efficiency (Glenn & Faircloth, 2016). Therefore, this was the chosen length of bait in our three projects. One can also choose to design bait sets of different lengths, for example for historical and modern samples; Joubran and Cassin‐Sackett (2021) had a separate bait panel for the historical collection with shorter length (100 bp) than for the modern collection (120 bp). Given that some of our samples were historical and hence likely to be degraded (Table S3), we expected the captured DNA fragments to consist primarily of short sequences, and accordingly, we chose a short‐read sequencing platform (Illumina). Fragmentation of extracted DNA to the desired size can be achieved using mechanical or enzymatic techniques (Hale et al., 2020), for example, when working with well conserved DNA or modern samples. Coupled with the development of new technologies related to the sequencing of long genomic regions (e.g., PacBio), CS is also evolving towards capturing longer regions (up to 20 kb), in the so‐called region‐specific extraction (RSE) (Dapprich et al., 2016). The number of baits to use will be a trade‐off between different factors related to the budget of the project and the research question in mind. Frandsen et al. (2020) used >59,000 baits to have enough power to obtain a high resolution when studying admixture levels of subspecies in the European ex situ population of the chimpanzee (Pan troglodytes), whereas a lower number of baits (8922) were needed to discover enough SNPs to identify the presence of Lynx from samples collected noninvasively in the wild. The chosen number of baits will often depend on the desired mean coverage depth for each sequenced individual for each targeted area, the expected efficiency of the CS approach and the sequencing platform capacity per lane (Grover et al., 2012). When deciding on the total number of baits to aim for, it is important to take into account the expected efficacy of the capture in the species of interest. Although this depends on multiple factors, efficacy will be lower for instance when designing baits based on a distant species (Bragg et al., 2016). CS allows increasing the number of samples at the expense of covering a smaller fraction of the genome, but targeting a sufficient number of SNPs to answer the desired research questions is essential. For example, we did not choose all the SNPs in the published SNP chips of Atlantic cod (Hubert et al., 2010; Moen et al., 2008) or salmon (Karlsson et al., 2011), but only those for which we had hypothesis‐driven questions. For our projects, we designed 20,000 baits for each of the three species; this number was a balance between the cost of baits, the number of samples processed in the laboratory (~1000 samples for SDPAS, ~300 for CODSTORY and ~400 for tiger sharks) and the predefined bait sets offered by the company used to produce the baits (MYBaits, now Arbor Biosciences). We first designed baits for targeted regions and previously identified SNPs of interest (between 2 and 5 thousand (K) baits per project) to ensure they were sufficiently covered, and designed the remainder of baits as “random”, targeting non‐coding and putative neutral regions throughout the genome (between 15 and 18 K baits per project; see Table S2). We expected these numbers of baits to generate sufficient SNPs for drawing a multitude of genomic inferences. Simulations show that, in some cases, ~1000 SNPs may be enough to reliably estimate levels of genetic diversity (Nazareno et al., 2017), while as low as ~100 SNPs often suffice to confidently analyse population structure and conduct population assignments (Turakulov & Easteal, 2003). When estimating N e, a recent study found that ~2,000 random SNPs provided consistent N e estimates, through different missing data levels and minor allele frequency (MAF) thresholds (Marandel et al., 2020). Nevertheless, further work is necessary to estimate the suitable number of SNPs needed for N e estimation of regional genomic regions in order to account for heterogeneity of N e across the genome (Jiménez‐Mena et al., 2016; Jiménez‐Mena, Tataru, et al., 2016). Prior knowledge about the expected level of genomic variation (e.g., number of SNPs per Mbp) could serve as a starting point to guide the number of baits to aim for. However, not all baits will capture fragments with a SNP and this will be most pronounced for species with less overall genomic variation.

Duplicated regions

Duplicated regions have been highlighted as a drawback of CS as these regions are captured and amplified more often than nonrepeat regions (Ávila‐Arcos et al., 2011), thereby swamping sequencing reads. Accordingly, it has been recommended to design baits for targets outside repetitive areas (Horn, 2012). This may be challenging for species that have experienced genomic duplication events (e.g., Atlantic salmon; Lien et al., 2016) and with different ploidy levels (e.g., strawberries: Kamneva et al., 2017; black cottonwood: Zhou & Holliday, 2012). The same reasoning applies for not targeting both mitochondrial and nuclear genes in the same bait panel, as nuclear‐mitochondrial homologues are abundant (Woischnik & Moraes, 2002). Thus, it is recommended to use available data on repeat regions of the species of interest as well as to apply bioinformatics tools allowing identification of such genomic regions and filtering out baits that fall within those areas. For example, in our projects we excluded repeat regions for the Atlantic salmon bait panels, using the Repeat Library report published with the ICSASG_v2 salmon genome (Lien et al., 2016). For tiger sharks, repeats were excluded by the company who later on produced the bait set (see below, Arbor Biosciences) using the Carcharhiniformes repeat database. For the cod, we used the Repeat Library report published with the gadMor2 genome (Tørresen et al., 2017). Although recommended, it is not always possible to filter out these regions, if there are not any suitable genomic resources. Therefore, researchers should keep this in mind for downstream analyses. Another good practice is to double‐check the baits for matching to multiple genomic regions, which could be achieved by using tools like BLAST (Camacho et al., 2009), excluding baits that map the genome more than once (e.g., Sanderson et al., 2020), or using departures from the expected mean coverage and heterozygosity along the genome to filter out duplication areas (e.g., Harpe et al., 2019).

Tiling

Using more than one bait to cover an area of interest (“tiling”) is likely to increase the chances of efficiently capturing sequences from a specific genomic area. Thus, if a given study aims to target a number of genes of particular high interest, then tiling may be an efficient approach to assure successful capture and higher coverage. In a comparative study of different exome bait panels that consisted of (i) adjacent, (ii) nonadjacent and (iii) overlapping baits, it was shown that overlapping baits increased the performance of targeted sequence capture. In this case, less sequencing reads were needed to obtain a good resolution of the variability of the specific genomic region, thereby increasing the sensitivity of the method (Clark et al., 2011). It is difficult to provide general guidelines on selecting the number of overlapping baits as well as their overlap and density (Clark et al., 2011; Glenn & Faircloth, 2016); however, Cruz‐Dávalos et al. (2017) recommends three to five‐fold tiling densities for enriching degraded DNA libraries including aDNA. Different bait tiling strategies for the various genomic regions covered by the panel can also be applied, in order to ensure that the essential genomic regions of interest are successfully captured. For example, in our case studies, we designed baits with 3×‐tiling of prioritized regions of interest compared to randomly‐selected genomic areas (Figure 2). These included SNPs known to be linked to “sea‐age” in the Atlantic salmon (how many years a salmon stays at sea before returning to the river for reproduction (Barson et al., 2015]), and SNPs related to salinity preference (Berg et al., 2015) and sex determination (Star et al., 2016) for Atlantic cod. By contrast, for some other areas (e.g., “random” areas), the priority was to cover a large number of regions and potentially as many different SNPs as possible, and thus only used a single bait per region. One can also choose to use a homogeneous tiling strategy throughout the bait design; in a study on the population structure and genetic diversity of the ex‐situ population of sable antelope in North America, the tiling strategy was 4× for all the 5000 neutral SNPs targeted (Gooley et al., 2020).

FIGURE 2

Examples of different options of tiling to design baits for a region of interest. (a) Tiling using a given offset distance between baits (e.g., 40 bp), (b) exact tiling (e.g., 3×)

Base composition

The GC content (i.e., the proportion of guanine [G] and cytosine [C)] nucleotides in the sequence) has a direct influence on the capture efficiency for targeted exonic regions, with very low and very high GC content regions having negative effects on the efficiency of hybridization (Chilamakuri et al., 2014). Whether GC content also affects capture efficiency outside coding areas (with typically lower GC content—Fortes et al., 2007; Vinogradov, 2001) has been less studied, but some studies indicate a similar negative effect (see Jones and Good (2016), and references therein). Cruz‐Dávalos et al. (2017) evaluated baits designed along the nuclear genome of the horse and found that increasing GC content (>53%) reduced the number of baits with at least 1 read coverage, as well as the mean coverage. Accordingly, as a rule of thumb, it is generally accepted that GC content of the bait panel should be kept at intermediate levels, avoiding areas with very low (<30%) or very high (>70%) GC content in order to try to compensate as much as possible for the capture efficiency bias. For our study species, we only used baits with GC content within a range of 40%–55%. In order to facilitate selection of baits within that range, we initially generated a larger number of baits than the desired 20,000 (~5 times more), and of a larger size (200 bp) than the final length of each bait (120 bp), whenever possible. For each of the 200 bp‐sequences, we designed overlapping baits of 120 bp with a 40 bp‐offset between baits, in order to have a broad selection to choose from (Figure 2). This approach allowed us to design baits meeting the GC criteria for almost all of the initial 200 bp‐sequences.

Other considerations

The thermodynamic properties of the nucleotide sequences can impact the annealing specificity of the designed bait sequence to the desired target (see review by Noguera et al., 2014). The affinity of two sequences can be quantified by measuring the Gibbs free energy change of sequence binding (ΔG). This is applied in order to take into account the properties of both the target and the designed bait to create self‐folding structures that do not allow the correct binding between them, and penalize for these formations during bait design. Melting temperature (Tm), defined as the temperature where 50% of the bait sequences are hybridized, should also be considered. In particular, Tm should be relatively homogenous across baits allowing optimal capture conditions. There are other chemical properties of importance for bait design, and we refer to the work of Cruz‐Dávalos et al. (2017) for more detailed considerations. Finally, baits can be built from RNA or DNA (Horn, 2012), where RNA baits seem to give a higher stability compared to DNA when binding to DNA (see Hale et al., 2020). For our three experiments, we exclusively used RNA baits produced by an external manufacturer (Arbor Biosciences). For our case studies, the final preselected bait sets for all three species (see Table S2) were sent to Arbor Biosciences, who provided a review of the chemical properties of the sequences, as well as suggested filtered baits following their own thresholds on Tm and BLAST cutoffs, and when applicable, the options described above for the 200‐bp sequences (see Table S3). After filtering for GC content, Tm and sequence specificity (i.e., a score that characterizes the bait specificity when blasting it against the genome of the target species), we selected the final set consisting of 20,000 sequence baits subsequently produced by Arbor Biosciences (Supporting Information S1). The main guidelines described in this section are summarized in Table 1.

TABLE 1

Summary table of the main considerations on the design of baits for population genetics

	Type	Example
Available genomic resources	Genome Transcriptome De novo assemblies Other (close) species	Atlantic salmon and Atlantic cod Tiger shark
Question	Neutral vs. adaptive processes Population substructuring Estimates of effective population size Retrospective genomics Environmental DNA	Coding/non‐coding regions Anonymous regions of the genome/transcriptome Neutral areas of the genome Coding/noncoding regions, anonymous regions
Type of targeted region	Known SNPs Genes of interest/quantitative traits loci Inversions Neutral areas of the genome	Baits in SNPs (e.g., from SNP‐chips) Randomly allocated baits in genes or regions of interest Baits in known inversions Randomly allocated baits
Bait length	~70–200 bp Up to 20 Kbp	120 bp
GC content	Avoid very low (<30%) or very high (>70%) areas	40%–55%
Tiling	Tiling Mixed tiling/no tiling No tiling	Tiling for areas of interest/No tiling for random areas
Other considerations	Sequence binding (ΔG) Melting temperature (T_m) BLAST hits

Summary table of the main considerations on the design of baits for population genetics Genome Transcriptome De novo assemblies Other (close) species Atlantic salmon and Atlantic cod Tiger shark Neutral vs. adaptive processes Population substructuring Estimates of effective population size Retrospective genomics Environmental DNA Coding/non‐coding regions Anonymous regions of the genome/transcriptome Neutral areas of the genome Coding/noncoding regions, anonymous regions Known SNPs Genes of interest/quantitative traits loci Inversions Neutral areas of the genome Baits in SNPs (e.g., from SNP‐chips) Randomly allocated baits in genes or regions of interest Baits in known inversions Randomly allocated baits ~70–200 bp Up to 20 Kbp Tiling Mixed tiling/no tiling No tiling Sequence binding (ΔG) Melting temperature (Tm) BLAST hits

TESTING THE PERFORMANCE OF BAIT SETS – INSIGHTS FROM THE THREE CASE STUDIES

Before proceeding with the capture of all samples, it is recommended to conduct a capture trial on a subset of individuals. The trial should cover the range of DNA quality (fragmentation) and quantity likely to be experienced throughout the project in order to get a good overview of the performance of the bait set. In our projects, we captured DNA from 20 individuals for each species, of which 10 were from contemporary samples and the other 10 from historical or “ancient” samples, that is, with lower concentration and more fragmented DNA. The capture in the laboratory was conducted following Arbor Biosciences guidelines. More information about the samples can be found in Table S3, including type of tissue and year of sampling/catch. Captured libraries were sequenced at two external sequencing facilities on a HiSeq4000 provider, using 2 × 125 bp paired‐end (PE) reads (tiger shark, salmon), and a HiSeq X using 2 × 150 bp PE reads (cod). Raw sequencing data were filtered for adaptors, potential bacteria and human contamination, and subsequently mapped back to their respective genomic resources. We filtered for mapping quality and PCR duplications, and obtained BAM files, which were used for statistical analysis of the bait panels’ performance. Further details on the laboratory DNA extraction, sequencing and bioinformatics filtering are outside the scope of this manuscript, but we followed a similar procedure as in Manuzzi et al. (2021). We evaluated capture sensitivity (i.e., the percentage of targets covered by at least one mapped read; (Jones & Good, 2016); coverage (i.e., mean number of reads per bait) and depth of targeted base pairs in BEDtools (functions: intersect and coverage with –hist option: Quinlan & Hall, 2010; Figure 3). We defined a read as “on‐target” if the read overlapped the bait region (i.e., target area, 120 bp) or a flanking sequence of 350 bp on each side of the bait (Figure 1b), allowing partial overlap in both cases. The flanking sequence was included in the “on‐target” area in case a sequencing read had mapped to the ends of a bait, thus extending beyond the bait length. For each of the three species, more than 75% of the baits had at least one read “on‐target”, and all groups presented similar value ranges, except the historical cod (Figure 3a). For all studies, clear differences in efficiency according to the age of the samples were observed. As expected, modern samples had a higher success in the total number of captured target regions (overall mean—modern samples: 17,920; historical: 16,209), as well as more reads per bait than historical/ancient samples (overall mean ‐ modern samples: 101; historical: 52.4). As the most extreme case, the samples of historical cod had the widest range of capture efficiency (min: 3.498; max: 17,077 baits capturing), although the median was 14,610 baits, which was similar to the contemporary samples for cod and the other species. A similar wide range was observed for the mean number of reads per bait, where the tiger shark (both historical and modern) and the modern cod presented the broadest range (threefold) among samples (Figure 3b). Modern and historical samples of salmon displayed relatively little variation in read number among samples, but had the lowest mean values of all six groups captured (with the exception of historical cod). On the contrary, the modern cod samples exhibited the largest fraction of the targeted regions captured by the baits (Figure 3c).

FIGURE 3

(a) Number of baits with more than one read on target, per species (x‐axis) and category explored (modern and historical, y‐axis). (b) Mean number of reads per bait, per species (x‐axis) and category explored (modern and historical, y‐axis). Black lines in (a) and (b) correspond to the median of the samples. (c) Cumulative distribution that describes the fraction of targeted bp covered by a certain number of reads (x‐axis, represented by depth); each coloured line represents an individual from each population and category explored Our trial runs revealed different degrees of capture efficiency. This included not only differences among species, but also between historical and modern samples, as well as the type of tissue source, the age and the preservation method of the samples, suggesting that the type of samples can have an effect in the success of the capture experiments. Similar findings have been reported across the literature (Derkarabetian et al., 2019; Nielsen et al., 2017), further illustrating that capture efficiency is not a “one‐measure‐fits‐all” and should be tailored to the species and type of samples at hand, although broader bait sets can also successfully work across large phylogenetic scales (e.g., Hutter et al., 2019). We highly recommend that researchers conduct trial runs before embarking on a full capture study. If capture efficiency is considered too low, hybridization conditions can be modified (e.g., temperature and bait/template concentrations) to optimize capture efficiency.

THE R‐PACKAGE supeRbaits AND ITS APPLICATION

The considerations for designing baits described in this article have been collated and implemented in a user‐friendly R‐package supeRbaits. supeRbaits consists of a main function do_baits that reads genomic information provided by the user to design baits for the species of interest. The only mandatory input is a file containing genomic information, typically a genome or transcriptome in FASTA format (a “database” in supeRbaits terms). If available, other types of genomic information can also be used, for example, previously identified SNPs, regions of particular interest, and areas to exclude (i.e., masked regions). For illustration purposes we made a short comparison on the time for supeRbaits to load the genomic resources (Figure 4a,b) and design different number of bait sets for the three species used in this study (salmon, cod and tiger shark), using the most basic parameters (i.e., do_baits [n = n, size = 120, database]) (Figure 4c). The speed test was performed using an Intel Core i5‐7200 U 2.50GHz with two cores and 8 GB RAM. The databases from these three species differ in size (Atlantic salmon: 3 Gb, Atlantic cod: 613 Mb, tiger shark: 155 Mb) and in number of scaffolds (Atlantic salmon: 232,155, Atlantic cod: 8310, tiger shark: 179,867). As supeRbaits is written in C++, it can effectively handle a variety of data set sizes; however, the larger the data set, the longer it takes to load (Figure 4a). The smaller the data set, the more other factors (e.g., storing length values to a table) start playing a significant role on the total time spent to load (which in turn lowers the kBP/s) (Figure 4b), which for example explains why the tiger shark database has a lower kBP/s despite being the fastest data set to load. The time that it takes supeRbaits to create baits is dependent on the number of scaffolds of the database (Figure 4c).

FIGURE 4

Analysis of the speed at which supeRbaits loads different genomic resources and retrieves baits. (a) Total time spent to import each of the three genomic databases (Atlantic cod, Gadus morhua; tiger shark, Galeocerdo cuvier, and Atlantic salmon, Salmo salar). (b) Average kBP counted per second for each of the genomic databases. (c) Total time required to choose bait locations and extract the respective number of baits from the genomic database, tested with basic conditions The arguments within the main function of the package (Table 2) allow the user to define how many baits to design, and where they should be placed within the genome. This can be done by specifying the total number of baits, the number/percentage of baits per category of input file, and if different categories are to be included, for example, known SNPs, genomic regions of particular interest, and masked regions. The tiling can also be specified, including information about different bait characteristics per input file category (e.g., 2× tiling for known SNP areas, and 3× tiling in regions of particular interest). If genomic regions to exclude are specified, no baits will be placed in those regions. The user can also define the GC content range within a bait, specifying a minimum and maximum allowed content. The output of the package is a set of baits for each specified type. The output also includes different statistics along with the DNA sequence that can be used for follow‐up filtering analyses. If desired, the generated bait list output from supeRbaits (do_baits() function) can be used as an input to apply further filtering (e.g., based on chemical properties, see Jayaraman et al., 2020) both in supeRbaits through the blast_baits() function that utilizes BLAST software (Camacho et al., 2009) within the R‐package, but also using other ad hoc scripts of already available software (Markham & Zuker, 2008; Zuker, 2003). Other options for further filtering are online tools such as SciTools Web Tools from IDT; or ArrayOligoSelector (Bozdech et al., 2003), and simulation programs (Cao et al., 2018) or external providers (e.g., Arbor Biosciences; Roche), to select the final set of baits. Therefore, by following the short pipeline of supeRbaits, large bait sets for population genomics can be generated with the desired bait properties and placement, in a fast and transparent way.

TABLE 2

Main arguments of the supeRbaits main function

Argument name	Description
n	Total number of desired baits
Size	Length (in bp) of each bait
Database	Genomic reference
n_per_seq	Number of baits per each sequence in the database
min_per_seq	Minimum number of baits per each sequence in the database
Exclusions	Areas of the database to exclude
Regions	Specific areas of the database to include
Regions.tiling	Choice of tiling for baits allocated in regions
Regions.prop	Proportion of baits allocated in regions
Targets	Specific points of the database to include (e.g., SNPs)
Targets.tiling	Choice of tiling for baits allocated in targets
Targets.prop	Proportion of baits allocated in targets
Seed	Seed to be set for a repeatable set of baits
Restrict	Areas of the database to restrict the baits to
gc	Wished range of the proportion of the nucleotides G and C within the bait area
force	Option to request a very large number of baits to be generated

Main arguments of the supeRbaits main function

CONCLUSION

Capture sequencing is a useful, cost effective tool to generate thousands of genomic markers for population genomics and conservation studies in non‐model species. However, designing the baits necessary for a capture experiment is challenging, with few resources and guidelines available. Here, we present the first user‐friendly R‐package created specifically for bait design, supeRbaits, as well as a discussion on the main parameters that influence the success of a DNA capture project for population genomics, with both contemporary and historical samples. We show that the method for designing baits that is implemented in supeRbaits facilitates fast, robust and efficient bait design. Our three described successful examples should be seen as proof of concept for the general practical applicability of supeRbaits. Although we did not discuss in great detail all the factors that might influence the success of a capture experiment (e.g., levels of endogenous DNA, quality of samples – but see Cruz‐Dávalos et al., 2017), our guidelines contain key criteria regarding both the overarching experimental setup of a capture‐based study, as well as the specific design of CS bait sets. In conclusion, CS is a powerful approach for spatiotemporal population genetics, by providing flexibility to design panels of baits targeting a high number of specific genomic regions of interest. Bait sets can be adapted specifically to each species and research question, thus enabling researchers to make better use of the resources available. For this quest, supeRbaits is a fast and versatile tool for facilitating bait design.

BENEFIT‐SHARING STATEMENT

All samples used in this manuscript are in compliance with national laws and the Nagoya Protocol.

AUTHOR CONTRIBUTIONS

Designed the study: Belén Jiménez‐Mena and Einar Eg Nielsen. Performed the research: Belén Jiménez‐Mena, Hugo Flávio, Miguel Ramos, Alice Manuzzi, Romina Henriques, Dorte Meldrup, Janette Edson, Jennifer R. Ovenden, Snæbjörn Pálsson, and Guðbjörg Ásta Ólafsdóttir. Developed supeRbaits: Hugo Flávio, Miguel Ramos, and Belén Jiménez‐Mena. Analysed the data: Belén Jiménez‐Mena, Hugo Flávio, Miguel Ramos, Alice Manuzzi, and Romina Henriques. Wrote the manuscript with suggestions from all authors: Belén Jiménez‐Mena and Einar Eg Nielsen.

CONFLICT OF INTEREST

The authors declare no conflict of interest. Table S1 Click here for additional data file.

90 in total

1. BaitFisher: A Software Package for Multispecies Target DNA Enrichment Probe Design.

Authors: Christoph Mayer; Manuela Sann; Alexander Donath; Martin Meixner; Lars Podsiadlowski; Ralph S Peters; Malte Petersen; Karen Meusemann; Karsten Liere; Johann-Wolfgang Wägele; Bernhard Misof; Christoph Bleidorn; Michael Ohl; Oliver Niehuis
Journal: Mol Biol Evol Date: 2016-03-23 Impact factor: 16.240

2. Generic genetic differences between farmed and wild Atlantic salmon identified from a 7K SNP-chip.

Authors: Sten Karlsson; Thomas Moen; Sigbjørn Lien; Kevin A Glover; Kjetil Hindar
Journal: Mol Ecol Resour Date: 2011-03 Impact factor: 7.090

3. Comparing diversity levels in environmental samples: DNA sequence capture and metabarcoding approaches using 18S and COI genes.

Authors: Hendrik Giebner; Kathrin Langen; Sarah J Bourlat; Sandra Kukowka; Christoph Mayer; Jonas J Astrin; Bernhard Misof; Vera G Fonseca
Journal: Mol Ecol Resour Date: 2020-06-24 Impact factor: 7.090

4. Microevolution in time and space: SNP analysis of historical DNA reveals dynamic signatures of selection in Atlantic cod.

Authors: Nina O Therkildsen; Jakob Hemmer-Hansen; Thomas D Als; Douglas P Swain; M Joanne Morgan; Edward A Trippel; Stephen R Palumbi; Dorte Meldrup; Einar E Nielsen
Journal: Mol Ecol Date: 2013-03-28 Impact factor: 6.185

5. A beginner's guide to low-coverage whole genome sequencing for population genomics.

Authors: Runyang Nicolas Lou; Arne Jacobs; Aryn P Wilder; Nina Overgaard Therkildsen
Journal: Mol Ecol Date: 2021-08-31 Impact factor: 6.185

6. Exome sequencing generates high quality data in non-target regions.

Authors: Yan Guo; Jirong Long; Jing He; Chung-I Li; Qiuyin Cai; Xiao-Ou Shu; Wei Zheng; Chun Li
Journal: BMC Genomics Date: 2012-05-20 Impact factor: 3.969

7. Performance comparison of exome DNA sequencing technologies.

Authors: Michael J Clark; Rui Chen; Hugo Y K Lam; Konrad J Karczewski; Rong Chen; Ghia Euskirchen; Atul J Butte; Michael Snyder
Journal: Nat Biotechnol Date: 2011-09-25 Impact factor: 68.164

8. Simulating the dynamics of targeted capture sequencing with CapSim.

Authors: Minh Duc Cao; Devika Ganesamoorthy; Chenxi Zhou; Lachlan J M Coin
Journal: Bioinformatics Date: 2018-03-01 Impact factor: 6.937

Review 9. Conservation of biodiversity in the genomics era.

Authors: Megan A Supple; Beth Shapiro
Journal: Genome Biol Date: 2018-09-11 Impact factor: 13.583

10. Temporal genomic contrasts reveal rapid evolutionary responses in an alpine mammal during recent climate change.

Authors: Ke Bi; Tyler Linderoth; Sonal Singhal; Dan Vanderpool; James L Patton; Rasmus Nielsen; Craig Moritz; Jeffrey M Good
Journal: PLoS Genet Date: 2019-05-03 Impact factor: 5.917

1 in total

1. Fishing for DNA? Designing baits for population genetics in target enrichment experiments: Guidelines, considerations and the new tool supeRbaits.

Authors: Belén Jiménez-Mena; Hugo Flávio; Romina Henriques; Alice Manuzzi; Miguel Ramos; Dorte Meldrup; Janette Edson; Snaebjörn Pálsson; Guðbjörg Ásta Ólafsdóttir; Jennifer R Ovenden; Einar Eg Nielsen
Journal: Mol Ecol Resour Date: 2022-03-03 Impact factor: 8.678

1 in total