Literature DB >> 30891252

Towards population genomics in non-model species with large genomes: a case study of the marine zooplankton Calanus finmarchicus.

Marvin Choquet1, Irina Smolina1, Anusha K S Dhanasiri1, Leocadio Blanco-Bercial2, Martina Kopp1, Alexander Jueterbock1, Arvind Y M Sundaram3, Galice Hoarau1.   

Abstract

Advances in next-generation sequencing technologies and the development of genome-reduced representation protocols have opened the way to genome-wide population studies in non-model species. However, species with large genomes remain challenging, hampering the development of genomic resources for a number of taxa including marine arthropods. Here, we developed a genome-reduced representation method for the ecologically important marine copepod Calanus finmarchicus (haploid genome size of 6.34 Gbp). We optimized a capture enrichment-based protocol based on 2656 single-copy genes, yielding a total of 154 087 high-quality SNPs in C. finmarchicus including 62 372 in common among the three locations tested. The set of capture probes was also successfully applied to the congeneric C. glacialis. Preliminary analyses of these markers revealed similar levels of genetic diversity between the two Calanus species, while populations of C. glacialis showed stronger genetic structure compared to C. finmarchicus. Using this powerful set of markers, we did not detect any evidence of hybridization between C. finmarchicus and C. glacialis. Finally, we propose a shortened version of our protocol, offering a promising solution for population genomics studies in non-model species with large genomes.

Entities:  

Keywords:  Calanus spp.; genome reduced-representation; sequence capture enrichment

Year:  2019        PMID: 30891252      PMCID: PMC6408391          DOI: 10.1098/rsos.180608

Source DB:  PubMed          Journal:  R Soc Open Sci        ISSN: 2054-5703            Impact factor:   2.963


Background

Assessment of population genetic metrics for non-model species, and in particular marine zooplankton, has usually been limited to a small number of loci (mostly mitochondrial DNA) [1,2] that may not reflect genome-wide diversity and differentiation [3]. Recent technological advances in next generation sequencing (NGS) have dramatically increased sequencing throughput, reduced associated costs, and together with the development of bioinformatics tools, have opened the door for population genomics studies in any species [4]. Nevertheless, whole-genome sequencing for many individuals of species with genomes greater than 1 Gb remains hampered by cost and bioinformatics challenges associated with the volume of data generated [4,5]. However, as many biological questions can be answered with only a fraction of the genome, genome reduction sequencing methods have become increasingly popular. These methods include amplicon, transcriptome, restriction digest, and capture enrichment sequencing [6-8]. Such methods, not only allow the analysis of 1000s of single nucleotide polymorphism (SNPs) in many individuals [6], but also usually result in higher coverage per locus, and increased accuracy of polymorphism detection compared to whole-genome sequencing approach [9]. Restriction site-associated DNA sequencing protocols (e.g. RAD-seq, [10]; ddRAD-seq, [11]; 2b-RAD, [12]) appear to be suitable for non-model species, as they allow low-cost genotyping of SNPs throughout the genome without allele-specific expression bias in contrast to RNA-seq, and do not require existing genomic resources nor species-specific reagents [4,6,7]. RAD-seq protocols involve an enzymatic digestion of the DNA followed by the selective sequencing of the fragments flanked by restriction enzymes' recognition sites [10]. The double digest RAD-seq uses a double enzymatic digestion of DNA and allows to adjust the number of fragments to be sequenced via the choice of restriction enzymes and the size selection of digested fragments [11]. Although this method presents several advantages, especially when dealing with species with large genomes, the initial requirements in terms of DNA amount and quality may represent a limiting factor for some organisms, such as small planktonic organisms. Alternatively, sequence capture enrichment, also called targeted resequencing, is a genome-reduced representation protocol that requires only a small amount of DNA for library preparation [13], a great advantage when working with tiny organisms. Different strategies of capture have been developed and are reviewed in Mamanova et al. [14]. Overall, the method consists of capturing specific fragments of the genome by hybridization with probes that contain complementary sequences of the targeted sequences [15,16], followed by NGS. Prior knowledge of the sequences targeted is therefore required in order to design the corresponding capture probe set [8,17]. As this can represent a real challenge in the case of non-model species, alternative strategies have been developed, such as using a transcriptome as reference because it is usually easier to assemble than a genome [18] and particularly in the case of species with large genomes. The capture enrichment method offers valuable advantages such as the possibility to use a set of capture probes developed for one species on closely related species with satisfying performance [19-23]. Capture enrichment approaches have also proven effective on historical and degraded DNA [24-27]. Several studies reported high quality of resulting data, consistent loci coverage and, subsequently, accurate SNP calling, when using a capture enrichment-based protocol for reduced genome representation [15,28-30]. In the present study, we developed a genome-reduced representation protocol to pave the way for population genomics studies in the marine copepod Calanus finmarchicus. This species dominates the mesozooplankton assemblage of the North Atlantic Ocean in terms of biomass [31] and plays an important role in linking lower and higher trophic levels [32]. Despite C. finmarchicus paramount ecological importance, genome-wide studies of the species have been hampered by its large genome (6.34 Gbp haploid; [33]). Its population genetic structure and connectivity have been long-standing subjects of research, reflecting the history of genetic marker development from allozymes [34] and mitochondrial genes [35,36] to microsatellites [37] and a few nuclear SNPs [38]. All studies have suggested high levels of polymorphism and gene flow. However, conclusions have ranged from lack of population genetic structure using six microsatellite loci [37] to a large-scale structure based on 24 SNPs in three nuclear genes [38]. The question of whether there are genetically differentiated populations of C. finmarchicus across the North Atlantic Ocean thus remains open and requires a genome-wide approach. We first applied a ddRAD-seq protocol on pooled Calanus individuals from different locations. This protocol requires a high amount and high quality of DNA to start with, but as the amount of DNA extracted from one individual of Calanus finmarchicus is rather low, due to the body-size of the organism (typically between 2 and 3 mm), pooling several individuals together was the only option. The enzyme pair to be used for the digestion was selected based on the results from an in silico digestion of the very small fraction of the genome sequenced so far (less than 0.5%). Although we would normally expect a small portion of genome to be sufficient for in silico digestion, it seems obvious that the large (and probably duplicated) genome of C. finmarchicus may have altered the success of this approach in selecting an optimal restriction enzyme pair. Indeed, in the SimRAD-based method [39], the correspondence between in silico and actual digested fragments was not evaluated for cases of large duplicated genomes. Thus, the actual digestion of C. finmarchicus’s DNA pools resulted in a very high number of fragments, requiring a costly sequencing effort in order to achieve sufficient coverage for all of them. Therefore, we considered the results of this pilot study not promising enough given the limitations and decided to attempt a different approach. The protocols and results associated with our ddRAD-seq pilot study are available as electronic supplementary material of this paper (supplementary material 1). Next, we decided to focus on a sequence capture enrichment protocol, and we also tested for cross-species capture hybridization on the closely related C. glacialis. The present paper describes the corresponding results. Based on our experience, we propose a simplified method to obtain an informative SNP panel for population genomic studies in non-model species with large genomes.

Material and methods

Samples and DNA extraction

Zooplankton samples were collected from four locations (table 1) by vertical tows between 0 and 200 m depth using WP2 [40] or similar nets with mesh size of 200 µm. Samples were immediately preserved in 95% undenatured ethanol, with subsequent change of ethanol after 24 h. Genomic DNA was extracted individually using the E.Z.N.A. Insect DNA Kit (Omega Bio-Tek) according to the manufacturer's instructions. Calanus species identification was performed for each individual using a set of six nuclear insertion–deletion markers (InDels) in a multiplex PCR following the protocol described in Smolina et al. [41].
Table 1.

Calanus finmarchicus and C. glacialis sample information.

locationmethodspeciesncollection datelat.long.
Barents SeaTranscriptomic captureC. finmarchicus16 Aug 201270.50° N19.99° E
Isfjord (Is)Genomic captureC. finmarchicus85 June 201678.32° N15.15° E
C. glacialis3
Skjerstadfjord (Skj)Genomic captureC. finmarchicus826 Feb 201660.72° N5.10° E
C. glacialis6
Lurefjord (Lure)Genomic captureC. finmarchicus822 June 201667.18° N15.43° E
C. glacialis3
Calanus finmarchicus and C. glacialis sample information.

Development of a genomic reference for Calanus finmarchicus

Probe set design for transcriptome-based sequence capture

So far, no good quality genomic reference is available for C. finmarchicus, but three transcriptomes have been published [42-44]. We used the transcriptome from Lenz et al. [43], which is the most complete currently available, to design a set of probes to capture, sequence and assemble genes of interest into a custom genomic reference. From the transcriptome, we selected all sequences ≥750 bp long (=29 518 sequences), to which we added the 38 unique transcripts known to be involved in thermal stress response of C. finmarchicus [42]. We blasted (blastn in Geneious v. 9.1.8) each of these transcripts against the whole transcriptome and kept only unique sequences in order to reduce false-positive SNPs from paralogous and repeated regions. Then, we trimmed the resulting 18 588 sequences to the first 200 bp, to target the 5′UTR regions, supposedly enriched in SNPs [45]. Our design of 3 717 600 bp in total was then sent to Roche NimbleGen Inc. (Madison, WI) to produce 120-mer sequence-capture probes.

Library preparation and sequence capture

A library was prepared from a single individual of C. finmarchicus from the Barents Sea (table 1), according to the manufacturer's protocol (NimbleGen SeqCap EZ Library SR version 4.2) (see details of library preparation and capture in supplementary material 2). The captured DNA was sequenced on a MiSeq sequencer (Illumina) with the 2 × 300 bp v. 3 chemistry.

Evaluation of the capture efficiency

We mapped the 15 556 070 raw reads obtained from the sequencing to the 29 556 full-length transcriptomic contigs initially used for the capture design using the BWA-MEM (v. 0.7.16) tool in default mode [46]. The fact that only 30.12% of the reads mapped uniquely with high-quality score strengthens the need for a genomic-based reference. Therefore, raw reads were filtered to remove duplicates and low complexity sequences using PRINSEQ (v. 0.20.4) [47] and then assembled using the MaSuRCA assembler (v. 3.2.2) [48] to be used as a reference for a second probe design (figure 1).
Figure 1.

Alternative capture-enrichment based workflows for SNP discovery in non-model species with large genomes.

Alternative capture-enrichment based workflows for SNP discovery in non-model species with large genomes.

Genome-based sequence capture

Probe set design

From the genomic data generated by the previous sequencing, we identified all the transcripts successfully captured and sequenced. To achieve this, we first mapped all the transcriptomic reads available for C. finmarchicus on NCBI (https://www.ncbi.nlm.nih.gov/; Ref: PRJNA236528) to the 29 556 full-length transcriptomic sequences using Bowtie2 (version 2.2.3) [49]. Then, to identify targeted genes that were successfully captured and sequenced, the 33 294 898 RNAseq reads that mapped to the selected transcripts were mapped to our MaSuRCA assembly of genomic data using TopHat RNAseq splice-aware mapper (version 2.1.1) [50]. This resulted in 9 225 593 reads that were mapped to 36 223 contigs. These 36 223 contigs were then blasted (blastn in Geneious v9.1.8) against themselves in order to keep only single-copy genes, resulting in 3500 contigs with only 1 hit (self-hit). We performed the second blast of these 3500 contigs against the full MaSuRCA assembly (generated in previous section), and we selected the 2223 contigs with 1 hit and 433 other contigs with more than one hit but having 97% or more pairwise identity. We finally obtained a total of 2656 contigs with length from 302 to 3029 bp that we trimmed to a maximum length of 1500 bp. The final design of 2656 sequences, representing 2 106 591 bp, was then sent to the MYcroarray MYbaits company (Inc., MI, USA) to produce 80-mer sequence-capture probes.

Library preparation

The second run of capture was performed on a total of 36 individual libraries, including 24 C. finmarchicus individuals from three locations, and 12 C. glacialis individuals from the same three locations (table 1). Libraries were prepared using the NEXTflex™ Rapid Pre-Capture Combo Kit (Bioo Scientific, Austin, TX, USA) (see details in supplementary material 2). Individually indexed libraries were then pooled per species, before proceeding to capture. As C. finmarchicus has an estimated genome size of 6.34 Gbp (haploid), while C. glacialis has an estimated genome size of 11.83 Gbp [33], we reduced the number of libraries of C. glacialis to be pooled for the capture reaction to ensure that similar genome copy numbers are present. The sequence capture was performed for each pool/species according to the MYcroarray MYbaits protocol (http://www.mycroarray.com/pdf/MYbaits-manual-v3.pdf) with a few adjustments (detailed in supplementary material 2) to maximize efficiency. Finally, the two pools were mixed together in equal proportions and sequenced on a NextSeq 550 (Illumina) with the 2 × 150 bp mid-output kit v. 2. The NextSeq sequences were demultiplexed and mapped directly to the MaSuRCA assembly using BWA-MEM (v. 0.7.16) [46]. Only the reads mapping uniquely to the reference, concordantly, and in pairs were kept. Duplicates were removed using Picard tools (http://broadinstitute.github.io/picard), and mapped reads were realigned around InDels using GATK (v. 3.7) [51]. The percentage of high-quality reads mapping back to the reference was more satisfying than previously with 38% of C. finmarchicus reads on average mapping uniquely and without duplicates, and 23% for C. glacialis.

SNP genotyping and application for population genomics

Genomic variation analyses

Variants were called for all individuals of both species together at once using the HaplotypeCaller [52] implemented in GATK (v. 3.7). In order to make accurate estimates of genetic diversity, we forced HaplotypeCaller (GATK) to also output the non-variant sites, together with the variants, in the resulting VCF file. Using VCFtools (v. 0.1.13) [53], we filtered the sites to keep only those with mean depth values (over all individuals) greater than or equal to 5×. Among these, sites with more than 20% of missing data were excluded, which means that we kept only the sites represented in at least 80% of the genotypes. The resulting file was used to estimate nucleotide diversity (π) for each species and location separately. Nucleotide diversity was estimated on a per-site basis and averaged in 780 bp windows (average of contig size distribution) using only the sites that passed the filtering. We reported the mean of π across windows for each population, with VCFtools (v. 0.1.13). Observed heterozygosities (proportion of heterozygous sites) at variant sites were calculated on a per-SNP level in each individual and averaged over all positions present in both species together, using VCFtools (option --het; v. 0.1.13).

Population structure and gene flow analyses

Once more, variants were called for all individuals of both species together at once, using the HaplotypeCaller [52] implemented in GATK (v. 3.7). With GATK and VCFtools (v. 0.1.13) [53], raw variants were hard-filtered for different quality parameters (see details in supplementary material 3), InDels were removed, variants phased and only SNPs covered between 5× and [average + 2*standard deviation]× were kept. Sites present in less than 80% of genotypes were filtered out. SNPs with minor allele frequency less than 0.05 were removed. The numbers of SNPs present in each species, in each location, and shared by both species and among locations were then calculated with BCFtools (v. 1.6). The command line scripts used for data processing are supplied in supplementary material 3. The filtered SNPs were pruned based on linkage disequilibrium (LD) in sliding windows of 50 markers, five markers at a time with a R2 threshold of 0.8. This dataset was used to investigate the potential presence of hybrids between C. finmarchicus and C. glacialis by running ADMIXTURE (v. 1.3.0) [54]. For the next analysis, we re-used the VCF file containing all the filtered SNPs before the pruning, and we split it per species. The two resulting files were then LD-pruned in the same way as in the previous step. The resulting markers were used in two principal component analyses (PCA) (one per species), performed with PLINK (v. 1.9) [55,56]. For calculating the global weighted FST [56] in each species, only one variant site per contig was randomly selected using a PERL script [57], to avoid giving more weight to contigs with more variants (i.e. probably linked variants). Global weighted FST was then calculated in PLINK. Distributions of the FST values were obtained after 1000 iterations of the procedure (therefore different combinations of SNPs from each contig), and median, average and quartiles calculated for each species (supplementary material 3).

Test for selection

Candidate SNP loci under selection were identified using BayeScan (v. 2.1) [58-60] for each species separately from the non-LD-pruned SNPs. The software compares allele frequencies among populations to determine which genetic markers are outliers and thus most likely to be under selection. In complement, we used VCFtools (v. 0.1.13) for calculating a site frequency spectrum of all SNPs per locations and species.

Results

Genome-based capture efficiency

The 36 libraries (table 1) yielded on average 4.3 million reads per individual for C. finmarchicus (N = 24), and 16.8 million reads for C. glacialis (N = 12) (table 2). For C. finmarchicus, an average of 1.6 million reads mapped uniquely to the reference. This represents on average 38% (32.7% to 43%) of the initial number of reads sequenced per individual (table 2). For C. glacialis, 3.8 million reads mapped on average per individual. This represents on average 23% (20.9% to 25.3%) of the initial number of reads sequenced per individual (table 2).
Table 2.

Efficiency of the transcriptome-based and genome-based capture enrichment for Calanus finmarchicus and C. glacialis. Raw reads: total number of sequenced reads used for mapping; % HQ-mapped reads: reads that mapped to a unique site in the genome reference and in proper pairs without duplicates; on-target rate: proportion of reads on target within HQ-mapped reads; global % reads on target: proportion of reads on target extrapolated to the total number of reads sequenced; mean depth of coverage on target: mean depth of coverage of targeted contigs.

individualraw reads% HQ-mapped readson-target rateglobal % reads on targetmean depth of coverage on targetNCBI BioSample accessions
C. finmarchicus—transcriptomic capture
 CfinPC13_pop115 556 07030.12%83.81%25.24%SAMN08924867
C. finmarchicus—genomic capture
 CF_Is_14 181 93839.06%79.17%30.93%83.36SAMN08924868
 CF_Is_24 219 26840.54%79.53%32.24%117.65SAMN08924869
 CF_Is_33 667 22838.55%82.53%31.82%87.71SAMN08924870
 CF_Is_45 119 05640.44%79.98%32.34%106.95SAMN08924871
 CF_Is_55 872 09640.62%77.46%31.46%117.65SAMN08924872
 CF_Is_65 184 25840.28%80.36%32.37%107.92SAMN08924873
 CF_Is_74 678 72043.04%73.70%31.72%93.8SAMN08924874
 CF_Is_82 702 24841.00%76.36%31.31%54.04SAMN08924875
 CF_Lure_172 093 34038.17%70.69%26.98%35.43SAMN08924876
 CF_Lure_181 329 22235.97%78.53%28.25%24.16SAMN08924877
 CF_Lure_193 563 37236.63%79.45%29.10%66.81SAMN08924878
 CF_Lure_202 395 55033.47%76.04%25.45%38.74SAMN08924879
 CF_Lure_213 031 52632.69%74.53%24.36%47.57SAMN08924880
 CF_Lure_222 800 91833.67%72.88%24.54%44.07SAMN08924881
 CF_Lure_231 267 78638.08%76.00%28.94%23.55SAMN08924882
 CF_Lure_243 518 31440.29%74.37%29.96%67.84SAMN08924883
 CF_Skj_333 741 46636.32%69.88%25.38%60.14SAMN08924884
 CF_Skj_343 438 88639.34%72.75%28.62%62.89SAMN08924885
 CF_Skj_353 028 59835.66%75.73%27.01%52.10SAMN08924886
 CF_Skj_369 028 83635.55%71.53%25.43%145.25SAMN08924887
 CF_Skj_378 244 40034.43%72.07%24.82%131.03SAMN08924888
 CF_Skj_386 805 15033.96%73.29%24.89%108.55SAMN08924889
 CF_Skj_396 287 26235.64%66.94%23.86%92.97SAMN08924890
 CF_Skj_407 023 83639.19%62.85%24.63%106SAMN08924891
average4 300 97038%75%28%78.17
C. glacialis—genomic capture
 CG_Is_1013 819 53820.95%56.00%11.73%96.82SAMN08924892
 CG_Is_117 741 98825.34%62.48%15.83%74.29SAMN08924893
 CG_Is_165 230 85224.20%64.09%15.51%49.31SAMN08924894
 CG_Lure_285 132 51825.02%61.66%15.43%47.59SAMN08924895
 CG_Lure_2927 796 63623.88%54.63%13.04%215.97SAMN08924896
 CG_Lure_3220 645 63822.50%57.83%13.01%160.93SAMN08924897
 CG_Skj_4318 412 87021.08%51.11%10.77%115.29SAMN08924898
 CG_Skj_4420 791 73422.84%53.91%12.32%150.80SAMN08924899
 CG_Skj_4520 389 80023.18%58.13%13.48%164.03SAMN08924900
 CG_Skj_4618 812 85022.14%53.55%11.86%131.58SAMN08924901
 CG_Skj_4719 203 88421.96%55.19%12.12%137.95SAMN08924902
 CG_Skj_4823 482 63422.98%49.16%11.30%153.75SAMN08924903
average16 788 41223%56%13%124.86
Efficiency of the transcriptome-based and genome-based capture enrichment for Calanus finmarchicus and C. glacialis. Raw reads: total number of sequenced reads used for mapping; % HQ-mapped reads: reads that mapped to a unique site in the genome reference and in proper pairs without duplicates; on-target rate: proportion of reads on target within HQ-mapped reads; global % reads on target: proportion of reads on target extrapolated to the total number of reads sequenced; mean depth of coverage on target: mean depth of coverage of targeted contigs. After variant calling and hard-filtering, 154 087 SNPs with sufficient coverage were identified for C. finmarchicus, ranging from 95 453 to 108 131 SNPs per location (table 3) and distributed across 4603 contigs (supplementary material 2: electronic supplementary material, figure S3). A total of 62 372 SNPs were in common among all three locations (table 3). For C. glacialis, 121 872 SNPs passed the hard-filtering steps and were sufficiently covered, ranging from 91 923 to 107 752 SNPs per location (table 3). These SNPs were distributed across 5363 contigs (supplementary material 2: electronic supplementary material, figure S3). A total of 80 319 SNPs were in common among all three locations (table 3). Furthermore, 60 452 SNPs were shared between C. finmarchicus and C. glacialis (table 3).
Table 3.

Summary of discovered SNPs using genome-based capture enrichment after hard-filtering, phasing and coverage filtering.

species
C. finmarchicus
C. glacialis
n indiv.total # SNPsn indiv.total # SNPs
location
 Isfjord8104 346391 923
 Skjerstadfjord8108 1316107 752
 Lurefjord895 453398 331
SNPs per species154 087121 872
SNPs in common among three locations62 37280 319
SNPs in common between species60 452
Summary of discovered SNPs using genome-based capture enrichment after hard-filtering, phasing and coverage filtering.

Population genomics results

Genomic variation

After filtering steps, nucleotide diversity estimates were calculated from a total of 316 019 sites (variants and non-variants), for each population in each species. The index π revealed similar levels of genetic diversity between species and among locations (figure 2).
Figure 2.

Nucleotide diversity (π) in each population of Calanus finmarchicus (red) and C. glacialis (blue) estimated from 780 bp non-overlapping windows of variant and non-variant sites. Each box plot notch represents the median. Mean values per location are written in each box.

Nucleotide diversity (π) in each population of Calanus finmarchicus (red) and C. glacialis (blue) estimated from 780 bp non-overlapping windows of variant and non-variant sites. Each box plot notch represents the median. Mean values per location are written in each box. A total of 118 196 variant sites were used for calculating the mean individual observed heterozygosities. The obtained averages were very similar between species and among locations (figure 3), ranging from 0.089 to 0.16 for C. finmarchicus and from 0.1 to 0.147 for C. glacialis.
Figure 3.

Individual heterozygosity levels within Calanus finmarchicus (red) and C. glacialis locations (blue). Mean proportion of heterozygous sites observed per individual.

Individual heterozygosity levels within Calanus finmarchicus (red) and C. glacialis locations (blue). Mean proportion of heterozygous sites observed per individual.

Population structure and gene flow

The ADMIXTURE analysis, based on 37 710 SNPs shows a very clear clustering per species, without apparent gene flow (figure 4).
Figure 4.

ADMIXTURE analysis of SNP markers from co-occurring Calanus finmarchicus and C. glacialis individuals from the same three geographical locations (K = 2). The analysis was performed using a total of 37 710 SNPs. Each group of individuals from the same geographical location are represented by a vertical bar, in red for C. finmarchicus and in blue for C. glacialis. For C. finmarchicus, there are eight individuals per location. For C. glacialis, there are three individuals for the locations CG_Is (Isfjord) and CG_Lure (Lurefjord) and six individuals for CG_Skj (Skjerstadfjord). This plot shows two distinct clusters, in two different colours, corresponding to the two different species. This clear distinction proves there is no hybrid in the dataset.

ADMIXTURE analysis of SNP markers from co-occurring Calanus finmarchicus and C. glacialis individuals from the same three geographical locations (K = 2). The analysis was performed using a total of 37 710 SNPs. Each group of individuals from the same geographical location are represented by a vertical bar, in red for C. finmarchicus and in blue for C. glacialis. For C. finmarchicus, there are eight individuals per location. For C. glacialis, there are three individuals for the locations CG_Is (Isfjord) and CG_Lure (Lurefjord) and six individuals for CG_Skj (Skjerstadfjord). This plot shows two distinct clusters, in two different colours, corresponding to the two different species. This clear distinction proves there is no hybrid in the dataset. The PCA performed for C. finmarchicus, based on 34 449 SNPs, shows no noticeable differentiation among individuals from different locations. Two outliers were identified as individuals from Lurefjord (CF_Lure_18 and CF_Lure_23) (figure 5a). The PCA performed for C. glacialis, based on 17 035 SNPs, shows the differentiation of two groups of individuals, corresponding to the locations of Isfjord and Skjerstadfjord. Individuals from Isfjord are differentiated from the two other locations on the PC1 (11.91%) (figure 5b).
Figure 5.

Principal component analyses (plot of 2 first components) performed with SNP markers from Calanus finmarchicus (a) and C. glacialis (b) individuals from three locations. The 24 individuals of C. finmarchicus are displayed in red colour while the 12 C. glacialis individuals are displayed in blue colour. Each shape represents a distinct location.

Principal component analyses (plot of 2 first components) performed with SNP markers from Calanus finmarchicus (a) and C. glacialis (b) individuals from three locations. The 24 individuals of C. finmarchicus are displayed in red colour while the 12 C. glacialis individuals are displayed in blue colour. Each shape represents a distinct location. Estimation of genetic differentiation (weighted FST) for each species among the same three locations was much higher (about six times higher) for C. glacialis (mean = median, FST = 0.019; 4113 SNPs per iteration) compared to C. finmarchicus (mean = median, FST = 0.003; 4216 SNPs per iteration), and statistically significant in both species (p < 0.001) (figure 6).
Figure 6.

Distribution of the global weighted index of genetic differentiation FST within Calanus finmarchicus and C. glacialis. The distribution of the global weighted FST within each species was calculated after 1000 iterations, selecting one random SNP per contig for all contigs for each iteration. Boxes indicate the first, second (median) and third quartiles, with the average FST indicated by the ‘x’; whiskers show 1.5 times the interquartile range above and below the third and first quartile respectively. Data above or below the whiskers range were considered outliers, indicated as circles. Only two iterations marginally reached values less than 0 for C. finmarchicus.

Distribution of the global weighted index of genetic differentiation FST within Calanus finmarchicus and C. glacialis. The distribution of the global weighted FST within each species was calculated after 1000 iterations, selecting one random SNP per contig for all contigs for each iteration. Boxes indicate the first, second (median) and third quartiles, with the average FST indicated by the ‘x’; whiskers show 1.5 times the interquartile range above and below the third and first quartile respectively. Data above or below the whiskers range were considered outliers, indicated as circles. Only two iterations marginally reached values less than 0 for C. finmarchicus.

Selection

Test for SNP loci under selection using BayeScan revealed no loci under recent and strong positive selection in C. finmarchicus out of 46 544 SNPs analysed (figure 7a). In C. glacialis, three loci out of 49 742 (0.006%) are likely to be under recent and strong positive selection (figure 7b).
Figure 7.

Identification of SNPs under recent and/or strong positive selection with BayeScan in Calanus finmarchicus (a) and C. glacialis (b). Each locus's FST value is plotted against the log10 of the corresponding q-value (false discovery rate (FDR), analogue of the p-value). The vertical bar indicates the threshold for FDR = 0.05 value used to identify outlier SNPs, represented on the right side of the bar.

Identification of SNPs under recent and/or strong positive selection with BayeScan in Calanus finmarchicus (a) and C. glacialis (b). Each locus's FST value is plotted against the log10 of the corresponding q-value (false discovery rate (FDR), analogue of the p-value). The vertical bar indicates the threshold for FDR = 0.05 value used to identify outlier SNPs, represented on the right side of the bar. The site frequency spectrum revealed no apparent selection in either species (supplementary material 2, electronic supplementary material, figures S4 and S5); however, the low number of individuals should be taken into account when drawing conclusions from the site frequency spectrum diagrams.

Discussion

Zooplankton organisms represent a key link in marine food webs and play a crucial role in marine ecosystems. They are often used as beacons of climate changes, therefore understanding their population structure and genetic connectivity is critical. However, this task may be challenging, as gene flow can be high in zooplankton species and often results in subtle patterns of genetic structure not necessarily detectable with only a few markers [61,62], thus requiring a genomic approach [63]. So far, technical difficulties linked to the large genome sizes of many of these organisms, particularly in the Arthropoda phylum, have hampered population genomics studies (reviewed by [64]). In the present study, our aim was to identify an efficient genome reduction method to obtain a sufficiently large number of SNPs to conduct robust population structure studies on Calanus finmarchicus. Our results suggest that a sequence capture protocol may be the easiest and most effective way to deal with non-model species with large genomes, especially when it comes to small-sized organisms. Indeed, our optimized protocol yielded more than 154 000 SNP markers for the targeted species. This number represents on average seven times more high-quality SNPs than what we obtained with our ddRAD-seq tentative approach for a comparable sequencing effort (supplementary material 1). Furthermore, the capture-based protocol yielded 70 times more contigs bearing SNPs (on average), thus resulting in a higher number of unlinked loci. One of the main challenges with the RAD-seq method was the DNA requirement, forcing us to pool individuals due to the limited amount of DNA available per individual. This is clearly an advantage of capture enrichment protocols [13], as a very small amount (less than 10 ng) or even partially degraded DNA can be used [27]. Sequence capture was also very successful for the congeneric species C. glacialis, with ca 122 000 SNPs identified. Besides, the physical proximity of many of the SNPs identified with sequence capture (4603 contigs for C. finmarchicus; 5363 contigs for C. glacialis) opens up the possibility to infer the precise sequence (phase) of alleles on each homologous copy of a chromosome [65,66]. Such phased haplotype can then be used to infer ancestry and demographic history [67] or to detect selection [68]. Although transcriptome-based capture sequencing can be successful (e.g. [18,69]), it typically requires a reference genome of a closely related species to identify intron-exon boundaries. Absence of such genomic information for most of zooplankton species (reviewed in Bucklin et al. [64]) and limited success of transcriptome-based capture of Calanus exemplified in the present study, suggest that the two-step capture protocol we used, offers a good compromise (figure 1). Moreover, with the constant reduction of sequencing costs, this method can be further simplified by generating genomic reference data directly by shotgun sequencing and aligning genomic and transcriptomic sequences in order to target mainly genic or anonymous intergenic regions, depending on the purpose of the study. The corresponding shortened workflow, illustrated in figure 1, has been tested on another zooplankton species, the pteropod Limacina bulimoides, and preliminary results are promising [70]. The vast majority (greater than 99.99%) of the SNPs identified with the capture-based protocol did not show any sign of recent and/or strong positive selection. Assessment of genetic diversity and heterozygosity levels revealed very similar results between the two species of Calanus (figures 2 and 3). Although the levels of genomic variation are comparable between the two species, the PCAs show contrasting preliminary patterns of genetic structure within the two species. Indeed, there is higher inter-individual variation in C. glacialis and also higher inter-location differentiation than for C. finmarchicus. Individuals of C. finmarchicus appear genetically close to one another independently of their geographical origin, except for two outliers, CF_Lure_18 and CF_Lure_23, both from the Lurefjord location. Their position in the PCA can easily be explained by the relative lack of usable data for these two individuals, as they have the lowest numbers of raw reads and lowest values of mean depth of coverage among all individuals sequenced (table 2). Samples from the Lurefjord (southern Norway) were collected in June, when temperatures were high, and samples may have suffered from the summer conditions before they could be appropriately stored at cold temperature. This could have led to some degradation of copepods' DNA resulting in lower success of sequencing. In contrast to C. finmarchicus, gene flow among C. glacialis locations seems more limited. In particular, and interestingly, the Isfjord (Svalbard) appears genetically well differentiated from the two other locations on the first axis of the PCA (PC1: approx. 12%). Individuals from the Skjerstadfjord are clustered closely together, while the individuals from the Lurefjord are more distanced from one another. Both FST and PCAs suggest a more recent and stronger genetic structure for C. glacialis compared to C. finmarchicus populations. It is important to keep in mind though, that we have been sampling C. glacialis genome using C. finmarchicus originated probes. Consequently, we may have missed some naturally more variable regions in C. glacialis by capturing mostly regions conserved enough between species to be recognized by the capture probes; another possibility is that due to the lower number of individuals we would be missing more variants (especially those less frequent) in C. glacialis (see fig. 2 in [71]). These are cases of ascertainment bias [71,72]. Only the investigation of more individuals from the entire distribution range of the species will help to evaluate the significance of this effect. Nonetheless, the obtained results are in line with microsatellite data validating the usefulness of the SNPs. Indeed, Choquet et al. [73] reported a global FST 7.5 times higher for C. glacialis populations compared to C. finmarchicus for the same three locations. The SNPs dataset shows a 6× difference between the two species, but higher precision is expected given the number of markers (4000 SNPs versus six microsatellite loci) [63]. However, the present study focused at developing a suitable method for investigating genetic connectivity in a non-model species with a large genome and is limited by the sampling size. A larger sampling scale is required to understand the population structure of the different Calanus species. Population genomics studies of marine zooplankton have been very scarce [64]. In the copepod Centropages typicus, a 2b-RAD-seq approach yielding 675 SNPs revealed genetic structure between the northwest and the northeast Atlantic Ocean [61], which was in contrast with results from a previous study based on COI and ITS markers [74]. Another study used RAD-seq on the Antarctic krill Euphausia superba [75], and found no population structure across the whole Southern Ocean. However, the authors reported on the many challenges they went through by using RAD-seq on a very large and complex genome (ca 24 Gbp haploid) with no primary reference available, particularly due to the fact that most of the markers they discovered were from multicopy genomic regions and had to be removed from downstream analyses [75]. Finally, the obtained SNP set that is shared by C. finmarchicus and C. glacialis represents a very powerful tool to investigate the potential for hybridization and introgression between the two species. Indeed, using microsatellites, the presence of hybrids between C. finmarchicus and C. glacialis has been suggested at the Canadian east coast [76]. However, the genotyping of more than 8000 individuals using six co-dominant nuclear InDel markers developed from both species never detected any hybrids, in any of the 85 locations investigated in the North-Atlantic and Arctic Ocean [73,77]. The change of scale, between a few markers (six InDels or 10 microsatellites), and tens of thousands of markers described here, is considerable, and from the limited dataset obtained in the present study, there is no indication of inter-species hybridization. However, this question needs to be addressed further using samples from the two species' entire distribution ranges, and the presently identified set of genome-wide SNPs will be a powerful instrument in this pursuit.
  11 in total

1.  Evaluating the monophyly of Mammillaria series Supertextae (Cactaceae).

Authors:  Cristian R Cervantes; Silvia Hinojosa-Alvarez; Ana Wegier; Ulises Rosas; Salvador Arias
Journal:  PhytoKeys       Date:  2021-04-28       Impact factor: 1.635

2.  Daily transcriptomes of the copepod Calanus finmarchicus during the summer solstice at high Arctic latitudes.

Authors:  Laura Payton; Céline Noirot; Claire Hoede; Lukas Hüppe; Kim Last; David Wilcockson; Elizaveta A Ershova; Sophie Valière; Bettina Meyer
Journal:  Sci Data       Date:  2020-11-24       Impact factor: 6.444

3.  Widely rhythmic transcriptome in Calanus finmarchicus during the high Arctic summer solstice period.

Authors:  Laura Payton; Lukas Hüppe; Céline Noirot; Claire Hoede; Kim S Last; David Wilcockson; Elizaveta Ershova; Sophie Valière; Bettina Meyer
Journal:  iScience       Date:  2020-12-11

4.  Insights into the species evolution of Calanus copepods in the northern seas revealed by de novo transcriptome sequencing.

Authors:  Apollo Marco Lizano; Irina Smolina; Marvin Choquet; Martina Kopp; Galice Hoarau
Journal:  Ecol Evol       Date:  2022-02-22       Impact factor: 2.912

5.  Annual transcriptome of a key zooplankton species, the copepod Calanus finmarchicus.

Authors:  Laura Payton; Céline Noirot; Kim S Last; Jordan Grigor; Lukas Hüppe; David V P Conway; Mona Dannemeyer; Amandine Suin; Bettina Meyer
Journal:  Ecol Evol       Date:  2022-02-22       Impact factor: 2.912

6.  Generalizing Bayesian phylogenetics to infer shared evolutionary events.

Authors:  Jamie R Oaks; Perry L Wood; Cameron D Siler; Rafe M Brown
Journal:  Proc Natl Acad Sci U S A       Date:  2022-07-15       Impact factor: 12.779

7.  Pseudoreplication in genomic-scale data sets.

Authors:  Robin S Waples; Ryan K Waples; Eric J Ward
Journal:  Mol Ecol Resour       Date:  2021-09-07       Impact factor: 8.678

8.  Taxon-specific or universal? Using target capture to study the evolutionary history of rapid radiations.

Authors:  Gil Yardeni; Juan Viruel; Margot Paris; Jaqueline Hess; Clara Groot Crego; Marylaure de La Harpe; Norma Rivera; Michael H J Barfuss; Walter Till; Valeria Guzmán-Jacob; Thorsten Krömer; Christian Lexer; Ovidiu Paun; Thibault Leroy
Journal:  Mol Ecol Resour       Date:  2021-10-10       Impact factor: 8.678

9.  Novel genomic resources for shelled pteropods: a draft genome and target capture probes for Limacina bulimoides, tested for cross-species relevance.

Authors:  Le Qin Choo; Thijs M P Bal; Marvin Choquet; Irina Smolina; Paula Ramos-Silva; Ferdinand Marlétaz; Martina Kopp; Galice Hoarau; Katja T C A Peijnenburg
Journal:  BMC Genomics       Date:  2020-01-03       Impact factor: 3.969

10.  Oceanic dispersal barriers in a holoplanktonic gastropod.

Authors:  Le Qin Choo; Thijs M P Bal; Erica Goetze; Katja T C A Peijnenburg
Journal:  J Evol Biol       Date:  2020-11-21       Impact factor: 2.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.