Literature DB >> 23936375

SNP discovery in European anchovy (Engraulis encrasicolus, L) by high-throughput transcriptome and genome sequencing.

Iratxe Montes¹, Darrell Conklin, Aitor Albaina, Simon Creer, Gary R Carvalho, María Santos, Andone Estonba.

Abstract

Increased throughput in sequencing technologies has facilitated the acquisition of detailed genomic information in non-model species. The focus of this research was to discover and validate SNPs derived from the European anchovy (Engraulis encrasicolus) transcriptome, a species with no available reference genome, using next generation sequencing technologies. A cDNA library was constructed from four tissues of ten fish individuals corresponding to three populations of E. encrasicolus, and Roche 454 GS FLX Titanium sequencing yielded 19,367 contigs. Additionally, the European anchovy genome was sequenced for the same ten individuals using an Illumina HiSeq2000. Using a computational pipeline for combining transcriptome and genome information, a total of 18,994 SNPs met the necessary minor allele frequency and depth filters. A series of further stringent filters were applied to identify those SNPs likely to succeed in genotyping assays, and for filtering of those in potential duplicated genome regions. A novel method for detecting potential intron-exon boundaries in areas of putative SNPs has also been applied in silico to improve genotyping success. In all, 2,317 filtered putative transcriptome SNPs suitable for genotyping primer design were identified. From those, a subset of 530 were selected, with the genotyping results showing the highest reported conversion and validation rates (91.3% and 83.2%, respectively) reported to date for a non-model species. This study represents a promising strategy to discover genotypable SNPs in the exome of non-model organisms. The genomic resource generated for E. encrasicolus, both in terms of sequences and novel markers, will be informative for research into this species with applications including traceability studies, population genetic analyses and aquaculture.

Entities: Chemical Disease Mutation Species

Mesh：

Substances：
Genetic Markers

Year: 2013 PMID： 23936375 PMCID： PMC3731364 DOI： 10.1371/journal.pone.0070051

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

European anchovy (Engraulis encrasicolus, L.1758) is a small pelagic teleost with major economic and cultural importance. It has been the focus of numerous ecological and genetic studies and major research efforts have been conducted to understand population dynamics from an ecological point of view. Genetic studies have been focused on population genetic structure [1], [2], phylogeography [3], [4], species traceability [5], [6] and marker discovery [7], [8]. However, despite the economic and ecological importance of this species, there is a deficiency in available genomic information for E. encrasicolus, and this issue has impeded progress in molecular marker development. Genetic markers are important for many applications [9]. More specifically, SNP markers are very informative for population assignment (facilitating the identification of migrants and estimation of current rates of dispersal), for estimates of effective population size (Ne, an important concept in the management of threatened species), and for detecting significant reductions in population size or population bottlenecks (informative in populations which have suffered a collapse) [10]. The unraveling of these questions in the European anchovy is essential, especially for the Bay of Biscay population, which suffered a collapse during last decade. The potential loss of genetic variability, with consequent reduced adaptability, population persistence, and productivity is unknown. The advance in the genetics fields of the European anchovy had reached a plateau until the development of the next generation sequencing (NGS) technologies, which has now made possible the implementation of SNPs as standard genetic markers in non-model species [11]. In a recent non-NGS study, 62 SNPs (including 47 nuclear and 15 mitochondrial) were validated for European anchovy using random cloning and comparative Sanger sequencing [3]. This study provided insights into European anchovy population structure [2], describing ten homogeneous population groups. In some cases, these populations differed from the stocks currently defined for management in the species. For marine exploited species, a better understanding of the population structure is relevant since the chief value of genetic data to management is the identification of demographically independent populations with different patterns of recruitment, mortality and productivity [12]. This traditional method for SNP discovery (cloning and comparative Sanger sequencing) has been used for decades, but it can be expensive and time-consuming. The number of validated (meaning reliably scored and polymorphic) SNPs is still scarce and further SNP discovery would reinforce their application in prospective studies in the fields of population genetics, traceability, aquaculture and conservation. In recent years, next generation sequencing (NGS) technologies have emerged as a cost-effective way to very rapidly generate a large amount of valuable genomic information [13] and for discovering SNPs in non-model organisms [9], increasing throughput and reducing the cost and time involved in SNP development. This is particularly crucial for non-model species with limited or no available genomic resources. For SNP discovery in non-model organisms, a “genome reduction” step may be applied in order to increase genome coverage and reach the deep assemblies of redundant reads required for SNP detection [14]; a practical and popular approach for SNP discovery in non-model species is based on reduction to the transcriptome. Recently, several studies have successfully used this approach for marker discovery in fish species relevant to fishery and aquaculture: catfish [15], lake sturgeon [16], rainbow trout [17], [18], lake whitefish [19], Atlantic cod [20], salmonids [21], [22], hake [23], turbot [24], Atlantic herring [25] and Pacific herring [26]. Indeed, transcriptome sequencing has the advantage of directly identifying expressed genes, which are often the main research focus for population genetics and aquaculture [10]. The use of transcriptome sequencing to identify SNPs presents issues all of which need to be considered in a SNP selection process. For the proper identification and validation of transcriptome derived SNPs it is necessary to filter in silico those markers potentially close to intron-exon boundaries (IEBs), as the proximity of an intron is known to be the main cause of genotyping assay failure due to the inability to handle large PCR amplicons [15]. Therefore, genotyping primers and probes must be designed within a single exon, completely avoiding introns. Many studies of transcriptome derived SNPs discovery in non-model fishes either do not consider the IEB problem [15], [19], [21], [22], [26] or use the standard approach to BLAST the transcriptome contigs against a closely related annotated genome [23], [25] to infer IEB positions. However, for most non-model fish species, the closest related annotated genome with IEB information may be phylogenetically divergent, and therefore most of the contigs using this standard approach end up not having a significant BLAST match. The deficient prediction of IEB ultimately leads to the low SNP validation rate, reported in previous SNP discovery studies for non-model fish species. Finally, duplicated regions must also be filtered because they usually are assembled into same contig due to their high levels of sequence similarity. This might lead to paralogous sequence variants (PSVs: single nucleotide differences between duplicated loci in the genome but invariant at the population or species level [27]) or multisite sequence variants (MSVs: single nucleotide variants with complex characteristics due to polymorphisms within duplicated regions [27]), both of which remain indistinguishable from SNPs in the discovery process. In the current study, we characterized the transcriptome of E. encrasicolus and discovered SNP markers based on the combination of transcriptome and genome information. Importantly, next generation sequences from the European anchovy genome were used to supplement the transcriptome, in a novel strategy designed to avoid IEBs as well as potential repeated regions during SNP genotyping primer design and SNP validation. This sequencing and computational pipeline, which does not require a prior application of genome assembly, has resulted in the highest conversion and validation rates reported to date in a non-model species. The overall objectives of this study are to improve current SNP discovery procedures from transcriptome sequences through a new and accurate IEB prediction pipeline which could be reproducible for another non-model species, and to validate new SNP markers to be applicable in prospective genetic studies in European anchovy.

Materials and Methods

The flowchart in Figure 1 summarizes the methodological process described in this section by showing the main steps of RNA and DNA sequencing, sequence processing, de novo assembly, filtering and mapping, SNP discovery and selection.

Figure 1

Flowchart for Engraulis encrasicolus SNP discovery.

Sample collection

Ten individuals from three genetically divergent populations [2] were collected (Bay of Biscay (BIS1 and BIS2), Mediterranean Sea (MED) and Atlantic (CAD), Figure 2). Brain, gonad, muscle and liver tissues from each fish were immediately conserved in RNAlater after collection and stored at −20°C until further processing.

Figure 2

Map with sampling locations.

Map with sampling locations.

Stars indicate sample locations used for 454 GS FLX and HiSeq2000 sequencing: 1 (BIS2; N = 2) and 2 (BIS1; N = 3) represent sampling points from Bay of Biscay population, 3 (TAR; N = 2) is the sampling point from Mediterranean population and 4 (CAD; N = 3) is the sample from the Atlantic population. Every sampling point (stars and black dots) was used for validation including N = 30 individuals. Apart from locations 1–4, two additional populations were included in this step: 5 (CAN) is the sampling location for Canary Islands population and 6 (NOR) is the sample representing North Sea population. All surveys followed local regulations and guidelines for such research. For Spanish territories, no specific permission is needed for sampling aquatic fauna for scientific objectives, and the European anchovy is not considered a threatened species according to the International Union for Conservation of Nature and Natural Resources (IUCN Red List of Threatened Species, www.iucnredlist.org). For surveying in French territories permission was received from the French Ministère des Affaires Etrangères et Européennes (document no. 1233/DGM/ATT/ENT). Anchovies were collected following fishing without unnecessary suffering of the animals and following usual procedures: samples were obtained as part of faunal surveys with trawl nets; immediately after collection, anchovies are sorted from the bulk of the catch and a sample of 2 kg was selected at random, for which extracted tissues of 30 individuals were stored in ethanol or at −20°C. No experimentation with live animals was performed. No other ethical issues applied to the present research project.

RNA and DNA sequencing

For transcriptome sequencing, total RNA from the ten sampled individuals and four tissues was extracted using Trizol Reagent (Invitrogen) and quantified with Agilent 2100 Bioanalyzer combined with Agilent RNA 6000 Nano chips (Agilent Technologies, Inc.) at the Gene Expression Unit (SGIker) at the University of the Basque Country (UPV/EHU). Isolated RNA (four tissues for ten individuals) was combined in equimolar quantities into a single pool in an attempt to maximize the diversity of transcriptional units sampled and RNA was normalized by Evrogen (Russia) to prevent over-representation of the most common transcripts, using the DSN normalization method [28]. The normalized RNA pool was used for double-stranded (ds) complementary DNA (cDNA) synthesis following the Evrogen CS010–1C protocol using SMART technology [29], and was precipitated as recommended by Evrogen. cDNA libraries for 454 sequencing were prepared from the normalized and digested cDNA pool according to Roche's protocol (cDNA Rapid Library preparation protocol). Finally, 454 sequencing was performed at the Centre for Genomic Research at the University of Liverpool (United Kingdom) on one half of a PicoTiterPlateTM using the 454 GS FLX Titanium System (454 Life Sciences, Branford, CT, USA). All 454 sequence data have been submitted to the NCBI Sequence Read Archive (SRA) under the BioProject accession number PRJNA193183. For the genome sequencing, total DNA from muscle tissue of ten individuals (the same individuals used for transcriptome sequencing) was isolated using NucleoSpin® 96 Tissue Kit (Macherey-Nagel) according to the manufacturer's instructions and DNA quantity and purity were measured with a Nanodrop ND-1000. The Illumina TruSeq® DNA sample preparation kit was used to generate a barcoded genomic library for each individual. The ten barcoded genomic libraries were pooled and sequenced with a 2×100bp paired-end module on 4 lanes of a HiSeq2000 (Illumina). Standard post-processing was applied (adaptor clipping and quality checking) and sequences were de-multiplexed based on the specific barcoding tags used for each individual. The genome sequencing and post-processing was carried out at the Laboratory of Biodiversity and Evolutionary Genomics at the Katholieke Universiteit Leuven (Belgium). All HiSeq2000 sequence data have been submitted to the NCBI Sequence Read Archive (SRA) under the BioProject accession number PRJNA202430.

Sequence processing

To improve the 454 native base-calling error rate, PyroBayes [30] was used to transform the native 454 quality values into the standard Phred64 quality scores [31]. Following this, transcriptome raw reads were trimmed using clean_reads [32] and SnoWhite with the TagDust option [33]. In this trimming process SMART adaptors, PCR primer sequences, and poly(A/T) tails were removed, and a quality and length-based trimming was done according to custom parameters. Trimmed cDNA reads were aligned against the E. encrasicolus mitochondrial genome (NCBI Accession Number: AP009137) to identify and isolate all mitochondrial transcripts using GS Reference Mapper v2.6 (‘Mapper’, 454 Life Sciences) with custom parameters. Additionally, a local BLASTn search on the trimmed dataset was performed to identify ribosomal transcripts (homologous to teleostei ribosomal gene nucleotide sequences) and the SeqClean tool was used to screen out UniVec database contaminating sequences. The European anchovy genome raw reads were trimmed using clean_reads [32] for quality and length-based trimming according to custom decided parameters, as well as for screening out UniVec database contaminating sequences. Additionally, quality was visually checked before and after the trimming process with the FastQC tool.

De novo assembly, filtering and mapping

Transcriptome trimmed sequences were assembled using GS de novo Assembler v2.6 (454 Life Sciences) with the cDNA assembly option, by setting a minimum overlap length of 50 and a minimum overlap similarity of 95%. Reads were re-trimmed with the GS de novo Assembler trimming tool, including in the GS de novo Assembler exclude filter file the previously created file with names from cDNA reads corresponding to mitochondrial, ribosomal or contaminant sequences. The assembly quality was verified by visual examination of a random subset of contigs with the Tablet assembly and alignment visualization tool [34]. A transcriptome reference assembly was created by filtering transcriptome contigs for (1) homologous genes, (2) low-complexity regions and (3) duplicated regions. For the removal of homologous genes (two or more potentially homologous genes which are incorrectly assembled into one contig) and low-complexity regions, gDNA reads were aligned to the transcriptome reference using Bowtie2 [35] and contigs with a disproportionately large number of aligned gDNA reads were removed. For the avoidance of duplicated regions, an additional filter based on the identification of multimap reads was applied. Multimap reads are defined as those gDNA reads that align to multiple positions in the transcriptome reference, and as a consequence can suggest ambiguous or repetitive regions [36]. For detecting multimap reads, an alignment was performed using Bowtie2 (setting k = 2, local alignment mode). Then, contigs containing one or more multimap reads (detected by inspecting the bitwise FLAG string of each read alignment within the SAM file produced by Bowtie2) were defined as potentially duplicated regions and removed for posterior analyses. Finally, Bowtie2 [35] was used to create two mappings to the transcriptome reference for SNP discovery purposes. In the first mapping, called G2T (genome to transcriptome), gDNA reads were aligned to the transcriptome reference, and in the second mapping, called T2T (transcriptome to transcriptome), cDNA trimmed reads were aligned to the filtered transcriptome reference. Both mappings (G2T and T2T) are referenced during the SNP discovery and selection process.

SNP discovery and selection

With the aim of avoiding false SNPs (monomorphic loci) due to sequencing errors, SNP discovery was performed using both T2T and G2T alignments. For SNP discovery and selection we developed a method for automatically extracting SNPs amenable to genotyping from the SAM output files of the T2T and G2T alignments, and the VCF (variant call format) output of the SAMtools package [37]. For SNP discovery, the T2T and G2T alignments were processed with SAMtools bcftools and duplicated reads (arising from errors in the PCR step prior to sequencing) from each dataset (cDNA and gDNA trimmed reads) were removed with rmdup option. Putative SNP discovery was accomplished by filtering all T2T and G2T observed variants in order to reveal only biallelic SNPs (no indels or complex SNPs). To avoid false positives due to sequencing errors (which may therefore be monomorphic loci), only T2T variants with a minimum variant count of 2 high quality (HQ) bases and a minimum site depth of 8 (HQ bases) were called as putative T2T SNPs (no depth filter was applied for a variant count of 3 HQ bases or more). In parallel, from the G2T SNPs only variants with a minimum variant count of 2 HQ bases and a minimum site depth of 20 (HQ bases) were called, but no limitation of depth was required for a variant count of more than 2 HQ bases. Finally, a maximum site depth threshold of 200 for each SNP was applied. This final step was performed also with the aim of avoiding duplicated regions due to homologous genes (PSVs or MSVs). For each filter (T2T and G2T) all SNPs meeting the respective requirements described above were marked and the T2T and then the G2T SNPs sets were joined, resulting in only those SNPs discovered from both approaches. After the SNP discovery step, SNPs were in silico assessed to select several SNP markers for validation with the TaqMan® OpenArrayTM platform (Life Technologies). The selection criteria for SNPs were based on an analysis of putative intron-exon boundaries (IEBs) within each contig and on the compatibility of flanking sequence with the Taqman® method. Initial efforts of IEB finding through a BLAST search of European anchovy contigs against other teleost genomes showed that most contigs had no significant matches to other teleost genomes. Therefore, a novel approach to IEB detection [38] was designed, which can be applied when gDNA reads are available. The basis of this algorithm is the observation that the alignment of genome reads to a transcriptome contig produces distinctive patterns at the areas of IEBs, which emerge as change points: locations where a number reads either start or terminate their local alignment at an internal position (Figure 3). Taking advantage of this property, the algorithm processes the G2T SAM files (using the Perl Bio-Samtools library v1.36), computing a p-value for every change point within a contig. Low p-values indicate an unexpectedly large number of reads supporting the change point, and are suggestive of IEB (for example, for the contig shown in Figure 3 all indicated locations have a computed p-value less than a calibrated discriminative threshold, with no other location below this threshold), and every contig was divided into several sequences (putative exons) using each predicted IEB as a breakpoint. Second, for designing 530 TaqMan® OpenArrayTM SNPs genotyping assays, one sequence was built for each detected SNP. In that sequence, every putative SNP but the target was masked as Ns, including those called in only one approach), which were also masked as Ns in every sequence. We ensured that target sites were not in regions of similarity with any other contig in the European anchovy transcriptome, as identified by a BLASTn search (E-value <10−25). Based on the genotyping technology requirements, we rejected any sequence with less than 30 available bases upstream or downstream of the target site (due to the start/end of the contig or due to a predicted IEB), to comply with the minimum requirement for primer design, and any target site without complete alignment conservation (no variation or sequencing Ns) within +/−5 bases.

Figure 3

Output from the Tablet alignment visualizer [34] showing a G2T alignment for which 4 IEBs (arrows) have been detected (upper part of the display).

The bottom part of the display focuses on the magnified area around the first IEB alignment pattern. See Materials and Methods for further explanation.

Output from the Tablet alignment visualizer [34] showing a G2T alignment for which 4 IEBs (arrows) have been detected (upper part of the display).

The bottom part of the display focuses on the magnified area around the first IEB alignment pattern. See Materials and Methods for further explanation. For TaqMan® OpenArrayTM SNP genotyping assay construction, SNPs were divided by two selection criteria. A first SNP subset was selected based on homology to zebrafish (Danio rerio) exome. For this, conservation of gene structure between European anchovy and zebrafish was assumed, and each contig was assessed using a database of coding sequences for all transcripts from the Ensembl zebrafish genome (Zv9). Anchovy filtered transcriptome contigs were aligned at the protein level using BLASTx (e-value <10−10) to identify a possible unique orthologous zebrafish gene. SNPs were annotated in 4 categories: no homology (no homology found to zebrafish), cSNP (contig with homology and SNP in a coding region), ncSNP (contig with homology and SNP outside of coding region) and tSNP (ambigious cases where a contig has a homology to a coding region and another homology outside a coding region). All cSNPs were selected and further SNPs were randomly selected from the tSNPs list in order to have potentially coding SNPs, interesting for adaptative and evolutionary studies. For the second SNP subset, a preliminary Fst value was calculated for suggesting how well the SNPs may discriminate populations. In this case, individual genotypes for each called SNP were determined by inspecting the VCF output of bcftools. This genotype information was used to calculate a preliminary Fst value [38], [39] for each marker in the 10 individuals from three populations. Markers were sorted from high to low Fst and those SNPs of higher Fst values were selected.

SNP genotyping and validation

A total of 180 samples of E. encrasicolus from five different populations (Bay of Biscay (BIS1 and BIS2), Mediterranean Sea (MED), Atlantic (CAD), Canary Islands (CAN) and North Sea (NSE); (Figure 2) according to [2] were used for genotyping and validating the 530 selected SNPs. From each population 30 individuals were included in the study, although 60 individuals were genotyped from Bay of Biscay (30 individuals from 2 sampling locations, BIS1 and BIS2; Figure 2). DNA extractions were performed from muscle tissue using NucleoSpin® 96 Tissue Kit (Macherey-Nagel) according to manufacturer instructions, and DNA quantity and purity were measured using Nanodrop ND-8000. A submission file with the sequences specifying the target SNPs was created and sent to the Applied Biosystems Assay Service for primer and probe design. Genomic DNA (66 ng per sample) was used as template at the required DNA starting concentration (22 ng/µl). Subsequent reactions for the amplification and detection of the SNPs were carried out following TaqMan® OpenArray™ Genotyping System User Guide at the Sequencing and Genotyping Service (SGIker) of the University of the Basque Country (UPV/EHU). Scoring of individual genotypes was performed using TaqMan® Genotyper software v2.1 (Life Technologies). After default clustering was performed, data was viewed in the scatter plot and genotype calls were reviewed and manually adjusted for producing the final cluster assignments. Based on these assignments, SNPs were classified as no signal (no amplification), disperse (less than 80% of individuals assigned to a cluster), monomorphic (minor allele frequency, MAF <0.01), PSV/MSV (≥99% of all individuals heterozygous), and polymorphic (MAF ≥0.01). Once no signal and disperse SNPs were removed, the remaining SNPs were used for the calculation of the SNP conversion rate, while only polymorphic SNPs were used for the calculation of SNP validation rate and for further analysis of descriptive statistics. For each polymorphic SNP, the genotyping percentage was calculated using Genepop v4.0 [40]. For each polymorphic locus and population, Fisher's exact test was used to test deviations from Hardy-Weinberg equilibrium (HWE) across samples, as implemented in Genepop v4.0 (p-value <0.001). Moreover, in order to identify only independent markers, linkage disequilibrium (LD) was tested with genetics package from R [41]. Then, for each independent marker, the expected heterozygosity (He), observed heterozygosity (Ho), and minor allele frequency (MAF) were calculated using GeneClass2 [42]. The BayeScan 2.1 software [43] was used to identify candidate loci under natural selection. In this test, because multiple comparisons were involved, critical values for the test were adjusted with false discovery rate (FDR) procedure (q-value <0.1) [44].

Microsatellite loci discovery

Transcriptome contigs were independently searched for microsatellite repeats and primers using the software QDD2 [45]. Additionally, repeats and their flanking sequences were BLASTed against the non-redundant section of NCBI (nt) for identifying the taxonomic lineage of the organism of the best hit. The microsatellite set was filtered for those meeting the following criteria: (1) at least five uninterrupted repetitions for di-, tri-, tetra-, penta-, and hexa-nucleotides, (2) a pure motif and (3) a primer design “A”, which means that neither homopolymers nor nanosatellites were allowed in the primer sequence, or in the target microsatellite flanking region.

Gene annotation

We used BLASTx (E-value <10−6) to align the contigs to the manually curated protein database Uniprot/Swissprot [46] using Blast2GO tool [47] against the zebrafish proteome. Blast2GO is an automated tool for the assignment of gene ontology terms to BLAST hits, designed for use with novel sequence data [47]. Assignment of gene ontology terms to contigs with significant BLASTx match was also performed using Blast2GO.

Results

Anchovy transcriptome sequencing yielded 889,772 reads with a length average of 293 bp comprising a total of 244 Mbp. Anchovy genome sequencing produced 1,598,669,378 paired-end sequences of 100 bp for each read. Raw transcriptome reads were trimmed (see Materials and Methods section) resulting in 821,107 reads with a length average of 304 bp comprising 225 Mbp. These trimmed sequences were considered as high-quality (HQ) sequences (92.3% of raw reads). Only nucleotides with Phred “high quality bases” were considered (HQ Phred score >20) [48]. A total of 1,446 mitochondrial transcriptome sequences were identified, matching 93% of the mitochondrial genome reference with a coverage (mean depth per reference base) of 24.51. Additionally, 1,701 reads were found to have a BLASTn hit with teleostei organism ribosomal RNA sequences. Finally, 1,000 reads were removed due to a BLASTn hit to the UniVec database. The exclusion of all these sequences resulted in removing a total of 3,688 reads from the trimmed dataset. Therefore, a completely trimmed dataset of 817,419 (91.9% of raw reads) anchovy nuclear transcriptome reads were obtained for anchovy nuclear transcriptome de novo assembly. Genome sequencing contaminant removal, and length and quality trimming, yielded 1,364,994,151 HQ paired-end sequences (85.4% of initial reads), with read lengths ranging from 61 to 100 bp. The trimming results for each individual are detailed in Table 1. In terms of sequencing success, individual GIR-4 had the highest percentage of valid sequences (93.7%); individual CAD-1 (182,750,832 trimmed sequences) had the highest number of valid sequences, and was also the one with the highest number of sequenced raw reads (211,055,936 sequences). Positions 70 to 100 of genome reads had quality values as low as 0 that were completely removed after the trimming process (Figure S1). After sequence cleaning, read quality values were also lower at the beginning and the end of the sequences, but quality average always stayed between 30 and 40, defined as high-quality region. FastQC tools analyses showed that the trimming process substantially improved the genome sequence dataset, especially in terms of sequence quality, which is essential for the success of the SNP calling process.

Table 1

Sequenced individual, number of sequences obtained from HiSeq2000 sequencing (Raw sequences), number of trimmed sequences, and percentage of valid (or trimmed) sequences.

Individual	Raw sequences	Valid sequences	Valid sequences (%)
BIS2–4	124,890,134	116,962,879	93.65%
BIS2–5	136,203,704	120,970,864	88.82%
BIS1–3	203,056,696	180,588,178	88.93%
BIS1–4	199,138,100	153,625,124	77.15%
BIS1–5	125,842,196	111,469,177	88.58%
TAR-4	166,733,374	144,736,461	86.81%
TAR-6	160,620,594	142,320,609	88.61%
CAD-1	211,055,936	182,750,832	86.59%
CAD-3	181,504,151	160,144,743	84.82%
CAD-5	151,952,366	128,884,436	69.38%
TOTAL	1,598,669,378	1,364,994,151	85.38%

GS de novo Assembler software assembled 657,778 reads (80.5% of trimmed reads) into 24,494 contigs with a N50 of 459, and an average coverage of 15.13. The assembled contigs had an average length of 498bp, comprising more than 180 Mbp. The longest contig was 3,336bp length and 10,095 contigs (41.2%) were longer than 500 bp. The N50 of these large contigs was 766. The total assembled reads length was 12,209,523bp. Reads that were not assembled constituted 143,537 singletons (17.6%), which were excluded for further analyses. The remaining 1.9% of reads was identified by GS de novo Assembler software as repeats, outliers, or reads too short for use in the assembly. The transcriptome reference was filtered for low-complexity regions, spurious contigs, chimeric contigs, and duplicated regions in the genome, yielding a total of 19,367 high-confident consensus sequences. From these, 10,402 contigs containing multimap reads were identified. In the T2T alignment, 593,122 reads successfully mapped, with a coverage of ∼15. For the G2T local alignment, 11,361,696 reads aligned one or more times, with a coverage of ∼78. A summary of SNP discovery and selection statistics is presented in Table 2. From 41,542 T2T biallelic variants, 32,373 remained after filtering based on the minimum variant count and site depth threshold. For G2T biallelic variants, 208,016 were filtered based on the maximum site depth threshold, resulting in 192,129 SNPs. Within the T2T and G2T SNPs sets, 18,994 were common to both. These SNPs were found within 7,426 distinct contigs. Transitions were the most common SNP type, with a ts/tv ratio of 2.31. Regarding IEB avoidance, a total of 4,031 of the 7,426 contigs with a common SNP contained one or more predicted IEB (Table 2), and a total of 14,186 IEBs were identified in these 4,031 contigs (on average 3.52 predicted IEB per contig). These predicted IEBs were annotated and further avoided.

Table 2

Summary statistics of SNP discovery and selection.

			T2T	G2T
Biallelic variants			41,542	208,016
In silico putative SNPs (after filters)			32,373	192,129
Contigs with putative SNPs			13,671	17,406
Total predicted IEB			10,688
	contigs with one or more predicted IEB		4,031
Common SNPs			18,994
	contigs with a common SNP		7,426
	transitions		13,255
	transversions		5,739
	SNPs suitable for TaqMan® OpenArray^TM		2,317
		cSNPs	195
		ncSNPs	423
		tSNPs	274
		no homology	1,425
Selected for validation			530 (100%)
	failed
		disperse	16 (3.0%)
		no signal	30 (5.7%)
	false
		monomorphic	40 (7.5%)
		PSV/MSV	3 (0.6%)
	polymorphic		441 (83.2%)

TaqMan® OpenArrayTM SNPs genotyping system requirements were passed by 2,317 SNPs which appeared within 1,617 contigs. From these, 892 contigs (55.16%) showed homology to the D. rerio proteome: 195 cSNPs (21.86%), 423 ncSNPs (47.42%) and 274 tSNPs (30.72%); for the first SNP subset every cSNP was selected and it was completed with markers annotated as tSNPs. Regarding the SNP selection criteria based on preliminary Fst, the second SNP subset consisted on those SNPs – not coincident with those selected from the zebrafish homology criteria – with the highest Fst values (from 0.31 to 0.83) were selected. The two SNPs subsets led to a total of 530 SNPs for genotyping and validation. The final set of selected and genotyped 530 SNPs is listed in Table S1, and results are shown in Table 2. A total of 484 SNPs amplified and produced clear clusters (16 no signal, 30 disperse); which constitute a conversion rate of 91.3%. From those SNPs, 441 were polymorphic (40 monomorphic, 3 PSV/MSV) resulting in a validation rate of 83.2%. Information for each selected SNP can be found in Table S1. Deviation from HWE for each locus and population after correction for multiple testing revealed 15 markers retaining significant deviation. Linkage disequilibrium (LD) was assessed for each pair of loci meeting HWE and 11 SNPs in LD formed 5 haplotypes. Therefore a total of 420 independent markers meeting HWE were found to be potentially used in prospective studies. The expected heterozygosity (He) of these markers ranged from 0.012 to 0.495, while observed heterozygosity (Ho) fluctuated between 0.007 and 0.550, and MAF values ranged from 0.003 to 0.498. The distribution of SNPs frequencies over the range of MAF categories does not suggest an elevated non-random exclusion of SNPs with low MAF, adequately showing even representation over the entire MAF range (Figure S2). Finally, 31 candidate loci under natural selection were identified; 30 of these markers had positive alpha values, suggesting diversifying selection [43]; and 1 SNP showed a negative value of alpha, suggesting balancing or purifying selection [43] (Table S1). In total, 510 microsatellite markers matching the quality criteria implemented in QDD2 and posterior filters were detected in the European anchovy filtered transcriptome (19,367 high-confidence contigs) (Table 3). The most common motif was the di-nucleotide AC, appearing in 46.9% of detected best microsatellites. The total sequence length of di-, tri-, tetra-, penta- and hexa-nucleotide repeats found in the anchovy transcriptome was 6,692bp, representing approximately 0.1% of the total assembled transcriptome contig sequences.

Table 3

Distribution of microsatellite repeat sizes and lengths.

repeat type	number of repeat units						maximum repeat units	total
	5	6	7	8	9	10
dinucleotide	204	91	37	27	8	5	10	372
trinucleotide	82	22	10	2	0	0	8	116
tetranucleotide	14	3	0	0	0	0	6	17
pentanucleotide	4	0	0	0	0	0	5	4
hexanucleotide	1	0	0	0	0	0	5	1
all	305	116	47	29	8	5		510

Approximately 10% of the detected repeats matched 49 sequences isolated from 12 different species, all of them fishes (Class Actinopterygii) except one amphibian (Xenopus tropicalis), which matched only one microsatellite sequence (e-value of 10−45). All detected microsatellite loci are listed in Table S2. The 19,367 transcriptome contigs reference were annotated using the Blast2GO tool against the zebrafish proteome. The BLASTx analysis resulted in 6,100 sequences with at least one BLASTx hit. This result indicated that 31.5% of E. encrasicolus transcriptome sequences could be annotated with a putative function (E-value <10−6). These sequences were assessed for gene ontology terms (with low MAF, adequately showing even representation over the entire MAF range (Figure S3). The vast majority of genes, within the biological process category, were included in the categories of cellular (18.6%), metabolic (15.9%) and biological regulation (11.1%). Most molecular functions found in this study were related to binding (16.7%) and catalytic activity (10.8%). Finally, regarding the cellular component gene ontology category, the most common components were the very general term cell (19.6%) and organelle (16.5%).

Discussion

This study presents one of the largest combined (transcriptome and genome) sequencing projects for a non-model species and is the most extensive genomic analysis performed on the ecologically and economically important E. encrasicolus. Moreover, we report the highest SNP conversion and validation rates described to date for a non-model species, demonstrating a method for rapid and cost-effective SNP discovery in the exome of non-model organisms. The success on the recovery and validation rates of the SNP markers in the present study relies on the strategies adopted to (1) avoid ascertainment bias, (2) trim and quality filter the transcripts, (3) establish criteria for accurate SNP calling and (4) accurately identify duplicated regions and intron-exon boundaries. Regarding ascertainment bias, it has been previously shown that the deviation towards detecting only SNPs with high or intermediate allele frequencies might be a problem as it influences the precision of estimates related to demographic parameters such as migration or Ne, which could lead to mistaken assumptions about demographic history of the species [49]. In this study, E. encrasicolus individuals from genetically distant populations [2] were sequenced (both in the transcriptome and the genome) as it has been reported that biases related to allele frequencies could be minimized if the individuals selected to discover putative SNPs are geographically, genetically, and phenotypically diverse [50], [51]. In the present study, the obtained MAF values ranged from 0.006 to 0.498 showing an even representation over the entire allele frequency range (Figure S2), which demonstrates the high efficacy of polling genetically heterogeneous samples to avoid ascertainment bias using NGS approaches and the suitability of the markers discovered to study demographic history of populations. It is well known that one of the major challenges for SNP discovery studies in non-model organisms is to achieve a high quality de novo assembly for SNP discovery. To deal with this issue, in the present study, a transcriptome rather than a genome assembly was chosen as the primary substrate for SNP discovery. The chosen strategy has the additional advantage that SNP markers derive directly from exonic regions of the genome, which are especially relevant for fisheries genetics, traceability, adaptative studies, conservation studies and aquaculture applications. For non-model species, high-throughput technologies are currently the most recommended since they generate a massive quantity of sequences and, specifically, the main advantage of the 454-FLX system is the production of longer reads than other sequencing systems, which helps the de novo assembly step. Additionally, performing suitable bioinformatic analyses such as base-call accuracy improvement, trimming, and the assembly itself (as well as its posterior filtering), is also important. This study has carefully taken into account all these issues. First, regarding the reduction of the error rate of base calls and improvement of accuracy in quality scores, PyroBayes algorithm produced a significant improvement in terms of trimming on quality results (data not shown). This correction may have led to a significant reduction in false SNP calls and may have facilitated the assembly [52]. Second, in connection with the trimming step, mitochondrial, ribosomal and contaminant trimming from transcriptome reads was successful since no gene from these categories was assembled, as observed in cellular component Gene Ontology terms from the annotation step. Indeed, the beginning and the end of genome reads had very low quality values prior to the trimming process. Such a feature is common to all high-throughput sequencing methods (see [53] for a review) and additional filtering should be performed to remove low quality, very short, or highly repetitive reads [54]. The use of FastQC trimming tools analyses proved that trimming process substantially improved the genome sequence dataset, especially in terms of sequence quality, which is essential for future SNP calling. These good results may arise from the variety of software used for trimming because, although several tools have been developed for NGS sequencing data trimming (e.g. [32], [33]), each utilizes a different algorithm and has limitations, which are reduced by employing a combination of trimming tools. Finally, with regard to the assembly step, in this study a high quality de novo transcriptome reference has been assembled as is confirmed by the SNP validation results. The high SNP conversion rate (484/530) is a result of the carefully quality controlled assembly, since the sequence for primer and probe design was correctly constructed in at least 91.3% of the cases. The assembly issue remains challenging mainly due to two factors: first, sequencing errors can lead to mismatching sequences between reads that came from the same location in the genome/transcriptome [55]; and second, duplicated and repetitive sequences may cause omissions, or even concatenation of reads that should not be assembled into the same contig [56]. In this study, excessively short reads and sequencing errors have both been avoided through a stringent trimming based on length and quality, as recommended by [54]. Additionally, further filtering was carried out in order to remove low-complexity regions, spurious contigs, chimeras, and duplicated genome regions from the constructed transcriptome reference dataset. The third key factor of this study is the criteria followed for putative SNP calling. One of the most important challenges for SNP discovery is to differentiate sequencing errors from potentially real differences due to polymorphisms [52]. In the absence of a reference sequence, distinguishing true polymorphisms from sequencing errors (false positives) is difficult. Therefore, a highly conservative pipeline for the rigorous avoidance of false SNPs has been performed. For this, two different sequencing technologies (each with its inherent sequencing error type), variant site depth, and the alternative allele count have been taken into account. The use of two technologies has allowed dealing with the two main problems of SNP calling from NGS technologies: (1) obtaining a sufficient read coverage and (2) the avoidance of sequencing errors. In this study, 454 platform has been used for generating a good reference, while Illumina sequencing has compensated for the coverage issue. Second, in terms of sequencing errors, 454 technology has compensated Illumina sequencing errors in order to avoid false SNP discovery. In this study monomorphic markers represented just 7.5% of the putative SNPs (40/530). Compared to other de novo transcriptome sequencing and SNP discovery studies (see Table 4 for examples in fish species), this study has the lowest false SNPs discovery rate, very similar to those reported for species with a close reference genome and it has achieved the highest validation rate (83.2%). The combination of both technologies, joined with the stringent filters applied in each of the two parallel SNP discovery strategies (G2T from Illumina reads, and T2T from 454-FLX reads) is an effective SNP discovery procedure.

Table 4

Approaches to transcriptome SNP discovery and validation in fish species.

Organism and study	Sequences	Putative SNPs	Conversion rate	False SNPs rate	Validation rate	IEB method	Comments
No reference genome
Catfish [15]	Sanger-EST	384	69.3%	28.6%	40.6%	none	no NGS
Lake whitefish [19]		31
Salmon [21]	454	202	40.6%	22.3%	18.3%	none
Sockeye salmon [22]	SOLiD	96	53.1%	41.7%	11.5%	none	RRL*1
Hake [23]	454	944	43.3%	15.9%	27.4%	homology
	GAII*2	684	43.3%	14.0%	29.2%
Atlantic herring [25]	454	1,536	50.7%	13.1%	37.6%	homology
Pacific herring [26]	454	96	47.9%	33.3%	14.6%	none
This study	454 and HiSeq2000	530	91.3%	8.1%	83.2%	read mapping
Draft reference genome
Atlantic cod [59]	Sanger-EST	594	69.0%	15.5%	53.5%	none	no NGS
Atlantic cod [20]	GAII	3,072	74.6%	19.8%	54.8%	none

RRL: Reduced Representation Libraries (method for the selection of a subset of the genome for assembly).

GAII: Genome Analyzer II (Illumina NGS sequencer).

RRL: Reduced Representation Libraries (method for the selection of a subset of the genome for assembly). GAII: Genome Analyzer II (Illumina NGS sequencer). The final essential component of this study is the novel bioinformatic processing for SNP discovery including PSVs/MSVs detection and a solution to the IEB problem in a non-model organism transcriptome sequencing [38]. One of the advantages of sequencing both the genome and transcriptome (which may be the target reference for SNP discovery), is the possibility of optimally exploiting genome information for detecting duplicated regions and intron-exon boundaries. The high failure rate of SNP selection in some projects has previously been attributed to duplicated and repetitive sequences, within PSVs and/or MSVs [57]. This is because paralogs sharing high levels of sequence similarity usually will be assembled into the same contig, and SNPs become indistinguishable; but they do not provide the same information. In the present study, PSVs and MSVs were successfully avoided (0.6% of the putative SNPs). This success is a result of filtering for duplicated genome regions by three different filters applied along the study: (1) transcriptome reference filtering, (2) maximum site depth for variants discovered in the G2T approach, and (3) multimap reads identification (see Materials and Methods). Even prior to genotyping, one clue for the accuracy of our SNP discovery process arises from the observed ts/tv ratio. It is generally assumed for humans that ts/tv ratio is around 3.0 for exonic SNPs and about 2.0 elsewhere in the genome [58]. As ts/tv differs from species to species, the ts/tv ratio of 2.3 observed for exonic SNPs in this study (between 2.0 and 3.0) reflects the high precision of the SNP discovery process. The key to achieving high SNP conversion rates from transcriptome data is the identification of IEBs and their avoidance in primer and probe sequences for posterior marker validation [15]. Until now, IEBs have been either ignored during the marker validation phase, or identified by homology of transcriptome contigs with the sequenced genome of another species through BLAST searches (see Table 4 for references). This strategy has worked well in species with a close reference genome; but it has been very weak for fish species (Table 4) because the only nine fish with complete genome coverage (Danio rerio, Gadus morhua, Gasterosteus aculeatus, Latimeria chalumnae, Oreochromis niloticus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis, and Xiphophorus maculatus) are too divergent from Engraulis encrasicolus for significant BLAST matches at the nucleotide level. For Atlantic cod transcriptome SNP discovery [20], [59], conversion and validation rates are among the highest reported in fish species, presumably due to the availability of the cod draft genome. In the present study a new IEB detection method [38] was applied and only 16 SNPs have been classified as no signal, which means that only the 3.0% of the putative markers have not amplified. The new approach for IEB detection described in this study provides successful detection of IEB within transcriptome assemblies while bypassing the need to construct an assembled genome, a complicated and time-consuming task. Most studies using high-throughput sequencing technologies for discovering SNPs from non-model fish transcriptomes have obtained conversion rates between 43% and 79%, and validation rates between 12% and 55% (Table 4), which are low values considering the scale, time and money involved in such studies. Indeed, these values reduce to 43–50% and 12–38% (conversion and validation rates, respectively) excluding species with a draft genome or large genomic resources due to decades of genetic research as cod, salmon and catfish. Here, we have significantly improved conversion rates to 91.3%, accompanied by a validation rate of 83.2%. Notably, such high rates conversion and validation are among the top of the range even for studies having access to a close reference genome (77–95% and 66–95%, respectively; e.g. [60], [61]–[63]). Furthermore, our study revealed a very small number and percentage of false positives, representing only 8.11% of the 530 SNPs set (40 monomorphic and 3 PSVs or MSVs). This data reflects an improvement in SNP selection, comparable to results obtained in other non-model fish species where between 13–33% of SNPs were false positives (Table 4). Moreover, based on the validation rates obtained on the 530 tested SNPs subset, the extrapolation of the 83.2% of validation rate obtained on the 530 tested SNPs to the whole 2,317 putative SNPs discovered, would yield around 2,000 SNP validated markers in the exonic regions of Engraulis encrasicolus, a non-model organism (assuming that all 83.2% of the 2,317 putative SNPs would be suitable for genotyping primer design). In all, these percentages reflect the accuracy and effectiveness of the described strategy.

Conclusions

The SNP discovery pipeline described in this paper has identified over 19,000 putative SNPs in E. encrasicolus. The technique is based on a single half-plate run on a 454 GS FLX (Roche) sequencing instrument using titanium chemistry for transcriptome sequencing, and two lanes of a HiSeq2000 (Illumina) instrument for genome sequencing. This approach can be used for rapid, comparatively low-cost SNP discovery and high conversion and validation rates in any non-model organism. While the cost of the method described here is comparable to traditional alternatives for SNP discovery, the approach has the added benefit of detecting a large number of reliable SNPs in non-model organisms. As these loci are derived directly from transcribed sequences, gene function annotation of the discovered markers is possible and markers under selection are expected. The value of SNPs under selection for fisheries management is that these are more informative than neutral ones when aiming for population/origin assignment of individuals. Therefore these SNPs could be informative in studies on adaptation, origin assignment or aquaculture. Regarding European anchovy, the target non-model species of this study, the 441 validated SNPs may be useful in prospective genetic studies for understanding and estimating effective population size and detecting significant reductions in population size or population bottlenecks in populations which have suffered a loss of genetic variability. Since markers linked with genes influencing fitness might generally provide a good indicator of levels of adaptive variation within populations and their potential to respond to changing environmental conditions [14], the new markers reported here could be very informative in terms of conservation studies in the European anchovy. FastQC tool generated quality plot for CAD-1 individual genome sequences before (left) and after (right) contaminants removal, and length and quality trimming. In the plot X axis represents position in the read (bp) from 0 to 100, and Y axis represents quality values in Phred+33 scale (from 0 to 40). (PDF) Click here for additional data file. Minor allele frequencies (MAF) values obtained from 435 independent validated markers (in H-W equilibrium or not). (PDF) Click here for additional data file. Level 2 gene ontology terms, divided in the three categories, and the percentage of genes for each term. (PDF) Click here for additional data file. 530 genotyped SNPs sequences and their descriptive statistics. For each marker following information is provided: NCBI Submitter SNP (ss) accession numbers, reference and alternative alleles, flanking sequence, category, genotyping percentage, HWE, linkage disequilibrium (LD), expected and observed heterozygosities (He and Ho, respectively), minor allele frequency (MAF) and natural selection state. (XLSX) Click here for additional data file. Best microsatellite markers detected in the European anchovy transcriptome. (XLSX) Click here for additional data file.

54 in total

1. QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects.

Authors: Emese Meglécz; Caroline Costedoat; Vincent Dubut; André Gilles; Thibaut Malausa; Nicolas Pech; Jean-François Martin
Journal: Bioinformatics Date: 2009-12-10 Impact factor: 6.937

2. Toward fish and seafood traceability: anchovy species determination in fish products by molecular markers and support through a public domain database.

Authors: Marc Jérôme; Jann Thorsten Martinsohn; Delphine Ortega; Philippe Carreau; Véronique Verrez-Bagnis; Olivier Mouchel
Journal: J Agric Food Chem Date: 2008-05-02 Impact factor: 5.279

Review 3. Next-generation transcriptome assembly.

Authors: Jeffrey A Martin; Zhong Wang
Journal: Nat Rev Genet Date: 2011-09-07 Impact factor: 53.242

4. Simple cDNA normalization using kamchatka crab duplex-specific nuclease.

Authors: Pavel A Zhulidov; Ekaterina A Bogdanova; Alex S Shcheglov; Laura L Vagner; George L Khaspekov; Valery B Kozhemyako; Mikhail V Matz; Ella Meleshkevitch; Leonid L Moroz; Sergey A Lukyanov; Dmitry A Shagin
Journal: Nucleic Acids Res Date: 2004-02-18 Impact factor: 16.971

5. Tablet--next generation sequence assembly visualization.

Authors: Iain Milne; Micha Bayer; Linda Cardle; Paul Shaw; Gordon Stephen; Frank Wright; David Marshall
Journal: Bioinformatics Date: 2009-12-04 Impact factor: 6.937

6. SNP discovery using Next Generation Transcriptomic Sequencing in Atlantic herring (Clupea harengus).

Authors: Sarah J Helyar; Morten T Limborg; Dorte Bekkevold; Massimiliano Babbucci; Jeroen van Houdt; Gregory E Maes; Luca Bargelloni; Rasmus O Nielsen; Martin I Taylor; Rob Ogden; Alessia Cariani; Gary R Carvalho; Frank Panitz
Journal: PLoS One Date: 2012-08-07 Impact factor: 3.240

7. A direct comparison of the KB™ Basecaller and phred for identifying the bases from DNA sequencing using chain termination chemistry.

Authors: Richard W Hyman; Hui Jiang; Marilyn Fukushima; Ronald W Davis
Journal: BMC Res Notes Date: 2010-10-08

8. Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery.

Authors: Matthew C Hale; Cory R McCormick; James R Jackson; J Andrew Dewoody
Journal: BMC Genomics Date: 2009-04-29 Impact factor: 3.969

9. Evaluation of next generation sequencing platforms for population targeted sequencing studies.

Authors: Olivier Harismendy; Pauline C Ng; Robert L Strausberg; Xiaoyun Wang; Timothy B Stockwell; Karen Y Beeson; Nicholas J Schork; Sarah S Murray; Eric J Topol; Samuel Levy; Kelly A Frazer
Journal: Genome Biol Date: 2009-03-27 Impact factor: 13.583

10. Quality assessment parameters for EST-derived SNPs from catfish.

Authors: Shaolin Wang; Zhenxia Sha; Tad S Sonstegard; Hong Liu; Peng Xu; Benjaporn Somridhivej; Eric Peatman; Huseyin Kucuktas; Zhanjiang Liu
Journal: BMC Genomics Date: 2008-09-30 Impact factor: 3.969

8 in total

1. Comparative transcriptome analysis of two oysters, Crassostrea gigas and Crassostrea hongkongensis provides insights into adaptation to hypo-osmotic conditions.

Authors: Xuelin Zhao; Hong Yu; Lingfeng Kong; Shikai Liu; Qi Li
Journal: PLoS One Date: 2014-11-04 Impact factor: 3.240

2. Genetic fingerprinting of salmon louse (Lepeophtheirus salmonis) populations in the North-East Atlantic using a random forest classification approach.

Authors: A Jacobs; M De Noia; K Praebel; Ø Kanstad-Hanssen; M Paterno; D Jackson; P McGinnity; A Sturm; K R Elmer; M S Llewellyn
Journal: Sci Rep Date: 2018-01-19 Impact factor: 4.379

3. Large-scale SNP screenings identify markers linked with GCRV resistant traits through transcriptomes of individuals and cell lines in Ctenopharyngodon idella.

Authors: Zhiwei Liao; Quanyuan Wan; Xueying Shang; Jianguo Su
Journal: Sci Rep Date: 2017-04-26 Impact factor: 4.379

4. A novel transcriptome-derived SNPs array for tench (Tinca tinca L.).

Authors: Girish Kumar; Jorge Langa; Iratxe Montes; Darrell Conklin; Martin Kocour; Klaus Kohlmann; Andone Estonba
Journal: PLoS One Date: 2019-03-19 Impact factor: 3.240

5. Authoritative subspecies diagnosis tool for European honey bees based on ancestry informative SNPs.

Authors: Jamal Momeni; Melanie Parejo; Rikke Vingborg; Maria Bouga; Per Kryger; Marina D Meixner; Andone Estonba; Rasmus O Nielsen; Jorge Langa; Iratxe Montes; Laetitia Papoutsis; Leila Farajzadeh; Christian Bendixen; Eliza Căuia; Jean-Daniel Charrière; Mary F Coffey; Cecilia Costa; Raffaele Dall'Olio; Pilar De la Rúa; M Maja Drazic; Janja Filipi; Thomas Galea; Miroljub Golubovski; Ales Gregorc; Karina Grigoryan; Fani Hatjina; Rustem Ilyasov; Evgeniya Ivanova; Irakli Janashia; Irfan Kandemir; Aikaterini Karatasou; Meral Kekecoglu; Nikola Kezic; Enikö Sz Matray; David Mifsud; Rudolf Moosbeckhofer; Alexei G Nikolenko; Alexandros Papachristoforou; Plamen Petrov; M Alice Pinto; Aleksandr V Poskryakov; Aglyam Y Sharipov; Adrian Siceanu; M Ihsan Soysal; Aleksandar Uzunov; Marion Zammit-Mangion
Journal: BMC Genomics Date: 2021-02-03 Impact factor: 3.969

6. Molecular Inversion Probes for targeted resequencing in non-model organisms.

Authors: M Niedzicka; A Fijarczyk; K Dudek; M Stuglik; W Babik
Journal: Sci Rep Date: 2016-04-05 Impact factor: 4.379

7. Transcriptomic SNP discovery for custom genotyping arrays: impacts of sequence data, SNP calling method and genotyping technology on the probability of validation success.

Authors: Emily Humble; Michael A S Thorne; Jaume Forcada; Joseph I Hoffman
Journal: BMC Res Notes Date: 2016-08-26

8. Chloroplast genome analyses and genomic resource development for epilithic sister genera Oresitrophe and Mukdenia (Saxifragaceae), using genome skimming data.

Authors: Luxian Liu; Yuewen Wang; Peizi He; Pan Li; Joongku Lee; Douglas E Soltis; Chengxin Fu
Journal: BMC Genomics Date: 2018-04-04 Impact factor: 3.969

8 in total