| Literature DB >> 23936375 |
Iratxe Montes1, Darrell Conklin, Aitor Albaina, Simon Creer, Gary R Carvalho, María Santos, Andone Estonba.
Abstract
Increased throughput in sequencing technologies has facilitated the acquisition of detailed genomic information in non-model species. The focus of this research was to discover and validate SNPs derived from the European anchovy (Engraulis encrasicolus) transcriptome, a species with no available reference genome, using next generation sequencing technologies. A cDNA library was constructed from four tissues of ten fish individuals corresponding to three populations of E. encrasicolus, and Roche 454 GS FLX Titanium sequencing yielded 19,367 contigs. Additionally, the European anchovy genome was sequenced for the same ten individuals using an Illumina HiSeq2000. Using a computational pipeline for combining transcriptome and genome information, a total of 18,994 SNPs met the necessary minor allele frequency and depth filters. A series of further stringent filters were applied to identify those SNPs likely to succeed in genotyping assays, and for filtering of those in potential duplicated genome regions. A novel method for detecting potential intron-exon boundaries in areas of putative SNPs has also been applied in silico to improve genotyping success. In all, 2,317 filtered putative transcriptome SNPs suitable for genotyping primer design were identified. From those, a subset of 530 were selected, with the genotyping results showing the highest reported conversion and validation rates (91.3% and 83.2%, respectively) reported to date for a non-model species. This study represents a promising strategy to discover genotypable SNPs in the exome of non-model organisms. The genomic resource generated for E. encrasicolus, both in terms of sequences and novel markers, will be informative for research into this species with applications including traceability studies, population genetic analyses and aquaculture.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23936375 PMCID: PMC3731364 DOI: 10.1371/journal.pone.0070051
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flowchart for Engraulis encrasicolus SNP discovery.
Figure 2Map with sampling locations.
Stars indicate sample locations used for 454 GS FLX and HiSeq2000 sequencing: 1 (BIS2; N = 2) and 2 (BIS1; N = 3) represent sampling points from Bay of Biscay population, 3 (TAR; N = 2) is the sampling point from Mediterranean population and 4 (CAD; N = 3) is the sample from the Atlantic population. Every sampling point (stars and black dots) was used for validation including N = 30 individuals. Apart from locations 1–4, two additional populations were included in this step: 5 (CAN) is the sampling location for Canary Islands population and 6 (NOR) is the sample representing North Sea population.
Figure 3Output from the Tablet alignment visualizer [34] showing a G2T alignment for which 4 IEBs (arrows) have been detected (upper part of the display).
The bottom part of the display focuses on the magnified area around the first IEB alignment pattern. See Materials and Methods for further explanation.
Sequenced individual, number of sequences obtained from HiSeq2000 sequencing (Raw sequences), number of trimmed sequences, and percentage of valid (or trimmed) sequences.
| Individual | Raw sequences | Valid sequences | Valid sequences (%) |
| BIS2–4 | 124,890,134 | 116,962,879 | 93.65% |
| BIS2–5 | 136,203,704 | 120,970,864 | 88.82% |
| BIS1–3 | 203,056,696 | 180,588,178 | 88.93% |
| BIS1–4 | 199,138,100 | 153,625,124 | 77.15% |
| BIS1–5 | 125,842,196 | 111,469,177 | 88.58% |
| TAR-4 | 166,733,374 | 144,736,461 | 86.81% |
| TAR-6 | 160,620,594 | 142,320,609 | 88.61% |
| CAD-1 | 211,055,936 | 182,750,832 | 86.59% |
| CAD-3 | 181,504,151 | 160,144,743 | 84.82% |
| CAD-5 | 151,952,366 | 128,884,436 | 69.38% |
| TOTAL | 1,598,669,378 | 1,364,994,151 | 85.38% |
Summary statistics of SNP discovery and selection.
| T2T | G2T | ||||
| Biallelic variants | 41,542 | 208,016 | |||
|
| 32,373 | 192,129 | |||
| Contigs with putative SNPs | 13,671 | 17,406 | |||
| Total predicted IEB | 10,688 | ||||
| contigs with one or more predicted IEB | 4,031 | ||||
| Common SNPs | 18,994 | ||||
| contigs with a common SNP | 7,426 | ||||
| transitions | 13,255 | ||||
| transversions | 5,739 | ||||
| SNPs suitable for TaqMan® OpenArrayTM | 2,317 | ||||
|
| 195 | ||||
|
| 423 | ||||
|
| 274 | ||||
|
| 1,425 | ||||
| Selected for validation | 530 (100%) | ||||
| failed | |||||
|
| 16 (3.0%) | ||||
|
| 30 (5.7%) | ||||
| false | |||||
|
| 40 (7.5%) | ||||
|
| 3 (0.6%) | ||||
|
| 441 (83.2%) | ||||
Distribution of microsatellite repeat sizes and lengths.
| repeat type | number of repeat units | maximum repeat units | total | |||||
|
|
|
|
|
|
| |||
| dinucleotide | 204 | 91 | 37 | 27 | 8 | 5 | 10 | 372 |
| trinucleotide | 82 | 22 | 10 | 2 | 0 | 0 | 8 | 116 |
| tetranucleotide | 14 | 3 | 0 | 0 | 0 | 0 | 6 | 17 |
| pentanucleotide | 4 | 0 | 0 | 0 | 0 | 0 | 5 | 4 |
| hexanucleotide | 1 | 0 | 0 | 0 | 0 | 0 | 5 | 1 |
| all | 305 | 116 | 47 | 29 | 8 | 5 | 510 | |
Approaches to transcriptome SNP discovery and validation in fish species.
| Organism and study | Sequences | Putative SNPs | Conversion rate | False SNPs rate | Validation rate | IEB method | Comments |
|
| |||||||
| Catfish | Sanger-EST | 384 | 69.3% | 28.6% | 40.6% | none | no NGS |
| Lake whitefish | 31 | ||||||
| Salmon | 454 | 202 | 40.6% | 22.3% | 18.3% | none | |
| Sockeye salmon | SOLiD | 96 | 53.1% | 41.7% | 11.5% | none | RRL |
| Hake | 454 | 944 | 43.3% | 15.9% | 27.4% | homology | |
| GAII | 684 | 43.3% | 14.0% | 29.2% | |||
| Atlantic herring | 454 | 1,536 | 50.7% | 13.1% | 37.6% | homology | |
| Pacific herring | 454 | 96 | 47.9% | 33.3% | 14.6% | none | |
| This study | 454 and HiSeq2000 | 530 | 91.3% | 8.1% | 83.2% | read mapping | |
|
| |||||||
| Atlantic cod | Sanger-EST | 594 | 69.0% | 15.5% | 53.5% | none | no NGS |
| Atlantic cod | GAII | 3,072 | 74.6% | 19.8% | 54.8% | none | |
RRL: Reduced Representation Libraries (method for the selection of a subset of the genome for assembly).
GAII: Genome Analyzer II (Illumina NGS sequencer).