| Literature DB >> 21255432 |
Shikai Liu1, Zunchun Zhou, Jianguo Lu, Fanyue Sun, Shaolin Wang, Hong Liu, Yanliang Jiang, Huseyin Kucuktas, Ludmilla Kaltenboeck, Eric Peatman, Zhanjiang Liu.
Abstract
BACKGROUND: Single nucleotide polymorphisms (SNPs) have become the marker of choice for genome-wide association studies. In order to provide the best genome coverage for the analysis of performance and production traits, a large number of relatively evenly distributed SNPs are needed. Gene-associated SNPs may fulfill these requirements of large numbers and genome wide distribution. In addition, gene-associated SNPs could themselves be causative SNPs for traits. The objective of this project was to identify large numbers of gene-associated SNPs using high-throughput next generation sequencing.Entities:
Mesh:
Year: 2011 PMID: 21255432 PMCID: PMC3033819 DOI: 10.1186/1471-2164-12-53
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary of Illumina expressed short reads production and filtration
| Catfish species | No. of tissues | No. of fish | Sequencer | Sequence length* | ||||
|---|---|---|---|---|---|---|---|---|
| Channel | 11 | 47 | Illumina GA-II | 36 bp | 48.6 | 1.8 | 47.2 | 1.7 |
| HiSeq 2000 | 100 bp | 173.9 | 17.4 | 171.6 | 13.9 | |||
| Blue | 11 | 19 | Illumina GA-II | 36 bp | 66.9 | 2.3 | 62.1 | 2.2 |
| HiSeq 2000 | 100 bp | 216.6 | 21.7 | 212.5 | 17.4 | |||
| Total | - | - | - | - | 506.0 | 43.2 | 493.4 | 35.2 |
Eleven tissues were used for RNA preparation including brain, gill, head kidney, intestine, liver, muscle, skin, spleen, stomach, heart, and trunk kidney. *Paired-end reads were generated in different lengths of either 36 bp or 100 bp as a result of different sequencers, Illumina GA-II or HiSeq 2000.
Summary of reference assembly of expressed short reads of channel catfish and blue catfish
| Catfish species | No. of reads used | No. of reads | % sequences | No. of contigs | Average contig | Average contig | |
|---|---|---|---|---|---|---|---|
| Channel | 218.8 × 106 | 152.6 × 106 | 69.8% | 103,650 | 670 | 1,473 | 137.4 |
| Blue | 274.6 × 106 | 183.8 × 106 | 66.7% | 104,475 | 775 | 1,760 | 164.2 |
*Number of reads per contig. #Total number of assembled read bases/Total number of bases in consensus sequence.
Summary of de novo assembly of the unassembled expressed short reads from reference assembly of channel catfish and blue catfish
| Catfish species | No. of reads used for assembly | No. of reads assembled | % sequences assembled | No. of contigs | Average contig length (bp) | Average contig size* | |
|---|---|---|---|---|---|---|---|
| Channel | 66.2 × 106 | 46.8 × 106 | 70.7% | 420,165 | 298 | 111 | 19.7 |
| Blue | 90.8 × 106 | 64.3 × 106 | 70.8% | 420,953 | 315 | 153 | 26.4 |
All the newly generated expressed short reads were first assembled using reference assembly (Table 2), and those that were not assembled, i.e., they did not align in silico to the existing catfish ESTs, were used for the de novo assembly. *Number of reads per contig. #Total number of assembled read bases/Total number of bases in consensus sequence.
Summary of assembly of all catfish expressed short reads
| Assembly | No. of reads used for assembly | No. of reads assembled | % sequences assembled | No. of contigs | Avg. contig length | Max length | No. of large contigs (>1 kb) | ||
|---|---|---|---|---|---|---|---|---|---|
| Reference1 | 493.4 × 106 | 336.0 × 106 | 68.1% | 104,870 | 686 | 6,849 | 17,756 | 3,204 | 330.8 |
| 157.4 × 106 | 107.2 × 106 | 68.2% | 421,229 | 340 | 4,615 | 4,133 | 255 | 44.1 |
1All expressed short reads from both channel catfish and blue catfish were first assembled using existing ESTs as references. 2Those that were not assembled into contigs with the reference ESTs were then assembled de novo. *Number of reads per contig. #Total number of assembled read bases/Total number of bases in consensus sequence.
Summary of BLASTX searches to annotated protein databases
| Assembly | Contigs hit Uniprot | % contigs with hits | Unique protein hits | Contigs hit zebrafish Refseq | % contigs with hits | Unique zebrafish Refseq hits |
|---|---|---|---|---|---|---|
| Reference | 32,350 | 30.9% | 17,766 | |||
| 24,168 | 5.7% | 12,331 | ||||
| Total | 56,518 | 10.7% | 24,440 | |||
Contigs of two assemblies, the reference assembly with 104,870 contigs and the de novo assembly with 421,229 contigs, were used to search the Uniprot database and the zebrafish Refseq protein database to assess the number of related genes represented by catfish expressed sequences.
Figure 1Similarity of GO-term assignments for catfish and zebrafish genes. Proportions of GO-terms assigned to annotated contigs from catfish assembly compared with the proportions found in the zebrafish genome annotation which serves as an indicator of the extent to which the catfish transcriptome has been characterized.
Summary of putative SNP identification from the catfish expressed short reads assembly
| Channel catfish | Blue catfish | All catfish | |
|---|---|---|---|
| Contigs under analysis | 523,815 | 525,428 | 526,099 |
| Total SNPs | 2,030,410 | 2,497,806 | 4,236,135 |
| Transitions | 1,311,220 | 1,616,477 | 2,751,244 |
| Transversions | 719,190 | 881,329 | 1,484,891 |
| SNP/100 bp | 1.6 | 1.8 | 3.0 |
Putative SNPs include all base variations involved in the sequence assemblies with at least four sequences present at the SNP position with minor allele sequences represented at least twice. All catfish represents both intra-specific and inter-specific SNPs. Note that the total SNPs from all catfish assembly is fewer than the sum of total SNPs from channel catfish and blue catfish due to shared SNP positions in the two catfish species.
Quality SNPs selected from the putative SNPs with a set of criteria as described in the Methods section
| Intra-specific SNPs | |||
|---|---|---|---|
| Total SNPs | 342,104 | 366,269 | 420,727 |
| Transitions | 208,517 | 230,031 | 262,048 |
| Transversions | 133,587 | 136,238 | 158,679 |
| No. of contigs with SNPs | 168,458 | 190,197 | 232,972 |
| No. of contigs with Uniprot hits & SNPs | 28,067 | 30,376 | 32,515 |
| No. of unique known genes containing SNPs | 16,562 | 17,423 | 18,085 |
1SNPs identified at positions where there were SNPs within channel catfish; 2SNPs identified at positions where there were SNPs within blue catfish; 3SNPs identified at positions where there were no intra-specific channel catfish SNPs or intra-specific blue catfish SNPs, but the bases differed between the two species.
Figure 2Distribution of minor allele frequencies of SNPs identified for channel catfish, blue catfish and inter-species, as derived from analysis of sequence tags from the Illumina sequencing. A: Intra-specific SNPs in channel catfish; B: Intra-specific SNPs in blue catfish and C: Inter-specific SNPs between the two species. The X-axis represents the SNP sequence derived minor allele frequency in percentage, while the Y-axis represents the number of SNPs with given minor allele frequency. Note that the majority of SNPs have minor allele frequencies more than 15%.
Summary of microsatellite markers identification from the all catfish expressed short reads assembly
| Number of contigs of sequences surveyed | 526,099 |
| Number of contigs containing microsatellites | 49,883 |
| Total number of microsatellites identified | 57,379 |
| Di-nucleotide repeats | 31,657 |
| Tri-nucleotide repeats | 16,925 |
| Tetra-nucleotide repeats | 8,235 |
| Penta-nucleotide repeats | 506 |
| Hexa-nucleotide repeats | 56 |
| Number of microsatellites with sufficient flanking sequences | 39,516 |
| Number of contigs containing microsatellites with sufficient flanking sequences | 34,539 |
Figure 3Comparative analysis of the genes containing SNPs on 25 chromosomes of the zebrafish genome. Each of the 25 zebrafish chromosomes was laid out in the X-axis with one million base pairs intervals, and the number of genes contained with filtered SNPs residing in the interval was plotted on the Y-axis.
Figure 4Frequency of contigs of various sizes from the all catfish reference assembly. The X-axis represents contig size (number of reads per contig). The curved line denotes the cumulative percentage of reads assembled. Note that a small number of very large contigs account for the majority of total reads. For instance, less than 0.3% of the contigs with over 100,000 reads per contig represent over 32% of all sequence reads assembled.
Figure 5Distribution of filtered SNPs per contig. Histograms depict frequency of contigs with a given number of SNPs identified. Note that the majority of contigs have 5 or fewer SNPs per contig.
Figure 6Schematic presentation of the catfish transcriptome analysis.