| Literature DB >> 17605782 |
Alexei A Adzhubei1, Anna V Vlasova, Heidi Hagen-Larsen, Torgeir A Ruden, Jon K Laerdahl, Bjørn Høyheim.
Abstract
BACKGROUND: To identify as many different transcripts/genes in the Atlantic salmon genome as possible, it is crucial to acquire good cDNA libraries from different tissues and developmental stages, their relevant sequences (ESTs or full length sequences) and attempt to predict function. Such libraries allow identification of a large number of different transcripts and can provide valuable information on genes expressed in a particular tissue at a specific developmental stage. This data is important in constructing a microarray chip, identifying SNPs in coding regions, and for future identification of genes in the whole genome sequence. An important factor that determines the usefulness of generated data for biologists is efficient data access. Public searchable databases play a crucial role in providing such service. DESCRIPTION: Twenty-three Atlantic salmon cDNA libraries were constructed from 15 tissues, yielding nearly 155,000 clones. From these libraries 58,109 ESTs were generated, of which 57,212 were used for contig assembly. Following deletion of mitochondrial sequences 55,118 EST sequences were submitted to GenBank. In all, 20,019 unique sequences, consisting of 6,424 contigs and 13,595 singlets, were generated. The Norwegian Salmon Genome Project Database has been constructed and annotation performed by the annotation transfer approach. Annotation was successful for 50.3% (10,075) of the sequences and 6,113 sequences (30.5%) were annotated with Gene Ontology terms for molecular function, biological process and cellular component.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17605782 PMCID: PMC1913521 DOI: 10.1186/1471-2164-8-209
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
cDNA pre-smolt libraries used for contig assembly for the SGP dataset.
| Library name | Number of sequences | ||
| Pre-processed (failed) | Submitted to GenBank | Selected for clustering | |
| Brain | 3991 (781) | 3847 | 3991 |
| Brain II | 2263 (400) | 1882 | 2263 |
| Eye | 5052 (1060) | 4870 | 5052 |
| Gills | 3075 (762) | 2964 | 3075 |
| Gills II | 1113 (164) | 1053 | 1113 |
| Head kidney | 4358 (609) | 4210 | 4358 |
| Heart | 1999 (288) | 1732 | 1999 |
| Heart II | 1211 (405) | 1141 | 1211 |
| Intestine | 2865 (708) | 2737 | 2865 |
| Kidney | 1528 (1122) | 1433 | 1528 |
| Liver | 2383 (292) | 2312 | 2383 |
| Ovaries | 4142 (826) | 4113 | 4142 |
| Red muscle | 661 (96) | 629 | 661 |
| SSH Gills down-regulated | 446 (28) | 262 | 269 |
| SSH Gills up-regulated | 415 (59) | 247 | 248 |
| SSH Intestine down-regulated | 222 (65) | 218 | 222 |
| SSH Intestine up-regulated | 252 (30) | 247 | 252 |
| Skin | 1084 (64) | 1010 | 1084 |
| Spleen | 4624 (517) | 4576 | 4624 |
| Swimbladder | 2245 (419) | 2182 | 2245 |
| Testes | 4910 (732) | 4854 | 4910 |
| White muscle | 7316 (653) | 6656 | 6763 |
| White muscle II | 1954 (294) | 1943 | 1954 |
| Total | 58109 (10374) | 55118 | 57212 |
Full statistics on the SGP and other libraries are available at SGP data resource > Data and results > cDNA libraries > SGP libraries list, or individual libraries.
Figure 1SGP data flow. The SGP data resource includes sequence processing and annotation pipelines, project and publicly available tools, and the project database.
Access to the Salmon Genome Project web resources.
| Data and results > Results > General statistics; Sequence length statistics | |
| Database search > Simple search; Advanced search; Sequence search | |
| Data and results > Results > EST sequences in GenBank; Sequences in UniGene | |
| Data and results > cDNA libraries > SGP libraries list short; SGP libraries list full > select library | |
| Data and results > Clustered data datasets | |
| Data and results > Clustered data summary > SGP > | |
| Data and results > Clustered data datasets > Contigs; Singlets | |
| Data and results > Annotations | |
| SGP workbench > preAssemble | |
| SGP workbench > Blast search | |
Figure 2Clustered data summary menu. "Contigs and best annotation hits" display for the SGP dataset. Similar results display – "Singlets and best annotation hits" is available for singlets. Other options are "Contigs length and number of reads" and "Distribution of average length and number of reads in contigs". The Clustered data summary provides a current snapshot of the SGP database.
Figure 3Detailed automatic annotations display shows results of the GO – GOA and NCBI BLAST annotations. Query and target sequences and complete BLAST results can be accessed from this page. Access to the detailed automatic annotations display for the full SGP annotation is available from SGP data resource > Data and results > Annotations > SGP full annotation. Links to the detailed annotation for the specific contig and singlet sequences are provided in the Annotation best hits display and for sequences accessed from the Clustered data datasets display and SGP database search results. Explanation of the format including BLAST parameters, table columns and colour coding is given at the top of each annotation page.
Figure 4Clustered data datasets menu gives access to dataset descriptions, and full sets of contigs and singlets sequences and automatic annotation results for each dataset. Searches in the SGP database run on the three linked categories of data: sequences including their descriptions, libraries and annotations. When a match occurs in any of these data, all three data categories are shown in results. The "Matches in datasets" display provides access to a subset of contigs, singlets and annotations selected in the search. The "Annotations best hits" display is the same for the Clustered data datasets menu and database search results. Clustered data datasets menu provides a current snapshot of the SGP database.
SGP automatic annotation statistics.
| Database | % Annotated (automatic annotation) | ||
| Contigs and singlets | Contigs | Singlets | |
| GO-GOA | 30.5 | 48.5 | 22.0 |
| pdb or swiss-prot | 32.6 | 51.1 | 23.9 |
| pdb | 17.3 | 31.4 | 10.7 |
| swiss-prot | 31.7 | 50.1 | 23 |
| nr | 41.1 | 60.3 | 32.1 |
| nt | 36.7 | 53.7 | 28.6 |
| Distribution between databases | |||
| pdb | 17.3 | 31.4 | 10.7 |
| swiss-prot | 15.3 | 19.7 | 13.2 |
| nr | 8.9 | 9.6 | 8.6 |
| nt | 8.8 | 7.7 | 9.3 |
| any database (pdb + swiss-prot + nr + nt) | 50.3 | 68.4 | 41.8 |
| no hits | 49.7 | 31.6 | 58.2 |
Detailed SGP dataset annotation statistics is available at SGP data resource > Data and results > Annotations > SGP full annotation, statistics.
Databases. NCBI databases – pdb: RCSB-PDB; swiss-prot: SWISS-PROT protein sequence database; nr: all non-redundant GenBank CDS translations + RefSeq Proteins + PDB + SwissProt + PIR + PRF; nt (nucleotide sequences): all GenBank + RefSeq Nucleotides + EMBL + DDBJ + PDB sequences (excluding HTGS0,1,2, EST, GSS, STS, PAT, WGS), no longer "non-redundant". GO-GOA: Gene Ontology assignments for the UNIPROT database produced by the GOA project.
% Annotated. Calculated as the number of sequences with successful annotations for a given subset (i.e. contigs and singlets, contigs, singlets) in a given database to the total number of sequences of this subset where annotation was attempted; GO annotation statistics was calculated separately from other databases as GO hits/GO no-hits. BLAST threshold E-values of 10-10 for PDB and 10-15 for other databases were used.
Distribution between databases. Only one successful annotation per sequence was counted, in the following ranking order: pdb OR swiss prot OR nr OR nt.
SGP dataset. 20019 contig consensus and singlet sequences.
Figure 5GO-GOA automatic annotation results. Upper level GO categories assignments, and the breakdown for each subset are shown for the three GO subsets, [GO: Molecular Function], [GO: Biological Process] and [GO: Cellular Component]. The annotation was performed for contig and singlet sequences. Complete SGP GO-GOA annotation is available at SGP data resource > Data and results > Annotations > SGP full annotation > GO term tables.
Annotations search results for a representative sample of potentially important functions.
| immune response | 94 | 64 | 153 | |
| immune response GO:0006955 | 91 | 60 | 147 | GO:0007582 : physiological process |
| immune cell activation GO:0045321 | 0 | 1 | 1 | GO:0007582 : physiological process |
| response to stress | 28 | 28 | 56 | |
| response to stress GO:0006950 | 14 | 13 | 27 | GO:0050896 : response to stimulus |
| inflammatory response | 29 | 24 | 53 | |
| inflammatory response GO:0006954 | 26 | 21 | 47 | GO:0007582 : physiological process |
| response to virus | 7 | 8 | 15 | |
| response to virus GO:0009615 | 7 | 5 | 12 | GO:0050896 : response to stimulus |
| wounding | 2 | 3 | 5 | |
| response to wounding GO:0009611 | 2 | 3 | 5 | GO:0050896 : response to stimulus |
| development | 139 | 192 | 331 | |
| development GO:0007275 | 73 | 120 | 193 | GO:0007275 : development |
| growth | 74 | 128 | 202 | |
| growth GO:0040007 | 1 | 0 | 1 | GO:0040007 : growth |
| cell growth | 32 | 41 | 73 | |
| cell growth GO:0016049 | 6 | 1 | 7 | GO:0040007 : growth |
| regulation of growth | 26 | 41 | 67 | |
| regulation of growth GO:0040008 | 0 | 2 | 2 | GO:0040007 : growth |
| differentiation | 71 | 90 | 161 | |
| regulation of differentiation | 35 | 48 | 83 | |
| cell differentiation GO:0030154 | 41 | 52 | 93 | GO:0009987 : cellular process |
| regulation of cell differentiation GO:0045595 | 0 | 2 | 2 | GO:0009987 : cellular process |
| proliferation | 86 | 87 | 173 | |
| regulation of proliferation | 61 | 65 | 126 | |
| cell proliferation GO:0008283 | 37 | 37 | 74 | GO:0009987 : cellular process |
| regulation of cell proliferation GO:0042127 | 5 | 4 | 9 | GO:0007582 : physiological process |
| sex | 14 | 7 | 21 | |
| sex determination GO:0007530 | 2 | 1 | 3 | GO:0007275 : development |
Searches were done for most of the queries on a wider term and on an exact GO term in the subset [GO: biological process]. Although wider search terms are bound to produce some mismatches, they can be more useful in identifying all important matches.