| Literature DB >> 22120661 |
Alex Di Génova1, Andrés Aravena, Luis Zapata, Mauricio González, Alejandro Maass, Patricia Iturra.
Abstract
SalmonDB is a new multiorganism database containing EST sequences from Salmo salar, Oncorhynchus mykiss and the whole genome sequence of Danio rerio, Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes and Takifugu rubripes, built with core components from GMOD project, GOPArc system and the BioMart project. The information provided by this resource includes Gene Ontology terms, metabolic pathways, SNP prediction, CDS prediction, orthologs prediction, several precalculated BLAST searches and domains. It also provides a BLAST server for matching user-provided sequences to any of the databases and an advanced query tool (BioMart) that allows easy browsing of EST databases with user-defined criteria. These tools make SalmonDB database a valuable resource for researchers searching for transcripts and genomic information regarding S. salar and other salmonid species. The database is expected to grow in the near feature, particularly with the S. salar genome sequencing project. Database URL: http://genomicasalmones.dim.uchile.cl/Entities:
Mesh:
Year: 2011 PMID: 22120661 PMCID: PMC3225076 DOI: 10.1093/database/bar050
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.EST workflow: Phase I is preprocessing and assembly; Phase II, sequence annotation and characterization and Phase III, storage of biological information.
Final assembly details
| Assembly statistics | ||
|---|---|---|
| Number of total reads | 495 257 | 285 359 |
| Total Unigenes in first assembly | 150 720 | 125 077 |
| Total Unigenes in reassembly (BLAST-CAP3) | 103 221 | 97 667 |
| Total Unigenes after CDS prediction | 59 336 | 62 233 |
| Number of reads in final assembly | 387 294 | 213 218 |
| Number of singletons | 31 915 | 38 884 |
| Average read length | 619 | 666 |
| Unigene length (average ± SD) | 872 ± 434 | 880 ± 322 |
| Average unigene depth | 7 | 3 |
| Maximum unigene depth | 2005 | 1444 |
Total number of reads and unigenes assembled using the described pipeline.
General SalmonDB statistics
| Database | ||
|---|---|---|
| Unigenes | 59 336 | 62 233 |
| Total SNP | 35 879 | 42 238 |
| UNIREF | 50 067 | 52 351 |
| KEGG | 30 085 | 31 908 |
| SWISSPROT | 41 472 | 44 803 |
| KOG | 33 000 | 35 436 |
| PFAM | 20 625 | 22 306 |
| TIGRFAM | 3191 | 3715 |
| SMART | 10 493 | 11 088 |
| PIRSF | 1658 | 1978 |
| SUPERFAMILY | 24 394 | 25 447 |
Total number of unigenes matching a database hit. On average each S. salar unigene has 4.2 attributes, while O. mykiss unigenes have 4.4.
Figure 2.Snapshots of the SalmonDB web interface. (a) Unigene browser: the Unigene SS2U057650 is shown with several tracks (features), the blast alignment can be shown for each hit. (b) Biomart: the MartView interface is shown using the S. salar dataset and several filters selected on the left navigation panel. It also shows the ouptut table with multiple attributes shown on the left. (c) Go Browser: result of the search for GO term GO:003872 in the S. salar Unigene database. (d) KEGG Browser: the pathway associated to alanine and aspartate metabolism is shown using the S. salar Unigene database.
Figure 3.Frequency of aligned Unigenes plotted against percent identity. Figure (modified from [45]) shows frequency of top-pairwise alignment (E < 1e-10; query and subject coverage = 0.9) between Unigenes generated through our assembly pipeline plotted against identity score (SalmonDB, orange). It also shows the relationships among the contig consensus sequences of gene index EST assembly (Gene Index, blue) and cGRASP EST assembly (CGRASP, yellow) for Atlantic salmon. The same analysis is included for Fugu (Takifugu rubripes, light blue) and Medaka (Oryzias latipes, dark red) mRNAs obtained from Ensembl and the African Clawed Frog (Xenopus laevis, green) Unigenes obtained from NCBI. Since there is no standard metodolgy to compare EST assemblies (e.g. Genome assembly has N50 value), a good approximation is to observe the expected pattern for a duplicated genome using this strategy. We include the African clawed frog because it has a well-documented recent genome duplication. The expected pattern is shown in the figure with a peak around 93–94%. The same is expected for Salmon which suffered from a whole genome duplication ∼100 million years ago. SalmonDB and gene index assembly show these accumulation of paralogs around 93–94% identity.
Global comparison of available salmon databases
| SalmonDB | GRASP | ASALBASE | Gene index | |
|---|---|---|---|---|
| Data | ||||
| Data source | All public ESTs | Public ESTs, BAC ends | BAC clones, BAC ends and EST cluster | NCBI ESTs |
| Base pair quality | No | Yes | No | No |
| EST assembly | CAP3, clustering | Phrap | No | Clustering, CAP3 |
| Physical map | No | No | Yes | No |
| Genetic map | No | No | Yes | No |
| Expression data | No | Yes | No | No |
| Tools | ||||
| Blast homology search | Yes | Yes | No | Yes |
| Quick search box | Yes | No | Yes | No |
| Primer design | Yes | No | No | No |
| RepeatMasking | No | Yes | No | No |
| GO annotation browser | Yes | No | No | Yes |
| KEGG annotation browser | Yes | No | No | Yes |
| Advanced search with Biomart | Yes | No | No | No |
| Analysis | ||||
| Ortholog prediction | Yes | Yes | Yes | No |
| Paralog prediction | Yes | No | No | No |
| SNP prediction | Yes | No | No | Yes |
| CDS prediction | Yes | Yes | No | Yes |
| Other markers | No | No | Yes | No |
| Full-length cDNA prediction | Yes | Yes | Yes | No |
| Alternative splicing forms prediction | No | No | No | Yes |
| Others | ||||
| Web interface | Gbrowse | Gbrowse | Gbrowse, custom | custom |
| Other organism data | 5 fish species | Other salmonids and salmon lice | 4 fish species and Human | Other TIGR organisms |
cGRASP information was extracted directly from the http://web.uvic.ca/grasp/ website that includes features from external links. Gene index information was obtained from the website http://compbio.dfci.harvard.edu/tgi/.
Assembly statistics comparison of available salmon databases
| SalmonDB | Gene index | cGRASP | |
|---|---|---|---|
| Unigenes | 59 336 | 99 285 | 81 236 |
| Total length (Mb) | 51 | 84 | 71 |
| Min length | 100 | 100 | 75 |
| Max length | 4563 | 5828 | 4780 |
| Average length | 872 | 854 | 881 |
| Median length | 771 | 755 | 758 |
| Full-length cDNA | 5939 | 7124 | 7625 |
| % Full-length protein | 10.01 | 7.18 | 9.39 |
Table shows statistics for the three Atlantic salmon assemblies. Total number of unigenes constructed using each database pipeline, total sequence length from all unigenes and their statistics. Also, we show the number of full-length cDNAs calculated using blastx against nr database (counted as full-length when the unigene cover 99% or more of the protein).
aNumber of full-length cDNAs from SalmonDB biomart is 7465. This number was calculated using translated sequences (blastp) instead of blastx against nr.