| Literature DB >> 27278816 |
Roddy Jorquera1, Rodrigo Ortiz1, F Ossandon1, Juan Pablo Cárdenas1, Rene Sepúlveda1, Carolina González1, David S Holmes2.
Abstract
Eukaryotic genes are typically interrupted by intragenic, noncoding sequences termed introns. However, some genes lack introns in their coding sequence (CDS) and are generally known as 'single exon genes' (SEGs). In this work, a SEG is defined as a nuclear, protein-coding gene that lacks introns in its CDS. Whereas, many public databases of Eukaryotic multi-exon genes are available, there are only two specialized databases for SEGs. The present work addresses the need for a more extensive and diverse database by creating SinEx DB, a publicly available, searchable database of predicted SEGs from 10 completely sequenced mammalian genomes including human. SinEx DB houses the DNA and protein sequence information of these SEGs and includes their functional predictions (KOG) and the relative distribution of these functions within species. The information is stored in a relational database built with My SQL Server 5.1.33 and the complete dataset of SEG sequences and their functional predictions are available for downloading. SinEx DB can be interrogated by: (i) a browsable phylogenetic schema, (ii) carrying out BLAST searches to the in-house SinEx DB of SEGs and (iii) via an advanced search mode in which the database can be searched by key words and any combination of searches by species and predicted functions. SinEx DB provides a rich source of information for advancing our understanding of the evolution and function of SEGs.Database URL: www.sinex.cl.Entities:
Mesh:
Year: 2016 PMID: 27278816 PMCID: PMC4897596 DOI: 10.1093/database/baw095
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Bioinformatic pipeline outlining the strategy for SinEx DB construction. Ten sequenced mammalian genomes (see text for list) were downloaded from the FTP site in the NCBI web page (ftp://ftp.ncbi.nlm.nih.gov/genomes/). Nucleotide sequences were translated in silico to corresponding amino acids. Using Perl scripts and BioPerl Application Programming Interface (API) genes were parsed into single exon (SEGs) and multi-exon genes (MEGs). MEGs and annotated pseudogenes were binned and stored separately.
Occurrence of total annotated CDS by NCBI, gene density [gene/genome size (Mb)] and predicted single exon genes in mammals using in-house Perl script.
| Species | Name | Total CDS | SEG number | ΔSEG percentage | Av. SEG lengtha | Genome size (Mb)b | Gene numberb | Gene density (gene/Mb) | Total gene number with 5′ and/or 3′-UTRsc |
|---|---|---|---|---|---|---|---|---|---|
| Human | 35 195 | 3128 | 8.9 | 341 | 2670.42 | 27 155 | 10.2 | 21 838 | |
| Chimpanzee | 33 726 | 3522 | 10.4 | 306 | 2528.45 | 24 440 | 9.7 | 20 583 | |
| Macaque | 29 288 | 4713 | 16.1 | 220 | 1412.47 | 28 770 | 20.4 | 17 376 | |
| Mouse | 28 789 | 4858 | 16.9 | 302 | 2474.93 | 22 900 | 9.3 | 25 553 | |
| Rat | 19 402 | 3355 | 17.3 | 297 | 3095.69 | 37 150 | 12.0 | 18 679 | |
| Dog | 21 894 | 2392 | 10.9 | 305 | 3600.5 | 21 583 | 6.0 | 1161 | |
| Horse | 20 210 | 1953 | 9.7 | 328 | 3097.59 | 29 413 | 9.5 | 7567 | |
| Pig | 22 663 | 2703 | 11.9 | 302 | 2654.91 | 34 293 | 12.9 | 5277 | |
| Cow | 18 577 | 2551 | 13.7 | 286 | 3160.37 | 30 235 | 9.6 | 18 039 | |
| Opossum | 18 410 | 2449 | 13.3 | 330 | 2725.99 | 29 100 | 10.7 | 1329 |
Percentage of predicted SEGs (Δ) as a function of total annotated CDS per genome.
Average CDS length of SEGs in amino acids.
Obtained from NCBI web page (http://www.ncbi.nlm.nih.gov/genome/).
cObtained from UTRdb web page (http://utrdb.ba.itb.cnr.it/home/statistics).
Figure 2.Screen shot of the web interface of SinEx DB. There are three ways to access SinEx DB data: (i) by exploring the database content through the browsable phylogenetic schema, (ii) using a protein sequence in FASTA format as a query against SinEx DB and (iii) doing an advanced search to interrogate one or more genomes (see text for more details). Nucleotide and protein sequences of SEGs and protein sequences of MEGs from 10 mammalian genomes are downloadable in FASTA format. A tutorial is also available in the webpage (www.sinex.cl/tutorial.app).
Figure 3.SEG/MEG proportion in different KOG functional categories for mammals, represented as a combined z-score from multiple tests (mammals). A high dimensional analysis of all categorized sequences from SinEx DB shows that, in mammals, CDSs with predicted functions related to chromatin structure (B), signal transduction mechanisms (T) and translation (J) are enriched in SEGs (high proportion of SEGs to MEGs), whereas CDSs with predicted functions related to envelope biogenesis, amino acid, nucleotide, secondary metabolites and lipid metabolism have the lowest SEGs to MEGs proportion. The P value was obtained using the Pearson’s chi-squared test and corrected by Sidak multiple testing method (31). Asterisk indicates statistical significance, P < 0.05.