Literature DB >> 15175116

Genome SEGE: a database for 'intronless' genes in eukaryotic genomes.

Meena Kishore Sakharkar¹, Pandjassarame Kangueane.

Abstract

BACKGROUND: A number of completely sequenced eukaryotic genome data are available in the public domain. Eukaryotic genes are either 'intron containing' or 'intronless'. Eukaryotic 'intronless' genes are interesting datasets for comparative genomics and evolutionary studies. The SEGE database containing a collection of eukaryotic single exon genes is available. However, SEGE is derived using GenBank. The redundant, incomplete and heterogeneous qualities of GenBank data are a bottleneck for biological investigation in comparative genomics and evolutionary studies. Such studies often require representative gene sets from each genome and this is possible only by deriving specific datasets from completely sequenced genome data. Thus Genome SEGE, a database for 'intronless' genes in completely sequenced eukaryotic genomes, has been constructed. AVAILABILITY: http://sege.ntu.edu.sg/wester/intronless DESCRIPTION: Eukaryotic 'intronless' genes are extracted from nine completely sequenced genomes (four of which are unicellular and five of which are multi-cellular). The complete dataset is available for download. Data subsets are also available for 'intronless' pseudo-genes. The database provides information on the distribution of 'intronless' genes in different genomes together with their length distributions in each genome. Additionally, the search tool provides pre-computed PROSITE motifs for each sequence in the database with appropriate hyperlinks to InterPro. A search facility is also available through the web server.
CONCLUSIONS: The unique features that distinguish Genome SEGE from SEGE is the service providing representative 'intronless' datasets for completely sequenced genomes. 'Intronless' gene sets available in this database will be of use for subsequent bio-computational analysis in comparative genomics and evolutionary studies. Such analysis may help to revisit the original genome data for re-examination and re-annotation.

Entities: Chemical Species

Mesh：

Year: 2004 PMID： 15175116 PMCID： PMC434494 DOI： 10.1186/1471-2105-5-67

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Eukaryotic genes are often interrupted by intragenic, non-coding sequences called introns [1]. However, prokaryotic genes lack introns. Therefore, 'Intronless' genes are characteristic features of prokaryotes. Interestingly, many eukaryotic histone [2,3] and GPCR [4] genes are predominantly 'intronless'. A number of vertebrate 'intronless' genes have been complied [5]. The human genome report identified 901 Otto predicted single exon genes (The Celera approach to gene prediction is called Otto) [6]. The presence of a sizeable amount of single exon genes (SEG) in eukaryotic genomes is intriguing. The SEGE database contains eukaryotic SEG derived from GenBank [7]. For most genomes, SEGE does not provide representative 'intronless' gene sets because GenBank often contains redundant sequences from the same species deposited by different authors. It should also be noted that all sequences obtained from genome projects are not available in GenBank. Representative sets of SEG from specific genomes will provide meaningful biological insights to subsequent bio-computational analysis for comparative and evolutionary studies. In order to facilitate such research we developed Genome SEGE, a database containing all putative SEG from completely sequenced eukaryotic genomes. Here, we describe the usefulness and construction of Genome SEGE.

Construction and content

Data source and methodology

The annotated eukaryotic genome sequence data was downloaded from NCBI [8]. 'Intronless' genes were identified using the 'CDS' annotation in the FEATURE as described elsewhere [7]. It should be noted that organellar sequences (annotated as 'chloroplast', 'plastid', 'mitochondrial', 'mitochondrion') were removed from further analysis. A flowchart describing the construction of the database is shown (Figure 1).

Figure 1

Database construction A flowchart describing the development of the database is shown. CDS = coding sequence.

'Intronless' pseudogenes

Data processing and cleaning is an essential part of biological knowledge discovery. Hence, we eliminated all identifiable processed pseudogenes by scanning for polyadenylation signal (AATAAA) and polyadenylation tail using a modified procedure of Harrison and colleagues [9]. In this procedure, by definition, we consider a sequence to represent a pseudogene if it contains a polyadenylation tail (>15A) within 1000 nucleotides from the stop codon with a preceding polyadenylation signal.

Prosite motifs and InterPro

We characterized 'intronless' gene products using PROSITE, which is a method of identifying the functions of uncharacterized proteins translated from genomic sequences [10]. We chose PROSITE because it is complete, highly specific, fully documented and regularly updated. The search tool provides pre-computed PROSITE motifs for each sequence in the database with appropriate hyperlinks to InterPro [11].

Content

A Database is created to store all eukaryotic 'intronless' sequences derived from completely sequenced genomes. The database contains three sets of data for each genome: (1) 'intronless' sequence set, (2) 'intronless' pseudo-genes sequence set, and (3) 'intronless' sequence set without pseudo-genes.

Caveats

Genome annotation is an inherently dynamic process in which it is necessary to use many different sources of data, which are not updated in a rigorous fashion. It should also be noted that annotation is not generally uniform and consistent because various procedures are used by different groups for genome annotation. During genome annotation, a gene may have been annotated with a single exon CDS in the FEATURE for three main reasons: (1) the gene is truly 'intronless' and functional, (2) the gene is of retroposition origin [12,13], (3) false positive prediction by gene finding algorithms. False positives are not removed from the current dataset due to lack of a methodology. Nevertheless, the gene finding algorithms are reasonably optimized to find SEG.

Update

The database will be refreshed on a quarterly basis or as and when an update is noticed to genome files in the public domain.

Utility and discussion

Genome SEGE is an extension of SEGE [6] and these two databases complement each other in their biological utility and application. SEGE and Genome SEGE differ primarily in their content, as the datasets are created from different sources. The degree and quality of annotation also varies between them. SEGE could be used for general purpose studies involving 'intronless' genes from different genomes, while Genome SEGE is of particular interest for researchers interested in comparative genomics. A wealth of information can be obtained by comparing 'intronless' gene sequences between two or more genomes to identify features conserved or diverged during evolution. Comparison of more closely related genomes can reveal similarities in gene order. Such analysis could also shed light on genome architecture and help understand why the genome is arranged the way it is and how its structure affects function. A systematic mapping between functional genes and their 'intronless' paralogs can provide a matrix for genomic rearrangement and gene duplication. Different 'intronless' gene sets available in the database will provide an opportunity to perform many-to-many comparison between genomes. Such analysis will provide information on paralogy and orthology at a molecular level. Analysis of the datasets using non-linear probabilistic models may provide acceptable evidence for retro-position events during evolution. The search tool in the database provides options to scan through each dataset using gene name or protein name. The result page produces information on chromosomal location, organism name, gene name, product name, GenBank Index, nucleotide sequence and protein sequence. The result page also shows all PROSITE motifs in the sequence with specific hyperlinks to PROSITE documentation and InterPro (Figure 2).

Figure 2

Illustration of an example search. This example illustrates a search for human 'G protein' in the database. The interface, search page and results (annotation, sequence, Prosite, InterPro links) are shown.

Conclusions

The biological role of 'intronless' genes in the genomes of higher organism is perplexing. 'Intronless' gene sets available in the database will be of use for subsequent bio-computational analysis in comparative genomics and evolutionary studies. Such analysis may help to revisit the original genome data for re-examination and re-annotation. Different eukaryotic genomes have varying proportions of 'intronless' genes and a sizeable fraction of them are found in many intron-rich multi-cellular genomes. We believe that these estimates will improve our understanding on the differential selection (as a process or force) of 'intronless' genes in different eukaryotic genomes. The different datasets made available in the database can serve as a data source for evolutionary and functional studies. They will also help to answer questions such as, (1) How many of 'intronless' genes are expressed in each genome? (2) How many of them are of prokaryotic origin? (3) How many of them have multi-exon correspondence within genome? (4) Do they evolve by retro-position? It is our hope that the database we make available will encourage molecular biologists and computational molecular evolutionist to address this problem. The unique features that distinguish Genome SEGE from SEGE is the service providing representative 'intronless' datasets for completely sequenced genomes. Such service will persuade researchers to use representative data sets for investigating a number of biologically significant evolutionary phenomena. We also hope to provide this service for other completely sequenced genomes as and when they are available in the public domain after appropriate examination and analysis. It is also our interest to compare the contents of SEGE and Genome SEGE on a genome by genome basis for the examination of data bias in SEGE.

Availability and requirements

Database is available freely at .

List of abbreviations

GPCR G-protein coupled receptors SEGE Single exon genes in eukaryotes CDS Coding sequence

12 in total

1. Many G-protein-coupled receptors are encoded by retrogenes.

Authors: J Brosius
Journal: Trends Genet Date: 1999-08 Impact factor: 11.639

2. The PROSITE database, its status in 2002.

Authors: Laurent Falquet; Marco Pagni; Philipp Bucher; Nicolas Hulo; Christian J A Sigrist; Kay Hofmann; Amos Bairoch
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

3. Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22.

Authors: Paul M Harrison; Hedi Hegyi; Suganthi Balasubramanian; Nicholas M Luscombe; Paul Bertone; Nathaniel Echols; Ted Johnson; Mark Gerstein
Journal: Genome Res Date: 2002-02 Impact factor: 9.043

4. Genomes were forged by massive bombardments with retroelements and retrosequences.

Authors: J Brosius
Journal: Genetica Date: 1999 Impact factor: 1.082

5. The InterPro Database, 2003 brings increased coverage and new features.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Daniel Barrell; Alex Bateman; David Binns; Margaret Biswas; Paul Bradley; Peer Bork; Phillip Bucher; Richard R Copley; Emmanuel Courcelle; Ujjwal Das; Richard Durbin; Laurent Falquet; Wolfgang Fleischmann; Sam Griffiths-Jones; Daniel Haft; Nicola Harte; Nicolas Hulo; Daniel Kahn; Alexander Kanapin; Maria Krestyaninova; Rodrigo Lopez; Ivica Letunic; David Lonsdale; Ville Silventoinen; Sandra E Orchard; Marco Pagni; David Peyruc; Chris P Ponting; Jeremy D Selengut; Florence Servant; Christian J A Sigrist; Robert Vaughan; Evgueni M Zdobnov
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

6. Targeted gene deletion in Zygosaccharomyces bailii.

Authors: M Mollapour; P Piper
Journal: Yeast Date: 2001-01-30 Impact factor: 3.239

7. Pseudogenes in yeast?

Authors: G R Fink
Journal: Cell Date: 1987-04-10 Impact factor: 41.582

8. Histone genes: not so simple after all.

Authors: R W Old; H R Woodland
Journal: Cell Date: 1984-10 Impact factor: 41.582

9. Why genes in pieces?

Authors: W Gilbert
Journal: Nature Date: 1978-02-09 Impact factor: 49.962

10. The sequence of the human genome.

Authors: J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu
Journal: Science Date: 2001-02-16 Impact factor: 47.728

14 in total

Genome SEGE: a database for 'intronless' genes in eukaryotic genomes.

Background

Construction and content

Data source and methodology

'Intronless' pseudogenes

Prosite motifs and InterPro

Content

Caveats

Update

Utility and discussion

Conclusions

Availability and requirements

List of abbreviations

1. Many G-protein-coupled receptors are encoded by retrogenes.

2. The PROSITE database, its status in 2002.

3. Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22.

4. Genomes were forged by massive bombardments with retroelements and retrosequences.

5. The InterPro Database, 2003 brings increased coverage and new features.

6. Targeted gene deletion in Zygosaccharomyces bailii.

7. Pseudogenes in yeast?

8. Histone genes: not so simple after all.

9. Why genes in pieces?

10. The sequence of the human genome.

1. Genome-wide analysis of intronless genes in rice and Arabidopsis.

2. A critical analysis of Atoh7 (Math5) mRNA splicing in the developing mouse retina.

3. Definition of global and transcript-specific mRNA export pathways in metazoans.

4. The roles and evolutionary patterns of intronless genes in deuterostomes.

5. The (in)dependence of alternative splicing and gene duplication.

6. The consensus 5' splice site motif inhibits mRNA nuclear export.

Review 7. Mammalian introns: when the junk generates molecular diversity.

8. Genome-wide analysis of European sea bass provides insights into the evolution and functions of single-exon genes.

9. Full-length minor ampullate spidroin gene sequence.

10. Blueprint for a high-performance biomaterial: full-length spider dragline silk genes.