Literature DB >> 18096612

AlterORF: a database of alternate open reading frames.

Inti Pedroso¹, Gustavo Rivera, Felipe Lazo, Max Chacón, Francisco Ossandón, Felipe A Veloso, David S Holmes.

Abstract

AlterORF is a searchable database that contains information regarding alternate open reading frames (ORFs) for over 1.5 million genes in 481 prokaryotic genomes. The objective of the database is to provide a platform for improving genome annotation and to serve as an aid for the identification of prokaryotic genes that potentially encode proteins in more than one reading frame. The AlterORF Database can be accessed through a web interface at www.alterorf.cl.

Entities: Gene Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 18096612 PMCID： PMC2238990 DOI： 10.1093/nar/gkm886

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

A DNA sequence contains six potential open reading frames (ORFs), three on one strand and three on the reverse strand. However, typically only one of the six is actually expressed because it is associated with appropriate genetic signals that specify the DNA strand and the reading frame to be transcribed and translate. Exceptions occur in which more than one open reading frame is translated into a protein, as has long been observed in the case of viral genes, where it was suggested that this property permitted a high packing density of information (1). However, analysis of the coding potential of 481 prokaryotic genomes revealed the surprisingly high frequency of alternate ORFs of annotated genes especially in high G + C rich genomes, where almost every annotated ORF exhibits an alternative ORF that could potentially encode a protein of 100 amino acids or more (2). The frequency of alternate open reading frames in high G + C genomes gives rise to the possibility that this property could be exploited to evolve novel genetic information and it is important to be able to detect this potential. However, this high frequency also provokes serious problems of gene annotation, where the incorrect ORF may inadvertently be mis-annotated as the coding sequence. This potential for error is especially problematic when automatic gene prediction programs are used to annotate genomes, but errors can also slip by human annotators. The problem is exacerbated if an alternative ORF is mis-annotated and the error is propagated in subsequent genome annotations. AlterORF provides a searchable database of all possible alternative ORFs in sequenced prokaryotic genomes that are potentially capable of encoding proteins of 100 amino acids or more. The objectives are 2-fold: to improve genome annotation by indicating possible errors in ORF identification and, perhaps more important in the long term, to predict instances of genes that potentially could give rise to more than one protein.

DATABASE CONSTRUCTION

Annotated protein coding genes were extracted from completely sequenced prokaryotic genomes in the Genome Database of NCBI. All alternative ORFs, potentially encoding 100 amino acids or more, were extracted from each gene sequence using Perl scripts and the BioPerl Application Programming Interface (API) (3). Using the standard genetic code, the in silico translated amino acid sequence of each alternative ORF was searched for similarity in completely sequenced prokaryotic genomes (4) and for conserved domains and motifs using CDD (5), PFAM (6), COG (7), KOG (8), SMART (9) and UniProt. (10). Hierarchical clustering using the software hcluster_sg developed as part of the TreeFam project (11) was used to build sequence families with the alternate ORFs. BLAST e-values were normalized from 0 to 100 (with 100 corresponding to e-value 0.0). The resulting information was stored in a relational database built with Microsoft SQL Server 2005.

DATABASE CONTENTS

Release 1.0 (September 2007) contains approximately 1.5 million annotated genes from 481 organisms and about 3 million alternate ORFs. Of these 942 856 (33%) occur in frame −1, 621 306 (21%) in frame −2, 322 284 (11%) in frame −3, 350 805 (12%) in frame +2 and 675 525 (23%) in frame +3. The following are provided for each alternate ORF sequence: (i) conserved domains and motifs including CDD (5), PFAM (6), COG (7), KOG (8), SMART (9) and UniProt. (10) and (ii) BLAST results with annotated sequences in completely sequenced prokaryotic genomes and alternate ORFs identified in AlterORF. The cross genera conservation of some alternate ORFs suggests that they might represent new protein families or domains and hierarchical clustering (11) was used to build sequence families from conserved alternate ORFs.

WEB INTERFACE AND SERVICES

The AlterORF database can be accessed through a simple and easy to use web interface at www.alterorf.cl. The database can be searched by protein ID (derived from NCBI), by organism and by sequence using a sequence search service. In addition, an option is provided to analyze complete genome sequences not present in the database. Searching by protein ID: a protein ID can be used to recover the original annotated gene that appeared in the database (e.g. GenBank), and also any alternate ORF(s) associated with that gene. If alternate ORFs are detected, tables providing information regarding domains, motifs and protein family are displayed with links to further information. Searching by organism: the user can select an organism from a pulldown menu or index for a pre-analyzed list of annotated protein coding genes with alternate ORFs. Searching by protein sequence: a search using a protein sequence can be carried out against all sequences stored in AlterORF using WU-BLAST (blast.wustl.edu/). Downloading data: all data in the AlterORF database can be freely downloaded by ftp. Additional information on the use of AlterORF can be found in the FAQs and Tutorial sections.

11 in total

1. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

2. Evolution of the three overlapping gene systems in G4 and phi X174.

Authors: J C Fiddes; G N Godson
Journal: J Mol Biol Date: 1979-09-05 Impact factor: 5.469

Review 3. A genomic perspective on protein families.

Authors: R L Tatusov; E V Koonin; D J Lipman
Journal: Science Date: 1997-10-24 Impact factor: 47.728

Review 4. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

5. SMART 5: domains in the context of genomes and networks.

Authors: Ivica Letunic; Richard R Copley; Birgit Pils; Stefan Pinkert; Jörg Schultz; Peer Bork
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

6. Pfam: clans, web tools and services.

Authors: Robert D Finn; Jaina Mistry; Benjamin Schuster-Böckler; Sam Griffiths-Jones; Volker Hollich; Timo Lassmann; Simon Moxon; Mhairi Marshall; Ajay Khanna; Richard Durbin; Sean R Eddy; Erik L L Sonnhammer; Alex Bateman
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. TreeFam: a curated database of phylogenetic trees of animal gene families.

Authors: Heng Li; Avril Coghlan; Jue Ruan; Lachlan James Coin; Jean-Karim Hériché; Lara Osmotherly; Ruiqiang Li; Tao Liu; Zhang Zhang; Lars Bolund; Gane Ka-Shu Wong; Weimou Zheng; Paramvir Dehal; Jun Wang; Richard Durbin
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. The Universal Protein Resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2006-11-16 Impact factor: 16.971

9. CDD: a conserved domain database for interactive domain family analysis.

Authors: Aron Marchler-Bauer; John B Anderson; Myra K Derbyshire; Carol DeWeese-Scott; Noreen R Gonzales; Marc Gwadz; Luning Hao; Siqian He; David I Hurwitz; John D Jackson; Zhaoxi Ke; Dmitri Krylov; Christopher J Lanczycki; Cynthia A Liebert; Chunlei Liu; Fu Lu; Shennan Lu; Gabriele H Marchler; Mikhail Mullokandov; James S Song; Narmada Thanki; Roxanne A Yamashita; Jodie J Yin; Dachuan Zhang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2006-11-29 Impact factor: 16.971

10. The COG database: an updated version includes eukaryotes.

Authors: Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: BMC Bioinformatics Date: 2003-09-11 Impact factor: 3.169

5 in total

1. Image correlation method for DNA sequence alignment.

Authors: Millaray Curilem Saldías; Felipe Villarroel Sassarini; Carlos Muñoz Poblete; Asticio Vargas Vásquez; Iván Maureira Butler
Journal: PLoS One Date: 2012-06-27 Impact factor: 3.240

2. HAltORF: a database of predicted out-of-frame alternative open reading frames in human.

Authors: Benoît Vanderperre; Jean-François Lucier; Xavier Roucou
Journal: Database (Oxford) Date: 2012-05-20 Impact factor: 3.451

3. Draft Genome Sequence of the Iron-Oxidizing, Acidophilic, and Halotolerant "Thiobacillus prosperus" Type Strain DSM 5130.

Authors: Francisco J Ossandon; Juan Pablo Cárdenas; Melissa Corbett; Raquel Quatrini; David S Holmes; Elizabeth Watkin
Journal: Genome Announc Date: 2014-10-23

4. Bioinformatic Analyses of Unique (Orphan) Core Genes of the Genus Acidithiobacillus: Functional Inferences and Use As Molecular Probes for Genomic and Metagenomic/Transcriptomic Interrogation.

Authors: Carolina González; Marcelo Lazcano; Jorge Valdés; David S Holmes
Journal: Front Microbiol Date: 2016-12-27 Impact factor: 5.640

5. The Molecular Biology Database Collection: 2008 update.

Authors: Michael Y Galperin
Journal: Nucleic Acids Res Date: 2007-11-19 Impact factor: 16.971

5 in total