| Literature DB >> 15608237 |
J L Ashurst1, C-K Chen, J G R Gilbert, K Jekosch, S Keenan, P Meidl, S M Searle, J Stalker, R Storey, S Trevanion, L Wilming, T Hubbard.
Abstract
The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) has been designed to be a community resource for browsing manual annotation of finished sequences from a variety of vertebrate genomes. Its core database is based on an Ensembl-style schema, extended to incorporate curation-specific metadata. In collaboration with the genome sequencing centres, Vega attempts to present consistent high-quality annotation of the published human chromosome sequences. In addition, it is also possible to view various finished regions from other vertebrates, including mouse and zebrafish. Vega displays only manually annotated gene structures built using transcriptional evidence, which can be examined in the browser. Attempts have been made to standardize the annotation procedure across each vertebrate genome, which should aid comparative analysis of orthologues across the different finished regions.Entities:
Mesh:
Year: 2005 PMID: 15608237 PMCID: PMC540089 DOI: 10.1093/nar/gki135
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1The VEGA annotation pipeline. The pipeline shown here is for human. The automated analysis for other species has slight differences. The searches are run on our computer farm and stored in an Ensembl MySQL database using the Ensembl analysis pipeline system (20). Nearly all searches and prediction algorithms are run on repeat masked sequence, the exception being CpG island prediction [see cpgreport in the EMBOSS (21) application suite]. RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html) is used to mask interspersed repeats, followed by TRF (22) to mask tandem repeats. Nucleotide sequence databases are searched with wuBLASTN (http://blast.wustl.edu), and significant hits are re-aligned to the unmasked genomic sequence using est2genome (23). The Uniprot protein database (http://www.uniprot.org) is searched with wuBLASTX, and the accession numbers of significant hits are looked up in the Pfam database (24). The hidden Markov models for Pfam protein domains are aligned against the genomic sequence using Genewise (25) to provide annotation of protein domains (Halfwise in the figure). We also run a number of ab initio prediction algorithms: genscan (26) and fgenesh (27) for genes, tRNAscan (28) to find tRNA genes and Eponine TSS (29), which predicts transcription start sites. The annotators use the Otterlace interface to create and edit genes, which are stored in the Otter database (13). Where predicted transcript structures from Ensembl are available these can be viewed from within the Otterlace interface and may be used as starting templates for gene curation. Annotation in the Otter database is submitted to the EMBL/GenBank/DDBJ nucleotide database. The database for the VEGA website is periodically created by a publishing process that involves the copying and reformatting of data from the Otter genes and automated pipeline databases.
Figure 2Curated Locus Report giving information about the PAX2 locus on chromosome 10.
Figure 3ContigView webpage from human chromosome 6 Vega displaying poly(A) signals/sites and SNPs associated with SLC29A1 and HSPCB loci.
Figure 4Different chromosomes and regions annotated from the three different vertebrates currently available in Vega.