| Literature DB >> 16925838 |
Jennifer Harrow1, France Denoeud, Adam Frankish, Alexandre Reymond, Chao-Kung Chen, Jacqueline Chrast, Julien Lagarde, James G R Gilbert, Roy Storey, David Swarbreck, Colette Rossier, Catherine Ucla, Tim Hubbard, Stylianos E Antonarakis, Roderic Guigo.
Abstract
BACKGROUND: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16925838 PMCID: PMC1810553 DOI: 10.1186/gb-2006-7-s1-s4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1The GENCODE pipeline. This schematic diagram shows the flow of data between the three groups involved in the GENCODE consortium (HAVANA, IMIM and Geneva) to produce an experimentally verified annotation of the ENCODE region.
Figure 2Experimental validation of HAVANA annotation. 'Known' and 'Novel_CDS' were submitted to 5' RACE, and 'Novel transcript' and 'Putative' loci were submitted to RT-PCR on all their exon junctions, followed by bi-directional RACE. Several steps of reannotation were performed during the process of experimental verification: the figure shows the update of the annotation between the first release in April 2005 and the release from October 2005.
Analysis of RefSeq and ENSEMBL ENCODE annotation compared with GENCODE
| RefSeq | ENSEMBL | |
| No. (unique) exons | 3,984 | 4,734 |
| No. transcripts | 577 | 738 |
| No. exons overlapping GENCODE exons (%) | 5,118 (98.6%) | 4,469 (94.4%) |
| No. transcripts overlapping GENCODE (%) | 567 (98.3%) | 675 (91.5%) |
| No. GENCODE exons overlapped (total = 8,865) (%) | 7,084 (80.0%) | 7,450 (84.0%) |
| No. GENCODE transcripts overlapped (total = 2,608) (%) | 2,327 (89.2%) | 2,395 (91.8%) |
Figure 3Comparison of GENCODE transcript annotation with RefSeq and ENSEMBL. The exact agreement between GENCODE and RefSeq and GENCODE and ENSEMBL exons, introns, and nucleotides (NT) for the full transcripts or only the coding parts of the transcripts (CDS) is represented: in blue is the fraction found only in GENCODE, in green the fraction common between GENCODE and the other set (RefSeq or ENSEMBL) and in red the fraction found only in the other set (RefSeq or ENSEMBL) but not in GENCODE. The RefSeq set only contained the curated transcripts tagged with the NM prefix.
Figure 4Comparison of GENCODE annotation with automated gene prediction methods. Viewed in Fmap of Acedb. Panel A shows the MAPK1 gene in ENr221. The GENCODE annotated gene structure is represented in green and red, the circled region highlights the different first exon identified by Pairagon (dark pink/blue) and the expanded region shows tiny introns (indicated by arrows) predicted by Ensembl (orange/red). Panel B shows the TRIM22 locus in ENm009. The structure predicted by Pairagon differs from the GENCODE structure and incorporates an unprocessed pseudogene as the final exon (circled). Panel C shows the human ANKRD43 locus in ENr221 for which AceView (light pink/blue), Pairagon and Ensembl all predict a shorter CDS than GENCODE. C ii shows the mouse ANKRD43 locus in which the upstream ATG is conserved. Panel D shows the GENCODE unprocessed pseudogene locus AC087380.14 at which Ensembl predicts a coding gene. The arrow indicates a tiny intron introduced into the prediction to splice around an in-frame premature stop codon. Panel E shows the IFNAR2 locus in ENm005 with GENCODE coding (red/green) and non-coding (all red) variants and AceView predictions. The AceView CDSs differ from GENCODE in several respects; arrow 'a' indicates several transcripts that have their CDS extended to the start of the prediction upstream of the GENCODE CDS start; arrow 'b' indicates a CDS starting in exon 5 despite the presence of an upstream ATG, which would seem to preclude (re-)initiation from this site; and arrow 'c' indicates a predicted stop codon in the fourth from last exon, which would be likely to make this transcript a target from Nonsense-mediated decay (NMD). GENCODE annotation incorporates all these variants but keeps them as transcripts as CDSs cannot be assigned with certainty. Panel F shows part of the olfactory receptor (OR) cluster in ENm009. Here Pairagon predicts a coding gene at the pseudogene locus OR52Z1P and a multi-exon gene that links separate OR loci (pseudogene locus OR51A1P, coding loci OR52A1 and OR52A5), indicated by arrows.