| Literature DB >> 15608279 |
Sébastien Aubourg1, Véronique Brunaud, Clémence Bruyère, Mark Cock, Richard Cooke, Annick Cottet, Arnaud Couloux, Patrice Déhais, Gilbert Deléage, Aymeric Duclert, Manuel Echeverria, Aimée Eschbach, Denis Falconet, Ghislain Filippi, Christine Gaspin, Christophe Geourjon, Jean-Michel Grienenberger, Guy Houlné, Elisabeth Jamet, Frédéric Lechauve, Olivier Leleu, Philippe Leroy, Régis Mache, Christian Meyer, Hafed Nedjari, Ioan Negrutiu, Valérie Orsini, Eric Peyretaillade, Cyril Pommier, Jeroen Raes, Jean-Loup Risler, Stéphane Rivière, Stéphane Rombauts, Pierre Rouzé, Michel Schneider, Philippe Schwob, Ian Small, Ghislain Soumayet-Kampetenga, Darko Stankovski, Claire Toffano, Michael Tognolli, Michel Caboche, Alain Lecharny.
Abstract
Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15608279 PMCID: PMC540069 DOI: 10.1093/nar/gki115
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Distribution of the gene families in the GeneFarm database according to the number of annotated paralogs in the Arabidopsis thaliana genome.
Figure 2Distribution of the genes annotated in the GeneFarm database according to their scores at the structural and functional levels. The structural score depends on the origin of the annotated intron–exon structure: s1, prediction software only; s2, prediction software and similarities with homologous genes; s3, the gene structure is partially covered by a transcript (EST, RT–PCR product, etc.); s4, the whole CDS is covered by a transcript; and s5, a cognate full-length cDNA is available (TSS and UTR are known). The functional score: f1, unknown function (no information); f2, some predicted clues (motif, signal, etc.); f3, similarities with a known gene; f4, biochemical function proved; and f5, biological function experimentally shown.
Figure 3Examples of corrections to TIGR annotations proposed by GeneFarm. (A) Fusion of two PPR genes revealed by a detailed definition of the repeat motifs (4 different matrixes have been defined by GeneFarm annotators to exhaustively tag all the repeat motifs of the PPR family), presence of C-terminal DYW motifs and cognate transcripts. (B) The consequence of this fusion of a PPR gene with a downstream gene is the attribution of a function on the basis of the presence of PFAM motifs PF03765 and PF00650. GeneFarm suggests two genes instead of one based on the presence of a C-terminal DYW motif in the first gene. The second gene has not been re-annotated in the framework of GeneFarm. (C) Gene fusion and erroneous exon boundaries. The GeneFarm corrections are supported by the fact that the gene model is shared by other members of the CYP sub-group, a cognate EST and better scores with the Pfam motif PF00067. Blue arrows and lines: CDS exons and introns, respectively. Brown arrows: PFAM motifs mapped to exons. Pink arrows: transcript sequences. Other arrows: different types of PPR repeats.