Literature DB >> 15608279

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts.

Sébastien Aubourg¹, Véronique Brunaud, Clémence Bruyère, Mark Cock, Richard Cooke, Annick Cottet, Arnaud Couloux, Patrice Déhais, Gilbert Deléage, Aymeric Duclert, Manuel Echeverria, Aimée Eschbach, Denis Falconet, Ghislain Filippi, Christine Gaspin, Christophe Geourjon, Jean-Michel Grienenberger, Guy Houlné, Elisabeth Jamet, Frédéric Lechauve, Olivier Leleu, Philippe Leroy, Régis Mache, Christian Meyer, Hafed Nedjari, Ioan Negrutiu, Valérie Orsini, Eric Peyretaillade, Cyril Pommier, Jeroen Raes, Jean-Loup Risler, Stéphane Rivière, Stéphane Rombauts, Pierre Rouzé, Michel Schneider, Philippe Schwob, Ian Small, Ghislain Soumayet-Kampetenga, Darko Stankovski, Claire Toffano, Michael Tognolli, Michel Caboche, Alain Lecharny.

Abstract

Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot.

Entities: Chemical Disease Species

Mesh：

Substances：
Arabidopsis Proteins

Year: 2005 PMID： 15608279 PMCID： PMC540069 DOI： 10.1093/nar/gki115

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The GeneFarm project was launched in 2001 soon after the announcement of the near-completion of the Arabidopsis thaliana genome (1). The initial annotation released at the same time as the assembled sequence of the five chromosomes was largely a compilation of independent annotations from different members of the Arabidopsis Genome Initiative (AGI) consortium. The generally recognized drawback of this otherwise invaluable resource was that this annotation was often faulty and misleading. Important discrepancies have been identified, for example when these initial annotations were later compared to more time-consuming expert-driven annotations, especially in the definitions of intron–exon boundaries and in erroneous names for genes or gene products (2,3). Owing to the cost in time and money of human expertise, genome annotation has often been restricted to the prediction of coding exons and to the labelling of the deduced protein with the function of its closest homologue (4) resulting in the under-annotated databases where errors are often multiplied by a snowball effect (5). Finally, the source of a specific annotation feature, such as whether the annotation feature originates from the external documented information or from prediction software has rarely been stated (6). During the last four years, the TIGR institute has made available five updated versions of the Arabidopsis chromosome sequences with associated structural and functional annotation (7). The structural semi-automatic annotation has been greatly improved by the development of new prediction software using the rapidly expanding transcript resources, mainly expressed sequence tags (ESTs) and full-length cDNAs (8–10). For functional predictions, TIGR has made an important effort to search known protein motifs and to classify the predicted genes according to the Gene Ontology method (11). Nevertheless, the computational part of the automatic gene annotation has been globally limited using heterogeneous intrinsic (sequence composition, signals, etc.) and extrinsic (cognate transcripts, similarities, etc.) data. The outcome of the automated annotation process is constrained by general rules defined to limit the number of false positive and false negative predictions. The biological complexity which includes many atypical situations in gene structure and organization along the chromosomes (alternative events, U12 splicing sites, pseudogenes, micro-exons, overlapping genes, etc.) cannot be described using satisfactory models and this constitutes a significant limitation to the annotation pipelines (7,12). An overview of the last release (TIGR R5.0) of the Arabidopsis chromosomes shows that the associated annotation is still not optimal and, considering its pivotal role as a reference plant resource and as a tool for genomic projects, an improved annotation would certainly be of wide general interest. It would allow better planning and design of future experiments such as high-throughput functional analysis of genes (13) and characterization of interaction networks. Completion and correction of the existing semi-automatic gene prediction will require a more in-depth approach and, for this, the manual intervention of expert biologists is unavoidable (14,15). An expert-based approach is the solution that has been chosen for the construction of the Swiss-Prot library in which the information associated with specific sequences is generated and rigorously controlled by expert annotators (16). This task is time consuming and limits the quantity of proteins that can be processed. For instance, in July 2004, Swiss-Prot contained 2853 Arabidopsis entries as compared with the 10 times greater number of predicted genes in the Arabidopsis genome. The goal of GeneFarm is to actively participate in this manual annotation effort and to extend it at the gene/nucleic acid level. The GeneFarm project is based on a network of scientists working in different fields of research allowing an extensive and curated annotation of Arabidopsis nuclear genes. In order to optimize the added value of this human expertise, the annotation process focuses on gene families since many of the features and much of the information mined in the literature or predicted for one gene can often be extrapolated to some or all the homologous genes (17). Performing a gene family-based annotation makes the task easier and more efficient than a gene-by-gene approach. Indeed, due to their common origin, genes from the same family quite often share the same gene intron–exon structure. Furthermore, sequence comparisons of all the members of a protein family help to highlight conserved motifs responsible for shared biochemical function(s) and point to specific features characteristic of one or a subset of paralogous genes. The complete functional study of a given gene that belongs to a family of duplicated paralogs (as is frequently the case in plants) should take into consideration its evolutionary relationship with the other members of the family. Therefore, systematically characterizing gene families in Arabidopsis and identifying particular characteristics of each member is an essential step to define orthologous relationships with genes from other plant species.

THE GeneFarm PHILOSOPHY

The main motivating aims during the definition of the GeneFarm database were (i) to obtain a consistent annotation across the different annotators, (ii) to track the annotation sources and (iii) to use a common bioinformatics toolbox to reduce annotation heterogeneity to a minimum. Based on precise evaluations of both the automatic annotation bottlenecks (18) and of the performances of prediction software (19), a minimum annotation protocol (i.e. mandatory steps) was defined. For example, at the gene structure level, the minimum protocol uses the Eugene (20) and GeneMark.hmm (21) programs, which were specially trained with Arabidopsis datasets and showed the best results compared with other programs, both at the exon and model gene levels. Other examples of mandatory steps, but this time at the protein level, are the Predotar program used to predict targeting peptides (22) and a combination of DSC (23), PHD (24) and SOPMA (25) for the prediction of secondary structures. Whatever the annotation steps and the software used, the results are always checked and compared by the biologist partners before being accepted. When available, experimental results coming from participant's laboratories or from publications are given precedence over the results of prediction software. In order to make the loading task easy, robust and traceable, two web submission interfaces were developed for the annotators, one for the gene and a second for the family descriptions. In the GeneFarm database, each piece of information is clearly justified either by experimental proof (unpublished data or bibliographic references), an accession number (motifs, structure, sequence, etc.) or reference to a prediction software. Each biologist partner is in charge of annotating several Arabidopsis gene families that are targets of their own research field. Often, results have been produced for another purpose, such as research into gene function, but have not been published in a form that is usable for the scientific community. The GeneFarm approach delivers an annotation of high quality with precise and detailed features and numerous links to the pertinent literature. Furthermore, the close examination by an expert annotator ensures that the best and most up-to-date nomenclature and ontology is used to name all the genes of the same family. In GeneFarm, the definition of a gene family is based on sequence similarities and on evidence for a common evolutionary origin (homology). The boundary between different families is not always easy to define and the expert annotators play an important role in defining this. Some of the GeneFarm partners are involved in methodological approaches, which provide additional aid for the identification of homologous genes. For example, the PHYTOPROT resource, in which all available plant proteins are clustered by an all-by-all systematic comparison (26), is being used as a starting point to define gene families, and a comparison of predicted secondary structures is also being exploited with the aim of detecting highly divergent homologous proteins (27).

THE CONTENT OF THE GeneFarm DATABASE

The GeneFarm database contains gene entries and family entries. The family entries contain the description of the families including common features shared by all of the homologous genes (signature, biochemical function, keywords, paper review, etc.). The gene entries contain the complete annotation of the genes including data specific to each gene. This information is organized into different sections: target plant and genomic sequence, gene name and synonym(s), references to all cognate transcripts, intron–exon structure(s), deduced protein(s), regulatory motifs in promoters, biochemical function, protein localization, motifs and domains, secondary structure, post-translational maturation sites, biological function(s), mutant phenotype(s), expression condition(s), cross-references with other databases and bibliographic references. Each gene entry is linked to its corresponding family entry. Currently, the GeneFarm database contains more than 3000 gene entries distributed among 300 gene families (Figure 1). The sizes of the families range from 2 paralogous genes (40% of cases, a consequence of the ancient duplication which affected almost the entire Arabidopsis genome) to 270 members (the cytochrome P450 family). An overview of the GeneFarm database shows that the annotation of the gene entries includes more than 35 000 cross-references with GeneBank/EMBL/DDBJ, 750 with Swiss-Prot, 6000 with motif databases, 2700 literature references, 3500 transcription proofs and detailed descriptions of more than 1700 expression conditions. The GeneFarm website contains a list of the annotator partners and their assigned families, lists of annotated gene and gene family entries and an interface to query the database. This interface allows access to genes and gene families using their names, their AGI or GeneFarm accession number (GF AC), keywords, expression conditions or sequence comparisons (with BLAST). The page results display the annotations with dynamic web links to the referenced databases. There are links between the genes and their corresponding family. In order to help the users to quickly have a general idea of the extent of the annotation, in term of details and experimental support, two scores (from 1 to 5) have been defined for structural and functional annotations. Figure 2 shows the distribution of the annotated genes as a function of these two scores and describes the scoring system in more detail.

Figure 1

Distribution of the gene families in the GeneFarm database according to the number of annotated paralogs in the Arabidopsis thaliana genome.

Figure 2

Distribution of the genes annotated in the GeneFarm database according to their scores at the structural and functional levels. The structural score depends on the origin of the annotated intron–exon structure: s1, prediction software only; s2, prediction software and similarities with homologous genes; s3, the gene structure is partially covered by a transcript (EST, RT–PCR product, etc.); s4, the whole CDS is covered by a transcript; and s5, a cognate full-length cDNA is available (TSS and UTR are known). The functional score: f1, unknown function (no information); f2, some predicted clues (motif, signal, etc.); f3, similarities with a known gene; f4, biochemical function proved; and f5, biological function experimentally shown.

EXAMPLES OF ADDED VALUE

One of the strong points concerning GeneFarm is that annotators are members of a coordinated project with regular work meetings. Therefore, the work is not redundant and is of controlled quality. We have tried to estimate the gain in annotation quality of the expertised annotation compared to the semi-automatic annotation. It is evident that the gain should be higher for the functional annotation as compared to the structural one. Nevertheless, the former cannot be quantified and therefore we only present results of a systematic comparison of the GeneFarm and the TIGR CDS structures. Structural differences have been observed for 751 genes out of the 3501 that have been re-annotated (21%) within the framework of GeneFarm. Differences are more frequently observed for genes that do not have cognate cDNA or EST sequences. Indeed 254 out of 870 (29%) genes without transcript support differ in their CDS structure between the TIGR and GeneFarm resources. A concrete example of the contribution of the GeneFarm effort is the collective ongoing annotation of the PPR family (PentratricoPeptide Repeat proteins). This huge family of 442 proteins is characterized by a complex arrangement of short motifs (28,29) deciphered using two different bioinformatics approaches, the MEME/MAST and the HMMER packages. In the TIGR annotation, most of the PPR genes are tagged by the motif PF01535 from the PFAM database (30). The structural annotation of this family is particularly poorly done by automatic procedures. Even the unique motif PF01535 does not cover all the repeats defined by the GeneFarm experts. Examples of the corrections proposed in GeneFarm for regions containing misleadingly annotated PPR genes are illustrated in Figure 3A and B. GeneFarm contains the complete re-annotation of a subgroup of 89 genes of the PPR family, named PCMP-H, and will soon contain data for the whole PPR family and thus provide a unified annotation reflecting as accurately as possible the complex structural organization of these proteins.

Figure 3

Examples of corrections to TIGR annotations proposed by GeneFarm. (A) Fusion of two PPR genes revealed by a detailed definition of the repeat motifs (4 different matrixes have been defined by GeneFarm annotators to exhaustively tag all the repeat motifs of the PPR family), presence of C-terminal DYW motifs and cognate transcripts. (B) The consequence of this fusion of a PPR gene with a downstream gene is the attribution of a function on the basis of the presence of PFAM motifs PF03765 and PF00650. GeneFarm suggests two genes instead of one based on the presence of a C-terminal DYW motif in the first gene. The second gene has not been re-annotated in the framework of GeneFarm. (C) Gene fusion and erroneous exon boundaries. The GeneFarm corrections are supported by the fact that the gene model is shared by other members of the CYP sub-group, a cognate EST and better scores with the Pfam motif PF00067. Blue arrows and lines: CDS exons and introns, respectively. Brown arrows: PFAM motifs mapped to exons. Pink arrows: transcript sequences. Other arrows: different types of PPR repeats.

Incorrect structural annotations often lead to erroneous functional labelling of genes. For example, the gene PCMP-H16 (GF AC 3179) is annotated as being homologous to the yeast SEC14 cytosolic factor in the TIGR annotation due to the fusion of two genes to create a single predicted gene AT5G04780 (Figure 3B). This type of error is easily detected by expert analysis. More surprisingly, even in the case of well-known families with relatively high-sequence conservation, erroneous gene predictions can be found that are contradicted by cognate transcripts, by the conserved positions of introns between paralogues and by the presence of Pfam motifs, as illustrated by the cytochrome P450 AT4G20240 (Figure 3C). Since molecular data are sometimes lacking, the gene models in the GeneFarm database should still be considered as predictions in many cases. However, we believe that owing to the manual comparisons performed between the different prediction approaches, including those carried out using TIGR and MIPS, together with the extensive analysis of the families by the annotators, the gene models have a high probability of corresponding to the real gene structure.

CONCLUSION

The GeneFarm network has carried out checked, curated, justified, homogeneous and deep annotation of more than 3000 nuclear genes distributed among 300 complete gene families in Arabidopsis. This resource is organized in a relational database and available on the GeneFarm website at http://genoplante-info.infobiogen.fr/Genefarm/. All the annotations corresponding to the protein sequences are also available in the UniProt knowledgebase (31). One of the partners of the project, the Swiss Institute of Bioinformatics (SIB), acts in synergy with GeneFarm annotators in order to improve annotations and to provide the scientific community with high-quality protein data via Swiss-Prot entries. To benefit from this dual expertise, a special DR (Database cross-Reference) line has been added to Swiss-Prot entries to point out to the corresponding GeneFarm entries. Reciprocally, each GeneFarm entry is cross-referenced to the relevant Swiss-Prot entry. Furthermore, the FLAGdb++ database (32) provides a graphical visualization of the GeneFarm gene structures in the context of the TIGR annotation. The GeneFarm project aims to provide, during the year 2005, a complete and detailed biological description of about 5500 Arabidopsis nuclear genes and more than 450 gene families. GeneFarm participates to the demanding collection of expert annotations also performed using TAIR (33) and AtGDB (34). In the long term, it will be important to enlarge the GeneFarm effort to other plant species and, thus, to provide a database for curated orthologous relationships across the plant kingdom.

33 in total

1. Open annotation offers a democratic solution to genome sequencing.

Authors: T Hubbard; E Birney
Journal: Nature Date: 2000-02-24 Impact factor: 49.962

2. Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences.

Authors: Ian Small; Nemo Peeters; Fabrice Legeai; Claire Lurin
Journal: Proteomics Date: 2004-06 Impact factor: 3.984

3. GeneMark.hmm: new solutions for gene finding.

Authors: A V Lukashin; M Borodovsky
Journal: Nucleic Acids Res Date: 1998-02-15 Impact factor: 16.971

4. The challenges of genome sequence annotation or "the devil is in the details".

Authors: T F Smith; X Zhang
Journal: Nat Biotechnol Date: 1997-11 Impact factor: 54.908

5. SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments.

Authors: C Geourjon; G Deléage
Journal: Comput Appl Biosci Date: 1995-12

6. Identification and application of the concepts important for accurate and reliable protein secondary structure prediction.

Authors: R D King; M J Sternberg
Journal: Protein Sci Date: 1996-11 Impact factor: 6.725

7. In Arabidopsis thaliana, 1% of the genome codes for a novel protein family unique to plants.

Authors: S Aubourg; N Boudet; M Kreis; A Lecharny
Journal: Plant Mol Biol Date: 2000-03 Impact factor: 4.076

8. Prediction of protein secondary structure at better than 70% accuracy.

Authors: B Rost; C Sander
Journal: J Mol Biol Date: 1993-07-20 Impact factor: 5.469

9. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Authors:
Journal: Nature Date: 2000-12-14 Impact factor: 49.962

10. Genome-wide analysis of Arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis.

Authors: Claire Lurin; Charles Andrés; Sébastien Aubourg; Mohammed Bellaoui; Frédérique Bitton; Clémence Bruyère; Michel Caboche; Cédrig Debast; José Gualberto; Beate Hoffmann; Alain Lecharny; Monique Le Ret; Marie-Laure Martin-Magniette; Hakim Mireau; Nemo Peeters; Jean-Pierre Renou; Boris Szurek; Ludivine Taconnat; Ian Small
Journal: Plant Cell Date: 2004-07-21 Impact factor: 11.277

7 in total

1. Plant protein annotation in the UniProt Knowledgebase.

Authors: Michel Schneider; Amos Bairoch; Cathy H Wu; Rolf Apweiler
Journal: Plant Physiol Date: 2005-05 Impact factor: 8.340

2. Exploration of plant genomes in the FLAGdb++ environment.

Authors: Sandra Dèrozier; Franck Samson; Jean-Philippe Tamby; Cécile Guichard; Véronique Brunaud; Philippe Grevet; Séverine Gagnot; Philippe Label; Jean-Charles Leplé; Alain Lecharny; Sébastien Aubourg
Journal: Plant Methods Date: 2011-03-29 Impact factor: 4.993

3. Formation of the Arabidopsis pentatricopeptide repeat family.

Authors: Eric Rivals; Clémence Bruyère; Claire Toffano-Nioche; Alain Lecharny
Journal: Plant Physiol Date: 2006-07 Impact factor: 8.340

4. PeroxiBase: a database for large-scale evolutionary analysis of peroxidases.

Authors: Nizar Fawal; Qiang Li; Bruno Savelli; Marie Brette; Gisele Passaia; Maxime Fabre; Catherine Mathé; Christophe Dunand
Journal: Nucleic Acids Res Date: 2012-11-24 Impact factor: 16.971

5. NFU-Enabled FASTA: moving bioinformatics applications onto wide area networks.

Authors: Erich J Baker; Guan N Lin; Huadong Liu; Ravi Kosuri
Journal: Source Code Biol Med Date: 2007-11-26

6. Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome.

Authors: Sébastien Aubourg; Marie-Laure Martin-Magniette; Véronique Brunaud; Ludivine Taconnat; Frédérique Bitton; Sandrine Balzergue; Pauline E Jullien; Mathieu Ingouff; Vincent Thareau; Thomas Schiex; Alain Lecharny; Jean-Pierre Renou
Journal: BMC Genomics Date: 2007-11-02 Impact factor: 3.969

7. Genome-wide survey of DNA-binding proteins in Arabidopsis thaliana: analysis of distribution and functions.

Authors: Sony Malhotra; Ramanathan Sowdhamini
Journal: Nucleic Acids Res Date: 2013-06-17 Impact factor: 16.971

7 in total