Literature DB >> 19015150

Génolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes.

David J Sherman1, Tiphaine Martin, Macha Nikolski, Cyril Cayla, Jean-Luc Souciet, Pascal Durrens.   

Abstract

The Génolevures online database (http://cbi.labri.fr/Genolevures/ and http://genolevures.org/) provides exploratory tools and curated data sets relative to nine complete and seven partial genome sequences determined and manually annotated by the Génolevures Consortium, to facilitate comparative genomic studies of Hemiascomycete yeasts. The 2008 update to the Génolevures database provides four new genomes in complete (subtelomere to subtelomere) chromosome sequences, 50,000 protein-coding and tRNA genes, and in silico analyses for each gene element. A key element is a novel classification of conserved multi-species protein families and their use in detecting synteny, gene fusions and other aspects of genome remodeling in evolution. Our purpose is to release high-quality curated data from complete genomes, with a focus on the relations between genes, genomes and proteins.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 19015150      PMCID: PMC2686504          DOI: 10.1093/nar/gkn859

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Since 1999, the Génolevures Consortium explores eukaryote genome evolution through the large-scale comparison of manually annotated yeast genomes. The Génolevures on-line database undergoes a major update with every significant data release: in 2004 with 13 partial genomes (1,2), in 2006 with four complete genomes (3,4) and in 2008 with four complete genomes (this work). Our purpose is to release high-quality data sets produced by the 13 partners in the Consortium, rather than to provide a site for integration of data available elsewhere, although we do integrate external data for comparison. Yeasts are small eukaryotes covering an evolutionary range comparable to the entire Chordate phylum (5), and the unique combination of genetic and genomic tools available for yeasts make them ideal candidates for experimental study of metabolism, genetic engineering and molecular genetics. All of the many yeasts sequenced so far have small genomes (10–20 Mb) which allows for detailed comparative genomics for a relatively modest price. The economic impact of yeasts is widespread; different species are used for the production of beer, wine and bread and more recently of various metabolic products, such as vitamins, ethanol, citric acid, lipids, etc. Yeasts can degrade hydrocarbons (genera Candida, Yarrowia and Debaryomyces), metabolize xylose (Pichia stipitis), depolymerise tannin extracts (Zygosaccharomyces rouxii), and produce hormones and vaccines in industrial quantities through heterologous gene expression (6). Several human diseases are due to yeast species among them the Hemiascomycetes Candida albicans, Candida glabrata, Candida tropicalis and even Saccharomyces cerevisiae in immunocompromised patients (7). The biology of S. cerevisiae has been extensively studied for decades as a model organism for molecular genetics and cell biology studies, and as a cell factory. Its genome (8) is the most thoroughly annotated among eukaryotes, and is a common reference for the annotation of other species. Génolevures focuses on the Hemiascomycete yeasts, a homogeneous phylogenetic group which nonetheless covers a broad range of physiological and ecological lifestyles. Hemiascomycete yeast genes contain introns and alternative splicing is observed (9). Comparative genomics studies in this group have proven informative (10–16); see (5) for review. The Génolevures Consortium sequences, annotates and analyzes complete genomes from various branches of the Hemiascomycete class, and subjects them to large-scale in silico and experimental comparisons (Table 1). From these comparisons, we produce classifications of genes, proteins and sequences to address questions of molecular evolution, such as gene conservation, specific genes, function conservation and genome remodeling. We do not provide detailed annotations of individual genes and proteins of S. cerevisiae, which are already carefully maintained by MIPS and CYGD (http://mips.gsf.de/projects/fungi) (17) and SGD (http://www.yeastgenome.org/) (18) as well as in general purpose databases such as UniProt (19) and EMBL (20).
Table 1.

Summary of genomes available in Génolevures online, and for illustration, the number of protein-coding genes classified by a selection of the different analyses made available

Number of protein coding genes in:
SpeciesAssemblyGenome size (Mb)Predicted proteomeProtein familiesSynteny blocksGene fusionsTandem arrays
Candida glabrataComplete12.3521048694850152120
Debaryomyces hansenii var. hanseniiComplete12.2628652131155194344
Kluyveromyces lactis var. lactisComplete10.650884810484415096
Kluyveromyces thermotoleransComplete10.4510548984937117
Saccharomyces kluyveriComplete11.3530850485177133
Yarrowia lipolyticaComplete20.564565036699162148
Zygosaccharomyces rouxiiComplete9.8499948024884112
Candida tropicalisPartial15.01130b18.1%
Kluyveromyces marxianusPartial14.01546b49.8%
Pichia angustaPartial9.02502a18.9%
Pichia sorbitophilaPartial13.91593a59.2%
Saccharomyces bayanus var. uvarumPartial11.52887a97.9%
Saccharomyces exiguusPartial18.01600b70.7%
Saccharomyces servazziiPartial12.31535b70.2%
Saccharomyces cerevisiaeComplete12.1576953815450172187
Eremothecium (Ashbya) gossypiiComplete8.747184474458414679
Pichia stipitisComplete15.45816
Schizosaccharomyces pombeComplete14.15004152

The four genomes at the bottom of the table are not produced by Génolevures but are integrated into the online database for comparative purposes.

a0.4 × coverage.

b0.2 × coverage.

Summary of genomes available in Génolevures online, and for illustration, the number of protein-coding genes classified by a selection of the different analyses made available The four genomes at the bottom of the table are not produced by Génolevures but are integrated into the online database for comparative purposes. a0.4 × coverage. b0.2 × coverage.

NEW GENOMES IN 2008

The Génolevures Consortium has released four new genomes in 2008, sequenced and assembled at high coverage and completely annotated by a team of 20 curators (Génolevures Consortium, submitted for publication). These genomes are those of Zygosaccharomyces rouxii, Saccharomyces kluyveri, Kluyveromyces thermotolerans, previously only sequenced to low or partial coverage, and Debaryomyces hansenii, partly resequenced to improve coverage. The former three, plus K. lactis (3) and Ashbya gossypii (21), are members of the Saccharomycetacae clades that are not descended from the ancestor that is thought to have undergone a whole genome duplication (unlike C. glabrata and Saccharomyces cerevisiae). The availability in the Génolevures of complete genomes for these five unduplicated genomes will allow close study of the core repertoire of protein families and functions and of the dynamics of genome remodeling.

PROTEIN FAMILIES

Extensive map reshuffling and a wide range of GC compositions from one species to another present a challenge for genome comparison in the yeasts. An essential tool for addressing this challenge is provided by ‘protein families’, a classification of protein-coding gene sequences into phylogenetic groups; members of a family are homologous and in many cases this homology is suggestive of functional similarity. The 48 889 proteins in the predicted proteomes of the nine complete genomes are classified into 7927 families, of which 4369 are common to at least two species. Of these latter families, 2591 are common to the nine species. From these families, other sub-classifications are made, such as the identification of syntenic orthologs in the hemiascomycetous yeasts (Génolevures Consortium, submitted for publication). Families are computed as follows: four complementary sets of all against all alignments are produced by the Blast (22) and Smith–Waterman (23) algorithms, with and without filtering for homeomorphy, common domain architecture along the full length of the proteins (24). Symmetric distance matrices derived from amino acid identities are constructed and submitted to algorithmic clustering using the MCL ‘Markov clustering’ method (25,26) with a range of statistical parameters. These competing partitions are reconciled using the consensus ensemble clustering method of Ref. (27) and manually curated using both literature search and systematic comparisons with COG (28) and PIRSF (24) classifications. For a given yeast species, each family may be represented by one or several (paralogous) proteins. Protein families are available in Génolevures at (http://genolevures.org/fam/) and tables describing all families are available in the Datasets area. Information on each family is accessible on individual pages presenting a graphical representation of the members’ relationships and links to individual members. (See URL construction rules, below.) Phyletic patterns and phylogenetic profiles are shown for each family; the former uses a simple one-letter code for each species as in Ref. (28) to indicate whether the family contains at least one member from that species, and the latter expands this to a numerical count of members for each species. Complementary information, such as GO terms, multiple sequence alignments and motif analyses are available for almost all families.

SYNTENY

Although synteny is in general poorly conserved in the Hemiascomycete yeasts (29,3,30), within phylogenetically delimited groups pairwise coverage is high and can be informative for studying genome dynamics. Pairs of homologous chromosomal regions between two species can be identified by comparing gene content and order, using protein families as an indication of gene-level homology within the regions. We identified these ‘pairwise syntenic blocks’ using the i-ADHoRE method (31) on gene homology relations identified using protein families. The approximately 19 000 pairwise synteny blocks obtained in the fashion can be interactively examined using the genome browser, as described in the Datasets area of the web site, and can be downloaded.

FUSIONS

Chromosomal rearrangements and segmental duplications may result in the creation or destruction of genes, when the breakpoint falls within the boundary of existing genes. ‘Gene fission’ occurs when a gene is broken in this way, and the 5′ (or even the 3′) fragment continues to be transcribed—creating a new, albeit truncated, gene. ‘Gene fusion’ occurs when the chromosomal rearrangement or duplication event leads to a fortuitous combination of previously existing genes, creating a new, longer gene. Fusions and fissions are mechanisms by which new genes can be formed in radical steps that do not obey the tree-like phylogenetic relations between species, and which can lead to significant changes in protein function. These events have to date been rarely taken into account in genome comparisons. Using a new method for detection of genes involved in gene fusion and fission events that makes its computations at the level of paralogous groups (32), we computed a catalogue of such events for complete genomes from the Hemiascomycete yeasts as well as other fungi. Both the paralogous groups of proteins used in the computation, and complete lists of identified fusion/fission events, are now available in Génolevures and are downloadable in the Datasets area.

OTHER DATA SETS

Tandem gene repeats are another means of gene formation. Unlike gene copies resulting from segmental duplications or retrotransposons, truncated or chimeric genes are not observed at the boundaries of tandem gene arrays. Data on tandem gene repeats are available in Datasets area. The Génosplicing database of spliceosomal introns and intron motifs in Hemiascomycete yeasts (C. Neuvéglise, unpublished data) is available from the Datasets area, and can be used for both for the study of splicing patterns and for the development of methods for predicting gene architectures from genomic data. The YETI classification of yeast membrane and transporter proteins (33) defined using evolutionary relationships traced using non-ambiguous functional and phylogenetic criteria derived from the TCDB (34) classification system is available in the Datasets area. Other analyses, such as (35) considering coverage of KEGG pathways (36), are available.

EXPLORING GÉNOLEVURES DATA

The design of the Génolevures on-line database has been revamped to provide improved tools for gaining insight into the mechanisms of eukaryotic molecular evolution. The key questions in the Génolevures use cases are: Additionally, the web site provides a query system that simultaneously searches for and can return: genes that have or may have a translation product, RNA and other genes that may have a transcription product only, cis-active elements and cross-genome protein families. What genes exist, as orthologs for my favorite gene or as members of a functional class (keyword search, alignment, homologs)? What is known about a given chromosomal element (chromosomal elements)? What relations exist in a protein family (protein families)? How are the individual genomes organized (maps and genome browser, synteny)? How are genes and proteins classified (data sets for fusions, tandem repeats, introns, pathways, YETI)?

AVAILABILITY OF DATA AND URL CONSTRUCTION RULES

All data from the Génolevures web site are freely available, and instructions for proper citation are included in each section. The Génolevures web site is developed using a ‘representational state transfer’ architecture (37) and URLs for individual identified resources built from the database can be constructed systematically. Identified resources include chromosomal elements, such as genes (prefix/elt/Abbrev/Element_identifier), protein families (prefix/fam/Family_identifier), DNA sequences (prefix/seq/Sequence_identifier) and biomolecular pathways (prefix/pathway/Pathway_identifier). Nomenclatures for genome abbreviations, systematic chromosomal element names, and protein family names are described in the Documentation area of the web site. Génolevures uses a bespoke object model mapped to a relational database, and uses the SOFA (38) and GO (39) ontologies extensively. The web interface is developed in Mason. Key software components are provided by the NCBI Blast program suite (22) and the Stein Lab's Generic Genome Browser (40).

ONGOING DEVELOPMENTS

The Génolevures Consortium continues its effort to sequence and annotate other complete genomes of Hemiascomycete yeasts. These genomes as well as the incorporation of other genomes annotated by third parties will help to refine the classifications presented above.

SUPPLEMENTARY DATA

Supplementary Data are available Online at http://genolevures.org/.

FUNDING

French National center for Scientific Research (CNRS) (GDR 2354, partial); French National Research Agency (ANR) (ANR-05-BLAN-0331; GENARISE, partial); Région Aquitaine (‘Pôle de Recherche en Informatique’) (2005-1306001AB, partial) and ACI IMPBIO (IMPB114, partial) (‘Génolevures En Ligne’). Funding for open access charge: CNRS GDR 2354. Conflict of interest statement. Large-scale Smith-Waterman alignments used in constructing protein families were performed using GenCore 6 software licensed from Biocceleration Inc., 77 Milltown Road, Suite B4, East Brunswick, NJ 08816, USA.
  36 in total

1.  KEGG: kyoto encyclopedia of genes and genomes.

Authors:  M Kanehisa; S Goto
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  PIRSF: family classification system at the Protein Information Resource.

Authors:  Cathy H Wu; Anastasia Nikolskaya; Hongzhan Huang; Lai-Su L Yeh; Darren A Natale; C R Vinayaka; Zhang-Zhi Hu; Raja Mazumder; Sandeep Kumar; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; Leslie Arminski; Yongxing Chen; Jian Zhang; Jorge Louie Cardenas; Sehee Chung; Jorge Castro-Alvear; Georgi Dinkov; Winona C Barker
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

3.  The generic genome browser: a building block for a model organism system database.

Authors:  Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal:  Genome Res       Date:  2002-10       Impact factor: 9.043

4.  More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy.

Authors:  Antonis Rokas; Sean B Carroll
Journal:  Mol Biol Evol       Date:  2005-03-02       Impact factor: 16.240

5.  Family relationships: should consensus reign?--consensus clustering for protein families.

Authors:  Macha Nikolski; David J Sherman
Journal:  Bioinformatics       Date:  2007-01-15       Impact factor: 6.937

Review 6.  Life with 6000 genes.

Authors:  A Goffeau; B G Barrell; H Bussey; R W Davis; B Dujon; H Feldmann; F Galibert; J D Hoheisel; C Jacq; M Johnston; E J Louis; H W Mewes; Y Murakami; P Philippsen; H Tettelin; S G Oliver
Journal:  Science       Date:  1996-10-25       Impact factor: 47.728

7.  The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome.

Authors:  Fred S Dietrich; Sylvia Voegeli; Sophie Brachat; Anita Lerch; Krista Gates; Sabine Steiner; Christine Mohr; Rainer Pöhlmann; Philippe Luedi; Sangdun Choi; Rod A Wing; Albert Flavier; Thomas D Gaffney; Peter Philippsen
Journal:  Science       Date:  2004-03-04       Impact factor: 47.728

8.  MIPS: analysis and annotation of genome information in 2007.

Authors:  H W Mewes; S Dietmann; D Frishman; R Gregory; G Mannhaupt; K F X Mayer; M Münsterkötter; A Ruepp; M Spannagl; V Stümpflen; T Rattei
Journal:  Nucleic Acids Res       Date:  2007-12-23       Impact factor: 16.971

9.  The COG database: an updated version includes eukaryotes.

Authors:  Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal:  BMC Bioinformatics       Date:  2003-09-11       Impact factor: 3.169

10.  Fusion and fission of genes define a metric between fungal genomes.

Authors:  Pascal Durrens; Macha Nikolski; David Sherman
Journal:  PLoS Comput Biol       Date:  2008-10-24       Impact factor: 4.475

View more
  47 in total

1.  Screening of high-level 4-hydroxy-2 (or 5)-ethyl-5 (or 2)-methyl-3(2H)-furanone-producing strains from a collection of gene deletion mutants of Saccharomyces cerevisiae.

Authors:  Kenji Uehara; Jun Watanabe; Takeshi Akao; Daisuke Watanabe; Yoshinobu Mogi; Hitoshi Shimoi
Journal:  Appl Environ Microbiol       Date:  2014-10-31       Impact factor: 4.792

2.  Alternative splicing of PTC7 in Saccharomyces cerevisiae determines protein localization.

Authors:  Kara Juneau; Corey Nislow; Ronald W Davis
Journal:  Genetics       Date:  2009-06-29       Impact factor: 4.562

3.  Accessory NUMM (NDUFS6) subunit harbors a Zn-binding site and is essential for biogenesis of mitochondrial complex I.

Authors:  Katarzyna Kmita; Christophe Wirth; Judith Warnau; Sergio Guerrero-Castillo; Carola Hunte; Gerhard Hummer; Ville R I Kaila; Klaus Zwicker; Ulrich Brandt; Volker Zickermann
Journal:  Proc Natl Acad Sci U S A       Date:  2015-04-20       Impact factor: 11.205

4.  Identification and differential gene expression of adhesin-like wall proteins in Candida glabrata biofilms.

Authors:  E A Kraneveld; J J de Soet; D M Deng; H L Dekker; C G de Koster; F M Klis; W Crielaard; P W J de Groot
Journal:  Mycopathologia       Date:  2011-07-17       Impact factor: 2.574

5.  Amino acid signaling in yeast: post-genome duplication divergence of the Stp1 and Stp2 transcription factors.

Authors:  Kevin Wielemans; Cathy Jean; Stéphan Vissers; Bruno André
Journal:  J Biol Chem       Date:  2009-11-11       Impact factor: 5.157

6.  Origin and fate of pseudogenes in Hemiascomycetes: a comparative analysis.

Authors:  Ingrid Lafontaine; Bernard Dujon
Journal:  BMC Genomics       Date:  2010-04-22       Impact factor: 3.969

7.  The subunit composition of mitochondrial NADH:ubiquinone oxidoreductase (complex I) from Pichia pastoris.

Authors:  Hannah R Bridges; Ian M Fearnley; Judy Hirst
Journal:  Mol Cell Proteomics       Date:  2010-07-07       Impact factor: 5.911

8.  Evolutionary divergence in the fungal response to fluconazole revealed by soft clustering.

Authors:  Dwight Kuo; Kai Tan; Guy Zinman; Timothy Ravasi; Ziv Bar-Joseph; Trey Ideker
Journal:  Genome Biol       Date:  2010-07-23       Impact factor: 13.583

9.  Insertion of horizontally transferred genes within conserved syntenic regions of yeast genomes.

Authors:  Thomas Rolland; Cécile Neuvéglise; Christine Sacerdot; Bernard Dujon
Journal:  PLoS One       Date:  2009-08-05       Impact factor: 3.240

10.  InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.

Authors:  Gabriel Ostlund; Thomas Schmitt; Kristoffer Forslund; Tina Köstler; David N Messina; Sanjit Roopra; Oliver Frings; Erik L L Sonnhammer
Journal:  Nucleic Acids Res       Date:  2009-11-05       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.