Literature DB >> 19015150

Génolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes.

David J Sherman¹, Tiphaine Martin, Macha Nikolski, Cyril Cayla, Jean-Luc Souciet, Pascal Durrens.

Abstract

The Génolevures online database (http://cbi.labri.fr/Genolevures/ and http://genolevures.org/) provides exploratory tools and curated data sets relative to nine complete and seven partial genome sequences determined and manually annotated by the Génolevures Consortium, to facilitate comparative genomic studies of Hemiascomycete yeasts. The 2008 update to the Génolevures database provides four new genomes in complete (subtelomere to subtelomere) chromosome sequences, 50,000 protein-coding and tRNA genes, and in silico analyses for each gene element. A key element is a novel classification of conserved multi-species protein families and their use in detecting synteny, gene fusions and other aspects of genome remodeling in evolution. Our purpose is to release high-quality curated data from complete genomes, with a focus on the relations between genes, genomes and proteins.

Entities: Chemical Disease Species

Mesh：

Substances：
Fungal Proteins
Proteome

Year: 2008 PMID： 19015150 PMCID： PMC2686504 DOI： 10.1093/nar/gkn859

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Since 1999, the Génolevures Consortium explores eukaryote genome evolution through the large-scale comparison of manually annotated yeast genomes. The Génolevures on-line database undergoes a major update with every significant data release: in 2004 with 13 partial genomes (1,2), in 2006 with four complete genomes (3,4) and in 2008 with four complete genomes (this work). Our purpose is to release high-quality data sets produced by the 13 partners in the Consortium, rather than to provide a site for integration of data available elsewhere, although we do integrate external data for comparison. Yeasts are small eukaryotes covering an evolutionary range comparable to the entire Chordate phylum (5), and the unique combination of genetic and genomic tools available for yeasts make them ideal candidates for experimental study of metabolism, genetic engineering and molecular genetics. All of the many yeasts sequenced so far have small genomes (10–20 Mb) which allows for detailed comparative genomics for a relatively modest price. The economic impact of yeasts is widespread; different species are used for the production of beer, wine and bread and more recently of various metabolic products, such as vitamins, ethanol, citric acid, lipids, etc. Yeasts can degrade hydrocarbons (genera Candida, Yarrowia and Debaryomyces), metabolize xylose (Pichia stipitis), depolymerise tannin extracts (Zygosaccharomyces rouxii), and produce hormones and vaccines in industrial quantities through heterologous gene expression (6). Several human diseases are due to yeast species among them the Hemiascomycetes Candida albicans, Candida glabrata, Candida tropicalis and even Saccharomyces cerevisiae in immunocompromised patients (7). The biology of S. cerevisiae has been extensively studied for decades as a model organism for molecular genetics and cell biology studies, and as a cell factory. Its genome (8) is the most thoroughly annotated among eukaryotes, and is a common reference for the annotation of other species. Génolevures focuses on the Hemiascomycete yeasts, a homogeneous phylogenetic group which nonetheless covers a broad range of physiological and ecological lifestyles. Hemiascomycete yeast genes contain introns and alternative splicing is observed (9). Comparative genomics studies in this group have proven informative (10–16); see (5) for review. The Génolevures Consortium sequences, annotates and analyzes complete genomes from various branches of the Hemiascomycete class, and subjects them to large-scale in silico and experimental comparisons (Table 1). From these comparisons, we produce classifications of genes, proteins and sequences to address questions of molecular evolution, such as gene conservation, specific genes, function conservation and genome remodeling. We do not provide detailed annotations of individual genes and proteins of S. cerevisiae, which are already carefully maintained by MIPS and CYGD (http://mips.gsf.de/projects/fungi) (17) and SGD (http://www.yeastgenome.org/) (18) as well as in general purpose databases such as UniProt (19) and EMBL (20).

Table 1.

Summary of genomes available in Génolevures online, and for illustration, the number of protein-coding genes classified by a selection of the different analyses made available

			Number of protein coding genes in:
Species	Assembly	Genome size (Mb)	Predicted proteome	Protein families	Synteny blocks	Gene fusions	Tandem arrays
Candida glabrata	Complete	12.3	5210	4869	4850	152	120
Debaryomyces hansenii var. hansenii	Complete	12.2	6286	5213	1155	194	344
Kluyveromyces lactis var. lactis	Complete	10.6	5088	4810	4844	150	96
Kluyveromyces thermotolerans	Complete	10.4	5105	4898	4937		117
Saccharomyces kluyveri	Complete	11.3	5308	5048	5177		133
Yarrowia lipolytica	Complete	20.5	6456	5036	699	162	148
Zygosaccharomyces rouxii	Complete	9.8	4999	4802	4884		112
Candida tropicalis	Partial	15.0	1130^b		18.1%
Kluyveromyces marxianus	Partial	14.0	1546^b		49.8%
Pichia angusta	Partial	9.0	2502^a		18.9%
Pichia sorbitophila	Partial	13.9	1593^a		59.2%
Saccharomyces bayanus var. uvarum	Partial	11.5	2887^a		97.9%
Saccharomyces exiguus	Partial	18.0	1600^b		70.7%
Saccharomyces servazzii	Partial	12.3	1535^b		70.2%
Saccharomyces cerevisiae	Complete	12.1	5769	5381	5450	172	187
Eremothecium (Ashbya) gossypii	Complete	8.7	4718	4474	4584	146	79
Pichia stipitis	Complete	15.4	5816
Schizosaccharomyces pombe	Complete	14.1	5004			152

The four genomes at the bottom of the table are not produced by Génolevures but are integrated into the online database for comparative purposes.

a0.4 × coverage.

b0.2 × coverage.

Summary of genomes available in Génolevures online, and for illustration, the number of protein-coding genes classified by a selection of the different analyses made available The four genomes at the bottom of the table are not produced by Génolevures but are integrated into the online database for comparative purposes. a0.4 × coverage. b0.2 × coverage.

NEW GENOMES IN 2008

The Génolevures Consortium has released four new genomes in 2008, sequenced and assembled at high coverage and completely annotated by a team of 20 curators (Génolevures Consortium, submitted for publication). These genomes are those of Zygosaccharomyces rouxii, Saccharomyces kluyveri, Kluyveromyces thermotolerans, previously only sequenced to low or partial coverage, and Debaryomyces hansenii, partly resequenced to improve coverage. The former three, plus K. lactis (3) and Ashbya gossypii (21), are members of the Saccharomycetacae clades that are not descended from the ancestor that is thought to have undergone a whole genome duplication (unlike C. glabrata and Saccharomyces cerevisiae). The availability in the Génolevures of complete genomes for these five unduplicated genomes will allow close study of the core repertoire of protein families and functions and of the dynamics of genome remodeling.

PROTEIN FAMILIES

Extensive map reshuffling and a wide range of GC compositions from one species to another present a challenge for genome comparison in the yeasts. An essential tool for addressing this challenge is provided by ‘protein families’, a classification of protein-coding gene sequences into phylogenetic groups; members of a family are homologous and in many cases this homology is suggestive of functional similarity. The 48 889 proteins in the predicted proteomes of the nine complete genomes are classified into 7927 families, of which 4369 are common to at least two species. Of these latter families, 2591 are common to the nine species. From these families, other sub-classifications are made, such as the identification of syntenic orthologs in the hemiascomycetous yeasts (Génolevures Consortium, submitted for publication). Families are computed as follows: four complementary sets of all against all alignments are produced by the Blast (22) and Smith–Waterman (23) algorithms, with and without filtering for homeomorphy, common domain architecture along the full length of the proteins (24). Symmetric distance matrices derived from amino acid identities are constructed and submitted to algorithmic clustering using the MCL ‘Markov clustering’ method (25,26) with a range of statistical parameters. These competing partitions are reconciled using the consensus ensemble clustering method of Ref. (27) and manually curated using both literature search and systematic comparisons with COG (28) and PIRSF (24) classifications. For a given yeast species, each family may be represented by one or several (paralogous) proteins. Protein families are available in Génolevures at (http://genolevures.org/fam/) and tables describing all families are available in the Datasets area. Information on each family is accessible on individual pages presenting a graphical representation of the members’ relationships and links to individual members. (See URL construction rules, below.) Phyletic patterns and phylogenetic profiles are shown for each family; the former uses a simple one-letter code for each species as in Ref. (28) to indicate whether the family contains at least one member from that species, and the latter expands this to a numerical count of members for each species. Complementary information, such as GO terms, multiple sequence alignments and motif analyses are available for almost all families.

SYNTENY

Although synteny is in general poorly conserved in the Hemiascomycete yeasts (29,3,30), within phylogenetically delimited groups pairwise coverage is high and can be informative for studying genome dynamics. Pairs of homologous chromosomal regions between two species can be identified by comparing gene content and order, using protein families as an indication of gene-level homology within the regions. We identified these ‘pairwise syntenic blocks’ using the i-ADHoRE method (31) on gene homology relations identified using protein families. The approximately 19 000 pairwise synteny blocks obtained in the fashion can be interactively examined using the genome browser, as described in the Datasets area of the web site, and can be downloaded.

FUSIONS

Chromosomal rearrangements and segmental duplications may result in the creation or destruction of genes, when the breakpoint falls within the boundary of existing genes. ‘Gene fission’ occurs when a gene is broken in this way, and the 5′ (or even the 3′) fragment continues to be transcribed—creating a new, albeit truncated, gene. ‘Gene fusion’ occurs when the chromosomal rearrangement or duplication event leads to a fortuitous combination of previously existing genes, creating a new, longer gene. Fusions and fissions are mechanisms by which new genes can be formed in radical steps that do not obey the tree-like phylogenetic relations between species, and which can lead to significant changes in protein function. These events have to date been rarely taken into account in genome comparisons. Using a new method for detection of genes involved in gene fusion and fission events that makes its computations at the level of paralogous groups (32), we computed a catalogue of such events for complete genomes from the Hemiascomycete yeasts as well as other fungi. Both the paralogous groups of proteins used in the computation, and complete lists of identified fusion/fission events, are now available in Génolevures and are downloadable in the Datasets area.

OTHER DATA SETS

Tandem gene repeats are another means of gene formation. Unlike gene copies resulting from segmental duplications or retrotransposons, truncated or chimeric genes are not observed at the boundaries of tandem gene arrays. Data on tandem gene repeats are available in Datasets area. The Génosplicing database of spliceosomal introns and intron motifs in Hemiascomycete yeasts (C. Neuvéglise, unpublished data) is available from the Datasets area, and can be used for both for the study of splicing patterns and for the development of methods for predicting gene architectures from genomic data. The YETI classification of yeast membrane and transporter proteins (33) defined using evolutionary relationships traced using non-ambiguous functional and phylogenetic criteria derived from the TCDB (34) classification system is available in the Datasets area. Other analyses, such as (35) considering coverage of KEGG pathways (36), are available.

EXPLORING GÉNOLEVURES DATA

The design of the Génolevures on-line database has been revamped to provide improved tools for gaining insight into the mechanisms of eukaryotic molecular evolution. The key questions in the Génolevures use cases are: Additionally, the web site provides a query system that simultaneously searches for and can return: genes that have or may have a translation product, RNA and other genes that may have a transcription product only, cis-active elements and cross-genome protein families. What genes exist, as orthologs for my favorite gene or as members of a functional class (keyword search, alignment, homologs)? What is known about a given chromosomal element (chromosomal elements)? What relations exist in a protein family (protein families)? How are the individual genomes organized (maps and genome browser, synteny)? How are genes and proteins classified (data sets for fusions, tandem repeats, introns, pathways, YETI)?

AVAILABILITY OF DATA AND URL CONSTRUCTION RULES

All data from the Génolevures web site are freely available, and instructions for proper citation are included in each section. The Génolevures web site is developed using a ‘representational state transfer’ architecture (37) and URLs for individual identified resources built from the database can be constructed systematically. Identified resources include chromosomal elements, such as genes (prefix/elt/Abbrev/Element_identifier), protein families (prefix/fam/Family_identifier), DNA sequences (prefix/seq/Sequence_identifier) and biomolecular pathways (prefix/pathway/Pathway_identifier). Nomenclatures for genome abbreviations, systematic chromosomal element names, and protein family names are described in the Documentation area of the web site. Génolevures uses a bespoke object model mapped to a relational database, and uses the SOFA (38) and GO (39) ontologies extensively. The web interface is developed in Mason. Key software components are provided by the NCBI Blast program suite (22) and the Stein Lab's Generic Genome Browser (40).

ONGOING DEVELOPMENTS

The Génolevures Consortium continues its effort to sequence and annotate other complete genomes of Hemiascomycete yeasts. These genomes as well as the incorporation of other genomes annotated by third parties will help to refine the classifications presented above.

SUPPLEMENTARY DATA

Supplementary Data are available Online at http://genolevures.org/.

FUNDING

French National center for Scientific Research (CNRS) (GDR 2354, partial); French National Research Agency (ANR) (ANR-05-BLAN-0331; GENARISE, partial); Région Aquitaine (‘Pôle de Recherche en Informatique’) (2005-1306001AB, partial) and ACI IMPBIO (IMPB114, partial) (‘Génolevures En Ligne’). Funding for open access charge: CNRS GDR 2354. Conflict of interest statement. Large-scale Smith-Waterman alignments used in constructing protein families were performed using GenCore 6 software licensed from Biocceleration Inc., 77 Milltown Road, Suite B4, East Brunswick, NJ 08816, USA.

36 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. PIRSF: family classification system at the Protein Information Resource.

Authors: Cathy H Wu; Anastasia Nikolskaya; Hongzhan Huang; Lai-Su L Yeh; Darren A Natale; C R Vinayaka; Zhang-Zhi Hu; Raja Mazumder; Sandeep Kumar; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; Leslie Arminski; Yongxing Chen; Jian Zhang; Jorge Louie Cardenas; Sehee Chung; Jorge Castro-Alvear; Georgi Dinkov; Winona C Barker
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

4. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy.

Authors: Antonis Rokas; Sean B Carroll
Journal: Mol Biol Evol Date: 2005-03-02 Impact factor: 16.240

5. Family relationships: should consensus reign?--consensus clustering for protein families.

Authors: Macha Nikolski; David J Sherman
Journal: Bioinformatics Date: 2007-01-15 Impact factor: 6.937

Review 6. Life with 6000 genes.

Authors: A Goffeau; B G Barrell; H Bussey; R W Davis; B Dujon; H Feldmann; F Galibert; J D Hoheisel; C Jacq; M Johnston; E J Louis; H W Mewes; Y Murakami; P Philippsen; H Tettelin; S G Oliver
Journal: Science Date: 1996-10-25 Impact factor: 47.728

7. The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome.

Authors: Fred S Dietrich; Sylvia Voegeli; Sophie Brachat; Anita Lerch; Krista Gates; Sabine Steiner; Christine Mohr; Rainer Pöhlmann; Philippe Luedi; Sangdun Choi; Rod A Wing; Albert Flavier; Thomas D Gaffney; Peter Philippsen
Journal: Science Date: 2004-03-04 Impact factor: 47.728

8. MIPS: analysis and annotation of genome information in 2007.

Authors: H W Mewes; S Dietmann; D Frishman; R Gregory; G Mannhaupt; K F X Mayer; M Münsterkötter; A Ruepp; M Spannagl; V Stümpflen; T Rattei
Journal: Nucleic Acids Res Date: 2007-12-23 Impact factor: 16.971

9. The COG database: an updated version includes eukaryotes.

Authors: Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: BMC Bioinformatics Date: 2003-09-11 Impact factor: 3.169

10. Fusion and fission of genes define a metric between fungal genomes.

Authors: Pascal Durrens; Macha Nikolski; David Sherman
Journal: PLoS Comput Biol Date: 2008-10-24 Impact factor: 4.475

47 in total

1. Screening of high-level 4-hydroxy-2 (or 5)-ethyl-5 (or 2)-methyl-3(2H)-furanone-producing strains from a collection of gene deletion mutants of Saccharomyces cerevisiae.

Authors: Kenji Uehara; Jun Watanabe; Takeshi Akao; Daisuke Watanabe; Yoshinobu Mogi; Hitoshi Shimoi
Journal: Appl Environ Microbiol Date: 2014-10-31 Impact factor: 4.792

2. Alternative splicing of PTC7 in Saccharomyces cerevisiae determines protein localization.

Authors: Kara Juneau; Corey Nislow; Ronald W Davis
Journal: Genetics Date: 2009-06-29 Impact factor: 4.562

3. Accessory NUMM (NDUFS6) subunit harbors a Zn-binding site and is essential for biogenesis of mitochondrial complex I.

Authors: Katarzyna Kmita; Christophe Wirth; Judith Warnau; Sergio Guerrero-Castillo; Carola Hunte; Gerhard Hummer; Ville R I Kaila; Klaus Zwicker; Ulrich Brandt; Volker Zickermann
Journal: Proc Natl Acad Sci U S A Date: 2015-04-20 Impact factor: 11.205