Literature DB >> 22064862

The Candida genome database incorporates multiple Candida species: multispecies search and analysis tools with curated gene and protein information for Candida albicans and Candida glabrata.

Diane O Inglis¹, Martha B Arnaud, Jonathan Binkley, Prachi Shah, Marek S Skrzypek, Farrell Wymore, Gail Binkley, Stuart R Miyasato, Matt Simison, Gavin Sherlock.

Abstract

The Candida Genome Database (CGD, http://www.candidagenome.org/) is an internet-based resource that provides centralized access to genomic sequence data and manually curated functional information about genes and proteins of the fungal pathogen Candida albicans and other Candida species. As the scope of Candida research, and the number of sequenced strains and related species, has grown in recent years, the need for expanded genomic resources has also grown. To answer this need, CGD has expanded beyond storing data solely for C. albicans, now integrating data from multiple species. Herein we describe the incorporation of this multispecies information, which includes curated gene information and the reference sequence for C. glabrata, as well as orthology relationships that interconnect Locus Summary pages, allowing easy navigation between genes of C. albicans and C. glabrata. These orthology relationships are also used to predict GO annotations of their products. We have also added protein information pages that display domains, structural information and physicochemical properties; bibliographic pages highlighting important topic areas in Candida biology; and a laboratory strain lineage page that describes the lineage of commonly used laboratory strains. All of these data are freely available at http://www.candidagenome.org/. We welcome feedback from the research community at candida-curator@lists.stanford.edu.

Entities: Chemical Disease Species

Mesh：

Substances：
Fungal Proteins

Year: 2011 PMID： 22064862 PMCID： PMC3245171 DOI： 10.1093/nar/gkr945

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Candida albicans is the most common fungal pathogen causing invasive and bloodstream infections in immunocompromised patients, although in recent years, several non-albicans species and other yeasts have also emerged as major opportunistic pathogens (1,2). Studies in the US identify Candida glabrata as the second most common Candida species involved in invasive fungal infections. Moreover, antifungal drug resistance, especially to azoles, is common among C. glabrata clinical strains isolated from patients with prior azole treatment (1). The availability of genome sequences for these pathogenic fungi has made it possible to study genes that play a role in pathogenesis and drug resistance in Candida species, thereby increasing our understanding of the mechanisms of virulence in fungal pathogens. The Candida Genome Database (CGD, http://www.candidagenome.org/) is an online resource for the scientific research community studying fungal molecular biology and pathogenesis. The primary mission of CGD is to facilitate and accelerate Candida research by providing both an extensively curated compendium of Candida gene, protein and sequence information, and easy-to-use web-based tools for accessing, analyzing and exploring these data. When the CGD project began in 2004, our initial efforts focused on curation of C. albicans, because it is the best-characterized species of the group and has the largest corpus of gene-specific scientific literature. We have now expanded the scope of the project to include other Candida species, and provide an extensive suite of tools and resources that have been redesigned to facilitate the analysis of multiple species concurrently. The CGD Locus Summary Page (LSP) has been updated with information about the identity of orthologous genes in C. glabrata, and with orthology-based functional predictions and gene descriptions. We currently display both manual and computational gene, protein and sequence information about C. albicans and the recently added species, C. glabrata. We also provide genomic and protein sequence downloads and BLAST (3) resources for multiple Candida species and strains, including C. albicans strains SC5314 (4) and WO-1 (5), C. dubliniensis (6), C. guilliermondii (5), C. lusitaniae (5), C. parapsilosis (5), C. tropicalis (5), Debaryomyces hansenii (7) and Lodderomyces elongisporus (5). We will be adding curated information for all these other Candida species in the future. All of the data in CGD are freely available. We also have an extensive suite of online user documentation, and provide advice and user support by e-mail at candida-curator@lists.stanford.edu.

LITERATURE CURATION FOR MULTIPLE CANDIDA SPECIES

At CGD, PhD level curators perform ongoing manual curation of the scientific literature to collect, organize, summarize and present a comprehensive picture of each characterized gene. Manual curation includes the recording of gene names, addition and updates to our summary gene descriptions, capture of mutant phenotype data and the assignment of relevant GO annotations with evidence and citations. The manual curation of the previously published literature pertaining to genes of C. albicans and C. glabrata is now complete (Table 1). We have combed the scientific literature for gene-specific information and gene bibliographies; Gene Ontology (GO) annotations describing the function, role and localization of gene products; and mutant phenotypes. These are now reported in CGD for all of the genes for which this information is available. At this time, there are 6203 predicted C. albicans protein-encoding genes localized to chromosomes in the current (Assembly 21) reference gene set, 22% with manually annotated gene and protein information. For C. glabrata, the reference annotation set contains 5212 predicted genes, each of which has a LSP (Figure 1), and 3% of which have manually curated annotations. CGD now includes a detailed Genome Snapshot for C. glabrata in addition to C. albicans, which provides a graphical and tabular summary of information about the total number of chromosomal features and feature types, changes to the reference sequence and a distribution of gene products by functional categories and cellular localization (Figure 2).

Table 1.

CGD curation statistics

	Candida albicans	Candida glabrata
Number of ORFs	6108	5212
Number of tRNAs	156	230
Verified ORFs	1403	178
Uncharacterized ORFs	4705	5034
Dubious ORFs	152	N/A
Manual GO annotations	4697	4689
Features with manual GO annotations	13 707	2622
Orthology-based GO annotations	13 246	19 655
Features with orthology-based GO annotations	3099	4157
Protein-domain (InterPro)-based GO annotations	6048	5087
Features with protein-domain (InterPro)-based GO annotations	2963	2583
Features with orthology-based description lines	1352	3982

Figure 1.

Figure 2.

CGD genome snapshots. Pie chart from the CGD Genome Snapshots, comparing the current extent of the characterization of the predicted protein-coding genes in the C. albicans and C. glabrata genomes. ORFs are classified as ‘Verified’ if there is experimental evidence for a functional gene product. ‘Uncharacterized’ ORFs are predicted based on sequence analysis but currently lack experimental characterization. Candida albicans ORFs labeled as ‘Dubious’ have no experimental characterization and appear to be indistinguishable from random non-coding sequences (5).

Updates to the CGD Locus Summary Page (LSP). The LSP is the hub around which the CGD gene information is organized. LSPs for both C. albicans and C. glabrata now feature new expanded orthology information sections, orthology-based description lines for uncharacterized genes, orthology-based GO term predictions and protein domain-based GO term predictions. CGD genome snapshots. Pie chart from the CGD Genome Snapshots, comparing the current extent of the characterization of the predicted protein-coding genes in the C. albicans and C. glabrata genomes. ORFs are classified as ‘Verified’ if there is experimental evidence for a functional gene product. ‘Uncharacterized’ ORFs are predicted based on sequence analysis but currently lack experimental characterization. Candida albicans ORFs labeled as ‘Dubious’ have no experimental characterization and appear to be indistinguishable from random non-coding sequences (5). CGD curation statistics In addition, CGD curators have composed in-depth descriptive Locus Summaries for 272 selected C. albicans genes, which, in contrast to the very concise Locus Descriptions, are more detailed enumerations of the characteristics of each gene, presented in a bullet-point format on the CGD LSPs. They provide additional experimental details and gene regulatory information that cannot be accommodated within the space limits of the Locus Description line. These lists are displayed in the Locus Summary section located near the bottom of the page and are fully searchable through the CGD Text Search tool. The curation of the entire body of scientific literature for these organisms is a large and ongoing endeavor as new papers are published, and we welcome suggestions from users as to papers that should be prioritized or other data that should be included. We greatly appreciate the beneficial interactions with members of the Candida research community who have already volunteered to review specific LSPs and provide feedback on the curation content for specific genes. The comments we have received have resulted in refinement of description lines, improvements to phenotype and GO annotations, and addition of new references that we had not encountered in our literature searches—improvements that benefit the entire community of CGD users.

TOOLS FOR SEARCH AND DISPLAY OF MULTISPECIES INFORMATION IN CGD

CGD was originally modeled after the Saccharomyces Genome Database (SGD) (8), a database that provides the Saccharomyces cerevisiae reference sequence with literature curation, and gene, protein and sequence analysis tools for the S. cerevisiae research community. SGD, and initially CGD, were designed to store and display data for only a single species at a time. To accommodate the incorporation of additional species in the database, user interface and analysis tools, significant design modifications to the software and the underlying database structure were necessary. The CGD search tools, such as Quick Search, Text Search, Gene/Sequence Resources, Ortholog Search and Pattern Match have been redesigned to search multiple species. In order to accommodate search results for multiple species, the new results page for the CGD Quick Search and Text Search tools now displays three sections. Search results that apply to all species (e.g. GO terms, authors and reference information, colleagues) are displayed at the top, with sections for species-specific search results displayed below. All of the tools that perform species- or sequence-specific searches (e.g. Gene/Sequence Resources, Pattern Match, Advanced Search, Batch Download, Restriction Mapper, GO Term Finder, GO Slim Mapper) have been updated, and they now prompt users to select the species of interest. The Ortholog Search now retrieves ortholog and best-hit matches among all of the species in CGD and SGD (currently C. albicans, C. glabrata and S. cerevisiae). BLAST searches at CGD have also been redesigned to allow queries against any combination of the several Candida species for which we have complete sequence sets (C. albicans, C. glabrata, C. dubliniensis, C. guilliermondii, C. lusitaniae, C. parapsilosis, C. tropicalis, Debaryomyces hansenii and Lodderomyces elongisporus). In addition, the curation tools have been extensively modified to facilitate the curation of multiple species. Each gene in CGD is represented on a LSP, which is the central organizing unit of the CGD web site. The LSP contains the basic information that describes the gene and provides access to tools for retrieval, analysis and visualization of gene data. We have reengineered the LSPs to accommodate multispecies information (Figure 1). LSPs for each C. albicans and C. glabrata gene now feature an expanded orthology section, by which the LSPs of each C. albicans gene are hyperlinked to the LSPs of their C. glabrata orthologs, and vice versa. The LSP for C. glabrata genes also provide external links to gene pages available at Ge nolevures (http://www.genolevures.org/cagl.html#) This section also serves as a gateway to information about the orthologs in Saccharomyces cerevisiae, providing hyperlinks to the LSP of each ortholog in the SGD. Including S. cerevisiae ortholog information is especially useful for the C. glabrata LSPs: the evolutionary divergence between C. glabrata and S. cerevisiae is considerably more recent (100-300 million years ago) (7,9) than the divergence between these two species and C. albicans (700-800 million years ago) (10), and thus C. glabrata shares a larger number of orthologs with S. cerevisiae than with C. albicans, 4372 and 3201, respectively (as predicted by InParanoid). To define orthology relationships, we use the InParanoid algorithm, which identifies reciprocal best BLAST hits between species (11). These mappings and links are updated quarterly in order to reflect changes in gene models and annotations at CGD and SGD. In addition to the new orthology relationships displayed in CGD, another level of similarity-based information is provided via the new Protein tab on the LSP of each protein-coding gene (Figure 3). This tab opens the Protein Information page that provides descriptions and a graphical display of conserved protein domains and motifs identified using InterProScan software (12,13). The Protein Information pages also display the structure of the most similar protein in the Protein Data Bank (14), and contain information about the predicted protein length, molecular weight, sequence and a link to a table of calculated physicochemical properties.

Figure 3.

Protein information page. The Protein Information page provides data including structural information inferred from homologs in PDB (RCSB Protein Data Bank), an interactive domains/motifs browser, protein sequence and physicochemical property details, BLASTP search against other CGD sequences and links to external protein resources such as UniProt.

LEVERAGING MULTISPECIES INFORMATION IN CGD: HOMOLOGY-BASED FUNCTIONAL PREDICTIONS

The GO is a structured vocabulary that is used to describe three aspects of gene products: their molecular function or activity, the broader biological process in which they participate, and the cellular location in which they reside (15). A gene product can be annotated with any number of terms about any of the three aspects, depending on the available data. Each GO term assignment is associated with an evidence code that describes the type of data the assignment is based on, and with a reference to its source. The GO is in wide use in genomic research and because it is rigorously structured, it ensures consistency in capturing of functional information about genes from different organisms and thus enables reliable analysis of biological significance of genomic data (15–21). For the fully curated species, C. albicans and C. glabrata, all of the available gene-related literature pertaining to these two species has been read and all possible GO assignments from these papers have been made. To augment the manual curation, we have leveraged the orthology relationships to infer GO annotations for genes having an experimentally characterized ortholog in SGD or CGD. Predictions for C. albicans are made based on S. cerevisiae and C. glabrata orthologs, whereas predictions for C. glabrata are based on orthologs from S. cerevisiae and C. albicans. Despite the evolutionary distances between C. albicans, C. glabrata and S. cerevisiae, the use of orthology relationships to infer GO annotations between C. albicans and C. glabrata allow the transfer of a significant number of important pathogenesis-related terms to be transferred between these two fungal pathogens. Candidate GO annotations to be used as the basis for these inferences are limited to those with experimental evidence, i.e. associated with evidence codes of ‘Inferred from Direct Assay (IDA)’, ‘Inferred from Physical Interaction (IPI)’, ‘Inferred from Genetic Interaction (IGI)’, or ‘Inferred from Mutant Phenotype (IMP)’. Any annotations that are themselves predicted in S. cerevisiae or in Candida, either based on sequence similarity or by some other methods, are excluded from this group to avoid transitive propagation of predictions. Also excluded from the predicted annotation set are annotations that are redundant with existing, manually curated annotations, or those that assign a related but less specific GO term other than candidate annotations. These orthology-based GO assignments are associated with evidence code ‘Inferred from Electronic Annotation (IEA)’ and displayed with the source species and gene name they are derived from along with a hyperlink to the appropriate LSP at CGD or SGD. CGD has also taken advantage of protein domain and motif homology to assign GO annotations for C. albicans and C. glabrata genes. We systematically predict conserved domains in CGD protein sequences using InterProScan (12), and then use the InterPro-to-GO mappings (12,13) provided by the GO Consortium to provide molecular function annotations for those proteins. These annotations are assigned the evidence code IEA and are displayed with the InterPro identifier of the protein that serves as the basis for the annotation. The identifier is linked to the EMBL-EBI database to provide access to more extensive information about each structural domain. We have also used the tRNAscan-SE software to predict tRNA genes, and have inferred predicted GO annotations for these tRNAs (22). The new annotations that have been transferred from S. cerevisiae to C. albicans and C. glabrata, and between C. albicans and C. glabrata, are summarized in Table 1. In addition to having the evidence code IEA, all these orthology-based annotations are identified as being derived computationally, rather than manually extracted from the scientific literature. Predictions are updated several times a year to make sure they remain current with annotation updates and new curation in CGD, SGD and in the protein domain datasets. Now that all literature-based GO assignments for C. albicans and C. glabrata, and all orthology-based and protein domain-based predictions have been made, we consider curation of both species to be ‘GO-complete’. For the remaining uncharacterized genes, we have explicitly assigned ‘unknown’ annotations to indicate that to the best of our knowledge no data are available. We have also used the multispecies information to create informative descriptions for those Candida genes that lack any experimental characterization, and which therefore have no literature-based description on the LSP, incorporating orthology relationships and orthology-based functional predictions into the gene description in cases where there would otherwise be no information available.

CURATED INFORMATIONAL PAGES AT CGD

Additional CGD resources for the Candida research community include a new collection of bibliographies on topics relevant to Candida biology, which is accessible under ‘Community Resources’ from the navigation sidebar on the CGD Home page. These Highlights in Candida Biology contain lists of important references, including many key reviews, and are designed to provide an overview of selected subject areas in C. albicans and C. glabrata biology. This resource will be particularly valuable for those new to Candida research. As new species are curated at CGD, Highlights in Candida Biology will expand to include bibliographies on these species as well. The curated bibliographies are available at http://candidagenome.org/TopicBiblios.shtml. We have also curated a directory of strains, which provides descriptions and references for commonly used Candida laboratory strains, along with a lineage diagram that graphically depicts the relationship among these strains. This information is available on the CGD web site at http://candidagenome.org/Strains.shtml. This resource is especially important for researchers because differences in strain background are known to have a significant impact on observed mutant phenotypes. In some cases, genes have been found to be lethal in one genetic background while successful gene disruption is possible in another. An example of this is the C. albicans UME6 gene, for which homozygous mutants are viable in the SN152 genetic background (23) yet inviable in the BWP17 strain background (24). Because of its importance, we also provide all available strain background information along with all of the curated phenotypes for each gene.

FUTURE DIRECTIONS

Now that the underlying database has been re-tooled to accommodate the curation of multiple species, we will add curated information for other Candida-related species including C. dubliniensis, C. guilliermondii, C. lusitaniae, C. parapsilosis, C. tropicalis, Debaryomyces hansenii and Lodderomyces elongisporus. In order to facilitate navigation across multiple genomes, we will provide links to an interactive comparative visualization tool, which will allow users to explore ortholog clusters in their genomic context. Recent advances in genomics technologies have created a deluge of information that poses a significant challenge of making all these data organized and readily available to researchers. We have adapted our genome browser, GBrowse, to enable users to visualize unannotated transcripts in C. albicans that have been identified by RNAseq (25–27). These transcripts are aligned to the reference genome and displayed alongside the existing set of features in the reference annotation. We will further develop and/or integrate existing software to incorporate and visualize more types of data and more data sets from high-throughput studies.

FUNDING

Funding for open access charge: National Institute of Dental and Craniofacial Research at the US National Institutes of Health (grant no. R01 DE015873). Conflict of interest statement. None declared.

27 in total

1. Creating the gene ontology resource: design and implementation.

Authors:
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

2. InterProScan--an integration platform for the signature-recognition methods in InterPro.

Authors: E M Zdobnov; R Apweiler
Journal: Bioinformatics Date: 2001-09 Impact factor: 6.937

Review 3. Emerging opportunistic yeast infections.

Authors: Marisa H Miceli; José A Díaz; Samuel A Lee
Journal: Lancet Infect Dis Date: 2011-02 Impact factor: 25.071

4. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

5. Comprehensive annotation of the transcriptome of the human fungal pathogen Candida albicans using RNA-seq.

Authors: Vincent M Bruno; Zhong Wang; Sadie L Marjani; Ghia M Euskirchen; Jeffrey Martin; Gavin Sherlock; Michael Snyder
Journal: Genome Res Date: 2010-09-01 Impact factor: 9.043

6. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

Authors: M Remm; C E Storm; E L Sonnhammer
Journal: J Mol Biol Date: 2001-12-14 Impact factor: 5.469

7. Genome evolution in yeasts.

Authors: Bernard Dujon; David Sherman; Gilles Fischer; Pascal Durrens; Serge Casaregola; Ingrid Lafontaine; Jacky De Montigny; Christian Marck; Cécile Neuvéglise; Emmanuel Talla; Nicolas Goffard; Lionel Frangeul; Michel Aigle; Véronique Anthouard; Anna Babour; Valérie Barbe; Stéphanie Barnay; Sylvie Blanchin; Jean-Marie Beckerich; Emmanuelle Beyne; Claudine Bleykasten; Anita Boisramé; Jeanne Boyer; Laurence Cattolico; Fabrice Confanioleri; Antoine De Daruvar; Laurence Despons; Emmanuelle Fabre; Cécile Fairhead; Hélène Ferry-Dumazet; Alexis Groppi; Florence Hantraye; Christophe Hennequin; Nicolas Jauniaux; Philippe Joyet; Rym Kachouri; Alix Kerrest; Romain Koszul; Marc Lemaire; Isabelle Lesur; Laurence Ma; Héloïse Muller; Jean-Marc Nicaud; Macha Nikolski; Sophie Oztas; Odile Ozier-Kalogeropoulos; Stefan Pellenz; Serge Potier; Guy-Franck Richard; Marie-Laure Straub; Audrey Suleau; Dominique Swennen; Fredj Tekaia; Micheline Wésolowski-Louvel; Eric Westhof; Bénédicte Wirth; Maria Zeniou-Meyer; Ivan Zivanovic; Monique Bolotin-Fukuhara; Agnès Thierry; Christiane Bouchier; Bernard Caudron; Claude Scarpelli; Claude Gaillardin; Jean Weissenbach; Patrick Wincker; Jean-Luc Souciet
Journal: Nature Date: 2004-07-01 Impact factor: 49.962

8. The diploid genome sequence of Candida albicans.

Authors: Ted Jones; Nancy A Federspiel; Hiroji Chibana; Jan Dungan; Sue Kalman; B B Magee; George Newport; Yvonne R Thorstenson; Nina Agabian; P T Magee; Ronald W Davis; Stewart Scherer
Journal: Proc Natl Acad Sci U S A Date: 2004-05-03 Impact factor: 11.205

9. The RCSB Protein Data Bank: redesigned web site and web services.

Authors: Peter W Rose; Bojan Beran; Chunxiao Bi; Wolfgang F Bluhm; Dimitris Dimitropoulos; David S Goodsell; Andreas Prlic; Martha Quesada; Gregory B Quinn; John D Westbrook; Jasmine Young; Benjamin Yukich; Christine Zardecki; Helen M Berman; Philip E Bourne
Journal: Nucleic Acids Res Date: 2010-10-29 Impact factor: 16.971

10. A molecular timescale of eukaryote evolution and the rise of complex multicellular life.

Authors: S Blair Hedges; Jaime E Blair; Maria L Venturi; Jason L Shoe
Journal: BMC Evol Biol Date: 2004-01-28 Impact factor: 3.260

131 in total

1. Modeling the transcriptional regulatory network that controls the early hypoxic response in Candida albicans.

Authors: Adnane Sellam; Marco van het Hoog; Faiza Tebbji; Cécile Beaurepaire; Malcolm Whiteway; André Nantel
Journal: Eukaryot Cell Date: 2014-03-28

2. Genome-Wide Screen for Haploinsufficient Cell Size Genes in the Opportunistic Yeast Candida albicans.

Authors: Julien Chaillot; Michael A Cook; Jacques Corbeil; Adnane Sellam
Journal: G3 (Bethesda) Date: 2017-02-09 Impact factor: 3.154

Review 3. Protein Bioinformatics Databases and Resources.

Authors: Chuming Chen; Hongzhan Huang; Cathy H Wu
Journal: Methods Mol Biol Date: 2017

4. Signaling domains of mucin Msb2 in Candida albicans.

Authors: Marc Swidergall; Lasse van Wijlick; Joachim F Ernst
Journal: Eukaryot Cell Date: 2015-01-30

Review 5. Online Databases for Taxonomy and Identification of Pathogenic Fungi and Proposal for a Cloud-Based Dynamic Data Network Platform.

Authors: Peralam Yegneswaran Prakash; Laszlo Irinyi; Catriona Halliday; Sharon Chen; Vincent Robert; Wieland Meyer
Journal: J Clin Microbiol Date: 2017-02-08 Impact factor: 5.948

Review 6. Hgc1-Cdc28-how much does a single protein kinase do in the regulation of hyphal development in Candida albicans?

Authors: Yue Wang
Journal: J Microbiol Date: 2016-02-27 Impact factor: 3.422

7. How to Use the Candida Genome Database.

Authors: Marek S Skrzypek; Jonathan Binkley; Gavin Sherlock
Journal: Methods Mol Biol Date: 2016

Review 8. Conservation of PHO pathway in ascomycetes and the role of Pho84.

Authors: Parul Tomar; Himanshu Sinha
Journal: J Biosci Date: 2014-06 Impact factor: 1.826

9. Phosphate is the third nutrient monitored by TOR in Candida albicans and provides a target for fungal-specific indirect TOR inhibition.

Authors: Ning-Ning Liu; Peter R Flanagan; Jumei Zeng; Niketa M Jani; Maria E Cardenas; Gary P Moran; Julia R Köhler
Journal: Proc Natl Acad Sci U S A Date: 2017-05-31 Impact factor: 11.205

10. Mitochondrial two-component signaling systems in Candida albicans.

Authors: John Mavrianos; Elizabeth L Berkow; Chirayu Desai; Alok Pandey; Mona Batish; Marissa J Rabadi; Katherine S Barker; Debkumar Pain; P David Rogers; Eliseo A Eugenin; Neeraj Chauhan
Journal: Eukaryot Cell Date: 2013-04-12