| Literature DB >> 24185697 |
Jonathan Binkley1, Martha B Arnaud, Diane O Inglis, Marek S Skrzypek, Prachi Shah, Farrell Wymore, Gail Binkley, Stuart R Miyasato, Matt Simison, Gavin Sherlock.
Abstract
The Candida Genome Database (CGD, http://www.candidagenome.org/) is a freely available online resource that provides gene, protein and sequence information for multiple Candida species, along with web-based tools for accessing, analyzing and exploring these data. The goal of CGD is to facilitate and accelerate research into Candida pathogenesis and biology. The CGD Web site is organized around Locus pages, which display information collected about individual genes. Locus pages have multiple tabs for accessing different types of information; the default Summary tab provides an overview of the gene name, aliases, phenotype and Gene Ontology curation, whereas other tabs display more in-depth information, including protein product details for coding genes, notes on changes to the sequence or structure of the gene and a comprehensive reference list. Here, in this update to previous NAR Database articles featuring CGD, we describe a new tab that we have added to the Locus page, entitled the Homology Information tab, which displays phylogeny and gene similarity information for each locus.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24185697 PMCID: PMC3965001 DOI: 10.1093/nar/gkt1046
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
CGD multispecies curation statistics
| Species | Verified genes | Uncharacterized genes | Manually curated GO | Orthology-based GO | Domain-based GO | Phenotypes |
|---|---|---|---|---|---|---|
| 1504 | 4558 | 8555 | 22 496 | 5041 | 15 205 | |
| 13 | 5849 | 33 | 27 765 | 5271 | 56 | |
| 207 | 5006 | 669 | 27 150 | 4434 | 659 | |
| 25 | 5812 | 62 | 27 155 | 5351 | 35 |
We currently perform manual literature curation for four species; this set of reference genomes comprises C. albicans SC5314, C. glabrata CBS138, C. dubliniensis CD36 and C. parapsilosis CDC 317. We provide sequence files and protein domain files for an additional seven strains, covering 11 genomes and 10 species in total: C. albicans SC5314, C. albicans WO-1, C. dubliniensis CD36, C. glabrata CBS138, C. guilliermondii ATCC 6260, C. lusitaniae ATCC 42720, C. orthopsilosis Co 90-125, C. parapsilosis CDC317, C. tropicalis MYA-3404, D. hansenii CBS767 and L. elongisporus NRLL YB-4239. Within curated species, we define a gene to be ‘Verified’ if there is some experimental evidence for function (e.g. a mutant phenotype, or enzymatic activity); otherwise, we define the gene to be ‘Uncharacterized.’
Figure 1.Ortholog cluster and Gene Links on the CGD Homology Information tab. The section entitled ‘Ortholog Cluster’ contains a link to the corresponding CGOB page for the ortholog group. Each of the clustered sequences is listed with links to its source database (e.g. the SGD, the Broad Institute, EMBL-EBI or CGD itself). The experimental status of each CGD and SGD gene is also given in this section, indicating whether there is published evidence for the existence of the gene as a functional entity. Links are also provided to download sequence files. In cases where a CGD-curated species is not included in the ortholog cluster but nevertheless has a high-scoring BLAST hit, that sequence is included in the next section of the page, entitled ‘Best hits in CGD species.’ Additional related proteins, from both more distantly related fungi and from non-fungal species, are listed along with links to gene information pages at their respective organism database sites.
Figure 2.Phylogenetic Tree Display on the CGD Homology Information tab. The phylogenetic trees are computed from the protein multiple sequence alignment for each ortholog cluster, using the SEMPHY program (29). The species name is displayed in a hover box when the cursor is placed above the gene name, and the full species and gene names are also listed directly above the tree in the Ortholog Cluster section of the page. This section of the Homology Information page may be hidden or expanded using the small plus-or-minus glyph located to the left of the header in the gold-colored sidebar.
Figure 3.Protein Alignment Display on the CGD Homology Information tab. The Protein Sequence Alignment is a decorated multiple sequence alignment of the members of the ortholog cluster, generated using MUSCLE (32). The alignment display is generated with MView (33). The overall percentage identity to the reference sequence is displayed adjacent to the gene name. Alignment columns with <80% identity to the reference are displayed in black font. In columns with >80% consensus, the residues are color-coded by physicochemical properties as follows: hydrophobic residues (A, I, L, M, V) in light green, aromatic residues (F, W, Y) in dark green, polar residues (N, Q, S, T) in pink, residues with negative charge (D, E) in red, residues with positive charge (H, K, R) in blue, residues associated with backbone change (G, P) in red and cysteines (C) in yellow. A nucleotide alignment of the coding sequence is displayed below the protein alignment, with purine bases (A, G) color-coded in red and pyrimidines (C, T) displayed in blue. Like the Phylogenetic Tree, each sequence alignment may be hidden or expanded using the small plus-or-minus glyph located to the left of the header in the gold-colored sidebar.