Literature DB >> 27738138

The Candida Genome Database (CGD): incorporation of Assembly 22, systematic identifiers and visualization of high throughput sequencing data.

Marek S Skrzypek¹, Jonathan Binkley¹, Gail Binkley¹, Stuart R Miyasato¹, Matt Simison¹, Gavin Sherlock².

Abstract

The Candida Genome Database (CGD, http://www.candidagenome.org/) is a freely available online resource that provides gene, protein and sequence information for multiple Candida species, along with web-based tools for accessing, analyzing and exploring these data. The mission of CGD is to facilitate and accelerate research into Candida pathogenesis and biology, by curating the scientific literature in real time, and connecting literature-derived annotations to the latest version of the genomic sequence and its annotations. Here, we report the incorporation into CGD of Assembly 22, the first chromosome-level, phased diploid assembly of the C. albicans genome, coupled with improvements that we have made to the assembly using additional available sequence data. We also report the creation of systematic identifiers for C. albicans genes and sequence features using a system similar to that adopted by the yeast community over two decades ago. Finally, we describe the incorporation of JBrowse into CGD, which allows online browsing of mapped high throughput sequencing data, and its implementation for several RNA-Seq data sets, as well as the whole genome sequencing data that was used in the construction of Assembly 22.

Entities: Chemical Disease Species

Mesh：

Substances：
Fungal Proteins

Year: 2016 PMID： 27738138 PMCID： PMC5210628 DOI： 10.1093/nar/gkw924

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The Candida Genome Database (CGD, http://www.candidagenome.org/) is a freely available online resource, based on the Saccharomyces Genome Database (SGD, www.yeastgenome.org; (1)), which collects, organizes, and makes available Candida gene, protein and sequence information to the fungal research community. CGD also provides web-based tools for the visualization and analysis of biological data. Candida albicans is the most thoroughly studied of the human fungal pathogens and also serves as a model organism for the study of other more experimentally challenging fungal pathogens (2). The frequency of fungal infections has risen dramatically over the past few decades, with the current annual incidence of invasive Candida infections recently estimated to range between 72 and 228 infections per million (3,4). These infections have high mortality rates (>35%) and, despite the availability of antifungal drugs, exceed those associated with bacterial infections among intensive care unit patients (5). As a result of medical progress, particularly with respect to chemotherapy, organ transplantation, intensive care and neonatal care, the population of immunocompromised patients susceptible to fungal infection, is expanding (6–8), and resistance to antifungal drugs is increasing (3,9,10). Furthermore, C. albicans is not the only disease-causing species in the genus; of great concern is an emerging clinical prevalence of non-albicans Candida species (11,12). Among these, C. tropicalis is common, virulent, and increasingly resistant to antifungal therapy (13), C. parapsilosis is particularly problematic, causing severe infections in neonates (14), and C. glabrata exhibits a special ability to evade the immune system and survive after cellular engulfment, and can resist antifungal treatment (15–17). In fungal taxonomy, C. albicans is placed in the order Endomycetales, to which the family Saccharomycetaceae also belongs (18,19), though Saccharomyces cerevisiae and C. albicans are separated by 140–850 million years of evolution (20). Despite sharing many similarities, including 3453 pairs of orthologous genes, the two fungi inhabit very different environmental niches, with S. cerevisiae existing as a saprophyte, and C. albicans living in close association with its mammalian hosts. Consequently, the two fungi show significant divergence in their genome structure and function. Extensive ‘transcriptional rewiring’ has occurred in their regulatory networks, such that evolved divergence in the activity of orthologous transcription factors, or divergence in cis-regulatory elements, has created distinct regulatory programs in each class (e.g. galactose metabolism (21), mating type (22,23) and ribosomal proteins (24)). Unlike S. cerevisiae, C. albicans exists primarily in a diploid state, with no conventional sexual reproduction that leads to meiosis. Instead, C. albicans undergoes a parasexual cycle that involves mating between two diploid cells of the opposite mating type and formation of a tetraploid cell that subsequently returns to diploidy by chromosome loss (25). Alternatively, chromosome loss may occur prior to mating yielding transient haploids that mate and return to a diploid state (26). Meiosis has not been observed in either case, and the homologous chromosomes remain mostly intact making the haplotypes much more stable than in other organisms. Thus, in order to understand and combat Candida pathogenesis, it is important to study Candida biology directly, and having access to the phased genome assembly opens up multiple new research opportunities.

Genome history, and the incorporation of C. albicans assembly 22 into CGD

Since its inception in 2004, CGD has been tasked with maintenance of the primary genome sequence and annotation for the C. albicans reference strain SC5314. CGD was originally based upon Assembly 19, a diploid, contig-level assembly built from the genomic sequence that had just been completed (27), and which included gene models, sequence corrections, and functional annotations provided by a group of volunteer researchers known as the Annotation Working Group (28). CGD then took over the task of curating all the Candida-related information based on published experimental data, comparative genomics predictions and improved sequence information (29). In 2006, Assembly 20 was incorporated into CGD (30); Assembly 20 was the first chromosome level assembly for C. albicans, though it consisted of what are referred to as reftigs, wherein each of the chromosomes in the assembly consist of a mosaic of the two haplotypes (31). Assembly 20 was superseded by Assembly 21, because a sequence from a different strain, WO-1, had been inadvertently incorporated into Assembly 20. A major update to Assembly 21, based on additional sequence data and comparative analyses (32) was subsequently also incorporated into CGD. The most recent major update to the sequence is Assembly 22 (33), which was generated using Illumina-based sequencing of SC5314 as well as a collection of congenic strains homozygous for specific chromosomes, which allowed the construction of a phased diploid assembly of the genome. Assembly 22 was incorporated into CGD and became the default genome sequence in 2014. All the previous assemblies, however, remain available at the CGD website. Incorporation of Assembly 22 into CGD entailed careful reconciliation of sequence ambiguities, potential assembly errors and differences from previous assemblies. We have established a pipeline that takes advantage of all available data, including the Illumina-based sequences from Muzzey et al. (33), Roche 454 sequences generated at CGD, as well as independently published, gene-specific sequence data derived from our curation of the Candida literature. We have used this pipeline to correct the sequences for more than 300 features, eliminating many reading frame errors and resolving multiple sequence ambiguities (Table 1). We have also re-analyzed the haplotype assignments for the entire Chromosome 3, which led to an exchange of 845 kb between chromosome 3A and chromosome 3B. Improving the reference sequence is an ongoing project and we are currently focusing our efforts on intergenic regions, repetitive sequences and segmental duplications.

Table 1.

Number of Assembly 22 features corrected, by error type (in either or both haplotypes)

Error Type	Features Corrected
Boundary/annotation	455
Ambiguous sequence	268
Nonsense codons	48
Missing stop codons	46
Misc. coding sequence	8
Misc. non-coding sequence	5
Missing start codons	2

Sequence versioning

Since 2010, CGD has used a versioning system to track genome sequence releases, and their associated annotations. The version designation appears in the name of each of the relevant sequence and feature annotation files that are available at CGD, so the exact source of the sequence data is always clear. Version designations appear in the following format: sXX-mYY-rZZ where XX, YY, and ZZ are zero-padded integers. XX is incremented when there is any change to the underlying genomic (i.e., chromosome) sequence. YY is incremented when there is any change to the coordinates of any feature annotated in the genome (e.g. any change in location or boundary, or addition or removal of a feature from the annotation). YY is reset to ‘01’ when XX is incremented (when a sequence change is made). ZZ is incremented in response to curatorial changes that affect information that appears in the GFF file, specifically gene names, gene aliases, gene IDs, gene descriptions, feature types (e.g. gene or pseudogene), and ORF classifications or qualifiers (e.g. Verified, Uncharacterized, Deleted, Merged). Files are checked on a weekly basis, as well as any time that a GFF file is regenerated manually, to determine if changes have occurred that warrant a change in the ZZ number. ZZ is reset to ‘01’ when either XX or YY is incremented (when a sequence change is made, or when the coordinates of any feature are updated). All versions are archived on the CGD download site. As a hypothetical example, say that we start with s05-m01-r01 as the current version. When the next weekly file check is performed, and the new file is noted to contain curatorial updates to gene names in the database, but no new changes to the structural annotation or to the sequence itself, the new version designation becomes s05-m01-r02. Subsequently, the chromosomal coordinates of a gene are changed, based on curation of a paper that provides evidence for updating the gene model. Consequently, the new version designation becomes s05-m02-r01. Later, a change to correct a sequencing error is made, and the new version designation becomes s06-m01-r01. CGD recommends that authors note in their materials and methods sections of published papers what version of the genome they were working with when they performed an analysis, so that the analysis is reproducible.

Systematic ORF nomenclature

With Assembly 22, it became possible to implement a new position based systematic nomenclature for chromosomal features. The previous systematic names, dating back to Assembly 19, consisted of the ‘orf19’ prefix followed by a unique but somewhat arbitrary number. The new systematic name is based on the known chromosomal location and haplotype, and it consists of the chromosome (C1-C7 and CR for the eight nuclear chromosomes, CM for the mitochondrial chromosome), a unique number indicating the order of features along chromosomes, the strand (W for Watson or C for Crick) and the haplotype (A or B). For example, C4_03570W_A denotes a feature located on chromosome 4, Watson strand and haplotype A. Feature numbers start at the left end of the chromosome and increase by 10 to allow for adding new features in the intervening spaces as they are discovered. Since systematic identifiers from all the previous assemblies remain in the literature, and some, especially ‘orf19’ names, continue to be used by researchers, CGD includes all the previous identifiers as searchable aliases to allow seamless and unambiguous transition between various nomenclature systems. Assembly 19/21 identifiers, ‘orf19 names’, are also prominently displayed on each Locus Summary page. Locus Summary pages (LSP) exist for every chromosomal feature in CGD and provide a primary way to access gene-specific information. In addition to locus-specific data, such as description, functional annotations (Gene Ontology terms, mutant phenotypes), or orthologous genes in other species, LSPs now also provide access to allele-specific data: DNA and protein sequences for both alleles, as well as a listing of allelic variations if they exist. LSPs also show graphical representations of the chromosomal context for each allele as thumbnails that lead to genome browser windows and allow farther exploration of a chosen haplotype. Assembly 22 allows powerful, allele- and haplotype-specific analyses of the overall genome structure, function and evolution. To facilitate genome-wide research, CGD provides access to all the data in downloadable files that can be used by researchers' own bioinformatics tools. All DNA sequences for the entire chromosomes, coding and non-coding regions, as well as translation products, are available for download at http://www.candidagenome.org/download/.

Incorporation of JBrowse

An important part of CGD's mission is to incorporate and annotate large-scale datasets from published experiments on Candida species, and to make them easily available for exploration and analysis by our users. For several years we have provided GeneXplorer (34) to display and analyze microarray datasets. Recently, to provide access to the growing number of datasets from experiments that use high-throughput sequencing technologies, we have deployed the JBrowse genome browser (35). JBrowse is a mature and widely used application that is fast, intuitive, and compatible with most web browsers (36). JBrowse allows users to quickly view large-scale sequence data in a genomic context, at multiple zoom-levels of resolution, from base pairs in individual sequence reads to read-density summaries across large genomic regions. The display includes parallel tracks of annotated sequence features, allowing seamless navigation between JBrowse and the Locus Summary pages for each feature (Figure 1). Quantitative tracks graphically display comparative information, such as relative expression level or sequence conservation. JBrowse is highly flexible and customizable: users may easily load their own sequence datasets and analysis tracks, for display in the context of genomic features, or for comparison with datasets and tracks provided by CGD.

Figure 1.

JBrowse Visualization of RNA-Seq data at CGD. JBrowse display of the region around the C. albicans serum-inducible gene HWP1, showing aligned RNA-Seq reads from serum-treated cells (37). The red and blue bars in the top track of the main display window show genes annotated at CGD: red for genes encoded on the ‘W’ strand (+), blue for genes on the ‘C’ strand (–). HWP1 is the second gene from the left. Clicking on a bar brings up an information window for that gene, and includes a link to its CGD Locus Summary Page. The green bar graph below the gene track shows the density of aligned RNA-Seq reads along to the chromosome, plotted on a log scale. The bottom track shows all the aligned RNA-Seq reads along the chromosome: each short bar in the bottom track represents a unique read. In this example the sequence reads are strand-specific: pink bars indicate reads transcribed from the ‘W’ strand, and light-blue bars indicate reads transcribed from the ‘C’ strand. Clicking on a bar brings up information about the read, including the sequence and quality score for each base. Menus and controls at the top of the browser provide navigation, zoom and search functionalities, and allow users to load their own data. CGD currently offers a number of C. albicans datasets for viewing in JBrowse. We provide the high-throughput DNA sequence data (33) that was the basis of Assembly 22, described above. Optional tracks highlight the sequence variation between the two Assembly 22 haplotypes, as well as the variation between the common strains SC5314 and WO-1 (32). We also make available RNA-Seq datasets from a number of gene expression studies in C. albicans, including comparisons of different stress conditions (37), hyphae-inducing conditions (37), biofilm vs. planktonic growth (38), white-opaque switching (39), and allele-specific expression differences (40). In addition, we provide chromatin occupancy data (ChIP-Seq) for the Wor1p transcription factor during white-opaque switching (41). We also have RNA-Seq datasets for two other Candida species: gene expression under pH and nitrosative stress in C. glabrata (42), and during biofilm vs. planktonic growth in C. parapsilosis (43). We will add new datasets to JBrowse as they become available, and in response to user requests.

Future directions

The reference sequence for an organism is not static—while sequencing technologies continue to advance (both in accuracy and read lengths), there is always the possibility that updates will improve the reference sequence. For example, either PacBio or Oxford Nanopore sequencing (very long reads, though with high error rates), coupled with existing Illumina data (short reads, but low error rates) may better resolve telomeric and other repeat sequences, which are hard to resolve even with the original Sanger reads. Another question is whether the sequence of a single instance of a strain is a reasonable representation of the reference sequence – indeed, given Candida albicans’ propensity to undergo rearrangement and loss of heterozygosity under stress, different lab isolates of ostensibly the same strain might have different sequences. It is likely in the near future that many distinct strains of Candida species will be sequenced, and CGD will endeavor to incorporate these into the database as well. They will provide insight into the genomic variation that exists in each of the Candida species. Additional high throughput sequencing datasets may also allow refinement of the genome annotations, such as novel additional transcripts, or the addition of 5′ and 3′ UTRs to each of the genes. They may also allow the annotation of functional elements within non-transcribed regions, such as transcription factor binding sites from ChIP-Seq studies. CGD will strive to incorporate such refinements as soon as they become available to maintain the reference sequence and annotation current.

43 in total

Review 1. Insights into Candida tropicalis nosocomial infections and virulence factors.

Authors: M Negri; S Silva; M Henriques; R Oliveira
Journal: Eur J Clin Microbiol Infect Dis Date: 2011-10-30 Impact factor: 3.267

2. Comprehensive annotation of the transcriptome of the human fungal pathogen Candida albicans using RNA-seq.

Authors: Vincent M Bruno; Zhong Wang; Sadie L Marjani; Ghia M Euskirchen; Jeffrey Martin; Gavin Sherlock; Michael Snyder
Journal: Genome Res Date: 2010-09-01 Impact factor: 9.043

3. Evolution of alternative transcriptional circuits with identical logic.

Authors: Annie E Tsong; Brian B Tuch; Hao Li; Alexander D Johnson
Journal: Nature Date: 2006-09-28 Impact factor: 49.962

4. Opportunistic candidiasis: an epidemic of the 1980s.

Authors: S P Fisher-Hoch; L Hutwagner
Journal: Clin Infect Dis Date: 1995-10 Impact factor: 9.079

5. The diploid genome sequence of Candida albicans.

Authors: Ted Jones; Nancy A Federspiel; Hiroji Chibana; Jan Dungan; Sue Kalman; B B Magee; George Newport; Yvonne R Thorstenson; Nina Agabian; P T Magee; Ronald W Davis; Stewart Scherer
Journal: Proc Natl Acad Sci U S A Date: 2004-05-03 Impact factor: 11.205

6. White-opaque switching in natural MTLa/α isolates of Candida albicans: evolutionary implications for roles in host adaptation, pathogenesis, and sex.

Authors: Jing Xie; Li Tao; Clarissa J Nobile; Yaojun Tong; Guobo Guan; Yuan Sun; Chengjun Cao; Aaron D Hernday; Alexander D Johnson; Lixin Zhang; Feng-Yan Bai; Guanghua Huang
Journal: PLoS Biol Date: 2013-03-26 Impact factor: 8.029

7. The 'obligate diploid' Candida albicans forms mating-competent haploids.

Authors: Meleah A Hickman; Guisheng Zeng; Anja Forche; Matthew P Hirakawa; Darren Abbey; Benjamin D Harrison; Yan-Ming Wang; Ching-hua Su; Richard J Bennett; Yue Wang; Judith Berman
Journal: Nature Date: 2013-01-30 Impact factor: 49.962

Review 8. Candida albicans pathogenicity mechanisms.

Authors: François L Mayer; Duncan Wilson; Bernhard Hube
Journal: Virulence Date: 2013-01-09 Impact factor: 5.882

9. Extensive and coordinated control of allele-specific expression by both transcription and translation in Candida albicans.

Authors: Dale Muzzey; Gavin Sherlock; Jonathan S Weissman
Journal: Genome Res Date: 2014-04-14 Impact factor: 9.043

10. Identification and Characterization of Wor4, a New Transcriptional Regulator of White-Opaque Switching.

Authors: Matthew B Lohse; Alexander D Johnson
Journal: G3 (Bethesda) Date: 2016-01-15 Impact factor: 3.154

138 in total

1. Conservation of location of several specific inhibitory codon pairs in the Saccharomyces sensu stricto yeasts reveals translational selection.

Authors: Dalia H Ghoneim; Xiaoju Zhang; Christina E Brule; David H Mathews; Elizabeth J Grayhack
Journal: Nucleic Acids Res Date: 2019-02-20 Impact factor: 16.971

2. Genome plasticity in Candida albicans is driven by long repeat sequences.

Authors: Robert T Todd; Tyler D Wikoff; Anja Forche; Anna Selmecki
Journal: Elife Date: 2019-06-07 Impact factor: 8.140

3. Iron Chelator Deferasirox Reduces Candida albicans Invasion of Oral Epithelial Cells and Infection Levels in Murine Oropharyngeal Candidiasis.

Authors: Sumant Puri; Rohitashw Kumar; Isolde G Rojas; Ornella Salvatori; Mira Edgerton
Journal: Antimicrob Agents Chemother Date: 2019-03-27 Impact factor: 5.191

4. Quantitative global studies reveal differential translational control by start codon context across the fungal kingdom.

Authors: Edward W J Wallace; Corinne Maufrais; Jade Sales-Lee; Laura R Tuck; Luciana de Oliveira; Frank Feuerbach; Frédérique Moyrand; Prashanthi Natarajan; Hiten D Madhani; Guilhem Janbon
Journal: Nucleic Acids Res Date: 2020-03-18 Impact factor: 16.971

Review 5. Candida parapsilosis: from Genes to the Bedside.

Authors: Renáta Tóth; Jozef Nosek; Héctor M Mora-Montes; Toni Gabaldon; Joseph M Bliss; Joshua D Nosanchuk; Siobhán A Turner; Geraldine Butler; Csaba Vágvölgyi; Attila Gácser
Journal: Clin Microbiol Rev Date: 2019-02-27 Impact factor: 26.132

6. Hemizygosity Enables a Mutational Transition Governing Fungal Virulence and Commensalism.

Authors: Shen-Huan Liang; Matthew Z Anderson; Matthew P Hirakawa; Joshua M Wang; Corey Frazer; Leenah M Alaalm; Gregory J Thomson; Iuliana V Ene; Richard J Bennett
Journal: Cell Host Microbe Date: 2019-02-26 Impact factor: 21.023

7. Role of the Inducible Adhesin CpAls7 in Binding of Candida parapsilosis to the Extracellular Matrix under Fluid Shear.

Authors: Sunil K Shaw; Joseph M Bliss; Matthew N Neale; Kyle A Glass; Sarah J Longley; Denny J Kim; Sonia S Laforce-Nesbitt; Jeremy D Wortzel
Journal: Infect Immun Date: 2018-03-22 Impact factor: 3.441

8. Using the Candida Genome Database.

Authors: Marek S Skrzypek; Jonathan Binkley; Gavin Sherlock
Journal: Methods Mol Biol Date: 2018

Review 9. Transcriptional regulation of the caspofungin-induced cell wall damage response in Candida albicans.

Authors: Marienela Y Heredia; Deepika Gunasekaran; Mélanie A C Ikeh; Clarissa J Nobile; Jason M Rauceo
Journal: Curr Genet Date: 2020-09-02 Impact factor: 3.886

Review 10. Relevance of peroxiredoxins in pathogenic microorganisms.

Authors: Marcos Antonio de Oliveira; Carlos A Tairum; Luis Eduardo Soares Netto; Ana Laura Pires de Oliveira; Rogerio Luis Aleixo-Silva; Vitoria Isabela Montanhero Cabrera; Carlos A Breyer; Melina Cardoso Dos Santos
Journal: Appl Microbiol Biotechnol Date: 2021-07-14 Impact factor: 4.813