Literature DB >> 25474325

Genome update of the dimorphic human pathogenic fungi causing paracoccidioidomycosis.

José F Muñoz¹, Juan E Gallo², Elizabeth Misas¹, Margaret Priest³, Alma Imamovic³, Sarah Young³, Qiandong Zeng³, Oliver K Clay⁴, Juan G McEwen⁵, Christina A Cuomo³.

Abstract

Paracoccidiodomycosis (PCM) is a clinically important fungal disease that can acquire serious systemic forms and is caused by the thermodimorphic fungal Paracoccidioides spp. PCM is a tropical disease that is endemic in Latin America, where up to ten million people are infected; 80% of reported cases occur in Brazil, followed by Colombia and Venezuela. To enable genomic studies and to better characterize the pathogenesis of this dimorphic fungus, two reference strains of P. brasiliensis (Pb03, Pb18) and one strain of P. lutzii (Pb01) were sequenced [1]. While the initial draft assemblies were accurate in large scale structure and had high overall base quality, the sequences had frequent small scale defects such as poor quality stretches, unknown bases (N's), and artifactual deletions or nucleotide duplications, all of which caused larger scale errors in predicted gene structures. Since assembly consensus errors can now be addressed using next generation sequencing (NGS) in combination with recent methods allowing systematic assembly improvement, we re-sequenced the three reference strains of Paracoccidioides spp. using Illumina technology. We utilized the high sequencing depth to re-evaluate and improve the original assemblies generated from Sanger sequence reads, and obtained more complete and accurate reference assemblies. The new assemblies led to improved transcript predictions for the vast majority of genes of these reference strains, and often substantially corrected gene structures. These include several genes that are central to virulence or expressed during the pathogenic yeast stage in Paracoccidioides and other fungi, such as HSP90, RYP1-3, BAD1, catalase B, alpha-1,3-glucan synthase and the beta glucan synthase target gene FKS1. The improvement and validation of these reference sequences will now allow more accurate genome-based analyses. To our knowledge, this is one of the first reports of a fully automated and quality-assessed upgrade of a genome assembly and annotation for a non-model fungus.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 25474325 PMCID： PMC4256289 DOI： 10.1371/journal.pntd.0003348

Source DB: PubMed Journal: PLoS Negl Trop Dis ISSN： 1935-2727

Introduction

Paracoccidioides spp. is a thermally dimorphic pathogenic fungus that causes paracoccidioidomycosis (PCM), a neglected health-threatening human systemic mycosis endemic to Latin America where up to ten million people are infected. Disease can progress slowly, with roughly five new cases of disease per million infected individuals per year, with a male to female ratio of 13 to 1. About 80% of PCM cases occur in Brazil, followed by Colombia and Venezuela [2]. Within the Paracoccidioides genus, the three characterized phylogenetic lineages of P. brasiliensis (PS2, PS3, S1) and the one characterized lineage of P. lutzii (Pb01-like) can infect humans, and these groups can vary in virulence and induce different immune responses by the host [3], [4]. To better understand the pathogenesis and to enable genomics-based studies, the genomes of Paracoccidioides spp. were sequenced, analyzed and made publicly available in 2011 [1]. The Broad Institute of MIT and Harvard in partnership with the Paracoccidioides research community selected three reference isolates for sequencing and genomic analysis; assembly size for these strains varied between 29.1 and 32.9 Mb, and between 7,875 and 9,132 genes were identified in each strain [1]. These included two strains of P. brasiliensis (Pb18 representing the S1 lineage and Pb03 representing the PS2 lineage) and one strain of P. lutzii (Pb01) [1]. These sequenced isolates are extensively referenced in molecular biology and experimental mycology laboratories working with Paracoccidioides spp. and also other pathogenic fungi, including those working with yeast phase specific genes expressed during host infection. These sequences also serve as a reference to analyze high-throughput data increasingly generated by genomic, metagenomic, transcriptomic and proteomic approaches. Additionally, accurate sequences are critical for evolutionary analyses, e.g., to identify positively selected genes, as well as to provide new targets for the design of diagnostic assays. The P. brasiliensis Pb18 and Pb03 strains and the P. lutzii Pb01 strain were sequenced using the sequencing technology and computational methods available at the time, which produced high quality draft assemblies. However, the assemblies included a large number of gaps and uncertain or low quality nucleotides in the final consensus sequences. Also, the annotation pipelines flagged only the most extreme annotation errors for curation and did not address the larger number of smaller scale errors in the gene models and underlying sequence [5]. Correction of such errors requires re-evaluation of the assembly consensus sequence and associated annotation. Assembly errors that could not be detected in previous data and passed standard quality control criteria at that time can now be corrected using next generation sequencing (NGS) for systematic assembly improvement. These include errors in gene-containing regions of the original genomic assembly affected by poor quality sequence or ambiguities, which can cause incorrect gene structure predictions. Since predicted genes of reference genomes are now frequently used for homology-based inference or confirmation of gene structures in closely related species, errors in the reference sequence may be propagated to other genomes [6], [7]. Therefore, systematic improvement of a genome assembly and annotation can impact not just the understanding of that particular species, but also that of other related species for which it is used as a reference for comparison. Here, we present an update of the three Paracoccidioides reference genome sequences achieved using Illumina re-sequencing to correct assembly errors and document the improvements obtained. The improved and updated reference genome assemblies and annotations of this important human fungal pathogen now allow more accurate SNP analyses, genome-wide evolutionary (e.g., selection) analyses that depend on high-quality sequences, phylogenetic footprinting studies of regulatory regions, or primer and probe design for diagnostic assays.

Methods

Paracoccidioides reference strains and previous sequencing

Three reference isolates of Paracoccidioides spp. (Pb01, Pb03 and Pb18), representing two species, were previously sequenced. The isolate of P. lutzii (Pb01) was a clinical isolate originating from an acute form of paracoccidioidomycosis (PCM) in an adult male. The two P. brasiliensis isolates were from individuals presenting chronic PCM; Pb03 represents the PS2 phylogenetic group and Pb18 the S1 group [1], [4]. In a partnership between the Broad Institute and the Paracoccidioides research community, these genomes were previously sequenced using multiple whole genome shotgun libraries constructed from genomic DNA for each strain; paired-end sequences were generated for each with Sanger technology and assembled using Arachne [1] (assembly v1; S1 Table).

Re-sequencing of Paracoccidioides reference strains

The reference strains Pb01 (previously sequenced DNA sample) and Pb03 and Pb18 (newly extracted DNA samples) were re-sequenced using Illumina technology. For library construction, 100 ng of genomic DNA was sheared to ∼250 bp using a Covaris LE instrument and prepared for sequencing as previously described [8]. A library for each of the three samples was used to generate 101 base paired-end reads on the Illumina HiSeq2000 platform, producing an average genome coverage of 165X.

Assembly improvement using Pilon

To improve the genome sequence of Paracoccidioides spp. strains Pb01, Pb03 and Pb18, Illumina paired-end reads were aligned to the draft reference assemblies (assembly v1) using BWA version 0.5.9 with default settings [9]. The assembly consensus sequence was re-evaluated by providing these alignments as input to the automated assembly improvement program Pilon (version 1.4, default parameters, www.broadinstitute.org/software/pilon/). Pilon uses the Illumina read alignments for multiple classes of assembly correction. First, Pilon scans the read alignments for positions where the sequencing data disagree with the input genome (assembly v1) and corrects small errors such as single nucleotide differences and small insertion/deletion events. Second, Pilon looks for coverage and alignment discrepancies to identify potential mis-assemblies and larger variants. Finally, Pilon uses reads anchored adjacent to discrepant regions and gaps in the input genome to reassemble the region, attempting to fill in the true sequence including large insertions. As output, Pilon provides the sequence of this improved genome assembly (assembly v2; Fig. 1) along with files summarizing the changes and quality measures used in the assessment.

Figure 1

Overview of genome assembly and annotation improvement process.

Gene prediction and annotation

Protein-coding genes were predicted in the improved assemblies (assembly v2) using a combination of gene models from the prediction programs Augustus [10], Genemark-ES [11], GlimmerHMM [12], Genewise [13], and Snap [14], as well as automated revision based on EST data (e.g., from [15]) and manual gene revision of flagged calls. The predicted gene sets were then provided as input to EvidenceModeler (EVM) [16] to obtain the best consensus model for a given locus. The consistency of the gene models was evaluated by examining alignments of protein orthology groups identified using OrthoMCL [17]. EVMLite was used to rescue orphan genes not captured in EVM; only those genes with additional evidence such as overlap to Genewise or non-repeat HMMER3 PFAM domains were rescued, as well as non-redundant genes overlapping the OrthoMCL genes in clusters containing 2 or more genomes. Lastly spurious gene models matching repetitive or low-complexity sequences were removed. For each Paracoccidioides genome, we compared the original annotation (v1) with the updated annotation (v2) to evaluate the changes in the new gene sets. To precisely characterize the types of changes across the v1 and v2 annotations, we first mapped the corresponding gene between the two assemblies. The v1 and v2 assemblies were aligned using nucmer [18], and the alignment coordinates were used to assign gene correspondence between the initial annotation v1 and the new annotation v2. This mapping also allowed us to preserve locus numbers in the updated gene set. Each annotated gene was assigned a locus number, keeping where appropriate the previous locus number of the form PAAG_##### (Pb01), PABG_##### (Pb03) or PADG_##### (Pb18), which serves as a unique identifier within each genome and across assemblies. New genes, merged genes, and genes with large structure and sequence changes in transcripts were assigned new and unique locus numbers following the last locus number of annotation v1. Locus numbers of deleted genes do not appear in the final gene sets.

Evaluation of gene annotation improvements between Paracoccidioides strains

To evaluate whether the changes in gene sequence and structure produced a more accurate gene set, the gene sequences of annotation v1 and of the updated annotation v2 were compared via sequence similarity and orthology analysis. To evaluate the consistency of gene structures for orthologs as well as their conservation between species, OrthoMCL version 1.4 with a Markov inflation index of 1.5 and a maximum e-value of 1e-5 was used to identify orthologous clusters across the six total protein sets corresponding to annotations v1 and v2 of each Paracoccidioides strain. For each cluster group representing putative orthologs, we compared the maximum length difference among the three Paracoccidioides genes in annotation v2 to that in the annotation v1. To compare the functional content of the v1 and v2 gene sets, we evaluated both protein domain families (PFAM) and pathway information (KEGG). Using HMMER3 [19], we mapped v27 of the PFAM domain database [20] to both the v1 and v2 gene sets. KEGG domains [21] from release 65 were also mapped to both gene sets using BLAST. To evaluate changes in the gene structure, the corresponding transcripts from annotation v1 and v2 were identified as described above. We also aligned gene sets v1 and v2 using BLASTn [22] version 2.2.28+ with default parameters, using an in house Perl script to determine the types of modification for each gene, which included changes in gene length, gene coverage and percent nucleotide identity. We manually checked a random gene sample of each type of change (up to 10 genes) from both gene correspondence and BLAST analyses to verify that changes in gene set v2 were actually gene improvements. To evaluate changes in the coding regions of genes of high interest to the community, we selected known specific yeast-phase genes or virulence factors of Paracoccidioides spp., as well as other genes that are generally considered relevant for research on Paracoccidioides or related dimorphic pathogens, for manual review. The sequences of these genes' coding regions were aligned at the protein level with CLUSTALW [23] version 2.1, using both the v1 and v2 annotations.

Gene annotation improvements using genes from CEGMA and from related dimorphic pathogenic fungi

The coverage of Core Eukaryotic Genes defined by CEGMA [24] was evaluated using the CoreAlyze tool (http://sourceforge.net/projects/corealyze/) to summarize results for all the v1 and v2 gene sets. BLASTp version 2.2.28+ was run with default settings using protein sets from annotations v1 and v2 as the database, with Saccharomyces cerevisiae and Schizosaccharomyces pombe CEGMA proteins as the query. We also included the protein gene sets of two close relatives of Paracoccidioides, the dimorphic fungal pathogens Blastomyces dermatitidis and Histoplasma capsulatum. In order to obtain a detailed picture of the changes where gene annotations were modified but not completely overridden, we compared protein sequences between the two versions, excluding proteins that were added or deleted from the final gene set v2, as well as proteins for which the new annotation was for a completely different transcript at the same locus. For a hit to be counted, the protein needed to match a protein in the reference set with at least 75% identity for the v1 and v2 annotations. This percent identity cutoff was determined empirically to eliminate spurious low similarity alignments. The percent identity and the bit score between the query protein and each version of the Paracoccidioides annotations (v1 and v2) were compared.

Results

Genome resequencing with NGS technology

The strains of the genomes of Paracoccidioides spp. previously sequenced [1], Pb18 and Pb03 and Pb01, were re-sequenced using Illumina 101 bp paired-end reads. This sequencing generated 93.6 million reads for Pb18 with an average coverage of 198X, 124.2 million reads for Pb03 with an average coverage of 150X and 110.0 million reads for Pb01 with an average coverage of 148X. This high coverage sequence data was then used to refine the consensus sequence of the original assembly by assessing differences between the new sequence and the previous assemblies. This can target a wide range of improvements, including correcting base calls, resolving ambiguous bases and closing gaps within scaffolds.

Genome assembly improvement using Pilon

Fig. 1 shows a simplified overview of the workflow of genome improvement. The new Illumina data were used to systematically improve the three Paracoccidioides spp. assemblies using Pilon (http://www.broadinstitute.org/software/pilon/). Pilon bases its improvement calls on an alignment of the reference genome and the sequenced reads. The aligned bases and depth at each sequenced position provides evidence for the reference base or for an alternative; where changes are supported they can result in single base differences, insertion or deletion of single bases or larger regions, identification of collapsed regions and more complex changes and gap filling based on local reassembly. Reads of each of the genomes of Pb18, Pb03 and Pb01 were aligned to the corresponding reference assembly using BWA [9] and the resulting bam file was used as input for Pilon. In each of the Paracoccidioides assemblies, Pilon identified and fixed base errors in the consensus sequence. The statistical improvements for the assemblies v2 of Paracoccidioides spp. are summarized in Table 1. The most frequent class of changes was single base substitutions, identified as single nucleotide polymorphisms (SNPs) between the assembly and reads. Between 3,018 and 3,290 single base errors were corrected in each assembly. Small insertions and deletions were also incorporated into each assembly. The major classes of changes can be attributed to bases added and removed in reassembly fixes, collapsed bases in the new assembly and the closing of gaps (Table 1). Regions of misassembly identified and fixed by Pilon resulted in bases added or removed but no new gaps opened. Across all the assemblies, 20% of all initial gaps were closed by Pilon; the number of gaps closed were 113, 56 and 212, for Pb18, Pb03 and Pb01, respectively. Overall, the assembly improvement process led to an increase of contig N50 for all strains. About ∼99% of low quality nucleotides in assemblies v1 were well supported or fixed with high flag coverage in assemblies v2.

Table 1

Summary of assembly metrics after Pilon improvement.

Pilon summary metrics	P. brasiliensis		P. lutzii
	Pb18	Pb03	Pb01
Read depth of coverage	127	146	148
SNPs	3,290	3,018	3,072
Ambiguous bases	246	222	221
Small insertion bases	957	1,083	1,062
Small deletion bases	725	714	628
Bases added in reassembly fixes	109,312	89,243	118,931
Bases removed in reassembly fixes	37,417	38,906	41,822
Gaps opened	0	0	0
Gaps closed	113	56	212
Collapsed regions	3	1	2
Collapsed bases	64,378	20,918	43,967
Increase in contig N50 (kb)	14.16	4.98	29.17

Overall, the P. lutzii Pb01 genome assembly was most substantially improved, based on comparing statistics for all v1 and v2 assemblies (S1 Table). The contig N50 for Pb01 v2 increased by 29.1 kb; more bases were added and removed after re-assembly fixes and more gaps were closed than in the other two genomes. The genome size and number of scaffolds of Pb18 and Pb03 were essentially unchanged. The Pb01 genome size decreased slightly from 32.94 to 32.93 Mb; the updated assembly contains one scaffold fewer, as two scaffolds were merged by closing the gap between them. The number of contigs was reduced in all three strains, which considering the increase in the contig N50 indicates that the assemblies v2 for Pb18, Pb03 and Pb01 were less fragmented. All these changes indicate that the genome assemblies v2 after Pilon improvements were more contiguous, contained more bases with high quality, and had fewer gaps and errors.

Gene annotation improvement in updated assemblies

The gene annotations of the reference strains Pb18, Pb03 and Pb01 were updated using a pipeline to transfer and revise gene structures (Methods). The implemented annotation pipeline was an updated and improved version of the previous protocol used to annotate Paracoccidioides spp. assemblies v1. The current pipeline includes an updated set of gene prediction programs, including the EVM caller used to select the best call for each locus. Databases used for training these gene prediction programs are also more comprehensive, with more sequences available from the dimorphic fungi group for comparison. Also, the databases used for homology inference and functional annotation were updated since the previous annotation. In addition, we identified orthologs to evaluate the gene calls for consistency (see below). The incorporation of these new methods and data improved the evidence supporting gene prediction. The updated gene sets are more consistent across the three Paracoccidioides genomes (S1 Table). The total gene count for the two P. brasiliensis genomes now only differs by 37 in v2 whereas the v1 gene counts differed by 866; overall the update removed 351 genes from Pb18 and added 552 genes to Pb03. P. lutzii (Pb01) also has a more similar gene count, due to 306 fewer genes in the v2 compared to v1. A more detailed view of the gene structure changes by major categories is provided in Table 2; these statistics were calculated by mapping the transcripts from the previous annotation to the corresponding locus on assembly v2. Notably, this analysis helped recover a large number of genes missed by the original annotation in each genome; the total genes newly added to a region was 840 in Pb18, 933 in Pb03 and 936 in Pb01. In addition, dubious genes were removed from each genome; the number of genes no longer present at the same locus was 1187 in Pb18, 490 in Pb03 and 1265 in Pb01. Other changes include extending or truncating transcripts, merging or splitting transcripts, changes to splice sites, and changes to UTRs (Table 2). Only 23% of genes in the v2 annotations were unchanged from v1; the primary transcripts were identical for 1816 genes in Pb18, 2599 in genes Pb03 and 1581 in genes Pb01. Genes with any type of change in their coding sequences represent a smaller subset, in total 5734 (68%) in Pb18, 4895 (58%) in Pb03 and 6309 (71%) in Pb01.

Table 2

Summary of annotation changes in protein coding genes.

Change type	P. brasiliensis		P. lutzii	Change description
	Pb18	Pb03	Pb01
Add	840	933	936	Gene added to a region that previously had none
Splice site	1,124	1,122	1,000	Same start and same stop; internally, a splice site moved
Extended	3	6	2	Splice agreement, new model is longer; upstream start and downstream stop
Start Extended	262	329	307	Splice agreement and same stop; new model is longer, upstream start
Stop Extended	46	38	30	Splice agreement and same start; new model is longer, downstream stop
Shift	15	12	18	Splice agreement; new model has upstream, or downstream, start and stop
Truncated	5	4	2	Splice agreement, new model is shorter; downstream start and upstream stop
Start Truncated	237	208	276	Splice agreement and same stop; new model is shorter, downstream start
Stop Truncated	7	11	6	Splice agreement and same start; new model is shorter, upstream stop
UTR	1,504	802	1,679	Splice agreement and same start and same stop, but differ in UTR
Cluster	12	9	16	Multiple old genes map to multiple new genes; complex change
Merge	118	88	102	Multiple old genes have been merged into one
Split	402	509	495	Single old gene has been split into multiple new genes
Other CDS change	1,999	1,757	2,376	Other model not covered by another category or multiple models
None	1,816	2,599	1,581	Primary transcript is identical
Total	8,390	8,427	8,826	Total genes in current annotation version 2

Both sequence addition (gap filling and local reassembly) and small changes in the genome assemblies (single-nucleotide substitutions or insertion/deletion events (indels)), contributed to the improvement of the gene annotation in the update v2. Two examples of how indel correction fixed gene structures are shown in Fig. 2. In the first case (left panel), an extra C was inserted at a polyC tract in PABG_00129 of Pb03; correction of this position resulted in extending the coding DNA sequence (CDS) of this gene by 423 bases. In the second case (right panel), an A was deleted at a polyA tract in PABG_00790; correction of this position also corrected the reading frame, allowing for removal of a false intron that was needed to step over a stop codon and extension of the CDS of this gene by 252 bases. While these are small changes to the underlying assembly, both have had larger impact on correcting these gene structures.

Figure 2

Examples of an artifactual insertion and an artifactual deletion that were corrected during the update of the P. brasiliensis Pb03 genome sequence.

Examples of an artifactual insertion and an artifactual deletion that were corrected during the update of the P. brasiliensis Pb03 genome sequence.

Screenshots of Pilon-generated genome browser tracks in GenomeView v1.0 [35] show the evidence used by Pilon to recognize and correct an incorrect insertion in the gene PABG_00120 (left) and an incorrect deletion in the gene PABG_00790 (right). Tracks (top panels) depict paired-end reads (green) aligned to the corresponding region of the reference assembly v1, a subset of the total depth of ∼150X or ∼170X; these alignments were used by Pilon to refine the consensus sequence, generating the improved Pb03 assembly v2. Positions in the v1 assembly where aligned reads suggest a change due to either a gap (red box) or an insertion (black line) are indicated with dashed red boxes. The changes suggested by Pilon are also supported by conservation of the changed bases in a multiple alignment (bottom panels) with the corresponding region of P. brasiliensis Pb18 and P. lutzii Pb01. The annotation improvements were also analyzed by comparing the alignments of orthologs for all three Paracoccidioides genomes, identified by OrthoMCL (Methods). For orthologs identified either from the v1 or v2 assemblies, maximum and minimum gene length was computed for each ortholog cluster. In comparing these gene lengths (Fig. 3A), the v1 gene annotations (red points in scatterplot) exhibited a higher variation among Pb18, Pb03 and Pb01 orthologs compared to annotation v2 (blue points). The positions that are closer to the diagonal correspond to smaller differences in gene length between orthologs; as expected for an improved annotation, the v2 points are closer to the diagonal than v1. These differences between maximum and minimum length of the genes within each orthologous cluster group were also plotted on a logarithmic scale (Fig. 3B), based on sorting cluster differences from smallest to largest. The v2 annotation differences (blue curve) were lower and well separated from the v1 annotation (red curve), providing additional support of the increased length concordance in the v2 annotation.

Figure 3

Improved consistency of gene annotation in v2 genomes.

Improved consistency of gene annotation in v2 genomes.

The final predicted gene sets of the three Paracoccidioides strains were clustered using OrthoMCL, in v1 and v2. The scatterplots (A) compare, for each clustered group, the maximum length versus the minimum length of the three Paracoccidioides genes in the same cluster, for each of the two versions. The scatterplot contrasts the maximum-minimum pairs from annotation v1 (red points) and those from annotation v2 (blue points). The location of blue points closer to the diagonal illustrates that the annotation v2 was more consistent across the three genomes with smaller differences in gene length. In the same sense, the rank plots (B) show the difference between maximum and minimum length for each clustered group, for each of the two versions; again annotation v2 (blue line) showed fewer (later increase) and smaller (more gradual increase) differences, corresponding to the improvement of the genome annotation in v2. Further analysis of gene conservation also supported the greater consistency among the Paracoccidioides spp. genomes in the v2 annotation. The number of genes found in all three genomes increased, whereas the number of unique genes specific to only one genome decreased; this has produced a more uniform set of protein coding genes (Figure S2). The improved structural annotation also led to improvements in functional annotation. The v2 annotation had more genes with assigned protein domain families (PFAM) and pathway information (KEGG), using the same version of these databases for the v1 and v2 gene sets (Figure S2). This supports the higher functional content of the revised gene sets, despite the lower total gene counts in two of the genomes. We also manually reviewed and curated the predicted structures of a number of protein-coding genes that are of importance to the Paracoccidioides research community, including well-characterized yeast-phase specific genes and other virulence factors. This introduces changes to the transcript sequence of 27 of these genes (Table 3). The improvements to the assemblies resulted in updated transcript predictions for the vast majority of genes of the three reference strains, with substantially corrected gene structures for several virulence-associated or yeast-phase specific genes of central importance in Paracoccidioides or other dimorphic fungi, including HSP90 [25], PbGP43 [26], PbP27 [27], RYP1-3 [28], BAD1 [29], catalase B, alpha 1,3 glucan synthase and the beta glucan synthase target gene FKS1 [30], [31]. An extreme example is the HSP90 gene, where corrections were made to the sequence of each of the three Paracoccidioides genomes (Fig. 4). This example illustrates the annotation errors in v1 of all Paracoccidioides reference strains that were fixed in v2 after Pilon improvement and re-annotation. In this case one or more single-nucleotide errors, unknown single nucleotides (N's), and/or single nucleotides that were erroneously reported as absent or duplicated by Sanger sequencing resulted in radically different predicted gene structures (intron/exon and/or gene boundary errors). This is shown in detail for a cluster of errors present at the end of HSP90 in Pb03 (Fig. 4B), which included alteration of the proper stop codon, resulting in premature truncation of this gene. Another example of coding sequence updates to multiple genomes is shown for FKS1, where different regions of the Pb03 and Pb18 proteins were restored in the updated assemblies and annotations (Figure S1).

Table 3

Changes in updated annotations of known yeast-phase specific genes or virulence factors of Paracoccidioides and other dimorphic human pathogenic fungi.

Gene name description	Annotation v2 IDs			Type of change			Ref.
	P. lutzii	P. brasiliensis		P. lutzii	P. brasiliensis
	Pb01	Pb03	Pb18	Pb01	Pb03	Pb18
	PAAG_	PABG_	PADG_	PAAG_	PABG_	PADG_
1,3-beta-glucan synthase component GLS1 (FKS1)	05071	04524	11846*	None	Splice site	CDS/ID	[36]
1,3-beta-glucanosyltransferase (GEL1)	03782	00831	03286	Splice site	Splice site	Splice site	[31]
4-hydroxyphenylpyruvate dioxygenase (4-HPPD)	02615	03102	01636	UTR	Splice site	UTR	[37]
Alpha-1,3-glucan synthase (AGS1)	03297	00726	03169	CDS	CDS	Splice site	[38], [39]
Alternative oxidase (AOX)	01078	01661	03747	UTR	UTR	UTR	[40], [41]
bZIP transcription factor (HAPX)	04257	04038	07492	UTR	UTR	UTR	[42], [43]
Catalase B (CATB)	01553	03611	00225	None	None	Stop extended	[44], [45]
Catalase peroxixomal (CATP)	01454	01943	00324	Stop extended	Start truncated	Start truncated	[46]
Conserved hypothetical protein (P27)	08096	07332	08402	UTR	None	UTR	[27]
Cu Zu superoxide dismutase (SOD1)	04164	03954	07418	CDS	CDS	CDS	[47], [48]
Cu Zu superoxide dismutase (SOD3)	02971	00431	02842	UTR	None	None	[36], [47]
Dimorphism regulator histidine kinase (DRK1)	05810	06372	07579	UTR	CDS	CDS	[49]
Glucan 1,3-beta-glucosidase (GP43)	05770	06340	07615	UTR	UTR	UTR	[26]
Glyceraldehyde-3-phosphate dehydrogenase (GAPDH)	08468	00022	02411	Splice site	Extended	UTR	[48], [50]
HAD-superfamily hydrolase (HAD32)	00503	06765	02181	UTR	Splice site	Splice site	[51]
Heat shock protein 60 Kda (HSP60)	08059	07300	08369	UTR	UTR	UTR	[36], [52]
Heat shock protein 90 Kda (HSP90)	05679	06249	07715	Splice site	CDS	CDS	[25]
L-ornithine 5-monooxygenase (SID1)	01682	03730	00097	Splice site	None	None	[43], [53]
Tubulin beta chain (TUB)	03031	00486	02900	CDS	CDS	CDS	[15]
Adhesin WI-1 (BAD1)	08980	07814	12525°	UTR	None	New gene	[29], [54]
cAMP-independent regulatory protein pac2 (RYP1)	11652**	06919	06243	CDS/ID	None	UTR	[28], [55]
Ornithine decarboxylase (ODC)	03153	00600	03032	None	None	UTR	[56]
Required for yeast phase growth 2 (RYP2)	02671	03151	11302***	Splice site	Splice site	CDS/ID	[28], [57]
Required for yeast phase growth 3 (RYP3)	06081	06575	08037	Start extended	CDS	Stop extended	[28], [57]
Urease	00954	01291	03871	CDS	Start truncated	CDS	[58]
Urease accessory protein (UREG)	06237	05255	07010	UTR	Start truncated	UTR	[59]
Ureidoglycolate hydrolase (UGH)	04751	00102	02493	UTR	UTR	UTR	[59]

*IDs from version 1 *PADG_04920; **PAAG_03579; ***PADG_01695.

°New gene that was not reported previously.

Figure 4

Diverse error correction for the 90 kDa heat shock protein (HSP90 gene) of Paracoccidioides spp.

Diverse error correction for the 90 kDa heat shock protein (HSP90 gene) of Paracoccidioides spp.

(A) In this example different annotation errors were present in v1 of all three Paracoccidioides reference strains, all of which were fixed in v2 after Pilon improvement and re-annotation. The example also illustrates how one or more single-nucleotide errors, unknown single nucleotides (N's), or single nucleotides that were erroneously reported as absent or duplicated by a Sanger sequencer can amplify across annotations, generating radically different gene structure (intron/exon and/or gene boundary) predictions. (B) Five changes are shown at assembly (DNA sequence) level, one of which was a single nucleotide error in a stop codon; as a result, the gene-calling program did not recognize the end of an exon and it was not reported. *IDs from version 1 *PADG_04920; **PAAG_03579; ***PADG_01695. °New gene that was not reported previously. The improvements in the annotation v2 were also analyzed for completeness by comparing to a set of highly conserved fungal genes defined by CEGMA and to protein sets of related dimorphic human pathogenic fungi. Genes in the v2 annotation showed a higher coverage of both the CEGMA and related dimorphic fungal data sets in comparison with annotation v1, suggesting these v2 genes are more complete (Figure S3). Furthermore, we examined the level of conservation to other fungi, by analyzing the difference of the BLASTp score between the v1 and v2 protein sets compared to those of B. dermatitidis, H. capsulatum and the CEGMA genes of S. cerevisiae and S. pombe. We observed that in all cases the v2 annotation had more hits greater than the minimum-similarity cutoff, and that the vast majority of genes of the v2 annotation had higher BLAST score values than their counterparts from v1 annotation (Figure S4).

Discussion

The initial draft genomes of three isolates of Paracoccidioides (P. brasiliensis isolates Pb03 and Pb18, and P. lutzii isolate Pb01) served as the first complete genome references for this fungal species [1]. Although these assemblies were obtained using the best technology available at that time, they included gaps and low quality sequence in genic and intergenic regions, which in turn resulted in a number of suboptimal gene structures, coding sequences and predicted protein sequences. This work has revised these reference genomes, providing the Paracoccidioides community with more complete and accurate sequences; this provides a more accurate foundation for future genome-based, molecular biological or genetic research on paracoccidioidomycosis and the fungal strains that cause it. The strategy we have followed will more widely be useful also for other groups wishing to update fungal and other microbial genomes in future. The updating of a reference genome, in particular of the underlying assembly and annotation, can be thought of as a largely computational form of deep sequence curation. The success of the update we present here shows that next-generation sequencing (NGS) together with publicly available software tools can markedly enhance the quality of a eukaryotic genome resource. Indeed, the availability of affordable NGS sequencing opportunities makes such endeavors accessible to small bioinformatics groups. The massively computer-assisted component of such an update, which can include tabular and graphical views for monitoring improvements and performing quality control, can be complemented by choosing and following a few ‘guide genes’ to evaluate the process. This focused analysis provides tangible examples of how the update affected predicted properties of important genes, such as gene structure or encoded proteins. The accuracy of a genome sequence and associated annotations are critically important for many types of analysis; therefore validating and improving the accuracy of the sequence and annotation can have wide impact, especially for methods highly sensitive to sequence errors. One example involves examining a genome sequence for evidence of genes and genic regions likely to be under positive selection. Such genes and genic regions, which are believed to be relatively rare in many eukaryotic genomes (see, e.g., [32]), are sometimes associated in pathogenic organisms with virulence or rapid adaptation to host conditions, including resistance to defense by the host or avoidance of the host immune system. An example of such adaptation has been found for surface proteins of diverse pathogens [32] and in fungi of the proline-rich antigen gene in Coccidioides spp. [33]. Positive selection can also occur in response to antimicrobial drugs, as in chloroquine resistance in Plasmodium falciparum [34]. Candidate regions under positive selection are commonly identified as sections of coding regions having unusually high rates of nonsynonymous (amino-acid changing) substitutions. Precisely because such regions are quite rare, a coding region of low sequence quality having several sequencing errors could be categorized as a region under positive selection, and if there are several such regions in a genome, an automated genome-wide screen will report a high percentage of false positives. Conversely, assembly and annotation improvements such as we describe here can effectively evaluate and fix such regions of a genome so that even error-sensitive evolutionary analyses become realistic. Similar considerations also apply to analyses that have more obvious clinical relevance. For example, improving a DNA sequence's accuracy can bring it closer to being ‘clinical grade’ or ‘diagnostic grade’. Indeed, identification of a clinical sample of a human pathogenic fungus isolated in a hospital using sequence comparison requires certainty that any nucleotide differences (e.g., resulting from single nucleotide polymorphisms/SNPs) observed between the sequenced sample and trusted reference strain(s) or isolate(s) are not simply errors in the reference. For fungi encountered in clinical contexts, only one or a few traditionally used loci are typically represented by reliable reference sequences, which are often from the ribosomal DNA, or from one or two protein-coding genes known beforehand to be diagnostically informative. Reference diagnostics, as well as diagnostic PCR assays (e.g., primer/probe design in real-time PCR assays), depend on such regions that have been reliably characterized at the molecular level in a fair number of related species or strains that could be present in clinical settings. Whole-genome gene sets offer, however, new perspectives; if their sequence quality is high, one could then systematically and exhaustively screen alignments of the full gene sets for diagnostically promising genes and genic regions that are likely to be informative for the identification task at hand. Such genome-wide screens should be able to identify new, candidate target loci, and molecular assays could then be developed for them and validated. Genome sequences also allow for metagenomic or metatranscriptomic analysis, where reference genomes enable identification of the pool random sequence from the population of microbes in a sample. Such wide applications will be better powered by efforts such as this to improve the set of reference genomes that form a fundamental basis of comparison and analysis. By re-sequencing three reference strains of Paracoccidioides spp., using deep sequencing depth of Illumina paired-end reads, we have been able to substantially improve the assemblies and annotations for this important human fungal pathogen. Here we have presented the updated and improved annotated genome sequences, which constitute new references that can be used in diverse future molecular projects by those working in the field of medical mycology. Since the process leading to the new sequences is largely automated using publicly available programs and the NGS technology used is cost-effective, the success of our strategy represents a proof of concept that may stimulate similar updates of other genomes in future. Regions of FKS1 protein sequence alignment highlighting changes between genome versions. This example is of a gene relevant to medical and experimental mycology, the 1,3-beta glucan synthase component FKS1. The colored regions (blue text for bases, red ‘-‘ for gaps in the alignment) illustrate the improvements in the annotation (‘CDS’ category) that led to improved protein sequences in two of the three Paracoccidioides strains. (TIF) Click here for additional data file. Comparison of ortholog conservation and annotation in v1 and v2 genomes. The final gene sets of each annotation version were clustered using OrthoMCL. (A) Bar chart showing the relative contributions of core, auxiliary and unique genes to the final gene clusters of versions 1 and 2. The clustered groups were categorized as ‘core’ if present in all strains, ‘aux’ if present in two strains and ‘uniq’ if present in one strain. (B) Venn diagram showing numbers of shared and unique genes in annotation v2. (C) Total numbers of predicted genes, and total numbers of predicted genes that were assigned functional annotation from PFAM and/or KEGG. In all cases, the annotation v2 is more consistent or homogeneous across the three strains than annotation v1: there are more core genes in annotation v2 and also in v2 the two strains Pb03 and Pb18 from the same species (P. brasiliensis) give more similar results (bar charts). Furthermore, the new annotation has more genes with assigned functional annotation even using the same version of the databases. (TIF) Click here for additional data file. Coverage of Core Eukaryotic Genes (CEGs) in original and updated genomes. More genes in annotation v2 had higher percent of coverage of CEGs in comparison with annotation v1. This analysis was performed and plotted using the CoreAlyze tool (http://sourceforge.net/project/corealyze). (TIF) Click here for additional data file. Difference in BLAST scores using the protein sets of annotations v1 and v2. The references used for the BLAST matching were the protein sets of two dimorphic pathogenic fungi that are closely related to Paracoccidioides (top row) and the Core Eukaryotic genes of the fungi in CEGMA (bottom row). The comparison shows that the vast majority of proteins with any change, included in the comparison, have higher BLAST score in annotation v2. Here the graphs depict the results for the Pb03 strain; Pb18 and Pb01 showed the same pattern. The horizontal axis shows the genes numbered in order of increasing score difference (v2 minus v1). (TIF) Click here for additional data file. Summary of assembly and gene statistics of v1 and the updated v2 of the three reference genomes of (XLSX) Click here for additional data file.

59 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Transcriptome analysis of Paracoccidioides brasiliensis cells undergoing mycelium-to-yeast transition.

Authors: Luiz R Nunes; Regina Costa de Oliveira; Daniela Batista Leite; Vivian Schmidt da Silva; Everaldo dos Reis Marques; Márcia Eliana da Silva Ferreira; Diógenes Custódio Duarte Ribeiro; Luciano Angelo de Souza Bernardes; Maria Helena S Goldman; Rosana Puccia; Luiz R Travassos; Wagner L Batista; Marina Pasetto Nóbrega; Francisco G Nobrega; Ding-Yah Yang; Carlos A de Bragança Pereira; Gustavo H Goldman
Journal: Eukaryot Cell Date: 2005-12

3. Redundant catalases detoxify phagocyte reactive oxygen and facilitate Histoplasma capsulatum pathogenesis.

Authors: Eric D Holbrook; Katherine A Smolnycki; Brian H Youseff; Chad A Rappleye
Journal: Infect Immun Date: 2013-04-15 Impact factor: 3.441

Review 4. Paracoccidioidomycosis: an update.

Authors: E Brummer; E Castaneda; A Restrepo
Journal: Clin Microbiol Rev Date: 1993-04 Impact factor: 26.132

5. Temperature-induced switch to the pathogenic yeast form of Histoplasma capsulatum requires Ryp1, a conserved transcriptional regulator.

Authors: Van Q Nguyen; Anita Sil
Journal: Proc Natl Acad Sci U S A Date: 2008-03-13 Impact factor: 11.205

6. Histoplasma capsulatum heat-shock 60 orchestrates the adaptation of the fungus to temperature stress.

Authors: Allan Jefferson Guimarães; Ernesto S Nakayasu; Tiago J P Sobreira; Radames J B Cordero; Leonardo Nimrichter; Igor C Almeida; Joshua Daniel Nosanchuk
Journal: PLoS One Date: 2011-02-10 Impact factor: 3.240

7. Alternative oxidase mediates pathogen resistance in Paracoccidioides brasiliensis infection.

Authors: Orville Hernández Ruiz; Angel Gonzalez; Agostinho J Almeida; Diana Tamayo; Ana Maria Garcia; Angela Restrepo; Juan G McEwen
Journal: PLoS Negl Trop Dis Date: 2011-10-25

8. Targeted gene disruption reveals an adhesin indispensable for pathogenicity of Blastomyces dermatitidis.

Authors: T T Brandhorst; M Wüthrich; T Warner; B Klein
Journal: J Exp Med Date: 1999-04-19 Impact factor: 14.307

9. Histoplasma yeast and mycelial transcriptomes reveal pathogenic-phase and lineage-specific gene expression profiles.

Authors: Jessica A Edwards; Chenxi Chen; Megan M Kemski; Jinnan Hu; Thomas K Mitchell; Chad A Rappleye
Journal: BMC Genomics Date: 2013-10-10 Impact factor: 3.969

10. Comparative transcriptomics of the saprobic and parasitic growth phases in Coccidioides spp.

Authors: Emily Whiston; Hua Zhang Wise; Thomas J Sharpton; Ginger Jui; Garry T Cole; John W Taylor
Journal: PLoS One Date: 2012-07-20 Impact factor: 3.240

17 in total

1. Paracoccidioides brasiliensis PS2: First Autochthonous Paracoccidioidomycosis Case Report in Rio de Janeiro, Brazil, and Literature Review.

Authors: Priscila Marques de Macedo; Rodrigo Almeida-Paes; Mauro de Medeiros Muniz; Manoel Marques Evangelista Oliveira; Rosely Maria Zancopé-Oliveira; Regina Lana Braga Costa; Antonio Carlos Francesconi do Valle
Journal: Mycopathologia Date: 2016-05-09 Impact factor: 2.574

2. Species boundaries in the human pathogen Paracoccidioides.

Authors: David A Turissini; Oscar M Gomez; Marcus M Teixeira; Juan G McEwen; Daniel R Matute
Journal: Fungal Genet Biol Date: 2017-06-09 Impact factor: 3.495

3. Paracoccidioides spp. catalases and their role in antioxidant defense against host defense responses.

Authors: Diana Tamayo; José F Muñoz; Agostinho J Almeida; Juan D Puerta; Ángela Restrepo; Christina A Cuomo; Juan G McEwen; Orville Hernández
Journal: Fungal Genet Biol Date: 2017-01-16 Impact factor: 3.495

4. Novel taxa of thermally dimorphic systemic pathogens in the Ajellomycetaceae (Onygenales).

Authors: Karolina Dukik; Jose F Muñoz; Yanping Jiang; Peiying Feng; Lynne Sigler; J Benjamin Stielow; Joanna Freeke; Azadeh Jamalian; Bert Gerrits van den Ende; Juan G McEwen; Oliver K Clay; Ilan S Schwartz; Nelesh P Govender; Tsidiso G Maphanga; Christina A Cuomo; Leandro F Moreno; Chris Kenyon; Andrew M Borman; Sybren de Hoog
Journal: Mycoses Date: 2017-02-07 Impact factor: 4.377

5. Current Status on Extracellular Vesicles from the Dimorphic Pathogenic Species of Paracoccidioides.

Authors: Rosana Puccia
Journal: Curr Top Microbiol Immunol Date: 2021 Impact factor: 4.291

Review 6. One Century of Study: What We Learned about Paracoccidioides and How This Pathogen Contributed to Advances in Antifungal Therapy.

Authors: Erika Seki Kioshima; Patrícia de Souza Bonfim de Mendonça; Marcus de Melo Teixeira; Isis Regina Grenier Capoci; André Amaral; Franciele Abigail Vilugron Rodrigues-Vendramini; Bruna Lauton Simões; Ana Karina Rodrigues Abadio; Larissa Fernandes Matos; Maria Sueli Soares Felipe
Journal: J Fungi (Basel) Date: 2021-02-02

7. Identification and Analysis of the Role of Superoxide Dismutases Isoforms in the Pathogenesis of Paracoccidioides spp.

Authors: Diana Tamayo; José F Muñoz; Ángela Lopez; Martha Urán; Juan Herrera; Clayton L Borges; Ángela Restrepo; Celia M Soares; Carlos P Taborda; Agostinho J Almeida; Juan G McEwen; Orville Hernández
Journal: PLoS Negl Trop Dis Date: 2016-03-10

8. Genome Diversity, Recombination, and Virulence across the Major Lineages of Paracoccidioides.

Authors: José F Muñoz; Rhys A Farrer; Christopher A Desjardins; Juan E Gallo; Sean Sykes; Sharadha Sakthikumar; Elizabeth Misas; Emily A Whiston; Eduardo Bagagli; Celia M A Soares; Marcus de M Teixeira; John W Taylor; Oliver K Clay; Juan G McEwen; Christina A Cuomo
Journal: mSphere Date: 2016-09-28 Impact factor: 4.389

9. Differential Metabolism of a Two-Carbon Substrate by Members of the Paracoccidioides Genus.

Authors: Lilian C Baeza; Fabiana R da Mata; Laurine L Pigosso; Maristela Pereira; Gustavo H M F de Souza; Alexandre S G Coelho; Célia M de Almeida Soares
Journal: Front Microbiol Date: 2017-11-27 Impact factor: 5.640

Review 10. Paracoccidioides-host Interaction: An Overview on Recent Advances in the Paracoccidioidomycosis.

Authors: Haroldo C de Oliveira; Patrícia A Assato; Caroline M Marcos; Liliana Scorzoni; Ana C A de Paula E Silva; Julhiany De Fátima Da Silva; Junya de Lacorte Singulani; Kaila M Alarcon; Ana M Fusco-Almeida; Maria J S Mendes-Giannini
Journal: Front Microbiol Date: 2015-11-25 Impact factor: 5.640