| Literature DB >> 25474325 |
José F Muñoz1, Juan E Gallo2, Elizabeth Misas1, Margaret Priest3, Alma Imamovic3, Sarah Young3, Qiandong Zeng3, Oliver K Clay4, Juan G McEwen5, Christina A Cuomo3.
Abstract
Paracoccidiodomycosis (PCM) is a clinically important fungal disease that can acquire serious systemic forms and is caused by the thermodimorphic fungal Paracoccidioides spp. PCM is a tropical disease that is endemic in Latin America, where up to ten million people are infected; 80% of reported cases occur in Brazil, followed by Colombia and Venezuela. To enable genomic studies and to better characterize the pathogenesis of this dimorphic fungus, two reference strains of P. brasiliensis (Pb03, Pb18) and one strain of P. lutzii (Pb01) were sequenced [1]. While the initial draft assemblies were accurate in large scale structure and had high overall base quality, the sequences had frequent small scale defects such as poor quality stretches, unknown bases (N's), and artifactual deletions or nucleotide duplications, all of which caused larger scale errors in predicted gene structures. Since assembly consensus errors can now be addressed using next generation sequencing (NGS) in combination with recent methods allowing systematic assembly improvement, we re-sequenced the three reference strains of Paracoccidioides spp. using Illumina technology. We utilized the high sequencing depth to re-evaluate and improve the original assemblies generated from Sanger sequence reads, and obtained more complete and accurate reference assemblies. The new assemblies led to improved transcript predictions for the vast majority of genes of these reference strains, and often substantially corrected gene structures. These include several genes that are central to virulence or expressed during the pathogenic yeast stage in Paracoccidioides and other fungi, such as HSP90, RYP1-3, BAD1, catalase B, alpha-1,3-glucan synthase and the beta glucan synthase target gene FKS1. The improvement and validation of these reference sequences will now allow more accurate genome-based analyses. To our knowledge, this is one of the first reports of a fully automated and quality-assessed upgrade of a genome assembly and annotation for a non-model fungus.Entities:
Mesh:
Year: 2014 PMID: 25474325 PMCID: PMC4256289 DOI: 10.1371/journal.pntd.0003348
Source DB: PubMed Journal: PLoS Negl Trop Dis ISSN: 1935-2727
Figure 1Overview of genome assembly and annotation improvement process.
Summary of assembly metrics after Pilon improvement.
| Pilon summary metrics |
|
| |
| Pb18 | Pb03 | Pb01 | |
| Read depth of coverage | 127 | 146 | 148 |
| SNPs | 3,290 | 3,018 | 3,072 |
| Ambiguous bases | 246 | 222 | 221 |
| Small insertion bases | 957 | 1,083 | 1,062 |
| Small deletion bases | 725 | 714 | 628 |
| Bases added in reassembly fixes | 109,312 | 89,243 | 118,931 |
| Bases removed in reassembly fixes | 37,417 | 38,906 | 41,822 |
| Gaps opened | 0 | 0 | 0 |
| Gaps closed | 113 | 56 | 212 |
| Collapsed regions | 3 | 1 | 2 |
| Collapsed bases | 64,378 | 20,918 | 43,967 |
| Increase in contig N50 (kb) | 14.16 | 4.98 | 29.17 |
Summary of annotation changes in protein coding genes.
| Change type |
|
| Change description | |
| Pb18 | Pb03 | Pb01 | ||
| Add | 840 | 933 | 936 | Gene added to a region that previously had none |
| Splice site | 1,124 | 1,122 | 1,000 | Same start and same stop; internally, a splice site moved |
| Extended | 3 | 6 | 2 | Splice agreement, new model is longer; upstream start and downstream stop |
| Start Extended | 262 | 329 | 307 | Splice agreement and same stop; new model is longer, upstream start |
| Stop Extended | 46 | 38 | 30 | Splice agreement and same start; new model is longer, downstream stop |
| Shift | 15 | 12 | 18 | Splice agreement; new model has upstream, or downstream, start and stop |
| Truncated | 5 | 4 | 2 | Splice agreement, new model is shorter; downstream start and upstream stop |
| Start Truncated | 237 | 208 | 276 | Splice agreement and same stop; new model is shorter, downstream start |
| Stop Truncated | 7 | 11 | 6 | Splice agreement and same start; new model is shorter, upstream stop |
| UTR | 1,504 | 802 | 1,679 | Splice agreement and same start and same stop, but differ in UTR |
| Cluster | 12 | 9 | 16 | Multiple old genes map to multiple new genes; complex change |
| Merge | 118 | 88 | 102 | Multiple old genes have been merged into one |
| Split | 402 | 509 | 495 | Single old gene has been split into multiple new genes |
| Other CDS change | 1,999 | 1,757 | 2,376 | Other model not covered by another category or multiple models |
| None | 1,816 | 2,599 | 1,581 | Primary transcript is identical |
| Total | 8,390 | 8,427 | 8,826 | Total genes in current annotation version 2 |
Figure 2Examples of an artifactual insertion and an artifactual deletion that were corrected during the update of the P. brasiliensis Pb03 genome sequence.
Screenshots of Pilon-generated genome browser tracks in GenomeView v1.0 [35] show the evidence used by Pilon to recognize and correct an incorrect insertion in the gene PABG_00120 (left) and an incorrect deletion in the gene PABG_00790 (right). Tracks (top panels) depict paired-end reads (green) aligned to the corresponding region of the reference assembly v1, a subset of the total depth of ∼150X or ∼170X; these alignments were used by Pilon to refine the consensus sequence, generating the improved Pb03 assembly v2. Positions in the v1 assembly where aligned reads suggest a change due to either a gap (red box) or an insertion (black line) are indicated with dashed red boxes. The changes suggested by Pilon are also supported by conservation of the changed bases in a multiple alignment (bottom panels) with the corresponding region of P. brasiliensis Pb18 and P. lutzii Pb01.
Figure 3Improved consistency of gene annotation in v2 genomes.
The final predicted gene sets of the three Paracoccidioides strains were clustered using OrthoMCL, in v1 and v2. The scatterplots (A) compare, for each clustered group, the maximum length versus the minimum length of the three Paracoccidioides genes in the same cluster, for each of the two versions. The scatterplot contrasts the maximum-minimum pairs from annotation v1 (red points) and those from annotation v2 (blue points). The location of blue points closer to the diagonal illustrates that the annotation v2 was more consistent across the three genomes with smaller differences in gene length. In the same sense, the rank plots (B) show the difference between maximum and minimum length for each clustered group, for each of the two versions; again annotation v2 (blue line) showed fewer (later increase) and smaller (more gradual increase) differences, corresponding to the improvement of the genome annotation in v2.
Changes in updated annotations of known yeast-phase specific genes or virulence factors of Paracoccidioides and other dimorphic human pathogenic fungi.
| Gene name description | Annotation v2 IDs | Type of change | Ref. | ||||
|
|
|
|
| ||||
| Pb01 | Pb03 | Pb18 | Pb01 | Pb03 | Pb18 | ||
| PAAG_ | PABG_ | PADG_ | PAAG_ | PABG_ | PADG_ | ||
| 1,3-beta-glucan synthase component GLS1 ( | 05071 | 04524 |
| None | Splice site | CDS/ID |
|
| 1,3-beta-glucanosyltransferase ( | 03782 | 00831 | 03286 | Splice site | Splice site | Splice site |
|
| 4-hydroxyphenylpyruvate dioxygenase ( | 02615 | 03102 | 01636 | UTR | Splice site | UTR |
|
| Alpha-1,3-glucan synthase ( | 03297 | 00726 | 03169 | CDS | CDS | Splice site |
|
| Alternative oxidase ( | 01078 | 01661 | 03747 | UTR | UTR | UTR |
|
| bZIP transcription factor ( | 04257 | 04038 | 07492 | UTR | UTR | UTR |
|
| Catalase B ( | 01553 | 03611 | 00225 | None | None | Stop extended |
|
| Catalase peroxixomal ( | 01454 | 01943 | 00324 | Stop extended | Start truncated | Start truncated |
|
| Conserved hypothetical protein ( | 08096 | 07332 | 08402 | UTR | None | UTR |
|
| Cu Zu superoxide dismutase | 04164 | 03954 | 07418 | CDS | CDS | CDS |
|
| Cu Zu superoxide dismutase | 02971 | 00431 | 02842 | UTR | None | None |
|
| Dimorphism regulator histidine kinase ( | 05810 | 06372 | 07579 | UTR | CDS | CDS |
|
| Glucan 1,3-beta-glucosidase ( | 05770 | 06340 | 07615 | UTR | UTR | UTR |
|
| Glyceraldehyde-3-phosphate dehydrogenase | 08468 | 00022 | 02411 | Splice site | Extended | UTR |
|
| HAD-superfamily hydrolase ( | 00503 | 06765 | 02181 | UTR | Splice site | Splice site |
|
| Heat shock protein 60 Kda ( | 08059 | 07300 | 08369 | UTR | UTR | UTR |
|
| Heat shock protein 90 Kda ( | 05679 | 06249 | 07715 | Splice site | CDS | CDS |
|
| L-ornithine 5-monooxygenase ( | 01682 | 03730 | 00097 | Splice site | None | None |
|
| Tubulin beta chain ( | 03031 | 00486 | 02900 | CDS | CDS | CDS |
|
| Adhesin WI-1 ( | 08980 | 07814 |
| UTR | None | New gene |
|
| cAMP-independent regulatory protein pac2 ( |
| 06919 | 06243 | CDS/ID | None | UTR |
|
| Ornithine decarboxylase ( | 03153 | 00600 | 03032 | None | None | UTR |
|
| Required for yeast phase growth 2 ( | 02671 | 03151 |
| Splice site | Splice site | CDS/ID |
|
| Required for yeast phase growth 3 ( | 06081 | 06575 | 08037 | Start extended | CDS | Stop extended |
|
| Urease | 00954 | 01291 | 03871 | CDS | Start truncated | CDS |
|
| Urease accessory protein ( | 06237 | 05255 | 07010 | UTR | Start truncated | UTR |
|
| Ureidoglycolate hydrolase ( | 04751 | 00102 | 02493 | UTR | UTR | UTR |
|
*IDs from version 1 *PADG_04920; **PAAG_03579; ***PADG_01695.
°New gene that was not reported previously.
Figure 4Diverse error correction for the 90 kDa heat shock protein (HSP90 gene) of Paracoccidioides spp.
(A) In this example different annotation errors were present in v1 of all three Paracoccidioides reference strains, all of which were fixed in v2 after Pilon improvement and re-annotation. The example also illustrates how one or more single-nucleotide errors, unknown single nucleotides (N's), or single nucleotides that were erroneously reported as absent or duplicated by a Sanger sequencer can amplify across annotations, generating radically different gene structure (intron/exon and/or gene boundary) predictions. (B) Five changes are shown at assembly (DNA sequence) level, one of which was a single nucleotide error in a stop codon; as a result, the gene-calling program did not recognize the end of an exon and it was not reported.