Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.
Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-genepan class="Species">ration sequencing techniques have revolutionized genome resequencing, e.g. for strain compaspan>risons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the pan class="Species">Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. pan class="Species">Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.
Fungi are heterotrophic eukaryotes found in nearly all ecosystems. About 100,000 fungi have been described to date, but conservative estimates predict at least 1.5 million different species [1],[2]. Fungi exhibit a wide range of different lifestyles, particularly as saprobes, pathogens or symbionts. As saprobes, fungi acquire nutrients from dead organic matter and are among the main recyclers on the planet. They play important roles in the degradation of pan class="Chemical">cellulose and span>n class="Chemical">lignin, contributing greatly to the global carbon cycle. However, their saprotrophic activities also cause severe problems with the degradation of man-made products and in causing food spoilage. Mortality from human fungal pathogens has increased in recent years, especially in immunocompromised patients. In plants, ∼90% of diseases are caused by fungi, and these result in massive losses in crop yield worldwide, with often profound socio-economic effects, sometimes resulting in severe famines [3]. Nevertheless, fungi also have beneficial effects in symbioses, such as mycorrhiza (fungus/plant root) and lichen (fungus/algae) associations. Greater than 80% of terrestrial plants have mycorrhizal relationships with fungi that allow the plants to access key nutrients such as nitrogen and phosphorus from the soil [4]. Fungi are also of great importance in biotechnology, e.g. in the production of drugs and enzymes [5],[6]. In addition, many fungi can be easily cultured and are amenable to microbiological, genetic, and molecular techniques. Therefore, fungi were some of the earliest model organisms for the study of genetics, biochemistry, cell and developmental biology. It is thus not surprising that the first eukaryotic organism for which a complete genome sequence was obtained is a fungus, the budding yeastSaccharomyces cerevisiae
[7]. Today, fungi are the eukaryotic group with the greatest number of completely, or nearly completely, sequenced genomes (http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi, [2]). This is not only owing to their ecological, medical, agricultural, biotechnological and economic significance, but also due to the fact that with a size of 10–90 Mb and 4,700–17,000 predicted genes, fungal genomes are among the smallest and most compact eukaryotic genomes known.
The sequences for almost all sequenced eukaryotic genomes have been obtained by conventional Sanger sequencing technology. Over the past five years “next-genepan class="Species">ration sequencing” techniques have revolutionized large-scale sequencing projects because of massively increased throughput, resulting in much reduced costs per base [8]. One major disadvantage of the current techniques is that none of them delivers read lengths that approach conventional Sanger technology: whereas Sanger sequencing routinely yields 900 nt, the longest next-generation reads obtained are in the range of ∼450 nt for Roche/454 pyrosequencing (from now on abbreviated as 454 sequencing), and the techniques with the highest throughput are with 36–80 nt still well below this. Short reads, e.g. as obtained by Illumina/Solexa sequencing (from now on abbreviated as Solexa sequencing) cause severe difficulties for the assembly of genome sequences that contain repetitive sequences, as is the case for many higher eukaryotes. Thus, next-generation sequencing techniques have so far mostly been used for the de novo sequencing of prokaryotic genomes or the re-sequencing of eukaryotic species with reference genomes, where the next-generation reads can be mapped on an existing genome sequence [8]–[11]. Recent improvements, e.g. paired-end sequencing (reads from matched ends of longer DNA fragments) and a steady increase in read length should make the de novo assembly of high-quality eukaryotic genomes possible. For example, the genome of the filamentous fungus Grosmannia clavigera was assembled from a combination of Sanger, 454, and Solexa sequence data [12] and a first draft of the 2.4 Gb Giant Panda genome has been assembled from Solexa sequence reads alone [13]. Because of their small size, fungal genomes are perfectly suited for the task of optimizing de novo assembly approaches to generate high-quality or even finished larger eukaryotic genomes.
Here, we present the de novo assembly and annotation of the genome sequence of the filamentous fungus pan class="Species">Sordaria macrospora. The genome was sequenced solely by next-genepan class="Species">ration techniques (Solexa sequencing by synthesis and 454 pyrosequencing). S. macrospora is an ascomycete with a long-standing history as a model organism for fungal sexual development and meiosis [14]–[18] (Figure 1). Development of a large set of genetic tools for this fungus [19]–[24] resulted in the discovery of novel proteins involved in central events of meiosis and organogenesis [25]–[32]. Similar to its close relative Neurospora crassa, S. macrospora is haploid with a nuclear genome of seven chromosomes and an estimated 39.5 Mb of DNA sequence [24], [33]–[35]. Previous studies found ∼90% nucleic acid identity within coding regions of orthologous genes from S. macrospora and N. crassa as well as a high degree of synteny over large genomic regions [36],[37]. Despite their close phylogenetic relationship, S. macrospora is homothallic (self-fertile) in contrast to the heterothallic (self-sterile) N. crassa. The natural habitat of S. macrospora is herbivore dung in temperate climates, whereas N. crassa is usually found on burned vegetation and the soil throughout the world [14], [38]–[41]. Thus, these two closely related fungi have evolved different life styles and inhabit different ecological niches. These differences may be at least partially reflected in their genomes.
Figure 1
S. macrospora as a model organism for the analysis of meiosis and fruiting body development.
(A) Segregation of the ascospore-color mutant pam2 from a cross; wild type (black ascospores) by pam2 (yellow ascospores). Arrow points to a gene conversion indicated by two black and six yellow ascospores. (B–D) Meiotic prophase. Chromosome axes are stained by the cohesin-associated Spo76/Pds5 protein tagged with GFP. (B) Prophase nucleus of a spo11 null mutant: the 14 chromosomes do not align or synapse and this asynaptic status is seen from leptotene through pachytene. (C, D) Pachytene nucleus from wild-type Sordaria: the seven bivalents are differentiated by their size (D). Chromosome 2 (yellow), which bears the nucleolar organizing region, is attached to the nucleolus (nu). (E) The seven bivalents at late diplotene, stained by DAPI. Note the difference in size when compared to the pachytene nucleus. Bar (B–E) = 5 µm. (F) An EGFP-HEX1 fusion protein localizes to Woronin bodies. Bar = 10 µm. (G) The GFP-tagged developmental protein PRO41 localizes to the endoplasmic reticulum. Plasma membrane stained with FM4-64. Bar = 10 µm. (H, I) In a young protoperithecium (H), the GFP-tagged developmentally induced protein APP accumulates (I). Bar = 20 µm.
S. macrospora as a model organism for the analysis of meiosis and fruiting body development.
(A) Segregation of the ascospore-color mutant pam2 from a cpan class="Chemical">ross; wild type (black ascospores) by pam2 (yellow ascospores). Arrow points to a gene conversion indicated by two black and six yellow ascospores. (B–D) Meiotic prophase. Chromosome axes are stained by the cohesin-associated Spo76/Pds5 protein tagged with GFP. (B) Prophase nucleus of a spo11 null mutant: the 14 chromosomes do not align or synapse and this asynaptic status is seen from pan class="Chemical">leptotene through pachytene. (C, D) Pachytene nucleus from wild-type Sordaria: the seven bivalents are differentiated by their size (D). Chromosome 2 (yellow), which bears the nucleolar organizing region, is attached to the nucleolus (nu). (E) The seven bivalents at late diplotene, stained by DAPI. Note the difference in size when compared to the pachytene nucleus. Bar (B–E) = 5 µm. (F) An EGFP-HEX1 fusion protein localizes to Woronin bodies. Bar = 10 µm. (G) The GFP-tagged developmental protein PRO41 localizes to the endoplasmic reticulum. Plasma membrane stained with FM4-64. Bar = 10 µm. (H, I) In a young protoperithecium (H), the GFP-tagged developmentally induced protein APP accumulates (I). Bar = 20 µm.
The pan class="Species">S. macrospora genome sequencing project had two aims: (1) to assemble a first, high-quality draft of the genome sequence after next-genepan class="Species">ration sequencing to show that this approach is feasible for filamentous fungi in general, and (2) to annotate the genome sequence by a community effort, with the goal of a better understanding of pan class="Species">S. macrospora biology and the idea of improving its value as a model organism for fungal development.
Results/Discussion
Sequencing and assembly of the S. macrospora genome
The genome of the pan class="Species">S. macrospora strain k-hell was sequenced by a combination of Solexa and 454 sequencing. First, a total of 3.4 Gb of DNA sequence in 95,153,034 Solexa 36-nt reads were obtained from one single-read lane (9,688,226 reads), four lanes of paired-end reads (55,337,284 reads) from a 300-bp insert library, and three lanes (30,172,524 reads) of paired-end reads from a 500-bp insert library (Table 1, Figure S1). This represents 85-fold coverage of the span>n class="Species">S. macrospora genome. Assembly of the Illumina/Solexa data with the Velvet assembler [42] resulted in 38.7 Mb of sequence data in 3,344 contigs with an N50 size of 51 kb (Table 2). As expected, these contigs contained a substantial number of internal gaps (17,956 gaps, Table 2), because paired-end data allows contigs to be scaffolded by inferred physical linkage of the matched pairs in the absence of contiguous coverage of intervening segments. Despite the internal gaps in some of the contigs, we decided not to call them scaffolds to differentiate between the Velvet output (referred to as contigs even when containing gaps) and a subsequent scaffolding step (see below). When compared with the N. crassa genome, we were able to map 8,350 of ∼10,000 predicted proteins to the 10,066 predicted N. crassa genes (e-value ≤10−20) which is only slightly lower than the number obtained with the final high-quality draft (8,519 proteins, see below). Thus, even this preliminary assembly covered most of the protein-coding genome.
Table 1
Main features of primary sequence data.
primary sequence data
Solexa
454
Solexa +454
no. of reads that were obtained
95,153,934
1,103,372
96,261,736
read length
36 nt
367 nt1
n.a.
total length of all sequence reads
3,426 Mb
415 Mb
3,879 Mb
1 average read length.
Table 2
Main features of S. macrospora genome assemblies from Solexa reads, 454 reads, a combination of both, and after comparative assembly with the N. crassa genome.
assembled genome
Solexa
454
Solexa +454
comp. assembly
N50 value of assembly1
51 kb
11 kb
117 kb
498 kb
maximum contig/scaffold length
267 kb
64 kb
991 kb
2.5 Mb
total length of assembly
38.7 Mb
42.1 Mb
39.9 Mb
39.9 Mb
no. of contigs/scaffolds
3,344
14,123
5,097
4,781
% of assembly in contigs >0.5 kb
99.1
95.6
98.1
98.1
% of assembly in contigs >10 kb
92.8
52.5
92.3
93.1
no. of gaps within contigs
17,956
1,681
624
9332
mean length of gaps
478 nt
1 nt
21 nt
150 nt2
1 The N50 is defined as the length for which 50% of all bases in the assembly are in a contig of at least that length. In other words, this means that 50% of the assembly is contained in contigs of at least the N50 length.
2 The higher number and greater length of gaps in the comparative assembly compared to the Solexa+454 assembly stems from the introduction of gaps while joining contigs to scaffolds.
1 average read length.1 The N50 is defined as the length for which 50% of all bases in the assembly are in a contig of at least that length. In other words, this means that 50% of the assembly is contained in contigs of at least the N50 length.2 The higher number and greater length of gaps in the compapan class="Species">rative assembly compaspan>red to the Solexa+454 assembly stems from the introduction of gaps while joining contigs to scaffolds.To close most gaps, we obtained additional sequence data by 454 sequencing. Because of longer reads, a relatively low coverage with 454 reads in combination with the previously obtained Solexa reads was expected to allow assembly with a higher N50 value and close internal gaps in the contigs. We obtained 415 Mb (∼10-fold coverage) of single-end 454 reads with an average read length of 367 bp (Table 1, Figure S1). Assembly of 454 reads only (with the Celera Assembler 5.3; Eurofins MWG GmbH, Ebersberg, Gerpan class="Species">many) yielded 14,123 contigs (N50 size 11 kb; 1,681 internal gaps; Table 2). Gaps in this assembly were primarily caused by sequencing ambiguities.
The combined raw data (Solexa and 454 reads) and the pre-assembled 454 data were used for constructing an assembly with the Velvet assembler version 0.7.31 [42] (Figure S1). This resulted in an assembly of 39.9 Mb of sequence data (5,097 contigs with an N50 size of 117 kb) and only 624 internal gaps within the contigs (Table 2). Thus, the combination of Solexa paired-end reads with 454 reads resulted in an increase of the N50 value and a drastic reduction in the number of gaps compared to assemblies where each data set was used alone. With a size of 39.9 Mb, this combined assembly corresponds well to previous analyses of the pan class="Species">S. macrospora genome by pulsed-field gel electrophoresis that estimated the genome size at 39.5 Mb [24].
To determine whether similar results might be obtained with fewer sequence reads, thereby further decreasing sequencing costs, we genepan class="Species">rated test assemblies with different combinations of coverage levels (Figure S2, Table S1). The addition of 454 reads had the most drastic effect on the number and length of gaps whereas addition of paired-end reads improved mostly the N50 value. The inclusion of fewer sequence reads resulted in suboptimal assemblies; however, at the number of reads used for our assembly, bench mark values were no longer changing dramatically, suggesting that a plateau had been reached where addition of this type of sequence reads did not significantly improve assemblies. Further improvement might be achieved by sequencing paired-end libraries with longer inserts. The genome sequence of the filamentous ascomycete span>n class="Species">Grosmannia clavigera was assembled from a combination of Sanger paired-end reads (0.3-fold coverage), 454 single reads (7.7-fold coverage), and Solexa paired-end reads (100-fold coverage) [12], resulting in a high-quality draft genome sequence of 32.5 Mb with an N50 size of 164 kb. Our data show that similar values can be obtained even without including Sanger sequencing data thereby drastically decreasing sequencing costs.
It has been previously demonstpan class="Species">rated that several regions of up to 50 kb of the pan class="Species">S. macrospora genome are syntenic to pan class="Species">N. crassa
[36],[37]. To extend this analysis to the newly assembled S. macrospora contigs, the five largest contigs from the Velvet assembly (519–991 kb) were compared to contigs of the N. crassa finished genome that have been assigned to specific linkage groups by mapped genetic markers (Assembly 9; http://www.broadinstitute.org/annotation/genome/neurospora/Regions.html). The results were visualized as dot plot (Figure 2A), and show that each contig maps to one or two linkage groups with only one to three breaks of synteny. Thus, the high degree of synteny between S. macrospora and N. crassa that was expected from previous studies was reflected in the Velvet assembly. To make use of this high degree of synteny and further improve the S. macrospora assembly, we generated a comparative assembly with Mercator by using the scaffolded chromosomes of the draft N. crassa genome (assembly 7, [43]) and the draft-sequences of the Neurospora discreta (http://genome.jgi-psf.org/Neudi1/Neudi1.home.html) and Neurospora tetrasperma (http://genome.jgi-psf.org/Neute1/Neute1.home.html) genomes to order and scaffold the S. macrospora contigs [44]. This resulted in a total of 152 scaffolds and 4,629 contigs with an N50 size of 498 kb (Table 2). Syntenic regions between the S. macrospora and N. crassa genomes were analyzed by dot plot analysis (Figure 2B). To verify that the scaffolded contigs represent the correct order within the S. macrospora genome, three regions spanning gaps between contigs on scaffolds 17, 58, and 98, respectively, were amplified by PCR and sequenced. In all cases, sequences between 0.8 and 1.2 kb were retrieved that close the gap between adjacent contigs thereby validating the scaffolding results (data not shown). This assembly represents the first high-quality draft version of the S. macrospora genome (“S. macrospora assembly 1”, acc. no. CABT01000001-CABT01004783, http://gb2.fungalgenomes.org/gb2/gbrowse/sordaria_macrospora).
Figure 2
Synteny between the genomes of S. macrospora and N. crassa.
(A) Synteny of contigs from the S. macrospora genome with the N. crassa genome before scaffolding along the N. crassa chromosomes. Dot plot of a comparison of the five largest contigs from the Velvet assembly (contigs 3467, 1588, 19727, 3369, and 12432, length given on the y-axis in descending order, total size of the five contigs 3.4 Mb, note that the Velvet contig numbers do not correspond to the contigs of the final assembly) against the Neurospora linkage groups (supercontigs I to VII in finished genome sequence, http://www.broadinstitute.org/annotation/genome/neurospora/Regions.html). The linkage group numbers for N. crassa are given above the dot plot. (B) Dot plot of a comparison of the S. macrospora scaffolds which cover 93% of the genomic sequence against the N. crassa supercontigs corresponding to linkage groups I to VII from the finished genome sequence. Comparisons for both analyses was done with BLASTN with e-value <10−150. Dot plot visualization was done with Combo [148].
Synteny between the genomes of S. macrospora and N. crassa.
(A) Synteny of contigs from the pan class="Species">S. macrospora genome with the pan class="Species">N. crassa genome before scaffolding along the N. crassa chromosomes. Dot plot of a comparison of the five largest contigs from the Velvet assembly (contigs 3467, 1588, 19727, 3369, and 12432, length given on the y-axis in descending order, total size of the five contigs 3.4 Mb, note that the Velvet contig numbers do not correspond to the contigs of the final assembly) against the Neurospora linkage groups (supercontigs I to VII in finished genome sequence, http://www.broadinstitute.org/annotation/genome/neurospora/Regions.html). The linkage group numbers for N. crassa are given above the dot plot. (B) Dot plot of a comparison of the S. macrospora scaffolds which cover 93% of the genomic sequence against the N. crassa supercontigs corresponding to linkage groups I to VII from the finished genome sequence. Comparisons for both analyses was done with BLASTN with e-value <10−150. Dot plot visualization was done with Combo [148].
Neither the rDNA repeat units nor the mitochondrial genome was represented in the Velvet assembly. We therefore searched the raw data as well as preassembled 454 and Solexa contigs for sequences with significant identity to rDNA or mitochondrial DNA from other fungi (Text S1). These reads were used to assemble both one rDNA unit as well as the mitochondrial DNA using CodonCode Aligner version 3.0.3 (http://www.codoncode.com/aligner/). The rDNA unit shows ∼98% DNA sequence identity to that of pan class="Species">N. crassa. Unlike in span>n class="Species">N. crassa, no additional smaller rDNA regions with point mutations were found by this method. Four shorter contigs had SNPs in various locations when compared to the full-length rDNA region. These SNPs all occurred as part of a homonucleotide run (4–6 nt), suggesting either sequencing errors or true polymorphisms in the rDNA repeats, which are considered to be rare in filamentous fungi but do exist in N. crassa because of the occurrence of RIP (see below; K.M. Smith and M. Freitag, unpublished data).
The mitochondrial genome encompasses 88.4 kb, and thus is larger than the 64.8 kb mitochondrial genome of pan class="Species">N. crassa and smaller than the 94.2 kb mitochondrial genome of span>n class="Species">Podospora anserina. With 33.6%, the GC content of the mitochondrial genome is in the same range as that of N. crassa (36.1%) and P. anserina (29.9%) (Text S1, Figure S3). Our data show that not only the single copy regions of the nuclear genome can be assembled from the next-generation sequencing data, but also multi-copy regions like the rDNA unit and the mitochondrial genome, even if they are not initially recovered in typical Velvet runs.
Comparisons between closely related species reduce the number of orphan genes
Gene models for the first draft of the pan class="Species">S. macrospora genome were predicted with four independent ab initio gene prediction programs trained on pan class="Species">N. crassa and evidence-based predictions with pan class="Species">N. crassa proteins (see Materials and Methods). The results were integrated with Evigan [45] to yield ∼12,000 gene models. Additionally, 455 tRNA genes were predicted, similar to the 424 tRNA genes predicted for N. crassa
[43]. The initially predicted ∼12,000 protein coding genes were screened for ORFs with internal stops, lack of initiation or termination codons, unusually long introns and insufficient support by sequence similarity. Such ORFs were corrected or removed resulting in a refined gene set of 10,789 genes with an average length of 1,432 bp for all predicted coding sequences (CDS, Table 3, Table S2). The overall GC content of the genome is 52.4%. This is changed to 56.5% in coding regions, which represent 38.4% of the genome, and 49.8% in non-coding regions, which make up 61.6% of the genome.
Table 3
Main features of the S. macrospora genome sequence.
Size of the final assembly
39.8 Mb
chromosomes
7
GC percentage (total genome)
52.4
GC percentage in coding regions
56.5
GC percentage in non-coding regions
49.8
tRNA genes
455
protein coding genes (CDSs)
10,789
percent coding
38.4
average CDS size (min/max)
1,423 bp (54 bp/33,321 bp)
To address the question of sequencing errors, we PCR-amplified and resequenced coding regions for six predicted genes (SMAC_01188, SMAC_01198, SMAC_6009, SMAC_07685, SMAC_07776, SMAC_09680) with frameshifts or internal stops. These were confirmed by resequencing in four cases, whereas in two cases, insertions or deletions of 1 nt were found in the assembled sequence which when corrected led to the prediction of functional open reading frames. In total, we tested 21 kb of coding sequence by resequencing and found four insertion/deletion errors (0.02%). Although it is difficult to compare errors and error pan class="Species">rates, this span>n class="Species">rate is similar to the 0.1–0.001% error rates achieved in microbial draft genomes sequenced by Sanger technology [46],[47].
With 10,789 predicted andpartially cupan class="Species">rated genes, the gene count in span>n class="Species">S. macrospora is similar to that of N. crassa (10,066 community-annotated and centrally curated genes). To determine how many predicted proteins in these two closely related species are orthologs, reciprocal BLASTP analysis was performed: At an e-value of ≤10−20, 8,519 S. macrospora proteins have at least one homolog among the N. crassa proteins; vice versa, 8,179 N. crassa proteins have at least one homolog among the S. macrospora proteins. In total, 7,855 proteins (72.7% of all S. macrospora proteins) have reciprocal best hits in both searches identifying them as likely orthologs (Table S3).
Sequencing of the first few eukaryotic genomes revealed relatively high frequencies of “orphan genes” (i.e. genes without apparent homologs in any of the already known sequence databases and proteomes). As more genomes become available, this number has been rapidly decreasing, e.g. for pan class="Species">N. crassa from ∼41% [43] to currently 22% (2,219/10,066 [48]). Because span>n class="Species">S. macrospora is more closely related to N. crassa than any other previously sequenced filamentous fungus, we compared the N. crassa orphan genes with the S. macrospora genome using TBLASTN and BLASTP to assess how many proteins are lineage-specific (Table S4). Of 2,112 N. crassa orphan genes that were retrieved from the current N. crassa MIPS protein list (http://mips.helmholtz-muenchen.de/genre/proj/ncrassa/), 870 do not have significant hits in the S. macrospora genome at an e-value of ≤10−20. Orphan genes might comprise more quickly evolving genes [48], and we therefore repeated our analysis at an e-value ≤10−5. This analysis still left 471 (4.7%) genes without significant hits, suggesting that these genes may constitute the remaining true orphan genes that separate the genus Sordaria from Neurospora (Table S4). The recent sequencing of additional Neurospora species is expected to further reduce the number of genus-specific genes.
In addition to assessing the conservation of protein-coding gene regions, we sought to investigate the conservation of non-coding regions between pan class="Species">S. macrospora and its closest relatives. Therefore, we performed comparisons of 5′ upstream regions in 1 kb blocks from 1 kb to 4 kb as well as compaspan>risons of introns and coding regions for span>n class="Species">S. macrospora, N. crassa, N. discreta and N. tetrasperma (Figure 3, Figure S4 and Table S5). We observed that introns are more conserved than upstream regions. Among the upstream regions, pairwise identity is slightly but significantly higher in the 1 kb upstream regions than in any of the other tested upstream regions (Table S5). This suggests that most regulatory (and therefore putatively conserved) elements in 5′ UTRs and promoters reside within the 1 kb upstream regions.
Figure 3
Pairwise identity between S. macrospora and N. crassa for different genomic regions.
CDSs, introns, and regions upstream of CDSs (in 1 kb steps ranging from 1 to 4 kb) were used for comparison. Only those upstream regions were used that do not overlap with a protein coding region. Each region was used only once even if it is upstream of two divergently transcribed genes to avoid double-counting. The box plots show the distribution of % pairwise identities with the median value as a horizontal line in the box between the first and third quartiles. Detailed information on the comparisons can be found in Figure S4 and Table S5.
Pairwise identity between S. macrospora and N. crassa for different genomic regions.
pan class="Chemical">CDSs, introns, and regions upstream of pan class="Chemical">CDSs (in 1 kb steps ranging from 1 to 4 kb) were used for comparison. Only those upstream regions were used that do not overlap with a protein coding region. Each region was used only once even if it is upstream of two divergently transcribed genes to avoid double-counting. The box plots show the distribution of % pairwise identities with the median value as a horizontal line in the box between the first and third quartiles. Detailed information on the comparisons can be found in Figure S4 and Table S5.
We also compared the predicted pan class="Species">S. macrospora proteins to the non-redundant GenBank and Swissprot databases (Table S2). Approximately 6% (631/10,789) of all predicted proteins did not have a significant hit against the non-redundant database at an e-value ≤10−5. This number is only slightly higher than that for span>n class="Species">N. crassa (4.7%, 471 genes, see above). Taking into account that no other Sordaria species have been sequenced yet, we suggest that the number of true orphan genes in ascomycetes might be less than 5% or 500 genes per genome.
A search for conserved protein domains in the predicted pan class="Species">S. macrospora proteins was performed with the HMMER program hmmpfam [49],[50] and with the InterPpan class="Chemical">roScan function from Blast2GO [51],[52]. With HMMER, one or more conserved domains were found in 5,471 predicted proteins (50.7%, Tables S2 and S6), the InterProScan found domains in 7,099 predicted proteins (65.7%, Table S2). These values might seem rather low when compared to the more than 10,000 proteins that have a hit in the non-redundant database, but it reflects the fact that many (predicted, hypothetical or conserved hypothetical) proteins have not yet been functionally characterized; therefore many domains remain to be identified.
In addition to a comparison to pan class="Species">N. crassa, an analysis of the predicted proteins from span>n class="Species">S. macrospora, N. crassa, N. discreta, P. anserina, and Chaetomium globosum was performed with OrthoMCL, a software that clusters orthologs and “recent” paralogs [53]. We identified 9,971 orthogroups, and among these 5,428 (54.4%) comprise single genes from each of the five species, i.e. single-copy genes that are conserved among all species investigated (Tables S7 and S8). 31 orthogroups contain genes with three or more paralogs in S. macrospora, but fewer or no paralogs in other fungi, and these were investigated further. Some of these orthogroups contained proteins suggestive of transposon activity (see below), whereas others have no homology to transposons or pseudogenes. Phylogenetic analysis of two orthogroups (99 and 79) indicates evolutionary histories of ancient gene family expansion and subsequent differential gene loss (Figure 4). Orthogoup 99 comprises three genes from S. macrospora and two genes from P. anserina, whereas in the Neurospora species and C. globosum, only one gene is present. The genes from this orthogroup encode putative P450 oxygenases, and one might speculate that these proteins are beneficial for a coprophilic lifestyle, because only the coprophilic fungi S. macrospora and P. anserina have retained more than one copy. A similar case of duplication and subsequent loss can be postulated for orthogroup 79, which contains genes encoding chitin binding and glycosyl hydrolase domains.
Figure 4
Phylogenetic analysis and expression of genes from different orthogroups from an OrthoMCL analysis of S. macrospora (SM), N. crassa (NC), N. discreta (ND), C. globosum (CHG), and P. anserina (PA).
(A) Species phylogeny with six concatenated genes that are single-copy orthologs in each of the five species. (B–E) Phylogenetic trees with five different orthogroups. Outgroups for the trees were homologs from either Nectria haematococca, Aspergillus fumigatus, Penicillium chrysogenum, or Pyrenophora tritici-repentis. Numbers at branches indicate bootstrap support (10,000 bootstrap replications) in % for neighbor joining trees. (D–E) Expression of the S. macrospora and N. crassa genes from orthogroups 49 and 180 during sexual development compared to vegetative growth. Expression data are the results of two independent experiments and were determined by quantitative real time PCR. The red and green dashed lines indicate two-fold up- and downregulation, respectively. n.e., no expression was detected during vegetative growth or sexual development.
Phylogenetic analysis and expression of genes from different orthogroups from an OrthoMCL analysis of S. macrospora (SM), N. crassa (NC), N. discreta (ND), C. globosum (CHG), and P. anserina (PA).
(A) Species phylogeny with six concatenated genes that are single-copy orthologs in each of the five species. (B–E) Phylogenetic trees with five different orthogroups. Outgroups for the trees were homologs from either pan class="Species">Nectria haematococca, pan class="Species">Aspergillus fumigatus, Penicillium chrysogenum, or Pyrenophora tritici-repentis. Numbers at branches indicate bootstrap support (10,000 bootstrap replications) in % for neighbor joining trees. (D–E) Expression of the S. macrospora and N. crassa genes from orthogroups 49 and 180 during sexual development compared to vegetative growth. Expression data are the results of two independent experiments and were determined by quantitative real time PCR. The red and green dashed lines indicate two-fold up- and downregulation, respectively. n.e., no expression was detected during vegetative growth or sexual development.
In contrast, orthogroups 49 and 180 contain one or no gene for the pan class="Species">Neurospora species, span>n class="Species">P. anserina, and C. globosum, but six and four members, respectively, in S. macrospora; and the S. macrospora genes cluster together in a phylogenetic tree (Figure 4). Thus, these genes seem to represent recent duplication events in S. macrospora. Both orthogroups are part of larger gene families, and to verify that placement of these subfamilies in different orthogroups was correct, an independent phylogenetic analysis was performed (Figure S5). This analysis supports the grouping by OrthoMCL. To test whether these genes are expressed genes and not simply annotation errors, quantitative real time PCR experiments were performed for ten S. macrospora genes from orthogroups 49 and 180 (Figure 4). For eight of the ten genes, transcripts were found under conditions of sexual development and/or vegetative growth, and all eight genes are upregulated during sexual development. In contrast, the homologous N. crassa genes are downregulated or not differentially regulated (Figure 4). Thus, the S. macrospora genes that are expressed under the conditions investigated might have gained developmental regulation after the split of the Neurospora and Sordaria lineages, probably as a result of gene family diversification after gene duplications. Whether these genes have a function during sexual morphogenesis in S. macrospora remains to be determined.
Repeated sequences, transposons, and genome integrity
Transposons and repeat elements have been identified in all eukaryotic groups investigated so far, and they can comprise large portions of a genome, e.g. 85% of the recently published pan class="Species">maize genome [54],[55]. In fungi they usually make up only a compaspan>ratively small part of the genome (usually ≤10%), because effective defense mechanisms against repeated sequences are in place and because smaller genomes are more streamlined [56]. Eukaryotic transposons can be divided into two classes, class I elements that transpose via an RNA intermediate, and class II elements that transpose at the DNA level by excision and reintegration [57]. To analyze the transposon content of the S. macrospora genome, several approaches were used. First, amino acid sequences of known transposon open reading frames were used for comparison with the predicted S. macrospora peptides as described previously [58]. Second, DNA sequences of randomly selected scaffolds were compared to Repbase data [59]. These two approaches will identify only those repeated sequences or transposons that are similar to previously described elements. Third, DNA sequences of randomly selected scaffolds were compared to the complete genome sequence in order to identify new repeated sequences without similar entities in the databases.Most interesting is the presence of five ORFs with amino acid sequence similarity to the pan class="Species">N. crassa Tad LINE-like transposon [60] (Table 4). In addition, there are ∼20 ORFs with sequence similarity to gypsy-type retrotransposons. However, these ORFs exhibit pan class="Species">rather diverse sequences and do not form element families. In contrast to these class I elements, there are only three ORFs with similarities to class II eukaryotic transposons; two of these represent a hAT-like element [61] that we called “Scarce”, and one ORF with amino acid similarity to the Fot1 transposon from Fusarium oxysporum
[62]. As the only full-length Scarce ORF SMAC_09680 contains a nonsense codon, it is likely that the element is no longer active, thus explaining the low copy number. Overall, the transposon load of S. macrospora is very low, much more resembling that of another homothallic fungus Gibberella zeae (anamorph Fusarium graminearum) [63] than that of N. crassa. This is also reflected in a search for regions of high similarity within the S. macrospora genome by performing a BLASTN analysis of the genome sequence versus itself (Figure S6). In this analysis, the prevalence of regions with high intragenomic similarity in S. macrospora is between those of N. crassa
[43] and F. graminearum
[63], all of which have significantly fewer intragenomic regions of high similarity than the repeat-rich genome of Magnaporthe grisea
[64]. This finding correlates well with the low transposon count of the S. macrospora genome. Taken together with the fact that we were able to assemble long contigs and that the assembly size correlates well with the genome size determined by pulsed-field gel electrophoresis, this suggests that the low repeat content is not an assembly artefact.
Table 4
Repeated sequences and transposons in the S. macrospora genome.
TSD: target site duplication present at least for some elements.
1 These elements show a very high degree of sequence variation; in addition Repbase analysis indicates additional DNA sequences with similarities to gypsy-like sequences.
2 Both elements exhibit partial sequence similarities.
3 Elements often inside ORFs or overlapping with ORFs.
4 Elements with at least 80% sequence similarity.
pan class="Disease">TSD: target site duplication present at least for some elements.
1 These elements show a very high degree of sequence variation; in addition Repbase analysis indicates additional DNA sequences with similarities to gypsy-like sequences.2 Both elements exhibit partial sequence similarities.3 Elements often inside ORFs or overlapping with ORFs.4 Elements with at least 80% sequence similarity.In addition to class I and II transposable elements, five different non-coding short repeat sequences (Smini1 to Smini5, 150–670 bp, Table 4) were detected. Two of these have partial sequence identity (Smini3 andSmini4) because of overlapping sequences. To verify that these repeats are real and not due to assembly problems, at least two copies for each repeat were PCR amplified and sequenced. For all tested repeats, their presence within the predicted genomic context was confirmed. Ten copies of repeat Smini5 are within ORFs, two are outside of ORFs, and another two overlap with ORFs. At least six of these ORFs show similarities to retrotransposon sequences. Some Smini1 andSmini3 repeats possess 5 bp target site duplications suggesting that they may be transposons or integpan class="Species">rated elements caused by transposition. In some cases, point mutations may have modified target site duplications, or recombination may have occurred as has been shown for span>n class="Species">Aspergillus niger
[65]. As Smini1 and Smini3 are both uniform in size (with the exception of some truncated elements), they may be solo-LTRs rather than mini-transposons such as the guest element of N. crassa
[66]. Altogether, these five short repeat types cover only 56.8 kb of the genome (0.14%).
In pan class="Species">N. crassa and a few other ascomycetes (e.g. pan class="Species">P. anserina
[67],[68], M. grisea
[69], F. graminearum
[63], and Leptosphaeria maculans
[70]), the RIP machinery detects pairs of repeated segments during premeiosis, introduces C∶G to T∶A mutations and can trigger DNA methylation of the mutated repeats in the vegetative cells resulting from ascospores, presumably by virtue of the increased AT content [71]. We analyzed the entire S. macrospora genome sequence for the presence of RIP footprints by calculating RIP indices [72] on the concatenated contigs and scaffolds (Figure S7). In contrast to the situation in N. crassa, where large regions mutated by RIP make up the centromeric DNA (K.M. Smith, L.R. Connolly and M. Freitag, unpublished data), we found no large blocks of AT-rich regions with the typical RIP bias (e.g., TpA/ApT >1.0). The only large region with atypical dinucleotide distribution was scaffold 0, which contains the mtDNA. Here, both TpA/ApT and (CpA+TpG)/(ApC+GpT) were close to 1, suggesting DNA composition more reminiscent of bacteria or budding yeast. Our results suggest the absence of large regions mutated by RIP in the S. macrospora genome. Previous analyses have shown that there is no active RIP in S. macrospora (Kück et al., unpublished data). However, an ortholog of the N. crassa rid gene, the only gene known to be important for RIP [73], is present in the S. macrospora genome (Table S9), indicating that S. macrospora might have been able to undergo RIP during some time of its evolution; alternatively, RIP may occur at such low levels that it is difficult to detect in typical transformation and selfing experiments. RID homologs are involved in sexual development in two other fungi, Ascobolus immersus
[74] and Aspergillus nidulans
[75], suggesting that the S. macrospora protein may carry out a function independent of RIP.
In pan class="Species">N. crassa, two other genome defense mechanisms in addition to RIP have been identified, namely meiotic silencing by unSpecies">paired DNA (span>n class="Disease">MSUD or “meiotic silencing”) and a form of RNAi (“quelling”) [76],[77]. All N. crassa genes identified in these processes have orthologs in pan class="Species">S. macrospora suggesting that S. macrospora might be able to perform different varieties of genome defense (Table S9). The fact that endogenous genes can be silenced via introduction of transgenic constructs that result in double-stranded RNA molecules indicates an active RNAi-like mechanism [78]. Nevertheless, transformants with ectopically integrated copies for genes involved in meiosis (which might be subject of MSUD) or other processes (which might be subject to RNAi) have been successfully generated in different laboratories working with S. macrospora for years. Silencing of the resident and/or ectopically located gene functions has never been observed or described (e.g. [21],[25],[30],[79],[80]). This suggests that S. macrospora might possess gene silencing mechanisms but that they are perhaps less active, at least with respect to transgenes, than in N. crassa.
Apart from genome defense mechanisms, there are a number of conserved processes in eukaryotes that are involved in maintaining genome integrity and regulating genome activity at the chromatin level [81]. We annotated chromatin-associated proteins, histone modification proteins, genes involved in the structural maintenance of chromosomes as well as centromere and kinetochore proteins and found that pan class="Species">S. macrospora contains essentially the same set of genes as span>n class="Species">N. crassa (Table S9). Like its close relative, S. macrospora has single genes for the histone H3 K9 methyltransferase (DIM5), the heterochromatin protein 1 (HP1) and the DNA methyltransferase DIM-2, suggesting that heterochromatin formation and DNA methylation in S. macrospora are similar to what has been observed in N. crassa
[43]. Taken together, these data indicate that S. macrospora contains the typical, conserved eukaryotic machinery for genome maintenance. Despite the absence of active RIP, this fungus appears to prevent the spreading of transposons and other repeated sequences as indicated by the low content of these elements within the genome.
Genes for regulatory networks, signaling, meiosis, and development
Since the 1950s, pan class="Species">S. macrospora has been used as a model system for the analysis of pan class="Disease">fungal sexual development and meiosis, and a number of developmental genes have been characterized at the molecular level [14],[82]. We searched for genes known to be involved in development or in signaling cascades in S. macrospora and other fungi and found that S. macrospora contains homologs to all conserved genes as expected, further confirming the quality of the genome sequence.
Specifically, we looked for orthologs to known genes for fungal sexual development, meiosis, pan class="Chemical">GTP-, pan class="Chemical">phospholipid- and pan class="Chemical">calcium-signaling, motor proteins, senescence, photoreceptors and light signaling (Tables S10, S11, S12, S13, S14). In the case of photoreceptor-coding genes, it was found that S. macrospora contains homologs to known or putative fungal photoreceptors (Table S10). S. macrospora is able to undergo sexual development both in the dark as well as under white light [83]; however, in the light perithecial necks of Sordaria and Neurospora species exhibit positive phototropism in order to aim the active discharge of ascospores away from the growth substrate [84],[85]. In N. crassa, this photoresponse is mediated by the blue light photoreceptor WC-1 [84],[86]–[88]. Photoresponses often involve multiple photoreceptors, e.g. photoreceptors for red and blue light are present in one protein complex in A. nidulans
[89],[90]. To test whether wavelengths other than blue light also play a role in regulating neck phototropism, we tested the photoresponse of S. macrospora to green and red light. Under red light, perithecial necks were oriented in random directions similar to that of perithecia grown in complete darkness, but perithecial necks showed a strong positive phototropism in response to green light (Figure S8). Our results suggest that perithecial neck phototropism in S. macrospora is regulated by blue light, similar to photoresponses in N. crassa
[91], and additionally by green light, a response not yet observed in other fungi. The photoreceptors responsible for this phenotype remain to be uncovered; possible candidates are two putative rhodopsin-like green light photoreceptors (SMAC_02424 and SMAC_06025) that are orthologs of ORP-1 and NOP-1 in N. crassa, respectively [81],[92].
Senescence in fungi has been observed in the model organismpan class="Species">P. anserina, in strains of span>n class="Species">N. crassa and N. intermedia
[93],[94], but not in S. macrospora. A search for homologs to genes that are known to be involved in the aging process in P. anserina revealed that for the majority of the genes clear homologs are present in S. macrospora (Table S11). This includes genes that are required for mitochondrial protein quality control, programmed cell death, DNA repair, ROS scavenging, mitochondrial dynamics, and respiration, among other processes. Two genes not identified in S. macrospora are the apoptosis-related genes PaAif1 and PaAmid2. PaAIF1 (apoptosis-inducing factor) and PaAMID2 (AIF-like mitochondrion-associated inducer of death) are putative NADH oxidoreductases. In mammals, AMID is present in mitochondria, and its overexpression induces cell death [95]. The third protein that is missing in S. macrospora is the SAM-dependent O-methyltransferase PaMth1. An accumulation of this protein was detected in the mitochondria and in total protein extracts of senescent P. anserina wild type strains [96],[97]. Investigation of substrate-specificity of the protein hints to a protecting role of this methyltransferase against the generation of reactive oxygen species [98],[99]. While PaMth1 overexpressing strains show a significantly elongated life span, PaMth1 deletion strains are short-lived. However, S. macrospora does not show a restricted lifespan despite the lack of a PaMth1 homolog, indicating that the aging process in P. anserina is not conserved in other members of the Sordariales, and that the P. anserina aging genes that are present in S. macrospora may function in other cellular pathways.
Fungi have long been used as model systems to study the molecular mechanisms of meiosis, andpan class="Species">S. macrospora has played a prominent role in these investigations due to its simple sexual life cycle, large meiotic products (ascospores) and the production of an ordered tetrad of ascospores that allows the differentiation between pre- and postreduction segregation of alleles [14],[82]. Compaspan>rison of the predicted span>n class="Species">S. macrospora genes with the S. cerevisiae and Schizosaccharomyces pombe genomes [100],[101] allowed the identification of 92 “meiotic” genes. Reciprocal best hit BLASTP similarity searches against the predicted ORFs of S. macrospora, N. crassa and P. anserina showed that the 92 genes display orthologs in all three species (Table S15) [81],[102]. Nine of the genes were already characterized in S. macrospora (Table S15). The most conserved proteins include enzymes that are implicated in the recombination process and the proteins involved in sister-chromatid cohesion. In contrast, structural proteins like the components of the synaptonemal complex (SC) are poorly conserved despite the fact that the SC is as conserved during evolution as meiosis itself. This is similar to findings in other groups of organisms, e.g. mammals and plants [103]. Remarkably, S. macrospora, N. crassa, and P. anserina, like other filamentous fungi [104] possess only the RecA ortholog RAD51 and lack a recognizable DMC1, the meiosis-specific homolog of RAD51, thought to play an essential role in strand invasion [105]. The meiotic regulators are also poorly conserved (Table S15): among the three meiotic-specific transcription factors in yeast (Abf1p, Ume6p and Ndt80p) only an Ndt80p homolog is identifiable. Thus, S. macrospora has a conserved set of meiotic core genes whereas the regulators are more diverged, probably indicating life style-specific adaptations.
We also searched for genes that may be involved in pan class="Chemical">GTP-dependent and/or pan class="Chemical">phospholipid or calcium signaling as well as known fungal developmental genes and genes encoding motor proteins, and found for all groups that the gene content of the S. macrospora genome is similar to that of N. crassa, and thus in most cases larger than that of S. cerevisiae (Tables S12, S13, S14). This shows that S. macrospora is a useful model organism for studying developmental processes because it contains the full repertoire of higher eukaryotic genes involved in signaling and regulatory networks. Nevertheless, there are several groups of genes where S. macrospora differs from other fungi and that warrant a closer look because they allow insights into fungal evolution and biology. These are described below.
Genes for conidiation and nonself recognition: a case of “cryptic” incompatibility?
Two features in which pan class="Species">S. macrospora differs from its close relative pan class="Species">N. crassa are the lack of both asexual spores (“mitospores” or conidia) and heterokaryon incompatibility reactions. Searches in the S. macrospora genome for conserved genes that are involved in these processes revealed that homologs for conidiation genes are present (Table S16). These homologs seem to encode functional proteins, as they are not enriched in missense or nonsense mutations. Furthermore, quantitative real time PCR analysis for orthologs of six genes involved in conidiation in N. crassa revealed that these genes are expressed both during vegetative growth and sexual development in S. macrospora (Figure S9). Of course, additional unknown genes that are essential for conidiation may be missing or mutated in S. macrospora. Another possibility is that S. macrospora is able to conidiate, but does not do so under laboratory conditions. This would be analogous to the situation of Aspergillus fumigatus, which was recently shown to undergo sexual development when grown under suitable conditions [106],[107]. A third possibility, discussed below, might be that S. macrospora no longer produces conidia due to an unfavorable combination of heterokaryon incompatibility genes.
Filamentous fungi can undergo hyphal fusion (anastomosis, [108]) between individuals of different genotypes leading to the formation of a mycelium containing genetically different nuclei (heterokaryon). In pan class="Species">many ascomycetes such as span>n class="Species">N. crassa, P. anserina, and A. nidulans, the viability of these heterokaryons is genetically controlled by a set of heterokaryon incompatibility (het) loci. A het locus can be defined as a locus at which heteroallelism cannot be tolerated in a heterokaryon [109], thus a fusion between two individuals that differ genetically at a het locus results in a nonself recognition reaction which leads to phenotypes ranging from inhibited, abnormal growth to cell death [110]. Heterokaryon incompatibility (HI) has been shown to prevent the spread of viruses and the exploitation of aggressive phenotypes and is believed to reduce the risk of resource plundering between individuals [111]–[114]. However, heterokaryon formation can also have benefits for the individuals involved, e.g. the formation of functional diploids and mitotic genetic exchange in the parasexual cycle [115].
Several het loci have been characterized at the molecular level, and a conserved region of about 150 residues has been identified within various pan class="Disease">HI proteins. This domain is termed the HET domain [116]. The parts of het genes not encoding the HET domain are highly polymorphic; they ensure nonself recognition and are evolving very rapidly whereas the HET domain triggers cell death [117]. In addition to het domain genes, several other genes function as het loci, among them the mating-type genes in pan class="Species">N. crassa, which act as het genes during vegetative cell fusion but are required to be different during sexual cell fusion [118]–[120].
Vegetative incompatibility has not been observed in pan class="Species">S. macrospora
[35]. Nevertheless, span>n class="Species">S. macrospora harbors genes for homologs to known het genes in other fungi (Table S17). A rather surprising finding was that in the case of het-c, pin-c, and a tol-related HET domain gene, not one, but two closely linked copies for each of these genes are present in the S. macrospora genome (Figure 5, Table S17). This is in contrast to all other filamentous ascomycetes which encode only one homolog of the het-c gene [121]. In addition to het-c, a second, closely linked HET domain-encoding gene named pin-c is essential for the HI reaction in N. crassa. It was shown that nonallelic genetic interactions between het-c and pin-c mediate nonself recognition while the severity of the HI depends on allelic interactions at the het-c locus [122]. In S. macrospora, the genomic region that is orthologous to the het-c/pin-c locus in N. crassa contains two copies of pin-c (SMAC_07217 and SMAC_07219) and one full-length (SMAC_07220) and one partial (SMAC_07218) copy of het-c (Figure 5). BLASTP comparison shows that the two PIN-C proteins from S. macrospora differ from each other to about the same degree as the N. crassa PIN-C allelic variants differ from each other (data not shown). The het-c/pin-c region is inverted in S. macrospora, and the genes at the ends of the inverted region, het-c and pin-c, are duplicated. To exclude the possibility that this is an assembly error, we amplified by PCR and end-sequenced DNA fragments spanning the regions between SMAC_07217 and SMAC_07218, between SMAC_07218 and SMAC_07219, between SMAC_07219 and SMAC_07220, and between SMAC_07228 and SMAC_07229. In all cases, we obtained PCR fragments of the expected size and sequence thereby validating that this gene order is not an assembly error but represents the wild type situation. Interestingly, the intergenic region between SMAC_07217 and SMAC_07218 contains two copies of the Smini1 repeat; thus, the duplication in this region may have originated from a transposition event. Phylogenetic analysis of the duplicated PIN-C homologs and the duplicated TOL-related proteins indicates that for pin-c, the duplication arose after the divergence of Sordaria from Neurospora, because the two pin-c copies are more similar to each other than to either of the three known pin-c alleles from N. crassa (Figure 5).
Figure 5
The het-c/pin-c locus of S. macrospora contains additional copies of putative heterokaryon incompatibility genes.
(A) Region from S. macrospora scaffold 98 and N. crassa contig 8 containing het-c and pin-c genes. A syntenic region containing the N. crassa het-c and pin-c genes and the orthologous region in S. macrospora is shaded in blue. In S. macrospora, this region is bordered by additional copies of pin-c and a partial het-c (left) and a TOL-related protein encoding gene (right). The tol-related gene SMAC_07228 contains an internal stop codon within the open reading frame (indicated by an asterisk) and therefore encodes a shortened TOL-related protein or is a pseudogene. (B) Phylogenetic tree of PIN-C and TOL-related proteins from the genomic region shown in (A). For N. crassa, three allelic variations of PIN-C (PIN-C1, PIN-C2, and PIN-C3) were used for tree construction. The PIN-C1 protein from Pyrenophora tritici-repentis was used as an outgroup to root the tree. Maximum parsimony and neighbor joining trees were calculated with 10,000 bootstrap replications each. The phylogenetic tree separates the PIN-C and TOL-related proteins, however, it is not conclusive with respect to the putative ancestral state of the PIN-C alleles.
The het-c/pin-c locus of S. macrospora contains additional copies of putative heterokaryon incompatibility genes.
(A) Region from pan class="Species">S. macrospora scaffold 98 and pan class="Species">N. crassa contig 8 containing het-c and pin-c genes. A syntenic region containing the pan class="Species">N. crassa het-c and pin-c genes and the orthologous region in S. macrospora is shaded in blue. In S. macrospora, this region is bordered by additional copies of pin-c and a partial het-c (left) and a TOL-related protein encoding gene (right). The tol-related gene SMAC_07228 contains an internal stop codon within the open reading frame (indicated by an asterisk) and therefore encodes a shortened TOL-related protein or is a pseudogene. (B) Phylogenetic tree of PIN-C and TOL-related proteins from the genomic region shown in (A). For N. crassa, three allelic variations of PIN-C (PIN-C1, PIN-C2, and PIN-C3) were used for tree construction. The PIN-C1 protein from Pyrenophora tritici-repentis was used as an outgroup to root the tree. Maximum parsimony and neighbor joining trees were calculated with 10,000 bootstrap replications each. The phylogenetic tree separates the PIN-C and TOL-related proteins, however, it is not conclusive with respect to the putative ancestral state of the PIN-C alleles.
In pan class="Species">N. crassa, two copies of het-c are only present in one cytoplasm after heterokaryon formation, and it has been shown that HET-C proteins encoded by different het-c alleles form a heterodimer complex at the plasma membrane during the HI reaction [123]. Thus, with respect to het-c and pin-c, the genomic situation in pan class="Species">S. macrospora resembles that of a heterokaryon in N. crassa (Figure 6), but no obvious signs of HI, e.g. compartmentalization and cell death, are evident in S. macrospora. However, mild HI reactions in N. crassa can lead to less severe phenotypes, e.g. aconidial strains [124]–[127]. In S. macrospora, the second het-c copy is incomplete and the ortholog of het-6, another gene involved in HI in N. crassa, contains internal stop codons so that a full HI reaction might be prevented by only partially functional het genes. Thus, we hypothesize that the lack of conidiation in S. macrospora may be due to “cryptic” or “mild” HI caused by the presence of more than one copy of putative HI genes in the genome (Figure 6). However, as indicated above, this is just one of several hypotheses to explain the fact that S. macrospora is aconidiate despite possessing orthologs to all known conidiation genes.
Figure 6
Model for the action of heterokaryon incompatibility.
Incompatibility in two incompatible strains of N. crassa (A) and in a single strain of S. macrospora (B). The VIB transcription factor regulates the expression of HET-domain genes tol, het-6, and pin-c
[170]. The het-6 gene of S. macrospora is mutated (m) and the second het-c gene (het-c2) is incomplete.
Model for the action of heterokaryon incompatibility.
Incompatibility in two incompatible strains of pan class="Species">N. crassa (A) and in a single strain of span>n class="Species">S. macrospora (B). The VIB transcription factor regulates the expression of HET-domain genes tol, het-6, and pin-c
[170]. The het-6 gene of S. macrospora is mutated (m) and the second het-c gene (het-c2) is incomplete.
Another point worth considering is that pan class="Species">S. macrospora is homothallic and encodes mating type genes in one locus that are present in sepaspan>n class="Species">rate mating-type idiomorphs in N. crassa
[128]. This situation would result in severe HI in vegetative cells of pan class="Species">N. crassa mediated by the TOL protein. Only in tol mutants both mating type idiomorphs are tolerated in one vegetative cytoplasm [120]. Introgression of the N. crassa tol into N. tetrasperma caused HI and disrupted the pseudohomothallic nature of this fungus indicating that the native N. tetrasperma tol does not mediate HI [129]. Interestingly, the S. macrospora TOL, SMAC_08253, has only 40% amino acid identity to its N. crassa ortholog, an extremely low value compared to the average 89% identity in coding regions at the DNA level [37]. Probably this very divergent TOL does not mediate HI and allows co-existence of all mating type genes within vegetative cells. Thus, HI in S. macrospora might be attenuated (“cryptic” HI) or abolished by mutations in critical HI-mediating genes to cope with or allow the presence of otherwise incompatible genes within one genome.
A second genomic locus that is important for HI in pan class="Species">N. crassa and span>n class="Species">N. tetrasperma contains the het-6 and un-24 (rnr-1) genes. In this case, the two known alleles, Oak Ridge (OR) and Panama (PA), of both genes in both species differ not only in the sequences of the alleles, but also in the gene order within the het-6/un-24 locus, which was caused by an inversion of a block of five genes including un-24
[116],[130],[131]. An analysis of the orthologous region in S. macrospora revealed the same gene order as in the OR allele (Figure S10A). Phylogenetic analysis of both genes showed that the different allelic versions of N. crassa and N. tetrasperma cluster together as has been shown previously [131], while the S. macrospora genes occupy a basal position relative to the two Neurospora species (Figure S10B). This suggests that the OR allele represents the ancient gene order, and that the PA allele arose from an inversion after separation of Sordaria and Neurospora, but before speciation of N. crassa and N. tetrasperma; otherwise one would have to postulate two independent inversion events of the same genomic region leading to the OR gene order which is rather unlikely.
Some genes for secondary metabolism may have been acquired by horizontal gene transfer
pan class="Chemical">Polyketides and non-ribosomal peptides are the most prominent classes of fungal secondary metabolites [6]. They comprise a wide variety of chemical structures, and a number of them have pharmaceutical applications, but their biological functions remain largely unknown [132],[133]. Most filamentous fungi harbor several genes encoding pan class="Chemical">polyketide synthases (PKS) as well as non-ribosomal peptide synthases (NRPS) in their genomes. Apart from the pks and nrps genes, the biosynthesis of a polyketide or non-ribosomal peptide usually requires additional genes that encode, for example, enzymes that modify the products of the PKSs and NRPSs. These genes are often clustered together with the corresponding pks or nrps gene within the genome [134]. In order to determine the potential of S. macrospora for the biosynthesis of secondary metabolites, we searched the predicted proteins for the occurrence of typical domains associated with PKS or NRPS proteins, and additionally also for fatty acid synthase (FAS) proteins as these have structural similarity to PKS proteins (Table S18). S. macrospora contains three putative nrps genes, three genes that fall into the fas class, and eleven putative pks genes. The numbers of nrps and fas genes are the same as in N. crassa, and the corresponding genes in the two fungi are orthologs. However, of the predicted eleven pks genes, only seven have an ortholog in N. crassa, whereas four PKS proteins have a higher sequence identity to other, more distantly related fungi. The N. crassa genome contains only eight putative pks genes one of which has no ortholog in S. macrospora
[36],[135]. Thus, with respect to pks genes and putative polyketides, S. macrospora appears to possess a greater potential for the production of secondary metabolites than its close relative N. crassa (Table S18, Figure S11).
Most of the pan class="Species">S. macrospora pan class="Chemical">polyketide biosynthesis genes that have been studied previously have been found to be upregulated during sexual development, and polyketides may play a role in fruiting body formation in S. macrospora
[36],[78]. Therefore, we determined the expression of the remaining five pks as well as the three nrps genes during sexual development (Figure 7). The nrps genes as well as eight of the eleven pks genes are transcriptionally upregulated during sexual development. The three pks genes that are not upregulated comprise the single type III pks gene as well as two pks genes without orthologs in N. crassa. These two pks genes, SMAC_01188 and SMAC_01198, are organized in a cluster of putative polyketide biosynthesis genes (Figure 8). Despite the fact that polyketide biosynthesis genes are often clustered in filamentous fungi [134], in S. macrospora only one such cluster has been found [36], and the genome sequence shows that most pks genes of S. macrospora do not occur clustered with other polyketide biosynthesis genes.
Figure 7
Expression of all predicted pks and nrps genes in S. macrospora during sexual development compared with vegetative growth.
Gene names for which a N. crassa ortholog is present are given in gray, gene names where no N. crassa ortholog exists are given in bold black (see also Table S18). All expression data are the results of at least two independent experiments and were determined by quantitative real time PCR. Data for six of the genes (the first six type I pks genes, SMAC_03130 to SMAC_05695) were taken from previous studies [36],[78], expression of the other eight genes was determined in the course of this investigation. The type of encoded protein (type I PKS, type III PKS, PKS/NRPS hybrid, and NRPS) is indicated. The red line indicates two-fold upregulation.
Figure 8
A partly orthologous polyketide biosynthesis cluster in S. macrospora and Phaeosphaeria nodorum.
(A) Comparison of partly orthologous polyketide biosynthesis clusters from S. macrospora (scaffold_17, S.m.) and P. nodorum (supercontig 16, P.n., data for P. nodorum are from the Stagonospora nodorum database at http://www.broadinstitute.org/annotation/genome/stagonospora_nodorum/Home.html
[136]). The six genes for which an ortholog is present in both clusters are shown in green, orthology is indicated by gray bars between the genes. Genes for which no orthologs are present in both clusters are given in blue. (B) Percent identity from BLASTP analysis (e-value ≤10−5) from a comparison of S. macrospora proteins versus P. nodorum proteins. Mean values of percent protein identity were calculated for (I) all proteins with a significant hit (e-value ≤10−5, 7424 proteins), (II) all proteins that contain a Pfam domain from one of the five Pfam domain families that are represented within the orthologous proteins from the cluster (137 proteins, the domains are adh_short, FAD_binding_3, p450, PAL, and UbiA, Table S8), (III) the orthologous proteins from the cluster (six proteins, indicated in green in A). The mean percent sequence identity for the orthologous proteins from the cluster is significantly higher (p = 0.001) than either of the other two mean sequence identity values as indicated by an asterisk.
Expression of all predicted pks and nrps genes in S. macrospora during sexual development compared with vegetative growth.
Gene names for which a pan class="Species">N. crassa ortholog is present are given in gray, gene names where no pan class="Species">N. crassa ortholog exists are given in bold black (see also Table S18). All expression data are the results of at least two independent experiments and were determined by quantitative real time PCR. Data for six of the genes (the first six type I pks genes, SMAC_03130 to SMAC_05695) were taken from previous studies [36],[78], expression of the other eight genes was determined in the course of this investigation. The type of encoded protein (type I PKS, type III PKS, PKS/NRPS hybrid, and NRPS) is indicated. The red line indicates two-fold upregulation.
Apart from them being clustered, the two pks genes SMAC_01188 andSMAC_01198 are interesting because they do not have orthologs in pan class="Species">N. crassa or any of the other sequenced Sordariomycete genomes (span>n class="Species">P. anserina, C. globosum, F. graminearum, M. grisea). This is true for most of the genes from the cluster spanning the region from SMAC_01188 to SMAC_01201 (Table S19). With the exception of SMAC_01192 and SMAC_01197, the clustered genes do not have identifiable homologs within the Sordariomycetes, rather their most similar homologs are found within the Eurotiomycetes (Aspergillus, Neosartorya, Penicillium) or Dothideomycetes (Phaeosphaeria). In the center of the cluster, six genes are orthologs to genes from a putative polyketide biosynthesis cluster of Phaeosphaeria nodorum (Figure 8, syn. Stagonospora nodorum, http://www.broadinstitute.org/annotation/genome/stagonospora_nodorum/Home.html
[136]). There are two likely explanations for these findings: (1) the cluster originated through gene duplication in a common ancestor of the Sordariomycetes and Dothideomycetes, and later on, massive gene loss occurred in the Sordariomycetes with the exception of S. macrospora; (2) S. macrospora acquired the cluster through horizontal gene transfer (HGT). To examine these two possibilities, we determined the sequence identity between the S. macrospora cluster proteins and their orthologs in the P. nodorum cluster as well as the sequence identity between all homologous S. macrospora and P. nodorum proteins, and found that the sequence identity between the proteins from the cluster is significantly higher (Figure 8B). This is also the case when looking at the sequence identity of proteins with the same domains as the orthologs in the cluster.
A partly orthologous polyketide biosynthesis cluster in S. macrospora and Phaeosphaeria nodorum.
(A) Comparison of partly orthologous pan class="Chemical">polyketide biosynthesis clusters from span>n class="Species">S. macrospora (scaffold_17, S.m.) and P. nodorum (supercontig 16, P.n., data for P. nodorum are from the Stagonospora nodorum database at http://www.broadinstitute.org/annotation/genome/stagonospora_nodorum/Home.html
[136]). The six genes for which an ortholog is present in both clusters are shown in green, orthology is indicated by gray bars between the genes. Genes for which no orthologs are present in both clusters are given in blue. (B) Percent identity from BLASTP analysis (e-value ≤10−5) from a comparison of S. macrospora proteins versus P. nodorum proteins. Mean values of percent protein identity were calculated for (I) all proteins with a significant hit (e-value ≤10−5, 7424 proteins), (II) all proteins that contain a Pfam domain from one of the five Pfam domain families that are represented within the orthologous proteins from the cluster (137 proteins, the domains are adh_short, FAD_binding_3, p450, PAL, and UbiA, Table S8), (III) the orthologous proteins from the cluster (six proteins, indicated in green in A). The mean percent sequence identity for the orthologous proteins from the cluster is significantly higher (p = 0.001) than either of the other two mean sequence identity values as indicated by an asterisk.
A phylogenetic analysis was performed with the cluster protein SMAC_01196 that encodes a putative phenylalanine ammonia lyase (PAL), a secondPAL protein SMAC_05651 present in pan class="Species">S. macrospora, and the homologs from seven other fungi (Figure 9). As expected, SMAC_05651 groups with the corresponding proteins from the span>n class="Species">Sordariales N. crassa, C. globosum, and P. anserina, each of which encodes only one PAL protein in their genomes. However, the “additional” PAL protein SMAC_01196 from the cluster groups among the Leotiomycetes/Dothideomycetes proteins and is closest to the P. nodorum cluster protein SNOG09914. Phylogenetic analysis of the cluster protein SMAC_01190 that encodes a putative member of the UbiA prenyltransferase family, its two other S. macrospora paralogs, SMAC_02313 and SMAC_06375, and the homologs from eleven other fungi gives a similar picture: SMAC_02313 and SMAC_06375 group within the Sordariales, whereas SMAC_01190 groups with the P. nodorum protein SNOG_09915 within a section of the tree that contains proteins from the Dothideomycetes, Eurotiomycetes, and Leotiomycetes, but not Sordariomycetes (Figure S12).
Figure 9
Phylogenetic analysis of the predicted phenylalanine ammonia lyase (PAL) proteins from eight fungi.
Numbers at branches indicate bootstrap support (10,000 bootstrap replications) in % for the neighbor joining tree, and clade credibilities for the Bayesian tree. Classes given on the right correspond to the taxonomy used by Liu and Hall [171], and in the NCBI Entrez Taxonomy Database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy). Sordariomycete proteins are given in blue with the exception of the S. macrospora protein SMAC_01196 that clusters with the Leotiomycete/Dothideomycete group and is given in red. Sequences for P. anserina were obtained from the Podospora anserina genome project (http://podospora.igmors.u-psud.fr/index.html) and for all other fungi from the Fungal Genome Initiative of the Broad Institute at (http://www.broad.mit.edu/ annotation/fungi/fgi/index.html). AN: Aspergillus nidulans, BC: Botrytis cinerea, CC: Coprinus cinereus (outgroup), CH: Chaetomium globosum, NC: Neurospora crassa, SM: Sordaria macrospora, PA: Podospora anserina, SN: Stagonospora nodorum (Phaeosphaeria nodorum).
Phylogenetic analysis of the predicted phenylalanine ammonia lyase (PAL) proteins from eight fungi.
Numbers at branches indicate bootstrap support (10,000 bootstrap replications) in % for the neighbor joining tree, and clade credibilities for the Bayesian tree. Classes given on the right correspond to the taxonomy used by Liu and Hall [171], and in the NCBI Entrez Taxonomy Database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy). Sordariomycete proteins are given in blue with the exception of the pan class="Species">S. macrospora protein SMAC_01196 that clusters with the Leotiomycete/span>n class="Species">Dothideomycete group and is given in red. Sequences for P. anserina were obtained from the Podospora anserina genome project (http://podospora.igmors.u-psud.fr/index.html) and for all other fungi from the Fungal Genome Initiative of the Broad Institute at (http://www.broad.mit.edu/ annotation/fungi/fgi/index.html). AN: Aspergillus nidulans, BC: Botrytis cinerea, CC: Coprinus cinereus (outgroup), CH: Chaetomium globosum, NC: Neurospora crassa, SM: Sordaria macrospora, PA: Podospora anserina, SN: Stagonospora nodorum (Phaeosphaeria nodorum).
The findings of (1) a conserved cluster of genes with closest homologs from the pan class="Species">Dothideomycete P. nodorum instead of members of the Sordariomycetes, (2) the significantly higher sequence similarity between span>n class="Species">S. macrospora and P. nodorum proteins from the cluster compared to the overall sequence similarity between other proteins from these species, and (3) the phylogenetic positioning of two of the clustered proteins within the Dothideomycetes rather than the Sordariomycetes are more consistent with HGT than with the hypothesis of gene duplication and subsequent gene loss even though the latter cannot be excluded [137]. Recent studies, made possible by the increasing number of fungal genome sequences, have indicated that HGT may be more common in fungi than previously thought, and that genes for secondary metabolism are especially prone to HGT [138]–[141]. Even though in many cases “non-canonic” phylogenetic tree topologies can be explained by a combination of duplication, diversification, and differential gene loss [138],[142], that still leaves a number of cases where a HGT model best fits the observed data [140],[141],[143],[144]. HGT may be one way for fungi to increase their biochemical repertoire, thereby increasing their ability to adapt to new ecological niches [137].
In the case of the pan class="Species">S. macrospora cluster presented here, it is interesting to note that it contains two putative pks genes (SMAC_01188 and SMAC_01198), one of which (SMAC_01198) has acquired 16 frame shifts/stop codons that interrupt the open reading frame whereas the other pks gene SMAC_01188 as well as the additional ten genes that comprise the putative pan class="Chemical">polyketide biosynthesis cluster represent functional genes. For seven of the twelve genes from the cluster (SMAC_01188 to SMAC_01991, SMAC_01194, SMAC_01196 and SMAC_01198), transcriptional expression was verified by cDNA sequencing, and spliced cDNAs were obtained for all of the genes including SMAC_01198 which is unlikely to yield a functional protein due to the frameshifts (data not shown). Thus, this cluster might represent a case of an evolutionary recent acquisition that was introduced into the S. macrospora genome since its divergence from the last common ancestor with N. crassa. While part of the cluster appears to be retained and under purifying selection in S. macrospora, the gene SMAC_01198 has drifted and accumulated frameshift/nonsense mutations, even though it is still transcribed. Further analyses are necessary to determine the function of this putative polyketide biosynthesis cluster in S. macrospora.
Conclusions
Due to their high throughput and low costs, next-genepan class="Species">ration sequencing techniques have greatly changed the way large-scale sequencing projects are done. This includes e.g. re-sequencing of existing genomes for the discovery of variations, “RNA-sequencing” for transcriptome analysis, or “ChIP-Seq” for the genome-wide analysis of DNA-protein interactions [8]. Until recently, de novo genome assembly from next-generation sequences has been restricted to prokaryotic genomes [10],[11]. This is due to the fact that eukaryotic genomes are larger and often contain high amounts of repetitive sequences that cannot be assembled from read lengths that are smaller than the length of the repeats. With the recent release of the Giant Panda genome [13] it has become obvious that even more complex eukaryotic genomes can be sequenced and assembled from short reads. Here, we present a high-quality draft of the S. macrospora genome, assembled solely from next-generation sequences, showing that de novo assembly from Solexa paired-end reads in combination with 454 sequence reads is feasible, cost-effective and fast, at least for compact eukaryotic genomes with few repetitive sequences.
Additionally, the pan class="Species">S. macrospora genome revealed several features that are of interest with respect to fungal evolution, namely its complement of het genes as well as pan class="Chemical">polyketide biosynthesis genes. In the case of the closely linked het-c and pin-c genes, it was found that S. macrospora contains additional copies that might have arisen from inversion/duplication events. In other fungi, the presence of non-identical het alleles within one cytoplasm leads to HI, which in its extreme results in cell death [109],[115]. In contrast, S. macrospora is able to cope with this situation as no obvious HI phenotypes are observed in this fungus. However, we suggest that the aconidial phenotype of S. macrospora may be the result of “cryptic HI” caused by the presence of incompatible het genes within a single genome. Furthermore, analysis of a second het gene locus shows how the analysis of closely related genome sequences can help to pinpoint evolutionary events, in this case the occurrence of an inversion after separation of Sordaria and Neurospora but before speciation of N. crassa and N. tetrasperma. The analysis of predicted polyketide biosynthesis genes showed that S. macrospora contains more pks genes than its close relative N. crassa, and therefore probably has a wider biochemical repertoire available. One putative polyketide biosynthesis cluster might have been acquired through HGT, and this fits with previous results that show that HGT is probably rather widespread in fungi both for the transfer of single genes, clustered genes like polyketide biosynthesis genes, or even larger stretches of DNA up to whole chromosomes as was found in the phytopathogenic fungus Nectria haematococca
[139]–[141],[143],[144]. These findings support the theory that HGT plays a role in fungal evolution and might be a source of genetic variation that allows fungi to adapt to different ecological niches [137].
Materials and Methods
Strains and culture conditions
The sequenced reference strain is pan class="Species">Sordaria macrospora k-hell from the strain collection of the Department of General and Molecular Botany at the Ruhr-Universität Bochum. The strain was grown on cornmeal medium as previously described [38].
DNA preparation for sequencing
Genomic DNA from pan class="Species">S. macrospora was prepared by following a modified previously published method [145]. Mycelium was frozen in liquid span>n class="Chemical">nitrogen, pulverized, and incubated in equal volumes of lysis buffer (0.2 M sodium borate, 30 mM EDTA, 1% SDS, pH 9.0) and phenol at 60°C for 5 min. After centrifugation, the supernatant was treated with RNase, and afterwards with an equal volume phenol/chloroform (1∶1). After centrifugation, genomic DNA was purified from the supernatant by cesium chloride density gradient centrifugation.
Illumina/Solexa sequencing by synthesis
To construct libraries of two different insert sizes, 5 µg DNA each were sonicated with a Branson sonicator. Sonicated DNA was sepapan class="Species">rated through 2% NuSieve span>n class="Chemical">agarose gels and fragments of ∼300 and ∼500 bp were purified. After generation of blunt-end fragments, A-overhangs were added, adaptors ligated, and the fragments were PCR amplified [146]. The resulting libraries were sequenced on an Illumina Genome Analyzer with a paired-end module generating reads of 36 bases. Four lanes from the 300 bp library and three lanes from the 500 bp library resulted in 3.4 Gb of sequence data (Table 1, Figure S1).
Roche/454 pyrosequencing
Roche/454 sequencing was performed with 50 µg genomic DNA at Eurofins MWG GmbH (Ebersberg, Gerpan class="Species">many). This resulted in 415 Mb of sequence data with an average read length of 367 bp (Table 1). The 454 raw data were extracted from the sff file and converted to a fasta file using sff_extract.py (written by Jose Blanca and Bastien Chevreux, http://bioinf.comav.upv.es/sff_extract/index.html).
Assembly
Assembly of the Solexa reads only as well as the combined Solexa and 454 reads was carried out with the Velvet assembler [42]. A description of the parameters used with Velvet can be found in Text S1 and Figure S1. An assembly of only the 454 data with the Celera Assembler 5.3 was performed by Eurofins MWG GmbH (Ebersberg, Gerpan class="Species">many). Compaspan>rison of the span>n class="Species">S. macrospora genome with the N. crassa genome [43] was done with BLAST [147] and visualized with Combo [148]. Comparative assembly of the S. macrospora genome along the genome sequences of N. crassa, N. discreta (http://genome.jgi-psf.org/Neudi1/Neudi1.home.html) and N. tetrasperma (http://genome.jgi-psf.org/Neute1/Neute1.home.html) genomes was done with Mercator [44]. Assembly of the mitochondrial genome and the rDNA unit was done with CodonCode Aligner version 3.0.3 (http://www.codoncode.com/aligner/), details can be found in Text S1 and Figure S3.
Annotation
Gene models were predicted independently with the ab initio predictors AUGUSTUS, GeneMark+ES, SNAP, and the evidence-based predictor Genewise [149]–[153]. The ab initio SNAP and AUGUSTUS parameters were trained on all pan class="Species">N. crassa gene models while GeneMark performs an itepan class="Species">rative self-training procedure. The Genewise predictions were generated from N. crassa proteins aligned to the genome by first aligning the proteins with TBLASTN, choosing the S. macrospora locus with only the best alignment for each protein and then refining the alignment and splice-sites with Genewise. The processing of outputs from these tools was completed with custom scripts utilizing tools from the BioPerl toolkit [154]. The resulting GFF annotation from each of the prediction programs was used as input to Evigan, a program that integrates the four sources of gene evidence [45].
For each of the predicted proteins, the protein with the highest sequence identity in GenBank was determined using BLASTP [147] (Table S2). Additionally, putative domains were predicted with the HMMER (version 2.3.2) program hmmpfam using the hidden Markov models from the pfam database [49],[50] and with the InterPpan class="Chemical">roScan function from Blast2GO [51],[52]. The resulting data can be found in Table S8. Putative localization of the predicted proteins was determined with WoLF PSORT [155], putative signal peptides and signal anchors were predicted with SignalP 3.0 [156], and transmembrane domains with HMMTOP [157] and TMHMM [158] (Table S2, Text S2). tRNAs were predicted using a combination of Infernal 1.0, tRNAscan-SE, and TFAM 1.0 [159]–[161]. Orthologous groups of genes among the five fungal species span>n class="Species">S. macrospora, N. crassa
[43], N. discreta (http://genome.jgi-psf.org/Neudi1/Neudi1.home.html), P. anserina
[102], and C. globosum (http://www.broadinstitute.org/annotation/genome/chaetomium_globosum, Chaetomium globosum Sequencing Project, Broad Institute of Harvard and MIT http://www.broad.mit.edu) were identified with OrthoMCL [53]. Searches for transposons and repeat elements were done with BLAST [147] and by searches in Repbase (http://www.girinst.org/) [59].
For comparison of different genomic regions (pan class="Chemical">CDSs, introns and upstream regions) from span>n class="Species">S. macrospora, N. crassa, N. discreta and N. tetrasperma, a Mercator alignment [44] of the genome sequences was performed and the parts of the alignment corresponding to the genomic regions were used to compute pairwise identities and evolutionary distances. Only those upstream regions were used that do not overlap with a protein coding region, and each region was used only once even if it is upstream of two divergently transcribed genes to avoid double-counting.
Accession numbers
The sequence and annotation data are available under the accession numbers CABT01000001-CABT01004783. The sequence reads that were used for the assembly of the pan class="Species">S. macrospora genome were submitted to the NCBI sequence read archive (accession number SRA010462).
RNA preparation and expression analysis
For comparison of vegetative growth versus sexual development, growth and harvesting of pan class="Species">S. macrospora and span>n class="Species">N. crassa, RNA preparation, reverse transcription and quantitative real time PCR were as described previously [19],[162].
Phylogenetic analysis
Multiple alignments were created in CLUSTALX [163] and trimmed with Jalview [164], and the same alignment was used for analysis by distance-matrix (DM), maximum parsimony (MP) or Bayesian methods. Phylogenetic analyses were made with PAUP version 4.0b10 for Windows (D.L. Swofford, distributed by Sinauer Associates, copyright 2001 Smithsonian Institution) for DM and MP analyses, and with MrBayes [165],[166]. DM and MP analyses were performed as described using 10,000 bootstrap replicates, Bayesian analysis was performed with at least 250,000 genepan class="Species">rations [167]. Consensus trees were graphically displayed with TREEVIEW or Dendspan>n class="Chemical">roscope [168],[169].
Next-genepan class="Species">ration sequencing of the pan class="Species">S. macrospora genome. (A) Summary of the Illumina/Solexa and 454 sequences that were obtained. (B) Maximum contig lengths (N max) and N50 values for assemblies with different k-values (hash length in Velvet) for the Illumina/Solexa data alone or in combination with the 454 data. The combined assembly with the highest N max and N50 value (k = 25) was used for further analyses.
(0.40 MB TIF)Click here for additional data file.Assemblies with different coverage levels of short reads. Assemblies were done with Velvet 0.7.56 with k = 25. N50, maximum contig length (Nmax), the number of gaps that Velvet introduced within contigs and the total length of gaps within the assembly are given for different combinations of 454 coverage (x-axis), coverage from a 300 bp Solexa paired-end library (25× coverage in (A–D), 50× coverage in (E–H), respectively), and coverage from a 500 bp Solexa paired-end library (0×, 9×, 18×, and 27× coverage as color-coded in the different panels). Addition of 454 reads has the most drastic effect on the number and length of gaps (note the logarithmic y-axis for these panels) whereas addition of paired-end reads influences mostly N50 and Nmax. A table with assembly information for additional coverage combinations can be found in Table S1.(1.07 MB TIF)Click here for additional data file.The mitochondrial genome of pan class="Species">S. macrospora. (A) Schematic map of the mitochondrial genome. Size 88,423 bp, scale in kb indicated on the inner circle. Blue: ribosomal RNAs, green: tRNAs, red: protein coding genes, orange on the inner circle: open reading frames within introns of protein coding genes. Note that all predicted genes are encoded on the same strand. (B, C) Compaspan>pan class="Species">rative analysis of the mitochondrial DNA of S. macrospora with N. crassa (B) and P. aserina (C). Dot plots of BLASTN analysis that was done at http://blast.ncbi.nlm.nih.gov/Blast.cgi with an e-value cutoff of 10−20.
(0.35 MB TIF)Click here for additional data file.Histograms of % pairwise identity between pan class="Species">S. macrospora and span>n class="Species">N. crassa for different genomic regions. CDSs, introns, and regions upstream of CDSs (in 1 kb steps ranging from 1 to 4 kb) were used for comparison. Only those upstream regions were used that do not overlap with a protein coding region. Each region was used only once even if it is upstream of two divergently transcribed genes to avoid double-counting. Detailed information on the comparisons can be found in Table S5.
(0.65 MB TIF)Click here for additional data file.Phylogenetic analysis of orthogroups 49 (A) and 180 (B) and related orthogroups. With the exception of orthogroup 6050, which is split in two parts, all orthogroups that were found by OrthoMCL are supported by phylogenetic analysis. Most orthogroups contain one member each in pan class="Species">S. macrospora, span>n class="Species">N. crassa, and N. discreta, but orthogroups 49 and 180 contain six and four members, respectively, from S. macrospora. Thus, both the OrthoMCL analysis as well as the phylogenetic trees constructed with maximum parsimony support the hypothesis that orthogroups 49 and 180 are part of larger gene families, but that in these branches of the gene families, recent gene duplication events occured specifically in S. macrospora. Numbers at branches indicate bootstrap support (10,000 bootstrap replications) in % for maximum parsimony trees. SM: S. macrospora, NC: N. crassa, ND: N. discreta, CHG: C. globosum, PA: P. anserina.
(0.72 MB TIF)Click here for additional data file.Regions of high similarity within four fungal genomes. Each genome sequence was compared to itself with BLASTN with e-value <10−150. Dot plot visualization was done with Combo (Engels et al. 2006, Bioinformatics 22: 1782–1783). The pan class="Species">M. grisea genome contains a high amount of repeated DNA (Dean et al. 2005, Nature 434 :980–986), and this is reflected in this compaspan>rison. The genomes of pan class="Species">N. crassa (Galagan et al. 2003, Nature 422: 859–868) and F. graminearum (Cuomo et al. 2007, Science 317: 1400–1402) contain only few repeat regions. The intragenomic similarities within the S. macrospora genome range between those for N. crassa and F. graminearum.
(0.87 MB TIF)Click here for additional data file.Comparison of RIP indices in the S.macpan class="Chemical">rospora and span>n class="Species">N. crassa genomes. (A) No evidence for large regions with RIP in the S. macrospora genome. The substrate ([CA+TG]/[AC+GT]; orange) and product (TpA/ApT; dark green) RIP indices were calculated for all unscaffolded contigs (nt 1 to 2,865,981), the mtDNA (nt 2,865,982 to 2,954,404) and random scaffolds (nt 2,954,405 to 5,000,000). The patterns for the remainder of the S. macrospora genome look similar to those shown here for the random scaffolds. We predicted that the unscaffolded contigs would be AT-rich and would show hallmarks of RIP. Instead, these contigs are GC-rich and show now evidence for RIP by this assay. The mtDNA (black arrow) has balanced AT and GC content, more resembling bacterial DNA and thus has a different pattern than S. macrospora nuclear DNA. (B) Evidence for RIP in N. crassa Linkage Group I. The first 5 Mb of LGI of N. crassa were analyzed as above. High values for the product RIP index (green), coupled with low values for the substrate RIP index (orange) reveal dispersed (blue arrows) and centromeric (red arrowhead) regions that have been subjected to RIP.
(0.57 MB TIF)Click here for additional data file.Perithecial neck phototropisms in response to unilateral light of different wavelengths. (A–C) Positive neck phototropisms in response to white light (A, fluence pan class="Species">rate 3.4 µM/m2*s), blue light (B, wavelength: 470 nm; fluence span>n class="Species">rate 6.3 µM/m2*s), and green light (C, wavelength: 530 nm; fluence rate 3.4 µM/m2*s). (D, E) No neck phototropisms under red light (D, wavelength: 680 nm; fluence rate 5.4 µM/m2*s) or complete darkness (E). The arrows indicate the direction of neck tropisms.
(1.08 MB TIF)Click here for additional data file.Expression of pan class="Species">S. macrospora genes that are orthologs of genes involved in conidiation in pan class="Species">N. crassa. Transcript levels were compared between sexual development and vegetative growth. Expression data are the results of two independent experiments and were determined by quantitative real time PCR. The red dashed line indicates two-fold upregulation. In N. crassa, the corresponding orthologs are regulators of conidiation (fluffy, csp-1, rco-1, rco-3) or encode structural proteins (ccg-2) or enzymes (al-1) that are important for conidiospore morphology. All six S. macrospora orthologs are transcribed both during vegetative growth and sexual development, and several are upregulated during sexual development. The strongest upregulation is observed in the ccg-2 ortholog SMAC_00022. In N. crassa, ccg-2 encodes a hydrophobin that forms the hydrophobic coat (rodlet layer) of the conidial cell wall (Bell-Pedersen et al. 1992 Genes Dev 6: 2382–2394). Generally, fungal hydrophobins are expressed when hyphae encounter an air/water interface (Wösten 1991 Annu Rev Microbiol 2001. 55:625–646), and this might be the reason why the S. macrospora ccg-2 ortholog is only weakly expressed in the submerged culture used to obtain vegetative mycelium but strongly upregulated under conditions for sexual development, i.e. during growth as a surface culture. The function of these genes in the aconidial S. macrospora is unknown. Several of the N. crassa orthologs have functions outside of conidiation, e.g. the putative transcriptional repressor rco-1 (Yamashiro et al. 1996 Mol Cell Biol 16: 6218–6228) or the glucose transporter rco-3 (Madi et al. 1997 Genetics 146: 499–508), but others like the transcription factor-encoding genes fluffy (Bailey and Ebbole 1998 Genetics 148: 1813–1820) and csp-1 (Lambreghts et al. 2009 Genetics 181: 767–781) are specific to conidiation and their function in S. macrospora remains to be elucidated.
(0.21 MB TIF)Click here for additional data file.The het-6/un-24 locus from pan class="Species">S. macrospora is syntenic to the OR allele of pan class="Species">N. crassa. (A) Region from S. macrospora scaffold 5 and N. crassa scaffold 8 containing het-6 and un-24 genes. Homologous genes are given in the same color. The two different allelic combinations of het-6 and un-24 in N. crassa, Oak Ridge (OR) and Panama (PA), are indicated. The S. macrospora gene order resembles that of the Oak Ridge strain. SMAC_07776 contains three stop codons within the open reading frame (indicated by asterisks above the gene) and is therefore probably a pseudogene or it encodes a shorter HET-6. (B) Phylogenetic analysis of partial HET-6 and UN-24 proteins from S. macrospora, N. crassa (NC) and N. tetrasperma (NT). For N. tetrasperma, OR alleles were taken from strain P514, PA alleles from strain P2361 (Powell et al., Fungal Genet Biol 2007, 44: 896–904). The homologous P. anserina proteins were used as an outgroup to root the trees. Bootstrap values in % (10,000 bootstrap replicates) are given for maximum parismony and neighbor joining trees above and below the branches, respectively. OR and PA alleles from the two different Neurospora species cluster together whereas S. macrospora is basal to the Neurospora proteins indicating that the Oak Ridge gene order is probably ancient and the Panama gene order has arisen from an inversion after separation of the genus Neurospora from the genus Sordaria and before speciation of N. crasssa and N. tetrasperma.
(0.18 MB TIF)Click here for additional data file.Summary of all proteins in pan class="Species">S. macrospora that are predicted to be PKSs or NRPSs.
(0.44 MB TIF)Click here for additional data file.Phylogenetic analysis of the UbiA prenyltransferase family proteins from 12 fungi. The pan class="Species">S. macrospora protein SMAC_01190 clusters in a group of “outsider” proteins with not quite clear phylogenetic resolution (bootstrap support of only 60 % as indicated in a red circle), most likely due to gene family exSpecies">pansion in span>n class="Species">Aspergillus/Neosartorya/Stagonospora/Botrytis/Sclerotinia. However, as it clusters with SNOG_09915, this might indicate horizontal gene transfer. Numbers at braches indicate bootstrap support (10,000 bootstrap replications) in % for the neighbor joining tree. Sordariales proteins are given in blue with the exception of the S. macrospora protein SMAC_01190 that clusters with the DothideomyceteStagonospora nodorum and is given in red. Sequences that belong to the protoheme farnesyl transferase group are shaded in dark gray, sequences that belong to the polyprenyl transferase group are shaded in light gray. Sequences for P. anserina were obtained from the Podospora anserina genome project (http://podospora.igmors.u-psud.fr/index.html) and for all other fungi from the Fungal Genome Initiative of the Broad Institute at (http://www.broad.mit.edu/ annotation/fungi/fgi/index.html) or from our own data (S. macrospora). AN: Aspergillus nidulans, BC: Botrytis cinerea, CH: Chaetomium globosum, FG: Fusarium graminearum, MG: Magnaporthe grisea, NC: Neurospora crassa, NF: Neosartorya fischeri, PA: Podospora anserina, SM: Sordaria macrospora, SN: Stagonospora nodorum (Phaeosphaeria nodorum), SS: Sclerotinia sclerotiorum, Y: Saccharomyces cerevisiae.
(0.33 MB TIF)Click here for additional data file.Assemblies with different coverage levels.(0.19 MB PDF)Click here for additional data file.Overview of predicted pan class="Species">S. macrospora genes.
(7.09 MB XLS)Click here for additional data file.Orthologs between pan class="Species">S. macrospora and pan class="Species">N. crassa.
(1.02 MB XLS)Click here for additional data file.Orphan genes.(0.37 MB XLS)Click here for additional data file.Comparisons of different genomic regions.(4.61 MB XLS)Click here for additional data file.hmmpfam predictions.(1.22 MB XLS)Click here for additional data file.OrthoMCL family species count.(1.18 MB XLS)Click here for additional data file.OrthoMCL orthogroups.(1.66 MB XLS)Click here for additional data file.Genome integrity genes.(0.35 MB XLS)Click here for additional data file.Homologs to genes involved in light signaling and regulation.(0.05 MB PDF)Click here for additional data file.Homologs to genes involved in senescence.(0.07 MB PDF)Click here for additional data file.Genes with putative functions in pan class="Chemical">calcium signaling.
(0.07 MB PDF)Click here for additional data file.Genes with putative functions as motor proteins.(0.07 MB PDF)Click here for additional data file.Genes with putative functions in MAP kinase andpan class="Chemical">phospholipid signaling.
(0.09 MB PDF)Click here for additional data file.Known and putative meiosis genes.(0.09 MB XLS)Click here for additional data file.pan class="Species">S. macrospora homologs of conidiation-related genes from different ascomycetes.
(0.07 MB PDF)Click here for additional data file.pan class="Species">S. macrospora homologs of genes involved in heterokaryon incompatibility from different ascomycetes.
(0.06 MB PDF)Click here for additional data file.Predicted pan class="Chemical">polyketide synthases (PKS), non-ribosomal peptide synthases (NRPS), and fatty acid synthases (FAS).
(0.06 MB PDF)Click here for additional data file.A putative pan class="Chemical">polyketide biosynthesis cluster that is partly conserved in span>n class="Species">S. macrospora and Phaeosphaeria nodorum.
(0.06 MB PDF)Click here for additional data file.pan class="Species">Parameters used for the assembly of the pan class="Species">S. macrospora genome.
(0.05 MB PDF)Click here for additional data file.Predicted subcellular localizations, signal peptides, and transmembrane domains for the pan class="Species">S. macrospora proteins.
(0.08 MB PDF)Click here for additional data file.
Authors: Ralph A Dean; Nicholas J Talbot; Daniel J Ebbole; Mark L Farman; Thomas K Mitchell; Marc J Orbach; Michael Thon; Resham Kulkarni; Jin-Rong Xu; Huaqin Pan; Nick D Read; Yong-Hwan Lee; Ignazio Carbone; Doug Brown; Yeon Yee Oh; Nicole Donofrio; Jun Seop Jeong; Darren M Soanes; Slavica Djonovic; Elena Kolomiets; Cathryn Rehmeyer; Weixi Li; Michael Harding; Soonok Kim; Marc-Henri Lebrun; Heidi Bohnert; Sean Coughlan; Jonathan Butler; Sarah Calvo; Li-Jun Ma; Robert Nicol; Seth Purcell; Chad Nusbaum; James E Galagan; Bruce W Birren Journal: Nature Date: 2005-04-21 Impact factor: 49.962
Authors: David J Jacobson; Jeremy R Dettman; Rachel I Adams; Cornelia Boesl; Shahana Sultana; Till Roenneberg; Martha Merrow; Margarida Duarte; Isabel Marques; Alexandra Ushakova; Patrícia Carneiro; Arnaldo Videira; Laura Navarro-Sampedro; María Olmedo; Luis M Corrochano; John W Taylor Journal: Mycologia Date: 2006 Jul-Aug Impact factor: 2.696
Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937
Authors: Kristina M Smith; Jonathan M Galazka; Pallavi A Phatale; Lanelle R Connolly; Michael Freitag Journal: Chromosome Res Date: 2012-07 Impact factor: 5.239
Authors: Wolfgang Harreither; Christoph Sygmund; Manfred Augustin; Melanie Narciso; Mikhail L Rabinovich; Lo Gorton; Dietmar Haltrich; Roland Ludwig Journal: Appl Environ Microbiol Date: 2011-01-07 Impact factor: 4.792
Authors: Sante Gnerre; Iain Maccallum; Dariusz Przybylski; Filipe J Ribeiro; Joshua N Burton; Bruce J Walker; Ted Sharpe; Giles Hall; Terrance P Shea; Sean Sykes; Aaron M Berlin; Daniel Aird; Maura Costello; Riza Daza; Louise Williams; Robert Nicol; Andreas Gnirke; Chad Nusbaum; Eric S Lander; David B Jaffe Journal: Proc Natl Acad Sci U S A Date: 2010-12-27 Impact factor: 11.205