| Literature DB >> 20386741 |
Minou Nowrousian1, Jason E Stajich, Meiling Chu, Ines Engh, Eric Espagne, Karen Halliday, Jens Kamerewerd, Frank Kempken, Birgit Knab, Hsiao-Che Kuo, Heinz D Osiewacz, Stefanie Pöggeler, Nick D Read, Stephan Seiler, Kristina M Smith, Denise Zickler, Ulrich Kück, Michael Freitag.
Abstract
Filamentous fungi are of great importa<span class="Species">nce in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been seque<span class="Species">nced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.Entities:
Mesh:
Year: 2010 PMID: 20386741 PMCID: PMC2851567 DOI: 10.1371/journal.pgen.1000891
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Figure 1S. macrospora as a model organism for the analysis of meiosis and fruiting body development.
(A) Segregation of the ascospore-color mutant pam2 from a cross; wild type (black ascospores) by pam2 (yellow ascospores). Arrow points to a gene conversion indicated by two black and six yellow ascospores. (B–D) Meiotic prophase. Chromosome axes are stained by the cohesin-associated Spo76/Pds5 protein tagged with GFP. (B) Prophase nucleus of a spo11 null mutant: the 14 chromosomes do not align or synapse and this asynaptic status is seen from leptotene through pachytene. (C, D) Pachytene nucleus from wild-type Sordaria: the seven bivalents are differentiated by their size (D). Chromosome 2 (yellow), which bears the nucleolar organizing region, is attached to the nucleolus (nu). (E) The seven bivalents at late diplotene, stained by DAPI. Note the difference in size when compared to the pachytene nucleus. Bar (B–E) = 5 µm. (F) An EGFP-HEX1 fusion protein localizes to Woronin bodies. Bar = 10 µm. (G) The GFP-tagged developmental protein PRO41 localizes to the endoplasmic reticulum. Plasma membrane stained with FM4-64. Bar = 10 µm. (H, I) In a young protoperithecium (H), the GFP-tagged developmentally induced protein APP accumulates (I). Bar = 20 µm.
Main features of primary sequence data.
| primary sequence data | Solexa | 454 | Solexa +454 |
| no. of reads that were obtained | 95,153,934 | 1,103,372 | 96,261,736 |
| read length | 36 nt | 367 nt | n.a. |
| total length of all sequence reads | 3,426 Mb | 415 Mb | 3,879 Mb |
1 average read length.
Main features of S. macrospora genome assemblies from Solexa reads, 454 reads, a combination of both, and after comparative assembly with the N. crassa genome.
| assembled genome | Solexa | 454 | Solexa +454 | comp. assembly |
| N50 value of assembly | 51 kb | 11 kb | 117 kb | 498 kb |
| maximum contig/scaffold length | 267 kb | 64 kb | 991 kb | 2.5 Mb |
| total length of assembly | 38.7 Mb | 42.1 Mb | 39.9 Mb | 39.9 Mb |
| no. of contigs/scaffolds | 3,344 | 14,123 | 5,097 | 4,781 |
| % of assembly in contigs >0.5 kb | 99.1 | 95.6 | 98.1 | 98.1 |
| % of assembly in contigs >10 kb | 92.8 | 52.5 | 92.3 | 93.1 |
| no. of gaps within contigs | 17,956 | 1,681 | 624 | 933 |
| mean length of gaps | 478 nt | 1 nt | 21 nt | 150 nt |
1 The N50 is defined as the length for which 50% of all bases in the assembly are in a contig of at least that length. In other words, this means that 50% of the assembly is contained in contigs of at least the N50 length.
2 The higher number and greater length of gaps in the comparative assembly compared to the Solexa+454 assembly stems from the introduction of gaps while joining contigs to scaffolds.
Figure 2Synteny between the genomes of S. macrospora and N. crassa.
(A) Synteny of contigs from the S. macrospora genome with the N. crassa genome before scaffolding along the N. crassa chromosomes. Dot plot of a comparison of the five largest contigs from the Velvet assembly (contigs 3467, 1588, 19727, 3369, and 12432, length given on the y-axis in descending order, total size of the five contigs 3.4 Mb, note that the Velvet contig numbers do not correspond to the contigs of the final assembly) against the Neurospora linkage groups (supercontigs I to VII in finished genome sequence, http://www.broadinstitute.org/annotation/genome/neurospora/Regions.html). The linkage group numbers for N. crassa are given above the dot plot. (B) Dot plot of a comparison of the S. macrospora scaffolds which cover 93% of the genomic sequence against the N. crassa supercontigs corresponding to linkage groups I to VII from the finished genome sequence. Comparisons for both analyses was done with BLASTN with e-value <10−150. Dot plot visualization was done with Combo [148].
Main features of the S. macrospora genome sequence.
| Size of the final assembly | 39.8 Mb |
| chromosomes | 7 |
| GC percentage (total genome) | 52.4 |
| GC percentage in coding regions | 56.5 |
| GC percentage in non-coding regions | 49.8 |
| tRNA genes | 455 |
| protein coding genes (CDSs) | 10,789 |
| percent coding | 38.4 |
| average CDS size (min/max) | 1,423 bp (54 bp/33,321 bp) |
Figure 3Pairwise identity between S. macrospora and N. crassa for different genomic regions.
CDSs, introns, and regions upstream of CDSs (in 1 kb steps ranging from 1 to 4 kb) were used for comparison. Only those upstream regions were used that do not overlap with a protein coding region. Each region was used only once even if it is upstream of two divergently transcribed genes to avoid double-counting. The box plots show the distribution of % pairwise identities with the median value as a horizontal line in the box between the first and third quartiles. Detailed information on the comparisons can be found in Figure S4 and Table S5.
Figure 4Phylogenetic analysis and expression of genes from different orthogroups from an OrthoMCL analysis of S. macrospora (SM), N. crassa (NC), N. discreta (ND), C. globosum (CHG), and P. anserina (PA).
(A) Species phylogeny with six concatenated genes that are single-copy orthologs in each of the five species. (B–E) Phylogenetic trees with five different orthogroups. Outgroups for the trees were homologs from either Nectria haematococca, Aspergillus fumigatus, Penicillium chrysogenum, or Pyrenophora tritici-repentis. Numbers at branches indicate bootstrap support (10,000 bootstrap replications) in % for neighbor joining trees. (D–E) Expression of the S. macrospora and N. crassa genes from orthogroups 49 and 180 during sexual development compared to vegetative growth. Expression data are the results of two independent experiments and were determined by quantitative real time PCR. The red and green dashed lines indicate two-fold up- and downregulation, respectively. n.e., no expression was detected during vegetative growth or sexual development.
Repeated sequences and transposons in the S. macrospora genome.
| class | superfamily | TSD | name | copies | ORFs |
| I | LINE | ? | SmLINE1 | 5 |
|
| I | gypsy | ? | Sinti1 | 5 |
|
| I | gypsy | ? | Sinti2 | 15 |
|
| II | hAT | ? | Scarce | 2 |
|
| II | Fot1 | ? | — | 1 |
|
| ? | — | 5 bp | Smini1 | 60 | — |
| ? | — | — | Smini2 | 34 | — |
| ? | — | 5 bp | Smini3 | 80 | — |
| ? | — | — | Smini4 | 74 | — |
| ? | — | — | Smini5 | 14 | — |
TSD: target site duplication present at least for some elements.
1 These elements show a very high degree of sequence variation; in addition Repbase analysis indicates additional DNA sequences with similarities to gypsy-like sequences.
2 Both elements exhibit partial sequence similarities.
3 Elements often inside ORFs or overlapping with ORFs.
4 Elements with at least 80% sequence similarity.
Figure 5The het-c/pin-c locus of S. macrospora contains additional copies of putative heterokaryon incompatibility genes.
(A) Region from S. macrospora scaffold 98 and N. crassa contig 8 containing het-c and pin-c genes. A syntenic region containing the N. crassa het-c and pin-c genes and the orthologous region in S. macrospora is shaded in blue. In S. macrospora, this region is bordered by additional copies of pin-c and a partial het-c (left) and a TOL-related protein encoding gene (right). The tol-related gene SMAC_07228 contains an internal stop codon within the open reading frame (indicated by an asterisk) and therefore encodes a shortened TOL-related protein or is a pseudogene. (B) Phylogenetic tree of PIN-C and TOL-related proteins from the genomic region shown in (A). For N. crassa, three allelic variations of PIN-C (PIN-C1, PIN-C2, and PIN-C3) were used for tree construction. The PIN-C1 protein from Pyrenophora tritici-repentis was used as an outgroup to root the tree. Maximum parsimony and neighbor joining trees were calculated with 10,000 bootstrap replications each. The phylogenetic tree separates the PIN-C and TOL-related proteins, however, it is not conclusive with respect to the putative ancestral state of the PIN-C alleles.
Figure 6Model for the action of heterokaryon incompatibility.
Incompatibility in two incompatible strains of N. crassa (A) and in a single strain of S. macrospora (B). The VIB transcription factor regulates the expression of HET-domain genes tol, het-6, and pin-c [170]. The het-6 gene of S. macrospora is mutated (m) and the second het-c gene (het-c2) is incomplete.
Figure 7Expression of all predicted pks and nrps genes in S. macrospora during sexual development compared with vegetative growth.
Gene names for which a N. crassa ortholog is present are given in gray, gene names where no N. crassa ortholog exists are given in bold black (see also Table S18). All expression data are the results of at least two independent experiments and were determined by quantitative real time PCR. Data for six of the genes (the first six type I pks genes, SMAC_03130 to SMAC_05695) were taken from previous studies [36],[78], expression of the other eight genes was determined in the course of this investigation. The type of encoded protein (type I PKS, type III PKS, PKS/NRPS hybrid, and NRPS) is indicated. The red line indicates two-fold upregulation.
Figure 8A partly orthologous polyketide biosynthesis cluster in S. macrospora and Phaeosphaeria nodorum.
(A) Comparison of partly orthologous polyketide biosynthesis clusters from S. macrospora (scaffold_17, S.m.) and P. nodorum (supercontig 16, P.n., data for P. nodorum are from the Stagonospora nodorum database at http://www.broadinstitute.org/annotation/genome/stagonospora_nodorum/Home.html [136]). The six genes for which an ortholog is present in both clusters are shown in green, orthology is indicated by gray bars between the genes. Genes for which no orthologs are present in both clusters are given in blue. (B) Percent identity from BLASTP analysis (e-value ≤10−5) from a comparison of S. macrospora proteins versus P. nodorum proteins. Mean values of percent protein identity were calculated for (I) all proteins with a significant hit (e-value ≤10−5, 7424 proteins), (II) all proteins that contain a Pfam domain from one of the five Pfam domain families that are represented within the orthologous proteins from the cluster (137 proteins, the domains are adh_short, FAD_binding_3, p450, PAL, and UbiA, Table S8), (III) the orthologous proteins from the cluster (six proteins, indicated in green in A). The mean percent sequence identity for the orthologous proteins from the cluster is significantly higher (p = 0.001) than either of the other two mean sequence identity values as indicated by an asterisk.
Figure 9Phylogenetic analysis of the predicted phenylalanine ammonia lyase (PAL) proteins from eight fungi.
Numbers at branches indicate bootstrap support (10,000 bootstrap replications) in % for the neighbor joining tree, and clade credibilities for the Bayesian tree. Classes given on the right correspond to the taxonomy used by Liu and Hall [171], and in the NCBI Entrez Taxonomy Database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy). Sordariomycete proteins are given in blue with the exception of the S. macrospora protein SMAC_01196 that clusters with the Leotiomycete/Dothideomycete group and is given in red. Sequences for P. anserina were obtained from the Podospora anserina genome project (http://podospora.igmors.u-psud.fr/index.html) and for all other fungi from the Fungal Genome Initiative of the Broad Institute at (http://www.broad.mit.edu/ annotation/fungi/fgi/index.html). AN: Aspergillus nidulans, BC: Botrytis cinerea, CC: Coprinus cinereus (outgroup), CH: Chaetomium globosum, NC: Neurospora crassa, SM: Sordaria macrospora, PA: Podospora anserina, SN: Stagonospora nodorum (Phaeosphaeria nodorum).