Literature DB >> 31992756

Building de novo reference genome assemblies of complex eukaryotic microorganisms from single nuclei.

Merce Montoliu-Nerin¹, Marisol Sánchez-García¹, Claudia Bergin², Manfred Grabherr³, Barbara Ellis¹, Verena Esther Kutschera⁴, Marcin Kierczak³, Hanna Johannesson⁵, Anna Rosling⁶.

Abstract

The advent of novel sequencing techniques has unraveled a tremendous diversity on Earth. Genomic data allow us to understand ecology and function of organisms that we would not otherwise know existed. However, major methodological challenges remain, in particular for multicellular organisms with large genomes. Arbuscular mycorrhizal (AM) fungi are important plant symbionts with cryptic and complex multicellular life cycles, thus representing a suitable model system for method development. Here, we report a novel method for large scale, unbiased nuclear sorting, sequencing, and de novo assembling of AM fungal genomes. After comparative analyses of three assembly workflows we discuss how sequence data from single nuclei can best be used for different downstream analyses such as phylogenomics and comparative genomics of single nuclei. Based on analysis of completeness, we conclude that comprehensive de novo genome assemblies can be produced from six to seven nuclei. The method is highly applicable for a broad range of taxa, and will greatly improve our ability to study multicellular eukaryotes with complex life cycles.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 31992756 PMCID： PMC6987183 DOI： 10.1038/s41598-020-58025-3

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

A large proportion of Earth’s biodiversity constitutes organisms that cannot be cultured, have cryptic life-cycles and/or live submerged within their substrates[1-4]. Genomic data are key to unravel both their identity and function[5]. The development of metagenomic methods[6,7] and the advent of single cell sequencing[8-10] have revolutionized the study of life and function of cryptic organisms by upending the need for large and pure biological material, and allowing generation of genomic data from complex or limited environmental samples. Genome assemblies from metagenomic data have so far been restricted to organisms with small genomes, such as bacteria[11], archaea[12] and certain eukaryotes[13]. On the other hand, single cell technologies have allowed the targeting of unicellular organisms, attaining a better resolution than metagenomics[8,9,14-16], and allowed the genomic study of cells from complex organisms one cell at a time[17,18]. However, single cell genomics are not easily applied to multicellular organisms formed by consortia of diverse taxa, and the generation of specific workflows for sequencing and data analyses is needed to expand genomic research to the entire tree of life, including sponges[19], lichens[3,20], intracellular parasites[21,22], and plant endophytes[23,24]. Among the most important plant endophytes are the obligate mutualistic symbionts, arbuscular mycorrhizal (AM) fungi, that pose an additional challenge with their multinucleate coenocytic mycelia[25]. Here, the development of a novel single nuclei sequencing and assembly workflow is reported. This workflow allows, for the first time, the generation of reference genome assemblies from large scale, unbiased sorted, and sequenced AM fungal nuclei, circumventing tedious and often impossible culturing efforts. This method opens infinite possibilities for studies of evolution and adaptation in these important plant symbionts and demonstrates that reference genomes can be generated from complex non-model organisms by isolating only a handful of their nuclei. AM fungi is a group of diverse obligate symbionts that have colonized root cells and formed mycelial networks in soil since plants first colonized land[25-27]. Their entire life-cycle is completed underground and they propagate with multinuclear asexual spores[28,29] (Fig. 1). Genomic research on AM fungi has been hampered by technical challenges involving isolation and culturing. Accordingly, reference nuclear genomes of only few species have been published[30-35], representing taxa that can be grown in axenic culture, i.e., Rhizophagus irregularis, R. clarus, R. diaphanus, R. cerebriforme, Gigaspora rosea, and Diversispora epigaea.

Figure 1

(a) Schematic representation of the life-cycle in AM fungi. A spore detects a plant root in the vicinity and grows hyphae towards it. The hyphae penetrate the plant cell wall and form the characteristically branching haustoria with the shape of arbuscules. The arbuscules are used to exchange nutrients with the plant. New spores are produced in other hyphal terminations, bud off upon maturity and remain in dormant state until the cycle starts again, while the first spore dies and the fungi retracts from the plant cell. (b) Schematic representation of a spore containing nuclei, lipid vesicles and endosymbiotic bacteria. The hyphae have very reduced compartmentalization with incomplete septa and nuclei appear to move freely.

Methodological overview

A method was developed in which genomic fungal DNA can be obtained, free of plant and prokaryotic DNA, directly from individual nuclei of multinucleate spores. In brief, spores from a trap culture fungal strain of Claroideoglomus claroideum/C. luteum (SA101) were obtained from the INVAM pot culture collection. After visually confirming that nuclear size was appropriate for the method (Fig. S1), an initial trial to sort AM nuclei was carried out using pools of spores in order to assess optimal settings. Spores were cleaned, crushed vigorously, and stained with a DNA stain, before being analyzed by fluorescence-activated cell sorting (FACS), by recording level of fluorescence as a measure of DNA content and light scattering as proxy for size and particle granularity (Fig. 2a–h). A distinct cloud of particles was observed above the background in the scatter plot (Fig. 2h, inside the blue box), which by PCR verification with fungal and bacterial specific primers was confirmed to consist of biological structures containing mostly fungal DNA (Figs. S2–S3, Table S1). Hence, we concluded that these particles were fungal nuclei and restricted future sorting to this window. Thereafter, individual nuclei from a single spore of the same strain were sorted into wells of a 96-well plate (Fig. S4, Table S2) and whole genome amplified (WGA) using multiple displacement amplification (MDA; Fig. 2I,j). The amplified DNA was screened for pure fungal origin by parallel amplification of rDNA barcode regions for both fungi and bacteria (Figs. 2k, S5). Twenty-four amplified nuclei samples confirmed to contain only fungi (Fig. S4, Table S3, S4), were sequenced with Illumina HiSeq X (Fig. 2l). Further, the MinION Nanopore-based sequencing device (Oxford Nanopore Technologies, ONT, UK) was used to obtain long read sequences from amplified DNA from multiple (5–100) nuclei separated from a pool of 30 spores of the same strain (Fig. 2i–k, m).

Figure 2

From a soil sample to AM fungal genome assemblies. (a) Whole inoculum from the culture collection INVAM is blended with water and (b) poured into a set of sieves; the material stuck in the 38 μm sieve is placed into a (c) tube that contains a solution of 60% sucrose, then centrifuged for 1 min. The supernatant is run through a 38 μm sieve and washed with water. (d) The sieve content is placed in a Petri dish for the spores to be manually picked using a glass pipette. (e) After cleaning the spores with ddH2O, these are placed one-by-one into tubes and crushed with a pestle. (f) The DNA from a broken spore is stained with SYBR Green, giving a strong fluorescent signal for the nuclei and a lighter signal for the background, organelles and microbes. (g) The stained spore content is loaded on the FACS, where the sample moves inside a constant flow of buffer and crosses a laser beam. An excitation laser of 488 nm and 530/40 band pass filter was used for the SYBR Green fluorescence detection. In addition, scattered light, forward scatter (FSC) and side scatter (SSC) were used as proxy for size and granularity to identify the nuclei. (h) The signals can be interpreted in a scatterplot, and particles of a selected cloud (e.g., R1, blue-box) can be sorted individually or pooled (i) into individual wells of a 96-well plate by directing them with a charge. (j) The content of each well is whole genome amplified using MDA. (k) The amplified products are tested for fungi and bacteria by PCR screening with specific rDNA primers. The products confirmed to be from fungal nuclei are sequenced with (l) Illumina HiSeqX, for single nuclei; and (m) Oxford Nanopore, for pools of nuclei. (n) In workflow 1, Illumina reads are assembled separately for individual nuclei using MaSuRCA[39]. (o) In workflow 2, reads from individual nuclei are normalized and assembled with SPADES[40]. (q) In workflow 3 reads from all nuclei are combined, then normalized and finally assembled with SPADES[40]. (p) Lingon[38] is used to produce a consensus assembly from individual nuclei assemblies in both workflows 1 and 2. (r) Nanopore data is assembled with Canu[41], polished with Pilon[53] using the Illumina raw-reads and used to (s) scaffold the three assemblies generated with workflows 1, 2 and 3 using Chromosemble, of Satsuma[55]. Three customized assembly workflows were developed to evaluate assembly quality in the light of coverage bias introduced by WGA, which is the biggest challenge when assembling sequence data from amplified single nuclei. The MDA method, however, has an advantage over PCR-based methods in that it produces longer fragments of DNA with a lower error rate and random coverage bias[36,37]. For the first two assembly workflows, individual nuclei assemblies were generated and subsequently combined to generate a consensus assembly using the workflow manager Lingon[38] (Fig. 2p), which consists of a motif-distance based long sequence overlap finder that merges sequences based on mutual maximal overlaps. In the first assembly workflow raw Illumina reads were assembled using MaSuRCA[39] (Fig. 2n) resulting in 24 assemblies, ranging in size from 14 to 69 Mb (Tables S5). To overcome MDA-generated differences in coverage across the genome, the second workflow normalized raw reads to average 100X before assembling using SPADES[40] (Fig. 2o), generating 24 assemblies ranging in size from 11 to 50 Mb (Table S5). A third assembly was created using SPADES[40] after combining raw reads from 24 nuclei followed by normalization to 100X (Fig. 2q). One assembly with 24 nuclei was generated from each workflow and subsequently scaffolded with a Nanopore assembly built with Canu[41] (Fig. 2r,s). To evaluate the number of nuclei needed for a complete assembly, results from BUSCO[42] analyses, assembly size, and N50 were plotted across assemblies resulting from an increasing number of assembled nuclei. Data from different nuclei were merged in random combinations of two to twelve nuclei and one random combination for 13–23 nuclei. The analysis was performed separately for the three workflows and the results were compared with the single- and 24-nuclei assemblies.

Results

The different assembly workflows resulted in assemblies that vary in size, fragmentation and completeness (Table 1). Based on BUSCO analyses, workflow 3 generated the most complete assembly, with 89% for assembly 3n, compared to 2n at 80%, and 1n at 78% (Table 1). Of the core single copy genes identified by BUSCO, few were fragmented or duplicated in assembly 3n indicating that the set of 14,600 predicted genes is likely to be complete and a close representation of the genetic content in this strain (Table 1). This number is lower than the number of genes found in other sequenced AM fungi such as R. irregularis[30] and R. clarus[33], and also lower than those predicted in assemblies 1n and 2n (Table 1). Interestingly, assembly 3n is considerably smaller (70.8 Mb) than the other assemblies (92.4 Mb and 130.4 Mb for assembly 1n and 2n, respectively) and markedly smaller than the average estimated genome size of 119 Mb based on SGA-PreQC[43]. The smaller assembly size of 3n can be attributed to repeat sequences (20.6 Mb) that are captured to a lesser extent, compared to the other assembly workflows (41.3–58.6 Mb). Specifically, normalization is expected to disproportionally reduce high coverage genomic sequences such as repeat elements and collapse those regions when assembling. Note that this effect of normalization is eluded in assembly workflow 2, in which nuclei are normalized and assembled individually; repetitive regions will collapse but in different parts of the genome. Thus repeats end up being represented in the final assembly when single nuclei assemblies are combined. In contrast, workflow 1 is based on non-normalized reads. Due to uneven coverage, this workflow assembles less of the genome, an average of 55% of the raw reads align to the individual nuclei assemblies, as opposed to 96% of the reads mapping to the normalized individual nuclei assemblies (Table S5). However, workflow 1 generates contigs well supported by high coverage. Combining these incomplete assemblies from single nuclei using Lingon generates an accurate assembly 1 comparable to assembly 3 with a better representation of repeats (Table 1). Scaffolding with nanopore improves contiguity of all three assemblies by reducing the number of contigs and thus increasing N50. Furthermore, it decreases the number of genes, but does not affect BUSCO results or inferred repeat content in a major way (Table 1). Hence, in this study, nanopore data is not essential to produce biologically informative assemblies. The assembly from nanopore data alone gave a similar number of predicted genes compared to assembly 3, but captured more repeats (47.3 Mb). BUSCO results suggest a completeness of 77%, which is comparable to assemblies 1 and 2 (Table 1). It is important to notice that this nanopore assembly was polished with Illumina reads and that the completeness based on BUSCO results increased from 17% before polishing[44] to 77% after three rounds of polishing.

Table 1

Comparative assessment of the 3 assembly workflows.

Assembly		Size (Mb)	# Contigs	N50	Largest contig (Kb)	GC (%)	BUSCO (%)^a	# Genes (Mb)	Repeats (Mb)
1	Raw reads	90.16	11077	12714	94.39	27.01	C: 77 F: 10	18068 (49.42)	40.39
1n	+Nanopore	92.38	3899	37258	176.652	27.91	C: 78 F: 9	16680 (69.54)	41.32
2	Normalized to 100×	124.96	21934	16055	155.09	28.07	C: 79 F: 8	24930 (69.79)	57.77
2n	+Nanopore	130.41	4632	60974	338.42	28.07	C: 80 F: 7	22618 (105.48)	58.57
3	Combined, normalized to 100×	68.31	11246	15947	199.90	28.08	C: 88 F: 4	15882 (43.73)	21.71
3n	+Nanopore	70.81	3883	33135	220.22	28.08	C: 89 F: 3	14662 (55.44)	20.64
Nanopore	polished with Pilon	96.03	6409	20944	151.76	28.15	C: 77 F: 6	15858 (57.47)	47.31

aCompleteness estimated in % of 290 single copy genes in fungi, scored as complete (C) or fragmented (F).

Comparative assessment of the 3 assembly workflows. aCompleteness estimated in % of 290 single copy genes in fungi, scored as complete (C) or fragmented (F). Combinations of increasing number (1–24) of randomly selected nuclei were produced for all the assembly workflows in order to evaluate the number of nuclei needed to produce a good final assembly. As shown in Fig. 3, single nuclei assemblies are most complete when using normalized reads in workflow 2, with an average of 40% BUSCO estimated completeness compared to 25% in workflow 1. Interestingly, there is an increasing number of duplicated genes among the complete genes as more single nuclei assemblies are combined for method 2 compared to method 1 (Fig. 3a,b). Higher amount of duplicated genes was confirmed by locating known single copy genes in all assemblies (Table S6). The duplications in workflow 2 are likely generated because read normalization allows for assembly of regions with low coverage that are prone to errors, and prevents contigs from being properly assembled by the workflow manager Lingon. Assemblies of increasing number of nuclei result in increasing assembly size, N50, and BUSCO estimated completeness (Fig. 3). In both workflow 1 and 3, BUSCO results reach maximum performance when assembling random combinations of six - seven nuclei (Fig. 3a,c). The same pattern is observed for assembly size and N50 (Fig. 3d). In workflow 2, on the other hand, assembly size continuously increases with increasing number of combined nuclei assemblies (Fig. 3c). This pattern is reflected by an increasing number of duplicated genes in the BUSCO results (Fig. 3b).

Figure 3

Summary statistics for different number of assembled nuclei (1–24) using three different assembly workflows. BUSCO estimates of completeness for (a) workflow 1: raw reads of individual nuclei assembled using MaSuRCa, consensus assembly using Lingon (b) workflow 2: normalized reads of individual nuclei assembled using SPADES, consensus assembly using Lingon and (c) workflow 3: reads from individual nuclei are pooled and normalized before assembling with SPADES. Percentage of single copy core genes detected as single copy (S: grey), duplicated (D: light grey) or fragmented (F: black). Average of 3–6 replicate assemblies up to 12 nuclei with error bars indicating SEM. In (d) assembly size (dashed lines) and N50 (solid lines) for the three methods 1 (black), 2 (grey) and 3 (light grey).

Discussion

Methodological challenges in assembling genomes from amplified single nuclei or cells can be tackled by careful analysis of generated assemblies[9,16,23]. In this study, it is suggested that different assembly strategies can be useful for different downstream analyses. A genome assembly with a high coverage and a high-quality dataset of single copy genes can already be generated from only six individually sequenced nuclei when reads are combined and normalized, as done in workflow 3 (Fig. 3). As demonstrated by Ahrendt et al.[16], such an assembly generates high coverage genome data and is ideally suited for phylogenomics studies. When using non-normalized data, as in assembly workflow 1, repeat elements are better represented and hence, this assembly is likely better suited for identification and classification of repeats, which are known to represent a large proportion of AM fungal genomes[34]. Comparative genetic analyses between single nuclei are best done using assemblies from workflow 2, where single nuclei assemblies are generated from normalized reads. Estimated completeness of these assemblies is comparable to results from single cell sequencing of fungi with smaller genomes[16]. However, single nuclei assemblies based on normalized reads should not be assembled into consensus assemblies since variable quality of contigs make them prone to duplication. To conclude, sequence data from single cell sequencing presents itself as challenging, but as shown here, with the right combination of methods adapted to the data, de novo reference genomes can be generated, opening the door for an expansion in genomic and phylogenomic research in organisms like AM fungi, that have, for too long, evaded large scale genome sequencing efforts due too methodological limitations stemming from their complicated biology. With organism-specific modifications to the initial nuclei extraction step, the complete workflow can be adapted to investigate nuclei or other intraorganismal units, such as endosymbiotic bacteria or mitochondria, from taxonomically diverse groups of non-model organisms. Useful genomic information can be generated from a handful of single nuclei greatly improving our ability to study multicellular eukaryotes with complex life stages. The assembly method of choice will ultimately depend on the research questions asked and the kind of data needed or available.

Methods

Fungal strain and spore extraction

C. claroideum/C. luteum (SA101) was obtained as whole inoculum from the International culture collection of (vesicular) arbuscular mycorrhizal fungi (INVAM) at West Virginia University, Morgantown, WV, USA. Due to the unclear taxonomic status of the strain we have decided to adhere to the current INVAM name throughout the text. Soil (10–30 ml) was blended with 3 to 4 pulses using a blender half-filled with water (500 ml). The mix was filtered through a set of sieves (1 mm/500 μm/38 μm × 200 mm diameter (VWR, Sweden)). The content of the last sieve was transferred into a falcon tube containing 20 ml of 60% sucrose solution and centrifuged for 1 minute at 2500–3000 rpm. The supernatant was poured into a small sieve (50 mm diameter) of 38 μm and the sucrose was washed off with water. The contents were poured into a petri dish for better visualization under the stereomicroscope. Spores were transferred individually or in groups to an Eppendorf tube using modified glass pipettes with reduced tip diameter and subsequently cleaned by adding and removing ddH2O five times. The step-by-step protocol can be found in the OSF Repository for the project[44].

Nuclei extraction and sorting

After spore extraction from soil, individual spores were placed in 30 μl ddH2O in 1.5 ml Eppendorf tubes. One tube with 15 spores was used to establish the sorting window. An amount of 50 μl 1x PBS was added to each tube before crushing the spores using a sterile pestle. DNA was stained by adding 1 μl of 200x SYBR Green I Nucleic Acid stain (InvitrogenTM, Thermo Fisher Scientific, MA, USA) and the sample was incubated for 20–50 min in the dark. More 1x PBS was added to increase the volume to 100–200 μl before loading the sample on the FACS. The nuclei were sorted on a MoFloTM Astrios EQ sorter (Beckman Coulter, USA) using a 488 nm laser for excitation, 70 μm nozzle, sheath pressure of 60 psi, and 0.1 µm filtered 1x PBS as sheath fluid. The trigger channel was set to the forward scatter (FSC) at a threshold of 0.03% and sort regions were defined on SYBR Green I fluorescence (488–530/40) over side scatter (SSC). The samples were sorted in single cell mode with a drop envelope of 1 at 700 to 1200 events per second. Thus, if a particle fitting within the sorting window passes by the laser together with another particle, these would be discarded. Particles from region R1, assumed to be nuclei (Fig. S4), were sorted individually into 96 well plates containing 1 μl 1x PBS/well. Groups of 5 particles were collected for positive control and empty wells were kept as negative control (Table S2).

Whole genome amplification

Sorted nuclei were lysed and neutralized followed by whole genome amplification using Phi29 and MDA as described by Rinke et al.[45]. In short, the cells were incubated in an alkaline solution (buffer DLB and DTT, Qiagen, Germany) for 5 min at room temperature, followed by 10 min on ice. Lysis reactions were neutralized by adding 1 μL neutralization buffer (stop solution, Qiagen, Germany). Both the alkaline lysis solution as well as the neutralization buffer were UV treated with 2 Joule in a Biolinker. MDA was performed using the RepliPHITMPhi29 Reagent set (RH031110, Epicenter, WI USA) at 30 °C for 16 h in 15 μl reaction volumes with a final concentration of 1x reaction buffer, 0.4 mM dNTPs, 10 mM DTT, 5% DMSO, 50 μM hexamers with 3′- phosphorothioate modifications (IDT Integrated DNA Technologies, Iowa USA), 40 U Phi 29 enzyme; 0.5 μM SYTO13® (InvitrogenTM, Thermo Fisher Scientific, MA, USA) and water. All reagents except SYTO13 were UV decontaminated with 3 Joule in a UV crosslinker as described in Rinke et al.[45] 12 µl of MDA mix were then added to each well. The whole genome amplification was monitored in real time by detection of SYTO13 fluorescence every 15 minutes for 16 h using a Chromo4 real-time PCR instrument (Bio-Rad, USA) or a FLUOstar®Omega plate reader (BMG Labtech, Germany). The amplified genome DNA was stored at −20 °C for short-term and transferred to −80 °C for long-term storage.

Selecting single amplified nuclei for sequencing

MDA products were diluted to approximately 5 ng/μl (40 × ) and screened for the presence of fungal and bacterial ribosomal genes using PCR. PCR reaction mixtures contained 10x Standard Taq Reaction buffer (Qiagen), 2 mM MgCl2, 0.2 mM deoxynucleoside triphosphates (dNTPs), 0.2 μM of each primer, and 1 U Taq DNA polymerase (Qiagen). The fungal-specific primers ITS9[46] and ITS4 were used. The PCR protocol had an initial denaturing step of 10 min at 95 °C, followed by 35 cycles of 30 s at 95 °C, 30 s at 58 °C, and 50 s at 72 °C for the fungi PCR. For the bacteria-specific 341 F/805R[47] primer pairs a different reaction mixture was used containing 10x Standard Taq Reaction buffer (Qiagen), 2 mM MgCl2, 0.2 mM deoxynucleoside triphosphates (dNTPs), 0.2 μM concentration of each primer and 1 U Taq DNA polymerase (Qiagen). DNA extracted from commercially available Agaricus bisporus provided by Dr. Ylva Strid (Uppsala University, Sweden), was included as a positive control, and ddH2O as negative control. The bacterial PCR protocol consisted of an initial step of 5 min at 95 °C, followed by 30 cycles of 30 s at 95 °C, 30 s at 58 °C, and 50 s at 72 °C before a final elongation step of 7 min at 72 °C. Bacteria PCR included a positive control of DNA extracted from Legionella provided by Tiscar Graells (Universitat Autónoma de Barcelona, Spain), and ddH2O was used as negative control. The reaction was performed with a 2720 Thermocycler (Applied Biosystems, USA). The presence of amplification products was verified by gel electrophoresis by separation on a 2% agarose gel run for 35 min at 110 V (fungi) and 70 V (bacteria) including a Thermo Scientific GeneRuler 100 bp DNA Ladder (Fig. S5). The samples were identified as fungi positive, bacteria positive, fungi + bacteria positive or failed/empty (Table S3). From the samples that scored positive for presence of fungi, 24 undiluted samples were selected for sequencing and the DNA amount was measured using Qubit (Invitrogen, Austria) after addition of 30 μl ddH2O (Table S4).

Sequencing of single amplified nuclei

From the 24 selected samples, around 800 ng of DNA was transferred to sequencing plates. Library preparation and sequencing was performed by the SNP&SEQ Technology Platform in Uppsala at the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory. For each sample, an individual library was prepared using the TruSeq Nano DNA Library Prep Kit. The sequencing was performed by doing a cluster generation and 150 cycles paired-end sequencing of the 24 libraries in 1 lane using the HiSeq X system and v2.5 sequencing chemistry (Illumina Inc., USA). Read data were delivered to us as fastq.

Spore sorting for Nanopore sequencing

Spores were picked in groups of 30 with the help of a P10 and P100 pipette, then washed five times in nuclease-free water and transferred to Eppendorf tubes in 30 uL nuclease-free water. For the FACS sorting spores were crushed, then 30 μl 1x PBS was added to the tube along with 1 μl of 200x SYBR Green for staining the DNA (20–50 mins). Sample volume was increased to 200 μl with 1x PBS before loading on the FACS. Pools of 5 and 100 nuclei were sorted into either individual 1.5 ml Eppendorf tubes or into multi-well plates. The above-described WGA protocol was run, and the presence of fungal DNA in the samples was verified by PCR on diluted samples of amplified pooled nuclei before selecting fungi positive samples for library preparation. PCR reaction mixtures were made as described above. The fungal-specific ITS1F/ITS4 and bacteria-specific 341 F/805 R primer pairs were used for each sample in two independent PCR reactions. The PCR protocol included an initial denaturing step of 5 min at 95 °C, followed by either 35 cycles of 30 s at 95 °C, 30 s at 55 °C, and 50 s at 72 °C for the fungi PCR or by 30 cycles of 30 s at 95 °C, 30 s at 58 °C, and 50 s at 72 °C for the bacteria PCR before a final elongation step of 7 min at 72 °C. The reaction was performed with a 2720 Thermocycler of Applied Biosystems (USA). Amplification products were visualized and documented by gel electrophoresis as described above. Libraries were prepared by following the “Premium Whole Genome Amplification” protocol (version WAL_9030_v108_revJ_26Jan2017, Oxford Nanopore Technologies [ONT], Oxford, United Kingdom) in combination with the Ligation Sequencing Kit 1D (SQK-LSK108, ONT) with the following modifications: (a) an alternative WGA method was used (Qiagen Single Cell Kit instead of the Midi Kit); (b) samples were diluted to a 50 μl volume following WGA and quantified using Qubit (Invitrogen, Austria). Amounts of 1–2.5 μg DNA were then used for preparing individual libraries, starting with the first bead cleaning step explained in the whole genome amplification section. At the end of this step, samples were eluted in 19 μl nuclease-free water instead of 100 μl. 1 μl of the eluted sample was used for DNA quantification (Qubit fluorometer) while another 1 μl was used to measure DNA quality with Nanodrop (ND 2000); (c) no size selection and intentional shearing was performed to achieve read length as long as possible; (d) 17 μl amplified DNA was added to the T7 endonuclease treatment; (e) an extended end-prep reaction was performed by incubating the samples for 30–30 mins at both 20 °C and 65 °C; (f) adapter ligation was allowed for 25–30 mins instead of 10; (g) elution buffer in the final step was incubated for 15 minutes instead of 10; (h) the loaded library contained no additional water but 14.5 μl DNA library instead of 12 μl. Additionally, flicking was used to mix reactions instead of pipetting to prevent DNA fragmentation. Further, eluates were removed and retained in a stepwise fashion (i.e. in multiple aliquots) after every cleaning step to assure that no beads were brought forward with the DNA into the next library preparation step. In general, by extending clean-up-, ligation- and elution steps the quality of the library and thus pore occupancy during sequencing could be improved. A total of 3 libraries on 3 separate ONT MinION R9.4 flow cells (FLO-MIN106) were sequenced using live base-calling and the standard 48 h sequencing protocol (NC_48Hr_sequencing_FLO-MIN106_LSK-108_plus_Basecaller). One library was run on a fresh flow cell with ~1400 single pores available for sequencing in the beginning of the run. This 48 h run provided 1,686,715 reads. As for the other two libraries, previously used and washed flow cells were re-used with only a fraction of sequencing pores being functional (402 vs. 256 pores), thus the acquired data were much lower (100,000 and 106,000 reads respectively).

Computational analyses, assembly and annotation

The quality of the Illumina reads was assessed with FastQC[48]. Genome size estimation was done for each paired raw-reads from individual nuclei with SGA-PreQC[49]. Contamination was assessed with Kraken[50] in some of the raw-reads. CG content was computed using the NBIS-UtilityCode[51] toolbox. Assembly workflow 1: Individual assemblies for each of the 24 nuclei was done by MaSuRCA[39] using default options. The resulting assemblies were iteratively merged using Lingon[38], which computed overlaps based on the spacing of sequence motifs (CATG, CTAG, GTAC, GATC, TATA, ATAT, and GC), and merged contigs based on pairwise maximal extensions. Each motif was iterated over ten times. Three versions of the assembly were generated when contigs smaller than <500, <1000 and <2000 were removed from the individual assemblies prior to Lingon. Assembly workflow 2: Each set of reads was normalized using bbnorm of BBMap[52] v. 38.08 with a target average depth of 100×. Normalized data were assembled individually into 24 assemblies using SPADES[40], and a consensus assembly was generated with Lingon[38], with the same sequence motifs as for assembly 1. Assembly workflow 3: The 24 datasets were combined and normalized with bbnorm of BBMap[52] v. 38.08 with a target average depth of 100x and posteriorly assembled using SPADES[40]. Nanopore assembly: Nanopore reads were assembled using Canu[41] v.1.7–86da76b, this specific beta version made it possible to assemble a difficult dataset like ours, with highly uneven coverage across the genome. An assembly was created using default settings together with the known information (genomeSize = 117 m -Nanopore-raw). The resulting assembly was polished with three rounds of Pilon[53] v.1.22 using the raw Illumina reads from the 24 nuclei mapped with Bowtie2[54]. The contigs of the final assemblies from single nuclei were scaffolded with the Nanopore assembly using Chromosemble from the Satsuma package[55].

Comparative assembly analyses

A quantitative assessment of the assemblies was done with Quast[56] v.4.5.4 and contamination was checked with Kraken[50] v1.0. In addition, a BUSCO[42] analysis was done to assess completeness of the genome. The BUSCO lineage set used was fungi_odb9 and the species set was rhizopus_oryzae. (Figs. 3, S6) Raw-reads were mapped to the individual assemblies of method 1 and 2 (Table S5) with Bowtie2[54] v. 2.3.3.1 using the default settings. Two genes, known to be single copy genes in fungal genomes, as elongation factor 1-alpha (EF1-alpha) and the largest subunit of RNA polymerase II (RPB1), were searched for in the genome assemblies to test for possible duplications generated by the assembly methods. Sequences belonging to C. claroideum were used to find the sequences with BLASTn[57] (Table S6). Genebank sequences: EF1-alpha GQ205008.1, RPB1 HG316018.1.

Genome annotation

Repeats and transposable elements (TEs) were de novo predicted in every assembly using RepeatModeler[58] v1.0.8. The repeat library from RepeatModeler was used to mask the genome assembly using RepeatMasker[59] v4.0.7. The classification reports can be found in the OSF Repository[44]. Protein coding genes were de novo predicted from the repeat-masked scaffolded genome assembly with GeneMark-ES[60] v4.33. GeneMark-ES uses unsupervised self-training and an algorithm that is optimized for fungal gene organization. To guide the gene predictions, we aligned UniProt/Swiss-Prot[61] protein sequences (downloaded 8 May 2018) to the repeat-masked genome assembly using MAKER[62] v3.01.1-beta and provided the genomic locations of the protein alignments to GeneMark-ES. The previously published transcriptomic data from C. claroideum[63] was not used to due to the low mapping success of the reads to the assembly (25%), which could be related to the low BUSCO statistics shown in the study[63], and that could have negatively affected the annotation quality. Protein and gene names were assigned to the gene predictions using a BLASTx[57] v2.6.0 search of predicted mRNAs against the UniProt/Swiss-Prot[61] database with default e-value parameters (1 × 10–5). The ANNotation Information Extractor, Annie[64], was used to extract BLAST matches and to reconcile them with the gene predictions. Sequences, assemblies and, annotations can be found in the BioProject: PRJNA528883. Supplementary information.

48 in total

1. Community structure and metabolism through reconstruction of microbial genomes from the environment.

Authors: Gene W Tyson; Jarrod Chapman; Philip Hugenholtz; Eric E Allen; Rachna J Ram; Paul M Richardson; Victor V Solovyev; Edward M Rubin; Daniel S Rokhsar; Jillian F Banfield
Journal: Nature Date: 2004-02-01 Impact factor: 49.962

Review 2. Single-cell genome sequencing: current state of the science.

Authors: Charles Gawad; Winston Koh; Stephen R Quake
Journal: Nat Rev Genet Date: 2016-01-25 Impact factor: 53.242

3. Genomic DNA amplification from a single bacterium.

Authors: Arumugham Raghunathan; Harley R Ferguson; Carole J Bornarth; Wanmin Song; Mark Driscoll; Roger S Lasken
Journal: Appl Environ Microbiol Date: 2005-06 Impact factor: 4.792

4. Single-cell genomics reveals organismal interactions in uncultivated marine protists.

Authors: Hwan Su Yoon; Dana C Price; Ramunas Stepanauskas; Veeran D Rajah; Michael E Sieracki; William H Wilson; Eun Chan Yang; Siobain Duffy; Debashish Bhattacharya
Journal: Science Date: 2011-05-06 Impact factor: 47.728

5. Archaeorhizomycetes: unearthing an ancient class of ubiquitous soil fungi.

Authors: Anna Rosling; Filipa Cox; Karelyn Cruz-Martinez; Katarina Ihrmark; Gwen-Aëlle Grelet; Björn D Lindahl; Audrius Menkis; Timothy Y James
Journal: Science Date: 2011-08-12 Impact factor: 47.728

6. Critical biogeochemical functions in the subsurface are associated with bacteria from new phyla and little studied lineages.

Authors: Laura A Hug; Brian C Thomas; Itai Sharon; Christopher T Brown; Ritin Sharma; Robert L Hettich; Michael J Wilkins; Kenneth H Williams; Andrea Singh; Jillian F Banfield
Journal: Environ Microbiol Date: 2015-07-22 Impact factor: 5.491

7. Scaling laws predict global microbial diversity.

Authors: Kenneth J Locey; Jay T Lennon
Journal: Proc Natl Acad Sci U S A Date: 2016-05-02 Impact factor: 11.205

8. A new view of the tree of life.

Authors: Laura A Hug; Brett J Baker; Karthik Anantharaman; Christopher T Brown; Alexander J Probst; Cindy J Castelle; Cristina N Butterfield; Alex W Hernsdorf; Yuki Amano; Kotaro Ise; Yohey Suzuki; Natasha Dudek; David A Relman; Kari M Finstad; Ronald Amundson; Brian C Thomas; Jillian F Banfield
Journal: Nat Microbiol Date: 2016-04-11 Impact factor: 17.745

9. No jacket required--new fungal lineage defies dress code: recently described zoosporic fungi lack a cell wall during trophic phase.

Authors: Timothy Y James; Mary L Berbee
Journal: Bioessays Date: 2011-12-01 Impact factor: 4.345

10. Basidiomycete yeasts in the cortex of ascomycete macrolichens.

Authors: Toby Spribille; Veera Tuovinen; Philipp Resl; Dan Vanderpool; Heimo Wolinski; M Catherine Aime; Kevin Schneider; Edith Stabentheiner; Merje Toome-Heller; Göran Thor; Helmut Mayrhofer; Hanna Johannesson; John P McCutcheon
Journal: Science Date: 2016-07-21 Impact factor: 47.728

6 in total

Review 1. Fungal biodiversity and conservation mycology in light of new technology, big data, and changing attitudes.

Authors: Lotus A Lofgren; Jason E Stajich
Journal: Curr Biol Date: 2021-10-11 Impact factor: 10.900

Review 2. Current State and Future Directions of Genetics and Genomics of Endophytic Fungi for Bioprospecting Efforts.

Authors: Rosa Sagita; Wim J Quax; Kristina Haslinger
Journal: Front Bioeng Biotechnol Date: 2021-03-15

3. FunOMIC: Pipeline with built-in fungal taxonomic and functional databases for human mycobiome profiling.

Authors: Zixuan Xie; Chaysavanh Manichanh
Journal: Comput Struct Biotechnol J Date: 2022-07-11 Impact factor: 6.155

4. Early branching arbuscular mycorrhizal fungus Paraglomus occultum carries a small and repeat-poor genome compared to relatives in the Glomeromycotina.

Authors: Mathu Malar C; Yan Wang; Jason E Stajich; Vasilis Kokkoris; Matthew Villeneuve-Laroche; Gokalp Yildirir; Nicolas Corradi
Journal: Microb Genom Date: 2022-04

5. Whole genome analyses based on single, field collected spores of the arbuscular mycorrhizal fungus Funneliformis geosporum.

Authors: Shadi Eshghi Sahraei; Marisol Sánchez-García; Merce Montoliu-Nerin; David Manyara; Claudia Bergin; Søren Rosendahl; Anna Rosling
Journal: Mycorrhiza Date: 2022-09-26 Impact factor: 3.856

6. Genome sequencing and de novo assembly of the giant unicellular alga Acetabularia acetabulum using droplet MDA.

Authors: Ina J Andresen; Russell J S Orr; Anders K Krabberød; Kamran Shalchian-Tabrizi; Jon Bråte
Journal: Sci Rep Date: 2021-06-17 Impact factor: 4.379

6 in total