Literature DB >> 36205401

Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other nonstandard architectures in genome assemblies.

Miguel A Naranjo-Ortiz1,2,3,4, Manu Molina1,2,5, Diego Fuentes5,6, Verónica Mixão1,2,5,6, Toni Gabaldón1,2,5,6,7,8.   

Abstract

BACKGROUND: Recent technological developments have made genome sequencing and assembly highly accessible and widely used. However, the presence in sequenced organisms of certain genomic features such as high heterozygosity, polyploidy, aneuploidy, heterokaryosis, or extreme compositional biases can challenge current standard assembly procedures and result in highly fragmented assemblies. Hence, we hypothesized that genome databases must contain a nonnegligible fraction of low-quality assemblies that result from such type of intrinsic genomic factors.
FINDINGS: Here we present Karyon, a Python-based toolkit that uses raw sequencing data and de novo genome assembly to assess several parameters and generate informative plots to assist in the identification of nonchanonical genomic traits. Karyon includes automated de novo genome assembly and variant calling pipelines. We tested Karyon by diagnosing 35 highly fragmented publicly available assemblies from 19 different Mucorales (Fungi) species.
CONCLUSIONS: Our results show that 10 (28.57%) of the assemblies presented signs of unusual genomic configurations, suggesting that these are common, at least for some lineages within the Fungi.
© The Author(s) 2022. Published by Oxford University Press GigaScience.

Entities:  

Keywords:  aneuploidy; genome assembly; heterozygosity; hybridization; polyploidy

Mesh:

Year:  2022        PMID: 36205401      PMCID: PMC9540331          DOI: 10.1093/gigascience/giac088

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   7.658


Findings

• We present Karyon, a Python-based bioinformatic pipeline that integrates genome assembly and a series of structural analyses for the diagnosis of problematic genomic structures. Karyon is freely available in GitHub and as a docker container (https://github.com/Gabaldonlab/karyon). • We applied Karyon to 35 highly fragmented, publicly available genome assemblies to identify putative undescribed deviations in genomic architecture that might have caused problems in a standard assembly process. From 35 assemblies, 10 presented features that suggested possible underlying biological factors as the likely cause of the observed assembly fragmentation. Even though our sample size is small and restricted to a single lineage (Mucoromycotina), our results suggest that the number of unreported deviations in genome architecture in Fungi is considerable. This is emphasized if we consider that most researchers that have produced low-quality assemblies are unlikely to publish their data.

Introduction

Recent developments in high-throughput sequencing and bioinformatic tools have made the process of sequencing the genome of a new organism a routine task for many laboratories, especially those working on groups with small compact genomes (prokaryotes, fungi, many parasitic lineages). The success of a genome assembly is limited by technical aspects as well as by intrinsic properties of the sequenced genome. A successful assembly depends on the quality, design, and depth of the sequencing libraries, which must typically adapt to budget limitations. Naturally, if the sequencing methodology or the computational approaches are inappropriate, the resulting assembly will be poor (i.e., highly fragmented, incomplete, or misassembled). However, additional difficulties might arise independently of the methodology employed, due to intrinsic properties of the genome that interfere with genome assembly algorithms.

Biological factors affecting genome assembly quality

The main intrinsic factors that compromise the success of a genome assembly are the genome size, the sequence heterozygosity, the abundance of low-complexity regions (i.e., highly repetitive sequences), and the presence of high or uneven ploidy, contaminating sequences or extreme nucleotide compositions (Fig. 1).
Figure 1:

Biological factors that confound genome assembly process. Ploidy and aneuploidy increase the number of possible states per site. Extreme GC% composition affects the information that different k-mers have, and extreme deviations are relatively common in extremophilic organisms. Transposable elements and other forms of repetitive elements increase genome size, affect GC% locally, and reduce sequence complexity. Hybridization, heterokaryosis, and chimerism introduce 2 genotypic signals that might be quite divergent, which increases heterozygosity. Finally, contamination introduces undesired sequences with uneven composition, heterozygosity, and stoichiometry.

Biological factors that confound genome assembly process. Ploidy and aneuploidy increase the number of possible states per site. Extreme GC% composition affects the information that different k-mers have, and extreme deviations are relatively common in extremophilic organisms. Transposable elements and other forms of repetitive elements increase genome size, affect GC% locally, and reduce sequence complexity. Hybridization, heterokaryosis, and chimerism introduce 2 genotypic signals that might be quite divergent, which increases heterozygosity. Finally, contamination introduces undesired sequences with uneven composition, heterozygosity, and stoichiometry. Genome size impacts computational costs, as many assembly algorithms scale nonlinearly [1-3]. Heterozygosity implies the existence of allelic differences within an individual. Standard assembly algorithms have difficulty in differentiating between highly heterozygous regions and distinct but highly similar genomic regions [4, 5]. This, in turn, results in fragmented assemblies with inflated size compared to empirical measurements, as many of these regions appear duplicated [5], often in short scaffolds. This is particularly problematic in the case of individuals or a population that derived from sexual recombination between 2 or more distinct phylogenetic lineages (Fig. 1A), as the component subgenomes often develop structural rearrangements after the split of the 2 parental lineages. Similarly, repetitive or low-complexity genomic regions (Fig. 1B) are difficult to resolve without the aid of expensive experimental approaches (e.g., genetic maps, bacterial artificial chromosomes, long read sequencing techniques, or chromosome conformation capture), particularly when they span large genomic regions. Duplicated regions introduce multiple possible solutions to the process of scaffolding, increasing assembly fragmentation and computational costs [3, 4]. Similarly, ploidy deviations can greatly affect genome assembly. The first possible ploidy deviation is polyploidy (Fig. 1C), which is the presence of more than 2 chromosomes for the majority of the genome. Polyploidy is generally associated with genome heterozygosity, as it increases the number of possible states per site [6, 7]. For a diploid site, only 2 states are possible: heterozygous or homozygous, depending on whether the 2 alleles are different or equal, respectively. For a triploid, however, there are 2 possible heterozygotic states (e.g., AAB and ABB), and differentiating between them depends on relative frequencies. Allele frequency is affected by stochastic variation, specially if depth of sequencing is low. Aneuploidy (Fig. 1D) tends to cause the same problems as polyploidy in assemblies, albeit with the effect being limited only to the aneuploid regions. Because of this, genes present in chromosomes with ploidy higher than 2 will have a higher likelihood of being unannotated. Animal and plant genomics have traditionally considered aneuploidies as rare events, due to their deleterious effects on many of these organisms, especially during embryonic development [8]. This paradigm is clearly false for many fungal [9-12]) and protist [13, 14] lineages. Eukaryotic genomics has only recently started to focus on pangenomes [15-19], but aneuploidies might be an important confounding factor for these studies. For example, genes located in aneuploid regions are more likely to be missed in annotations, which can inflate estimations of presence/absence variation. In syncytial organisms, such as filamentous fungi or slime molds, there is the possibility of coexistence of genetically different populations of nuclei within a cytoplasmic continuum, a condition known as heterokaryosis (Fig. 1E) [20-22]. Heterokaryosis is functionally similar to ploidy, although with some important differences. First, the relative proportions between heterozygous sites do not necessarily adjust to a simple fraction, as often one population is more abundant that the other. Second, since nuclei divide independently from each other, mitotic or meiotic recombination should be rare. This independence implies that any relative chromosomal rearrangements (i.e., duplications, deletions, translocations, and inversions) between the 2 nuclear populations, either pre- or postunion, would remain in nuclear populations for long periods of time. These rearrangements introduce the aforementioned complications in genome assemblies, and some of these might be difficult to differentiate from other chromosomal aberrations. A similar phenomenon is chimerism (Fig. 1F), in which the body of an organism is composed by 2 or more populations of genetically distinct cells. Certain lineages, especially colonial species, might arise by fusion of several genetically distinct individuals [23], but very little is known regarding the effect of chimerism in genome assemblies. The presence of sequence contamination (Fig. 1G) can greatly compromise the quality of the genome assembly [24-28]. Extraneous sequences introduce noise, create chimeric contigs, and might introduce errors in k-mer estimations. Highly diverse contaminations (e.g., from the gut microbiota) introduce sequences with highly variable level of coverage, heterozygosity, and composition. On the other hand, highly abundant contaminants (e.g., symbiotic bacteria) are typically more homogeneous in all these parameters but might still form chimeric contigs and would indirectly reduce the depth of coverage in the main genome. Contaminations reducing the signal of the main genome are particularly problematic for single-cell sequencing projects [29, 30]. This is normally prevented by methodological means, but contaminating sequences are intrinsic for certain samples or even organisms, such as the case of symbiotic organisms (e.g., Lichens). Finally, genomes with extreme compositions, typically very high or low GC content (GC%), can be difficult to assemble (Fig. 1H). For these genomes, the information contained by any AT positions is different from the information contained by a GC, as k-mers composed of the favored nucleotide pair will appear at higher frequencies. GC% has a well-documented effect on some sequencing technologies, most notably on the quality of Illumina reads [31, 32]. Fortunately, GC% is easy to measure from raw reads, and some genome assemblers include options specially adapted for these cases [33, 34]. Low GC% is typically associated with high abundance of low-complexity regions and transposable elements, but extreme GC% is also a hallmark of certain lineages, such as several groups of early diverging Fungi [35]. Despite their effects in genome analyses, GC% in eukaryotic genomes is often ignored. For example, neither NCBI nor Mycocosm report GC% in their assembly information statistics, unlike the Genome Online Database, which has a greater focus on prokaryotic sequences. If the presence of the factors outlined above is anticipated, specific technical approaches—both experimental and computational— can be used. Contaminating DNA can be identified easily because sequencing coverage, nucleotide composition, and phylogenetic signal are usually different from the main genome, and several programs have been developed to identify contamination [24-28]. Ploidy can be estimated with cytogenetic techniques, which has been used for animals and plants since the 19th century. Unfortunately, cytogenetic techniques are time-consuming and difficult to interpret for some groups, such as the Fungi. Computational approaches exist to estimate composition and ploidy from sequencing reads [36-38]. Similarly, hybridization can be detected based on phenotypic traits (intermediate phenotypes and hybrid vigor). Again, this is not feasible for most microbial eukaryotes due to the lack of easily identifiable phenotypes. Genomes of hybrid organisms are heterozygous, and some genome assembly software has been designed to be able to handle this situation [5, 39, 40], but proper identification of hybrid lineages cannot be done without adequate population and phylogenetic analyses. Thus, biological factors affecting genome assembly quality increase the overall costs of a project and require expertise that might not be available. Given the difficulty of performing analyses on low-quality assemblies, it is likely that published genomes are biased in favor of organisms with genomic characteristics that make them easier to work with. In contrast, genomic projects that choose organisms with nonstandard genomic architectures are more likely to suffer methodological obstacles that delay or even prevent analyses. Our inability to work around nonstandard genomic architectures distorts our perception of biological phenomena, relegating them to mere oddities.

Results

The Karyon toolkit

To aid in the identification of these noncanonical genomic architectures, we developed Karyon, a Python-based toolkit that assesses several parameters of sequencing data and their derived assemblies that are common indicators of different intrinsic genomic features leading to poor assemblies. Karyon comprises different modules that can be used independently or sequentially. Karyon is written in Python 3 and freely available to download as a Docker build or as a standalone project in https://github.com/Gabaldonlab/karyon. Karyon integrates Trimmomatic [41] as an optional step to eliminate low-quality positions and adapters from sequencing reads. It then uses that input to generate a de novo assembly using SPAdes v3.9.0 [33], dipSPAdes v3.9.0 [40], Platanus v1.2.4 [39], or SOAPdenovo2 v2.04-r240 [42]. Karyon then uses the de novo assembly to generate a reduced assembly using Redundans [5]. Redundans is a pipeline that collapses assembly fragments with high similarity to create an artificial haploid genome assembly. This assembly is then used as reference to map the original sequencing reads using BWA-MEM [43] and generate a variant calling file with GATK v4.1.9.0 [44]. A battery of analyses is then performed on the sequencing libraries, the assemblies, and the maps of coverage and genetic variation to generate plots that will aid in the diagnosis of the genomic structure. Fig. 2 summarizes the pipeline.
Figure 2:

Karyon pipeline. Schematic representation of the steps and program used by Karyon. Red circles represent possible user inputs. Blue boxes represent software used for each step. Orange hexagons represent files generated by the software. Red arrows indicate input to a program, and blue arrows represent output of a program. Thicker red arrows represent the standard pipeline, while thinner red arrows represent the different options the user can select to skip some of the steps. These options appear next to the arrow.

Karyon pipeline. Schematic representation of the steps and program used by Karyon. Red circles represent possible user inputs. Blue boxes represent software used for each step. Orange hexagons represent files generated by the software. Red arrows indicate input to a program, and blue arrows represent output of a program. Thicker red arrows represent the standard pipeline, while thinner red arrows represent the different options the user can select to skip some of the steps. These options appear next to the arrow. Karyon uses the K-mer analysis toolkit (KAT) [36] to provide a k-mer (all possible sequences of length k) spectrum analysis as part of its report. From this analysis, it produces frequency histograms representing coverage versus k-mer counts. These plots inform on ploidy and heterozygosity of a genome. In a haploid genome, k-mers of enough size will appear either 1 or 0 times, with unique k-mers having an average coverage roughly equal to the average global depth of coverage. Deviations from these patterns suggest alternative architectures. For instance, the presence of 2 peaks in the k-mer plot typically indicates a genome that is totally or partially a nonhomozygous diploid. To complement these analyses and provide further information on the features of the genome, Karyon assesses scaffold length distributions, relationships between scaffold length and coverage, sliding-window analysis of coverage, and genetic variation, as well as allele frequency distributions per scaffold (Fig. 2). In addition, Karyon uses nQuire [38] to estimate the likelihood of different ploidy levels in sliding windows per scaffold. Karyon also incorporates BUSCO completeness analysis [45] with automatic taxonomic assignment. Altogether, the interpretation of these analyses can be used to detect polyploidies, aneuploidies, hybridizations, heterokaryosis, large segmental duplications, unusual DNA composition, or the presence of symbiont or contaminating sequences. Karyon generates a report file that summarizes the results of these analyses and raises some warning messages in case certain metrics are problematic, such as low BUSCO completeness, low percentage of mapped reads, or extreme GC% values. Karyon generates a series of original plots that aim to provide valuable information regarding the architecture of the problem assembly: Scaffold length plots (Fig. 3A). These plots represent the distribution of scaffold length through a barplot, where each value represents a single scaffold versus its length, with all scaffolds sorted from shortest to longest. Karyon generates the results in linear and logarithmic distribution. Very short scaffolds (shorter than 1 Kbp) might introduce noise in the analyses and might be interesting to just filter them out.
Figure 3:

Summary of Karyon plots. (A) Scaffold length plot. (B) Scaffold length versus coverage plot. In the example, scaffolds form 2 populations with different coverage, which suggests aneuploidy behavior. (C) Variation versus coverage plot. In the example, the genome forms a clear population with low SNP density and approximately 30× of coverage and a second, more diffuse population with higher SNP density and approximately 60× coverage. This behavior suggests a mix of haploid and diploid regions. (D) Fair coin analysis. The red line represents a simulated distribution assuming perfect 50% distribution of reference and alternative SNPs. Each blue line represents the empirical distribution of reference versus alternative SNPs per scaffold, which in this case all follow a diploid distribution. (E) Per-scaffold nQuire plot. The plot represents nQuire-generated normalized values across sliding windows for a single scaffold. Most windows have a high diploid score, which suggests that this particular scaffold is diploid.

Scaffold versus coverage (Fig. 3B). Karyon generates a scatterplot representing the average coverage versus length for each scaffold in the assembly. Quite often, short scaffolds have different coverage from most of the genome, which might be indicative of contamination or repetitive regions. Variation versus coverage plot (Fig. 3C). This plot allows the user to observe overall patterns across the whole genome. It uses a kernel density estimation over a cloud of dots. Each dot represents the number of single-nucleotide polymorphisms (SNPs) (x-axis) versus the average coverage (y-axis) in a window of the genome, typically 1 Kb. Presence of more than 1 population of dots is indicative of genomes with dual behavior, such as aneuploidies or loss of heterozygosity. Fair coin plot (Fig. 3D). This plot represents the proportion of alternative versus reference SNPs for the whole genome and for each individual scaffold. Vertical lines indicate expected frequencies of 0.5, 0.33, and 0.25, corresponding to ideal diploids, triploids, and tetraploids, respectively. An expected frequency is drawn, which is based on a per-site simulation of proportions assuming ideal 0.5 relative frequencies and random sampling equal to the coverage of the site. The plot is generated for the whole genome, as well as per scaffold. nQuire per scaffold plot (Fig. 3E). Karyon will run nQuire across sliding windows of defined length (by default 1 Kbp) across different scaffolds. The plot for each scaffold contains 5 subplots. The first 3 represent the nQuire score for diploid, triploid, and tetraploid for that particular window. This allows the user to visualize patterns of aneuploidy per scaffold, especially with regards to diploid and triploid regions. The fourth and fifth subplots represent the location and coverage of SNPs across the scaffold. A color code is assigned to represent the density. Since nQuire requires information of SNPs, homozygous regions cannot be assessed and will appear as missing data. Each of the steps is optional and can be controlled with flags in the main script. Additionally, the script uses a configuration file, which allows to define the options of each dependency program. This configuration file is automatically created during the installation and can be modified with any text editor. We encourage the user to make a copy of the original configuration file for future modification. Installation is fully automated, requiring no user input during the process. Summary of Karyon plots. (A) Scaffold length plot. (B) Scaffold length versus coverage plot. In the example, scaffolds form 2 populations with different coverage, which suggests aneuploidy behavior. (C) Variation versus coverage plot. In the example, the genome forms a clear population with low SNP density and approximately 30× of coverage and a second, more diffuse population with higher SNP density and approximately 60× coverage. This behavior suggests a mix of haploid and diploid regions. (D) Fair coin analysis. The red line represents a simulated distribution assuming perfect 50% distribution of reference and alternative SNPs. Each blue line represents the empirical distribution of reference versus alternative SNPs per scaffold, which in this case all follow a diploid distribution. (E) Per-scaffold nQuire plot. The plot represents nQuire-generated normalized values across sliding windows for a single scaffold. Most windows have a high diploid score, which suggests that this particular scaffold is diploid.

Genomic survey in the Mucorales (Fungi)

To showcase the use of Karyon, we undertook an analysis of deposited fungal genomes in the order Mucorales. Fungi are in a particularly privileged position to assess the impact of noncanonical genomic architectures in genome assemblies. Fungi generally have small and compact genomes and can be often cultured under axenic conditions. As a result, the amount of sequenced fungal genomes is now in the order of thousands, including multiple strains for many species. Even more, comprehensive efforts to obtain a balanced coverage of the existing fungal diversity are ongoing, such as the 1,000 fungal genomes [46] and the 1,000 yeast genomes initiatives [47-50]. Thus, fungi provide an excellent system to study the incidence of different genomic accidents in evolution [10, 51, 52]. Despite this, the quality of fungal genomes is often suboptimal, and databases are rife with highly fragmented assemblies. Genomic factors such as those discussed above might complicate genome assembly and be responsible for this observed fragmentation, at least partially. Considering this, we hypothesized that genome databases must contain a fraction of low-quality assemblies from fungal organisms that are caused by intrinsic genomic factors. If that is true, reanalysis of the raw data should lead us to describe novel genomic accidents and obtain a minimum estimate of their relative abundance. We thus applied Karyon to a set of 35 publicly deposited genomes from the fungal order Mucorales. Our results suggest that nonstandard genomic organizations are not rare and that future studies on other groups are likely to uncover many new cases. We selected the order Mucorales because this group comprises several described examples of whole-genome duplication, both ancient and recent [53, 54]. Many sequenced members of the clade come from clinical samples, an environment that is known to promote the emergence of different genomic accidents [52, 55, 56]. Additionally, several represented species included 2 or more sequenced isolates, allowing to get a glimpse at their intraspecific diversity. We obtained 35 genome assemblies from 19 different Mucorales species deposited in GenBank between 1 January 2005 and 31 December 2015 (Table 1). For 4 of the species, dipSPADes was unable to generate an assembly.
Table 1:

NCBI Assembly statistics for the analyzed strains. Strains with darker background possessed some property that was affecting assembly quality and was diagnosed using Karyon. Fragmentation in all remaining strains is attributed to low sequencing depth.

SpeciesNCBI genome size (Mbp)NCBI number of scaffoldsGenBank accessionGenome size after Karyon (Mbp)Number of scaffolds after KaryonDiagnosisReference
Rhizopus microsporus ATCC 62 41749.61,386 GCA_900000135.1 40.15,521Contamination[57]
Rhizopus microsporus CBS_344.2949.21,554 GCA_000825725.1 32.13,037Contamination[57]
Rhizopus microsporus B973875.15,266 GCA_000697275.1 71.612,789Misidentification[58]
Rhizopus microsporus var. rhizopodiformis B745548.74,658 GCA_000738565.1 21.82,176Contamination[58]
Rhizopus delemar type I NRRL 2178942.03,921 GCA_000697155.1 33.44,824Unknown[58]
Rhizopus delemar type II NRRL 2144638.91,156 GCA_000738605.1 33.75,071Unknown[58]
Rhizopus delemar type II NRRL 2144738.71,177 GCA_000738595.1 28.96,683Unknown[58]
Rhizopus delemar type II NRRL 2147740.81,808 GCA_000738585.1 NoneNoneUnknown[58]
Rhizopus oryzae 99–89239.11,168 GCA_000697725.1 29.61,875Low GC% (below 35%)[58]
Rhizopus oryzae HUMC0240.32,313 GCA_000697605.1 NoneNoneUnknown[58]
Rhizopus oryzae B740743.34,683 GCA_000696915.1 34.73,720Low GC% (below 35%)[58]
Rhizopus oryzae type I NRRL 1344043.45,022 GCA_000697075.1 NoneNoneUnknown[58]
Rhizopus oryzae type I NRRL 1814847.514,653 GCA_000697095.1 NoneNoneUnknown[58]
Rhizopus oryzae type I NRRL 2139642.84,445 GCA_000697115.1 34.24,115Unknown[58]
Rhizopus oryzae 99–13341.54,317 GCA_000697135.1 27.21,332Unknown[58]
Rhizopus oryzae 97–119242.94,566 GCA_000697195.1 NoneNoneUnknown[58]
Rhizopus stolonifer B9770385,567 GCA_000697035.1 30.16,406Unknown[58]
Mucor circinelloides B898736.72,210 GCA_000696935.1 29.94,864Unknown[58]
Mucor indicus B740239.83,117 GCA_000697295.1 32.1691Unknown[58]
Mucor racemosus B964565.56,360 GCA_000697255.1 46.04,444Hemidiploid, low GC% (below 35%)[58]
Mucor velutinosus B532835.92,411 GCA_000696895.1 28.22,743Unknown[58]
Lichtheimia corymbifera 008–04936.61,629 GCA_000697175.1 42.83,575Unknown[58]
Lichtheimia corymbifera B254136.61,176 GCA_000697475.1 13.23,575Unknown[58]
Lichtheimia ramosa B539945.63,968 GCA_000738555.1 26.6861Aneuploid, hybrid[58]
Saksenaea oblongisporus B335340.81,702 GCA_000697495.1 29.7622Unknown[58]
Saksenaea vasiformis B407842.52,417 GCA_000697055.1 32.71,506Unknown[58]
Cokeromyces recurvatus B548329.32,637 GCA_000697235.1 26.65,213Low GC% (below 35%)[58]
Syncephalastrum monosporum B892229.61,284 GCA_000697355.1 24.15,271Unknown[58]
Syncephalastrum racemosum B610129.61,035 GCA_000696955.1 23.3311Unknown[58]
Cunninghamella elegans B976931.71,380 GCA_000697015.1 30.85,465Low GC% (below 35%)[58]
Apophysomyces elegans B776038.51,528 GCA_000696995.1 29.31,293Unknown[58]
Apophysomyces trapeziformis B932435.81,400 GCA_000696975.1 30.1898Unknown[58]
Thermomucor indicae-seudaticae HACC 24329.61,958 GCA_000787465.1 25.74,118UnknownBusk et al., unpublished. Genome submitted in 2014.
Parasitella parasitica CBS 412.66 isolate NGI315 ade- mutant44.915,637 GCA_000938895.1 23.53,295Unknown[59]
NCBI Assembly statistics for the analyzed strains. Strains with darker background possessed some property that was affecting assembly quality and was diagnosed using Karyon. Fragmentation in all remaining strains is attributed to low sequencing depth. Karyon was run using the complete default pipeline. Most of the analyzed genomes (27, 79.4%) presented very low levels of heterozygosity and a relatively homogeneous coverage across the genome, suggesting that those strains are haploid or, if presenting higher ploidy, extremely homozygous. Fragmentation in these cases might be caused by insufficient coverage, presence of repetitive regions, or some other methodological constraints. However, our pipeline uncovered cases that produced anomalous results in the different Karyon tests. Many zygomycetous fungi exhibit low or very low GC%. In our dataset, 5 species showed GC% below our threshold of 35%, with several others approximating that value. Additionally, some of the analyzed genomes show signs of ploidy anomalies or contamination. Below we describe these cases and propose a plausible scenario to explain each of the obtained results based on the data obtained from the Karyon pipeline.

Rhizopus microsporus species complex

At the time of this study, the NCBI database had deposited sequences for 8 Rhizopus microsporus strains. Interestingly, 3 of them presented a genome size estimated around 25 Mbp, 4 had a genome size close to 50 Mbp, and 1 presented a genome size of 75 Mbp. Only the 3 strains with a genome size of 25 Mbp had sufficiently good assemblies considering they were based on short reads, with a scaffold number below 1,000, and thus were not selected for further analyses. Additionally, the raw libraries for one of the strains presenting a 50-Mbp genome assembly size (R. microsporus var. chinensis CCTCC M201021) were not publicly available and thus could not be part of the survey. For the remaining 3 strains with genome size close to 50 Mbp (ATCC62417, CBS344.29, and var rhizopodiformis B7455), our de novo assembly pipeline recovered a genome size of approximately 40 Mb, which is smaller than the assemblies deposited in NCBI (Table 1). The heterozygosity distribution in these assemblies shows that most of the genome presented a relatively uniform behavior with low heterozygosity. In all 3 cases, though, a considerable proportion of the genome appears with a highly variable coverage and increased heterozygosity (Fig. 4). For these 3 strains, BlobTools [25] shows widespread bacterial contamination (Fig. 4B), and thus we conclude that contamination might be responsible for the observed assembly fragmentation.
Figure 4:

Analysis of Rhizopus microsporus ATCC62417. (A) Variation versus coverage plot reveals the existence of a highly variable portion of the genome that presents variable heterozygosity levels. (B) BlobTools analyses suggest that the genome presents a considerable portion of contaminating sequences. Coverage of the sequences assigned to bacteria is very low when the analyses are performed with other libraries (data not shown), which proves that the conflicting signal observed in this sample has its origin in a contaminated sequencing library. Results for R. microsporus CBS344.5 and var. rhizopodiformis B7455 show similar patterns of contamination (data not shown).

Analysis of Rhizopus microsporus ATCC62417. (A) Variation versus coverage plot reveals the existence of a highly variable portion of the genome that presents variable heterozygosity levels. (B) BlobTools analyses suggest that the genome presents a considerable portion of contaminating sequences. Coverage of the sequences assigned to bacteria is very low when the analyses are performed with other libraries (data not shown), which proves that the conflicting signal observed in this sample has its origin in a contaminated sequencing library. Results for R. microsporus CBS344.5 and var. rhizopodiformis B7455 show similar patterns of contamination (data not shown). The remaining strain, B9738, showed a surprisingly large genome size in both the assembly deposited in NCBI (75 Mbp) and the one reconstructed here (71 Mbp). The genome of R. microsporus B9738 presents an extremely low level of heterozygosity and a very homogeneous coverage. The k-mer spectrum also shows just 1 very clear peak. All in all, all this suggests that B9738 is haploid (or a highly homozygous diploid), despite presenting a 3-fold increase in genome size as compared to other strains of the same species (Fig. 5). Augustus gene prediction returned a total of 21,300 gene models, which is an unusually large number for a filamentous fungus. As a reference, the 7 genomes in the Rhizopodaceae, to which Rhizopus belongs, available in Mycocosm range from 25 to 46 Mbp and from 10,781 to 17,676 annotated genes. Contamination analysis does not suggest the presence of widespread contamination that could explain such an overinflated genome (Fig. 5). For this reason, we suggest that B9738 might be a misidentified strain that does not belong to the R. microsporus species complex. Indeed, phylogenomic analyses recover B9738 as sister to a clade containing Mucor and Parasitella, rather than allied with the rest of the R. microsporus species clade (Fig. 6), thus supporting a misidentification. It is noteworthy that no sequenced species of either Mucor or Parasitella have genomes above 49 Mbp or with more than 15,000 genes, at least from the available genomes in Mycocosm.
Figure 5:

Analysis of Rhizopus microsporus B9738. (A) KAT k-mer plot shows very low genome compaction (black area), suggestive of a haploid genome. (B) Variation versus coverage plot reveals a single main behavior for the genome with regards to its SNP density and coverage. (C) BlobTools analysis shows no sign of widespread contamination that might be inflating the genome.

Figure 6:

Phylogenetic tree of Rhizopus microsporus B9738. Phylogenetic tree inferred from OrthoFinder. The Rhizopus microsporus species complex is marked in blue. The problematic strain, B9738, is marked in yellow.

Analysis of Rhizopus microsporus B9738. (A) KAT k-mer plot shows very low genome compaction (black area), suggestive of a haploid genome. (B) Variation versus coverage plot reveals a single main behavior for the genome with regards to its SNP density and coverage. (C) BlobTools analysis shows no sign of widespread contamination that might be inflating the genome. Phylogenetic tree of Rhizopus microsporus B9738. Phylogenetic tree inferred from OrthoFinder. The Rhizopus microsporus species complex is marked in blue. The problematic strain, B9738, is marked in yellow.

Mucor racemosus B9645

Analyses on Mucor racemosus B9645 depicted a genome with a dual behavior. The distribution of heterozygosity and coverage showed 2 peaks with very low heterozygosity but with different coverage (Fig. 7B). This was further confirmed by the k-mer spectrum analysis, which revealed 2 clear peaks (Fig. 7A). The genome available in NCBI is 65.5 Mbp long, noticeably larger than the 45.9 Mbp we recovered in our analyses (Table 1). The reduction step of Redundans cannot explain this difference, as the assembly size prior to this step is already 46.8 Mbp, very close to the final result. Our analyses suggest that contaminating sequences are very minor and do not explain the observed pattern (Fig. 7). We hypothesize that M. racemosus B9645 is a hemidiploid, which presents a portion of its genome in a haploid state and the other portion in a highly homozygous diploid state. Due to the low heterozygosity exhibited by this strain, the observed genome architecture might have arisen by either autopolyploidization followed by chromosome loss or by chromosomal duplications. Additionally, GC% for this species was only 32.6%.
Figure 7:

Analysis of Mucor racemosus B9645. (A) KAT k-mer plot shows 2 peaks of coverage considerably affected by genome reduction (black area), suggestive of a highly heterozygous diploid genome. (B) Variation versus coverage plot reveals a bimodal behavior for the genome with regards to its coverage, but both peaks appear with very low SNP density. (C) BlobTools analysis shows no sign of widespread contamination that might be inflating the genome.

Analysis of Mucor racemosus B9645. (A) KAT k-mer plot shows 2 peaks of coverage considerably affected by genome reduction (black area), suggestive of a highly heterozygous diploid genome. (B) Variation versus coverage plot reveals a bimodal behavior for the genome with regards to its coverage, but both peaks appear with very low SNP density. (C) BlobTools analysis shows no sign of widespread contamination that might be inflating the genome.

Lichtheimia ramosa B5399

The Karyon assembly for this genome was only 26.6 Mbp, much smaller than the NCBI assembly (45.6 Mpb long, Table 1). Unlike other genomes, our assembly presented a considerable improved quality, going from 3,968 scaffolds and N50 of 33,650 in the NCBI assembly to 861 scaffolds and N50 of 133,635 in our own assembly. L. ramosa presents a heterozygosity level around 3% in its diploid peak (Fig. 8). All considered, we propose that L. ramosa B5399 is a mix of haploid and diploid with high heterozygosity, likely resulting from mating between 2 distantly related strains followed by genomic aneuploidization.
Figure 8:

Analysis of Lichtheimia ramosa B5399. (A) KAT k-mer plot shows 1 peak with considerable genome compaction (black area) suggestive of a diploid genome. (B) Variation versus coverage plot reveals a unimodal behavior for the genome with regards to its coverage, presenting a widespread heterozygosity of approximately 3% (maximum density around 30 SNP/Kbp). (C) Alternative allele frequency shows that all scaffolds present a behavior very similar to the ideal diploid. (D) Scaffold length plot shows that, except for a group of very low-coverage scaffolds, the entire genome presents a uniform coverage.

Analysis of Lichtheimia ramosa B5399. (A) KAT k-mer plot shows 1 peak with considerable genome compaction (black area) suggestive of a diploid genome. (B) Variation versus coverage plot reveals a unimodal behavior for the genome with regards to its coverage, presenting a widespread heterozygosity of approximately 3% (maximum density around 30 SNP/Kbp). (C) Alternative allele frequency shows that all scaffolds present a behavior very similar to the ideal diploid. (D) Scaffold length plot shows that, except for a group of very low-coverage scaffolds, the entire genome presents a uniform coverage.

Methods

Sequencing data

We downloaded raw data from libraries deposited at the Short Read Archive (SRA)⁠ of those species in the Mucorales with a highly fragmented assembly (>1,000 scaffolds), which included at least 1 paired-end Illumina library larger than 1 Gb after quality filtering (Table 1), to ensure at least a decent coverage. Since most of our genomes have typical assembly sizes around 40 Mbp, this measure ensures a bare minimum average coverage of 20. All available sequencing libraries were used for all the analyses.

De novo gene annotation

We used Augustus v3.1.0. [60] to obtain a de novo gene prediction using the Rhizopus oryzae generalized hidden Markov model included in the default installation of Augustus.

Contamination detection

For each of the conflictive assemblies, we generated an Augustus prediction. Then, we used Blastp [61] to query the whole proteome against Uniref100 [62]. Since the genomes come from public databases, their own proteins should appear as hits, and thus we retrieved the 10 best hits. We have used these hits to assign a taxonomic profile. Additionally, we have used the predicted Augustus CDS to map sequencing reads with GATK. With both the taxonomic profile and the variant calling file, we have run BlobTools [25] to identify the presence of widespread contamination in the sequencing libraries.

Phylogenomic analyses

In order to identify the phylogenetic position of R. microsporus B9738, we used the Augustus gene prediction and the proteome of 24 other zygomycetes to run OrthoFinder v.2.3.3 [63] with the flags -S blast and -m msa.

Discussion

As genome sequencing has moved away from model organisms, it has become apparent that many possible genomic architectures are possible, and many do exist in a wide range of organisms. Most of these genomic accidents are difficult to identify from sequencing data alone. As far as we know, Karyon is the first software developed with the intention of performing reference-free analyses for the presence of a wide array of genomic factors affecting the quality of de novo genome assembly. We have designed this software to be easy to install and use, with the possibility of installation from both GitHub and Docker. Despite the success in the implemented strategy, we consider our software has several limitations. Karyon requires an assembly step and variant calling protocol, for which some default options are included. However, the included programs might not suit every need. For example, extremely large genomes might require alternative assemblers that are not included in our pipeline, or some users might prefer a different set of programs for the variant calling protocol. For those cases, Karyon can still be used as independent steps (Fig. 2). Karyon is designed to work without any preexisting data, which limits the information it can predict. Comparing different genome assemblies, especially if at least one of them has good quality, can help detect many of these alternative genomic architectures and some others that are outside the capabilities of Karyon. If other reference genomes are available, tools like QUAST [64] can generate similar analyses to Karyon with higher accuracy and speed. Despite the increasing use of long-read technologies for assembly purposes, a large amount of genome assemblies available in public databases have been generated exclusively from short reads. As of October 2021, NCBI SRA contains 173,087 DNA libraries for Fungi, of which 157,144 are Illumina short reads, and only 6,163 are long reads (4,700 PacBio and 1,463 Nanopore). At this moment, the pipeline assumes the use of at least 1 Illumina paired-end sequencing library. Because of this, we recommend the use of other genome assemblers if other sequencing technologies (i.e., Nanopore or PacBio long reads) are to be used, and the same goes for variant calling protocols. We provided a practical example of the usage of Karyon on a publicly available set of fungal genomes from the order Mucorales. While most analyzed assemblies show no sign of any of the considered biological conditions, we were able to effectively find underlying nonstandard genomic architectures that had been previously unnoticed in these assemblies. These results suggest that many authors do not take into consideration these kinds of genomic accidents, which in turn greatly hampers the results that might be obtained from them. How common are these nonstandard genomic architectures? Our results suggest that they might be quite abundant, although so far they are restricted to a limited selection of species within a narrow clade of Fungi. As such, these genomic anomalies might, or might not, be common in other lineages. However, we consider that there are 3 important arguments in favor for considering our dataset an underestimation of the abundance of unorthodox fungal genomes, even within the limited taxonomic range we have selected. The first one is the fact that fungal biomass used for DNA extraction and subsequent sequencing typically comes from cultures. This implies an important ecological step in which the fungus grows at optimal speed and in the absence of most stressors. Aneuploidies, polyploidies, and other similar genomic rearrangements are common in the presence of stressors [9, 10, 52, 65] but seem to be outcompeted by euploid cells under optimal growth conditions [24, 66, 67]. Hence, isolates growing in rich medium will be selected to lose most chromosomal aberrations they might present. Analogously, many of these chromosomal aberrations might exist in nature but are unable to grow on optimal medium. The advance of environmental sequencing and single-cell-based technologies might cast some light in this matter in coming years. Supporting this argument, Ahrendt et al. [68] sequenced several environmental isolates of zoosporic and zygomycetous microfungi using these techniques and found several aneuploids and polyploids. The frequency of unconventional genomic architectures is very likely lineage dependent. While some of these are well known, such as the dikaryotic phase in Agaricomycetes or the macro- and micronuclei of ciliates, strange genomic architectures might be common in more obscure lineages. This not only represents a yet-to-know facet of the biology of these organisms but could also potentially complicate their study. The third factor to consider is purely human. The datasets we have analyzed were uploaded by researchers who considered they were good enough to be uploaded to a public repository. Thus, it is to be expected that many more low-quality assemblies would have never been never deposited and sit forgotten in the disks of laboratory computers, if not discarded completely. Even if we consider these possible biases as negligible, our results recover a significant fraction of publicly available genomes with unorthodox genomic configurations. These have been correlated in many fungal groups with adaptation to novel environments [69-71], resistance to antifungals [72, 73], pathogenic capabilities toward both animals [55, 74–76] and plants [77, 78], and adaptation to industrial settings [47, 79–83]. Beyond that, contamination in sequencing libraries is a problem that can affect any assembly project and might mislead downstream inferences if left unaddressed. Validation of published results goes far beyond the interest of discovering overlooked findings. Comparative genomic studies are limited in their scope and reliability by the quality of assembly and annotation of the genomes, factors that can be greatly compromised by these biological factors. Comparative studies commonly require the use of flagship genomes that represent a given taxon. Often, this generates a chronology of comparisons versus the reference that shapes the perspective on the group. As such, artifacts and errors in strategic genome assemblies, such as reference strains or strains in groups with few represented species, might have a domino effect, impacting future studies. Long-read sequencing technologies, which are increasingly being used for genome assembly projects, hold the promise of providing much more information that could be used to resolve many of these unorthodox genomic architectures. However, these approaches require novel computational approaches to fully employ their potential.

Availability of supporting source code and requirements

Project name: Karyon Project homepage: https://github.com/Gabaldonlab/karyon Operating system(s): e.g., Linux, any with the Docker image Programming language: Python, Bash Other requirements: Check the installation License: GNU GPL v3 RRID: SCR_022544 biotools ID: karyon

Data Availability

All data used for this study were downloaded from NCBI SRA. Table 1 contains accession numbers for all the data. An archival copy of the code and test data is available via the GigaScience database GigaDB [84].

Abbreviations

BUSCO: Benchmarking Universal Single-Copy Orthologs; NCBI: The National Center for Biotechnology Information; SNP: single-nucleotide polymorphism; SRA: Short Read Archive.

Competing Interests

The authors state that they have no conflicts of interests.

Funding

Supported by the Spanish Ministry of Science and Innovation (grant PGC2018-099921-B-I00), cofounded by the European Regional Development Fund (ERDF); the Catalan Research Agency (AGAUR), SGR423; the European Union's Horizon 2020 research and innovation program (ERC-2016–724173); the Gordon and Betty Moore Foundation (grant GBMF9742); and the Instituto de Salud Carlos III (IMPACT grant IMP/00019 and CIBERINFEC CB21/13/00061-ISCIII-SGEFI/ERDF). Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Zhong Wang -- 6/29/2021 Reviewed Click here for additional data file. Zhong Wang -- 12/2/2021 Reviewed Click here for additional data file. Michael F. Seidl -- 7/4/2021 Reviewed Click here for additional data file. Michael F. Seidl -- 11/29/2021 Reviewed Click here for additional data file. Kamil S. Jaron -- 7/10/2021 Reviewed Click here for additional data file. Kamil S. Jaron -- 1/10/202 Reviewed Click here for additional data file. Kamil S. Jaron -- 5/17/2022 Reviewed Click here for additional data file.
  82 in total

1.  A shift in nuclear state as the result of natural interspecific hybridization between two North American taxa of the basidiomycete complex Heterobasidion.

Authors:  Matteo Garbelotto; Paolo Gonthier; Rachel Linzer; Giovanni Nicolotti; William Otrosina
Journal:  Fungal Genet Biol       Date:  2004-11       Impact factor: 3.495

Review 2.  Towards plant pangenomics.

Authors:  Agnieszka A Golicz; Jacqueline Batley; David Edwards
Journal:  Plant Biotechnol J       Date:  2015-11-23       Impact factor: 9.803

3.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors:  Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2015-06-09       Impact factor: 6.937

4.  Molecular evidence for the existence of natural hybrids in the genus Zygosaccharomyces.

Authors:  Stephen A James; Christopher J Bond; Malcolm Stratford; Ian N Roberts
Journal:  FEMS Yeast Res       Date:  2005-05       Impact factor: 2.796

5.  Hybridization and emergence of virulence in opportunistic human yeast pathogens.

Authors:  Verónica Mixão; Toni Gabaldón
Journal:  Yeast       Date:  2017-09-14       Impact factor: 3.239

6.  The Influence of Polyploidy on the Evolution of Yeast Grown in a Sub-Optimal Carbon Source.

Authors:  Amber L Scott; Phillip A Richmond; Robin D Dowell; Anna M Selmecki
Journal:  Mol Biol Evol       Date:  2017-10-01       Impact factor: 16.240

7.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors:  Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal:  Gigascience       Date:  2012-12-27       Impact factor: 6.524

8.  Insights into the Dekkera bruxellensis genomic landscape: comparative genomics reveals variations in ploidy and nutrient utilisation potential amongst wine isolates.

Authors:  Anthony R Borneman; Ryan Zeppel; Paul J Chambers; Chris D Curtin
Journal:  PLoS Genet       Date:  2014-02-13       Impact factor: 5.917

9.  MycoCosm portal: gearing up for 1000 fungal genomes.

Authors:  Igor V Grigoriev; Roman Nikitin; Sajeet Haridas; Alan Kuo; Robin Ohm; Robert Otillar; Robert Riley; Asaf Salamov; Xueling Zhao; Frank Korzeniewski; Tatyana Smirnova; Henrik Nordberg; Inna Dubchak; Igor Shabalov
Journal:  Nucleic Acids Res       Date:  2013-12-01       Impact factor: 16.971

10.  An integrated genomic and transcriptomic survey of mucormycosis-causing fungi.

Authors:  Marcus C Chibucos; Sameh Soliman; Teclegiorgis Gebremariam; Hongkyu Lee; Sean Daugherty; Joshua Orvis; Amol C Shetty; Jonathan Crabtree; Tracy H Hazen; Kizee A Etienne; Priti Kumari; Timothy D O'Connor; David A Rasko; Scott G Filler; Claire M Fraser; Shawn R Lockhart; Christopher D Skory; Ashraf S Ibrahim; Vincent M Bruno
Journal:  Nat Commun       Date:  2016-07-22       Impact factor: 14.919

View more
  1 in total

1.  Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other nonstandard architectures in genome assemblies.

Authors:  Miguel A Naranjo-Ortiz; Manu Molina; Diego Fuentes; Verónica Mixão; Toni Gabaldón
Journal:  Gigascience       Date:  2022-10-07       Impact factor: 7.658

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.