The existence of Metabolic Gene Clusters (MGCs) in plant genomes has recently raised increased interest. Thus far, MGCs were commonly identified for pathways of specialized metabolism, mostly those associated with terpene type products. For efficient identification of novel MGCs, computational approaches are essential. Here, we present PhytoClust; a tool for the detection of candidate MGCs in plant genomes. The algorithm employs a collection of enzyme families related to plant specialized metabolism, translated into hidden Markov models, to mine given genome sequences for physically co-localized metabolic enzymes. Our tool accurately identifies previously characterized plant MGCs. An exhaustive search of 31 plant genomes detected 1232 and 5531 putative gene cluster types and candidates, respectively. Clustering analysis of putative MGCs types by species reflected plant taxonomy. Furthermore, enrichment analysis revealed taxa- and species-specific enrichment of certain enzyme families in MGCs. When operating through our web-interface, PhytoClust users can mine a genome either based on a list of known cluster types or by defining new cluster rules. Moreover, for selected plant species, the output can be complemented by co-expression analysis. Altogether, we envisage PhytoClust to enhance novel MGCs discovery which will in turn impact the exploration of plant metabolism.
The existence of Metabolic Gene Clusters (MGCs) in plant genomes has recently raised increased interest. Thus far, MGCs were commonly identified for pathways of specialized metabolism, mostly those associated with terpene type products. For efficient identification of novel MGCs, computational approaches are essential. Here, we present PhytoClust; a tool for the detection of candidate MGCs in plant genomes. The algorithm employs a collection of enzyme families related to plant specialized metabolism, translated into hidden Markov models, to mine given genome sequences for physically co-localized metabolic enzymes. Our tool accurately identifies previously characterized plant MGCs. An exhaustive search of 31 plant genomes detected 1232 and 5531 putative gene cluster types and candidates, respectively. Clustering analysis of putative MGCs types by species reflected plant taxonomy. Furthermore, enrichment analysis revealed taxa- and species-specific enrichment of certain enzyme families in MGCs. When operating through our web-interface, PhytoClust users can mine a genome either based on a list of known cluster types or by defining new cluster rules. Moreover, for selected plant species, the output can be complemented by co-expression analysis. Altogether, we envisage PhytoClust to enhance novel MGCs discovery which will in turn impact the exploration of plant metabolism.
Metabolic Gene Clusters (MGCs) are genomically co-localized and potentially co-regulated genes of a particular metabolic pathway. In contrast to bacterial operons they are not under the control of a single transcriptional unit. MGCs are commonly found in fungal genomes and it was long assumed that MGCs occur only as an exception in plants (1). However, this assumption has been challenged in recent years; >20 MGCs have been experimentally characterized in a variety of species, the majority associated with plant specialized metabolism (2–6). Plant MGCs span genomes of both mono- and dicotyledonous species mediating the biosynthesis of end-products associated with different chemical classes including benzoxazinoids, cyanogenic glycosides, terpenoids and alkaloids (3). Significantly, the range of reported plant MGCs includes biosynthetic reactions for the synthesis of pharmaceutically and agronomically important chemicals, e.g., the anti-tumor alkaloid noscapine in poppy, the anti-nutritional steroidal alkaloids in potato or bitter cucurbitacintriterpenoids in cucumber (7–9).A common descriptor of plant MGCs is the adjacent localization of at least three, occasionally two (10), non-homologous biosynthetic genes that encode enzymes involved in the synthesis of specialized metabolites (4). It appears that plant MGCs evolved independently from prokaryotic operons. Typically, plant MGCs are composed of one gene encoding a so-called ‘signature enzyme’, i.e. an enzyme which catalyzes the first committed step of the biosynthetic pathway and synthesizes the scaffold of the following specialized metabolites. The remaining genes encode subsequent ‘tailoring enzymes’ which modify the scaffold to form the desired end-product (4). Moreover, as signature genes share homology with genes of plant primary metabolism it is widely assumed that plant MGCs were formed by the recruitment of additional tailoring enzymes through gene duplication and neo-functionalization.The finding of MGCs in plant genomes accelerates pathway discovery and the capacity to metabolically engineer desired specialized metabolites (11,12). At the same time, it also increases the demand for plant genome sequencing and genome-mining. Consequently, several in silico approaches were undertaken to systematically screen plant genomes for putative MGCs. In one of the earliest attempts, the authors presented co-expression analysis of neighboring genes in Arabidopsis across 1469 experimental conditions (13). Based on their analysis, 100 putative clusters were identified of which 34 were significantly co-expressed, containing 3–22 genes and 27 duplicated gene pairs.A different study focused merely on the large class of terpenoids by examining the distribution of pairs of terpene synthases (TS) and cytochrome p450s (CYPs) across 17 sequenced plant genomes (14). TSs and CYPs have been found in several reported MGCs as the product of TS activity is typically further decorated by CYPs enzymes. It appeared that physically co-localized pairs of TS/CYPs were found much more frequently than by chance. By further investigating the predominance of certain TS/CYPs pairs across species the authors uncovered different mechanisms of pathway assembly in eudicots and monocots.As part of a more extensive analysis that investigated diversification of metabolism in 16 plant species, Chae et al. (10) examined the clustering of metabolic genes across primary and specialized metabolism. The authors found approximately one-third of the metabolic genes in Arabidopsis, soybean, and sorghum and one-fifth of the genes in rice matching these criteria. In the case of Arabidopsis, the authors tested whether clustered genes exhibit significant co-expression patterns. The results indicated that putative MGCs associated with specialized metabolism are more likely to be co-expressed than their putative counterparts from primary metabolism. However, using the same data set, Omranian et al. (15) demonstrated that conclusions drawn from these results should be handled with care as patterns of co-expression between specialized and non-specialized metabolic pathways may differ depending on the considered correlation.In a recent study Schläpfer et al. (16) presented PlantClusterFinder, a pipeline for the identification of MGCs. The authors presented a case study in 18 plant species in which they employed the following criteria to define a MGC: a minimal contiguous stretch of the genome that includes (i) at least three enzyme-coding genes involved in small molecule metabolism that catalyze at least two reactions; (ii) more than a single group of tandem duplicated genes and (iii) an enrichment of metabolic genes. Across the investigated species, the authors detected 11 969 MGC candidates out of which >1700 clusters could potentially be involved in the generation of specialized metabolites. While the program can be downloaded from the Institute's webpage, it does not contain an easy-to-use graphical user interface.AntiSmash, the primary tool for online MGCs search was initially developed for the detection of clusters in fungi and bacteria (17–19). The tool employs a hidden Markov model (HMM) search algorithm to detect genes that are specific for certain types of known clusters, i.e. the tool is confined to already characterized gene cluster types.Despite the insights generated by these studies, a flexible and user friendly framework for identifying plant MGCs is required. Such an approach should be applicable to a large set of genomes as well as be comprehensive and comprise searching capabilities beyond the known MGCs types. Ideally, it should allow the user to query for particular enzyme classes of interest and to adjust search criteria to the species under consideration and the specific research question. Following the needs described above, we developed and applied PhytoClust, an in silico MGC prediction tool. PhytoClust allows the search for known plant MGC types as well as mining for novel types of clusters (i.e. in terms of enzyme class composition). Co-expression analysis of genes located in candidate clusters is available for selected plant species. We anticipate that PhytoClust will enhance the characterization of novel MGCs in a wide range of plant species as new genome assemblies become available.
MATERIALS AND METHODS
Choice of software for building the PhytoClust tool
The initial step for designing a computational tool to detect MGCs in plants was to screen the literature for existing tools for MGC discovery in bacteria and fungi. The microbial community is ahead of the plant community by thousands of sequenced genomes and many more experimentally characterized operons/gene clusters. After initial testing of several tools for cluster detection in microorganisms (also reviewed here (20)) we decided to build our tool on the Core-implementation of Antismash (18), as the tool was designed for both pro- and eukaryotic genomes and relies on an input DNA sequence or annotated nucleotide files only. The plug-and-play architecture of the program made it suitable for adjustment to plant genomes analysis.
The PhytoClust workflow
The PhytoClust workflow follows the core Antismash implementation detailed in (18). The cluster analysis pipeline uses as an input GBK, EMBL or FASTA files and either extracts genes from the input file, or if not supplied, predicts them from the input nucleotide sequence using GlimmerHMM (21). Representation of protein families known to catalyze reactions in plant specialized metabolism (‘marker enzymes’) as hidden Markov models (HMMs) enables the representation of combinations of these marker enzymes by cluster rules (see ‘Translating specialized metabolism enzyme families into hidden Markov models’, Table 1, and Supplementary Information 1 and 2). These cluster rules in turn are used to search the genome of interest for combinations of marker enzymes within a given distance (‘cluster range’). The cluster detection algorithm employs a HMM-based search (22) for the occurrence of marker enzymes according to the user-defined cluster rules. Additionally, a (user-defined) ‘flanking region’ on both sides of the cluster candidate is scanned for the occurrence of any other marker enzyme from the collection of specialized metabolism enzyme families and the results are added to the output. As a consequence of this procedure, the search algorithm is rendered comprehensive and enables the detection of novel MGC types that were not explicitly searched for. Optional co-expression analysis is detailed under ‘The co-expression module in PhytoClust’. The output of the results includes HTML, GBK, EMBL, TXT and XLS files.
Table 1.
List of enzyme families associated with specialized metabolism. Literature screening and manual curation to avoid redundancy resulted in a collection of enzyme families associated with plant specialized metabolism; see (4,23–25) and references within
Translating specialized metabolism enzyme families into hidden Markov models
We compiled a list of 26 enzyme families known to catalyze reactions in plant specialized metabolism (i.e. ‘marker enzymes’) through literature mining (4,23–25). This manually curated list was subsequently translated into a library of hidden Markov models by extracting relevant entries from the PFAM database (26) (Supplementary Information 1).
Input genomes and databases
A collection of 31 publicly available plant genomes was subjected to analysis by PhytoClust. The selected genomes cover a broad phylogenetic spectrum, including two green and a red algae, a lycophyte and a moss, as well as representative and agronomically important species from the monocot and the dicot lineage, including Arabidopsis, cabbage, tomato, potato, rice and maize. For reasons of reproducibility and runtime constraints we constrained our choice of genome assemblies to a selection of genomes with the highest assembly quality that required a relatively short runtime. All genomes (except for Lotus japonicus) used for this analysis were downloaded as .embl files from Ensembl plants release 31 (27,28). A comparative analysis with genomes downloaded previously from Ensembl plants release 25 resulted in the same outcome. The genome for L. japonicus was obtained from the PlantGDB (Version 1.0) (29). Calculations on a 64-bit Linux computer with 3.00 GHz × 16 took between minutes to days depending on genome size and quality as well as on the applied cluster rules.
Sensitivity of MGC prediction to different proximity ranges
We performed a sensitivity analysis for our MGC search algorithm by re-running the algorithm for 10, 20, 30, 40, 60, 80, 100 and 120 kb proximity of any combination of two marker enzymes. Significantly, the results, in terms of cluster number remained similar for proximity ranges between 10 and 40 kb (5355–5694 MGCs). Therefore, we chose the combination of 20kb cluster range and 20 kb flanking region as the default setting for the presented analysis. Note that for different cluster ranges, the detected MGCs do not necessarily need to be identical. Differences may arise as marker enzymes in the vicinity of the cluster might be detected and included in the putative MGC or not depending on the chosen proximity range. Note that the used cluster range in our search for combinations of any two enzymes is not equal to the length of the detected cluster as a whole but rather represents the maximum distance between two adjacent marker enzymes. Therefore, depending on the number of marker enzymes in the cluster the complete cluster can be longer.
Randomized MGC search
To obtain reference values for how many of the detected marker genes would be co-localized in the cluster proximity by chance a randomization study was performed based on the gene coordinates for A. thaliana, Z. mays, O. sativa and S. lycopersicum obtained from Plant Metabolic Network (PMN), RiceCyc (30) and SolCyc (31). The analysis was limited to those four species with sufficient gene coordinates available. During the randomization process, the detected marker genes were re-assigned to randomly chosen gene coordinates from the dataset. Based on this ‘shuffled genome’, we calculated the percentage of marker enzymes that form a ‘random MGC’ based on 10 000 randomizations. By following this approach, the randomization procedure accounts for the chromosome organization and the right representation of gene rich regions versus intergenic regions.
Clustering of species by MGC types
We performed clustering analysis of the 31 investigated genomes based on MGCs similarity. As an input for the clustering we generated a binary matrix that includes information about the absence or presence of a particular MGC type. The cosine similarity (R package ‘lsa’) was used to obtain a distance matrix on which hierarchical clustering was performed. As a clustering method agglomerative complete linkage method was chosen. The cluster similarity heat-map was generated using the ‘gplots’ package.
The co-expression module in PhytoClust
In this study, we performed co-expression analysis for Arabidopsis, rice and tomato using the publically available datasets described in (8, 32–34). These data were subjected to Pearson-correlation based co-expression analysis. Two main assays were carried out including: (i) comparison if there was any enrichment for co-expression of all marker genes located in putative MGCs with respect to the co-expression patterns of all marker genes detected (reference background co-expression) for any of the four species. (ii) Testing if individual MGC candidates exhibit significantly higher co-expression of marker genes with respect to the background distribution of the species.The same data are available in the co-expression module on the web server. If the co-expression option is selected the user can set a custom threshold for the correlation coefficient. Subsequently, for any of the MGC candidates, Pearson-based co-expression is performed for (i) the detected marker genes in the MGC candidate, (ii) all genes located in the range of the MGC candidate and (iii) all genes in the range of the MGC candidate and the flanking region. If any of the genes under investigation show higher correlation than the threshold, a correlation heatmap for the respective genes is generated and displayed in the results and additionally written to an output file. The co-expression results can only be treated as a support for the identification of a MGC candidate but not as disapproval. A gene cluster that does not show co-expression in our pipeline might as well exhibit coordinated expression when using either a more comprehensive dataset or data from different experimental conditions.
Statistical analysis
Statistical analyses were performed using R and the ‘stats’ package for Fisher Exact test (fisher.test) and Wilcoxon Rank test (wilcox.test).
RESULTS
The PhytoClust tool to detect metabolic gene clusters in plants
We developed PhytoClust, a software that uses a collection of enzyme families of plant specialized metabolism (Table 1) translated into HMMs and mines a given genome sequence for co-localized metabolic enzymes of these families (‘marker enzymes’) (Supplementary Information 1). Using the core implementation of AntiSmash (18) the search query parameters are defined by ‘cluster rules’ which include (i) the names of enzyme families of interest, (ii) the span of chromosomal region in which genes are clustered (‘cluster range’) and (iii) the flanking region to be searched for additional marker enzymes (Figure 1). PhytoClust enables the search for currently known gene cluster types as well as setting individual criteria with a combination of up to four enzyme families. It also allows searching for tandem repeat gene clusters. Once the search query has been processed the results can be examined in a web-browser or downloaded for further offline analysis. Additionally, for a small collection of well annotated genomes and available transcriptome data sets we developed a co-expression module that provides co-expression analysis for genes located in candidate clusters based on a user-defined co-expression threshold and visualizes the results as heat-maps. PhytoClust is available via a web-server at PhytoClust.weizmann.ac.il.
Figure 1.
Schematic representation of the cluster detection algorithm. PhytoClust searches a given plant genome for genes representing enzyme families associated with specialized metabolism (‘marker enzymes’) based on a hidden Markov model search algorithm. The search query is formalized by ‘cluster rules’ that contain the query enzyme families and parameters for the cluster range (sequence range in which the marker enzymes are to be detected) and the flanking region (proximity of the cluster). A hit is recorded if genes encoding the query enzymes are detected within the cluster range. Additional marker enzymes from the marker enzyme library which are detected within the cluster range or the flanking region of the cluster will be automatically added to the results. Analogously, neighboring gene clusters will be merged if they are within reach of each other's flanking region.
Schematic representation of the cluster detection algorithm. PhytoClust searches a given plant genome for genes representing enzyme families associated with specialized metabolism (‘marker enzymes’) based on a hidden Markov model search algorithm. The search query is formalized by ‘cluster rules’ that contain the query enzyme families and parameters for the cluster range (sequence range in which the marker enzymes are to be detected) and the flanking region (proximity of the cluster). A hit is recorded if genes encoding the query enzymes are detected within the cluster range. Additional marker enzymes from the marker enzyme library which are detected within the cluster range or the flanking region of the cluster will be automatically added to the results. Analogously, neighboring gene clusters will be merged if they are within reach of each other's flanking region.
PhytoClust accurately predicts known plant metabolic gene clusters
To demonstrate the quality of PhytoClust we first attempted to detect established plant MGCs located in genomes having high-quality assemblies. These included the Arabidopsis marneral (35) and thalianol clusters (2), Solanum lycopersicum (tomato) terpene cluster (36), Solanum tuberosum (potato) and Solanum lycopersicum steroidal glycoalkaloids cluster (8), Oryza sativa (rice) momilactone (37) and phytocassane cluster (38) and the Zea mays (maize) DIMBOA cluster (39–42). Based on the structure of these MGCs we defined cluster rules for the ‘known-gene-cluster-search-module’ (Supplementary Information 2). Significantly, we accurately find the above mentioned MGCs among few candidate MGCs that fit the search criteria (between one to five MGC candidates, depending on the cluster type and species). The search criteria and results are summarized in Figure 2.
Figure 2.
Graphical representation of known MGCs investigated in this study. ‘Type of cluster’ shows the cluster name, species, and the proximity in which the respective enzymes are located. ‘Structure’ gives a graphical representation of the cluster's physical map (approximation). ‘Number of detected cluster’ lists the total number of hits obtained when running PhytoClusts’ ‘known-gene-cluster-search’ including the characterized gene cluster. All eight gene clusters in species with high-quality genome assemblies were detected with high accuracy. Image adapted from (4).
Graphical representation of known MGCs investigated in this study. ‘Type of cluster’ shows the cluster name, species, and the proximity in which the respective enzymes are located. ‘Structure’ gives a graphical representation of the cluster's physical map (approximation). ‘Number of detected cluster’ lists the total number of hits obtained when running PhytoClusts’ ‘known-gene-cluster-search’ including the characterized gene cluster. All eight gene clusters in species with high-quality genome assemblies were detected with high accuracy. Image adapted from (4).
Benchmarking PhytoClust with existing tools for MGCs identification
To benchmark PhytoClust against existing MGCs prediction tools we compared our results to AntiSmash (version 3.0.5) and to the results obtained by the PlantClusterFinder. As plant genomes are very large and due to restrictions in upload file size of the AntiSmash server, we were not able to upload the whole plant genome assemblies, but had to limit our test cases to confined sequences within the plant genomes containing the actual MGCs shown in Figure 2. AntiSmash identified the marneral gene cluster in Arabidopsis (detected as Terpene cluster), the momilactone and phytocassane gene cluster in rice (detected as Terpene—Momilactone and biosynthetic gene cluster and Terpene—Phytocassane/Oryzalides biosynthetic gene cluster, respectively), the terpene gene cluster in tomato (detected as Terpene gene cluster) and the chromosome 12-located part of the steroidal glycoalkaloid gene cluster in tomato (detected as a Terpene gene cluster). MGCs not detected by AntiSmash included the thalianol gene cluster in Arabidopsis, the chromosome 7-located part of the glycoalkaloid gene cluster in tomato and potato, the chromosome-12-located part of the steroidal glycoalkaloid gene cluster in potato, and the DIMBOA gene cluster in maize. All of the mentioned clusters were accurately detected by PhytoClust, as outlined in the previous section. The results of the webserver search are summarized in Table 2. As our search tool is based on the core-implementation of AntiSmash, we conclude that the observed differences are mainly due to the different search criteria implemented in the cluster rules in PhytoClust (see also Materials and Methods).
Table 2.
Comparison between PhytoClust and AntiSmash (Version 3.0.5). As a test case, the eight characterized gene clusters in Arabidopsis, rice, tomato, potato, and maize were searched for. Note, that Solanine 1, Tomatine 1 refer to the split cluster part located on chromosome (Chr.) 7 and Solanine/Tomatine 2 refer to the split cluster part located on 12. The column ‘PhytoClust’ shows the results obtained using the ‘known-gene-cluster-search-modus’. Column ‘AntiSmash’ shows the results obtained when applying the AntiSmash pipeline to the same sequence within the respective plant genome
Solanum L Tomatine 1 - Solanum T/L Solanine/Tomatine 2
Terpene
Position: 788715-1062587
Solanum tuberosum
Chr. 7
Solanine 1
Solanum T Solanine 1
/
Position: 41670011-42005447
Chr. 12
Solanine/Tomatine 2
Solanum T solanine 1-Solanum T/L Solanine/Tomatine 2
Position: 5803461-5978242
Zea mays
Chr. 4
DIMBOA
Zea M DIMBOA
/
Position: 3001405-3322257
Schläpfer et al. (16) reported that 13 known MGCs could be found in the genomes analyzed in their study. The authors were unsuccessful in detecting the steroidal glycoalkaloids gene cluster in potato due to a large chromosomal assembly gap in this region. The remaining 12 clusters were detected partially. Whereas the marneral gene cluster in Arabidopsis was identical to the published cluster, the DIMBOA MGC in maize was missing known genes at both ends of the predicted cluster. The other 10 predicted MGCs included additional metabolic genes that have not yet been characterized.
PhytoClust identifies putative novel MGCs in plant and algal genomes
To assess the relationship between MGCs and plant taxonomy we applied PhytoClust to a selection of 31 high-quality genome sequences from the green lineage (Figure 3), including three algae species, a moss and a lycophyte, 12 monocotyledonous and 14 dicotyledonous species.
Figure 3.
Taxonomic tree of the 31 investigated plant species. The tree is based on NCBI taxonomy and highlights some of the investigated plant families and their taxonomic relationship.
Taxonomic tree of the 31 investigated plant species. The tree is based on NCBI taxonomy and highlights some of the investigated plant families and their taxonomic relationship.In our analysis, we used an artifice to detect putative new cluster types. Initially, the genomes under investigation were scanned for co-localized marker enzyme combinations of any two enzyme families. Additionally, we took advantage of the greediness of the search algorithm that automatically merges any marker enzymes detected within the cluster range and the flanking region with the cluster itself, as well as neighboring clusters, if they are in close proximity (see Figure 1 for a graphical illustration). Using this artifice we were able to detect not only known MGCs but also novel combinations of marker enzymes that represent putative new cluster types. As additional criteria, in our further analysis we only considered those clusters that contained at least three marker enzymes of which at least two belonged to different enzyme families. Note, that the large superfamily of CYPs represents a collection of CYP sub-families which should be regarded as different types of marker enzymes but are represented by the same homolog PFAM domain and therefore cannot be distinguished by the search algorithm. To investigate the robustness of the search we performed the analysis for cluster ranges of 10–120 kb in-between each two enzymes and allowed for a 20 kb flanking region (see Materials and Methods). Across this range, we obtained qualitatively similar results, more precisely; we obtained very similar results in terms of cluster number for cluster ranges between 10 and 40 kb. Across these investigated proximity ranges the total number of detected clusters varied between 5355 and 5694. Therefore, results from the 20 kb cluster range search will be further analyzed and discussed in the following.Moreover, we repeated main parts of the here presented study for 100 kb cluster ranges and compared the results (Supplementary Information 3). We would like to point out that the 20 kb is not an upper limit on the size of the MGC candidate but represents the maximum distance between two adjacent marker enzymes. The maximum size of the whole MGC candidate is thereby determined by the number of detected marker enzymes.Our exhaustive search for co-localized enzymes from 26 specialized metabolism enzyme families resulted in the detection of in total 1232 putative MGCs types and a total number of 5531 putative clusters across the 31 plant species under investigation. The detected MGCs contained between 3 to 30 marker enzymes, whereas the most common cluster length across all species investigated was 3 (Figure 4A). Note, that throughout the analysis, gene clusters were considered as belonging to the same type of cluster, if their marker enzymes and the order along the chromosome were identical. This requirement was applied for the following two reasons: (i) in gene clusters that were characterized in several species the order of marker enzymes was typically (partially) conserved (8,43,44) and (ii) it was shown that genes in the noscapine MGC appeared to exhibit temporal collinearity, i.e. the gene order on the chromosome corresponds roughly to the temporal order of gene activation in the pathway (7). While this is the first example of this phenomenon in plants, collinearity was repeatedly observed in bacteria and filamentous fungi and it might be observed in plants more often in future studies.
Figure 4.
Characteristics of the detected MGCs—cluster size (number of marker enzymes) and relationships between number of marker enzymes, number of coding genes, number of clusters and genome size. (A) Distribution of MGC sizes (number of marker enzymes) for all 31 investigated species. Shown is the absolute number of MGCs with different numbers of marker enzymes per species (common names are indicated in brackets if available). The most common cluster length is 3. (B) Relationships between genome size, number of detected gene clusters, and number of members of secondary metabolism enzyme families (marker enzymes) for the 31 investigated species. Species names are indicated by a three-letter abbreviation code. Only the relationship between the number of marker enzymes and the genome size shows a positive correlation (R2 = 0.65) for the 31 investigated species.
Characteristics of the detected MGCs—cluster size (number of marker enzymes) and relationships between number of marker enzymes, number of coding genes, number of clusters and genome size. (A) Distribution of MGC sizes (number of marker enzymes) for all 31 investigated species. Shown is the absolute number of MGCs with different numbers of marker enzymes per species (common names are indicated in brackets if available). The most common cluster length is 3. (B) Relationships between genome size, number of detected gene clusters, and number of members of secondary metabolism enzyme families (marker enzymes) for the 31 investigated species. Species names are indicated by a three-letter abbreviation code. Only the relationship between the number of marker enzymes and the genome size shows a positive correlation (R2 = 0.65) for the 31 investigated species.To extend our knowledge regarding the distribution and organization of individual marker enzymes across the genome, we performed a search for individual members of the 26 enzyme families across all genomes and compared it to the results of our MGC search. We found between 4.6% (Z. mays) to 57.1% (Arabidopsis) of all detected marker enzymes to be organized in putative MGCs (mean over all species) was 23.7% (see Supplementary Information 4 for all values). As a comparison, Chae et al. (10) found that in Arabidopsis 30.1%, in soybean 30.2%, in sorghum 30.5%, and in rice 22.4% of genes to be located in MGCs and Schläpfer et al. (16) reported between 0% (C. reinhardtii) and 24.1% (rice) of specialized metabolic pathways to contain clustered genes.As a reference value we calculated the probability to find these results by chance by randomly permuting the positions of the detected marker genes based on the gene coordinates (for four representative species) (see Materials and Methods). For Z. mays we found 0.3% and for Arabidopsis 47,6% of the detected marker enzymes to be located in the same proximity as the MGC candidates, termed ‘random MGCs’ (see Supplementary Information 4 for all values). The latter value is high and suggests a high false positive discovery rate for Arabidopsis. Nevertheless, it needs to be considered that the number of detected marker enzymes counts all individual occurrences of marker enzymes, including a large number of tandem gene repeats. While tandem repeats are excluded to count as MGC candidates in our analysis, they still contribute to random MGCs as they are redistributed along the chromosome.Moreover, we observed a weak correlation between the overall number of marker enzymes and the number of coding genes in a genome (R2 = 0.65) (Figure 4B, panel 1). We did not observe significant correlations between the number of MGCs and the number of coding genes (R2 = 0.12) nor between the number of MGCs and the overall number of marker enzymes (R2 = 0.39) (Figure 4B, panels 2 and 3, respectively).Furthermore, neither the number of marker enzymes nor the number of detected MGC candidates correlated with the genome size (R2 = 0.18 and 0.10, respectively) (Figure 4B, panels 4 and 5, respectively) nor does the number of clusters with the gene density (coding genes/genomes size) (R2 = 0.18) (Figure 4B, panel 6). Noteworthy, the latter results do not support the assumption that the size in bp of MGC candidates should necessarily correlate with the size of the genome or the gene density. We observed that genomes of similar size or gene density can have very diverging numbers of cluster candidates.Notably, both Arabidopsis and potato show a high number of MGCs with respect to both genome size and gene density. While this outlier might partially be explained by the fact that potato has a high number of marker enzymes, this is not the case for Arabidopsis. The underlying biological reasons for this phenomenon are not clear at this stage.
Plant taxonomy is reflected by the types of MGCs present in the genomes
We subsequently tested whether the taxonomy of the investigated plant species was reflected by differences in the detected MGC types. Clustering by MGC types (see Materials and Methods) reflected the taxonomy of several species under investigation. This could be observed for algae, grasses, the nightshades, the single moss species, the single lycophyte, and members of the Brassicaceae (cabbage) family (Figure 5).
Figure 5.
Green lineage taxonomy is reflected by MGCs. Clustering analysis of the 31 investigated species based on MGCs similarity. The cosine similarity based on the presence or absence of MGC types was used as a similarity measure to calculate the clustering profile. Related species cluster together, such as algae, grasses, the nightshades, the single moss species, the single lycophyte and members of the Brassicaceae (cabbage) family (compare to taxonomy in Figure 3).
Green lineage taxonomy is reflected by MGCs. Clustering analysis of the 31 investigated species based on MGCs similarity. The cosine similarity based on the presence or absence of MGC types was used as a similarity measure to calculate the clustering profile. Related species cluster together, such as algae, grasses, the nightshades, the single moss species, the single lycophyte and members of the Brassicaceae (cabbage) family (compare to taxonomy in Figure 3).
Enzyme families significantly enriched in MGCs
We next investigated which enzyme families are enriched in MGCs and whether taxa-specific differences could be found. When performing enrichment analysis (Fisher's-exact test) across all species we found 2-oxoglutarate-dependent dioxygenases (P = 7.2e–31), CYPs (P = 1.3e–02), glutathione-S-transferases (P = 1.9e–88), methylene bridge-forming enzymes (P = 6.6e–22), terpene synthases (P = 4.7e–62) and 2-isopropylmalate synthases (P = 7.3e–10) to be significantly enriched in MGCs across the plant kingdom. Notably, when performing the same analysis with a relaxed cluster definition (including clusters with only two marker genes and tandem repeats) we found all enzyme families but CYPs (P = 1.03e–164) to be less significant (Supplementary Information 5). This observation supports the notion of an accelerated rate of gene duplications in certain CYPs clans (45) which are likely to make up most of the tandem repeat gene clusters which are not excluded from the analysis when using the more relaxed constraints.Taxa-specific analysis revealed notable differences between taxonomic groups e.g., members of the glutathione-S-transferases family were found enriched in MGCs across the plant kingdom, whereas 2-oxoglutarate-dependent dioxygenases and methylene bridge-forming enzymes were enriched in angiosperms only and CYPs in monocots only. An overview of significantly enriched enzyme families in MGC across all investigated species and for algae, mosses, lycophytes, mono- and dicots separately is shown in Figure 6. Figure 7 shows which enzyme families are enriched in MGCs in each of the investigated species (see also Supplementary Information 6 for analysis of enriched enzyme combinations).
Figure 6.
Enrichment of marker enzymes in putative MGCs in different phylogenetic taxa. (A) Enrichment analysis for specialized metabolism associated enzyme families was performed (i) for all species together (left side) and (ii) the five investigated taxa (algae, mosses, lycophytes, monocots and dicots) separately (right side). The y-axis represents the negative log10 transformation of the P-value for the enrichment based on the Fisher's exact test. Shown are only significant enzyme families (P-value < 0.05). Members of the glutathione-S-transferases family were found enriched in MGCs across the plant kingdom, whereas 2-oxoglutarate-dependent dioxygenases and methylene bridge-forming enzymes were enriched in angiosperms only and CYPs in monocots only.
Figure 7.
Enrichment of marker enzymes in putative MGCs in the 31 investigated species genomes. The bar plots show the total number of marker enzymes (lighter color) and the number of marker enzymes organized in putative MGCs (darker color) for all marker enzyme families and all species separately. Shown are only enzyme families and species which exhibit a significant enrichment of the respective marker enzymes in putative MGCs based on Fisher's exact test (P-value < 0.05). Species names are indicated by a three-letter abbreviation code (common names indicated in brackets).
Enrichment of marker enzymes in putative MGCs in different phylogenetic taxa. (A) Enrichment analysis for specialized metabolism associated enzyme families was performed (i) for all species together (left side) and (ii) the five investigated taxa (algae, mosses, lycophytes, monocots and dicots) separately (right side). The y-axis represents the negative log10 transformation of the P-value for the enrichment based on the Fisher's exact test. Shown are only significant enzyme families (P-value < 0.05). Members of the glutathione-S-transferases family were found enriched in MGCs across the plant kingdom, whereas 2-oxoglutarate-dependent dioxygenases and methylene bridge-forming enzymes were enriched in angiosperms only and CYPs in monocots only.Enrichment of marker enzymes in putative MGCs in the 31 investigated species genomes. The bar plots show the total number of marker enzymes (lighter color) and the number of marker enzymes organized in putative MGCs (darker color) for all marker enzyme families and all species separately. Shown are only enzyme families and species which exhibit a significant enrichment of the respective marker enzymes in putative MGCs based on Fisher's exact test (P-value < 0.05). Species names are indicated by a three-letter abbreviation code (common names indicated in brackets).
MGC candidates for terpene diversification
It is non-trivial to predict pathway products from the enzyme classes of a given MGC candidate as plant specialized metabolism is chemically extremely complex and the vast majority of this metabolic diversity is still unknown. In the case of the terpenoids class (46), it might be possible to predict related MGCs. The terpenoid backbone is typically being synthesized by terpene synthases (TSs) and further modified by CYPs. Consequently, we examined how many of the MGC candidates could be associated with terpenoid biosynthesis. Across all species we found between 0 and 14 of the MGC candidates to include one or more TSs and 0–9 MGC candidates to additionally include one or more CYPs. Figure 8 summarizes the results for all genomes investigated. These numbers are in line with reports from a previous study which identified between 1 TS/CYP pairs in Z. mays and 13 in Arabidopsis for a search based on 30 kb maximum distance between adjacent pairs of TS and CYPs (14).
Figure 8.
Number of MGC candidates possibly involved in terpene diversification. Given are the numbers of MGC candidates that include one or more terpene synthases (left bar) and the number of MGC candidates additionally including one or more CYPs (right bar) for all species under investigation. The absolute numbers for MGC candidates including terpene synthases range between 0 and 14 and between 0 and 9 for pairs of terpene synthases and CYPs and are in accordance with previous findings (14).
Number of MGC candidates possibly involved in terpene diversification. Given are the numbers of MGC candidates that include one or more terpene synthases (left bar) and the number of MGC candidates additionally including one or more CYPs (right bar) for all species under investigation. The absolute numbers for MGC candidates including terpene synthases range between 0 and 14 and between 0 and 9 for pairs of terpene synthases and CYPs and are in accordance with previous findings (14).
Combining MGCs prediction with gene co-expression analysis results in the identification of likely candidates
We performed co-expression analysis for Arabidopsis, rice and tomato to investigate which MGCs contain significantly co-expressed genes with respect to the reference background co-expression. As a reference we used the co-expression patterns of all marker genes in the respective genome and examined (i) the mean correlation of marker genes in MGCs against the background distribution and (ii) the correlation distribution of marker genes in individual MGCs against the background distribution (see Materials and Methods for details). We found statistically significant higher co-expression between marker genes in MGCs with respect to the background distribution only for tomato (Wilcoxon Rank test; P = 0.001). Nevertheless, when investigating the correlation distribution of individual MGCs against the background we found genes in four clusters in Arabidopsis, three in rice and seven in tomato to be significantly higher co-expressed (Wilcoxon rank test, P < 0.05). The length of these clusters varied between 51 and 140 kb and they contained between 4 and 18 marker enzymes. The results of this analysis are summarized in Figure 9. A complete list of the detected marker enzymes and the respective genes are given in the Supplementary Information 7. Interestingly, when running the same analysis for the relaxed cluster definition (including also clusters with only two marker genes and tandem repeats) we found increased and statistically significant co-expression of marker genes in the MGCs for all three species examined (Wilcoxon Rank test, P = 0.002, 3.04e–4 and 2.8e–11 for Arabidopsis, rice and tomato, respectively).
Figure 9.
Putative MGCs with statistically significant gene co-expression. The schematic overview shows the candidate clusters that are significantly higher co-expressed with respect to the background distribution including all detected marker enzymes (Wilcoxon Rank test, P < 0.05). The left side shows the chromosomal location and length of the putative cluster and the right side depicts the organization of the marker enzymes along the chromosome.
Putative MGCs with statistically significant gene co-expression. The schematic overview shows the candidate clusters that are significantly higher co-expressed with respect to the background distribution including all detected marker enzymes (Wilcoxon Rank test, P < 0.05). The left side shows the chromosomal location and length of the putative cluster and the right side depicts the organization of the marker enzymes along the chromosome.
DISCUSSION
In this study, we developed a computational tool for the identification and analysis of putative MGCs in plants. The tool accurately detects known MGCs. An exhaustive search of 31 genome assemblies across the plant kingdom resulted in the detection of several thousand MGCs candidates. Furthermore, we found genes associated with specialized metabolism enzymes to exhibit taxa-specific co-location patterns, reflecting the underlying taxonomy of the species. Enrichment analysis revealed taxa- and species-specific enrichment of certain enzyme families. Co-expression analysis for Arabidopsis, rice and tomato revealed an increased level of co-expression of cluster candidates only in tomato. Nevertheless, statistical analysis of individual gene cluster candidates yielded a list of putative MGCs containing co-expressed genes of statistical significance in Arabidopsis, rice, and tomato.
The discovery of additional plant MGCs is crucial for refining cluster detection rules
PhytoClust represents a highly exhaustive search algorithm as any combination of specialized metabolism enzyme classes can be detected. Conversely, the ‘known-cluster search algorithm’ in PhytoClust is specific and detects only complete MGCs of a particular enzyme class composition. Therefore, evolutionary modifications of the same cluster in different species (e.g., loss of a certain marker enzyme or splitting of clusters) would not necessarily be detected. Future insights into the organization and evolution of MGCs in plants as well as deeper knowledge of metabolic pathways will help reducing the gap between these two approaches. As a result, cluster rules in PhytoClust could be attuned according to e.g., certain chemical classes produced by particular combinations of enzymes while being flexible with respect to additional marker enzymes in the cluster.Our study points at the importance of generating specific HMMs for members of the CYP family. More than 5100 sequences of the plant CYP family were described which are likely involved in numerous functions. For the larger families within the clan, enzymes with even <40% sequence similarity are grouped into sub-families (45). Despite this tremendous diversity, currently all CYPs have the same HMM in the PFAM database (26). A more sophisticated representation of these enzyme families would significantly improve the predictive power of the PhytoClust search algorithm. Likewise, we anticipate the repertoire of marker enzymes to expand as new MGCs will be discovered and other classes of gene products, such as enzymes involved in primary metabolism, transcription factors, signaling components or transporters, will be found as MGCs components.Moreover, any analysis of putative MGCs, e.g., enrichment of certain enzyme families in MGCs, heavily depends on the chosen set of putative MGCs. Relaxation of the cluster search definition from three to two marker enzymes and allowance for tandem repeats resulted in (i) higher over-representation of members of the CYPs class and (ii) rendered enzymes from putative MGCs to be significantly co-expressed in the case-study of Arabidopsis, rice, and tomato. This might explain different findings with respect to previous studies (10). Moreover, our findings emphasize the significant role of CYPs in specialized metabolism gene duplication and neo-functionalization that drives gene diversification and potentially MGCs evolution.
Main findings are in line with other computational approaches for MGC discovery
A recently presented study (16) investigated the occurrence of putative MGCs across primary and secondary metabolism and searched for co-located genes involved in small molecule metabolism. Reassuringly, despite differing computational approaches (E2P2 versus PFAM for enzyme annotation) and species under investigation the main results of the PlantClusterFinder and PhytoClust studies are in line and emphasize the importance of MGCs in the evolution of plant specialized metabolism. Both computational approaches detected several thousand MGC candidates which contain between three and five metabolic genes. Moreover, besides TSs, both studies find enrichment of CYPs and 2-oxoglutarate-dependent dioxygenases in putative MGCs, proposing a role as potential signature enzymes. Polyketide synthases (PKSs) were assigned a rather minor role in plant MGCs. Notably, both studies generated support for the important role of local gene duplication and in particular for the role of CYPs in MGC formation.
The potential of plant MGCs for rational design strategies
The discoveries of MGCs in plants through programs such as PhytoClust will likely speed-up metabolic pathway elucidation. Yet, it is also expected to provide the information required for subsequent metabolic engineering of pathways generating high-value products in plants or microorganisms (5). Already at this stage where a relatively small number of MGCs has been characterized in depth, MGCs associated with the biosynthesis of high-value molecules have been discovered, e.g., noscapine, an anti-tumor alkaloid from opium poppy (7). Moreover, MGCs provide a natural example for pathway engineering and could therefore teach us how to tackle issues of coinheritance, avoidance of toxic intermediates, spatial and temporal control of gene expression, metabolic channeling and likely much more. Hence, we expect that working with PhytoClust will boost the elucidation of metabolic pathways of specialized metabolism, understanding their evolution and engineering. These advances are nevertheless largely dependent on the quality and number of new plant genomes assembled in the coming years.
AVAILABILITY
PhytoClust is available at: http://phytoclust.weizmann.ac.il/.Click here for additional data file.
Authors: Alexander M Boutanaev; Tessa Moses; Jiachen Zi; David R Nelson; Sam T Mugford; Reuben J Peters; Anne Osbourn Journal: Proc Natl Acad Sci U S A Date: 2014-12-10 Impact factor: 11.205
Authors: Marnix H Medema; Kai Blin; Peter Cimermancic; Victor de Jager; Piotr Zakrzewski; Michael A Fischbach; Tilmann Weber; Eriko Takano; Rainer Breitling Journal: Nucleic Acids Res Date: 2011-06-14 Impact factor: 16.971
Authors: Xiangchao Gan; Oliver Stegle; Jonas Behr; Joshua G Steffen; Philipp Drewe; Katie L Hildebrand; Rune Lyngsoe; Sebastian J Schultheiss; Edward J Osborne; Vipin T Sreedharan; André Kahles; Regina Bohnert; Géraldine Jean; Paul Derwent; Paul Kersey; Eric J Belfield; Nicholas P Harberd; Eric Kemen; Christopher Toomajian; Paula X Kover; Richard M Clark; Gunnar Rätsch; Richard Mott Journal: Nature Date: 2011-08-28 Impact factor: 49.962
Authors: Paul Julian Kersey; James E Allen; Irina Armean; Sanjay Boddu; Bruce J Bolt; Denise Carvalho-Silva; Mikkel Christensen; Paul Davis; Lee J Falin; Christoph Grabmueller; Jay Humphrey; Arnaud Kerhornou; Julia Khobova; Naveen K Aranganathan; Nicholas Langridge; Ernesto Lowy; Mark D McDowall; Uma Maheswari; Michael Nuhn; Chuang Kee Ong; Bert Overduin; Michael Paulini; Helder Pedro; Emily Perry; Giulietta Spudich; Electra Tapanari; Brandon Walts; Gareth Williams; Marcela Tello-Ruiz; Joshua Stein; Sharon Wei; Doreen Ware; Daniel M Bolser; Kevin L Howe; Eugene Kulesha; Daniel Lawson; Gareth Maslen; Daniel M Staines Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971
Authors: M Saleem Dar; Bhushan B Dholakia; Abhijeet P Kulkarni; Pranjali S Oak; Dhanasekaran Shanmugam; Vidya S Gupta; Ashok P Giri Journal: Planta Date: 2021-02-04 Impact factor: 4.116
Authors: Zhenhua Liu; Hernando G Suarez Duran; Yosapol Harnvanichvech; Michael J Stephenson; M Eric Schranz; David Nelson; Marnix H Medema; Anne Osbourn Journal: New Phytol Date: 2019-12-28 Impact factor: 10.323