Literature DB >> 30674669

Robust predictions of specialized metabolism genes through machine learning.

Bethany M Moore1,2, Peipei Wang1, Pengxiang Fan3, Bryan Leong1, Craig A Schenck3, John P Lloyd4,5, Melissa D Lehti-Shiu1, Robert L Last1,3, Eran Pichersky6, Shin-Han Shiu7,2,8.   

Abstract

Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.
Copyright © 2019 the Author(s). Published by PNAS.

Entities:  

Keywords:  data integration; machine learning; predictive biology; specialized metabolism

Year:  2019        PMID: 30674669      PMCID: PMC6369796          DOI: 10.1073/pnas.1817074116

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


Gene duplication and subsequent divergence/loss events led to highly variable gene content between plant species (1, 2). These differential gain and loss events have given rise to diverse metabolic enzymes ranging from those involved in generally conserved, primary metabolic processes found in most species [referred to as general metabolism (GM) genes], to those that function in lineage-specific, specialized metabolism (SM) (3–6). The proliferation of SM genes in plants has resulted in an overall far larger number of specialized than general metabolites. These specialized metabolites are important for niche-specific interactions between plants and environmental agents that can be harmful (e.g., herbivores) or beneficial (e.g., pollinators) (3, 7, 8). They are also the basis for thousands of plant-derived chemicals, many of which are used for medicinal and/or nutritional purposes, such as carotenoid derivatives with antioxidant properties in tomato (9–11). Thus, identification of the genes encoding enzymes that produce specialized metabolites (referred to as SM genes) is key to understanding the causes underlying the diversity of plant specialized metabolites as well as for engineering plant-derived chemicals and pharmaceuticals. Despite their importance, most plant metabolites and the enzymes and genes involved in their biosynthesis are yet to be identified (12). Although many SM genes arise by duplication of GM genes (13, 14) or other SM genes (15), duplication itself is not sufficient for pinpointing SM genes for four reasons. First, genes encoding GM or SM enzymes can belong to the same family. Second, duplicated GM genes may not necessarily become specialized (1), and minor sequence changes can lead to substantially altered enzyme functions (16, 17). Third, SM genes may arise through lineage-specific loss of the GM function without duplication. Finally, convergent evolution may explain the presence of unrelated enzymes in different lineages that use the same substrate to make similar products (5). Consequently, it remains unresolved whether most plant enzyme genes are involved in GM or SM pathways, even in the best annotated plant species, Arabidopsis thaliana (3, 5, 18, 19). Therefore, in recent years there has been an enhanced focus on identifying SM genes (20, 21). Multiple properties have been shown to differ between SM and GM genes (4, 20–22). For example, whole genome duplications (WGDs) and tandem duplications both contribute to metabolic innovations in glucosinolate biosynthesis genes (23). In addition, compared with GM genes, SM genes tend to have a more restricted phylogenetic distribution, a higher family expansion rate, tandem clustering of paralogs, a propensity for genomic clustering (close physical proximity of genes encoding enzymes in the same pathways), higher degrees of expression variation, and higher degrees of coexpression. Coexpression with known SM genes (20, 24) or genomic neighborhood and gene–metabolite correlation (25) were also used to predict SM pathway genes. With the influx of more biochemical and omic data, there is an increasing number of gene properties that can be evaluated for their utility in distinguishing SM/GM genes. Furthermore, the studies published to date have mainly focused on specific SM or GM pathways, raising the question of how SM/GM genes differ globally. This prompted us to examine 10,243 gene properties (referred to as features), including new features and those evaluated in early studies, falling into five categories (gene function, expression/coexpression, gene networks, evolution/conservation, and gene duplication), and evaluate the ability of each feature to distinguish SM genes from GM genes. Earlier studies revealed that the association between features and SM genes is far from absolute (26), and—in most cases—the effect sizes (i.e., the extent to which these specific features can distinguish SM and GM genes) were not reported. To build on these studies, a machine learning approach (21), which jointly considers all five categories of heterogeneous features, was used to distinguish SM and GM genes. This approach led to machine learning models that were used to predict if an A. thaliana enzyme gene is likely an SM gene. Furthermore, we examined the properties of enzyme genes in cases where the annotations and predictions differed. Our findings provide a global estimate of SM gene content in the Arabidopsis thaliana genome, and the identified features may pave the way for further improvement of the modeling approach.

Results and Discussion

Benchmark SM and GM Genes.

Currently, there are two major resources for plant SM and GM gene annotations: Gene Ontology (GO; ref. 27) and AraCyc (28). For SM genes, we started with the 357 genes with the GO term “secondary metabolic process” and 649 enzyme-encoding genes in 129 AraCyc “secondary metabolism” pathways (Dataset S1). Initial GM genes included 2,009 annotated with the GO term “primary metabolic process” and 1,557 enzyme-encoding genes in 490 AraCyc nonsecondary metabolism pathways (Dataset S1). Although 32.4% of GO- and 41.8% of AraCyc-annotated GM genes overlapped, only 35 SM genes (15% of GO- and 8.3% of AraCyc-annotated SM genes) overlapped (Fig. 1). Although this is a significantly higher degree of overlap than expected by chance (), it indicates a greater inconsistency in SM annotation criteria than in GM annotation criteria between the GO and AraCyc datasets. Furthermore, 152 and 261 genes were annotated as both SM and GM in GO and AraCyc, respectively. This indicates that although SM and GM genes may have distinct properties, several genes can be both, and their properties may not be distinct. Here we focus on cases that are not ambiguous, but later we delve into this gene set to see if genes involved in both SM and GM pathways can be uniquely classified.
Fig. 1.

Gene Ontology and AraCyc annotation of specialized and primary metabolism genes. (A) Overlap between Gene Ontology (GO)/AraCyc primary metabolism (PM) and secondary metabolism (SM) gene annotations. The number of genes in an intersection or in a complement set are shown. Three benchmark SM/GM gene sets were defined: benchmark 1 (Union), benchmark 2 (AraCyc), and benchmark 3 (Intersection) (). The table to the right shows the genes (labeled with lowercase letters in the Venn diagram) included in each benchmark set. (B) GO term enrichment in SM genes (Left) and in GM genes (Right). The three columns show statistics for GM/SM genes that are GO-annotated, are AraCyc-annotated, or belong to a combined set (union between GO and AraCyc). Rows indicate GO terms. Color represents the q value (multiple testing corrected p value) of the Fisher’s exact test for a GO term enriched in either GM (blue) or SM (red) genes (Dataset S1). White indicates no significant enrichment.

Gene Ontology and AraCyc annotation of specialized and primary metabolism genes. (A) Overlap between Gene Ontology (GO)/AraCyc primary metabolism (PM) and secondary metabolism (SM) gene annotations. The number of genes in an intersection or in a complement set are shown. Three benchmark SM/GM gene sets were defined: benchmark 1 (Union), benchmark 2 (AraCyc), and benchmark 3 (Intersection) (). The table to the right shows the genes (labeled with lowercase letters in the Venn diagram) included in each benchmark set. (B) GO term enrichment in SM genes (Left) and in GM genes (Right). The three columns show statistics for GM/SM genes that are GO-annotated, are AraCyc-annotated, or belong to a combined set (union between GO and AraCyc). Rows indicate GO terms. Color represents the q value (multiple testing corrected p value) of the Fisher’s exact test for a GO term enriched in either GM (blue) or SM (red) genes (Dataset S1). White indicates no significant enrichment. To further assess the differences in AraCyc and GO annotations, we asked whether SM and GM genes annotated based on these two sources have different functional and pathway annotations and Pfam protein domains. We found that GO- and AraCyc-annotated SM genes have substantially different enriched GO categories (Fig. 1 and Dataset S1), AraCyc pathways (, and Dataset S1), and protein domains (, and Dataset S2). In contrast to SM genes, GO- and AraCyc-annotated GM genes tend be overrepresented in the same functional categories and pathways (Fig. 1). Considering the above findings, we defined three benchmark sets (Dataset S1). The first (benchmark 1) was defined to include as many annotated SM genes as possible. Here 393 benchmark 1 SM genes were defined as the union of GO and AraCyc SM annotations that have Enzyme Commission (EC) numbers. Similarly, 2,226 benchmark 1 GM genes are from the union of GO and AraCyc primary metabolism gene annotations associated with EC numbers. In the second set (benchmark 2), we used only AraCyc annotations, which were likely better annotated because the focus of AraCyc is on metabolic pathways (SM = 411, GM = 1,306; Fig. 1). In the third set (benchmark 3), we used the intersection between GO and AraCyc annotations (SM = 35, GM = 650; Fig. 1). When we examined which gene feature could distinguish benchmark SM and GM genes (described in the following four sections; Dataset S2), the p values from testing >10,000 features were highly correlated among the three benchmark definitions [R2 ≥ 0.55 () and all Pearson correlation coefficients (PCCs) ≥0.74 (Dataset S2)]. Therefore, we focus on comparing benchmark 1 (union-based) and benchmark 2 (AraCyc-only) genes, particularly when the conclusions (whether a feature can distinguish between SM and GM genes) were inconsistent.

Differences in Gene Expression and Epigenetic Marks Between SM and GM Genes.

A previous study showed that the expression of genes in some SM pathways tends to be more variable than the expression of genes in “essential pathways” (22). To further assess differences in SM and GM gene expression, we examined transcriptome datasets encompassing 25 tissue types (development dataset) and 16 abiotic/biotic stress conditions [stress dataset (); for all test p values, see Dataset S2]. In addition to confirming that benchmark 2 SM genes tend to have higher expression variability (P = 0.003; Fig. 2), we examined 23 additional expression features. We found that SM genes had significantly narrower breadths of expression (Mann–Whitney U tests, for all benchmark sets, P < 1e-35; Fig. 2), lower median expression levels (P = e-24; Fig. 2), and lower maximum expression levels (P = 0.04; Fig. 2). These findings are consistent with the fact that SM genes have more specialized roles, whereas GM genes are involved in basic cellular functions (3, 6). As expected with the established roles of some specialized metabolites in environmental interactions (e.g., refs. 8 and 29), we found that benchmark 1 SM genes tend to be up-regulated under a higher number of abiotic and biotic stress conditions compared with GM genes (all P < 2e-7; Fig. 2), largely similar to the results based on benchmark 2 (P = 0.24 ∼ 1e-8). Relatively fewer SM genes were down-regulated in the shoot under stress compared with GM genes (P = 0.18 ∼ 3.1e-5; Fig. 2), likely reflecting a growth–defense tradeoff (30) where GM genes involved in housekeeping functions are down-regulated under stress and SM genes with roles in abiotic and biotic interactions are not. We do not, however, see the same trend in roots. Because CG methylation and histone modification can influence gene expression (31, 32), we compared the numbers of these sites between SM and GM genes. We found that SM genes tend to have a lower degree of gene body CG-methylation than GM genes (Fisher’s exact tests, P < 3e-4; Dataset S2). On the other hand, the extent of histone modification did not significantly differ between SM and GM genes for seven of the eight histone marks ( and ).
Fig. 2.

Differences in expression and coexpression characteristics of benchmark 1 SM and GM genes. (A) Distributions of SM (red) and GM (blue) gene expression-related values calculated from the development dataset. Level is microarray intensity. Expression breadth is the number of tissues/developmental stages in which a gene is expressed. Expression variation is median absolute deviation/median. (B) Distributions of the number of conditions in which a gene is up- or down-regulated in the abiotic stress (root and shoot) and biotic stress (shoot) datasets. (C) Distributions of maximum PCC values between SM or GM genes and their paralogs in four expression datasets. All test statistics from A to C were generated using Mann–Whitney U tests. (D) Distributions of maximum PCC between GM–GM (light blue), GM–SM (dark blue), SM–GM (orange), and SM–SM (red) gene pairs using the same expression datasets as in C. (E) Clustering of SM and GM genes based on their expression patterns in the diurnal development and stress datasets using six algorithms: HA (hierarchical, average linkage), HC (hierarchical, complete linkage), HW (hierarchical, Ward’s method), CM (c-means), KM (k-means), and AK (approximate k-means). Row indicates a benchmark SM/GM gene. Blue and red shading indicates that the gene belongs to a cluster with an overrepresented number of GM genes and SM genes, respectively, compared with the background (P < 0.05, Fisher’s exact test).

Differences in expression and coexpression characteristics of benchmark 1 SM and GM genes. (A) Distributions of SM (red) and GM (blue) gene expression-related values calculated from the development dataset. Level is microarray intensity. Expression breadth is the number of tissues/developmental stages in which a gene is expressed. Expression variation is median absolute deviation/median. (B) Distributions of the number of conditions in which a gene is up- or down-regulated in the abiotic stress (root and shoot) and biotic stress (shoot) datasets. (C) Distributions of maximum PCC values between SM or GM genes and their paralogs in four expression datasets. All test statistics from A to C were generated using Mann–Whitney U tests. (D) Distributions of maximum PCC between GMGM (light blue), GM–SM (dark blue), SM–GM (orange), and SM–SM (red) gene pairs using the same expression datasets as in C. (E) Clustering of SM and GM genes based on their expression patterns in the diurnal development and stress datasets using six algorithms: HA (hierarchical, average linkage), HC (hierarchical, complete linkage), HW (hierarchical, Ward’s method), CM (c-means), KM (k-means), and AK (approximate k-means). Row indicates a benchmark SM/GM gene. Blue and red shading indicates that the gene belongs to a cluster with an overrepresented number of GM genes and SM genes, respectively, compared with the background (P < 0.05, Fisher’s exact test). Previous studies used expression correlation to evaluate how well genes in distinct SM pathways are correlated (20, 21). To see if similar correlation measures could be used to distinguish SM and GM genes, we used maximum PCCs to evaluate expression correlation between each SM/GM gene and its paralogs (Fig. 2) as well as to other SM and GM genes (Fig. 2) in each of four expression datasets (abiotic stress, biotic stress, development, and hormone treatment). We found SM paralogs to have a significantly higher expression correlation than GM paralogs in all four data sets (Mann–Whitney U test, all P < 0.05; Fig. 2). Because SM genes have undergone more recent expansion than GM genes (2, 4) and the degrees of sequence and expression divergence are positively correlated (33, 34), the higher expression similarities between SM paralogs than between GM paralogs may be partly explained by the more recent timing of SM duplication. We next looked at the maximum expression correlation between each SM gene and other SM genes (SM–SM) or GM genes (SM–GM), as well as between each GM gene and other GM genes (GMGM) or SM genes (GM–SM). The expression correlations ranked as follows: GMGM > SM–GM > SM–SM > GM–SM (all benchmark 1 P < 0.05, but all benchmark 2 P > 0.05 for correlation in the development and biotic stress datasets; Fig. 2). The higher expression correlation for GMGM compared with SM–SM is likely because GM genes tend to be more broadly expressed and at higher levels than SM genes (Dataset S2). The ratio between expression level variance and mean is higher for genes with lower expression levels, such as SM genes, which contributes to the comparatively lower correlation between these genes. Taken together, our findings indicate that expression correlation features can distinguish SM and GM genes. Because pathway genes tend to be coexpressed and belong to the same coexpression cluster (20, 21), we next assessed if benchmark 1 SM and GM genes that belong to distinct pathways were members of distinct coexpression modules (Fig. 2 and Dataset S2). Among these modules, 99 and 125 contained significantly more SM genes than randomly expected (α = 0.05) and are referred to as SM modules. Similarly, 125 GM modules were significantly enriched in GM genes (P < 0.05). Therefore, a subset of benchmark GM and SM genes tend to be coexpressed with other GM and SM genes, respectively. However, >50% of SM and GM genes did not belong to SM/GM modules (gray, Fig. 2). In addition, 0.3–14.0% of GM genes were found in SM modules, and 0–32% of SM genes were found in GM modules, depending on the dataset and algorithm (Fig. 2). This pattern reflects the fact that GM genes which function immediately upstream of an SM pathway may be coexpressed with genes in the SM pathway in question. Examples include 208 “junction” genes interfacing GM and SM pathways based on AraCyc annotations (Dataset S2). These findings further highlight the challenges in differentiating SM and GM genes globally using coexpression patterns alone.

Network Properties of SM and GM Genes.

SM genes tend to have specialized functions and are involved in one or a few pathways, leading us to hypothesize that SM genes would have fewer connections in biological networks than GM genes. To test this prediction, we first assessed the connectivity among SM genes and among GM genes in a protein–protein interaction network (35) and found that SM genes have a significantly smaller number of physical interactions (mean = 1.25) than GM genes (1.84; benchmark 1, P = 0.03; benchmark 2, P = 3.85e-8; ). The smaller number of SM gene interactions is not because SM genes have shorter coding regions (SM > GM, all P = 0.004; ) but is possibly due to the presence of fewer protein domains (SM < GM; benchmark 1, P = 0.35; benchmark 2, P = 4.3e-6; ). Our finding that significantly fewer protein–protein interactions are known for SM proteins is consistent with SM genes having more specific functions than GM genes (6). It is also possible that there have been more interaction experiments for GM genes or that GM genes tend to function in larger pathways compared with SM genes. Although GM genes tend to have more interactions than SM genes, SM genes with certain domains, such as cytochrome P450, have a higher median number of gene interactions (99.5) compared with their P450 GM counterparts (15.0). Thus, proteins in some domain families may deviate from the general trend we uncovered. Next, we examined the same relationships using the AraNet functional network (36), which connects genes with likely similar functions through the integration of multiple datasets, including expression and protein–protein interaction datasets. Although the number of protein–protein interactions was significantly higher for GM genes relative to SM genes (; all P < 0.05), the differences in network connectivity between GM and SM genes in benchmark 1 were not significant (P = 0.139; ) but were significant for benchmark 2 genes (P = 0.027). AraNet considers multiple gene features including protein interactions, coexpression, shared domains, and homologous genes to construct gene networks, so it is not surprising that this result differs from that for analysis of only protein–protein interactions. These findings suggest that the amount of network connectivity is dependent on the type of network, and this may be useful for distinguishing between SM and GM genes. We should also note that the results from the benchmark 1 and 2 sets are inconsistent, highlighting the effect of the benchmark definition on our analyses. In particular, benchmark 1 p values were higher than those of benchmark 2, despite the fact that benchmark 1 was substantially larger and would have lower P values compared with a smaller dataset with the same effect sizes. This suggests that the AraCyc-only–based benchmark 2 is likely of higher quality.

Evolutionary Rates of SM and GM Genes Based on Within- and Cross-Species Comparisons.

SM genes are frequently involved in plant adaptation to variable environments (8, 29, 37). In contrast, GM genes, which are involved in ancient and stable metabolic functions such as photosynthesis, are expected to be more highly conserved (38) and experience stronger negative selection (39, 40). An earlier study found a high degree of genetic variation in glucosinolate genes across A. thaliana accessions (21). Here, by comparing SM to GM genes globally, we found that SM genes tend to have higher nucleotide diversities than GM genes (P = 3.9e-19; ). In addition, we analyzed 15 evolutionary features based on within-species and across-species comparisons of SM and GM genes. First, we searched for A. thaliana SM and GM paralogs as well as homologs across six plant species spanning more than 300 million years of evolution (). A significantly higher proportion of SM genes have paralogs than GM genes (P = 1.2e-10; ). However, consistently fewer SM genes (14.8–54%) have homologs across species than GM genes (27–76%) (all P < 2e-4; ). In addition, as expected for lineage-specific functions, only 0.94% of SM genes have homologs in core eukaryotic genomes (41) compared with 14.7% of GM genes (). Finally, we determined the timing of GM and SM duplications over the course of land plant evolution using sequence similarity to determine the most recent duplication point (). We found that 75% of SM genes were products of duplication events after the divergence between the A. thaliana and Brassica rapa lineages compared with only 40% of GM genes (Fig. 3), indicating that SM genes tend to be more recently duplicated relative to GM genes. Additionally, 25% of SM genes were duplicated after the A. thalianaArabidopsis lyrata split, compared with only 7% of GM genes (Fig. 3). Thus, SM genes have higher duplication rates but do not persist in the long run, leading to the observation of fewer homologs across species.
Fig. 3.

Differences in the duplication timing, degree of selective pressure, paralog-related features, and functional likelihood between benchmark 1 SM and GM genes. (A) The distribution of duplication time points (y axis) for each GM/SM gene (x axis). (Left and Middle) A black line indicates that the GM (Left) or SM (Middle) gene in question likely duplicated before the divergence between the A. thaliana lineage and the species lineage to the left of the black line. Species order is based on the time of divergence from A. thaliana. (Right) Each bar represents the log2 ratio (x axis) between the proportions of SM and GM genes duplicated at each duplication time point (y axis). For full species names, see . (B–F) Density plots showing SM (pink) and GM (blue) gene feature distributions. Test statistics were generated using Mann–Whitney U tests. (B) Median nonsynonymous substitution rate/synonymous substitution rate (dN/dS) values between A. thaliana SM/GM genes and their A. thaliana paralogs or best matching homologs in six other species, arranged based on the time of divergence from A. thaliana. (C) The number of A. thaliana paralogs of SM or GM genes. (D) The maximum percent identity of an SM or GM gene to its paralogs. (E) The dS distribution between each SM or GM gene and its paralog. (F) The functional likelihood ranging from 0 to 1, which indicates the likelihood that a gene is under selection.

Differences in the duplication timing, degree of selective pressure, paralog-related features, and functional likelihood between benchmark 1 SM and GM genes. (A) The distribution of duplication time points (y axis) for each GM/SM gene (x axis). (Left and Middle) A black line indicates that the GM (Left) or SM (Middle) gene in question likely duplicated before the divergence between the A. thaliana lineage and the species lineage to the left of the black line. Species order is based on the time of divergence from A. thaliana. (Right) Each bar represents the log2 ratio (x axis) between the proportions of SM and GM genes duplicated at each duplication time point (y axis). For full species names, see . (B–F) Density plots showing SM (pink) and GM (blue) gene feature distributions. Test statistics were generated using Mann–Whitney U tests. (B) Median nonsynonymous substitution rate/synonymous substitution rate (dN/dS) values between A. thaliana SM/GM genes and their A. thaliana paralogs or best matching homologs in six other species, arranged based on the time of divergence from A. thaliana. (C) The number of A. thaliana paralogs of SM or GM genes. (D) The maximum percent identity of an SM or GM gene to its paralogs. (E) The dS distribution between each SM or GM gene and its paralog. (F) The functional likelihood ranging from 0 to 1, which indicates the likelihood that a gene is under selection. We also found that SM genes and their homologs had significantly higher nonsynonymous (dN) to synonymous (dS) substitution rate ratios (all P < 1e-06; Fig. 3) compared with GM genes. Together with other measures of selection (), both within- and cross-species comparisons suggest that SM genes are under weaker negative selection relative to GM genes. One reason for this pattern may be that these SM genes initially experienced positive selection (higher rate than GM) followed by negative selection (similar to GM). This would result in SM genes having a higher rate of evolution than GM genes, with the appearance of weaker negative selection. Another possible reason for this pattern is that some of these SM genes may have experienced strong negative selection (similar to GM) but are now neutrally evolving. This may be because the selective agent (e.g., a particular environmental factor) previously contributing to the selection no longer exists. This is consistent with the roles of SM genes mostly in the production of metabolites important for tolerance to rapidly changing abiotic stress conditions and defense against biotic agents (6).

Duplication Mechanisms and Genomic Clustering of SM and GM Genes.

Gene duplication mechanism, such as whole genome duplication (WGD), tandem duplication, and dispersed duplication, may affect subsequent functional divergence and ultimately influence whether a duplicate is under selection and retained (1). For example, genes in a few SM pathways, such as aliphatic glucosinolate biosynthesis, tend to be tandemly duplicated and have a higher degree of expression variation (22). To assess if SM and GM genes differ in their post-WGD retention rate, we compared the number of GM and SM WGD duplicates in the A. thaliana lineage. Although two different glucosinolate pathways arose in the α WGD event ∼50 Mya (15), these two pathways do not lead to a significantly higher number of SM WGD duplicates compared with the number of GM WGD duplicates. This indicates that SM genes from multiple SM pathways (not just those involved in glucosinolate metabolism) are not more likely to be derived from WGDs than GM genes (benchmark 1 P = 0.1, and benchmark 2 P = 0.85; ). This suggests that the likelihood of long-term retention of SM and GM WGD duplicates does not appear to differ significantly. In contrast, significantly more SM genes tend to be tandem duplicates than GM genes (P < 2e-43; ). Genes involved in response to the environment are more likely to be tandem duplicates (2, 42), and tandem duplication potentially allows for rapid evolution of SM gene families that are subject to selection in variable environments. The numbers of paralogs and pseudogenes were used as measures of the degree of SM and GM gene gains and losses, respectively. Our analysis revealed that SM genes tend to have more paralogs (P < 3e-72; Fig. 3), higher sequence similarities to their paralogs (benchmark 1, P = 3e-3, and benchmark 2, P = 0.3; Fig. 3), and lower synonymous substitution rates (dS) (P < 2e-19; Fig. 3) compared with GM genes. Furthermore, a higher percentage of SM genes duplicated since A. thaliana diverged from A. lyrata (P < 4e-8; ), and SM genes tended not to be found in single copies (P < 1e-3; ). These findings all point to more recent expansion of SM gene families. We also compared the functional likelihood, which is a measure of how likely it is that a gene is functional and, thus, under selection (39), between SM genes, GM genes, and pseudogenes. The functional likelihoods of SM genes are significantly lower than those of GM genes but higher than those of pseudogenes (ANOVA, Tukey’s test, P < 2e-16; Fig. 3 and ). Genes under strong negative selection have high functional likelihoods that are close to 1, whereas pseudogenes tend to have values close to 0 (39). In addition, most pseudogenes are eventually removed from the genome (43) and tend not to be under selection (44). Our finding that SM genes tend to have lower functional likelihood is consistent with the hypothesis that some SM genes are under weaker selection and may be in the process of becoming pseudogenes. The proportion of pseudogene paralogs for SM genes (between benchmarks, 9.8–11.1%) compared with GM genes (6.1–6.5%) is not significant overall (P = 0.04 ∼ 0.2; ). Considering that SM genes tend not to have cross-species homologs (), this finding suggests that pseudogenes are too short lived to be adequate indicators of gene loss. SM and GM genes that function in the same pathway are sometimes found in genomic clusters (21, 45–47), and we used two approaches to compare the occurrence of SM and GM genes in close physical proximity. In the first approach, we asked whether SM and GM genes tend to be located near other SM and GM genes, respectively, regardless of whether the neighboring genes are paralogous or not. We found that SM genes cluster near other SM genes (benchmark 1, P = 9.5e-121, and benchmark 2, P = 0.02; ) and GM genes tend to be close to GM genes (P < 2e-5; ). It is surprising that the P values for SM clustering differ so greatly between benchmark sets. This may indicate that AraCyc annotation (benchmark 2) is of higher quality. In the second approach, we defined metabolic clusters identified using Plant Cluster Finder (21), but the identified clusters were not enriched in either SM or GM genes (). Taken together, SM genes are more likely to be tandemly duplicated and tend to belong to large gene families. Our findings provide genome-wide confirmation of earlier studies (e.g., refs. 2, 15, and 22) that focused on a relatively small number of SM genes or pathways. These characteristics may be useful features in distinguishing SM and GM genes.

Machine Learning Model for Predicting SM and GM Genes.

In total, we examined 10,243 features (summarized in Dataset S2) that differ widely in their ability to distinguish benchmark SM and GM genes. For example, the best performing single feature—gene family size—led to a model with an area under receiver operating characteristic curve (AuROC) of 0.8. An AuROC of 0.5 indicates the performance of random guesses, and a value of 1 indicates perfect predictions. However, using this high-performing feature alone as the predictor resulted in a 43% false positive rate (FPR) and a 58% false negative rate (FNR). In addition, the majority of the features are not particularly informative (Dataset S3), because the average AuROC for single feature-based models was extremely low (0.5) with an average FPR of 89%. These findings indicate that SM and GM genes are highly heterogeneous and cannot be distinguished with high accuracy using single features. To remedy this, we next integrated all 10,243 features, regardless of whether they were significantly different between SM and GM genes or not, to build machine-learning models for predicting SM and GM genes. We used machine learning because it allowed us to build an integrated model where multiple features were considered simultaneously. Integrated models offer better predictive power than individual features by lowering FNR and FPR. Two machine learning algorithms, Support Vector Machine and Random Forest, were used to build predictive models using all three benchmark datasets (; Fig. 4; ; and Dataset S3). The best performing SM gene prediction model was based on benchmark 2 (AraCyc only) and Random Forest (AuROC = 0.87, FPR = 29.4%, FNR = 14.8%; Fig. 4). Randomizing SM/GM labels but maintaining the same feature values associated with the benchmark genes as the initial model resulted in AuROCs = 0.51 ∼ 0.57, as expected for random guesses (Dataset S3). Note that the performance measures reported above were based on models built with a 10-fold cross-validation scheme where 90% of the data were used for training the models and 10% for testing them. Based on the prediction outcomes, each gene was given an SM score ranging from 0 to 1 indicating the likelihood that the gene is an SM gene. Based on a threshold SM score defined by minimizing false predictions (), 85.6% of the training SM genes (Fig. 4) and 73.1% of the training GM genes were correctly predicted (Fig. 4), which reflects an improvement over the individual feature-based, naïve models.
Fig. 4.

SM gene prediction model performance based on benchmark 2. (A) AuROC curves of binary SM/GM prediction models built with Support Vector Machine (SVM) and Random Forest (RF) algorithms. FPR, false positive rate; TPR, true positive rate. (B) SM score distributions for benchmark GM, benchmark SM, holdout SM (not included in models), unannotated enzymes from AraCyc, glucosinolate pathway, p450, terpene synthase, and methyltransferase 2 (methyltr_2) domain-containing genes. Dotted line indicates SM score threshold (). Red and blue shading indicate genes predicted to be SM and GM genes, respectively. (C) The most important features for SM (red) and GM (blue) gene predictions. (D–G) Distributions of the values of representative, predictive features for correctly and incorrectly predicted SM and GM genes. Black horizontal bar indicates median. Overall P values are from Kruskal–Wallis tests used to evaluate differences among classes. The Dunn post hoc test was used to test differences between classes (Dataset S3). (D) Functional likelihood. (E) dN/dS between A. thaliana and P. trichocarpa homologs. (F) Sizes of the gene families the four categories of genes belong to. (G) Expression breadth in the development dataset.

SM gene prediction model performance based on benchmark 2. (A) AuROC curves of binary SM/GM prediction models built with Support Vector Machine (SVM) and Random Forest (RF) algorithms. FPR, false positive rate; TPR, true positive rate. (B) SM score distributions for benchmark GM, benchmark SM, holdout SM (not included in models), unannotated enzymes from AraCyc, glucosinolate pathway, p450, terpene synthase, and methyltransferase 2 (methyltr_2) domain-containing genes. Dotted line indicates SM score threshold (). Red and blue shading indicate genes predicted to be SM and GM genes, respectively. (C) The most important features for SM (red) and GM (blue) gene predictions. (D–G) Distributions of the values of representative, predictive features for correctly and incorrectly predicted SM and GM genes. Black horizontal bar indicates median. Overall P values are from Kruskal–Wallis tests used to evaluate differences among classes. The Dunn post hoc test was used to test differences between classes (Dataset S3). (D) Functional likelihood. (E) dN/dS between A. thaliana and P. trichocarpa homologs. (F) Sizes of the gene families the four categories of genes belong to. (G) Expression breadth in the development dataset.

Features Important for SM Gene Prediction and Model Application to Unannotated Enzyme Genes.

In addition to the SM score, the machine learning result included a list of feature importance values, where features with more positive values are more informative for predicting SM genes. In contrast, more negative feature weights are more informative for predicting GM genes (Fig. 4 and Dataset S3). Based on the AraCyc-only (benchmark 2) model, the most informative features for predicting SM genes included specific protein domains as well as multiple gene duplication-related features, such as duplication mechanism (higher degree of tandem duplication), gene family expansion (larger family size), and higher degrees of correlation in expression between an SM gene and other SM genes or its paralogs (Fig. 4). In addition, higher evolutionary rates were among the most informative for predicting A. thaliana SM genes based on comparison of an SM gene to its Populus trichocarpa and Vitis vinifera homologs but not to homologs from more closely related species. This pattern may reflect the fact that at these time points (postdivergence between A. thaliana and the P. trichocarpa or V. vinifera lineages) a number of SM genes experienced accelerated, potentially positive, selection that contributed to the diversification of major SM pathways. In contrast, wider expression breadth, measured using the development expression dataset, and higher connectivity in gene networks were among the most important features for predicting GM genes, indicating the more generalizable functions of GM genes and the tendency to interact with a greater number of genes/gene products relative to SM genes. Finally, specific histone marks as well as hierarchical, k-means, and approximate k-means coexpression clusters based on the stress, diurnal, and development datasets were informative for predicting both SM and GM genes (Dataset S2). While earlier studies established that genes belonging to a particular SM pathway tend to be coexpressed (20, 25), our findings demonstrate that there are global differences in expression patterns and properties between SM and GM genes. With the accuracy of the SM gene prediction models assessed through cross-validation and prominent features identified, we next applied these machine learning models to make predictions for 2,085 known enzymatic genes (with an EC number) not annotated to be SM or GM genes (Dataset S1). Of these genes, 58% (1,220 genes) were predicted to be SM genes. We took three approaches to assess the accuracy of these SM and GM gene predictions. First, we intentionally held out 10% of both known SM and GM genes (Fig. 4 and Dataset S1) from any model training. Upon application of the machine learning model, 84 and 85% of withheld GM and SM genes were correctly predicted, respectively, indicating that the model has an 84% true positive rate (or 16% FNR). Second, we tested how well genes in well-known SM pathways involved in glucosinolate biosynthesis (38, 39) could be predicted. To do this we built a model using the benchmark SM and GM genes but excluding genes from glucosinolate biosynthetic pathways () (Fig. 4 and Dataset S1). When applying this model to glucosinolate genes, 79% of known glucosinolate pathway genes were correctly predicted as SM genes. The FNR was 16% overall, which is much better than the 58% FNR when using the single best feature, gene family size. Finally, methyltransferase, terpene synthase, and cytochrome P450 families were identified based on their respective protein domains () and analyzed to test model performance within a specific family (Fig. 4 and Dataset S1). These families were chosen because they tend to be associated with SM. To this end, we built three models using our benchmark sets, excluding holdout genes from the families we planned to predict. Upon applying this model to each enzyme family, 97% of P450, 88% of terpene synthase, and 92% of methyltransferase genes were predicted as SM genes (Fig. 4). Thus, these models predicted the majority of holdout genes with known SM functions, glucosinolate pathway genes, and genes in enzyme families whose members predominantly play roles in SM pathways. In summary, our models allowed assessment of the relative importance of features in distinguishing SM and GM genes, as well as provided predictions for 1,220 SM genes among enzyme genes with no known SM/GM designation. In addition, our findings indicate that our models and this general approach are valuable for predicting unknown enzymes.

Characteristics of Mispredicted Genes.

Although our SM prediction model performed well, 354 (27.1%) AraCyc annotated GM genes were mispredicted as SM genes. In addition, 60 (15.3%) AraCyc annotated SM genes were mispredicted as GM genes. To assess the properties of mispredicted SM/GM genes, we determined how the values of a subset of the most informative features (Fig. 4 and Dataset S3) differed between four gene classes defined based on the consistency between the gene annotation and the benchmark 2 (AraCyc only)-based model prediction. These four classes included (i) annotated GM predicted as GM (GM [annotation] → GM [prediction]), (ii) annotated SM predicted as SM (SM → SM), (iii) annotated GM predicted as SM (GM → SM), and (iv) annotated SM predicted as GM (SM → GM). Genes in the mispredicted classes (3 and 4) tend to have feature values between those of genes in correctly predicted classes (1 and 2). For example, the median values of the functional likelihood among these four gene classes follow the order GMGM > SM → GM > GM → SM > SM → SM (Fig. 4). The opposite pattern (SM → SM has the highest value) was observed for dN/dS values (Fig. 4), gene family size (Fig. 4), the number of conditions expressed (Fig. 4), and values for other gene features we examined (). Thus, in the SM → GM mispredicted class, the annotated SM genes in fact possess multiple properties that are more similar to those of GM genes and vice versa, but no single feature can fully explain why these genes were mispredicted. These observations suggest that some of the mispredicted benchmark genes (Fig. 4) may in fact be misannotated, or alternatively, they may point to a deficiency in our model (addressed in the next section). To assess how many of the mispredictions are due to misannotation, we collated information from 25 genes with predictions (from the benchmark 2-based model) matching the AraCyc annotations (GMGM = 4, SM → SM = 21), and for 32 genes with predictions that were not consistent with their AraCyc annotations (SM → GM = 20, GM → SM = 12) (, and Dataset S1). We focused on genes in the P450/terpene synthase families because there is substantial biochemical and functional information available for these genes (3, 48, 49). For mispredicted genes, which were manually examined, five (42%) genes in the GM → SM class had supporting SM evidence (Dataset S1). In addition, 16 (80%) genes in the SM → GM class have supporting GM evidence (Dataset S1). These findings indicate that a subset of these genes (66%) are mispredicted due to misannotation, not due to prediction errors. For the benchmark 1 set, which is based on the union between AraCyc and GO annotations, a similar percentage of the mispredicted genes [5 of 11 GM → SM (45%) genes examined] were likely misannotated (Dataset S1). This is consistent with our finding that some SM genes enriched in AraCyc pathways and GO terms—such as carotene, leucine, suberin, and wax biosynthesis—are found across all major land plant lineages and should be considered GM genes (Fig. 1 and ). It is also possible that some of the erroneous annotations are based on in vitro biochemical activity and/or sequence similarities alone, criteria that may not accurately represent their in vivo functions. We should note that genes that were manually examined were mostly from the P450 and terpene synthase families. More enzyme families should be evaluated to obtain a more complete picture of the reasons behind inconsistent annotation and prediction. Together with the finding that nearly all (24/25) benchmark 2 genes with consistent annotations and predictions had biochemical evidence supporting their SM or GM classification (), these results further demonstrate the feasibility of using the model prediction outcome to prioritize future experiments to determine the in planta role of SM or GM genes, including those that may be misannotated or have functions in addition to their annotated activities.

Effect of Dual-Annotated Genes on Model Performance.

The number of genes mispredicted with our model that were not misannotated according to our manual examination (, and Dataset S1) indicate that our model can be further improved. Our original model focused on distinguishing SM and GM genes as binary classes, but genes with both SM and GM functions were excluded. However, there are 261 genes (Fig. 1) annotated as belonging to both SM and GM pathways in AraCyc (dual-annotated or DA genes; Fig. 5). We thus explored the possibility that DA genes have properties distinct from SM or GM genes and should be considered a distinct class. We first compared the SM scores between SM, GM, and DA genes based on our AraCyc-only binary model. If DA genes belong to a distinct class that is neither SM nor GM, the SM scores of DA genes should have a unimodal distribution with a median close to 0.5. Contrary to this expectation, the SM score distribution of DA genes is bimodal, where some DA genes resemble SM genes and others resemble GM genes (Fig. 5). Thus, based on a GM vs. SM binary model, DA genes do not appear to belong to a distinct class. These findings raise the question whether the dual annotation is valid.
Fig. 5.

Three-class models for classifying SM/GM/DA and SM/GM/JC genes. (A) Definition of DA (dual annotation) and JC (junction) genes. For JC genes, four subclasses were defined based on the degree of connectivity, defined as the number of connecting reactions in the metabolic network based on AraCyc annotations. The letters a–o indicate SM/GM enzymes that are annotated as GM (blue), SM (red), or DA (green outline) or are defined as JC (orange). JC reaction substrates and products are in black and gray, respectively. (B) Distributions of SM scores based on the binary model built using benchmark 2 data for GM, SM, DA, JC (all), and JC subclass genes. (C–F) Ternary plots showing the SM/GM/DA model-based score distributions for GM (C), SM (D), DA (E), and JC (F) genes. The g (blue), s (red), d (green), and j (orange) labels indicate the peak gene density areas (brighter yellow) occupied by GM, SM, DA, and JC genes, respectively. (G–J) Ternary plots showing the SM/GM/JC model-based score distributions for GM (G), SM (H), JC (I), and DA (J) genes. The color scheme follows that in C–F.

Three-class models for classifying SM/GM/DA and SM/GM/JC genes. (A) Definition of DA (dual annotation) and JC (junction) genes. For JC genes, four subclasses were defined based on the degree of connectivity, defined as the number of connecting reactions in the metabolic network based on AraCyc annotations. The letters a–o indicate SM/GM enzymes that are annotated as GM (blue), SM (red), or DA (green outline) or are defined as JC (orange). JC reaction substrates and products are in black and gray, respectively. (B) Distributions of SM scores based on the binary model built using benchmark 2 data for GM, SM, DA, JC (all), and JC subclass genes. (C–F) Ternary plots showing the SM/GM/DA model-based score distributions for GM (C), SM (D), DA (E), and JC (F) genes. The g (blue), s (red), d (green), and j (orange) labels indicate the peak gene density areas (brighter yellow) occupied by GM, SM, DA, and JC genes, respectively. (G–J) Ternary plots showing the SM/GM/JC model-based score distributions for GM (G), SM (H), JC (I), and DA (J) genes. The color scheme follows that in C–F. To assess whether our inability to distinguish DA genes from SM/GM genes is because the binary model is inadequate, we built a multiclass model assuming SM, GM, and DA genes as three distinct classes and plotted the SM scores for each class in a ternary plot (Fig. 5 ). If the three classes of genes can be perfectly separated, then the highest gene density areas will be toward different corners of the ternary plots. Although the GM/SM/DA model has an F1 score of 0.51 (higher than the F1 of 0.33 for a random model) and an accuracy of 0.53, the inclusion of DA genes as a third class significantly diminished the ability of the model to separate SM (Fig. 5) and GM (Fig. 5) genes. Note that SM and GM genes are not well separated in the ternary plots (Fig. 5 ), but in the binary model, their SM score distributions are highly distinct (Fig. 5). In addition, the DA gene distribution in the ternary plot overlapped with the distributions of both SM and GM genes (Fig. 5), consistent with the bimodal SM score distribution observed among DA genes. Thus, the DA genes belong to two subclasses, with each subclass resembling SM or GM genes, again raising the question whether the dual annotations in AraCyc are valid. Curiously, GM genes separate into two populations in the GM/SM/DA model where one population is located toward the GM corner of the ternary plot (arrow g1) and the second population (arrow g2) overlaps with areas of high SM (arrow s) and DA (arrow d) gene density (Fig. 5). Therefore, although this three-class model does not separate SM and GM genes well, it raises the question of how the two GM gene populations (g1/g2 peaks) differ and should be further examined.

Consideration of Junction Genes in Predictive Model Building.

Another potential way to improve our model is to consider metabolic network topologies. We hypothesized that SM and GM genes closer to pathway junctions ( and Fig. 5) are more likely to be mispredicted. We identified junction reactions connecting 15 GM (upstream) and 20 SM (downstream) pathways. The 212 genes encoding enzymes responsible for junction reactions were referred to as junction (JC) genes. By further classifying JC genes based on the connectivity of their associated reactions, four topological subclasses of junction genes were defined: 1 → J → 1, junction reactions, each connected with one reaction upstream and one reaction downstream; n → J → 1, multiple upstream reactions but only one downstream reaction; 1 → J → n, one upstream and multiple downstream reactions; and n → J → n, multiple upstream and downstream reactions (Fig. 5). Although junction genes as a whole also have a bimodal SM score distribution similar to that of DA genes (JC all; Fig. 5), the score distributions were distinct among the four topological subclasses, indicating that network topology is a distinguishing characteristic between SM and GM genes. Considering that products of GM pathways serve as substrates for many other pathways, it is expected that GM genes functioning in junction reactions would be connected to multiple downstream pathways. Consistent with this, JC genes in the n → J → n and 1 → J → n subclasses where n > 1 tend to be more similar to GM genes (Fig. 5). In contrast, SM enzymes are more likely involved in incorporating substrates from multiple reactions and serve as the committed step for producing specialized metabolites with an expected n → J → 1 topology. In addition, a typical SM pathway mostly contains a series of nonbranching reactions that lead to specialized metabolite products and is also expected to have a 1 → j → 1 topology. Consistent with these expectations, JC genes in the n → J → 1 and 1 → J → 1 subclasses are the most similar to SM genes (Fig. 5). The GM/SM/JC three-class model separated SM and GM genes significantly better (F1 score = 0.65, accuracy = 0.65; Fig. 5 ) than the GM/SM/DA model (Fig. 5 ), indicating that junction genes have unique characteristics and that some genes intersecting annotated SM and GM pathways can be considered a separate class. In addition, the four topological subclasses of JC genes are located in different areas in the ternary plots for the SM/GM/DA () and GM/SM/JC () models. We should emphasize that JC genes were defined based on a network constructed using AraCyc pathway annotations where the criteria for defining pathway boundaries may differ between research groups and/or annotators. Despite this, the GM/SM/JC model predictions demonstrate that JC genes are by and large distinct from GM/SM genes. Although we cannot be certain which JC genes were key enzymes in the committed steps entering SM pathways, the JC genes in the n → J → 1 subclass are clearly a class of their own with most genes at the JC-like corner (). Taken together, these findings demonstrate that further categorization of SM and GM genes based on biologically motivated criteria, such as network topology, could help further distinguish different types of SM or GM genes leading to modest improvement of our models. In addition, the binary classification of SM and GM genes, while meaningful, can be an oversimplification. Finally, the consideration of additional topological characteristics (e.g., pathway depth and terminal reaction) and additional biochemical features (e.g., substrate and product identities) may lead to further improvements in SM and GM predictions.

Conclusions

Machine learning models built using genomic features show considerable promise in predicting the functions of unclassified or unannotated genes (21, 39). Before establishing such models for predicting SM and GM genes, we first explored how SM and GM genes in A. thaliana differ in >10,000 conservation, protein domain, duplication, epigenetic, expression, and gene network-based features. Most of these features have not been examined by other studies contrasting SM and GM genes. We demonstrated that machine learning models in which these features are integrated to predict SM and GM genes perform well based on cross-validation performed using three benchmark datasets, three predominantly SM gene families, glucosinolate biosynthesis pathway genes, and 39 AraCyc-annotated SM genes that were deliberately withheld from the model building process. Focusing on the AraCyc-only benchmark (benchmark 2), although 380 individual features significantly differed between SM and GM genes, the effect sizes are small, and any individual feature does a poor job of distinguishing SM and GM genes compared with the machine learning models. In addition, machine learning models allow the global prediction of SM and GM genes in a plant genome. Based on the SM scores derived from these models, candidate SM genes can be prioritized for further experimental studies. Although the binary SM/GM gene prediction model performed well, the FPR and FNR were substantial at 28% and 19%, respectively. Through closer examination of experimental evidence for 10 genes annotated as GM genes but predicted as SM genes, we found ∼50% had evidence supporting classification as SM genes, indicating that a subset of the mispredictions is likely due to misannotation. Thus, in addition to predicting likely GM/SM functions of unannotated enzymes, our models can be used to pinpoint potentially misannotated GM/SM genes. Mispredictions can be avoided by further improving the model in two areas: the classes defined and the features used. Classifying enzyme genes as GM and SM may be an oversimplification. By building two three-class models (GM/SM/JC and GM/SM/DA), we found that SM and GM genes could be further categorized based on the metabolic network topology and, to a lesser extent, based on their dual-annotated roles in both SM and GM pathways. Future studies distinguishing genes at the pathway level can be carried out using similar multiclass modeling methods. Additional features that can distinguish SM and GM genes may also be needed to further improve model performance. One possibility is to incorporate topological information as features. Another possibility is to examine feature combinations (e.g., combining an expression and a duplication feature linearly or nonlinearly) using approaches such as deep learning. In summary, we have conducted a global analysis of gene features that are useful to distinguish SM and GM genes. We also established well-performing machine learning models that provide a global estimate of the SM gene content within a plant genome. The great majority of the predicted SM genes have not been assigned to pathways, highlighting the important next step of combining the GM/SM prediction scheme described here with approaches for pathway discovery and assignment. Considering that the most important features are related to gene duplication, evolutionary rate, and gene expression and that these types of data are readily available for an ever-expanding number of plant species, the machine learning workflow we have developed can be readily applied to any other species for predicting SM genes or, more generally, gene functions. Nonetheless, there is room for further improvement. Our prediction model serves as a baseline model for future studies incorporating additional features and algorithms that are anticipated to further improve the accuracy of predictions.

Methods

Specialized and General Metabolism Gene Annotation and Enrichment Analysis.

Gene sets were identified based on GO (ref. 27; www.geneontology.org/ontology/go.obo), and/or AraCyc (ref. 28; https://www.plantcyc.org/) annotations but not MapMan (50). We did not analyze MapMan annotations because all GO and AraCyc SM genes, which include a large number of well-known SM examples, were annotated as GM in MapMan, raising questions about the utility of MapMan SM/GM designations. GO annotations for A. thaliana were downloaded from The Arabidopsis Information Resource (TAIR) (51), and genes annotated with the secondary metabolism term (GO:0019748) and primary metabolism term (GO:0044238) were selected as potential SM genes and GM genes, respectively. Genes that were associated with more specialized primary and secondary metabolism child GO terms were also classified as GM and SM genes, respectively. Only genes annotated with either SM or PM terms, but not both, were included in the analysis, and only those with experimental evidence codes IDA, IEP, IGI, IPI, and/or IMP were included. For AraCyc genes, the v.15 pathway annotations were retrieved from the Plant Metabolic Network database (https://www.plantcyc.org/) (28). Potential SM genes were those associated with secondary metabolites biosynthesis pathways. Potential GM genes were those found in nonsecondary metabolite biosynthesis pathways. In addition, genes without experimental evidence in AraCyc (EV-EXP) were not included in the benchmark. Some genes were annotated in both SM and non-SM pathways and were defined as dual-annotated (DA) genes, not as SM or GM. Potential SM and GM genes from GO or AraCyc were required to have an EC number annotation from AraCyc or from Pfam v.30 (pfam.xfam.org/) (52). Five benchmark gene sets were defined. In addition, glucosinolate pathway genes were also defined to test model performance. The criteria for defining benchmarks and glucosinolate pathway genes are detailed in . Terpene synthase, P450, and methyltransferase genes were identified from A. thaliana annotated protein sequences using the following domain matches from Pfam: terpene_synth, p450 and methyltr_2. Details of gene set enrichment analysis are available in .

Expression Dataset Processing and Coexpression and Gene Network Analysis.

Expression datasets were downloaded from TAIR. Target datasets included plant development (53), biotic stress (54), abiotic stress (54, 55), hormone treatment (56), and diurnal expression (57). Genes that were considered significantly expressed relative to the background in the development expression dataset were those with a log2 microarray hybridization intensity value of ≥4 (the cutoff value is based on our earlier study, ref. 39). The median and maximum expression levels and expression variation and breadth across the developmental expression dataset were calculated as previously described (39). Differentially expressed genes under biotic stress, abiotic stress, and hormone treatments were defined as those that had an absolute log2 fold change ≥1 and adjusted P < 0.05 following analysis using the affy and limma packages in R (58, 59). For each gene, the number of conditions in which the gene in question was significantly differentially regulated was also calculated. This resulted in 16 expression values that were used as model features (Dataset S2). For each expression dataset (development, abiotic, biotic, and hormone), PCCs were calculated between each gene and genes in the same paralogous cluster as defined by ORTHOMCL v1.4 (60). For the gene in question, the maximum PCC < 1 for genes in the paralog cluster was used as the PCC value. In addition to examining expression correlation, coexpressed genes in the biotic stress, abiotic stress, diurnal, and developmental datasets were classified into coexpression clusters using K-means, approximate kernel K-means, c-means, and hierarchical clustering algorithms as described in our earlier study (26) resulting in 5,303 binary features. For K-means–related analyses, the within-cluster sum of squares was plotted against the number of clusters, and K was chosen based on the number of clusters at the elbow or bend of the plot. Gene clusters that were significantly enriched in SM or GM genes were identified using Fisher’s exact tests (adjusted P < 0.05). The number of AraNet gene network interactions (ref. 36; www.functionalnet.org/aranet/), number of protein interactions (35), domain number, and amino acid length were calculated in our earlier study (39). There were 23 model features related to PCC values, significant cluster membership, and gene network data (Dataset S2).

Conservation, Duplication, Methylation, Histone Modification, and Genome Location Related Features.

Nonsynonymous (dN)/synonymous (dS) substitution rates between plant homologs, core eukaryotic gene status, nucleotide diversity data, Fay and Wu’s H, and MacDonald–Kreitman test statistics were the same as used in our earlier studies (39, 61, 62). Details on determining the timing of duplication of an A. thaliana gene are available in . Pseudogenes were defined using a published pipeline (53). The lethal gene scores, which represent the relative likelihood that a mutation in a gene is lethal, and additional gene duplication-related features, including gene family size, rates of synonymous substitutions, α and β/γ whole genome duplication status, and tandem duplication status (Dataset S2), were obtained from (39). CG methylation and log2 fold change of histone marks relative to background were taken from ref. 39 (detailed in ). Three approaches were used to evaluate the degree of metabolic gene clustering ().

Machine Learning Classification of SM and GM Genes.

The prediction models were built based on 10,243 features using the Random Forest (RF) and Support Vector Machine (SVM) algorithms implemented using the Python package sci-kit learn (63). To build binary machine learning models, we used three benchmark sets (benchmarks 1, 2, and 3). For each benchmark set, SM and GM genes were first divided into a modeling set (90%) and a holdout set for independent validation (10%). Because there were significantly more GM genes than SM genes, 100 balanced datasets were constructed by randomly selecting GM genes equal to the number of SM genes in each balanced set. Additionally, 10-fold cross validation was performed for 100 random draws of a balanced dataset for each machine learning run, and grid searches were performed to obtain the best performing parameters for each model. Details for the performance measure are available in . A confidence score between 0 and 1 was produced by the model and was used as the SM prediction score. For the procedure to define threshold SM score classifying a gene as SM or not, the performance measures used, and the random background model, see . Dual-annotation (DA) genes are genes annotated as both GM and SM pathway genes in AraCyc. Junction (JC) genes were defined based on the pathway annotation data (pathway.dat) from the PlantCyc A. thaliana v.12 dataset. Two three-class models were built. The first SM/GM/DA model used SM, GM, and DA genes (benchmark 4) as the three classes. The second SM/GM/JC model used SM, GM, and JC genes (benchmark 5). Additional information for defining the JC gene type is available in .
  59 in total

Review 1.  Network analysis for gene discovery in plant-specialized metabolism.

Authors:  Yasuhiro Higashi; Kazuki Saito
Journal:  Plant Cell Environ       Date:  2013-02-21       Impact factor: 7.228

Review 2.  Growth-defense tradeoffs in plants: a balancing act to optimize fitness.

Authors:  Bethany Huot; Jian Yao; Beronda L Montgomery; Sheng Yang He
Journal:  Mol Plant       Date:  2014-04-27       Impact factor: 13.164

3.  Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Authors: 
Journal:  Nature       Date:  2000-12-14       Impact factor: 49.962

4.  Evidence for network evolution in an Arabidopsis interactome map.

Authors: 
Journal:  Science       Date:  2011-07-29       Impact factor: 47.728

5.  Transcriptional coordination of the metabolic network in Arabidopsis.

Authors:  Hairong Wei; Staffan Persson; Tapan Mehta; Vinodh Srinivasasainagendra; Lang Chen; Grier P Page; Chris Somerville; Ann Loraine
Journal:  Plant Physiol       Date:  2006-08-18       Impact factor: 8.340

6.  The Arabidopsis information resource: Making and mining the "gold standard" annotated reference plant genome.

Authors:  Tanya Z Berardini; Leonore Reiser; Donghui Li; Yarik Mezheritsky; Robert Muller; Emily Strait; Eva Huala
Journal:  Genesis       Date:  2015-08-04       Impact factor: 2.487

7.  Coselected genes determine adaptive variation in herbivore resistance throughout the native range of Arabidopsis thaliana.

Authors:  Benjamin Brachi; Christopher G Meyer; Romain Villoutreix; Alexander Platt; Timothy C Morton; Fabrice Roux; Joy Bergelson
Journal:  Proc Natl Acad Sci U S A       Date:  2015-03-16       Impact factor: 11.205

8.  Plant cell culture for production of paclitaxel and other taxanes.

Authors:  Jian-Jiang Zhong
Journal:  J Biosci Bioeng       Date:  2002       Impact factor: 2.894

9.  An overview of gibberellin metabolism enzyme genes and their related mutants in rice.

Authors:  Tomoaki Sakamoto; Koutarou Miura; Hironori Itoh; Tomoko Tatsumi; Miyako Ueguchi-Tanaka; Kanako Ishiyama; Masatomo Kobayashi; Ganesh K Agrawal; Shin Takeda; Kiyomi Abe; Akio Miyao; Hirohiko Hirochika; Hidemi Kitano; Motoyuki Ashikari; Makoto Matsuoka
Journal:  Plant Physiol       Date:  2004-04-09       Impact factor: 8.340

10.  Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis.

Authors:  Gaurav D Moghe; Melissa D Lehti-Shiu; Alex E Seddon; Shan Yin; Yani Chen; Piyada Juntawong; Federica Brandizzi; Julia Bailey-Serres; Shin-Han Shiu
Journal:  Plant Physiol       Date:  2012-11-06       Impact factor: 8.340

View more
  19 in total

Review 1.  Harnessing evolutionary diversification of primary metabolism for plant synthetic biology.

Authors:  Hiroshi A Maeda
Journal:  J Biol Chem       Date:  2019-09-26       Impact factor: 5.157

Review 2.  Using interdisciplinary, phylogeny-guided approaches to understand the evolution of plant metabolism.

Authors:  Craig A Schenck; Lucas Busta
Journal:  Plant Mol Biol       Date:  2021-11-23       Impact factor: 4.076

3.  Interplay between gene expression and gene architecture as a consequence of gene and genome duplications: evidence from metabolic genes of Arabidopsis thaliana.

Authors:  Dola Mukherjee; Deeya Saha; Debarun Acharya; Ashutosh Mukherjee; Tapash Chandra Ghosh
Journal:  Physiol Mol Biol Plants       Date:  2022-06-02

4.  Genetic variation, environment and demography intersect to shape Arabidopsis defense metabolite variation across Europe.

Authors:  Ella Katz; Jia-Jie Li; Benjamin Jaegle; Haim Ashkenazy; Shawn R Abrahams; Clement Bagaza; Samuel Holden; Chris J Pires; Ruthie Angelovici; Daniel J Kliebenstein
Journal:  Elife       Date:  2021-05-05       Impact factor: 8.140

5.  Polyploidy underlies co-option and diversification of biosynthetic triterpene pathways in the apple tribe.

Authors:  Wenbing Su; Yi Jing; Shoukai Lin; Zhen Yue; Xianghui Yang; Jiabao Xu; Jincheng Wu; Zhike Zhang; Rui Xia; Jiaojiao Zhu; Ning An; Haixin Chen; Yanping Hong; Yuan Yuan; Ting Long; Ling Zhang; Yuanyuan Jiang; Zongli Liu; Hailan Zhang; Yongshun Gao; Yuexue Liu; Hailan Lin; Huicong Wang; Levi Yant; Shunquan Lin; Zhenhua Liu
Journal:  Proc Natl Acad Sci U S A       Date:  2021-05-18       Impact factor: 11.205

Review 6.  Fruity, sticky, stinky, spicy, bitter, addictive, and deadly: evolutionary signatures of metabolic complexity in the Solanaceae.

Authors:  Paul D Fiesel; Hannah M Parks; Robert L Last; Cornelius S Barry
Journal:  Nat Prod Rep       Date:  2022-07-20       Impact factor: 15.111

7.  Integrative analysis of the shikonin metabolic network identifies new gene connections and reveals evolutionary insight into shikonin biosynthesis.

Authors:  Thiti Suttiyut; Robert P Auber; Manoj Ghaste; Cade N Kane; Scott A M McAdam; Jennifer H Wisecaver; Joshua R Widhalm
Journal:  Hortic Res       Date:  2022-01-20       Impact factor: 7.291

8.  A Tale of Two Families: Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus angustifolius L.).

Authors:  Katarzyna B Czyż; Michał Książkiewicz; Grzegorz Koczyk; Anna Szczepaniak; Jan Podkowiński; Barbara Naganowska
Journal:  Int J Mol Sci       Date:  2020-04-08       Impact factor: 5.923

9.  An improved catalogue of putative synaptic genes defined exclusively by temporal transcription profiles through an ensemble machine learning approach.

Authors:  Flavio Pazos Obregón; Martín Palazzo; Pablo Soto; Gustavo Guerberoff; Patricio Yankilevich; Rafael Cantera
Journal:  BMC Genomics       Date:  2019-12-23       Impact factor: 3.969

Review 10.  Network-based approaches for understanding gene regulation and function in plants.

Authors:  Dae Kwan Ko; Federica Brandizzi
Journal:  Plant J       Date:  2020-08-28       Impact factor: 6.417

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.