| Literature DB >> 19284550 |
Gabi Kastenmüller1, Maria Elisabeth Schenk, Johann Gasteiger, Hans-Werner Mewes.
Abstract
Identifying the biochemical basis of microbial phenotypes is a main objective of comparative genomics. Here we present a novel method using multivariate machine learning techniques for comparing automatically derived metabolic reconstructions of sequenced genomes on a large scale. Applying our method to 266 genomes directly led to testable hypotheses such as the link between the potential of microorganisms to cause periodontal disease and their ability to degrade histidine, a link also supported by clinical studies.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19284550 PMCID: PMC2690999 DOI: 10.1186/gb-2009-10-3-r28
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Overview of the approach. The three major steps of our approach are: metabolic reconstruction of completely sequenced genomes resulting in pathway profiles; pathway selection resulting in lists of pathways ranked by relevance; and cross-checking of the resulting pathway rankings by classification in order to estimate their significance (Figure S1 in Additional data file 2).
Figure 2Estimating the significance of pathway rankings provided by pathway selection. For phenotypes that are weakly associated with the presence or absence of specific metabolic pathways, the classification quality should be within the same range for classification based on randomly picked pathways (red), all pathways (marked by a horizontal line), and pathways highly ranked in attribute subset selection (green, ReliefF; yellow, SVMAttributeEval; blue, wrapper (naïve Bayes)). As an example, the right diagram shows the classification quality for the phenotype 'habitat: soil' (depending on the number of top-ranking pathways used for classification). In this case, the top-ranking pathways provided by attribute subset selection are considered as not significant for the phenotype. The left diagram shows the classification quality values for the phenotype 'obligate intracellular'. Using the most relevant pathways for classification results in higher classification quality compared to using all pathways or randomly picked pathways. Furthermore, the quality values lie above 0.6. In this case, the most relevant pathways derived by attribute subset selection are considered as significant.
Figure 3Cross-checking for the phenotype methanogenesis. The classification quality diagrams for nearest neighbor classifier (IB1) and the naïve Bayes classifier show that the identified most relevant pathways are well suited to distinguish methanogens and non-methanogens (sensitivity × selectivity = 1.0). According to the cross-check, the most relevant pathways identified by pathway selection are considered as significant. Apart from using ReliefF top-ranking pathways (green) for the classification with IB1, the maximum classification quality is already reached for the (up to) five most relevant pathways (these pathways are listed in Table 1).
Relevant pathways for methanogenesis
| Dataset | ReliefF | SVMAttributeEval | Wrapper (naïve Bayes) |
|---|---|---|---|
| Complete (266) | Reduction of CO2 to CH4 (methane1) ↑ | Reduction of CO2 to CH4 (methane1) ↑ | Reduction of CO2 to CH4 (methane1) ↑ |
| Biosynthesis of cardiolipin (phospholipids1) ↓ | Biosynthesis of cardiolipin (phospholipids1) ↓ | Degradation of L-lysine to crotonyl-CoA (lysine3) ↓ | |
| Biosynthesis of peptidoglycan I (aminosugars4) ↓ | beta-Oxidation of fatty acids (fa2) ↓ | Biosynthesis of coenzyme A (coa1) ↓ | |
| Heme biosynthesis (pyrrole3) ↓ | Degradation of L-threonine to L-2-aminoacetate (threonine2) ↓ | ||
| Pentose phosphate cycle (non-oxidative branch) (ppc3) ↓ | Biosynthesis of phosphatidylserine (phospholipids3) ↑ | ||
| Archaea (23) | Biosynthesis of 2'-deoxythymidine-5'-triphosphate (dtn1) ↑ | Biosynthesis of 2'-deoxythymidine-5'-triphosphate (dtn1) ↑ | Reduction of CO2 to CH4 (methane1) ↓ |
| Reduction of CO2 to CH4 (methane1) ↑ | Biosynthesis of L-phenylalanine from chorismate (aaa4) ↑ | Biosynthesis of 2'-deoxythymidine-5'-triphosphate (dtn1) ↑ | |
| Biosynthesis of phosphatidylserine (phospholipids3) ↑ | Reduction of CO2 to CH4 (methane1) ↑ | Degradation of L-threonine to L-2-aminoacetate (threonine2) ↓ | |
| Degradation of L-threonine to L-2-aminoacetate (threonine2) ↓ | Degradation of dGMP to deoxyguanosine (dgn2) ↓ | Degradation of L-lysine to crotonyl-CoA (lysine3) ↓ | |
| Degradation of tryptophane to 6-hydroxymelatonin (trp5) ↑ | Biosynthesis of phosphatidylserine (phospholipids3) ↑ | Biosynthesis of coenzyme B12 (coba1) ↑ | |
| Archaea (23) (without methane1) | Biosynthesis of 2'-deoxythymidine-5'-triphosphate (dtn1) ↑ | Biosynthesis of 2'-deoxythymidine-5'-triphosphate (dtn1) ↑ | Biosynthesis of 2'-deoxythymidine-5'-triphosphate (dtn1) ↑ |
| Biosynthesis of phosphatidylserine (phospholipids3) ↑ | Biosynthesis of L-phenylalanine from chorismate (aaa4) ↑ | Biosynthesis of coenzyme B12 (coba1) ↑ | |
| Degradation of L-threonine to L-2-aminoacetate (threonine2) ↓ | Degradation of L-threonine to L-2-aminoacetate (threonine2) ↓ | Degradation of L-valine (vas4) ↓ | |
| Degradation of tryptophane to 6-hydroxymelatonin (trp5) ↑ | Biosynthesis of phosphatidylserine (phospholipids3) ↑ | Degradation of L-threonine to L-2-aminoacetate (threonine2) ↓ | |
| Biosynthesis of coenzyme B12 (coba1) ↑ | Odd-numbered fatty acid metabolism (glf2) ↓ | Degradation of L-lysine to crotonyl-CoA (lysine3) ↓ | |
The relevant pathways for methanogenesis were determined by applying three different attribute selection methods (ReliefF, SVMAttributeEval, and a wrapper for the naïve Bayes classifier) to three datasets. The (up to) five most relevant pathways received for the complete set of pathway profiles (266 genomes), the archaeal pathway profiles (23 genomes), and the archaea profiles (23 genomes) without the attribute 'methane1' are shown. An upwards pointing arrow denotes pathways that are relevant due to higher pathway scores (that is, pathways are more complete) in methanogens compared to the other genomes in the investigated dataset. In analogy, a downwards pointing arrow denotes pathways that are relevant due to lower pathway scores (that is, pathways are less complete) in methanogens.
Classification quality for the classification of 23 archaeal genomes into methanogens and non-methanogens using the 5 most relevant pathways
| Classifier | ReliefF | SVM | Wrapper | All pathways | Random |
|---|---|---|---|---|---|
| J48 | 0.88 | 0.88 | 0.59 | 0.83 | 0.17 |
| IB1 | 0.94 | 1.00 | 1.00 | 0.29 | 0.31 |
| Naïve Bayes | 0.94 | 1.00 | 0.83 | 0.83 | 0.38 |
| SMO | 1.00 | 1.00 | 1.00 | 1.00 | 0.01 |
The 23 archaeal genomes were classified into methanogens and non-methanogens using only the five most relevant pathways from Table 1. We applied four different classifiers (J48, IB1, naïve Bayes, and SMO) with tenfold cross-validation. In addition, the genomes were classified based on all pathways (290) in the pathway profile as well as on five randomly chosen pathways. To estimate the quality of classification, we calculated the product of classification selectivity and sensitivity, which is shown in this table. In the case of randomly chosen pathways, the value was derived by averaging the classification quality of 25 sets of 5 randomly chosen pathways.
Figure 4Classification quality for the classification of archaea into methanogens and non-methanogens using the nearest neighbor classifier. The classification based on the four most relevant pathways yields a perfect separation of methanogenic archaea and non-methanogenic archaea for all attribute subset selection methods used (green, ReliefF; yellow, SVMAttributeEval; blue, wrapper (naïve Bayes)). Classification based on all pathways (marked by a horizontal line) and based on randomly picked pathways (red) show lower classification quality.
Classification quality for the classification of 23 archaeal genomes into methanogens and non-methanogens using the 5 most relevant pathways derived from pathway profiles without methane1
| Classifier | ReliefF | SVM | Wrapper | All pathways except methane1 | Random |
|---|---|---|---|---|---|
| J48 | 0.59 | 0.88 | 0.59 | 0.67 | 0.01 |
| IB1 | 0.94 | 1.00 | 1.00 | 0.29 | 0.40 |
| Naïve Bayes | 0.78 | 1.00 | 0.77 | 0.67 | 0.59 |
| SMO | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
The 23 archaeal genomes were classified into methanogens and non-methanogens using only the five most relevant pathways from Table 1. These relevant pathways were derived by attribute subset selection based on pathway profiles without the pathway methane1. We applied four different classifiers (J48, IB1, naïve Bayes, and SMO) with tenfold cross-validation. In addition, the genomes were classified based on all pathways (290) in the pathway profile as well as on five randomly chosen pathways. To estimate the quality of classification, we calculated the product of classification selectivity and sensitivity, which is shown in this table. In the case of randomly chosen pathways, the value was derived by averaging the classification quality of 25 sets of 5 randomly chosen pathways.
Figure 5Classification quality for the classification of archaea into methanogens and non-methanogens using the nearest neighbor classifier while omitting the pathway of methane synthesis. Omitting the pathway of methane synthesis (methane1) in our analyses, the classification based on the most relevant pathways still reaches perfect separation of methanogenic archaea and non-methanogenic archaea for all attribute subset selection methods used (green, ReliefF; yellow, SVMAttributeEval; blue, wrapper (naïve Bayes)). Classification based on all pathways (marked by a horizontal line) and based on randomly picked pathways (red) show lower classification quality.
Figure 6Classification quality for the phenotype periodontal disease causing. Left: classification of all genomes (266) into genomes related and not related to periodontal disease using the nearest neighbor classifier (IB1). Right: classification of oral genomes (15) into genomes related and not related to periodontal disease using the nearest neighbor classifier (IB1). Compared to classification based on all pathways (marked by a horizontal line) and based on randomly picked pathways (red), the classification based on the most relevant pathways yields better separation of periodontal species and other species in both genome datasets.
Classification quality for the classification of genomes into genomes related and unrelated to periodontal disease using the corresponding 10 most relevant pathways
| Classifier | ReliefF | SVM | Wrapper | All pathways | Random |
|---|---|---|---|---|---|
| J48 | 0.50 | 0.00 | 0.00 | 0.00 | 0.01 |
| (0.68) | (0.68) | (0.36) | (0.25) | (0.29) | |
| IB1 | 0.75 | 0.75 | 0.74 | 0.50 | 0.08 |
| (0.75) | (0.50) | (0.75) | (0.18) | (0.28) | |
| Naïve Bayes | 0.65 | 0.50 | 0.72 | 0.50 | 0.05 |
| (0.61) | (0.75) | (0.75) | (0.45) | (0.29) | |
| SMO | 0.00 | 0.75 | 0.00 | 0.25 | 0.00 |
| (0.75) | (0.68) | (0.36) | (0.50) | (0.11) |
The 266 (15 oral cavity) genomes of the complete data set were classified into genomes related and not related to periodontal disease using only ten (eight in the case of wrapper) most relevant pathways derived by the three attribute selection methods. We applied four different classifiers (J48, IB1, naïve Bayes, and SMO) with tenfold cross-validation. In addition, the genomes were classified based on all pathways (290) in the pathway profile as well as on ten randomly chosen pathways. To estimate the quality of classification, we calculated the product of classification selectivity and sensitivity, which is shown in this table. In the case of randomly chosen pathways, the value was derived by averaging the classification quality of 25 sets of 10 randomly chosen pathways. The data in parentheses are for the dataset containing the 15 oral cavity genomes.
Figure 7Pathway scores of the relevant pathways for the periodontal species. Plotting the pathway scores of the relevant pathways (from Table 6), the differences of A. actinomycetemcomitans (black) compared to F. nucleatum (red), P. gingivalis (green), and T. denticola (blue) become apparent. In contrast, the scores for F. nucleatum, P. gingivalis, and T. denticola are very similar.
Classification quality for the classification of genomes into members and non-members of the red or orange cluster by using the corresponding 10 most relevant pathways
| Classifier | ReliefF | SVM | Wrapper | All pathways | Random |
|---|---|---|---|---|---|
| J48 | 0.67 | 0.00 | 0.67 | 0.00 | 0.00 |
| (0.92) | (0.92) | (0.92) | (0.00) | (0.20) | |
| IB1 | 1.00 | 1.00 | 1.00 | 0.67 | 0.08 |
| (1.00) | (1.00) | (1.00) | (0.61) | (0.16) | |
| Naïve Bayes | 1.00 | 0.67 | 1.00 | 0.67 | 0.12 |
| (1.00) | (1.00) | (0.67) | (0.67) | (0.14) | |
| SMO | 1.00 | 1.00 | 0.67 | 0.00 | 0.00 |
| (1.00) | (1.00) | (0.92) | (0.67) | (0.08) |
The 266 (15 oral cavity) genomes of the complete data set were classified into members and non-members of the red or orange cluster using only the ten (in the case of wrapper, 6 (266 genomes) or 4 (15 genomes)) most relevant pathways (Additional data file 2) derived by the three attribute selection methods, respectively. We applied four different classifiers (J48, IB1, naïve Bayes, and SMO) with tenfold cross-validation. In addition, the genomes were classified based on all pathways (290) in the pathway profile as well as on ten randomly chosen pathways. To estimate the quality of classification, we calculated the product of classification selectivity and sensitivity, which is shown in this table. In the case of randomly chosen pathways, the value was derived by averaging the classification quality of 25 sets of 10 randomly chosen pathways. The data in parentheses are for the dataset containing the 15 oral cavity genomes.
Figure 8Classification quality for the phenotype member of red or orange cluster. Left: classification of all genomes (266) into genomes that are members and non-members of the 'red/orange' cluster using the nearest neighbor classifier (IB1). Right: classification of oral genomes (15) into genomes that are members and non-members of the 'red/orange' cluster related using the nearest neighbor classifier (IB1). Compared to classification based on all pathways (marked by a horizontal line) and based on randomly picked pathways (red), the classification based on the most relevant pathways yields better separation of the cluster members and other species in both genome datasets.
Relevant pathways for the phenotype 'periodontal disease causing'
| Relevant pathway | Dataset | Attribute selection method |
|---|---|---|
| Biosynthesis of coenzyme B12 (coba1) ↑ | 4A, 4O, 3A, 3O | R, S, W |
| Biosynthesis of L-proline (proline1) ↓ | 4A, 4O, 3A, 3O | R, S, W |
| Glutamate fermentation (fnc1) ↑ | 4A, 4O, 3A, 3O | R, S, W |
| Biosynthesis of 5-formimino-THF (c2) ↑ | 4A, 4O, 3A, 3O | R, S |
| Urea cycle (part) (urea2) ↓ | 4A, 4O, 3A, 3O | R, S |
| Conversion of L-glutamate to L-proline (glutamate3) ↓ | 4A, 4O, 3A, 3O | R, S |
| Conversion of L-glutamate to L-ornithine (glutamate2) ↓ | 4A, 4O, 3A, 3O | R |
| Degradation of L-histidine to L-glutamate (histidine2) ↑ | 4O, 3A, 3O | R, S, W |
| Glycolysis and Gluconeogenesis (part) (gg13) ↑ | 4A, 4O, 3O | R, S |
The pathways that are among the ten most relevant pathways for at least one attribute selection method (R, ReliefF; S, SVMAttributeEval; W, wrapper(naïve Bayes)) and for at least three of the four investigated datasets (4A, 'periodontal disease causing' genomes in complete dataset (266 genomes); 4O, 'periodontal disease causing' genomes in oral cavity dataset (15 genomes); 3A, 'members of red or orange cluster' in the complete dataset; 3O, 'members of red or orange cluster' in oral cavity dataset) are listed. An upwards pointing arrow denotes pathways that are relevant due to higher pathway scores (that is, pathways are more complete) for the periodontal species compared to the other genomes in the investigated dataset. In analogy, a downwards pointing arrow denotes pathways that are relevant due to lower pathway scores (that is, pathways are less complete) for the periodontal species.
Figure 9Degradation of histidine. The pathways glutamate fermentation (fnc1) (red) and degradation of histidine to L-glutamate (histidine2) (black) describe (amino acid) degradations producing ammonia as an end product. Due to the reversibility of all its reactions, this also includes the pathway of biosynthesis of 5-formimino-THF (blue), which - inversely followed - describes the degradation of 5-formimino-THF to glutamate (c2). All three pathways are interconnected and can be interpreted as complete degradation of L-histidine to acetate and ammonia (NH3). Thereby, three moles of ammonia per mole of histidine are produced (green or turquoise boxes, respectively). As histidine2 includes an alternative route from L-histidine to glutamate (dashed line), one mole of ammonia is either produced by the conversion of N-carbamoyl-L-glutamate to L-glutamate or by the conversion of N-formimino-tetrahydrofolate to 5,10-methenyl-tetrahydrofolate.