| Literature DB >> 20064214 |
Joseph M Dale1, Liviu Popescu, Peter D Karp.
Abstract
BACKGROUND: A key challenge in systems biology is the reconstruction of an organism's metabolic network from its genome sequence. One strategy for addressing this problem is to predict which metabolic pathways, from a reference database of known pathways, are present in the organism, based on the annotated genome of the organism.Entities:
Mesh:
Year: 2010 PMID: 20064214 PMCID: PMC3146072 DOI: 10.1186/1471-2105-11-15
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Procedure for applying machine learning methods to metabolic pathway prediction. Data from curated pathway/genome databases (PGDBs) are gathered into a "gold standard" collection. Features are defined using biological knowledge, and their values are computed for all pathways in the gold standard. The resulting dataset is split into training and test sets. Training data are used to perform feature selection and parameter estimation for multiple predictor types. Test data are used to evaluate the predictors. The predictor which performs best on the test set will be applied to data from newly sequenced and annotated genomes to perform metabolic network reconstruction.
Number of positive and negative pathways for each organism in the gold standard dataset
| organism | positives | negatives | total |
|---|---|---|---|
| 235 | 1035 | 1270 | |
| 297 | 971 | 1268 | |
| 119 | 777 | 896 | |
| 171 | 778 | 949 | |
| 203 | 754 | 957 | |
| 151 | 119 | 270 |
Best-performing Boolean features, ordered by information gain
| Feature | ACC | SN | SP | FM | PR | RC | IG |
|---|---|---|---|---|---|---|---|
| has-enzymes | 0.821 | 0.914 | 0.796 | 0.681 | 0.543 | 0.914 | 0.188 |
| has-reactions-present | 0.797 | 0.919 | 0.765 | 0.655 | 0.509 | 0.919 | 0.173 |
| majority-of-reactions-present | 0.872 | 0.707 | 0.916 | 0.699 | 0.69 | 0.707 | 0.165 |
| some-initial-reactions-present | 0.84 | 0.724 | 0.87 | 0.654 | 0.597 | 0.724 | 0.138 |
| some-initial-and-final-reactions-present | 0.864 | 0.605 | 0.933 | 0.651 | 0.706 | 0.605 | 0.136 |
| mostly-absent-not-unique | 0.215 | 0.163 | 0.229 | 0.08 | 0.053 | 0.163 | 0.133 |
| all-initial-reactions-present | 0.825 | 0.747 | 0.845 | 0.641 | 0.561 | 0.747 | 0.133 |
| every-reaction-present | 0.871 | 0.508 | 0.968 | 0.623 | 0.807 | 0.508 | 0.132 |
| every-reaction-present-or-orphaned | 0.871 | 0.508 | 0.968 | 0.623 | 0.807 | 0.508 | 0.132 |
| taxonomic-range-includes-target | 0.795 | 0.813 | 0.79 | 0.624 | 0.506 | 0.813 | 0.131 |
See section "Feature Extraction and Processing" and Section 1 of Additional file 2 for description of features.
Columns 2 through 8 correspond to various performance measures: ACC = accuracy; SN = sensitivity; SP = specificity; FM = F-measure; PR = precision; RC = recall; IG = information gain.
Best-performing numeric features, ordered by AUC
| Feature | AUC | max. ACC | SN (max. ACC) | SP (max. ACC) | max. FM | PR (max. FM) | RC (max. FM) |
|---|---|---|---|---|---|---|---|
| fraction-reactions-with-Enzymes | 0.902 | 0.878 | 0.662 | 0.935 | 0.715 | 0.641 | 0.807 |
| fraction-reactions-present | 0.899 | 0.879 | 0.618 | 0.948 | 0.699 | 0.612 | 0.815 |
| fraction-reactions-present-or-Orphaned | 0.899 | 0.879 | 0.619 | 0.948 | 0.7 | 0.69 | 0.709 |
| best-fraction-reactions-present-in-linear-path | 0.898 | 0.879 | 0.662 | 0.936 | 0.703 | 0.682 | 0.726 |
| evidence-info-content-norm-all | 0.894 | 0.866 | 0.638 | 0.927 | 0.689 | 0.617 | 0.781 |
| enzyme-info-content-norm | 0.89 | 0.855 | 0.69 | 0.899 | 0.69 | 0.584 | 0.844 |
| enzyme-info-content-unnorm | 0.88 | 0.847 | 0.665 | 0.895 | 0.683 | 0.556 | 0.887 |
| evidence-info-content-unnorm | 0.875 | 0.841 | 0.526 | 0.925 | 0.657 | 0.511 | 0.918 |
| num-reactions-with-enzymes | 0.873 | 0.838 | 0.635 | 0.892 | 0.681 | 0.543 | 0.914 |
| enzymes-per-reaction | 0.871 | 0.842 | 0.686 | 0.883 | 0.688 | 0.567 | 0.875 |
See section "Feature Extraction and Processing" and Section 1 of Additional file 2 for description of features.
Columns 2 through 8 correspond to various performance measures: AUC = area under the ROC curve; max. ACC = maximum thresholded accuracy; SN (max. ACC) = sensitivity at maximum-accuracy threshold; SP = specificity at maximum-accuracy threshold; max. FM = maximum thresholded F-measure; PR (max. FM) = precision at maximum-F-measure threshold; RC (max. FM) = recall at maximum-F-measure threshold.
Best-performing discretized numeric features, ordered by information gain
| Feature | ACC | SN | SP | FM | PR | RC | IG |
|---|---|---|---|---|---|---|---|
| enzyme-info-content-norm | 0.824 | 0.912 | 0.801 | 0.685 | 0.548 | 0.912 | 0.19 |
| enzymes-per-reaction | 0.822 | 0.914 | 0.798 | 0.683 | 0.545 | 0.914 | 0.189 |
| fraction-reactions-with-enzymes | 0.824 | 0.91 | 0.801 | 0.684 | 0.548 | 0.91 | 0.188 |
| num-reactions-with-enzymes | 0.821 | 0.914 | 0.796 | 0.681 | 0.543 | 0.914 | 0.188 |
| num-enzymes | 0.821 | 0.914 | 0.796 | 0.681 | 0.543 | 0.914 | 0.188 |
| enzyme-info-content-unnorm | 0.821 | 0.914 | 0.796 | 0.681 | 0.543 | 0.914 | 0.188 |
| evidence-info-content-norm-all | 0.821 | 0.893 | 0.802 | 0.677 | 0.545 | 0.893 | 0.179 |
| best-fraction-reactions-present-in-linear-path | 0.842 | 0.85 | 0.84 | 0.693 | 0.584 | 0.85 | 0.179 |
| fraction-reactions-present | 0.83 | 0.869 | 0.82 | 0.682 | 0.562 | 0.869 | 0.176 |
| fraction-reactions-present-or-orphaned | 0.852 | 0.817 | 0.861 | 0.698 | 0.609 | 0.817 | 0.176 |
See section "Feature Extraction and Processing" and Section 1 of Additional table 2 for description of features. See Table 2 for explanation of column headings.
Performance of the existing, manually crafted PathoLogic algorithm for pathway prediction
| ACC | SN | SP | FM | PR | RC | IG |
|---|---|---|---|---|---|---|
| 0.91 | 0.793 | 0.94 | 0.786 | 0.779 | 0.793 | 0.233 |
See Table 2 for explanation of column headings.
Naïve Bayes performance
| Predictor | AUC | max. ACC | SN (max. ACC) | SP (max. ACC) | max. FM | PR (max. FM) | RC (max. FM) |
|---|---|---|---|---|---|---|---|
| all features | 0.91 | 0.883 | 0.763 | 0.915 | 0.736 | 0.68 | 0.804 |
| random features ( | 0.916 | 0.884 | 0.686 | 0.935 | 0.725 | 0.67 | 0.792 |
| random forest ( | 0.924 | 0.888 | 0.709 | 0.936 | 0.737 | 0.693 | 0.791 |
| HC-BIC feature selection | 0.933 | 0.905 | 0.787 | 0.936 | 0.775 | 0.757 | 0.794 |
| HC-AIC feature selection | 0.938 | 0.905 | 0.78 | 0.938 | 0.777 | 0.759 | 0.796 |
| bagged HC-BIC ( | 0.945 | 0.908 | 0.751 | 0.949 | 0.782 | 0.761 | 0.805 |
| bagged HC-AIC ( | 0.946 | 0.909 | 0.757 | 0.949 | 0.78 | 0.767 | 0.796 |
See Table 3 for description of column headings. HC-BIC = hill-climbing on Bayes information criterion; HC-AIC = hill-climbing on Akaike information criterion.
Logistic regression performance
| Predictor | AUC | max. ACC | SN (max. ACC) | SP (max. ACC) | max. FM | PR (max. FM) | RC (max. FM) |
|---|---|---|---|---|---|---|---|
| random features ( | 0.939 | 0.902 | 0.732 | 0.947 | 0.768 | 0.74 | 0.8 |
| random forest ( | 0.946 | 0.909 | 0.734 | 0.955 | 0.779 | 0.765 | 0.796 |
| HC-BIC feature selection | 0.948 | 0.91 | 0.738 | 0.956 | 0.785 | 0.765 | 0.808 |
| HC-AIC feature selection | 0.949 | 0.911 | 0.753 | 0.953 | 0.787 | 0.771 | 0.804 |
| bagged HC-BIC ( | 0.951 | 0.912 | 0.744 | 0.956 | 0.786 | 0.763 | 0.812 |
See Table 3 for description of column headings. HC-BIC = hill-climbing on Bayes information criterion; HC-AIC = hill-climbing on Akaike information criterion.
Decision tree performance
| Predictor | AUC | max. ACC | SN (max. ACC) | SP (max. ACC) | max. FM | PR (max. FM) | RC (max. FM) |
|---|---|---|---|---|---|---|---|
| single tree | 0.946 | 0.909 | 0.714 | 0.961 | 0.777 | 0.755 | 0.802 |
| bagged ( | 0.953 | 0.911 | 0.729 | 0.961 | 0.787 | 0.77 | 0.808 |
| random forest ( | 0.952 | 0.911 | 0.736 | 0.957 | 0.786 | 0.758 | 0.818 |
See Table 3 for description of column headings.
Predictor performance using PathoLogic prediction as a feature
| Predictor | AUC | max. ACC | SN (max. ACC) | SP (max. ACC) | max. FM | PR (max. FM) | RC (max. FM) |
|---|---|---|---|---|---|---|---|
| NB, bagged HC-BIC ( | 0.936 | 0.912 | 0.775 | 0.948 | 0.79 | 0.779 | 0.801 |
| LR, HC-AIC | 0.949 | 0.913 | 0.756 | 0.954 | 0.789 | 0.773 | 0.806 |
| DT, bagged ( | 0.953 | 0.914 | 0.763 | 0.954 | 0.794 | 0.782 | 0.807 |
Figure 2. This pathway is present in E. coli; PathoLogic excludes it while our machine learning methods consistently predict it to be present. See Table 10 for selected feature values for this pathway.
Feature values and log-odds ratios for a naïve Bayes predictor constructed with HC-AIC feature selection and trained on the entire gold standard, for pathway 5-aminoimidazole ribonucleotide biosynthesis II, shown in Figure 2.
| Feature | Value | Log-odds |
|---|---|---|
| num-reactions | 6 | -0.04 |
| enzyme-info-content-norm | 0.47 | 2.19 |
| is-subpathway | true | 0.67 |
| biosynthesis-pathway | true | 0.22 |
| majority-of-reactions-present-unique | false | -0.36 |
| has-key-reactions | false | -0.07 |
| some-key-reactions-are-present-alt | true | 0.04 |
| all-key-reactions-are-present-alt | true | 0.04 |
| taxonomic-range-includes-target-alt | true | 1.71 |
| subset-has-same-evidence | true | -0.44 |