| Literature DB >> 19680427 |
Abstract
A common biological pathway reconstruction approach -- as implemented by many automatic biological pathway services (such as the KAAS and RAST servers) and the functional annotation of metagenomic sequences -- starts with the identification of protein functions or families (e.g., KO families for the KEGG database and the FIG families for the SEED database) in the query sequences, followed by a direct mapping of the identified protein families onto pathways. Given a predicted patchwork of individual biochemical steps, some metric must be applied in deciding what pathways actually exist in the genome or metagenome represented by the sequences. Commonly, and straightforwardly, a complete biological pathway can be identified in a dataset if at least one of the steps associated with the pathway is found. We report, however, that this naïve mapping approach leads to an inflated estimate of biological pathways, and thus overestimates the functional diversity of the sample from which the DNA sequences are derived. We developed a parsimony approach, called MinPath (Minimal set of Pathways), for biological pathway reconstructions using protein family predictions, which yields a more conservative, yet more faithful, estimation of the biological pathways for a query dataset. MinPath identified far fewer pathways for the genomes collected in the KEGG database -- as compared to the naïve mapping approach -- eliminating some obviously spurious pathway annotations. Results from applying MinPath to several metagenomes indicate that the common methods used for metagenome annotation may significantly overestimate the biological pathways encoded by microbial communities.Entities:
Mesh:
Year: 2009 PMID: 19680427 PMCID: PMC2714467 DOI: 10.1371/journal.pcbi.1000465
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Schematic illustration of the MinPath method.
Assume 6 families (or orthologous groups, f1, …, f6) are identified from a given sample of genes (e.g., the genes could be from a genome, or sampled from a metagenome). The naïve mapping approach (shown on the left) will lead to a reconstruction with 4 pathways annotated (p1, p2, p3, and p4). Due to the overlapping nature of the biological pathways (see text for more details), pathway p3 shares function f3 with pathway p2. We claim that only three pathways, p1, p2, and p3 are sufficient to explain the existence of the 6 families annotated in the dataset, and a conservative reconstruction of pathways should have only 3 pathways (shown on the right). As we show in the paper, such a conservative estimation of pathways provides a more reliable estimation of the functional diversity of a sample.
Figure 2Comparison of the number of pathways reconstructed for various genomes by different methods.
The coloring schema is as following: MinPath (red triangles), naïve mapping approach (green), and the pathway annotation maintained in KEGG database after human evaluation (blue).
Selected spurious pathways of the human genome that are incorrectly identified by the naïve mapping approach.
| KEGG ID | Pathway description | Possible reason for being falsely identified by the naïve mapping approach | Removed by MinPath? | Additional notes |
| 00053 | ascorbate and aldarate metabolism | pathway redundancy (same function involves in multiple pathways) | yes | humans can not synthesize ascorbic acid (vitamin C) |
| 00290 | valine, leucine and isoleucine biosynthesis | pathway redundancy | yes | all three are essential amino acids in humans |
| 00521 | streptomycin biosynthesis | pathway redundancy | yes | see table note |
| 00720 | reductive carboxylate cycle | pathway redundancy | yes | it is a CO2 fixation pathway found in photosynthetic bacteria |
Steptomycin biosynthesis is not listed for the human genome (http://www.genome.jp/kegg-bin/show_organism?menu_type=pathway_maps&org=hsa) in KEGG; but there are 5 functional roles from this pathway annotated in the human genome based on the KEGG annotation, including K00844, K01092, K01710, K01835, and K01858.
Figure 3The ascorbate and aldarate metabolism pathway, eliminated by MinPath.
The diagram was prepared based on the corresponding KEGG pathway (ID = 00053), and only part of the pathway is shown for clarity. The three enzymes that are annotated in the human genome are highlighted in green, even though none of these enzymes are unique to this pathway.
Selected spurious pathways of the E. coli genome (collected in KEGG) eliminated by MinPath.
| KEGG ID | Pathway description | Functions involved | Removed by MinPath? | Justification | Additional notes |
| 00062 | fatty acid elongation in mitochondria | K00022 | yes | K00022 is shared by this pathway and 6 other pathways |
|
| 00521 | bile acid biosynthesis | K00001 K00632 | yes | K00001 is shared by several other pathways, including the glycolysis pathway; K00632 is shared by the fatty acid metabolism pathway and others. | bile acids are steroid acids found predominantly in the bile of mammals |
Comparison of biological pathway reconstruction based on MinPath and the naïve mapping approach for selected metagenomesa.
| Environmental samples | Naïve mapping (KEGG) | MinPath (KEGG) | Naïve mapping (SEED) | MinPath (SEED) |
| Coral-Mic (7) | 188/232 | 109/171 | 497/629 | 186/392 |
| Coral-Vir (6) | 174/211 | 105/140 | 594/667 | 285/441 |
| Marine-Mic (8) | 221/236 | 146/174 | 695/730 | 488/577 |
| Marine-Vir (10) | 213/236 | 154/175 | 680/733 | 507/599 |
| Freshwater-Mic (4) | 196/220 | 137/165 | 678/739 | 460/601 |
| Freshwater-Vir (4) | 113/154 | 57/90 | 392/559 | 112/283 |
| Hyper-saline-Mic (9) | 196/221 | 146/170 | 724/763 | 558/650 |
| Hyper-saline-Vir (12) | 164/181 | 105/137 | 613/697 | 347/510 |
metagenomes sampled from different environments [17] (-Mic, and -Vir are for microbial and viral metagenomes, respectively, as shown in the table).
microbial metagenomes sampled from coral, with the total number of sequencing datasets shown in the brackets.
based on the KEGG pathways (the KEGG database used in this study was downloaded in Dec, 2008, which has 345 pathways).
based on the SEED subsystems (we used FIGfams release 6, which has more subsystems than reported in [17], and the total number of subsystems included is 898).
the two numbers present the total number of pathways (or subsystems) found in at least two of the datasets (e.g., two out of 7 for Coral-Mic), and in at least one of the datasets for each environmental location, respectively.