| Literature DB >> 25260589 |
Emmanuel Prestat1, Maude M David2, Jenni Hultman2, Neslihan Taş2, Regina Lamendella2, Jill Dvornik2, Rachel Mackelprang3, David D Myrold4, Ari Jumpponen5, Susannah G Tringe6, Elizabeth Holman2, Konstantinos Mavromatis6, Janet K Jansson7.
Abstract
A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. 'profiles') were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associated functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.Entities:
Mesh:
Year: 2014 PMID: 25260589 PMCID: PMC4231724 DOI: 10.1093/nar/gku702
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.HMM building pipeline: example with KO:K16157 (methane monooxygenase). Step 1—find Pfam(s) combination assigned to the KO of interest (a) and (b) check for redundancy. Step 2—fetch IMG peptide sequences which hit the retrieved Pfam(s). Step 3—fetch from Pfam-A database the HMM of interest. Step 4—alignment (hmmalign) and filter each Pfam from extra sequences obtained in IMG. Step 5—stitch filtered alignments. Step 6—draw a Maximum Likelihood tree (fasttree). Step 7—find clusters in tree with same KO. Step 8—split alignment (step 5 output) by cluster (step 7 output) and build HMM for each, and process the ‘Trusted Cutoff’ computation.
The current FOAM database is made of 73 969 HMMs designed to target 2870 different Kos
| #HMM | #KO | #hmm/KO | |
|---|---|---|---|
| 01_Fermentation | 1342.5 | 173 | 7.76 |
| 02_Homoacetogenesis | 336 | 118 | 2.85 |
| 03_Superpathway of thiosulfate metabolism | 36 | 7 | 5.14 |
| 04_Utililization of sugar, conversion of pentose to EMP pathway intermediates | 100.5 | 14 | 7.18 |
| 05_Fatty acid oxidation | 1179.5 | 41 | 28.77 |
| 06_Amino acid utilization biosynthesis metabolism | 7773 | 805 | 9.66 |
| 07_Nucleic acid metabolism | 2734 | 288 | 9.49 |
| 08_Hydrocarbon degradation | 1415.5 | 85 | 16.65 |
| 09_Carbohydrate Active enzyme (CAZy) | 2305.5 | 305 | 7.56 |
| 10_TCA cycle | 478.5 | 35 | 13.67 |
| 11_Nitrogen cycle | 217.5 | 52 | 4.18 |
| 12_Transporters | 0.5 | 543 | 0.00 |
| 13_Hydrogen metabolism | 194.5 | 16 | 12.16 |
| 14_Methanogenesis | 524.5 | 57 | 9.20 |
| 15_Methylotrophy | 238 | 69 | 3.45 |
| 16_Embden Meyerhof-Parnos (EMP) | 209 | 35 | 5.97 |
| 17_Gluconeogenesis | 258 | 28 | 9.21 |
| 18_Sulfur metabolism | 35.5 | 33 | 1.08 |
| 19_Synthesis of saccharides and deriviatives | 2015.5 | 419 | 4.81 |
| 20_Polymers hydrolysis | 2740 | 358 | 7.65 |
| 21_Cellular response to stress | 11647 | 825 | 14.12 |
On average, an HMM is made from an alignment of 81 peptide sequences and about 26 HMMs are built per KO. The file size is ∼7 GB.
Example of confusion matrix construction for database validation
Figure 2.Validation results. For each of the five functional levels available in the FOAM ontology, three metrics were computed: recall (or sensitivity), precision (known also as ‘positive predictive value’ or sometime ‘specificity’) and F1-score (the harmonic mean of both). In all cases, precision stays >92% at the KO level to reach 97% at level 1 (21 classes). Recall varies much more, from 69% at the KO level, to 98% at level 1. Levels 2, 3 and 4 gave similar performance results for both recall and precision; and their F1-score ‘mean’ within a range of 92–94%.