| Literature DB >> 22070167 |
Malay K Basu1, Jeremy D Selengut, Daniel H Haft.
Abstract
BACKGROUND: Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22070167 PMCID: PMC3226654 DOI: 10.1186/1471-2105-12-434
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Figure 1Flowchart for profile search using ProPhylo with the Partial Phylogenetic Profile algorithm (PPP). For each step, the relevant software name is indicated. (A) Creation of a profile from various search methods or directly using a set of GenBank GIs. (B) The created profile is a tab delimited text file containing a set of taxonomic IDs from NCBI taxonomic database and 1's and 0's for the presence and absence of the query protein family. (C) The main script ppp.pl searches a given genome using the query profile and generates results as a ranked list of candidate functionally linked proteins.
Proteins identified in Methanohalophilus mahii DSM 5219 as PPP's top hits with perfect agreement (28 of 28 genomes) to the methanogenesis phylogenetic profile, showing BLAST E-values flanking the boundaries selected by PPP.
| GI number | Last True | First False | Protein Functional Assignment |
|---|---|---|---|
| 294495086 | 1e-24 | 4e-23 | paralog of MtrA |
| 294495087 | 5e-52 | 2e-46 | paralog of MtrH |
| 294495289 | 5e-170 | 2.5 | methyl-coenzyme M reductase, alpha subunit |
| 294495290 | 8e-68 | 0.11 | methyl-coenzyme M reductase, gamma subunit |
| 294495291 | 5e-35 | (none) | methyl-coenzyme M reductase operon protein C |
| 294495292 | 7e-17 | 0.63 | methyl-coenzyme M reductase operon protein D |
| 294495293 | 2e-130 | 2.4 | methyl-coenzyme M reductase, beta subunit |
| 294495294 | 5e-94 | 3e-07 | methanogenesis marker 10 radical SAM protein |
| 294495889 | 0.38 | 0.64 | protein of unknown function DUF2098 |
| 294495924 | 7e-131 | 6e-130 | formylmethanofuran dehydrogenase, subunit A |
| 294495926 | 6e-08 | 1e-07 | formylmethanofuran dehydrogenase, subunit D |
| 294496062 | 2e-73 | 4e-09 | methanogenesis marker 13 metalloprotein |
| 294496216 | 9e-68 | 3e-55 | methanogenesis marker 2 protein |
| 294496423 | 5e-48 | (none) | tetrahydromethanopterin S-methyltransferase, MtrE |
| 294496424 | 3e-25 | (none) | tetrahydromethanopterin S-methyltransferase, MtrD |
| 294496425 | 1e-18 | 0.78 | tetrahydromethanopterin S-methyltransferase, MtrC |
| 294496426 | 0.003 | 0.48 | tetrahydromethanopterin S-methyltransferase, MtrB |
| 294496427 | 2e-37 | 8e-19 | tetrahydromethanopterin S-methyltransferase, MtrA |
| 294496429 | 4e-04 | 0.40 | tetrahydromethanopterin S-methyltransferase, MtrG |
| 294496430 | 8e-76 | 2e-48 | tetrahydromethanopterin S-methyltransferase, MtrH |
| 294496497 | 9e-52 | 0.45 | methanogenesis marker 7 protein |
| 294496498 | 0.011 | 0.36 | methanogenesis marker 17 protein |
| 294496499 | 1e-115 | 9e-31 | methanogenesis marker 15 protein |
| 294496500 | 3e-34 | 1.5 | methanogenesis marker 5 protein |
| 294496501 | 2e-14 | 3.0 | methanogenesis marker 6 protein |
| 294496502 | 2e-45 | 1.8 | methanogenesis marker 3 protein |
| 294496503 | 1e-144 | 4e-39 | methyl coenzyme M reductase system, AtwA protein |
| 294496608 | 2e-43 | 9e-31 | paralog of AtwA |
| 294496619 | 2e-72 | 0.031 | methanogenesis marker 14 protein |
Figure 2Distribution of the scores of PPP. The query profile contains all methanogens marked as 1, and the target genome is Methanohalophilus mahii DSM 5219. The binomial distribution probability parameter is raised to 0.2, which helps proteins absolutely restricted to the methanogens, although not universal among them, to get a better relative rank. The plot shows PPP score on the Y-axis, and rank, sorted by PPP score, on the X-axis. The top-scoring 28, with perfect agreement to the query profile, are colored red.
Top hits by PPP in Serratia odorifera 4Rx13 using a query profile for DptD (TIGR03185 family) of the DNA phosphorothioation system, with a modified probability of 0.4.
| DPT | Total | Protein Functional Assignment | ||
|---|---|---|---|---|
| 270263651 | 66 | 66 | 26.26 | DptD (DndD) (DNA phosphorothioation) |
| 270263652 | 59 | 59 | 23.48 | DptC (DndC) (DNA phosphorothioation) |
| 270263653 | 32 | 32 | 12.73 | DptB (DndB) (DNA phosphorothioation) |
| 270263650 | 19 | 19 | 7.56 | DptE (DndE) (DNA phosphorothioation) |
| 270263648 | 16 | 16 | 6.37 | DptH (DPT-dependent restriction) |
| 270263645 | 18 | 20 | 5.30 | DptF (DPT-dependent restriction) (N-terminal) |
| 270263647 | 15 | 16 | 4.97 | DptF (DPT-dependent restriction) (C-terminal) |
| 270263646 | 12 | 12 | 4.78 | DptG (DPT-dependent restriction) |
| 270263649 | 7 | 7 | 2.79 | DGQHR domain protein |
1NCBI GI. 2The number of unique taxa common to the query profile and the target BLAST hit list. 3The total number of unique taxa in encountered where PPP found the optimal depth. 4Score reported by PPP as the negative logarithm of the P-value at the optimal depth, which is lowest that can be obtained from the binomial distribution for any depth.
Figure 3Flowchart of profile search using ProPhylo with Double Partial Phylogenetic profiling (DPPP). (A) Search begins with a single query sequence with its BLAST hits. (B) The program then generates a different query profile for each depth in the BLAST hit list. (C) Each of these profiles is then searched against the target genome. (D) The top hits for each of these searches are then collected and the output is presented sorted by significance.
Figure 4Double partial phylogenetic profiling (DPPP) using the GTP-binding protein HydF (GI:113971588) as query. For the query protein (red), the curve rises monotonically, because it measures the correlation of the list of species it generates to itself. Among all proteins other than the query, the peak score for any protein occurs (for HydE, GI:113971587,) where the query protein BLAST hits list depth is about 210. DPPP scores are shown for query protein depths 10 to 930, sampled every tenth hit, for the ten proteins that scored the best at the depth 210. The curves for HydE (GI:113971587, olive), HydG (GI:113971585, green), and the hydrogenase large (GI:113971582, blue) and small (GI:113971583, purple) subunits all peak at this query protein depth, which largely exhausts the list of species that carry the [FeFe] hydrogenase maturation system (see text and Table 3 for details). Proteins unrelated to [FeFe] hydrogenase maturation are shown in gray.
Top 10 hits from DPPP at in Shewanella sp. MR-4, with the [FeFe] hydrogenase maturation GTP-binding protein HydF (GI:113971588) as query, at a query protein BLAST hits depth of 210 (finding 200 genomes)
| 113970224 | 194 | 336 | 613 | 78.43 | thiazole biosynthesis protein ThiH (TIGR02351) |
| 113971167 | 169 | 320 | 480 | 60.69 | hydroxylamine reductase (TIGR01703) |
| 113971205 | 181 | 374 | 603 | 57.37 | radical SAM protein (TIGR01212) |
| 113971473 | 161 | 327 | 484 | 52.43 | peptidase, U32 family (PF01136) |
| 113971082 | 132 | 240 | 843 | 50.34 | MATE domain protein (PF01554) |
Proteins related to [FeFe]-hydrogenase are marked in boldface.
1NCBI GI. 2"YES" denotes number of unique taxa common to query profile and the target BLAST hit list. 3The total number of unique taxa found in the target BLAST list at the level that gives the best PPP score. 4The depth in the target BLAST list that give the best PPP score, which may reflect multiple hits in some genomes. 5Score reported by PPP as the negative logarithm of P-value. 6TIGRFAMs and Pfam models were used because existing GenBank annotations were out of date.
Figure 5Distribution of PPP scores resulting from a GI:113971588-based query profile at a depth of 200 distinct genomes (Shewanella sp. MR-4). Green: the query gene itself and the four correlated hydrogenase and hydrogenase maturation genes listed in Table 3. Yellow: the ThiH gene.