| Literature DB >> 18816389 |
Anis Karimpour-Fard1, Sonia M Leach, Ryan T Gill, Lawrence E Hunter.
Abstract
BACKGROUND: Applications of computational methods for predicting protein functional linkages are increasing. In recent years, several bacteria-specific methods for predicting linkages have been developed. The four major genomic context methods are: Gene cluster, Gene neighbor, Rosetta Stone, and Phylogenetic profiles. These methods have been shown to be powerful tools and this paper provides guidelines for when each method is appropriate by exploring different features of each method and potential improvements offered by their combination. We also review many previous treatments of these prediction methods, use the latest available annotations, and offer a number of new observations.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18816389 PMCID: PMC2570368 DOI: 10.1186/1471-2105-9-397
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Predicted linkages in E. coli K12 at different confidence levels in Prolinks v2.0
| Method | Confidence > 0.4 | Confidence > 0.6 | Confidence > 0.8 |
| GC | 2,297/3,104 | 1,676/2,486 | 0/0 |
| GN | 11,789/2,843 | 4,946/1,935 | 625/396 |
| RS | 6,668/1,026 | 1,518/776 | 292/366 |
| PP | 10,029/1,730 | 2,926/1,156 | 1,308/779 |
| Total | 27,339/3,924 | 9,623/3,323 | 2,143/1,267 |
| % predicted E. coli genes (4245 genes in NCBI) | 92% | 78% | 30% |
| Number of overlapping linkages > 0.6 | |||
| GC | GN | RS | |
| GN | 843 | - | - |
| RS | 100 | 202 | - |
| PP | 77 | 305 | 103 |
Figure 1Complete protein-protein networks of . Coverage and overlap are given in the central inset. Network proteins are color-coded based on the four KEGG functional categories: Unclassified (gray), Cellular processes (cyan), Environment information processing (blue), Genetic information processing (red), Metabolism (green) a) Gene cluster b) Gene neighbor c) Rosetta Stone d) Phylogenetic profile.
Topological analysis of different networks in E. coli K12.
| Protein-protein linkage set | Average clustering coefficient | Average connectivity | Number of proteins |
| GN | 0.56 | 5.11 | 1,935 |
| RS | 0.27 | 0.25 | 776 |
| PP | 0.83 | 5.06 | 1,156 |
Figure 2Coverage and distribution of predictions relative to functional categorizations. a) among all proteins annotated to a given COG category, the percent of those proteins with at least 1 linkage predicted by a given method b) percent using KEGG subcategories c) percent using KEGG subcategories d) percent of linkages predicted by a method where the linked proteins share the same function e) percent of linkages predicted by a method with at least 1 unclassified protein in the linked pair.
Figure 3Function Prediction Cross-evaluation of protein linkages for each method (PP, RS and GN): UNIF: predicted function is sampled uniformly at random from the set of categories (KEGG 4 categories, COG 19 categories, KEGG 19 subcategories); MAJOR: predicted function is the majority assignment to immediate neighbors (ties are broken randomly).
Figure 4Function Prediction Cross-evaluation of protein-protein linkages for method combinations (OR, Noisy-OR UNIF: predicted function is sampled uniformly at random from the set of categories (KEGG 4 categories, COG 19 categories, KEGG 19 subcategories); MAJOR: predicted function is the (weighted) majority assignment to immediate neighbors (ties are broken randomly).
Figure 5Percentage of proteins in a pathway with at least one linkage in each method, using confidence threshold 0.6. Numbers above and below each line represent 50% and 100% coverage, respectively.
Statistics on operon predictions.
| 830 | 355 | 418 | |
| Completely predicted | 489 (59%) | 213 (60%) | 333 (80%) |
| Completely missed | 152 (18%) | 52 (15%) | 24 (6%) |
| 1676 | 1676 | 2104 | |
| Both proteins in same operon | 1282 (76%) | 661 (39%) | 843 (40%) |
| Both operon proteins, not same operon | 46 (3%) | 12 (1%) | 20 (1%) |
| Only 1 classified as an operon protein | 161 (10%) | 101 (6%) | 122 (6%) |
| Both not classified as operon proteins | 187 (11%) | 902 (54%) | 1119 (53%) |
| 1520 | 1392 | 1583 | |
| 1244 | 659 | 706 | |
| Mean (Median) intergenic distance in bases | 20.68 (10) | 22.04 (10) | 28.98 (17) |
| Percentage with < 25 bases | 74 | 73 | 63 |
| Percentage with > 25 and < 50 bases | 12 | 10 | 16 |
| Percentage with > 50 bases | 14 | 17 | 21 |
| 276 | 733 | 877 | |
| Mean (Median) intergenic distance in bases | 69.29 (74) | 38.79 (24) | 43.80 (29) |
| Percentage with < 25 bases | 15 | 51 | 46 |
| Percentage with > 25 and < 50 bases | 11 | 14 | 15 |
| Percentage with > 50 bases | 74 | 35 | 39 |
Figure 6Operon Prediction in a-c) Node coloring depicts operon status according to RegulonDB or DBTBS: both proteins in the same operon (blue nodes), both operon proteins but not in the same operon (red nodes), only one classified as an operon protein (yellow nodes), or both not classified as operon proteins (grey nodes) d) precision and recall using different combinations.
Table of previous operon prediction methods.
| Predictor method | Data Types | Applied species | Operon Data | Sen (%) | Spe (%) | Accuracy |
| Probability | Intergenic distance | E. coli K12 | 2684 TU E. coli K12 Jun07 | 78 | _ | _ |
| B. subtilis | (831 multiprotein operons) | (precision = PPV = 82% E. coli K12) | ||||
| 1823 pairs in same operon | ||||||
| 770 TU E. coli K12 Jan06 | 75 | _ | _ | |||
| (356 mulitprotein operons) | (precision = 47%) | |||||
| 905 pairs in same operon | 94 | _ | _ | |||
| 1115 TU B. subtilis | (precision = 45% B. subtilis) | |||||
| (419 multiprotein operons) | ||||||
| 972 pairs in same operon | ||||||
| HMM | Sequence information | E. coli K12 | 390 TU | 59 | _ | _ |
| Naïve Bayes | Sequence information | E. coli K12 | 365 TU | 75 | 91 | 83%* |
| Log likelihood | Intergenic distance, functional classes | E. coli K12 | 361 TU (237 multi) | 88 | 88 | 82%* Distance |
| 572 pairs in same operon | 88%* Both | |||||
| 346 pairs at TU border | ||||||
| Probability | Conserved gene clusters across 34 genomes | E. coli K12 | 389 TU | 48 | 92 | 70%* |
| 541 pairs in same operon | ||||||
| 263 pairs at TU border | ||||||
| (pair if ≤ 200 bp apart) | ||||||
| HMM | Expression data | E. coli K12 | 463 pairs in same operon | 63 | 99 | 81%* |
| Graph analysis | Metabolic pathway information | E. coli K12 (also applied to 42 other genomes) | 128 TU metabolism related | 89 | 87 | 88%* |
| Log likelihood | Intergenic distance | B. subtilis (trained on E. coli K12, applied to 68 genomes) | 100 TU B. subtilis | 88 | 88 | 88%* B. subtilis |
| 310 pairs in same operon | 82%* E. coli K12 | |||||
| 123 pairs at TU border | ||||||
| Bayesian posterior probability | Intergenic distance, co-expression | E. coli K12 | 257 TU | 82 | 70 | 76%* Co-expr |
| 604 pairs in same operon | 84 | 82 | 83%* Distance | |||
| 151 pairs at TU border | 88 | 88 | 88%* Both | |||
| Bayesian network | Intergenic distance, sequence information, expression data | E. coli K12 | 365 TU | 78 | 90 | 84%* |
| Machine learning Romero et al., 2004 | Intergenic distance, functional information | B. subtilis | 100 TU B. subtilis | 81 | 48 | 65%* B. subtilis |
| 87 | 86 | 87%* E. coli K12 | ||||
| (91) | (87) | (89%* if use all info on E. coli) | ||||
| Bayesian classifier | Intergenic distance, operon length, gene expression | B. subtilis | 635 TU | 82¶ | 89¶ | 83¶ distance |
| 582 pairs in same operon | 80¶ | 79¶ | 80¶ expression | |||
| 91 pairs at TU border | 88¶ | 88¶ | 89¶ all | |||
| Machine learning without extensive training data | Intergenic distance, functional classes, conserved gene clusters | E. coli K12 | E. coli K12: | 88 | 80 | 84%* E. coli K12 |
| B. theta | 797 pairs in same operon | |||||
| 294 pairs at TU border | ||||||
| B. theta: | 73 | 80 | 76.5%* B. theta | |||
| 936 concordant pairs | ||||||
| 106 discordant pairs |
Annotations: * value estimated as average of sensitivity and specificity, ¶ value based on leave-one-out analysis as reported by authors
Abbreviations: Transcriptional Unit (TU), Base Pair (bp), Specificity (Spe), Sensitivity (Sen).
Figure 7Intergenic distance between gene pairs. True positives (TP) count when when both proteins involved in a linkage had a functional annotation in EcoCyc (DBTBS) and were known to reside in the same operon by RegulonDB (DBTBS) and a false positive (FP) when at least one of the classified proteins was not listed by RegulonDB (DBTBS) to be in the same operon. a) E. coli K12 using RegulonDB June 2007 b) E. coli K12 using RegulonDB January 2006 c) B. subtilis using DBTBS September 2007.