| Literature DB >> 18836191 |
Kiyoung Lee1, Han-Yu Chuang, Andreas Beyer, Min-Kyung Sung, Won-Ki Huh, Bonghee Lee, Trey Ideker.
Abstract
The function of a protein is intimately tied to its subcellular localization. Although localizations have been measured for many yeast proteins through systematic GFP fusions, similar studies in other branches of life are still forthcoming. In the interim, various machine-learning methods have been proposed to predict localization using physical characteristics of a protein, such as amino acid content, hydrophobicity, side-chain mass and domain composition. However, there has been comparatively little work on predicting localization using protein networks. Here, we predict protein localizations by integrating an extensive set of protein physical characteristics over a protein's extended protein-protein interaction neighborhood, using a classification framework called 'Divide and Conquer k-Nearest Neighbors' (DC-kNN). These predictions achieve significantly higher accuracy than two well-known methods for predicting protein localization in yeast. Using new GFP imaging experiments, we show that the network-based approach can extend and revise previous annotations made from high-throughput studies. Finally, we show that our approach remains highly predictive in higher eukaryotes such as fly and human, in which most localizations are unknown and the protein network coverage is less substantial.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18836191 PMCID: PMC2582614 DOI: 10.1093/nar/gkn619
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Schematic overview of the integrated network-based framework. (a) Generation of single-protein feature vectors (s). Nine kinds of (AA, diAA, gapAA, three kinds of chemAA, pseuAA, Motif and GO) were generated for each protein P based on its sequence, chemical properties, motifs and functions. (b) Calculation of Neighbors’ Significance Matrixes (NSMs). These were calculated based on the number of distinct localizations covered by proteins falling along the path with the highest weight from a target protein to a neighbor protein (see Materials and methods section). (c) Calculation of PLCPs. They were calculated based on a weighted counting with normalization (see Materials and methods section). (d) Generation of network feature vector . Each was generated using up to D-th neighborhood's s with neighbors’ significance degrees from NSMs. (e) Generation of network feature vector . Each was generated using P's network neighbors up to distance D, weighted by NSMs and PLCPs to reflect each neighbor's significance and the conditional probabilities of interactions between localization pairs, respectively. (f) Model selection for each localization. The best combination of feature sets was selected for each localization based on a forward approach with the DC-kNN classifier. (g) Prediction of unknown localizations. After generating all feature vectors using all known localization and network information, a confidence degree and a decision (on whether an unknown protein has a specific localization or not) were computed for each localization.
Data sources integrated to predict localization information
| Species | Data set | Proteins | Localizations |
|---|---|---|---|
| Localization | |||
| | Localization-known proteins | 3914 | 5184 |
| Localization-known and having interactions | 3206 | 4284 | |
| Ambiguous | 237 | 189 335 | |
| Localization-unknown | 1530 | 0 | |
| | Localization-known | 2187 | 2398 |
| Localization-known and having interactions | 1610 | 1778 | |
| Localization-unknown and having interactions | 5656 | 0 | |
| | Localization-known | 4570 | 5251 |
| Localization-known and having interactions | 2684 | 3093 | |
| Localization-unknown and having interactions | 3767 | 0 | |
| Species | Data set | Proteins | Interactions |
| Interaction | |||
| | BioGRID | 5184 | 70 700 |
| DIP | 4931 | 17 471 | |
| SGD | 5395 | 56 035 | |
| | BioGRID | 7545 | 25 463 |
| DIP | 7038 | 20 719 | |
| | BioGRID | 7378 | 20 968 |
| Feature | Description | ||
| Protein feature | |||
| Sequences | UniProt (for | ||
| Chemical property | Hydrophobicity, hydrophilicity and side-chain mass | ||
| Motifs | InterPro | ||
| Functions | InterPro and GO | ||
Here, we only considered the proteins with sequence information.
a22 SC localizations are actin, bud, bud neck, cell periphery, cytoplasm, early Golgi, endosome, ER, ER to Golgi, Golgi, late Golgi, lipid particle, microtubule, mitochondrion, nuclear periphery, nucleolus, nucleus, peroxisome, punctate composite, spindle pole, vacuolar membrane, vacuole.
b12 DM localizations are actin, cell periphery, centrosome, cytosol, ER, golgi, lysosome, mitochondrion, nucleolus, nucleus, peroxisome, vacuole.
c13 HS localizations are actin, cell cortex, centrosome, cytosol, ER, golgi, lysosome, mitochondrion, nucleolus, nucleus, peroxisome, plasma membrane, vacuole. Further details regarding localizations and interactions of SC, DM and HS are in Supplementary Figure S1 and Tables S1–S4.
Figure 2.Correlation between known localizations and protein interactions of yeast proteins. (a) The number of interactions (inside the circles) and the fraction of interactions whose proteins share localization information (outside the circles) of three interaction databases: BiG, DIP and SGD. (b–d) The PLCPs of BiG, DIP and SGD, respectively. Given a protein at a particular localization (row), each cell corresponds to the conditional probability of the localization of its interacting partners (column). The squares on the diagonal (or off-diagonal) indicate the locations with relatively low (or high) degrees of location-sharing interactions within (or between) locations; the dotted circles on the diagonal indicate different patterns among three interaction databases for proteins in the lipid particle.
Figure 6.Performance of predicting yeast protein localization as the available interaction (a) or localization (b) data are eroded. In (a), interactions were randomly deleted to reduce the average degree of the yeast PPI network to that specified (x-axis). In (b), known yeast protein localizations were randomly deleted. In either case, AUC was estimated using the leave-one-out approach. To avoid over fitting, the selected feature sets were taken from Supplementary Figure S6 and not re-optimized. Worm, fly, and human were mapped onto these yeast performance curves using the average degree of their available protein networks (a) or the fraction of known localizations for network proteins (b). The blue diamond represents the performance of a conventional approach using all nine single protein features without feature set selection. The red ‘X’ marks denote the performance of the proposed method when applied to recover known protein localizations in fly and human, using LOOCV.
Figure 3.Usefulness of protein interaction networks. (a) The performance of five cases, including (i) random guess of localization, (ii) features only, (iii) only, (iv) only and (v) all three kinds of features. (b–e) The performance of the features for amino acid frequencies (b), chemical amino acid properties (c), and GO terms (d) as well as performance of the feature (e). Performance is based on the five interaction networks BiG, DIP, SGD, Combined, and Random (different color curves). The performance of other network features is shown in Supplementary Figure S3. The x-axis is the radius of neighborhood D; D = 0 means only the single protein feature vector was used, which is a conventional approach. For Combined, the three interactome datasets BiG, DIP and SGD were pooled into a single network. For Random, localizations were randomly assigned on the BiG network. The solid lines and the dotted lines represent the Total and Balanced measures, respectively.
Figure 4.Performance of the network-based approach. (a) The averaged AUC values of three cases: (i) all features without feature set selection (FSS), (ii) all features with FSS and (iii) all , and features with FSS for each localization. (b) Performance comparison with two well-known methods. Performance is computing using the Total versus Balanced metrics (top three versus bottom three bars).
Figure 5.Validation of novel localizations for yeast proteins. New localization images for two yeast proteins, Noc4/Ypr144c (a) and Utp21/Ylr409c (b), for which the network-based prediction (nucleolus) was different than previously measured (nucleus) (1). The near-complete overlap area between the GFP and RFP images (‘Merge A’), marking the protein and nucleolus, respectively, is consistent with a nucleolar localization (Sik1-RFP was used as a nucleolus marker). Here, DAPI is used for marking the nucleus, and ‘Merge B’ is the overlap among GFP, DAPI and RFP images. (c) Proteins that interact with Noc4/Ypr144c and their localizations. The values in the upper-left box represent the interacting protein pairs’ localization purity (IPLP, or enrichment) among interacting protein pairs for distinct localizations (see the ‘Supplements.doc’ for more information). Panel (c) is drawn using Cytoscape (55).