| Literature DB >> 21283516 |
Jin Wang1, Chunhe Li, Erkang Wang, Xidi Wang.
Abstract
Accurately predicting the localization of proteins is of paramount importance in the quest to determine their respective functions within the cellular compartment. Because of the continuous and rapid progress in the fields of genomics and proteomics, more data are available now than ever before. Coincidentally, data mining methods been developed and refined in order to handle this experimental windfall, thus allowing the scientific community to quantitatively address long-standing questions such as that of protein localization. Here, we develop a frequent pattern tree (FPT) approach to generate a minimum set of rules (mFPT) for predicting protein localization. We acquire a series of rules according to the features of yeast genomic data. The mFPT prediction accuracy is benchmarked against other commonly used methods such as Bayesian networks and logistic regression under various statistical measures. Our results show that mFPT gave better performance than other approaches in predicting protein localization. Meanwhile, setting 0.65 as the minimum hit-rate, we obtained 138 proteins that mFPT predicted differently than the simple naive bayesian method (SNB). In our analysis of these 138 proteins, we present novel predictions for the location for 17 proteins, which currently do not have any defined localization. These predictions can serve as putative annotations and should provide preliminary clues for experimentalists. We also compared our predictions against the eukaryotic subcellular localization database and related predictions by others on protein localization. Our method is quite generalized and can thus be applied to discover the underlying rules for protein-protein interactions, genomic interactions, and structure-function relationships, as well as those of other fields of research.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21283516 PMCID: PMC3023707 DOI: 10.1371/journal.pone.0014449
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Features description.
| Feature | Type | Subtype | Bins(range) | Description |
| MIT1 | Motif | Signal | 2(1–3) | More than one N-terminal residue is cut (good chance of being mitochondrial) |
| GLYC | Motif | Signal | 11(4–14) | Number of predicted N-glycosylation sites (NXS/T) |
| SIGNALP | Motif | Signal | 2(15–17) | Secretory signal peptide according to the SignalP server |
| SIG1 | Motif | Signal | 2(18–20) | If a protein has a signal sequence. The pattern consists of a charged residue within the first seven residues, followed by a stretch of 14 residues with an average GES hydrophobicity less than -1 kcal/mol |
| NUC1 | Motif | Signal | 6(21–27) | Four-residue patterns of 1. All basic amino acid residues (K or R) or 2. Three basic amino acids (K or R), and one H or P |
| PI | Overall-sequence | Isoelectric point | 10(28–38) | pI (isoelectric point) values |
| TMS1 | Overall-sequence | Transmembrane helix | 5(39–44) | Prediction results of whether a protein has transmembrane (TM) segments. TM segments were identified using the GES hydrophobicity scale |
| MAYOUNG | Whole-genome | Absolute expr.(GeneChip) | 10(45–55) | Absolute mRNA expression in a GeneChip experiment |
| KNOCKOUT | Whole-genome | Knockout mutation | 2(56–58) | Knockout mutation (lethal or viable). |
| MRDIASD | Whole-genome | Expr.fluctuation (Diauxic shift) | 10(59–69) | Standard deviation in mRNA expression level over time (i.e. expression fluctuation) for a protein in the diauxic shift experiment |
| PLMNEW1 | Motif | Signal | 2(70–72) | Plasma membrane signal |
| FARN | Motif | Signal | 2(73–75) | C-terminal farnesylation site: the sequence pattern consists of a cysteine followed by two aliphatic residues and one more residue at the C terminus |
| GGSI | Motif | Signal | 2(76–78) | C-terminal geranylgeranylation site |
| MIT2 | Motif | Signal | 2(79–81) | Mitochondrial matrix import sequence: The N-terminal of the protein has repeated alternating hydrophobic and hydrophilic patterns, and the protein contains at least four S or T residues in its 20 N-terminal residues. |
| HDEL | Motif | Signal | 2(82–84) | Endoplasmic reticulum retention signal (HDEL) |
| NUC2 | Motif | Signal | 3(85–88) | Pattern starting with a P and followed within three residues by a basic four-residue segment containing K or R residues |
| POX1 | Motif | Signal | 2(89–91) | C-terminal peroxisome import signal ([SA](KRH]L) |
| MRCYELU | Whole-genome | Expr.fluctuation (Cell cycle) | 10(92–102) | Standard deviation in mRNA expression level over time (i.e. expression fluctuation) for a protein in the elutriation time series experiment in Yeast Cell Cycle Analysis Project |
| MRCYCSD | Whole-genome | Expr.fluctuation (Cell cycle) | 10 (103–113) | Standard deviation in mRNA expression level over time (i.e. expression fluctuation) for a protein in the alphafactor arrest time series experiment in Yeast Cell Cycle Analysis Project |
For every feature, proteins are divided into a definite number of bins in terms of the different feature values. Then we use number(from 1 to 113) to represent different bins for different features. For every feature, the biggest number represents no feature record for one protein, and the other numbers correspond to different bins separately. For example, for feature Mit1, 1 represents no More than one N-terminal residue is cut for one protein, 2 represents More than one N-terminal residue is cut for one protein, and 3 denotes no feature record for this protein.
Input format of FPT.
| Mt1 | Gly | Sig | Sig1 | NC1 | PI | TMS | MAY | KNO | MRD | PLM | FAR | GGS | MT2 | HDE | NC2 | POX | MRC | MCC | Location |
| 1 | 4 | 16 | 19 | 27 | 31 | 42 | 54 | 56 | 68 | 70 | 73 | 76 | 79 | 82 | 85 | 89 | 97 | 109 | (E) |
| 1 | 9 | 15 | 18 | 27 | 34 | 39 | 54 | 56 | 67 | 70 | 73 | 76 | 79 | 82 | 85 | 89 | 92 | 108 | (C) |
| 1 | 13 | 15 | 18 | 27 | 28 | 39 | 50 | 57 | 59 | 70 | 73 | 76 | 79 | 82 | 85 | 89 | 92 | 105 | (N) |
| 2 | 5 | 15 | 19 | 27 | 34 | 39 | 53 | 56 | 64 | 70 | 73 | 76 | 79 | 82 | 85 | 89 | 95 | 112 | (M) |
| 2 | 5 | 16 | 19 | 27 | 30 | 39 | 53 | 56 | 62 | 70 | 73 | 76 | 79 | 83 | 85 | 89 | 98 | 109 | (E) |
The meaning of category values are as Table 1.
Rules of FPT.
| Hit-rate | Hit-number | Rules | Rules label | Location |
| 0.852 | 163 | 76 89 1 54 | C1 | (C) |
| 0.599 | 15 | 89 85 27 1 56 68 | C2 | (C) |
| 0.603 | 222 | 73 76 82 89 79 15 1 57 | N1 | (N) |
| 0.785 | 28 | 89 70 15 13 | N2 | (N) |
| 0.747 | 91 | 73 85 15 27 2 | M1 | (M) |
| 0.666 | 6 | 89 85 27 56 47 36 | M2 | (M) |
| 0.916 | 12 | 89 85 1 56 43 | T1 | (T) |
| 0.714 | 7 | 89 85 27 18 53 62 | T2 | (T) |
| 0.789 | 19 | 89 85 27 2 19 16 | E1 | (E) |
| 1 | 2 | 89 85 5 67 16 | E2 | (E) |
The meaning of category values are as table 1.
Rules of FPT used to predict Unknown-4700 with 0.65 as hit-rate threshold cut.
| Hit-rate | Hit-number | Rules | Rules label | Location |
| 0.852 | 163 | 76 89 1 54 | (C1) | (C) |
| 0.785 | 28 | 89 70 15 13 | (N2) | (N) |
| 0.666 | 18 | 73 82 89 79 15 1 11 | (N3) | (N) |
| 0.747 | 91 | 73 85 15 27 2 | (M1) | (M) |
| 0.916 | 12 | 89 85 1 56 43 | (T1) | (T) |
| 0.789 | 19 | 89 85 27 2 19 16 | (E1) | (E) |
The meaning of category values are as Table 1.
Figure 1ROC curve comparisons of testing sample for 4 different compartments for 3 different methods.
A,B,C,D show the comparison results separately for C, N, M, E compartment.
Figure 2Comparisons of correct prediction rate for four methods.
Figure 3Comparisons of correct prediction rate using five fold cross validation test for FPT and logistic regression method.
The variances are enlarged to 1000 times to see the comparison clearly.
Figure 4Prediction of localization for the entire 6042 yeast proteins.
The FP-tree in Example 1.
| Tid | Items bought | (Ordered) frequent items |
| 1 | F, A, C, D, G, I, M | F, C, A, M |
| 2 | A, B, C, F, L, M, O | F, C, A, B, M |
| 3 | B, F, H, J, O | F, B |
| 4 | B, C, K, S | C, B |
| 5 | A, F, C, E, L, M, N | F, C, A, M |
Figure 5The FP-tree in Example 1.