| Literature DB >> 21206902 |
Daniel S Lieber1, Olivier Elemento, Saeed Tavazoie.
Abstract
The increasing ability to generate large-scale, quantitative proteomic data has brought with it the challenge of analyzing such data to discover the sequence ele<span class="Species">ments that underlie systems-level protein behavior. Here we show that short, linear protein motifs can be efficiently recovered from proteome-scale datasets such as sub-cellular localization, molecular function, half-life, and protein abundance data using an information theoretic approach. Using this approach, we have identified many known protein motifs, such as phospn>horylation sites and localization signals, and discovered a large number of candidate ele<span class="Species">ments. We estimate that ~80% of these are novel predictions in that they do not match a known motif in both sequence and biological context, suggesting that post-translational regulation of protein behavior is still largely unexplored. These predicted motifs, many of which display preferential association with specific biological pathways and non-random positioning in the linear protein sequence, provide focused hypotheses for experimental validation.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21206902 PMCID: PMC3012054 DOI: 10.1371/journal.pone.0014444
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schematic of motif-finding approach.
FIRE-pro seeks to identify protein motifs whose pattern of presence and absence across all amino acid sequences is highly informative about the behavior profile for the corresponding proteins. The algorithm takes as input a user-specified protein behavior profile listing a quantitative measurement or discrete attribute of every protein (e.g., half-life or nuclear localization). Presented is a schematic example using discrete localization data. Here, knowing whether the motif is present or absent in the amino acid sequence provides significant information regarding the behavior of the protein. For each candidate motif (e.g., “KRK”), FIRE-pro calculates the correlation between the motif profile and the protein behavior profile using mutual information. Motifs that maximize the mutual information are ultimately selected for further characterization.
Select known and novel motifs found by FIRE-pro.
|
|
|
|
|
|
|
|
|
|
|
| ||||||||
| CLB2: B-type cyclin | SP.[RK] | 312 | SP.[RK] | CDK kinase substrate | Y | Pkinase (1e-04) | −3.5 | cell cycle (1e-16) |
| PTK2: Putative S/T kinase | RR.[SHP] | 122 | RR.S | PKA kinase substrate | - | phosphotransferase activity (0.01) | ||
| GO: nuclear part | [KRN]KR[KSR] | 99 | K[KR].[KR] | Nuclear localization | Bromodomain (0.001) | −1.1 | nuclear lumen (1e-91) | |
| TPK1: cAMP-dependent kinase | R[RK].S | 96 | R[KER].S | PKA kinase substrate | ||||
| LSB3: C-terminal SH3 domain | [PQ]P..P[PTM]R | 92 | P..P | SH3 general ligand | actin cytoskeleton biogenesis (1e-05) | |||
| GO: membrane | L[LAF]G | 89 | LLG | Beta2-Integrin binding | Mito_carr (1e-06) | 0.3 | intrinsic to membrane (1e-67) | |
| GO: transcription | N[NTP]N[NAP] | 77 | NNNN | Poly-asparagine | Y | Zn_clus (0.001) | −0.7 | transcription (1e-10) |
| RSP5: Ubiquitin-protein ligase | PP.Y | 76 | PP.Y | LIG_WW_1 | ||||
| CLB2: B-type cyclin | L..SP | 74 | SP | ERK1,2 Kinase substrate | Pkinase (0.001) | −1.4 | bud neck (1e-06) | |
| RIM11: kinase | [GSQ]S..[ANV]SP | 72 | [ST]…[ST]P | RIM11 Kinase substrate | ||||
| GO: transcription | Q[QNH]Q | 68 | QQQ | Poly-glutamine | zf-C2H2 (1e-11) | −0.9 | transcription (1e-14) | |
| GO: membrane-enclosed lumen | K[KRE][REH]K | 67 | KR | CLV_PCSK_PC1ET2_1 | Y | nuclear lumen (1e-10) | ||
| GO: nucleus | LK | 67 | F.F.LK…K.R | Phosphatidylserine binding | WD40 (1e-07) | −0.4 | nuclear lumen (1e-19) | |
| GO: cellular morphogenesis | [STL]S..[SAD]S | 66 | S..[ST] | Casein kinase I phos. site | Pkinase (0.01) | −4.6 | cellular morphogenesis (1e-15) | |
| Localization: actin | PPP.[PHY] | 63 | PPP | Polyproline | Y | SH3_1 (1e-04) | −0.7 | actin cortical patch (1e-14) |
| GO: cell cycle | [SYI]S…S | 54 | S…S | WD40 binding | Pkinase (1e-04) | −4.8 | cell cycle (10) | |
| PPH22: phosphatase subunit | SP.[GD]R[LYN] | 52 | SP | ERK1,2 Kinase substrate | Proteasome (1e-08) | −3.7 | proteasome core complex(1e-10) | |
| CDC15: MEN kinase | S..[PWH]S | 30 | S…S | WD40 binding | Pkinase (1e-18) | −2 | protein kinase activity (1e-14) | |
|
| ||||||||
| SMT3: SUMO family protein | A[DVA]A | 66 | [LV]IA[DE][PA] | Caveolin pattern | carboxylic acid metabolism (1e-07) | |||
| YCK1: membrane casein kinase | S.[SEV]D | 65 | HSTSDD | BCKDC kinase | ||||
| Plasmodium expression cluster | K..Y[ISH] | 47 | Y[LI] | SH2 ligand for PLCgamma1 | Y | Rifin_STEVOR (0.01) | −5.3 | |
| PRE2: 20S proteasome subunit | VEYA | 46 | VIYAAPF | Abl kinase substrate | Y | Proteasome (1e-09) | −3.8 | proteasome core complex (1e-11) |
| PPH22: phosphatase subunit | [TIV][FH]SP | 36 | SP | ERK substrate | Y | Proteasome (1e-12) | −4.5 | proteasome core complex (1e-16) |
| PPH22: phosphatase subunit | EY.[LS]E[AS] | 36 | [DE]Y | EGFR kinase substrate | Y | Proteasome (1e-10) | −4.1 | proteasome core complex (1e-09) |
| HTZ1: Histone | [GVH]G[KYQ]G | 32 | GGQ | N-methylation in E. coli | Y | Histone (1e-05) | −2.5 | nuclear chromatin (1e-06) |
| PAB1: Poly(A) binding | G.[PRT]G | 31 | IQ.RG.RG | Binding on Calmodulin | RRM_1 (0.001) | −4.1 | RNA metabolism (1e-09) | |
| Localization: periphery (S. pombe) | T..[PSL]N | 30 | T..[SA] | FHA of KAPP binding | Pkinase (1e-04) | −2 | barrier septum (1e-54) | |
| Plasmodium expression cluster | R.[GSA]R | 29 | [AG]R | Protease matriptase site | DEAD (1e-13) | −2.9 | ATP-dependent helicase activity (1e-12) | |
| ARC1: tRNA binding | S[DQP]S | 28 | R.S.S.P | 14-3-3 bindings | Pkinase (1e-14) | −3.9 | protein kinase activity (1e-13) | |
| HHT1: histone | KP..[KFV][KHA] | 28 | KP..[QK] | LIG_SH3_4 | Histone (0.01) | −2.8 | chromatin architecture (1e-07) | |
| PPI clusters | SP[STN] | 24 | SP | ERK substrate | interphase (1e-06) | |||
| Localization clusters (Huh, 2003) | P..[PSE]P | 21 | P.[ST]PP | ERK substrate | Y | PX (1e-05) | −0.3 | cell cortex part (1e-24) |
| Localization multiclass (Huh, 2003) | T..[SFL]T | 11 | T..[SA] | FHA of KAPP binding | Y | nuclear pore (1e-29) | ||
| Localization clusters (Huh, 2003) | TG.G[KLW][TFY] | 11 | TGY | ERK6/SAPK3 activation sites | Helicase_C (1e-10) | −1.1 | RNA helicase activity (1e-11) | |
|
| ||||||||
| GO: nuclear part | DE[EDK][ED] | 131 | Y | nuclear lumen (1e-09) | ||||
| Ubiquitin-conjugates (Peng, 2003) | L..[LDS]A | 125 | Y | IBN_N (1e-05) | −0.4 | Golgi apparatus (1e-08) | ||
| GO: membrane | I[FIW]..V | 70 | Adaptin_N (0.001) | 0.6 | transporter activity (1e-40) | |||
| GO: ribosome biogenesis | E[EDK]..E[EKD] | 67 | WD40 (0.01) | −2.3 | cytoplasm organization (1e-12) | |||
| YAP1: Basic leucine zipper | QQ..M[QIV][QTA] | 66 | RNA polymerase II TF activity (1e-06) | |||||
| NOP2: RNA methyltransferase | R[GST].[DQF]IP | 56 | Y | DEAD (1e-05) | −1.1 | ribosome biogenesis (1e-08) | ||
| GO: DNA-dependent transcription | N.D[DST] | 52 | zf-C2H2 (1e-06) | −1.5 | transcription, DNA-dependent (1e-23) | |||
| GO: transcription | N.D[DST] | 52 | zf-C2H2 (1e-06) | −1.5 | transcription, DNA-dependent (1e-23) | |||
| SMT3: SUMO family protein | V.[DKG]A | 47 | Y | carboxylic acid metabolism (1e-04) | ||||
| POB3: Nucleosome maintenance | [GH]S..KA[SI] | 33 | Histone (0.01) | −1.6 | chromatin architecture (0.001) | |||
| UBP15: Ubiquitin-specific protease | A.[TSL]S | 28 | Pkinase (0.001) | −2.1 | protein kinase activity (0.001) | |||
| PRE2: 20S proteasome subunit | Q[VID]E | 26 | Proteasome (1e-08) | −4.8 | proteasome complex (1e-19) | |||
| Half-life (Belle, 2006) | R.[RSY]S | 25 | reg. of cellular physiological process (1e-04) | |||||
| PPI clusters | GGL[FTL][GEP] | 13 | snRNP protein import into nucleus (1e-07) | |||||
Known: matches previously identified; Semi-novel: matches sequence but has distinct biological context; Novel: no match.
Select (a) known, (b) semi-novel, and (c) novel motifs discovered by FIRE-pro. Known motifs match previously identified motifs in the literature in both sequence and biological context. Semi-novel motifs match previously identified motifs in sequence but not in biological context. Novel motifs do not match any previously identified motif. Motifs presented here were selected based on a combination of criteria including high mutual information and z-score, low domain overlap score, positional bias, GO enrichment, and similarity to known motifs. Name refers to the dataset in which the motif was discovered and is abbreviated as follows, GO: term = binary profile of proteins annotated to the GO term; Protein: description = binary profile of proteins interacting with the protein; Localization: compartment = binary profile of proteins localized to the cellular compartment. See Text S1 for further description of datasets.
Figure 2Motifs found in Cdc28 (YBR160W) interacting proteins.
(A) P-value heatmap of motifs found in Cdc28-interacting proteins. Columns correspond to classes of proteins and rows correspond to predicted motifs. The yellow color-map indicates over-representation of a motif in a given class; significant over-representation (p<0.05 after Bonferroni correction) is highlighted using red frames. Similarly, the blue color-map and blue frames indicate under-representation. For each motif, we indicate 1) position-weight matrix (PWM) representation, 2) mutual information (MI) value, 3) z-score associated with the MI value, 4) robustness score ranging from 0 to 10/10. (B) Motif interaction heat map. Columns/rows correspond to motifs. Light-colored boxes represent co-occurring motifs and “+” signs represent spatial co-localization. Dark-colored boxes represent motif co-avoidance. Values represent information (in bits). (C) Auto-generated enrichment analysis table. For each motif, we indicate 1) the presence of a position bias, 2) GO enrichment, 3) domain enrichment, 4) domain overlap score indicating the positional overlap between the motif and the most enriched domain. (D) Positional bias of “SP.[RK]”. For every motif, a histogram is automatically generated showing the distribution of motif instance positions, normalized to protein length. Upper row, histogram of motif instance positions in Cdc28-interacting proteins (“Targets”). Lower row, histogram of positions in proteins that do not interact with Cdc28 (“Other”).
Figure 3Multi-class analysis of protein sub-cellular localization.
The data [44] were clustered into six distinct localization patterns, each represented by a column of the matrix: nucleus, mitochondria, cell periphery, nucleus & cytoplasm, cytoplasm, and ER (see Figure S2 and Text S1). (A) P-value heatmap of motifs uncovered in the analysis. The top motif, “K[KRP].K”, matches the well-known nuclear localization signal (NLS). The hydrophobic motifs found to be enriched in ER proteins may suggest the existence of signals within stretches of hydrophobic residues. Enrichment analysis for the motifs can be found in Figure S3. (B) Linear sequence position bias of a mitochondrial motif corresponding to the “RFYS” consensus sequence for the N-terminal mitochondrial signal peptide cleavage site [47], [48]. Comparing motif positions in mitochondrial proteins against non-mitochondrial proteins reveals strong N-terminal enrichment.
Figure 4Analysis of quantitative protein half-life data.
Half-life data for ∼3,750 yeast proteins [50] were sorted and binned into ten equally populated classes, with the shortest half-life proteins comprising the left-most column and longest half-life proteins comprising the right-most column. The range of half-lives in each bin in minutes is indicated below the heatmap. Four motifs were found to be informative of half-life, all of which are associated with short half-life. The heatmap shows a gradual transition from over- to under-representation of each motif across the ten bins. The top motif shows an association with protein kinase domains though it does not overlap with the domain, while the bottom three motifs may represent protein kinase domain signatures (see Figure S4 for functional enrichment analysis).