| Literature DB >> 21647451 |
Yi-Cheng Chen1, Kripamoy Aguan, Chu-Wen Yang, Yao-Tsung Wang, Nikhil R Pal, I-Fang Chung.
Abstract
BACKGROUND: The need for efficient algorithms to uncover biologically relevant phosphorylation motifs has become very important with rapid expansion of the proteomic sequence database along with a plethora of new information on phosphorylation sites. Here we present a novel unsupervised method, called Motif Finder (in short, F-Motif) for identification of phosphorylation motifs. F-Motif uses clustering of sequence information represented by numerical features that exploit the statistical information hidden in some foreground data. Furthermore, these identified motifs are then filtered to find "actual" motifs with statistically significant motif scores. RESULTS AND DISCUSSION: We have applied F-Motif to several new and existing data sets and compared its performance with two well known state-of-the-art methods. In almost all cases F-Motif could identify all statistically significant motifs extracted by the state-of-the-art methods. More importantly, in addition to this, F-Motif uncovers several novel motifs. We have demonstrated using clues from the literature that most of these new motifs discovered by F-Motif are indeed novel. We have also found some interesting phenomena. For example, for CK2 kinase, the conserved sites appear only on the right side of S. However, for CDK kinase, the adjacent site on the right of S is conserved with residue P. In addition, three different encoding methods, including a novel position contrast matrix (PCM) and the simplest binary coding, are used and the ability of F-motif to discover motifs remains quite robust with respect to encoding schemes.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21647451 PMCID: PMC3102080 DOI: 10.1371/journal.pone.0020025
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1An illustration of extraction of different consensus sequence motifs by clustering process.
A set of fixed length sequences are represented by a sequence logo. Sequence logo (a) represents all of the PKA kinase substrates. The sequences in (a) are split into several clusters. Each cluster is then represented by a sequence logo. Sequence logos (b)∼(g) represent PKA kinase substrates in different clusters.
Summary of the data sets.
| Data set (No. of peptides) | Description | |
| Foreground data sets |
| Foreground data set comprised of the multiple sets of ATM, Casein II, CaMK II, and MAPK kinase substrates |
|
| Foreground data sets from the Phospho.ELM database considering all species with respect to PKA, PKC, CK2, and CDK kinase substrates. | |
|
| ||
|
| ||
|
| ||
|
| Foreground data sets from the Phospho.ELM database considering only human species with respect to PKA, PKC, CK2, and CDK kinase substrates. | |
|
| ||
|
| ||
|
| ||
|
| Synthetic foreground data set consisting of five specially designed motifs “…D‥SQ.N…”, “….R.S‥L…”, “…TV.S.E….”, “….R.S‥P…”, and “…‥KS…I‥” | |
|
| Foreground data set from mouse mass spectrometry data | |
| Background data sets |
| All species background data set from the Phospho.ELM database. |
|
| Human background data set from the Phospho.ELM database. | |
|
| Mouse background data set from the IPI mouse database. |
Figure 2Overview of motif finding steps.
In Step 1, for PCM we use the background data and foreground data, for PWM encoding, in place of the background data we use the entire Phospho.ELM database, while for binary encoding neither the foreground nor the background data are used. In Step 2 the k-means clustering algorithm is repeatedly used to generate a composite motif list (CML). This CML is then used to generate the final list of motifs in a stepwise manner ensuring two factors: statistical significance of the motif using a Binomial distribution based model, and frequency of occurrence of the motif in the present foreground data is at least M.
Figure 3An illustration of how a potential motif is extracted from a cluster.
First, for every position the frequency of each residue is counted. Then for each position the residue with the highest frequency is noted. If more than one residue have the same highest frequency, one of them is randomly chosen. At the next stage of the process, sites with residues having frequency ≥T are considered conserved sites to generate a potential motif.
Motifs identified by F-Motif and Motif-X using the foreground data set, FM and the background data sets, BH and BA for G = 15 and T = 15.
| Motif-X | F-Motif | MoDL | ||||
| Order | Motif | Score with ( | Index | Motif | Score with ( | Appearance in ( |
| 1 |
| (27.45, 27.48) | 1 |
| (31.64, 31.61) | (✓, ?) |
| 2 |
| (16.00, 16.00) | *1 |
| (23.69, 23.68) | (?, ?) |
| 3 |
| (16.00, 16.00) | 3 |
| (15.80, 15.70) | (?, ?) |
| 4 |
| (16.00, 16.00) | 4 |
| (16.00, 16.00) | (?, ✓) |
| 5 |
| (7.61, 7.36) | 2 |
| (15.84, 15.85) | (?, ?) |
| 6 |
| (7.95, 7.99) | 5 |
| (7.61, 7.36) | (?, ?) |
| 6 |
| (7.95, 7.99) | (?, ?) | |||
The first three columns correspond to the output generated by Motif-X while the next three columns correspond to F-Motif. The last column uses symbols “✓” and “?” to indicate the appearance of a motif identified by MoDL or not, respectively. The entry (*1) in row 2 of the right half of the table shows a novel motif that Motif-X could not discover. This novel motif “……S..EE..” also has a very high score (see column 6).
Motifs identified by MoDL using the foreground data set, FM and background data sets, BH and BA.
| Data set | Mixture motifs | Motifs with single residues | Foreground match | Background match | Individual position score |
|
|
|
| 13 | 998 | 15.64; 12.52 |
|
| 33 | 1536 | 15.64; 16.00 | ||
|
| 10 | 1003 | 4.66; 12.52 | ||
|
| 19 | 2046 | 4.66; 16.00 | ||
|
| 9 | 704 | 0.47; 12.52 | ||
|
| 5 | 1025 | 0.47; 16.00 | ||
|
| 2 | 548 | 13.04; 12.52 | ||
|
| 9 | 873 | 13.04; 16.00 | ||
|
|
| 1 | 693 | 0.08; 9.80 | |
|
| 4 | 998 | 0.08; 11.40 | ||
|
| 10 | 968 | 12.52; 9.80 | ||
|
| 9 | 1188 | 12.52; 11.40 | ||
|
| 22 | 1167 | 16.00; 9.80 | ||
|
| 32 | 2184 | 16.00; 11.40 | ||
|
| 2 | 1477 | 0.24; 9.80 | ||
|
| 6 | 1868 | 0.24; 11.40 | ||
|
|
|
| 10 | 1437 | 12.42; 9.73 |
|
| 9 | 1762 | 12.42; 11.55 | ||
|
| 22 | 1756 | 16.00; 9.73 | ||
|
| 32 | 3196 | 16.00; 11.55 | ||
|
|
| 54 | 17260 | 15.61 | |
|
| 49 | 16753 | 13.05 |
In columns 2 and 3, the mixture motifs found by MoDL (with BH or BA) and the corresponding motifs with single residues are displayed, respectively. Columns “Foreground match” and “Background match” show the number of times the associated single residue motif appears in the foreground and background data, respectively. “Individual position score” column indicates the motif score for each of the associations between the position-residue pair in each motif.
Motifs identified by F-Motif and Motif-X using the kinase specific all species foreground data sets (FA, FA, FA, FA) and the all species background data set (BA) with G = 15 and T = 15.
| Data set | Index | Motif | Match/Total | PCM hit frequency | Background match | Motif score |
|
| 1 |
| 98/306 | 37/50 | 0.59% | 32.00 |
| 2 |
| 36/208 | 50/50 | 0.41% | 32.00 | |
| 3 |
| 76/172 | 50/50 | 4.94% | 16.00 | |
| 4 |
| 52/96 | 50/50 | 5.75% | 16.00 | |
| $1 |
| |||||
|
| *1 |
| 22/297 | 2/50 | 0.59% | 32.00 |
| 1 |
| 77/275 | 50/50 | 5.15% | 16.00 | |
| 3 |
| 54/198 | 50/50 | 5.58% | 16.00 | |
| 2 |
| 45/144 | 50/50 | 5.78% | 16.00 | |
| 4 |
| 32/99 | 50/50 | 5.11% | 16.00 | |
| $1 |
| |||||
| $2 |
| |||||
|
| 1 |
| 36/241 | 11/50 | 0.65% | 32.00 |
| *1 |
| 22/205 | 19/50 | 0.92% | 31.60 | |
| 3 |
| 23/183 | 22/50 | 0.44% | 27.91 | |
| 2 |
| 55/160 | 50/50 | 6.62% | 16.00 | |
| 4 |
| 42/105 | 50/50 | 5.44% | 16.00 | |
|
| 1 |
| 44/209 | 35/50 | 0.45% | 30.62 |
| *1 |
| 27/165 | 22/50 | 0.53% | 22.69 | |
| *(2) |
| 24/138 | 8/50 | 0.99% | 21.44 | |
| *(3) |
| 21/114 | 24/50 | 1.01% | 17.43 | |
| 2 |
| 84/93 | 33/50 | 5.51% | 16.00 |
An asterisk in column 2 indicates a new motif that is found by F-Motif but not found by Motif-X. The information in other columns corresponds to F-Motif. The fourth column labeled “Match/Total” shows the number of times the associated motif appears in the present (remaining) foreground data. The fifth column, PCM hit frequency, gives the number of times, out of the fifty iterations, the associated motif is detected and it refers to the PCM encoding. The sixth column, Background match, displays the percentage of the present (remaining) background data that has matched with the associated motif. Also a “$” symbol in column 2 indicates a new motif (satisfying the criteria on statistical significance) that is found by F-Motif but not found by Motif-X if we do not remove the repeated sequences.
Motifs identified by F-Motif and Motif-X using the kinase specific human species foreground data sets (FH, FH, FH, FH) and the human background and all species background data sets (BH, BA) with G = 15 and T = 15.
| Data set | Index | Motif | Match/Total | PCM hit frequency | Background match | Motif score |
|
| 1 |
| 58/187 | (50, 48)/50 | (0.58, 0.59)% | (32.00, 32.00) |
| 2 |
| 26/129 | (50, 50)/50 | (0.43, 0.41)% | (31.58, 31.92) | |
| 3 |
| 45/103 | (50, 47)/50 | (4.91, 4.94)% | (16.00, 16.00) | |
| 4 |
| 30/58 | (50, 50)/50 | (5.75, 5.75)% | (16.00, 16.00) | |
|
| 1 |
| 62/209 | (50, 50)/50 | (5.19, 5.17)% | (16.00, 16.00) |
| 2 |
| 46/147 | (50, 50)/50 | (6.29, 6.31)% | (16.00, 16.00) | |
| 3 |
| 37/101 | (50, 50)/50 | (5.79, 5.60)% | (16.00, 16.00) | |
| 4 |
| 20/64 | (50, 50)/50 | (5.07, 5.11)% | (10.55, 10.49) | |
|
| 1 |
| 27/177 | (6, 2)/50 | (0.66, 0.65)% | (32.00, 32.00) |
| 2 |
| 56/150 | (50, 50)/50 | (7.45, 7.44)% | (16.00, 16.00) | |
| 3 |
| 53/94 | (49, 47)/50 | (5.85, 5.88)% | (16.00, 16.00) | |
|
| 1 |
| 34/155 | (31, 34)/50 | (0.44, 0.45)% | (27.93, 28.36) |
| *(1) |
| 23/121 | (28, 24)/50 | (0.97, 1.02)% | (21.73, 21.53) | |
| 2 |
| 89/98 | (44, 49)/50 | (6.78, 6.93)% | (16.00, 16.00) |
An asterisk in column 2 indicates a new motif that is found by F-Motif but not found by Motif-X. The information in other columns corresponds to F-Motif. The fourth column labeled “Match/Total” shows the number of times the associated motif appears in the present (remaining) foreground data. The fifth column, PCM hit frequency, gives the number of times, out of the fifty iterations, the associated motif is detected and it refers to the PCM encoding. The sixth column, Background match, displays the percentage of the present (remaining) background data that has matched with the associated motif.
Comparison of motifs discovered by F-Motif using three coding schemes for the human species foreground data set, FH, and the human species background data set (BH).
| Data set | Motif | Binary hit frequency | PWM hit frequency | PCM hit frequency |
|
|
| 100 | 100 | 100 |
|
| 39 | 88 | 88 | |
|
| 98 | 99 | 99 | |
|
| 97 | 87 | 94 | |
|
| 81 | 100 | 100 | |
|
| 41 | 100 | 100 | |
|
| 75 | 96 | 92 | |
|
| 81 | 98 | 94 | |
|
| 79 | 63 | 99 | |
|
| 70 | 29 | 71 | |
|
| 70 | 81 | 70 | |
| Average hit frequency/motif | 75.55 | 85.55 | 91.55 | |
|
| 97 | 28 | 28 | |
| Execution time | 1509.628s | 98.090s | 90.343s | |
The process is repeated 100 times with G = 15 and T = 15 using data represented by three different encoding schemes: PCM, PWM and Binary coding. It reveals that the ultimate performance of the F-Motif is practically independent of the three schemes although computation time and the size of CML vary significantly with the choice of encoding schemes.
The list of novel motifs that are discovered by F-Motif but not by Motif-X in Experiment 2.
| Foreg./Backg. data sets | Motif | Literature |
|
|
|
|
|
|
| MAPK substrate motif |
|
|
| Akt and PKA substrate motif |
|
|
| Classical PKA substrate motif |
|
| PKC substrate motif | |
|
| A possible PKC substrate motif | |
|
|
| Well established CK2 substrate motif |
|
|
| CDK substrate motif |
|
|
| |
|
| Phosphorylated by kinases GSK3 and CK1 rather than CDK |
The third column refers to literature that has discussed such motifs. There are some novel motifs for which we could not find any clue in the existing literature.