| Literature DB >> 20055997 |
Norman E Davey1, Richard J Edwards, Denis C Shields.
Abstract
BACKGROUND: Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20055997 PMCID: PMC2819990 DOI: 10.1186/1471-2105-11-14
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The four scoring schemes investigated in this study.
Cumulative binomial and binomial p-values for motif example described in Methods
| Motif (k) | p | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| k | 0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 3 | 4 |
| 1 | 1 | 1 | 1 | 0.9948 | 0 | 0 | 0.0001 | 0.005 | 0.9948 | |
| 1 | 0.8515 | |||||||||
| 1 | 0.7789 | 0.3736 | 0.2211 | 0.4053 | 0.2787 | |||||
| 1 | 0.7789 | 0.3736 | 0.2211 | 0.4053 | 0.2787 | |||||
Incomplete beta and binomial shows the cumulative binomial and binomial p-values respectively for each motif for all supports between 0 and N = 4. p1+is the success probability of the motif considered. The value in italics indicates the highest scoring motif in the example described. The bold values indicate the values of k for which I(k, N, pm) < = 0.1485, the pvalue of the highest ranking motif. The five right hand columns are only shown to illustrate how the probabilities in the left hand columns are calculated (with sums across the bold values in the right hand columns cells totaling the values in the left hand columns).
Figure 1Comparing the distribution of returned top ranking fixed position motifs to the uniform distribution for the 4 tested significance scoring methods. (a) Scatterplot of the root mean square error (RMSE) for the distribution of returned top ranking motifs for each dataset size and motif length (nine combinations in total) from the uniform distribution versus the p-value of a Mann-Whitney test for rejection of the hypothesis that the distribution of top ranking motifs significance values were sampled from the uniform distribution. The top boxplot describes the p-value of a Mann-Whitney test and the boxplot on the right describes the RMSE data. (b) Comparison of 4 tested significance scoring methods for probability of being sampled from the uniform distribution. The heatmap plots, for each dataset size (horizontal axis) and motif length (vertical axis), the p-value of a Mann-Whitney test for rejection of the hypothesis that the distribution of top ranking motifs significance values was sampled from the uniform distribution.
Figure 2Test for the comparability of the significance scores, for the 4 tested significance scoring methods, between fixed position motifs of different length and datasets of different size. (a) Boxplot comparisons of the 4 tested significance scoring schemes, for each dataset size and motif length, of the distribution of top ranking motifs significance values. The first boxplot in each panel is the uniform distribution. (b) The heatmap describes the Mann-Whitney p-value for all-by-all comparisons of the 3 dataset sizes and 3 motifs lengths. The Mann-Whitney, in this case, tests for rejection of the hypothesis that the distributions of Sig values of top ranking motifs of different length and dataset size are sampled from the same distribution.
Redundancy of motif probabilities (see Equations 9-13).
| Motif length | Number of partitions | Number of motifs | Number of non-redundant motifs | Proportion of non-redundant motifs |
|---|---|---|---|---|
| 3 | 3 | 8000 | 1540 | 19.25% |
| 4 | 5 | 160000 | 8855 | 5.53% |
| 6 | 11 | 64000000 | 177100 | 0.27% |
Motif length is the length of the motif. Number of partitions is the number of solutions in the set M for motifs of length n. Number of motifs is the number of distinct motifs of length l where the order of the residues is important. Number of non-redundant motifs is the number of the non-redundant distinct motifs of length l where the order of the residues is not important. Proportion of non-redundant motifs is the percentage of motif calculations needed to calculate the true p-value Sig' distribution.
Comparison of the Sig and Sig' scoring for the top ranking motifs matching the known interaction motif.
| Dataseta | Sig'b | Sigc | ELMd | Motife | k (N)f |
|---|---|---|---|---|---|
| LIG_CtBP | P. [DEN]L [VAST] | P [ILM]DL (1) | 15(30) | ||
| TRG_ER_KDEL_1 | [KRHQSAP] [DENQT]EL$ | DE.$ (1) | 9(11) | ||
| LIG_PCNA | Q.. [ILM].. [FHM] [FHM] | Q.. [IL]..FF (1) | 11(19) | ||
| MOD_SUMO | [VILAFP]K. [EDNGP] | V.VK.EP (1) | 4(29) | ||
| LIG_SH3_2 | P..P. [KR] | P. [LV]P. [KR] (1) | 5(7) | ||
| LIG_AP_GAE_1 | [DE] [DES].F. [DE] [LVIMFD] | D.F..F.S..P (1) | 3(7) | ||
| LIG_Dynein_DLC8_1 | [KR].TQT | K.TQ.P (1) | 3(7) | ||
| LIG_RGD | RGD | RGD (1) | 6(13) | ||
| LIG_CYCLIN_1 | 0.012 | [RK].L.{0,1} [FYLIVMP] | RR.L.{0,1}F (1) | 4(18) | |
| LIG_Clathr_ClatBox_1 | 0.011 | 0.054 | L [ILM]. [ILMF] [DE] | [FL].D [FLM] (1) | 8(14) |
| LIG_14-3-3_1 | 0.013 | 0.186 | R. [^P] [ST] [^P]P | R.R..S (1) | 4(4) |
| LIG_NRBOX | 0.014 | 0.082 | L..LL | L..LL. [ST] (2) | 5(8) |
| LIG_RB | 0.96 | 1.00 | [LI].C. [DE] | E.L.C.E (29) | 3(25) |
| LIG_14-3-3_3 | 0.95 | 1.00 | [RHK] [STALV]. [ST]. [PESRDIF] | R [ST].S (13) | 7(7) |
| LIG_HP1_1 | P.V. [LM] | 0(8) | |||
| MOD_N-GLC | N [^P] [ST] | 0(5) | |||
| TRG_LysEnd_APsAcLL_1 | [DER]...L [LVI] | 0(10) | |||
(a) The ELM dataset used. (b) The Sig' score of the top ranking motif matching the known interaction motif. (c) The Sig score of the top ranking motif matching the known interaction motif. (d) The regular expression of the true functional motif. (e) The regular expression of the top ranking motif that matches the known ELM. Significant motifs (p < 0.01) are shown in bold. (f) The number of proteins in the dataset containing the variant of the motif discovered and the number of proteins in the dataset (in brackets)