| Literature DB >> 16855291 |
Norman E Davey1, Denis C Shields, Richard J Edwards.
Abstract
Many important interactions of proteins are facilitated by short, linear motifs (SLiMs) within a protein's primary sequence. Our aim was to establish robust methods for discovering putative functional motifs. The strongest evidence for such motifs is obtained when the same motifs occur in unrelated proteins, evolving by convergence. In practise, searches for such motifs are often swamped by motifs shared in related proteins that are identical by descent. Prediction of motifs among sets of biologically related proteins, including those both with and without detectable similarity, were made using the TEIRESIAS algorithm. The number of motif occurrences arising through common evolutionary descent were normalized based on treatment of BLAST local alignments. Motifs were ranked according to a score derived from the product of the normalized number of occurrences and the information content. The method was shown to significantly outperform methods that do not discount evolutionary relatedness, when applied to known SLiMs from a subset of the eukaryotic linear motif (ELM) database. An implementation of Multiple Spanning Tree weighting outperformed two other weighting schemes, in a variety of settings.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16855291 PMCID: PMC1524906 DOI: 10.1093/nar/gkl486
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Simplified graphical representation of the SLiMDisc method. The steps completed by SLiMDisc are in green, those which occur outside the program are in red. The input dataset is given to the TEIRESIAS algorithm for pattern discovery and the BLAST algorithm to establish the evolutionary relationships of the parent proteins. The returned motifs are then filtered according to a number of user defined criteria. Finally, the motifs are ranked using information content (based on amino acid frequencies) and evolutionary relatedness.
Figure 2Graphical representation of UHS and UP normalization techniques. Four proteins, labelled 1–4, are shown with annotated domains marked as coloured regions. Regions of homology as detected by BLAST are shown as grey boxes linking the sequences. Sequences 1 and 2 share a large homologous (orange) domain. Sequences 2 and 3 also share a homologous region but this is not annotated as a domain. Three other domains are specific to proteins 1 (green), 3 (blue) and 4 (purple). All motifs a–f have three occurrences in the dataset but have different support (shown in the table on the right) after filtering. a.→Motif a occurs in a shared region between 1 and 2, which is reduced by UHS to a single occurrence. The third occurrence in sequence 3 is not in an homologous region to 1 or 2 and is treated as a separate occurrence by UHS. However, proteins 2 and 3 share a homologous region and so UP will cluster sequences 1, 2 and 3, reducing the number of occurrences to 1. Filtering domains reduces the support to 1 in either case. b.→Motif b occurs in a shared region between 2 and 3, which is reduced by both UHS and UP to a single occurrence. This time, the third occurrence lies in the totally unrelated protein 4 and is counted with either filter. Filtering domains removed the occurrence in 4, reducing the support to 1. c.→Motif c lies purely within a repeated domain in protein 3. This is reduced to a single occurrence by both UHS and UP (the protein is homologous with itself). Although, whole-protein self-hits are ignored by UHS, the additional local BLAST hits between different domains (shown in grey) will still cause motif c to be filtered by UHS. Domain filtering removes it completely. d.→Motif d is the same as motif b, except that none of the occurrences lie in domains and so domain filtering makes no difference. e.→Motif e lies in non-homologous regions of protein 1 and 4. UHS therefore keeps all three occurrences. Whole-protein self-hits are ignored during the UHS filtering, and so both occurrences of motif e in protein 4 are counted. In contrast, UP clusters sequence 4 with itself and reduces the support to 2. No occurrences lie in domains and so domain filtering makes no difference. f.→Motif f is found in proteins 1, 3 and 4. None of these regions are homologous and so UHS gives a support of 3. UP, however, will group proteins 1 and 3; even though they do not directly share homology, they both share homology with common protein 2. UP therefore reduces the support to 2.
Figure 3Scattergram of the information content versus the score for the KDEL (see Table 2) retrieving motif. Each blue point on the scattergram is a motif which has been considered by SLiMDisc. The points in green are the top three motifs ranked by the method. The actual SLiM for this dataset is the motif described by the regular expression [KRHQSAP][DENQT]EL.
Summary performance of the different normalization techniques on ELM benchmark dataset
| Ambiguity | Domain filter | MST | UHS | UP | TEIRESIAS | ||||
|---|---|---|---|---|---|---|---|---|---|
| Rank | Support % | Rank | Support % | Rank | Support % | Rank | Support % | ||
| No | No | 18.06 | 73.9 | 24.83 | 69.8 | 50.56 | 51.5 | 62.17 | 64.3 |
| No | Yes | 16.61 | 70.5 | 20.83 | 66.4 | 49.83 | 48.1 | 66.61 | 59.6 |
| Yes | No | 41.61 | 70.9 | 56.56 | 69.6 | 69.11 | 57.9 | 84.22 | 52.6 |
| Yes | Yes | 28.39 | 72.5 | 48.61 | 64.0 | 53.56 | 56.6 | 57.33 | 62.6 |
Comparison of the average rank of the motif matching the regular expression given in the ELM database and the average percentage support of the top ranked pattern for the ELM benchmark dataset between the three different normalization techniques and the TEIRESIAS algorithm with and without domain filtering and ambiguity. Rank is calculated using the arbitrary value of 200 when the motif of interest is not found in the top 100 motifs returned. Support for each ELM is the percentage of proteins in the dataset containing the returned ELM.
Comparison of methods based on results from the ELM benchmark dataset
| ELM | TEIRESIAS | SLiMDisc | LMD | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Initial motifs | Rank | Motif | Support | Rank | Motif | Support | Initial motifs | Rank | Motif | Support | ||
| 2232 | 225 | |||||||||||
| 11 130 | 6 | SxSxP | 6/6 | 1656 | 6 | RSxSxE | 7/12 | |||||
| 17 339 | 392 | |||||||||||
| 53 613 | 80 | LxDL | 12/15 | 778 | ||||||||
| 157 438 | 53 | PxDL | 20/26 | 26 892 | ||||||||
| 100 074 | — | — | 0/22 | 13 179 | 15 | KRRL | 3/19 | |||||
| 1263 | 117 | |||||||||||
| 10 035 | 2 | PxVxL | 6/6 | 5287 | 4 | KVPxVxL | 3/7 | |||||
| 68 944 | 10 | LxxLL | 8/9 | 27 874 | — | — | 0/18 | |||||
| 64 978 | — | — | 0/13 | 36 | QxxLxxF | 9/13 | 1505 | |||||
| 131 943 | 24 581 | |||||||||||
| 8351 | 25 | RGD | 9/15 | 154 | 2 | R.DV | 3/8 | |||||
| 14 380 | 1 | 1406 | ||||||||||
| 9294 | 18 | QxPxE | 7/8 | 2 | QxPxE | 7/8 | 121 | |||||
| 1982 | — | — | 0/4 | — | — | 0/4 | 53 | — | — | 0/4 | ||
| 132 732 | — | — | 0/29 | 13 722 | — | — | 0/14 | |||||
| 17 236 | 8 | |||||||||||
| 10 201 | — | — | 0/10 | 27 | ExxxLL | 5/10 | 1471 | |||||
Results of the analysis of 18 datasets from the ELM database as proposed by Neduva et al (5). The table compares the first ranked position, support and number of initial motifs between TEIRESIAS (scored using the product of support and information content), SLiMDisc (using default settings) and the LMD method [as described in the LMD paper (5)]. Results in bold are motifs for which the method returns the best rank or equal best rank for that ELM across the three methods.
Results from ELM containing HPRD interaction datasets
| Hub protein | HPRD _id | ELM name (annotated motif) | % True annotated motifs | Returned motif (rank) |
|---|---|---|---|---|
| CtBP | 04015 | 0.29 (9/31) | DLS (6) | |
| Clathrin | 00350 | 0.21 (5/24) | LxDL (2) | |
| Peroxisome proliferator activated receptor gamma | 03288 | 0.14 (3/21) | LxxLL (4) | |
| Integrin Alpha 5 | 00627 | 0.1 (2/21) | RGD (4) | |
| Grb2 | 00150 | 0.04 (6/159) | PxPP (3) | |
| 14-3-3- Eta | 00215 | 0.06 (2/31) | RSxS (4) | |
| Ubiquitin conjugating enzyme E2I | 09045 | 0.09 (4/43) | IKxE (8) |
Results for the seven datasets which returned true annotated binding motifs in the top 10 ranks for the HPRD interaction datasets. A returned motif is defined as one which is found at the annotated positions of the known instances of the motif (including motifs which account for at least 2 of the residues involved in the ELM interaction). % True annotated motifs is a measure of the extent of anticipated noise in the dataset: datasets with a low % have relatively few of the proteins where the ELM has been annotated.