| Literature DB >> 21799808 |
Ahmed Sayadi1, Leonardo Briganti, Anna Tramontano, Allegra Via.
Abstract
The function of proteins is often mediated by short linear segments of their amino acid sequence, called Short Linear Motifs or SLiMs, the identification of which can provide important information about a protein function. However, the short length of the motifs and their variable degree of conservation makes their identification hard since it is difficult to correctly estimate the statistical significance of their occurrence. Consequently, only a small fraction of them have been discovered so far. We describe here an approach for the discovery of SLiMs based on their occurrence in evolutionarily unrelated proteins belonging to the same biological, signalling or metabolic pathway and give specific examples of its effectiveness in both rediscovering known motifs and in discovering novel ones. An automatic implementation of the procedure, available for download, allows significant motifs to be identified, automatically annotated with functional, evolutionary and structural information and organized in a database that can be inspected and queried. An instance of the database populated with pre-computed data on seven organisms is accessible through a publicly available server and we believe it constitutes by itself a useful resource for the life sciences (http://www.biocomputing.it/modipath).Entities:
Mesh:
Substances:
Year: 2011 PMID: 21799808 PMCID: PMC3140502 DOI: 10.1371/journal.pone.0022270
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flowchart of the MoDiPath procedure.
NormIC is the CompariMotif [32] similarity score. The CompariMotif tool was used to find similarities between motifs automatically discovered by MoDiPath and motifs already annotated in other databases.
Number of motifs predicted in KEGG pathways.
| Species | Total | Significant SLiMs | Novel SLiMs | ||||||
| Total | MP | NMP | Total | MP | NMP | Tot | MP | NMP | |
|
| 2097 | 836 | 1261 | 104 | 21 | 83 | 22 | 6 | 16 |
|
| 2094 | 882 | 1212 | 127 | 38 | 89 | 28 | 12 | 16 |
|
| 1863 | 809 | 1054 | 72 | 19 | 53 | 15 | 5 | 10 |
|
| 1391 | 632 | 759 | 35 | 5 | 30 | 4 | 0 | 4 |
|
| 1050 | 610 | 440 | 32 | 12 | 20 | 6 | 6 | 0 |
|
| 933 | 733 | 200 | 11 | 10 | 1 | 2 | 1 | 1 |
|
| 889 | 584 | 305 | 20 | 15 | 5 | 3 | 2 | 1 |
: Total number of motifs predicted by SliMFinder in KEGG pathways;
: number of significantly over-represented motifs in pathways with respect to the two reference datasets (hyper-geometric p-value<3e-9, see Materials and Methods);
: number of significant motifs that are novel (hyper-geometric p-value<3e-9, NormIC<0.7). MP: Metabolic pathways; NMP: Non-Metabolic Pathways.
Number of KEGG pathways (total and with motifs).
| KEGG pathways | Pathways with SLiMs | Pathways with novel SLiMs | |||||||
| Species | Total | MP | NMP | Total | MP | NMP | Total | MP | NMP |
|
| 201 | 87 | 114 | 42 | 13 | 29 | 19 | 5 | 14 |
|
| 198 | 87 | 111 | 50 | 17 | 33 | 18 | 7 | 11 |
|
| 197 | 84 | 113 | 38 | 13 | 25 | 14 | 5 | 9 |
|
| 118 | 84 | 34 | 9 | 4 | 5 | 3 | 0 | 3 |
|
| 117 | 82 | 35 | 15 | 9 | 6 | 4 | 4 | 0 |
|
| 105 | 90 | 15 | 8 | 7 | 1 | 2 | 1 | 1 |
|
| 92 | 70 | 22 | 11 | 9 | 2 | 2 | 1 | 1 |
: Total number of KEGG pathways in each of the seven organisms under study;
: Number of KEGG pathways for which at least one significant motif was found (hyper-geometric p-value<3e-9, see Materials and Methods);
: Number of KEGG pathways for which at least one statistically significant novel motif was found (i.e. a motif with no similarity to any known motif) (hyper-geometric p-value<3e-9, NormIC<0.7). MP: Metabolic pathways; NMP: Non-Metabolic Pathways.
Number of motif representatives predicted in KEGG pathways.
| Species | Total | Significant SLiMs | Novel SLiMs | ||||||
| Total | MP | NMP | Total | MP | NMP | Tot | MP | NMP | |
|
| 813 | 329 | 484 | 64 | 18 | 46 | 21 | 6 | 15 |
|
| 803 | 384 | 419 | 58 | 20 | 38 | 22 | 10 | 12 |
|
| 727 | 322 | 405 | 55 | 16 | 39 | 15 | 5 | 10 |
|
| 616 | 378 | 238 | 14 | 5 | 9 | 4 | 0 | 4 |
|
| 513 | 307 | 206 | 20 | 11 | 9 | 5 | 5 | 0 |
|
| 465 | 378 | 87 | 7 | 6 | 1 | 2 | 1 | 1 |
|
| 502 | 336 | 166 | 16 | 13 | 3 | 2 | 1 | 1 |
: Total number of motif representatives predicted by SliMFinder in KEGG pathways;
: number of significantly over-represented motif representatives in pathways with respect to the two reference datasets (hyper-geometric p-value<3e-9, see Materials and Methods);
: number of significant motif representatives that are novel (hyper-geometric p-value<3e-9, NormIC<0.7). MP: Metabolic pathways; NMP: Non-Metabolic Pathways.
Figure 2The crystal structure of the human granulocyte colony-stimulating factor (GCSF) receptor.
The structure of the GCSF receptor (PDB:2D9Q [42]) is reported in orange. Residues corresponding to the WS.WS motif (residues 295–299) are shown in blue.
Figure 3The information provided by MoDiPath for the hsa04640 KEGG pathway.
(a) First column: the SLiM regular expression; Second column: a ‘+’ is reported if the motif overlaps to a similar motif in other databases (the list of which is shown by moving the mouse over the ‘+’); Third column: the hyper-geometric p-value of the number of motif hits in the hsa04640 pathway compared to the number of motif hits in the SwissProt database; Fourth column: The fraction of proteins in the hsa04640 pathway that contain the WS.WS motif (b) Multiple sequence alignment of the hsa04640 pathway proteins containing the WS.WS motif. (c) Information about each of the hsa04640 proteins containing the WS.WS motif. Clicking on the ‘Show’ button provides more detailed information, including the protein structure visualization with the motif hit(s) highlighted. (d) List of motif overlap(s) to similar motifs in other databases; the last column reports the CompariMotif [32] similarity score (NormIC). (e) GO terms shared by the hsa04640 pathway proteins that have the motif; the last column reports the fraction of the proteins hosting the motif that share a GO term.
Figure 4Motif occurrence Hyper-geometric distribution.
Hyper-geometric p-value distribution for the number of motif occurrences in true (black) and reshuffled (red) KEGG pathways with respect to the number of motif occurrences in the UniProt dataset for H.sapiens. The p-value = 3e-9 approximately corresponds to a false discovery rate of 10%.