| Literature DB >> 17032452 |
Abstract
BACKGROUND: Compositionally biased (CB) regions are stretches in protein sequences made from mainly a distinct subset of amino acid residues; such regions are frequently associated with a structural role in the cell, or with protein disorder.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17032452 PMCID: PMC1618407 DOI: 10.1186/1471-2105-7-441
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Universally abundant compositional biases ***
| {C} | 1.8 | 13 (13) |
| {P} | 2.5 | 13 (13) |
| {GP} | 5.0 | 12 (13) |
| {Q} | 6.5 | 11 (13) |
| {G} | 6.9 | 11 (13) |
| {E} | 8.8 | 11 (13) |
| {S} | 11.5 | 11 (13) |
| {ED} | 15.4 | 6 (12) |
| {H} | 23.7 | 1 (13) |
| {RS} | 26.8 | 1 (13) |
| {T} | 31.5 | 6 (13) |
| {A} | 32.2 | 3 (13) |
| {KE} | 34.9 | 0 (13) |
| {K} | 37.6 | 3 (13) |
| {SR} | 44.6 | 0 (13) |
| {QP} | 45.6 | 3 (13) |
| {R} | 52.5 | 1 (13) |
| {PA} | 53.9 | 0 (12) |
| {PG} | 56.8 | 3 (13) |
| {PM} | 56.9 | 0 (12) |
| {EQKL} | 61.2 | 1 (9) |
| {QH} | 65.9 | 2 (13) |
| {CD} | 68.2 | 1 (13) |
| {GR} | 69.5 | 0 (13) |
| {SP} | 71.8 | 0 (10) |
* Mean rank is simply calculated from averaging over rankings (in decreasing order of abundance) for all thirteen proteomes.
** Number of times the bias appears in the top 10 (with the number of proteomes this bias occurs in, in brackets).
*** The types of CB region have been ranked in increasing order of mean rank for the human proteome.
Top biases for the the thirteen metazoan proteomes (*)
| {C} | 0.036 | {GP} | 0.042 | {C} | 0.039 | {C} | 0.039 | ||||||||
| {P} | 0.031 | {C} | 0.037 | {P} | 0.020 | {GP} | 0.023 | ||||||||
| {GP} | 0.024 | {P} | 0.020 | {GP} | 0.020 | {P} | 0.020 | ||||||||
| {Q} | 0.009 | {ED} | 0.009 | {Q} | 0.011 | {Q} | 0.013 | ||||||||
| {G} | 0.008 | {Q} | 0.009 | {ED} | 0.009 | {ED} | 0.009 | ||||||||
| {E} | 0.008 | {G} | 0.009 | {E} | 0.008 | {KE} | 0.006 | ||||||||
| {S} | 0.008 | {S} | 0.007 | {PQ} | 0.005 | {E} | 0.005 | ||||||||
| {ED} | 0.007 | {E} | 0.007 | {CG} | 0.005 | {RS} | 0.005 | ||||||||
| {PG} | 0.007 | {QP} | 0.007 | {PG} | 0.004 | {S} | 0.004 | ||||||||
| {QP} | 0.006 | {PG} | 0.006 | {G} | 0.004 | {PG} | 0.004 | ||||||||
| {C} | 0.056 | {HT} | 0.099 | {C} | 0.034 | {C} | 0.042 | {C} | 0.035 | {C} | 0.052 | {Q} | 0.070 | {GP} | 0.035 |
| {GP} | 0.048 | {CV} | 0.081 | {GP} | 0.032 | {GP} | 0.038 | {GP} | 0.014 | {GP} | 0.030 | {QH} | 0.055 | {C} | 0.030 |
| {P} | 0.019 | {GP} | 0.048 | {P} | 0.016 | {P} | 0.017 | {T} | 0.012 | {P} | 0.016 | {C} | 0.020 | {T} | 0.021 |
| {EKQL} | 0.008 | {C} | 0.046 | {CV} | 0.014 | {T} | 0.010 | {Q} | 0.012 | {F} | 0.010 | {T} | 0.014 | {Q} | 0.012 |
| {Q} | 0.007 | {P} | 0.016 | {HT} | 0.013 | {ED} | 0.010 | {QH} | 0.009 | {R} | 0.007 | {N} | 0.011 | {KED} | 0.010 |
| {S} | 0.006 | {Q} | 0.009 | {Q} | 0.007 | {G} | 0.008 | {G} | 0.009 | {G} | 0.007 | {H} | 0.009 | {QC} | 0.009 |
| {EKQ} | 0.006 | {S} | 0.009 | {PS} | 0.007 | {Q} | 0.007 | {RDE} | 0.007 | {FIY} | 0.007 | {S} | 0.007 | {ED} | 0.009 |
| {RS} | 0.005 | {CD} | 0.008 | {ED} | 0.007 | {S} | 0.007 | {PIE} | 0.007 | {CN} | 0.007 | {G} | 0.006 | {KE} | 0.008 |
| {QP} | 0.005 | {E} | 0.008 | {E} | 0.006 | {RS} | 0.005 | {P} | 0.005 | {EKAQ LRND} | 0.006 | {QPH} | 0.006 | {PG} | 0.007 |
| {EQKL} | 0.005 | {ED} | 0.007 | {PG} | 0.006 | {HC} | 0.005 | {RS} | 0.005 | {A} | 0.005 | {P} | 0.006 | {RS} | 0.007 |
(*) The proteomes are given an abbreviation derived from the Latin name of the organism, i.e., Hsap for human, Rnor for rat, Drer for zebrafish, etc. The Total Number of CB regions is listed at the bottom for each proteome. For each bias (denoted by a CB signature), the fraction of the total number of CB regions is given; the regions are listed in decreasing order of abundance.
Figure 1Number of bias residue types per CB region in the human proteome. The number of bias residue types per CB region is binned in a bar chart (x-axis). The total occurrences for each 'number of bias residue types' is on the y-axis.
Figure 7Examples of assigned CB regions. In each case, the name of the protein, its current Ensembl identifier, its CB signature and Pmin value are indicated. The CB region is in bold and underlined; the rest of the sequence is in plain text. The proteins are as follows: (A) leukosialin from the human protein, (B) and unnamed fruitfly protein and (C) an unnamed chicken protein.
Most abundant CB regions in Human and their significant functional associations and predicted protein disorder (*)
| 273 | 0.61 | GO:0005737 [55]; cytoplasm (3 × 10-23) | |
| GO:0007155 [28]; cell adhesion (4 × 10-10) | |||
| 183 | 0.00 | GO:0005515 [37]; protein-binding (2 × 10-4) | |
| GO:0005509 [35]; calcium-ion binding (3 × 10-15) | |||
| GO:0007155 [29]; cell adhesion (5 × 10-16) | |||
| GO:0005198 [27]; structural-molecule activity (4 × 10-16) | |||
| GO:0046872 [21]; metal-ion binding (4 × 10-3) | |||
| GO:0005578 [16]; extracellular matrix (sensu Metazoa) (8 × 10-9) | |||
| 116 | 0.76 | GO:0005737 [65]; cytoplasm (2 × 10-61) | |
| GO:0007155 [27]; cell adhesion (2 × 10-19) | |||
| GO:0005198 [12]; structural-molecule activity (2 × 10-3) | |||
| GO:0005578 [9]; extracellular matrix (sensu Metazoa) (8 × 10-3) | |||
| 77 | 0.41 | GO:0005634 [34]; nucleus (3 × 10-8) | |
| GO:0006355 [21]; DNA-dependent regulation of transcription (1 × 10-4) | |||
| 74 | 0.71 | ||
| 70 | 0.40 | GO:0005634 [24]; nucleus (2 × 10-2) | |
| 69 | 0.49 | GO:0005198 [10]; structural-molecule activity (1 × 10-3) | |
| 33 | 0.56 | GO:0005634 [24]; nucleus (1 × 10-11) | |
| GO:0006355 [14]; DNA-dependent regulation of transcription (6 × 10-5) | |||
| 32 | 0.75 | ||
| 30 | 0.52 |
(*) – The CB regions are sorted in decreasing order of abundance. They are denoted by their CB signatures in column #1. Column #2 contains the total number of members in a particular cluster; column #3 is the mean value of D, the disorder fraction of each member; column #4 lists the top five significantly-associated (P' ≤ 0.05, adjusted for multiple hypothesis testing) GO (Gene Ontology terms for the cluster in the format: GO term name [count for GO term]; description of GO term in words. In addition, bias types that are significantly associated with transcription (where we reduced GO categories to just two categories, 'transcription-related' and 'non-transcription-related'), are labelled with a † sign.
Top Ten Biases for Fruitfly, and their significant functional associations and protein disorder values (*)
| 274 | 0.45 | GO:0005634 [78]; nucleus (2 × 10-14) | |
| GO:0006357 [53]; regulation of transcription from RNA polymerase II promoter (1 × 10-16) | |||
| GO:0003700 [44]; transcription factor activity (3 × 10-12) | |||
| GO:0003677 [37]; DNA binding (7 × 10-6) | |||
| GO:0005515 [33]; protein binding (8 × 10-3) | |||
| GO:0003704 [20]; specific RNA polymerase II transcription factor activity (2 × 10-5) | |||
| 187 | 0.81 | GO:0005634 [75]; nucleus (2 × 10-24) | |
| GO:0003700 [52]; transcription factor activity (3 × 10-27) | |||
| GO:0006357 [45]; regulation of transcription from RNA polymerase II promoter (9 × 10-18) | |||
| GO:0008270 [36]; Zn-ion binding (9 × 10-12) | |||
| GO:0003677 [35]; DNA binding (1 × 10-9) | |||
| GO:0006355 [30]; DNA-dept. regulation of transcription (3 × 10-13) | |||
| GO:0003702 [29]; RNA polymerase II transcription factor activity (4 × 10-16) | |||
| GO:0045449 [26]; regulation of transcription (6 × 10-15) | |||
| GO:0007498 [22]; mesoderm development (1 × 10-7) | |||
| 70 | 0.00 | GO:0005198 [25]; structural molecule activity (2 × 10-27) | |
| GO:0007165 [23]; signal transduction (5 × 10-14) | |||
| GO:0016337 [19]; cell-cell adhesion (1 × 10-19) | |||
| GO:0005886 [14]; plasma membrane (1 × 10-3) | |||
| GO:0005102 [14]; receptor binding (6 × 10-9) | |||
| 62 | 0.13 | ||
| 61 | 0.28 | ||
| 58 | 0.45 | GO:0005634 [19]; nucleus (3 × 10-2) | |
| GO:0003729 [16]; mRNA binding (1 × 10-7) | |||
| GO:0003723 [16]; RNA binding (1 × 10-11) | |||
| 50 | 0.14 | ||
| 44 | 0.42 | ||
| 38 | 0.25 | ||
| 38 | 0.21 | GO:0005634 [16] nucleus (3 × 10-3) | |
| GO:0006357 [14]; regulation of transcription from RNA polymerase II promoter (3 × 10-6) | |||
| GO:0006333 [11]; chromatin (dis)assembly (4 × 10-10) | |||
| GO:0003700 [10]; transcription factor activity (2 × 10-2) |
(*) – The CB regions are sorted in decreasing order of abundance. They are denoted by their CB signatures in column #1. Column #2 contains the total number of members in a particular cluster; column #3 is the mean value of D, the disorder fraction of each member; column #4 lists the top five significantly-associated (P' ≤ 0.05) GO (Gene Ontology terms for the cluster in the format: GO term name [count for GO term]; description of GO term in words. In addition, bias types that are significantly associated with transcription (where we reduced GO categories to just two categories, 'transcription-related' and 'non-transcription-related'), are labelled with a † sign.
Most abundant GO terms for {Q(X)n} CB regions in the fruitfly, mouse, rat and human proteomes *
| 245 | 38 | 78 | 114 | ||||
| 152 | 28 | 49 | 68 | ||||
| 137 | 15 | 36 | GO:0005515 ; protein-binding | 51 | GO:0008270 ; Zinc ion binding | ||
| 125 | 6 | GO:0004871 ; signal transducer activity | 31 | 39 | |||
| 99 | GO:0005515 ; protein-binding | 4 | GO:0030216 ; keratinocyte differentiation | 28 | GO:0008270 ; Zinc ion binding | 35 | |
| 92 | GO:0008270 ; Zinc ion binding | 4 | GO:0001533 ; cornified envelope | 25 | 24 | ||
| 78 | 21 | GO:0005737 ; cytoplasm | 21 | GO:0046872 ; metal-ion binding | |||
| 62 | GO:0005737 ; cytoplasm | 12 | 20 | ||||
| 61 | GO:0007498 ; mesoderm development | 17 | |||||
| 59 | 9 | 11 | |||||
| 57 | 5 | 11 | GO:0004871 ; signal transducer activity | ||||
| 53 | 5 | 10 | |||||
| 47 | GO:0009993 ; oogenesis (sensu insecta) | 8 | |||||
| 47 | GO:0007398 ; ectoderm development | 6 | |||||
| 47 | |||||||
| 43 | |||||||
| 41 | GO:0008283 ; cell proliferation | ||||||
| 36 | GO:0003779 ; actin binding | ||||||
| 32 | GO:0007476 ; wing morphogenesis | ||||||
| 30 | GO:0007242 ; intracellular signaling cascade | ||||||
* GO terms common to all four organisms are in bold. Other terms directly associated with 'transcription' or 'nucleic acids' are in italics.
** Number of occurrences of each GO term. For each proteome, the GO terms are sorted in decreasing order of abundance.
Associated SCOP domains for Q{(X)regions in Human and Fruitfly (*)
| 53 | 32 | 14 | g.50.1, FYVE/PHD Zinc finger | 14 | |||
| g.39, glucocorticoid receptor-like (DNA-binding domain) | 17 | g.39.1, glucocorticoid receptor-like (DNA-binding domain) | 17 | g.50, FYVE/PHD Zinc finger | 11 | a.40.1, calponin homology (CH) domain | 12 |
| b.1, Ig-like sandwich | 16 | a.4.5, winged-helix DNA-binding domain | 16 | a.40, CH-domain -like | 9 | d.211.2, plakin repeat | 10 |
| 14 | 14 | d.211, beta-hairpin-alpha-hairpin repeat (ankyrin & plakin) | 8 | 10 | |||
| b.34, SH3-like barrel | 12 | a.123.1, nuclear-receptor ligand-binding domain | 12 | 6 | 10 | ||
| a.123, nuclear-receptor ligand-binding domain | 12 | ||||||
(*) For each proteome, the most common SCOP domains [18] are listed in decreasing order of abundance. Those that occur in both the Human and Fruitfly lists are highlighted in bold.
Figure 2Distribution of lengths of {QH} regions in D. melanogaster. There are two histograms: the overall distribution (red bars), and the nuclear- or transcription-related proteins (blue bars). The nuclear- and transcription-related proteins have been compiled by grouping together all proteins that have been assigned one of the GO terms that has been adjudged transcription-related (See main text for details).
Figure 3A 'blow-up' of the overall distribution of {QH} region lengths. The {QH} regions are listed horizontally in order of increasing length; Q residues are coloured red and H residues green, with other residues in black.
Conservation of {Q(X)and {E(X)biased regions (*)
| Conservation♦ | Total Number | Human ♦ Mouse | Human ♦ Rat | Human ♦ Chicken | Human ♦ C.elegans | Human ♦ Fruitfly | Fruitfly ♦ Human | ||||||
| Human bias regions | With CB region | W/o CB region | With CB region | W/o CB region | With CB region | W/o CB region | With CB region | W/o CB region | With CB region | W/o CB region | With CB region | W/o CB region | |
| All Q-rich regions {Q(X)n} | 350 | 255/326 (78%) | 97/140 (69%) | 245/315 (78%) | 100/135 (74%) | 184/281 (65%) | 100/160 (63%) | 46/115 (40%) | 3/18 (17%) | 79/255 (31%) | 12/36 (33%) | 79/246 (32%) | 12/13 (92%) |
| {Q}, {QH} and {QPH} regions | 139 | 73/109 (67%) | 41/66 (62%) | 61/100 (61%) | 38/64 (59%) | 30/93 (32%) | 25/81 (31%) | 1/30 (3%) | 0/1 (0%) | 11/53 (21%) | 0/12 (0%) | 16/80 (20%) | 0/11 (0%) |
| All E-/D-rich regions {E/D(X)n} | 298 | 194/268 (72%) | 107/169 (63%) | 184/264 (70%) | 96/155 (62%) | 125/219 (57%) | 72/152 (47%) | 50/105 (48%) | 14/46 (30%) | 66/244 (27%) | 17/53 (32%) | 66/130 (51%) | 13/28 (46%) |
| {E} and {ED} regions | 102 | 55/89 (62%) | 41/62 (66%) | 53/89 (60%) | 33/59 (56%) | 13/62 (21%) | 26/83 (31%) | 3/32 (9%) | 1/17 (6%) | 5/40 (13%) | 0/12 (0%) | 3/49 (6%) | 0/16 (0%) |
(*) For each bias grouping, the total number of regions is listed, followed by the (total number conserved with the bias region/total number conserved) (percentage in brackets) for each of the proteomes: mouse, rat, chicken, C. elegans and Drosophila (fruitfly), relative to Human. For Drosophila, the 'reverse' conservation is also listed.
Figure 4Example of conservation of {Q} region in vertebrates: FOXP2 and its orthologs. A multiple alignment is shown for FOXP2 and its orthologs on other vertebrates, made using the MUSCLE program [21]; the {Q} region is highlighted in red if its P-value was high enough to be included in the present analysis; otherwise, it is highlighted in green.
Figure 5The fraction of predicted disorder (denoted in the text) is binned as a bar chart for both the human and fruitfly proteomes. The bin p-q contains all values , such that p ≤
Figure 6Plot of the value versus the length of a CB region for the human proteome.
Cellular compartments for protein with CB regions with different D values (*)
| Nucleus | |||||||
| Cytoplasm | |||||||
| GO:0016020238/4618 | |||||||
| Membrane | |||||||
| GO:0005634 | 137/980 | ||||||
| Nucleus | Nucleus | Nucleus | Nucleus | ||||
| GO:0005737 | 38/980 | GO:0005737 | 35/867 | GO:0005737 | 37/758 | ||
| Cytoplasm | Cytoplasm | Cytoplasm | Cytoplasm | ||||
| GO:0016020 | 68/980 | GO:0016020 | 29/867 | GO:0016020 | 31/758 | GO:0016020 | 34/948 |
| Membrane | Membrane | Membrane | Membrane | ||||
| Nucleus | |||||||
| GO:0005737 | 141/2972 | ||||||
| Cytoplasm | |||||||
| GO:0016020 | 49/2972 | ||||||
| Membrane | |||||||
| Nucleus | Nucleus | Nucleus | Nucleus | ||||
| GO:0005737 | 20/372 | GO:0005737 | 19/556 | GO:0005737 | 38/678 | ||
| Cytoplasm | Cytoplasm | Cytoplasm | Cytoplasm | ||||
| GO:0016020 | 6/372 | GO:0016020 | 12/556 | GO:0016020 | 7/678 | GO:0016020 | 16/1143 |
(*) The numbers of CB regions (overall, and for four categories split up according to value) that have the GO term annotation for Nucleus, Cytoplasm and Membrane are counted up.
† Significant overrepresentation using binomial statistics (P' < 0.05), corrected for multiple hypothesis testing over cellular compartment GO terms. P' values are indicated in brackets, rounded up to the nearest power of ten.