| Literature DB >> 12801414 |
Paul M Harrison1, Mark Gerstein.
Abstract
We have derived a novel method to assess compositional biases in biological sequences, which is based on finding the lowest-probability subsequences for a given residue-type set. As a case study, the distribution of prion-like glutamine/asparagine-rich ((Q+N)-rich) domains (which are linked to amyloidogenesis) was assessed for budding and fission yeasts and four other eukaryotes. We find more than 170 prion-like (Q+N)-rich regions in budding yeast, and, strikingly, many fewer in fission yeast. Also, some residues, such as tryptophan or isoleucine, are unlikely to form biased regions in any eukaryotic proteome.Entities:
Mesh:
Substances:
Year: 2003 PMID: 12801414 PMCID: PMC193619 DOI: 10.1186/gb-2003-4-6-r40
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
The four prion sequences*
| Notable single-residue bias counts for prion determinant domain (Pbias in brackets)† | Notable LPSs (for whole sequence)‡ | |
| Prion sequence: Ure2p (YNL229C) | ||
| N 27/65 (1.9 × 10-16) | N, 33 in 2 to 78 (1.2 × 10-16) | |
| Prion sequence: Sup35p (YDR172W) | ||
| MSDSN | Q 35/123 (9.6 × 10-21) | Q, 39 in 5 to 134 (7.3 × 10-24) |
| Prion sequence: Rnq1p (YCL028W) | ||
| MDTDKLISEAESHFSQGNHAEAVAKLTSAAQSNPNDEQMSTIES | Q 66/253 (4.2 × 10-35) | Q, 67 in 152 to 401 (2.1 × 10-36) |
| Prion sequence: New1p (YPL226W) | ||
| N 26/153 (1.4 × 10-6) | N, 16 in 69 to 94 (9.8 × 10-14) |
*The prion determinant regions (found from experiment) are in bold, the LPSs for the whole protein sequence for the most pronounced single-amino-acid bias, are underlined. †All biases with a Pbias = 1 × 10-4 are listed for each prion sequence. ‡As examples, the counts for the sets of residues {DERK} and {VILM} that correspond to the {QN} lowest probability sequence are listed for each prion.
Abundance of biased regions that have biases at the same level as the Q and N biases in the four budding-yeast prions
| Rank | Budding yeast | Fission yeast | Fruit fly | Nematode | Human | |||||||
| 1 | S | 108 [ | S | 74 | Q | 725 (20.7) | P | 494 | P | 345 | P | 549 |
| 2 | Q | 104 [ | P | 40 | G | 400 | G | 448 | E | 292 | E | 322 |
| 3 | N | 73 [ | E | 37 | P | 359 | E | 286 | G | 242 | C | 302 |
| 4 | E | 68 [ | T | 32 | S | 327 | Q | 270 (8.9) | Q | 153 (6.0) | G | 294 |
| 5 | T | 58 | Q | 17 (5.6) | A | 264 | C | 220 | C | 150 | S | 233 |
| 6 | P | 37 [ | G | 17 | H | 231 | T | 199 | S | 134 | K | 188 |
| 7 | D | 35 | K | 16 | E | 212 | K | 184 | D | 90 | Q | 176 (5.9) |
| 8 | K | 24 [ | A | 15 | T | 188 | S | 146 | K | 86 | A | 136 |
| 9 | G | 20 | C | 13 | K | 170 | R | 132 | R | 81 | R | 83 |
| 10 | A | 19 [ | R | 11 | N | 144 (4.1) | A | 102 | L | 56 | H | 80 |
| 11 | H | 8 | V | 6 | C | 144 | D | 55 | A | 56 | T | 59 |
| 12 | C | 8 | H | 6 | R | 118 | H | 46 | H | 47 | D | 49 |
| 13 | R | 6 | M | 5 | D | 74 | N | 40 (1.3) | Y | 28 | L | 31 |
| 14 | M | 5 | N | 4 (1.3) | L | 47 | F | 17 | N | 28 (1.1) | M | 23 |
| 15 | L | 5 | F | 4 | Y | 28 | Y | 16 | T | 17 | F | 21 |
| 16 | V | 3 | L | 3 | M | 24 | M | 15 | M | 11 | V | 18 |
| 17 | Y | 2 | D | 2 | V | 22 | L | 11 | V | 8 | N | 15 (0.6) |
| 18 | F | 2 | Y | 1 | F | 8 | V | 6 | W | 5 | Y | 13 |
| 19 | W | 0 | W | 0 | W | 5 | I | 5 | F | 3 | W | 11 |
| 20 | I | 0 | I | 0 | I | 5 | W | 1 | I | 1 | I | 10 |
| Total | 585 | Total | 303 | Total | 3,495 | Total | 2,693 | Total | 1,833 | Total | 2,613 | |
| Rank | Budding yeast | Fission yeast | Fruit fly | Nematode | Human | |||||||
| 1 | S | 10,630 | S | 9,035 | Q | 39,186 (16.3) | P | 31,917 | E | 23,229 | P | 44,427 |
| 2 | T | 5,900 | T | 5,805 | S | 31,936 | E | 31,216 | P | 21,124 | E | 27,352 |
| 3 | E | 4,704 | P | 2,887 | P | 29,345 | G | 28,192 | G | 13,462 | S | 26,363 |
| 4 | Q | 3,924 (10.4) | E | 2,657 | G | 24,320 | Q | 18,126 (8.9) | S | 10,313 | G | 22,131 |
| 5 | N | 3,745 (10.0) | A | 1,854 | E | 23,384 | T | 15,994 | L | 9,459 | C | 16,681 |
| 6 | P | 2,049 | G | 1,669 | A | 14,730 | S | 15,262 | C | 6,852 | K | 15,459 |
| 7 | K | 1,910 | C | 1,185 | K | 14,448 | C | 15,224 | Q | 6,835 (6.0) | Q | 12,156 (5.9) |
| 8 | D | 1,292 | Q | 1,107 (3.6) | T | 12,560 | K | 14,518 | K | 6,122 | A | 9,587 |
| 9 | G | 961 | L | 1,087 | C | 10,067 | A | 9,124 | R | 4,061 | T | 5,667 |
| 10 | A | 916 | V | 851 | L | 9,331 | R | 7,501 | A | 3,244 | L | 5,646 |
| 11 | L | 554 | K | 680 | R | 6,847 | D | 6,950 | D | 3,176 | R | 5,165 |
| 12 | C | 256 | N | 486 (1.6) | H | 6,302 | N | 2,606 (1.3) | Y | 2,315 | H | 3,189 |
| 13 | R | 204 | R | 425 | D | 5,695 | H | 2,361 | N | 1,259 (1.1) | V | 2,964 |
| 14 | H | 195 | F | 257 | N | 5,690 (2.4) | L | 1,352 | H | 1,044 | D | 2,085 |
| 15 | M | 163 | H | 238 | V | 2,651 | F | 827 | T | 697 | N | 1,714 (0.8) |
| 16 | F | 94 | M | 217 | Y | 1,179 | M | 746 | V | 549 | F | 1,433 |
| 17 | V | 90 | D | 127 | M | 915 | Y | 692 | M | 287 | M | 1,081 |
| 18 | Y | 33 | Y | 60 | I | 798 | V | 608 | F | 221 | I | 924 |
| 19 | W | 0 | I | 0 | F | 667 | I | 404 | W | 162 | Y | 617 |
| 20 | I | 0 | W | 0 | W | 147 | W | 42 | I | 16 | W | 541 |
| Total | 37,620 | Total | 30,627 | Total | 240,198 | Total | 203,662 | Total | 114,427 | Total | 205,182 | |
| Budding yeast | Fission yeast | Fruit fly | Nematode | Human | ||||||||
| RQ/N (total residues) | 1.05 | 2.28 | 6.89 | 6.96 | 5.43 | 7.09 | ||||||
| RQ/N (total regions) | 1.42 | 4.25 | 5.03 | 6.75 | 5.46 | 11.73 | ||||||
| Q/N (composition) | (0.039/0.061) = 0.64 | (0.038/0. 052) = 0.73 | (0.052/0 .047) = 1.12 | (0.041/ 0.049 = 0.83 | (0.035/ 0.044 = 0.79 | (0.047/ 0.037 = 1.28 | ||||||
*The total numbers of regions that have a compositional biased LPS with Pbias < 1 × 10-13. †The number of LPSs for a particular compositional bias in the budding yeast proteome that overlap a region assigned as coiled coil by the MULTICOIL program [22]. ‡The total numbers of bias residues (for example, total number of serines for a serine bias) for all of the regions tallied for part (a) of the table. § RQ/N is the ratio of the number of Q-rich regions to N-rich regions as listed in parts (a) and (b) of the table. The overall abundance of the residues is simply the fraction of the total proteome that is either Q or N.
Comparison of prevalent compositionally biased regions for the whole proteome, translated intergenic DNA, known proteins, hypothetical proteins and dORFs in budding yeast
| Pbias < 1 × 10-5 | Pbias < 1 × 10-9 | Pbias < 1 × 10-13 | |||
| S | 37,006 | S | 18,502 | S | 10,630 |
| E | 21,163 | E | 9,147 | T | 5,900 |
| L | 18,064 | T | 6,836 | E | 4,704 |
| K | 17,067 | N | 6462 (9.3) | Q | 3,924 (10.4) |
| N | 15,577 (7.4) | Q | 5,212 (7.5) | N | 3,745 (10.0) |
| A | 13,974 | K | 4,280 | P | 2,049 |
| G | 12,927 | P | 3,831 | K | 1,910 |
| D | 10,004 | L | 3,512 | D | 1,292 |
| P | 9,892 | D | 3,176 | G | 961 |
| T | 9,866 | A | 2,473 | A | 916 |
| F | 8,934 | G | 2,115 | L | 554 |
| Q | 8,689 (4.1) | C | 810 | C | 256 |
| I | 6,939 | F | 764 | R | 204 |
| R | 5,333 | H | 662 | H | 195 |
| V | 4,121 | R | 509 | M | 163 |
| C | 3,293 | I | 264 | F | 94 |
| Y | 2,960 | Y | 262 | V | 90 |
| H | 2,645 | M | 245 | Y | 33 |
| W | 2,009 | V | 150 | W | 0 |
| M | 850 | W | 0 | I | 0 |
| Total | 211,313 | Total | 69,212 | Total | 37,620 |
| Pbias < 1 × 10-5 | Pbias < 1 × 10-9 | Pbias < 1 × 10-13 | |||
| F | 28,949 | F | 5,692 | F | 1,211 |
| C | 10,074 | C | 1,280 | H | 602 |
| K | 7,800 | H | 908 | V | 490 |
| R | 7,551 | V | 814 | T | 448 |
| Y | 6,450 | K | 753 | C | 377 |
| L | 6,283 | Y | 690 | L | 366 |
| I | 3,789 | T | 681 | Y | 282 |
| H | 3,157 | P | 675 | P | 243 |
| P | 1,650 | R | 594 | S | 222 |
| S | 1,613 | L | 576 | K | 186 |
| V | 1,566 | S | 380 | I | 185 |
| T | 1,299 | G | 380 | R | 178 |
| G | 1,136 | I | 353 | N | 173 (3.2) |
| N | 798 (0.9) | W | 299 | G | 166 |
| W | 746 | N | 242 (1.7) | W | 98 |
| Q | 498 (0.6) | Q | 125 (0.9) | Q | 51 (1.0) |
| M | 282 | E | 85 | E | 39 |
| A | 268 | M | 26 | D | 16 |
| E | 241 | D | 16 | M | 15 |
| D | 16 | A | 0 | A | 0 |
| Total | 84,166 | Total | 14,569 | Total | 5,348 |
| Pbias < 1 × 10-5 | Pbias < 1 × 10-9 | Pbias < 1 × 10-13 | |||
| S | 27,539 | S | 15,328 | S | 9,819 |
| E | 17,519 | E | 8,074 | T | 5,900 |
| L | 13,928 | N | 5,716 (9.9) | E | 4,289 |
| K | 13,785 | T | 5,413 | N | 3,551 (11.9) |
| N | 12,854 (7.7) | Q | 4,520 (7.8) | Q | 3,348 (11.3) |
| A | 12,482 | K | 3,653 | K | 1,723 |
| G | 11,783 | L | 2,864 | P | 1,669 |
| D | 1,934 | L | 595 | P | 170 |
| P | 1,883 | P | 453 | G | 62 |
| Q | 7,299 (4.4) | A | 2,434 | G | 899 |
| P | 7,045 | G | 1,969 | L | 451 |
| F | 6,154 | C | 608 | C | 207 |
| I | 5,495 | H | 530 | H | 162 |
| R | 3,973 | R | 447 | R | 155 |
| V | 3,415 | F | 443 | M | 113 |
| C | 2,400 | I | 264 | F | 78 |
| Y | 2,158 | Y | 218 | V | 0 |
| H | 1,536 | M | 195 | Y | 0 |
| W | 1,484 | V | 60 | W | 0 |
| M | 656 | W | 0 | I | 0 |
| Total | 166,920 (13) | Total | 57,938 (38) | Total | 33,070 (19) |
| Pbias < 1 × 10-5 | Pbias < 1 × 10-9 | Pbias < 1 × 10-13 | |||
| S | 8,621 | S | 2,958 | T | 1,240 |
| L | 3,905 | T | 1,423 | S | 772 |
| E | 3,630 | E | 1,073 | Q | 576 (13.7) |
| K | 3,043 | Q | 680 (6.8) | E | 415 |
| F | 2,747 | N | 664 (6.6) | D | 262 |
| N | 2,506 (6.4) | K | 602 | N | 194 (4.6) |
| T | 2,050 | D | 600 | K | 187 |
| D | 1,934 | L | 595 | P | 170 |
| P | 1,883 | P | 453 | G | 62 |
| A | 1,386 | F | 321 | V | 55 |
| I | 1,267 | C | 202 | M | 50 |
| R | 1,264 | G | 146 | L | 50 |
| Q | 1,171 (3.0) | H | 106 | R | 49 |
| G | 882 | R | 62 | C | 49 |
| C | 863 | V | 55 | Y | 33 |
| H | 528 | M | 50 | H | 33 |
| W | 514 | Y | 44 | F | 16 |
| Y | 512 | A | 14 | A | 0 |
| V | 389 | V | 0 | W | 0 |
| M | 179 | W | 0 | I | 0 |
| Total | 39,274 (16) | Total | 10,048 (221) | Total | 4,213 (150) |
| Pbias < 1 × 10-5 | Pbias < 1 × 10-9 | Pbias < 1 × 10-13 | |||
| R | 459 | R | 254 | R | 254 |
| H | 307 | L | 204 | L | 204 |
| S | 288 | T | 138 | H | 122 |
| G | 271 | Q | 129 (11.0) | T | 120 |
| L | 248 | H | 122 | C | 99 |
| Q | 225 (6.8) | C | 99 | Q | 74 (8.3) |
| T | 208 | S | 82 | N | 23 (2.6) |
| N | 172 (5.2) | P | 72 | A | 0 |
| F | 168 | Y | 50 | D | 0 |
| C | 163 | N | 23 (2.0) | E | 0 |
| V | 151 | A | 0 | F | 0 |
| A | 149 | D | 0 | G | 0 |
| D | 111 | E | 0 | I | 0 |
| I | 98 | F | 0 | K | 0 |
| P | 84 | G | 0 | P | 0 |
| Y | 67 | I | 0 | S | 0 |
| E | 45 | K | 0 | V | 0 |
| K | 37 | M | 0 | Y | 0 |
| W | 23 | V | 0 | W | 0 |
| M | 14 | W | 0 | M | 0 |
| Total | 3,288 | Total | 1,173 | Total | 896 |
*Translated igDNA ('intergenic DNA') is conceptually translated in six frames. For analysis of intergenic DNA in budding yeast, we used the 'Not Feature' file of sequences in FASTA format distributed by SGD (this contains all genomic DNA that does not overlap an annotated feature [32]). This set of nucleotide sequences was conceptually translated in all six reading frames, and the amino-acid compositional biases were tallied up as for the annotated budding-yeast proteome. A dORF is an open reading frame that is disrupted by one or more frameshifts or premature stop codons, and which is likely to be a pseudogene. A data set of dORFs has been derived previously for the budding-yeast genome [9]. †In the totals for known and hypothetical proteins, the number of bias residues per residue of protein is given in parentheses.
Numbers of (Q+N)-rich domains for the six proteomes
| Category | Budding yeast | Fission yeast | Fruit fly | Nematode | Human | ||
| 1 | (Q+N)-rich domains according to a Q, N or Q+N bias* | 172 | 22 | 853 | 315 | 213 | 194 |
| 2 | (1) plus filter for charged and hydrophobic residues† | 96 | 14 | 473 | 216 | 125 | 69 |
| 3 | (2) plus requirement for a subsidiary bias for G, Y or S‡ | 31 | 7 | 86 | 80 | 35 | 21 |
*The total number of (Q+N)-rich domains. These are all LPSs that have a Q or N bias with Pbias < 1 × 10-13 or a {QN} bias with Pbias < 1.8 × 10-14. †A filter is used so that only LPSs that have a subsidiary bias against {DERK} with a Pbias < 6.5 × 10-3 and against {VILM} with a Pbias < 2 × 10-2 are considered. ‡A filter is used so that only LPSs that have a subsidiary bias for one of the residues G, Y or S with a Pbias < 5 × 10-4 are considered.
Figure 1Histogram of the lengths of the (Q+N)-rich domains for budding yeast, fruit fly and human. The distribution of sequence lengths for the (Q+N)-rich domains are shown for budding yeast (top panel), fruit fly (middle panel) and human (bottom panel). The y-axis is the number of regions per bin, and the x-axis is for bins with labels x such that each bin contains all sequences with length x to x + 24 inclusive. The mean and median lengths for each of these distributions are as follows (organism, mean (± SD), median): budding yeast, 209 ± 209, 116; fruit fly, 236 ± 389, 89; human, 553 ± 730, 268. Only the distributions up to bin x = 275 are shown; a sizeable proportion of each distribution is longer than 275 residues (budding yeast 30% of sequences, fruit fly 22% and human 44%).
Functional categories for the (Q+N)-rich domains for budding yeast fruit, fly and human
| Organism | GO ontology | Five most frequent category annotations* |
| Budding yeast | Component | Nucleus GO:0005634 (23), Cytoplasm GO:0005737 (16), cellular_component_unknown GO:0008372 (14), Plasma_membrane GO:0005886 (9), actin_cortical_path GO:0005857 (8), nuclear pore GO:0005643 (6) |
| Function | Molecular_function_unknown GO:0005554 (59), transcription_factor GO:0003700 (19), cytoskeletal_adaptor GO:0008093 (7), general transcriptional repressor GO:0016565 (6), general_RNA_polymerase_II_transcription_factor GO:0016251 (6), structural molecule GO:0005198 (6) | |
| Process | Biological_process_unknown GO:0000004 (52), endocytosis GO:0006897 (10), pseudohyphal_growth GO:0007124 (9), transcription GO:0006350 (9), nuclear pore organization GO:0006999 (8), protein amino acid phosphorylation GO:0006468 (7), regulation of cell cycle GO:0000074 (7) | |
| Fly | Component | Nucleus GO:0005634 (157), TFIID_complex GO:0005669 (13), plasma_membrane GO:0005886 (19), cytoplasm GO:0005737 (23), microtubule_associated_protein GO:0005875 (9) |
| Function | RNA_polymerase_II_transcription_factor GO:0003702 (52), transcription_factor GO:0003700 (39), specific_RNA_polymerase_II_transcription_factor GO:0003704 (36), RNA binding GO:0003723 (30), general RNA polymerase II transcription factor GO:0016251 (17) | |
| Process | Notch receptor signaling pathway GO:0007219 (18), protein amino acid phosphorylation GO:0006468 (18), transcription initiation GO:0006367 (13), gene silencing GO:0016458 (9), neuroblast determination GO:0004725 (9) | |
| Human | Component | Nucleus GO:0005634 (52), integral membrane protein GO:0016021 (9), extracellular space GO:0005615 (9), plasma membrane GO:0005887 (7), cytoskeleton GO:0005856 (7). |
| Function | Transcription_factor GO:0003700 (22), GO:0003677 DNA binding (20), calcium ion binding GO:0005509 (11), ATP binding GO:0005524 (10), transcription coactivator GO:0003713 (10) | |
| Process | Regulation of transcription GO:0006355 (34), signal transduction GO:0007165 (15), protein amino acid phosphorylation GO:0006468 (7), transcription from PolII promoter GO:0006366 (7), oncogenesis GO:0005198 (7) |
*A description of each GO category is followed by the number in the ontology and the total number of such designations found, in brackets.
Figure 2Each proteome has a characteristic distribution of biases. The proportion of bias residues (y-axis) counted up for each of the following seven residues (S, Q, N, L, I, D, C) are shown as a function of the bias probability (x-axis). The x-axis comprises bins labeled with -log(P) such that all regions with probabilities from -log(P) to 3.0 -log(P) are included. The end (right-most) bin includes all regions with log probability greater than -log(P). From left to right, the first set of panels is for budding yeast, the second set for fission yeast, the third set for fruit fly and the fourth for human. The rows of panels are labeled at the far right with the appropriate one-letter amino-acid symbol (S, Q, N, L, I, D and C).