| Literature DB >> 32288076 |
Shiyi Shen1, Bo Kai1, Jishou Ruan1, J Torin Huzil2, Eric Carpenter2, Jack A Tuszynski2.
Abstract
Here, we describe a unique probabilistic evaluation of the 20, naturally occurring, amino acids and their distributions within the Swiss-Prot and Complete Human Genebank databases. We have developed a computational technique that imparts both directionality and length constraints into searches for unique combinations of amino acids within protein sequences. Using statistical approaches, we have carried out searches of all possible two- and three-residue motifs contained within these databases. This technique is based on the unusually high occurrence of a small number of these motifs when compared to the expected probability of finding a specific residue grouping within a given database. Subsequent filtering of this search to identify such unique combinations has provided several examples that can be used as markers to identify particular proteins within or across databases. We focus on three of these motifs, which were found to be of greatest interest to us. The CC, CM and a combination of the two, CCM motifs all occur either more or less frequently than would be predicted based on standard amino acid distributions within the entire human proteome.Entities:
Keywords: Amino acid motif; Fast detection method; Significance test
Year: 2006 PMID: 32288076 PMCID: PMC7127678 DOI: 10.1016/j.physa.2006.03.004
Source DB: PubMed Journal: Physica A ISSN: 0378-4371 Impact factor: 3.263
Probability distribution of all amino acids contained in the Swiss-Prot and the frequency distribution in the Complete Human Genebank (square brackets) databases
| [0.0704] | [0.0231] | [0.0484] | [0.0692] | [0.0378] |
| [0.0675] | [0.0256] | [0.0450] | [0.0565] | [0.0984] |
| [0.0237] | [0.0368] | [0.0610] | [0.0465] | [0.0552] |
| [0.0799] | [0.0534] | [0.0613] | [0.0121] | [0.0282] |
The occurrence of each individual amino acid has been averaged over the total number of amino acids found within each of the databases.
Based on the ()-criterion, we obtained the following amino acid motifs that are statistically significant (calculated frequency) when compared to the frequency of amino acids described in Table 1
| Amino acid pair | Calculated frequency (%) | Expected frequency (%) |
|---|---|---|
| AA | 0.78 | 0.60 |
| RR | 0.36 | 0.28 |
| NN | 0.22 | 0.18 |
| CC | 0.04 | 0.02 |
| CH | 0.04 | 0.03 |
| HC | 0.04 | 0.03 |
| 0.24 | 0.15 | |
| EE | 0.58 | 0.43 |
| EK | 0.48 | 0.39 |
| HH | 0.08 | 0.05 |
| HP | 0.14 | 0.11 |
| KK | 0.47 | 0.31 |
| PP | 0.31 | 0.22 |
| SS | 0.63 | 0.48 |
| WW | 0.02 | 0.01 |
| YY | 0.12 | 0.10 |
| CM | 0.03 | 0.04 |
| EP | 0.24 | 0.26 |
| ES | 0.36 | 0.46 |
| HE | 0.12 | 0.15 |
| HK | 0.11 | 0.14 |
| IM | 0.12 | 0.14 |
| PM | 0.09 | 0.11 |
| WP | 0.04 | 0.08 |
The first 16 residue pairs represent “positive” motifs and occur more frequently within the database than expected. The remaining eight pairs represent “negative” motifs (grayed boxes) and occur less frequently than expected. Data are taken from the Swiss-Prot database and it includes all the -positive and -negative pairs.
Frequency of CC and CM pairs within all human protein sequences found within the Swiss-Prot database
| # pairs | 1 | 2 | 3 | 5 | 7 | 8 | 9 | |||
| CC | 1678 | 385 | 136 | 35 | 14 | 9 | 7 | 5 | 4 | 7 |
| CM | 1308 | 244 | 31 | 8 | 3 | 1 | 0 | 0 | 1 | 0 |
Statistical thresholds for CC (6) and CM (4) pairs are shown in bold case.
Relationship between the total number of CC pairs within human proteins and disease
| Protein name | Total length | Number of C | Number of CC | Function |
|---|---|---|---|---|
| NIC1 | 99 | 17 | 9 | (Uncharacterized) |
| (NICE-1 protein) | Involved in epidermal differentiation | |||
| VTDB | 474 | 28 | 7 | (Secreted-plasma) |
| (Vitamin D-binding protein) | Prevents polymerization of actin | |||
| AFAM | 599 | 34 | 8 | (Secreted) |
| (Afamin) | Contains albumin domains | |||
| MCS | 116 | 20 | 9 | (Cytoplasmic) |
| (Sperm mitochondrial associated | Associated male | |||
| cysteine-rich protein) | infertility | |||
| ALBU | 609 | 35 | 8 | (Secreted-plasma) |
| (Serum albumin) | Familial dysalbuminemic hyperthyroxinemia | |||
| DJC5 | 198 | 14 | 8 | (Membrane bound) |
| (DnaJ homolog) | Involved in presynaptic function | |||
| DJCX | 199 | 14 | 7 | (Membrane) |
| (DnaJ homolog) | ||||
| KRUA | 169 | 60 | 19 | (Extracellular) |
| (Keratin, ultra high-sulfur | Cuticle layers of differentiating | |||
| matrix protein A) | hair follicles | |||
| KRUB | 194 | 67 | 22 | (Extracellular) |
| (Keratin, ultra high-sulfur | Cuticle layers of differentiating | |||
| matrix protein B) | hair follicles | |||
| CIWC | 430 | 19 | 9 | (Membrane protein) |
| (Potassium channel subfamily | Probable potassium channel | |||
| K member 12) | subunit | |||
| MDFI | 246 | 29 | 7 | (Cytoplasmic) |
| (Myogenic repressor I-mf) | ||||
| GRN | 593 | 88 | 28 | (Secreted) |
| (Granulins) | May play a role in inflammation, wound | |||
| repair, and tissue remodeling | ||||
| ECM1 | 540 | 28 | 7 | (Extracellular matrix) |
| (Extracellular matrix protein | Lipoid proteinosis | |||
| 1 [Precursor]) | ||||
| MU5A | 1233 | 95 | 7 | (Extracellular matrix) |
| (Mucin 5AC) | ||||
| LTBS | 1394 | 139 | 7 | (Secreted) |
| (Latent transforming growth factor | ||||
| beta binding protein, isoform 1S) | ||||
| LTBL | 1595 | 138 | 8 | (Secreted) |
| (Latent transforming growth factor | Involved in the assembly, secretion | |||
| beta binding protein) | and targeting of TGF | |||
| LYST | 3801 | 93 | 9 | (Cytoplasmic) |
| (Lysosomal trafficking regulator) | Chediak–Higashi syndrome | |||
| (hypopigmentation) | ||||
| FBN1 | 2871 | 361 | 17 | (Extracellular matrix) |
| (Fibrillin 1) | Ongenital contractural arachnodactyly | |||
| FBN2 | 2911 | 363 | 17 | (Extracellular matrix) |
| (Fibrillin 2) | Ongenital contractural arachnodactyly | |||
| VWF | 2813 | 234 | 8 | (Secreted) |
| (Von Willebrand factor) | Von Willebrand disease |
Shown here are the top 20 of 28 human proteins from the Swiss-Prot database, which contain seven or more CC pairs. Proteins are ordered based upon their total CC content in relation to the total number of cysteine residues and the overall length of the protein.
Frequency of CC and CM pairs within human proteins in the Complete Human Genebank database
| # pair | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
| CC | 4651 | 1081 | 330 | 87 | 37 | 29 | 24 | 8 | 7 | 41 |
| CM | 3488 | 520 | 94 | 36 | 5 | 2 | 1 | 0 | 1 | 0 |