| Literature DB >> 17090320 |
Lipika R Pal1, Chittibabu Guda.
Abstract
BACKGROUND: The functional repertoire of the human proteome is an incremental collection of functions accomplished by protein domains evolved along the Homo sapiens lineage. Therefore, knowledge on the origin of these functionalities provides a better understanding of the domain and protein evolution in human. The lack of proper comprehension about such origin has impelled us to study the evolutionary origin of human proteome in a unique way as detailed in this study.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17090320 PMCID: PMC1654190 DOI: 10.1186/1471-2148-6-91
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Figure 1Taxonomic classification of . Each central box represents a major node representing a distinct group of species. In each box (other than cellular organism), letter notation for that node is given in square bracket. Each side box (other than archaea and bacteria) was derived by subtracting the sequences from the next higher node from those in the previous lower node. The number of human domains originated at archaea, bacteria and eukaryotic nodes are given in parenthesis. For archaea and bacteria, the number of domains with their remote homologs found both in archaea and bacteria (archaea+bacteria subnode) is 13,052, found only in archaea not in bacteria (archaea_only subnode) is 1,021 and found only in bacteria not in archaea (bacteria_only subnode) is 15,294 (see Methods).
Figure 2Flow diagram of subtractive searching method depicting the process of tracing the evolutionary origin of human domains.
Comparison of the performance of HHpred against Hmmpfam in detecting remote homologs
| HHpred method | Hmmpfam method | |
| Pfam-A hits for sequences | 74% | 64% |
| Residues covered by Pfam-A hits | 54% | 34% |
| Unique Pfam-A families found | 3853 | 3192 |
Figure 3Number of unique Pfam-A and Pfam-B families associated with human domains (redundant across nodes) with origin at different nodes.
Figure 4Cartoon diagram of different representative proteins containing Pfam-A family EGF (epidermal growth factor) with remote homologs found at different nodes along the lineage using subtractive searching method. For each sequence, SWISS-PROT identifier is given and EGF domain is shown along with the node name where it has found its remote homolog in that protein sequence. The codes for different nodes are: B, bacteria; E, eukaryota; T, metazoa; C, chordata; M, mammalia; P, primates; H, Homo sapiens. Other functionally significant domain names in protein sequences are given in the legend.
Figure 5Grouping of Pfam-A families according to the number of nodes (where remote homologs are found along the lineage) associated with it. (A) Distribution of different groups of Pfam-A families is plotted in the left axis with decaying nature best approximated by power-law with Y ~ X-1.7 (R2 = 0.95), compared to an exponential, linear or logarithmic function, whereas average numbers of human domains in each group of Pfam-A families are plotted in the right axis with exponentially rising nature with Y = 3.6 × e-0.7X (R2 = 0.97). (B) Distribution of sorted Pfam-A families by average percentage sequence identity of domains within same family (families with number of domains less than 10 are excluded from this graph) in different groups. The maximum probable range of each curve is the more flattened portion.
Some of the functionally known Pfam-A families in each group, defined by the number of nodes associated with it
| Number of nodes associated with a group | Pfam-A family ID | Frequency of occurrence in human proteome | Description of the family |
| 1 | PF02214 | 120 | K+ channel tetramerization domain |
| PF02101 | 113 | Ocular albinism type 1 protein | |
| PF02719 | 112 | Polysaccharide biosynthesis protein | |
| PF00307 | 109 | Calponin Homology domain | |
| PF04185 | 19 | Phosphoesterase family | |
| 2 | PF02117 | 671 | C. |
| PF04762 | 306 | IKI3 family | |
| PF00089 | 175 | Trypsin | |
| PF00854 | 159 | Proton-dependent oligopeptide transport family | |
| PF00969 | 133 | Class II histocompatibility antigen, beta domain | |
| 3 | PF01748 | 923 | C. |
| PF05462 | 883 | Slime mold cyclic AMP receptor | |
| PF00002 | 862 | 7 transmembrane receptor (secretin family) | |
| PF03125 | 791 | C. | |
| PF02118 | 777 | C. | |
| PF00169 | 307 | Pleckstrin homology domain | |
| PF02175 | 305 | C. | |
| PF07653 | 273 | Variant SH3 domain | |
| 4 | PF01461 | 900 | 7 transmembrane chemoreceptor |
| PF03402 | 642 | Vomeronasal organ pheromone receptor family | |
| PF01163 | 461 | RIO1 family | |
| PF01352 | 461 | Kruppel-associated box domain | |
| PF00076 | 313 | RNA recognition motif | |
| PF00018 | 288 | SH3 domain | |
| PF00046 | 276 | Homeobox domain | |
| PF00595 | 210 | PDZ domain | |
| 5 | PF00096 | 1007 | Zinc finger, C2H2 type |
| PF05296 | 917 | Mammalian taste receptor protein (TAS2R) | |
| PF00047 | 834 | Immunoglobulin domain | |
| PF00069 | 769 | Protein kinase domain | |
| PF08205 | 647 | CD80-like C2-set immunoglobulin domain | |
| PF07679 | 621 | Immunoglobulin I-set domain | |
| PF00071 | 317 | Ras family | |
| 6 | PF03326 | 1403 | Herpes virus transcription activation factor |
| PF07686 | 965 | Immunoglobulin V-set domain | |
| PF07714 | 767 | Protein tyrosine kinase | |
| PF04388 | 713 | Hamartin protein | |
| PF07654 | 489 | Immunoglobulin C1-set domain | |
| PF00131 | 344 | Metallothionein | |
| PF00023 | 310 | Ankyrin repeat | |
| PF00041 | 261 | Fibronectin type III domain | |
| 7 | PF05109 | 2177 | Herpes virus major outer envelope glycoprotein |
| PF03546 | 1859 | Treacher Collins syndrome protein Treacle | |
| PF00038 | 1194 | Intermediate filament protein | |
| PF00001 | 1004 | 7 transmembrane receptor (rhodopsin family) | |
| PF01391 | 277 | Collagen triple helix repeat | |
| PF00008 | 221 | Epidermal Growth Factor-like domain | |
| 8 | PF03154 | 3365 | Atrophin-1 family |
| PF04554 | 3125 | Extensin-like region | |
| PF03251 | 2546 | Tymo virus 45/70kd protein | |
| PF05956 | 2384 | APC basic domain | |
| PF01500 | 766 | Keratin, high sulfur B2 protein |
Figure 6Distribution of Pfam-A families according to the origin in three kingdoms of life – archaea, bacteria and eukaryota. The codes for different nodes are: A, archaea; B, bacteria; E, eukaryota; E here represents eukaryota node and all nodes above it.
Evolutionary origin of Pfam-A families at different eukaryotic nodes
| Node of origin | Number of Pfam-A families with remote homologs at | ||
| Single node | Multiple node | Total | |
| Eukaryota | 667 | 334 | 1001 |
| Metazoa | 255 | 115 | 370 |
| Chordata | 154 | 64 | 218 |
| Mammalia | 54 | 18 | 72 |
| Primates | 4 | 3 | 7 |
| 44 | 0 | 44 | |
Some frequently populated Pfam-A families with origin at different evolutionary nodes.
| Pfam-A family | N* | Functional description |
| Archaea_only (131 sequences) | ||
| Ribosomal proteins | 25 | Involved in catalyzing mRNA-directed protein synthesis |
| RNA polymerase | 20 | Catalyse the DNA dependent polymerisation of RNA |
| Translation initiation factor | 12 | Required for maximal rate of protein biosynthesis, in directing ribosome to proper start state of translation |
| DNA polymerase | 5 | Required in replication of DNA |
| Diphthamide_syn | 5 | Putative diphthamide synthesis protein |
| Bacteria_only (1102 sequences) | ||
| Sulfotransferases | 67 | Responsible for the transfer of sulphate groups to specific compounds |
| Tubulin | 35 | Major component of microtubules, involved in polymer formation |
| DAGAT | 23 | The enzyme diacylglycerol acyltransferase involved in the catalysis of terminal step of triacylglycerol |
| Carb_anhydrase | 23 | Carbonic anhydrase, catalyze reversible hydration of carbon dioxide |
| 2OG-FeII_Oxy | 21 | 2-oxoglutarate and Fe(II)-dependent oxygenase superfamily |
| Ribosomal protein | 17 | Involved in catalyzing mRNA-directed protein synthesis |
| Eukaryota (2928 sequences) | ||
| K_tetra | 120 | K+ channel cytoplasmic tetramerisation domain |
| Ocular_alb | 113 | X-linked disorder characterized by severe impairment of visual acuity, retinal hypopigmentation and the presence of macromelanosomes |
| CH | 109 | Calponin homology domain, found in both cytoskeletal and signal transduction protein |
| Histone | 68 | Core Histone H2A/H2B/H3/H4, involved in histone-histone and histone-DNA interactions |
| 7 tm_3 | 65 | 7 transmembrane receptor (metabotropic glutamate family), coupled to G-proteins and stimulate the inositol phosphate/Ca2+ intracellular signalling pathway |
| Actin | 62 | Involved in formation of filament, major component of cytoskeleton |
| Fork_head | 59 | A transcription factor that promotes terminal rather than segmental development, involved in early developmental decisions of cell fates during embryogenesis |
| UQ_con | 59 | Ubiquitin-conjugating enzyme, involved in catalytic activity or assist in poly-ubiquitin chain formation |
| Metazoa (1362 sequences) | ||
| Zf-C4 | 88 | DNA binding domain of a nuclear hormone receptor |
| PID | 67 | Phosphotyrosine interaction domain |
| RA | 51 | Ras association domain |
| sema | 48 | The Sema domain occurs in semaphorins, which are a large family of secreted and transmembrane proteins, some of which function as repellent signals during axon guidance |
| Ets | 37 | Erythroblast transformation specific domain, required for induction of erythroblastosis |
| Wnt | 25 | Role in intercellular communication, possible role in central nervous system |
| T-box | 24 | Perform DNA-binding and transcriptional activation/repression roles |
| Chordata (470 sequences) | ||
| Connexin | 22 | Gap junction protein |
| Interferon | 19 | Produce antiviral and antiproliferative responses in cells |
| Protocadherin | 17 | Cadherin-related molecules in central nervous system |
| MHC_II_alpha | 15 | Related with cell-mediated immune responses |
| Fn2 | 14 | Fibronectin type II domain, involved in a number of important functions e.g., wound healing; cell adhesion; blood coagulation; cell differentiation and migration; maintenance of the cellular cytoskeleton; and tumour metastasis |
| Mammalia (146 sequences) | ||
| Gag_p10 | 21 | The p10 or matrix protein (MA) is associated with the virus envelope glycoproteins in most mammalian retroviruses and may be involved in virus particle assembly, transport and budding |
| GP41 | 16 | The GP41 subunit of the envelope protein complex mediates membrane fusion during viral entry |
| Bim_N | 11 | Bim protein N terminus, essential initiators of apoptotic cell death |
| Primates (21 sequences) | ||
| SPAN-X | 14 | Human sperm proteins associated with the nucleus and mapped to the X chromosome, they are cancer-testis antigens. |
| GP120 | 9 | Envelope glycoprotein GP120 |
| BAGE | 5 | B melanoma antigen family |
* N is the number of protein sequences containing those Pfam-A families in the left column.
Figure 7Distribution of protein sequences according to the total number of Pfam-A annotations in each. For each category, the number of Pfam-A annotations in a sequence, followed by the total number of sequences under that category are given along with the percentage of sequences in each category with respect to the total number of sequences with Pfam-A hits.
Most frequent Pfam-A families/superfamilies in protein sequences that are associated with single or multiple functions
| Fa | Pb | Tc | Most abundant families/superfamilies (Top 10) | |
| Description | Nd | |||
| 1 | 11,646 | 1,778 | Immunoglobulin superfamily (Ig) | 598 |
| Zinc finger family (Zf-C2H2) | 348 | |||
| Protein kinase superfamily (Pkinase) | 339 | |||
| FAD/NAD(P)-binding Rossmann fold Superfamily (NADP_Rossmann) | 191 | |||
| Ankyrin repeat | 151 | |||
| RNA recognition motif (RRM_1) | 130 | |||
| Major Facilitator Superfamily (MFS) | 114 | |||
| Ig-like fold superfamily (E-set) | 112 | |||
| Peptidase clan PA | 106 | |||
| Methyltransferase superfamily | 104 | |||
| 2 | 4,933 | 1,333 | Zinc finger family (Zf-C2H2) | 477 |
| Kruppel-associated box | 326 | |||
| Immunoglobulin superfamily (Ig) | 322 | |||
| Marek's disease glycoprotein A | 206 | |||
| WD-40 repeats (beta-transducin repeats) | 167 | |||
| Protein kinase superfamily (Pkinase) | 160 | |||
| Ig-like fold superfamily (E-set) | 150 | |||
| IKI3 family | 149 | |||
| FAD/NAD(P)-binding Rossmann fold Superfamily | 124 | |||
| POZ domain superfamily | 99 | |||
| 3 | 2,229 | 965 | G-protein superfamily | 166 |
| G-protein alpha subunit | 136 | |||
| Dynein light intermediate chain | 132 | |||
| Immunoglobulin superfamily (Ig) | 113 | |||
| WD-40 repeats (beta-transducin repeats) | 104 | |||
| Zinc finger family (Zf-C2H2) | 98 | |||
| IKI3 family | 96 | |||
| Keratin, high sulfur B2 protein | 96 | |||
| Protein kinase superfamily (Pkinase) | 93 | |||
| Zinc finger, C3HC4 type (Ring finger) | 90 | |||
| 4 | 1,038 | 678 | Immunoglobulin superfamily (Ig) | 90 |
| Atrophin-1 | 84 | |||
| Keratin, high sulfur B2 protein | 83 | |||
| Class II histocompatibility antigen, beta domain | 77 | |||
| Class I histocompatibility antigen, domain alpha 1 and 2 | 77 | |||
| Extensin-like region | 71 | |||
| Giardia variant-specific surface protein | 67 | |||
| Class I histocompatibility antigen, C- terminus | 66 | |||
| P-loop containing nucleoside triphosphate hydrolase superfamily | 57 | |||
| Family A G protein-coupled receptor-like superfamily | 56 | |||
| 5 | 1,136 | 536 | Frizzled/OA1/CAR/Secretin receptor-like superfamily | 676 |
| Family A G protein-coupled receptor-like superfamily | 675 | |||
| Mammalian taste receptor protein | 664 | |||
| 660 | ||||
| 659 | ||||
| Atrophin-1 | 97 | |||
| Keratin, high sulfur B2 protein | 92 | |||
| Giardia variant-specific surface protein | 77 | |||
| Extensin-like region | 63 | |||
| Dentin matrix protein 1 | 53 | |||
a F = Number of functions associated with a protein sequence
b P = Number of protein sequences associated with F number of functions
c T = Number of unique Pfam-A families associated with P number of protein sequences
d N = Frequency of a Pfam-A family in P number of protein sequences
Some commonly occurring functional domain combinations (not ordered) in protein sequences with multiple Pfam-A annotations
| Fa | Different domain combinations | Nb |
| 2 | Kruppel-associated box & Zf-C2H2 zinc finger family | 316 |
| Immunoglobulin superfamily & Marek's disease glycoprotein A | 206 | |
| IKI3 family & WD-40 repeats | 149 | |
| 3 | G-protein alpha subunit & G-protein superfamily & Dynein light intermediate chain | 131 |
| MHC_I & MHC_II_beta & immunoglobulin superfamily | 50 | |
| POZ domain superfamily & recombination activating protein 2 & kelch repeat superfamily | 50 | |
| 4 | MHC_I & MHC_II_beta & immunoglobulin superfamily & MHC_I_C | 66 |
| Cation transporter/ATPase, N-terminus & E1-E2 ATPase & haloacid dehalogenase (HAD) superfamily & Cation transporter/ATPase, C-terminus | 38 | |
| 36 | ||
| 5 | 659 | |
a F = Number of functions associated with a protein sequence
b N = Number of occurrences of corresponding combination
Figure 8Graph of the distribution of protein sequences in single or multi node combinations. This curve can be best approximated by exponential decay curve [Y = 2.6 × 104.e-0.65X with R2 = 0.99].
Figure 9Different patterns of nodes of origin for protein sequences grouped according to the number of nodes of origin of its constituent domains. The codes for different nodes are: A, archaea; B, bacteria; R, archaea+bacteria; E, eukaryota; T, metazoa; C, chordata; M, mammalia; P, Primates; H, Homo sapiens. Different groups of protein sequences are: (a) 1-node combination, (b) 2-node combination, (c) 3-node combination, (d) 4-node combination, (e) 5-node combination, (f) 6-node combination, (g) 7-node combination and (h) 8-node combination. In each group, number of colored boxes in each row represents the number of node combinations present in each protein sequence under that group, where the number of protein sequences in that node combination is given in the column denoted by '#' and percentage of those sequences out of total sequences in that group is given in the column denoted by '%'. Total number of sequences in each group with different node combinations is given in Figure 8.
Figure 10Cartoon diagram showing the evolution of metalloproteinase family through different stages of evolution. Each central box represents the insertion of a new Pfam-A domain at different evolutionary nodes shown in square brackets. The codes for different nodes are: R, archaea+bacteria; E, eukaryota; T, metazoa; C, chordata. SWISS-PROT identifiers for human protein sequences are given in side boxes along with their domain compositions.