| Literature DB >> 24278003 |
Abstract
We present a novel analysis of compositional order (CO) based on the occurrence of Frequent amino-acid Triplets (FTs) that appear much more than random in protein sequences. The method captures all types of proteomic compositional order including single amino-acid runs, tandem repeats, periodic structure of motifs and otherwise low complexity amino-acid regions. We introduce new order measures, distinguishing between 'regularity', 'periodicity' and 'vocabulary', to quantify these phenomena and to facilitate the identification of evolutionary effects. Detailed analysis of representative species across the tree-of-life demonstrates that CO proteins exhibit numerous functional enrichments, including a wide repertoire of particular patterns of dependencies on regularity and periodicity. Comparison between human and mouse proteomes further reveals the interplay of CO with evolutionary trends, such as faster substitution rate in mouse leading to decrease of periodicity, while innovation along the human lineage leads to larger regularity. Large-scale analysis of 94 proteomes leads to systematic ordering of all major taxonomic groups according to FT-vocabulary size. This is measured by the count of Different Frequent Triplets (DFT) in proteomes. The latter provides a clear hierarchical delineation of vertebrates, invertebrates, plants, fungi and prokaryotes, with thermophiles showing the lowest level of FT-vocabulary. Among eukaryotes, this ordering correlates with phylogenetic proximity. Interestingly, in all kingdoms CO accumulation in the proteome has universal characteristics. We suggest that CO is a genomic-information correlate of both macroevolution and various protein functions. The results indicate a mechanism of genomic 'innovation' at the peptide level, involved in protein elongation, shaped in a universal manner by mutational and selective forces.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24278003 PMCID: PMC3836704 DOI: 10.1371/journal.pcbi.1003346
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Typical Examples of proteins containing FTs.
| Protein | Length | # DFTs | Leading FT | MFI | RC, RP | Amino-acid sequence (leading FTs are highlighted) |
|
| 770 | 3 | AEEEEETTT | 1 | 0.04, 0.4 |
|
|
| 312 | 12 | GGG | 1 | 0.72, 0.13 |
|
|
| 501 | 5 | TPAPATATD | 8 | 0.1, 0.69 |
|
|
| 455 | 7 | PGP | 6 | 0.18, 0.14 |
|
|
| 265 |
| SGE | 5 | 0.24, 0.35 |
|
|
| 894 | 28 | HQRHTGTGEGEKYVCVCRCREECG | 28 | 0.36, 0.84 |
|
Typical examples of order patterns, as obtained by FT search in the human proteome. For each protein, Swiss-Prot entry name and main function is given in the first column, and then follow the protein length, the number of different frequent-triplets (DFT), the leading FTs, defined by the maximal number of occurrences of a FT, and the CO measures MFI, RC, RP. The leading FTs are highlighted within the protein sequence, displayed in the last column; in some cases they form runs of amino-acids (A–B), while in other cases they form large repetitive motifs of various purities (C–F). See Methods for more details.
Figure 1Analysis of Swiss-Prot human proteome.
Analysis of Swiss-Prot human proteome (n = 20248) containing 5511 CO proteins. A) Histogram of the most frequent intervals, MFI, demonstrates the significant periodic structures originating in ‘runs’ of homo-peptides (MFI = 1) and zinc-fingers (MFI = 28). B) The frequency of intervals of all FTs in all proteins (black circles). The outstanding symbols are mostly due to Zinc-finger proteins which form repetitive sections of 28 amino-acids. Multiplicities at intervals 56, 84 amino-acid are also evident due to mutation acting on these sections. The superimposed red dots display the data in a rank-ordered manner (i.e. the x-axis takes on the role of rank rather than value of interval). C) The number of periodic proteins as defined by the number of FT occurrences at MFI. The bars indicate the fraction of CO proteins with exactly 2–20 (x-axis) occurrences at MFI. 20% of CO proteins are non-periodic (NP). Circles represent the cumulative fraction of proteins with number of repetitions at MFI above the value indicated by the x-axis. Thus, for a minimum of 4 repeats at MFI (i.e., x = 3), there are above 50% CO proteins with periodic structures.
Examples of compositional order and functional enrichment.
| Function | within the proteome | within CO proteins | Mean RP | Mean RC |
| Disease | 2755 (13.6%) | 903 (16.4%) | 0.3 | 0.1 |
| Zinc Fingers | 1799 (8.9%) | 977 (17.7%) |
| 0.17 |
| Collagen | 166 (0.8%) | 87 (1.6%) | 0.21 |
|
| Keratin | 162 (0.8%) | 100 (1.8%) | 0.27 |
|
Examples of selected functional groups with high CO in human. Based on Swiss-Prot records, the portions of each functional group in the entire proteome and within the CO set (i.e., proteins containing FTs) are given in numbers and percentages. Last columns indicate the average RC and RP, which should be compared with the overall mean values of 0.1 (RC) and 0.35 (RP) in the CO set (n = 5511).
Figure 2Repertoire of functional enrichments in human proteome.
Repertoire of enrichment dependencies of GO (gene ontology) terms on the order measures of regularity, RC (black), and periodicity, RP (red). Portions of proteins belonging to a functional group are estimated based on text search in GO terms (see methods) and plotted in double axes against increasing thresholds of RC (lower x-axis) and RP (upper x-axis). A) The portions of some terms that are enriched with increasing threshold of RC but not of RP, like keratin (solid) and collagen (dotted). In this category one finds also filament and cell adhesion related proteins. B) GO terms that are enriched only for increased RP threshold but not RC, as neuronal related proteins (solid) and immune system proteins (dotted). These include also synaptic function and cell response genes. C) Other terms like extracellular region are enriched with increasing the threshold of both RC and RP. D) Some functionalities show more complicated non-monotonic “bump” behaviors. These include DNA-binding, regulation and transcription. As an example, DNA binding are further analyzed showing functional dependencies on RP and RC of both repetitive sections (MFI>1) and runs (MFI = 1). E) MFI = 1 proteins exhibit stable enrichment pattern as function of the threshold on the sum of repetitions at MFI = 1 (i.e., the effective coverage of all amino-acids “runs”). F) Disease related proteins are enriched with increasing length of runs. In each plot, the portion of the corresponding GO-term in the entire CO Swiss-Prot reviewed proteome is the value displayed at 0.
Figure 3Functional enrichment in A. Thaliana and S. cerevisiae.
Similarly to figure 2, functional enrichment in A. Thaliana (A–C) and S. cerevisiae (D–F) are shown with respect to RC (black) or RP (red). Portions of cell wall genes (A, D) and extracellular related genes (C, F) are enriched with increasing the threshold of RC, while portions of response related genes (B, E) are enriched with RP in A. thaliana but RC in yeast.
Human and Mouse proteomes analysis.
| Human | Mouse | Orthology | # of proteins | % of Human proteome | % of Mouse proteome |
| CO | CO | V | 3312 | 16.4% | 20% |
| CO | NO | V | 831 | 4.1% | 5% |
| NO | CO | V | 626 | 3.1% | 3.8% |
| NO | NO | V | 10557 | 52.1% | 63.9% |
| CO | — | X | 1368 | 6.8% | — |
| NO | — | X | 3554 | 17.6% | — |
| — | CO | X | 125 | — | 0.8% |
| — | NO | X | 1062 | — | 6.4% |
Human and mouse proteomes were decomposed into compositionally ordered (CO) and non-ordered (NO) subsets as well as into Orthologous (V) and non-orthologous (X) proteins.
Figure 4Comparison of CO orthologs in human and mouse.
Comparison of CO orthologs in human and mouse according to their RC values. Each point corresponds to a pair of such proteins (n = 3312). Low homologies are marked by circles. Usually, their CO sections are comparable, however revealing higher harmonics in the mouse (Text S1 - section 7, figure S11). Off-diagonal pairs always display low homologies. In the upper-left diagonal CO sections of human and mouse resemble each other, having high similarity of FTs and MFI, despite the low RC in human. In the lower-right diagonal mouse CO sections do not resemble human CO sections, except for few exceptions (see text). High homology is obtained for protein pairs with similar MFIs, such as zinc finger (MFI = 28), collagen (MFI = 3) and keratin (MFI = 5) proteins, and lie along the diagonal.
Human and Mouse CO set – Enrichment by RC and RP.
| species | CO Set name | Orthology | # of CO proteins | RP( | RC( |
| Human (n = 5511) |
| V (CO in mouse) | 3312 | 0.33 | 0.09 |
|
| V (NO in mouse) | 831 |
|
| |
|
| X | 1368 |
|
| |
| Mouse (n = 4063) |
| V (CO in human) | 3312 | 0.33 | 0.08 |
|
| V (NO in human) | 626 |
|
| |
|
| X | 125 | 0.34 |
|
The CO sets of Human (H) and mouse (M) are decomposed into CO orthologous proteins (V) that appear in both species (H1, M1), to orthologous proteins that are CO in one species but not (NO) in the other (H2, M2) and to non-orthologous genes (X) belonging to the CO sets (H3, M3). The values of RC and RP are shown for each subgroup in each species. P-values correspond to Kolmogorov-Smirnov 2 sample test of each group in a species compared with the subgroup 1 of the same species (i.e., H1 and M1 respectively).
List of 94 species.
| Index | DFT | Species | Taxonomy |
| 1 | 5076 | Human (Homo Sapiens) | Animal (V) |
| 2 | 4333 | Chimpanzee(pan troglodytes) | Animal (V) |
| 3 | 4873 | Mouse (Mus musculus) | Animal (V) |
| 4 | 4815 | Rat (Rattus Norvegicus) | Animal (V) |
| 5 | 4559 | Dog (Canis lupus familiaris) | Animal (V) |
| 6 | 2901 | Platypus (Ornithorhynchus Anatinus) | Animal (V) |
| 7 | 4419 | Chicken (Gallus gallus) | Animal (V) |
| 8 | 3216 | Zebra Finch (Taeniopygia guttata) | Animal (V) |
| 9 | 3989 | Lizard (Anolis Carolinensis) | Animal (V) |
| 10 | 5299 | Zebrafish (Danio rerio) | Animal (V) |
| 11 | 4019 | Sea Squirt (Ciona intestinalis) | Animal (IV) |
| 12 | 4146 | Fruit Fly (Drosophila melanogaster) | Animal (IV) |
| 13 | 3518 | Mosquito (Anopheles Gambiae) | Animal (IV) |
| 14 | 3225 | Bee (Apis Mellifera) | Animal (IV) |
| 15 | 3722 | Nematode (C. elegans) | Animal (IV) |
| 16 | 2630 | Nematode (Brugia Malayi) | Animal (IV) |
| 17 | 2262 | Arabidopsis thaliana | Plant |
| 18 | 2785 | Medicago truncatula | Plant |
| 19 | 2094 | Populus trichocarpa | Plant |
| 20 | 2286 | Physcomitrella patens | Plant |
| 21 | 2770 | Chlamydomonas reinhardtii | Plant |
| 22 | 1846 | Rice (Oryza sativa Japonica) | Plant |
| 23 | 1993 | Sorghum bicolor | Plant |
| 24 | 1037 | Maize (Zea may) | Plant |
| 25 | 1838 | Nectria haematococca | Fungi |
| 26 | 1858 | Botryotinia fuckeliana B05.10 | Fungi |
| 27 | 1411 | Aspergillus niger CBS 513.88 | Fungi |
| 28 | 936 | Ajellomyces_capsulatus NAm1 | Fungi |
| 29 | 1439 | candida albicans SC5314 | Fungi |
| 30 | 1112 | Candida albicans WO1 | Fungi |
| 31 | 1077 | S. Cerevisiae | Fungi |
| 32 | 1033 | S. Pombe | Fungi |
| 33 | 2990 | Dictyostelium Discoideum | Protista |
| 34 | 1380 | Entamoeba Histolytica | Protista |
| 35 | 2319 | Leishmania Major | Protista |
| 36 | 2740 | Phytophthora Infestans | Protista |
| 37 | 1230 | Plasmodium Chabaudi | Protista |
| 38 | 3404 | Plasmodium Vivax | Protista |
| 39 | 2129 | Thalassiosira Pseudonana | Protista |
| 40 | 823 | Staphylococcus aureus MRSA252 | Bacteria, Firmicutes |
| 41 | 644 | Bacillus anthracis AMES | Bacteria, Firmicutes |
| 42 | 432 | Bacillus subtilis str168 | Bacteria, Firmicutes (T) |
| 43 | 413 | Symbiobacterium thermophilum | Bacteria, Firmicutes |
| 44 | 344 | Mycoplasma penetrans HF-2 | Bacteria, Firmicutes |
| 45 | 332 | Alicyclobacillus acidocaldarius | Bacteria, Firmicutes (T) |
| 46 | 210 | Lactococcus lactis cremoris MG1363 | Bacteria, Firmicutes |
| 47 | 154 | Caldocellum saccharolyticum | Bacteria, Firmicutes (T) |
| 48 | 137 | Streptococcus agalactiae NEM316 | Bacteria, Firmicutes |
| 49 | 1007 | streptomyces coelicolor A3(2) | Bacteria, Actinobacteria |
| 50 | 670 | Mycobacterium tuberculosis CDC1551 | Bacteria, Actinobacteria |
| 51 | 454 | Arthrobacter aurescens TC1 | Bacteria, Actinobacteria |
| 52 | 274 | Corynebacterium glutamicum ATCC13032 | Bacteria, Actinobacteria |
| 53 | 2452 | Chlorobium chlorochromatii CaD3 | Bacteria, Chlorobi |
| 54 | 202 | Bacteroides thetaiotaomicron VPI-5482 | Bacteria, Bacteriodes |
| 55 | 179 | Bacteriodes fragilis YCH46 | Bacteria, Bacteriodes |
| 56 | 126 | Bacteroides caccae ATCC 43185 | Bacteria, Bacteriodes |
| 57 | 83 | Chlamydophila pneumoniae AR39 | Bacteria, Chlamydiae |
| 58 | 90 | Chlamydia trachomatis A2497 | Bacteria, Chlamydiae |
| 59 | 330 | Fusobacterium nucleatum ATCC 25586 | Bacteria, Fusobacteria |
| 60 | 77 | Thermotoga maritima | Bacteria, Thermotogae (T) |
| 61 | 45 | Thermotoga lettingae TMO | Bacteria, Thermotogae (T) |
| 62 | 82 | Aquifex aeolicus | Bacteria, Aquificae (T) |
| 63 | 267 | Thermomicrobium roseum | Bacteria, Chloroflexi (T) |
| 64 | 261 | Thermus thermophilus | Bacteria, Deinococcus-Thermus (T) |
| 65 | 1627 | Nostoc punctiforme PCC 73102 | Bacteria, Cyanobacteria, Nostocaceae |
| 66 | 630 | Gloeobacter violaceus PCC 7421 | Bacteria, Cyanobacteria, Gloeobacteraceae |
| 67 | 402 | Prochlorococcus marinus MIT 9303 | Bacteria, Cyanobacteria, Synechococcaceae |
| 68 | 1167 | Geobacter uraniireducens Rf4 | Bacteria, Protobacteria, Delta |
| 69 | 543 | Yersinia pestis Antiqua | Bacteria, Protobacteria, Gamma |
| 70 | 482 | Shewanella baltica OS155 | Bacteria, Protobacteria, Gamma |
| 71 | 432 | Bordetella pertussis Tohama I | Bacteria, Protobacteria, Beta |
| 72 | 403 | Caulobacter crescentus CB15 | Bacteria, Protobacteria, Alpha |
| 73 | 268 | Brucella suis 1330 | Bacteria, Protobacteria, Alpha |
| 74 | 249 | Ecoli K12 MG1655 | Bacteria, Protobacteria, Gamma |
| 75 | 100 | Helicobacter cinaedi CCUG 18818 | Bacteria, Protobacteria, Epsilo |
| 76 | 1665 | Cenarchaeum symbiosum A | Archaea |
| 77 | 883 | Nitrosopumilus maritimus SCM1 | Archaea |
| 78 | 582 | Methanosphaera stadtmanae | Archaea |
| 79 | 522 | Haloquadratum walsbyi | Archaea |
| 80 | 495 | Methanospirillum hungatei | Archaea |
| 81 | 285 | Natronomonas Pharaonis | Archaea |
| 82 | 193 | Halobacterium salinarum R1 | Archaea |
| 83 | 182 | Methanopyrus kandleri | Archaea (T) |
| 84 | 173 | Pyrobaculum aerophilum | Archaea (T) |
| 85 | 141 | Aeropyrum pernix K1 | Archaea (T) |
| 86 | 141 | Methanococcus maripaludis | Archaea |
| 87 | 130 | Metallosphaera sedula | Archaea (T) |
| 88 | 124 | Sulfolobus solfataricus | Archaea (T) |
| 89 | 114 | Methanothermobacter thermautotrophicus | Archaea (T) |
| 90 | 99 | Archaeoglobus fulgidus DSM4304 | Archaea (T) |
| 91 | 98 | Picrophilus torridus DSM9790 | Archaea (T) |
| 92 | 73 | Pyrococcus furiosus | Archaea (T) |
| 93 | 73 | Pyrococcus abyssi GE5 | Archaea (T) |
| 94 | 48 | Nanoarchaeum equitans Kin4-M | Archaea (T) |
List of the 94 species distributed across the tree-of-life studied in the large-scale analysis and their taxonomic identities, Eukaryotes (1–39) and Prokaryotes (49–94). The ordering of species is according to the tree-of life [47]. Within Eukaryotes, kingdoms are first ordered from Animalia to Plantae (P) to Fungi (F). Animalia are classified as vertebrates (V), and invertebrates (IV). Within each kingdom ordering is according the phylogenetic distance from the first species, i.e. Human within Animalia, A. thaliana within Plantae and Nectria within Fungi. Protista (PRT) are added at the end with no phylogenetic analysis. Bacteria are also ordered according to the Phylum as presented in [47], where within each Phylum the ordering is according to DFT counts. Archaea are ordered by DFT counts. Mesophiles (M) and Thremophiles (T) are indicated.
Figure 5DFT Box-plot by Kingdom.
Box plots of DFT counts across the tree-of-life. Each box delineates lower quartile, median and upper quartile values. Most extreme values (whiskers) are within 1.5 times the inter-quartile range from the ends of the box. Outliers are also displayed. Prokaryotes are displayed twice. First divided according to bacteria and archaea, and secondly as mesophiles and thermophiles. P-values according to non-parametric two-sample Kolmogorov-Smirnov test are 2.5×10−2 (V-IV), 6.5×10−3(IV-P), 9×10−3 (P-F), 1.7×10−5 (F-B), 2.3×10−2(B-A) and 1.4×10−4 (M-T). Protista species show large variability and cannot be distinguished from Plantae or Fungi. Abbreviations: Vertebrates (V), Invertebrates (IV), Plantae (P), Fungi (F), Protista (PRT) Bacteria (B) Archaea (A), Mesophiles (M), Thermophiles (T).
Figure 6DFT enrichment in eukaryotes.
DFT count and correlation C of the 39 studied eukaryotes. Species are indexed and ordered as in table 5, according to the kingdoms Animalia, Plantae, Fungi and within each kingdom, according to their phylogenetic distance. The upper panel shows the heat-map of the correlation C, the middle panel shows the DFT counts, and the lower panel shows the tree of hierarchical clustering based on Euclidian average distance of C. Colors of the branches correspond to the taxonomic identity as indicated by the colored abbreviations in the middle panel. Abbreviations are the same as defined in figure 5. Solid gray branch corresponds to two proximate ends-leafs belonging to different taxonomic groups. Dashed gray branches link groups.
Figure 7DFT enrichment in prokaryotes.
DFT count and correlation C of the 55 studied prokaryotes. Bacteria are grouped into phyla which are ordered according to their phylogenetic distance, from firmicutes to proteobacteria, and within each phylum species are ordered by DFT counts. Archaea are ordered by DFT counts. Upper panel displays the heatmap of C, lower panel displays DFT counts (red points indicate thermophiles). Color scale is different from figure 6, in order to be able to trace trends which extend over several orders of magnitude. Abbreviations: Firmicutes (Firm); Actinobacteria (Act); Bacteriodes (Bac); Chlamydiae (Ch); Cyanobacteria (Cya), Protobacteria (Proto), Mesophiles (M), Thermophiles (T).
Predominant FTs in selected species.
| Human | Mouse | Fly | C. elegans | A. thaliana | S. cerevisiae | E. coli | |
| 1 |
|
|
|
|
|
|
|
| 2 |
|
|
|
|
|
| LLA |
| 3 |
|
|
|
|
|
|
|
| 4 |
|
|
|
|
|
| LAA |
| 5 |
|
|
|
|
|
| ALL |
| 6 | CGK |
|
|
|
| SST | EAA |
| 7 | HTG | GEK |
|
|
|
| GRL |
| 8 | GEK | HTG | SGS |
|
| TSS | RLT |
| 9 | TGE | CGK | GSG | STS |
| STS | AAG |
| 10 |
| EKP |
| TSS |
|
| AAK |
| 11 | EKP | TGE | GSS | PPG | SSL | NSS | AEA |
| 12 | ECG | KPY | QQH | PGP | DLS |
| ALA |
| 13 | KPY |
| SST |
| LLS | SSL | APA |
| 14 |
| KAF |
| GPP |
|
| DRL |
| 15 | KAF | SSL | SSG | APG | SLL | SKK | LAE |
| 16 | GKA |
|
| GAP | SPS | LSS | LAL |
| 17 | IHT | GKA | SGG | APA | LDL | ATT | LLG |
| 18 |
| SPS | SAS | SST | LSG | NSN |
|
| 19 | PGP | PGP | TSS | PAP | LSS |
| RYD |
| 20 | HQR | PSP |
| STT | SPP | SLS | TLT |
List of predominant FTs in several species. FTs are ranked according to the number of CO proteins in which they are found. FTs containing a single amino-acid, which represent amino-acid runs on the protein's sequence, are highlighted. The latter are significantly more abundant in Eukaryotes (see, figure S13).
Figure 8Universal DFT accumulation in proteomes.
Probability of a number of DFT in a protein, on log-log scale, for 32 eukaryotes proteomes, colored differently for Animalia (red), Plantae (green) and Fungi (yellow). Few FTs occur quite often in the proteome while many FTs are rare. The cases of human and E. coli are shown as specific examples. All individual eukaryote species are very well fitted by a pure power-law (see Text S1 - section 9). E. coli serves as an example of a typical prokaryote.
Figure 9Universal dependence of RP and DFT on protein length.
The relationship, on a log-log scale, between the CO measures RP, RC and DFT and protein length, L. Upper panel (A–C) display human proteins indicating strong correlation of RP (A) and DFT (C) but not RC (B), ρ indicated the Pearson correlation coefficient. A clear linear boundary in RC is due to its lower bound 3/L. Linear regression analysis shows excellent power-law fits of RP and DFT dependence on L. Data was binned to 50 equally spaced intervals along the y-axis. ‘X’ symbols denote the average of L in each bin, error (SD) on the mean is at the size of the symbol and therefore not shown. The blue line is the result of a linear regression fit. Middle Panel (D–F) shows a superposition of RP-L data for all species (D) and the quality of its linear regression fits in (E,F). Slopes increase from Eukaryote to Prokaryotes (E) coupled with a decrease in the goodness of fit (F). Lower panel (G–I) is the same type of analysis for DFT-L dependence. Note that the slope trends are opposite. The ratio of the RP-L and DFT-L slopes is close to −1 in all species: it is −1.11±0.05 in eukaryotes. In prokaryotes, excluding 9 outliers, the ratio is −0.85±0.05.
Figure 10Frequent Triplets – Theory and simulation.
Expected values of Frequent Triplets (FTs) in random proteins as function of sequence length. Length range is up to 35,000 amino-acids, approximately the length of the longest proteins found among the proteomes of the 94 species studied (TITIN in human, and beta-helical in Chlorobium). A) Blue curve is the theoretical expected value given by the Bernoulli probability, for n = 5. Dark circles are the corresponding results of a numerical search of triplets showing perfect match to the theoretical estimation. Red circles are the numerical results for restrictive FTs defined by n = 5 and M = 2000. Inset: same data is shown up to L = 8000 for clarity. Additional black curves represent the theoretical estimation for n = 4–6. B) P-value for FT misidentification as function of length on log-scale. C) Length distribution of human proteins showing log-normal characteristics. Length of CO proteins is right-shifted (see also Text S1 -section 3, figure S6d). Further analysis based on a human “unigram” reference model is provided in Text S1 - sections 1 and 2, where the few very long proteins are analyzed in detail.