| Literature DB >> 29261740 |
Patrick C F Buchholz1, Silvia Fademrecht1, Jürgen Pleiss1.
Abstract
The currently known protein sequences are not distributed equally in sequence space, but cluster into families. Analyzing the cluster size distribution gives a glimpse of the large and unknown extant protein sequence space, which has been explored during evolution. For six protein superfamilies with different fold and function, the cluster size distributions followed a power law with slopes between 2.4 and 3.3, which represent upper limits to the cluster distribution of extant sequences. The power law distribution of cluster sizes is in accordance with percolation theory and strongly supports connectedness of extant sequence space. Percolation of extant sequence space has three major consequences: (1) It transforms our view of sequence space as a highly connected network where each sequence has multiple neighbors, and each pair of sequences is connected by many different paths. A high degree of connectedness is a necessary condition of efficient evolution, because it overcomes the possible blockage by sign epistasis and reciprocal sign epistasis. (2) The Fisher exponent is an indicator of connectedness and saturation of sequence space of each protein superfamily. (3) All clusters are expected to be connected by extant sequences that become apparent as a higher portion of extant sequence space becomes known. Being linked to biochemically distinct homologous families, bridging sequences are promising enzyme candidates for applications in biotechnology because they are expected to have substrate ambiguity or catalytic promiscuity.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29261740 PMCID: PMC5738032 DOI: 10.1371/journal.pone.0189646
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Protein superfamily size and the Fisher exponent extrapolated to 100% sequence identity (τ100) of the six protein families.
| Abbreviation | Enzyme superfamily | Superfamily size | τ100 |
|---|---|---|---|
| abH | α/β hydrolases | 395000 | 2.6 |
| SDR | short-chain dehydrogenases/reductases | 141000 | 2.4 |
| oTA | ω-transaminases | 121000 | 2.3 |
| CYP | cytochrome P450 monooxygenases | 53000 | 3.3 |
| DC | thiamine diphosphate-dependent decarboxylases | 39000 | 2.8 |
| bHAD | β-hydroxyacid dehydrogenases/imine reductases | 31000 | 2.5 |
Fig 2Cluster size distributions.
Cluster size distribution of α/β hydrolases (abH), short-chain dehydrogenases/reductases (SDR), ω-transaminases (oTA), cytochrome P450 monooxygenases (CYP), thiamine diphosphate-dependent decarboxylases (DC), and β-hydroxyacid dehydrogenases/imine reductases (bHAD) follow a power law distribution: N(s) ~s-τ (N(s), number of clusters of size s; τ, Fisher exponent). Cluster criterion: 60% global sequence identity.