| Literature DB >> 30067815 |
Patrick C F Buchholz1, Catharina Zeil1, Jürgen Pleiss1.
Abstract
The sequence space of five protein superfamilies was investigated by constructing sequence networks. The nodes represent individual sequences, and two nodes are connected by an edge if the global sequence identity of two sequences exceeds a threshold. The networks were characterized by their degree distribution (number of nodes with a given number of neighbors) and by their fractal network dimension. Although the five protein families differed in sequence length, fold, and domain arrangement, their network properties were similar. The fractal network dimension Df was distance-dependent: a high dimension for single and double mutants (Df = 4.0), which dropped to Df = 0.7-1.0 at 90% sequence identity, and increased to Df = 3.5-4.5 below 70% sequence identity. The distance dependency of the network dimension is consistent with evolutionary constraints for functional proteins. While random single and double mutations often result in a functional protein, the accumulation of more than ten mutations is dominated by epistasis. The networks of the five protein families were highly inhomogeneous with few highly connected communities ("hub sequences") and a large number of smaller and less connected communities. The degree distributions followed a power-law distribution with similar scaling exponents close to 1. Because the hub sequences have a large number of functional neighbors, they are expected to be robust toward possible deleterious effects of mutations. Because of their robustness, hub sequences have the potential of high innovability, with additional mutations readily inducing new functions. Therefore, they form hotspots of evolution and are promising candidates as starting points for directed evolution experiments in biotechnology.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30067815 PMCID: PMC6070207 DOI: 10.1371/journal.pone.0200815
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Overview of the analyzed protein family networks by number of nodes (sequences) and maximal degree (number of neighbors) for a 95% sequence identity threshold, with average sequence length.
| Enzyme family (abbreviation) | Nodes | Maximal degree | Length |
|---|---|---|---|
| TEM β-lactamases (TEM) | 267a | 86a | 250 |
| β-hydroxyacid dehydrogenases/imine reductases (bHAD) | 17020 | 259 | 320 |
| thiamine diphosphate-dependent decarboxylases (DC) | 24880 | 266 | 580 |
| ω-transaminases (oTA) | 79987 | 381 | 460 |
| short-chain dehydrogenases/reductases (SDR) | 81680 | 312 | 300 |
The small family of TEM β-lactamases is shown as reference due to its high microdiversity with a threshold of 99.5% sequence identity (a).
Overview of the analyzed protein families from Table 1 and their derived parameters.
| Enzyme family | γ | Df |
|---|---|---|
| TEM | 1.2a | 1.8 |
| bHAD | 1.2 | 1.0 |
| DC | 1.1 | 0.7 |
| oTA | 1.2 | 0.9 |
| SDR | 1.3 | 1.0 |
The scale-free exponent γ refers to sequence identity networks constructed with pairwise identity thresholds of 95% (compare with , 99.5% threshold for TEM β-lactamases a). Network dimension Df refers to the slope from in different regions of pairwise sequence identity (>90%).
Exemplary network hubs and their annotations from sequence networks with a threshold of 95% sequence identity (99.5% for TEM β-lactamases)a for the protein families from Table 1.
| Family | Annotation | Source | NCBI accession | Degree |
|---|---|---|---|---|
| TEMa | β-lactamase TEM-1 | AAP20891 | 86 | |
| bHAD | 2-hydroxy-3-oxopropionate reductase | WP_001303675 | 259 | |
| DC | pyruvate dehydrogenase subunit | WP_044256366 | 266 | |
| oTA | putrescine aminotransferase | WP_042715413 | 381 | |
| aspartate aminotransferase | WP_000069444 | 378 | ||
| SDR | GDP-mannose 4,6-dehydratase | WP_058338748 | 312 |