| Literature DB >> 31101822 |
Charly Empereur-Mot1,2, Hector Garcia-Seisdedos3, Nadav Elad4, Sucharita Dey3, Emmanuel D Levy5.
Abstract
Proteins can self-associate with copies of themselves to form symmetric complexes called homomers. Homomers are widespread in all kingdoms of life and allow for unique geometric and functional properties, as reflected in viral capsids or allostery. Once a protein forms a homomer, however, its internal symmetry can compound the effect of point mutations and trigger uncontrolled self-assembly into high-order structures. We identified mutation hot spots for supramolecular assembly, which are predictable by geometry. Here, we present a dataset of descriptors that characterize these hot spot positions both geometrically and chemically, as well as computer scripts allowing the calculation and visualization of these properties for homomers of choice. Since the biological relevance of homomers is not readily available from their X-ray crystallographic structure, we also provide reliability estimates obtained by methods we recently developed. These data have implications in the study of disease-causing mutations, protein evolution and can be exploited in the design of biomaterials.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31101822 PMCID: PMC6525250 DOI: 10.1038/s41597-019-0058-x
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Principle of calculation of different versions of the normal distance to the closest bounding plane (nDp) visualized on the dihedral structure of isoaspartyl dipeptidase. (a) Coloration of the biological assembly of isoaspartyl dipeptidase by subunits (PDB accession 1POK[35]). Symmetry axes appear in green (2-fold axes) and red (4-fold axis). (b) Residues are assigned to their closest bounding plane. For this D4 complex, bounding planes originate from either 2- or 4-fold axes (grey and brown, respectively). (c) Visualization of the nDp-2-fold. (d) Visualization of the nDp-n-fold, where n = 4 in the case of this D4 complex. (e) Visualization of the nDp, which is relative to all bounding planes of the assembly independently of axes folds.
Fig. 2Workflow used to calculate the ‘environment stickiness’ of a residue illustrated on the dihedral structure of isoaspartyl dipeptidase (PDB accession 1POK). (a) Calculation of the ‘stickiness’ scale. Surface and interface regions are defined for each protein of the dataset[4]. The stickiness of an amino acid is then defined as the log-ratio of its frequency at protein-protein interfaces relative to solvent-exposed surfaces[12]. (b) The environment of a residue of interest is defined by surface residues within a 400 Å2 patch centered on the Cα of the residue of interest[12]. The central residue is excluded from the calculation. (c) Projection of the environment stickiness on isoaspartyl dipeptidase. Residues protected by low interaction propensity environments appear in blue.
Fig. 3Benchmark of individual methods and of their integration into QSbio. ROC curves are shown for each method with their respective area under the curve (AUC) values; separately for monomers, dimers and larger oligomers. The benchmark was carried out as earlier[20], using the manually curated PiQSi database as a gold-standard dataset.
Overview of tables content.
| Table name | Content | Nb assemblies | Rows | Cols | File size |
|---|---|---|---|---|---|
| protein_assemblies_description | Assemblies descriptors | 165,916 | 165,916 | 13 | 8.8 Mb |
| protein_assemblies_symmetry_axes | Axes coordinates | 69,191 | 105,965 | 5 | 3.2 Mb |
| residues_all_sym_protein_assemblies | Residue descriptors | 69,922 | 56,547,328 | 17 | 4.329 Gb |
| residues_all_asym_protein_assemblies | Residue descriptors | 95,994 | 32,035,629 | 14 | 2.145 Gb |
| residues_h80_sym_protein_assemblies | Residue descriptors | 20,820 | 16,649,091 | 17 | 1.278 Gb |
| residues_h80_asym_protein_assemblies | Residue descriptors | 19,289 | 7,024,731 | 14 | 468.7 Mb |
File sizes are for uncompressed tables. Although the data present in tables ‘residues_h80_sym_protein_assemblies’ and ‘residues_h80_asym_protein_assemblies’ are subsets of tables ‘residues_all_sym_protein_assemblies’ and ‘residues_all_asym_protein_assemblies’, respectively, we decided to provide separate tables for non-redundant assemblies to facilitate data loading and manipulation.
Assembly descriptors records.
| Field | Description | Type |
|---|---|---|
| pdb_long | Four characters PDB accession code, followed by the assembly number | string |
| pdb_short | Four characters PDB accession code | string |
| uniprot | Uniprot accession code | string |
| resol | X-ray crystallography resolution (Å) | float |
| sym | Symmetry of protein assembly | string |
| nsub | Number of subunits in protein assembly | int |
| mw | Molecular weight of protein assembly (Da) | float |
| PiQSi | Quaternary structure validity inferred in the manually curated database PiQSi (YES/NO & PROBYES/PROBNOT). YES/PROBYES indicates likely errors. | string |
| QSalign | Quaternary structure validity inferred from QSalign (YES/NO & PROBYES/PROBNOT). YES/PROBYES indicates likely errors. | string |
| QSbio | Quaternary structure error probability from QSbio (range 0-100) | float |
| tv_discard | Assembly ignored in the technical validation (binary) | int |
| h_80 | Assembly belonging to a non-redundant dataset where no two structures share the same QS and sequence identity >80% (binary) | int |
| h_90 | Assembly belonging to a non-redundant dataset where no two structures share the same QS and sequence identity >90% (binary) | int |
Each line of table protein_assemblies_description.csv.tar.gz[15] corresponds to one unique assembly.
Fig. 4Relating the normal distance to the closest bounding plane (nDp) to assemblies’ molecular weight and environment stickiness. (a) Average and maximum nDp per assembly as a function of its molecular weight for control (Ctrl), cyclic (Cn) and dihedral (Dn) complexes. The control structures are monomers. Number of assemblies: (Ctrl) 11,092, (Cn) 9,996 and (Dn) 3,286. Number of residues: (Ctrl) 3,126,485, (Cn) 5,725,566 and (Dn) 4,585,996. Lines show the average per binned sample. Boxes height represents Q1–Q3 quartiles. Lower and upper hinges extend boxes by 150% of the Q1–Q3 interquartile range, in the limit of existing data. Boxes widths are proportional to the square root of sample size ratio. (b) Distributions of the nDp across symmetry types: control (Ctrl), cyclic (Cn) and dihedral (Dn) complexes. Number of assemblies: same as (a) and (Dn nDp-2-fold & Dn nDp-n-fold) 1,133. Number of residues: same as (a) and (Dn nDp-2-fold & Dn nDp-n-fold) 2,072,956. (c) Environment stickiness as a function of nDp for control (dashed lines), cyclic (red) and dihedral (blue) complexes. In accordance with our previous results[3], environment stickiness is tuned as a function of nDp in dihedral complexes, but not in cyclic complexes. Brown error bars correspond to two standard errors. Number of assemblies: (Ctrl) 10,637, (C2) 8,626, (C3) 857, (C4) 126, (C5) 58, (D2) 2,106, (D3) 693, (D4) 282, (D5) 68. Number of residues: (Ctrl) 1,437,486, (C2) 2,018,957, (C3) 253,493, (C4) 52,484, (C5) 18,381, (D2) 910,575, (D3) 398,613, (D4) 224,067, (D5) 51,004.
Assembly symmetry axes records.
| Field | Description | Type |
|---|---|---|
| pdb_long | Four characters PDB accession code, followed by the biological assembly number | string |
| fold | Symmetry axis fold | int |
| x | Symmetry axis unit vector orientation (x-axis) | float |
| y | Symmetry axis unit vector orientation (y-axis) | float |
| z | Symmetry axis unit vector orientation (z-axis) | float |
Each line of table protein_assemblies_symmetry_axes.csv.tar.gz[15] corresponds to one unique symmetry axis.
Residue descriptors records.
| Field | Description | Type |
|---|---|---|
| pdb_long | Four characters PDB accession code, followed by the biological assembly number | string |
| chain | Protein chain in PDB file | char |
| num | Residue number in PDB file | int |
| name | Residue 3 characters code | string |
| letter | Residue 1 character code | char |
| x | Residue Cα position (x-axis) | float |
| y | Residue Cα position (y-axis) | float |
| z | Residue Cα position (z-axis) | float |
| rASA_in_BU | Residue relative ASA considering the complexed protein state | float |
| rASA_alone | Residue relative ASA considering the unbound protein state | float |
| sticky_scale | Residue stickiness value | float |
| sticky_patch | Residue environment stickiness | float |
| patch_size | Number of residues used for environment stickiness calculation | int |
| (*) nDp | Residue nDp (minimum values across all axes) | float |
| (**) fold | Symmetry type (2-fold, 3-fold, etc) of the axis with respect to which nDp is calculated | int |
| (**) nDp_n_fold | Residue nDp-n-fold | float |
| (**) nDp_2_fold | Residue nDp-2-fold (minimum values across all 2-fold axes) | float |
Each line of tables ‘residues_all_sym_protein_assemblies’, ‘residues_all_asym_protein_assemblies’, ‘residues_h80_sym_protein_assemblies’ and ‘residues_h80_asym_protein_assemblies’[15] corresponds to one unique residue. (*) Descriptor defined for monomers in tables ‘residues_all_asym_protein_assemblies’ and ‘residues_h80_asym_protein_assemblies’ only because we use it as a control (see section “Technical Validation”). (**) Descriptors exclusively related to high-order dihedral complexes (Dn, n > 2) and present only in tables residues_all_sym_protein_assemblies’ and ‘residues_h80_sym_protein_assemblies’.
| Design Type(s) | protein interaction analysis objective • protein structure prediction objective • modeling and simulation objective |
| Measurement Type(s) | protein complex |
| Technology Type(s) | computational modeling technique |
| Factor Type(s) | Filtering • source |
| Sample Characteristic(s) | laboratory environment |
Overview of the scripts archive content.
| Folder | File | Type | Description |
|---|---|---|---|
| . | README.txt | README file | README file |
| 1pok_3.pdb | Demo Input | Demonstration PDB file | |
| pymol_visualization.py | PyMol Script | Enables the visualization of properties on structures, and also symmetry axes | |
| freesasa-2.0.3.tar | Archive | FreeSASA software v2.0.3 that needs to be installed to perform ASA calculations | |
| wrapper_nDp_and_stickiness_calculations.pl | Perl Script | Calculation of the different nDp versions and environment stickiness | |
| 1pok_3.nDp_and_stickiness | Demo Output | Different nDp versions and environment stickiness for 1pok_3 in tabulated file | |
| ./nDp | nDp_calculation.pl | Perl Script | Calculation of the different nDp versions |
| ananas_linux | Binary | AnAnaS software v0.6 for Linux platforms required to perform symmetry calculations, no installation is required | |
| ananas_mac | Binary | AnAnaS software v0.6 for Darwin (Mac) platforms required to perform symmetry calculations, no installation is required | |
| 1pok_3.sym | Demo Output | Symmetry order and axes coordinates for 1pok_3 in tabulated file, as calculated by the software AnAnaS | |
| 1pok_3.nDp | Demo Output | Different nDp versions for 1pok_3 in tabulated file | |
| ./environment_stickiness | environment_stickiness_calculation.pl | Perl Script | Calculation of the environment stickiness |
| 1pok_3.asa | Demo Output | ASA for 1pok_3 in tabulated file, as calculated by the software FreeASA | |
| 1pok_3.stickiness | Demo Output | Environment stickiness for 1pok_3 in tabulated file |
Demonstration input and output files are provided for each script.