| Literature DB >> 29244774 |
Ruben Sanchez-Garcia1, Carlos Oscar Sanchez Sorzano2, Jose Maria Carazo3, Joan Segura4.
Abstract
Many studies have used position-specific scoring matrices (PSSM) profiles to characterize residues in protein structures and to predict a broad range of protein features. Moreover, PSSM profiles of Protein Data Bank (PDB) entries have been recalculated in many works for different purposes. Although the computational cost of calculating a single PSSM profile is affordable, many statistical studies or machine learning-based methods used thousands of profiles to achieve their goals, thereby leading to a substantial increase of the computational cost. In this work we present a new database compiling PSSM profiles for the proteins of the PDB. Currently, the database contains 333,532 protein chain profiles involving 123,135 different PDB entries.Entities:
Keywords: machine learning; position-specific scoring matrices; protein databases; protein structure
Mesh:
Substances:
Year: 2017 PMID: 29244774 PMCID: PMC6149929 DOI: 10.3390/molecules22122230
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Biological features of domain and non-domain residues for Protein Data Bank (PDB) proteins.
| Residues (%) 1 | SS (%) 2 | BS (%) 3 | PTM (%) 4 | SLiM (%) 5 | Variants (%) 6 | |
|---|---|---|---|---|---|---|
| Domain | 78 | 81 | 78 | 62 | 63 | 77 |
| Non-domain | 22 | 19 | 22 | 38 | 37 | 23 |
1 Percentage of residues in domain and non-domain regions; 2 Percentage of secondary structure elements in domain and non-domain regions; 3 Percentage of binding site residues in domain and non-domain regions; 4 Percentage of posttranslational modifications in domain and non-domain regions; 5 Percentage of short linear motifs in domain and non-domain regions; 6 Percentage of genomic variants associated to diseases. Post-translational modifications (PTMs); short linear motifs (SLiMs).
Information per position of position-specific scoring matrices (PSSM) profiles in domain and non-domain regions.
| Region 1 | Gap Freq. (%) 2 | Entropy 3 | Entropy 4 |
|---|---|---|---|
| Domain | 1.8 | 1.36 | 1.97 |
| Non-domain | 10.5 | 1.11 | 1.62 |
1 Location; 2 Gap frequency in the MSA; 3 Williamson entropy grouping the amino acids in nine classes; 4 Williamson entropy using the 20 naturally occurring amino acids.
Figure 1Q3 scores histogram. Histogram of Q3 scores predicting secondary structure in the testing set sequences. In blue color, obtained results using the original PSSM profiles. In pink color, obtained results using current 3DCONS-DB PSSM profiles.
Root mean square error predicting residue contact number.
| Threshold 1 | 8 Å | 10 Å | 12 Å | 14 Å |
|---|---|---|---|---|
| Yuan et al. 2 | 0.77 | 0.75 | 0.72 | 0.72 |
| 3DCONS-DB 3 | 0.62 | 0.64 | 0.68 | 0.69 |
1 Distance threshold used to define contact between C-beta atoms; 2 Root mean square error reported in Yuan et al. work [25]; 3 Root mean square error using 3DCONS-DB data to train and test the support vector regression model.