| Literature DB >> 34921198 |
Oliviero Carugo1,2.
Abstract
A novel and simple procedure (RaSPDB) for Protein Data Bank mining is described. 10 PDB subsets, each containing 7000 randomly selected protein chains, are built and used to make 10 estimations of the average value of a generic feature F-the length of the protein chain, the amino acid composition, the crystallographic resolution, and the secondary structure composition. These 10 estimations are then used to compute an average estimation of F together with its standard error. It is heuristically verified that the dimension of these 10 subsets-7000 protein chains-is sufficiently small to avoid redundancy within each subset and sufficiently large to guarantee stable estimations amongst different subsets. RaSPDB has two major advantages over classical procedures aimed to build a single, non-redundant PDB subset: a larger fraction of the information stored in the PDB is used and an estimation of the standard error of F is possible.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34921198 PMCID: PMC8683422 DOI: 10.1038/s41598-021-03615-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Optimization of the dimension of the subsets. (a) Relationship between protein chain length and dimension of the randomly selected subsets of the PDB. (b) Relationship between resolution and dimension of the randomly selected subset of the PDB. (c) Relationship between secondary structure composition and dimension of the randomly selected subsets of the PDB (secondary structures were assigned with STRIDE[14], and only the most common H (α-helix), E (β-strand), C (coil), and T (turn) types of secondary structure were considered. (d) Relationship between amino acid composition and dimension of the randomly selected subsets of the PDB (only the case of alanine in shown, being the plots relative to other amino acids very similar).