| Literature DB >> 17916570 |
David S Wishart1, David Arndt, Mark Berjanskii, An Chi Guo, Yi Shi, Savita Shrivastava, Jianjun Zhou, You Zhou, Guohui Lin.
Abstract
The protein property prediction and testing database (PPT-DB) is a database housing nearly 30 carefully curated databases, each of which contains commonly predicted protein property information. These properties include both structural (i.e. secondary structure, contact order, disulfide pairing) and dynamic (i.e. order parameters, B-factors, folding rates) features that have been measured, derived or tabulated from a variety of sources. PPT-DB is designed to serve two purposes. First it is intended to serve as a centralized, up-to-date, freely downloadable and easily queried repository of predictable or 'derived' protein property data. In this role, PPT-DB can serve as a one-stop, fully standardized repository for developers to obtain the required training, testing and validation data needed for almost any kind of protein property prediction program they may wish to create. The second role that PPT-DB can play is as a tool for homology-based protein property prediction. Users may query PPT-DB with a sequence of interest and have a specific property predicted using a sequence similarity search against PPT-DB's extensive collection of proteins with known properties. PPT-DB exploits the well-known fact that protein structure and dynamic properties are highly conserved between homologous proteins. Predictions derived from PPT-DB's similarity searches are typically 85-95% correct (for categorical predictions, such as secondary structure) or exhibit correlations of >0.80 (for numeric predictions, such as accessible surface area). This performance is 10-20% better than what is typically obtained from standard 'ab initio' predictions. PPT-DB, its prediction utilities and all of its contents are available at http://www.pptdb.ca.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17916570 PMCID: PMC2238980 DOI: 10.1093/nar/gkm800
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Summary of the content and description of different PPT-DB databases
| Database | Description | Database | Description |
|---|---|---|---|
| 2° Structure (cytoplasmic) 15 002 sequences | 3-state 2° structure assignments obtained by VADAR ( | Signal Peptide (Eukaryotic, Gram+, Gram−) 23 067 sequences | 2-state signal peptide assignments obtained from SwissProt comment fields — grouped via organism type |
| EVA 2° Structure Test Set 7117 sequences | 3-state 2° structure assignments via VADAR ( | Accessible Surface Area (integerized) 14 871 sequences | Residue-specific accessible surface area obtained via VADAR ( |
| 2° Structure (membrane helix) 254 sequences | 2-state 2° structure assignments obtained via VADAR (25) for helical membrane proteins | Accessible Surface Area (%) 14871 sequences | Residue-specific accessible surface area obtained via VADAR (25) and converted to percentage values |
| TMH Benchmark Test Set 2247 sequences | 2-state 2° structure assignments for transmembrane helices from TMH Benchmark ( | B-factor (integerized) 10 332 sequences | Residue-specific B-factors obtained directly from PDB files of X-ray structures and scaled to values from 0 to 9 |
| 2° Structure (membrane barrel) 41 sequences | 2-state 2° structure assignments obtained by VADAR ( | B-factor 10 332 sequences | Residue-specific B-factors obtained directly from PDB ( |
| % 2° Structure (cytoplasmic) 15 002 sequences | 3-state 2° structure content obtained by VADAR ( | RMSF (integerized) 2134 sequences | Scaled (0–9) residue-specific RMSF values determined from NMR structures via SuperPose ( |
| Beta Turns 14 571 sequences | 5-state beta-turn assignments obtained via VADAR ( | RMSF 2134 sequences | Residue-specific root mean square fluctuation (RMSF) determined from NMR structures via SuperPose ( |
| Coiled-coil 824 sequences | 2-state, positional assignments for coiled coil regions from the Paircoil2 training set ( | Order Parameter (integerized) 9800 sequences | Scaled (0–9) residue-specific order parameter (model free) determined using Contact Model method ( |
| Edge/Central Beta Strands 13 255 sequences | 2-state beta strand type assignments obtained by VADAR ( | Order Parameter 9800 sequences | Residue-specific order parameter (model free) determined using Contact Model method ( |
| Beta Hairpins 8600 sequences | 2-state beta hairpin assignments obtained by VADAR ( | Contact Order 14 769 sequences | Contact order calculated using method of Plaxco |
| Disulfide Bonds 2785 sequences | Disulfide bond pairings obtained by VADAR ( | Folding Rate 83 sequences | Experimentally measured folding rates (ln[ |
| SPdb (Eukaryotic, Gram+, Gram−) 2590 sequences | Experimentally verified 2-state signal peptide assignments obtained from the SPdb ( | 3D Folding Decoys 52 sequences | PDB coordinates for misfolded or improperly folded proteins generated via different 3D prediction tools |
Figure 1.A screenshot montage showing different windows from the PPT-DB server. (A) An example of a typical sub-database query page, with an example of the content found the ‘Database Details’. (B) and (C) illustrate the two kinds of BLAST search output, with (B) showing the standard horizontal or FASTA format and (C) showing the vertical or column format for displaying residue-specific values/properties that are multi-digit numbers.
Figure 2.(A) A scatter plot showing of the predictive performance (Q3 versus% sequence identity) for secondary structure prediction for 1000 random query sequences that were submitted to PPT-DB's secondary structure database. (B) A moving average of the same data shown in A (using 5% sequence identity intervals). Using 10-fold cross-validation (10 sets of 100 random query proteins), the average coverage was 77.1 ± 9.3% and the average Q3 for all secondary structure predictions was 89.1 ± 1.3%.