| Literature DB >> 18000004 |
Antonina Andreeva1, Dave Howorth, John-Marc Chandonia, Steven E Brenner, Tim J P Hubbard, Cyrus Chothia, Alexey G Murzin.
Abstract
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.Entities:
Mesh:
Substances:
Year: 2007 PMID: 18000004 PMCID: PMC2238974 DOI: 10.1093/nar/gkm993
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Workflow of the SCOP update protocol. The update sequence set of new unclassified structures is derived from the PDB SEQRES record. Disordered regions at the termini are masked. The update sequences are clustered using a threshold of 100% identity and 95% coverage for the inclusion of protein sequence into the cluster set. The resulting clusters are used to select a representative sequence set. This dataset is used as a primary input to the pre-classification pipeline. The representative cluster set is first compared using BLAST against itself and a database of non-redundant representative ASTRAL sequences for SCOP domains. This step allows detection of close homologs, usually members of the same SCOP family. Representative sequences without significant sequence match (E-value = 0.001) are further used for two-step PSI-BLAST searches. In the first step, a position-specific scoring matrix (PSSM) is generated by searching the NCBI non-redundant protein database. The resulting PSSM is saved after ten PSI-BLAST iterations or less if the program converges. In the second step, each saved PSSM is used to scan databases of representative ASTRAL and update sequences. In addition, the representative cluster set of unclassified proteins is submitted for RPS-BLAST search against a database of Pfam profiles. The resulting matches are then compared with the matches of pre-computed large-scale comparisons of SCOP domains and Pfam families. A provisional SCOP classification assignment is made for those proteins with a matching region in Pfam that has given a hit to SCOP domain. The results of both RPS-BLAST and PSI-BLAST are used to identify relationships between more distant homologs that are likely to be members of the same SCOP superfamily. Update proteins that are identical or nearly identical to domains classified in the current SCOP release or in the SCOP developmental version are classified automatically. The remaining proteins with and without provisional classification are curated manually.
Figure 2.Statistics of SCOP classification of SG targets. (A) Numbers of SG-families and SG-superfamilies by fraction of SG domains in them. (B) Division of SG-families in ‘true’ and ‘singleton’ families, their SG target contents and their distribution in ‘true’ and ‘singleton’ superfamilies. Note that different parts of the same SG target can be classified into different families and that a ‘true’ superfamily can contain both ‘true’ and ‘singleton’ families.
Selected SG-superfamilies largely populated with SG-families
| SCOP Superfamily | Number of SG-FA | Coverage by SG-FA (%) | Representative structure |
|---|---|---|---|
| PUA domain-like | 12 | 100 | 1wmm |
| NagB/RpiA/CoA transferase-like | 7 | 100 | 2g40 |
| Alpha/beta knot | 6 | 100 | 1mxi |
| Ribokinase-like | 5 | 100 | 1ub0 |
| AhpD-like | 4 | 100 | 2cwq |
| ITPase-like | 3 | 100 | 1vp2 |
| RmlC-like cupins | 20 | 87 | 2atf |
| Dimeric α + β barrel | 18 | 78 | 1mwq |
| Bet v1-like | 6 | 67 | 1xfs |
| NTF2-like | 10 | 54 | 1tp6 |