| Literature DB >> 16118666 |
Abstract
The bias in protein structure and function space resulting from experimental limitations and targeting of particular functional classes of proteins by structural biologists has long been recognized, but never continuously quantified. Using the Enzyme Commission and the Gene Ontology classifications as a reference frame, and integrating structure data from the Protein Data Bank (PDB), target sequences from the structural genomics projects, structure homology derived from the SUPERFAMILY database, and genome annotations from Ensembl and NCBI, we provide a quantified view, both at the domain and whole-protein levels, of the current and projected coverage of protein structure and function space relative to the human genome. Protein structures currently provide at least one domain that covers 37% of the functional classes identified in the genome; whole structure coverage exists for 25% of the genome. If all the structural genomics targets were solved (twice the current number of structures in the PDB), it is estimated that structures of one domain would cover 69% of the functional classes identified and complete structure coverage would be 44%. Homology models from existing experimental structures extend the 37% coverage to 56% of the genome as single domains and 25% to 31% for complete structures. Coverage from homology models is not evenly distributed by protein family, reflecting differing degrees of sequence and structure divergence within families. While these data provide coverage, conversely, they also systematically highlight functional classes of proteins for which structures should be determined. Current key functional families without structure representation are highlighted here; updated information on the "most wanted list" that should be solved is available on a weekly basis from http://function.rcsb.org:8080/pdb/function_distribution/index.html.Entities:
Mesh:
Year: 2005 PMID: 16118666 PMCID: PMC1188274 DOI: 10.1371/journal.pcbi.0010031
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Coverage/Kendall's Tau Correlations for Major Categories of Enzyme for Both Single Domains and Whole Proteins
Current values for nodes of these major branches can be determined from http://function.rcsb.org:8080/pdb/function_distribution/index.html.
Coverage/Kendall's Tau Correlations for Major Categories of GO Cell Component for Both Single Domains and Whole Proteins
Current values for nodes of these five major branches and other two minor categories can be determined from http://function.rcsb.org:8080/pdb/function_distribution/index.html.
a Two minor categories—virion and cell component unknown—are not listed in the table, and can be browsed from the Web site.
Coverage/Kendall's Tau Correlations for Major Categories of GO Molecular Function for Both Single Domains and Whole Proteins
Current values for nodes of these seven major branches and other eight minor categories can be determined from http://function.rcsb.org:8080/pdb/function_distribution/index.html.
a Eight minor categories are not listed in the table, and can be browsed from the Web site.
Coverage/Kendall's Tau Correlations for Major Categories of GO Biological Process for Both Single Domains and Whole Proteins
Current values for nodes of these five major branches and other two minor categories can be determined from http://function.rcsb.org:8080/pdb/function_distribution/index.html.
a Two minor categories—viral life of cycle and biological process unknown—are not listed in the table, and can be browsed from the Web site.
Most Wanted Structures According to EC Numbersa
The proteins are clustered with 40% sequence identity.
a Data for clusters of fewer than three can be obtained from http://function.rcsb.org:8080/pdb/function_distribution/index.html.
Most Wanted Structures According to GO Classificationa
The proteins are clustered with 40% sequence identity.
a Data for clusters of fewer than five can be obtained from http://function.rcsb.org:8080/pdb/function_distribution/index.html.