Literature DB >> 27899674

CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.

Aron Marchler-Bauer¹, Yu Bo², Lianyi Han², Jane He², Christopher J Lanczycki², Shennan Lu², Farideh Chitsaz², Myra K Derbyshire², Renata C Geer², Noreen R Gonzales², Marc Gwadz², David I Hurwitz², Fu Lu², Gabriele H Marchler², James S Song², Narmada Thanki², Zhouxi Wang², Roxanne A Yamashita², Dachuan Zhang², Chanjuan Zheng², Lewis Y Geer², Stephen H Bryant².

Abstract

NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2016 PMID： 27899674 PMCID： PMC5210587 DOI： 10.1093/nar/gkw1129

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

CDD SCOPE AND COVERAGE

The current live CDD version, v3.15, contains 48 963 protein- and protein domain-models, with content obtained from Pfam (1), SMART (2), the COGs collection (3), TIGRFAMS (4), the NCBI Protein Clusters collection (5), and NCBI's in-house data curation effort (6). CDD version v3.16 is scheduled to be released in late 2016, it will include the most recent release of Pfam, version 30 and a total of 50 369 protein and protein-domain models. For CDD v3.16, the fixed assumed size of the domain model database has been increased to match the current size of the model collection, resulting in slightly higher E-values reported by RPS-BLAST. CDD maintains a fixed model database size for E-value computation so that it becomes possible to incrementally update domain annotation without the need to re-compute existing annotation. The increase in the database size parameter will suppress previously reported annotation at the borderline of significance, as the default E-value threshold has not been adjusted accordingly. Several classifications for large, common, and functionally diverse domain families have recently been updated or added to CDD, such as for the G-protein-coupled receptors (cd14964*), EF-hand domains (cd15900*), MBL-like metallo-hydrolases (cd06262), haloacid-dehalogenases (cd01427*), RING-finger domains (cd00162*), SPRY domains (cd11709), alkaline phosphatases and sulfatases (cd00016), pleckstrin homology domains (cd00900), myosin and kinesin motor domains (cd01363) or metallophosphatases (cd00838) (* available in CDD v3.16). CDD is part of NCBI's Entrez search and retrieval system and is cross-linked with other databases such as Entrez/protein, Entrez/Gene, 3D-structure (Molecular Modeling Database, MMDB), NCBI BioSystems, PubMed, and PubChem. Domain and site annotation generated via CDD is visible in flat-file and graphical views of protein sequences in Entrez. Currently, CDD annotates ∼250 million sequences in Entrez/protein, about 80% of the proteins excluding sequences from environmental sampling. CDD annotates 96% of structure-derived protein sequences in Entrez that are over 30 residues long. CDD curators annotate functional sites on NCBI-curated models, such as active sites and binding sites, which are mapped onto protein (query) sequences. Currently, a total of 29 991 site annotations have been created on 10 605 out of 12 805 NCBI-curated domain models*. Conserved sequence patterns have been recorded for 2123 of these site annotations, and their mapping onto query sequences is contingent on pattern matches.

AVAILABILITY AND DATA SHARING

Table 1 lists a set of URLs for services provided by CDD. The RPS-BLAST program is included in NCBI's BLAST software distribution, and pre-computed search databases are available via CDD's FTP site, so that conserved domain searches can be run locally. We have developed a utility, called ‘rpsbproc’, also distributed via CDD's FTP-site, which can be installed locally as a wrapper for RPS-BLAST in order to provide results that match those computed by NCBI's on-line search services (7), including site annotation and the location of conserved domain superfamily footprints.

Table 1.

URLs and other resources associated with the CDD project

URL	Description
https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi	CD-Search interface utilizing the RPS-BLAST algorithm and the model database, and to the CDART database of pre-computed domain annotation
https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi	BATCH CD-Search interface utilizing the RPS-BLAST algorithm and the model database, and to the CDART database of pre-computed domain annotation. Up to 4,000 protein queries may be submitted per request
https://www.ncbi.nlm.nih.gov/cdd	Entrez interface to CDD
https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml	CDD project home page
https://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi	CDART domain architecture viewer
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd	CDD FTP site, see README file for content
https://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml	Domain hierarchy editor/viewer and protein structure/alignment viewer
ftp://ftp.ncbi.nlm.nih.gov/toolbox; executables can be obtained from: https://www.ncbi.nlm.nih.gov/BLAST/download.shtml	RPS-BLAST stand-alone tool for searching databases of profile models, part of the NCBI toolkit distribution
https://www.ncbi.nlm.nih.gov/sparcle	Entrez interface to SPARCLE (Subfamily Protein Architecture Labeling Engine)

The CDD group has started a collaboration with the InterPro group at the European Bioinformatics Institute to supplement sequence annotation provided by InterPro with data that are uniquely provided by NCBI's CDD curation effort, including protein domain models for very specific subfamilies and the annotation of functional sites. To date, >1000 domain signatures provided by CDD have been integrated by InterPro (8).

SUBFAMILY DOMAIN ARCHITECTURES AND SPARCLE

Since its inception, the facility of the Conserved Domain Database was enhanced by the CDART (Conserved Domain Architecture Retrieval Tool) service (9), which groups proteins in the Entrez database by common domain architecture. As CDD is a redundant collection, protein domain models are collapsed into domain superfamilies for the sake of defining domain architectures in CDART. The CDART domain architecture viewer is a tool for the comparative analysis of domain architectures by listing architectures that are most similar to that of a query. It also provides powerful options for filtering these similarity results by domain content/composition and taxonomy. Architectures defined on the basis of their domain superfamily footprints can summarize sets of proteins that are functionally very diverse, however. In CDD, a protein domain superfamily is defined as a set of protein domain models that annotate overlapping footprints on protein sequences. An automated clustering method groups models into these superfamily clusters, and the superfamily clusters undergo a subsequent manual review in order to prevent false clustering. In many cases superfamily clusters contain more than one model from a given data contributor, such as Pfam or COG, and clustering may not just reduce redundancy introduced by including multiple data sources, but also collapse information from one data source that would otherwise be useful in distinguishing between functionally distinct protein families. This becomes particularly evident in cases where a superfamily cluster contains one or more NCBI-curated protein domain hierarchies, which were deliberately constructed so that CDD could distinguish between domain and protein sub-families that have a distinct evolutionary history and distinct function. In order to make use of the richness of information collected in CDD, we have investigated alternative methods for grouping proteins, namely by their Subfamily Domain Architecture (SDA). Here we present a first and simple implementation of that idea, which defines a Subfamily Domain Architecture as the string of the domain model accessions that provide the most concise annotation on a protein. We find that the proteins grouped under such SDAs can often be named accurately, and curators record brief functional characterizations (called functional labels) and supporting evidence. The curation interface used to associate domain architectures with functional descriptions has been named SPARCLE for ‘Subfamily Protein Architecture Labeling Engine’, and that name has been adopted for the public service as well. To date, CDD curators have assigned names and functional labels to more than 6,500 SDAs. A publicly accessible Entrez database has been made available to support querying and to provide summary information for SDAs as well as links to other databases, most importantly the NCBI protein collection. For user protein queries submitted to CD-Search, and in the display of pre-computed domain annotation, the content of the functional label assigned to the corresponding SDA is now shown on the results page, if available (see Figure 1). The functional labels are linked to summary pages, which display additional information about a subfamily domain architecture, including evidence for the name and functional label (see Figure 2).

Figure 1.

Figure 2.

Subfamily domain architecture summary page. The summary pages include a browser that provides options for retrieving sub-sets of the sequences sharing the same subfamily architecture, such as those from particular sources, a particular organism, or those that are linked to papers in PubMed.

CD-Search reporting pre-computed domain annotation for the protein with GenBank accession KUG45846, a hypothetical protein from Pseudomonas savastanoi pv. Fraxini. The section circled in red provides the functional label that has been assigned to the subfamily domain architecture characterized by the string ‘cd00714 cd05008 cd05009’, which is shared by over 70 000 sequences in Entrez/protein. Subfamily domain architecture summary page. The summary pages include a browser that provides options for retrieving sub-sets of the sequences sharing the same subfamily architecture, such as those from particular sources, a particular organism, or those that are linked to papers in PubMed. Subfamily domain architectures as defined by the SPARCLE resource vary widely in their coverage and functional diversity. The resolution of this protein classification with respect to specific function depends directly on the availability of specific reagents in the CDD domain model collection, and future curation work in CDD will also be aimed at providing more fine-grained classifications where they may help to better resolve functionally distinct SDAs. The current set of curated SDA names and labels focuses on architectures common in bacterial genomes. We are investigating methods for automating name and label assignments for architectures. Automatically assigned names and labels will be flagged as not validated, should they be displayed publicly as annotation on query sequences.

9 in total

1. CDART: protein homology by domain architecture.

Authors: Lewis Y Geer; Michael Domrachev; David J Lipman; Stephen H Bryant
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

2. CD-Search: protein domain annotations on the fly.

Authors: Aron Marchler-Bauer; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

3. The COG database: new developments in phylogenetic classification of proteins from complete genomes.

Authors: R L Tatusov; D A Natale; I V Garkavtsev; T A Tatusova; U T Shankavaram; B S Rao; B Kiryutin; M Y Galperin; N D Fedorova; E V Koonin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

4. CDD: NCBI's conserved domain database.

Authors: Aron Marchler-Bauer; Myra K Derbyshire; Noreen R Gonzales; Shennan Lu; Farideh Chitsaz; Lewis Y Geer; Renata C Geer; Jane He; Marc Gwadz; David I Hurwitz; Christopher J Lanczycki; Fu Lu; Gabriele H Marchler; James S Song; Narmada Thanki; Zhouxi Wang; Roxanne A Yamashita; Dachuan Zhang; Chanjuan Zheng; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2014-11-20 Impact factor: 16.971

5. SMART: recent updates, new developments and status in 2015.

Authors: Ivica Letunic; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2014-10-09 Impact factor: 16.971

6. InterPro in 2017-beyond protein family and domain annotations.

Authors: Robert D Finn; Teresa K Attwood; Patricia C Babbitt; Alex Bateman; Peer Bork; Alan J Bridge; Hsin-Yu Chang; Zsuzsanna Dosztányi; Sara El-Gebali; Matthew Fraser; Julian Gough; David Haft; Gemma L Holliday; Hongzhan Huang; Xiaosong Huang; Ivica Letunic; Rodrigo Lopez; Shennan Lu; Aron Marchler-Bauer; Huaiyu Mi; Jaina Mistry; Darren A Natale; Marco Necci; Gift Nuka; Christine A Orengo; Youngmi Park; Sebastien Pesseat; Damiano Piovesan; Simon C Potter; Neil D Rawlings; Nicole Redaschi; Lorna Richardson; Catherine Rivoire; Amaia Sangrador-Vegas; Christian Sigrist; Ian Sillitoe; Ben Smithers; Silvano Squizzato; Granger Sutton; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Cathy H Wu; Ioannis Xenarios; Lai-Su Yeh; Siew-Yit Young; Alex L Mitchell
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

7. TIGRFAMs and Genome Properties in 2013.

Authors: Daniel H Haft; Jeremy D Selengut; Roland A Richter; Derek Harkins; Malay K Basu; Erin Beck
Journal: Nucleic Acids Res Date: 2012-11-28 Impact factor: 16.971

8. The National Center for Biotechnology Information's Protein Clusters Database.

Authors: William Klimke; Richa Agarwala; Azat Badretdin; Slava Chetvernin; Stacy Ciufo; Boris Fedorov; Boris Kiryutin; Kathleen O'Neill; Wolfgang Resch; Sergei Resenchuk; Susan Schafer; Igor Tolstoy; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

9. The Pfam protein families database: towards a more sustainable future.

Authors: Robert D Finn; Penelope Coggill; Ruth Y Eberhardt; Sean R Eddy; Jaina Mistry; Alex L Mitchell; Simon C Potter; Marco Punta; Matloob Qureshi; Amaia Sangrador-Vegas; Gustavo A Salazar; John Tate; Alex Bateman
Journal: Nucleic Acids Res Date: 2015-12-15 Impact factor: 16.971

9 in total

840 in total

1. Genetic Basis of Maize Resistance to Multiple Insect Pests: Integrated Genome-Wide Comparative Mapping and Candidate Gene Prioritization.

Authors: A Badji; D B Kwemoi; L Machida; D Okii; N Mwila; S Agbahoungba; F Kumi; A Ibanda; A Bararyenya; M Solemanegy; T Odong; P Wasswa; M Otim; G Asea; M Ochwo-Ssemakula; H Talwana; S Kyamanywa; P Rubaihayo
Journal: Genes (Basel) Date: 2020-06-24 Impact factor: 4.096

2. MifS, a DctB family histidine kinase, is a specific regulator of α-ketoglutarate response in Pseudomonas aeruginosa PAO1.

Authors: Zaara Sarwar; Michael X Wang; Benjamin R Lundgren; Christopher T Nomura
Journal: Microbiology (Reading) Date: 2020-09 Impact factor: 2.777

3. Klebsiella pneumoniae O1 and O2ac antigens provide prototypes for an unusual strategy for polysaccharide antigen diversification.

Authors: Steven D Kelly; Bradley R Clarke; Olga G Ovchinnikova; Ryan P Sweeney; Monica L Williamson; Todd L Lowary; Chris Whitfield
Journal: J Biol Chem Date: 2019-05-28 Impact factor: 5.157

4. Genome-wide analysis of HSP90 gene family in the Mediterranean olive (Olea europaea subsp. europaea) provides insight into structural patterns, evolution and functional diversity.

Authors: Inchirah Bettaieb; Jihen Hamdi; Dhia Bouktila
Journal: Physiol Mol Biol Plants Date: 2020-11-19

5. A Key Enzyme of the NAD⁺ Salvage Pathway in Thermus thermophilus: Characterization of Nicotinamidase and the Impact of Its Gene Deletion at High Temperatures.

Authors: Hironori Taniguchi; Sathidaphorn Sungwallek; Phatcharin Chotchuang; Kenji Okano; Kohsuke Honda
Journal: J Bacteriol Date: 2017-08-08 Impact factor: 3.490

6. Autographa californica Multiple Nucleopolyhedrovirus ac75 Is Required for the Nuclear Egress of Nucleocapsids and Intranuclear Microvesicle Formation.

Authors: Anqi Shi; Zhaoyang Hu; Yachao Zuo; Yan Wang; Wenbi Wu; Meijin Yuan; Kai Yang
Journal: J Virol Date: 2018-01-30 Impact factor: 5.103

7. Double-stranded RNA reduces growth rates of the gut parasite Crithidia mellificae.

Authors: Kleber de Sousa Pereira; Niels Piot; Guy Smagghe; Ivan Meeus
Journal: Parasitol Res Date: 2019-01-04 Impact factor: 2.289

8. Conformational changes on substrate binding revealed by structures of Methylobacterium extorquens malate dehydrogenase.

Authors: Javier M González; Ricardo Marti-Arbona; Julian C H Chen; Brian Broom-Peltz; Clifford J Unkefer
Journal: Acta Crystallogr F Struct Biol Commun Date: 2018-09-19 Impact factor: 1.056

9. Release of Mediator Enzyme β-Hexosaminidase and Modulated Gene Expression Accompany Hemocyte Degranulation in Response to Parasitism in the Silkworm Bombyx mori.

Authors: Shambhavi H Prabhuling; Pooja Makwana; Appukuttan Nair R Pradeep; Kunjupillai Vijayan; Rakesh Kumar Mishra
Journal: Biochem Genet Date: 2021-02-22 Impact factor: 1.890

10. Crystallographic and kinetic analyses of the FdsBG subcomplex of the cytosolic formate dehydrogenase FdsABG from Cupriavidus necator.

Authors: Tynan Young; Dimitri Niks; Sheron Hakopian; Timothy K Tam; Xuejun Yu; Russ Hille; Gregor M Blaha
Journal: J Biol Chem Date: 2020-04-05 Impact factor: 5.157