Literature DB >> 15608190

GenDiS: Genomic Distribution of protein structural domain Superfamilies.

Ganesan Pugalenthi1, Anirban Bhaduri, Ramanathan Sowdhamini.   

Abstract

Several proteins that have substantially diverged during evolution retain similar three-dimensional structures and biological function inspite of poor sequence identity. The database on Genomic Distribution of protein structural domain Superfamilies (GenDiS) provides record for the distribution of 4001 protein domains organized as 1194 structural superfamilies across 18,997 genomes at various levels of hierarchy in taxonomy. GenDiS database provides a survey of protein domains enlisted in sequence databases employing a 3-fold sequence search approach. Lineage-specific literature is obtained from the taxonomy database for individual protein members to provide a platform for performing genomic and phyletic studies across organisms. The database documents residual properties and provides alignments for the various superfamily members in genomes, offering insights into the rational design of experiments and for the better understanding of a superfamily. GenDiS database can be accessed at http://www.ncbs.res.in/~faculty/mini/gendis/home.html.

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 15608190      PMCID: PMC540041          DOI: 10.1093/nar/gki087

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

High-throughput large-scale sequencing efforts have illustrated the enormous diversity embedded within genomes owing to varied composition of the proteome. Fortunately, structural and sequence analyses suggest strong convergence, indicating that many proteins will share limited number of folds (1). Curation of protein structural entries in a hierarchy (2,3), compilation of sequence families (4,5) and superfamilies (6,7), establishing relationships between protein sequence and structural databases (8,9) and the analysis of genomic patterns (10,11) form representative approaches to understand the process of this strong convergence. Reliable association of unannotated protein sequences to pre-existing families of well-characterized structure and function allows the mapping of functionally important residues on sequence alignments that can provide important insights into functional mechanisms. However, similarity and inheritance of function among homologues related in the twilight zone have to be considered after careful validation (12). Genomes are classified into taxons on the basis of morphology and genetic content under the taxonomy database (13). Classification of the organism at various taxonomic strata elaborates diversity among the organisms along with their proteomic content (14). Genome content and distribution of proteins provide better understanding of species phylogeny (15). Exploring the distribution of structural superfamilies across varied strata of taxons provides an addendum into our understanding of proteins and phylogeny of the organism. The database of Genomic Distribution of protein structural domain Superfamilies (GenDiS) aims to provide structural assignments to genes listed within the non-redundant protein sequence database at the superfamily level. Structural superfamily definitions are in correspondence with SCOP 1.63 (16) and PASS2 (17) databases. Search for homologues within the sequence databases have been performed using multiple approaches (see Methods). Assignments have been subsequently validated before inducting a member. Genomic lineage for every individual entry was obtained from the taxonomy database and corresponding taxon records were assigned. The database offers a platform for understanding and comparing the distribution of protein superfamilies across the different taxonomic strata.

METHODS

Searching for potential superfamily members in sequence databases

Potential members of the superfamilies have been searched using a 3-fold approach. Members of PASS2 database (17) have been queried in April 2003 release of non-redundant sequence database (13) employing PSI-BLAST (18) setting an expectation value of 10−3 for 20 iterations. The profile-to-sequence searches were complemented employing the HMMsearch tool of the HMMER suite (19). Hidden Markov models (HMMs) were derived for domain superfamilies starting from structure-based sequence alignments of PASS2 members (17) with an expectation threshold of 0.1 during the searches. In addition, motif-constrained PHI-BLAST (20) searches were also carried out as reported previously (21,22) for a single iteration and an expectation value of 1.0. A composite set of domain assignments was obtained for individual superfamilies from these three approaches. The alignment lengths were compared with the query to ensure that it corresponds to the full length of PASS2 domains (23) (Figure 1). Redundant proteins were removed employing CD-HIT (24) at a stringent sequence identity cut-off of 100%. Domains assigned to a superfamily belonging to a genome were aligned using CLUSTALW (25). The alignments have been colour-coded by examining the conservation and similarity at the various positions.
Figure 1

Flowchart representing the steps involved in the curation of the database and various features in GenDiS. Boxes marked in boldface represent the tools provided while the dotted boxes indicate the residual features evaluated for the protein members in GenDiS.

Taxonomic annotation of the superfamily members and alignments

Non-redundant sequences, maintained in the NCBI, form a composite resource of several genome databases. GenDiS records the source organism of the assigned proteins and a detailed taxonomic lineage of the species in correspondence with the taxonomy database (13). Taxonomic classifications at the phyla, class, order, family, genus and species levels have been recorded against individual entries. Proteins belonging to similar taxons are clustered together and further sub-grouped at the superfamily level (Figure 1).

TOOLS AND SERVICES AT THE GenDiS SERVER

GenDiS database can be navigated through a user-friendly search engine to obtain relevant information on taxonomic and superfamily distribution. The database has been linked to taxonomy and other protein databases. GenDiS server provides several useful tools for performing genome and cross-genome analysis.

Information about superfamily members

The presence of superfamily members at the different taxonomic levels is summarized. Domains of the various superfamilies before and following the validation (pruned set) are downloadable. Domain architecture was identified for validated members of GenDiS employing IMPALA (26) against PASS2 profiles of structural domains. Average domain length, sequence diversity within genomes and at the superfamily level are listed. HMMs can be obtained for the various superfamilies.

Genome and taxonomic information

The full list of the diverse superfamilies residing at the various taxonomic hierarchies can be retrieved from the database. Information about the occurrences of the various descending taxons within a particular hierarchy level of taxonomy is provided. Completely sequenced genomes have been separately listed and can be browsed through the complete genome list. The number of superfamilies and homologous sequences present in the various genomes can be obtained. Alignments of the members of particular superfamilies within genomes and conserved regions of the alignment are provided. For multi-membered superfamilies, diversity score evaluated by the Makowski and Soares (27) method and the phylogenetic tree obtained on the basis of protein dissimilarity are presented. Domain architectures can also be retrieved at the phyla, class, order and genus levels at the taxonomical hierarchy.

Overlap score within genomes

Distinction among organisms results from the composite proteome encoded by the genome. Comprehensive structural domain assignments at the proteome level provide opportunities to study the distribution of the common and unique superfamilies among the completely sequenced genomes. The overlap score for a pair of completed genomes along with the listing of common and unique superfamilies demonstrates similarity among the organisms at a more holistic level.

Alignments of desired query to superfamilies

Options are provided for aligning query sequences to superfamily members within a genome or by performing genome-wide alignments for specific superfamilies. The alignments are performed employing CLUSTALW (25).

Assigning structural domain architectures

Domain architectural assignments of unannotated sequences elucidate the combination of structural domains embedded within the polypeptide aiding its detailed characterization (28). Structural domains can be assigned to a query sequence by probing against sequence profiles of PASS2 members employing IMPALA (26).

CONCLUSION

GenDiS is a compendium of sequence domains of evolutionarily related proteins grouped at the superfamily level in direct correspondence with SCOP (16) and PASS2 (17) databases. Furthermore, it is possible to obtain links between structural hierarchy and taxonomic levels at GenDis. Availability of alignments for sequence domains in the various genomes over the World Wide Web facilitates the study and design of experiments on specific superfamilies. The database creates a framework for a systematic survey and analysis of various structural superfamilies. The database may be accessed and downloaded across the World Wide Web (http://caps.ncbs.res.in/gendis/download.html). Associating different proteins with structurally similar and evolutionarily related proteins enhance our functional understanding of a protein superfamily. Complete taxonomic information corresponding to individual sequences in GenDiS database provides a platform for performing cross-genomic or phyletic analysis at various levels of hierarchy in taxonomy. A World Wide Web interface would provide an understanding of the various sequence relatives across the various genomes, their conservation and sequence diversity enhancing our comprehension corresponding to the protein superfamily or an organism.
  28 in total

1.  Clustering of highly homologous sequences to reduce the size of large protein databases.

Authors:  W Li; L Jaroszewski; A Godzik
Journal:  Bioinformatics       Date:  2001-03       Impact factor: 6.937

2.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices.

Authors:  A A Schäffer; Y I Wolf; C P Ponting; E V Koonin; L Aravind; S F Altschul
Journal:  Bioinformatics       Date:  1999-12       Impact factor: 6.937

3.  SUPFAM--a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes.

Authors:  Shashi B Pandit; Dilip Gosar; S Abhiman; S Sujatha; Sayali S Dixit; Natasha S Mhatre; R Sowdhamini; N Srinivasan
Journal:  Nucleic Acids Res       Date:  2002-01-01       Impact factor: 16.971

4.  Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database.

Authors:  Daniel W A Buchan; Adrian J Shepherd; David Lee; Frances M G Pearl; Stuart C G Rison; Janet M Thornton; Christine A Orengo
Journal:  Genome Res       Date:  2002-03       Impact factor: 9.043

5.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes.

Authors:  G Apic; J Gough; S A Teichmann
Journal:  J Mol Biol       Date:  2001-07-06       Impact factor: 5.469

6.  Estimating the diversity of peptide populations from limited sequence data.

Authors:  Lee Makowski; Alexei Soares
Journal:  Bioinformatics       Date:  2003-03-01       Impact factor: 6.937

7.  GeneCensus: genome comparisons in terms of metabolic pathway activity and protein family sharing.

Authors:  J Lin; J Qian; D Greenbaum; P Bertone; R Das; N Echols; A Senes; B Stenger; M Gerstein
Journal:  Nucleic Acids Res       Date:  2002-10-15       Impact factor: 16.971

8.  Multiple sequence alignment with the Clustal series of programs.

Authors:  Ramu Chenna; Hideaki Sugawara; Tadashi Koike; Rodrigo Lopez; Toby J Gibson; Desmond G Higgins; Julie D Thompson
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

Review 9.  Profile hidden Markov models.

Authors:  S R Eddy
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

10.  Database resources of the National Center for Biotechnology Information: update.

Authors:  David L Wheeler; Deanna M Church; Ron Edgar; Scott Federhen; Wolfgang Helmberg; Thomas L Madden; Joan U Pontius; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Tugba O Suzek; Tatiana A Tatusova; Lukas Wagner
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

View more
  7 in total

Review 1.  Peptides encoded by noncoding genes: challenges and perspectives.

Authors:  Shuo Wang; Chuanbin Mao; Shanrong Liu
Journal:  Signal Transduct Target Ther       Date:  2019-12-13

2.  DIAL: a web-based server for the automatic identification of structural domains in proteins.

Authors:  Ganesan Pugalenthi; Govindaraju Archunan; Ramanathan Sowdhamini
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

3.  Improved performance of sequence search approaches in remote homology detection.

Authors:  Adwait Govind Joshi; Upadhyayula Surya Raghavender; Ramanathan Sowdhamini
Journal:  F1000Res       Date:  2013-03-22

Review 4.  Peptides encoded by noncoding genes: challenges and perspectives.

Authors:  Shuo Wang; Chuanbin Mao; Shanrong Liu
Journal:  Signal Transduct Target Ther       Date:  2019-12-13

5.  Genome-wide survey of prokaryotic serine proteases: analysis of distribution and domain architectures of five serine protease families in prokaryotes.

Authors:  Lokesh P Tripathi; R Sowdhamini
Journal:  BMC Genomics       Date:  2008-11-19       Impact factor: 3.969

6.  A machine learning approach for the identification of odorant binding proteins from sequence-derived properties.

Authors:  Ganesan Pugalenthi; Ke Tang; P N Suganthan; G Archunan; R Sowdhamini
Journal:  BMC Bioinformatics       Date:  2007-09-19       Impact factor: 3.169

7.  CUSP: an algorithm to distinguish structurally conserved and unconserved regions in protein domain alignments and its application in the study of large length variations.

Authors:  Sankaran Sandhya; Barah Pankaj; Madabosse Kande Govind; Bernard Offmann; Narayanaswamy Srinivasan; Ramanathan Sowdhamini
Journal:  BMC Struct Biol       Date:  2008-05-31
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.